The production of Optimised Machine-Code
for High-Level Languages using
Machine-Independent Intermediate Codes.
Peter Salkeld Robertson
Ph. D.
University of Edinburgh
1981
(Graduation date July 1981)
ABSTRACT
The aim of this work was to investigate the problems
associated with using machine-independent intermediate codes
in the translation from a high-level language into machine
code, with emphasis on minimising code size and providing
good run-time diagnostic capabilities.
The main result was a machine-independent intermediate
code, I-code, which has been used successfully to develop
optimising and diagnostic compilers for the IMP77 language
on a large number of different computer systems. In
addition, the work has been used to lay the foundations for
a project to develop an intermediate code for portable
SIMULA compilers.
The major conclusions of the research were that carefully
designed machine-independent intermediate codes can be used
to generate viable optimising and diagnostic compilers, and
that the commonality introduced into different code
generators processing the code for different machines
simplifies the tasks of creating new compilers and
maintaining old ones.
Contents
1 Introduction
2 Intermediate codes
2.1 Uncol
2.2 Janus
2.3 OCODE
2.4 P-code
2.5 Z-code
2.6 Summary and conclusions
2.6.1 Error checking and reporting
2.6.2 Efficiency
2.6.3 Assumptions
2.6.4 Interpretation
3 Optimisations
3.1 Classification of optimisations
3.1.1 Universal optimisations
3.1.2 Local optimisations
3.1.2.1 Remembering
3.1.2.2 Delaying
3.1.2.3 Inaccessable code removal
3.1.2.4 Peephole optimisation
3.1.2.5 Special cases
3.1.2.6 Algebraic manipulation
3.1.3 Global optimisations
3.1.3.1 Restructuring
3.1.3.2 Merging
3.1.3.2.1 Forward merging
3.1.3.2.2 Backward merging
3.1.3.3 Advancing
3.1.3.4 Factoring
3.1.3.5 Loop optimisations
3.1.3.5.1 Iteration
3.1.3.5.2 Holding
3.1.3.5.3 Removal of invariants
3.1.3.6 Expansion
3.1.3.7 Addressing optimisations
3.1.4 Source optimisations
3.2 Combination of optimisations
4 The design of the compiler
4.1 General structure
4.2 The intermediate code
4.2.1 Objectives
4.2.1.1 Scope
4.2.1.2 Information preservation
4.2.1.3 Target machine independence
4.2.1.4 Decision binding
4.2.1.5 Simplification
4.2.1.6 Redundancy
4.2.1.7 Ease of use
4.3 Code layout and addressing
4.3.1 Nested procedure definitions
4.3.2 Paged machines
4.3.3 Events
4.4 Data layout and addressing
4.5 Procedure entry and exit
4.5.1 User-defined procedures
4.5.2 External procedures
4.5.3 Permanent procedures
4.5.4 Primitive procedures
4.6 Language-specified and compiler-generated objects
4.6.1 Internal labels
4.6.2 Temporary objects
4.7 Object-file generation
4.7.1 Reordering
4.7.2 Jumps and branches
4.7.3 Procedures
4.7.4 External linkage
4.7.5 In-line constants
4.7.6 Object-file format
4.8 Summary
5 Review of the overall structure
5.1 Division of function
5.2 Testing and development
5.3 Diagnostics
5.3.1 Line numbers
5.3.2 Diagnostic tables
5.3.3 Run-time checks
6 Observations 6.1 Suitability of I-code for optimisation 6.2 Performance 6.3 Cost of optimisation 6.3.1 Compile time 6.3.2 Space requirement 6.3.3 Logical complexity 6.4 Comments on the results 6.4.1 Register remembering 6.4.2 Remembering environments 6.4.4 Array allocation and use 6.4.5 Parameters in registers 6.4.6 Condition-code remembering 6.4.7 Merging 6.5 Criticisms and benefits of the technique 6.5.1 Complexity 6.5.2 I/O overhead 6.5.3 Lack of gains 6.5.4 Flexibility 6.6 Comments on Instruction sets and Optimisation 7 Conclusions 7.1 Viability of the technique 7.2 Ease of portability 7.3 Nature of optimisations Appendix A1 Simplified I-code definition A2 I-code internal representation A3 Results References
1. Introduction
Compilers for high-level languages form a significant
part of most computer systems, and with an ever increasing
number and variety of machine architectures on the market
the problems of compiler development, testing, and
maintenance consume more and more manpower and computer
time. Moreover, as computer technology is improving and
changing rapidly it is becoming evident that software costs
will increasingly dominate the total cost of a system.
Indeed, it may not be long before the lifetime of software
regularly exceeds that of the hardware on which it was
originally implemented, a state of affairs quite different
from that envisaged by Halpern when he concluded that "the
importance of the entire question of machine-independence is
diminishing .." [Halpern, 1965]. In addition, there is a
need to encourage the slowly-developing trend to write the
majority of software in high-level languages. Even though
the advantages of such an approach are many, a large number
of users still have a love of machine-code, usually fostered
by thoughts of "machine efficiency". Clearly, techniques
must be developed to simplify the production of usable
compilers which can "optimise" the match between the
executing program and the user's requirements, be they for
fast execution, small program size, reasonable execution
time but with good run-time diagnostics, or whatever.
One popular method for reducing the complexity of a compiler is to partition it into two major phases: one language-dependent and the other machine-dependent. The idea is that the language-dependent phase inputs the source program and deals with all the syntactic niceties of the language, finally generating a new representation of the program, an intermediate code. This is then input by a second phase which uses it to generate machine-code for the target computer. In this way it should be possible to produce a compiler to generate code for a different machine by taking the existing first phase and writing a new second phase. This ability to move a large portion of the compiler from machine to machine has led to such compilers being referred to as "portable compilers" even though the term is perhaps misleading, as only part of the complete compiler can be moved without change. In practice many existing compilers generate intermediate representations of the program which are passed around within the compiler, for example the "analysis records" produced by the syntactic phase of compilation, but for the purposes of this work it is only when these representations are machine-independent and are made available outwith the compiler that they will be termed intermediate codes. Much of the emphasis in designing intermediate codes has been on enabling a compiler to be bootstrapped quickly onto a new machine - either by interpreting the intermediate code, or by using a macro generator to expand it into
machine-code [Brown, 1977]. Once this has been done the intention is that the quality of the code so produced can be improved at leisure. While this approach has been very successful and relatively error-free, it has been the experience of several implementors that it is difficult to adapt the scheme to produce highly optimised code [Russell, 1974]; apparently considerations of portability and machine-independence have caused the problems of optimisation to be overlooked. The aspect of intermediate-code design which has received most debate concerns the level of the code: low-level with a fairly simple code-generator, or high-level with a more complex code-generator [Brown, 1972]. This thesis attempts to put machine-independence and optimisation on an equal footing, and describes the use of an intermediate code which takes a novel view of the process. Instead of the intermediate code describing the computation to be performed, it describes the operation of a code-generator which will produce a program to perform the required computation. This effectively adds an extra level of indirection into the compilation, weakening any linkage between the form of the intermediate code and the object code required for a particular implementation. In essence I-code attempts to describe the results required in a way which does not constrain the method of achieving those results.
In particular it should be noted that the code described, I-code, was designed specifically for the language IMP-77, a systems implementation language which contains many of the constructions which pose problems for optimisation [Robertson, 1979]. It in no way attempts to be a "universal" intermediate code. Notwithstanding, the code, with a small number of minor extensions to cover non-IMP features, has been used successfully in an ALGOL 60 compiler and is currently proving viable in projects for writing Pascal and Fortran 77 compilers. The intermediate code as finally designed is completely machine independent, except inasmuch as the source program it describes is machine dependent, demonstrating that the problems may not be as intractable as thought by Branquart et al. who state that "clearly complete machine independency is never reached" [Branquart, 1973]. In addition to the problems of machine independence there is also the question of operating system independence, as nowadays it is common for machines to have several systems available. For this reason the task of producing a compiler is far from finished when it can generate machine code [Richards, 1977]. To simplify the generation of versions of a compiler for different operating systems a third phase of compilation was added, although it soon became clear that the extra phase could be used for other purposes as well, as will be shown later (section 4).
Throughout the text examples are given of the code produced by compilers written to demonstrate the power of the intermediate code. The examples of the intermediate code are couched in terms of mnemonics for the various code items, although the production compilers use a compacted representation. The code and its representations are described in Appendix A1 and Appendix A2. In the examples of code generated for various constructions it should be appreciated that the exact instructions and machine features used will depend very much on the context in which the code is produced, and so only typical code sequences can be given. The machines for which code is demonstrated are indicated by the following abbreviations in parentheses: (Nova) Data General NOVA (PDP10) Digital Equipment Corporation PDP10 (PDP11) Digital Equipment Corporation PDP11 (VAX) Digital Equipment Corporation VAX 11/780 (GEC4080) General Electric Company 4080 (ICL2900) International Computers Limited 2900 (4/75) International Computers Limited 4/75 (7/16) Interdata 7/16 (7/32) Interdata 7/32 (PE3200) Perkin Elmer 3200
2 Intermediate codes
This section gives a brief account of the more important
intermediate codes which have been discussed and have had an
influence on the design of I-code.
2.1 Uncol
UNCOL, UNiversal Computer Orientated Language, [Mock,
1958], was an early attempt to specify a means for solving
the M*N problem of producing compilers for M languages to
run on N machines. It was proposed that an intermediate
language, UNCOL, be defined which would be able to express
the constructs from any language, and which could itself be
translated into code for any machine, resulting in the need
for only M+N compilers. Indeed it was even suggested that
programs would be written directly in UNCOL rather than in
machine code.
These ideas were very ambitious, but were presented without
any concrete examples of what UNCOL might look like.
Proposals were made for an UNCOL in [Steel, 1961] but the
work was abandoned before anything like a complete
specification had been produced.
An UNCOL-like technique which has been used extensively
is to compile for a known type of machine, such as the IBM
360, and then emulate that machine on the target machine.
Unfortunately, to give this any chance of being efficient,
microcode support will be necessary, and this is rarely
available to compiler writers.
2.2 Janus The first attempt at generating an UNCOL which seems to have been at least partially successful was JANUS [Coleman, 1974]. The approach was effectively to enumerate all the mechanisms found in current programming languages and the techniques used to implement them. From this large list was defined a set of primitive data-types and operations upon them. These primitives were then put together to model the objects in the source language. Once JANUS code had been produced the intention was that it would either be interpreted or compiled into machine code by a macro generator. 2.3 OCODE Of all the languages which claim to be portable, perhaps the most successful has been BCPL [Richards, 1971]. The BCPL compiler generates the intermediate code OCODE which can either be interpreted or translated into machine code for direct execution. As BCPL is a fairly low-level language with only one data type, the word, many of the difficulties in designing intermediate codes do not arise. This means that the code can be pitched at a low level and be "semantically weak" without compromising the efficiency of the compiled code to any great extent.
The OCODE machine works by manipulating single-word objects
held on a stack, into which there are several pointers.
e.g. R(1, 2, 3)
STACK 3 adjust the top of stack to leave two
cells free for linkage information.
LN 1 stack the constant 1.
LN 2 stack the constant 2.
LN 3 stack the constant 3.
LL L6 stack the address of label L6 (the entry
to the routine).
RTAP 5 enter the procedure adjusting the stack
frame pointer by 5 locations.
.....
.....
.....
ENTRY 1 L6 'R'
entry point for the routine R.
SAVE 5 set the top of stack pointer to be 5
locations from the stack frame pointer.
.....
RTRN return.
2.4 P-code
P-code is the intermediate code used by the PASCAL<P>
compiler [Nori, 1976; Jensen, 1976] and was designed with
the aim of porting PASCAL quickly by means of an
interpreter. In this respect it has been very successful,
especially on microprocessor-based systems. The code is
similar to OCODE but has a greater range of instructions to
handle objects of differing types.
procedure ERROR(VAL:INTEGER); begin
0: ENT 4
TOTAL := TOTAL+1;
1: LDO 138
2: LDCI 1
3: ADDI
4: SRO 138
if INDEX >= 9 then begin
5: LDO 139
6: LDCI 9
7: GEQI
8: FJP 17
LIST[10].NUM := 255
9: LAO 140
10: LDCI 10
11: DEC 1
12: IXA 2
13: INC 1
14: LDCI 255
15: STO
end else begin
16: UJP 28
INDEX := INDEX+1;
17: LDO 139
18: LDCI 1
19: ADDI
20: SRO 139
LIST[INDEX].NUM := VAL
21: LAO 140
22: LDO 139
23: DEC 1
24: IXA 2
25: INC 1
26: LOD 0, 4
27: STO
end;
end;
28: RETP
2.5 Z-code Z-code [Bourne, 1975] is the intermediate code produced by the ALGOL68C compiler, the main feature of which is the ability for the user to parameterise the first phase to modify the Z-code to suit the target machine, an idea previously investigated in SLANG [Sibley, 1961]. A set of eight registers is assumed by the code and others may be specified explicitly for each transfer. The memory with which the code works is assumed to be "a linear data store that is addressed by consecutive integers", addresses taking the form of base+displacement pairs. Intermingled with the instructions are directives which control the translation of the code into machine orders. Two of these directives are used to divide the code into "basic blocks" or "straight-line segments", and describe the usage of registers on entry to and exit from the blocks, although little use seems to be made of them at present.
As an example here is the Z-code generated by the PDP10 version of the compiler [Gardner, 1977]: int X := 2, Y := 3, Z := 2 1: F000 10 0 +2 load 2 F040 10 6 +144 store in X F000 10 0 +3 load 3 F040 10 6 +145 store in Y 5: F000 10 0 +2 load 2 F040 10 6 +146 store in Z proc P = (int A, B) int: begin 7: S715 p*Z T246 677 712 if A > B 9: R0 10: F020 10 5 +4 load A 11: F022 10 5 +5 subtract B 12: F113 10 0 P713 ->L713 if <= then A 13: F020 10 5 +4 load A else B 14: H116 0 p7l4 ->L714 15: L713 16: F020 10 5 +5 load B fi 17: L714 18: R1 10 1 end 19: R1 10 1 20: T247 667 712 end of P
2.6 Summary and conclusions 2.6.1 Error checking and reporting The UNCOL approach of having one code for all languages and machines may well simplify the generation of some sort of compiler, but has the major disadvantage that the optimisation of error checking and reporting run-time errors cannot be left to the code generator - many errors are language-dependent and the code generator cannot know how to handle all of them. Instead the checks must be programmed into the intermediate representation explicitly. As will be shown later (5.3) this can inhibit several very powerful and effective optimisations. Sadly, this problem can result in the absence of all but the most trivial of run-time checks in the compiled code. Even when checking is provided in the intermediate code, as in the case of P-code with its CHK instruction for range testing, it is rare for the code to contain enough information to permit the error to be reported in source program terms; line numbers, procedure names, variable names and values etc. As an example, many P-code interpreters locate run-time errors in terms of 'P-code instruction addresses' which are of negligible benefit to most users.
2.6.2 Efficiency Commonly, little attention is paid to questions of run-time efficiency in the generation of intermediate code. An exception to this is Z-code which is parameterised in order that the match between the code and the target machine can be improved. In particular, the machine-independent phase is intended to perform simple register optimisation, although as the example in 2.5 shows, the insistence on using the same register will minimise any gains from remembering register contents. However, this is probably just a failure on the part of the current compilers and could be corrected at a later date. Unfortunately, the fact that the compiler purports to optimise the intermediate code inhibits the code generator from attempting any but the most trivial peephole optimisations. This may be seen in the example by considering instructions 10 - 12. On many machines the subtract operation is not good for value comparison as firstly it may fail with overflow, and secondly it corrupts a register. A better implementation would be to replace the subtract with a suitable COMPARE, leaving the values untouched and available for later use. This cannot be done by the code generator as it cannot know that the intermediate code does not go on to use the result of the subtraction later. Similarly, if Z-code had chosen to use a COMPARE instruction in the first place, a machine without a compare would have to work hard to make sure all registers involved in the necessary subtract were restored
to their initial values before the intermediate code goes on to use them. 2.6.3 Assumptions Most machine-independent intermediate codes have been designed, at least initially, assuming a linear store with one address increment corresponding to one basic object. In the case of O-code this is a direct result of the language definition, but in languages such as PASCAL it has led to a great loss of information, as the rich information about data types cannot be expressed in the code. The problems associated with putting languages onto machines with different addressing schemes has resulted in some intermediate codes being updated to accept a limited form of parameterisation to define the gross appearance of the target machine. Typical of the limitations of these codes is P-code where although the basic types of object can have differing sizes of machine representation, objects with enumerated types will always be given a 'fullword' even though the host machine could easily support a smaller item. A typical assumption is that the difference between objects explicitly specified in the original source and those created by the intermediate code generator for its own purposes is insignificant. As will be shown later (4.6), this is not necessarily the case.
3 Optimisations
The task of any compiler for a high-level language is to
fit programs written in that language onto a specific
computer system so that the required computations may be
performed.
Optimisation may be described as the process by which the
fit is improved. Usually the quality of the optimisation is
measured in terms of two parameters: the size of the running
program, and, more commonly, the speed at which it executes.
While it is possible in some cases to make a program smaller
and increase its speed of execution, it is well-known that
in general speed and size are complementary. For example,
the following code fragments have the same effect, but the
first will probably be smaller than the second, which will
execute faster than the first:
-------------------------- -----------------------
| | A(1) = K |
| | A(2) = K |
| for J = 1,1,8 cycle | A(3) = K |
| A(J) = K | A(4) s K |
| repeat | A(5) = K |
| | A(6) = K |
| | A(7) = K |
| | A(8) = K |
| | J = 8 |
-------------------------- -----------------------
3.1 Classification of optimisations
With a subject as complex as optimisation it is difficult
to give a useful and definitive classification of the
various possibilities for improving programs. In addition
different authors have used many different terms to describe
optimisations which have been attempted [Aho, 1974; Lowry, 1969; Wichmann, 1977]. However most optimisations fall into one of the following four groups: Universal, Local, Global and Source. 3.1.1 Universal Optimisations These are those optimisations which are independent of any particular program, but which depend on the complete environment in which the program is to be compiled or executed. They are the policy decisions taken by the compiler writer during the initial design of the compiler, and include such things as the fixed use of registers (stack pointers, code pointers, link registers etc), the representations of language-defined objects (arrays, records, strings etc), and the standards for communication with external objects. In addition, universal optimisation must take into account such questions as: i Compilation speed or execution speed? If the compiler is to be used in an environment where programs are compiled roughly as often as they are executed, such as in a teaching environment, execution speed can be sacrificed for an increase in compilation speed.
ii Diagnostics?
If the compiler is to produce code which
will provide extensive checking and will
give diagnostic information in the event of
program failure, allowance must be made for
the efficient checking of the program's
behaviour and the maintenance of the
recovery information used by the
diagnostics. If highly optimised code is
required these constraints may not apply.
In the current state of the art universal
optimisation is done by experience and guesswork;
attempts at producing compiler-compilers which can
approach the quality of hand-constructed compilers
have not met with great success [Brooker, 1967;
Feldman, 1966; Trout, 1967]. As will be shown later
(4.5), minor changes in the universal optimisation
can result in major changes in the performance of
the generated code, and so rules made at this stage
should be as flexible as possible to permit changes
to be made in the light of experience.
From the point of view of measurement universal
optimisation provides the base level from which
other optimisations are investigated. Roughly, the
better the universal optimisation the less effective
the other optimisations appear to be.
3.1.2 Local optimisations Local optimisations may be defined as those optimisations which may be performed during a sequential scan of the program using only knowledge of statements already processed. Not only are these optimisations reasonably simple to perform but they can have a major effect on the generated code. Indeed Wulf et al. state that "In the final analysis the quality of the local code has a greater impact on both the size and speed of the final program than any other optimisation" [Wulf, 1975]. 3.1.2.1 Remembering Remembering optimisations are those optimisations which can be applied to single statements in the light of information gathered during the compilation of previous statements. These optimisations depend on remembering the current state of the machine and applying this knowledge to subsequent statements. Their chief characteristic is that they are applied during a sequential scan of the program, and as such are reasonably cheap to implement and execute.
For example:
----------------
| X = Y |
| if X = 0 start |
-----------------
on the PDP11 would generate:
------------
| MOV Y,X |
| BNE $1 | remembering that the previous
| | line sets the condition code.
------------
The most powerful of the remembering
optimisations is that whereby the correspondence
between values in registers and values in store is
remembered and references to the store value are
replaced by references to the register's value,
register operations usually being smaller and faster
than their store equivalents. Unfortunately there
are several cases where this leads to worse code
than the "obvious" version. For example, on the
(PE3200) the code on the right is larger and slower
than that on the left:
------------
| X = 2 |
| P = P<<2 |
-----------
----------- --------------
| LIS 3,2 | LIS 3,2 | pick up 2
| ST 3,X | ST 3,X | store it in X
| L 1,P | L 1,P | pick up P
| SLLS 1,2 | SLL 1,0(3) | shift it by 2
| ST 1,P | ST 1,P | store it in P
----------- --------------
In addition to keeping track of the changes in
the state of the machine from the compilation of one
statement to another, remembering also includes
preserving this state or environment for later use
when a label is encountered, either by merging the
current environment with the environment saved at
the jump to the label, or simply by restoring that
latter environment when it is not possible for
control to "fall through" from the statements
immediately preceding the label.
In all forms of remembering it is vital to be
able to keep the information up-to-date,
invalidating knowledge when it becomes false, a
process which is exacerbated when it is possible for
an object to be known by two or more apparently
different descriptions as in the following code:
---------------------------
| integer J, K |
| integerarray A(1:12) |
| integername P |
| P == J |
| J = 1; K = 1 |
---------------------------
At this point P and J refer to the same location as
do A(J) and A(K).
Except in the most simple of cases all that can be
done is to assume the worst and forget anything
potentially dangerous after writing to unknown
addresses.
By delaying the first store into T until after
the conditional statement, and delaying the second
store into T until after the label, both
instructions can be combined, resulting in the code:
-------------------
| L 3,X |
| L 0,0(3) |
| BGE $1 |
| SR 0,0 |
| $1:ST 0,T |
| LIS 2,1 |
| ST 2,0(3) |
| LR 1,0 |
| {return} |
-------------------
This store itself can now be delayed until the
return from the function, at which point, as T is
local to the function and will be destroyed, the
instruction can be deleted altogether.
Section 3.2 gives a description of one way in which
this sort of optimisation has been achieved.
3.1.2.3 Inaccessable code removal
In several cases compilers can generate code
which will never be executed.
The common causes of this are either user-specified
conditions whose truth is constant and are used to
achieve some sort of "conditional compilation", or
structural statements following unconditional jumps
as below:
----------------------
| if X = 0 start |
| ERROR(1) |
| return |
| else |
| V = V-X |
| finish |
----------------------
Here the branch instruction usually generated at the
end of the if clause to take control past the else
clause, can never be executed.
Such inaccessable code can be eliminated to shorten
the program, but without directly effecting its
speed of execution.
3.1.2.4 Peephole optimisations
Peephole optimisation [McKeeman, 1965] is the
technique of examining small sections of generated
code to make fairly obvious, but ad hoc,
improvements. Many of the gains from the
optimisation come from simplifying code sequences
created by the juxtaposition of code areas which
were produced separately.
For example (PE3220):
Before After
---------------- -------------------
| ST 4,X | ST 4,X |
| L 4,X | |
|- -|- -|
| AR 1,2 | |
| AHI 1,48 | AHI 1,48(2) |
---------------- -------------------
3.1.2.5 Special cases
Special-case optimisations are those which make
use of the particular structure and features of the
target machine to control the way in which certain
statements are implemented.
For example:
Obvious Optimised
---------------- -----------
(PDP11) | MOV #0,X | CLR X | X = 0
|- -|- -|
(PDP11) | ADD #1,X | INC X | X = X+1
|- -|- -|
(PE3220) | LHI 1,NULL | SR 0,0 | S = ""
| LHI 2,S | STB 0,S |
| BAL 15,MOVE | |
----------------------------
These optimisations are very similar to peephole
optimisations but are distinguished because they
actively control the generation of code rather than
passively alter code which has already been
produced. In particular they avoid one of the
drawbacks of peephole optimisation, namely that even
though it can reduce fairly complex instruction
sequences to a simpler form, the side-effects of
generating the long form in the first place often
degrade the code. In the example of setting a
string variable to the null string above the
optimised form uses only one register, the value of
which can be remembered. In the non-optimised
version three registers are immediately altered and
the knowledge of the contents of all of the
registers may need to be forgotten unless the code
generator knows how the MOVE routine works and can
forget only those registers which it uses.
3.1.2.6 Algebraic manipulations
Algebraic optimisations are improvements brought
about by using the algebraic properties of operators
and operands, and include:
. Folding, or compile-time evaluation
1+2 is replaced by 3
. Removal of null operations
A+0 is replaced by A
. Using commutative properties
-B+A is replaced by A-B
3.1.3 Global optimisations Global optimisation may be defined as those improvements to the code which require knowledge of substantial parts of the program. In effect they are performed by examining numbers of statements in parallel, in contrast to the .sequential scan required by local optimisation. 3.1.3.1 Restructuring Restructuring optimisations are those optimisations which may be brought about by changing the order in which the code is laid out in memory without otherwise changing the nature of the code. As will be discussed later, there are many reasons why programs can be improved by changing the order of chunks of the code. A common reason is that many machines have conditional branch instructions with limited range while the unconditional branches have a much larger range. Hence if (A) represents a large number of statements and (B) represents a small number of statements, the program: ---------------- | if X = 0 start | | {A} | | else | | {B} | | finish | ----------------
could be improved by reordering as on the right
(PDP11):
original reordered
------------------ ----------------
| MOV X,R0 | MOV X,R0 |
| BEQ $1 | BEQ $1 |
| JMP $2 | {B} |
| $1: {A} | JMP $2 |
| BR $3 | $1: {A} |
| $2: {B} | $2: |
| $3: | |
------------------ ----------------
3.1.3.2 Merging 3.1.3.2.1 Forward merging Forward merging, also somewhat confusingly referred to as "cross jumping" [Wulf, 1975], is the process whereby the point of convergence of two or more code sequences is moved back over common sub-sequences thus removing one of the sub-sequences, as in the case below. ---------------------- | if X > Y start | | TEST(X, Y) | | else | | TEST(Y, X) | | finish | ---------------------- obvious code (VAX) after merging --------------------- -------------------- | CMPL X,Y | CMPL X,Y | | BLE $1 | BLE $1 | | PUSHL X | PUSHL X | | PUSHL Y | PUSHL Y | | CALLS 2,TEST | | | BRB $2 | BRB $2 | | $1:PUSHL Y | $1:PUSHL Y | | PUSHL X | PUSHL X | | CALLS 2,TEST | $2:CALLS 2,TEST | | $2: | | --------------------- -------------------- The simplest way to perform this optimisation is to take the code sequence about the point of a label and a reference to that label, and set two pointers: one, P., to the unconditional jump and the other, P2, to the label. If the instructions immediately preceding the pointers are identical both pointers are moved back over that instruction. The label is
redefined at the new position of P2 and the
instruction passed over by P1 is deleted. The
process is repeated until either another label is
found or two different instructions are encountered.
The redefinition of the label involves creating a
completely new label, leaving the old one untouched.
This both prevents trouble with multiple references
to the label and permits the optimisation to be
attempted on them.
As this optimisation simply causes the sharing of
execution paths there is no direct gain in execution
speed, but as the code size is reduced an indirect
improvement may be achieved if the shorter code
moves the label close enough to the reference to it
for a shorter and usually faster jump instruction to
be used.
The optimisation obviously must be performed while
labels and jumps are in a symbolic form, that is
before code addresses have been resolved. This
permits the merging of instructions which will
eventually have program-counter relative operands
and consequently be position dependent.
In this case as the code-generator is in complete
control the optimisation can be very simple,
although rather specific.
The techniques for handling common sub-expressions
have been investigated at length by several authors,
but measurements indicate that in most programs
expressions are so trivial the expense in finding
common sub-expressions is not repaid by the
resulting improvement in the generated code [Knuth,
1971].
The more general form of factoring can be seen in
the transformation of the following statements:
---------------------------------------
| if X = 0 then A = 1 else B = 2 |
| if X = 0 then C = 3 else D = 4 |
---------------------------------------
into:
-----------------------
| if X = 0 start |
| A = 1 |
| C = 3 |
| else |
| B = 2 |
| D = 4 |
| finish |
-----------------------
3.1.3.5 Loop optimisations 3.1.3.5.1 Iteration Iteration is the process whereby the values in variables from previous iterations of a loop are used to calculate the new values for the current iteration, rather than calculating those values from scratch each time. One of the effects of this optimisation can be the reduction in strength of operations, such as changing multiplications into additions. In this context the IMP77 operators "++" and "--" may be used to great effect. Their action is to adjust the reference on the left by the number of items to the right, hence if X is an integer then X++1 is the next integer and X--2 is the integer two integers before X. ------------------------------- | for J = 1,1,1000 cycle | | A(J) = J | | repeat | ------------------------------- Can be optimised to: ------------------------------- | integername T | | T == A(J)--1 | | for J = 1,1,1000 cycle | | T == T++1 | | T = J | | repeat | -------------------------------
------------------- ---------------------
| | TEMP = P/Q |
| while X > 0 cycle | while X > 0 cycle |
| W(X) = P//Q | W(X) = TEMP |
| X = X-1 | X = X-1 |
| repeat | repeat |
------------------- ---------------------
B will be faster than A if the loop is executed at
least twice. If the loop is not executed at all
(X <= 0) B will be much slower than A (by an
alarming 80 microseconds on the 7/32).
3.1.3.5.3 Removal of invariants
This is the process whereby complex
sub-expressions which do not change their values as
the loop progresses are evaluated outside the loop
and held in temporaries:
---------------------------------
| for J = 1, 1, 1000 cycle |
| A(J) = LIMIT-MARGIN |
| repeat |
--------------------------------
Can be optimised to:
-------------------------------
| TEMP = LIMIT-MARGIN |
| for J = 1,1,1000 cycle |
| A(J) = TEMP |
| repeat |
-------------------------------
It is simply a special case of Holding.
3.1.4 Source optimisations Source optimisations [Schneck, 1973] are those optimisations which can be effected by changes in the source program. They can be sub-divided into three categories; machine-independent [Hecht, 1973; Kildall, 1973], machine-dependent, and tentative. Tentative optimisations are those which, while unlikely to make the code worse, may improve it on some machines. For example, most machines will handle the comparison "X<1" better if it is rewritten as "X<=0", where X is an integer variable.
With this knowledge the second statement can be
compiled to:
-------------------
| MOV# 1,1,SZR |
| JMP $1 |
| LDA 1,D |
| STA 1,A |
| $1: |
-------------------
Immediately before the label $1 it is known that
once again the value of A is in accumulator 1, and
that the STA above the label is marked as pending as
before. Following the definition of the label the
environment before the jump to that label can be
combined with the environment just before the label
to give the new environment following the label.
The information in this environment is that A is in
accumulator 1 and that the same store is pending
from both old environments. This allows the two
marked store instructions to be removed and one
store placed after the label (and once again marked
as being "pending"). This gives the following code:
---------------------
| LDA 0,B |
| LDA 1,C |
| AND 0,1 |
| MOV# 1,1,SZR |
| JMP $1 |
| LDA 1,D |
| $1:STA 1,A |
---------------------
A simple jump optimisation notices that the JMP
passes over just one instruction and can therefore
be removed by inverting the skip condition on the
previous MOVe, giving:
------------------
| LDA 0,B |
| LDA 1,C |
| AND 0,1 |
| MOVZL 1,1 |
| MOV# 1,1,SNR |
| LDA 1,D |
| STA 1,A |
------------------
Finally, peephole optimisation combines the AND with
the MOVZL giving ANDZL, and then combine this with
the following MOV# to give the complete code
sequence as:
------------------
| LDA 0,B |
| LDA 1,C |
| ANDZL 0,1,SNR |
| LDA 1,D |
| STA 1,A |
------------------
The most interesting thing to notice about this
particular sequence of optimisations is that with
the possible exception of the removal of the marked
STA instructions, the final code can be generated
very simply with local optimisations.
4 The design of the compiler This section describes the features of the compiler which have had an influence on the form of the intermediate code. 4.1 General structure One of the aims of this type of compilation strategy is to simplify the production of compilers, and a successful technique for simplifying programs is to divide them into several communicating modules, each largely independent of the others but with well-defined interfaces between them. At the highest level a compiler can be split up into three major parts: 1 A language processor, which deals with the language-dependent parts such as parsing, semantic checking, and error reporting. 2 A code generator, which takes the decomposed form of the program as generated by 1 above, and constructs the appropriate code sequences to perform the required functions. 3 An object-file generator, which builds an object-file from the code sequences produced by 2, In the form required by the system which is to execute the program. Commonly, the first two parts of this scheme are combined into one program which generates as its output an
assembly-language source file corresponding to the original
program. The third part then becomes the standard system
assembler. This approach clearly simplifies the production
of the compiler, as one part, the assembler, is provided
already and can ease the problems of checking the compiler
because the code it generates is presented in a well-known
form. Despite these advantages it was rejected for the
following reasons:
1 In order that assembly language can be
generated, the compiler must have an internal
form of the instructions which is changed into
text, processed by the assembler, and finally
converted into the machine representation.
These transformations can be eliminated if the
compiler works directly with the machine
representations.
2 In general, the system-provided assembler will
be expecting to receive a much more powerful
language than the rather stereotyped text
produced by compilers. This will certainly
degrade the performance of the assembler. A
solution to this is to produce a cut-down
version of the assembler which only recognises
those constructs generated by the compiler.
However, producing a new assembler removes one
of the reasons for choosing this route,
namely, not requiring extra work in writing
the object-file generator.
3 As will be seen later (section 4,7), even
after the code sequences have been produced
there remain several optimisations which can
be performed using knowledge gained during the
production of those sequences, for example,
generating short forms of jump instructions
when the distance between the jump and its
destination is small enough. While in certain
cases these optimisations can be performed by
a standard assembler it is unlikely that the
structure of the code-generator would be as
simple as if a special-purpose object-file
generator were available.
The main interface in such a system is clearly that
between the language and machine dependencies, as most
languages are largely machine-independent. It is this
interface between the language-dependent and
machine-dependent parts of the compiler which is termed the
INTERMEDIATE CODE. In the following discussion it is
assumed that the reader has a reasonable understanding of
the structure of the final form of I-code, a definition of
which may be found in Appendix.A2.
code is to be generated by several different language processors, a simple intermediate code which is easy to produce may well be more attractive. As I-code was intended for optimisation a high-level code was chosen. In addition, as it was hoped that the code could eventually be used in different language processors, it was decided to keep the structure simple. The complete compilation process may be thought of as a sequence of transformations working from the source program to the final object program via a number of intermediate representations. As the transformations are applied the representations become less dependent on the source language and more dependent on the target machine. In order to simplify the code-generator as much as possible the intermediate code must lie as far from the source language as is possible without straying from the objectives set out below. 4.2.1 Objectives One of the dangers in designing an intermediate code is that of building into it old techniques and standard expansions of source constructions, which while they may be tried and tested cannot in any way be said to be "the only solutions" or even "the best solutions". One of the intentions behind the design of I-code was to
permit the use of varied implementation strategies. In the same way that the only practical definition of a "good" programming language is that it fits the style of the particular programmer using it, so the measure of the power of an intermediate code must include the ease with which it can adapt to an existing style of code-generator writing. Inevitably, practical constraints prevent total generality; the most general form of a program is a canonical form of itself, but this is little help in compiling it. It follows that the intermediate code, while remaining true to the original program and distant from "real" machines, must provide enough simplification to make the task of code-generation as easy as possible without inhibiting optimisation. From the start it was appreciated that an intermediate code suitable for use in optimising compilers would necessarily require more processing than a code such as O-code which was aimed at a quick implementation. The original hope was that although each machine-dependent code generator would not be small, typically about 3000-4000 IMP statements, large portions of one could be taken as a basis for a new implementation. This has proved to be the case, and provision of an existing code-generator as a template greatly simplifies the task of creating a new one (section 6.4.1).
4.2.1.1 Scope The first and most fundamental objective in the design of I-code was that it should support the compilation of one specific language, IMP-77, on many different machines. Considerations of using the code to implement other languages were secondary to this main aim, but were used to bias the design when a choice had to be made from several otherwise equally suitable possibilities. In retrospect, a few areas of the code could have been made more general without significant overheads in the code generators, mainly in the area of data descriptor definitions, but a detailed discussion of one intermediate code supporting several languages is beyond the scope of this work. In direct contrast to many intermediate codes, I-code was not designed with the intention of making it convenient to interpret; the prime aim was to permit compilation into efficient machine-code. Nevertheless it is possible to "compile" I-code into threaded code [Bell, 1973] or a form suitable for interpretation, either by generating a conventional interpretive code or by leaving the code in more-or-less its current form but with labels resolved and descriptors expanded into a more convenient representation.
For example, the following two program fragments are
semantically identical:
A B
----------------------------------------------------
| | P = 0 |
| | cycle |
| TEST for P = 1, 1, 10 | P = P+1 |
| | TEST |
| | repeat until P = 10 |
| | |
----------------------------------------------------
However, in "B" the information that the fragment contains a
simple for construction, while not completely lost, has been
scattered through the code, and this dilution of information
will increase the complexity of any code-generator wishing
to handle for loops specially.
To leave open all avenues for optimisation it is
necessary therefore, that all of the semantic information in
the source program is preserved in a compact form in the
I-code. One sure way of achieving this property is to
design the code in such a way as to allow the regeneration
of the source program, or at least a canonical form of it
which is not significantly different from the original. In
this context insignificant differences are the removal of
comments and the standardisation of variant forms of
statements, such as:
NEWLINE if COUNT = 0
and: if COUNT = 0 then NEWLINE
frequently pun on addresses and integer values. Several later versions of the codes have attempted to solve these problems by parameterising the intermediate code generator so that the characteristics of the target machine may be used to modify the code which is produced. However, they still have built in to them assumptions about how the objects can be addressed. There are so many constraints which can be imposed on the code to be generated, such as operating system requirements and conventions for communicating with the run-time environment, that a parameterised first phase could not be expected to generate code which was well-suited to every installation. The authors of JANUS [Coleman, 1974] write that they believe that the approach of using a parameterised intermediate code "... is a dead end, and that the adaptability must come in the translation from the intermediate language to machine code". 4.2.1.4 Simplification For the complexity of the machine-dependent phases of compilation to be kept as low as possible the machine-independent phase must do as much work as possible while keeping within the constraints imposed by the previous objectives. One way of simplifying the intermediate code is for certain high-level constructions to be expanded into
lower-level constructions, but only when there is just one
expansion possible under the rules of the language, and that
expansion does not scatter information which may be of later
use.
The most obvious case of such expansion is in dealing with
complex conditional clauses such as:
-----------------------------------------------------
| if (A=B and C#D) or (E<F and G>H) then X else Y |
-----------------------------------------------------
IMP-77 specifies that the condition will only be evaluated
as far as is necessary to determine the inevitable truth or
falsity of the condition, and so, bearing in mind the
modifications to be discussed in section 4.6.1, the
statement can be represented more simply as:
---------------------------------
| if A # B then ->L1 |
| if C # D then ->L2 |
| L1: if E >= F then ->L3 |
| if G <= H then ->L3 |
| L2: X |
| ->L4 |
| L3: Y |
| L4: |
---------------------------------
This expansion is tricky and notoriously error prone, and
therefore is best done once and for all in the common phase.
Similarly it is possible to expand all simple control
structures into their equivalent labels and jumps, providing
that the structural information is not lost thereby.
4.3 Code layout and addressing 4.3.1 Nested procedure definitions A common feature of programming languages is the ability to nest the definition of a procedure within another procedure. In addition, several languages imply the definition of procedures within single statements, as in the case of name parameters in ALGOL-60 where the parameter which is actually passed can be a reference to a "thunk", a procedure to evaluate the parameter. With such nesting provision must be made for preventing the flow of execution from "falling through" into the procedure from the preceding statements, and this is usually accomplished by planting at the start of the procedure a jump to the statement following the end. While this is simple to implement it does introduce extra instructions which are not strictly necessary. With user-defined nested procedures the overhead can be minimised if a number of procedures is defined, as one jump instruction can be used to skip them all. Unfortunately thunks will be generated throughout the code in a more-or-less random way, giving little opportunity to coalesce the jumps. Even if the extra execution time caused by these jumps is insignificant (the jumps round thunks defined in loops get executed repeatedly), the code which they are skipping
stretches the code in which they are nested. On machines with fixed-size jump instructions which can cover the whole machine, such as the DEC PDP 10, the stretching causes no problems, but if the addressing is limited, or if several different sizes of jump instruction are provided, the presence of the nested procedure can result in more code being produced later in the generation of large jumps. 4.3.2 Paged machines On paged machines the overall performance of a program does not depend solely on the efficiency of the code produced by the compiler but includes a factor depending on the locality of references made by the executing program. Traditionally this locality has been improved by monitoring the execution of the program and then re-ordering parts of it in the light of the measurements. Unfortunately not all operating systems provide the user with convenient tools to enable the measurement to be done, leaving only ad hoc methods or intuition for guidance. Without careful control it is all too easy to move one procedure to improve references to it and thereby cause another piece of code to cross page boundaries and counteract any gains in paging performance. Even if the user can obtain the necessary information, a slight change in the program can invalidate the changes. Notwithstanding these problems it is evident that by careful
structuring of a program significant gains in paging behaviour can be obtained and so this option should not be pre-empted by the intermediate-code (as does Z-CODE which automatically reorders the definitions of procedures). The possibility of automatic improvement of paging behaviour was investigated by Pavelin who showed that the paging characteristics of a program can be improved by an automatic reordering of the code [Pavelin, 1970]. Pavelin's thesis describes the breaking-up of a program into "chunks", defined by branches and the destinations of branches. At each chunk boundary extra instructions are planted to cause the updating of a "similarity array" which records the dynamic characteristics of the program. After several runs the similarity arrays are merged and the result is used to specify a reordering of the chunks which should improve the paging performance. In test cases the working-set size of the code was reduced by as much as 40%. The thesis also went on to say that the various compilation problems associated with this "... can be alleviated by operating on an intermediate code which is machine independent with symbolic code addresses".
For example, on the INTERDATA 7/16 a procedure at the outermost textual level uses register 15 to access its local stack frame, giving the exit sequence: ------------------- | LM 7, 4(15) | | BFCR 0, 8 | ------------------- but a procedure nested within this would use register 14 thus: ------------------- | LM 7, 4(14) | | BFCR 0, 8 | ------------------- It follows that the signal routine must be told which base regiser to use at each stage of the recovery. This can be done either by planting code in the entry and exit sequences of each procedure, or by keeping a static table associating procedure start and finish addresses with the appropriate base register. The first method is poor as it imposes a run-time overhead on all procedures, whether they trap events or not. The second method is better but can be complicated if procedures are nested as the start-finish addresses alone no longer uniquely define the procedure. One solution is to cause all procedures which use the same exit sequence to be loaded into distinct areas, and to associate the recovery information for the signal routine with each area. This reduces the static overhead to a few words per area, rather than a few words per procedure.
ii Array references with constant subscripts need no
address calculations at run-time. For example
using V as declared above, the element V(2) is
immediately addressable as the halfword with
displacement "a+2 + 2*2" from LNB.
iii In certain more general cases when the subscript is
a variable, access can be simplified by remembering
previous calculations. For example, the address of
the array element V(X) is
-------------------------------------
| addr(V(0)) + X*size of each element |
-------------------------------------
In the example above this becomes:
-------------------------------------
| LNB+a+2 + X*size of each element |
-------------------------------------
which can be rearranged to:
-------------------------------------------
| a+2 + (X*size of each element+LNB) |
-------------------------------------------
Hence the following code could be produced (7/16):
--------------------
| V(X) = 0 |
| |
| LH 1,X(LNB) | pick up X
| AHR 1,1 | *2 (2 bytes per integer)
| AHR 1,LNB | add in LNB
| SHR 0,0 | get zero
| STH 0,a+2(1) | store in V(X)
--------------------
Noting that the value now in register 1
(X*size+LNB) only depends on the size of each
element, X, and the local name base, it is clear
that register 1 can be used to address the X'th
element of any integer array of one dimension and
constant bounds declared at the current level.
Hence if the array W(1:12) were declared
immediately after Y in the example above, while
register 1 is not changed W(X) can be addressed as
a+2002(1).
On the other hand, a machine with limited store cover,
such as the Data General NOVA which only has an eight-bit
displacement, will almost certainly force the array to be
implemented as an immediately addressable pointer which is
initialised at run-time to the address of storage claimed
explicitly.
--------------------------------------------------------
| LNB |
| | |
| | |
| v .---.---.---. |
| .- -| X | V | Y | - - - |
| .---.---.---. |
| | |
| | |
| | .------.------. .--------. |
| +->| V(0) | V(1) | - - - - | V(999) | - - - |
| .------.------. .--------. |
--------------------------------------------------------
With this organisation the address of V(X) will be:
----------------------------
| V + X*size of each element |
----------------------------
and there is little that can be done by rearranging the
expression to improve on the "obvious" code (7/16):
----------------
| V(X) = 0 |
| |
| LH 1,X | pick up X
| AHR 1,1 | double it
| AH 1,V | add in addr(v(0))
| SHR 0,0 |
| STH 0,0(1) |
----------------
Not only is this second code sequence longer than the first
by two bytes, but it will execute more slowly as the second
addition involves a store reference whereas the equivalent
instruction in the first sequence uses a register.
In both cases, however, some simplification can be done if
the subscript is an expression of the form:
------------------------------
| X plus or minus CONSTANT |
------------------------------
in which case the constant can be removed from the subscript
expression evaluation and added into the final displacement.
For example (7/16):
------------------------
| V(X-7) = 0 |
| |
| LH 1,X | pick up X
| AHR 1,1 | double it
| AHR 1,LNB | and in LNB
| SHR 0,0 | get zero
| STH 0,a+2+(-7)*2(1) |
------------------------
Unfortunately even this optimisation may not be available. For example, the ICL 2900 series performs array accesses through a DESCRIPTOR REGISTER, and the extra displacement cannot be added into the instruction. Also some machines, such as the IBM 360, only permit positive displacements in instructions. The examples above pose the following problem: If the intermediate-code is to know nothing of the target machine it cannot know the best way to declare the array, nor the best way to access it. Therefore the code must always produce the same sequences for array declarations and array accesses. It follows that these sequences must remain quite close to the original source and not include any explicit address calculations. As another example, the DEC PDP11 range has a hardware stack which grows with decreasing store addresses. Because of this it could be convenient to allocate storage for variables in that order, from large addresses to small addresses. However, in several cases it may be necessary to force objects to be created in increasing addresses, such as when program structures are to be mapped onto hardware-defined structures in memory, resulting in an implementation which requires to be able to create similar objects in different ways depending on the context.
Finally, some machines provide instructions in which the displacement of the operand is scaled before use, depending on the size of that operand. The GEC 4080 is such a machine, with instructions such as: LDB 1 load byte <1> LD 1 load halfword, bytes <2> & <3> LDW 1 load fullword, bytes <4>,<5>,<6> & <7> When producing code for such machines it is convenient to allocate all the local objects of the same size in particular areas, and then order the areas in increasing order of the size of the objects they contain. This permits fuller use of the available displacement field in the instructions.
The solution to these problems which was chosen in I-code was to define a DESCRIPTOR for each object to be manipulated. On input to the code-generator descriptors are converted from their machine-independent form to a new form appropriate to the target machine. As all subsequent reference to the object will be through descriptors the code produced will automatically reflect the decisions made at the time the descriptors were created. As has been discussed previously, it may be possible to remove the overhead in setting up addressability for local variables and parameters if the parameters can be held in registers and the local variables are never referenced. After examining many procedures which do use local variables it is clear that a large number of them do not need the complete overhead in setting up a local frame base as they could use the workspace pointer (stack pointer) instead. The criterion is that the position of the locals relative to the workspace pointer must be known at compile time. This reduces to the procedure not having any objects with computed sizes (arrays with computed bounds, for example) and no calls on procedures which access the locals as their global variables.
Consider the compilation of the following procedure on the
PDP11:
----------------------------------------
| routine MARK(record(cellfm)name CHAIN) |
| integer N |
| N = 0 |
| while not CHAIN == NULL cycle |
| N = N+1 |
| CHAIN_INDEX = N |
| CHAIN == CHAIN_LINK |
| repeat |
| end |
----------------------------------------
The code normally produced for this routine would be:
--------------------------
| MOV LNB,-(SP) | remember old LNB
| MOV DS,-(SP) | remember DS
| MOV R0,(DS)+ | save the parameter
| MOV DS,LNB | set up local addressing
| ADD #20,DS | reserve local space
| CLR 10(LNB) | N = 0
| $1: MOV -2(LNB),R1 | test CHAIN
| BEQ $2 | branch if NULL
| INC 10(LNB) | N = N+1
| MOV 10(LNB),2(R1) | CHAIN_INDEX = N
| MOV (R1),-2(LNB) | CHAIN == CHAIN_LINK
| BR $1 | repeat
| $2: MOV (SP)+,DS | restore DS
| MOV (SP)+,LNB | restore LNB
| RTS PC | return
--------------------------
however, by using workspace pointer (DS) relative addressing
this reduces to:
--------------------------
| MOV R0,(DS)+ |
| TST (DS)+ | reserve local space
| CLR -2(DS) | N = 0
| $1: MOV -4(DS),R1 | test CHAIN
| BEQ $2 |
| INC -2(DS) | N = N+1
| MOV -2(DS),2(R1) | CHAIN_INDEX = N
| MOV (R1),-4(DS) | CHAIN == CHAIN_LINK
| BR $1 |
| $2: SUB #4,DS | restore DS
| RTS PC | return
--------------------------
This optimisation can be performed quite simply by the
third phase of compilation.
In the interface between the second and third phases the
code sequences generated by the second phase are made up of
items of the form:
<type> <VALUE>
where <type> describes where <VALUE> is to be put, for
example in the code area or in the private data area. To
achieve the workspace-pointer-relative addressing extra
types are introduced which specify that the associated value
is the displacement of a local variable from LMB. Other
codes are needed to be able to modify the operation part of
the instruction which uses the displacements but these will
be ignored here as they cause no difficulty and would just
obscure the discussion. In addition, an extra <modify DS>
item is output whenever DS is explicitly altered (as when
parameters are stacked using MOV ??,(DS)+. By default the
third phase will treat these extra types as being exactly
equivalent to <code area> types, and will generate the first
sequence of code. However, if when the end of the procedure is processed, the second phase discovers that no dynamic objects or dangerous procedure calls were generated, it marks the end of the procedure accordingly (in the same way as described in section 4.7.2). This mark instructs the third phase to relocate all values with the appropriate type so as to make them relative to DS. The <modify DS> types are used to keep the third phase's idea of the current position of DS in step with reality.
4.5 Procedure entry and exit. IMP is heavily based on the use of procedures, indeed the only method of communicating with the controlling environment is by means of procedure calls. Also the techniques of structured programming result in the extensive use of procedures. Clearly when writing a compiler for such languages much thought must be given to making procedure entry and exit (and the associated passing of parameters) as efficient as possible. 4.5.1 User-defined procedures The usual technique for procedures entry and exit is to have standard preludes and postludes which cover all the different types of procedure. For example the EMAS IMP code sequences [Stephens, 1974] are (ICL4/75): ------------------------- | STM 4,14,16(WSP) | save the current environment | BAL 15, PROC | enter the procedure | . | | PROC ST 15,60(WSP) | save the return address | LR LNB,WSP | set up the local stack frame | LA WSP,***(WSP) | claim local stack space | BALR 10,0 | set up code addressability | . | | . | | LM 4,15,16(LNB) | restore calling environment | BCR 15,15 | return -------------------------
While this has proved to be convenient to generate and
efficient to execute it has one major problem, part of the
housekeeping of the procedure entry is performed at the call
itself. This seems undesirable for two reasons:
i Procedures are generally called more often than
they are defined. If part of the housekeeping of
procedure entry is done at the call that code will
be duplicated at each call, thus increasing the
size of the program. Putting that code within the
procedure reduces the size overhead.
ii If the knowledge of what housekeeping needs to be
done for procedure entry is needed outside the
procedure it becomes impossible to alter the entry
and exist sequences to suit the actual procedure.
In particular on certain machines it is possible to
remove the entry and exit sequences altogether when
the procedures are simple enough.
If the 4/75 compiler moved the environment-saving STM
instruction into the body of the procedure the storing of
the return address would be performed automatically:
-------------------------
| BAL 15,PROC |
|- - |
| PROC STM 4,15,16(WSP) |
| LR 8,WSP |
| .. .. |
-------------------------
This not only saves four bytes per call, very important on a
machine with a very severely limited immediate addressing
range, but also reduces the overhead in entering the
procedure by one instruction.
A further modification would be to pass one or more of the
parameters in the registers, leaving the way open for
remembering that fact inside the procedure.
Hence a call could be reduced from:
------------------------
| L 1,X |
| ST 1,64(WSP) |
| L 2,Y |
| ST 2,68(WSP) |
| BAL 15,PROC | PROC(X, Y)
|- -|
| PROC STM 4,15,16(WSP) |
| .. .. |
------------------------
to:
------------------------
| L 0,X |
| L 1,Y |
| BAL 15,PROC |
|- -|
| PROC STM 4,1,16(WSP) |
| .. .. |
------------------------
The ability to determine exactly how parameters are to be
passed can be of crucial importance in the efficiency of the
procedure mechanism.
When compiling for the PDP11 the obvious calling sequence
for a procedure with two integer value parameters would be:
-------------------
| MOV X,-(SP) |
| MOV Y,-(SP) |
| JSR PC,PROC |
-------------------
Unfortunately this produces problems inside the procedure as
the return address, stacked by JSR, is too far down the
stack to permit the use of the RTS instruction to return,
for this would leave on the stack the space used by the
parameters. Neither can the stack be adjusted before the
return which would be made indirect through a location
beyond the stack pointer as space there must be considered
volatile, being used by interrupt handling. Extra
instructions are needed either at the call or inside the
procedure to adjust the stack; the JSR instruction may well
not be "a beauty" as claimed by some implementors [Bron,
1976]. A MARK instruction has been introduced in an attempt
to overcome this problem, but it is far from helpful as it
imposes an arbitrary register convention and puts all of the
overhead on the call rather than on the procedure itself.
On the other hand if all of the parameters can be passed in
registers, the JSR will put the return address on a clear
stack permitting the use of RTS for the return. As in
practice most procedures have few parameters, usually only
one or two, this can give a large saving.
As an example of the power of being able to alter entry and
exit sequences consider a recursive implementation of the
IMP routine SPACES:
-----------------------------
| routine SPACES(integer N) |
| return if N <= 0 |
| SPACES(N-1) |
| SPACE |
| end |
-----------------------------
On the PDP10 the straightforward coding for this would be:
--------------------------
| MOVE 0, X | pick up X
| MOVEM 0, 3(SP) | assign the parameter
| PUSHJ SP, SPACES | call SPACES
|- -|
| SPACES: MOVEM LNB,1(SP) | save old frame base
| MOVE LNB,SP | pick up new frame base
| ADDI SP,3 | reserve new frame base
| SKIPLE 1,2(LNB) | load, test & skip if X<=0
| JRST LAB1 | jump to LAB1
| MOVE SP,LNB | restore stack pointer
| POPJ SP | return
| LAB1: SOJ 1, 0 | X-1 -> ACC1
| MOVEM 1,3(SP) | assign parameter
| PUSHJ SP,SPACES | call SPACES
| PUSHJ SP,SPACE | call SPACE
| MOVE SP,LNB | restore stack pointer
| MOVE LNB,1(SP) | restore old frame base
| POPJ SP |
--------------------------
By applying the optimisations of passing the parameter in an
accumulator (called ARG) and remembering that the parameter
is in this accumulator on entry to the procedure, the code
reduces to:
---------------------------
| MOVE ARG,X | pick up X
| PUSHJ SP, SPACES | call SPACES
|- -|
| SPACES: MOVEM LNB,1(SP) |
| MOVE ARG,2(SP) | assign the parameter
| ADDI SP,3 |
| JUMPG ARG, LAB1 | ->LAB1 if ARG > 0
| MOVE SP,LNB | restore stack pointer
| MOVE LNB, 1(SP) |
| POPJ SP | return
| LAB1: SOJ ARG, 0 | parameter = ARG-1
| PUSHJ SP,SPACES |
| PUSHJ SP,SPACE |
| MOVE SP,LNB |
| MOVE LNB,1(SP) |
| POPJ SP |
---------------------------
On inspection it is clear that the local stack frame
(pointed at by LNB) is never used within the procedure
except by the entry and exit sequences. Hence by reducing
those sequences to the absolute minimum the code becomes
---------------------------
| MOVE ARG,X |
| PUSHJ SP, SPACES |
|- -|
| SPACES: JUMPG ARG, LAB1 |
| POPJ SP |
| LAB1: SOJ ARG, 0 |
| PUSHJ SP,SPACES |
| PUSHJ SP,SPACE |
| POPJ SP |
---------------------------
Finally, an opportunistic optimisation may be performed
[Knuth, 1974; Spier, 1976] by noticing that the final two
instructions may be combined so that the procedure SPACE
uses the return address pushed onto the stack for the return
from SPACES. This results in the tightest form of the code:
----------------------------
| MOVE ARG,X |
| PUSHJ SP, SPACES |
|- -|
| SPACES: JUMPG ARG, LAB1 |
| POPJ SP |
| LAB1: SOJ ARG, 0 |
| PUSHJ SP,SPACES |
| JRST SPACE |
---------------------------
The final steps in this optimisation can only be performed once the body of the procedure has been compiled. In order that the correct (in this case non-existent) entry sequence can be used an extra pass over the object code is necessary. This pass can be combined with the process of adjusting labels and jumps which is carried out in the third phase of compilation described in section 4.7. The code generator can mark the position where an extra sequence is required and at the end of the procedure can inform the third phase of any salient features found in the body. The third phase can then decide on the best entry and exit sequences to use. This ability to tailor the "housekeeping" parts of procedures can be used in many circumstances to limit the inclusion of code which is needed to handle rare constructions to those procedures which use the feature. As an example of this consider the ICL 2900 series. The machines of the series are designed around a hardware stack, which resides in one, and only one, segment of the user's virtual memory, and thus limits this data space to 255K bytes. In order to be able to handle programs using very large arrays space must be available off-stack in another segment or set of consecutive segments. The maintenance of this extra data space will require instructions to be executed on entry to and on exit from procedures which claim space from it, but not from those which only use space from the stack. These extra instructions can be added to the procedure in a simple
manner by the third phase as it now controls the form of the procedure when all the necessary information is available. For these optimisations to be performed the intermediate code must not lay down rules for procedure entry and exit, rather it should simply mark the points at which suitable code is required. An additional consideration in the design of the I-code for procedure' entry and exit is the requirement of some machines for a "pre-call" to be made the prepare a hardware stack for parameters prior to their evaluation and assignment.
For example (ICL2900):
-------------------
| PROC(1, 2, 3) |
|- -|
| PRCL 4 | pre-call
| LSS 1 | load 1
| SLSS 2 | stack it and load 2
| SLSS 3 | stack it and load 3
| ST TOS | store it on Top Of Stack
| RALN 8 | raise the Local Name Base
| | to point to the new frame
| CALL PROC |
-------------------
Following these considerations the form of procedure call
chosen for I-code was:
-----------------
| PROC P | stack procedure descriptor
| | \
| {stack param} | : repeated for each parameter
| ASSPAR | /
| ENTER | enter the procedure
-----------------
ASSPAR causes the value described on the top of the stack to
be assigned to the next parameter, identified by the
procedure descriptor second on the stack, using either
ASSVAL or ASSREF as appropriate.
In order to pass some of the parameters in registers all
that need be done is for the initial processing of the
descriptors for those parameters to define then as the
appropriate registers. PROC can then "claim" those
registers, the parameter assignment will load them, and
finally ENTER can release them for subsequent re-use on
return from the procedure.
the same compiler. In particular the parameter passing mechanisms 'may be different from those used in the intermediate code. In order to cope with these and other considerations any intermediate code which permits access to external procedures must be sufficiently flexible to allow the variations to be handled efficiently. 4.5.3 Permanent procedures Most languages define a set of procedures which will be available on any implementation without explicit action by the user (such as the IMP procedures ITOS, READSYMPOL, READ, and REM). Such procedures are termed "permanent procedures" It is common for intermediate codes to provide specific code items to invoke permanent procedures, but this has the problem that the code-generator must know about all such procedures, and the language-dependent phase must be changed and the intermediate-code extended if an implementation wishes to make efficient use of procedures which can be compiled in-line on particular machines. For example many machines provide an instruction for moving blocks of store around and it could be advantageous to have a procedure which invoked this instruction directly. Before investigating ways of improving the implementation of permanent procedures it is Useful to examine in some detail the properties of the procedures mentioned above,
which were chosen because they typify the main problems in this area. ITOS is a fairly complicated string function which returns as its result the decimal character string representation of the integer value passed to it as a parameter. Because of its complexity this procedure is almost always best implemented as an external procedure which is linked into the program along with any other external entities required. REM is an integer function which returns the remainder of dividing the first integer parameter by the second, and on many machines can be efficiently compiled in-line, as most integer divide instructions provide both the quotient and the remainder. However, when compiling for machines such as the DATA GENERAL NOVA or the DEC PDP11 when they do not have the optional divide instructions, division has to be performed by a complicated subroutine, suggesting that REM itself should be an external procedure like ITOS. READSYMBOL falls somewhere between the two, mainly because it is defined to have a general name parameter, that is, the parameter may be a reference to any type of entity: integer, real, byteinteger, etc. To implement READSYMBOL as an external procedure it would have to be passed the general name parameter (comprising both the address of the actual parameter and information about its type and precision), and
would have to interpret that parameter in order to be able to store the character, suitably converted, in the appropriate way. A much more efficient implementation is to convert the statement: --------------- | READSYMBOL(S) | --------------- into the equivalent form: ------------------ | S = F$READSYMBOL | ------------------ where F$READSYMBOL is a function which returns as its result the character value that READSYMBOL would have placed into its parameter. Once this is done conversions and the choice of store operation can be left to the usual assignment part of the compiler. A further complication can arise if, as in the case of the INTERDATA 7/16 operating system, ISYS [Dewar, 1975], several permanent procedures map directly onto system-provided facilities: the function F$READSYMBOL can be replaced by the supervisor call "SVC 8,0", SELECT INPUT by "SVC 6" etc. The difficulty caused by READ is mainly one of space. As read can input an integer value, a real value, or a string value depending on the type of its (general name type) argument, it is going to be fairly large, especially if the hardware on which it runs does not provide floating-point
instructions, forcing those functions to be performed by subroutine. It follows that on small systems it may be convenient to replace calls on READ by calls on smaller procedures, chosen at compile-time by examining the type of the parameter given to READ, which input solely integer, real, or string values. Finally it should be noted that the substitutions and modifications discussed above may only be generated as replacements for direct calls on the procedure; if the procedure is passed as a parameter to another procedure no alterations are possible and a "pure" version must be available. As passing a procedure as a parameter is totally distinct from calling the procedure this case does not prevent the improvements being carried out where possible. It should now be clear that the efficient implementation of permanent procedures will differ greatly from the implementation of user-defined procedures, and the implementation of permanent procedures on different machines. Hence the intermediate-code must make no assumptions about either which permanent procedures are available or how they are to be implemented. As a side-effect of removing any built-in properties from permanent procedures it becomes possible for a simple code-generator to ignore any possibility of producing special code and compile them all as externals.
These transformations of procedures can only be applied when the procedures are invoked (called) directly. In the case of procedures passed as parameters all calls will of necessity be the same and hence either it will not be possible to pass some permanent procedures as parameters, an unfortunate limitation imposed by several languages, or there must be a "pure" form of the procedures available. This latter can be done very simply using I-code. The primitive procedure descriptors are defined exactly as if the procedures were truly external, but with an extra marker showing them to be "permanent". The only time that this marker is used is in the procedure call generating section of the compiler. If the procedure is being passed as a parameter this section of the compiler is not entered and so the procedure will be passed as an external. All that is now necessary is for there to be an external manifestation available when the program executes. This method has the added advantage that there is no compile-time overhead, especially important considering that passing procedures as parameters is one of the least-used features of IMP77. 4.5.4 Primitive Procedures It is rare for machines to provide simple instructions which can deal directly with all of the requirements of high-level languages and so several constructions will have to be handled by subroutines. The code generator may then refer to these "primitive procedures" as though they were
machine instructions. The cases in which such procedures are required commonly include exponentiation, string manipulation, and array declaration and access. Given these procedures the code-generator has a choice between calling them as closed subroutines or expanding them in-line. The former produces dense code but will execute more slowly than the latter (and possibly suffer from not knowing what is corrupted by the routine and therefore having to forget everything it knows). On the other hand while the expansion of primitive procedures in-line will improve the execution speed of the program, it becomes necessary for the code-generator to be able to create the appropriate code sequences and thereby become more bulky. Once again the choice must be left to the code-generator as the benefits of a particular decision will depend on both the target machine and the use to which the compiler is to be put. If the compiler is to be used for large mathematical problems it is likely that the gains made by putting exponentiation in-line will outweigh the disadvantage of the extra code size, whereas in operating-system work as exponentiation is probably never needed, the extra complexity of the code generator to expand the routine would not be desirable. Given that some of the primitive procedures will be referenced often (checked array access, for example) it is important that entry to them is made as efficient as possible and in this area the ability to reorder code can be used to great effect.
In the original Interdata 7/32 IMP77 compiler the
primitive routines were gathered together at the end of the
user's code, as it was only then that it was known which
procedures were required.
---------
| | <- CODE BASE (register 15)
| USER |
| CODE |
| |
---------
| |
| PRIM |
| PROCS |
| |
---------
With this scheme programs of 16Kbytes or less can reference
the primitive procedures with 32-bit instructions
(program-counter relative addressing). Unfortunately once
the program grew beyond this limit the larger and slower
48-bit form of the instructions had to be used in order to
achieve addressability. In the IMP77 code generator there
were 352 such large instructions.
In the new compiler the object code is reordered to place
the primitive procedures at the head of the user's code
where they can be addressed relative to CODE BASE.
---------
| | <- CODE BASE (register 15)
| PRIM |
| PROCS |
| |
---------
| |
| USER |
| CODE |
| |
---------
The immediate disadvantage of this is that it will push the
user's procedures further away from CODE BASE and hence
increase the chances of a user procedure reference requiring
a long (48-bit) instruction. However in practice this is
not a problem as the total size of the primitive procedures
is usually quite small, typically less than 800 bytes on the
7/32. The IMP77 code generator mentioned above now needs no
long references at all, saving 724 bytes of code, out of
about 40Kbytes. The compression of the code so achieved can
be enhanced slightly by bringing the destinations of more
jumps into the short-jump range, giving an extra saving of
20 bytes the case above. In addition, now that a register
(CODE BASE) is pointing to the first primitive procedure the
list of procedures required can be reordered to place the
most frequently referenced one first and thereby reduce
references to it to 16-bit instructions
(BALR LINK,CODEBASE).
When compiling with checks on, by far the most commonly referenced primitive procedure is the routine which checks for the use of an unassigned variable (over 2000 references to it in the code generator), and this trivial optimisation results in a saving of more than 4000 bytes.
4.6 Language-specified and compiler-generated objects During compilation various objects will be manipulated in order to generate code. Some of these objects have a direct representation in the source program and are referred to as "language-specified" objects, whereas others are created by the compilation process itself and are referred to as "compiler-generated" objects. The fact that the compiler-generated objects will be (or can be constrained to be) used in a stereotyped and well-behaved fashion can be used to great advantage to give simple means for optimising parts of the program. 4.6.1 Internal labels Using most intermediate codes the following program parts would translate into effectively identical sequences: ----------------------------------------------------- | ->LAB if X = 0 | if X # 0 start | | Y = 3 | Y = 3 | | LAB: | finish | ----------------------------------------------------- At first glance this is as it should be, for the two program fragments are semantically identical and could therefore be implemented by the same object code, for example on the PERKIN-ELMER 3200: ----------------- | L 1,X | pick up X and set the condition code | BZ $1 | branch equal (to zero) | LIS 0,3 | pick up 3 | ST 0,Y | store it in Y | $1: | define label $1 -----------------
However, if it is known that the label $1 will only ever be used once, the code-generator may remember that the current value of the variable X will still be in register 1 following the label, and thus remove the need for it to be loaded again if it is required before register 1 gets altered. In the case of user-defined labels no statement can be made about the number of uses of each label without a complete analysis of the parts of the program where the label is in scope. This suggests that I-code should maintain a clear distinction between user-defined and compiler-generated labels. Also, by making the rule that compiler-generated labels may only be used once, the internal representations of labels may be reused by the code-generator, removing the necessity for large tables of label definitions in this phase of compilation. This now leaves the question of how to represent conditional jumps in the intermediate code. The first observation is that user-specified jumps need never be conditional as they can always be surrounded by appropriate compiler-generated conditional jumps. This can be used to restrict the processing of conditions and tests to the compiler-generated jumps. The second observation is that in IMP77 conditionals are always associated with the comparison of two values or the testing of an implied boolean variable (predicates and string resolution). There are currently three main ways in which processors handle this:
1 "compare" instructions are used to set flags or
condition-codes which represent the relationship
between two values (one of which is frequently an
implied value of zero). These condition-codes are
later used to control the execution of conditional
branch instructions. This method is used in the
PDP11: COMP, BNE etc.).
2 Instructions are provided which compare two values
as above but instead of setting condition-codes
they skip one or more subsequent instructions
depending on a specified relationship. By skipping
unconditional branches in this way conditional
branch sequences may be generated. This method is
used in the PDP10: SKIPE etc.
3 Instructions are provided which compare two values
and branch to a specified label if a given
relationship holds. This method is used in the
PDP10: JUMPNE etc.
P-code uses compare instructions to set the boolean value
TRUE or FALSE on the stack and then uses this value either
as an operand in an expression or to condition a branch (a
variant of technique 1 above).
Z-code tests the value in a register against zero and
branches accordingly (technique 3 above).
These three techniques have fairly obvious possible
representations in I-code:
if X = Y start
1) PUSH X
PUSH Y
COMP {set condition code}
BNE 1 {branch not equal}
2) PUSH X
PUSH Y
SKIPE {compare and skip if equal}
GOTO 1
3) PUSH X
PUSH Y
JUMP # 1 {compare and branch if not equal}
All three of these representations have been tried in
different versions of I-code.
Technique 2) was rejected as it proved cumbersome to
implement effectively, especially on machines which did not
use skips; either the code-generator had to "look ahead" to
be able to locate the destination of the skip (which is
dependent on the instruction being skipped) or to check
before each instruction whether on not a skip had been
processed earlier and its destination had not yet been
resolved,
Technique 1) was perfect for machines with condition-codes but required look-ahead over subsequent jumps on machines which used skips. Both 1) and 2) had the additional problem that to generate conditional branches two separate I-code instructions had to be given. In the case of 1) condition-codes are usually altered by many instructions not directly involved in comparison and hence the compare and its associated branch must be made adjacent. With 2) there is the possibility of generating meaningless constructions such as skipping a line-number definition instruction. These difficulties add complexity to the definition of the intermediate code and require extra checks in the code generator. Thus the third form was chosen as the most convenient, even though all three forms can be suitable defined to be totally equivalent. In particular the third technique provides all the relevant information to the code-generator in one instruction, and has proved to be simple and effective as a basis for generating code for both condition-code and skip sequences.
Using these ideas the following is the expansion of the statements given at the start of section 4.6.1. ---------------- ----------------- | PUSH X | PUSH X | | PUSHI 0 | PUSHI 0 | | COMP # 1 | COMP = 1 | | JUMP LAB | | | LOCATE 1 | | | PUSH Y | PUSH Y | | PUSHI 3 | PUSHI 3 | | ASSVAL | ASSVAL | | LABEL LAB | LOCATE 1 | ---------------- ----------------- 4.6.2 Temporary objects During the compilation of high-level languages it often becomes necessary to create temporary objects which are not present in the source program. The most common need for temporaries is in the evaluation of expressions. Regardless of the number of accumulators or registers available it is always possible to construct an expression which will require one more. To obtain this register, a register currently in use must be selected and the value currently in it must be saved in a temporary location. One apparent exception to this is a machine in which expressions are evaluated using a stack (e.g. ICL 2900) but in this case the operands are always in temporaries.
Temporary variables may also be required to implement
certain high-level constructions, such as the IMP for
statement:
-----------------------
| for V = A, B, C cycle |
-----------------------
which is defined so that the initial values of B and C, and
the initial address of the control variable, V, are to be
used to control the loop regardless of any assignments to V,
B and C. While it is possible for a machine-independent
optimiser to discover whether these variables are modified
in the loop or not, in the simple case where little
optimisation is required the code generator must use
temporaries.
In the case of expression evaluation, however, the machine
independent phase cannot know how many temporaries will be
required. Even giving the first phase knowledge of the
number of registers available is not adequate for several
reasons. Firstly, the use of registers is commonly tied to
the operations being performed, as in the case of integer
multiply on several machines which requires a pair of
registers. For a machine-independent first phase to be able
to cope with this sort of limitation would require great
flexibility of parameterisation.
Secondly, the first phase would have to be given details of
the problems encountered in statements such as:
----------------------------
| LEFT = REM(A,5) + REM(B,7) |
----------------------------
On a PDP11 equipped with the EIS option a divide instruction
is available which provides both the quotient and the
remainder. Hence the statement could be compiled into:
-----------------
| MOV A,R1 |
| SXT R0 | propagate the sign of A
| DIV R0,#5 | remainder to R1
| MOV B,R3 |
| SXT R2 |
| DIV R2,#7 | remainder to R3
| ADD R2,R0 |
| MOV R0,LEFT |
-----------------
In this case no temporary store locations are required.
However, if the EIS option is not present no DIV instruction
is present and so a subroutine must be used instead. The
code becomes:
-----------------
| MOV A,R1 |
| MOV #5,R2 |
| JSR PC,DIV | result back in R1
| MOV R1,T1 | preserve remainder
| MOV B,R1 |
| MOV #7,R2 |
| JSR PC,DIV | result in R1
| ADD T1,R1 |
| MOV R0,LEFT |
-----------------
As the subroutine REM uses R1 (for one of its arguments and
to return its result) the result of the first call on REM
must be saved in a temporary, T1. Of course, the function
REM could be written so as to preserve the value in, say, R2
and this could be used instead of T1, but this would increase the cost of REM when it is likely that the value in R2 will not be of use as most expressions are trivial [Knuth, 1971]. Unless the machine-independent phase is given intimate knowledge of the target machine (something of a contradiction) it cannot know how many temporaries to use nor when to use them. The solution adopted by most intermediate codes is to base the code around a stack, thus providing an unlimited number of temporaries which are handled automatically. While this in itself does not hinder the compilation for a machine without a hardware stack, as the code-generator can always simulate the stack internally, its presence invariably results in other parts of the code using it, for example to pass parameters to procedures where the receiving procedure contains built-in knowledge of the layout of the stack. As a stack does not require the explicit mention of temporaries it has been adopted by I-code, but purely as a descriptive mechanism. Because I-code does not specify the computation but the compilation process needed to produce a program which will perform the computation, this internal stack need have no existence when the final program executes.
The implementors of SIMPL-T describe an intermediate code with some properties similar to I-code, but based on "quadruples" of operators and operands rather than an internal stack [Basili, 1975]. The stack approach was rejected by them because "quads allow more flexibility in the design of the code generator since, for example, no stack is required". The exact meaning of this is not clear but it suggests the misconception that a stack-based intermediate code forces a stack-based object code representation. Regardless of the exact structure of the code generator or the input it takes, some form of internal stack is invariably required for operations such as protecting intermediate values in registers which are needed for other purposes, and it seems reasonable to make this stack more explicit if so doing will simplify the intermediate code and its processing.
The third phase takes as its input two data streams
generated by the second phase. These streams are:
i The object stream, a sequence of items of the form:
<type> <value>* defining the code sequences
required in the object file.
ii the directive stream, a sequence of items defining
the logical structure of the object stream, that is
a specification of label definitions and label
references, and details of various code groupings
(blocks, procedures etc.).
The third phase starts by taking in the directive stream and
constructing a linkage map describing the whole program.
This linkage map is processed and then used to control the
generation of the final object file from the object stream.
The operations performed using the map are:
4.7.1 Reordering
As discussed previously, there are several gains to be
made by having the ability to output instructions in an
order different from that in which they were implied by the
linear structure of the source program. This reordering is
performed on the linkage map in a manner controlled by items
in the directive stream. In the most simple case of
exbedding procedures (section 4.3.1) this only entails
allocating code addresses to the items in the map each time an "end-of-block" control item is input, resulting in the procedures being laid out in "end" order. To facilitate evaluating references to the reordered areas, all references in the object stream are made relative to the start of the appropriate area. As this process does not cause the physical moving of the various areas there is an implicit assumption that either the subsequent processing of the object stream can do the reordering (for example by writing its output to specific sections of a direct-access file), or that the object file format can instruct the loader or linker to do the shuffling. With the linkage map available it becomes possible to make a preliminary pass over the object stream performing structural modifications which require knowledge of the generated code and which alter its size and general appearance. These modifications may be made by passing the object stream through a buffer which is scanned and modified under the control of the linkage table. In this way merging common code sequences and reordering the arms of conditional sequences may be achieved quite simply.
In typical programs the frequency of occurrence
of such jumps is:
PDP11 PE3200
------------------
2 byte | 88% 28% |
4 byte | 8% 71% |
6 byte | 2% <1% |
------------------
It has been suggested [Brown, 1977] that the
problem of deciding which form of jump to use can
be eased on certain machines by specifying a
"distance" parameter with the intermediate code,
e.g. "GOTO LAB,80" informing the code generator
that the label LAB is 80 instructions ahead.
It is difficult to think of any case in which this
could be of any use as it requires the code
generator to be able to predict the amount of
target machine-code which will be generated for
each intermediate code instruction.
The solution adopted by the IMP compilers has
been for the code generator to assume that all
jumps are the minimum size, and to let the third
phase stretch them where necessary.
The Perkin-Elmer CAL assembler [Interdata, 1974]
makes the opposite assumption, namely that jumps
are long until proven short. This was rejected as
the size of one jump is often dependent on another,
so that one of them will be short if and only if
both of them are short. By assuming them long
either they will never be found to be short, or the
process will have to examine all the jumps
repeatedly trying each jump in turn to see if it
can be "squeezed". Commonly enabling the "SQUEZ"
option in the CAL assembler can double or treble
the time to assemble programs.
With the assumption that all jumps start short and
then grow, all truly short jumps will be found with
no possibility of infinite loops, as the process
must terminate, in the worse case when all the
jumps have been made long.
Several methods for achieving this optimisation
have been described [Szymanski, 1978; Williams,
1978].
The technique used by the third phase of the IMP77
compilers for stretching jumps is as follows.
Once the linkage map has been constructed and
addresses provisionally allocated, all labels and
references to them are grouped according to the
block in which they occurred. This is to take
advantage of the fact that most references will be
local. A procedure STRETCH is now defined which
repeatedly attempts to lengthen each reference
within a particular group. If a reference is found
which must be stretched, the entry in the linkage
map is updated and all subsequent entries are
suitably modified to take account of the increased
size of the code. The process is repeated until no
alterations have been made.
STRETCH is first called once for each group of
references in the program. This "local stretch"
commonly resolves up to 80% of the references. A
final call on STRETCH is then made with all the
references lumped together as one group in order to
resolve references between blocks, and any local
references which, although processed by the local
stretch, have become invalidated by changes made by
the "global stretch".
The use of a local and a global stretch has a
considerable effect on the performance of the
compiler; If the calls on "local stretch" are taken
out "global stretch" has to do all the work in
ignorance of the block-structure of the labels.
This involves repeated searching of the complete
label and reference lists in order that changes in
the position of these items may be recorded. On
the Interdata 7/32 this increases the stretching
time for 1968 branches from 2.3 seconds out of a
total compilation time of 146 seconds up to 35
seconds!
The time taken to perform the stretching using both
local and global stretch is on average just over 1%
of the total compilation time excluding the time
for input and output.
Wulf et al. describe an optimisation on the PDP11
which attempts to shorten otherwise long
conditional jumps by making them jump to suitable
jumps to the same destination as this is smaller
and faster than the six byte instructions which
would be generated by default [Wulf, 1975]. This
was tried but eventually removed from the PDP11
compiler as finding suitable jumps was a tedious
task and of the average 2% of jumps which were long
in compiling many programs only one case was found
where the optimisation could be applied. That case
was in a program specially constructed to test the
optimisation.
At the same time that jumps and labels are being
processed, certain operations which depend on the
flow of control may be inserted into the code.
The GEC 4080 provides a good example of this
problem which can be handled elegantly by the third
phase.
The machine provides arithmetic instructions which
take either fixed point or floating point operands
depending on the state of a processor status bit.
This bit must be altered by the instructions SIM
(Set Integer Mode) and SFM (Set Floating Mode).
During code generation when a label is encountered
the state of the status bit will not in general be
known, and so a suitable mode switching instruction
will need to be planted.; frequently this
instruction will be redundant. Given the presence
of the third phase, the second phase merely needs
to mark jumps with the current state of the bit,
and to mark labels with the required state (and the
previous state of the bit if control can "fall
through" past the label. During the process of
expanding jumps these mark bits can be checked. If
all references to a label have the same mode, no
action needs to be taken, but if the bits differ
the appropriate instruction must be added. As an
extra improvement if only one jump to a label is
from the wrong mode, the mode switching instruction
can be planted before that jump rather than after
its destination label, so shortening the execution
paths when no change is required.
ii Conflating jumps to jumps.
Nested conditional structures in high-level
languages often generate jumps which take control
directly to another jump. If the second jump can
be shown always to be taken whenever the first is,
the latter can be redefined as jumping directly to
the destination of the former.
e.g. -----------------------------------
| while N > 0 cycle |
| N = N-1 |
| if N > 5 then TEST1 else TEST2 |
| repeat |
-----------------------------------
In this program following the call on TEST1 the
else causes a jump to be taken to the next
statement (repeat). This statement is simply a
jump back to the previous cycle.
Hence the following code can be generated (3220):
-------------------
| $1: L 1,N |
| BLE $3 |
| SIS 1,1 |
| ST 1,N |
| CHI 1,5 |
| BLE $2 |
| BAL 15,TEST1 |
| B $1 |
| $3: BAL 15,TEST2 |
| B $1 |
-------------------
The danger with this optimisation is that an
otherwise short jump can be expanded to a long jump
as the following program demonstrates:
----------------------
| if X = 1 start |
| if Y = 1 start |
| {A} |
| else |
| {B} |
| finish |
| else |
| {C} |
| finish |
----------------------
The else following the sequence (A) causes a jump
to the next else which jumps past the finish. In
that form the first jump only has to skip (B) and
is likely to be a short jump. If it is made to
jump directly to the second finish it has to cover
{B} and {C}, so reducing the chances of its being
short.
Equally, the position can be reversed resulting in
the optimised jump being short when the original
was long. If this problem is considered serious
the third phase can check the sort of jump which
would be generated and act accordingly.
iii Removal of jumps round jumps.
Statements such as:
------------------
| ->LABEL if X = Y |
------------------
are common, either in the explicit for as given
above or in some higher-level representation such
as:
---------------
| exit if X = Y |
---------------
The simple code sequence generated for this would
be similar to (3220):
--------------
| L 1,X | pick up X
| C 1,Y | compare with Y
| BNE $1 | branch not equal
| B LABEL | jump to LABEL
| $1: |
--------------
by combining the two branches the code can be
reduced to:
--------------
| L 1,X |
| C 1,Y |
| BE LABEL |
--------------
While it is possible for the code generator to do
this immediately it was decided to leave the
optimisation to the third phase for four reasons:
1 The third phase can perform this optimisation
simply, almost as a side-effect of
constructing the linkage map.
2 The are several cases where the optimisation
can be extended in ways which would be awkward
for the second phase to deal with. In
particular, it would have either to look ahead
or to be able to modify code sequences already
generated. With a third phase, however, the
optimisation reduces to a straightforward
inspection of the linkage map. For example:
------------------
| exit if X = Y |
| repeat |
------------------
in which case the optimisation may be applied
twice to reduce the code to two instructions.
3 Leaving the optimisation to a later phase
simplifies the second phase which is the most
complicated part of the compiler.
4 On several machines if the destination of the
jump is too far away the original "jump round
a jump" may be the only form available (e.g.
PDP11). The distance to be jumped will only
be known exactly when all labels have been
processed.
4.7.5 In-line constants
When compiling for machines such as the Data General NOVA
which have a limited direct addressing range and no
full-length immediate operands, it is useful if constants
can be planted in the code sequence and addressed as
program-counter-relative operands. The simplest technique
for doing, this is for the code generator to maintain a list
of required constants and to dump them in-line at a suitable
opportunity before the limit of addressability has been
exceeded. Such constants will need to be protected from
being executed and so will need to have a jump round them or will have to be planted in a "hole" In the code, that is between an unconditional jump and the next label. As holes occur frequently in high-level languages (for example following every else or repeat) and do not require extra code to be planted round the constants, they must be the preferred position for the constants. In order to minimise the number of constants planted it is necessary to delay the dumping of them until the last possible moment, making them as near the forward limit of the addressability of the first outstanding reference. This increases the chance on a subsequent reference to the constant being able to address the previous location. This poses problems if the second phase is to handle the constants as it cannot know which is the optimum position for the constants in advance of producing the code (especially if the code is to be reordered). A convenient solution is to utilise the linkage table in the third phase and include in it references to constants and the locations of holes and "forced" holes, that is places where an extra jump is required. Following the initial resolution of jumps (4.7.?) the list of constants can be examined and holes allocated. The labels are processed again to take account of the extra code and any alignment limitations. During the processing of the object stream the constants are infiltrated into the object file.
5 Review of the overall structure
5.1 Division of function
The division of the machine-dependent phase into two
parts was motivated by three main considerations:
i to localise the changes necessary to produce
different object-file formats,
ii to permit the reordering of sections of the code,
iii to enable the production of short jumps whenever
possible.
In addition it turns out that on all of the machines for
which this technique has currently been applied points (ii)
and (iii) can be handled by almost identical pieces of code,
making this phase of compilation machine-independent to a
large extent and therefore easing the task of creating new
compilers.
Against this must be set the overheads incurred by
separating the compilation into two parts which have to
communicate. The interface between phases two and three
comprises the object file and the directive file, and the
third phase needs to process the whole of the directive file
before starting to look at the object file.
The ways in which these 'files' will be implemented, and consequently the cost of the communication, will in general vary from system to system. If large virtual memories are available the data may be held in memory as mapped files or arrays and accessed much more efficiently than on simpler systems using the conventional approach of 'true' files with their more cumbersome transfer operations. 5.2 Testing and development Although the initial reason for choosing a multi-phase approach to compiling was that of simplifying the generation of new compilers, an extra advantage arose in that the task of checking the compilers so produced and diagnosing faults in them was very much simplified. This was because of two features of the technique. Firstly, the programs corresponding to the phases were of managable size, varying from about one thousand statements up to three thousand statements. Secondly, the phases communicated with each other using well-defined interfaces which could be monitored to narrow down errors to a particular phase and even to specific parts of that phase. In addition, as the structure of the intermediate code inevitably suggests the general techniques to apply in code generation, many of the complete compilers on different machines had great similarities; usually only the lowest
levels of code production and machine-specific optimisation
were appreciably different.
This gave rise to three convenient properties with regard to
testing and development:
i An error in one compiler will frequently give
notice of similar faults in others. Clearly, any
faults in the common first phase will be present in
all the compilers and only one correction will be
required.
ii An improvement in the performance of one compiler,
or the code it generates, can suggest similar
improvements in others.
iii The third effect on reflection seems obvious yet
was noted with some surprise. The systems on which
most of the investigation was done are run with
very different operating systems and used by
different types of user. These two factors
together caused a great spread in the demands
placed upon the compiler, resulting in more parts
of the compiler being thoroughly tested than would
happen when running on one particular system, where
users tend to be more stereotyped. Questions of
"proper practice" aside, it is a fact of life that
all software gets a better testing in the field
than at the hands of its creator.
5.3 Diagnostics As mentioned previously optimisation is not just a process of improving the storage requirements and speed of a program but also involves fitting a program into the overall framework of the run-time environment. In many applications the provision of extensive run-time checks and post-mortem traces can be of great importance. The ability to generate such diagnostic code has certain implications for the features in the intermediate code. 5.3.1 Line numbers When producing information about the state of a computation, whether it be an error report following a run-time fault or an execution trace [Satterthwaite, 1972], the data must be presented in a form which is meaningful to the user in terms of the source program. The commonly-provided dump of the machine state, registers, code addresses etc., is a complete failure in this respect, as the correspondence between this and the program state depends on the workings of the compiler and other factors of which the user should not need to be aware. The simplest way of specifying the point of interest in a program is to give its line number. There are two common techniques for providing line number information at run-time, the choice of which depends on the uses to which the compiler is to be put.
The first is to plant instructions which dynamically update a variable with the current line number whenever it changes. This has the significant advantages that it is extremely cheap to implement and the line number is always immediately available. Its obvious disadvantages are that it increases the execution time for the program, and more significantly it increases the size of the program, typically by about 6K bytes on the Interdata 7/32 for a 1000 line program, approximately a 50% increase. The second technique is to build a table giving the correspondence between line numbers and the addresses of the associated code sequences. While this imposes a greater burden on the compiler and takes more time to extract the line number it has the advantage that it does not increase the code size of the program, nor does it alter its execution speed. Indeed it may even be possible to keep the table out of main memory until it is required.
The choice of technique will have implications on the
compiled code. If the line number table approach is used
error procedures must have available the address of the
point of the error. The effects of this can be seen in the
following example of the sort of code generated for
unassigned variable checking on the 7/32 using both methods:
-------------
| 17 X = Y |
-------------
------------- -------------
| LHI 0,17 | | update line no
| ST 0,LINE | |
| L 1,Y | L 1,Y |
| C 1,UV | C 1,UV | check value
| BE ERROR | | give the error
| | BAL 8,TU | test for error
| ST 1,X | ST 1,X |
| | .. .. |
| | .. .. |
| |TU:BNER 8 | return if OK
| | B ERROR| give the error
------------- -------------
As the generated code depends on the method in use it cannot
be specified in the intermediate code and so the latter must
simply indicate the points in the program at which the line
number changes.
5.3.2 Diagnostic tables In the event of program failure, or when explicitly requested by the user, a trace of the current state of a program, including the values in active variables and the execution history, can be of immense value. For such a trace to be provided the intermediate code must contain the identifiers used in the source program for all the variables, and a source-dependent description of those variables. This latter is needed so that the machine representations may be interpreted in the correct way when giving the values in variables. In I-code all this information is presented in the definitions of descriptors and may be used or discarded at will. 5.3.3 Run-time checks Most languages define circumstances under which a program is to be considered in error and its execution terminated. These errors include creating a value too large to be represented (overflow), division by zero, use of an array index which is outwith the declared bounds, and so on. There is a natural division of these errors into those which are detected automatically by the machine and those which must be detected by explicit checks in the program. Commonly, machines catch division by zero automatically but do not provide such a feature for checking array subscripts. The "hardware-detected" errors may be furthur divided into
those which on detection cause the normal flow of control to be interrupted, and those which simply make the knowledge of the occurrance of the error available to the program, for example by setting a condition-code bit. For the purposes of this discussion the second form of hardware-detected error may be considered an error which is not detected automatically as it still requires explicit instructions to test for the error and to transfer control accordingly. Clearly the more errors that fall into the automatic category the better, as they do not cause the user's program to grow with sequences of instructions which, in a correct program, will always be testing for conditions which never arise. These differences complicate the design of intermediate codes as the classification differs from machine to machine: the VAX all forms of overflow can be made to generate automatic interrupts, but the PDP11 only sets a condition-code bit on some overflows. There are two basic ways of handling this in the intermediate code: firstly the code can contain explicit requests for the checks to be performed, and secondly the code can be designed in such a way as to give the code-generator enough information to be able to decide where checks are necessary. Two specific examples can indicate which of these ways should be adopted.
Testing for arithmetic overflow is currently handled by
machines in three main ways:
1. An interrupt is generated whenever overflow occurs.
This is by far the best method as it requires no
overheads in the checked code.
2. A bit is set on overflow and is only cleared when
it is tested. This requires explicit checks in the
code but several tests may be conflated into a
single test at an appropriate point, for example
before the final result is stored.
3. A bit is set on overflow, but is cleared by the
next arithmetic operation. This again requires
explicit checking code but the tests must be
inserted after every operation.
For the intermediate code to indicate where overflow
testing is to be performed it would have to choose the
worst case from the three above, namely case 3. This
would result in a test being requested after every
arithmetic instruction, which test may just as well be
included into the definition of the instructions
themselves.
The other area of low-level testing is in implied type
conversions such as storing from a 32-bit integer into a
16-bit integer. The VAX provides an instruction which
combines the test for truncation with the store (CVTLW).
The 7/32 has an instruction (CVHR) which can test the
value before assignment, and the 4/75 can most
efficiently test following the assignment (CH).
If the request for the check is a separate intermediate
code item, the 7/32 case is simple but the other
machines will require much more work to be able to
generate the efficient check. The problem can be
simplified by introducing new assignment instructions
which also perform the test, but this adds many new
instructions to the code as one instruction will be
required for every valid combination of types and every
sort of assignment.
The high-level checks such as array bound checking are
usually so complicated that the most efficient
implementations depend greatly on the particular
hardware, so much so that it would be foolish to attempt
to express them in the intermediate code.
The simplest solution is to ensure that the intermediate
code provides enough information to let the code
generator decide where and what checks are necessary.
The inclusion of checks against the use of unassigned
variables provides a good example of the power of
leaving the checking to the code-generator. In a
simple-minded approach the code-generator tests every
suitable value loaded from store. A minor improvement
to this is to mark the descriptor for every local
variable in a block when it is first assigned,
inhibiting the marking after the first jump.
Subsequently, marked objects need not be checked.
A much better improvement may be obtained by making a
trivial extension to the register remembering mechanism.
If an object is 'known' it must have been used
previously, and hence it will have been checked if
necessary. Even after the register which held the value
of the object has been altered, and hence the
association between the register and the object lost, if
the compiler remembers that the value was known it can
suppress any unassigned checks on future references.
At this point a useful property of IMP77 may be used to
great effect: once a variable has been assigned it
cannot become unassigned. This is not true in many
languages, as for example, in ALGOL60 the control
variable of a for loop is undefined (unassigned) at the
end of the loop. This means that in IMP77 the 'was
known' property of variables may be preserved across
procedure calls, even though all the register content
information must be forgotten.
This technique when applied on the 7/32 compiler results
in a reduction of 33% in the code required for checking.
While it is possible for the unassigned checks to be
placed in the intermediate code and for the first phase
to remove redundant checks, this supression would
require a duplication of the remembering logic which
must, in any case, reside in the machine-dependent
phase.
6 Observations
6.1 Suitability of I-code for Optimisation
When considering the use of I-code for global
optimisation there are two techniques available:
Firstly, the optimisations can be performed using the
I-code and going straight into object code, possibly via a
third phase. In this case the only real constraint on
I-code is that it be powerful enough to be able to carry all
the information available in the source and to present it in
a compact form.
Secondly, the optimisations can be seen as an extra phase
introduced between the first phase (the I-code generator)
and what is normally the second phase (the code generator).
The optimiser takes in I-code and produces as its output a
new I-code stream which can be fed into the code generator.
In this case not only must the I-code carry all the source
information but it must be able to describe the generation
of an optimised program, Clearly the code must be able to
reflect the structure of the target machine in some way and
hence must be able to lose its machine independence.
The second technique is the more interesting as not only
does it permit the optional inclusion of the global
optimising without affecting the structure of the other
phases, but it removes the optimisations from the low-level details of code production and provides a means for separating the machine-independent and machine-dependent optimisations. In particular in the same way as much of tho code generator can be built from a standard "kit" with a few special machine-specific parts, so the global optimiser can utilise code from other optimisers. The way in which the optimiser can influence the operation of the code generator is by making use of the fact that the intermediate code does not describe a computation but a compilation process. This compilation is driven by the descriptors which are normally translated by the code generator from the machine-independent form in the I-code into the appropriate machine-dependent representation, reflecting the target machine architecture: registers, stacks, memory etc. By short-circuiting this translation a global optimiser can force the use of specific machine features.
For example consider the following fragment of an integer
function:
-------------------
| integer X |
| X = A(J) |
| X = 0 if X < 0 |
| result = X |
-------------------
The standard I-code produced for this fragment would have
the form:
------------------------
| DEF 12 "X" INTEGER |
| SIMPLE DEFAULT NONE |
| NONE |
| PUSH 12 | - X
| PUSH 6 | - A
| PUSH 7 | - J
| ACCESS |
| ASSVAL |
| PUSH 12 | - X
| PUSHI 0 |
| COMP >= 1 |
| BGE 1 |
| PUSH 12 | - X
| PUSHI 0 |
| ASSVAL |
| LOC 1 |
| PUSH 12 | - X
| RESULT |
------------------------
On the PDP11 the code generated for this could be:
-------------------
| MOV J,R2 |
| ADD R2,R2 | Scale the index
| ADD A,R2 | Add in ADDR(A(0))
| MOV (R2),X | A=A(J)
| BGE $1 | ->$1 if X >= 0
| CLR X | X = 0
| $1: MOV X,R1 | assign result register
| {return} |
-------------------
Here the obvious optimisation is to note that the local
variable, X, is eventually to be used as the result of the
function and so needs to end in register 1.
By changing the definition of X in the I-code into:
------------------------------------------------
| DEF x X INTEGER SIMPLE DEFAULT NONE SPECIAL R1 |
------------------------------------------------
and making no other changes, the code generator will produce
code of the form:
--------------------
| MOV J,R2 |
| ADD R2,R2 |
| ADD A,R2 |
| MOV (R2),R1 |
| BGE $1 |
| CLR R1 |
| $1: {return} |
--------------------
As this process necessitates the I-code becoming more and
more intimately involved with the structure of the target
machine, in that it starts referring directly to registers
and the like, it is necessary that a new control item be
added so that the code generator may be prevented from
pre-empting resources which the optimiser is manipulating.
The new item is RELEASE and it is used in conjunction with
the definition of machine-dependent descriptors. When such
a descriptor is introduced (using DEF) the associated target
machine component is considered to have been claimed and may
only be used in response to explicit direction from the
I-code. On receipt of the corresponding RELEASE the
component is once again made available for implicit use by
the code generator (for temporaries etc.). This mechanism
is an exact parallel to the way in which memory locations
are claimed by the definition of descriptors and released by
the END of the enclosing block.
The main assumption about this style of optimisation is
that the code generator has the ability to generate any
required instruction, provided that the pertinent
information is available at the required time.
As an example, the VAX 11/780 provides addressing modes in
which the value in a register may be scaled and added into
the effective operand address before the operand is used,
hence the following code:
------------------------
| integerarray A(1:9) |
|- -|
| A(J) = 0 |
| |
| MOVL J,R5 | pick up J
| CLRL 12(R3)[R5] | A(J) = 0
------------------------
The operand address generated by the CLRL instruction is:
------------------
| 12+R3 + R5*4 |
------------------
as there are 4 bytes (address units) to a longword.
This instruction can be generated naturally during the
non-optimised evaluation of array subscripts, and so the
optimiser can assume that the index mode of operand will be
used whenever a register operand is specified as an array
index.
The procedure has the added advantage that in the worst case
when the code generator will not produce the instructions
that the optimiser hoped, as long as the optimised I-code
still describes the required compilation, the code generator
will simply produce a more long-winded, but equally valid
version of the program. In other words, as long as some
choice is available and some temporary objects are left at
the disposal of the code generator, the optimiser cannot
force it into a state where working code cannot be produced.
In the example above even if the code generator does not
produce index mode operands, it can still generate sequences
of the form:
----------------------
| MULL3 R5,#4,R1 | R5*4 -> R1
| ADDL2 R3,R1 | R3+R1 -> R1
| CLRL 12(R1) | 0 -> (12(R1))
----------------------
6.2 Performance The figures to appendix A3 are the results of measuring the effect of various optimisations on the Interdata 7/32 and the DEC PDP 11/45. One problem in choosing programs to be measured in that heavy use of particular language features will increase the overall effect of certain optimisations. As a trivial example of this consider the following "program": -------------------------------- | begin | | integerarray A(1:1000) | | A(1) = 0 | | endofprogram | -------------------------------- With all array optimisations enabled, on the 7/32 this generates 30 bytes of code, whereas without the optimisation it results in 170 bytes of code, largely due to the procedure for declaring the array. Clearly a reduction of 82% is not to be expected on more typical programs. Similarly the absence of features will bias the results. In particular the smaller programs will not demonstrate the power of the optimisations which only take effect when various size limits have been exceeded: the most obvious such limits being addressing restrictions caused by the size of address fields in instructions.
The major difficulty in producing results which are of any real value is that the effects of the optimisations depend on the individual style in which the programs under consideration were written. Inevitably users get a "feel" for the sort of statement for which the compiler generates good code and they modify their style of programming accordingly. If at some state in its development a compiler produces poor code for a particular construction users will tend to avoid that construction, even long after the compiler has been improved and can compile it effectively. This well-known phenomenon [Whitfield, 1973] argues strongly that users should never see the object code generated by the compilers they are using. The effects of many optimisations are difficult if not impossible to measure with any degree of accuracy as they interact with other optimisations to a great deal. The most obvious interaction is that between the size of jump instruction required and most of the other optimisations. The size of jump is determined by the amount of code generated between the jump and the label it references. If any other optimisation is inhibited this volume of code is likely to increase, decreasing the chances of being able to use the shorter forms of the jump. Some optimisations depend almost totally on others; it is unlikely that the optimisation of reducing or removing the entry and exit code sequences associated with procedures (section 4.5.1) would have much effect if the parameters
were not passed in registers and references to them in the procedures were replaced by references to those registers. In particular, it must be noted that it is always possible to generate programs which will benefit greatly from those optimisations which do not appear to be of much use from the figures given. However, the test programs used to derive the figures are typical of the programs processed by the compiler, and it is hoped that they give a more realistic and balanced view of the improvements which may be achieved in 'real' cases. Under some circumstances it may be advantageous to apply all optimisations even though some may appear to give little benefit, since this 'squeezing the pips' frequently removes one or two instructions from critical loops in a program. Yet again this shows the difficulty in quantifying the usefullness of optimisations as they are so dependent on the particular circumstances.
One area of measurement has been deliberately omitted
from the figures, namely the effect on execution time of the
optimisations. This was for several reasons:
1. On the systems used it was impossible to get
reliable timing measurements with any accuracy
greater than about plus or minus 5%.
2. For the reasons given previously, many programs
could benefit from fortuitous optimisations which
could not be expected every time.
3. Programs which would run long enough to improve the
accuracy of the measurements invariably lost this
accuracy through spending much time in the
system-provided procedures, mainly for input and
output. This point in particular suggests that as
the overhead is beyond the control of the general
user, the savings in code space may be much more
important. Even with ever-growing store sizes,
virtual memory systems will continue to treat
smaller programs better than larger ones.
4. Some of the optimisations, particularly passing
parameters in registers, prevent the compiled
program from running, unless the controlling
environment is modified in a parallel way. This
would invalidate the timings as the environment is
not usually under the control of the compiler.
From the crude measures which were obtained there is a
suggestion that the decrease in execution time roughly
parallels the decrease in code size.
6.3 Cost of optimisation
The cost of an optimisation is, in general, very
difficult to measure, as may be seen by considering the
three relevant areas: compile time, space requirement, and
logical complexity.
6.3.1 Compile time
In order to generate good code, the compiler must spend
time looking for the cases which are amenable to
improvement. If no optimisation is performed this time is
not used and so the compilation should take less time.
However, the non-optimised version commonly requires the
production of more code than the optimised version,
frequently over fifty percent more when comparing fully
diagnostic code with fully optimised code. On all the
compilers written so far, the time saved by not having to
generate these extra instructions more than outweighs the
time spent in deciding not to generate them.
6.3.2 Space requirement Several optimisations increase the requirement for workspace, notably all the remembering optimisations. On most machines available at the present the number of things which may be remembered is fairly small: sixteen registers and one condition-code is probably the maximum. Even if this number is increased by remembering several facts about each thing, the total amount of space needed will be small when compared with the space needed to hold the information about user-defined objects, information which is required whether optimisation is being performed or not. On large machines the extra memory required will be cheap; on small machines the need for the optimisation will have to be balanced against the size of the largest program which must be compiled. 6.3.3 Logical complexity The cost of providing an optimisation includes a non-recurrent component, which is the difficulty of performing the optimisation at all because of the logical complexity of discovering the necessary circumstances. In a system which is aimed at portability this cost can often be shared over a number of implementations; the techniques used in one being applicable to others, perhaps after minor modifications.
6.4 Comments on the results 6.4.1 Register remembering Of all the optimisations tested, a simple remembering of values in registers provided by far the greatest improvement in code size. One problem in implementing this optimisation is deciding what to remember, as shown by the following code sequence: ----------- | X = Y | | | | L 1,Y | | ST 1,X | ----------- Following this sequence register 1 will contain both the value in X and the value in Y; should the compiler remember X or Y or both? The measurements show that the gain in remembering both (2 uses) as opposed to just one (1 use) are quite small. The algorithm used to determine what to remember in the '1 use' case was simply to remember a new piece of information only if nothing else was known about the register in question. This gives the best results in cases such as: A = 0; B = 0; C = 0 where the value '0' will be remembered, but will perform badly with the more contorted case: A = 0; B = A; C = B as again only the value '0' will be remembered. Unless very tight code is required, the cost in
maintaining multiple sets of information about each
register and searching for particular values will
probably rule out such extended remembering
optimisations.
Perhaps a surprising result is that the PDP11 on
average gains about as much from this optimisation
as the 7/32.
This is the result of two interacting effects.
Firstly, the 7/32 dedicates up to five registers to
address local variables in the last five levels of
procedure nesting and locks three for other fixed
purposes, leaving about ten for intermediate
calculations. The PDP11, however, uses a display
in store the access intermediate levels and has to
load the address of a particular level each time it
is required. In addition the PDP11 implementation
fixes the use of four registers, leaving only four
for intermediate calculations.
Secondly, the 7/32 needs to use at least one
register to move values around while the PDP11
often requires none.
These two effects give a fairly large number of
transient values in the registers of the 7/32, and
a smaller number of more frequently used values
(addresses) in the registers of the PDP11. On
average it appears that the number of times
necessary values are found is roughly equal in the
two cases.
register later on. This instability seems to be
undesirable, but alternative strategies, such a
biasing the allocation towards or away from
particular registers, on average results in worse
code.
6.4.4 Common operands
On the 7/32 the only instruction which can be
used to simplify statements of the form: X = X op Y
is the AM (add to memory) instruction. It is
therefore somewhat surprising that its use
frequently saves over two percent of the code.
The two possible expansions of a suitable addition
statement are:
------------ ------------
| | |
| L 1,Y | L 1,X |
| AM 1,X | A 1,Y |
| | ST 1,X |
------------ ------------
The first saves four bytes and leaves the increment
in the register. Even if the incremented value is
required immediately the extra load instruction
will only increase the code size to that of the
alternative sequence.
As the PDP11 has many instructions which can be
used in this way it is hardly surprising that it
benefits much more.
6.4.5 Parameters in registers
This optimisation gives another significant
saving in code at little cost to the compiler,
simply by moving the store instructions for
parameter assignment from the many calls to the
unique procedure definitions. The effect is more
pronounced on the 7/32 as all assignments require
two instructions, a load and a store, whereas the
PDP11 can usually make do with one MOV instruction.
In the latter case the saving comes from the
ability to reduce the size of the procedure entry
and exit sequences if all of the parameters can be
passed in registers.
6.4.6 Condition-code remembering
On machines with condition codes many
instructions set the result of a comparison with
zero as a side-effect. Knowledge of this can be
used to inhibit explicit code to compare values
with zero. However, the small benefit so gained
suggests that it is not worth doing, even though it
is a very cheap test.
As the techniques for merging and delaying are
quite expensive, but not complicated, and have a
major influence on the design of the code-generator
the small gains achieved are probably not worth the
trouble, unless the last drop of efficiency is
required at all costs.
6.5 Criticisms and benefits of the technique 6.5.1 Complexity The main argument against the use of high-level intermediate codes is that they move the complexity of code generation from the common machine independent phase into the machine dependent phase forcing work to be repeated each time a new compiler is required. While this is undoubtedly true the overheads are not as great as they may first appear. The extra complexity of the code generators may be split into two parts: an organisational part which builds and maintains the structures used during the compilation, and processes the intermediate code, using it to drive the second part, an operational part which uses the structures to generate code as instructed by the organisational part. The changes in the organisational part when moving to a new machine as small enough to permit the use of large sections of code from old compilers. Even when considering the operational part much will be similar from machine to machine, in particular the communication between the second and third phases and the bulk of that latter phase can be taken without change. From examining the compilers produced using I-code it appears that about 60% of the source of the machine dependent parts is common, 20% can be considered as being selected from a standard "kit" of parts, and the final 20% unique to the host machine.
6.5.2 I/O overhead One of the disadvantages of dividing a compiler into several distinct phases is that it results in an additional cost in communicating between consecutive phases. As discussed in section 5.1 this cost depends on the system running the compiler. In the worst case where communication is achieved using conventional files the overhead may not be too serious. The time spent doing input and output on the Interdata 7/32 compiler is about 27% of the total compilation time, and for the PDP11 is about 22%, breaking down as follows: ------------------------------------------------------- | | | ---- ---- ---- | | | | | |---->| | | | Source ----> | P1 |---->| P2 | | P3 |----> Object | | | | | |---->| | | | ---- ---- ---- | | | | 7/32: 7% 7% 10% 3% | | (4%) (4%) (5%) (3%) | | | | PDP11: 9.4% 11% 0.6% 0.5% | ------------------------------------------------------- The figures in parentheses give the percentage of time taken when the input and output requests are made directly to the file manager rather than via the standard subsystem procedures, thus reducing the internal I/O overhead to about 10% of the total compilation time.
One important gain in using such intermediate codes is that they can ease the difficulties associated with maintaining a number of compilers for different machines, when those compilers are self-compiling. For several reasons it may not be desirable to permit sites to have the source of the machine-independent phase: commonly to give freedom of choice for the form of the language in which that phase is written and to prevent local "improvements" which rapidly lead to non-standard language definitions. In such cases the intermediate-code generator can be maintained at one site and updated versions can be distributed in the intermediate code form without fear of compromising the quality of the object code generated from it. Such a technique is currently being used in the production of portable SIMULA compilers [Krogdahl, 1980]. 6.5.4 Flexibility At some stage in producing a compiler the needs of the end user must be considered. The flexibility afforded by the high-level nature of the intermediate code allows the compiler to be adapted to fit its environment. If the compiler is to be used for teaching the quality of the code it produces can be sacrificed for compilation speed and high-quality diagnostics, particularly as compilation time may well be an order of magnitude greater than the execution time, Indeed many of the programs will fail to compile and
never reach execution. If the application is for compiling programs that will be compiled once and then executed many times more effort can be expended in producing fast code, although this is not to say that diagnostics and fast code must be kept separate as the longer a program runs without falling the more trouble will be caused when it falls without convenient diagnostics. 6.6 Comments on Instruction sets and compilation Following the production of IMP compilers for several different processors various features on instruction sets have become evident which influence the generation of code. i The instruction set should be complete, that is, where an instruction in available for one data type it should be available for all data types for which it is well-defined. Similarly, instruction formats used by one operation should be available for all similar operations. The best example of such an instruction set is that provided by the DEC PDP10. Unfortunately the majority of machines are not so helpful. As example of the sorts of thing which go wrong consider the Perkin-Elmer 3200 series. These machines provide three integer data types: fullword (32 bits, signed), halfword (16 bits, signed), and byte (8 bits, unsigned). There are "add fullword" (A) and "add halfword" (AH) instructions but no
"add byte" instruction. There are "add immediate
short" and "subtract immediate short" instructions
but multiply, divide, and, or etc. can not have
short immediate operands.
ii The instructions should be consistent, that is,
logically similar instructions should behave in
similar fashions.
Again, on the Perkin-Elmer 3200:
Load fullword (L) and load halfword (LH) set the
condition code but load byte (LP) does not.
Most register-store instructions can be replaced by
a load of the appropriate type followed by a
register-register instruction: e.g.
-------------- --------------
| CH 1,X | LH 0,X |
| | CR 1,0 |
-------------- --------------
both result in the same setting of the condition
code, but
-------------- --------------
| CLB 1,B | LB 0,B |
| | CLR 1,0 |
-------------- --------------
could result in different settings of the condition
code as CLR compares two unsigned 32 bit quantities
whereas CLB compares a zero-extended byte from
store with the zero-extended least significant byte
of register 1, For consistency either compare
halfword (CH) should use the sign-extended less
significant half of the register, or better, CLB
should not tamper with the value in the register.
iii Complex instructions should be avoided. There are
two reasons for this. Firstly, it is easier for a
compiler to break down statements into simple
operations than it is to build them up into complex
ones [Stockton-Gaines, 1965]. Secondly, if the
complex instructions do not perform the exact
function required by the language more instructions
will be needed to "prepare" for the complex
instruction and to "undo" its unwanted effects. As
an example, the DEC VAX11/780 is full of complex
instructions which seem to be well-suited to
high-level languages at first glance but on closer
inspection they are not so useful. A CASE
instruction is provided which indexes into a table
of displacements and adds the selected value to the
program counter. This would seem ideal for
compiling SWITCH jumps. Unfortunately, as the
table of displacements follows the CASE instruction
it would be very expensive to use it each time a
jump occurred using a particular switch. Instead
all references to the switch must jump to a common
CASE instruction. Even this does not help, as in
the event of an attempted jump to a non-existent
switch label the diagnostics or the event mechanism
will see the error as having occurred at the wrong
place in the program. Although this problem can be
"programmed around" it turns out that it is faster
to implement switches using sequences of simpler
instructions.
iv Machine designers should investigate carefully the
full consequences of building-in special fixed uses
of machine features. One of the best examples of a
clear oversight which causes grief to compiler
writers is found in the DATA GENERAL NOVA
multiplication instruction. This instruction
multiplies the value in register 1 by register 2
and places the double-length result in registers 0
and 1. As only registers 2 and 3 may be used for
addressing, and as register 3 is always used for
subroutine linkage, it follows that register 2 must
be used for addressing the local stack frame, but
this is exactly the register which must be
corrupted in order to use the multiply instruction!
Although specific machines have been used in the examples
similar problems abound in all machines. Indeed it is clear
that machines are most commonly designed for programmers
writing in assembler or FORTRAN, and furthermore writing
their programs in a particular style.
While it is clear that the problems could be called "mere
details" and that they are not difficult to surmount, it
remains that they complicate otherwise simple
code-generation algorithms making compilers larger, slower
and correspondingly more difficult to write, debug, and
maintain. In conclusion it appears that the machine most suited to supporting high-level languages should have a small but complete set of very simple instructions, their simplicity permitting rapid execution and great flexibility.
7 Conclusions 7.1 Viability of the technique The techniques described above have been used to create several IMP77 compilers which are in regular use on a number of systems. In terms of total memory space required for a compilation, they compare favourably with other compilers. The major weakness seems to be execution time which can vary from twice as long as other compilers in the worst case, to half as long in the best case. As most of the effort in writing the compilers was spent in investigating the techniques involved and not in minimising compile time, and as the compilers which ran much faster were either totally, or partially written in machine code (the IMP77 compilers are all written exclusively in IMP77), it seems that the technique can be used to produce acceptable service compilers. 7.2 Ease of portability Although using I-code does not permit compilers to be written in as short a time as with P-code and O-code, the large amount of code which is common to all of the compilers written so far means that, given a working code generator as a template, a new optimising compiler can be written in the space of a few months, with the final result producing code of high quality.
area, or created specially. Items on the stack
are modified by intermediate code control items
to reflect operations specified in the source
program. Such modifications may or may not
result in code being generated. From the point
of view of this definition stack elements are
considered to have at least three components:
i Type
ii Value
iii Access rule
The "Access rule" defines how the "Type" and
"Value" attributes are to interpreted in order to
locate the described object.
For example, the access rule for a constant could
be "Value contains the constant" while for a
variable it could be "Value contains the address
of the variable". Clearly, the access rules are
target-machine dependent. Descriptors may be
combined to give fairly complex access rules, as
in the case of applying "PLUS" to the stack when
the top two descriptors are for the variable X
and the constant 1 resulting in one descriptor
with the access rule "take the value in X and add
1 to it". The complexity of these access rules
may be restricted by a code-generator. In the
example above code could be generated to evaluate
X+1 resulting in an access rule "the value is in
register 1", say.
The importance of the code not describing the actual
computation which the source program specified but the
compilation process required is seen when attempting to use
the code for statements of the form:
A := if B = C then D else E;
This could not be encoded as:
PUSH A
PUSH B
PUSH C
JUMP # L1
PUSH D
BR L2
LOC L1
PUSH E
LOC L2
ASSVAL
The reason is that the items on the stack at the time of the
ASSVAL would be (from top to bottom) [E], [D], [A], because
no items were given which would remove them from the stack.
hence the ASSVAL would assign the value of E to D and then
leave A dangling on the stack.
Unless otherwise stated, all constants in the
intermediate code are represented in octal.
Descriptors
DEF TAG TEXT TYPE FORM SIZE SPEC PREFIX
This item causes a new descriptor to be generated
and placed in the descriptor area. On creation,
the various fields of the DEF are used to
construct the machine-dependent representation
required for the object.
TAG is an identification which will
be used subsequently to refer to
the descriptor.
TEXT is the source-language identifier
given to the object (a null
string if no identifier was
specified).
TYPE is the type of the object:
GENERAL, INTEGER, REAL, STRING,
RECORD, LABEL, SWITCH, FORMAT.
FORM is one of: SIMPLE, NAME, ROUTINE,
FN, MAP, PRED, ARRAY, NARRAY,
ARRAYN, NARRAYN.
SIZE is either the TAG of the
appropriate record format
descriptor for records, the
maximum length of a string
variable, or the precision of
numerical variables: DEFAULT,
BYTE, SHORT. LONG.
SPEC has the value SPEC or NONE
depending on whether or not the
item is a specification.
PREFIX is one of: NONE, OWN, CONST,
EXTERNAL, SYSTEM, DYNAMIC, PRIM,
PERM or SPECIAL. If SPECIAL is
given there will follow an
implementation-dependent
specification of the properties
of the object (such as that it is
to be a register, for example).
Parameters and Formats The parameters for procedures and the elements of record formats are defined by a list immediately following the procedures or format descriptor definition: START Start of definition list FINISH End of definition list ALTBEG Start of alternative sequence ALT Alternative separator ALTEND End of alternative sequence. Blocks BEGIN Start of BEGIN block END End of BEGIN block or procedure
PUSH <tag> Push a copy of the descriptor <tag> onto
the stack.
PROC <tag> This is the same as PUSH except that the
descriptor being stacked represents a
procedure which is about to be called
(using ENTER).
PUSHI <n> Push a descriptor for the integer constant
<n> onto the stack.
PUSHR <r> Push a descriptor for the real
(floating-point) constant <r> onto the
stack.
PUSHS <s> Push a descriptor for the string constant
<s> onto the stack.
SELECT <tag> TOS will be a descriptor for a record.
Replace this descriptor with one describing
the sub-element <tag> of this record.
Assignment
ASSVAL Assign the value described by TOS to the
variable described by SOS. Both TOS and
SOS are popped from the stack.
ASSREF Assign a reference to (the address of) the
variable described by TOS to the pointer
variable described by SOS. Both TOS and
SOS are popped from the stack.
JAM This is the same as ASSVAL except that the
value being assigned will be truncated if
necessary.
ASSPAR Assign the actual parameter described by
TOS to the formal parameter described by
SOS. This is equivalent to either ASSVAL
(for value parameters) or ASSREF (for
reference parameters).
RESULT TOS describes the result of the enclosing
function. Following the processing of the
result code must be generated to return
from the function.
MAP Similar to RESULT except that TOS describes
the result of a MAP. Again a return must
be generated.
DEFAULT <n>
INIT <n> Create N data items, corresponding to the
last descriptor defined, and given them all
an initial (constant) value. The constant
is popped from the stack in the case of
INIT but DEFAULT causes the
machine-dependent default value to be used
(normally the UNASSIGNED pattern).
Binary operators
ADD Addition
SUB Subtraction
MUL Multiplication
QUOT Integer division
DIVIDE Real division
IEXP Integer exponentiation
REXP Real exponentiation
AND Logical AND
OR Logical inclusive OR
XOR Logical exclusive OR
LSH Logical left shift
RSH Logical right shift
CONC String concatenate
ADDA ++
SUBA --
The given operation is performed on TOS and SOS , both of
which are removed from the stack, and the result
(SOS op TOS) is pushed onto the stack.
e.g. A = B-C
PUSH A
PUSH B
PUSH C
SUB ASSVAL
Unary Operators NEG Negate (unary minus) NOT Logical NOT (complement) MOD Modulus (absolute value) The given operation is performed on TOS.
Arrays
DIM <d> <n> The stack will contain <d> pairs of
descriptors corresponding to the lower and
upper bounds for an array. This
information is used to construct <n> arrays
and any necessary accessing information for
use through the last <n> descriptors to
have been defined. All of these
descriptors will be for similar arrays.
INDEX SOS will be the descriptor for a
multi-dimensional array and TOS will be the
next non-terminal subscript. The stack is
popped.
ACCESS SOS will be the descriptor of an array and
TOS will be the final/only subscript. Both
descriptors are replaced by a descriptor
for the appropriate element of the array.
E.g. given arrays A(1:5) and B(1:4, 2:6),
and integers J,K:
A(J) = 0 K = B(J, K)
PUSH A PUSH K
PUSH J PUSH B
ACCESS PUSH J
PUSHC 0 INDEX
ASSVAL PUSH K
ACCESS
ASSIGN
Internal labels
Internal labels are those labels in the
intermediate code which have been created
by the process of translating from the
source program, and so do not appear
explicitly in the source program. The main
property of these labels is that they will
only be referred to once. This fact can be
used to re-use these labels, as, for
example, a forward reference to a
currently-defined label must cause its
redefinition.
LOCATE <1> define internal label <1>
GOTO <1> forward jump to internal label <1>
REPEAT <1> backward jump to internal label <1>
Conditional branches
These branches are always forward.
JUMPIF <cond> <label>
JUMPIFD <cond> <label>
JUMPIFA <cond> <label>
Where: <cond> ::= =, #,
<, <=,
>, >=,
TRUE, FALSE
The two items on the top of the stack are compared and a
jump is taken to <label> is the condition specified by
<cond> is true. In the case of <cond> being TRUE or FALSE
only one item is taken from the stack, and this represents a
boolean value to be tested.
User Labels
LABEL <d> locate label descriptor <d>
JUMP <d> Jump to the label described by <d>
CALL <d> Call the procedure described by <d>
Sundry Items
ON <e> <l> Start of event trap for events <e>.
Internal label <l> defines the end of the
event block.
EVENT <e> Signal event <e>
STOP stop
MONITOR monitor
RESOLVE <m> Perform a string resolution
FOR Start of a for loop
SLABEL <sd> Define switch label
SJUMP <sd> Select and jump to switch label
LINE <l> Set the current line number to <l>
Remembering values in registers
Code Total Incremental
Size Reduction Reduction
---- --------- ---------
P732.1 0 uses 9504 - -
1 use 8194 13.8% 13.8%
2 uses 8192 13.8% 0.0%
P732.2 0 uses 6500 - -
1 use 6126 5.8% 5.8%
2 uses 6126 5.8% 0.0%
P732.3 0 uses 10960 - -
1 use 9968 9.0% 9.0%
2 uses 9956 9.2% 0.2%
P732.4 0 uses 5288 - -
1 use 4970 6.0% 6.0%
2 uses 4958 6.2% 0.2%
P732.5 0 uses 5468 - -
1 use 4990 8.7% 8.7%
2 uses 4986 8.8% 0.1%
P732.6 0 uses 3424 - -
1 use 3208 6.3% 6.3%
2 uses 3208 6.3% 0.0%
P732.7 0 uses 10736 - -
1 use 9880 8.0% 8.0%
2 uses 9874 8.0% 0.0%
P732.8 0 uses 824 - -
1 use 770 6.6% 6.6%
2 uses 770 6.6% 0.0%
P732.9 0 uses 6448 - -
1 use 6148 4.6% 4.6%
2 uses 6148 4.6% 0.0%
P732.10 0 uses 22968 - -
1 use 20656 10.1% 10.1%
2 uses 20650 10.1% 0.0%
P732.11 0 uses 13996 - -
1 use 12470 10.9% 10.9%
2 uses 12442 11.1% 0.2%
P732.12 0 uses 32600 - -
1 use 28532 12.5% 12.5%
2 uses 28392 12.9% 0.4%
Code Total Incremental
Size Reduction Reduction
---- --------- ---------
P11.1 0 uses 9060 - -
1 use 7712 14.9% 14.9%
2 uses 7660 15.4% 0.5%
P11.2 0 uses 6276 - -
1 use 6000 4.4% 4.4%
2 uses 6000 4.4% 0.0%
P11.3 0 uses 9992 - -
1 use 9480 5.1% 5.1%
2 uses 9444 5.5% 0.4%
P11.4 0 uses 5020 - -
1 use 4772 5.4% 5.4%
2 uses 4768 5.6% 0.2%
P11.5 0 uses 5096 - -
1 use 4460 12.5% 12.5%
2 uses 4452 12.6% 0.1%
P11.6 0 uses 3692 - -
1 use 3064 17.0% 17.0%
2 uses 4064 17.0% 0.0%
P11.7 0 uses 7976 - -
1 use 7060 11.5% 11.5%
2 uses 7030 11.8% 0.3%
P11.8 0 uses 668 - -
1 use 652 2.4% 2.4%
2 uses 624 6.6% 4.2%
P11.9 0 uses 4888 - -
1 use 4492 8.1% 8.1%
2 uses 4484 8.3% 0.2%
P11.10 0 uses 20318 - -
1 use 19120 5.9% 5.9%
2 uses 19120 5.9% 0.0%
P11.11 0 uses 12938 - -
1 use 12162 6.0% 6.0%
2 uses 12148 6.1% 0.1%
P11.12 0 uses 12068 - -
1 use 10594 12.2% 12.2%
2 uses 10584 12.3% 0.0%
Remembering sets of registers (environments)
Code Total Incremental
Size Reduction Reduction
---- --------- ---------
P732.1 0 environments 8556 - -
1 environment 8316 2.8% 2.8%
2 environments 8238 3.7% 0.9%
3 environments 8232 3.8% 0.1%
4 environments 8222 3.9% 0.1%
5 environments 8218 4.0% 0.1%
6 environments 8192 4.2% 0.2%
P732.2 0 environments 6202 - -
1 environment 6128 1.2% 1.2%
2 environments 6130 1.2% 0.0%
3 environments 6126 1.2% 0.0%
4 environments 6126 1.2% 0.0%
5 environments 6126 1.2% 0.0%
6 environments 6126 1.2% 0.0%
P732.3 0 environments 10174 - -
1 environment 10062 1.1% 1.1%
2 environments 9968 2.0% 0.9%
3 environments 9966 2.0% 0.0%
4 environments 9964 2.1% 0.1%
5 environments 9956 2.1% 0.1%
6 environments 9956 2.1% 0.1%
P732.4 0 environments 5068 - -
1 environment 4978 1.8% 1.8%
2 environments 4958 2.2% 0.4%
3 environments 4958 2.2% 0.0%
4 environments 4958 2.2% 0.0%
5 environments 4958 2.2% 0.0%
6 environments 4958 2.2% 0.0%
P732.6 0 environments 3262 - -
1 environment 3250 0.4% 0.4%
2 environments 3216 1.4% 1.0%
3 environments 3208 1.7% 0.3%
4 environments 3208 1.7% 0.0%
5 environments 3208 1.7% 0.0%
6 environments 3208 1.7% 0.0%
P732.7 0 environments 10062 - -
1 environment 9970 0.9% 0.9%
2 environments 9894 1.7% 0.8%
3 environments 9880 1.8% 0.1%
4 environments 9874 1.9% 0.1%
5 environments 9874 1.9% 0.0%
6 environments 9874 1.9% 0.0%
P732.8 0 environments 806 - -
1 environment 782 3.0% 3.0%
2 environments 782 3.0% 0.0%
3 environments 770 4.5% 1.5%
4 environments 770 4.5% 0.0%
5 environments 770 4.5% 0.0%
6 environments 770 4.5% 0.0%
P732.9 0 environments 6244 - -
1 environment 6202 0.7% 0.7%
2 environments 6156 1.4% 0.7%
3 environments 6158 1.4% 0.0%
4 environments 6148 1.5% 0.1%
5 environments 6148 1.5% 0.0%
6 environments 6148 1.5% 0.0%
P732.10 0 environments 21214 - -
1 environment 20928 1.3% 1.3%
2 environments 20748 2.2% 0.9%
3 environments 20678 2.5% 0.3%
4 environments 20678 2.5% 0.0%
5 environments 20668 2.6% 0.1%
6 environments 20650 2.6% 0.0%
P732.11 0 environments 12772 - -
1 environment 12592 1.4% 1.4%
2 environments 12486 2.2% 0.8%
3 environments 12472 2.3% 0.1%
4 environments 12460 2.4% 0.1%
5 environments 12452 2.5% 0.1%
6 environments 12442 2.6% 0.1%
P732.12 0 environments 11522 - -
1 environment 11418 0.9% 0.9%
2 environments 11342 1.6% 0.7%
3 environments 11314 1.8% 0.2%
4 environments 11314 1.8% 0.0%
5 environments 11296 2.0% 0.2%
6 environments 11296 2.0% 0.0%
P11.1 0 environments 7686 - -
1 environment 7670 0.2% 0.2%
2 environments 7660 0.3% 0.1%
3 environments 7660 0.3% 0.0%
4 environments 7660 0.3% 0.0%
5 environments 7660 0.3% 0.0%
6 environments 7660 0.3% 0.0%
P11.2 0 environments 6012 - -
1 environment 6000 0.2% 0.2%
2 environments 6000 0.2% 0.0%
3 environments 6000 0.2% 0.0%
4 environments 6000 0.2% 0.0%
5 environments 6000 0.2% 0.0%
6 environments 6000 0.2% 0.0%
P11.3 0 environments 9472 - -
1 environment 9440 0.3% 0.3%
2 environments 9444 0.3% -0.0%
3 environments 9444 0.3% 0.0%
4 environments 9444 0.3% 0.0%
5 environments 9444 0.3% 0.0%
6 environments 9444 0.3% 0.0%
P11.4 0 environments 4784 - -
1 environment 4776 0.2% 0.2%
2 environments 4776 0.2% 0.0%
3 environments 4776 0.2% 0.0%
4 environments 4776 0.2% 0.0%
5 environments 4772 0.2% 0.0%
6 environments 4768 0.3% 0.1%
P11.5 0 environments 4512 - -
1 environment 4464 1.1% 1.1%
2 environments 4456 1.2% 0.1%
3 environments 4452 1.3% 0.1%
4 environments 4452 1.3% 0.0%
5 environments 4452 1.3% 0.0%
6 environments 4452 1.3% 0.0%
P11.6 0 environments 3076 - -
1 environment 3070 0.2% 0.2%
2 environments 3064 0.4% 0.2%
3 environments 3064 0.4% 0.0%
4 environments 3064 0.4% 0.0%
5 environments 3064 0.4% 0.0%
6 environments 3064 0.4% 0.0%
P11.7 0 environments 7104 - -
1 environment 7048 0.8% 0.8%
2 environments 7048 0.8% 0.0%
3 environments 7048 0.8% 0.0%
4 environments 7048 0.8% 0.0%
5 environments 7048 0.8% 0.0%
6 environments 7032 1.0% 1.0%
P11.8 0 environments 640 - -
1 environment 624 2.5% 2.5%
2 environments 624 2.5% 0.0%
3 environments 624 2.5% 0.0%
4 environments 624 2.5% 0.0%
5 environments 624 2.5% 0.0%
6 environments 624 2.5% 0.0%
P11.9 0 environments 4492 - -
1 environment 4484 0.2% 0.2%
2 environments 4484 0.2% 0.0%
3 environments 4484 0.2% 0.0%
4 environments 4484 0.2% 0.0%
5 environments 4484 0.2% 0.0%
6 environments 4484 0.2% 0.0%
P11.10 0 environments 19332 - -
1 environment 19196 0.7% 0.7%
2 environments 19158 0.9% 0.2%
3 environments 19138 1.0% 0.1%
4 environments 19138 1.0% 0.0%
5 environments 19120 1.1% 0.1%
6 environments 19120 1.1% 0.0%
P11.11 0 environments 12280 - -
1 environment 12200 0.6% 0.6%
2 environments 12168 0.9% 0.3%
3 environments 12160 1.0% 0.1%
4 environments 12156 1.0% 0.0%
5 environments 12148 1.1% 0.1%
6 environments 12148 1.1% 0.0%
P11.12 0 environments 10690 - -
1 environment 10616 0.7% 0.7%
2 environments 10604 0.8% 0.1%
3 environments 10604 0.8% 0.0%
4 environments 10594 0.9% 0.1%
5 environments 10584 1.0% 0.1%
6 environments 10584 1.0% 0.0%
Simple allocation of arrays and remembering subscripts
Allocation Remembering
Neither Simple (gain) Subscripts (gain)
------- ---- ------ ---------- ------
P732.1 8596 8476 (1.4%) 8312 (3.3%)
P732.2 6126 6126 (0.0%) 6126 (0.0%)
P732.3 10450 1014 (3.2%) 10426 (0.2%)
P732.4 5056 4958 (1.9%) 5056 (0.0%)
P732.5 5306 5054 (4.7%) 5308 -(0.0%)
P732.6 3384 3254 (3.8%) 3386 -(0.0%)
P732.7 10346 10112 (2.3%) 10344 (0.0%)
P732.8 806 806 (0.0%) 770 (4.5%)
P732.9 6138 6138 (0.0%) 6148 -(0.2%)
P732.10 20806 20684 (0.6%) 20776 (0.1%)
P732.11 12442 12442 (0.0%) 12442 (0.0%)
P732.12 11976 11946 (0.2%) 11326 (5.4%)
Both optimisations Total gain
------------------ ----------
P732.1 8192 4.7%
P732.2 6126 0.0%
P732.3 9956 4.7%
P732.4 4958 1.9%
P732.5 4986 6.0%
P732.6 3208 5.2%
P732.7 9874 4.6%
P732.8 770 4.5%
P732.9 6148 -0.2%
P732.10 20650 0.7%
P732.11 12442 0.0%
P732.12 11296 5.8%
Allocation Remembering
Neither Simple (gain) Subscripts (gain)
------- ---- ------ ---------- ------
P11.1 8572 8188 (4.5%) 7704 (10.1%)
P11.2 6000 6000 (0.0%) 6000 (0.0%)
P11.3 9764 9556 (2.1%) 9644 (1.2%)
P11.4 4848 4776 (1.5%) 4848 (0.0%)
P11.5 4656 4568 (1.9%) 4452 (4.4%)
P11.6 3356 3202 (4.6%) 3218 (4.1%)
P11.7 7844 7728 (1.4%) 7204 (8.2%)
P11.8 644 624 (3.1%) 644 (0.0%)
P11.9 4796 4796 (0.0%) 4484 (6.5%)
P11.10 19236 19140 (0.5%) 19216 (0.1%)
P11.11 12148 12148 (0.0%) 12148 (0.0%)
P11.12 11094 11060 (0.3%) 10616 (4.3%)
Both optimisations Total gain
------------------ ----------
P11.1 7660 10.6%
P11.2 6000 0.0%
P11.3 9444 3.3%
P11.4 4768 1.6%
P11.5 4452 4.4%
P11.6 3064 8.7%
P11.7 7032 10.4%
P11.8 624 3.1%
P11.9 4484 6.5%
P11.10 19120 0.6%
P11.11 12148 0.0%
P11.12 10584 4.6%
Simplifying: X = X op Y
Code Code
Without With Gain
------- ---- ----
P732.1 8292 8192 1.2%
P732.2 6156 6126 0.5%
P732.3 10068 9956 1.1%
P732.4 5088 4958 2.6%
P732.5 5180 4986 3.7%
P732.6 3368 3208 4.8%
P732.7 11438 11296 1.2%
P732.8 772 770 0.2%
P732.9 6214 6148 1.1%
P732.10 21086 20650 2.1%
P732.11 12590 12442 1.2%
P732.12 11438 11296 1.2%
P11.1 8284 7660 7.5%
P11.2 6220 6000 3.5%
P11.3 10040 9444 5.9%
P11.4 5136 4768 7.2%
P11.5 4800 4452 7.2%
P11.6 3342 3064 8.3%
P11.7 7596 7032 7.4%
P11.8 668 624 6.6%
P11.9 4724 4484 5.1%
P11.10 20634 19128 7.3%
P11.11 12894 12148 5.8%
P11.12 11492 10584 7.9%
Passing some parameters in registers
Code Total Incremental
Size Reduction Reduction
---- --------- ---------
P732.1 0 registers 8862 - -
1 register 8360 5.7% 5.7%
2 registers 8192 7.6% 1.9%
P732.2 0 registers 7196 - -
1 register 6544 9.1% 9.1%
2 registers 6126 14.9% 5.8%
P732.3 0 registers 10586 - -
1 register 9976 5.8% 5.8%
2 registers 9956 6.0% 0.2%
P732.4 0 registers 5126 - -
1 register 4958 3.3% 3.3%
2 registers 4958 3.3% 0.0%
P732.5 0 registers 5198 - -
1 register 5022 3.4% 3.5%
2 registers 4986 4.1% 0.7%
P732.6 0 registers 3402 - -
1 register 3222 5.3% 5.3%
2 registers 3208 5.7% 0.4%
P732.7 0 registers 10400 - -
1 register 10048 3.4% 3.4%
2 registers 9874 5.0% 1.6%
P732.8 0 registers 840 - -
1 register 810 3.6% 3.6%
2 registers 770 8.3% 4.7%
P732.9 0 registers 6404 - -
1 register 6172 3.6% 3.6%
2 registers 6148 4.0% 0.4%
P732.10 0 registers 21650 - -
1 register 20826 3.8% 3.8%
2 registers 20650 4.6% 0.8%
P732.11 0 registers 13476 - -
1 register 12442 7.7% 7.7%
2 registers 12442 7.7% 0.0%
P732.12 0 registers 11916 - -
1 register 11452 3.9% 3.9%
2 registers 11296 5.2% 1.3%
Code Total Incremental
Size Reduction Reduction
---- --------- ---------
P11.1 0 registers 7796 - -
1 register 7756 0.5% 0.5%
2 registers 7660 1.7% 1.2%
P11.2 0 registers 6192 - -
1 register 6072 1.9% 1.9%
2 registers 6000 3.1% 1.2%
P11.3 0 registers 9564 - -
1 register 9448 1.2% 1.2%
2 registers 9444 1.2% 0.0%
P11.4 0 registers 4776 - -
1 register 4768 0.2% 0.2%
2 registers 4768 0.2% 0.0%
P11.5 0 registers 4508 - -
1 register 4452 1.2% 1.2%
2 registers 4452 1.2% 0.0%
P11.6 0 registers 3098 - -
1 register 3064 1.1% 1.1%
2 registers 3064 1.1% 0.0%
P11.7 0 registers 7124 - -
1 register 7096 0.4% 0.4%
2 registers 7032 1.3% 0.9%
P11.8 0 registers 624 - -
1 register 624 0.0% 0.0%
2 registers 624 0.0% 0.0%
P11.9 0 registers 4520 - -
1 register 4488 0.7% 0.7%
2 registers 4484 0.8% 0.1%
P11.10 0 registers 19302 - -
1 register 19166 0.7% 0.7%
2 registers 19128 0.9% 0.2%
P11.11 0 registers 12364 - -
1 register 12152 1.7% 1.7%
2 registers 12148 1.7% 0.0%
P11.12 0 registers 10734 - -
1 register 10648 0.8% 0.8%
2 registers 10584 1.4% 0.6%
Remembering condition-codes
Unknown Remembered Gain
------- ---------- ----
P732.1 8820 8192 0.3%
P732.2 6134 6126 0.1%
P732.3 9976 9956 0.2%
P732.4 4968 4958 0.2%
P732.5 4988 4986 0.0%
P732.6 3212 3208 0.1%
P732.7 9880 9874 0.1%
P732.8 770 770 0.0%
P732.9 6150 6148 0.0%
P732.10 20684 20650 0.2%
P732.11 12474 12442 0.2%
P732.12 11318 11296 0.2%
P11.1 7732 7660 0.9%
P11.2 6012 6000 0.3%
P11.3 9516 9444 0.8%
P11.4 4792 4768 0.5%
P11.5 4452 4452 0.0%
P11.6 3076 3064 0.4%
P11.7 7064 7032 0.4%
P11.8 624 624 0.0%
P11.9 4496 4484 0.3%
P11.10 19204 19128 0.4%
P11.11 12192 12148 0.4%
P11.12 10626 10584 0.4%
Forward merging and delaying
Forward
Neither Merge Delaying Merge & Delay
------- ----- -------- -------------
P732.1 8192 8172 (0.2%) 8160 (0.4%) 8136 (0.7%)
P732.2 6126 6110 (0.3%) 6054 (1.2%) 6044 (1.3%)
P732.3 9956 9948 (0.2%) 9872 (0.8%) 9864 (0.9%)
P732.4 4958 4950 (0.2%) 4942 (0.3%) 4942 (0.3%)
P732.5 4986 4970 (0.3%) 4982 (0.1%) 4966 (0.4%)
P732.6 3208 3194 (0.4%) 3140 (2.1%) 3122 (2.7%)
P732.7 9874 9752 (1.2%) 9834 (0.4%) 9812 (1.6%)
P732.8 770 764 (0.5%) 738 (4.2%) 728 (5.4%)
P732.9 6148 6136 (0.2%) 6132 (0.3%) 6120 (0.4%)
P732.10 20650 20586 (0.3%) 20558 (0.4%) 20490 (0.8%)
P732.11 12442 12406 (0.3%) 12342 (0.8%) 12306 (1.1%)
P732.12 11296 11280 (0.1%) 11272 (0.2%) 11256 (0.4%)
P11.1 7660 7660 (0.0%) 7660 (0.0%) 7660 (0.0%)
P11.2 6000 6000 (0.0%) 5988 (0.2%) 5988 (0.2%)
P11.3 9444 9444 (0.0%) 9434 (0.1%) 9434 (0.1%)
P11.4 4768 4768 (0.0%) 4768 (0.0%) 4768 (0.0%)
P11.5 4452 4448 (0.1%) 4452 (0.0%) 4448 (0.1%)
P11.6 3064 3064 (0.0%) 3060 (0.1%) 3060 (0.1%)
P11.7 7032 7004 (0.4%) 7032 (0.0%) 7004 (0.4%)
P11.8 624 620 (0.6%) 620 (0.6%) 616 (1.3%)
P11.9 4484 4484 (0.0%) 4484 (0.0%) 4484 (0.0%)
P11.10 19128 19128 (0.0%) 19128 (0.0%) 19128 (0.0%)
P11.11 12148 12136 (0.1%) 12136 (0.1%) 12124 (0.2%)
P11.12 10584 10584 (0.0%) 10584 (0.0%) 10584 (0.0%)
All optimisations
None All Gain
---- --- ----
P732.1 11300 8136 28.0%
P732.2 7520 6044 19.6%
P732.3 12286 9864 19.7%
P732.4 5782 4942 14.5%
P732.5 6204 4966 19.9%
P732.6 4004 3122 22.0%
P732.7 11750 9812 16.5%
P732.8 988 728 26.3%
P732.9 6848 6120 10.6%
P732.10 24722 20490 17.1%
P732.11 14618 12306 15.8%
P732.12 14064 11256 20.0%
P11.1 9664 7660 20.7%
P11.2 6588 5988 9.1%
P11.3 11092 9434 14.9%
P11.4 5540 4768 13.9%
P11.5 5572 4448 20.2%
P11.6 3666 3060 16.5%
P11.7 8632 7004 18.7%
P11.8 752 616 18.1%
P11.9 5256 4484 14.7%
P11.10 23940 19128 20.1%
P11.11 14328 12124 15.4%
P11.12 12816 10584 17.8%
Total CPU time and CPU time in input/output
Internal I/O: communication from phase 1 to phase 3.
External I/O: input of the source and the output of the object files.
Phase 1 Phase 2 Phase 3 Overall
----------- ---------- ---------- ------------------
Total % Total % Total % Internal External
CPU I/O CPU I/O CPU I/O I/O I/O
All optimisations:
P732.1: 53% 8% 32% 6% 14% 5% 13% 7%
P732.2: 53% 8% 35% 8% 12% 5% 16% 7%
P732.3: 51% 7% 34% 7% 15% 5% 13% 6%
P732.4: 54% 8% 32% 6% 14% 5% 13% 7%
P732.5: 49% 12% 36% 5% 14% 5% 12% 6%
P732.6: 50% 7% 37% 7% 12% 6% 14% 7%
P732.7: 48% 7% 39% 7% 13% 7% 15% 6%
P732.8: 54% 7% 36% 7% 10% 5% 16% 4%
P732.9: 50% 8% 36% 8% 14% 6% 15% 7%
P732.10: 54% 6% 29% 5% 17% 4% 11% 6%
P732.11: 52% 7% 32% 6% 16% 5% 11% 6%
P732.12: 52% 8% 32% 8% 17% 7% 15% 7%
No optimisation:
P732.1: 50% 7% 30% 7% 19% 8% 16% 7%
P732.2: 55% 8% 31% 9% 14% 8% 16% 8%
P732.3: 54% 8% 28% 8% 18% 6% 15% 7%
P732.4: 57% 8% 26% 7% 16% 6% 14% 7%
P732.5: 52% 8% 30% 8% 17% 7% 16% 7%
P732.6: 53% 8% 32% 9% 15% 8% 17% 8%
P732.7: 49% 7% 31% 8% 19% 8% 17% 6%
P732.8: 56% 7% 32% 9% 12% 6% 18% 5%
P732.9: 55% 9% 29% 8% 16% 7% 16% 8%
P732.10: 55% 6% 23% 6% 21% 5% 13% 6%
P732.11: 55% 8% 26% 6% 20% 6% 12% 6%
P732.12: 54% 8% 27% 8% 19% 8% 16% 8%
References
Aho, A.V. & Ullman, J.P. (1972)
"Optimisation of straight line programs"
SIAM J. March 1972.
Allen, F.E. & Cocke, J.A. (1971)
"A catalogue of optimising techniques"
Design and optimisation of compilers
Prentice-Hall, R. Rustin (ed). 1971.
Basili, V.R. & Turner, A.J. (1975)
"A transportable extendable compiler".
Software - Practice and Experience Vol 5,
1975.
Bell. G. (1973)
"Threaded Code".
CACM Vol 16, No 6.
Berry, R.E. (1978)
"Experience with the PASCAL P-compiler"
Software - Practice and experience
Vol 8, No 5, 1978.
Berthaud, M. & Griffiths, M. (1973)
"Incremental compilation and conversational
interpretation"
Annual review in automatic programming,
Vol 7, Part 2, 1973.
Bourne, S.R,, Birrell, A.D. & Walker, I.W. (1975)
"Z code, an intermediate object code for
ALGOL68".
The Computer Laboratory, Cambridge.
Branquart, P., Cardinael, I. & Lewi, J. (1973)
"Optimised translation process, application to
ALGOL68".
Int. Comp. Symp. 1973.
Bron, C. & De Vries, W. (1976)
"A PASCAL compiler for PDP11 minicomputers".
Software - Practice and Experience Vol 6,
1976.
Brooker, R.A. et al. "Experience with the compiler-compiler" Comp. J. Vol 9, 1967. Brown, P. (1972) "Levels of language for portable software" CACM. 15, 1972. Brown, P. (1977) "Macro processors". Software Portability (ed. P. Brown) Cambridge University Press, 1977. Buckle, J.K. (1978) "The ICL 2900 Series". Macmillan Press Ltd., 1978. Cocke, J. (1970) "Global common subexpression elimination" ACM SIGPLAN notices, Vol 5, No 7. 1970.
Cocke, J. & Schwartz, J.T. (1970)
"Programming languages and their compilers"
Courant Institute of Mathematical Sciences,
Hew York University, 1970.
Coleman, S.S., Poole, P.C. & Waite, W.M. (1974)
"The Mobile programming system, JANUS".
Software: Practice and Experience. Vol 4,
1974.
Dewar, H. (1975)
"The ISYS system".
Internal memo, Department of Computer Science,
University of Edinburgh, 1975.
Feldman, J.A. (1966)
""A formal semantics for computer languages
and its application in a compiler-compiler"
CACM 9, 1966.
Feldman, J.A. & Gries, D. (1968)
"Translator writing systems"
CACM 11, 1968.
Gardner, P. (1977) "An ALGOL68C bootstrap". Memo CSM-20, University of Essex, 1977. Gilmore, B.A.C. (1980) "DEIMOS User's Manual" Edinburgh Regional Computing Centre Grosse-Lindemann, C.O. & Nageli, H.H. (1976) "Postlude to a PASCAL compiler bootstrap on the DEC system 10". Software - Practice and Experience, Vol 6, 1976. Haddon, B.K. & Waite, W.M. (1978) "Experience with the Universal Intermediate language JANUS" Software - Practice and Experience Vol 8, No 5, 1978. Halpern, M.I. (1965) "Machine Independence: its technology and economics". CACM Vol 8 No 12, 1965.
Hawkins, E.N. & Huxtable, D.H.R. (1963)
"A multi-pass translation scheme for Algol 60"
Annual review in automatic programming, No 3,
1963.
Hecht, M.S. & Ullman, J.D. (1975)
"A simple algorithm for global data flow
analysis problems"
SIAM J. Vol 4, 1975.
Huxtable, D.H.R. (1964)
"On writing an optimising translator for Algol
60"
Introduction to System Programming,
P. Wegner (ed.),
Academic Press, 1964.
Interdata (1974)
"Common Assembler Language (CAL) User's
Manual".
Interdata publication number 29-375R04. 1974.
Jensen, K. & Wirth, N. (1976) "PASCAL User Manual and Report" Springer-Verlag, 1976. Kildall, G.A. (1973) "A unified approach to global program optimisation" CACM Principles of programming languages, 1973. Klint. P. (1979) "Line numbers made cheap" CACM Vol 22, No. 10. 1979. Knuth. D.E. (1962) "History of writing compilers" Proc. ACM 17, 1962. Knuth, D.E. (1971) "An empirical study of FORTRAN programs". Software - Practice and Experience Vol 1, 1971.
Knuth, D.E. (1974) "Structured programming with GOTO statements". ACM Computing Surveys 6, No 4, 1974. Krogdahl, S., Myhre, O., Robertson, P.S., & Syrrist, G. (1980) The SCALA project: S-PORT. Norwegian Computing Centre working paper. Lecarme, O. & Peyrolle-Thomas, M. (1978) "Self-compiling compilers: An appraisal of their implementation and portability" Software - Practice and Experience Vol 8, No 2, 1978. Lowry, E.S. & Medlock, C.W. (1969) "Object code optimisation". CACM Vol 12, 1969. McKeeman, W.M. (1965) "Peephole Optimisation" CACM Vol 8, 1965.
Mock, O., Olsztyn, J. et al (1958) "The problem of programming communication with changing machines". CACM Vol 1, No 8, 1958. Nelson, P.A. (1979) "A comparison of PASCAL intermediate languages" ACM SIGPLAN Notices, Vol 14, No 8, 1979. Nori, K.V., Ammann, U. et al (1974 & 1976) "The PASCAL <P>-compiler: Implementation notes". Eidgenossische Technische Hochschule, Zurich, 1976. Pavelin, C.J. (1970) "The improvement of program behaviour in paged computer systems". Ph.D. Thesis, University of Edinburgh, 1970.
Poole, P. (1974)
"Portable and adaptable compilers"
Advanced course on compiler construction,
Springer-Verlag 1974.
Richards, M. (1971)
"The portability of the BCPL compiler".
Software - Practice and Experience VOl 1,
1971.
Richards, M. (1977)
"Portable Compilers".
Software Portability (ed. P. Brown),
Cambridge University Press, 1977.
Robertson, P.S. (1977)
"The IMP-77 Language".
Internal Report CSR-19-77, Department of
Computer Science, University of Edinburgh,
1977.
Russell, W. (1974) "A translator for PASCAL P-code". Final year project report, Department of Computer Science, University of Edinburgh, 1974. Satterthwaite, E. (1972) Debugging tools for high-level languages". Software - Practice and Experience, Vol 2, 1972. Schneck, P.B. & Angel. E. (1973) "A FORTRAN to FORTRAN optimising compiler". Computer Journal Vol 16, 1973. Sibley, R.A. (1961) "The SLANG System" CACM Vol 4, 1961. Spier, M.J. (1976) "Software Malpractice - A distasteful experience". Software - Practice and Experience Vol 6, 1976.
Steel, T.B. (1961) "A first version of UNCOL Proc. AFIPS WJCC 19, 1961. Stephens, P.D. (1974) "The IMP language and compiler" Computer Journal Vol 17, 1974. Stockton-Gaines, R. (1965) "On the translation of machine language programs". CACM Vol 8, No 12, 1965. Szymanski, T.G. "Assembling code for machines with span-dependent instructions". CACM 21, 1978. Tanenbaum, A.S. (1976) "Structured Computer Organisation" Prentice/Hall International, 1976.
Thompson, K. & Ritchie, D.M. (1974)
"The UNIX timesharing system"
CACM Vol 17, No 7, 1974.
Trout. R.G. (1967)
"A compiler-compiler system"
Proc. ACM, 1967.
Waite, W.M. (1970)
"The Mobile Programming System: STAGE 2"
CACM, vol 13, 1970.
Welsh, J. & Quinn, C. (1972)
"A PASCAL compiler for the ICL 1900 series".
Software - Practice and Experience, Vol 2,
1972.
Whitfield, H. & Wight, A.S. (1973)
"EMAS - The Edinburgh Multi-Access System"
Computer Journal Vol 16, 1973.
Wichmann, B.A. (1977) "Optimisation" Software Portability (ed. P. Brown), Cambridge University Press, 1977. Wichmann, B.A. (1977) "How to call procedures, or second thoughts on Ackermann's function" Software - Practice and Experience Vol 7, No 3, 1977. Williams, M.H. "Long/short address optimisation in assemblers". Software - Practice and Experience, Vol 9 No 3, 1978 Wulf, W., Johnsson, R.K., Weinstock, C.B., Hobbs, S.O. & Geschke, C.M. (1975) "The design of an optimising compiler" Elsevier Computer Science Library, 1975.