The production of Optimised Machine-Code
for High-Level Languages using
Machine-Independent Intermediate Codes.
Peter Salkeld Robertson
Ph. D.
University of Edinburgh
1981
(Graduation date July 1981)
ABSTRACT
The aim of this work was to investigate the problems
associated with using machine-independent intermediate codes
in the translation from a high-level language into machine
code, with emphasis on minimising code size and providing
good run-time diagnostic capabilities.
The main result was a machine-independent intermediate
code, I-code, which has been used successfully to develop
optimising and diagnostic compilers for the IMP77 language
on a large number of different computer systems. In
addition, the work has been used to lay the foundations for
a project to develop an intermediate code for portable
SIMULA compilers.
The major conclusions of the research were that carefully
designed machine-independent intermediate codes can be used
to generate viable optimising and diagnostic compilers, and
that the commonality introduced into different code
generators processing the code for different machines
simplifies the tasks of creating new compilers and
maintaining old ones.
Contents
1 Introduction
2 Intermediate codes
2.1 Uncol
2.2 Janus
2.3 OCODE
2.4 P-code
2.5 Z-code
2.6 Summary and conclusions
2.6.1 Error checking and reporting
2.6.2 Efficiency
2.6.3 Assumptions
2.6.4 Interpretation
3 Optimisations
3.1 Classification of optimisations
3.1.1 Universal optimisations
3.1.2 Local optimisations
3.1.2.1 Remembering
3.1.2.2 Delaying
3.1.2.3 Inaccessable code removal
3.1.2.4 Peephole optimisation
3.1.2.5 Special cases
3.1.2.6 Algebraic manipulation
3.1.3 Global optimisations
3.1.3.1 Restructuring
3.1.3.2 Merging
3.1.3.2.1 Forward merging
3.1.3.2.2 Backward merging
3.1.3.3 Advancing
3.1.3.4 Factoring
3.1.3.5 Loop optimisations
3.1.3.5.1 Iteration
3.1.3.5.2 Holding
3.1.3.5.3 Removal of invariants
3.1.3.6 Expansion
3.1.3.7 Addressing optimisations
3.1.4 Source optimisations
3.2 Combination of optimisations
4 The design of the compiler
4.1 General structure
4.2 The intermediate code
4.2.1 Objectives
4.2.1.1 Scope
4.2.1.2 Information preservation
4.2.1.3 Target machine independence
4.2.1.4 Decision binding
4.2.1.5 Simplification
4.2.1.6 Redundancy
4.2.1.7 Ease of use
4.3 Code layout and addressing
4.3.1 Nested procedure definitions
4.3.2 Paged machines
4.3.3 Events
4.4 Data layout and addressing
4.5 Procedure entry and exit
4.5.1 User-defined procedures
4.5.2 External procedures
4.5.3 Permanent procedures
4.5.4 Primitive procedures
4.6 Language-specified and compiler-generated objects
4.6.1 Internal labels
4.6.2 Temporary objects
4.7 Object-file generation
4.7.1 Reordering
4.7.2 Jumps and branches
4.7.3 Procedures
4.7.4 External linkage
4.7.5 In-line constants
4.7.6 Object-file format
4.8 Summary
5 Review of the overall structure
5.1 Division of function
5.2 Testing and development
5.3 Diagnostics
5.3.1 Line numbers
5.3.2 Diagnostic tables
5.3.3 Run-time checks
6 Observations
6.1 Suitability of I-code for optimisation
6.2 Performance
6.3 Cost of optimisation
6.3.1 Compile time
6.3.2 Space requirement
6.3.3 Logical complexity
6.4 Comments on the results
6.4.1 Register remembering
6.4.2 Remembering environments
6.4.4 Array allocation and use
6.4.5 Parameters in registers
6.4.6 Condition-code remembering
6.4.7 Merging
6.5 Criticisms and benefits of the technique
6.5.1 Complexity
6.5.2 I/O overhead
6.5.3 Lack of gains
6.5.4 Flexibility
6.6 Comments on Instruction sets and Optimisation
7 Conclusions
7.1 Viability of the technique
7.2 Ease of portability
7.3 Nature of optimisations
Appendix A1 Simplified I-code definition
A2 I-code internal representation
A3 Results
References
1. Introduction
Compilers for high-level languages form a significant
part of most computer systems, and with an ever increasing
number and variety of machine architectures on the market
the problems of compiler development, testing, and
maintenance consume more and more manpower and computer
time. Moreover, as computer technology is improving and
changing rapidly it is becoming evident that software costs
will increasingly dominate the total cost of a system.
Indeed, it may not be long before the lifetime of software
regularly exceeds that of the hardware on which it was
originally implemented, a state of affairs quite different
from that envisaged by Halpern when he concluded that "the
importance of the entire question of machine-independence is
diminishing .." [Halpern, 1965]. In addition, there is a
need to encourage the slowly-developing trend to write the
majority of software in high-level languages. Even though
the advantages of such an approach are many, a large number
of users still have a love of machine-code, usually fostered
by thoughts of "machine efficiency". Clearly, techniques
must be developed to simplify the production of usable
compilers which can "optimise" the match between the
executing program and the user's requirements, be they for
fast execution, small program size, reasonable execution
time but with good run-time diagnostics, or whatever.
One popular method for reducing the complexity of a
compiler is to partition it into two major phases: one
language-dependent and the other machine-dependent. The
idea is that the language-dependent phase inputs the source
program and deals with all the syntactic niceties of the
language, finally generating a new representation of the
program, an intermediate code. This is then input by a
second phase which uses it to generate machine-code for the
target computer. In this way it should be possible to
produce a compiler to generate code for a different machine
by taking the existing first phase and writing a new second
phase. This ability to move a large portion of the compiler
from machine to machine has led to such compilers being
referred to as "portable compilers" even though the term is
perhaps misleading, as only part of the complete compiler
can be moved without change. In practice many existing
compilers generate intermediate representations of the
program which are passed around within the compiler, for
example the "analysis records" produced by the syntactic
phase of compilation, but for the purposes of this work it
is only when these representations are machine-independent
and are made available outwith the compiler that they will
be termed intermediate codes.
Much of the emphasis in designing intermediate codes has
been on enabling a compiler to be bootstrapped quickly onto
a new machine - either by interpreting the intermediate
code, or by using a macro generator to expand it into
machine-code [Brown, 1977]. Once this has been done the
intention is that the quality of the code so produced can be
improved at leisure. While this approach has been very
successful and relatively error-free, it has been the
experience of several implementors that it is difficult to
adapt the scheme to produce highly optimised code [Russell,
1974]; apparently considerations of portability and
machine-independence have caused the problems of
optimisation to be overlooked. The aspect of
intermediate-code design which has received most debate
concerns the level of the code: low-level with a fairly
simple code-generator, or high-level with a more complex
code-generator [Brown, 1972].
This thesis attempts to put machine-independence and
optimisation on an equal footing, and describes the use of
an intermediate code which takes a novel view of the
process. Instead of the intermediate code describing the
computation to be performed, it describes the operation of a
code-generator which will produce a program to perform the
required computation. This effectively adds an extra level
of indirection into the compilation, weakening any linkage
between the form of the intermediate code and the object
code required for a particular implementation.
In essence I-code attempts to describe the results required
in a way which does not constrain the method of achieving
those results.
In particular it should be noted that the code described,
I-code, was designed specifically for the language IMP-77, a
systems implementation language which contains many of the
constructions which pose problems for optimisation
[Robertson, 1979]. It in no way attempts to be a
"universal" intermediate code. Notwithstanding, the code,
with a small number of minor extensions to cover non-IMP
features, has been used successfully in an ALGOL 60 compiler
and is currently proving viable in projects for writing
Pascal and Fortran 77 compilers.
The intermediate code as finally designed is completely
machine independent, except inasmuch as the source program
it describes is machine dependent, demonstrating that the
problems may not be as intractable as thought by Branquart
et al. who state that "clearly complete machine independency
is never reached" [Branquart, 1973].
In addition to the problems of machine independence there
is also the question of operating system independence, as
nowadays it is common for machines to have several systems
available. For this reason the task of producing a compiler
is far from finished when it can generate machine code
[Richards, 1977]. To simplify the generation of versions of
a compiler for different operating systems a third phase of
compilation was added, although it soon became clear that
the extra phase could be used for other purposes as well, as
will be shown later (section 4).
Throughout the text examples are given of the code
produced by compilers written to demonstrate the power of
the intermediate code. The examples of the intermediate
code are couched in terms of mnemonics for the various code
items, although the production compilers use a compacted
representation. The code and its representations are
described in Appendix A1 and Appendix A2.
In the examples of code generated for various
constructions it should be appreciated that the exact
instructions and machine features used will depend very much
on the context in which the code is produced, and so only
typical code sequences can be given.
The machines for which code is demonstrated are indicated
by the following abbreviations in parentheses:
(Nova) Data General NOVA
(PDP10) Digital Equipment Corporation PDP10
(PDP11) Digital Equipment Corporation PDP11
(VAX) Digital Equipment Corporation VAX 11/780
(GEC4080) General Electric Company 4080
(ICL2900) International Computers Limited 2900
(4/75) International Computers Limited 4/75
(7/16) Interdata 7/16
(7/32) Interdata 7/32
(PE3200) Perkin Elmer 3200
2 Intermediate codes
This section gives a brief account of the more important
intermediate codes which have been discussed and have had an
influence on the design of I-code.
2.1 Uncol
UNCOL, UNiversal Computer Orientated Language, [Mock,
1958], was an early attempt to specify a means for solving
the M*N problem of producing compilers for M languages to
run on N machines. It was proposed that an intermediate
language, UNCOL, be defined which would be able to express
the constructs from any language, and which could itself be
translated into code for any machine, resulting in the need
for only M+N compilers. Indeed it was even suggested that
programs would be written directly in UNCOL rather than in
machine code.
These ideas were very ambitious, but were presented without
any concrete examples of what UNCOL might look like.
Proposals were made for an UNCOL in [Steel, 1961] but the
work was abandoned before anything like a complete
specification had been produced.
An UNCOL-like technique which has been used extensively
is to compile for a known type of machine, such as the IBM
360, and then emulate that machine on the target machine.
Unfortunately, to give this any chance of being efficient,
microcode support will be necessary, and this is rarely
available to compiler writers.
2.2 Janus
The first attempt at generating an UNCOL which seems to
have been at least partially successful was JANUS [Coleman,
1974]. The approach was effectively to enumerate all the
mechanisms found in current programming languages and the
techniques used to implement them. From this large list was
defined a set of primitive data-types and operations upon
them. These primitives were then put together to model the
objects in the source language. Once JANUS code had been
produced the intention was that it would either be
interpreted or compiled into machine code by a macro
generator.
2.3 OCODE
Of all the languages which claim to be portable, perhaps
the most successful has been BCPL [Richards, 1971]. The
BCPL compiler generates the intermediate code OCODE which
can either be interpreted or translated into machine code
for direct execution. As BCPL is a fairly low-level
language with only one data type, the word, many of the
difficulties in designing intermediate codes do not arise.
This means that the code can be pitched at a low level and
be "semantically weak" without compromising the efficiency
of the compiled code to any great extent.
The OCODE machine works by manipulating single-word objects
held on a stack, into which there are several pointers.
e.g. R(1, 2, 3)
STACK 3 adjust the top of stack to leave two
cells free for linkage information.
LN 1 stack the constant 1.
LN 2 stack the constant 2.
LN 3 stack the constant 3.
LL L6 stack the address of label L6 (the entry
to the routine).
RTAP 5 enter the procedure adjusting the stack
frame pointer by 5 locations.
.....
.....
.....
ENTRY 1 L6 'R'
entry point for the routine R.
SAVE 5 set the top of stack pointer to be 5
locations from the stack frame pointer.
.....
RTRN return.
2.4 P-code
P-code is the intermediate code used by the PASCAL
compiler [Nori, 1976; Jensen, 1976] and was designed with
the aim of porting PASCAL quickly by means of an
interpreter. In this respect it has been very successful,
especially on microprocessor-based systems. The code is
similar to OCODE but has a greater range of instructions to
handle objects of differing types.
procedure ERROR(VAL:INTEGER); begin
0: ENT 4
TOTAL := TOTAL+1;
1: LDO 138
2: LDCI 1
3: ADDI
4: SRO 138
if INDEX >= 9 then begin
5: LDO 139
6: LDCI 9
7: GEQI
8: FJP 17
LIST[10].NUM := 255
9: LAO 140
10: LDCI 10
11: DEC 1
12: IXA 2
13: INC 1
14: LDCI 255
15: STO
end else begin
16: UJP 28
INDEX := INDEX+1;
17: LDO 139
18: LDCI 1
19: ADDI
20: SRO 139
LIST[INDEX].NUM := VAL
21: LAO 140
22: LDO 139
23: DEC 1
24: IXA 2
25: INC 1
26: LOD 0, 4
27: STO
end;
end;
28: RETP
2.5 Z-code
Z-code [Bourne, 1975] is the intermediate code produced
by the ALGOL68C compiler, the main feature of which is the
ability for the user to parameterise the first phase to
modify the Z-code to suit the target machine, an idea
previously investigated in SLANG [Sibley, 1961]. A set of
eight registers is assumed by the code and others may be
specified explicitly for each transfer. The memory with
which the code works is assumed to be "a linear data store
that is addressed by consecutive integers", addresses taking
the form of base+displacement pairs. Intermingled with the
instructions are directives which control the translation of
the code into machine orders. Two of these directives are
used to divide the code into "basic blocks" or
"straight-line segments", and describe the usage of
registers on entry to and exit from the blocks, although
little use seems to be made of them at present.
As an example here is the Z-code generated by the PDP10
version of the compiler [Gardner, 1977]:
int X := 2, Y := 3, Z := 2
1: F000 10 0 +2 load 2
F040 10 6 +144 store in X
F000 10 0 +3 load 3
F040 10 6 +145 store in Y
5: F000 10 0 +2 load 2
F040 10 6 +146 store in Z
proc P = (int A, B) int: begin
7: S715 p*Z
T246 677 712
if A > B
9: R0
10: F020 10 5 +4 load A
11: F022 10 5 +5 subtract B
12: F113 10 0 P713 ->L713 if <=
then A
13: F020 10 5 +4 load A
else B
14: H116 0 p7l4 ->L714
15: L713
16: F020 10 5 +5 load B
fi
17: L714
18: R1 10 1
end
19: R1 10 1
20: T247 667 712 end of P
2.6 Summary and conclusions
2.6.1 Error checking and reporting
The UNCOL approach of having one code for all languages
and machines may well simplify the generation of some sort
of compiler, but has the major disadvantage that the
optimisation of error checking and reporting run-time errors
cannot be left to the code generator - many errors are
language-dependent and the code generator cannot know how to
handle all of them. Instead the checks must be programmed
into the intermediate representation explicitly. As will be
shown later (5.3) this can inhibit several very powerful and
effective optimisations. Sadly, this problem can result in
the absence of all but the most trivial of run-time checks
in the compiled code.
Even when checking is provided in the intermediate code,
as in the case of P-code with its CHK instruction for range
testing, it is rare for the code to contain enough
information to permit the error to be reported in source
program terms; line numbers, procedure names, variable names
and values etc. As an example, many P-code interpreters
locate run-time errors in terms of 'P-code instruction
addresses' which are of negligible benefit to most users.
2.6.2 Efficiency
Commonly, little attention is paid to questions of
run-time efficiency in the generation of intermediate code.
An exception to this is Z-code which is parameterised in
order that the match between the code and the target machine
can be improved. In particular, the machine-independent
phase is intended to perform simple register optimisation,
although as the example in 2.5 shows, the insistence on
using the same register will minimise any gains from
remembering register contents. However, this is probably
just a failure on the part of the current compilers and
could be corrected at a later date. Unfortunately, the fact
that the compiler purports to optimise the intermediate code
inhibits the code generator from attempting any but the most
trivial peephole optimisations. This may be seen in the
example by considering instructions 10 - 12. On many
machines the subtract operation is not good for value
comparison as firstly it may fail with overflow, and
secondly it corrupts a register. A better implementation
would be to replace the subtract with a suitable COMPARE,
leaving the values untouched and available for later use.
This cannot be done by the code generator as it cannot know
that the intermediate code does not go on to use the result
of the subtraction later. Similarly, if Z-code had chosen
to use a COMPARE instruction in the first place, a machine
without a compare would have to work hard to make sure all
registers involved in the necessary subtract were restored
to their initial values before the intermediate code goes on
to use them.
2.6.3 Assumptions
Most machine-independent intermediate codes have been
designed, at least initially, assuming a linear store with
one address increment corresponding to one basic object. In
the case of O-code this is a direct result of the language
definition, but in languages such as PASCAL it has led to a
great loss of information, as the rich information about
data types cannot be expressed in the code. The problems
associated with putting languages onto machines with
different addressing schemes has resulted in some
intermediate codes being updated to accept a limited form of
parameterisation to define the gross appearance of the
target machine. Typical of the limitations of these codes
is P-code where although the basic types of object can have
differing sizes of machine representation, objects with
enumerated types will always be given a 'fullword' even
though the host machine could easily support a smaller item.
A typical assumption is that the difference between
objects explicitly specified in the original source and
those created by the intermediate code generator for its own
purposes is insignificant. As will be shown later (4.6),
this is not necessarily the case.
2.6.4 Interpretation
The vast majority of machine-independent intermediate
codes in current use have been designed in such a way as |to
permit execution by interpretation. This immediately
imposes constraints on the form of the code, as, for
example, it will need to be possible to pre-process the code
into some consistent and managable internal form for the
benefit of the interpreter. In order to give some sort of
efficiency to the interpretation process the intermediate
code of necessity must become like the order code of a
'real' machine. This results in code-generation being seen
as fitting the target machine to the intermediate code,
rather than fitting the intermediate code to the target
machine which is clearly the better strategy for
optimisation.
3 Optimisations
The task of any compiler for a high-level language is to
fit programs written in that language onto a specific
computer system so that the required computations may be
performed.
Optimisation may be described as the process by which the
fit is improved. Usually the quality of the optimisation is
measured in terms of two parameters: the size of the running
program, and, more commonly, the speed at which it executes.
While it is possible in some cases to make a program smaller
and increase its speed of execution, it is well-known that
in general speed and size are complementary. For example,
the following code fragments have the same effect, but the
first will probably be smaller than the second, which will
execute faster than the first:
-------------------------- -----------------------
| | A(1) = K |
| | A(2) = K |
| for J = 1,1,8 cycle | A(3) = K |
| A(J) = K | A(4) s K |
| repeat | A(5) = K |
| | A(6) = K |
| | A(7) = K |
| | A(8) = K |
| | J = 8 |
-------------------------- -----------------------
3.1 Classification of optimisations
With a subject as complex as optimisation it is difficult
to give a useful and definitive classification of the
various possibilities for improving programs. In addition
different authors have used many different terms to describe
optimisations which have been attempted [Aho, 1974; Lowry,
1969; Wichmann, 1977]. However most optimisations fall into
one of the following four groups: Universal, Local, Global
and Source.
3.1.1 Universal Optimisations
These are those optimisations which are
independent of any particular program, but which
depend on the complete environment in which the
program is to be compiled or executed. They are the
policy decisions taken by the compiler writer during
the initial design of the compiler, and include such
things as the fixed use of registers (stack
pointers, code pointers, link registers etc), the
representations of language-defined objects (arrays,
records, strings etc), and the standards for
communication with external objects.
In addition, universal optimisation must take
into account such questions as:
i Compilation speed or execution speed?
If the compiler is to be used in an
environment where programs are compiled
roughly as often as they are executed, such
as in a teaching environment, execution
speed can be sacrificed for an increase in
compilation speed.
ii Diagnostics?
If the compiler is to produce code which
will provide extensive checking and will
give diagnostic information in the event of
program failure, allowance must be made for
the efficient checking of the program's
behaviour and the maintenance of the
recovery information used by the
diagnostics. If highly optimised code is
required these constraints may not apply.
In the current state of the art universal
optimisation is done by experience and guesswork;
attempts at producing compiler-compilers which can
approach the quality of hand-constructed compilers
have not met with great success [Brooker, 1967;
Feldman, 1966; Trout, 1967]. As will be shown later
(4.5), minor changes in the universal optimisation
can result in major changes in the performance of
the generated code, and so rules made at this stage
should be as flexible as possible to permit changes
to be made in the light of experience.
From the point of view of measurement universal
optimisation provides the base level from which
other optimisations are investigated. Roughly, the
better the universal optimisation the less effective
the other optimisations appear to be.
3.1.2 Local optimisations
Local optimisations may be defined as those
optimisations which may be performed during a
sequential scan of the program using only knowledge
of statements already processed. Not only are these
optimisations reasonably simple to perform but they
can have a major effect on the generated code.
Indeed Wulf et al. state that "In the final analysis
the quality of the local code has a greater impact
on both the size and speed of the final program than
any other optimisation" [Wulf, 1975].
3.1.2.1 Remembering
Remembering optimisations are those optimisations
which can be applied to single statements in the
light of information gathered during the compilation
of previous statements. These optimisations depend
on remembering the current state of the machine and
applying this knowledge to subsequent statements.
Their chief characteristic is that they are applied
during a sequential scan of the program, and as such
are reasonably cheap to implement and execute.
For example:
----------------
| X = Y |
| if X = 0 start |
-----------------
on the PDP11 would generate:
------------
| MOV Y,X |
| BNE $1 | remembering that the previous
| | line sets the condition code.
------------
The most powerful of the remembering
optimisations is that whereby the correspondence
between values in registers and values in store is
remembered and references to the store value are
replaced by references to the register's value,
register operations usually being smaller and faster
than their store equivalents. Unfortunately there
are several cases where this leads to worse code
than the "obvious" version. For example, on the
(PE3200) the code on the right is larger and slower
than that on the left:
------------
| X = 2 |
| P = P<<2 |
-----------
----------- --------------
| LIS 3,2 | LIS 3,2 | pick up 2
| ST 3,X | ST 3,X | store it in X
| L 1,P | L 1,P | pick up P
| SLLS 1,2 | SLL 1,0(3) | shift it by 2
| ST 1,P | ST 1,P | store it in P
----------- --------------
In addition to keeping track of the changes in
the state of the machine from the compilation of one
statement to another, remembering also includes
preserving this state or environment for later use
when a label is encountered, either by merging the
current environment with the environment saved at
the jump to the label, or simply by restoring that
latter environment when it is not possible for
control to "fall through" from the statements
immediately preceding the label.
In all forms of remembering it is vital to be
able to keep the information up-to-date,
invalidating knowledge when it becomes false, a
process which is exacerbated when it is possible for
an object to be known by two or more apparently
different descriptions as in the following code:
---------------------------
| integer J, K |
| integerarray A(1:12) |
| integername P |
| P == J |
| J = 1; K = 1 |
---------------------------
At this point P and J refer to the same location as
do A(J) and A(K).
Except in the most simple of cases all that can be
done is to assume the worst and forget anything
potentially dangerous after writing to unknown
addresses.
3.1.2.2 Delaying
Delaying is the process of generating
instructions but not planting them in the code
sequence until it is absolutely necessary. This is
of advantage if it is discovered that such "pending"
instructions are not needed, or can be combined with
other instructions.
The two common cases are illustrated below:
----------------------------------
| integerfn F(integername X) |
| integer T |
| T = X |
| T = 0 if T < 0 |
| X = 1 |
| result = T |
| end |
----------------------------------
The obvious code for the body of this function is
(PE3220):
-------------------
| L 3,X | address of parameter
| L 0,0(3) | value of parameter
| ST 0,T |
| BGE $1 | -> if T >= 0
| SR 0,0 |
| ST 0,T | T = 0
| $1:LIS 2,1 |
| ST 2,0(3) | X = 1
| LR 1,0 | load result
| {return} |
-------------------
By delaying the first store into T until after
the conditional statement, and delaying the second
store into T until after the label, both
instructions can be combined, resulting in the code:
-------------------
| L 3,X |
| L 0,0(3) |
| BGE $1 |
| SR 0,0 |
| $1:ST 0,T |
| LIS 2,1 |
| ST 2,0(3) |
| LR 1,0 |
| {return} |
-------------------
This store itself can now be delayed until the
return from the function, at which point, as T is
local to the function and will be destroyed, the
instruction can be deleted altogether.
Section 3.2 gives a description of one way in which
this sort of optimisation has been achieved.
3.1.2.3 Inaccessable code removal
In several cases compilers can generate code
which will never be executed.
The common causes of this are either user-specified
conditions whose truth is constant and are used to
achieve some sort of "conditional compilation", or
structural statements following unconditional jumps
as below:
----------------------
| if X = 0 start |
| ERROR(1) |
| return |
| else |
| V = V-X |
| finish |
----------------------
Here the branch instruction usually generated at the
end of the if clause to take control past the else
clause, can never be executed.
Such inaccessable code can be eliminated to shorten
the program, but without directly effecting its
speed of execution.
3.1.2.4 Peephole optimisations
Peephole optimisation [McKeeman, 1965] is the
technique of examining small sections of generated
code to make fairly obvious, but ad hoc,
improvements. Many of the gains from the
optimisation come from simplifying code sequences
created by the juxtaposition of code areas which
were produced separately.
For example (PE3220):
Before After
---------------- -------------------
| ST 4,X | ST 4,X |
| L 4,X | |
|- -|- -|
| AR 1,2 | |
| AHI 1,48 | AHI 1,48(2) |
---------------- -------------------
3.1.2.5 Special cases
Special-case optimisations are those which make
use of the particular structure and features of the
target machine to control the way in which certain
statements are implemented.
For example:
Obvious Optimised
---------------- -----------
(PDP11) | MOV #0,X | CLR X | X = 0
|- -|- -|
(PDP11) | ADD #1,X | INC X | X = X+1
|- -|- -|
(PE3220) | LHI 1,NULL | SR 0,0 | S = ""
| LHI 2,S | STB 0,S |
| BAL 15,MOVE | |
----------------------------
These optimisations are very similar to peephole
optimisations but are distinguished because they
actively control the generation of code rather than
passively alter code which has already been
produced. In particular they avoid one of the
drawbacks of peephole optimisation, namely that even
though it can reduce fairly complex instruction
sequences to a simpler form, the side-effects of
generating the long form in the first place often
degrade the code. In the example of setting a
string variable to the null string above the
optimised form uses only one register, the value of
which can be remembered. In the non-optimised
version three registers are immediately altered and
the knowledge of the contents of all of the
registers may need to be forgotten unless the code
generator knows how the MOVE routine works and can
forget only those registers which it uses.
3.1.2.6 Algebraic manipulations
Algebraic optimisations are improvements brought
about by using the algebraic properties of operators
and operands, and include:
. Folding, or compile-time evaluation
1+2 is replaced by 3
. Removal of null operations
A+0 is replaced by A
. Using commutative properties
-B+A is replaced by A-B
3.1.3 Global optimisations
Global optimisation may be defined as those
improvements to the code which require knowledge of
substantial parts of the program. In effect they
are performed by examining numbers of statements in
parallel, in contrast to the .sequential scan
required by local optimisation.
3.1.3.1 Restructuring
Restructuring optimisations are those
optimisations which may be brought about by changing
the order in which the code is laid out in memory
without otherwise changing the nature of the code.
As will be discussed later, there are many reasons
why programs can be improved by changing the order
of chunks of the code. A common reason is that many
machines have conditional branch instructions with
limited range while the unconditional branches have
a much larger range. Hence if (A) represents a
large number of statements and (B) represents a
small number of statements, the program:
----------------
| if X = 0 start |
| {A} |
| else |
| {B} |
| finish |
----------------
could be improved by reordering as on the right
(PDP11):
original reordered
------------------ ----------------
| MOV X,R0 | MOV X,R0 |
| BEQ $1 | BEQ $1 |
| JMP $2 | {B} |
| $1: {A} | JMP $2 |
| BR $3 | $1: {A} |
| $2: {B} | $2: |
| $3: | |
------------------ ----------------
3.1.3.2 Merging
3.1.3.2.1 Forward merging
Forward merging, also somewhat confusingly
referred to as "cross jumping" [Wulf, 1975], is the
process whereby the point of convergence of two or
more code sequences is moved back over common
sub-sequences thus removing one of the
sub-sequences, as in the case below.
----------------------
| if X > Y start |
| TEST(X, Y) |
| else |
| TEST(Y, X) |
| finish |
----------------------
obvious code (VAX) after merging
--------------------- --------------------
| CMPL X,Y | CMPL X,Y |
| BLE $1 | BLE $1 |
| PUSHL X | PUSHL X |
| PUSHL Y | PUSHL Y |
| CALLS 2,TEST | |
| BRB $2 | BRB $2 |
| $1:PUSHL Y | $1:PUSHL Y |
| PUSHL X | PUSHL X |
| CALLS 2,TEST | $2:CALLS 2,TEST |
| $2: | |
--------------------- --------------------
The simplest way to perform this optimisation is to
take the code sequence about the point of a label
and a reference to that label, and set two pointers:
one, P., to the unconditional jump and the other,
P2, to the label. If the instructions immediately
preceding the pointers are identical both pointers
are moved back over that instruction. The label is
redefined at the new position of P2 and the
instruction passed over by P1 is deleted. The
process is repeated until either another label is
found or two different instructions are encountered.
The redefinition of the label involves creating a
completely new label, leaving the old one untouched.
This both prevents trouble with multiple references
to the label and permits the optimisation to be
attempted on them.
As this optimisation simply causes the sharing of
execution paths there is no direct gain in execution
speed, but as the code size is reduced an indirect
improvement may be achieved if the shorter code
moves the label close enough to the reference to it
for a shorter and usually faster jump instruction to
be used.
The optimisation obviously must be performed while
labels and jumps are in a symbolic form, that is
before code addresses have been resolved. This
permits the merging of instructions which will
eventually have program-counter relative operands
and consequently be position dependent.
3.1.3.2.2 Backward merging
A second, but much more difficult form of merging
involves moving instructions back over the preceding
branch code which generates the two paths being
considered.
Original (PE3200) Optimised
------------------- -------------------
| | L 2,R |
| L 1,X | L 1,X |
| BNE $1 | BNE $1 |
P1 -> | L 2,R | |
| LIS 3,1 | LIS 3,1 |
| ST 3,A(2) | ST 3,A(2) |
| B $2 | B $2 |
P2 -> | $1:L 2,R | $1: |
| LIS 3,3 | LIS 3,3 |
| ST 3,B(2) | ST 3,B(2) |
| $2: | $2: |
------------------- -------------------
The difficulty with this optimisation is that it
requires the branch and the associated condition
testing code to be treated as a single unit, so that
merged instructions do not split the test and the
use of the result. Also the testing instructions
must be checked to ensure that they are not able to
modify the operands of the merged instructions.
This information is easily available to the
code-generator as in IMP77 only procedure calls and
string resolution can have such side-effects. In a
way similar to the other form of merging the two
pointers, P1 and P2 are set and adjusted; PI being
moved forward over common code carrying the branch
sequence with it (L & BNE), and P2 being advanced,
deleting the code it passes over.
3.1.3.3 Advancing
Advancing is the process of moving operations
back in the instruction stream so that they are
executed earlier and pave the way for improving
subsequent statements.
On many machines the statements:
---------------
| X = X-1 |
| A(X) = P |
| X = X-1 |
| A(X) = Q |
---------------
could be compiled to more efficient code if
rewritten:
---------------
| X = X-2 |
| A(X+1) = P |
| A(X) = Q |
---------------
as only one calculation will need to be done to
address both A(X) and A(X+1), the constant, suitably
scaled, being added into the displacement field of
the appropriate instruction (PDP11):
-----------------
| SUB #2,X |
| MOV X,R1 |
| ADD R1,R1 |
| ADD A,R1 |
| MOV P,2(R1) |
| MOV Q,(R1) |
-----------------
3.1.3.4 Factoring
Factoring is the generalisation of merging and
involves the removal of common sections of code.
Included under this heading is the elimination of
common sub-expressions. At the source level this can
be seen in changes such
as:
-----------------------------
| D = SIN(X^2) + COS(X^2) |
-----------------------------
being replaced by
----------------------------
| real T |
| T = X^2 |
| D = SIN(T) + COS(T) |
----------------------------
At the machine level the optimisation is often
available as the result of address arithmetic in the
case of simple arrays:
-----------------
| A(J) = B(J) |
-----------------
Original (PE3200) Optimised
----------------- -----------------
| L 1,J | L 1,J |
| SLLS 1,2 | SLLS 1,2 |
| AR 1,LNB | AR 1,LNB |
| L 3,J | |
| SLLS 3,2 | |
| AR 3,LNB | |
| L 0,B(3) | L 0,B(1) |
| ST 0,A(1) | ST 0,A(1) |
-----------------------------------
In this case as the code-generator is in complete
control the optimisation can be very simple,
although rather specific.
The techniques for handling common sub-expressions
have been investigated at length by several authors,
but measurements indicate that in most programs
expressions are so trivial the expense in finding
common sub-expressions is not repaid by the
resulting improvement in the generated code [Knuth,
1971].
The more general form of factoring can be seen in
the transformation of the following statements:
---------------------------------------
| if X = 0 then A = 1 else B = 2 |
| if X = 0 then C = 3 else D = 4 |
---------------------------------------
into:
-----------------------
| if X = 0 start |
| A = 1 |
| C = 3 |
| else |
| B = 2 |
| D = 4 |
| finish |
-----------------------
3.1.3.5 Loop optimisations
3.1.3.5.1 Iteration
Iteration is the process whereby the values in
variables from previous iterations of a loop are
used to calculate the new values for the current
iteration, rather than calculating those values from
scratch each time. One of the effects of this
optimisation can be the reduction in strength of
operations, such as changing multiplications into
additions. In this context the IMP77 operators "++"
and "--" may be used to great effect. Their action
is to adjust the reference on the left by the number
of items to the right, hence if X is an integer then
X++1 is the next integer and X--2 is the integer two
integers before X.
-------------------------------
| for J = 1,1,1000 cycle |
| A(J) = J |
| repeat |
-------------------------------
Can be optimised to:
-------------------------------
| integername T |
| T == A(J)--1 |
| for J = 1,1,1000 cycle |
| T == T++1 |
| T = J |
| repeat |
-------------------------------
3.1.3.5.2 Holding
Holding is the process of preloading values used
in a loop into registers or other such temporaries,
using those temporaries within the loop and finally
storing the values back into the required variables
at the end of the loop, if necessary. In the
previous example the value in T, the current address
of the array element being considered, could be
loaded into a register before the start of the loop.
In this case the final value in the register need
not be stored once the loop terminates.
The application of most other optimisations will,
at worst, have little or no effect on any particular
program, however the danger of holding is that it
assumes that the values loaded outside the loop will
be required within the loop, and this assumption
could well be invalid.
For example, consider the following equivalent
programs:
------------------- ---------------------
| | TEMP = P/Q |
| while X > 0 cycle | while X > 0 cycle |
| W(X) = P//Q | W(X) = TEMP |
| X = X-1 | X = X-1 |
| repeat | repeat |
------------------- ---------------------
B will be faster than A if the loop is executed at
least twice. If the loop is not executed at all
(X <= 0) B will be much slower than A (by an
alarming 80 microseconds on the 7/32).
3.1.3.5.3 Removal of invariants
This is the process whereby complex
sub-expressions which do not change their values as
the loop progresses are evaluated outside the loop
and held in temporaries:
---------------------------------
| for J = 1, 1, 1000 cycle |
| A(J) = LIMIT-MARGIN |
| repeat |
--------------------------------
Can be optimised to:
-------------------------------
| TEMP = LIMIT-MARGIN |
| for J = 1,1,1000 cycle |
| A(J) = TEMP |
| repeat |
-------------------------------
It is simply a special case of Holding.
3.1.3.6 Expansion
Expansion is the process of rewriting compact
representations of parts of a program in a more
explicit form, usually resulting in faster execution
but at the expense of more code. The two main uses
of expansion are to reduce the overheads in loop
control by repeating (unrolling) the loop body and
hence reducing the number of iterations, and to
replace calls on procedures by the body of the
procedure, with the necessary substitution for
parameters. Extra gains can come from the
interaction of the expanded code with the enclosing
code as in the following example:
-----------------------------
| for J = 1,1,100 cycle |
| A(J) = 0 |
| repeat |
-----------------------------
This can be expanded into:
-------------------------------
| for J = 2, 2, 100 cycle |
| A(J-1) = 0 |
| A(J) = 0 |
| repeat |
-------------------------------
and can generate the following code (PDP11):
--------------------
| CLR J |
| $1:ADD #2,J |
| MOV J,R1 |
| ADD R1,R1 |
| ADD LNB,R1 |
| CLR A-2(R1) |
| CLR A(R1) |
| CMP J,#100. |
| BNE $1 |
--------------------
3.1.4 Source optimisations
Source optimisations [Schneck, 1973] are those
optimisations which can be effected by changes in
the source program. They can be sub-divided into
three categories; machine-independent [Hecht, 1973;
Kildall, 1973], machine-dependent, and tentative.
Tentative optimisations are those which, while
unlikely to make the code worse, may improve it on
some machines. For example, most machines will
handle the comparison "X<1" better if it is
rewritten as "X<=0", where X is an integer variable.
3.2 Combination of optimisations
Many of the optimisations described above result
in an improvement in the generated code not only by
their own effects by also by their interaction with
other optimisations, as one improvement often
produces the conditions needed by another. As an
example consider the compilation of the following,
rather unlikely, statements on the Data General
NOVA:
---------------------
| A = (B&C)<<1 |
| A = D if A = 0 |
---------------------
The first statement can generate the obvious code:
---------------
| LDA 0,B |
| LDA 1,C |
| AND 0,1 |
| MOVZL 1,1 |
| STA 1,A |
---------------
At this stage the value in accumulator 1 (A) can be
remembered, and the STA instruction marked as
"pending" so that it can be removed later if it is
decided that deferring the store will improve the
code.
With this knowledge the second statement can be
compiled to:
-------------------
| MOV# 1,1,SZR |
| JMP $1 |
| LDA 1,D |
| STA 1,A |
| $1: |
-------------------
Immediately before the label $1 it is known that
once again the value of A is in accumulator 1, and
that the STA above the label is marked as pending as
before. Following the definition of the label the
environment before the jump to that label can be
combined with the environment just before the label
to give the new environment following the label.
The information in this environment is that A is in
accumulator 1 and that the same store is pending
from both old environments. This allows the two
marked store instructions to be removed and one
store placed after the label (and once again marked
as being "pending"). This gives the following code:
---------------------
| LDA 0,B |
| LDA 1,C |
| AND 0,1 |
| MOV# 1,1,SZR |
| JMP $1 |
| LDA 1,D |
| $1:STA 1,A |
---------------------
A simple jump optimisation notices that the JMP
passes over just one instruction and can therefore
be removed by inverting the skip condition on the
previous MOVe, giving:
------------------
| LDA 0,B |
| LDA 1,C |
| AND 0,1 |
| MOVZL 1,1 |
| MOV# 1,1,SNR |
| LDA 1,D |
| STA 1,A |
------------------
Finally, peephole optimisation combines the AND with
the MOVZL giving ANDZL, and then combine this with
the following MOV# to give the complete code
sequence as:
------------------
| LDA 0,B |
| LDA 1,C |
| ANDZL 0,1,SNR |
| LDA 1,D |
| STA 1,A |
------------------
The most interesting thing to notice about this
particular sequence of optimisations is that with
the possible exception of the removal of the marked
STA instructions, the final code can be generated
very simply with local optimisations.
4 The design of the compiler
This section describes the features of the compiler which
have had an influence on the form of the intermediate code.
4.1 General structure
One of the aims of this type of compilation strategy is
to simplify the production of compilers, and a successful
technique for simplifying programs is to divide them into
several communicating modules, each largely independent of
the others but with well-defined interfaces between them.
At the highest level a compiler can be split up into three
major parts:
1 A language processor, which deals with the
language-dependent parts such as parsing,
semantic checking, and error reporting.
2 A code generator, which takes the decomposed
form of the program as generated by 1 above,
and constructs the appropriate code sequences
to perform the required functions.
3 An object-file generator, which builds an
object-file from the code sequences produced
by 2, In the form required by the system which
is to execute the program.
Commonly, the first two parts of this scheme are combined
into one program which generates as its output an
assembly-language source file corresponding to the original
program. The third part then becomes the standard system
assembler. This approach clearly simplifies the production
of the compiler, as one part, the assembler, is provided
already and can ease the problems of checking the compiler
because the code it generates is presented in a well-known
form. Despite these advantages it was rejected for the
following reasons:
1 In order that assembly language can be
generated, the compiler must have an internal
form of the instructions which is changed into
text, processed by the assembler, and finally
converted into the machine representation.
These transformations can be eliminated if the
compiler works directly with the machine
representations.
2 In general, the system-provided assembler will
be expecting to receive a much more powerful
language than the rather stereotyped text
produced by compilers. This will certainly
degrade the performance of the assembler. A
solution to this is to produce a cut-down
version of the assembler which only recognises
those constructs generated by the compiler.
However, producing a new assembler removes one
of the reasons for choosing this route,
namely, not requiring extra work in writing
the object-file generator.
3 As will be seen later (section 4,7), even
after the code sequences have been produced
there remain several optimisations which can
be performed using knowledge gained during the
production of those sequences, for example,
generating short forms of jump instructions
when the distance between the jump and its
destination is small enough. While in certain
cases these optimisations can be performed by
a standard assembler it is unlikely that the
structure of the code-generator would be as
simple as if a special-purpose object-file
generator were available.
The main interface in such a system is clearly that
between the language and machine dependencies, as most
languages are largely machine-independent. It is this
interface between the language-dependent and
machine-dependent parts of the compiler which is termed the
INTERMEDIATE CODE. In the following discussion it is
assumed that the reader has a reasonable understanding of
the structure of the final form of I-code, a definition of
which may be found in Appendix.A2.
4.2 The intermediate code
Even while remaining independent of machine architecture,
codes can be designed at various levels of abstraction.
Roughly the higher the level of the intermediate-code the
closer it is to to the source language, and the lower the
level the closer it is to some (possibly hypothetical)
processor's instruction set.
The choice as to the level of the intermediate-code
eventually comes down to a question of where decisions are
to be taken.
If a low-level code is chosen, more decisions will have to
be made in the language-dependent phase (making it more
complicated) but less choices are available to the
code-generator (making it simpler, but removing chances for
improving the code in the light of particular machine
features). If a high-level code is chosen, decisions are
left to the code-generator resulting in a simpler language
processor but a more complicated code-generator which is
better able to adapt to a particular processor.
The design of the intermediate code can also be
influenced by its intended role in the complete compiling
system. If the code is to be used in the compilation of
just one language on many machines, there may be an
advantage in increasing the complexity of the code if it
results in simpler code generators at the expense of a more
complicated, but unique, first phase. Conversely, if the
code is to be generated by several different language
processors, a simple intermediate code which is easy to
produce may well be more attractive.
As I-code was intended for optimisation a high-level code
was chosen. In addition, as it was hoped that the code
could eventually be used in different language processors,
it was decided to keep the structure simple.
The complete compilation process may be thought of as a
sequence of transformations working from the source program
to the final object program via a number of intermediate
representations. As the transformations are applied the
representations become less dependent on the source language
and more dependent on the target machine. In order to
simplify the code-generator as much as possible the
intermediate code must lie as far from the source language
as is possible without straying from the objectives set out
below.
4.2.1 Objectives
One of the dangers in designing an intermediate code is
that of building into it old techniques and standard
expansions of source constructions, which while they may be
tried and tested cannot in any way be said to be "the only
solutions" or even "the best solutions".
One of the intentions behind the design of I-code was to
permit the use of varied implementation strategies. In the
same way that the only practical definition of a "good"
programming language is that it fits the style of the
particular programmer using it, so the measure of the power
of an intermediate code must include the ease with which it
can adapt to an existing style of code-generator writing.
Inevitably, practical constraints prevent total generality;
the most general form of a program is a canonical form of
itself, but this is little help in compiling it.
It follows that the intermediate code, while remaining true
to the original program and distant from "real" machines,
must provide enough simplification to make the task of
code-generation as easy as possible without inhibiting
optimisation.
From the start it was appreciated that an intermediate
code suitable for use in optimising compilers would
necessarily require more processing than a code such as
O-code which was aimed at a quick implementation. The
original hope was that although each machine-dependent code
generator would not be small, typically about 3000-4000 IMP
statements, large portions of one could be taken as a basis
for a new implementation. This has proved to be the case,
and provision of an existing code-generator as a template
greatly simplifies the task of creating a new one
(section 6.4.1).
4.2.1.1 Scope
The first and most fundamental objective in the design of
I-code was that it should support the compilation of one
specific language, IMP-77, on many different machines.
Considerations of using the code to implement other
languages were secondary to this main aim, but were used to
bias the design when a choice had to be made from several
otherwise equally suitable possibilities. In retrospect, a
few areas of the code could have been made more general
without significant overheads in the code generators, mainly
in the area of data descriptor definitions, but a detailed
discussion of one intermediate code supporting several
languages is beyond the scope of this work.
In direct contrast to many intermediate codes, I-code was
not designed with the intention of making it convenient to
interpret; the prime aim was to permit compilation into
efficient machine-code. Nevertheless it is possible to
"compile" I-code into threaded code [Bell, 1973] or a form
suitable for interpretation, either by generating a
conventional interpretive code or by leaving the code in
more-or-less its current form but with labels resolved and
descriptors expanded into a more convenient representation.
4.2.1.2 Information preservation
As the translation of the source program into
intermediate-code is to be machine-independent it will not
be possible to know before code generation what details of
the program will be of interest to the code-generator. It
follows that any loss of information caused by the
translation is likely to reduce the scope for optimisation.
In addition, not only must the information present in the
source be available at the intermediate-code level, but also
it must be presented in a form in which it can be recognised
easily and used.
For example, the following two program fragments are
semantically identical:
A B
----------------------------------------------------
| | P = 0 |
| | cycle |
| TEST for P = 1, 1, 10 | P = P+1 |
| | TEST |
| | repeat until P = 10 |
| | |
----------------------------------------------------
However, in "B" the information that the fragment contains a
simple for construction, while not completely lost, has been
scattered through the code, and this dilution of information
will increase the complexity of any code-generator wishing
to handle for loops specially.
To leave open all avenues for optimisation it is
necessary therefore, that all of the semantic information in
the source program is preserved in a compact form in the
I-code. One sure way of achieving this property is to
design the code in such a way as to allow the regeneration
of the source program, or at least a canonical form of it
which is not significantly different from the original. In
this context insignificant differences are the removal of
comments and the standardisation of variant forms of
statements, such as:
NEWLINE if COUNT = 0
and: if COUNT = 0 then NEWLINE
4.2.1.3 Target machine independence
Most existing intermediate codes are built around a model
of a machine which will perform the required computation,
and it is this machine which must be mapped onto the actual
target computer. In order to simplify this mapping certain
assumptions are made resulting in the machine being defined
in terms of fixed-sized data objects, a fixed way of
addressing them, and a fixed set of operations on them,
usually involving some kind of stack. When compiling for
machines which are similar to the intermediate code machine
there is little problem in obtaining a reasonable.match, but
when there are major differences it becomes impossible to
convert the code into an efficient machine representation.
For these reasons it was decided to make I-code
independent of actual machine representations; objects would
be described once in high-level terms and then all uses
would refer to that definition. This immediately removes
any assumptions about the sizes of data objects and the ways
in which they are addressed other than those assumptions
built in to the source language. One of the main
difficulties with existing codes has been their insistence
on the store containing a linear array of equally-sized
objects, the difference between one object and the next
being one address unit. When mapping such a structure onto
real machines with (say) byte addressed stores, problems
arise with arithmetic involving addresses as the codes
frequently pun on addresses and integer values. Several
later versions of the codes have attempted to solve these
problems by parameterising the intermediate code generator
so that the characteristics of the target machine may be
used to modify the code which is produced. However, they
still have built in to them assumptions about how the
objects can be addressed.
There are so many constraints which can be imposed on the
code to be generated, such as operating system requirements
and conventions for communicating with the run-time
environment, that a parameterised first phase could not be
expected to generate code which was well-suited to every
installation.
The authors of JANUS [Coleman, 1974] write that they believe
that the approach of using a parameterised intermediate code
"... is a dead end, and that the adaptability must come in
the translation from the intermediate language to machine
code".
4.2.1.4 Simplification
For the complexity of the machine-dependent phases of
compilation to be kept as low as possible the
machine-independent phase must do as much work as possible
while keeping within the constraints imposed by the previous
objectives. One way of simplifying the intermediate code is
for certain high-level constructions to be expanded into
lower-level constructions, but only when there is just one
expansion possible under the rules of the language, and that
expansion does not scatter information which may be of later
use.
The most obvious case of such expansion is in dealing with
complex conditional clauses such as:
-----------------------------------------------------
| if (A=B and C#D) or (EH) then X else Y |
-----------------------------------------------------
IMP-77 specifies that the condition will only be evaluated
as far as is necessary to determine the inevitable truth or
falsity of the condition, and so, bearing in mind the
modifications to be discussed in section 4.6.1, the
statement can be represented more simply as:
---------------------------------
| if A # B then ->L1 |
| if C # D then ->L2 |
| L1: if E >= F then ->L3 |
| if G <= H then ->L3 |
| L2: X |
| ->L4 |
| L3: Y |
| L4: |
---------------------------------
This expansion is tricky and notoriously error prone, and
therefore is best done once and for all in the common phase.
Similarly it is possible to expand all simple control
structures into their equivalent labels and jumps, providing
that the structural information is not lost thereby.
4.2.1.5 Decision binding
In any program there will be various options open to a
code generator and at some stage in the compilation
decisions must be made as to the particular code sequences
to be generated. Inevitably these decisions will influence
the code which is produced subsequently. On the PDP11, for
example, there are two obvious ways of assigning the value
in X to the variable Y: either MOVe the value in directly,
or move the value into a register first and then assign the
register. If the latter way is chosen the value of X will
be available in the register for subsequent use, although
the former way is better if the value is not required in the
near future. In order to make use of the information which
may well be presented later it is necessary to be able to
defer taking irrevocable decisions until the last possible
moment. The structure of I-code permits this delaying in
the binding of decisions as it only specifies what needs to
be done in abstract terms (using descriptors of arbitrary
structure and complexity), and does not give instructions as
to how particular results are to be achieved.
4.2.1.6 Ease of use
Of prime importance in the design of the code is the ease
with which it may be used to generate good object code.
Obviously a high-level code will by its nature be more
difficult to handle than a low-level code, but this need not
be serious if the code is consistent and results in a
convenient expression of the original source. In particular
the code should be designed to permit extensive checking to
be performed during the compilation process to catch errors
in both the intermediate code and the machine-code generator
before those errors are passed on to the users. Low-level
codes are at a serious disadvantage in this respect as they
have lost much of the redundancy present in the source.
4.3 Code layout and addressing
4.3.1 Nested procedure definitions
A common feature of programming languages is the ability
to nest the definition of a procedure within another
procedure. In addition, several languages imply the
definition of procedures within single statements, as in the
case of name parameters in ALGOL-60 where the parameter
which is actually passed can be a reference to a "thunk", a
procedure to evaluate the parameter.
With such nesting provision must be made for preventing the
flow of execution from "falling through" into the procedure
from the preceding statements, and this is usually
accomplished by planting at the start of the procedure a
jump to the statement following the end. While this is
simple to implement it does introduce extra instructions
which are not strictly necessary. With user-defined nested
procedures the overhead can be minimised if a number of
procedures is defined, as one jump instruction can be used
to skip them all. Unfortunately thunks will be generated
throughout the code in a more-or-less random way, giving
little opportunity to coalesce the jumps.
Even if the extra execution time caused by these jumps is
insignificant (the jumps round thunks defined in loops get
executed repeatedly), the code which they are skipping
stretches the code in which they are nested. On machines
with fixed-size jump instructions which can cover the whole
machine, such as the DEC PDP 10, the stretching causes no
problems, but if the addressing is limited, or if several
different sizes of jump instruction are provided, the
presence of the nested procedure can result in more code
being produced later in the generation of large jumps.
4.3.2 Paged machines
On paged machines the overall performance of a program
does not depend solely on the efficiency of the code
produced by the compiler but includes a factor depending on
the locality of references made by the executing program.
Traditionally this locality has been improved by monitoring
the execution of the program and then re-ordering parts of
it in the light of the measurements. Unfortunately not all
operating systems provide the user with convenient tools to
enable the measurement to be done, leaving only ad hoc
methods or intuition for guidance. Without careful control
it is all too easy to move one procedure to improve
references to it and thereby cause another piece of code to
cross page boundaries and counteract any gains in paging
performance. Even if the user can obtain the necessary
information, a slight change in the program can invalidate
the changes.
Notwithstanding these problems it is evident that by careful
structuring of a program significant gains in paging
behaviour can be obtained and so this option should not be
pre-empted by the intermediate-code (as does Z-CODE which
automatically reorders the definitions of procedures).
The possibility of automatic improvement of paging
behaviour was investigated by Pavelin who showed that the
paging characteristics of a program can be improved by an
automatic reordering of the code [Pavelin, 1970].
Pavelin's thesis describes the breaking-up of a program into
"chunks", defined by branches and the destinations of
branches. At each chunk boundary extra instructions are
planted to cause the updating of a "similarity array" which
records the dynamic characteristics of the program. After
several runs the similarity arrays are merged and the result
is used to specify a reordering of the chunks which should
improve the paging performance. In test cases the
working-set size of the code was reduced by as much as 40%.
The thesis also went on to say that the various compilation
problems associated with this "... can be alleviated by
operating on an intermediate code which is machine
independent with symbolic code addresses".
4.3.3 Events
IMP provides a mechanism for signalling the occurrence of
synchronous "events" during the execution of a program.
These events are either generated automatically as the
result of a program error, or are signalled explicitly by
the program. The signalling of the event causes control to
be passed back through the dynamic chain of currently active
blocks until one is found which has specified a trap for the
particular event which has occurred. Execution then
continues from a point in that block determined by the trap.
In order for this to be implemented it is necessary for the
signal routine to be able to "unwind" the stack and recover
the environment of the block containing the trap.
If the entry and exit sequences of all blocks are identical,
as, for example, in the standard procedure entry mechanism
specified for the DEC VAX 11/780, the unwinding is fairly
trivial. More commonly, however, the recovery is dependent
on factors such as the textual level of the procedure and
whether it has been optimised or not. In such cases the
unwinding can be very expensive or even impossible unless
extra information is provided.
For example, on the INTERDATA 7/16 a procedure at the
outermost textual level uses register 15 to access its local
stack frame, giving the exit sequence:
-------------------
| LM 7, 4(15) |
| BFCR 0, 8 |
-------------------
but a procedure nested within this would use register 14
thus:
-------------------
| LM 7, 4(14) |
| BFCR 0, 8 |
-------------------
It follows that the signal routine must be told which
base regiser to use at each stage of the recovery. This can
be done either by planting code in the entry and exit
sequences of each procedure, or by keeping a static table
associating procedure start and finish addresses with the
appropriate base register.
The first method is poor as it imposes a run-time overhead
on all procedures, whether they trap events or not. The
second method is better but can be complicated if procedures
are nested as the start-finish addresses alone no longer
uniquely define the procedure. One solution is to cause all
procedures which use the same exit sequence to be loaded
into distinct areas, and to associate the recovery
information for the signal routine with each area. This
reduces the static overhead to a few words per area, rather
than a few words per procedure.
4.4 Data addressing
One of the most important problems which faces the
compiler is the addressing of the various data objects used
by the program.
As an example of the difficulties which can arise, consider
the IMP declarations:
--------------------------------------
| integer X |
| integer array V(0:999) |
| integer Y |
-------------------------------------
On a machine such as the INTERDATA 7/16 which uses
base+displacement addressing with a 16-bit displacement, the
whole of the available storage, (64K bytes), can be
addressed with a single instruction. In this case the most
efficient implementation of the array is as a row of one
thousand integers (halfwords) addressed directly via the
local name base (LNB):
---------------------------------------------------------
| LNB (Local Name Base) |
| | |
| | a a+2 a+4 a+2000 a+2002 |
| v .---.------.------. .--------.---. |
| .- -| X | V(0) | V(1) | - - - - - | V(999) | Y | - - |
| .---.------.------. .--------.---. |
---------------------------------------------------------
This implementation has several points in its favour:
i As the size of the array is known at compile-time,
no special code is required to create it at
run-time; the necessary storage can be claimed on
entry to the block along with that for simple
variables, return addresses etc.
ii Array references with constant subscripts need no
address calculations at run-time. For example
using V as declared above, the element V(2) is
immediately addressable as the halfword with
displacement "a+2 + 2*2" from LNB.
iii In certain more general cases when the subscript is
a variable, access can be simplified by remembering
previous calculations. For example, the address of
the array element V(X) is
-------------------------------------
| addr(V(0)) + X*size of each element |
-------------------------------------
In the example above this becomes:
-------------------------------------
| LNB+a+2 + X*size of each element |
-------------------------------------
which can be rearranged to:
-------------------------------------------
| a+2 + (X*size of each element+LNB) |
-------------------------------------------
Hence the following code could be produced (7/16):
--------------------
| V(X) = 0 |
| |
| LH 1,X(LNB) | pick up X
| AHR 1,1 | *2 (2 bytes per integer)
| AHR 1,LNB | add in LNB
| SHR 0,0 | get zero
| STH 0,a+2(1) | store in V(X)
--------------------
Noting that the value now in register 1
(X*size+LNB) only depends on the size of each
element, X, and the local name base, it is clear
that register 1 can be used to address the X'th
element of any integer array of one dimension and
constant bounds declared at the current level.
Hence if the array W(1:12) were declared
immediately after Y in the example above, while
register 1 is not changed W(X) can be addressed as
a+2002(1).
On the other hand, a machine with limited store cover,
such as the Data General NOVA which only has an eight-bit
displacement, will almost certainly force the array to be
implemented as an immediately addressable pointer which is
initialised at run-time to the address of storage claimed
explicitly.
--------------------------------------------------------
| LNB |
| | |
| | |
| v .---.---.---. |
| .- -| X | V | Y | - - - |
| .---.---.---. |
| | |
| | |
| | .------.------. .--------. |
| +->| V(0) | V(1) | - - - - | V(999) | - - - |
| .------.------. .--------. |
--------------------------------------------------------
With this organisation the address of V(X) will be:
----------------------------
| V + X*size of each element |
----------------------------
and there is little that can be done by rearranging the
expression to improve on the "obvious" code (7/16):
----------------
| V(X) = 0 |
| |
| LH 1,X | pick up X
| AHR 1,1 | double it
| AH 1,V | add in addr(v(0))
| SHR 0,0 |
| STH 0,0(1) |
----------------
Not only is this second code sequence longer than the first
by two bytes, but it will execute more slowly as the second
addition involves a store reference whereas the equivalent
instruction in the first sequence uses a register.
In both cases, however, some simplification can be done if
the subscript is an expression of the form:
------------------------------
| X plus or minus CONSTANT |
------------------------------
in which case the constant can be removed from the subscript
expression evaluation and added into the final displacement.
For example (7/16):
------------------------
| V(X-7) = 0 |
| |
| LH 1,X | pick up X
| AHR 1,1 | double it
| AHR 1,LNB | and in LNB
| SHR 0,0 | get zero
| STH 0,a+2+(-7)*2(1) |
------------------------
Unfortunately even this optimisation may not be
available. For example, the ICL 2900 series performs array
accesses through a DESCRIPTOR REGISTER, and the extra
displacement cannot be added into the instruction. Also
some machines, such as the IBM 360, only permit positive
displacements in instructions.
The examples above pose the following problem: If the
intermediate-code is to know nothing of the target machine
it cannot know the best way to declare the array, nor the
best way to access it. Therefore the code must always
produce the same sequences for array declarations and array
accesses. It follows that these sequences must remain quite
close to the original source and not include any explicit
address calculations.
As another example, the DEC PDP11 range has a hardware
stack which grows with decreasing store addresses. Because
of this it could be convenient to allocate storage for
variables in that order, from large addresses to small
addresses. However, in several cases it may be necessary to
force objects to be created in increasing addresses, such as
when program structures are to be mapped onto
hardware-defined structures in memory, resulting in an
implementation which requires to be able to create similar
objects in different ways depending on the context.
Finally, some machines provide instructions in which the
displacement of the operand is scaled before use, depending
on the size of that operand. The GEC 4080 is such a
machine, with instructions such as:
LDB 1 load byte <1>
LD 1 load halfword, bytes <2> & <3>
LDW 1 load fullword, bytes <4>,<5>,<6> & <7>
When producing code for such machines it is convenient to
allocate all the local objects of the same size in
particular areas, and then order the areas in increasing
order of the size of the objects they contain. This permits
fuller use of the available displacement field in the
instructions.
The solution to these problems which was chosen in I-code
was to define a DESCRIPTOR for each object to be
manipulated. On input to the code-generator descriptors are
converted from their machine-independent form to a new form
appropriate to the target machine. As all subsequent
reference to the object will be through descriptors the code
produced will automatically reflect the decisions made at
the time the descriptors were created.
As has been discussed previously, it may be possible to
remove the overhead in setting up addressability for local
variables and parameters if the parameters can be held in
registers and the local variables are never referenced.
After examining many procedures which do use local variables
it is clear that a large number of them do not need the
complete overhead in setting up a local frame base as they
could use the workspace pointer (stack pointer) instead.
The criterion is that the position of the locals relative to
the workspace pointer must be known at compile time. This
reduces to the procedure not having any objects with
computed sizes (arrays with computed bounds, for example)
and no calls on procedures which access the locals as their
global variables.
Consider the compilation of the following procedure on the
PDP11:
----------------------------------------
| routine MARK(record(cellfm)name CHAIN) |
| integer N |
| N = 0 |
| while not CHAIN == NULL cycle |
| N = N+1 |
| CHAIN_INDEX = N |
| CHAIN == CHAIN_LINK |
| repeat |
| end |
----------------------------------------
The code normally produced for this routine would be:
--------------------------
| MOV LNB,-(SP) | remember old LNB
| MOV DS,-(SP) | remember DS
| MOV R0,(DS)+ | save the parameter
| MOV DS,LNB | set up local addressing
| ADD #20,DS | reserve local space
| CLR 10(LNB) | N = 0
| $1: MOV -2(LNB),R1 | test CHAIN
| BEQ $2 | branch if NULL
| INC 10(LNB) | N = N+1
| MOV 10(LNB),2(R1) | CHAIN_INDEX = N
| MOV (R1),-2(LNB) | CHAIN == CHAIN_LINK
| BR $1 | repeat
| $2: MOV (SP)+,DS | restore DS
| MOV (SP)+,LNB | restore LNB
| RTS PC | return
--------------------------
however, by using workspace pointer (DS) relative addressing
this reduces to:
--------------------------
| MOV R0,(DS)+ |
| TST (DS)+ | reserve local space
| CLR -2(DS) | N = 0
| $1: MOV -4(DS),R1 | test CHAIN
| BEQ $2 |
| INC -2(DS) | N = N+1
| MOV -2(DS),2(R1) | CHAIN_INDEX = N
| MOV (R1),-4(DS) | CHAIN == CHAIN_LINK
| BR $1 |
| $2: SUB #4,DS | restore DS
| RTS PC | return
--------------------------
This optimisation can be performed quite simply by the
third phase of compilation.
In the interface between the second and third phases the
code sequences generated by the second phase are made up of
items of the form:
where describes where is to be put, for
example in the code area or in the private data area. To
achieve the workspace-pointer-relative addressing extra
types are introduced which specify that the associated value
is the displacement of a local variable from LMB. Other
codes are needed to be able to modify the operation part of
the instruction which uses the displacements but these will
be ignored here as they cause no difficulty and would just
obscure the discussion. In addition, an extra
item is output whenever DS is explicitly altered (as when
parameters are stacked using MOV ??,(DS)+. By default the
third phase will treat these extra types as being exactly
equivalent to types, and will generate the first
sequence of code. However, if when the end of the procedure
is processed, the second phase discovers that no dynamic
objects or dangerous procedure calls were generated, it
marks the end of the procedure accordingly (in the same way
as described in section 4.7.2). This mark instructs the
third phase to relocate all values with the appropriate type
so as to make them relative to DS. The types
are used to keep the third phase's idea of the current
position of DS in step with reality.
4.5 Procedure entry and exit.
IMP is heavily based on the use of procedures, indeed the
only method of communicating with the controlling
environment is by means of procedure calls. Also the
techniques of structured programming result in the extensive
use of procedures. Clearly when writing a compiler for such
languages much thought must be given to making procedure
entry and exit (and the associated passing of parameters) as
efficient as possible.
4.5.1 User-defined procedures
The usual technique for procedures entry and exit is to
have standard preludes and postludes which cover all the
different types of procedure. For example the EMAS IMP code
sequences [Stephens, 1974] are (ICL4/75):
-------------------------
| STM 4,14,16(WSP) | save the current environment
| BAL 15, PROC | enter the procedure
| . |
| PROC ST 15,60(WSP) | save the return address
| LR LNB,WSP | set up the local stack frame
| LA WSP,***(WSP) | claim local stack space
| BALR 10,0 | set up code addressability
| . |
| . |
| LM 4,15,16(LNB) | restore calling environment
| BCR 15,15 | return
-------------------------
While this has proved to be convenient to generate and
efficient to execute it has one major problem, part of the
housekeeping of the procedure entry is performed at the call
itself. This seems undesirable for two reasons:
i Procedures are generally called more often than
they are defined. If part of the housekeeping of
procedure entry is done at the call that code will
be duplicated at each call, thus increasing the
size of the program. Putting that code within the
procedure reduces the size overhead.
ii If the knowledge of what housekeeping needs to be
done for procedure entry is needed outside the
procedure it becomes impossible to alter the entry
and exist sequences to suit the actual procedure.
In particular on certain machines it is possible to
remove the entry and exit sequences altogether when
the procedures are simple enough.
If the 4/75 compiler moved the environment-saving STM
instruction into the body of the procedure the storing of
the return address would be performed automatically:
-------------------------
| BAL 15,PROC |
|- - |
| PROC STM 4,15,16(WSP) |
| LR 8,WSP |
| .. .. |
-------------------------
This not only saves four bytes per call, very important on a
machine with a very severely limited immediate addressing
range, but also reduces the overhead in entering the
procedure by one instruction.
A further modification would be to pass one or more of the
parameters in the registers, leaving the way open for
remembering that fact inside the procedure.
Hence a call could be reduced from:
------------------------
| L 1,X |
| ST 1,64(WSP) |
| L 2,Y |
| ST 2,68(WSP) |
| BAL 15,PROC | PROC(X, Y)
|- -|
| PROC STM 4,15,16(WSP) |
| .. .. |
------------------------
to:
------------------------
| L 0,X |
| L 1,Y |
| BAL 15,PROC |
|- -|
| PROC STM 4,1,16(WSP) |
| .. .. |
------------------------
The ability to determine exactly how parameters are to be
passed can be of crucial importance in the efficiency of the
procedure mechanism.
When compiling for the PDP11 the obvious calling sequence
for a procedure with two integer value parameters would be:
-------------------
| MOV X,-(SP) |
| MOV Y,-(SP) |
| JSR PC,PROC |
-------------------
Unfortunately this produces problems inside the procedure as
the return address, stacked by JSR, is too far down the
stack to permit the use of the RTS instruction to return,
for this would leave on the stack the space used by the
parameters. Neither can the stack be adjusted before the
return which would be made indirect through a location
beyond the stack pointer as space there must be considered
volatile, being used by interrupt handling. Extra
instructions are needed either at the call or inside the
procedure to adjust the stack; the JSR instruction may well
not be "a beauty" as claimed by some implementors [Bron,
1976]. A MARK instruction has been introduced in an attempt
to overcome this problem, but it is far from helpful as it
imposes an arbitrary register convention and puts all of the
overhead on the call rather than on the procedure itself.
On the other hand if all of the parameters can be passed in
registers, the JSR will put the return address on a clear
stack permitting the use of RTS for the return. As in
practice most procedures have few parameters, usually only
one or two, this can give a large saving.
As an example of the power of being able to alter entry and
exit sequences consider a recursive implementation of the
IMP routine SPACES:
-----------------------------
| routine SPACES(integer N) |
| return if N <= 0 |
| SPACES(N-1) |
| SPACE |
| end |
-----------------------------
On the PDP10 the straightforward coding for this would be:
--------------------------
| MOVE 0, X | pick up X
| MOVEM 0, 3(SP) | assign the parameter
| PUSHJ SP, SPACES | call SPACES
|- -|
| SPACES: MOVEM LNB,1(SP) | save old frame base
| MOVE LNB,SP | pick up new frame base
| ADDI SP,3 | reserve new frame base
| SKIPLE 1,2(LNB) | load, test & skip if X<=0
| JRST LAB1 | jump to LAB1
| MOVE SP,LNB | restore stack pointer
| POPJ SP | return
| LAB1: SOJ 1, 0 | X-1 -> ACC1
| MOVEM 1,3(SP) | assign parameter
| PUSHJ SP,SPACES | call SPACES
| PUSHJ SP,SPACE | call SPACE
| MOVE SP,LNB | restore stack pointer
| MOVE LNB,1(SP) | restore old frame base
| POPJ SP |
--------------------------
By applying the optimisations of passing the parameter in an
accumulator (called ARG) and remembering that the parameter
is in this accumulator on entry to the procedure, the code
reduces to:
---------------------------
| MOVE ARG,X | pick up X
| PUSHJ SP, SPACES | call SPACES
|- -|
| SPACES: MOVEM LNB,1(SP) |
| MOVE ARG,2(SP) | assign the parameter
| ADDI SP,3 |
| JUMPG ARG, LAB1 | ->LAB1 if ARG > 0
| MOVE SP,LNB | restore stack pointer
| MOVE LNB, 1(SP) |
| POPJ SP | return
| LAB1: SOJ ARG, 0 | parameter = ARG-1
| PUSHJ SP,SPACES |
| PUSHJ SP,SPACE |
| MOVE SP,LNB |
| MOVE LNB,1(SP) |
| POPJ SP |
---------------------------
On inspection it is clear that the local stack frame
(pointed at by LNB) is never used within the procedure
except by the entry and exit sequences. Hence by reducing
those sequences to the absolute minimum the code becomes
---------------------------
| MOVE ARG,X |
| PUSHJ SP, SPACES |
|- -|
| SPACES: JUMPG ARG, LAB1 |
| POPJ SP |
| LAB1: SOJ ARG, 0 |
| PUSHJ SP,SPACES |
| PUSHJ SP,SPACE |
| POPJ SP |
---------------------------
Finally, an opportunistic optimisation may be performed
[Knuth, 1974; Spier, 1976] by noticing that the final two
instructions may be combined so that the procedure SPACE
uses the return address pushed onto the stack for the return
from SPACES. This results in the tightest form of the code:
----------------------------
| MOVE ARG,X |
| PUSHJ SP, SPACES |
|- -|
| SPACES: JUMPG ARG, LAB1 |
| POPJ SP |
| LAB1: SOJ ARG, 0 |
| PUSHJ SP,SPACES |
| JRST SPACE |
---------------------------
The final steps in this optimisation can only be performed
once the body of the procedure has been compiled. In order
that the correct (in this case non-existent) entry sequence
can be used an extra pass over the object code is necessary.
This pass can be combined with the process of adjusting
labels and jumps which is carried out in the third phase of
compilation described in section 4.7. The code generator
can mark the position where an extra sequence is required
and at the end of the procedure can inform the third phase
of any salient features found in the body. The third phase
can then decide on the best entry and exit sequences to use.
This ability to tailor the "housekeeping" parts of
procedures can be used in many circumstances to limit the
inclusion of code which is needed to handle rare
constructions to those procedures which use the feature.
As an example of this consider the ICL 2900 series.
The machines of the series are designed around a hardware
stack, which resides in one, and only one, segment of the
user's virtual memory, and thus limits this data space to
255K bytes. In order to be able to handle programs using
very large arrays space must be available off-stack in
another segment or set of consecutive segments. The
maintenance of this extra data space will require
instructions to be executed on entry to and on exit from
procedures which claim space from it, but not from those
which only use space from the stack. These extra
instructions can be added to the procedure in a simple
manner by the third phase as it now controls the form of the
procedure when all the necessary information is available.
For these optimisations to be performed the intermediate
code must not lay down rules for procedure entry and exit,
rather it should simply mark the points at which suitable
code is required.
An additional consideration in the design of the I-code
for procedure' entry and exit is the requirement of some
machines for a "pre-call" to be made the prepare a hardware
stack for parameters prior to their evaluation and
assignment.
For example (ICL2900):
-------------------
| PROC(1, 2, 3) |
|- -|
| PRCL 4 | pre-call
| LSS 1 | load 1
| SLSS 2 | stack it and load 2
| SLSS 3 | stack it and load 3
| ST TOS | store it on Top Of Stack
| RALN 8 | raise the Local Name Base
| | to point to the new frame
| CALL PROC |
-------------------
Following these considerations the form of procedure call
chosen for I-code was:
-----------------
| PROC P | stack procedure descriptor
| | \
| {stack param} | : repeated for each parameter
| ASSPAR | /
| ENTER | enter the procedure
-----------------
ASSPAR causes the value described on the top of the stack to
be assigned to the next parameter, identified by the
procedure descriptor second on the stack, using either
ASSVAL or ASSREF as appropriate.
In order to pass some of the parameters in registers all
that need be done is for the initial processing of the
descriptors for those parameters to define then as the
appropriate registers. PROC can then "claim" those
registers, the parameter assignment will load them, and
finally ENTER can release them for subsequent re-use on
return from the procedure.
4.5.2 External procedures
Most useful languages provide means for compiling files
of procedures (and less commonly, data objects) which can be
accessed from other modules. Also, systems usually provide
extensive libraries of procedures which users of high-level
languages will want to access. In general an external
procedure is identified by a vector of quantities including
at least the entry address and a description of the
environment in which the procedure is to execute. Depending
on the type of operating system in question the number of
quantities in this vector will change. When the system
requires a "store image" which has all the addresses fixed
before execution only the entry address is required, as the
code of the procedure can be relocated in order to define
its environment. As this method demotes code-sharing to a
limited facility (making programs shareable is often a
privileged operation), several systems have selected a more
flexible scheme whereby executing programs have a writeable
"linkage area" into which are placed the entry vectors for
procedures. The code of these procedures may now be made
read-only and shared with only the linkage areas being
unique to each user. These vectors are filled in with the
references to the externals either prior to program
execution, or dynamically when the procedure is first
called. Finally, it must be noted that the compiler writer
will have little or no control over the standards required
by external procedures unless they have been generated with
the same compiler. In particular the parameter passing
mechanisms 'may be different from those used in the
intermediate code.
In order to cope with these and other considerations any
intermediate code which permits access to external
procedures must be sufficiently flexible to allow the
variations to be handled efficiently.
4.5.3 Permanent procedures
Most languages define a set of procedures which will be
available on any implementation without explicit action by
the user (such as the IMP procedures ITOS, READSYMPOL, READ,
and REM). Such procedures are termed "permanent
procedures" It is common for intermediate codes to provide
specific code items to invoke permanent procedures, but this
has the problem that the code-generator must know about all
such procedures, and the language-dependent phase must be
changed and the intermediate-code extended if an
implementation wishes to make efficient use of procedures
which can be compiled in-line on particular machines. For
example many machines provide an instruction for moving
blocks of store around and it could be advantageous to have
a procedure which invoked this instruction directly.
Before investigating ways of improving the implementation
of permanent procedures it is Useful to examine in some
detail the properties of the procedures mentioned above,
which were chosen because they typify the main problems in
this area.
ITOS is a fairly complicated string function which
returns as its result the decimal character string
representation of the integer value passed to it as a
parameter. Because of its complexity this procedure is
almost always best implemented as an external procedure
which is linked into the program along with any other
external entities required.
REM is an integer function which returns the remainder of
dividing the first integer parameter by the second, and on
many machines can be efficiently compiled in-line, as most
integer divide instructions provide both the quotient and
the remainder. However, when compiling for machines such as
the DATA GENERAL NOVA or the DEC PDP11 when they do not have
the optional divide instructions, division has to be
performed by a complicated subroutine, suggesting that REM
itself should be an external procedure like ITOS.
READSYMBOL falls somewhere between the two, mainly
because it is defined to have a general name parameter, that
is, the parameter may be a reference to any type of entity:
integer, real, byteinteger, etc. To implement READSYMBOL as
an external procedure it would have to be passed the general
name parameter (comprising both the address of the actual
parameter and information about its type and precision), and
would have to interpret that parameter in order to be able
to store the character, suitably converted, in the
appropriate way.
A much more efficient implementation is to convert the
statement:
---------------
| READSYMBOL(S) |
---------------
into the equivalent form:
------------------
| S = F$READSYMBOL |
------------------
where F$READSYMBOL is a function which returns as its result
the character value that READSYMBOL would have placed into
its parameter. Once this is done conversions and the choice
of store operation can be left to the usual assignment part
of the compiler.
A further complication can arise if, as in the case of
the INTERDATA 7/16 operating system, ISYS [Dewar, 1975],
several permanent procedures map directly onto
system-provided facilities: the function F$READSYMBOL can be
replaced by the supervisor call "SVC 8,0", SELECT INPUT by
"SVC 6" etc.
The difficulty caused by READ is mainly one of space. As
read can input an integer value, a real value, or a string
value depending on the type of its (general name type)
argument, it is going to be fairly large, especially if the
hardware on which it runs does not provide floating-point
instructions, forcing those functions to be performed by
subroutine. It follows that on small systems it may be
convenient to replace calls on READ by calls on smaller
procedures, chosen at compile-time by examining the type of
the parameter given to READ, which input solely integer,
real, or string values.
Finally it should be noted that the substitutions and
modifications discussed above may only be generated as
replacements for direct calls on the procedure; if the
procedure is passed as a parameter to another procedure no
alterations are possible and a "pure" version must be
available. As passing a procedure as a parameter is totally
distinct from calling the procedure this case does not
prevent the improvements being carried out where possible.
It should now be clear that the efficient implementation
of permanent procedures will differ greatly from the
implementation of user-defined procedures, and the
implementation of permanent procedures on different
machines. Hence the intermediate-code must make no
assumptions about either which permanent procedures are
available or how they are to be implemented.
As a side-effect of removing any built-in properties from
permanent procedures it becomes possible for a simple
code-generator to ignore any possibility of producing
special code and compile them all as externals.
These transformations of procedures can only be applied
when the procedures are invoked (called) directly� In the
case of procedures passed as parameters all calls will of
necessity be the same and hence either it will not be
possible to pass some permanent procedures as parameters, an
unfortunate limitation imposed by several languages, or
there must be a "pure" form of the procedures available.
This latter can be done very simply using I-code. The
primitive procedure descriptors are defined exactly as if
the procedures were truly external, but with an extra marker
showing them to be "permanent". The only time that this
marker is used is in the procedure call generating section
of the compiler. If the procedure is being passed as a
parameter this section of the compiler is not entered and so
the procedure will be passed as an external. All that is
now necessary is for there to be an external manifestation
available when the program executes. This method has the
added advantage that there is no compile-time overhead,
especially important considering that passing procedures as
parameters is one of the least-used features of IMP77.
4.5.4 Primitive Procedures
It is rare for machines to provide simple instructions
which can deal directly with all of the requirements of
high-level languages and so several constructions will have
to be handled by subroutines. The code generator may then
refer to these "primitive procedures" as though they were
machine instructions. The cases in which such procedures
are required commonly include exponentiation, string
manipulation, and array declaration and access.
Given these procedures the code-generator has a choice
between calling them as closed subroutines or expanding them
in-line. The former produces dense code but will execute
more slowly than the latter (and possibly suffer from not
knowing what is corrupted by the routine and therefore
having to forget everything it knows). On the other hand
while the expansion of primitive procedures in-line will
improve the execution speed of the program, it becomes
necessary for the code-generator to be able to create the
appropriate code sequences and thereby become more bulky.
Once again the choice must be left to the code-generator as
the benefits of a particular decision will depend on both
the target machine and the use to which the compiler is to
be put. If the compiler is to be used for large
mathematical problems it is likely that the gains made by
putting exponentiation in-line will outweigh the
disadvantage of the extra code size, whereas in
operating-system work as exponentiation is probably never
needed, the extra complexity of the code generator to expand
the routine would not be desirable.
Given that some of the primitive procedures will be
referenced often (checked array access, for example) it is
important that entry to them is made as efficient as
possible and in this area the ability to reorder code can be
used to great effect.
In the original Interdata 7/32 IMP77 compiler the
primitive routines were gathered together at the end of the
user's code, as it was only then that it was known which
procedures were required.
---------
| | <- CODE BASE (register 15)
| USER |
| CODE |
| |
---------
| |
| PRIM |
| PROCS |
| |
---------
With this scheme programs of 16Kbytes or less can reference
the primitive procedures with 32-bit instructions
(program-counter relative addressing). Unfortunately once
the program grew beyond this limit the larger and slower
48-bit form of the instructions had to be used in order to
achieve addressability. In the IMP77 code generator there
were 352 such large instructions.
In the new compiler the object code is reordered to place
the primitive procedures at the head of the user's code
where they can be addressed relative to CODE BASE.
---------
| | <- CODE BASE (register 15)
| PRIM |
| PROCS |
| |
---------
| |
| USER |
| CODE |
| |
---------
The immediate disadvantage of this is that it will push the
user's procedures further away from CODE BASE and hence
increase the chances of a user procedure reference requiring
a long (48-bit) instruction. However in practice this is
not a problem as the total size of the primitive procedures
is usually quite small, typically less than 800 bytes on the
7/32. The IMP77 code generator mentioned above now needs no
long references at all, saving 724 bytes of code, out of
about 40Kbytes. The compression of the code so achieved can
be enhanced slightly by bringing the destinations of more
jumps into the short-jump range, giving an extra saving of
20 bytes the case above. In addition, now that a register
(CODE BASE) is pointing to the first primitive procedure the
list of procedures required can be reordered to place the
most frequently referenced one first and thereby reduce
references to it to 16-bit instructions
(BALR LINK,CODEBASE).
When compiling with checks on, by far the most commonly
referenced primitive procedure is the routine which checks
for the use of an unassigned variable (over 2000 references
to it in the code generator), and this trivial optimisation
results in a saving of more than 4000 bytes.
4.6 Language-specified and compiler-generated objects
During compilation various objects will be manipulated in
order to generate code. Some of these objects have a direct
representation in the source program and are referred to as
"language-specified" objects, whereas others are created by
the compilation process itself and are referred to as
"compiler-generated" objects. The fact that the
compiler-generated objects will be (or can be constrained to
be) used in a stereotyped and well-behaved fashion can be
used to great advantage to give simple means for optimising
parts of the program.
4.6.1 Internal labels
Using most intermediate codes the following program parts
would translate into effectively identical sequences:
-----------------------------------------------------
| ->LAB if X = 0 | if X # 0 start |
| Y = 3 | Y = 3 |
| LAB: | finish |
-----------------------------------------------------
At first glance this is as it should be, for the two
program fragments are semantically identical and could
therefore be implemented by the same object code, for
example on the PERKIN-ELMER 3200:
-----------------
| L 1,X | pick up X and set the condition code
| BZ $1 | branch equal (to zero)
| LIS 0,3 | pick up 3
| ST 0,Y | store it in Y
| $1: | define label $1
-----------------
However, if it is known that the label $1 will only ever be
used once, the code-generator may remember that the current
value of the variable X will still be in register 1
following the label, and thus remove the need for it to be
loaded again if it is required before register 1 gets
altered. In the case of user-defined labels no statement
can be made about the number of uses of each label without a
complete analysis of the parts of the program where the
label is in scope.
This suggests that I-code should maintain a clear
distinction between user-defined and compiler-generated
labels. Also, by making the rule that compiler-generated
labels may only be used once, the internal representations
of labels may be reused by the code-generator, removing the
necessity for large tables of label definitions in this
phase of compilation.
This now leaves the question of how to represent
conditional jumps in the intermediate code. The first
observation is that user-specified jumps need never be
conditional as they can always be surrounded by appropriate
compiler-generated conditional jumps. This can be used to
restrict the processing of conditions and tests to the
compiler-generated jumps. The second observation is that in
IMP77 conditionals are always associated with the comparison
of two values or the testing of an implied boolean variable
(predicates and string resolution).
There are currently three main ways in which processors
handle this:
1 "compare" instructions are used to set flags or
condition-codes which represent the relationship
between two values (one of which is frequently an
implied value of zero). These condition-codes are
later used to control the execution of conditional
branch instructions. This method is used in the
PDP11: COMP, BNE etc.).
2 Instructions are provided which compare two values
as above but instead of setting condition-codes
they skip one or more subsequent instructions
depending on a specified relationship. By skipping
unconditional branches in this way conditional
branch sequences may be generated. This method is
used in the PDP10: SKIPE etc.
3 Instructions are provided which compare two values
and branch to a specified label if a given
relationship holds. This method is used in the
PDP10: JUMPNE etc.
P-code uses compare instructions to set the boolean value
TRUE or FALSE on the stack and then uses this value either
as an operand in an expression or to condition a branch (a
variant of technique 1 above).
Z-code tests the value in a register against zero and
branches accordingly (technique 3 above).
These three techniques have fairly obvious possible
representations in I-code:
if X = Y start
1) PUSH X
PUSH Y
COMP {set condition code}
BNE 1 {branch not equal}
2) PUSH X
PUSH Y
SKIPE {compare and skip if equal}
GOTO 1
3) PUSH X
PUSH Y
JUMP # 1 {compare and branch if not equal}
All three of these representations have been tried in
different versions of I-code.
Technique 2) was rejected as it proved cumbersome to
implement effectively, especially on machines which did not
use skips; either the code-generator had to "look ahead" to
be able to locate the destination of the skip (which is
dependent on the instruction being skipped) or to check
before each instruction whether on not a skip had been
processed earlier and its destination had not yet been
resolved,
Technique 1) was perfect for machines with condition-codes
but required look-ahead over subsequent jumps on machines
which used skips.
Both 1) and 2) had the additional problem that to generate
conditional branches two separate I-code instructions had to
be given. In the case of 1) condition-codes are usually
altered by many instructions not directly involved in
comparison and hence the compare and its associated branch
must be made adjacent. With 2) there is the possibility of
generating meaningless constructions such as skipping a
line-number definition instruction. These difficulties add
complexity to the definition of the intermediate code and
require extra checks in the code generator.
Thus the third form was chosen as the most convenient,
even though all three forms can be suitable defined to be
totally equivalent. In particular the third technique
provides all the relevant information to the code-generator
in one instruction, and has proved to be simple and
effective as a basis for generating code for both
condition-code and skip sequences.
Using these ideas the following is the expansion of the
statements given at the start of section 4.6.1.
---------------- -----------------
| PUSH X | PUSH X |
| PUSHI 0 | PUSHI 0 |
| COMP # 1 | COMP = 1 |
| JUMP LAB | |
| LOCATE 1 | |
| PUSH Y | PUSH Y |
| PUSHI 3 | PUSHI 3 |
| ASSVAL | ASSVAL |
| LABEL LAB | LOCATE 1 |
---------------- -----------------
4.6.2 Temporary objects
During the compilation of high-level languages it often
becomes necessary to create temporary objects which are not
present in the source program. The most common need for
temporaries is in the evaluation of expressions. Regardless
of the number of accumulators or registers available it is
always possible to construct an expression which will
require one more. To obtain this register, a register
currently in use must be selected and the value currently in
it must be saved in a temporary location. One apparent
exception to this is a machine in which expressions are
evaluated using a stack (e.g. ICL 2900) but in this case
the operands are always in temporaries.
Temporary variables may also be required to implement
certain high-level constructions, such as the IMP for
statement:
-----------------------
| for V = A, B, C cycle |
-----------------------
which is defined so that the initial values of B and C, and
the initial address of the control variable, V, are to be
used to control the loop regardless of any assignments to V,
B and C. While it is possible for a machine-independent
optimiser to discover whether these variables are modified
in the loop or not, in the simple case where little
optimisation is required the code generator must use
temporaries.
In the case of expression evaluation, however, the machine
independent phase cannot know how many temporaries will be
required. Even giving the first phase knowledge of the
number of registers available is not adequate for several
reasons. Firstly, the use of registers is commonly tied to
the operations being performed, as in the case of integer
multiply on several machines which requires a pair of
registers. For a machine-independent first phase to be able
to cope with this sort of limitation would require great
flexibility of parameterisation.
Secondly, the first phase would have to be given details of
the problems encountered in statements such as:
----------------------------
| LEFT = REM(A,5) + REM(B,7) |
----------------------------
On a PDP11 equipped with the EIS option a divide instruction
is available which provides both the quotient and the
remainder. Hence the statement could be compiled into:
-----------------
| MOV A,R1 |
| SXT R0 | propagate the sign of A
| DIV R0,#5 | remainder to R1
| MOV B,R3 |
| SXT R2 |
| DIV R2,#7 | remainder to R3
| ADD R2,R0 |
| MOV R0,LEFT |
-----------------
In this case no temporary store locations are required.
However, if the EIS option is not present no DIV instruction
is present and so a subroutine must be used instead. The
code becomes:
-----------------
| MOV A,R1 |
| MOV #5,R2 |
| JSR PC,DIV | result back in R1
| MOV R1,T1 | preserve remainder
| MOV B,R1 |
| MOV #7,R2 |
| JSR PC,DIV | result in R1
| ADD T1,R1 |
| MOV R0,LEFT |
-----------------
As the subroutine REM uses R1 (for one of its arguments and
to return its result) the result of the first call on REM
must be saved in a temporary, T1. Of course, the function
REM could be written so as to preserve the value in, say, R2
and this could be used instead of T1, but this would
increase the cost of REM when it is likely that the value in
R2 will not be of use as most expressions are trivial
[Knuth, 1971].
Unless the machine-independent phase is given intimate
knowledge of the target machine (something of a
contradiction) it cannot know how many temporaries to use
nor when to use them.
The solution adopted by most intermediate codes is to base
the code around a stack, thus providing an unlimited number
of temporaries which are handled automatically. While this
in itself does not hinder the compilation for a machine
without a hardware stack, as the code-generator can always
simulate the stack internally, its presence invariably
results in other parts of the code using it, for example to
pass parameters to procedures where the receiving procedure
contains built-in knowledge of the layout of the stack.
As a stack does not require the explicit mention of
temporaries it has been adopted by I-code, but purely as a
descriptive mechanism. Because I-code does not specify the
computation but the compilation process needed to produce a
program which will perform the computation, this internal
stack need have no existence when the final program
executes.
The implementors of SIMPL-T describe an intermediate code
with some properties similar to I-code, but based on
"quadruples" of operators and operands rather than an
internal stack [Basili, 1975]. The stack approach was
rejected by them because "quads allow more flexibility in
the design of the code generator since, for example, no
stack is required". The exact meaning of this is not clear
but it suggests the misconception that a stack-based
intermediate code forces a stack-based object code
representation. Regardless of the exact structure of the
code generator or the input it takes, some form of internal
stack is invariably required for operations such as
protecting intermediate values in registers which are needed
for other purposes, and it seems reasonable to make this
stack more explicit if so doing will simplify the
intermediate code and its processing.
4.7 Object file generation
Once a program has been compiled into sequences of
machine code instructions there still remains the task of
producing an object file in a form suitable for processing
by the operating system (if any) under which the program is
to be executed. This task was separated from the main part
of code generation (the second phase) and has become the
third phase of compilation for the following reasons:
i The particular format required in the final object
file will vary on any particular machine depending
on the operating system in use. As this is to a
large extent independent of the code sequences
needed to implement the program it was thought
sensible to keep the processes separate.
ii Even following the generation of the code by the
second phase there remain many opportunities for
further optimisation, both global and structural,
which require information only available once the
complete program has been compiled. Rather than
build in global analysis into the second phase
these optimisations were left to a third phase.
The third phase takes as its input two data streams
generated by the second phase. These streams are:
i The object stream, a sequence of items of the form:
* defining the code sequences
required in the object file.
ii the directive stream, a sequence of items defining
the logical structure of the object stream, that is
a specification of label definitions and label
references, and details of various code groupings
(blocks, procedures etc.).
The third phase starts by taking in the directive stream and
constructing a linkage map describing the whole program.
This linkage map is processed and then used to control the
generation of the final object file from the object stream.
The operations performed using the map are:
4.7.1 Reordering
As discussed previously, there are several gains to be
made by having the ability to output instructions in an
order different from that in which they were implied by the
linear structure of the source program. This reordering is
performed on the linkage map in a manner controlled by items
in the directive stream. In the most simple case of
exbedding procedures (section 4.3.1) this only entails
allocating code addresses to the items in the map each time
an "end-of-block" control item is input, resulting in the
procedures being laid out in "end" order.
To facilitate evaluating references to the reordered areas,
all references in the object stream are made relative to the
start of the appropriate area.
As this process does not cause the physical moving of the
various areas there is an implicit assumption that either
the subsequent processing of the object stream can do the
reordering (for example by writing its output to specific
sections of a direct-access file), or that the object file
format can instruct the loader or linker to do the
shuffling.
With the linkage map available it becomes possible to
make a preliminary pass over the object stream performing
structural modifications which require knowledge of the
generated code and which alter its size and general
appearance. These modifications may be made by passing the
object stream through a buffer which is scanned and modified
under the control of the linkage table. In this way merging
common code sequences and reordering the arms of conditional
sequences may be achieved quite simply.
4.7.2 Jumps and Branches
Following the construction of the linkage map structural
optimisations may be performed on jumps. The three
optimisations which are currently applied are:
i Use of the smallest instruction
A common feature of machines is that they
provide a variety of sizes of jump instruction,
depending on the reason for the jump (conditional
or unconditional) and the distance to be jumped.
e.g. PDP11
BEQ (2 byte instruction) conditional jump up to
256 bytes in either direction.
JMP (4 byte instruction) unconditional jump to
anywhere.
Perkin-Elmer 3200
BFFS
BFBS (2 byte instruction) conditional jump
forward (F) or backward (B) up to 32 bytes
away.
BFC (4 byte instruction) conditional jump to
within 16Kbytes of the current instruction.
BFC (6 byte variant) conditional jump to anywhere.
In typical programs the frequency of occurrence
of such jumps is:
PDP11 PE3200
------------------
2 byte | 88% 28% |
4 byte | 8% 71% |
6 byte | 2% <1% |
------------------
It has been suggested [Brown, 1977] that the
problem of deciding which form of jump to use can
be eased on certain machines by specifying a
"distance" parameter with the intermediate code,
e.g. "GOTO LAB,80" informing the code generator
that the label LAB is 80 instructions ahead.
It is difficult to think of any case in which this
could be of any use as it requires the code
generator to be able to predict the amount of
target machine-code which will be generated for
each intermediate code instruction.
The solution adopted by the IMP compilers has
been for the code generator to assume that all
jumps are the minimum size, and to let the third
phase stretch them where necessary.
The Perkin-Elmer CAL assembler [Interdata, 1974]
makes the opposite assumption, namely that jumps
are long until proven short. This was rejected as
the size of one jump is often dependent on another,
so that one of them will be short if and only if
both of them are short. By assuming them long
either they will never be found to be short, or the
process will have to examine all the jumps
repeatedly trying each jump in turn to see if it
can be "squeezed". Commonly enabling the "SQUEZ"
option in the CAL assembler can double or treble
the time to assemble programs.
With the assumption that all jumps start short and
then grow, all truly short jumps will be found with
no possibility of infinite loops, as the process
must terminate, in the worse case when all the
jumps have been made long.
Several methods for achieving this optimisation
have been described [Szymanski, 1978; Williams,
1978].
The technique used by the third phase of the IMP77
compilers for stretching jumps is as follows.
Once the linkage map has been constructed and
addresses provisionally allocated, all labels and
references to them are grouped according to the
block in which they occurred. This is to take
advantage of the fact that most references will be
local. A procedure STRETCH is now defined which
repeatedly attempts to lengthen each reference
within a particular group. If a reference is found
which must be stretched, the entry in the linkage
map is updated and all subsequent entries are
suitably modified to take account of the increased
size of the code. The process is repeated until no
alterations have been made.
STRETCH is first called once for each group of
references in the program. This "local stretch"
commonly resolves up to 80% of the references. A
final call on STRETCH is then made with all the
references lumped together as one group in order to
resolve references between blocks, and any local
references which, although processed by the local
stretch, have become invalidated by changes made by
the "global stretch".
The use of a local and a global stretch has a
considerable effect on the performance of the
compiler; If the calls on "local stretch" are taken
out "global stretch" has to do all the work in
ignorance of the block-structure of the labels.
This involves repeated searching of the complete
label and reference lists in order that changes in
the position of these items may be recorded. On
the Interdata 7/32 this increases the stretching
time for 1968 branches from 2.3 seconds out of a
total compilation time of 146 seconds up to 35
seconds!
The time taken to perform the stretching using both
local and global stretch is on average just over if
of the total compilation time excluding the time
for input and output.
Wulf et al. describe an optimisation on the PDP11
which attempts to shorten otherwise long
conditional jumps by making them jump to suitable
jumps to the same destination as this is smaller
and faster than the six byte instructions which
would be generated by default [Wulf, 1975]. This
was tried but eventually removed from the PDP11
compiler as finding suitable jumps was a tedious
task and of the average 2% of jumps which were long
in compiling many programs only one case was found
where the optimisation could be applied. That case
was in a program specially constructed to test the
optimisation.
At the same time that jumps and labels are being
processed, certain operations which depend on the
flow of control may be inserted into the code.
The GEC 4080 provides a good example of this
problem which can be handled elegantly by the third
phase.
The machine provides arithmetic instructions which
take either fixed point or floating point operands
depending on the state of a processor status bit.
This bit must be altered by the instructions SIM
(Set Integer Mode) and SFM (Set Floating Mode).
During code generation when a label is encountered
the state of the status bit will not in general be
known, and so a suitable mode switching instruction
will need to be planted.; frequently this
instruction will be redundant. Given the presence
of the third phase, the second phase merely needs
to mark jumps with the current state of the bit,
and to mark labels with the required state (and the
previous state of the bit if control can "fall
through" past the label. During the process of
expanding jumps these mark bits can be checked. If
all references to a label have the same mode, no
action needs to be taken, but if the bits differ
the appropriate instruction must be added. As an
extra improvement if only one jump to a label is
from the wrong mode, the mode switching instruction
can be planted before that jump rather than after
its destination label, so shortening the execution
paths when no change is required.
ii Conflating jumps to jumps.
Nested conditional structures in high-level
languages often generate jumps which take control
directly to another jump. If the second jump can
be shown always to be taken whenever the first is,
the latter can be redefined as jumping directly to
the destination of the former.
e.g. -----------------------------------
| while N > 0 cycle |
| N = N-1 |
| if N > 5 then TEST1 else TEST2 |
| repeat |
-----------------------------------
In this program following the call on TEST1 the
else causes a jump to be taken to the next
statement (repeat). This statement is simply a
jump back to the previous cycle.
Hence the following code can be generated (3220):
-------------------
| $1: L 1,N |
| BLE $3 |
| SIS 1,1 |
| ST 1,N |
| CHI 1,5 |
| BLE $2 |
| BAL 15,TEST1 |
| B $1 |
| $3: BAL 15,TEST2 |
| B $1 |
-------------------
The danger with this optimisation is that an
otherwise short jump can be expanded to a long jump
as the following program demonstrates:
----------------------
| if X = 1 start |
| if Y = 1 start |
| {A} |
| else |
| {B} |
| finish |
| else |
| {C} |
| finish |
----------------------
The else following the sequence (A) causes a jump
to the next else which jumps past the finish. In
that form the first jump only has to skip (B) and
is likely to be a short jump. If it is made to
jump directly to the second finish it has to cover
{B} and {C}, so reducing the chances of its being
short.
Equally, the position can be reversed resulting in
the optimised jump being short when the original
was long. If this problem is considered serious
the third phase can check the sort of jump which
would be generated and act accordingly.
iii Removal of jumps round jumps.
Statements such as:
------------------
| ->LABEL if X = Y |
------------------
are common, either in the explicit for as given
above or in some higher-level representation such
as:
---------------
| exit if X = Y |
---------------
The simple code sequence generated for this would
be similar to (3220):
--------------
| L 1,X | pick up X
| C 1,Y | compare with Y
| BNE $1 | branch not equal
| B LABEL | jump to LABEL
| $1: |
--------------
by combining the two branches the code can be
reduced to:
--------------
| L 1,X |
| C 1,Y |
| BE LABEL |
--------------
While it is possible for the code generator to do
this immediately it was decided to leave the
optimisation to the third phase for four reasons:
1 The third phase can perform this optimisation
simply, almost as a side-effect of
constructing the linkage map.
2 The are several cases where the optimisation
can be extended in ways which would be awkward
for the second phase to deal with. In
particular, it would have either to look ahead
or to be able to modify code sequences already
generated. With a third phase, however, the
optimisation reduces to a straightforward
inspection of the linkage map. For example:
------------------
| exit if X = Y |
| repeat |
------------------
in which case the optimisation may be applied
twice to reduce the code to two instructions.
3 Leaving the optimisation to a later phase
simplifies the second phase which is the most
complicated part of the compiler.
4 On several machines if the destination of the
jump is too far away the original "jump round
a jump" may be the only form available (e.g.
PDP11). The distance to be jumped will only
be known exactly when all labels have been
processed.
4.7.5 In-line constants
When compiling for machines such as the Data General NOVA
which have a limited direct addressing range and no
full-length immediate operands, it is useful if constants
can be planted in the code sequence and addressed as
program-counter-relative operands. The simplest technique
for doing, this is for the code generator to maintain a list
of required constants and to dump them in-line at a suitable
opportunity before the limit of addressability has been
exceeded. Such constants will need to be protected from
being executed and so will need to have a jump round them or
will have to be planted in a "hole" In the code, that is
between an unconditional jump and the next label. As holes
occur frequently in high-level languages (for example
following every else or repeat) and do not require extra
code to be planted round the constants, they must be the
preferred position for the constants. In order to minimise
the number of constants planted it is necessary to delay the
dumping of them until the last possible moment, making them
as near the forward limit of the addressability of the first
outstanding reference. This increases the chance on a
subsequent reference to the constant being able to address
the previous location.
This poses problems if the second phase is to handle the
constants as it cannot know which is the optimum position
for the constants in advance of producing the code
(especially if the code is to be reordered).
A convenient solution is to utilise the linkage table in the
third phase and include in it references to constants and
the locations of holes and "forced" holes, that is places
where an extra jump is required.
Following the initial resolution of jumps (4.7.?) the list
of constants can be examined and holes allocated. The
labels are processed again to take account of the extra code
and any alignment limitations. During the processing of the
object stream the constants are infiltrated into the object
file.
4.8 Summary
The major decisions about the design of the compiler
were:
a) All information present in the source program
should be easily visible in the intermediate code.
b) The intermediate code should be as
machine-independent as the source language.
c) The code generator should be split into two
distinct phases joined by a stream of code
fragments and a linkage map defining the
connections between them.
d) The intermediate code should handle objects in
terms of language-dependent descriptors which are
converted into appropriate machine-dependent
descriptors by the second phase.
e) The intermediate code should distinguish clearly
between objects explicitly specified in the source
program and those implied by the translation.
f) All decisions about code and data addressing must
be left to the code-generator.
5 Review of the overall structure
5.1 Division of function
The division of the machine-dependent phase into two
parts was motivated by three main considerations:
i to localise the changes necessary to produce
different object-file formats,
ii to permit the reordering of sections of the code,
iii to enable the production of short jumps whenever
possible.
In addition it turns out that on all of the machines for
which this technique has currently been applied points (ii)
and (iii) can be handled by almost identical pieces of code,
making this phase of compilation machine-independent to a
large extent and therefore easing the task of creating new
compilers.
Against this must be set the overheads incurred by
separating the compilation into two parts which have to
communicate. The interface between phases two and three
comprises the object file and the directive file, and the
third phase needs to process the whole of the directive file
before starting to look at the object file.
The ways in which these 'files' will be implemented, and
consequently the cost of the communication, will in general
vary from system to system. If large virtual memories are
available the data may be held in memory as mapped files or
arrays and accessed much more efficiently than on simpler
systems using the conventional approach of 'true' files with
their more cumbersome transfer operations.
5.2 Testing and development
Although the initial reason for choosing a multi-phase
approach to compiling was that of simplifying the generation
of new compilers, an extra advantage arose in that the task
of checking the compilers so produced and diagnosing faults
in them was very much simplified. This was because of two
features of the technique.
Firstly, the programs corresponding to the phases were of
managable size, varying from about one thousand statements
up to three thousand statements.
Secondly, the phases communicated with each other using
well-defined interfaces which could be monitored to narrow
down errors to a particular phase and even to specific parts
of that phase.
In addition, as the structure of the intermediate code
inevitably suggests the general techniques to apply in code
generation, many of the complete compilers on different
machines had great similarities; usually only the lowest
levels of code production and machine-specific optimisation
were appreciably different.
This gave rise to three convenient properties with regard to
testing and development:
i An error in one compiler will frequently give
notice of similar faults in others. Clearly, any
faults in the common first phase will be present in
all the compilers and only one correction will be
required.
ii An improvement in the performance of one compiler,
or the code it generates, can suggest similar
improvements in others.
iii The third effect on reflection seems obvious yet
was noted with some surprise. The systems on which
most of the investigation was done are run with
very different operating systems and used by
different types of user. These two factors
together caused a great spread in the demands
placed upon the compiler, resulting in more parts
of the compiler being thoroughly tested than would
happen when running on one particular system, where
users tend to be more stereotyped. Questions of
"proper practice" aside, it is a fact of life that
all software gets a better testing in the field
than at the hands of its creator.
5.3 Diagnostics
As mentioned previously optimisation is not just a
process of improving the storage requirements and speed of a
program but also involves fitting a program into the overall
framework of the run-time environment. In many applications
the provision of extensive run-time checks and post-mortem
traces can be of great importance. The ability to generate
such diagnostic code has certain implications for the
features in the intermediate code.
5.3.1 Line numbers
When producing information about the state of a
computation, whether it be an error report following a
run-time fault or an execution trace [Satterthwaite, 1972],
the data must be presented in a form which is meaningful to
the user in terms of the source program. The
commonly-provided dump of the machine state, registers, code
addresses etc., is a complete failure in this respect, as
the correspondence between this and the program state
depends on the workings of the compiler and other factors of
which the user should not need to be aware.
The simplest way of specifying the point of interest in a
program is to give its line number. There are two common
techniques for providing line number information at
run-time, the choice of which depends on the uses to which
the compiler is to be put.
The first is to plant instructions which dynamically update
a variable with the current line number whenever it changes.
This has the significant advantages that it is extremely
cheap to implement and the line number is always immediately
available. Its obvious disadvantages are that it increases
the execution time for the program, and more significantly
it increases the size of the program, typically by about 6K
bytes on the Interdata 7/32 for a 1000 line program,
approximately a 50% increase.
The second technique is to build a table giving the
correspondence between line numbers and the addresses of the
associated code sequences. While this imposes a greater
burden on the compiler and takes more time to extract the
line number it has the advantage that it does not increase
the code size of the program, nor does it alter its
execution speed. Indeed it may even be possible to keep the
table out of main memory until it is required.
The choice of technique will have implications on the
compiled code. If the line number table approach is used
error procedures must have available the address of the
point of the error. The effects of this can be seen in the
following example of the sort of code generated for
unassigned variable checking on the 7/32 using both methods:
-------------
| 17 X = Y |
-------------
------------- -------------
| LHI 0,17 | | update line no
| ST 0,LINE | |
| L 1,Y | L 1,Y |
| C 1,UV | C 1,UV | check value
| BE ERROR | | give the error
| | BAL 8,TU | test for error
| ST 1,X | ST 1,X |
| | .. .. |
| | .. .. |
| |TU:BNER 8 | return if OK
| | B ERROR| give the error
------------- -------------
As the generated code depends on the method in use it cannot
be specified in the intermediate code and so the latter must
simply indicate the points in the program at which the line
number changes.
5.3.2 Diagnostic tables
In the event of program failure, or when explicitly
requested by the user, a trace of the current state of a
program, including the values in active variables and the
execution history, can be of immense value. For such a
trace to be provided the intermediate code must contain the
identifiers used in the source program for all the
variables, and a source-dependent description of those
variables. This latter is needed so that the machine
representations may be interpreted in the correct way when
giving the values in variables. In I-code all this
information is presented in the definitions of descriptors
and may be used or discarded at will.
5.3.3 Run-time checks
Most languages define circumstances under which a program
is to be considered in error and its execution terminated.
These errors include creating a value too large to be
represented (overflow), division by zero, use of an array
index which is outwith the declared bounds, and so on.
There is a natural division of these errors into those which
are detected automatically by the machine and those which
must be detected by explicit checks in the program.
Commonly, machines catch division by zero automatically but
do not provide such a feature for checking array subscripts.
The "hardware-detected" errors may be furthur divided into
those which on detection cause the normal flow of control to
be interrupted, and those which simply make the knowledge of
the occurrance of the error available to the program, for
example by setting a condition-code bit. For the purposes
of this discussion the second form of hardware-detected
error may be considered an error which is not detected
automatically as it still requires explicit instructions to
test for the error and to transfer control accordingly.
Clearly the more errors that fall into the automatic
category the better, as they do not cause the user's program
to grow with sequences of instructions which, in a correct
program, will always be testing for conditions which never
arise.
These differences complicate the design of intermediate
codes as the classification differs from machine to machine:
the VAX all forms of overflow can be made to generate
automatic interrupts, but the PDP11 only sets a
condition-code bit on some overflows.
There are two basic ways of handling this in the
intermediate code: firstly the code can contain explicit
requests for the checks to be performed, and secondly the
code can be designed in such a way as to give the
code-generator enough information to be able to decide where
checks are necessary.
Two specific examples can indicate which of these ways
should be adopted.
Testing for arithmetic overflow is currently handled by
machines in three main ways:
1. An interrupt is generated whenever overflow occurs.
This is by far the best method as it requires no
overheads in the checked code.
2. A bit is set on overflow and is only cleared when
it is tested. This requires explicit checks in the
code but several tests may be conflated into a
single test at an appropriate point, for example
before the final result is stored.
3. A bit is set on overflow, but is cleared by the
next arithmetic operation. This again requires
explicit checking code but the tests must be
inserted after every operation.
For the intermediate code to indicate where overflow
testing is to be performed it would have to choose the
worst case from the three above, namely case 3. This
would result in a test being requested after every
arithmetic instruction, which test may just as well be
included into the definition of the instructions
themselves.
The other area of low-level testing is in implied type
conversions such as storing from a 32-bit integer into a
16-bit integer. The VAX provides an instruction which
combines the test for truncation with the store (CVTLW).
The 7/32 has an instruction (CVHR) which can test the
value before assignment, and the 4/75 can most
efficiently test following the assignment (CH).
If the request for the check is a separate intermediate
code item, the 7/32 case is simple but the other
machines will require much more work to be able to
generate the efficient check. The problem can be
simplified by introducing new assignment instructions
which also perform the test, but this adds many new
instructions to the code as one instruction will be
required for every valid combination of types and every
sort of assignment.
The high-level checks such as array bound checking are
usually so complicated that the most efficient
implementations depend greatly on the particular
hardware, so much so that it would be foolish to attempt
to express them in the intermediate code.
The simplest solution is to ensure that the intermediate
code provides enough information to let the code
generator decide where and what checks are necessary.
The inclusion of checks against the use of unassigned
variables provides a good example of the power of
leaving the checking to the code-generator. In a
simple-minded approach the code-generator tests every
suitable value loaded from store. A minor improvement
to this is to mark the descriptor for every local
variable in a block when it is first assigned,
inhibiting the marking after the first jump.
Subsequently, marked objects need not be checked.
A much better improvement may be obtained by making a
trivial extension to the register remembering mechanism.
If an object is 'known' it must have been used
previously, and hence it will have been checked if
necessary. Even after the register which held the value
of the object has been altered, and hence the
association between the register and the object lost, if
the compiler remembers that the value was known it can
suppress any unassigned checks on future references.
At this point a useful property of IMP77 may be used to
great effect: once a variable has been assigned it
cannot become unassigned. This is not true in many
languages, as for example, in ALGOL60 the control
variable of a for loop is undefined (unassigned) at the
end of the loop. This means that in IMP77 the 'was
known' property of variables may be preserved across
procedure calls, even though all the register content
information must be forgotten.
This technique when applied on the 7/32 compiler results
in a reduction of 33% in the code required for checking.
While it is possible for the unassigned checks to be
placed in the intermediate code and for the first phase
to remove redundant checks, this supression would
require a duplication of the remembering logic which
must, in any case, reside in the machine-dependent
phase.
6 Observations
6.1 Suitability of I-code for Optimisation
When considering the use of I-code for global
optimisation there are two techniques available:
Firstly, the optimisations can be performed using the
I-code and going straight into object code, possibly via a
third phase. In this case the only real constraint on
I-code is that it be powerful enough to be able to carry all
the information available in the source and to present it in
a compact form.
Secondly, the optimisations can be seen as an extra phase
introduced between the first phase (the I-code generator)
and what is normally the second phase (the code generator).
The optimiser takes in I-code and produces as its output a
new I-code stream which can be fed into the code generator.
In this case not only must the I-code carry all the source
information but it must be able to describe the generation
of an optimised program, Clearly the code must be able to
reflect the structure of the target machine in some way and
hence must be able to lose its machine independence.
The second technique is the more interesting as not only
does it permit the optional inclusion of the global
optimising without affecting the structure of the other
phases, but it removes the optimisations from the low-level
details of code production and provides a means for
separating the machine-independent and machine-dependent
optimisations. In particular in the same way as much of tho
code generator can be built from a standard "kit" with a few
special machine-specific parts, so the global optimiser can
utilise code from other optimisers.
The way in which the optimiser can influence the
operation of the code generator is by making use of the fact
that the intermediate code does not describe a computation
but a compilation process. This compilation is driven by
the descriptors which are normally translated by the code
generator from the machine-independent form in the I-code
into the appropriate machine-dependent representation,
reflecting the target machine architecture: registers,
stacks, memory etc. By short-circuiting this translation a
global optimiser can force the use of specific machine
features.
For example consider the following fragment of an integer
function:
-------------------
| integer X |
| X = A(J) |
| X = 0 if X < 0 |
| result = X |
-------------------
The standard I-code produced for this fragment would have
the form:
------------------------
| DEF 12 "X" INTEGER |
| SIMPLE DEFAULT NONE |
| NONE |
| PUSH 12 | - X
| PUSH 6 | - A
| PUSH 7 | - J
| ACCESS |
| ASSVAL |
| PUSH 12 | - X
| PUSHI 0 |
| COMP >= 1 |
| BGE 1 |
| PUSH 12 | - X
| PUSHI 0 |
| ASSVAL |
| LOC 1 |
| PUSH 12 | - X
| RESULT |
------------------------
On the PDP11 the code generated for this could be:
-------------------
| MOV J,R2 |
| ADD R2,R2 | Scale the index
| ADD A,R2 | Add in ADDR(A(0))
| MOV (R2),X | A=A(J)
| BGE $1 | ->$1 if X >= 0
| CLR X | X = 0
| $1: MOV X,R1 | assign result register
| {return} |
-------------------
Here the obvious optimisation is to note that the local
variable, X, is eventually to be used as the result of the
function and so needs to end in register 1.
By changing the definition of X in the I-code into:
------------------------------------------------
| DEF x X INTEGER SIMPLE DEFAULT NONE SPECIAL R1 |
------------------------------------------------
and making no other changes, the code generator will produce
code of the form:
--------------------
| MOV J,R2 |
| ADD R2,R2 |
| ADD A,R2 |
| MOV (R2),R1 |
| BGE $1 |
| CLR R1 |
| $1: {return} |
--------------------
As this process necessitates the I-code becoming more and
more intimately involved with the structure of the target
machine, in that it starts referring directly to registers
and the like, it is necessary that a new control item be
added so that the code generator may be prevented from
pre-empting resources which the optimiser is manipulating.
The new item is RELEASE and it is used in conjunction with
the definition of machine-dependent descriptors. When such
a descriptor is introduced (using DEF) the associated target
machine component is considered to have been claimed and may
only be used in response to explicit direction from the
I-code. On receipt of the corresponding RELEASE the
component is once again made available for implicit use by
the code generator (for temporaries etc.). This mechanism
is an exact parallel to the way in which memory locations
are claimed by the definition of descriptors and released by
the END of the enclosing block.
The main assumption about this style of optimisation is
that the code generator has the ability to generate any
required instruction, provided that the pertinent
information is available at the required time.
As an example, the VAX 11/780 provides addressing modes in
which the value in a register may be scaled and added into
the effective operand address before the operand is used,
hence the following code:
------------------------
| integerarray A(1:9) |
|- -|
| A(J) = 0 |
| |
| MOVL J,R5 | pick up J
| CLRL 12(R3)[R5] | A(J) = 0
------------------------
The operand address generated by the CLRL instruction is:
------------------
| 12+R3 + R5*4 |
------------------
as there are 4 bytes (address units) to a longword.
This instruction can be generated naturally during the
non-optimised evaluation of array subscripts, and so the
optimiser can assume that the index mode of operand will be
used whenever a register operand is specified as an array
index.
The procedure has the added advantage that in the worst case
when the code generator will not produce the instructions
that the optimiser hoped, as long as the optimised I-code
still describes the required compilation, the code generator
will simply produce a more long-winded, but equally valid
version of the program. In other words, as long as some
choice is available and some temporary objects are left at
the disposal of the code generator, the optimiser cannot
force it into a state where working code cannot be produced.
In the example above even if the code generator does not
produce index mode operands, it can still generate sequences
of the form:
----------------------
| MULL3 R5,#4,R1 | R5*4 -> R1
| ADDL2 R3,R1 | R3+R1 -> R1
| CLRL 12(R1) | 0 -> (12(R1))
----------------------
6.2 Performance
The figures to appendix A3 are the results of measuring
the effect of various optimisations on the Interdata 7/32
and the DEC PDP 11/45.
One problem in choosing programs to be measured in that
heavy use of particular language features will increase the
overall effect of certain optimisations.
As a trivial example of this consider the following
"program":
--------------------------------
| begin |
| integerarray A(1:1000) |
| A(1) = 0 |
| endofprogram |
--------------------------------
With all array optimisations enabled, on the 7/32 this
generates 30 bytes of code, whereas without the optimisation
it results in 170 bytes of code, largely due to the
procedure for declaring the array.
Clearly a reduction of 82% is not to be expected on more
typical programs.
Similarly the absence of features will bias the results.
In particular the smaller programs will not demonstrate the
power of the optimisations which only take effect when
various size limits have been exceeded: the most obvious
such limits being addressing restrictions caused by the size
of address fields in instructions.
The major difficulty in producing results which are of
any real value is that the effects of the optimisations
depend on the individual style in which the programs under
consideration were written. Inevitably users get a "feel"
for the sort of statement for which the compiler generates
good code and they modify their style of programming
accordingly. If at some state in its development a compiler
produces poor code for a particular construction users will
tend to avoid that construction, even long after the
compiler has been improved and can compile it effectively.
This well-known phenomenon [Whitfield, 1973] argues strongly
that users should never see the object code generated by the
compilers they are using.
The effects of many optimisations are difficult if not
impossible to measure with any degree of accuracy as they
interact with other optimisations to a great deal. The most
obvious interaction is that between the size of jump
instruction required and most of the other optimisations.
The size of jump is determined by the amount of code
generated between the jump and the label it references. If
any other optimisation is inhibited this volume of code is
likely to increase, decreasing the chances of being able to
use the shorter forms of the jump.
Some optimisations depend almost totally on others; it is
unlikely that the optimisation of reducing or removing the
entry and exit code sequences associated with procedures
(section 4.5.1) would have much effect if the parameters
were not passed in registers and references to them in the
procedures were replaced by references to those registers.
In particular, it must be noted that it is always possible
to generate programs which will benefit greatly from those
optimisations which do not appear to be of much use from the
figures given. However, the test programs used to derive
the figures are typical of the programs processed by the
compiler, and it is hoped that they give a more realistic
and balanced view of the improvements which may be achieved
in 'real' cases.
Under some circumstances it may be advantageous to apply
all optimisations even though some may appear to give little
benefit, since this 'squeezing the pips' frequently removes
one or two instructions from critical loops in a program.
Yet again this shows the difficulty in quantifying the
usefullness of optimisations as they are so dependent on the
particular circumstances.
One area of measurement has been deliberately omitted
from the figures, namely the effect on execution time of the
optimisations. This was for several reasons:
1. On the systems used it was impossible to get
reliable timing measurements with any accuracy
greater than about plus or minus 5%.
2. For the reasons given previously, many programs
could benefit from fortuitous optimisations which
could not be expected every time.
3. Programs which would run long enough to improve the
accuracy of the measurements invariably lost this
accuracy through spending much time in the
system-provided procedures, mainly for input and
output. This point in particular suggests that as
the overhead is beyond the control of the general
user, the savings in code space may be much more
important. Even with ever-growing store sizes,
virtual memory systems will continue to treat
smaller programs better than larger ones.
4. Some of the optimisations, particularly passing
parameters in registers, prevent the compiled
program from running, unless the controlling
environment is modified in a parallel way. This
would invalidate the timings as the environment is
not usually under the control of the compiler.
From the crude measures which were obtained there is a
suggestion that the decrease in execution time roughly
parallels the decrease in code size.
6.3 Cost of optimisation
The cost of an optimisation is, in general, very
difficult to measure, as may be seen by considering the
three relevant areas: compile time, space requirement, and
logical complexity.
6.3.1 Compile time
In order to generate good code, the compiler must spend
time looking for the cases which are amenable to
improvement. If no optimisation is performed this time is
not used and so the compilation should take less time.
However, the non-optimised version commonly requires the
production of more code than the optimised version,
frequently over fifty percent more when comparing fully
diagnostic code with fully optimised code. On all the
compilers written so far, the time saved by not having to
generate these extra instructions more than outweighs the
time spent in deciding not to generate them.
6.3.2 Space requirement
Several optimisations increase the requirement for
workspace, notably all the remembering optimisations. On
most machines available at the present the number of things
which may be remembered is fairly small: sixteen registers
and one condition-code is probably the maximum. Even if
this number is increased by remembering several facts about
each thing, the total amount of space needed will be small
when compared with the space needed to hold the information
about user-defined objects, information which is required
whether optimisation is being performed or not. On large
machines the extra memory required will be cheap; on small
machines the need for the optimisation will have to be
balanced against the size of the largest program which must
be compiled.
6.3.3 Logical complexity
The cost of providing an optimisation includes a
non-recurrent component, which is the difficulty of
performing the optimisation at all because of the logical
complexity of discovering the necessary circumstances. In a
system which is aimed at portability this cost can often be
shared over a number of implementations; the techniques used
in one being applicable to others, perhaps after minor
modifications.
6.4 Comments on the results
6.4.1 Register remembering
Of all the optimisations tested, a simple
remembering of values in registers provided by far
the greatest improvement in code size.
One problem in implementing this optimisation is
deciding what to remember, as shown by the
following code sequence:
-----------
| X = Y |
| |
| L 1,Y |
| ST 1,X |
-----------
Following this sequence register 1 will contain
both the value in X and the value in Y; should the
compiler remember X or Y or both?
The measurements show that the gain in remembering
both (2 uses) as opposed to just one (1 use) are
quite small. The algorithm used to determine what
to remember in the '1 use' case was simply to
remember a new piece of information only if nothing
else was known about the register in question.
This gives the best results in cases such as:
A = 0; B = 0; C = 0
where the value '0' will be remembered, but will
perform badly with the more contorted case:
A = 0; B = A; C = B
as again only the value '0' will be remembered.
Unless very tight code is required, the cost in
maintaining multiple sets of information about each
register and searching for particular values will
probably rule out such extended remembering
optimisations.
Perhaps a surprising result is that the PDP11 on
average gains about as much from this optimisation
as the 7/32.
This is the result of two interacting effects.
Firstly, the 7/32 dedicates up to five registers to
address local variables in the last five levels of
procedure nesting and locks three for other fixed
purposes, leaving about ten for intermediate
calculations. The PDP11, however, uses a display
in store the access intermediate levels and has to
load the address of a particular level each time it
is required. In addition the PDP11 implementation
fixes the use of four registers, leaving only four
for intermediate calculations.
Secondly, the 7/32 needs to use at least one
register to move values around while the PDP11
often requires none.
These two effects give a fairly large number of
transient values in the registers of the 7/32, and
a smaller number of more frequently used values
(addresses) in the registers of the PDP11. On
average it appears that the number of times
necessary values are found is roughly equal in the
two cases.
6.4.2 Remembering environments
An environment is the complete knowledge
maintained by the compiler at any time. By
remembering and merging environments while
compiling IF-THEN-ELSE constructions the effects of
the implied labels and jumps on the remembering
optimisations can be minimised.
The measurements show that the gains achieved by
remembering more and more environments fall off
very quickly; two environments seem to be about the
best. However, the overhead in providing more than
one environment is simply compiler table space, and
so a compiler which can handle one environment can
easily handle more to get a very small but cheap
gain.
One clear result is the difference between the
effects on the two machines (sometimes an order of
magnitude). This is almost entirely due to the
difference in the number of available registers.
6.4.3 Array allocation and use
From monitoring service versions of the
compilers is it clear that in IMP77 the vast
majority of arrays have constant bounds.
Allocating these arrays on the local stack frame at
compile time is a simple operation and can save a
fair amount of code much of which would only be
executed once, as most arrays are declared in the
outermost block.
Remembering array address calculations can reduce
the code by about five percent, but it commonly has
little effect and is quite tedious and expensive to
achieve.
The small increase in code size for a few cases
is a side-effect of the register allocation
mechanism. Registers are chosen by giving priority
to those about which the least is known, and then
by selecting the least recently used such register.
Hence, which register will be used depends on the
compilation of previous statements. When a value
is required in a specific register, for example
during parameter transmission, occasionally it will
already be in that register purely by chance. A
minor change in the generated code, such as not
requiring a new register for an array access, can
result in the value not being in the correct
register later on. This instability seems to be
undesirable, but alternative strategies, such a
biasing the allocation towards or away from
particular registers, on average results in worse
code.
6.4.4 Common operands
On the 7/32 the only instruction which can be
used to simplify statements of the form: X = X op Y
is the AM (add to memory) instruction. It is
therefore somewhat surprising that its use
frequently saves over two percent of the code.
The two possible expansions of a suitable addition
statement are:
------------ ------------
| | |
| L 1,Y | L 1,X |
| AM 1,X | A 1,Y |
| | ST 1,X |
------------ ------------
The first saves four bytes and leaves the increment
in the register. Even if the incremented value is
required immediately the extra load instruction
will only increase the code size to that of the
alternative sequence.
As the PDP11 has many instructions which can be
used in this way it is hardly surprising that it
benefits much more.
6.4.5 Parameters in registers
This optimisation gives another significant
saving in code at little cost to the compiler,
simply by moving the store instructions for
parameter assignment from the many calls to the
unique procedure definitions. The effect is more
pronounced on the 7/32 as all assignments require
two instructions, a load and a store, whereas the
PDP11 can usually make do with one MOV instruction.
In the latter case the saving comes from the
ability to reduce the size of the procedure entry
and exit sequences if all of the parameters can be
passed in registers.
6.4.6 Condition-code remembering
On machines with condition codes many
instructions set the result of a comparison with
zero as a side-effect. Knowledge of this can be
used to inhibit explicit code to compare values
with zero. However, the small benefit so gained
suggests that it is not worth doing, even though it
is a very cheap test.
6.4.7 Merging
The large difference between the effect of
forward merging on the 7/32 and the PDP11 is mainly
due to the addressing modes available on the
machines.
On the PDP11 statements of the form "A=B" can be
compiled into a single instruction "MOV_B,A",
ignoring any extra instructions which may be needed
to make A and B addressable. However, on the 7/32
all values must be moved via the registers,
resulting in two instructions for the same
statement:
-------------
| L 1,B |
| ST 1,A |
-------------
Hence the following code:
-------------------------------
| if X=0 then Y=1 else Y=12 |
-------------------------------
7/32 PDP11
------------------- -------------------
| L 1,X | TST X |
| BNE $1 | BNE $1 |
| LIS 2,1 | |
| ST 2,Y | MOV #1,Y |
| B $2 | BR $2 |
| $1:LIS 2,12 | $1: |
| ST 2,Y | MOV #12.,Y |
| $2: | $2: |
------------------- -------------------
With the 7/32 code, merging can reduce the sequence
by one instruction, a "STore", while with the PDP11
no such improvement is possible.
As the techniques for merging and delaying are
quite expensive, but not complicated, and have a
major influence on the design of the code-generator
the small gains achieved are probably not worth the
trouble, unless the last drop of efficiency is
required at all costs.
6.5 Criticisms and benefits of the technique
6.5.1 Complexity
The main argument against the use of high-level
intermediate codes is that they move the complexity of code
generation from the common machine independent phase into
the machine dependent phase forcing work to be repeated each
time a new compiler is required.
While this is undoubtedly true the overheads are not as
great as they may first appear.
The extra complexity of the code generators may be split
into two parts: an organisational part which builds and
maintains the structures used during the compilation, and
processes the intermediate code, using it to drive the
second part, an operational part which uses the structures
to generate code as instructed by the organisational part.
The changes in the organisational part when moving to a new
machine as small enough to permit the use of large sections
of code from old compilers. Even when considering the
operational part much will be similar from machine to
machine, in particular the communication between the second
and third phases and the bulk of that latter phase can be
taken without change. From examining the compilers produced
using I-code it appears that about 60% of the source of the
machine dependent parts is common, 20% can be considered as
being selected from a standard "kit" of parts, and the final
20% unique to the host machine.
6.5.2 I/O overhead
One of the disadvantages of dividing a compiler into
several distinct phases is that it results in an additional
cost in communicating between consecutive phases. As
discussed in section 5.1 this cost depends on the system
running the compiler. In the worst case where communication
is achieved using conventional files the overhead may not be
too serious.
The time spent doing input and output on the Interdata 7/32
compiler is about 27% of the total compilation time, and for
the PDP11 is about 22%, breaking down as follows:
-------------------------------------------------------
| |
| ---- ---- ---- |
| | | | |---->| | |
| Source ----> | P1 |---->| P2 | | P3 |----> Object |
| | | | |---->| | |
| ---- ---- ---- |
| |
| 7/32: 7% 7% 10% 3% |
| (4%) (4%) (5%) (3%) |
| |
| PDP11: 9.4% 11% 0.6% 0.5% |
-------------------------------------------------------
The figures in parentheses give the percentage of time taken
when the input and output requests are made directly to the
file manager rather than via the standard subsystem
procedures, thus reducing the internal I/O overhead to about
10% of the total compilation time.
6.5.3 Lack of Gains
It has been argued that the increases brought about by
adopting a high-level code as opposed to a low level one are
not worth the increased effort involved in processing it.
Depending on the uses to which the compiler is to be put,
small increases in code efficiency can outweigh a reasonable
increase in the cost of producing the compiler and using it.
A 5% improvement in the execution speed of the compiler
itself is not insignificant when the number of times it is
used is considered. However, it cannot be denied that a
careful redesign of critical parts of a program can have a
greater effect on its performance than any amount of
automatic optimisation. Notwithstanding, it seems
reasonable that programmers should be able to concentrate on
the large-scale efficiencies of program design and have the
detailed improvements left to the compiler.
Also it should be noted that measurements indicate that the
compilers execute faster when performing certain
optimisations than when not performing them, for example
passing parameters in registers.
If low-level codes are needed for some reason the
complexity saved from the machine independent phase can be
moved into a new phase which converts the high-level code
into a low-level one. This provides the low-level code for
those who want it while preserving the high-level interface
for use when good code is required,
One important gain in using such intermediate codes is
that they can ease the difficulties associated with
maintaining a number of compilers for different machines,
when those compilers are self-compiling.
For several reasons it may not be desirable to permit sites
to have the source of the machine-independent phase:
commonly to give freedom of choice for the form of the
language in which that phase is written and to prevent local
"improvements" which rapidly lead to non-standard language
definitions. In such cases the intermediate-code generator
can be maintained at one site and updated versions can be
distributed in the intermediate code form without fear of
compromising the quality of the object code generated from
it.
Such a technique is currently being used in the production
of portable SIMULA compilers [Krogdahl, 1980].
6.5.4 Flexibility
At some stage in producing a compiler the needs of the
end user must be considered. The flexibility afforded by
the high-level nature of the intermediate code allows the
compiler to be adapted to fit its environment. If the
compiler is to be used for teaching the quality of the code
it produces can be sacrificed for compilation speed and
high-quality diagnostics, particularly as compilation time
may well be an order of magnitude greater than the execution
time, Indeed many of the programs will fail to compile and
never reach execution. If the application is for compiling
programs that will be compiled once and then executed many
times more effort can be expended in producing fast code,
although this is not to say that diagnostics and fast code
must be kept separate as the longer a program runs without
falling the more trouble will be caused when it falls
without convenient diagnostics.
6.6 Comments on Instruction sets and compilation
Following the production of IMP compilers for several
different processors various features on instruction sets
have become evident which influence the generation of code.
i The instruction set should be complete, that is,
where an instruction in available for one data type
it should be available for all data types for which
it is well-defined. Similarly, instruction formats
used by one operation should be available for all
similar operations. The best example of such an
instruction set is that provided by the DEC PDP10.
Unfortunately the majority of machines are not so
helpful. As example of the sorts of thing which go
wrong consider the Perkin-Elmer 3200 series. These
machines provide three integer data types: fullword
(32 bits, signed), halfword (16 bits, signed), and
byte (8 bits, unsigned). There are "add fullword"
(A) and "add halfword" (AH) instructions but no
"add byte" instruction. There are "add immediate
short" and "subtract immediate short" instructions
but multiply, divide, and, or etc. can not have
short immediate operands.
ii The instructions should be consistent, that is,
logically similar instructions should behave in
similar fashions.
Again, on the Perkin-Elmer 3200:
Load fullword (L) and load halfword (LH) set the
condition code but load byte (LP) does not.
Most register-store instructions can be replaced by
a load of the appropriate type followed by a
register-register instruction: e.g.
-------------- --------------
| CH 1,X | LH 0,X |
| | CR 1,0 |
-------------- --------------
both result in the same setting of the condition
code, but
-------------- --------------
| CLB 1,B | LB 0,B |
| | CLR 1,0 |
-------------- --------------
could result in different settings of the condition
code as CLR compares two unsigned 32 bit quantities
whereas CLB compares a zero-extended byte from
store with the zero-extended least significant byte
of register 1, For consistency either compare
halfword (CH) should use the sign-extended less
significant half of the register, or better, CLB
should not tamper with the value in the register.
iii Complex instructions should be avoided. There are
two reasons for this. Firstly, it is easier for a
compiler to break down statements into simple
operations than it is to build them up into complex
ones [Stockton-Gaines, 1965]. Secondly, if the
complex instructions do not perform the exact
function required by the language more instructions
will be needed to "prepare" for the complex
instruction and to "undo" its unwanted effects. As
an example, the DEC VAX11/780 is full of complex
instructions which seem to be well-suited to
high-level languages at first glance but on closer
inspection they are not so useful. A CASE
instruction is provided which indexes into a table
of displacements and adds the selected value to the
program counter. This would seem ideal for
compiling SWITCH jumps. Unfortunately, as the
table of displacements follows the CASE instruction
it would be very expensive to use it each time a
jump occurred using a particular switch. Instead
all references to the switch must jump to a common
CASE instruction. Even this does not help, as in
the event of an attempted jump to a non-existent
switch label the diagnostics or the event mechanism
will see the error as having occurred at the wrong
place in the program. Although this problem can be
"programmed around" it turns out that it is faster
to implement switches using sequences of simpler
instructions.
iv Machine designers should investigate carefully the
full consequences of building-in special fixed uses
of machine features. One of the best examples of a
clear oversight which causes grief to compiler
writers is found in the DATA GENERAL NOVA
multiplication instruction. This instruction
multiplies the value in register 1 by register 2
and places the double-length result in registers 0
and 1. As only registers 2 and 3 may be used for
addressing, and as register 3 is always used for
subroutine linkage, it follows that register 2 must
be used for addressing the local stack frame, but
this is exactly the register which must be
corrupted in order to use the multiply instruction!
Although specific machines have been used in the examples
similar problems abound in all machines. Indeed it is clear
that machines are most commonly designed for programmers
writing in assembler or FORTRAN, and furthermore writing
their programs in a particular style.
While it is clear that the problems could be called "mere
details" and that they are not difficult to surmount, it
remains that they complicate otherwise simple
code-generation algorithms making compilers larger, slower
and correspondingly more difficult to write, debug, and
maintain.
In conclusion it appears that the machine most suited to
supporting high-level languages should have a small but
complete set of very simple instructions, their simplicity
permitting rapid execution and great flexibility.
7 Conclusions
7.1 Viability of the technique
The techniques described above have been used to create
several IMP77 compilers which are in regular use on a number
of systems. In terms of total memory space required for a
compilation, they compare favourably with other compilers.
The major weakness seems to be execution time which can vary
from twice as long as other compilers in the worst case, to
half as long in the best case. As most of the effort in
writing the compilers was spent in investigating the
techniques involved and not in minimising compile time, and
as the compilers which ran much faster were either totally,
or partially written in machine code (the IMP77 compilers
are all written exclusively in IMP77), it seems that the
technique can be used to produce acceptable service
compilers.
7.2 Ease of portability
Although using I-code does not permit compilers to be
written in as short a time as with P-code and O-code, the
large amount of code which is common to all of the compilers
written so far means that, given a working code generator as
a template, a new optimising compiler can be written in the
space of a few months, with the final result producing code
of high quality.
7.3 Nature of optimisations
During the course of the investigation it became clear
that one of the difficulties of optimisation is that gains
are achieved by applying a large number of ad hoc rules,
especially where peephole optimisations are concerned.
As instruction sets become more complicated and rich, there
is a corresponding increase in the variety of ways of
implementing high-level language features. This increases
the possibilities of optimisation and subsequently the
complexity of compilers. By using high-level intermediate
codes, such as I-code, it should be possible to concentrate
on machine-independent optimisations knowing that the
resulting intermediate code can be used to generate
efficient code for current machines. Eventually, when
better instruction sets are available, hopefully with only
one way of doing things and no opportunities for non-trivial
optimisation, the same intermediate code can be used to
drive code generators which are much simpler and more
directly portable.
Appendix A1
The IMP Intermediate Code
A Brief Summary
The IMP intermediate code may be considered a sequence of
instructions to a stack-oriented machine which generates
programs for specific computers. It is important to note
that the intermediate code describes the compilation process
necessary to generate an executable form of a program; it
does not directly describe the computation defined by the
program.
The machine which accepts the intermediate code has two
main components:
1 A Descriptor area. This is used to hold
descriptors containing machine-dependent
definitions of the objects the program is to
manipulate. This area is maintained in a
block-structured fashion, that is new descriptors
are added to the area during the definition of a
block and are removed from the area at the end of
the block.
2 A Stack. The stack holds copies of descriptors
taken from the descriptor area or the parameter
area, or created specially. Items on the stack
are modified by intermediate code control items
to reflect operations specified in the source
program. Such modifications may or may not
result in code being generated. From the point
of view of this definition stack elements are
considered to have at least three components:
i Type
ii Value
iii Access rule
The "Access rule" defines how the "Type" and
"Value" attributes are to interpreted in order to
locate the described object.
For example, the access rule for a constant could
be "Value contains the constant" while for a
variable it could be "Value contains the address
of the variable". Clearly, the access rules are
target-machine dependent. Descriptors may be
combined to give fairly complex access rules, as
in the case of applying "PLUS" to the stack when
the top two descriptors are for the variable X
and the constant 1 resulting in one descriptor
with the access rule "take the value in X and add
1 to it". The complexity of these access rules
may be restricted by a code-generator. In the
example above code could be generated to evaluate
X+1 resulting in an access rule "the value is in
register 1", say.
The importance of the code not describing the actual
computation which the source program specified but the
compilation process required is seen when attempting to use
the code for statements of the form:
A := if B = C then D else E;
This could not be encoded as:
PUSH A
PUSH B
PUSH C
JUMP # L1
PUSH D
BR L2
LOC L1
PUSH E
LOC L2
ASSVAL
The reason is that the items on the stack at the time of the
ASSVAL would be (from top to bottom) [E], [D], [A], because
no items were given which would remove them from the stack.
hence the ASSVAL would assign the value of E to D and then
leave A dangling on the stack.
Unless otherwise stated, all constants in the
intermediate code are represented in octal.
Descriptors
DEF TAG TEXT TYPE FORM SIZE SPEC PREFIX
This item causes a new descriptor to be generated
and placed in the descriptor area. On creation,
the various fields of the DEF are used to
construct the machine-dependent representation
required for the object.
TAG is an identification which will
be used subsequently to refer to
the descriptor.
TEXT is the source-language identifier
given to the object (a null
string if no identifier was
specified).
TYPE is the type of the object:
GENERAL, INTEGER, REAL, STRING,
RECORD, LABEL, SWITCH, FORMAT.
FORM is one of: SIMPLE, NAME, ROUTINE,
FN, MAP, PRED, ARRAY, NARRAY,
ARRAYN, NARRAYN.
SIZE is either the TAG of the
appropriate record format
descriptor for records, the
maximum length of a string
variable, or the precision of
numerical variables: DEFAULT,
BYTE, SHORT. LONG.
SPEC has the value SPEC or NONE
depending on whether or not the
item is a specification.
PREFIX is one of: NONE, OWN, CONST,
EXTERNAL, SYSTEM, DYNAMIC, PRIM,
PERM or SPECIAL. If SPECIAL is
given there will follow an
implementation-dependent
specification of the properties
of the object (such as that it is
to be a register, for example).
Parameters and Formats
The parameters for procedures and the elements of record
formats are defined by a list immediately following the
procedures or format descriptor definition:
START Start of definition list
FINISH End of definition list
ALTBEG Start of alternative sequence
ALT Alternative separator
ALTEND End of alternative sequence.
Blocks
BEGIN Start of BEGIN block
END End of BEGIN block or procedure
PUSH Push a copy of the descriptor onto
the stack.
PROC This is the same as PUSH except that the
descriptor being stacked represents a
procedure which is about to be called
(using ENTER).
PUSHI Push a descriptor for the integer constant
onto the stack.
PUSHR Push a descriptor for the real
(floating-point) constant onto the
stack.
PUSHS Push a descriptor for the string constant
onto the stack.
SELECT TOS will be a descriptor for a record.
Replace this descriptor with one describing
the sub-element of this record.
Assignment
ASSVAL Assign the value described by TOS to the
variable described by SOS. Both TOS and
SOS are popped from the stack.
ASSREF Assign a reference to (the address of) the
variable described by TOS to the pointer
variable described by SOS. Both TOS and
SOS are popped from the stack.
JAM This is the same as ASSVAL except that the
value being assigned will be truncated if
necessary.
ASSPAR Assign the actual parameter described by
TOS to the formal parameter described by
SOS. This is equivalent to either ASSVAL
(for value parameters) or ASSREF (for
reference parameters).
RESULT TOS describes the result of the enclosing
function. Following the processing of the
result code must be generated to return
from the function.
MAP Similar to RESULT except that TOS describes
the result of a MAP. Again a return must
be generated.
DEFAULT
INIT Create N data items, corresponding to the
last descriptor defined, and given them all
an initial (constant) value. The constant
is popped from the stack in the case of
INIT but DEFAULT causes the
machine-dependent default value to be used
(normally the UNASSIGNED pattern).
Binary operators
ADD Addition
SUB Subtraction
MUL Multiplication
QUOT Integer division
DIVIDE Real division
IEXP Integer exponentiation
REXP Real exponentiation
AND Logical AND
OR Logical inclusive OR
XOR Logical exclusive OR
LSH Logical left shift
RSH Logical right shift
CONC String concatenate
ADDA ++
SUBA --
The given operation is performed on TOS and SOS , both of
which are removed from the stack, and the result
(SOS op TOS) is pushed onto the stack.
e.g. A = B-C
PUSH A
PUSH B
PUSH C
SUB ASSVAL
Unary Operators
NEG Negate (unary minus)
NOT Logical NOT (complement)
MOD Modulus (absolute value)
The given operation is performed on TOS.
Arrays
DIM The stack will contain pairs of
descriptors corresponding to the lower and
upper bounds for an array. This
information is used to construct arrays
and any necessary accessing information for
use through the last descriptors to
have been defined. All of these
descriptors will be for similar arrays.
INDEX SOS will be the descriptor for a
multi-dimensional array and TOS will be the
next non-terminal subscript. The stack is
popped.
ACCESS SOS will be the descriptor of an array and
TOS will be the final/only subscript. Both
descriptors are replaced by a descriptor
for the appropriate element of the array.
E.g. given arrays A(1:5) and B(1:4, 2:6),
and integers J,K:
A(J) = 0 K = B(J, K)
PUSH A PUSH K
PUSH J PUSH B
ACCESS PUSH J
PUSHC 0 INDEX
ASSVAL PUSH K
ACCESS
ASSIGN
Internal labels
Internal labels are those labels in the
intermediate code which have been created
by the process of translating from the
source program, and so do not appear
explicitly in the source program. The main
property of these labels is that they will
only be referred to once. This fact can be
used to re-use these labels, as, for
example, a forward reference to a
currently-defined label must cause its
redefinition.
LOCATE <1> define internal label <1>
GOTO <1> forward jump to internal label <1>
REPEAT <1> backward jump to internal label <1>
Conditional branches
These branches are always forward.
JUMPIF