The production of Optimised Machine-Code for High-Level Languages using Machine-Independent Intermediate Codes. Peter Salkeld Robertson Ph. D. University of Edinburgh 1981 (Graduation date July 1981)
ABSTRACT The aim of this work was to investigate the problems associated with using machine-independent intermediate codes in the translation from a high-level language into machine code, with emphasis on minimising code size and providing good run-time diagnostic capabilities. The main result was a machine-independent intermediate code, I-code, which has been used successfully to develop optimising and diagnostic compilers for the IMP77 language on a large number of different computer systems. In addition, the work has been used to lay the foundations for a project to develop an intermediate code for portable SIMULA compilers. The major conclusions of the research were that carefully designed machine-independent intermediate codes can be used to generate viable optimising and diagnostic compilers, and that the commonality introduced into different code generators processing the code for different machines simplifies the tasks of creating new compilers and maintaining old ones.
Contents 1 Introduction 2 Intermediate codes 2.1 Uncol 2.2 Janus 2.3 OCODE 2.4 P-code 2.5 Z-code 2.6 Summary and conclusions 2.6.1 Error checking and reporting 2.6.2 Efficiency 2.6.3 Assumptions 2.6.4 Interpretation 3 Optimisations 3.1 Classification of optimisations 3.1.1 Universal optimisations 3.1.2 Local optimisations 3.1.2.1 Remembering 3.1.2.2 Delaying 3.1.2.3 Inaccessable code removal 3.1.2.4 Peephole optimisation 3.1.2.5 Special cases 3.1.2.6 Algebraic manipulation
3.1.3 Global optimisations 3.1.3.1 Restructuring 3.1.3.2 Merging 3.1.3.2.1 Forward merging 3.1.3.2.2 Backward merging 3.1.3.3 Advancing 3.1.3.4 Factoring 3.1.3.5 Loop optimisations 3.1.3.5.1 Iteration 3.1.3.5.2 Holding 3.1.3.5.3 Removal of invariants 3.1.3.6 Expansion 3.1.3.7 Addressing optimisations 3.1.4 Source optimisations 3.2 Combination of optimisations 4 The design of the compiler 4.1 General structure 4.2 The intermediate code 4.2.1 Objectives 4.2.1.1 Scope 4.2.1.2 Information preservation 4.2.1.3 Target machine independence 4.2.1.4 Decision binding 4.2.1.5 Simplification 4.2.1.6 Redundancy 4.2.1.7 Ease of use
4.3 Code layout and addressing 4.3.1 Nested procedure definitions 4.3.2 Paged machines 4.3.3 Events 4.4 Data layout and addressing 4.5 Procedure entry and exit 4.5.1 User-defined procedures 4.5.2 External procedures 4.5.3 Permanent procedures 4.5.4 Primitive procedures 4.6 Language-specified and compiler-generated objects 4.6.1 Internal labels 4.6.2 Temporary objects 4.7 Object-file generation 4.7.1 Reordering 4.7.2 Jumps and branches 4.7.3 Procedures 4.7.4 External linkage 4.7.5 In-line constants 4.7.6 Object-file format 4.8 Summary 5 Review of the overall structure 5.1 Division of function 5.2 Testing and development 5.3 Diagnostics 5.3.1 Line numbers 5.3.2 Diagnostic tables 5.3.3 Run-time checks
6 Observations 6.1 Suitability of I-code for optimisation 6.2 Performance 6.3 Cost of optimisation 6.3.1 Compile time 6.3.2 Space requirement 6.3.3 Logical complexity 6.4 Comments on the results 6.4.1 Register remembering 6.4.2 Remembering environments 6.4.4 Array allocation and use 6.4.5 Parameters in registers 6.4.6 Condition-code remembering 6.4.7 Merging 6.5 Criticisms and benefits of the technique 6.5.1 Complexity 6.5.2 I/O overhead 6.5.3 Lack of gains 6.5.4 Flexibility 6.6 Comments on Instruction sets and Optimisation 7 Conclusions 7.1 Viability of the technique 7.2 Ease of portability 7.3 Nature of optimisations Appendix A1 Simplified I-code definition A2 I-code internal representation A3 Results References
1. Introduction Compilers for high-level languages form a significant part of most computer systems, and with an ever increasing number and variety of machine architectures on the market the problems of compiler development, testing, and maintenance consume more and more manpower and computer time. Moreover, as computer technology is improving and changing rapidly it is becoming evident that software costs will increasingly dominate the total cost of a system. Indeed, it may not be long before the lifetime of software regularly exceeds that of the hardware on which it was originally implemented, a state of affairs quite different from that envisaged by Halpern when he concluded that "the importance of the entire question of machine-independence is diminishing .." [Halpern, 1965]. In addition, there is a need to encourage the slowly-developing trend to write the majority of software in high-level languages. Even though the advantages of such an approach are many, a large number of users still have a love of machine-code, usually fostered by thoughts of "machine efficiency". Clearly, techniques must be developed to simplify the production of usable compilers which can "optimise" the match between the executing program and the user's requirements, be they for fast execution, small program size, reasonable execution time but with good run-time diagnostics, or whatever.
One popular method for reducing the complexity of a compiler is to partition it into two major phases: one language-dependent and the other machine-dependent. The idea is that the language-dependent phase inputs the source program and deals with all the syntactic niceties of the language, finally generating a new representation of the program, an intermediate code. This is then input by a second phase which uses it to generate machine-code for the target computer. In this way it should be possible to produce a compiler to generate code for a different machine by taking the existing first phase and writing a new second phase. This ability to move a large portion of the compiler from machine to machine has led to such compilers being referred to as "portable compilers" even though the term is perhaps misleading, as only part of the complete compiler can be moved without change. In practice many existing compilers generate intermediate representations of the program which are passed around within the compiler, for example the "analysis records" produced by the syntactic phase of compilation, but for the purposes of this work it is only when these representations are machine-independent and are made available outwith the compiler that they will be termed intermediate codes. Much of the emphasis in designing intermediate codes has been on enabling a compiler to be bootstrapped quickly onto a new machine - either by interpreting the intermediate code, or by using a macro generator to expand it into
machine-code [Brown, 1977]. Once this has been done the intention is that the quality of the code so produced can be improved at leisure. While this approach has been very successful and relatively error-free, it has been the experience of several implementors that it is difficult to adapt the scheme to produce highly optimised code [Russell, 1974]; apparently considerations of portability and machine-independence have caused the problems of optimisation to be overlooked. The aspect of intermediate-code design which has received most debate concerns the level of the code: low-level with a fairly simple code-generator, or high-level with a more complex code-generator [Brown, 1972]. This thesis attempts to put machine-independence and optimisation on an equal footing, and describes the use of an intermediate code which takes a novel view of the process. Instead of the intermediate code describing the computation to be performed, it describes the operation of a code-generator which will produce a program to perform the required computation. This effectively adds an extra level of indirection into the compilation, weakening any linkage between the form of the intermediate code and the object code required for a particular implementation. In essence I-code attempts to describe the results required in a way which does not constrain the method of achieving those results.
In particular it should be noted that the code described, I-code, was designed specifically for the language IMP-77, a systems implementation language which contains many of the constructions which pose problems for optimisation [Robertson, 1979]. It in no way attempts to be a "universal" intermediate code. Notwithstanding, the code, with a small number of minor extensions to cover non-IMP features, has been used successfully in an ALGOL 60 compiler and is currently proving viable in projects for writing Pascal and Fortran 77 compilers. The intermediate code as finally designed is completely machine independent, except inasmuch as the source program it describes is machine dependent, demonstrating that the problems may not be as intractable as thought by Branquart et al. who state that "clearly complete machine independency is never reached" [Branquart, 1973]. In addition to the problems of machine independence there is also the question of operating system independence, as nowadays it is common for machines to have several systems available. For this reason the task of producing a compiler is far from finished when it can generate machine code [Richards, 1977]. To simplify the generation of versions of a compiler for different operating systems a third phase of compilation was added, although it soon became clear that the extra phase could be used for other purposes as well, as will be shown later (section 4).
Throughout the text examples are given of the code produced by compilers written to demonstrate the power of the intermediate code. The examples of the intermediate code are couched in terms of mnemonics for the various code items, although the production compilers use a compacted representation. The code and its representations are described in Appendix A1 and Appendix A2. In the examples of code generated for various constructions it should be appreciated that the exact instructions and machine features used will depend very much on the context in which the code is produced, and so only typical code sequences can be given. The machines for which code is demonstrated are indicated by the following abbreviations in parentheses: (Nova) Data General NOVA (PDP10) Digital Equipment Corporation PDP10 (PDP11) Digital Equipment Corporation PDP11 (VAX) Digital Equipment Corporation VAX 11/780 (GEC4080) General Electric Company 4080 (ICL2900) International Computers Limited 2900 (4/75) International Computers Limited 4/75 (7/16) Interdata 7/16 (7/32) Interdata 7/32 (PE3200) Perkin Elmer 3200
2 Intermediate codes This section gives a brief account of the more important intermediate codes which have been discussed and have had an influence on the design of I-code. 2.1 Uncol UNCOL, UNiversal Computer Orientated Language, [Mock, 1958], was an early attempt to specify a means for solving the M*N problem of producing compilers for M languages to run on N machines. It was proposed that an intermediate language, UNCOL, be defined which would be able to express the constructs from any language, and which could itself be translated into code for any machine, resulting in the need for only M+N compilers. Indeed it was even suggested that programs would be written directly in UNCOL rather than in machine code. These ideas were very ambitious, but were presented without any concrete examples of what UNCOL might look like. Proposals were made for an UNCOL in [Steel, 1961] but the work was abandoned before anything like a complete specification had been produced. An UNCOL-like technique which has been used extensively is to compile for a known type of machine, such as the IBM 360, and then emulate that machine on the target machine. Unfortunately, to give this any chance of being efficient, microcode support will be necessary, and this is rarely available to compiler writers.
2.2 Janus The first attempt at generating an UNCOL which seems to have been at least partially successful was JANUS [Coleman, 1974]. The approach was effectively to enumerate all the mechanisms found in current programming languages and the techniques used to implement them. From this large list was defined a set of primitive data-types and operations upon them. These primitives were then put together to model the objects in the source language. Once JANUS code had been produced the intention was that it would either be interpreted or compiled into machine code by a macro generator. 2.3 OCODE Of all the languages which claim to be portable, perhaps the most successful has been BCPL [Richards, 1971]. The BCPL compiler generates the intermediate code OCODE which can either be interpreted or translated into machine code for direct execution. As BCPL is a fairly low-level language with only one data type, the word, many of the difficulties in designing intermediate codes do not arise. This means that the code can be pitched at a low level and be "semantically weak" without compromising the efficiency of the compiled code to any great extent.
The OCODE machine works by manipulating single-word objects held on a stack, into which there are several pointers. e.g. R(1, 2, 3) STACK 3 adjust the top of stack to leave two cells free for linkage information. LN 1 stack the constant 1. LN 2 stack the constant 2. LN 3 stack the constant 3. LL L6 stack the address of label L6 (the entry to the routine). RTAP 5 enter the procedure adjusting the stack frame pointer by 5 locations. ..... ..... ..... ENTRY 1 L6 'R' entry point for the routine R. SAVE 5 set the top of stack pointer to be 5 locations from the stack frame pointer. ..... RTRN return. 2.4 P-code P-code is the intermediate code used by the PASCAL<P> compiler [Nori, 1976; Jensen, 1976] and was designed with the aim of porting PASCAL quickly by means of an
interpreter. In this respect it has been very successful, especially on microprocessor-based systems. The code is similar to OCODE but has a greater range of instructions to handle objects of differing types. procedure ERROR(VAL:INTEGER); begin 0: ENT 4 TOTAL := TOTAL+1; 1: LDO 138 2: LDCI 1 3: ADDI 4: SRO 138 if INDEX >= 9 then begin 5: LDO 139 6: LDCI 9 7: GEQI 8: FJP 17 LIST[10].NUM := 255 9: LAO 140 10: LDCI 10 11: DEC 1 12: IXA 2 13: INC 1 14: LDCI 255 15: STO end else begin 16: UJP 28 INDEX := INDEX+1; 17: LDO 139 18: LDCI 1 19: ADDI 20: SRO 139 LIST[INDEX].NUM := VAL 21: LAO 140 22: LDO 139 23: DEC 1 24: IXA 2 25: INC 1 26: LOD 0, 4 27: STO end; end; 28: RETP
2.5 Z-code Z-code [Bourne, 1975] is the intermediate code produced by the ALGOL68C compiler, the main feature of which is the ability for the user to parameterise the first phase to modify the Z-code to suit the target machine, an idea previously investigated in SLANG [Sibley, 1961]. A set of eight registers is assumed by the code and others may be specified explicitly for each transfer. The memory with which the code works is assumed to be "a linear data store that is addressed by consecutive integers", addresses taking the form of base+displacement pairs. Intermingled with the instructions are directives which control the translation of the code into machine orders. Two of these directives are used to divide the code into "basic blocks" or "straight-line segments", and describe the usage of registers on entry to and exit from the blocks, although little use seems to be made of them at present.
As an example here is the Z-code generated by the PDP10 version of the compiler [Gardner, 1977]: int X := 2, Y := 3, Z := 2 1: F000 10 0 +2 load 2 F040 10 6 +144 store in X F000 10 0 +3 load 3 F040 10 6 +145 store in Y 5: F000 10 0 +2 load 2 F040 10 6 +146 store in Z proc P = (int A, B) int: begin 7: S715 p*Z T246 677 712 if A > B 9: R0 10: F020 10 5 +4 load A 11: F022 10 5 +5 subtract B 12: F113 10 0 P713 ->L713 if <= then A 13: F020 10 5 +4 load A else B 14: H116 0 p7l4 ->L714 15: L713 16: F020 10 5 +5 load B fi 17: L714 18: R1 10 1 end 19: R1 10 1 20: T247 667 712 end of P
2.6 Summary and conclusions 2.6.1 Error checking and reporting The UNCOL approach of having one code for all languages and machines may well simplify the generation of some sort of compiler, but has the major disadvantage that the optimisation of error checking and reporting run-time errors cannot be left to the code generator - many errors are language-dependent and the code generator cannot know how to handle all of them. Instead the checks must be programmed into the intermediate representation explicitly. As will be shown later (5.3) this can inhibit several very powerful and effective optimisations. Sadly, this problem can result in the absence of all but the most trivial of run-time checks in the compiled code. Even when checking is provided in the intermediate code, as in the case of P-code with its CHK instruction for range testing, it is rare for the code to contain enough information to permit the error to be reported in source program terms; line numbers, procedure names, variable names and values etc. As an example, many P-code interpreters locate run-time errors in terms of 'P-code instruction addresses' which are of negligible benefit to most users.
2.6.2 Efficiency Commonly, little attention is paid to questions of run-time efficiency in the generation of intermediate code. An exception to this is Z-code which is parameterised in order that the match between the code and the target machine can be improved. In particular, the machine-independent phase is intended to perform simple register optimisation, although as the example in 2.5 shows, the insistence on using the same register will minimise any gains from remembering register contents. However, this is probably just a failure on the part of the current compilers and could be corrected at a later date. Unfortunately, the fact that the compiler purports to optimise the intermediate code inhibits the code generator from attempting any but the most trivial peephole optimisations. This may be seen in the example by considering instructions 10 - 12. On many machines the subtract operation is not good for value comparison as firstly it may fail with overflow, and secondly it corrupts a register. A better implementation would be to replace the subtract with a suitable COMPARE, leaving the values untouched and available for later use. This cannot be done by the code generator as it cannot know that the intermediate code does not go on to use the result of the subtraction later. Similarly, if Z-code had chosen to use a COMPARE instruction in the first place, a machine without a compare would have to work hard to make sure all registers involved in the necessary subtract were restored
to their initial values before the intermediate code goes on to use them. 2.6.3 Assumptions Most machine-independent intermediate codes have been designed, at least initially, assuming a linear store with one address increment corresponding to one basic object. In the case of O-code this is a direct result of the language definition, but in languages such as PASCAL it has led to a great loss of information, as the rich information about data types cannot be expressed in the code. The problems associated with putting languages onto machines with different addressing schemes has resulted in some intermediate codes being updated to accept a limited form of parameterisation to define the gross appearance of the target machine. Typical of the limitations of these codes is P-code where although the basic types of object can have differing sizes of machine representation, objects with enumerated types will always be given a 'fullword' even though the host machine could easily support a smaller item. A typical assumption is that the difference between objects explicitly specified in the original source and those created by the intermediate code generator for its own purposes is insignificant. As will be shown later (4.6), this is not necessarily the case.
3 Optimisations The task of any compiler for a high-level language is to fit programs written in that language onto a specific computer system so that the required computations may be performed. Optimisation may be described as the process by which the fit is improved. Usually the quality of the optimisation is measured in terms of two parameters: the size of the running program, and, more commonly, the speed at which it executes. While it is possible in some cases to make a program smaller and increase its speed of execution, it is well-known that in general speed and size are complementary. For example, the following code fragments have the same effect, but the first will probably be smaller than the second, which will execute faster than the first: -------------------------- ----------------------- | | A(1) = K | | | A(2) = K | | for J = 1,1,8 cycle | A(3) = K | | A(J) = K | A(4) s K | | repeat | A(5) = K | | | A(6) = K | | | A(7) = K | | | A(8) = K | | | J = 8 | -------------------------- ----------------------- 3.1 Classification of optimisations With a subject as complex as optimisation it is difficult to give a useful and definitive classification of the various possibilities for improving programs. In addition different authors have used many different terms to describe
optimisations which have been attempted [Aho, 1974; Lowry, 1969; Wichmann, 1977]. However most optimisations fall into one of the following four groups: Universal, Local, Global and Source. 3.1.1 Universal Optimisations These are those optimisations which are independent of any particular program, but which depend on the complete environment in which the program is to be compiled or executed. They are the policy decisions taken by the compiler writer during the initial design of the compiler, and include such things as the fixed use of registers (stack pointers, code pointers, link registers etc), the representations of language-defined objects (arrays, records, strings etc), and the standards for communication with external objects. In addition, universal optimisation must take into account such questions as: i Compilation speed or execution speed? If the compiler is to be used in an environment where programs are compiled roughly as often as they are executed, such as in a teaching environment, execution speed can be sacrificed for an increase in compilation speed.
ii Diagnostics? If the compiler is to produce code which will provide extensive checking and will give diagnostic information in the event of program failure, allowance must be made for the efficient checking of the program's behaviour and the maintenance of the recovery information used by the diagnostics. If highly optimised code is required these constraints may not apply. In the current state of the art universal optimisation is done by experience and guesswork; attempts at producing compiler-compilers which can approach the quality of hand-constructed compilers have not met with great success [Brooker, 1967; Feldman, 1966; Trout, 1967]. As will be shown later (4.5), minor changes in the universal optimisation can result in major changes in the performance of the generated code, and so rules made at this stage should be as flexible as possible to permit changes to be made in the light of experience. From the point of view of measurement universal optimisation provides the base level from which other optimisations are investigated. Roughly, the better the universal optimisation the less effective the other optimisations appear to be.
3.1.2 Local optimisations Local optimisations may be defined as those optimisations which may be performed during a sequential scan of the program using only knowledge of statements already processed. Not only are these optimisations reasonably simple to perform but they can have a major effect on the generated code. Indeed Wulf et al. state that "In the final analysis the quality of the local code has a greater impact on both the size and speed of the final program than any other optimisation" [Wulf, 1975]. 3.1.2.1 Remembering Remembering optimisations are those optimisations which can be applied to single statements in the light of information gathered during the compilation of previous statements. These optimisations depend on remembering the current state of the machine and applying this knowledge to subsequent statements. Their chief characteristic is that they are applied during a sequential scan of the program, and as such are reasonably cheap to implement and execute.
For example: ---------------- | X = Y | | if X = 0 start | ----------------- on the PDP11 would generate: ------------ | MOV Y,X | | BNE $1 | remembering that the previous | | line sets the condition code. ------------ The most powerful of the remembering optimisations is that whereby the correspondence between values in registers and values in store is remembered and references to the store value are replaced by references to the register's value, register operations usually being smaller and faster than their store equivalents. Unfortunately there are several cases where this leads to worse code than the "obvious" version. For example, on the (PE3200) the code on the right is larger and slower than that on the left: ------------ | X = 2 | | P = P<<2 | ----------- ----------- -------------- | LIS 3,2 | LIS 3,2 | pick up 2 | ST 3,X | ST 3,X | store it in X | L 1,P | L 1,P | pick up P | SLLS 1,2 | SLL 1,0(3) | shift it by 2 | ST 1,P | ST 1,P | store it in P ----------- --------------
In addition to keeping track of the changes in the state of the machine from the compilation of one statement to another, remembering also includes preserving this state or environment for later use when a label is encountered, either by merging the current environment with the environment saved at the jump to the label, or simply by restoring that latter environment when it is not possible for control to "fall through" from the statements immediately preceding the label. In all forms of remembering it is vital to be able to keep the information up-to-date, invalidating knowledge when it becomes false, a process which is exacerbated when it is possible for an object to be known by two or more apparently different descriptions as in the following code: --------------------------- | integer J, K | | integerarray A(1:12) | | integername P | | P == J | | J = 1; K = 1 | --------------------------- At this point P and J refer to the same location as do A(J) and A(K). Except in the most simple of cases all that can be done is to assume the worst and forget anything potentially dangerous after writing to unknown addresses.
By delaying the first store into T until after the conditional statement, and delaying the second store into T until after the label, both instructions can be combined, resulting in the code: ------------------- | L 3,X | | L 0,0(3) | | BGE $1 | | SR 0,0 | | $1:ST 0,T | | LIS 2,1 | | ST 2,0(3) | | LR 1,0 | | {return} | ------------------- This store itself can now be delayed until the return from the function, at which point, as T is local to the function and will be destroyed, the instruction can be deleted altogether. Section 3.2 gives a description of one way in which this sort of optimisation has been achieved. 3.1.2.3 Inaccessable code removal In several cases compilers can generate code which will never be executed.
The common causes of this are either user-specified conditions whose truth is constant and are used to achieve some sort of "conditional compilation", or structural statements following unconditional jumps as below: ---------------------- | if X = 0 start | | ERROR(1) | | return | | else | | V = V-X | | finish | ---------------------- Here the branch instruction usually generated at the end of the if clause to take control past the else clause, can never be executed. Such inaccessable code can be eliminated to shorten the program, but without directly effecting its speed of execution. 3.1.2.4 Peephole optimisations Peephole optimisation [McKeeman, 1965] is the technique of examining small sections of generated code to make fairly obvious, but ad hoc, improvements. Many of the gains from the optimisation come from simplifying code sequences created by the juxtaposition of code areas which were produced separately.
For example (PE3220): Before After ---------------- ------------------- | ST 4,X | ST 4,X | | L 4,X | | |- -|- -| | AR 1,2 | | | AHI 1,48 | AHI 1,48(2) | ---------------- ------------------- 3.1.2.5 Special cases Special-case optimisations are those which make use of the particular structure and features of the target machine to control the way in which certain statements are implemented. For example: Obvious Optimised ---------------- ----------- (PDP11) | MOV #0,X | CLR X | X = 0 |- -|- -| (PDP11) | ADD #1,X | INC X | X = X+1 |- -|- -| (PE3220) | LHI 1,NULL | SR 0,0 | S = "" | LHI 2,S | STB 0,S | | BAL 15,MOVE | | ---------------------------- These optimisations are very similar to peephole optimisations but are distinguished because they actively control the generation of code rather than passively alter code which has already been produced. In particular they avoid one of the drawbacks of peephole optimisation, namely that even though it can reduce fairly complex instruction sequences to a simpler form, the side-effects of generating the long form in the first place often
degrade the code. In the example of setting a string variable to the null string above the optimised form uses only one register, the value of which can be remembered. In the non-optimised version three registers are immediately altered and the knowledge of the contents of all of the registers may need to be forgotten unless the code generator knows how the MOVE routine works and can forget only those registers which it uses. 3.1.2.6 Algebraic manipulations Algebraic optimisations are improvements brought about by using the algebraic properties of operators and operands, and include: . Folding, or compile-time evaluation 1+2 is replaced by 3 . Removal of null operations A+0 is replaced by A . Using commutative properties -B+A is replaced by A-B
3.1.3 Global optimisations Global optimisation may be defined as those improvements to the code which require knowledge of substantial parts of the program. In effect they are performed by examining numbers of statements in parallel, in contrast to the .sequential scan required by local optimisation. 3.1.3.1 Restructuring Restructuring optimisations are those optimisations which may be brought about by changing the order in which the code is laid out in memory without otherwise changing the nature of the code. As will be discussed later, there are many reasons why programs can be improved by changing the order of chunks of the code. A common reason is that many machines have conditional branch instructions with limited range while the unconditional branches have a much larger range. Hence if (A) represents a large number of statements and (B) represents a small number of statements, the program: ---------------- | if X = 0 start | | {A} | | else | | {B} | | finish | ----------------
could be improved by reordering as on the right (PDP11): original reordered ------------------ ---------------- | MOV X,R0 | MOV X,R0 | | BEQ $1 | BEQ $1 | | JMP $2 | {B} | | $1: {A} | JMP $2 | | BR $3 | $1: {A} | | $2: {B} | $2: | | $3: | | ------------------ ----------------
3.1.3.2 Merging 3.1.3.2.1 Forward merging Forward merging, also somewhat confusingly referred to as "cross jumping" [Wulf, 1975], is the process whereby the point of convergence of two or more code sequences is moved back over common sub-sequences thus removing one of the sub-sequences, as in the case below. ---------------------- | if X > Y start | | TEST(X, Y) | | else | | TEST(Y, X) | | finish | ---------------------- obvious code (VAX) after merging --------------------- -------------------- | CMPL X,Y | CMPL X,Y | | BLE $1 | BLE $1 | | PUSHL X | PUSHL X | | PUSHL Y | PUSHL Y | | CALLS 2,TEST | | | BRB $2 | BRB $2 | | $1:PUSHL Y | $1:PUSHL Y | | PUSHL X | PUSHL X | | CALLS 2,TEST | $2:CALLS 2,TEST | | $2: | | --------------------- -------------------- The simplest way to perform this optimisation is to take the code sequence about the point of a label and a reference to that label, and set two pointers: one, P., to the unconditional jump and the other, P2, to the label. If the instructions immediately preceding the pointers are identical both pointers are moved back over that instruction. The label is
redefined at the new position of P2 and the instruction passed over by P1 is deleted. The process is repeated until either another label is found or two different instructions are encountered. The redefinition of the label involves creating a completely new label, leaving the old one untouched. This both prevents trouble with multiple references to the label and permits the optimisation to be attempted on them. As this optimisation simply causes the sharing of execution paths there is no direct gain in execution speed, but as the code size is reduced an indirect improvement may be achieved if the shorter code moves the label close enough to the reference to it for a shorter and usually faster jump instruction to be used. The optimisation obviously must be performed while labels and jumps are in a symbolic form, that is before code addresses have been resolved. This permits the merging of instructions which will eventually have program-counter relative operands and consequently be position dependent.
In this case as the code-generator is in complete control the optimisation can be very simple, although rather specific. The techniques for handling common sub-expressions have been investigated at length by several authors, but measurements indicate that in most programs expressions are so trivial the expense in finding common sub-expressions is not repaid by the resulting improvement in the generated code [Knuth, 1971]. The more general form of factoring can be seen in the transformation of the following statements: --------------------------------------- | if X = 0 then A = 1 else B = 2 | | if X = 0 then C = 3 else D = 4 | --------------------------------------- into: ----------------------- | if X = 0 start | | A = 1 | | C = 3 | | else | | B = 2 | | D = 4 | | finish | -----------------------
3.1.3.5 Loop optimisations 3.1.3.5.1 Iteration Iteration is the process whereby the values in variables from previous iterations of a loop are used to calculate the new values for the current iteration, rather than calculating those values from scratch each time. One of the effects of this optimisation can be the reduction in strength of operations, such as changing multiplications into additions. In this context the IMP77 operators "++" and "--" may be used to great effect. Their action is to adjust the reference on the left by the number of items to the right, hence if X is an integer then X++1 is the next integer and X--2 is the integer two integers before X. ------------------------------- | for J = 1,1,1000 cycle | | A(J) = J | | repeat | ------------------------------- Can be optimised to: ------------------------------- | integername T | | T == A(J)--1 | | for J = 1,1,1000 cycle | | T == T++1 | | T = J | | repeat | -------------------------------
------------------- --------------------- | | TEMP = P/Q | | while X > 0 cycle | while X > 0 cycle | | W(X) = P//Q | W(X) = TEMP | | X = X-1 | X = X-1 | | repeat | repeat | ------------------- --------------------- B will be faster than A if the loop is executed at least twice. If the loop is not executed at all (X <= 0) B will be much slower than A (by an alarming 80 microseconds on the 7/32). 3.1.3.5.3 Removal of invariants This is the process whereby complex sub-expressions which do not change their values as the loop progresses are evaluated outside the loop and held in temporaries: --------------------------------- | for J = 1, 1, 1000 cycle | | A(J) = LIMIT-MARGIN | | repeat | -------------------------------- Can be optimised to: ------------------------------- | TEMP = LIMIT-MARGIN | | for J = 1,1,1000 cycle | | A(J) = TEMP | | repeat | ------------------------------- It is simply a special case of Holding.
3.1.4 Source optimisations Source optimisations [Schneck, 1973] are those optimisations which can be effected by changes in the source program. They can be sub-divided into three categories; machine-independent [Hecht, 1973; Kildall, 1973], machine-dependent, and tentative. Tentative optimisations are those which, while unlikely to make the code worse, may improve it on some machines. For example, most machines will handle the comparison "X<1" better if it is rewritten as "X<=0", where X is an integer variable.
With this knowledge the second statement can be compiled to: ------------------- | MOV# 1,1,SZR | | JMP $1 | | LDA 1,D | | STA 1,A | | $1: | ------------------- Immediately before the label $1 it is known that once again the value of A is in accumulator 1, and that the STA above the label is marked as pending as before. Following the definition of the label the environment before the jump to that label can be combined with the environment just before the label to give the new environment following the label. The information in this environment is that A is in accumulator 1 and that the same store is pending from both old environments. This allows the two marked store instructions to be removed and one store placed after the label (and once again marked as being "pending"). This gives the following code: --------------------- | LDA 0,B | | LDA 1,C | | AND 0,1 | | MOV# 1,1,SZR | | JMP $1 | | LDA 1,D | | $1:STA 1,A | ---------------------
A simple jump optimisation notices that the JMP passes over just one instruction and can therefore be removed by inverting the skip condition on the previous MOVe, giving: ------------------ | LDA 0,B | | LDA 1,C | | AND 0,1 | | MOVZL 1,1 | | MOV# 1,1,SNR | | LDA 1,D | | STA 1,A | ------------------ Finally, peephole optimisation combines the AND with the MOVZL giving ANDZL, and then combine this with the following MOV# to give the complete code sequence as: ------------------ | LDA 0,B | | LDA 1,C | | ANDZL 0,1,SNR | | LDA 1,D | | STA 1,A | ------------------ The most interesting thing to notice about this particular sequence of optimisations is that with the possible exception of the removal of the marked STA instructions, the final code can be generated very simply with local optimisations.
4 The design of the compiler This section describes the features of the compiler which have had an influence on the form of the intermediate code. 4.1 General structure One of the aims of this type of compilation strategy is to simplify the production of compilers, and a successful technique for simplifying programs is to divide them into several communicating modules, each largely independent of the others but with well-defined interfaces between them. At the highest level a compiler can be split up into three major parts: 1 A language processor, which deals with the language-dependent parts such as parsing, semantic checking, and error reporting. 2 A code generator, which takes the decomposed form of the program as generated by 1 above, and constructs the appropriate code sequences to perform the required functions. 3 An object-file generator, which builds an object-file from the code sequences produced by 2, In the form required by the system which is to execute the program. Commonly, the first two parts of this scheme are combined into one program which generates as its output an
assembly-language source file corresponding to the original program. The third part then becomes the standard system assembler. This approach clearly simplifies the production of the compiler, as one part, the assembler, is provided already and can ease the problems of checking the compiler because the code it generates is presented in a well-known form. Despite these advantages it was rejected for the following reasons: 1 In order that assembly language can be generated, the compiler must have an internal form of the instructions which is changed into text, processed by the assembler, and finally converted into the machine representation. These transformations can be eliminated if the compiler works directly with the machine representations. 2 In general, the system-provided assembler will be expecting to receive a much more powerful language than the rather stereotyped text produced by compilers. This will certainly degrade the performance of the assembler. A solution to this is to produce a cut-down version of the assembler which only recognises those constructs generated by the compiler. However, producing a new assembler removes one of the reasons for choosing this route, namely, not requiring extra work in writing the object-file generator.
3 As will be seen later (section 4,7), even after the code sequences have been produced there remain several optimisations which can be performed using knowledge gained during the production of those sequences, for example, generating short forms of jump instructions when the distance between the jump and its destination is small enough. While in certain cases these optimisations can be performed by a standard assembler it is unlikely that the structure of the code-generator would be as simple as if a special-purpose object-file generator were available. The main interface in such a system is clearly that between the language and machine dependencies, as most languages are largely machine-independent. It is this interface between the language-dependent and machine-dependent parts of the compiler which is termed the INTERMEDIATE CODE. In the following discussion it is assumed that the reader has a reasonable understanding of the structure of the final form of I-code, a definition of which may be found in Appendix.A2.
code is to be generated by several different language processors, a simple intermediate code which is easy to produce may well be more attractive. As I-code was intended for optimisation a high-level code was chosen. In addition, as it was hoped that the code could eventually be used in different language processors, it was decided to keep the structure simple. The complete compilation process may be thought of as a sequence of transformations working from the source program to the final object program via a number of intermediate representations. As the transformations are applied the representations become less dependent on the source language and more dependent on the target machine. In order to simplify the code-generator as much as possible the intermediate code must lie as far from the source language as is possible without straying from the objectives set out below. 4.2.1 Objectives One of the dangers in designing an intermediate code is that of building into it old techniques and standard expansions of source constructions, which while they may be tried and tested cannot in any way be said to be "the only solutions" or even "the best solutions". One of the intentions behind the design of I-code was to
permit the use of varied implementation strategies. In the same way that the only practical definition of a "good" programming language is that it fits the style of the particular programmer using it, so the measure of the power of an intermediate code must include the ease with which it can adapt to an existing style of code-generator writing. Inevitably, practical constraints prevent total generality; the most general form of a program is a canonical form of itself, but this is little help in compiling it. It follows that the intermediate code, while remaining true to the original program and distant from "real" machines, must provide enough simplification to make the task of code-generation as easy as possible without inhibiting optimisation. From the start it was appreciated that an intermediate code suitable for use in optimising compilers would necessarily require more processing than a code such as O-code which was aimed at a quick implementation. The original hope was that although each machine-dependent code generator would not be small, typically about 3000-4000 IMP statements, large portions of one could be taken as a basis for a new implementation. This has proved to be the case, and provision of an existing code-generator as a template greatly simplifies the task of creating a new one (section 6.4.1).
4.2.1.1 Scope The first and most fundamental objective in the design of I-code was that it should support the compilation of one specific language, IMP-77, on many different machines. Considerations of using the code to implement other languages were secondary to this main aim, but were used to bias the design when a choice had to be made from several otherwise equally suitable possibilities. In retrospect, a few areas of the code could have been made more general without significant overheads in the code generators, mainly in the area of data descriptor definitions, but a detailed discussion of one intermediate code supporting several languages is beyond the scope of this work. In direct contrast to many intermediate codes, I-code was not designed with the intention of making it convenient to interpret; the prime aim was to permit compilation into efficient machine-code. Nevertheless it is possible to "compile" I-code into threaded code [Bell, 1973] or a form suitable for interpretation, either by generating a conventional interpretive code or by leaving the code in more-or-less its current form but with labels resolved and descriptors expanded into a more convenient representation.
For example, the following two program fragments are semantically identical: A B ---------------------------------------------------- | | P = 0 | | | cycle | | TEST for P = 1, 1, 10 | P = P+1 | | | TEST | | | repeat until P = 10 | | | | ---------------------------------------------------- However, in "B" the information that the fragment contains a simple for construction, while not completely lost, has been scattered through the code, and this dilution of information will increase the complexity of any code-generator wishing to handle for loops specially. To leave open all avenues for optimisation it is necessary therefore, that all of the semantic information in the source program is preserved in a compact form in the I-code. One sure way of achieving this property is to design the code in such a way as to allow the regeneration of the source program, or at least a canonical form of it which is not significantly different from the original. In this context insignificant differences are the removal of comments and the standardisation of variant forms of statements, such as: NEWLINE if COUNT = 0 and: if COUNT = 0 then NEWLINE
frequently pun on addresses and integer values. Several later versions of the codes have attempted to solve these problems by parameterising the intermediate code generator so that the characteristics of the target machine may be used to modify the code which is produced. However, they still have built in to them assumptions about how the objects can be addressed. There are so many constraints which can be imposed on the code to be generated, such as operating system requirements and conventions for communicating with the run-time environment, that a parameterised first phase could not be expected to generate code which was well-suited to every installation. The authors of JANUS [Coleman, 1974] write that they believe that the approach of using a parameterised intermediate code "... is a dead end, and that the adaptability must come in the translation from the intermediate language to machine code". 4.2.1.4 Simplification For the complexity of the machine-dependent phases of compilation to be kept as low as possible the machine-independent phase must do as much work as possible while keeping within the constraints imposed by the previous objectives. One way of simplifying the intermediate code is for certain high-level constructions to be expanded into
lower-level constructions, but only when there is just one expansion possible under the rules of the language, and that expansion does not scatter information which may be of later use. The most obvious case of such expansion is in dealing with complex conditional clauses such as: ----------------------------------------------------- | if (A=B and C#D) or (E<F and G>H) then X else Y | ----------------------------------------------------- IMP-77 specifies that the condition will only be evaluated as far as is necessary to determine the inevitable truth or falsity of the condition, and so, bearing in mind the modifications to be discussed in section 4.6.1, the statement can be represented more simply as: --------------------------------- | if A # B then ->L1 | | if C # D then ->L2 | | L1: if E >= F then ->L3 | | if G <= H then ->L3 | | L2: X | | ->L4 | | L3: Y | | L4: | --------------------------------- This expansion is tricky and notoriously error prone, and therefore is best done once and for all in the common phase. Similarly it is possible to expand all simple control structures into their equivalent labels and jumps, providing that the structural information is not lost thereby.
4.3 Code layout and addressing 4.3.1 Nested procedure definitions A common feature of programming languages is the ability to nest the definition of a procedure within another procedure. In addition, several languages imply the definition of procedures within single statements, as in the case of name parameters in ALGOL-60 where the parameter which is actually passed can be a reference to a "thunk", a procedure to evaluate the parameter. With such nesting provision must be made for preventing the flow of execution from "falling through" into the procedure from the preceding statements, and this is usually accomplished by planting at the start of the procedure a jump to the statement following the end. While this is simple to implement it does introduce extra instructions which are not strictly necessary. With user-defined nested procedures the overhead can be minimised if a number of procedures is defined, as one jump instruction can be used to skip them all. Unfortunately thunks will be generated throughout the code in a more-or-less random way, giving little opportunity to coalesce the jumps. Even if the extra execution time caused by these jumps is insignificant (the jumps round thunks defined in loops get executed repeatedly), the code which they are skipping
stretches the code in which they are nested. On machines with fixed-size jump instructions which can cover the whole machine, such as the DEC PDP 10, the stretching causes no problems, but if the addressing is limited, or if several different sizes of jump instruction are provided, the presence of the nested procedure can result in more code being produced later in the generation of large jumps. 4.3.2 Paged machines On paged machines the overall performance of a program does not depend solely on the efficiency of the code produced by the compiler but includes a factor depending on the locality of references made by the executing program. Traditionally this locality has been improved by monitoring the execution of the program and then re-ordering parts of it in the light of the measurements. Unfortunately not all operating systems provide the user with convenient tools to enable the measurement to be done, leaving only ad hoc methods or intuition for guidance. Without careful control it is all too easy to move one procedure to improve references to it and thereby cause another piece of code to cross page boundaries and counteract any gains in paging performance. Even if the user can obtain the necessary information, a slight change in the program can invalidate the changes. Notwithstanding these problems it is evident that by careful
structuring of a program significant gains in paging behaviour can be obtained and so this option should not be pre-empted by the intermediate-code (as does Z-CODE which automatically reorders the definitions of procedures). The possibility of automatic improvement of paging behaviour was investigated by Pavelin who showed that the paging characteristics of a program can be improved by an automatic reordering of the code [Pavelin, 1970]. Pavelin's thesis describes the breaking-up of a program into "chunks", defined by branches and the destinations of branches. At each chunk boundary extra instructions are planted to cause the updating of a "similarity array" which records the dynamic characteristics of the program. After several runs the similarity arrays are merged and the result is used to specify a reordering of the chunks which should improve the paging performance. In test cases the working-set size of the code was reduced by as much as 40%. The thesis also went on to say that the various compilation problems associated with this "... can be alleviated by operating on an intermediate code which is machine independent with symbolic code addresses".
For example, on the INTERDATA 7/16 a procedure at the outermost textual level uses register 15 to access its local stack frame, giving the exit sequence: ------------------- | LM 7, 4(15) | | BFCR 0, 8 | ------------------- but a procedure nested within this would use register 14 thus: ------------------- | LM 7, 4(14) | | BFCR 0, 8 | ------------------- It follows that the signal routine must be told which base regiser to use at each stage of the recovery. This can be done either by planting code in the entry and exit sequences of each procedure, or by keeping a static table associating procedure start and finish addresses with the appropriate base register. The first method is poor as it imposes a run-time overhead on all procedures, whether they trap events or not. The second method is better but can be complicated if procedures are nested as the start-finish addresses alone no longer uniquely define the procedure. One solution is to cause all procedures which use the same exit sequence to be loaded into distinct areas, and to associate the recovery information for the signal routine with each area. This reduces the static overhead to a few words per area, rather than a few words per procedure.
ii Array references with constant subscripts need no address calculations at run-time. For example using V as declared above, the element V(2) is immediately addressable as the halfword with displacement "a+2 + 2*2" from LNB. iii In certain more general cases when the subscript is a variable, access can be simplified by remembering previous calculations. For example, the address of the array element V(X) is ------------------------------------- | addr(V(0)) + X*size of each element | ------------------------------------- In the example above this becomes: ------------------------------------- | LNB+a+2 + X*size of each element | ------------------------------------- which can be rearranged to: ------------------------------------------- | a+2 + (X*size of each element+LNB) | ------------------------------------------- Hence the following code could be produced (7/16): -------------------- | V(X) = 0 | | | | LH 1,X(LNB) | pick up X | AHR 1,1 | *2 (2 bytes per integer) | AHR 1,LNB | add in LNB | SHR 0,0 | get zero | STH 0,a+2(1) | store in V(X) -------------------- Noting that the value now in register 1 (X*size+LNB) only depends on the size of each element, X, and the local name base, it is clear that register 1 can be used to address the X'th element of any integer array of one dimension and constant bounds declared at the current level.
Hence if the array W(1:12) were declared immediately after Y in the example above, while register 1 is not changed W(X) can be addressed as a+2002(1). On the other hand, a machine with limited store cover, such as the Data General NOVA which only has an eight-bit displacement, will almost certainly force the array to be implemented as an immediately addressable pointer which is initialised at run-time to the address of storage claimed explicitly. -------------------------------------------------------- | LNB | | | | | | | | v .---.---.---. | | .- -| X | V | Y | - - - | | .---.---.---. | | | | | | | | | .------.------. .--------. | | +->| V(0) | V(1) | - - - - | V(999) | - - - | | .------.------. .--------. | --------------------------------------------------------
With this organisation the address of V(X) will be: ---------------------------- | V + X*size of each element | ---------------------------- and there is little that can be done by rearranging the expression to improve on the "obvious" code (7/16): ---------------- | V(X) = 0 | | | | LH 1,X | pick up X | AHR 1,1 | double it | AH 1,V | add in addr(v(0)) | SHR 0,0 | | STH 0,0(1) | ---------------- Not only is this second code sequence longer than the first by two bytes, but it will execute more slowly as the second addition involves a store reference whereas the equivalent instruction in the first sequence uses a register. In both cases, however, some simplification can be done if the subscript is an expression of the form: ------------------------------ | X plus or minus CONSTANT | ------------------------------ in which case the constant can be removed from the subscript expression evaluation and added into the final displacement. For example (7/16): ------------------------ | V(X-7) = 0 | | | | LH 1,X | pick up X | AHR 1,1 | double it | AHR 1,LNB | and in LNB | SHR 0,0 | get zero | STH 0,a+2+(-7)*2(1) | ------------------------
Unfortunately even this optimisation may not be available. For example, the ICL 2900 series performs array accesses through a DESCRIPTOR REGISTER, and the extra displacement cannot be added into the instruction. Also some machines, such as the IBM 360, only permit positive displacements in instructions. The examples above pose the following problem: If the intermediate-code is to know nothing of the target machine it cannot know the best way to declare the array, nor the best way to access it. Therefore the code must always produce the same sequences for array declarations and array accesses. It follows that these sequences must remain quite close to the original source and not include any explicit address calculations. As another example, the DEC PDP11 range has a hardware stack which grows with decreasing store addresses. Because of this it could be convenient to allocate storage for variables in that order, from large addresses to small addresses. However, in several cases it may be necessary to force objects to be created in increasing addresses, such as when program structures are to be mapped onto hardware-defined structures in memory, resulting in an implementation which requires to be able to create similar objects in different ways depending on the context.
Finally, some machines provide instructions in which the displacement of the operand is scaled before use, depending on the size of that operand. The GEC 4080 is such a machine, with instructions such as: LDB 1 load byte <1> LD 1 load halfword, bytes <2> & <3> LDW 1 load fullword, bytes <4>,<5>,<6> & <7> When producing code for such machines it is convenient to allocate all the local objects of the same size in particular areas, and then order the areas in increasing order of the size of the objects they contain. This permits fuller use of the available displacement field in the instructions.
The solution to these problems which was chosen in I-code was to define a DESCRIPTOR for each object to be manipulated. On input to the code-generator descriptors are converted from their machine-independent form to a new form appropriate to the target machine. As all subsequent reference to the object will be through descriptors the code produced will automatically reflect the decisions made at the time the descriptors were created. As has been discussed previously, it may be possible to remove the overhead in setting up addressability for local variables and parameters if the parameters can be held in registers and the local variables are never referenced. After examining many procedures which do use local variables it is clear that a large number of them do not need the complete overhead in setting up a local frame base as they could use the workspace pointer (stack pointer) instead. The criterion is that the position of the locals relative to the workspace pointer must be known at compile time. This reduces to the procedure not having any objects with computed sizes (arrays with computed bounds, for example) and no calls on procedures which access the locals as their global variables.
Consider the compilation of the following procedure on the PDP11: ---------------------------------------- | routine MARK(record(cellfm)name CHAIN) | | integer N | | N = 0 | | while not CHAIN == NULL cycle | | N = N+1 | | CHAIN_INDEX = N | | CHAIN == CHAIN_LINK | | repeat | | end | ---------------------------------------- The code normally produced for this routine would be: -------------------------- | MOV LNB,-(SP) | remember old LNB | MOV DS,-(SP) | remember DS | MOV R0,(DS)+ | save the parameter | MOV DS,LNB | set up local addressing | ADD #20,DS | reserve local space | CLR 10(LNB) | N = 0 | $1: MOV -2(LNB),R1 | test CHAIN | BEQ $2 | branch if NULL | INC 10(LNB) | N = N+1 | MOV 10(LNB),2(R1) | CHAIN_INDEX = N | MOV (R1),-2(LNB) | CHAIN == CHAIN_LINK | BR $1 | repeat | $2: MOV (SP)+,DS | restore DS | MOV (SP)+,LNB | restore LNB | RTS PC | return --------------------------
however, by using workspace pointer (DS) relative addressing this reduces to: -------------------------- | MOV R0,(DS)+ | | TST (DS)+ | reserve local space | CLR -2(DS) | N = 0 | $1: MOV -4(DS),R1 | test CHAIN | BEQ $2 | | INC -2(DS) | N = N+1 | MOV -2(DS),2(R1) | CHAIN_INDEX = N | MOV (R1),-4(DS) | CHAIN == CHAIN_LINK | BR $1 | | $2: SUB #4,DS | restore DS | RTS PC | return -------------------------- This optimisation can be performed quite simply by the third phase of compilation. In the interface between the second and third phases the code sequences generated by the second phase are made up of items of the form: <type> <VALUE> where <type> describes where <VALUE> is to be put, for example in the code area or in the private data area. To achieve the workspace-pointer-relative addressing extra types are introduced which specify that the associated value is the displacement of a local variable from LMB. Other codes are needed to be able to modify the operation part of the instruction which uses the displacements but these will be ignored here as they cause no difficulty and would just obscure the discussion. In addition, an extra <modify DS> item is output whenever DS is explicitly altered (as when parameters are stacked using MOV ??,(DS)+. By default the third phase will treat these extra types as being exactly equivalent to <code area> types, and will generate the first
sequence of code. However, if when the end of the procedure is processed, the second phase discovers that no dynamic objects or dangerous procedure calls were generated, it marks the end of the procedure accordingly (in the same way as described in section 4.7.2). This mark instructs the third phase to relocate all values with the appropriate type so as to make them relative to DS. The <modify DS> types are used to keep the third phase's idea of the current position of DS in step with reality.
4.5 Procedure entry and exit. IMP is heavily based on the use of procedures, indeed the only method of communicating with the controlling environment is by means of procedure calls. Also the techniques of structured programming result in the extensive use of procedures. Clearly when writing a compiler for such languages much thought must be given to making procedure entry and exit (and the associated passing of parameters) as efficient as possible. 4.5.1 User-defined procedures The usual technique for procedures entry and exit is to have standard preludes and postludes which cover all the different types of procedure. For example the EMAS IMP code sequences [Stephens, 1974] are (ICL4/75): ------------------------- | STM 4,14,16(WSP) | save the current environment | BAL 15, PROC | enter the procedure | . | | PROC ST 15,60(WSP) | save the return address | LR LNB,WSP | set up the local stack frame | LA WSP,***(WSP) | claim local stack space | BALR 10,0 | set up code addressability | . | | . | | LM 4,15,16(LNB) | restore calling environment | BCR 15,15 | return -------------------------
While this has proved to be convenient to generate and efficient to execute it has one major problem, part of the housekeeping of the procedure entry is performed at the call itself. This seems undesirable for two reasons: i Procedures are generally called more often than they are defined. If part of the housekeeping of procedure entry is done at the call that code will be duplicated at each call, thus increasing the size of the program. Putting that code within the procedure reduces the size overhead. ii If the knowledge of what housekeeping needs to be done for procedure entry is needed outside the procedure it becomes impossible to alter the entry and exist sequences to suit the actual procedure. In particular on certain machines it is possible to remove the entry and exit sequences altogether when the procedures are simple enough.
If the 4/75 compiler moved the environment-saving STM instruction into the body of the procedure the storing of the return address would be performed automatically: ------------------------- | BAL 15,PROC | |- - | | PROC STM 4,15,16(WSP) | | LR 8,WSP | | .. .. | ------------------------- This not only saves four bytes per call, very important on a machine with a very severely limited immediate addressing range, but also reduces the overhead in entering the procedure by one instruction. A further modification would be to pass one or more of the parameters in the registers, leaving the way open for remembering that fact inside the procedure.
Hence a call could be reduced from: ------------------------ | L 1,X | | ST 1,64(WSP) | | L 2,Y | | ST 2,68(WSP) | | BAL 15,PROC | PROC(X, Y) |- -| | PROC STM 4,15,16(WSP) | | .. .. | ------------------------ to: ------------------------ | L 0,X | | L 1,Y | | BAL 15,PROC | |- -| | PROC STM 4,1,16(WSP) | | .. .. | ------------------------ The ability to determine exactly how parameters are to be passed can be of crucial importance in the efficiency of the procedure mechanism.
When compiling for the PDP11 the obvious calling sequence for a procedure with two integer value parameters would be: ------------------- | MOV X,-(SP) | | MOV Y,-(SP) | | JSR PC,PROC | ------------------- Unfortunately this produces problems inside the procedure as the return address, stacked by JSR, is too far down the stack to permit the use of the RTS instruction to return, for this would leave on the stack the space used by the parameters. Neither can the stack be adjusted before the return which would be made indirect through a location beyond the stack pointer as space there must be considered volatile, being used by interrupt handling. Extra instructions are needed either at the call or inside the procedure to adjust the stack; the JSR instruction may well not be "a beauty" as claimed by some implementors [Bron, 1976]. A MARK instruction has been introduced in an attempt to overcome this problem, but it is far from helpful as it imposes an arbitrary register convention and puts all of the overhead on the call rather than on the procedure itself.
On the other hand if all of the parameters can be passed in registers, the JSR will put the return address on a clear stack permitting the use of RTS for the return. As in practice most procedures have few parameters, usually only one or two, this can give a large saving. As an example of the power of being able to alter entry and exit sequences consider a recursive implementation of the IMP routine SPACES: ----------------------------- | routine SPACES(integer N) | | return if N <= 0 | | SPACES(N-1) | | SPACE | | end | ----------------------------- On the PDP10 the straightforward coding for this would be: -------------------------- | MOVE 0, X | pick up X | MOVEM 0, 3(SP) | assign the parameter | PUSHJ SP, SPACES | call SPACES |- -| | SPACES: MOVEM LNB,1(SP) | save old frame base | MOVE LNB,SP | pick up new frame base | ADDI SP,3 | reserve new frame base | SKIPLE 1,2(LNB) | load, test & skip if X<=0 | JRST LAB1 | jump to LAB1 | MOVE SP,LNB | restore stack pointer | POPJ SP | return | LAB1: SOJ 1, 0 | X-1 -> ACC1 | MOVEM 1,3(SP) | assign parameter | PUSHJ SP,SPACES | call SPACES | PUSHJ SP,SPACE | call SPACE | MOVE SP,LNB | restore stack pointer | MOVE LNB,1(SP) | restore old frame base | POPJ SP | --------------------------
By applying the optimisations of passing the parameter in an accumulator (called ARG) and remembering that the parameter is in this accumulator on entry to the procedure, the code reduces to: --------------------------- | MOVE ARG,X | pick up X | PUSHJ SP, SPACES | call SPACES |- -| | SPACES: MOVEM LNB,1(SP) | | MOVE ARG,2(SP) | assign the parameter | ADDI SP,3 | | JUMPG ARG, LAB1 | ->LAB1 if ARG > 0 | MOVE SP,LNB | restore stack pointer | MOVE LNB, 1(SP) | | POPJ SP | return | LAB1: SOJ ARG, 0 | parameter = ARG-1 | PUSHJ SP,SPACES | | PUSHJ SP,SPACE | | MOVE SP,LNB | | MOVE LNB,1(SP) | | POPJ SP | ---------------------------
On inspection it is clear that the local stack frame (pointed at by LNB) is never used within the procedure except by the entry and exit sequences. Hence by reducing those sequences to the absolute minimum the code becomes --------------------------- | MOVE ARG,X | | PUSHJ SP, SPACES | |- -| | SPACES: JUMPG ARG, LAB1 | | POPJ SP | | LAB1: SOJ ARG, 0 | | PUSHJ SP,SPACES | | PUSHJ SP,SPACE | | POPJ SP | --------------------------- Finally, an opportunistic optimisation may be performed [Knuth, 1974; Spier, 1976] by noticing that the final two instructions may be combined so that the procedure SPACE uses the return address pushed onto the stack for the return from SPACES. This results in the tightest form of the code: ---------------------------- | MOVE ARG,X | | PUSHJ SP, SPACES | |- -| | SPACES: JUMPG ARG, LAB1 | | POPJ SP | | LAB1: SOJ ARG, 0 | | PUSHJ SP,SPACES | | JRST SPACE | ---------------------------
The final steps in this optimisation can only be performed once the body of the procedure has been compiled. In order that the correct (in this case non-existent) entry sequence can be used an extra pass over the object code is necessary. This pass can be combined with the process of adjusting labels and jumps which is carried out in the third phase of compilation described in section 4.7. The code generator can mark the position where an extra sequence is required and at the end of the procedure can inform the third phase of any salient features found in the body. The third phase can then decide on the best entry and exit sequences to use. This ability to tailor the "housekeeping" parts of procedures can be used in many circumstances to limit the inclusion of code which is needed to handle rare constructions to those procedures which use the feature. As an example of this consider the ICL 2900 series. The machines of the series are designed around a hardware stack, which resides in one, and only one, segment of the user's virtual memory, and thus limits this data space to 255K bytes. In order to be able to handle programs using very large arrays space must be available off-stack in another segment or set of consecutive segments. The maintenance of this extra data space will require instructions to be executed on entry to and on exit from procedures which claim space from it, but not from those which only use space from the stack. These extra instructions can be added to the procedure in a simple
manner by the third phase as it now controls the form of the procedure when all the necessary information is available. For these optimisations to be performed the intermediate code must not lay down rules for procedure entry and exit, rather it should simply mark the points at which suitable code is required. An additional consideration in the design of the I-code for procedure' entry and exit is the requirement of some machines for a "pre-call" to be made the prepare a hardware stack for parameters prior to their evaluation and assignment.
For example (ICL2900): ------------------- | PROC(1, 2, 3) | |- -| | PRCL 4 | pre-call | LSS 1 | load 1 | SLSS 2 | stack it and load 2 | SLSS 3 | stack it and load 3 | ST TOS | store it on Top Of Stack | RALN 8 | raise the Local Name Base | | to point to the new frame | CALL PROC | ------------------- Following these considerations the form of procedure call chosen for I-code was: ----------------- | PROC P | stack procedure descriptor | | \ | {stack param} | : repeated for each parameter | ASSPAR | / | ENTER | enter the procedure ----------------- ASSPAR causes the value described on the top of the stack to be assigned to the next parameter, identified by the procedure descriptor second on the stack, using either ASSVAL or ASSREF as appropriate. In order to pass some of the parameters in registers all that need be done is for the initial processing of the descriptors for those parameters to define then as the appropriate registers. PROC can then "claim" those registers, the parameter assignment will load them, and finally ENTER can release them for subsequent re-use on return from the procedure.
the same compiler. In particular the parameter passing mechanisms 'may be different from those used in the intermediate code. In order to cope with these and other considerations any intermediate code which permits access to external procedures must be sufficiently flexible to allow the variations to be handled efficiently. 4.5.3 Permanent procedures Most languages define a set of procedures which will be available on any implementation without explicit action by the user (such as the IMP procedures ITOS, READSYMPOL, READ, and REM). Such procedures are termed "permanent procedures" It is common for intermediate codes to provide specific code items to invoke permanent procedures, but this has the problem that the code-generator must know about all such procedures, and the language-dependent phase must be changed and the intermediate-code extended if an implementation wishes to make efficient use of procedures which can be compiled in-line on particular machines. For example many machines provide an instruction for moving blocks of store around and it could be advantageous to have a procedure which invoked this instruction directly. Before investigating ways of improving the implementation of permanent procedures it is Useful to examine in some detail the properties of the procedures mentioned above,
which were chosen because they typify the main problems in this area. ITOS is a fairly complicated string function which returns as its result the decimal character string representation of the integer value passed to it as a parameter. Because of its complexity this procedure is almost always best implemented as an external procedure which is linked into the program along with any other external entities required. REM is an integer function which returns the remainder of dividing the first integer parameter by the second, and on many machines can be efficiently compiled in-line, as most integer divide instructions provide both the quotient and the remainder. However, when compiling for machines such as the DATA GENERAL NOVA or the DEC PDP11 when they do not have the optional divide instructions, division has to be performed by a complicated subroutine, suggesting that REM itself should be an external procedure like ITOS. READSYMBOL falls somewhere between the two, mainly because it is defined to have a general name parameter, that is, the parameter may be a reference to any type of entity: integer, real, byteinteger, etc. To implement READSYMBOL as an external procedure it would have to be passed the general name parameter (comprising both the address of the actual parameter and information about its type and precision), and
would have to interpret that parameter in order to be able to store the character, suitably converted, in the appropriate way. A much more efficient implementation is to convert the statement: --------------- | READSYMBOL(S) | --------------- into the equivalent form: ------------------ | S = F$READSYMBOL | ------------------ where F$READSYMBOL is a function which returns as its result the character value that READSYMBOL would have placed into its parameter. Once this is done conversions and the choice of store operation can be left to the usual assignment part of the compiler. A further complication can arise if, as in the case of the INTERDATA 7/16 operating system, ISYS [Dewar, 1975], several permanent procedures map directly onto system-provided facilities: the function F$READSYMBOL can be replaced by the supervisor call "SVC 8,0", SELECT INPUT by "SVC 6" etc. The difficulty caused by READ is mainly one of space. As read can input an integer value, a real value, or a string value depending on the type of its (general name type) argument, it is going to be fairly large, especially if the hardware on which it runs does not provide floating-point
instructions, forcing those functions to be performed by subroutine. It follows that on small systems it may be convenient to replace calls on READ by calls on smaller procedures, chosen at compile-time by examining the type of the parameter given to READ, which input solely integer, real, or string values. Finally it should be noted that the substitutions and modifications discussed above may only be generated as replacements for direct calls on the procedure; if the procedure is passed as a parameter to another procedure no alterations are possible and a "pure" version must be available. As passing a procedure as a parameter is totally distinct from calling the procedure this case does not prevent the improvements being carried out where possible. It should now be clear that the efficient implementation of permanent procedures will differ greatly from the implementation of user-defined procedures, and the implementation of permanent procedures on different machines. Hence the intermediate-code must make no assumptions about either which permanent procedures are available or how they are to be implemented. As a side-effect of removing any built-in properties from permanent procedures it becomes possible for a simple code-generator to ignore any possibility of producing special code and compile them all as externals.
These transformations of procedures can only be applied when the procedures are invoked (called) directly. In the case of procedures passed as parameters all calls will of necessity be the same and hence either it will not be possible to pass some permanent procedures as parameters, an unfortunate limitation imposed by several languages, or there must be a "pure" form of the procedures available. This latter can be done very simply using I-code. The primitive procedure descriptors are defined exactly as if the procedures were truly external, but with an extra marker showing them to be "permanent". The only time that this marker is used is in the procedure call generating section of the compiler. If the procedure is being passed as a parameter this section of the compiler is not entered and so the procedure will be passed as an external. All that is now necessary is for there to be an external manifestation available when the program executes. This method has the added advantage that there is no compile-time overhead, especially important considering that passing procedures as parameters is one of the least-used features of IMP77. 4.5.4 Primitive Procedures It is rare for machines to provide simple instructions which can deal directly with all of the requirements of high-level languages and so several constructions will have to be handled by subroutines. The code generator may then refer to these "primitive procedures" as though they were
machine instructions. The cases in which such procedures are required commonly include exponentiation, string manipulation, and array declaration and access. Given these procedures the code-generator has a choice between calling them as closed subroutines or expanding them in-line. The former produces dense code but will execute more slowly than the latter (and possibly suffer from not knowing what is corrupted by the routine and therefore having to forget everything it knows). On the other hand while the expansion of primitive procedures in-line will improve the execution speed of the program, it becomes necessary for the code-generator to be able to create the appropriate code sequences and thereby become more bulky. Once again the choice must be left to the code-generator as the benefits of a particular decision will depend on both the target machine and the use to which the compiler is to be put. If the compiler is to be used for large mathematical problems it is likely that the gains made by putting exponentiation in-line will outweigh the disadvantage of the extra code size, whereas in operating-system work as exponentiation is probably never needed, the extra complexity of the code generator to expand the routine would not be desirable. Given that some of the primitive procedures will be referenced often (checked array access, for example) it is important that entry to them is made as efficient as possible and in this area the ability to reorder code can be used to great effect.
In the original Interdata 7/32 IMP77 compiler the primitive routines were gathered together at the end of the user's code, as it was only then that it was known which procedures were required. --------- | | <- CODE BASE (register 15) | USER | | CODE | | | --------- | | | PRIM | | PROCS | | | --------- With this scheme programs of 16Kbytes or less can reference the primitive procedures with 32-bit instructions (program-counter relative addressing). Unfortunately once the program grew beyond this limit the larger and slower 48-bit form of the instructions had to be used in order to achieve addressability. In the IMP77 code generator there were 352 such large instructions.
In the new compiler the object code is reordered to place the primitive procedures at the head of the user's code where they can be addressed relative to CODE BASE. --------- | | <- CODE BASE (register 15) | PRIM | | PROCS | | | --------- | | | USER | | CODE | | | --------- The immediate disadvantage of this is that it will push the user's procedures further away from CODE BASE and hence increase the chances of a user procedure reference requiring a long (48-bit) instruction. However in practice this is not a problem as the total size of the primitive procedures is usually quite small, typically less than 800 bytes on the 7/32. The IMP77 code generator mentioned above now needs no long references at all, saving 724 bytes of code, out of about 40Kbytes. The compression of the code so achieved can be enhanced slightly by bringing the destinations of more jumps into the short-jump range, giving an extra saving of 20 bytes the case above. In addition, now that a register (CODE BASE) is pointing to the first primitive procedure the list of procedures required can be reordered to place the most frequently referenced one first and thereby reduce references to it to 16-bit instructions (BALR LINK,CODEBASE).
When compiling with checks on, by far the most commonly referenced primitive procedure is the routine which checks for the use of an unassigned variable (over 2000 references to it in the code generator), and this trivial optimisation results in a saving of more than 4000 bytes.
4.6 Language-specified and compiler-generated objects During compilation various objects will be manipulated in order to generate code. Some of these objects have a direct representation in the source program and are referred to as "language-specified" objects, whereas others are created by the compilation process itself and are referred to as "compiler-generated" objects. The fact that the compiler-generated objects will be (or can be constrained to be) used in a stereotyped and well-behaved fashion can be used to great advantage to give simple means for optimising parts of the program. 4.6.1 Internal labels Using most intermediate codes the following program parts would translate into effectively identical sequences: ----------------------------------------------------- | ->LAB if X = 0 | if X # 0 start | | Y = 3 | Y = 3 | | LAB: | finish | ----------------------------------------------------- At first glance this is as it should be, for the two program fragments are semantically identical and could therefore be implemented by the same object code, for example on the PERKIN-ELMER 3200: ----------------- | L 1,X | pick up X and set the condition code | BZ $1 | branch equal (to zero) | LIS 0,3 | pick up 3 | ST 0,Y | store it in Y | $1: | define label $1 -----------------
However, if it is known that the label $1 will only ever be used once, the code-generator may remember that the current value of the variable X will still be in register 1 following the label, and thus remove the need for it to be loaded again if it is required before register 1 gets altered. In the case of user-defined labels no statement can be made about the number of uses of each label without a complete analysis of the parts of the program where the label is in scope. This suggests that I-code should maintain a clear distinction between user-defined and compiler-generated labels. Also, by making the rule that compiler-generated labels may only be used once, the internal representations of labels may be reused by the code-generator, removing the necessity for large tables of label definitions in this phase of compilation. This now leaves the question of how to represent conditional jumps in the intermediate code. The first observation is that user-specified jumps need never be conditional as they can always be surrounded by appropriate compiler-generated conditional jumps. This can be used to restrict the processing of conditions and tests to the compiler-generated jumps. The second observation is that in IMP77 conditionals are always associated with the comparison of two values or the testing of an implied boolean variable (predicates and string resolution). There are currently three main ways in which processors handle this:
1 "compare" instructions are used to set flags or condition-codes which represent the relationship between two values (one of which is frequently an implied value of zero). These condition-codes are later used to control the execution of conditional branch instructions. This method is used in the PDP11: COMP, BNE etc.). 2 Instructions are provided which compare two values as above but instead of setting condition-codes they skip one or more subsequent instructions depending on a specified relationship. By skipping unconditional branches in this way conditional branch sequences may be generated. This method is used in the PDP10: SKIPE etc. 3 Instructions are provided which compare two values and branch to a specified label if a given relationship holds. This method is used in the PDP10: JUMPNE etc. P-code uses compare instructions to set the boolean value TRUE or FALSE on the stack and then uses this value either as an operand in an expression or to condition a branch (a variant of technique 1 above). Z-code tests the value in a register against zero and branches accordingly (technique 3 above).
These three techniques have fairly obvious possible representations in I-code: if X = Y start 1) PUSH X PUSH Y COMP {set condition code} BNE 1 {branch not equal} 2) PUSH X PUSH Y SKIPE {compare and skip if equal} GOTO 1 3) PUSH X PUSH Y JUMP # 1 {compare and branch if not equal} All three of these representations have been tried in different versions of I-code. Technique 2) was rejected as it proved cumbersome to implement effectively, especially on machines which did not use skips; either the code-generator had to "look ahead" to be able to locate the destination of the skip (which is dependent on the instruction being skipped) or to check before each instruction whether on not a skip had been processed earlier and its destination had not yet been resolved,
Technique 1) was perfect for machines with condition-codes but required look-ahead over subsequent jumps on machines which used skips. Both 1) and 2) had the additional problem that to generate conditional branches two separate I-code instructions had to be given. In the case of 1) condition-codes are usually altered by many instructions not directly involved in comparison and hence the compare and its associated branch must be made adjacent. With 2) there is the possibility of generating meaningless constructions such as skipping a line-number definition instruction. These difficulties add complexity to the definition of the intermediate code and require extra checks in the code generator. Thus the third form was chosen as the most convenient, even though all three forms can be suitable defined to be totally equivalent. In particular the third technique provides all the relevant information to the code-generator in one instruction, and has proved to be simple and effective as a basis for generating code for both condition-code and skip sequences.
Using these ideas the following is the expansion of the statements given at the start of section 4.6.1. ---------------- ----------------- | PUSH X | PUSH X | | PUSHI 0 | PUSHI 0 | | COMP # 1 | COMP = 1 | | JUMP LAB | | | LOCATE 1 | | | PUSH Y | PUSH Y | | PUSHI 3 | PUSHI 3 | | ASSVAL | ASSVAL | | LABEL LAB | LOCATE 1 | ---------------- ----------------- 4.6.2 Temporary objects During the compilation of high-level languages it often becomes necessary to create temporary objects which are not present in the source program. The most common need for temporaries is in the evaluation of expressions. Regardless of the number of accumulators or registers available it is always possible to construct an expression which will require one more. To obtain this register, a register currently in use must be selected and the value currently in it must be saved in a temporary location. One apparent exception to this is a machine in which expressions are evaluated using a stack (e.g. ICL 2900) but in this case the operands are always in temporaries.
Temporary variables may also be required to implement certain high-level constructions, such as the IMP for statement: ----------------------- | for V = A, B, C cycle | ----------------------- which is defined so that the initial values of B and C, and the initial address of the control variable, V, are to be used to control the loop regardless of any assignments to V, B and C. While it is possible for a machine-independent optimiser to discover whether these variables are modified in the loop or not, in the simple case where little optimisation is required the code generator must use temporaries. In the case of expression evaluation, however, the machine independent phase cannot know how many temporaries will be required. Even giving the first phase knowledge of the number of registers available is not adequate for several reasons. Firstly, the use of registers is commonly tied to the operations being performed, as in the case of integer multiply on several machines which requires a pair of registers. For a machine-independent first phase to be able to cope with this sort of limitation would require great flexibility of parameterisation.
Secondly, the first phase would have to be given details of the problems encountered in statements such as: ---------------------------- | LEFT = REM(A,5) + REM(B,7) | ---------------------------- On a PDP11 equipped with the EIS option a divide instruction is available which provides both the quotient and the remainder. Hence the statement could be compiled into: ----------------- | MOV A,R1 | | SXT R0 | propagate the sign of A | DIV R0,#5 | remainder to R1 | MOV B,R3 | | SXT R2 | | DIV R2,#7 | remainder to R3 | ADD R2,R0 | | MOV R0,LEFT | ----------------- In this case no temporary store locations are required. However, if the EIS option is not present no DIV instruction is present and so a subroutine must be used instead. The code becomes: ----------------- | MOV A,R1 | | MOV #5,R2 | | JSR PC,DIV | result back in R1 | MOV R1,T1 | preserve remainder | MOV B,R1 | | MOV #7,R2 | | JSR PC,DIV | result in R1 | ADD T1,R1 | | MOV R0,LEFT | ----------------- As the subroutine REM uses R1 (for one of its arguments and to return its result) the result of the first call on REM must be saved in a temporary, T1. Of course, the function REM could be written so as to preserve the value in, say, R2
and this could be used instead of T1, but this would increase the cost of REM when it is likely that the value in R2 will not be of use as most expressions are trivial [Knuth, 1971]. Unless the machine-independent phase is given intimate knowledge of the target machine (something of a contradiction) it cannot know how many temporaries to use nor when to use them. The solution adopted by most intermediate codes is to base the code around a stack, thus providing an unlimited number of temporaries which are handled automatically. While this in itself does not hinder the compilation for a machine without a hardware stack, as the code-generator can always simulate the stack internally, its presence invariably results in other parts of the code using it, for example to pass parameters to procedures where the receiving procedure contains built-in knowledge of the layout of the stack. As a stack does not require the explicit mention of temporaries it has been adopted by I-code, but purely as a descriptive mechanism. Because I-code does not specify the computation but the compilation process needed to produce a program which will perform the computation, this internal stack need have no existence when the final program executes.
The implementors of SIMPL-T describe an intermediate code with some properties similar to I-code, but based on "quadruples" of operators and operands rather than an internal stack [Basili, 1975]. The stack approach was rejected by them because "quads allow more flexibility in the design of the code generator since, for example, no stack is required". The exact meaning of this is not clear but it suggests the misconception that a stack-based intermediate code forces a stack-based object code representation. Regardless of the exact structure of the code generator or the input it takes, some form of internal stack is invariably required for operations such as protecting intermediate values in registers which are needed for other purposes, and it seems reasonable to make this stack more explicit if so doing will simplify the intermediate code and its processing.
The third phase takes as its input two data streams generated by the second phase. These streams are: i The object stream, a sequence of items of the form: <type> <value>* defining the code sequences required in the object file. ii the directive stream, a sequence of items defining the logical structure of the object stream, that is a specification of label definitions and label references, and details of various code groupings (blocks, procedures etc.). The third phase starts by taking in the directive stream and constructing a linkage map describing the whole program. This linkage map is processed and then used to control the generation of the final object file from the object stream. The operations performed using the map are: 4.7.1 Reordering As discussed previously, there are several gains to be made by having the ability to output instructions in an order different from that in which they were implied by the linear structure of the source program. This reordering is performed on the linkage map in a manner controlled by items in the directive stream. In the most simple case of exbedding procedures (section 4.3.1) this only entails
allocating code addresses to the items in the map each time an "end-of-block" control item is input, resulting in the procedures being laid out in "end" order. To facilitate evaluating references to the reordered areas, all references in the object stream are made relative to the start of the appropriate area. As this process does not cause the physical moving of the various areas there is an implicit assumption that either the subsequent processing of the object stream can do the reordering (for example by writing its output to specific sections of a direct-access file), or that the object file format can instruct the loader or linker to do the shuffling. With the linkage map available it becomes possible to make a preliminary pass over the object stream performing structural modifications which require knowledge of the generated code and which alter its size and general appearance. These modifications may be made by passing the object stream through a buffer which is scanned and modified under the control of the linkage table. In this way merging common code sequences and reordering the arms of conditional sequences may be achieved quite simply.
In typical programs the frequency of occurrence of such jumps is: PDP11 PE3200 ------------------ 2 byte | 88% 28% | 4 byte | 8% 71% | 6 byte | 2% <1% | ------------------ It has been suggested [Brown, 1977] that the problem of deciding which form of jump to use can be eased on certain machines by specifying a "distance" parameter with the intermediate code, e.g. "GOTO LAB,80" informing the code generator that the label LAB is 80 instructions ahead. It is difficult to think of any case in which this could be of any use as it requires the code generator to be able to predict the amount of target machine-code which will be generated for each intermediate code instruction. The solution adopted by the IMP compilers has been for the code generator to assume that all jumps are the minimum size, and to let the third phase stretch them where necessary. The Perkin-Elmer CAL assembler [Interdata, 1974] makes the opposite assumption, namely that jumps are long until proven short. This was rejected as the size of one jump is often dependent on another, so that one of them will be short if and only if
both of them are short. By assuming them long either they will never be found to be short, or the process will have to examine all the jumps repeatedly trying each jump in turn to see if it can be "squeezed". Commonly enabling the "SQUEZ" option in the CAL assembler can double or treble the time to assemble programs. With the assumption that all jumps start short and then grow, all truly short jumps will be found with no possibility of infinite loops, as the process must terminate, in the worse case when all the jumps have been made long. Several methods for achieving this optimisation have been described [Szymanski, 1978; Williams, 1978]. The technique used by the third phase of the IMP77 compilers for stretching jumps is as follows. Once the linkage map has been constructed and addresses provisionally allocated, all labels and references to them are grouped according to the block in which they occurred. This is to take advantage of the fact that most references will be local. A procedure STRETCH is now defined which repeatedly attempts to lengthen each reference within a particular group. If a reference is found
which must be stretched, the entry in the linkage map is updated and all subsequent entries are suitably modified to take account of the increased size of the code. The process is repeated until no alterations have been made. STRETCH is first called once for each group of references in the program. This "local stretch" commonly resolves up to 80% of the references. A final call on STRETCH is then made with all the references lumped together as one group in order to resolve references between blocks, and any local references which, although processed by the local stretch, have become invalidated by changes made by the "global stretch". The use of a local and a global stretch has a considerable effect on the performance of the compiler; If the calls on "local stretch" are taken out "global stretch" has to do all the work in ignorance of the block-structure of the labels. This involves repeated searching of the complete label and reference lists in order that changes in the position of these items may be recorded. On the Interdata 7/32 this increases the stretching time for 1968 branches from 2.3 seconds out of a total compilation time of 146 seconds up to 35 seconds!
The time taken to perform the stretching using both local and global stretch is on average just over 1% of the total compilation time excluding the time for input and output. Wulf et al. describe an optimisation on the PDP11 which attempts to shorten otherwise long conditional jumps by making them jump to suitable jumps to the same destination as this is smaller and faster than the six byte instructions which would be generated by default [Wulf, 1975]. This was tried but eventually removed from the PDP11 compiler as finding suitable jumps was a tedious task and of the average 2% of jumps which were long in compiling many programs only one case was found where the optimisation could be applied. That case was in a program specially constructed to test the optimisation. At the same time that jumps and labels are being processed, certain operations which depend on the flow of control may be inserted into the code. The GEC 4080 provides a good example of this problem which can be handled elegantly by the third phase. The machine provides arithmetic instructions which
take either fixed point or floating point operands depending on the state of a processor status bit. This bit must be altered by the instructions SIM (Set Integer Mode) and SFM (Set Floating Mode). During code generation when a label is encountered the state of the status bit will not in general be known, and so a suitable mode switching instruction will need to be planted.; frequently this instruction will be redundant. Given the presence of the third phase, the second phase merely needs to mark jumps with the current state of the bit, and to mark labels with the required state (and the previous state of the bit if control can "fall through" past the label. During the process of expanding jumps these mark bits can be checked. If all references to a label have the same mode, no action needs to be taken, but if the bits differ the appropriate instruction must be added. As an extra improvement if only one jump to a label is from the wrong mode, the mode switching instruction can be planted before that jump rather than after its destination label, so shortening the execution paths when no change is required. ii Conflating jumps to jumps. Nested conditional structures in high-level languages often generate jumps which take control directly to another jump. If the second jump can
be shown always to be taken whenever the first is, the latter can be redefined as jumping directly to the destination of the former. e.g. ----------------------------------- | while N > 0 cycle | | N = N-1 | | if N > 5 then TEST1 else TEST2 | | repeat | ----------------------------------- In this program following the call on TEST1 the else causes a jump to be taken to the next statement (repeat). This statement is simply a jump back to the previous cycle. Hence the following code can be generated (3220): ------------------- | $1: L 1,N | | BLE $3 | | SIS 1,1 | | ST 1,N | | CHI 1,5 | | BLE $2 | | BAL 15,TEST1 | | B $1 | | $3: BAL 15,TEST2 | | B $1 | ------------------- The danger with this optimisation is that an otherwise short jump can be expanded to a long jump as the following program demonstrates: ---------------------- | if X = 1 start | | if Y = 1 start | | {A} | | else | | {B} | | finish | | else | | {C} | | finish | ----------------------
The else following the sequence (A) causes a jump to the next else which jumps past the finish. In that form the first jump only has to skip (B) and is likely to be a short jump. If it is made to jump directly to the second finish it has to cover {B} and {C}, so reducing the chances of its being short. Equally, the position can be reversed resulting in the optimised jump being short when the original was long. If this problem is considered serious the third phase can check the sort of jump which would be generated and act accordingly. iii Removal of jumps round jumps. Statements such as: ------------------ | ->LABEL if X = Y | ------------------ are common, either in the explicit for as given above or in some higher-level representation such as: --------------- | exit if X = Y | ---------------
The simple code sequence generated for this would be similar to (3220): -------------- | L 1,X | pick up X | C 1,Y | compare with Y | BNE $1 | branch not equal | B LABEL | jump to LABEL | $1: | -------------- by combining the two branches the code can be reduced to: -------------- | L 1,X | | C 1,Y | | BE LABEL | -------------- While it is possible for the code generator to do this immediately it was decided to leave the optimisation to the third phase for four reasons: 1 The third phase can perform this optimisation simply, almost as a side-effect of constructing the linkage map. 2 The are several cases where the optimisation can be extended in ways which would be awkward for the second phase to deal with. In particular, it would have either to look ahead or to be able to modify code sequences already generated. With a third phase, however, the optimisation reduces to a straightforward inspection of the linkage map. For example:
------------------ | exit if X = Y | | repeat | ------------------ in which case the optimisation may be applied twice to reduce the code to two instructions. 3 Leaving the optimisation to a later phase simplifies the second phase which is the most complicated part of the compiler. 4 On several machines if the destination of the jump is too far away the original "jump round a jump" may be the only form available (e.g. PDP11). The distance to be jumped will only be known exactly when all labels have been processed. 4.7.5 In-line constants When compiling for machines such as the Data General NOVA which have a limited direct addressing range and no full-length immediate operands, it is useful if constants can be planted in the code sequence and addressed as program-counter-relative operands. The simplest technique for doing, this is for the code generator to maintain a list of required constants and to dump them in-line at a suitable opportunity before the limit of addressability has been exceeded. Such constants will need to be protected from
being executed and so will need to have a jump round them or will have to be planted in a "hole" In the code, that is between an unconditional jump and the next label. As holes occur frequently in high-level languages (for example following every else or repeat) and do not require extra code to be planted round the constants, they must be the preferred position for the constants. In order to minimise the number of constants planted it is necessary to delay the dumping of them until the last possible moment, making them as near the forward limit of the addressability of the first outstanding reference. This increases the chance on a subsequent reference to the constant being able to address the previous location. This poses problems if the second phase is to handle the constants as it cannot know which is the optimum position for the constants in advance of producing the code (especially if the code is to be reordered). A convenient solution is to utilise the linkage table in the third phase and include in it references to constants and the locations of holes and "forced" holes, that is places where an extra jump is required. Following the initial resolution of jumps (4.7.?) the list of constants can be examined and holes allocated. The labels are processed again to take account of the extra code and any alignment limitations. During the processing of the object stream the constants are infiltrated into the object file.
5 Review of the overall structure 5.1 Division of function The division of the machine-dependent phase into two parts was motivated by three main considerations: i to localise the changes necessary to produce different object-file formats, ii to permit the reordering of sections of the code, iii to enable the production of short jumps whenever possible. In addition it turns out that on all of the machines for which this technique has currently been applied points (ii) and (iii) can be handled by almost identical pieces of code, making this phase of compilation machine-independent to a large extent and therefore easing the task of creating new compilers. Against this must be set the overheads incurred by separating the compilation into two parts which have to communicate. The interface between phases two and three comprises the object file and the directive file, and the third phase needs to process the whole of the directive file before starting to look at the object file.
The ways in which these 'files' will be implemented, and consequently the cost of the communication, will in general vary from system to system. If large virtual memories are available the data may be held in memory as mapped files or arrays and accessed much more efficiently than on simpler systems using the conventional approach of 'true' files with their more cumbersome transfer operations. 5.2 Testing and development Although the initial reason for choosing a multi-phase approach to compiling was that of simplifying the generation of new compilers, an extra advantage arose in that the task of checking the compilers so produced and diagnosing faults in them was very much simplified. This was because of two features of the technique. Firstly, the programs corresponding to the phases were of managable size, varying from about one thousand statements up to three thousand statements. Secondly, the phases communicated with each other using well-defined interfaces which could be monitored to narrow down errors to a particular phase and even to specific parts of that phase. In addition, as the structure of the intermediate code inevitably suggests the general techniques to apply in code generation, many of the complete compilers on different machines had great similarities; usually only the lowest
levels of code production and machine-specific optimisation were appreciably different. This gave rise to three convenient properties with regard to testing and development: i An error in one compiler will frequently give notice of similar faults in others. Clearly, any faults in the common first phase will be present in all the compilers and only one correction will be required. ii An improvement in the performance of one compiler, or the code it generates, can suggest similar improvements in others. iii The third effect on reflection seems obvious yet was noted with some surprise. The systems on which most of the investigation was done are run with very different operating systems and used by different types of user. These two factors together caused a great spread in the demands placed upon the compiler, resulting in more parts of the compiler being thoroughly tested than would happen when running on one particular system, where users tend to be more stereotyped. Questions of "proper practice" aside, it is a fact of life that all software gets a better testing in the field than at the hands of its creator.
5.3 Diagnostics As mentioned previously optimisation is not just a process of improving the storage requirements and speed of a program but also involves fitting a program into the overall framework of the run-time environment. In many applications the provision of extensive run-time checks and post-mortem traces can be of great importance. The ability to generate such diagnostic code has certain implications for the features in the intermediate code. 5.3.1 Line numbers When producing information about the state of a computation, whether it be an error report following a run-time fault or an execution trace [Satterthwaite, 1972], the data must be presented in a form which is meaningful to the user in terms of the source program. The commonly-provided dump of the machine state, registers, code addresses etc., is a complete failure in this respect, as the correspondence between this and the program state depends on the workings of the compiler and other factors of which the user should not need to be aware. The simplest way of specifying the point of interest in a program is to give its line number. There are two common techniques for providing line number information at run-time, the choice of which depends on the uses to which the compiler is to be put.
The first is to plant instructions which dynamically update a variable with the current line number whenever it changes. This has the significant advantages that it is extremely cheap to implement and the line number is always immediately available. Its obvious disadvantages are that it increases the execution time for the program, and more significantly it increases the size of the program, typically by about 6K bytes on the Interdata 7/32 for a 1000 line program, approximately a 50% increase. The second technique is to build a table giving the correspondence between line numbers and the addresses of the associated code sequences. While this imposes a greater burden on the compiler and takes more time to extract the line number it has the advantage that it does not increase the code size of the program, nor does it alter its execution speed. Indeed it may even be possible to keep the table out of main memory until it is required.
The choice of technique will have implications on the compiled code. If the line number table approach is used error procedures must have available the address of the point of the error. The effects of this can be seen in the following example of the sort of code generated for unassigned variable checking on the 7/32 using both methods: ------------- | 17 X = Y | ------------- ------------- ------------- | LHI 0,17 | | update line no | ST 0,LINE | | | L 1,Y | L 1,Y | | C 1,UV | C 1,UV | check value | BE ERROR | | give the error | | BAL 8,TU | test for error | ST 1,X | ST 1,X | | | .. .. | | | .. .. | | |TU:BNER 8 | return if OK | | B ERROR| give the error ------------- ------------- As the generated code depends on the method in use it cannot be specified in the intermediate code and so the latter must simply indicate the points in the program at which the line number changes.
5.3.2 Diagnostic tables In the event of program failure, or when explicitly requested by the user, a trace of the current state of a program, including the values in active variables and the execution history, can be of immense value. For such a trace to be provided the intermediate code must contain the identifiers used in the source program for all the variables, and a source-dependent description of those variables. This latter is needed so that the machine representations may be interpreted in the correct way when giving the values in variables. In I-code all this information is presented in the definitions of descriptors and may be used or discarded at will. 5.3.3 Run-time checks Most languages define circumstances under which a program is to be considered in error and its execution terminated. These errors include creating a value too large to be represented (overflow), division by zero, use of an array index which is outwith the declared bounds, and so on. There is a natural division of these errors into those which are detected automatically by the machine and those which must be detected by explicit checks in the program. Commonly, machines catch division by zero automatically but do not provide such a feature for checking array subscripts. The "hardware-detected" errors may be furthur divided into
those which on detection cause the normal flow of control to be interrupted, and those which simply make the knowledge of the occurrance of the error available to the program, for example by setting a condition-code bit. For the purposes of this discussion the second form of hardware-detected error may be considered an error which is not detected automatically as it still requires explicit instructions to test for the error and to transfer control accordingly. Clearly the more errors that fall into the automatic category the better, as they do not cause the user's program to grow with sequences of instructions which, in a correct program, will always be testing for conditions which never arise. These differences complicate the design of intermediate codes as the classification differs from machine to machine: the VAX all forms of overflow can be made to generate automatic interrupts, but the PDP11 only sets a condition-code bit on some overflows. There are two basic ways of handling this in the intermediate code: firstly the code can contain explicit requests for the checks to be performed, and secondly the code can be designed in such a way as to give the code-generator enough information to be able to decide where checks are necessary. Two specific examples can indicate which of these ways should be adopted.
Testing for arithmetic overflow is currently handled by machines in three main ways: 1. An interrupt is generated whenever overflow occurs. This is by far the best method as it requires no overheads in the checked code. 2. A bit is set on overflow and is only cleared when it is tested. This requires explicit checks in the code but several tests may be conflated into a single test at an appropriate point, for example before the final result is stored. 3. A bit is set on overflow, but is cleared by the next arithmetic operation. This again requires explicit checking code but the tests must be inserted after every operation. For the intermediate code to indicate where overflow testing is to be performed it would have to choose the worst case from the three above, namely case 3. This would result in a test being requested after every arithmetic instruction, which test may just as well be included into the definition of the instructions themselves. The other area of low-level testing is in implied type conversions such as storing from a 32-bit integer into a 16-bit integer. The VAX provides an instruction which
combines the test for truncation with the store (CVTLW). The 7/32 has an instruction (CVHR) which can test the value before assignment, and the 4/75 can most efficiently test following the assignment (CH). If the request for the check is a separate intermediate code item, the 7/32 case is simple but the other machines will require much more work to be able to generate the efficient check. The problem can be simplified by introducing new assignment instructions which also perform the test, but this adds many new instructions to the code as one instruction will be required for every valid combination of types and every sort of assignment. The high-level checks such as array bound checking are usually so complicated that the most efficient implementations depend greatly on the particular hardware, so much so that it would be foolish to attempt to express them in the intermediate code. The simplest solution is to ensure that the intermediate code provides enough information to let the code generator decide where and what checks are necessary. The inclusion of checks against the use of unassigned variables provides a good example of the power of leaving the checking to the code-generator. In a simple-minded approach the code-generator tests every
suitable value loaded from store. A minor improvement to this is to mark the descriptor for every local variable in a block when it is first assigned, inhibiting the marking after the first jump. Subsequently, marked objects need not be checked. A much better improvement may be obtained by making a trivial extension to the register remembering mechanism. If an object is 'known' it must have been used previously, and hence it will have been checked if necessary. Even after the register which held the value of the object has been altered, and hence the association between the register and the object lost, if the compiler remembers that the value was known it can suppress any unassigned checks on future references. At this point a useful property of IMP77 may be used to great effect: once a variable has been assigned it cannot become unassigned. This is not true in many languages, as for example, in ALGOL60 the control variable of a for loop is undefined (unassigned) at the end of the loop. This means that in IMP77 the 'was known' property of variables may be preserved across procedure calls, even though all the register content information must be forgotten. This technique when applied on the 7/32 compiler results in a reduction of 33% in the code required for checking. While it is possible for the unassigned checks to be placed in the intermediate code and for the first phase to remove redundant checks, this supression would
require a duplication of the remembering logic which must, in any case, reside in the machine-dependent phase.
6 Observations 6.1 Suitability of I-code for Optimisation When considering the use of I-code for global optimisation there are two techniques available: Firstly, the optimisations can be performed using the I-code and going straight into object code, possibly via a third phase. In this case the only real constraint on I-code is that it be powerful enough to be able to carry all the information available in the source and to present it in a compact form. Secondly, the optimisations can be seen as an extra phase introduced between the first phase (the I-code generator) and what is normally the second phase (the code generator). The optimiser takes in I-code and produces as its output a new I-code stream which can be fed into the code generator. In this case not only must the I-code carry all the source information but it must be able to describe the generation of an optimised program, Clearly the code must be able to reflect the structure of the target machine in some way and hence must be able to lose its machine independence. The second technique is the more interesting as not only does it permit the optional inclusion of the global optimising without affecting the structure of the other
phases, but it removes the optimisations from the low-level details of code production and provides a means for separating the machine-independent and machine-dependent optimisations. In particular in the same way as much of tho code generator can be built from a standard "kit" with a few special machine-specific parts, so the global optimiser can utilise code from other optimisers. The way in which the optimiser can influence the operation of the code generator is by making use of the fact that the intermediate code does not describe a computation but a compilation process. This compilation is driven by the descriptors which are normally translated by the code generator from the machine-independent form in the I-code into the appropriate machine-dependent representation, reflecting the target machine architecture: registers, stacks, memory etc. By short-circuiting this translation a global optimiser can force the use of specific machine features.
For example consider the following fragment of an integer function: ------------------- | integer X | | X = A(J) | | X = 0 if X < 0 | | result = X | ------------------- The standard I-code produced for this fragment would have the form: ------------------------ | DEF 12 "X" INTEGER | | SIMPLE DEFAULT NONE | | NONE | | PUSH 12 | - X | PUSH 6 | - A | PUSH 7 | - J | ACCESS | | ASSVAL | | PUSH 12 | - X | PUSHI 0 | | COMP >= 1 | | BGE 1 | | PUSH 12 | - X | PUSHI 0 | | ASSVAL | | LOC 1 | | PUSH 12 | - X | RESULT | ------------------------ On the PDP11 the code generated for this could be: ------------------- | MOV J,R2 | | ADD R2,R2 | Scale the index | ADD A,R2 | Add in ADDR(A(0)) | MOV (R2),X | A=A(J) | BGE $1 | ->$1 if X >= 0 | CLR X | X = 0 | $1: MOV X,R1 | assign result register | {return} | ------------------- Here the obvious optimisation is to note that the local variable, X, is eventually to be used as the result of the
function and so needs to end in register 1. By changing the definition of X in the I-code into: ------------------------------------------------ | DEF x X INTEGER SIMPLE DEFAULT NONE SPECIAL R1 | ------------------------------------------------ and making no other changes, the code generator will produce code of the form: -------------------- | MOV J,R2 | | ADD R2,R2 | | ADD A,R2 | | MOV (R2),R1 | | BGE $1 | | CLR R1 | | $1: {return} | -------------------- As this process necessitates the I-code becoming more and more intimately involved with the structure of the target machine, in that it starts referring directly to registers and the like, it is necessary that a new control item be added so that the code generator may be prevented from pre-empting resources which the optimiser is manipulating. The new item is RELEASE and it is used in conjunction with the definition of machine-dependent descriptors. When such a descriptor is introduced (using DEF) the associated target machine component is considered to have been claimed and may only be used in response to explicit direction from the I-code. On receipt of the corresponding RELEASE the component is once again made available for implicit use by the code generator (for temporaries etc.). This mechanism is an exact parallel to the way in which memory locations
are claimed by the definition of descriptors and released by the END of the enclosing block. The main assumption about this style of optimisation is that the code generator has the ability to generate any required instruction, provided that the pertinent information is available at the required time. As an example, the VAX 11/780 provides addressing modes in which the value in a register may be scaled and added into the effective operand address before the operand is used, hence the following code: ------------------------ | integerarray A(1:9) | |- -| | A(J) = 0 | | | | MOVL J,R5 | pick up J | CLRL 12(R3)[R5] | A(J) = 0 ------------------------ The operand address generated by the CLRL instruction is: ------------------ | 12+R3 + R5*4 | ------------------ as there are 4 bytes (address units) to a longword. This instruction can be generated naturally during the non-optimised evaluation of array subscripts, and so the optimiser can assume that the index mode of operand will be used whenever a register operand is specified as an array index. The procedure has the added advantage that in the worst case when the code generator will not produce the instructions
that the optimiser hoped, as long as the optimised I-code still describes the required compilation, the code generator will simply produce a more long-winded, but equally valid version of the program. In other words, as long as some choice is available and some temporary objects are left at the disposal of the code generator, the optimiser cannot force it into a state where working code cannot be produced. In the example above even if the code generator does not produce index mode operands, it can still generate sequences of the form: ---------------------- | MULL3 R5,#4,R1 | R5*4 -> R1 | ADDL2 R3,R1 | R3+R1 -> R1 | CLRL 12(R1) | 0 -> (12(R1)) ----------------------
6.2 Performance The figures to appendix A3 are the results of measuring the effect of various optimisations on the Interdata 7/32 and the DEC PDP 11/45. One problem in choosing programs to be measured in that heavy use of particular language features will increase the overall effect of certain optimisations. As a trivial example of this consider the following "program": -------------------------------- | begin | | integerarray A(1:1000) | | A(1) = 0 | | endofprogram | -------------------------------- With all array optimisations enabled, on the 7/32 this generates 30 bytes of code, whereas without the optimisation it results in 170 bytes of code, largely due to the procedure for declaring the array. Clearly a reduction of 82% is not to be expected on more typical programs. Similarly the absence of features will bias the results. In particular the smaller programs will not demonstrate the power of the optimisations which only take effect when various size limits have been exceeded: the most obvious such limits being addressing restrictions caused by the size of address fields in instructions.
The major difficulty in producing results which are of any real value is that the effects of the optimisations depend on the individual style in which the programs under consideration were written. Inevitably users get a "feel" for the sort of statement for which the compiler generates good code and they modify their style of programming accordingly. If at some state in its development a compiler produces poor code for a particular construction users will tend to avoid that construction, even long after the compiler has been improved and can compile it effectively. This well-known phenomenon [Whitfield, 1973] argues strongly that users should never see the object code generated by the compilers they are using. The effects of many optimisations are difficult if not impossible to measure with any degree of accuracy as they interact with other optimisations to a great deal. The most obvious interaction is that between the size of jump instruction required and most of the other optimisations. The size of jump is determined by the amount of code generated between the jump and the label it references. If any other optimisation is inhibited this volume of code is likely to increase, decreasing the chances of being able to use the shorter forms of the jump. Some optimisations depend almost totally on others; it is unlikely that the optimisation of reducing or removing the entry and exit code sequences associated with procedures (section 4.5.1) would have much effect if the parameters
were not passed in registers and references to them in the procedures were replaced by references to those registers. In particular, it must be noted that it is always possible to generate programs which will benefit greatly from those optimisations which do not appear to be of much use from the figures given. However, the test programs used to derive the figures are typical of the programs processed by the compiler, and it is hoped that they give a more realistic and balanced view of the improvements which may be achieved in 'real' cases. Under some circumstances it may be advantageous to apply all optimisations even though some may appear to give little benefit, since this 'squeezing the pips' frequently removes one or two instructions from critical loops in a program. Yet again this shows the difficulty in quantifying the usefullness of optimisations as they are so dependent on the particular circumstances.
One area of measurement has been deliberately omitted from the figures, namely the effect on execution time of the optimisations. This was for several reasons: 1. On the systems used it was impossible to get reliable timing measurements with any accuracy greater than about plus or minus 5%. 2. For the reasons given previously, many programs could benefit from fortuitous optimisations which could not be expected every time. 3. Programs which would run long enough to improve the accuracy of the measurements invariably lost this accuracy through spending much time in the system-provided procedures, mainly for input and output. This point in particular suggests that as the overhead is beyond the control of the general user, the savings in code space may be much more important. Even with ever-growing store sizes, virtual memory systems will continue to treat smaller programs better than larger ones.
4. Some of the optimisations, particularly passing parameters in registers, prevent the compiled program from running, unless the controlling environment is modified in a parallel way. This would invalidate the timings as the environment is not usually under the control of the compiler. From the crude measures which were obtained there is a suggestion that the decrease in execution time roughly parallels the decrease in code size. 6.3 Cost of optimisation The cost of an optimisation is, in general, very difficult to measure, as may be seen by considering the three relevant areas: compile time, space requirement, and logical complexity. 6.3.1 Compile time In order to generate good code, the compiler must spend time looking for the cases which are amenable to improvement. If no optimisation is performed this time is not used and so the compilation should take less time. However, the non-optimised version commonly requires the production of more code than the optimised version, frequently over fifty percent more when comparing fully diagnostic code with fully optimised code. On all the compilers written so far, the time saved by not having to generate these extra instructions more than outweighs the time spent in deciding not to generate them.
6.3.2 Space requirement Several optimisations increase the requirement for workspace, notably all the remembering optimisations. On most machines available at the present the number of things which may be remembered is fairly small: sixteen registers and one condition-code is probably the maximum. Even if this number is increased by remembering several facts about each thing, the total amount of space needed will be small when compared with the space needed to hold the information about user-defined objects, information which is required whether optimisation is being performed or not. On large machines the extra memory required will be cheap; on small machines the need for the optimisation will have to be balanced against the size of the largest program which must be compiled. 6.3.3 Logical complexity The cost of providing an optimisation includes a non-recurrent component, which is the difficulty of performing the optimisation at all because of the logical complexity of discovering the necessary circumstances. In a system which is aimed at portability this cost can often be shared over a number of implementations; the techniques used in one being applicable to others, perhaps after minor modifications.
6.4 Comments on the results 6.4.1 Register remembering Of all the optimisations tested, a simple remembering of values in registers provided by far the greatest improvement in code size. One problem in implementing this optimisation is deciding what to remember, as shown by the following code sequence: ----------- | X = Y | | | | L 1,Y | | ST 1,X | ----------- Following this sequence register 1 will contain both the value in X and the value in Y; should the compiler remember X or Y or both? The measurements show that the gain in remembering both (2 uses) as opposed to just one (1 use) are quite small. The algorithm used to determine what to remember in the '1 use' case was simply to remember a new piece of information only if nothing else was known about the register in question. This gives the best results in cases such as: A = 0; B = 0; C = 0 where the value '0' will be remembered, but will perform badly with the more contorted case: A = 0; B = A; C = B as again only the value '0' will be remembered. Unless very tight code is required, the cost in
maintaining multiple sets of information about each register and searching for particular values will probably rule out such extended remembering optimisations. Perhaps a surprising result is that the PDP11 on average gains about as much from this optimisation as the 7/32. This is the result of two interacting effects. Firstly, the 7/32 dedicates up to five registers to address local variables in the last five levels of procedure nesting and locks three for other fixed purposes, leaving about ten for intermediate calculations. The PDP11, however, uses a display in store the access intermediate levels and has to load the address of a particular level each time it is required. In addition the PDP11 implementation fixes the use of four registers, leaving only four for intermediate calculations. Secondly, the 7/32 needs to use at least one register to move values around while the PDP11 often requires none. These two effects give a fairly large number of transient values in the registers of the 7/32, and a smaller number of more frequently used values (addresses) in the registers of the PDP11. On average it appears that the number of times necessary values are found is roughly equal in the two cases.
register later on. This instability seems to be undesirable, but alternative strategies, such a biasing the allocation towards or away from particular registers, on average results in worse code. 6.4.4 Common operands On the 7/32 the only instruction which can be used to simplify statements of the form: X = X op Y is the AM (add to memory) instruction. It is therefore somewhat surprising that its use frequently saves over two percent of the code. The two possible expansions of a suitable addition statement are: ------------ ------------ | | | | L 1,Y | L 1,X | | AM 1,X | A 1,Y | | | ST 1,X | ------------ ------------ The first saves four bytes and leaves the increment in the register. Even if the incremented value is required immediately the extra load instruction will only increase the code size to that of the alternative sequence. As the PDP11 has many instructions which can be used in this way it is hardly surprising that it benefits much more.
6.4.5 Parameters in registers This optimisation gives another significant saving in code at little cost to the compiler, simply by moving the store instructions for parameter assignment from the many calls to the unique procedure definitions. The effect is more pronounced on the 7/32 as all assignments require two instructions, a load and a store, whereas the PDP11 can usually make do with one MOV instruction. In the latter case the saving comes from the ability to reduce the size of the procedure entry and exit sequences if all of the parameters can be passed in registers. 6.4.6 Condition-code remembering On machines with condition codes many instructions set the result of a comparison with zero as a side-effect. Knowledge of this can be used to inhibit explicit code to compare values with zero. However, the small benefit so gained suggests that it is not worth doing, even though it is a very cheap test.
As the techniques for merging and delaying are quite expensive, but not complicated, and have a major influence on the design of the code-generator the small gains achieved are probably not worth the trouble, unless the last drop of efficiency is required at all costs.
6.5 Criticisms and benefits of the technique 6.5.1 Complexity The main argument against the use of high-level intermediate codes is that they move the complexity of code generation from the common machine independent phase into the machine dependent phase forcing work to be repeated each time a new compiler is required. While this is undoubtedly true the overheads are not as great as they may first appear. The extra complexity of the code generators may be split into two parts: an organisational part which builds and maintains the structures used during the compilation, and processes the intermediate code, using it to drive the second part, an operational part which uses the structures to generate code as instructed by the organisational part. The changes in the organisational part when moving to a new machine as small enough to permit the use of large sections of code from old compilers. Even when considering the operational part much will be similar from machine to machine, in particular the communication between the second and third phases and the bulk of that latter phase can be taken without change. From examining the compilers produced using I-code it appears that about 60% of the source of the machine dependent parts is common, 20% can be considered as being selected from a standard "kit" of parts, and the final 20% unique to the host machine.
6.5.2 I/O overhead One of the disadvantages of dividing a compiler into several distinct phases is that it results in an additional cost in communicating between consecutive phases. As discussed in section 5.1 this cost depends on the system running the compiler. In the worst case where communication is achieved using conventional files the overhead may not be too serious. The time spent doing input and output on the Interdata 7/32 compiler is about 27% of the total compilation time, and for the PDP11 is about 22%, breaking down as follows: ------------------------------------------------------- | | | ---- ---- ---- | | | | | |---->| | | | Source ----> | P1 |---->| P2 | | P3 |----> Object | | | | | |---->| | | | ---- ---- ---- | | | | 7/32: 7% 7% 10% 3% | | (4%) (4%) (5%) (3%) | | | | PDP11: 9.4% 11% 0.6% 0.5% | ------------------------------------------------------- The figures in parentheses give the percentage of time taken when the input and output requests are made directly to the file manager rather than via the standard subsystem procedures, thus reducing the internal I/O overhead to about 10% of the total compilation time.
One important gain in using such intermediate codes is that they can ease the difficulties associated with maintaining a number of compilers for different machines, when those compilers are self-compiling. For several reasons it may not be desirable to permit sites to have the source of the machine-independent phase: commonly to give freedom of choice for the form of the language in which that phase is written and to prevent local "improvements" which rapidly lead to non-standard language definitions. In such cases the intermediate-code generator can be maintained at one site and updated versions can be distributed in the intermediate code form without fear of compromising the quality of the object code generated from it. Such a technique is currently being used in the production of portable SIMULA compilers [Krogdahl, 1980]. 6.5.4 Flexibility At some stage in producing a compiler the needs of the end user must be considered. The flexibility afforded by the high-level nature of the intermediate code allows the compiler to be adapted to fit its environment. If the compiler is to be used for teaching the quality of the code it produces can be sacrificed for compilation speed and high-quality diagnostics, particularly as compilation time may well be an order of magnitude greater than the execution time, Indeed many of the programs will fail to compile and
never reach execution. If the application is for compiling programs that will be compiled once and then executed many times more effort can be expended in producing fast code, although this is not to say that diagnostics and fast code must be kept separate as the longer a program runs without falling the more trouble will be caused when it falls without convenient diagnostics. 6.6 Comments on Instruction sets and compilation Following the production of IMP compilers for several different processors various features on instruction sets have become evident which influence the generation of code. i The instruction set should be complete, that is, where an instruction in available for one data type it should be available for all data types for which it is well-defined. Similarly, instruction formats used by one operation should be available for all similar operations. The best example of such an instruction set is that provided by the DEC PDP10. Unfortunately the majority of machines are not so helpful. As example of the sorts of thing which go wrong consider the Perkin-Elmer 3200 series. These machines provide three integer data types: fullword (32 bits, signed), halfword (16 bits, signed), and byte (8 bits, unsigned). There are "add fullword" (A) and "add halfword" (AH) instructions but no
"add byte" instruction. There are "add immediate short" and "subtract immediate short" instructions but multiply, divide, and, or etc. can not have short immediate operands. ii The instructions should be consistent, that is, logically similar instructions should behave in similar fashions. Again, on the Perkin-Elmer 3200: Load fullword (L) and load halfword (LH) set the condition code but load byte (LP) does not. Most register-store instructions can be replaced by a load of the appropriate type followed by a register-register instruction: e.g. -------------- -------------- | CH 1,X | LH 0,X | | | CR 1,0 | -------------- -------------- both result in the same setting of the condition code, but -------------- -------------- | CLB 1,B | LB 0,B | | | CLR 1,0 | -------------- -------------- could result in different settings of the condition code as CLR compares two unsigned 32 bit quantities whereas CLB compares a zero-extended byte from store with the zero-extended least significant byte of register 1, For consistency either compare halfword (CH) should use the sign-extended less significant half of the register, or better, CLB should not tamper with the value in the register.
iii Complex instructions should be avoided. There are two reasons for this. Firstly, it is easier for a compiler to break down statements into simple operations than it is to build them up into complex ones [Stockton-Gaines, 1965]. Secondly, if the complex instructions do not perform the exact function required by the language more instructions will be needed to "prepare" for the complex instruction and to "undo" its unwanted effects. As an example, the DEC VAX11/780 is full of complex instructions which seem to be well-suited to high-level languages at first glance but on closer inspection they are not so useful. A CASE instruction is provided which indexes into a table of displacements and adds the selected value to the program counter. This would seem ideal for compiling SWITCH jumps. Unfortunately, as the table of displacements follows the CASE instruction it would be very expensive to use it each time a jump occurred using a particular switch. Instead all references to the switch must jump to a common CASE instruction. Even this does not help, as in the event of an attempted jump to a non-existent switch label the diagnostics or the event mechanism will see the error as having occurred at the wrong place in the program. Although this problem can be "programmed around" it turns out that it is faster to implement switches using sequences of simpler instructions.
iv Machine designers should investigate carefully the full consequences of building-in special fixed uses of machine features. One of the best examples of a clear oversight which causes grief to compiler writers is found in the DATA GENERAL NOVA multiplication instruction. This instruction multiplies the value in register 1 by register 2 and places the double-length result in registers 0 and 1. As only registers 2 and 3 may be used for addressing, and as register 3 is always used for subroutine linkage, it follows that register 2 must be used for addressing the local stack frame, but this is exactly the register which must be corrupted in order to use the multiply instruction! Although specific machines have been used in the examples similar problems abound in all machines. Indeed it is clear that machines are most commonly designed for programmers writing in assembler or FORTRAN, and furthermore writing their programs in a particular style. While it is clear that the problems could be called "mere details" and that they are not difficult to surmount, it remains that they complicate otherwise simple code-generation algorithms making compilers larger, slower and correspondingly more difficult to write, debug, and
maintain. In conclusion it appears that the machine most suited to supporting high-level languages should have a small but complete set of very simple instructions, their simplicity permitting rapid execution and great flexibility.
7 Conclusions 7.1 Viability of the technique The techniques described above have been used to create several IMP77 compilers which are in regular use on a number of systems. In terms of total memory space required for a compilation, they compare favourably with other compilers. The major weakness seems to be execution time which can vary from twice as long as other compilers in the worst case, to half as long in the best case. As most of the effort in writing the compilers was spent in investigating the techniques involved and not in minimising compile time, and as the compilers which ran much faster were either totally, or partially written in machine code (the IMP77 compilers are all written exclusively in IMP77), it seems that the technique can be used to produce acceptable service compilers. 7.2 Ease of portability Although using I-code does not permit compilers to be written in as short a time as with P-code and O-code, the large amount of code which is common to all of the compilers written so far means that, given a working code generator as a template, a new optimising compiler can be written in the space of a few months, with the final result producing code of high quality.
area, or created specially. Items on the stack are modified by intermediate code control items to reflect operations specified in the source program. Such modifications may or may not result in code being generated. From the point of view of this definition stack elements are considered to have at least three components: i Type ii Value iii Access rule The "Access rule" defines how the "Type" and "Value" attributes are to interpreted in order to locate the described object. For example, the access rule for a constant could be "Value contains the constant" while for a variable it could be "Value contains the address of the variable". Clearly, the access rules are target-machine dependent. Descriptors may be combined to give fairly complex access rules, as in the case of applying "PLUS" to the stack when the top two descriptors are for the variable X and the constant 1 resulting in one descriptor with the access rule "take the value in X and add 1 to it". The complexity of these access rules may be restricted by a code-generator. In the example above code could be generated to evaluate X+1 resulting in an access rule "the value is in register 1", say.
The importance of the code not describing the actual computation which the source program specified but the compilation process required is seen when attempting to use the code for statements of the form: A := if B = C then D else E; This could not be encoded as: PUSH A PUSH B PUSH C JUMP # L1 PUSH D BR L2 LOC L1 PUSH E LOC L2 ASSVAL The reason is that the items on the stack at the time of the ASSVAL would be (from top to bottom) [E], [D], [A], because no items were given which would remove them from the stack. hence the ASSVAL would assign the value of E to D and then leave A dangling on the stack. Unless otherwise stated, all constants in the intermediate code are represented in octal.
Descriptors DEF TAG TEXT TYPE FORM SIZE SPEC PREFIX This item causes a new descriptor to be generated and placed in the descriptor area. On creation, the various fields of the DEF are used to construct the machine-dependent representation required for the object. TAG is an identification which will be used subsequently to refer to the descriptor. TEXT is the source-language identifier given to the object (a null string if no identifier was specified). TYPE is the type of the object: GENERAL, INTEGER, REAL, STRING, RECORD, LABEL, SWITCH, FORMAT. FORM is one of: SIMPLE, NAME, ROUTINE, FN, MAP, PRED, ARRAY, NARRAY, ARRAYN, NARRAYN. SIZE is either the TAG of the appropriate record format descriptor for records, the maximum length of a string variable, or the precision of numerical variables: DEFAULT, BYTE, SHORT. LONG.
SPEC has the value SPEC or NONE depending on whether or not the item is a specification. PREFIX is one of: NONE, OWN, CONST, EXTERNAL, SYSTEM, DYNAMIC, PRIM, PERM or SPECIAL. If SPECIAL is given there will follow an implementation-dependent specification of the properties of the object (such as that it is to be a register, for example).
Parameters and Formats The parameters for procedures and the elements of record formats are defined by a list immediately following the procedures or format descriptor definition: START Start of definition list FINISH End of definition list ALTBEG Start of alternative sequence ALT Alternative separator ALTEND End of alternative sequence. Blocks BEGIN Start of BEGIN block END End of BEGIN block or procedure
PUSH <tag> Push a copy of the descriptor <tag> onto the stack. PROC <tag> This is the same as PUSH except that the descriptor being stacked represents a procedure which is about to be called (using ENTER). PUSHI <n> Push a descriptor for the integer constant <n> onto the stack. PUSHR <r> Push a descriptor for the real (floating-point) constant <r> onto the stack. PUSHS <s> Push a descriptor for the string constant <s> onto the stack. SELECT <tag> TOS will be a descriptor for a record. Replace this descriptor with one describing the sub-element <tag> of this record.
Assignment ASSVAL Assign the value described by TOS to the variable described by SOS. Both TOS and SOS are popped from the stack. ASSREF Assign a reference to (the address of) the variable described by TOS to the pointer variable described by SOS. Both TOS and SOS are popped from the stack. JAM This is the same as ASSVAL except that the value being assigned will be truncated if necessary. ASSPAR Assign the actual parameter described by TOS to the formal parameter described by SOS. This is equivalent to either ASSVAL (for value parameters) or ASSREF (for reference parameters). RESULT TOS describes the result of the enclosing function. Following the processing of the result code must be generated to return from the function.
MAP Similar to RESULT except that TOS describes the result of a MAP. Again a return must be generated. DEFAULT <n> INIT <n> Create N data items, corresponding to the last descriptor defined, and given them all an initial (constant) value. The constant is popped from the stack in the case of INIT but DEFAULT causes the machine-dependent default value to be used (normally the UNASSIGNED pattern).
Binary operators ADD Addition SUB Subtraction MUL Multiplication QUOT Integer division DIVIDE Real division IEXP Integer exponentiation REXP Real exponentiation AND Logical AND OR Logical inclusive OR XOR Logical exclusive OR LSH Logical left shift RSH Logical right shift CONC String concatenate ADDA ++ SUBA -- The given operation is performed on TOS and SOS , both of which are removed from the stack, and the result (SOS op TOS) is pushed onto the stack. e.g. A = B-C PUSH A PUSH B PUSH C SUB ASSVAL
Unary Operators NEG Negate (unary minus) NOT Logical NOT (complement) MOD Modulus (absolute value) The given operation is performed on TOS.
Arrays DIM <d> <n> The stack will contain <d> pairs of descriptors corresponding to the lower and upper bounds for an array. This information is used to construct <n> arrays and any necessary accessing information for use through the last <n> descriptors to have been defined. All of these descriptors will be for similar arrays. INDEX SOS will be the descriptor for a multi-dimensional array and TOS will be the next non-terminal subscript. The stack is popped. ACCESS SOS will be the descriptor of an array and TOS will be the final/only subscript. Both descriptors are replaced by a descriptor for the appropriate element of the array. E.g. given arrays A(1:5) and B(1:4, 2:6), and integers J,K: A(J) = 0 K = B(J, K) PUSH A PUSH K PUSH J PUSH B ACCESS PUSH J PUSHC 0 INDEX ASSVAL PUSH K ACCESS ASSIGN
Internal labels Internal labels are those labels in the intermediate code which have been created by the process of translating from the source program, and so do not appear explicitly in the source program. The main property of these labels is that they will only be referred to once. This fact can be used to re-use these labels, as, for example, a forward reference to a currently-defined label must cause its redefinition. LOCATE <1> define internal label <1> GOTO <1> forward jump to internal label <1> REPEAT <1> backward jump to internal label <1>
Conditional branches These branches are always forward. JUMPIF <cond> <label> JUMPIFD <cond> <label> JUMPIFA <cond> <label> Where: <cond> ::= =, #, <, <=, >, >=, TRUE, FALSE The two items on the top of the stack are compared and a jump is taken to <label> is the condition specified by <cond> is true. In the case of <cond> being TRUE or FALSE only one item is taken from the stack, and this represents a boolean value to be tested. User Labels LABEL <d> locate label descriptor <d> JUMP <d> Jump to the label described by <d> CALL <d> Call the procedure described by <d>
Sundry Items ON <e> <l> Start of event trap for events <e>. Internal label <l> defines the end of the event block. EVENT <e> Signal event <e> STOP stop MONITOR monitor RESOLVE <m> Perform a string resolution FOR Start of a for loop SLABEL <sd> Define switch label SJUMP <sd> Select and jump to switch label LINE <l> Set the current line number to <l>
Remembering values in registers Code Total Incremental Size Reduction Reduction ---- --------- --------- P732.1 0 uses 9504 - - 1 use 8194 13.8% 13.8% 2 uses 8192 13.8% 0.0% P732.2 0 uses 6500 - - 1 use 6126 5.8% 5.8% 2 uses 6126 5.8% 0.0% P732.3 0 uses 10960 - - 1 use 9968 9.0% 9.0% 2 uses 9956 9.2% 0.2% P732.4 0 uses 5288 - - 1 use 4970 6.0% 6.0% 2 uses 4958 6.2% 0.2% P732.5 0 uses 5468 - - 1 use 4990 8.7% 8.7% 2 uses 4986 8.8% 0.1% P732.6 0 uses 3424 - - 1 use 3208 6.3% 6.3% 2 uses 3208 6.3% 0.0% P732.7 0 uses 10736 - - 1 use 9880 8.0% 8.0% 2 uses 9874 8.0% 0.0% P732.8 0 uses 824 - - 1 use 770 6.6% 6.6% 2 uses 770 6.6% 0.0% P732.9 0 uses 6448 - - 1 use 6148 4.6% 4.6% 2 uses 6148 4.6% 0.0% P732.10 0 uses 22968 - - 1 use 20656 10.1% 10.1% 2 uses 20650 10.1% 0.0% P732.11 0 uses 13996 - - 1 use 12470 10.9% 10.9% 2 uses 12442 11.1% 0.2% P732.12 0 uses 32600 - - 1 use 28532 12.5% 12.5% 2 uses 28392 12.9% 0.4%
Code Total Incremental Size Reduction Reduction ---- --------- --------- P11.1 0 uses 9060 - - 1 use 7712 14.9% 14.9% 2 uses 7660 15.4% 0.5% P11.2 0 uses 6276 - - 1 use 6000 4.4% 4.4% 2 uses 6000 4.4% 0.0% P11.3 0 uses 9992 - - 1 use 9480 5.1% 5.1% 2 uses 9444 5.5% 0.4% P11.4 0 uses 5020 - - 1 use 4772 5.4% 5.4% 2 uses 4768 5.6% 0.2% P11.5 0 uses 5096 - - 1 use 4460 12.5% 12.5% 2 uses 4452 12.6% 0.1% P11.6 0 uses 3692 - - 1 use 3064 17.0% 17.0% 2 uses 4064 17.0% 0.0% P11.7 0 uses 7976 - - 1 use 7060 11.5% 11.5% 2 uses 7030 11.8% 0.3% P11.8 0 uses 668 - - 1 use 652 2.4% 2.4% 2 uses 624 6.6% 4.2% P11.9 0 uses 4888 - - 1 use 4492 8.1% 8.1% 2 uses 4484 8.3% 0.2% P11.10 0 uses 20318 - - 1 use 19120 5.9% 5.9% 2 uses 19120 5.9% 0.0% P11.11 0 uses 12938 - - 1 use 12162 6.0% 6.0% 2 uses 12148 6.1% 0.1% P11.12 0 uses 12068 - - 1 use 10594 12.2% 12.2% 2 uses 10584 12.3% 0.0%
Remembering sets of registers (environments) Code Total Incremental Size Reduction Reduction ---- --------- --------- P732.1 0 environments 8556 - - 1 environment 8316 2.8% 2.8% 2 environments 8238 3.7% 0.9% 3 environments 8232 3.8% 0.1% 4 environments 8222 3.9% 0.1% 5 environments 8218 4.0% 0.1% 6 environments 8192 4.2% 0.2% P732.2 0 environments 6202 - - 1 environment 6128 1.2% 1.2% 2 environments 6130 1.2% 0.0% 3 environments 6126 1.2% 0.0% 4 environments 6126 1.2% 0.0% 5 environments 6126 1.2% 0.0% 6 environments 6126 1.2% 0.0% P732.3 0 environments 10174 - - 1 environment 10062 1.1% 1.1% 2 environments 9968 2.0% 0.9% 3 environments 9966 2.0% 0.0% 4 environments 9964 2.1% 0.1% 5 environments 9956 2.1% 0.1% 6 environments 9956 2.1% 0.1% P732.4 0 environments 5068 - - 1 environment 4978 1.8% 1.8% 2 environments 4958 2.2% 0.4% 3 environments 4958 2.2% 0.0% 4 environments 4958 2.2% 0.0% 5 environments 4958 2.2% 0.0% 6 environments 4958 2.2% 0.0% P732.6 0 environments 3262 - - 1 environment 3250 0.4% 0.4% 2 environments 3216 1.4% 1.0% 3 environments 3208 1.7% 0.3% 4 environments 3208 1.7% 0.0% 5 environments 3208 1.7% 0.0% 6 environments 3208 1.7% 0.0% P732.7 0 environments 10062 - - 1 environment 9970 0.9% 0.9% 2 environments 9894 1.7% 0.8% 3 environments 9880 1.8% 0.1% 4 environments 9874 1.9% 0.1% 5 environments 9874 1.9% 0.0% 6 environments 9874 1.9% 0.0%
P732.8 0 environments 806 - - 1 environment 782 3.0% 3.0% 2 environments 782 3.0% 0.0% 3 environments 770 4.5% 1.5% 4 environments 770 4.5% 0.0% 5 environments 770 4.5% 0.0% 6 environments 770 4.5% 0.0% P732.9 0 environments 6244 - - 1 environment 6202 0.7% 0.7% 2 environments 6156 1.4% 0.7% 3 environments 6158 1.4% 0.0% 4 environments 6148 1.5% 0.1% 5 environments 6148 1.5% 0.0% 6 environments 6148 1.5% 0.0% P732.10 0 environments 21214 - - 1 environment 20928 1.3% 1.3% 2 environments 20748 2.2% 0.9% 3 environments 20678 2.5% 0.3% 4 environments 20678 2.5% 0.0% 5 environments 20668 2.6% 0.1% 6 environments 20650 2.6% 0.0% P732.11 0 environments 12772 - - 1 environment 12592 1.4% 1.4% 2 environments 12486 2.2% 0.8% 3 environments 12472 2.3% 0.1% 4 environments 12460 2.4% 0.1% 5 environments 12452 2.5% 0.1% 6 environments 12442 2.6% 0.1% P732.12 0 environments 11522 - - 1 environment 11418 0.9% 0.9% 2 environments 11342 1.6% 0.7% 3 environments 11314 1.8% 0.2% 4 environments 11314 1.8% 0.0% 5 environments 11296 2.0% 0.2% 6 environments 11296 2.0% 0.0% P11.1 0 environments 7686 - - 1 environment 7670 0.2% 0.2% 2 environments 7660 0.3% 0.1% 3 environments 7660 0.3% 0.0% 4 environments 7660 0.3% 0.0% 5 environments 7660 0.3% 0.0% 6 environments 7660 0.3% 0.0% P11.2 0 environments 6012 - - 1 environment 6000 0.2% 0.2% 2 environments 6000 0.2% 0.0% 3 environments 6000 0.2% 0.0% 4 environments 6000 0.2% 0.0% 5 environments 6000 0.2% 0.0% 6 environments 6000 0.2% 0.0%
P11.3 0 environments 9472 - - 1 environment 9440 0.3% 0.3% 2 environments 9444 0.3% -0.0% 3 environments 9444 0.3% 0.0% 4 environments 9444 0.3% 0.0% 5 environments 9444 0.3% 0.0% 6 environments 9444 0.3% 0.0% P11.4 0 environments 4784 - - 1 environment 4776 0.2% 0.2% 2 environments 4776 0.2% 0.0% 3 environments 4776 0.2% 0.0% 4 environments 4776 0.2% 0.0% 5 environments 4772 0.2% 0.0% 6 environments 4768 0.3% 0.1% P11.5 0 environments 4512 - - 1 environment 4464 1.1% 1.1% 2 environments 4456 1.2% 0.1% 3 environments 4452 1.3% 0.1% 4 environments 4452 1.3% 0.0% 5 environments 4452 1.3% 0.0% 6 environments 4452 1.3% 0.0% P11.6 0 environments 3076 - - 1 environment 3070 0.2% 0.2% 2 environments 3064 0.4% 0.2% 3 environments 3064 0.4% 0.0% 4 environments 3064 0.4% 0.0% 5 environments 3064 0.4% 0.0% 6 environments 3064 0.4% 0.0% P11.7 0 environments 7104 - - 1 environment 7048 0.8% 0.8% 2 environments 7048 0.8% 0.0% 3 environments 7048 0.8% 0.0% 4 environments 7048 0.8% 0.0% 5 environments 7048 0.8% 0.0% 6 environments 7032 1.0% 1.0% P11.8 0 environments 640 - - 1 environment 624 2.5% 2.5% 2 environments 624 2.5% 0.0% 3 environments 624 2.5% 0.0% 4 environments 624 2.5% 0.0% 5 environments 624 2.5% 0.0% 6 environments 624 2.5% 0.0% P11.9 0 environments 4492 - - 1 environment 4484 0.2% 0.2% 2 environments 4484 0.2% 0.0% 3 environments 4484 0.2% 0.0% 4 environments 4484 0.2% 0.0% 5 environments 4484 0.2% 0.0% 6 environments 4484 0.2% 0.0%
P11.10 0 environments 19332 - - 1 environment 19196 0.7% 0.7% 2 environments 19158 0.9% 0.2% 3 environments 19138 1.0% 0.1% 4 environments 19138 1.0% 0.0% 5 environments 19120 1.1% 0.1% 6 environments 19120 1.1% 0.0% P11.11 0 environments 12280 - - 1 environment 12200 0.6% 0.6% 2 environments 12168 0.9% 0.3% 3 environments 12160 1.0% 0.1% 4 environments 12156 1.0% 0.0% 5 environments 12148 1.1% 0.1% 6 environments 12148 1.1% 0.0% P11.12 0 environments 10690 - - 1 environment 10616 0.7% 0.7% 2 environments 10604 0.8% 0.1% 3 environments 10604 0.8% 0.0% 4 environments 10594 0.9% 0.1% 5 environments 10584 1.0% 0.1% 6 environments 10584 1.0% 0.0%
Simple allocation of arrays and remembering subscripts Allocation Remembering Neither Simple (gain) Subscripts (gain) ------- ---- ------ ---------- ------ P732.1 8596 8476 (1.4%) 8312 (3.3%) P732.2 6126 6126 (0.0%) 6126 (0.0%) P732.3 10450 1014 (3.2%) 10426 (0.2%) P732.4 5056 4958 (1.9%) 5056 (0.0%) P732.5 5306 5054 (4.7%) 5308 -(0.0%) P732.6 3384 3254 (3.8%) 3386 -(0.0%) P732.7 10346 10112 (2.3%) 10344 (0.0%) P732.8 806 806 (0.0%) 770 (4.5%) P732.9 6138 6138 (0.0%) 6148 -(0.2%) P732.10 20806 20684 (0.6%) 20776 (0.1%) P732.11 12442 12442 (0.0%) 12442 (0.0%) P732.12 11976 11946 (0.2%) 11326 (5.4%) Both optimisations Total gain ------------------ ---------- P732.1 8192 4.7% P732.2 6126 0.0% P732.3 9956 4.7% P732.4 4958 1.9% P732.5 4986 6.0% P732.6 3208 5.2% P732.7 9874 4.6% P732.8 770 4.5% P732.9 6148 -0.2% P732.10 20650 0.7% P732.11 12442 0.0% P732.12 11296 5.8%
Allocation Remembering Neither Simple (gain) Subscripts (gain) ------- ---- ------ ---------- ------ P11.1 8572 8188 (4.5%) 7704 (10.1%) P11.2 6000 6000 (0.0%) 6000 (0.0%) P11.3 9764 9556 (2.1%) 9644 (1.2%) P11.4 4848 4776 (1.5%) 4848 (0.0%) P11.5 4656 4568 (1.9%) 4452 (4.4%) P11.6 3356 3202 (4.6%) 3218 (4.1%) P11.7 7844 7728 (1.4%) 7204 (8.2%) P11.8 644 624 (3.1%) 644 (0.0%) P11.9 4796 4796 (0.0%) 4484 (6.5%) P11.10 19236 19140 (0.5%) 19216 (0.1%) P11.11 12148 12148 (0.0%) 12148 (0.0%) P11.12 11094 11060 (0.3%) 10616 (4.3%) Both optimisations Total gain ------------------ ---------- P11.1 7660 10.6% P11.2 6000 0.0% P11.3 9444 3.3% P11.4 4768 1.6% P11.5 4452 4.4% P11.6 3064 8.7% P11.7 7032 10.4% P11.8 624 3.1% P11.9 4484 6.5% P11.10 19120 0.6% P11.11 12148 0.0% P11.12 10584 4.6%
Simplifying: X = X op Y Code Code Without With Gain ------- ---- ---- P732.1 8292 8192 1.2% P732.2 6156 6126 0.5% P732.3 10068 9956 1.1% P732.4 5088 4958 2.6% P732.5 5180 4986 3.7% P732.6 3368 3208 4.8% P732.7 11438 11296 1.2% P732.8 772 770 0.2% P732.9 6214 6148 1.1% P732.10 21086 20650 2.1% P732.11 12590 12442 1.2% P732.12 11438 11296 1.2% P11.1 8284 7660 7.5% P11.2 6220 6000 3.5% P11.3 10040 9444 5.9% P11.4 5136 4768 7.2% P11.5 4800 4452 7.2% P11.6 3342 3064 8.3% P11.7 7596 7032 7.4% P11.8 668 624 6.6% P11.9 4724 4484 5.1% P11.10 20634 19128 7.3% P11.11 12894 12148 5.8% P11.12 11492 10584 7.9%
Passing some parameters in registers Code Total Incremental Size Reduction Reduction ---- --------- --------- P732.1 0 registers 8862 - - 1 register 8360 5.7% 5.7% 2 registers 8192 7.6% 1.9% P732.2 0 registers 7196 - - 1 register 6544 9.1% 9.1% 2 registers 6126 14.9% 5.8% P732.3 0 registers 10586 - - 1 register 9976 5.8% 5.8% 2 registers 9956 6.0% 0.2% P732.4 0 registers 5126 - - 1 register 4958 3.3% 3.3% 2 registers 4958 3.3% 0.0% P732.5 0 registers 5198 - - 1 register 5022 3.4% 3.5% 2 registers 4986 4.1% 0.7% P732.6 0 registers 3402 - - 1 register 3222 5.3% 5.3% 2 registers 3208 5.7% 0.4% P732.7 0 registers 10400 - - 1 register 10048 3.4% 3.4% 2 registers 9874 5.0% 1.6% P732.8 0 registers 840 - - 1 register 810 3.6% 3.6% 2 registers 770 8.3% 4.7% P732.9 0 registers 6404 - - 1 register 6172 3.6% 3.6% 2 registers 6148 4.0% 0.4% P732.10 0 registers 21650 - - 1 register 20826 3.8% 3.8% 2 registers 20650 4.6% 0.8% P732.11 0 registers 13476 - - 1 register 12442 7.7% 7.7% 2 registers 12442 7.7% 0.0% P732.12 0 registers 11916 - - 1 register 11452 3.9% 3.9% 2 registers 11296 5.2% 1.3%
Code Total Incremental Size Reduction Reduction ---- --------- --------- P11.1 0 registers 7796 - - 1 register 7756 0.5% 0.5% 2 registers 7660 1.7% 1.2% P11.2 0 registers 6192 - - 1 register 6072 1.9% 1.9% 2 registers 6000 3.1% 1.2% P11.3 0 registers 9564 - - 1 register 9448 1.2% 1.2% 2 registers 9444 1.2% 0.0% P11.4 0 registers 4776 - - 1 register 4768 0.2% 0.2% 2 registers 4768 0.2% 0.0% P11.5 0 registers 4508 - - 1 register 4452 1.2% 1.2% 2 registers 4452 1.2% 0.0% P11.6 0 registers 3098 - - 1 register 3064 1.1% 1.1% 2 registers 3064 1.1% 0.0% P11.7 0 registers 7124 - - 1 register 7096 0.4% 0.4% 2 registers 7032 1.3% 0.9% P11.8 0 registers 624 - - 1 register 624 0.0% 0.0% 2 registers 624 0.0% 0.0% P11.9 0 registers 4520 - - 1 register 4488 0.7% 0.7% 2 registers 4484 0.8% 0.1% P11.10 0 registers 19302 - - 1 register 19166 0.7% 0.7% 2 registers 19128 0.9% 0.2% P11.11 0 registers 12364 - - 1 register 12152 1.7% 1.7% 2 registers 12148 1.7% 0.0% P11.12 0 registers 10734 - - 1 register 10648 0.8% 0.8% 2 registers 10584 1.4% 0.6%
Remembering condition-codes Unknown Remembered Gain ------- ---------- ---- P732.1 8820 8192 0.3% P732.2 6134 6126 0.1% P732.3 9976 9956 0.2% P732.4 4968 4958 0.2% P732.5 4988 4986 0.0% P732.6 3212 3208 0.1% P732.7 9880 9874 0.1% P732.8 770 770 0.0% P732.9 6150 6148 0.0% P732.10 20684 20650 0.2% P732.11 12474 12442 0.2% P732.12 11318 11296 0.2% P11.1 7732 7660 0.9% P11.2 6012 6000 0.3% P11.3 9516 9444 0.8% P11.4 4792 4768 0.5% P11.5 4452 4452 0.0% P11.6 3076 3064 0.4% P11.7 7064 7032 0.4% P11.8 624 624 0.0% P11.9 4496 4484 0.3% P11.10 19204 19128 0.4% P11.11 12192 12148 0.4% P11.12 10626 10584 0.4%
Forward merging and delaying Forward Neither Merge Delaying Merge & Delay ------- ----- -------- ------------- P732.1 8192 8172 (0.2%) 8160 (0.4%) 8136 (0.7%) P732.2 6126 6110 (0.3%) 6054 (1.2%) 6044 (1.3%) P732.3 9956 9948 (0.2%) 9872 (0.8%) 9864 (0.9%) P732.4 4958 4950 (0.2%) 4942 (0.3%) 4942 (0.3%) P732.5 4986 4970 (0.3%) 4982 (0.1%) 4966 (0.4%) P732.6 3208 3194 (0.4%) 3140 (2.1%) 3122 (2.7%) P732.7 9874 9752 (1.2%) 9834 (0.4%) 9812 (1.6%) P732.8 770 764 (0.5%) 738 (4.2%) 728 (5.4%) P732.9 6148 6136 (0.2%) 6132 (0.3%) 6120 (0.4%) P732.10 20650 20586 (0.3%) 20558 (0.4%) 20490 (0.8%) P732.11 12442 12406 (0.3%) 12342 (0.8%) 12306 (1.1%) P732.12 11296 11280 (0.1%) 11272 (0.2%) 11256 (0.4%) P11.1 7660 7660 (0.0%) 7660 (0.0%) 7660 (0.0%) P11.2 6000 6000 (0.0%) 5988 (0.2%) 5988 (0.2%) P11.3 9444 9444 (0.0%) 9434 (0.1%) 9434 (0.1%) P11.4 4768 4768 (0.0%) 4768 (0.0%) 4768 (0.0%) P11.5 4452 4448 (0.1%) 4452 (0.0%) 4448 (0.1%) P11.6 3064 3064 (0.0%) 3060 (0.1%) 3060 (0.1%) P11.7 7032 7004 (0.4%) 7032 (0.0%) 7004 (0.4%) P11.8 624 620 (0.6%) 620 (0.6%) 616 (1.3%) P11.9 4484 4484 (0.0%) 4484 (0.0%) 4484 (0.0%) P11.10 19128 19128 (0.0%) 19128 (0.0%) 19128 (0.0%) P11.11 12148 12136 (0.1%) 12136 (0.1%) 12124 (0.2%) P11.12 10584 10584 (0.0%) 10584 (0.0%) 10584 (0.0%)
All optimisations None All Gain ---- --- ---- P732.1 11300 8136 28.0% P732.2 7520 6044 19.6% P732.3 12286 9864 19.7% P732.4 5782 4942 14.5% P732.5 6204 4966 19.9% P732.6 4004 3122 22.0% P732.7 11750 9812 16.5% P732.8 988 728 26.3% P732.9 6848 6120 10.6% P732.10 24722 20490 17.1% P732.11 14618 12306 15.8% P732.12 14064 11256 20.0% P11.1 9664 7660 20.7% P11.2 6588 5988 9.1% P11.3 11092 9434 14.9% P11.4 5540 4768 13.9% P11.5 5572 4448 20.2% P11.6 3666 3060 16.5% P11.7 8632 7004 18.7% P11.8 752 616 18.1% P11.9 5256 4484 14.7% P11.10 23940 19128 20.1% P11.11 14328 12124 15.4% P11.12 12816 10584 17.8%
Total CPU time and CPU time in input/output Internal I/O: communication from phase 1 to phase 3. External I/O: input of the source and the output of the object files. Phase 1 Phase 2 Phase 3 Overall ----------- ---------- ---------- ------------------ Total % Total % Total % Internal External CPU I/O CPU I/O CPU I/O I/O I/O All optimisations: P732.1: 53% 8% 32% 6% 14% 5% 13% 7% P732.2: 53% 8% 35% 8% 12% 5% 16% 7% P732.3: 51% 7% 34% 7% 15% 5% 13% 6% P732.4: 54% 8% 32% 6% 14% 5% 13% 7% P732.5: 49% 12% 36% 5% 14% 5% 12% 6% P732.6: 50% 7% 37% 7% 12% 6% 14% 7% P732.7: 48% 7% 39% 7% 13% 7% 15% 6% P732.8: 54% 7% 36% 7% 10% 5% 16% 4% P732.9: 50% 8% 36% 8% 14% 6% 15% 7% P732.10: 54% 6% 29% 5% 17% 4% 11% 6% P732.11: 52% 7% 32% 6% 16% 5% 11% 6% P732.12: 52% 8% 32% 8% 17% 7% 15% 7% No optimisation: P732.1: 50% 7% 30% 7% 19% 8% 16% 7% P732.2: 55% 8% 31% 9% 14% 8% 16% 8% P732.3: 54% 8% 28% 8% 18% 6% 15% 7% P732.4: 57% 8% 26% 7% 16% 6% 14% 7% P732.5: 52% 8% 30% 8% 17% 7% 16% 7% P732.6: 53% 8% 32% 9% 15% 8% 17% 8% P732.7: 49% 7% 31% 8% 19% 8% 17% 6% P732.8: 56% 7% 32% 9% 12% 6% 18% 5% P732.9: 55% 9% 29% 8% 16% 7% 16% 8% P732.10: 55% 6% 23% 6% 21% 5% 13% 6% P732.11: 55% 8% 26% 6% 20% 6% 12% 6% P732.12: 54% 8% 27% 8% 19% 8% 16% 8%
References Aho, A.V. & Ullman, J.P. (1972) "Optimisation of straight line programs" SIAM J. March 1972. Allen, F.E. & Cocke, J.A. (1971) "A catalogue of optimising techniques" Design and optimisation of compilers Prentice-Hall, R. Rustin (ed). 1971. Basili, V.R. & Turner, A.J. (1975) "A transportable extendable compiler". Software - Practice and Experience Vol 5, 1975. Bell. G. (1973) "Threaded Code". CACM Vol 16, No 6. Berry, R.E. (1978) "Experience with the PASCAL P-compiler" Software - Practice and experience Vol 8, No 5, 1978.
Berthaud, M. & Griffiths, M. (1973) "Incremental compilation and conversational interpretation" Annual review in automatic programming, Vol 7, Part 2, 1973. Bourne, S.R,, Birrell, A.D. & Walker, I.W. (1975) "Z code, an intermediate object code for ALGOL68". The Computer Laboratory, Cambridge. Branquart, P., Cardinael, I. & Lewi, J. (1973) "Optimised translation process, application to ALGOL68". Int. Comp. Symp. 1973. Bron, C. & De Vries, W. (1976) "A PASCAL compiler for PDP11 minicomputers". Software - Practice and Experience Vol 6, 1976.
Brooker, R.A. et al. "Experience with the compiler-compiler" Comp. J. Vol 9, 1967. Brown, P. (1972) "Levels of language for portable software" CACM. 15, 1972. Brown, P. (1977) "Macro processors". Software Portability (ed. P. Brown) Cambridge University Press, 1977. Buckle, J.K. (1978) "The ICL 2900 Series". Macmillan Press Ltd., 1978. Cocke, J. (1970) "Global common subexpression elimination" ACM SIGPLAN notices, Vol 5, No 7. 1970.
Cocke, J. & Schwartz, J.T. (1970) "Programming languages and their compilers" Courant Institute of Mathematical Sciences, Hew York University, 1970. Coleman, S.S., Poole, P.C. & Waite, W.M. (1974) "The Mobile programming system, JANUS". Software: Practice and Experience. Vol 4, 1974. Dewar, H. (1975) "The ISYS system". Internal memo, Department of Computer Science, University of Edinburgh, 1975. Feldman, J.A. (1966) ""A formal semantics for computer languages and its application in a compiler-compiler" CACM 9, 1966. Feldman, J.A. & Gries, D. (1968) "Translator writing systems" CACM 11, 1968.
Gardner, P. (1977) "An ALGOL68C bootstrap". Memo CSM-20, University of Essex, 1977. Gilmore, B.A.C. (1980) "DEIMOS User's Manual" Edinburgh Regional Computing Centre Grosse-Lindemann, C.O. & Nageli, H.H. (1976) "Postlude to a PASCAL compiler bootstrap on the DEC system 10". Software - Practice and Experience, Vol 6, 1976. Haddon, B.K. & Waite, W.M. (1978) "Experience with the Universal Intermediate language JANUS" Software - Practice and Experience Vol 8, No 5, 1978. Halpern, M.I. (1965) "Machine Independence: its technology and economics". CACM Vol 8 No 12, 1965.
Hawkins, E.N. & Huxtable, D.H.R. (1963) "A multi-pass translation scheme for Algol 60" Annual review in automatic programming, No 3, 1963. Hecht, M.S. & Ullman, J.D. (1975) "A simple algorithm for global data flow analysis problems" SIAM J. Vol 4, 1975. Huxtable, D.H.R. (1964) "On writing an optimising translator for Algol 60" Introduction to System Programming, P. Wegner (ed.), Academic Press, 1964. Interdata (1974) "Common Assembler Language (CAL) User's Manual". Interdata publication number 29-375R04. 1974.
Jensen, K. & Wirth, N. (1976) "PASCAL User Manual and Report" Springer-Verlag, 1976. Kildall, G.A. (1973) "A unified approach to global program optimisation" CACM Principles of programming languages, 1973. Klint. P. (1979) "Line numbers made cheap" CACM Vol 22, No. 10. 1979. Knuth. D.E. (1962) "History of writing compilers" Proc. ACM 17, 1962. Knuth, D.E. (1971) "An empirical study of FORTRAN programs". Software - Practice and Experience Vol 1, 1971.
Knuth, D.E. (1974) "Structured programming with GOTO statements". ACM Computing Surveys 6, No 4, 1974. Krogdahl, S., Myhre, O., Robertson, P.S., & Syrrist, G. (1980) The SCALA project: S-PORT. Norwegian Computing Centre working paper. Lecarme, O. & Peyrolle-Thomas, M. (1978) "Self-compiling compilers: An appraisal of their implementation and portability" Software - Practice and Experience Vol 8, No 2, 1978. Lowry, E.S. & Medlock, C.W. (1969) "Object code optimisation". CACM Vol 12, 1969. McKeeman, W.M. (1965) "Peephole Optimisation" CACM Vol 8, 1965.
Mock, O., Olsztyn, J. et al (1958) "The problem of programming communication with changing machines". CACM Vol 1, No 8, 1958. Nelson, P.A. (1979) "A comparison of PASCAL intermediate languages" ACM SIGPLAN Notices, Vol 14, No 8, 1979. Nori, K.V., Ammann, U. et al (1974 & 1976) "The PASCAL <P>-compiler: Implementation notes". Eidgenossische Technische Hochschule, Zurich, 1976. Pavelin, C.J. (1970) "The improvement of program behaviour in paged computer systems". Ph.D. Thesis, University of Edinburgh, 1970.
Poole, P. (1974) "Portable and adaptable compilers" Advanced course on compiler construction, Springer-Verlag 1974. Richards, M. (1971) "The portability of the BCPL compiler". Software - Practice and Experience VOl 1, 1971. Richards, M. (1977) "Portable Compilers". Software Portability (ed. P. Brown), Cambridge University Press, 1977. Robertson, P.S. (1977) "The IMP-77 Language". Internal Report CSR-19-77, Department of Computer Science, University of Edinburgh, 1977.
Russell, W. (1974) "A translator for PASCAL P-code". Final year project report, Department of Computer Science, University of Edinburgh, 1974. Satterthwaite, E. (1972) Debugging tools for high-level languages". Software - Practice and Experience, Vol 2, 1972. Schneck, P.B. & Angel. E. (1973) "A FORTRAN to FORTRAN optimising compiler". Computer Journal Vol 16, 1973. Sibley, R.A. (1961) "The SLANG System" CACM Vol 4, 1961. Spier, M.J. (1976) "Software Malpractice - A distasteful experience". Software - Practice and Experience Vol 6, 1976.
Steel, T.B. (1961) "A first version of UNCOL Proc. AFIPS WJCC 19, 1961. Stephens, P.D. (1974) "The IMP language and compiler" Computer Journal Vol 17, 1974. Stockton-Gaines, R. (1965) "On the translation of machine language programs". CACM Vol 8, No 12, 1965. Szymanski, T.G. "Assembling code for machines with span-dependent instructions". CACM 21, 1978. Tanenbaum, A.S. (1976) "Structured Computer Organisation" Prentice/Hall International, 1976.
Thompson, K. & Ritchie, D.M. (1974) "The UNIX timesharing system" CACM Vol 17, No 7, 1974. Trout. R.G. (1967) "A compiler-compiler system" Proc. ACM, 1967. Waite, W.M. (1970) "The Mobile Programming System: STAGE 2" CACM, vol 13, 1970. Welsh, J. & Quinn, C. (1972) "A PASCAL compiler for the ICL 1900 series". Software - Practice and Experience, Vol 2, 1972. Whitfield, H. & Wight, A.S. (1973) "EMAS - The Edinburgh Multi-Access System" Computer Journal Vol 16, 1973.
Wichmann, B.A. (1977) "Optimisation" Software Portability (ed. P. Brown), Cambridge University Press, 1977. Wichmann, B.A. (1977) "How to call procedures, or second thoughts on Ackermann's function" Software - Practice and Experience Vol 7, No 3, 1977. Williams, M.H. "Long/short address optimisation in assemblers". Software - Practice and Experience, Vol 9 No 3, 1978 Wulf, W., Johnsson, R.K., Weinstock, C.B., Hobbs, S.O. & Geschke, C.M. (1975) "The design of an optimising compiler" Elsevier Computer Science Library, 1975.