The production of Optimised Machine-Code
              for High-Level Languages using
          Machine-Independent Intermediate Codes.
                 Peter Salkeld Robertson
                         Ph. D.
                University of Edinburgh
                          1981
               (Graduation date July 1981)
                          ABSTRACT
                          
   The aim of this work  was  to  investigate  the  problems
associated with using machine-independent intermediate codes
in  the  translation from a high-level language into machine
code, with emphasis on minimising code  size  and  providing
good run-time diagnostic capabilities.
   The  main  result  was a machine-independent intermediate
code, I-code, which has been used  successfully  to  develop
optimising  and  diagnostic compilers for the IMP77 language
on  a  large  number  of  different  computer  systems.   In
addition,  the work has been used to lay the foundations for
a project to  develop  an  intermediate  code  for  portable
SIMULA compilers.
   The major conclusions of the research were that carefully
designed  machine-independent intermediate codes can be used
to generate viable optimising and diagnostic compilers,  and
that   the   commonality   introduced  into  different  code
generators  processing  the  code  for  different   machines
simplifies   the   tasks   of  creating  new  compilers  and
maintaining old ones.
                        Contents
1  Introduction
2  Intermediate codes
   2.1  Uncol
   2.2  Janus
   2.3  OCODE
   2.4  P-code
   2.5  Z-code
   2.6  Summary and conclusions
        2.6.1  Error checking and reporting
        2.6.2  Efficiency
        2.6.3  Assumptions
        2.6.4  Interpretation
3  Optimisations
   3.1  Classification of optimisations
        3.1.1  Universal optimisations
        3.1.2  Local optimisations
               3.1.2.1  Remembering
               3.1.2.2  Delaying
               3.1.2.3  Inaccessable code removal
               3.1.2.4  Peephole optimisation
               3.1.2.5  Special cases
               3.1.2.6  Algebraic manipulation
        3.1.3  Global optimisations
               3.1.3.1  Restructuring
               3.1.3.2  Merging
                        3.1.3.2.1 Forward merging
                        3.1.3.2.2 Backward merging
               3.1.3.3  Advancing
               3.1.3.4  Factoring
               3.1.3.5  Loop optimisations
                        3.1.3.5.1 Iteration
                        3.1.3.5.2 Holding
                        3.1.3.5.3 Removal of invariants
               3.1.3.6  Expansion
               3.1.3.7  Addressing optimisations
        3.1.4  Source optimisations
   3.2  Combination of optimisations
4  The design of the compiler
      4.1  General structure
      4.2  The intermediate code
           4.2.1  Objectives
                  4.2.1.1  Scope
                  4.2.1.2  Information preservation
                  4.2.1.3  Target machine independence
                  4.2.1.4  Decision binding
                  4.2.1.5  Simplification
                  4.2.1.6  Redundancy
                  4.2.1.7  Ease of use
      4.3  Code layout and addressing
           4.3.1  Nested procedure definitions
           4.3.2  Paged machines
           4.3.3  Events
      4.4  Data layout and addressing
      4.5  Procedure entry and exit
           4.5.1  User-defined procedures
           4.5.2  External procedures
           4.5.3  Permanent procedures
           4.5.4  Primitive procedures
      4.6  Language-specified and compiler-generated objects
           4.6.1  Internal labels
           4.6.2  Temporary objects
      4.7  Object-file generation
           4.7.1  Reordering
           4.7.2  Jumps and branches
           4.7.3  Procedures
           4.7.4  External linkage
           4.7.5  In-line constants
           4.7.6  Object-file format
      4.8  Summary
5  Review of the overall structure
      5.1  Division of function
      5.2  Testing and development
      5.3  Diagnostics
           5.3.1 Line numbers
           5.3.2 Diagnostic tables
           5.3.3 Run-time checks
6 Observations 6.1 Suitability of I-code for optimisation 6.2 Performance 6.3 Cost of optimisation 6.3.1 Compile time 6.3.2 Space requirement 6.3.3 Logical complexity 6.4 Comments on the results 6.4.1 Register remembering 6.4.2 Remembering environments 6.4.4 Array allocation and use 6.4.5 Parameters in registers 6.4.6 Condition-code remembering 6.4.7 Merging 6.5 Criticisms and benefits of the technique 6.5.1 Complexity 6.5.2 I/O overhead 6.5.3 Lack of gains 6.5.4 Flexibility 6.6 Comments on Instruction sets and Optimisation 7 Conclusions 7.1 Viability of the technique 7.2 Ease of portability 7.3 Nature of optimisations Appendix A1 Simplified I-code definition A2 I-code internal representation A3 Results References
                      1. Introduction
   Compilers for high-level  languages  form  a  significant
part  of  most computer systems, and with an ever increasing
number and variety of machine architectures  on  the  market
the   problems   of   compiler   development,  testing,  and
maintenance consume more  and  more  manpower  and  computer
time.   Moreover,  as  computer  technology is improving and
changing rapidly it is becoming evident that software  costs
will  increasingly  dominate  the  total  cost  of a system.
Indeed, it may not be long before the lifetime  of  software
regularly  exceeds  that  of  the  hardware  on which it was
originally implemented, a state of affairs  quite  different
from  that  envisaged by Halpern when he concluded that "the
importance of the entire question of machine-independence is
diminishing .." [Halpern, 1965].   In addition, there  is  a
need  to  encourage the slowly-developing trend to write the
majority of software in high-level languages.   Even  though
the  advantages of such an approach are many, a large number
of users still have a love of machine-code, usually fostered
by thoughts of "machine  efficiency".   Clearly,  techniques
must  be  developed  to  simplify  the  production of usable
compilers  which  can  "optimise"  the  match  between   the
executing  program  and the user's requirements, be they for
fast execution, small  program  size,  reasonable  execution
time but with good run-time diagnostics, or whatever.
One popular method for reducing the complexity of a compiler is to partition it into two major phases: one language-dependent and the other machine-dependent. The idea is that the language-dependent phase inputs the source program and deals with all the syntactic niceties of the language, finally generating a new representation of the program, an intermediate code. This is then input by a second phase which uses it to generate machine-code for the target computer. In this way it should be possible to produce a compiler to generate code for a different machine by taking the existing first phase and writing a new second phase. This ability to move a large portion of the compiler from machine to machine has led to such compilers being referred to as "portable compilers" even though the term is perhaps misleading, as only part of the complete compiler can be moved without change. In practice many existing compilers generate intermediate representations of the program which are passed around within the compiler, for example the "analysis records" produced by the syntactic phase of compilation, but for the purposes of this work it is only when these representations are machine-independent and are made available outwith the compiler that they will be termed intermediate codes. Much of the emphasis in designing intermediate codes has been on enabling a compiler to be bootstrapped quickly onto a new machine - either by interpreting the intermediate code, or by using a macro generator to expand it into
machine-code [Brown, 1977]. Once this has been done the intention is that the quality of the code so produced can be improved at leisure. While this approach has been very successful and relatively error-free, it has been the experience of several implementors that it is difficult to adapt the scheme to produce highly optimised code [Russell, 1974]; apparently considerations of portability and machine-independence have caused the problems of optimisation to be overlooked. The aspect of intermediate-code design which has received most debate concerns the level of the code: low-level with a fairly simple code-generator, or high-level with a more complex code-generator [Brown, 1972]. This thesis attempts to put machine-independence and optimisation on an equal footing, and describes the use of an intermediate code which takes a novel view of the process. Instead of the intermediate code describing the computation to be performed, it describes the operation of a code-generator which will produce a program to perform the required computation. This effectively adds an extra level of indirection into the compilation, weakening any linkage between the form of the intermediate code and the object code required for a particular implementation. In essence I-code attempts to describe the results required in a way which does not constrain the method of achieving those results.
In particular it should be noted that the code described, I-code, was designed specifically for the language IMP-77, a systems implementation language which contains many of the constructions which pose problems for optimisation [Robertson, 1979]. It in no way attempts to be a "universal" intermediate code. Notwithstanding, the code, with a small number of minor extensions to cover non-IMP features, has been used successfully in an ALGOL 60 compiler and is currently proving viable in projects for writing Pascal and Fortran 77 compilers. The intermediate code as finally designed is completely machine independent, except inasmuch as the source program it describes is machine dependent, demonstrating that the problems may not be as intractable as thought by Branquart et al. who state that "clearly complete machine independency is never reached" [Branquart, 1973]. In addition to the problems of machine independence there is also the question of operating system independence, as nowadays it is common for machines to have several systems available. For this reason the task of producing a compiler is far from finished when it can generate machine code [Richards, 1977]. To simplify the generation of versions of a compiler for different operating systems a third phase of compilation was added, although it soon became clear that the extra phase could be used for other purposes as well, as will be shown later (section 4).
Throughout the text examples are given of the code produced by compilers written to demonstrate the power of the intermediate code. The examples of the intermediate code are couched in terms of mnemonics for the various code items, although the production compilers use a compacted representation. The code and its representations are described in Appendix A1 and Appendix A2. In the examples of code generated for various constructions it should be appreciated that the exact instructions and machine features used will depend very much on the context in which the code is produced, and so only typical code sequences can be given. The machines for which code is demonstrated are indicated by the following abbreviations in parentheses: (Nova) Data General NOVA (PDP10) Digital Equipment Corporation PDP10 (PDP11) Digital Equipment Corporation PDP11 (VAX) Digital Equipment Corporation VAX 11/780 (GEC4080) General Electric Company 4080 (ICL2900) International Computers Limited 2900 (4/75) International Computers Limited 4/75 (7/16) Interdata 7/16 (7/32) Interdata 7/32 (PE3200) Perkin Elmer 3200
                    2 Intermediate codes
   This section gives a brief account of the more  important
intermediate codes which have been discussed and have had an
influence on the design of I-code.
2.1 Uncol
   UNCOL,  UNiversal  Computer  Orientated  Language, [Mock,
1958], was an early attempt to specify a means  for  solving
the  M*N  problem  of producing compilers for M languages to
run on N machines.   It was proposed  that  an  intermediate
language,  UNCOL,  be defined which would be able to express
the constructs from any language, and which could itself  be
translated  into code for any machine, resulting in the need
for only M+N compilers.   Indeed it was even suggested  that
programs  would  be written directly in UNCOL rather than in
machine code.
These ideas were very ambitious, but were presented  without
any  concrete  examples  of  what  UNCOL  might  look  like.
Proposals were made for an UNCOL in [Steel,  1961]  but  the
work   was   abandoned   before  anything  like  a  complete
specification had been produced.
   An UNCOL-like technique which has been  used  extensively
is  to  compile for a known type of machine, such as the IBM
360, and then emulate that machine on  the  target  machine.
Unfortunately,  to  give this any chance of being efficient,
microcode support will be  necessary,  and  this  is  rarely
available to compiler writers.
2.2 Janus The first attempt at generating an UNCOL which seems to have been at least partially successful was JANUS [Coleman, 1974]. The approach was effectively to enumerate all the mechanisms found in current programming languages and the techniques used to implement them. From this large list was defined a set of primitive data-types and operations upon them. These primitives were then put together to model the objects in the source language. Once JANUS code had been produced the intention was that it would either be interpreted or compiled into machine code by a macro generator. 2.3 OCODE Of all the languages which claim to be portable, perhaps the most successful has been BCPL [Richards, 1971]. The BCPL compiler generates the intermediate code OCODE which can either be interpreted or translated into machine code for direct execution. As BCPL is a fairly low-level language with only one data type, the word, many of the difficulties in designing intermediate codes do not arise. This means that the code can be pitched at a low level and be "semantically weak" without compromising the efficiency of the compiled code to any great extent.
The  OCODE machine works by manipulating single-word objects
held on a stack, into which there are several pointers.
                   e.g.  R(1, 2, 3)
STACK 3            adjust the top  of  stack  to  leave  two
                   cells free for linkage information. 
LN 1               stack the constant 1.
LN 2               stack the constant 2.
LN 3               stack the constant 3.
LL L6              stack  the address of label L6 (the entry
                   to the routine).
RTAP 5             enter the procedure adjusting  the  stack
                   frame pointer by 5 locations.
.....
.....
.....
ENTRY 1 L6 'R'
                   entry point for the routine R.
SAVE 5             set  the  top  of  stack  pointer to be 5
                   locations from the stack frame pointer.
.....
RTRN               return.
2.4 P-code
   P-code is the intermediate code  used  by  the  PASCAL<P>
compiler  [Nori,  1976;  Jensen, 1976] and was designed with
the  aim  of  porting  PASCAL  quickly  by   means   of   an
interpreter.   In  this respect it has been very successful,
especially on microprocessor-based  systems.   The  code  is
similar  to OCODE but has a greater range of instructions to
handle objects of differing types.
   procedure ERROR(VAL:INTEGER);  begin
                     0:   ENT   4
      TOTAL := TOTAL+1;
                     1:   LDO   138
                     2:   LDCI  1
                     3:   ADDI
                     4:   SRO   138
      if INDEX >= 9 then begin
                     5:   LDO   139
                     6:   LDCI 9
                     7:   GEQI
                     8:   FJP   17
         LIST[10].NUM := 255
                     9:   LAO   140
                    10:   LDCI  10
                    11:   DEC   1
                    12:   IXA   2
                    13:   INC   1
                    14:   LDCI 255
                    15:   STO
      end else begin
                    16:   UJP   28
         INDEX := INDEX+1;
                    17:   LDO   139
                    18:   LDCI  1
                    19:   ADDI
                    20:   SRO   139
         LIST[INDEX].NUM := VAL
                    21:   LAO   140
                    22:   LDO   139
                    23:   DEC   1
                    24:   IXA   2
                    25:   INC   1
                    26:   LOD   0, 4
                    27:   STO
      end;
   end;
                    28:   RETP
2.5 Z-code Z-code [Bourne, 1975] is the intermediate code produced by the ALGOL68C compiler, the main feature of which is the ability for the user to parameterise the first phase to modify the Z-code to suit the target machine, an idea previously investigated in SLANG [Sibley, 1961]. A set of eight registers is assumed by the code and others may be specified explicitly for each transfer. The memory with which the code works is assumed to be "a linear data store that is addressed by consecutive integers", addresses taking the form of base+displacement pairs. Intermingled with the instructions are directives which control the translation of the code into machine orders. Two of these directives are used to divide the code into "basic blocks" or "straight-line segments", and describe the usage of registers on entry to and exit from the blocks, although little use seems to be made of them at present.
As an example here is the Z-code generated by the PDP10 version of the compiler [Gardner, 1977]: int X := 2, Y := 3, Z := 2 1: F000 10 0 +2 load 2 F040 10 6 +144 store in X F000 10 0 +3 load 3 F040 10 6 +145 store in Y 5: F000 10 0 +2 load 2 F040 10 6 +146 store in Z proc P = (int A, B) int: begin 7: S715 p*Z T246 677 712 if A > B 9: R0 10: F020 10 5 +4 load A 11: F022 10 5 +5 subtract B 12: F113 10 0 P713 ->L713 if <= then A 13: F020 10 5 +4 load A else B 14: H116 0 p7l4 ->L714 15: L713 16: F020 10 5 +5 load B fi 17: L714 18: R1 10 1 end 19: R1 10 1 20: T247 667 712 end of P
2.6 Summary and conclusions 2.6.1 Error checking and reporting The UNCOL approach of having one code for all languages and machines may well simplify the generation of some sort of compiler, but has the major disadvantage that the optimisation of error checking and reporting run-time errors cannot be left to the code generator - many errors are language-dependent and the code generator cannot know how to handle all of them. Instead the checks must be programmed into the intermediate representation explicitly. As will be shown later (5.3) this can inhibit several very powerful and effective optimisations. Sadly, this problem can result in the absence of all but the most trivial of run-time checks in the compiled code. Even when checking is provided in the intermediate code, as in the case of P-code with its CHK instruction for range testing, it is rare for the code to contain enough information to permit the error to be reported in source program terms; line numbers, procedure names, variable names and values etc. As an example, many P-code interpreters locate run-time errors in terms of 'P-code instruction addresses' which are of negligible benefit to most users.
2.6.2 Efficiency Commonly, little attention is paid to questions of run-time efficiency in the generation of intermediate code. An exception to this is Z-code which is parameterised in order that the match between the code and the target machine can be improved. In particular, the machine-independent phase is intended to perform simple register optimisation, although as the example in 2.5 shows, the insistence on using the same register will minimise any gains from remembering register contents. However, this is probably just a failure on the part of the current compilers and could be corrected at a later date. Unfortunately, the fact that the compiler purports to optimise the intermediate code inhibits the code generator from attempting any but the most trivial peephole optimisations. This may be seen in the example by considering instructions 10 - 12. On many machines the subtract operation is not good for value comparison as firstly it may fail with overflow, and secondly it corrupts a register. A better implementation would be to replace the subtract with a suitable COMPARE, leaving the values untouched and available for later use. This cannot be done by the code generator as it cannot know that the intermediate code does not go on to use the result of the subtraction later. Similarly, if Z-code had chosen to use a COMPARE instruction in the first place, a machine without a compare would have to work hard to make sure all registers involved in the necessary subtract were restored
to their initial values before the intermediate code goes on to use them. 2.6.3 Assumptions Most machine-independent intermediate codes have been designed, at least initially, assuming a linear store with one address increment corresponding to one basic object. In the case of O-code this is a direct result of the language definition, but in languages such as PASCAL it has led to a great loss of information, as the rich information about data types cannot be expressed in the code. The problems associated with putting languages onto machines with different addressing schemes has resulted in some intermediate codes being updated to accept a limited form of parameterisation to define the gross appearance of the target machine. Typical of the limitations of these codes is P-code where although the basic types of object can have differing sizes of machine representation, objects with enumerated types will always be given a 'fullword' even though the host machine could easily support a smaller item. A typical assumption is that the difference between objects explicitly specified in the original source and those created by the intermediate code generator for its own purposes is insignificant. As will be shown later (4.6), this is not necessarily the case.
                    3 Optimisations
                    
   The task of any compiler for a high-level language is  to
fit  programs  written  in  that  language  onto  a specific
computer system so that the  required  computations  may  be
performed.
Optimisation  may  be  described as the process by which the
fit is improved.  Usually the quality of the optimisation is
measured in terms of two parameters: the size of the running
program, and, more commonly, the speed at which it executes.
While it is possible in some cases to make a program smaller
and increase its speed of execution, it is  well-known  that
in  general speed and size are complementary.   For example,
the following code fragments have the same effect,  but  the
first  will  probably be smaller than the second, which will
execute faster than the first:
  -------------------------- -----------------------
 |                          |    A(1) = K           |
 |                          |    A(2) = K           |
 |   for J = 1,1,8 cycle    |    A(3) = K           |
 |      A(J) = K            |    A(4) s K           |
 |   repeat                 |    A(5) = K           |
 |                          |    A(6) = K           |
 |                          |    A(7) = K           |
 |                          |    A(8) = K           |
 |                          |    J = 8              |
  -------------------------- -----------------------
3.1 Classification of optimisations
   With a subject as complex as optimisation it is difficult
to give  a  useful  and  definitive  classification  of  the
various  possibilities for improving programs.   In addition
different authors have used many different terms to describe
optimisations which have been attempted [Aho, 1974; Lowry, 1969; Wichmann, 1977]. However most optimisations fall into one of the following four groups: Universal, Local, Global and Source. 3.1.1 Universal Optimisations These are those optimisations which are independent of any particular program, but which depend on the complete environment in which the program is to be compiled or executed. They are the policy decisions taken by the compiler writer during the initial design of the compiler, and include such things as the fixed use of registers (stack pointers, code pointers, link registers etc), the representations of language-defined objects (arrays, records, strings etc), and the standards for communication with external objects. In addition, universal optimisation must take into account such questions as: i Compilation speed or execution speed? If the compiler is to be used in an environment where programs are compiled roughly as often as they are executed, such as in a teaching environment, execution speed can be sacrificed for an increase in compilation speed.
        ii      Diagnostics?
                If the compiler is  to  produce  code  which
                will  provide  extensive  checking  and will
                give diagnostic information in the event  of
                program  failure, allowance must be made for
                the  efficient  checking  of  the  program's
                behaviour   and   the   maintenance  of  the
                recovery    information    used    by    the
                diagnostics.   If  highly  optimised code is
                required these constraints may not apply.
        In  the  current  state   of   the   art   universal
        optimisation  is  done  by experience and guesswork;
        attempts at producing compiler-compilers  which  can
        approach  the  quality of hand-constructed compilers
        have not met  with  great  success  [Brooker,  1967;
        Feldman, 1966; Trout, 1967].  As will be shown later
        (4.5),  minor  changes in the universal optimisation
        can result in major changes in  the  performance  of
        the  generated code, and so rules made at this stage
        should be as flexible as possible to permit  changes
        to be made in the light of experience.
           From  the  point of view of measurement universal
        optimisation provides  the  base  level  from  which
        other optimisations are investigated.   Roughly, the
        better the universal optimisation the less effective
        the other optimisations appear to be.
3.1.2 Local optimisations Local optimisations may be defined as those optimisations which may be performed during a sequential scan of the program using only knowledge of statements already processed. Not only are these optimisations reasonably simple to perform but they can have a major effect on the generated code. Indeed Wulf et al. state that "In the final analysis the quality of the local code has a greater impact on both the size and speed of the final program than any other optimisation" [Wulf, 1975]. 3.1.2.1 Remembering Remembering optimisations are those optimisations which can be applied to single statements in the light of information gathered during the compilation of previous statements. These optimisations depend on remembering the current state of the machine and applying this knowledge to subsequent statements. Their chief characteristic is that they are applied during a sequential scan of the program, and as such are reasonably cheap to implement and execute.
        For example:
 
             ----------------
           |  X = Y          |
           |  if X = 0 start |
            -----------------
            
           on the PDP11 would generate:
           
            ------------
           |  MOV  Y,X  |
           |  BNE $1    |  remembering that the previous
           |            |  line sets the condition code.
            ------------
            
           The    most    powerful    of   the   remembering
        optimisations is  that  whereby  the  correspondence
        between  values  in registers and values in store is
        remembered and references to  the  store  value  are
        replaced  by  references  to  the  register's value,
        register operations usually being smaller and faster
        than their store equivalents.   Unfortunately  there
        are  several  cases  where  this leads to worse code
        than the "obvious" version.   For  example,  on  the
        (PE3200)  the code on the right is larger and slower
        than that on the left:
                   ------------
                  |  X = 2    |
                  |  P = P<<2 |
                   -----------
                   
             ----------- --------------
            | LIS  3,2  |   LIS 3,2    |  pick up 2
            | ST   3,X  |   ST  3,X    |  store it in X
            | L    1,P  |   L   1,P    |  pick up P
            | SLLS 1,2  |   SLL 1,0(3) |  shift it by 2
            | ST   1,P  |   ST  1,P    |  store it in P
             ----------- --------------
           In addition to keeping track of  the  changes  in
        the state of the machine from the compilation of one
        statement  to  another,  remembering  also  includes
        preserving this state or environment for  later  use
        when  a  label is encountered, either by merging the
        current environment with the  environment  saved  at
        the  jump  to the label, or simply by restoring that
        latter environment  when  it  is  not  possible  for
        control   to  "fall  through"  from  the  statements
        immediately preceding the label.
         
           In all forms of remembering it  is  vital  to  be
        able    to    keep   the   information   up-to-date,
        invalidating knowledge  when  it  becomes  false,  a
        process which is exacerbated when it is possible for
        an  object  to  be  known  by two or more apparently
        different descriptions as in the following code:
             ---------------------------
            |   integer J, K            |
            |   integerarray A(1:12)    |
            |   integername P           |
            |   P == J                  |
            |   J = 1;  K = 1           |
             ---------------------------
        At this point P and J refer to the same location  as
        do A(J) and A(K).
        Except  in  the most simple of cases all that can be
        done is to assume  the  worst  and  forget  anything
        potentially   dangerous  after  writing  to  unknown
        addresses.
           By delaying the first store into  T  until  after
        the  conditional  statement, and delaying the second
        store  into  T   until   after   the   label,   both
        instructions can be combined, resulting in the code:
            -------------------
           |      L   3,X      |
           |      L   0,0(3)   |
           |      BGE $1       |
           |      SR  0,0      |
           |   $1:ST  0,T      |
           |      LIS 2,1      |
           |      ST  2,0(3)   |
           |      LR  1,0      |
           |      {return}     |
            -------------------
        This  store  itself  can  now  be  delayed until the
        return from the function, at which point,  as  T  is
        local  to  the  function  and will be destroyed, the
        instruction can be deleted altogether.
        Section 3.2 gives a description of one way in  which
        this sort of optimisation has been achieved.
3.1.2.3 Inaccessable code removal
           In  several  cases  compilers  can  generate code
        which will never be executed.
        The common causes of this are either  user-specified
        conditions  whose  truth is constant and are used to
        achieve some sort of "conditional  compilation",  or
        structural  statements following unconditional jumps
        as below:
            ----------------------
           |   if X = 0 start     |
           |      ERROR(1)        |
           |      return          |
           |   else               |
           |      V = V-X         |
           |   finish             |
            ----------------------
        Here the branch instruction usually generated at the
        end of the if clause to take control past  the  else
        clause, can never be executed.
        
        Such  inaccessable code can be eliminated to shorten
        the program,  but  without  directly  effecting  its
        speed of execution.
        
3.1.2.4 Peephole optimisations
           Peephole  optimisation  [McKeeman,  1965]  is the
        technique of examining small sections  of  generated
        code   to   make   fairly   obvious,   but  ad  hoc,
        improvements.    Many  of   the   gains   from   the
        optimisation  come  from  simplifying code sequences
        created by the juxtaposition  of  code  areas  which
        were produced separately.
        For example (PE3220):
               Before              After
            ---------------- -------------------
           |   ST  4,X      |      ST  4,X      |
           |   L   4,X      |                   |
           |-              -|-                 -|
           |   AR  1,2      |                   |
           |   AHI 1,48     |      AHI 1,48(2)  |
            ---------------- -------------------
            
3.1.2.5 Special cases
           Special-case  optimisations  are those which make
        use of the particular structure and features of  the
        target  machine  to control the way in which certain
        statements are implemented.
        For example:
                      Obvious          Optimised
                   ---------------- -----------
        (PDP11)   |   MOV  #0,X    |   CLR   X |   X = 0
                  |-              -|-         -|
        (PDP11)   |   ADD  #1,X    |   INC   X |   X = X+1
                  |-              -|-         -|
        (PE3220)  |   LHI  1,NULL  |   SR  0,0 |   S = ""
                  |   LHI  2,S     |   STB 0,S |
                  |   BAL  15,MOVE |           |
                   ----------------------------
                   
           These optimisations are very similar to  peephole
        optimisations  but  are  distinguished  because they
        actively control the generation of code rather  than
        passively   alter   code   which  has  already  been
        produced.   In particular  they  avoid  one  of  the
        drawbacks of peephole optimisation, namely that even
        though  it  can  reduce  fairly  complex instruction
        sequences to a simpler  form,  the  side-effects  of
        generating  the  long  form in the first place often
        degrade the code.   In  the  example  of  setting  a
        string   variable  to  the  null  string  above  the
        optimised form uses only one register, the value  of
        which  can  be  remembered.   In  the  non-optimised
        version three registers are immediately altered  and
        the   knowledge  of  the  contents  of  all  of  the
        registers may need to be forgotten unless  the  code
        generator  knows  how the MOVE routine works and can
        forget only those registers which it uses.
        
3.1.2.6 Algebraic manipulations
           Algebraic optimisations are improvements  brought
        about by using the algebraic properties of operators
        and operands, and include:
           . Folding, or compile-time evaluation
                    1+2 is replaced by 3
           . Removal of null operations
                   A+0 is replaced by A
           . Using commutative properties
                   -B+A is replaced by A-B
3.1.3 Global optimisations Global optimisation may be defined as those improvements to the code which require knowledge of substantial parts of the program. In effect they are performed by examining numbers of statements in parallel, in contrast to the .sequential scan required by local optimisation. 3.1.3.1 Restructuring Restructuring optimisations are those optimisations which may be brought about by changing the order in which the code is laid out in memory without otherwise changing the nature of the code. As will be discussed later, there are many reasons why programs can be improved by changing the order of chunks of the code. A common reason is that many machines have conditional branch instructions with limited range while the unconditional branches have a much larger range. Hence if (A) represents a large number of statements and (B) represents a small number of statements, the program: ---------------- | if X = 0 start | | {A} | | else | | {B} | | finish | ----------------
        could  be  improved  by  reordering  as on the right
        (PDP11):
        
                 original           reordered
            ------------------ ----------------
           |     MOV  X,R0    |     MOV  X,R0  |
           |     BEQ  $1      |     BEQ  $1    |
           |     JMP  $2      |     {B}        |
           | $1: {A}          |     JMP  $2    |
           |     BR   $3      | $1: {A}        |
           | $2: {B}          | $2:            |
           | $3:              |                |
            ------------------ ----------------
3.1.3.2 Merging 3.1.3.2.1 Forward merging Forward merging, also somewhat confusingly referred to as "cross jumping" [Wulf, 1975], is the process whereby the point of convergence of two or more code sequences is moved back over common sub-sequences thus removing one of the sub-sequences, as in the case below. ---------------------- | if X > Y start | | TEST(X, Y) | | else | | TEST(Y, X) | | finish | ---------------------- obvious code (VAX) after merging --------------------- -------------------- | CMPL X,Y | CMPL X,Y | | BLE $1 | BLE $1 | | PUSHL X | PUSHL X | | PUSHL Y | PUSHL Y | | CALLS 2,TEST | | | BRB $2 | BRB $2 | | $1:PUSHL Y | $1:PUSHL Y | | PUSHL X | PUSHL X | | CALLS 2,TEST | $2:CALLS 2,TEST | | $2: | | --------------------- -------------------- The simplest way to perform this optimisation is to take the code sequence about the point of a label and a reference to that label, and set two pointers: one, P., to the unconditional jump and the other, P2, to the label. If the instructions immediately preceding the pointers are identical both pointers are moved back over that instruction. The label is
       redefined   at  the  new  position  of  P2  and  the
       instruction passed  over  by  P1  is  deleted.   The
       process  is  repeated  until either another label is
       found or two different instructions are encountered.
       The redefinition of the label  involves  creating  a
       completely new label, leaving the old one untouched.
       This  both prevents trouble with multiple references
       to the label and  permits  the  optimisation  to  be
       attempted on them.
           As this optimisation simply causes the sharing of
        execution paths there is no direct gain in execution
        speed,  but  as the code size is reduced an indirect
        improvement may be  achieved  if  the  shorter  code
        moves  the label close enough to the reference to it
        for a shorter and usually faster jump instruction to
        be used.
        
        The optimisation obviously must be  performed  while
        labels  and  jumps  are  in a symbolic form, that is
        before code  addresses  have  been  resolved.   This
        permits  the  merging  of  instructions  which  will
        eventually have  program-counter  relative  operands
        and consequently be position dependent.
        In  this  case  as the code-generator is in complete
        control  the  optimisation  can  be   very   simple,
        although rather specific.
        The  techniques  for handling common sub-expressions
        have been investigated at length by several authors,
        but measurements  indicate  that  in  most  programs
        expressions  are  so  trivial the expense in finding
        common  sub-expressions  is  not   repaid   by   the
        resulting  improvement in the generated code [Knuth,
        1971].
        
           The more general form of factoring can be seen in
        the transformation of the following statements:
            ---------------------------------------
           |   if X = 0 then A = 1 else B = 2      |
           |   if X = 0 then C = 3 else D = 4      |
            ---------------------------------------
           into:
            -----------------------
           |   if X = 0 start      |
           |      A = 1            |
           |      C = 3            |
           |   else                |
           |      B = 2            |
           |      D = 4            |
           |   finish              |
            -----------------------
3.1.3.5 Loop optimisations 3.1.3.5.1 Iteration Iteration is the process whereby the values in variables from previous iterations of a loop are used to calculate the new values for the current iteration, rather than calculating those values from scratch each time. One of the effects of this optimisation can be the reduction in strength of operations, such as changing multiplications into additions. In this context the IMP77 operators "++" and "--" may be used to great effect. Their action is to adjust the reference on the left by the number of items to the right, hence if X is an integer then X++1 is the next integer and X--2 is the integer two integers before X. ------------------------------- | for J = 1,1,1000 cycle | | A(J) = J | | repeat | ------------------------------- Can be optimised to: ------------------------------- | integername T | | T == A(J)--1 | | for J = 1,1,1000 cycle | | T == T++1 | | T = J | | repeat | -------------------------------
         
            ------------------- ---------------------
           |                   |  TEMP = P/Q         |
           | while X > 0 cycle |  while X > 0 cycle  |
           |    W(X) = P//Q    |     W(X) = TEMP     |
           |    X = X-1        |     X = X-1         |
           | repeat            |  repeat             |
            ------------------- ---------------------
        B  will  be faster than A if the loop is executed at
        least twice.   If the loop is not  executed  at  all
        (X <= 0)  B  will  be  much  slower  than  A  (by an
        alarming 80 microseconds on the 7/32).
        
        3.1.3.5.3 Removal of invariants
        
           This   is    the    process    whereby    complex
        sub-expressions  which do not change their values as
        the loop progresses are evaluated outside  the  loop
        and held in temporaries:
           ---------------------------------
           |   for J = 1, 1, 1000 cycle     |
           |      A(J) = LIMIT-MARGIN       |
           |   repeat                       |
            --------------------------------
              Can be optimised to:
            -------------------------------
           |   TEMP = LIMIT-MARGIN         |
           |   for J = 1,1,1000 cycle      |
           |      A(J) = TEMP              |
           |   repeat                      |
            -------------------------------
        It is simply a special case of Holding.
3.1.4 Source optimisations Source optimisations [Schneck, 1973] are those optimisations which can be effected by changes in the source program. They can be sub-divided into three categories; machine-independent [Hecht, 1973; Kildall, 1973], machine-dependent, and tentative. Tentative optimisations are those which, while unlikely to make the code worse, may improve it on some machines. For example, most machines will handle the comparison "X<1" better if it is rewritten as "X<=0", where X is an integer variable.
        With  this  knowledge  the  second  statement can be
        compiled to:
            -------------------
           |   MOV#  1,1,SZR   |
           |   JMP   $1        |
           |   LDA   1,D       |
           |   STA   1,A       |
           | $1:               |
            -------------------
        Immediately before the label $1  it  is  known  that
        once  again  the value of A is in accumulator 1, and
        that the STA above the label is marked as pending as
        before.   Following the definition of the label  the
        environment  before  the  jump  to that label can be
        combined with the environment just before the  label
        to  give  the  new  environment following the label.
        The information in this environment is that A is  in
        accumulator  1  and  that  the same store is pending
        from both old environments.   This  allows  the  two
        marked  store  instructions  to  be  removed and one
        store placed after the label (and once again  marked
        as being "pending").  This gives the following code:
            ---------------------
           |      LDA   0,B      |
           |      LDA   1,C      |
           |      AND   0,1      |
           |      MOV#  1,1,SZR  |
           |      JMP   $1       |
           |      LDA   1,D      |
           |   $1:STA   1,A      |
            ---------------------
        A  simple  jump  optimisation  notices  that the JMP
        passes over just one instruction and  can  therefore
        be  removed  by  inverting the skip condition on the
        previous MOVe, giving:
            ------------------
           |   LDA   0,B      |
           |   LDA   1,C      |
           |   AND   0,1      |
           |   MOVZL 1,1      |
           |   MOV#  1,1,SNR  |
           |   LDA   1,D      |
           |   STA   1,A      |
            ------------------
        Finally, peephole optimisation combines the AND with
        the MOVZL giving ANDZL, and then combine  this  with
        the   following  MOV#  to  give  the  complete  code
        sequence as:
            ------------------
           |   LDA   0,B      |
           |   LDA   1,C      |
           |   ANDZL 0,1,SNR  |
           |   LDA   1,D      |
           |   STA   1,A      |
            ------------------
        The most interesting  thing  to  notice  about  this
        particular  sequence  of  optimisations is that with
        the possible exception of the removal of the  marked
        STA  instructions,  the  final code can be generated
        very simply with local optimisations.
4 The design of the compiler This section describes the features of the compiler which have had an influence on the form of the intermediate code. 4.1 General structure One of the aims of this type of compilation strategy is to simplify the production of compilers, and a successful technique for simplifying programs is to divide them into several communicating modules, each largely independent of the others but with well-defined interfaces between them. At the highest level a compiler can be split up into three major parts: 1 A language processor, which deals with the language-dependent parts such as parsing, semantic checking, and error reporting. 2 A code generator, which takes the decomposed form of the program as generated by 1 above, and constructs the appropriate code sequences to perform the required functions. 3 An object-file generator, which builds an object-file from the code sequences produced by 2, In the form required by the system which is to execute the program. Commonly, the first two parts of this scheme are combined into one program which generates as its output an
assembly-language  source file corresponding to the original
program.   The third part then becomes the  standard  system
assembler.   This approach clearly simplifies the production
of the compiler, as one part,  the  assembler,  is  provided
already  and  can ease the problems of checking the compiler
because the code it generates is presented in  a  well-known
form.   Despite  these  advantages  it  was rejected for the
following reasons:
         1    In  order  that  assembly  language   can   be
              generated,  the compiler must have an internal
              form of the instructions which is changed into
              text, processed by the assembler, and  finally
              converted  into  the  machine  representation.
              These transformations can be eliminated if the
              compiler  works  directly  with  the   machine
              representations.
         2    In general, the system-provided assembler will
              be  expecting  to receive a much more powerful
              language  than  the  rather  stereotyped  text
              produced  by  compilers.   This will certainly
              degrade the performance of the  assembler.   A
              solution  to  this  is  to  produce a cut-down
              version of the assembler which only recognises
              those constructs generated  by  the  compiler.
              However, producing a new assembler removes one
              of   the  reasons  for  choosing  this  route,
              namely, not requiring extra  work  in  writing
              the object-file generator.
         3    As  will  be  seen  later  (section 4,7), even
              after the code sequences  have  been  produced
              there  remain  several optimisations which can
              be performed using knowledge gained during the
              production of those  sequences,  for  example,
              generating  short  forms  of jump instructions
              when the distance between  the  jump  and  its
              destination is small enough.  While in certain
              cases  these optimisations can be performed by
              a standard assembler it is unlikely  that  the
              structure  of  the  code-generator would be as
              simple as  if  a  special-purpose  object-file
              generator were available.
   The  main  interface  in  such  a  system is clearly that
between the  language  and  machine  dependencies,  as  most
languages  are  largely  machine-independent.   It  is  this
interface     between     the     language-dependent     and
machine-dependent  parts of the compiler which is termed the
INTERMEDIATE  CODE.   In  the  following  discussion  it  is
assumed  that  the  reader has a reasonable understanding of
the structure of the final form of I-code, a  definition  of
which may be found in Appendix.A2.
code is to be generated by several different language processors, a simple intermediate code which is easy to produce may well be more attractive. As I-code was intended for optimisation a high-level code was chosen. In addition, as it was hoped that the code could eventually be used in different language processors, it was decided to keep the structure simple. The complete compilation process may be thought of as a sequence of transformations working from the source program to the final object program via a number of intermediate representations. As the transformations are applied the representations become less dependent on the source language and more dependent on the target machine. In order to simplify the code-generator as much as possible the intermediate code must lie as far from the source language as is possible without straying from the objectives set out below. 4.2.1 Objectives One of the dangers in designing an intermediate code is that of building into it old techniques and standard expansions of source constructions, which while they may be tried and tested cannot in any way be said to be "the only solutions" or even "the best solutions". One of the intentions behind the design of I-code was to
permit the use of varied implementation strategies. In the same way that the only practical definition of a "good" programming language is that it fits the style of the particular programmer using it, so the measure of the power of an intermediate code must include the ease with which it can adapt to an existing style of code-generator writing. Inevitably, practical constraints prevent total generality; the most general form of a program is a canonical form of itself, but this is little help in compiling it. It follows that the intermediate code, while remaining true to the original program and distant from "real" machines, must provide enough simplification to make the task of code-generation as easy as possible without inhibiting optimisation. From the start it was appreciated that an intermediate code suitable for use in optimising compilers would necessarily require more processing than a code such as O-code which was aimed at a quick implementation. The original hope was that although each machine-dependent code generator would not be small, typically about 3000-4000 IMP statements, large portions of one could be taken as a basis for a new implementation. This has proved to be the case, and provision of an existing code-generator as a template greatly simplifies the task of creating a new one (section 6.4.1).
4.2.1.1 Scope The first and most fundamental objective in the design of I-code was that it should support the compilation of one specific language, IMP-77, on many different machines. Considerations of using the code to implement other languages were secondary to this main aim, but were used to bias the design when a choice had to be made from several otherwise equally suitable possibilities. In retrospect, a few areas of the code could have been made more general without significant overheads in the code generators, mainly in the area of data descriptor definitions, but a detailed discussion of one intermediate code supporting several languages is beyond the scope of this work. In direct contrast to many intermediate codes, I-code was not designed with the intention of making it convenient to interpret; the prime aim was to permit compilation into efficient machine-code. Nevertheless it is possible to "compile" I-code into threaded code [Bell, 1973] or a form suitable for interpretation, either by generating a conventional interpretive code or by leaving the code in more-or-less its current form but with labels resolved and descriptors expanded into a more convenient representation.
For   example,  the  following  two  program  fragments  are
semantically identical:
           A                             B
   ----------------------------------------------------
  |                           |   P = 0                |
  |                           |   cycle                |
  |  TEST for P = 1, 1, 10    |      P = P+1           |
  |                           |      TEST              |
  |                           |   repeat until P = 10  |
  |                           |                        |
   ----------------------------------------------------
However, in "B" the information that the fragment contains a
simple for construction, while not completely lost, has been
scattered through the code, and this dilution of information
will increase the complexity of any  code-generator  wishing
to handle for loops specially.
   To   leave  open  all  avenues  for  optimisation  it  is
necessary therefore, that all of the semantic information in
the source program is preserved in a  compact  form  in  the
I-code.   One  sure  way  of  achieving  this property is to
design the code in such a way as to allow  the  regeneration
of  the  source  program, or at least a canonical form of it
which is not significantly different from the original.   In
this  context  insignificant  differences are the removal of
comments  and  the  standardisation  of  variant  forms   of
statements, such as:
      NEWLINE if COUNT = 0
and: if COUNT = 0 then NEWLINE
frequently pun on addresses and integer values. Several later versions of the codes have attempted to solve these problems by parameterising the intermediate code generator so that the characteristics of the target machine may be used to modify the code which is produced. However, they still have built in to them assumptions about how the objects can be addressed. There are so many constraints which can be imposed on the code to be generated, such as operating system requirements and conventions for communicating with the run-time environment, that a parameterised first phase could not be expected to generate code which was well-suited to every installation. The authors of JANUS [Coleman, 1974] write that they believe that the approach of using a parameterised intermediate code "... is a dead end, and that the adaptability must come in the translation from the intermediate language to machine code". 4.2.1.4 Simplification For the complexity of the machine-dependent phases of compilation to be kept as low as possible the machine-independent phase must do as much work as possible while keeping within the constraints imposed by the previous objectives. One way of simplifying the intermediate code is for certain high-level constructions to be expanded into
lower-level  constructions,  but only when there is just one
expansion possible under the rules of the language, and that
expansion does not scatter information which may be of later
use.
The most obvious case of such expansion is in  dealing  with
complex conditional clauses such as:
    -----------------------------------------------------
   |   if (A=B and C#D) or (E<F and G>H) then X else Y   |
    -----------------------------------------------------
    
IMP-77  specifies  that the condition will only be evaluated
as far as is necessary to determine the inevitable truth  or
falsity  of  the  condition,  and  so,  bearing  in mind the
modifications  to  be  discussed  in  section   4.6.1,   the
statement can be represented more simply as:
    ---------------------------------
   |         if A # B  then ->L1     |
   |         if C # D  then ->L2     |
   |   L1:   if E >= F then ->L3     |
   |         if G <= H then ->L3     |
   |   L2:   X                       |
   |         ->L4                    |
   |   L3:   Y                       |
   |   L4:                           |
    ---------------------------------
    
This  expansion  is  tricky and notoriously error prone, and
therefore is best done once and for all in the common phase.
Similarly it  is  possible  to  expand  all  simple  control
structures into their equivalent labels and jumps, providing
that the structural information is not lost thereby.
4.3 Code layout and addressing 4.3.1 Nested procedure definitions A common feature of programming languages is the ability to nest the definition of a procedure within another procedure. In addition, several languages imply the definition of procedures within single statements, as in the case of name parameters in ALGOL-60 where the parameter which is actually passed can be a reference to a "thunk", a procedure to evaluate the parameter. With such nesting provision must be made for preventing the flow of execution from "falling through" into the procedure from the preceding statements, and this is usually accomplished by planting at the start of the procedure a jump to the statement following the end. While this is simple to implement it does introduce extra instructions which are not strictly necessary. With user-defined nested procedures the overhead can be minimised if a number of procedures is defined, as one jump instruction can be used to skip them all. Unfortunately thunks will be generated throughout the code in a more-or-less random way, giving little opportunity to coalesce the jumps. Even if the extra execution time caused by these jumps is insignificant (the jumps round thunks defined in loops get executed repeatedly), the code which they are skipping
stretches the code in which they are nested. On machines with fixed-size jump instructions which can cover the whole machine, such as the DEC PDP 10, the stretching causes no problems, but if the addressing is limited, or if several different sizes of jump instruction are provided, the presence of the nested procedure can result in more code being produced later in the generation of large jumps. 4.3.2 Paged machines On paged machines the overall performance of a program does not depend solely on the efficiency of the code produced by the compiler but includes a factor depending on the locality of references made by the executing program. Traditionally this locality has been improved by monitoring the execution of the program and then re-ordering parts of it in the light of the measurements. Unfortunately not all operating systems provide the user with convenient tools to enable the measurement to be done, leaving only ad hoc methods or intuition for guidance. Without careful control it is all too easy to move one procedure to improve references to it and thereby cause another piece of code to cross page boundaries and counteract any gains in paging performance. Even if the user can obtain the necessary information, a slight change in the program can invalidate the changes. Notwithstanding these problems it is evident that by careful
structuring of a program significant gains in paging behaviour can be obtained and so this option should not be pre-empted by the intermediate-code (as does Z-CODE which automatically reorders the definitions of procedures). The possibility of automatic improvement of paging behaviour was investigated by Pavelin who showed that the paging characteristics of a program can be improved by an automatic reordering of the code [Pavelin, 1970]. Pavelin's thesis describes the breaking-up of a program into "chunks", defined by branches and the destinations of branches. At each chunk boundary extra instructions are planted to cause the updating of a "similarity array" which records the dynamic characteristics of the program. After several runs the similarity arrays are merged and the result is used to specify a reordering of the chunks which should improve the paging performance. In test cases the working-set size of the code was reduced by as much as 40%. The thesis also went on to say that the various compilation problems associated with this "... can be alleviated by operating on an intermediate code which is machine independent with symbolic code addresses".
For example, on the INTERDATA 7/16 a procedure at the outermost textual level uses register 15 to access its local stack frame, giving the exit sequence: ------------------- | LM 7, 4(15) | | BFCR 0, 8 | ------------------- but a procedure nested within this would use register 14 thus: ------------------- | LM 7, 4(14) | | BFCR 0, 8 | ------------------- It follows that the signal routine must be told which base regiser to use at each stage of the recovery. This can be done either by planting code in the entry and exit sequences of each procedure, or by keeping a static table associating procedure start and finish addresses with the appropriate base register. The first method is poor as it imposes a run-time overhead on all procedures, whether they trap events or not. The second method is better but can be complicated if procedures are nested as the start-finish addresses alone no longer uniquely define the procedure. One solution is to cause all procedures which use the same exit sequence to be loaded into distinct areas, and to associate the recovery information for the signal routine with each area. This reduces the static overhead to a few words per area, rather than a few words per procedure.
    ii   Array references with constant subscripts  need  no
         address  calculations  at  run-time.   For  example
         using V as declared  above,  the  element  V(2)  is
         immediately   addressable   as  the  halfword  with
         displacement "a+2 + 2*2" from LNB.
         
    iii  In certain more general cases when the subscript is
         a variable, access can be simplified by remembering
         previous calculations.  For example, the address of
         the array element V(X) is
             -------------------------------------
            | addr(V(0)) + X*size of each element |
             -------------------------------------
         In the example above this becomes:
             -------------------------------------
            | LNB+a+2    + X*size of each element |
             -------------------------------------
         which can be rearranged to:
             -------------------------------------------
            | a+2        + (X*size of each element+LNB) |
             -------------------------------------------
         Hence the following code could be produced (7/16):
         
           --------------------
          |   V(X) = 0         |
          |                    |
          |   LH   1,X(LNB)    | pick up X
          |   AHR  1,1         | *2 (2 bytes per integer)
          |   AHR  1,LNB       | add in LNB
          |   SHR  0,0         | get zero
          |   STH  0,a+2(1)    | store in V(X)
           --------------------
           
         Noting  that  the   value   now   in   register   1
         (X*size+LNB)  only  depends  on  the  size  of each
         element, X, and the local name base,  it  is  clear
         that  register  1  can  be used to address the X'th
         element of any integer array of one  dimension  and
         constant  bounds  declared  at  the  current level.
         Hence  if   the   array   W(1:12)   were   declared
         immediately  after  Y  in  the example above, while
         register 1 is not changed W(X) can be addressed  as
         a+2002(1).
         
   On  the  other  hand, a machine with limited store cover,
such as the Data General NOVA which only  has  an  eight-bit
displacement,  will  almost  certainly force the array to be
implemented as an immediately addressable pointer  which  is
initialised  at  run-time  to the address of storage claimed
explicitly.
  --------------------------------------------------------
 | LNB                                                    |
 |  |                                                     |
 |  |                                                     |
 |  v   .---.---.---.                                     |
 |  .- -| X | V | Y | - - -                               |
 |      .---.---.---.                                     |
 |            |                                           |
 |            |                                           |
 |            |  .------.------.         .--------.       |
 |            +->| V(0) | V(1) | - - - - | V(999) | - - - |
 |               .------.------.         .--------.       |
  --------------------------------------------------------
With this organisation the address of V(X) will be:
    ----------------------------
   | V + X*size of each element |
    ----------------------------
and there is little that can  be  done  by  rearranging  the
expression to improve on the "obvious" code (7/16):
  ----------------
 |   V(X) = 0     |
 |                |
 |   LH   1,X     |    pick up X
 |   AHR  1,1     |    double it
 |   AH   1,V     |    add in addr(v(0))
 |   SHR  0,0     |
 |   STH  0,0(1)  |
  ----------------
Not  only is this second code sequence longer than the first
by two bytes, but it will execute more slowly as the  second
addition  involves  a store reference whereas the equivalent
instruction in the first sequence uses a register.
In both cases, however, some simplification can be  done  if
the subscript is an expression of the form:
    ------------------------------
   |   X plus or minus CONSTANT   |
    ------------------------------
in which case the constant can be removed from the subscript
expression evaluation and added into the final displacement.
For example (7/16):
   ------------------------
  |   V(X-7) = 0           |
  |                        |
  |   LH   1,X             |      pick up X
  |   AHR  1,1             |      double it
  |   AHR  1,LNB           |      and in LNB
  |   SHR  0,0             |      get zero
  |   STH  0,a+2+(-7)*2(1) |
   ------------------------
Unfortunately even this optimisation may not be available. For example, the ICL 2900 series performs array accesses through a DESCRIPTOR REGISTER, and the extra displacement cannot be added into the instruction. Also some machines, such as the IBM 360, only permit positive displacements in instructions. The examples above pose the following problem: If the intermediate-code is to know nothing of the target machine it cannot know the best way to declare the array, nor the best way to access it. Therefore the code must always produce the same sequences for array declarations and array accesses. It follows that these sequences must remain quite close to the original source and not include any explicit address calculations. As another example, the DEC PDP11 range has a hardware stack which grows with decreasing store addresses. Because of this it could be convenient to allocate storage for variables in that order, from large addresses to small addresses. However, in several cases it may be necessary to force objects to be created in increasing addresses, such as when program structures are to be mapped onto hardware-defined structures in memory, resulting in an implementation which requires to be able to create similar objects in different ways depending on the context.
Finally, some machines provide instructions in which the displacement of the operand is scaled before use, depending on the size of that operand. The GEC 4080 is such a machine, with instructions such as: LDB 1 load byte <1> LD 1 load halfword, bytes <2> & <3> LDW 1 load fullword, bytes <4>,<5>,<6> & <7> When producing code for such machines it is convenient to allocate all the local objects of the same size in particular areas, and then order the areas in increasing order of the size of the objects they contain. This permits fuller use of the available displacement field in the instructions.
The solution to these problems which was chosen in I-code was to define a DESCRIPTOR for each object to be manipulated. On input to the code-generator descriptors are converted from their machine-independent form to a new form appropriate to the target machine. As all subsequent reference to the object will be through descriptors the code produced will automatically reflect the decisions made at the time the descriptors were created. As has been discussed previously, it may be possible to remove the overhead in setting up addressability for local variables and parameters if the parameters can be held in registers and the local variables are never referenced. After examining many procedures which do use local variables it is clear that a large number of them do not need the complete overhead in setting up a local frame base as they could use the workspace pointer (stack pointer) instead. The criterion is that the position of the locals relative to the workspace pointer must be known at compile time. This reduces to the procedure not having any objects with computed sizes (arrays with computed bounds, for example) and no calls on procedures which access the locals as their global variables.
Consider  the  compilation of the following procedure on the
PDP11:
    ----------------------------------------
   | routine MARK(record(cellfm)name CHAIN) |
   |    integer N                           |
   |    N = 0                               |
   |    while not CHAIN == NULL cycle       |
   |       N = N+1                          |
   |       CHAIN_INDEX = N                  |
   |       CHAIN == CHAIN_LINK              |
   |    repeat                              |
   | end                                    |
    ----------------------------------------
    
The code normally produced for this routine would be:
    --------------------------
   |      MOV   LNB,-(SP)     | remember old LNB
   |      MOV   DS,-(SP)      | remember DS
   |      MOV   R0,(DS)+      | save the parameter
   |      MOV   DS,LNB        | set up local addressing
   |      ADD   #20,DS        | reserve local space
   |      CLR   10(LNB)       | N = 0
   | $1:  MOV   -2(LNB),R1    | test CHAIN
   |      BEQ   $2            | branch if NULL
   |      INC   10(LNB)       | N = N+1
   |      MOV   10(LNB),2(R1) | CHAIN_INDEX = N
   |      MOV   (R1),-2(LNB)  | CHAIN == CHAIN_LINK
   |      BR    $1            | repeat
   | $2:  MOV   (SP)+,DS      | restore DS
   |      MOV   (SP)+,LNB     | restore LNB
   |      RTS   PC            | return
    --------------------------
however, by using workspace pointer (DS) relative addressing
this reduces to:
    --------------------------
   |      MOV   R0,(DS)+      |
   |      TST   (DS)+         | reserve local space
   |      CLR   -2(DS)        | N = 0
   | $1:  MOV   -4(DS),R1     | test CHAIN
   |      BEQ   $2            |
   |      INC   -2(DS)        | N = N+1
   |      MOV   -2(DS),2(R1)  | CHAIN_INDEX = N
   |      MOV   (R1),-4(DS)   | CHAIN == CHAIN_LINK
   |      BR    $1            |
   | $2:  SUB   #4,DS         | restore DS
   |      RTS   PC            | return
    --------------------------
   This optimisation can be performed quite  simply  by  the
third phase of compilation.
In  the  interface  between  the second and third phases the
code sequences generated by the second phase are made up  of
items of the form:
   <type> <VALUE>
where  <type>  describes  where  <VALUE>  is  to be put, for
example in the code area or in the private  data  area.   To
achieve   the  workspace-pointer-relative  addressing  extra
types are introduced which specify that the associated value
is the displacement of a local  variable  from  LMB.   Other
codes  are needed to be able to modify the operation part of
the instruction which uses the displacements but these  will
be  ignored  here as they cause no difficulty and would just
obscure the discussion.   In addition, an extra  <modify DS>
item  is  output  whenever DS is explicitly altered (as when
parameters are stacked using MOV ??,(DS)+.   By default  the
third  phase  will  treat these extra types as being exactly
equivalent to <code area> types, and will generate the first
sequence of code. However, if when the end of the procedure is processed, the second phase discovers that no dynamic objects or dangerous procedure calls were generated, it marks the end of the procedure accordingly (in the same way as described in section 4.7.2). This mark instructs the third phase to relocate all values with the appropriate type so as to make them relative to DS. The <modify DS> types are used to keep the third phase's idea of the current position of DS in step with reality.
4.5 Procedure entry and exit. IMP is heavily based on the use of procedures, indeed the only method of communicating with the controlling environment is by means of procedure calls. Also the techniques of structured programming result in the extensive use of procedures. Clearly when writing a compiler for such languages much thought must be given to making procedure entry and exit (and the associated passing of parameters) as efficient as possible. 4.5.1 User-defined procedures The usual technique for procedures entry and exit is to have standard preludes and postludes which cover all the different types of procedure. For example the EMAS IMP code sequences [Stephens, 1974] are (ICL4/75): ------------------------- | STM 4,14,16(WSP) | save the current environment | BAL 15, PROC | enter the procedure | . | | PROC ST 15,60(WSP) | save the return address | LR LNB,WSP | set up the local stack frame | LA WSP,***(WSP) | claim local stack space | BALR 10,0 | set up code addressability | . | | . | | LM 4,15,16(LNB) | restore calling environment | BCR 15,15 | return -------------------------
   While  this  has  proved to be convenient to generate and
efficient to execute it has one major problem, part  of  the
housekeeping of the procedure entry is performed at the call
itself.  This seems undesirable for two reasons:
    i    Procedures  are  generally  called  more often than
         they are defined.   If part of the housekeeping  of
         procedure  entry is done at the call that code will
         be duplicated at each  call,  thus  increasing  the
         size of the program.   Putting that code within the
         procedure reduces the size overhead.
    ii   If the knowledge of what housekeeping needs  to  be
         done  for  procedure  entry  is  needed outside the
         procedure it becomes impossible to alter the  entry
         and  exist  sequences to suit the actual procedure.
         In particular on certain machines it is possible to
         remove the entry and exit sequences altogether when
         the procedures are simple enough.
If  the  4/75  compiler  moved  the  environment-saving  STM
instruction  into  the  body of the procedure the storing of
the return address would be performed automatically:
    -------------------------
   |      BAL   15,PROC      |
   |-                    -   |
   | PROC STM   4,15,16(WSP) |
   |      LR    8,WSP        |
   |      ..    ..           |
    -------------------------
This not only saves four bytes per call, very important on a
machine with a very severely  limited  immediate  addressing
range,  but  also  reduces  the  overhead  in  entering  the
procedure by one instruction.
A further modification would be to pass one or more  of  the
parameters  in  the  registers,  leaving  the  way  open for
remembering that fact inside the procedure.
Hence a call could be reduced from:
    ------------------------
   |      L   1,X           |
   |      ST  1,64(WSP)     |
   |      L   2,Y           |
   |      ST  2,68(WSP)     |
   |      BAL 15,PROC       |   PROC(X, Y)
   |-                      -|
   | PROC STM 4,15,16(WSP)  |
   |      ..  ..            |
    ------------------------
   to:
    ------------------------
   |      L   0,X           |
   |      L   1,Y           |
   |      BAL 15,PROC       |
   |-                      -|
   | PROC STM 4,1,16(WSP)   |
   |      ..  ..            |
    ------------------------
   The ability to determine exactly how parameters are to be
passed can be of crucial importance in the efficiency of the
procedure mechanism.
When compiling for the PDP11 the  obvious  calling  sequence
for a procedure with two integer value parameters would be:
    -------------------
   |   MOV   X,-(SP)   |
   |   MOV   Y,-(SP)   |
   |   JSR   PC,PROC   |
    -------------------
Unfortunately this produces problems inside the procedure as
the  return  address,  stacked  by  JSR, is too far down the
stack to permit the use of the RTS  instruction  to  return,
for  this  would  leave  on  the stack the space used by the
parameters.   Neither can the stack be adjusted  before  the
return  which  would  be  made  indirect  through a location
beyond the stack pointer as space there must  be  considered
volatile,   being   used   by  interrupt  handling.    Extra
instructions are needed either at the  call  or  inside  the
procedure  to adjust the stack; the JSR instruction may well
not be "a beauty" as claimed  by  some  implementors  [Bron,
1976].  A MARK instruction has been introduced in an attempt
to  overcome  this problem, but it is far from helpful as it
imposes an arbitrary register convention and puts all of the
overhead on the call rather than on the procedure itself.
On the other hand if all of the parameters can be passed  in
registers,  the  JSR  will put the return address on a clear
stack permitting the use of  RTS  for  the  return.   As  in
practice  most  procedures have few parameters, usually only
one or two, this can give a large saving.
As an example of the power of being able to alter entry  and
exit  sequences  consider  a recursive implementation of the
IMP routine SPACES:
    -----------------------------
   | routine SPACES(integer N)   |
   |    return if N <= 0         |
   |    SPACES(N-1)              |
   |    SPACE                    |
   | end                         |
    -----------------------------
 On the PDP10 the straightforward coding for this would be:
     --------------------------
    |         MOVE  0, X       | pick up X
    |         MOVEM 0, 3(SP)   | assign the parameter
    |         PUSHJ SP, SPACES | call SPACES
    |-                        -|
    | SPACES: MOVEM LNB,1(SP)  | save old frame base
    |         MOVE  LNB,SP     | pick up new frame base
    |         ADDI  SP,3       | reserve new frame base
    |         SKIPLE 1,2(LNB)  | load, test & skip if X<=0
    |         JRST   LAB1      | jump to LAB1
    |         MOVE   SP,LNB    | restore stack pointer
    |         POPJ   SP        | return
    | LAB1:   SOJ    1, 0      | X-1 -> ACC1
    |         MOVEM  1,3(SP)   | assign parameter
    |         PUSHJ  SP,SPACES | call SPACES
    |         PUSHJ  SP,SPACE  | call SPACE
    |         MOVE   SP,LNB    | restore stack pointer
    |         MOVE   LNB,1(SP) | restore old frame base
    |         POPJ   SP        |
     --------------------------
By applying the optimisations of passing the parameter in an
accumulator (called ARG) and remembering that the  parameter
is  in  this accumulator on entry to the procedure, the code
reduces to:
    ---------------------------
   |         MOVE  ARG,X       | pick up X
   |         PUSHJ SP, SPACES  | call SPACES
   |-                         -|
   | SPACES: MOVEM LNB,1(SP)   |
   |         MOVE  ARG,2(SP)   | assign the parameter
   |         ADDI  SP,3        |
   |         JUMPG ARG, LAB1   | ->LAB1 if ARG > 0
   |         MOVE  SP,LNB      | restore stack pointer
   |         MOVE  LNB, 1(SP)  |
   |         POPJ   SP         | return
   | LAB1:   SOJ    ARG, 0     | parameter = ARG-1
   |         PUSHJ  SP,SPACES  |
   |         PUSHJ  SP,SPACE   |
   |         MOVE   SP,LNB     |
   |         MOVE   LNB,1(SP)  |
   |         POPJ   SP         |
    ---------------------------
On inspection  it  is  clear  that  the  local  stack  frame
(pointed  at  by  LNB)  is  never  used within the procedure
except by the entry and exit sequences.   Hence by  reducing
those sequences to the absolute minimum the code becomes
    ---------------------------
   |         MOVE  ARG,X       |
   |         PUSHJ SP, SPACES  |
   |-                         -|
   | SPACES: JUMPG ARG, LAB1   |
   |         POPJ   SP         |
   | LAB1:   SOJ    ARG, 0     |
   |         PUSHJ  SP,SPACES  |
   |         PUSHJ  SP,SPACE   |
   |         POPJ   SP         |
    ---------------------------
Finally,  an  opportunistic  optimisation  may  be performed
[Knuth, 1974; Spier, 1976] by noticing that  the  final  two
instructions  may  be  combined  so that the procedure SPACE
uses the return address pushed onto the stack for the return
from SPACES.  This results in the tightest form of the code:
   ----------------------------
   |         MOVE  ARG,X       |
   |         PUSHJ SP, SPACES  |
   |-                         -|
   | SPACES: JUMPG ARG, LAB1   |
   |         POPJ   SP         |
   | LAB1:   SOJ    ARG, 0     |
   |         PUSHJ  SP,SPACES  |
   |         JRST   SPACE      |
    ---------------------------
The final steps in this optimisation can only be performed once the body of the procedure has been compiled. In order that the correct (in this case non-existent) entry sequence can be used an extra pass over the object code is necessary. This pass can be combined with the process of adjusting labels and jumps which is carried out in the third phase of compilation described in section 4.7. The code generator can mark the position where an extra sequence is required and at the end of the procedure can inform the third phase of any salient features found in the body. The third phase can then decide on the best entry and exit sequences to use. This ability to tailor the "housekeeping" parts of procedures can be used in many circumstances to limit the inclusion of code which is needed to handle rare constructions to those procedures which use the feature. As an example of this consider the ICL 2900 series. The machines of the series are designed around a hardware stack, which resides in one, and only one, segment of the user's virtual memory, and thus limits this data space to 255K bytes. In order to be able to handle programs using very large arrays space must be available off-stack in another segment or set of consecutive segments. The maintenance of this extra data space will require instructions to be executed on entry to and on exit from procedures which claim space from it, but not from those which only use space from the stack. These extra instructions can be added to the procedure in a simple
manner by the third phase as it now controls the form of the procedure when all the necessary information is available. For these optimisations to be performed the intermediate code must not lay down rules for procedure entry and exit, rather it should simply mark the points at which suitable code is required. An additional consideration in the design of the I-code for procedure' entry and exit is the requirement of some machines for a "pre-call" to be made the prepare a hardware stack for parameters prior to their evaluation and assignment.
For example (ICL2900):
    -------------------
   |   PROC(1, 2, 3)   |
   |-                 -|
   |   PRCL   4        |   pre-call
   |   LSS    1        |   load 1
   |   SLSS   2        |   stack it and load 2
   |   SLSS   3        |   stack it and load 3
   |   ST     TOS      |   store it on Top Of Stack
   |   RALN   8        |   raise the Local Name Base
   |                   |   to point to the new frame
   |   CALL   PROC     |
    -------------------
   Following these considerations the form of procedure call
chosen for I-code was:
    -----------------
   |   PROC  P       |   stack procedure descriptor
   |                 |   \
   |   {stack param} |    : repeated for each parameter
   |   ASSPAR        |   /
   |   ENTER         |   enter the procedure
    -----------------
ASSPAR causes the value described on the top of the stack to
be  assigned  to  the  next  parameter,  identified  by  the
procedure  descriptor  second  on  the  stack,  using either
ASSVAL or ASSREF as appropriate.
In order to pass some of the  parameters  in  registers  all
that  need  be  done  is  for  the initial processing of the
descriptors for those  parameters  to  define  then  as  the
appropriate   registers.    PROC   can  then  "claim"  those
registers, the parameter  assignment  will  load  them,  and
finally  ENTER  can  release  them  for subsequent re-use on
return from the procedure.
the same compiler. In particular the parameter passing mechanisms 'may be different from those used in the intermediate code. In order to cope with these and other considerations any intermediate code which permits access to external procedures must be sufficiently flexible to allow the variations to be handled efficiently. 4.5.3 Permanent procedures Most languages define a set of procedures which will be available on any implementation without explicit action by the user (such as the IMP procedures ITOS, READSYMPOL, READ, and REM). Such procedures are termed "permanent procedures" It is common for intermediate codes to provide specific code items to invoke permanent procedures, but this has the problem that the code-generator must know about all such procedures, and the language-dependent phase must be changed and the intermediate-code extended if an implementation wishes to make efficient use of procedures which can be compiled in-line on particular machines. For example many machines provide an instruction for moving blocks of store around and it could be advantageous to have a procedure which invoked this instruction directly. Before investigating ways of improving the implementation of permanent procedures it is Useful to examine in some detail the properties of the procedures mentioned above,
which were chosen because they typify the main problems in this area. ITOS is a fairly complicated string function which returns as its result the decimal character string representation of the integer value passed to it as a parameter. Because of its complexity this procedure is almost always best implemented as an external procedure which is linked into the program along with any other external entities required. REM is an integer function which returns the remainder of dividing the first integer parameter by the second, and on many machines can be efficiently compiled in-line, as most integer divide instructions provide both the quotient and the remainder. However, when compiling for machines such as the DATA GENERAL NOVA or the DEC PDP11 when they do not have the optional divide instructions, division has to be performed by a complicated subroutine, suggesting that REM itself should be an external procedure like ITOS. READSYMBOL falls somewhere between the two, mainly because it is defined to have a general name parameter, that is, the parameter may be a reference to any type of entity: integer, real, byteinteger, etc. To implement READSYMBOL as an external procedure it would have to be passed the general name parameter (comprising both the address of the actual parameter and information about its type and precision), and
would have to interpret that parameter in order to be able to store the character, suitably converted, in the appropriate way. A much more efficient implementation is to convert the statement: --------------- | READSYMBOL(S) | --------------- into the equivalent form: ------------------ | S = F$READSYMBOL | ------------------ where F$READSYMBOL is a function which returns as its result the character value that READSYMBOL would have placed into its parameter. Once this is done conversions and the choice of store operation can be left to the usual assignment part of the compiler. A further complication can arise if, as in the case of the INTERDATA 7/16 operating system, ISYS [Dewar, 1975], several permanent procedures map directly onto system-provided facilities: the function F$READSYMBOL can be replaced by the supervisor call "SVC 8,0", SELECT INPUT by "SVC 6" etc. The difficulty caused by READ is mainly one of space. As read can input an integer value, a real value, or a string value depending on the type of its (general name type) argument, it is going to be fairly large, especially if the hardware on which it runs does not provide floating-point
instructions, forcing those functions to be performed by subroutine. It follows that on small systems it may be convenient to replace calls on READ by calls on smaller procedures, chosen at compile-time by examining the type of the parameter given to READ, which input solely integer, real, or string values. Finally it should be noted that the substitutions and modifications discussed above may only be generated as replacements for direct calls on the procedure; if the procedure is passed as a parameter to another procedure no alterations are possible and a "pure" version must be available. As passing a procedure as a parameter is totally distinct from calling the procedure this case does not prevent the improvements being carried out where possible. It should now be clear that the efficient implementation of permanent procedures will differ greatly from the implementation of user-defined procedures, and the implementation of permanent procedures on different machines. Hence the intermediate-code must make no assumptions about either which permanent procedures are available or how they are to be implemented. As a side-effect of removing any built-in properties from permanent procedures it becomes possible for a simple code-generator to ignore any possibility of producing special code and compile them all as externals.
These transformations of procedures can only be applied when the procedures are invoked (called) directly. In the case of procedures passed as parameters all calls will of necessity be the same and hence either it will not be possible to pass some permanent procedures as parameters, an unfortunate limitation imposed by several languages, or there must be a "pure" form of the procedures available. This latter can be done very simply using I-code. The primitive procedure descriptors are defined exactly as if the procedures were truly external, but with an extra marker showing them to be "permanent". The only time that this marker is used is in the procedure call generating section of the compiler. If the procedure is being passed as a parameter this section of the compiler is not entered and so the procedure will be passed as an external. All that is now necessary is for there to be an external manifestation available when the program executes. This method has the added advantage that there is no compile-time overhead, especially important considering that passing procedures as parameters is one of the least-used features of IMP77. 4.5.4 Primitive Procedures It is rare for machines to provide simple instructions which can deal directly with all of the requirements of high-level languages and so several constructions will have to be handled by subroutines. The code generator may then refer to these "primitive procedures" as though they were
machine instructions. The cases in which such procedures are required commonly include exponentiation, string manipulation, and array declaration and access. Given these procedures the code-generator has a choice between calling them as closed subroutines or expanding them in-line. The former produces dense code but will execute more slowly than the latter (and possibly suffer from not knowing what is corrupted by the routine and therefore having to forget everything it knows). On the other hand while the expansion of primitive procedures in-line will improve the execution speed of the program, it becomes necessary for the code-generator to be able to create the appropriate code sequences and thereby become more bulky. Once again the choice must be left to the code-generator as the benefits of a particular decision will depend on both the target machine and the use to which the compiler is to be put. If the compiler is to be used for large mathematical problems it is likely that the gains made by putting exponentiation in-line will outweigh the disadvantage of the extra code size, whereas in operating-system work as exponentiation is probably never needed, the extra complexity of the code generator to expand the routine would not be desirable. Given that some of the primitive procedures will be referenced often (checked array access, for example) it is important that entry to them is made as efficient as possible and in this area the ability to reorder code can be used to great effect.
   In   the  original  Interdata  7/32  IMP77  compiler  the
primitive routines were gathered together at the end of  the
user's  code,  as  it  was only then that it was known which
procedures were required.
    ---------
   |         | <- CODE BASE (register 15)
   |   USER  |
   |   CODE  |
   |         |
    ---------
   |         |
   |   PRIM  |
   |  PROCS  |
   |         |
    ---------
With this scheme programs of 16Kbytes or less can  reference
the    primitive   procedures   with   32-bit   instructions
(program-counter relative addressing).   Unfortunately  once
the  program  grew  beyond  this limit the larger and slower
48-bit form of the instructions had to be used in  order  to
achieve  addressability.   In the IMP77 code generator there
were 352 such large instructions.
In the new compiler the object code is  reordered  to  place
the  primitive  procedures  at  the  head of the user's code
where they can be addressed relative to CODE BASE.
    ---------
   |         | <- CODE BASE (register 15)
   |   PRIM  |
   |  PROCS  |
   |         |
    ---------
   |         |
   |   USER  |
   |   CODE  |
   |         |
    ---------
The immediate disadvantage of this is that it will push  the
user's  procedures  further  away  from  CODE BASE and hence
increase the chances of a user procedure reference requiring
a long (48-bit) instruction.   However in practice  this  is
not  a problem as the total size of the primitive procedures
is usually quite small, typically less than 800 bytes on the
7/32.  The IMP77 code generator mentioned above now needs no
long references at all, saving 724 bytes  of  code,  out  of
about 40Kbytes.  The compression of the code so achieved can
be  enhanced  slightly  by bringing the destinations of more
jumps into the short-jump range, giving an extra  saving  of
20  bytes the case above.   In addition, now that a register
(CODE BASE) is pointing to the first primitive procedure the
list of procedures required can be reordered  to  place  the
most  frequently  referenced  one  first  and thereby reduce
references     to     it     to     16-bit      instructions
(BALR LINK,CODEBASE).
When compiling with checks on, by far the most commonly referenced primitive procedure is the routine which checks for the use of an unassigned variable (over 2000 references to it in the code generator), and this trivial optimisation results in a saving of more than 4000 bytes.
4.6 Language-specified and compiler-generated objects During compilation various objects will be manipulated in order to generate code. Some of these objects have a direct representation in the source program and are referred to as "language-specified" objects, whereas others are created by the compilation process itself and are referred to as "compiler-generated" objects. The fact that the compiler-generated objects will be (or can be constrained to be) used in a stereotyped and well-behaved fashion can be used to great advantage to give simple means for optimising parts of the program. 4.6.1 Internal labels Using most intermediate codes the following program parts would translate into effectively identical sequences: ----------------------------------------------------- | ->LAB if X = 0 | if X # 0 start | | Y = 3 | Y = 3 | | LAB: | finish | ----------------------------------------------------- At first glance this is as it should be, for the two program fragments are semantically identical and could therefore be implemented by the same object code, for example on the PERKIN-ELMER 3200: ----------------- | L 1,X | pick up X and set the condition code | BZ $1 | branch equal (to zero) | LIS 0,3 | pick up 3 | ST 0,Y | store it in Y | $1: | define label $1 -----------------
However, if it is known that the label $1 will only ever be used once, the code-generator may remember that the current value of the variable X will still be in register 1 following the label, and thus remove the need for it to be loaded again if it is required before register 1 gets altered. In the case of user-defined labels no statement can be made about the number of uses of each label without a complete analysis of the parts of the program where the label is in scope. This suggests that I-code should maintain a clear distinction between user-defined and compiler-generated labels. Also, by making the rule that compiler-generated labels may only be used once, the internal representations of labels may be reused by the code-generator, removing the necessity for large tables of label definitions in this phase of compilation. This now leaves the question of how to represent conditional jumps in the intermediate code. The first observation is that user-specified jumps need never be conditional as they can always be surrounded by appropriate compiler-generated conditional jumps. This can be used to restrict the processing of conditions and tests to the compiler-generated jumps. The second observation is that in IMP77 conditionals are always associated with the comparison of two values or the testing of an implied boolean variable (predicates and string resolution). There are currently three main ways in which processors handle this:
    1    "compare" instructions are used  to  set  flags  or
         condition-codes  which  represent  the relationship
         between two values (one of which is  frequently  an
         implied value of zero).   These condition-codes are
         later used to control the execution of  conditional
         branch  instructions.   This  method is used in the
         PDP11: COMP, BNE etc.).
    2    Instructions are provided which compare two  values
         as  above  but  instead  of setting condition-codes
         they  skip  one  or  more  subsequent  instructions
         depending on a specified relationship.  By skipping
         unconditional  branches  in  this  way  conditional
         branch sequences may be generated.   This method is
         used in the PDP10: SKIPE etc.
    3    Instructions  are provided which compare two values
         and  branch  to  a  specified  label  if  a   given
         relationship  holds.   This  method  is used in the
         PDP10: JUMPNE etc.
         
   P-code uses compare instructions to set the boolean value
TRUE or FALSE on the stack and then uses this  value  either
as  an  operand in an expression or to condition a branch (a
variant of technique 1 above).
Z-code tests the  value  in  a  register  against  zero  and
branches accordingly (technique 3 above).
   These  three  techniques  have  fairly  obvious  possible
representations in I-code:
      if X = Y start
      1)   PUSH   X
           PUSH   Y
           COMP            {set condition code}
           BNE    1        {branch not equal}
      2)   PUSH   X
           PUSH   Y
           SKIPE           {compare and skip if equal}
           GOTO   1
      3)   PUSH   X
           PUSH   Y
           JUMP # 1        {compare and branch if not equal}
All three  of  these  representations  have  been  tried  in
different versions of I-code.
Technique  2)  was  rejected  as  it  proved  cumbersome  to
implement effectively, especially on machines which did  not
use  skips; either the code-generator had to "look ahead" to
be able to locate the destination  of  the  skip  (which  is
dependent  on  the  instruction  being  skipped) or to check
before each instruction whether  on  not  a  skip  had  been
processed  earlier  and  its  destination  had  not yet been
resolved,
Technique 1) was perfect for machines with condition-codes but required look-ahead over subsequent jumps on machines which used skips. Both 1) and 2) had the additional problem that to generate conditional branches two separate I-code instructions had to be given. In the case of 1) condition-codes are usually altered by many instructions not directly involved in comparison and hence the compare and its associated branch must be made adjacent. With 2) there is the possibility of generating meaningless constructions such as skipping a line-number definition instruction. These difficulties add complexity to the definition of the intermediate code and require extra checks in the code generator. Thus the third form was chosen as the most convenient, even though all three forms can be suitable defined to be totally equivalent. In particular the third technique provides all the relevant information to the code-generator in one instruction, and has proved to be simple and effective as a basis for generating code for both condition-code and skip sequences.
Using these ideas the following is the expansion of the statements given at the start of section 4.6.1. ---------------- ----------------- | PUSH X | PUSH X | | PUSHI 0 | PUSHI 0 | | COMP # 1 | COMP = 1 | | JUMP LAB | | | LOCATE 1 | | | PUSH Y | PUSH Y | | PUSHI 3 | PUSHI 3 | | ASSVAL | ASSVAL | | LABEL LAB | LOCATE 1 | ---------------- ----------------- 4.6.2 Temporary objects During the compilation of high-level languages it often becomes necessary to create temporary objects which are not present in the source program. The most common need for temporaries is in the evaluation of expressions. Regardless of the number of accumulators or registers available it is always possible to construct an expression which will require one more. To obtain this register, a register currently in use must be selected and the value currently in it must be saved in a temporary location. One apparent exception to this is a machine in which expressions are evaluated using a stack (e.g. ICL 2900) but in this case the operands are always in temporaries.
Temporary  variables  may  also  be  required  to  implement
certain  high-level  constructions,  such  as  the  IMP  for
statement:
    -----------------------
   | for V = A, B, C cycle |
    -----------------------
which is defined so that the initial values of B and C,  and
the  initial  address  of the control variable, V, are to be
used to control the loop regardless of any assignments to V,
B and C.   While it is possible  for  a  machine-independent
optimiser  to  discover whether these variables are modified
in the  loop  or  not,  in  the  simple  case  where  little
optimisation   is  required  the  code  generator  must  use
temporaries.
In the case of expression evaluation, however,  the  machine
independent  phase  cannot know how many temporaries will be
required.   Even giving the first  phase  knowledge  of  the
number  of  registers  available is not adequate for several
reasons.   Firstly, the use of registers is commonly tied to
the  operations  being  performed, as in the case of integer
multiply on  several  machines  which  requires  a  pair  of
registers.  For a machine-independent first phase to be able
to  cope  with  this  sort of limitation would require great
flexibility of parameterisation.
Secondly, the first phase would have to be given details  of
the problems encountered in statements such as:
    ----------------------------
   | LEFT = REM(A,5) + REM(B,7) |
    ----------------------------
On a PDP11 equipped with the EIS option a divide instruction
is  available  which  provides  both  the  quotient  and the
remainder.  Hence the statement could be compiled into:
    -----------------
   |   MOV   A,R1    |
   |   SXT   R0      | propagate the sign of A
   |   DIV   R0,#5   | remainder to R1
   |   MOV   B,R3    |
   |   SXT   R2      |
   |   DIV   R2,#7   | remainder to R3
   |   ADD   R2,R0   |
   |   MOV   R0,LEFT |
    -----------------
In this case no  temporary  store  locations  are  required.
However, if the EIS option is not present no DIV instruction
is  present  and so a subroutine must be used instead.   The
code becomes:
    -----------------
   |   MOV   A,R1    |
   |   MOV   #5,R2   |
   |   JSR   PC,DIV  | result back in R1
   |   MOV   R1,T1   | preserve remainder
   |   MOV   B,R1    |
   |   MOV   #7,R2   |
   |   JSR   PC,DIV  | result in R1
   |   ADD   T1,R1   |
   |   MOV   R0,LEFT |
    -----------------
As the subroutine REM uses R1 (for one of its arguments  and
to  return  its  result) the result of the first call on REM
must be saved in a temporary, T1.   Of course, the  function
REM could be written so as to preserve the value in, say, R2
and this could be used instead of T1, but this would increase the cost of REM when it is likely that the value in R2 will not be of use as most expressions are trivial [Knuth, 1971]. Unless the machine-independent phase is given intimate knowledge of the target machine (something of a contradiction) it cannot know how many temporaries to use nor when to use them. The solution adopted by most intermediate codes is to base the code around a stack, thus providing an unlimited number of temporaries which are handled automatically. While this in itself does not hinder the compilation for a machine without a hardware stack, as the code-generator can always simulate the stack internally, its presence invariably results in other parts of the code using it, for example to pass parameters to procedures where the receiving procedure contains built-in knowledge of the layout of the stack. As a stack does not require the explicit mention of temporaries it has been adopted by I-code, but purely as a descriptive mechanism. Because I-code does not specify the computation but the compilation process needed to produce a program which will perform the computation, this internal stack need have no existence when the final program executes.
The implementors of SIMPL-T describe an intermediate code with some properties similar to I-code, but based on "quadruples" of operators and operands rather than an internal stack [Basili, 1975]. The stack approach was rejected by them because "quads allow more flexibility in the design of the code generator since, for example, no stack is required". The exact meaning of this is not clear but it suggests the misconception that a stack-based intermediate code forces a stack-based object code representation. Regardless of the exact structure of the code generator or the input it takes, some form of internal stack is invariably required for operations such as protecting intermediate values in registers which are needed for other purposes, and it seems reasonable to make this stack more explicit if so doing will simplify the intermediate code and its processing.
   The  third  phase  takes  as  its  input two data streams
generated by the second phase.  These streams are:
    i    The object stream, a sequence of items of the form:
         <type> <value>*   defining   the   code   sequences
         required in the object file.
    ii   the  directive stream, a sequence of items defining
         the logical structure of the object stream, that is
         a specification  of  label  definitions  and  label
         references,  and  details of various code groupings
         (blocks, procedures etc.).
The third phase starts by taking in the directive stream and
constructing a linkage map  describing  the  whole  program.
This  linkage  map is processed and then used to control the
generation of the final object file from the object  stream.
The operations performed using the map are:
4.7.1 Reordering
   As  discussed  previously,  there are several gains to be
made by having the ability  to  output  instructions  in  an
order  different from that in which they were implied by the
linear structure of the source program.   This reordering is
performed on the linkage map in a manner controlled by items
in  the  directive  stream.   In  the  most  simple  case of
exbedding  procedures  (section  4.3.1)  this  only  entails
allocating code addresses to the items in the map each time an "end-of-block" control item is input, resulting in the procedures being laid out in "end" order. To facilitate evaluating references to the reordered areas, all references in the object stream are made relative to the start of the appropriate area. As this process does not cause the physical moving of the various areas there is an implicit assumption that either the subsequent processing of the object stream can do the reordering (for example by writing its output to specific sections of a direct-access file), or that the object file format can instruct the loader or linker to do the shuffling. With the linkage map available it becomes possible to make a preliminary pass over the object stream performing structural modifications which require knowledge of the generated code and which alter its size and general appearance. These modifications may be made by passing the object stream through a buffer which is scanned and modified under the control of the linkage table. In this way merging common code sequences and reordering the arms of conditional sequences may be achieved quite simply.
            In typical programs the frequency of  occurrence
         of such jumps is:
                           PDP11    PE3200
                          ------------------
                2 byte   |   88%      28%   |
                4 byte   |    8%      71%   |
                6 byte   |    2%      <1%   |
                          ------------------
            It  has  been  suggested  [Brown, 1977] that the
         problem of deciding which form of jump to  use  can
         be  eased  on  certain  machines  by  specifying  a
         "distance" parameter with  the  intermediate  code,
         e.g.  "GOTO LAB,80"  informing  the  code generator
         that the label LAB is 80 instructions ahead.
         It is difficult to think of any case in which  this
         could  be  of  any  use  as  it  requires  the code
         generator to be  able  to  predict  the  amount  of
         target  machine-code  which  will  be generated for
         each intermediate code instruction.
            The solution adopted by the  IMP  compilers  has
         been  for  the  code  generator  to assume that all
         jumps are the minimum size, and to  let  the  third
         phase stretch them where necessary.
         The  Perkin-Elmer  CAL  assembler [Interdata, 1974]
         makes the opposite assumption,  namely  that  jumps
         are long until proven short.   This was rejected as
         the size of one jump is often dependent on another,
         so that one of them will be short if  and  only  if
         both  of  them  are  short.   By assuming them long
         either they will never be found to be short, or the
         process  will  have  to  examine  all   the   jumps
         repeatedly  trying  each  jump in turn to see if it
         can be "squeezed".   Commonly enabling the  "SQUEZ"
         option  in  the  CAL assembler can double or treble
         the time to assemble programs.
         With the assumption that all jumps start short  and
         then grow, all truly short jumps will be found with
         no  possibility  of  infinite loops, as the process
         must terminate, in the  worse  case  when  all  the
         jumps have been made long.
            Several  methods for achieving this optimisation
         have been  described  [Szymanski,  1978;  Williams,
         1978].
         The  technique used by the third phase of the IMP77
         compilers for stretching jumps is as follows.
         Once the  linkage  map  has  been  constructed  and
         addresses  provisionally  allocated, all labels and
         references to them are  grouped  according  to  the
         block  in  which  they  occurred.   This is to take
         advantage of the fact that most references will  be
         local.   A  procedure  STRETCH is now defined which
         repeatedly  attempts  to  lengthen  each  reference
         within a particular group.  If a reference is found
         which  must  be stretched, the entry in the linkage
         map is  updated  and  all  subsequent  entries  are
         suitably  modified to take account of the increased
         size of the code.  The process is repeated until no
         alterations have been made.
         STRETCH is first called  once  for  each  group  of
         references  in  the program.   This "local stretch"
         commonly resolves up to 80% of the  references.   A
         final  call  on  STRETCH  is then made with all the
         references lumped together as one group in order to
         resolve references between blocks,  and  any  local
         references  which,  although processed by the local
         stretch, have become invalidated by changes made by
         the "global stretch".
            The use of a local and a global  stretch  has  a
         considerable  effect  on  the  performance  of  the
         compiler; If the calls on "local stretch" are taken
         out "global stretch" has to  do  all  the  work  in
         ignorance  of  the  block-structure  of the labels.
         This involves repeated searching  of  the  complete
         label  and reference lists in order that changes in
         the position of these items may  be  recorded.   On
         the  Interdata  7/32  this increases the stretching
         time for 1968 branches from 2.3 seconds  out  of  a
         total  compilation  time  of  146  seconds up to 35
         seconds!
         The time taken to perform the stretching using both
         local and global stretch is on average just over 1%
         of the total compilation time  excluding  the  time
         for input and output.
         Wulf  et  al. describe an optimisation on the PDP11
         which   attempts   to   shorten   otherwise    long
         conditional  jumps  by making them jump to suitable
         jumps to the same destination as  this  is  smaller
         and  faster  than  the  six byte instructions which
         would be generated by default [Wulf,  1975].   This
         was  tried  but  eventually  removed from the PDP11
         compiler as finding suitable jumps  was  a  tedious
         task and of the average 2% of jumps which were long
         in  compiling many programs only one case was found
         where the optimisation could be applied.  That case
         was in a program specially constructed to test  the
         optimisation.
            At the same time that jumps and labels are being
         processed,  certain  operations which depend on the
         flow of control may be inserted into the code.
         The GEC  4080  provides  a  good  example  of  this
         problem which can be handled elegantly by the third
         phase.
         The  machine provides arithmetic instructions which
         take either fixed point or floating point  operands
         depending  on  the state of a processor status bit.
         This bit must be altered by  the  instructions  SIM
         (Set  Integer  Mode)  and  SFM (Set Floating Mode).
         During code generation when a label is  encountered
         the  state of the status bit will not in general be
         known, and so a suitable mode switching instruction
         will  need  to   be   planted.;   frequently   this
         instruction will be redundant.   Given the presence
         of the third phase, the second phase  merely  needs
         to  mark  jumps  with the current state of the bit,
         and to mark labels with the required state (and the
         previous state of the  bit  if  control  can  "fall
         through"  past  the  label.   During the process of
         expanding jumps these mark bits can be checked.  If
         all references to a label have the  same  mode,  no
         action  needs  to  be taken, but if the bits differ
         the appropriate instruction must be added.   As  an
         extra  improvement  if  only one jump to a label is
         from the wrong mode, the mode switching instruction
         can be planted before that jump rather  than  after
         its  destination label, so shortening the execution
         paths when no change is required.
    ii   Conflating jumps to jumps.
         Nested   conditional   structures   in   high-level
         languages  often  generate jumps which take control
         directly to another jump.   If the second jump  can
         be  shown always to be taken whenever the first is,
         the latter can be redefined as jumping directly  to
         the destination of the former.
            e.g.   -----------------------------------
                  | while N > 0 cycle                 |
                  |    N = N-1                        |
                  |    if N > 5 then TEST1 else TEST2 |
                  | repeat                            |
                   -----------------------------------
         In  this  program  following  the call on TEST1 the
         else  causes  a  jump  to  be  taken  to  the  next
         statement  (repeat).   This  statement  is simply a
         jump back to the previous cycle.
         Hence the following code can be generated (3220):
                   -------------------
                  | $1:  L   1,N      |
                  |      BLE $3       |
                  |      SIS 1,1      |
                  |      ST  1,N      |
                  |      CHI 1,5      |
                  |      BLE $2       |
                  |      BAL 15,TEST1 |
                  |      B   $1       |
                  | $3:  BAL 15,TEST2 |
                  |      B   $1       |
                   -------------------
         The  danger  with  this  optimisation  is  that  an
         otherwise short jump can be expanded to a long jump
         as the following program demonstrates:
                ----------------------
               |   if X = 1 start     |
               |      if Y = 1 start  |
               |         {A}          |
               |      else            |
               |         {B}          |
               |      finish          |
               |   else               |
               |      {C}             |
               |   finish             |
                ----------------------
         The  else  following the sequence (A) causes a jump
         to the next else which jumps past the  finish.   In
         that  form  the first jump only has to skip (B) and
         is likely to be a short jump.   If it  is  made  to
         jump  directly to the second finish it has to cover
         {B} and {C}, so reducing the chances of  its  being
         short.
         Equally,  the position can be reversed resulting in
         the optimised jump being short  when  the  original
         was  long.   If  this problem is considered serious
         the third phase can check the sort  of  jump  which
         would be generated and act accordingly.
    iii  Removal of jumps round jumps.
         Statements such as:
                ------------------
               | ->LABEL if X = Y |
                ------------------
         are  common,  either  in  the explicit for as given
         above or in some higher-level  representation  such
         as:
                ---------------
               | exit if X = Y |
                ---------------
         The  simple  code sequence generated for this would
         be similar to (3220):
                --------------
               |    L   1,X   |  pick up X
               |    C   1,Y   |  compare with Y
               |    BNE $1    | branch not equal
               |    B   LABEL | jump to LABEL
               | $1:          |
                --------------
         by combining the  two  branches  the  code  can  be
         reduced to:
                --------------
               |    L   1,X   |
               |    C   1,Y   |
               |    BE  LABEL |
                --------------
         While  it  is possible for the code generator to do
         this  immediately  it  was  decided  to  leave  the
         optimisation to the third phase for four reasons:
         1    The  third phase can perform this optimisation
              simply,   almost   as   a    side-effect    of
              constructing the linkage map.
         2    The  are  several cases where the optimisation
              can be extended in ways which would be awkward
              for  the  second  phase  to  deal  with.    In
              particular, it would have either to look ahead
              or to be able to modify code sequences already
              generated.   With  a third phase, however, the
              optimisation  reduces  to  a   straightforward
              inspection of the linkage map.  For example:
                  ------------------
                 |    exit if X = Y |
                 | repeat           |
                  ------------------
              in  which case the optimisation may be applied
              twice to reduce the code to two instructions.
         3    Leaving the  optimisation  to  a  later  phase
              simplifies  the second phase which is the most
              complicated part of the compiler.
         4    On several machines if the destination of  the
              jump  is too far away the original "jump round
              a jump" may be the only form  available  (e.g.
              PDP11).   The  distance to be jumped will only
              be known exactly when  all  labels  have  been
              processed.
4.7.5 In-line constants
             
   When compiling for machines such as the Data General NOVA
which   have  a  limited  direct  addressing  range  and  no
full-length immediate operands, it is  useful  if  constants
can  be  planted  in  the  code  sequence  and  addressed as
program-counter-relative operands.   The simplest  technique
for  doing, this is for the code generator to maintain a list
of required constants and to dump them in-line at a suitable
opportunity before the  limit  of  addressability  has  been
exceeded.   Such  constants  will  need to be protected from
being executed and so will need to have a jump round them or will have to be planted in a "hole" In the code, that is between an unconditional jump and the next label. As holes occur frequently in high-level languages (for example following every else or repeat) and do not require extra code to be planted round the constants, they must be the preferred position for the constants. In order to minimise the number of constants planted it is necessary to delay the dumping of them until the last possible moment, making them as near the forward limit of the addressability of the first outstanding reference. This increases the chance on a subsequent reference to the constant being able to address the previous location. This poses problems if the second phase is to handle the constants as it cannot know which is the optimum position for the constants in advance of producing the code (especially if the code is to be reordered). A convenient solution is to utilise the linkage table in the third phase and include in it references to constants and the locations of holes and "forced" holes, that is places where an extra jump is required. Following the initial resolution of jumps (4.7.?) the list of constants can be examined and holes allocated. The labels are processed again to take account of the extra code and any alignment limitations. During the processing of the object stream the constants are infiltrated into the object file.
             5 Review of the overall structure
           
5.1 Division of function
   The division of  the  machine-dependent  phase  into  two
parts was motivated by three main considerations:
    i    to   localise  the  changes  necessary  to  produce
         different object-file formats,
    ii   to permit the reordering of sections of the code,
    iii  to enable the production of  short  jumps  whenever
         possible.
In  addition  it  turns  out that on all of the machines for
which this technique has currently been applied points  (ii)
and (iii) can be handled by almost identical pieces of code,
making  this  phase  of compilation machine-independent to a
large extent and therefore easing the task of  creating  new
compilers.
   Against  this  must  be  set  the  overheads  incurred by
separating the compilation into  two  parts  which  have  to
communicate.   The  interface  between  phases two and three
comprises the object file and the directive  file,  and  the
third phase needs to process the whole of the directive file
before starting to look at the object file.
The ways in which these 'files' will be implemented, and consequently the cost of the communication, will in general vary from system to system. If large virtual memories are available the data may be held in memory as mapped files or arrays and accessed much more efficiently than on simpler systems using the conventional approach of 'true' files with their more cumbersome transfer operations. 5.2 Testing and development Although the initial reason for choosing a multi-phase approach to compiling was that of simplifying the generation of new compilers, an extra advantage arose in that the task of checking the compilers so produced and diagnosing faults in them was very much simplified. This was because of two features of the technique. Firstly, the programs corresponding to the phases were of managable size, varying from about one thousand statements up to three thousand statements. Secondly, the phases communicated with each other using well-defined interfaces which could be monitored to narrow down errors to a particular phase and even to specific parts of that phase. In addition, as the structure of the intermediate code inevitably suggests the general techniques to apply in code generation, many of the complete compilers on different machines had great similarities; usually only the lowest
levels of code production and machine-specific  optimisation
were appreciably different.
This gave rise to three convenient properties with regard to
testing and development:
    i    An  error  in  one  compiler  will  frequently give
         notice of similar faults in others.   Clearly,  any
         faults in the common first phase will be present in
         all  the  compilers and only one correction will be
         required.
    ii   An improvement in the performance of one  compiler,
         or  the  code  it  generates,  can  suggest similar
         improvements in others.
    iii  The third effect on reflection  seems  obvious  yet
         was noted with some surprise.  The systems on which
         most  of  the  investigation  was done are run with
         very  different  operating  systems  and  used   by
         different   types  of  user.    These  two  factors
         together caused  a  great  spread  in  the  demands
         placed  upon  the compiler, resulting in more parts
         of the compiler being thoroughly tested than  would
         happen when running on one particular system, where
         users  tend  to be more stereotyped.   Questions of
         "proper practice" aside, it is a fact of life  that
         all  software  gets  a  better testing in the field
         than at the hands of its creator.
5.3 Diagnostics As mentioned previously optimisation is not just a process of improving the storage requirements and speed of a program but also involves fitting a program into the overall framework of the run-time environment. In many applications the provision of extensive run-time checks and post-mortem traces can be of great importance. The ability to generate such diagnostic code has certain implications for the features in the intermediate code. 5.3.1 Line numbers When producing information about the state of a computation, whether it be an error report following a run-time fault or an execution trace [Satterthwaite, 1972], the data must be presented in a form which is meaningful to the user in terms of the source program. The commonly-provided dump of the machine state, registers, code addresses etc., is a complete failure in this respect, as the correspondence between this and the program state depends on the workings of the compiler and other factors of which the user should not need to be aware. The simplest way of specifying the point of interest in a program is to give its line number. There are two common techniques for providing line number information at run-time, the choice of which depends on the uses to which the compiler is to be put.
The first is to plant instructions which dynamically update a variable with the current line number whenever it changes. This has the significant advantages that it is extremely cheap to implement and the line number is always immediately available. Its obvious disadvantages are that it increases the execution time for the program, and more significantly it increases the size of the program, typically by about 6K bytes on the Interdata 7/32 for a 1000 line program, approximately a 50% increase. The second technique is to build a table giving the correspondence between line numbers and the addresses of the associated code sequences. While this imposes a greater burden on the compiler and takes more time to extract the line number it has the advantage that it does not increase the code size of the program, nor does it alter its execution speed. Indeed it may even be possible to keep the table out of main memory until it is required.
The choice  of  technique  will  have  implications  on  the
compiled  code.   If  the line number table approach is used
error procedures must have  available  the  address  of  the
point of the error.   The effects of this can be seen in the
following  example  of  the  sort  of  code  generated   for
unassigned variable checking on the 7/32 using both methods:
    -------------
   | 17  X = Y   |
    -------------
    ------------- -------------
   | LHI  0,17   |             | update line no
   | ST   0,LINE |             |
   | L    1,Y    |   L    1,Y  |
   | C    1,UV   |   C    1,UV | check value
   | BE   ERROR  |             | give the error
   |             |   BAL  8,TU | test for error
   | ST   1,X    |   ST   1,X  |
   |             |   ..   ..   |
   |             |   ..   ..   |
   |             |TU:BNER 8    | return if OK
   |             |   B    ERROR| give the error
    ------------- -------------
As the generated code depends on the method in use it cannot
be specified in the intermediate code and so the latter must
simply  indicate the points in the program at which the line
number changes.
5.3.2 Diagnostic tables In the event of program failure, or when explicitly requested by the user, a trace of the current state of a program, including the values in active variables and the execution history, can be of immense value. For such a trace to be provided the intermediate code must contain the identifiers used in the source program for all the variables, and a source-dependent description of those variables. This latter is needed so that the machine representations may be interpreted in the correct way when giving the values in variables. In I-code all this information is presented in the definitions of descriptors and may be used or discarded at will. 5.3.3 Run-time checks Most languages define circumstances under which a program is to be considered in error and its execution terminated. These errors include creating a value too large to be represented (overflow), division by zero, use of an array index which is outwith the declared bounds, and so on. There is a natural division of these errors into those which are detected automatically by the machine and those which must be detected by explicit checks in the program. Commonly, machines catch division by zero automatically but do not provide such a feature for checking array subscripts. The "hardware-detected" errors may be furthur divided into
those which on detection cause the normal flow of control to be interrupted, and those which simply make the knowledge of the occurrance of the error available to the program, for example by setting a condition-code bit. For the purposes of this discussion the second form of hardware-detected error may be considered an error which is not detected automatically as it still requires explicit instructions to test for the error and to transfer control accordingly. Clearly the more errors that fall into the automatic category the better, as they do not cause the user's program to grow with sequences of instructions which, in a correct program, will always be testing for conditions which never arise. These differences complicate the design of intermediate codes as the classification differs from machine to machine: the VAX all forms of overflow can be made to generate automatic interrupts, but the PDP11 only sets a condition-code bit on some overflows. There are two basic ways of handling this in the intermediate code: firstly the code can contain explicit requests for the checks to be performed, and secondly the code can be designed in such a way as to give the code-generator enough information to be able to decide where checks are necessary. Two specific examples can indicate which of these ways should be adopted.
Testing for arithmetic  overflow  is  currently  handled  by
machines in three main ways:
    1.   An interrupt is generated whenever overflow occurs.
         This  is  by  far the best method as it requires no
         overheads in the checked code.
    2.   A bit is set on overflow and is only  cleared  when
         it is tested.  This requires explicit checks in the
         code  but  several  tests  may  be conflated into a
         single test at an appropriate  point,  for  example
         before the final result is stored.
    3.   A  bit  is  set  on overflow, but is cleared by the
         next arithmetic  operation.   This  again  requires
         explicit  checking  code  but  the  tests  must  be
         inserted after every operation.
    For the intermediate code  to  indicate  where  overflow
    testing  is  to be performed it would have to choose the
    worst case from the three above, namely  case  3.   This
    would  result  in  a  test  being  requested after every
    arithmetic instruction, which test may just as  well  be
    included   into   the  definition  of  the  instructions
    themselves.
    The other area of low-level testing is in  implied  type
    conversions such as storing from a 32-bit integer into a
    16-bit  integer.   The VAX provides an instruction which
    combines the test for truncation with the store (CVTLW).
    The 7/32 has an instruction (CVHR) which  can  test  the
    value   before   assignment,   and  the  4/75  can  most
    efficiently test following the assignment (CH).
    If the request for the check is a separate  intermediate
    code  item,  the  7/32  case  is  simple  but  the other
    machines will require much  more  work  to  be  able  to
    generate  the  efficient  check.   The  problem  can  be
    simplified by introducing  new  assignment  instructions
    which  also  perform  the  test,  but this adds many new
    instructions to the code  as  one  instruction  will  be
    required  for every valid combination of types and every
    sort of assignment.
    The high-level checks such as array bound  checking  are
    usually   so   complicated   that   the  most  efficient
    implementations  depend  greatly   on   the   particular
    hardware, so much so that it would be foolish to attempt
    to express them in the intermediate code.
    The simplest solution is to ensure that the intermediate
    code   provides  enough  information  to  let  the  code
    generator decide where and what checks are necessary.
       The inclusion of checks against the use of unassigned
    variables provides  a  good  example  of  the  power  of
    leaving  the  checking  to  the  code-generator.   In  a
    simple-minded approach the  code-generator  tests  every
    suitable  value loaded from store.   A minor improvement
    to this is  to  mark  the  descriptor  for  every  local
    variable   in   a  block  when  it  is  first  assigned,
    inhibiting   the   marking   after   the   first   jump.
    Subsequently, marked objects need not be checked.
    A  much  better  improvement may be obtained by making a
    trivial extension to the register remembering mechanism.
    If  an  object  is  'known'  it  must  have  been   used
    previously,  and  hence  it  will  have  been checked if
    necessary.  Even after the register which held the value
    of  the  object  has  been  altered,   and   hence   the
    association between the register and the object lost, if
    the  compiler  remembers that the value was known it can
    suppress any unassigned checks on future references.
    At this point a useful property of IMP77 may be used  to
    great  effect:  once  a  variable  has  been assigned it
    cannot become unassigned.   This is  not  true  in  many
    languages,  as  for  example,  in  ALGOL60  the  control
    variable of a for loop is undefined (unassigned) at  the
    end  of  the  loop.   This  means that in IMP77 the 'was
    known' property of variables  may  be  preserved  across
    procedure  calls,  even  though all the register content
    information must be forgotten.
    This technique when applied on the 7/32 compiler results
    in a reduction of 33% in the code required for checking.
    While it is possible for the  unassigned  checks  to  be
    placed  in the intermediate code and for the first phase
    to  remove  redundant  checks,  this  supression   would
    require  a  duplication  of  the remembering logic which
    must, in  any  case,  reside  in  the  machine-dependent
    phase.
                       6 Observations
6.1 Suitability of I-code for Optimisation
   When  considering  the   use   of   I-code   for   global
optimisation there are two techniques available:
   Firstly,  the  optimisations  can  be performed using the
I-code and going straight into object code, possibly  via  a
third  phase.   In  this  case  the  only real constraint on
I-code is that it be powerful enough to be able to carry all
the information available in the source and to present it in
a compact form.
   Secondly, the optimisations can be seen as an extra phase
introduced between the first phase  (the  I-code  generator)
and  what is normally the second phase (the code generator).
The optimiser takes in I-code and produces as its  output  a
new  I-code stream which can be fed into the code generator.
In this case not only must the I-code carry all  the  source
information  but  it must be able to describe the generation
of an optimised program,   Clearly the code must be able  to
reflect  the structure of the target machine in some way and
hence must be able to lose its machine independence.
   The second technique is the more interesting as not  only
does   it  permit  the  optional  inclusion  of  the  global
optimising without affecting  the  structure  of  the  other
phases, but it removes the optimisations from the low-level details of code production and provides a means for separating the machine-independent and machine-dependent optimisations. In particular in the same way as much of tho code generator can be built from a standard "kit" with a few special machine-specific parts, so the global optimiser can utilise code from other optimisers. The way in which the optimiser can influence the operation of the code generator is by making use of the fact that the intermediate code does not describe a computation but a compilation process. This compilation is driven by the descriptors which are normally translated by the code generator from the machine-independent form in the I-code into the appropriate machine-dependent representation, reflecting the target machine architecture: registers, stacks, memory etc. By short-circuiting this translation a global optimiser can force the use of specific machine features.
For example consider the following fragment  of  an  integer
function:
    -------------------
   |   integer X       |
   |   X = A(J)        |
   |   X = 0 if X < 0  |
   |   result = X      |
    -------------------
The  standard  I-code  produced for this fragment would have
the form:
    ------------------------
   |   DEF 12 "X" INTEGER   |
   |    SIMPLE DEFAULT NONE |
   |    NONE                |
   |   PUSH     12          | - X
   |   PUSH     6           | - A
   |   PUSH     7           | - J
   |   ACCESS               |
   |   ASSVAL               |
   |   PUSH     12          | - X
   |   PUSHI    0           |
   |   COMP >=  1           |
   |   BGE      1           |
   |   PUSH     12          | - X
   |   PUSHI    0           |
   |   ASSVAL               |
   |   LOC      1           |
   |   PUSH     12          | - X
   |   RESULT               |
    ------------------------
On the PDP11 the code generated for this could be:
    -------------------
   |      MOV   J,R2   |
   |      ADD   R2,R2  | Scale the index
   |      ADD   A,R2   | Add in ADDR(A(0))
   |      MOV   (R2),X | A=A(J)
   |      BGE   $1     | ->$1 if X >= 0
   |      CLR   X      | X = 0
   | $1:  MOV   X,R1   | assign result register
   |      {return}     |
    -------------------
Here the obvious optimisation is  to  note  that  the  local
variable,  X,  is eventually to be used as the result of the
function and so needs to end in register 1.
By changing the definition of X in the I-code into:
    ------------------------------------------------
   | DEF x X INTEGER SIMPLE DEFAULT NONE SPECIAL R1 |
    ------------------------------------------------
and making no other changes, the code generator will produce
code of the form:
    --------------------
   |      MOV   J,R2    |
   |      ADD   R2,R2   |
   |      ADD   A,R2    |
   |      MOV   (R2),R1 |
   |      BGE   $1      |
   |      CLR   R1      |
   | $1:  {return}      |
    --------------------
   As this process necessitates the I-code becoming more and
more intimately involved with the structure  of  the  target
machine,  in  that it starts referring directly to registers
and the like, it is necessary that a  new  control  item  be
added  so  that  the  code  generator  may be prevented from
pre-empting resources which the optimiser  is  manipulating.
The  new  item is RELEASE and it is used in conjunction with
the definition of machine-dependent descriptors.   When such
a descriptor is introduced (using DEF) the associated target
machine component is considered to have been claimed and may
only  be  used  in  response  to explicit direction from the
I-code.    On  receipt  of  the  corresponding  RELEASE  the
component  is  once again made available for implicit use by
the code generator (for temporaries etc.).   This  mechanism
is  an  exact  parallel to the way in which memory locations
are claimed by the definition of descriptors and released by
the END of the enclosing block.
   The main assumption about this style of  optimisation  is
that  the  code  generator  has  the ability to generate any
required   instruction,   provided   that   the    pertinent
information is available at the required time.
As  an  example, the VAX 11/780 provides addressing modes in
which the value in a register may be scaled and  added  into
the  effective  operand  address before the operand is used,
hence the following code:
    ------------------------
   |   integerarray A(1:9)  |
   |-                      -|
   |   A(J) = 0             |
   |                        |
   |   MOVL   J,R5          |  pick up J
   |   CLRL   12(R3)[R5]    |  A(J) = 0
    ------------------------
The operand address generated by the CLRL instruction is:
    ------------------
   |   12+R3 + R5*4   |
    ------------------
as there are 4 bytes (address units) to a longword.
This instruction  can  be  generated  naturally  during  the
non-optimised  evaluation  of  array  subscripts, and so the
optimiser can assume that the index mode of operand will  be
used  whenever  a  register operand is specified as an array
index.
The procedure has the added advantage that in the worst case
when the code generator will not  produce  the  instructions
that  the  optimiser  hoped, as long as the optimised I-code
still describes the required compilation, the code generator
will simply produce a more long-winded,  but  equally  valid
version  of  the  program.   In other words, as long as some
choice is available and some temporary objects are  left  at
the  disposal  of  the  code generator, the optimiser cannot
force it into a state where working code cannot be produced.
In the example above even if the  code  generator  does  not
produce index mode operands, it can still generate sequences
of the form:
    ----------------------
   |   MULL3   R5,#4,R1   | R5*4  -> R1
   |   ADDL2   R3,R1      | R3+R1 -> R1
   |   CLRL    12(R1)     |     0 -> (12(R1))
    ----------------------
6.2 Performance The figures to appendix A3 are the results of measuring the effect of various optimisations on the Interdata 7/32 and the DEC PDP 11/45. One problem in choosing programs to be measured in that heavy use of particular language features will increase the overall effect of certain optimisations. As a trivial example of this consider the following "program": -------------------------------- | begin | | integerarray A(1:1000) | | A(1) = 0 | | endofprogram | -------------------------------- With all array optimisations enabled, on the 7/32 this generates 30 bytes of code, whereas without the optimisation it results in 170 bytes of code, largely due to the procedure for declaring the array. Clearly a reduction of 82% is not to be expected on more typical programs. Similarly the absence of features will bias the results. In particular the smaller programs will not demonstrate the power of the optimisations which only take effect when various size limits have been exceeded: the most obvious such limits being addressing restrictions caused by the size of address fields in instructions.
The major difficulty in producing results which are of any real value is that the effects of the optimisations depend on the individual style in which the programs under consideration were written. Inevitably users get a "feel" for the sort of statement for which the compiler generates good code and they modify their style of programming accordingly. If at some state in its development a compiler produces poor code for a particular construction users will tend to avoid that construction, even long after the compiler has been improved and can compile it effectively. This well-known phenomenon [Whitfield, 1973] argues strongly that users should never see the object code generated by the compilers they are using. The effects of many optimisations are difficult if not impossible to measure with any degree of accuracy as they interact with other optimisations to a great deal. The most obvious interaction is that between the size of jump instruction required and most of the other optimisations. The size of jump is determined by the amount of code generated between the jump and the label it references. If any other optimisation is inhibited this volume of code is likely to increase, decreasing the chances of being able to use the shorter forms of the jump. Some optimisations depend almost totally on others; it is unlikely that the optimisation of reducing or removing the entry and exit code sequences associated with procedures (section 4.5.1) would have much effect if the parameters
were not passed in registers and references to them in the procedures were replaced by references to those registers. In particular, it must be noted that it is always possible to generate programs which will benefit greatly from those optimisations which do not appear to be of much use from the figures given. However, the test programs used to derive the figures are typical of the programs processed by the compiler, and it is hoped that they give a more realistic and balanced view of the improvements which may be achieved in 'real' cases. Under some circumstances it may be advantageous to apply all optimisations even though some may appear to give little benefit, since this 'squeezing the pips' frequently removes one or two instructions from critical loops in a program. Yet again this shows the difficulty in quantifying the usefullness of optimisations as they are so dependent on the particular circumstances.
   One area of measurement  has  been  deliberately  omitted
from the figures, namely the effect on execution time of the
optimisations.  This was for several reasons:
    1.   On  the  systems  used  it  was  impossible  to get
         reliable  timing  measurements  with  any  accuracy
         greater than about plus or minus 5%.
    2.   For  the  reasons  given  previously, many programs
         could benefit from fortuitous  optimisations  which
         could not be expected every time.
    3.   Programs which would run long enough to improve the
         accuracy  of  the measurements invariably lost this
         accuracy  through  spending  much   time   in   the
         system-provided  procedures,  mainly  for input and
         output.   This point in particular suggests that as
         the  overhead  is beyond the control of the general
         user, the savings in code space may  be  much  more
         important.   Even  with  ever-growing  store sizes,
         virtual  memory  systems  will  continue  to  treat
         smaller programs better than larger ones.
    4.   Some  of  the  optimisations,  particularly passing
         parameters  in  registers,  prevent  the   compiled
         program   from   running,  unless  the  controlling
         environment is modified in a  parallel  way.   This
         would  invalidate the timings as the environment is
         not usually under the control of the compiler.
   From the crude measures which were obtained  there  is  a
suggestion  that  the  decrease  in  execution  time roughly
parallels the decrease in code size.
6.3 Cost of optimisation
   The  cost  of  an  optimisation  is,  in  general,   very
difficult  to  measure,  as  may  be seen by considering the
three relevant areas: compile time, space  requirement,  and
logical complexity.
6.3.1 Compile time
   In  order  to generate good code, the compiler must spend
time  looking  for  the  cases   which   are   amenable   to
improvement.   If  no optimisation is performed this time is
not used and so  the  compilation  should  take  less  time.
However,  the  non-optimised  version  commonly requires the
production  of  more  code  than  the   optimised   version,
frequently  over  fifty  percent  more  when comparing fully
diagnostic code with  fully  optimised  code.   On  all  the
compilers  written  so  far, the time saved by not having to
generate these extra instructions more  than  outweighs  the
time spent in deciding not to generate them.
6.3.2 Space requirement Several optimisations increase the requirement for workspace, notably all the remembering optimisations. On most machines available at the present the number of things which may be remembered is fairly small: sixteen registers and one condition-code is probably the maximum. Even if this number is increased by remembering several facts about each thing, the total amount of space needed will be small when compared with the space needed to hold the information about user-defined objects, information which is required whether optimisation is being performed or not. On large machines the extra memory required will be cheap; on small machines the need for the optimisation will have to be balanced against the size of the largest program which must be compiled. 6.3.3 Logical complexity The cost of providing an optimisation includes a non-recurrent component, which is the difficulty of performing the optimisation at all because of the logical complexity of discovering the necessary circumstances. In a system which is aimed at portability this cost can often be shared over a number of implementations; the techniques used in one being applicable to others, perhaps after minor modifications.
6.4 Comments on the results 6.4.1 Register remembering Of all the optimisations tested, a simple remembering of values in registers provided by far the greatest improvement in code size. One problem in implementing this optimisation is deciding what to remember, as shown by the following code sequence: ----------- | X = Y | | | | L 1,Y | | ST 1,X | ----------- Following this sequence register 1 will contain both the value in X and the value in Y; should the compiler remember X or Y or both? The measurements show that the gain in remembering both (2 uses) as opposed to just one (1 use) are quite small. The algorithm used to determine what to remember in the '1 use' case was simply to remember a new piece of information only if nothing else was known about the register in question. This gives the best results in cases such as: A = 0; B = 0; C = 0 where the value '0' will be remembered, but will perform badly with the more contorted case: A = 0; B = A; C = B as again only the value '0' will be remembered. Unless very tight code is required, the cost in
         maintaining multiple sets of information about each
         register  and  searching for particular values will
         probably  rule  out   such   extended   remembering
         optimisations.
            Perhaps a surprising result is that the PDP11 on
         average  gains about as much from this optimisation
         as the 7/32.
         This is the result of two interacting effects.
         Firstly, the 7/32 dedicates up to five registers to
         address local variables in the last five levels  of
         procedure  nesting  and locks three for other fixed
         purposes,  leaving  about  ten   for   intermediate
         calculations.   The  PDP11, however, uses a display
         in store the access intermediate levels and has  to
         load the address of a particular level each time it
         is required.   In addition the PDP11 implementation
         fixes the use of four registers, leaving only  four
         for intermediate calculations.
         Secondly,  the  7/32  needs  to  use  at  least one
         register to move  values  around  while  the  PDP11
         often requires none.
         These  two  effects  give  a fairly large number of
         transient values in the registers of the 7/32,  and
         a  smaller  number  of  more frequently used values
         (addresses) in the  registers  of  the  PDP11.   On
         average   it  appears  that  the  number  of  times
         necessary values are found is roughly equal in  the
         two cases.
         register  later  on.   This instability seems to be
         undesirable, but  alternative  strategies,  such  a
         biasing   the   allocation  towards  or  away  from
         particular registers, on average results  in  worse
         code.
         
    6.4.4 Common operands
            On  the  7/32  the only instruction which can be
         used to simplify statements of the form: X = X op Y
         is the AM  (add  to  memory)  instruction.   It  is
         therefore   somewhat   surprising   that   its  use
         frequently saves over two percent of the code.
         The two possible expansions of a suitable  addition
         statement are:
             ------------ ------------
            |            |            |
            |  L   1,Y   |  L   1,X   |
            |  AM  1,X   |  A   1,Y   |
            |            |  ST  1,X   |
             ------------ ------------
         The first saves four bytes and leaves the increment
         in the register.   Even if the incremented value is
         required immediately  the  extra  load  instruction
         will  only  increase  the  code size to that of the
         alternative sequence.
         As the PDP11 has many  instructions  which  can  be
         used  in  this  way it is hardly surprising that it
         benefits much more.
    6.4.5 Parameters in registers
            This  optimisation  gives  another   significant
         saving  in  code  at  little  cost to the compiler,
         simply  by  moving  the  store   instructions   for
         parameter  assignment  from  the  many calls to the
         unique procedure definitions.   The effect is  more
         pronounced  on  the 7/32 as all assignments require
         two instructions, a load and a store,  whereas  the
         PDP11 can usually make do with one MOV instruction.
         In  the  latter  case  the  saving  comes  from the
         ability to reduce the size of the  procedure  entry
         and  exit sequences if all of the parameters can be
         passed in registers.
    6.4.6 Condition-code remembering
            On   machines   with   condition   codes    many
         instructions  set  the  result of a comparison with
         zero as a side-effect.   Knowledge of this  can  be
         used  to  inhibit  explicit  code to compare values
         with zero.   However, the small benefit  so  gained
         suggests that it is not worth doing, even though it
         is a very cheap test.
         As the techniques  for  merging  and  delaying  are
         quite  expensive,  but  not complicated, and have a
         major influence on the design of the code-generator
         the small gains achieved are probably not worth the
         trouble, unless the  last  drop  of  efficiency  is
         required at all costs.
6.5 Criticisms and benefits of the technique 6.5.1 Complexity The main argument against the use of high-level intermediate codes is that they move the complexity of code generation from the common machine independent phase into the machine dependent phase forcing work to be repeated each time a new compiler is required. While this is undoubtedly true the overheads are not as great as they may first appear. The extra complexity of the code generators may be split into two parts: an organisational part which builds and maintains the structures used during the compilation, and processes the intermediate code, using it to drive the second part, an operational part which uses the structures to generate code as instructed by the organisational part. The changes in the organisational part when moving to a new machine as small enough to permit the use of large sections of code from old compilers. Even when considering the operational part much will be similar from machine to machine, in particular the communication between the second and third phases and the bulk of that latter phase can be taken without change. From examining the compilers produced using I-code it appears that about 60% of the source of the machine dependent parts is common, 20% can be considered as being selected from a standard "kit" of parts, and the final 20% unique to the host machine.
6.5.2 I/O overhead One of the disadvantages of dividing a compiler into several distinct phases is that it results in an additional cost in communicating between consecutive phases. As discussed in section 5.1 this cost depends on the system running the compiler. In the worst case where communication is achieved using conventional files the overhead may not be too serious. The time spent doing input and output on the Interdata 7/32 compiler is about 27% of the total compilation time, and for the PDP11 is about 22%, breaking down as follows: ------------------------------------------------------- | | | ---- ---- ---- | | | | | |---->| | | | Source ----> | P1 |---->| P2 | | P3 |----> Object | | | | | |---->| | | | ---- ---- ---- | | | | 7/32: 7% 7% 10% 3% | | (4%) (4%) (5%) (3%) | | | | PDP11: 9.4% 11% 0.6% 0.5% | ------------------------------------------------------- The figures in parentheses give the percentage of time taken when the input and output requests are made directly to the file manager rather than via the standard subsystem procedures, thus reducing the internal I/O overhead to about 10% of the total compilation time.
One important gain in using such intermediate codes is that they can ease the difficulties associated with maintaining a number of compilers for different machines, when those compilers are self-compiling. For several reasons it may not be desirable to permit sites to have the source of the machine-independent phase: commonly to give freedom of choice for the form of the language in which that phase is written and to prevent local "improvements" which rapidly lead to non-standard language definitions. In such cases the intermediate-code generator can be maintained at one site and updated versions can be distributed in the intermediate code form without fear of compromising the quality of the object code generated from it. Such a technique is currently being used in the production of portable SIMULA compilers [Krogdahl, 1980]. 6.5.4 Flexibility At some stage in producing a compiler the needs of the end user must be considered. The flexibility afforded by the high-level nature of the intermediate code allows the compiler to be adapted to fit its environment. If the compiler is to be used for teaching the quality of the code it produces can be sacrificed for compilation speed and high-quality diagnostics, particularly as compilation time may well be an order of magnitude greater than the execution time, Indeed many of the programs will fail to compile and
never reach execution. If the application is for compiling programs that will be compiled once and then executed many times more effort can be expended in producing fast code, although this is not to say that diagnostics and fast code must be kept separate as the longer a program runs without falling the more trouble will be caused when it falls without convenient diagnostics. 6.6 Comments on Instruction sets and compilation Following the production of IMP compilers for several different processors various features on instruction sets have become evident which influence the generation of code. i The instruction set should be complete, that is, where an instruction in available for one data type it should be available for all data types for which it is well-defined. Similarly, instruction formats used by one operation should be available for all similar operations. The best example of such an instruction set is that provided by the DEC PDP10. Unfortunately the majority of machines are not so helpful. As example of the sorts of thing which go wrong consider the Perkin-Elmer 3200 series. These machines provide three integer data types: fullword (32 bits, signed), halfword (16 bits, signed), and byte (8 bits, unsigned). There are "add fullword" (A) and "add halfword" (AH) instructions but no
         "add byte" instruction.   There are "add  immediate
         short"  and "subtract immediate short" instructions
         but multiply, divide, and, or  etc.  can  not  have
         short immediate operands.
    ii   The  instructions  should  be  consistent, that is,
         logically similar  instructions  should  behave  in
         similar fashions.
         Again, on the Perkin-Elmer 3200:
         Load  fullword  (L)  and load halfword (LH) set the
         condition code but load byte (LP) does not.
         Most register-store instructions can be replaced by
         a load  of  the  appropriate  type  followed  by  a
         register-register instruction: e.g.
             -------------- --------------
            |   CH   1,X   |   LH   0,X   |
            |              |   CR   1,0   |
             -------------- --------------
         both  result  in  the same setting of the condition
         code, but
             -------------- --------------
            |   CLB  1,B   |   LB   0,B   |
            |              |   CLR  1,0   |
             -------------- --------------
         could result in different settings of the condition
         code as CLR compares two unsigned 32 bit quantities
         whereas CLB  compares  a  zero-extended  byte  from
         store with the zero-extended least significant byte
         of  register  1,   For  consistency  either compare
         halfword (CH) should  use  the  sign-extended  less
         significant  half  of  the register, or better, CLB
         should not tamper with the value in the register.
    iii  Complex instructions should be avoided.   There are
         two reasons for this.   Firstly, it is easier for a
         compiler  to  break  down  statements  into  simple
         operations than it is to build them up into complex
         ones  [Stockton-Gaines,  1965].   Secondly,  if the
         complex  instructions  do  not  perform  the  exact
         function required by the language more instructions
         will   be  needed  to  "prepare"  for  the  complex
         instruction and to "undo" its unwanted effects.  As
         an example, the DEC VAX11/780 is  full  of  complex
         instructions   which  seem  to  be  well-suited  to
         high-level languages at first glance but on  closer
         inspection   they  are  not  so  useful.    A  CASE
         instruction is provided which indexes into a  table
         of displacements and adds the selected value to the
         program   counter.    This  would  seem  ideal  for
         compiling  SWITCH  jumps.   Unfortunately,  as  the
         table of displacements follows the CASE instruction
         it  would  be  very expensive to use it each time a
         jump occurred using a particular  switch.   Instead
         all  references to the switch must jump to a common
         CASE instruction.   Even this does not help, as  in
         the  event  of  an attempted jump to a non-existent
         switch label the diagnostics or the event mechanism
         will see the error as having occurred at the  wrong
         place in the program.  Although this problem can be
         "programmed  around" it turns out that it is faster
         to implement switches using  sequences  of  simpler
         instructions.
    iv   Machine  designers should investigate carefully the
         full consequences of building-in special fixed uses
         of machine features.  One of the best examples of a
         clear oversight  which  causes  grief  to  compiler
         writers   is   found   in  the  DATA  GENERAL  NOVA
         multiplication   instruction.    This   instruction
         multiplies  the  value  in register 1 by register 2
         and places the double-length result in registers  0
         and  1.   As only registers 2 and 3 may be used for
         addressing, and as register 3 is  always  used  for
         subroutine linkage, it follows that register 2 must
         be  used  for addressing the local stack frame, but
         this  is  exactly  the  register  which   must   be
         corrupted in order to use the multiply instruction!
   Although specific machines have been used in the examples
similar problems abound in all machines.  Indeed it is clear
that  machines  are  most  commonly designed for programmers
writing in assembler or  FORTRAN,  and  furthermore  writing
their programs in a particular style.
   While it is clear that the problems could be called "mere
details" and that they are not difficult to surmount, it
remains    that   they     complicate    otherwise   simple
code-generation  algorithms making compilers larger, slower
and correspondingly more  difficult  to  write, debug,  and
maintain. In conclusion it appears that the machine most suited to supporting high-level languages should have a small but complete set of very simple instructions, their simplicity permitting rapid execution and great flexibility.
7 Conclusions 7.1 Viability of the technique The techniques described above have been used to create several IMP77 compilers which are in regular use on a number of systems. In terms of total memory space required for a compilation, they compare favourably with other compilers. The major weakness seems to be execution time which can vary from twice as long as other compilers in the worst case, to half as long in the best case. As most of the effort in writing the compilers was spent in investigating the techniques involved and not in minimising compile time, and as the compilers which ran much faster were either totally, or partially written in machine code (the IMP77 compilers are all written exclusively in IMP77), it seems that the technique can be used to produce acceptable service compilers. 7.2 Ease of portability Although using I-code does not permit compilers to be written in as short a time as with P-code and O-code, the large amount of code which is common to all of the compilers written so far means that, given a working code generator as a template, a new optimising compiler can be written in the space of a few months, with the final result producing code of high quality.
           area, or created specially.   Items on the  stack
           are  modified  by intermediate code control items
           to reflect operations  specified  in  the  source
           program.   Such  modifications  may  or  may  not
           result in code being generated.   From the  point
           of  view  of  this  definition stack elements are
           considered to have at least three components:
                   i       Type
                   ii      Value
                   iii     Access rule
           The "Access rule"  defines  how  the  "Type"  and
           "Value" attributes are to interpreted in order to
           locate the described object.
           For example, the access rule for a constant could
           be  "Value  contains  the  constant"  while for a
           variable it could be "Value contains the  address
           of the variable".   Clearly, the access rules are
           target-machine  dependent.   Descriptors  may  be
           combined  to give fairly complex access rules, as
           in the case of applying "PLUS" to the stack  when
           the  top  two  descriptors are for the variable X
           and the constant 1 resulting  in  one  descriptor
           with the access rule "take the value in X and add
           1  to it".   The complexity of these access rules
           may be restricted by a  code-generator.   In  the
           example above code could be generated to evaluate
           X+1  resulting in an access rule "the value is in
           register 1", say.
   The importance of the  code  not  describing  the  actual
computation  which  the  source  program  specified  but the
compilation process required is seen when attempting to  use
the code for statements of the form:
   A := if B = C then D else E;
This could not be encoded as:
         PUSH   A
         PUSH   B
         PUSH   C
         JUMP # L1
         PUSH   D
         BR     L2
         LOC    L1
         PUSH   E
         LOC    L2
         ASSVAL
         
The reason is that the items on the stack at the time of the
ASSVAL  would be (from top to bottom) [E], [D], [A], because
no items were given which would remove them from the  stack.
hence  the  ASSVAL would assign the value of E to D and then
leave A dangling on the stack.
   Unless   otherwise   stated,   all   constants   in   the
intermediate code are represented in octal.
Descriptors
DEF TAG TEXT TYPE FORM SIZE SPEC PREFIX
           This item causes a new descriptor to be generated
           and placed in the descriptor area.   On creation,
           the  various  fields  of  the  DEF  are  used  to
           construct  the  machine-dependent  representation
           required for the object.
           TAG             is an identification  which  will
                           be  used subsequently to refer to
                           the descriptor.
           TEXT            is the source-language identifier
                           given  to  the  object  (a   null
                           string   if   no  identifier  was
                           specified).
           TYPE            is  the  type  of   the   object:
                           GENERAL,  INTEGER,  REAL, STRING,
                           RECORD, LABEL, SWITCH, FORMAT.
           FORM            is one of: SIMPLE, NAME, ROUTINE,
                           FN,  MAP,  PRED,  ARRAY,  NARRAY,
                           ARRAYN, NARRAYN.
           SIZE            is   either   the   TAG   of  the
                           appropriate     record     format
                           descriptor   for   records,   the
                           maximum  length   of   a   string
                           variable,  or  the  precision  of
                           numerical   variables:   DEFAULT,
                           BYTE, SHORT. LONG.
           SPEC            has   the   value  SPEC  or  NONE
                           depending on whether or  not  the
                           item is a specification.
           PREFIX          is  one  of:  NONE,  OWN,  CONST,
                           EXTERNAL, SYSTEM, DYNAMIC,  PRIM,
                           PERM  or SPECIAL.   If SPECIAL is
                           given  there   will   follow   an
                           implementation-dependent
                           specification  of  the properties
                           of the object (such as that it is
                           to be a register, for example).
Parameters and Formats The parameters for procedures and the elements of record formats are defined by a list immediately following the procedures or format descriptor definition: START Start of definition list FINISH End of definition list ALTBEG Start of alternative sequence ALT Alternative separator ALTEND End of alternative sequence. Blocks BEGIN Start of BEGIN block END End of BEGIN block or procedure
PUSH <tag>       Push a copy of the  descriptor  <tag>  onto
                 the stack.
PROC <tag>       This  is  the  same as PUSH except that the
                 descriptor  being  stacked   represents   a
                 procedure  which  is  about  to  be  called
                 (using ENTER).
PUSHI <n>        Push a descriptor for the integer  constant
                 <n> onto the stack.
PUSHR <r>        Push    a    descriptor    for   the   real
                 (floating-point)  constant  <r>  onto   the
                 stack.
PUSHS <s>        Push  a  descriptor for the string constant
                 <s> onto the stack.
                 
SELECT <tag>     TOS will be  a  descriptor  for  a  record.
                 Replace this descriptor with one describing
                 the sub-element <tag> of this record.
Assignment
ASSVAL           Assign  the  value  described by TOS to the
                 variable described by SOS.   Both  TOS  and
                 SOS are popped from the stack.
ASSREF           Assign  a reference to (the address of) the
                 variable described by TOS  to  the  pointer
                 variable  described  by SOS.   Both TOS and
                 SOS are popped from the stack.
JAM              This is the same as ASSVAL except that  the
                 value  being  assigned will be truncated if
                 necessary.
ASSPAR           Assign the actual  parameter  described  by
                 TOS  to  the  formal parameter described by
                 SOS.   This is equivalent to either  ASSVAL
                 (for   value  parameters)  or  ASSREF  (for
                 reference parameters).
RESULT           TOS describes the result of  the  enclosing
                 function.   Following the processing of the
                 result code must  be  generated  to  return
                 from the function.
MAP              Similar to RESULT except that TOS describes
                 the  result of a MAP.   Again a return must
                 be generated.
                 
DEFAULT <n>
INIT <n>         Create N data items, corresponding  to  the
                 last descriptor defined, and given them all
                 an initial (constant) value.   The constant
                 is popped from the stack  in  the  case  of
                 INIT      but    DEFAULT    causes      the
                 machine-dependent default value to be  used
                 (normally the UNASSIGNED pattern).
Binary operators
ADD              Addition
SUB              Subtraction
MUL              Multiplication
QUOT             Integer division
DIVIDE           Real division
IEXP             Integer exponentiation
REXP             Real exponentiation
AND              Logical AND
OR               Logical inclusive OR
XOR              Logical exclusive OR
LSH              Logical left shift
RSH              Logical right shift
CONC             String concatenate
ADDA             ++
SUBA             --
   The given operation is performed on TOS and SOS , both of
which   are   removed   from   the  stack,  and  the  result
(SOS op TOS) is pushed onto the stack.
e.g.  A = B-C
      PUSH   A
      PUSH   B
      PUSH   C
      SUB ASSVAL
Unary Operators NEG Negate (unary minus) NOT Logical NOT (complement) MOD Modulus (absolute value) The given operation is performed on TOS.
Arrays
DIM <d> <n>      The  stack  will  contain  <d>   pairs   of
                 descriptors  corresponding to the lower and
                 upper   bounds   for   an   array.     This
                 information is used to construct <n> arrays
                 and any necessary accessing information for
                 use  through  the  last  <n> descriptors to
                 have   been   defined.     All   of   these
                 descriptors will be for similar arrays.
INDEX            SOS   will   be   the   descriptor   for  a
                 multi-dimensional array and TOS will be the
                 next non-terminal subscript.   The stack is
                 popped.
ACCESS           SOS  will be the descriptor of an array and
                 TOS will be the final/only subscript.  Both
                 descriptors are replaced  by  a  descriptor
                 for the appropriate element of the array.
                 E.g.  given  arrays A(1:5) and B(1:4, 2:6),
                 and integers J,K:
                    A(J) = 0      K = B(J, K)
                    PUSH   A      PUSH K
                    PUSH   J      PUSH B
                    ACCESS        PUSH J
                    PUSHC  0      INDEX
                    ASSVAL        PUSH K
                                  ACCESS
                                  ASSIGN                   
Internal labels
                    Internal labels are those labels in  the
                 intermediate  code  which have been created
                 by the  process  of  translating  from  the
                 source   program,  and  so  do  not  appear
                 explicitly in the source program.  The main
                 property of these labels is that they  will
                 only be referred to once.  This fact can be
                 used   to  re-use  these  labels,  as,  for
                 example,   a   forward   reference   to   a
                 currently-defined   label  must  cause  its
                 redefinition.
LOCATE <1>       define internal label <1>
GOTO <1>         forward jump to internal label <1>
REPEAT <1>       backward jump to internal label <1>
Conditional branches
   These branches are always forward.
JUMPIF           <cond> <label>
JUMPIFD          <cond> <label>
JUMPIFA          <cond> <label>
                    Where:   <cond> ::= =, #,
                                        <, <=,
                                        >, >=,
                                        TRUE, FALSE
   The two items on the top of the stack are compared and  a
jump  is  taken  to  <label>  is  the condition specified by
<cond> is true.   In the case of <cond> being TRUE or  FALSE
only one item is taken from the stack, and this represents a
boolean value to be tested.
User Labels
LABEL <d>        locate label descriptor <d>
JUMP <d>         Jump to the label described by <d>
CALL <d>         Call the procedure described by <d>
Sundry Items
ON <e> <l>       Start   of   event  trap  for  events  <e>.
                 Internal label <l> defines the end  of  the
                 event block.
EVENT <e>        Signal event <e>
STOP             stop
MONITOR          monitor
RESOLVE <m>      Perform a string resolution
FOR              Start of a for loop
SLABEL <sd>      Define switch label
SJUMP <sd>       Select and jump to switch label
LINE <l>         Set the current line number to <l>
Remembering values in registers
                        Code    Total    Incremental
                        Size  Reduction   Reduction
                        ----  ---------   ---------
P732.1   0 uses         9504       -           -
         1 use          8194    13.8%        13.8%
         2 uses         8192    13.8%         0.0%
P732.2   0 uses         6500       -           -
         1 use          6126     5.8%         5.8%
         2 uses         6126     5.8%         0.0%
P732.3   0 uses        10960       -           -
         1 use          9968     9.0%         9.0%
         2 uses         9956     9.2%         0.2%
P732.4   0 uses         5288       -           -
         1 use          4970     6.0%         6.0%
         2 uses         4958     6.2%         0.2%
P732.5   0 uses         5468       -           -
         1 use          4990     8.7%         8.7%
         2 uses         4986     8.8%         0.1%
P732.6   0 uses         3424       -           -
         1 use          3208     6.3%         6.3%
         2 uses         3208     6.3%         0.0%
P732.7   0 uses        10736       -           -
         1 use          9880     8.0%         8.0%
         2 uses         9874     8.0%         0.0%
P732.8   0 uses          824       -           -
         1 use           770     6.6%         6.6%
         2 uses          770     6.6%         0.0%
P732.9   0 uses         6448       -           -
         1 use          6148     4.6%         4.6%
         2 uses         6148     4.6%         0.0%
P732.10  0 uses        22968       -           -
         1 use         20656    10.1%        10.1%
         2 uses        20650    10.1%         0.0%
P732.11  0 uses        13996       -           -
         1 use         12470    10.9%        10.9%
         2 uses        12442    11.1%         0.2%
P732.12  0 uses        32600       -           -
         1 use         28532    12.5%        12.5%
         2 uses        28392    12.9%         0.4%
                       Code    Total    Incremental
                       Size  Reduction   Reduction
                       ----  ---------   ---------
P11.1   0 uses         9060       -           -
        1 use          7712    14.9%       14.9%
        2 uses         7660    15.4%        0.5%
P11.2   0 uses         6276       -           -
        1 use          6000     4.4%        4.4%
        2 uses         6000     4.4%        0.0%
P11.3   0 uses         9992       -           -
        1 use          9480     5.1%        5.1%
        2 uses         9444     5.5%        0.4%
P11.4   0 uses         5020       -           -
        1 use          4772     5.4%        5.4%
        2 uses         4768     5.6%        0.2%
P11.5   0 uses         5096       -           -
        1 use          4460    12.5%       12.5%
        2 uses         4452    12.6%        0.1%
P11.6   0 uses         3692       -           -
        1 use          3064    17.0%       17.0%
        2 uses         4064    17.0%        0.0%
P11.7   0 uses         7976       -           -
        1 use          7060    11.5%       11.5%
        2 uses         7030    11.8%        0.3%
P11.8   0 uses          668       -           -
        1 use           652     2.4%        2.4%
        2 uses          624     6.6%        4.2%
P11.9   0 uses         4888       -           -
        1 use          4492     8.1%        8.1%
        2 uses         4484     8.3%        0.2%
P11.10  0 uses        20318       -           -
        1 use         19120     5.9%        5.9%
        2 uses        19120     5.9%        0.0%
P11.11  0 uses        12938       -           -
        1 use         12162     6.0%        6.0%
        2 uses        12148     6.1%        0.1%
P11.12  0 uses        12068       -           -
        1 use         10594    12.2%       12.2%
        2 uses        10584    12.3%        0.0%
Remembering sets of registers (environments)
                        Code    Total    Incremental
                        Size  Reduction   Reduction
                        ----  ---------   ---------
P732.1  0 environments  8556      -           -
        1 environment   8316    2.8%         2.8%
        2 environments  8238    3.7%         0.9%
        3 environments  8232    3.8%         0.1%
        4 environments  8222    3.9%         0.1%
        5 environments  8218    4.0%         0.1%
        6 environments  8192    4.2%         0.2%
P732.2  0 environments  6202      -           -
        1 environment   6128    1.2%         1.2%
        2 environments  6130    1.2%         0.0%
        3 environments  6126    1.2%         0.0%
        4 environments  6126    1.2%         0.0%
        5 environments  6126    1.2%         0.0%
        6 environments  6126    1.2%         0.0%
P732.3  0 environments 10174      -           -
        1 environment  10062    1.1%         1.1%
        2 environments  9968    2.0%         0.9%
        3 environments  9966    2.0%         0.0%
        4 environments  9964    2.1%         0.1%
        5 environments  9956    2.1%         0.1%
        6 environments  9956    2.1%         0.1%
P732.4  0 environments  5068      -           -
        1 environment   4978    1.8%         1.8%
        2 environments  4958    2.2%         0.4%
        3 environments  4958    2.2%         0.0%
        4 environments  4958    2.2%         0.0%
        5 environments  4958    2.2%         0.0%
        6 environments  4958    2.2%         0.0%
P732.6  0 environments  3262      -           -
        1 environment   3250    0.4%         0.4%
        2 environments  3216    1.4%         1.0%
        3 environments  3208    1.7%         0.3%
        4 environments  3208    1.7%         0.0%
        5 environments  3208    1.7%         0.0%
        6 environments  3208    1.7%         0.0%
P732.7  0 environments 10062      -           -
        1 environment   9970    0.9%         0.9%
        2 environments  9894    1.7%         0.8%
        3 environments  9880    1.8%         0.1%
        4 environments  9874    1.9%         0.1%
        5 environments  9874    1.9%         0.0%
        6 environments  9874    1.9%         0.0%
P732.8  0 environments   806      -           -
        1 environment    782    3.0%         3.0%
        2 environments   782    3.0%         0.0%
        3 environments   770    4.5%         1.5%
        4 environments   770    4.5%         0.0%
        5 environments   770    4.5%         0.0%
        6 environments   770    4.5%         0.0%
P732.9  0 environments  6244      -           -
        1 environment   6202    0.7%         0.7%
        2 environments  6156    1.4%         0.7%
        3 environments  6158    1.4%         0.0%
        4 environments  6148    1.5%         0.1%
        5 environments  6148    1.5%         0.0%
        6 environments  6148    1.5%         0.0%
P732.10 0 environments 21214      -           -
        1 environment  20928    1.3%         1.3%
        2 environments 20748    2.2%         0.9%
        3 environments 20678    2.5%         0.3%
        4 environments 20678    2.5%         0.0%
        5 environments 20668    2.6%         0.1%
        6 environments 20650    2.6%         0.0%
P732.11 0 environments 12772      -           -
        1 environment  12592    1.4%         1.4%
        2 environments 12486    2.2%         0.8%
        3 environments 12472    2.3%         0.1%
        4 environments 12460    2.4%         0.1%
        5 environments 12452    2.5%         0.1%
        6 environments 12442    2.6%         0.1%
P732.12 0 environments 11522      -           -
        1 environment  11418    0.9%         0.9%
        2 environments 11342    1.6%         0.7%
        3 environments 11314    1.8%         0.2%
        4 environments 11314    1.8%         0.0%
        5 environments 11296    2.0%         0.2%
        6 environments 11296    2.0%         0.0%
P11.1   0 environments  7686      -           -
        1 environment   7670    0.2%         0.2%
        2 environments  7660    0.3%         0.1%
        3 environments  7660    0.3%         0.0%
        4 environments  7660    0.3%         0.0%
        5 environments  7660    0.3%         0.0%
        6 environments  7660    0.3%         0.0%
P11.2   0 environments  6012      -           -
        1 environment   6000    0.2%         0.2%
        2 environments  6000    0.2%         0.0%
        3 environments  6000    0.2%         0.0%
        4 environments  6000    0.2%         0.0%
        5 environments  6000    0.2%         0.0%
        6 environments  6000    0.2%         0.0%
P11.3   0 environments  9472      -           -
        1 environment   9440    0.3%         0.3%
        2 environments  9444    0.3%        -0.0%
        3 environments  9444    0.3%         0.0%
        4 environments  9444    0.3%         0.0%
        5 environments  9444    0.3%         0.0%
        6 environments  9444    0.3%         0.0%
P11.4   0 environments  4784      -           -
        1 environment   4776    0.2%         0.2%
        2 environments  4776    0.2%         0.0%
        3 environments  4776    0.2%         0.0%
        4 environments  4776    0.2%         0.0%
        5 environments  4772    0.2%         0.0%
        6 environments  4768    0.3%         0.1%
P11.5   0 environments  4512      -           -
        1 environment   4464    1.1%         1.1%
        2 environments  4456    1.2%         0.1%
        3 environments  4452    1.3%         0.1%
        4 environments  4452    1.3%         0.0%
        5 environments  4452    1.3%         0.0%
        6 environments  4452    1.3%         0.0%
P11.6   0 environments  3076      -           -
        1 environment   3070    0.2%         0.2%
        2 environments  3064    0.4%         0.2%
        3 environments  3064    0.4%         0.0%
        4 environments  3064    0.4%         0.0%
        5 environments  3064    0.4%         0.0%
        6 environments  3064    0.4%         0.0%
P11.7   0 environments  7104      -           -
        1 environment   7048    0.8%         0.8%
        2 environments  7048    0.8%         0.0%
        3 environments  7048    0.8%         0.0%
        4 environments  7048    0.8%         0.0%
        5 environments  7048    0.8%         0.0%
        6 environments  7032    1.0%         1.0%
P11.8   0 environments   640      -           -
        1 environment    624    2.5%         2.5%
        2 environments   624    2.5%         0.0%
        3 environments   624    2.5%         0.0%
        4 environments   624    2.5%         0.0%
        5 environments   624    2.5%         0.0%
        6 environments   624    2.5%         0.0%
P11.9   0 environments  4492      -           -
        1 environment   4484    0.2%         0.2%
        2 environments  4484    0.2%         0.0%
        3 environments  4484    0.2%         0.0%
        4 environments  4484    0.2%         0.0%
        5 environments  4484    0.2%         0.0%
        6 environments  4484    0.2%         0.0%
P11.10  0 environments 19332      -           -
        1 environment  19196    0.7%         0.7%
        2 environments 19158    0.9%         0.2%
        3 environments 19138    1.0%         0.1%
        4 environments 19138    1.0%         0.0%
        5 environments 19120    1.1%         0.1%
        6 environments 19120    1.1%         0.0%
P11.11  0 environments 12280      -           -
        1 environment  12200    0.6%         0.6%
        2 environments 12168    0.9%         0.3%
        3 environments 12160    1.0%         0.1%
        4 environments 12156    1.0%         0.0%
        5 environments 12148    1.1%         0.1%
        6 environments 12148    1.1%         0.0%
P11.12  0 environments 10690      -           -
        1 environment  10616    0.7%         0.7%
        2 environments 10604    0.8%         0.1%
        3 environments 10604    0.8%         0.0%
        4 environments 10594    0.9%         0.1%
        5 environments 10584    1.0%         0.1%
        6 environments 10584    1.0%         0.0%
Simple allocation of arrays and remembering subscripts
                   Allocation        Remembering
           Neither   Simple  (gain)  Subscripts  (gain)
           -------    ----   ------  ----------  ------
P732.1       8596     8476   (1.4%)     8312     (3.3%)
P732.2       6126     6126   (0.0%)     6126     (0.0%)
P732.3      10450     1014   (3.2%)    10426     (0.2%)
P732.4       5056     4958   (1.9%)     5056     (0.0%)
P732.5       5306     5054   (4.7%)     5308    -(0.0%)
P732.6       3384     3254   (3.8%)     3386    -(0.0%)
P732.7      10346    10112   (2.3%)    10344     (0.0%)
P732.8        806      806   (0.0%)      770     (4.5%)
P732.9       6138     6138   (0.0%)     6148    -(0.2%)
P732.10     20806    20684   (0.6%)    20776     (0.1%)
P732.11     12442    12442   (0.0%)    12442     (0.0%)
P732.12     11976    11946   (0.2%)    11326     (5.4%)
            Both optimisations   Total gain
            ------------------   ----------
    P732.1         8192             4.7%
    P732.2         6126             0.0%
    P732.3         9956             4.7%
    P732.4         4958             1.9%
    P732.5         4986             6.0%
    P732.6         3208             5.2%
    P732.7         9874             4.6%
    P732.8          770             4.5%
    P732.9         6148            -0.2%
    P732.10       20650             0.7%
    P732.11       12442             0.0%
    P732.12       11296             5.8%
                   Allocation        Remembering
           Neither   Simple  (gain)  Subscripts  (gain)
           -------    ----   ------  ----------  ------
P11.1        8572     8188   (4.5%)     7704    (10.1%)
P11.2        6000     6000   (0.0%)     6000     (0.0%)
P11.3        9764     9556   (2.1%)     9644     (1.2%)
P11.4        4848     4776   (1.5%)     4848     (0.0%)
P11.5        4656     4568   (1.9%)     4452     (4.4%)
P11.6        3356     3202   (4.6%)     3218     (4.1%)
P11.7        7844     7728   (1.4%)     7204     (8.2%)
P11.8         644      624   (3.1%)      644     (0.0%)
P11.9        4796     4796   (0.0%)     4484     (6.5%)
P11.10      19236    19140   (0.5%)    19216     (0.1%)
P11.11      12148    12148   (0.0%)    12148     (0.0%)
P11.12      11094    11060   (0.3%)    10616     (4.3%)
            Both optimisations   Total gain
            ------------------   ----------
    P11.1          7660            10.6%
    P11.2          6000             0.0%
    P11.3          9444             3.3%
    P11.4          4768             1.6%
    P11.5          4452             4.4%
    P11.6          3064             8.7%
    P11.7          7032            10.4%
    P11.8           624             3.1%
    P11.9          4484             6.5%
    P11.10        19120             0.6%
    P11.11        12148             0.0%
    P11.12        10584             4.6%
Simplifying: X = X op Y
                        Code      Code
                      Without     With      Gain
                      -------     ----      ----
P732.1                  8292      8192      1.2%
P732.2                  6156      6126      0.5%
P732.3                 10068      9956      1.1%
P732.4                  5088      4958      2.6%
P732.5                  5180      4986      3.7%
P732.6                  3368      3208      4.8%
P732.7                 11438     11296      1.2%
P732.8                   772       770      0.2%
P732.9                  6214      6148      1.1%
P732.10                21086     20650      2.1%
P732.11                12590     12442      1.2%
P732.12                11438     11296      1.2%
P11.1                   8284      7660      7.5%
P11.2                   6220      6000      3.5%
P11.3                  10040      9444      5.9%
P11.4                   5136      4768      7.2%
P11.5                   4800      4452      7.2%
P11.6                   3342      3064      8.3%
P11.7                   7596      7032      7.4%
P11.8                    668       624      6.6%
P11.9                   4724      4484      5.1%
P11.10                 20634     19128      7.3%
P11.11                 12894     12148      5.8%
P11.12                 11492     10584      7.9%
Passing some parameters in registers
                        Code    Total    Incremental
                        Size  Reduction   Reduction
                        ----  ---------   ---------
P732.1   0 registers    8862       -           -
         1 register     8360     5.7%         5.7%
         2 registers    8192     7.6%         1.9%
P732.2   0 registers    7196       -           -
         1 register     6544     9.1%         9.1%
         2 registers    6126    14.9%         5.8%
P732.3   0 registers   10586       -           -
         1 register     9976     5.8%         5.8%
         2 registers    9956     6.0%         0.2%
P732.4   0 registers    5126       -           -
         1 register     4958     3.3%         3.3%
         2 registers    4958     3.3%         0.0%
P732.5   0 registers    5198       -           -
         1 register     5022     3.4%         3.5%
         2 registers    4986     4.1%         0.7%
P732.6   0 registers    3402       -           -
         1 register     3222     5.3%         5.3%
         2 registers    3208     5.7%         0.4%
P732.7   0 registers   10400       -           -
         1 register    10048     3.4%         3.4%
         2 registers    9874     5.0%         1.6%
P732.8   0 registers     840       -           -
         1 register      810     3.6%         3.6%
         2 registers     770     8.3%         4.7%
P732.9   0 registers    6404       -           -
         1 register     6172     3.6%         3.6%
         2 registers    6148     4.0%         0.4%
P732.10  0 registers   21650       -           -
         1 register    20826     3.8%         3.8%
         2 registers   20650     4.6%         0.8%
P732.11  0 registers   13476       -           -
         1 register    12442     7.7%         7.7%
         2 registers   12442     7.7%         0.0%
P732.12  0 registers   11916       -           -
         1 register    11452     3.9%         3.9%
         2 registers   11296     5.2%         1.3%
                        Code    Total    Incremental
                        Size  Reduction   Reduction
                        ----  ---------   ---------
P11.1    0 registers    7796       -           -
         1 register     7756     0.5%         0.5%
         2 registers    7660     1.7%         1.2%
P11.2    0 registers    6192       -           -
         1 register     6072     1.9%         1.9%
         2 registers    6000     3.1%         1.2%
P11.3    0 registers    9564       -           -
         1 register     9448     1.2%         1.2%
         2 registers    9444     1.2%         0.0%
P11.4    0 registers    4776       -           -
         1 register     4768     0.2%         0.2%
         2 registers    4768     0.2%         0.0%
P11.5    0 registers    4508       -           -
         1 register     4452     1.2%         1.2%
         2 registers    4452     1.2%         0.0%
P11.6    0 registers    3098       -           -
         1 register     3064     1.1%         1.1%
         2 registers    3064     1.1%         0.0%
P11.7    0 registers    7124       -           -
         1 register     7096     0.4%         0.4%
         2 registers    7032     1.3%         0.9%
P11.8    0 registers     624       -           -
         1 register      624     0.0%         0.0%
         2 registers     624     0.0%         0.0%
P11.9    0 registers    4520       -           -
         1 register     4488     0.7%         0.7%
         2 registers    4484     0.8%         0.1%
P11.10   0 registers   19302       -           -
         1 register    19166     0.7%         0.7%
         2 registers   19128     0.9%         0.2%
P11.11   0 registers   12364       -           -
         1 register    12152     1.7%         1.7%
         2 registers   12148     1.7%         0.0%
P11.12   0 registers   10734       -           -
         1 register    10648     0.8%         0.8%
         2 registers   10584     1.4%         0.6%
Remembering condition-codes
                  Unknown      Remembered     Gain
                  -------      ----------     ----
P732.1              8820          8192        0.3%
P732.2              6134          6126        0.1%
P732.3              9976          9956        0.2%
P732.4              4968          4958        0.2%
P732.5              4988          4986        0.0%
P732.6              3212          3208        0.1%
P732.7              9880          9874        0.1%
P732.8               770           770        0.0%
P732.9              6150          6148        0.0%
P732.10            20684         20650        0.2%
P732.11            12474         12442        0.2%
P732.12            11318         11296        0.2%
P11.1               7732          7660        0.9%
P11.2               6012          6000        0.3%
P11.3               9516          9444        0.8%
P11.4               4792          4768        0.5%
P11.5               4452          4452        0.0%
P11.6               3076          3064        0.4%
P11.7               7064          7032        0.4%
P11.8                624           624        0.0%
P11.9               4496          4484        0.3%
P11.10             19204         19128        0.4%
P11.11             12192         12148        0.4%
P11.12             10626         10584        0.4%
Forward merging and delaying
                    Forward
         Neither     Merge       Delaying    Merge & Delay
         -------     -----       --------    -------------
P732.1    8192    8172 (0.2%)   8160 (0.4%)   8136 (0.7%)
P732.2    6126    6110 (0.3%)   6054 (1.2%)   6044 (1.3%)
P732.3    9956    9948 (0.2%)   9872 (0.8%)   9864 (0.9%)
P732.4    4958    4950 (0.2%)   4942 (0.3%)   4942 (0.3%)
P732.5    4986    4970 (0.3%)   4982 (0.1%)   4966 (0.4%)
P732.6    3208    3194 (0.4%)   3140 (2.1%)   3122 (2.7%)
P732.7    9874    9752 (1.2%)   9834 (0.4%)   9812 (1.6%)
P732.8     770     764 (0.5%)    738 (4.2%)    728 (5.4%)
P732.9    6148    6136 (0.2%)   6132 (0.3%)   6120 (0.4%)
P732.10  20650   20586 (0.3%)  20558 (0.4%)  20490 (0.8%)
P732.11  12442   12406 (0.3%)  12342 (0.8%)  12306 (1.1%)
P732.12  11296   11280 (0.1%)  11272 (0.2%)  11256 (0.4%)
P11.1     7660    7660 (0.0%)   7660 (0.0%)   7660 (0.0%)
P11.2     6000    6000 (0.0%)   5988 (0.2%)   5988 (0.2%)
P11.3     9444    9444 (0.0%)   9434 (0.1%)   9434 (0.1%)
P11.4     4768    4768 (0.0%)   4768 (0.0%)   4768 (0.0%)
P11.5     4452    4448 (0.1%)   4452 (0.0%)   4448 (0.1%)
P11.6     3064    3064 (0.0%)   3060 (0.1%)   3060 (0.1%)
P11.7     7032    7004 (0.4%)   7032 (0.0%)   7004 (0.4%)
P11.8      624     620 (0.6%)    620 (0.6%)    616 (1.3%)
P11.9     4484    4484 (0.0%)   4484 (0.0%)   4484 (0.0%)
P11.10   19128   19128 (0.0%)  19128 (0.0%)  19128 (0.0%)
P11.11   12148   12136 (0.1%)  12136 (0.1%)  12124 (0.2%)
P11.12   10584   10584 (0.0%)  10584 (0.0%)  10584 (0.0%)
All optimisations
                     None     All      Gain
                     ----     ---      ----
P732.1              11300     8136     28.0%
P732.2               7520     6044     19.6%
P732.3              12286     9864     19.7%
P732.4               5782     4942     14.5%
P732.5               6204     4966     19.9%
P732.6               4004     3122     22.0%
P732.7              11750     9812     16.5%
P732.8                988      728     26.3%
P732.9               6848     6120     10.6%
P732.10             24722    20490     17.1%
P732.11             14618    12306     15.8%
P732.12             14064    11256     20.0%
P11.1                9664     7660     20.7%
P11.2                6588     5988      9.1%
P11.3               11092     9434     14.9%
P11.4                5540     4768     13.9%
P11.5                5572     4448     20.2%
P11.6                3666     3060     16.5%
P11.7                8632     7004     18.7%
P11.8                 752      616     18.1%
P11.9                5256     4484     14.7%
P11.10              23940    19128     20.1%
P11.11              14328    12124     15.4%
P11.12              12816    10584     17.8%
Total CPU time and CPU time in input/output
Internal I/O: communication from phase 1 to phase 3.
External I/O: input of the source and the output of the object files.
          Phase 1      Phase 2       Phase 3           Overall
        -----------   ----------    ----------    ------------------
        Total    %    Total   %     Total   %     Internal  External
         CPU    I/O    CPU   I/O     CPU   I/O      I/O       I/O
All optimisations:
P732.1:   53%    8%    32%    6%     14%    5%      13%        7%
P732.2:   53%    8%    35%    8%     12%    5%      16%        7%
P732.3:   51%    7%    34%    7%     15%    5%      13%        6%
P732.4:   54%    8%    32%    6%     14%    5%      13%        7%
P732.5:   49%   12%    36%    5%     14%    5%      12%        6%
P732.6:   50%    7%    37%    7%     12%    6%      14%        7%
P732.7:   48%    7%    39%    7%     13%    7%      15%        6%
P732.8:   54%    7%    36%    7%     10%    5%      16%        4%
P732.9:   50%    8%    36%    8%     14%    6%      15%        7%
P732.10:  54%    6%    29%    5%     17%    4%      11%        6%
P732.11:  52%    7%    32%    6%     16%    5%      11%        6%
P732.12:  52%    8%    32%    8%     17%    7%      15%        7%
No optimisation:
P732.1:   50%    7%    30%    7%     19%    8%      16%        7%
P732.2:   55%    8%    31%    9%     14%    8%      16%        8%
P732.3:   54%    8%    28%    8%     18%    6%      15%        7%
P732.4:   57%    8%    26%    7%     16%    6%      14%        7%
P732.5:   52%    8%    30%    8%     17%    7%      16%        7%
P732.6:   53%    8%    32%    9%     15%    8%      17%        8%
P732.7:   49%    7%    31%    8%     19%    8%      17%        6%
P732.8:   56%    7%    32%    9%     12%    6%      18%        5%
P732.9:   55%    9%    29%    8%     16%    7%      16%        8%
P732.10:  55%    6%    23%    6%     21%    5%      13%        6%
P732.11:  55%    8%    26%    6%     20%    6%      12%        6%
P732.12:  54%    8%    27%    8%     19%    8%      16%        8%
                     References
                     
Aho, A.V. & Ullman, J.P. (1972)
          "Optimisation of straight line programs"
          SIAM J.  March 1972.
Allen, F.E. & Cocke, J.A. (1971)
          "A catalogue of optimising techniques"
          Design and optimisation of compilers
          Prentice-Hall, R.  Rustin (ed). 1971.
Basili, V.R. & Turner, A.J. (1975)
          "A transportable extendable compiler".
          Software -  Practice  and  Experience  Vol  5,
          1975.
          
Bell. G. (1973)
          "Threaded Code".
          CACM Vol 16, No 6.
          
Berry, R.E. (1978)
          "Experience with the PASCAL P-compiler"
          Software - Practice and experience
          Vol 8, No 5, 1978.
Berthaud, M. & Griffiths, M. (1973)
          "Incremental  compilation  and  conversational
          interpretation"
          Annual review in automatic programming,
          Vol 7, Part 2, 1973.
          
Bourne, S.R,, Birrell, A.D. & Walker, I.W. (1975)
          "Z  code,  an  intermediate  object  code  for
          ALGOL68".
          The Computer Laboratory, Cambridge.
Branquart, P., Cardinael, I. & Lewi, J. (1973)
          "Optimised translation process, application to
          ALGOL68".
          Int.  Comp.  Symp. 1973.
Bron, C. & De Vries, W. (1976)
          "A PASCAL compiler for PDP11 minicomputers".
          Software  -  Practice  and  Experience  Vol 6,
          1976.
Brooker, R.A. et al. "Experience with the compiler-compiler" Comp. J. Vol 9, 1967. Brown, P. (1972) "Levels of language for portable software" CACM. 15, 1972. Brown, P. (1977) "Macro processors". Software Portability (ed. P. Brown) Cambridge University Press, 1977. Buckle, J.K. (1978) "The ICL 2900 Series". Macmillan Press Ltd., 1978. Cocke, J. (1970) "Global common subexpression elimination" ACM SIGPLAN notices, Vol 5, No 7. 1970.
Cocke, J. & Schwartz, J.T. (1970)
          "Programming languages and their compilers"
          Courant Institute of Mathematical Sciences,
          Hew York University, 1970.
Coleman, S.S., Poole, P.C. & Waite, W.M. (1974)
          "The Mobile programming system, JANUS".
          Software:  Practice  and  Experience.   Vol 4,
          1974.
Dewar, H. (1975)
          "The ISYS system".
          Internal memo, Department of Computer Science,
          University of Edinburgh, 1975.
Feldman, J.A. (1966)
          ""A formal semantics  for  computer  languages
          and its application in a compiler-compiler"
          CACM 9, 1966.
Feldman, J.A. & Gries, D. (1968)
          "Translator writing systems"
          CACM 11, 1968.
Gardner, P. (1977) "An ALGOL68C bootstrap". Memo CSM-20, University of Essex, 1977. Gilmore, B.A.C. (1980) "DEIMOS User's Manual" Edinburgh Regional Computing Centre Grosse-Lindemann, C.O. & Nageli, H.H. (1976) "Postlude to a PASCAL compiler bootstrap on the DEC system 10". Software - Practice and Experience, Vol 6, 1976. Haddon, B.K. & Waite, W.M. (1978) "Experience with the Universal Intermediate language JANUS" Software - Practice and Experience Vol 8, No 5, 1978. Halpern, M.I. (1965) "Machine Independence: its technology and economics". CACM Vol 8 No 12, 1965.
Hawkins, E.N. & Huxtable, D.H.R. (1963)
          "A multi-pass translation scheme for Algol 60"
          Annual  review in automatic programming, No 3,
          1963.
Hecht, M.S. & Ullman, J.D. (1975)
          "A  simple  algorithm  for  global  data  flow
          analysis problems"
          SIAM J.  Vol 4, 1975.
Huxtable, D.H.R. (1964)
          "On writing an optimising translator for Algol
          60"
          Introduction to System Programming,
          P.  Wegner (ed.),
          Academic Press, 1964.
Interdata (1974)
          "Common   Assembler   Language   (CAL)  User's
          Manual".
          Interdata publication number 29-375R04. 1974.
Jensen, K. & Wirth, N. (1976) "PASCAL User Manual and Report" Springer-Verlag, 1976. Kildall, G.A. (1973) "A unified approach to global program optimisation" CACM Principles of programming languages, 1973. Klint. P. (1979) "Line numbers made cheap" CACM Vol 22, No. 10. 1979. Knuth. D.E. (1962) "History of writing compilers" Proc. ACM 17, 1962. Knuth, D.E. (1971) "An empirical study of FORTRAN programs". Software - Practice and Experience Vol 1, 1971.
Knuth, D.E. (1974) "Structured programming with GOTO statements". ACM Computing Surveys 6, No 4, 1974. Krogdahl, S., Myhre, O., Robertson, P.S., & Syrrist, G. (1980) The SCALA project: S-PORT. Norwegian Computing Centre working paper. Lecarme, O. & Peyrolle-Thomas, M. (1978) "Self-compiling compilers: An appraisal of their implementation and portability" Software - Practice and Experience Vol 8, No 2, 1978. Lowry, E.S. & Medlock, C.W. (1969) "Object code optimisation". CACM Vol 12, 1969. McKeeman, W.M. (1965) "Peephole Optimisation" CACM Vol 8, 1965.
Mock, O., Olsztyn, J. et al (1958) "The problem of programming communication with changing machines". CACM Vol 1, No 8, 1958. Nelson, P.A. (1979) "A comparison of PASCAL intermediate languages" ACM SIGPLAN Notices, Vol 14, No 8, 1979. Nori, K.V., Ammann, U. et al (1974 & 1976) "The PASCAL <P>-compiler: Implementation notes". Eidgenossische Technische Hochschule, Zurich, 1976. Pavelin, C.J. (1970) "The improvement of program behaviour in paged computer systems". Ph.D. Thesis, University of Edinburgh, 1970.
Poole, P. (1974)
          "Portable and adaptable compilers"
          Advanced course on compiler construction,
          Springer-Verlag 1974.
Richards, M. (1971)
          "The portability of the BCPL compiler".
          Software -  Practice  and  Experience  VOl  1,
          1971.
Richards, M. (1977)
          "Portable Compilers".
          Software   Portability   (ed.    P.    Brown),
          Cambridge University Press, 1977.
Robertson, P.S. (1977)
          "The IMP-77 Language".
          Internal  Report  CSR-19-77,   Department   of
          Computer Science,  University  of  Edinburgh,
          1977.
Russell, W. (1974) "A translator for PASCAL P-code". Final year project report, Department of Computer Science, University of Edinburgh, 1974. Satterthwaite, E. (1972) Debugging tools for high-level languages". Software - Practice and Experience, Vol 2, 1972. Schneck, P.B. & Angel. E. (1973) "A FORTRAN to FORTRAN optimising compiler". Computer Journal Vol 16, 1973. Sibley, R.A. (1961) "The SLANG System" CACM Vol 4, 1961. Spier, M.J. (1976) "Software Malpractice - A distasteful experience". Software - Practice and Experience Vol 6, 1976.
Steel, T.B. (1961) "A first version of UNCOL Proc. AFIPS WJCC 19, 1961. Stephens, P.D. (1974) "The IMP language and compiler" Computer Journal Vol 17, 1974. Stockton-Gaines, R. (1965) "On the translation of machine language programs". CACM Vol 8, No 12, 1965. Szymanski, T.G. "Assembling code for machines with span-dependent instructions". CACM 21, 1978. Tanenbaum, A.S. (1976) "Structured Computer Organisation" Prentice/Hall International, 1976.
Thompson, K. & Ritchie, D.M. (1974)
          "The UNIX timesharing system"
          CACM Vol 17, No 7, 1974.
Trout. R.G. (1967)
          "A compiler-compiler system"
          Proc.  ACM, 1967.
Waite, W.M. (1970)
          "The Mobile Programming System: STAGE 2"
          CACM, vol 13, 1970.
Welsh, J. & Quinn, C. (1972)
          "A PASCAL compiler for the ICL 1900 series".
          Software  -  Practice  and  Experience, Vol 2,
          1972.
Whitfield, H. & Wight, A.S. (1973)
          "EMAS - The Edinburgh Multi-Access System"
          Computer Journal Vol 16, 1973.
Wichmann, B.A. (1977) "Optimisation" Software Portability (ed. P. Brown), Cambridge University Press, 1977. Wichmann, B.A. (1977) "How to call procedures, or second thoughts on Ackermann's function" Software - Practice and Experience Vol 7, No 3, 1977. Williams, M.H. "Long/short address optimisation in assemblers". Software - Practice and Experience, Vol 9 No 3, 1978 Wulf, W., Johnsson, R.K., Weinstock, C.B., Hobbs, S.O. & Geschke, C.M. (1975) "The design of an optimising compiler" Elsevier Computer Science Library, 1975.