TinC

Porting the Tiny compiler from Pascal to C

By Pete Gray

Pete Gray is a programmer who specializes in embedded-systems development. He can be contacted at petegray@ieee.org.

Several years ago, Jack W. Crenshaw wrote a series of articles entitled "Let's Build a Compiler"—an introduction to compiler construction culminating in the Tiny compiler and language, coded in Pascal [1]. In this article, I describe how I ported, enhanced, and extended the Tiny compiler to make it a C-based pcode-generating, target-neutral compiler called "TinC." The complete source code for TinC is available from CUJ [2] and from my site [3].

Actually, rewriting Tiny 1.1 wasn't too difficult, mainly because it was originally written using crisp, clean code developed using the "keep it simple" principle. Another reason for the simplicity of the rewrite is that, in general, Pascal ports to C quite easily. (For that matter, there are also both free and commercial Pascal-to-C translators.)

Once ported, I began stripping the assembly-language generation out of the various compiler functions, replacing it with pcode-generating routine calls. To demonstrate a functioning compiler used with a specific target, Jack's original compiler necessarily generated assembly language directly from many levels and branches of the compiler's execution tree. As an example of pcode replacement, the routine performing the compiler function push primary was changed from:

EmitLn("MOVE D0,-(SP)");

to:

Stage (PC_PUSH_PRIMARY,"","","");

The use of pcodes as an intermediary language is a simple way of separating the high-level language being processed from the low-level language being generated. The high- and low-level languages become loosely coupled, but remain highly cohesive.

Another advantage of pcode generation is that instead of using the pcodes to immediately generate assembly language for the required target, they can be written to a staging buffer. Basically, the staging buffer is a big array full of pcodes sequenced in the order they were generated by the compiler. The population of a staging buffer also lets the pcodes be processed (optimized, for example) prior to assembly-language generation.

The original compiler was also interactive, which is good because you can see the generated code, and test enhancements on screen as you type in high-level source code. The new compiler processes either interactively or from an input file (if one is specified). I did this via a simple change to the GetChar routine. However, the introduction of the pcodes and staging buffer presented a slightly more complex problem, particularly when you consider pcode changes due to the addition of an optimizer. I resolved this issue by generating two output files—one with the original pcodes and their generic descriptions, and the other that would be generated following the optimization and containing target-specific assembly language. This had the additional benefit of allowing the examination of pre- and postoptimization code to aid optimizer development. Interactively, the compiler first displays the preoptimized code, then the optimized code. I added line numbers for compilation error reporting, which are required for noninteractive processing. Finally, the addition of a compiler "smart trace," where the compiler shows its execution path among routines (indented, according to the call level), helped me understand the compiler operation, and assisted in compiler development. The trace information is automatically written to a third output file called "trace.txt."

pcode Testing

A set of test programs verified the correct operation of the compiler, staging mechanism, preoptimization pcode generation, and compiler trace. For example, the high-level code:

result = 123

produced the preoptimized pcodes with these descriptions:

Move constant 123 to Primary
Move Primary to global variable result

and this output to the compiler trace file:

+Assignment [result]
   +Next
     +GetOp
     -GetOp [=]
   -Next
   +Next
     +GetNum
     -GetNum [123]
   -Next

Tidying Up

To finish off the initial port, I then split out the definitions into a header file, added new definitions representing such things as the maximum symbol size, keyword tokens, and so on, and renamed a few of the important variables. Most of the compiler routines I kept as-is to make it easy to reference from Jack's original text.

Nonglobal variables—for example, variables that are declared for private use in a routine and have local scope—are typically allocated space in a stack frame and referenced by the compiler via an offset using a Base Pointer. The Base Pointer is typically a register loaded with the Stack Pointer after local variables have been allowed for. However, this is complicated by various factors; for instance, depending on the target architecture, the natural growth of the stack may be either toward or away from memory location 0. The new compiler makes no assumption about stack orientation when generating pcodes. It simply calculates how much space is required for local variables, and passes this to the staging buffer as a pcode that says "Grow stack X units." Likewise, when the compiler encounters a local variable assignment, it passes the offset of the variable within the frame to the staging buffer. It's the back end of the compiler—the assembly language generated by the pcodes—that is written to make allowances for stack orientation and offsets.

Data Types

Data types are tricky, and more so when you allow for multiple (potentially different) target architectures. For example, take the classic char and int data types, typically represented internally by 1 and 2 bytes, respectively. Now, take an (imaginary) byte-oriented and word-oriented architecture, and compare. Make it easy, and say that a char occupies a word in the word-oriented architecture. Table 1 lists the space required for declaring local variables.

For this example, both of the architecture's PUSH/POP mechanisms adjust the Stack Pointer by 1. It becomes obvious from Table 1 that if you don't take the target architecture into account, the stack-frame/offset mechanism breaks. To allow for this, I defined SIZEOFCHAR and SIZEOFINT, and created the SizeOfType()function, which is used by the compiler when calculating storage-space allocation and stack-frame offsets. This lets you retarget the compiler between different architectures simply by changing definitions, rather than recoding the allocation and frame-offset routines.

The Next Chapter: Arrays

Arrays are where things got really interesting. Consider what this reference actually means:

arrayname[element]

It could be described as "the contents of the address specified by arrayname, modified by offset element." In fact, that's how I got the compiler to process it. Recognizing a global or local array declaration was simply a matter of having the declaration routines intercept the [ and ], and adjusting memory allocation accordingly. To get the compiler to process array usage correctly, I had to consider the BNF (Backus-Naur Form) associated with assignments:

<assignment>   ::= <ident> = <expression>
<expression>   ::= <first term> ( <addop> <term> )*
<first term>   ::= <first factor> <rest>
<term>         ::= <factor> <rest>
<rest>         ::= ( <mulop> <factor> )*

Clearly then, when an assignment of the form:

result = arrayname[element]

is encountered, it should be treated as a factor. Due to the smart naming convention used in the original compiler, Factor() is the name of the function where arrays need to be intercepted. When an array reference is found, Factor() needs to process element. At the highest level, element should be regarded as a Boolean expression, which is kind of convenient because the compiler already contains a function, BoolExpression(), which leaves the result in the primary register. Given this collective knowledge, it would seem that in order for Factor() to process an array, it would need to:

Call BoolExpression() to get the element.
Save the result to a secondary register.
Place the address of the array into the primary register.
Add the primary to the secondary.
Move [secondary] to the primary.

However, this process could be problematic. It doesn't allow for the fact that you may be accessing an element of an array whose data type isn't aligned with the target architecture; for example, a word array on a byte-oriented target, which would cause the element calculation to be incorrect. Allowing for this, the enhancement to Factor() to allow for arrays looks like this:

if (Token == '[') {        // an array element...
  Next();
  BoolExpression();        // evaluate element
  MovePrimSec();           // save element to secondary
  MatchString("]");
  LoadAddrVar(SavedValue);  // load address of array to primary
  AdjustSec(SavedValue);    // adjust secondary by array's 
		         // data type
  AddPrimSec();           // add primary to secondary
  MoveSecIPrim();         // move [secondary] to primary
  return;
}

That pretty much takes care of array use as a source, so I'll move on to when an array is a destination. This turns out to be similar, the main difference being that interception must happen within the compiler's Assignment() routine:

if (Token == '[') {   // an array element...
  Next();
  BoolExpression();   // evaluate element
  MovePrimSec();      // save element to secondary
  MatchString("]");		
  LoadAddrVar(Name);  // load address of array to primary
  AdjustSec(Name);    // adjust element by array's data type
  AddPrimSec();       // add primary to secondary
  MatchString("=");
  PushSec();          // push LHS (an address)
  BoolExpression();   // process RHS
  PopSec();           // pop LHS (an address) to secondary
  MovePrimSecI();     // move primary to [secondary]
  }

LHS and RHS refer to the left and right side of the assignment, respectively. Note that the secondary is now pushed before (and popped after) the Assignment routine processes the RHS by calling BoolExpression. This is to allow for the situation where you have arrays referenced on both sides of the assignment; for example:

destination[this_element] = source[that_element]

because the BoolExpression call to process the RHS will, eventually, call Factor() to process that_element, and Factor() uses the secondary register when an array is processed. Without the PushSec and PopSec, the address generated by the LHS would be overwritten when the RHS was processed.

Now consider the tricky cases of array usage; for example:

result = array1 [ array2 [9] ]

Not a problem—the compiler already copes with it. Why? Because when Factor processes array1[element], it calls BoolExpression, which calls Factor to process array2[element], and so on. The Factor routine has, for want of a better term, become indirectly recursive. Just to verify, look at the generic output from the compiler, generated from the aforementioned case (number column added for descriptive purposes):

1. Move constant 9 to Primary.
   Move Primary to Secondary.
   Move address of global variable array2 to Primary.
   Add Primary to Secondary.
2. Move [Secondary] to Primary.
   Move Primary to Secondary.
   Move address of global variable array1 to Primary.
   Add Primary to Secondary.
   Move [Secondary] to Primary.
3. Move Primary to global variable result.

Starting at Step 1, you see the offset (9) being loaded to primary and moved to secondary. Primary is then loaded with the address of array2, and added to secondary. This gives you the address of the value to be used as the offset into array2. At Step 2, the process starts again, but this time the offset is the current contents of the secondary; for example, array2[9], and the base is array1. At Step 3, the final result is stored. As a side-note, observe how the first two pcodes are prime candidates for optimization (they become "Move constant 9 to Secondary").

I do need to clarify one thing regarding the use of the secondary register. You'll notice how I had to perform a push/pop operation in the Assignment routine, due to the fact that the Factor routine, via BoolExpression, uses secondary. Well, as compiler development continues, you'll probably find that the frequency of secondary register usage increases—particularly if you delve into the other realms of indirection, such as pointers. As this occurs, it will no longer be satisfactory for the Factor routine, for example, to simply use the secondary register and rely on the fact that other routines will save and restore the register at the appropriate time. The Factor routine itself, and any other routine requiring the secondary register, will need to save and restore the secondary before and after use. A suitable and more efficient control mechanism might be for the compiler to set an internal flag at the point where the secondary is used, and unset it when freed—then other routines would "know" if they needed to push/pop the secondary before/after using it, simply by testing this "secondary in use" flag. The point being, of course, that regardless of which solution is used, this control mechanism must be implemented for the compiler to function properly.

Conclusion

Performing this port and enhancements clearly demonstrates the tight binding between the BNF and compiler structure. Because of the design of the original compiler, and the application of the simplicity principle, I was able to add array processing to the Tiny language using only a few dozen lines of code. Jack's Tiny compiler was—and still is—an excellent learning tool. I hope that porting it to C as a target-neutral, pcode-generating compiler with a few new bells and whistles is the first step of an ongoing evolutionary process. In the true spirit of the original—and with Jack's blessing—the new compiler is released as open source.

References

http://compilers.iecc.com/crenshaw/
http://www.cuj.com/code/
http://home.comcast.net/~pete.gray/