An Emulator Writer's HOWTO for Static Binary Translation

Readers in a hurry might prefer Michael Steil's very readable Abstract of this page. Since this document was first written a couple of years ago, members of the SBT mailing list led by Neil Bradley have been working on a project to write a multi-target SBT called "Orion". Neil's code uses the outline below as its starting point, and goes further by implementing the compiler-like optimizations which are merely alluded to in passing in this document. The Orion project will be written up separately; we're considering submitting a paper on it to next year's Working Conference on Reverse Engineering in 2007 (we were too slow off the mark for the 2006 submission deadline, plus too much of it would have had to be written in what one of my professors used to call the "future hopeful tense" ;-) ) We intend to update this document using examples from Orion, plus some of the other translators for real machines which have been written since this project was started. We'll also add links to source code for you to look at. But for now, read this (and our embrionic wiki page) and join our online chat group that discusses the subject, at Yahoo Groups.

An Emulator Writer's HOWTO for Static Binary Translation

There is a lot of Computer Science literature on binary translation, both of the sexy dynamic variety and the slightly duller (from the CS point of view) static variety. However most of what you'll find when you research the subject is rather dry and somewhat academic. I don't know of any accessible primers on static translation that someone from the emulation world can pick up and use in a practical project.

So... the aim of this HOWTO document is to give you a very practical method that you can adapt to your own area of expertise, which should pretty much guarantee that if you have already written or worked on a reasonable emulator, you will be able to write a translator which will take a specific program and generate a binary from it that runs on a different architecture, and does so faster than any emulator that you're used to using.

And that's why we're doing this: you have a favorite old video game that you want to run on some system and the emulator for that system works but it just isn't fast enough. Or perhaps you want to port an emulator which works OK on a 300MHz Pentium to a 16MHz ARM in your PDA. A straight port runs your favourite game at 2FPS and you don't think the hardware is likely to get 25 times faster in the near future! Play your cards right and you may get that factor of 25 out of a static binary translator.

This document tries to explain things simply - perhaps too simply at times, and there are a lot of examples which differ from the previous example in small details. This makes the document rather long, but that's deliberate; I don't want to skip too many stages and risk having anyone lose track of the process. Apologies in advance to those of you who think I'm taking it too slowly or have made this document too long.

What are we hoping to achieve?

First of all, let's have a quick recap of regular emulator techology. Most emulators work something like this:

for (;;) {
  if (cycles >= eventTrigger) do_some_timed_event();
  opcode = mem[Reg_PC++];
  cycles = cycles + 1;
  switch (opcode) {

  case TFRregreg:
    operand = mem[Reg_PC++];
    cycles = cycles + 1;
    firstreg = operand>>4;
    secondreg = operand&15;
    *Reg[firstreg] = *Reg[secondreg];
    break;

  case ADDimmediate:
    operand = mem[Reg_PC++];
    cycles = cycles + 1;
    temp = Reg_A + operand + Flag_Carry;
    Flag_Carry = temp >> 8;
    Flag_Overflow = (Reg_A ^ operand ^ (Flag_Carry<<7))>>7;
    Reg_A = temp & 0xff;
    Flag_Negative = Reg_A >> 7;
    Flag_Zero = (Reg_A == 0 ? 1 : 0);
    break;

  case MULdirect:
    ... similar code for all the available opcodes ...

  }
}

OK, we've all written an emulator before, and we know the code above is pretty inefficient, but play along with me... the actual details of how an opcode is emulated aren't too important, and the enclosing loop in this case isn't too onerous. I could have shortened it by removing the cycle count and test, but most real emulators do have something like that and many have a lot more overhead, so this will do for now as a realistic test-bed.

Now let's look at a very short program fragment:

        TFR  B,A
        ADDA #21

First of all we'll look at what is executed when we emulate these two instructions in a traditional emulator. Let's pretend we have a symbolic debugger which prints out the source of a program as it is executed; this is what we would see when we emulate the two lines above. We'll assume we're starting from the first instruction inside the main dispatch loop.

  if (cycles >= eventTrigger) do_some_timed_event();
  opcode = mem[Reg_PC++];
  cycles = cycles + 1;
  switch (opcode) {
  case TFRregreg:
    operand = mem[Reg_PC++];
    cycles = cycles + 1;
    firstreg = operand>>4;
    secondreg = operand&15;
    *Reg[firstreg] = *Reg[secondreg];
    break;
  }
}
for (;;) {
  if (cycles >= eventTrigger) do_some_timed_event();
  opcode = mem[Reg_PC++];
  cycles = cycles + 1;
  switch (opcode) {

  case ADDimmediate:
    operand = mem[Reg_PC++];
    cycles = cycles + 1;
    temp = Reg_A + operand + Flag_Carry;
    Flag_Carry = temp >> 8;
    Flag_Overflow = (Reg_A ^ operand ^ (Flag_Carry<<7))>>7;
    Reg_A = temp & 0xff;
    Flag_Negative = Reg_A >> 7;
    Flag_Zero = (Reg_A == 0 ? 1 : 0);
    break;
  }
}

Now, we're going to skip to the end of the book and look at the solution :-)
This is what we would like to be executed for the above two instructions:

    Reg_A = Reg_B;
    Reg_A = (Reg_A + 21 + Flag_Carry)&0xff;

Does that look like it might be 25 times more efficient that the first version? It just might be. I'll let you in on a secret: I've already done this once for my favourite video game, Cinematronic's Tailgunner, and the translated version was just fast enough to run on the GP32 handheld videogame system, which is an ARM and roughly similar to the Gameboy Advance. So the technique I'm explaining here does work and it makes it possible to run video games at full speed on relatively slow processors such as you'll find in handhelds and PDAs.

Of course, we're making some pretty big assumptions in the short sequence above, such as we are not interested in any of the flags again, we don't care if the addition causes an overflow, and we're not going to bother checking for our special timed events at this point. What we're going to do is come up with an algorithm which lets us find out when those assumptions are valid, and only generate the code to handle flags when we really are forced to, which in most programs will not be very often. The worst case code we could hope to generate for those two instructions would be this:

  Reg_A = Reg_B;
  temp = Reg_A + 21 + Flag_Carry;
  Flag_Carry = temp >> 8;
  Flag_Overflow = (Reg_A ^ 21 ^ (Flag_Carry<<7))>>7;
  Reg_A = temp & 0xff;
  Flag_Negative = Reg_A >> 7;
  Flag_Zero = (Reg_A == 0 ? 1 : 0);
  cycles = cycles + 4; Reg_PC = Reg_PC + 4;
  if (cycles >= eventTrigger) do_some_timed_event();

Even this is pretty good compared to what we originally had in the emulator, because we no longer have to fetch the opcodes and operands, nor dispatch via a switch statement to decode the opcodes.

General structure of the translator

So how do we arrange this transformation? Well, it should be pretty obvious that whoever wrote an emulator for the system you want to translate from has already done the heavy lifting. Basically what we're going to do is write a disassembler which takes a binary and instead of outputing native assembler, outputs the target code, and we're going to steal most of that target code from the existing emulator. If we're lucky we can steal the disassembler framework too, or at the least, mangle the emulator source to turn it into a disassembler.

In my examples I'm going to use C as the target code, but this method works just as well if you're starting from a traditional emulator that's written in the assembly language of the target machine. You still will have a dispatch loop, and the equivalent of a switch jump to a set of labels or procedures written in the target assembly language which decode the operands and execute instructions to accurately simulate the emulated machine. Our algorithm is simply that the disassembler outputs that code for each instruction instead of executing it.

We should be able to very quickly write a first naive hack at this which is still structured primarily as an emulator, but before we do that, let's have a little digression:

Let's go back to the days when you were a good old-fashioned emulator writer. You started by working on someone else's emulator because you thought it was written pretty inefficiently, and you thought you could do better, right? :-) Well, you may not have done the hack below but I guarantee that with hindsight you'll agree you wish you had! The best way to write a new emulator when you already have a working one, even if it isn't very efficient, is to hack up a "dual CPU" testbed. Basically this is the same as a regular emulator, but it contains two completely independent CPU cores. The main dispatch loop calls both cores for every instruction, and it compares all the registers after each one has emulated the instruction to see if they differ. It'll look something like this:

for (;;) {
  assert(Reg_PC_1 == Reg_PC_2);
  opcode = *Reg_PC_1++; Reg_PC_2 = Reg_PC_1;
  engine1(opcode); engine2(opcode);
  assert(Reg_PC_1 == Reg_PC_2);
  assert(Reg_A_1 == Reg_A_2);
  assert(Reg_B_1 == Reg_B_2);
  assert(Flag_Carry_1 == Flag_Carry_2);
  assert(Flag_Negative_1 == Flag_Negative_2);
  assert(Flag_Zero_1 == Flag_Zero_2);
  assert(Flag_Overflow_1 == Flag_Overflow_2);
}

The astute reader will have realised that it is also wise to check the RAM which has changed too, and indeed each engine does need its own copy of the RAM, but I'll save you a little effort and let you know that in practice, unless you break some really obvious part of the load/save-memory code, almost all emulation errors show up first in the registers. Comparing two entire sets of memory is a very expensive proposition, even in something as small as a 64K micro with just 4K of RAM.

OK, back on track... what we're going to do is modify one of the CPUs in our Dual CPU emulator so that instead of dispatching by opcode, it dispatches by the address of the opcode. To make this clear, let's look at our short example. We'll assume the code sequence for the two instructions I gave is in memory at address 0xBCA7...

BCA7:        TFR  B,A
BCA9:        ADDA #21

Our loop (omitting the consistency checks for brevity) now becomes:

for (;;) {
  opcode = mem[Reg_PC_1++];
  engine1(opcode); engine2(Reg_PC_2);
}

and the 'new' emulator dispatch loop now looks like this:

for (;;) {
  if (cycles >= eventTrigger) do_some_timed_event();

  switch (Reg_PC_2) {

  /* This section is automatically generated from the binary */

  case 0x0000:
    ... similar code for the rest of the program ...

  case 0xBCA7:
    cycles = cycles + 2;
    Reg_B_2 = Reg_A_2;
    Reg_PC_2 = 0xBCA9;
    break;

  case 0xBCA9:
    cycles = cycles + 2;
    temp = Reg_A_2 + 21 + Flag_Carry_2;
    Flag_Carry_2 = temp >> 8;
    Flag_Overflow_2 = (Reg_A_2 ^ 21 ^ (Flag_Carry_2<<7))>>7;
    Reg_A_2 = temp & 0xff;
    Flag_Negative_2 = Reg_A_2 >> 7;
    Flag_Zero_2 = (Reg_A_2 == 0 ? 1 : 0);
    Reg_PC_2 = 0xBCAB;
    break;

  case 0xBCAB:
    ... similar code for the rest of the program ...

  /* End of automatically generated section */
  }
}

You can see in the above that by switching from executing by opcode to executing by code address, we've already been able to make a few optimisations. For instance the old emulator had to decode the operand byte of the TFR opcode (to determine that the operands were register A and register B), and we can now see at the time we generate the code above that the operand to the ADD instruction is the literal value 21. We don't need to fetch that operand at emulate time any more, we know what it is at translate time. Also we know what the PC will be on the next instruction so we just set it directly instead of adding to it.

(From now on we'll assume we're talking about the 'new' emulator and drop the "_2" suffix on everything...)

Just to make it absolutely clear at this point, the body of the dispatch loop above was generated by a program similar to a disassembler, containing a dispatch table based on opcodes which eventually executes code something like this:

Reg_PC_translator = 0;
for (;;) {
  fprintf(codefile, "  case 0x%04x:\n", Reg_PC_translator);
  opcode = mem[Reg_PC_translator];
  Reg_PC_translator += 1;
  switch (opcode) {

  case TFRregreg:
    fprintf(codefile, "    cycles = cycles + 2;\n");
    operand = mem[Reg_PC_translator];
    firstreg = operand>>4;
    secondreg = operand&15;
    fprintf(codefile, "    Reg_%s = Reg_%s;\n", Regname[firstreg],
                                                Regname[secondreg]);
    fprintf(codefile, "    break;\n");
    Reg_PC_translator += 1;
    break;

    ... similar code for the other opcodes ...
  }
  fprintf(codefile, "\n");
}

So you should be able to see now that the structure of our translator is pretty much falling into shape; there's only a couple of key steps left. What we have above would actually work, but we still have the overhead of a dispatch loop. So how do we get rid of it? Well, it's pretty easy in principle - we just remove the break; and drop through to the next instruction:

for (;;) {
  if (cycles >= eventTrigger) do_some_timed_event();

  switch (Reg_PC) {

  /* This section is automatically generated from the binary */

  case 0x0000:
    ... similar code for the rest of the program ...

  case 0xBCA7:
    cycles = cycles + 2;
    Reg_A = Reg_B;

  case 0xBCA9:
    cycles = cycles + 2;
    temp = Reg_A + 21 + Flag_Carry;
    Flag_Carry = temp >> 8;
    Flag_Overflow = (Reg_A ^ 21 ^ (Flag_Carry<<7))>>7;;
    Reg_A = temp & 0xff;
    Flag_Negative = Reg_A >> 7;
    Flag_Zero = (Reg_A == 0 ? 1 : 0);
    Reg_PC = 0xBCAE;
    break;

    ... similar code for the rest of the program ...
  /* End of automatically generated section */
  }
}

This is now greatly improved, because we no longer have the overhead of returning to the dispatch loop for the majority of instructions. The only time we'll have to do that is if we execute a branch. We've also been able to drop the setting of the PC and we'll do it only when we need to use the PC, such as when we return to the dispatch loop, or when the code itself needs to use the PC (as with a JSR for example).

So let's extend our test program with "BVS 0x2204" (Branch if Overflow to the code at address 0x2204):

0000: ...
...
2204: L2204: ...
...
BCA7:        TFR  B,A
BCA9:        ADDA #21
BCAB:        BVS  0x2204
BCAF: ...
...

The code for the BVS is pretty simple: if (Flag_Overflow) break;, i.e. stop executing in-line code, and let the dispatch loop cause a jump to the new address:

for (;;) {
  if (cycles >= eventTrigger) do_some_timed_event();

  switch (Reg_PC) {

  /* This section is automatically generated from the binary */

  case 0x0000:
    ... similar code for the rest of the program ...

  case 0xBCA7:
    cycles = cycles + 2;
    Reg_A = Reg_B;

  case 0xBCA9:
    cycles = cycles + 2;
    temp = Reg_A + 21 + Flag_Carry;
    Flag_Carry = temp >> 8;
    Flag_Overflow = (Reg_A ^ 21 ^ (Flag_Carry<<7))>>7;
    Reg_A = temp & 0xff;
    Flag_Negative = Reg_A >> 7;
    Flag_Zero = (Reg_A == 0 ? 1 : 0);

  case 0xBCAB:
    cycles = cycles + 3;
    if (Flag_Overflow) {
      cycles = cycles + 1;
      Reg_PC = 0x2204;
      break;
    }

  case 0xBCAE:
    ... similar code for the rest of the program ...

  /* End of automatically generated section */
  }
}

I hope these small incremental changes aren't too tedious, but I didn't want to skip too many stages in case anyone loses the place. Here's another relatively trivial change but it's an important one:

for (;;) {
  if (cycles >= eventTrigger) do_some_timed_event();

  goto *dispatch[Reg_PC];

  /* This section is automatically generated from the binary */

  L_0000:
    ... similar code for the rest of the program ...

  L_BCA7:
    cycles = cycles + 2;
    Reg_A = Reg_B;

  L_BCA9:
    cycles = cycles + 2;
    temp = Reg_A + 21 + Flag_Carry;
    Flag_Carry = temp >> 8;
    Flag_Overflow = (Reg_A ^ 21 ^ (Flag_Carry<<7))>>7;
    Reg_A = temp & 0xff;
    Flag_Negative = Reg_A >> 7;
    Flag_Zero = (Reg_A == 0 ? 1 : 0);

  L_BCAB:
    cycles = cycles + 3;
    if (Flag_Overflow) {
      cycles = cycles + 1;
      Reg_PC = 0x2204;
      continue;
    }

  L_BCAE:
    ... similar code for the rest of the program ...

  /* End of automatically generated section */
}

I have used GCC's non-standard extension here which allows you to jump through an array of labels. This is how most switch jumps are executed anyway. (At least we hope so; a binary search table would be most inefficient in a program like this). So, the case labels have been replaced with simple labels, and the break out of the case has been replaced with a continue, which takes it back to the start of the enclosing dispatch loop. Otherwise the code is just the same.

Now, let's look more closely at the BVS instruction. In this case we do know at translate time where the destination is, so we could in fact do away with the return to the dispatch loop, and dispatch directly ourselves at the point of the BVS:

  L_BCAB:
    cycles = cycles + 3;
    if (Flag_Overflow) {
      cycles = cycles + 1;
      Reg_PC = 0x2204;
      continue;
    }

  L_BCAE:

Now becomes ...

  L_BCAB:
    cycles = cycles + 3;
    if (Flag_Overflow) {
      cycles = cycles + 1;
      goto L_2204;
    }

  L_BCAE:

This is pretty good but there's one difference between this and the original code. The test for scheduled events based on the cycle count will not be executed. Let's modify the code we generate a little to put it back in, inside the BVS code...

  L_BCAB:
    cycles = cycles + 3;
    if (Flag_Overflow) {
      cycles = cycles + 1;
      if (cycles >= eventTrigger) do_some_timed_event();
      goto L_2204;
    }

  L_BCAE:

You may have noticed that I'd already dropped that test in the two-opcode sequence when we dropped through the first instruction. The reason for this is that scheduled events (eg simulated interrupts from a disk drive) seldom need to be cycle-perfect, so as long as we do the test relatively frequently, it'll be often enough. (If this isn't true for your emulator then don't do this optimisation, but for now I'll assume it's OK).

What I'm going to suggest next is a criterion for when we need to insert one of these tests: I believe that they only need to be done when you execute a branch backwards. It is almost impossible for code to execute in a straight line forwards for any length of time. Even if it does execute straight-line code, it won't take long - programs just are not that big, and straight-line code is not going to be megabytes long. At some point - almost certainly within a K at most - it will branch backwards (or somewhere non-fixed, such as in a return from subroutine). I think those points are sufficiently frequent to be good places to insert the tests that would normally be in the dispatch loop. We will soon be getting to the point where you will see why it's worthwhile to eliminate as many of these tests as possible...

Aside: the most likely event that the majority of video-game emulators needs is a vsync interrupt, to allow flicker-free updating of the screen by the program. Let's say this is in fact the most frequent or time-critical interrupt that you will have. It's frequency is 50Hz. Your emulated CPU is 1MHz, so... you'll be looking for a clock interrupt at least once every 1,000,000/50 = 20,000 cycles. So your accuracy will not be diminished by more than 5% if you force out one of the event tests every 1,000 cycles, right? Not a problem, because you're keeping a cycle count in the translator anyway, and all you have to do is simply add it up while generating code, and if it ever gets to a multiple of 1,000 without you planting an event test due to something like a BVS, then force one out into the code unconditionally.

Incidentally, this same call can be used to adjust the real-time speed of your emulation to match the real CPU. If you've followed the suggestion above, you will have at least 1000 calls per emulated second, so we're dealing with quite a fine-grained timer here: all you need to do is compare the cycles to real time, and wait in a spin-loop until real time has caught up with emulated time. (I'm assuming your emulator is too fast; if it's too slow, you have other problems!). With a 1000Hz test, any delay loops are not going to be noticeable, and your video game will play smoothly. (With too few tests, the response can get 'choppy'). Cycle-perfect accuracy is seldom required for the majority of old arcade games. This is a much cheaper solution to real time accuracy than the traditional one which is to handle timing on each and every instruction. The danger there is that the overhead of the test is enough to slow you down. This isn't a problem on a fast processor, but if you had a fast processor you wouldn't need to create a static translation in the first place!

OK, that's the end of the simple formula to take an emulator and convert it almost mechanically into a static binary translator. If we want to generate smarter code than this, we now have to look at something a little more difficult. Let's look back at what we have so far:

  L_BCA7:
    cycles = cycles + 2;
    Reg_A = Reg_B;

  L_BCA9:
    cycles = cycles + 2;
    temp = Reg_A + 21 + Flag_Carry;
    Flag_Carry = temp >> 8;
    Flag_Overflow = (Reg_A ^ 21 ^ (Flag_Carry<<7))>>7;
    Reg_A = temp & 0xff;
    Flag_Negative = Reg_A >> 7;
    Flag_Zero = (Reg_A == 0 ? 1 : 0);

  L_BCAB:
    cycles = cycles + 3;
    if (Flag_Overflow) {
      cycles = cycles + 1;
      if (cycles >= eventTrigger) do_some_timed_event();
      goto L_2204;
    }

  L_BCAE:

If we know that this code above is only executed in a straight line, and none of the intermediate instructions are ever jumped to directly, we can rewrite the above as follows:

  L_BCA7:
    Reg_A = Reg_B;
    temp = Reg_A + 21 + Flag_Carry;
    Flag_Carry = temp >> 8;
    Flag_Overflow = (Reg_A ^ 21 ^ (Flag_Carry<<7))>>7;
    Reg_A = temp & 0xff;
    Flag_Negative = Reg_A >> 7;
    Flag_Zero = (Reg_A == 0 ? 1 : 0);
    if (Flag_Overflow) {
      cycles = cycles + 8;
      if (cycles >= eventTrigger) do_some_timed_event();
      goto L_2204;
    }

Note the only time we write to cycles is when we branch elsewhere, because the cycle count for the drop-through code is being remembered by the translator and will be output only when it is needed. (This is pretty much the same as how it keeps track of the PC ... and indeed the PC isn't even written to on a branch because it is known implicitly at the destination of a fixed jump as well. We'd write to it on a return from subroutine or an indirect jump, but only then).

I hear you ask Well, that looks interesting, but how do you eliminate those labels, and is the benefit really worth it? It looks like all you save are a few additions to the cycle count...
Good question, and one I'll only partially answer right now. Is the benefit really worth it? No, not as given above. Our example isn't complex enough to show the big win yet. Sorry to make this even longer, but all will soon become clear. Let's add a few more instructions to our program...

0000: ...
...
2204: L2204: ...
...
BCA3:        LDB  0x0100
BCA6:        ASL  B
BCA7:        TFR  B,A
BCA9:        ADDA #21
BCAB:        ADDA B
BCAC:        BVS  0x2204
BCAF: ...
...

which translates to

  L_BCA3:
    Reg_B = mem[0x0100];
    temp = Reg_B << 1;
    Reg_B = temp&0xff;
    Flag_Overflow = Flag_Carry;
    Flag_Carry = temp>>8;
    Flag_Negative = Reg_B >> 7;
    Flag_Zero = (Reg_B == 0 ? 1 : 0);
    Reg_A = Reg_B;
    temp = Reg_A + 21 + Flag_Carry;
    Flag_Carry = temp >> 8;
    Flag_Overflow = (Reg_A ^ 21 ^ (Flag_Carry<<7))>>7;
    Reg_A = temp & 0xff;
    Flag_Negative = Reg_A >> 7;
    Flag_Zero = (Reg_A == 0 ? 1 : 0);
    Reg_A = Reg_B;
    temp = Reg_A + Reg_B + Flag_Carry;
    Flag_Carry = temp >> 8;
    Flag_Overflow = (Reg_A ^ Reg_B ^ (Flag_Carry<<7))>>7;
    Reg_A = temp & 0xff;
    Flag_Negative = Reg_A >> 7;
    Flag_Zero = (Reg_A == 0 ? 1 : 0);
    if (Flag_Overflow) {
      cycles = cycles + 8;
      if (cycles >= eventTrigger) do_some_timed_event();
      goto L_2204;
    }

Blech. It's a lot of code for 6 original instructions. Most of it is coming from the setting of flags which are never used. What would happen if we don't emit the statements which set flags if those flags are never used at all before they are written to again??? I'll tell you, this is what would happen:

  L_BCA3:
    Reg_B = mem[0x0100];
    temp = Reg_B << 1;
    Reg_B = temp&0xff;
    Flag_Carry = temp >> 8;
    Reg_A = Reg_B;
    temp = Reg_A + 21 + Flag_Carry;
    Flag_Carry = temp >> 8;
    Reg_A = temp & 0xff;
    Reg_A = Reg_B;
    temp = Reg_A + Reg_B + Flag_Carry;
    Flag_Carry = temp >> 8;
    Flag_Overflow = (Reg_A ^ Reg_B ^ (Flag_Carry<<7))>>7;
    Reg_A = temp & 0xff;
    Flag_Negative = Reg_A >> 7;
    Flag_Zero = (Reg_A == 0 ? 1 : 0);
    if (Flag_Overflow) {
      cycles = cycles + 8;
      if (cycles >= eventTrigger) do_some_timed_event();
      goto L_2204;
    }

As you can see, most of the flags disappear except for the few we actually need. If we have a longer sequence (which we'll now assume is the case) we can drop even more of the flag calculations, except where we need to use them. Without a particularly complex dataflow analysis of the whole program, we should probably assume that we need to write back all the flags at any point where we jump elsewhere, or any time there is a necessary label which allows code elsewhere to jump into our sequence:

  L_BCA3:
    Reg_B = mem[0x0100];
    temp = Reg_B << 1;
    Reg_B = temp&0xff;
    Flag_Carry = temp >> 8;
    Reg_A = Reg_B;
    temp = Reg_A + 21 + Flag_Carry;
    Flag_Carry = temp >> 8;
    Reg_A = temp & 0xff;
    temp = Reg_A + Reg_B + Flag_Carry;
    Reg_A = temp & 0xff;
    Flag_Carry = temp >> 8;
    Flag_Overflow = (Reg_A ^ Reg_B ^ (Flag_Carry<<7))>>7;
    if (Flag_Overflow) {
      Flag_Negative = Reg_A >> 7;
      Flag_Zero = (Reg_A == 0 ? 1 : 0);
      cycles = cycles + 8;
      if (cycles >= eventTrigger) do_some_timed_event();
      goto L_2204;
    }

You've probably worked out by now that what we are trying to do is work with what the compiler writers call 'basic blocks', i.e. the longest sequence of straight-line code you can find. If we apply some relatively simple compiler techniques to the improved sequence above, it gets even better:

  L_BCA3:
    temp = (mem[0x0100] << 1)&0xff;
    Reg_B = temp&0xff;
    temp = Reg_B + 21 + (temp>>8);
    temp = (temp&0xff) + Reg_B + (temp>>8);
    Reg_A = temp & 0xff;
    Flag_Carry = temp >> 8;
    Flag_Overflow = (Reg_A ^ Reg_B ^ (Flag_Carry<<7))>>7;
    if (Flag_Overflow) {
      Flag_Negative = Reg_A >> 7;
      Flag_Zero = (Reg_A == 0 ? 1 : 0);
      cycles = cycles + 8;
      if (cycles >= eventTrigger) do_some_timed_event();
      goto L_2204;
    }

This is actually pretty damned good code now, although we did have some quite complex compiler-like optimisations to make and I haven't explained how to do those, nor am I going to, but I will show you a neat hack to do the redundant flag removal. But having shown, I hope, why the removal of unused labels is worthwhile, I'll now give you the promised second half of the answer to the question asked some time back: how do you eliminate those labels?

Well, we need to know the flow control of the program and note the destinations of jumps. Those jump destinations are the only opcodes which need to be labelled. To some extent you might be able to work those out statically, by having a slightly smarter disassembler framework which does a tree-walk of the source, starting at the entry point. I.e. whenever it hits a branch or subroutine call, it also goes off to the destination and looks at the code there too. The problem with this is that very few architectures are designed so that you always can determine the jump destination just by looking at the code. Most architectures allow indirect jumps through store or registers, or let you tweak a return address on the stack so that your return from subroutine goes somewhere other than where your subroutine was called from. This is a major pain in the ass and the only practical solution is to profile the execution of the code and mark the jump destinations at run time.

Fortunately we have just the piece of code to do this for us - the original emulator (or even our own one at this point, before we start making 'dangerous' assumptions such as removing the labels in front of any instructions) can be used to execute the code, and it can be modified to note in an array which is the same size as memory whether any code address is ever jumped to.

As you can imagine, any one run through the program is not going to exercise every path in the code, so your profiling has to load this array at startup to get the results of previous runs; it then adds any new info it finds in this run, and writes the cumulative results back out to a file. I found that about a week of playtesting fully exercised a game. It's pretty important that you play through all levels and try all problems, etc. You should execute all areas of code at least once, including things like getting on the high score table. Once you've done this you'll have a list of jump destinations. Use this list to seed a 'smart disassembler' as suggested above. The two techniques together should find all the executable code areas in your binary. The rest of the ROM you're translating is probably data. You might also run it through a 'traditional' disassembler and see if it finds anything that looks like code in some area that you missed. (It helps in this case if your list of jump destinations/tree-walking start points is held in ascii/hex. Makes it easier to add to by hand!)

Note: if you miss any jump targets, your translation will find them because when you do a goto *dispatch[Reg_PC];, all the unlabelled statements will have an entry which takes you to a diagnostic routine, and that routine will print out the value of Reg_PC and tell you to add it to the jump dest table, and rebuild your emulator. If you want to get really fancy, you can do this quietly and pass control over to the other emulator core (the 'traditional' emulator, which should still be in your program even if it isn't being called any more) and it can interpret instructions one by one until it gets to an address which is the entry point to one of our statically translated sequences. That'll let you play the game through, and you can automatically rebuild every time you run it, to incrementally improve the code.

Once you have that list of jump destinations, modify your translator like this (remember our example from early on?):

Reg_PC = 0;
for (;;) {
  if (jumpdest[Reg_PC]) fprintf(codefile, "L_%04x:\n", Reg_PC);
  opcode = mem[Reg_PC];
  Reg_PC += 1;
  switch (opcode) {
...
  case TFRregreg:
    cycles = cycles + 2;
    operand = mem[Reg_PC];
    firstreg = operand>>4;
    secondreg = operand&15;
    fprintf(codefile, "    Reg_%s = Reg_%s;\n", Regname[firstreg],
                                                Regname[secondreg]);
    Reg_PC += 1;
    break;
...
  }
  fprintf(codefile, "\n");
}

We're almost finished now! We just have to remove those redundant flag expressions and we're done. That turns out to be relatively easy. Here's a complete demo program; it's not actually using our example code above as I wrote it some time before writing this document, but it's very obvious what's going on:

// A test program to eliminate redundant code; does *not* do common
// subexpression elimination or constant folding, although in a real
// program the CSE and dead code stuff would best be done by the same program

// We might assume (perhaps wrongly) that we are translating hand-written
// assembly code and that no obvious peephole optimisation is possible.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#ifndef FALSE
#define FALSE (0!=0)
#define TRUE  (0==0)
#endif

#define MAX 1024
long defs[MAX], uses[MAX];
char *codes[MAX];
int nextfree = 0;

/* Metacodes */
#define Always 1
#define Flush 2

/* Real regs */
#define PC  4
#define A   8
#define B   16
#define CC  32

/* Internal variables */
#define TMP 64

/* See http://groups.yahoo.com/group/compilers101/ for a more general
   technique to eliminate redundant code.  Here we are limited to
   30 variables */

void flush(long regs)
{
  int i;
  long common_regs;

  for (i = nextfree-1; i >= 0; --i) {
    if (common_regs = (regs & defs[i])) {
      /* if bits in 'defs' then this needs to be dumped. */
      fprintf(stdout, "// %02d lookfor: %02x  Item %d - def %02x\n",
        nextfree, regs, i, defs[i]);
      defs[i] |= Always;  /* flag this line for printing */
      regs &= ~common_regs; /* don't need to look for these ones any more */
    }
    if (defs[i]&Always) regs |= uses[i]; /* Any unconditionals + parents */
  }

  for (i = 0; i < nextfree; i++) {
      fprintf(stdout, "/* %03d [def:%02x] [use:%02x] */  ",
        i, defs[i], uses[i]);
      fprintf(stdout, "%s %s", (defs[i]&Always ? "   " : "// "), codes[i]);
  }

  nextfree = 0;
}

#define dump(def, use, code) (defs[nextfree] = def, uses[nextfree] = use, \
                              codes[nextfree] = code, nextfree += 1)

int main(int argc, char **argv) {
int i;
  //  Test program for a non-existent machine.  Several details omitted.
  //
  //  $ORG A000
  //  LDB #42
  //  INP A,B
  //  INC B
  //  INC B
  //  ASR A,B
  //  LDB #1
  //  JMP 0xFC08    ; end of basic block ...

  dump(Always,  Always, "           //  LDB #42\n");
  dump(PC, 0, "PC = 0xA000;\n");
  dump(B,  0, "B = 42;\n");
  dump(Always,  Always, "           //  INP A,B\n");
  dump(PC, 0, "PC = 0xA002;\n");
  dump(A | Always,  B | Always, "A = fninput(B);\n"); /* has side effects */
  dump(Always, Always, "           //  INC B\n");
  dump(PC, 0, "PC = 0xA003;\n");
  dump(TMP,B, "tmp = B+1;\n");
  dump(CC, TMP, "CC = (tmp>>8)&1;\n");
  dump(B,  TMP, "B = tmp&255;\n");
  dump(Always, Always, "           //  INC B\n");
  dump(PC, 0, "PC = 0xA004;\n");
  dump(TMP,B, "tmp = B+1;\n");
  dump(CC, TMP, "CC = (tmp>>8)&1;\n");
  dump(B,  TMP, "B = tmp&255;\n");
  dump(Always, Always, "           //  ASR A,B\n");
  dump(PC, 0, "PC = 0xA005;\n");
  dump(A,  B, "A = B>>1;\n");
  dump(Always, Always, "           //  LDB #1\n");
  dump(PC, 0, "PC = 0xA006;\n");
  dump(B,  0, "B = 1;\n");
  dump(Always, Always, "           //  JMP 0xFC08\n");
  dump(PC, 0, "PC = 0xA008;\n");
  dump(Always, Always, "\n");
  dump(PC | Always, Always, "goto *lab[PC = 0xFC08];\n");
  flush(A | B | CC | PC); /* List of registers to force out (excluding TMP) */
  exit(0);
  return(0);
}

What we're doing here is taking the printf that we're currently using, and replacing it with a 'dumpf' which has two extra parameters; one is a list of the registers or internal variables that the statement uses as inputs, and the other is a list of the registers or internal variables that the statement writes to or defines. [Actually in that example we use a simple dump macro, but you know how to expand that to a stdargs procedure that'll take printf-style extra parameters, right?] The very simple def/use chain ensures that for any use of a variable, the immediately previous definition of it is output - and so on, recursively. This particular implementation is a really really simple one and gets its simplicity by using bit flags in long ints to simplify the tables it needs to keep. If your translator needs more than 30 registers or internal variables, a more general technique should be used. (Hint: look up Directed Acyclic Graphs)

The redundant expression elimination code above can also be twisted so that you can generate code conditionally, such as when you want to reconstitute the condition code registers from the split-apart individual flags. If you're generating assembler rather than C and have a processor where the flag bits are compatible with your emulated system, then you might not bother with separate flags and just use the hardware CC, but usually it's easier to calculate the flags explicitly. You can either keep the CC up to date, and examine the relevant bits when you need them, ... or you can do it this way, and just make minimal changes only to the flags that you need, and coalesce those flags into a combined CC only on the rare occassions when it is really needed, such as on the 6809's "PSHS CC" instruction which pushes the whole CC on the stack. However the easy way to manage the CC so that you not only amalgamate the flags when needed, but also generate code to update the individual flag bits beforehand, is to use the redundant code optimiser from above. Any time you translate an opcode which changes one of the CC flags, output this at the end of the sequence:

dump(CC, OV|MI|CA|ZE, "Reg_CC = Flag_Overflow << 8 | Flag_Negative << 7 | "
                               "Flag_Carry << 6 | Flag_Zero << 5;");

If anything uses the combined CC, it'll include that in it's use parameter, and the previous dump instruction will be generated, otherwise it will be dropped.

dump(Always | SP, SP | CC, "mem[Reg_SP++] = Reg_CC;");

So that's about it. I didn't explain the 'compiler-like optimisations' because to do so would require a document as long as this one, and it's time to cut you a break and stop here. Write the generated code carefully and you probably won't need that level of optimisation anyway - or if you have a good compiler it'll do it for you. A second reason for removing unused labels from each translated instruction is that it also allows the compiler to optimise more aggressively too. This is also a reason why we should eliminate most of the calls to the timed event test procedure - most compilers will not be able to work out the side-effects of that procedure (such as writing through a pointer), and the call will automatically cause the current basic block to end. This will severely limit the register optimisations that the compiler can do for you. You may find that within basic blocks, your compiler will keep each emulated register in a real register and only write it back to store at the end of the basic block. In exceptional cases it may slave to real registers throughout the whole emulator procedure or indeed the entire program. It doesn't hurt to declare your emulated registers with the register keyword, such as

  register int Reg_A, Reg_B;

But be careful not to do this for variables that aren't heavily used. For instance we seldom mention Reg_PC any more and it would be a bad choice to force into a register if it caused a more heavily referenced variable to be spilled back to store.

I believe that if you follow the procedure outlined above and are already familiar with emulators, you should not find it hard to write a translator, and take one program and translate it. This isn't high Computer Science, it's just hacking that any competant programmer can do. So pick your favourite video game and emulator and go make a translation!

If you're interested in emulation for other reasons (eg you need to convert some stiffware COBOL program that you've lost source to, from IBM360 binary to ARM code, the techniques are the very same. In fact you may find it easier because if you're working from compiled code rather than hand-written assembler, you're unlikely to hit any of the nasty 'gotcha's that we sometimes do, such as self-modifying code... (which you can handle either by retaining a traditional emulator as well as the translation, or you can manually analyse the code and special-case it if the generated code is consistently the same)

Good luck!

Graham

Postscript: Read this section only after you've actually got a translator running. If it's your first time round you can skip this section as it is concerned with optimisations that give relatively minor incremental improvement but take some effort to set up. Do not try any of this until you have got your translated program working correctly.

Constant expression folding

This is often done for you by the compiler but if you add a mechanism to do it yourself, you're halfway towards adding some other optimisations which need the same structure in place.

A good example of constant folding can be seen in the ARM, where literal constants in opcodes are only eight bits, so larger constants such as 0x12345678 are sometimes loaded in multiple operations, such as

MOV R0, #0x12000000
ORR R0, #0x00340000
ORR R0, #0x00005600
ORR R0, #0x00000078

An alternative of course is to do an LDR directly from an area of store pre-initialised with the constant by the linker, but this sort of thing does come up. When it does we don't want to emulate it the hard way, but would prefer a simple R0 = 0x12345678;

To do this, we have to remember any constant values we happen to see being written to a register, and if we can fold constant operations at translate time then we should do so.

The ideal way to do this would be to build up full expression trees and handle it exactly the way an optimising compiler would. But that is an incredible amount of work, and we can get quite a lot of the benefit from some relatively crude hacks. Firstly, we just need a procedure to remember(reg, value); and then a lookup procedure to get the value of reg if known. If it is known we use the constant instead of reg, if we're doing an operation with another constant, we fold them into one; if the value of reg is not known, we just plant code like normal.

Time for some more code examples. We'll ignore the complexity of decoding the LD and OR instructions to decide whether the parameters are immediate literal constants, direct from store locations, or other registers. We'll keep this short by assuming the operands reg, reg2, addr, const and shift have already been decoded by the translator framework.

The remember() procedure is implementing what is called 'constant propogation'. The code below which calculates val | (const<<shift) is doing 'constant expression folding'.

  case LDimmediate:  /* LD reg, byte-const, shift */
    remember(reg, const<<shift);
    dumpf(reg, 0, "R%d = %d", reg, const<<shift));
    break;

  case LDreg:        /* LD reg, reg2, shift */
    if (known(reg2, &val2)) {
      remember(reg, val2<<shift);
      dumpf(reg, 0, "R%d = %d", reg, val2<<shift));
    } else {
      forget(reg);
      dumpf(reg, reg2, "R%d = R%d<<%d", reg, reg2, shift));
    }
    break;

  case LDdirect:     /* LD reg, storeaddr, shift */
    forget(reg);
    dumpf(reg, 0, "R%d = mem[%d]<<%d", reg, addr, shift));
    break;

  case ORimmediate:  /* OR reg, byte-const, shift */
    if (known(reg, &val)) {
      remember(reg, val | (const<<shift));
      dumpf(reg, 0, "R%d = %d", reg, val | (const<<shift));
    } else {
      forget(reg);
      dumpf(reg, reg, "R%d = R%d | %d", reg, reg, const<<shift));
    }
    break;

  case ORreg:        /* OR reg, reg2, shift */
    if (known(reg, &val)) {
      if (known(reg2, &val2)) {
        remember(reg, val | (val2<<shift));
        dumpf(reg, 0, "R%d = %d", reg, val | (val2<<shift));
      } else {
        forget(reg);
        dumpf(reg, reg2, "R%d = %d | (R%d<<%d)", reg, val, reg2, shift);
      }
    } else if (known(reg2, &val2)) {
        forget(reg);
        dumpf(reg, reg, "R%d = R%d | %d", reg, reg, (val2<<shift));
    } else {
      forget(reg);
      dumpf(reg, reg|reg2, "R%d = R%d | (R%d<<%d)", reg, reg, reg2, const<<shift));
    }
    break;

  case ORdirect:     /* OR reg, storeaddr, shift */
    if (known(reg, &val)) {
      dumpf(reg, 0, "R%d = %d | (mem[%d]<<%d)", reg, val, addr, shift));
    } else {
      forget(reg);
      dumpf(reg, reg, "R%d = R%d | (mem[%d]<<%d)", reg, reg, addr, shift));
    }
    break;

Beware when you start this kind of code though - you have to be sure to call forget(reg) every time you write to a register with something that isn't a constant. So... in our short example above, we would generate calls to dump the following code:

R0 = 0x12000000;
R0 = 0x12340000;
R0 = 0x12345600;
R0 = 0x12345678;

This isn't nearly as bad as it looks because the redundant code elimination algorithm will only print out the last value of R0. None of the earlier ones have any dependents, so they will be dropped.

Now, why are we going to such lengths? Well, do you remember the code you added to record jump destinations, in order to determine the starts and finishes of basic blocks? We're going to modify that code a little so that we record for every instruction the values of all the registers at the point of that instruction! [Aside: in fact you can get away with recording the register values only at the entry points to every basic block. This speeds it up considerably for no loss of information and allows you to playtest in real time...] We do this by having several parallel arrays which are each the same size as the code memory, and noting in each element the first value of the register that we see at that address. If we ever see a different value for the register at that address, we tag the memory as 'deadbeef'. After many many runs of the code under the emulator, we soon converge on a set of register contents which are invariant.

We do this so that when we get to code translation time, we can treat a register as if it were a constant, if its value is invariant. This lets the constant-folding code above have even more data to play with. A perfect example of this is our original (imaginary) code sequence,

        TFR  B,A
        ADDA #21

I didn't make a big thing of it, but the ADD instruction as coded was actually an "add with carry", and we never knew the status of the carry flag at that point. However in real code (and this does happen on some real machines) the programmer would clear the carry first:

        CLC
        TFR  B,A
        ADDA #21

In that case, the code we could generate - in our original scheme, before we added all the fancy stuff - would have been:

  Flag_Carry = 0;
  Reg_A = Reg_B;
  temp = Reg_A + 21 + Flag_Carry;
  Flag_Carry = temp >> 8;
  Flag_Overflow = (Reg_A ^ 21 ^ (Flag_Carry<<7))>>7;
  Reg_A = temp & 0xff;
  Flag_Negative = Reg_A >> 7;
  Flag_Zero = (Reg_A == 0 ? 1 : 0);

So, not to repeat all the details again, the whole set of statements above should now collapse down to the rather nice:

Reg_A = (Reg_B + 21) & 0xff;

(This only worked because the whole sequence was a basic block. If the second or third instruction could be jumped to, the optimisations would break the code...)

But what has this to do with the trick of remembering the values of registers from running under the emulator? Well, it's because programmers for small micros tend to shave every byte. They might have a label in front of the "ADDA #21" and that label is branched to from places all over the code, where the programmer already knows that the carry is clear. So although remembering the value of registers in straight line code is relatively easy, optimising over the whole of the program is usually rather difficult! The normal way to do this would be with global dataflow analysis. That's hard work. People write theses about it. What we've done here is a hack, and to be honest it could in principle sometimes give the wrong result, but it's a question of low probabilities, and the longer you run the code to gather stats, the less likely this approximation is going to differ from reality. And if it does mess up once in a blue moon, it's just a video game, right?

So not only have we avoided doing global dataflow analysis, we also have not built up an Abstract Syntax Tree nor have we written a general-purpose optimiser using tree rewriting rules. If you want to do that stuff, by all means do so, but you're looking at the final 2 or 3% of improvement that requires 97% of effort!

There is one more optimisation I am going to mention, and I haven't done this myself yet so I'm not 100% sure of the details but I think it will turn out to be relatively easy. In fact I cheated in the earlier description by assuming that we could do this, though by the time you get here you've probably tried implementing everything yourself and realised that this particular tweak wasn't handled yet. We had something like this:

    Flag_Carry = temp >> 8;
    Flag_Overflow = (Reg_A ^ Reg_B ^ (Flag_Carry<<7))>>7;
    Flag_Negative = Reg_A >> 7;
    Flag_Zero = (Reg_A == 0 ? 1 : 0);
    if (Flag_Overflow) {
      goto L_2204;
    }

which in the next example became:

    Flag_Carry = temp >> 8;
    Flag_Overflow = (Reg_A ^ Reg_B ^ (Flag_Carry<<7))>>7;
    if (Flag_Overflow) {
      Flag_Negative = Reg_A >> 7;
      Flag_Zero = (Reg_A == 0 ? 1 : 0);
      goto L_2204;
    }

This was because Flag_Negative and Flag_Zero were not needed in the drop-through code, and by moving them into the if/then clause (where they are needed) the drop-through case will execute faster. I believe this optimisation can be done by slightly modifying the flush() to mark, but not remove, statements which have already been conditionally output. I'll work on this some day but you're all welcome to get to it first!

That really is it for me this time.

Graham Toal

gtoal at gtoal dot com