Friday, April 28, 2006

A look at the new compiler design

The higher levels of the compiler did not change at all. The stack effect inference module spits out a dataflow intermediate representation (IR), and the dataflow optimizer rewrites it. The most important dataflow IR node types are:
  • #push - push literals on the data stack
  • #shuffle - permute the elements of the data or call stack
  • #call - call a word
  • #label - an inlined recursive block (loop, etc)
  • #if - conditional with two child nodes
  • #dispatch - jump table with multiple nodes; jumps to the node indexed by a number on the data stack

What is different is that now, machine code is generated directly from optimized dataflow IR, instead of being converted to linear IR and having another optimization pass applied. Paradoxically, the new approach generates better code, and the key is making use of CPU registers in an intelligent way.
The simplest optimization performed by the generator is an old one; the defunct linear IR optimizer did it too. The idea is that inside runs of generated code between calls to runtime functions, we can merge multiple stack height change instructions into one final one at the end. As long as no runtime functions which inspect the stack are called, the temporary inconsistency does not create any issues, and saves instructions.
When the generator encouters a #push node, it allocates enough registers to hold the literals being pushed, and loads the literals into the registers. Then it makes a note that the top n stack elements are actually in registers now. This information is recorded on a phantom stack. Note that the dataflow optimizer already kills dead literal loads, so there is no loss of efficiency from immediately loading the literals into registers without checking usage first.
#shuffle nodes are similar. If the phantom stack contains sufficient register assignments for the shuffle to take place, then the phantom stack is simply rearranged. Otherwise, we add special placeholder "location" objects to the phantom stack, and perform the shuffle. For example, suppose the compiler comes across a swap, but the phantom stack is empty, meaning no stack items are in registers. Then the phantom stack ends up looking like this:
{ T{ ds-loc f 0 } T{ ds-loc f 1 } }
Since locations are indexed with the top of the stack starting from zero, this means that the "real" top of the stack is the "current" item underneath the top, and vice versa. Note that #shuffle nodes do not allocate registers or load values; they perform a purely formal operation. This is because in many cases, values which would be loaded by a shuffle are not actually used. For example, consider a shuffle operation which leaves certain locations fixed, such as over ( x y -- x y x ). The value y is part of the shuffle and is never loaded. It would be inefficient to allocate a register for y and load it.
Any node which ends up calling into the runtime, or calling another compiled word, "commits" the phantom stacks. What this does is look at the phantom stack, and write back any values stored in registers to the stack. It also shuffles stack elements if any location objects are present, taking care not to ignore locations which end up back at their original position. So in effect, a #shuffle node does not generate any code until the next time the phantom stack is "committed", and a #push only allocates some registers without writing the stack.
As I have described it so far, the optimizer only takes care of merging consecutive #push and #shuffle nodes. While this is a useful optimization, the old compiler design made it this far without any problems.
The real improvement is in the handling of intrinsics. In the old compiler, a handful of words, such as set-slot, fixnum+ and so on were compiled inline, as hand-tuned assembly segments. This is still being done, but the framework is far more intelligent, and the idea is that if an intrinsic is compiled with all its inputs as registers on the phantom stack, significant effort can be saved in loading/saving elements from the stack. For example, consider the following snippet:
dup cons? [ ... ] [ ... ] if
This is a typical type check. The dataflow optimizer inlines cons?:
dup tag 2 eq? [ ... ] [ ... ] if
Now during code generation, the dup modifies the phantom stack to note that the top two "real" stack items are actually both the "current" top stack item. When the #call node for tag is encountered, the compiler notes that the tag word has an associated compiler intrinsic, which requests one input and produces one output. Now the compiler first checks the phantom stack to see if the input can be loaded without generating any stack traffic. In this case, the phantom stack holds a location object, so a load intruction does need to be generated. Then the tag intrinsic is generated -- it is a single instruction which masks off the tag bits in the address given. Next up is the literal 2. A register is allocated, the integer 2 is loaded in, and a note of this is made on the phantom stack. The eq? followed by a conditional is handled as one operation; and this time, both inputs are on the phantom stack, as registers -- the return value of tag, and the literal 2. So the eq? does not generate any stack traffic at all. Also note that in this example, dup and 2 together increase the stack height by 2, while eq? decreases it by 1 and the conditional pops another value (the return value of eq?). So no stack height change instruction is generated, since the stack is restored to its original height before any runtime calls are made. Indeed, here is the generated x86 assembly for the above example:
MOV EAX,[ESI]
MOV ECX,16 -- the fixnum 2, with a tag, is encoded as the integer 16
CMP EAX,ECX
JNE false_branch
...
JMP end
: false-branch
...
: end
...

Now, I admit that the old linear IR optimizer would, after considerable effort and some rather ugly hand-coded heuristics, squeeze the same machine code out of the above type-check snippet. However, the code was considerably more complicated, and it did not deal with registers and stack locations in such full generality. Indeed, the old design would not manage to deal with the following:
over tag over tag eq? [ ... ] [ ... ] if
Both stack shuffle words would generate code, and the calls to tag would incur stack load/store overhead. The new design handles the above just fine; it loads the top two stack items into registers, masks off everything but the tag bits, compares them and branches.
Many other far more complicated examples generate good code with the new design, especially on PowerPC where there is a large number of registers available.

No comments: