Thursday, February 01, 2007

ARM port day 3

The Factor compiler backend architecture has multiple tiers. There are several levels of compilation one can implement:
  • Basic stack shuffling, subroutine calls, branching, with most primitives being compiled as calls into C code
  • Intrinsics can be selectively defined to replace calls to primitives with open-coded assembly making use of register allocation and minimizing stack shuffling
  • C library interface and callback support

The last one is not really optional since key components of Factor (non-blocking I/O, transcendental math functions, etc) depend on the C library interface. However, traditionally I have done that part last when porting Factor to a new architecture, since it usually requires the most fiddling and is easier to test if everything else works. Open-coding certain primitives such as slot, fixnum+ can bring huge performance improvements because it eliminates branching and memory transfers from inner loops, but this is optional.

Today I only got as far as the first level of support in the ARM backend. Basically, one has to implement various words which are called by the compiler to generate branches, generate calls, load stack values to registers, write registers to the stack, and so on. There is only a small number of such words. Also, one must describe the register file of the architecture for the register allocator. While the register allocator is not used much without intrinsics, it still plays a role in the compilation of stack shuffles (which never result in calls into C, they're always open-coded). So I worked on this one file for the most part:

Despite claiming to be a RISC architecture, ARM feels more like CISC, with the multiple load/store instructions (which I haven't used yet) and complex operand and addressing modes, including automatic pre- and post- increment on memory loads and stores. This makes ARM code a pleasure to write compared to say, PowerPC. I also find it nicer than x86 because it doesn't have odd limitations and bizarre legacy cruft.

Getting relocation right was the hardest part, really. When the Factor compiler builds up a compiled code block corresponding to a word, it does not know the final address the code will be stored at; and indeed, even if it did, saving the image and restarting Factor may very well load the image at a different base address. So the compiler has to retain as much information as is needed for the VM to be able to "fix up" compiled code when the image is loaded. Relocation information is recorded in a table along with every compiled code block, and essentially consists of a list of instructions; "store the address of this word at this offset", "fix up this absolute address for the current code heap base", and so on. Since ARM instruction encoding is different from x86 and PowerPC, I had to add some new relocation instructions for updating offsets in packed ARM instruction fields.

The annoying part came when I was coding support for compiling references to literals. On x86, an immediate operand can be 32 bits; on PowerPC, only 16 bits, so a load of a 32 bit value is synthesized as a pair of instructions, a load followed by a shift-load. On ARM, only 8 bit immediates with a 4 bit rotation are supported. Constructing larger integers and addresses from 8 bit chunks is retarded; thankfully, ARM does support PC-relative addressing, so what compilers tend to do is deposit the large integers close to the compiled code (close enough for a 12 bit displacement on a load instruction), and load it via PC-relative addressing.

A similar situation occurs when one wishes to call the C VM. Calls between compiled words are not a problem, because a PC-relative branch can have up to 26 bits displacement, which is enough for a pretty large code heap. However, there is no guarantee that the Linux kernel will map the C VM text segment close enough to the mmapped code heap for a 26 bit branch to be sufficient. Again, on PowerPC, we use a pair of instructions to synthesize a 32 bit address, then jump there. On ARM, I decided to go with a simpler approach which doesn't require relocation at all; I just put the list of primitives in a global register variable, R8. Then a call to a primitive call be done by a displaced load from R8 with the constant displacement being a primitive number.

I also found a bug in the ARM assembler, added a unit test, and fixed it. I like being able to unit test assembly code, I only wish that I had the foresight to implement the post-0.85 assembler design from the very start, so that the x86 and PowerPC assemblers could be unit tested when I was working on them. Nowadays they are relative stable and rarely change, but at one point I might write unit tests for them anyway.

So after 3 days work, what I have is essentially a subroutine threaded compiler for ARM with minor optimizations. This is a rather trivial accomplishment in itself; however, tomorrow I will start writing intrinsics for the most critical primitives, and this will result in a noticeable performance boost, with the full force of the register allocator and dataflow optimizer coming into play.

No comments: