Saturday, November 29, 2008

Faster generic arithmetic and integer overflow checking

While the Factor compiler does a pretty good job of inferring types and eliminating dispatch where possible, runtime polymorphism still ends up being used a lot. Integer arithmetic presents an additional challenge. Factor supports arbitrary-precision integers, and it tries to make this support transparent. However, the flipside of this is that even if the compiler can figure out that all of the values in an expression are integers, it may not be able to eliminate the overflow check, depending on whether or not the ranges of the values are known. Because of this, making runtime polymorphism efficient is still important. In the last few days, I worked on making integer arithmetic more efficient.

Consider a word like +. If compile-time type information is not available, this word has to look at the types of the top two stack items, and take appropriate action depending on whether they are fixnums, bignums, ratios, floats, or complex numbers. Until now, this dispatch was implemented by two indirect jumps. The type of the first object is used to key a dispatch table, where each entry then looks at the type of the second object and keys into a second-level dispatch table; the entries in the latter are the actual methods implementing the arithmetic operation.

In Factor, the generic word system generates Factor code for performing method dispatch. If you look at the definition of a word such as + in an older build, you'll see this two-level jump table pattern, implemented with Factor's dispatch combinator:
( scratchpad ) \ + def>> .
[
over tag {
[
dup tag {
[ fixnum=>+ ]
[ [ >bignum ] dip bignum=>+ ]
[ \ + no-math-method ]
[ \ + no-math-method ]
[ ratio=>+ ]
[ [ >float ] dip float=>+ ]
[ complex=>+ ]
[ \ + no-math-method ]
} dispatch
]
[
dup tag {
[ >bignum bignum=>+ ]
[ bignum=>+ ]
[ \ + no-math-method ]
[ \ + no-math-method ]
[ ratio=>+ ]
[ [ >float ] dip float=>+ ]
[ complex=>+ ]
[ \ + no-math-method ]
} dispatch
]
[ \ + no-math-method ]
[ \ + no-math-method ]
[
dup tag {
[ ratio=>+ ]
[ ratio=>+ ]
[ \ + no-math-method ]
[ \ + no-math-method ]
[ ratio=>+ ]
[ [ >float ] dip float=>+ ]
[ complex=>+ ]
[ \ + no-math-method ]
} dispatch
]
[
dup tag {
[ >float float=>+ ]
[ >float float=>+ ]
[ \ + no-math-method ]
[ \ + no-math-method ]
[ >float float=>+ ]
[ float=>+ ]
[ complex=>+ ]
[ \ + no-math-method ]
} dispatch
]
[
dup tag {
[ complex=>+ ]
[ complex=>+ ]
[ \ + no-math-method ]
[ \ + no-math-method ]
[ complex=>+ ]
[ complex=>+ ]
[ complex=>+ ]
[ \ + no-math-method ]
} dispatch
]
[ \ + no-math-method ]
} dispatch
]

The dispatch combinator compiles down to an indirect jump, and indirect jumps are expensive on modern CPUs. Also, note that this dispatch scheme did not favor any single numeric type over another. However, in practice, operations on bignums and ratios will always be slow, and dispatch overhead is neglegible there; whereas code that works with floating point numbers and complex numbers can only be fast if the compiler can infer types and unbox values statically, in which case there is no runtime dispatch at all. So performance of runtime arithmetic dispatch on types other than fixnums is largely irrelevant in Factor. Indeed, the overwhelming majority of calls to generic arithmetic operations involve fixnums, so some kind of trick to avoid two indirect jumps in this common case is needed.

There are many solutions, but the one I settled on is pretty much the simplest one. In Factor, all values are tagged pointers (except for unboxed floats and such, which the compiler uses, but they cannot be "observed" in their unboxed state from the outside). A tagged pointer is a value where the low 3 bits encode a type, and the remaining bits encode a value. If the type tag is 0, the value is interpreted as an integer; otherwise, it is a pointer. All values in the heap are aligned on 8 byte boundaries, so if we mask off the 3 tag bits, we get a valid pointer.

So if we have two values, stored in registers x and y, then both have a tag of 0 if their inclusive bitwise or has a tag of zero. That is, we can check if both are fixnums with a single conditional.

Here is the new Factor code generated for the dispatch performed by +:
( scratchpad ) \ + def>> .
[
both-fixnums?
[ { fixnum fixnum } declare fixnum=>+ ] [
over tag {
[
dup tag {
[ { fixnum fixnum } declare fixnum=>+ ]
[
{ fixnum bignum } declare
[ >bignum ] dip bignum=>+
]
[
{ fixnum tuple } declare
\ + no-math-method
]
[ \ + no-math-method ]
[ { fixnum ratio } declare ratio=>+ ]
[
{ fixnum float } declare [ >float ] dip
float=>+
]
[ { fixnum complex } declare complex=>+ ]
[
{ fixnum POSTPONE: f } declare
\ + no-math-method
]
} dispatch
]
[
dup tag {
[
{ bignum fixnum } declare
>bignum bignum=>+
]
[ { bignum bignum } declare bignum=>+ ]
[
{ bignum tuple } declare
\ + no-math-method
]
[ \ + no-math-method ]
[ { bignum ratio } declare ratio=>+ ]
[
{ bignum float } declare [ >float ] dip
float=>+
]
[ { bignum complex } declare complex=>+ ]
[
{ bignum POSTPONE: f } declare
\ + no-math-method
]
} dispatch
]
[ \ + no-math-method ]
[ \ + no-math-method ]
[
dup tag {
[ { ratio fixnum } declare ratio=>+ ]
[ { ratio bignum } declare ratio=>+ ]
[
{ ratio tuple } declare
\ + no-math-method
]
[ \ + no-math-method ]
[ { ratio ratio } declare ratio=>+ ]
[
{ ratio float } declare [ >float ] dip
float=>+
]
[ { ratio complex } declare complex=>+ ]
[
{ ratio POSTPONE: f } declare
\ + no-math-method
]
} dispatch
]
[
dup tag {
[
{ float fixnum } declare
>float float=>+
]
[
{ float bignum } declare
>float float=>+
]
[
{ float tuple } declare
\ + no-math-method
]
[ \ + no-math-method ]
[ { float ratio } declare >float float=>+ ]
[ { float float } declare float=>+ ]
[ { float complex } declare complex=>+ ]
[
{ float POSTPONE: f } declare
\ + no-math-method
]
} dispatch
]
[
dup tag {
[ { complex fixnum } declare complex=>+ ]
[ { complex bignum } declare complex=>+ ]
[
{ complex tuple } declare
\ + no-math-method
]
[ \ + no-math-method ]
[ { complex ratio } declare complex=>+ ]
[ { complex float } declare complex=>+ ]
[ { complex complex } declare complex=>+ ]
[
{ complex POSTPONE: f } declare
\ + no-math-method
]
} dispatch
]
[ \ + no-math-method ]
} dispatch
] if
]

Notice that it is essentially the same as the old version, except the two-level jump table has been wrapped in a conditional. If both-fixnums? outputs true, we go directly to the fixnum case; otherwise we do the two-level dispatch.

Since both-fixnums? needs to bitwise OR pointers together, I had no choice but to make it a VM primitive. The non-optimizing compiler inlines a fixed code sequence for calls of both-fixnums?, and the optimizing compiler expands it into a series of low-level IR operations:
( scratchpad ) [ both-fixnums? [ "Fixnums!" ] [ "At least one is not a fixnum" ] if ] test-mr mr.
=== word: ( gensym ), label: ( gensym )

_label 0
##prologue
_label 1
##peek V int-regs 692262 D 0
##peek V int-regs 692263 D 1
##copy V int-regs 692264 V int-regs 692262
##or V int-regs 692264 V int-regs 692264 V int-regs 692263
##copy V int-regs 692265 V int-regs 692264
##and-imm V int-regs 692265 V int-regs 692265 7
_compare-imm-branch 3 V int-regs 692265 0 cc/=
_label 2
##inc-d 1
##load-indirect V int-regs 692269 "Fixnums!"
##replace V int-regs 692269 D 0
##epilogue
##return
_label 3
##inc-d 1
##load-indirect V int-regs 692270 "At least one is not a fixnum"
##replace V int-regs 692270 D 0
_label 4
##epilogue
##return

So the machine code generated for a generic arithmetic word such as + now begins as follows:
    mov    (%r14),%rax
mov -0x8(%r14),%rcx
or %rcx,%rax
and $0x7,%rax
cmp $0x0,%rax
jne slow_dispatch
... handle fixnum+fixnum here ...
ret
slow_dispatch:
... do table dispatch here ...

We load two values from the data stack, or them together, mask off the low 3 bits, compare with zero, and jump. Indeed, those readers who know the ways of x86 assembly will recognize that there is a shorter code sequence which can accomplish this same task:
    mov    (%r14),%rax
mov -0x8(%r14),%rcx
or %rcx,%rax
test $0x7,%rax
jnz slow_dispatch
... handle fixnum+fixnum here ...
ret
slow_dispatch:
... do table dispatch here ...

Once I figure out a nice way to incorporate the test instruction into my code generator and value numbering passes, I will switch things up and eliminate the unnecessary instruction there.

This takes care of making dispatch faster. The next challenge was speeding up integer overflow handling. If the optimizing compiler's high-level optimizer can prove that a call to + has fixnum inputs, and in addition, the result will not overflow -- either because the ranges of the input values are constrained, or the output is passed to bitand -- it will replace it with a call to the fixnum+fast primitive. The low-level optimizer turns a call to fixnum+fast into a single low-level IR instruction, and this becomes a single machine instruction. Very efficient.

However, if the high-level optimizer cannot prove that the result will not overflow, but it does know the inputs are fixnums, it replaces the call to + with a call to fixnum+. This still avoids dispatch overhead, however, until now, the low-level optimizer did not do anything special for fixnum+; it would compile it as a subroutine call of a VM function, which added two integers, checking for overflow, in C. This was slow, because C does not have a way to test the overflow flag of the CPU.

I made the low-level optimizer aware of the fixnum+ word; it converts it into a low-level instruction which can then participate in value numbering and register allocation. The CPU-specific backend is then responsible for generating an optimal machine code sequence for fixnum+. On x86 and PowerPC, this can use the overflow flag in the CPU, which is a more efficient than trying to simulate the same in C. If the overflow flag is set after the addition, fixnum+ performs a subroutine call into the VM, however there is no call and the arithmetic is done inline in the common case, where there is no overflow.

Here is an example of machine code emitted for fixnum+:
    sub    $0x8,%r14
mov (%r14),%rax
mov 0x8(%r14),%rcx
add %rcx,%rax
jno no_overflow
... handle overflow ...
no_overflow:
mov %rax,(%r14)
retq

I omitted the lengthy setup for the subroutine call into the VM, and register assignments will vary depending on the placement of fixnum+ in the code.

The results were very interesting. Some benchmarks, such as spectral-norm and mandelbrot, showed no improvement, because the compiler eliminates all the dispatch and overflow checks from the inner loops anyway. Other benchmarks showed an impressive gain. This is good because it means that I didn't waste my time with implementing the above, but it is also bad because it shows that the high-level optimizer could probably do a better job of inferring types in some of these benchmarks!

Here are the results; I'm running each benchmark 5 times and measuring the runtime in milliseconds.
BenchmarkBeforeAfter
fasta9492 8985 8975 9310 92197841 7818 8115 7807 7777
reverse-complement5815 5611 5628 5645 56455176 5152 5177 5173 5289
binary-trees2911 2867 2870 2890 28942163 2142 2186 2168 2136
fib2113 112 114 112 11282 83 84 85 85
fib3301 302 301 300 301260 261 259 260 260
project-euler.134427 427 427 426 424314 313 315 318 313

3 comments:

Wolf550e said...

RE: "jno no_overflow"

Shouldn't you take a jump on the uncommon case and fall through on the common case where there is no overflow?

Slava Pestov said...

Wolf550e: you'd think so, but in my testing reversing the conditional actually lead to slower performance.

Anonymous said...

Instead of:

and $0x7,%rax
cmp $0x0,%rax
jne slow_dispatch

You can just do:

and $0x7,%rax
jnz slow_dispatch

(note that jnz and jne are actually the same opcode). All arithmetic and bitwise ops on x86 family set the zero flag.

Best way to introduce this, I think is through a peephole optimizer; It's been a while since I've looked at Factor, I don't recall if it has one. But you can drop the 2nd instruction in a sequence matching:

and|or|xor \1,\2
cmp $0x0, \2
je|jne \3