Thursday, November 06, 2008

Further performance gains

One thing I've learned over the last few months is that minor improvements to performance, accumulated over time, are just as important as big architectural changes which result in big gains. With the new code generator in place, I spent some time working on a few tweaks I've been thinking about for a while.

Tuple dispatch overview


First up, I sped up method dispatch on tuples by eliminating some unnecessary conditional branches and memory accesses. In Factor, the object system is implemented in the language, and so all dispatch is done by generated Factor code. This code is generated at compile time. For example, we can define a tuple,
TUPLE: dog name breed ;

and look at the code generated for the breed accessor:
( scratchpad ) \ breed>> def>> .
[
dup tag dup 2 eq? [
drop dup { tuple } declare 1 slot { array } declare
7 slot dup \ dog eq?
[ drop { dog } declare dog=>breed>> ]
[ drop object=>breed>> ] if
] [ drop object=>breed>> ] if
]

As you can see above, the generated code uses various low-level primitives to accelerate dispatch. Usually you will never work with these primitives directly. The words named class=>generic are not real words in the dictionary; they are word objects corresponding to methods. So methods are essentially words, except they're not part of any vocabulary. Again, this is an implementation detail and you don't need to be aware of how methods are implemented behind the scenes. In the image where I defined this tuple, I had no other tuples with a slot named breed, so the generic accessor only has one method, and the code generated for a generic with one method is very simple. For generics with many methods, more complex code is generated; if you have enough methods, a hashtable dispatch is generated.

The general philosophy of method dispatch on tuples is as follows. To each tuple, we associate an integer which is the length of the superclass chain from the tuple up to the root class of tuples, tuple. I call this the tuple's "echelon". Every tuple instance in the heap holds a pointer to a tuple layout object in its first slot. The tuple layout object holds the echelon, the sequence of superclasses, and the tuple size.

The tuple class methods of a generic word are sorted into echelons, and dispatch proceeds by starting at the highest echelon and going down until a suitable method is found (or an error is thrown). At each echelon, the generic word looks at the corresponding entry in the superclass list of the tuple on the stack, and performs a hashtable lookup. So dispatch runs in O(n) time, where n is the maximum echelon number of the tuple classes participating in the generic word.

For example, suppose we have the following inheritance hierarchy -- its contrieved, because in practice you probably would not have square inherit from rectangle, but it demonstrates a point here:
tuple
text
shape
polygon
quad
rectangle
square
parallelogram
triangle
circle

Now, suppose we have a generic word:
GENERIC: draw ( shape -- )

M: text draw ... ;
M: rectangle draw ... ;
M: square draw ... ;
M: parallelogram draw ... ;
M: circle draw ... ;
M: triangle draw ... ;

Note that if we pass a square instance to draw, we want it to invoke the M: rectangle method. Here are the participating classes, sorted into echelons:
level 1: text
level 2: circle
level 3: triangle
level 4: rectangle, parallelogram

Method dispatch proceeds as follows:
  • If the tuple on the stack has echelon >= 4, we get the 4th element in its superclass chain, and check if it's rectangle or parallelogram. If so, we dispatch to that method. Note that the 4th element of the superclass chain of both rectangles and squares is rectangle.
  • If the tuple on the stack has echelon >= 3, we get the 3rd element and check if it's a triangle. If so, we dispatch to that method.
  • If the tuple on the stack has echelon >= 2, we get the 2nd element and check if its a circle. If it is, we dispatch to that method.
  • If the tuple on the stack has echelon >= 1, we get the 1st element and check if it's text. If so, we dispatch to that method.

For this generic word, a maximum of four tests must be performed because the inheritance hierarchy is very tall. This situation is rare, since for the most part inheritance is very flat.

Method dispatch: removing conditionals and indirection


The first thing I realized is that every tuple has an echelon level of at least 1, since it must have itself and tuple in the superclass chain. So the conditional test on the final step is unnecessary. Furthermore, if all tuples have echelon 1, there is no need to even load the echelon of the tuple on the stack into a register.

Next, I decided to put the superclass list inside the tuple layout object itself, instead of having the tuple layout object reference an array. This removes one level of indirection, which reduces CPU cache pollution.

To illustrate these improvements, look at the code generated for the call word before these improvements:
[
dup tag dup 3 eq? [
drop dup 0 slot 8 fixnum-fast dup 6 eq?
[ drop { quotation } declare quotation=>call ]
[ drop object=>call ] if
] [
dup 2 eq? [
drop dup { tuple } declare 1 slot
{ tuple-layout } declare 5 slot dup 1 fixnum>= [
drop dup { tuple } declare 1 slot
{ tuple-layout } declare 4 slot
{ array } declare 3 slot { word } declare
dup \ curry eq?
[ drop { curry } declare curry=>call ] [
dup \ compose eq?
[ drop { compose } declare compose=>call ]
[ drop object=>call ] if
] if
] [ drop object=>call ] if
] [ drop object=>call ] if
] if
]

And after:
[
dup tag dup 2 eq? [
drop dup { tuple } declare 1 slot { array } declare
7 slot dup \ compose eq?
[ drop { compose } declare compose=>call ] [
dup \ curry eq?
[ drop { curry } declare curry=>call ]
[ drop object=>call ] if
] if
] [
dup 3 eq? [
drop dup { hi-tag } declare 0 slot dup 14 eq?
[ drop { quotation } declare quotation=>call ]
[ drop object=>call ] if
] [ drop object=>call ] if
] if
]

Method dispatch: faster hashcode lookup


If a generic word has more than 4 methods at the same echelon level, a hashtable dispach is generated instead of a series of linear tests. Formerly, the generated code would indirect through the class word to look up the hashcode. Again, this is bad for locality because it pulls in the entire word object's cache line in for no good reason. So what I did was intersperse the hashcodes with the superclass chain in the tuple layout object. The tuple layout object will already be in the cache, since we just read one of the superclass chain entries, so reading the hashcode is essentially free.

Method dispatch: re-ordering tests


I noticed is that the >fixnum generic word had conditionals in an essentially random order. The cut-off between a series of conditionals and a jump table is 4 methods, and >fixnum, defining methods on every subtype of the reals, is the borderline case:
( scratchpad ) \ >fixnum def>> .
[
dup tag dup 5 eq?
[ drop { float } declare float=>>fixnum ] [
dup 4 eq?
[ drop { ratio } declare ratio=>>fixnum ] [
dup 0 eq?
[ drop { fixnum } declare fixnum=>>fixnum ] [
dup 1 eq?
[ drop { bignum } declare bignum=>>fixnum ]
[ drop object=>>fixnum ] if
] if
] if
] if
]

For various reasons, most >fixnum calls are made with a fixnum on the stack. By sorting the methods by tag number before generating the generic, I was able to get it to look like this:
( scratchpad ) \ >fixnum def>> .
[
dup tag dup 0 eq?
[ drop { fixnum } declare fixnum=>>fixnum ] [
dup 1 eq?
[ drop { bignum } declare bignum=>>fixnum ] [
dup 4 eq?
[ drop { ratio } declare ratio=>>fixnum ] [
dup 5 eq?
[ drop { float } declare float=>>fixnum ]
[ drop object=>>fixnum ] if
] if
] if
] if
]

This reduces the number of conditionals in the common case.

I/O system tweaks and string-nth intrinsic


I made some changes to io.ports and io.buffers; mostly adding inline declarations. I also added a hint to the push word for the sbuf class; stream-readln would push every character onto an sbuf, and adding this fast-path eliminated some method dispatch. Finally, I made string-nth a compiler intrinsic rather than a call into the VM, which means that words where the compiler infers you're operating on strings will inline the intrinsic and perform less memory loads/stores and subroutine calls than before.

Eliminating useless branches with value numbering


Take a look at what the high-level optimizer does to the following quotation:
( scratchpad ) [ [ { + - * } member? ] contains? ] optimized.
[
dup length swap 0 -rot \ ( gensym ) [
pick pick < [
swap
>r 2dup
>r
>r nth-unsafe dup \ * eq? [ t ] [ f ] if
[ drop t ] [
dup \ - eq? [ t ] [ f ] if
[ drop t ]
[ \ + eq? [ t ] [ f ] if [ t ] [ f ] if ] if
] if r>
r>
r>
swap
>r rot r>
swap
[ 2drop ]
[ >r >r 1 +-integer-fixnum r> r> ( gensym ) ] if
] [ 3drop f ] if
] label dup [ ] [ ] if [ t ] [ f ] if
]

A lot of dispatch is eliminated and methods are inlined, but there is some pretty crappy control flow redundancy there. I extended the low-level optimizer to eliminate it when building the low-level IR.

One of the worst offenders is the = word, defined as follows:
: = ( obj1 obj2 -- ? )
2dup eq? [ 2drop t ] [ equal? ] if ; inline

If obj2 is a word, then equal? expands into 2drop f, so we get
2dup eq? [ 2drop t ] [ 2drop f ] if

Then, dead code elimination converts this to
eq? [ t ] [ f ] if

Ideally, we'd get rid of [ t ] [ f ] if altogether. Instead of doing this in the high-level optimizer, I decided to detect this code pattern, along with its negation [ f ] [ t ] if (which is the inlined definition of not) and emit a branchless comparison, instead:
( scratchpad ) [ \ + = ] test-mr mr.
=== word: ( gensym ), label: ( gensym )

_label 0
##prologue
_label 1
##load-indirect V int-regs 582552 +
##peek V int-regs 582553 D 0
##compare V int-regs 582555 V int-regs 582553 V int-regs 582552 cc=
##replace V int-regs 582555 D 0
##epilogue
##return

Now, suppose you have a code sequence like [ t ] [ f ] if [ 1 ] [ 2 ] if after inlining. Value numbering builds up an expression chain where one comparison depends on the result of another, and it is able to collapse them down to a single comparison.

Finally, the empty conditional, [ ] [ ] if, manifests in value numbering as a conditional branch instruction where both successors point to the same node. This is replaced with an unconditional branch.

Results


Here are the timings, in milliseconds, for 10 runs of the reverse-complement benchmark:
5309 5310 5342 5297 5305 5635 5344 5313 5338 5303

This represents a 35% improvement over last time.

Finally, bootstrap is faster too. On my laptop it now takes around 3 minutes 45 seconds, which is about 40 seconds faster than before.

Edit: some more minor optimizations improved reverse-complement runtime by another 500 milliseconds or so.

No comments: