Sunday, September 05, 2010

Making Factor's continuous build system more robust

I've done some work on Factor's continuous build system over the weekend to make it more robust in the face of failure, with improved error reporting and less manual intervention required to fix problems when they come up. The current build system is called "mason", because it is based on an earlier build system named "builder" that was written by Eduardo Cavazos. Every binary package you download from factorcode.org was built, tested and benchmarked by mason.

Checking for disk space

Every once in a while build machines run out of disk space. This is a condition that Git doesn't handle very gracefully; if a git pull fails half way through, it leaves the repository in an inconsistent state. Instead of failing during source code checkout or testing, mason now checks disk usage before attempting a build. If less than 1 Gb is free, it sends out a warning e-mail and takes no further action. Disk usage is also now part of every build report; for example, take a look at the latest Mac OS X report. Finally, mason does a better job of cleaning up after itself when builds fail, reducing the rate of disk waste overall.

I must say the disk space check was very easy to implement using Doug Coleman's excellent cross-platform file-system-info library. Factor's I/O libraries are top-notch and everything works as expected across all of the platforms we run on.

Git repository corruption

Git is not 100% reliable, and sometimes repositories will end up in a funny state. One cause is when the disk fills up in the middle of a pull, but it seems to happen in other cases too. For example, just a few days ago, our 64-bit Windows build machine started failing builds with the following error:

From git://factorcode.org/git/factor
* branch master -> FETCH_HEAD
Updating d386ea7..eece1e3

error: Entry 'basis/io/sockets/windows/windows.factor' not uptodate. Cannot merge.

Of course nobody actually edits files in the repository in question, its a clone of the official repo that gets updated every 5 minutes. Why git messed up here I have no clue, but instead of expecting software to be perfect, we can design for failure.

If a pull fails with a merge error, or if the working copy somehow ends up containing modified or untracked files, mason deletes the repository and clones it again from scratch, instead of just falling over and requiring manual intervention.

Error e-mail throttling

Any time mason encounters an error, such as not being able to pull from the Factor Git repository, disk space exhaustion, or intermittent network failure, it sends out an e-mail to Factor-builds. Since it checks for new code every 5 minutes, this can get very annoying if there is a problem with the machine and nobody is able to fix it immediately; the Factor-builds list would get spammed with hundreds of duplicate messages. Now, mason uses a heuristic to limit the number of error e-mails sent out. If two errors are sent within twenty minutes of each other, no further errors are sent for another 6 hours.

More robust new code check

Previously mason would initiate a build if a git pull had pulled in patches. This was insufficient though, because if a build was killed half way through, for example due to power failure or machine reboot, it would not re-attempt a build when it came back up until new patches were pushed. Now mason compares the latest git revision with the last one that was actually built to completion (whether or not there were errors).

Build system dashboard

I've put together a simple dashboard page showing build system status. Sometimes VMs will crash (FreeBSD is particularly flaky when running under VirtualBox, for example) and we don't always notice that a VM is down until several days after, when no builds are being uploaded. Since mason now sends out heartbeats every 5 minutes to a central server, it was easy to put together a dashboard showing which machines have not sent any heartbeats for a while. These machines are likely to be down. The dashboard also allows a build to be forced even if no new code was pushed to the repository; this is useful to test things out after changing machine configuration.

The dashboard nicely complements my earlier work on live status display for the build farm.

Conclusion

I think mason is one of the most advanced continuous integration systems among open source language implementations, nevermind the less mainstream ones such as Factor. And thanks to Factor's advanced libraries, it is only 1600 lines of code. Here is a selection of the functionality from Factor's standard library used by mason:

  • io.files.info - checking disk usage
  • io.launcher - running processes, such as git, make, zip, tar, ssh, and of course the actual Factor instance being tested
  • io.timeouts - timeouts on network operations and child processes are invaluable; Factor's consistent and widely-used timeout API makes it easy
  • http.client - downloading boot images, making POST requests to builds.factorcode.org for the live status display feature
  • smtp - sending build report e-mails
  • twitter - Tweeting binary upload notifications
  • oauth - yes, Factor has a library to support the feared OAuth. Everyone complains about how hard OAuth is, but if you have easy to use libraries for HMAC, SHA1 and HTTP then it's no big deal at all.
  • xml.syntax - constructing HTML-formatted build reports using XML literal syntax

No comments: