Thread Sanitizer

I recently ran the gold linker under Thread Sanitizer. It’s a nice plugin for Valgrind which looks for race conditions in multi-threaded programs. To describe it briefly, it builds Happens-Before relationships based on mutex operations and warns when it notices a write and a read/write to the same memory location without a Happens-Before relationship. This approach can yield false positives to be sure, but it does a very nice job of identifying real problems.

It was able to identify one real bug in gold, one problem that led to less efficient link time, and several cases where several threads would set a shared memory location to the same value. The latter cases are not a problem on x86 architectures, though they could be on other processors. In order to get clean Thread Sanitizer results in the future, I fixed all of the cases so that I could get a clean run of gold, at least with the default settings.

The real bug that it found was a typical multi-threaded bug: the code looked fine, but it had a well-hidden error. Gold uses a workqueue of tasks to execute, with a pool of worker threads. Many of the tasks are run using a blocker token. The blocker token is set to the number of tasks that precede it. As each task completes, it decrements the blocker count. When the count goes to zero, the next set of tasks can start. This is a simple way to parallelize linker operations, in which one set of operations (e.g., process the symbol tables) must be run before the next set (e.g., process the relocations) can begin. Naturally I paid close attention to the blocker behaviour when a task completed, and there were no problems there. The problem arose in setting the blocker count when the tasks started. The code was doing a loop of “increment the blocker count” and then “queue the task.” What I forgot was that the process of queuing the task actually lets another thread in the pool start working on it immediately. When the task completed, it decremented the blocker count with a lock. But if the task completed fast enough, the initial code was still running the loop queuing up new tasks, and thus incrementing the blocker count. I didn’t think that I needed to lock the increment, since I wasn’t expecting any task to actually complete before I started all of the tasks. A dumb mistake—just the kind of mistake one makes in multi-threaded programming.

Gold is written in C++. In Go I would of course have each task communicate its completion on a channel. The locking would be handled by the runtime, and there would be no chance for me to make the same sort of error. If you write multi-threaded code, and you can’t use Go, you should definitely check out Thread Sanitizer.

4 Comments »

  1. jyasskin said,

    February 13, 2010 @ 2:25 am

    It’s easy enough to write a simple channel in C++. The hardest part is “wait for any”, which you probably don’t need for this problem. Why not do that, if it’s the right way to structure the problem?

    But you’re definitely right about tsan. It’s amazing at finding races.

  2. kcc said,

    February 13, 2010 @ 7:06 am

    >> This approach can yield false positives to be sure
    It can’t.
    There are no false positives unless you use raw atomic instructions for synchronization.
    There is another approach (called “hybrid” because it uses ‘happens-before’ and ‘lockset’) which can have false positives. More on these two approaches here: http://code.google.com/p/data-race-test/wiki/PureHappensBeforeVsHybrid

  3. Ian Lance Taylor said,

    February 13, 2010 @ 9:15 am

    Thanks for the comments.

    jyasskin: Channels are neither better nor worse than the threadpool I wrote. The difference is simply that since channels are built into the language, any concurrency issues only have to be fixed in one place. Gold had to be highly portable, and there is no standard concurrency library for C++. Even the libraries planned for C++0x are too low-level to really help people avoid concurrency problems; they just standardize and clarify the operations which programmers already use. (Of course portability also precludes writing gold in Go, at least at present; I was really just thinking of the advantage of including solid concurrency support in the language.)

    kcc: I didn’t mean to impugn the tool, which you’ve done a great job on. By false positives I meant cases where different threads can initialize a variable simultaneously, but they always initialize the variable to the same value. This is (as far as I know) safe on popular hosts, though it is technically a race and can fail on some unusual processors.

  4. kcc said,

    February 13, 2010 @ 9:42 am

    [Slightly off-topic]
    Here is a discussion about cases when two threads write the same value to the same memory. Quite entertaining 🙂
    http://jeremymanson.blogspot.com/2008/12/benign-data-races-in-java.html

RSS feed for comments on this post · TrackBack URI

Leave a Comment

You must be logged in to post a comment.