Multi-threading Memory

As I’ve written before, multi-threaded code is hard to write. It becomes even harder to write when you find that lock contention is an efficiency issue. On a modern multi-core system locks require some sort of memory cache coordination between the various cores. If your program is written using fine-grained locks which may be held by threads running on different cores, access to those locks can become relatively expensive. For high performance code it then becomes desirable to avoid the locks when possible.

When you do this, you have to understand memory barriers. A memory barrier enforces a weak global ordering on memory reads and writes. Without a memory barrier, depending on cache behaviour, it is quite possible for one core to write A before B but for another core to see the write to B before it sees the write to A. Memory barriers enforce consistency.

Unfortunately, different architecture define memory barriers in different ways. And different implementations of different architecture implement them in different ways. So it is very easy to write code which works reliably in some places but fails occasionally in others.

Also, the compiler knows nothing about memory barriers (this will change to some extent with the next C++ standard and the atomic<> types). The compiler is generally free to casually reorder memory reads and writes as long as the final result is correct. When writing lock-free code using memory barriers, you need to carefully indicate what you are doing in two different ways: once for the processor and once for the compiler. This will normally be done via some sort of inline assembler construct. And of course you want to make this run as efficiently as possible, since that is the point of the exercise in the first place.

So basically getting all this right is a big hairball. The next C++ standard will push this hairball onto the compiler and library implementors. The types aren’t all that easy to understand, but at least they should work correctly. Until then, or when working in C, it’s really hard to write correct portable lock-free code. Better to avoid fine-grained locks, I think, or at least to avoid contention for them.

Some people argue that transactional memory will solve these problems. I am very skeptical of this. I’ll write more about this later.

5 Comments »

  1. Simon Richter said,

    March 1, 2008 @ 6:28 am

    I still think that “volatile” serves us quite well here.

    When I write multi-threaded programs, rather than thinking of a single abstract machine with multiple threads, I think of multiple abstract machines that can see exactly each other’s “observable behaviour”, which is well-defined in the standard, and nothing else.

    Guaranteed atomic types were something I sorely missed in the standard — I just assumed that “unsigned int” was atomic for all purposes, which I believe holds true in most if not all implementations. My only use for them, however, is reporting progress when one thread is handling I/O and the other is doing a lengthy computation.

    The new standard essentially codifies existing practice in treating everything like I/O buffers (i.e. accesses aren’t ordered, but are guaranteed to happen) unless told otherwise. Since the overwhelming majority of objects in my programs aren’t shared, I’d actually prefer explicitly marking those that are (that takes quite a few load/store instructions out of the hot path), however that would break lots of existing code.

  2. Ian Lance Taylor said,

    March 3, 2008 @ 2:39 pm

    Thanks for the comment.

    volatile doesn’t help all that much. In particular, if you have two volatile global variables, and one thread changes first A and then B, there is no guarantee that a thread running on a different core will see the changes in that order. It might very sell see the change to B before it sees the change to A. volatile only applies within a single thread, it does not imply anything about threads running on multiple cores.

  3. rdb said,

    March 3, 2008 @ 4:39 pm

    Paul McKenney‘s LCA2008 talk “After 25 Years, C/C++ Understands Concurrency” on this has slides, video & audio up on the Friday program link. (2nd try, only the one URL this time).

  4. Simon Richter said,

    March 4, 2008 @ 1:29 am

    Good point — so the current implementations do not really behave like single abstract machines operating concurrently, and my model of thinking is wrong (good thing I never share more than an unsigned int for progress reports between two threads 🙂 ).

    The question is: can this be fixed without imposing undue overhead?

    If writes to shared variables should be visible in-order, the implementation would have to insert memory barriers, which is an OS-specific thing (so the compiler would have to be OS aware). If it were to do this implicitly, it’d have to guess and err on the side of caution, degrading performance. If the programmer had to insert the barriers, there is no gain compared to the status quo.

    Shared memory between threads seems to be a bad idea altogether. I wonder if it would make more sense to have a “stream” like interface that an implementation could map to shared memory if it could guarantee consistency, but were not required to if other methods made more sense (for example, for NUMA).

  5. Ian Lance Taylor said,

    March 4, 2008 @ 5:56 am

    rdb: thanks for the pointer.

    Simon: yes, I tend to think that a stream like interface for communicating between threads is the most workable method for multi-multi-core systems. That idea came up earlier also.

RSS feed for comments on this post · TrackBack URI

You must be logged in to post a comment.