Single Threaded Memory Model

One more round on the parallel programming theme. There has been some recent discussion on the gcc and LKML mailing lists about an interesting case.

static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
static int acquires_count = 0;

int
trylock()
{
int res;

res = pthread_mutex_trylock(&mutex);
if (res == 0)
++acquires_count;

return res;
}

On x86 processors, current gcc will optimize this to use a comparison, an add with carry flag, and an unconditional store to acquires_count. This eliminates a branch, but it means that the variable is written to even if the lock is not held. That introduces a race condition.

This is clearly permitted by the C language standard, which in general describes a single threaded model. The standard says nothing about precisely when values are written to memory. It simply says that if you assign a value to a global variable, then when you read the global variable you should see that value.

It may seem that this should be an invalid optimization: the compiler should not move stores out of a conditionally executed basic block. Several people have reacted to this test case in this way. However, note that the first function call could be anything–it could even be a test of a global variable. And note that the store could be a load–we could load a global variable twice, protected by some conditional both times. If we require that this code be safe in a multi-threaded environment, then we can not coalesce the loads. That would be OK with people who write C as a fancy assembly language. But it would significantly hurt performance for complex programs, especially in C++, in which the conditionals would be in different functions and then inlined together.

So what should the compiler do, given a choice between optimizing complex C++ code and helping to generate correct complex multi-threaded code? Complex C++ code is much more common than complex multi-threaded code. I think the compiler is making the right choice. And it conforms to the language standard.

For this code to be correct in standard C, the variable needs to be marked as volatile, or it needs to use an explicit memory barrier (which requires compiler specific magic–in the case of gcc, a volatile asm with an explicit memory clobber). But many people don’t see that–including Linus Torvalds.

I think this is a nice example of my earlier point, which is that our current models of parallel programming are generally too hard for people to use.


Posted

in

by

Tags:

Comments

19 responses to “Single Threaded Memory Model”

  1. tromey Avatar

    I don’t really see this as a strike against the parallel programming models. Instead it is more a problem with the C language, and perhaps with either GCC or the C library. For instance, I consider the Java threading model to be very similar to the C model (as seen from a programming-language point of view, say) — but it does not have this problem. (To be precise, it does have this problem with the trylock approach. But that is a dumb thing to do.)

    Also, unfortunately Linus is partly right about GCC. I do think there’s been too much language lawyering and exploitation of odd language loopholes in pursuit of somewhat dubious (in my view) optimizations. This isn’t to invalidate the idea that one ought to write programs that conform strictly to standards. But in my view GCC has erred on the wrong side of this important QoI issue.

    I do worry that GCC will be supplanted as the compiler of choice by something else, partly due to this issue, but also due to the other ways that GCC has failed over the years to listen to its user community.

  2. Ian Lance Taylor Avatar

    My feeling is that gcc has actually been fairly responsive to the user community–except for the kernel programmers. The problem there, as I see it, is the kernel programmers tend to want something close to “C as assembly language.” They have a much clearer idea of the assembly code that they want than most people do. They also make extensive use of the inline asm syntax.

    Most programmers do not want “C as assembly language,” and most programmers don’t use inline asm syntax. Those programmers want performance. They are the ones who regularly complain that gcc is not competitive with icc, LLVM, Open64, etc.

    So gcc is caught between the kernel programmers and people who want performance. And since compiler programmers also want performance, gcc tends to veer toward the latter.

    The only way I see to fix this problem is for the kernel programmers to work more closely with gcc to state clearly what they want in terms that make sense in the compiler, not in terms of DWIM programming. Unfortunately they have largely given up on the gcc development process, and would switch to something else in a second if there were anything else which could fit their needs. I don’t know how to fix this problem.

    But note that if the kernel programmers find a different compiler that they can use, other programmers will find that compiler to be even less suitable than gcc.

  3. womble Avatar
    womble

    This sort of optimisation was discovered to be a problem for multi-threaded programs some time ago (http://doi.acm.org/10.1145/1065010.1065042) and has been fixed by some other compiler vendors. It will be banned in the next C++ standard, which has a multi-threaded abstract machine, and I would expect the C standard committee to follow suit.

  4. Ian Lance Taylor Avatar

    Thanks for the note.

    We worked this out on the gcc list today, and I wrote a patch to change the compiler. Although it follows the current language standards, it is clearly better to move toward the upcoming C++0x memory model. That memory model bans speculative stores, but appears to permit load coalescing. So there are variants of this code which will still fail.

  5. Simon Richter Avatar
    Simon Richter

    What is missing is something in between “volatile” and non-“volatile”; something that does not have accesses ordered internally and permits load/store coalescing, but takes accesses to a (either one specific or any) volatile object as a memory barrier.

    This way, I could mark the buffer I use for inter-thread communication with this modifier, and the (volatile) access to the mutex will provide the necessary barrier, while at the same time allowing even more optimizations for single-threaded code.

    For example, the standard zlib implementation uses global variables to write down the buffer positions. On a sane architecture like ARM or PowerPC I could keep those in registers and coalesce the byte loads since I have enough registers, which would result in a huge performance win if the data source is flash memory (which introduces several hundred wait states).

    I don’t think banning speculative stores is the answer. It’s throwing out the baby with the bathwater. If you have a 25-stage pipeline, then a speculative store inside a loop and undoing it outside if some condition didn’t hold might just mean a 5-fold speed increase (if the speculative store goes into the cache, which is always allowed since we cannot do that to volatile objects).

  6. tromey Avatar

    I’ve been thinking more about your feeling that GCC has been fairly responsive to its user community.

    I do think GCC has been improving. Certainly its internals are cleaner than ever before. And, this has had a payoff in terms of optimizations — SRA comes to mind as something that was enabled by the tree-ssa work.

    I still think there are serious user complaints that have not been addressed. For instance, people have complained for years about the speed of gcc itself.

    Also, sometimes GCC seems to invent new ways to annoy users. Here I’m thinking of aliasing and overflow. My feeling is that many users were surprised by these changes, since they had been programming in “C as it always was” (according to oral tradition, I suppose) and not “C as interpreted by a specification expert”.

    I suppose my view is that there are multiple constituencies for GCC, and it is hard to please them all. But it seems to me that the “compete with icc” and “do well on SPEC” contingent has too much of the upper hand.

    That’s putting it all a bit strongly of course. I’d like to hear your reaction.

  7. scott.tsai Avatar
    scott.tsai

    I think saying the above pthread code is writing “fancy assembly in C” and is mainly of interest to kernel developers is not quite accurate.
    It was an userspace pthread example after all and could have appeared in any multi-threaded program, in C or C++.

    Speculative store behavior is of interest to anyone who writes multi-threaded programs.

  8. Ian Lance Taylor Avatar

    Thanks for the notes.

    To Simon: look for the C++0x memory model. It has a description of why it does not want to support speculative stores.

    To Tom: I’ll reply at greater length later.

    To Scott: I didn’t mean to imply that this example was only of interest to kernel programmers. But I do think it is writing assembly code in C. Modern C allows quite a bit of latitude to the compiler to rearrange code. Although I’ve patched gcc, I continue to believe that a correct program would have an explicit memory barrier.

  9. ncm Avatar

    I don’t understand the example as written. Can you point me to someplace else where it’s explained differently?

  10. ncm Avatar

    I’m sorry, I missed seeing the links you posted.

  11. ncm Avatar

    So (pulling code from one of the e-mails referenced) the code generated is as if the source said

    tmp = acquires_count;
    res = pthread_mutex_trylock(&mutex);
    acquires_count = tmp + (res == 0)

    and as somebody else noted, under POSIX threads, moving the load across the pthread_* call is not allowed. Everybody writing from Gcc says acquires_count should have been declared volatile. It seems to me that the pthread call should have an attribute forbidding the compiler from moving memory operations across it. I’m astonished there isn’t such an attribute. It seems to me it’s needed for practically every synchronization primitive.

  12. Ian Lance Taylor Avatar

    But no memory operations were moved over the pthread_mutex_trylock call. It didn’t make the transformation you state. The original test case looked like

    if (pthread_mutex_trylock (&m) == 0)
    ++acquires_count;

    gcc turned it into

    r = pthread_mutex_trylock (&m);
    acquires_count += r == 0;

    (gcc generated an add with carry flag to memory, which is a standard x86 instruction). A load and store were effectively added in the implicit else branch of the conditional, but no loads or stores were moved over the call to pthread_mutex_trylock.

  13. ncm Avatar

    I should never post when I have a fever.

  14. […] Reading mig21’s weblog, in which I often find really interesting stuff, I found Ian Lance Taylor’s article “Single Threaded Memory Model“, which kind of bothers me a bit. It reports a recent discussion on the gcc and LKML mailing lists about how C compilers, gcc in this case, optimize for single threaded code, sometimes leading to counter-intuitive results which won’t work properly in multi-threaded software (leading, for example, to race conditions). […]

  15. jarkao2 Avatar
    jarkao2

    “The standard says nothing about precisely when values are written to memory.”

    But, doesn’t the standard say anything about flow control?
    Eg., can something like this:
    if (sin(x) == 2.0)
    y = ++acquires_count/0;

    be turned to this as well?:
    r = sin(x) == 2.0;
    y = (acquires_count += r == 0)/0;

  16. jarkao2 Avatar
    jarkao2

    OK, this y changes too much, sorry.
    Let it be the same memory still:

    But, doesn’t the standard say anything about flow control?
    Eg., can something like this:
    if (sin(x) == 2.0)
    acquires_count = ++acquires_count/0;

    be turned to this as well?:
    r = sin(x) == 2.0;
    acquires_count = (acquires_count += r == 0)/0;

  17. Ian Lance Taylor Avatar

    Division by zero is a trapping instruction, so that modification could introduce a trap where one wasn’t before. Therefore, the transformation is not valid in this case.

  18. jarkao2 Avatar
    jarkao2

    Yes! After re-rethinking I’ve found this example is still wrong, but you were faster… But, I’ve thought about
    division being too invasive, so such transformation would have changed the ‘else’ branch (at least except division by 1). Sorry for such loud thinking.

    I’ve simply wondered, if such optimization is safe
    against all such illegal traps, but it seems it really is!
    It’s only hard to get used to.

    On the other hand isn’t it funny – gcc enforces people to respect C’s single threadedness just when they are
    about to forget single processor boxes!

    Thanks very much for these great programming articles!

  19. […] Re: Memory access vs variable access On Jun 24, 5:51*pm, Gerhard Fiedler <geli…@gmail.com> wrote: > On 2008-06-24 11:50:26, gpderetta wrote: > > > If a specific architecture didn’t allow 32 bit load/stores to 32 bit > > objects, it would require the implementation to pad every object to the > > smaller load/store granularity. Pretty much all common architectures > > allow access to memory at least at 8/16/32 bit granularity (except for > > DSPs I guess), so it is not a problem. > > Ah, I didn’t know that. So on common hardware (maybe x86, x64, AMD, AMD64, > IA-64, PowerPC, ARM, Alpha, PA-RISC, MIPS, SPARC), memory access is > possible in byte granularity? Which then means that no common compiler > would write to locations that are not the actual purpose of the write > access? All x86 derivatives allow 8/16/32/64 access at any offset. I think both PowerPC and ARM allows access at any granularity as the access is properly aligned. IIRC very old Alphas only allowed accessing aligned 32/64 bits (no byte access), but it got fixed because it was extremely inconvenient. I do not know about IA-64, MIPS, SPARC and PA-RISC, but I would be extremely surprised if they didn’t. > > > Current compilers do not implement the rule above, but thread aware > > compilers approximate it well enough that, as long as you use correct > > locks, things work correctly *most of the time* (some compilers have > > been known to miscompile code which used trylocks for example). > > Do you have any links about which compilers specifically don’t create code > that works correctly? One objective of mine is to be able to separate this > "most of the time" into two clearly defined subsets, one of which works > "all of the time" 🙂 > Many in corner cases do. Usually these are considered bugs and are fixed when they are encountered. See for example http://www.airs.com/blog/archives/79 > > Actually, discussing whether the next C++ standard prohibits > > speculative writes, is language specific and definitely on topic. > > Is "speculative writes" the technical term for the situation I described? > I’m not sure if it applies to this example. I think that "speculative store" is defined as the motion of a store outside of its position in program order (usually sinking it outside of loops or branches). It doesn’t take much to generalize the concept to that of the *addition* of a store not present in the original program (i.e. adjacent fields overwrites). For details see "Concurrency memory model compiler consequences" by Hans Bohem: http://www.open-std.org/jtc1/sc22/wg…007/n2338.html HTH, — gpd […]

Leave a Reply