Linkers part 7

As we’ve seen, what linkers do is basically quite simple, but the details can get complicated. The complexity is because smart programmers can see small optimizations to speed up their programs a little bit, and somtimes the only place those optimizations can be implemented is the linker. Each such optimizations makes the linker a little more complicated. At the same time, of course, the linker has to run as fast as possible, since nobody wants to sit around waiting for it to finish. Today I’ll talk about a classic small optimization implemented by the linker.

Thread Local Storage

I’ll assume you know what a thread is. It is often useful to have a global variable which can take on a different value in each thread (if you don’t see why this is useful, just trust me on this). That is, the variable is global to the program, but the specific value is local to the thread. If thread A sets the thread local variable to 1, and thread B then sets it to 2, then code running in thread A will continue to see the value 1 for the variable while code running in thread B sees the value 2. In Posix threads this type of variable can be created via pthread_key_create and accessed via pthread_getspecific and pthread_setspecific.

Those functions work well enough, but making a function call for each access is awkward and inconvenient. It would be more useful if you could just declare a regular global variable and mark it as thread local. That is the idea of Thread Local Storage (TLS), which I believe was invented at Sun. On a system which supports TLS, any global (or static) variable may be annotated with __thread. The variable is then thread local.

Clearly this requires support from the compiler. It also requires support from the program linker and the dynamic linker. For maximum efficiency–and why do this if you aren’t going to get maximum efficiency?–some kernel support is also needed. The design of TLS on ELF systems fully supports shared libraries, including having multiple shared libraries, and the executable itself, use the same name to refer to a single TLS variable. TLS variables can be initialized. Programs can take the address of a TLS variable, and pass the pointers between threads, so the address of a TLS variable is a dynamic value and must be globally unique.

How is this all implemented? First step: define different storage models for TLS variables.

  • Global Dynamic: Fully general access to TLS variables from an executable or a shared object.
  • Local Dynamic: Permits access to a variable which is bound locally within the executable or shared object from which it is referenced. This is true for all static TLS variables, for example. It is also true for protected symbols–I described those back in part 5.
  • Initial Executable: Permits access to a variable which is known to be part of the TLS image of the executable. This is true for all TLS variables defined in the executable itself, and for all TLS variables in shared libraries explicitly linked with the executable. This is not true for accesses from a shared library, nor for accesses to TLS variables defined in shared libraries opened by dlopen.
  • Local Executable: Permits access to TLS variables defined in the executable itself.

These storage models are defined in decreasing order of flexibility. Now, for efficiency and simplicity, a compiler which supports TLS will permit the developer to specify the appropriate TLS model to use (with gcc, this is done with the -ftls-model option, although the Global Dynamic and Local Dynamic models also require using -fpic). So, when compiling code which will be in an executable and never be in a shared library, the developer may choose to set the TLS storage model to Initial Executable.

Of course, in practice, developers often do not know where code will be used. And developers may not be aware of the intricacies of TLS models. The program linker, on the other hand, knows whether it is creating an executable or a shared library, and it knows whether the TLS variable is defined locally. So the program linker gets the job of automatically optimizing references to TLS variables when possible. These references take the form of relocations, and the linker optimizes the references by changing the code in various ways.

The program linker is also responsible for gathering all TLS variables together into a single TLS segment (I’ll talk more about segments later, for now think of them as a section). The dynamic linker has to group together the TLS segments of the executable and all included shared libraries, resolve the dynamic TLS relocations, and has to build TLS segments dynamically when dlopen is used. The kernel has to make it possible for access to the TLS segments be efficient.

That was all pretty general. Let’s do an example, again for i386 ELF. There are three different implementations of i386 ELF TLS; I’m going to look at the gnu implementation. Consider this trivial code:


__thread int i;
int foo() { return i; }

In global dynamic mode, this generates i386 assembler code like this:


leal i@TLSGD(,%ebx,1), %eax
call ___tls_get_addr@PLT
movl (%eax), %eax

Recall from part 4 that %ebx holds the address of the GOT table. The first instruction will have a R_386_TLS_GD relocation for the variable i; the relocation will apply to the offset of the leal instruction. When the program linker sees this relocation, it will create two consecutive entries in the GOT table for the TLS variable i. The first one will get a R_386_TLS_DTPMOD32 dynamic relocation, and the second will get a R_386_TLS_DTPOFF32 dynamic relocation. The dynamic linker will set the DTPMOD32 GOT entry to hold the module ID of the object which defines the variable. The module ID is an index within the dynamic linker’s tables which identifies the executable or a specific shared library. The dynamic linker will set the DTPOFF32 GOT entry to the offset within the TLS segment for that module. The __tls_get_addr function will use those values to compute the address (this function also takes care of lazy allocation of TLS variables, which is a further optimization specific to the dynamic linker). Note that __tls_get_addr is actually implemented by the dynamic linker itself; it follows that global dynamic TLS variables are not supported (and not necessary) in statically linked executables.

At this point you are probably wondering what is so inefficient aboutpthread_getspecific. The real advantage of TLS shows when you see what the program linker can do. The leal; call sequence shown above is canonical: the compiler will always generate the same sequence to access a TLS variable in global dynamic mode. The program linker takes advantage of that fact. If the program linker sees that the code shown above is going into an executable, it knows that the access does not have to be treated as global dynamic; it can be treated as initial executable. The program linker will actually rewrite the code to look like this:


movl %gs:0, %eax
subl $i@GOTTPOFF(%ebx), %eax

Here we see that the TLS system has coopted the %gs segment register, with cooperation from the operating system, to point to the TLS segment of the executable. For each processor which supports TLS, some such efficiency hack is made. Since the program linker is building the executable, it builds the TLS segment, and knows the offset of i in the segment. The GOTTPOFF is not a real relocation; it is created and then resolved within the program linker. It is, of course, the offset from the GOT table to the address of i in the TLS segment. The movl (%eax), %eax from the original sequence remains to actually load the value of the variable.

Actually, that is what would happen if i were not defined in the executable itself. In the example I showed, i is defined in the executable, so the program linker can actually go from a global dynamic access all the way to a local executable access. That looks like this:


movl %gs:0,%eax
subl $i@TPOFF,%eax

Here i@TPOFF is simply the known offset of i within the TLS segment. I’m not going to go into why this uses subl rather than addl; suffice it to say that this is another efficiency hack in the dynamic linker.

If you followed all that, you’ll see that when an executable accesses a TLS variable which is defined in that executable, it requires two instructions to compute the address, typically followed by another one to actually load or store the value. That is significantly more efficient than calling pthread_getspecific. Admittedly, when a shared library accesses a TLS variable, the result is not much better than pthread_getspecific, but it shouldn’t be any worse, either. And the code using __thread is much easier to write and to read.

That was a real whirlwind tour. There are three separate but related TLS implementations on i386 (known as sun, gnu, and gnu2), and 23 different relocation types are defined. I’m certainly not going to try to describe all the details; I don’t know them all in any case. They all exist in the name of efficient access to the TLS variables for a given storage model.

Is TLS worth the additional complexity in the program linker and the dynamic linker? Since those tools are used for every program, and since the C standard global variable errno in particular can be implemented using TLS, the answer is most likely yes.


Posted

in

by

Tags:

Comments

18 responses to “Linkers part 7”

  1. fche Avatar

    > Is TLS worth the additional complexity […] errno […] yes

    Is it your sense that real programs check errno frequently enough
    for this difference to be measurable? I don’t recall coming across
    numbers.

  2. Ian Lance Taylor Avatar

    I was thinking not so much that real programs check errno frequently enough, as that real multi-threaded programs frequently call library functions which are required to set errno.

    But I don’t have any numbers either, I’m just speculating.

  3. ncm Avatar

    That a pointer to one thread’s errno has the same numeric value as a pointer to some other thread’s errno came as a surprise to me. That seems like something not only a lot of extra work to support, but also likely to be unportable to some environments, and furthermore not necessarily what I would want anyway.

  4. Ian Lance Taylor Avatar

    No, the pointer to one thread’s errno has a different numeric value than the pointer to another thread’s errno. The address of a __thread variable is globally unique–each thread gets a different address for a __thread variable. When I say you can pass the pointer between threads, I mean that thread A can pass the address of a __thread variable to thread B, and if thread B makes an assignment through that pointer, thread A will see the result in the __thread variable but thread B will not. Hope that makes sense.

  5. avjo Avatar
    avjo

    Hi Ian,

    Again please allow me to express my gratitude. This series
    is amazing.

    I’ve got two questions please.
    1. I can’t understand those ‘@’-based keywords. Can you please explain how are these keywords constructed ? What is this ‘@’ and what can I put at its sides (I don’t remember it being mentioned in my AT&T assembly book) ?
    e.g. $i@TPOFF, $i@GOTTPOFF(%ebx), i@TLSGD(,%ebx,1), ___tls_get_addr@PLT

    2. Another unfamiliar item: %gs:0. what is it ? I can’t remember the x86 has a %gs register.. and why does it end with a :0 ?

    Thank you so much,
    avjo

  6. Ian Lance Taylor Avatar

    Thanks for the note.

    The ‘@’ keywords are extensions to the existing assembly language. They don’t change the assembly, but they tell the assembler which relocation types to generate for the operand to which they are attached. The supported keywords are: PLTOFF (64-bit only), PLT, GOTPLT (64-bit only), GOTOFF, GOTPCREL (64-bit only), TLSGD, TLSLDM, TLSLD (64-bit only), GOTTPOFF, TPOFF, NTPOFF (32-bit only), DTPOFF, GOTNTPOFF (32-bit only), INDNTPOFF (32-bit only), GOT, TLSDESC, TLSCALL.

    The %gs register is a segment register. The x86 supports several segment registers. These days they are generally all set to the same value, but in the 80286 days they were used to select different portions of memory for different parts of the program. %gs:0 means address 0 in the segment addressed by the %gs segment register.

  7. avjo Avatar
    avjo

    Hi Ian and thanks for the explanation.

    Do you know of any online page I can read more about
    this list of supported keywords ?

    Thanks again
    ~avjo

  8. Ian Lance Taylor Avatar

    They don’t seem to be in the assembler documentation. I think your best bet would be look at the i386 ELF ABI supplement and at the TLS documentation. Here are some links. Look for the sample assembler code. In general the keywords correlate to specific relocation types.

    http://sco.com/developers/devspecs/
    http://docs.sun.com/app/docs/doc/817-1984/6mhm7pl2a
    http://people.redhat.com/drepper/tls.pdf
    http://www.lsd.ic.unicamp.br/~oliva/writeups/TLS/RFC-TLSDESC-x86.txt

  9. avjo Avatar
    avjo

    Hi Ian,

    Is there any reason at all to prefer the pthread_getspecific/setspecific
    library calls over a __thread variable ?

    What about embedded systems with relatively old kernels (2.6.10 the
    oldest) ?

    Thanks!
    ~avjo

  10. Ian Lance Taylor Avatar

    As long as your kernel is 2.6.x, you should be able to use __thread variables. The only reason I know to prefer pthread_getspecific is that you can pass a destructor routine to pthread_key_create, which will be run when a thread exits. I don’t think there is any way to run a destructor for a __thread variable. In general __thread variables are more efficient and should be preferred.

  11. avjo Avatar
    avjo

    Thank you.

    (PS – I still hope to pre-order you Linkers book one day 😉

  12. avjo Avatar
    avjo

    Hi Ian,

    When I’m trying to use __thread in an application, I get the following gcc error:

    error: function-scope ‘i’ implicitly auto and declared ‘__thread’

    (all I did is trying to compile an empty C main with the line ‘__thread int i;’)

    Any idea what is wrong ? (I’m using gcc 4.2.3 (Ubuntu 4.2.3-2ubuntu7) on 2.6.24-19 (ubuntu generic x86_64 kernel) on x86_64 platform…

    The compile line is just ‘gcc attempt.c’..

    Thank you!
    ~avjo

  13. Ian Lance Taylor Avatar

    __thread only works for global or static variables. It sounds like you wrote

    int main() { __thread int i; }

    That makes i a local variable in main, which in C is known as an “auto” variable (from the very old but still supported syntax “auto int i;”). A local variable can not be a TLS variable. Or, to put it another way, local variables are always TLS variables, in the sense that they can only be accessed by a single thread. TLS only makes sense when speaking about variables which can be accessed by multiple threads, which means a global or static variable.

  14. erichtsai Avatar
    erichtsai

    Great blog!

    After went through a couple of TLS related documents, I still have questions. It seems to me that, by default, an executable will use IE model to access external TLS variable. With IE model, an executable can access all TLS variables in shared libraries explicitly linked with that executable. So, I think these shared objects can’t support lazy binding for this executable any more. In order to support lazy binding, either GD model or dlsym() has to be used. Am I right?

    Thanks!

    Eric

  15. Ian Lance Taylor Avatar

    Thanks for the comment. I guess I’m not sure just what you are saying. It’s true that when an executable uses the default IE model to access a TLS variable defined in a shared library, the dynamic linker has to resolve that access at startup time, rather than lazily. This doesn’t really affect how the shared libraries access the TLS variable, though; they will continue to use a function call to resolve the address.

    Lazy binding is not really a feature of TLS variables. Lazy binding is used for function calls, not variable references. TLS variables do support lazy allocation, which is not quite the same thing. It’s true that if an executable refers to a TLS variable, then that variable can not be allocated lazily. But that doesn’t really matter, as the allocation of a TLS variable referenced by an executable is essentially free. It simply becomes part of the executable’s TLS segment.

  16. ndatta Avatar
    ndatta

    Hi Ian,

    Your blog post series on linkers is very well written, thanks!

    I had a couple of questions:
    (i) Who populates the %fs or %gs register to point to the start of the TLS segment each time a thread switch happens? Is this done in the pthread library? Or by the NPTL in the Linux kernel? Or by some other mechanism? How can I programmatically verify the same, if that is at all possible?

    (ii) Is it not possible to see the value of the %gs or %fs register in gdb while debugging a program using a __thread variable? I compiled a simple test program that defined a __thread long l; global variable and printed its value in main(). When I set a breakpoint in gdb at main, and then do an “info registers” at the breakpoint, it shows the segment registers ds, es, fs and gs to be zero. This doesn’t make sense?! The disassembled code shows this instruction:
    mov %fs:0xfffffffffffffff8,%rdx
    I’m assuming that the negative offset is due to your note about the linker generating a subl instead of an addl. Is this correct? And how does it work with negative offsets anyhow?

    Thanks again.

  17. Ian Lance Taylor Avatar

    On GNU/Linux, the %fs and %gs registers are saved and restored by the kernel on each thread switch, just as with any other register. When a new thread is created, the NPTL pthread library uses CLONE_SETTLS to tell the kernel to point %fs or %gs to the area passed in as a parameter.

    I’m not sure what you want to programmatically verify, so I’m not sure how to answer that question.

    Current versions of gdb will print __thread variables correctly. The values of %fs and %gs are difficult to interpret as they are 16-bit segment registers, and do not store addresses directly. I don’t know how to get gdb to provide the address associated with a segment register, nor do I know how print something like %fs:0 directly.

    The TLS works with negative offsets by simply having the NTPL library and the kernel point %gs to the top of the statically allocated TLS area.

  18. ndatta Avatar
    ndatta

    Great, that clears things up. The CLONE_SETTLS patch description is here: http://lwn.net/Articles/7603/.

Leave a Reply