Linkers part 4

Shared Libraries

We’ve talked a bit about what object files and executables look like, so what do shared libraries look like? I’m going to focus on ELF shared libraries as used in SVR4 (and GNU/Linux, etc.), as they are the most flexible shared library implementation and the one I know best.

Windows shared libraries, known as DLLs, are less flexible in that you have to compile code differently depending on whether it will go into a shared library or not. You also have to express symbol visibility in the source code. This is not inherently bad, and indeed ELF has picked up some of these ideas over time, but the ELF format makes more decisions at link time and is thus more powerful.

When the program linker creates a shared library, it does not yet know which virtual address that shared library will run at. In fact, in different processes, the same shared library will run at different address, depending on the decisions made by the dynamic linker. This means that shared library code must be position independent. More precisely, it must be position independent after the dynamic linker has finished loading it. It is always possible for the dynamic linker to convert any piece of code to run at any virtula address, given sufficient relocation information. However, performing the reloc computations must be done every time the program starts, implying that it will start more slowly. Therefore, any shared library system seeks to generate position independent code which requires a minimal number of relocations to be applied at runtime, while still running at close to the runtime efficiency of position dependent code.

An additional complexity is that ELF shared libraries were designed to be roughly equivalent to ordinary archives. This means that by default the main executable may override symbols in the shared library, such that references in the shared library will call the definition in the executable, even if the shared library also defines that same symbol. For example, an executable may define its own version of malloc. The C library also defines malloc, and the C library contains code which calls malloc. If the executable defines malloc itself, it will override the function in the C library. When some other function in the C library calls malloc, it will call the definition in the executable, not the definition in the C library.

There are thus different requirements pulling in different directions for any specific ELF implementation. The right implementation choices will depend on the characteristics of the processor. That said, most, but not all, processors make fairly similar decisions. I will describe the common case here. An example of a processor which uses the common case is the i386; an example of a processor which make some different decisions is the PowerPC.

In the common case, code may be compiled in two different modes. By default, code is position dependent. Putting position dependent code into a shared library will cause the program linker to generate a lot of relocation information, and cause the dynamic linker to do a lot of processing at runtime. Code may also be compiled in position independent mode, typically with the -fpic option. Position independent code is slightly slower when it calls a non-static function or refers to a global or static variable. However, it requires much less relocation information, and thus the dynamic linker will start the program faster.

Position independent code will call non-static functions via the Procedure Linkage Table or PLT. This PLT does not exist in .o files. In a .o file, use of the PLT is indicated by a special relocation. When the program linker processes such a relocation, it will create an entry in the PLT. It will adjust the instruction such that it becomes a PC-relative call to the PLT entry. PC-relative calls are inherently position independent and thus do not require a relocation entry themselves. The program linker will create a relocation for the PLT entry which tells the dynamic linker which symbol is associated with that entry. This process reduces the number of dynamic relocations in the shared library from one per function call to one per function called.

Further, PLT entries are normally relocated lazily by the dynamic linker. On most ELF systems this laziness may be overridden by setting the LD_BIND_NOW environment variable when running the program. However, by default, the dynamic linker will not actually apply a relocation to the PLT until some code actually calls the function in question. This also speeds up startup time, in that many invocations of a program will not call every possible function. This is particularly true when considering the shared C library, which has many more function calls than any typical program will execute.

In order to make this work, the program linker initializes the PLT entries to load an index into some register or push it on the stack, and then to branch to common code. The common code calls back into the dynamic linker, which uses the index to find the appropriate PLT relocation, and uses that to find the function being called. The dynamic linker then initializes the PLT entry with the address of the function, and then jumps to the code of the function. The next time the function is called, the PLT entry will branch directly to the function.

Before giving an example, I will talk about the other major data structure in position independent code, the Global Offset Table or GOT. This is used for global and static variables. For every reference to a global variable from position independent code, the compiler will generate a load from the GOT to get the address of the variable, followed by a second load to get the actual value of the variable. The address of the GOT will normally be held in a register, permitting efficient access. Like the PLT, the GOT does not exist in a .o file, but is created by the program linker. The program linker will create the dynamic relocations which the dynamic linker will use to initialize the GOT at runtime. Unlike the PLT, the dynamic linker always fully initializes the GOT when the program starts.

For example, on the i386, the address of the GOT is held in the register %ebx. This register is initialized at the entry to each function in position independent code. The initialization sequence varies from one compiler to another, but typically looks something like this:


call __i686.get_pc_thunk.bx
add $offset,%ebx

The function __i686.get_pc_thunk.bx simply looks like this:


mov (%esp),%ebx
ret

This sequence of instructions uses a position independent sequence to get the address at which it is running. Then is uses an offset to get the address of the GOT. Note that this requires that the GOT always be a fixed offset from the code, regardless of where the shared library is loaded. That is, the dynamic linker must load the shared library as a fixed unit; it may not load different parts at varying addresses.

Global and static variables are now read or written by first loading the address via a fixed offset from %ebx. The program linker will create dynamic relocations for each entry in the GOT, telling the dynamic linker how to initialize the entry. These relocations are of type GLOB_DAT.

For function calls, the program linker will set up a PLT entry to look like this:


jmp *offset(%ebx)
pushl #index
jmp first_plt_entry

The program linker will allocate an entry in the GOT for each entry in the PLT. It will create a dynamic relocation for the GOT entry of type JMP_SLOT. It will initialize the GOT entry to the base address of the shared library plus the address of the second instruction in the code sequence above. When the dynamic linker does the initial lazy binding on a JMP_SLOT reloc, it will simply add the difference between the shared library load address and the shared library base address to the GOT entry. The effect is that the first jmp instruction will jump to the second instruction, which will push the index entry and branch to the first PLT entry. The first PLT entry is special, and looks like this:


pushl 4(%ebx)
jmp *8(%ebx)

This references the second and third entries in the GOT. The dynamic linker will initialize them to have appropriate values for a callback into the dynamic linker itself. The dynamic linker will use the index pushed by the first code sequence to find the JMP_SLOT relocation. When the dynamic linker determines the function to be called, it will store the address of the function into the GOT entry references by the first code sequence. Thus, the next time the function is called, the jmp instruction will branch directly to the right code.

That was a fast pass over a lot of details, but I hope that it conveys the main idea. It means that for position independent code on the i386, every call to a global function requires one extra instruction after the first time it is called. Every reference to a global or static variable requires one extra instruction. Almost every function uses four extra instructions when it starts to initialize %ebx (leaf functions which do not refer to any global variables do not need to initialize %ebx). This all has some negative impact on the program cache. This is the runtime performance penalty paid to let the dynamic linker start the program quickly.

On other processors, the details are naturally different. However, the general flavour is similar: position independent code in a shared library starts faster and runs slightly slower.

More tomorrow.


Posted

in

by

Tags:

Comments

12 responses to “Linkers part 4”

  1. wh5a Avatar
    wh5a

    Thanks for your great article. I’ve got one question:

    It seems if a pic function only access global variables but does not call global functions, it will call __i686.get_pc_thunk.cx to compute the GOT address, and its value will be cached in %ecx, instead of %ebx. Why is that?

    I’m running Linux. Thanks.

  2. Ian Lance Taylor Avatar

    %ebx is a callee saved register for the i386, which means that if a function changes %ebx, it must save it at the start of a function and restore it at the end. This is normally the best choice for the GOT register, since it means that the the value does not have to be recomputed or restored after a function call.

    However, if a function does not call any other functions (i.e., it is a leaf function), then it is not important to keep the address of the GOT in a callee saved register. In fact, in that case, it is better to keep it in a caller saved register–that is, a register which a function is permitted to change without needing to save and restore it. So gcc optimizes by putting the GOT table in a caller saved register in a leaf function.

    gcc does not always use %ecx, incidentally, though that is a common choice. Depending on the function, it may choose any available caller saved register.

  3. […] Linkers part 4 – Shared Libraries (Procedure Linkage Table – PLT and Global Offset Table – GOT). […]

  4. jrlevine Avatar

    The other advantage of PIC is better code sharing. If there’s no relocation fixups in a page, all processes can share the same physical copy of the page. As soon as there’s load time fixup, you need a separate copy of the page per process. Making and maintaining the copy is way more work than the fixups themselves, since it requires a trap to the system and a copy of the whole page.

  5. Ian Lance Taylor Avatar

    Thanks–I remembered to put that bit into part 6.

  6. jlh Avatar
    jlh

    For the uneducated reader, it may worth saying explicitely that the offset added to the ebx register is the difference between the start of the GOT and the actual location in the code. Otherwise, the following sentence may not be as clear as you might think: “this requires that the GOT always be a fixed offset from the code, regardless of where the shared library is loaded”. An interesting note is that these offsets are all fixed in the code at link time by the linker program.

  7. Ian Lance Taylor Avatar

    jlh: yes; thanks for the note.

  8. berkus Avatar
    berkus

    Thanks, all this GOT/PLT thing became a bit more clear now. I was seeing the general picture before, but these little details is what was buzzing in my head all the time.

  9. telenn Avatar
    telenn

    “For every reference to a global variable from position independent code, the compiler will generate a load from the GOT to get the address of the variable, followed by a second load to get the actual value of the variable.” … “Every reference to a global or static variable requires one extra instruction”.

    Well, I thought there was a difference between global and static variables, as explained by U.Drepper in his document “How to write shared libraries” :
    For a non-static global variable (globvar) :
    movl globalvar@GOT(%ebx), %edx
    movl (%edx), %eax
    For a static global variable :
    movl staticvar@GOTOFF(%ebx), %eax

    So it looks there’s one instruction less for accessing a static global variable. It’s as if the GOT entry for staticvar were a place for the variable itself, rather than a place for the absolute address of staticvar.
    What do you think ?

  10. telenn Avatar
    telenn

    “For every reference to a global variable from position independent code, the compiler will generate a load from the GOT to get the address of the variable, followed by a second load to get the actual value of the variable.” … “Every reference to a global or static variable requires one extra instruction”.

    Well, I thought there was a difference between global and static variables, as explained by U.Drepper in his document “How to write shared libraries” :
    For a non-static global variable (globvar) :
    movl globalvar@GOT(%ebx), %edx
    movl (%edx), %eax
    For a static global variable :
    movl staticvar@GOTOFF(%ebx), %eax

    So it looks there’s one instruction less for accessing a static variable. It’s as if the GOT entry for staticvar were a place for the variable itself, rather than a place for the absolute address of staticvar.
    What do you think ?

  11. Ian Lance Taylor Avatar

    You’re right, on some platforms the compiler can treat a static variable (or a variable with hidden visibility) differently and more efficiently. When this is done, a static variable does not require a GOT entry. The GOTOFF relocation computes the offset from the start of the GOT to the variable itself. This can work because there is no possibility that the variable is overridden by some other shared library, so the offset from the GOT to the variable is fixed.

  12. […] to implement calls to shared libraries—procedure linkage tables, or PLT’s. Ian Lance Taylor published a good treatment of the way dynamic linking works, for us the necessary thing is that inter-library […]

Leave a Reply