Archive for Programming

Linkers part 4

Shared Libraries

We’ve talked a bit about what object files and executables look like, so what do shared libraries look like? I’m going to focus on ELF shared libraries as used in SVR4 (and GNU/Linux, etc.), as they are the most flexible shared library implementation and the one I know best.

Windows shared libraries, known as DLLs, are less flexible in that you have to compile code differently depending on whether it will go into a shared library or not. You also have to express symbol visibility in the source code. This is not inherently bad, and indeed ELF has picked up some of these ideas over time, but the ELF format makes more decisions at link time and is thus more powerful.

When the program linker creates a shared library, it does not yet know which virtual address that shared library will run at. In fact, in different processes, the same shared library will run at different address, depending on the decisions made by the dynamic linker. This means that shared library code must be position independent. More precisely, it must be position independent after the dynamic linker has finished loading it. It is always possible for the dynamic linker to convert any piece of code to run at any virtula address, given sufficient relocation information. However, performing the reloc computations must be done every time the program starts, implying that it will start more slowly. Therefore, any shared library system seeks to generate position independent code which requires a minimal number of relocations to be applied at runtime, while still running at close to the runtime efficiency of position dependent code.

An additional complexity is that ELF shared libraries were designed to be roughly equivalent to ordinary archives. This means that by default the main executable may override symbols in the shared library, such that references in the shared library will call the definition in the executable, even if the shared library also defines that same symbol. For example, an executable may define its own version of malloc. The C library also defines malloc, and the C library contains code which calls malloc. If the executable defines malloc itself, it will override the function in the C library. When some other function in the C library calls malloc, it will call the definition in the executable, not the definition in the C library.

There are thus different requirements pulling in different directions for any specific ELF implementation. The right implementation choices will depend on the characteristics of the processor. That said, most, but not all, processors make fairly similar decisions. I will describe the common case here. An example of a processor which uses the common case is the i386; an example of a processor which make some different decisions is the PowerPC.

In the common case, code may be compiled in two different modes. By default, code is position dependent. Putting position dependent code into a shared library will cause the program linker to generate a lot of relocation information, and cause the dynamic linker to do a lot of processing at runtime. Code may also be compiled in position independent mode, typically with the -fpic option. Position independent code is slightly slower when it calls a non-static function or refers to a global or static variable. However, it requires much less relocation information, and thus the dynamic linker will start the program faster.

Position independent code will call non-static functions via the Procedure Linkage Table or PLT. This PLT does not exist in .o files. In a .o file, use of the PLT is indicated by a special relocation. When the program linker processes such a relocation, it will create an entry in the PLT. It will adjust the instruction such that it becomes a PC-relative call to the PLT entry. PC-relative calls are inherently position independent and thus do not require a relocation entry themselves. The program linker will create a relocation for the PLT entry which tells the dynamic linker which symbol is associated with that entry. This process reduces the number of dynamic relocations in the shared library from one per function call to one per function called.

Further, PLT entries are normally relocated lazily by the dynamic linker. On most ELF systems this laziness may be overridden by setting the LD_BIND_NOW environment variable when running the program. However, by default, the dynamic linker will not actually apply a relocation to the PLT until some code actually calls the function in question. This also speeds up startup time, in that many invocations of a program will not call every possible function. This is particularly true when considering the shared C library, which has many more function calls than any typical program will execute.

In order to make this work, the program linker initializes the PLT entries to load an index into some register or push it on the stack, and then to branch to common code. The common code calls back into the dynamic linker, which uses the index to find the appropriate PLT relocation, and uses that to find the function being called. The dynamic linker then initializes the PLT entry with the address of the function, and then jumps to the code of the function. The next time the function is called, the PLT entry will branch directly to the function.

Before giving an example, I will talk about the other major data structure in position independent code, the Global Offset Table or GOT. This is used for global and static variables. For every reference to a global variable from position independent code, the compiler will generate a load from the GOT to get the address of the variable, followed by a second load to get the actual value of the variable. The address of the GOT will normally be held in a register, permitting efficient access. Like the PLT, the GOT does not exist in a .o file, but is created by the program linker. The program linker will create the dynamic relocations which the dynamic linker will use to initialize the GOT at runtime. Unlike the PLT, the dynamic linker always fully initializes the GOT when the program starts.

For example, on the i386, the address of the GOT is held in the register %ebx. This register is initialized at the entry to each function in position independent code. The initialization sequence varies from one compiler to another, but typically looks something like this:


call __i686.get_pc_thunk.bx
add $offset,%ebx

The function __i686.get_pc_thunk.bx simply looks like this:


mov (%esp),%ebx
ret

This sequence of instructions uses a position independent sequence to get the address at which it is running. Then is uses an offset to get the address of the GOT. Note that this requires that the GOT always be a fixed offset from the code, regardless of where the shared library is loaded. That is, the dynamic linker must load the shared library as a fixed unit; it may not load different parts at varying addresses.

Global and static variables are now read or written by first loading the address via a fixed offset from %ebx. The program linker will create dynamic relocations for each entry in the GOT, telling the dynamic linker how to initialize the entry. These relocations are of type GLOB_DAT.

For function calls, the program linker will set up a PLT entry to look like this:


jmp *offset(%ebx)
pushl #index
jmp first_plt_entry

The program linker will allocate an entry in the GOT for each entry in the PLT. It will create a dynamic relocation for the GOT entry of type JMP_SLOT. It will initialize the GOT entry to the base address of the shared library plus the address of the second instruction in the code sequence above. When the dynamic linker does the initial lazy binding on a JMP_SLOT reloc, it will simply add the difference between the shared library load address and the shared library base address to the GOT entry. The effect is that the first jmp instruction will jump to the second instruction, which will push the index entry and branch to the first PLT entry. The first PLT entry is special, and looks like this:


pushl 4(%ebx)
jmp *8(%ebx)

This references the second and third entries in the GOT. The dynamic linker will initialize them to have appropriate values for a callback into the dynamic linker itself. The dynamic linker will use the index pushed by the first code sequence to find the JMP_SLOT relocation. When the dynamic linker determines the function to be called, it will store the address of the function into the GOT entry references by the first code sequence. Thus, the next time the function is called, the jmp instruction will branch directly to the right code.

That was a fast pass over a lot of details, but I hope that it conveys the main idea. It means that for position independent code on the i386, every call to a global function requires one extra instruction after the first time it is called. Every reference to a global or static variable requires one extra instruction. Almost every function uses four extra instructions when it starts to initialize %ebx (leaf functions which do not refer to any global variables do not need to initialize %ebx). This all has some negative impact on the program cache. This is the runtime performance penalty paid to let the dynamic linker start the program quickly.

On other processors, the details are naturally different. However, the general flavour is similar: position independent code in a shared library starts faster and runs slightly slower.

More tomorrow.

Comments (12)

Linkers part 3

Continuing notes on linkers.

Address Spaces

An address space is simply a view of memory, in which each byte has an address. The linker deals with three distinct types of address space.

Every input object file is a small address space: the contents have addresses, and the symbols and relocations refer to the contents by addresses.

The output program will be placed at some location in memory when it runs. This is the output address space, which I generally refer to as using virtual memory addresses.

The output program will be loaded at some location in memory. This is the load memory address. On typical Unix systems virtual memory addresses and load memory addresses are the same. On embedded systems they are often different; for example, the initialized data (the initial contents of global or static variables) may be loaded into ROM at the load memory address, and then copied into RAM at the virtual memory address.

Shared libraries can normally be run at different virtual memory address in different processes. A shared library has a base address when it is created; this is often simply zero. When the dynamic linker copies the shared library into the virtual memory space of a process, it must apply relocations to adjust the shared library to run at its virtual memory address. Shared library systems minimize the number of relocations which must be applied, since they take time when starting the program.

Object File Formats

As I said above, an assembler turns human readable assembly language into an object file. An object file is a binary data file written in a format designed as input to the linker. The linker generates an executable file. This executable file is a binary data file written in a format designed as input for the operating system or the loader (this is true even when linking dynamically, as normally the operating system loads the executable before invoking the dynamic linker to begin running the program). There is no logical requirement that the object file format resemble the executable file format. However, in practice they are normally very similar.

Most object file formats define sections. A section typically holds memory contents, or it may be used to hold other types of data. Sections generally have a name, a type, a size, an address, and an associated array of data.

Object file formats may be classed in two general types: record oriented and section oriented.

A record oriented object file format defines a series of records of varying size. Each record starts with some special code, and may be followed by data. Reading the object file requires reading it from the begininng and processing each record. Records are used to describe symbols and sections. Relocations may be associated with sections or may be specified by other records. IEEE-695 and Mach-O are record oriented object file formats used today.

In a section oriented object file format the file header describes a section table with a specified number of sections. Symbols may appear in a separate part of the object file described by the file header, or they may appear in a special section. Relocations may be attached to sections, or they may appear in separate sections. The object file may be read by reading the section table, and then reading specific sections directly. ELF, COFF, PE, and a.out are section oriented object file formats.

Every object file format needs to be able to represent debugging information. Debugging informations is generated by the compiler and read by the debugger. In general the linker can just treat it like any other type of data. However, in practice the debugging information for a program can be larger than the actual program itself. The linker can use various techniques to reduce the amount of debugging information, thus reducing the size of the executable. This can speed up the link, but requires the linker to understand the debugging information.

The a.out object file format stores debugging information using special strings in the symbol table, known as stabs. These special strings are simply the names of symbols with a special type. This technique is also used by some variants of ECOFF, and by older versions of Mach-O.

The COFF object file format stores debugging information using special fields in the symbol table. This type information is limited, and is completely inadequate for C++. A common technique to work around these limitations is to embed stabs strings in a COFF section.

The ELF object file format stores debugging information in sections with special names. The debugging information can be stabs strings or the DWARF debugging format.

More next week.

Comments (3)

Linkers part 2

I’m back, and I’m still doing the linker technical introduction.

Shared libraries were invented as an optimization for virtual memory systems running many processes simultaneously. People noticed that there is a set of basic functions which appear in almost every program. Before shared libraries, in a system which runs multiple processes simultaneously, that meant that almost every process had a copy of exactly the same code. This suggested that on a virtual memory system it would be possible to arrange that code so that a single copy could be shared by every process using it. The virtual memory system would be used to map the single copy into the address space of each process which needed it. This would require less physical memory to run multiple programs, and thus yield better performance.

I believe the first implementation of shared libraries was on SVR3, based on COFF. This implementation was simple, and basically assigned each shared library a fixed portion of the virtual address space. This did not require any significant changes to the linker. However, requiring each shared library to reserve an appropriate portion of the virtual address space was inconvenient.

SunOS4 introduced a more flexible version of shared libraries, which was later picked up by SVR4. This implementation postponed some of the operation of the linker to runtime. When the program started, it would automatically run a limited version of the linker which would link the program proper with the shared libraries. The version of the linker which runs when the program starts is known as the dynamic linker. When it is necessary to distinguish them, I will refer to the version of the linker which creates the program as the program linker. This type of shared libraries was a significant change to the traditional program linker: it now had to build linking information which could be used efficiently at runtime by the dynamic linker.

That is the end of the introduction. You should now understand the basics of what a linker does. I will now turn to how it does it.

Basic Linker Data Types

The linker operates on a small number of basic data types: symbols, relocations, and contents. These are defined in the input object files. Here is an overview of each of these.

A symbol is basically a name and a value. Many symbols represent static objects in the original source code–that is, objects which exist in a single place for the duration of the program. For example, in an object file generated from C code, there will be a symbol for each function and for each global and static variable. The value of such a symbol is simply an offset into the contents. This type of symbol is known as a defined symbol. It’s important not to confuse the value of the symbol representing the variable my_global_var with the value of my_global_var itself. The value of the symbol is roughly the address of the variable: the value you would get from the expression &my_global_var in C.

Symbols are also used to indicate a reference to a name defined in a different object file. Such a reference is known as an undefined symbol. There are other less commonly used types of symbols which I will describe later.

During the linking process, the linker will assign an address to each defined symbol, and will resolve each undefined symbol by finding a defined symbol with the same name.

A relocation is a computation to perform on the contents. Most relocations refer to a symbol and to an offset within the contents. Many relocations will also provide an additional operand, known as the addend. A simple, and commonly used, relocation is “set this location in the contents to the value of this symbol plus this addend.” The types of computations that relocations do are inherently dependent on the architecture of the processor for which the linker is generating code. For example, RISC processors which require two or more instructions to form a memory address will have separate relocations to be used with each of those instructions; for example, “set this location in the contents to the lower 16 bits of the value of this symbol.”

During the linking process, the linker will perform all of the relocation computations as directed. A relocation in an object file may refer to an undefined symbol. If the linker is unable to resolve that symbol, it will normally issue an error (but not always: for some symbol types or some relocation types an error may not be appropriate).

The contents are what memory should look like during the execution of the program. Contents have a size, an array of bytes, and a type. They contain the machine code generated by the compiler and assembler (known as text). They contain the values of initialized variables (data). They contain static unnamed data like string constants and switch tables (read-only data or rdata). They contain uninitialized variables, in which case the array of bytes is generally omitted and assumed to contain only zeroes (bss). The compiler and the assembler work hard to generate exactly the right contents, but the linker really doesn’t care about them except as raw data. The linker reads the contents from each file, concatenates them all together sorted by type, applies the relocations, and writes the result into the executable file.

Basic Linker Operation

At this point we already know enough to understand the basic steps used by every linker.

  • Read the input object files. Determine the length and type of the contents. Read the symbols.
  • Build a symbol table containing all the symbols, linking undefined symbols to their definitions.
  • Decide where all the contents should go in the output executable file, which means deciding where they should go in memory when the program runs.
  • Read the contents data and the relocations. Apply the relocations to the contents. Write the result to the output file.
  • Optionally write out the complete symbol table with the final values of the symbols.

More tomorrow.

Comments (14)

Linkers part 1

I’ve been working on and off on a new linker. To my surprise, I’ve discovered in talking about this that some people, even some computer programmers, are unfamiliar with the details of the linking process. I’ve decided to write some notes about linkers, with the goal of producing an essay similar to my existing one about the GNU configure and build system.

As I only have the time to write one thing a day, I’m going to do this on my blog over time, and gather the final essay together later. I believe that I may be up to five readers, and I hope y’all will accept this digression into stuff that matters. I will return to random philosophizing and minding other people’s business soon enough.

A Personal Introduction

Who am I to write about linkers?

I wrote my first linker back in 1988, for the AMOS operating system which ran on Alpha Micro systems. (If you don’t understand the following description, don’t worry; all will be explained below). I used a single global database to register all symbols. Object files were checked into the database after they had been compiled. The link process mainly required identifying the object file holding the main function. Other objects files were pulled in by reference. I reverse engineered the object file format, which was undocumented but quite simple. The goal of all this was speed, and indeed this linker was much faster than the system one, mainly because of the speed of the database.

I wrote my second linker in 1993 and 1994. This linker was designed and prototyped by Steve Chamberlain while we both worked at Cygnus Support (later Cygnus Solutions, later part of Red Hat). This was a complete reimplementation of the BFD based linker which Steve had written a couple of years before. The primary target was a.out and COFF. Again the goal was speed, especially compared to the original BFD based linker. On SunOS 4 this linker was almost as fast as running the cat program on the input .o files.

The linker I am now working, called gold, on will be my third. It is exclusively an ELF linker. Once again, the goal is speed, in this case being faster than my second linker. That linker has been significantly slowed down over the years by adding support for ELF and for shared libraries. This support was patched in rather than being designed in. Future plans for the new linker include support for incremental linking–which is another way of increasing speed.

There is an obvious pattern here: everybody wants linkers to be faster. This is because the job which a linker does is uninteresting. The linker is a speed bump for a developer, a process which takes a relatively long time but adds no real value. So why do we have linkers at all? That brings us to our next topic.

A Technical Introduction

What does a linker do?

It’s simple: a linker converts object files into executables and shared libraries. Let’s look at what that means. For cases where a linker is used, the software development process consists of writing program code in some language: e.g., C or C++ or Fortran (but typically not Java, as Java normally works differently, using a loader rather than a linker). A compiler translates this program code, which is human readable text, into into another form of human readable text known as assembly code. Assembly code is a readable form of the machine language which the computer can execute directly. An assembler is used to turn this assembly code into an object file. For completeness, I’ll note that some compilers include an assembler internally, and produce an object file directly. Either way, this is where things get interesting.

In the old days, when dinosaurs roamed the data centers, many programs were complete in themselves. In those days there was generally no compiler–people wrote directly in assembly code–and the assembler actually generated an executable file which the machine could execute directly. As languages liked Fortran and Cobol started to appear, people began to think in terms of libraries of subroutines, which meant that there had to be some way to run the assembler at two different times, and combine the output into a single executable file. This required the assembler to generate a different type of output, which became known as an object file (I have no idea where this name came from). And a new program was required to combine different object files together into a single executable. This new program became known as the linker (the source of this name should be obvious).

Linkers still do the same job today. In the decades that followed, one new feature has been added: shared libraries.

More tomorrow.

Comments (19)

Cutting Without Measuring

We’re having our front steps replaced. They are made of concrete with embedded tiles (the old steps were badly cracked, and had to be removed anyhow when we had our foundation rebuilt). We were able to salvage about half the tiles from the old steps as they were destroyed, and managed to find matching new ones for about $15 a tile. My wife and daughter spent about two hours laying out the tiles on a sheet, marked with tape to the measured size of the steps, and with the location of each tile measured out. The tiles are each marked to indicate which one goes where (they are various different colors). This map will be used by the builders to place the tiles precisely into the concrete before it dries.

The amount of time spent planning for this task of placing tiles far exceeds the amount of time which the task will actually take. This approach is captured in the old carpentry dictum “measure twice, cut once.” While participating in this tile project, I couldn’t help noticing that this is not how programmers work. It is appropriate to take a long time to measure the tile location because once the concrete dries–a couple of hours at most–there is no going back. The tiles will be set into their position, and the only way to change them will be to destroy the steps again.

Programming is quite different. It takes place in a fluid medium. Any decision can be changed at any time. Of course, some decisions are harder to change than others, because they affect more code. But decisions at the level of “change this tile to another color” or “move this tile a few inches” are generally pretty easy to make in code. This fluidity means that programmers almost never measure twice and cut once. In fact, programmers often don’t measure at all. They just code.

There have been various efforts over the years to encourage programmers to spend more time in the design phase before they start the coding phase. I really don’t know any experienced programmer who takes them seriously. Most projects do start with a relatively informal written plan. This plan is helpful but soon becomes inaccurate. It is generally not updated in detail.

This is not inherently a bad thing. It is appropriate for the nature of the medium. Despite the claims made by the proponents of better design, the truth is that the biggest problems in programming do not arise from lack of design. They come where different systems are stitched together, such that the systems were never designed together in the first place. The biggest problems in programming come from managing complexity, dealing with unexpected and unplanned interactions between different parts of the program and between different layers of the overall system.

This complexity is most likely irreducible. Perhaps what we need in our programs is more ability to handle unforseen events. Perhaps we need to write our code so that it does approximately the right thing in all cases, as well as doing the precisely correct thing in well understood cases. Of course the “right thing” to do is an inherently context dependent operation. So this would have to be expressed in the interfaces between components, rather than in the components themselves.

So perhaps what we need is an interface more sophisticated than the function call, one that can describe what the caller expects. So you wouldn’t just say “draw a window this size at this position,” you would also specify what type of data was going into the window and how the window was likely to be used. That information would be used when it turns out to be impossible or inappropriate to fulfill the original request, perhaps because the screen is too small or because the user has requested extra large windows.

Of course, if done in a simplistic manner, this amounts to adding parameters to the function call, which just pushes the problem back one level. Is there any way that we can capture this kind of fuzzy information in a useful way? Can our programs automatically learn what to do in unexpected situations? Can we provide mechanisms that we can use to teach them?

Comments

« Previous Page« Previous entries « Previous Page · Next Page » Next entries »Next Page »