Archive for Programming

Abolish Syntax

Although I can’t find it now, I think it was Dan Bernstein who said somewhere that programs should avoid syntax when possible. Using syntax means permitting syntax errors. Avoiding syntax means making syntax errors impossible.

Even if it wasn’t Bernstein who said this, you can see the idea in action in things like his tinydns configuration file format. There are no keywords or grouping constructs. Each statement is a single line. The first character on the line indicates the type of statement.

The expectation is that if people want a more comprehensible syntax, they will write a separate program which will read something and generate the un-syntax. That way any problems are isolated to that separate program. (Actually tinydns-data is itself a separate program which reads the un-syntax and turns it into a binary form for the tinydns program.)

I think this idea deserves wider use. It fits with the general idea that modules should be independent. When you have a program which needs to read some data, that format of that data should be as simple as possible. When it is desirable to permit a more complicated representation, that should be done by providing a mechanism to convert the complicated representation to the simple one.

Programming languages are of course a sinkhole of syntax. That said, an example of a programming language with minimal syntax is sed. For those with only a passing familiarity with sed, it is surprisingly powerful, and the t and T commands make it Turing complete. It would not be a satisfactory language for general purpose programming, but it is quite effective in its own domain. Another language with minimal syntax is, of course, APL.

Comments (3)

GCC Inline Assembler

GCC’s inline assembler syntax is very powerful and is the best mechanism I know of to mix assembly code and optimized C/C++ code. It lets you take advantage of assembly features like add-with-carry or direct calls into the kernel without losing optimizations. I don’t know of any other approach which supports that.

That said, the inline assembler syntax is also a set of traps for the unwary. Because the compiler applies optimizations around the assembler code, the inline assembler construct must precisely describe what the inline assembler code does. This is done by using constraints and by listing registers and memory that are clobbered—changed in a way which can not be easily described. Constraints are underdocumented, machine specific, and easy to get wrong.

For a complex and underdocumented construct like inline assembler, it is naturally tempting to simply copy some existing example. Unfortunately, even minor changes to the assembler code can require changes to the constraints. Unfortunately, there is no automated way to check whether you got them right. Unfortunately, it is common for incorrect constraints to work fine in simple cases and break in complex one, or to work fine with one gcc release and break with another.

So using inline assembler really requires reading and understanding the documentation. In particular the = and & constraints must be used correctly. On non-orthogonal machines like the x86 the register class constraints must be used correctly. In many cases it will be better to simply write the assembler code in a separate file and call it.

Several years ago I sketched out a different approach that might be easier to use in some cases. However, actually implementing something along those lines requires embedded the assembler into the compiler. This is unlikely to ever actually happen. I’m certainly not working on it.

Comments (6)

Multi Debugging

Many programs these days are written using multiple threads, multiple processes, and multiple languages. Our current debugging solutions don’t cope particularly well with any of those.

gdb supports multiple threads. However, the interface is hard to work with. You have to select which thread you want to look at. Threads are referred to using numbers which are relatively arbitrary; it would be helpful to be able to say things like “show me the server thread.” When a thread releases a lock, it would be helpful to be able to automatically switch to the thread which acquires the lock.

When debugging multiple processes, the most interesting case is handling remote calls between the processes. It would be desirable to be able to switch easily from the process making the call to the process executing the call. Naturally multiple processes may be running on different systems, so this requires communicating with different machines during the debugging session. Multi process core files would also be interesting.

Debugging multiple languages is a difficult case, but one applicable to many web based applications. It’s normal for code to move in and out of a scripting language, such as Python, and underlying C/C++/Java code. In the multiple process case you may also have some code running in a browser written in Javascript.

For code with strong interfaces, multi-process and multi-language debugging is less interesting. However, the reality of today’s programs is that they aren’t written with strong interfaces, and program logic moves between different components. A flexible and powerful debugger could be very useful.

There is a lot of interesting work going on with gdb these days. Making gdb more powerful is hard, but I hope that it will be possible.

Comments

Version Control Wish

A lot of smart people have thought much harder than I have about version control systems, and I am by no means an expert on them. That said, this is what I want from a VCS, beyond the obvious: I want to be able to name a patch. I want to be able to easily transfer that patch from one branch to another. I want to be able to add chunks to the patch, and modify existing chunks. If I earlier transferred the patch to another branch, I want to be able to easily move the modifications I made.

Clearly there is a sense in which a patch is a branch. But it isn’t a branch in the usual sense. I may have several active patches which live on my development branch. When I update my development branch–sync it to the master sources, or in general to other repositories–I want my patches to update also. When I want to move a patch to a release branch, I want the VCS to roll the patch back to the current merge point of the development branch and the release branch, and to apply that modified patch to the release branch.

For example, let’s say that patch P was started on the development branch at version Rd. Let’s say that release branch B was branched off of the development branch at version Rb. I do some work on P, and then I update the development branch to version Re, and then I do some more work on P. Now I’m happy with patch P and I want to put it on the release branch. I want the VCS to get P out of the development branch. I want it to reverse apply the diffs from Re back to Rb. I want it to take the resulting diff and apply it to the release branch.

Then i want to work on patch P some more, and then move it over to the release branch again. Now I want the VCS to pick up the changes since I last moved it over and only apply those changes–after, of course, removing any changes I dragged in from other people.

I want P to have a name, not a revision number, and I want these operations to be simple VCS commands, not complicated scripts.

Naturally merge conflicts are possible at several different stages here, and the final output may have to include several different bits of source code for each conflict. Or perhaps the VCS could ask me what to do as it goes along, that would be OK.

These are the sorts of operations I find myself doing fairly regularly. Obviously I can do them with any VCS, by using manual bookkeeping and attention to detail. I have yet to find any VCS which makes them simple.

Comments (7)

Combining Versions

Sun introduced a symbol versioning scheme to use for the linker. Their implementation is relatively simple: symbol versions are defined in a version script provided when a shared library was created. The dynamic linker can verify that all required versions are present. This is useful for ensuring that an application can run with a specific version of the library.

In the Sun versioning scheme, when a symbol is changed to have an incompatible interface, the library file name must change. This then produces a new DT_SONAME entry, which leads to new DT_NEEDED entries, and thus manages incompatibility at that level.

Ulrich Drepper and Eric Youngdale introduced a much more sophisticated symbol versioning scheme, which is used by the glibc, the GNU linker, and gold. The key differences are that versions may be specified in object files and that shared libraries may contain multiple independent versions of the same symbol. Versions are specified in object files by naming the symbol NAME@VERSION or NAME@@VERSION. In the former case the symbol is a hidden version, available only by specific request. In the latter case the symbol is a default version, and references to NAME will be linked to NAME@@VERSION. Versions may also be specified in version scripts.

This facility means that in principle it is never necessary to change the library file name. The versioning scheme lets the dynamic linker direct each symbol reference to the appropriate version. This in turn means that in a complicated program with many shared libraries compiled against different versions of the base library, only one instance of the base library needs to be loaded.

However, this additional complexity leads to additional ambiguity. There are now two possible sources of a symbol version: the name in the object file and an entry in the version script. There is the possibility that two instances of the same name will disagree on whether the name should be globally visible or not–in fact, this is normal, as undefined references will always use NAME@VERSION, not NAME@@VERSION. Symbol overriding can be confusing: if the main executable defines NAME without a version, which versions should it override in the shared library? Which version should be used in the program? Symbol visibility adds an additional wrinkle to this.

The most important issue for the linker arises when it sees both NAME and NAME@VERSION, and then sees NAME@@VERSION. At that time the linker has seen two separate symbols and has to decide whether to merge them. The rules that gold currently follows are these:

  • If NAME is hidden, and NAME@@VERSION is in a shared object, they are two independent symbols, and we do not change NAME or its version.
  • If NAME already has a version, because we earlier saw NAME@@VERSION2, then we produce two separate symbols, and leave NAME@@VERSION2 as the default symbol.
  • Otherwise, we change the version of NAME to VERSION, and do normal symbol resolution.

I recently fixed a bug in this code in gold, which was breaking symbol overriding in a specific case. I wouldn’t be surprised if there are more bugs. As far as I know nobody has worked through all the symbol combining issues and defined what should happen.

Comments (2)

« Previous Page« Previous entries « Previous Page · Next Page » Next entries »Next Page »