GCC Garbage

gcc uses garbage collection internally during the course of a compilation. I wasn’t too involved with gcc at the time that garbage collection was introduced, but I certainly saw the problems that it was trying to solve. Before garbage collection, gcc allocated memory using obstacks. An obstack is a memory allocation pool in which if you free an object you also free all more recently allocated objects. Obstacks are still used in gcc today. The issue before garbage collection was introduced was that all allocation was done on the current default obstack. There were several different obstacks with different lifetimes, and code would change the default back and forth. There was also code to copy data structures from one obstack to another in order to make it more permanent. It was very easy to introduce errors in which objects in a more permanent obstack pointed to objects in a more temporary obstack. This then lead to unpredictable memory crashes.

We can draw some lessons from this. If you are going to use memory pools in a complicated program, you must not use a default memory pool. All memory allocation must use an explicitly named pool. Otherwise you are going to get confused and the result will be dangling pointers.

Another lesson is that memory allocation is hard, but then we already knew that. Using a garbage collector resolves the memory allocation problem. However, you pay a fairly heavy price. Is it a price which gcc can afford to pay?

One answer is to observe that gcc’s garbage collector includes a ggc_free function. This function frees memory in the garbage collector. It is basically an assertion that there are no pointers to the memory. Any use of gcc_free in gcc’s code base is a case where the price of garbage collection was considered to be too high. Note that systems which are fully dedicated to garbage collection do not provide any equivalent to ggc_free. I count 24 uses of ggc_free in gcc’s code base today. Clearly in at least some cases developers feel that the cost of garbage collection is too high.

The question gcc faces is: is the garbage collector too expensive in all cases? Should we go back to explicit memory management–but this time with explicitly named pools, to avoid the problems of default obstacks? The answers are complicated by the fact that the implementation of pre-compiled headers is tied into the implementation of garbage collection.

Another way to ask the question is: if garbage collection is not too expensive, then why not write gcc in Java? The natural response would be that using Java would be too expensive–the compiler would be too slow. But the main reason that Java is slower than C++ is not the bytecode system–modern JIT technology is competitive with compilation to machine code; the slowness of Java is a result of garbage collection.


Posted

in

by

Tags:

Comments

6 responses to “GCC Garbage”

  1. graydon Avatar
    graydon

    Putting aside the theory-of-GC debate (and the fact that, due to the way current memory subsystems behave, I fully agree with the point that this alone slows a system down), and putting aside even the baroque over-design of the Java standard libraries…

    There is *also* a systematic performance pain that Java incurs through its inability to effect any sort of dense allocation of composite structures aside from monomorphic arrays. Every sort of memory structure you introduce — including the contents of arrays! — necessarily becomes a boxed allocation that you refer to through a pointer (modulo clever JITs that at least escape-analyze some locals and stack allocate them, thank goodness).

    This inability pushes serious memory costs through every structure in the system: fragmented allocation, allocation-header overhead, additional indirections during active programmatic traversal and (especially) during GC. These are relatively high costs, comparable to the similarly pessimistic omnipresent dynamic dispatch in java, or the omnipresent synchronization locks. Seldom mentioned costs, but very real. Every byte of overhead adds up, particularly when dealing with heavily-used, inner-loopy subsystems.

    It is the sort of design result of focusing a bit too much on simplicity of semantic model: it tries to relieve programmers from knowing or caring about the difference between reference and value types, or between static and dynamic dispatch. Sadly, saying “let clever implementations handle it” is too easy a habit to fall into during language design.

  2. tromey Avatar

    I tend to agree about the overhead of using boxed objects for everything. I’m not convinced this is a universal problem — I think there are Java programs that are competitive with analogous programs in other languages — but I think it would certainly make Java unsuitable for GCC. I base this on the fact that people have spent a lot of time, with measurable results, working to remove even a single word from a common GCC data structure. I.e., GCC is not one of those happy-go-lucky programs.

    For synchronization, nowadays high-performance Java programs don’t use synchronized, they use atomic operations and whatnot. I’m not sure whether this means the problem is “gone”, but it is lessened at least.

    As to ggc_free… I dislike it, of course. However, the problem it solves is not that GC is unsuitable at those points, but rather that GCC in general has a funny kind of GC where collections can only happen at relatively arbitrary — and occasionally distantly separated in time — points. So, these free calls are just reflections of another form of lameness in GCC; if we had stack marking we would not need them (one minor benefit of using the Boehm GC — but probably not enough to switch).

    I think I’d be happy with a return to pool allocation, assuming it was “done right”. It would make the incremental compiler more complicated, but probably not drastically so, at least assuming I could exert some control over lifetimes of things.

  3. BenHutchings Avatar
    BenHutchings

    With GC, there’s a trade-off between increased run-time and increased memory usage. You can reduce the run-time overhead to about zero but at the cost of around 3x higher memory usage. You can theoretically reduce memory usage to probably a little lower than you would see with explicit freeing, but at huge expense in run-time.

  4. ncm Avatar

    The notion of rewriting all of Gcc in Java is absurd on its face.

    However, there’s no reason it couldn’t be compiled more or less as-is with a C++ compiler — e.g. G++. Once in G++, uses of regular pointers could begin to be replaced with smart pointers. Some allocations could reference garbage-collected pools, others reference-counted.

  5. Ian Lance Taylor Avatar

    Thanks for the notes.

    A private e-mail questioned my comment that, if Java is slow, it is due to GC and the memory model. It’s hard to know for sure, of course, and it’s going to depend on the program; it just seems likely to me. Graydon, thanks for expanding on the memory model.

    BenHutchings: thanks for pointing out the time-space tradeoff. It is not helpful for gcc, which is already a big memory user.

    ncm: I certainly agree that gcc should be converted to C++.

  6. davem Avatar

    I sent a note to Ian mentioning the locality aspecs of using GC in gcc.
    With obstacks, RTL expressions were densely packed with the RTL
    objects of their sub-expressions. Usually all within the same L2 cache
    line.

    So if you took a cache miss on the top-level RTL expression, it would
    be amortized as you scanned it’s subexpressions.

    Now, with GC, objects all sit on different pages, let alone L2 cache lines.
    In fact, when scanning RTL during an optimization pass, GC based gcc
    has the most suboptimal memory reference pattern possible.

Leave a Reply