Tracing Garbage Collection for Arenas

Tracing GC is to Arenas what jelly is to peanut butter.

Jun 24, 2024

Certain things just don’t mix: oil and water, me and productivity, garbage collection and systems programming1.

Well, the latter is not exactly true. C++, Rust and Ada all have reference counted pointers in their standard library, and reference counting is a type of garbage collection. Both Swift and Nim use automatic reference counting as their default memory management strategy.

Interestingly Swift’s predecessor Objective-C, older versions of Nim, and early versions of Rust all supported tracing garbage collection! Yet they all abandoned it.

There are many reasons as to why, but I think it boils down to a few key points:

Reference counting, at least in its simplest form, is deterministic and plays well with scope-based resource management. Objects are freed immediately after the last reference to them is dropped. A typical tracing garbage collector is non-deterministic, you don’t know when it will kick in and free things.
Reference counting, at least in its simplest form, is entirely local. The less you use of it, the less of an impact it has in your program. A typical tracing GC needs a hefty runtime, safepoints inserted into functions, etc., even if it is barely used.
Reference counting, at least in its simplest form, plays well with C, and therefore with anything that talks to C (i.e., it is good for FFI). A typical tracing GC needs its runtime to be set up by the host, or for a copy of it to be bundled with every FFI-exported library.

Note that I specifically mentioned reference counting in its simplest form, which happens to be the reference counting implementation used by all of these systems programming languages. They all have plenty of compile-time optimizations, sure, but the runtime component is bog standard reference counting. No fancy concurrent, coalesced, deferred reference counting or anything like that, because it would have the same issues as a tracing GC.

According to A Unified Theory of Garbage Collection, reference counting and tracing garbage collection are just two ends of a spectrum. The more you optimize one or the other the more they meet in the middle. As far as “systems programming languages” are concerned, those optimizations have an unacceptable cost.

But that got me thinking. If modern tracing garbage collectors are “in the middle of the spectrum”, and that is bad, and simple reference counting is “on one end of the spectrum”, and it is ok… what about the other end? Can we make “the simplest form” of tracing garbage collection work for systems programming?

Back to Basics

With how fancy modern tracing garbage collectors are, we tend to forget that the original algorithms were extremely simple.

Mark-and-Sweep is just graph reachability. Starting from a set of roots, find all reachable nodes and mark them. Go through the graph again and free every node that isn’t marked.

It’s not particularly efficient, but it’s extremely simple to implement. So simple in fact that you could just implement your own mark-and-sweep garbage collector for whatever graph-like data structures you use.

The Yices2 SMT solver, written in C, does this for example. It has a mark-and-sweep garbage collector you can invoke at any time to free no longer used expression nodes. Note that the collector is never implicitly called, there’s no runtime, it’s just a function that Yices makes available to the host.

For the set of roots it doesn’t need to do any fancy stack scanning. Any expression stored in a model is a root, and any root expressions not stored in a model can be passed in as an extra argument.

We’re so blinded by these fancy incremental, concurrent, generational garbage collectors that we forget just how simple tracing garbage collection really is.

Yices2 proves that tracing garbage collection can be used in a systems programming context. The host calls the GC if and when it wants to, no runtime.

But can we generalize this? Having each library implement their own custom mark-and-sweep tracing GC is fine and all, but I’m wondering how this could work as a general memory management strategy.

Just taking an existing, modern garbage collector and making it cleanup only when explicitly called does not work very well, they’re incremental for a reason (in fact pretty much every tracing GC-ed language tells you not to do this). But we don’t need any of that fancy-schmancy stuff, we’re going back to 1970.

Garbage Collected Arenas

While the languages mentioned previously all use a mix of scope-based resource management and reference counting, in C that’s actually not so common since it has to be done manually. Needing to manually increment and decrement reference counts is not only pretty annoying it is also very error prone.

The more common memory management approach is to use Arenas, also known as Bump Allocators. You throw a bunch of allocations into a big bucket, and then you free them all in one go. Allocation and deallocation are both extremely fast, but deallocation is delayed to when it is convenient for the program, rather than immediately when an object becomes inaccessible.

You know, a bit like how a tracing GC works.

Yup, that’s what we’re doing. Alongside free(&arena) we’re adding a new collect(&arena, &roots) function in the style of Yices. Collect does the same thing as free, but it copies all data accessible through the roots to a new bucket (adjusting the pointers in the process). This is called a moving collector, which has various advantages compared to mark-and-sweep:

Allocating is just a pointer bump on the arena, as fast as can be.
Freeing is just resetting the arena pointer. That means inaccessible data, no matter how much or how big, has zero impact on the garbage collector2.
Data also gets compacted when copied, avoiding fragmentation and improving memory locality.
The particular algorithm we’ll use, Cheney’s Algorithm, is iterative. It uses no recursion what so ever.

So what’s the catch? Well, copying can be slow, but beyond that not much. Having to pass the roots manually is a bit annoying but Yices shows that it can work. Not an issue for a custom language that uses this as its memory management strategy, since it can easily compute the roots.

There’s an extra cost per allocation, however, a small header has to be prepended with the following information:

The size of the allocation.
A pointer to a “tracing/moving” function or to a data structure with information about the pointers contained in the allocation.
A forwarding pointer for the case where the same data is reachable from multiple roots but has already been copied.

The forwarding pointer is only needed during collection, so with some bit fiddling we can get away with an 8 byte header. Reference counting needs 8 bytes for the reference count, another 8 if there are weak references, plus however much is used by the malloc header.

The tracing/moving function is equivalent to the destructor in the reference counting case, with the advantage that it is not recursive.

Lastly, the set of roots has to be stored somewhere to be passed to the collect function. In a custom language, this can be implemented very easily following the approach outlined in this paper: Accurate garbage collection in an uncooperative environment.

Basically, you keep a shadow stack of pointers to garbage collected data. If the language has an effect system, you can restrict this shadow stack to be manipulated only by functions that directly or indirectly call collect. You could, for example, only allow applications (not libraries) to ever call collect3.

Putting it all together

Start from a language like Zig or Odin where every function that allocates takes an allocator as input (implicitly or explicitly). From the function’s point of view, this allocator is just a bog standard arena that asks for some weird metadata. You could just as well pass in a normal, non-garbage collected arena to this function if desired, and you can have as many distinct arenas as makes sense for your application.

But, if you don’t need that level of control, you can also just pass in the default arena. This arena acts just like a regular arena, except it stores some extra metadata alongside each allocation. Again, from the function’s point of view, there’s no difference beyond needing to pass in a pointer to the metadata (can be compiler generated or written by hand, doesn’t matter).

Lastly, the application (not libraries!) can call collect at any point to clean up the default GC arena. The compiler needs to maintain a shadow-stack for every function in collect’s call stack. This can be easily tracked with a simple effect system. In the case of another language like C serving as the host, it can just pass the root set explicitly to collect.

And we’re done. The only cost is the extra header 8 byte header per allocation and a tracing function per type that stores GC pointers. Otherwise the application can choose to never call collect at all and just call free directly and everything will still work. The garbage collection is (almost) entirely optional4.

This honestly sounds too good to be true, so I’m wondering why it hasn’t been done before. The closest attempt I can think of is Cone. Maybe I’ll have to be the one to put it to the test in my own programming language. For now, here’s a GC Arena implementation in C (still needs improvement).

Update: This article was discussed on hackernews

In case you don’t know what it means, application programming is meant to serve end-users directly, whereas systems programming is meant to serve other software. Think low-level libraries, game engines, operating systems, etc.

Copying large amounts of memory can be slow, but that can be mitigated by storing very large allocations (i.e., multiple megabytes) in their own separate buckets. These don’t need to be copied, they can just by attached to the new “bucket” that is left after collection. The only real issue are massive amounts of live, small allocations. Use a different approach if you need that many (e.g., pools), or split them up into different arenas.

Libraries can call collect on GC arenas they themselves create of course, the issue is only calling collect on the default GC arena. That should only ever be done by the application.

I haven’t covered multithreaded applications here, but they’re an important concern. We can’t add safepoints to functions or we’re back where we started with too-fancy GC. For reference counting the solution is to use atomic updates or keep data per-thread, moving it as necessary (like Nim does). For this approach I would follow Nim and make each thread have its own Arena, with data needing to be explicitly made available across threads through specialized shared-memory arenas.

Burning the Midnight Coffee

Discussion about this post