Implementing a Garbage Collector in C++

TL;DR

Main garbage collector source code: gc_heap.h gc_heap.cpp
Scroll to the bottom for a link to a visualization showing the garbage collector in action (it may also provide some clarity if the text is vague)

Introduction

In this document I’ll attempt to give an overview of the new garbage collector (GC) I’ve implemented for my hobby ECMAScript 1997 interpreter mjs. Even though it was created for that specific purpose the principles - and code, with some adjustments - should be generally applicable.

To keep it relatively short I’ll assume a basic familiarity with the concept of garage collection (otherwise Wikipedia is a decent starting point) as well as a working knowledge of C++. I’ll try to limit the amount of C++ cruft and esoterica discussed here and instead refer the interested reader to the source code.

My implementation is inspired by an excellent talk given by Herb Sutter at CppCon 2016 titled “Leak-Freedom in C++… By Default” (video) (slides) and NaN tagging as described by Mike Pall in the following post to the LuaJIT mailing list: “LuaJIT 2.0 intellectual property disclosure and research opportunities”.

While I consider the implementation good enough (in this project, for now) it’s far from being “production ready” and you definitely shouldn’t be using it for anything you or anyone else rely on :). See my TODO list for a small sampling of what I know needs improvement.

Background

Before diving into the details, let’s have a look at the problem we’re trying to solve.

Why does an ECMAScript interpreter need a GC in the first place? Who isn’t cleaning up after themselves and causing us to have to expend extra clock cycles dealing with something that’s usually handled fairly automatically in modern C++?

Values

To answer that we first need to talk about values.

In ECMAScript 1997 a value can have one of the following types:

undefined
null
boolean
number
string
object

Note: In ECMAScript a value always has a single unchanging type, it’s the fact that variables and object properties can be assigned values of differing types that makes it a dynamically typed language.

The first four on the list are easily handled, and since strings are immutable and can’t contain references to anything else these five types are collectively called primitive by the ECMAScript specification. They are not the ones causing issues.

Objects

That leaves the culprit: object. “What’s in an object?”, you might ask. Well, it’s really just a collection of properties. Some of them are - as an implementation detail - treated specially, like for instance the prototype(*) or the name of the objects class, but essentially objects are just a lists of properties. A property of an object consists of its name/key (a string), its attributes (irrelevant for this discussion) and its value.

Now you may be seeing where this is going: Objects contain properties that contain values that may contain objects (or rather references to objects). This is at the root of the issue.

*: The internal prototype called [[Prototype]] in the specification. Not to be confused with the prototype property you’ll encounter when dealing with function objects.

Inadequacies of the Initial Garbage Collector

My general implementation strategy has been that of walking the path of least resistance unless I found enjoyment in diving into a specific area, so my initial version of the interpreter used the “obvious” C++ primtives.

This means that the interesting types were represented as follows:

objects references as std::shared_ptr<mjs::object>, where mjs::object manages a standard container of properties (key/value pairs)
values as the aptly named mjs::value - a classic tagged union capable of storing object references as well as the other ECMAScript primitives (numbers, strings, etc.)

The limitations of my initial approach to GC were seen when it was confronted with this simple piece of ECMAScript code:

var o = new Object(); // Create object 'o'. Note: no object initialiser/literal until ES3
o.p = o;              // Set property 'p' of object 'o' to point to 'o' itself

And lest you think this only occurs in pathological cases consider the constructor property of a functions prototype. It will contain a reference to the prototype object, which contains a reference to the constructor which…

As std::shared_ptr provides garbage collection through reference counting the initial implementation was faced with handling the classic issue of dealing with self references. The problem being that in the above example, even if there are no external references, the reference count of o will always be positive because of the reference from its own property p. This example is simplified but the problem persists even if the chain is longer.

The initial implementation handled this rather crudely by - when requested - going through all objects and clearing their internal property lists thereby ensuring that no self references existed. By manually breaking any cycles that might otherwise keep an object from being destructed leaks were avoided.

As you can imagine this was only requested when testing whether there were other leaks in the system as it was a hugely wasteful pessimization compared to just using the null collector. The initial GC was especially useless since I hadn’t bothered implementing detection of live objects, so it couldn’t even be used while the interpreter was running.

The Set of Live Objects

At the heart of garbage collection lies the task of determining which objects are live, i.e. which objects cannot be collected. std::shared_ptr does this by keeping track of the number of references to the object, but as mentioned above this alone isn’t sufficient due to cycles in the object graph. Edges in the object graph are formed by references in the property list to other objects and a cycle implies that you can reach an object by following its outgoing edges back to itself - like in the simplified example mentioned earlier.

In this context “live” means that we will possible be needing the object again in the future, but how can that be determined without predicting the future? In a bit of a cop-out garbage collection has the concept of root pointers which are a special class of object reference from the “outside” which force an object to kept alive even if it wouldn’t otherwise have been. In simple terms a root pointer can be thought of as something the “user” can refer to - like o in the example or the window object when used in a web browser context.

Given the root pointers determining the set of live objects is conceptually easy enough: it’s the set of all objects that are reachable starting from the root pointers. Or in pseudo-code:

    find-live(root-pointers)
        for o in root-pointers
            mark-live(o)

    mark-live(o)
        if contains(live-set, o) # Avoid infinite recursion
            return
        insert(live-set, o)
        for p in properties-of(o)
            if p is object-reference
                mark-live(p)

So by keeping track of the root pointers and going from there we should be able to create a garbage collection for our interpreter.

The New Garbage Collector

To actually implement a garbage collector we’ll have to become more specific about what a root pointer is and start discussing the implementation. I’ll be presenting the details without mentioning all the different trade-offs that were considered, but be aware that for nearly every decision there are lots of other interesting possibilities to explore.

Requirements

Before starting the implementation I had a set of requirements in mind:

The collector should be precise
It shouldn’t introduce too many changes in the existing interpreter
It shouldn’t rely on non-portable constructs
It should be type-safe to greatest possible degree
It should be exception safe
It should be possible to make thread-safe
It should be optimizable (meaning: it can be inefficient to start off with but it should be possible to make it fast at a later pointer)
It should be extensible - I’d like to be able to use the code as a testbed for exploring GC ideas
It should be possible to move objects in memory (to be able to compact the heap for example)
It doesn’t have to be generally useable, but it should be separable from the interpreter
It should be possible to keep the amount of memory used by the objects to a fixed number of bytes

The GC Heap

It was pretty clear from the start that I needed a special heap where objects could be stored. It both helps with the requirement to be able to restrict the amount of memory used for objects and in general simplifies things when it’s known that all valid objects live within a certain memory range.

In the code this is represented by the mjs::gc_heap class which handles storage management and object lifetime. Along with a set of helper (template) classes it handles everything related to the garbage collector. The raw storage is allocated in 64-bit chunks (to match the alignment requirement of pointers on 64-bit systems and numbers) called slots.

The public interface of gc_heap only permits construction of objects inside the managed heap and requesting a garbage collection. Everything else is handled automatically.

Each allocation in the heap is preceded by a slot_allocation_header which is an extra slot of overhead that records the number of slots allocated (including the header) as well as the type index of the object that lives there.

Storing the size of each allocation enables traversing the heap and having a type associated turns out to be pretty useful (read: essential).

Other Types on the GC Heap

Even though we don’t strictly need to store anything other than objects on the garbage collected heap, it would be really convenient to be able to have all (or most) things related to the interpreter state kept there as well. Like all the strings created during execution for example.

More importantly activation records (loosely “the call stack”) and function implementations can both contain self-references and it simplifies things a lot if they can be treated as normal ECMAScript objects by the GC even if they’re not implement the same way.

Hence the need for distinguishing between different kinds of “things” allocated on the heap via the type index.

Note: From now on I’ll try to use object when I mean an ECMAScript object and just object when I mean a C++ object that can be allocated on the heap.

GC Type Information

Except for special cases that will be discussed later the type index associated with an allocation can be used to obtain the statically initialized mjs::gc_type_info instance describing the type.

The type information object allows the garbage collector to destroy, move and perform the other necessary operations on anything that lives at a given storage position.

The same thing could be accomplished by requiring all objects constructed on the GC heap to derive from some abstract base class, but where is the fun in that? (Also some optimizations wouldn’t be possible.)

Tracked Pointers

To keep track of which objects are live, the GC needs to know whether there are any active references to them. To be able to move objects around in memory during garbage collection there also needs to be a layer of indirection for the clients, so they can remain (mostly) unaware that the pointer-to object has been relocated.

For these reasons gc_heap_ptr exists and serves the same purpose as std::shared_ptr in the old implementation. It’s also a smart pointer class template, but instead of increment/decrementing a reference count on construction/destruction it registers/deregisters itself with its associated gc_heap.

This way the GC heap can manage the storage as it sees fit and have everything “just work” by making sure all the tracked pointers are updated when it moves things around. If, at any point, it can determine that there are no pointers to an object it can safely delete it. Of course that will only make the garbage collector as good as the old implementation.

Determining the Root Pointers

To do better we need to determine the root pointers and use the algorithm sketched above and do proper garbage collection of anything that isn’t reachable from them.

By requiring that all objects on the GC heap store their pointers on the heap as well it’s easy to determine the root pointers: they’re all pointers which addresses lie outside the managed heap. This requirement also enables another important feature: for any object we can determine what other object it references by considering all the tracked pointers that are located inside its designated storage.

Compacting

With everything in place we’re ready to handle requests to perform a garbage collection! This is done by moving all live objects to a new storage location and destroying all the remaining objects. The new storage will be compact and easy to allocate from, and not copying the left-over objects is the garbage collection. Calling the destructor on the garbage objects is what’s referred to as finilization in many languages.

The method described above:

From the root pointers determine the set of live objects
Move the live objects to somewhere else
Destroy any left-over objects
Use “somewhere else” as new storage area

while using tracked pointers and all the other machinery does work, but it’s not very efficient.

Optimizations

To make sure the GC was relatively robust I ran the garbage collector after every statement when running some parts of the test suite. While this uncovered some bugs (see the Pitfalls section) it also made the tests take a lot longer to run.

Profiling showed that (copy) constructing and destroying gc_heap_ptr instances as well as processing tracked pointers in general was taking up most of the time. This matched my intuition, but I made sure to collect data before jumping to conclusions.

A Backdoor

In the current implementation tracked pointers contain a pointer to the containing gc_heap and a slot index into the storage area enabling it to always produce a pointer to the pointed-to object at its current storage location.

In numbers they take up lots of space and are very costly to move(*), so to enable optimizations at the expense of type-safety (please let me know if there is a better way) objects are allowed to use dangerous - but fast - methods of keeping track of object references.

*: Since the address at which a tracked pointer lives is important it isn’t cheaper to move than to copy.

Untracked Pointers

To limit these type-indiscretions the unsafe operations are accessed by using gc_heap_ptr_untracked which only stores the slot index and relies on the class implementer (since the user shouldn’t be exposed to it) supplying the gc_heap.

Since this presents a problem when the object is moved - the GC by definition isn’t tracking the pointer - a protocol is in place where the garbage collector notifies the moved object that it wants to be informed about its untracked pointers.

Fixup

The fixup protocol is a way of ensuring the work that would otherwise be automatically done when using tracked pointers can be manually performed by clever classes that use untracked pointers. After an object has been moved the garbage collector calls the objects fixup method (if it has one with a matching signature). The object can then go through its untracked pointers and fix up each in turn. Calling fixup on an untracked pointer ensures that whatever the pointer refers to is also moved and that the pointer value is updated.

It’s important that classes using untracked pointers provide a fixup method that fixes each untracked pointer up exactly once. Forgetting to fix up a pointer results in a dangling pointer and calling fixup more than once (or outside the fixup method) will at best result in a crash. This is the dangerous part of untracked pointers.

Value Representation

Another unsafe helper class is value_representation which takes the same amount of storage as a number (a IEEE 754 double precision floating point number) and can hold any ECMAScript value. Numbers themselves are stored using their normal representation (with all NaN values forced to a single representation) and the remaining types are encoded using NaN tagging.

For values of type string or object the pointer value is stored untracked in the same way as gc_heap_ptr_untracked and also requires a fixup after move.

Strings and Tables

Strings are stored as a single allocation with a small header containing the length followed by the UTF-16 characters it contains. Nothing fancy to see there, except a possible optimization where it could deduce its length from the allocation header. Some trickery would be needed to get the length modulo 4, but that could be handled if necessary.

Since the properties of an object can change, they are kept in a separate table that can be stored as a single allocation. When the table needs to grow a copy of greater size is created and the reference to the old table is simply forgotten (yay garbage collection). The table entries are internally represented by an untracked pointer to the key (a string), the attributes associated with the entry (a bit field), and a value_representation to store the actual value. All together this allows a compact representation where each entry only takes up two heap slots (i.e. 128 bits).

Representing the Set of Live Objects

During garbage collection the set of live objects isn’t stored explicitly. Instead the garbage collector keeps a stack of pointers that are waiting to be processed (the normal stack isn’t used to avoid overflowing it when processing long object chains).

The stack is initially populated by the root pointers and when processing a pointer the garbage collector performs the following steps to move the pointed-to object:

The object is move constructed at its new storage location
The old object is destroyed, the type index is changed to the special gc_moved type index and its new position is recorded at the old location
All tracked pointers internal to the object are added to the stack
If the object has a fixup method it’s called, which ensures that the untracked pointers are also added to the processing stack (this is where it’s convenient that untracked pointers have the same representation as tracked pointers except for the gc_heap pointer)

After the object has been moved the pointer is set to to the objects new location and the next pointer on the stack is processed.

Changing the type index of moved objects is crucial in avoiding infinite recursion when processing self-references. When a pointer to a gc_moved object is encountered the above steps are skipped and the pointer is updated directly.

As long as all objects are implemented correctly, the above steps ensures precise garbage collection and that all tracked and untracked pointers are updated.

Pitfalls

Even without considering the unsafe optimizations using the garbage collector still requires extra care for both class implementers and users alike compared to using std::shared_ptr.

The fact that objects can be moved around in memory means that the user has to very careful about only keeping direct (real) pointers to objects when they know a garbage collection can’t happen.

This is easier said than done and was the cause of quote a few bugs.

From obvious cases where I was storing a reference to an object on the GC heap: no go because objects can move.

Through the global object implementation capturing this in a lambda: no go because this can move when you’re an object on the GC heap!

To headscratchers like:

o->put(key, eval(some_expression));

not working because o::operator-> may be evaluated before the call to eval which may trigger a garbage collection, which may move o making the result of o-> stale!

Usage Guidelines

During the development I’ve formulated some rules I try to follow. They’re probably not sufficient to avoid all errors but they’re necessary:

Never keep raw references to GC objects - use managed pointers (tracker or untracked), as objects may move around in memory
Be careful about the (sometimes very short) scope where a dereferenced heap pointer is valid
Any class that might be constructed on the GC heap must not keep tracked pointers outside the GC heap. An example is having a std::vector<gc_heap_ptr<...>> as the pointers register as root pointers, but are only be destructed once the object is finalized, which means the garbage collector needs to run twice to clean up the mess
In general be prepared for the destructor being called a lot later than normal (finalization)
Be prepared to be moved and be inquired about your untracked pointers

Future Work

While the garbage collector works there are still lots left to be improved and explored. Again I’ll refer to my TODO list to see some of the things I’m considering. Feel free to use the code as a starting point or inspiration for your own experiments.

Not all of my requirements have been satisfied, but I’m content for now and fairly confident that the code can be adapted to future needs. I’m also not too keen on making more invasive optimizations before I’ve decided on what kinds of workloads I want to optimize for and what direction I want to take the code in.

The Garbage Collector in Action

I’ve created a short visualization of the garbage collector in action to maybe explain in pictures what I’ve failed to make clear in text. It shows most of the steps the garbage collector takes when h.garbage_collect() is executed in the following simplified example. It starts out showing all steps, but gradually skips intermediary actions to not get too long.


// Simple gc-enabled class capable of holding a single (tracked) object reference
class test_obj {
public:
    explicit test_obj(const gc_heap_ptr<object>& p) : pointer(p) {}
private:
    gc_heap_ptr<object> pointer;
};

void vis() {
    // Create heap with 32 slots
    gc_heap h{32};

    // Leak a string (for show)
    string{h, "Leaked!"};

    // Create object "o"
    auto o = object::make(h,
        string{h, "Object"}, // Class name "Object"
        nullptr);            // No prototype

    // Set its [[Value]] to false
    o->internal_value(value{false});

    // Add a self-referencing property "K1"
    o->put(string{h, "K1"}, value{o});

    // And another property set to NaN
    o->put(string{h, "K2"}, value{NAN});

    // Create a "test_obj" instance referencing "o"
    auto c = h.make<test_obj>(o);

    // Collect garbage (see visualization)
    // There are two root pointers: to "o" (@0008) and "c" (@001a)
    h.garbage_collect();
}

For kicks I also created a squashed (and totally unreadable) video of the garbage collector running after interpreting an empty statement. This means it’s mostly the global object being moved.