Quantifying the Cost of Pointer Dereference in Pointer Wars Level 2 Code

Pointer deference is an extremely useful operation, but it is not free.

(Note to Pointer Wars contestants: this is a work in progress, but mostly complete.)

Pointers are perhaps the thing that first comes to mind when people think of C and C++ versus other languages. They are incredibly useful, and even though they add a bit of expense when dereferenced, often either save compute time by avoiding useless memory copies between functions or offer such a wide range of flexibility such that the (minimal) performance cost is worth it. However, there are absolutely cases where they can be overused, and some of those cases exist within the Level 2 Pointer Wars repository.

Pointer Usage in Level 1

The Level 1 linked list code is compiled into shared library, which is code that is dynamically linked against the program that uses it when the program first executes. For shared libraries, it is incredibly common to have them provide a function to allocate a particular structure and to free it. In our case, we did so via linked_list_create() and linked_list_delete(). It’s relatively rare, although an acceptable design choice, for a shared library implementor to decide that the user program should provide a pointer to memory where the shared library should populate a structure. This relies on the user program managing its own memory, as opposed to the shared library. Examples of both are shown below:

// Normal allocation with a shared library:
//
struct linked_list * ll = linked_list_create();
...
linked_list_delete(ll);

// Another acceptable design point, where user specifies the
// memory address to populate:
//
struct linked_list ll; // assume allocated on the stack.
linked_list_create(&ll);
...
linked_list_delete(&ll);

This is perfectly acceptable. You want the user library to work with ANY linked list structure, not just one at a fixed address, and the only reasonable means to do that is with a pointer.

Pointer Usage in Level 2

Level 2 is where things get silly. The implementation of the queue is a bit suspect.

// Recall queue definition:
//
struct queue {
    struct linked_list * ll;
}

struct queue * q = queue_create();
queue_delete(q);

Our queue implementation uses exactly ONE linked list per queue, meaning that the benefit of indirection that a pointer gives us is useless. It’s a performance cost that we don’t need to pay in this case, and with some minor changes, we can rework the queue and linked list code such that it doesn’t need to be paid if the struct is changed to be defined this way:

// A better queue definition, that avoids unnecessary
// pointer dereference:
//
struct queue {
    struct linked_list ll;
}

It might not be initially obvious why this ends up with fewer pointer dereferences. Hang on.

Compiler Code Generation for Pointer Dereferences

A compiler, in the common case, has to emit an additional load instruction to determine to load the address of the pointer into a register (let’s pick X17), and then perform a second load instruction using the X17 register as a base address to access a member in the struct. It may be that the pointer’s address is already in X17, so in some cases you might get the dereference “for free”, but keep in mind that Pointer Wars is generally calling library code for most of its work. The compiler by default isn’t optimizing between the program and your shared library, because it has no idea what the shared library contents are at run time. Your shared library code also has no idea how a program might use it because it can’t, so you won’t benefit from any sort of optimization like that.

Load instructions take a few CPU cycles to generate data. Most desktop/server/phone/laptop CPUs can generate a data for a load, at best, every 3 to 5 CPU cycles. The Pointer Wars testing machine, per the ARM provided Software Optimization Guide, takes four CPU cycles.

Justify The Optimization

Before jumping into the code, justify the optimization. We already identified tradeoffs: not using a pointer removes your ability to have indirection (that is, to specify any linked list as opposed to only a single one). We also determined that for our queue implementation, this tradeoff is never used. Create a mathematical function that takes number of edges as an input and the output is the number of useless pointer dereferences. Then, assume that your code searches through 1,000,000,000 edges and can do 450,000,000 pointer dereferences per second (derived from a 1.8 GHz frequency, with all pointer dereferences never overlapping (you may wish to justify why this must be true, because it isn’t always)). How much runtime might be saved?

The Magic Behind The Optimization

Consider the two implementations:

// Old implementation.
//
struct queue {
    struct linked_list * ll;
};

void queue_push(struct queue * q, unsigned int data) {
    linked_list_insert_end(q->ll, data);
}

// New implementation.
//
struct better_queue {
    struct linked_list ll;
};
void better_queue_push(struct better_queue * qb, unsigned int data) {
    linked_list_insert_end(&qb->ll, data);
}

These look nearly identical, but there’s a really fundamental difference: the compiler is smart enough to reason that q and &q->ll are the same address, by construction. They point to the exact same thing, and moreover they must always point to the exact same thing because they are always guaranteed by the C compiler to have the same address. Absolute magic.

This sort of analysis is called “alias analysis”, which is generally used to determine if two pointers point to the same location or may otherwise overlap. The compiler cannot always determine this (in fact, this is a well known ‘undecidable problem’, meaning that its mathematically impossible to specify a general algorithm that solves it), and often requires the compiler to generate conservative code to handle the case where two pointers may point to the same location. But in this case, it works for us, and gives us an optimization benefit.