Custom Memory Allocator: Implementation and Performance Measurements

Dynamic memory allocation isn’t free. Today I discuss a means of reducing the frequency of it for a breadth first search algorithm.

This is part of an ongoing series of posts relating to the high performance C programming challenge Pointer Wars. Come join us!

The Pointer Wars challenge has four levels to it. Level 2 builds on the linked list built in level 1 by extending its functionality into a queue. The challenge centers around performance feedback from me to participants; Level 2 performance data comes from an application I wrote that performs Breadth First Search over a directed graph. The directed graph represents which Wikipedia articles link to other articles. The data comes from 2007, so rather dated, but there’s still around 40,000,000 Wikipedia article links in the directed graph. One hundred pairs of pseudo-randomly selected articles are searched to determine whether a link between them exists. Unfortunately, I have not found article names that map onto the node ids (so we’ll never know if you can go from shoe to Parmesan cheese), but that’s OK. The numbers are good enough.

Level 3 in Pointer Wars revolves around identifying performance problems and improving the somewhat garbage code that I wrote (as always, an initial implementation can almost always be improved). There are portions of the code that will drive the branch predictor nuts, and there are absolutely memory hierarchy related improvements that can be made. Algorithmic and some virtual memory system effects can also be improved upon. One of the recommendations that I teased in Level 2 was a custom memory allocator, and while it isn’t the bulk of the performance improvement that can likely be achieved, it’s substantial. It’s unlikely that a participant will get the best performance possible without implementing one.

I delve into the realm of a custom memory allocator below. The breadth first search uses the queue code I wrote for Level 2, which in turn uses the linked list code I wrote for Level 1. The queue and underlying linked list that I built call malloc() every time that they allocate a new node, and free() whenever they remove one of the nodes. Any nodes remaining in the linked list are deleted on deallocation, which occurs at the end of each search.

As of this writing, all of the participants who opted in to sharing information about their code also have the same memory allocation and deallocation strategy as I did, although a few people did float the idea on LinkedIn when I first announced the competition.

Qualitative and Quantitative Justification

The malloc() and free() functions, or C++ new/delete operators, are not the first thought when many people think of sources for performance problems. There’s a variety of reasons for why they actually can be problematic, detailed here. In short, if the program has to ask the kernel for additional virtual memory it is not a cheap operation, and many of the optimizations that the kernel might do to make the allocation fast cause additional future kernel operations. Those additional kernel operations incur latency and will need to be amortized over the lifetime of memory allocation to overcome that additional time. For example, the first access on a virtual page might result in a page fault for a variety of reasons, a common one being because the kernel hasn’t assigned it a physical page number yet. If the program does billions of memory accesses per page fault, this is certainly noise that no one should care about. But if it doesn’t, that becomes more problematic.

The malloc() invocation for each push that occurs onto the queue and the free() for each pop or cleanup on deallocation of the queue eats up considerable time in this workload. I have a microbenchmark showing that malloc() takes around 26 nanoseconds on the Raspberry Pi 4B with all CPUs running at 1.80 GHz, with a 64-bit version of Ubuntu running on it. The free() call comes in a bit slower at 31 nanoseconds. Briefly, these are averages derived from running each function 10,000 times in a row, repeating that process several times in a row, and taking the last result. They’re good, but not great, measurements, and tend to vary 10-15% run to run.

The initial queue implementation I made, over the course of the 100 pair searches, calls malloc() 2,038,458,095 times. It calls free() the same number of times. The number of brk() system calls (the method by which the POSIX compliant kernels allow user programs to change the size of their heap) is also a bit higher than I’d like during the search phase of the program, but in the noise for relative to the number of malloc() and free() calls. The 100 searches takes 1,028 seconds, of which I estimate 116 seconds is spent within malloc() and free() functions.

Now, the question is, can anything be done about it? The program does have a maximum number of nodes that it ever can allocate which is bounded by the number of edges in the directed graph, roughly 41,800,000. There are search pairs the search the entire graph, meaning that the upper bound will eventually be hit. If we allocated that many nodes and then simply re-used pointers, there might be an advantage. It’s a pretty tight bound however: in order to be faster than malloc() whatever solution to grab a pointer to a new node better be faster than 26 nanoseconds, and whatever free() operation better be faster than 31 seconds. The latter sounds easier than the former. The program’s memory access patterns suggest that a custom memory allocator that doesn’t always make use of malloc() and free() is likely beneficial, if the code can fit within that very tight time window.

It is achievable, though. A push and pop off of a stack are both O(1) worst case time complexity operations, and far more importantly than that (as implying that worst case time complexity implies anything about best case performance is an egregious sin), their implementations should only take a few nanoseconds assuming everything in the CPU pipeline “just works”.

A Strategy for Memory Allocation and Deallocation

This is by no means the best custom memory allocator ever devised, but it should get much of the benefit with relatively little effort. Given that Pointer Wars participants are probably going to read this, I’m not going to provide what I think will be the highest performing solution (that’s for you and Pointer War participants to noodle on). Here’s roughly the framework I’m following (Github link to linked_list.h here):

  1. Remove malloc() and free() calls within the linked list code for node pointers. Instead, I replace them with a static functions __linked_list_create_node() and __linked_list_delete_node() defined within linked_list.c. These aren’t exposed outside the shared library because they have no reason to be.
  2. Add a global linked list of node pointers. Concurrency support be damned. It’s not thread safe, in fact, it’s not even safe for use in signal handlers on a computer with a single CPU core*, but not a concern (for Pointer Wars level 3, at least). This linked list will contain pointers of nodes that have been previously allocated by malloc() but are not currently in use. This consists of any node created by any linked list, and persists across deallocation of a linked list, which is important for our particular workload.
  3. The __linked_list_create_node() function prefers to use a pointer from the list in #2, and calls malloc() if none are available. The thought here is that popping a pointer off of a stack or queue is cheaper than a malloc() call. Almost certainly true, it should add a few nanoseconds of execution time in the best case.
  4. The __linked_list_delete_node() function pushes the pointer onto the list in #2.
  5. A new function, linked_list_final_cleanup() is added at the end of the test program. This is responsible for free()’ing all the memory that was allocated.

My queue code is such that I have no need to change it. The functionality of it is nearly entirely dependent on the linked list, by design.

Performance Results and Analysis

Functionality was verified with the Pointer Wars functional test suite, valgrind, and a before and after comparison of the number of nodes searched per query.

Let’s start with the comparison for the entire search process.

Before Custom AllocatorAfter Custom Allocator
Wall Time [s]1,028 seconds874 seconds
Highest Node Per Second Processing [nodes/s]2,793,3073,223,047
Total malloc() calls2,038,458,09521,068,515
Total free() calls2,038,458,095200

Around 15% of the wall time is completely gone with the memory allocator implemented! Do note that the custom allocator number does not count the free() calls at the end of the program that are performed in linked_list_final_cleanup(). All performance metrics require context, in this case I’m interested in a program that would eventually run indefinitely searching through pairs of data. The free() calls at the end are a one time shutdown cost, which does not scale with the run time of the program nor the number of queries performed, and therefore I am not interested in it. It was a burdensome expense that was needlessly paid in the original code that is now removed.

The number of malloc() calls is less than the total number of edges in the directed graph by about 2x. That may seem unexpected or buggy at first, but what that data is telling you is that there’s never more than half of the graph’s edges in the breadth first search queue, even though all of the edges are eventually pushed into the queue on some searches. Cool finding!

Now, let’s pull some individual searches and compare some more data, including some sourced from hardware performance counters.

Search 1/100

The initial search is where the bulk of the malloc() cost is going to hit, because it’s first and also because it searched 20% of the edges in the directed graph. Not a bad decrease in speed, especially with the initial iteration. There’s a nice reduction in the number of loads and branches that are speculatively executed, and little change in my estimated data cache hit rates (these consist of loads only, for reasons that I will go into in a future blog post). A “local hit rate” specifies how often an L(n) cache miss hits within the L(n+1) cache, so if you were expecting to see a higher hit rate in the L2 cache (often a global hit rate, specifying how many additional hits the L2 cache caused versus going to DRAM is specified), that’s why.

Before Custom AllocatorAfter Custom Allocator
Edges Examined8,164,9008,164,900
Wall Time [s]6.350 seconds6.087 seconds
malloc() calls28,731,87920,573,828
free() calls28,731,8792
Loads Executed2,482,526,3941,414,757,409
Estimated L1 Data Cache Hit Rate98.9%98.1%
Estimated L2 Cache Local Hit Rate60.5%59.6%
Branches Executed2,408,172,8291,520,861,490
Percentage of Branches Predicted Correctly99.8%99.8%

Search 3/100

These results are interesting because it’s the third time through. The 2nd search is very short, and actually took longer to run (although the time is short enough that its probably just noise impacting it). Microarchitectural structures are warmed up now, any faulting behavior we may have seen on the first run is resolved, and this may be more representative of what a ‘normal’ search looks like.

The nodes/second throughput of both the before and after runs increased over the 1st run, showing you that some startup cost existed in both before and after runs.

These hardware performance counters are starting to look more realistic as well. It appears that a lot of loads are done by malloc() and free(), and many of those are hits. You’re starting to see the behavior of having loads going to DRAM more often in the hit rates. This is expected, the representation of directed graph is vastly larger than the L2 cache capacity.

Before Custom AllocatorAfter Custom Allocator
Edges Examined12,092,09312,092,093
Wall Time [s]6.406 seconds5.729 seconds
malloc() calls32,373,0052
free() calls32,373,0052
Loads Executed2,516,271,981689,686,690
Estimated L1 Data Cache Hit Rate98.1%93.2%
Estimated L2 Cache Local Hit Rate60.2%56.9%
Branches Executed2,127,965,750684,051,174
Percentage of Branches Predicted Correctly99.8%99.8%

Search 9/100

This search results in no path being found, which is different than the two above.

Before Custom AllocatorAfter Custom Allocator
Edges Examined41,899,61841,899,618
Wall Time [s]15.221 seconds13.862 seconds
malloc() calls41,899,6192
free() calls41,899,6192
Loads Executed3,703,903,8441,356,949,442
Estimated L1 Data Cache Hit Rate96.6%91.7%
Estimated L2 Cache Local Hit Rate58.9%54.6%
Branches Executed3,288,923,8851,461,067,494
Percentage of Branches Predicted Correctly99.7%99.8%

Human Factors

A consideration for any performance optimization is whether the engineer time is worth the estimated performance increase. In this case it was. I spent around 15 minutes implementing the allocator, and then an hour validating it and collecting performance numbers. The custom allocator is not thread safe (but neither is the linked list nor queue), but this will impact level 4 code when I write my solution. That’s probably a few more hours of work, and if you throw in that I ideated on the allocator before actually writing it, that’s probably another few hours. Call it a day of effort.

One Final Caveat

I did say earlier that the queue and linked list implementations can’t be used in a signal handler or any concurrent code. True. However, the use of malloc() and free() in a signal handler is forbidden by the C standard anyway, so it’s a moot point.

The more important thing is thread safety. Nothing stops two threads from popping the same pointer to allocate into a linked list! That would be bad, but given that no thread safety guarantee is provided by the code, that’s fine.

Level 4 Pointer Wars venture into the wonderful world of multi-threaded code, but my wife and I will be providing a baseline solution that “just works” at the expense of having less than optimal performance.

Published by

2 responses to “Custom Memory Allocator: Implementation and Performance Measurements”

  1. […] code that implements a queue based off of a singly linked list, albeit with some rather suboptimal memory allocation and deallocation behavior included. The queue implements push() and pop() operations in constant time, so the latency that […]

    Like

  2. […] implementation is done such that every time a new node is added to the queue, it performs malloc(). That’s less than great and can be improved, but also serves well for debugging performance problems. Conventional wisdom suggests that […]

    Like

Leave a comment