malloc() isn’t free: Some Findings and Suggestions

Dynamic memory allocation is often treated as being free by programmers. It isn’t.

For our first batch of participants, week 1 of high-performance C competition Pointer Wars: Linked List Edition is complete (for those curious, you can still join!). Code has been reviewed, microbenchmarks ran, and performance results delivered. A few people even had code that crashed performance testing software.

Week 1 focused on the implementation of a linked list, and the microbenchmarks ran against it were as follows:

  1. Repeated creation of struct linked_list objects via the linked_list_create() function.
  2. Repeated insertion of data into the linked list at the front of the list (simulating insertion into a stack).
  3. Repeated insertion of data into the linked list at the end of the list (simulating insertion into a queue).

The participants wrote other functions as well, but these were the ones tested this week.

Microbenchmarks like these are not real workloads, even if they behave like portions of one. Performance inferences can be made from them, with the context about the microbenchmark mattering greatly. In the case of Pointer Wars, the benchmark framework introduces various sources of optimism such that a consistent performance result can be seen. These very much reflect the best that one can each with the code written on the system it is benchmarked on. For instance, the performance testing framework used by my wife and I warms up all the microarchitectural structures by way of repeating the same code in a tight loop and doing many, many, iterations. All of the caches, TLBs, branch predictors, prefetchers, and other microarchitectural state that may exist has been ran over and over again.

Performance Findings

The average minimum time for most participant’s code to perform a single, warm, invocation of linked_list_create() was 25 to 26 nanoseconds on a Raspberry Pi 4B with all of its CPU cores frequency set to 1.8 GHz. One participant came in at 27 nanoseconds, likely due to having a doublely linked list and having an extra member in their linked_list structure to initialize (per ARM’s Cortex-A72 (TM) Software Optimization Guide, the store instruction that the compiler likely emitted for that initialization has a latency of 1 CPU cycle). One might wonder why the results are all the same between the participants. Same system, same benchmark software, and frankly very close to the same code. There are a limited number of ways that you can initialize a linked list structure, and I enforced input validation of all arguments via our functional testing framework (so one can’t skip it and get a better result).

On one hand, you might claim “WOAH 26 nanoseconds, that is really fast,” But when you look on it at on the other hand, the result that the programmer gets from calling linked_list_create() is a pointer, and for most of the contestant’s implementations, a single member within that structure is set to NULL. Considering that, 26 nanoseconds actually feels like an eternity, but, an acceptable eternity. I don’t think that I’m alone in espousing the opinion that a one-off higher cost operation is acceptable for the allocation and initialization of a structure.

There’s a reasonable amount going on in that linked_list_create() function microbenchmark, both in terms of C code and C compiler handling of function call and returns. Input validation is one thing (and probably eats a few nanoseconds), checking malloc() to ensure its result is non-NULL is another, the call itself to malloc() is an indirect procedure call, stack frame allocation is another, the call and return instructions aren’t free. All in all let’s call the malloc() invocation 20 nanoseconds out of that 26 nanoseconds on average.

All of the participants performed a malloc() invocation within their linked_list_insert_front() and linked_list_insert_end() functions to allocate a pointer to a new node structure. There’s some performance absolutely being left on the table here with that implementation.

A More Likely Scenario

What happens when your particular workload is dominated by the insertion and removal of nodes, such as in an implementation of Breadth First Search? The linked list (or, more likely, a queue that uses a linked list internally) is likely created exactly one time, or perhaps one time per search that the application performs. The operations over the list are insertion and the end, removal at the front, and these operations absolutely dominate what the list actually does in the application.

Insertion at the beginning of a linked list might seem relatively cheap and it should be; participant code comes in around 27 to 32 nanoseconds per operation. Assume you have a data set large enough that 1,000,000,000 nodes are being pushed onto the linked list. That’s 20 seconds of just malloc()’ing data, and the insertion operation takes 32 seconds! The malloc() calls are substantial overhead, and if they were never performed the run time would be down to 12 seconds, for nearly a 3x speedup.

Now, it’s a linked list, one might argue that the entire purpose of the data structure is the freedom to NOT have all of your data contiguous within the program’s virtual address space. But, that doesn’t mean that every node insertion requires a call to malloc(). In fact, it’s not hard to envision a solution where a small number of nodes are pre-allocated with a single malloc() invocation and then used as needed by the insertion functions of the linked list. It’s a bit more complex than that, as the linked list must ensure that all of the pre-allocated nodes are no longer in use before performing the corresponding free() function call is performed.

Microbenchmark Optimism to Be Aware Of

Perhaps you’re still not quite sold on the idea that dynamic memory allocation is expensive, because 20 nanoseconds isn’t that much time. Here’s a bunch of reasons why the 20 nanosecond number is extremely optimistic in more real world code:

  • All of the page faulting of newly used virtual pages occurs within the first iteration of each microbenchmark. Page faults are rather expensive. They require finding a physical frame to allocate, clearing of that frame, updating of the process’ page table, updating of various kernel metadata structures around virtual and physical memory allocation, may involve TLB shootdowns, and are generally slow.
  • malloc() is commonly used in the microbenchmark and therefore faster by way of caching and other microarchitectural structure warming. The instruction and data caches are primed for this, as are the TLBs. It’s likely that misses are extremely rare in any of those microarchitectural structures, and a few misses in one of those is going to turn that 20 nanoseconds into potentially 200 or more.
  • The microbenchmark behaves such that there is never any internal nor external fragmentation within the virtual address space that malloc() manages. A substantial amount of time can be spent by malloc() searching for an appropriately sized block of virtual memory to return in real world applications. That doesn’t occur here.
  • The microbenchmark always requests the same amount of memory per call to malloc(). Some malloc() implementations have special case logic for this, although I am not familiar with the one installed on my machine. It’s worth being on the list, however, as it’s common. For accuracy, this of course excludes the initial allocation of the linked list structure (which may be a different size than a linked list node structure), but that’s so rare relative to the other operations that it’s moot.
  • The microbenchmark likely does not allocate enough heap space to invoke the brk() system call. On POSIX compliant systems such as the one, brk() is a system call by which a process requests a change to the size of its heap. Once a system call is invoked, the process is going to wait for quite a while, substantially longer than 20 nanoseconds. For fairness sake, this is not a common operation, but it is one to be aware of.
  • The microbenchmark likely receives pointers that are close to contiguous in the process’ virtual address space. It’s likely given that there has been no fragmentation and that every structure is malloc()’d and then free()’d in the same order that the addresses are close to sequential. They likely won’t be touching because most malloc() implementations stick a small preamble before the pointer they return (often for ‘free list’ tracking). CPUs love sequential or nearly sequential virtual addresses.
  • The malloc() data structures are likely cached. The entirety of the malloc() data structures and the linked list probably does not fit within the L1 of a CPU core on the Raspberry Pi 4B, but for the performance tests ran, it absolutely would fit within the L2. Given it is commonly used, it likely remains there. L2 memory accesses are substantially faster than DRAM, generally.
  • The microbenchmark is single threaded. The likely mechanism by which malloc() is made thread safe is to surround (nearly) the entire function with a lock. Given that there is a single thread in the process this lock is always available, and thus no delay in acquiring the lock.
  • The microbenchmark is ran on a system that never context switches. All the caching benefits mentioned go out the window once a context switch occurs. For sane performance number collection, we configured the system to avoid context switching and also to detect it and throw away any results where one occurred.

Published by

One response to “malloc() isn’t free: Some Findings and Suggestions”

  1. […] a performance problem. There’s a variety of reasons for why they actually can be problematic, detailed here. In short, if the program has to ask the kernel for additional virtual memory it is not a cheap […]

    Like

Leave a reply to Custom Memory Allocator: Implementation and Performance Measurements – Chris Feilbach's Blog Cancel reply