Digging into a Weak Scaling Performance Problem: Part 2

Now that we’ve defined weak scaling and the workload, it’s time to dig into what might be going wrong. In the last part we identified shared resources across software threads and hardware cores as a potential source of problems for weak scaling. I’ve come up with a table with all the shared resources I can think of as a brainstorming exercise as to where problems might lie.

As a reminder, each thread has its own queue and pushes 10,000,000 values into the queue as its workload. It’s a bit of an odd workload by itself, but perhaps somewhere down the line another thread consumes values from that queue.

Resource	Sharing Problems	Notes
Virtual and physical address space	Not inherently problematic.	This is expected behavior with multiple threads.
I/O storage devices	I/O devices, whether in the device itself, the driver, or the kernel, may not scale well across multiple threads.	Not a problem, program does not do this.
I/O to console	Prints to the console may be serialized.	Not a problem, program does not do this.
Kernel: system calls	System calls may use a shared resource in the kernel, or cause the scheduler to block a thread’s execution.	Not a problem, strace shows that program does not do this.
Kernel: virtual memory management	If one thread causes another to swap out pages, it may stall its execution.	Not a problem, usage of memory by program fits within DRAM capacity.
C library calls (such as malloc()/free())	Library functions may not be able to handle concurrent calls to functions by multiple threads.	Surprisingly, not a problem.
Shared cache lines between threads	Data structures for each thread on the same cache line causes false sharing.	A problem. More on this later.
L2 cache behavior	The L2 cache is shared among cores, each thread gets less capacity with multiple ongoing,	If the L2 cache is inclusive of each core’s L1 cache, the L2 might end up evicting a cache line of core A for a request by core B.
Thread startup and teardown time	Happens sequentially, could limit scaling.	Not a problem for this workload.
Thread migration	A thread might migrate from one core to another	Unsure if this is a problem for us. Hopefully the OS is smart enough to avoid this, but it should be checked.

Kernel and I/O Device Analysis

Let’s start knocking this out. We can eliminate I/O device behavior because we don’t use any, and we can eliminate prints to the console because we also don’t do any of those. Virtual memory isn’t a problem (yet) because our threads don’t consume enough memory for it to be one (my Raspberry Pi 4B has 8 GB of DRAM, and this is only workload running on the machine).

I used the strace utility to determine which system calls the program was using, and somewhat surprisingly I didn’t see any brk() or mmap() calls being performed after the threads were initialized, or any other system calls for that matter. The brk() system call is the means by which this process increases the size of its heap, and mmap() is used to ask for a section of virtual memory (which may or may not be considered part of the heap). I did see that for each thread being initialized that an mmap() call was performed to allocate its thread-local stack, but that’s before I start measuring the time it takes for each thread to run (and it shouldn’t be a bottleneck for only 4 threads). So for this case it’s safe to say that system calls aren’t limiting our ability to scale. It should be noted that even if we did see a bunch of system calls that it doesn’t imply that the workload won’t scale, but it could be that a shared resource is locked within the kernel, limiting the ability of all threads to consume the resource at the same time.

C Library Calls

With the kernel level effects out of the way, it’s time to dig into the code. My queue implementation is done such that every time a new node is added to the queue, it performs malloc(). That’s less than great and can be improved, but also serves well for debugging performance problems. Conventional wisdom suggests that routines like malloc() implement a lock around the whole thing as their means of being thread safe. In my particular case, I found this not to be true, and that my version of the C library (aarch64 glibc 2.39) implements thread specific allocation structures. That was a cool finding! I found this initially by seeing some unexpected assembly language (a read of the ARM TPIDR_EL0 SREG, which holds a thread specific identifier), and then confirming by reading the source. Really cool. At some point repeated malloc() calls are going to do a brk() or mmap(), but we never hit that point in our workload, so it’s not the source of our weak scaling problem.

Cache Lines and False Sharing

On my system (and nearly every CPU) the cache line size is 64 bytes, and those cache lines are coherent. In coherent caches, a store by one core to any byte within a cache line will cause other copies of the cache line to be invalidated in all other caches that contain that line. That means that the next time another core tries to load from or store to that cache line, they miss in their cache and request it from the L2 cache.

For reference, here’s what my the data structure looks like:

struct node {
    struct node * next;
    unsigned int data;
};

struct linked_list {
    struct node * head;
    struct node * tail;
    size_t size;
};

struct queue {
    struct linked_list* ll; // <-- the fact that this is a pointer at all is rather silly.
};

And I pre-allocate the queue structures prior to the threads running:

    // Populate thread private data.
    //
    for (size_t i = 1; i <= THREAD_COUNT; i++) {
        tpd[i - 1].q = queue_create();
    }

If you look at the addresses of the queues and their linked lists:

Address of thread 0 queue: 0x0xaaaaaaac12a0
Address of thread 1 queue: 0x0xaaaaaaac12e0
Address of thread 2 queue: 0x0xaaaaaaac1320
Address of thread 3 queue: 0x0xaaaaaaac1360
Address of thread 0 queue's linked list: 0x0xaaaaaaac12c0
Address of thread 1 queue's linked list: 0x0xaaaaaaac1300
Address of thread 2 queue's linked list: 0x0xaaaaaaac1340
Address of thread 3 queue's linked list: 0x0xaaaaaaac1380

You’ll see that the thread 1’s queue is on the same cache line as thread 0’s linked list and that thread 2’s queue is on the same cache line as thread 1’s linked list. Those structures are never meant to be shared among threads, showing us a case of false sharing that may have a performance impact.

Chris Feilbach's Blog