Digging into a Weak Scaling Performance Problem: Part 1

Blindly scaling software is hard and often unpleasant. This multi-part blog post defines and debugs a weak scaling performance problem on a multi-core CPU.

A lot of software performance work revolves around the reduction of latency as a means of achieving a particular performance target. Latency is the time that it takes from start to finish for a task to complete. A typical flow for performance improvement is to find where most of the code is spending time via profiling and then optimizing that piece of the code, often when a business need arises to justify the work. This works great when you’re working with code that primarily operates within a single thread. Whether it scales well (haha) is another matter.

Scaling code such that multiple independent threads, processes, machines, and/or data centers can perform the same function is an entirely different problem and requires a different type of thinking. It’s no longer solely a latency issue. Often, it’s a matter of getting data where it needs to go at a high enough rate to achieve a particular set of functional and performance goals. One of the bottlenecks that any system will eventually encounter is the desire to use a shared resource that is currently busy, and indeed, the weak scaling problem that was contrived for this post has that problem.

Let’s start with some definitions:

Processor: In terms of this blog post, I mean processor core, although it holds true for adding an additional CPU, machine, cluster, etc.
Strong scaling: Strong scaling is a measure of how well the latency improves for a fixed dataset when the processor count is increased. If a 100 second task on a single core processor now takes 25 seconds on a four core processor, it scaled perfectly.
Weak scaling: Weak scaling is a measure of whether additional processors can keep up with additional data. If a single processor can perform 1,000,000 requests per second, and adding an additional three cores allows the system to process 4,000,000 requests per second, it scaled perfectly.

Strong scaling tends to be what people immediately think of when asked to define scaling. Weak scaling is also crucial in today’s world especially as the amount of data we’re generating is growing at astronomical rates, but tends to get much less love. Strong scaling is all about handling latency effectively (and is indeed limited by the latency of the ‘sequential’ portion of the system, per Amdahl’s law). Weak scaling is ensuring that your throughput scales, and importantly, you’ll note that nowhere in the definition is it a function of how much time each individual request took.

It’s important to note that strong scaling does not at all imply that weak scaling also must occur. It’s easy to fool oneself into thinking that throughput is always the reciprocal of latency for a given request. That only occurs when one task is immediately started after another completes, and tasks never overlap. Any sort of attempt to hide latency by having multiple requests in parallel makes that mathematical relationship false.

The Workload and Performance Measurements

I used a contrived workload that I built for part of my high performance C programming competition, Pointer Wars. I have some code that implements a queue based off of a singly linked list, albeit with some rather suboptimal memory allocation and deallocation behavior included. The queue implements push() and pop() operations in constant time, so the latency that you’re seeing is more a function of a “typical” implementation versus a quadratic time implementation of pushing repeatedly into a queue in a loop.

The workload consists of each core having a queue and pushing 10,000,000 elements into the queue, coming in around 230 nanoseconds per push (yuck, slow!). The four core variant is doing 40,000,000 pushes total across four queues, which meets the requirement for data scaling as processor count increases. You also can see that the code doesn’t quite have perfect weak scaling based on the fact that each of the four bars aren’t the same height, although admittedly it’s not as bad as I thought it might turn out.

I think it’s likely that after some optimization to reduce the latency per push operation that the weak scaling becomes more drastic, but that’s a moot point for now. The question becomes whether we can get back to a wall clock time of 2,300 ms for the four core variant.

Performance debug to follow in the next post!

Chris Feilbach's Blog