Pointer Wars: The Target Machines

The response to my last post on the new C programming competition has been immense. So far we’re up to 9 participants (about eight more than my wife and I expected), and counting! Pointer Wars: Linked List edition is designed to have the competitors code a linked list in C under a variety of interesting scenarios. You guys provide code, we (my wife and I) provide statistically sound benchmarking and feedback on the code, with feedback focusing on whether or not your code would be accepted into a professional code repository. Entrance fee is $25 USD. It’s not too late to join in!

This post describes some aspects of the target architecture, containing some (but my no means all) of what I think is important. There is absolutely missing information about the machine from the able below, and much of that missing information will be exposed by the performance tests that we run. Feel free to leave a comment here or send an email asking for a specific stat (the email address is the same one as the welcome email you received).

Board	Raspberry Pi 4B
System On Chip	BCM2711, passively cooled with attached heatsink
Power Supply	Raspberry Pi provided USB-C adapter
CPU	4x Cortex-A72, clock speed of 1.8 GHz
Memory	8GB LPDDR4
Operating System	A 64-bit variant of Ubuntu
Documentation	ARM Provided Technical Reference Manual (TRM) ARM Provided Software Optimization Guide

Commentary On the Raspberry Pi Architectural Statistics

When I think about CPU performance problems, the first question that comes to mind is whether it’s worth doing. That skill takes a while to develop, so assume yes, because that’s a major point of the competition.

With that assumption in mind, for small pieces of heavily used code, I put some thought into what the software spends its time doing, what is needed to make sure that occurs optimally in the particular target system. Let’s consider iteration through a linked list. What’s the bottleneck that dictates how many iterations per second we can do? Does cache behavior play a part? Virtual memory? Clock speed? Yes to all of the above.

Participants: Consider what optimally needs to happen for each access in an iteration over your linked list (the virtual memory system “just works”, the caches hit, the CPU pipeline is always full, the OS never preempts your process…) and then come up with a model that gives an iteration count as a function of clock speed. Is there a piece of information missing from the table above that would *really* help you in this analysis?

Test your code against your model and see whether both your code and your model hold water. Part of the beauty of the competition is that with a single machine being used, you can keep just a single CPU in your head when thinking about performance.

I consider the perforamnce debug work done when I either achieve 95% of the theoretical maximum value I calculated, or refine my mental model with a constraint that I wasn’t able to think of. Why 95%? Because 100% is really hard, and one of my advisors who wrote some really great high performance code gave me that heuristic.

For those curious, a well written linked list implementation, under some conditions, can likely achieve a few hundred million iterations per second per core on the target machine.

Chris Feilbach's Blog