Pointer Wars: Linked List Edition Example Performance Review

Here is an example performance review for week 1, ran using my own ‘average’ solution to the problem.

For week 1, performance testing focuses on the creation of linked_list
structures and insertion into a single linked_list structure at either
the front or the end. These were chosen because the operations are
common and present an opportunity for you to improve the performance
of your code without much effort for week 2.

Performance numbers should always come with the context under which
they were measured. With respect to that, your code was ran on a
Raspberry Pi 4B with all four cores set to their maximum frequency
of 1.8 GHz. Further, your code was ran on a 64-bit version of Ubuntu,
using gcc 13.3.0, compiled with -O3 (unless you specified otherwise,
which is entirely legal within this competition). The benchmarking
software was ran with a real time scheduling policy, meaning that
context switches should never occur, but in the event that they
do, the software throws out that particular set of data points
and restarts the test during which the context switch occurred.

(In case you’re curious, on POSIX-compliant systems one can use
the getrusage() function to get a lot of information about a
process’ or thread’s resource usage, and that’s how the
code detects context switch.)

The linked_list data sizes are relatively small to keep run time
down. No deletions are performed on the linked_list during the
test. There is absolutely some optimism in these numbers as a
result as a result.

The small size also means that the linked_list likely fits within
the cache hierarchy of the Raspberry Pi in every test,
leading towards performance results that are much more optimistic
than real-world workloads. That’s OK for week1.

We provide the number of samples, minimum time, arithmetic mean,
and standard deviation (to the extent that the distribution is a
normal distribution at all!) along with a brief description of
what we are testing.

The results might surprise you as being faster than expected,
especially when you consider where most of the time in say,
linked_list_create(), is spent. Fast does not mean free,
especially when software is working with millions or billions
of pieces of data, and also anything can be improved if the
engineering need exists to do so.

As always, if you have questions send them to
<email redacted to avoid spam>.

Your performance results:

Create 10,000 linked_lists: n: 1000 min: 278311 ns mean: 294676.962 ns stddev: 8568.53609939037 ns
Insert 1 elements at front of linked_list: n: 1000 min: 74 ns mean: 78.435 ns stddev: 14.3679426154196 ns
Insert 2 elements at front of linked_list: n: 1000 min: 92 ns mean: 111.555 ns stddev: 242.386511536842 ns
Insert 4 elements at front of linked_list: n: 1000 min: 129 ns mean: 146.464 ns stddev: 9.71579662199639 ns
Insert 8 elements at front of linked_list: n: 1000 min: 259 ns mean: 265.474 ns stddev: 52.0689669957072 ns
Insert 16 elements at front of linked_list: n: 1000 min: 518 ns mean: 543.012 ns stddev: 21.2040056593091 ns
Insert 32 elements at front of linked_list: n: 1000 min: 1037 ns mean: 1081.028 ns stddev: 29.7119372643391 ns
Insert 64 elements at front of linked_list: n: 1000 min: 2074 ns mean: 2124.401 ns stddev: 260.070990691004 ns
Insert 128 elements at front of linked_list: n: 1000 min: 4185 ns mean: 4325.264 ns stddev: 411.099963882265 ns
Insert 256 elements at front of linked_list: n: 1000 min: 8370 ns mean: 8639.817 ns stddev: 485.051297813951 ns
Insert 512 elements at front of linked_list: n: 1000 min: 16796 ns mean: 16890.215 ns stddev: 222.236452399242 ns
Insert 1024 elements at front of linked_list: n: 1000 min: 33814 ns mean: 36098.914 ns stddev: 1401.87056842064 ns
Insert 2048 elements at front of linked_list: n: 1000 min: 70073 ns mean: 74248.457 ns stddev: 2388.20888955531 ns
Insert 4096 elements at front of linked_list: n: 1000 min: 146165 ns mean: 153352.248 ns stddev: 3359.06072176375 ns
Insert 8192 elements at front of linked_list: n: 1000 min: 293829 ns mean: 307067.061 ns stddev: 6577.10375631089 ns
Insert 16384 elements at front of linked_list: n: 100 min: 590233 ns mean: 608037.4 ns stddev: 7579.2567933802 ns
Insert 32768 elements at front of linked_list: n: 100 min: 1195594 ns mean: 1294847.28 ns stddev: 141171.629805714 ns
Insert 1 elements at end of linked_list: n: 1000 min: 55 ns mean: 63.765 ns stddev: 12.65510865224 ns
Insert 2 elements at end of linked_list: n: 1000 min: 74 ns mean: 86.55 ns stddev: 11.1808541713055 ns
Insert 4 elements at end of linked_list: n: 1000 min: 129 ns mean: 133.988 ns stddev: 8.88571077629699 ns
Insert 8 elements at end of linked_list: n: 1000 min: 277 ns mean: 296.434 ns stddev: 34.6360454440169 ns
Insert 16 elements at end of linked_list: n: 1000 min: 574 ns mean: 584.392 ns stddev: 13.9473415388023 ns
Insert 32 elements at end of linked_list: n: 1000 min: 1092 ns mean: 1142.818 ns stddev: 236.38412568529 ns
Insert 64 elements at end of linked_list: n: 1000 min: 2166 ns mean: 2192.167 ns stddev: 172.70073280389 ns
Insert 128 elements at end of linked_list: n: 1000 min: 4296 ns mean: 4343.084 ns stddev: 327.612611088157 ns
Insert 256 elements at end of linked_list: n: 1000 min: 8630 ns mean: 8898.64 ns stddev: 546.443636983724 ns
Insert 512 elements at end of linked_list: n: 1000 min: 17574 ns mean: 18431.747 ns stddev: 987.967551588105 ns

The performance of your linked_list_create() function is acceptable.

Each sample calls linked_list_create() 10,000 times in a loop, measuring
the time of the entire loop. It then calls linked_list_delete() on all
those linked_list pointers, before performing the test again.

Creation of a new linked_list, even for very shortly lived linked_lists,
is not very common when weighed against insertion and deletion of nodes.
Generally it’s excepted that resource allocation takes some time and tends
to be fairly rare. On the other hand, what the user of your linked_list
library gets is a pointer to a structure with a minimal number of
members being initialized to default values. A modern CPU core,
including the one your code is being ran on, can perform those
operations in a few CPU cycles (assume two stores, see the
Software Optimization Guide link I posted, this should have a latency
of 2 CPU cyles).

It begs the question as to whether you’re really just measuring how
fast malloc() is with this test.

The performance of your linked_list_insert_front() function is acceptable.

Inserting an operation in your linked_list has a worst case time complexity
of O(k), constant time. In this case the peformance test has a worst case
time complexity of O(n), and you can see empirical evidence of
this by seeing the minimum times roughly double as input size doubles.

A word of warning, a more accurate time of how fast the
linked_list_insert_front() function can be found for the 8,192
datapoint and dividing it by 8,192 and treating that as the minimum
time per invocation of linked_list_insert_front(). The 1, 2, and 4
entry tests are more likely to be measuring test overhead than anything
else.

Your implementation of linked_list_insert_end() needs improvement for
week 2 in Pointer Wars. This was an intentional and expected outcome
in week 1.

Linked lists are often used as the internal data structure for a queue
or a stack. In the case of a queue, which is a FIFO (First In First Out)
data structure, new data is added at the end of the list and removed
from the front.

You’ll see a worst case time complexity of O(n^2) for the test, and you’ll
notice that at 512 elements your code performs substantially worse than
the linked_list_insert_front() tests.

The first step to improve the performance of your linked list should be
to consider the algorithm used. Can you achieve a worst case time
complexity of O(k), at least for most cases, for this function?

That answer is yes. There are at least two good ways to do it, one
of which is to modify the list to be doublely linked, and keep a
tail pointer in the struct linked_list. This comes at some expense
of keeping an additional pointer around for each node, which may or
may not matter depending on what code you’re designing. If you’re
on an embedded system with limited memory, this is probably more of a
concern than a Raspberry Pi system with 8 GB of DRAM.

For week 2, a queue is going to be built around your linked_list
that disallows the random insertion and random removal of elements.
Insertion at the end and removal at the front will be replaced with
queue_push() and queue_pop() functions.

With that constraint in mind, think about a means of hitting that
worst case time complexity goal of O(k).

Both your linked_list_insert_front() and linked_list_insert_end() functions
call malloc_fptr() for each node insertion.

As you saw and hopefully inferred about your performance results regarding
linked_list_create(), dynamic memory allocation is expensive relative to
the setting up of the structure.

Insertion is a common enough operation; common operations should be
made fast, within the constraints of engineer time, of course. The best
implementations for week 2 are going to work around this, and the
difference will show for the implementations that do versus those that
do not.

One means of achieving this goal is to build a custom memory allocator.
With this design, the allocator would allocate a larger amount of memory,
and then use that larger chunk of memory to return a pointer within that
memory upon the allocation of a new node structure. Once that memory
fills up, the process repeats.

Those pointers also have to be returned to the allocator on the deletion
of the node, which can be messy because there is no guarantee that
nodes will be deleted in the order in which they were inserted!

A simple, less performant, solution is to use a linked list of
unused node structure pointers within the linked_list structure.
On the first allocation of a large chunk of memory, add each pointer
to the list. When a pointer is requested, remove it from the list.
If the list is empty, perform the allocation of a large chunk again
and repeat.

Memory leaks will not be tolerated in this competition, so on
linked_list_delete(), you would have to devise an algorithm to
return all allocated memory via free_fptr(). That’s something
for you to think on.

The implementation of a custom allocator is not necessary for
week 2, but recommended. Dynamic memory allocation and freeing
is a source of substantial wasted performance in modern software.