Good Fences Make Great Concurrent Code: Implementing Dekker’s Algorithm on an x86/x64 Machine

If Dekker’s algorithm is new to you, you might want to start with my previous blog post here first: Mutual Exclusion: No You Don’t Need Atomic Operations to Build a Lock In C.

As a very short recap, Dekker’s algorithm provides mutual exclusion between two threads wherein one thread and only one thread can enter into a so-called ‘critical section’ at a time. It does so by having two flag variables, one per thread, where each thread signals to the other that they want to enter the critical section. There’s also a turn variable for breaking ties when both threads want to enter the critical section at the same time.
Here’s some code that correctly implements Dekker’s algorithm, but will not work on x86/x64 machines nor ARM machines. The algorithm is sound. The problem is that the machine’s execution of that algorithm breaks the invariant that only one thread can be in the critical section at once.

#include <stdbool.h>

volatile bool wants_to_enter[2];
volatile int  turn;

// Standard init stuff.
//
void init_lock(void) {
    wants_to_enter[0] = false;
    wants_to_enter[1] = false;
    turn = 0;
}

// Enters the critical section. Spins until the critical section
// is entered.
//
void enter_critical_section(int thread_id) {
    int other_thread_id = thread_id ^ 1ULL;
    wants_to_enter[thread_id] = true;

    while (wants_to_enter[other_thread_id]) {
        if (turn == other_thread_id) {
            wants_to_enter[thread_id] = false;
            while (turn == other_thread_id);
            wants_to_enter[thread_id] = true;
        }
    }
}

// Leaves the critical section.
//
void leave_critical_section(int thread_id) {
    turn = thread_id ^ 1ULL;
    wants_to_enter[thread_id] = false;
}

Just for sanity purposes, I used the code above and pthreads to create a program with two threads that where each thread increments a single, shared, counter up to one million. Each thread checks to see if the counter is equal to 1 million. If it isn’t, the thread enters the critical section, increments it, and leaves. I also count how many times each thread enters the critical section.

The final shared counter value: 1000000
Thread 0 counter value:          528874
Thread 1 counter value:          499873
Sum of per thread counters:     1028747

Now, the final shared counter value is correct, but the sum of the number of times that the threads enter the critical section is more than 1 million. That implies that both threads are entering the critical section at the same time, which is obviously a problem, given that Dekker’s algorithm is designed to avoid that specific behavior!

The Big Picture: Do Memory Operations Complete in the Same Order as They Are Specified In a Program?

The short answer is no. There are nice mathematical and algorithmic benefits to computers that do have this property, but they often come at the cost of real-world-visible performance.

Consider a program written in assembly language. When you look at that program and think about how a CPU executes, your mind might go to a simple (and valid) model such as a fetch-decode-execute model or the sequential execution model. In these models, the CPU goes instruction by instruction, executes it individually, and then moves onto the next instruction. It’s an ancient model, but it’s still incredibly useful because at its core, even with all of our crazy cool CPU innovations to increase performance, we still respect those models (unless there’s good reason not to). As you continued your education, hopefully you learned about pipelines and how they overlap instructions across multiple pipeline stages for a theoretical O(s) speedup where s is stage count. You may have learned that certain instructions take more cycles than others (multiplication and division are common examples). And if you do graduate work in computer architecture, you’ll absolutely know that we execute these instructions out of order, and the CPU only gives the illusion of the sequential execution model.

While these are compounding factors, none of these are the reason for why Dekker’s algorithm fails to run on x86 properly. The fetch-decode-execute model of things can botch it up, too. When you consider the models mentioned in the last paragraph, you likely considered that results were being written back into registers and are immediately available for use as soon as they are written back. You likely considered the same for a store. But where that model fails is when there are two or more entities both attempting to access a memory.

That begs the question: when is it that a store actually completes? Is it when the data is ready and the same CPU can read from the store it performed? Or is it when every CPU and/or device in the system can read from the store? Those are two very different things and they come at very different costs.

The Life of Stores

Stores really have two purposes in life:

To store data that will possibly be used later by the program.
To communicate data with another thread/process/device.

Let’s consider the case of #1 first. A store to the stack, either because the compiler had to spill due to lack of registers, un-optimized code, parameter passing, or whatever else, just needs to be available by the time the program is ready to load it. For the purposes of the CPU that’s executing instructions, it can consider the store to be complete once the data is ready to be written. So long as it actually gets written such that loads later in the program get the correct value, the CPU did the correct thing.

One really important caveat: all of this assumes that the thread being executed hasn’t been context switched to another core. What I just wrote is all garbage in that case, but for the sake of the blog post let’s ignore that case, mostly due to context switches relative to the rate at which a CPU executes instructions tends to be very, very, rare.

Now, let’s consider the #2 case, which is really the more interesting one. Let’s say we have a case where we’re sharing data between two or more threads, with a flag-based synchronization technique. The idea is simple, write some data to be processed by another thread, set the flag, and go.

data = x / 42; // Instead of an int, often a pointer to a struct.
flag = 1;

Now, let’s go with the semantics of store completion where the store is considered complete once the CPU that executed can execute a load instruction later in the program that can see the store’s result. This effectively means that the store started, but it’s not guaranteed that every other CPU and/or device in the system can see those stores. What bad things can happen?

The stores never actually reach the other CPU, which means one thread is starved of work indefinitely. That sort of defeats the entire point of using another thread!
The stores occur in the wrong order. The other thread just computed with a garbage value, because it saw the flag variable be set to 1 before the data variable was set to the value of x / 42.

Both of these are horribly bad.

You can “fix” them by enforcing that for every store a CPU performs that it waits for the store to be visible to every other CPU and/or device in the system. This makes stores ridiculously expensive because it takes time for that electricity indicating whatever that store conveys to propagate through the system. Moreover, consider the case of #1 stores. Apart from the context switch example, stack stores are never meant to be communicated to other threads. It’s a complete waste of time to wait for them to be seen by other cores. So that in and of itself isn’t a fix, it’s complete non-sense.

Memory Consistency Models and x86 Fences to Save The Day

The solution that industry and academia has converged around is to provide a memory consistency model that defines how loads and stores are ordered by a CPU, and to provide explicit instructions when those orderings are too weak.

For the purposes of this discussion, the problem that occurs with Dekker’s algorithm on the x86 is two-fold:

The x86 memory consistency model allows stores to complete before their results are visible to every other CPU and device in the system.
The x86 memory consistency models allows loads that follow a store to complete without the store being globally visible first.

Both of these qualities are excellent for stack stores and other types of stores that do not communicate between threads. But they can run into problems with inter-thread communication! Dekker’s algorithm relies on the fact that it can see whether the other thread is entering the critical section before it attempts to enter the critical section.

To prove this to yourself, consider the case where both threads attempt to enter the critical section by setting their wants_to_enter flag. Now pretend that the other thread can’t see it (as in the x86 case). What happens? They both end up entering the critical section. Now, pretend that only one thread communicates with the other thread its intent to enter the critical section. What happens? It works, but with potential starvation of the thread that never updates its flag.

So, on an x86 machine, how can you fix this problem? You can do so with a fence instruction (often called a barrier in other ISAs). The x86 ISA provides three: MFENCE, LFENCE, and SFENCE, of which MFENCE is the one that we need (yes, this is hand waving, but those instructions are a blog post in and of themselves!). The MFENCE instruction provides a variety of guarantees, but the important one for us is that store is visible to all CPUs before the loads and stores that come after it complete (me not saying “and devices” after CPUs here is intentional).

For Dekker’s algorithm to work, thread 0 has to guarantee that the thread 1 can see its flag before the thread 0 checks for thread 1’s flag (and vice versa, of course). The MFENCE instruction provides us that exact guarantee, among others.

Just to spell it out again, assume that both thread 0 and thread 1 are attempting to enter the critical section at the same time, and assume that thread 0 fails to check for thread 1’s flag before setting its own. The following execution of events can happen:

Thread 0 checks to see whether thread 1’s flag is set. It isn’t.
Thread 1 sets flag.
Thread 1 reads thread 0’s flag. It isn’t set.
Thread 0 sets flag.
Both threads enter the critical section.

Of course, it can work properly as well, which is why finding these bugs can be so frustrating:

Thread 1 sets flag.
Thread 0 checks to see whether thread 1’s flag is set. It is.
Thread 1 reads thread 0’s flag. It isn’t set.
Thread 0 sets flag.
Thread 0 waits because it isn’t its turn.

A non-C11-compliant, non-portable solution is to add __builtin_ia32_mfence() function calls to the code every time the flag is set.

#include <stdbool.h>

volatile bool wants_to_enter[2];
volatile int  turn;

// Standard init stuff.
//
void init_lock(void) {
    wants_to_enter[0] = false;
    wants_to_enter[1] = false;
    turn = 0;
}

// Enters the critical section. Spins until the critical section
// is entered.
//
void enter_critical_section(int thread_id) {
    int other_thread_id = thread_id ^ 1ULL;
    wants_to_enter[thread_id] = true;
    __builtin_ia32_mfence();

    while (wants_to_enter[other_thread_id]) {
        if (turn == other_thread_id) {
            wants_to_enter[thread_id] = false;
            while (turn == other_thread_id);
            wants_to_enter[thread_id] = true;
            __builtin_ia32_mfence();
        }
    }
}

// Leaves the critical section.
//
void leave_critical_section(int thread_id) {
    turn = thread_id ^ 1ULL;
    wants_to_enter[thread_id] = false;
}

And with those fixes, you’ll see the mutual exclusion working perfectly:

The counter value:          1000000
Thread 0 counter value:      497254
Thread 1 counter value:      502746
Sum of per thread counters: 1000000

Satisfying. Please do subscribe if you like this sort of content!

And a new thing for our very modern world: this content may not be reproduced in whole or in part, nor included in training data, by any AI model without the express written consent of the author.

Chris Feilbach's Blog