Mutual Exclusion: No, You Don’t Need Atomic Operations to Build a Lock in C

On one of the C programming subreddits today I saw that some people were under the impression that in order to build a mutex in C, atomic operations must be used. There’s a performance argument to be made that atomic operations should be used for this kind of work, but functionally this is flat out wrong. I’ll show you why, with an example of a mutex (defined later) that protects a critical section.

Consider Dekker’s algorithm, an algorithm for mutual exclusion between two concurrent entities (processes, threads, a signal handler and it’s corresponding process, exceedingly nerdy human beings who never learned to take turns, etc). Mutual exclusion here means that only one of the two entities can enter the ‘critical section’ at a time. A critical section is defined as a section of code that only one entity can execute at a time. A mutex is a concurrent programming construct that implements mutual exclusion.

Here’s the code (written for an x86/x64 machine, important later!):

#include <stdbool.h>

volatile bool wants_to_enter[2];
volatile int  turn;

// Standard init stuff.
//
void init_lock(void) {
    wants_to_enter[0] = false;
    wants_to_enter[1] = false;
    turn = 0;
}

// Enters the critical section. Spins until the critical section
// is entered.
//
void enter_critical_section(int thread_id) {
    int other_thread_id = thread_id ^ 1ULL;
    wants_to_enter[thread_id] = true;
    asm volatile("mfence" ::: "memory"); // x86/x64 isn't sequentially consistent.
    while (wants_to_enter[other_thread_id]) {
        if (turn == other_thread_id) {
            wants_to_enter[thread_id] = false;
            while (turn == other_thread_id);
            wants_to_enter[thread_id] = true;
            asm volatile("mfence" ::: "memory");
        }
    }
}

// Leaves the critical section.
//
void leave_critical_section(int thread_id) {
    turn = thread_id ^ 1ULL;
    wants_to_enter[thread_id] = false;
}

Understanding Dekker’s Algorithm

In Dekker’s algorithm there are three pieces of state that each thread has available to it:

Whether the thread wants to enter the critical section (or has already entered it!).
Whether the other thread wants to enter the critical section.
Whose ‘turn’ it is, in the case that a tie breaker is required.

Start by considering the case where only one thread enters and leaves the critical section. Assume that turn starts out as 0. When thread 0 enters the critical section it sets its wants_to_enter flag to true. Then, it compares against the other thread’s flag. If the other flag is clear the other thread has not attempted to enter, so it’s OK to enter into the critical section.

Now, let’s say that thread 1 comes along and attempts to enter the critical section. It sees that the other thread is in the critical section. It remains looping in that while() loop until the other thread leaves the critical section, providing the aforementioned mutual exclusion. Once that thread leaves the critical section, the other thread enters it.

Now, for the real beauty of it! What happens when both threads want to enter the critical section at the same time? Both threads set their flags, both threads enter the while loop, but only the thread whose turn it is leaves. The other thread waits by clearing its wait_to_enter flag, allowing the other thread to enter the section. It then spins on the turn variable, waiting for the other thread to signal its completion.

Take a few moments to convince yourself it really works. That’s where the intuition is built for these sort of things.

More Concurrent Topics: Volatility in C, Deadlock, and Starvation

The volatile keyword has been blogged to death, but just to provide another correct definition among the vast sea of incorrect ones: when a variable is qualified as volatile in C, it means that the compiler may not assume that the value it last wrote is still the value when it reads it again. In practice, this does two really important things for us:

It ensures that the flag variables are actually written to memory when updated. C compilers will often try and remove so-called ‘silent stores’, or stores that the C compiler believes will never be read by that program.
It ensures the turn variable is actually read from memory. A good compiler might stick the turn variable in a register otherwise and never read its value from memory or write to memory, both of which break the algorithm.

This algorithm is deadlock free. Deadlock would occur if both threads attempted to enter the lock at the same time and neither could enter the critical section. If you want to deadlock the algorithm, remove the turn variable and the flag variable being set to false in that while loop, and you’ll see it.

Further, this algorithm does not starve any thread of the critical section, which means that each thread will get to enter it within some finite amount of time. The turn variable changing when the last thread released it to the opposite thread ensures that if both threads attempt to enter again, the last thread that won that race doesn’t get to enter again. Lack of starvation is incredibly important for parallel system performance, especially in the case where adding threads approaches or exceeds linear speedup.

And About That Claim That this Only Works on X86/X64 Systems…

It works on other systems too, if you code with care and know the pitfalls.

Modern memory systems are incredibly complex beasts and they like to make many optimizations, including performing memory accesses out of the order defined by the program. In some cases this leads to a faster computer, but often at the expense of the programmer who has to reason about how the underlying hardware works. Briefly, there are various guarantees as to the amount of reordering that a memory system will do, called a memory consistency model. The x86 based machines are thought to implement a “TSO” model, and in short, these behave mostly as you would intuitively expect them to. An exception to this is the above code, which requires a memory fence (sometimes called a memory barrier) to execute correctly.

On ARM based systems, without direct intervention, this is not true. On an ARM system, it would be entirely legal for the processor to perform the other thread’s flag load (the while loop condition) before notifying the other thread it wants to enter the critical section. You’ll note that if both threads choose to perform the read before they write their flags, the algorithm is broken.

And that’s not the only problem with it. But getting into all of that is a blog post in itself!

Much more on this later.

One Last Quick Thing

It should go without saying, but unless you have a really good reason, never write your own code for this sort of thing. It’s nasty and buggy and usually performs less well than industry standard solutions.

2 responses to “Mutual Exclusion: No, You Don’t Need Atomic Operations to Build a Lock in C”

chrisfeilbach

May 5, 2023 at 11:37 pm

I forgot some MFENCE instructions in the initial writeup. Shame on me! They’re added now.

LikeLike

Good Fences Make Great Concurrent Code: Implementing Dekker’s Algorithm on an x86/x64 Machine – Chris Feilbach's Blog

May 7, 2023 at 7:18 am

[…] If Dekker’s algorithm is new to you, you might want to start with my previous blog post here first: Mutual Exclusion: No You Don’t Need Atomic Operations to Build a Lock In C. […]

LikeLike