An Atomic Counter for Guaranteed Thread Safety

John M. Dlugosz

If you've ever cursed the overhead of the various mechanisms to keep your programs thread-safe, you're going to love the technique that John will show you. He can test and modify a counter (for example, a reference count) in a single operation that can't be interrupted in a threaded program.

When it comes to multi-threaded programs, one issue looms large: race conditions. The operating system (Windows NT or Windows 95/98) provides primitives for mutual exclusion as a means to cope with these race conditions. When you set up a mutual exclusion, if one thread is executing a particular block of code, then no other thread is allowed to execute that same block. There are a variety of kernel objects for this, including the Mutex.

But Mutexes are slow. They're pure overhead and don't contribute any real meaning to the program, so making them expensive is just adding insult to injury. Adding thread safety to the reference counting in a string class of mine, for example, slowed it down by a factor of three.

Win32 provides a CRITICAL_SECTION object—it's designed specifically for these race conditions and is basically a Mutex that cheats. Without changing from user mode to kernel mode, it can check for the (hopefully common) case of the object being un-owned, and grab it. It can also check for the case of recursive locking, where the same thread acquires the object more than once. In other cases, when one thread has to wait for another, a real Mutex is used. For a program with little or no contention, this produces a dramatic increase in speed.

But this still isn't the last word in eliminating race conditions. To get any faster, we need to get away from the idea of a protected region of code all together.

Digging deeper

Just how does the CRITICAL_SECTION do its user-mode checking, anyway? It tests a flag to see whether the object is owned. If it's not owned, it takes ownership. What if the operation was interrupted after deciding the object was un-owned, but before taking ownership? Can the CRITICAL_SECTION code cause its own race conditions? In a word, no. It modifies a flag and tests it as a single unbroken operation. This proves that there is a way to do a read/modify/write operation in a thread-safe manner. In fact, if you peruse the Win32 API, you'll find a set of functions for this purpose:

InterlockedCompareExchange

InterlockedDecrement

InterlockedExchange

InterlockedExchangeAdd

InterlockedIncrement

As an aside, I was surprised to find that the Microsoft compiler for x86 doesn't emit inline code for these. Although they could be optimized down to a single machine instruction, you instead call a small function. Furthermore, the InterlockedExchange is implemented so poorly that I had to get a second opinion to even be sure it was correct.

Why might these functions be useful? Suppose you were implementing a reference count in some object. One thread might increment the value while another thread decrements the value, or two threads might increment it at the same time. This is a classic read/modify/write operation, susceptible to serious race conditions. But instead of throwing a region of mutual exclusion around this simple operation (appalling, because the overhead outweighs the operation by a factor of a thousand—like shipping a single SIMM in a foam block the size of boxcar), we could use the Interlocked instructions.

Basically, the CPU has a primitive capability of doing an increment or decrement in a single unbroken operation, or atomically. You might suppose that a single INC DWORD PTR instruction is atomic in and of itself. After all, an interrupt can't occur between the read and the write, since interrupts are only serviced between instructions. That was true enough once upon a time, but today we have superscaler CPUs, where an instruction like this is actually compiled internally into three RISC instructions, which can then be scheduled and mixed up with other instructions! If that makes your head spin, a simpler problem is when a machine has more than one CPU, like mine does. And things other than the CPU could access main memory, such as a bus-mastering card.

In order to be truly safe, you must use a special feature of the CPU to make sure that the instruction is indeed atomic. On the x86, this is the bus lock, achieved with the LOCK prefix. EnterCriticalSection uses one of these instructions. It atomically reads the old value and writes a "taken" flag to see whether the critical section is already taken. If the old value is "not taken", then the code proceeds. If the old value is also "taken," then it has to wait, and writing the "taken" flag didn't change anything so nothing needs to be undone.

For a reference count, a single atomic increment or decrement is all that's needed to accomplish our final goal! There's no need to go through the whole logic of EnterCriticalSection and then LeaveCriticalSection. Using atomic primitives is two to three times faster, and even more importantly for some uses, they never block. Table 1 shows some comparison times when incrementing a counter.

Table 1. Comparison times when incrementing a counter.

regular memory variable: 31.25 nanoseconds
atomic counter: 178.1 nanoseconds
critical section: 504.6 nanoseconds

Taking the next step

That kind of performance improvement is impressive. The Win32 calls, however, are sadly incomplete and inherently inefficient, not to mention too primitive for C++. We can do a lot better, with just a little work. Instead of this:

y= InterlockedExchangeAdd (&x, 1);

I'd rather write the cleaner:

y= x++;

This is just syntactic sugar; InterlockedExchangeAdd of 1 is just an atomic post-increment. But what about an atomic pre-increment? There's no such function in the Win32 arsenal, so you'd have to use the post-increment and then repeat the operation on the returned "old" value to find out what the new value was at that instant:

y = InterlockedExchangeAdd (&x, 1) + 1;

This is why I said that the functions are inherently inefficient. As it turns out, the machine primitive is a pre-increment! A similar trick is done in reverse to undo the operation on the return value, as part of the implementation of InterlockedExchangeAdd. So, to get a pre-increment, you apply two more operations that cancel each other out. Yuck. If there were both pre- and post-increment available, you could use whichever you really needed. Which one was native under the hood wouldn't matter, since in either case you do the minimum work to get the result you actually wanted.

Another shortcoming of the Interlocked suite of functions is the argument type—they only work on signed long values. Actually, the code works on any four-byte integral value, but only a long* signature is provided, so you have to cast to use int, unsigned, unsigned long, or pointers. It can't work at all on short, unsigned short, char, or unsigned char. A proper C++ solution would be fully rich in terms of available types and would be type-safe without casting. In C++, when you want to use the same code for a variety of types, with full type safety and no casting, you use templates. Once the template class is written, you can declare an integer atmoic counter, a byte atomic counter, or whatever other type you wish, like this:

atomic_counter<int> c1;

atomic_counter<unsigned> c2;

atomic_counter<byte> c3;

// … etc…

The template class implements prefix and postfix increment and decrement, in addition to += and -=, and they all behave atomically. Listing 1 shows the template class definition, which you can see is pretty simple.

Listing 1. Template class definition.

template <typename T>

class atomic_counter

{

protected:

   T value;

   // >> need a critical section, too.

public:

   operator T() const volatile { return value; }

   T operator= (atomic_counter x) volatile

        { value = x.value; }

   atomic_counter (T x) : value(x) {}

   atomic_counter() {}

   T operator++() volatile;

   T operator--() volatile;

   T operator++ (int) volatile;

   T operator-- (int) volatile;

   T operator+= (T) volatile;

   T operator-= (T) volatile;

   T exchange (T) volatile;

};

This is a fairly ordinary template, which creates a class to hold a value of the specified type and apply various operators to it. This parameterized class could be implemented using a CRITICAL_SECTION, and then it would handle any type, as long as the underlying operations existed. This would include floating point values and any user-defined types such as BCD. But that's not what I have in mind. I want the efficient atomic instructions to be used, not critical sections.

Getting specialized

The special atomic instructions only work for one-, two-, or four-byte integer types (signed or unsigned). The instructions can also be applied to pointers with a slightly different implementation. They won't work on arbitrary types and certainly not on user-defined class types. So while your general case will use a critical section, you instruct the compiler to use an alternate implemention for particular types. This is hidden from the user. This code will use the general (critical section) implementation:

atomic_counter<double> dc;

double val= dc++;

This code will use the alternate implementation:

atomic_counter<unsigned long> Lc;

unsigned long val= Lc++;

Keeping the syntax uniform is more than just being neat and relaxing the need for the user to remember two different template names and when to use each. Keeping the syntax the same is crucial in writing other templates. Suppose I wrote the following:

template <typename T>

T foo()

 {

 atomic_counter<T> count;

 // …

Here I don't know what T will be, and this template would inherit any special restrictions from the templates used to implement it. That isn't nice. In order to be able to use an atomic_counter<T> within some other template function or template class, the syntax must be uniform across all types. But that doesn't mean the implementation has to be the same!

In the simplest case, you can write an explicit specialization of a class. For example, here's an explicit specialization of atomic_counter<int>:

template<>

class atomic_counter<int>

{

protected:

   int value;

public:

      inline int operator++()

         { return InterlockedExchangeAdd(&value, 1); }

     // in reality, would need volatile and casting.

      // … etc

};

This tells the compiler that an atomic_counter<int> is to use this definition instead of generating it from the generic atomic_counter<T>. Simple enough—you supply a specialized definition for each of the types that can be handled specially. That can get rather lengthy, though. The thought of writing 10 classes, each with 10 functions, makes me cringe. That's what templates are supposed to be saving me from! Well, you can use another template to implement nine of the special cases. You still need to define all the classes, but now they're trivial:

template<>

class atomic_counter<int> : public

     atomic_integer_xx<int> {

public:

   atomic_counter() {}

   atomic_counter (int x) : atomic_integer_xx<int>(x) {}

   int operator= (atomic_counter x) volatile

        { return value = x.value; }

   };

The real work is done by the helper template atomic_integer_xx, and then you declare the specialization to inherit the implementation generated by the helper template. You need only supply the constructors and assignment operator. The key to writing the helper template is uniformity. If all the different flavors look the same, then the template can generate them from a generic description. The key to making all of them look the same is to isolate the differences in small functions and allow overloading to choose the correct function. For example, the implementation of postfix increment is as follows, where internal::Xadd is an overloaded function that handles one-, two-, or four-byte values with the appropriate atomic instructions:

   T operator++() volatile

      { return 1+ internal::Xadd (&value,1); }

The beauty of this approach is that all the functions except exchange can be implemented in terms of Xadd. So the template handles all the common operators and all the available types, and the real implementation boils down to three primitives to atomically add bytes, shorts, and longs.

Atomic CPU instructions

You need assembly language to implement the three forms of Xadd. The supplied InterlockedExchangeAdd only handles the four-byte case, and no supplied function exists for the other two sizes. But the function is quite trivial to write:

__declspec(naked) int __fastcall Xadd (volatile int*,

                                       int)

 {

 __asm {

    lock xadd dword ptr [ECX], EDX

    mov EAX, EDX

    ret

    }

 }

This inline assembly spits out the instruction and returns the proper result. The __fastcall keyword tells the compiler that you want arguments passed in registers instead of on the stack. There, you can immediately use them in the lock Xadd instruction. Then, return a value by moving it into the EAX register. The __declspec(naked) tells the Microsoft compiler not to bother generating the normal entry and exit code for the function—there's no need for it. The byte and 16-bit forms are almost identical.

volatile values

You might have noticed that all the members of atomic_counter are declared volatile, as are various arguments within the implementation. This is necessary in order to allow the user to declare volatile atomic counters. In C++, volatile is a modifier similar to const. While const says that the value won't be changed by the code, volatile says that the value might be changed by some process unknown to the code. In particular, any time a volatile value is used, it must actually be checked. The compiler isn't allowed to optimize it out by pulling it into a register or assume that just because it hasn't been touched by code in this block, it still holds the same value.

Consider a loop like this:

atomic_counter<int> count= 0;

// create some threads …

// use count …

//later,

while (count > 0) Sleep(1000);

The idea here is that this code waits for other threads to finish their work, which decrements count. When the count hits zero, this code can proceed. When this code is optimized, you get an infinite loop, because the compiler sees that count is never changed in the body of the loop and assumes it never changes. The volatile keyword tells the compiler that count could indeed change by other means; in this case, by other threads.

volatile atomic_counter<int> count= 0;

I got into all this so that atomic counters could be manipulated simultaneously on multiple threads; they should be volatile in general. If a counter is declared as volatile, you couldn't call operator++ on it unless operator++ were a volatile function. (It's the same as with const, which you should be more familiar with.) If I declare the variable as volatile, I can only pass a pointer or reference to it to a function that understands that it's volatile and respects it as such. I couldn't pass it to a function that then caches it in a register—that would go against the standing orders about this variable. So, the need for the volatile keyword propagates all the way down into the most primitive implementation functions.

The use of volatile also requires me to supply an assignment operator, which at first glance would seem to be no different than what the compiler would do for me anyway. But the compiler-generated copy assignment operator isn't volatile, and if I want to assign a value to a volatile variable, I need to specify otherwise.

In practice

The bottom line is that atomic counters are easy to use and satisfy a specialized but somewhat common need more efficiently than a more general-purpose mutual exclusion mechanism. The class is more than twice as fast as a CRITICAL_SECTION, though still much slower than an ordinary non-thread-safe counter. However, it's as good as we can get on the hardware. Besides the speed, it offers two more advantages. It doesn't need to be initialized like a CRITICAL_SECTION, so initialization order issues are avoided. Second, it fits in the same place as a primitive value—an atomic int is four bytes, no larger than a normal int. This means that atomic arithmetic can be performed on structures that have already been defined to hold integer values at specific offsets, and there aren't large CRITICAL_SECTION objects that outweigh the object being protected. The atomic_counter template is a powerful tool in my arsenal of thread-safe code.

Availability Download sample code here.

John is a Senior Software Engineer with Kodak Health Imaging in Dallas. In addition to programming and writing, he enjoys computer art, photography, and science fiction. john@dlugosz.com, www.dlugosz.com.