P99 futexes: non-blocking integer valued condition variables

A while ago I already have written about Linux futexes as a really nice concept for a control data structure that goes beyond the ones that we learn or teach in school (mutex, semaphore, condition variable…). I have now gone one step further and integrated futexes into P99; if used on Linux this will evidently use the corresponding Linux feature under the hood, on other platforms a C11 thread implementation using mutexes and condition variables can be used.

One of the real disadvantages of most of the control structures is that they have two very different kinds of events: user events (e.g a call to cnd_signal) and system events, often called “spurious wakeups”. Unless we program system code, these spurious wakeups are just an annoyance. They are easily forgotten during development and lead to subtle bugs that only appear on heavy load or when changing the platform and handling them often makes the user code overly complex.

p99_futex are designed to work around this type of problems, by still providing a close integration of the control structure into the system and by efficiently distinguishing a “fast path” for operations from a “slow path” where we handle congestion. They provide a counter similar to a conditional variable that allows atomic increments and to wait for it, just as the Linux system call does. (Only that for ideological reasons the base type is an unsigned, instead of an int as in Linux.)

Continue reading “P99 futexes: non-blocking integer valued condition variables”


Simple C11 atomics: atomic_flag

C11 (and similar C++11) has a new primitive type in its new toolbox for atomics: atomic_flag. As you’d perhaps imagine from the name, a variable of that type has two different states, called “cleared” and “set”, and access to it is atomic. It is even guaranteed to be “lock-free”, see below.

Sounds like atomic_flag is the same as _Atomic(_Bool), doesn’t it? Not quite. Access to it is even more restricted than for a _Bool, however it might be more efficient.

Continue reading “Simple C11 atomics: atomic_flag”

linux futexes: non-blocking integer valued condition variables

POSIX’ condition variables pthread_cond_t unfortunately have some drawbacks:

  • They require the use of at least two separate other objects that are only loosely coupled with the condition variable itself:

    • A plain variable, say X of type int, that actually holds the value of which the condition depends.
    • A mutex that is just to regulate the access to X and to the condition variable.
  • Generally they are not lock free. To access X we have to

    • lock the mutex
    • inspect X
    • eventually wait for the condition to become true
    • do changes
    • eventually signal other waiters about the changes we made
    • unlock the mutex

Linux’ concept of futexes allows to associate an equivalent concept directly to the variable X and to access this variable without taking locks directly in most of the cases.
Continue reading “linux futexes: non-blocking integer valued condition variables”

Why sem_t is not suited as an atomic counter

C has itself no primitive for atomic access to variables. Whenever we want to deal with some sort of counter in a threaded environment (e.g for reference counting) we have thus to be careful not to erase data that other threads might have written or be just in the course of reading.

sem_t as a counter

At a first glance, sem_t seems to fit into that need. It has an increment function sem_post, a decrement sem_trywait and a value test with sem_getvalue. If we are careful about interrupt handling (see this post) this should give us access to a “good” counting device, one might think.

First if we would like to do this an implementation of a reference counter would be relatively tedious. sem_t is meant to be waited for if the counter is 0, but here we would need it the other way round: we would need to wait until the counter falls to 0. Nevertheless this is doable, not with one semaphore but with several.

Side effects

What is more annoying is a another defect of sem_t its ill-conceived API, namely its tendency to have side effects to the system call. The sem_trywait call can be used to decrement a counter without waiting, and in particular to know once the counter has reached 0. If the counter has been decremented, so it was not 0 before, sem_trywait returns 0, otherwise it returns -1.

But the later case has a side effect: errno is set to an error code. So in that case, the caller has not only to do what he has to do but also has to do repair work, namely reset errno to 0. This may sound harmless, but it isn’t: errno is not a simple variable but an

expression which expands to a modifiable lvalue that has type int

This must be so, since errno is guaranteed to be different per thread, one thread’s errno wouldn’t influence the one of another thread. C has no concept of thread-only variables, therefore the system usually as to deploy quite a machinery to achieve a suitable mechanism.

So generally the use of sem_t might turn out to be quite inefficient for such a purpose.


If the API for the function would have gone a different way, namely in returning the precise error code instead of -1, this would be much simpler to handle. No access to pseudo-global variables would be necessary, no side effects would take place.

This strategy has been chosen for the other POSIX lock primitives such as pthread_mutex_t or pthread_rwlock_t.
Both are suitable for the purpose. An implementation with one pthread_mutex_t, one pthread_cond_t and an unsigned should be relatively obvious.

pthread_rwlock_t is even directly suitable for a locking structure for resource control. Just have all users of a resource issue a pthread_rwlock_rdlock when they start to use it and a pthread_rwlock_unlock when they cease to do so. If the resource is unused may then simply tested by a call to pthread_rwlock_wrlock.

Scope Bound Resource Management with for Scopes

Resource management can be tedious in C. E.g to protect a critical block from simultaneous execution in a threaded environment you’d have to place a lock / unlock pair before and after that block:

pthread_mutex_t guard = PTHREAD_MUTEX_INTIALIZER;

// critical block comes here

This is very much error prone since you have to provide such calls every time you have such a block. If the block is longer than some lines it is difficult to keep track of that, since the lock / unlock calls are spread on the same level as the other code.

Within C99 (and equally in C++, BTW) it is possible to extend the language of some sorts such that you may make this easier visible and guarantee that your lock / unlock calls are matching. Below , we will give an example of a macro that will help us to write something like

    pthread_mutex_unlock(&guard)) {
       // critical block comes here

If we want to make this even a bit more comfortable for cases that we still need to know the mutex variable we may have something like:

       // critical block comes here

The macro P99_PROTECTED_BLOCK can be defined as follows:

#define P99_PROTECTED_BLOCK(BEFORE, AFTER)                         \
for (int _one1_ = 1;                                               \
     /* be sure to execute BEFORE only at the first evaluation */  \
     (_one1_ ? ((void)(BEFORE), _one1_) : _one1_);                 \
     /* run AFTER exactly once */                                  \
     ((void)(AFTER), _one1_ = 0))                                  \
  /* Ensure that a `break' will still execute AFTER */             \ 
  for (; _one1_; _one1_ = 0)

As you may see, this uses two for statements. The first defines an auxiliary variable _one1_ that is used to control that the dependent code is only executed exactly once. The arguments BEFORE and AFTER are then placed such that they will be executed before and after the dependent code, respectively.

The second for is just there to make sure that AFTER is even executed when the dependent code executes a break statement. For other preliminary exits such as continue, return or exit there is unfortunately no such cure. When programming the dependent statement we have to be careful about these, but this problem is just the same as it had been in the “plain” C version.

Generally there is no run time performance cost for using such a macro. Any decent compiler will detect that the dependent code is executed exactly once, and thus optimize out all the control that has to do with our variable _one1_.

The GUARDED_BLOCK macro could now be realized as:

#define GUARDED_BLOCK(NAME)        \
P99_PROTECTED_BLOCK(               \
    pthread_mutex_lock(&(NAME)),   \

Now, to have more specific control about the mutex variable we may use the following:

for (int _one1_ = 1; _one1_; _one1_ = 0)                             \
  for (T NAME = (INITIAL);                                           \
       /* be sure to execute BEFORE only at the first evaluation */  \
       (_one1_ ? ((void)(BEFORE), _one1_) : _one1_);                 \
       /* run AFTER exactly once */                                  \
       ((void)(AFTER), _one1_ = 0))                                  \
    /* Ensure that a `break' will still execute AFTER */             \
    for (; _one1_; _one1_ = 0)

This is a bit more complex than the previous one because in addition it declares a local variable NAME of type T and initializes it.

Unfortunately, the use of static for the declaration of a for-scope variable is not allowed by the standard. To implement a simple macro for a critical section in programs that would not depend on any argument, we have to do a bit more than this.

Other block macros that can be implemented with such a technique:

  • pre- and postconditions
  • make sure that some dynamic initialization of a static variable is performed exactly once
  • code instrumentation

P99 now has a lot of examples that use this feature.

how to make sem_t interrupt safe

The POSIX semaphore wait calls

int sem_wait(sem_t *sem);
int sem_trywait(sem_t *sem);
int sem_timedwait(sem_t *sem, const struct timespec *abs_timeout);

can be interrupted at any time, e.g by IO that is delivered or if a process child terminates. In such a case errno is set to EINTR:

EINTR The call was interrupted by a signal handler; see signal(7).

A signal handler always interrupts a blocked call to one of these functions, regardless of the use
of the sigaction(2) SA_RESTART flag.

(Side note, sem_post is always atomic in that sense and will never return EINTR.)

So if we don’t use sem_t in a signal handler context and don’t check for EINTR on return of the wait, the whole semaphore approach becomes error prone. In particular there are good chances that with small test cases during development everything runs fine, but that then during production byzantine errors occur once in a while that will be very hard to track.

To test for EINTR systematically it is easy to write wrappers that just re-issue the corresponding wait whence the return indicates that the call was interrupted.

static inline
int sem_wait_nointr(sem_t *sem) {
  while (sem_wait(sem))
    if (errno == EINTR) errno = 0;
    else return -1;
  return 0;

static inline
int sem_trywait_nointr(sem_t *sem) {
  while (sem_trywait(sem))
    if (errno == EINTR) errno = 0;
    else return -1;
  return 0;

static inline
int sem_timedwait_nointr(sem_t *sem, const struct timespec *abs_timeout) {
  while (sem_timedwait(sem, abs_timeout))
    if (errno == EINTR) errno = 0;
    else return -1;
  return 0;

Since they use the inline keyword, the variants given here are only suitable for C99, but it should be easy for you to adapt them for other contexts. The inline here has the advantage that the call can be inlined efficiently at the caller side, basically resulting in the call of the real POSIX function in question plus some conditional jumps.

BTW, the evaluation of errno here has some thread magic. The POSIX standard guarantees that it must behave as if it were a local variable for each thread, so we don’t have to worry about concurrent access to it.

Semaphores and Mutexes aren’t so similar

A common misconception I read about the two POSIX types sem_t and pthread_mutex_t are that that they’d be similar, almost interchangeable it seems. They have indeed some things in common:

  • both are control structures that can be placed in memory that is shared between threads (or processes, we will concentrate on threads, here)
  • both can be used to block a thread to wait for another one
  • both have three different variants for such a wait function
    1. Unconditional, blocking wait, sem_wait and pthread_mutex_lock.
    2. Non-blocking wait, sem_trywait and pthread_mutex_trylock.
    3. Time dependent wait, sem_timedwait and pthread_mutex_timedlock.

But in other aspects they are quite different. The main difference, already on a conceptual level that I see is the following:

  • pthread_mutex_t is thread oriented. If a mutex is hold by a particular thread it can only be unlocked by this particular thread alone.
  • sem_t is token oriented. If a thread is blocked on a semaphore, any active thread may post on the semaphore.

This already reserves certain surprises to the unaware, and each of these concepts as it own sets of pitfalls that have to be taken care of.

But then, the special specification for the mutex and semaphore concepts as they are given by POSIX adds a particular oddity that makes their use a bit tricky for starters:

  • The semaphore calls can be interrupted at any time, e.g by IO that is delivered or if a process child terminates. In such a case errno is set to EINTR.
  • A thread A that is blocked in pthread_mutex_lock can receive as much signals as it would. It will only return to the application once the thread B that holds the lock releases it. If B returns, crashes, idles for whatever reason, A will be blocked.

The reason for that is just that sem_t is designed to be used in signal handlers, so it must be sensible to the fact that an interrupt arrived. phtread_mutex_t is better not used in interrupt handlers, unless you really know what you are doing.

In subsequent blogs, I will try to show the particular advantages or pitfalls of the two different interfaces