Optimize Concurrency - Optimized C++ (2016)

Optimized C++ (2016)

Chapter 12. Optimize Concurrency

It’s tough to make predictions, especially about the future.

Yogi Berra (1925–2015), baseball legend and inadvertent humorist

This quip appeared in English-language physics and economics journals prior to becoming a “Yogi-ism.” It is also attributed as a Danish proverb. It seems unlikely, though, that Berra misappropriated it from any of these sources.

All but the smallest modern computers process multiple execution streams concurrently. They contain multiple CPU cores, graphics processors with hundreds of simple cores, audio processors, disk controllers, network cards, and even keyboards with separate computing power and memory. Like it or not, developers live in a concurrent world, and must understand how to program concurrent activities.

Software practices for concurrency evolved in a world of single-core processors. Since 2005, the advent of multicore microprocessors providing true (versus time-sliced) concurrency has altered the development landscape, informing best practices with new rules. These rules may be unfamiliar even to developers experienced with concurrency issues that arise in single-processor systems.

If the future direction of processor development leads to commercial devices with dozens or hundreds of cores, programming best practices will continue to change. Several competing tools promoting fine-grained concurrency have arrived that anticipate this future. However, general-purpose hardware with many cores has yet to become mainstream,1 standards of practice have not yet solidified in the developer community, and clear leaders among fine-grained concurrency solutions have yet to emerge. This future is by no means certain. It is with some sadness that I leave this very interesting topic for others to cover.

Concurrency can be provided to programs by a variety of mechanisms. Some of them live outside of C++, in the operating system or on the hardware. C++ code appears to run normally as a single program or a group of programs communicating via I/O. Still, these approaches to concurrency have some bearing on the design of C++ programs.

The C++ standard library directly supports a shared-memory thread model of concurrency. Many developers are more familiar with the concurrency features of their operating system, or with the C-language calls of the POSIX Threads (pthreads) library. It is also my experience that developers are far less familiar with the concurrency features of C++, and concurrency in general, than they are with other areas of C++ programming. For this reason, I provide more extensive notes on how these C++ features may be used than I do for other features covered in this book.

Concurrency support in the C++ standard library is a work in progress. While the standard has made tremendous strides in providing foundational concepts and capabilities, many capabilities are still working their way through standardization and will not arrive until C++17 or later.

This chapter discusses several techniques to improve performance of thread-based concurrent programs. The discussion assumes that readers are already basically familiar with thread-level concurrency and synchronization primitives, and are looking for ways to optimize their threaded programs. Providing “basic familiarity” with thread-level concurrency would be the topic of yet another book.

Concurrency Refresher

Concurrency is the simultaneous (or apparently simultaneous) execution of multiple threads of control. The goal of concurrency is not a reduction in the number of instructions executed or data words accessed per se. It is rather an increase in utilization of computing resources that reduces run time as measured on a wall clock.

Concurrency improves performance by permitting some program activities to move forward while other activities wait for an event to occur or a resource to become available. This allows computing resources to be utilized more of the time. The more activities that proceed concurrently, the more resources are used and the more activities wait for events and resources, in a positive feedback loop that increases overall utilization of computing and I/O resources up to some saturation point, hopefully near 100% utilization. The result is a reduction of wall-clock time versus what would happen if each task was performed to completion before the next task started, with the computer idling while awaiting events.

From the standpoint of optimization, the challenge of concurrency is to find enough independent tasks to fully utilize available computing resources, even if some tasks must wait on external events or availability of resources.

C++ offers a modest library for shared-memory thread-based concurrency. This is by no means the only way that C++ programs can implement a system of cooperating programs. Other kinds of concurrency also have an impact on C++ programs, and are thus worth at least a brief look.

This section examines the C++ memory model and the basic toolkit for using shared memory in multithreaded programs. In my opinion, it’s the gnarliest topic in C++. It’s gnarly because our sad little monkey brains are single-threaded. They can reason naturally about just one cause-and-effect stream at a time.

A Walk Through the Concurrency Zoo

Concurrency can be provided to programs by the computer hardware, by operating systems, by function libraries, and by the features of C++ itself. This section describes a number of concurrency features and their impact on C++.

The fact that certain concurrency features are built into C++ while others are provided by library code or the operating system must not be read as an endorsement of one concurrency model over others. Some facilities are built in because they have to be. There is no other way to provide them. The most notable forms of concurrency include:

Time slicing

This is a function of the scheduler in the operating system. In time slicing, an operating system maintains a list of currently executing programs and system tasks, and allocates chunks of time to each program. Any time a program is waiting for an event or resource, the program is taken off the operating system’s runnable list, making its share of the processor available for the other programs.

An operating system is processor- and hardware-dependent. It makes use of a timer and periodic interrupts to mediate the scheduling of processes. C++ remains unaware that it is being time sliced.

Virtualization

In one typical kind of virtualization, a lightweight operating system called a hypervisor allocates chunks of processor time to a guest virtual machine. The guest virtual machine (VM) contains a filesystem image and a memory image, which is typically an operating system running one or more programs. When a hypervisor runs a guest virtual machine, certain processor instructions and accesses to certain regions of memory cause a trap into the hypervisor, allowing the hypervisor to emulate I/O devices and other hardware resources. In another type of virtualization, a conventional operating system hosts the guest virtual machines, using the operating system’s I/O facilities to more efficiently emulate I/O when the host and guest virtual machines run the same operating system.

Some advantages of virtualization include:

· Guest virtual machines are packaged as disk files when they aren’t running. The guest VM can be checkpointed and saved, loaded and resumed, and copied and run on multiple computer hosts.

· Multiple guest virtual machines can be run concurrently, resources permitting. Each guest VM is isolated from the others by the hypervisor, in cooperation with the computer’s virtual memory protection hardware. This allows hardware to be commoditized and rented by the hour.

· Guest virtual machines can be configured to use a portion of the resources (physical memory, processor cores) of a host computer. Computing resources can be tailored to fit the requirements of the programs running on each guest VM, providing consistent levels of performance and preventing unintended interactions among multiple VMs running concurrently on the same hardware.

As with traditional time slicing, C++ remains unaware that it is running within a guest VM under a hypervisor. C++ programs may be indirectly aware that they have constrained resources. Virtualization is relevant to C++ program design because it makes it both possible to limit the computational resources consumed by a program and necessary for the program to know what resources are actually available.

Containerization

Containerization is similar to virtualization in that containers hold a filesystem image and a memory image containing a checkpointed program state. They differ in that the container host is an operating system, so that I/O and system resources can be provided directly instead of emulated less efficiently by the hypervisor.

Containerization has the same advantages as virtualization (packaging, configuration, and isolation) plus containers can run somewhat more efficiently.

Containerization is not visible to C++ programs running within a container. It is relevant to C++ programs for the same reason as virtualization.

Symmetric multiprocessing

A symmetric multiprocessor is a computer containing several execution units that execute the same machine code and have access to the same physical memory. Modern multicore processors are symmetric multiprocessors. The currently executing programs and system tasks can be run on any available execution unit, though the choice of execution unit may have consequences on performance.

A symmetric multiprocessor executes multiple threads of control with true hardware concurrency. If there are n execution units, the total run time of a compute-bound program can be reduced by a factor of as much as 1/n. These hardware threads stand in contrast to software threads, discussed later, which may or may not run on distinct hardware threads, and thus may or may not reduce total run time.

Simultaneous multithreading

Some processors are designed so that each hardware core has two (or more) register sets and executes two or more corresponding streams of instructions. When one stream stalls (as for an access to main memory), instructions from the other stream can execute. A processor core having this feature acts like two (or more) cores, so that a “four-core processor” really can host eight hardware threads. This is important, as we’ll see in “Create as Many Runnable Threads as Cores”, because the most efficient use of software threads matches the number of software threads to the number of hardware threads.

Multiple processes

Processes are concurrent streams of execution that have their own protected virtual memory space. Processes communicate using pipes, queues, network I/O, or some other unshared mechanism.2 Processes synchronize using synchronization primitives or by pending on input (that is, blocking until the input becomes available).

The principal advantage of processes is that the operating system isolates one process from another. If one process crashes, the others remain alive, although they may have nothing to do.

The greatest weakness of processes is that they have a lot of state: virtual memory tables, multiple execution unit contexts, plus the contexts of all suspended threads. They are slower to start, stop, and switch between than threads.

C++ doesn’t experience processes directly. Generally, a C++ program is represented by an operating system process. There are no tools in C++ to manipulate processes because not all operating systems even possess a process concept. On small processors, programs may be time sliced but not protected from one another, so that they act more like threads.

Distributed processing

Distributed processing is distribution of an activity onto a collection of processors that are not necessarily the same, and that communicate over links that are typically slow compared to the processor. A cluster of cloud instances communicating via TCP/IP links is an example of distributed processing. An example of distributed processing on a single PC is offloading drivers to processors running on disk drives and network cards. Another example is offloading graphics tasks to the many kinds of specialized processors in a graphics processing unit (GPU). GPUs have traditionally been provided on video cards, but the latest silicon stylings from several manufacturers put GPUs onto the microprocessor die.

In a typical distributed processing setup, data flows through a pipeline or network of processes, where each process performs some transformation on the input, pushing the transformed data to the next stage in the pipeline. This model, which is as old as Unix command-line pipelines, allows relatively heavyweight processes to run efficiently. Processes in a pipeline are long-lived, avoiding startup costs. Processes can continuously transform units of work, so they use whole time slices, subject to availability of input. Most importantly, processes do not share memory or synchronize with each other, so they run at full speed.

Although C++ contains no process concept, distributed processing is relevant to C++ development because it influences the design and structure of programs. Sharing memory does not scale beyond a few threads. Some treatments of concurrency advocate abandoning shared memory altogether. Distributed processing systems often decompose naturally into subsystems, producing modular, understandable, reconfigurable architectures.

Threads

Threads are concurrent streams of execution within a process that share the same memory. Threads synchronize using synchronization primitives and communicate using shared memory locations.

Their strength, in comparison to processes, is that threads consume fewer resources and are faster to create and faster to switch between.

Threads have several weaknesses, however. Since all threads in a process share the same memory space, a thread writing to invalid memory locations can overwrite data structures in other threads, causing them to crash or behave unpredictably. Further, access to shared memory is many times slower than access to unshared memory and must be synchronized between threads, or the contents are difficult to interpret.

Most operating systems support threads with an operating system–dependent library. Up until recently, C++ developers with experience in concurrency used a native thread library or the POSIX Thread (pthread) library, which provides a cross-platform solution with basic services.

Tasks

A task is a unit of execution that can be invoked asynchronously in the context of a separate thread. In task-based concurrency, the tasks and the threads are separately and explicitly managed, so that a task is assigned to a thread for execution. By contrast, in thread-based concurrency, the thread and the executable code running on the thread are managed as a unit.

Task-based concurrency is built atop threads, so tasks share the strengths and weaknesses of threads.

An additional advantage of task-based concurrency is that the number of active software threads can be matched to the number of hardware threads so that threads execute efficiently. The program can prioritize and queue tasks for execution. By contrast, in a thread-based system, the operating system prioritizes threads in an opaque and operating system–dependent manner.

The cost of tasks’ extra flexibility is greater application complexity. The program must implement some means of prioritizing or sequencing tasks. The program must also manage a pool of threads upon which to run tasks.

Interleaved Execution

Astronomers have a funny way of thinking about the universe. Hydrogen makes up 73% of the visible matter in the universe. Helium makes up 25%. Everything else makes up less than 2%. Astronomers can describe most features of the observable universe as if it consisted exclusively of hydrogen and helium. They call all the other elements—the ones that make up planets, processors, and people—“metals,” as if they had no separate identity, and pretty much ignore them.

Concurrent programs can similarly be abstracted into loads, stores, and branches, and the branches mostly ignored, as if all the complexity of programming was irrelevant. Discussions of concurrency (including the ones in this book) often feature simple program fragments consisting mostly of sequences of assignment statements to illustrate concurrency concepts.

Concurrent execution of two threads of control can be modeled as interleaving the simple load and store statements of the two threads. If thread 1 and thread 2 each consist of a single statement, the possible interleavings are “12” and “21”. If each thread has two statements, there are several potential interleavings: “1122”, “1212”, “2112”, “1221”, “2121”, “2211”. In real programs, a very large number of interleavings are possible.

In the days of single-core processors, concurrency was implemented by time slicing in the operating system. Race conditions were relatively infrequent because many statements from one thread executed before the operating system gave control to the other thread. The interleavings observed in practice were “1111...11112222....2222”. With today’s multicore processors, interleavings of individual statements are possible, so race conditions are more frequently visible. Developers who wrote concurrent programs in the single-processor past may have a sense of complacency about their skill that is no longer warranted.

Sequential Consistency

As noted in Chapter 2, C++ believes in a simple, intuitive model of the computer. This model includes the requirement that programs are sequentially consistent. That is, programs behave as if statements are executed in the order in which they are written, subject to C++ flow of control statements. Obvious as this requirement seems, those weasel-words “as if” in the previous sentence permit many compiler optimizations and innovative microprocessor designs.

For instance, the program fragment in Example 12-1 is sequentially consistent even if y is set to 0 before x is, or x is set to 1 after y is set to 1, or even if x is set to 1 after the comparison y == 1 in the if statement, as long as the assignment x = 1 happens before assert(x == 1)—that is, before the value is used.

Example 12-1. Sequential consistency means “as if” executed in order

int x = 0, y = 0;

x = 1;

y = 1;

if (y == 1) {

assert(x == 1);

The reader may well ask, “Why would the compiler want to reorder statements?” In fact, there are many reasons, all of them about the dark magic of optimal code generation in compilers, of which I dare not speak. And it’s not just the compiler moving things around; modern microprocessors reorder loads and stores too (see “There Are Multiple Streams of Program Execution”).

The combined effect of compiler optimizations, out-of-order execution, caching, and write buffering can be modeled using the common metaphor of loads and stores breaking loose from their places in the program listing and moving up and down the page to a point before or after their intended position. This metaphor captures all these effects without having to explain or understand details of compiler optimizations and hardware behavior in a specific processor.

In a particular compiler/hardware combination, not all possible movements of loads and stores actually happen. Moving loads and stores of shared variables thus illustrates a worst-case scenario. But trying to reason about a specific hardware architecture when it has three levels of cache and behavior that depends on which cache line each variable is on is insanely difficult. It is also not fruitful, because most programs must run on more than one hardware device over their lifetime.

What matters is that a program remains sequentially consistent when use of a variable is moved up or down with respect to other statements, so long as it doesn’t move past an update of that variable. Likewise, the program remains sequentially consistent when updates of a variable are moved up or down, so long as they don’t move past a use.

Races

Concurrency creates a problem for C++, which provides no way to tell when two functions may run concurrently and what variables are shared. Code-moving optimizations that are perfectly reasonable when considering one function at a time may cause problems when two functions run concurrently.

If thread 1 consists of the statement x = 0 and thread 2 the statement x = 100, the outcome of the program depends on a race between the two threads. A race occurs when the result of the concurrent execution of these statements depends on which interleaving occurs in a given run of the program. The interleaving “12” produces the result x == 100, and the interleaving “21” produces x == 0. The result of this program, or any program containing a race, is nondeterministic—that is, unpredictable.

The C++ standard memory model says programs behave as if they were sequentially consistent, as long as they do not contain races. If a program contains races, sequential consistency may be violated.

Example 12-2 is a multithreaded version of Example 12-1, this time with more meaningful variable names.

Example 12-2. Sequential consistency of multiple threads

// thread 1, running in core 1

shared_result_x = 1;

shared_flag_y = 1;

...

// thread 2, running in core 2

while (shared_flag_y != 1)

/* busy-wait for shared_flag_y to be set to 1 */ ;

assert(shared_result_x == 1);

shared_result_x is a result computed in thread 1 to be used by thread 2. shared_flag_y is a flag set by thread 1 that tells thread 2 that the result is ready to use. If the compiler or processor reorders the two statements in thread 1 so that shared_flag_y is set beforeshared_result_x, thread 2 may (but not necessarily must) exit its while loop after seeing the new value of shared_flag_y, and asserts because it still sees the old value of shared_result_x. Each thread is sequentially consistent, but the interaction of the two threads is a race.

Sequential consistency of variables shared between threads is guaranteed by some languages that don’t have that pesky “as if” in their definition. It is provided in others because shared variables are explicitly declared; the compiler doesn’t move them around and generates special code to ensure that the hardware doesn’t move them either. A concurrent C++ program must explicitly force a particular interleaving to preserve sequential consistency.

Synchronization

Synchronization is the forced ordering of the interleaving of statements in multiple threads. Synchronization allows developers to reason about the order in which things happen in a multithreaded program. Without synchronization, the order in which things happen is unpredictable, making it hard to coordinate actions between threads.

Synchronization primitives are programming constructs whose purpose is to achieve synchronization by forcing a particular interleaving of concurrent programs. All synchronization primitives work by causing one thread to wait for, or pend on, another thread. By enforcing a specific execution order, synchronization primitives prevent races.

A variety of synchronization primitives have been proposed and implemented in the past 50 years of programming. Microsoft Windows has a rich set of synchronization primitives, including events on which threads can pend, two kinds of mutexes, a very general semaphore, and Unix-style signals. Linux has its own rich but distinct set.

It is important to understand that synchronization primitives exist as concepts only. There is no authoritative voice that says exactly what a semaphore is, or exactly how a monitor should be implemented. What Windows calls a semaphore can be hard to recognize from Dijkstra’s original description.3 Furthermore, the whole menagerie of proposed synchronization primitives can be synthesized out of a sufficiently rich set of pieces, in the same way that any Boolean function can be synthesized in hardware from NAND and NOR gates. Different implementations on different operating systems should thus be expected.

Classical synchronization primitives interact with the operating system to move threads between active and pending states. This implementation is appropriate for computers with only one relatively slow execution unit. However, the latency of starting and stopping threads using the operating system can be significant. When there are multiple processors executing streams of instructions in a truly concurrent fashion, synchronization by busy-waiting on shared variables can produce very short wait times. Synchronization library designers may also take a hybrid approach.

Atomicity

An operation on a shared variable (particularly a class instance having multiple members) is atomic if no thread can ever see an update performed by another thread in a half-completed state. If update operations are not atomic, some interleavings of two threads’ code allow one thread to access the shared variable when it is in an inconsistent, half-updated state due to an update operation in progress on another thread. Viewed another way, atomicity is a promise that these undesirable interleavings cannot occur.

Atomicity by mutual exclusion

Traditionally, atomicity is provided via mutual exclusion. Each thread wishing to access a shared variable must acquire a mutex before accessing the shared variable, and release the mutex when the operation is complete. The part of the program bounded by acquiring and releasing the mutex is called a critical section. If one thread has acquired a mutex, all other threads pend when they try to acquire the mutex. Thus, only one thread at a time can perform operations on the shared data. This thread is said to hold the mutex. The mutex serializes the threads so that critical sections execute one after the other.

Loads and stores of shared variables must happen within a critical section, where only one thread is running, or a race with unpredictable results may occur. But as discussed in “Sequential Consistency”, the compiler and processor both move loads and stores. A mechanism called a memory fence prevents loads and stores of shared variables from leaking out of a critical section. In the processor, special instructions tell it not to move loads and stores past the memory fence. In the compiler, the memory fence is conceptual. The optimizer does not move loads past function calls, because any function call might contain a critical section.

The memory fence at the top of a critical section must prevent loads of shared variables from leaking out the top of the critical section. This memory fence is said to have acquire semantics, because it happens when the thread acquires a mutex. Similarly, the memory fence at the bottom of a critical section must prevent stores of shared variables from leaking out the bottom. This memory fence is said to have release semantics, because it happens when the thread releases a mutex.

In the days of single-core processors, memory fences were unnecessary. Compilers did not reorder loads and stores across function calls, and the operating system synchronized memory almost accidentally when it switched threads. But in the multicore world, programmers must contend with this new issue. Developers who use synchronization primitives provided by the C++ standard library or the native synchronization library of their operating system don’t need to worry about memory fences, but programmers implementing synchronization primitives or lock-free data structures must take a keen interest.

Atomic hardware operations

Implementing atomicity by mutual exclusion has costs that make it a troublesome tool to wield:

· Since only one thread can hold a mutex, operations on shared variables cannot run concurrently. The more time a critical section consumes, the more time the critical section takes away from concurrent execution. The more threads that operate on the shared variable, the more time the critical section takes away from concurrent execution.

· When a thread releases a mutex, another thread pending on the mutex can acquire it. But the mutex may not guarantee which other thread acquires it, because providing such a guarantee is expensive. If many threads are pending on the mutex, some threads may never acquire it; the computations performed by those threads do not move forward. This situation is called starvation.

· If a thread holds one mutex and needs to acquire a second mutex, a situation can arise where the thread can never move forward because another thread holds the second mutex and wishes to acquire the first mutex. This situation is called deadlock, or, more colorfully, the deadly embrace. A thread can deadlock itself by trying to lock the same mutex twice. Any number of threads can deadlock because of a cyclic dependency on mutexes among the threads. There is no known way to guarantee that a program that attempts to acquire multiple mutexes is deadlock-free, although there are deadlock avoidance strategies.

For simple variables like integers and pointers, certain operations can be performed atomically on some computers because the operations are performed by single machine instructions. These special atomic instructions contain memory fences that ensure the instruction is not interrupted halfway through execution.

Atomic instructions form the basis for implementing mutexes and other synchronization primitives. Surprisingly sophisticated thread-safe data structures can be implemented using only the operations that can be done atomically by the hardware. This is called lock-free programming because code that works in this way does not have to wait to acquire a mutex.

Lock-free programs scale to larger numbers of concurrent threads, but they are not a cure-all. Threads are still serialized by atomic operations, even those performed by a single instruction. However, the duration of the critical section is an order of magnitude smaller than that of even the most efficient mutex.

C++ Concurrency Facilities Refresher

As of C++14, concurrency support in the C++ standard library is somewhat Spartan, compared to the rich facilities in popular operating systems. This is partly because the C++ facilities must contain only behavior implementable on all operating systems. It is also partly because C++ concurrency is still a work in progress, with substantial improvements scheduled for C++17. The advantage of using C++ concurrency features, versus native calls into the operating system, is that the C++ facilities are defined to behave in a consistent way across platforms.

The concurrency facilities of the C++ standard library fit together like a set of stacking blocks, building concurrency up from the operating system’s very C-ish threading library to a fully C++ threading solution that can pass arbitrary argument lists, return values, throw exceptions, and be stored in containers.

Threads

The <thread> header file provides the std::thread template class, which allows a program to create thread objects as thin wrappers around the operating system’s own threading facilities. std::thread’s constructor takes a callable object (a function pointer, function object, lambda, orbind expression) as an argument, and executes it in the context of a new software thread. C++ uses variadic template argument forwarding magic to call functions with arbitrary argument lists, whereas the underlying operating system thread invocation typically takes a pointer to a void function with one void* argument.

std::thread is an RAII (resource acquisition is initialization) class for managing operating system threads. It provides a get() member function, which returns an operating system’s native thread handle. This allows a program access to the operating system’s generally richer set of functions that act on threads.

Example 12-3 shows a simple example of std::thread use.

Example 12-3. Starting a few simple threads

void f1(int n) {

std::cout << "thread " << n << std::endl;

}

void thread_example() {

std::thread t1; // thread variable, not a thread

t1 = std::thread(f1, 1); // assign thread to a thread variable

t1.join(); // wait for thread to complete

std::thread t2(f1, 2);

std::thread t3(std::move(t2));

std::thread t4([]() { return; });// works with lambdas too

t4.detach();

t3.join();

}

Thread t1 is initially empty. Threads cannot be copied because they each contain a unique handle to the underlying resource, but the move assignment operator allows an empty thread to be assigned an rvalue. t1 can own any thread that executes a function taking an integer argument. A function pointer to f1 and an integer argument are provided to the constructor. The second argument is forwarded to the callable object (f1), which is started in std::thread’s constructor.

Thread t2 is started with the same function but a different second argument. Thread t3 is an example of the move constructor in action. After the move constructor, t3 is running the thread started as t2, and t2 is empty. Thread t4 shows that threads can be started with lambdas as their callable object.

The operating system thread represented by a std::thread has to be disposed of before the std::thread is destroyed. The thread can be joined, as in t3.join(), which means the current thread waits for the thread being joined to end. The operating system thread can also be dissociated from the std::thread object, as in t4.detach(). In this case, the thread continues to execute, but isn’t visible to the thread that started it any longer. A detached thread ends when its callable object returns. If the callable object does not return, the thread is a resource leak, continuing to consume resources until the full program ends. If neither join() nor detach() is called before the std::thread is destroyed, its destructor calls terminate(), abruptly stopping the whole program.

Although it is possible to work directly with std::thread, it may be more productive to think of std::thread as a building block of more sophisticated facilities. Any value returned by the function object is ignored. Any exception thrown by the function object causes immediate, unconditional program termination via a call to terminate(). These limitations make naked calls to std::thread fragile, almost as if the standard writers wanted to discourage their use.

Promises and Futures

The C++ std::promise and std::future template classes are, respectively, the sending and receiving ends of a message from one thread to another. Promises and futures allow threads to asynchronously produce values and throw exceptions. The promise and future share a dynamically allocated variable called the shared state that can hold either a value of a defined type, or an exception of any type encapsulated in a standard wrapper. A thread of execution can pend on a future, so the future also acts as a synchronization device.

Promises and futures may be used simply to implement an asynchronous function call and return. However, the promise and future are far more general than that simple use. They can implement a graph of dynamically changing communication points among threads. That said, they provide no structuring mechanism, so completely wild and crazy communication graphs may be difficult to debug.

The C++ <future> header file contains the functionality of promises and futures. An instance of the std::promise template allows a thread to set the shared state to either a value of a specified type or an exception. The sending thread does not wait for the shared state to be read; it can continue immediately.

A promise’s shared state is not ready until either a value or an exception is set. The shared state is meant to be set exactly once. Otherwise:

· If the thread attempts to set a value or exception more than once, the shared state is instead set to the exception std::future_error with an error code of promise_already_satisfied, and the shared state becomes ready releasing any futures waiting on the promise.

· If the thread never sets a value or exception, when the promise is destroyed, the destructor sets the shared state to the exception std::future_error with error code broken_promise, and the shared state becomes ready, releasing any futures waiting on the promise. To get this useful error indication, the promise must be destroyed in the thread’s callable object.

std::future allows a thread to receive the value or exception saved in the shared state of a promise. A future is a synchronization primitive; the receiving thread pends in the call to the future’s get() member function until the corresponding promise is made ready by a call to set the value or exception.

A future is not valid until it is constructed or assigned from a promise. The receiving thread cannot pend on a future until it is valid. The future must be constructed from a promise before the sending thread is executed. Otherwise, the receiving thread may try to pend on the future before it isvalid.

Promises and futures can’t be copied. They are entities that represent a specific communication rendezvous. They can be constructed and move-constructed, and a promise can be assigned to a future. Ideally, the promise is created in the sending thread and the future is created in the receiving thread. There is an idiom in which a promise is created in the sending thread, then passed as an rvalue reference using std::move(promise) to the receiving thread so its contents are moved into a promise belonging to the receiving thread. std::async() performs this strange magic for the developer. It is also possible to pass a promise by reference to the sending thread. Example 12-4 shows how to control thread interaction using promises and futures.

Example 12-4. Promises, futures, and threads

void promise_future_example() {

auto meaning = [](std::promise<int>& prom) {

prom.set_value(42); // compute the meaning of life

};

std::promise<int> prom;

std::thread(meaning, std::ref(prom)).detach();

std::future<int> result = prom.get_future();

std::cout << "the meaning of life: " << result.get() << "\n";

}

In Example 12-4, the promise prom is created before the std::thread is invoked. This isn’t perfect as written, because it doesn’t test for a broken promise. It is necessary though, because if prom is not constructed before the thread starts, there is no way to guarantee that the future resultwill be valid before the call to result.get().

The program then constructs an anonymous std::thread. Its arguments are the lambda meaning, which is the callable object to be executed, and prom, the promise that is an argument to meaning. Note that since prom is a reference argument, it must be wrapped in std::ref() so the argument-forwarding magic will work correctly. The call to detach() dissociates the running thread from the anonymous std::thread, which is destroyed.

Now two things are happening: the operating system is getting ready to execute meaning, and the program is creating the future result. The program may execute prom.get_future() before the thread begins to run. This is why prom was created before the thread was constructed—so that the future would be valid and the program could pend, awaiting the thread.

The program pends in result.get(), waiting for the thread to set the shared state in prom. The thread calls prom.set_value(42), making the shared state ready and releasing the program. The program finishes by outputting "the meaning of life: 42".

There is nothing magical about futures. A developer who wanted to design a thread that first returned an int and then returned a std::string could make two promises. The receiving program would create two matching futures.

Making the future ready signals that the computation is complete. Because a program can pend on a future, it isn’t necessary for the program to pend on thread termination. This will be important in “Asynchronous Tasks”, when we discuss std::async(), and in “Implement a Task Queue and Thread Pool”, where we discuss thread pools, because it is more efficient to reuse threads than it is to destroy and re-create them.

Asynchronous Tasks

The C++ standard library task template class wraps a callable object in a try block and saves the returned value or thrown exception in a promise. Tasks permit callable objects to be called asynchronously by threads.

Task-based concurrency in the C++ standard library is only half-formed. C++11 provides the async() template function that packages a callable object as a task and invokes it on a reusable thread. async() is a bit of a “god function” (see “Beware of ‘God Functions’”), hiding many juicy details of thread pools and task queues.

Tasks are represented in the C++ standard library <future> header file. The std::packaged_task template class wraps any callable object (which can be a function pointer, function object, lambda, or bind expression) so that it can be called asynchronously. The packaged task is itself a callable object that can be the callable argument to a std::thread. The great advantage of tasks versus other callable objects is that a task can return a value or throw an exception without abruptly terminating the program. The task’s return value or thrown exception is stored in a shared state that can be accessed through a std::future object.

Example 12-5 is a simplified version of Example 12-4 that uses a packaged_task.

Example 12-5. packaged_task and thread

void promise_future_example_2() {

auto meaning = std::packaged_task<int(int)>(

[](int n) { return n; });

auto result = meaning.get_future();

auto t = std::thread(std::move(meaning), 42);

std::cout << "the meaning of life: " << result.get() << "\n";

t.join();

}

The packaged_task meaning contains both a callable object and a std::promise. This solves the problem of getting the promise destructor called in the context of the thread. Notice that the lambda in meaning simply returns; the machinery of setting the promise is all comfortably hidden.

In this example, I joined the thread instead of detaching it. Although it isn’t completely obvious from this example, both main program and thread can continue running concurrently after the main program gets the future’s value.

The <async> library provides a facility based on tasks. A template function std::async() executes a callable object argument, and the callable object may be executed in the context of a new thread. However, std::async() returns a std::future, capable of holding either a return value or an exception thrown by the callable object called by std::async(). Furthermore, an implementation may choose to allocate std::async() threads out of a thread pool for improved efficiency. This is illustrated in Example 12-6.

Example 12-6. tasks and async()

void promise_future_example_3() {

auto meaning = [](int n) { return n; };

auto result = std::async(std::move(meaning), 42);

std::cout << "the meaning of life: " << result.get() << "\n";

}

The lambda defined as meaning and the lambda’s argument are passed to std::async(). Type deduction is used to determine the template parameters of std::async(). std::async() returns a future able to retrieve an int result or an exception, which is moved into result. The call to result.get() pends until the thread invoked by std::async() fulfills its promise by returning its int argument. Thread termination is managed within std::async(), which may keep the thread in a thread pool.

The example code need not explicitly manage thread termination. std::async() may use a thread pool maintained by the C++ runtime system to recycle the thread if this is cheaper than destroying and re-creating threads as needed. Explicit thread pools may be added in C++17.

Mutexes

C++ provides several mutex templates to provide mutual exclusion for critical sections. The mutex template definition is simple enough to be specialized for particular operating system–dependent native mutex classes.

The <mutex> header file contains four mutex templates:

std::mutex

A simple, relatively efficient mutex. On Windows, this class tries a busy-wait first, and falls back to operating system calls if it cannot obtain the mutex quickly.

std::recursive_mutex

A mutex that allows a thread that already holds the mutex to acquire it again, as in a nested function call. This class may be less efficient due to the need to count how many times it is acquired.

std::timed_mutex

A mutex that permits a timed attempt to acquire the mutex. The requirement for timed attempts generally requires intervention from the operating system, which significantly increases the latency of this type of mutex in comparison to std::mutex.

std::recursive_timed_mutex

A mutex that is both timed and recursive, with mustard, ketchup, and secret sauce. This type of mutex is tasty, but expensive.

My experience is that recursive and timed mutexes may be warnings of a design that can be simplified. The scope of a recursive lock is difficult to reason about, so it is deadlock-bait. The expense of these mutexes should be tolerated only when necessary, and they should be avoided when designing new code.

C++14 added the <shared_mutex> header file containing support for shared mutexes, also known as reader/writer mutexes. A single thread can lock a shared mutex in an exclusive mode, in order to atomically update a data structure. Multiple threads can lock a shared mutex in a shared mode to atomically read a data structure, but it is locked against exclusive access until all readers have released the mutex. The shared mutex may allow more threads to access the data structure without waiting, provided most accesses are for reading. These shared mutexes are available:

std::shared_timed_mutex

A shared mutex that supports timed and untimed attempts to acquire the mutex.

std::shared_mutex

A simpler shared mutex, scheduled for inclusion in C++17.

It is my experience that reader/writer mutexes lead to starvation of the writer thread unless reads are infrequent, in which case the value of the reader/writer optimization is negligible. As with recursive mutexes, developers must make a strong case for using this more complex mutex, and should usually choose a simpler, more predictable mutex.

Locks

In C++, the word lock refers to an RAII class that acquires and releases a mutex in a structured manner. Use of the word can be confusing because mutexes are also sometimes referred to as locks. Acquiring a mutex is also called locking the mutex, and releasing the mutex is also calledunlocking the mutex. In C++, the mutex member function to acquire a mutex is named lock(). I have been using mutexes for over 20 years, and I still have to concentrate to keep these concepts straight.

The C++ standard library provides a simple lock to acquire one mutex, and a more general lock to acquire multiple mutexes. The general lock implements a deadlock avoidance algorithm.

The <mutex> header file contains two lock templates:

std::lock_guard

A simple RAII lock. During the class’s construction, the program waits for the lock to be acquired, and it releases the lock on destruction of the lock_guard. Pre-standard implementations of this class were often called scope_guard.

std::unique_lock

A general-purpose mutex ownership class that offers RAII locking, deferred locking, timed locking attempts, transfer of mutex ownership, and use with condition variables.

A lock for shared mutexes was added in C++14, in the <shared_mutex> header file:

std::shared_lock

A mutex ownership class for shared (reader/writer) mutexes. It offers all the sophisticated features of std::unique_lock, plus control of shared mutexes.

A single thread can lock a shared mutex in an exclusive mode, to atomically update a data structure. Multiple threads can lock a shared mutex in a shared mode to atomically read a data structure, but it is locked against exclusive access until all readers have released the mutex.

Condition Variables

Condition variables allow C++ programs to implement the monitor concept proposed by famous computer scientists C.A.R. Hoare and Per Brinch-Hansen and widely deployed in Java as synchronized classes.4

A monitor shares a data structure among multiple threads. When a thread successfully enters the monitor, it owns a mutex that allows it to update the shared data structure. A thread may leave the monitor after updating the shared data structure, giving up its exclusive access. It may also pend on a condition variable, temporarily giving up exclusive access until a specific change occurs.

A monitor may have one or more condition variables. Each condition variable summarizes a conceptual change-of-state event in the data structure. When a thread running inside the monitor updates the data structure, it must notify any condition variables affected by the update that the event they represent has occurred.

C++ provides two implementations of condition variables in the <condition_variable> header file. They differ in the generality of the lock argument they accept:

std::condition_variable

The most efficient condition variable, this requires the use of std::unique_lock to lock the mutex.

std::condition_variable_any

A condition variable that can use any BasicLockable lock; that is, any lock offering lock() and unlock() member functions. This condition variable may be less efficient than std::condition_variable.

When a thread is released by a condition variable, the thread must verify that the data structure is in the expected state. This is because some operating systems may spuriously notify the condition variable. (My experience is that it may also happen due to programmer error.) Example 12-7 is an extended example of using condition variables to implement a multithreaded producer/consumer.

Example 12-7. Simple producer and consumer using condition variables

void cv_example() {

std::mutex m;

std::condition_variable cv;

bool terminate = false;

int shared_data = 0;

int counter = 0;

auto consumer = [&]() {

std::unique_lock<std::mutex> lk(m);

do {

while (!(terminate || shared_data != 0))

cv.wait(lk);

if (terminate)

break;

std::cout << "consuming " << shared_data << std::endl;

shared_data = 0;

cv.notify_one();

} while (true);

};

auto producer = [&]() {

std::unique_lock<std::mutex> lk(m);

for (counter = 1; true; ++counter) {

cv.wait(lk,[&]() {return terminate || shared_data == 0;});

if (terminate)

break;

shared_data = counter;

std::cout << "producing " << shared_data << std::endl;

cv.notify_one();

}

};

auto p = std::thread(producer);

auto c = std::thread(consumer);

std::this_thread::sleep_for(std::chrono::milliseconds(1000));

{

std::lock_guard<std::mutex> l(m);

terminate = true;

}

std::cout << "total items consumed " << counter << std::endl;

cv.notify_all();

p.join();

c.join();

exit(0);

}

The producer thread in Example 12-7 “produces” by setting a single integer called shared_data to a nonzero value. The consumer “consumes” shared_data by setting it back to zero. The main program thread launches the producer and consumer, then takes a 1,000-millisecond nap. When the main thread wakes up, it locks mutex m to briefly enter the monitor, then sets a flag called terminate, which causes both producer and consumer threads to exit. The main program notifies the condition variable that the state of terminate has changed, joins the two threads, and exits.

consumer enters the monitor by locking the mutex m. The consumer is a single loop that pends on a condition variable named cv. While it is pending on cv, consumer is not in the monitor; mutex m is available. When there is something to consume, cv is notified. The consumer wakes up, relocks the mutex m, and returns from its call to cv.wait(), conceptually reentering the monitor.

Example 12-7 uses a single condition variable, which has the meaning “the data structure has been updated.” The particular update consumer normally waits for is shared_data != 0, but it also needs to wake up on terminate == true. This is an economical use of synchronization primitives, in contrast to the array of signals in the Windows WaitForMultipleObjects() function. A similar example could have been coded using one condition variable to wake the consumer and another to wake the producer.

The consumer invokes cv.wait() in a loop, checking each time it wakes to see whether the proper condition exists. This is because some implementations can spuriously wake threads waiting on a condition variable accidentally, when it isn’t appropriate. If the condition is met, the whileloop exits. If the condition that woke consumer was terminate == true, consumer exits its outer loop and returns. Otherwise, the condition was shared_data != 0. consumer prints a message, then indicates it has consumed the data by setting shared_data to zero and notifyingcv that the shared data has changed. At this point, consumer is still inside the monitor, holding a lock on mutex m, but it continues the loop, entering cv.wait() again, unlocking the mutex and conceptually exiting the monitor.

The producer is similar. It pends until it can lock mutex m, then enters an outer loop that causes it to continue until it sees terminate == true. The producer waits on cv looking for a change in state. In this example, producer uses a version of wait() that takes a predicate argument, causing it to loop until the predicate is false. The predicate is thus the condition implied by notifying the condition variable. This second form is syntactic sugar, hiding the while loop. Initially, shared_data is already zero, so the producer doesn’t wait on cv. It updates shared_data, then notifies cv, then loops back and enters cv.wait(), releasing the mutex and conceptually exiting the monitor.

Atomic Operations on Shared Variables

The C++ standard library <atomic> header file provides low-level tools for building multithreaded synchronization primitives: memory fences and atomic load and store.

std::atomic provides a standard mechanism to update an arbitrary data structure, so long as it is copy constructable or move constructable. Any specialization of std::atomic must provide the following functions for any type T:

load()

std::atomic<T> provides the member function T load(memory_order), which copies the T object out of the std::atomic<T> atomically.

store()

std::atomic<T> provides the member function void store(T, memory_order), which copies a T object into the std::atomic<T> atomically.

is_lock_free()

is_lock_free() returns bool true if all operations defined on this type are implemented without the use of mutual exclusion, as by a single read-modify-write machine instruction.

Specializations of std::atomic are provided for integral and pointer types. Where the processor supports it, these specializations synchronize memory without invoking the operating system’s synchronization primitives. The specializations provide a number of operations that can be implemented atomically on modern hardware.

The performance of std::atomic depends on the processor for which code is being compiled:

· Intel-architecture PCs have a variety of read-modify-write instructions, and the cost of the atomic access depends on the memory fence, with some fences having no cost at all.

· On single-core processors with read-modify-write instructions, std::atomic may generate no extra code at all.

· On processors that do not have atomic read-modify-write instructions, std::atomic may be implemented using expensive mutual exclusion.

Memory fences

Most std::atomic member functions take an optional memory_order argument that selects a memory fence around the operation. If the memory_order argument is not provided, the default is memory_order_acq_rel. This provides a full fence that is always safe, but can be expensive. More refined fences can be selected, but this should be done only by knowledgeable users.

Memory fences synchronize main memory with the cache of multiple hardware threads. In general, a memory fence is executed on each of two threads to synchronize one thread with the other. C++ allows the following memory fences:

memory_order_acquire

memory_order_acquire may be casually thought to mean “acquire all work done by other threads.” It ensures that subsequent loads are not moved before the current load or any previous loads. It does this, paradoxically, by waiting for store operations currently in flight between the processor and main memory to complete. Without the fence, if a store is in flight when this thread does a load to the same address, the thread gets old information, as if the load had moved up inside the program.

memory_order_acquire may be less expensive than the default full fence. It would be appropriate, for instance, to use memory_order_acquire when atomically reading the flag in the while loop of a busy-wait.

memory_order_release

memory_order_release may casually be thought to mean “release all work done by this thread to this point.” It ensures that previous loads and stores done by this thread are not moved past the current store. It does this by waiting for store operations currently in flight within this thread to complete.

memory_order_release may be less expensive than the default full fence. It would be appropriate, for instance, when setting the flag at the end of a home-built mutex.

memory_order_acq_rel

This combines the two previous guarantees, creating a full fence.

memory_order_consume

memory_order_consume is a potentially weaker (and faster) form of memory_order_acquire that requires only that the current load take place before other operations that are data-dependent on it. For instance, when a load of a pointer is marked memory_order_consume, subsequent operations that dereference this pointer won’t be moved before it.

memory_order_relaxed

With this value, all reorderings are OK.

As currently implemented on most processors, memory fences are a blunt tool. The memory fence blocks forward progress until all writes in flight have completed. In reality, only writes to shared locations need be completed, but neither C++ nor x86-compatible processors have the means to identify this more restrictive set of locations, especially as the set varies from one thread to the next.

STOP AND THINK

Multicore x86 processors released to date have a helpful memory model. All loads have acquire semantics, and all stores have release semantics. That is, the x86 architecture imposes a strong ordering on memory access. It was probably necessary to forgo a more aggressive memory model in the x86 to maintain compatibility with a host of legacy programs whose concurrency features were not in fact correctly implemented. PowerPC, ARM, and Itanium processors all have weaker ordering (and higher performance).

What this means is that developers who write concurrent C++ programs using Visual C++ on the x86 can currently get away with multithreaded murder—but if they recompile the same program on an ARM or PowerPC processor, they may be surprised by new threading bugs in their supposedly debugged and working code.

Atomic access is no panacea. The memory fence is quite expensive. To find out just how expensive, I performed an experiment, timing an atomic store operation versus a simple store. Example 12-8 times a simple atomic store (with default full fence) performed in a loop.

Example 12-8. Atomic store test

typedef unsigned long long counter_t;

std::atomic<counter_t> x;

for (counter_t i = 0, iterations = 10'000'000 * multiplier;

i < iterations; ++i)

x = i;

The test took 15,318 milliseconds on my PC. The non-atomic version of this test, reproduced in Example 12-9, took 992 milliseconds—about 14 times faster. In an example with more writes in flight, the difference might be even more dramatic.

Example 12-9. Non-atomic store test

typedef unsigned long long counter_t;

counter_t x;

for (counter_t i = 0, iterations = 10'000'000 * multiplier;

i < iterations; ++i)

x = i;

If std::atomic were implemented with an operating system mutex, as it might be on some small processors, the difference in performance might be several orders of magnitude. std::atomic must thus be used with some knowledge of the target hardware.

If you’re wondering what the single quotes are doing in the numeric constant, the ability to put group-separators in a numeric constant is one of several small but wonderful additions in C++14. Some people may find the C++14 changes small and fussy. I find them very helpful.

On Deck: Future C++ Concurrency Features

The developer community shows a tremendous interest in concurrency, driven no doubt by the availability and the need to make use of rapidly growing computer resources. Many developers became experienced with thread-based concurrency using either native calls or the C-style functions of the POSIX Threads (pthreads) library. Some of these developers built C++-style wrappers around the native calls. The best of these efforts have emerged in public, gathering user communities and cross-pollinating one another. Many proposals for new C++ concurrency features are now working their way through the standardization process. C++17 is likely to have far more extensive concurrency support—but nothing is certain about the future except that it is coming.

This is just a sample of what may arrive by C++17:

Cooperative multithreading

In cooperative multithreading, two or more software threads pass execution among themselves via explicit statements so that only one thread is actually running at a time. Coroutines are an example of cooperative multithreading.

Cooperative multithreading has a few key advantages:

· Each thread can maintain its context when not actively executing.

· Since only one thread is running at a time, variables are not shared. Mutual exclusion is not necessary.

Coroutines appear to be coming in C++17, and are available right now in Boost. A bunch of building blocks for various innovative concurrency schemes are listed in a recent proposal from the C++ Concurrency TR working group.

SIMD instructions

SIMD is an acronym for single instruction multiple data. In processors supporting SIMD, certain instructions operate on a vector of registers. The processor performs the same action simultaneously on each register in the vector, reducing overhead versus scalar operations.

The C++ compiler does not normally generate SIMD instructions because their behavior is complex and not well matched to the way C++ describes programs. Compiler-dependent pragmas or inline assembly features allow SIMD instructions to be inserted into functions, which are then provided in libraries for specialized tasks like digital signal processing or computer graphics. SIMD programming is thus both processor- and compiler-dependent. Among the many web references regarding SIMD instructions is a Stack Exchange Q&A that contains lots of references about using SIMD in C++.

Optimize Threaded C++ Programs

TANSTAAFL: There ain’t no such thing as a free lunch.

First used in print in the article “Economics in Eight Words” in the El Paso Herald-Post on June 27, 1938. Many geeky people recognize the quote from Robert A. Heinlein’s 1966 novel The Moon Is a Harsh Mistress.

Desktop-class microprocessors in widespread use as of early 2016 have a few cores with highly pipelined execution units and multiple levels of cache memory. Their architectures are appropriate for high-performance execution of a few threads of control. Executing many threads requires frequent, expensive context switches.

This architecture must be held in mind when designing concurrent programs. Attempting to force-fit a data-parallel, fine-grained model of concurrency onto current desktop processor designs can produce programs that use concurrency inefficiently.

In the bright future of personal hover-cars and cities in the clouds, programming languages will automatically parallelize programs in an effective way. Until then, it’s up to individual developers to find tasks that can be done concurrently. Although opportunities for concurrency are as diverse as are programs, there are some reliable places to look for threadable code. I’ve listed a few of these in the following sections.

The threading behavior of a program may be wired deeply into its structure. It can thus be more difficult to modify threading behavior than to optimize allocation or function calls. That said, there are some design practices for concurrent programs that optimize performance. Some of these practices have evolved only in the past few years, so they may not be familiar even to experienced users.

Prefer std::async to std::thread

From a performance standpoint, a significant problem with std::thread is that each invocation starts a new software thread. Starting a thread has both direct and indirect costs that make this operation very expensive:

· The direct cost includes calling into the operating system to allocate space for the thread in the operating system’s tables, allocate memory for the thread’s stack, initialize the thread’s register set, and schedule the thread to run. If the thread gets a new scheduling quantum, there is a delay before it begins to run. If it gets the rest of the invoking thread’s scheduling quantum, there is a delay while storing the invoking thread’s registers.

· An indirect cost of creating threads is an increase in the amount of memory used. Each thread must reserve storage for its own stack of called functions. If large numbers of threads are started and stopped frequently, this can cause thrashing in the cache as threads executing on the computer compete for access to a limited cache.

· Another indirect cost is incurred when the number of software threads exceeds the number of hardware threads. All threads then begin to slow down as they must be scheduled by the operating system.

I measured the cost of starting and stopping a thread by running the program fragment in Example 12-10 in a loop. The void function within the thread returns immediately. This is the briefest possible function. The call to join() causes the main program to wait until the thread finishes, causing thread invocations to be laid end-to-end with no overlap. This would be bad concurrency design if it didn’t have a specific goal of measuring the cost of starting threads.

Example 12-10. Starting and stopping std::thread

std::thread t;

t = std::thread([]() { return; });

t.join();

In fact, this test probably underrepresents the cost of thread invocation, because the threads don’t write anything into the cache. But the results are still significant: 10,000 thread invocations took around 1,350 milliseconds, or about 135 microseconds per thread start and stop, on Windows. This is many thousands of times the cost of executing the lambda. std::thread is a fabulously expensive way to perform a brief computation, even if it could be done concurrently.

There is no escaping this cost because it is a latency. It is the time it takes before the call to std::thread’s constructor returns and the time before the call to join() ends. Even if the program detaches the threads instead of waiting to join them, the test still consumes over 700 milliseconds.

A useful optimization in concurrent programming is to reuse threads instead of creating new threads for each use. A thread can pend on some condition variable until needed, be released, and then execute a callable object. While some of the costs of switching threads (saving and restoring registers and flushing and refilling the cache) are the same, other costs, like allocating memory for the thread and scheduling it in the operating system, are eliminated or reduced.

The template function std::async() runs a callable object in the context of a thread, but the implementation is allowed to reuse threads. The standard hints that std::async() may be implemented using a thread pool. On Windows, std::async() is significantly faster. I ran the code fragment in Example 12-11 in a loop to measure the improvement.

Example 12-11. starting and stopping with async()

std::async(std::launch::async, []() { return; });

std::async() returns a std::future, which in this case is an unnamed temporary. The program calls the destructor of this unnamed std::future as soon as std::async() returns. The destructor waits for the future to become ready, so it can throw any exception that might occur. No explicit join() or detach() call is required. Example 12-11, like Example 12-10, lays the thread executions out end-to-end.

The improvement was dramatic: 10,000 invocations of the simple lambda took only 86 milliseconds, about 14 times faster than if a new thread was started each time.

Create as Many Runnable Threads as Cores

Older treatises on concurrency advise creating as many threads as are convenient, in a manner similar to creating dynamic variables. This thinking reflects an antique world where threads competed for the attention of a single processor. In the modern world of multicore processors, this advice is too simplistic.

The optimizing developer may distinguish two kinds of threads, with different 'margin-bottom:0cm;margin-bottom:.0001pt;line-height: normal;vertical-align:baseline'>Runnable threads compute continuously

A runnable thread consumes 100% of the computing resources of the core upon which it runs. If there are n cores, scheduling a runnable thread on each core can reduce wall clock run time almost to 1/n. However, once a thread is running on each available core, scheduling additional threads produces no further improvement in run time. Instead, the threads simply chop up the available hardware pie into smaller and smaller slices. In fact, there is a limit to how finely access can be sliced, beyond which a time slice is all starting and stopping, with no time in the middle for computation. As the number of runnable threads increases, overall performance drops off and can ultimately approach zero.

Waitable threads wait for an external event, then compute briefly

A waitable thread consumes just a few percent of the computational resources available on a core. Scheduling multiple waitable threads on a core uses a greater fraction of the available resources if execution of the waitable threads is interleaved so that the computation on one waitable thread happens during the wait on another waitable thread. This reduces wall clock run time up to a saturation point at which all computational resources are in use.

In the single-core past, there was no benefit from scheduling computations as runnable threads. All performance gains came from interleaving waitable threads with other threads. This is no longer the case when there are multiple cores.

C++ provides a function called std::thread::hardware_concurrency() that returns the number of available cores. This function accounts for cores allocated to other virtual machines by hypervisors, and cores that behave as two or more logical cores due to simultaneous multithreading. It allows a program to scale to hardware containing more (or fewer) cores as time passes.

To test the effect of multiple threads on performance, I wrote a function, timewaster() (Example 12-12), that repeatedly performs a time-consuming computation. The computation takes 3,087 milliseconds if all the iterations are performed in a single loop in the main program.

Example 12-12. An iterated waste of computer time

void timewaster(unsigned iterations) {

for (counter_t i = 0; i < iterations; ++i)

fibonacci(n);

}

I then wrote a function to create threads that each ran timewaster(), dividing the number of iterations performed by the number of threads (Example 12-13). I used this function to test performance as I varied the number of threads.

Example 12-13. A multithreaded time waster

void multithreaded_timewaster(

unsigned iterations,

unsigned threads)

{

std::vector<std::thread> t;

t.reserve(threads);

for (unsigned i = 0; i < threads; ++i)

t.push_back(std::thread(timewaster, iterations/threads));

for (unsigned i = 0; i < threads; ++i)

t[i].join();

}

Some observations:

· std::thread::hardware_concurrency() returns 4 on my hardware. As expected, the fastest time, 1,870 milliseconds, came by running four threads.

· Somewhat surprisingly, the best time was closer to half (not a quarter) of the single-thread time.

· Although the general trend was for experiments with more than four threads to take longer, there was considerable variation from run to run, so that the result was not reliable.

There is, of course, a practical obstacle to limiting the number of threads launched by a program. In a big program with multiple developers, or with third-party libraries, it may be hard to know how many threads have been launched elsewhere in the program. Threads are created at need, anywhere in the program. Although the operating system serves as a container that knows all the threads that are running, this knowledge is not made explicit in C++.

Implement a Task Queue and Thread Pool

The solution to the problem of not knowing how many threads are running is to make threading explicit: provide a thread pool, a data structure that contains a fixed number of long-lived threads, and a task queue, a data structure that contains a list of computations to perform, which are serviced by the threads in the thread pool.

In task-oriented programming, the program is written as a collection of runnable task objects that are executed by threads from the thread pool. When a thread becomes available, it fetches a task from the task queue. When the thread finishes the task, it does not terminate; it either runs the next task or blocks, pending the arrival of a new task to be executed.

There are several advantages to task-oriented programs:

· Task-oriented programs can efficiently handle I/O completion events from nonblocking I/O calls, achieving high processor utilization.

· Having a thread pool and task queue eliminates the overhead of starting threads for short-lived tasks.

· Task-oriented programming centralizes asynchronous processing in one set of data structures, so that it is easy to limit the number of threads in use.

A disadvantage of task-oriented programs is a problem called inversion of control. Instead of the flow of control being specified by the program, it becomes implicit in the order of event messages received. This can make task-oriented programs difficult to reason about or debug, though my personal experience is that this confusion rarely occurs in practice.

Standard thread pools and task queues are coming to C++, probably in C++17. They are available today in the Boost library and Intel’s Threading Building Blocks.

Perform I/O in a Separate Thread

The physical realities of spinning disks and distant network connections create delay between the time a program requests data and the time the data becomes available. I/O is thus a perfect place to look for concurrency. It is also typical of I/O that a program must transform the data before it can be written, or after it is read. For instance, a file of XML data may be read from the Internet. Then the data is parsed to extract information interesting to the program. Because the data is not usable until after it has been transformed, the whole process, including reading and parsing, becomes an obvious candidate to move into a separate thread.

Program Without Synchronization

Synchronization and mutual exclusion slow down multithreaded programs. Getting rid of this synchronization can improve performance. There are three easy ways to program without explicit synchronization, and one hard way:

Event-oriented programming

In event-oriented programming, the program is written as a collection of event-handling functions that are invoked from a framework. The underlying framework dispatches each event from an event queue to the handler function registered for that event. Event-oriented programming is similar in many ways to task-oriented programming. The framework in an event-oriented program acts like the task scheduler, and the event handlers are similar to tasks. The important difference is that in an event-oriented program, the framework is single-threaded, and the event-handling functions are not concurrent.

There are several advantages to event-oriented programs:

· Because the underlying framework is single-threaded, no synchronization is required.

· Event-oriented programs can efficiently handle I/O completion events from nonblocking I/O calls. An event-oriented program achieves the same high processor utilization as a multithreaded program does.

As with task-oriented programs, the chief disadvantage of event-oriented programs is a problem called inversion of control, where the flow of control becomes implicit in the order of event messages received. This can make event-oriented programs difficult to reason about or debug.

Coroutines

Coroutines are executable objects that explicitly hand off execution from one object to another, but remember their execution pointers so they can resume if called again. Like event-oriented programs, coroutines are not truly multithreaded, so no synchronization is required unless they are controlled by multiple threads.

Two kinds of coroutines can be distinguished. The first kind has its own stack, and can relinquish control to another coroutine at any point in execution. The second kind borrows the stack of another thread, and can relinquish control only at its top level.

Coroutines may arrive by C++17.

Message passing

In a message-passing program, threads of control take input from one or more sources, transform it, and put it on one or more output sinks. The connected outputs and inputs form a graph with well-defined entry and exit nodes. The items read and written by the thread implementing each stage of a message-passing program may be implemented as network datagrams, character I/O streams, or data structures in explicit queues.

Unix command-line pipelines and web services are examples of message-passing programming. The components of a distributed processing system are message-passing programs.

Among the advantages of message-passing programs are:

· Synchronization of the output of each stage to the input of the next stage is implicit. It is either handled by the operating system when the stages communicate with datagrams, or provided within a queue connecting the stages. The concurrency of the system happens outside the stages, so stages can usually be considered to be single-threaded code that may block on input and output actions.

· Because the outputs of one stage are each connected to a single input of the next stage, issues of starvation and fairness occur less frequently.

· Synchronization occurs less frequently, on larger units of data. This increases the fraction of time that multiple threads can run concurrently.

· Because pipeline stages do not share variables, they don’t slow down for mutexes and memory fences.

· Larger units of work can be passed between pipeline stages, so that each stage uses whole time slices instead of stopping and starting for mutual exclusion. This improves processor utilization.

Disadvantages of message-passing programs include:

· Messages are not inherently object-oriented. The C++ developer must write code to marshal input messages into member function calls.

· Error recovery can be an issue when a pipeline stage crashes.

· Not every problem has an obvious solution as a pipeline of independent programs passing messages.

STOP AND THINK

This book discusses the C++ facilities for sharing variables and synchronizing threads, because these facilities must be baked into the C++ language to function efficiently. Message-passing programs use libraries outside of C++, so this book only spends a few words on them. That doesn’t mean message passing is unimportant.

A large and vocal design community thinks explicit synchronization and shared variables are a bad idea: complex, race-prone, and unscalable. These developers think the only scalable way to program concurrently is with pipelines.

The highly concurrent architectures of GPUs don’t even provide shared memory, so for these processors a message-passing program design is required.

Lock-free programming

Hic Sunt Dracones (English translation: Here Be Dragons)

Inscription on the Hunt-Lenox Globe (c. 1503–07), implying dangerous unknown shores

Lock-free programming refers to programming practices that permit multithreaded updates to data structures without the need for mutual exclusion. In a lock-free program, expensive mutexes are replaced by atomic operations synchronized by the hardware. Lock-free data structures can perform significantly better than conventional containers guarded by a mutex, especially when many threads access the same container.

Lock-free array, queue, and hash-table container classes in C++ have been released publically. Boost has lock-free stack and queue containers, but they have been tested only on the GCC and Clang compilers. Intel’s Threading Building Blocks has lock-free array, queue, and hash-map containers. These containers are not identical to the equivalent C++ standard library containers due to the requirements of lock-free programming.

Lock-free data structures are insanely difficult to reason about. Even well-known experts argue about the correctness of published algorithms. For this reason, I recommend sticking to widely published and well-supported code rather than attempting to build lock-free data structures on your own.

Remove Code from Startup and Shutdown

A program can launch as many threads as needed to perform tasks concurrently or make use of multiple CPU cores. One part of the program, however, is quite difficult to run concurrently: the code executed before main() gets control, and after main() exits.

Before main() starts, all variables with static storage duration (“Storage Duration of Variables”) are initialized. For plain-old-data types, this initialization has zero cost. The linker points the variables at the initialization data. But for variables of class type and static storage duration, the initialization process calls each variable’s constructor serially, in a single thread, in a particular order defined by the standard.

It is easy to forget that variables such as strings, which are initialized with static string constants, actually execute code during initialization. Likewise, constructors that contain function calls or non-const expressions in their initializer lists also are executed at run time, even if the constructor body is empty.

These costs, which may each be small when considered in isolation, can add up dramatically, causing large programs to be unresponsive for several seconds after startup.

OPTIMIZATION WAR STORY

Google’s Chrome browser is a massive program, developed by hundreds of people over a period of years. The number of tables that must be initialized in Chrome is mind-boggling. To ensure that startup performance does not degrade (further), the Chromium project administrators added a rule to the review procedure requiring approval for every static variable initialized with executable code.

Make Synchronization More Efficient

Synchronization is the overhead cost of shared-memory concurrency. Reducing this overhead cost is an important pathway to optimal performance.

The received wisdom is that synchronization primitives such as mutexes are expensive. The truth, of course, is complicated. On both Windows and Linux, for instance, the mutex upon which std::mutex is based is a hybrid design that busy-waits on an atomic variable for a short period of time, then pends on an operating system signal if it cannot rapidly acquire the mutex. The cost of the “happy path,” where no other thread holds the mutex, is low. The cost when the thread must pend on the signal is measured in milliseconds. The important cost of a mutex is the cost of waiting for another thread to release it, not the cost of calling lock().

Multithreaded programs most frequently run into trouble when there are many threads contending to acquire the same mutex. When contention is low, holding the mutex usually doesn’t slow other threads down. When contention is high, any contended mutex causes threads to run one at a time, spoiling the developer’s attempt to execute the parts of a program concurrently.

Concurrent C++ programs are significantly more complex than single-threaded ones. It is harder to create examples or test cases that produce meaningful results, so I must fall back on heuristics in this section.

Reduce the Scope of Critical Sections

A critical section is a region of code bounded by acquiring a mutex and releasing that mutex. During execution of a critical section, no other thread can access shared variables controlled by that mutex. And that’s the problem in a nutshell. If the critical section does anything but access shared variables, other threads have to wait for no good reason.

To illustrate this problem, look again at Example 12-7. The two lambdas producer and consumer are run concurrently in two threads. Both threads immediately attempt to enter the monitor controlled by mutex m. Both threads remain in the monitor except when they are in cv.wait().producer sets the shared variable shared_data to the next value in sequence. consumer clears shared_data. But producer and consumer each do one other thing; they log one line of text to cout.

As written, cv_example can produce somewhere between 35 and 45 updates to shared_data in 1,000 milliseconds, on my i7 system compiled with Visual Studio 2015 in release mode. It may not immediately be obvious whether this is a lot or a little. But if I comment out the two output statements, cv_example can produce 1.25 million updates. Putting characters on the console performs a lot of behind-the-scenes magic, so this isn’t too surprising in retrospect.

There are two lessons here. Lesson number one is that the monitor concept, where code is always in the monitor except while waiting on a condition variable, can be difficult to use efficiently. Lesson number two is that performing I/O in a critical section does not lead to optimal performance.

Limit the Number of Concurrent Threads

As mentioned in “Create as Many Runnable Threads as Cores”, the number of runnable threads should be less than or equal to the number of processor cores to eliminate the overhead of context switching. To provide a second reason why, I first have to explain how mutexes are implemented.

The mutex class provided by Windows, Linux, and most other modern operating systems is a hybrid design optimized for multicore processors. A thread t1 attempting to acquire a mutex that is not locked acquires the mutex immediately. If t1 attempts to acquire a mutex that is held by another thread, t2, t1 first busy-waits for a limited amount of time. If t2 releases the mutex before the busy-wait times out, t1 acquires the mutex and continues to run, efficiently using its whole time slice. If t2 does not release the mutex in time, t1 pends on an operating system event, giving up its time slice. t1 goes on the operating system’s “pending” list.

If there are more threads than cores, only some threads are assigned to cores and actually progress at any instant of time. The remaining threads wait on the operating system’s “runnable” queue, and are eventually given a time slice. The operating system wakes on a periodic interrupt to make decisions about what to run. This interrupt is slow compared to the rate at which individual instructions are executed. A thread on the “runnable” queue may thus wait many milliseconds before the operating system allocates it a core on which to run.

If t2 holds the mutex but is waiting on the operating system’s “runnable” queue instead of actually executing, it cannot release the mutex. When t1 attempts to acquire the mutex, its busy-wait times out, and t1 pends on an operating system event, which means t1 gives up its time slice and its core and goes on the operating system’s “pending” list. Eventually t2 gets a core to run on and releases the mutex, signaling the event on which t1 is pending. The operating system notes that the event has occurred, and moves t1 to the “runnable” queue. But the operating system does not necessarily allocate a core to run t1 immediately.5 The operating system runs the threads currently assigned to cores for their full time slices, unless they block on an event. The operating system doesn’t allocate a core to the newly runnable t1 until some other thread exhausts its time slice, which may take many milliseconds.

It is this slow path the developer attempts to eliminate by limiting the number of active threads. Avoiding the slow path means that instead of hundreds of interlocked operations per second, threads t1 and t2 can perform millions of interlocked operations per second.

The ideal number of threads contending for a brief critical section is two. When there are only two threads, there are no issues of fairness or starvation, and no chance of the thundering herd problem described in the following section.

Avoid the Thundering Herd

A phenomenon called the thundering herd occurs when many threads are pending on an event, such as the availability of work, that only one thread can service. When the event occurs, all the threads become runnable, but only a few of them can run immediately because there are only a few cores. One of these takes the work item. The operating system moves the rest to the runnable queue, and eventually runs each one. Each thread discovers that the signaled event has already been serviced, and pends again on the event, spending time on logistics without making progress.

Avoiding the thundering herd may be as easy as limiting the number of threads created to service the event. Two threads may be better than one, but 100 threads is probably not better (see “Limit the Number of Concurrent Threads”). Limiting the number of threads may be easier for software designs that implement work queues than for designs that associate a thread with each work item.

Avoid Lock Convoys

A lock convoy occurs when many threads synchronize, pending on a resource or critical section. This causes additional congestion because they all attempt to proceed at once and are made to progress one at a time, as if in a convoy.

In a simple case, a thundering herd occurs over and over. There are enough threads contending for a mutex that many threads pend on the mutex’s operating system signal. When the thread holding the mutex releases it, the event is signaled, and all the pending threads become runnable. The first thread to get the processor locks the mutex again. All the remaining threads eventually get the processor, check the mutex, see it is still locked, and go back to pending. The overall effect is that the operating system spends a lot of time restarting threads, but most threads don’t progress. Worse yet, all the threads are still synchronized. They will all wake at once when the next thread releases the mutex, and repeat the cycle.

In a more complex case, a thundering herd of threads all attempt to acquire a second mutex or perform some action, like reading a file, that is bottlenecked by the physical properties of the device. Since threads are synchronized, they all attempt to access the second resource at about the same time. Performance is degraded because the threads request the same resource at the same time, causing serialization. If they had not been synchronized, they might all have progressed.

Lock convoys are visible as systems that usually run well, but occasionally seem to become unresponsive for a few seconds at a time.

Reducing the number of threads or scheduling threads to start at different times may relieve the lock convoy, though there is always a risk that it will simply emerge in another place. Sometimes it is best simply to acknowledge that certain groups of tasks can’t be performed concurrently because they share some hardware device or other bottleneck.

Reduce Contention

Threads in a multithreaded program may contend over resources. At any point where two or more threads need the same resource, mutual exclusion causes a thread to pend, losing concurrency. There are several techniques to address the problem of contention:

Be aware that memory and I/O are resources

Not every developer realizes that the memory manager is a resource. The memory manager must serialize access in multithreaded systems, or its data structures will get corrupted. A large number of threads all attempting to allocate dynamic variables (std::string is a particular offender) at once may experience performance that falls off a cliff suddenly as the number of threads increases.

File I/O is a resource. The disk drive can only be in one place at a time. Attempting to do I/O on multiple files simultaneously can cause performance to fall off suddenly.

Network I/O is also a resource. The Ethernet connector is a narrow pipe through which bits squirt. It is possible for modern processors to saturate even gigabit Ethernet cables. Saturating a WiFi connection is easy.

When there is a performance problem, it is time to take a step back and ask, “What is the whole program doing right now?” Logging that is no problem most of the time may be slowing the whole system down if something else is also banging away at the disk or network interface. That dynamic data structure may not scale to many threads.

Duplicate resources

Instead of multiple threads contending for a resource such as a shared map or hash table, sometimes it is possible to duplicate the table so that each thread has an unshared copy, eliminating contention. Although maintaining two copies of a data structure is more work, it may still result in reduced wall clock time versus a shared data structure.

Even hardware resources like disk drives and network interface cards can be duplicated to improve throughput.

Partition resources

Instead of multiple threads contending for a single data structure, sometimes it is possible to partition the data structure so that each thread only accesses the portion of the data that it works with.

Fine-grained locking

Instead of locking an entire data structure with a single mutex, multiple mutexes can be used. For example, in a hash table, one mutex can lock the hash table backbone against modification (such as inserting and deleting entries), and another mutex can lock the entries against modification. Reader/writer locks may be a good choice here. To access a hash table entry, a thread takes a read lock on the backbone, and either a read or write lock on the entry. To insert or delete an entry, the thread takes a write lock on the backbone.

Lock-free data structures

Use of a lock-free data structure such as a hash table relieves the need for mutual exclusion. This is the ultimate extent to which fine-grained locking can be taken.

Resource scheduling

Some resources, such as a hard drive, are not susceptible to duplication or partitioning. But it is still possible to schedule disk activity so that it doesn’t all happen at once, or so that accesses to adjacent parts of the disk are performed together. The operating system may schedule reads and writes at a fine-grained level, but the program can sequence operations like reading configuration files so that they don’t all happen at the same time.

Don’t Busy-Wait on a Single-Core System

C++ concurrency features allow developers to implement high-performance synchronization primitives that do busy-waits. But busy-waiting is not always a good idea.

In a single-core processor, the only way to synchronize threads is to invoke the operating system’s synchronization primitives. A busy-wait is laughably ineffective. In fact, busy-waiting causes the thread to waste its whole time slice, because a thread holding a mutex cannot run to complete its critical section until the waiting thread gives up the processor.

Don’t Wait Forever

What happens to a thread that waits unconditionally for an event? If the program runs properly, perhaps nothing. But what if the user attempts to stop the program? The user interface shuts down, but the program doesn’t stop, because there are threads still running. If main() tries to join the waiting thread, it hangs. If the waiting thread is detached, main() may exit. Then what happens is dependent on how the thread is waiting. If it’s waiting for a flag to be set, it waits forever. If it’s waiting on an operating system event, it waits forever. If it’s waiting on a C++ object, then it very much depends whether some nonblocked thread deletes the object. That might cause the waiting thread to terminate. Or not.

Waiting forever is the enemy of error recovery. It is the difference between a program that mostly works but sometimes is flaky, and a program that has solid, dependable behavior that is reassuring to users.

Rolling Your Own Mutex May Be Ineffective

It is not difficult to code a simple class that acts as a mutex, busy-waiting until another thread updates an atomic variable. Such a class may even be faster than the system-provided mutex when contention is low and critical sections are brief. However, the mutexes that come with an operating system often know secrets about the operating system and the way it schedules tasks that improve their performance or avoid priority inversion issues on that particular operating system.

The design of robust mutexes is informed by the design of the operating system they must run on. Rolling your own is not a broad pathway to optimization.

Limit Producer Output Queue Length

In a producer/consumer program, any time the producer is faster than the consumer, data builds up in the queue between producer and consumer. Some of the many problems with this situation include:

· The producer contends for the processor, memory allocator, and other resources, further slowing the consumer and exacerbating the problem.

· The producer will eventually consume all system memory resources, causing the entire program to terminate unexpectedly.

· If the program is designed to recover from exceptions, it may need to process all queued data before restarting, which increases the time to recover.

This situation is especially likely in situations where a program sometimes takes input from a streaming source with a maximum rate that limits how often the producer runs, and sometimes takes input from a fixed source like a file, where the producer can run continuously.

The solution is to limit the queue size and block the producer when the queue is full. The queue only needs to be big enough to smooth any variation in the consumer’s performance. Frequently only a few entries are necessary. Any extra queue entries only cause the producer to run further and further ahead, increasing resource consumption without contributing to concurrency.

Concurrency Libraries

There are a number of well-regarded concurrency libraries. The developer looking to implement a message-passing style of concurrency would be well advised to take up one of these tools. Threading Building Blocks in particular offers some features for concurrency that are not yet in the C++ standard. Options include:

Boost.Thread and Boost.Coroutine

Boost’s Thread library is an anticipation of the C++17 standard library thread library. Some parts are still very experimental. Boost.Coroutine is also experimental.

POSIX Threads

POSIX Threads (pthreads) is a cross-platform library of threads and synchronization primitives—possibly the oldest and most widely used library for concurrency. Pthreads is a C-style function library that provides traditional capabilities. It is extensively documented, widely available on Linux distributions, and also available for Windows.

Threading Building Blocks (TBB)

TBB is an ambitious, well-documented C++ threading API with a template flavor. It provides parallel for loops, tasks and thread pools, concurrent containers, data-flow message-passing classes, and synchronization primitives. TBB was developed by Intel to promote effective use of multicore microprocessors. It is now open source, has extensive documentation including at least one good book (Intel Threading Building Blocks by James Reinders, also from O’Reilly), and runs on both Windows and Linux.

0mq (also spelled ZeroMQ)

0mq is a communication library for connecting message-passing programs. It supports a variety of communication paradigms and strives for efficiency and parsimony. My personal experience with this library has been very positive. 0mq is open source, well documented, and actively supported. There is also a reimagining of 0mq called nanomsg that is said to address some issues in 0mq.

Message Passing Interface (MPI)

MPI is an API specification for message passing in a distributed network of computers. Implementations have the flavor of a C-style function library. MPI has its origin at California’s Lawrence Livermore National Laboratory, a place long associated with supercomputer clusters and with the kind of high-energy physics that goes boom. MPI is well documented in an old-fashioned 1980s DoD style. Implementations exist for both Linux and Windows, including one from Boost, but they don’t always cover the full specification.

OpenMP

OpenMP is an API for “multi-platform shared-memory parallel programming in C/C++ and Fortran.” In use, a developer decorates a C++ program with pragmas that define its parallel behavior. OpenMP provides a fine-grained model of concurrency with an emphasis on numerical computation, and is evolving toward programming of GPUs. OpenMP is available on Linux using both GCC and Clang, and on Windows using Visual C++.

C++ AMP

C++ AMP is an open specification of a C++ library designed to perform parallel data computations on GPU devices. The version from Microsoft resolves to DirectX 11 calls.

Summary

· A multithreaded C++ program is sequentially consistent if it contains no races.

· A large and vocal design community thinks explicit synchronization and shared variables are a bad idea.

· Performing I/O in a critical section does not lead to optimal performance.

· The number of runnable threads should be less than or equal to the number of processor cores.

· The ideal number of threads contending for a brief critical section is two.

1 Yes, yes, I have heard of GPUs. When millions of developers are programming directly on GPUs, they will be mainstream. That time has yet to arrive.

2 Some operating systems permit designated blocks of memory to be shared between processes. These mechanisms for sharing memory among processes are arcane and operating system–dependent, and are not covered here.

3 Edsger W. Dijkstra, “Cooperating Sequential Processes (EWD-123),” E.W. Dijkstra Archive, Center for American History, University of Texas at Austin (September 1965).

4 See C.A.R Hoare, “Monitors: An Operating System Structuring Concept,” ACM Communications 17 (Oct 1974): 549–557.

5 I’m skipping over a lot of arcane details here. Windows fiddles thread priorities to give the newly runnable thread a turn sooner. Other operating systems do other things in hopes of reducing the duration of critical sections. The takeaway should be that once the operating system gets involved in a critical section, it’s going to be a long wait.