Optimized C++ (2016)
Chapter 2. Computer Behavior Affecting Optimization
Lying, the telling of beautiful untrue things, is the proper aim of Art.
Oscar Wilde, “The Decay of Lying,” Intentions (1891)
The purpose of this chapter is to provide the bare minimum of background information about computer hardware to motivate the optimizations described in this book, to prevent readers going mad poring over 600-page processor handbooks. It takes a superficial look at processor architecture to extract some heuristics to guide optimization efforts. The very impatient reader might skip this chapter for now and return when other chapters refer to it, though the information here is important and useful.
Microprocessor devices in current use are incredibly diverse. They range from sub-$1 embedded devices with just a few thousand gates and clock rates below 1 MHz to desktop-class devices with billions of gates and gigahertz clocks. Mainframe computers can be the size of a large room, containing thousands of independent execution units and drawing enough electrical power to light a small city. It is tempting to believe that nothing connects this profusion of computing devices that would lead to usable generalizations. In reality, though, they have useful similarities. After all, if there were no similarities, it would not be possible to compile C++ code for the many processors for which there are compilers.
All computers in wide use execute instructions stored in memory. The instructions act upon data that is also stored in memory. The memory is divided into many small words of a few bits each. A few precious memory words are registers, which are directly named in machine instructions. Most memory words are named by a numerical address. A particular register in each computer contains the address of the next instruction to execute. If the memory is like a book, the execution address is like a finger pointing to the next word to read. An execution unit (also called processor, core, CPU, computer, and a bunch of other names) reads a stream of instructions from the memory and acts upon them. The instructions tell the execution unit what data to read (load, fetch) from memory, how to act upon the data, and what results to write (store, save) to memory. The computer is made up of devices that obey physical laws. It takes some nonzero amount of time to read or write each memory address, and time for an instruction to transform the data upon which it acts.
Beyond this basic outline, familiar to any first-year computer science major, the family tree of computer architectures grows in a riot of profusion. Because computer architectures are so variable, rigorous numerical rules regarding hardware behavior are hard to come by. Modern processors do so many different, interacting things to speed up instruction execution that instruction timing has become for all intents and purposes nondeterministic. Couple that with the problem that many developers don’t even know precisely what processors their code will run on, and heuristics are the most you can expect.
Lies C++ Believes About Computers
Of course, a C++ program at least pretends to believe in a version of the simple model of the computer laid out in the previous section. There is memory, addressable in char-sized bytes, which is essentially infinite. A special address called nullptr exists, which is different from any valid memory address. Integer 0 converts to nullptr, though nullptr is not necessarily at address 0. There is a single conceptual execution address pointing to the source code statement that is currently being executed. Statements are executed in the order in which they are written, subject to the effects of C++ flow of control statements.
C++ knows that computers are really more complex than this simple model. It offers a few glimpses under the cover at the glittering machinery within:
· A C++ program only has to behave “as if” statements were executed in order. The C++ compiler and the computer itself may change the execution order to make the program run faster, as long as the meaning of any computation is not changed.
· As of C++11, C++ no longer believes that there is only a single execution address. The C++ standard library now comes with facilities for starting and stopping threads, and synchronizing access to memory between threads. Before C++11, programmers lied to the C++ compiler about their threads, sometimes leading to difficult-to-debug problems.
· Certain memory addresses may be device registers instead of ordinary memory. The values at these addresses may change during the interval between two consecutive reads of the location by the same thread, indicating some change in the hardware. Such locations are described to C++ asvolatile. Declaring a variable volatile requires the compiler to fetch a new copy of the variable every time it is used, rather than optimizing the program by saving the value in a register and reusing it. Pointers to volatile memory may also be declared.
· C++11 offers a magical spell called std::atomic<> that causes memory to behave for just a moment as if it really was a simple linear store of bytes, wishing away all the complexities of modern microprocessors, with their multiple threads of execution, multilayer memory caches, and so on. Some developers think this is what volatile is for. They are mistaken.
The operating system also lies to programs and their users. In fact, the whole purpose of an operating system is to tell each program a set of very convincing lies. Among the most important lies the OS wants each program to believe are that the program is alone on the computer, that the physical memory is infinite, and that there are an infinite number of processors available for running the program’s threads.
The operating system uses the computer’s hardware to hide its lies, so that the C++ program really has no choice but to believe them. These lies don’t generally affect the running program very much except to slow it down. However, they can complicate performance measurements.
The Truth About Computers
Only the simplest microprocessors and certain historical mainframe computers correspond directly to the C++ model. Importantly for optimization, the actual memory hardware of real computers is very slow compared to the instruction execution rate, the memory is not really accessed in bytes, the memory is not a simple linear array of identical cells, and it has a finite capacity. Real computers may have more than one instruction address. Real computers are fast, not because they execute each instruction quickly, but because they overlap execution of many instructions at the same time and contain intricate circuitry to ensure that overlapping instructions behave as if they execute one after the other.
Memory Is Slow
The main memory of a computer is very slow compared to its internal gates and registers. Pouring electrons out of the microprocessor chip into the relative vastness of a copper circuit board trace, and then pushing them down the trace to a memory chip several centimeters away, takes thousands of times as long as moving electrons across the minute distances separating transistors within the microprocessor. Main memory is so slow that a desktop processor can execute hundreds of instructions in the length of time it takes to fetch a single data word from main memory.
The implication for optimization is that access to memory dominates other costs in a processor, including the cost of executing instructions.
THE VON NEUMANN BOTTLENECK
The interface to main memory is a choke point that limits execution speed. This choke point even has a name. It’s called the von Neumann bottleneck, after famous computer architecture pioneer and mathematician John von Neumann (1903–1957).
For example, a PC using DDR2 memory devices running at 1,000 MHz (typical of computers a few years ago, and easy to compute with) has a theoretical bandwidth of 2 billion words per second, or 500 picoseconds (ps) per word. But that doesn’t mean the computer can read or write a random word of data every 500 picoseconds.
First, only sequential accesses can complete in one cycle (one-half of a tick of the 1,000 MHz clock). Access to a nonsequential location completes in somewhere between 6 and 10 cycles.
Several activities contend for access to the memory bus. The processor continually fetches memory containing the next instructions to be executed. The cache memory controller fetches data memory blocks for the cache, and flushes written cache lines. The DRAM controller also steals cycles to refresh the charge in the dynamic RAM cells of the memory device. The number of cores in a multicore processor is sufficient to guarantee that the memory bus is saturated with traffic. The actual rate at which data can be read from main memory into a particular core is more like 20–80 nanoseconds (ns) per word.
Moore’s Law makes it possible to put more cores in a microprocessor each year. But it does little to make the interface to main memory faster. Thus, doubling the number of cores in the future will have a diminishing effect on performance. The cores will all be starved for access to memory. This looming limit on performance is called the memory wall.
Memory Is Not Accessed in Bytes
Although C++ believes every byte is separately accessible, computers often compensate for slow physical memory by fetching data in bigger chunks. The smallest processors may fetch a byte at a time from main memory. Desktop-class processors may fetch 64 bytes at once. Some supercomputers and graphics processors fetch even more.
When C++ fetches a multibyte data type like an int, double, or pointer, it may be that the bytes making up that data span two physical memory words. This is called an unaligned memory access. The implication for optimization is that an unaligned access takes twice as long as if all the bytes were in the same word, because two words must be read. The C++ compiler works to align structs so that every field begins on a byte address that is a multiple of the field’s size. But this creates its own problem: “holes” in the structs containing unused data. Paying attention to the size of data fields and their order in the struct can result in structs that are as compact as possible while still being aligned.
Some Memory Accesses Are Slower than Others
To further compensate for slow main memory, many computers contain cache memory, a kind of fast temporary storage located very close to the processor, to speed up access to the most frequently used memory words. Some computers have no cache; others have one or several levels of cache, each smaller, faster, and more expensive than the one before. When an execution unit requests bytes from a cached memory word, the bytes may be fetched rapidly without accessing main memory again. How much faster is cache memory? A rule of thumb is that each level of cache memory is about 10 times faster than the one below it in the memory hierarchy. On desktop-class processors, the time cost of a memory access can vary across five orders of magnitude depending on whether the data being accessed is in first-, second-, or third-level cache, in main memory, or on a virtual memory page on disk. This is one reason why obsessing over instruction clock cycles and other arcana is so often maddening and unhelpful; the state of the cache makes instruction execution times largely nondeterministic.
When an execution unit needs to fetch data that is not in the cache, some data that is currently in cache must be discarded to make room. The data selected to be discarded is generally the least recently used data. This matters to optimization because it means heavily used memory locations can be accessed more quickly than less heavily used locations.
Reading even one byte of data that is not in cache causes many nearby bytes to be cached as well (as a consequence, this also means that many bytes currently in the cache are discarded). These nearby bytes are then ready for quick access. For optimization, this is important because it meansmemory in adjacent locations can be accessed more quickly (on average) than memory in distant locations.
In C++ terms, the implication is that a block of code containing a loop may execute faster because instructions making up the loop are heavily used and close together, and thus likely to stay in cache. A block of code containing function calls or if statements that causes execution to hop around may execute more slowly because individual parts of the code are less heavily used, and not close together. Such a code block uses up more cache space than a tight loop. If the program is large and the cache is finite, some of the code must be discarded from the cache to make room for other things, slowing access the next time that code is needed. Similarly, a data structure consisting of consecutive locations, like an array or vector, may be faster to access than a data structure consisting of nodes linked by pointers, because data in consecutive locations remains in a small number of cache locations. Accessing a data structure consisting of records connected by pointers (e.g., a list or tree) may be slower because of the need to read data for each node from main memory into new cache lines.
Memory Words Have a Big End and a Little End
A single byte of data can be fetched from memory, but often what happens is that several consecutive bytes are fetched together to form a number. For instance, in Microsoft’s Visual C++, four bytes fetched together form an int. Because the same memory can be accessed in two different ways, the people who design computers have to answer a question. Does the first byte—the one whose byte address is smallest—form the most significant bits of the int, or the least significant bits?
At first it seems like this couldn’t possibly matter. Of course, it’s important that every part of a given computer agrees on which end of the int goes at the lowest address, or chaos reigns. And sure, it’s possible to tell the difference. If the int value 0x01234567 is stored at addresses 1000–1003, and the little end is stored first, address 1000 holds the byte 0x01 and address 1003 holds the byte 0x67, whereas if the big end is stored first, address 1000 holds 0x67 and the 0x01 is at address 1003. Computers that read the most significant bits at the first byte address are called big-endian. Little-endian computers read the least significant bits first. There are two ways to store an integer (or pointer), and no reason to choose one over the other, so different teams working on different processors for different companies may choose differently.
The problem comes when data written on disk or sent over a network by one computer must be read by a different computer. Disks and networks send a byte at a time, not a whole int at a time. So it matters which end is stored or sent first. If the sending computer and the receiving computer don’t agree, the value sent as 0x01234567 is received as 0x67452301, a very different value indeed.
Endian-ness is just one of the reasons why C++ cannot specify the way bits are laid out in an int, or how setting one field in a union affects the other fields. It’s one reason why programs can be written that work on one kind of computer but crash on another.
Memory Has Finite Capacity
The memory in a computer is not, in fact, infinite. To maintain the illusion of infinite memory, the operating system can use the physical memory like a cache memory, and store data that doesn’t fit into physical memory as a file on disk. This scheme is called virtual memory. Virtual memory produces the illusion of more physical memory. However, retrieving a block of memory from disk takes dozens of milliseconds—an eternity for a modern computer.
Making cache memory fast is expensive. There may be gigabytes of main memory in a desktop computer or smartphone, but only a few million bytes of cache. Programs and their data don’t generally fit into the cache.
One effect of all this caching and virtual memory is that due to caching, a particular function running in the context of a whole program may run slower than the same function running in a test harness 10,000 times to measure its performance. In the context of the whole program, the function and its data are not likely to be in cache. In the context of the test, they generally are. This effect magnifies the benefit of optimizations that reduce the use of memory or disk, while leaving unchanged the benefit of optimizations that reduce code size.
A second effect of caching is that if a big program is making scattered accesses to many memory locations, there may not be sufficient cache memory to hold the data being immediately used by the program. This causes a kind of performance meltdown called page thrashing. When page thrashing happens in a microprocessor’s internal cache, the result is reduced performance. When it happens in the operating system’s virtual memory file, performance drops a thousandfold. This problem was more common when physical memory was less plentiful, but it still happens.
Instruction Execution Is Slow
Simple microprocessors of the sort embedded in coffee makers and microwave ovens are designed to execute instructions as fast as they can be fetched from memory. Desktop-class microprocessors have additional resources to process many instructions concurrently, so they are capable of executing instructions many times faster than they can be fetched from main memory, relying on a fast cache memory to feed their execution units most of the time. The significance of this to optimization is that memory access dominates the cost of computation.
Modern desktop-class computers execute instructions at an awesome rate if nothing gets in the way. They can complete an instruction every few hundred picoseconds (a picosecond is 10-12 seconds, a ridiculously short time). But that doesn’t mean that each instruction only takes picoseconds to perform. The processor contains a “pipeline” of instructions that it is working on concurrently. Instructions work their way through the pipeline, being decoded, getting their arguments, performing their computations, and saving their results. The more powerful the processor, the more complicated this pipeline becomes, breaking instruction execution into a dozen phases so that more instructions can be concurrently processed.
If instruction A computes a value needed by instruction B, then instruction B can’t do its computation until instruction A produces its result. This causes a pipeline stall, a brief pause in instruction execution that occurs because execution of the two instructions can’t be fully overlapped. The pipeline stall is particularly long if instruction A fetches a value from memory, then performs a computation that produces a value needed by instruction B. Pipeline stalls defeat all the fancy machinery of the microprocessor, making it nearly as slow as the processor in your toaster from time to time.
Making Decisions Is Hard for Computers
Another thing that can cause a pipeline stall is when the computer needs to make a decision. After most instructions, execution continues with the instruction at the next memory address. Most of the time, this next instruction is already in cache. Consecutive instructions can be fed into the pipeline as soon as the first pipeline stage becomes available.
Transfer-of-control instructions are different. A jump or jump-to-subroutine instruction changes the execution address to an arbitrary new value. The “next” instruction cannot be read from memory and put in the pipeline until the execution address is updated, some time during processing of the jump instruction. The memory word at the new execution address is less likely to be in cache. The pipeline stalls while the execution address is updated and the new “next” instruction is loaded into the pipeline.
After a conditional branch instruction, execution continues in one of two different places: either the next instruction or else the instruction at the address that is the branch target, depending upon the result of some previous computation. The pipeline stalls until all instructions involved in the previous computation complete, and remains stalled while the next execution address is determined and the value at that address is fetched.
The significance of this for optimization is that computation is faster than decision.
There Are Multiple Streams of Program Execution
Any program running on a modern operating system shares the computer with other programs running at the same time, with periodic maintenance processes checking the disk or looking for Java or Flash updates, and with the various parts of the operating system that control the network interface, disks, sound devices, accelerometers, thermometers, and other peripherals. Every program competes with other programs for the computer’s resources.
A program generally doesn’t notice this too much. It just runs a little slower. An exception occurs, however, when many programs start simultaneously and all compete for the memory and disk. For performance tuning, if a program must run at startup, or at times of peak load, performance must be measured under load.
As of early 2016, desktop computers have up to 16 processor cores. The microprocessors used in phones and tablets have up to eight. A quick look at the Windows Task Manager, Linux’s process status output, or Android’s task list typically shows there are a great many more software processes than this, and most processes have multiple threads of execution. The operating system runs each thread for a short time, then switches contexts to another thread or process. To the program, it is as if one statement took a nanosecond, and the next statement took 60 milliseconds.
What does it mean to switch contexts? If the operating system is switching from one thread to another thread in the same program, it means saving the processor’s registers for the thread being suspended, and loading the saved registers for the thread being resumed. The registers of a modern processor contain hundreds of bytes of data. When the new thread resumes execution, its data may not be in cache, so there is an initial period of slow execution while the new context is loaded into cache. There is thus a considerable cost to switching thread contexts.
When the operating system switches context from one program to another, the procedure is even more expensive. All dirty cache pages (ones with written data that has not reached main memory) must be flushed to physical memory. All the processor registers are saved. Then the physical-to-virtual memory page registers in the memory manager are saved. Next, the physical-to-virtual memory page registers for the new process are reloaded, and the processor registers for the new process are reloaded. Finally, execution can resume. But the cache is empty, so there is an initial period of slow performance and heavy memory contention while the cache is refilled.
When a program must wait for some event to occur, it may continue to wait even after the event occurs until the operating system makes a processor available to continue the program. This can make the program run time longer and more variable when the program runs in the context of other programs competing for the computer’s resources.
The execution units of a multicore processor and their associated cache memories act more or less independently of one another to achieve better performance. However, all execution units share the same main memory. Execution units must compete for access to the hardware linking them to main memory, making the von Neumann bottleneck even more limiting in a computer with multiple execution units.
When an execution unit writes a value, the value goes first into cache memory. It eventually has to be written through the cache all the way to main memory, so that the value is visible to other execution units. But because of contention for access to main memory among the execution units, main memory may not be updated until hundreds of instructions after the execution unit changes a value.
If the computer has multiple execution units, one execution unit may thus not see the data written by another execution unit change in main memory for an extended period of time, and changes to main memory may not occur in the same order as instruction execution order. Depending upon unpredictable timing factors, an execution unit may see either the old value of a shared memory word or the updated value. Special synchronization instructions must be used to ensure that threads running in different execution units see a consistent view of memory. The significance of this to optimization is that it’s much slower to access data shared between threads of execution than unshared data.
Calling into the Operating System Is Expensive
All but the smallest processors have hardware to enforce isolation between programs, so that program A can’t read, write, or execute in physical memory belonging to program B. The same hardware protects the kernel of the operating system from being overwritten by programs. The OS kernel, on the other hand, needs to access memory belonging to every program, so the programs can make system calls into the operating system. Some operating systems also allow programs to make requests to share memory. The many ways system calls occur and shared memory is arranged are varied and arcane. The significance for optimization is that system calls are expensive; hundreds of times more expensive than function calls within a single thread of one program.
C++ Tells Lies Too
The biggest lie C++ tells its users is that the computer it runs on has a simple, consistent structure. In exchange for pretending to believe this lie, C++ lets developers program without having to learn the intimate details of every microprocessor device, like they would have to do to program in brutally honest assembly language.
All Statements Are Not Equally Expensive
In the peaceful, bygone days of Kernighan and Ritchie’s C programming, every statement was about as expensive as every other statement. A function call might contain an arbitrarily complex computation. However, an assignment statement generally copied something that fit in a machine register into something else that fit into a machine register. Thus, the statement:
int i,j;
...
i = j;
copied 2 or 4 bytes from j to i. The declaration could be int or float or struct bigstruct *, but the assignment statement still did about the same amount of work.
This is no longer true. In C++, assigning one int to another is exactly the same amount of work as the corresponding C statement. But a statement like BigInstance i = OtherObject; can copy whole structures. More significantly, this kind of assignment invokes a constructor member function of BigInstance, which can hide arbitrarily complex machinery. A constructor is also invoked for each expression passed to a function’s formal arguments, and again as the function returns a value. Arithmetic operators and comparison operators can be overloaded too, so A=B*C;may multiply n-dimensional matrices, and if (x<y) ... may compare two paths through a directed graph of arbitrary complexity. The importance of this to optimization is that some statements hide large amounts of computation. The form of the statement does not tell how expensive it is.
Developers who learned C++ first may not find this rule surprising. But for those who learned C first, their instincts can lead them disastrously astray.
Statements Are Not Executed in Order
C++ programs behave as if they were executed in order, subject to C++ flow of control statements. The weasel-words “as if” in the previous sentence are the foundation upon which many compiler optimizations, and many tricks of modern computer hardware, are built.
Under the hood, of course, the compiler can and does sometimes reorder statements to improve performance. But the compiler knows that a variable must contain an up-to-date result of any computation assigned to it before it is tested or assigned to another variable. Modern microprocessors may also choose to execute instructions out of order, but they contain logic to ensure that writes to memory happen before subsequent reads of the same location. Even the microprocessor’s memory control logic may choose to delay writes to memory for optimal use of the memory bus. But the memory controller knows which writes are currently “in flight” from the execution unit through cache memory to main memory, and makes sure the in-flight value is used if the same address is subsequently read.
Concurrency complicates this picture. C++ programs are compiled without knowledge of other threads that may be running concurrently. The C++ compiler does not know which variables, if any, are shared between threads. The combined effect of the compiler reordering statements and the computer delaying writes to main memory damages the illusion that statements are executed in order when a program contains concurrent threads that share data. The developer must add explicit synchronizing code to multithreaded programs to ensure consistently predictable behavior. Thesynchronizing code reduces the amount of concurrency obtainable when concurrent threads share data.
Summary
· Access to memory dominates other costs in a processor.
· An unaligned access takes twice as long as if all the bytes were in the same word.
· Heavily used memory locations can be accessed more quickly than less heavily used locations.
· Memory in adjacent locations can be accessed more quickly than memory in distant locations.
· Due to caching, a function running in the context of a whole program may run slower than the same function running in a test harness.
· It’s much slower to access data shared between threads of execution than unshared data.
· Computation is faster than decision.
· Every program competes with other programs for the computer’s resources.
· If a program must run at startup, or at times of peak load, performance must be measured under load.
· Every assignment, function argument initialization, and function return invokes a constructor, a function that can hide arbitrary amounts of code.
· Some statements hide large amounts of computation. The form of the statement does not tell how expensive it is.
· Synchronizing code reduces the amount of concurrency obtainable when concurrent threads share data.