Optimized C++ (2016)
Chapter 1. An Overview of Optimization
The world has a stupendous appetite for computation. Whether the code you are writing runs on a watch, phone, tablet, workstation, supercomputer, or planet-spanning network of data centers, there are many programs that must run flat-out all the time. So it may not be enough merely to accurately convert the cool idea in your head into lines of code. It may not be enough even to comb the code for defects until it runs correctly all the time. Your application may be sluggish on the kind of hardware your customers can afford. Your hardware team may have saddled you with a tiny processor to meet their power consumption goals. You may be battling a competitor over throughput or frames per second. Or you may be building out to planetary scale and just slightly nervous about boiling the oceans. Enter optimization.
This book is about optimization—specifically, optimizing C++ programs, with particular reference to patterns of behavior of C++ code. Some of the techniques in this book are applicable to other programming languages, but I have made no attempt to explain the techniques in a universal way. Some optimizations that are effective on C++ code have no effect or are simply not possible in other languages.
This book is about taking correct code that embodies best practices of C++ design, and changing it into correct code that still embodies good C++ design but also runs faster and consumes fewer resources on pretty much any computer. Many opportunities for optimization occur because some C++ features, used casually, run slowly and consume many resources. Such code, while correct, is ill-considered, given just a little general knowledge about modern microprocessor devices or a little thought about the costs of various C++ constructs. Other optimizations are available because of the fine control C++ offers over memory management and copying.
This book is not about laboriously coding assembly language subroutines, counting clock cycles, or learning how many instructions Intel’s latest silicon stylings can dispatch concurrently. There exist developers who work with a single platform (the Xbox is a good example) for several years, and have the time and the need to master these dark arts. However, the great majority of developers target phones, tablets, or PCs, which contain an endless variety of microprocessor chips—some of which have not yet even been designed. Developers of software embedded into products also face a variety of processors with widely varying architectures. Attempting to learn processor arcana will make most developers crazy and paralyze them with indecision. I don’t recommend this path. Processor-dependent optimization is simply not fruitful for most applications, which must by definition run on a variety of processors.
This book is likewise not about learning the fastest operating system–dependent way to do some operation on Windows, and again on Linux, OS X, and every embedded operating system. It is about what you can do in C++, including with the C++ standard library. Breaking out of C++ to perform an optimization may make it hard for peers to review or comment on the optimized code. It’s not an action to be taken lightly.
This book is about learning how to optimize. Any static catalog of techniques or functions is doomed, as new algorithms are discovered and new language features become available. Instead, this book provides several running examples of how code may be improved incrementally, so that the reader becomes familiar with the code tuning process and develops the mindset that makes optimization fruitful.
This book is about optimizing the coding process as well. Developers mindful of the runtime cost of their code can write code that is efficient from the start. With practice, writing fast code usually takes no longer than writing slow code.
Finally, this book is about performing miracles; about checking in a change and later hearing a colleague exclaim in surprise, “Wow, what happened? It just started right up. Who fixed something?” Optimization is something you can also do to your status as a developer and your personal pride in your craft.
Optimization Is Part of Software Development
Optimization is a coding activity. In traditional software development processes, optimization takes place after Code Complete, during the integration and testing phase of a project, when the performance of a whole program can be observed. In an Agile development process, one or more sprints may be allocated to optimization after a feature with performance goals is coded, or as needed to meet performance targets.
The goal of optimization is to improve the behavior of a correct program so that it also meets customer needs for speed, throughput, memory footprint, power consumption, and so on. Optimization is thus as important to the development process as coding features is. Unacceptably poor performance is the same kind of problem for users as bugs and missing features.
One important difference between bug fixing and performance tuning is that performance is a continuous variable. A feature is either coded, or it is not. A bug is either present or absent. However, performance can be very bad or very good or somewhere in between. Optimization is also an iterative process in which each time the slowest part of the program is improved, a new slowest part appears.
Optimization is very much an experimental science, calling for a scientific mindset to a greater degree than some other coding tasks. To be successful at optimization requires observing behavior, forming testable hypotheses based on these observations, and performing experiments that result in measurements that either support or refute the hypotheses. Experienced developers often believe they have valid experience and valid intuitions about optimal code. But unless they frequently test their intuitions, they will frequently be wrong. My personal experiences writing the test programs for this book revealed several results that contradicted my intuitions. Experimentation, rather than intuition, is a theme of this book.
Optimization Is Effective
It is difficult for developers to reason about the effects of individual coding decisions on the overall performance of a large program. Thus, practically all complete programs contain significant opportunities for optimization. Even code produced by experienced teams with plenty of time can often be sped up by a factor of 30%–100%. For more rushed or less experienced teams, I have seen performance improvements of 3 to 10 times. A speedup of more than 10x by tweaking code is less likely. However, selection of a better algorithm or data structure can be the difference between whether a feature can be deployed or is infeasibly slow.
It’s OK to Optimize
Many treatments of optimization begin with a stern warning: don’t! Don’t optimize, and if you have to optimize, don’t do it until the end of the project, and don’t do any more optimizing than you must. For instance, famous computer scientist Donald Knuth said of optimization:
We should forget about small efficiencies, say about 97 percent of the time: premature optimization is the root of all evil.
Donald Knuth, Structured Programming with go to Statements, ACM Computing Surveys 6(4), December 1974, p268. CiteSeerX: 10.1.1.103.6084
Or this from William A. Wulf:
More computing sins are committed in the name of efficiency (without necessarily achieving it) than for any other single reason—including blind stupidity.
“A Case Against the GOTO,” Proceedings of the 25th National ACM Conference (1972): 796
The advice not to optimize has become received wisdom, unquestioned even by many experienced coders, who wince reflexively when conversation turns to performance tuning. I think cynical promotion of this advice has been used too often to excuse weak development habits and avoid the small amount of analysis that might result in vastly faster code. I also think uncritical acceptance of this advice is responsible for a lot of wasted CPU cycles, a lot of frustrated user-hours, and too much time spent reworking code that should have been more efficient from the start.
My advice is less dogmatic. It is OK to optimize. It is OK to learn efficient programming idioms, and to apply them all the time, even if you don’t know what code is performance-critical. These idioms are good C++. You won’t be disrespected by your peers for having used them. If somebody asks why you didn’t write something “simple” and inefficient, you may say, “It takes the same amount of time to write efficient code as slow, wasteful code. Why would anyone choose deliberately to write inefficient code?”
What’s not OK is to make no progress for days because you can’t decide which algorithm will be better, when you don’t know that it matters. What’s not OK is to spend weeks coding something in assembler because you guess it might be time-critical, then spoil the whole effort by calling your code as a function when the C++ compiler might have inlined it for you. What’s not OK is to demand that your team code half their program in C because “everybody knows C is faster,” when you do not in fact know either that C really is faster, or that C++ isn’t fast. In other words, all the best practices of software development still apply. Optimization isn’t an excuse to break the rules.
It’s not OK to waste a bunch of extra time on optimization when you don’t know where you have performance problems. Chapter 3 introduces the 90/10 rule, the notion that only about 10% of a program’s code is performance-critical. It is thus neither necessary nor helpful to change every line of a program in order to improve the program’s performance. Since only 10% of the program has a significant impact on performance, your chances of picking a good starting point at random are poor. Chapter 3 provides tools to help determine where the hot spots in the code are.
When I was in college, my professors warned that optimal algorithms might have a higher startup cost than simple ones. They should therefore only be used on large data sets. Although that’s possibly true for some esoteric algorithms, my experience is that optimal algorithms for simple searching and sorting tasks take little time to set up, and they give a performance improvement even on small data sets.
I have also been advised to develop programs using whatever algorithm is easiest to code, and then go back and optimize it if the program runs too slowly. While it is undeniably good advice to continue making progress, once you have coded an optimal search or sort a couple of times, it is no more difficult to get running than a slower algorithm. You might as well do it right the first time and debug only one algorithm.
Received wisdom is, in fact, probably the single greatest enemy of performance improvement efforts. For instance, “everybody knows” that an optimal sort algorithm runs in O(n log n) time, where n is the size of the data set (see “Time Cost of Algorithms” for a brief review of the big-O notation and time cost). This received wisdom is valuable to the extent that it keeps developers from believing their O(n2) insertion sort is optimal, but not so good if it prevents them from checking the literature to find out that radix sort is faster at O(n logr n) (where r is the radix or number of sorting buckets), that flashsort has even faster O(n) performance on randomly distributed data, or that quicksort, which received wisdom makes the benchmark against which other sorts are measured, has painfully bad O(n2) worst-case performance. Aristotle said that women had fewer teeth than men (The History of Animals, Book II, part 1), a bit of received wisdom that stood for 1,500 years before someone was curious enough to count the teeth in a few mouths. The antidote to received wisdom is the scientific method, in the form of experimentation. Chapter 3 covers instruments to measure software performance and experiments to validate optimizations.
There is also received wisdom in the software development world that optimization is not relevant. The reasoning goes that even if your code runs slowly today, each new year brings faster processors, solving your performance problems for free through the passage of time. Like most received wisdom, this nugget of thought was never really true. It might have seemed true in the 1980s and 1990s, when desktop computers and standalone applications dominated the development scene and single-core processors doubled in speed every 18 months. But while today’s multicore processors continue to become more powerful in the aggregate, the performance of individual cores improves only gradually, or sometimes even declines. Programs today must also run on mobile platforms, where battery life and heat dissipation constrain the instruction execution rate. Furthermore, while time may bring new customers with faster computers, it does nothing to improve the performance of existing hardware. Existing customers’ workloads are all that is increasing with time. The only speed upgrade an existing customer will ever get from your company comes from optimizing subsequent releases. Optimization keeps your program fresh.
A Nanosecond Here, a Nanosecond There
A billion here, a billion there, pretty soon you’re talking real money.
Frequently misattributed to Senator Everett Dirkson (1898–1969), who claims he never said this, though he admits saying a lot of things like it
Desktop computers are astonishingly fast. They can dispatch a new instruction every nanosecond (or better). That’s every 10-9 seconds! It is seductively appealing to believe that optimization cannot possibly matter when a computer is so fast.
The problem with this way of thinking is that the faster the processor is, the faster wasted instructions pile up. If 50% of the instructions executed by a program are unnecessary, that program can be made to run twice as fast by removing them, no matter how rapidly each unnecessary instruction is executed.
Your colleagues who say “efficiency doesn’t matter” may also mean it doesn’t matter for certain applications, which are human response–bound and run on desktop computers that are already very fast. Efficiency matters very much on small embedded and mobile processors with memory, power, or speed constraints. It also matters on servers running flat-out on big computers. Another way to say this is that efficiency matters for any application that must contend for constrained resources (memory, power, CPU cycles). Efficiency matters a great deal, too, any time your workload is big enough to be spread across multiple computers. In this case, efficiency can be the difference between spending on 100 servers or cloud instances versus 500 or 1,000.
In 50 years, computer performance has improved by six orders of magnitude. And yet, here we are talking about optimization. If the past is any guide, optimization will likely remain relevant far into the future.
Summary of Strategies for Optimizing C++ Code
Round up the usual suspects.
Capt. Louis Renault (Claude Rains), Casablanca, 1942
The mix of features in C++ provides a continuum of implementation choices ranging from hands-off automation and expressiveness on one hand, to increasingly fine control over performance on the other hand. It is this degree of choice that makes it possible to tune C++ programs to meet requirements for performance.
C++ has its “usual suspects” for optimization hot spots, including function calls, memory allocation, and loops. The following is a summary list of ways to improve the performance of C++ programs, which is also an outline of this book. The advice is shockingly simple. All of it has been published before. But of course, the devil is in the details. The examples and heuristics in this book will help you better recognize optimization opportunities when you see them.
Use a Better Compiler, Use Your Compiler Better
C++ compilers are complex software artifacts. Each compiler makes different decisions about what machine code to generate for C++ statements. They see different opportunities for optimization. They will produce different executables from the same source code. If you are squeezing the last iota of performance from your code, it may be worthwhile to try several compilers to see if one produces a faster executable for your code.
The most important advice about which C++ compiler to pick is use a compiler conforming to C++11. C++11 implements rvalue references and move semantics that eliminate many copy operations that were unavoidable with previous C++ versions. (Move semantics are discussed in“Implement Move Semantics”.)
Sometimes using a better compiler means using your compiler better. For instance, if your application seems sluggish, have a look at the compiler switches to see if the optimizer is turned on. This seems totally obvious, but I can’t count the number of times I have given this advice to people who subsequently admitted that, yes, their code ran much faster when compiled with optimization turned on. In many cases, you need go no further. The compiler alone may make your program several times faster if you ask it nicely.
By default, most compilers don’t turn on any optimizations. Compile times are slightly shorter without the optimization pass. This was a big deal in the 1990s, but nowadays compilers and computers are so fast that the extra cost is insignificant. Debugging is also simpler with the optimizer turned off because the flow of execution exactly follows the source code. The optimizer may move code out of loops, remove some function calls, and remove some variables altogether. Some compilers won’t emit debug symbols at all when optimization is turned on. Other compilers are more generous, but understanding what the program is doing by watching the flow of execution in the debugger can be challenging. Many compilers allow individual optimizations to be turned on and off in a debug build without affecting debugging too much. Just turning on function inlining can have a significant impact on a C++ program because good C++ style includes writing many small member functions to access the member variables of each class.
The documentation that comes with a C++ compiler contains an extensive description of the optimization flags and pragmas available. This documentation is like the owner’s manual that comes with a new car. You can get in your new car and drive it without reading the manual, but there is a lot of information there that might help you use this big, complicated tool more effectively.
If you are lucky enough to be developing for the x86 architecture on Windows or Linux, you have a choice of several excellent compilers under very active development. Microsoft pushed out three versions of Visual C++ in the five years before this book was written. GCC releases more than one version a year.
As of early 2016, there is reasonable consensus that Intel’s C++ compiler generates the tightest code on both Linux and Windows, that the GNU C++ compiler GCC has lower performance but excellent standards conformance, and that Microsoft’s Visual C++ is in between. I would love to help your decision making by producing a little chart that says Intel C++ generates code that is so many percent faster than GCC, but it depends on your code, and on who has just pushed out a souped-up release. Intel C++ costs over a thousand dollars, but offers a 30-day free trial. There are free express versions of Visual C++. On Linux, GCC is always free. It is inexpensive to perform a little experiment, trying each compiler on your code to see if any give a performance advantage.
Use Better Algorithms
The biggest bang for the optimization buck comes from choosing an optimal algorithm. Optimization efforts can improve the performance of a program in a dramatic fashion. They can put zip into code that seemed sluggish before, just as upgrading your PC makes applications run faster. Unfortunately, just like upgrading your PC, most optimizations improve performance by at most a constant factor. Many optimization efforts give an improvement of 30%–100%. If you are lucky, you might triple your performance. But a quantum leap in performance is unlikely—unless you can locate a more efficient algorithm.
OPTIMIZATION WAR STORY
Back in the days of 8-inch floppies and 1 MHz processors, a certain developer designed a program to manage radio stations. One aspect of this program was to produce a sorted log of the songs played each day. The trouble was, it took about 27 hours to sort a day’s worth of data, which was obviously unacceptable. To get this important sort to run faster, the developer resorted to heroic feats. He reverse engineered the computer and hacked into the microcode by undocumented means. He microcoded an in-memory sort that reduced run time to a still-unacceptable 17 hours. Desperate, he called the computer manufacturer, for whom I worked, for assistance.
I asked the developer what sorting algorithm he was using. He said, “Merge sort.” Merge sort is a member of the family of optimal comparison sorting algorithms. How many records was he sorting? “A few thousand,” he replied. This made no sense. The system he was using should have been able to sort his data in less than an hour.
I thought to ask the developer to describe his sort algorithm in detail. I can no longer recall the tortured words that followed, but it turned out the developer had implemented an insertion sort. Insertion sort is a poor choice, taking time proportional to the square of the number of records to be sorted (see“Time Cost of Sorting Algorithms”). He knew there was something called merge sort, and that it was optimal. He had found a way to describe his insertion sort using the words “merge” and “sort.”
I coded a very conventional sort routine for this customer, using a real merge sort. It sorted his data in 45 minutes.
It is foolish to struggle heroically to optimize a bad algorithm. Learning and using optimal algorithms for searching and sorting is a broad path to optimal code. An inefficient search or sort routine can completely dominate the running time of a program. Code tweaking cuts run time by a constant factor. Switching to a more optimal algorithm can cut run time by a factor that grows bigger the bigger your data set is. Even on small data sets of a dozen items an optimal search or sort can save a bunch of time if the data is searched frequently. Chapter 5, Optimize Algorithmscontains some guidance on what optimal algorithms look like.
Chances to use optimal algorithms come in all sizes, from compact little closed-form computations, to tight keyword search functions, to complex data structures and massive programs. There are many excellent books covering this subject. Whole careers can be spent studying it. I regret that I can only touch lightly on the topic of optimal algorithms in this book.
“Optimization Patterns” covers several important techniques for improving performance; these include precomputation (moving computation from run time to link, compile, or design time), lazy computation (moving computation toward the point where a sometimes unused result is actually needed), and caching (saving and reusing expensive computations). Chapter 7, Optimize Hot Statements adds many examples of these techniques in practice.
Use Better Libraries
The standard C++ template and runtime libraries that come with a C++ compiler must be maintainable, general, and very robust. It may come as a surprise to developers that these libraries are not necessarily tuned for speed. It may come as an even greater surprise that even after 30 years of C++, the libraries that come with commercial C++ compilers still contain bugs, and may not conform to the current C++ standard, or even to the standard in force when the compiler was released. This complicates the task of measuring or recommending optimizations, and makes non-portable any optimization experience developers believe they have. Chapter 8, Use Better Libraries covers these issues.
Mastery of the C++ standard library is a critical skill for the optimizing developer. This book provides recommendations for algorithms for searching and sorting (Chapter 9, Optimize Searching and Sorting), optimal idioms of use of container classes (Chapter 10, Optimize Data Structures), I/O (Chapter 11, Optimize I/O), concurrency (Chapter 12, Optimize Concurrency), and memory management (Chapter 13, Optimize Memory Management).
There are open source libraries for important functions like memory management (see “High-Performance Memory Managers”) that provide sophisticated implementations that can be faster and more capable than what’s provided by a vendor’s C++ runtime library. The advantage of these alternative libraries is that they may be easy to drop into an existing project, and provide an immediate speed improvement.
There are many publicly available libraries from the Boost project and Google Code, among many others, that provide libraries for things such as I/O, windowing, string handling (see “Adopt a novel string implementation”) and concurrency (see “Concurrency Libraries”) that are not drop-in replacements for the standard libraries but offer improved performance and added features. These libraries gain part of their speed advantage from making different design trade-offs than the ones made for the standard library.
Finally, it is possible to develop a project-specific library that relaxes some of the safety and robustness constraints of the standard library in exchange for a speed advantage. All these topics are covered in Chapter 8, Use Better Libraries.
Function calls are expensive in several ways (see “Cost of Function Calls”). Good function library APIs provide functions that reflect the idioms of use of these APIs, so that the user doesn’t have to make unnecessarily frequent calls to the most fundamental functions. For instance, an API that gets characters and provides only a get_char() function requires the user to spend a function call for each character needed. If the API also provides a get_buffer() function, the API can avoid the expense of calling a function for each character.
Function and class libraries are good places to hide the complexity that sometimes comes with highly tuned programs. Libraries should pay your program back for the cost of calling into them by doing work with maximum efficiency. Library functions are quite likely to be at the bottom of deeply nested calling chains, where the effect of improved performance is magnified.
Reduce Memory Allocation and Copying
Reducing calls into the memory manager is such an effective optimization that a developer can be a successful optimizer knowing only this one trick. While the cost of most C++ language features is at most a few instructions, the cost of each call into the memory manager is measured in thousands of instructions.
Because strings are such an important (and costly) part of many C++ programs, I have devoted a whole chapter to them as a case study in optimization. Chapter 4, Optimize String Use: A Case Study introduces and motivates many optimization concepts within the familiar context of string handling. Chapter 6, Optimize Dynamically Allocated Variables is devoted to reducing the cost of dynamic memory allocation without giving up useful C++ programming idioms like strings and standard library containers.
A single call to a buffer-copying function may also consume thousands of cycles. Doing less copying is thus an obvious way to speed up your code. A lot of copying happens in association with memory allocation, so fixing one often does away with the other. Other hot spots for copying are constructors and assignment operators, and input/output. Chapter 6, Optimize Dynamically Allocated Variables covers this subject.
Aside from allocation and function calls, the cost of a single C++ statement is generally insignificant. But execute that same code a million times in a loop, or every time a program processes an event, and suddenly it’s a big deal. Most programs have one or more main event processing loops, and one or more functions that process characters. Identifying and optimizing these loops is almost always fruitful. Chapter 7, Optimize Hot Statements offers some advice on how to find frequently executed code. You can bet it will always be in a loop.
The literature on optimization contains a cornucopia of techniques for using individual C++ statements efficiently. Many programmers believe that knowledge of these tricks is the bread and butter of optimization. The problem with this thinking is that, unless the code is extremely hot (frequently executed), removing one or two memory accesses from it does not make a measurable difference in overall performance. Chapter 3, Measure Performance contains techniques to determine what parts of a program are frequently executed, before attempting to reduce the amount of computation in these places.
It is also the case that modern C++ compilers do a really outstanding job of finding these local improvements. Developers should therefore attempt not to go all OCD on a big code base changing every occurrence of i++ to ++i, unrolling all the loops, and breathlessly explaining to each colleague exactly what Duff’s device is and why it’s so cool. Still, I take a brief look into this cornucopia in Chapter 7, Optimize Hot Statements.
Use Better Data Structures
Selecting the most appropriate data structure has a profound effect on performance. This is partly because the algorithms for inserting, iterating, sorting, and retrieving entries have a runtime cost that depends on the data structure. In addition, different data structures make differing use of the memory manager. It is also partly because the data structure may have good cache locality, or not. Chapter 10, Optimize Data Structures explores the performance, behavior, and trade-offs among data structures provided by the C++ standard library. Chapter 9, Optimize Searching and Sortingdiscusses use of standard library algorithms to implement table data structures on top of simple vectors and C arrays.
Most programs must wait for activities taking place in the tiresome, lumbering world of physical reality to complete. They must wait for files to be read off mechanical disks, pages to be returned from the Internet, or users’ slow fingers to press mechanical key switches. Any time a program’s forward progress is blocked waiting for such an event is a wasted opportunity to perform some other computation.
Modern computers have more than one processor core available to execute instructions. If work is parceled out to several processors, it can be completed sooner.
Along with concurrent execution come tools for synchronizing concurrent threads so that they can share data. These tools can be used well or poorly. Chapter 12, Optimize Concurrency covers some considerations in synchronizing concurrent threads of control efficiently.
Optimize Memory Management
The memory manager, the part of the C++ runtime library that manages the allocation of dynamic memory, is frequently executed code in many C++ programs. C++ has an extensive API for memory management, though most developers have never used it. Chapter 13, Optimize Memory Management shows some techniques for improving the performance of memory management.
This book helps the developer identify and exploit the following opportunities to improve code performance:
· Use better compilers, and turn on the optimizer.
· Use optimal algorithms.
· Use better libraries, and use libraries better.
· Reduce memory allocation.
· Reduce copying.
· Remove computation.
· Use optimal data structures.
· Increase concurrency.
· Optimize memory management.
As I said, the devil is in the details. Let’s press on.