Optimized C++ (2016)

Chapter 7. Optimize Hot Statements

The idea is there, locked inside, and all you have to do is remove the excess stone.

Michelangelo di Lodovico Buonarroti Simoni (1475–1564), in response to the question, “How do you create your masterpieces?”

Optimizing at the statement level can be modeled as a process of removing instructions from the stream of execution, in a manner similar to how Michelangelo described the process of sculpting his masterpieces. The problem with Michelangelo’s advice is that it doesn’t address which part of the stone is excess, and which part is the masterpiece.

The problem with optimizing at the statement level is that, aside from function calls, no C++ statement consumes more than a handful of machine instructions. Focusing on such small-scale optimizations doesn’t generally produce enough improvement to make them worth the effort unless the developer can find factors that magnify the cost of the statement, making it hot enough to be worth optimizing. These factors include:

Loops

The cost of statements within a loop is multiplied by the number of times they are repeated. Loops must be identified by the developer. The profiler may point to a function containing a hot loop, but it won’t say which loop in the function is hot. It may point to a function that is hot because it is called from inside one or more loops, but it won’t say which call sites are hot. Since the profiler does not point directly to the loop, the developer must inspect the code to find the loop, using the profiler output as a source of clues.

Frequently called functions

The cost of the function is multiplied by the number of times the function is called. A profiler points directly to hot functions.

Idioms used throughout the program

This is a catch-all category of C++ statements and idioms for which less expensive alternatives exist. If these idioms are widely used within a program, changing to less expensive idioms may improve overall performance.

Optimizing code at the statement level can produce significant performance improvements on the smaller, simpler processors that are embedded into instruments, appliances, peripherals, and toys, because instructions are fetched directly from memory and executed one after another. However, desktop- and handset-class processors provide so much instruction-level concurrency and caching that statement-level optimization produces smaller returns than optimizing allocation or copying.

On programs designed for desktop-class computers, statement-level optimization may thus only be appropriate for frequently called library functions, or the most inner loops of programs like game graphics engines or programming language interpreters that run flat-out all the time.

An issue with optimizing at the statement level is that the effectiveness of optimizations can be compiler-dependent. Each compiler has one or more plans for how to generate code for a particular C++ statement. A coding idiom that improves performance on one compiler may produce no result on another, or may even slow down the code. A trick that improved performance when using GCC may not work with Visual C++. More critically, it means that when a team upgrades to a new compiler version, the new compiler may deoptimize their carefully tuned code. This is another reason why statement-level optimization may be less fruitful than other kinds of performance improvement efforts.

Remove Code from Loops

A loop has two parts: a block of controlled statements that are executed repeatedly, and a controlling clause that determines how many times the loop is repeated. The general comments on removing computation from C++ statements apply to the controlled statements of a loop. With loops, the controlling clause offers extra opportunities for optimization because it is, in a sense, overhead.

Consider the for loop in Example 7-1, which traverses a string, replacing space characters with asterisks.

Example 7-1. Unoptimized for loop

char s[] = "This string has many space (0x20) chars. ";

...

for (size_t i = 0; i < strlen(s); ++i)

if (s[i] == ' ')

s[i] = '*';

The test i < strlen(s) in the for loop is performed for each character in the string.1 The call to strlen() is expensive, traversing its string argument to count its characters, turning this algorithm from O(n) to O(n²). This is an example of an inner loop hidden in a library function (see“Estimate the Cost of Loops”).

Ten million iterations of this loop took 13.238 milliseconds when compiled with Visual Studio 2010, and 11.467 milliseconds with Visual Studio 2015. The VS2015 measurement is 15% faster, signaling that the two compilers are generating somewhat different code for this loop.

Cache the Loop End Value

The loop end value, the value returned by the expensive call to strlen(), can be precomputed and cached in the loop entry code to improve performance. The modified for loop looks like Example 7-2.

Example 7-2. for loop with cached end value

for (size_t i = 0, len = strlen(s); i < len; ++i)

if (s[i] == ' ')

s[i] = '*';

The effect of this change is dramatic, because strlen() is so costly. A test of the optimized code runs in 636 milliseconds with VS2010 and 541 milliseconds with VS2015—nearly 20 times faster than the original. VS2015 is still outrunning VS2010, this time by 17%.

Use More Efficient Loop Statements

This is the syntax for the C++ for loop:

for (init-expression ; condition ; continue-expression ) controlled-statement

Roughly speaking, this compiles into code that looks like this:

initial-expression ;

L1: if ( ! condition ) goto L2;

controlled-statement ;

continue-expression ;

goto L1;

L2:

The for loop must jump twice: once if condition is false, and once more after evaluating the continue-expression. These jumps may slow down execution. C++ also has a simpler loop form called the do loop that is less widely used. Its syntax is:

do controlled-statement while ( condition ) ;

Roughly speaking, this compiles into the following code:

L1: controlled-statement

if ( condition ) goto L1;

So, a for loop can often be simplified into a do loop that may be faster. Example 7-3 is the example string-scanning code as a do loop.

Example 7-3. for loop converted to do loop

size_t i = 0, len = strlen(s); // for loop init-expression

do {

if (s[i] == ' ')

s[i] = ' ';

++i; // for loop continue-expression

} while (i < len); // for loop condition

With Visual Studio 2010, this was a performance win: it took 482 milliseconds to run the test loop, a 12% improvement. With Visual Studio 2015, however, this change performed significantly less well: 674 milliseconds on the test loop, or 25% slower than the for loop.

Count Down Instead of Up

A variation on caching the end value is to count down instead of up, caching the end value in the loop index variable. Many loops have one end condition that is less expensive to evaluate than the other. For instance, in the loop in Example 7-3, one end condition is a constant 0, while the other end is an expensive call to strlen(). Example 7-4 is the for loop of Example 7-1 reorganized to count down instead of up.

Example 7-4. Optimized loop counting down

for (int i = (int)strlen(s)-1; i >= 0; --i)

if (s[i] == ' ')

s[i] = '*';

Note that I changed the type of the induction variable i from size_t, which is unsigned, to int, which is signed. The for loop’s termination clause is i >= 0. If i was unsigned, it would be, by definition, greater than or equal to zero, and the loop could not terminate. This is a very typical mistake when counting down to zero.

The same timing test measured this loop at 619 milliseconds when compiled with Visual Studio 2010, and 571 milliseconds for Visual Studio 2015. It’s not clear that these results represent a significant change compared to the code in Example 7-2.

Remove Invariant Code from Loops

Example 7-2, where the loop end value is cached for efficient reuse, is an instance of the more general technique of moving invariant code out of a loop. Code is invariant with respect to a loop if it does not depend on the loop induction variable. For instance, in the somewhat contrived loop in Example 7-5, the assignment statement j = 100; and the subexpression j * x * x are loop-invariant.

Example 7-5. Loop with invariant code

int i,j,x,a[10];

...

for (i=0; i<10; ++i) {

j = 100;

a[i] = i + j * x * x;

}

The loop can be rewritten as shown in Example 7-6.

Example 7-6. Loop with invariant code moved out

int i,j,x,a[10];

...

j = 100;

int tmp = j * x * x;

for (i=0; i<10; ++i) {

a[i] = i + tmp;

}

Modern compilers are good at finding bits of invariant code (like the ones shown here) that are computed over and over, and moving the computations out of loops to improve performance. A developer would not normally need to rewrite this loop because the compiler already finds the invariant code and rewrites the loop.

When a statement inside a loop calls a function, the compiler may not be able to determine if the value returned by the function depends on something in the loop. The function may be complex, or the function’s body may be in another compilation unit and not visible to the compiler. The developer must manually identify loop-invariant function calls and remove them from loops.

Remove Unneeded Function Calls from Loops

A function call can execute an arbitrarily large number of instructions. If the function is loop-invariant, it’s plainly a good idea to pull it out of the loop to improve performance. In Example 7-1, reproduced here, the call to strlen() is loop-invariant and can be pulled out of the loop:

char* s = "sample data with spaces";

...

for (size_t i = 0; i < strlen(s); ++i)

if (s[i] == ' ')

s[i] = '*'; // change ' ' to '*'

Example 7-7 shows what the revised loop might look like.

Example 7-7. Loop where strlen() is loop invariant

char* s = "sample data with spaces";

...

size_t end = strlen(s);

for (size_t i = 0; i < end; ++i)

if (s[i] == ' ')

s[i] = '*'; // change ' ' to '*'

In Example 7-8, the value returned by strlen() is not loop-invariant, because removing a space character decreases the string length. The end condition cannot be pulled out of this loop.

Example 7-8. Loop in which strlen() is not loop-invariant

char* s = "sample data with spaces";

size_t i;

...

for (i = 0; i < strlen(s); ++i)

if (s[i] == ' ')

strcpy(&s[i], &s[i+1]); // remove space

s[i] = '\0';

There is no simple rule for determining whether a function is loop-invariant in a particular situation. Example 7-8 shows that a particular function can be loop-invariant for one loop, but not invariant for another. This is a place where human judgment is a superior tool to the compiler’s thorough but limited analysis. (The repeated call to strlen() is not the only thing less than optimal about this function. Other optimizations are left as an exercise.)

One kind of function that can always be pulled out of a loop is a pure function, a function whose returned value depends only on the value of its arguments, and which has no side effects. If such a function appears in a loop, and its argument is not modified in the loop, the function is loop-invariant and can be pulled out of the loop. In Example 7-8, the function strlen() is a pure function. In the first loop, its argument s is never modified by the loop, so the call to strlen() is loop-invariant. In the second loop, the call to strcpy() modifies s, so the call strlen(s) is not loop-invariant.

Here is another example involving the math functions sin() and cos(), which return the mathematical sine and cosine of a value in radians. Many mathematical functions are pure functions, so this situation arises frequently in numerical calculations. In Example 7-9, a graphics rotation transformation is applied to a figure with 16 vertices. A timing test performing this transformation 100 million times took 7,502 milliseconds with VS2010 and 6,864 milliseconds with VS2015, with the latter continuing to show about a 15% speed advantage.

Example 7-9. rotate() with loop containing invariant pure functions

void rotate(std::vector<Point>& v, double theta) {

for (size_t i = 0; i < v.size(); ++i) {

double x = v[i].x_, y = v[i].y_;

v[i].x_ = cos(theta)*x - sin(theta)*y;

v[i].y_ = sin(theta)*x + cos(theta)*y;

}

The functions sin(theta) and cos(theta) depend only on the function argument theta. They don’t depend on the loop variable. They can be lifted out of the loop as shown in Example 7-10.

Example 7-10. rotate_invariant() with invariant pure functions lifted out of the loop

void rotate_invariant(std::vector<Point>& v, double theta) {

double sin_theta = sin(theta);

double cos_theta = cos(theta);

for (size_t i = 0; i < v.size(); ++i) {

double x = v[i].x_, y = v[i].y_;

v[i].x_ = cos_theta*x - sin_theta*y;

v[i].y_ = sin_theta*x + cos_theta*y;

}

The revised function runs the test loop about 3% faster, at 7,382 milliseconds (VS2010) and 6,620 milliseconds (VS2015).

The savings are less dramatic on a PC than those made by lifting the call to strlen() in the previous section, because the math functions generally operate on one or two numbers held in registers and do not access memory like strlen() does. On a retro PC from the 1990s or an embedded processor without floating-point instructions in hardware, the savings might be more dramatic because the computation of sines and cosines is more expensive.

Sometimes, a function called within a loop does no work at all, or does work that is not necessary. Such functions can, of course, be removed. It is easy to say, “Well, duh! No competent developer would call a function that does no useful work!” Far harder is to remember all the places where a function is called, and check them all when the function’s behavior changes during the several-year lifetime of a project.

The following is an idiom that I have encountered repeatedly in my career:

UsefulTool subsystem;

InputHandler input_getter;

...

while (input_getter.more_work_available()) {

subsystem.initialize();

subsystem.process_work(input_getter.get_work());

}

In this recurring pattern, subsystem is repeatedly initialized for processing work, and then asked to process the next unit of work. There may be something wrong with this code that can only be determined by inspecting UsefulTool::initialize(). It may be that initialize() need only be called before the first unit of work is processed, or perhaps for the first unit of work and also after an error. Often process_work() establishes on exit the same class invariant that initialize() does. Calling initialize() each time through the loop just repeats the same code as process_work(). If so, the call to initialize() can be lifted out of the loop:

UsefulTool subsystem;

InputHandler input_getter;

...

subsystem.initialize();

while (input_getter.more_work_available()) {

subsystem.process_work(input_getter.get_work());

}

It is presumptuous to blame the developer who wrote the original code for being sloppy. Sometimes the behavior of initialize() changes, moving code into process_work(). Sometimes the project documentation is inadequate, or the schedule is rushed, or the purpose of the code ofinitialize() is unclear, and the developer was just programming defensively. But I have found several times that initialization that only needs to be done once is done every time work is processed.

If saving time is important enough, it is worth looking at each function call in a loop to see whether that work is actually needed.

Remove Hidden Function Calls from Loops

Normal function calls have a distinctive look, with a function name and list of argument expressions in parentheses. C++ code can also call functions implicitly, without this easy-to-spot syntax. This can happen when a variable of class type appears in any of the following situations:

· Declaration of a class instance (calls constructor)

· Initialization of a class instance (calls constructor)

· Assignment to a class instance (calls assignment operator)

· Arithmetic expressions involving class instances (calls operator member functions)

· Scope exit (calls destructor of class instances declared in the scope)

· Function arguments (each argument expression is copy-constructed into its formal argument)

· Function return of a class instance (calls copy constructor, perhaps twice)

· Inserting items into a standard library container (items are move- or copy-constructed)

· Inserting items into a vector (all items are move- or copy-constructed if the vector is reallocated)

These function calls are hidden. They don’t have the familiar look of a function call with a name and argument list; they look like assignments and declarations. It’s easy to miss the fact that a function is being called. I’ve talked about these before, in “Eliminate Unneeded Copying”.

Hidden function calls resulting from constructing a function’s formal arguments can sometimes be eliminated if the function signature is changed to pass a reference or pointer to the class instead of passing the actual argument by value. The benefit of this was demonstrated for strings in“Eliminate Copying of String Arguments”, and for any object that copies data in “Eliminate Copying on Function Call”.

Hidden function calls resulting from copying out the function’s return value can be removed if the function signature is changed so the class instance to be returned goes into a reference or pointer argument used as an out parameter instead. The benefit of this was demonstrated for strings in“Eliminate Copying of Returned String Values”, and for any object that copies data in “Eliminate Copying on Function Return”.

If an assignment or initialized declaration is loop-invariant, it can be lifted out of the loop. Sometimes even if the variable needs to be set for each pass through the loop, you can lift the declaration out of the loop and perform a less expensive function on each iteration of the loop. For example, std::string is a class that contains a dynamically allocated array of char. In the following code:

for (...) {

std::string s("<p>");

...

s += "</p>";

}

it is expensive to put the declaration of s in the for loop. At the closing brace of the controlled block of statements, s’s destructor is called. The destructor frees s’s dynamically allocated memory, so it must be reallocated the next time through the loop. This code can be rewritten as follows:

std::string s;

for (...) {

s.clear();

s += "<p>";

...

s += "</p>";

}

Now s’s destructor is not called until later. This not only saves the cost of a function call each time through the loop but, as a side effect, the dynamic array inside s is reused, possibly eliminating a hidden call to the memory manager when characters are appended to s.

This behavior doesn’t apply only to strings, or only to classes that contain dynamic memory. A class instance may contain a resource fetched from the operating system, such as a window or file handle, or may do any expensive thing inside its constructors and destructors.

Remove Expensive, Slow-Changing Calls from Loops

Some function calls are not invariant, but they might as well be. A good example is a call to get the current time of day to use in a logging application. It takes a nontrivial number of instructions to fetch the time of day from the operating system, and more time to format the time as text. Example 7-11 is a function to format the current time into a null-terminated character array.

Example 7-11. timetoa(): format the time into an array of char

# include <ctime>

char* timetoa(char *buf, size_t bufsz) {

if (buf == 0 || bufsz < 9)

return nullptr; // invalid arguments

time_t t = std::time(nullptr); // get time from O/S

tm tm = *std::localtime(&t); // break time into hours, minutes, seconds

size_t sz = std::strftime(buf, bufsz, "%c", &tm); // format into buffer

if (sz == 0) strcpy(buf, "XX:XX:XX"); // error

return buf;

}

In an experimental test loop, timetoa() took about 700 nanoseconds to fetch the time and format it. But this cost is significant. It is about twice as long as it takes to append two text strings to a file. In the same test loop, the statement

out << "Fri Jan 01 00:00:00 2016"

<< " Test log line test log line test log line\n";

took only 372 nanoseconds, whereas the statement

out << timetoa(buf, sizeof(buf))

<< " Test log line test log line test log line\n";

took 1,042 nanoseconds.

Logging must be as efficient as possible, or else it slows down the program being logged. This is bad if it makes the program slower, but worse if that slowdown changes the program’s behavior so that bugs disappear when logging is turned on. Getting the current time dominates the cost of logging in this example.

The time of day changes very slowly compared to the instruction execution rate of a modern computer. Apparently my program could log a million lines between time ticks. Two consecutive calls to fetch the time of day are thus very likely to return the same value. If several lines are to be logged at once, it doesn’t make sense to fetch the time anew for each line.

I ran another test to simulate the behavior of logging when a program requests the time once, and puts out a group of 10 lines using the same data. As expected, this test printed lines in an average of 376 nanoseconds per line.

Push Loops Down into Functions to Reduce Call Overhead

If a program iterates through a string, array, or other data structure and calls a function for each iteration, it may be possible to improve performance by a technique called loop inversion. A loop calling a function is inverted by putting the loop inside the function. This changes the function’s interface to reference the whole data structure, rather than referencing a single entry. In this way, the cost of n–1 function calls can be removed if the data structure contains n entries.

For a very simple example, imagine a library function that replaces nonprinting characters with a dot (“.”) character:

# include <ctype>

void replace_nonprinting(char& c) {

if (!isprint(c))

c = '.';

}

To replace all the nonprinting characters in a string, a program calls replace_nonprinting() in a loop:

for (unsigned i = 0, e = str.size(); i < e; ++i)

replace_nonprinting(str[i]);

If the compiler cannot inline the call to the replace_nonprinting() function, it will call the function 26 times if the preceding loop is given the string “Ring the carriage bell\x07\x07!!” to process.

The library designer might add an overload of replace_nonprinting() to process a whole string:

void replace_nonprinting(std::string& str) {

for (unsigned i = 0, e = str.size(); i < e; ++i)

if (!isprint(str[i]))

c = '.';

}

The loop is now inside the function, eliminating n–1 calls to replace_nonprinting().

Note that the code implementing the behavior of replace_nonprinting() must be duplicated in the new overload. It doesn’t work to just move the loop around. The following version actually adds a function call to the previous total:

void replace_nonprinting(std::string& str) {

for (unsigned i = 0, e = str.size(); i < e; ++i)

replace_nonprinting(str[i]);

}

Do Some Actions Less Frequently

Here’s a motivating question: “A program’s main loop processes about 1,000 transactions per second. How frequently does it need to check for a command to terminate the loop?”

Of course, the answer is, “It depends.” In fact, it depends on two things: how fast the program needs to respond to the termination request, and how costly it is to check for termination.

If the program needs to terminate within 1 second to achieve its responsiveness goal, and it takes an average of 500 ± 100 milliseconds to stop after detecting that shutdown was commanded, it is necessary to check every 400 milliseconds (1,000 – (500 + 100) = 400 milliseconds). Checking more often would be wasteful.

The other factor is the cost to check for termination. If the loop is a Windows message loop, and the termination command arrives as the WM_CLOSE Windows message, there is no extra overhead for checking. The cost is built into dispatching the event. If a signal handler causes a bool flag to be set, the cost of checking this flag each time through the loop is tiny.

But what if the loop is on an embedded device where a key press must be polled for, and the key has to be debounced2 for 50 milliseconds? Testing each time through the loop will add the 50-millisecond key polling cost to the cost of each transaction, dropping the transaction rate from 1,000 per second to transactions per second. Ouch!

If the program instead polls only at 400-millisecond intervals, the effect on the loop is less dramatic. The math here is a bit tedious, but let’s go through it. Transactions take about 1 millisecond (1,000 transactions per second). To complete a 50-millisecond poll every 400 milliseconds, the poll must be started every 350 milliseconds, or 2.5 times per 1,000 milliseconds. That results in a transaction rate of 1,000 – (2.5 * 50) = 875 transactions per second.

Example 7-12 shows code to check for such a key press.

Example 7-12. Check for events occasionally

void main_loop(Event evt) {

static unsigned counter = 1;

if ((counter % 350) == 0)

if (poll_for_exit())

exit_program(); // noreturn

++counter;

switch (evt) {

...

}

Execution enters main_loop() every millisecond (assuming events occur every millisecond). Every time through the loop, counter is incremented. When it reaches 350, the program calls poll_for_exit(), which takes 50 milliseconds. If the exit key is detected in poll_for_exit(), the code calls exit_program(), which takes 400–600 milliseconds to shut things down.

This casual polling method illustrates how to take bigger bites of computation between polling. It is, however, fraught with assumptions:

· It assumes that events happen every millisecond, and not sometimes every 2 or every 5 milliseconds, and that the event rate doesn’t slow down if there is no work.

· It assumes that polling always takes exactly 50 milliseconds.

· It assumes that a debugger will never get control so that no events are processed for half a minute while the developer inspects some variables.

A more careful approach is to measure the elapsed time from event to event, and from beginning to end of poll_for_exit().

A developer wishing to achieve the full 1,000 transactions per second under the constraints listed here must find some source of true concurrency to separate polling the keyboard from the main loop. This is typically implemented via interrupts, multiple cores, or a hardware keyboard scanner.

What About Everything Else?

There are many sources on the Internet of low-level optimization techniques involving loops. For instance, some sources note that ++i is generally faster than i++ because no intermediate value need be saved and returned. Some sources recommend loop unrolling, to reduce the total number of times the loop test and increment code is executed.

The problem with these recommendations is that they don’t always make any difference at all. You may spend the time to do an experiment and observe no improvement. The recommendations may be based on conjecture not informed by experiment, or may apply only to a particular compiler at a particular date. They may come from a compiler design textbook, describing optimizations that the C++ compiler is already programmed to do. Thirty years on, modern C++ compilers are very good at finding code to move out of loops. In fact, the compiler is better at it than most programmers. This is why such optimizations are so often frustratingly ineffective, and thus why this section isn’t longer.

Remove Code from Functions

Functions, like loops, consist of two parts: a block of code that is the body of the function, and the head of the function consisting of the argument list and return type. Like with loops, these two parts can be separately optimized.

Although the cost of executing the function’s body can be arbitrarily large, the cost of calling the function, like the cost of most C++ statements, is minuscule. But when the function is called many times, that cost is multiplied and can become significant, so that reducing this cost becomes important.

Cost of Function Calls

Functions are the oldest and most important abstraction in programming. The programmer defines a function once, then calls its name from many locations in the code. For each call, the computer saves its place in the executing code, transfers control to the function body, then returns to the next statement after the function call, effectively inserting the function’s body into the stream of instruction execution.

This notational convenience does not come for free. Each time a program calls a function, something like the following happens (depending of course on processor architecture and optimizer settings):

1. The executing code pushes a new frame onto the call stack to hold the function’s arguments and local variables.

2. Each argument expression is evaluated and copied into the stack frame.

3. The execution address is copied into the stack frame to form the return address.

4. The executing code updates the execution address to the first statement of the function body (instead of the next statement after the function call).

5. The instructions in the function body execute.

6. The return address is copied from the stack frame into the instruction address, transferring control to the statement after the function call.

7. The stack frame is popped off the stack.

There is some good news about the cost of functions. A program with functions is generally more compact than the same program with large functions expanded inline. This improves cache and virtual memory performance. Still, all other things being equal, making frequently called functions more efficient can be an effective optimization.

Basic cost of function calls

There are a number of details that can slow down function calls in C++, and which therefore form the basis for optimization of function calls:

Function arguments

Beyond the cost of evaluating the argument expressions, there is a cost to copy each argument value onto the stack in memory. The first few arguments may be passed efficiently in registers if they are small, but if there are many arguments, at least some are passed via the stack.

Member function calls (versus function calls)

Every member function call has an extra hidden argument: a pointer to the “this” class instance through which the member function is invoked, that must be written to memory on the call stack or saved in a register.

Call and return

These add nothing to the useful work of the program. They are pure overhead that would not exist if the function call was replaced by the function body. Indeed, many compilers attempt to put the function body inline if the function is small, and if the function definition is available at the point of the function call. If the function cannot be inlined, there are several costs.

Calling the function requires the execution address to be written to the stack frame in memory to form the return address.

Function return requires the execution address to be read from the stack and loaded into the execution pointer. At the call and again at the return, execution continues at a nonconsecutive memory address. As noted in “Making Decisions Is Hard for Computers”, the computer executes consecutive instructions very efficiently. However when execution transfers to a nonconsecutive location, there is a good chance of a pipeline stall and/or a cache miss.

Cost of virtual functions

In C++, the program can define any member function virtual. Derived classes can define a member function with the same signature that overrides the base class’s virtual member function, providing a new function body to be used when the virtual function is called on an instance of the derived class, even if called from a pointer or reference to the base class’s type. The program selects what function to call when it dereferences the class instance. The particular override invoked is thus determined at run time by the actual type of the class instance.

Each instance of a class with virtual member functions contains an unnamed pointer to a table, called the vtable, which points to the body associated with each virtual function signature visible in that class. The vtable pointer is generally made the first field of the class instance to make dereferencing it less expensive.

Since a virtual function call selects one of several function bodies, the code to call a virtual function dereferences the pointer to the class instance to get a pointer to the vtable. The code indexes into the vtable (that is, the code adds a small integer offset to the vtable, and dereferences that address) to get the virtual function’s execution address. There are thus two extra nonsequential memory loads for every virtual function call, each with increased probability of a cache miss and execution pipeline stall. Another issue with virtual functions is that it is difficult for the compiler to inline them. The compiler can do so only if it has access both to the function body and to the code that constructs the instance (so it can tell which virtual function body to call).

Member function calls in derived classes

When one class is derived from another, member functions of the derived class may have to do extra work:

Virtual member functions defined in a derived class

If the most base class (the class at the root of the derivation hierarchy) has no virtual member functions, the code must add an offset to the “this” class instance pointer to get to the vtable of the derived class, then index into the vtable to get the function execution address. This code contains more instruction bytes, and the instructions are often slower because of the extra arithmetic they do. This cost is significant on small embedded processors, but on desktop-class processors instruction-level concurrency hides most of the cost of the extra arithmetic.

Member function calls defined in a derived class with multiple inheritance

The code must add an offset to the “this” class instance pointer to form the class instance pointer to a multiply inherited class. This cost is significant on small embedded processors, but on desktop-class processors instruction-level concurrency hides most of the cost of the extra arithmetic.

Virtual member function calls defined in a derived class with multiple inheritance

As for virtual member function calls in a derived class, if the most base class has no virtual member functions, the code must add an offset to the “this” class instance pointer to get to the vtable of the derived class, then index into the vtable to get the function execution address. The code must also add a potentially different offset to the “this” class instance pointer to form the class instance pointer of the derived class. This cost is significant on small embedded processors, but on desktop-class processors instruction-level concurrency hides most of the cost of the extra arithmetic.

Virtual multiple inheritance

The code must dereference a table in the class instance to determine the offset to add to the class instance pointer to form the instance pointer of a virtually multiply inherited class. There is additional overhead, as described previously, if the called function is virtual.

Cost of pointers to functions

C++ provides function pointers so that code can explicitly choose at run time which of several function bodies is executed as a result of a function call through the function pointer. These mechanisms have additional costs beyond the basic cost of function call and return:

Function pointers (pointers to free functions and static member functions)

C++ allows a program to define variables that are pointers to functions. Function pointers permit the programmer to explicitly choose any free function with a specific signature, consisting of its list of argument types and return type, to be called at run time when the function pointer is dereferenced. The program selects what function to call through the function pointer explicitly, by assigning a function to the function pointer.

The code must dereference a pointer to get the execution address of the function. It is also unlikely that the compiler can inline these functions.

Member function pointers

Member function pointer declarations identify both the function signature and the class in the context of which the call is interpreted. The program selects what function to call through the member function pointer explicitly, by assigning a function to the function pointer.

A member function pointer may have any of several representations. It must be general enough to call any member function under any of the complex circumstances listed previously. It is not unreasonable to assume worst-case performance for a member function pointer.

Summary of function call costs

A C-style void function call with no arguments is thus the cheapest, having no cost if it can be inlined, and only two memory accesses plus the cost of two nonlocal transfers of execution otherwise.

A virtual function call belonging to a derived class whose base class has no virtual functions, contained in a class having virtual multiple inheritance, is a worst-case scenario. Fortunately, it is extremely rare. The code must dereference a table in the class instance to determine the offset to add to the class instance pointer, form the instance pointer of the virtually multiply inherited function’s instance, then dereference the instance to get the vtable, then index into the vtable to get the function execution address.

At this point, the reader may either be shocked that a function call can be so expensive, or amazed at how efficiently C++ implements very complex features. Both thoughts are appropriate. The point to understand is that there are costs to calling functions, and thus opportunities for optimization. The bad news here is that removing one nonsequential memory access is not much of an improvement unless the function is very frequently called. The good news is that a profiler will point directly to the most frequently called functions, allowing the developer to quickly home in on the best targets for attention.

Declare Brief Functions Inline

An effective way to remove function call overhead is to inline the function. To inline a function, the compiler must have access to the function definition at the point of the call to be inlined. Functions whose bodies appear inside the body of a class definition are implicitly declared inline. Functions defined outside the class definition may also be explicitly declared inline by declaring them with storage class inline. Additionally, the compiler may choose on its own to inline short functions when their definitions appear before their first use in a particular compilation unit. Although the C++ standard says that the inline keyword is only a “hint” to the compiler, compilers must be good at inlining in order to be commercially successful.

When the compiler inlines a function, it has additional opportunities to improve the code, beyond removing the call and return statements. Some arithmetic may be performed at compile time. Some branches may be eliminated because the compiler determines they cannot be taken when arguments have particular values. Inlining is thus a means of removing redundant computation to improve performance, by performing the computation at compile time.

Function inlining is probably the most powerful code optimization. In fact, the difference in performance of the “debug” versus “release” builds in Visual Studio (or the -d versus -O flags in GCC), comes mostly because by default, debug builds turn off function inlining.

Define Functions Before First Use

Defining a function (providing the function body) above the point of the first call to the function gives the compiler a chance to optimize the function call. The compiler can choose on its own to inline a function call if it has the function definition available when it compiles a call to that function. This even works with virtual functions, if the compiler can see both the function body and the code that instantiates the class variable, pointer, or reference through which the virtual function is called.

Eliminate Unused Polymorphism

In C++, virtual member functions are frequently used to implement runtime polymorphism. Polymorphism allows a member function to do one of several different but semantically related things, depending on the object on which the member function is called.

To implement polymorphic behavior, a base class defines a virtual member function. Any derived classes can choose to override the behavior of the base class function with behavior specialized to that derived class. The various implementations are all related by a semantic concept that must be implemented differently for each derivation.

A classic example of polymorphism is the Draw() function defined in a hierarchy of classes derived from DrawableObject, representing graphical objects to draw on the screen. When a program calls drawObjPtr->Draw(), the implementation of Draw() is selected by dereferencing the virtual function table in the instance pointed to by drawObjPtr. The selected implementation of Draw() makes a triangle when the class instance is an instance of Triangle, a box when it is an instance of Rectangle, and so forth. The program declares DrawableObject::Draw()virtual, so that the proper derived class’s Draw() member is called. When the program must select among several implementations at run time, the virtual function table is a very fast mechanism, despite its overhead of two additional memory loads and their associated pipeline stalls.

Still, polymorphism can occasionally cause unnecessary overhead. A class may have originally been designed to facilitate a hierarchy of derived classes that is never implemented, or a function may have been declared virtual in the expectation of polymorphic behavior that was never implemented. In the preceding example, all the drawable objects might end up being implemented as a list of points that are connected in order, so that the base class version of Draw() is always used. Removing the virtual specifier on the DrawableObject declaration of Draw() when there are no overrides speeds up all calls to Draw().

STOP AND THINK

There is tension between the designer’s wish to say that DrawableObject can be the root of a set of DrawableObject-derived classes, and the optimizing developer’s wish to improve performance because the Draw() member function has no overrides. Presumably the Draw()member function has been fingered by some experiment as being costly. If so, the designer will probably bow out. It’s easy to put the virtual specifier back in later if needed. The wisest optimizing developers know better than to weaken the design without good reason, and don’t go chasing every virtual function in a program with scalpel in hand.

Discard Unused Interfaces

In C++, virtual member functions can be used to implement interfaces. An interface is a declaration of a common set of functions that together describe the intended behavior of an object, and which may be implemented in different ways under different circumstances. A base class definesthe interface by declaring a set of pure virtual functions (functions with declarations, but no function body). Because pure virtual functions have no body, C++ prevents the interface base class from being instantiated. Derived classes implement the interface by providing overrides (definitions) for all the pure virtual functions in the interface base class. An advantage of the interface idiom in C++ is that the derived class must implement all the functions declared in the interface, or the compiler will not allow the program to create instances of the derived class.

For example, a developer may use an interface class to isolate operating system dependencies, particularly if the design foresees implementing the program for several operating systems. A class for reading and writing files might be defined by the interface class File that follows. File is called an abstract base class because it cannot be instantiated:

// file.h - interface

class File {

public:

virtual ~File() {}

virtual bool Open(Path& p) = 0;

virtual bool Close() = 0;

virtual int GetChar() = 0;

virtual unsigned GetErrorCode() = 0;

};

Somewhere else in the code, a derived class WindowsFile is defined, supplying overrides that implement these functions on the Windows operating system. The C++11 keyword override is optional, and tells the compiler that this declaration is meant to override a virtual function declaration in a base class. When override is specified, if there is no virtual function declaration in the base class, the compiler issues a warning message:

// Windowsfile.h - interface

# include "File.h"

class WindowsFile : public File { // C++11-style declaration

public:

~File() {}

bool Open(Path& p) override;

bool Close() override;

int GetChar() override;

unsigned GetErrorCode() override;

};

In addition to the header file, there is also a windowsfile.cpp file containing the implementations of the override functions for the Windows implementation:

// windowsfile.cpp - windows implementation

# include "WindowsFile.h"

bool WindowsFile::Open(Path& p) {

...

}

bool WindowsFile::Close() {

...

}

...

Sometimes a program defines an interface, but only provides a single implementation. In this case, the cost of the virtual function calls (particularly the frequently used call to GetChar()) may be saved by eliminating the interface, removing the virtual keyword from the class definition in file.h, and providing implementations for the File’s member functions.

STOP AND THINK

As in the previous section, there is tension between the designer’s wish to clearly define an interface (which is a Good Thing) and the optimizing developer’s desire to improve performance (which also matters if GetChar() has been flagged by the profiler as a hot function). When the program is mature, it may be easier to judge whether there will likely ever be another implementation. This knowledge can illuminate the choice to optimize or retain the original design. The optimizing developer, if she is not the architect who originally defined the interface, must be prepared for push-back when she proposes this change. She would be well advised to have performance numbers to hand to justify the change.

Select interface implementation at link time

Virtual functions permit a running program to choose among several implementations. Interfaces allow a designer to specify what functions must be coded during development to make an object useful. The problem with implementing the interface idiom using C++ virtual functions is that virtual functions provide a runtime solution with runtime cost to a design-time problem.

In the previous section, an interface called File was defined to isolate operating system dependency. The interface was implemented in the derived class WindowsFile. If the program is ported to Linux, the source code may eventually include a LinuxFile derivation of the File interface, but WindowsFile and LinuxFile can never be instantiated in the same program. Each makes low-level calls that are only implemented on one operating system. The overhead of the virtual function call is unneeded. Further, if a program reads large files, File::GetChar() may be hot enough to optimize.

The developer can use the linker to select among several implementations if the decision doesn’t have to be made at run time. Instead of declaring a C++ interface, the header file declares (but does not implement) the member functions directly, in much the same way as if they were standard library functions:

// file.h - interface

class File {

public:

File();

bool Open(Path& p);

bool Close();

int GetChar();

unsigned GetErrorCode();

};

A file called windowsfile.cpp contains the Windows implementation:

// windowsfile.cpp - Windows implementation

# include "File.h"

bool File::Open(Path& p) {

...

}

bool File::Close() {

...

}

...

A similar file called linuxfile.cpp contains the Linux implementation. The Visual Studio project file references windowsfile.cpp. The Linux makefile references linuxfile.cpp. The decision is implemented by the list of arguments handed to the linker. Now the call to GetChar() will be as efficient as a function call can be. (Note that there are other approaches to optimizing a call like GetChar(), including loop inversion, described in “Push Loops Down into Functions to Reduce Call Overhead”.)

Selecting the implementation at link time has the advantage of being very general. Its disadvantage is that part of the decision is in the .cpp files, and part is in the makefile or project file.

Select interface implementation at compile time

In the previous section, an implementation of the File abstraction was selected by the linker. This was possible because the implementation of File depended on the operating system. A particular program executable can run on only one operating system, so the decision didn’t have to be made at run time.

If a different compiler is used for the two different implementations of File (say, Visual Studio for the Windows version and GCC for the Linux version), the implementation can be selected at compile time using #ifdef. The header file remains the same. This time there is a single file called file.cpp, and the preprocessor selects the implementation:

// file.cpp - implementation

# include "File.h"

# ifdef _WIN32

bool File::Open(Path& p) {

...

}

bool File::Close() {

...

}

...

# else // Linux

bool File::Open(Path& p) {

...

}

bool File::Close() {

...

}

...

# endif

This solution requires a preprocessor macro that can be used to select the desired implementation. Some developers like this solution because more of the decision making is visible in the .cpp file. Others find it messy and un-object-oriented to have both implementations in the same file.

Select Implementation at Compile Time with Templates

C++ template specializations are another means of selecting implementation at compile time. Templates allow the developer to create a family of classes with a common interface, but behavior that depends on a type parameter to the template. Template parameters may be any type—class types that provide their own set of member functions, or basic types that provide their built-in set of operators. There are thus two interfaces: the public members of the template class, and the interface defined by the operators and functions invoked on the template parameters. Interfaces defined in abstract base classes are very strict, requiring the derived class to implement all functions in the abstract base class. Interfaces defined by templates are less strict. Only functions in the parameters that are actually invoked by a particular use of a template need be defined.

This property of templates is a two-edged sword: the compiler doesn’t give an error message right away if the developer forgets to implement an interface function in a particular template specialization, but the developer can also choose not to implement functions that are inconvenient if they aren’t used in a given context.

From an optimization standpoint, the most important difference between polymorphic class hierarchies and template instances is that generally the whole template is available at compile time. In most use cases, C++ inlines template function calls, improving performance in several ways (as noted in “Declare Brief Functions Inline”).

Template programming provides powerful optimizations. For the developer not familiar with templates, it is worth the considerable mental effort required to learn to use this C++ feature effectively.

Eliminate Uses of the PIMPL Idiom

PIMPL, short for “Pointer to IMPLementation,” is a coding idiom used as a compilation firewall, a mechanism to prevent a change in one header file from triggering recompilation of many source files. During the adolescence of C++ in the 1990s, use of the PIMPL idiom was justifiable because the compilation time for a large program might be measured in hours. Here’s how PIMPL works.

Imagine a widely used class called BigClass (declared in Example 7-13) that has some inline functions and is implemented using classes Foo, Bar, and Baz. Normally, any change to bigclass.h, foo.h, bar.h, or baz.h, even if the change only amounted to a single character in a comment, would trigger recompilation of the many files that included bigclass.h.

Example 7-13. bigclass.h before implementing the PIMPL idiom

# include "foo.h"

# include "bar.h"

# include "baz.h"

class BigClass {

public:

BigClass();

void f1(int a) { ... }

void f2(float f) { ... }

Foo foo_;

Bar bar_;

Baz baz_;

};

To implement PIMPL, the developer defines a new class, called Impl in this example. bigclass.h is changed as follows, shown in Example 7-14.

Example 7-14. bigclass.h after implementing the PIMPL idiom

class Impl;

class BigClass {

public:

BigClass();

void f1(int a);

char f2(float f);

Impl* impl;

};

C++ permits a pointer to be declared that points to an incomplete type—that is, to an object for which there is no definition yet. In this case, Impl is an incomplete type. This works because all pointers are the same size, so the compiler knows how to reserve storage for the pointer. After implementing PIMPL, BigClass’s externally visible definition does not depend on foo.h, bar.h, or baz.h. The complete definition of Impl is inside bigclass.cpp (Example 7-15).

Example 7-15. bigclass.cpp, containing the definition of Impl

# include "foo.h"

# include "bar.h"

# include "baz.h"

# include "bigclass.h"

class Impl {

void g1(int a);

void g2(float f);

Foo foo_;

Bar bar_;

Baz baz_;

};

void Impl::g1(int a) {

...

}

char Impl::g2(float f) {

...

}

void BigClass::BigClass() {

impl_ = new Impl;

}

void BigClass::f1(int a) {

impl_ -> g1(a);

}

char BigClass::f2(float f) {

return impl_ -> g2(f)

}

At compile time, after implementing PIMPL, changes to foo.h, bar.h, or baz.h, or the implementation of Impl cause bigclass.cpp to be recompiled, but bigclass.h is unchanged, limiting the scope of the recompilation.

At run time, the story is different. PIMPL adds nothing to the program except delay. Member functions of BigClass that might have been inlined now require a member function call. Furthermore, each member function now calls a member function of Impl. Projects that use PIMPL often use it in many places, creating many layers of nested function calls. Furthermore, debugging is tedious due to the extra layers of function calls.

In 2016, there is little need to use PIMPL. Compilation times are maybe 1% of the times experienced in the 1990s. Furthermore, even in the ’90s, for PIMPL to be necessary, BigClass had to be a big class depending on many header files. Such classes violate many rules of object-oriented programming. Breaking BigClass up to provide more focused interfaces might be just as effective as implementing PIMPL.

Eliminate Calls into DLLs

Calls into dynamic link libraries (DLLs) on Windows are made through a function pointer that is either explicitly set in the program if the DLL is loaded on demand, or implicitly set if the DLL is automatically loaded at startup. Linux has dynamic libraries too, with the same implementation.

Some DLL calls are necessary. For instance, an application may implement a third-party plug-in library. Other reasons to use DLLs are less clear-cut. For instance, one reason DLLs have been used is to provide the ability to patch bugs by replacing just a DLL. Experience has demonstrated that bug fixes are generally produced in batches, covering many areas of the program at once. This limits the chance that a single DLL will contain all the fixes, spoiling this use for DLLs.

Changing DLLs into object code libraries and linking the libraries into a single executable is another way to improve function call performance.

Use Static Member Functions Instead of Member Functions

Every call to a member function has an additional, implicit argument: a “this” pointer to the class instance whose member is being called. Class member data is accessed via an offset from the “this” pointer. Virtual member functions must dereference the “this” pointer to get the vtable pointer.

Sometimes a member function depends only on its arguments, doesn’t access any member data, and doesn’t call any virtual member functions. In this case, the “this” pointer serves no purpose.

Such member functions should be declared static. Static member functions don’t have to evaluate an implicit “this” pointer. They can be referred to with regular function pointers instead of more expensive member function pointers (see “Cost of pointers to functions”).

Move Virtual Destructor to Base Class

The destructor of any class that will have derived classes should be declared virtual. This is necessary so that if a delete-expression references a pointer to the base class, both the derived class destructor and the base class destructor will be called.

There is another reason to provide some virtual function declaration in the most base class of any hierarchy: to ensure that a vtable pointer will be included in the base class.

The base class has a special position in a class hierarchy. If the base class has virtual member function declarations, the vtable pointer will be at offset zero in any class instance derived from this base class. If the base class declares member variables and doesn’t declare any virtual member functions, but some derived class does declare virtual member functions, every virtual member function call will have to add an offset to the “this” pointer to obtain the address of the vtable pointer. Ensuring that at least one member function in the base class is virtual forces the vtable to appear at offset zero, which generates more efficient code.

The destructor is an excellent candidate to be that function. It must be virtual anyway if the base class will have derived classes. The destructor is called only once during the life of a class instance, so the cost of making it virtual is minimal except for very small classes that are constructed and destroyed at a high rate in the program (and rarely are these small classes used to derive subclasses).

This may seem like such a rare case that it doesn’t deserve mention, but I have worked on several projects where important class hierarchies had a base class containing an intrusive reference count, transaction ID, or some such variable. This base class had little knowledge about what classes might be derived from it. Usually the first derivation in the hierarchy was an abstract base class declaring a bunch of virtual member functions. One thing the base class did know was that instances would eventually be destroyed.

Optimize Expressions

Below the statement level, at the level of arithmetic expressions involving basic data types (integers, floats, pointers), there is a final opportunity to optimize. If a very hot function consists of a single expression, it may be the only opportunity.

STOP AND THINK

Modern compilers are very good at optimizing expressions involving basic data types. They are so good that they range from amazing all the way up to provably optimal. But they are not very brave. They will optimize expressions only when they can prove that the optimization does not change the behavior of the program.

Developers are smarter than compilers, though far less meticulous. Developers can code optimizations in situations where the compiler cannot determine that an optimization is safe, because developers can reason about the design and the intent of functions defined in other code modules not visible to the compiler.

It is in these relatively rare cases that developers can do better than compilers.

Optimizing expressions can have a significant effect on smaller processors that execute instructions one at a time. On desktop-class processors with multistage pipelines, the performance improvement is detectable, but not great. This is not the place to hunt for big game in the optimization jungle. Go here only on those rare occasions when you must squeeze the final iota of performance out of some red-hot inner loop or function.

Simplify Expressions

C++ evaluates expressions in strict order by the precedence and associativity of their operators. The expression a*b+a*c is evaluated as if written ((a*b)+(a*c)) because the precedence rules in C++ make multiplication higher priority than addition. The C++ compiler never uses the distributive law to recode this expression to a more efficient form like a*(b+c). The expression a+b+c is evaluated as if written ((a+b)+c), because the + operator associates to the left. The compiler never rewrites the expression to (a+(b+c)) even though that is equivalent in the math of integers and real numbers.

The reason C++ leaves its hands off expressions is because the modular arithmetic of the C++ int type isn’t exactly like the mathematics of integers, and the approximate arithmetic of the C++ float type is not exactly like real math. C++ must give the programmer power to specify his intention precisely, or the compiler will reorder expressions in ways that cause overflows of various kinds. That means the developer is responsible for writing expressions in a form that uses the fewest possible operators.

Horner’s rule for evaluating polynomials exemplifies how powerful rewriting expressions to a more efficient form can be as an optimization. Although most C++ developers don’t evaluate polynomials every day, we are all familiar with them.

The polynomial y = ax³ + bx² + cx + d could be written in C++ as:

y = a*x*x*x + b*x*x + c*x + d;

This statement performs six multiplies and three adds. Horner’s rule uses a repeated application of the distributive law to rewrite this statement as:

y = (((a*x + b)*x) + c)*x + d;

The optimized statement performs three multiplies and three adds. In general, Horner’s rule simplifies an expression from n(n–1) multiplies to n, where n is the degree of the polynomial.

A / B * C: A CAUTIONARY TALE

The reason C++ does not reorder arithmetic expressions itself is because it’s dangerous. Numerical analysis is another subject that produces whole books. This is a single example of what can go wrong.

The C++ compiler will evaluate the expression a / b * c as if it was parenthesized ((a / b) * c). But there’s a problem with this expression. If a, b, and c are integral types, the result of a / b is not exact. Thus, if a = 2, b = 3, and c = 10, a / b * c is 2 / 3 * 10 = 0, when we really want it to be 6. The problem is that the inexactness in a / b is multiplied by c, producing a very incorrect result. A developer with good math skills might rearrange this expression as c * a / b, which the compiler will evaluate as if parenthesized ((c * a) / b). Then 2 * 10 / 3 = 6.

Problem solved, right? Well actually, no. If you do the multiplication first, there is a danger of overflow. If a = 86,400 (the number of seconds in a day), b = 90,000 (a constant used in video sampling), and c = 1,000,000 (the number of microseconds in a second), the expression c * aoverflows a 32-bit unsigned. The original expression, while having a significant error, is less wrong than the revised one.

The developer is solely responsible for knowing that the expression as written, with arguments having the expected magnitudes, will produce a correct result. The compiler cannot help with this task, which is why it leaves expressions alone.

Group Constants Together

One thing the compiler can do is evaluate constant expressions. Faced with an expression like

seconds = 24 * 60 * 60 * days;

seconds = days * (24 * 60 * 60);

the compiler can evaluate the constant part of the expression to produce something equivalent to

seconds = 86400 * days;

But if the programmer writes

seconds = 24 * days * 60 * 60;

the compiler is obliged to perform the multiplications at run time.

Always group constant expressions together with parentheses, or put them on the left side of the expression, or better yet, separate them out into the initializer of a const variable or put them in a constexpr function if your compiler supports this C++11 feature. This allows the compiler to efficiently evaluate the constant expression at compile time.

Use Less-Expensive Operators

Some arithmetic operations are less expensive to compute than others. For instance, all but the smallest processors in use today can perform a shift or addition in one internal clock cycle. A few specialized digital signal processor chips have single-cycle multipliers, but for PCs, multiplication is an iterated computation similar to the decimal multiplication procedure taught to schoolchildren. Division is a more complex iterated procedure. This hierarchy of costs provides opportunities for optimization.

For instance, the integer expression x*4 can be recoded more efficiently as x<<2. Any half-decent compiler already performs this reduction in strength optimization. But what if the expression is x*y or x*func()? In many cases, the compiler cannot determine that y or func() always produces a value that is a power of two. This is where the programmer’s knowledge and skill beats the compiler. If one argument or the other can be modified to provide the exponent rather than the power of two value, the developer can rewrite the expression to use shift instead of multiplication.

Another possible optimization is to rewrite a multiplication into a sequence of shifts and adds. For instance, the integer expression x*9 can be rewritten as x*8+x*1, which in turn can be rewritten as (x<<3)+x. This optimization is most effective when the constant operand does not contain many set bits, because each set bit expands to a shift-and-add term. It’s effective on desktop- or handset-class processors that have instruction caches and pipelined execution units, and on very small processors where long multiplication is a subroutine call. Like all optimizations, this one should be tested to be sure it improves performance on a particular processor, but it is generally a win.

Use Integer Arithmetic Instead of Floating Arithmetic

Floating-point arithmetic is expensive. Floating-point numbers have a complex internal representation with a normalized integer mantissa, a separate exponent, and two signs. The hardware that implements the floating-point unit on a PC may occupy as much as 20% of the die area. Some multicore processors share a single floating-point arithmetic unit but have multiple integer arithmetic units per core.

Integer results can be computed at least 10 times faster than equivalent floating-point results, even on processors with hardware floating-point arithmetic units, and even when results of integer division are rounded instead of truncated. On small processors where floating-point math is implemented via a function library, integer math is many times faster. And yet, it is very typical to find developers using floating point for things as simple as rounded division.

Example 7-16 is the way most code I’ve encountered produces a rounded quotient. It converts the integral arguments to floating point, performs the division, and rounds the result.

Example 7-16. Integer rounding with floating point

unsigned q = (unsigned)round((double)n / (double)d));

A test loop that repeats this operation 100 million times took 3,125 milliseconds on my PC.

To obtain a rounded integral quotient, it is necessary to look at the remainder from division. The value of the remainder ranges from 0 to d–1, where d is the denominator. If the remainder is greater than or equal to half the denominator, the quotient should be rounded up to the next integer. The formula for signed integers is only a bit more complex.

C++ provides a function ldiv() from the C runtime library, which produces a structure containing both the quotient and remainder. Example 7-17 is a function to round a division result using ldiv().

Example 7-17. Rounding integer division using ldiv()

inline unsigned div0(unsigned n, unsigned d) {

auto r = ldiv(n, d);

return (r.rem >= (d >> 1)) ? r.quot + 1 : r.quot;

}

This function isn’t perfect. ldiv() expects int arguments, and complains about a signed/unsigned mismatch. It produces a correct result if both arguments are positive when regarded as ints. div0() runs the test loop in 435 milliseconds, a satisfying 6 times faster than the floating-point version.

Example 7-18 is a function to compute the rounded quotient of two unsigned arguments.

Example 7-18. Rounding integer division

inline unsigned div1(unsigned n, unsigned d) {

unsigned q = n / d;

unsigned r = n % d;

return r >= (d >> 1) ? q + 1 : q;

}

div1() computes the quotient and remainder. (d >> 1) is an efficient, reduced-strength form of d/2, which is half the denominator d. If the remainder is greater than or equal to half the denominator, the quotient is rounded up. Key to the success of this function is an optimization performed by the compiler. The x86 machine instruction that divides two integers produces both a quotient and a remainder. The Visual C++ compiler is smart enough to invoke this instruction only once when this code is executed. The same test, calling this function instead of using floating-point math, took 135 milliseconds. This is a rewarding 22 times faster.

Example 7-19 is another way to round an unsigned that is even faster, though at a price.

Example 7-19. Rounding integer division

inline unsigned div2(unsigned n, unsigned d) {

return (n + (d >> 1)) / d;

}

div2() adds half the denominator d to the numerator n before dividing. div2()’s weakness is that for large numerators, n + (d >> 1) may overflow. If the developer is aware of the general magnitude of the arguments, div2() is very fast, at 102 milliseconds in the test loop (30 times faster than the floating-point version that is so common).

Double May Be Faster than Float

On my i7 PC running Visual C++, double floating-point computations were faster than float floating-point computations. First I present the result, then I will speculate about why this may occur.

The following code iteratively computes the distance traveled by a falling body, a typical floating-point computation:

float d, t, a = -9.8f, v0 = 0.0f, d0 = 100.0f;

for (t = 0.0; t < 3.01f; t += 0.1f) {

d = a*t*t + v0*t + d0;

Ten million iterations of this loop consumed 1,889 milliseconds.

The same routine coded to use double variables and constants looks like this:

double d, t, a = -9.8, v0 = 0.0, d0 = 100.0;

for (t = 0.0; t < 3.01; t += 0.1) {

d = a*t*t + v0*t + d0;

Ten million iterations of this version took only 989 milliseconds, making the double version almost twice as fast.

Why does this happen? Visual C++ generates floating-point instructions that reference the old “x87 FPU coprocessor” register stack. In this scheme, all floating-point computations are performed in a single 80-bit format. Single-precision float and double-precision double values are lengthened when they are moved into the FPU registers. The conversion for float may take more time than the conversion for double.

There are many ways to compile floating-point operations. On the x86 platform, using the SSE registers allows computation to be done directly in four different sizes. A compiler that used SSE instructions might behave differently, as might a compiler for a non-x86 processor.

Replace Iterative Computations with Closed Forms

What is it about C++ and bit-twiddling? Does the rich collection of arithmetic and bit-logical operators in C++ just facilitate moving bits around, or did the need to get bits of information into and out of device registers and network packets mold C++ into the language it is today?

Many occasions arise to count the number of set bits, find the most significant bit, determine the parity of bits in a word, determine whether the bits in a word represent a power of two, and so on. Most of these problems have simple O(n) solutions involving iterating through the n bits of the word. There may be more complex iterative solutions with better performance. But for some problems, there are faster and more compact closed-form solutions: constant-time computations that produce a solution without the need for iteration.

For example, there is a simple iterative algorithm to determine whether an integer value is a power of two. All such values have a single bit set, so counting the number of set bits is one solution. A simple implementation of this algorithm might look like Example 7-20.

Example 7-20. Iterative test of whether an integer is a power of 2

inline bool is_power_2_iterative(unsigned n) {

for (unsigned one_bits = 0; n != 0; n >>= 1)

if ((n & 1) == 1)

if (one_bits != 0)

return false;

else

one_bits += 1;

return true;

}

A test loop of this algorithm took 549 milliseconds.

This problem also has a closed-form solution. If x is the n^th power of two, it has a single set bit in the n^th position (counting from zero for the least significant bit). Then x–1 is a bitmask with set bits in positions n–1,...,0, and x & (x–1) equals zero. If x is not a power of two, it has some additional bits set, x–1 zeros out only the least significant set bit, and x & (x–1) does not equal zero.

A function to determine whether x is a power of two using the closed form looks like Example 7-21.

Example 7-21. Closed-form test of whether an integer is a power of 2

inline bool is_power_2_closed(unsigned n) {

return ((n != 0) && !(n & (n - 1)));

}

The same test loop using this revised function runs in 238 milliseconds, 2.3 times as fast. There are even faster ways. Rick Regan has an excellent web page that explores 10 ways, with timing measurements.

GET A COPY OF HACKER’S DELIGHT

The preceding section contains some advice on efficient bit-twiddling, but it’s only a sample to get you interested. There are hundreds of little tricks to improve arithmetic performance.

There is a book that should be in the library of every developer with an interest in optimizing at the expression level: Hacker’s Delight by Henry S. Warren, Jr. (Addison-Wesley Professional), now in its second edition. If you have the slightest interest in efficient expressions, readingHacker’s Delight is like opening your first Lego kit or building your first electronic circuit from discrete components. Warren also maintains a website for Hacker’s Delight, with even more interesting links and discussions.

For a free taste of the Hacker’s Delight drug, you can peruse MIT Artificial Intelligence Laboratory Memo 239, affectionately known as HAKMEM,3 on the Web. HAKMEM is the conceptual ancestor of Hacker’s Delight, chock-full of bit-twiddling hacks collected at a time when the fastest processors were 10,000 times slower than your phone.

Optimize Control Flow Idioms

As noted in “Making Decisions Is Hard for Computers”, computation is faster than flow of control, due to pipeline stalls that occur in the processor when the instruction pointer must be updated to a nonconsecutive address. The C++ compiler tries hard to reduce the number of instruction pointer updates. It pays to be aware of what the compiler is doing in order to write the fastest code possible.

Use switch Instead of if-elseif-else

The flow of control in an if-else if-else statement is linear: evaluate the if condition, and if true, execute the first controlled block. Otherwise evaluate each else if condition, and execute the controlled block for the first one that evaluates to true. Testing a variable for each of nvalues, takes an if-then-else if sequence with n controlled blocks. If all possible cases are about equally likely, the if-then-else if sequence makes O(n) comparisons. When this code is executed very frequently, as for event or instruction dispatch code, the cost adds up significantly.

The switch statement will also test a variable for each of n values, but the form of the switch statement, which compares the switch value against a series of constants, lets the compiler perform a number of useful optimizations.

In a frequently occurring case, when the constants to be tested against are taken from a set of sequential or nearly sequential values, the switch statement compiles into a jump table indexed by the test value or an expression derived from the test value. The switch statement performs a single indexing operation and jumps to the address in the table. The cost of the comparison is O(1), no matter how many cases appear. The cases don’t have to be presented in sequential order; the compiler can sort the jump table.

When the constants form a sequence with large gaps, a jump table becomes unmanageably large. The compiler may still sort the tested constants and emit code that performs a binary search. For a switch statement comparing against n values, the worst-case cost of this search is O(log₂ n). In any case, the compiler never emits code for a switch statement that is slower than an equivalent if-then statement.

Sometimes a single branch of the if-elseif-else logic is very probable. In this case, the amortized performance of an if statement may approach a constant if the very probable case is tested first.

Use Virtual Functions Instead of switch or if

Before C++, if developers wanted to introduce polymorphic behavior in a program, they might have coded a struct or union type with a discriminator variable that told what the particular struct or union represented. The program would have been full of code that looked like this:

if (p->animalType == TIGER) {

tiger_pounce(p->tiger);

}

else if (p->animalType == RABBIT) {

rabit_hop(p->rabbit);

}

else if (...)

Most experienced developers know this antipattern is the poster child for object-oriented programming. But it takes a while for the object-oriented mindset to become ingrained in novice developers. I have seen lots of production C++ code containing halfway-objectified code like this:

Animal::move() {

if (this->animalType == TIGER) {

pounce();

}

else if (this->animalType == RABBIT) {

hop();

}

else if (...)

...

}

From an optimization standpoint, the problem with this code is the use of a block of if statements to discriminate the derived type of the object. C++ classes already contain a mechanism to do this: virtual member functions, and the vtable pointer as discriminator.

A virtual function call indexes into the vtable to get the address of the virtual function body. It’s a constant-time operation, always. Thus, a virtual move() member function in the base class is overridden in derived classes representing each animal with the code for pounce, hop, swim, and so on.

Use No-Cost Exception Handling

Exception handling is one of those optimizations best applied at design time. The design of an error propagation method ripples through every line of a program, so retrofitting a program for exception handling may be too expensive. That said, use of exception handling leads to programs that are faster when they execute normally, and better behaved when they fail.

Some C++ developers regard the C++ exception handling facility with suspicion. There is received wisdom that exception handling makes programs larger and slower, and that turning exception handling off via a compiler switch is thus an optimization.

The truth is more complex. It is true that if a program doesn’t use exception handling, turning it off via a compiler switch will make the program smaller, and maybe a bit faster. Jeff Preshing measured the cost at between 1.4% and 4% in a blog entry. But it isn’t clear exactly how well the resulting program will work. All the containers in the C++ standard library use new expressions that throw exceptions. Many other libraries, including stream I/O and the concurrency libraries covered in this book (see Chapter 12) throw exceptions. The dynamic_cast operator throws exceptions. With exception handling turned off, it isn’t clear what happens when the program encounters a situation where it would throw an exception.

If the program doesn’t throw exceptions, it may completely ignore error codes. In this case, the developer gets what she deserves. Otherwise, the program must patiently, carefully pass error codes up through its many layers of function calls, translating error codes from one code set to another at library boundaries and freeing resources as it goes. It must do this whether each operation succeeds or fails.

With exceptions, some of the cost of handling errors is moved from the happy path of normal program execution to the error path. In addition, the compiler automatically reclaims resources by invoking the destructors of all automatic variables on the path between the exception throw and thetry/catch block that handles the exception. This simplifies the logic of the program’s happy path, with consequent performance gains.

In the early days of C++, each stack frame contained an exception context: a pointer to the list of objects that had been constructed, and thus must be destroyed if an exception was thrown through the stack frame. This context was updated dynamically as the program executed. This was undesirable because it put runtime cost in the happy path; this may be the source of the legend of the high cost of exception handling. A newer implementation maps the objects that need to be destroyed to a range of instruction pointer values. This mechanism has no runtime cost unless an exception is thrown, so exceptions are very inexpensive. Visual Studio uses this newer no-cost mechanism for 64-bit builds, and the older mechanism for 32-bit builds. The mechanism is selectable by a compiler switch in Clang.

Don’t use exception specifications

Exception specifications decorate function declarations, saying what exceptions the function may throw. A function with no exception specification may throw exceptions without penalty. A function with an exception specification may throw only the exceptions listed in the specification. If it throws any other exception, the program is unconditionally and immediately stopped by a call to terminate().

There are two problems with exception specifications. One problem is that it is difficult for the developer to know what exceptions may be thrown by a called function, especially one in an unfamiliar library. This makes programs using exception specifications brittle and likely to fail suddenly.

The second problem is that exception specifications negatively affect performance. Thrown exceptions must be checked, as if every function call with an exception specification entered a new try/catch block.

Traditional exception specifications were deprecated as of C++11.

C++11 introduced a new exception specification called noexcept. Declare a function noexcept tells the compiler that the function cannot possibly throw any exception. If the function does throw one, terminate() is called just as in the throw() specification. The difference is that the compiler requires certain move constructors and move assignment statements to be declared noexcept to implement move semantics (discussed in “Implement Move Semantics”). The noexcept specification on these functions serves as a declaration that move semantics are more important than the strong exception safety guarantee for certain objects. It’s very obscure, I know.

Summary

· Optimizing at the statement level doesn’t provide enough improvement to make it worthwhile unless there are factors that magnify the cost of the statement.

· The cost of statements in loops is magnified by the number of iterations of the loops.

· The cost of statements in functions is magnified by the number of times the functions are called.

· The cost of a frequently used idiom is magnified by the number of times the idiom is used.

· Some C++ statements (assignment, initialization, function argument evaluation) contain hidden function calls.

· Function calls into the operating system are expensive.

· An effective way to remove function call overhead is to inline the function.

· There is currently little need for the PIMPL idiom. Compile times today are maybe 1% of what they were when PIMPL was invented.

· double arithmetic may be faster than float arithmetic.

1 Some readers react, “Aagghh! Why would anyone write such code? Don’t they know std::string has a constant-time length() function?” Yet this sort of code is distressingly common in programs needing optimization. I also wanted to make the example obvious, because I’m illustrating a point.

2 When a real, mechanical key is depressed, it makes a connection that is at first intermittent. Taking an instantaneous look at the connection state may show the key as not pressed when in fact it is still in that initial, intermittent phase. “Debouncing” delays reporting a key as being pressed until the connection becomes continuous; 50 milliseconds is a typical debouncing interval.

3 Beeler, Michael, Gosper, R. William, and Schroeppel, Rich, “HAKMEM,” Memo 239, Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, 1972.