Optimized C++ (2016)

Chapter 9. Optimize Searching and Sorting

There’s a way to do it better—find it.

Thomas A. Edison (1847–1931), American inventor and optimizer

C++ programs do a lot of searching. From programming language compilers to web browsers, from list controls to databases, many repeated activities have a search at the bottom of some inner loop. Searching comes up frequently on lists of hot functions in my experience. It is thus worthwhile to devote special attention to searching efficiently.

This chapter looks at searching in tables through the optimizer’s eye. I use searching as an example of a general process that a developer can use to approach the optimization task, by breaking down an existing solution into its component algorithms and data structures, then looking at each of these for opportunities to improve performance. I also evaluate some specific search methods to demonstrate the optimization process.

Most C++ developers know that the standard library container std::map can be used for looking up values that are associated with an index number or alphanumeric key string. Such associations are called key/value tables. They create a map from keys to values, hence the name of the container. Developers who know std::map remember that its performance is good in a big-O sense. This chapter explores ways to optimize the basic map-based search.

Fewer developers know that the C++ standard library’s <algorithm> header contains several iterator-based algorithms that can search sequence containers. These algorithms don’t all have the same big-O efficiency, even under optimal conditions. The best algorithm for each situation isn’t obvious, and the advice on the Internet doesn’t always point to the optimal method. Searching for the best search algorithm is presented as another example of the optimization process.

Even developers who know their way around the standard library algorithms may not yet have heard that hash-table based containers have been added to C++11 (and have been available in Boost for years). These unordered associative containers achieve excellent, average constant-time lookup efficiency, but they are not a panacea.

Key/Value Tables Using std::map and std::string

As an example, this section explores the performance of searching and sorting a very common kind of key/value table. The table’s key type is a string of ASCII characters that can be initialized by a C++ string literal or held in a std::string.1 This kind of table is typical of tables used to parse initialization profiles, command lines, XML files, database tables, and other applications requiring a limited set of keys. The table’s value type can be as simple as an integer, or arbitrarily complex. The value type doesn’t affect search performance, except that a really big value might reduce cache performance. In my experience, simple value types dominate anyway, so this table’s value is a simple unsigned.

It’s easy to build a table that maps std::string names to values using std::map. The table may be defined simply:

# include <string>

# include <map>

std::map<std::string, unsigned> table;

Developers using a C++11 compiler can use the list initializer syntax shown here to easily populate the table with entries:

std::map<std::string, unsigned> const table {

{ "alpha", 1 }, { "bravo", 2 },

{ "charlie", 3 }, { "delta", 4 },

{ "echo", 5 }, { "foxtrot", 6 },

{ "golf", 7 }, { "hotel", 8 },

{ "india", 9 }, { "juliet", 10 },

{ "kilo", 11 }, { "lima", 12 },

{ "mike", 13 }, { "november",14 },

{ "oscar", 15 }, { "papa", 16 },

{ "quebec", 17 }, { "romeo", 18 },

{ "sierra", 19 }, { "tango", 20 },

{ "uniform",21 }, { "victor", 22 },

{ "whiskey",23 }, { "x-ray", 24 },

{ "yankee", 25 }, { "zulu", 26 }

};

Otherwise, the developer must write code like the following to insert each element:

table["alpha"] = 1;

table["bravo"] = 2;

...

table["zulu"] = 26;

Values may be retrieved or tested simply:

unsigned val = table["echo"];

...

std::string key = "diamond";

if (table.find(key) != table.end())

std::cout << "table contains " << key << std::endl;

Creating tables with std::map is an example of how the C++ standard library provides powerful abstractions that achieve reasonable big-O performance with minimal thought and typing effort. It’s an instance of a general property of C++ mentioned in Chapter 1:

The mix of features in C++ provides a continuum of implementation choices ranging from hands-off automation and expressiveness on one hand, to increasingly fine control over performance on the other hand. It is this degree of choice that makes it possible to tune C++ programs to meet requirements for performance.

Toolkit to Improve Search Performance

But what if a profiling run fingers a function containing a table lookup as one of the hottest functions in a program? For instance:

void HotFunction(std::string const& key) {

...

auto it = table.find(key);

if (it == table.end()) {

// activity when table item not found

...

}

else {

// activity when table item is found

...

}

...

}

Can the developer do better than the simplest implementation? How do we find out?

An experienced developer’s eye may of course immediately spot inefficiencies that can be tackled. He may know that an algorithm is suboptimal, or that a better data structure exists. I sometimes operate in this way, though the path is fraught with risk. It is better for the developer interested in optimization to proceed methodically:

· Measure performance of the existing implementation to obtain a baseline for comparison.

· Identify the abstract activity to be optimized.

· Decompose the activity to be optimized into its component algorithms and data structures.

· Change or replace algorithms and data structures that may not be optimal, and perform experiments to determine if the changes are effective.

If the activity being optimized is viewed as an abstraction, the optimization task is picking the abstraction’s baseline implementation apart into its pieces and building up a more specialized abstraction with better performance.

I like to do this process on a whiteboard if possible, or in a text editor, or in an engineering notebook. The process is iterative. The longer you spend pulling the problem apart, the more parts you discover. There are enough parts in most activities worth optimizing that human beings cannot reliably remember them all if they try to hold the whole thing in their heads. Paper has a better memory, and a whiteboard or text editor is even better because it allows thoughts to be easily added and erased.

Make a Baseline Measurement

As mentioned in “Measure Baseline Performance and Set Goals”, it is important to measure the performance of the unoptimized code, to get a baseline measurement against which to test possible optimizations.

I wrote a test that looks up all 26 values in the table introduced in “Key/Value Tables Using std::map and std::string”, and also 27 values not in the table. I repeated these 53 lookups a million times to get a measurable duration. For maps of strings, this test ran for about 2,310 milliseconds.

Identify the Activity to Be Optimized

The next step is to identify the activity to be optimized, so that the activity can be decomposed into pieces that can more easily be searched for candidate optimizations.

The idea of “the activity to be optimized” is a matter of judgment on the developer’s part. But there are a couple of clues. In this case, the developer can see that the baseline implementation is a std::map with std::string keys. A look at the hot code reveals a call tostd::map::find(), a function that returns an iterator to the found entry, or to end(). Although std::map supports lookup, insertion, deletion, and iteration, the hot function is only doing a lookup. It may be necessary to look at other places in the code to see if other operations are performed on the table. Identifying where the table is constructed and destroyed is particularly interesting, because these activities may be time-consuming.

In this case, the activity to be optimized is reasonably obvious: looking up values in a key/value table implemented as a map with string keys. It is important to abstract from the baseline solution, however. The developer must not limit their investigation to only tables built from std::mapand std::string.

There is a technique called “thinking backward and forward” for abstracting from a baseline implementation to a problem statement. The technique for thinking backward is to ask, “Why?” Why does the baseline implementation use a std::map, and not some other data structure? The answer is reasonably obvious: std::map facilitates looking up a value given a key. Why does the baseline implementation use std::string, as opposed to int or pointer-to-Foo? The answer is that the keys are strings of ASCII text. This analysis leads the developer to write a more abstract problem statement, looking up values in a key/value table with text keys. Notice the words map and string do not appear in this statement. This is a deliberate attempt to free the problem description from its baseline implementation.

Decompose the Activity to Be Optimized

The next step is to decompose the activity to be optimized into its component algorithms and data structures. The baseline solution (a map with string keys) can again be used as an example of the algorithms and data structures comprising the activity. However, the baseline solution represents one specific implementation of the activity, and the optimizing developer is seeking other possible implementations. That is why it is important to describe the algorithms and data structures in a general way that doesn’t restrict thinking to the existing solution.

The activity to be optimized is looking up values in a key/value table with text keys. Decomposing this statement into its component algorithms and data structures and checking the result against the baseline implementation produces this list:

1. The table, a data structure containing the keys and associated values

2. The key, a data structure that contains text

3. An algorithm to compare keys

4. An algorithm to search the table data structure to see if a specific key is present, and if so get the associated value

5. An algorithm to construct the table, or insert keys into the table

How did I know I needed these parts? Mostly it came from looking at the definition of the baseline solution’s data structure, std::map:

1. In the baseline solution, the table is a std::map.

2. In the baseline solution, the keys are instances of std::string.

3. Partly this is logic, but also std::map’s template definition provides a defaultable parameter for specifying a key comparison function.

4. The hot function contains a call to std::map::find() rather than using operator[].

5. This comes from knowledge that the map must be constructed and destroyed. General knowledge about std::map informs me that it is implemented as a balanced binary tree. The map is thus a linked data structure that must be constructed node by node, so there must be an insertion algorithm.

The last item, the algorithm for (and thus the time cost of) building and destroying tables, is often overlooked. This cost can be significant. Even if the cost is small compared to the time consumed doing table lookups, if the table has static storage duration (see “Storage Duration of Variables”) the initialization cost may be added to all other initialization taking place at startup (see “Remove Code from Startup and Shutdown”). The program becomes unresponsive if there are too many initializations. If the table has automatic storage duration, it may be initialized multiple times during the program run, making the startup cost more significant. Luckily, there are key/value table implementations with low or no cost to create and destroy.

Change or Replace Algorithms and Data Structures

In this step, the developer can think forward, asking, “How?” How, and in what different ways, can the program look up values in a key/value table with text keys? The developer looks for algorithms in the baseline solution that are less than optimal, and looks for costly behaviors provided by the data structures that are not necessary for the particular activity being optimized, and can thus be removed or simplified. Then the developer performs experiments to see whether any changes improve performance.

The activity to be optimized offers the following opportunities:

· The table data structure may be changed or made more efficient. Some choices of table data structure constrain the choice of search and insertion algorithm. Choice of table data structure may also affect performance if the data structure contains dynamic variables that require calls to the memory manager.

· The key data structure may be changed or made more efficient.

· The comparison algorithm may be changed or made more efficient.

· The search algorithm may be changed or made more efficient, though it may be constrained by the choice of table data structure.

· The insertion algorithm, and also when and how to construct and destroy the data structure, may be changed or made more efficient.

Here are some observations that are relevant:

· std::map is implemented as a balanced binary tree. The search algorithm in a balanced binary tree is O(log2 n). The biggest performance gain would likely result if std::map could be replaced by a data structure that facilitates lookup at a lower time cost.

· A quick look at the definition of std::map shows that the operations defined on maps are inserting items into the map, looking up items in the map, deleting items from the map, and traversing the items in the map. To facilitate this dynamism, std::map is a node-based data structure. The consequence is that std::map frequently calls into the memory manager during construction, and has poor cache locality. In the activity to be optimized, items are only inserted into the table during its construction, and never deleted until the table is. A less dynamic data structure might improve performance by invoking the allocator less, and having better cache locality.

· The functionality actually needed for a key data structure is to contain the characters, and to compare two keys. std::string does a bunch of additional stuff beyond what is needed for a table key. Strings maintain a dynamically allocated buffer so they can be modified and be made longer or shorter, but the key/value table does not modify keys. Furthermore, looking up a literal string causes an expensive conversion to std::string that might not be necessary if the keys were stored differently.

· std::string instances are designed to act as values. std::string defines all six comparison operators so that strings can be compared for equality or put into order. std::map is designed to work by default with value-like key types that implement the comparison operators. A data structure that does not define its own comparison operator can be used with std::map, by providing a comparison function as a template argument.

· The C++11-style list initialization shown previously for maps with string keys resembles a C-style static aggregate initializer, but it is not one. The initializer invokes executable code that invokes the memory allocator to create a std::string from each string literal, then callsinsert() to add each item into the map, which invokes the allocator again to allocate a new node in the map’s balanced tree data structure. The advantage of this initializer is that the resulting table is const. However, the runtime cost is about the same as that of building up a map the pre-C++11 way, by inserting one value at a time. A data structure that could be constructed using a real C-style aggregate initializer would have zero runtime cost to construct and destroy.

Using the Optimization Process on Custom Abstractions

In the case of maps with string keys, the developer is fortunate. Many things about std::map can be programmed. The key type can be set. The key type’s comparison algorithm can be modified. The way the map allocates nodes can be programmed. Furthermore, there are options beyondstd::map that allow the data structure and search algorithm to be improved, all within the C++ standard library. This, in a nutshell, is why C++ can be tuned for performance so effectively. The next few sections take the reader through this process.

The deliberate process outlined here can be used to optimize non–standard library abstractions too, but it may require more work. std::map and std::string are well-defined and well-documented data structures. Excellent resources on the Web reveal what operations they support, and how they are implemented. For a custom-built abstraction, if the abstraction was designed thoughtfully, its header file tells the developer what operations are provided. Comments or well-designed code tells what algorithm is in use, and how frequently the memory manager is called.

If the code is a mess, or if the interface is poorly designed and the code is spread across many files, then I have bad news and good news. The bad news is that the optimization process outlined here offers no specific help for the developer. All I have are platitudes, like “You knew the job was dangerous when you took it,” and “That’s why you make the big bucks.” The good news is that the more horrible the code is, the more likely it contains opportunities for optimization that will reward the developer for wading in.

Optimize Search Using std::map

The optimizing developer can improve performance by changing the data structure representing keys, and therefore the algorithm for comparing keys, while leaving the table data structure alone.

Use Fixed-Size Character Array Keys with std::map

As noted in Chapter 4, a developer may want to avoid the costs of using std::string keys in a really hot key/value table. Allocation dominates the cost of creating the table. The developer can cut this cost in half by using a key type that does not dynamically allocate storage. Furthermore, if the table uses std::string keys, and the developer wants to look up entries given by C-style string literals, as in

unsigned val = table["zulu"];

every lookup converts the char* string literal to a std::string, costing the allocation of more memory that is then immediately destroyed.

If the maximum size of a key is not too long, one solution is to make the key type some class type that contains a char array big enough to hold the biggest key. It isn’t possible to use an array directly, like this:

std::map<char[10],unsigned> table

because C++ arrays don’t have built-in comparison operators.

Here is the definition of a simple fixed-size character array template class called charbuf:

template <unsigned N=10, typename T=char> struct charbuf {

charbuf();

charbuf(charbuf const& cb);

charbuf(T const* p);

charbuf& operator=(charbuf const& rhs);

charbuf& operator=(T const* rhs);

bool operator==(charbuf const& that) const;

bool operator<(charbuf const& that) const;

private:

T data_[N];

};

charbuf is extremely simple. It can be initialized with or assigned a C-style null-terminated string. It can be compared to another charbuf. Because the constructor charbuf(T const*) is not explicit, it can also be compared to a null-terminated string by type conversion. Its size is fixed at compile time. It does not use dynamic memory at all.

Out of the box, C++ does not know how to compare two class instances or put them in order. The developer must define any relational operators she intends to use. The C++ standard library generally uses only operator== and operator<. The other four operators can be synthesized from these two. The operators may be defined as free functions:

template <unsigned N=10, typename T=char>

bool operator<(charbuf<N,T> const& cb1, charbuf<N,T> const& cb2);

But it is easier and better C++ style to define an operator< inside charbuf, where it has access to charbuf’s implementation.

charbuf requires the programmer to think. It can handle only strings that fit in its fixed-size internal buffer, counting the trailing zero byte. It is therefore not guaranteed to be safe, like std::string. Verifying that all potential keys fit in the charbuf is an example of pushing computation out of run time all the way to design time. It is also an example of a safety compromise that may be required to improve performance. Only an individual design team has the perspective to say whether the benefits outweigh the risks. Pundits making blanket assertions are not credible.

I ran the same test of 53 names a million times using a std::map with charbuf keys. The test ran in 1,331 milliseconds. This is about twice as fast as the version using std::string.

Use C-Style String Keys with std::map

Sometimes the program has access to long-lived C-style null-terminated character strings, whose char* pointers can be used for the keys in a std::map. For instance, the table may be constructed using C++ string literals. The program can avoid the cost of constructing and destroyingstd::string instances by using the char*s directly.

There is a problem with using char* as the key type, though. std::map puts key/value pairs into its internal data structure using an ordering relation on the key type. By default, this relation evaluates the expression key1 < key2. std::string defines an operator< to compare the character strings. char* also defines an operator<, but the operator compares the pointers, not the pointed-to strings.

std::map allows the developer to solve this problem by providing a nondefault comparison algorithm. This is an example of the fine control that C++ offers over the behavior of its standard containers. The comparison algorithm is provided by the third template parameter of std::map. The default value for the comparison algorithm is the function object std::less<Key>. std::less defines a member function, bool operator()(Key const& k1, Key const& k2), which compares two keys by returning the result of the expression key1 < key2.

The program could in principle specialize std::less for char*. However, this specialization would be visible for at least the whole file, possibly causing unexpected behavior of some other part of the program.

Instead of a function object, the program can provide a C-style free function to perform the comparison, as shown in Example 9-1. The function’s signature becomes the third argument to the map declaration, and the map is initialized with a pointer to the function.

Example 9-1. Map with C-style char* keys, free function as comparison function

bool less_free(char const* p1, char const* p2) {

return strcmp(p1,p2)<0;

}

...

std::map<char const*,

unsigned,

bool(*)(char const*,char const*)> table(less_free);

A test using this variation ran for 1,450 milliseconds, a significant improvement over the version using std::string keys.

The program could also create a new function object to encapsulate the comparison. In Example 9-2, less_for_c_strings is the name of a class type, so it serves as the type argument, and no pointer is needed.

Example 9-2. Map with char* keys, function object as comparison function

struct less_for_c_strings {

bool operator()(char const* p1, char const* p2) {

return strcmp(p1,p2)<0;

}

};

...

std::map<char const*,

unsigned,

less_for_c_strings> table;

This variation ran the test in 820 milliseconds. This is almost three times as fast as the original version, and nearly twice as fast as the char*-and-free-function version.

In C++11, another way to provide a char* comparison function for std::map is to define a lambda and pass it to the map constructor. Lambdas are convenient because they can be defined locally and have compact syntax. This approach is illustrated in Example 9-3.

Example 9-3. Map with char* keys, lambda as comparison function

auto comp = [](char const* p1, char const* p2) {

return strcmp(p1,p2)<0;

};

std::map<char const*,

unsigned,

decltype(comp)> table(comp);

Note in this example the use of the C++11 decltype keyword. The third argument to the map is a type. The name comp is a variable, and decltype(comp) is the type of that variable. The type of a lambda doesn’t have a name, and each lambda’s type is unique, so decltype is the only way to get the lambda’s type.

In the preceding example, the lambda behaves like a function object containing an operator(), so this mechanism has the same measured performance as the previous one, although the lambda must be passed as a constructor argument to the map.

The best time for the null-terminated string table was about three times faster than the original string version, and 55% faster than the fixed-size array version.

C++ VERSUS THE VISUAL STUDIO COMPILER

Visual Studio 2010 and Visual Studio 2015 compiled Example 9-3 correctly, but Visual Studio 2012 and 2013 produced an error message due to a reported bug in Visual Studio’s standard library implementation.

It is an interesting C++ factoid that a lambda with no capture can decay to a function pointer. For users who really like lambda syntax, it is possible to use a lambda even on the Visual Studio 2012 and 2013 compilers. As the third constructor argument to the map, use the signature of the function pointer to which the lambda decays:

auto comp = [](char const* p1, char const* p2) {

return strcmp(p1, p2)<0;

};

std::map<char const*,

unsigned,

bool(*)(char const*, char const*)> kvmap(comp);

In this case, the performance becomes that of std::map with a free function pointer, which is slower than its performance with a function object.

As the evolving C++ standard gradually teaches C++ compilers to do more and more type deduction, lambdas are going to become more interesting. As of early 2016, however, there is not much to recommend them over function objects.

Using Map’s Cousin std::set When the Key Is in the Value

Some programmers find it natural to define a data structure containing both a key and other data serving as the key’s value. Indeed, std::map internally declares a struct that combines the key and value as if it was defined as follows:

template <typename KeyType, typename ValueType> struct value_type {

KeyType const first;

ValueType second;

// ... constructors and assignment operators

};

If the program defines such a data structure, it can’t be used directly in a std::map. std::map demands that the key and value be defined separately for pragmatic reasons. The key must be constant, because changing the key could make the whole map data structure invalid. Also, specifying the key lets the map know how to access it.

std::map has a cousin, std::set, that facilitates the use of data structures containing their own keys. This type uses a comparison function that, by default, compares two whole elements using std::less. Thus, to use std::set and a user-defined struct that contains its own key, the developer must specialize std::less for the user-defined struct, specify operator< for the user-defined struct, or provide a nondefault comparison object. There is nothing to recommend any of these methods over the other. It is a style preference.

I mention this now because when I describe using a sequence container for a key/value table, it will also be necessary either to define a comparison operator for the item data structure or to specify a comparison function when the search algorithm is invoked.

Optimize Search Using the <algorithm> Header

In the previous section, I explored improving performance by changing the data structure representing keys, and therefore the algorithm for comparing keys. In this section, I explore changes to the search algorithm and table data structure.

In addition to data structures like std::string and std::map, the C++ standard library provides a collection of algorithms, including algorithms for searching and sorting. Standard library algorithms take iterators as arguments. Iterators abstract the behavior of pointers, separating the traversal of values from the data structure that contains those values. Standard library algorithms are specified in terms of the abstract behavior of their iterator arguments, rather than in terms of some concrete data structure. Iterator-based algorithms can be applied within many kinds of data structures, as long as the iterators over these data structures have the required properties.

The standard library search algorithms take two iterators: one pointing to the beginning of the sequence of items to be searched, and one to the end (the item just past the last item) of the sequence to be searched. All take a key to be searched for, and optionally take a comparison function. The algorithms differ in what they return, and whether the comparison function must define an ordering relationship among the keys or just an equality test.

The part of the data structure to be searched can be described as the range [first, last), where the square bracket to the left of first means that first is included in the interval, and the close parenthesis after last means that last is not included. The range notation sees much use in descriptions of standard library algorithms.

Some iterator-based search methods implement divide-and-conquer algorithms. These algorithms rely on a particular property of some iterators—the ability to compute the distance, or number of elements, between two iterators—and in this way achieve their better-than-linear big-O behavior. The distance between two iterators can always be computed by incrementing one iterator until it compares equal to the second, but this results in a time cost of O(n) to compute the distance. A special property of random-access iterators is that the distance can be computed in constant time.

The sequence containers that provide random-access iterators are C-style arrays, std::string, std::vector, and std::deque. Divide-and-conquer algorithms can be applied to std::list, but their time cost will be O(n), not O(log₂ n), because of the higher cost to compute the distance on bidirectional iterators.

The names string and map are evocative of their function, and may lead the novice user directly to a solution using these data types. Unfortunately, not all algorithms for iterator-based search have very evocative names. They are also very general, so that choosing the right one makes a considerable difference in performance, although they have the same big-O time cost.

Key/Value Table for Search in Sequence Containers

There are several reasons to choose a sequence container implementation for a key/value table over using std::map or its cousin std::set. Sequence containers consume less memory than maps. They can also be less expensive to set up. One very useful property of standard library algorithms is that they can iterate over ordinary arrays of any type. They can thus perform efficient lookups in statically initialized arrays of structs. This eliminates all costs of setting up and destroying the tables. Furthermore, coding standards, such as MISRA C++, forbid or limit the use of dynamically allocated data structures. It is reassuring to have an efficient solution for searching in these environments.

The examples in this section make use of the following struct definition:

struct kv { // (key,value) pairs

char const* key;

unsigned value; // can be anything

};

A static array of these key/value pairs is defined as follows:

kv names[] = {// in alphabetical order

{ "alpha", 1 }, { "bravo", 2 },

{ "charlie", 3 }, { "delta", 4 },

{ "echo", 5 }, { "foxtrot", 6 },

{ "golf", 7 }, { "hotel", 8 },

{ "india", 9 }, { "juliet", 10 },

{ "kilo", 11 }, { "lima", 12 },

{ "mike", 13 }, { "november",14 },

{ "oscar", 15 }, { "papa", 16 },

{ "quebec", 17 }, { "romeo", 18 },

{ "sierra", 19 }, { "tango", 20 },

{ "uniform",21 }, { "victor", 22 },

{ "whiskey",23 }, { "x-ray", 24 },

{ "yankee", 25 }, { "zulu", 26 }

};

The initializer for the names array is a static aggregate initializer. The C++ compiler creates initialized data for C-style structs at compile time. There is zero runtime cost to create such arrays.

The various algorithms are evaluated by searching in this small table for each of its 26 keys, and for 27 invalid strings chosen to fall between the values in the table. The 53 searches are then repeated a million times. This is the same test performed for std::map-based containers in the previous section.

Standard library container classes provide member functions begin() and end() so the program can obtain an iterator range to search. C-style arrays are more primitive, and offer no such functions. However, a little template magic provides typesafe template functions for this purpose. Since they take an array type as an argument, the array does not decay to a pointer as it normally would:

// get size and begin/end from a C-style array

template <typename T, int N> size_t size(T (&a)[N]) {

return N;

}

template <typename T, int N> T* begin(T (&a)[N]) {

return &a[0];

}

template <typename T, int N> T* end(T (&a)[N]) {

return &a[N];

}

In C++11, more sophisticated definitions for begin() and end() using the same template magic can be found in the namespace std in the <iterator> header. This header is included any time a standard library container class header is included. Visual Studio 2010 provides these definitions in anticipation of the standard. Unfortunately, size() is not scheduled for standardization until C++14, and it does not appear in Visual Studio 2010, though a simple equivalent is easy enough to provide.

std::find(): Obvious Name, O(n) Time Cost

The standard library <algorithm> header defines a template function find() as follows:

template <class It, class T> It find(It first, It last, const T& key)

The find() algorithm is a simple linear search. Linear search is the most general kind of search. It doesn’t require the searched data to be ordered in any way. It only requires the keys to be comparable for equality.

find() returns an iterator to the first entry in a sequence container that compares equal to key. The iterator arguments first and last delimit the range to be searched, with last pointing to the first element after the end of the data to be searched. The types of first and last, given by the template parameter It, depend on the kind of data structure to be traversed by find(). An example is given in Example 9-4.

Example 9-4. Linear search using std::find()

kv* result=std::find(std::begin(names), std::end(names), key);

In this invocation, names is the name of the array to be searched. key is a value to be compared to each kv entry. To do the comparison, a specific function comparing keys must be defined in the scope where find() is instantiated. This function tells std::find() everything it needs to know to do the comparison. C++ allows the equality operator bool operator==(v1,v2) to be overloaded for values of arbitrary pairs of types. If key is a pointer to char, the required function is:

bool operator==(kv const& n1, char const* key) {

return strcmp(n1.key, key) == 0;

}

A timing experiment using std::find() and a set of keys both in and not in the sample 26-element table took 1,425 milliseconds.

A variation of find() called find_if() takes the comparison function as a fourth argument. Instead of defining operator==() in the scope of find(), the developer can write it as a lambda. The lambda takes a single argument, the table element being compared against. So, the lambda must capture the key value from the environment.

std::binary_search(): Does Not Return Values

Binary search is a divide-and-conquer strategy that is so generally useful that the C++ standard library provides several different algorithms that use it. But for some reason, the evocative name binary_search was used for an algorithm that is not particularly useful for looking up values.

The standard library algorithm binary_search() returns a bool indicating whether a key is in a sorted table. Strangely enough, there is no related function to return the matching table element. Thus, neither of the two most obvious names, find and binary_search, provide the solution we are seeking.

If the program just wants to know if an item is in the table, instead of finding its value, std::binary_search() performed the timing test in 972 milliseconds.

Binary Search Using std::equal_range()

If the sequence container is sorted, the developer can piece together an efficient searching function from the bits and pieces provided by the C++ standard library. Unfortunately, these pieces have names that don’t evoke the obvious notion of a binary search.

The C++ standard library’s <algorithm> header contains the template function std::equal_range(), defined as:

template <class ForwardIt, class T>

std::pair<ForwardIt,ForwardIt>

equal_range(ForwardIt first, ForwardIt last, const T& value);

equal_range() returns a pair of iterators delimiting a subsequence of the sorted sequence [first, last) that contains entries equal to value. If no entries are equal to value, equal_range() returns a pair of iterators that point to the same value, indicating an empty range. If the returned iterators are unequal, there is at least one entry that matched key. By construction, in the example problem there can be no more than one match, and the first iterator points to it. Example 9-5 sets result to the matching table entry, or to the end of the table.

Example 9-5. Binary search using std::equal_range()

auto res = std::equal_range(std::begin(names),

std::end(names),

key);

kv* result = (res.first == res.second)

? std::end(names)

: res.first;

An experiment to test the performance of equal_range() on the sample table ran for 1,810 milliseconds. This is actually worse than the linear search for the same-sized table, which is pretty shocking. However, we’ll see that equal_range() was not the best choice for a binary search function.

Binary Search Using std::lower_bound()

Although equal_range() promises O(log2 n) time cost, it actually contains more machinery than is necessary for table lookup. A possible implementation of equal_range() might look like this:

template <class It, class T>

std::pair<It,It>

equal_range(It first, It last, const T& value) {

return std::make_pair(std::lower_bound(first, last, value),

std::upper_bound(first, last, value));

}

upper_bound() makes a second divide-and-conquer pass through the table to find the end of the returned range, because equal_range() is general enough to work with sorted sequences containing more than one value with the same key. But by construction, in the example table the range will always contain either one entry or none. The search can be performed using lower_bound() and one additional comparison, as shown in Example 9-6.

Example 9-6. Binary search using std::lower_bound()

kv* result = std::lower_bound(std::begin(names),

std::end(names),

key);

if (result != std::end(names) && key < *result.key)

result = std::end(names);

In this example, std::lower_bound() returns an iterator to the first entry in the table whose key is not less than key. This iterator points to the end of the table if all the entries are less than key. It may point to an entry that is greater than key. The final if statement sets result to the end of the table if either of these conditions is true. Otherwise, result points to the entry whose key is equal to key.

A version of the experiment using this search took 973 milliseconds, which is a satisfying 86% faster than std::equal_range(). This is to be expected because it is doing about half the work.

A search using std::lower_bound has performance that is competitive with the best implementation using std::map, and has the additional advantage of zero cost to construct or destroy the static table. The function std::binary_search() also runs the experiment in 973 milliseconds, albeit with only the Boolean result. It seems as if this is as far as we can go using the C++ standard library algorithms.

Handcoded Binary Search

It is possible to handcode a binary search taking the same arguments as the standard library functions. The standard library algorithms all use a single ordering function, operator<(), so that only the minimum interface need be provided. Because these functions eventually have to determine if a key matches an entry, they do an extra comparison at the end, noting that a == b can be defined as !(a < b) && !(b < a).

The original table occupies a range of consecutive values denoted as [start, end). At each step in the search, the function (shown in Example 9-7) computes the midpoint of the range and compares key against the entry at the midpoint. This effectively partitions the range into two halves,[start, mid+1) and [mid+1, stop).

Example 9-7. Handcoded binary search using “<” for comparison

kv* find_binary_lessthan(kv* start, kv* end, char const* key) {

kv* stop = end;

while (start < stop) {

auto mid = start + (stop-start)/2;

if (*mid < key) {// search right half [mid+1,stop)

start = mid + 1;

}

else {// search left half [start,mid)

stop = mid;

}

return (start == end || key < *start) ? end : start;

}

A timing run for this binary search took 968 milliseconds. This is approximately the same as the earlier version based on std::lower_bound().

Handcoded Binary Search using strcmp()

It is possible to peel the onion yet further, noting that operator<() is defined in terms of strcmp(). Like operator<(), strcmp() compares two keys. But strcmp() produces more information: an int result that is less than, equal to, or greater than zero, as the first key is less than, equal to, or greater than the second key. The resulting code, as shown in Example 9-8, looks like what you might write in C.

During each iteration of the while loop, the sequence to search is given by the interval [start,stop). At each step mid is set to the midpoint of the sequence to be searched. The value returned by strcmp() divides the sequence into three parts instead of two: [start,mid), [mid,mid+1), and[mid+1,stop). If mid->key is greater than key, we know that key must be in the left part of the sequence, before mid. If mid->key is less than key, key must be in the right part of the sequence beginning with mid+1. Otherwise, mid->key equals key, and the loop terminates. Theif/else logic does the more frequently occurring comparisons first to improve performance.

Example 9-8. Handcoded binary search using strcmp()

kv* find_binary_3(kv* start, kv* end, char const* key) {

auto stop = end;

while (start < stop) {

auto mid = start + (stop-start)/2;

auto rc = strcmp(mid->key, key);

if (rc > 0) {

stop = mid;

}

else if (rc < 0) {

start = mid + 1;

}

else {

return mid;

}

return end;

}

A timing run for this version of the binary search took 771 milliseconds. This is nearly 26% faster than the best standard library–based search.

Optimize Search in Hashed Key/Value Tables

The previous section examined changes to the algorithm for a particular table data structure that was very efficient to set up. In this section, I examine another table data structure and algorithm: that of hash tables.

The idea of a hash table is that the key, whatever its type, can be fed to a hash function that reduces the key to an integer hash. The hash in turn becomes an array index that leads directly to a table entry. If the table entry then matches the key, the search is a success. If the hash always leads directly to the table entry, the hash table allows access in constant time. The only cost is the cost of producing the hash. Like linear search, hashing does not assume any ordering relationship among the keys, but requires only a means to compare keys for equality.

The complicated part of implementing a hash table is finding an efficient hash function. A 10-letter string contains many more bits than a 32-bit integer. More than one string can thus hash to the same table index. Some mechanism must be provided to resolve such collisions. Each entry in the hash table might be the head of a list of entries that hash to that index. Alternately, adjacent indexes might be searched for the matching entry until an empty index is found.

Another problem is that the hash function may not produce a particular index for any of the valid keys in the table, leaving an unused location in the hash table. This makes a hash table potentially larger than a sorted array of the same entries.

A poor hash function or an unlucky set of keys can cause many keys to hash to the same index. Then the performance of the hash table degrades to O(n), making it no better than a linear search.

A good hash function produces array indexes that are not highly correlated to the bits of the key. Random number generators and cryptographic encoders are good for this goal. But if the hash function is costly to compute, there may be no advantage over a binary search unless the table is very large.

Hunting better hash functions has been a pastime of computer scientists for many years. A Q&A on Stack Exchange provides performance data and reference links for several popular hash functions. The developer tuning hash table code should be aware that these waters have been heavily fished. There are no big performance improvements to be found here.

C++ defines a standard hash function object called std::hash. std::hash is a template with specializations for integers, floating-point data, pointers, and std::string. The unspecialized definition of std::hash, which is also used for pointers, converts the hashed type to a size_t, and then randomizes the bits of this value.

Hashing with a std::unordered_map

In C++11, the standard header <unordered_map> provides a hash table. Visual Studio 2010 anticipated the standard and provided this header as well. std::unordered_map can’t be used with the handcoded static table used in previous examples, though. The entries must be inserted into the hash table, adding cost at startup. The code for creating a hash table using std::unordered_map and inserting entries is shown in Example 9-9.

Example 9-9. Initializing a hash table

std::unordered_map<std::string, unsigned> table;

for (auto it = names; it != names+namesize; ++it)

table[it->key] = it->value;

The default hash function used by std::unordered_map is the template function object std::hash. This template has a specialization for std::string, so no hash function need be explicitly provided.

After entries are installed in the table, a lookup can be performed:

auto it = table.find(key);

it is an iterator to the table that points either to a valid entry or to table.end().

std::unordered_map with std::string keys uses all the default values of the map’s template to achieve simplicity and reasonable performance. An experiment to measure the performance of std::unordered_map took 1,725 milliseconds, not counting time to construct the table. This is 56% faster than std::map with string keys, but hardly a world-beating result. Given the hype about hashing being a performance win, this result was surprising and disappointing.

Hashing with Fixed Character Array Keys

A version of the simple fixed character array template class charbuf from section “Use Fixed-Size Character Array Keys with std::map” can be used with hash tables. The following template extends charbuf with a means to hash the character string and an operator== for comparing keys if there is a collision:

template <unsigned N=10, typename T=char> struct charbuf {

charbuf();

charbuf(charbuf const& cb);

charbuf(T const* p);

charbuf& operator=(charbuf const& rhs);

charbuf& operator=(T const* rhs);

operator size_t() const;

bool operator==(charbuf const& that) const;

bool operator<(charbuf const& that) const;

private:

T data_[N];

};

The hash function is operator size_t(). This is a little unintuitive, and also a little unclean. It happens that the default specialization for std::hash() casts the argument to a size_t. For pointers, this normally just casts the bits of the pointer, but with a charbuf& what happens is that charbuf::operator size_t() is called, returning the hash as a size_t. Of course, with operator size_t() hijacked for this use, it isn’t available to return the size of a charbuf. The expression sizeof(charbuf) will return very misleading data. The declaration of a hash table using charbuf looks like this:

std::unordered_map<charbuf<>, unsigned> table;

Performance of this hash table was disappointing. It ran the test of 53 keywords in 2,277 milliseconds, worse even than the hash table or a map with std::string.

Hashing with Null-Terminated String Keys

These things must be done caaaaarefully, or it hurts the spell.

The Wicked Witch of the West (Margaret Hamilton), The Wizard of Oz, 1939, musing about how to remove the ruby slippers from Dorothy’s feet

If the hash table can be initialized with long-lived null-terminated strings, such as C++ string literals, a hash-based key/value table can be constructed with pointers to these strings. There is performance gold to be mined by making std::unordered_map work with char* keys. But it isn’t trivial.

The full definition of std::unordered_map is:

template<

typename Key,

typename Value,

typename Hash = std::hash<Key>,

typename KeyEqual = std::equal_to<Key>,

typename Allocator = std::allocator<std::pair<const Key, Value>>

> class unordered_map;

Hash is a function object or function pointer type declaration for the function that hashes a Key. KeyEqual is a function object or function pointer type declaration for the function that compares two instances of Key for equality to resolve hash collisions.

When Key is a pointer, Hash is well defined. The program compiles without errors. It may even appear to run. (My initial tests ran, producing an excellent test run time and a false sense of accomplishment.) But the program is not correct. What happens is that std::hash produces a hash of the pointer values, rather than a hash of the pointed-to strings. If, for instance, a test program loads the table from an array of strings, then tests that each string can be found, the pointers to the test keys are the same as the pointers to the keys that initialized the table, and it appears to work. Test the table with a duplicate string taken from user input, however, and the test claims that the string is not in the table because the pointer to the test string is not the same as the pointer to the key that initialized the table.

This problem can be solved by providing a nondefault hash function in place of the default value for the third argument to the template. Just as for a map, this hash function can be a function object, lambda declaration, or free function pointer:

struct hash_c_string {

void hash_combine(size_t& seed, T const& v) {

seed ^= v + 0x9e3779b9 + (seed << 6) + (seed >> 2);

}

std::size_t operator() (char const* p) const {

size_t hash = 0;

for (; *p; ++p)

hash_combine(hash, *p);

return hash;

}

};

// this solution is incomplete -- see below

std::unordered_map<char const*, unsigned, hash_c_string> table;

I have taken the hash function from Boost. It is in standard library implementations if they conform to C++14 or later. Alas, Visual Studio 2010 did not provide this function.

Careful testing reveals that this declaration is still not correct, although it compiles without error and produces a program that may work for some small tables. The problem is with KeyEqual, the fourth argument of the std::unordered_map template. The default value of this argument isstd::equal_to, a function object that applies operator== to its two operands. operator== is defined for pointers, but it compares the pointers for their order in the computer’s memory space, rather than the pointed-to strings.

The solution, of course, is another nondefault function object for the KeyEqual template parameter. The complete solution is shown in Example 9-10.

Example 9-10. std::unordered_map with null-terminated string keys

struct hash_c_string {

void hash_combine(size_t& seed, T const& v) {

seed ^= v + 0x9e3779b9 + (seed << 6) + (seed >> 2);

}

std::size_t operator() (char const* p) const {

size_t hash = 0;

for (; *p; ++p)

hash_combine(hash, *p);

return hash;

}

};

struct comp_c_string {

bool operator()(char const* p1, char const* p2) const {

return strcmp(p1,p2) == 0;

}

};

std::unordered_map<

char const*,

unsigned,

hash_c_string,

comp_c_string

> table;

This version of the key/value table, using std::unordered_map and char* keys, ran the test in 993 milliseconds. This is 73% faster than the hash table based on std::string. But it’s a paultry 9% faster than the best implementation based on std::map and char* keys. And it is slower than the binary-search algorithm on a simple static array of key/value entries using std::lower_bound. This is not what years of hype led me to expect. (In “std::unordered_map and std::unordered_multimap”, we’ll see that larger hash tables have a greater advantage over binary search–based search algorithms.)

Hashing with a Custom Hash Table

A hash function for use with unknown keys must be very general. If the keys are known in advance, as they are in the example table, a very simple hash function may suffice.

A hash that creates a table in which there are no collisions for a given set of keys is called a perfect hash. A hash that creates a table with no unused spaces is called a minimal hash. The Holy Grail of hash functions is a minimal perfect hash, creating a table with no collisions and no empty spaces. Perfect hashes for reasonably short sets of keywords are easy to create, and perfect minimal hashes are only a little more difficult. The first letter (or first couple of letters), the sum of the letters, and the length of the key are all example hash functions that might be tried.

In the example table for this section, the 26 valid entries each begin with a different letter and they are sorted, so a hash based on the first letter is a perfect minimal hash. It doesn’t matter what invalid keys hash to. They are compared against the valid key at their hash, and the comparison fails.

Example 9-11 shows a very simple custom hash table implemented to resemble std::unordered_map.

Example 9-11. Minimal perfect hash table based on the example table

unsigned hash(char const* key) {

if (key[0] < 'a' || key[0] > 'z')

return 0;

return (key[0]-'a');

}

kv* find_hash(kv* first, kv* last, char const* key) {

unsigned i = hash(key);

return strcmp(first[i].key, key) ? last : first + i;

}

hash() maps the first letter of key to one of the 26 table entries, hence the mod 26. This is defensive programming to prevent accessing undefined storage in case the key was something like “@#$%”.

An experiment to test find_hash() ran in 253 milliseconds. This result is stunning.

Although the simple hash function that worked for the sample table was particularly fortunate, the table was not contrived to achieve this result. There is often a simple function that is a minimal perfect hash. There are papers on the Internet discussing various methods to automatically generate perfect minimal hash functions for small keyword sets. The GNU Project (among others) has built a command-line tool called gperf for producing a perfect hash function that is often also minimal.

Stepanov’s Abstraction Penalty

The experiment I used for searches looked up all 26 valid table entries and 27 invalid table entries. This creates a kind of average performance. Linear search looked relatively better on tests consisting only of keys in the table, because linear search terminates as soon as it finds a matching entry. Binary searches make about the same number of comparisons whether they find the key in the table or not.

Table 9-1 summarizes the results of the search experiments.

	VS2010 release, i7, 1m iterations	% improvement vs. previous	% improvement vs. category
map<string>	2,307 ms
*map<char> free function**	1,453 ms	59%	59%
*map<char> function object**	820 ms	77%	181%
*map<char> lambda**	820 ms	0%	181%
std::find()	1,425 ms
std::equal_range()	1,806 ms
std::lower_bound	973 ms	53%	86%
find_binary_3way()	771 ms	26%	134%
std::unordered_map()	509 ms
find_hash()	195 ms	161%	161%
Table 9-1. Summary of search performance experiments

As expected, binary search was faster than linear search, and hashing was faster than binary search.

The C++ standard library provides a bunch of ready-to-use, debugged algorithms and data structures that are useful in many situations. The standard defines their worst-case time cost in big-O terms to demonstrate that they are widely applicable.

But there is a cost for using the extremely powerful and general-purpose machinery of the standard library. Even when a standard library algorithm with good performance exists, it is often not competitive with the best handcoded algorithm. This may be due to weaknesses in the template code or weaknesses in the compiler design, or just because the standard library code has to work for very general situations (like using only operator<() and not strcmp()). This cost may persuade the developer to code really important searches by hand.

The gap in performance between standard algorithms and good, handcoded algorithms is called Stepanov’s Abstraction Penalty, after Alexander Stepanov, who designed the original standard library algorithms and container classes, at a time when no compiler existed that could compile them. Stepanov’s Abstraction Penalty is the inevitable cost of providing a universal solution versus a custom-coded one. Stepanov’s Abstraction Penalty is a toll for using highly productive tools like the C++ standard library algorithms. It’s not a bad thing, but it’s a thing developers need to keep in mind when they need to go really fast.

Optimize Sorting with the C++ Standard Library

A sequence container must be sorted before it can be searched efficiently using a divide-and-conquer algorithm. The C++ standard library provides two standard algorithms, std::sort() and std::stable_sort(), that can sort sequence containers efficiently.

Although the standard does not specify which sorting algorithm is used, it was written so that std::sort could be implemented using some variation of quicksort, and std::stable_sort could be implemented using merge sort. C++03 requires std::sort to have average performance that is O(n log₂ n). Implementations conforming to C++03 generally implemented std::sort using quicksort, usually with some median-picking trick to reduce the chance of quicksort’s O(n²) worst-case. C++11 requires the worst-case performance to be O(n log₂ n). Implementations conforming to C++11 are generally hybrid sorts such as Timsort or introsort.

std::stable_sort is usually a variant of merge sort. The standard’s peculiar wording is that std::stable_sort runs in O(n log₂ n) time if sufficient additional memory can be allocated, but otherwise runs in O(n (log₂ n)²) time. A typical implementation is to use merge sort if the recursion depth is not too great, and heapsort if it is.

The value of a stable sort is that a program can sort a range of records by each of several criteria (like first name, and then last name) and get records sorted by the second criterion, then by the first criterion within the second (like last name, and then first name within last name). Only stable sorts provide this property. This additional property justifies having two sorts.

Table 9-2 reports the results of an experiment in which I sorted 100,000 randomly generated key/value records stored in a std::vector. An interesting result is that std::stable_sort() actually performed better than std::sort(). I also tested sorting an already-sorted table. I examine sorting in different data structures in Chapter 10.

std::vector, 100k items, VS2010 release, i7	Time
std::sort() vector	18.61 ms
std::sort() sorted vector	3.77 ms
std::stable_sort()	16.08 ms
std::stable_sort() sorted	5.01 ms
Table 9-2. Summary of sort performance experiments

The sequence container std::list provides only bidirectional iterators. On a list, therefore, std::sort() would run in O(n²) time. std::list provides a sort() member function that is O(n log₂ n).

The ordered associative containers maintain their data in sorted order, so it isn’t necessary to sort them. The unordered associative containers maintain their data in a particular order that is not meaningful to users. They cannot be sorted.

The C++ standard library <algorithm> header contains pieces of the various sort algorithms, from which more complicated sorts can be built up for input data that has additional special properties:

· std::heap_sort converts a range having the heap property into a sorted range. heap_sort is not a stable sort.

· std::partition performs the basic action of a quicksort.

· std::merge performs the basic action of a merge sort.

· The insert member of the various sequence containers performs the basic action of an insertion sort.

Summary

· The mix of features in C++ provides a continuum of implementation choices ranging from hands-off automation and expressiveness on the one hand, to fine control over performance on the other hand. It is this degree of choice that makes it possible to tune C++ programs to meet requirements for performance.

· There are enough parts in most activities worth optimizing that human beings cannot reliably remember them all. Paper has a better memory.

· In a test of searching a table with 26 keys, std::unordered_map with string keys was only 52% faster than std::map with string keys. Given the hype about hashing being a performance win, this result was surprising.

· Stepanov’s Abstraction Penalty is a toll for using highly productive tools like the C++ standard library algorithms.

1 Such a table might not meet the needs of Arabic- or Chinese-speaking developers. Meeting these needs is another subject about which whole books are written. It is my expectation that developers using wide character sets have already solved these problems. Some English-speaking developers still find them mysterious and distracting, so I am wishing that complexity away in this example.