Optimized C++ (2016)
Chapter 5. Optimize Algorithms
Time heals what reason cannot.
Seneca the Younger (c. 4 BC–65 AD)
If a program takes hours to run, when it needs to take seconds, the only optimization likely to be successful is to select a more efficient algorithm. Most optimizations improve performance by only a constant factor. Replacing an inefficient algorithm with a betterperforming one is the only sure ticket to an orderofmagnitude improvement in performance.
Design of efficient algorithms is the topic of many whole computer science textbooks and numerous doctoral dissertations. There are specialist computer scientists who devote their careers to analysis of algorithms. There’s no way to cover the topic in a single short chapter. Instead, this chapter looks briefly at the time cost of algorithms, providing a guide to know when you’re in trouble.
I take a look at common searching and sorting algorithms, and present a toolkit for optimizing searching and sorting in existing programs. In addition to picking an algorithm that is optimal for unknown data, there are algorithms with exceptional performance on data that is sorted or nearly sorted, or that has other special properties.
Computer scientists study important algorithms and data structures because they are examples of how to optimize code. I have collected some important optimization techniques in the hope that the reader will come to recognize places where they can be applied.
OPTIMIZATION WAR STORY
Many programming problems have simple solutions that are infeasibly slow, and more subtle published solutions that are relatively more efficient. A team’s best option may be to seek an outside expert in algorithm analysis to determine whether an efficient solution to a particular problem may exist. Retaining such a consultant can be money well spent.
I once was on a team developing functional board testers (including the Fluke 9010A pictured in Chapter 2). Our board tester had a builtin RAM test for diagnosing manufacturing defects in the device under test. I once thought to examine the coverage of this test by connecting the tester to a Commodore PET computer and running the RAM test over its video memory, so that the test patterns were visible on the PET’s builtin screen. I inserted faults into the RAM circuit by the crude but effective method of jabbing a screwdriver between adjacent pins of the PET’s video RAM chips. We were all quite surprised to learn that faults that visibly changed the patterns written to RAM were often not detected by our test, which had been developed by some very smart engineers. Furthermore, Moore’s Law was doubling the amount of RAM we needed to test every 18 months. We needed a new RAM test algorithm that was much faster than the existing test, and with demonstrably better fault coverage.
An exhaustive RAM test is infeasible, taking O(2^{n}) accesses to memory (where n is the number of RAM addresses—see “Time Cost of Algorithms” for more on the “bigO” notation). Published RAM test algorithms at that time were mostly infeasibly slow O(n^{2}) or O(n^{3}) tests developed without much thought when memory devices held only a few hundred words. Published tests that were theoretically sound required 30 accesses to each cell to achieve basic fault coverage. I had an idea for a much better test using pseudorandom sequences, but lacked the mathematical knowhow to demonstrate its soundness. We had amply demonstrated that intuition alone did not guarantee success. We needed an algorithm expert.
A call to my old professors at the University of Washington led to a PhD candidate named David Jacobson who was happy to trade his research assistant stipend for a developer paycheck for a while. Our collaboration led to a bestinclass RAM test requiring only five passes through memory, several other novel functional test algorithms, and half a dozen US patents.
Time Cost of Algorithms
The time cost of an algorithm is an abstract mathematical function describing how rapidly the cost of the algorithm grows as a function of the size of its input. Many factors affect the run time of a program on a specific computer. As a result, run time is less than perfect as a way to talk about performance of an algorithm. Time cost abstracts away these details, leaving a simple relationship between input size and cost. Algorithms can be classified into families with similar time costs, and the families can be studied for common features. Time cost is well covered in any textbook on algorithms and data structures—I like The Algorithm Design Manual, Second Edition, by Steven S. Skiena (SpringerVerlag)—so this section will be brief.
The time cost is customarily written in “bigO” notation, as O(f(n)), where n is some significant aspect of the size of the input, and f(n) is a function describing how many of some significant operations an algorithm performs to process an input of size n. The function f(n) is typically simplified to show only its fastestgrowing term, because for large values of n, this term dominates the value of f(n).
Using searching and sorting algorithms as an example, if n is the number of items to be searched through or sorted, f(n) is often the number of comparisons made between two items to put the items into sorted order.
Here is a very rough guide to the time cost of some common algorithms, and the implications of those costs to program run times:
O(1), or constant time
The fastest possible algorithms have constant time cost. That is, they have a fixed cost that doesn’t depend on the size of the input at all. A constanttime algorithm is like the Holy Grail: incredibly valuable if you find it, but you can search for your whole life without achieving your quest.Beware, therefore, of strangers selling constanttime algorithms. The constant of proportionality may be very high. That is, the cost may be one operation, but it may be one hellaciously long operation. In fact, it may be O(n) in disguise, or worse.
O(log_{2} n)
The time cost can be less than linear. For instance, a search algorithm that can divide the input in half at each step takes O(log2 n) time. Algorithms with less than linear time cost have a cost that increases more slowly than the input does. They are thus efficient enough that there is often (but not always) no reason to look for a faster algorithm; the code implementing the algorithm simply isn’t hot enough to show up on a profiler’s list of expensive functions. Algorithms with O(log_{2} n) time cost can be invoked many times in a program’s overall run without fear of making the whole program infeasibly slow. Binary search is a frequently occurring algorithm with O(log_{2} n) time cost.
O(n), or linear time
If an algorithm’s time cost is O(n), the time taken by the algorithm is proportional to the size of its input. Such an algorithm is called a lineartime algorithm, or just linear. A time cost of O(n) is typical of algorithms that scan their input from one end to the other, as to find the least or greatest value in a list. A lineartime algorithm grows more expensive at the same rate as its input. Such algorithms are inexpensive enough that a program can scale to larger and larger inputs without unexpectedly great demands on computational resources. However, multiple lineartime algorithms can combine in ways that make their overall run time O(n^{2}) or worse, so they may be implicated if a program’s overall run time is too great for large inputs.
O(n log_{2} n)
Algorithms can have a superlinear time cost. For instance, many sorting algorithms compare pairs of inputs, and divide the space to be sorted into two parts at each step. These algorithms have time cost O(n log_{2} n). Algorithms that are O(n log_{2} n) become relatively more costly as nincreases, but the rate of growth is so slow that these algorithms are generally considered tractable even for large values of n. Still, you don’t want to invoke such an algorithm needlessly during a program’s run.
O(n^{2}), O(n^{3}), etc.
Some algorithms, including less efficient sorting algorithms, must compare each input value to every other value. These algorithms have time cost O(n2). The cost of such algorithms grows so fast that it becomes quite worrisome for large values of n. There are a number of problems that have simple solutions that are O(n^{2}) or O(n^{3}), and more subtle solutions that are faster.
O(2^{n})
O(2^{n}) algorithms grow expensive so rapidly that they can be considered only for small values of n. Sometimes that is OK. Algorithms that require inspecting all combinations of n inputs are O(2^{n}). Scheduling and tripplanning problems like the famous Traveling Salesman problem are O(2^{n}). If the fundamental problem an algorithm needs to solve is O(2^{n}), the developer is faced with several unpalatable options: use a heuristic algorithm that doesn’t guarantee an optimal solution, limit the solution to small values of n, or find some other way to add value that doesn’t involve solving the problem at all.
Table 51 estimates how long algorithms of different time costs will take if each operation on the n input values takes a single nanosecond. A more complete version of this table can be found on page 38 of Skiena’s The Algorithm Design Manual.
log_{2} n 
n 
nlog_{2} n 
n^{2} 
2^{n} 

10 
< 1 μs 
< 1 μs 
< 1 μs 
< 1 μs 
1 μs 
20 
< 1 μs 
< 1 μs 
< 1 μs 
< 1 μs 
1 μ 
30 
< 1 μs 
< 1 μs 
< 1 μs 
< 1 μs 
1 s 
40 
< 1 μs 
< 1 μs 
< 1 μs 
1.6 μs 
18 min 
50 
< 1 μs 
< 1 μs 
< 1 μs 
2.5 μs 
10^{13} years 
100 
< 1 μs 
< 1 μs 
< 1 μs 
10 μs 
∞ 
1,000 
< 1 μs 
1 μs 
10 μs 
1 ms 
∞ 
10,000 
< 1 μs 
10 μs 
130 μs 
100 ms 
∞ 
100,000 
< 1 μs 
100 μs 
2 ms 
10 s 
∞ 
1,000,000 
< 1 μs 
1 ms > 
20 ms 
17 min 
∞ 
Table 51. Runtime cost for an algorithm taking 1 ns per operation 
BestCase, Average, and WorstCase Time Cost
The conventional bigO notation assumes that an algorithm has the same running time for any input. However, some algorithms are sensitive to properties of their input, running faster for some sequences of inputs than for other sequences of the same size. When considering what algorithm to use in performancecritical code, it is important to know whether the algorithm has a bad worst case. We will see some examples of this in “Time Cost of Sorting Algorithms”.
Some algorithms likewise have very fortunate bestcase behavior, like a sort that takes less time if the input is already sorted or nearly sorted. When the input is likely to have useful properties like being mostly sorted, selecting an algorithm with especially good bestcase performance can improve the run time of a program.
Amortized Time Cost
Amortized time cost means time cost averaged over a large number of inputs. For instance, inserting an item into a heap has O(log2 n) time cost. Building a heap one item at a time is thus O(n log2 n). However, building a heap by the most efficient method is O(n), which means that inserting each item by this method has amortized O(1) time cost. The most efficient algorithm doesn’t insert items one at a time, though. It processes all the items into successively larger subheaps using a divideandconquer algorithm.
The significance of amortized time cost is that some individual operations are fast, while others are slow. For instance, the amortized time cost to append a character to a std::string is constant, but it involves a call to the memory manager some fraction of the time. When the string is short, the memory manager may be called nearly every time a character is appended. Only once the program has appended thousands or millions of characters does this amortized cost become small.
Other Costs
Sometimes an algorithm can be sped up by saving intermediate results. Such an algorithm thus has not only a cost in time, but also a cost in auxiliary storage. For instance, the familiar recursive algorithm for binary tree traversal runs in linear time, but also requires log2 n auxiliary storage locations on the stack during the recursion. Algorithms that have large auxiliary space costs may not be suitable for constrained environments.
There are algorithms that run faster if they are parallelized, but have a cost in the number of processors running in parallel needed to obtain the theoretically promised speedup. Algorithms requiring more than log2 n processors are usually not feasible on generalpurpose computers, which have a small, fixed number of processors. Such algorithms may be possible using purposebuilt hardware or graphics processors. I wish this book had space to cover parallel algorithm design, but alas it does not.
Toolkit to Optimize Searching and Sorting
The toolkit for optimizing searches and sorts contains just three tools:
· Replace inferior algorithms having poor averagecase bigO time costs with algorithms having better averagecase time costs.
· Exploit additional knowledge of the data (for instance, knowledge that the data is usually sorted or nearly sorted) to select algorithms having particularly good bigO time costs on data with these properties, and avoid algorithms with particularly poor bigO behavior on data with these properties.
· Tweak the algorithm to improve its performance by a constant factor.
I explore the application of these three tools in Chapter 9.
Efficient Search Algorithms
The time costs of the most important algorithms for searching and sorting are presented in every undergraduate computer science curriculum, and eagerly memorized by all developers early in their college education. The problem with undergraduate algorithms and data structures classes is that they are too brief. The instructor can provide indepth coverage of just a few algorithms to teach analysis, or shallow coverage of more algorithms to provide memorable facts about time cost. The instructor is probably also teaching how to program at the same time. As a result, students exit the class proud of their new knowledge, and unaware that they have missed out on many nuances. This incompleteness of knowledge leads to optimization opportunities even when nominally optimal algorithms are in use.
Time Cost of Searching Algorithms
How fast, in bigO terms, is the fastest method to look up a table entry? Hint: binary search, at O(log2 n), is a useful benchmark, but it isn’t the fastest.
Right now some readers are saying, “Wait...WHAT?” They weren’t taught anything other than linear search and binary search. But there are many search algorithms:
· Linear search, at O(n), is expensive, but extremely general. It can be used on an unsorted table. It requires only that keys can be compared for equality, so it works even if keys can’t be put into any kind of order. If the table is sorted, the linear search can terminate before seeing all table elements. It’s still O(n), but on average about twice as fast.
If it’s OK to mutate the table, a version of linear search that moves each search result to the head of the table can perform very well in some situations. For instance, the symbol table in a programming language compiler is searched every time an identifier is used in an expression. There are so many expressions that look like i = i + 1; that this optimization can make linear search reasonably useful.
· Binary search, at O(log2 n), has good performance, but it’s not the best possible search. Binary search requires input data that is sorted on the search key, and keys that can be compared not only for equality, but for an ordering relation such as lessthan.
Binary search is the workhorse of the searching and sorting world. It is a divideandconquer algorithm that repeatedly divides a sorted table in half by comparing the key to the middle element in the portion of the table being searched to determine if the key comes before or after that element.
· Interpolation search divides the sorted table into two parts like binary search, but uses extra knowledge about the keys to improve its partitioning guess. Interpolation search achieves a very respectable O(log log n) performance when the keys are uniformly distributed. This improvement can be very noticeable if the table is large or the cost of testing a table entry is significant (like when it’s on a rotating disk). However, interpolation search is not the fastest possible search.
· It is possible to find a record in average O(1) time by hashing: converting the key to an array index in a hash table. Hashing doesn’t work on just any list of key/value pairs. It requires a specially constructed table. Hash table entries need only be compared for equality. Hashing has worstcase performance of O(n), and may require more hash table entries than there are records to search for. However, when the table has fixed contents (like month names or programming language keywords), these pathological cases can be eliminated.
All Searches Are Equal When n Is Small
What is the cost of searching in a table with a single entry? Does it make any difference what search algorithm is used? Table 52 shows the cost of searching a sorted table using the best possible versions of linear, binary, and hashtable search algorithms. The answer is that for small tables, all methods examine the same number of entries. For a table of 100,000 entries, however, the dominating term of the time cost function would rule, and things would look different.
Table size 
Linear 
Binary 
Hash 
1 
1 
1 
1 
2 
1 
2 
1 
4 
2 
3 
1 
8 
4 
4 
1 
16 
8 
5 
1 
26 
13 
6 
1 
32 
16 
6 
1 
Table 52. Table size versus average number of accesses to search 
Efficient Sort Algorithms
There is an amazing menagerie of sorting algorithms. Many have been proposed only in the past 10 years or so. Many of the newer sorts are hybrids that improve bestcase or worstcase performance. If you completed your computer science education prior to 2000, it is worth reviewing the literature. Wikipedia has an accessible summary of sorting algorithms. Here are just a few fun facts not always covered in algorithms classes, to prove that deeper knowledge has benefits:
· “Everyone knows” that an optimal sort runs in O(n log2 n) time, right? Wrong. Wrong twice, in fact. Only sorts that work by comparing pairs of input values have this behavior. Radix sorts (sorts that divide the input repeatedly into one of r buckets) have time cost O(n logr n), where r is the radix, or number of sorting buckets. This is considerably better than comparison sorts for large data sets. Furthermore, when the sorting keys are taken from certain sets, like the consecutive integers from 1 to n, flashsort can sort the data in O(n) time.
· Quicksort, a frequently implemented and generally wellregarded algorithm, has poor worstcase performance of O(n2). The worst case cannot reliably be avoided, and naïve implementations often perform poorly.
· Some sorts, including insertion sort, have excellent (linear) performance if the data is almost sorted, even if they aren’t all that good for random data. Other sorts (such as the naïve quicksort just mentioned) have worstcase performance on sorted data. If the data is usually sorted or almost sorted, knowledge of this additional property can be used to select a sort having good performance on sorted data.
Time Cost of Sorting Algorithms
Table 53 lists the time complexity of several sorts for basecase, average, and worstcase input data. While most of these sorts have the same O(n log_{2} n) average performance, they differ in their bestcase and worstcase performance, and in the amount of extra space they consume.
Sort 
Best case 
Average 
Worst case 
Space 
Notes on best/worst case 
Insertion sort 
n 
n^{2} 
n^{2} 
1 
Best case is sorted or nearly sorted data 
Quicksort 
n log_{2} n 
n log_{2} n 
n^{2} 
log_{2} n 
Worst case is sorted data and naïve (first/last) choice of pivot 
Merge sort 
n log_{2} n 
n log_{2} n 
n log_{2} n 
1 

Tree sort 
n log_{2} n 
n log_{2} n 
n log_{2} n 
n 

Heapsort 
n log_{2} n 
n log_{2} n 
n log_{2} n 
1 

Timsort 
n 
n log_{2} n 
n log_{2} n 
n 
Best case is sorted data 
Introsort 
n log_{2} n 
n log_{2} n 
n log_{2} n 
1 

Table 53. Time cost for some sorting algorithms 
Replace Sorts Having Poor WorstCase Performance
Quicksort is a very popular sort. Its internal overhead is quite low, and its average performance is optimal for a sort based on comparing two keys. But quicksort has a flaw. If you run quicksort on an array that is already sorted (or nearly sorted), and you use the first or last element as the pivot, its performance is very poor. Sophisticated implementations of quicksort pick a pivot at random to overcome this weakness most of the time, or spend many extra cycles to compute the median and use that as the initial pivot. It is thus naïve to assume that quicksort will always have good performance. You must either know something about the input data—specifically, that it is not already sorted—or know that the implementation of the algorithm chooses the initial pivot carefully.
If you don’t know anything about the input data, mergesort, treesort, and heapsort all provide a comfortable assurance that there is no pathological case that will cause performance to balloon to unacceptable levels.
Exploit Known Properties of the Input Data
If you know that the input data is sorted or nearly sorted, the normally unacceptable insertion sort has excellent O(n) performance on such data.
A relatively new hybrid sort called Timsort also has excellent O(n) performance if the data is sorted or nearly sorted, and optimal O(n log _{2} n) performance the rest of the time. Timsort is now the standard sort in Python.
A recent sort called introsort is a hybrid of quicksort and heapsort. Introsort starts doing a quicksort, but switches to a heapsort if unfortunate input data causes quicksort’s recursion depth to become too great. Introsort guarantees a reasonable O(n log_{2} n) worstcase time, while taking advantage of quicksort’s efficient implementation to modestly reduce averagecase run time. Since C++11, introsort has been the preferred implementation of std::sort().
Another recently proposed sort called flashsort has excellent O(n) performance for data drawn from a particular probability distribution. Flashsort is related to radix sorting; data is sorted into buckets based on percentile in the probability distribution. A simple case of flashsort occurs when the data items are uniformly distributed.
Optimization Patterns
Experienced optimizing developers don’t rely exclusively on dizzying intuitive leaps to find opportunities to improve code performance. There are patterns that recur in optimized code. Developers study algorithms and data structures partly because they contain a library of ideas for improving performance.
This section gathers together a few general techniques for improving performance that are so useful that they deserve specific mention. The reader may recognize some of these patterns as the core of familiar data structures, C++ language features, or hardware innovations:
Precomputation
Remove computation from the hot part of the program by performing it before execution arrives at the hot code—earlier in the program, at link time, compile time, or design time.
Lazy computation
Remove computation from some code paths by performing the computation closer to the point where it is needed.
Batching
Perform computation on several items together rather than one item at a time.
Caching
Reduce computation by saving and reusing the results of an expensive computation rather than recomputing them.
Specialization
Reduce computation by removing generality that is not used.
Taking bigger bites
Reduce the cost of a repeated operation by acting on bigger groups of input at a time.
Hinting
Reduce computation by providing a hint that might improve performance.
Optimizing the expected path
Test for inputs or events at run time in decreasing order of expected frequency.
Hashing
Compute a compact numerical summary (the hash) of a larger data structure such as a variablelength character string. The hash can stand in for the data structure in comparisons to improve performance.
Doublechecking
Reduce computation by performing an inexpensive check, followed only if necessary by an expensive check.
Precomputation
Precomputation is a very general technique whose goal is to remove computation from hot code by performing it earlier, before execution arrives at the hot spot. Different variations move computation from the hot spot to a lesshot part of the code, to link time, compile time, or even design time. In general, the earlier the computation can be performed, the better.
Precomputation is only possible to the extent that the value to be computed does not depend on the context. A computation such as this can be precomputed by the compiler because it doesn’t depend on anything in the program:
int sec_per_day = 60 * 60 * 24;
A related computation like this one depends on variables in the program:
int sec_per_weekend = (date_end  date_beginning + 1) * 60 * 60 * 24;
The trick with precomputation is to either realize that (date_end  date_beginning + 1) may have an unchanging value in the program, so it can be replaced with 2, or to factor out the part of the expression that can be precomputed.
Here are some examples of precomputation:
· The C++ compiler automatically precomputes the value of constant expressions using the compiler’s builtin rules of associativity and operator precedence. The compiler precomputes the value of sec_per_day in the previous example with no problem.
· A template function call with particular arguments is evaluated at compile time. The compiler generates efficient code when the arguments are constants.
· The designer can observe, for example, that in the context of a particular program the concept of “weekend” is always two days, and precompute that constant as he is writing the program.
Lazy Computation
The goal of lazy computation is to delay computation to a point closer to where the computation is needed. There are several benefits to lazy computation. If the computation is not needed on all execution paths (all branches of the ifthenelse logic) in a function, it is only performed on the paths that need the result. Examples include:
Twopart construction
Often information that is needed to construct an object is not available when the instance could be statically constructed. Instead of constructing an object all at once, the constructor contains minimal code to establish an empty object. Later, the running program calls an initialization member function to finish construction. Delaying initialization until additional data is available means the constructed object can often be an efficient, flat data structure (see “Flatten Data Structures”).
In some cases, there is a cost to check whether a lazily computed value has been computed yet. This cost is comparable to the cost of ensuring a pointer to a dynamically constructed class is valid.
Copyonwrite
Instead of copying a dynamic member variable when an object is copied, the two instances share a single copy of the dynamic variable. Copying is deferred until either instance wants to modify the variable.
Batching
The goal of batching is to collect multiple work items and process them together. Batching may be used to remove repeated function calls or other computation that would occur if the items were processed one at a time. It may also be used because there is a more efficient algorithm for processing all inputs together, to postpone computation to a time when more computational resources are available. For instance:
· Buffered output is an example of batching in which output characters are appended to a buffer until the buffer is full or the program reaches an endofline or endoffile character. The entire buffer is passed to an output routine, saving the cost of calling the output routine once for each character.
· The optimal method for converting an unsorted array into a heap is an example of batching to use a more efficient algorithm. The time cost of inserting n items into a heap one by one is O(n log_{2} n). The cost of building the heap all at once is O(n).
· A multithreaded task queue is an example of batching to use computing resources efficiently.
· Saving or updating in the background is an example of batching.
Caching
Caching refers to any of several methods to reduce computation by saving and reusing the result of an expensive computation, rather than recomputing the result each time it is needed. For example:
· The compiler caches the result of small repeated chunks of code, like the computation needed to dereference an array element. It sees a statement like a[i][j] = a[i][j] + c; and saves the array expression, producing code that looks more like auto p = &a[i][j]; *p = *p + c;.
· Cache memory refers to special circuits in computers that make frequently needed memory locations more rapidly accessible to the processor. Caching is a very important concept in the design of computer hardware. There are many levels of caching in the hardware and software of an x86 PC.
· The length of a Cstyle character string must be computed each time it is needed by counting the characters. std::string caches the string length rather than computing it each time it is needed.
· A thread pool is a cache of expensivetocreate threads.
· Dynamic programming is an algorithmic technique in which a computation that has a recurrence relation is sped up by computing subparts and caching the result.
Specialization
Specialization is the opposite of generalization. The goal of specialization is to eliminate parts of an expensive computation that are not required in a specific situation.
It may be possible to simplify an action or data structure by removing a feature that makes it expensive, but that is not required in a given situation in question. This can be accomplished by lifting a constraint from the problem, or adding a constraint to the implementation—making the dynamic static, bounding the unbounded, and so on. For example:
· The template function std::swap() has a default implementation that may copy its arguments. However, the developer can provide a specialization that is more efficient given knowledge of the internals of a data structure. (The C++11 version of std::swap() uses move semantics to achieve an efficient result if the argument type implements a move constructor.)
· std::string is dynamically resizable to hold variablelength character strings. It provides many operations to manipulate strings. If it’s only necessary to compare fixed strings, a Cstyle array or pointer to literal string and a comparison function may have better performance.
Taking Bigger Bites
The aim of taking bigger bites is to reduce the number of iterations of a repeated operation, cutting down on the overhead of repetition. Strategies include:
· Request large blocks of input from the operating system or send large blocks of output to the operating system to reduce the overhead of calling into the kernel for small blocks or individual items. The downside of taking bigger bytes, especially when writing, is that more data may be lost if the program crashes. This can be a problem, for instance, for log files.
· Move or clear buffers by word or longword instead of bytebybyte. This optimization only improves performance if the two ranges are aligned to the samesized boundaries.
· Compare strings by word or longword. This only works on a bigendian machine, and not on the x86, which is littleendian (see “Memory Words Have a Big End and a Little End”). Machine architecture–dependent hacks like this one can be dangerous because they are nonportable.
· Perform more work when a thread wakes up. Check for multiple units of work to process rather than relinquishing the processor after a single unit. This saves the overhead of repeated thread wakeup.
· Instead of performing a maintenance task each time through a loop, perform it every 10th or every 100th time.
Hinting
In hinting, the cost of an operation is reduced by using a hint that, if provided, may lead to reduced computation.
For instance, one overload of the insert() member function of std::map takes an optional insertion point argument. An optimal hint can make insertion O(1). Otherwise, inserting in a map costs O(log_{2} n).
Optimizing the Expected Path
In ifthenelse code with several elseif arms, if the tests are laid out in random order, each time execution passes through the ifthenelse block, about half the tests will be computed. If there is one case that happens 95% of the time, and that test is performed first, then 95% of the time only one test will be performed.
Hashing
In hashing, a large input data structure or long character string is summarized by an algorithm as an integer, called the hash. The hashes of two inputs can be efficiently compared to rule out equality of the inputs. If the hashes are different, the inputs are definitely not the same. If the hashes are equal, the inputs are probably equal. Hashing can be used with doublechecking to optimize a deterministic comparison for equality. It is typical that the hash for an input is cached so that it need not be recomputed.
DoubleChecking
In doublechecking, an inexpensive check is used to rule out some cases, followed if necessary by an expensive check to rule out all other cases. For example:
· Doublechecking is frequently used with caching. The cache is checked quickly to see if the desired value is there, and if not the value is fetched or computed by a more expensive process.
· Comparing two std::string instances for equality normally requires comparing them characterbycharacter. However, a preliminary comparison of the lengths of the two strings can quickly rule out equality.
· Doublechecking may be used in hashing. The hashes of two inputs can be compared efficiently for equality; if the hashes are different, the inputs are not equal. Only if the hashes are the same is it necessary to compare the inputs bytebybyte.
Summary
· Beware of strangers selling constanttime algorithms. They may be O(n).
· Multiple efficient algorithms can be combined in ways that make their overall run time O(n^{2}) or worse.
· Binary search, at O(log_{2} n), isn’t the fastest search. Interpolation search is O(log log n) and hashing is constanttime.
· For small tables with < 4 entries, all search algorithms examine about the same number of entries.