Measure Performance - Optimized C++ (2016)

Optimized C++ (2016)

Chapter 3. Measure Performance

Measure what is measurable, and make measurable what is not so.

Galileo Galilei (1564–1642)

Measurement and experimentation are the basis of any serious attempt to improve a program’s performance. This chapter introduces two software tools that measure performance: the profiler and the software timer. I discuss how to design experiments to measure performance so that their results are meaningful, rather than misleading.

The most basic and most frequently performed measurement of software performance answers the question, “How long?” How long does it take to execute a function? How long to fetch a configuration from disk? How long to start up or shut down?

These questions can be attacked (clumsily) with ridiculously simple instruments. Isaac Newton measured the gravitational constant by timing falling objects against his heartbeat. I’m sure every developer has informally timed activities by counting out loud; in the United States, we count “one-Mississippi, two-Mississippi, three-Mississippi...” to get approximate seconds. A digital wristwatch with a stopwatch function was once de rigueur for computer geeks, and not just for the fashion statement. In the embedded world, hardware-savvy developers have nifty instruments at their disposal, including frequency counters and oscilloscopes that can time even short routines accurately. Software vendors sell a variety of specialized tools too numerous to review here.

This chapter focuses on two tools that are widely available, generally useful, and inexpensive. The first tool, the profiler, is typically provided alongside the compiler by the compiler vendor. The profiler produces a tabular report of the cumulative time spent in each function called during a program run. It is the go-to instrument for software optimization because it produces a list of the hottest functions in your program.

The second tool, the software timer, is something developers can build, like Jedi knights constructing their own lightsaber (if you will excuse the cheesy Star Wars reference). If the fancy version of the compiler that has a profiler is priced out of reach, or the compiler vendor on some embedded platform doesn’t offer one, the developer can still perform experiments by timing long-running activities. The software timer is also helpful for timing tasks that are not compute-bound.

A third tool, the lab notebook, is so old-school that many developers may think it has gone completely out of fashion. But a lab notebook or equivalent text file is an indispensable tool in the optimization toolkit.

The Optimizing Mindset

Before diving into measurements and experiments, I would like to offer just a little of the philosophy of optimization that I practice and that I hope to teach in this book.

Performance Must Be Measured

Your senses are usually not accurate enough to detect an incremental change in performance. Your memory is inadequate to exactly recall the results of many experiments. Book learning can mislead you, causing you to believe things that are not always true. Developers often show spectacularly poor intuition when thinking about whether or not a particular piece of code needs to be optimized. They write functions, knowing they will be used, but with little thought to how heavily or by what code. Then an inefficient bit of code finds its way into critical components, where it is called zillions of times. Experience can also deceive you. Programming languages, compilers, libraries, and processors all evolve. Functions that were once reliable hot spots can become more efficient, and vice versa. Only measurement tells you if you are winning or losing the optimization game.

The developers whose skill at optimization I respect most all approach the optimization task systematically:

· They make testable predictions, and write their predictions down.

· They keep a record of code changes.

· They make measurements with the best instruments available.

· They keep detailed notes of experimental results.

STOP AND THINK

Go back and read the preceding section again. It contains the most important advice in this book. Most developers (including the author) don’t behave in this methodical way naturally. It’s a skill that must be practiced.

Optimizers Are Big Game Hunters

I say we take off and nuke the whole site from orbit. It’s the only way to be sure.

Ellen Ripley (Sigourney Weaver), Aliens, 1986

Optimizers are big game hunters. Making a program run 1% faster is not worth the risk that modifying a working program might introduce bugs. The effect of a change must be at least locally dramatic to make it worthwhile. Furthermore, a 1% speedup might be a measurement artifact masquerading as improvement. Such a speedup needs to be proven, with randomization, sample statistics, and confidence levels. It’s too much work for too little effect. It’s not a place we go in this book.

A 20% improvement is a different animal. It blows through objections about methodology. There are not a lot of statistics in this book, something for which I do not apologize. The point of this book is to help the developer find performance improvements that are dramatic enough to outweigh any question of their value. Such an improvement may still depend on factors like the operating system and compiler, so it may not have much effect on another system or at another time. But changes that are dramatic almost never bite with reduced performance when the developers port their code to a new system.

The 90/10 Rule

The fundamental rule of optimization is the 90/10 rule: a program spends 90% of its run time in 10% of its code. This rule is a heuristic; not a law of nature, but rather a useful generalization to guide thinking and planning. This rule is also sometimes called the 80/20 rule. The idea is the same. Intuitively, the 90/10 rule means that certain blocks of code are hot spots that are executed very frequently, while other parts of the code are hardly ever executed. These hot spots are the targets for optimization efforts.

OPTIMIZATION WAR STORY

I was introduced to the 90/10 rule in one of my first projects as a professional developer: an embedded device with a keyboard interface, which very coincidentally was called the 9010A (Figure 3-1).

The Fluke 9010A Microsystem Troubleshooter, a desktop test instrument with an array of operating keys, a single-line Vacuum Flourescent display, and microcassette tape storage

Figure 3-1. The Fluke 9010A (British Computer History Museum)

There was a function that polled the keyboard to see whether the STOP key had been pressed. This function was executed frequently by every routine. Hand-optimizing the C compiler’s Z80 assembly language output code for this one function (a 45-minute task) improved overall throughput by 7%, which was a big deal in this particular device.

This experience was typical of optimization efforts in general. At the beginning of the optimization process, a lot of run time was spent in one place in the program. The place was pretty obvious: a housekeeping activity done repeatedly, at each iteration of every loop. Optimizing the activity required making the painful choice to code in assembly language instead of C. But the scope of the assembly language was extremely limited, reducing the risk entailed by the choice to use assembly language.

The experience was also typical in that this one block of code was very hot. After we improved this code, another block of code took first place as the most frequently executed—but its contribution to overall run time was much smaller. It was so small, in fact, that we stopped optimizing after just this one change. We could find no other change that gave us even a 1% speedup.

A consequence of the 90/10 rule is that optimizing every routine in a program is not helpful. Optimizing a small part of your code gives you practically all the performance improvement you are going to get. Identifying the hot 10% is time well spent. Selecting code to optimize by guesswork is likely time wasted.

I would like to return to the quote by Don Knuth from the first chapter. Here is a longer version of the same quote:

Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.

Donald Knuth, Structured Programming with go to Statements, ACM Computing Surveys 6 (Dec 1974): 268. CiteSeerX: 10.1.1.103.6084

Dr. Knuth was not warning that optimization in general was evil, as some people have suggested. He was saying only that wasting time optimizing the noncritical 90% of a program was evil. Apparently he was aware of the 90/10 rule too.

Amdahl’s Law

Amdahl’s Law, coined by and named after computer engineering pioneer Gene Amdahl, describes how much overall performance will improve when a portion of the code is optimized. There are several ways to express Amdahl’s Law, but with regard to optimization, it may be expressed by the equation

upper S Subscript upper T Baseline equals StartStartFraction 1 OverOver left-parenthesis 1 minus upper P right-parenthesis plus StartFraction upper P Over upper S Subscript upper P Baseline EndFraction EndEndFraction

where ST is the proportion of improvement in the whole program run time resulting from an optimization, P is the proportion of total original run time that was subject to optimization, and SP is the proportion of improvement in the optimized portion P.

For example, suppose a program takes 100 seconds of run time. Through profiling (see “Profile Program Execution”), you discover that the program spends 80 seconds performing multiple calls to a single function, f. Now suppose you recode f to make it 30% faster. How much would this improve the program’s overall run time?

P, the proportion of the original run time spent within function f, is 0.8. SP is 1.3. Plugging this into Amdahl’s Law yields

upper S Subscript upper T Baseline equals StartFraction 1 Over left-parenthesis 1 minus 0.8 right-parenthesis plus Fraction 0.8 1.3 Fraction EndFraction equals StartFraction 1 Over 0.2 plus 0.62 EndFraction equals 1.22

Improving this one function by 30% improved the overall program run time by 22%. In this case, Amdahl’s Law illustrates the 90/10 rule, and provides an example of how powerful even a modest performance improvement can be in the hot 10% of the code.

Let’s look at a second example. Suppose again that a program takes 100 seconds of run time. Through profiling, you discover that a single function, g, consumes 10 seconds of that run time. Now suppose you recode g so that it runs 100 times faster. How much will this improve the program’s overall run time?

P, the proportion of the original run time spent within function g, is 0.1. SP is 100. Plugging this into Amdahl’s Law yields

upper S Subscript upper T Baseline equals StartStartFraction 1 OverOver left-parenthesis 1 minus upper P right-parenthesis plus StartFraction 0.1 Over 100 EndFraction EndEndFraction equals StartFraction 1 Over 0.9 plus 0.001 EndFraction equals 1.11

In this case, Amdahl’s Law provides a cautionary tale. Even if heroic coding or black magic reduces the run time of function g to zero, it is still in the unimportant 90%. The overall performance improvement is still 11%, to two decimal places. Amdahl’s Law says that even a really successful optimization isn’t valuable if the optimized code doesn’t account for much of the overall program’s run time. The lesson of Amdahl’s Law is that, when your colleague comes to a meeting all excited that he knows how to make some calculation run 10 times faster, it doesn’t necessarily mean all your performance troubles are over.

Perform Experiments

Developing software is always an experiment, in the sense that you start up the program thinking that it is going to do a particular thing, and then watch to see whether it does. Performance tuning is experimental in a more formal sense. You must start with correct code, in the sense of code that does what you expect it to do. You have to look at this code with a fresh set of eyes and ask, “Why is this code hot?” Why does this particular function, out of the many hundreds in your program, appear at the top of the profiler’s naughty list? Is the function wasting time doing something redundantly? Is there a faster way to do the same computation? Is the function using expensive resources? Is the function, in fact, as efficient as it possibly could be, but just called a whole lot of times, leaving no opportunity to improve performance?

Your proposed answer to the question “Why is this code hot?” forms the hypothesis you will test. The experiment takes the form of two elapsed time measurements of the running program: one before you make a change and one after. If the second time is less than the first, your experimentvalidates the hypothesis.

Note the very special language. Experiments don’t necessarily prove anything at all. Your modified code may have run faster or slower for a variety of reasons having nothing to do with your change. For example:

· The computer might receive an email or check for a Java update while you are measuring run time.

· A colleague might check in an improved library just before you recompile.

· Your fix may run faster because it is not correct.

Good scientists are skeptics. They are always suspicious. If the expected improvement does not appear, or appears too good to be true, a skeptic tries the experiment again, or questions her assumptions, or looks for a bug.

Good scientists are also open to new knowledge, even if it contradicts knowledge already in their heads. I learned several unexpected things about optimization in the process of writing this book. The technical reviewers learned things too. Good scientists never stop learning.

OPTIMIZATION WAR STORY

Chapter 5 contains an example function for keyword lookup. I coded several versions of the example. One was a linear search, another was a binary search. When I measured the performance of these two functions, the linear search was consistently a few percent faster than the binary search. This, I felt, was unreasonable. The binary search just had to be faster. But the timing numbers told a different story.

I was aware that someone on the Internet had reported that linear lookup was often faster because it enjoyed better cache locality than binary search, and indeed my linear search implementation should have had excellent cache locality. But this result violated both experience and long-revered book learning about search performance.

In investigating more deeply, I first realized that I had been testing a table with just a few keywords, and using sample words that would be found in the table in my test. If the table had eight entries, a linear search would examine half (four) of them on average before returning. The binary search would divide the table in half four times before it found the keyword each time it was called. The two algorithms had exactly the same average performance on small keyword sets. This reality invalidated my intuition about binary search “always” being better.

But this was not the result I intended to demonstrate! So I made the table larger, thinking there must be a size at which the binary search was faster. I also added more words to my test that were not in the table. The results were as before, with the linear search being faster. At this time, I had to put this coding task aside for a couple of days. But the result gnawed at me.

I was still sure that the binary search should be faster. I reexamined the unit tests for the two searches, eventually to discover that the linear search always returned success after the first comparison. My test case had checked for a nonzero return, rather than testing for the correct return. Sheepishly, I fixed the linear search code and test cases. Now the experiment produced the expected result that the binary search was faster.

In this case, the experimental results first rejected and later validated my hypothesis—and challenged my assumptions along the way.

Keep a Lab Notebook

Good optimizers (like all good scientists) are concerned with repeatability. This is where the lab notebook comes into use. Each timing run starts with a hypothesis, one or more small code tweaks, and an input data set, and ends up with an unremarkable integer number of milliseconds. It’s not too hard to remember the previous run’s time long enough to compare it with the next. As long as each code change is successful, that’s enough.

Eventually, however, the developer will guess wrong and the timing of the latest run will be worse than that of the previous run. Suddenly the developer’s mind will be full of questions. Was run #5 faster than run #3 even though it was slower than run #4? And just what code change produced run #3? Was the speed difference just an artifact, or was it really faster?

If each experiment run is documented, it can quickly be repeated. This makes answering such questions trivial. Otherwise, the developer must go back and rerun a previous experiment to get the timing—that is, if he can remember exactly what code change to make or unmake. If runs are brief and the developer’s memory is excellent, then he’s lucky and may waste only a little time. But it is possible to be unlucky too, and lose track of a promising line of work, or maybe repeat it unnecessarily the next day.

When I give this advice, there is always someone who says, “I can do that without paper! I can write a Perl script that would modify the FOO command of the SmartFOO check-in tool to save the test results of each run with the change set for that run. If I save the test results in a file... If I run the test in a directory set up just so...”

I don’t want to discourage innovation among software developers. If you’re a senior manager driving uptake of best practices, go for it. I will say, however, that writing on paper is a robust, easy-to-use technology that is a thousand years down its learning curve. It will continue to work when the team updates the revision manager or test suite. It will work at the developer’s next job. This old-school solution may still optimize use of the developer’s time.

Measure Baseline Performance and Set Goals

For a developer working alone on her own time, the optimization process can be casual, iterating until performance “feels good enough.” Developers working in teams, however, have managers and stakeholders to satisfy. Optimization efforts are guided by two numbers: a baseline measure of performance before optimization, and a performance goal. The baseline performance measurement is important not just for measuring the success of individual improvements, but also for justifying the cost of optimization efforts to stakeholders.

A performance goal is important because optimization is a process with diminishing returns. Initially there is “low-hanging fruit” that may easily be picked: individual processes or naïvely coded functions that produce big improvements when optimized. But once these easy optimization targets have been improved, more effort is required for each next turn of the optimization crank.

Many teams do not initially think to set design goals for performance or responsiveness, simply because they are not accustomed to doing so. Fortunately, poor performance is usually obvious (a user interface with long periods of unresponsiveness, a server that won’t scale, excessive hosting costs for CPU time, etc.). Once a team looks at performance, numerical goals are easy to set. A whole subdiscipline of user experience (UX) design centers on how users perceive waiting time. Here is a list of commonly measured performance aspects to start you off, plus just enough UX numbers to make you dangerous:

Startup time

The elapsed time from pressing the Enter key until the program enters its main input processing loop. Often, but not always, the developer can just measure the time elapsed between entering the main() procedure and entering the main loop. Operating system vendors who offer certification for programs have strict requirements for programs that run as the computer starts up, or whenever a user logs on. For instance, Microsoft demands of hardware vendors seeking certification that the Windows shell must enter its main loop less than 10 seconds after starting. This limits the number of other programs the vendor is allowed to preload and launch in the busy startup environment. Microsoft offers specialized tools to measure startup time.

Shutdown time

The elapsed time from the user clicking the Close icon or entering a command to exit until the process actually exits. Often, but not always, you can just measure the time elapsed between the main window receiving the shutdown command and exiting main(). Shutdown time also includes the time needed to stop all threads and dependent processes. Operating system vendors who offer certification have strict requirements for shutdown time. Shutdown time is also important because the time it takes to restart a service or long-running program is equal to its shutdown time plus its startup time.

Response time

The average or worst-case time needed to perform a command. For websites, both the average response time and worst-case response time contribute to users’ satisfaction with the site. Response times can be categorized into coarse powers-of-10 baskets as follows:

Less than 0.1 seconds: user in direct control

If response time is less than 0.1 seconds, the users feel like they are directly controlling the user interface, and that UI changes result directly from their actions. This is the maximum delay between a user beginning to drag and the dragged object moving, or between a user clicking a field and the field being highlighted. Any longer and the user feels like he is issuing a command that the computer carries out.

0.1 to 1 seconds: user in control of command

If response time is between 0.1 and 1 seconds, the users feel as though they are in charge, but interpret the brief delay as the computer executing a command and causing a change to the UI. The users are able to tolerate this amount of delay without losing their focus on their current train of thought.

1 to 10 seconds: computer in charge

If response time is between 1 and 10 seconds, the users feel they have executed a command and have lost control to the computer while it processes the command. Users may lose focus and forget things in short-term memory that they need to complete the task; 10 seconds is about the maximum time a user can remain focused. User satisfaction with the UI drops rapidly if they encounter many waits of this duration.

More than 10 seconds: time for a coffee break

If response time is greater than 10 seconds, the users perceive that they have enough time to do some other task. If their work requires them to use the UI, they will go for coffee while the computer crunches numbers. If it’s an option, the user may close the program and look elsewhere for satisfaction.

Jakob Nielsen wrote an interesting article about time scales in user experience that points to scholarly studies for the extra-curious.

Throughput

The inverse of response time. Throughput is generally expressed as the average number of operations per unit of time on some test workload. Throughput measures the same thing as response time, but is more appropriate for batch-oriented programs like databases and web services. Generally you want this number as big as possible.

It is also possible to overoptimize. For instance, in many cases, users regard a response time of under 0.1 seconds as instantaneous. In such a situation, improving response time from 0.1 seconds to 1 millisecond adds virtually no value, even though it is 100 times faster.

You Can Improve Only What You Measure

Optimizing a single function, subsystem, task, or test case is never the same as improving the performance of a whole program. Your test setup differs in many ways from the production-version program running on customer data, so the performance improvements measured in your test rarely translate into the same performance improvements in the wild. Making a single task faster may not make the whole program faster, even if the task exercises much of the program’s logic.

For instance, if a database developer profiles a database performing a specific select query 1,000 times, and optimizes based on that profile, she will speed up not the whole database, but the database performing a specific select query. This speedup may well improve the performance of other select queries. It will have a less predictable effect on delete or update queries, indexing, and all the other things a database does.

Profile Program Execution

A profiler is a program that produces statistics about where another program spends its time. The profiler constructs a report showing how frequently each statement or function is executed, and how much time accumulates in each function.

Many compiler suites, including both Visual Studio on Windows and GCC on Linux, come with a profiler to help find hot spots. Historically, Microsoft only offered its profiler with expensive versions of Visual Studio, but Visual Studio 2015 Community Edition comes with very extensive profiler support. There are open source profilers for Windows, for earlier versions of Visual Studio.

There are several ways to implement a profiler. One method used by both Windows and Linux works as follows:

1. The programmer recompiles the program to be profiled with a special compiler flag that instruments every function in the program. This involves adding a couple of extra assembly language instructions to the beginning and end of each function.

2. The programmer links the instrumented program with a profiling library.

3. Any time the instrumented program runs, it creates a profiling table as a file on disk.

4. The profiler takes the profiling table as input, and produces a range of available textual or graphical reports.

Another profiling method works in this way:

1. The unmodified program is instrumented by linking it with a profiling library. The library contains a routine that interrupts program execution at a high rate and records the value of the instruction pointer.

2. Any time the instrumented program runs, it creates a profiling table as a file on disk.

3. The profiler reads the profiling table as input, and produces a range of available textual or graphical reports.

The output of the profiler may take several forms. One form is a source listing annotated with the number of times each line was executed. Another form is a list of function names along with the number of times each was invoked. Another form is the same list of functions, with the total time accumulated within each function and all functions called from within it. Still another form is a list of functions and the amount of time spent in each function, minus the time spent in called functions, in system code, or waiting for events.

The profiler instrumentation is carefully designed to be as inexpensive as possible. Its effect on overall run time is small, and typically takes the form of a few percent slowdown of every operation. The first method gives exact numbers, at the cost of higher overhead and disabling of certain optimizations. The second method gives approximate results and may miss a few infrequently called functions, but has the advantage of running over the production code.

The most important advantage of a profiler is that it directly displays a list of the hottest functions in the code. The process of optimization is reduced to making a “naughty list” of functions to investigate, inspecting each function for optimization opportunities, making a change, and rerunning the code to obtain new profiler output until no single function appears particularly hot or until you run out of ideas. Because hot spots observed by the profiler are by definition places where a lot of computation takes place, this process is generally straightforward.

My experience with profiling is that profiling a debug build of a program produces results that are just as relevant as those obtained from a release build. In some ways, the debug build is easier to profile because it includes all functions, including those that are inlined, whereas the release build hides very frequently called inlined functions.

OPTIMIZING PRO TIP

One issue with profiling a debug build on Windows is that the debug build is linked with a debug version of the runtime library. The debug version of the memory manager functions performs many extra tests so it can better report doubly freed storage and memory leaks. The cost of these extra tests can significantly increase the cost of certain functions. There is an environment variable to tell the debugger not to use the debug memory manager: go to Control Panel→System Properties→Advanced System Settings→Environment Variables→System Variables, and add a new variable called _NO_DEBUG_HEAP with a value of 1.

Using a profiler is an excellent way to find candidates for optimization, but it isn’t perfect:

· A profiler cannot tell you that there is a more efficient algorithm to solve your computing problem. Tuning a bad algorithm is just a waste of time.

· A profiler does not give crisp results on input that exercises many different tasks. For instance, a SQL database may execute very different code when doing an insert query from when doing a select query. Thus, code that is hot when loading up the database with insert queries may not be executed at all when reading data from the database with select queries. Running a test that is a mix of loading the database and querying the database makes the insert code less prominent in the profiler report, unless the profiled run performs a lot of computation.

Therefore, to identify the hottest functions more easily, try optimizing one task at a time. It can be helpful to profile a subsystem of the full program running in a test harness, if test code is available. However, you introduce another source of uncertainty if you optimize one task at a time: it doesn’t necessarily improve the performance of the entire system. Results may be less dramatic when the program is run on input that performs a mix of tasks.

· The profiler gives misleading information when a program is I/O-bound or multithreaded, because the profiler subtracts time spent in system calls or waiting on events. Removing this time is theoretically sensible because the program is not solely responsible for these waits. But the result is that the profiler tells how much work a program is doing, not how much “wall clock time” it takes to perform its work. Some profilers give a count of the number of function calls as well as the time spent within each function. A very high call count can be a clue when the profiler hides wall clock time.

The profiler is a one-trick pony; there are optimization opportunities it does not highlight, and issues in interpreting profiler output. Still, for many programs, the profiler produces results good enough that no other method is needed.

Time Long-Running Code

If a program does one task that is compute-bound, profiling will automatically show the hot spots. But if a program does a lot of different things, no function may be particularly hot in the profiler. The program may also spend time waiting for I/O or external events that slow down overall progress as measured by a wall clock. In these cases, you need to time parts of the program with a stopwatch, and then try to reduce the measured run time of parts that are slow.

A developer uses run timing to find hot spots by successively narrowing the part of a long-running task being timed until one section takes a length of time that does not make intuitive sense. After identifying the suspicious section of code, the developer times experiments on small subsystems or individual functions in a test harness.

Run timing is an effective way to test hypotheses about how to reduce the cost of a particular function.

It is hardly a stretch to realize that a computer can be programmed to act as a stopwatch. A phone or laptop is quite handy for waking you up at 6:45 on weekday mornings, or reminding you 5 minutes before your stand-up meeting at 10. Measuring the submicrosecond run times of functions in modern computers is more challenging, especially because the common Windows/PC platform has historically had trouble providing a high-resolution clock that behaved consistently across hardware models and software versions.

As a developer you must therefore be prepared to “roll your own” software stopwatch, knowing that it may change in the future. To help make this possible, I will discuss how time is measured, and what tools support measuring time on computers.

“A Little Learning” About Measuring Time

A little learning is a dangerous thing.

Alexander Pope, “An Essay on Criticism”, 1774

A perfect measurement would exactly capture the size, weight, or, in the case of this book, duration of a measured phenomenon every single time. Making a perfect measurement is as likely as an archer reliably hitting the exact center of the bull’s-eye with every shot, splitting arrow after arrow. Such archery only happens in storybooks, and it is the same with measurement.

Real measurement experiments (like real archers) must contend with variation: sources of error that spoil the quest for perfection. Variation comes in two flavors: random and systematic. Random variations affect each measurement differently, like a puff of wind that causes a particular arrow to drift in flight. Systematic variations affect every measurement in a similar way, like the way an archer’s stance can bias every shot so that it goes to the left of its intended destination.

Variation itself can be measured. Summary measures of variation form properties of a measurement called precision and trueness. Together these properties form the intuitive property called accuracy.

Precision, trueness, and accuracy

It is apparently the case that the scientists who get really excited about measurement argue endlessly about terminology. You need only look up “accuracy” on Wikipedia for a taste of how much controversy there is about what words to use for explaining agreed-upon concepts. I chose to explain terms in the context of the 1994 standard ISO 5725-1, “Accuracy (trueness and precision) of measurement methods and results - Part 1: General principles and definitions” (1994).

A measurement is precise if it is free from random variation. That is, if the same phenomenon is measured repeatedly, and the measured values are close together, the measurement is precise. A series of precise measurements may still contain systematic variation. An archer who puts a grouping of shots in the same off-center area can be said to be very precise even if not very accurate. His target might look like Figure 3-2.

If I measure a phenomenon (say, the run time of a function) 10 times, and get the same result all 10 times, I may hypothesize that my measurement is precise. (As in any experiment, I should remain skeptical until I have a lot of evidence.) If I get the same result six times, a slightly different result three times, and a very different result once, the measurement is less precise.

A measurement is true if it is free from systematic variation. That is, if I measure the same phenomenon repeatedly, and the average of all the results clusters close to the actual quantity being measured, then I may believe the measurement is true. Individual measurements may be affected by random variation so that they are closer to or further from the actual quantity. Trueness is not a skill rewarded in archery. In Figure 3-3, the average of the four shots would be in the bull’s-eye if only it were an arrow. Furthermore, all these shots have the same accuracy (distance from the bull’s-eye) in ring units.

Image of an archery target. Four arrows are clustered very close together in the third ring of the targer.

Figure 3-2. High-precision (but poor-accuracy) target shooting

The accuracy of a measurement is an informal concept that depends on how close each individual measurement comes to the actual quantity being measured. The distance from the actual quantity has a component of random variation and a component of systematic variation. A measurement must be both precise and true to be accurate.

Image of an archery target with four arrows, all in different places in the second ring.

Figure 3-3. The archer who fired these shots has true aim

Measuring time

The software performance measurements covered in this book are either a duration (the amount of time that elapses between two events) or a rate (the number of events per unit of time, the inverse of duration). The instrument used for measuring duration is a clock.

All clocks work by counting some periodic fluctuation. In some clocks, the count is divided into hours, minutes, and seconds for presentation. In other clocks, the count is presented directly as a number of ticks. But clocks (other than sundials) don’t directly measure hours, minutes, and seconds. They count ticks, and it is only by comparing the ticks to a second reference clock that a clock may be calibrated to present hours, minutes, and seconds.

All sources of periodic fluctuation are subject to variations that make them imperfect clocks. Some of these variations are random, while others are systematic:

· A sundial makes use of the periodic off-plane rotation of the Earth. By definition, one full rotation is one day. The Earth makes an imperfect clock because its period is long, and because its rotation speeds up and slows down detectably (in microseconds) as the continents drift slowly across its surface. This variation is random. Tidal forces from the Moon and Sun slow the overall rate of the Earth’s rotation. This variation is systematic.

· A grandfather clock counts the regular swinging of a pendulum. Gears divide the pendulum’s swing down to drive hands that display the time. The period of the pendulum may be manually adjusted so that the displayed time is synchronized to the Earth’s rotation. The period of a pendulum’s swing depends on the weight of the pendulum and its length, so that every swing may be faster or slower than desired. This variation is systematic. Friction, air pressure, and the accumulation of dust may all affect a pendulum even if it is initially set perfectly. These are random sources of variation.

· An electric clock uses the periodic 60 Hz sine wave of its AC power supply to drive a synchronous motor. Gears divide down the fundamental oscillation and drive hands to display the time. An electric clock is not a perfect clock because the AC power is only 60 Hz (in the United States) by convention, not by some law of nature. The power company slows down the oscillation in periods of heavy use, and speeds it up again later so electric clocks will not run slow. So, a second measured by an electric clock may be longer on a hot summer afternoon than on a temperate evening (as we have always suspected). This variation is random. An electric clock made for use in the United States will run consistently slow if plugged into a 50 Hz AC outlet in Europe. In contrast to the random variation introduced by temperature, the variation caused by using a European power outlet is systematic.

· A digital wristwatch uses the induced vibration of a quartz crystal as its fundamental fluctuation. A logic circuit divides the fundamental oscillation down and drives a display. The periodic oscillation of the crystal depends on its size, on the temperature, and on the voltage applied to it. The effect of the crystal’s size is a systematic variation, while variations of the temperature and voltage are random.

A tick count is an inherently unsigned quantity. There is no such thing as –5 tick events. I mention this seemingly obvious fact because, as shown later, many developers implementing timekeeping functions choose a signed representation for durations. I don’t know why they do this. My teenage son would say, “It’s just a thing.”

Measurement resolution

The resolution of a measurement is the size of the units in which the measurement is presented.

An archer shooting at a target gets the same score for putting an arrow anywhere in a given ring. The bull’s-eye is not an infinitesimal point, but rather a circle of a certain diameter (see Figure 3-4). A shot is either a bull’s-eye or is in the first ring, second ring, and so on. The width of the ring is the resolution of an archery score.

Drawing of an archery target on a stand. The target has concentric circles alternately black and white.

Figure 3-4. Resolution: a hit anywhere in a circle is the same score

The useful resolution of a time measurement is limited by the duration of the underlying fluctuation. A time measurement can be one tick or two ticks, but nothing in between. The period between these ticks is the clock’s useful resolution.

An observer may perceive events as happening between two ticks of a slow clock like a pendulum clock. This just means people have a faster (but less accurate) clock in their heads that they informally compare to the pendulum clock. An observer wanting to make measurements of imperceptible durations like milliseconds has only the clock’s ticks to go on.

There is no required connection between the accuracy of a measurement and the resolution with which it is presented. If I log my daily activities, I may be able to report that it took two days to write this section of this book. The useful resolution of this measurement is in this case one day. I suppose I could convert this time to seconds, reporting instead that I took 172,800 seconds. But unless I had a stopwatch in my hand, reporting the time in seconds might give a false sense that the measurement was more accurate than it was, or a misleading impression that I did not eat or sleep for this whole period.

A measurement may be reported in units smaller than the useful resolution because the units are standard. I own an oven that displays the oven temperature in degrees Fahrenheit. The thermostat that controls the oven, however, has a useful resolution of 5°F, so that as the oven heats, the display reads 300°F, then 305, 310, 315, and so on. It makes more sense to display the temperature in familiar degree units rather than thermostat units. It just means that the least significant digit of the measurement can only assume the values 0 and 5.

The reader might be surprised and disappointed to know the useful resolutions of the many inexpensive thermometers, scales, and other measuring devices around them that have display resolutions of one unit or one-tenth of a unit.

Measuring with several clocks

A man with a watch knows what time it is. A man with two watches is never quite sure.

Most frequently attributed to Lee Segall

When two events happen in the same place, it is easy to measure the elapsed time in timer ticks on a single clock. When two events happen at a distance from one another, it may be necessary to use two clocks. Tick counts can’t be compared directly between different clocks.

Humankind has attacked this problem by synchronizing clocks to the Coordinated Universal Time reference. Coordinated Universal Time is synchronized to astronomical midnight at longitude 0, an arbitrary line that passes through a nifty plaque at the Royal Observatory in Greenwich, England (see Figure 3-5). This allows a time in ticks to be converted to a time in hours, minutes, and seconds past midnight UTC (Universal Time Coordinated, a clunky acronym negotiated between French and English horologists that favors neither French nor English spelling).

If two clocks are perfectly synchronized to UTC, the UTC time from one is directly comparable to the other. But of course, perfect synchronization is not possible. The two clocks contain independent sources of variation that will cause them to drift from UTC, and from one another.

Photograph of the Royal Observatory at Greenwich England. There is an open doorway with a plaque over it, and a brass line embedded in the walk and steps leading up to the door, and above the door to the roof

Figure 3-5. Markings of the Prime Meridian at the Royal Observatory, Greenwich England (photo by Ævar Arnfjörð Bjarmason, license CC BY-SA 3.0)

Measuring Time with Computers

Building a clock on a computer requires a source of periodic fluctuation—preferably one with good precision and trueness—and a way for software to obtain ticks from that source. A purpose-built computer designed for telling time could easily be created. However, the most popular current computer architectures were not designed with much thought to providing good clocks. I will illustrate the problems using the PC architecture and Microsoft Windows. The problems for Linux and embedded platforms are similar.

The crystal oscillator at the heart of a PC clock circuit has a typical basic accuracy of 100 parts per million, which is 0.01%, or about 8 seconds per day. While this accuracy is only a little better than that provided by a digital wristwatch, it is more than enough for performance measurement, where very informal results accurate to a few percent are usable. The clock circuits on cheap embedded processors are less accurate, but the most serious problem is not with the source of periodic fluctuation—more difficult is the problem of getting a reliable tick count for use by programs.

Hardware evolution of tick counters

The original IBM PC did not have any hardware tick counters. It did have a time-of-day clock that the software could read. The earliest Microsoft C runtime library copied the ANSI C library, providing the function time_t time(time_t*), which returns the number of seconds since 00:00 January 1, 1970 UTC. The original version of time() returned a 32-bit signed integer, but during the run-up to Y2K, this was changed to a 64-bit signed integer.

The original IBM PC used a periodic interrupt from the AC power supply to wake the kernel to perform task switches and other kernel operations. The period of this tick was 16.67 milliseconds in North America, because the AC power is 60 Hz, and 10 milliseconds where the AC power is 50 Hz.

By Windows 98, and possibly earlier, Microsoft C provided the ANSI C function clock_t clock(), which returned a tick counter in a signed format. The constant CLOCKS_PER_SEC specified the number of ticks per second. A returned value of -1 implied that clock() was not available. clock() originally reported a tick based on the periodic AC interrupt. clock() as implemented on Windows differs from the ANSI specification, measuring elapsed wall-clock time and not CPU time. clock() was recently reimplemented in terms ofGetSystemTimeAsFileTime(), and in 2015 it returned a 1-millisecond tick with 1-millisecond resolution, making it a good millisecond clock in Windows.

Starting in Windows 2000, a software tick counter based on the A/C power interrupt was made available via a call to DWORD GetTickCount(). The tick counted by GetTickCount() depends on the hardware of the PC, and may be significantly longer than 1 millisecond.GetTickCount() performs a computation to convert the tick to milliseconds to partially relieve this ambiguity. An updated version of this call, ULONGLONG GetTickCount64(), returns the same tick count in a 64-bit unsigned integer so that longer durations can be measured. Although there is no way to inspect the current interrupt period, a pair of functions reduce the period and then restore it:

MMRESULT timeBeginPeriod(UINT)

MMRESULT timeEndPeriod(UINT)

These functions act on a global variable affecting all processes and many other functions, such as Sleep(), that depend on the AC interrupt. Another call, DWORD timeGetTime(), appears to get the same tick counter by a different method.

Starting with the Pentium architecture, Intel provided a hardware register called the Time Stamp Counter (TSC). The TSC is a 64-bit register that counts ticks from the processor clock. This counter could be accessed very rapidly by the RDTSC instruction. Starting in Windows 2000, the TSC could be read by calling the function BOOL QueryPerformanceCounter(LARGE_INTEGER*), which produces a tick count with no particular resolution. The resolution can be obtained through a call to BOOL QueryPerformanceFrequency(LARGE_INTEGER*), which returns the frequency in ticks per second. LARGE_INTEGER is a struct holding 64 bits of integer in a signed format, because the version of Visual Studio at the time these calls were introduced did not have a native 64-bit signed integer type.

A problem with the initial version of QueryPerformanceCounter() was that its tick rate depended on the processor’s clock. The processor clock was different for different processors and motherboards. Older PCs, notably those with processors from Advanced Micro Devices (AMD), did not have a TSC at the time. When the TSC was not available, QueryPerformanceCounter() fell back to the low-resolution tick returned by GetTickCount().

Windows 2000 also added void GetSystemTimeAsFileTime(FILETIME*), which returns the number of 100-nanosecond ticks since 00:00 January 1, 1601 UTC. FILETIME is a struct holding 64 bits of integer, this time in an unsigned format. Although the displayed resolution of the tick counter appears to be very high, some implementations use the same slow counter used by GetTickCount().

More problems with QueryPerformanceCounter() soon appeared. Some processors implemented a variable clock rate to manage power consumption. This caused the tick period to change. In multiprocessor systems using several discrete processors, the value returned byQueryPerformanceCounter() depended on which processor a thread ran on. Processors began to implement instruction reordering, so that the RDTSC instruction might be delayed, reducing the accuracy of software using the TSC.

To solve these problems, Windows Vista used a different counter, typically the Advanced Configuration and Power Interface (ACPI) power management timer, for QueryPerformanceCounter(). Using this counter solved the synchronization problem in multiprocessors, but significantly increased latency. In the meantime, Intel respecified the TSC to be the maximum, nonchanging clock frequency. Intel also added the nonreorderable RDTSCP instruction.

Since Windows 8, a reliable high-resolution hardware tick count based on the TSC has been available. void GetSystemTimePreciseAsFileTime(FILETIME*) produces a high-resolution tick with fixed frequency and submicrosecond accuracy on any current system running Windows 8 or later.

The summary of this history lesson is that PCs were never designed as clocks, so the tick counters they provide are unreliable. If the past 35 years are any guide, future processors and future operating systems may continue to frustrate any hope of getting a solid, high-resolution tick count.

The only tick counter reliably available in all generations of PC is the tick counter returned by GetTickCount(), with all its warts. The 1-millisecond tick returned by clock() is better, and ought to be available in PCs manufactured in the last 10 years or so. Limiting consideration to Windows 8 and later and new processors, the 100-nanosecond tick counter from GetSystemTimePreciseAsFileTime() is very precise. However, my experience is that millisecond accuracy is sufficient for timing experiments.

Wraparound

Wraparound is what happens when the tick counter of a clock achieves its maximum value and then increments to zero. Twelve-hour analog clocks wrap around every day at noon and again at midnight. Windows 98 hung if it ran continuously for 49 days (see Q216641) due to wraparound of a 32-bit millisecond tick counter. The Y2K bug occurred when time expressed as a two-digit year wrapped around. The Mayan calendar wrapped around in 2012, supposedly heralding the end of the world. In January 2038, the Unix epoch (signed, 32-bit seconds since 00:00 January 1, 1970 UTC) will wrap around, possibly causing the actual end of the world for some long-lived embedded systems. The problem with wraparound is that, in the absence of additional bits to record it, the next increment of time is numerically before the previous time. Clocks that wrap around are good only for measuring durations that are less than the wraparound interval.

For instance, on Windows, the GetTickCount() function returns a tick count with 1-millisecond resolution in a 32-bit unsigned integer. The value returned by GetTickCount() wraps around about every 49 days. GetTickCount() may be used for timing operations that take much less than 49 days without concern. If a program calls GetTickCount() at the beginning and end of an operation, the difference between the returned values may be interpreted as the number of milliseconds elapsed between the calls. For example:

DWORD start = GetTickCount();

DoBigTask();

DWORD end = GetTickCount();

cout << "Startup took " << end-start << " ms" << endl;

The way C++ implements unsigned arithmetic produces a correct answer even in the presence of wraparound.

GetTickCount() is less effective for remembering time since startup. Many long-lived servers can remain up for months or even years. The problem with wraparound is that, in the absence of bits to record the number of wraparounds, end-start might imply no wraparounds, or one, or more.

Starting in Vista, Microsoft added the GetTickCount64() function, which returns a 64-bit unsigned tick count with 1-millisecond display resolution. GetTickCount64() takes millions of years to wrap around, greatly reducing the likelihood of anyone ever witnessing a problem.

Resolution is not accuracy

On Windows, the GetTickCount() function returns an unsigned 32-bit integer. If a program calls GetTickCount() at the beginning and end of an operation, the difference between the returned values may be interpreted as the number of milliseconds that elapsed between the calls. The resolution of GetTickCount() is thus 1-millisecond.

For example, the following block of code measures the relative performance of an arbitrary function called Foo() on Windows, by calling Foo() in a loop. Tick counts obtained at the start and end of the loop give a time in milliseconds for the loop:

DWORD start = GetTickCount();

for (unsigned i = 0; i < 1000; ++i) {

Foo();

}

DWORD end = GetTickCount();

cout << "1000 calls to Foo() took " << end-start << "ms" << endl;

If Foo() performed some substantial calculation, this block of code might produce the following output:

1000 calls to Foo() took 16ms

Unfortunately, the accuracy of the call to GetTickCount() may be 10 milliseconds or 15.67 milliseconds, as documented in Microsoft’s web page on GetTickCount(). That is, if you call GetTickCount() twice in a row, the difference may be 0, or 1 millisecond, or 10, 15, or 16 milliseconds. So, the fundamental precision of the measurement is 15 milliseconds, and the extra resolution is valueless. The result output from the previous code block might have been 10 ms or 20 ms or exactly 16 ms.

What is especially frustrating about GetTickCount() is that, beyond having a resolution of 1 millisecond, it is not guaranteed to be implemented in any particular way, or in the same way on two different Windows computers.

I tested the various timing functions on Windows to find out what their usable resolution was on a particular computer (an i7-based Surface 3) and operating system (Windows 8.1). The test, shown in Example 3-1, calls a timing function repeatedly and checks the difference between the values returned by consecutive calls. If the usable resolution of the tick is greater than the latency of the function call, consecutive calls will return either the same value or a value differing by the size of the fundamental tick, in units of the function’s resolution. I averaged the nonzero differences just in case the operating system stole a time slice for some other task.

Example 3-1. Measuring the tick duration of GetTickCount()

unsigned nz_count = 0, nz_sum = 0;

ULONG last, next;

for (last = GetTickCount(); nz_count < 100; last = next) {

next = GetTickCount();

if (next != last) {

nz_count += 1;

nz_sum += (next - last);

}

}

std::cout << "GetTickCount() mean resolution "

<< (double)nz_sum / nz_count

<< " ticks" << std::endl;

The results of this test are summarized in Table 3-1.

Function

Tick duration

time()

1 sec

GetTickCount()

15.6 ms

GetTickCount64()

15.6 ms

timeGetTime()

15.6 ms

clock()

1.0 ms

GetSystemTimeAsFileTime()

0.9 ms

GetSystemTimePreciseAsFileTime()

~450 ns

QueryPerformanceCounter()

~450 ns

Table 3-1. Measured tick durations, i7 Surface Pro 3, Windows 8.1

Of particular note is that GetSystemTimeAsFileTime(), which has a display resolution of 100 nanoseconds, appears to be based on the same slow 1-millisecond tick as clock(), and that GetSystemTimePreciseAsFileTime() appears to be implemented usingQueryPerformanceCounter().

Modern computers have fundamental clock periods measured in hundreds of picoseconds (100 picoseconds is 10-10 seconds!). They execute instructions in a few nanoseconds each. But there are no accessible tick counters on the PC with picosecond or nanosecond resolution. On PCs, the fastest tick counters available have 100-nanosecond resolution, and their fundamental accuracy may be far worse than their resolution. As a consequence, it isn’t possible to measure the duration of a single call to many functions. See “Overcoming Measurement Obstacles” to find out how to overcome this problem.

Latency

Latency is the length of time that elapses between commanding an activity to start and it actually beginning. Latency is the time between dropping a penny into a well and hearing the splash (see Figure 3-6). It’s the time between when the starter’s gun fires, and when the runner begins to move forward.

With regard to time measurement on computers, latency occurs because starting the clock, running the experiment, and stopping the clock are operations performed serially. The measurement performed can be broken up into five phases:

1. “Starting the clock” involves a function call that gets a tick count from the operating system. This call takes a nonzero amount of time. Somewhere in the middle of the function call, the tick counter value is actually read from a processor register. This value becomes the start time. Call this interval t1.

2. After the value of the tick counter is read, it still must be returned and assigned to a variable. These actions take time. The actual clock is ticking, but the starting tick count is not increasing. Call this interval t2.

3. The measured experiment begins and ends. Call this interval t3.

4. “Stopping the clock” involves another function call to get the tick count. During the part of the function call up to the point where the tick counter value is read, the timer keeps ticking even though the experiment is over. Call this interval t4.

5. After the value of the tick counter is read, it still must be returned and assigned to a variable. Since the tick counter is read, no more error accumulates even though the clock is still ticking. This interval is t5.

A person throws a pebble down a well. The stone is halfway down. The person is waiting to hear the splash.

Figure 3-6. Latency: the time between throwing the penny and hearing the splash

So, while the actual duration of the measurement is t3, the measured duration is the longer value t2+t3+t4. The latency is thus t2+t4. If the latency is a significant fraction of the experiment run time, the experimenter must subtract out the latency.

Assume for instance that it takes 1 microsecond (μs) to obtain a tick count, and that the tick count is obtained by the very last instruction executed in that time. In the following pseudocode, time measurement begins on the last instruction of the first call to get_tick(), so there is no latency before the activity being measured. The latency of the call to get_tick() at the end of the test gets added to the measured duration:

start = get_tick() // 1uS latency before, doesn't matter

do_activity()

stop = get_tick() // 1uS after, added onto measured interval

duration = stop-start

If the measured activity took 1 microsecond, the measured value will be 2 microseconds, a 100% error. If the measured activity took 1 millisecond, the measured value will be 1.001 milliseconds, a 0.1% error.

If the same function is called before and after the experiment, then t1=t4 and t2=t5. The latency is just the duration of the timer function call.

I measured the call latency of the Windows timing functions, which is just their duration. Example 3-2 presents a typical test harness for timing the GetSystemTimeAsFileTime() function.

Example 3-2. Latency of Windows timing functions

ULONG start = GetTickCount();

LARGE_INTEGER count;

for (counter_t i = 0; i < nCalls; ++i)

QueryPerformanceCounter(&count);

ULONG stop = GetTickCount();

std::cout << stop - start

<< "ms for 100m QueryPerformanceCounter() calls"

<< std::endl;

The results of this test are summarized in Table 3-2.

Function

Call duration

GetSystemTimeAsFileTime()

2.8 ns

GetTickCount()

3.8 ns

GetTickCount64()

6.7 ns

QueryPerformanceCounter()

8.0 ns

clock()

13 ns

time()

15 ns

TimeGetTime()

17 ns

GetSystemTimePreciseAsFileTime()

22 ns

Table 3-2. Latency of Windows timing functions VS 2013, i7, Win 8.1

Among the interesting things to note here is that the latencies are all in the range of a few nanoseconds on my i7 tablet, so all these calls are reasonably efficient. This means latency won’t affect measurement accuracy when function calls are timed in a loop lasting around one second. Still, the costs vary by a factor of 10, for functions that read the same low-resolution tick. GetSystemTimePreciseAsFileTime() has the highest latency, and the highest latency relative to its tick, at about 5%. Latency is a much bigger problem on slower processors.

Nondeterministic behavior

Computers are incredibly complex devices with tons of internal state, most of which is invisible to developers. Executing a function changes the state of the computer (for instance, the contents of cache memory), so that each repeated execution happens under conditions that are different from the previous one. Uncontrollable changes in internal state are thus a source of random variation in measurements.

Furthermore, the operating system schedules tasks in unpredictable ways, so that other activity on the processors and memory buses varies during a measurement run. This makes any measurement less accurate.

The operating system may even stop the code being timed to give a time slice away to another program. The tick counter continues to run during such pauses, producing a larger measured duration than if the operating system had not given a time slice to another program. This causes a somewhat larger random variation that can affect measurement.

Overcoming Measurement Obstacles

So, how bad is it really? Can we use a computer at all for timing? And what do we need to do to make it all work? This section summarizes my personal experience using the stopwatch class described in the next section to test functions for this book.

Don’t sweat the small stuff

The really good news is that measurements only have to be accurate to within a couple of percent to be good enough to guide optimization. Looking at it another way, for the usual linear kinds of improvement you can expect from optimization, the measurements need only have about two significant digits—for an experiment that loops on some function for 1,000 milliseconds, about 10 milliseconds—of error. If we look at likely sources of error, as illustrated in Table 3-3, they are all well below this.

Variation

Contribution %

Tick counter function latency

< 0.00001

Basic clock stability

< 0.01

Tick counter usable resolution

< 0.1

Table 3-3. Contribution of variation in 1-second Windows time measurement

Measure relative performance

The ratio of the run time of the optimized code to the run time of the original code is called relative performance. Among the many good things about relative measurements is that they cancel out systematic variation, since the same variation applies to both measurements. The relative number in percentage terms is also easier to grasp intuitively than a number of milliseconds.

Improve repeatability by measuring module tests

Module tests, those staged subsystem tests using canned input data, often make good, repeatable actions for a profile run or performance measurement. Many organizations have an extensive library of module tests, and new tests can be added specifically for performance tuning.

A common concern about performance tuning goes like this: “My code is a big ball of yarn and I don’t have any test cases. I have to test performance on my live input data or live database, which is always changing. I don’t get consistent or repeatable results. What should I do?”

I do not have magic bullets to kill this particular beast. If I test modules or subsystems with a repeatable set of mock input data, improved performance on these tests generally translates into improved performance on live data. If I identify hot functions in a big but unrepeatable test, improving those hot functions using module test cases generally improves performance on live data. Every developer knows why they should build software systems out of loosely coupled modules. Every developer knows why they should maintain a library of good test cases. Optimization is just one more reason.

Tune performance with metrics

There is a ray of hope for the developer forced to tune performance in an unpredictable live system. Instead of measuring the value of, say, a critical response time, the developer can collect metrics, population statistics such as mean and variance, or an exponentially smoothed average of the response time. Since these statistics get their value from a large number of individual events, a durable improvement in these statistics indicates a successful code change.

Here are just a few of the issues that may arise when optimizing with metrics:

· Population statistics must be based on a large number of events to be valid. The change/test/evaluate loop is far more time-consuming with metrics than with direct measurement of a fixed input.

· Collecting metrics requires far more infrastructure than either profiling or run timing. A persistent store is usually required to hold the statistical data. Feeding this store can be expensive enough to affect performance. The system for collecting metrics must be carefully designed to make it flexible enough to support many experiments.

· Although there is a well-established methodology to verify or refute hypotheses based on statistics, the methodology requires some statistical sophistication on the part of developers to do properly.

Improve accuracy by averaging many iterations

An experiment can improve upon the accuracy of a single measurement by averaging several measurements. This is what happens when a developer times a function call in a loop or runs a program over input that causes it to call a particular function many times.

One advantage of measuring many iterations of a function call is that random variations tend to cancel out. The state of the caches tends to converge on a single value, so that each iteration is an apples-to-apples comparison. Over a long enough interval, random scheduler behavior has about the same effect on the timing of the original function and the optimized function. Although the absolute times may not be typical of the same functions in the larger program, the relative performance still measures the degree of improvement accurately.

Another advantage is that a coarser and easier-to-obtain tick counter can be used. Computers are fast enough that a single second is long enough for thousands or even millions of iterations.

Reduce nondeterministic operating system behavior by raising priority

It is possible to reduce the chance of the operating system taking slices out of a measurement run by increasing the priority of the measurement process. On Windows, the SetPriorityClass() function manipulates the priority of a process. The SetThreadPriority() function does the same for the priority of a thread. The following code increases the priority of the current process and thread:

SetPriorityClass(GetCurrentProcess(), ABOVE_NORMAL_PRIORITY_CLASS);

SetThreadPriority(GetCurrentThread(), THREAD_PRIORITY_HIGHEST);

After the measurement, the process and thread priority should usually be restored to normal:

SetPriorityClass(GetCurrentProcess(), NORMAL_PRIORITY_CLASS);

SetThreadPriority(GetCurrentThread(), THREAD_PRIORITY_NORMAL);

Nondeterminism happens—get over it

The way I measure performance for optimization is extremely casual. It isn’t informed by great knowledge of statistics. My tests run for seconds, not hours. I don’t feel the need to apologize for this casual approach. These methods translate into human-visible performance improvements in overall program run time, so I must be on the right track.

If I run the same timing experiment on two different days, I get results that may differ by somewhere between 0.1% and 1%. This is no doubt due to differences in initial conditions in my PC. I have no way to control these conditions, so I don’t worry. If I see larger variations, I can make the timing runs longer. As this makes my test/debug cycles longer, I don’t do it unless I see a need to.

Even when I see variation of a few percent between runs, relative variation of timing measurements within a single run seems to be way below 1%. That is, I can see even relatively subtle changes by timing two variations of a function in the same test run.

I try to make timing runs on a quiet computer that isn’t streaming video, running a Java update, or uncompressing a big archive. I try not to move the mouse or switch windows during a timing run. When PCs contained only one processor core, this was really important. With modern multicore processors, I don’t notice much variation even if I forget to turn off Pandora.

If a timing test calls a function 10,000 times, that code and related data are going to remain in cache. If I were doing a test of worst-case absolute time for a hard real-time system, that would matter. But I am doing a relative time measurement on a system whose kernel is already nondeterministic. Furthermore, I’m testing only functions that my profiler has already identified as hot. They will thus be in cache already during production runs, so the iterated test actually represents realistic conditions.

If a modified function appears to be 1% faster, it’s generally not worth modifying. Amdahl’s Law will shrink its small contribution to overall run time to insiginficance. An improvement of 10% has marginal value. A change of 100% is far more interesting. Accepting only large changes in performance frees the developer from worrying about methodology.

Create a Stopwatch Class

To time part of a program, instrument the code with a stopwatch class. The class works exactly like a mechanical stopwatch. Instantiate the stopwatch or call its start() member function, and the stopwatch begins to tick. Call the stopwatch’s stop() member function or destroy the stopwatch class instance, and the stopwatch stops ticking and displays the elapsed time.

It’s not hard to build a stopwatch class, and there are plenty available on the Internet. Example 3-3 shows one I have used.

Example 3-3. Stopwatch class

template <typename T> class basic_stopwatch : T {

typedef typename T BaseTimer;

public:

// create, optionally start timing an activity

explicit basic_stopwatch(bool start);

explicit basic_stopwatch(char const* activity = "Stopwatch",

bool start=true);

basic_stopwatch(std::ostream& log,

char const* activity="Stopwatch",

bool start=true);

// stop and destroy a stopwatch

~basic_stopwatch();

// get last lap time (time of last stop)

unsigned LapGet() const;

// predicate: return true if the stopwatch is running

bool IsStarted() const;

// show accumulated time, keep running, set/return lap

unsigned Show(char const* event="show");

// (re)start a stopwatch, set/return lap time

unsigned Start(char const* event_namee="start");

// stop a running stopwatch, set/return lap time

unsigned Stop(char const* event_name="stop");

private: // members

char const* m_activity; // "activity" string

unsigned m_lap; // lap time (time of last stop)

std::ostream& m_log; // stream on which to log events

};

This code just reproduces the class definition. The member functions are inlined for maximum performance.

The class that is the value of Stopwatch’s template parameter T is an even simpler kind of timer that provides operating system–dependent and C++ standard–dependent functions to access a tick counter. I wrote several versions of the TimerBase class to test different tick counter implementations. Depending on how modern the C++ compiler in use is, the class given by T can use the C++ <chrono> library or can directly obtain a tick from the operating system. Example 3-4 shows a TimerBase class that uses the C++ <chrono> library available in C++11 and later.

Example 3-4. TimerBase class using <chrono>

# include <chrono>

using namespace std::chrono;

class TimerBase {

public:

// clears the timer

TimerBase() : m_start(system_clock::time_point::min()) { }

// clears the timer

void Clear() {

m_start = system_clock::time_point::min();

}

// returns true if the timer is running

bool IsStarted() const {

return (m_start.time_since_epoch() != system_clock::duration(0));

}

// start the timer

void Start() {

m_start = system_clock::now();

}

// get the number of milliseconds since the timer was started

unsigned long GetMs() {

if (IsStarted()) {

system_clock::duration diff;

diff = system_clock::now() - m_start;

return (unsigned)(duration_cast<milliseconds>(diff).count());

}

return 0;

}

private:

system_clock::time_point m_start;

};

This class has the advantage of portability among operating systems, but it requires C++11.

Example 3-5 shows a TimerBase class with the same functionality that uses the clock() function available on both Linux and Windows.

Example 3-5. TimerBase class using clock()

class TimerBaseClock {

public:

// clears the timer

TimerBaseClock() { m_start = -1; }

// clears the timer

void Clear() { m_start = -1; }

// returns true if the timer is running

bool IsStarted() const { return (m_start != -1); }

// start the timer

void Start() { m_start = clock(); }

// get the number of milliseconds since the timer was started

unsigned long GetMs() {

clock_t now;

if (IsStarted()) {

now = clock();

clock_t dt = (now - m_start);

return (unsigned long)(dt * 1000 / CLOCKS_PER_SEC);

}

return 0;

}

private:

clock_t m_start;

};

This class has the advantage of portability across C++ versions and operating systems, but the disadvantage that the clock() function measures somewhat different things on Linux and Windows.

Example 3-6 shows a TimerBase class that works for older versions of Windows and Linux. On Windows, the gettimeofday() function must also be explicitly provided, since it is not part of the Windows API or the C standard library.

Example 3-6. TimerBase class using gettimeofday()

# include <chrono>

using namespace std::chrono;

class TimerBaseChrono {

public:

// clears the timer

TimerBaseChrono() :

m_start(system_clock::time_point::min()) {

}

// clears the timer

void Clear() {

m_start = system_clock::time_point::min();

}

// returns true if the timer is running

bool IsStarted() const {

return (m_start != system_clock::time_point::min());

}

// start the timer

void Start() {

m_start = std::chrono::system_clock::now();

}

// get the number of milliseconds since the timer was started

unsigned long GetMs() {

if (IsStarted()) {

system_clock::duration diff;

diff = system_clock::now() - m_start;

return (unsigned)

(duration_cast<milliseconds>(diff).count());

}

return 0;

}

private:

std::chrono::system_clock::time_point m_start;

};

This class is portable across C++ versions and operating systems, but it requires an implementation of the gettimeofday() function when used on Windows.

The simplest use of the stopwatch class makes use of the RAII (Resource Acquisition Is Initialization) idiom. The program instantiates the stopwatch class at the beginning of any set of statements enclosed by braces. When the stopwatch is instantiated, its default action is to start timing. When the stopwatch is destroyed at the closing brace, it outputs the accumulated time. The program can output the accumulated time during the program execution by calling the stopwatch’s show() member function. This lets the developer instrument several connected regions of code with a single timer. For example:

{

Stopwatch sw("activity");

DoActivity();

}

This will print two lines on the standard output:

activity: start

activity: stop 1234mS

The stopwatch has no overhead while it is running. Latency at start and stop consists of the cost of the system call to get the current time, plus the cost of producing output if the show() member function is called. This latency is not significant when used to time tasks taking tens of milliseconds or more. Overhead becomes significant, and thus timing accuracy decreases, if the developer attempts to time microsecond activities.

The greatest weakness of run timing is probably the need for intuition, and thus experience, in interpreting the results. After repeated timing runs narrow the search to a particular area of interest, the developer must then inspect the code or perform experiments to identify and remove hot spots. Code inspection makes use of the developer’s own experience or the heuristic rules outlined in this book. These rules have the advantage of helping you identify code that has a high runtime cost. The disadvantage is that they don’t point unambiguously to a most-hot place.

OPTIMIZATION PRO TIP

The developer can’t always time activities that take place before main() gets control or after main() exits. If there are many classes instantiated at global scope, this may become a problem. Program time spent outside of main() can be a source of considerable overhead in large programs with many global variables. See “Remove Code from Startup and Shutdown” for more on program startup.

Time Hot Functions in a Test Harness

Once a profiler or runtime analysis fingers a function as an optimization candidate, an easy way to work with the function is to build a test harness that calls the function many times. This allows the run time of the function to grow to a measurable value. It also evens out variations in run time caused by background tasks, context switching, and so on. The edit/compile/run cycle to time a function in isolation is faster than the more lengthy cycle of edit/compile/run followed by running the profiler and inspecting its output. Many examples in this book use this technique.

The timing test harness (Example 3-7) is just a call to the function, surrounded by a loop of, say, 10,000 iterations, surrounded in turn by an invocation of the stopwatch.

Example 3-7. Timing test harness

typedef unsigned counter_t;

counter_t const iterations = 10000;

...

{

Stopwatch sw("function_to_be_timed()");

for (counter_t i = 0; i < iterations; ++i) {

result = function_to_be_timed();

}

}

The number of iterations has to be guessed. The total run time should be a few hundred to a few thousand milliseconds on desktop processors if the tick counter used by Stopwatch has a useful resolution of around 10 milliseconds.

I use the type counter_t instead of unsigned or unsigned long because for some sufficiently brief functions, the variable may have to be a 64-bit unsigned long long type. It’s just easier to get accustomed to using a typedef than it is to go back and retype all the type names. It’s an optimization of the optimization process itself.

The outermost set of curly braces is important. It defines the scope within which sw, the instance of the Stopwatch class, exists. Because Stopwatch uses the RAII idiom, the sw constructor gets the starting tick count, and its destructor gets the final tick count and puts the results onto the standard output stream.

Estimate Code Cost to Find Hot Code

Experience has taught me that profiling and measuring run times are effective methods of finding candidates for optimization. The profiler may point to a function that is very frequently called, or that consumes a significant fraction of total processing time. It is unlikely to point to a specific C++ statement and accuse, “Colonel Mustard, in the conservatory, with the lead pipe!” There may also be times when instrumenting code for profiling is very expensive. Run timing also may implicate a large block of code without pointing to a specific problem.

The developer’s next step is to estimate the cost of each statement in the identified block. This is not a precise task, like proving a theorem. Most of the value can be obtained from a cursory inspection, looking for expensive statements and structures that magnify the cost of code.

Estimate the Cost of Individual C++ Statements

As described in “Memory Is Slow”, the cost in time of memory access dominates all other instruction execution costs. On the simple microprocessors used in toasters and coffee makers, the time taken by an instruction is literally the time needed to read each byte of the instruction from memory, plus the time needed to read data needed as input to the instruction, plus the time needed to write data that is the result of the instruction. Decoding and executing the instruction itself takes a comparatively insignificant amount of time, which is hidden underneath the memory access time.

On desktop-class microprocessors, the situation is more complex. Many instructions are simultaneously in various stages of being executed. The cost of reading the stream of instructions is hidden. However, the cost of accessing the data on which the instructions operate is more difficult to hide. For this reason, the cost of reading and writing data may be used to approximate the relative cost of instruction execution across all classes of microprocessor.

To estimate how expensive a C++ statement is, a useful rule of thumb is count the number of memory reads and writes performed by the statement. For example, in the statement a = b + c;, where a, b, and c are integers, the values of locations b and c must be read from memory, and the sum must be written to memory at location a. This statement thus costs three memory accesses. This count doesn’t depend on the microprocessor instruction set. It is the unavoidable, inevitable cost of the statement.

r = *p + a[i]; can be counted as follows: one access to read i, one to read a[i], one to read p, one to read the data to which *p points, and one to write the result in r, for a total of five accesses. The cost of function calls in memory accesses is described in “Cost of Function Calls”.

It’s important to understand that this is a heuristic, a rule of thumb. In actual hardware, additional memory accesses are needed to fetch the instructions that perform the statement. However, these accesses are sequential, so they are probably very efficient. And this additional cost is proportional to the cost of accessing the data. The compiler may optimize away some of the accesses by reusing previous computations or taking advantage of static analysis of the code. The cost in units of time also depends on whether the memory accessed by the C++ statement is in cache.

But all other factors being equal, the part that matters is how many reads and writes to memory are necessary to access the data used by a statement. This heuristic is not perfect, but it’s all you get unless you want to read the assembly language intermediate output of the compiler, which is tedious and usually unrewarding.

Estimate the Cost of Loops

Because each C++ statement generally performs just a handful of memory accesses, it is unlikely that a statement standing alone will be very hot unless some other factor acts on it to execute it very frequently. One such factor is when the statement appears in a loop. Then the cost of the statement is multiplied by the number of times it is executed.

If you’re very lucky, you may occasionally find code that shouts out how hot it is. The profiler may finger a single function called a million times, or some hot function may contain a loop like:

for (int i=1; i<1000000; ++i) {

do_something_expensive();

if (mostly_true) {

do_more_stuff();

even_more();

}

}

This loop is very obviously executed a million times, and is thus very hot. It’s likely you’ll have to do a little work to optimize it.

Estimate repeat count in nested loops

When a loop is nested inside another loop, the repeat count of the code block is the repeat count of the inner loop multiplied by the repeat count of the outer loop. For example:

for (int i=0; i<100; ++i) {

for (int j=0; j<50; ++j) {

fiddle(a[i][j]);

}

}

Here, the repeat count is 100*50 = 5,000.

This particular block is extremely straightforward. There are infinite variations. For instance, in mathematics, there are important cases where code iterates over a triangular part of a matrix. And sometimes poor coding practices make it harder to discern the outline of the nested loops.

A nested loop may not be immediately obvious. If a loop invokes a function, and the function contains another loop, this inner loop is a nested loop. As we will see in “Push Loops Down into Functions to Reduce Call Overhead”, the cost of invoking the function repeatedly in the outer loop can also sometimes be eliminated.

An inner loop may be embedded within a standard library function, particularly a string-handling or character I/O function. If the repeat count is great enough, it may be worth recoding the contents of the standard library function to escape the invocation cost.

Estimate loops with variable repeat count

Not every loop provides a nice, neat count of its iterations. Many loops repeat until some condition occurs. An example is a loop that processes characters until it finds a space, or that processes digits until it finds a non-digit. The repeat count of such a loop can be estimated. Only a casual estimate is needed; say, five digits per number or six letters per word. The goal is to locate likely candidates for optimization.

Recognize implicit loops

Programs that process events (Windows UI programs, for instance) contain an implicit loop at their top level. This loop is not even visible in the program because it is hidden in the framework. If the framework is feeding events at the maximum possible rate, any code that is executed each time the event handler gets control, before the event is dispatched, or during dispatch is likely to be hot, as is code in the most frequently dispatched events.

Recognize false loops

Not every while or do statement is a loop at all. I have encountered code where a do statement was used to facilitate flow of control. There are other, better ways to code the simple example that follows, but with more complicated if-then-else logic, this idiom makes some sense. This “loop” is executed exactly once. If it reaches the while(0), it exits:

do {

if (!operation1())

break;

if (!operation2(x,y,z))

break;

} while(0);

This idiom is also sometimes used to package up several statements into a C-style macro.

Other Ways to Find Hot Spots

A developer familiar with the code can choose to proceed on intuition alone, making hypotheses about areas of code that may be affecting the overall run time, and performing experiments to see whether a change in one of these areas improves overall performance.

I do not recommend this route unless you have no one to please but yourself. Instrumenting code with a profiler or timer allows developers to demonstrate to colleagues and managers that they are making progress in an optimization effort. If you go on intuition alone and do not deliver results, and sometimes even if you do deliver, your teammates will question your methods and distract you from whatever good you are doing. And they should. From their point of view, there is no way to tell if you are using your highly trained professional intuition, or just trying stuff at random.

OPTIMIZATION WAR STORY

My personal record using experience and intuition is mixed. I once reduced the unresponsive startup time of an interactive game application from an unacceptable 16 seconds to around 4 seconds. Unfortunately, I did not save a baseline measurement of the original code. My manager would only believe that I had reduced startup time from eight seconds to four, because that was what I could show him. This very evidence-minded manager then went to the trouble of instrumenting the original code with a profiler and printing the results for me. Interestingly enough, his instrumentation found pretty much the same list of functions as my intuition. However, my credibility as an optimization expert was blown because I had not pursued the task in a methodical way.

Summary

· Performance must be measured.

· Make testable predictions, and write the predictions down.

· Make a record of code changes.

· If each experiment run is documented, it can quickly be repeated.

· A program spends 90% of its run time in 10% of its code.

· A measurement must be both true and precise to be accurate.

· Resolution is not accuracy.

· On Windows, the function clock() provides a reliable 1-millisecond clock tick. For Windows 8 and later, the function GetSystemTimePreciseAsFileTime() provides a submicrosecond tick.

· Accepting only large changes in performance frees the developer from worrying about methodology.

· To estimate how expensive a C++ statement is, count the number of memory reads and writes performed by the statement.