Writing Efficient C++ - C++ Software Engineering - Professional C++ (2014)

Professional C++ (2014)

Part VC++ Software Engineering

Chapter 25Writing Efficient C++

WHAT’S IN THIS CHAPTER?

· What “efficiency” and “performance” mean

· What kind of language-level optimizations you can use

· Which design-level guidelines you can follow to design efficient programs

· What profiling tools are

WROX.COM DOWNLOADS FOR THIS CHAPTER

Please note that all the code examples for this chapter are available as a part of this chapter’s code download on the book’s website at www.wrox.com/go/proc++3e on the Download Code tab.

The efficiency of your programs is important regardless of your application domain. If your product competes with others in the marketplace, speed can be a major differentiator: Given the choice between a slower and a faster program, which one would you choose? No one would buy an operating system that takes two weeks to boot up, unless it was the only option. Even if you don’t intend to sell your products, they will have users. Those users will not be happy with you if they end up wasting time waiting for your programs to complete tasks.

Now that you understand the concepts of Professional C++ design and coding, and have tackled some of the more complex facilities that the language provides, you are ready to incorporate performance into your programs. Writing efficient programs involves thought at the design level, as well as details at the implementation level. Although this chapter falls late in this book, remember to consider performance from the beginning of your program life cycle.

OVERVIEW OF PERFORMANCE AND EFFICIENCY

Before delving further into the details, it’s helpful to define the terms performance and efficiency, as used in this book. The performance of a program can refer to several areas, such as speed, memory usage, disk access, and network use. This chapter focuses on speed performance. The term efficiency, when applied to programs, means running without wasted effort. An efficient program completes its tasks as quickly as possible within the given circumstances. A program can be efficient without being fast, if the application domain is inherently prohibitive to quick execution.

NOTE An efficient, or high-performance, program runs as fast as is possible for the particular tasks.

Note that the title of this chapter, “Writing Efficient C++,” means writing programs that run efficiently, not efficiently writing programs. That is, the time you learn to save by reading this chapter will be your users’, not your own!

Two Approaches to Efficiency

Language-level efficiency involves using the language as efficiently as possible, for example, passing objects by reference instead of by value. However, this will only get you so far. Much more important is design-level efficiency, which includes choosing efficient algorithms, avoiding unnecessary steps and computations, and selecting appropriate design optimizations. Optimizing existing code, more often than not, involves replacing a bad algorithm or data structure with a better, more efficient one.

Two Kinds of Programs

As noted, efficiency is important for all application domains. Additionally, there is a small subset of programs, such as system-level software, embedded systems, intensive computational applications, and real-time games, which require extremely high levels of efficiency. Most programs don’t. Unless you write those types of high-performance applications, you probably don’t need to worry about squeezing every ounce of speed out of your C++ code. Think of it as the difference between building normal family cars and building sports cars. Every car must be reasonably efficient, but sports cars require extremely high performance. You wouldn’t want to waste your time optimizing family cars for speed when they’ll never go faster than 70 miles per hour.

Is C++ an Inefficient Language?

C programmers often resist using C++ for high-performance applications. They claim that the language is inherently less efficient than C or a similar procedural language because C++ includes high-level concepts, such as exceptions and virtual methods. However, there are problems with this argument.

First, you cannot ignore the effect of compilers. When discussing the efficiency of a language, you must separate the performance capabilities of the language itself from the effectiveness of its compilers at optimizing it. Recall that the C or C++ code you write is not the code that the computer executes. A compiler first translates that code into machine language, applying optimizations in the process. This means that you can’t simply run benchmarks of C and C++ programs and compare the result. You’re really comparing the compiler optimizations of the languages, not the languages themselves. C++ compilers can “optimize away” many of the high-level constructs in the language to generate machine code similar to that generated from a comparable C program.

Critics, however, still maintain that some features of C++ cannot be optimized away. For example, as Chapter 9 explains, virtual methods require the existence of a vtable and an additional level of indirection at run time, possibly making them slower than regular non-virtual function calls. However, when you really think about it, this argument is still unconvincing. Virtual method calls provide more than just a function call: They also give you a run-time choice of which function to call. A comparable non-virtual function call would need a conditional statement to decide which function to call. If you don’t need those extra semantics, you can use a non-virtual function. A general design rule in the C++ language is, “if you don’t use it, you don’t need to pay for it.” If you don’t use virtual methods, you pay no performance penalty for the fact that you could use them. Thus, non-virtual function calls in C++ are identical to function calls in C in terms of performance. However, since virtual function calls have such a tiny overhead, I recommend making all your class methods, including destructors but not constructors, virtual for all your non-final classes.

Far more important, the high-level constructs of C++ enable you to write cleaner programs that are more efficient at the design level, are more easily maintained, and avoid accumulating unnecessary and dead code.

I believe that you will be better served in your development, performance, and maintenance by choosing C++ instead of a procedural language.

LANGUAGE-LEVEL EFFICIENCY

Many books, articles, and programmers spend a lot of time trying to convince you to apply language-level optimizations to your code. These tips and tricks are important, and can speed up your programs in some cases. However, they are far less important than the overall design and algorithm choices in your program. You can pass-by-reference all you want, but it won’t ever make your program fast if you perform twice as many disk writes as you need. It’s easy to get bogged down in references and pointers and forget about the big picture.

Furthermore, some of these language-level tricks can be performed automatically by good optimizing compilers. You should never spend time optimizing a particular area, unless a profiler, discussed later in this chapter, tells you that that particular area is a bottleneck.

That being said, using certain language-level optimizations, such as pass-by-reference, is just considered good coding style.

In this book, I’ve tried to present a balance of strategies. Thus, I’ve included here what I feel are the most useful language-level optimizations. This list is not comprehensive, but should give you a good start if you want to optimize your code. However, make sure to read, and practice, the design-level efficiency advice described later in this chapter as well.

WARNING Apply language-level optimizations judiciously. I recommend to make a clean, well-structured design and implementation first. Then use a profiler, and only invest time optimizing those parts that are flagged by a profiler as being a performance bottleneck.

Handle Objects Efficiently

C++ does a lot of work for you behind the scenes, particularly with regard to objects. You should always be aware of the performance impact of the code you write. If you follow a few simple guidelines, your code will become more efficient. Note that these guidelines are only relevant for objects, and not for primitive types such as bool, int, float, and so on.

Pass-by-Reference

This rule is discussed elsewhere in this book, but it’s worth repeating here.

WARNING Objects should rarely be passed by value to a function or method.

Pass-by-value incurs copying costs that are avoided with pass-by-reference. One reason why this rule can be difficult to remember is that on the surface there doesn’t appear to be any problem when you pass-by-value. Consider a class to represent a person that looks like this:

class Person

{

public:

Person();

Person(const std::string& inFirstName, const std::string& inLastName,

int inAge);

const std::string& getFirstName() const { return mFirstName; }

const std::string& getLastName() const { return mLastName; }

int getAge() const { return mAge; }

private:

std::string mFirstName, mLastName;

int mAge;

};

You could write a function that takes a Person object as follows:

void processPerson(Person p)

{

// Process the person.

}

You might call it like this:

Person me("Marc", "Gregoire", 35);

processPerson(me);

This doesn’t look like there’s any more code than if you wrote the function like this instead:

void processPerson(const Person& p)

{

// Process the person.

}

The call to the function remains the same. However, consider what happens when you pass-by-value in the first version of the function. In order to initialize the p parameter of processPerson(), me must be copied with a call to its copy constructor. Even though you didn’t write a copy constructor for the Person class, the compiler generates one that copies each of the data members. That still doesn’t look so bad: there are only three data members. However, two of those are strings, which are themselves objects with copy constructors. So, each of their copy constructors will be called as well. The version of processPerson() that takes p by reference incurs no such copying costs. Thus, pass-by-reference in this example avoids three constructor calls when the code enters the function.

And you’re still not done. Remember that p in the first version of processPerson() is a local variable to the processPerson() function, and so must be destroyed when the function exits. This destruction requires a call to the Person destructor, which will call the destructor of all of the data members. strings have destructors, so exiting this function (if you passed by value) incurs calls to three destructors. None of those calls are needed if the Person object is passed by reference.

NOTE If a function must modify an object, you can pass the object by reference. If the function should not modify the object, you can pass it by const reference, as in the preceding example. See Chapter 10 for details on references and const.

NOTE Avoid using pass-by-pointer, which is a relatively obsolete method for pass-by-reference, and is a throwback to the C language, rarely suitable for C++ (unless passing nullptr has meaning in your design).

Return-by-Reference

Just as you should pass objects by reference to functions, you should also return them by reference from functions in order to avoid copying the objects unnecessarily. Unfortunately, it is sometimes impossible to return objects by reference, such as when you write overloaded operator+ and other similar operators. You should never return a reference or a pointer to a local object that will be destroyed when the function exits.

Since C++11, the language has support for move semantics, which allows you to efficiently return objects by value, instead of using reference semantics.

Catch Exceptions by Reference

As noted in Chapter 13, you should catch exceptions by reference in order to avoid an extra copy. Throwing exceptions is heavy in terms of performance, so any little thing you can do to improve their efficiency will help.

Use Move Semantics

You should implement a move constructor and move assignment operator for your objects, which allow the C++ compiler to use move semantics. With move semantics for your objects, returning them by value from a function will be efficient without incurring large copying costs. Consult Chapter 10 for details on move semantics.

Avoid Creating Temporary Objects

The compiler creates temporary, unnamed objects in several circumstances. Chapter 8 explains that after writing a global operator+ for a class, you can add objects of that class to other types, as long as those types can be converted to objects of that class. For example, the SpreadsheetCell class definition looks in part like the following:

class SpreadsheetCell

{

public:

// Other constructors omitted for brevity

SpreadsheetCell(double initialValue);

friend SpreadsheetCell operator+(const SpreadsheetCell& lhs,

const SpreadsheetCell& rhs);

// Remainder omitted for brevity

};

The constructor that takes a double allows you to write code like this:

SpreadsheetCell myCell(4), aThirdCell;

aThirdCell = myCell + 5.6;

aThirdCell = myCell + 4;

The first addition line constructs a temporary SpreadsheetCell object from the 5.6 argument; then calls the operator+ with myCell and the temporary object as arguments. The result is stored in aThirdCell. The second addition line does the same thing, except that 4 must be coerced to a double in order to call the double constructor of the SpreadsheetCell.

The important point in this example is that the compiler generates code to create an extra, unnamed SpreadsheetCell object for each addition line. That object must be constructed and destructed with calls to its constructor and destructor. If you’re still skeptical, try inserting cout statements in your constructor and destructor and watching the printout.

In general, the compiler constructs a temporary object whenever your code converts a variable of one type to another type for use in a larger expression. This rule applies mostly to function calls. For example, suppose that you write a function with the following prototype:

void doSomething(const SpreadsheetCell& s);

You can call it like this:

doSomething(5.56);

The compiler constructs a temporary SpreadsheetCell object from 5.56 using the double constructor, which it passes to doSomething(). Note that if you remove the const from the s parameter, you can no longer call doSomething() with a constant; you must pass a variable.

You should generally attempt to avoid cases in which the compiler is forced to construct temporary objects. Although it is impossible to avoid in some situations, you should at least be aware of the existence of this “feature” so you aren’t surprised by performance and profiling results.

Move semantics are also used by the compiler to make working with temporary objects more efficient. That’s another reason to add move semantics to your classes.

The Return-Value Optimization

A function that returns an object by value can cause the creation of a temporary object. Continuing with the Person example, consider this function:

Person createPerson()

{

Person newP;

return newP;

}

Suppose that you call it like this (assuming that operator<< is implemented for the Person class):

cout << createPerson();

Even though this call does not store the result of createPerson() anywhere, the result must be stored somewhere in order to pass it to operator<<. In order to generate code for this behavior, the compiler is allowed to create a temporary variable in which to store thePerson object returned from createPerson().

Even if the result of the function is not used anywhere, the compiler might still generate code to create the temporary object. For example, suppose that you have this code:

createPerson();

The compiler might generate code to create a temporary object for the return value, even though it is not used.

However, you usually don’t need to worry about this issue because the compiler optimizes away the temporary variable in most cases. This optimization is called the return-value optimization and is usually only enabled for release builds.

Note that if the object you want to return from a function supports move semantics, then it is moved out of the function instead of copied.

Use Inline Methods and Functions

As described in Chapter 8, the code for an inline method or function is inserted directly into the code where it is called, avoiding the overhead of a function call. You should mark as inline all functions and methods that you think can qualify for this optimization. However, remember that inlining requests by the programmer are only a recommendation to the compiler, which is allowed to refuse them.

On the other hand, some compilers inline appropriate functions and methods during their optimization steps, even if those functions aren’t marked with the inline keyword. Thus, you should read your compiler documentation before wasting a lot of effort deciding which functions to inline.

DESIGN-LEVEL EFFICIENCY

The design choices in your program affect its performance far more than do language details such as pass-by-reference. For example, if you choose an algorithm for a fundamental task in your application that runs in O(n2) time instead of a simpler one that runs inO(n) time, you could potentially perform the square of the number of operations that you really need. To put numbers on that, a task that uses an O(n2) algorithm and performs one million operations would perform only one thousand with an O(n) algorithm. Even if that operation is optimized beyond recognition at the language level, the simple fact that you perform one million operations when a better algorithm would use only one thousand will make your program very inefficient. Always choose your algorithms carefully. Refer to Part II, specifically Chapter 4, of this book for a detailed discussion of algorithm design choices and big-O notation.

In addition to your choice of algorithms, design-level efficiency includes specific tips and tricks. Instead of writing your own data structures and algorithms, you should use existing ones, such as the STL or the Boost libraries, as much as possible because these are written by experts. You should also think about incorporating multithreading in your design, see Chapter 23. The remainder of this section presents two more design techniques for optimizing your program: caching, and object pools.

Cache Where Necessary

Caching means storing items for future use in order to avoid retrieving or recalculating them. You might be familiar with the principle from its use in computer hardware. Modern computer processors are built with memory caches that store recently and frequently accessed memory values in a location that is quicker to access than main memory. Most memory locations that are accessed at all are accessed more than once in a short time period, so caching at the hardware level can significantly speed up computations.

Caching in software follows the same approach. If a task or computation is particularly slow, you should make sure that you are not performing it more than necessary. Store the results in memory the first time you perform the task so that they are available for future needs. Here is a list of tasks that are usually slow:

· Disk access: You should avoid opening and reading the same file more than once in your program. If memory is available, save the file contents in RAM if you need to access it frequently.

· Network communication: Whenever you need to communicate over a network, your program is subject to the vagaries of the network load. Treat network accesses like file accesses, and cache as much static information as possible.

· Mathematical computations: If you need the result of a very complex computation in more than one place, perform the calculation once and share the result. However, if it’s not very complex, then it’s probably faster to just calculate it instead of retrieving it from a cache. Use a profiler if you want to make sure.

· Object allocation: If you need to create and use a large number of short-lived objects in your program, consider using an object pool, described later in this chapter.

· Thread creation: This task can also be slow. You can “cache” threads in a thread-pool, similar to caching objects in an object-pool.

Cache Invalidation

One common problem with caching is that the data you store are often only copies of the underlying information. The original data might change during the lifetime of the cache. For example, you might want to cache the values in a configuration file so that you don’t need to read it repeatedly. However, the user might be allowed to change the configuration file while your program is running, which would make your cached version of the information obsolete. In cases like this, you need a mechanism for cache invalidation: When the underlying data change, you must either stop using your cached information, or repopulate your cache.

One technique for cache invalidation is to request that the entity managing the underlying data notifies your program of the data change. It could do this through a callback that your program registers with the manager. Alternatively, your program could poll for certain events that would trigger it to repopulate the cache automatically. Regardless of your specific cache invalidation technique, make sure you think about these issues before relying on a cache in your program.

NOTE Always keep in mind that maintaining caches takes code, memory, and processing time. On top of that, caches can be a source of subtle bugs. You should only add caching to a particular area when a profiler clearly shows that that area is a performance bottleneck. First write clean and correct code, then profile it, and only then optimize parts of it.

Use Object Pools

If your program needs a large number of short-lived objects of the same type that have an expensive constructor (for example, a constructor creating a large, pre-sized vector for storing data), and a profiler confirms that allocating and deallocating these objects is a bottleneck, then you can create a pool, or cache, of those objects. Whenever you need an object in your code, you ask the pool for one. When you are done with the object, you return it to the pool. The object pool creates the objects only once, so their constructor is called only once, not each time they are used. Thus, object pools are appropriate when the constructor performs some setup actions that apply to many uses of the object, and when you can set instance-specific parameters on the object through non-constructor method calls.

An Object Pool Implementation

This section provides an implementation of a pool class template that you can use in your programs. The pool allocates a chunk of objects of the specified class when it is constructed and hands them out via the acquireObject() method. If acquireObject() is called but there are no free objects, the pool allocates another chunk of objects. acquireObject() returns an Object which is a std::shared_ptr with a custom deleter. The custom deleter doesn’t actually delete the memory but it simply puts the object back on the list of free objects.

The most difficult aspect of an object pool implementation is keeping track of which objects are free and which are in use. This implementation takes the approach of storing free objects in a queue. Each time a client requests an object, the pool gives that client the top object from the queue.

The code uses the std::queue class from the Standard Template Library (STL), discussed in Chapter 16. Memory for allocated objects is handled by std::unique_ptr and std::shared_ptr smart pointers.

Here is the class definition, with comments that explain the details. Note that the template is parameterized on the class type from which the objects in the pool are to be constructed.

#include <cstddef>

#include <queue>

#include <stdexcept>

#include <memory>

// Provides an object pool that can be used with any class that provides a

// default constructor.

//

// The object pool constructor creates a pool of objects, which it hands out

// to clients when requested via the acquireObject() method. acquireObject()

// returns an Object which is a std::shared_ptr with a custom deleter that

// automatically puts the object back into the object pool when the shared_ptr

// is destroyed and its reference reaches 0.

//

// The constructor and destructor on each object in the pool will be called only

// once each for the lifetime of the program, not once per acquisition and release.

//

// The primary use of an object pool is to avoid creating and deleting objects

// repeatedly. The object pool is most suited to applications that use large

// numbers of objects with expensive constructors for short periods of time, if

// a profiler tells you that allocating and deallocating these objects is a

// bottleneck.

template <typename T>

class ObjectPool

{

public:

// Creates an object pool with chunkSize objects.

// Whenever the object pool runs out of objects, chunkSize

// more objects will be added to the pool. The pool only grows:

// objects are never removed from the pool (freed), until

// the pool is destroyed.

//

// Throws invalid_argument if chunkSize is 0.

// Throws bad_alloc if allocation fails.

ObjectPool(size_t chunkSize = kDefaultChunkSize);

// Prevent assignment and pass-by-value

ObjectPool(const ObjectPool<T>& src) = delete;

ObjectPool<T>& operator=(const ObjectPool<T>& rhs) = delete;

// The type of smart pointer returned by acquireObject().

using Object = std::shared_ptr<T>;

// Reserves an object for use.

Object acquireObject();

private:

// mFreeList stores the objects that are not currently in use by clients.

std::queue<std::unique_ptr<T>> mFreeList;

size_t mChunkSize;

static const size_t kDefaultChunkSize = 10;

// Allocates mChunkSize new objects and adds them to mFreeList.

void allocateChunk();

};

Note that the user of the object pool specifies through the template parameter the name of the class from which objects can be created, and through the constructor the allocation “chunk size.” This “chunk size” controls the number of objects created at one time. Here is the code that defines the kDefaultChunkSize:

template<typename T>

const size_t ObjectPool<T>::kDefaultChunkSize;

The default of 10, given in the class definition, is probably too small for most uses. If your program requires thousands of objects at once, you should use a larger, more appropriate, value.

The constructor validates the chunkSize parameter, and calls the allocateChunk() helper method to obtain a starting allocation of objects:

template <typename T> ObjectPool<T>::ObjectPool(size_t chunkSize)

{

if (chunkSize == 0) {

throw std::invalid_argument("chunk size must be positive");

}

mChunkSize = chunkSize;

// Create mChunkSize objects to start.

allocateChunk();

}

The allocateChunk() method allocates mChunkSize elements. It stores a unique_ptr to each object in the queue:

// Allocates mChunkSize new objects.

template <typename T>

void ObjectPool<T>::allocateChunk()

{

for (size_t i = 0; i < mChunkSize; ++i) {

mFreeList.emplace(std::make_unique<T>());

}

}

acquireObject() returns the top object from the free list, first calling allocateChunk() if there are no free objects:

template <typename T>

typename ObjectPool<T>::Object ObjectPool<T>::acquireObject()

{

if (mFreeList.empty()) {

allocateChunk();

}

// Move next free object from the queue to a local unique_ptr.

std::unique_ptr<T> obj(std::move(mFreeList.front()));

mFreeList.pop();

// Convert the object pointer to an Object (a shared_ptr with

// a custom deleter).

Object smartObject(obj.release(), [this](T* t){

// The custom deleter doesn't actually deallocate the

// memory, but simply puts the object back on the free list.

mFreeList.push(std::unique_ptr<T>(t));

});

// Return the Object.

return smartObject;

}

Using the Object Pool

Consider an application that is structured around obtaining requests for actions from users and processing those requests. This application would most likely be the middleware between a graphical front-end and a back-end database. For example, it could be part of an airline reservation system or an online banking application. You might want to encode each user request in an object, with a class that looks something like this:

class UserRequest

{

public:

UserRequest() {}

virtual ~UserRequest() {}

// Methods to populate the request with specific information.

// Methods to retrieve the request data.

// (not shown)

private:

// Data members (not shown)

};

Instead of creating and deleting large numbers of requests throughout the lifetime of your program, you could use an object pool. Your program structure could be something as follows:

ObjectPool<UserRequest>::Object obtainUserRequest(ObjectPool<UserRequest>& pool)

{

// Obtain a UserRequest object from the pool.

auto request = pool.acquireObject();

// Populate the request with user input. (not shown)

return request;

}

void processUserRequest(ObjectPool<UserRequest>::Object& req)

{

// Process the request. (not shown)

// Return the request to the pool.

req.reset();

}

int main()

{

ObjectPool<UserRequest> requestPool(10);

for (size_t i = 0; i < 100; ++i) {

auto req = obtainUserRequest(requestPool);

processUserRequest(req);

}

return 0;

}

PROFILING

It is good to think about efficiency as you design and code. There is no point in writing obviously inefficient programs if it can be avoided with some common sense, or experience-based intuitions. However, I urge you not to get too obsessed with performance during the design and coding phases. It’s best to first make a clean, well-structured design and implementation, then use a profiler, and only optimize parts that are flagged by the profiler as being performance bottlenecks. Remember, Chapter 4 introduces the “90/10” rule: 90 percent of the running time of most programs is spent in only 10 percent of the code (Hennessy and Patterson, Computer Architecture, A Quantitative Approach, Fourth Edition, [Morgan Kaufmann, 2006]). This means that you could optimize 90 percent of your code out of existence, but still only improve the running time of the program by 10 percent. Obviously, you want to optimize the parts of the code that are exercised the most for the specific workload that you expect the program to run.

Consequently, it is often helpful to profile your program to determine which parts of the code require optimization. There are many profiling tools available that analyze programs as they run in order to generate data about their performance. Most profiling tools provide analysis at the function level by specifying the amount of time (or percent of total execution time) spent in each function in the program. After running a profiler on your program, you can usually tell immediately which parts of the program need optimization. Profiling before and after optimizing is also useful to prove that your optimizations had an effect.

If you are using Microsoft Visual C++ 2013, you already have a great built-in profiler, which is discussed later in this chapter. Another great profiling tool is Rational PurifyPlus from IBM. Both of these products require license fees, but you should check if your company or academic institution has a site license for their use. If the license restriction is prohibitive, there are several free profiling tools. Very Sleepy and Luke Stackwalker are popular profilers for Windows, Valgrind and gprof (GNU profiler) are well-known profilers for Unix/Linux systems, and there are plenty of other choices.

Profiling Example with gprof

The power of profiling can best be seen with a real coding example. As a disclaimer, the performance bugs in the first attempt shown are not subtle. Real efficiency issues would probably be more complex, but a program long enough to demonstrate them would be too lengthy for this book.

Suppose that you work for the United States Social Security Administration. Every year the administration puts up a website that allows users to look up the popularity of new baby names from the previous year. Your job is to write the back-end program that looks up names for users. Your input is a file containing the name of every new baby. This file will obviously contain redundant names. For example, in the file for boys for 2003, the name Jacob was the most popular, showing up 29,195 times. Your program must read the file to construct an in-memory database. A user may then request the absolute number of babies with a given name, or the rank of that name among all the babies.

First Design Attempt

A logical design for this program consists of a NameDB class with the following public methods:

#include <string>

class NameDB

{

public:

// Reads the list of baby names in nameFile to populate the database.

// Throws invalid_argument if nameFile cannot be opened or read.

NameDB(const std::string& nameFile);

// Returns the rank of the name (1st, 2nd, etc).

// Returns –1 if the name is not found.

int getNameRank(const std::string& name) const;

// Returns the number of babies with this name.

// Returns –1 if the name is not found.

int getAbsoluteNumber(const std::string& name) const;

// Protected and private members and methods not shown

};

The hard part is choosing a good data structure for the in-memory database. A first attempt might be an array, or a vector from the STL, of name/count pairs. Each entry in the vector would store one of the names, along with a count of the number of times that name shows up in the raw data file. Here is the complete class definition with such a design:

#include <string>

#include <vector>

#include <utility>

class NameDB

{

public:

NameDB(const std::string& nameFile);

int getNameRank(const std::string& name) const;

int getAbsoluteNumber(const std::string& name) const;

private:

std::vector<std::pair<std::string, int>> mNames;

// Helper methods

bool nameExists(const std::string& name) const;

void incrementNameCount(const std::string& name);

void addNewName(const std::string& name);

};

Note the use of the STL vector and pair, discussed in Chapter 16. A pair is a utility class that combines two variables of different types.

Here are the implementations of the constructor and the helper methods nameExists(), incrementNameCount(), and addNewName(). The loops in nameExists() and incrementNameCount() iterate over all the elements of the vector.

// Reads the names from the file and populates the database.

// The database is a vector of name/count pairs, storing the

// number of times each name shows up in the raw data.

NameDB::NameDB(const string& nameFile)

{

// Open the file and check for errors.

ifstream inFile(nameFile.c_str());

if (!inFile) {

throw invalid_argument("Unable to open file");

}

// Read the names one at a time.

string name;

while (inFile >> name) {

// Look up the name in the database so far.

if (nameExists(name)) {

// If the name exists in the database, just increment the count.

incrementNameCount(name);

} else {

// If the name doesn't yet exist, add it with a count of 1.

addNewName(name);

}

}

inFile.close();

}

// Returns true if the name exists in the database, false otherwise.

bool NameDB::nameExists(const string& name) const

{

// Iterate through the vector of names looking for the name.

for (auto& entry : mNames) {

if (entry.first == name) {

return true;

}

}

return false;

}

// Precondition: name exists in the vector of names.

// Postcondition: the count associated with name is incremented.

void NameDB::incrementNameCount(const string& name)

{

for (auto& entry : mNames) {

if (entry.first == name) {

entry.second++;

return;

}

}

}

// Adds a new name to the database.

void NameDB::addNewName(const string& name)

{

mNames.push_back(make_pair(name, 1));

}

Note that in the preceding example, you could use an algorithm like find_if() to accomplish the same thing as the loops in nameExists() and incrementNameCount(). The loops are shown explicitly in order to emphasize the performance problems.

The savvy reader might notice some performance problems already. What if there are hundreds of thousands of names? The many linear searches involved in populating the database might become slow.

In order to complete the example, here are the implementations of the two public methods:

// Returns the rank of the name.

// First looks up the name to obtain the number of babies with that name.

// Then iterates through all the names, counting all the names with a higher

// count than the specified name. Returns that count as the rank.

int NameDB::getNameRank(const string& name) const

{

// Make use of the getAbsoluteNumber() method.

int num = getAbsoluteNumber(name);

// Check if we found the name.

if (num == -1) {

return -1;

}

// Now count all the names in the vector that have a

// count higher than this one. If no name has a higher count,

// this name is rank number 1. Every name with a higher count

// decreases the rank of this name by 1.

int rank = 1;

for (auto& entry : mNames) {

if (entry.second > num) {

rank++;

}

}

return rank;

}

// Returns the count associated with this name.

int NameDB::getAbsoluteNumber(const string& name) const

{

for (auto& entry : mNames) {

if (entry.first == name) {

return entry.second;

}

}

return -1;

}

Profile of the First Attempt

In order to test the program, you need a main() function:

#include "NameDB.h"

#include <iostream>

using namespace std;

int main()

{

NameDB boys("boys_long.txt");

cout << boys.getNameRank("Daniel") << endl;

cout << boys.getNameRank("Jacob") << endl;

cout << boys.getNameRank("William") << endl;

return 0;

}

This main() function creates one NameDB database called boys, telling it to populate itself with the file boys_long.txt, which contains 500,500 names.

There are three steps to using gprof:

1. Compile your program with a special flag that causes it to log raw execution information next time it is run. When using GCC as your compiler, the flag is –pg, for example:

> gcc -lstdc++ -std=c++11 -pg -o namedb NameDB.cpp NameDBTest.cpp

2. Next, run your program. This run should generate a file called gmon.out in the working directory. Be patient when you run the program because this first version is slow.

3. The final step is to run the gprof command in order to analyze the gmon.out profiling information and produce a (somewhat) readable report. gprof outputs to standard out, so you should redirect the output to a file:

> gprof namedb gmon.out > gprof_analysis.out

Now you can analyze the data. Unfortunately, the output file is somewhat cryptic and intimidating. It takes a little while to learn how to interpret it. gprof provides two separate sets of information. The first set summarizes the amount of time spent executing each function in the program. The second, and more useful set summarizes the amount of time spent executing each function and its descendants, and is also called a call graph. Here is some of the output from the gprof_analysis.out file, edited to make it more readable. Note that the numbers will be different on your machine.

index %time self children called name

[1] 100.0 0.00 14.06 main [1]

0.00 14.00 1/1 NameDB::NameDB [2]

0.00 0.04 3/3 NameDB::getNameRank [25]

0.00 0.01 1/1 NameDB::~NameDB [28]

The following list explains the different columns:

· index: an index to be able to refer to this entry in the call graph.

· %time: the percentage of the total execution time of the program required by this function and its descendants.

· self: how many seconds the function itself was executing.

· children: how many seconds the descendants of this function were executing.

· called: how often this function was called.

· name: the name of the function. If the name of the function is followed by a number between square brackets, that number refers to another index in the call graph.

The preceding extract tells us that main() and its descendants took 100 percent of the total execution time of the program, for a total of 14.06 seconds. The second line shows that the NameDB constructor took 14.00 seconds of the total 14.06 seconds. So it’s immediately clear where the performance issue is situated. To track down which part of the constructor is taking so long, you need to jump to the call graph entry with index 2, because that’s the index in square brackets behind the name in the last column. The call graph entry with index 2 is as follows on my test system:

[2] 99.6 0.00 14.00 1 NameDB::NameDB [2]

1.20 6.14 500500/500500 NameDB::nameExists [3]

1.24 5.24 499500/499500 NameDB::incrementNameCount [4]

0.00 0.18 1000/1000 NameDB::addNewName [19]

0.00 0.00 1/1 vector::vector [69]

The nested entries below NameDB::NameDB show which of its descendants took the most time. Here you can see that nameExists() took 6.14 seconds, and incrementNameCount()took 5.24 seconds. Remember that these times are the sums of all the calls to the functions. The fourth column in those lines shows the number of calls to the function (500,500 to nameExists() and 499,500 to incrementNameCount()). No other function took a significant amount of time.

Without going any further in this analysis, two things should jump out at you:

1. 14 seconds to populate the database of approximately 500,000 names is slow. Perhaps you need a better data structure.

2. nameExists() and incrementNameCount() take almost identical time, and are called almost the same number of times. If you think about the application domain, that makes sense: Most names in the input text file are duplicates, so the vast majority of the calls tonameExists() are followed by a call to incrementNameCount(). If you look back at the code, you can see that these functions are almost identical; they could probably be combined. In addition, most of what they are doing is searching the vector. It would probably be better to use a sorted data structure to reduce the searching time.

Second Attempt

With these two observations from the gprof output, it’s time to redesign the program. The new design uses a map instead of a vector. Chapter 16 explains that the STL map keeps the entries sorted, and provides O(log n) lookup instead of the O(n) searches in a vector. A good exercise for the reader is to use a std::unordered_map, which has an expected O(1) for lookups, and use a profiler to see if that is faster than std::map for this application.

The new version of the program also combines nameExists() and incrementNameCount() into one nameExistsAndIncrement() method.

Here is the new class definition:

#include <string>

#include <map>

class NameDB

{

public:

NameDB(const std::string& nameFile);

int getNameRank(const std::string& name) const;

int getAbsoluteNumber(const std::string& name) const;

private:

std::map<std::string, int> mNames;

bool nameExistsAndIncrement(const std::string& name);

void addNewName(const std::string& name);

};

Here are the new method implementations:

// Reads the names from the file and populates the database.

// The database is a map associating names with their frequency.

NameDB::NameDB(const string& nameFile)

{

// Open the file and check for errors.

ifstream inFile(nameFile.c_str());

if (!inFile) {

throw invalid_argument("Unable to open file");

}

// Read the names one at a time.

string name;

while (inFile >> name) {

// Look up the name in the database so far.

if (!nameExistsAndIncrement(name)) {

// If the name exists in the database, the

// method incremented it, so we just continue.

// We get here if it didn't exist, in which case

// we add it with a count of 1.

addNewName(name);

}

}

inFile.close();

}

// Returns true if the name exists in the database, false

// otherwise. If it finds it, it increments it.

bool NameDB::nameExistsAndIncrement(const string& name)

{

// Find the name in the map.

auto res = mNames.find(name);

if (res != end(mNames)) {

res->second++;

return true;

}

return false;

}

// Adds a new name to the database.

void NameDB::addNewName(const string& name)

{

mNames[name] = 1;

}

// Returns the rank of the name.

// First looks up the name to obtain the number of babies with that name.

// Then iterates through all the names, counting all the names with a higher

// count than the specified name. Returns that count as the rank.

int NameDB::getNameRank(const string& name) const

{

int num = getAbsoluteNumber(name);

// Check if we found the name.

if (num == -1) {

return -1;

}

// Now count all the names in the map that have

// count higher than this one. If no name has a higher count,

// this name is rank number 1. Every name with a higher count

// decreases the rank of this name by 1.

int rank = 1;

for (auto& entry : mNames) {

if (entry.second > num) {

rank++;

}

}

return rank;

}

// Returns the count associated with this name.

int NameDB::getAbsoluteNumber(const string& name) const

{

auto res = mNames.find(name);

if (res != end(mNames)) {

return res->second;

}

return -1;

}

Profile of the Second Attempt

By following the same steps shown earlier, you can obtain the gprof performance data on the new version of the program. The data are quite encouraging:

index %time self children called name

[1] 100.0 0.00 0.21 main [1]

0.02 0.18 1/1 NameDB::NameDB [2]

0.00 0.01 1/1 NameDB::~NameDB [13]

0.00 0.00 3/3 NameDB::getNameRank [28]

[2] 95.2 0.02 0.18 1 NameDB::NameDB [2]

0.02 0.16 500500/500500 NameDB::nameExistsAndIncrement [3]

0.00 0.00 1000/1000 NameDB::addNewName [24]

0.00 0.00 1/1 map::map [87]

If you run this on your machine, the output will be different. It’s even possible that you will not see the data for NameDB methods in your output. Because of the efficiency of this second attempt, the timings are getting so small that you might see more map methods in the output than NameDB methods.

On my test system, main() now takes only 0.21 seconds: a 67-fold improvement! There are certainly further improvements that you could make on this program. For example, the current constructor performs a lookup to see if the name is already in the map, and if not, adds it to the map. You could combine these two operations. You could always use the insert() method of the map without first checking if the name already exists. The insert() method returns a pair<iterator, bool>. The Boolean will be true if it added the name to themap, and false if the name was already present. When the name is already in the map, you increment its count, which you can reach through the iterator in the returned pair. This iterator is a map entry, so to access the count, you use res.first->second. To implement this improvement, you can remove the nameExistsAndIncrement() and addNewName() methods, and change the constructor as follows:

NameDB::NameDB(const string& nameFile)

{

// Open the file and check for errors

ifstream inFile(nameFile.c_str());

if (!inFile) {

throw invalid_argument("Unable to open file");

}

// Read the names one at a time.

string name;

while (inFile >> name) {

auto res = mNames.insert(make_pair(name, 1));

if (res.second == false) {

res.first->second++; // Already in the map, increment count

}

}

inFile.close();

}

getNameRank() still uses a loop that iterates over all elements in the map. A good exercise for the reader is to come up with another data structure so that the linear iteration in getNameRank() can be avoided.

Profiling Example with Visual C++ 2013

Most editions of Microsoft Visual C++ 2013 come with a great built-in profiler, which is briefly discussed in this section. The VC++ profiler has a complete graphical user interface. This book does not recommend one or the other profiler, but it is always good to have an idea of what a command-line based profiler like gprof can provide in comparison to a GUI-based profiler like the one included with VC++.

Profile of the First Design Attempt

To start profiling an application in Visual C++ 2013, you first need to open the project in Visual Studio. This example uses the same NameDB code as in the first inefficient design attempt earlier. This code is not repeated here. Once your project is opened in Visual Studio, click on the “Analyze” menu and then choose “Performance and Diagnostics.” A new window appears. Figure 25-1 shows a screenshot of how this window might look like.

image

FIGURE 25-1

In this new window, enable the “Performance Wizard” option and click the “Start” button. This will start a wizard. The first page of this wizard is shown in Figure 25-2.

image

FIGURE 25-2

Depending on the version of VC++ 2013, there are a number of different profiling methods available. The following non-exhaustive list explains three of them:

· CPU Sampling: Used to monitor applications with low overhead. This means that the act of profiling the application will not have a big performance impact on the target application.

· Instrumentation: This method will add extra code to the application to be able to accurately count the number of function calls and to time individual function calls. However, this method has a much bigger performance impact on the application. It is recommended to use the CPU Sampling method first to get an idea about the bottlenecks in your application. If this method does not give you enough information, you can try the Instrumentation method.

· Concurrency: This allows you to monitor multithreaded applications. It allows you to graphically see which threads are running, which threads are waiting for something, and so on.

For this profiling example, leave the default CPU Sampling method and click the “Next” button. The next page of the wizard asks you to select the application that you want to profile. Here you should select your NameDB project and click the “Next” button. On the last page of the wizard you can enable the “Launch profiling after the wizard finishes” and then click the “Finish” button. It is possible that you will get a message saying that you don’t have the right credentials for profiling, and whether you would like to upgrade your credentials. If you get this message, you should allow VC++ to upgrade your credentials otherwise the profiler will not work.

When the program execution is finished, Visual Studio automatically opens the profiling report. Figure 25-3 shows how this report might look when profiling the first attempt of the NameDB application.

image

FIGURE 25-3

From this report you can immediately see the hot path. Just like with gprof, it shows that the NameDB constructor takes up most of the running time of the program. The Visual Studio profiling report is interactive. For example, you can drill down the NameDB constructor by clicking on it. This results in a drill-down report for that function, which looks like Figure 25-4.

image

FIGURE 25-4

This drill-down view shows a graphical breakdown at the top, and the actual code of the method at the bottom. The code view shows the percentage of the running time that a line needed. The line using up most of the time is shown in red. When you are interactively browsing the profiling report, you can always go back by using the back arrow at the top-left of the report.

At the top of the report there is also a drop-down menu, which you can use to quickly jump to certain summary or details pages. If you go back to the “Summary” of the profiling report, you can see that there is a “Show Trimmed Call Tree” link on the right. Clicking that link displays a trimmed call tree showing you an alternative view of the hot path in your code. This is shown in Figure 25-5.

image

FIGURE 25-5

Also in this view, you immediately see that main() is calling the NameDB constructor, which is using up most of the time.

SUMMARY

This chapter discussed the key aspects of efficiency and performance in C++ programs, and provided several specific tips and techniques for designing and writing more efficient applications. Hopefully you gained an appreciation for the importance of performance and for the power of profiling tools.

There are two important things to remember from this chapter. The first thing is that you should not get too obsessed with performance while designing and coding. It’s recommended to first make a correct, well-structured design and implementation, then use a profiler, and only optimize those parts that are flagged by a profiler as being a performance bottleneck. The second thing to remember is that design-level efficiency is far more important than language-level efficiency. For example, don’t use algorithms or data structures with bad complexity if there are better ones available.