Optimized C++ (2016)

Chapter 11. Optimize I/O

A program is a spell cast over a computer, turning input into error messages.

Anonymous

This chapter treats efficient use of the C++ streaming I/O functions for the commonly occurring example of reading and writing text data. Reading and writing data are activities so commonplace that developers don’t even notice them—yet these activities are time-consuming.

The rotation of a disk platter is as ponderous and slow to today’s ultra-fast computer chips as the rotation of the Earth is to us. The read heads have inertia that must be overcome to move their bulk from track to track. These physical properties fiercely resist attempts at improving hardware performance. In the internetworked world of limited data rates and busy servers, response latency may be measured in seconds rather than milliseconds. Even the speed of light becomes a significant factor when streaming data from distant computers.

Another problem with I/O is that there is a lot of code between the user’s program and the spinning platters of the disk or the network interface card. The cost of all this code must be managed to make I/O as efficient as possible.

A Recipe for Reading Files

There are a number of formulas on the Web for reading and writing files, and even some that claim to be fast. They range in performance over a whole order of magnitude in my tests. Readers who are already smart enough not to use an unvetted recipe from the Internet (such as the legendarily dangerous and incorrect recipes for making explosives or cooking meth) might be wise to add to that list recipes for reading a file in C++.

OPTIMIZATION WAR STORY

My sister-in-law Marcia liked to bake pies, but she was never satisfied with the recipes she could find for pie crust. She could not get her crusts to come out as light and flaky as the best pie crusts she had tasted. Although Marcia was not a software developer, she was a pretty good optimizer, so this is what she did.

She collected a number of recipes for pie crust, all of which claimed to be “the best pie crust ever.” She noted that all of them had ingredients in common: flour, salt, water, shortening. Then she looked at the differences. Some used butter or lard instead of vegetable shortening. Some called specifically for cold ingredients or to cool the dough before rolling it out. Some added a little sugar, a tablespoon of vinegar, or an egg. Marcia took the distinctive advice from several recipes and performed tasty experiments over several months. As a result of her patient experimentation, she settled upon a combination of the advice from several recipes. The resulting pie crusts were amazing.

As with my sister-in-law’s optimization effort, I discovered that there are several techniques for improving the performance of file-reading functions. Many of these techniques can be combined. I also discovered some techniques that had little value.

Example 11-1 is a simple function to read a text file into a string. I have encountered code like this over and over, often as a prelude to parsing the string as an XML or JSON block.

Example 11-1. Original file_reader() function

std::string file_reader(char const* fname) {

std::ifstream f;

f.open(fname);

if (!f) {

std::cout << "Can't open " << fname

<< " for reading" << std::endl;

return "";

}

std::stringstream s;

std::copy(std::istreambuf_iterator<char>(f.rdbuf()),

std::istreambuf_iterator<char>(),

std::ostreambuf_iterator<char>(s) );

return s.str();

}

The fname argument contains a filename. If the file cannot be opened, file_reader() prints an error message to standard output, and returns the empty string. Otherwise, std::copy() copies f’s streambuf into std::stringstream s’s streambuf.

Create a Parsimonious Function Signature

From a library design standpoint, file_reader() could be improved (see “Parsimony Is a Virtue in Library Design”). It does several different things: it opens a file; it performs error handling (in the form of reporting failure to open the file); it reads the open, valid stream into a string. As a library function, this combination of responsibilities makes file_reader() hard to use. For instance, if the client program implements exception handling, or wants to use a Windows resource for error message strings, or even just wants to print to std::cerr, file_reader() can’t be used. file_reader() also creates new memory and returns it, a pattern that potentially causes multiple copies to be made as the returned value is passed up a calling chain (see “Copy Free Libraries”). When the file cannot be opened, file_reader() returns the empty string. It also returns the empty string if the file is readable, but doesn’t contain any characters. It would be nice to get an error indication so these cases could be distinguished.

Example 11-2 is an updated version of file_reader() that separates the concerns of opening files and reading streams, and has a signature that users won’t immediately want to change.

Example 11-2. stream_read_streambuf_stringstream(), now with parsimony!

void stream_read_streambuf_stringstream(

std::istream& f,

std::string& result) {

std::stringstream s;

std::copy(std::istreambuf_iterator<char>(f.rdbuf()),

std::istreambuf_iterator<char>(),

std::ostreambuf_iterator<char>(s) );

std::swap(result, s.str());

}

The last line of stream_read_streambuf_stringstream() exchanges the dynamic storage of result with that of s.str(). I might instead have assigned s.str() to result, but that would have caused an allocation and copy unless the compiler and string library implementation both provided move semantics. std::swap() is specialized for many standard library classes to call a swap() member function. The member function in turn swaps pointers, which is far less expensive than an allocate-and-copy operation.

As a bonus for being parsimonious, f becomes a std::istream, rather than a std::ifstream. This function could work on other kinds of streams too, like std::stringstream, for instance. The function is also shorter and easier to read, containing just one conceptual nugget of computation.

stream_read_streambuf_stringstream() may be called with this client code:

std::string s;

std::ifstream f;

f.open(fname);

if (!f) {

std::cerr << "Can't open " << fname

<< " for reading" << std::endl;

}

else {

stream_read_streambuf_stringstream(f, s);

}

Notice that the client is now responsible for opening the file and for reporting errors in a manner appropriate to the client, while the stream-reading “magic” is still in stream_read_streambuf_stringstream()—only this version of the function isn’t very magical.

I performed an experiment that read a 10,000-line file 100 times. This experiment exercises much of the machinery in C++ standard I/O. Because of the repetition, the operating system almost certainly caches the file contents. In real-world conditions, which are difficult to simulate in a test loop, this function probably takes longer than the measured 1,548 milliseconds in both Visual Studio 2010 and 2015.

Because this program reads the disk, and because there is only one disk on my computer, the test run was more sensitive to other disk activity on the PC than other tests I have run. I had to close nonessential programs and stop nonessential services. Even then, I had to make several test runs and report the smallest number I got.

stream_read_streambuf_stringstream() makes use of standard library idioms, but isn’t particularly efficient. It uses character iterators to copy a single character at a time. It is reasonable to hypothesize that each character fetched operates a substantial amount of machinery instd::istream, and perhaps even in the host operating system’s file I/O API. It is equally reasonable to imagine that a std::string in the std::stringstream is extended by one character at a time, resulting in a significant number of calls into the memory allocator.

There are several variations of the copy-the-stream-iterator idea. Example 11-3 uses std::string::assign() to copy an iterator from the input stream into a std::string.

Example 11-3. Another copy-the-stream-iterator file reader

void stream_read_streambuf_string(

std::istream& f,

std::string& result) {

result.assign(std::istreambuf_iterator<char>(f.rdbuf()),

std::istreambuf_iterator<char>());

}

This code ran my test in 1,510 milliseconds when compiled with Visual Studio 2010, and 1,787 milliseconds using Visual Studio 2015.

Shorten Calling Chains

Example 11-4 is another character-at-a-time variation. std::istream has an operator<<() that takes a streambuf as an argument. Perhaps operator<<() bypasses the istream API to call directly into the streambuf.

Example 11-4. Append stream to a stringstream, a character at a time

void stream_read_streambuf(std::istream& f, std::string& result) {

std::stringstream s;

s << f.rdbuf();

std::swap(result, s.str());

}

This code ran my test in 1,294 milliseconds on Visual Studio 2010 and 1,181 milliseconds on Visual Studio 2015—17% and 51% faster, respectively, than stream_read_streambuf_string(). It is my hypothesis that the underlying code just runs a smaller amount of machinery.

Reduce Reallocation

stream_read_streambuf_string() contains seeds of hope for optimization. Although there is no obvious way to optimize the iterate-the-streambuf idiom, the std::string into which the file is being read can be preallocated by a call to reserve() to prevent costly reallocation as the string grows longer, character by character. Example 11-5 tests whether this idea improves performance.

Example 11-5. stream_read_string_reserve(): Preallocating storage for the result

void stream_read_string_reserve(std::istream& f,

std::string& result)

{

f.seekg(0,std::istream::end);

std::streamoff len = f.tellg();

f.seekg(0);

if (len > 0)

result.reserve(static_cast<std::string::size_type>(len));

result.assign(std::istreambuf_iterator<char>(f.rdbuf()),

std::istreambuf_iterator<char>());

}

stream_read_string_reserve() finds how long the stream is by positioning its stream pointer at the end of the stream, reading the offset, and then resetting the stream pointer to the beginning. istream::tellg() actually returns a small struct that gives the position, including an offset for a partially read UTF-8 multibyte character. Fortunately, this struct implements a conversion to a signed integral type. The type must be signed because tellg() can fail, returning an offset of -1 if, for instance, the stream has not been opened, or if there has been an error, or if end-of-file has been reached. If tellg() returns an offset instead of -1, that value can be used as a hint to std::string::reserve() to preallocate sufficient storage for the whole file, in a manner similar to remove_ctrl_reserve() in Example 4-3.

stream_read_string_reserve() tests the hypothesis that the cost of seeking the file pointer twice is less than the cost of the allocations eliminated by the call to reserve(). This is not a foregone conclusion. If seeking to the end of the file causes all the disk sectors in the file to be read, there is a substantial cost. On the other hand, having read these disk sectors, the operating system may have them in cache, reducing other costs. Maybe it depends on the file size. C++ might be able to seek the file pointer to end-of-file without reading anything but the directory entry for the file, or there may be an operating system–dependent function for doing this (perhaps more than one).

When the conjectures start piling up like this, the experienced optimizer recognizes that the cost of finding the answers to all these questions is high, and the benefit unknown. An experiment will quickly tell if the C++ idiom of seeking to end-of-file to get the file size is helpful. On Windows, stream_read_string_reserve() performed no better on the test than stream_read_streambuf_string(). However, this may only mean that the degree of improvement was not noticeable compared to other inefficiencies in this method of reading the file.

Spoiler alert: the technique of finding the length of the stream and preallocating storage is a useful tool. It is good library design to factor out reusable tools into their own functions. A function stream_size() implementing this technique is in Example 11-6.

Example 11-6. stream_size(): compute length of a stream

std::streamoff stream_size(std::istream& f) {

std::istream::pos_type current_pos = f.tellg();

if (-1 == current_pos)

return -1;

f.seekg(0,std::istream::end);

std::istream::pos_type end_pos = f.tellg();

f.seekg(current_pos);

return end_pos - current_pos;

}

A stream-reading function may be called with the stream already partially consumed. stream_size() accounts for this possibility by saving the current position, seeking to the end of the stream, and computing the difference between the current position and the end. The stream pointer is restored to the saved position before the function ends. This is actually more correct than the simplified computation done in stream_read_string_reserve(). It’s another example of how good library design makes functions more flexible and general.

Example 11-7 is a version of stream_read_string() with the file size estimation performed outside the function. This allows the developer to provide an estimate when called on streams that cannot determine their size. It defaults to the behavior of stream_read_string() when no estimate is available.

Example 11-7. stream_read_string_2(): general-purpose version of stream_read_string()

void stream_read_string_2(std::istream& f,

std::string& result,

std::streamoff len = 0)

{

if (len > 0)

result.reserve(static_cast<std::string::size_type>(len));

result.assign(std::istreambuf_iterator<char>(f.rdbuf()),

std::istreambuf_iterator<char>());

}

stream_read_string_2() ran the test in 1,566 milliseconds on VS2010 and 1,766 milliseconds on VS2015. The cost of the extra function call to stream_size(), while presumably nonzero, was not observable in my test. On the other hand, stream_read_string_2() was not observably faster than stream_read_string(). Did this technique fail? We’ll see later.

Take Bigger Bites—Use a Bigger Input Buffer

C++ streams contain a class derived from std::streambuf that improves performance in the case of file reading by reading data from the underlying operating system in bigger chunks. The data goes into a buffer inside the streambuf, from which it is extracted byte-by-byte by the iterator-based input methods discussed previously. Some articles on the Internet suggest that increasing the size of this input buffer will improve performance. Example 11-8 is a simple way to do this.

Example 11-8. Increasing the size of std::streambuf’s internal buffer

std::ifstream in8k;

in8k.open(filename);

char buf[8192];

in8k.rdbuf()->pubsetbuf(buf, sizeof(buf));

Many Internet-dwellers report having difficulty using pubsetbuf(). It must be called after the stream is opened and before any characters are read from the stream. The call fails if any status bits (failbit, eofbit) are set in the stream. The buffer must remain valid until the stream is closed. I obtained a small but measurable 20–50 millisecond improvement in test run time on most of the input functions in this section by setting a larger buffer in the streambuf. Values over 8 KB had little additional effect on run time. This was rather disappointing, because back in the day, increasing the size of a similar buffer in the C FILE struct produced a dramatic improvement on code I had written. This demonstrates again how prior experience can lead the optimizing developer astray. This improvement amounts to around 5% on the 1,500-millisecond test run times observed so far, but might be a bigger component if the overall run time can be reduced.

Take Bigger Bites—Read a Line at a Time

In the introduction to this section, I noted that the files to be read were often text files. For a text file divided into lines, it is a reasonable hypothesis that a function that reads lines might reduce the number of function calls, hopefully using a line-oriented or buffer-filling interface internally. In addition, if the result string is updated less frequently, there should be less copying and reallocation. Indeed, there is a function called getline() in the standard library, and Example 11-9 can be used to test this hypothesis.

Example 11-9. stream_read_getline() reads the file a line at a time

void stream_read_getline(std::istream& f, std::string& result) {

std::string line;

result.clear();

while (getline(f, line))

(result += line) += "\n";

}

stream_read_getline() appends to result. result must be cleared at the start because there is nothing requiring that it be empty when passed into the function. clear() does not return the string’s dynamic buffer to the memory manager. It just sets the length of the string to zero. Depending how the string argument was used before the call, it might already have a substantial dynamic buffer, reducing the burden of allocation.

Testing stream_read_getline() validates these hypotheses: it takes only 1,284 milliseconds on VS2010 and 1,440 milliseconds on VS2015 to complete 100 iterations of reading a 10,000-line file.

Although result might happen already to be long enough to prevent reallocation, it’s a good plan to reserve space, just in case. The same mechanism used in stream_read_string_2() can be added to stream_read_getline(), producing the function in Example 11-10.

Example 11-10. stream_read_getline_2(): line at a time, plus preallocation of result variable

void stream_read_getline_2(std::ifstream& f,

std::string& result,

std::streamoff len = 0)

{

std::string line;

result.clear();

if (len > 0)

result.reserve(static_cast<std::string::size_type>(len));

while (getline(f, line))

(result += line) += "\n";

}

This optimization yielded a just-measurable 3% performance improvement versus stream_read_getline(). This improvement can be combined with a bigger streambuf to get test run times of 1,193 (VS2010) and 1,404 (VS2015) milliseconds.

Another way to approach taking bigger bites is by using the std::streambuf member function sgetn(), which retrieves an arbitrary amount of data into a buffer argument to the call. For files of reasonable size, the entire file can be retrieved in a single big gulp. Thestream_read_sgetn() function in Example 11-11 illustrates this approach.

Example 11-11. stream_read_sgetn()

bool stream_read_sgetn(std::istream& f, std::string& result) {

std::streamoff len = stream_size(f);

if (len == -1)

return false;

result.resize (static_cast<std::string::size_type>(len));

f.rdbuf()->sgetn(&result[0], len);

return true;

}

In stream_read_sgetn(), sgetn() copies data directly into the result string, which must be made big enough to hold the data. The size of the stream must therefore be determined before the call to sgetn(). It is not optional, like in the stream_read_string_2() function inExample 11-7. The size is determined by a call to stream_size().

As previously noted, stream_size() can fail. It would be nice to get a failure indication out of stream_read_sgetn(). Fortunately, since this library function uses the copy-free idiom (see “Copy Free Libraries”), its return value is available for a success/fail indication.

stream_read_sgetn() is fast. It completed the test in 307 (VS2010) and 148 (VS2015) milliseconds, over 4 times faster than stream_read_streambuf(). When combined with a bigger rdbuf, test run time shrank to 244 (VS2010) and 134 (VS2015) milliseconds. The modest improvement from a bigger rdbuf makes a bigger contribution when the overall time is shorter.

Shorten Calling Chains Again

std::istream provides a read() member function that copies characters directly into a buffer. This function mimics the low-level read() function in Linux and ReadFile() on Windows. If std::istream::read() is wired directly to this low-level functionality, bypassing the buffering and other baggage of C++ stream I/O, it should be more efficient. Furthermore, if the entire file can be read at once, it may produce a very efficient call. Example 11-12 implements this functionality.

Example 11-12. stream_read_string() uses read() into a string’s storage

bool stream_read_string(std::istream& f, std::string& result) {

std::streamoff len = stream_size(f);

if (len == -1)

return false;

result.resize (static_cast<std::string::size_type>(len));

f.read(&result[0], result.length());

return true;

}

stream_read_string(), running the test in 267 (VS2010) and 144 (VS2015) milliseconds, is about 25% faster than stream_read_sgetn() and 5 times faster than file_reader().

A problem with stream_read_sgetn() and stream_read_string() is that they assume the pointer &s[0] points to a contiguous block of storage. Prior to C++11, the C++ standard did not require the characters of a string to be stored contiguously, though all standard library implementations of which I am aware were coded this way. The C++11 standard contains clear language in section 21.4.1 saying that string storage is contiguous.

I tested a function that dynamically allocated a character array into which data was read and then copied into a string using assign(). This function would be usable even if a novel string implementation violated the contiguous storage rule:

bool stream_read_array(std::istream& f, std::string& result) {

std::streamoff len = stream_size(f);

if (len == -1)

return false;

std::unique_ptr<char> data(new char[static_cast<size_t>(len)]);

f.read(data.get(), static_cast<std::streamsize>(len));

result.assign(data.get(), static_cast<std::string::size_type>(len));

return true;

}

This function performed my test in 307 (VS2010) and 186 (VS2015) milliseconds, only a little slower than stream_read_string().

Things That Didn’t Help

I found some seriously complex advice to construct a custom streambuf to improve performance. Example 11-13 is one such function found in the wild.

Example 11-13. Kids, don’t try this at home

// from: http://stackoverflow.com/questions/8736862

class custombuf : public std::streambuf

{

public:

custombuf(std::string& target): target_(target) {

this->setp(this->buffer_, this->buffer_ + bufsize - 1);

}

private:

std::string& target_;

enum { bufsize = 8192 };

char buffer_[bufsize];

int overflow(int c) {

if (!traits_type::eq_int_type(c, traits_type::eof())) {

*this->pptr() = traits_type::to_char_type(c);

this->pbump(1);

}

this->target_.append(this->pbase(),

this->pptr() - this->pbase());

this->setp(this->buffer_, this->buffer_ + bufsize - 1);

return traits_type::not_eof(c);

}

int sync() { this->overflow(traits_type::eof()); return 0; }

};

std::string stream_read_custombuf(std::istream& f) {

std::string data;

custombuf sbuf(data);

std::ostream(&sbuf) << f.rdbuf() << std::flush;

return data;

}

The problem with this code sample is that it’s an attempt to optimize an inefficient algorithm. As observed previously (in stream_read_streambuf()), inserting a streambuf on an ostream wasn’t particularly fast. Performance, at 1,312 (VS2010) and 1,182 (VS2015) milliseconds, was no better than that of stream_read_streambuf(). Any improvement was probably due to using an 8 KB buffer in the custom streambuf, which can be done with just a couple of lines of code.

Writing Files

To test my file-reading functions, I had to create files. This let me test file-writing functions. My first file-writing attempt looked like Example 11-14.

Example 11-14. stream_write_line()

void stream_write_line(std::ostream& f, std::string const& line) {

f << line << std::endl;

}

This function, called 10,000 times to produce a file and iterated 100 times to produce timing data, took 1,972 (VS2010) and 2,110 (VS2015) milliseconds.

stream_write_line() pushes out each line with a std::endl. Something I didn’t know about endl is that it flushes the output. Without the std::endl, writing should be faster, because std::ofstream only pushes a few big blocks of data out to the operating system. Example 11-15 tests this hypothesis.

Example 11-15. stream_write_line_noflush() is faster

void stream_write_line_noflush(std::ostream& f,

std::string const& line)

{

f << line << "\n";

}

Of course, either stream_write_line_noflush() must be finished with an f.flush(), or the stream must be closed, so that the last buffer full of data goes out. stream_write_line_noflush() produced my file in 367 (VS2010) and 302 (VS2015) milliseconds, about 5 times faster than stream_write_line().

I also called stream_write_line_noflush() with the entire file in a single string. As expected, this was much faster, completing the test loop in 132 (VS2010) and 137 (VS2015) milliseconds. This was about 1.7 times faster than putting out the file line-by-line.

Reading from std::cin and Writing to std::cout

When reading from the standard input, it is worth knowing that std::cin is tied to std::cout. Requesting input from std::cin flushes std::cout first, so that interactive console programs display their prompts. A call to istream::tie() will produce a pointer to a tied stream, if any exist. A call to istream::tie(nullptr) breaks an existing tie. Flushing is quite costly, as shown in the previous section.

Another thing to know about std::cin and std::cout is that the C++ streams are conceptually connected to the C FILE* objects stdin and stdout. This permits a program to use both C++ and C I/O statements and have interleaving of the output or input make some kind of sense. The way std::cout is connected to stdout is implementation-defined. Most standard library implementations forward std::cout directly to stdout by default. stdout is line-buffered by default, a mode not present in C++ iostreams. When stdout sees a newline, it flushes its buffer.

Performance improves if the connections are severed. Calling the static member function std::ios_base::sync_with_stdio(false) breaks this connection, improving performance at the expense of unpredictable interleaving if a program uses both C and C++ I/O functions.

I did not test how significant the performance difference was.

Summary

· Code for “fast” file I/O on the Internet is not necessarily fast, no matter what the site tries to sell you.

· Increasing the size of the rdbuf produces a few percent performance improvement on reading files.

· The best read times I could obtain used the std::streambuf::sgetn() function to fill a string buffer preallocated to the size of the file.

· std::endl flushes the output. It’s expensive if you are not doing console output.

· std::cout is tied to std::cin and stdout. Breaking these ties can improve performance.