Working with Strings - Introduction to Professional C++ - Professional C++ (2014)

Professional C++ (2014)

Part IIntroduction to Professional C++

Chapter 2Working with Strings

WHAT’S IN THIS CHAPTER?

· The differences between C-style strings and C++ strings

· Details of the C++ std::string class

· What raw string literals are

WROX.COM DOWNLOADS FOR THIS CHAPTER

Please note that all the code examples for this chapter are available as a part of this chapter’s code download on the book’s website at www.wrox.com/go/proc++3e on the Download Code tab.

Every program that you write will use strings of some kind. With the old C language there is not much choice but to use a dumb null-terminated character array to represent a string. Unfortunately, doing so can cause a lot of problems, such as buffer overflows, which can result in security vulnerabilities. The C++ STL includes a safe and easy-to-use std::string class that does not have these disadvantages.

This chapter discusses strings in more detail. It starts with a discussion of the old C-style strings, explains their disadvantages, and ends with the C++ string class and raw string literals.

DYNAMIC STRINGS

Strings in languages that have supported them as first-class objects tend to have a number of attractive features, such as being able to expand to any size, or have sub-strings extracted or replaced. In other languages, such as C, strings were almost an afterthought; there was no really good “string” data type, just fixed arrays of bytes. The “string library” was nothing more than a collection of rather primitive functions without even bounds checking. C++ provides a string type as a first-class data type.

C-Style Strings

In the C language, strings are represented as an array of characters. The last character of a string is a null character ('\0') so that code operating on the string can determine where it ends. This null character is officially known as NUL, spelled with one L, not two. NUL is not the same as the NULL pointer. Even though C++ provides a better string abstraction, it is important to understand the C technique for strings because they still arise in C++ programming. One of the most common situations is where a C++ program has to call a C-based interface in some third-party library or as part of interfacing to the operating system.

By far, the most common mistake that programmers make with C strings is that they forget to allocate space for the '\0' character. For example, the string "hello" appears to be five characters long, but six characters worth of space are needed in memory to store the value, as shown in Figure 2-1.

image

FIGURE 2-1

C++ contains several functions from the C language that operate on strings. These functions are defined in the <cstring> header. As a general rule of thumb, these functions do not handle memory allocation. For example, the strcpy() function takes two strings as parameters. It copies the second string onto the first, whether it fits or not. The following code attempts to build a wrapper around strcpy() that allocates the correct amount of memory and returns the result, instead of taking in an already allocated string. It uses thestrlen() function to obtain the length of the string. The caller is responsible for freeing the memory allocated by copyString().

char* copyString(const char* str)

{

char* result = new char[strlen(str)]; // BUG! Off by one!

strcpy(result, str);

return result;

}

The copyString() function as written is incorrect. The strlen() function returns the length of the string, not the amount of memory needed to hold it. For the string "hello", strlen() will return 5, not 6. The proper way to allocate memory for a string is to add one to the amount of space needed for the actual characters. It seems a bit unnatural to have +1 all over the place. Unfortunately, that’s how it works, so keep this in mind when you work with C-style strings. The correct implementation is as follows:

char* copyString(const char* str)

{

char* result = new char[strlen(str) + 1];

strcpy(result, str);

return result;

}

One way to remember that strlen()returns only the number of actual characters in the string is to consider what would happen if you were allocating space for a string made up of several others. For example, if your function took in three strings and returned a string that was the concatenation of all three, how big would it be? To hold exactly enough space, it would be the length of all three strings, added together, plus one for the trailing '\0' character. If strlen() included the '\0' in the length of the string, the allocated memory would be too big. The following code uses the strcpy() and strcat() functions to perform this operation. The cat in strcat() stands for concatenate.

char* appendStrings(const char* str1, const char* str2, const char* str3)

{

char* result = new char[strlen(str1) + strlen(str2) + strlen(str3) + 1];

strcpy(result, str1);

strcat(result, str2);

strcat(result, str3);

return result;

}

The sizeof() operator in C and C++ can be used to get the size of a certain data type or variable. For example, sizeof(char) returns 1 because a char has a size of 1 byte. However, in the context of C-style strings, sizeof() is not the same as strlen(). You should never usesizeof() to try to get the size of a string. If the C-style string is stored as a char[], then sizeof() returns the actual memory used by the string, including the '\0' character. For example:

char text1[] = "abcdef";

size_t s1 = sizeof(text1); // is 7

size_t s2 = strlen(text1); // is 6

However, if the C-style string is stored as a char*, then sizeof() returns the size of a pointer! For example:

const char* text2 = "abcdef";

size_t s3 = sizeof(text2); // is platform-dependent

size_t s4 = strlen(text2); // is 6

s3 will be 4 when compiled in 32-bit mode and will be 8 when compiled in 64-bit mode because it is returning the size of a const char*, which is a pointer.

A complete list of C functions to operate on strings can be found in the <cstring> header file.

WARNING When you use the C-style string functions with Microsoft Visual Studio, the compiler is likely to give you security-related warnings or even errors about these functions being deprecated. You can eliminate these warnings by using other C standard library functions, such as strcpy_s() or strcat_s(), which are part of the “secure C library” standard (ISO/IEC TR 24731). However, the best solution is to switch to the C++ string class, discussed later in this chapter.

String Literals

You’ve probably seen strings written in a C++ program with quotes around them. For example, the following code outputs the string hello by including the string itself, not a variable that contains it:

cout << "hello" << endl;

In the preceding line, "hello" is a string literal because it is written as a value, not a variable. The actual memory associated with a string literal is in a read-only part of memory. This allows the compiler to optimize memory usage by reusing references to equivalent string literals. That is, even if your program uses the string literal "hello" 500 times, the compiler is allowed to create just one instance of hello in memory. This is called literal pooling.

String literals can be assigned to variables, but because string literals are in a read-only part of memory and because of the possibility of literal pooling, assigning them to variables can be risky. The C++ standard officially says that string literals are of type “array of nconst char”; however, for backward compatibility with older non-const aware code, most compilers do not enforce your program to assign a string literal to a variable of type const char*. They let you assign a string literal to a char* without const, and the program will work fine unless you attempt to change the string. Generally, the behavior of modifying string literals is undefined. It could, for example, cause a crash, or it could keep working with seemingly inexplicable side effects, or the modification could silently be ignored, or it could just work; it all depends on your compiler. For example, the following code exhibits undefined 'margin-bottom:0cm;margin-bottom:.0001pt;line-height: normal;vertical-align:baseline'>char* ptr = "hello"; // Assign the string literal to a variable.

ptr[1] = 'a'; // Undefined behavior!

A much safer way to code is to use a pointer to const characters when referring to string literals. The following code contains the same bug, but because it assigned the literal to a const char*, the compiler will catch the attempt to write to read-only memory.

const char* ptr = "hello"; // Assign the string literal to a variable.

ptr[1] = 'a'; // Error! Attempts to write to read-only memory

You can also use a string literal as an initial value for a character array (char[]). In this case, the compiler creates an array that is big enough to hold the string and copies the string to this array. So, the compiler will not put the literal in read-only memory and will not do any literal pooling.

char arr[] = "hello"; // Compiler takes care of creating appropriate sized

// character array arr.

arr[1] = 'a'; // The contents can be modified.

The C++ string Class

C++ provides a much-improved implementation of the concept of a string as part of the Standard Library. In C++, std::string is a class (actually an instantiation of the basic_string class template) that supports many of the same functionalities as the <cstring>functions, but takes care of memory allocation for you. The string class is defined in the <string> header in the std namespace, and has already been introduced in the previous chapter. Now it’s time to take a deeper look at it.

What Is Wrong with C-Style Strings?

To understand the necessity of the C++ string class, consider the advantages and disadvantages of C-style strings.

Advantages:

· They are simple, making use of the underlying basic character type and array structure.

· They are lightweight, taking up only the memory that they need if used properly.

· They are low level, so you can easily manipulate and copy them as raw memory.

· They are well understood by C programmers — why learn something new?

Disadvantages:

· They require incredible efforts to simulate a first-class string data type.

· They are unforgiving and susceptible to difficult-to-find memory bugs.

· They don’t leverage the object-oriented nature of C++.

· They require knowledge of their underlying representation on the part of the programmer.

The preceding lists were carefully constructed to make you think that perhaps there is a better way. As you’ll learn, C++ strings solve all the problems of C strings and render most of the arguments about the advantages of C strings over a first-class data type irrelevant.

Using the string Class

Even though string is a class, you can almost always treat it as if it were a built-in type. In fact, the more you think of it as a simple type, the better off you are. Through the magic of operator overloading, C++ strings are much easier to use than C-style strings. For example, the + operator is redefined for strings to mean “string concatenation.” The following produces 1234:

string A("12");

string B("34");

string C;

C = A + B; // C will become "1234"

The += operator is also overloaded to allow you to easily append a string:

string A("12");

string B("34");

A += B; // A will become "1234"

Another problem with C strings is that you cannot use == to compare them. Suppose you have the following two strings:

char* a = "12";

char b[] = "12";

Writing a comparison as follows always returns false, because it compares the pointer values, not the contents of the strings:

if (a == b)

image Note that C arrays and pointers are related. You can think of C arrays, like the b array in the example, as pointers to the first element in the array. Chapter 22 goes deeper in on the array-pointer duality.

To compare C strings, you have to write something as follows:

if (strcmp(a, b) == 0)

Furthermore, there is no way to use <, <=, >=, or > to compare C strings, so strcmp() returns -1, 0, or 1 depending on the lexicographic relationship of the strings. This results in very clumsy code, which is also error-prone.

With C++ strings, operator==, operator!=, operator<, and so on are all overloaded to work on the actual string characters. Individual characters can still be accessed with operator[].

As the following code shows, when string operations require extending the string, the memory requirements are automatically handled by the string class, so memory overruns are a thing of the past.

string myString = "hello";

myString += ", there";

string myOtherString = myString;

if (myString == myOtherString) {

myOtherString[0] = 'H';

}

cout << myString << endl;

cout << myOtherString << endl;

The output of this code is:

hello, there

Hello, there

There are several things to note in this example. One point to note is that there are no memory leaks even though strings are allocated and resized left and right. All of these string objects are created as stack variables. While the string class certainly has a bunch of allocating and resizing to do, the string destructors clean up this memory when string objects go out of scope.

Another point to note is that the operators work the way you want them to. For example, the = operator copies the strings, which is most likely what you want. If you are used to working with array-based strings, this will either be refreshingly liberating for you or somewhat confusing. Don’t worry — once you learn to trust the string class to do the right thing, life gets so much easier.

For compatibility, you can use the c_str() method on a string to get a const character pointer, representing a C-style string. However, the returned const pointer becomes invalid whenever the string has to perform any memory reallocation, or when the string object is destroyed. You should call the method just before using the result so that it accurately reflects the current contents of the string, and you must never return the result of c_str() called on a stack-based string from your function.

Consult a Standard Library Reference, for example http://www.cppreference.com/ or http://www.cplusplus.com/reference/, for a complete list of all supported operations that you can perform on string objects.

image std::string Literals

A string literal in source code is usually interpreted as a const char*. You can use the standard user-defined literal "s" to interpret a string literal as an std::string instead. For example:

auto string1 = "Hello World"; // string1 will be a const char*

auto string2 = "Hello World"s; // string2 will be an std::string

Numeric Conversions

The std namespace includes a number of helper functions making it easy to convert numerical values into strings or strings into numerical values. The following functions are available to convert numerical values into strings:

· string to_string(int val);

· string to_string(unsigned val);

· string to_string(long val);

· string to_string(unsigned long val);

· string to_string(long long val);

· string to_string(unsigned long long val);

· string to_string(float val);

· string to_string(double val);

· string to_string(long double val);

They are pretty straightforward to use. For example, the following code converts a long double value into a string:

long double d = 3.14L;

string s = to_string(d);

Converting in the other direction is done by the following set of functions, also defined in the std namespace. In these prototypes, str is the string that you want to convert, idx is a pointer that will receive the index of the first non-converted character, and base is the mathematical base that should be used during conversion. The idx pointer can be a null pointer in which case it will be ignored. These functions throw invalid_argument if no conversion could be performed and throw out_of_range if the converted value is outside the range of the return type.

· int stoi(const string& str, size_t *idx=0, int base=10);

· long stol(const string& str, size_t *idx=0, int base=10);

· unsigned long stoul(const string& str, size_t *idx=0, int base=10);

· long long stoll(const string& str, size_t *idx=0, int base=10);

· unsigned long long stoull(const string& str, size_t *idx=0, int base=10);

· float stof(const string& str, size_t *idx=0);

· double stod(const string& str, size_t *idx=0);

· long double stold(const string& str, size_t *idx=0);

For example:

const string s = "1234";

int i = stoi(s); // i will be 1234

Raw String Literals

Raw string literals are string literals that can span across multiple lines of code, that don’t require escaping of embedded double quotes, and where escape sequences like \t and \n are not processed as escape sequences, but as normal text. Escape sequences are discussed in Chapter 1. For example, if you write the following with a normal string literal, you will get a compiler error because the string contains non-escaped double quotes:

string str = "Hello "World"!"; // Error!

With a normal string you have to escape the double quotes as follows:

string str = "Hello \"World\"!";

With a raw string literal you can avoid the need to escape the quotes. The raw string literal starts with R"( and ends with )".

string str = R"(Hello "World"!)";

Raw string literals can span across multiple lines. For example, if you write the following with a normal string literal, you will get a compiler error, because a normal string literal cannot span multiple lines:

string str = "Line 1

Line 2 with \t"; // Error!

Instead, you can use a raw string literal as follows:

string str = R"(Line 1

Line 2 with \t)";

This also demonstrates that with the raw string literal the \t escape character is not replaced with an actual tab character but is taken literally. If you write str to the console the output will be:

Line 1

Line 2 with \t

Since the raw string literal ends with )" you cannot embed a )" in your string using this syntax. For example, the following string is not valid because it contains the )" in the middle of the string:

string str = R"(The characters )" are embedded in this string)"; // Error!

If you need embedded )" characters, you need to use the extended raw string literal syntax, which is as follows:

R"d-char-sequence(r-char-sequence)d-char-sequence"

The r-char-sequence is the actual raw string. The d-char-sequence is an optional delimiter sequence, which should be the same at the beginning and at the end of the raw string literal. This delimiter sequence can have at most 16 characters. You should choose this delimiter sequence as a sequence that will not appear in the middle of your raw string literal.

The previous example can be rewritten using a unique delimiter sequence as follows:

string str = R"-(The characters )" are embedded in this string)-";

Raw string literals make it easier to work with database querying strings, regular expressions, and so on. Regular expressions are discussed in Chapter 18.

Nonstandard Strings

There are several reasons why many C++ programmers don’t use C++-style strings. Some programmers simply aren’t aware of the string type because it was not always part of the C++ specification. Others have discovered over the years that the C++ string doesn’t provide the behavior they need and have developed their own string type. Perhaps the most common reason is that development frameworks and operating systems tend to have their own way of representing strings, such as the CString class in Microsoft’s MFC. Often, this is for backward compatibility or legacy issues. When starting a project in C++, it is very important to decide ahead of time how your group will represent strings.

SUMMARY

This chapter discussed the C++ string class and why you should use it instead of the old plain C-style character arrays. It also explained a number of helper functions to make it easier to convert numerical values into strings and vice versa, and introduced the concept of raw string literals.