Strings - The Standard Library - The C++ Programming Language (2013)

The C++ Programming Language (2013)

Part IV: The Standard Library

36. Strings

Prefer the standard to the offbeat.

– Strunk & White

Introduction

Character Classification

Classification Functions; Character Traits

Strings

string vs. C-Style Strings; Constructors; Fundamental Operations; String I/O; Numeric Conversions; STL-like Operations; The find Family; Substrings

Advice

36.1. Introduction

The standard library offers character classification operations in <cctype>36.2), strings with associated operations in <string>36.3), regular expression matching in <regex> (Chapter 37), and support for C-style strings in <cstring>43.4). Handling of different character sets, encodings, and conventions (locales) is discussed in Chapter 39.

A simplified string implementation is presented in §19.3.

36.2. Character Classification

The standard library provides classification functions to help users to manipulate strings (and other character sequences) and traits specifying properties of a character type to help implementers of operations on strings.

36.2.1. Classification Functions

In <cctype>, the standard library provides functions to classify the characters from the basic execution character set:

Image

In addition, the standard library provides two useful functions for removing case differences:

Image

The equivalent functions for wide characters are provided in <cwctype>.

The character classification functions are sensitive to the "C" locale (§39.5.1, §39.5.2). Equivalent functions for other locales are provided in <locale>39.5.1).

One reason that these character classification functions are useful is that character classification can be trickier than it might appear. For example, a novice might write:

if ('a'<ch && ch<'z') // a character

This is more verbose (and most likely slower) than:

if (islower(ch)) // a lowercase character

Also, there is no guarantee that the characters are contiguous in a code space. Furthermore, the use of standard character classifications are far easier to convert to another locale:

if (islower,danish) // a lowercase character in Danish
// (assuming "danish" is the name for a Danish locale)

Note that Danish has three more lowercase characters than English, so that the initial explicit test using 'a' and 'z' would be flat wrong.

36.2.2. Character Traits

As shown in §23.2, a string template can, in principle, use any type with proper copy operations as its character type. However, efficiency can be improved and implementations can be simplified for types that don’t have user-defined copy operations. Consequently, the standard stringrequires that a type used as its character type be a POD (§8.2.6). This also helps to make I/O of strings simple and efficient.

The properties of a character type are defined by its char_traits. A char_traits is a specialization of the template:

template<typename C> struct char_traits { };

All char_traits are defined in std, and the standard ones are presented in <string>. The general char_traits itself has no properties; only char_traits specializations for a particular character type have. Consider char_traits<char>:

template<>
struct char_traits<char> { //
char_traits operations should not throw exceptions
using char_type = char;
using int_type = int; //
type of integer value of character
using off_type = streamoff; // offset in stream
using pos_type = streampos; // position in stream
using state_type = mbstate_t; // multibyte stream state (§39.4.6)
// ...
};

The standard provides four specializations of char_traits (§iso.21.2.3):

template<> struct char_traits<char>;
template<> struct char_traits<char16_t>;
template<> struct char_traits<char32_t>;
template<> struct char_traits<wchar_t>;

The members of the standard char_traits are all static functions:

Image

Comparing with eq() is often not simply an ==. For example, a case-insensitive char_traits would define its eq() so that eq('b','B') would return true.

Because copy() does not protect against overlapping ranges, it may be faster than move(). The compare() function uses lt() and eq() to compare characters. It returns an int, where 0 represents an exact match, a negative number means that its first argument comes lexicographically before the second, and a positive number means that its first argument comes after its second.

The I/O-related functions are used by the implementation of low-level I/O (§38.6).

36.3. Strings

In <string>, the standard library provides a general string template basic_string:

template<typename C,
typename Tr = char_traits<C>,
typename A = allocator<C>>
class basic_string {
public:
using traits_type = Tr;
using value_type = typename Tr::char_type;
using allocator_type = A;
using size_type = typename allocator_traits<A>::size_type;
using difference_type = typename allocator_traits<A>::difference_type;
using reference = value_type&;
using const_reference = const value_type&;
using pointer = typename allocator_traits<A>::pointer;
using const_pointer = typename allocator_traits<A>::const_pointer;
using iterator = /*
implementation-defined */;
using const_iterator = /*
implementation-defined */;
using reverse_iterator = std::reverse_iterator<iterator>;
using const_reverse_iterator = std::reverse_iterator<const_iterator>;


static const size_type npos = –1; // integer representing end-of-string

// ...
};

The elements (characters) are stored contiguously, so that low-level input operations can safely use a basic_string’s sequence of characters as a source or target.

The basic_string offers the strong guarantee (§13.2): if a basic_string operation throws, the string is left unchanged.

Specializations are offered for a few standard character types:

using string = basic_string<char>;
using u16string = basic_string<char16_t>;
using u32string = basic_string<char32_t>;
using wstring = basic_string<wchar_t>;

All these strings provide a host of operations.

Like containers (Chapter 31), basic_string is not meant to be used as a base class and offers move semantics so that it can be efficiently returned by value.

36.3.1. string vs. C-Style Strings

I assume some familiarity with string from the many examples in this book, so I start with a few examples contrasting string use with the use of C-style strings (§43.4) which are popular with programmers primarily familiar with C and C-style C++.

Consider making up an email address by concatenating a user identifier and a domain name:

string address(const string& identifier, const string& domain)
{
return identifier + '@' + domain;
}

void test()
{
string t = address("bs","somewhere");
cout << t << '\n';
}

This is trivial. Now consider a plausible C-style version. A C-style string is a pointer to an array of zero-terminated characters. The user controls allocation and is responsible for deallocation:

char* address(const char* identifier, const char* domain)
{
int iden_len = strlen(identifier);
int dom_len = strlen(domain);
char* addr = (char*)malloc(iden_len+dom_len+2); //
remember space for 0 and '@'
strcpy(identifier,addr);
addr[iden_len] = '@';
strcpy(domain,addr+iden_len+1);
return addr;
}


void test2()
{
char* t = address("bs","somewhere");
printf("%s\n",t);
free(t);
}

Did I get that right? I hope so. At least it gave the output I expected. Like most experienced C programmers, I got the C version correct (I hope) the first time, but there are a lot of details to get right. However, experience (i.e., error logs) shows that this is not always the case. Often, such simple programming tasks are given to relative novices who still don’t know all the techniques needed to get it right. The implementation of the C-style address() contains a lot of tricky pointer manipulation, and its use requires the caller to remember to free the returned memory. Which code would you prefer to maintain?

Sometimes, it is claimed that C-style strings are more efficient than strings. However, for most uses, the string does fewer allocations and deallocations than a C-style equivalent (because of the small-string optimization and move semantics; §19.3.3, §19.3.1). Also, strlen() is a log(N) operation, whereas string::size() is a simple read. In the example, this implies that the C-style code traverses each input string twice, whereas the string version does only one traversal per input. Efficiency concerns at this level are often misguided, but the string version has a fundamental edge.

The fundamental difference between C-style strings and string is that string is a proper type with conventional semantics, whereas the C-style string is a set of conventions supported by a few useful functions. Consider assignment and comparison:

void test3()
{
string s1 = "Ring";
if (s1!="Ring") insanity();
if (s1<"Opera")cout << "check";
string s2 = address(s1,"Valkyrie");


char s3[] = "Ring";
if (strcmp(s3,"Ring")!=0) insanity();
if (strcmp(s3,"Opera")<0) cout << "check";
char* s4 = address(s3,"Valkyrie");
free(s4);
}

Finally, consider sorting:

void test4()
{
vector<string> vs = {"Grieg", "Williams", "Bach", "Handel" };
sort(vs.begin(),vs.end()); //
assuming that I haven't defined sort(vs)

const char* as[] = {"Grieg", "Williams", "Bach", "Handel" };
qsort(as,sizeof(*as),sizeof(as)/sizeof(*as),(int(*)(const void*,const void*))strcmp);
}

The C-style string sort function qsort() is presented in §43.7. Again, sort() is as fast as (and typically much faster than) qsort(), so there is no performance reason to choose the lower-level, more verbose, and less maintainable programming style.

36.3.2. Constructors

A basic_string offers a bewildering variety of constructors:

Image

The most common variants are also the simplest:

string s0; // the empty string
string s1 {"As simple as that!"}; // construct from C-style string
string s2 {s1}; // copy constructor

Almost always, the destructor is implicitly invoked.

There is no string constructor that takes only a number of elements:

string s3 {7}; // error: no string(int)
string s4 {'a'}; // error: no string(char)
string s5 {7,'a'}; // OK: 7 'a's
string s6 {0}; // danger: passing nullptr

The declaration of s6 shows a mistake sometimes made by programmers used to C-style strings:

const char* p = 0; // set p to "no string"

Unfortunately, the compiler cannot catch the definition of s6 or the even nastier case of a const char* holding the nullptr:

string s6 {0}; // danger: passing nullptr
string s7 {p}; // may or may not be OK depending on the value of p
string s8 {"OK"}; // OK: pass pointer to C-style string

Don’t try to initialize a string with a nullptr. At best, you get a nasty run-time error. At worst, you get mysterious undefined behavior.

If you try to construct a string with more characters than your implementation can handle, the constructor throws std::length_error. For example:

string s9 {string::npos,'x'}; // throw length_error

The value string::npos represents a position beyond a string’s length and is generally used to mean “the end of the string.” For example:

string ss {"Fleetwood Mac"};
string ss2 {ss,0,9}; //
"Fleetwood"
string ss3 {ss,10,string::npos}; // "Mac"

Note that the substring notation is (position,length) rather than [start,end).

There are no literals of type string. A user-defined literal could be used for that (§19.2.6), for example, "The Beatles"s and "Elgar"s. Note the s suffix.

36.3.3. Fundamental Operations

A basic_string offers comparisons, control of size and capacity, and access operations.

Image

For more comparison operations, see §36.3.8.

The size and capacity mechanisms for basic_string are the same as those for vector31.3.3):

Image

A resize() or reserve() that would cause size() to exceed max_size() will throw std::length_error.

An example:

void fill(istream& in, string& s, int max)
// use s as target for low-level input (simplified)
{
s.reserve(max); //
make sure there is enough allocated space
in.read(&s[0],max);
const int n = in.gcount(); //
number of characters read
s.resize(n);
s.shrink_to_fit(); //
discard excess capacity
}

Here, I “forgot” to make use of the number of characters read. That’s sloppy.

Image

An out-of-range access using at() throws std::out_of_range. A +=(), push_back(), or + that would cause size() to exceed max_size() will throw std::length_error.

There is no implicit conversion of a string to a char*. That was tried in many places and found to be error-prone. Instead, the standard library provides the explicit conversion function c_str() to const char*.

A string can contain a zero value character (e.g., '\0'). Using a function, such as strcmp(), that assumes C-style string conventions on the result of s.c_str() or s.data() on a string containing a zero character may cause surprise.

36.3.4. String I/O

A basic_string can be written using <<38.4.2) and read into using >>38.4.1):

Image

An input operation that would cause size() to exceed max_size() will throw std::length_error.

A getline() removes its terminator character (by default '\n') from the input stream but does not enter it into the string. This simplifies handling of lines. For example:

vector<string> lines;
for (string s; getline(cin,s);)
lines.push_back(s);

The string I/O operations all return a reference to their input stream, so that operations can be chained. For example:

string first_name;
string second_name;
cin >> first_name >> second_name;

The string target of an input operation is set to empty before reading and expands to hold the characters read. A read operation can also be terminated by reaching end-of-file (§38.3).

36.3.5. Numeric Conversions

In <string>, the standard library provides a set of functions for extracting numeric values from their character representation in a string or wstring (note: not a basic_string<C,Tr,A>). The desired numeric types are encoded in the function names:

Image

Image

Each of these sto* (String to) functions has three variants, like stoi. For example:

string s = "123.45";
auto x1 = stoi(s); //
x1 = 123
auto x2 = stod(s); // x2 = 123.45

The second argument of a sto* function is a pointer used to indicate how far into the string the search for a numeric value progressed. For example:

string ss = "123.4567801234";
size_t dist = 0; //
put number of characters read here
auto x = stoi(ss,&dist); // x = 123 (an int)
++dist; // ignore the dot
auto y = stoll(&ss[dist]); // x = 4567801234 (a long long)

This is not my favorite interface for parsing several numbers from a string. I prefer to use a string_stream38.2.2).

Initial whitespace is skipped. For example:

string s = " 123.45";
auto x1 = stoi(s); //
x1 = 123

The base argument can be in the range [2:36] with the 0123456789abcdefghijklmnopqrstuvwxyz used as “digits” with their value determined by their position in this sequence. Any further base value will be an error or an extension. For example:

string s4 = "149F";
auto x5 = stoi(s4); //
x5 = 149
auto x6 = stoi(s4,nullptr,10); // x6 = 149
auto x7 = stoi(s4,nullptr,8); // x7 = 014
auto x8 = stoi(s4,nullptr,16); // x8 = 0x149F

string s5 = "1100101010100101"; // binary
auto x9 = stoi(s5,nullptr,2); // x9 = 0xcaa5

If a conversion function doesn’t find characters in its string argument that it can convert to a number, it throws invalid_argument. If it finds a number that it cannot represent in its target type, it throws out_of_range; in addition, the conversions to floating-point types set errno to ERANGE40.3). For example:

stoi("Hello, World!"); // throws std::invalid_argument
stoi("12345678901234567890"); // throws std::out_of_range; errno=ERANGE
stof("123456789e1000"); // throws std::out_of_range; errno=ERANGE

The sto* functions encode their target type in their names. This makes them unsuitable for generic code where the target can be a template parameter. In such cases, consider to<X>25.2.5.1).

36.3.6. STL-like Operations

The basic_string provides the usual set of iterators:

Image

Because string has the required member types and the functions for obtaining iterators, strings can be used together with the standard algorithms (Chapter 32). For example:

void f(string& s)
{
auto p = find_if(s.begin(),s.end(),islower);
//
...
}

The most common operations on strings are supplied directly by string. Hopefully, these versions will be optimized for strings beyond what would be easy to do for general algorithms.

The standard algorithms (Chapter 32) are not as useful for strings as one might think. General algorithms tend to assume that the elements of a container are meaningful in isolation. This is typically not the case for a string.

A basic_string offers complex assignment()s:

Image

We can insert(), append(), and erase() in a basic_string:

Image

For example:

void add_middle(string& s, const string& middle) // add middle name
{
auto p = s.find(' ');
s.insert(p,' '+middle);
}


void test()
{
string dmr = "Dennis Ritchie";
add_middle(dmr,"MacAlistair");
cout << dmr << '\n';
}

As for vectors, append()ing (adding characters at the end) is typically more efficient than insert()ing elsewhere.

In the following, I use s[b:e) to denote a sequence of elements [b:e) in s:

Image

Image

The replace() replaces one substring with another and adjusts the string’s size accordingly. For example:

void f()
{
string s = "but I have heard it works even if you don't believe in it";
s.replace(0,4,""); //
erase initial "but "
s.replace(s.find("even"),4,"only");
s.replace(s.find(" don't"),6,""); //
erase by replacing with ""
assert(s=="I have heard it works only if you believe in it");
}

Code relying on “magic” constants like the number of characters to be replaced is error-prone.

A replace() returns a reference to the object for which it was called. This can be used for chaining operations:

void f2()
{
string s = "but I have heard it works even if you don't believe in it";
s.replace(0,4,"").replace(s.find("even"),4,"only").replace(s.find(" don't"),6,"");
assert(s=="I have heard it works only if you believe in it");
}

36.3.7. The find Family

There is a bewildering variety of functions for finding substrings. As usual, find() searches from s.begin() onward, whereas rfind() searches backward from s.end(). The find functions use string::npos (“not a position”) to represent “not found.”

Image

For example:

void f()
{
string s {"accdcde"};


auto i1 = s.find("cd"); // i1==2 s[2]=='c' && s[3]=='d'
auto i2 = s.rfind("cd"); // i2==4 s[4]=='c' && s[5]=='d'
}

The find_*_of() functions differ from find() and rfind() by looking for a single character, rather than a whole sequence of characters:

Image

For example:

string s {"accdcde"};

auto i1 = s.find("cd"); // i1==2 s[2=='c' && s[3]=='d'
auto i2 = s.rfind("cd"); // i2==4 s[4]=='c' && s[5]=='d'

auto i3 = s.find_first_of("cd"); // i3==1 s[1]=='c'
auto i4 = s.find_last_of("cd"); // i4==5 s[5]=='d'
auto i5 = s.find_first_not_of("cd"); // i5==0 s[0]!='c' && s[0]!='d'
auto i6 = s.find_last_not_of("cd"); // i6==6 s[6]!='c' && s[6]!='d'

36.3.8. Substrings

A basic_string offers a low-level notion of substring:

Image

Note that substr() creates a new string:

void user()
{
string s = "Mary had a little lamb";
string s2 = s.substr(0,4); //
s2 == "Mary"
s2 = "Rose"; // does not change s
}

We can compare substrings:

Image

For example:

void f()
{
string s = "Mary had a little lamb";
string s2 = s.substr(0,4); //
s2 == "Mary"
auto i1 = s.compare(s2); // i1 is positive
auto i2 = s.compare(0,4,s2); // i2==0
}

This explicit use of constants to denote positions and lengths is brittle and error-prone.

36.4. Advice

[1] Use character classifications rather than handcrafted checks on character ranges; §36.2.1.

[2] If you implement string-like abstractions, use character_traits to implement operations on characters; §36.2.2.

[3] A basic_string can be used to make strings of characters on any type; §36.3.

[4] Use strings as variables and members rather than as base classes; §36.3.

[5] Prefer string operations to C-style string functions; §36.3.1.

[6] Return strings by value (rely on move semantics); §36.3.2.

[7] Use string::npos to indicate “the rest of the string”; §36.3.2.

[8] Do not pass a nullptr to a string function expecting a C-style string; §36.3.2.

[9] A string can grow and shrink, as needed; §36.3.3.

[10] Use at() rather than iterators or [] when you want range checking; §36.3.3, §36.3.6.

[11] Use iterators and [] rather than at() when you want to optimize speed; §36.3.3, §36.3.6.

[12] If you use strings, catch length_error and out_of_range somewhere; §36.3.3.

[13] Use c_str() to produce a C-style string representation of a string (only) when you have to; §36.3.3.

[14] string input is type sensitive and doesn’t overflow; §36.3.4.

[15] Prefer a string_stream or a generic value extraction function (such as to<X>) over direct use of str* numeric conversion functions; §36.3.5.

[16] Use the find() operations to locate values in a string (rather than writing an explicit loop); §36.3.7.

[17] Directly or indirectly, use substr() to read substrings and replace() to write substrings; §36.3.8.