Regular Expressions - The Standard Library - The C++ Programming Language (2013)

The C++ Programming Language (2013)

Part IV: The Standard Library

37. Regular Expressions

If the code and the comments disagree, then both are probably wrong.

– Norm Schryer

Regular Expressions

Regular Expression Notation

regex

Match Results; Formatting

Regular Expression Functions

regex_match(); regex_search(); regex_replace()

Regular Expression Iterators

regex_iterator; regex_token_iterator

regex_traits

Advice

37.1. Regular Expressions

In <regex>, the standard library provides regular expressions:

regex_match(): Match a regular expression against a string (of known size).

regex_search(): Search for a string that matches a regular expression in an (arbitrarily long) stream of data.

regex_replace(): Search for strings that match a regular expression in an (arbitrarily long) stream of data and replace them.

regex_iterator: iterate over matches and submatches.

regex_token_iterator: iterate over non-matches.

The result of a regex_search() is a collection of matches, typically represented as an smatch:

void use()
{
ifstream in("file.txt"); //
input file
if (!in) cerr << "no file\n";

regex pat {R"(\w{2}\s*\d{5}(–\d{4})?)"}; // U.S. postal code pattern

int lineno = 0;
for (string line; getline(in,line);) {

++lineno;
smatch matches; //
matched strings go here
if (regex_search(line, matches, pat)) {
cout << lineno << ": " << matches[0] << '\n'; //
the complete match
if (1<matches.size() && matches[1].matched)
cout << "\t: " << matches[1] << '\n';//
submatch
}
}

}

This function reads a file looking for U.S. postal codes, such as TX77845 and DC 20500–0001. An smatch type is a container of regex results. Here, matches[0] is the whole pattern and matches[1] is the optional four-digit subpattern. I used a raw string (§7.3.2.1) which is particularly suitable for regular expressions because they tend to contain a lot of backslashes. Had I used a conventional string, the pattern definition would have been:

regex pat {"\\w{2}\\s*\\d{5}(–\\d{4})?"}; // U.S. postal code pattern

The regular expression syntax and semantics are designed so that regular expressions can be compiled into state machines for efficient execution [Cox,2007]. The regex type performs this compilation at run time.

37.1.1. Regular Expression Notation

The regex library can recognize several variants of the notation for regular expressions (§37.2). Here, I first present the default notation used, a variant of the ECMA standard used for ECMAScript (more commonly known as JavaScript).

The syntax of regular expressions is based on characters with special meaning:

Image

For example, we can specify a line starting with zero or more As followed by one or more Bs followed by an optional C like this:

^A*B+C?$

Examples that match:

AAAAAAAAAAAABBBBBBBBBC
BC
B

Examples that do not match:

AAAAA // no B
AAAABC // initial space
AABBCC // too many Cs

A part of a pattern is considered a subpattern (which can be extracted separately from an smatch) if it is enclosed in parentheses.

A pattern can be optional or repeated (the default is exactly once) by adding a suffix:

Image

For example:

A{3}B{2,4}C*

Examples that match:

AAABBC
AAABBB

Example that do not match:

AABBC // too few As
AAABC // too few Bs
AAABBBBBCCC // too many Bs

A suffix ? after any of the repetition notations makes the pattern matcher “lazy” or “non-greedy.” That is, when looking for a pattern, it will look for the shortest match rather than the longest. By default, the pattern matcher always looks for the longest match (similar to C++’s Max Munch rule; §10.3). Consider:

ababab

The pattern (ab)* matches all of ababab. However, (ab)*? matches only the first ab.

The most common character classifications have names:

Image

Several character classes are supported by shorthand notation:

Image

In addition, languages supporting regular expressions often provide:

Image

For full portability, use the character class names rather than these abbreviations.

As an example, consider writing a pattern that describes C++ identifiers: an underscore or a letter followed by a possibly empty sequence of letters, digits, or underscores. To illustrate the subtleties involved, I include a few false attempts:

[:alpha:][:alnum:]* // wrong: characters from the set ":alph" followed by ...
[[:alpha:]][[:alnum:]]* // wrong: doesn't accept underscore ('_' is not alpha)
([[:alpha:]]|_)[[:alnum:]]* // wrong: underscore is not part of alnum either
([[:alpha:]]|_)([[:alnum:]]|_)* // OK, but clumsy
[[:alpha:]_][[:alnum:]_]* // OK: include the underscore in the character classes
[_[:alpha:]][_[:alnum:]]* // also OK
[_[:alpha:]]\w* // \w is equivalent to [_[:alnum:]]

Finally, here is a function that uses the simplest version of regex_match()37.3.1) to test whether a string is an identifier:

bool is_identifier(const string& s)
{
regex pat {"[_[:alpha:]]\\w*"};
return regex_match(s,pat);
}

Note the doubling of the backslash to include a backslash in an ordinary string literal. As usual, backslashes can also denote special characters:

Image

To add to the opportunities for confusion, two further logically different uses of the backslash are provided:

Image

Using raw string literals alleviates many problems with special characters. For example:

bool is_identifier(const string& s)
{
regex pat {R"([_[:alpha:]]\w*)"};
return regex_match(s,pat);
}

Here are some examples of patterns:

Ax* // A, Ax, Axxxx
Ax+ // Ax, Axxx Not A
\d–?\d // 1-2, 12 Not 1--2
\w{2}–\d{4,5} // Ab-1234, XX-54321, 22-5432 Digits are in \w
(\d*:)?(\d+) // 12:3, 1:23, 123, :123 Not 123:
(bs|BS) // bs, BS Not bS
[aeiouy] // a, o, u An English vowel, not x
[^aeiouy] // x, k Not an English vowel, not e
[a^eiouy] // a, ^, o, u An English vowel or ^

A group (a subpattern) potentially to be represented by a sub_match is delimited by parentheses. If you need parentheses that should not define a subpattern, use (? rather than plain (. For example:

(\s|:|,)*(\d*) // spaces, colons, and/or commas followed by a number

Assuming that we were not interested in the characters before the number (presumably separators), we could write:

(?\s|:|,)*(\d*) // spaces, colons, and/or commas followed by a number

This would save the regular expression engine from having to store the first characters: the (? variant has only one subpattern.

Image

That last pattern is useful for parsing XML. It finds tag/end-of-tag markers. Note that I used a non-greedy match (a lazy match), .*?, for the subpattern between the tag and the end tag. Had I used plain .*, this input would have caused a problem:

Always look for the <b>bright</b> side of <b>life</b>.

A greedy match for the first subpattern would match the first < with the last >. A greedy match on the second subpattern would match the first <b> with the last </b>. Both would be correct behavior, but unlikely what the programmer wanted.

It is possible to vary details of the regular expression notation using options (§37.2). For example, if regex_constants::grep is used, a?x:y is a sequence of five ordinary characters because ? does not mean “optional” in grep.

For a more exhaustive presentation of regular expressions, see [Friedl,1997].

37.2. regex

A regular expression is a matching engine (usually a state machine) constructed from a sequence of characters, such as a string:

template<class C, class traits = regex_traits<C>>
class basic_regex {
public:
using value_type = C;
using traits_type = traits;
using string_type = typename traits::string_type;
using flag_type = regex_constants::syntax_option_type;
using locale_type = typename traits::locale_type;

~basic_regex(); //not virtual; basic_regex is not meant to be used as a base class
// ...
};

The regex_traits are presented in §37.5.

Like string, regex is an alias for the version that uses chars:

using regex = basic_regex<char>;

The meaning of regular expression patterns is controlled by syntax_option_type constants defined identically in regex_constants and regex:

Image

Use the default unless you have a good reason not to. Good reasons include a large body of existing regular expressions in a non-default notation.

A regex object can be constructed from a string or similar sequence of characters:

Image

The main use of regex is through the search, match, and replace functions (§37.3), but there are also a few operations on regex itself:

Image

You can determine the locale or a regex by a call of getloc() and learn what flags are used from flags(), but unfortunately there is no (standard) way of reading out its pattern. If you need to output a pattern, keep a copy of the string used to initialize. For example:

regex pat1 {R"(\w+\d*)"}; // no way of outputting the pattern in pat1

string s {R"(\w+\d*)"};
regex pat2 {s};
cout << s << '\n'; //
the pattern in pat2

37.2.1. Match Results

Results from a regular expression match are gathered in a match_results object which contains one or more sub_match objects:

template<class Bi>
class sub_match : public pair<Bi,Bi> {
public:
using value_type = typename iterator_traits<Bi>::value_type;
using difference_type = typename iterator_traits<Bi>::difference_type;
using iterator = Bi;
using string_type = basic_string<value_type>;


bool matched; // true if *this contains a match
// ...
};

Bi must be a bidirectional iterator (§33.1.2). A sub_match can be seen as a pair of iterators into the string being matched.

Image

For example:

regex pat ("<(.*?)>(.*?)</(.*?)>");

string s = "Always look for the <b> bright </b> side of <b> death </b>";

if (regex_search(s1,m,p2))
if (m[1]==m[3]) cout << "match\n";

The output is match.

A match_results is a container of sub_matches:

template<class Bi, class A = allocator<sub_match<Bi>>
class match_results {
public:
using value_type = sub_match<Bi>;
using const_reference = const value_type&;
using reference = const_reference;
using const_iterator = /*
implementation-defined */;
using iterator = const_iterator;
using difference_type = typename iterator_traits<Bi>::difference_type;
using size_type = typename allocator_traits<A>::size_type;
using allocator_type = A;
using char_type = typename iterator_traits<Bi>::value_type;
using string_type = basic_string<char_type>;


~match_results(); // not virtual

// ...
};

Bi must be a bidirectional iterator (§33.1.2).

As for basic_string and basic_ostream, a few standard aliases are provided for the most common match_results:

using cmatch = match_results<const char*>; // C-style string
using wcmatch = match_results<const wchar_t*>; // wide C-style string
using smatch = match_results<string::const_iterator>; // string
using wsmatch = match_results<wstring::const_iterator>; // wstring

A match_results provides access to its match string, its sub_matches, and the characters before and after the match:

Image

A match_results provides a conventional set of operations:

Image

Image

We can subscript a regex_match to access a sub_match, for example, m[i]. If a subscript, i, refers to a nonexistent sub_match, the result of m[i] represents an unmatched sub_match. For example:

void test()
{
regex pat ("(AAAA)(BBB)?");
string s = "AAAA";
smatch m;
regex_search(s,m,pat);


cout << boolalpha;
cout << m[0].matched << '\n'; //
true: we found a match
cout << m[1].matched << '\n'; // true: there was a first sub_match
cout << m[2].matched << '\n'; // false: no second sub_match
cout << m[3].matched << '\n'; // false: there couldn't be a third sub_match for pat
}

37.2.2. Formatting

In regex_replace(), formatting is done using a format() function:

Image

Formats can contain formatting characters:

Image

For an example, see §37.3.3.

The details of formatting done by format() are controlled by a set of options (flags):

Image

37.3. Regular Expression Functions

The functions for applying regular expression patterns to data are regex_search() for searching in a sequence of characters, regex_match() for matching a fixed-length sequence of characters, and regex_replace() for doing replacement of patterns.

The details of matching are controlled by a set of options (flags):

Image

37.3.1. regex_match()

To look for a pattern matching a whole sequence with a known length, such as a line of text, use regex_match():

Image

As an example, consider a naive program for validating the format of a table. If the table format is as expected, the program writes “all is well” to cout; if not, it writes error messages to cerr. A table is a series of rows, each with four tab-separated fields, except for the first (title row) which may have only three fields. For example:

Image

The numbers are supposed to add up horizontally and vertically.

The program reads the title line and then does the sums for each line until it reaches the final line labeled “Total”:

int main()
{
ifstream in("table.txt"); //
input file
if (!in) cerr << "no file\n";

string line; // input buffer
int lineno = 0;

regex header {R"(^[\w ]+(\t[\w ]+)*$)"}; // tab-separated words
regex row {R"(^([\w ]+)(\t\d+)(\t\d+)(\t\d+)$)"}; // label followed by three tab-separated numbers

if (getline(in,line)) { // check and discard the header line
smatch matches;
if (!regex_match(line,matches,header))
cerr << "no header\n";
}


int boys = 0; // running totals
int girls = 0;

while (getline(in,line)) {
++lineno;
smatch matches;
// submatches go here

if (!regex_match(line,matches,row))
cerr << "bad line: " << lineno << '\n';


int curr_boy = stoi(matches[2]); // for stoi() see §36.3.5
int curr_girl = stoi(matches[3]);
int curr_total = stoi(matches[4]);
if (curr_boy+curr_girl != curr_total) cerr << "bad row sum \n";


if (matches[1]=="Total") { // last line
if (curr_boy != boys) cerr << "boys do not add up\n";
if (curr_girl != girls) cerr << "girls do not add up\n";
cout << "all is well\n";
return 0;
}
boys += curr_boy;
girls += curr_girl;
}


cerr << "didn't find total line\n")
return 1;
}

37.3.2. regex_search()

To look for a pattern in a part of a sequence, such as a file, use regex_search():

Image

For example, I could look for some of the more popular misspellings of my name like this:

regex pat {"[Ss]tro?u?v?p?stra?o?u?p?b?"};

smatch m;
for (string s; cin>>s; )
if (regex_search(s,m,pat))
if (m[0]!="stroustrup" && m[0]!="Stroustrup")
cout << "Found: " << m[0] << '\n';

Given suitable input, this will output misspellings of Stroustrup, such as:

Found: strupstrup
Found: Strovstrup
Found: stroustrub
Found: Stroustrop

Note that regex_search() will find its pattern even if it is “hidden” among other characters. For example, it will find strustrub in abstrustrubal. If you want to match a pattern against every character in an input string, use regex_match37.3.1).

37.3.3. regex_replace()

To make simple substitutions of a pattern in a part of a sequence, such as a file, use regex_replace():

Image

Copying a format is done using the regex’s format()37.2.2) with the $ prefix notation, for example, $& for the match and $2 for the second submatch. Here is a little test program that takes a string of word and number pairs and outputs them as {word,number}, one per line:

void test1()
{
string input {"x 1 y2 22 zaq 34567"};
regex pat {"(\w+)\s(\d+)"}; //
word space number
string format {"{$1,$2}\n"};

cout << regex_replace(input,pat,format);
}

The output is:

{x,1}
{y2,22}
{zaq,34567}

Note the annoying “spurious” spaces at the beginning of the lines. By default, regex_match() copies unmatched characters to its output, so the two spaces that were not matched by pat are printed.

To eliminate those spaces, we can use the format_no_copy option (§37.2.2):

cout << regex_replace(input,pat,format,regex_constants::format_no_copy);

Now we get:

{x,1}
{y2,22}
{zaq,34567}

Submatches do not have to be output in order:

void test2()
{
string input {"x 1 y 2 z 3"};
regex pat {"(\w)\s(\d+)"}; //
word space number
string format {"$2: $1\n"};

cout << regex_replace(input,pat,format,regex_constants::format_no_copy);
}

Now we get:

1: x
22: y2
34567: zeq

37.4. Regular Expression Iterators

The regex_search() function allows us to find a single occurrence of a pattern in a data stream. What if we wanted to find and do something to all such occurrences? If the data is organized as a sequence of easily recognized lines or records, we can iterate over those and use regex_match()for each. If what we want to do with each occurrence of a pattern is a simple substitution, we can use regex_replace(). If we want to iterate over a sequence of characters doing something for each occurrence of a pattern, we use a regex_iterator.

37.4.1. regex_iterator

A regex_iterator is a bidirectional iterator that searches a sequence for the next match of a pattern when incremented:

template<class Bi,
class C = typename iterator_traits<Bi>::value_type,
class Tr = typename regex_traits<C>::type>
class regex_iterator {
public:
using regex_type = basic_regex<C,Tr>;
using value_type = match_results<Bi>;
using difference_type = ptrdiff_t;
using pointer = const value_type*;
using reference = const value_type&;
using iterator_category = forward_iterator_tag;
//
...
}

The regex_traits are described in §37.5.

The usual set of aliases is provided:

using cregex_iterator = regex_iterator<const char*>;
using wcregex_iterator = regex_iterator<const wchar_t*>;
using sregex_iterator = regex_iterator<string::const_iterator>;
using wsregex_iterator = regex_iterator<wstring::const_iterator>;

A regex_iterator provides a minimal set of iterator operations:

Image

A regex_iterator is a bidirectional iterator, so we cannot directly iterate over an istream.

As an example, we can output all whitespace-separated words in a string:

void test()
{
string input = "aa as; asd ++e^asdf asdfg";
regex pat {R"(\s+(\w+))"};
for (sregex_iterator p(input.begin(),input.end(),pat); p!=sregex_iterator{}; ++p)
cout << (*p)[1] << '\n';
}

This outputs:

as
asd
asdfg

Note that we are missing the first word, aa, because it has no preceding whitespace. If we simplify the pattern to R"((\ew+))", we get

aa
as
asd
e
asdf
asdfg

You cannot write through a regex_iterator and regex_iterator{} is the only possible end-of-sequence.

37.4.2. regex_token_iterator

A regex_token_iterator is an adaptor for regex_iterator that iterates over sub_matches of the match_results found:

template<class Bi,
class C = typename iterator_traits<Bi>::value_type,
class Tr = typename regex_traits<C>::type>
class regex_token_iterator {
public:
using regex_type = basic_regex<C,Tr>;
using value_type = sub_match<Bi>;
using difference_type = ptrdiff_t;
using pointer = const value_type*;
using reference = const value_type&;
using iterator_category = forward_iterator_tag;
//
...

The regex_traits are described in §37.5.

The usual set of aliases is provided:

using cregex_token_iterator = regex_token_iterator<const char*>;
using wcregex_token_iterator = regex_token_iterator<const wchar_t*>;
using sregex_token_iterator = regex_token_iterator<string::const_iterator>;
using wsregex_token_iterator = regex_token_iterator<wstring::const_iterator>;

A regex_token_iterator provides a minimal set of iterator operations:

Image

The x argument lists the sub_matches to be included in the iteration. For example (iterating over matches 1 and 3):

void test1()
{
string input {"aa::bb cc::dd ee::ff"};
regex pat {R"((\w+)([[:punct:]]+)(\w+)\s*)"};
sregex_token_iterator end {};
for (sregex_token_iterator p{input.begin(),input.end(),pat,{1,3}}; p!=end; ++p)
cout << *p << '\n';
}

This gives the output:

aa
bb
cc
dd
ee
ff

The –1 option basically inverts the strategy for reporting matches by representing each character sequence that does not match as a sub_match. This is often referred to as token splitting (that is, splitting a character stream into tokens) because when your pattern matches the token separators, option –1 leaves you with the tokens. For example:

void test2()
{
string s {"1,2 , 3 ,4,5, 6 7"}; //
input
regex pat {R"(\s*,\s*)"}; // use comma as token separator
copy(sregex_token_iterator{s.begin(),s.end(),pat,–1)},
sregex_token_iterator{},
ostream_iterator<string>{cout,"\n"});
}

The output is:

1
2
3
4
5
6 7

This could equivalently be written using an explicit loop:

void test3()
{
sregex_token_iterator end{};
for (sregex_token_iterator p {s.begin(),s.end(),pat,–1}; p!=end; ++p)
cout << *p << '\n';
}

37.5. regex_traits

A regex_traits<T> represents the correspondence between a character type, a string type, and a locale as needed for a regex implementer:

template<class C>
struct regex_traits {
public:
using char_type = C;
using string_type = basic_string<char_type>;
using locale_type = locale;
using char_class_type = /*
implementation-defined bitmask type */;
//
...
};

The standard library provides specializations regex_traits<char> and regex_traits<wchar_t>.

Image

A transform is used to generate strings for fast comparisons in pattern-matching implementations.

A classification name is one of the character classifications listed in §37.1.1, such as alpha, s, and xdigit.

37.6. Advice

[1] Use regex for most conventional uses of regular expressions; §37.1.

[2] The regular expression notation can be adjusted to match various standards; §37.1.1, §37.2.

[3] The default regular expression notation is that of ECMAScript; §37.1.1.

[4] For portability, use the character class notation to avoid nonstandard abbreviations; §37.1.1.

[5] Be restrained; regular expressions can easily become a write-only language; §37.1.1.

[6] Prefer raw string literals for expressing all but the simplest patterns; §37.1.1.

[7] Note that \i allows you to express a subpattern in terms of a previous subpattern; §37.1.1.

[8] Use ? to make patterns “lazy”; §37.1.1, §37.2.1.

[9] regex can use ECMAScript, POSIX, awk, grep, and egrep notation; §37.2.

[10] Keep a copy of the pattern string in case you need to output it; §37.2.

[11] Use regex_search() for looking at streams of characters and regex_match() to look for fixed layouts; §37.3.2, §37.3.1.