String Localization and Regular Expressions - Coding the Professional Way - Professional C++ (2014)

Professional C++ (2014)

Part IIICoding the Professional Way

Chapter 18String Localization and Regular Expressions

WHAT’S IN THIS CHAPTER?

· How to localize your applications to reach a worldwide audience

· How to use regular expressions to do powerful pattern matching

WROX.COM DOWNLOADS FOR THIS CHAPTER

Please note that all the code examples for this chapter are available as a part of this chapter’s code download on the book’s website at www.wrox.com/go/proc++3e on the Download Code tab.

This chapter starts with a discussion of localization, which is becoming more and more important to allow you to write software that can be localized to different regions around the world.

The second part of this chapter introduces the regular expressions library, which makes it easy to perform pattern matching on strings. It allows you to search for sub-strings matching a given pattern, but also to validate, parse, and transform strings. Regular expressions are really powerful and it’s recommended that you start using them instead of manually writing your own string processing code.

LOCALIZATION

When you’re learning how to program in C or C++, it’s useful to think of a character as equivalent to a byte and to treat all characters as members of the ASCII character set (American Standard Code for Information Interchange). ASCII is a 7-bit set usually stored in an 8-bit char type. In reality, experienced C++ programmers recognize that successful programs are used throughout the world. Even if you don’t initially write your program with international audiences in mind, you shouldn’t prevent yourself from localizing, or making the software local aware, at a later date.

Localizing String Literals

A critical aspect of localization is that you should never put any native-language string literals in your source code, except maybe for debug strings targeted at the developer. In Microsoft Windows applications, this is accomplished by putting the strings in STRINGTABLEresources. Most other platforms offer similar capabilities. If you need to translate your application to another language, translating those resources should be all that needs to be done, without requiring any source changes. There are tools available that help you with this translation process.

To make your source code localizable, you should not compose sentences out of string literals, even if the individual literals can be localized. For example:

cout << "Read " << n << " bytes" << endl;

This statement cannot be localized to Dutch because it requires a reordering of the words. The Dutch translation is as follows:

cout << n << " bytes gelezen" << endl;

To make sure you can properly localize this statement, you could implement something as follows:

cout << Format(IDS_TRANSFERRED, n) << endl;

IDS_TRANSFERRED is the name of an entry in a string resource table. For the English version, IDS_TRANSFERRED could be defined as "Read $1 bytes", while the Dutch version of the resource could be defined as "$1 bytes gelezen". The Format() function loads the string resource and substitutes $1 with the value of n.

Wide Characters

The problem with viewing a character as a byte is that not all languages, or character sets, can be fully represented in 8 bits, or 1 byte. C++ has a built-in type called wchar_t that holds a wide character. Languages with non-ASCII (U.S.) characters, such as Japanese and Arabic, can be represented in C++ with wchar_t. However, the C++ standard does not define a size for wchar_t. Some compilers use 16 bits while others use 32 bits. To write portable software, it is not safe to assume that wchar_t is of a particular size.

If there is any chance that your program will be used in a non-Western character set context (hint: there is!), you should use wide characters from the beginning. When working with wchar_t, string and character literals are prefixed with the letter L to indicate that a wide-character encoding should be used. For example, to initialize a wchar_t character to be the letter m, you write it like this:

wchar_t myWideCharacter = L'm';

There are wide-character versions of most of your favorite types and classes. The wide string class is wstring. The “prefix letter w” pattern applies to streams as well. Wide-character file output streams are handled with the wofstream, and input is handled with thewifstream. The joy of pronouncing these class names (woof-stream? whiff-stream?) is reason enough to make your programs local aware! Streams are discussed in detail in Chapter 12.

In addition to cout, cin, cerr, and clog there are wide versions of the built-in console and error streams called wcout, wcin, wcerr, and wclog. Using them is no different than using the non-wide versions:

wcout << L"I am wide-character aware." << endl;

Non-Western Character Sets

Wide characters are a great step forward because they increase the amount of space available to define a single character. The next step is to figure out how that space is used. In wide character sets, just like in ASCII, characters are represented by numbers, now called code points. The only difference is that each number does not fit in 8 bits. The map of characters to code points is quite a bit larger because it handles many different character sets in addition to the characters that English-speaking programmers are familiar with.

The Universal Character Set (UCS), defined by the International Standard ISO 10646, and Unicode are both standardized sets of characters. They contain around one-hundred-thousand abstract characters, each identified by an unambiguous name and a code point. The same characters with the same numbers exist in both standards. Both have specific encodings that you can use. For example, UTF-8 is an example of a Unicode encoding where Unicode characters are encoded using one to four 8-bit bytes. UTF-16 encodes Unicode characters as one or two 16-bit values and UTF-32 encodes Unicode characters as exactly 32 bits.

Different applications can use different encodings. Unfortunately, the C++ standard does not specify a size for wide characters (wchar_t). On Windows it is 16 bits, while on other platforms it could be 32 bits. You need to be aware of this when using wide characters for character encoding in cross-platform code. To help solve this issue, there are two other character types: char16_t and char32_t. The following list gives an overview of all character types supported:

· char: Stores 8 bits. Can be used to store ASCII characters, or as a basic building block for storing UTF-8 encoded Unicode characters, where one Unicode character is encoded as one to four chars.

· char16_t: Stores at least 16 bits. Can be used as the basic building block for UTF-16 encoded Unicode characters where one Unicode character is encoded as one or two char16_ts.

· char32_t: Stores at least 32 bits. Can be used for storing UTF-32 encoded Unicode characters as one char32_t.

· wchar_t: Stores a wide character of a compiler-specific size and encoding.

The benefit of using char16_t and char32_t instead of wchar_t is that the size of char16_t is guaranteed to be at least 16 bits, and the size of char32_t is guaranteed to be at least 32 bits, independent of the compiler. There is no minimum size guaranteed for wchar_t.

The standard also defines the following two macros:

· __STDC_UTF_32__: If this is defined by the compiler, then the type char32_t represents a UTF-32 encoding. If it is not defined, the type char32_t has a compiler-dependent encoding.

· __STDC_UTF_16__: If this is defined by the compiler, then the type char16_t represents a UTF-16 encoding. If it is not defined, the type char16_t has a compiler-dependent encoding.

String literals can have a string prefix to turn them into a specific type. The complete set of supported string prefixes is as follows:

· u8: A char string literal with UTF-8 encoding.

· u: A char16_t string literal, which can be UTF-16 if __STDC_UTF_16__ is defined by the compiler.

· U: A char32_t string literal, which can be UTF-32 if __STDC_UTF_32__ is defined by the compiler.

· L: A wchar_t string literal with a compiler-dependent encoding.

All of these string literals can be combined with the raw string literal prefix, R, discussed in Chapter 2. For example:

const char* s1 = u8R"(Raw UTF-8 encoded string literal)";

const wchar_t* s2 = LR"(Raw wide string literal)";

const char16_t* s3 = uR"(Raw char16_t string literal)";

const char32_t* s4 = UR"(Raw char32_t string literal)";

If you are using Unicode encoding, for example, by using u8 UTF-8 string literals, or if your compiler defines __STDC_UTF_16__ or __STDC_UTF_32__, you can insert a specific Unicode code point in your non-raw string literal by using the \uABCD notation. For example,\u03C0 represents the PI character, and \u00B2 represents the 2 character. The following code prints "Π r2":

const char* formula = u8"\u03C0 r\u00B2";

cout << formula << endl;

Besides the std::string class, there is also support for wstring, u16string, and u32string. They are all defined as follows:

· typedef basic_string<char> string;

· typedef basic_string<wchar_t> wstring;

· typedef basic_string<char16_t> u16string;

· typedef basic_string<char32_t> u32string;

Multibyte characters are characters composed of one or more bytes with a compiler-dependent encoding, similar as how Unicode can be represented with one to four bytes using UTF-8, or with one or two 16-bit values using UTF-16. There are conversion functions to convert between char16_t/char32_t and multibyte characters, and vice versa: mbrtoc16, c16rtomb, mbrtoc32, and c32rtomb

Unfortunately, the support for char16_t and char32_t stops there. For example, the I/O stream classes in the standard library do not include support for these character types. This means that there is nothing like a version of cout or cin that supports char16_t andchar32_t making it difficult to print such strings to a console or to read them from user input. If you want to do more with char16_t and char32_t strings you need to resort to third-party libraries.

Locales and Facets

Character sets are only one of the differences in data representation between countries. Even countries that use similar character sets, such as Great Britain and the United States, still differ in how they represent data such as dates and money.

The standard C++ mechanism that groups specific data about a particular set of cultural parameters is called a locale. An individual component of a locale, such as date format, time format, number format, etc., is called a facet. An example of a locale is U.S. English. An example of a facet is the format used to display a date. There are several built-in facets common to all locales. C++ also provides a way to customize or add facets.

Using Locales

When using I/O streams, data is formatted according to a particular locale. Locales are objects that can be attached to a stream. They are defined in the <locale> header file. Locale names can be implementation-specific. One standard is to separate a language and an area in two-letter sections with an optional encoding. For example, the locale for the English language as spoken in the U.S. is en_US, while the locale for the English language as spoken in Great Britain is en_GB. The locale for Japanese spoken in Japan with Japanese Industrial Standard encoding is ja_JP.jis.

Locale names on Windows follow a different standard, which has the following general format:

lang[_country_region[.code_page]]

Everything between the square brackets is optional. The following table lists some examples:

LINUX GCC

WINDOWS

U.S. English

en_US

English_United States

Great Britain English

en_GB

English_Great Britain

Most operating systems have a mechanism to determine the locale as defined by the user. In C++, you can pass an empty string to the locale object constructor to create a locale from the user’s environment. Once this object is created, you can use it to query thelocale, possibly making programmatic decisions based on it. The following code demonstrates how to use the user’s locale by calling the imbue() method on a stream. The result is that everything that is sent to wcout is formatted according to the formatting rules for your environment:

wcout.imbue(locale(""));

wcout << 32767 << endl;

This means that if your system locale is English United States and you output the number 32767, the number is displayed as 32,767; but, if your system locale is Dutch Belgium, the same number is displayed as 32.767.

The default locale is the classic locale, and not the user’s locale. The classic locale uses ANSI C conventions, and has the name C. The classic C locale is similar to U.S. English, but there are slight differences. For example, numbers are handled without any punctuation:

wcout.imbue(locale("C"));

wcout << 32767 << endl;

The output of this code is as follows:

32767

The following code manually sets the U.S. English locale, so the number 32767 is formatted with U.S. English punctuation, independent of your system locale:

wcout.imbue(locale("en_US")); // Use "English_United States" on Windows

wcout << 32767 << endl;

The output of this code is as follows:

32,767

A locale object allows you to query information about the locale. For example, the following program creates a locale matching the user’s environment. The name() method is used to get a C++ string that describes the locale. Then, the find() method is used on thestring object to find a given sub-string, which returns string::npos when the given sub-string is not found. The code checks for the Windows name and the Linux GCC name. One of two messages is output, depending on whether the locale appears to be U.S. English or not:

locale loc("");

if (loc.name().find("en_US") == string::npos &&

loc.name().find("United States") == string::npos) {

wcout << L"Welcome non-U.S. English speaker!" << endl;

} else {

wcout << L"Welcome U.S. English speaker!" << endl;

}

Using Facets

You can use the std::use_facet() function to obtain a particular facet in a particular locale. The argument to use_facet() is a locale. For example, the following expression retrieves the standard monetary punctuation facet of the British English locale using the Linux GCC locale name:

use_facet<moneypunct<wchar_t>>(locale("en_GB"));

Note that the innermost template type determines the character type to use. This is usually wchar_t or char. The use of nested template classes is unfortunate, but once you get past the syntax, the result is an object that contains all the information you want to know about British money punctuation. The data available in the standard facets are defined in the <locale> header and its associated files.

The following program brings together locales and facets by printing out the currency symbol in both U.S. English and British English. Note that, depending on your environment, the British currency symbol may appear as a question mark, a box, or not at all. If your environment is equipped to handle it, you may actually get the British pound symbol:

locale locUSEng("en_US"); // For Linux

//locale locUSEng("English_United States"); // For Windows

locale locBritEng("en_GB"); // For Linux

//locale locBritEng("English_Great Britain"); // For Windows

wstring dollars = use_facet<moneypunct<wchar_t>>(locUSEng).curr_symbol();

wstring pounds = use_facet<moneypunct<wchar_t>>(locBritEng).curr_symbol();

wcout << L"In the US, the currency symbol is " << dollars << endl;

wcout << L"In Great Britain, the currency symbol is " << pounds << endl;

REGULAR EXPRESSIONS

Regular expressions, defined in the <regex> header, are a powerful feature of the Standard Library. They are a special mini-language for string processing. They might seem complicated at first, but once you get to know them, they make working with strings easier. Regular expressions can be used for several string-related operations:

· Validation: Check if an input string is well-formed.

For example: Is the input string a well-formed phone number?

· Decision: Check what kind of string an input represents.

For example: Is the input string the name of a JPEG or a PNG file?

· Parsing: Extract information from an input string.

For example: From a full filename, extract the filename part without the full path and without its extension.

· Transformation: Search sub-strings and replace them with a new formatted sub-string.

For example: Search all occurrences of “C++14” and replace them with “C++”.

· Iteration: Search all occurrences of a sub-string.

For example: Extract all phone numbers from an input string.

· Tokenization: Split a string into sub-strings based on a set of delimiters.

For example: Split a string on whitespace, commas, periods, and so on to extract its individual words.

Of course, you could write your own code to perform any of the preceding operations on your strings, but using the regular expressions feature is highly recommended, because writing correct and safe code to process strings can be tricky.

Before we can go into more detail on the regular expressions, there is some important terminology to know. The following terms are used throughout the discussion:

· Pattern: The actual regular expression is a pattern represented by a string.

· Match: Determines whether there is a match between a given regular expression and all of the characters in a given sequence [first,last).

· Search: Determines whether there is some sub-string within a given sequence [first,last) that matches a given regular expression.

· Replace: Identifies sub-strings in a given sequence, and replaces them with a corresponding new sub-string computed from another pattern, called a substitution pattern.

If you look around on the internet you will find several different grammars for regular expressions. For this reason, C++ includes support for several of these grammars: ECMAScript, basic, extended, awk, grep, and egrep. If you already know any of these regular expression grammars, you can use it straight away in C++ by telling the regular expression library to use that specific syntax (syntax_option_type). The default grammar in C++ is ECMAScript whose syntax is explained in detail in the following section. It is also the most powerful grammar, so it’s recommended to use ECMAScript instead of one of the other more limited grammars. Explaining the other regular expression grammars falls outside the scope of this book.

NOTE If this is the first time you hear anything about regular expressions, just leave the default ECMAScript syntax.

ECMAScript Syntax

A regular expression pattern is a sequence of characters representing what you want to match. Any character in the regular expression matches itself except for the following special characters:

^ $ \ . * + ? ( ) [ ] { } |

These special characters are explained throughout the following discussion. If you need to match one of these special characters, you need to escape it using the \ character. For example:

\[ or \. or \* or \\

Anchors

The special characters ^ and $ are called anchors. The ^ character matches the position immediately following a line terminator character, and $ matches the position of a line terminator character. ^ and $ by default also match the beginning or ending of a string, respectively, but this behavior can be disabled. For example, ^test$ matches only the string test, and not strings that contain test in the line with anything else like 1test, test2, test abc, and so on.

Wildcards

The wildcard character . can be used to match any character except a newline character. For example, the regular expression a.c will match abc, and a5c, but will not match ab5c, ac, and so on.

Alternation

The | character can be used to specify the “or” relationship. For example, a|b matches a or b.

Grouping

Parentheses () are used to mark sub-expressions, also called capture groups. Capture groups can be used for several purposes:

· Capture groups can be used to identify individual sub-sequences of the original string; each marked sub-expression (capture group) is returned in the result. For example, take the following regular expression: (.)(ab|cd)(.). It has three marked sub-expressions. Running a regex_search() with this regular expression on 1cd4 results in a match with four entries. The first entry is the entire match 1cd4 followed by three entries for the three marked sub-expressions. These three entries are 1, cd, and 4. The details on how to use the regex_search() algorithm are shown in a later section.

· Capture groups can be used during matching for a purpose called back references (explained later).

· Capture groups can be used to identify components during replace operations (explained later).

Repetition

Parts of a regular expression can be repeated by using one of four repeats:

· * matches the preceding part zero or more times. For example: a*b matches b, ab, aab, aaaab, and so on.

· + matches the preceding part one or more times. For example: a+b matches ab, aab, aaaab, and so on, but not b.

· ? matches the preceding part zero or one time. For example: a?b matches b and ab, but nothing else.

· {...} represents a bounded repeat. a{n} matches a repeated exactly n times; a{n,} matches a repeated n times or more; and a{n,m} matches a repeated between n and m times inclusive. For example, a{3,4} matches aaa and aaaa but not a, aa, aaaaa, and so on.

The repeats described in the previous list are called greedy because they find the longest match while still matching the remainder of the regular expression. To make them non-greedy, a ? can be added behind the repeat as in *?, +?, ??, and {...}?. A non-greedy repetition repeats its pattern as few times as possible while still matching the remainder of the regular expression.

For example, the following table shows a greedy and a non-greedy regular expression and the resulting sub matches when running them on the input sequence aaabbb:

REGULAR EXPRESSION

SUB MATCHES

Greedy: (a+)(ab)*(b+)

"aaa" "" "bbb"

Non-greedy: (a+?)(ab)*(b+)

"aa" "ab" "bb"

Precedence

Just as with mathematical formulas it’s important to know the precedence of regular expression elements. Precedence is as follows:

· Elements: like a are the basic building blocks of a regular expression.

· Quantifiers: like +, *, ?, and {...} bind tightly to the element on the left; for example, b+.

· Concatenation: like ab+c binds after quantifiers.

· Alternations: like | binds as last.

For example, take the regular expression ab+c|d. This matches abc, abbc, abbbc, and so on, and also d. Parentheses can be used to change these precedence rules. For example, ab+(c|d) matches abc, abbc, abbbc, ..., abd, abbd, abbbd, and so on. However, by using parentheses you also mark it as a sub-expression or capture group. It is possible to change the precedence rules without creating new capture groups by using (?:...). For example, ab+(?:c|d) matches the same as the preceding ab+(c|d) but does not create an additional capture group.

Character Set Matches

Instead of having to write (a|b|c|...|z), which is clumsy and introduces a capture group, a special syntax for specifying sets of characters or ranges of characters is available. In addition, a “not” form of the match is also available. A character set is specified between square brackets, and allows you to write [c1c2...cn], which matches any of the characters c1, c2 , ..., or cn. For example, [abc] matches any character a, b, or c. If the first character is ^, it means “any but”:

· ab[cde] matches abc, abd, and abe.

· ab[^cde] matches abf, abp, and so on but not abc, abd, and abe.

If you need to match the ^, [ or ] characters themselves, you need to escape them; for example: [\[\^\]] matches the characters [, ^ or ].

If you want to specify all letters, you could use a character set like [abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ]; however, this is clumsy and doing this several times is awkward, especially if you make a typo and omit one of the letters accidentally. There are two solutions to this.

The range specification in square brackets allows you to write [a-zA-Z], which recognizes all the letters in the range a to z and A to Z. If you need to match a hyphen, you need to escape it; for example, [a-zA-Z\-]+ matches any word including a hyphenated word.

Another capability is to use one of the character classes. These are used to denote specific types of characters and are represented as [:name:]. Which character classes are available depends on the locale, but the names listed in the following table are always recognized. The exact meaning of these character classes is also dependent on the locale. This table assumes the standard C locale.

CHARACTER CLASS NAME

DESCRIPTION

digit

Digits.

d

Same as digit.

xdigit

Digits (digit) and the following letters used in hexadecimal numbers ‘a’, ‘b’, ‘c’, ‘d’, ‘e’, ‘f’, ‘A’, ‘B’, ‘C’, ‘D’, ‘E’, ‘F’

alpha

Alphabetic characters. For the C locale these are all lowercase and uppercase letters.

alnum

A combination of the alpha class and the digit class.

w

Same as alnum.

lower

Lowercase letters, if applicable to the locale.

upper

Uppercase letters, if applicable to the locale.

blank

A blank character is a space character used to separate words within a line of text. For the C locale these are ‘ ‘ or ‘\t’ (tab).

space

Whitespace characters. For the C locale, these are ‘ ‘, ‘\t’, ‘\n’, ‘\r’, ‘\v’, and ‘\f’.

s

Same as space.

print

Printable characters. These occupy a printing position, for example, on a display, and are the opposite of control characters (cntrl). Examples are lowercase letters, uppercase letters, digits, punctuation characters, and space characters.

cntrl

Control characters. These are the opposite of printable characters (print), and don’t occupy a printing position, for example, on a display. Some examples for the C locale are ‘\f’ (form feed), ‘\n’ (new line), and ‘\r’ (carriage return).

graph

Characters with a graphical representation. These are all characters that are printable (print), except the space character ‘ ‘.

punct

Punctuation characters. For the C locale, these are all graphical characters (graph) that are not alphanumeric (alnum).

Character classes are used within character sets; for example, [[:alpha:]]* in English means the same as [a-zA-Z]*.

Because certain concepts like matching digits are so common, there are shorthand patterns for them. For example, [:digit:] and [:d:] mean the same thing as [0-9]. Some classes have an even shorter pattern using the escape notation \. For example \d means[:digit:]. Therefore, to recognize a sequence of one or more numbers, you can write any of the following patterns:

· [0-9]+

· [[:digit:]]+

· [[:d:]]+

· \d+

The following table lists the available escape notations for character classes:

ESCAPE NOTATION

EQUIVALENT TO

\d

[[:d:]]

\D

[^[:d:]]

\s

[[:s:]]

\S

[^[:s:]]

\w

[_[:w:]]

\W

[^_[:w:]]

Some examples:

· Test[5-8] matches Test5, Test6, Test7, and Test8.

· [[:lower:]] matches a, b, and so on but not A, B, and so on.

· [^[:lower:]] matches any character except lowercase letters like a, b, and so on.

· [[:lower:]5-7] matches any lower case letter like a, b, and so on and the numbers 5, 6, and 7.

Word Boundaries

A word boundary can mean the following:

· The beginning of the source string if the first character of the source string is one of the word characters [A-Za-z0-9_]. Matching the beginning of the source string is enabled by default, but you can disable it (regex_constants::match_not_bow).

· The end of the source string if the last character of the source string is one of the word characters. Matching the end of the source string is enabled by default, but you can disable it (regex_constants::match_not_eow).

· The first character of a word, which is one of the word characters, while the preceding character is not a word character.

· The end of a word, which is a non-word character after a word, while the preceding character is a word character.

You can use \b to match a word boundary, and \B to match anything except a word boundary.

Back References

Back references allow you to reference a captured group inside the regular expression itself: \n refers to the n-th captured group, with n>0. For example, the regular expression (\d+)-.*-\1 matches a string that has the following format:

· one or more digits captured in a capture group (\d+)

· followed by a dash -

· followed by zero or more characters .*

· followed by another dash -

· followed by exactly the same digits captured by the first capture group \1

This regular expression matches 123-abc-123, 1234-a-1234, and so on but does not match 123-abc-1234, 123-abc-321, and so on.

Lookahead

Regular expressions support positive lookahead (?=pattern) and negative lookahead (?!pattern). The characters following the lookahead must match (positive) or not match (negative) the lookahead pattern, but those characters are not yet consumed. For example, the following regular expression matches an input sequence that consists of at least one lower case letter, at least one upper case letter, at least one punctuation character, and is at least eight characters long:

(?=.*[[:lower:]])(?=.*[[:upper:]])(?=.*[[:punct:]]).{8,}

Regular Expressions and Raw String Literals

As seen in the preceding sections, regular expressions often use special characters that should be escaped in normal C++ string literals. For example, if you write \d in a regular expression it matches any digit. However, since \ is a special character in C++, you need to escape it in your regular expression string literal as \\d, otherwise your C++ compiler tries to interpret the \d. It can get more complicated if you want your regular expression to match a single back-slash character \. Because \ is a special character in the regular expression syntax itself, you need to escape it as \\. The \ character is also a special character in C++ string literals, so you need to escape it in your C++ string literal, resulting in \\\\.

You can use raw string literals to make complicated regular expression easier to read in your C++ source code. Raw string literals are discussed in Chapter 2. For example take the following regular expression:

"( |\\n|\\r|\\\\)"

This regular expression searches for spaces, newlines, carriage returns, and back slashes. As you can see, you need a lot of escape characters. Using raw string literals, this can be replaced with the following more readable regular expression:

R"(( |\n|\r|\\))"

The raw string literal starts with R"( and ends with )". Everything in between is the regular expression. Of course you still need a double back slash at the end because the back slash needs to be escaped in the regular expression itself.

This concludes a brief description of the ECMAScript grammar. The following section starts with actually using regular expressions in your C++ code.

The regex Library

Everything for the regular expression library is in the <regex> header file and in the std namespace. The basic templated types defined by the regular expression library are:

· basic_regex: An object representing a specific regular expression.

· match_results: A sub-string that matched a regular expression, including all the captured groups. It is a collection of sub_matches.

· sub_match: An object containing a pair of iterators into the input sequence. These iterators represent the matched capture group. The pair is an iterator pointing to the first character of a matched capture group and an iterator pointing to one-past-the-last character of the matched capture group. It has a str() method that returns the matched capture group as a string.

The library provides three key algorithms: regex_match(), regex_search(), and regex_replace(). All of these algorithms have different versions that allow you to specify the source string as an STL string, a character array, or as a begin/end iterator pair. The iterators can be any of the following:

· const char*

· const wchar_t*

· string::const_iterator

· wstring::const_iterator

In fact, any iterator that behaves as a bidirectional iterator can be used. Iterators are discussed in detail in Chapter 16.

The library also defines regular expression iterators, which are very important if you want to find all occurrences of a pattern in a source string. There are two templated regular expression iterators defined:

· regex_iterator: iterates over all the occurrences of a pattern in a source string

· regex_token_iterator: iterates over all the capture groups of all occurrences of a pattern in a source string

To make the library easier to use, the standard defines a number of typedefs for the preceding templates:

typedef basic_regex<char> regex;

typedef basic_regex<wchar_t> wregex;

typedef sub_match<const char*> csub_match;

typedef sub_match<const wchar_t*> wcsub_match;

typedef sub_match<string::const_iterator> ssub_match;

typedef sub_match<wstring::const_iterator> wssub_match;

typedef match_results<const char*> cmatch;

typedef match_results<const wchar_t*> wcmatch;

typedef match_results<string::const_iterator> smatch;

typedef match_results<wstring::const_iterator> wsmatch;

typedef regex_iterator<const char*> cregex_iterator;

typedef regex_iterator<const wchar_t*> wcregex_iterator;

typedef regex_iterator<string::const_iterator> sregex_iterator;

typedef regex_iterator<wstring::const_iterator> wsregex_iterator;

typedef regex_token_iterator<const char*> cregex_token_iterator;

typedef regex_token_iterator<const wchar_t*> wcregex_token_iterator;

typedef regex_token_iterator<string::const_iterator> sregex_token_iterator;

typedef regex_token_iterator<wstring::const_iterator> wsregex_token_iterator;

The following sections explain the regex_match(), regex_search(), and regex_replace() algorithms, and the regex_iterator and regex_token_iterator.

regex_match()

The regex_match() algorithm can be used to compare a given source string with a regular expression pattern and returns true if the pattern matches the entire source string, false otherwise. It is very easy to use. There are six versions of the regex_match() algorithm accepting different kinds of arguments. They all have the following form:

template<...>

bool regex_match(InputSequence[, MatchResults], RegEx[, Flags]);

All variations return true when the entire input sequence matches the pattern, false otherwise. The InputSequence can be represented as:

· A start and end iterator into a source string

· A std::string

· A C-style string

The optional MatchResults parameter is a reference to a match_results and receives the match. If regex_match() returns false, you are only allowed to call match_results::empty() or match_results::size(); anything else is undefined. If regex_match() returns true, a match is found and you can inspect the match_results object for what exactly got matched. How to do this is explained with examples in the following sections.

The RegEx parameter is the regular expression that needs to be matched. The optional Flags parameter specifies options for the matching algorithm. In most cases you can keep the default. Consult a Standard Library Reference — for examplehttp://www.cppreference.com/ or http://www.cplusplus.com/reference/ — for more details.

regex_match() Example

Suppose you want to write a program that asks the user to enter a date in the following format year/month/day where year is four digits, month is a number between 1 and 12, and day is a number between 1 and 31. You can use a regular expression together with theregex_match() algorithm to validate the user input as follows. The details of the regular expression are explained after the code:

regex r("\\d{4}/(?:0?[1-9]|1[0-2])/(?:0?[1-9]|[1-2][0-9]|3[0-1])");

while (true) {

cout << "Enter a date (year/month/day) (q=quit): ";

string str;

if (!getline(cin, str) || str == "q")

break;

if (regex_match(str, r))

cout << " Valid date." << endl;

else

cout << " Invalid date!" << endl;

}

The first line creates the regular expression. The expression consists of three parts separated by a forward slash / character, one part for year, one for month, and one for day. The following list explains these parts:

· \\d{4}: This matches any combination of four digits; for example, 1234, 2010, and so on.

· (?:0?[1-9]|1[0-2]): This sub part of the regular expression is wrapped inside parentheses to make sure the precedence is correct. We don’t need any capture group so (?:...) is used. The inner expression consists of an alternation of two parts separated by the |character.

· 0?[1-9]: This matches any number from 1 to 9 with an optional 0 in front of it. For example, it matches 1, 2, 9, 03, 04, and so on. It does not match 0, 10, 11, and so on.

· 1[0-2]: This matches 10, 11, or 12, and nothing else.

· (?:0?[1-9]|[1-2][0-9]|3[0-1]): This sub part is also wrapped inside a non-capture group and consists of an alternation of three parts:

· 0?[1-9]: This matches any number from 1 to 9 with an optional 0 in front of it. For example, it matches 1, 2, 9, 03, 04, and so on. It does not match 0, 10, 11, and so on.

· [1-2][0-9]: This matches any number between 10 and 29 inclusive and nothing else.

· 3[0-1]: This matches 30 or 31 and nothing else.

The example then enters an infinite loop to ask the user to enter a date. Each date entered is given to the regex_match() algorithm. When regex_match() returns true the user has entered a date that matches the date regular expression pattern.

This example can be expanded a bit by asking the regex_match() algorithm to return captured sub-expressions in a results object. The following code extracts the year, month, and day digits into three separate integer variables.

To understand this code, you have to understand what a capture group does. By specifying a match_results object like smatch in the call to regex_match(), the elements of the match_results object are filled in when the regular expression matches the string. To be able to extract these sub-strings, you must create capture groups, so parentheses are used to define new capture groups.

The first element, [0], in a match_results object contains the string that matched the entire pattern. When using regex_match() and a match is found, this is the entire source sequence. When using regex_search(), discussed in the next section, this is a sub-string in the source sequence that matches the regular expression. Element [1] is the sub-string matched by the first capture group, [2] by the second capture group, and so on. To get a string representation of a capture group, you can write m[i] as in the following code or writem[i].str(), where i is the index of the capture group.

The regular expression in the revised example has a few small changes. The first part matching the year is wrapped in a capture group, while the month and day parts are now also capture groups instead of non-capture groups. The call to regex_match() includes asmatch parameter, which receives the matched capture groups. Here is the adapted example:

regex r("(\\d{4})/(0?[1-9]|1[0-2])/(0?[1-9]|[1-2][0-9]|3[0-1])");

while (true) {

cout << "Enter a date (year/month/day) (q=quit): ";

string str;

if (!getline(cin, str) || str == "q")

break;

smatch m;

if (regex_match(str, m, r)) {

int year = stoi(m[1]);

int month = stoi(m[2]);

int day = stoi(m[3]);

cout << " Valid date: Year=" << year

<< ", month=" << month

<< ", day=" << day << endl;

} else {

cout << " Invalid date!" << endl;

}

}

In this example, there are four elements in the smatch results objects: the full match and three captured groups:

· [0]: the string matching the full regular expression, which is the full date in this example

· [1]: the year

· [2]: the month

· [3]: the day

When you execute this example you can get the following output:

Enter a date (year/month/day) (q=quit): 2011/12/01

Valid date: Year=2011, month=12, day=1

Enter a date (year/month/day) (q=quit): 11/12/01

Invalid date!

NOTE These date-matching examples only check if the date consists of a year (four digits), a month (1-12), and a day (1-31). They do not perform any validation for leap years and so on. If you need that, you have to write code to validate the year, month and day values that are extracted by regex_match(). This validation is not a job for regular expressions, so this is not shown.

regex_search()

The regex_match() algorithm discussed in the previous section returns true if the entire source string matches the regular expression, false otherwise. It cannot be used to find a matching sub-string in the source string. The regex_search() algorithm allows you to search for a sub-string that matches a certain pattern in a source string. There are six versions of the regex_search() algorithm. They all have the following form:

template<...>

bool regex_search(InputSequence[, MatchResults], RegEx[, Flags]);

All variations return true when a match is found in the input sequence, false otherwise. The InputSequence can be represented as:

· A start and end iterator into a source string

· A std::string

· A C-style string

The optional MatchResults parameter is a reference to a match_results and receives the match. If regex_search() returns false, you are only allowed to call match_results::empty() or match_results::size(); anything else is undefined. If regex_search() returns true, a match is found and you can inspect the match_results object for what exactly got matched.

The RegEx parameter is the regular expression that needs to be matched. The optional Flags parameter specifies options for the matching algorithm. In most cases you can keep the default. Consult a Standard Library Reference for more details.

Two versions of the regex_search() algorithm accept a begin and end iterator as the input sequence that you want to process. You might be tempted to use this version of regex_search() in a loop to find all occurrences of a pattern in a source string by manipulating these begin and end iterators for each regex_search() call. Never do this! It can cause problems when your regular expression uses anchors (^ or $), word boundaries, and so on. It can also cause an infinite loop due to empty matches. Use the regex_iterator orregex_token_iterator as explained later in this chapter to extract all occurrences of a pattern from a source string.

WARNING Never use regex_search() in a loop to find all occurrences of a pattern in a source string. Instead, use a regex_iterator or regex_token_iterator.

regex_search() Example

The regex_search() algorithm can be used to extract matching sub-strings from an input sequence. The following example extracts code comments from input lines. The regular expression searches for a sub-string that starts with // followed by some optional whitespace \\s* followed by one or more characters captured in a capture group (.+). This capture group captures only the comment sub-string. The smatch object m receives the search results. You can check the m[1].first and m[1].second iterators to see where exactly the sub-string matching the first capture group was found in the source string.

regex r("//\\s*(.+)$");

while (true) {

cout << "Enter a string with optional code comments (q=quit): ";

string str;

if (!getline(cin, str) || str == "q")

break;

smatch m;

if (regex_search(str, m, r))

cout << " Found comment '" << m[1] << "'" << endl;

else

cout << " No comment found!" << endl;

}

The output of this program can look as follows:

Enter a string (q=quit): std::string str; // Our source string

Found comment 'Our source string'

Enter a string (q=quit): int a; // A comment with // in the middle

Found comment 'A comment with // in the middle'

Enter a string (q=quit): float f; // A comment with a (tab) character

Found comment 'A comment with a (tab) character'

The match_results object also has a prefix() and suffix() method, which returns the string preceding or following the match respectively.

regex_iterator

As explained in the previous section, you should never use regex_search() in a loop to extract all occurrences of a pattern from a source sequence. Instead, you should use a regex_iterator or regex_token_iterator. They work similarly like iterators for STL containers, which are discussed in Chapter 16.

Internally, both a regex_iterator and a regex_token_iterator contain a pointer to the regular expression. Because of this, you should not create them with a temporary regex object.

WARNING Never try to create a regex_iterator or regex_token_iterator with a temporary regex object.

regex_iterator Example

The following example asks the user to enter a source string, extracts every word from the string, and prints all words between quotes. The regular expression in this case is [\\w]+, which searches for one or more word-letters. This example uses std::string as source, so it uses sregex_iterator for the iterators. A standard iterator loop is used, but in this case, the end iterator is done slightly differently from the end iterators of ordinary STL containers. Normally, you specify an end iterator for a particular container, but forregex_iterator, there is only one “end” value. You can get this end iterator by declaring a regex_iterator type using the default constructor; it will implicitly be initialized to the end value.

The for loop creates a start iterator called iter, which accepts a begin and end iterator into the source string together with the regular expression. The loop body is called for every match found, which is every word in this example. The sregex_iterator iterates over all the matches. By dereferencing a sregex_iterator, you get a smatch object. Accessing the first element of this smatch object, [0], gives you the matched sub-string:

regex reg("[\\w]+");

while (true) {

cout << "Enter a string to split (q=quit): ";

string str;

if (!getline(cin, str) || str == "q")

break;

const sregex_iterator end;

for (sregex_iterator iter(cbegin(str), cend(str), reg);

iter != end; ++iter) {

cout << "\"" << (*iter)[0] << "\"" << endl;

}

}

The output of this program can look as follows:

Enter a string to split (q=quit): This, is a test.

"This"

"is"

"a"

"test"

As this example demonstrates, even simple regular expressions can do some powerful string manipulation.

regex_token_iterator

The previous section describes regex_iterator, which iterates through every matched pattern. In each iteration of the loop you get a match_results object, which you can use to extract sub-expressions for that match captured by capture groups.

A regex_token_iterator can be used to automatically iterate over all or selected capture groups across all matched patterns. There are four constructors with the following format:

regex_token_iterator(BidirectionalIterator a,

BidirectionalIterator b,

const regex_type& re

[, SubMatches

[, Flags]]);

All of them require a begin and end iterator as input sequence, and a regular expression. The optional SubMatches parameter is used to specify which capture groups should be iterated over. SubMatches can be specified in four ways:

· A single integer representing the index of the capture group that you want to iterate over.

· A vector with integers representing the indices of the capture groups that you want to iterate over.

· An initializer_list with capture group indices.

· A C-style array with capture group indices.

When you omit SubMatches or when you specify a 0 for SubMatches, you get an iterator that iterates over all capture groups with index 0, which are the sub-strings matching the full regular expression. The optional Flags parameter specifies options for the matching algorithm. In most cases you can keep the default. Consult a Standard Library Reference for more details.

regex_token_iterator Examples

The previous regex_iterator example can be rewritten using a regex_token_iterator as follows. Note that *iter is used in the loop body instead of (*iter)[0] as in the regex_iterator example because the token iterator with 0 as the default submatch index automatically iterates over all capture groups with index 0. The output of this code is exactly the same as the output generated by the regex_iterator example:

regex reg("[\\w]+");

while (true) {

cout << "Enter a string to split (q=quit): ";

string str;

if (!getline(cin, str) || str == "q")

break;

const sregex_token_iterator end;

for (sregex_token_iterator iter(cbegin(str), cend(str), reg);

iter != end; ++iter) {

cout << "\"" << *iter << "\"" << endl;

}

}

The following example asks the user to enter a date and then uses a regex_token_iterator to iterate over the second and third capture groups (month and day), which are specified as a vector of integers. The regular expression used for dates is explained earlier in this chapter. The only difference is that ^ and $ anchors are added, which are not necessary earlier because that example uses regex_match().

regex reg("^(\\d{4})/(0?[1-9]|1[0-2])/(0?[1-9]|[1-2][0-9]|3[0-1])$");

while (true) {

cout << "Enter a date (year/month/day) (q=quit): ";

string str;

if (!getline(cin, str) || str == "q")

break;

vector<int> indices{ 2, 3 };

const sregex_token_iterator end;

for (sregex_token_iterator iter(cbegin(str), cend(str), reg, indices);

iter != end; ++iter) {

cout << "\"" << *iter << "\"" << endl;

}

}

This code prints only the month and day of valid dates. Output generated by this example can look as follows:

Enter a date (year/month/day) (q=quit): 2011/1/13

"1"

"13"

Enter a date (year/month/day) (q=quit): 2011/1/32

Enter a date (year/month/day) (q=quit): 2011/12/5

"12"

"5"

The regex_token_iterator can also be used to perform a so-called field splitting or tokenization. It is a much safer and more flexible alternative than using the old strtok() function from C. Tokenization is triggered in the regex_token_iterator constructor by specifying-1 as the capture group index to iterate over. When in tokenization mode, the iterator iterates over all sub-strings of the input sequence that do not match the regular expression. The following code demonstrates this by tokenizing a string on the delimiters , and ;with any number of whitespace characters before or after the delimiters:

regex reg(R"(\s*[,;]\s*)");

while (true) {

cout << "Enter a string to split on ',' and ';' (q=quit): ";

string str;

if (!getline(cin, str) || str == "q")

break;

const sregex_token_iterator end;

for (sregex_token_iterator iter(cbegin(str), cend(str), reg, -1);

iter != end; ++iter) {

cout << "\"" << *iter << "\"" << endl;

}

}

The regular expression in this example is specified as a raw string literal and searches for patterns that match the following:

· Zero or more whitespace characters,

· followed by a , or ; character,

· followed by zero or more whitespace characters.

The output can be as follows:

Enter a string to split on ',' and ';' (q=quit): This is, a; test string.

"This is"

"a"

"test string."

As you can see from this output, the string is split on , and ;. All whitespace characters around the , or ; are removed, because the tokenization iterator iterates over all sub-strings that do not match the regular expression, and because the regular expression matches, and ; with whitespace around them.

regex_replace()

The regex_replace() algorithm requires a regular expression, and a formatting string that is used to replace matching sub-strings. This formatting string can reference part of the matched sub-strings by using the following escape sequences:

ESCAPE SEQUENCE

REPLACED WITH

$n

the string matching the n-th capture group; for example, $1 for the first capture group, $2 for the second, and so on

$&

the string matching the whole regular expression, which is the same as $0

$`

the part of the input sequences that appears to the left of the sub-string matching the regular expression

$'

the part of the input sequence that appears to the right of the sub-string matching the regular expression

$$

a dollar sign

There are six versions of the regex_replace() algorithm. The difference between them is in the type of arguments. Four of them have the following format:

string regex_replace(InputSequence, RegEx, FormatString[, Flags]);

These four versions return the resulting string after performing the replacement. The InputSequence can be a std::string or a C-style string. The RegEx parameter is the regular expression that needs to be matched. The FormatString can be a std::string or a C-style string. The optional Flags parameter specifies options for the replace algorithm.

Two versions of the regex_replace() algorithm have the following format:

OutputIterator regex_replace(OutputIterator,

BidirectionalIterator first,

BidirectionalIterator last,

RegEx, FormatString[, Flags]);

These two versions write the resulting string to the given output iterator and return this output iterator. The input sequence is given as a begin and end iterator. The other parameters are identical to the other four versions of regex_replace().

regex_replace() Examples

As a first example, take the source HTML string <body><h1>Header</h1><p>Some text</p></body> and the regular expression <h1>(.*)</h1><p>(.*)</p>. The following table shows the different escape sequences and what they will be replaced with:

ESCAPE SEQUENCE

REPLACED WITH

$1

Header

$2

Some text

$&

<h1>Header</h1><p>Some text</p>

$`

<body>

$'

</body>

The following code demonstrates the use of regex_replace():

const string str("<body><h1>Header</h1><p>Some text</p></body>");

regex r("<h1>(.*)</h1><p>(.*)</p>");

const string format("H1=$1 and P=$2");

string result = regex_replace(str, r, format);

cout << "Original string: '" << str << "'" << endl;

cout << "New string : '" << result << "'" << endl;

The output of this program is as follows:

Original string: '<body><h1>Header</h1><p>Some text</p></body>'

New string : '<body>H1=Header and P=Some text</body>'

The regex_replace() algorithm accepts a number of flags that can be used to manipulate how it is working. The most important flags are given in the following table:

FLAG

DESCRIPTION

format_default

The default is to replace all occurrences of the pattern, and to also copy everything that does not match the pattern to the result string.

format_no_copy

Replace all occurrences of the pattern, but do not copy anything that does not match the pattern to the result string.

format_first_only

Replace only the first occurrence of the pattern.

The following example modifies the previous code to use the format_no_copy flag:

const string str("<body><h1>Header</h1><p>Some text</p></body>");

regex r("<h1>(.*)</h1><p>(.*)</p>");

const string format("H1=$1 and P=$2");

string result = regex_replace(str, r, format,

regex_constants::format_no_copy);

cout << "Original string: '" << str << "'" << endl;

cout << "New string : '" << result << "'" << endl;

The output is as follows. Compare this with the output of the previous version.

Original string: '<body><h1>Header</h1><p>Some text</p></body>'

New string : 'H1=Header and P=Some text'

Another example is to get an input string and replace each word boundary with a newline so that the target string contains only one word per line. The following example demonstrates this without using any loops to process a given string. The code first creates a regular expression that matches individual words. When a match is found it is replaced with $1\n where $1 is replaced with the matched word. Note also the use of the format_no_copy flag to prevent copying whitespace and other non-word characters from the source string to the result string:

regex reg("([\\w]+)");

const string format("$1\n");

while (true) {

cout << "Enter a string to split over multiple lines (q=quit): ";

string str;

if (!getline(cin, str) || str == "q")

break;

cout << regex_replace(str, reg, format,

regex_constants::format_no_copy) << endl;

}

The output of this program can be as follows:

Enter a string to split over multiple lines (q=quit): This is a test.

This

is

a

test

SUMMARY

This chapter gave you an appreciation for coding with localization in mind. As anyone who has been through a localization effort will tell you, adding support for a new language or locale is infinitely easier if you have planned ahead; for example, by using Unicode characters and being mindful of locales.

The second part of this chapter explained the regular expressions library. Once you know the syntax of regular expressions, it becomes much easier to work with strings. Regular expressions allow you to validate strings, search for sub-strings inside an input sequence, perform find-and-replace operations, and so on. It is highly recommended you get to know them and start using them instead of writing your own string manipulation routines. They will make your life easier.