C# 5.0 in a Nutshell (2012)
Chapter 26. Regular Expressions
The regular expressions language identifies character patterns. The .NET types supporting regular expressions are based on Perl 5 regular expressions and support both search and search/replace functionality.
Regular expressions are used for tasks such as:
§ Validating text input such as passwords and phone numbers (ASP.NET provides the RegularExpressionValidator control just for this purpose)
§ Parsing textual data into more structured forms (e.g., extracting data from an HTML page for storage in a database)
§ Replacing patterns of text in a document (e.g., whole words only)
This chapter is split into both conceptual sections teaching the basics of regular expressions in .NET and reference sections describing the regular expressions language.
All regular expression types are defined in System.Text.RegularExpressions.
NOTE
For more on regular expressions, http://regular-expressions.info is a good online reference with lots of examples, and Mastering Regular Expressions by Jeffrey E. F. Friedl, is invaluable for the serious programmer.
The samples in this chapter are all preloaded into LINQPad. There is also an interactive utility available called Expresso (http://www.ultrapico.com) which assists in building and visualizing regular expressions, and comes with its own expression library.
Regular Expression Basics
One of the most common regular expression operators is a quantifier. ? is a quantifier that matches the preceding item 0 or 1 time. In other words, ? means optional. An item is either a single character or a complex structure of characters in square brackets. For example, the regular expression "colou?r" matches color and colour, but not colouur:
Console.WriteLine (Regex.Match ("color", @"colou?r").Success); // True
Console.WriteLine (Regex.Match ("colour", @"colou?r").Success); // True
Console.WriteLine (Regex.Match ("colouur", @"colou?r").Success); // False
Regex.Match searches within a larger string. The object that it returns has properties for the Index and Length of the match, as well as the actual Value matched:
Match m = Regex.Match ("any colour you like", @"colou?r");
Console.WriteLine (m.Success); // True
Console.WriteLine (m.Index); // 4
Console.WriteLine (m.Length); // 6
Console.WriteLine (m.Value); // colour
Console.WriteLine (m.ToString()); // colour
You can think of Regex.Match as a more powerful version of the string’s IndexOf method. The difference is that it searches for a pattern rather than a literal string.
The IsMatch method is a shortcut for calling Match and then testing the Success property.
The regular expressions engine works from left to right by default, so only the leftmost match is returned. You can use the NextMatch method to return more matches:
Match m1 = Regex.Match ("One color? There are two colours in my head!",
@"colou?rs?");
Match m2 = m1.NextMatch();
Console.WriteLine (m1); // color
Console.WriteLine (m2); // colours
The Matches method returns all matches in an array. We can rewrite the preceding example as follows:
foreach (Match m in Regex.Matches
("One color? There are two colours in my head!", @"colou?rs?"))
Console.WriteLine (m);
Another common regular expressions operator is the alternator, expressed with a vertical bar, |. An alternator expresses alternatives. The following matches “Jen”, “Jenny”, and “Jennifer”:
Console.WriteLine (Regex.IsMatch ("Jenny", "Jen(ny|nifer)?")); // True
The brackets around an alternator separate the alternatives from the rest of the expression.
NOTE
From Framework 4.5, you can specify a timeout when matching regular expressions. If a match operation takes longer than the specified TimeSpan, a RegexMatchTimeoutException is thrown. This can be useful if your program processes arbitrary regular expressions (for instance, in an advanced search dialog box) because it prevents malformed regular expressions from infinitely spinning.
Compiled Regular Expressions
In some of the preceding examples, we called a static RegEx method repeatedly with the same pattern. An alternative approach in these cases is to instantiate a Regex object with the pattern and then call instance methods:
Regex r = new Regex (@"sausages?");
Console.WriteLine (r.Match ("sausage")); // sausage
Console.WriteLine (r.Match ("sausages")); // sausages
This is not just a syntactic convenience: under the covers, a RegEx instance uses lightweight code generation (DynamicMethod in Reflection.Emit) to dynamically build and compile code tailored to that particular regular expression. This results in (up to 10 times) faster matching, at the expense of a small initial compilation cost (a few tens of microseconds).
A Regex instance is immutable.
NOTE
The regular expressions engine is fast. Even without compilation, a simple match typically takes less than a microsecond.
RegexOptions
The RegexOptions flags enum lets you tweak matching behavior. A common use for RegexOptions is to perform a case-insensitive search:
Console.WriteLine (Regex.Match ("a", "A", RegexOptions.IgnoreCase)); // a
This applies the current culture’s rules for case equivalence. The CultureInvariant flag lets you request the invariant culture instead:
Console.WriteLine (Regex.Match ("a", "A", RegexOptions.IgnoreCase
| RegexOptions.CultureInvariant));
Most of the RegexOptions flags can also be activated within a regular expression itself, using a single-letter code as follows:
Console.WriteLine (Regex.Match ("a", @"(?i)A")); // a
You can turn options on and off throughout an expression as follows:
Console.WriteLine (Regex.Match ("AAAa", @"(?i)a(?-i)a")); // Aa
Another useful option is IgnorePatternWhitespace or (?x). This allows you to insert whitespace to make a regular expression more readable—without the whitespace being taken literally.
Table 26-1 lists all RegExOptions values along with their single-letter codes.
Table 26-1. Regular expression options
Enum value |
Regular expressions code |
Description |
None |
||
IgnoreCase |
i |
Ignores case (by default, regular expressions are case-sensitive) |
Multiline |
m |
Changes ^ and $ so that they match the start/end of a line instead of start/end of the string |
ExplicitCapture |
n |
Captures only explicitly named or explicitly numbered groups (see Groups) |
Compiled |
c |
Forces compilation of regular expression to IL |
Singleline |
s |
Makes . match every character (instead of matching every character except \n) |
IgnorePatternWhitespace |
x |
Eliminates unescaped whitespace from the pattern |
RightToLeft |
r |
Searches from right to left; can’t be specified midstream |
ECMAScript |
Forces ECMA compliance (by default, the implementation is not ECMA-compliant) |
|
CultureInvariant |
Turns off culture-specific behavior for string comparisons |
Character Escapes
Regular expressions have the following metacharacters, which have a special rather than literal meaning:
\ * + ? | { [ () ^ $ . # |
To refer to a metacharacter literally, you must prefix the character with a backslash. In the following example, we escape the ? character to match the string "what?":
Console.WriteLine (Regex.Match ("what?", @"what\?")); // what? (correct)
Console.WriteLine (Regex.Match ("what?", @"what?")); // what (incorrect)
WARNING
If the character is inside a set (square brackets), this rule does not apply, and the metacharacters are interpreted literally. We discuss sets in the following section.
The Regex’s Escape and Unescape methods convert a string containing regular expression metacharacters by replacing them with escaped equivalents, and vice versa. For example:
Console.WriteLine (Regex.Escape (@"?")); // \?
Console.WriteLine (Regex.Unescape (@"\?")); // ?>
All the regular expression strings in this chapter we express with the C# @ literal. This is to bypass C#’s escape mechanism, which also uses the backslash. Without the @, a literal backslash would require four backslashes:
Console.WriteLine (Regex.Match ("\\", "\\\\")); // \
Unless you include the (?x) option, spaces are treated literally in regular expressions:
Console.Write (Regex.IsMatch ("hello world", @"hello world")); // True
Character Sets
Character sets act as wildcards for a particular set of characters.
Expression |
Meaning |
Inverse (“not”) |
[abcdef] |
Matches a single character in the list |
[^abcdef] |
[a-f] |
Matches a single character in a range |
[^a-f] |
\d |
Matches a decimal digit Same as [0-9] |
\D |
\w |
Matches a word character (by default, varies according to CultureInfo.CurrentCulture; for example, in English, same as [a-zA-Z_0-9]) |
\W |
\s |
Matches a whitespace character Same as [\n\r\t\f] |
\S |
\p{category} |
Matches a character in a specified category |
\P |
. |
(Default mode) Matches any character except \n |
\n |
. |
(SingleLine mode) Matches any character |
\n |
To match exactly one of a set of characters, put the character set in square brackets:
Console.Write (Regex.Matches ("That is that.", "[Tt]hat").Count); // 2
To match any character except those in a set, put the set in square brackets with a ^ symbol before the first character:
Console.Write (Regex.Match ("quiz qwerty", "q[^aeiou]").Index); // 5
You can specify a range of characters with a hyphen. The following regular expression matches a chess move:
Console.Write (Regex.Match ("b1-c4", @"[a-h]\d-[a-h]\d").Success); // True
\d indicates a digit character, so \d will match any digit. \D matches any nondigit character.
\w indicates a word character, which includes letters, numbers, and the underscore. \W matches any nonword character. These work as expected for non-English letters too, such as Cyrillic.
. matches any character except \n (but allows \r).
\p matches a character in a specified category, such as {Lu} for uppercase letter or {P} for punctuation (we list the categories in the reference section later in the chapter):
Console.Write (Regex.IsMatch ("Yes, please", @"\p{P}")); // True
We will find more uses for \d, \w, and . when we combine them with quantifiers.
Quantifiers
Quantifiers match an item a specified number of times.
Quantifier |
Meaning |
* |
Zero or more matches |
+ |
One or more matches |
? |
Zero or one match |
{n} |
Exactly n matches |
{n,} |
At least n matches |
{n,m} |
Between n and m matches |
The * quantifier matches the preceding character or group zero or more times. The following matches cv.doc, along with any numbered versions of the same file (e.g., cv2.doc, cv15.doc):
Console.Write (Regex.Match ("cv15.doc", @"cv\d*\.doc").Success); // True
Notice that we have to escape out the period in the file extension with a backslash.
The following allows anything between cv and .doc and is equivalent to dir cv*.doc:
Console.Write (Regex.Match ("cvjoint.doc", @"cv.*\.doc").Success); // True
The + quantifier matches the preceding character or group one or more times. For example:
Console.Write (Regex.Matches ("slow! yeah slooow!", "slo+w").Count); // 2
The {} quantifier matches a specified number (or range) of repetitions. The following matches a blood pressure reading:
Regex bp = new Regex (@"\d{2,3}/\d{2,3}");
Console.WriteLine (bp.Match ("It used to be 160/110")); // 160/110
Console.WriteLine (bp.Match ("Now it's only 115/75")); // 115/75
Greedy Versus Lazy Quantifiers
By default, quantifiers are greedy, as opposed to lazy. A greedy quantifier repeats as many times as it can before proceeding. A lazy quantifier repeats as few times as it can before proceeding. You can make any quantifier lazy by suffixing it with the ? symbol. To illustrate the difference, consider the following HTML fragment:
string html = "<i>By default</i> quantifiers are <i>greedy</i> creatures";
Suppose we want to extract the two phrases in italics. If we execute the following:
foreach (Match m in Regex.Matches (html, @"<i>.*</i>"))
Console.WriteLine (m);
the result is not two matches, but a single match, as follows:
<i>By default</i> quantifiers are <i>greedy</i>
The problem is that our * quantifier greedily repeats as many times as it can before matching </i>. So, it chomps right through the first </i>, stopping only at the final </i> (the last point at which the rest of the expression can still match).
If we make the quantifier lazy:
foreach (Match m in Regex.Matches (html, @"<i>.*?</i>"))
Console.WriteLine (m);
the * bails out at the first point at which the rest of the expression can match. Here’s the result:
<i>By default</i>
<i>greedy</i>
Zero-Width Assertions
The regular expressions language lets you place conditions on what should occur before or after a match, through lookbehind, lookahead, anchors, and word boundaries. These are called zero-width assertions, because they don’t increase the width (or length) of the match itself.
Lookahead and Lookbehind
The (?=expr) construct checks whether the text that follows matches expr, without including expr in the result. This is called positive lookahead. In the following example, we look for a number followed by the word “miles”:
Console.WriteLine (Regex.Match ("say 25 miles more", @"\d+\s(?=miles)"));
OUTPUT: 25
Notice the word “miles” was not returned in the result, even though it was required to satisfy the match.
After a successful lookahead, matching continues as though the sneak preview never took place. So, if we append .* to our expression as follows:
Console.WriteLine (Regex.Match ("say 25 miles more", @"\d+\s(?=miles).*"));
the result is 25 miles more.
Lookahead can be useful in enforcing rules for a strong password. Suppose a password has to be at least six characters and contain at least one digit. With a lookup, we could achieve this as follows:
string password = "...";
bool ok = Regex.IsMatch (password, @"(?=.*\d).{6,}");
This first performs a lookahead to ensure that a digit occurs somewhere in the string. If satisfied, it returns to its position before the sneak preview began and matches six or more characters. (In the section Cookbook Regular Expressions, later in this chapter, we include a more substantial password validation example.)
The opposite is the negative lookahead construct, (?!expr). This requires that the match not be followed by expr. The following expression matches “good”—unless “however” or “but” appears later in the string:
string regex = "(?i)good(?!.*(however|but))";
Console.WriteLine (Regex.IsMatch ("Good work! But...", regex)); // False
Console.WriteLine (Regex.IsMatch ("Good work! Thanks!", regex)); // True
The (?<=expr) construct denotes positive lookbehind and requires that a match be preceded by a specified expression. The opposite construct, (?<!expr), denotes negative lookbehind and requires that a match not be preceded by a specified expression. For example, the following matches “good”—unless “however” appears earlier in the string:
string regex = "(?i)(?<!however.*)good";
Console.WriteLine (Regex.IsMatch ("However good, we...", regex)); // False
Console.WriteLine (Regex.IsMatch ("Very good, thanks!", regex)); // True
We could improve these examples by adding word boundary assertions, which we will introduce shortly.
Anchors
The anchors ^ and $ match a particular position. By default:
^
Matches the start of the string
$
Matches the end of the string
NOTE
^ has two context-dependent meanings: an anchor and a character class negator.
$ has two context-dependent meanings: an anchor and a replacement group denoter.
For example:
Console.WriteLine (Regex.Match ("Not now", "^[Nn]o")); // No
Console.WriteLine (Regex.Match ("f = 0.2F", "[Ff]$")); // F
If you specify RegexOptions.Multiline or include (?m) in the expression:
§ ^ matches the start of the string or line (directly after a \n).
§ $ matches the end of the string or line (directly before a \n).
There’s a catch to using $ in multiline mode: a new line in Windows is nearly always denoted with \r\n rather than just \n. This means that for $ to be useful, you must usually match the \r as well, with a positive lookahead:
(?=\r?$)
The positive lookahead ensures that \r doesn’t become part of the result. The following matches lines that end in ".txt":
string fileNames = "a.txt" + "\r\n" + "b.doc" + "\r\n" + "c.txt";
string r = @".+\.txt(?=\r?$)";
foreach (Match m in Regex.Matches (fileNames, r, RegexOptions.Multiline))
Console.Write (m + " ");
OUTPUT: a.txt c.txt
The following matches all empty lines in string s:
MatchCollection emptyLines = Regex.Matches (s, "^(?=\r?$)",
RegexOptions.Multiline);
The following matches all lines that are either empty or contain only whitespace:
MatchCollection blankLines = Regex.Matches (s, "^[ \t]*(?=\r?$)",
RegexOptions.Multiline);
NOTE
Since an anchor matches a position rather than a character, specifying an anchor on its own matches an empty string:
Console.WriteLine (Regex.Match ("x", "$").Length); // 0
Word Boundaries
The word boundary assertion \b matches where word characters (\w) adjoin either:
§ Nonword characters (\W)
§ The beginning/end of the string (^ and $)
\b is often used to match whole words. For example:
foreach (Match m in Regex.Matches ("Wedding in Sarajevo", @"\b\w+\b"))
Console.WriteLine (m);
Wedding
In
Sarajevo
The following statements highlight the effect of a word boundary:
int one = Regex.Matches ("Wedding in Sarajevo", @"\bin\b").Count; // 1
int two = Regex.Matches ("Wedding in Sarajevo", @"in").Count; // 2
The next query uses positive lookahead to return words followed by “(sic)”:
string text = "Don't loose (sic) your cool";
Console.Write (Regex.Match (text, @"\b\w+\b\s(?=\(sic\))")); // loose
Groups
Sometimes it’s useful to separate a regular expression into a series of subexpressions, or groups. For instance, consider the following regular expression that represents a U.S. phone number such as 206-465-1918:
\d{3}-\d{3}-\d{4}
Suppose we wish to separate this into two groups: area code and local number. We can achieve this by using parentheses to capture each group:
(\d{3})-(\d{3}-\d{4})
We then retrieve the groups programmatically as follows:
Match m = Regex.Match ("206-465-1918", @"(\d{3})-(\d{3}-\d{4})");
Console.WriteLine (m.Groups[1]); // 206
Console.WriteLine (m.Groups[2]); // 465-1918
The zeroth group represents the entire match. In other words, it has the same value as the match’s Value:
Console.WriteLine (m.Groups[0]); // 206-465-1918
Console.WriteLine (m); // 206-465-1918
Groups are part of the regular expressions language itself. This means you can refer to a group within a regular expression. The \n syntax lets you index the group by group number n within the expression. For example, the expression (\w)ee\1 matches deed and peep. In the following example, we find all words in a string starting and ending in the same letter:
foreach (Match m in Regex.Matches ("pop pope peep", @"\b(\w)\w+\1\b"))
Console.Write (m + " "); // pop peep
The brackets around the \w instruct the regular expressions engine to store the submatch in a group (in this case, a single letter), so it can be used later. We refer to that group later using \1, meaning the first group in the expression.
Named Groups
In a long or complex expression, it can be easier to work with groups by name rather than index. Here’s a rewrite of the previous example, using a group that we name 'letter':
string regEx =
@"\b" + // word boundary
@"(?'letter'\w)" + // match first letter, and name it 'letter'
@"\w+" + // match middle letters
@"\k'letter'" + // match last letter, denoted by 'letter'
@"\b"; // word boundary
foreach (Match m in Regex.Matches ("bob pope peep", regEx))
Console.Write (m + " "); // bob peep
To name a captured group:
(?'group-name'group-expr) or (?<group-name>group-expr)
To refer to a group:
\k'group-name' or \k<group-name>
The following example matches a simple (nonnested) XML/HTML element, by looking for start and end nodes with a matching name:
string regFind =
@"<(?'tag'\w+?).*>" + // match first tag, and name it 'tag'
@"(?'text'.*?)" + // match text content, name it 'text'
@"</\k'tag'>"; // match last tag, denoted by 'tag'
Match m = Regex.Match ("<h1>hello</h1>", regFind);
Console.WriteLine (m.Groups ["tag"]); // h1
Console.WriteLine (m.Groups ["text"]); // hello
Allowing for all possible variations in XML structure, such as nested elements, is more complex. The .NET regular expressions engine has a sophisticated extension called “matched balanced constructs” that can assist with nested tags—information on this is available on the Internet and inMastering Regular Expressions by Jeffrey E. F. Friedl.
Replacing and Splitting Text
The RegEx.Replace method works like string.Replace, except that it uses a regular expression.
The following replaces “cat” with “dog”. Unlike with string.Replace, “catapult” won’t change into “dogapult”, because we match on word boundaries:
string find = @"\bcat\b";
string replace = "dog";
Console.WriteLine (Regex.Replace ("catapult the cat", find, replace));
OUTPUT: catapult the dog
The replacement string can reference the original match with the $0 substitution construct. The following example wraps numbers within a string in angle brackets:
string text = "10 plus 20 makes 30";
Console.WriteLine (Regex.Replace (text, @"\d+", @"<$0>"));
OUTPUT: <10> plus <20> makes <30>
You can access any captured groups with $1, $2, $3, and so on, or ${name} for a named group. To illustrate how this can be useful, consider the regular expression in the previous section that matched a simple XML element. By rearranging the groups, we can form a replacement expression that moves the element’s content into an XML attribute:
string regFind =
@"<(?'tag'\w+?).*>" + // match first tag, and name it 'tag'
@"(?'text'.*?)" + // match text content, name it 'text'
@"</\k'tag'>"; // match last tag, denoted by 'tag'
string regReplace =
@"<${tag}" + // <tag
@"value=""" + // value="
@"${text}" + // text
@"""/>"; // "/>
Console.Write (Regex.Replace ("<msg>hello</msg>", regFind, regReplace));
Here’s the result:
<msg value="hello"/>
MatchEvaluator Delegate
Replace has an overload that takes a MatchEvaluator delegate, which is invoked per match. This allows you to delegate the content of the replacement string to C# code when the regular expressions language isn’t expressive enough. For example:
Console.WriteLine (Regex.Replace ("5 is less than 10", @"\d+",
m => (int.Parse (m.Value) * 10).ToString()) );
OUTPUT: 50 is less than 100
In the cookbook, we show how to use a MatchEvaluator to escape Unicode characters appropriately for HTML.
Splitting Text
The static Regex.Split method is a more powerful version of the string.Split method, with a regular expression denoting the separator pattern. In this example, we split a string, where any digit counts as a separator:
foreach (string s in Regex.Split ("a5b7c", @"\d"))
Console.Write (s + " "); // a b c
The result, here, doesn’t include the separators themselves. You can include the separators, however, by wrapping the expression in a positive lookahead. The following splits a camel-case string into separate words:
foreach (string s in Regex.Split ("oneTwoThree", @"(?=[A-Z])"))
Console.Write (s + " "); // one Two Three
Cookbook Regular Expressions
Recipes
Matching U.S. Social Security number/phone number
string ssNum = @"\d{3}-\d{2}-\d{4}";
Console.WriteLine (Regex.IsMatch ("123-45-6789", ssNum)); // True
string phone = @"(?x)
( \d{3}[-\s] | \(\d{3}\)\s? )
\d{3}[-\s]?
\d{4}";
Console.WriteLine (Regex.IsMatch ("123-456-7890", phone)); // True
Console.WriteLine (Regex.IsMatch ("(123) 456-7890", phone)); // True
Extracting “name = value” pairs (one per line)
Note that this starts with the multiline directive (?m):
string r = @"(?m)^\s*(?'name'\w+)\s*=\s*(?'value'.*)\s*(?=\r?$)";
string text =
@"id = 3
secure = true
timeout = 30";
foreach (Match m in Regex.Matches (text, r))
Console.WriteLine (m.Groups["name"] + " is " + m.Groups["value"]);
id is 3 secure is true timeout is 30
Strong password validation
The following checks whether a password has at least six characters, and whether it contains a digit, symbol, or punctuation mark:
string r = @"(?x)^(?=.* ( \d | \p{P} | \p{S} )).{6,}";
Console.WriteLine (Regex.IsMatch ("abc12", r)); // False
Console.WriteLine (Regex.IsMatch ("abcdef", r)); // False
Console.WriteLine (Regex.IsMatch ("ab88yz", r)); // True
Lines of at least 80 characters
string r = @"(?m)^.{80,}(?=\r?$)";
string fifty = new string ('x', 50);
string eighty = new string ('x', 80);
string text = eighty + "\r\n" + fifty + "\r\n" + eighty;
Console.WriteLine (Regex.Matches (text, r).Count); // 2
Parsing dates/times (N/N/N H:M:S AM/PM)
This expression handles a variety of numeric date formats—and works whether the year comes first or last. The (?x) directive improves readability by allowing whitespace; the (?i) switches off case sensitivity (for the optional AM/PM designator). You can then access each component of the match through the Groups collection:
string r = @"(?x)(?i)
(\d{1,4}) [./-]
(\d{1,2}) [./-]
(\d{1,4}) [\sT]
(\d+):(\d+):(\d+) \s? (A\.?M\.?|P\.?M\.?)?";
string text = "01/02/2008 5:20:50 PM";
foreach (Group g in Regex.Match (text, r).Groups)
Console.WriteLine (g.Value + " ");
01/02/2008 5:20:50 PM 01 02 2008 5 20 50 PM
(Of course, this doesn’t verify that the date/time is correct.)
Matching Roman numerals
string r =
@"(?i)\bm*" +
@"(d?c{0,3}|c[dm])" +
@"(l?x{0,3}|x[lc])" +
@"(v?i{0,3}|i[vx])" +
@"\b";
Console.WriteLine (Regex.IsMatch ("MCMLXXXIV", r)); // True
Removing repeated words
Here, we capture a named grouped called dupe:
string r = @"(?'dupe'\w+)\W\k'dupe'";
string text = "In the the beginning...";
Console.WriteLine (Regex.Replace (text, r, "${dupe}"));
In the beginning
Word count
string r = @"\b(\w|[-'])+\b";
string text = "It's all mumbo-jumbo to me";
Console.WriteLine (Regex.Matches (text, r).Count); // 5
Matching a Guid
string r =
@"(?i)\b" +
@"[0-9a-fA-F]{8}\-" +
@"[0-9a-fA-F]{4}\-" +
@"[0-9a-fA-F]{4}\-" +
@"[0-9a-fA-F]{4}\-" +
@"[0-9a-fA-F]{12}" +
@"\b";
string text = "Its key is {3F2504E0-4F89-11D3-9A0C-0305E82C3301}.";
Console.WriteLine (Regex.Match (text, r).Index); // 12
Parsing an XML/HTML tag
Regex is useful for parsing HTML fragments—particularly when the document may be imperfectly formed:
string r =
@"<(?'tag'\w+?).*>" + // match first tag, and name it 'tag'
@"(?'text'.*?)" + // match text content, name it 'text'
@"</\k'tag'>"; // match last tag, denoted by 'tag'
string text = "<h1>hello</h1>";
Match m = Regex.Match (text, r);
Console.WriteLine (m.Groups ["tag"]); // h1
Console.WriteLine (m.Groups ["text"]); // hello
Splitting a camel-cased word
This requires a positive lookahead to include the uppercase separators:
string r = @"(?=[A-Z])";
foreach (string s in Regex.Split ("oneTwoThree", r))
Console.Write (s + " "); // one Two Three
Obtaining a legal filename
string input = "My \"good\" <recipes>.txt";
char[] invalidChars = System.IO.Path.GetInvalidPathChars();
string invalidString = Regex.Escape (new string (invalidChars));
string valid = Regex.Replace (input, "[" + invalidString + "]", "");
Console.WriteLine (valid);
My good recipes.txt
Escaping Unicode characters for HTML
string htmlFragment = "© 2007";
string result = Regex.Replace (htmlFragment, @"[\u0080-\uFFFF]",
m => @"&#" + ((int)m.Value[0]).ToString() + ";");
Console.WriteLine (result); // © 2007
Unescaping characters in an HTTP query string
string sample = "C%23 rocks";
string result = Regex.Replace (
sample,
@"%[0-9a-f][0-9a-f]",
m => ((char) Convert.ToByte (m.Value.Substring (1), 16)).ToString(),
RegexOptions.IgnoreCase
);
Console.WriteLine (result); // C# rocks
Parsing Google search terms from a web stats log
This should be used in conjunction with the previous example to unescape characters in the query string:
string sample =
"http://google.com/search?hl=en&q=greedy+quantifiers+regex&btnG=Search";
Match m = Regex.Match (sample, @"(?<=google\..+search\?.*q=).+?(?=(&|$))");
string[] keywords = m.Value.Split (
new[] { '+' }, StringSplitOptions.RemoveEmptyEntries);
foreach (string keyword in keywords)
Console.Write (keyword + " "); // greedy quantifiers regex
Regular Expressions Language Reference
Table 26-2 through Table 26-12 summarize the regular expressions grammar and syntax supported in the .NET implementation.
Table 26-2. Character escapes
Escape code sequence |
Meaning |
Hexadecimal equivalent |
\a |
Bell |
\u0007 |
\b |
Backspace |
\u0008 |
\t |
Tab |
\u0009 |
\r |
Carriage return |
\u000A |
\v |
Vertical tab |
\u000B |
\f |
Form feed |
\u000C |
\n |
Newline |
\u000D |
\e |
Escape |
\u001B |
\nnn |
ASCII character nnn as octal (e.g., \n052) |
|
\xnn |
ASCII character nn as hex (e.g., \x3F) |
|
\cl |
ASCII control character l (e.g., \cG for Ctrl-G) |
|
\unnnn |
Unicode character nnnn as hex (e.g., \u07DE) |
|
\symbol |
A nonescaped symbol |
Special case: within a regular expression, \b means word boundary, except in a [ ] set, in which \b means the backspace character.
Table 26-3. Character sets
Expression |
Meaning |
Inverse (“not”) |
[abcdef] |
Matches a single character in the list |
[^abcdef] |
[a-f] |
Matches a single character in a range |
[^a-f] |
\d |
Matches a decimal digit Same as [0-9] |
\D |
\w |
Matches a word character (by default, varies according to CultureInfo.CurrentCulture; for example, in English, same as [a-zA-Z_0-9]) |
\W |
\s |
Matches a whitespace character Same as [\n\r\t\f] |
\S |
\p{category} |
Matches a character in a specified category (see Table 26-6) |
\P |
. |
(Default mode) Matches any character except \n |
\n |
. |
(SingleLine mode) Matches any character |
\n |
Table 26-4. Character categories
Quantifier |
Meaning |
\p{L} |
Letters |
\p{Lu} |
Uppercase letters |
\p{Ll} |
Lowercase letters |
\p{N} |
Numbers |
\p{P} |
Punctuation |
\p{M} |
Diacritic marks |
\p{S} |
Symbols |
\p{Z} |
Separators |
\p{C} |
Control characters |
Table 26-5. Quantifiers
Quantifier |
Meaning |
* |
Zero or more matches |
+ |
One or more matches |
? |
Zero or one match |
{n} |
Exactly n matches |
{n,} |
At least n matches |
{n,m} |
Between n and m matches |
The ? suffix can be applied to any of the quantifiers to make them lazy rather than greedy.
Table 26-6. Substitutions
Expression |
Meaning |
$0 |
Substitutes the matched text |
$group-number |
Substitutes an indexed group-number within the matched text |
${group-name} |
Substitutes a text group-name within the matched text |
Substitutions are specified only within a replacement pattern.
Table 26-7. Zero-width assertions
Expression |
Meaning |
^ |
Start of string (or line in multiline mode) |
$ |
End of string (or line in multiline mode) |
\A |
Start of string (ignores multiline mode) |
\z |
End of string (ignores multiline mode) |
\Z |
End of line or string |
\G |
Where search started |
\b |
On a word boundary |
\B |
Not on a word boundary |
(?=expr) |
Continue matching only if expression expr matches on right (positive lookahead) |
(?!expr) |
Continue matching only if expression expr doesn’t match on right (negative lookahead) |
(?<=expr) |
Continue matching only if expression expr matches on left (positive lookbehind) |
(?<!expr) |
Continue matching only if expression expr doesn’t match on left (negative lookbehind) |
(?>expr) |
Subexpression expr is matched once and not backtracked |
Table 26-8. Grouping constructs
Syntax |
Meaning |
(expr) |
Capture matched expression expr into indexed group |
(?number) |
Capture matched substring into a specified group number |
(?'name') |
Capture matched substring into group name |
(?'name1-name2') |
Undefine name2, and store interval and current group into name1; if name2 is undefined, matching backtracks; name1 is optional |
(?:expr) |
Noncapturing group |
Table 26-9. Back references
Parameter syntax |
Meaning |
\index |
Reference a previously captured group by index |
\k<name> |
Reference a previously captured group by name |
Table 26-10. Alternation
Expression syntax |
Meaning |
| |
Logical or |
(?(expr)yes|no) |
Matches yes if expression matches; otherwise, matches no (no is optional) |
(?(name)yes|no) |
Matches yes if named group has a match; otherwise, matches no (no is optional) |
Table 26-11. Miscellaneous constructs
Expression syntax |
Meaning |
(?#comment) |
Inline comment |
#comment |
Comment to end of line (works only in IgnorePatternWhitespace mode) |
Table 26-12. Regular expression options
Option |
Meaning |
(?i) |
Case-insensitive match (“ignore” case) |
(?m) |
Multiline mode; changes ^ and $ so that they match beginning and end of any line |
(?n) |
Captures only explicitly named or numbered groups |
(?c) |
Compiles to IL |
(?s) |
Single-line mode; changes meaning of “.” so that it matches every character |
(?x) |
Eliminates unescaped whitespace from the pattern |
(?r) |
Searches from right to left; can’t be specified midstream |