Working with Regular Expressions - THE RUBY WAY, Third Edition (2015)

THE RUBY WAY, Third Edition (2015)

Chapter 3. Working with Regular Expressions

I would choose
To lead him in a maze along the patterned paths...

—Amy Lowell, “Patterns”

The power of regular expressions as a computing tool has often been underestimated. From their earliest theoretical beginnings in the 1940s, they found their way onto computer systems in the 1960s and from there into various tools in the UNIX operating system. In the 1990s, the popularity of Perl helped make regular expressions a household programming item rather than the esoteric domain of bearded gurus.

The beauty of regular expressions is that almost everything in our experience can be understood in terms of patterns. Where there are patterns that we can describe, we can detect matches, we can find the pieces of reality that correspond to those matches, and we can replace those pieces with others of our own choosing.

The regular expression engine used in Ruby is called Onigmo. Onigmo is itself a revision of another library called Oniguruma, which was used by Ruby 1.9. Oniguruma is commonly misspelled by non-Japanese writers; it is fortunate that Onigmo is much easier to spell.

The Onigmo engine provides several improvements to Ruby’s regular expression support. Most notably, it handles internationalized strings far better, and adds some powerful features. Additional issues with internationalization and regular expressions are dealt with in Chapter 4, “Internationalization in Ruby.”

3.1 Regular Expression Syntax

The typical regular expression is delimited by a pair of slashes; the %r form can also be used. Table 3.1 gives some simple examples.

Image

Table 3.1 Basic Regular Expressions

It is also possible to place a modifier, consisting of a single letter, immediately after a regex. Table 3.2 shows the most common modifiers.

Image

Table 3.2 Regular Expression Modifiers

Other modifiers will be covered in Chapter 4.

To complete our introduction to regular expressions, Table 3.3 lists the most common symbols and notations available.

Image

Image

Table 3.3 Common Notations Used in Regular Expressions

Character classes in brackets can also use the && operator for combining together:

reg1 = /[a-z&&[^aeiou]]/ # any letter but vowels a, e, i, o, and u
reg2 = /[a-z&&[^m-p]]/ # the entire alphabet minus m through p

Because this can be confusing, I recommend using this feature sparingly.

An understanding of regex handling greatly benefits the modern programmer. A complete discussion of this topic is far beyond the scope of this book, but if you’re interested, see the definitive work Mastering Regular Expressions, by Jeffrey Friedl.

3.2 Compiling Regular Expressions

Regular expressions can be compiled using the class method Regexp.compile (which is really only a synonym for Regexp.new). The first parameter is required and may be a string or a regex. (Note that if the parameter is a regex with options, the options will not carry over into the newly compiled regex.)

pat1 = Regexp.compile("^foo.*") # /^foo.*/
pat2 = Regexp.compile(/bar$/i) # /bar/ (i not propagated)

The second parameter, if present, is normally a bitwise OR of any of the following constants: Regexp::EXTENDED, Regexp::IGNORECASE, and Regexp::MULTILINE. Additionally, any non-nil value will have the result of making the regex case insensitive; we do not recommend this practice.

options = Regexp::MULTILINE || Regexp::IGNORECASE
pat3 = Regexp.compile("^foo", options)
pat4 = Regexp.compile(/bar/, Regexp::IGNORECASE)

The third parameter, if it is specified, is the language parameter, which enables multibyte character support. It can take any of four string values:

"N" or "n" means None
"E" or "e" means EUC
"S" or "s" means Shift-JIS
"U" or "u" means UTF-8

Of course, regular expression literals may be specified without calling new or compile, simply by enclosing them in slash delimiters.

pat1 = /^foo.*/
pat2 = /bar$/i

Regular expression encodings will be covered in Chapter 4.

3.3 Escaping Special Characters

The class method Regexp.escape escapes any characters that are special characters used in regular expressions. Such characters include the asterisk, question mark, and brackets:

str1 = "[*?]"
str2 = Regexp.escape(str1) # "\[\*\?\]"

The method Regexp.quote is an alias.

3.4 Using Anchors

An anchor is a special expression that matches a position in a string rather than a character or sequence of characters. As we’ll see later, this is a simple case of a zero-width assertion, a match that doesn’t consume any of the string when it matches.

The most common anchors were already listed at the beginning of this chapter. The simplest are ^ and $, which match the beginning and end of the string.

string = "abcXdefXghi"
/def/ =~ string # 4
/abc/ =~ string # 0
/ghi/ =~ string # 8
/^def/ =~ string # nil
/def$/ =~ string # nil
/^abc/ =~ string # 0
/ghi$/ =~ string # 8

However, I’ve told a small lie. These anchors don’t actually match the beginning and end of the string but rather of the line. Consider the same patterns applied to a similar string with embedded newlines:

string = "abc\ndef\nghi"
/def/ =~ string # 4
/abc/ =~ string # 0
/ghi/ =~ string # 8
/^def/ =~ string # 4
/def$/ =~ string # 4
/^abc/ =~ string # 0
/ghi$/ =~ string # 8

However, we also have the special anchors \A and \Z, which match the real beginning and end of the string itself.

string = "abc\ndef\nghi"
/\Adef/ =~ string # nil
/def\Z/ =~ string # nil
/\Aabc/ =~ string # 0
/ghi\Z/ =~ string # 8

The \z is the same as \Z except that the latter matches before a terminating newline, whereas the former must match explicitly.

string = "abc\ndef\nghi"
str2 << "\n"
/ghi\Z/ =~ string # 8
/\Aabc/ =~ str2 # 8
/ghi\z/ =~ string # 8
/ghi\z/ =~ str2 # nil

It’s also possible to match a word boundary with \b, or a position that is not a word boundary with \B. These gsub examples make it clear how this works:

str = "this is a test"
str.gsub(/\b/,"|") # "|this| |is| |a| |test|"
str.gsub(/\B/,"-") # "t-h-i-s i-s a t-e-s-t"

There is no way to distinguish between beginning and ending word boundaries.

3.5 Using Quantifiers

A big part of regular expressions is handling optional items and repetition. An item followed by a question mark is optional; it may be present or absent, and the match depends on the rest of the regex. (It doesn’t make sense to apply this to an anchor but only to a subpattern of nonzero width.)

pattern = /ax?b/
pat2 = /a[xy]?b/

pattern =~ "ab" # 0
pattern =~ "acb" # nil
pattern =~ "axb" # 0
pat2 =~ "ayb" # 0
pat2 =~ "acb" # nil

It is common for entities to be repeated an indefinite number of times (which we can specify with the + quantifier). For example, this pattern matches any positive integer:

pattern = /[0-9]+/
pattern =~ "1" # 0
pattern =~ "2345678" # 0

Another common occurrence is a pattern that occurs zero or more times. We could do this with + and ?, of course; here we match the string Huzzah followed by zero or more exclamation points:

pattern = /Huzzah(!+)?/ # Parentheses are necessary here
pattern =~ "Huzzah" # 0
pattern =~ "Huzzah!!!!" # 0

However, there’s a better way. The * quantifier describes this 'margin-top:6.0pt;margin-right:0cm;margin-bottom:6.0pt; margin-left:0cm;line-height:normal'>

pattern = /Huzzah!*/ # * applies only to !
pattern =~ "Huzzah" # 0
pattern =~ "Huzzah!!!!" # 0

What if we want to match a U.S. Social Security number? Here’s a pattern for that:

ssn = "987-65-4320"
pattern = /\d\d\d-\d\d-\d\d\d\d/
pattern =~ ssn # 0

But that’s a little unclear. Let’s explicitly say how many digits are in each group. A number in braces is the quantifier to use here:

pattern = /\d{3}-\d{2}-\d{4}/

This is not necessarily a shorter pattern, but it is more explicit and arguably more readable.

Comma-separated ranges can also be used. Imagine that an Elbonian phone number consists of a part with three to five digits and a part with three to seven digits. Here’s a pattern for that:

elbonian_phone = /\d{3,5}-\d{3,7}/

The beginning and ending numbers are optional (though we must have one or the other):

/x{5}/ # Match 5 xs
/x{5,7}/ # Match 5-7 xs
/x{,8}/ # Match up to 8 xs
/x{3,}/ # Match at least 3 xs

Obviously, the quantifiers ?, +, and * could be rewritten in this way:

/x?/ # same as /x{0,1}/
/x*/ # same as /x{0,}
/x+/ # same as /x{1,}

The terminology of regular expressions is full of colorful personifying terms such as greedy, reluctant, lazy, and possessive. The greedy/non-greedy distinction is one of the most important.

Consider this piece of code. You might expect that this regex would match “Where the” but it matches the larger substring “Where the sea meets the” instead:

str = "Where the sea meets the moon-blanch'd land,"
match = /.*the/.match(str)
p match[0] # Display the entire match:
# "Where the sea meets the"

The reason is that the * operator is greedy—in matching, it consumes as much of the string as it can for the longest match possible. We can make it non-greedy by appending a question mark:

str = "Where the sea meets the moon-blanch'd land,"
match = /.*?the/.match(str)
p match[0] # Display the entire match:
# "Where the"

This shows us that the * operator is greedy by default unless a ? is appended. The same is true for the + and {m,n} quantifiers, and even for the ? quantifier itself.

I haven’t been able to find good examples for the {m,n}? and ?? cases. If you know of any, please share them.

3.6 Positive and Negative Lookahead

Naturally, a regular expression is matched against a string in a linear fashion (with backtracking as necessary). Therefore, there is the concept of the “current location” in the string—rather like a file pointer or a cursor.

The term lookahead refers to a construct that matches a part of the string ahead of the current location. It is a zero-width assertion because even when a match succeeds, no part of the string is consumed (that is, the current location does not change).

In this next example, the string “New World” will be matched if it is followed by “Symphony” or “Dictionary”; however, the third word is not part of the match:

s1 = "New World Dictionary"
s2 = "New World Symphony"
s3 = "New World Order"

reg = /New World(?= Dictionary| Symphony)/
m1 = reg.match(s1)
m.to_a[0] # "New World"
m2 = reg.match(s2)
m.to_a[0] # "New World"
m3 = reg.match(s3) # nil

Here is an example of negative lookahead:

reg2 = /New World(?! Symphony)/
m1 = reg.match(s1)
m.to_a[0] # "New World"
m2 = reg.match(s2)
m.to_a[0] # nil
m3 = reg.match(s3) # "New World"

In this example, “New World” is matched only if it is not followed by “Symphony”.

3.7 Positive and Negative Lookbehind

If lookahead isn’t enough for you, you can use lookbehind—detecting whether the current location is preceded by a given pattern.

Like many areas of regular expressions, this can be difficult to understand and motivate. Thanks goes to Andrew Johnson for the following example.

Imagine that we are analyzing some genetic sequence. (The DNA molecule consists of four “base” molecules, abbreviated A, C, G, and T.) Suppose that we are scanning for all non-overlapping nucleotide sequences (of length 4) that follow a T. We couldn’t just try to match a T and four characters because the T may have been the last character of the previous match.

gene = 'GATTACAAACTGCCTGACATACGAA'
seqs = gene.scan(/T(\w{4})/)
# seqs is: [["TACA"], ["GCCT"], ["ACGA"]]

But in this preceding code, we miss the GACA sequence that follows GCCT. Using a positive lookbehind (as follows), we catch them all:

gene = 'GATTACAAACTGCCTGACATACGAA'
seqs = gene.scan(/(?<=T)(\w{4})/)
# seqs is: [["TACA"], ["GCCT"], ["GACA"], ["ACGA"]]

This next example is adapted from one by K. Kosako. Suppose that we want to take a bunch of text in XML (or HTML) and shift to uppercase all the text outside the tags (that is, the cdata). Here is a way to do that using lookbehind:

text = <<-EOF
<body> <h1>This is a heading</h1>
<p> This is a paragraph with some
<i>italics</i> and some <b>boldface</b>
in it...</p>
</body>
EOF

pattern = /(?:^| # Beginning or...
(?<=>) # following a '>'
)
([^<]*) # Then all non-'<' chars (captured).
/x

puts text.gsub(pattern) {|s| s.upcase }

# Output:
# <body> <h1>THIS IS A HEADING</h1>
# <p>THIS IS A PARAGRAPH WITH SOME
# <i>ITALICS</i> AND SOME <b>BOLDFACE</b>
# IN IT...</p>
# </body>

3.8 Accessing Backreferences

Each parenthesized piece of a regular expression will be a submatch of its own. These are numbered and can be referenced by these numbers in more than one way. Let’s examine the more traditional “ugly” ways first.

The special global variables $1, $2, and so on, can be used to reference matches:

str = "a123b45c678"
if /(a\d+)(b\d+)(c\d+)/ =~ str
puts "Matches are: '#$1', '#$2', '#$3'"
# Prints: Matches are: 'a123', 'b45', 'c768'
end

Within a substitution such as sub or gsub, these variables cannot be used:

str = "a123b45c678"
str.sub(/(a\d+)(b\d+)(c\d+)/, "1st=#$1, 2nd=#$2, 3rd=#$3")
# "1st=, 2nd=, 3rd="

Why didn’t this work? Because the arguments to sub are evaluated before sub is called. This code is equivalent:

str = "a123b45c678"
s2 = "1st=#$1, 2nd=#$2, 3rd=#$3"
reg = /(a\d+)(b\d+)(c\d+)/
str.sub(reg,s2)
# "1st=, 2nd=, 3rd="

This code, of course, makes it much clearer that the values $1 through $3 are unrelated to the match done inside the sub call.

In this kind of case, the special codes \1, \2, and so on, can be used:

str = "a123b45c678"
str.sub(/(a\d+)(b\d+)(c\d+)/, '1st=\1, 2nd=\2, 3rd=\3')
# "1st=a123, 2nd=b45, 3rd=c768"

Notice that we used single quotes (hard quotes) in the preceding example. If we used double quotes (soft quotes) in a straightforward way, the backslashed items would be interpreted as octal escape sequences:

str = "a123b45c678"
str.sub(/(a\d+)(b\d+)(c\d+)/, "1st=\1, 2nd=\2, 3rd=\3")
# "1st=\001, 2nd=\002, 3rd=\003"

The way around this is to double-escape:

str = "a123b45c678"
str.sub(/(a\d+)(b\d+)(c\d+)/, "1st=\\1, 2nd=\\2, 3rd=\\3")
# "1st=a123, 2nd=b45, 3rd=c678"

It’s also possible to use the block form of a substitution, in which case the global variables may be used:

str = "a123b45c678"
str.sub(/(a\d+)(b\d+)(c\d+)/) { "1st=#$1, 2nd=#$2, 3rd=#$3" }
# "1st=a123, 2nd=b45, 3rd=c678"

When using a block in this way, it is not possible to use the special backslashed numbers inside a double-quoted string (or even a single-quoted one). This is reasonable if you think about it.

As an aside here, I will mention the possibility of noncapturing groups. Sometimes you may want to regard characters as a group for purposes of crafting a regular expression, but you may not need to capture the matched value for later use. In such a case, you can use a noncapturing group, denoted by the (?:...) syntax:

str = "a123b45c678"
str.sub(/(a\d+)(?:b\d+)(c\d+)/, "1st=\\1, 2nd=\\2, 3rd=\\3")
# "1st=a123, 2nd=c678, 3rd="

In the preceding example, the second grouping was thrown away, and what was the third submatch became the second.

I personally don’t like either the \1 or the $1 notation. They are convenient sometimes, but it isn’t ever necessary to use them. We can do it in a “prettier,” more object-oriented way.

The class method Regexp.last_match returns an object of class MatchData (as does the instance method match). This object has instance methods that enable the programmer to access backreferences.

The MatchData object is manipulated with a bracket notation as though it were an array of matches. The special element 0 contains the text of the entire matched string. Thereafter, element n refers to the nth match:

pat = /(.+[aiu])(.+[aiu])(.+[aiu])(.+[aiu])/i
# Four identical groups in this pattern
refs = pat.match("Fujiyama")
# refs is now: ["Fujiyama","Fu","ji","ya","ma"]
x = refs[1]
y = refs[2..3]
refs.to_a.each {|x| print "#{x}\n"}

Note that the object refs is not a true array. Therefore, when we want to treat it as one by using the iterator each, we must use to_a (as shown) to convert it to an array.

We may use more than one technique to locate a matched substring within the original string. The methods begin and end return the beginning and ending offsets of a match, respectively. (It is important to realize that the ending offset is really the index of the next character after the match.)

str = "alpha beta gamma delta epsilon"
# 0....5....0....5....0....5....
# (for your counting convenience)

pat = /(b[^ ]+ )(g[^ ]+ )(d[^ ]+ )/
# Three words, each one a single match
refs = pat.match(str)

# "beta "
p1 = refs.begin(1) # 6
p2 = refs.end(1) # 11
# "gamma "
p3 = refs.begin(2) # 11
p4 = refs.end(2) # 17
# "delta "
p5 = refs.begin(3) # 17
p6 = refs.end(3) # 23
# "beta gamma delta"
p7 = refs.begin(0) # 6
p8 = refs.end(0) # 23

Similarly, the offset method returns an array of two numbers, which are the beginning and ending offsets of that match. To continue the previous example:

range0 = refs.offset(0) # [6,23]
range1 = refs.offset(1) # [6,11]
range2 = refs.offset(2) # [11,17]
range3 = refs.offset(3) # [17,23]

The portions of the string before and after the matched substring can be retrieved by the instance methods pre_match and post_match, respectively. To continue the previous example:

before = refs.pre_match # "alpha "
after = refs.post_match # "epsilon"

3.9 Named Matches

A special form of subexpression is the named expression. This in effect gives a name to a pattern (rather than just a number).

The syntax is simple: (?<name>expr) where name is some name starting with a letter (like a Ruby identifier). Notice how similar this is to the non-named atomic subexpression.

What can we do with a named expression? One thing is to use it as a backreference. The following example is a simple regex that matches a doubled word (see also Section 3.15.6, “Detecting Doubled Words in Text”):

re1 = /\s+(\w+)\s+\1\s+/
str = "Now is the the time for all..."
re1.match(str).to_a # ["the the","the"]

Note how we capture the word and then use \1 to reference it. We can use named references in much the same way. We give the name to the subexpression when we first use it, and we access the backreference by \k followed by that same name (always in angle brackets):

re2 = /\s+(?<anyword>\w+)\s+\k<anyword>\s+/

The second variant is longer but arguably more readable. (Be aware that if you use named backreferences, you cannot use numbered backreferences in the same regex.) Use this feature at your discretion.

Ruby has long had the capability to use backreferences in strings passed to sub and gsub; in the past, this has been limited to numbered backreferences, but in very recent versions, named matches can be used:

str = "I breathe when I sleep"

# Numbered matches...
r1 = /I (\w+) when I (\w+)/
s1 = str.sub(r1,'I \2 when I \1')

# Named matches...
r1 = /I (?<verb1>\w+) when I (?<verb2>\w+)/
s2 = str.sub(r2,'I \k<verb2> when I \k<verb1>')

puts s1 # I sleep when I breathe
puts s2 # I sleep when I breathe

Another use for named expressions is to re-invoke that expression. In this case, we use \g (rather than \k) preceding the name.

For example, let’s define a spaces subpattern so that we can use it again. The previous regex re2 then becomes this:

re3 = /(?<spaces>\s+)(?<anyword>\w+)\g<spaces>\k<anyword>\g<spaces>/

Note how we invoke the pattern repeatedly by means of the \g marker. This feature makes more sense if the regular expression is recursive; that is the topic of the next section.

A notation such as \g<1> may also be used if there are no named subexpressions. This re-invokes a captured subexpression by referring to it by number rather than name.

One final note on the use of named matches. The name can be used (as a symbol or a string) as a MatchData index. Here is an example:

str = "My hovercraft is full of eels"
reg = /My (?<noun>\w+) is (?<predicate>.*)/
m = reg.match(str)
puts m[:noun] # hovercraft
puts m["predicate"] # full of eels
puts m[1] # same as m[:noun] or m["noun"]

3.10 Using Character Classes

Character classes are simply a form of alternation (specification of alternative possibilities) where each submatch is a single character. In the simplest case, we list a set of characters inside square brackets:

/[aeiou]/ # Match any single letter a, e, i, o, u; equivalent
# to /(a|e|i|o|u)/ except for group-capture

Inside a character class, escape sequences such as \n are still meaningful, but metacharacters such as . and ? do not have any special meanings:

/[.\n?]/ # Match any of: period, newline, question mark

The caret (^) has special meaning inside a character class if used at the beginning; it negates the list of characters (or refers to their complement):

[^aeiou] # Any character EXCEPT a, e, i, o, u

The hyphen, used within a character class, indicates a range of characters (a lexicographic range, that is):

/[a-mA-M]/ # Any letter in the first half of the alphabet
/[^a-mA-M]/ # Any OTHER letter, or number, or non-alphanumeric
# character

When a hyphen is used at the beginning or end of a character class, or a caret is used in the middle of a character class, these characters lose their special meaning and only represent themselves literally. The same is true of a left bracket, but a right bracket must obviously be escaped:

/[-^[\]]/ # Match a hyphen, caret, or right bracket

Ruby regular expressions may contain references to named character classes, which are basically named patterns (of the form [[:name:]]). For example, [[:digit:]] means the same as [0-9] in a pattern. In many cases, this turns out to be shorthand or is at least more readable.

Some others are [[:print:]] (printable characters) and [[:alpha:]] (alphabetic characters). Here are some examples:

s1 = "abc\007def"
/[[:print:]]*/.match(s1)
m1 = Regexp::last_match[0] # "abc"

s2 = "1234def"
/[[:digit:]]*/.match(s2)
m2 = Regexp::last_match[0] # "1234"

/[[:digit:]]+[[:alpha:]]/.match(s2)
m3 = Regexp::last_match[0] # "1234d"

A caret before the character class name negates the class:

/[[:^alpha:]]/ # Any non-alpha character

Named character classes provide another non-obvious feature: they match Unicode characters. For example, the [[:lower:]] class matches strings such as “élan” that contain lowercase characters outside of [a-z].

There are also shorthand notations for many classes. The most common ones are \d (to match a digit), \w (to match any “word” character), and \s (to match any whitespace character such as a space, tab, or newline):

str1 = "Wolf 359"
/\w+/.match(str1) # matches "Wolf" (same as /[a-zA-Z_0-9]+/)
/\w+ \d+/.match(str1) # matches "Wolf 359"
/\w+ \w+/.match(str1) # matches "Wolf 359"
/\s+/.match(str1) # matches " "

The “negated” forms are typically capitalized:

/\W/ # Any non-word character
/\D/ # Any non-digit character
/\S/ # Any non-whitespace character

3.11 Extended Regular Expressions

Regular expressions are frequently cryptic, especially as they get longer. The x directive enables you to stretch out a regex across multiple lines; spaces and newlines are ignored so that you can use these for indentation and readability. This also encourages the use of comments, although comments are possible even in simple regexes.

For a contrived example of a moderately complex regular expression, let’s suppose that we had a list of addresses like this:

addresses =
[ "409 W Jackson Ave", "No. 27 Grande Place",
"16000 Pennsylvania Avenue", "2367 St. George St.",
"22 Rue Morgue", "33 Rue St. Denis",
"44 Rue Zeeday", "55 Santa Monica Blvd.",
"123 Main St., Apt. 234", "123 Main St., #234",
"345 Euneva Avenue, Suite 23", "678 Euneva Ave, Suite A"]

In these examples, each address consists of three parts—a number, a street name, and an optional suite or apartment number. I’m making the arbitrary rules that there can be an optional “No.” on the front of the number, and the period may be omitted. Likewise, let’s arbitrarily say that the street name may consist of ordinary word characters but also allows the apostrophe, hyphen, and period. Finally, if the optional suite number is used, it must be preceded by a comma and one of the following tokens: Apt., Suite, or # (number sign).

Here is the regular expression I created for this. Notice that I’ve commented it heavily (maybe even too heavily):

regex = / ^ # Beginning of string
((No\.?)\s+)? # Optional: No[.]
\d+ \s+ # Digits and spacing
((\w|[.'-])+ # Street name... may be
\s* # multiple words.
)+
(,\s* # Optional: Comma etc.
(Apt\.?|Suite|\#) # Apt[.], Suite, #
\s+ # Spacing
(\d+|[A-Z]) # Numbers or single letter
)?
$ # End of string
/x

The point here is clear. When your regex reaches a certain threshold (which is a matter of opinion), make it an extended regex so that you can format it and add comments.

You may have noticed that I used ordinary Ruby comments here (# ...) instead of regex comments ((?#...)). Why did I do that? Simply because I could. The regex comments are needed only when the comment needs to be closed other than at the end of the line (for example, when more “meat” of the regex follows the comment on the same line).

3.12 Matching a Newline with a Dot

Ordinarily a dot matches any character except a newline. When the m (multiline) modifier is used, a newline will be matched by a dot. The same is true when the Regexp::MULTILINE option is used in creating a regex.

str = "Rubies are red\nAnd violets are blue.\n"
pat1 = /red./
pat2 = /red./m

str =~ pat1 # nil
str =~ pat2 # 11

This multiline mode has no effect on where anchors match (such as ^, $, \A, and \Z); they match in the same places. All that is affected is whether a dot matches a newline.

3.13 Using Embedded Options

The common way to specify options for a regex is to use a trailing option (such as i or m). But what if we want an option to apply only to part of a regular expression?

We can turn options on and off with a special notation. Within parentheses, a question mark followed by one or more options “turns on” those options for the remainder of the regular expression. A minus sign preceding one or more options “turns off” those options.

/abc(?i)def/ # Will match abcdef, abcDEF, abcDef, ...
# but not ABCdef
/ab(?i)cd(?-i)ef/ # Will match abcdef, abCDef, abcDef, ...
# but not ABcdef or abcdEF
/(?imx).*/ # Same as /.*/imx
/abc(?i-m).*/m # For last part of regex, turn on case
# sensitivity, turn off multiline

If we want, we can use a colon followed by a subexpression, and those options specified will be in effect only for that subexpression:

/ab(?i:cd)ef/ # Same as /ab(?i)cd(?-i)ef/

For technical reasons, it is not possible to treat the o option this way. The x option can be treated this way, but I don’t know why anyone ever would.

3.14 Using Embedded Subexpressions

We can use the ?> notation to specify subexpressions in a regex:

re = /(?>abc)(?>def)/ # Same as /abcdef/
re.match("abcdef").to_a # ["abcdef"]

Notice that the subexpressions themselves don’t imply grouping. We can turn them into capturing groups with additional parentheses, of course.

Note that this notation is possessive—that is, it is greedy, and it does not allow backtracking into the subexpression.

str = "abccccdef"
re1 = /(abc*)cdef/
re2 = /(?>abc*)cdef/

re1 =~ str # 0
re2 =~ str # nil
re1.match(str).to_a # ["abccccdef", "abccc"]
re2.match(str).to_a # []

In the preceding example, re2’s subexpression abc* consumes all the instances of the letter c, and it (possessively) won’t give them back to allow backtracking.

In addition to (?>), there is another way of expressing possessiveness, with the postfix + quantifier. This is distinct from the + meaning “one or more” and can in fact be combined with it. In fact, it is a “secondary” quantifier, like the ? (which gives us ??, +?, and *?).

In essence, + applied to a repeated pattern is the same as enclosing that repeated pattern in an independent subexpression. Here is an example:

r1 = /x*+/ # Same as: /(?>x*)/
r2 = /x++/ # Same as: /(?>x+)/
r3 = /x?+/ # Same as: /(?>x?)/

For technical reasons, Ruby does not honor the {n,m}+ notation as possessive.

3.14.1 Recursion in Regular Expressions

The ability to re-invoke a subexpression makes it possible to craft recursive regular expressions. For example, here is one that matches any properly nested parenthesized expression. (Thanks again to Andrew Johnson.)

str = "a * ((b-c)/(d-e) - f) * g"

reg = /(? # begin named expression
\( # match open paren
(?: # non-capturing group
(?> # possessive subexpr to match:
\\[()] # either an escaped paren
| # OR
[^()] # a non-paren character
) # end possessive
| # OR
\g # a nested parens group (recursive call)
)* # repeat non-captured group zero or more
\) # match closing paren
) # end named expression
/x
m = reg.match(str).to_a # ["((b-c)/(d-e) - f)",
"((b-c)/(d-e) - f)"]

Note that left recursion is not allowed. This is legal:

str = "bbbaccc"
re1 = /(?<foo>a|b\g<foo>c)/
re1.match(str).to_a # ["bbbaccc","bbbaccc"]

But this is illegal:

re2 = /(?<foo>a|\g<foo>c)/ # Syntax error!

This example is illegal because of the recursion at the head of each alternative. This leads, if you think about it, to an infinite regress.

3.15 A Few Sample Regular Expressions

This section presents a small list of regular expressions that might be useful either in actual use or as samples to study.

3.15.1 Matching an IP Address

Suppose that we want to determine whether a string is a valid IPv4 address. The standard form of such an address is a dotted quad or dotted decimal string. This is, in other words, four decimal numbers separated by periods, each number ranging from 0 to 255.

The pattern given here will do the trick (with a few exceptions, such as “127.1”). We break up the pattern a little just for readability. Note that the \d symbol is double-escaped so that the slash in the string gets passed on to the regex. (We’ll improve on this in a minute.)

num = "(\\d|[01]?\\d\\d|2[0-4]\\d|25[0-5])"
pat = "^(#{num}\.){3}#{num}$"
ip_pat = Regexp.new(pat)

ip1 = "9.53.97.102"

if ip1 =~ ip_pat # Prints "yes"
puts "yes"
else
puts "no"
end

Note how we have an excess of backslashes when we define num in the preceding example. Let’s define it as a regex instead of a string:

num = /(\d|[01]?\d\d|2[0-4]\d|25[0-5])/

When a regex is interpolated into another, to_s is called, which preserves all the information in the original regex:

num.to_s # "(?-mix:(\\d|[01]?\\d\\d|2[0-4]\\d|25[0-5]))"

In some cases, it is more important to use a regex instead of a string for embedding. A good rule of thumb is to interpolate regexes unless there is some reason you must interpolate a string.

IPv6 addresses are not in widespread use yet, but we include them for completeness. These consist of eight colon-separated 16-bit hex numbers with zeroes suppressed:

num = /[0-9A-Fa-f]{0,4}/
pat = /^(#{num}:){7}#{num}$/
ipv6_pat = Regexp.new(pat)

v6ip = "abcd::1324:ea54::dead::beef"

if v6ip =~ ipv6_pat # Prints "yes"
puts "yes"
else
puts "no"
end

3.15.2 Matching a Keyword-Value Pair

Occasionally we want to work with strings of the form “attribute=value” (as, for example, when we parse some kind of configuration file for an application).

The following code fragment extracts the keyword and the value. The assumptions are that the keyword or attribute is a single word, the value extends to the end of the line, and the equal sign may be surrounded by whitespace:

pat = /(\w+)\s*=\s*(.*?)$/
str = "color = blue"

matches = pat.match(str)

puts matches[1] # "color"
puts matches[2] # "blue"

3.15.3 Matching Roman Numerals

In the following example, we match against a complex pattern to determine whether a string is a valid Roman number (up to decimal 3999). As before, the pattern is broken up into parts for readability:

rom1 = /m{0,3}/i
rom2 = /(d?c{0,3}|c[dm])/i
rom3 = /(l?x{0,3}|x[lc])/i
rom4 = /(v?i{0,3}|i[vx])/i
roman = /^#{rom1}#{rom2}#{rom3}#{rom4}$/

year1985 = "MCMLXXXV"

if year1985 =~ roman # Prints "yes"
puts "yes"
else
puts "no"
end

You might be tempted to put the i on the end of the whole expression and leave it off the smaller ones, like so:

# This doesn't work!

rom1 = /m{0,3}/
rom2 = /(d?c{0,3}|c[dm])/
rom3 = /(l?x{0,3}|x[lc])/
rom4 = /(v?i{0,3}|i[vx])/
roman = /^#{rom1}#{rom2}#{rom3}#{rom4}$/i

Why doesn’t this work? Look at this for the answer:

rom1.to_s # "(?-mix:m{0,3})"

Notice how the to_s captures the flags for each subexpression, and these then override the flag on the big expression.

3.15.4 Matching Numeric Constants

A simple decimal integer is the easiest number to match. It has an optional sign and consists thereafter of digits (except that Ruby allows an underscore as a digit separator). Note that the first digit should not be a zero because then it would be interpreted as an octal constant:

int_pat = /^[+-]?[1-9][\d_]*$/

Integer constants in other bases are similar. Note that the hex and binary patterns have been made case insensitive because they contain at least one letter:

hex_pat = /^[+-]?0x[\da-f_]+$/i
oct_pat = /^[+-]?0[0-7_]+$/
bin_pat = /^[+-]?0b[01_]+$/i

A normal floating point constant is a little tricky; the number sequences on each side of the decimal point are optional, but one or the other must be included:

float_pat = /^(\d[\d_]*)*\.[\d_]*$/

Finally, scientific notation builds on the ordinary floating point pattern:

sci_pat = /^(\d[\d_]*)?\.[\d_]*(e[+-]?)?(_*\d[\d_]*)$/i

These patterns can be useful if, for instance, you have a string and you want to verify its validity as a number before trying to convert it.

3.15.5 Matching a Date/Time String

Suppose that we want to match a date/time in the form mm/dd/yy hh:mm:ss. This pattern is a good first attempt:

datetime = /(\d\d)\/(\d\d)\/(\d\d) (\d\d): (\d\d):(\d\d)/

However, that will also match invalid date/times and miss valid ones. The following example is pickier. Note how we build it up by interpolating smaller regexes into larger ones:

mo = /(0?[1-9]|1[0-2])/ # 01 to 09 or 1 to 9 or 10-12
dd = /([0-2]?[1-9]|[1-3][01])/ # 1-9 or 01-09 or 11-19 etc.
yy = /(\d\d)/ # 00-99
hh = /([01]?[1-9]|[12][0-4])/ # 1-9 or 00-09 or...
mi = /([0-5]\d)/ # 00-59, both digits required
ss = /([0-6]\d)?/ # allows leap seconds ;-)

date = /(#{mo}\/#{dd}\/#{yy})/
time = /(#{hh}:#{mi}:#{ss})/

datetime = /(#{date} #{time})/

Here’s how we might call it using String#scan to return an array of matches:

str="Recorded on 11/18/07 20:31:00"
str.scan(datetime)
# [["11/18/07 20:31:00", "11/18/07", "11", "18", "00",
# "20:31:00", "20", "31", ":00"]]

Of course, this could all have been done as a large extended regex:

datetime = %r{(
(0?[1-9]|1[0-2])/ # mo: 01 to 09 or 1 to 9 or 10-12
([0-2]?[1-9]|[1-3][01])/ # dd: 1-9 or 01-09 or 11-19 etc.
(\d\d) [ ] # yy: 00-99
([01]?[1-9]|[12][0-4]): # hh: 1-9 or 00-09 or...
([0-5]\d): # mm: 00-59, both digits required
(([0-6]\d))? # ss: allows leap seconds ;-)
)}x

Note the use of the %r{} notation so that we don’t have to escape the slashes.

3.15.6 Detecting Doubled Words in Text

In this section, we implement the famous double-word detector. Typing the same word twice in succession is a common typing error. The following code detects instances of that occurrence:

double_re = /\b(['A-Z]+) +\1\b/i


str="There's there's the the pattern."
str.scan(double_re) # [["There's"],["the"]]

Note that the trailing i in the regex is for case-insensitive matching. There is an array for each grouping, hence the resulting array of arrays.

3.15.7 Matching All-Caps Words

This example is simple if we assume no numerics, underscores, and so on:

allcaps = /\b[A-Z]+\b/


string = "This is ALL CAPS"
string[allcaps] # "ALL"
Suppose you want to extract every word in all-caps:
string.scan(allcaps) # ["ALL", "CAPS"]

If we wanted, we could extend this concept to include Ruby identifiers and similar items.

3.15.8 Matching Version Numbers

A common convention is to express a library or application version number by three dot-separated numbers. This regex matches that kind of string, with the package name and the individual numbers as submatches:

package = "mylib-1.8.12"
matches = package.match(/(.*)-(\d+)\.(\d+)\.(\d+)/)
name, major, minor, tiny = matches[1..-1]

3.15.9 A Few Other Patterns

Let’s end this list with a few more “odds and ends.” As usual, most of these could be done in more than one way.

Suppose that we wanted to match a two-character U.S. state postal code. The simple way is just /[A-Z]{2}/, of course. But this matches names such as XY and ZZ that look legal but are meaningless. The following regex matches all the 51 usual codes (50 states and the District of Columbia):

state = /^A[LKZR] | C[AOT] | D[EC] | FL | GA | HI | I[DLNA] |
K[SY] | LA | M[EDAINSOT] | N[EVHJMYCD] | O[HKR] |
PA | RI | S[CD] | T[NX] | UT | V[TA] | W[AVIY]$/x

For clarity, I’ve made this an extended regex (by using the x modifier). The spaces and newlines are ignored.

In a similar vein, here is a regex to match a U.S. ZIP Code (which may be five or nine digits):

zip = /^\d{5}(-\d{4})?$/

The anchors (in this regex and others) are only to ensure that there are no extraneous characters before or after the matched string. Note that this regex will not catch all invalid codes. In that sense, it is less useful than the preceding one.

The following regex matches a phone number in the North American Numbering Plan (NANP) format; it allows three common ways of writing such a phone number:

phone = /^((\(\d{3}\) |\d{3}-)\d{3}-\d{4}|\d{3}\.\d{3}\.\d{4})$/
"(512) 555-1234" =~ phone # true
"512.555.1234" =~ phone # true
"512-555-1234" =~ phone # true
"(512)-555-1234" =~ phone # false
"512-555.1234" =~ phone # false

Matching a dollar amount with optional cents is also trivial:

dollar = /^\$\d+(\.\d\d)?$/

This one obviously requires at least one digit to the left of the decimal and disallows spaces after the dollar sign. Also note that if you only wanted to detect a dollar amount rather than validate it, the anchors would be removed and the optional cents would be unnecessary.

3.16 Conclusion

That ends our discussion of regular expressions in Ruby. Now that we have looked at both strings and regexes, let’s take a look at some issues with internationalization in Ruby. This topic builds on both the string and regex material we have already seen.