Regular Expressions - Foundation ActionScript 3, Second Edition (2014)

Foundation ActionScript 3, Second Edition (2014)

Chapter 10. Regular Expressions

This chapter covers the following topics:

· What regular expressions are and why they are useful

· The anatomy of regular expressions

· How to use regular expressions in ActionScript 3.0

· Useful regular expressions

· Resources for more information about regular expressions

In this chapter, you’ll spend some time looking at regular expressions, a brand-new feature introduced into ActionScript 3.0 that has helped make it a proper, grown-up programming language.

Regular expressions have often been considered something of a dark art, reserved for propeller-heads who eat Perl scripts for breakfast and go back for seconds. Seeing regular expressions in the wild, you would be forgiven for writing them off as incomprehensible gobbledygook. For example, take a look at the following regular expression:

^([a-zA-Z0-9._-]+)@([a-zA-Z0-9.-]+)\.([a-zA-Z]{2,4})$

Believe it or not, this pattern can be used to make sure that an e-mail address is valid.

By learning a few simple rules, it’s possible to break down even complex regular expressions into understandable chunks. This chapter is all about learning those simple rules. I promise that by the end of this chapter, you will be able to break down the preceding regular expression and understand exactly what each part does.

Once you’ve mastered regular expressions, you’ll find a whole bunch of uses for them in your ActionScript projects. They help you solve a specific kind of problem that would otherwise require a lot of coding. In fact, regular expressions are not just part of ActionScript 3.0; you can also use them in a number of different programming languages, from JavaScript to Java, from Perl to PHP, and beyond.

Why you Need Regular Expressions

A regular expression is a string of characters that describes a pattern that you can use to search a string. Those of you who have used ActionScript in previous versions might very well exclaim, “Hold on just a minute! Isn’t that what the String.indexOf() method is for?” Well, yes, but regular expressions are like String.indexOf() on steroids.

String.indexOf() returns the first position of a character or substring within a string. Here is an example of its syntax:

var bookTitle:String = "Foundation ActionScript 3.0";
var firstIndex:int = bookTitle.indexOf("n");
trace(firstIndex); // outputs 3

The first appearance of the character n is as the fourth letter in the string. The output is 3 because the first index position is 0, as in arrays.

One of the first problems with String.indexOf() is that it returns only the first index of the particular substring. In the previous example, if you wanted to find all instances of n, you might use this script:

var bookTitle:String = "Foundation ActionScript 3.0";
var startIndex:int = 0;
var positionIndex:int;
var stringLength:uint = bookTitle.length;
var positions:Array = [];
while (startIndex < stringLength-1) {
positionIndex = bookTitle.indexOf("n", startIndex);
if (positionIndex > -1) {
positions.push(positionIndex);
startIndex = positionIndex + 1;
} else {
break;
}
}
trace(positions); // outputs 3,9,16

That seems like a lot of work, doesn’t it? Another problem with using the String.indexOf() method to search for a string is that you have no control over whether the string you’re searching for is matched against a whole word or part of a word. For example, consider the following variable:

var tongueTwister:String = image
"Peter Piper picked a peck of pickled peppers";

The variable tongueTwister contains the opening line of a particularly bothersome tongue twister. Now, let’s say you wanted to see whether this string contained the word pick. Humans can look at the string and confirm that although the words picked and pickled are there, the wordpick is nowhere to be seen. Nonetheless, if you use String.indexOf() to search the string, it will return a match:

trace(tongueTwister.indexOf("pick")); // outputs 12

What’s happening here is that String.indexOf() isn’t searching for the word pick; it’s searching for any consecutive sequence of characters containing the letters p, i, c, and k (in that order). The value of the tongueTwister variable contains this sequence of characters (twice, in fact), so the result is a match.

If your brain works faster than mine, you might think that you could just add a space on either side of the word you’re searching for in the search string to get a whole-word match with String.indexOf(). That would work in this instance, but would fail if the word you were searching for were at the beginning or end of the string, or if it were nudged up against any pesky punctuation.

Another potential problem with using String.indexOf() to search strings is that you must be very exact. Let’s revisit the old chestnut of American English vs. British English. If you need to search a string of text for a word and you aren’t sure whether the author has used the English or American spelling, you have to search for both (colour vs. color in this example):

var entry:String = "Purple is my favourite colour";
if (entry.indexOf("color") > -1 || entry.indexOf("colour") > -1) {
trace("We have a match!");
}

I admit that this example doesn’t seem too bad, but any amount of extra typing seems unnecessary for a word that differs by only a single letter.

What’s the best way to get around the problems with String.indexOf()? Let me put it this way: this would be a very short chapter if regular expressions weren’t the answer.

The examples for this chapter are presented so that users of the Flash integrated development environment (IDE) can copy the code directly into the timeline to test it. The same code can be wrapped in a document class, as was initially demonstrated in Chapter 2, so it can be tested in both the Flash IDE and Flash Builder. For brevity’s sake, the chapter text does not include all the document class code; it concentrates specifically on the regular expression syntax. However, this chapter’s downloadable files include document classes for both Flash and Flash Builder users to run all the code included within the chapter.

Introducing the RegExp Class

In ActionScript 3.0, regular expressions are represented by the RegExp class. You can create a new RegExp object in two ways:

· By using the new keyword with the RegExp constructor: This is the same technique you used to create instances of almost all the classes you’ve met thus far. The RegExp constructor takes two arguments: a string specifying the pattern to search for as a string and a series of modifiers that change the way the regular expression behaves, also specified as a string:

var myFirstRegExp:RegExp = new RegExp("pattern", "modifiers");

· By using a regular expression literal: A regular expression literal is similar to a string literal, except that it is delineated by forward slashes (/), with patterns placed between the forward slashes and modifiers placed after them:

var myFirstRegExp:RegExp = /pattern/modifiers;

In terms of functionality, these two techniques are the same; they both create a new RegExp object with the specified pattern and modifiers. However, depending on which technique you choose and the characters in your pattern, you might need to slightly change how the pattern is specified. If you use the constructor technique, you have to make sure that any characters that have special meaning as a string are escaped using the backslash (\). For example, suppose that you want to define the pattern of letters ABC followed by any number of digits; you add the letters and then follow with the backslash escape character or metasequence. This example uses \d, which matches a decimal digit. But because the (*) is added, you tell the pattern to search for any digit:

var pattern:RegExp = new RegExp("/ABC\d*/");

I prefer regular expression literals because they require less typing, and anything that reduces the wear and tear on my poor fingers has to be a good thing. When specifying your pattern as a regular expression literal, you’ll need to escape the forward slashes and backslashes using a backslash character (much as you escape special characters in a string literal). You’ll see examples of regular expression literals throughout this chapter.

Having said all that, sometimes you have no choice but to use the RegExp constructor. This is necessary when either the pattern or the modifiers for the regular expression (or both) come from the value of a variable, as explained in the “Using variables to build a regular expression” section later in this chapter.

Anatomy of a Regular Expression Pattern

Now that you know how to create a RegExp object, it’s time to look at how the pattern for a regular expression is built and exactly what you can do with it.

A very simple regular expression pattern might look something like this:

pick

Yes, it really is just a simple string of characters. This signifies a regular expression that will match the character sequence p, i, c, k–in that order.

If you’re thinking that this is the same as the tongue-twister example you saw earlier, you’re right. You can verify it by adapting the earlier example to use a regular expression instead of the String.indexOf() method:

var tongueTwister:String = image
"Peter Piper picked a peck of pickled peppers";
var pickRegExp:RegExp = /pick/;
trace(pickRegExp.test(tongueTwister)); // outputs true

Here the test() method of the RegExp object is used to see whether the value of the tongueTwister variable matches the pattern you’ve defined. This method returns a Boolean value indicating whether the specified string contains the pattern: true for a match and false for no match. In this case, you should see the value true traced to the Output panel (or output to the console if you are using Flash Builder).

To see exactly what is being matched, you can use the replace() method of the String object. The String.replace() method uses a regular expression to replace the text matched by a regular expression with the specified replacement string:

var tongueTwister:String = image
"Peter Piper picked a peck of pickled peppers";
var pickRegExp:RegExp = /pick/;
var replaced:String = tongueTwister.replace(pickRegExp, "MATCH")
trace(replaced);

If you test this code, you should see the following in the Output panel:

Peter Piper MATCHed a peck of pickled peppers

You can tell that the pickRegExp regular expression matched the first occurrence of the string pick at the beginning of the word picked, which has been replaced by the string MATCH, the specified replacement string.

You might have noticed that the string pick that is part of the word pickled toward the end of the tongueTwister string wasn’t replaced. Unless you tell it otherwise, a regular expression will stop searching when it finds the first occurrence of a string that matches the specified pattern. If you want it to continue and find all matches, you’ll need to use the global modifier (see “Using the global modifier” section later in this chapter).

In the previous regular expression example, the pattern was just made up of regular characters, so that makes it as useless as String.indexOf() for solving the tongue-twister problem. However, regular expressions can also contain metacharacters, which add a lot more power and flexibility to string searches.

Introducing Metacharacters

Metacharacters are characters that have special meaning in the regular expression pattern, and they make regular expressions a powerful tool. Table 10-1 contains a partial list of the metacharacters.

Table 10-1. Common metacharacters

Metacharacter

Description

\b

Matches the position between a word character and a nonword character

\d

Matches a single digit

\s

Matches any whitespace character such as a space, tab, or newline

\w

Matches any alphanumeric character or an underscore (_)

The metacharacters listed in Table 10-1 also have exact opposites, which can be specified using the uppercase version of the same letter. For example, to match any character that is not a digit, you can use the \D metacharacter. The word boundary metacharacter (\b) is a little trickier in this respect because you need to remember that it matches a position between two characters. Its opposite, \B, still matches a position between two characters, either two word characters or two nonword characters.

You might have noticed that one of these metacharacters finally offers the solution to the tongue-twister problem from earlier. By placing a word boundary metacharacter (\b) on either side of the pick string, you can specify that you want it to match only as whole word:

\bpick\b

Now the pattern matches only if there is a word boundary (anything that’s not an alphanumeric character or an underscore) on either side of pick:

var tongueTwister:String = image
"Peter Piper picked a peck of pickled peppers";
var pickRegExp:RegExp = /\bpick\b/;
trace(pickRegExp.test(tongueTwister)); // outputs false

That gives an output of false, which is the desired result because the string being tested does not contain the word pick. Now you can put away this pesky problem and peruse other possibilities in programming.

Using Anchors to Restrict the Position of Matches

Like their heavy-chained nautical counterparts, anchors can restrict the action of your regular expression. Up to this point, the regular expression examples have been free as a bird—free to hunt the entire search string for a match. Anchors allow you to specify where in the string to look for a match to the pattern: at the beginning, at the end, or both.

Take the following variable, which gives the recipe for a good story:

var goodStory:String = "beginning, middle and end";

Using Start-of-String Anchors

Let’s say that you want to match the word beginning, but only if it appears at the beginning of the string. You could indicate that as part of your pattern by preceding it with the start-of-string anchor, represented by a caret (^). Press Shift+6:

^beginning

This matches the string beginning, but only if it is the first thing in the string being searched.

Using End-of-String Anchors

The start-of-string anchor has a counterpart called the end-of-string anchor, which is represented by a dollar sign ($). You use this anchor if you want the pattern to match only if it appeared at the end of the string. The end-of-string anchor goes at the end of the pattern:

end$

This matches the string end, but only if it is the last thing in the string being searched.

Combining Anchors

Finally, you can use a combination of both anchors to specify that the pattern should match the entire string:

^beginning, middle and end$

This would match the string beginning, middle and end, but only if the search string contained exactly that string and nothing else. Let’s see how it works in an example:

var goodStory0:String = "beginning, middle and end";
var goodStory1:String = "beginning, middle and end with epilogue";

var myRegExp:RegExp = /^beginning, middle and end$/;

trace(myRegExp.test(goodStory0)); // outputs true
trace(myRegExp.test(goodStory1)); // outputs false

This example tries to match the exact string beginning, middle and end. Because the first recipe contains this exactly, running test() on this string returns true. The second recipe does not end with the specified string end, so it returns false when tested.

Providing Alternatives with Alternation

In the examples so far, every character specified in the regular expression patterns must match for the string as a whole to be considered a match. Alternation allows you to specify a number of alternative patterns to be matched by separating the strings with a pipe (or vertical bar) symbol (|). As an example, the following pattern will match either the word one or two:

one|two

You could use alternation to solve the earlier spelling problem to match either color or colour, as follows:

var entry:String = "Purple is my favourite colour";
var colorRegExp:RegExp = /color|colour/;
if (colorRegExp.test(entry)) {
trace("We have a match!");
}

You can specify as many alternatives as you like:

one|two|three|four|five|six|seven|eight|nine|ten

Alternation operates on the entire pattern. You can force the alternation to act only on a particular part of the pattern using groups (covered in the “Grouping patterns” section later in this chapter).

Using Character Classes and Character Ranges

Character classes allow you to specify that instead of a specific character, you want one of a number of characters to be matched at a given position in a pattern. You create a character class by wrapping the characters to be matched in square brackets. For example, if you want a regular expression to match any of the vowels in the English alphabet, you could create a character class like this:

[aeiou]

This pattern will match only a single character, but that character can be any one of those specified in the character class. You can use the character class as part of a larger expression:

b[aeiou]g

This pattern would match bag, beg, big, bog, and bug.

Specifying each character that could possibly match is all well and good, but what if you want to match any letter of the alphabet? You would end up with the following:

[abcdefghijklmnopqrstuvyxyz]

Thankfully, this can be rewritten much more efficiently as a character range. A character range in a character class is specified as two characters separated by a hyphen (–). The following pattern is equivalent to the previous example:

[a-z]

You can also combine character ranges in a single character class by specifying them one after another. To match any alphanumeric digit, you could use the following pattern:

var entry:String = "Purple is my favourite colour";
var colorRegExp:RegExp = /[a-zA-Z0-9]/;

if (colorRegExp.test(entry)) {
trace("We have a match!");
}

The characters in a character class don’t have to be alphanumeric. For example, you might want to construct a pattern to match any of the standard punctuation characters:

[.,;:'!?]

The only symbols you need to be wary of when using a character class are the hyphen and the opening and closing square brackets. A hyphen can be specified only as the first or the last character in the character class (to avoid confusion with a character range). If you want to include square brackets in the class, escape them with backslashes:

var entry:String = "The [colour] Purple";
var colorRegExp:RegExp = /[\[\]-]/;

if (colorRegExp.test(entry)) {
trace("We have a match!");
}

Simple, no?

Matching any Character using the Dot Metacharacter

Sometimes you want your patterns to be extremely flexible. The dot metacharacter, represented by a period or full-stop symbol (.), will match any single character in the string, without caring what that character is. The only exception to this rule is that, by default, it will not match a newline character (you’ll find out how to alter this behavior in the “Using the dotall modifier” section later in this chapter).

Let’s say that you want to match any string that is exactly five characters long, but you don’t care which five characters they are. You could construct a pattern that consists solely of five dot metacharacters:

.....

This pattern matches hello, knife, a bag, and even &^%$£’—any string that is five characters in length, provided that none of those characters is a newline.

As with all the other metacharacters, if you want to match a period character literally in your pattern, you need to escape it with a backslash:

ActionScript [123]\.0

This expression matches ActionScript 1.0, ActionScript 2.0, or ActionScript 3.0.

Note that the period symbol has no special meaning when specified as part of a character class, there’s no need to escape it.

Matching a Number of Occurrences Using Quantifiers

So far, each character in the regular expression patterns you’ve seen has matched exactly one character in the string being searched. However, in a regular expression, you can use quantifiers to determine how many times a given character should be matched. Table 10-2 shows the available quantifiers.

Table 10-2. Regular expression quantifiers

Quantifier

Description

?

Matches zero or one occurrence of the preceding character

*

Matches zero or more occurrences of the preceding character

+

Matches one or more occurrences of the preceding character

I’ll discuss each one of these quantifiers in turn.

Matching Zero or One Occurrence

Earlier you saw an example of the String.indexOf() method, showing that it isn’t ideal for matching a word when you aren’t quite sure of its spelling. The specific example was matching the British English or American English spelling of the word colour/color. If your memory is as bad as mine, here’s a little refresher of the rather awkward solution using String.indexOf():

var entry:String = "Purple is my favourite colour";
if (entry.indexOf("color") > -1 || entry.indexOf("colour") > -1) {
trace("We have a match!");
}

You then saw a way to solve this problem using a regular expression with alternation, but it wasn’t much of an improvement:

var entry:String = "Purple is my favourite colour";
var colorRegExp:RegExp = /color|colour/;
if (colorRegExp.test(entry)) {
trace("We have a match!");
}

You still need to specify the majority of the letters in the word twice. What you really need is a way to specify that the letter u is optional in the word colour and does not need to be present in the string to match. Using the zero-or-one quantifier, represented by a question mark (?), you can do just that:

colou?r

You can now rewrite the code to use the regular expression:

var entry:String = "Purple is my favourite colour";
var colorRegExp:RegExp = /colou?r/;
if (colorRegExp.test(entry)) {
trace("We have a match!");
}

Matching Zero or More Occurrences

If you want to say that a given character can appear zero or more times, use the zero-or-more quantifier, which is represented by an asterisk (*). Like the zero-or-one quantifier, this quantifier is placed after the character you want to be matched zero or more times in the string.

For example, the following pattern matches any word beginning with i and ending with s, with zero or more other characters in between. (Remember that \w matches any alphanumeric character, and \b specifies the beginning or end of the word.)

\bi\w*s\b

This pattern matches the words is, insulates, and inconsistencies with equal aplomb.

Matching One or More Occurrences

The ? and * quantifiers allow zero occurrences of a given character. However, you might need to specify that there should be at least one occurrence. In these cases, you can use the one-or-more quantifier, which is represented by the plus sign (+). As with the other quantifiers, you place this symbol after the character that you want to match in the string.

Modifying the earlier example, you can say that you want to match any word beginning with i and ending with s, but that there must be at least one character between them by replacing the * quantifier with a + quantifier:

\bi\w+s\b

This pattern would still match insulates and inconsistencies, but would no longer match is because there is no letter between the i and the s.

How to Prevent Greedy Quantifiers

By default, the * (zero-or-more) and + (one-or-more) quantifiers are greedy—they’ll consume as much as they possibly can and leave only what is left for the rest of the pattern to match. I feel a demonstration of the problem coming up:

var compassPoints:String = "Naughty elephants squirt water";
var firstWordRegExp:RegExp = /\b.+\b/;
trace(compassPoints.replace(firstWordRegExp, "MATCH"));

Here, you want to match a word boundary (\b), followed by one or more non-newline characters (.+), followed by another word boundary (\b). You might reasonably expect that Naughty would be replaced by MATCH. What you actually get in the Output panel is the following:

MATCH

What happened to the rest of the string? The answer is that the .+ portion of the regular expression ate every last bit of it. The word boundaries that were matched were the very beginning of the string and the very end of the string, and the rest was consumed by the greedy quantifier because the dot metacharacter matches any non-newline character, including whitespace characters.

To put the quantifiers on a diet and stop them from being so greedy, you can add a question mark (?) just after the quantifier symbol:

var compassPoints:String = "Naughty elephants squirt water";
var firstWordRegExp:RegExp = /\b.+?\b/;
trace(compassPoints.replace(firstWordRegExp, "MATCH"));

This might seem a little confusing at first because the question mark is also the symbol for the zero-or-one quantifier. However, when it is placed after either the * or + quantifier, it forces that quantifier to consume as few characters as possible, while allowing the entire pattern to be matched:

MATCH elephants squirt water

If only it were that easy to correct the appetite of human beings, I could give up my extortionate gym membership.

Another way to solve the problem is by restricting which characters are allowed to appear between the word boundaries:

var compassPoints:String = "Naughty elephants squirt water";
var firstWordRegExp:RegExp = /\b\w+\b/;
trace(compassPoints.replace(firstWordRegExp, "MATCH"));

Now, instead of matching one or more non-newline characters (using the . metacharacter) surrounded by word boundaries, the expression will match only one or more characters that can make up a word (using the \w sequence, which matches only alphanumeric characters and underscores) surrounded by word boundaries:

MATCH elephants squirt water

You’ll often find that there are many ways to make your regular expression patterns more specific. Be pragmatic, and don’t be afraid to experiment to see which approach works best for you.

Being More Specific with Bounds

Sometimes being able to specify that you want zero or one or more occurrences of a character isn’t specific enough. You might want to specify that you want at least four occurrences of this character, or between two and six occurrences of that character. Although you could do this by stringing together some of the quantifiers you’ve already met, it wouldn’t be pretty:

\b\w\w\w?\w?\w?\w?\b

This pattern will match words of two to six characters, but you would be forgiven for taking a while to work that out.

Thankfully, you can use a bound in your regular expression patterns to specify how many characters should be matched. Like quantifiers, bounds are placed after the character that you want to be affected, and they are denoted by curly braces ({}).

The simplest example of a bound specifies exactly how many occurrences should be matched. The following pattern matches words of exactly two characters:

\b\w{2}\b

You can also specify a maximum number of occurrences to be matched. The following pattern matches words of between two and six characters:

\b\w{2,6}\b

Finally, you can leave off the maximum value (keeping the comma) to specify that you want at least the minimum number of occurrences to match, but without an upper limit. The following pattern matches words of at least two characters:

\b\w{2,}\b

Beware that bounds that can match a variable number of characters (those that have a maximum value specified or that are unlimited) are greedy by default. Just like the * and + quantifiers, they will consume as many occurrences as possible while allowing the rest of the pattern to match. You can demonstrate this by going back to the earlier example and replacing the + quantifier with a bound looking for two or more occurrences of a non-newline character:

var compassPoints:String = "Naughty elephants squirt water";
var firstWordRegExp:RegExp = /\b.{2,}\b/;
trace(compassPoints.replace(firstWordRegExp, "MATCH"));

This will produce the same result as using the + quantifier; namely that the entire string will be replaced by MATCH.

If you want a bound to be lazy rather than greedy, just append a question mark after the closing curly brace, just as with the quantifiers:

var compassPoints:String = "Naughty elephants squirt water";
var firstWordRegExp:RegExp = /\b.{2,}?\b/;
trace(compassPoints.replace(firstWordRegExp, "MATCH"));

This results in just the first word being replaced.

Grouping Patterns

Using the quantifiers with single characters is incredibly restrictive. What if you want to apply a quantifier or a bound to a sequence of characters? You can group them by using parentheses, as in this example:

b(an)+a

This matches the letter b, followed by one or more occurrences of the sequence an, followed by the letter a, which would include the word banana:

var myFavoriteFruit:String = "banana";
var bananaRegExp:RegExp = /b(an)+a/;
trace(bananaRegExp.test(myFavoriteFruit)); // outputs true

Of course, it would also include the sequence bananananananana because the pattern specifies one or more occurrences of an. If you want to be more restrictive, you could use a bound instead:

var myFavoriteFruit:String = "banana";
var bananaRegExp:RegExp = /b(an){2}a/;
trace(bananaRegExp.test(myFavoriteFruit)); // outputs true

No more bananananananana for you.

Groups are also useful when using alternation. As you saw earlier, the alternation operator (|) acts on the entire pattern instead of just the preceding character (as is the case with the quantifiers). Consider the following expression:

\bb(oa|iscui)t\b

This would match both boat and biscuit because the parentheses limit the alternation between the substrings oa and iscui between the opening b and closing t.

Accessing Matched Strings with Backreferences

In addition to allowing you to organize your patterns, groups let you extract certain pieces of information from a regular expression. When you enclose either part or all of a pattern in a group, the portion of the string matched by that group, referred to as a capture group, is available later in the pattern via a backreference.

A backreference is a numeric reference to a capture group preceded by a backslash (\), starting at 1 for the first group and counting up to a maximum of 99.

Working out the index of your group can be quite troublesome, particularly if you have nested groups (groups within groups) in your pattern, but a simple rule of thumb should see you through: count the number of unescaped opening parentheses, starting from the left side of your pattern, up to and including the group you want to target. The number you end up with will be the index of the backreference to that group.

A simple example might make this a little clearer. Suppose that you want to search through a piece of text with HTML tags and find any references to headings. You could use the dot metacharacter and quantifiers to write the pattern, like this:

<h[1-6]>.*?</h[1-6]>

This would do the job of matching valid heading tags, but it would also match strings with mismatched opening and closing heading tags:

var invalidHtmlText:String = "<h1>A mismatched example</h6>";
var headingRegExp:RegExp = /<h[1-6]>.*?<\/h[1-6]>/;
trace(headingRegExp.test(invalidHtmlText)); // outputs true

According to the pattern, the value of the invalidHtmlText variable is perfectly valid. It has an opening header tag with a level of 1 through 6 (<h[1-6]>), some text (.*?), and then a closing heading tag with a level of 1 through 6 (<\/h[1-6]>). Nothing in the pattern says that the opening and closing tags must be of the same level. Just in case the expressions are still looking uncomfortably foreign to you, Table 10-3 gives a more detailed breakdown.

Table 10-3. Regular expression breakdown

invalidHTMLText

Regular expression decompiled

<h

Matches literally the characters <h

[1-6]

Matches any one digit 1 through 6

>

Matches literally the character >

.

Matches any one character

*

Matches zero or more occurrences of the preceding character (because it is a period, it means any character)

?

Prevents the preceding qualifier from being greedy, meaning it matches only the minimal number of characters to fulfill the expression’s requirements

<\/h

Matches literally the characters </h (note the need to escape the forward slash)

Notice that when translating the pattern to an ActionScript 3.0 regular expression literal, you need to escape the forward slash in the closing heading tag. This is necessary because regular expression literals are delineated by forward slashes, and you need to tell the ActionScript compiler that this forward slash is part of the pattern, not the end delimiter.

To solve this problem, you need a way to tell the regular expression engine that whatever number was used to open the tag should also be used to close the tag. You can do this by wrapping the portion of the pattern that matches the contents of the opening tag in parentheses and then using a backreference in the closing tag to specify that they must match:

var invalidHtmlText:String = "<h1>A mismatched example</h6>";
var validHtmlText:String = "<h1>A matching example</h1>";
var headingRegExp:RegExp = /<(h[1-6])>.*?<\/\1>/;
trace(headingRegExp.test(invalidHtmlText)); // outputs false
trace(headingRegExp.test(validHtmlText)); // outputs true

Similar to the previous example, you are matching the first part of the string. The difference is that you are matching <\/\1>, which is the number 1 in the heading tag.

Running this example will confirm that only the valid HTML text will match the regular expression.

Using Backreferences with the String.replace() Method

Backreferences can also be used in the replacement string that is passed to the String.replace() method. When used in this context, backreferences are specified slightly differently: they use a $ (dollar sign) instead of a backslash, followed by the capture group index.

To demonstrate replacing using regular expressions and backreferences, imagine that you have loaded in HTML text dynamically to populate a TextField instance. Although a TextField instance can be populated with HTML text, it understands only the most basic HTML tags. One of the tags not understood is <strong>, which needs to be converted to a <b> tag to display properly in a text field. To go through a string and replace all occurrences of the <strong> tag and its contents with the <b> tag with the same contents, you can use this code:

var htmlText:String = "<strong>This text is important</strong>";
var strongRegExp:RegExp = /<strong>(.*?)<\/strong>/;
var replaced:String = htmlText.replace(strongRegExp, "<b>$1</b>");
trace(replaced); // outputs: <b>This text is important</b>

Here, the backreference $1 refers to the capture group containing the match for (.*?). Remember that the full match for the entire expression consists of the opening and closing <strong> tags and the contents between. This full match is replaced by the opening and closing <b> tags, enclosing whatever characters are contained in the first capture group denoted by the parentheses. In the example, the matched characters for the capture group are This text is important, so the backreference $1 includes these characters, and you can use this backreference to insert these characters into your final string.

You can also use the special index 0 (zero) to make use of the part of the search string that was matched by the whole pattern.

Using Backreferences After the Pattern has been Matched

One of the methods of the RegExp object you haven’t yet explored is the exec() method. It is similar to the test() method in that it executes the regular expression against the specified string. But instead of just returning true for a match, you actually get some useful information.

The exec() method returns an Object containing the groups that were matched, stored by group index, including the part of the string matched by the entire pattern at index 0, as well as two other special properties:

· input: the string that was passed to the method

· index: the position within the string in which the matched substring was found

Returning to the example in the previous section, you can see what this means in practice:

var htmlText:String = "<strong>This text is important</strong> image
while this text is not as important";
var strongRegExp:RegExp = /<strong>(.*?)<\/strong>/;
var matches:Object = strongRegExp.exec(htmlText);
for (var i:String in matches) {
trace(i + ": " + matches[i]);
}

Running this example results in the following text in the Output panel:

0: <strong>This text is important</strong>
1: This text is important
input: <strong>This text is important</strong> image
while this text is not as important
index: 0

The array of capture groups contains two indexes. First, the entire matched substring is found in the first index (0). The second index (1) contains the matched group denoted by parentheses. In addition, the whole string being searched is contained in the input property. Finally, indextraces as 0 because the matched substring begins at the first character in the string through which you were searching.

Understanding the E-Mail Regular Expression-­Pattern

As promised, you can now make sense of the e-mail validation regular expression pattern presented at the beginning of the chapter;

^([a-zA-Z0-9._-]+)@([a-zA-Z0-9.-]+)\.([a-zA-Z]{2,4})$

Let’s break down each of its parts. First, notice that the whole pattern in enclosed in start and end anchors (^ and $, respectively):

^([a-zA-Z0-9._-]+)@([a-zA-Z0-9.-]+)\.([a-zA-Z]{2,4})$

The pattern matches only if the entire string being searched matches the pattern. If you omit these anchors, the pattern would match a string that contained a valid e-mail address somewhere within it. This might be what you want if you’re trying to extract all e-mail addresses from a larger string, but it’s not correct for validating an e-mail address.

Moving on, you can see that there is a group toward the beginning of the expression containing a character range:

^([a-zA-Z0-9._-]+)@([a-zA-Z0-9.-]+)\.([a-zA-Z]{2,4})$

This group contains a character class that matches any alphanumeric character, a period, an underscore, or a hyphen. This character class then has the one-or-more quantifier applied to it to indicate that you want to match as many of these characters in a row as possible. This will match the mailbox name portion of an e-mail address. It is grouped for readability only, but it might be useful later on if you want to reference the mailbox name using a backreference.

Next comes the @ symbol. This symbol has no special meaning in the pattern and is treated as a literal character:

^([a-zA-Z0-9._-]+)@([a-zA-Z0-9.-]+)\.([a-zA-Z]{2,4})$

This matches the @ character that separates the mailbox name from the domain name in an e-mail address.

Following that is another group containing a character range:

^([a-zA-Z0-9._-]+)@([a-zA-Z0-9.-]+)\.([a-zA-Z]{2,4})$

This group is almost identical to the first, except that the character range does not contain an underscore character because it would be invalid in a domain name. Again, the pattern looks for one or more occurrences of the character range (which is why you use the + quantifier).

Next is an escaped period:

^([a-zA-Z0-9._-]+)@([a-zA-Z0-9.-]+)\.([a-zA-Z]{2,4})$

Remember that the period needs to be escaped if you want to match it literally because it has special meaning within a regular expression pattern. Without the preceding backslash, it matches any non-newline character.

The final part of the regular expression pattern is another group:

^([a-zA-Z0-9._-]+)@([a-zA-Z0-9.-]+)\.([a-zA-Z]{2,4})$

This group consists of a character class matching an uppercase or lowercase letter and a bounds operator, indicating that you want between two and four characters that match that class. This matches the top-level domain (com, net, org, uk, and so on) for the domain name portion of the e-mail address.

Changing Regular Expression Behavior with-­Modifiers

You’ve been concentrating on patterns for so long that you might have forgotten the other element of regular expressions: modifiers. Modifiers are used to change the behavior of the entire regular expression or just some of the metacharacters within a pattern. ActionScript 3.0 supports the modifiers listed in Table 10-4.

Table 10-4. Regular expression modifiers

Modifier

Property

Description

i

ignoreCase

Specifies that the entire pattern is case-insensitive

g

global

Specifies that the pattern should be matched as many times as possible throughout the string being searched instead of just once

m

multiline

Allows the start-of-string and end-of-string anchors to match the start and end of a line, respectively

s

dotall

Allows the dot metacharacter to match newline characters

x

extended

Specifies that whitespace in the pattern should be ignored

The modifiers are separate from the pattern in a regular expression and are specified either as a string passed as the second argument to the RegExp constructor or after the second forward slash in a regular expression literal. You can specify more than one modifier (it doesn’t make sense to specify the same modifier more than once). A regular expression using all the available modifiers (more power, more power) would look something like this:

/pattern/igmsx

And if you use all the modifiers, you’re writing more complex regular expressions than you’ll ever need.

After the modifiers are configured for a RegExp object, you can test to see which ones have been set by using the equivalent property names (as specified in Table 10-4):

var globalRegExp:RegExp = /abc/g;
trace(globalRegExp.global); // outputs true

The properties are read-only Boolean values, so the preceding example would output true to the Output panel.

Let’s now look at each of the modifiers in turn to see how they affect your patterns.

Using the Case-Insensitive Modifier

Using the i (case-insensitive) modifier allows you to specify that any alphabetic character in your pattern should match both the uppercase and lowercase versions in the string being searched (an a in the pattern matches either an a or an A):

var colorRegExp:RegExp = /colou?r/i;
trace(colorRegExp.test("colour")); // outputs true
trace(colorRegExp.test("Color")); // outputs true
trace(colorRegExp.test("COLOUR")); // outputs true

Using the case-insensitive modifier, you can reduce the number of characters in the e-mail validation pattern by eliminating all the uppercase character ranges:

/^([a-z0-9._-]+)@([a-z0-9.-]+)\.([a-z]{2,4})$/i

Unfortunately, the case-insensitive modifier has no effect on non-English characters, such as è and È. For occasions when you want to perform a case-insensitive match on a string containing non-English characters, you have to use character classes or alternation instead.

Using the Global Modifier

The g (global) modifier allows you to use the exec() method to find more than one occurrence of your entire pattern in the specified search string. For example, without the global modifier, multiple calls to the exec() method in the following example result in the same word being matched every time:

var compassPoints:String = "Naughty elephants squirt water";
var wordRegExp:RegExp = /\b\w+\b/;
trace(wordRegExp.exec(compassPoints)); // outputs Naughty
trace(wordRegExp.exec(compassPoints)); // outputs Naughty
trace(wordRegExp.exec(compassPoints)); // outputs Naughty
trace(wordRegExp.exec(compassPoints)); // outputs Naughty

This occurs because the RegExp object is being reset after each exec() method is called. You can change this behavior with the global modifier:

var compassPoints:String = "Naughty elephants squirt water";
var wordRegExp:RegExp = /\b\w+\b/g;
trace(wordRegExp.exec(compassPoints)); // outputs Naughty
trace(wordRegExp.exec(compassPoints)); // outputs elephants
trace(wordRegExp.exec(compassPoints)); // outputs squirt
trace(wordRegExp.exec(compassPoints)); // outputs water

This time, the RegExp object remembers the position at the end of the previous match. The next time exec() is called, it begins its search from where it previously left off.

The global modifier also changes the behavior of the String.match() method. Normally, this method would return an array containing exactly one element, consisting of the first substring that was matched by the specified regular expression:

var compassPoints:String = "Naughty elephants squirt water";
var wordRegExp:RegExp = /\b\w+\b/;
trace(compassPoints.match(wordRegExp)); // outputs Naughty

This example outputs the following to the Output panel:

Naughty

However, if you use the global modifier, the array returned from the String.match() call will contain one element for each time the pattern was matched throughout the entire string:

var compassPoints:String = "Naughty elephants squirt water";
var wordRegExp:RegExp = /\b\w+\b/g;
// outputs Naughty,elephants,squirt,water
trace(compassPoints.match(wordRegExp));

The revised example outputs the following to the Output panel:

Naughty,elephants,squirt,water

The final method affected by the global modifier is the String.replace() method. I trust that you can work out what the following example does:

var compassPoints:String = "Naughty elephants squirt water";
var wordRegExp:RegExp = /\b\w+\b/g;
// outputs MATCH MATCH MATCH MATCH
trace(compassPoints.replace(wordRegExp, "MATCH"));

Using the Multiline Modifier

The m (multiline) modifier changes the behavior of the start-of-string and end-of-string anchors so that they also match the start and end of a line in a string, respectively. When combined with the global modifier, this modifier makes it easy to construct a regular expression to take a string containing multiple lines and convert them into list items:

var list:String = "one\ntwo\nthree\nfour";
var singleLineRegExp:RegExp = /^(.*?)$/mg;
trace(list.replace(singleLineRegExp, "<li>$1</li>"));

This example produces the following output:

<li>one</li>
<li>two</li>
<li>three</li>
<li>four</li>

Notice that the newlines are still present. They weren’t actually consumed by the anchors, so they weren’t replaced by the replacement string.

Using the Dotall Modifier

The dot metacharacter normally matches any character in a string with the exception of newlines. Using the s (dotall) modifier means that the dot metacharacter will match any character in the string being searched, including newlines. This is a subtle shift, but a useful one.

Going back to the <strong> tag example, only by allowing newlines to be recognized with other characters would the following expression be able to find the <strong> tag spread across multiple lines. Try the following with and without the dotall modifier:

var htmlText:String = "<strong>This text\nis important</strong>";
var strongRegExp:RegExp = /<strong>(.*?)<\/strong>/s;
var replaced:String = htmlText.replace(strongRegExp, "<b>$1</b>");
trace(replaced);

Using the Extended Modifier

Using the x (extended) modifier allows you to format your regular expression pattern using whitespace without actually affecting the pattern itself. This is generally used to aid readability of a pattern. For example, you could use whitespace to separate the various parts of the e-mail validation pattern, like so:

/^ ([a-z0-9._-]+) @ ([a-z0-9.-]+) \. ([a-z]{2,4}) $/ix

In my experience, the extended modifier is rarely (if ever) used. If other developers pick up your code, they might assume that the whitespace is part of the pattern, and only when they look at the modifiers (if they look at them at all) will they realize that the whitespace has no meaning. Sometimes using the extended modifier makes sense, such as when you have a really long regular expression. If you use it, make sure that you add in a comment before the regular expression to point it out.

Using Variables to Build a Regular Expression

Another (perhaps better) option for breaking up long regular expressions to make them more readable is to build up the expression using variables. To use variables to construct a regular expression, you must use the RegExp constructor instead of a regular expression literal. Consider the following example:

var localName:String = "^([a-z0-9._-]+)";
var domain:String = "([a-z0-9.-]+)";
var topLevel:String = "([a-z]{2,4})$";
var emailValidator:RegExp = image
new RegExp(localName + "@" + domain + "\\." + topLevel, "i");
var email:String = "someAddress@someserver.com";
trace(emailValidator.test(email)); // outputs true

This example breaks out each of the groups and assigns them to variables; then constructs the regular expression and passes it in the RegExp constructor. (Note that because the backslash for the period is within a string, you need to escape that backslash with another backslash so that it is read literally and not ignored.) Whether this is more readable than including the expression in a literal declaration is debatable.

var emailValidator:RegExp = image
/^([a-z0-9._-]+)@([a-z0-9.-]+)\.([a-z]{2,4})$/i;
var email:String = "someAddress@someserver.com";
trace(emailValidator.test(email)); // outputs true

At the very least, you have options about how you want to represent your regular expressions and can decide what works best for you and your group.

Useful Regular Expressions

Table 10-5 shows a list of regular expression patterns that you might find useful in your projects. See if you can work out how they do what they do. (Note that these regular expression patterns do not include any boundaries, so if you want to ensure that they are not matched within other words, you should include the \b metacharacter).

Table 10-5. Common regular expressions

Matches

Regular expression

U.S. Social Security number

\d{3}-\d{2}-\d{4}

24-hour time with optional seconds (hh:mm[:ss])

([01][0-9]| 2[0-3]) :([0-5][0-9]) (:([0-5][0-9]))?

U.S. date (mm/dd/yyyy)

(0?[1-9]| 1[012]) /(0?[1-9]| [12][0-9]| 3[01]) /([0-9]{ 4})

UK date (dd/mm/yyyy)

(0?[1-9]| [12][0-9]| 3[01]) /(0?[1-9]| 1[012]) /([0-9]{ 4})

E-mail address

([a-z0-9._ -]+)@([a-z0-9.-]+)\ .([a-z]{ 2,4 })

URL

^http\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(/\S*)?$

Many of the regular expression patterns shown in Table 10-5 could be written differently or more accurately. Half the job of creating the pattern for a regular expression is to find the right balance between clarity and accuracy. The official regular expression to validate an e-mail address is more than 6,000 characters long and is almost completely incomprehensible to mere mortals. The version presented here is 42 characters long, much more understandable, and good enough in all but the most exceptional cases.

Regular Expression Resources

I hope this chapter has given you an insight into regular expressions, but there’s no way it could possibly tell the whole story. If you have a taste for regular expressions and want to explore some of the more esoteric features, you could do no better than getting yourself a copy of Jeffrey Friedl’s Mastering Regular Expressions (O’Reilly). This book will tell you everything you ever wanted to know—and more—about regular expressions. It then messes with your head with a look at how regular expression engines work and how to best optimize your patterns to squeeze every last ounce of performance from them. Be warned that by the time you’ve finished this book, you will either be institutionalized or a fully paid-up member of the propeller-head club.

If you are averse to institutionalization, you might want to check out Tony Stubblebine’s excellent Regular Expression Pocket Reference (O’Reilly). All developers who use regular expressions more than once per year should have a copy of this reference on their desks.

If you’re in a fix and you can’t quite work out how to build a regular expression to suit your needs, chances are that someone has solved the problem before. If it is solved, it’s probably listed on the Regular Expression Library website (http://www.regexlib.com), which contains a searchable list of regular expression patterns that have been contributed by visitors to the site. The collection is ever-growing and driven by the community, so if you find a solution to a problem that isn’t already listed, you can contribute it to the developer community through this site.

Summary

I covered a lot of ground in this chapter. If you’re still reading, give yourself a pat on the back (and extra pats if you read it all in one sitting). You started by looking at what regular expressions are and why they are useful. Next, you spent a long time wading through the various features of a regular expression pattern and how they can be used in a variety of practical examples, using the various regular expression-capable methods along the way. You then looked at the modifiers that can be applied to a regular expression and how they affect the way in which a pattern is matched. Finally, you saw some commonly used regular expressions and were directed to some resources for learning more about regular expressions.

Regular expressions offer an amazing amount of power for searching through and manipulating string data in ActionScript. Actually, much of programming comes down to manipulating strings and other types of data. In the next chapter, you’ll look at using XML, one of the most useful ways of storing and passing data back and forth with the server. See you on the next page, when you’re ready to add yet another powerful tool to your ActionScript toolkit.