Regular Expressions - Learning JavaScript (2016)

Learning JavaScript (2016)

Chapter 17. Regular Expressions

Regular expressions provide sophisticated string matching functionality. If you want to match things that “look like” an email address or a URL or a phone number, regular expressions are your friends. A natural complement to string matching is string replacement, and regular expressions support that as well—for example, if you want to match things that look like email addresses and replace them with a hyperlink for that email address.

Many introductions to regular expressions use esoteric examples such as “match aaaba and abaaba but not abba,” which has the advantage of breaking the complexity of regular expressions into neat chunks of functionality, but has the disadvantage of seeming very pointless (when do you ever need to match aaaba?). I am going to try to introduce the features of regular expressions using practical examples from the get-go.

Regular expressions are often abbreviated “regex” or “regexp”; in this book, we’ll use the former for brevity.

Substring Matching and Replacing

The essential job of a regex is to match a substring within a string, and optionally replace it. Regexes allow you to do this with incredible power and flexibility, so before we dive into it, let’s briefly cover the non-regex search and replace functionality of String.prototype, which is suitable for very modest search and replacement needs.

If all you need to do is determine if a specific substring exists in a bigger string, the following String.prototype methods will suffice:

const input = "As I was going to Saint Ives";

input.startsWith("As") // true

input.endsWith("Ives") // true

input.startsWith("going", 9) // true -- start at index 9

input.endsWith("going", 14) // true -- treat index 14 as the end of the string

input.includes("going") // true

input.includes("going", 10) // false -- starting at index 10

input.indexOf("going") // 9

input.indexOf("going", 10) // -1

input.indexOf("nope") // -1

Note that all of these methods are case sensitive. So input.startsWith("as") would be false. If you want to do a case-insensitive comparison, you can simply convert the input to lowercase:

input.toLowerCase().startsWith("as") // true

Note that this doesn’t modify the original string; String.prototype.toLowerCase returns a new string and doesn’t modify the original string (remember that strings in JavaScript are immutable).

If we want to go a step further and find a substring and replace it, we can use String.prototype.replace:

const input = "As I was going to Saint Ives";

const output = input.replace("going", "walking");

Again, the original string (input) is not modified by this replacement; output now contains the new string with “going” replaced with “walking” (we could have assigned back to input, of course, if we really wanted input to change).

Constructing Regular Expressions

Before we get into the complexities of the regex metalanguage, let’s talk about how they’re actually constructed and used in JavaScript. For these examples, we’ll be searching for a specific string, just as before—an overkill for regexes, but an easy way to understand how they’re used.

Regexes in JavaScript are represented by the class RegExp. While you can construct a regex with the RegExp constructor, regexes are important enough to merit their own literal syntax. Regex literals are set off with forward slashes:

const re1 = /going/; // regex that can search for the word "going"

const re2 = new RegExp("going"); // equivalent object constructor

There is a specific reason to use the RegExp constructor that we’ll cover later in this chapter, but except for that special case, you should prefer the more convenient literal syntax.

Searching with Regular Expressions

Once we have a regex, we have multiple options for using it to search in a string.

To understand the options for replacement, we’re going to get a little preview of the regex metalanguage—using a static string here would be very boring. We’ll use the regex /\w{3,}/ig, which will match all words three letters or longer (case insensitive). Don’t worry about understanding this right now; that will come later in this chapter. Now we can consider the search methods available to us:

const input = "As I was going to Saint Ives";

const re = /\w{3,}/ig;

// starting with the string (input)

input.match(re); // ["was", "going", "Saint", "Ives"]

input.search(re); // 5 (the first three-letter word starts at index 5)

// starting with the regex (re)

re.test(input); // true (input contains at least one three-letter word)

re.exec(input); // ["was"] (first match)

re.exec(input); // ["going"] (exec "remembers" where it is)

re.exec(input); // ["Saint"]

re.exec(input); // ["Ives"]

re.exec(input); // null -- no more matches

// note that any of these methods can be used directly with a regex literal

input.match(/\w{3,}/ig);

input.search(/\w{3,}/ig);

/\w{3,}/ig.test(input);

/\w{3,}/ig.exec(input);

// ...

Of these methods, RegExp.prototype.exec provides the most information, but you’ll find that it’s the one you use the least often in practice. I find myself using String.prototype.match and RegExp.prototype.test the most often.

Replacing with Regular Expressions

The same String.prototype.replace method we saw before for simple string replacement also accepts a regex, but it can do a lot more. We’ll start with a simple example—replacing all four-letter words:

const input = "As I was going to Saint Ives";

const output = input.replace(/\w{4,}/ig, '****'); // "As I was ****

// to **** ****"

We’ll learn about much more sophisticated replacement methods later in this chapter.

Input Consumption

A naïve way to think about regexes is “a way to find a substring within a larger string” (often called, colorfully, “needle in a haystack”). While this naïve conception is often all you need, it will limit your ability to understand the true nature of regexes and leverage them for more powerful tasks.

The sophisticated way to think about a regex is a pattern for consuming input strings. The matches (what you’re looking for) become a byproduct of this thinking.

A good way to conceptualize the way regexes work is to think of a common children’s word game: a grid of letters in which you are supposed to find words. We’ll ignore diagonal and vertical matches; as a matter of fact, let’s think only of the first line of this word game:

§ X J A N L I O N A T U R E J X E E L N P

Humans are very good at this game. We can look at this, and pretty quickly pick out LION, NATURE, and EEL (and ION while we’re at it). Computers—and regexes—are not as clever. Let’s look at this word game as a regex would; not only will we see how regexes work, but we will also see some of the limitations that we need to be aware of.

To simplify things, let’s tell the regex that we’re looking for LION, ION, NATURE, and EEL; in other words, we’ll give it the answers and see if it can verify them.

The regex starts at the first character, X. It notes that none of the words it’s looking for start with the letter X, so it says “no match.” Instead of just giving up, though, it moves on to the next character, J. It finds the same situation with J, and then moves on to A. As we move along, we consider the letters the regex engine is moving past as being consumed. Things don’t get interesting until we hit the L. The regex engine then says, “Ah, this could be LION!” Because this could be a potential match, it doesn’t consume the L; this is an important point to understand. The regex goes along, matching the I, then the O, then the N. Now it recognizes a match; success! Now that it has recognized a match it can then consume the whole word, so L, I, O, and N are now consumed. Here’s where things get interesting. LION and NATURE overlap. As humans, we are untroubled by this. But the regex is very serious about not looking at things it’s already consumed. So it doesn’t “go back” to try to find matches in things it’s already consumed. So the regex won’t find NATURE because the N has already been consumed; all it will find is ATURE, which is not one of the words it is looking for. It will, however, eventually find EEL.

Now let’s go back to the example and change the O in LION to an X. What will happen then? When the regex gets to the L, it will again recognize a potential match (LION), and therefore not consume the L. It will move on to the I without consuming it. Then it will get to the X; at this point, it realizes that there’s no match: it’s not looking for any word that starts with LIX. What happens next is that the regex goes back to where it thought it had a match (the L), consumes that L, and moves on, just as normal. In this instance, it will match NATURE, because the N hadn’t been consumed as part of LION.

A partial example of this process is shown in Figure 17-1.

Regex example

Figure 17-1. Regex example

Before we move on to discussing the specifics of the regex metalanguage, let’s consider abstractly the algorithm a regex employs when “consuming” a string:

§ Strings are consumed from left to right.

§ Once a character has been consumed, it is never revisited.

§ If there is no match, the regex advances one character at a time attempting to find a match.

§ If there is a match, the regex consumes all the characters in the match at once; matching continues with the next character (if the regex is global, which we’ll talk about later).

This is the general algorithm, and it probably won’t surprise you that the details are much more complicated. In particular, the algorithm can be aborted early if the regex can determine that there won’t be a match.

As we move through the specifics of the regex metalanguage, try to keep this algorithm in mind; imagine your strings being consumed from left to right, one character at a time, until there are matches, at which point whole matches are consumed at once.

Alternation

Imagine you have an HTML page stored in a string, and you want to find all tags that can reference an external resource (<a>, <area>, <link>, <script>, <source>, and sometimes, <meta>). Furthermore, some of the tags may be mixed case (<Area>, <LINKS>, etc.). Regular expression alternations can be used to solve this problem:

const html = 'HTML with <a href="/one">one link</a>, and some JavaScript.' +

'<script src="stuff.js"></script>';

const matches = html.match(/area|a|link|script|source/ig); // first attempt

The vertical bar (|) is a regex metacharacter that signals alternation. The ig signifies to ignore case (i) and to search globally (g). Without the g, only the first match would be returned. This would be read as “find all instances of the text area, a, link, script, or source, ignoring case.” The astute reader might wonder why we put area before a; this is because regexes evaluate alternations from left to right. In other words, if the string has an area tag in it, it would match the a and then move on. The a is then consumed, and rea would not match anything. So you have to matcharea first, then a; otherwise, area will never match.

If you run this example, you’ll find that you have many unintended matches: the word link (inside the <a> tag), and instances of the letter a that are not an HTML tag, just a regular part of English. One way to solve this would be to change the regex to /<area|<a|<link|<script|<source/ (angle brackets are not regex metacharacters), but we’re going to get even more sophisticated still.

Matching HTML

In the previous example, we perform a very common task with regexes: matching HTML. Even though this is a common task, I must warn you that, while you can generally do useful things with HTML using regexes, you cannot parse HTML with regexes. Parsing means to completely break something down into its component parts. Regexes are capable of parsing regular languages only (hence the name). Regular languages are extremely simple, and most often you will be using regexes on more complex languages. Why the warning, then, if regexes can be used usefully on more complex languages? Because it’s important to understand the limitations of regexes, and recognize when you need to use something more powerful. Even though we will be using regexes to do useful things with HTML, it’s possible to construct HTML that will defeat our regex. To have a solution that works in 100% of the cases, you would have to employ a parser. Consider the following example:

const html = '<br> [!CDATA[[<br>]]';

const matches = html.match(/<br>/ig);

This regex will match twice; however, there is only one true <br> tag in this example; the other matching string is simply non-HTML character data (CDATA). Regexes are also extremely limited when it comes to matching hierarchical structures (such as an <a> tag within a <p> tag). The theoretical explanations for these limitations are beyond the scope of this book, but the takeaway is this: if you’re struggling to make a regex to match something very complicated (such as HTML), consider that a regex simply might not be the right tool.

Character Sets

Character sets provide a compact way to represent alternation of a single character (we will combine it with repetition later, and see how we can extend this to multiple characters). Let’s say, for example, you wanted to find all the numbers in a string. You could use alternation:

const beer99 = "99 bottles of beer on the wall " +

"take 1 down and pass it around -- " +

"98 bottles of beer on the wall.";

const matches = beer99.match(/0|1|2|3|4|5|6|7|8|9/g);

How tedious! And what if we wanted to match not numbers but letters? Numbers and letters? Lastly, what if you wanted to match everything that’s not a number? That’s where character sets come in. At their simplest, they provide a more compact way of representing single-digit alternation. Even better, they allow you to specify ranges. Here’s how we might rewrite the preceding:

const m1 = beer99.match(/[0123456789]/g); // okay

const m2 = beer99.match(/[0-9]/g); // better!

You can even combine ranges. Here’s how we would match letters, numbers, and some miscellaneous punctuation (this will match everything in our original string except whitespace):

const match = beer99.match(/[\-0-9a-z.]/ig);

Note that order doesn’t matter: we could just as easily have said /[.a-z0-9\-]/. We have to escape the dash to match it; otherwise, JavaScript would attempt to interpret it as part of a range (you can also put it right before the closing square bracket, unescaped).

Another very powerful feature of character sets is the ability to negate character sets. Negated character sets say “match everything but these characters.” To negate a character set, use a caret (^) as the first character in the set:

const match = beer99.match(/[^\-0-9a-z.]/);

This will match only the whitespace in our original string (if we wanted to match only whitespace, there are better ways to do it, which we’ll learn about shortly).

Named Character Sets

Some character sets are so common—and so useful—that there are handy abbreviations for them:

Named character set

Equivalent

Notes

\d

[0-9]

\D

[^0-9]

\s

[ \t\v\n\r]

Includes tabs, spaces, and vertical tabs.

\S

[^ \t\v\n\r]

\w

[a-zA-Z_]

Note that dashes and periods are not included in this, making it unsuitable for things like domain names and CSS classes.

\W

[^a-zA-Z_]

Probably the most commonly used of these abbreviations is the whitespace set (\s). For example, whitespace is often used to line things up, but if you’re trying to parse it programmatically, you want to be able to account for different amounts of whitespace:

const stuff =

'hight: 9\n' +

'medium: 5\n' +

'low: 2\n';

const levels = stuff.match(/:\s*[0-9]/g);

(The * after the \s says “zero or more whitespace,” which we’ll learn about shortly.)

Don’t overlook the usefulness of the negated character classes (\D, \S, and \W); they represent a great way of getting rid of unwanted cruft. For example, it’s a great idea to normalize phone numbers before storing in a database. People have all kinds of fussy ways of entering phone numbers: dashes, periods, parentheses, and spaces. For searching, keying, and identification, wouldn’t it be nice if they were just 10-digit numbers? (Or longer if we’re talking about international phone numbers.) With \D, it’s easy:

const messyPhone = '(505) 555-1515';

const neatPhone = messyPhone.replace(/\D/g, '');

Similarly, I often use \S to make sure there’s data in required fields (they have to have at least one character that’s not whitespace):

const field = ' something ';

const valid = /\S/.test(field);

Repetition

Repetition metacharacters allow you to specify how many times something matches. Consider our earlier example where we were matching single digits. What if, instead, we wanted to match numbers (which may consist of multiple contiguous digits)? We could use what we already know and do something like this:

const match = beer99.match(/[0-9][0-9][0-9]|[0-9][0-9]|[0-9]/);

Notice how we again have to match the most specific strings (three-digit numbers) before we match less specific ones (two-digit numbers). This will work for one-, two-, and three-digit numbers, but when we add four-digit numbers, we’d have to add to our alternation. Fortunately, there is a better way:

const match = beer99.match(/[0-9]+/);

Note the + following the character group: this signals that the preceding element should match one or more times. “Preceding element” often trips up beginners. The repetition metacharacters are modifiers that modify what comes before them. They do not (and cannot) stand on their own. There are five repetition modifiers:

Repetition modifier

Description

Example

{n}

Exactly n.

/d{5}/ matches only five-digit numbers (such as a zip code).

{n,}

At least n.

/\d{5,}/ matches only five-digit numbers or longer.

{n, m}

At least n, at most m.

/\d{2,5}/ matches only numbers that are at least two digits, but no more than five.

?

Zero or one. Equivalent to {0,1}.

/[a-z]\d?/i matches letter followed by an optional digit.

*

Zero or more (sometimes called a “Klene star” or “Klene closure”).

/[a-z]\d*/i matches a letter followed by an optional number, possibly consisting of multiple digits.

+

One or more.

/[a-z]\d+/i matches a letter followed by a required number, possibly containing multiple digits.

The Period Metacharacter and Escaping

In regex, the period is a special character that means “match anything” (except newlines). Very often, this catch-all metacharacter is used to consume parts of the input that you don’t care about. Let’s consider an example where you’re looking for a single five-digit zip code, and then you don’t care about anything else on the rest of the line:

const input = "Address: 333 Main St., Anywhere, NY, 55532. Phone: 555-555-2525.";

const match = input.match(/\d{5}.*/);

You might find yourself commonly matching a literal period, such as the periods in a domain name or IP address. Likewise, you may often want to match things that are regex metacharacters, such as asterisks and parentheses. To escape any special regex character, simply prefix it with a backslash:

const equation = "(2 + 3.5) * 7";

const match = equation.match(/\(\d \+ \d\.\d\) \* \d/);

TIP

Many readers may have experience with filename globbing, or being able to use *.txt to search for “any text files.” The * here is a “wildcard” metacharacter, meaning it matches anything. If this is familiar to you, the use of * in regexes may confuse you, because it means something completely different, and cannot stand alone. The period in a regex is more closely related to the* in filename globbing, except that it only matches a single character instead of a whole string.

A True Wildcard

Because the period matches any character except newlines, how do you match any character including newlines? (This comes up more often than you might think.) There are lots of ways to do this, but probably the most common is [\s\S]. This matches everything that’s whitespace…and everything that’s not whitespace. In short, everything.

Grouping

So far, the constructs we’ve learned about allow us to identify single characters (repetition allows us to repeat that character match, but it’s still a single-character match). Grouping allows us to construct subexpressions, which can then be treated like a single unit.

In addition to being able to create subexpressions, grouping can also “capture” the results of the groups so you can use them later. This is the default, but there is a way to create a “noncapturing group,” which is how we’re going to start. If you have some regex experience already, this may be new to you, but I encourage you to use noncapturing groups by default; they have performance advantages, and if you don’t need to use the group results later, you should be using noncapturing groups. Groups are specified by parentheses, and noncapturing groups look like (?:<subexpression>), where <subexpression> is what you’re trying to match. Let’s look at some examples. Imagine you’re trying to match domain names, but only .com, .org, and .edu:

const text = "Visit oreilly.com today!";

const match = text.match(/[a-z]+(?:\.com|\.org|\.edu)/i);

Another advantage of groups is that you can apply repetition to them. Normally, repetition applies only to the single character to the left of the repetition metacharacter. Groups allow you to apply repetition to whole strings. Here’s a common example. If you want to match URLs, and you want to include URLs that start with http://, https://, and simply // (protocol-independent URLs), you can use a group with a zero-or-one (?) repetition:

const html = '<link rel="stylesheet" href="http://insecure.com/stuff.css">\n' +

'<link rel="stylesheet" href="https://secure.com/securestuff.css">\n' +

'<link rel="stylesheet" href="//anything.com/flexible.css">';

const matches = html.match(/(?:https?)?\/\/[a-z][a-z0-9-]+[a-z0-9]+/ig);

Look like alphabet soup to you? It does to me too. But there’s a lot of power packed into this example, and it’s worth your while to slow down and really consider it. We start off with a noncapturing group: (?:https?)?. Note there are two zero-or-one repetition metacharacters here. The first one says “the s is optional.” Remember that repetition characters normally refer only to the character to their immediate left. The second one refers to the whole group to its left. So taken all together, this will match the empty string (zero instances of https?), http, or https. Moving on, we match two slashes (note we have to escape them: \/\/). Then we get a rather complicated character class. Obviously domain names can have letters and numbers in them, but they can also have dashes (but they have to start with a letter, and they can’t end with a dash).

This example isn’t perfect. For example, it would match the URL //gotcha (no TLD) just as it would match //valid.com. However, to match completely valid URLs is a much more complicated task, and not necessary for this example.

NOTE

If you’re feeling a little fed up with all the caveats (“this will match invalid URLs”), remember that you don’t have to do everything all the time, all at once. As a matter of fact, I use a very similar regex to the previous one all the time when scanning websites. I just want to pull out all the URLs—or suspect URLs—and then do a second analysis pass to look for invalid URLs, broken URLs, and so on. Don’t get too caught up in making perfect regexes that cover every case imaginable. Not only is that sometimes impossible, but it is often unnecessary effort when it is possible. Obviously, there is a time and place to consider all the possibilities—for example, when you are screening user input to prevent injection attacks. In this case, you will want to take the extra care and make your regex ironclad.

Lazy Matches, Greedy Matches

What separates the regex dilettantes from the pros is understanding lazy versus greedy matching. Regular expressions, by default, are greedy, meaning they will match as much as possible before stopping. Consider this classic example.

You have some HTML, and you want to replace, for example, <i> text with <strong> text. Here’s our first attempt:

const input = "Regex pros know the difference between\n" +

"<i>greedy</i> and <i>lazy</i> matching.";

input.replace(/<i>(.*)<\/i>/ig, '<strong>$1</strong>');

The $1 in the replacement string will be replaced by the contents of the group (.*) in the regex (more on this later).

Go ahead and try it. You’ll find the following disappointing result:

"Regex pros know the difference between

<strong>greedy</i> and <i>lazy</strong> matching."

To understand what’s going on here, think back to how the regex engine works: it consumes input until it satisfies the match before moving on. By default, it does so in a greedy fashion: it finds the first <i> and then says, “I’m not going to stop until I see an </i> and I can’t find any more past that.” Because there are two instances of </i>, it ends at the second one, not the first.

There’s more than one way to fix this example, but because we’re talking about greedy versus lazy matching, we’ll solve it by making the repetition metacharacter (*) lazy instead. We do so by following it with a question mark:

input.replace(/<i>(.*?)<\/i>/ig, '<strong>$1</strong>');

The regex is exactly the same except for the question mark following the * metacharacter. Now the regex engine thinks about this regex this way: “I’m going to stop as soon as I see an </i>.” So it lazily stops matching every time it sees an </i> without scanning further to see if it could match later. While we normally have a negative association with the word lazy, that behavior is what we want in this case.

All of the repetition metacharacters—*, +, ?, {n}, {n,} and {n,m}—can be followed with a question mark to make them lazy (though in practice, I’ve only ever used it for * and +).

Backreferences

Grouping enables another technique called backreferences. In my experience, this is one of the least used regex features, but there is one instance where it comes in handy. Let’s consider a very silly example before we consider a truly useful one.

Imagine you want to match band names that follow the pattern XYYX (I bet you can think of a real band name that follows this pattern). So we want to match PJJP, GOOG, and ANNA. This is where backreferences come in. Each group (including subgroups) in a regex is assigned a number, from left to right, starting with 1. You can refer to that group in a regex with a backslash followed by a number. In other words, \1 means “whatever group #1 matched.” Confused? Let’s see the example:

const promo = "Opening for XAAX is the dynamic GOOG! At the box office now!";

const bands = promo.match(/(?:[A-Z])(?:[A-Z])\2\1/g);

Reading from left to right, we see there are two groups, then \2\1. So if the first group matches X and the second group matches A, then \2 must match A and \1 must match X.

If this sounds cool to you but not very useful, you’re not alone. The only time I think I have ever needed to use backreferences (other than solving puzzles) is matching quotation marks.

In HTML, you can use either single or double quotes for attribute values. This enables us to easily do things like this:

// we use backticks here because we're using single and

// double quotation marks:

const html = `<img alt='A "simple" example.'>` +

`<img alt="Don't abuse it!">`;

const matches = html.match(/<img alt=(?:['"]).*?\1/g);

Note that there’s some simplifying going on in this example; if the alt attribute didn’t come first, this wouldn’t work, nor would it if there were extra whitespace. We’ll see this example revisited later with these problems addressed.

Just as before, the first group will match either a single or double quote, followed by zero or more characters (note the question mark that makes the match lazy), followed by \1—which will be whatever the first match was, either a single quote or a double quote.

Let’s take a moment to reinforce our understanding of lazy versus greedy matching. Go ahead and remove the question mark after the *, making the match greedy. Run the expression again; what do you see? Do you understand why? This is a very important concept to understand if you want to master regular expressions, so if this is not clear to you, I encourage you to revisit the section on lazy versus greedy matching.

Replacing Groups

One of the benefits grouping brings is the ability to make more sophisticated replacements. Continuing with our HTML example, let’s say that we want to strip out everything but the href from an <a> tag:

let html = '<a class="nope" href="/yep">Yep</a>';

html = html.replace(/<a .*?(href=".*?").*?>/, '<a $1>');

Just as with backreferences, all groups are assigned a number starting with 1. In the regex itself, we refer to the first group with \1; in the replacement string, we use $1. Note the use of lazy quantifiers in this regex to prevent it from spanning multiple <a> tags. This regex will also fail if thehref attribute uses single quotes instead of double quotes.

Now we’ll extend the example. We want to preserve the class attribute and the href attribute, but nothing else:

let html = '<a class="yep" href="/yep" id="nope">Yep</a>';

html = html.replace(/<a .*?(class=".*?").*?(href=".*?").*?>/, '<a $2 $1>');

Note in this regex we reverse the order of class and href so that href always occurs first. The problem with this regex is that class and href always have to be in the same order and (as mentioned before) it will fail if we use single quotes instead of double. We’ll see an even more sophisticated solution in the next section.

In addition to $1, $2, and so on, there are also $‘ (everything before the match), $& (the match itself), and $’ (everything after the match). If you want to use a literal dollar sign, use $$:

const input = "One two three";

input.replace(/two/, '($`)'); // "One (One ) three"

input.replace(/\w+/g, '($&)'); // "(One) (two) (three)"

input.replace(/two/, "($')"); // "One ( three) three"

input.replace(/two/, "($$)"); // "One ($) three"

These replacement macros are often neglected, but I’ve seen them used in very clever solutions, so don’t forget about them!

Function Replacements

This is my favorite feature of regexes, which often allows you to break down a very complex regex into some simpler regexes.

Let’s consider again the practical example of modifying HTML elements. Imagine you’re writing a program that converts all <a> links into a very specific format: you want to preserve the class, id, and href attributes, but remove everything else. The problem is, your input is possibly messy. The attributes aren’t always present, and when they are, you can’t guarantee they’ll be in the same order. So you have to consider the following input variations (among many):

const html =

`<a class="foo" href="/foo" id="foo">Foo</a>\n` +

`<A href='/foo' Class="foo">Foo</a>\n` +

`<a href="/foo">Foo</a>\n` +

`<a onclick="javascript:alert('foo!')" href="/foo">Foo</a>`;

By now, you should be realizing that this is a daunting task to accomplish with a regex: there are just too many possible variations! However, we can significantly reduce the number of variations by breaking this up into two regexes: one to recognize <a> tags, and another to replace the contents of an <a> tag with only what you want.

Let’s consider the second problem first. If all you had was a single <a> tag, and you wanted to discard all attributes other than class, id, and href, the problem is easier. Even still, as we saw earlier, this can cause problems if we can’t guarantee the attributes come in a particular order. There are multiple ways to solve this problem, but we’ll use String.prototype.split so we can consider attributes one at a time:

function sanitizeATag(aTag) {

// get the parts of the tag...

const parts = aTag.match(/<a\s+(.*?)>(.*?)<\/a>/i);

// parts[1] are the attributes of the opening <a> tag

// parts[2] are what's between the <a> and </a> tags

const attributes = parts[1]

// then we split into individual attributes

.split(/\s+/);

return '<a ' + attributes

// we only want class, id, and href attributes

.filter(attr => /^(?:class|id|href)[\s=]/i.test(attr))

// joined by spaces

.join(' ')

// close the opening <a> tag

+ '>'

// add the contents

+ parts[2]

// and the closing tag

+ '</a>';

}

This function is longer than it needs to be, but we’ve broken it down for clarity. Note that even in this function, we’re using multiple regexes: one to match the parts of the <a> tag, one to do the split (using a regex to identify one or more whitespace characters), and one to filter only the attributes we want. It would be much more difficult to do all of this with a single regex.

Now for the interesting part: using sanitizeATag on a block of HTML that might contain many <a> tags, among other HTML. It’s easy enough to write a regex to match just the <a> tags:

html.match(/<a .*?>(.*?)<\/a>/ig);

But what do we do with it? As it happens, you can pass a function to String.prototype.replace as the replacement parameter. So far, we’ve only be using strings as the replacement parameter. Using a function allows you to take special action for each replacement. Before we finish our example, let’s use console.log to see how it works:

html.replace(/<a .*?>(.*?)<\/a>/ig, function(m, g1, offset) {

console.log(`<a> tag found at ${offset}. contents: ${g1}`);

});

The function you pass to String.prototype.replace receives the following arguments in order:

§ The entire matched string (equivalent to $&).

§ The matched groups (if any). There will be as many of these arguments as there are groups.

§ The offset of the match within the original string (a number).

§ The original string (rarely used).

The return value of the function is what gets replaced in the returned string. In the example we just considered, we weren’t returning anything, so undefined will be returned, converted into a string, and used as a replacement. The point of that example was the mechanics, not the actual replacement, so we simply discard the resultant string.

Now back to our example…. We already have our function to sanitize an individual <a> tag, and a way to find <a> tags in a block of HTML, so we can simply put them together:

html.replace(/<a .*?<\/a>/ig, function(m) {

return sanitizeATag(m);

});

We can simplify this even further—considering that the function sanitizeATag matches exactly what String.prototype.replace expects, we can rid ourselves of the anonymous function, and use sanitizeATag directly:

html.replace(/<a .*?<\/a>/ig, sanitizeATag);

Hopefully the power of this functionality is clear. Whenever you find yourself faced with a problem involving matching small strings within a bigger string, and you need to do processing on the smaller strings, remember that you can pass a function to String.prototype.replace!

Anchoring

Very often, you’ll care about things at the beginning or end of a string, or the entire string (as opposed to just a part of it). That’s where anchors come in. There are two anchors—^, which matches the beginning of the line, and $, which matches the end of the line:

const input = "It was the best of times, it was the worst of times";

const beginning = input.match(/^\w+/g); // "It"

const end = input.match(/\w+$/g); // "times"

const everything = input.match(/^.*$/g); // sames as input

const nomatch1 = input.match(/^best/ig);

const nomatch2 = input.match(/worst$/ig);

There’s one more nuance to anchors that you need to be aware of. Normally, they match the beginning and end of the whole string, even if you have newlines in it. If you want to treat a string as multiline (as separated by newlines), you need to use the m (multiline) option:

const input = "One line\nTwo lines\nThree lines\nFour";

const beginnings = input.match(/^\w+/mg); // ["One", "Two", "Three", "Four"]

const endings = input.match(/\w+$/mg); // ["line", "lines", "lines", "Four"]

Word Boundary Matching

One of the often-overlooked useful gems of regexes is word boundary matches. Like beginning and end-of-line anchors, the word boundary metacharacter, \b, and its inverse, \B, do not consume input. This can be a very handy property, as we’ll see shortly.

A word boundary is defined where a \w match is either preceded by or followed by a \W (nonword) character, or the beginning or end of the string. Imagine you’re trying to replace email addresses in English text with hyperlinks (for the purposes of this discussion, we’ll assume email addresses start with a letter and end with a letter). Think of the situations you have to consider:

const inputs = [

"john@doe.com", // nothing but the email

"john@doe.com is my email", // email at the beginning

"my email is john@doe.com", // email at the end

"use john@doe.com, my email", // email in the middle, with comma afterward

"my email:john@doe.com.", // email surrounded with punctuation

];

It’s a lot to consider, but all of these email addresses have one thing in common: they exist at word boundaries. The other advantage of word boundary markers is that, because they don’t consume input, we don’t need to worry about “putting them back” in the replacement string:

const emailMatcher =

/\b[a-z][a-z0-9._-]*@[a-z][a-z0-9_-]+\.[a-z]+(?:\.[a-z]+)?\b/ig;

inputs.map(s => s.replace(emailMatcher, '<a href="mailto:$&">$&</a>'));

// returns [

// "<a href="mailto:john@doe.com">john@doe.com</a>",

// "<a href="mailto:john@doe.com">john@doe.com</a> is my email",

// "my email is <a href="mailto:john@doe.com">john@doe.com</a>",

// "use <a href="mailto:john@doe.com">john@doe.com</a>, my email",

// "my email:<a href="mailto:john@doe.com>john@doe.com</a>.",

// ]

In addition to using word boundary markers, this regex is using a lot of the features we’ve covered in this chapter: it may seem daunting at first glance, but if you take the time to work through it, you’re well on your way to regex mastery (note especially that the replacement macro, $&, doesnot include the characters surrounding the email address…because they were not consumed).

Word boundaries are also handy when you’re trying to search for text that begins with, ends with, or contains another word. For example, /\bcount/ will find count and countdown, but not discount, recount, or accountable. /\bcount\B/ will only find countdown, /\Bcount\b/ will finddiscount and recount, and /\Bcount\B/ will only find accountable.

Lookaheads

If greedy versus lazy matching is what separates the dilettantes from the pros, lookaheads are what separate the pros from the gurus. Lookaheads—like anchor and word boundary metacharacters—don’t consume input. Unlike anchors and word boundaries, however, they are general purpose: you can match any subexpression without consuming it. As with word boundary metacharacters, the fact that lookaheads don’t match can save you from having to “put things back” in a replacement. While that can be a nice trick, it’s not required. Lookaheads are necessary whenever there is overlapping content, and they can simplify certain types of matching.

A classic example is validating that a password matches some policy. To keep it simple, let’s say our password must contain at least one uppercase letter, number, and lowercase letter, and no nonletter, nonnumber characters. We could, of course, use multiple regexes:

function validPassword(p) {

return /[A-Z]/.test(p) && // at least one uppercase letter

/[0-9]/.test(p) && // at least one number

/[a-z]/.test(p) && // at least one lowercase letters

!/[^a-zA-Z0-9]/.test(p); // only letters and numbers

}

Let’s say we want to combine that into one regular expression. Our first attempt fails:

function validPassword(p) {

return /[A-Z].*[0-9][a-z]/.test(p);

}

Not only does this require the capital letter to come before the numbers to come before the two lowercase letters, but we haven’t tested for the invalid characters at all. And there’s really no sensible way to do it, either, because characters are consumed as the regex is processed.

Lookaheads come to the rescue by not consuming input; essentially each lookahead is an independent regex that doesn’t consume any input. Lookaheads in JavaScript look like (?=<subexpression>). They also have a “negative lookahead”: (?!<subexpression>) will match only things that aren’t followed by the subexpression. Now we can write a single regex to validate our passwords:

function validPassword(p) {

return /(?=.*[A-Z])(?=.*[0-9])(?=.*[a-z])(?!.*[^a-zA-Z0-9])/.test(p);

}

You might be looking at this soup of letters and characters and thinking that our multiregex function is better—or at least easier to read. And in this example, I would probably agree. However, it demonstrates one of the important uses of lookaheads (and negative lookaheads). Lookaheads definitely fall into the category of “advanced regex,” but are important for solving certain problems.

Constructing Regexes Dynamically

We started off this chapter by saying that you should prefer the regex literal over the RegExp constructor. In addition to having to type four fewer letters, we prefer the regex literal because we don’t have to escape backslashes as we do in JavaScript strings. Where we do need to use theRegExp constructor is when we want to construct regexes dynamically. For example, you might have an array of usernames you want to match in a string; there’s no (sensible) way to get those usernames into a regex literal. This is where the RegExp constructor comes in, because it constructs the regex from a string—which can be constructed dynamically. Let’s consider this example:

const users = ["mary", "nick", "arthur", "sam", "yvette"];

const text = "User @arthur started the backup and 15:15, " +

"and @nick and @yvette restored it at 18:35.";

const userRegex = new RegExp(`@(?:${users.join('|')})\\b`, 'g');

text.match(userRegex); // [ "@arthur", "@nick", "@yvette" ]

The equivalent literal regex in this example would be /@(?:mary|nick|arthur|sam|yvette)\b/g, but we’ve managed to construct it dynamically. Note that we have to use double backslashes before the b (word boundary metacharacter); the first backslash is to escape the second backslash in the string.

Conclusion

While this chapter has touched on the major points of regexes, it only scratches the surface of the techniques, examples, and complexities inherent in regexes. Becoming proficient at regexes is about 20% understanding the theory, and 80% practice. Using a robust regex tester (such as regular expressions 101) can be very helpful when you’re starting out (and even when you’re experienced!). The most important thing you can take away from this chapter is an understanding of how a regex engine consumes input; a lack of understanding in that area is the source of a lot of frustration.