Special Characters - JavaScript Regular Expressions (2015)

JavaScript Regular Expressions (2015)

Chapter 3. Special Characters

In this chapter, we will be taking a look at some special characters and some more advanced techniques that will help us create more detailed Regex patterns. We will also slowly transition from using our Regex testing environment, and go back to using standard JavaScript to build more complete real-world examples.

Before we get ahead of ourselves, there are still a couple things we can learn using our current setup, starting with some constraints.

In this chapter ,we will cover the following topics:

· Defining boundaries for a Regex

· Defining nongreedy quantifiers

· Defining Regex with groups

Nonvisual constraints

Until now, all the constraints we have been putting on our patterns had to do with characters that could or couldn't be displayed, but Regex provides a number of positional constraints, which allow you to filter out some false positives.

Matching the beginning and end of an input

The first such set is the start and end of string matchers. Using the (^) caret character to match the start of a string and the ($) dollar sign to match the end, we can force a pattern to be positioned in these locations, for example, you can add the dollar sign at the end of a word to make sure that it is the last thing in the provided string. In the next example, I used the /^word|word$/g pattern to match an occurrence of word, which either starts or ends a string. The following image exemplifies the match of the regular expression when given a Text input:

Matching the beginning and end of an input

Using both the start and end character together assure that your pattern is the only thing in the string. For example if you have a /world/ pattern, it will match both the world string as well as any other string which merely contains world in it, such as hello world. However, if you wanted to make sure that the string only contains world, you can modify the pattern to be /^world$/. This means that Regex will attempt to find the pattern which, both, begins the string and ends it. This, of course, will only happen if it is the only thing in the string.

This is the default behavior but it is worth mentioning that this isn't always the case. In the previous chapter, we saw the m or multiline flag, and what this flag does is that it makes the caret character match not only the beginning of the string but also the beginning of any line. The same goes for the dollar sign: it will match the end of each line instead of the end of the entire string. So, it really comes down to what you need in a given situation.

Matching word boundaries

Word boundaries are very similar to the string boundaries we just saw, except that they work in the context of a single word. For example, we want to match can, but this refers to can alone, and not can from candy. We saw in the previous example, if you just type a pattern, such as /can/g, you will get matches for can even if it's a part of another word, for example, in a situation where the user typed candy. Using a backslash (\b) character, we can denote a word boundary (either in the beginning or at the end), so that we can fix this problem using a pattern similar to /\bcan\b/g, as shown here:

Matching word boundaries

Matching nonword boundaries

Paired with the \b character, we have the \B symbol, which is its inverse. Similar to what we have seen on multiple occasions, a capital symbol usually refers to the opposite functionality, and is no exception. The uppercase version will put a constraint on the pattern that limits it from being at the edge of word. Now, we'll run the same example text, except with /can\B/g, which will swap the matches; this is because the n in can is at its boundary:

Matching nonword boundaries

Matching a whitespace character

You can match a whitespace character using the backslash s character, and it matches things such as spaces and tabs. It is similar to a word boundary, but it does have some distinctions. First of all, a word boundary matches the end of a word even if it is the last word in a pattern, unlike the whitespace character, which would require an extra space. So, /foo\b/ would match foo. However, /foo\s/ would not, because there is no following space character at the end of the string. Another difference is that a boundary matcher will count something similar to a period or dash as an actual boundary, though the whitespace character will only match a string if there is a whitespace:

Matching a whitespace character

Note

It's worth mentioning that the whitespace character has an \S inverse matcher, which will match anything but a whitespace character.

Defining nongreedy quantifiers

In the previous section, we had a look at multipliers, where you can specify that a pattern should be repeated a certain number of times. By default, JavaScript will try and match the largest number of characters possible, which means that it will be a greedy match. Let's say we have a pattern similar to /\d{1,4}/ that will match any text and has between one and four numbers. By default, if we use 124582948, it will return 1245, as it will take the maximum number of options (greedy approach). However, if we want, we can add the (?) question mark operator to tell JavaScript not to use greedy matching and instead return the minimum number of characters as possible:

Defining nongreedy quantifiers

Greedy matching is something that makes it difficult to find bugs in your code. Consider the following example text:

<div class="container" id="main">

Site content

<div>

If we wanted to extract the class, you might think of writing a pattern in this way:

/class=".*"/

The problem here is that the * character will attempt to match as many characters as possible, so instead of getting container like we wanted, we would get "container" id="main". Since the dot character will match anything, the regular expression will match from the first quotation mark before the class word to the closing quotation mark right before the id word. To fix this, we can use the ungreedy question mark and change the pattern to /class=".*?"/. This will cause it to stop at the minimum required match, which is when we reach the first quotation mark:

Defining nongreedy quantifiers

Matching groups in Regex

The last main topic that I have left out until now is groups. However, in order to work with groups, we have to move back into a JavaScript console, as this will provide the actual results object that we will need to look at.

Groups show how we can extract data from the input provided. Without groups, you can check whether there is a match, or if a given input text follows a specific pattern. However, you can't take advantage of vague definitions to extract relevant content. The syntax is fairly simple: you wrap the pattern you want inside brackets, and then this part of the expression will be extracted in its own property.

Grouping characters together to create a clause

Let's start with something basic—a person's name—in standard JavaScript. If you had a string with someone's name, you would probably split it by the space character and check whether there are two or three components in it. In case there are two, the first would consist of the first name and the second would consist of the last name; however, if there are three components, then the second component would include the middle name and the third would include the last name.

Instead of imposing a condition like this, we can create a simple pattern as shown:

/(\S+) (\S*) ?\b(\S+)/

The first group contains a mandatory non-space word. The plus sign will again multiply the pattern indefinitely. Next, we want a space with a second word; this time, I've used the asterisk to denote that it could be of length zero, and after this, we have another space, though, this time, it's optional.

Note

If there is no middle name, there won't be a second space, followed by a word boundary. This is because the space is optional, but we still want to make sure that a new word is present, followed by the final word.

Now, open up a JavaScript console (in Chrome) and create a variable for this pattern:

var pattern = /(\S+) (\S*) ?\b(\S+)/

Then, try running the exec command on this pattern with different names, with and without a middle name, and take a look at this resulting output:

Grouping characters together to create a clause

Whether the string has a middle name or not, it will have the three patterns that we can assign to variables, therefore, we can use something else instead of this:

var res = name.split(" ");

first_name = res[0];

if (res.length == 2) {

middle_name = "";

last_name = res[1];

} else {

middle_name = res[1];

last_name = res[2];

}

We can remove the conditional statements (if-else) from the preceding code and write the code something similar to this:

var res = /(\S+) (\S*) ?\b(\S+)/.exec(name);

first_name = res[1];

middle_name = res[2];

last_name = res[3];

If the middle name is left out, our expression will still have the group, it will just be an empty string.

Another thing worth mentioning is that the indexes of the groups start at 1, so the first group is in the result 1 index, and the result 0 index holds the entire match.

Capture and noncapture groups

In the first chapter, we saw an example where we wanted to parse some kind of XML tokens, and we said that we needed an extra constraint where the closing tag had to match the opening tag for it to be valid. So, for example, this should be parsed:

<duration>5 Minutes</duration>

Here, this should not be parsed:

<duration>5 Minutes</title>

Since the closing tag doesn't match the opening tag, the way to reference previous groups in your pattern is by using a backslash character, followed by the group's index number. As an example, let's write a small script that will accept a line delimited series of XMLtags, and then convert it into a JavaScript object.

To start with, let's create an input string:

var xml = [

"<title>File.js</title>",

"<size>36 KB</size>",

"<language>JavaScript</language>",

"<modified>5 Minutes</name>"

].join("\n");

Here, we have four properties, but the last property does not have a valid closing tag, so it should not be picked up. Next, we will cycle through this pattern and set the properties of a data object:

var data = {};

xml.split("\n").forEach(function(line){

match = /<(\w+)>([^<]*)<\/\1>/.exec(line);

if (match) {

var tag = match[1];

data[tag] = match[2];

}

});

If we output data in a console, you will see that we do, in fact, get three valid properties:

Capture and noncapture groups

However, let's take a moment to examine the pattern; we look for some opening tags with a name inside them, and we then pick up all the characters, except for an opening triangle brace using a negated range. After this, we look for a closing tag using a (\1) back reference to make sure it matches. You may have also realized that we needed to escape the forward slash, so it wouldn't think we were closing the Regexp pattern.

Note

A back reference, when added to the end of a regular expression pattern, allows you to back reference a sub-pattern within a pattern, so that the value of the sub-pattern is remembered and used as part of the matching. For example, /(no)\1/ matches nono innono. \1 and is replaced with the value of the first sub-pattern within a pattern, or with (no), so as to form the final pattern.

All the groups we have seen so far have been capture groups, and they tell Regexp to extract this portion of the pattern into its own variable. However, there are other groups or uses for brackets that can be made to achieve even more functionality, the first of these is a non capture group.

Matching non capture groups

A non capture group groups a part of a pattern but it does not actually extract this data into the results array, or use it in back referencing. One benefit of this is that it allows you to use character modifiers on full sections in your pattern. For example, if we want to get a pattern that repeats world indefinitely, we can write it as this:

/(?:world)*/

This will match world as well as worldworldworld and so on. The syntax for a noncapture group is similar to a standard group, except that you start it with a question mark and a (?:) colon. Grouping it allows us to consider the entire thing as a single object, and use modifiers, which usually only work on individual characters.

The other most common use for noncapture groups (which can be done in capture groups as well) works in conjunction with a pipe character. A pipe character allows you to insert multiple options one after the other inside your pattern, for example, in a situation where we want to match either yes or no, we can create this pattern:

/yes|no/

Most of the time, though, this set of options will only be a small piece of your pattern. For example, if we are parsing log messages, we may want to extract the log level and the message. The log level can be one of only a few options (such as debug, info, error, and so on), but the message will always be there. Now, you can write a pattern instead of this one:

/[info] - .*|[debug] - .*|[error] - .*/

We can extract the common part into its own noncapture group:

/[(?:info|debug|error)] - .*/

By doing this we remove a lot of the duplicate code.

Matching lookahead groups

The last sets of groups you can have in your code are lookahead groups. These groups allow us to set a constraint on a pattern, but not really include this constraint in an actual match. With noncapture groups, JavaScript will not create a special index for a section, although, it will include it in the full results (the result's first element). With lookahead groups, we want to be able to make sure there is or isn't some text after our match, but we don't want this text in the results.

For example, let's say we have some input text and we want to parse out all .com domain names. We might not necessarily want .com in the match, just the actual domain name. In this case, we can create this pattern:

/\w+(?=\.com)/g

The group with the ?= character will mean that we want it to have this text at the end of our pattern, but we don't actually want to include it; we also have to escape the period since it is a special character. Now, we can use this pattern to extract the domains:

text.match(/\w+(?=\.com)/g)

We can assume that we have a variable text similar to this:

Matching lookahead groups

Using a negative lookahead

Finally, if we wanted to use a negative lookahead, as in a lookahead group that makes sure that the included text does not follow a pattern, we can simply use an exclamation point instead of an equal to sign:

var text = "Mr. Smith & Mrs. Doe";

text.match(/\w+(?!\.)\b/g);

This will match all the words that do not end in a period, that is, it will pull out the names from this text:

Using a negative lookahead

Summary

In this chapter, we learned how to work with greedy and nongreedy matches. We also learned how to use groups to create more complex regular expressions. While learning how to group a Regex, we also learned about capturing groups, non-capturing groups, and lookahead groups.

In the next chapter, we will implement everything we've learned so far in this book and create a real-world example to match and validate information inputted by a user.