Getting Started with Regular Expressions - Introduction to Regular Expressions in SAS (2014)

Introduction to Regular Expressions in SAS (2014)

Chapter 2. Getting Started with Regular Expressions

2.1 Introduction

This chapter focuses entirely on developing your understanding of regular expressions (RegEx) before getting into the details of using them in SAS. We will begin actually implementing RegEx with SAS in Chapter 3. It is a natural inclination to jump right into the SAS code behind all of this. However, RegEx patterns are fundamental to making the SAS coding elements useful. Without going through the RegEx first, the forthcoming SAS functions and calls could be discussed only at a very theoretical level, which is the opposite of what I am trying to accomplish in this book. Also, trying to learn too many different elements of any process at the same time can simply be overwhelming.

To facilitate the mission of this book—practical application—without becoming overwhelmed by too much information at one time (new functions, calls, and expressions), there is a very short bit of test code to use with the RegEx examples throughout the chapter. I want to stress the point that obtaining a thorough understanding of RegEx syntax is critical for harnessing the full power of this incredible capability in SAS.

RegEx consist of letters, numbers, metacharacters, and special characters, which form patterns. In order for SAS to properly interpret these patterns, all RegEx values must be encapsulated by delimiter pairs—I use the forward slash, /, throughout the text. (Refer to the test code). They act as the container for our patterns. So, all RegEx patterns that we create will look something like this: /pattern/.

For example, suppose we want to match the string of characters “Street” in an address. The pattern would look like /Street/. But we are clearly interested in doing more with RegEx than just searching for strings. So, the remainder of this chapter explores the various RegEx elements that we can insert into / / to develop rich capabilities.


Before going any farther, I should clarify some upcoming terminology. Metacharacter is a term used quite frequently in this book, so I need to be clear as to what it actually means. A metacharacter is a character or set of characters used by a programming language like SAS for something other than its literal meaning. For example, \s represents a whitespace character in RegEx patterns, rather than just being a \ and the letter “s” collocated in the text. We begin our discussion of specific metacharacters in Section 2.3.

All nonliteral RegEx elements are some kind of metacharacter. It is good to keep this distinction clear, as I also make references to character when I want to discuss the actual string values or the results of metacharacter use.

Special Character

A special character is one of a limited set of ASCII characters that affects the structure and behavior of RegEx patterns. For example, opening and closing parentheses, ( and ), are used to create logical groups of characters or metacharacters in RegEx patterns. These are discussed thoroughly in Section 2.2.

RegEx Pattern Processing

At this juncture, it is also important to clarify how RegEx are processed by SAS. SAS reads each pattern from left to right in sequential chunks, matching each element (character or metacharacter) of the pattern in succession. If we want to match the string “hello”, SAS searches until the first match of the letter “h” is found. Then, SAS determines whether the letter “e” immediately follows, and so on until the entire string is found. Below is some pseudo code for this process, for which the logic is true even after we begin replacing characters with metacharacters (it would simply look more impressive).

Pseudo Code for Pattern Matching Process







In this pseudo code, we see the START tag is our initiation of the algorithm, and the END tag denotes the termination of the algorithm. Meanwhile, the NEXT tag tells us when to skip to the next line of pseudo code, and the GOTO tag tells us to jump to a specified line in the pseudo code. The POS tag denotes the character position. We also have the usual IF, THEN, and ELSE logical tags in the code.

Again, this example demonstrates the search for “hello” in some text source. The algorithm initiates by testing whether the first character position is an “h”. If it is not true, then the algorithm increments the character position by one—and tests for “h” again. If the first position is an “h”, the character position is incremented, and the code tests for the letter “e”. This continues until the word “hello” is found.

2.1.1 RegEx Test Code

The following code snippet enables you to quickly test new RegEx concepts as we go through the chapter. As you learn new RegEx metacharacters, options, and so on, you can edit this code in an effort to test the functionality. Also, more interesting data can be introduced by editing thedatalines portion of the code. However, because we haven’t yet discussed the details of how the pieces work, I discourage making edits outside the marked places in the code in order to avoid unforeseen errors arising at run time.

To keep things simple, we are using the DATALINES statement to define our data source and print the source string and the matched portion to the log. This should make it easier to follow what each new metacharacter is doing as we go through the text. Notice that everything is contained in a single DATA step, which does not generate a resulting data set (we are using_NULL_). The first line of our code is an IF statement that tests for the first record of our data set. The RegEx pattern is created only if we have encountered the first record in the data set, and is retained using the RETAIN statement. Afterward, the pattern reference identifier is reused by our code due to the RETAIN statement. Next, we pull in the data lines using the INPUT statement that assumes 50-character strings. Don’t worry about the details of the CALL routine on the next line for now. We start writing SAS code in Chapter 3.

Essentially, the CALL routine inside the RegEx Testing Framework code shown below uses the RegEx pattern to find only the first matching occurrence of our pattern on each line of the datalines data. Finally, we use another IF statement to determine whether we found a pattern match. If we did, the code prints the results to the SAS log.

/*RegEx Testing Framework*/

data _NULL_;

if _N_=1 then


retain pattern_ID;

pattern="/METACHARACTERS AND CHARACTERS GO HERE/"; /*<--Edit the pattern




input some_data $50.;

call prxsubstr(pattern_ID, some_data, position, length);

if position ^= 0 then


match=substr(some_data, position, length);

put match:$QUOTE. "found in " some_data:$QUOTE.;



Smith, BOB A.

ROBERT Allen Smith

Smithe, Cindy

103 Pennsylvania Ave. NW, Washington, DC 20216

508 First St. NW, Washington, DC 20001

650 1st St NE, Washington, DC 20002

3000 K Street NW, Washington, DC 20007

1560 Wilson Blvd, Arlington, VA 22209


1(800) 789-1234



Note: I have provided a jumble of data in the datalines portion of the code above. However, feel free to edit the data lines to thoroughly test each metacharacter as we go through this chapter.

Figure 2.1 shows an example of the SAS log output provided by the previous code. For this example, I used merely the character string /Street/ for the pattern in order to create the output.

Figure 2.1: Example Output where, pattern=/Street/


The remaining information in this chapter provides a solid foundation for building robust, complex patterns in the future. Each element discussed is an independently useful building block for sophisticated text manipulation and analysis capabilities. Once we begin to combine these basic elements, we will create some very powerful analytic tools.

2.2 Special Characters

In addition to / (the forward slash), the characters ( ) | and \ (the backslash) are special and are thus treated differently than the RegEx metacharacters to be discussed later. Since some of these special characters are so fundamental to the structure of the RegEx pattern construction, we need to briefly discuss them first.

( )

The two parentheses create logical groups of pattern characters and metacharacters—the same way they work in SAS code for logic operations. It is important to create logical groupings in order to construct more sophisticated patterns. Nesting the parentheses is also possible.


The vertical bar represents a logical OR (much like in SAS). Again, the proper use of this element creates more sophisticated patterns. We will explore some interesting ways to use this character, starting with the example in Table 2.1. It is important to remember that the first item in an OR condition always matches before moving to the next condition.


The backslash is a tricky one as it has a couple of uses. It is used as an integral component of many other metacharacters (examples abound in Section 2.3). Think about it as an initiator that tells SAS, “Hey, this is a metacharacter, not just some letter.” But that’s not all it does. Since the special characters defined above also appear in text that we might want to process, the backslash also acts as a blocker that tells SAS, “Hey, treat this special character as just a regular character.” By using \, we can create patterns that include parentheses, vertical bars, backslashes, forward slashes, and more—we simply add a \ in front of each occurrence of all the special characters that we want to treat as characters. For example, if we want our pattern to include open and closed parentheses respectively, the pattern would contain \( \).

Since you haven’t learned any RegEx metacharacters yet, let’s revisit strings using some of these new concepts. Notice that we can already start to match useful patterns with the characters and special characters.

Table 2.1: Examples using (), |, and \




“Cat” “cat”


“cat” “mouse”


“Street” “street” “Road” “road”

/\(This\)|\(That\) /

“(This)” or “(That)”

Note: In Perl parlance, \ is known as an escape character. To avoid any unnecessary confusion, we will dispense with this lingo and just refer to it as the backslash. However, be prepared to see that term used quite a bit in the Perl literature and on community websites.

Now, there are some additional special characters that also need the backslash in front of them in order to be matched as normal characters. They are: { } [ ] ^ $ . * + and ?. All these characters are reserved and are thus treated differently, because they each have a special purpose and meaning in the world of RegEx. Since each one is defined and discussed at length in Sections 2.4 and 2.5, we will not discuss them further here. For now, just remember that they can’t be used as part of pattern strings without the backslash immediately preceding them. Table 2.2 shows a few examples of how to use them as normal characters.

Table 2.2: Examples using { } [ ] ^ $ . * +



/\$1\.00 \+ \$0\.50 = \$1\.50/

“$1.00 + $0.50 = $1.50”

/2\*3 = 6/

“2*3 = 6”





Note: Notice that = and , match as characters (i.e., without a backslash) because they are not considered special characters.

2.3 Basic Metacharacters

As you write RegEx patterns in the future, you will find yourself using most of the metacharacters discussed in this section frequently because they are fundamental elements of RegEx pattern creation. Now, we can already build some useful patterns with the information discussed inSection 2.1. However,

the metacharacters in this section create the greatest return on time investment due to how flexible and powerful they can make RegEx patterns.

Notice as we go through the examples how we can obtain some unexpected results. It is important to be very strategic when using some of these RegEx metacharacters as you don’t always know what to expect in the text that you are processing. Even when you know the source quite well, there are inevitably errors or unknown changes that can wreck a poorly designed pattern. So, like any good analyst, you need to be thinking a few steps ahead in order to maintain robust RegEx code.

Note: Unlike SAS, all RegEx metacharacters are case sensitive, as you will see shortly. If a letter is defined here as lowercase or uppercase, then it MUST be used that way. Otherwise, your programs will do something very different from what you expect. In other words, even though you can be lazy with capitalization when writing SAS code (e.g., DATA vs. data), the same is not true here.

2.3.1 Wildcard

The wildcard metacharacter, which is a period (.), matches any single character value, except for a newline character (\n). The ability to match virtually any single character will prove useful when you are searching for the superset of associated character strings. You might also want to use it when you have no idea what values might be in a particular character position. Table 2.3 provides examples.

Table 2.3: Examples using .




“Ran” “Run” “R+n” “R n” “R(n” “Ron” …


“Fun” “fun” “Run” “run” “bun” “(un” “-un” …


“Street.” “Street,” “Streets” “Street+” “Street_”…

Note: The period matches anything except the newline character (\n)—including itself. This can be helpful, but must be used wisely. Also note, only \n matches the newline character.

2.3.2 Word

The metacharacter \w matches any word character value, which includes alphanumeric and underscore (_) values. It matches any single letter (regardless of case), number, or underscore for a single character position. But do not be fooled by the underscore inclusion; \w does NOT match hyphens, dashes, spaces, or punctuation marks. Table 2.4 provides examples.

Table 2.4: Examples using \w




“Ran” “Run” “Ron” …


“Fun” “fun” “Run” “run” “Bun” “bun” “_un” …


“Streets” “Street_”

Note: The \w wildcard should not have any unintentional spaces before or after it. Such spaces result in the pattern trying to match those additional spaces in addition to the \w. (This goes for any RegEx metacharacter.)

2.3.3 Non-word

The metacharacter \W matches a non-word character value (i.e., everything that \w doesn’t include, except for the ever-elusive \n). The \W metacharacter is valuable when you are unsure what is in a character cell but you know that you don’t want a word character (i.e., alphanumeric and _). Table 2.5 provides examples.

Table 2.5: Examples using \W




“Washington.” “Washington,” “Washington;”…


“D.C.” “D,C.” “D C.” “D C “ …


“Street.” “Street,” “Street+” …

Note: You will continue to see lowercase and uppercase versions of these RegEx characters acting as near opposites, with some exceptions. It might not be overly clever, but does help simplify matters.

2.3.4 Tab

The metacharacter \t matches only the tab character in a string. Unlike the RegEx characters to follow, this metacharacter matches only the tab whitespace character. This is especially useful when the tab holds some special significance, such as when you are processing tab-delimited text files. Table 2.6 provides examples.

Table 2.6: Examples using \t




“SAS ”


“SAS Institute Inc”


“Street ”

Note: This metacharacter does not have an opposite (i.e., \T does not exist).

2.3.5 Whitespace

The metacharacter \s matches on a single whitespace character, which includes the space, tab, newline, carriage return, and form feed characters. You must include this when you are matching on anything in text that is separated by white space, and you are unsure of which will occur.Table 2.7 provides examples.

Table 2.7: Examples using \s




“SAS ” “SAS ”


“SAS Institute Inc” “SAS Institute Inc”


“Street ” “Street “

Note: This form of the \s metacharacter matches only one whitespace character. We review how to find multiple matches in Section 2.5.2 because that is frequently needed when you are matching text.

2.3.6 Non-whitespace

The metacharacter \S matches on a single non-whitespace character—the exact opposite of \s. This metacharacter is often used to account for unexpected dashes, apostrophes, commas, and so on, that might otherwise prevent a match. Table 2.8 provides examples.

Table 2.8: Examples using \S




“Leonato’s” “Leonatoas” “Leonato_s” …


“Washingtons” “Washington.” “Washington,” …


“Street.” “Street,” “Streets” “Street+” “Street_”…

2.3.7 Digit

The metacharacter \d matches on a numerical digit character (i.e., 0–9). This RegEx metacharacter is probably the most straightforward one as it has a very narrow focus. Just remember that a single occurrence of \d is for only one character position in any text. In order to capture larger numbers (i.e., anything greater than 9), you have to build patterns with multiple occurrences of \d. Table 2.9 provides examples, but we discuss more sophisticated methods for accomplishing this later in the chapter. (See “Repetition Modifiers” in Section 2.4.2).

Table 2.9: Examples using \d




“1st” “9st” “4st” …


“101” “102” “103” …


“1-800-123-4567” “1-800-789-3456” …

Note: Just remember that even though your pattern might be correct, the data is not necessarily correct (4st and 9st don’t make sense!).

2.3.8 Non-digit

The metacharacter \D matches on any single non-digit character. Again, this is the opposite of the lowercase metacharacter \d. This metacharacter matches on every value that is not a number. Table 2.10 provides examples.

Table 2.10: Examples using \D




“1-800-123-4567” “1.800.123.4567” …


“1560 Wilson Blvd” “1560_Wilson_Blvd” …


“19th Street” “19th.Street” “19…Street” …

2.3.9 Newline

The metacharacter \n matches a newline character. It is quite useful for some patterns to know that you have encountered a new line. For instance, you might be processing addresses in a text file, which often contain different pieces of information on different lines. Table 2.11 provides examples.

Table 2.11: Examples using \n



/103 Pennsylvania Ave\. NW,\nWashington, DC 20216/

“103 Pennsylvania Ave. NW, Washington, DC 20216”

/<html tag>\n/

“<html tag>

” …


t” …

Note: The test code does not enable us to actually try this metacharacter because it uses data lines, which is a feature of SAS that intentionally ignores newline characters when typed (i.e., hitting the Enter key just creates the start of a new data line in the SAS code window). For this reason, newline characters are not present in data lines for you to read and match on. But have faith, for now, that this one works as advertised. You will discover ways to process different text sources in the next chapter, enabling you to process newline characters.

2.3.10 Bell

The metacharacter \a matches an alarm “bell” character. The alarm character falls into a class of non-printing or invisible characters that are part of the ASCII character set. ASCII was developed long ago when operating systems used non-printing characters fairly extensively. Today, however, these characters are relatively uncommon, and most often occur only in files meant for computers to read rather than humans—since they are not displayed. When encountered, these characters generate an alarm tone, or “bell,” on a computer’s internal speaker. While they are often associated with errors, they can also be used to alert users that the end of a file or process has been achieved (e.g., in a system log file). You can use this metacharacter when you know to expect such a character in a source file. Table 2.12 provides examples.

Table 2.12: Examples using \a









Note: Since the alarm character is a non-printing ASCII character, I am representing its location in the matching text with the BEL ASCII character. However, remember that such a code does not appear in our text.

2.3.11 Control Character

The metacharacter \cA-\cZ matches a control character for the letter that follows the \c. For example, \cF matches control-F in the source. This is one of several examples where you might be processing less-often-used file types (i.e., not a file meant for humans to read). Control characters, or non-printing characters, were once used extensively by transactional computing and telecommunications systems. These control characters, while not visible in most text editors, are still part of the ASCII character set, and can still be used by older systems in these regimes. For our examples in Table 2.13, we stick with the convention that is used for the alarm metacharacter above—the standard ASCII abbreviation is used despite the fact that they are never actually seen in text.

Table 2.13: Examples using \cA-\cZ




DEL the non-printing Data Link Escape ASCII control character ^P


STX the non-printing Start of Text ASCII control character ^B


STX hello ETX the non-printing Start of Text ASCII control character ^B followed by the character string “hello” and completed with the non-printing End of Text ASCII control character ^C

2.3.12 Octal

The metacharacter \ddd matches an octal1 character of the form ddd. It is used to match on the octal code for an ASCII character for which you are searching. It can be especially useful when you need to find specific non-printing ASCII characters in a file. The default behavior by SAS is to return the ASCII character associated with this octal code in the results. Table 2.14 provides examples.

Table 2.14: Examples using \ddd





“ ! ”

This octal code translates to the ! ASCII character.



This series of octal codes translate to the “HELLO” string of ASCII characters.



These octal codes translate to the two non-printing ASCII characters BEL and TAB. Refer to our discussion of the alarm metacharacter in Section 2.3.10 regarding characters that are not displayed.

Note: You will discover how to search for ranges of these values in the next section (Section 2.4). Also note that the largest ASCII value is decimal 127, octal 177, and hexadecimal 7F.

2.3.13 Hexadecimal

The metacharacter \xdd matches a hexadecimal2 character of the form dd. The purpose of our implementation here is again not about searching through raw hexadecimal files, etc. We are using this to search for the hexadecimal code associated with the ASCII characters that we want in a source (manipulation of raw hex data sources is a different book). Table 2.15 provides examples.

Table 2.15: Examples using \xdd






This hexadecimal code translates to the + ASCII character.



These hexadecimal codes translate to the 1+1=2 ASCII characters.


“00 FF”

This is a reminder that we can match hexadecimal numbers stored in ASCII, and that they are not the same.

2.4 Character Classes

In addition to using the built-in RegEx characters to match patterns, users have the ability to create custom character matching. This capability is derived via different uses of [ and ] (square braces). The square braces essentially create a custom metacharacter, where the items contained between the opening brace and closing brace are possible match values for a single character cell. In addition to putting a list characters inside the braces, you can also include metacharacters. Each metacharacter discussed below includes an example, which includes the use of a metacharacter, and they all have the same match results. Just for fun, they are all identifying a hexadecimal number range present in the ASCII source file (stored as ASCII characters in the source file, but representing the range of possible hexadecimal values).

Note: Remember that some of the components discussed in this section are special characters that must be escaped with \ in order to be matched in isolation. Specifically, these characters are: ^, [, and ].

2.4.1 List

The metacharacter […] matches any one of the specific characters or metacharacters listed within the braces. Being able to define an unordered list of things that you want to appear in a space is very convenient, and can sometimes be more convenient than the metacharacters that identify broad classes of character types. Table 2.16 provides examples.

Table 2.16: Examples using […]




“a” “b” “c” “A” “B” “C”


“0” “1” “3” “7”


“cat” “Cat” “bat” “Bat” “rat” “Rat”


“0” “1” “2” “3” “4” “5” “6” “7” “8” “9” “A” “B” “C” “D” “E” “F”

2.4.2 Not List

The metacharacter [^…] matches one of anything not listed within the braces, except for the newline character. Sometimes it is easier to write down what we don’t want rather than what we do. And for that reason, we might want to use this metacharacter. We can quickly identify the unwanted items and define them here. Table 2.17 provides examples.

Table 2.17: Examples using [^…]




“d” “e” “f” …


“2” “4” “5” “6” “8” “9”


“fat” “Fat” “hat” “rat” “mat” “Hat” …


“0” “1” “2” “3” “4” “5” “6” “7” “8” “9” “A” “B” “C” “D” “E” “F”

2.4.3 Range

The metacharacter […-…] matches anything that falls into a range of character values. In other words, case matters for letters listed in the braces. RegEx, and by extension SAS, understands the inherent order of letters and numbers. Therefore, we can define any range of numbers or letters to be matched by this metacharacter. Table 2.18 provides examples.

Table 2.18: Examples using […-…]




“f” “g” “h” “i” “j” “k” “l” “m”


“1” “2” “3” “4” “5” “6” “7” “8” “9”


“a” “b” “c” “A” “B” “C”


“0” “1” “2” “3” “4” “5” “6” “7” “8” “9” “A” “B” “C” “D” “E” “F”

2.5 Modifiers

There are two significant things that you probably notice missing from the previous sections, which are worth further discussion here. First, all of the applicable metacharacters thus far have ignored letter case. In other words, \w, \S, \D, and . all match on a letter regardless of whether it is lowercase or uppercase. However, there are situations in which the case of a letter becomes important, but the letter itself is not known in advance.

Second, we can use a single match character as many times as we like, which creates additional fuzziness for our matches. However, there is a downside to just typing them out: each occurrence must exist in order to match the pattern. For instance, if the source text for the \D examples above contained “19thStreet” with no spaces, we’d never find it by using \D three times. And since the primary goal of the RegEx capability is to have automated text processing, we need a robust way to make this kind of matching more flexible.

Over the next two subsections (2.5.1 and 2.5.2), we will work through ways to overcome these limitations by using modifiers. There are two types of modifiers, case modifiers and repetition modifiers. Combining them gives us significant robustness and flexibility in real-world RegEx implementations, and should be considered as fundamental to real-world implementations as the metacharacters that we have discussed thus far.

2.5.1 Case Modifiers

When performing matches on text, there is the obvious consideration of letter case (upper vs. lower). Although I have already introduced a rudimentary way to handle this in situations where the letter is known, there still must be a methodology for accounting for letter case when it is unknown. This section discusses a variety of approaches to dealing with case matching. Depending on the situation, some approaches are more convenient than others, while not necessarily being right or wrong.


The metacharacter \l matches when the next character in a pattern is lowercase. This metacharacter applies only to characters (metacharacters, groups,and so on don’t work). In practice, it is more practical to simply type the lowercase version of the desired character value, or provide a list of lowercase letters to match. Table 2.19 provides examples.

Table 2.19: Examples using \l




“street” …


“ sas Institute” …


“sleet” “fleet” …


The metacharacter \u matches when the next letter in a pattern is uppercase. It functions exactly as the lowercase version introduced above (\l), but also applies to uppercase. Table 2.20 provides examples.

Table 2.20: Examples using \u




“Inc.” …


“Street” “St.” …


“Ave.” “Avenue,”

Lowercase Range

The metacharacter \L…\E matches when all the characters between the \L and \E are lowercase. Strings typed between \L and \E are forced to match on lowercase only, even when they are typed in as capital letters. However, unlike the \l metacharacter, \L…\E can also contain character classes and repetition modifiers. Table 2.21 provides examples.

Table 2.21: Examples using \L…\E




“sas” “abc” “123” …


“these are lowercase”


“ Read ” “ Road ” “ Rode ” “ Ride ” “ Real ” …

Note: When applying case modifiers to non-alphabet characters, the modifier is ignored. It doesn’t apply to those characters, so it doesn’t affect the match.

Uppercase Range

The metacharacter \U…\E creates a match when all the characters between the \U and \E are uppercase. Again, this metacharacter functions the same way as the lowercase version discussed above, but applies to uppercase. This metacharacter can be useful for identifying acronyms or other text where capital letters are important. Table 2.22 provides examples.

Table 2.22: Examples using \U…\E




“SAS” “CIA” …


“SAS Institute Inc.” …



Note: Notice that other metacharacters are not allowed inside \L…\E or \U…\E metacharacters. In other words, \w can’t be used to replace the character classes above.

Quote Range

The metacharacter \Q…\E matches all content inside the \Q and \E as character strings, disabling everything including the backslash character. Metacharacters cannot be used inside \Q…\E. The functionality provided by this metacharacter is great for searching within strings that contain a significant number of reserved characters, such as XML, webserver logs, or HTML. Table 2.23 provides examples.

Table 2.23: Examples using \Q…\E



/\Q<html tag name>\E/

“<html tag name>”

/\Qf(x) + f(y) = z\E/

“f(x) + f(y) = z”

/\Q<!DOCTYPE HTML> <html lang="en-US">\E/

“<!DOCTYPE HTML> <html lang="en-US">”

2.5.2 Repetition Modifiers

Repetition modifiers change the matching repetition behavior of the metacharacters and characters immediately preceding them in a pattern. They can also modify the matching repetition of an entire group—defined using () to surround the group of metacharacters and characters before the modifier. Just keep in mind that repetition of the entire group means that it repeats back-to-back (e.g., “haha”), unless we also modify the individual metacharacters.

Now, there are two types of repetition modifiers, greedy and lazy. Greedy repetition modifiers try to match as many times as possible within the confines of their definition. Lazy modifiers attempt to find a match as few times as possible. They have similar uses, which can make the difference between their results subtle.

Introduction to Greedy Repetition Modifiers

Let’s start by discussing greedy modifiers because they are a little more intuitive to use. As we go through the examples, it is important to keep in mind that greedy modifiers match as many times as possible—constantly searching for the last possible time the match is still true. It is therefore easy to create patterns that match differently from what you might expect.

There is a concept in RegEx known as backtracking, which is the root cause for potential issues with greedy modifiers (hint: backtracking results in the need for lazy modifiers). As we discuss further when we examine lazy repetition modifiers, a greedy modifier actually tries to maximize the matches of a modified pattern chunk by searching until the match fails. Upon that failure, the system then backtracks to the position where the modified chunk last matched. The processing time wasted with backtracking for a single match is insignificant. However, as soon as we introduce a few additional factors, this problem can waste tremendous computing cycles—multiple modified pattern chunks, numerous match iterations (think loops), and large data sources. It is important to be mindful of these factors when designing patterns as they can have unintended consequences.

Greedy 0 or More

The modifier * requires the immediately preceding character or metacharacter to match 0 or more times. It enables us to generate unlimited optional matches within text. For example, we might want to match every occurrence of a word root, along with all of its prefixes and suffixes. By allowing the prefixes and suffixes to be optional, we are able to achieve this goal. Table 2.24 provides examples.

Table 2.24: Examples using *




“Sing” “Sings” “Singing” “Singer” “Singers” …


“DC” “D.C.” “D C “ “D….-!$%^ C.-)*&^%”…


“19th Street” “19thStreet” “19Street” …


“Hell” “Hello” “Hellooooooooooooo” …

Greedy 1 or More

The modifier + requires the immediately preceding character or metacharacter to match 1 or more times. The plus sign modifier works similarly to the asterisk modifier, with the exception that it enforces a match of the metacharacter or character at least 1 time. Table 2.25 provides examples.

Table 2.25: Examples using +




“Run” “Ruin” “Runt” “Runners” …


Words with all letters capitalized, and surrounded by spaces.


“19th Street” “19th.Street” “19…Street” …


“ha” “hahahahahahaha” …

Note: Pay special attention to the addition of the \s metacharacter in the second example in Table 2.24. If it were not present, the pattern would also match only single capital letters at the beginning of words. By adding \s, the pattern requires a whitespace character to immediately follow the one or more capital letters, thus eliminating matches on single letters at the beginning of words.

Greedy 0 or 1 Time

The modifier ? creates a match of only 0 or 1 time. The question mark provides us the ability to make the occurrence of a metacharacter optional without allowing it to match multiple times. This can be effective for matching word pairs that have inconsistent use of dashes or spaces (e.g., short-term vs. short term). Table 2.26 provides examples.

Table 2.26: Examples using ?




“1-800-123-4567” “18001234567” …


“1560 Wilson Blvd.” “1560 Wilson Blvd” …


“19th Street” “19thStreet” …

Greedy n Times

The modifier {n} creates a match of exactly n times. Being able to match on a metacharacter exactly n number of times is the same as typing that metacharacter out that many times. However, from the perspective of coding and maintaining the RegEx patterns, using the modifier is a much better approach. It limits the opportunity for us to make typographical errors when initially creating the RegEx pattern, and it improves readability when later editing and sharing the patterns. Table 2.27 provides examples.

Table 2.27: Examples using {n}




“1-800-123-4567” “1.800.123.4567” …


“Round” “Runts” “Ruins” …


“19th Street” “19th.Street” “19…Street” …


“12345-6789” “12345” …

Greedy n or More

The modifier {n,} creates a match at least n times. By ensuring that we can match something at least n times, we are able to create functionality very similar to the plus modifier. However, we are raising the minimum number of times that the metacharacter must match. This is quite useful for certain applications, but must be handled with caution. Also, like the + modifier, we can easily get very long strings of unanticipated matches due to a single logical error in pattern construction. Table 2.28 provides examples.

Table 2.28: Examples using {n,}




“1-800-123-4567” “1-800-789-12” …


“143-25-7689” “12345689-546545654-9820”…


“19th Street” “19th, Not My Street” …

Note: Be mindful to not type a space after the comma inside the curly braces. It is easy to do out of habit, but it will wreck our pattern!

Greedy n to m Times

The modifier {n,m} creates a match at least n, but not more than m times. Creating a match with a specified range is quite useful for ensuring that data quality standards are being maintained. When extracting semi-structured data elements such as ZIP codes, birthdates, and phone numbers, it is important to maintain a certain level of flexibility while also ensuring that the source is within expected tolerances. For instance, a two-digit year might be accepted in lieu of a four-digit year, but a four-digit zip would be unacceptable. Table 2.29 provides examples.

Table 2.29: Examples using {n,m}




“1-800-123-4567” …


“10-20-1950” “8-30-52” “4-3-1979”…


“Washington” “Wash” “Waste” “Washing” …

Note: As you can see in the examples above, the {n,m} might not always be the best choice of modifier, but these examples are meant to demonstrate the flexibility of implementation. For instance, the year in the second example is allowed to be three digits with this usage. Using an OR clause with the {n} modifier is a simple fix.

Introduction to Lazy Repetition Modifiers

Now that you are familiar with greedy modifiers, let’s begin examining the lazy ones. In terms of syntax, they differ from the greedy modifiers only by the addition of a question mark (?). By adding the question mark immediately after each of the greedy modifiers, we are able to subtly change their behavior—sometimes in unexpected ways.

In general, lazy modifiers are used to both avoid overmatching and improve performance when compared to the greedy modifiers. There are situations when matching with greedy modifiers would lead to either grabbing too much information, or simply slowing down system performance. For instance, processing semi-structured text files such as HTML or XML is a great example of when lazy modifiers would come in handy.

Lazy 0 or More

The modifier *? creates a match 0 or more times, but as few times as necessary to create the match. In some situations, it creates the same matches as does the greedy version. However, in other cases, the results are very different. To make it clearer, Table 2.30 describes the details of a few examples.

Table 2.30: Examples using *?






This matches only the word “Sing” because the modifier is given the option to match nothing. And since it is lazy, it will take that option every time, regardless of whether a word character immediately follows the “g” in “Sing”.


“Singing ” …

Comparing this to the example above, you see that appending the \s on the pattern creates additional matches. The \s forces the pattern to continue searching for a match that includes white space. This could be “Sing “ or many other combinations (similar to the greedy outcomes).



This example demonstrates why we need to be careful with lazy modifiers. Even when “ha” exists, it is ignored, again because the modifier has the option to do so. The greedy version of this would match as many times as the word “ha” occurred back-to-back, with a minimum of zero times.

Lazy 1 or More

The modifier +? creates a match 1 or more times, but as few times as necessary to create a match. Again, if it is possible, this matches only once. Table 2.31 provides examples.

Table 2.31: Examples using +?






This matches only “Sing” plus exactly one word character following the “g”. Again, by giving the lazy modifier an option to match the minimum, it will do so every time.


“Singing ” …

Again, we see that appending the \s on the pattern creates additional matches. The \s forces the pattern to continue searching for a match that includes white space. This could be “Singi “ or many other combinations (similar to the greedy outcomes).



This example is less of a cautionary tale than for *?. But it might still provide undesirable results. Even when “ha” exists numerous times back-to-back, it matches only the first time, unless an additional match element follows it. Again, this is because the modifier has the option to match only once. The greedy version of this would match as many times as the word “ha” occurred back-to-back, with a minimum of once.

Lazy 0 or 1 Times

The modifier ?? creates a match 0 or 1 times, but as few times as necessary to create a match. Unless forced, this modifier will match 0 times. Table 2.32 provides examples.

Table 2.32: Examples using ??






This matches only the word “Sing” because the modifier is given the option to match nothing. And since it is lazy, it will take that option every time, regardless of whether a word character immediately follows the “g” in “Sing”. The reasoning is the same as with the*? modifier.


“Sings ” …

Again, just as with the *? modifier, we see that appending the \s on the pattern creates additional matches. The \s forces the pattern to continue searching for a match that includes white space. This could be “Sings “ or a few other combinations (similar to the greedy outcomes).



This example demonstrates why we need to be careful with lazy modifiers. Even when “ha” exists, it is ignored, again because the modifier has the option to do so. The greedy version of this would match as many times as the word “ha” occurred back-to-back.

Lazy n Times

The modifier {n}? creates a match exactly n times. This modifier functions exactly as the greedy version, making the ? unnecessary. Using this modifier results in no performance enhancement or change in functionality, which makes it a completely unnecessary addition to the Perl language. It has been included here for the sake of completeness. Table 2.33 shows that the same examples reveal the same results.

Table 2.33: Examples using “{n}?”




“1-800-123-4567” “1.800.123.4567” …


“Round” “Runts” “Ruins” …


“19th Street” “19th.Street” “19…Street” …


“12345-6789” “12345” …

Lazy n or More

The modifier {n,}? creates a match, at least n times and as few times as necessary to create a match. This functions just like the *? or +? modifiers, except that the minimum number of matches is arbitrary. Again, we see similar behavior resulting from the laziness of the modifier. Table 2.34 provides examples.

Table 2.34: Examples using {n,}?





“Singing” …

This usage matches exactly n=3 times. Again, by giving the lazy modifier an option to match the minimum, it will do so every time.


“0000 ” …

Now that you have the hang of these modifiers, this example should be a little more interesting. Appending \s on the pattern still forces it to match each 0 until the white space is encountered. The pattern is “anchored” to the first occurrence of a 0, thus capturing more than the minimum.



Without surrounding information in the pattern, this matches only the minimum number of times. By having nothing else to force additional matching, the lazy modifier just stops after the minimum of n=4.

Lazy n to m Times

The modifier {n,m}? creates a match at least n times, but no more than m times—as few times in that range as necessary to create the match. It functions like many of the other lazy modifiers discussed thus far, but it sets a cap on how many times it can match in addition to having an arbitrary minimum. Table 2.35 provides examples.

Table 2.35: Examples using {n,m}?





“Ready” …

This usage matches the word metacharacter only one time. Again, by giving the lazy modifier an option to match the minimum, it will do so every time.


“0000 ” …

Again, the pattern is “anchored” to the first occurrence of a 0, thus capturing the minimum if it exists, up to the maximum.


“ ha”

By not having anything after the “anchor” point for the pattern to match on, there is nothing to force additional matching. The lazy modifier just stops after the minimum of n=0.

2.6 Options

Options affect the behavior of the entire RegEx pattern with which they are associated. These behavioral changes provide benefits ranging from making RegEx creation more convenient, to providing new or enhanced functionality.

Options occur after the closing slash character, but there is one item of significance that occurs before the first slash character that we will also discuss—it is not actually an option but this is best place to go over it. And we are not going to cover all of the options for the same reason we haven’t covered absolutely all of the metacharacters thus far—this is an introductory text.

2.6.1 Ignore Case

The option //i ignores letter case for the entire pattern, even character strings. This is a great option to use when we know exactly what words we are searching for, but we don’t want the letter case to be an issue. Table 2.36 provides examples.

Table 2.36: Examples using //i



/1600 Pennsylvania Avenue/i

“1600 pennsylvania avenue” “1600 PENNSYLVANIA AVENUE” …


“street” “Street” “STREET” …

/CAPS don’t MaTtEr/i

“caps don’t matter” “CaPs DoN’t MATTER” …

2.6.2 Single Line

The option //s forces the dot character (.) to match everything, including the newline character, when it occurs in the pattern. This can be very helpful to ensure that we don’t miss anything for a particular character position. Table 2.37 provides examples.

Table 2.37: Examples using //s



/43rd and Times Square.New York, NY 10036/s

“43rd and Times Square
New York, NY 10036” …

/Bob Smith.\d{3}-\d{3}-\d{4}/s

“Bob Smith
123-456-7891” …

2.6.3 Multiline

The option //m causes ^ and $ to match on more than just the string start and end respectively. Instead, they match on every newline encountered because the various lines of information are treated as one continuous line. This enhanced functionality really applies to two metacharacters that we haven’t covered yet (we’ll discuss them in Section 2.7), so if you need to, feel free to peek ahead and come back to this one. Table 2.38 provides examples.

Table 2.38: Examples using //m




Words at the beginning of a string and words following a newline character.


Words immediately before a space and the string end, and before a space and newline character.

2.6.4 Compile Once

The option //o is known as the compile once option. By having the “o” immediately following the closing slash, SAS knows to compile that RegEx only once. This option creates a very nice simplification to SAS code, which I demonstrate by showing updated test code below (seeSection 2.1.1 for the original code). Notice how the IF block is removed, and only the two lines that do not include the RETAIN statement remain. These changes are possible due to the compilation happening the first time through the DATA step. Every subsequent loop through reuses the previously compiled expression, if it exists.

Updated Test Code

/*RegEx Testing Framework*/

data _NULL_;

*if _N_=1 then


* retain pattern_ID;

* pattern="/Run/"; /*<--Edit the pattern here.*/

* pattern_ID=prxparse(pattern);


pattern="/Run/o"; /*<--Edit the pattern here.*/


input some_data $50.;

call prxsubstr(pattern_ID, some_data, position, length);

if position ^= 0 then


match=substr(some_data, position, length);

put match:$QUOTE. "found in " some_data:$QUOTE.;



Smith, BOB A.

ROBERT Allen Smith

Smithe, Cindy

103 Pennsylvania Ave. NW, Washington, DC 20216

508 First St. NW, Washington, DC 20001

650 1st St NE, Washington, DC 20002

3000 K Street NW, Washington, DC 20007

1560 Wilson Blvd, Arlington, VA 22209


1(800) 789-1234



2.6.5 Substitution Operator

While the substitution operator s// is not technically an option, it belongs here if only because it truly stands apart from the other items discussed in this section. Although the substitution operation is similar in appearance to the other options, it fundamentally changes the RegEx activity from a matching operation to a match-and-replace operation. Placing “s” in front of the surrounding slashes (//) signifies that the pattern is being used to replace the text being matched and insert the accompanying replacement text. This operator is another peek at additional functionality that is explored in the next chapter with SAS functions. Once a pattern is matched, we can then do a variety of things with that information. A great analogy for how this works in practice is the find-and-replace functionality provided by many word processing applications—except this is much more powerful. Also, notice that there is a third slash in the examples below (in the middle of the patterns). That additional slash denotes where the matching portion of the RegEx ends and the replacement portion begins. And notice something important in the last example:everything is a string literal. That’s right, all the characters that occur between the second and third slash are treated as just characters. Table 2.39 provides some examples, but we cover this in detail in the next chapter, where we also discuss how to insert more than just character strings.

Table 2.39: Examples using s//



Replaces with








“1 (800) - ” …


Note: This is a more advanced function that our test code is not set up to handle. You’ll just need to accept it as true until we use it with some SAS code in the next chapter.

2.7 Zero-width Metacharacters

Zero-width characters, often called positional characters, are not matched in isolation because they do not have a width. They are used as an additional piece of information for making a proper pattern match. There are numerous examples for how these zero-width characters can be used. For instance, perhaps you want to match a particular word, but only if it occurs at the beginning of a line.

2.7.1 Start of Line

The metacharacter ^ matches the beginning of a line or string. Depending on the text that we are processing, we might know a priori that a new line signifies something specific. For example, we might be looking for the beginning of a new paragraph, which could be denoted by a new line in combination with a capital letter and no preceding white space. Or we might need to be prepared to match an address that includes a new line for the city, state, and zip. Table 2.40 provides examples.

Table 2.40: Example using ^



/^Washington, DC 20007/

Washington, DC 20007”


The first word in a string.

Note: This metacharacter is often used as the logical NOT symbol, including within the character class metacharacters discussed in Section 2.3 and in SAS code. So be careful not to get confused in its usage when shifting between contexts.

2.7.2 End of Line

The metacharacter $ matches the end of a line or string. There are numerous situations in which this might become relevant, similar to the reasons for the ^ metacharacter. Table 2.41 provides examples.

Table 2.41: Example using $



/3000 K Street NW,$/

“3000 K Street NW,



2.7.3 Word Boundary

The metacharacter \b matches a word boundary. The \b RegEx assertion metacharacter is zero-width because it actually represents the invisible gap between two characters, with a \w character on one side and \W on the other. Therefore, when you use this metacharacter, you won’t generate matches that contain the associated non-word character. Table 2.42 provides examples.

Table 2.42: Example using \b




“Street” from the substrings, “Street,” “Street ” …
But does NOT match from the substring “Streets” etc.


“800” “888” … from the substrings “(800)” “-888-“ …
But does NOT match from the substrings “18002” …


Words in all caps. Without the second \b, the output would also include single capitalized letters from the front of a word.

2.7.4 Non-word Boundary

The metacharacter \B matches a non-word boundary (i.e., anywhere \b does not match). This is especially useful for matching root words or substrings without including the surrounding pieces of information. Table 2.43 provides examples.

Table 2.43: Examples using \B




“read” from the substrings, “reads” “reading” “reader” …
But does NOT match from the substring “read”


“un” from the substrings, “fun ” “rerun.” “gun,” …
But does NOT match from the substring “un”


Any word longer than three letters.

2.7.5 String Start

The metacharacter \A matches the beginning of a string. Similar to the word boundary metacharacter (\b), \A occurs between two character cells. It also denotes when a string value occurs to its right with nothing to its left. In the context of data lines (as in our test code for this chapter), that situation occurs at the beginning of each line.

However, suppose we had a more complex task such as stitching together multiple strings of extracted text (stored in SAS variables). In this context, \A could be a key to determining in what order to place or sort them. However, for our test code, the \A matches only on the beginning of each data line, since each line is identified as the beginning of the string. So, this is another one that you have to approach with a little bit of faith until we start doing some more interesting tasks in the next chapter. Table 2.44 provides examples.

Table 2.44: Examples using \A




The first word of a line. In the case of our test code, it matches:
“ROBERT ” from line 2;
“103 ” from line 4;
“508 ” from line 5;
“650 ” from line 6;
“3000 ” from line 7;
and “1560 ” from line 8.

2.8 Summary

We have explored a variety of interesting new concepts in this chapter, and I’ve been doing my utmost to make them tangible along the way. Hopefully, you are now ready to tackle the challenge of implementing these concepts in SAS code in the coming chapters. Following are some takeaways you should keep in mind for the coming pages and beyond.


It should have become clear through reading this chapter that there are many ways to accomplish the same task, making few of them truly right or wrong. You have to decide the most efficient and effective approach for accomplishing your goals to determine what is best for a given situation.

Scratching the Surface

We have only begun to scratch the surface of what RegEx can do. The information you have learned thus far is a solid foundation upon which you can develop sophisticated functionality.

Start Small

As we have explored a variety of RegEx capabilities throughout this chapter, it is easy to become overwhelmed with attempting to do too much at once. As with anything, it is best to start small by experimenting with simple patterns and iteratively evolve them. And remember that leveraging just a few of the elements we have covered can have a tremendous impact on the processing and analysis of textual information.

1 Octal is a number system that uses base-8 instead of base-10. This system has only numbers 0–7 represented. Some old microcontrollers and microprocessors used this encoding, but it is extremely rare today.

2 Hexadecimal is a number system that uses base-16 instead of base-10. The possible values go from “0” to “F” in a single character position (where A=10, B=11, …, F=15).