Strings and Patterns - Zend PHP 5 Certification Study Guide (2014)

Zend PHP 5 Certification Study Guide (2014)

Strings and Patterns

As we mentioned in the PHP Basics chapter, strings wear many hats in PHP—far from being relegated to mere collections of textual characters, they can be used to store binary data of any kind, as well as text encoded in a way that PHP does not understand natively, but that one of its extensions can manipulate directly.

String manipulation is a very important skill for every PHP developer—a fact that is reflected in the number of exam questions that either revolve directly around strings or that require a firm grasp on the way they work. Therefore, you should ensure that you are very familiar with them before taking the exam.

Keep in mind, however, that strings are a vast topic; once again, we focus on the PHP features that are most likely to be relevant to the Zend exam.

String Basics

Strings can be defined using one of several methods. Most commonly, you will encapsulate them in single quotes or double quotes. In PHP, unlike some other languages, these two methods behave quite differently: single quotes represent simple strings, in which almost all characters are used literally. Double quotes, on the other hand, encapsulate complex strings that allow for special escape sequences (for example, to insert special characters) and for variable substitution, which makes it possible to embed the value of a variable directly in a string, without the need for any special operator.

Escape sequences are sometimes called control characters and take the form of a backslash (\) followed by one or more characters. Perhaps the most common escape sequence is the newline character \n. In the following example, we use hex and octal notation to display an asterisk:

echo "\x2a";

echo "\052";

Variable Interpolation

A variable can be embedded directly inside a double-quote string by simply typing its name. For example:

$who = "World";

echo "Hello $who\n"; // Shows "Hello World" followed by

// a newline

echo 'Hello $who\n'; // Shows "Hello $who\n"

Clearly, this simple syntax won’t work in situations in which the parser cannot readily parse the name of the variable you want interpolated because of the way the name is positioned inside of the string. In these cases, you can encapsulate the variable’s name in braces to make it clear:

$me = 'Davey';

$names = array('Smith', 'Jones', 'Jackson');

echo "There cannot be more than two {$me}s!";

echo "Citation: {$names[1]}[1987]";

In the first example above, the braces help us append a hard-coded letter “s” to the value of $me. Without braces, the parser would be looking for the variable $mes, which obviously does not exist. In the second example, if the braces were not available, the parser would interpret our input as$names[1][1987], since the square brackets are used for array syntax. This would clearly not give us the result we intended, since 1987 is the year of the citation.

The Heredoc and Nowdoc Syntax

Another syntax, called heredoc, can be used to declare complex strings—in general, the functionality it provides is similar to that of double quotes. Because heredoc uses a special set of tokens to encapsulate the string, it’s easier to declare strings that include many double-quote characters or span many lines.

A heredoc string is delimited by the special operator <<< followed by an identifier. You must then close the string using the same identifier, optionally followed by a semicolon, placed at the very beginning of its own line (that is, it should not be preceded by whitespace). Heredoc identifiers must follow the same rules are variable naming (explained in the PHP Basics chapter), and are similarly case-sensitive. By convention, the identifiers are usually all upper cased.

The heredoc syntax behaves like double quotes in every way, meaning that variables and escape sequences are interpolated:

$who = "World";

echo <<<TEXT

So I said, "Hello $who"

TEXT;

The above code will output So I said, "Hello World". Note how the newline characters right after the opening token and at the end of the string (before the closing token) are ignored.

In PHP 5.3, the new nowdoc syntax was introduced. Nowdoc is to Heredoc as single quoted strings are to double quoted strings. That is, no interpolation is done, and the entire string is considered literal (all $ and escape sequences are ignored).

To use nowdoc simply single-quote the identifier after the <<<.

$who = "World";

echo <<<'TEXT'

So I said, "Hello $who"

TEXT;

The above code will output So I said, "Hello $who". With nowdoc, as in single quoted strings, the variable is treated as literal.

PHP 5.3 also added the ability to double quote the identifier, which gives the traditional heredoc behavior.

Heredoc and Nowdoc strings can be used in almost all situations in which a string is an appropriate value. The only exception which applies only to heredoc is the declaration of a class property (explained in the Object-Oriented Programming in PHP chapter).

Prior to PHP 5.3 using heredoc when defining a property will result in a parser error:

class Hello {

public $greeting = <<<EOT

Hello World

EOT;

}

Additionally, even in PHP 5.6, there is the caveat that you cannot interpolate variables when using heredoc to define properties. Nowdoc can be used with no issue.

Escaping Literal Values

All three string-definition syntaxes feature a set of several characters that require escaping in order to be interpreted as literals.

When using single-quote strings, single quote characters can be escaped using a backslash:

echo 'This is \'my\' string';

A similar set of escaping rules applies to double-quote strings, where double quote characters and dollar sign can also be escaped by prefixing them with a backslash:

$a = 10;

echo "The value of \$a is \"$a\".";

Backslashes themselves can be escaped in both cases using the same technique:

echo "Here's an escaped backslash: - \\ -";

Note that you cannot escape a brace. Therefore, if you need the literal string {$ to be printed out, you will need to escape the dollar sign in order to prevent the parser from interpreting the sequence as an attempt to interpolate a variable:

echo "Here's a literal brace + dollar sign: {\$";

Heredoc strings provide the same escaping mechanisms as double-quote strings, with the exception that you do not need to escape double quote characters, since they have no semantic value.

Working with Strings

While PHP strings can store any data, including multibyte characters like those found in Unicode/UTF-8, the standard PHP string functions work on a per-byte basis, not a per-character basis. If you wish to work with multibyte strings, you should look at the iconv and mbstring extensions.

While iconv is considered superior to mbstring, the mbstring extension provides alternatives to many more common string functions. The functions provided by these extensions account for the fact that in multibyte strings, more than one byte can be used to represent a single character.

Note: From PHP 5.6, UTF-8 is now the default setting for default_charset. This means that the automatically generated Content-Type header will now send at UTF-8, and that the iconv, mbstring and filter extensions will all use UTF-8 by default.

Determining the Length of a String

The strlen() function is used to determine the length, in bytes, of a string. Note that strlen(), like most string functions, is binary-safe. This means that all characters in the string are counted, regardless of their value. (In some languages (notably C), some functions are designed to work with “zero-terminated” strings, where the NUL character is used to signal the end of a string. This causes problems when dealing with binary objects, since bytes with a value of zero are quite common; luckily, most PHP functions are capable of handling binary data without any problem.)

As in most cases in which an alternative function is supported, to test with iconv, use iconv_strlen(); similarly, with mbstring, use mb_strlen() instead.

Transforming a String

The strtr() function can be used to translate certain characters of a string into other characters—it is often used as an aid in the practice known as transliteration to transform certain accented characters that cannot appear, for example, in URLs or e-mail address into the equivalent unaccented versions:

Listing 3.1: Translating characters

// Translate a single character

echo strtr('abc', 'a', '1'); // Outputs 1bc

// Translate multiple-characters

$subst = array(

'1' => 'one',

'2' => 'two',

);

echo strtr('123', $subst); // Outputs onetwo3

Using Strings as Arrays

You can access the individual characters of a string as if they were members of an array. For example:

$string = 'abcdef';

echo $string[1]; // Outputs 'b'

This approach can be very handy when you need to scan a string one character at a time:

$s = 'abcdef';

for ($i = 0; $i < strlen($s); $i++) {

if ($s[$i] > 'c') {

echo $s[$i];

}

}

Note that string character indices are zero-based—the first character of an arbitrary string $s has an index of zero, and the last has an index of strlen($s)-1.

This key can be any valid expression; for example, you can use the rand() function to get a random character.

As of PHP 5.5, you can use this syntax on string natives—this is known as dereferencing:

echo "abcdef"[1]; // Outputs 'b'

Comparing, Searching and Replacing Strings

Comparison is, perhaps, one of the most common operations performed on strings. At times, PHP’s type-juggling mechanisms also make it the most maddening, particularly because strings that can be interpreted as numbers are often transparently converted to their numeric equivalent. Consider, for example, the following code:

$string = '123aa';

if ($string == 123) {

// The string equals 123

}

You’d expect this comparison to return false, since the two operands are most definitely not the same. However, PHP first transparently converts the contents of $string to the integer 123, thus making the comparison true. Naturally, the best way to avoid this problem is to use the identity operator === whenever you are performing a comparison that could potentially lead to type-juggling problems.

In addition to comparison operators, you can also use the specialized functions strcmp() and strcasecmp() to match strings. These are identical, with the exception that the former is case-sensitive, while the latter is not. In both cases, a result of zero indicates that the two strings passed to the function are equal:

Listing 3.2: Comparing strings with strcmp and strcasecmp

$str = "Hello World";

if (strcmp($str, "hello world") === 0) {

// We won't get here, because of case sensitivity

}

if (strcasecmp($str, "hello world") === 0) {

// We will get here, because strcasecmp()

// is case-insensitive

}

A further variant of strcasecmp(), strncasecmp() allows you to only test a given number of characters inside two strings. For example:

$s1 = 'abcd1234';

$s2 = 'abcd5678';

// Compare the first four characters

echo strncasecmp($s1, $s2, 4);

You can also perform a comparison between portions of strings by using the substr_compare() function.

Simple Searching Functionality

PHP provides a number of very powerful search facilities for which functionality varies from the very simple (and correspondingly faster) to the very complex (and correspondingly slower).

The simplest way to search inside a string is to use the strpos() and strstr() families of functions. The former allows you to find the position of a substring (usually called the needle) inside a string (called the haystack). It returns either the numeric position of the needle’s first occurrence within the haystack, or false if a match could not be found. Here’s an example:

$haystack = "abcdefg";

$needle = 'abc';

if (strpos($haystack, $needle) !== false) {

echo 'Found';

}

Note that, because strings are zero-indexed, it is necessary to use the identity operators when calling strpos() to ensure that a return value of zero—which indicates that the needle occurs right at the beginning of the haystack—is not mistaken for a return value of false.

You can also specify an optional third parameter to strpos() to indicate that you want the search to start from a specific position within the haystack. For example:

$haystack = '123456123456';

$needle = '123';

echo strpos($haystack, $needle); // outputs 0

echo strpos($haystack, $needle, 1); // outputs 6

The strstr() function works similarly to strpos() in that it searches the haystack for a needle. The only real difference is that this function returns the portion of the haystack that starts with the needle instead of the latter’s position:

$haystack = '123456';

$needle = '34';

echo strstr($haystack, $needle); // outputs 3456

In general, strstr() is slower than strpos(). Therefore, you should use the latter if your only goal is to determine whether a certain needle occurs inside the haystack. Also, note that you cannot force strstr() to start looking for the needle from a given location by passing a third parameter.

Both strpos() and strstr() are case sensitive and start looking for the needle from the beginning of the haystack. However, PHP provides variants that work in a case-insensitive way or start looking for the needle from the end of the haystack. For example:

// Case-insensitive search

echo stripos('Hello World', 'hello'); // outputs zero

echo stristr('Hello My World', 'my'); // outputs "My World"

// Reverse search

echo strrpos('123123', '123'); // outputs 3

Matching Against a Mask

You can use the strspn() function to match a string against a “whitelist” mask of allowed characters. This function returns the length of the initial segment of the string that contains any of the characters specified in the mask:

$string = '133445abcdef';

$mask = '12345';

echo strspn($string, $mask); // Outputs 6

The strcspn() function works just like strspn(), but uses a blacklist approach instead. In other words, the mask is used to specify which characters are disallowed, and the function returns the length of the initial segment of the string that does not contain any of the characters from the mask.

Both strspn() and strcspn() accept two optional parameters that define the starting position and the length of the string to examine. For example:

$string = '1abc234';

$mask = 'abc';

echo strspn($string, $mask, 1, 4); // Outputs 3

In the example above, strspn() will start examining the string from the second character (index 1), and continue for up to four characters. However, only the first three characters it encounters satisfy the mask’s constraints and, therefore, the script outputs 3.

Simple Search and Replace Operations

Replacing portions of a string with a different substring is another very common task for PHP developers. Simple substitutions are performed using str_replace()---as well as its case-insensitive variation,str_ireplace()---andsubstr_replace()`. Here’s an example:

// Outputs Hello Reader

echo str_replace("World", "Reader", "Hello World");

// Also outputs Hello Reader

echo str_ireplace("world", "Reader", "Hello World");

In both cases, the function takes three parameters: a needle, a replacement string and a haystack. PHP will attempt to look for the needle in the haystack (using either a case-sensitive or case-insensitive search algorithm) and substitute every single instance of the latter with the replacement string. Optionally, you can specify a third parameter, passed by reference, that the function fills, upon return, with the number of substitutions made:

$a = 0; // Initialize

str_replace('a', 'b', 'a1a1a1', $a);

echo $a; // outputs 3

If you need to search and replace more than one needle at a time, you can pass the first two arguments to str_replace() in the form of arrays:

Listing 3.3: Using str_replace with array arguments

// outputs Bonjour Monde

echo str_replace(

array("Hello", "World"),

array("Bonjour", "Monde"),

"Hello World"

);

// outputs Bye Bye

echo str_replace(

array("Hello", "World"),

"Bye",

"Hello World"

);

In the first example, the replacements are made based on array indexes. The first element of the search array is replaced by the first element of the replacement array, and the output is “Bonjour Monde”. In the second example, only the needle argument is an array; in this case, both search terms are replaced by the same string resulting in “Bye Bye”.

If you need to replace a portion of a needle of which you already know the starting and ending point, you can use substr_replace():

echo substr_replace("Hello World", "Reader", 6);

echo substr_replace(

"Canned tomatoes are good", "potatoes", 7, 8

);

The third argument is our starting point (the space in the first example); the function replaces the contents of the string from here until the end of the string with the second argument passed to it, thus resulting in the output “Hello Reader”. You can also pass an optional fourth parameter to define the end of the substring that will be replaced (as shown in the second example, which outputs “Canned potatoes are good”).

Combining substr_replace() with strpos() can prove to be a powerful tool. For example:

$user = "davey@example.com";

$name = substr_replace($user, "", strpos($user, '@'));

echo "Hello " . $name;

By using strpos() to locate the first occurrence of the @ symbol, we can replace the rest of the e-mail address with an empty string, leaving us with just the username, which we output in greeting.

Extracting Substrings

The very flexible and powerful substr() function allows you to extract a substring from a larger string. It takes three parameters: the string to be worked on, a starting index and an optional length. The starting index can be specified as either a positive integer (meaning the index of a character in the string starting from the beginning) or a negative integer (meaning the index of a character starting from the end). Here are a few simple examples:

$x = '1234567';

echo substr($x, 0, 3); // outputs 123

echo substr($x, 1, 1); // outputs 2

echo substr($x, -2); // outputs 67

echo substr($x, 1); // outputs 234567

echo substr($x, -2, 1); // outputs 6

Formatting Strings

PHP provides a number of different functions that can be used to format output in a variety of ways. Some of them are designed to handle special data types—for example, numbers of currency values—while others provide a more generic interface for formatting strings according to more complex rules.

Formatting rules are sometimes governed by locale considerations. For example, most English-speaking countries format numbers by using commas as the separators between thousands, and the point as a separator between the integer portion of a number and its fractional part. In many European countries, this custom is reversed: the dot (or a space) separates thousands, and the comma is the fractional delimiter.

In PHP, the current locale is set by calling the setlocale() function, which takes two parameters: the name of the locale you want to set and a category that indicates which functions are affected by the change. For example, you can change currency formatting (which we’ll examine in a few paragraphs) to reflect the standard US rules by calling setlocale() as in the following example:

setlocale(LC_MONETARY, 'en_US');

Formatting Numbers

Number formatting is typically used when you wish to output a number and separate its digits into thousands and decimal points. The number_format() function, used for this purpose, is not locale-aware. This means that, even if you have a French or German locale set , it will still use periods for decimals and commas for thousands, unless you specify otherwise.

The number_format() function accepts 1, 2 or 4 arguments (but not three). If only one argument is given, the default formatting is used: the number will be rounded to the nearest integer, and a comma will be used to separate thousands. If two arguments are given, the number will be rounded to the given number of decimal places and a period and comma will be used to separate decimals and thousands, respectively. Should you pass in all four parameters, the number will be rounded to the number of decimal places given, and number_format() will use the first character of the third and fourth arguments as decimal and thousand separators respectively.

Here are a few examples:

echo number_format("100000.698"); // Shows 100,001

echo number_format("100000.698", 3, ",", " "); // Shows 100 000,698

Formatting Currency Values

Currency formatting, unlike number formatting, is locale aware and will display the correct currency symbol (either international or national notations—e.g.: USD or $, respectively) depending on how your locale is set.

When using money_format(), we must specify the formatting rules we want to use by passing the function a specially-crafted string that consists of a percent symbol (%) followed by a set of flags that determine the minimum width of the resulting output, its integer and decimal precision, and a conversion character that determines whether the currency value is formatted using the locale’s national or international rules.

The money_format() function is not available on Windows, nor on some variants of UNIX.

For example, to output a currency value using the American national notation with two decimal places, we’d use the following function call:

setlocale(LC_MONETARY, "en_US");

echo money_format('%.2n', "100000.698");

This example displays “$100,000.70”.

If we simply change the locale to Japanese, we can display the number in Yen.

setlocale(LC_MONETARY, "ja_JP.UTF-8");

echo money_format('%.2n', "100000.698");

This time, the output is “¥100,000.70”. Similarly, if we change our formatting to use the i conversion character, money_format() will produce its output using the international notation, for example:

setlocale(LC_MONETARY, "en_US");

echo money_format('%.2i', "100000.698");

setlocale(LC_MONETARY, "ja_JP");

echo money_format('%.2i', "100000.698");

The first example displays “USD 100,000.70”, while the second outputs “JPY 100,000.70”. As you can see, money_format() is a must for any international commerce site that accepts multiple currencies, as it allows you to easily display amounts in currencies that you are not familiar with.

There are two important things that you should keep in mind here. First, a call to setlocale() affects the entire process inside which it is executed, rather than the individual script. Thus, you should be careful to always reset the locale whenever you need to perform a formatting operation, particularly if your application requires the use of multiple locales, or is hosted alongside other applications that may do the same.

In addition, you should keep in mind that the default rounding rules change from locale to locale. For example, US currency values are regularly expressed as dollars and cents, while Japanese currency values are represented as integers. Therefore, if you don’t specify a decimal precision, the same value can yield very different locale-dependent formatted strings:

setlocale(LC_MONETARY, "en_US");

echo money_format('%i', "100000.698");

setlocale(LC_MONETARY, "ja_JP");

echo money_format('%i', "100000.698");

The first example displays “USD 100,000.70”; however, the Japanese output is now “JPY 100,001”. As you can see, the latter value was rounded up to the next integer.

Generic Formatting

If you are not handling numbers or currency values, you can use the printf() family of functions to perform arbitrary formatting of a value. All the functions in this group perform in an essentially identical way: they take an input string that specifies the output format and one or more values. The only difference is in the way they return their results: the “plain” printf() function simply writes it to the script’s output, while other variants may return it (sprintf()), write it out to a file (fprintf()), and so on.

The formatting string usually contains a combination of literal text—copied directly into the function’s output—and specifiers that determine how the input should be formatted. The specifiers are then used to format each input parameter in the order in which they are passed to the function (thus, the first specifier is used to format the first data parameter, the second specified is used to format the second parameter, and so on).

A formatting specifier always starts with a percent symbol (if you want to insert a literal percent character in your output, you need to escape it as %%) and is followed by a type specification token, which identifies the type of formatting to be applied; a number of optional modifiers can be inserted between the two to affect the output:

· A sign specifier (a plus or minus symbol) to determine how signed numbers are to be rendered

· A padding specifier that indicates what character should be used to make up the required output length, should the input not be long enough on its own

· An alignment specifier that indicates if the output should be left or right aligned

· A numeric width specifier that indicates the minimum length of the output

· A precision specifier that indicates how many decimal digits should be displayed for floating-point numbers

It is important that you be familiar with some of the most commonly-used type specifiers:

Type Specifier

Description

b

Output an integer as a binary number.

c

Output the character which has the input integer as its ASCII value.

d

Output a signed decimal number

e

Output a number using scientific notation (e.g., 3.8e+9)

u

Output an unsigned decimal number

f

Output a locale aware float number

F

Output a non-locale aware float number

o

Output a number using its Octal representation

s

Output a string

x

Output a number as hexadecimal with lowercase letters

X

Output a number as hexadecimal with uppercase letters

Here are some simple examples of printf() usage:

Listing 3.4: Using printf

$n = 123;

$f = 123.45;

$s = "A string";

printf("%d", $n); // prints 123

printf("%d", $f); // prints 123

// Prints "The string is A string"

printf("The string is %s", $s);

// Example with precision

printf("%3.3f", $f); // prints 123.450

// Complex formatting

function showError($msg, $line, $file) {

return sprintf("An error occurred in %s on " .

"line %d: %s", $file, $line, $msg);

}

echo showError("Invalid confibulator", __LINE__, __FILE__);

Parsing Formatted Input

The sscanf() family of functions works in a similar way to printf(), except that, instead of formatting output, it allows you to parse formatted input. For example, consider the following:

$data = '123 456 789';

$format = '%d %d %d';

var_dump(sscanf($data, $format));

When this code is executed, the function interprets its input according to the rules specified in the format string and returns an array that contains the parsed data:

array(3) {

[0]=>

int(123)

[1]=>

int(456)

[2]=>

int(789)

}

Note that the data must match the format passed to sscanf() exactly, or the function will fail to retrieve all the values. For this reason, sscanf() is normally only useful in those situations in which input follows a well-defined format (that is, it is not provided by the user!).

Perl Compatible Regular Expressions

Perl Compatible Regular Expressions (normally abbreviated as “PCRE”) offer a very powerful string-matching and replacement mechanism that far surpasses anything we have examined so far.

Since PHP 5.3 PCRE is always enabled.

PHP also offers POSIX compatible regular expression, which use the ereg_* family of functions. POSIX regular expressions are simpler; however, they are also less capable and slower than PCRE regular expressions and are officially deprecated since PHP 5.3.

Regular expressions are often thought of as very complex—and they can be, at times. However, they are relatively simple to understand and fairly easy to use. Given their complexity, of course, they are also much more computationally intensive than the simple search-and-replace functions we examined earlier in this chapter. Therefore, you should use them only when appropriate—that is, when using the simpler functions is either impossible or so complicated that it’s not worth the effort.

A regular expression is a string that describes a set of matching rules. The simplest possible regular expression is one that matches only one string; for example, Davey matches only the string “Davey”. In fact, such a simple regular expression would be pointless, as you could just as easily perform the match using strpos(), which is a much faster alternative.

The real power of regular expressions comes into play when you don’t know the exact string that you want to match. In this case, you can specify one or more meta-characters and quantifiers, which do not have a literal meaning, but instead stand to be interpreted in a special way.

In this chapter, we will discuss the basics of regular expressions that are required by the exam. More thorough coverage can be found in the PHP manual, or in one of the many regular expression books available (most notably, Mastering Regular Expressions, by Jeffrey Friedl, published by O’Reilly Media).

Delimiters

A regular expression is always delimited by a starting and ending character. Any character can be used for this purpose (as long as the beginning and ending delimiter match); since any occurrence of this character inside the expression itself must be escaped, it’s usually a good idea to pick a delimiter that isn’t likely to appear inside the expression. By convention, the forward slash (/) is used for this purpose, although, for example, another character like @ or # is commonly used when dealing with patterns containing the forward slash.

Metacharacters

The term “metacharacter” is a bit of a misnomer—as a metacharacter can actually be composed of more than one character. However, every metacharacter represents a single character in the matched expression. Here are the most common ones:

Metacharacter

Description

.

Match any character

^

Match the start of the string

$

Match the end of the string

\s

Match any whitespace character

\d

Match any digit

\w

Match any “word” character

Metacharacters can also be expressed using grouping expressions. For example, a series of valid alternatives for a character can be provided by using square brackets:

/ab[cd]e/

The expression above will match both abce and abde. You can also use other metacharacters, and provide ranges of valid characters inside a grouping expression:

/ab[c-e\d]/

This will match abc, abd, abe and any combination of ab followed by a digit.

Quantifiers

A quantifier allows you to specify the number of times a particular character or metacharacter can appear in a matched string. There are four types of quantifiers:

Quantifier

Description

*

The character can appear zero or more times

+

The character can appear one or more times

?

The character can appear zero or one times

{n,m}

The character can appear at least n times, and no more than m. Either parameter can be omitted to indicated a minimum limit with no maximum, or a maximum limit without a minimum, but not both.

Thus, for example, the expression ab?c matches both ac and abc, while ab{1,3}c matches abc, abbc and abbbc.

Greediness

By default quantifiers are greedy, which means they will try to match as much of the string as possible (up to the maximum number of allowed times). For example, if we want to match markdown syntax for inline code, we might use the simple expression:

/`(.*)`/

However, if you have the string: some `code` and `more code` here, it will match the “code” from the first backtick to the last, including the and and two extra backticks in the middle, not each single occurance of “code”.

To make a quantifier ungreedy, simply follow it by a question mark ?:

/`(.*?)`/

Doing this will match the blocks individually.

Modifiers

You can change the behavior of the expression by adding a pattern modifier after the closing delimiter. For example: /expression/<modifier>.

Modifier

Description

i

Case-insensitive expression

m

Indicates that you are matching against a multi-line string, and that the ^ and $ should match the start and end of each line (delimited by a newline (\n) character). By default, newlines are ignored, and ^ and $ will match the start and end of the string respectively.

s

When this modifier is used, the . (any) metacharacter will also match newlines.

x

This modifier will ignore regular whitespace (e.g. spaces and newlines) unless escaped, or inside character classes. Additionally, it will ignore all characters between an unescaped # and the next newline, allowing you to add comments.

e

The e modifier can only used in preg_replace(), and is also known as the “eval” modifier. It allows you to use valid PHP code as the replacement string, which will be evaled.
This is considered bad practice for security and is deprecated since PHP 5.5. Its use is highly discouraged.

U

This modifier inverts the behavior of “greediness”, meaning that quantifiers are not greedy by default, and those followed by a ? become greedy.

u

Treats both the pattern and subject as UTF-8 strings. This is particularly important because it will match characters instead of bytes.

Sub-Expressions

A sub-expression is a regular expression contained within the main regular expression (or another sub-expression); you define one by encapsulating it in parentheses:

/a(bc.)e/

This expression will match the letter a, followed by the letters b and c, followed by any character and, finally the letter e. As you can see, sub-expressions by themselves do not have any influence on the way a regular expression is executed; however, you can use them in conjunction with quantifiers to allow complex expressions to happen more than once. For example:

/a(bc.)+e/

This expression will match the letter a, followed by the expression bc. repeated one or more times, followed by the letter e.

Sub-expressions can also be used as capturing patterns, which we will examine in the next section.

Matching and Extracting Strings

The preg_match() function can be used to match a regular expression against a given string. The function returns integer 1 if the match is successful, and can return all the captured subpatterns in an array if an optional third parameter is passed. Here’s an example:

Listing 3.5: Using preg_match

$name = "Davey Shafik";

// Simple match

$regex = "/[a-zA-Z\s]/";

if (preg_match($regex, $name)) {

// Valid Name

}

// Match with subpatterns and capture

$regex = '/^(\w+)\s(\w+)/';

$matches = array();

if (preg_match($regex, $name, $matches)) {

var_dump($matches);

}

If you run the second example, you will notice that the $matches array is populated, on return, with the following values:

array(3) {

[0]=>

string(12) "Davey Shafik"

[1]=>

string(5) "Davey"

[2]=>

string(6) "Shafik"

}

As you can see, the first element of the array contains the entire matched string, while the second element (index 1) contains the first captured subpattern, and the third element contains the second matched subpattern.

Named Matches

For convenience, we can also name our matches by adding a ?<name> or ?'name' inside the parentheticals:

Listing 3.6: Named matches in regular expressions

// Match with subpatterns and capture

$name = "Davey Shafik";

$regex = '/^(?<firstname>\w+)\s(?<lastname>\w+)/';

$matches = array();

if (preg_match($regex, $name, $matches)) {

var_dump($matches);

}

Will now output:

array(5) {

[0] =>

string(12) "Davey Shafik"

'firstname' =>

string(5) "Davey"

[1] =>

string(5) "Davey"

'lastname' =>

string(6) "Shafik"

[2] =>

string(6) "Shafik"

}

Performing Multiple Matches

The preg_match_all() function allows you to perform multiple matches on a given string based on a single regular expression. For example:

Listing 3.7: Multiple matches

$string = "a1bb b2cc c2dd";

$regex = "#([abc])\d#";

$matches = array();

if (preg_match_all($regex, $string, $matches)) {

var_dump($matches);

}

This script outputs the following:

array(2) {

[0]=>

array(3) {

[0]=>

string(2) "a1"

[1]=>

string(2) "b2"

[2]=>

string(2) "c2"

}

[1]=>

array(3) {

[0]=>

string(1) "a"

[1]=>

string(1) "b"

[2]=>

string(1) "c"

}

}

As you can see, all the whole-pattern matches are stored in the first sub-array of the result, while the first captured subpattern of every match is stored in the corresponding slot of the second sub-array.

Capture Flags

The third argument for preg_match_all() is a combination of flags (bit mask). There are three flags possible:

Flag

Description

PREG_PATTERN_ORDER

The default behavior; each match is an array whose first array index is the entire matched string, and each subsequent index is a capture group.

PREG_SET_ORDER

With this flag, the result is an array of capture group matches; that is, all matches for the first capture group are in the first key (0), the second capture group is in the second key (1), etc. This is particularly useful when using named capture groups.

PREG_OFFSET_CAPTURE

This flag will result in each matching string in the result being an array with the string itself and its offset from the beginning of the string.

Using PCRE to Replace Strings

Whilst str_replace() is quite flexible, it still only works on “whole” strings, where you know the exact text to search for. Using preg_replace(), however, you can replace text that matches a pattern we specify. It is even possible to reuse captured subpatterns directly in the substitution string by prefixing their index with a dollar sign. In the example below, we use this technique to replace the entire matched pattern with a string that is composed using the first captured subpattern ($1).

$body = "[b]Make Me Bold![/b]";

$regex = "@\[b\](.*?)\[/b\]@i";

$replacement = '<b>$1</b>';

$body = preg_replace($regex, $replacement, $body);

Just as with str_replace(), we can pass arrays of search and replacement arguments and we can also pass in an array of subjects on which to perform the search-and-replace operation. This can speed things up considerably, since the regular expression (or expressions) is (or are) compiled once and reused multiple times. Here’s an example:

Listing 3.8: Multiple arguments with preg_replace

$subjects['body'] = "[b]Make Me Bold![/b]";

$subjects['subject'] = "[i]Make Me Italics![/i]";

$regex[] = "@\[b\](.*?)\[/b\]@i";

$regex[] = "@\[i\](.*?)\[/i\]@i";

$replacements[] = "<b>$1</b>";

$replacements[] = "<i>$1</i>";

$results = preg_replace($regex, $replacements, $subjects);

When you execute the code shown above, you will end up with an array that looks like this:

array(2) {

["body"]=>

string(20) "<b>Make Me Bold!</b>"

["subject"]=>

string(23) "<i>Make Me Italic!</i>"

}

Notice how the resulting array maintains the array structure of our $subjects array that we passed in, which, however, is not passed by reference, nor is it modified.

Summary

This chapter covered what is most likely going to be the bulk of your work as a developer—manipulating strings. While regular expressions may be complex, they are extremely powerful. Just remember: with great power, comes great responsibility—in this case, don’t use them if you don’t have to. Never underestimate the power of the string functions and regular expressions.