Content Formatting with Regular Expressions - PHP & MySQL: Novice to Ninja, 5th Edition (2012)

PHP & MySQL: Novice to Ninja, 5th Edition (2012)

Chapter 8. Content Formatting with Regular Expressions

We’re almost there! We’ve designed a database to store jokes, organized them into categories, and tracked their authors. We’ve learned how to create a web page that displays this library of jokes to site visitors. We’ve even developed a set of web pages that a site administrator can use to manage the joke library without knowing anything about databases. In so doing, we’ve built a site that frees the resident webmaster from continually having to plug new content into tired HTML page templates, and from maintaining an unmanageable mass of HTML files. The HTML is now kept completely separate from the data it displays. If you want to redesign the site, you simply have to make the changes to the HTML contained in the PHP templates that you’ve constructed. A change to one file (for example, modifying the footer) is immediately reflected in the page layouts of all pages in the site. Only one task still requires knowledge of HTML: content formatting. On any but the simplest of websites, it will be necessary to allow content (in our case, jokes) to include some sort of formatting. In a simple case, this might merely be the ability to break text into paragraphs. Often, however, content providers will expect facilities such as bold or italic text, hyperlinks, and so on. Supporting these requirements with our current code is deceptively easy. In the past couple of chapters, we’ve used htmlout to output user-submitted content:

chapter7/jokes/jokes.html.php (excerpt)

<?php htmlout($joke['text']); ?>

If, instead, we just echo out the raw content pulled from the database, we can enable administrators to include formatting in the form of HTML code in the joke text:

<?php echo $joke['text']; ?>

Following this simple change, a site administrator could include HTML tags that would have their usual effect on the joke text when inserted into a page. But is this really what we want? Left unchecked, content providers can do a lot of damage by including HTML code in the content they add to your site’s database. Particularly if your system will be enabling nontechnical users to submit content, you’ll find that invalid, obsolete, and otherwise inappropriate code will gradually infest the pristine website you set out to build. With one stray tag, a well-meaning user could tear apart the layout of your site. In this chapter, you’ll learn about several new PHP functions that specialize in finding and replacing patterns of text in your site’s content. I’ll show you how to use these capabilities to provide a simpler markup language for your users that’s better suited to content formatting. By the time we’ve finished, we’ll have completed a content management system that anyone with a web browser can use—no knowledge of HTML required.

Regular Expressions

To implement our own markup language, we’ll have to write some PHP code to spot our custom tags in the text of jokes and then replace them with their HTML equivalents. For tackling this sort of task, PHP includes extensive support for regular expressions. A regular expression is a short piece of code that describes a pattern of text that may occur in content like our jokes. We use regular expressions to search for and replace patterns of text. They’re available in many programming languages and environments, and are especially prevalent in web development languages like PHP. The popularity of regular expressions has everything to do with how useful they are, and absolutely nothing to do with how easy they are to use—because they’re not at all easy. In fact, to most people who encounter them for the first time, regular expressions look like what might eventuate if you fell asleep with your face on the keyboard. Here, for example, is a relatively simple (yes, really!) regular expression that will match any string that might be a valid email address:

/^[\w\.\-]+@([\w\-]+\.)+[a-z]+$/i

Scary, huh? By the end of this section, you’ll actually be able to make sense of that. The language of a regular expression is cryptic enough that, once you master it, you may feel as if you’re able to weave magical incantations with the code that you write. To begin with, let’s start with some very simple regular expressions. This is a regular expression that searches for the text “PHP” (without the quotes):

/PHP/

Fairly simple, right? It’s the text for which you want to search surrounded by a pair of matching delimiters. Traditionally, slashes (/) are used as regular expression delimiters, but another common choice is the hash character (#). You can actually use any character as a delimiter except letters, numbers, or backslashes (\). I’ll use slashes for all the regular expressions in this chapter.

Tip: Escape Delimiter Characters

To include a forward slash as part of a regular expression that uses forward slashes as delimiters, you must escape it with a preceding backslash (\/); otherwise, it will be interpreted as marking the end of the pattern. The same goes for other delimiter characters: if you use hash characters as delimiters, you’ll need to escape any hashes within the expression with backslashes (\#).

To use a regular expression, you must be familiar with the regular expression functions available in PHP. preg_match is the most basic, and can be used to determine whether a regular expression is matched by a particular text string. Consider this code:

chapter8/preg_match1/index.php

<?php

$text = 'PHP rules!';

if (preg_match('/PHP/', $text))

{

$output = '$text contains the string “PHP”.';

}

else

{

$output = '$text does not contain the string “PHP”.';

}

include 'output.html.php';

In this example, the regular expression finds a match because the string stored in the variable $text contains “PHP”. This example will therefore output the message shown in Figure 8.1 (note that the single quotes around the strings in the code prevent PHP from filling in the value of the variable $text).

The regular expression finds a match

Figure 8.1. The regular expression finds a match

By default, regular expressions are case-sensitive; that is, lowercase characters in the expression only match lowercase characters in the string, and uppercase characters only match uppercase characters. If you want to perform a case-insensitive search instead, you can use a pattern modifier to make the regular expression ignore case. Pattern modifiers are single-character flags following the ending delimiter of an expression. The modifier for performing a case-insensitive match is i. So while /PHP/ will only match strings that contain “PHP”, /PHP/i will match strings that contain “PHP”, “php”, or even “pHp”. Here’s an example to illustrate this:

chapter8/preg_match2/index.php

<?php

$text = 'What is Php?';

if (preg_match('/PHP/i', $text))

{

$output = '$text contains the string “PHP”.';

}

else

{

$output = '$text does not contain the string “PHP”.';

}

include 'output.html.php';

Again, as shown in Figure 8.2, this outputs the same message despite the string actually containing “Php”.

No need to be picky …

Figure 8.2. No need to be picky …

Regular expressions are almost a programming language unto themselves. A dazzling variety of characters have a special significance when they appear in a regular expression. Using these special characters, you can describe in great detail the pattern of characters that a PHP function likepreg_match will search for. To show you what I mean, let’s look at a slightly more complex regular expression:

/^PH.*/

The caret (^), the dot (.), and the asterisk (*) are all special characters that have a specific meaning inside a regular expression. Specifically, the caret means “the start of the string,” the dot means “any character,” and the asterisk means “zero or more of the preceding character.” Therefore, the pattern /^PH.*/ matches not only the string “PH”, but “PHP”, “PHX”, “PHP: Hypertext Preprocessor”, and any other string beginning with “PH”. When you first encounter it, regular expression syntax can be downright confusing and difficult to remember, so if you intend to make extensive use of it, a good reference might come in handy. The PHP Manual includes a very thorough regular expression reference, but let’s start with the basics. Here are some of the most commonly used regular expression special characters (try not to lose too much sleep memorizing these), and some simple examples to illustrate how they work:

^ (caret)

The caret matches the start of the string. This excludes any characters—it considers merely the position itself.

$ (dollar)

A dollar character matches the end of the string. This excludes any characters—it considers merely the position itself:

/PHP/ Matches 'PHP rules!' and 'What is PHP?'

/^PHP/ Matches 'PHP rules!' but not 'What is PHP?'

/PHP$/ Matches 'I love PHP' but not 'What is PHP?'

/^PHP$/ Matches 'PHP' and no other string.

. (dot)

This is the wildcard character. It matches any single character except a newline (\n):[44]

/^...$/ Matches any three-character string (no newlines).

* (asterisk)

An asterisk requires that the preceding character appears zero or more times. When matching, the asterisk will be greedy, including as many characters as possible. For example, for the string 'a word here, a word there', the pattern /a.*word/ will match 'a word here, a word'. In order to make a minimal match (just 'a word'), use the question mark character (explained shortly).

+ (plus)

This character requires that the preceding character appears one or more times. When matching, the plus will be greedy (just like the asterisk) unless you use the question mark character.

? (question mark)

This character makes the preceding character optional. If placed after a plus or an asterisk, it instead dictates that the match for this preceding symbol will be a minimal match (also known as non-greedy or lazy matching), including as few characters as possible:

/bana?na/ Matches 'banana' and 'banna', but not 'banaana'.

/bana+na/ Matches 'banana' and 'banaana', but not 'banna'.

/bana*na/ Matches 'banna', 'banana', and 'banaaana',

but not 'bnana'.

/^[a-zA-Z]+$/ Matches any string of one or more letters only.

| (pipe)

The pipe causes the regular expression to match either the pattern on the left of the pipe, or the pattern on the right.

(…) (round brackets)

Round brackets define a group of characters that must occur together, to which you can then apply a modifier like *, +, or ? by placing it after the closing bracket. You can also refer to a bracketed portion of a regular expression later to obtain the portion of the string that it matched:

/^(yes|no)$/ Matches the strings 'yes' and 'no' only.

/ba(na)+na/ Matches 'banana' and 'banananana',

but not 'bana' or 'banaana'.

/ba(na|ni)+/ Matches 'bana' and 'banina',

but not 'naniba'.

[…] (square brackets)

Square brackets define a character class . A character class matches one character out of those listed within the square brackets. A character class can include an explicit list of characters (for instance, [aqz], which is the same as (a|q|z)), or a range of characters (such as [a-z], which is the same as (a|b|c|…|z). A character class can also be defined so that it matches one character that’s not listed in the brackets. To do this, simply insert a caret (^) after the opening square bracket (so [^a] will match any single character except ‘a’).

Let’s see all these in action:

/[12345]/ Matches '1a' (contains ‘1’) and '39' (contains ‘3’),

but doesn’t match 'a' or '76'.

/[^12345]/ Matches '1a' (contains ‘a’) and '39' (contains ‘9’),

but not '1', or '54'.

/[1-5]/ Equivalent to /[12345]/.

/^[a-z]$/ Matches any single lowercase letter.

/^[^a-z]$/ Matches any single character not a lowercase letter.

/[0-9a-zA-Z]/ Matches any string containing a letter or number.

If you want to use one of these special characters as a literal character to be matched by the regular expression pattern, escape it by placing a backslash (\) before it:

/1\+1=2/ Matches any string containing '1+1=2'.

/\$\$\$/ Matches any string containing '$$$'.

There are also a number of so-called escape sequences that will match a character that’s either not easily typed, or a certain type of character:

\n

This sequence matches a newline character.

\r

This matches a carriage-return character.

\t

This matches a tab character.

\s

This sequence matches any whitespace character, which includes any newline, carriage-return, or tab character; it’s the same as [ \n\r\t].

\S

This matches any nonwhitespace character, and is the same as [^ \n\r\t].

\d

This matches any digit; it’s the same as [0-9].

\D

This sequence matches anything but a digit, and is the same as [^0-9].

\w

This matches any “word” character. It’s the same as [a-zA-Z0-9_].

\W

This sequence matches any “non-word” character, and is the same as [^a-zA-Z0-9_].

\b

This code is a little special because it doesn’t actually match a character. Instead, it matches a word boundary—the start or end of a word.

\B

Like \b, this won’t match a character. Rather, it matches a position in the string that is not a word boundary.

\\

This matches an actual backslash character. So if you want to match the string “\n” exactly, your regular expression would be /\\n/, not /\n/ (which matches a newline character). Similarly, if you wanted to match the string “\\” exactly, your regular expression would be /\\\\/.

Important: \\ becomes \\\\

To use your regular expression with a PHP function like preg_match, you need to write it as a PHP string. Just like regular expressions, however, PHP uses \\ to indicate a single backslash in a PHP string. A regular expression like /\\n/ must therefore be written in PHP as '/\\\\n/' to work properly. PHP takes the four backslashes to mean two backslashes, which is what you want your regular expression to contain.

Believe it or not, we now have everything we need to be able to understand the email address regular expression I showed you at the start of this section:

/^[\w\.\-]+@([\w\-]+\.)+[a-z]+$/i

/

The slash is the starting delimiter that marks the beginning of the regular expression.

^

We match the beginning of the string to make sure that nothing appears before the email address.

[\w\.\-]+

The name portion of the email address is made up of one or more (+) characters that are either “word” characters, dots, or hyphens ([\w\.\-]).

@

The name is followed by the @ character.

([\w\-]+\.)+

Then we have one or more (+) subdomains (such as “sitepoint.”), each of which is one or more “word” characters or hyphens ([\w\-]+) followed by a dot (\.).

[a-z]+

Next, there’s the top-level domain (for example, “com”), which is simply one or more letters ([a-z]+).

$

Finally, we match the end of the string, to make sure that nothing appears after the email address.

/i

The slash is the ending delimiter marking the end of the regular expression. The pattern modifier i following the slash indicates that the letters in the regular expression (such as [a-z]) should be treated case-insensitively.

Got all that? If you’re feeling anything like I was when I first learned regular expressions, you’re probably a little nervous. Okay, so you can follow along with a breakdown of a regular expression that someone else wrote for you, but can you really come up with this gobbledygook yourself? Don’t sweat it: in the rest of this chapter, we’ll look at a bunch more regular expressions, and before you know it you’ll be writing expressions of your own with confidence.

String Replacement with Regular Expressions

As you may recall, we’re aiming in this chapter to make it easier for non-HTML-savvy users to add formatting to the jokes on our website. For example, if a user puts asterisks around a word in the text of a joke (for example, 'Knock *knock*…'), we’d like to display that joke with HTML emphasis tags around that word (Knock <em>knock</em>…'). We can detect the presence of plain-text formatting such as this in a joke’s text using preg_match with the regular expression syntax we’ve just learned; however, what we need to do is pinpoint that formatting and replace it with appropriate HTML tags. To achieve this, we need to look at another regular expression function offered by PHP: preg_replace. preg_replace, like preg_match, accepts a regular expression and a string of text, and attempts to match the regular expression in the string. In addition, preg_replacetakes another string of text and replaces every match of the regular expression with that string. The syntax for preg_replace is as follows:

$newString = preg_replace(regExp, replaceWith, oldString);

Here, regExp is the regular expression, and replaceWith is the string that will replace matches in oldString. The function returns the new string with all the replacements made. In that code, this newly generated string is stored in $newString. We’re now ready to build our joke formatting function.

Emphasized Text

In Chapter 6, we wrote a helper function, htmlout, for outputting arbitrary text as HTML. This function is housed in a shared include file, helpers.inc.php. Since we’ll now want to output text containing plain-text formatting as HTML, let’s add a new helper function to the file for this purpose:

chapter8/includes/helpers.inc.php (excerpt)

function markdown2html($text)

{

$text = html($text);

Convert plain-text formatting to HTML

return $text;

}

The plain-text formatting syntax we’ll support is a simplified form of Markdown, created by John Gruber.

Markdown is a text-to-HTML conversion tool for web writers. Markdown allows you to write using an easy-to-read, easy-to-write plain-text format, then convert it to structurally valid XHTML (or HTML).

-- the Markdown home page

Since this helper function will convert Markdown to HTML, it’s named markdown2html. This function’s first action is to use the html helper function to convert any HTML code present in the text into HTML text. We want to avoid any HTML code appearing in the output except that which is generated from plain-text formatting.[45] Let’s start with formatting that will create bold and italic text. In Markdown, you can emphasize text by surrounding it with a pair of asterisks (*), or a pair of underscores (_). Obviously, we’ll replace any such pair with an <em> and </em> tag.[46] To achieve this, we’ll use two regular expressions: one that handles a pair of asterisks and one that handles a pair of underscores. Let’s start with the underscores:

/_[^_]+_/

Breaking this down:

/

We choose our usual slash character to begin (and therefore delimit) our regular expression.

_

There’s nothing special about underscores in regular expressions, so this will simply match an underscore character in the text.

[^_]

A sequence of one or more characters that aren’t underscores.

_

The second underscore, which marks the end of the italicized text.

/

The end of the regular expression.

Now, it’s easy enough to feed this regular expression to preg_replace, but we have a problem:

$text = preg_replace('/_[^_]+_/', '<em>emphasized text</em>',

↵$text);

The second argument we pass to preg_replace needs to be the text that we want to replace each match with. The problem is, we have no idea what the text that goes between the <em> and </em> tags should be—it’s part of the text that’s being matched by our regular expression! Thankfully, another feature of preg_replace comes to our rescue. If you surround a portion of the regular expression with round brackets (or parentheses), you can capture the corresponding portion of the matched text and use it in the replacement string. To do this, you’ll use the code $n , where n is 1for the first parenthesized portion of the regular expression, 2 for the second, and so on, up to 99 for the 99th. Consider this example:

$text = 'banana';

$text = preg_replace('/(.*)(nana)/', '$2$1', $text);

echo $text; // outputs 'nanaba'

So $1 is replaced with the text matched by the first round-bracketed portion of the regular expression ((.*)—zero or more non-newline characters), which is ba in this case. $2 is replaced by nana, which is the text matched by the second round-bracketed portion of the regular expression ((nana)). The replacement string '$2$1', therefore, produces 'nanaba'. We can use the same principle to create our emphasized text, adding a pair of round brackets to our regular expression:

/_([^_]+)_/

These brackets have no effect on how the expression works at all, but they create a group of matched characters that we can reuse in our replacement string:

chapter8/includes/helpers.inc.php (excerpt)

$text = preg_replace('/_([^_]+)_/', '<em>$1</em>', $text);

The pattern to match and replace pairs of asterisks looks much the same, except we need to escape the asterisks with backslashes, since the asterisk character normally has a special meaning in regular expressions:

chapter8/includes/helpers.inc.php (excerpt)

$text = preg_replace('/\*([^\*]+)\*/', '<em>$1</em>', $text);

That takes care of emphasized text, but Markdown also supports creating strong emphasis (<strong> tags) by surrounding text with a pair of double asterisks or underscores (**strong emphasis** or __strong emphasis__). Here’s the regular expression to match pairs of double underscores:

/__(.+?)__/s

The double underscores at the start and end are straightforward enough, but what’s going on inside the round brackets? Previously, in our single-underscore pattern, we used [^_]+ to match a series of one or more characters, none of which could be underscores. That works fine when the end of the emphasized text is marked by a single underscore, but when the end is a double underscore we want to allow for the emphasized text to contain single underscores (for example, __text_with_strong_emphasis__). “No underscores allowed,” therefore, won’t cut it—we must come up with some other way to match the emphasized text. You might be tempted to use .+ (one or more characters, any kind), giving us a regular expression like this:[47]

/__(.+)__/s

The problem with this pattern is that the + is greedy—it will cause this portion of the regular expression to gobble up as many characters as it can. Consider this joke, for example:

__Knock-knock.__ Who’s there? __Boo.__ Boo who? __Aw, don’t cry

↵ about it!__

When presented with this text, the regular expression above will see just a single match, beginning with two underscores at the start of the joke and ending with two underscores at the end. The rest of the text in between (including all the other double underscores) will be gobbled up by the greedy .+ as the text to be emphasized! To fix this problem, we can ask the + to be non-greedy by adding a question mark after it. Instead of matching as many characters as possible, .+? will match as few characters as possible while still matching the rest of the pattern, ensuring we’ll match each piece of emphasized text (and the double-underscores that surround it) individually. This gets us to our final regular expression:

/__(.+?)__/s

Using the same technique, we can also come up with a regular expression for double-asterisks. This is how the finished code for applying strong emphasis ends up looking:

chapter8/includes/helpers.inc.php (excerpt)

$text = preg_replace('/__(.+?)__/s', '<strong>$1</strong>',

↵ $text);

$text = preg_replace('/\*\*(.+?)\*\*/s', '<strong>$1</strong>',

↵ $text);

One last point: we must avoid converting pairs of single asterisks and underscores into <em> and </em> tags until after we’ve converted the pairs of double asterisks and underscores in the text into <strong> and </strong> tags. Our markdown2html function, therefore, will apply strong emphasis first, then regular emphasis:

chapter8/includes/helpers.inc.php (excerpt)

function markdown2html($text)

{

$text = html($text);

// strong emphasis

$text = preg_replace('/__(.+?)__/s', '<strong>$1</strong>',

↵ $text);

$text = preg_replace('/\*\*(.+?)\*\*/s', '<strong>$1</strong>',

↵ $text);

// emphasis

$text = preg_replace('/_([^_]+)_/', '<em>$1</em>', $text);

$text = preg_replace('/\*([^\*]+)\*/', '<em>$1</em>', $text);

return $text;

}

Paragraphs

While we could choose characters to mark the start and end of paragraphs just as we did for emphasized text, a simpler approach makes more sense. Since your users will type the content into a form field that allows them to create paragraphs using the Enter key, we’ll take a single newline to indicate a line break (<br>) and a double newline to indicate a new paragraph (</p><p>). As I explained earlier, you can represent a newline character in a regular expression as \n. Other whitespace characters you can write this way include a carriage return (\r) and a tab space (\t). Exactly which characters are inserted into text when the user hits Enter depends on the user’s operating system. In general, Windows computers represent a line break as a carriage return followed by a newline (\r\n), whereas Mac computers used to represent it as a single carriage return character (\r). These days, Macs and Linux computers use a single newline character (\n) to indicate a new line.[48] To deal with these different line-break styles, any of which may be submitted by the browser, we must do some conversion:

// Convert Windows (\r\n) to Unix (\n)

$text = preg_replace('/\r\n/', "\n", $text);

// Convert Macintosh (\r) to Unix (\n)

$text = preg_replace('/\r/', "\n", $text);

Note: Regular Expressions in Double Quoted Strings

All the regular expressions we’ve seen so far in this chapter have been expressed as single-quoted PHP strings. The automatic variable substitution provided by PHP strings is sometimes more convenient, but they can cause headaches when used with regular expressions. Double-quoted PHP strings and regular expressions share a number of special character escape codes. "\n" is a PHP string containing a newline character. Likewise, /\n/ is a regular expression that will match any string containing a newline character. We can represent this regular expression as a single-quoted PHP string ('/\n/') and all is well, because the code \n has no special meaning in a single-quoted PHP string. If we were to use a double-quoted string to represent this regular expression, we’d have to write "/\\n/"—with a double-backslash. The double-backslash tells PHP to include an actual backslash in the string, rather than combining it with the n that follows it to represent a newline character. This string will therefore generate the desired regular expression, /\n/. Because of the added complexity it introduces, it’s best to avoid using double-quoted strings when writing regular expressions. Note, however, that I have used double quotes for the replacement strings ("\n") passed as the second parameter to preg_replace. In this case, we actually do want to create a string containing a newline character, so a double-quoted string does the job perfectly.

With our line breaks all converted to newline characters, we can convert them to paragraph breaks (when they occur in pairs) and line breaks (when they occur alone):

// Paragraphs

$text = '<p>' . preg_replace('/\n\n/', '</p><p>', $text) . '</p>';

// Line breaks

$text = preg_replace('/\n/', '<br>', $text);

Note the addition of <p> and </p> tags surrounding the joke text. Because our jokes may contain paragraph breaks, we must make sure the joke text is output within the context of a paragraph to begin with. This code does the trick: the line breaks in the text will now become the natural line- and paragraph-breaks expected by the user, removing the requirement to learn anything new to create this simple formatting. It turns out, however, that there’s a simpler way to achieve the same result in this case—there’s no need to use regular expressions at all! PHP’s str_replace function works a lot like preg_replace, except that it only searches for strings instead of regular expression patterns:

$newString = str_replace(searchFor, replaceWith, oldString);

We can therefore rewrite our line-breaking code as follows:

chapter8/includes/helpers.inc.php (excerpt)

// Convert Windows (\r\n) to Unix (\n)

$text = str_replace("\r\n", "\n", $text);

// Convert Macintosh (\r) to Unix (\n)

$text = str_replace("\r", "\n", $text);

// Paragraphs

$text = '<p>' . str_replace("\n\n", '</p><p>', $text) . '</p>';

// Line breaks

$text = str_replace("\n", '<br>', $text);

str_replace is much more efficient than preg_replace because there’s no need for it to apply the complex rules that govern regular expressions. Whenever str_replace (or str_ireplace, if you need a case-insensitive search) can do the job, you should use it instead of preg_replace.

Hyperlinks

While supporting the inclusion of hyperlinks in the text of jokes may seem unnecessary, such a feature makes plenty of sense in other applications. Here’s what a hyperlink looks like in Markdown:[49]

[linked text](link URL)

Simple, right? You put the text of the link in square brackets, and follow it with the URL for the link in round brackets. As it turns out, you’ve already learned everything you need to match and replace links like this with HTML links. If you’re feeling up to the challenge, you should stop reading right here and try to tackle the problem yourself! First, we need a regular expression that will match links of this form. The regular expression is as follows:

/\[([^\]]+)]\(([-a-z0-9._~:\/?#@!$&'()*+,;=%]+)\)/i

This is a rather complicated regular expression. You can see how regular expressions have gained a reputation for being indecipherable! Squint at it for a little while, and see if you can figure out how it works. Grab a pen and break it into parts if you need to. If you have a highlighter pen handy, you might use it to mark the two pairs of parentheses (()) used to capture portions of the matched string: the linked text ($1) and the link URL ($1). Let me break it down for you:

/

As with all our regular expressions, we choose to mark its beginning with a slash.

\[

This matches the opening square bracket ([). Since square brackets have a special meaning in regular expressions, we must escape it with a backslash to have it interpreted literally.

([^\]]+)

First of all, this portion of the regular expression is surrounded with round brackets, so the matching text will be available to us as $1 when we write the replacement string. Inside the round brackets, we’re after the linked text. Because the end of the linked text is marked with a closing square bracket (]), we can describe it as one or more characters, none of which is a closing square bracket ([^\]]+).

]\(

This will match the closing square bracket that ends the linked text, followed by the opening round bracket that signals the start of the link URL. The round bracket needs to be escaped with a backslash to prevent it from having its usual grouping effect. (The square bracket doesn’t need to be escaped with a backslash because there is no unescaped opening square bracket currently in play.)

([-a-z0-9._~:\/?#@!$&'()*+,;=%]+)

Again, the round brackets make the matching text available to us as $2 in the replacement string. As for the gobbledygook inside the brackets, it will match any URL.[50] The square brackets contain a list of characters that may appear in a URL, which is followed by a + to indicate that one or more of these acceptable characters must be present. Within a square-bracketed list of characters, many of the characters that normally have a special meaning within regular expressions lose that meaning. ., ?, +, *, (, and ) are all listed here without the need to be escaped by backslashes. The only character that does need to be escaped in this list is the slash (/), which must be written as \/ to prevent it from being mistaken for the end-of-regular-expression delimiter. Note also that to include the hyphen (-) in the list of characters, you have to list it first. Otherwise, it would have been taken to indicate a range of characters (as in a-z and 0-9).

\)

This escaped round bracket matches the closing round bracket ()) at the end of the link URL.

/i

We mark the end of the regular expression with a slash, followed by the case-insensitivity flag, i.

We can therefore convert links with the following PHP code:

chapter8/includes/helpers.inc.php (excerpt)

$text = preg_replace(

'/\[([^\]]+)]\(([-a-z0-9._~:\/?#@!$&\'()*+,;=%]+)\)/i',

'<a href="$2">$1</a>', $text);

As you can see, $1 is used in the replacement string to substitute the captured link text, and $2 is used for the captured URL. Additionally, because we’re expressing our regular expression as a single-quoted PHP string, you have to escape the single quote that appears in the list of acceptable characters with a backslash.

Putting It All Together

Here’s our finished helper function for converting Markdown to HTML:

chapter8/includes/helpers.inc.php (excerpt)

function markdown2html($text)

{

$text = html($text);

// strong emphasis

$text = preg_replace('/__(.+?)__/s', '<strong>$1</strong>',

↵ $text);

$text = preg_replace('/\*\*(.+?)\*\*/s', '<strong>$1</strong>',

↵ $text);

// emphasis

$text = preg_replace('/_([^_]+)_/', '<em>$1</em>', $text);

$text = preg_replace('/\*([^\*]+)\*/', '<em>$1</em>', $text);

// Convert Windows (\r\n) to Unix (\n)

$text = str_replace("\r\n", "\n", $text);

// Convert Macintosh (\r) to Unix (\n)

$text = str_replace("\r", "\n", $text);

// Paragraphs

$text = '<p>' . str_replace("\n\n", '</p><p>', $text) . '</p>';

// Line breaks

$text = str_replace("\n", '<br>', $text);

// [linked text](link URL)

$text = preg_replace(

'/\[([^\]]+)]\(([-a-z0-9._~:\/?#@!$&\'()*+,;=%]+)\)/i',

'<a href="$2">$1</a>', $text);

return $text;

}

For added convenience when using this in a PHP template, we’ll add a markdownout function that calls markdown2html and then echoes out the result:

chapter8/includes/helpers.inc.php (excerpt)

function markdownout($text)

{

echo markdown2html($text);

}

We can then use this helper in our two templates that output joke text. First, in the admin pages, we have the joke search results template:

chapter8/admin/jokes/jokes.html.php

<?php include_once $_SERVER['DOCUMENT_ROOT'] .

'/includes/helpers.inc.php'; ?>

<!DOCTYPE html>

<html lang="en">

<head>

<meta charset="utf-8">

<title>Manage Jokes: Search Results</title>

</head>

<body>

<h1>Search Results</h1>

<?php if (isset($jokes)): ?>

<table>

<tr><th>Joke Text</th><th>Options</th></tr>

<?php foreach ($jokes as $joke): ?>

<tr>

<td><?php markdownout($joke['text']); ?></td>

<td>

<form action="?" method="post">

<div>

<input type="hidden" name="id" value="<?php

htmlout($joke['id']); ?>">

<input type="submit" name="action" value="Edit">

<input type="submit" name="action" value="Delete">

</div>

</form>

</td>

</tr>

<?php endforeach; ?>

</table>

<?php endif; ?>

<p><a href="?">New search</a></p>

<p><a href="..">Return to JMS home</a></p>

</body>

</html>

Second, we have the public joke list page:

chapter8/jokes/jokes.html.php

<?php include_once $_SERVER['DOCUMENT_ROOT'] .

'/includes/helpers.inc.php'; ?>

<!DOCTYPE html>

<html lang="en">

<head>

<meta charset="utf-8">

<title>List of Jokes</title>

</head>

<body>

<p>Here are all the jokes in the database:</p>

<?php foreach ($jokes as $joke): ?>

<blockquote>

<p>

<?php markdownout($joke['text']); ?>

(by <a href="mailto:<?php htmlout($joke['email']); ?>">

<?php htmlout($joke['name']); ?></a>)

</p>

</blockquote>

<?php endforeach; ?>

</body>

</html>

With these changes made, take your new plain-text formatting for a spin! Edit a few of your jokes to contain Markdown syntax and verify that the formatting is correctly displayed.

Tip: Use the PHP Markdown Library

What’s nice about adopting a formatting syntax like Markdown for your own website is that there’s often plenty of open-source code out there to help you deal with it. Your newfound regular expression skills will serve you well in your career as a web developer, but if you want to support Markdown formatting on your site, the easiest way to do it would be to not write all the code to handle Markdown formatting yourself! Instead, a quick Google search will find you the PHP Markdown project, from which you can download a markdown.php file that you can drop in your site’s includes folder. You can then use the Markdown function it contains in your markdown2html helper function:

function markdown2html($text)

{

$text = html($text);

include_once $_SERVER['DOCUMENT_ROOT'] .

↵ '/includes/markdown.php';

return Markdown($text);

}

Go ahead and give this a try. Make sure your formatting still works, and then curse me for dragging you through the ordeal of regular expressions when you could have avoided it. (Seriously, it’s a handy skill.)

Real World Content Submission

It seems a shame to have spent so much time and effort on a content management system that’s really easy to use, when the only people who are actually allowed to use it are the site administrators. Furthermore, while it’s extremely convenient for an administrator not having to edit HTML when making updates to the site’s content, submitted documents still need to be transcribed into the “Add new joke” form, and any formatted text converted into Markdown—a tedious and mind-numbing task, to say the least. What if we put the “Add new joke” form in the hands of casual site visitors? If you recall, we actually did this in Chapter 4 when we put an Add your own joke link on our public joke list page, through which users could submit their own jokes. At the time, this was simply a device that demonstrated how INSERT statements could be made from within PHP scripts, and we’ve since removed it (because it was incompatible with some changes we made to our database structure), but given how easy Markdown is to write, it sure would be nice to put a joke submission form back in the hands of our visitors. In the next chapter, we’ll introduce access control to your joke database, making your website one that could survive in the real world. Most importantly, you’ll limit access to the admin pages for the site to authorized users only. But perhaps more excitingly, we will revisit the idea of accepting joke submissions from your visitors.


[44] If you put an s pattern modifier at the end of your regular expression, the dot character will also match newlines.

[45] Technically, this breaks one of the features of Markdown: support for inline HTML. “Real” Markdown can contain HTML code, which will be passed through to the browser untouched. The idea is that you can use HTML to produce any formatting that is too complex to create using Markdown’s plain-text formatting syntax. Since we don’t want to allow this, it might be more accurate to say we’ll support Markdown-style formatting.

[46] You may be more accustomed to using <b> and <i> tags for bold and italic text; however, I’ve chosen to respect the most recent HTML standards, which recommend using the more meaningful <strong> and <em> tags, respectively. If bold text doesn’t necessarily indicate strong emphasis in your content, and italic text isn’t representative of emphasis, you might want to use <b> and <i> instead.

[47] The s pattern modifier at the end of the regular expression ensures that the dot (.) will truly match any character, including newlines.

[48] In fact, the type of line breaks used can vary between software programs on the same computer. If you’ve ever opened a text file in Notepad to see all the line breaks missing, you’ve experienced the frustration this can cause. Advanced text editors used by programmers usually let you specify the type of line breaks to use when saving a text file.

[49] Markdown also supports a more advanced link syntax where you put the link URL at the end of the document, as a footnote. But we won’t be supporting that kind of link in our simplified Markdown implementation.

[50] It will also match some strings that are invalid URLs, but it’s close enough for our purposes. If you’re especially intrigued by regular expressions, you might want to check out RFC 3986, the official standard for URLs. Appendix B of this specification demonstrates how to parse a URL with a rather impressive regular expression.