Using Strings - Beginning Lua Programming (2007)

Beginning Lua Programming (2007)

Chapter 5. Using Strings

The last chapter covered tables and the table library. You already know strings, but you don't know the string library, and that's the main topic of this chapter. Among many other things, you'll learn about the following:

· Converting between uppercase and lowercase

· Getting individual characters and substrings out of strings

· Getting user input

· Reading from and writing to files

· Doing pattern matching and replacement on strings (which is very powerful, and accounts for more than half of the chapter)

Basic String Conversion Functions

Many of the functions in the string library take a single string argument (and, in a few cases, one or two numeric arguments) and return a single string. In the following example, string.lower takes a string and returns a string that's the same except that any uppercase letters have been converted to lowercase:

> print(string.lower("HELLO there!"))

hello there!

Of course, string.lower doesn't care if the string given to it contains no uppercase letters, or no letters at all—it just returns the same string as shown here:

> print(string.lower("hello there!"))

hello there!

> print(string.lower("1 2 3 4"))

1 2 3 4

You may occasionally find yourself doing something like the following, and wondering why Str hasn't been changed to "abc":

> Str = "ABC"

> string.lower(Str)

> print(Str)

ABC

If you think about it, you'll see that this is because strings are immutable, so it's impossible for a function like string.lower to work by side effect the way functions like table.sort do. Instead, string.lower (and all string functions) works by return value:

> Str = "ABC"

> Str = string.lower(Str)

> print(Str)

abc

It's convenient to say things like “string.lower converts any uppercase characters in a string to lowercase” (instead of the previous wordier “takes a string and returns a string” phrasing). Saying it this way doesn't mean that string.lower changes the characters right in the string itself, which is impossible. Rather, the conversion process creates a new string, which is the function's return value.

string.upper is the obvious counterpart to string.lower:

> print(string.upper("HELLO there!"))

HELLO THERE!

Both string.lower and string.upper use the current locale to decide what characters are uppercase or lowercase letters. There's a short example of this later in this chapter.

string.reverse reverses a string, character by character, like this:

> print(string.reverse("desserts"))

stressed

Lua 5.0 didn't have string.reverse. It can be written by looping through a string backwards, putting its characters in an array, and passing that array to table.concat.

string.rep takes a string and a number, and repeats the string that many times, like this:

> print(string.rep("a", 5))

aaaaa

> print(string.rep("Hubba", 2))

HubbaHubba

string.sub returns a substring of the string given to it. The second and third arguments of string.sub are the (one-based) numeric positions of the desired first and last characters. For example:

> Str = "alphaBRAVOcharlie"

> print(string.sub(Str, 1, 5))

alpha

> print(string.sub(Str, 6, 10))

BRAVO

> print(string.sub(Str, 11, 17))

charlie

Negative numbers can be used as positions: -1 means the last character in the string, -2 means the second-to-last, and so on:

> print(string.sub(Str, -7, -1))

charlie

> print(string.sub(Str, 6, -8))

BRAVO

> print(string.sub(Str, -17, 5))

alpha

All of the built-in functions that take character positions understand negative ones.

If string.sub's last argument is omitted, it defaults to -1 (or, equivalently, the length of the string), as follows:

> print(string.sub(Str, 1))

alphaBRAVOcharlie

> print(string.sub(Str,-7))

charlie

String Length

string.len (like the # operator) returns the length of the string given to it. For example:

> print(string.len(""))

0

> print(string.len("ABCDE"))

5

Converting Between Characters and Character Codes

string.byte returns the numerical byte values used by your system to represent the characters of the substring specified (string.sub-style) by its second and third arguments. For example:

> print(string.byte("ABCDE",1, 5))

65 66 67 68 69

> print(string.byte("ABCDE", 2, -2))

66 67 68

The third argument defaults to the value of the second argument (so only one character's byte value is returned):

> print(string.byte("ABCDE", 2))

66

> print(string.byte("ABCDE", -1))

69

string.byte is often called with only one argument, in which case the second argument defaults to 1:

> print(string.byte("ABCDE"))

65

> print(string.byte("A"))

65

The Lua 5.0 string.byte only took two arguments and therefore never returned more than one value.

string.char takes integers like those returned by string.byte and returns a string composed of those characters, as follows:

> print(string.char(65))

A

> print(string.char(65, 65, 66, 66))

AABB

If called with no arguments, it returns the empty string.

The byte values used by string.byte and string.char are not necessarily the same from system to system. Most of the examples in this book assume that ASCII characters have their ASCII values. (If this is not true on your system, you probably know it.)

Formatting Strings and Numbers with string.format

The code you've seen and written so far has formatted its output using string concatenation: tostring, and the automatic insertion of tabs with print. A more powerful tool for formatting output is string.format, which can do the following:

· Pad strings with spaces and pad numbers with spaces or zeros

· Output numbers in hexadecimal or in octal (base 8)

· Trim extra characters from a string

· Left- or right-justify output within columns

Try It Out

Outputting Data in Columns

Imagine you have a bulletin board system, and you want a report of how many new posts and replies to existing posts each user has made. If you wanted to read this report in a web browser, you could format it as HTML, but if you want to read it on the command line or in a text editor, then it'll be easier to read if you format it in columns. (This works only with fixed-width fonts like Courier or Fixedsys, which you should be using to read and write code anyway.)

1. Define the following function:

-- Prints a report of posts per user:

function Report(Users)

print("USERNAME NEW POSTS REPLIES")

for Username, User in pairs(Users) do

print(string.format("%-15s %6d %6d",

Username, User.NewPosts, User.Replies))

end

end

2. Run it as follows:

Report({

arundelo = {NewPosts = 39, Replies = 19},

kwj = {NewPosts = 22, Replies = 81},

leethaxOr = {NewPosts = 5325, Replies = 0}})

The output should be this:

USERNAME NEW POSTS REPLIES

arundelo 39 19

leethax0r 5325 0

kwj 22 81

How It Works

The string.format function takes as its first argument a format string that includes format placeholders like "%-15s" and "%6d". It returns a string constructed by replacing the placeholders by the rest of the string.format arguments.

All format placeholders start with the % character. A placeholder that ends in the character s inserts the corresponding arguments as a string:

> print(string.format("-->%s<--", "arundelo"))

-->arundelo<--

A number that's between the % and the s uses spaces to pad the string out to that many characters:

> print(string.format("-->%15s<--", "arundelo"))

--> arundelo<--

A - (minus sign) in front of the number left-justifies the string:

> print(string.format("-->%-15s<--", "arundelo"))

-->arundelo <-

A placeholder that ends in the character d inserts the corresponding argument as a decimal integer:

> print(string.format("-->%d<--", 39))

-->39<--

A number in between the % and the d uses spaces to pad the number out to that many characters (numbers can also be left-justified just like strings, but that's not done in this example):

> print(string.format("-->%6d<--", 39))

--> 39<--

There should be a one-to-one correspondence between placeholders and extra arguments (“extra argu-ments” here refer to arguments after the format string). The first placeholder corresponds to the first extra argument, the second placeholder to the second extra argument, and so on:

> print(string.format("%-15s %6d %6d", "arundelo", 39, 19))

arundelo 39 19

Apart from its ability to format strings and numbers to certain lengths, string.format is also a useful alternative to string concatenation. Both of the following two lines insert the variables Link and LinkText into an HTML <a> (anchor) element. The second has the disadvantage of greater length, but it's arguably more readable, especially after your eye gets used to picking out the two format placeholders:

Anchor = '<a href="' .. Link .. '">' .. LinkText .. '</a>'

Anchor = string.format('<a href="%s">%s</a>', Link, LinkText)

This readability advantage ramps up the more alternating literal strings and variables you're dealing with. If you find yourself using string.format a lot in a particular piece of code, you can give it a shorter name, like this:

local Fmt = string.format

A single percent sign in string.format's output is represented by two consecutive ones in the format string, as in the following example. This %% pseudo-placeholder is the only exception to the one-to-one correspondence between format placeholders and the rest of the string.format arguments:

> print(string.format("We gave %d%%!", 110))

We gave 110%!

When necessary, string.format converts strings to numbers and numbers to strings, but any other conversions need to be done by hand. For example:

> print(string.format("%d-%s", "50", 50))

50-50

> print(string.format("%s", {}))

stdin:1: bad argument #2 to 'format' (string expected, gottable)

stack traceback:

[C]: in function 'format'

stdin:1: in main chunk

[C]: ?

> print(string.format("%s", tostring({})))

table: 0x4927e8

The %d placeholder only works for integers. For fractional numbers, use %f as follows:

> print(string.format("We gave %f percent!", 99.44))

We gave 99.440000 percent!

By default, six digits are printed after the decimal point. To control this, supply a precision, which should be an integer preceded by a dot. For example:

> print(string.format("%.0f | %.1f | %.2f | %.3f",

>> 99.44, 99.44, 99.44, 99.44))

99 | 99.4 | 99.44 | 99.440

If there's a width (the number used to control column widths in the previous Try-It-Out example), put it before the precision. The width includes any digits after the decimal point, and the decimal point itself, if there is one. Here's an example:

For readability, this example uses a square-bracket string, as covered in Chapter 2.

> print(string.format([[

>> -->%5.0f<-- width: 5; precision: 0

>> -->%5.1f<-- width: 5; precision: 1

>> -->%5.2f<-- width: 5; precision: 2]], 99.44, 99.44, 99.44))

--> 99<-- width: 5; precision: 0

--> 99.4<-- width: 5; precision: 1

-->99.44<-- width: 5; precision: 2

If either a string or a number is longer than the width allotted for it, it will spill over—no characters or digits will be removed:

> -- Print some usernames in a 15-character column:

> print(string.format([[

>> -->%-15s<--

>> -->%-15s<--

>> -->%-15s<--

>> -->%-15s<--]],

>> "arundelo", "kwj", "leethax0r", "reallylongusername"))

-->arundelo <--

-->kwj <--

-->leethax0r <--

-->reallylongusername<--

> -- Print various numbers in a 4-character column:

> print(string.format([[

>> -->%6.1f<--

>> -->%6.1f<--

>> -->%6.1f<--]], 1, 99.44, 123456789))

--> 1.0<--

--> 99.4<--

-->123456789.0<--

If you give a %s placeholder a precision, it's treated as a maximum width—characters will be trimmed from the end of the string to make it fit. The following strings are printed with a width (the minimum width) of 2 and a precision (the maximum width) of 2:

> print(string.format([[

>> -->%2.2s<--

>> -->%2.2s<--

>> -->%2.2s<--]], "a", "ab", "abc"))

--> a<--

-->ab<--

-->ab<--

Along with the - (left-justification) character, you can place the following characters right after the percent sign (that is, before the width and precision, if any):

> -- "+" -- all numbers are printed with a sign:

> print(string.format("%+d", 1))

+1

> -- " " -- use a space if a sign is omitted on a positive

> -- number:

> print(string.format("% d", 1))

1

> -- "0" -- pad numbers with zeros instead of spaces:

> print(string.format("%05d", 1))

00001

> -- "#" -- various uses -- with %f, output always has a

> -- decimal point; with %x, output always has "0x":

> print(string.format("%#.0f", 1))

1.

> print(string.format("%#x", 31))

0x1f

These characters don't have to be in any particular order when combined. For example:

> print(string.format("-->%-+5d<--", 11))

-->+11 <--

> print(string.format("-->%+-5d<--", 11))

-->+11 <--

The %q placeholder (“q” stands for quote) surrounds a string with double quotes, backslash-escapes any double quotes, backslashes, or newlines in it, and converts any zero bytes to "\\000":

> print(string.format("%q", [[backslash: \; double quote: "]]))

"backslash: \\; double quote: \""

> print(string.format("%q", "\0\n"))

"\000\

"

In other words, it converts a string into a representation of that string usable in Lua source code:

> WeirdStr = "\"\\\n\0"

> Fnc = loadstring(string.format("return %q", WeirdStr))

> print(Fnc() == WeirdStr)

true

This lets you save data for subsequent use by formatting it as a valid Lua script and putting it into a file. When you want to retrieve the data, you load the file into a string, pass it to loadstring, and call the resulting function. (You'll see how to save and load files in the next section.)

This kind of data file can consist of a single return statement (like the string passed to loadstring in the previous example). It can also be a series of function calls whose arguments are the data in question, like this:

Record{Username = "arundelo", NewPosts = 39, Replies = 19}

Record{Username = "kwj", NewPosts = 22, Replies = 81}

Record{Username = "leethax0r", NewPosts = 5325, Replies = 0}

The function called (Record in this example) is one that you define to do what's necessary to make the data available to the main program. If this function is given to the loadstring function by setfenv, then there's no need for the data file to have access to any other functions.

If you want to save a table, you need to build up a string by splicing together the table's keys and values with the appropriate curly braces, square brackets, commas, and so on. Chapter 7 has an example of this.

Other common placeholders include %x and %X for hexadecimal output, %o for octal output, %e and %E for scientific notation, %g and %G for automatic selection between scientific notation and standard notation, and %i, which is just a synonym for %d. Here are some examples:

> print(string.format("%%x: %x\t%%o: %o",31,31))

%x: 1f %o: 37

> print(string.format("%%e: %e\t%%g: %g",3.1,3.1))

%e: 3.100000e+00 %g: 3.1

> print(string.format("%%e: %e\t%%g: %g",11 13, ^ 11 ^ 13))

%e: 3.452271e+13 %g: 3.45227e+13

Placeholders that come in both uppercase and lowercase differ only in the case of any letters they might print. For example:

> print(string.format("%x, %X", 255, 255))

ff, FF

More detail on format strings is available in books on the C programming language, because the string.format format placeholders are almost exactly the same as those used by the C language's printf family of functions.

The differences are as follows: C doesn't understand the %q placeholder, and Lua doesn't understand the h, L, or l modifiers, the %n or %p placeholders, or the * character used as a width or precision. To compensate for the second difference, you can build a format string in Lua at run time.

A lot of string.format's work is actually done by the C function sprintf. This means that you should not include any “\0” characters in strings you give to string.format (unless they'll be formatted by %q placeholders).

Input/Output

It's time to take a brief break from the string library, and detour through the topic of input/output (I/O). You've already used print for output, but print does things in addition to output data—it converts everything to strings, separates multiple arguments with tabs, and appends a newline. This is convenient for debugging because you can see what values a function returns, but a more fundamental output function is io.write. This function expects all of its arguments to be strings or numbers, and it outputs each of the arguments without adding anything. The > character that appears to be at the end of the following output is really just the Lua prompt—it didn't get put on its own line because no newline was outputted:

> Str1, Str2 = "Alpha", "Bravo"

> io.write(Str1, Str2)

AlphaBravo>

Here's the same thing, with a space and a newline:

> io.write(Str1, " ", Str2, "\n")

Alpha Bravo

>

And in the following, io.write is used like a print that uses neither the concatenation operator nor table.concat; (trace its execution for a few sample arguments so that you can see how it works, especially how it builds one line of output with several iterations of io.write):

function Print(...)

local ArgCount = select("#", ...)

for I = 1, ArgCount do

-- Only print a separator if one argument has already

-- been printed:

if I > 1 then

io.write("\t")

end

io.write(tostring(select(I, ...)))

end

io.write("\n")

end

The function io.read reads (by default) one line of input. Call it like so:

Line = io.read()

Your cursor will move to the next line, but there'll be no Lua prompt. Type some random text and press Enter. The Lua prompt will reappear. Print Line and you'll see that it contains whatever you typed.

To write to or read from a file, you need a handle to that file. A file handle is an object with methods like write and read. Getting a handle to a file is called opening the file.

Writing to and Reading from a File

The following example will write to a file named test.txt in your current directory. (If you already have a file of this name that you don't want overwritten, move it or use a different name.) Do the following in the interpreter:

> FileHnd, ErrStr = io.openCtest.txt", "w")

> print(FileHnd, ErrStr)

file (0x485ad4) nil

> FileHnd:write("Line 1\nLine 2\nLine 3\n")

> FileHnd:close()

> FileHnd, ErrStr = io.open("test.txt")

> print(FileHnd, ErrStr)

file (0x485ad4) nil

> print(FileHnd:read())

Line 1

> print(FileHnd:read())

Line 2

> print(FileHnd:read())

Line 3

> print(FileHnd:read())

nil

> FileHnd:close()

> print(os.remove("test.txt"))

true

The first io.open argument is the name of the file to be opened. Its second argument, "w", means that the file will be opened in write mode, making it possible to write to it but not read from it. If the file is successfully opened, a handle to it will be returned, but if for some reason it can't be opened, then it returns a nil value and an error message. (You can print the file handle and the error message in the context of an actual program with something like "if FileHnd then". There's an example of this in the next chapter.)

The file handle has a write method, which works just like io.write, but writes to the file instead of your screen. The following line writes three lines (each one terminated with a newline character) to the file:

FileHnd:write("Line 1\nLine 2\nLine 3\n")

After this, calling the file handle's close method closes the file. This ensures that all output actually makes it to the hard drive, and that any system resources associated with the open file aren't tied up any longer than they need to be. After the file is closed, an attempt to use any of the handle's methods will cause an error.

Next, the file is reopened, but this time in read mode instead of write mode. (The second io.open argument defaults to "r".)

The first three times FileHnd:read is called, it returns the file's three lines, one after the other. Notice that the newline characters marking the ends of these lines are not returned. (You can tell this because, if they were returned, then print would appear to print an extra blank line after each line.)

The fourth time FileHnd:read is called, it returns nil, which means that the end of the file has been reached. The file is then closed with FileHnd:close and removed with os.remove, whose return value of true just means that the removal was successful. If you skip the removal step, you can look attest.txt in your text editor.

The first character of the second io.open argument must be an r (for read mode), a w (for write mode, which discards any contents the file may have already had), or an a (for append mode, which writes to the end of the file, preserving any previous contents). This letter can optionally be followed by (in any order) a “+” character and/or a “b” character. Including "+" opens the file in one of three different versions of read/write mode, depending on the first letter. Including "b" opens the file in binary mode (specifically binary read, binary write, or binary append mode).

There's more discussion of the distinction between binary mode and text mode (the default) in Chapter 13, but here's the essence. Some systems use something other than "\n" to mark the end of a line. Lua (actually, the C I/O library that Lua uses) does a translation that lets you ignore this and always use "\n". That's a good thing, but if you're working with a file that isn't text, and you don't want the library messing with any bytes that happen to look like end-of-line markers, or if you have to deal with text files that were created on a system with a different end-of-line convention, a mode string with "b" turns off this translation.

If given no arguments, read (either io.read or the read method of a file handle) reads and returns one line (with no trailing "\n") or returns nil on end-of-file. If it's given an argument, the argument should be one of the following:

5-t1

read can be given multiple arguments, in which case it reads and returns multiple values.

The I/O library supplies the following three pre-opened file handles to virtual files (things that act like files, but don't actually reside on the hard disk):

· io.stdin is the standard input file handle (read-only). By default, it reads from the keyboard. io.stdin:read acts the same as io.read.

· io.stdout is the standard output file handle (write-only). By default, it writes to the screen. io.stdout:write acts the same as io.write.

· io.stderr is the standard error file handle (write-only). By default, it too writes to the screen.

Now that you know the basics of I/O, you can write a script that works with files.

Try It Out

Sorting and Eliminating Duplicate Lines in a File

1. Create a file with the following contents, and save it as sortuniq.lua:

-- This script outputs all unique lines of a file, sorted.

-- It does no error checking!

--

-- Usage:

-- lua sortuniq.lua INFILE OUTFILE

--

-- If OUTFILE is not given, standard output will be used. If

-- no arguments are given, standard input and standard output

-- will be used.

-- Like pairs, but loops in order by key. (Unlike the

-- version in Chapter 4, this only handles all-string or

-- all-numeric keys.)

function SortedPairs(Tbl)

local Sorted = {} -- A (soon to be) sorted array of Tbl's keys.

for Key in pairs(Tbl) do

Sorted[#Sorted + 1] = Key

end

table.sort(Sorted)

local I = 0

-- The iterator subitself:

return function()

I = I + 1

local Key = Sorted[I]

return Key, Tbl[Key]

end

end

function Main(InFilename, OutFilename)

-- Make the lines of the input file (standard input if no

-- name was given) keys of a table:

local Lines = {}

local Iter = InFilename and io.lines(InFilename) or io.lines()

for Line in Iter do

Lines[Line] = true

end

-- Get a handle to the output file (standard output if no

-- name was given):

local OutHnd = OutFilename

and io.open(OutFilename, "w")

or io.stdout

-- Write each line in Lines to the output file, in order:

for Line in SortedPairs(Lines) do

OutHnd:write(Line, "\n")

end

OutHnd:close()

end

Main(...)

2. Create a file with the following contents, and save it as testin.txt:

bravo

alpha

charlie

alpha

charlie

charlie

alpha

3. At your shell, type this:

lua sortuniq.lua testin.txt testout.txt

4. Open the testout.txt file in your text editor. Here's what it should contain:

alpha

bravo

charlie

How It Works

io.lines is an iterator factory. If given a filename, it opens that file and returns an iterator that loops through all lines in the file. If called with no argument, it returns an iterator that loops through all lines found on standard input (by default, the lines you type).

sortuniq.lua is written to use either of these. If no input file is given as a command-line argument, io.lines would be called with no argument and loop through standard input. In the previous example, though, InFilename is testin.txt, each line of which is made a key in the table Lines inside the loop. This has the effect of ignoring duplicate lines:

local Lines = {}

local Iter = InFilename and io.lines(InFilename) or io.lines()

for Line in Iter do

Lines[Line] = true

end

When the iterator hits the end of testin.txt, it closes the file right before the loop is exited. Next, testout.txt (the file named by OutFilename) is opened. Again, the following is done so that standard output will be used if no output file was specified on the command line:

local OutHnd = OutFilename

and io.open(OutFilename, "w")

or io.stdout

Next, the SortedPairs function loops through Lines in order by key, and each key is written to testout.txt, along with a newline—the io.lines iterator, like io.read, doesn't include newlines on the lines it returns:

for Line in SortedPairs(Lines) do

OutHnd:write(Line, "\n")

end

Finally, testout.txt is closed and the program is exited.

If you don't supply an output file at the command line, then standard output is used, which means the output will get written right to your screen (as though io.write had been used instead of OutHnd:write). If you also omit the input file, then standard input is used, which means you can type the input instead of saving it in a file. (Use Ctrl+D or Ctrl+Z to send an EOF when you're done typing.)

Omitting the filenames also lets you use your shell's redirection operators (<, >, and |) to make standard input come from a file or another program instead of your keyboard, and standard output go to a file or another program instead of your screen. For example, the following code:

lua sortuniq.lua < testin.txt > testout.txt

does the same thing as this:

lua sortuniq.lua testin.txt testout.txt

except that in the version with < and >, the shell opens testin.txt and testout.txt and sortuniq.lua just uses standard input and standard output.

This script does no error checking. If, for instance, the output file can't be created or is read-only, then the io.open will fail, and output will default to standard output with no explanation of why that happened. You'll learn how to check for and handle errors in the next chapter, but for now just remember that most I/O library functions will return nil and an error string if something unexpected happens.

For example, a file's write function simply triggers an error when you give it bad arguments or use it with a closed file, but when it doesn't trigger an error, it returns true or nil depending on whether it successfully wrote what you wanted it to write. If it returns a nil then the second return value will be an error message. Accidentally writing to a file opened in read mode gives the cryptic "No error", and writing to a disk drive that has no more space might give an error message like "No space left on device". The former is just a programmer error, but the latter is something that the program can't prevent. If a program doesn't check the write return values, it won't know if something like this happens. That's fine for scripts where the likelihood or harmfulness of a disk drive filling up is small, for example, but something mission-critical may merit more care.

Some functions (such as io.open), are more likely to fail in this way, so their return values should always be checked, except in interactive use or quick-and-dirty scripts where you can spot problems and fix them on the fly.

When an I/O function returns nil and an error message, it also returns, as an undocumented third value, the numeric error code from C. Later in this chapter, you'll find out about the implications of using undocumented features.

Pattern-Matching

Other than the string.dump function (to be covered in Chapter 10), the rest of the functions in the string library are all for finding or replacing substrings. (In some cases, the substring being searched for is the entire string being searched.) A string being searched is called a subject. A substring to be searched for is specified with a string called a pattern, and the substring, if found, is called a match.

Searching for a Specific String

A single pattern can have different matches. For example, a pattern that searched for “sequences of 1 or more whitespace characters” would match both the ““ and "\t\n\t\n" substrings, among others. However, you can also use a pattern to search for one specific match. In the simplest case, such a pattern is identical to the substring it matches, like this one:

> Str = "The rain in Spain stays mainly in the plain."

> print(string.gsub(Str, "ai", "oy"))

The royn in Spoyn stays moynly in the ployn. 4

string.gsub searches its first argument (the subject) for substrings that match the pattern given as its second argument, and replaces any that it finds with its third argument. Thus every occurence of "ai" in Str is replaced with "oy". As a second result, string.gsub returns the number of substitutions done (4 in this case).

gsub stands for global substitute. It's global because all matches are replaced, not just the first one.

If there are no matches, then no substitutions are done, meaning that string.gsub returns the subject and 0. Here's an example:

> print(string.gsub("A shrew in Kew glues newbies to a pew.",

>> "ai", "oy"))

A shrew in Kew glues newbies to a pew. 0

If a fourth argument is given, then at most that many substitutions are done (starting from the beginning of the string), like this:

> Str = "The rain in Spain stays mainly in the plain."

> print(string.gsub(Str, "ai", "oy", 2))

The royn in Spoyn stays mainly in the plain. 2

> print(string.gsub(Str, "ai", "oy", 999))

The royn in Spoyn stays moynly in the ployn. 4

The replacement string can be a different length than the pattern, as shown here:

> print(string.gsub(Str, "ai", "izzai"))

The rizzain in Spizzain stays mizzainly in the plizzain. 4

In particular, replacing matches with the empty string deletes them. For example:

> print(string.gsub(Str, "ai", ""))

The rn in Spn stays mnly in the pln. 4

Matching Any of Several Characters

How often have you been buying something online and seen a message like “when entering credit card number, please enter only numbers without spaces or dashes”? This is bad programming—a user should be able to enter their credit card number with separators, so they can more easily double-check it. It takes only a few lines of code to allow this and still make sure the user didn't type in something that obviously isn't a credit card number.

Try It Out

Validating Credit Card Numbers

1. Define the following function:

-- Returns Str without any whitespace or separators, or nil

-- if Str doesn't satisfy a simple validity test:

function ValidateCreditCard(Str)

Str = string.gsub(Str, "[ /,.-]", "")

return string.find(Str, "%d%d%d%d%d%d%d%d%d%d%d%d%d%d%d%d$")

and Str

end

2. Test it out like this:

> print(ValidateCreditCard("1234123412341234"))

1234123412341234

> print(ValidateCreditCard("1234 1234 1234 1234"))

1234123412341234

> print(ValidateCreditCard("1234-1234-1234-1234"))

1234123412341234

> print(ValidateCreditCard("1234-1234-1234-123"))

nil

> print(ValidateCreditCard("221B Baker Street"))

nil

How It Works

This example introduces several new concepts. The first one is that a pattern can include bracket classes, which are lists of characters enclosed by square brackets. The bracket class [ /,.-] matches a character if it's a space, a slash, a comma, a dot, or a dash:

> print(string.gsub("The rain? -- the plain!", "[ /,.-]", ""))

Therain?theplain! 6

The user can use any of those characters as separators, anywhere in the credit card number, and string.gsub will remove them, but leave intact any digits, letters, or other characters:

Str = string.gsub(Str, "[ /,.-]", "")

After separators have been removed, the string will be considered valid only if it's nothing more and nothing less than 16 digits. %d is shorthand for the bracket class [0123456789]—it matches any decimal digit:

> print(string.gsub("52 pickup", "%d", "Digit"))

DigitDigit pickup 2

> print(string.gsub("52 pickup", "%d%d", "TwoDigits"))

TwoDigits pickup 1

Patterns can include %d and other combinations of the percent sign with other characters. Although they look the same, these are unrelated to string.format placeholders.

If a pattern's first character is ^ (a caret), it means that the rest of the pattern will only match at the beginning of the subject. In this example, not only is "ab" contained in "abc", but "ab" is right at the beginning of "abc", so a substitution is made:

> print(string.gsub("abc", "^ab", "XX"))

XXc 1

In the next example, "bc" is contained in "abc", but it's not right at the beginning, so no substitution is made:

> print(string.gsub("abc", "^bc", "XX"))

abc 0

The caret is said to anchor the pattern at the beginning. A pattern is anchored at the end if it ends with $ (a dollar sign):

> print(string.gsub("abc", "ab$", "XX"))

abc 0

> print(string.gsub("abc", "bc$", >"XX"))

aXX 1

So the pattern "^%d%d%d%d%d%d%d%d%d%d%d%d%d%d%d%d$", anchored both at the beginning and the end, matches a subject only if it contains 16 digits and nothing else.

The function string.find returns the positions of the first and last character of the first match, or nil if there is no match:

> print(string.find("duck duck goose", "duck"))

1 4

> print(string.find("duck duck goose", "goose"))

11 15

> print(string.find("duck duck goose", "swan"))

nil

ValidateCreditCard, however, depends only on the fact that string.find's first result will be true if there's a match and nil if there's no match. When this result is anded with Str itself, the effect is that the function returns nil if Str is something other than 16 digits, and Str if it is 16 digits:

return string.find(Str, "^%d%d%d%d%d%d%d%d%d%d%d%d%d%d%d%d$")

and Str

Characters like [, %, and ^ are called magic characters, because they have special meanings rather than representing themselves the way a standard character does (for example, “a”). To match a magic character, escape it with a percent sign. For example, "%[" matches "[", "%^" matches "^", "%%"matches "%", and so on, as follows:

> -- If not for the extra %, this would give an error message:

> print(string.gsub("90%, 100%, 110%", "110%%", "a lot!"))

90%, 100%, a lot! 1

> -- If not for the %, this would replace the zero-length

> -- string before the 3:

> print(string.gsub("3^10", " to the power of "))

3 to the power of 10 1

All magic characters are punctuation characters. This means that nonpunctuation characters—including letters, digits, whitespace, and control characters—always represent themselves and don't need to be percent-escaped.

Backslash-escaping is decoded when a chunk is compiled, so the pattern-matching functions don't even know about it. The pattern "\\" contains only one character (the backslash), which it matches. If you have trouble separating percent-escaping and backslash-escaping in your mind, you can quote the troublesome pattern with square brackets: [[\]].

Not all punctuation characters are magic, but a punctuation character (or any nonalphanumeric character) escaped by a percent sign always represents itself, whether it's normally magic or not. In the following example, both "@" and "%@" match the at-sign, which is not magic:

> print(string.gsub(" somebody@example.com ", " AT "))

somebody AT example.com 1

> print(string.gsub(" somebody@example.com ", "%@", " AT "))

somebody AT example.com 1

The reasoning behind this is that anytime you want to match a punctuation character, you don't have to remember whether it's magic or not, you can just go ahead and escape it. This also applies to characters that are only magic in certain contexts. For example, the dollar sign has its magic anchoring meaning only at the very end of a pattern, and anywhere else it represents itself, but rather than thinking about that, you can feel free to escape it wherever it occurs if you don't want it to be magic. Here's how:

> print(string.gsub("I have $5.", "$%d", "some money"))

I have some money. 1

> print(string.gsub("I have $5.", "%$%d", "some money"))

I have some money. 1

The individual units of a pattern are called pattern items. Pattern items that produce matches that are one character in length are called character classes. Some character classes are themselves one character long (“a”), and some are longer (“%d”, “[ /,.-]”).

In addition to character classes, you've seen pattern items with zero-character matches (“^” and “$”). In the next section, you'll see how to make pattern items whose matches are longer than one character. All this complexity can make it hard to decode patterns, but a useful technique is to write them down on scratch paper and break them into their elements. For example, you could diagram the "p[aeiou]t" patten as shown in Figure 5-1.

5-1

Figure 5-1

If it wasn't obvious from the pattern itself, it's easy to see from this diagram that "p[aeiou]t" will match any of the character sequences "pat," "pet," "pit," "pot," and "put" (wherever they may occur in the subject, because there's no anchoring).

Bracket classes have a few other features. One is that you can specify a range of characters, with the lowest character in the range followed by a - (hyphen) character and the highest character in the range (with “lowest” and “highest” being defined in terms of string.byte). In the following example, "[0-4]" is equivalent to "[01234]", and "[a-e]" is equivalent to "[abcde]":

> print(string.gsub("0123456789", "[0-4]", "n"))

nnnnn56789 5

> Str = "the quick brown fox jumps over the lazy dog"

> print(string.gsub(Str, "[a-e]", "!"))

th! qui!k !rown fox jumps ov!r th! l!zy !og 7

Actually, although "[0-9]" is guaranteed to match all and only the decimal digits, the meaning of other ranges is dependent on your system's character set. Chances are, however, that your system uses the ASCII character set or some superset of it, so that "[a-z]" matches all and only the 26 lowercase letters of the English alphabet, and "[A-Z]" matches the 26 uppercase letters. This will be assumed in the examples.

Because the hyphen has a magic meaning inside bracket classes, it generally needs to be escaped to be included in a bracket class: "[a%-z]" matches the letter a, the hyphen, and the letter z. The only reason that you didn't need to escape the hyphen in "[ /,.-]" (from the ValidateCreditCard example) because it did not have a character to its right, so it was in a nonmagical position. Alternatively, you could have written that bracket class as "[ /,.%-]".

You can combine ranges with other ranges, and with other things in a bracket class. "[a-emnv-z]" matches the letters a through e, the letters mand n, and the letters v through z.

If a bracket class starts with ^ (a caret), it matches all characters that would not be matched without the caret. "[aeiou]" matches all vowels, so "[^aeiou]" matches any character that isn't a vowel in the following example:

> Str = "the quick brown fox jumps over the lazy dog"

> print(string.gsub(Str, "[^aeiou]", ""))

euioouoeeao 32

There are other character classes that, like "%d", use a percent sign followed by a letter and match some predefined category of characters. A complete list is provided later in the chapter, but here is an example that uses “%a” (letters), “%u” (uppercase letters), and “%l” (lowercase letters):

> Str = "abc Abc aBc abC ABc AbC aBC ABC"

> print(string.gsub(Str, "%u%l%l", "Xxx"))

abc Xxx aBc abC ABc AbC aBC ABC 1

> print(string.gsub(Str, "%a%a%a", "!!!"))

!!! !!! !!! !!! !!! !!! !!! !!! 8

Just as "%d" is another way of writing "[0-9]", "%l" is another way of writing "[a-z]" and "%a" is another way of writing "[A-Za-z]". Unlike bracket classes, which are based on your system's character set, character classes like "%a" are based on your system's locale settings, so they may give different results with non-English characters.

Unfortunately, a full explanation of locales and other character-encoding issues is outside the scope of this book, but the following example (which may not work as-is on your system) shows that the %a character class and the string.upper function use the current locale. The example uses theos.setlocale function to manipulate the ctype locale setting, which governs the categorization of characters:

> -- Change the ctype (character type) locale, saving the

> -- current one:

> OrigCtype = os.setlocale(nil, ctype)

> print(OrigCtype)

C

> -- Set the ctype locale to Brazilian Portuguese:

> print(os.setlocale("pt_BR", "ctype"))

pt_BR

> -- Test [A-Za-z] versus %a:

> Str = "Pontif\237cia Universidade Cat\243lica do Rio de Janeiro"

> print(Str)

Pontifícia Universidade Católica do Rio de Janeiro

> print(string.gsub(Str, "[A-Za-z]", "."))

......í... ............ ...ó.... .. ... .. ....... 42

> print(string.gsub(Str, "%a", "."))

.......... ............ ........ .. ... .. ....... 44

> -- Also test string.upper:

> print(string.upper(Str))

PONTIFÍCIA UNIVERSIDADE CATÓLICA DO RIO DE JANEIRO

> -- Go back to the original ctype locale:

> print(os.setlocale(OrigCtype, "ctype"))

C

> -- Test string.upper again, with the original ctype locale

> -- which, being "C", will not recognize the non-English

> -- characters:

> print(string.upper(Str))

PONTIFíCIA UNIVERSIDADE CATóLICA DO RIO DE JANEIRO

Locales use 8-bit character encodings, where each character is one byte. A more flexible encoding system is Unicode. In Unicode's UTF-8 format, ASCII characters (“regular” characters) are one byte long and other characters are two or more bytes. Lua has no built-in support for Unicode (for example, it measures string length in bytes), but it does nothing to prevent you from using Unicode. In UTF-8, the characte í happens to be represented as "\195\173", so if the string "Pontif\195\173cia" is written to a file, and that file is opened with a text editor that reads it as UTF-8, then the editor will show “Pontifícia.

To do more complicated things with Unicode and other character encodings, try the Selene Unicode library (slnunicode), available at luaforge.net. For more information on Unicode and other character encoding issues, check out the Joel Spolsky article at joelonsoftware.com/articles/Unicode.html.

"%D" matches any character that isn't a digit, "%L" matches any character that isn't a lowercase letter, and “%A” matches any character that isn't alphabetical. In fact, any pattern item that is a percent sign followed by a lowercase letter has a counterpart with an uppercase letter that matches all the characters not matched by the lowercase version.

Any character class that isn't itself a bracket class can be included in a bracket class. This includes character classes like “%d” and “%a”; both "[%d.]" and "[0-9.]" match digits and decimal points. However, the two endpoints of a hyphen-separated range must be single characters (representing themselves), not longer character classes like "%d", "%a", "%%", or "%]".

The only characters that are magic inside a bracket class are "-", “^”, "]", and “%”, and of these, only "%" is magic in all positions. But unless it's the endpoint of a range, a punctuation character can be escaped even when that's not strictly necessary, like this:

> print(string.gsub("!@#$%^&*", "[%!%@%#%$]", "."))

....%^&* 4

As with patterns as a whole, you can understand complex bracket classes better if you diagram them. Often, all you need to do is show how the parts are divided up. For example, Figure 5-2 is a diagram of "[aq-z%]%d]"—it makes clear that the characters matched are the letter a, the letters q through z, the ] character, and all digits.

5-2

Figure 5-2

To match any character at all, including whitespace and control characters, use the . (dot) character class. Here are some examples:

> print(string.find("abc", "a.c"))

1 3

> print(string.find("a c", "a.c"))

1 3

> print(string.find("a!c", "a.c"))

1 3

> print(string.find("a\0c", "a.c"))

1 3

> -- It matches one and only one character:

> print(string.find("abbc", "a.c"))

nil

> print(string.find("ac", "a.c"))

nil

A pattern cannot contain the character "\0". To match a zero, use "%z":

> print(string.gsub("a\0b\0c", "%z", "Z"))

aZbZc 2

Matches of Varying Lengths

In the previous section, you could always tell just by looking at a pattern how long its match would be. Many patterns need to match varying quantities of characters. An example of this is squeezing whitespace, which turns all runs of one or more whitespace characters into a single space.

Try It Out

Squeezing Whitespace

1. Define the following function:

-- Squeezes whitespace:

function Squeeze(Str)

return (string.gsub(Str, "%s+",

end

2. Run this test code:

TestStrs = {

"nospaces",

"alpha bravo charlie",

" alpha bravo charlie ",

"\nalpha\tbravo\tcharlie\na\tb\tc",

[[

alpha

bravo

charlie]]}

for _, TestStr in ipairs(TestStrs) do

io.write("UNSQUEEZED: <", TestStr, ">\n")

io.write(" SQUEEZED: <", Squeeze(TestStr), ">\n\n")

end

The output should be as follows:

UNSQUEEZED: <nospaces>

SQUEEZED: <nospaces>

UNSQUEEZED: <alpha bravo charlie>

SQUEEZED: <alpha bravo charlie>

UNSQUEEZED: < alpha bravo charlie >

SQUEEZED: < alpha bravo charlie >

UNSQUEEZED: <

alpha bravo charlie

a b c>

SQUEEZED: < alpha bravo charlie a b c>

UNSQUEEZED: < alpha

bravo

charlie>

SQUEEZED: < alpha bravo charlie>

How It Works

The %s character class matches a whitespace character ("\t", "\n", "\v", "\f", "\r", and "", plus any other whitespace characters defined by your locale). A character class followed by + (the plus sign) is a single pattern item that matches one or more (as many as possible) of that character class. Therefore, "%s+" matches sequences of consecutive whitespace characters in this example:

> print(string.find("abc xyz", "%s+"))

4 4

> print(string.find("abc xyz", "%s+"))

4 6

The characters do not all have to be the same, as shown here:

> print(string.find("abc\n \txyz ", "%s+"))

4 6

Squeeze's string.gsub call replaces any such sequences with a single space. (The parentheses around the call are there to keep string.gsub's second value from being returned.)

+ works with any character class. For example:

> print(string.gsub("aaa bbb aaa ccc", "a+", "X"))

X bbb X ccc 2

> print(string.gsub("aaa bbb aaa ccc", "[ab]+", "X"))

X X X ccc 3

Matches never overlap. When "a+" grabs all three occurrences of a at the beginning of "aaa bbb aaa ccc", it starts looking for the next match at the fourth character, not the second.

* (the asterisk, or star) is similar to +, except it matches 0 or more characters. In the following example, "^[%a_][%w_]*$" matches Lua identifiers and keywords (%w matches “word” characters, which are letters and numbers):

> Strs = {"Strl", "_G", "function", "1st", "splunge?",

>> "alpha bravo charlie", '"global" substitution'}

> for _, Str in ipairs(Strs) do

>> print(string.format("%q is %sa valid identifier or keyword",

>> Str, string.find(Str, "*[%a_][%w_]*$") and "" or "NOT "))

>> end

"Strl" is a valid identifier or keyword

"_G" is a valid identifier or keyword

"function" is a valid identifier or keyword

"1st" is NOT a valid identifier or keyword

"splunge?" is NOT a valid identifier or keyword

"alpha bravo charlie" is NOT a valid identifier or keyword

"\"global\" substitution" is NOT a valid identifier or keyword

Figure 5-3 is a diagram of "^[%a_][%w_]*$".

5-3

Figure 5-3

Notice that the * only applies to the character class right before it. This is also true of +, and of - and ?, the two other similar magic characters you'll soon learn.

If you know what regular expressions are, then you've already figured out that patterns are the Lua equivalent. The limitation on what length modifiers like * can apply to is one of the things that keep Lua patterns from being full-fledged regular expressions, a choice that was made intentionally, to keep the string library small and simple. You can simulate most regular expression features through a clever use of Lua patterns.

When string.find finds a zero-length match, the start and end positions it returns will be the positions of (respectively) the characters after and before the match. Generally, such a match is at the beginning of the string, so the positions are 1 and 0. Zero-length matches can be confusing. The first of the following string.finds matches "aa", but the second one matches the empty string at the beginning of "aabb":

>print(string.find("aabb", "a*"))

1 2

>print(string.find("aabb",

1 0

This is because searches always start at the beginning of the subject. When the pattern is "a*", Lua tries to find any occurrences of the letter a at the beginning of the subject string. In this example, it finds two. When the pattern is "b*", Lua tries to find any occurrences of b at the beginning of the subject string. In this too, it succeeds, but only by finding zero b's—the empty string.

string.find finds just the first match, so only if Lua fails to find a match at the beginning of the subject does it look for one at the second character. If it fails there, it looks at the third, and so on. A string.gsub can replace multiple matches. Some of them may be empty strings, but none of them will overlap. Of the matches in the following example, all but the second-to-last are empty strings:

> print(string.gsub("aabb", "b*", "<match>"))

<match>a<match>a<match><match> 4

Sometimes a search finds substrings that look like matches but turn out not to be. For example, this function trims (removes) any whitespace from the end of a string:

-- Trims trailing whitespace:

function TrimRight(Str)

return (string.gsub(Str, "%s+$", ""))

end

If given " A B C " (that's two spaces before and after each letter), the example returns " A B C". Here's what string.gsub needs to do behind the scenes for this to work: It grabs as many whitespace characters as it can starting at character 1 of the subject. (This is two whitespace characters.) Then it checks whether it's hit the end of the subject. It hasn't, so it tries grabbing one less whitespace character for a total of one. This of course doesn't get it to the end of the subject either, and because grabbing zero whitespace characters is not an option with +, it gives up on finding a match at character 1 of the subject and goes through the same business at character 2. This doesn't result in a match either, so it moves on to character 3, where it can give up immediately, because character 3 is not a whitespace character. This whole process continues down the subject string until a match is finally found at character 10 (right after the C).

In the previous example, string.gsub looks at some characters more than once. Character 2 gets looked at once as a continuation of a potential match at character 1, and again as the beginning of a potential match. This is called backtracking.

By comparison, the Squeeze function from earlier in this chapter had a string.gsub that used the same pattern except without the $ anchor. Given the same subject string, it would grab as many whites-pace characters as it could starting at character 1, and because this would be a match, it could immediately jump to character 3. It knows just by looking at each character in the subject whether that character is the start of a match or not, and it also doesn't have to backtrack to see how long the match is.

If a pattern match seems to be a slow spot in your program, try to rewrite it to reduce or eliminate backtracking. The version of TrimRight given earlier is fine for short strings, but it will be noticeably slow if used to trim a sufficiently long string (perhaps read in from a file) with lots of runs of consecutive whites-pace in it. There are a few ways to write TrimRight to avoid backtracking. One is to look at as many characters of the string as necessary, starting at the end and working towards the beginning, like this:

-- Trims trailing whitespace:

function TrimRight(Str)

-- By searching from the end backwards, find the position

-- of the last nonwhitespace character:

local I = #Str

while I > 0 and string.find(string.sub(Str, I, I), "%s") do

I = I - 1

end

return string.sub(Str, 1, I)

end

The - (hyphen or minus sign) is just like * except that * matches as many characters as possible, and -matches as few as possible. Both the patterns in the following example match two vertical bar characters and the characters in between them, but the first one finds the biggest match it can, whereas the second finds the smallest it can:

> print(string.find("|abc|def|", "%|.*%|"))

1 9

> print(string.find("|abc|def|", "%|.-%|"))

1 5

Because * and +match as many characters as possible, they are called greedy. In contrast, - is called nongreedy.

string.find always finds the first match in the subject, no matter whether greedy or nongreedy matching is used. In the following example, both the greedy pattern and the nongreedy matches find "<lengthy>", even though "<short>" has less characters:

> Str = "blah<lengthy>blah<short>blah"

> print(string.find(Str, "%<%a*%>"))

5 13

> print(string.find(Str, "%<%a-%>"))

5 13

Also, the greedy/nongreedy distinction controls how long a match will be, but not whether a match will be found. If * finds a match, then - will find one too, and vice versa.

A character class followed by ? (a question mark) matches zero or one of that character class. In the following example, this is used to match either "Mr" or "Mr.":

> print(string.gsub("Mr. Smith and Mr Smythe", "Mr%.?", "Mister"))

Mister Smith and Mister Smythe 2

? is greedy in this example—it matched the dot in "Mr." even though nothing after it forced it to.

*, +, -, and ? are all of the magic characters used to control the length of a match. Here's a summary:

5-t2

These magic characters can be combined for other match lengths. For example, "%a%a%a%a?%a?%a?" matches three to six letters, and "%a%a%a%a*" matches three or more letters.

Captures

Selected parts of a match can be captured, which means they can be separated from the rest of the match. This allows one pattern to do the work of several.

The FriendlyDate function used in following example takes a date formatted as yyyy-mm-dd; captures that date's year, month, and day; and returns the same date in a more user-friendly format. First, define the FriendlyDate helper function, Ordinal, and FriendlyDate itself, like this:

-- Returns the ordinal form of N:

function Ordinal(N)

N = tonumber(N)

local Terminator = "th"

assert(N > 0, "Ordinal only accepts positive numbers")

assert(math.floor(N) == N, "Ordinal only accepts integers")

if string.sub(N, -2, -2) ~= "1" then

local LastDigit = N % 10

if LastDigit == 1 then

Terminator = "st"

elseif LastDigit == 2 then

Terminator = "nd"

elseif LastDigit == 3 then

Terminator = "rd"

end

end

return N .. Terminator

end

-- Returns a user-friendly version of a date string. (Assumes

-- its argument is a valid date formatted yyyy-mm-dd.)

function FriendlyDate(DateStr)

local Year, Month, Day = string.match(DateStr,

"^(%d%d%d%d)%-(%d%d)%-(%d%d)$")

Month = ({"January", "February", "March", "April", "May",

"June", "July", "August", "September", "October",

"November", "December"})[tonumber(Month)]

return Month .. " " .. Ordinal(Day) .. ", " .. Year

end

Then try it out like this:

> print(FriendlyDate("1964-02-09"))

February 9th, 1964

> print(FriendlyDate("2007-07-13"))

July 13th, 2007

> -- FriendlyDate assumes its argument is valid. It complains

> -- about some invalid arguments, but not all of them:

> print(FriendlyDate("9999-99-99"))

stdin:7: attempt to concatenate local 'Month' (a nil value)

stack traceback:

stdin:7: in function 'FriendlyDate'

stdin:1: in main chunk

[C]: ?

> print(FriendlyDate("0000-12-99"))

December 99th, 0000

Parentheses within a pattern are used to capture parts of the match. The string.match function returns all captures from the first match it finds, or nil if no match is found:

> Pat = "(%a)(%a*)"

> print(string.match("123 alpha bravo charlie", Pat))

a lpha

> print(string.match("123", Pat))

nil

In FriendlyDate, the pattern given to string.match captures the year, month, and day, which will be nil if DateStr is incorrectly formatted, as shown here:

-- Returns a user-friendly version of a date string. (Assumes

-- its argument is a valid date formatted yyyy-mm-dd.)

function FriendlyDate(DateStr)

local Year, Month, Day = string.match(DateStr,

"^(%d%d%d%d)%-(%d%d)%-(%d%d)$")

Month = ({"January", "February", "March", "April", "May",

"June", "July", "August", "September", "October",

"November", "December"})[tonumber(Month)]

return Month " " Ordinal(Day) ", " Year

end

After the year, month, and day are in hand, it's just a matter of turning the month from a number to a name, and turning the date from, e.g., "01" to "1st".

string.find also returns any captures, after its first two return values. For example:

> print(string.find("123 alpha bravo charlie", "(%a)(%a*)"))

5 9 a lpha

Lua 5.0 didn't have string.match. In its place, use string.find (ignoring its first two return values).

If string.match is given a pattern with no captures, it returns the entire (first) match like this:

> print(string.match("123 alpha bravo charlie", "%a+"))

alpha

If one capture contains another, they are ordered according to the position of the first parenthesis like this:

> print(string.match("abcd", "(((%a)(%a))((%a)(%a)))"))

abcd ab a b cd c d

If () is used as a capture, then the position in the subject is captured—more specifically, the position of the next character. The position at the end of a pattern is the length of the subject plus one, as shown here:

> print(string.match("abcd", "()(%a+)()"))

1 abcd 5

A position capture is a number, such as the following:

> print(type(string.match("abcd", "ab()cd")))

number

All other captures are strings, although they may look like numbers or be empty strings, as shown in the following example:

> print(type(string.match("1234", "(%d+)")))

string

> print(string.match("1234", "(%a*)") == "")

true

A *, +, -, or ? character must be directly after its character class, with no intervening capture parentheses. Similarly, a ^ or $ anchor character must be the very first or last character in its pattern. In the current implementation of Lua, the patterns "(%a)*" and "(^%a)" are both valid, but they have different meanings then you might expect. "(%a)*" means “capture a letter that is followed by a star” in the following example:

> print(string.match("ab*cd", "(%a)*"))

b

And "(^%a)" means “capture a caret and a letter” here:

> print(string.match("a^z", "(*%a)"))

^z

The equivalent patterns that give the star and the caret their magic meanings are "(%a*)" and "^(%a)" in the following example::

> print(string.match("ab*cd", "(%a*)"))

ab

> print(string.match("a^z", "*(%a)"))

a

A percent followed by 1 through 9 represents that capture, so %1 is the first capture. This means that in the following example, "(%a+) %1" matches two consecutive identical words separated by a space (the pattern translates to “match and capture a word, match a space, match the first capture”):

> print(string.gsub("Paris in the the spring",

>> "(%a+) "WORD WORD"))

Paris in WORD WORD spring 1

A capture used within the pattern that captured it is said to be replayed. You cannot replay a capture inside itself, because this would cause an infinite regress. If you try to do so, you get the following error:

> print(string.match("blah", "(%1)"))

stdin:1: invalid capture index

stack traceback:

[C]: in function 'match'

stdin:1: in main chunk

[C]: ?

You can also access captures within a string.gsub replacement string. For example (%% represents a literal percent sign):

> Percents = "90 percent, 100 percent, 110 percent"

> print(string.gsub(Percents, "(%d+) percent", "%1%%"))

90%, 100%, 110% 3

In a replacement string, %0 stands for the whole match, as in this example:

> Str = "alpha, bravo, charlie"

> print(string.gsub(Str, "%a+", "<%0>"))

<alpha>, <bravo>, <charlie> 3

Lua 5.0 didn't understand %0. The whole match, if desired, had to be explicitly captured.

Because percent signs are magic in replacement strings, any replacement string supplied by a user or otherwise generated at run time must be percent-escaped automatically, like so:

Str = string.gsub(Str, "%%", "%%%%")

Matching Balanced Delimiters

Parentheses, curly braces, and square brackets, as used in Lua, are examples of delimiters, because they mark the beginning and end of whatever they surround. You use them in pairs, so that each open delimiter has a corresponding close delimiter. Delimiters that are paired in this way are said to be balanced, and Lua offers a pattern item that matches them. This pattern item is four characters long: the first two characters are “%b”, and the next two are the open delimiter and the close delimiter. For example, "%b()" matches balanced parentheses, "%b{}" matches balanced curly braces, and"%b[]" matches balanced square brackets. The two characters after %b always represent themselves—any magic meaning they might have is ignored.

Here's an example that converts the two top-level pairs into "balanced":

To get at the inner pairs, you'd only need to grab the top-level pairs and run "%b()" on them.

> Str = "((a b) (b c)) ((c d) (e f))"

> print(string.gsub(Str, "%b()", "balanced"))

balanced balanced 2

For comparison, here are three attempts to do this without the %b pattern item, where the first would work if there was only one pair of parentheses, and the second two would work if there were no nested pairs:

> print(string.gsub(Str, "%(.*%)", "imbalanced"))

imbalanced 1

> print(string.gsub(Str, "%(.-%)", "imbalanced"))

imbalanced imbalanced) imbalanced imbalanced) 4

> print(string.gsub(Str, "%([^()]*%)", "imbalanced"))

(imbalanced imbalanced) (imbalanced imbalanced) 4

The %b delimiters can be any characters (other than "\0"). For example, a "%b %" pattern would match delimited strings that begin with a space and end with a percent sign.

A string like '{"a", "}", "z"}' would confuse "%b{}", because it doesn't know not to treat the quoted "}" as a delimiter. If you run into this situation, and you can figure out which delimiter characters should be ignored, then you can use the trick of converting them to characters you know will be unused (such as "\1" and "\2"), doing the %b matching, and then converting them back.

It might be easier to just arrange for the string you're looking at to be a valid Lua table constructor. To do this, you can prepend "return" to it, apply loadstring to it, give the resulting function an empty environment with setfenv, and call the function (using pcall to guard against errors, as described in the next chapter). The function's return value will be the table described by the table constructor.

More on string.find, string.match, and string.gsub

There are a few more features of string.find, string.match, and string.gsub that need to be covered. After you learn them, you'll know everything there is to know about these functions.

string.find and string.match both take a third argument, which is a number that specifyies which character of the subject to start the search at. Any matches that start before this character will be ignored. For example:

> Subj, Pat = "abc <--> xyz", "(%a+)"

> -- Start searching at character 2 ("b"):

> print(string.match(Subj, Pat, 2))

bc

> -- Start searching at character 5 ("<"):

> print(string.match(Subj, Pat, 5))

xyz

A caret anchors the pattern at the beginning of the search, not the beginning of the subject, as follows:

> Subj, Pat = "aa ab ac", "^(a%a)"

> -- Character 4 is an "a", so this matches:

> print(string.match(Subj, Pat, 4))

ab

> -- Character 5 is not an "a", so this doesn't match:

> print(string.match(Subj, Pat, 5))

nil

Returned string positions are reckoned from the beginning of the subject (not the beginning of the search) like this:

> Subj = "aa ab ac"

> print(string.find(Subj, "(a%a)", 6))

7 8 ac

> print(string.match(Subj, "()(a%a)()", 6))

7 ac 9

If the fourth string.find argument is true, it will ignore the magic meanings of characters in its second argument, treating it as a plain old string rather than a pattern:

To give a fourth argument, you need to give a third; use 1 if you want the search to start at the beginning of the subject as usual.

> -- Both of these look for a caret, an "a", a percent sign,

> -- and another "a":

> print(string.find("characters: ^a%a", "Aa%a", 1, true))

13 16

> print(string.find("ab", "^a%a", 1, true))

nil

So far, the string.gsub replacement argument (the third argument) has always been a string. That string can include captures, but sometimes that's not enough power to do what you want to do. For smarter replacements, the replacement argument can be a function or a table. If it's a function, it is called on each match with the match's captures as arguments, and the match is replaced with the function's return value. In the following example, the first letter and the rest of the letters of each word are captured, and the first letter is capitalized:

> Str = "If it ain't broke, don't fix it."

> Str = string.gsub(Str, "(%a)([%a'-]*)",

>> function(First, Rest)

>> return string.upper(First) .. Rest

>> end)

> print(Str)

If It Ain't Broke, Don't Fix It.

If there are no captures, the whole match is passed to the function. If the function returns nil or false, no replacement is done. Both these points are demonstrated by the following example, which turns "cat" into "dog", but makes no change to other words that contain "cat":

> Str = "concatenate cathy ducat cat"

> Str = string.gsub(Str, "%a+",

>> function(Match)

>> return Match == "cat" and "dog"

>> end)

> print(Str)

concatenate cathy ducat dog

In Lua 5.0, if the replacement function returned nil or false, the match was replaced with the empty string.

If the string.gsub replacement argument is a table, then the first capture—or the whole match, if there are no captures—is used to index the table, and the match is replaced with the value at that index, unless it's nil or false, in which case no replacement is done. Here's an example:

> Str = "dog bites man"

> Str = string.gsub(Str, "%a+", {dog = "man", man = "dog"})

> print(Str)

man bites dog

In Lua 5.0, the replacement argument could only be a string or a function.

Iterating Through All Matches

There's one more function in the string library: string.gmatch (where gmatch stands for global match). string.match only sees the first match, but string.gmatch lets you get at all the matches. It does this by iterating through the matches—like pairs and ipairs, it's an iterator factory.

The following example of string.gmatch is an HTML tokenizer. Most computer languages are composed, at one level, of tokens. These are the units in terms of which the language's syntax is defined (Lua tokens include literal strings, keywords, parentheses, commas, and so on.) The example separates a string of HTML (the language used to write web pages) into tokens. Specifically, it returns a table of HTML tags and what's in between those tags. An HTML tag consists of characters delimited by open and close angle brackets (<>), like <this>. (This is a crude tokenizer—an industrial-strength HTML tokenizer would take us too far off topic.)

Here's the example, which defines the string.gmatch function:

-- Turns a string of HTML into an array of tokens. (Each

-- token is a tag or a string before or after a tag; literal

-- open angle brackets cannot occur outside of tags, and

-- literal close angle brackets cannot occur inside them.)

function TokenizeHtml(Str)

local Ret = {}

-- Chop off any leading nontag text:

local BeforeFirstTag, Rest = string.match(Str, "^([^<]*)(.*)")

if BeforeFirstTag ~= "" then

Ret[1] = BeforeFirstTag

end

-- Get all tags and anything in between or after them:

for Tag, Nontag in

string.gmatch(Rest, "(%<[^>]*%>)([^<]*)")

do

Ret[#Ret + 1] = Tag

if Nontag ~= "" then

Ret[#Ret + 1] = Nontag

end

end

return Ret

end

Now, try it out:

> Html = "<p>Some <i>italicized</i> text.</p>"

> for _, Token in ipairs(TokenizeHtml(Html)) do print(Token) end

<p>

Some

<i>

italicized

</i>

text.

</p>

First, TokenizeHtml separates its argument into two parts: the possibly empty string before the first tag (if there is a first tag), and the possibly empty remainder of the string:

function TokenizeHtml(Str)

local Ret = {}

local BeforeFirstTag, Rest = string.match(Str, "A([A<]*)(.*)")

if BeforeFirstTag ~= "" then

Ret[1] = BeforeFirstTag

end

Notice that this match cannot fail, even on an empty string. Also notice that, if there's at least one tag in Str, the BeforeFirstTag string will be as long as it needs to be to include everything that comes before the tag, because greedy matching is used.

After it saves the data before the first tag (if it's nonempty), string.gmatch can loop through the rest of the string. string.gmatch takes a subject and a pattern. It returns an iterator that, on each iteration, returns the captures from the pattern's next match in the subject. In this case, the pattern is as follows:

"(%<[^>]*%>)([^<]*)"

Figure 5-4 shows how this pattern can be diagrammed.

5-4

Figure 5-4

The two captures returned by the iterator are given the names Tag and Nontag. In this example, the string.gmatch function specifies that Tag will never be the empty string (at a minimum, it will be "<>"), and if two tags are next to each other, then Nontag will be empty, in which case won't be put into Ret:

for Tag, Nontag in

string.gmatch(Rest, "(%<[^>]*%>)([^<]*)")

do

Ret[#Ret + 1] = Tag

if Nontag ~= "" then

Ret[#Ret + 1] = Nontag

end

end

The loop does an iteration for each match, so if there are no tags to be found, then it will do zero iterations and Ret will have only one or zero elements.

In Lua 5.0, string.gmatch was named string.gfind (but was otherwise identical).

If the pattern given to string.gmatch has no captures, the iterator will return the whole match:

> for Letter in string.gmatch("1st 2nd 3rd", "%a") do

>> print(Letter)

>> end

s

t

n

d

r

d

Tricks for the Tricky

Sometimes when you can't write a pattern to match what you want to match, you can get around this by matching a little more than you need to, and then ignoring the match if it's a false positive. This is the approach used in the previous cat-to-dog example:

> Str = "concatenate cathy ducat cat"

> Str = string.gsub(Str, "%a+",

>> function(Match)

>> return Match == "cat" and "dog"

>> end)

> print(Str)

concatenate cathy ducat dog

This approach would also work for tasks such as the following:

· Changing "cat", "dog", and "bird", but not other words, to "animal"

· Matching “cat” case-insensitively without using "[Cc][Aa][Tt]"

· Matching 3-to-10-letter words

Another technique is to use a modified version of the subject. In the following, the pattern "%Wcat%W" even finds the word “cat” at the end of the subject, because the concatenated newlines ensure that it can't be at the very end (or the very beginning):

> Str = "concatenate cathy ducat cat"

> Count = 0

> for _ in string.gmatch("\n" .. Str .. "\n", "%Wcat%W") do

>> Count = Count + 1

>> end

> io.write("'cat' occurs ", Count, " time(s)\n")

'cat' occurs 1 time(s)

Another technique is to capture positions and use them to look around inside the subject, like this:

> Str = "concatenate cathy ducat cat"

> Str = string.gsub(Str, "()cat()",

>> function(Pos1, Pos2)

>> Posl = Posl - 1 -- The character before the match.

>> -- Is the match at the beginning of the string or

>> -- preceded by a nonword character?

>> if Posl == 0 or string.find(Str, "*%W", Posl) then

>> -- Is it also at the end of the string or followed by

>> -- a nonword character?

>> if Pos2 > #Str or string.find(Str, "*%W", Pos2) then

>> return "dog"

>> end

>> end

>> end)

> print(Str)

concatenate cathy ducat dog

In the case of matching a whole word, there's yet another technique—the undocumented frontier pattern item, which is shown here:

> Str = "concatenate cathy ducat cat"

> Str = string.gsub(Str, "%f[%w]cat%f[%W]",

>> function(Match)

>> return Match == "cat" and "dog"

>> end)

> print(Str)

concatenate cathy ducat dog

%f is followed by a bracket class. It matches an empty string that comes after a character not in the class and before one that is in the class, but the empty string matched can also be at the beginning or end of the subject. The pattern "%f[a]a" matches any “a” that isn't preceded by another “a,” and the pattern "a%f[^a]" matches any “a” that isn't followed by another “a.”

The fact that %f is undocumented means that it may—without notice—change or disappear altogether in a subsequent release of Lua. It also means that there's no explicit guarantee of its behavior, and it may act unexpectedly when used in unusual situations such as this one:

> -- This should find the empty string at the beginning of the

> -- subject, but it doesn't (due to an implementation quirk):

> print(string.find("\0", "%f[%z]"))

nil

Magic Characters Chart

Here's a chart of all the magic characters and magic character sequences that patterns can contain:

5-t3 5-t4

Summary

In this chapter, you learned all the functions in the string library (except for string.dump, which is covered in Chapter 10), along with a few in the I/O library. Here are the highlights:

· You can convert the case of strings with string.lower and string.upper.

· You can obtain substrings with string.sub.

· You can format strings with string.format, whose first argument is a format string with placeholders (which start with a percent sign).

· You get more fine-grained control over output with io.write, and over standard input with io.read.

· When you use io.open to open a file, it returns a file handle—an object with read, write, and close methods.

· You perform pattern matching with magic characters such as %, ^, and *.

· Anchoring (which you do with ^ and $) forces a match to be at the beginning and/or end of the subject string (or the searched part of the subject string, if a third argument is given to string.find or string.match).

· Greedy matching (“*”, “+”, and “?”) matches as many characters as possible. Nongreedy matching (“-”) matches as few as possible. Either way, each match will be as close to the beginning of the searched part of the subject string as possible.

· string.gsub does substitution on all matches of a pattern.

· string.find finds the first match.

· string.match returns the captures from the first match.

· string.gmatch returns an iterator that loops through all matches.

At this point, you have all the tools you need to start writing real programs, except that the programs you write may not be very robust. In the next chapter, you'll learn how to prevent errors, or to keep them from stopping your program cold when they do happen. To test your understanding of this chapter, you can do the following exercises (answers are in the appendix).

Exercises

1. Write a function that takes an n-character string and returns an n-element array whose elements are the string's characters (in order).

2. Write the format string Frmt so that the following:

for _, Name in ipairs({"Lynn", "Jeremy", "Sally"}) do

io.write(string.format(Frmt, Name))

end

will print this:

Lynn

Jeremy

Sally

3. Write a comparison function that allows table.sort to sort in “dictionary order.” Specifically, case distinctions and any characters other than letters or numbers should be ignored, unless they are the only ways in which two strings differ.

> Names = {"Defoe", "Deforest", "Degas", "de Forest"}

> table.sort(Names, DictCmp)

> for _, Name in ipairs(Names) do print(Name) end

Defoe

Deforest

de Forest

Degas

4. Write a function that starts up a subinterpreter that prints a prompt, reads a line, and prints the result(s) of evaluating the expression(s) typed onto that line. Typing a line with nothing but the word "quit" should exit the subinterpreter.

> ExprInterp()

expression> 2 + 2

4

expression> true, false, nil

true false nil

expression> string.gsubC'somewhere", "[Ss]", "%0h")

shomewhere 1

expression> quit

>

There's no need to check for errors in what is typed (you'll learn how to do this in the next chapter) or to special-case empty lines. (Hint: This exercise doesn't require any pattern matching.)

5. The TrimRight function given in this chapter trims off trailing whitespace. Write its counterpart: TrimLeft, a function that trims leading whitespace.

6. Does TrimLeft ever need to do any backtracking?

7. Write an Interpolate function that replaces dollar signs followed by identifiers with the value of the named global variable.

> Where, Who, What =

>> "in xanadu", "kubla khan", "a stately pleasure-dome"

> print(Interpolate("$Where did $Who\n$What decree"))

in xanadu did kubla khan

a stately pleasure-dome decree

> print(Interpolate("string = $string, asdf = $asdf"))

string = table: 0x481dd0, asdf = nil