awk and sed - Tools and Programming - UNIX: The Complete Reference (2007)

UNIX: The Complete Reference (2007)

Part V: Tools and Programming

Chapter 21: awk and sed

Overview

The Swiss army knife of the UNIX System toolkit is awk. Many useful awk programs are only one line long, and in fact even a one-line awk program can be the equivalent of a regular UNIX System tool. For example, with a one-line awk program, you can count the number of lines in a file (like wc), print the first field in each line (like cut), print all lines that contain the phrase “open source” (like grep), or exchange the position of the third and fourth fields in each line (like join and paste). However, awk is a programming language with control structures, functions, and variables that allow you to write even more complex programs.

awk is specially designed for working with structured files and text patterns. It has built-in features for breaking input lines into fields and comparing these fields to patterns that you specify This chapter will show you how to use awk to work with structured files such as inventories, mailing lists, and other tables or simple databases.

awk is often used in command pipelines with tools like sort, tr, or sed. Each of these commands can act as a preprocessor or filter to simplify a problem before solving it in awk. For example, it is difficult to sort lines in awk, so using sort on a file before passing the information to awk can make your programs much simpler. In fact, you can process a file in awk, send the result to sort through a pipeline, and then return the output to awk for further processing.

sed is an abbreviation for stream editor. Like awk, it can do complex pattern matching and editing on a stream of characters, although it does not have all of the powerful programming capabilities of awk. In addition to processing text like awk, sed can be used as an efficient noninteractive editor for very large files. sed uses a syntax that is very similar to many vi and ed commands. sed is more challenging to learn than awk, but it is often used as a preprocessor for awk programs.

This chapter will describe many of the commands of awk, enough to enable you to use it for many applications. It does not cover all of the functions, built-in variables, or control structures that awk provides. For a full description of the awk language with many examples, refer to The AWK Programming Language, by Alfred Aho, Brian Kernighan, and Peter Weinberger.

Because awk can be used for almost all of the same tasks, and most people find awk easier to use, this chapter does not devote as much time to sed. If you want to learn sed in greater depth, consult sed & awk, by Dale Dougherty and Arnold Robbins (see the last section of this chapter for bibliographical information).

Versions of awk

The awk program was originally developed by Aho, Kernighan, and Weinberger in 1977 as a pattern-scanning language (the name “AWK” comes from their initials). Many new features have been added since then. The version of awk first implemented in UNIX System V, Release 3.1, added many features, such as additional built-in functions. In order to preserve compatibility with programs that were written for the original version, this one was named nawk (new awk). The use of two different commands for the two versions was a temporary step to provide time to convert programs using the older version to the new one. On some systems, including AIX, the awk command actually runs nawk.

On some Linux and UNIX systems, the awk command may actually run the gawk program. gawk is an enhanced, public domain version of awk that is part of the GNU system. It includes some new features and extensions, including the ability to do pattern matching that ignores the distinction between uppercase and lowercase.

For simplicity, this chapter refers to the language as awk and uses the command name awk in the examples. If you want to be sure which version of awk you are using, consult your system manual pages.

How awk Works

The basic operation of awk is simple. It reads input from a file, a pipe, or the keyboard, and searches each line of input for patterns that you have specified. When it finds a line that matches a pattern, it performs an action. You specify the patterns and actions in an awk program.

An awk program consists of one or more pattern/action statements of the form

pattern {action}

A statement like this tells awk to test for the pattern in every line of input, and to perform the corresponding action whenever the pattern matches the input line. The pattern/action concept is an extension of the target/search model used by grep. In grep, the target is a pattern, and the action is to print the line containing the pattern.

You can use awk as a replacement for grep. The following awk program searches for lines containing the word “widget.” When it finds such a line, it prints it.

/widget/ {print}

The slashes indicate that you are searching for the target string “widget”. The action, print, is enclosed in braces.

Here is another example of a simple awk program:

/widget/ {w_count=w_count+1}

The pattern is the same, but the action is different. In this case, whenever a line contains “widget,” the variable w_count is incremented by 1.

The simplest way to run an awk program is to include it on the command line as an argument to the awk command, followed by the name of an input file. For example, the following program prints every line from the file inventory that contains the string “widget”:

$ awk '/widget/ {print}' inventory

This command line consists of the awk command, then the text of the program itself in single quotes, and then name of the input file, inventory. The program text is enclosed in single quotes to prevent the shell from interpreting its contents as separate arguments or as instructions to the shell.

Default Patterns and Actions

If you want the action to apply to every line in the file, omit the pattern. By default, awk will match every line, so an action statement with no pattern causes awk to perform that action for every line in the input. For example, the command

$ awk '{print $1}' students

uses the special variable $1 to print the first field of every line in the file students.

You can also omit the action. The default action is to print an entire line, so if you specify a pattern with no action, awk will print every line that matches that pattern. For example,

$ awk '/science/' students

will print every line in students that contains the string science.

Working with Fields

You may recall from Chapter 20 that the shell automatically assigns the variables $1, $2, and so on to the command-line arguments for a script. Similarly, awk automatically separates each line of input into fields and assigns the fields to variables. So $1 is the first field in each line, $2 is the second, and so on. The entire line is in $0.

This makes it easy to work with tables and other formatted text files. For example, instead of printing whole lines, you can print specific fields from a table. Suppose you have the following list of names, states, and phone numbers:

Ben IN 650-333-4321

Dan AK 907-671-4321

Marissa NJ 732-741-1234

Robin CA 650-273-1234

If you want to print the names of everyone in area code 650, the pattern to match is 650-, and the action when a match is found is to print the name in the first field.

You can use the awk program

/650-/ {print $1}

where $1 indicates the first field in each line. You can run this program with the following command:

$ awk '/650-/ {print $1}' contacts

This produces the following output:

Ben

Robin

Fields are separated by a field separator. The default field separator is white space, consisting of any number of spaces and/or tabs. This means that each word in a line is a separate field. Many structured files use a field separator other than a space, such as a colon, a comma, or a single tab, so that you can have several words in one field. You can use the -F option on the command line to specify the field separator. For example,

$ awk -F, 'program goes here'

specifies a comma as the separator, and

$ awk -F"\t" 'program goes here'

tells awk to use a tab as a separator. Since the backslash is a special character in the shell, it must be enclosed in quotation marks. Otherwise, the effect would be to tell awk to use t as the field separator.

Using Standard Input and Output

Like most UNIX System commands, awk uses standard input and output. If you do not specify an input file, the program will read and act on standard input. This allows you to use an awk program as a part of a command pipeline. For example, it is common to use sort to sort data before awk operates on it:

sort input_file awk -f program_file

Because the default for standard input is the keyboard, if you do not specify an input file, and if it is not part of a pipeline, an awk program will read and act on lines that you type in from the keyboard. This can be useful for testing your awk programs. Remember that you can terminate input by typing CTRL-D.

As with any command that uses standard output, you can redirect output from an awk program to a file or to a pipeline. For example, the command

$ awk '{print $1}' contacts > namelist

copies the first field from each line of contacts to a file called namelist.

You can get input from multiple files by listing each filename in the command line. awk takes its input from each file in turn. For example, the following command line reads and acts on all of the first file, list1, and then reads and acts on the second file, list2. It sends the output (the first field of each file) to lp.

$ awk '{print $1}' phone1 phone2 | lp

Running an awk Program from a File

You can store the text of an awk program in a file. To run a program from a file, use awk -f, followed by the filename. The following command line runs the program saved in the file prog_file. awk takes its input from input_file:

$ awk -f prog_file input_file

If the file is not in the current directory, you must give awk a full pathname. If you are using gawk, you can use the environment variable AWKPATH to specify a list of directories to search for program files. The default AWKPATH is .:/usr/lib/awk:/usr/local/lib/awk. If you modify your AWKPATH, you may want to save it in your shell configuration file (e.g., in .bash_profile if you are using bash).

Here’s how you could set and use AWKPATH in bash:

$ export AWKPATH=$AWKPATH:$HOME/bin/awk

$ ls ~/bin/awk

testprog

$ gawk -f testprog testinput

An even better way to save an awk program in a file is to create an executable script. If you add the line #!/bin/awk -f (where /bin/awk is the path for awk on your system) to the top of your file, you can run the program as a stand-alone script. You must have execute permission on the file before you can run it.

$ cat sampleProg

#!/bin/awk -f

/black/ {print}

$ chmod u+x sampleProg

$ ./sampleProg inputfile

Sphinx of black quartz, judge my vow.

When you run this script, the shell reads the first line and calls awk, which runs the program.

Multiline Programs

You can do a surprising amount with one-line awk programs, but programs can also contain many lines. Multiline programs simply consist of multiple pattern/action statements. Each line of input is checked against all of the patterns in turn. For each matching pattern, the corresponding action is performed. For example,

$ cat countStudents

# Count the number of lines containing "science" or "writing"

/science/ { sci = sci + 1 }

/writing/ { wri = wri + 1 }

# At the end of the input, print the totals

END {print sci " science and " wri "writing students." }

$ awk -f countStudents student-list

47 science and 39 writing students.

This program uses the END statement to perform an action at the end of the input. See the section “BEGIN and END” later in this chapter for more information about how END works.

An action statement can also continue over multiple lines. Although you can chain together multiple actions using semicolons, your programs will be easier to read if you break them up into separate lines. If you do, the opening brace of the action must be on the same line as the pattern it matches. You can have as many lines as you want in the action before the final brace. For example,

$ cat numberLines

# Add line numbers to the input

# Since there is no pattern, do this to every line in the file

{

n = n + 1 # add 1 to the number of lines

print n " " $0 # print the line number, a space, and the original line

}

The comments in these programs make them easier to read. Like the shell, awk uses the # symbol for comments. Any line or part of a line beginning with the # symbol will be ignored by awk. The comment begins with the # character and ends at the end of the line.

Specifying Patterns

Because pattern matching is such a fundamental part of awk, the awk language provides a rich set of operators for specifying patterns. You can use these operators to specify patterns that match a particular word, a phrase, a group of words that have some letters in common (such as all words starting with A), or a number within a certain range. You can also use special operators to combine simple patterns into more complex patterns. These are the basic pattern types in awk:

§ Regular expressions are sequences of letters, numbers, and special characters that specify strings to be matched. awk accepts the same regular expressions as the egrep command, discussed in Chapter 19.

§ Comparison patterns are patterns in which you compare two elements using operators such as == (equal to), != (not equal to), > (greater than), and < (less than).

§ Compound patterns are built up from other patterns, using the logical operators and (&&), or (||), and not (!).

§ Range patterns have a starting pattern and an ending pattern. They search for the starting pattern and then match every line until they find a line that matches the ending pattern.

§ BEGIN and END are special built-in patterns that send instructions to your awk program to perform certain actions before or after the main processing loop.

Regular Expressions

You can search for lines that match a regular expression by enclosing it in a pair of slashes (/…/). The simplest kind of regular expression is just a word or string. For example, to match lines containing the phrase “boxing wizards” anywhere in the line, you can use the pattern

/boxing wizards/

Expressions can also include escape sequences. The most common are \t for TAB and \n for newline.

Table 21–1 shows the special symbols that you can use to form more complex regular expressions.

Table 21–1: awk Regular Expressions

Symbol

Definition

Example

Matches

.

Matches any single character.

th.nk

think, thank, thunk, etc.

\

Quotes the following character.

\*\*\*

***

*

Matches zero or more repetitions of the previous item.

ap*le

ale, apple, etc.

+

Matches one or more repetitions of the previous item.

.+

any non-empty line

?

Matches the previous item zero or one times.

index\.html?

index. htm, index. html

^

Matches the beginning of a line.

^If

any line beginning with If

$

Matches the end of a line.

\.$

any line ending in a period

[]

Matches any one of the characters inside.

[QqXx]

Q, q, X, or x

[az]

Matches any one of the characters in the range.

[0–9]*

any number: 0110, 27, 9876, etc.

[^ ]

Matches any character not inside.

[^\n]

any character but newline

()

Group a portion of the pattern.

script(\.sh)?

script, script.sh

|

Matches either the value before or after the |.

(E|e)xit

Exit, exit

To illustrate how you can use regular expressions, consider a file containing the inventory of items in a stationery store. The file inventory includes a one-line record for each item. Each record contains the item name, how many are on hand, how much each costs, and how much each sells for:

pencils 108 .11 .15

markers 50 .45 .75

pens 24 .53 .75

notebooks 15 .75 1.00

erasers 200 .12 .15

books 10 1.00 1.50

If you want to search for the price of markers, but you cannot remember whether you called them “marker” or “markers,” you could use the regular expression

/markers?/

as the pattern.

To find out how many books you have on hand, you could use the pattern

/^books/

to find entries that contain “books” only at the beginning of a line. This would match the record for books, but not the one for notebooks.

Case Sensitivity

In awk, string patterns are case sensitive. For example, the pattern/student/wouldn’t match the string “Student”. In gawk, you can set the environment variable IGNORECASE if you want to make matching case-insensitive.

Alternately, you can use tr to convert all of your input to lowercase before running awk, like this:

cat inputfiles | tr [AZ] [az] awk -f programfile

Some versions of awk have the functions tolower and toupper to help you control the case of strings (see the later section “Working with Strings”).

Comparison Operators

The preceding section dealt with string matches where the target string may occur anywhere in a line. Sometimes, though, you want to compare a string or pattern with a specific string. For example, suppose you want to find all the items in the earlier example that sell for 75 cents. You want to match .75, but only when it is in the fourth field (selling price).

You use the tilde (~) sign to test whether two strings match. For example,

$4 ~ /^\.75/

checks whether the string $4 contains a match for the expression /^\.75/. That is to say, it checks whether field 4 begins with .75 (the backslash is necessary to prevent the . from being interpreted as a special character). This pattern will match strings such as “.75”, “.7552”, and “.75potatoes”. If you wish to test whether field 4 contains precisely the string .75 and nothing else, you could use

$4 ~ /^\.75$/

You can test for nonmatching strings with !~. This is similar to ~, but it matches if the first string is not contained in the second string.

The == operator checks whether two strings are identical. For example,

$1==$3

checks to see whether the value of field 1 is equal to the value of field 3.

Do not confuse == with =. The former (==) tests whether two strings are identical. The single equal sign (=) assigns a value to a variable. For example,

$1="hello"

sets the value of field 1 equal to “hello”. It would be used as part of an action statement. On the other hand,

$1=="hello"

compares the value of field 1 to the string “hello”. It could be a pattern statement.

The != operator tests whether the values of two expressions are not equal. For example,

$1 != "pencils"

is a pattern that matches any line where the first field is not “pencils.”

Comparing Order

The comparison operators <, >, <=, and >= can compare two numbers or two strings. With numbers, they work just as you would expect-for example,

$1 <= 10

would match the numbers less than or equal to 10.

When used with strings, these operators compare the strings according to the standard ASCII alphabetical order. For example,

"vanished" < "vorpal"

Remember that in the ASCII character code, all uppercase letters precede all lowercase letters, so

"Horse" < "cart"

Compound Patterns

Compound patterns are combinations of patterns, joined with the logical operators && (and), || (or), and ! (not). You can create very complex compound patterns.

For example, here is a small but useful program that works on a text file formatted with HTML. It checks whether each <B> starting tag is followed by exactly one </B> ending tag:

/<B>/ && bold==0 { bold=1 }

/<B>/ && bold==1 { print "Missing <B> before line " NR }

/</B>/ && bold==1 { bold=0 }

/</B>/ && bold==0 { print "Extra </B> at line " NR }

This program look for the HTML tag <B>. If it finds one, it marks the start of bold text (with the variable bold). If bold was already set, it prints an error message. The program also searches for the ending tag </B>. If it finds one, it changes the variable bold to show that the text is no longer bold. If the text wasn’t bold, it prints an error message. You could easily extend this program to test for <I> and other tags.

Compound patterns are useful for numeric variables as well as strings. For example,

$1 < 10 && $2 >= 30

matches a line if field 1 is less than 10 and field 2 is greater than or equal to 30.

Range Patterns

The syntax for a range pattern is

startPattern, endPattern

This causes awk to compare each line of input to startPattern. When it finds a line that matches startPattern, that line and every line following it will match the range. awk will continue to match every line until it encounters one that matches endPattern. After that line, the range will no longer match lines of input (until another copy of startPattern appears).

In other words, a range pattern matches all the lines from a starting pattern to an ending pattern. If you have a table in which at least one of the fields is sorted, you can use a range to pull out a section of data. For example, if you have a table in which each line is numbered, you could use this program to print lines 100 to 199:

$ awk '/100/, /199/ {print}' datafile

BEGIN and END

BEGIN and END are special patterns that separate parts of your awk program from the normal awk loop that examines each line of input. The BEGIN pattern applies before any lines are read. It causes the action following it to be performed before any input is processed.

This allows you to set a variable or print a heading before the main loop of the awk program. For example, suppose you are writing a program that will generate a table. You could use a BEGIN statement to print a header at the top:

BEGIN {print "Artist Album SongTitle TrackNum"}

The END pattern is similar to BEGIN, but it applies after the last line of input has been read. Suppose you need to count the number of lines in a file. You could use

{ numline = numline + 1 }

END { print "There were " numline " lines of input." }

This awk program counts each of line of input and then prints the total when all the input has been processed. A shorter way to write this program is

END { print "There were " NR " lines of input." }

which uses a built-in awk variable to automatically count the lines.

Specifying Actions

The preceding sections have illustrated some of the patterns you can use. This section gives you a brief introduction to the kinds of actions that awk can take when it matches a pattern. An action can be as simple as printing a line or changing the value of a variable, or as complex as invoking control structures and user-defined functions.

Variables

The awk program allows you to create variables, assign values to them, and perform operations on them. Variables can contain strings or numbers. A variable name can be any sequence of letters and digits, beginning with a letter. Underscores are permitted as part of a variable name, for example, old_price. Unlike many programming languages, awk doesn’t require you to declare variables as numeric or string; they are assigned a type depending on how they are used. The type of a variable may change if it is used in a different way All variables are initially set to null (or for numbers, 0). Variables are global throughout an awk program, except inside user-defined functions.

Built-in Variables

Table 21–2 shows the awk built-in variables. These variables either are set automatically or have a standard default value. For example, FILENAME is set to the name of the current input file as soon as the file is read. FS, the field separator, has a default value. Other commonly used built-in variables are NF, the number of fields in the current record (by default, each line is considered a record), and NR, the number of records read so far (which we used in the preceding example to count the number of lines in a file). ARGV is an array of the command-line arguments to your awk program.

Table 21–2: awk Built-in Variables

Variable

Meaning

Variable

Meaning

FS

Input field separator

NF

Number of fields in this record

OFS

Output field separator

NR

Number of records read so far

RS

Input record separator

FNR

Number of records from this file

ORS

Output record separator

RESTART

Set by match to the match index

ARGC

Number of arguments

RLENGTH

Set by match to the match length

ARGV

Array of arguments

OFMT

Output format for numbers

FILENAME

Name of input file

SUBSEP

Subscript separator for arrays

Built-in variables have uppercase names. They may contain string values (FILENAME, FS, OFS), or numeric values (NR, NF). You can reset the values of these variables. For example, you can change the default field separator by changing the value of FS.

Actions Involving Fields

You have already seen the field identifiers $1, $2, and so on. These are a special kind of built-in variable. You can assign values to them; change their values; and compare them to other variables, strings, or numbers. These operations allow you to create new fields, erase a field, or change the order of two or more fields.

For example, recall the inventory file, which contained the name of each item, the number on hand, the price paid for each, and the selling price. The entry for pencils is

pencils 108 .11 .15

The following awk program calculates the total value of each item in the file:

{

$5 = $2 * $4

print $0

}

This program multiplies field 2 times field 4 and puts the result in a new field ($5), which is added at the end of the record. (By default, a record is one line.) The program also prints the new record with $0.

You can use the NF variable to access the last field in the current record. For example, suppose that some lines have four fields while others have five. Since NF is the number of fields, $NF is the field identifier for the last field in the record (just as, in a line with four fields, $4 is the identifier for the last field). You can add a new field at the end of each record by increasing the value of NF by one and assigning the new data to $NF. For example,

/pencil/ { # search for lines containing "pencil"

NF += 1 # increase the number of fields

$NF="Empire" # give the new last field the value "Empire"

}

Record Separators

You have already seen many examples in which awk gets its input from a file. It normally reads one line at a time and treats each input line as a separate record. However, you might have a file with multiline records, such as a mailing list with separate lines for name, street, city, and state. To make it easier to read a file like this, you can change the record separator character.

The default separator is a newline. To change this, set the variable RS to an alternate separator. For example, to tell awk to use a blank line as a record separator, set the record separator to null in the BEGIN section of your program, like this:

BEGIN {RS=""} # break records at blank lines

Now all of lines up until a blank line will be read in at once. You can use the variables $1, $2, and so on to work with the fields, just as you normally would.

When working with multiline records, you may wish to leave the field separator as a space (the default value), or you may wish to change it to a newline, with a statement such as

BEGIN {RS=""; FS="\n"} # separate fields at newlines

Then you can use the field identifiers to refer to complete lines of the record.

Working with Strings

awk provides a full range of functions and operations for working with strings. For example, you can assign strings to variables, concatenate strings, extract substrings, and find the length of a string.

You already know how to assign a string to a variable:

class = "music151"

Don’t forget the quotes around music151. If you do, awk will try to assign class to the value of a variable named music151. Since you probably don’t have a variable by that name, class will end up set to null.

You can also combine several strings into one variable. For example, you could enter this at the command line:

$ awk '{student ID = $1 $3

> print student ID}'

Long, Adam 2008

Long2008

Similarly, you could use print $3 $2 with that input to print 2008Adam.

Some of the most useful string functions are length, which returns the length of a string, match, which searches for a regular expression within a string, and sub, which substitutes a string for a specified expression. You can use gsub to perform a “global” string substitution, in which anything in the line that matches a target regular expression is replaced by a new string. substr takes a string and returns the substring at a given position. In addition to these standard functions, gawk provides the functions toupper and tolower to change the case of a string.

This program shows how you can use some of the string functions:

length($0) > 10 { # pattern matches any line longer than 10 characters

gsub(/[0–9]+/, "---") # replace all strings of digits with ---

print substr ($0, 1, 10) # print the first ten characters of the new string

}

Working with Numbers

awk includes the usual arithmetic operators +, −, *, and /. (Unlike in shell scripting, you do not need to quote * when multiplying in an awk program.) The % operator calculates the modulus of two numbers (the remainder from integer division), and the ^ operator is used for exponentiation.

In addition to =, you can use the assignment operators +=, −=, *=, /=, %=, and ^= as shortcuts. For example,

{ total += $1} # add the value of $1 to total

END { print "Average = " total/NR } # divide total by the number of lines

will find the average of the numbers in the first field of the input.

You can also use the C-style shortcuts ++ and −− to increment or decrement the value of

a variable. For example,

x++

is the same as x += 1 (or x=x+1).

awk provides a number of built-in arithmetic functions. These include trigonometric functions such as cos, the cosine function, and atan2, the arctangent function, as well as the logarithmic functions log and exp. Other useful functions are int, which returns the integral part of a number, and rand, which generates a random number between 0 and 1. For example, you can estimate the value of pi with

at an2 (1, 1) * 4 # four times arctan of 1/1

Arrays

It is particularly easy to create and use arrays in awk. Instead of declaring or defining an array, you define the individual array elements as needed and awk creates the array automatically One feature of awk is that it uses associative arrays-arrays that can use strings as well as numbers for subscripts. For example, votes [“republican”] and votes[“democratic”] could be two elements of an associative array

You may be familiar with associate arrays from some other language, but by a different name. In Perl, they are called hashes, and in Python they are dictionaries. There is no built-in data type for associative arrays in C, but they are sometimes implemented with hash tables.

You define an element of an array by assigning a value to it. For example,

stock[1] = $2

assigns the value of field 2 to the first element of the array stock. You do not need to define or declare an array before assigning its elements.

You can use a string as the element identifier. For example,

numberl [$1] =$2

If the first field ($1) is pencil, and the second field ($2) is 108, this creates an array element:

number["pencil"] = 108

When an element of an array has been defined, it can be used like any other variable. You can change it, use it in comparisons and expressions, and set variables or fields equal to it. For example, you could print the value of number[“pencil”] with

print number["pencil"]

You can delete an element of an array with

delete array[subscript]

and you can test whether a particular subscript occurs in an array with

subscript in array

where this expression will return a value of 1 if army[subscript] exists and 0 if it does not.

Control Statements

awk provides control flow statements that allow you to test logical condition (with if-then statements) or loop through blocks of code (for and while statements). The syntax is similar to that used in C.

if... then Statements

The if statement evaluates an expression and performs an action if the expression was true. It has the form

if (condition) action

For example, this statement checks the number of pencils in an inventory and alters you if you are running low:

/pencil/ {if $2 < 144) print "Order more pencils"}

You can add an else clause to an if statement. For example,

if (length(input) > 0)

print "Good, we have input"

else

print "Nope, no input here"

awk provides a similar conditional form that can be used in an expression. The form is

expression1 ? expression2 : expression3

If expression1 is true, the whole statement has the value of expression2; otherwise, it has the value of expression3. For example,

rank = ($1 > 50000) ? "high" : "low"

determines whether a number is above or below 50000.

while Loops

A while loop is used to repeat a statement as long as some condition is met. The form is

while(condition) {

action

}

For example, suppose you have a file in which different records contain different numbers of fields, such as a list of the test scores for each student in a school, where some students have more test scores than others, like this:

Gignoux, Chris 97 88 95 92

Landfield, Ryan 75 93 99 94 89

You could use while to loop through every field in each record, add up the total score, and print the average for each student:

{

sum=0

i=2

while (i<=NF) {

sum += $i

i++

}

average=sum/ (NF−1)

print "The average for " $1 " is " average

}

In this program, i is a counter for each field in the record after the first field, which contains the student’s name. Where i is less than NF (the number of fields in the record), “sum” is incremented by the contents of field i. The average is the sum divided by the number of fields containing numbers.

The do-while statement is like the while statement, except that it executes the action first and then tests the inside condition. It has the form

do action while(condition)

The break command is used to exit from a surrounding loop early. It can be included in a while loop or a for loop.

for Loops

The for statement repeats an action as long as a condition is satisfied. The for statement includes an initial statement that is executed the first time through the loop, a test that is executed each time through the loop, and a statement that is performed after each successful test. It has the form

for(initial statement; test; increment) statement

The for statement is usually used to repeat an action some number of times. The following example uses for to total the scores for each student and find the average, exactly like the while example just shown:

{

sum=0

for (i=2; i<=NF; i++) sum += $i

average=sum/ (NF−1)

print "The average for " $1" is " average

}

You can use for loops to step through the elements of an array. For example, to count the number of tables in an HTML document, and the number of rows and cells in the tables, use this:

/<TABLE>/ {count["table"]++}

/<TR>/ {count["tablerow"]++}

/<TD>/ {count["tablecell"]++}

END {for (s in count) print s, count[s]}

The array is called count. As you find each pattern, you increment the counter with the appropriate subscript. After reading the file, you print out the totals.

Ending a Program

The exit command tells awk to stop reading input. When awk comes to an exit statement, it immediately goes executes the END action, if there is one, and then terminates. You might use this command to end a program if you discover an error in the input file, such as a missing field.

User-Defined Functions

Like many programming languages, awk allows you to define your own functions within a program. Your functions may take parameters (arguments) and may return a value.

Once a function has been defined, it may be used in a pattern or action, in any place where you could use a built-in function.

To define a function, you specify its name, the parameters it takes, and the actions to perform. A function is defined by a statement of the form

function function_name (list of parameters) {action_list}

For example, you can define a function called in_range, which takes the value of a field and returns 1 if the value is within a certain range and 0 otherwise, as follows:

function in_range (testval, lower, upper) {

if (testval > lower && testval < upper)

return 1

else

return 0

}

Make sure that there is no space between the function name and the parenthesis for the parameter list. The return statement is optional, but the function will not return a value if it is missing.

How to Call a Function

Once you have defined your function, you use it just like a built-in awk function. For example, you can use in_range as follows:

if (in_range($5, 10, 15))

print "Found a match!"

This lets you know when the value of the fifth field lies between 10 and 15.

Functions may be recursive-that is, they may call themselves. A simple example of a recursive function is the factorial function:

function factorial(n) {

if (n<=1)

return 1

else

return n * factorial(n−1)

}

If you call this function in a program like this:

print factorial(4)

it calculates and prints the value, which in this case would be 24 (because 4*3*2*1 is 24).

Input and Output

As mentioned at the beginning of this chapter, awk uses standard input and output. This means that you can use the normal shell redirection operators to save output to a file (or read input from a file). You can include awk in command pipelines, and you can get input from the keyboard if no file is specified. However, awk also has a few special features for working with input and output.

Getting Input

Normally your awk program gets input from the file or files that you specify when you run the command, or from standard input if no files are specified. Sometimes, however, you need to get input from another source in addition to this input file. For example, as part of a program you may want to display a message and get a response that the user types in at the keyboard.

You can use the getline function to read a line of input from the keyboard or another file. By default, getline reads its input from the same file that you specified on the awk command line. Each time it is called, it reads the next line and splits it into fields. This is useful if you want precise control over the input loop-for example, if you wish to read the file only up to a certain point and then go to an END statement.

The following instruction reads a line from standard input and assigns it to the variable X:

getline X

To get input from another file, you redirect the input to getline, as in this example:

getline < "my_file"

This will read the next line of the file my_file. You can then use the built-in variables $0, $1, and so on to work with the line. Note that unlike shell file redirection, awk requires you to put quotes around the filename, or it will be interpreted as a variable name. You might use getline like this to combine data from two different files, by reading in data from my_file in addition to whatever file you may have opened from the command line. You can also read input from a named file and assign it to a variable, as in this example:

getline nextline < "my_file"

This reads a line from my_file and assigns it to nextline.

The UNIX System identifies the keyboard as the logical file /dev/tty. To read a line from the keyboard, use getline with /dev/tty as the filename. You must enclose /dev/tty in quotes, as in “/dev/tty”, just as you would any other string or filename.

This example shows how you could use for keyboard input to add information interactively to a file. The following program fragment prints the item name (field 1) and old price for each inventory record, prompts the user to type in the new price, and then substitutes the new price and prints the new record on standard output:

{

print $1, "Old price:", $4

getline new < "/dev/tty"

$4=new

print "New price:", $0 > "outputfile"

}

Using Command-Line Arguments

Normally awk interprets words on the command line after the program as names of input files. However, it is possible to use the command line to give arguments to an awk program.

The number of command-line arguments is stored as a built-in variable (ARGC). The command-line arguments themselves are stored in a built-in array called ARGV. The awk command itself is counted as an argument, so ARGV[0] is awk. ARGV[1] is the next command-line argument, ARGV[2] comes after that, and so on.

Since by default awk treats words on the command line as input filenames, you must tell it not to try to read the contents of your command-line arguments. If you want to use a word as an argument, you must read its value in a BEGIN statement and then set the corresponding ARGV element to null so that it will not be treated as a filename. For example,

BEGIN {

searchpattern=ARGV[1]

ARGV[1]=""

}

$0 ~ searchpattern {print}

sets a variable called searchpattern equal to the first command-line argument and then sets ARGV[1] equal to null so that awk will not try to read in lines from it. The program then searches its input for the word in searchpattern and prints any lines that match.

Printing Output

There are two commands for printing output in awk. One of these is the print command, which you have been using. The other command, printf, can be used to print formatted output.

The print command has this form:

print expr1, expr2, ...

The expressions may be variables, strings, or any other awk expression. The commas are necessary if you want items separated by the output field separator. If you leave out the commas, the values will be printed with no separators between them. Remember that if you want to print a string, it must be enclosed in quotes. A word that is not enclosed in quotes is treated as a variable. By itself, print prints the entire record ($0).

You can control the character used to separate output fields by setting the output field separator (OFS) variable. The following statement prints the item name and selling price from an inventory file, using a tab as the output field separator:

BEGIN {OFS="\t"} {print $1, $4}

The printf command provides formatted output, similar to C. With printf, you can print out a number as an integer, with different numbers of decimal places, or in octal or hex. You can print a string left-justified, truncated to a specific length, or padded with initial spaces. For example,

printf("%s\n%d\n%f\n", $1, $2, $3)

will print three fields-a string, an integer, and a decimal-with a new line after each one.

Sending Output to Files

You can use the shell redirection operators on the command line to save output from an awk program in a file or pipe it to a command. But you can also use file redirection inside a program to send part of the output to a file. For example,

{

if ($6 ~ "toy")

print $0 >> "toy_file"

else

print $0 >> "alt_file"

}

This separates an inventory file into two parts based on the contents of the sixth field. The operator >> is used to append output to the files toy_file and alt_file.

sed

sed works in basically the same way as awk: it takes a set of patterns and simple editing commands and applies them to an input stream. It has a different syntax (which will seem very familiar if you are a viuser, but will probably be rather difficult if you are not), and slightly different capabilities. In particular, it lacks the field processing and control flow features of awk. Most programs which can be written in sed can also be written in awk. However, sed can be very useful for performing a simple set of editing commands on input before sending it on to awk.

How sed Works

To edit a file with sed, you give it a list of editing commands and the filename. For example, the following command deletes the first line of the file data and prints the result to standard output:

$ sed '1d' data

Note that editing commands are enclosed in single quotation marks. This is because the editing command list is treated as an argument to the sed command, and it may contain spaces, newlines, or other special characters. The name of the file to edit can be specified as the second argument on the command line. If you do not give it a filename, sed reads and edits standard input.

The sed command reads its input one line at a time. If a line is selected by a command in the command list, sed performs the appropriate editing operation and prints the resulting line. If a line is not selected, it is copied to standard output. Editing commands and line addresses are very similar to the commands and addresses used with ed, which is discussed in Appendix A. Experienced vi users will also recognize many of the commands.

sed does not modify the original file. To save the changes sed makes, use file redirection, as in

$ sed '1d' data > newdata

Selecting Lines

The sed editing commands generally consist of an address and an operation. The address tells sed which lines to act on. There are two ways to specify addresses: by line numbers and by regular expression patterns.

As the previous example showed, you can specify a line with a single number. You can also specify a range of lines, by listing the first and last lines in the range, separated by a comma. The following command deletes the first four lines of data:

$ sed '1,4d' data

Regular expression patterns select all lines that contain a string matching the pattern. The following command removes all lines containing “New York” from the file states:

$ sed '/New York/d' states

sed uses the same regular expressions as awk. You can also specify a range using two regular expressions separated by a comma, just like in awk.

Editing Commands

In addition to the delete command (d), sed supports a (append), i (insert), and c (change) for adding text. It uses r and w to read from or write to a file.

By default, sed prints all lines to standard output. If you invoke sed with the -n option (no copy), only those lines that you explicitly print are sent to standard output. For example, the following prints lines 10 through 20 only:

$ sed -n '10,20p' file

Replacing Strings

The substitute (s) command works like the similar vi command. This example switches all occurrences of 2006 to 2007 in the file scheduling:

$ sed 's/2006/2007/g' scheduling

Because there is no line address or pattern at the beginning, this command will be applied to every line in the input file. As in vi, the g at the end of the substitution stands for “global”. It causes the substitution to be applies to every part of the line that matches the pattern.

You can also use an explicit search pattern to find all the lines containing the string “2006” before applying the substitution:

$ sed '/2006/s//2007/g'

This command tells sed to operate on all lines containing the pattern 2006, and in each of those lines to change all instances of the target (2006) to 2007.

Substitution is a very common use of sed. If you are not familiar with this syntax for substitutions, you might want to review vi substitutions in Chapter 5.

Using sed and awk Together

It is often convenient to use sed and awk together to solve a problem. Even though awk has a full set of commands for manipulating text, using sed to filter the input to awk can simplify and clarify things. You can use sed for its simple text editing capabilities, and awk for its ability to deal with fields and records, as well as for its rich programming capabilities.

The following example shows how you can use sed and awk together to extract a list of songs from a music database. Here is part of the entry for one song from an XML music data file:

$ cat mysongs

<key>Name</key><string>Airportman</string>

<key>Artist</key><string>R. E .M. </string>

<key>Album</key><string>Up</string>

<key>Genre</key><string>Rock</string>

<key>Kind</key><string>MPEG audio file</string>

<key>Size</key><integer>4091947</integer>

<key>Total Time</key><integer>255608</integer>

<key>Track Number</key><integer>1</integer>

<key>Track Count</key><integer>14</integer>

<key>Year</key><integer>1998</integer>

The data is stored as a simple keyword/value pair, with XML markup tags.

In its current form, the information is hard to read. Also, there are some fields that you don’t really need. You can use sed to turn this file full of data into a useful table. Specifically, you can eliminate the XML tags and create a table showing the song title, artist, album, and track number.

Processing the File with sed

The first step is to use sed to remove the XML tags, and to insert a : after the keyword in each line. Inserting the : isn’t too hard. The substitution

s/<\/key>/: /

will replace the “</key>” entries with “: ”. Removing the XML tags, however, is a bit more difficult. The substitution

s/<.*>//g

will actually delete everything from the first < to the last >. That’s because * is greedy, meaning it will try to match the largest pattern possible-in this case, most of the line. The substitution

s/<[^>]*>//g

will do the trick, although it’s more complicated. The pattern “<[^>]*>” matches a <, then any string of characters that does not include >, and finally a > sign. So the substitution will delete the XML tags “<key>”, “<integer>”, and “</integer>”.

You can combine the two substitutions on one line with a; and run a single sed command:

$ sed 's/<\/key>/: /; s/< [^>]*>//g' mysongs

Name: Airportman

Artist: R. E. M.

Album: Up

Genre: Rock

Kind: MPEG audio file

Size: 4091947

Total Time: 255608

Track Number: 1

Track Count: 14

Year: 1998

This is an improvement. It’s readable; but it still has a block structure and it still includes extra information.

You can remove the extra lines with statements like

/Kind: / d

or remove them all at once with

/(Kind|Genre|Size|Total Time|Track Count Year): / d

but the output is still on multiple lines:

$ sed 's/<\/key>/: /; s/< [^>]*>//g;

> /(Kind|Genre|Size Total Time|Track Count Year): / d' mysongs

Name: Airportman

Artist: R. E. M.

Album: Up

Track Count: 14

That’s fine for a short example like this, but not ideal for a long file with many entries.

Using sed as a Filter for awk

At this point, a better solution would be to use sed to remove the field name along with the XML tags, and then pass the results to awk. You can then use awk to select only the fields that you want, and arrange them so that they are all on one line and in the right order.

The sed command to remove the “<key>…</key>” data and the other tags is

$ sed 's/<key>.*<\/key>//; s/< [^>]*>//g' mysongs

Airportman

R.E.M.

Up

Rock

MPEG audio file

4091947

255608

1

1

1

14

1998

The awk command will read the records in this format and use the field variables to select the fields you want and output them in the proper order. Since the input records use newline as the field delimiter and a blank line as the record delimiter, the awk program includes an initial statement defining the field separator (FS) and record separator (RS) accordingly.

The commands for the awk program are in the file makesonglist.

$ cat makesonglist

BEGIN {FS="\n"; RS=""; OFS="\t"}

{print $2, $3, $1, $10}

Putting the sed command together with the awk program produces the result you want.

$ sed 's/<key>.*<\/key>//; s/< [^>]*>//g' mysongs | awk -f makesonglist

R.E.M. Up Airportman 1

Troubleshooting Your awk Programs

If awk finds an error in a program, it will give you a “Syntax error” message. This can be frustrating, especially to a beginner, as the syntax of awk programs can be tricky Here are some points to check if you are getting a mysterious error message or if you are not getting the output you expect:

§ Make sure that there is a space between the final single quotation mark in the command line and any arguments or input filenames that follow it.

§ Make sure you enclosed the awk program in single quotation marks to protect it from interpretation by the shell.

§ Make sure you put braces around the action statement.

§ Do not confuse the operators == and =. Use == for comparing the value of two variables or expressions. Use = to assign a value to a variable.

§ Regular expressions must be enclosed in forward slashes, not backslashes.

§ If you are using a filename inside a program, it must be enclosed in quotation marks. (But filenames on the command line are not enclosed in quotation marks.)

§ Each pattern/action pair should be on its own line to ensure the readability of your program. However, if you choose to combine them, use a semicolon in between.

§ If your field separator is something other than a space, and you are sending output to a new file, specify the output field separator as well as the input field separator in order to get the expected results.

§ If you change the order of fields or add a new field, use a print statement as part of the action statement, or the new modified field will not be created.

§ If an action statement takes more than one line, the opening brace must be on the same line as the pattern statement.

§ Remember to use a > if you want to redirect output to a file on the command line.

Summary

This chapter has described the basic concepts of the awk programming language, and given you a short introduction to sed and to using sed with awk. At this point, you should be able to write short but very useful awk programs to perform many tasks.

This chapter is only an introduction to awk. It should be enough to give you a sense of the language and its potential, and to make it possible for you to learn more by using it. If you find that you want to learn more about awk, sed, or regular expressions, consult the resources listed next.

How to Find Out More

The following book is an entertaining and comprehensive treatment by the inventors of awk. It provides a thorough description of the language and many examples, including a relational database and a recursive descent parser:

· Aho, Alfred, Brian Kernighan, and Peter Weinberger. The AWK Programming Language. Reading, MA: Addison-Wesley, 1988.

This book is a good, thorough introduction to both awk and sed, with many examples and instructive longer programs:

· Dougherty, Dale, and Arnold Robbins. sed & awk. 2nd ed. Sebastopol, CA: O’Reilly & Associates, 1997.

This book is another very good awk reference:

· Robbins, Arnold. Effective awk Programming. 3rd ed. Sebastopol, CA: O’Reilly & Associates, 2001.

These two books are both very good introductions to understanding and using regular expressions, both for sed and awk and for other programs:

§ Forta, Ben. Sams Teach Yourself Regular Expression in 10 Minutes. 1st ed. Indianapolis, IN: Sams, 2004.

§ Friedl, Jeffrey E.F. Mastering Regular Expressions. 2nd ed. Sebastopol, CA: O’Reilly & Associates, 2002.

You can download gawk from the GNU site, which also has a great deal of documentation:

§ http://www.gnu.org/software/gawk/gawk.html

§ http://www.gnu.org/software/gawk/manual/gawk.html

Similarly, you can download sed or consult the sed manual:

· http://www.gnu.org/software/sed/

· http://www.gnu.org/software/sed/manual/sed.html

For a guide to frequently asked questions about awk and its relatives, see

· http://www.faqs.org/faqs/computer-lang/awk/faq/index.html

The sed FAQ (which may also be helpful when working with awk regular expressions) can be found at

· http://www.student.northpark.edu/pemente/sed/sedfaq.html

Resources for the book The AWK Programming Language can be found at

· http://cm.bell-labs.com/cm/cs/awkbook/