Patterns, Actions, and Variables - The awk Language - Effective awk Programming (2015)

Effective awk Programming (2015)

Part I. The awk Language

Chapter 7. Patterns, Actions, and Variables

As you have already seen, each awk statement consists of a pattern with an associated action. This chapter describes how you build patterns and actions, what kinds of things you can do within actions, and awk’s predefined variables.

The pattern–action rules and the statements available for use within actions form the core of awk programming. In a sense, everything covered up to here has been the foundation that programs are built on top of. Now it’s time to start building something useful.

Pattern Elements

Patterns in awk control the execution of rules—a rule is executed when its pattern matches the current input record. The following is a summary of the types of awk patterns:

/regular expression/

A regular expression. It matches when the text of the input record fits the regular expression. (See Chapter 3.)

expression

A single expression. It matches when its value is nonzero (if a number) or non-null (if a string). (See Expressions as Patterns.)

begpat, endpat

A pair of patterns separated by a comma, specifying a range of records. The range includes both the initial record that matches begpat and the final record that matches endpat. (See Specifying Record Ranges with Patterns.)

BEGIN
END

Special patterns for you to supply startup or cleanup actions for your awk program. (See The BEGIN and END Special Patterns.)

BEGINFILE
ENDFILE

Special patterns for you to supply startup or cleanup actions to be done on a per-file basis. (See The BEGINFILE and ENDFILE Special Patterns.)

empty

The empty pattern matches every input record. (See The Empty Pattern.)

Regular Expressions as Patterns

Regular expressions are one of the first kinds of patterns presented in this book. This kind of pattern is simply a regexp constant in the pattern part of a rule. Its meaning is ‘$0 ~ /pattern/’. The pattern matches when the input record matches the regexp. For example:

/foo|bar|baz/ { buzzwords++ }

END { print buzzwords, "buzzwords seen" }

Expressions as Patterns

Any awk expression is valid as an awk pattern. The pattern matches if the expression’s value is nonzero (if a number) or non-null (if a string). The expression is reevaluated each time the rule is tested against a new input record. If the expression uses fields such as $1, the value depends directly on the new input record’s text; otherwise, it depends on only what has happened so far in the execution of the awk program.

Comparison expressions, using the comparison operators described in Variable Typing and Comparison Expressions, are a very common kind of pattern. Regexp matching and nonmatching are also very common expressions. The left operand of the ‘~’ and ‘!~’ operators is a string. The right operand is either a constant regular expression enclosed in slashes (/regexp/), or any expression whose string value is used as a dynamic regular expression (see Using Dynamic Regexps). The following example prints the second field of each input record whose first field is precisely ‘li’:

$ awk '$1 == "li" { print $2 }' mail-list

(There is no output, because there is no person with the exact name ‘li’.) Contrast this with the following regular expression match, which accepts any record with a first field that contains ‘li’:

$ awk '$1 ~ /li/ { print $2 }' mail-list

555-5553

555-6699

A regexp constant as a pattern is also a special case of an expression pattern. The expression /li/ has the value one if ‘li’ appears in the current input record. Thus, as a pattern, /li/ matches any record containing ‘li’.

Boolean expressions are also commonly used as patterns. Whether the pattern matches an input record depends on whether its subexpressions match. For example, the following command prints all the records in mail-list that contain both ‘edu’ and ‘li’:

$ awk '/edu/ && /li/' mail-list

Samuel 555-3430 samuel.lanceolis@shu.edu A

The following command prints all records in mail-list that contain either ‘edu’ or ‘li’ (or both, of course):

$ awk '/edu/ || /li/' mail-list

Amelia 555-5553 amelia.zodiacusque@gmail.com F

Broderick 555-0542 broderick.aliquotiens@yahoo.com R

Fabius 555-1234 fabius.undevicesimus@ucb.edu F

Julie 555-6699 julie.perscrutabor@skeeve.com F

Samuel 555-3430 samuel.lanceolis@shu.edu A

Jean-Paul 555-2127 jeanpaul.campanorum@nyu.edu R

The following command prints all records in mail-list that do not contain the string ‘li’:

$ awk '! /li/' mail-list

Anthony 555-3412 anthony.asserturo@hotmail.com A

Becky 555-7685 becky.algebrarum@gmail.com A

Bill 555-1675 bill.drowning@hotmail.com A

Camilla 555-2912 camilla.infusarum@skynet.be R

Fabius 555-1234 fabius.undevicesimus@ucb.edu F

Martin 555-6480 martin.codicibus@hotmail.com A

Jean-Paul 555-2127 jeanpaul.campanorum@nyu.edu R

The subexpressions of a Boolean operator in a pattern can be constant regular expressions, comparisons, or any other awk expressions. Range patterns are not expressions, so they cannot appear inside Boolean patterns. Likewise, the special patterns BEGIN, END, BEGINFILE, and ENDFILE, which never match any input record, are not expressions and cannot appear inside Boolean patterns.

The precedence of the different operators that can appear in patterns is described in Operator Precedence (How Operators Nest).

Specifying Record Ranges with Patterns

A range pattern is made of two patterns separated by a comma, in the form ‘begpat, endpat’. It is used to match ranges of consecutive input records. The first pattern, begpat, controls where the range begins, while endpat controls where the pattern ends. For example, the following:

awk '$1 == "on", $1 == "off"' myfile

prints every record in myfile between ‘on’/‘off’ pairs, inclusive.

A range pattern starts out by matching begpat against every input record. When a record matches begpat, the range pattern is turned on, and the range pattern matches this record as well. As long as the range pattern stays turned on, it automatically matches every input record read. The range pattern also matches endpat against every input record; when this succeeds, the range pattern is turned off again for the following record. Then the range pattern goes back to checking begpat against each record.

The record that turns on the range pattern and the one that turns it off both match the range pattern. If you don’t want to operate on these records, you can write if statements in the rule’s action to distinguish them from the records you are interested in.

It is possible for a pattern to be turned on and off by the same record. If the record satisfies both conditions, then the action is executed for just that record. For example, suppose there is text between two identical markers (e.g., the ‘%’ symbol), each on its own line, that should be ignored. A first attempt would be to combine a range pattern that describes the delimited text with the next statement (not discussed yet, see The next Statement). This causes awk to skip any further processing of the current record and start over again with the next input record. Such a program looks like this:

/^%$/,/^%$/ { next }

{ print }

This program fails because the range pattern is both turned on and turned off by the first line, which just has a ‘%’ on it. To accomplish this task, write the program in the following manner, using a flag:

/^%$/ { skip = ! skip; next }

skip == 1 { next } # skip lines with `skip' set

In a range pattern, the comma (‘,’) has the lowest precedence of all the operators (i.e., it is evaluated last). Thus, the following program attempts to combine a range pattern with another, simpler test:

echo Yes | awk '/1/,/2/ || /Yes/'

The intent of this program is ‘(/1/,/2/) || /Yes/’. However, awk interprets this as ‘/1/, (/2/ || /Yes/)’. This cannot be changed or worked around; range patterns do not combine with other patterns:

$ echo Yes | gawk '(/1/,/2/) || /Yes/'

error→ gawk: cmd. line:1: (/1/,/2/) || /Yes/

error→ gawk: cmd. line:1: ^ syntax error

As a minor point of interest, although it is poor style, POSIX allows you to put a newline after the comma in a range pattern. (d.c.)

The BEGIN and END Special Patterns

All the patterns described so far are for matching input records. The BEGIN and END special patterns are different. They supply startup and cleanup actions for awk programs. BEGIN and END rules must have actions; there is no default action for these rules because there is no current record when they run. BEGIN and END rules are often referred to as “BEGIN and END blocks” by longtime awk programmers.

Startup and cleanup actions

A BEGIN rule is executed once only, before the first input record is read. Likewise, an END rule is executed once only, after all the input is read. For example:

$ awk '

> BEGIN { print "Analysis of \"li\"" }

> /li/ { ++n }

> END { print "\"li\" appears in", n, "records." }' mail-list

Analysis of "li"

"li" appears in 4 records.

This program finds the number of records in the input file mail-list that contain the string ‘li’. The BEGIN rule prints a title for the report. There is no need to use the BEGIN rule to initialize the counter n to zero, as awk does this automatically (see Variables). The second rule increments the variable n every time a record containing the pattern ‘li’ is read. The END rule prints the value of n at the end of the run.

The special patterns BEGIN and END cannot be used in ranges or with Boolean operators (indeed, they cannot be used with any operators). An awk program may have multiple BEGIN and/or END rules. They are executed in the order in which they appear: all the BEGIN rules at startup and all the END rules at termination. BEGIN and END rules may be intermixed with other rules. This feature was added in the 1987 version of awk and is included in the POSIX standard. The original (1978) version of awk required the BEGIN rule to be placed at the beginning of the program, the ENDrule to be placed at the end, and only allowed one of each. This is no longer required, but it is a good idea to follow this template in terms of program organization and readability.

Multiple BEGIN and END rules are useful for writing library functions, because each library file can have its own BEGIN and/or END rule to do its own initialization and/or cleanup. The order in which library functions are named on the command line controls the order in which their BEGINand END rules are executed. Therefore, you have to be careful when writing such rules in library files so that the order in which they are executed doesn’t matter. See Command-Line Options for more information on using library functions. See Chapter 10 for a number of useful library functions.

If an awk program has only BEGIN rules and no other rules, then the program exits after the BEGIN rules are run.[34] However, if an END rule exists, then the input is read, even if there are no other rules in the program. This is necessary in case the END rule checks the FNR and NR variables.

Input/output from BEGIN and END rules

There are several (sometimes subtle) points to be aware of when doing I/O from a BEGIN or END rule. The first has to do with the value of $0 in a BEGIN rule. Because BEGIN rules are executed before any input is read, there simply is no input record, and therefore no fields, when executingBEGIN rules. References to $0 and the fields yield a null string or zero, depending upon the context. One way to give $0 a real value is to execute a getline command without a variable (see Explicit Input with getline). Another way is simply to assign a value to $0.

The second point is similar to the first, but from the other direction. Traditionally, due largely to implementation issues, $0 and NF were undefined inside an END rule. The POSIX standard specifies that NF is available in an END rule. It contains the number of fields from the last input record. Most probably due to an oversight, the standard does not say that $0 is also preserved, although logically one would think that it should be. In fact, all of BWK awk, mawk, and gawk preserve the value of $0 for use in END rules. Be aware, however, that some other implementations and many older versions of Unix awk do not.

The third point follows from the first two. The meaning of ‘print’ inside a BEGIN or END rule is the same as always: ‘print $0’. If $0 is the null string, then this prints an empty record. Many longtime awk programmers use an unadorned ‘print’ in BEGIN and END rules, to mean ‘print ""’, relying on $0 being null. Although one might generally get away with this in BEGIN rules, it is a very bad idea in END rules, at least in gawk. It is also poor style, because if an empty line is needed in the output, the program should print one explicitly.

Finally, the next and nextfile statements are not allowed in a BEGIN rule, because the implicit read-a-record-and-match-against-the-rules loop has not started yet. Similarly, those statements are not valid in an END rule, because all the input has been read. (See The next Statement and The nextfile Statement.)

The BEGINFILE and ENDFILE Special Patterns

This section describes a gawk-specific feature.

Two special kinds of rule, BEGINFILE and ENDFILE, give you “hooks” into gawk’s command-line file processing loop. As with the BEGIN and END rules (see the previous section), all BEGINFILE rules in a program are merged, in the order they are read by gawk, and all ENDFILE rules are merged as well.

The body of the BEGINFILE rules is executed just before gawk reads the first record from a file. FILENAME is set to the name of the current file, and FNR is set to zero.

The BEGINFILE rule provides you the opportunity to accomplish two tasks that would otherwise be difficult or impossible to perform:

§ You can test if the file is readable. Normally, it is a fatal error if a file named on the command line cannot be opened for reading. However, you can bypass the fatal error and move on to the next file on the command line.

You do this by checking if the ERRNO variable is not the empty string; if so, then gawk was not able to open the file. In this case, your program can execute the nextfile statement (see The nextfile Statement). This causes gawk to skip the file entirely. Otherwise, gawk exits with the usual fatal error.

§ If you have written extensions that modify the record handling (by inserting an “input parser”; see Customized input parsers), you can invoke them at this point, before gawk has started processing the file. (This is a very advanced feature, currently used only by the gawkextlib project.)

The ENDFILE rule is called when gawk has finished processing the last record in an input file. For the last input file, it will be called before any END rules. The ENDFILE rule is executed even for empty input files.

Normally, when an error occurs when reading input in the normal input-processing loop, the error is fatal. However, if an ENDFILE rule is present, the error becomes non-fatal, and instead ERRNO is set. This makes it possible to catch and process I/O errors at the level of the awk program.

The next statement (see The next Statement) is not allowed inside either a BEGINFILE or an ENDFILE rule. The nextfile statement is allowed only inside a BEGINFILE rule, not inside an ENDFILE rule.

The getline statement (see Explicit Input with getline) is restricted inside both BEGINFILE and ENDFILE: only redirected forms of getline are allowed.

BEGINFILE and ENDFILE are gawk extensions. In most other awk implementations, or if gawk is in compatibility mode (see Command-Line Options), they are not special.

The Empty Pattern

An empty (i.e., nonexistent) pattern is considered to match every input record. For example, the program:

awk '{ print $1 }' mail-list

prints the first field of every record.

Using Shell Variables in Programs

awk programs are often used as components in larger programs written in shell. For example, it is very common to use a shell variable to hold a pattern that the awk program searches for. There are two ways to get the value of the shell variable into the body of the awk program.

A common method is to use shell quoting to substitute the variable’s value into the program inside the script. For example, consider the following program:

printf "Enter search pattern: "

read pattern

awk "/$pattern/ "'{ nmatches++ }

END { print nmatches, "found" }' /path/to/data

The awk program consists of two pieces of quoted text that are concatenated together to form the program. The first part is double-quoted, which allows substitution of the pattern shell variable inside the quotes. The second part is single-quoted.

Variable substitution via quoting works, but can potentially be messy. It requires a good understanding of the shell’s quoting rules (see Shell Quoting Issues), and it’s often difficult to correctly match up the quotes when reading the program.

A better method is to use awk’s variable assignment feature (see Assigning variables on the command line) to assign the shell variable’s value to an awk variable. Then use dynamic regexps to match the pattern (see Using Dynamic Regexps). The following shows how to redo the previous example using this technique:

printf "Enter search pattern: "

read pattern

awk -v pat="$pattern" '$0 ~ pat { nmatches++ }

END { print nmatches, "found" }' /path/to/data

Now, the awk program is just one single-quoted string. The assignment ‘-v pat="$pattern"’ still requires double quotes, in case there is whitespace in the value of $pattern. The awk variable pat could be named pattern too, but that would be more confusing. Using a variable also provides more flexibility, as the variable can be used anywhere inside the program—for printing, as an array subscript, or for any other use—without requiring the quoting tricks at every point in the program.

Actions

An awk program or script consists of a series of rules and function definitions interspersed. (Functions are described later. See User-Defined Functions.) A rule contains a pattern and an action, either of which (but not both) may be omitted. The purpose of the action is to tell awk what to do once a match for the pattern is found. Thus, in outline, an awk program generally looks like this:

[pattern] { action }
pattern [{ action }]

function name(args) { … }

An action consists of one or more awk statements, enclosed in braces (‘{…}’). Each statement specifies one thing to do. The statements are separated by newlines or semicolons. The braces around an action must be used even if the action contains only one statement, or if it contains no statements at all. However, if you omit the action entirely, omit the braces as well. An omitted action is equivalent to ‘{ print $0 }’:

/foo/ { } match foo, do nothing --- empty action

/foo/ match foo, print the record --- omitted action

The following types of statements are supported in awk:

Expressions

Call functions or assign values to variables (see Chapter 6). Executing this kind of statement simply computes the value of the expression. This is useful when the expression has side effects (see Assignment Expressions).

Control statements

Specify the control flow of awk programs. The awk language gives you C-like constructs (if, for, while, and do) as well as a few special ones (see Control Statements in Actions).

Compound statements

Enclose one or more statements in braces. A compound statement is used in order to put several statements together in the body of an if, while, do, or for statement.

Input statements

Use the getline command (see Explicit Input with getline). Also supplied in awk are the next statement (see The next Statement) and the nextfile statement (see The nextfile Statement).

Output statements

Such as print and printf. See Chapter 5.

Deletion statements

For deleting array elements. See The delete Statement.

Control Statements in Actions

Control statements, such as if, while, and so on, control the flow of execution in awk programs. Most of awk’s control statements are patterned after similar statements in C.

All the control statements start with special keywords, such as if and while, to distinguish them from simple expressions. Many control statements contain other statements. For example, the if statement contains another statement that may or may not be executed. The contained statement is called the body. To include more than one statement in the body, group them into a single compound statement with braces, separating them with newlines or semicolons.

The if-else Statement

The if-else statement is awk’s decision-making statement. It looks like this:

if (condition) then-body [else else-body]

The condition is an expression that controls what the rest of the statement does. If the condition is true, then-body is executed; otherwise, else-body is executed. The else part of the statement is optional. The condition is considered false if its value is zero or the null string; otherwise, the condition is true. Refer to the following:

if (x % 2 == 0)

print "x is even"

else

print "x is odd"

In this example, if the expression ‘x % 2 == 0’ is true (i.e., if the value of x is evenly divisible by two), then the first print statement is executed; otherwise, the second print statement is executed. If the else keyword appears on the same line as then-body and then-body is not a compound statement (i.e., not surrounded by braces), then a semicolon must separate then-body from the else. To illustrate this, the previous example can be rewritten as:

if (x % 2 == 0) print "x is even"; else

print "x is odd"

If the ‘;’ is left out, awk can’t interpret the statement and it produces a syntax error. Don’t actually write programs this way, because a human reader might fail to see the else if it is not the first thing on its line.

The while Statement

In programming, a loop is a part of a program that can be executed two or more times in succession. The while statement is the simplest looping statement in awk. It repeatedly executes a statement as long as a condition is true. For example:

while (condition)

body

body is a statement called the body of the loop, and condition is an expression that controls how long the loop keeps running. The first thing the while statement does is test the condition. If the condition is true, it executes the statement body. After body has been executed,condition is tested again, and if it is still true, body executes again. This process repeats until the condition is no longer true. If the condition is initially false, the body of the loop never executes and awk continues with the statement following the loop. This example prints the first three fields of each record, one per line:

awk '

{

i = 1

while (i <= 3) {

print $i

i++

}

}' inventory-shipped

The body of this loop is a compound statement enclosed in braces, containing two statements. The loop works in the following manner: first, the value of i is set to one. Then, the while statement tests whether i is less than or equal to three. This is true when i equals one, so the ith field is printed. Then the ‘i++’ increments the value of i and the loop repeats. The loop terminates when i reaches four.

A newline is not required between the condition and the body; however, using one makes the program clearer unless the body is a compound statement or else is very simple. The newline after the open brace that begins the compound statement is not required either, but the program is harder to read without it.

The do-while Statement

The do loop is a variation of the while looping statement. The do loop executes the body once and then repeats the body as long as the condition is true. It looks like this:

do

body

while (condition)

Even if the condition is false at the start, the body executes at least once (and only once, unless executing body makes condition true). Contrast this with the corresponding while statement:

while (condition)

body

This statement does not execute the body even once if the condition is false to begin with. The following is an example of a do statement:

{

i = 1

do {

print $0

i++

} while (i <= 10)

}

This program prints each input record 10 times. However, it isn’t a very realistic example, because in this case an ordinary while would do just as well. This situation reflects actual experience; only occasionally is there a real use for a do statement.

The for Statement

The for statement makes it more convenient to count iterations of a loop. The general form of the for statement looks like this:

for (initialization; condition; increment)

body

The initialization, condition, and increment parts are arbitrary awk expressions, and body stands for any awk statement.

The for statement starts by executing initialization. Then, as long as the condition is true, it repeatedly executes body and then increment. Typically, initialization sets a variable to either zero or one, increment adds one to it, and condition compares it against the desired number of iterations. For example:

awk '

{

for (i = 1; i <= 3; i++)

print $i

}' inventory-shipped

This prints the first three fields of each input record, with one field per line.

It isn’t possible to set more than one variable in the initialization part without using a multiple assignment statement such as ‘x = y = 0’. This makes sense only if all the initial values are equal. (But it is possible to initialize additional variables by writing their assignments as separate statements preceding the for loop.)

The same is true of the increment part. Incrementing additional variables requires separate statements at the end of the loop. The C compound expression, using C’s comma operator, is useful in this context, but it is not supported in awk.

Most often, increment is an increment expression, as in the previous example. But this is not required; it can be any expression whatsoever. For example, the following statement prints all the powers of two between 1 and 100:

for (i = 1; i <= 100; i *= 2)

print i

If there is nothing to be done, any of the three expressions in the parentheses following the for keyword may be omitted. Thus, ‘for (; x > 0;)’ is equivalent to ‘while (x > 0)’. If the condition is omitted, it is treated as true, effectively yielding an infinite loop (i.e., a loop that never terminates).

In most cases, a for loop is an abbreviation for a while loop, as shown here:

initialization

while (condition) {

body

increment

}

The only exception is when the continue statement (see The continue Statement) is used inside the loop. Changing a for statement to a while statement in this way can change the effect of the continue statement inside the loop.

The awk language has a for statement in addition to a while statement because a for loop is often both less work to type and more natural to think of. Counting the number of iterations is very common in loops. It can be easier to think of this counting as part of looping rather than as something to do inside the loop.

There is an alternative version of the for loop, for iterating over all the indices of an array:

for (i in array)

do something with array[i]

See Scanning All Elements of an Array for more information on this version of the for loop.

The switch Statement

This section describes a gawk-specific feature. If gawk is in compatibility mode (see Command-Line Options), it is not available.

The switch statement allows the evaluation of an expression and the execution of statements based on a case match. Case statements are checked for a match in the order they are defined. If no suitable case is found, the default section is executed, if supplied.

Each case contains a single constant, be it numeric, string, or regexp. The switch expression is evaluated, and then each case’s constant is compared against the result in turn. The type of constant determines the comparison: numeric or string do the usual comparisons. A regexp constant does a regular expression match against the string value of the original expression. The general form of the switch statement looks like this:

switch (expression) {

case value or regular expression:

case-body

default:

default-body

}

Control flow in the switch statement works as it does in C. Once a match to a given case is made, the case statement bodies execute until a break, continue, next, nextfile, or exit is encountered, or the end of the switch statement itself. For example:

while ((c = getopt(ARGC, ARGV, "aksx")) != -1) {

switch (c) {

case "a":

# report size of all files

all_files = TRUE;

break

case "k":

BLOCK_SIZE = 1024 # 1K block size

break

case "s":

# do sums only

sum_only = TRUE

break

case "x":

# don't cross filesystems

fts_flags = or(fts_flags, FTS_XDEV)

break

case "?":

default:

usage()

break

}

}

Note that if none of the statements specified here halt execution of a matched case statement, execution falls through to the next case until execution halts. In this example, the case for "?" falls through to the default case, which is to call a function named usage(). (The getopt()function being called here is described in Processing Command-Line Options.)

The break Statement

The break statement jumps out of the innermost for, while, or do loop that encloses it. The following example finds the smallest divisor of any integer, and also identifies prime numbers:

# find smallest divisor of num

{

num = $1

for (divisor = 2; divisor * divisor <= num; divisor++) {

if (num % divisor == 0)

break

}

if (num % divisor == 0)

printf "Smallest divisor of %d is %d\n", num, divisor

else

printf "%d is prime\n", num

}

When the remainder is zero in the first if statement, awk immediately breaks out of the containing for loop. This means that awk proceeds immediately to the statement following the loop and continues processing. (This is very different from the exit statement, which stops the entire awkprogram. See The exit Statement.)

The following program illustrates how the condition of a for or while statement could be replaced with a break inside an if:

# find smallest divisor of num

{

num = $1

for (divisor = 2; ; divisor++) {

if (num % divisor == 0) {

printf "Smallest divisor of %d is %d\n", num, divisor

break

}

if (divisor * divisor > num) {

printf "%d is prime\n", num

break

}

}

}

The break statement is also used to break out of the switch statement. This is discussed in The switch Statement.

The break statement has no meaning when used outside the body of a loop or switch. However, although it was never documented, historical implementations of awk treated the break statement outside of a loop as if it were a next statement (see The next Statement). (d.c.) Recent versions of BWK awk no longer allow this usage, nor does gawk.

The continue Statement

Similar to break, the continue statement is used only inside for, while, and do loops. It skips over the rest of the loop body, causing the next cycle around the loop to begin immediately. Contrast this with break, which jumps out of the loop altogether.

The continue statement in a for loop directs awk to skip the rest of the body of the loop and resume execution with the increment-expression of the for statement. The following program illustrates this fact:

BEGIN {

for (x = 0; x <= 20; x++) {

if (x == 5)

continue

printf "%d ", x

}

print ""

}

This program prints all the numbers from 0 to 20—except for 5, for which the printf is skipped. Because the increment ‘x++’ is not skipped, x does not remain stuck at 5. Contrast the for loop from the previous example with the following while loop:

BEGIN {

x = 0

while (x <= 20) {

if (x == 5)

continue

printf "%d ", x

x++

}

print ""

}

This program loops forever once x reaches 5, because the increment (‘x++’) is never reached.

The continue statement has no special meaning with respect to the switch statement, nor does it have any meaning when used outside the body of a loop. Historical versions of awk treated a continue statement outside a loop the same way they treated a break statement outside a loop: as if it were a next statement (discussed in the following section). (d.c.) Recent versions of BWK awk no longer work this way, nor does gawk.

The next Statement

The next statement forces awk to immediately stop processing the current record and go on to the next record. This means that no further rules are executed for the current record, and the rest of the current rule’s action isn’t executed.

Contrast this with the effect of the getline function (see Explicit Input with getline). That also causes awk to read the next record immediately, but it does not alter the flow of control in any way (i.e., the rest of the current action executes with a new input record).

At the highest level, awk program execution is a loop that reads an input record and then tests each rule’s pattern against it. If you think of this loop as a for statement whose body contains the rules, then the next statement is analogous to a continue statement. It skips to the end of the body of this implicit loop and executes the increment (which reads another record).

For example, suppose an awk program works only on records with four fields, and it shouldn’t fail when given bad input. To avoid complicating the rest of the program, write a “weed out” rule near the beginning, in the following manner:

NF != 4 {

printf("%s:%d: skipped: NF != 4\n", FILENAME, FNR) > "/dev/stderr"

next

}

Because of the next statement, the program’s subsequent rules won’t see the bad record. The error message is redirected to the standard error output stream, as error messages should be. For more detail, see Special Filenames in gawk.

If the next statement causes the end of the input to be reached, then the code in any END rules is executed. See The BEGIN and END Special Patterns.

The next statement is not allowed inside BEGINFILE and ENDFILE rules. See The BEGINFILE and ENDFILE Special Patterns.

According to the POSIX standard, the behavior is undefined if the next statement is used in a BEGIN or END rule. gawk treats it as a syntax error. Although POSIX does not disallow it, most other awk implementations don’t allow the next statement inside function bodies (see User-Defined Functions). Just as with any other next statement, a next statement inside a function body reads the next record and starts processing it with the first rule in the program.

The nextfile Statement

The nextfile statement is similar to the next statement. However, instead of abandoning processing of the current record, the nextfile statement instructs awk to stop processing the current datafile.

Upon execution of the nextfile statement, FILENAME is updated to the name of the next datafile listed on the command line, FNR is reset to 1, and processing starts over with the first rule in the program. If the nextfile statement causes the end of the input to be reached, then the code in any END rules is executed. An exception to this is when nextfile is invoked during execution of any statement in an END rule; in this case, it causes the program to stop immediately. See The BEGIN and END Special Patterns.

The nextfile statement is useful when there are many datafiles to process but it isn’t necessary to process every record in every file. Without nextfile, in order to move on to the next datafile, a program would have to continue scanning the unwanted records. The nextfile statement accomplishes this much more efficiently.

In gawk, execution of nextfile causes additional things to happen: any ENDFILE rules are executed if gawk is not currently in an END or BEGINFILE rule, ARGIND is incremented, and any BEGINFILE rules are executed. (ARGIND hasn’t been introduced yet. See Predefined Variables.)

With gawk, nextfile is useful inside a BEGINFILE rule to skip over a file that would otherwise cause gawk to exit with a fatal error. In this case, ENDFILE rules are not executed. See The BEGINFILE and ENDFILE Special Patterns.

Although it might seem that ‘close(FILENAME)’ would accomplish the same as nextfile, this isn’t true. close() is reserved for closing files, pipes, and coprocesses that are opened with redirections. It is not related to the main processing that awk does with the files listed in ARGV.

NOTE

For many years, nextfile was a common extension. In September 2012, it was accepted for inclusion into the POSIX standard. See the Austin Group website.

The current version of BWK awk and mawk also support nextfile. However, they don’t allow the nextfile statement inside function bodies (see User-Defined Functions). gawk does; a nextfile inside a function body reads the next record and starts processing it with the first rule in the program, just as any other nextfile statement.

The exit Statement

The exit statement causes awk to immediately stop executing the current rule and to stop processing input; any remaining input is ignored. The exit statement is written as follows:

exit [return code]

When an exit statement is executed from a BEGIN rule, the program stops processing everything immediately. No input records are read. However, if an END rule is present, as part of executing the exit statement, the END rule is executed (see The BEGIN and END Special Patterns). Ifexit is used in the body of an END rule, it causes the program to stop immediately.

An exit statement that is not part of a BEGIN or END rule stops the execution of any further automatic rules for the current record, skips reading any remaining input records, and executes the END rule if there is one. gawk also skips any ENDFILE rules; they do not execute.

In such a case, if you don’t want the END rule to do its job, set a variable to a nonzero value before the exit statement and check that variable in the END rule. See Assertions for an example that does this.

If an argument is supplied to exit, its value is used as the exit status code for the awk process. If no argument is supplied, exit causes awk to return a “success” status. In the case where an argument is supplied to a first exit statement, and then exit is called a second time from an ENDrule with no argument, awk uses the previously supplied exit value. (d.c.) See gawk’s Exit Status for more information.

For example, suppose an error condition occurs that is difficult or impossible to handle. Conventionally, programs report this by exiting with a nonzero status. An awk program can do this using an exit statement with a nonzero argument, as shown in the following example:

BEGIN {

if (("date" | getline date_now) <= 0) {

print "Can't get system date" > "/dev/stderr"

exit 1

}

print "current date is", date_now

close("date")

}

NOTE

For full portability, exit values should be between zero and 126, inclusive. Negative values, and values of 127 or greater, may not produce consistent results across different operating systems.

Predefined Variables

Most awk variables are available to use for your own purposes; they never change unless your program assigns values to them, and they never affect anything unless your program examines them. However, a few variables in awk have special built-in meanings. awk examines some of these automatically, so that they enable you to tell awk how to do certain things. Others are set automatically by awk, so that they carry information from the internal workings of awk to your program.

This section documents all of gawk’s predefined variables, most of which are also documented in the chapters describing their areas of activity.

Built-in Variables That Control awk

The following is an alphabetical list of variables that you can change to control how awk does certain things.

The variables that are specific to gawk are marked with a pound sign (‘#’). These variables are gawk extensions. In other awk implementations or if gawk is in compatibility mode (see Command-Line Options), they are not special. (Any exceptions are noted in the description of each variable.)

BINMODE #

On non-POSIX systems, this variable specifies use of binary mode for all I/O. Numeric values of one, two, or three specify that input files, output files, or all files, respectively, should use binary I/O. A numeric value less than zero is treated as zero, and a numeric value greater than three is treated as three. Alternatively, string values of "r" or "w" specify that input files and output files, respectively, should use binary I/O. A string value of "rw" or "wr" indicates that all files should use binary I/O. Any other string value is treated the same as "rw", but causes gawk to generate a warning message. BINMODE is described in more detail in Using gawk on PC operating systems. mawk (see Other Freely Available awk Implementations) also supports this variable, but only using numeric values.

CONVFMT

A string that controls the conversion of numbers to strings (see Conversion of Strings and Numbers). It works by being passed, in effect, as the first argument to the sprintf() function (see String-Manipulation Functions). Its default value is "%.6g". CONVFMT was introduced by the POSIX standard.

FIELDWIDTHS #

A space-separated list of columns that tells gawk how to split input with fixed columnar boundaries. Assigning a value to FIELDWIDTHS overrides the use of FS and FPAT for field splitting. See Reading Fixed-Width Data for more information.

FPAT #

A regular expression (as a string) that tells gawk to create the fields based on text that matches the regular expression. Assigning a value to FPAT overrides the use of FS and FIELDWIDTHS for field splitting. See Defining Fields by Content for more information.

FS

The input field separator (see Specifying How Fields Are Separated). The value is a single-character string or a multicharacter regular expression that matches the separations between fields in an input record. If the value is the null string (""), then each character in the record becomes a separate field. (This behavior is a gawk extension. POSIX awk does not specify the behavior when FS is the null string. Nonetheless, some other versions of awk also treat "" specially.)

The default value is " ", a string consisting of a single space. As a special exception, this value means that any sequence of spaces, TABs, and/or newlines is a single separator.[35] It also causes spaces, TABs, and newlines at the beginning and end of a record to be ignored.

You can set the value of FS on the command line using the -F option:

awk -F, 'program' input-files

If gawk is using FIELDWIDTHS or FPAT for field splitting, assigning a value to FS causes gawk to return to the normal, FS-based field splitting. An easy way to do this is to simply say ‘FS = FS’, perhaps with an explanatory comment.

IGNORECASE #

If IGNORECASE is nonzero or non-null, then all string comparisons and all regular expression matching are case-independent. This applies to regexp matching with ‘~’ and ‘!~’, the gensub(), gsub(), index(), match(), patsplit(), split(), and sub() functions, record termination with RS, and field splitting with FS and FPAT. However, the value of IGNORECASE does not affect array subscripting and it does not affect field splitting when using a single-character field separator. See Case Sensitivity in Matching.

LINT #

When this variable is true (nonzero or non-null), gawk behaves as if the --lint command-line option is in effect (see Command-Line Options). With a value of "fatal", lint warnings become fatal errors. With a value of "invalid", only warnings about things that are actually invalid are issued. (This is not fully implemented yet.) Any other true value prints nonfatal warnings. Assigning a false value to LINT turns off the lint warnings.

This variable is a gawk extension. It is not special in other awk implementations. Unlike with the other special variables, changing LINT does affect the production of lint warnings, even if gawk is in compatibility mode. Much as the --lint and --traditional options independently control different aspects of gawk’s behavior, the control of lint warnings during program execution is independent of the flavor of awk being executed.

OFMT

A string that controls the conversion of numbers to strings (see Conversion of Strings and Numbers) for printing with the print statement. It works by being passed as the first argument to the sprintf() function (see String-Manipulation Functions). Its default value is "%.6g". Earlier versions of awk used OFMT to specify the format for converting numbers to strings in general expressions; this is now done by CONVFMT.

OFS

The output field separator (see Output Separators). It is output between the fields printed by a print statement. Its default value is " ", a string consisting of a single space.

ORS

The output record separator. It is output at the end of every print statement. Its default value is "\n", the newline character. (See Output Separators.)

PREC #

The working precision of arbitrary-precision floating-point numbers, 53 bits by default (see Setting the Precision).

ROUNDMODE #

The rounding mode to use for arbitrary-precision arithmetic on numbers, by default "N" (roundTiesToEven in the IEEE 754 standard; see Setting the Rounding Mode).

RS

The input record separator. Its default value is a string containing a single newline character, which means that an input record consists of a single line of text. It can also be the null string, in which case records are separated by runs of blank lines. If it is a regexp, records are separated by matches of the regexp in the input text. (See How Input Is Split into Records.)

The ability for RS to be a regular expression is a gawk extension. In most other awk implementations, or if gawk is in compatibility mode (see Command-Line Options), just the first character of RS’s value is used.

SUBSEP

The subscript separator. It has the default value of "\034" and is used to separate the parts of the indices of a multidimensional array. Thus, the expression ‘foo["A", "B"]’ really accesses foo["A\034B"] (see Multidimensional Arrays).

TEXTDOMAIN #

Used for internationalization of programs at the awk level. It sets the default text domain for specially marked string constants in the source text, as well as for the dcgettext(), dcngettext(), and bindtextdomain() functions (see Chapter 13). The default value of TEXTDOMAIN is"messages".

Built-in Variables That Convey Information

The following is an alphabetical list of variables that awk sets automatically on certain occasions in order to provide information to your program.

The variables that are specific to gawk are marked with a pound sign (‘#’). These variables are gawk extensions. In other awk implementations or if gawk is in compatibility mode (see Command-Line Options), they are not special:

ARGC, ARGV

The command-line arguments available to awk programs are stored in an array called ARGV. ARGC is the number of command-line arguments present. See Other Command-Line Arguments. Unlike most awk arrays, ARGV is indexed from 0 to ARGC − 1. In the following example:

$ awk 'BEGIN {

> for (i = 0; i < ARGC; i++)

> print ARGV[i]

> }' inventory-shipped mail-list

awk

inventory-shipped

mail-list

ARGV[0] contains ‘awk’, ARGV[1] contains ‘inventory-shipped’, and ARGV[2] contains ‘mail-list’. The value of ARGC is three, one more than the index of the last element in ARGV, because the elements are numbered from zero.

The names ARGC and ARGV, as well as the convention of indexing the array from 0 to ARGC − 1, are derived from the C language’s method of accessing command-line arguments.

The value of ARGV[0] can vary from system to system. Also, you should note that the program text is not included in ARGV, nor are any of awk’s command-line options. See Using ARGC and ARGV for information about how awk uses these variables. (d.c.)

ARGIND #

The index in ARGV of the current file being processed. Every time gawk opens a new datafile for processing, it sets ARGIND to the index in ARGV of the filename. When gawk is processing the input files, ‘FILENAME == ARGV[ARGIND]’ is always true.

This variable is useful in file processing; it allows you to tell how far along you are in the list of datafiles as well as to distinguish between successive instances of the same filename on the command line.

While you can change the value of ARGIND within your awk program, gawk automatically sets it to a new value when it opens the next file.

ENVIRON

An associative array containing the values of the environment. The array indices are the environment variable names; the elements are the values of the particular environment variables. For example, ENVIRON["HOME"] might be "/home/arnold". Changing this array does not affect the environment passed on to any programs that awk may spawn via redirection or the system() function. (In a future version of gawk, it may do so.)

Some operating systems may not have environment variables. On such systems, the ENVIRON array is empty (except for ENVIRON["AWKPATH"] and ENVIRON["AWKLIBPATH"]; see The AWKPATH Environment Variable and The AWKLIBPATH Environment Variable).

ERRNO #

If a system error occurs during a redirection for getline, during a read for getline, or during a close() operation, then ERRNO contains a string describing the error.

In addition, gawk clears ERRNO before opening each command-line input file. This enables checking if the file is readable inside a BEGINFILE pattern (see The BEGINFILE and ENDFILE Special Patterns).

Otherwise, ERRNO works similarly to the C variable errno. Except for the case just mentioned, gawk never clears it (sets it to zero or ""). Thus, you should only expect its value to be meaningful when an I/O operation returns a failure value, such as getline returning −1. You are, of course, free to clear it yourself before doing an I/O operation.

FILENAME

The name of the current input file. When no datafiles are listed on the command line, awk reads from the standard input and FILENAME is set to "-". FILENAME changes each time a new file is read (see Chapter 4). Inside a BEGIN rule, the value of FILENAME is "", because there are no input files being processed yet.[36] (d.c.) Note, though, that using getline (see Explicit Input with getline) inside a BEGIN rule can give FILENAME a value.

FNR

The current record number in the current file. awk increments FNR each time it reads a new record (see How Input Is Split into Records). awk resets FNR to zero each time it starts a new input file.

NF

The number of fields in the current input record. NF is set each time a new record is read, when a new field is created, or when $0 changes (see Examining Fields).

Unlike most of the variables described in this subsection, assigning a value to NF has the potential to affect awk’s internal workings. In particular, assignments to NF can be used to create fields in or remove fields from the current record. See Changing the Contents of a Field.

FUNCTAB #

An array whose indices and corresponding values are the names of all the built-in, user-defined, and extension functions in the program.

NOTE

Attempting to use the delete statement with the FUNCTAB array causes a fatal error. Any attempt to assign to an element of FUNCTAB also causes a fatal error.

NR

The number of input records awk has processed since the beginning of the program’s execution (see How Input Is Split into Records). awk increments NR each time it reads a new record.

PROCINFO #

The elements of this array provide access to information about the running awk program. The following elements (listed alphabetically) are guaranteed to be available:

PROCINFO["egid"]

The value of the getegid() system call.

PROCINFO["euid"]

The value of the geteuid() system call.

PROCINFO["FS"]

This is "FS" if field splitting with FS is in effect, "FIELDWIDTHS" if field splitting with FIELDWIDTHS is in effect, or "FPAT" if field matching with FPAT is in effect.

PROCINFO["identifiers"]

A subarray, indexed by the names of all identifiers used in the text of the awk program. An identifier is simply the name of a variable (be it scalar or array), built-in function, user-defined function, or extension function. For each identifier, the value of the element is one of the following:

"array"

The identifier is an array.

"builtin"

The identifier is a built-in function.

"extension"

The identifier is an extension function loaded via @load or -l.

"scalar"

The identifier is a scalar.

"untyped"

The identifier is untyped (could be used as a scalar or an array; gawk doesn’t know yet).

"user"

The identifier is a user-defined function.

The values indicate what gawk knows about the identifiers after it has finished parsing the program; they are not updated while the program runs.

PROCINFO["gid"]

The value of the getgid() system call.

PROCINFO["pgrpid"]

The process group ID of the current process.

PROCINFO["pid"]

The process ID of the current process.

PROCINFO["ppid"]

The parent process ID of the current process.

PROCINFO["sorted_in"]

If this element exists in PROCINFO, its value controls the order in which array indices will be processed by ‘for (indx in array)’ loops. This is an advanced feature, so we defer the full description until later; see Scanning All Elements of an Array.

PROCINFO["strftime"]

The default time format string for strftime(). Assigning a new value to this element changes the default. See Time Functions.

PROCINFO["uid"]

The value of the getuid() system call.

PROCINFO["version"]

The version of gawk.

The following additional elements in the array are available to provide information about the MPFR and GMP libraries if your version of gawk supports arbitrary-precision arithmetic (see Chapter 15):

PROCINFO["mpfr_version"]

The version of the GNU MPFR library.

PROCINFO["gmp_version"]

The version of the GNU MP library.

PROCINFO["prec_max"]

The maximum precision supported by MPFR.

PROCINFO["prec_min"]

The minimum precision required by MPFR.

The following additional elements in the array are available to provide information about the version of the extension API, if your version of gawk supports dynamic loading of extension functions (see Chapter 16):

PROCINFO["api_major"]

The major version of the extension API.

PROCINFO["api_minor"]

The minor version of the extension API.

On some systems, there may be elements in the array, "group1" through "groupN" for some N. N is the number of supplementary groups that the process has. Use the in operator to test for these elements (see Referring to an Array Element).

The PROCINFO array has the following additional uses:

§ It may be used to provide a timeout when reading from any open input file, pipe, or coprocess. See Reading Input with a Timeout for more information.

§ It may be used to cause coprocesses to communicate over pseudo-ttys instead of through two-way pipes; this is discussed further in Two-Way Communications with Another Process.

RLENGTH

The length of the substring matched by the match() function (see String-Manipulation Functions). RLENGTH is set by invoking the match() function. Its value is the length of the matched string, or −1 if no match is found.

RSTART

The start index in characters of the substring that is matched by the match() function (see String-Manipulation Functions). RSTART is set by invoking the match() function. Its value is the position of the string where the matched substring starts, or zero if no match was found.

RT #

The input text that matched the text denoted by RS, the record separator. It is set every time a record is read.

SYMTAB #

An array whose indices are the names of all defined global variables and arrays in the program. SYMTAB makes gawk’s symbol table visible to the awk programmer. It is built as gawk parses the program and is complete before the program starts to run.

The array may be used for indirect access to read or write the value of a variable:

foo = 5

SYMTAB["foo"] = 4

print foo # prints 4

The isarray() function (see Getting Type Information) may be used to test if an element in SYMTAB is an array. Also, you may not use the delete statement with the SYMTAB array.

You may use an index for SYMTAB that is not a predefined identifier:

SYMTAB["xxx"] = 5

print SYMTAB["xxx"]

This works as expected: in this case SYMTAB acts just like a regular array. The only difference is that you can’t then delete SYMTAB["xxx"].

The SYMTAB array is more interesting than it looks. Andrew Schorr points out that it effectively gives awk data pointers. Consider his example:

# Indirect multiply of any variable by amount, return result

function multiply(variable, amount)

{

return SYMTAB[variable] *= amount

}

NOTE

In order to avoid severe time-travel paradoxes,[37] neither FUNCTAB nor SYMTAB is available as an element within the SYMTAB array.

CHANGING NR AND FNR

awk increments NR and FNR each time it reads a record, instead of setting them to the absolute value of the number of records read. This means that a program can change these variables and their new values are incremented for each record. (d.c.) The following example shows this:

$ echo '1

> 2

> 3

> 4' | awk 'NR == 2 { NR = 17 }

> { print NR }'

1

17

18

19

Before FNR was added to the awk language (see Major Changes Between V7 and SVR3.1), many awk programs used this feature to track the number of records in a file by resetting NR to zero when FILENAME changed.

Using ARGC and ARGV

Built-in Variables That Convey Information presented the following program describing the information contained in ARGC and ARGV:

$ awk 'BEGIN {

> for (i = 0; i < ARGC; i++)

> print ARGV[i]

> }' inventory-shipped mail-list

awk

inventory-shipped

mail-list

In this example, ARGV[0] contains ‘awk’, ARGV[1] contains ‘inventory-shipped’, and ARGV[2] contains ‘mail-list’. Notice that the awk program is not entered in ARGV. The other command-line options, with their arguments, are also not entered. This includes variable assignments done with the -v option (see Command-Line Options). Normal variable assignments on the command line are treated as arguments and do show up in the ARGV array. Given the following program in a file named showargs.awk:

BEGIN {

printf "A=%d, B=%d\n", A, B

for (i = 0; i < ARGC; i++)

printf "\tARGV[%d] = %s\n", i, ARGV[i]

}

END { printf "A=%d, B=%d\n", A, B }

Running it produces the following:

$ awk -v A=1 -f showargs.awk B=2 /dev/null

A=1, B=0

ARGV[0] = awk

ARGV[1] = B=2

ARGV[2] = /dev/null

A=1, B=2

A program can alter ARGC and the elements of ARGV. Each time awk reaches the end of an input file, it uses the next element of ARGV as the name of the next input file. By storing a different string there, a program can change which files are read. Use "-" to represent the standard input. Storing additional elements and incrementing ARGC causes additional files to be read.

If the value of ARGC is decreased, that eliminates input files from the end of the list. By recording the old value of ARGC elsewhere, a program can treat the eliminated arguments as something other than filenames.

To eliminate a file from the middle of the list, store the null string ("") into ARGV in place of the file’s name. As a special feature, awk ignores filenames that have been replaced with the null string. Another option is to use the delete statement to remove elements from ARGV (see The delete Statement).

All of these actions are typically done in the BEGIN rule, before actual processing of the input begins. See Splitting a Large File into Pieces and Duplicating Output into Multiple Files for examples of each way of removing elements from ARGV.

To actually get options into an awk program, end the awk options with -- and then supply the awk program’s options, in the following manner:

awk -f myprog.awk -- -v -q file1 file2 …

The following fragment processes ARGV in order to examine, and then remove, the previously mentioned command-line options:

BEGIN {

for (i = 1; i < ARGC; i++) {

if (ARGV[i] == "-v")

verbose = 1

else if (ARGV[i] == "-q")

debug = 1

else if (ARGV[i] ~ /^-./) {

e = sprintf("%s: unrecognized option -- %c",

ARGV[0], substr(ARGV[i], 2, 1))

print e > "/dev/stderr"

} else

break

delete ARGV[i]

}

}

Ending the awk options with -- isn’t necessary in gawk. Unless --posix has been specified, gawk silently puts any unrecognized options into ARGV for the awk program to deal with. As soon as it sees an unknown option, gawk stops looking for other options that it might otherwise recognize. The previous command line with gawk would be:

gawk -f myprog.awk -q -v file1 file2 …

Because -q is not a valid gawk option, it and the following -v are passed on to the awk program. (See Processing Command-Line Options for an awk library function that parses command-line options.)

When designing your program, you should choose options that don’t conflict with gawk’s, because it will process any options that it accepts before passing the rest of the command line on to your program. Using ‘#!’ with the -E option may help (see Executable awk Programs andCommand-Line Options).

Summary

§ Pattern–action pairs make up the basic elements of an awk program. Patterns are either normal expressions, range expressions, or regexp constants; one of the special keywords BEGIN, END, BEGINFILE, or ENDFILE; or empty. The action executes if the current record matches the pattern. Empty (missing) patterns match all records.

§ I/O from BEGIN and END rules has certain constraints. This is also true, only more so, for BEGINFILE and ENDFILE rules. The latter two give you “hooks” into gawk’s file processing, allowing you to recover from a file that otherwise would cause a fatal error (such as a file that cannot be opened).

§ Shell variables can be used in awk programs by careful use of shell quoting. It is easier to pass a shell variable into awk by using the -v option and an awk variable.

§ Actions consist of statements enclosed in curly braces. Statements are built up from expressions, control statements, compound statements, input and output statements, and deletion statements.

§ The control statements in awk are if-else, while, for, and do-while. gawk adds the switch statement. There are two flavors of for statement: one for performing general looping, and the other for iterating through an array.

§ break and continue let you exit early or start the next iteration of a loop (or get out of a switch).

§ next and nextfile let you read the next record and start over at the top of your program or skip to the next input file and start over, respectively.

§ The exit statement terminates your program. When executed from an action (or function body), it transfers control to the END statements. From an END statement body, it exits immediately. You may pass an optional numeric value to be used as awk’s exit status.

§ Some predefined variables provide control over awk, mainly for I/O. Other variables convey information from awk to your program.

§ ARGC and ARGV make the command-line arguments available to your program. Manipulating them from a BEGIN rule lets you control how awk will process the provided datafiles.


[34] The original version of awk kept reading and ignoring input until the end of the file was seen.

[35] In POSIX awk, newline does not count as whitespace.

[36] Some early implementations of Unix awk initialized FILENAME to "-", even if there were datafiles to be processed. This behavior was incorrect and should not be relied upon in your programs.

[37] Not to mention difficult implementation issues.