Functions - The awk Language - Effective awk Programming (2015)

Effective awk Programming (2015)

Part I. The awk Language

Chapter 9. Functions

This chapter describes awk’s built-in functions, which fall into three categories: numeric, string, and I/O. gawk provides additional groups of functions to work with values that represent time, do bit manipulation, sort arrays, provide type information, and internationalize and localize programs.

Besides the built-in functions, awk has provisions for writing new functions that the rest of a program can use. The second part of this chapter describes these user-defined functions. Finally, we explore indirect function calls, a gawk-specific extension that lets you determine at runtime what function is to be called.

Built-in Functions

Built-in functions are always available for your awk program to call. This section defines all the built-in functions in awk; some of these are mentioned in other sections but are summarized here for your convenience.

Calling Built-in Functions

To call one of awk’s built-in functions, write the name of the function followed by arguments in parentheses. For example, ‘atan2(y + z, 1)’ is a call to the function atan2() and has two arguments.

Whitespace is ignored between the built-in function name and the opening parenthesis, but nonetheless it is good practice to avoid using whitespace there. User-defined functions do not permit whitespace in this way, and it is easier to avoid mistakes by following a simple convention that always works—no whitespace after a function name.

Each built-in function accepts a certain number of arguments. In some cases, arguments can be omitted. The defaults for omitted arguments vary from function to function and are described under the individual functions. In some awk implementations, extra arguments given to built-in functions are ignored. However, in gawk, it is a fatal error to give extra arguments to a built-in function.

When a function is called, expressions that create the function’s actual parameters are evaluated completely before the call is performed. For example, in the following code fragment:

i = 4

j = sqrt(i++)

the variable i is incremented to the value five before sqrt() is called with a value of four for its actual parameter. The order of evaluation of the expressions used for the function’s parameters is undefined. Thus, avoid writing programs that assume that parameters are evaluated from left to right or from right to left. For example:

i = 5

j = atan2(++i, i *= 2)

If the order of evaluation is left to right, then i first becomes six, and then 12, and atan2() is called with the two arguments six and 12. But if the order of evaluation is right to left, i first becomes 10, then 11, and atan2() is called with the two arguments 11 and 10.

Numeric Functions

The following list describes all of the built-in functions that work with numbers. Optional parameters are enclosed in square brackets ([ ]):

atan2(y, x)

Return the arctangent of y / x in radians. You can use ‘pi = atan2(0, -1)’ to retrieve the value of π.

cos(x)

Return the cosine of x, with x in radians.

exp(x)

Return the exponential of x (e ^ x) or report an error if x is out of range. The range of values x can have depends on your machine’s floating-point representation.

int(x)

Return the nearest integer to x, located between x and zero and truncated toward zero. For example, int(3) is 3, int(3.9) is 3, int(-3.9) is −3, and int(-3) is −3 as well.

log(x)

Return the natural logarithm of x, if x is positive; otherwise, return NaN (“not a number”) on IEEE 754 systems. Additionally, gawk prints a warning message when x is negative.

rand()

Return a random number. The values of rand() are uniformly distributed between zero and one. The value could be zero but is never one.[41]

Often random integers are needed instead. Following is a user-defined function that can be used to obtain a random nonnegative integer less than n:

function randint(n)

{

return int(n * rand())

}

The multiplication produces a random number greater than zero and less than n. Using int(), this result is made into an integer between zero and n − 1, inclusive.

The following example uses a similar function to produce random integers between one and n. This program prints a new random number for each input record:

# Function to roll a simulated die.

function roll(n) { return 1 + int(rand() * n) }

# Roll 3 six-sided dice and

# print total number of points.

{

printf("%d points\n", roll(6) + roll(6) + roll(6))

}

CAUTION

In most awk implementations, including gawk, rand() starts generating numbers from the same starting number, or seed, each time you run awk.[42] Thus, a program generates the same results each time you run it. The numbers are random within one awk run but predictable from run to run. This is convenient for debugging, but if you want a program to do different things each time it is used, you must change the seed to a value that is different in each run. To do this, use srand().

sin(x)

Return the sine of x, with x in radians.

sqrt(x)

Return the positive square root of x. gawk prints a warning message if x is negative. Thus, sqrt(4) is 2.

srand([x])

Set the starting point, or seed, for generating random numbers to the value x.

Each seed value leads to a particular sequence of random numbers.[43] Thus, if the seed is set to the same value a second time, the same sequence of random numbers is produced again.

CAUTION

Different awk implementations use different random-number generators internally. Don’t expect the same awk program to produce the same series of random numbers when executed by different versions of awk.

If the argument x is omitted, as in ‘srand()’, then the current date and time of day are used for a seed. This is the way to get random numbers that are truly unpredictable.

The return value of srand() is the previous seed. This makes it easy to keep track of the seeds in case you need to consistently reproduce sequences of random numbers.

POSIX does not specify the initial seed; it differs among awk implementations.

String-Manipulation Functions

The functions in this section look at or change the text of one or more strings.

gawk understands locales (see Where You Are Makes a Difference) and does all string processing in terms of characters, not bytes. This distinction is particularly important to understand for locales where one character may be represented by multiple bytes. Thus, for example, length()returns the number of characters in a string, and not the number of bytes used to represent those characters. Similarly, index() works with character indices, and not byte indices.

CAUTION

A number of functions deal with indices into strings. For these functions, the first character of a string is at position (index) one. This is different from C and the languages descended from it, where the first character is at position zero. You need to remember this when doing index calculations, particularly if you are used to C.

In the following list, optional parameters are enclosed in square brackets ([ ]). Several functions perform string substitution; the full discussion is provided in the description of the sub() function, which comes toward the end, because the list is presented alphabetically.

Those functions that are specific to gawk are marked with a pound sign (‘#’). They are not available in compatibility mode (see Command-Line Options):

asort(source [, dest [, how ] ]) #
asorti(source [, dest [, how ] ]) #

These two functions are similar in behavior, so they are described together.

NOTE

The following description ignores the third argument, how, as it requires understanding features that we have not discussed yet. Thus, the discussion here is a deliberate simplification. (We do provide all the details later on; see Sorting Array Values and Indices with gawk for the full story.)

Both functions return the number of elements in the array source. For asort(), gawk sorts the values of source and replaces the indices of the sorted values of source with sequential integers starting with one. If the optional array dest is specified, then source is duplicated intodest. dest is then sorted, leaving the indices of source unchanged.

When comparing strings, IGNORECASE affects the sorting (see Sorting Array Values and Indices with gawk). If the source array contains subarrays as values (see Arrays of Arrays), they will come last, after all scalar values. Subarrays are not recursively sorted.

For example, if the contents of a are as follows:

a["last"] = "de"

a["first"] = "sac"

a["middle"] = "cul"

A call to asort():

asort(a)

results in the following contents of a:

a[1] = "cul"

a[2] = "de"

a[3] = "sac"

The asorti() function works similarly to asort(); however, the indices are sorted, instead of the values. Thus, in the previous example, starting with the same initial set of indices and values in a, calling ‘asorti(a)’ would yield:

a[1] = "first"

a[2] = "last"

a[3] = "middle"

gensub(regexp, replacement, how [, target]) #

Search the target string target for matches of the regular expression regexp. If how is a string beginning with ‘g’ or ‘G’ (short for “global”), then replace all matches of regexp with replacement. Otherwise, how is treated as a number indicating which match of regexp to replace. If no target is supplied, use $0. It returns the modified string as the result of the function and the original target string is not changed.

gensub() is a general substitution function. Its purpose is to provide more features than the standard sub() and gsub() functions.

gensub() provides an additional feature that is not available in sub() or gsub(): the ability to specify components of a regexp in the replacement text. This is done by using parentheses in the regexp to mark the components and then specifying ‘\N’ in the replacement text, where N is a digit from 1 to 9. For example:

$ gawk '

> BEGIN {

> a = "abc def"

> b = gensub(/(.+) (.+)/, "\\2 \\1", "g", a)

> print b

> }'

def abc

As with sub(), you must type two backslashes in order to get one into the string. In the replacement text, the sequence ‘\0’ represents the entire matched text, as does the character ‘&’.

The following example shows how you can use the third argument to control which match of the regexp should be changed:

$ echo a b c a b c |

> gawk '{ print gensub(/a/, "AA", 2) }'

a b c AA b c

In this case, $0 is the default target string. gensub() returns the new string as its result, which is passed directly to print for printing.

If the how argument is a string that does not begin with ‘g’ or ‘G’, or if it is a number that is less than or equal to zero, only one substitution is performed. If how is zero, gawk issues a warning message.

If regexp does not match target, gensub()’s return value is the original unchanged value of target.

gsub(regexp, replacement [, target])

Search target for all of the longest, leftmost, nonoverlapping matching substrings it can find and replace them with replacement. The ‘g’ in gsub() stands for “global,” which means replace everywhere. For example:

{ gsub(/Britain/, "United Kingdom"); print }

replaces all occurrences of the string ‘Britain’ with ‘United Kingdom’ for all input records.

The gsub() function returns the number of substitutions made. If the variable to search and alter (target) is omitted, then the entire input record ($0) is used. As in sub(), the characters ‘&’ and ‘\’ are special, and the third argument must be assignable.

index(in, find)

Search the string in for the first occurrence of the string find, and return the position in characters where that occurrence begins in the string in. Consider the following example:

$ awk 'BEGIN { print index("peanut", "an") }'

3

If find is not found, index() returns zero.

With BWK awk and gawk, it is a fatal error to use a regexp constant for find. Other implementations allow it, simply treating the regexp constant as an expression meaning ‘$0 ~ /regexp/’. (d.c.)

length([string])

Return the number of characters in string. If string is a number, the length of the digit string representing that number is returned. For example, length("abcde") is five. By contrast, length(15 * 35) works out to three. In this example, 15 ⋅ 35 = 525, and 525 is then converted to the string "525", which has three characters.

If no argument is supplied, length() returns the length of $0.

NOTE

In older versions of awk, the length() function could be called without any parentheses. Doing so is considered poor practice, although the 2008 POSIX standard explicitly allows it, to support historical practice. For programs to be maximally portable, always supply the parentheses.

If length() is called with a variable that has not been used, gawk forces the variable to be a scalar. Other implementations of awk leave the variable without a type. (d.c.) Consider:

$ gawk 'BEGIN { print length(x) ; x[1] = 1 }'

0

error→ gawk: fatal: attempt to use scalar `x' as array

$ nawk 'BEGIN { print length(x) ; x[1] = 1 }'

0

If --lint has been specified on the command line, gawk issues a warning about this.

With gawk and several other awk implementations, when given an array argument, the length() function returns the number of elements in the array. (c.e.) This is less useful than it might seem at first, as the array is not guaranteed to be indexed from one to the number of elements in it. If --lint is provided on the command line (see Command-Line Options), gawk warns that passing an array argument is not portable. If --posix is supplied, using an array argument is a fatal error (see Chapter 8).

match(string, regexp [, array])

Search string for the longest, leftmost substring matched by the regular expression regexp and return the character position (index) at which that substring begins (one, if it starts at the beginning of string). If no match is found, return zero.

The regexp argument may be either a regexp constant (/…/) or a string constant ("…"). In the latter case, the string is treated as a regexp to be matched. See Using Dynamic Regexps for a discussion of the difference between the two forms, and the implications for writing your program correctly.

The order of the first two arguments is the opposite of most other string functions that work with regular expressions, such as sub() and gsub(). It might help to remember that for match(), the order is the same as for the ‘~’ operator: ‘string ~ regexp’.

The match() function sets the predefined variable RSTART to the index. It also sets the predefined variable RLENGTH to the length in characters of the matched substring. If no match is found, RSTART is set to zero, and RLENGTH to −1.

For example:

{

if ($1 == "FIND")

regex = $2

else {

where = match($0, regex)

if (where != 0)

print "Match of", regex, "found at", where, "in", $0

}

}

This program looks for lines that match the regular expression stored in the variable regex. This regular expression can be changed. If the first word on a line is ‘FIND’, regex is changed to be the second word on that line. Therefore, if given:

FIND ru+n

My program runs

but not very quickly

FIND Melvin

JF+KM

This line is property of Reality Engineering Co.

Melvin was here.

awk prints:

Match of ru+n found at 12 in My program runs

Match of Melvin found at 1 in Melvin was here.

If array is present, it is cleared, and then the zeroth element of array is set to the entire portion of string matched by regexp. If regexp contains parentheses, the integer-indexed elements of array are set to contain the portion of string matching the corresponding parenthesized subexpression. For example:

$ echo foooobazbarrrrr |

> gawk '{ match($0, /(fo+).+(bar*)/, arr)

> print arr[1], arr[2] }'

foooo barrrrr

In addition, multidimensional subscripts are available providing the start index and length of each matched subexpression:

$ echo foooobazbarrrrr |

> gawk '{ match($0, /(fo+).+(bar*)/, arr)

> print arr[1], arr[2]

> print arr[1, "start"], arr[1, "length"]

> print arr[2, "start"], arr[2, "length"]

> }'

foooo barrrrr

1 5

9 7

There may not be subscripts for the start and index for every parenthesized subexpression, because they may not all have matched text; thus, they should be tested for with the in operator (see Referring to an Array Element).

The array argument to match() is a gawk extension. In compatibility mode (see Command-Line Options), using a third argument is a fatal error.

patsplit(string, array [, fieldpat [, seps ] ]) #

Divide string into pieces defined by fieldpat and store the pieces in array and the separator strings in the seps array. The first piece is stored in array[1], the second piece in array[2], and so forth. The third argument, fieldpat, is a regexp describing the fields in string(just as FPAT is a regexp describing the fields in input records). It may be either a regexp constant or a string. If fieldpat is omitted, the value of FPAT is used. patsplit() returns the number of elements created. seps[i] is the separator string between array[i] and array[i+1]. Any leading separator will be in seps[0].

The patsplit() function splits strings into pieces in a manner similar to the way input lines are split into fields using FPAT (see Defining Fields by Content).

Before splitting the string, patsplit() deletes any previously existing elements in the arrays array and seps.

split(string, array [, fieldsep [, seps ] ])

Divide string into pieces separated by fieldsep and store the pieces in array and the separator strings in the seps array. The first piece is stored in array[1], the second piece in array[2], and so forth. The string value of the third argument, fieldsep, is a regexp describing where to split string (much as FS can be a regexp describing where to split input records). If fieldsep is omitted, the value of FS is used. split() returns the number of elements created. seps is a gawk extension, with seps[i] being the separator string between array[i] andarray[i+1]. If fieldsep is a single space, then any leading whitespace goes into seps[0] and any trailing whitespace goes into seps[n], where n is the return value of split() (i.e., the number of elements in array).

The split() function splits strings into pieces in a manner similar to the way input lines are split into fields. For example:

split("cul-de-sac", a, "-", seps)

splits the string "cul-de-sac" into three fields using ‘-’ as the separator. It sets the contents of the array a as follows:

a[1] = "cul"

a[2] = "de"

a[3] = "sac"

and sets the contents of the array seps as follows:

seps[1] = "-"

seps[2] = "-"

The value returned by this call to split() is three.

As with input field splitting, when the value of fieldsep is " ", leading and trailing whitespace is ignored in values assigned to the elements of array but not in seps, and the elements are separated by runs of whitespace. Also, as with input field splitting, if fieldsep is the null string, each individual character in the string is split into its own array element. (c.e.)

Note, however, that RS has no effect on the way split() works. Even though ‘RS = ""’ causes the newline character to also be an input field separator, this does not affect how split() splits strings.

Modern implementations of awk, including gawk, allow the third argument to be a regexp constant (/…/) as well as a string. (d.c.) The POSIX standard allows this as well. See Using Dynamic Regexps for a discussion of the difference between using a string constant or a regexp constant, and the implications for writing your program correctly.

Before splitting the string, split() deletes any previously existing elements in the arrays array and seps.

If string is null, the array has no elements. (So this is a portable way to delete an entire array with one statement. See The delete Statement.)

If string does not match fieldsep at all (but is not null), array has one element only. The value of that element is the original string.

In POSIX mode (see Command-Line Options), the fourth argument is not allowed.

sprintf(format, expression1, …)

Return (without printing) the string that printf would have printed out with the same arguments (see Using printf Statements for Fancier Printing). For example:

pival = sprintf("pi = %.2f (approx.)", 22/7)

assigns the string ‘pi = 3.14 (approx.)’ to the variable pival.

strtonum(str) #

Examine str and return its numeric value. If str begins with a leading ‘0’, strtonum() assumes that str is an octal number. If str begins with a leading ‘0x’ or ‘0X’, strtonum() assumes that str is a hexadecimal number. For example:

$ echo 0x11 |

> gawk '{ printf "%d\n", strtonum($1) }'

17

Using the strtonum() function is not the same as adding zero to a string value; the automatic coercion of strings to numbers works only for decimal data, not for octal or hexadecimal.[44]

Note also that strtonum() uses the current locale’s decimal point for recognizing numbers (see Where You Are Makes a Difference).

sub(regexp, replacement [, target])

Search target, which is treated as a string, for the leftmost, longest substring matched by the regular expression regexp. Modify the entire string by replacing the matched text with replacement. The modified string becomes the new value of target. Return the number of substitutions made (zero or one).

The regexp argument may be either a regexp constant (/…/) or a string constant ("…"). In the latter case, the string is treated as a regexp to be matched. See Using Dynamic Regexps for a discussion of the difference between the two forms, and the implications for writing your program correctly.

This function is peculiar because target is not simply used to compute a value, and not just any expression will do—it must be a variable, field, or array element so that sub() can store a modified value there. If this argument is omitted, then the default is to use and alter $0.[45] For example:

str = "water, water, everywhere"

sub(/at/, "ith", str)

sets str to ‘wither, water, everywhere’, by replacing the leftmost longest occurrence of ‘at’ with ‘ith’.

If the special character ‘&’ appears in replacement, it stands for the precise substring that was matched by regexp. (If the regexp can match more than one string, then this precise substring may vary.) For example:

{ sub(/candidate/, "& and his wife"); print }

changes the first occurrence of ‘candidate’ to ‘candidate and his wife’ on each input line. Here is another example:

$ awk 'BEGIN {

> str = "daabaaa"

> sub(/a+/, "C&C", str)

> print str

> }'

dCaaCbaaa

This shows how ‘&’ can represent a nonconstant string and also illustrates the “leftmost, longest” rule in regexp matching (see How Much Text Matches?).

The effect of this special character (‘&’) can be turned off by putting a backslash before it in the string. As usual, to insert one backslash in the string, you must write two backslashes. Therefore, write ‘\\&’ in a string constant to include a literal ‘&’ in the replacement. For example, the following shows how to replace the first ‘|’ on each line with an ‘&’:

{ sub(/\|/, "\\&"); print }

As mentioned, the third argument to sub() must be a variable, field, or array element. Some versions of awk allow the third argument to be an expression that is not an lvalue. In such a case, sub() still searches for the pattern and returns zero or one, but the result of the substitution (if any) is thrown away because there is no place to put it. Such versions of awk accept expressions like the following:

sub(/USA/, "United States", "the USA and Canada")

For historical compatibility, gawk accepts such erroneous code. However, using any other nonchangeable object as the third parameter causes a fatal error and your program will not run.

Finally, if the regexp is not a regexp constant, it is converted into a string, and then the value of that string is treated as the regexp to match.

substr(string, start [, length ])

Return a length-character-long substring of string, starting at character number start. The first character of a string is character number one.[46] For example, substr("washington", 5, 3) returns "ing".

If length is not present, substr() returns the whole suffix of string that begins at character number start. For example, substr("washington", 5) returns "ington". The whole suffix is also returned if length is greater than the number of characters remaining in the string, counting from character start.

If start is less than one, substr() treats it as if it was one. (POSIX doesn’t specify what to do in this case: BWK awk acts this way, and therefore gawk does too.) If start is greater than the number of characters in the string, substr() returns the null string. Similarly, if length is present but less than or equal to zero, the null string is returned.

The string returned by substr() cannot be assigned. Thus, it is a mistake to attempt to change a portion of a string, as shown in the following example:

string = "abcdef"

# try to get "abCDEf", won't work

substr(string, 3, 3) = "CDE"

It is also a mistake to use substr() as the third argument of sub() or gsub():

gsub(/xyz/, "pdq", substr($0, 5, 20)) # WRONG

(Some commercial versions of awk treat substr() as assignable, but doing so is not portable.)

If you need to replace bits and pieces of a string, combine substr() with string concatenation, in the following manner:

string = "abcdef"

string = substr(string, 1, 2) "CDE" substr(string, 6)

tolower(string)

Return a copy of string, with each uppercase character in the string replaced with its corresponding lowercase character. Nonalphabetic characters are left unchanged. For example, tolower("MiXeD cAsE 123") returns "mixed case 123".

toupper(string)

Return a copy of string, with each lowercase character in the string replaced with its corresponding uppercase character. Nonalphabetic characters are left unchanged. For example, toupper("MiXeD cAsE 123") returns "MIXED CASE 123".

MATCHING THE NULL STRING

In awk, the ‘*’ operator can match the null string. This is particularly important for the sub(), gsub(), and gensub() functions. For example:

$ echo abc | awk '{ gsub(/m*/, "X"); print }'

XaXbXcX

Although this makes a certain amount of sense, it can be surprising.

More about ‘\’ and ‘&’ with sub(), gsub(), and gensub()

CAUTION

This subsubsection has been reported to cause headaches. You might want to skip it upon first reading.

When using sub(), gsub(), or gensub(), and trying to get literal backslashes and ampersands into the replacement text, you need to remember that there are several levels of escape processing going on.

First, there is the lexical level, which is when awk reads your program and builds an internal copy of it to execute. Then there is the runtime level, which is when awk actually scans the replacement string to determine what to generate.

At both levels, awk looks for a defined set of characters that can come after a backslash. At the lexical level, it looks for the escape sequences listed in Escape Sequences. Thus, for every ‘\’ that awk processes at the runtime level, you must type two backslashes at the lexical level. When a character that is not valid for an escape sequence follows the ‘\’, BWK awk and gawk both simply remove the initial ‘\’ and put the next character into the string. Thus, for example, "a\qb" is treated as "aqb".

At the runtime level, the various functions handle sequences of ‘\’ and ‘&’ differently. The situation is (sadly) somewhat complex. Historically, the sub() and gsub() functions treated the two-character sequence ‘\&’ specially; this sequence was replaced in the generated text with a single ‘&’. Any other ‘\’ within the replacement string that did not precede an ‘&’ was passed through unchanged. This is illustrated in Table 9-1.

Table 9-1. Historical escape sequence processing for sub() and gsub()

You type

sub() sees

sub() generates

\&

&

The matched text

\\&

\&

A literal ‘&’

\\\&

\&

A literal ‘&’

\\\\&

\\&

A literal ‘\&’

\\\\\&

\\&

A literal ‘\&’

\\\\\\&

\\\&

A literal ‘\\&’

\\q

\q

A literal ‘\q’

This table shows the lexical-level processing, where an odd number of backslashes becomes an even number at the runtime level, as well as the runtime processing done by sub(). (For the sake of simplicity, the rest of the following tables only show the case of even numbers of backslashes entered at the lexical level.)

The problem with the historical approach is that there is no way to get a literal ‘\’ followed by the matched text.

Several editions of the POSIX standard attempted to fix this problem but weren’t successful. The details are irrelevant at this point in time.

At one point, the gawk maintainer submitted proposed text for a revised standard that reverts to rules that correspond more closely to the original existing practice. The proposed rules have special cases that make it possible to produce a ‘\’ preceding the matched text. This is shown inTable 9-2.

Table 9-2. Gawk rules for sub() and backslash

You type

sub() sees

sub() generates

\\\\\\&

\\\&

A literal ‘\&’

\\\\&

\\&

A literal ‘\’, followed by the matched text

\\&

\&

A literal ‘&’

\\q

\q

A literal ‘\q’

\\\\

\\

\\

In a nutshell, at the runtime level, there are now three special sequences of characters (‘\\\&’, ‘\\&’, and ‘\&’) whereas historically there was only one. However, as in the historical case, any ‘\’ that is not part of one of these three sequences is not special and appears in the output literally.

gawk 3.0 and 3.1 follow these rules for sub() and gsub(). The POSIX standard took much longer to be revised than was expected. In addition, the gawk maintainer’s proposal was lost during the standardization process. The final rules are somewhat simpler. The results are similar except for one case.

The POSIX rules state that ‘\&’ in the replacement string produces a literal ‘&’, ‘\\’ produces a literal ‘\’, and ‘\’ followed by anything else is not special; the ‘\’ is placed straight into the output. These rules are presented in Table 9-3.

Table 9-3. POSIX rules for sub() and gsub()

You type

sub() sees

sub() generates

\\\\\\&

\\\&

A literal ‘\&’

\\\\&

\\&

A literal ‘\’, followed by the matched text

\\&

\&

A literal ‘&’

\\q

\q

A literal ‘\q’

\\\\

\\

\

The only case where the difference is noticeable is the last one: ‘\\\\’ is seen as ‘\\’ and produces ‘\’ instead of ‘\\’.

Starting with version 3.1.4, gawk followed the POSIX rules when --posix was specified (see Command-Line Options). Otherwise, it continued to follow the proposed rules, as that had been its behavior for many years.

When version 4.0.0 was released, the gawk maintainer made the POSIX rules the default, breaking well over a decade’s worth of backward compatibility.[47] Needless to say, this was a bad idea, and as of version 4.0.1, gawk resumed its historical behavior, and only follows the POSIX rules when --posix is given.

The rules for gensub() are considerably simpler. At the runtime level, whenever gawk sees a ‘\’, if the following character is a digit, then the text that matched the corresponding parenthesized subexpression is placed in the generated output. Otherwise, no matter what character follows the ‘\’, it appears in the generated text and the ‘\’ does not, as shown in Table 9-4.

Table 9-4. Escape sequence processing for gensub()

You type

gensub() sees

gensub() generates

&

&

The matched text

\\&

\&

A literal ‘&’

\\\\

\\

A literal ‘\’

\\\\&

\\&

A literal ‘\’, then the matched text

\\\\\\&

\\\&

A literal ‘\&’

\\q

\q

A literal ‘q’

Because of the complexity of the lexical- and runtime-level processing and the special cases for sub() and gsub(), we recommend the use of gawk and gensub() when you have to do substitutions.

Input/Output Functions

The following functions relate to input/output (I/O). Optional parameters are enclosed in square brackets ([ ]):

close(filename [, how])

Close the file filename for input or output. Alternatively, the argument may be a shell command that was used for creating a coprocess, or for redirecting to or from a pipe; then the coprocess or pipe is closed. See Closing Input and Output Redirections for more information.

When closing a coprocess, it is occasionally useful to first close one end of the two-way pipe and then to close the other. This is done by providing a second argument to close(). This second argument (how) should be one of the two string values "to" or "from", indicating which end of the pipe to close. Case in the string does not matter. See Two-Way Communications with Another Process, which discusses this feature in more detail and gives an example.

Note that the second argument to close() is a gawk extension; it is not available in compatibility mode (see Command-Line Options).

fflush([filename])

Flush any buffered output associated with filename, which is either a file opened for writing or a shell command for redirecting output to a pipe or coprocess.

Many utility programs buffer their output (i.e., they save information to write to a disk file or the screen in memory until there is enough for it to be worthwhile to send the data to the output device). This is often more efficient than writing every little bit of information as soon as it is ready. However, sometimes it is necessary to force a program to flush its buffers (i.e., write the information to its destination, even if a buffer is not full). This is the purpose of the fflush() function—gawk also buffers its output, and the fflush() function forces gawk to flush its buffers.

Brian Kernighan added fflush() to his awk in April 1992. For two decades, it was a common extension. In December 2012, it was accepted for inclusion into the POSIX standard. See the Austin Group website.

POSIX standardizes fflush() as follows: if there is no argument, or if the argument is the null string (""), then awk flushes the buffers for all open output files and pipes.

NOTE

Prior to version 4.0.2, gawk would flush only the standard output if there was no argument, and flush all output files and pipes if the argument was the null string. This was changed in order to be compatible with Brian Kernighan’s awk, in the hope that standardizing this feature in POSIX would then be easier (which indeed proved to be the case).

With gawk, you can use ‘fflush("/dev/stdout")’ if you wish to flush only the standard output.

fflush() returns zero if the buffer is successfully flushed; otherwise, it returns a nonzero value. (gawk returns −1.) In the case where all buffers are flushed, the return value is zero only if all buffers were flushed successfully. Otherwise, it is −1, and gawk warns about the problemfilename.

gawk also issues a warning message if you attempt to flush a file or pipe that was opened for reading (such as with getline), or if filename is not an open file, pipe, or coprocess. In such a case, fflush() returns −1, as well.

INTERACTIVE VERSUS NONINTERACTIVE BUFFERING

As a side point, buffering issues can be even more confusing if your program is interactive (i.e., communicating with a user sitting at a keyboard).[48]

Interactive programs generally line buffer their output (i.e., they write out every line). Noninteractive programs wait until they have a full buffer, which may be many lines of output. Here is an example of the difference:

$ awk '{ print $1 + $2 }'

1 1

2

2 3

5

Ctrl-d

Each line of output is printed immediately. Compare that behavior with this example:

$ awk '{ print $1 + $2 }' | cat

1 1

2 3

Ctrl-d

2

5

Here, no output is printed until after the Ctrl-d is typed, because it is all buffered and sent down the pipe to cat in one shot.

system(command)

Execute the operating system command command and then return to the awk program. Return command ’s exit status.

For example, if the following fragment of code is put in your awk program:

END {

system("date | mail -s 'awk run done' root")

}

the system administrator is sent mail when the awk program finishes processing input and begins its end-of-input processing.

Note that redirecting print or printf into a pipe is often enough to accomplish your task. If you need to run many commands, it is more efficient to simply print them down a pipeline to the shell:

while (more stuff to do)

print command | "/bin/sh"

close("/bin/sh")

However, if your awk program is interactive, system() is useful for running large self-contained programs, such as a shell or an editor. Some operating systems cannot implement the system() function. system() causes a fatal error if it is not supported.

NOTE

When --sandbox is specified, the system() function is disabled (see Command-Line Options).

CONTROLLING OUTPUT BUFFERING WITH SYSTEM()

The fflush() function provides explicit control over output buffering for individual files and pipes. However, its use is not portable to many older awk implementations. An alternative method to flush output buffers is to call system() with a null string as its argument:

system("") # flush output

gawk treats this use of the system() function as a special case and is smart enough not to run a shell (or other command interpreter) with the empty command. Therefore, with gawk, this idiom is not only useful, it is also efficient. Although this method should work with other awk implementations, it does not necessarily avoid starting an unnecessary shell. (Other implementations may only flush the buffer associated with the standard output and not necessarily all buffered output.)

If you think about what a programmer expects, it makes sense that system() should flush any pending output. The following program:

BEGIN {

print "first print"

system("echo system echo")

print "second print"

}

must print:

first print

system echo

second print

and not:

system echo

first print

second print

If awk did not flush its buffers before calling system(), you would see the latter (undesirable) output.

Time Functions

awk programs are commonly used to process log files containing timestamp information, indicating when a particular log record was written. Many programs log their timestamps in the form returned by the time() system call, which is the number of seconds since a particular epoch. On POSIX-compliant systems, it is the number of seconds since 1970-01-01 00:00:00 UTC, not counting leap seconds. All known POSIX-compliant systems support timestamps from 0 through 231 − 1, which is sufficient to represent times through 2038-01-19 03:14:07 UTC. Many systems support a wider range of timestamps, including negative timestamps that represent times before the epoch.

In order to make it easier to process such log files and to produce useful reports, gawk provides the following functions for working with timestamps. They are gawk extensions; they are not specified in the POSIX standard.[49] However, recent versions of mawk (see Other Freely Available awk Implementations) also support these functions. Optional parameters are enclosed in square brackets ([ ]):

mktime(datespec)

Turn datespec into a timestamp in the same form as is returned by systime(). It is similar to the function of the same name in ISO C. The argument, datespec, is a string of the form "YYYY MM DD HH MM SS [DST]". The string consists of six or seven numbers representing, respectively, the full year including century, the month from 1 to 12, the day of the month from 1 to 31, the hour of the day from 0 to 23, the minute from 0 to 59, the second from 0 to 60,[50] and an optional daylight-savings flag.

The values of these numbers need not be within the ranges specified; for example, an hour of −1 means 1 hour before midnight. The origin-zero Gregorian calendar is assumed, with year 0 preceding year 1 and year −1 preceding year 0. The time is assumed to be in the local time zone. If the daylight-savings flag is positive, the time is assumed to be daylight savings time; if zero, the time is assumed to be standard time; and if negative (the default), mktime() attempts to determine whether daylight savings time is in effect for the specified time.

If datespec does not contain enough elements or if the resulting time is out of range, mktime() returns −1.

strftime([format [, timestamp [, utc-flag] ] ])

Format the time specified by timestamp based on the contents of the format string and return the result. It is similar to the function of the same name in ISO C. If utc-flag is present and is either nonzero or non-null, the value is formatted as UTC (Coordinated Universal Time, formerly GMT or Greenwich Mean Time). Otherwise, the value is formatted for the local time zone. The timestamp is in the same format as the value returned by the systime() function. If no timestamp argument is supplied, gawk uses the current time of day as the timestamp. Without a format argument, strftime() uses the value of PROCINFO["strftime"] as the format string (see Predefined Variables). The default string value is "%a %b %e %H:%M:%S %Z %Y". This format string produces output that is equivalent to that of the date utility. You can assign a new value to PROCINFO["strftime"] to change the default format; see the following list for the various format directives.

systime()

Return the current time as the number of seconds since the system epoch. On POSIX systems, this is the number of seconds since 1970-01-01 00:00:00 UTC, not counting leap seconds. It may be a different number on other systems.

The systime() function allows you to compare a timestamp from a log file with the current time of day. In particular, it is easy to determine how long ago a particular record was logged. It also allows you to produce log records using the “seconds since the epoch” format.

The mktime() function allows you to convert a textual representation of a date and time into a timestamp. This makes it easy to do before/after comparisons of dates and times, particularly when dealing with date and time data coming from an external source, such as a log file.

The strftime() function allows you to easily turn a timestamp into human-readable information. It is similar in nature to the sprintf() function (see String-Manipulation Functions), in that it copies nonformat specification characters verbatim to the returned string, while substituting date and time values for format specifications in the format string.

strftime() is guaranteed by the 1999 ISO C standard[51] to support the following date format specifications:

%a

The locale’s abbreviated weekday name.

%A

The locale’s full weekday name.

%b

The locale’s abbreviated month name.

%B

The locale’s full month name.

%c

The locale’s “appropriate” date and time representation. (This is ‘%A %B %d %T %Y’ in the "C" locale.)

%C

The century part of the current year. This is the year divided by 100 and truncated to the next lower integer.

%d

The day of the month as a decimal number (01–31).

%D

Equivalent to specifying ‘%m/%d/%y’.

%e

The day of the month, padded with a space if it is only one digit.

%F

Equivalent to specifying ‘%Y-%m-%d’. This is the ISO 8601 date format.

%g

The year modulo 100 of the ISO 8601 week number, as a decimal number (00–99). For example, January 1, 2012, is in week 53 of 2011. Thus, the year of its ISO 8601 week number is 2011, even though its year is 2012. Similarly, December 31, 2012, is in week 1 of 2013. Thus, the year of its ISO week number is 2013, even though its year is 2012.

%G

The full year of the ISO week number, as a decimal number.

%h

Equivalent to ‘%b’.

%H

The hour (24-hour clock) as a decimal number (00–23).

%I

The hour (12-hour clock) as a decimal number (01–12).

%j

The day of the year as a decimal number (001–366).

%m

The month as a decimal number (01–12).

%M

The minute as a decimal number (00–59).

%n

A newline character (ASCII LF).

%p

The locale’s equivalent of the AM/PM designations associated with a 12-hour clock.

%r

The locale’s 12-hour clock time. (This is ‘%I:%M:%S %p’ in the "C" locale.)

%R

Equivalent to specifying ‘%H:%M’.

%S

The second as a decimal number (00–60).

%t

A TAB character.

%T

Equivalent to specifying ‘%H:%M:%S’.

%u

The weekday as a decimal number (1–7). Monday is day one.

%U

The week number of the year (with the first Sunday as the first day of week one) as a decimal number (00–53).

%V

The week number of the year (with the first Monday as the first day of week one) as a decimal number (01–53). The method for determining the week number is as specified by ISO 8601. (To wit: if the week containing January 1 has four or more days in the new year, then it is week one; otherwise it is week 53 of the previous year and the next week is week one.)

%w

The weekday as a decimal number (0–6). Sunday is day zero.

%W

The week number of the year (with the first Monday as the first day of week one) as a decimal number (00–53).

%x

The locale’s “appropriate” date representation. (This is ‘%A %B %d %Y’ in the "C" locale.)

%X

The locale’s “appropriate” time representation. (This is ‘%T’ in the "C" locale.)

%y

The year modulo 100 as a decimal number (00–99).

%Y

The full year as a decimal number (e.g., 2015).

%z

The time zone offset in ‘+HHMM’ format (e.g., the format necessary to produce RFC 822/RFC 1036 date headers).

%Z

The time zone name or abbreviation; no characters if no time zone is determinable.

%Ec %EC %Ex %EX %Ey %EY %Od %Oe %OH
%OI %Om %OM %OS %Ou %OU %OV %Ow %OW %Oy

“Alternative representations” for the specifications that use only the second letter (‘%c’, ‘%C’, and so on).[52] (These facilitate compliance with the POSIX date utility.)

%%

A literal ‘%’.

If a conversion specifier is not one of those just listed, the behavior is undefined.[53]

For systems that are not yet fully standards-compliant, gawk supplies a copy of strftime() from the GNU C Library. It supports all of the just-listed format specifications. If that version is used to compile gawk (see Appendix B), then the following additional format specifications are available:

%k

The hour (24-hour clock) as a decimal number (0–23). Single-digit numbers are padded with a space.

%l

The hour (12-hour clock) as a decimal number (1–12). Single-digit numbers are padded with a space.

%s

The time as a decimal timestamp in seconds since the epoch.

Additionally, the alternative representations are recognized but their normal representations are used.

The following example is an awk implementation of the POSIX date utility. Normally, the date utility prints the current date and time of day in a well-known format. However, if you provide an argument to it that begins with a ‘+’, date copies nonformat specifier characters to the standard output and interprets the current time according to the format specifiers in the string. For example:

$ date '+Today is %A, %B %d, %Y.'

Today is Monday, September 22, 2014.

Here is the gawk version of the date utility. It has a shell “wrapper” to handle the -u option, which requires that date run as if the time zone is set to UTC:

#! /bin/sh

#

# date --- approximate the POSIX 'date' command

case $1 in

-u) TZ=UTC0 # use UTC

export TZ

shift ;;

esac

gawk 'BEGIN {

format = PROCINFO["strftime"]

exitval = 0

if (ARGC > 2)

exitval = 1

else if (ARGC == 2) {

format = ARGV[1]

if (format ~ /^\+/)

format = substr(format, 2) # remove leading +

}

print strftime(format)

exit exitval

}' "$@"

Bit-Manipulation Functions

I can explain it for you, but I can’t understand it for you.

—Anonymous

Many languages provide the ability to perform bitwise operations on two integer numbers. In other words, the operation is performed on each successive pair of bits in the operands. Three common operations are bitwise AND, OR, and XOR. The operations are described in Table 9-5.

Table 9-5. Bitwise operations

Bit operator

AND

OR

XOR

Operands

0

1

0

1

0

1

0

0

0

0

1

0

1

1

0

1

1

1

1

0

As you can see, the result of an AND operation is 1 only when both bits are 1. The result of an OR operation is 1 if either bit is 1. The result of an XOR operation is 1 if either bit is 1, but not both. The next operation is the complement; the complement of 1 is 0 and the complement of 0 is 1. Thus, this operation “flips” all the bits of a given value.

Finally, two other common operations are to shift the bits left or right. For example, if you have a bit string ‘10111001’ and you shift it right by three bits, you end up with ‘00010111’.[54] If you start over again with ‘10111001’ and shift it left by three bits, you end up with ‘11001000’. The following list describes gawk’s built-in functions that implement the bitwise operations. Optional parameters are enclosed in square brackets ([ ]):

and(v1, v2 [, …])

Return the bitwise AND of the arguments. There must be at least two.

compl(val)

Return the bitwise complement of val.

lshift(val, count)

Return the value of val, shifted left by count bits.

or(v1, v2 [, …])

Return the bitwise OR of the arguments. There must be at least two.

rshift(val, count)

Return the value of val, shifted right by count bits.

xor(v1, v2 [, …])

Return the bitwise XOR of the arguments. There must be at least two.

For all of these functions, first the double-precision floating-point value is converted to the widest C unsigned integer type, then the bitwise operation is performed. If the result cannot be represented exactly as a C double, leading nonzero bits are removed one by one until it can be represented exactly. The result is then converted back into a C double. (If you don’t understand this paragraph, don’t worry about it.)

Here is a user-defined function (see User-Defined Functions) that illustrates the use of these functions:

# bits2str --- turn a byte into readable ones and zeros

function bits2str(bits, data, mask)

{

if (bits == 0)

return "0"

mask = 1

for (; bits != 0; bits = rshift(bits, 1))

data = (and(bits, mask) ? "1" : "0") data

while ((length(data) % 8) != 0)

data = "0" data

return data

}

BEGIN {

printf "123 = %s\n", bits2str(123)

printf "0123 = %s\n", bits2str(0123)

printf "0x99 = %s\n", bits2str(0x99)

comp = compl(0x99)

printf "compl(0x99) = %#x = %s\n", comp, bits2str(comp)

shift = lshift(0x99, 2)

printf "lshift(0x99, 2) = %#x = %s\n", shift, bits2str(shift)

shift = rshift(0x99, 2)

printf "rshift(0x99, 2) = %#x = %s\n", shift, bits2str(shift)

}

This program produces the following output when run:

$ gawk -f testbits.awk

123 = 01111011

0123 = 01010011

0x99 = 10011001

compl(0x99) = 0xffffff66 = 11111111111111111111111101100110

lshift(0x99, 2) = 0x264 = 0000001001100100

rshift(0x99, 2) = 0x26 = 00100110

The bits2str() function turns a binary number into a string. Initializing mask to one creates a binary value where the rightmost bit is set to one. Using this mask, the function repeatedly checks the rightmost bit. ANDing the mask with the value indicates whether the rightmost bit is one or not. If so, a "1" is concatenated onto the front of the string. Otherwise, a "0" is added. The value is then shifted right by one bit and the loop continues until there are no more one bits.

If the initial value is zero, it returns a simple "0". Otherwise, at the end, it pads the value with zeros to represent multiples of 8-bit quantities. This is typical in modern computers.

The main code in the BEGIN rule shows the difference between the decimal and octal values for the same numbers (see Octal and hexadecimal numbers), and then demonstrates the results of the compl(), lshift(), and rshift() functions.

Getting Type Information

gawk provides a single function that lets you distinguish an array from a scalar variable. This is necessary for writing code that traverses every element of an array of arrays (see Arrays of Arrays).

isarray(x)

Return a true value if x is an array. Otherwise, return false.

isarray() is meant for use in two circumstances. The first is when traversing a multidimensional array: you can test if an element is itself an array or not. The second is inside the body of a user-defined function (not discussed yet; see User-Defined Functions), to test if a parameter is an array or not.

NOTE

Using isarray() at the global level to test variables makes no sense. Because you are the one writing the program, you are supposed to know if your variables are arrays or not. And in fact, due to the way gawk works, if you pass the name of a variable that has not been previously used to isarray(), gawk ends up turning it into a scalar.

String-Translation Functions

gawk provides facilities for internationalizing awk programs. These include the functions described in the following list. The descriptions here are purposely brief. See Chapter 13 for the full story. Optional parameters are enclosed in square brackets ([ ]):

bindtextdomain(directory [, domain])

Set the directory in which gawk will look for message translation files, in case they will not or cannot be placed in the “standard” locations (e.g., during testing). It returns the directory in which domain is “bound.”

The default domain is the value of TEXTDOMAIN. If directory is the null string (""), then bindtextdomain() returns the current binding for the given domain.

dcgettext(string [, domain [, category] ])

Return the translation of string in text domain domain for locale category category. The default value for domain is the current value of TEXTDOMAIN. The default value for category is "LC_MESSAGES".

dcngettext(string1, string2, number [, domain [, category] ])

Return the plural form used for number of the translation of string1 and string2 in text domain domain for locale category category. string1 is the English singular variant of a message, and string2 is the English plural variant of the same message. The default value fordomain is the current value of TEXTDOMAIN. The default value for category is "LC_MESSAGES".

User-Defined Functions

Complicated awk programs can often be simplified by defining your own functions. User-defined functions can be called just like built-in ones (see Function Calls), but it is up to you to define them (i.e., to tell awk what they should do).

Function Definition Syntax

It’s entirely fair to say that the awk syntax for local variable definitions is appallingly awful.

—Brian Kernighan

Definitions of functions can appear anywhere between the rules of an awk program. Thus, the general form of an awk program is extended to include sequences of rules and user-defined function definitions. There is no need to put the definition of a function before all uses of the function. This is because awk reads the entire program before starting to execute any of it.

The definition of a function named name looks like this:

function name([parameter-list])
{
body-of-function
}

Here, name is the name of the function to define. A valid function name is like a valid variable name: a sequence of letters, digits, and underscores that doesn’t start with a digit. Here too, only the 52 upper- and lowercase English letters may be used in a function name. Within a single awkprogram, any particular name can only be used as a variable, array, or function.

parameter-list is an optional list of the function’s arguments and local variable names, separated by commas. When the function is called, the argument names are used to hold the argument values given in the call.

A function cannot have two parameters with the same name, nor may it have a parameter with the same name as the function itself.

CAUTION

According to the POSIX standard, function parameters cannot have the same name as one of the special predefined variables (see Predefined Variables), nor may a function parameter have the same name as another function.

Not all versions of awk enforce these restrictions. gawk always enforces the first restriction. With --posix (see Command-Line Options), it also enforces the second restriction.

Local variables act like the empty string if referenced where a string value is required, and like zero if referenced where a numeric value is required. This is the same as the behavior of regular variables that have never been assigned a value. (There is more to understand about local variables; see Functions and Their Effects on Variable Typing.)

The body-of-function consists of awk statements. It is the most important part of the definition, because it says what the function should actually do. The argument names exist to give the body a way to talk about the arguments; local variables exist to give the body places to keep temporary values.

Argument names are not distinguished syntactically from local variable names. Instead, the number of arguments supplied when the function is called determines how many argument variables there are. Thus, if three argument values are given, the first three names in parameter-list are arguments and the rest are local variables.

It follows that if the number of arguments is not the same in all calls to the function, some of the names in parameter-list may be arguments on some occasions and local variables on others. Another way to think of this is that omitted arguments default to the null string.

Usually when you write a function, you know how many names you intend to use for arguments and how many you intend to use as local variables. It is conventional to place some extra space between the arguments and the local variables, in order to document how your function is supposed to be used.

During execution of the function body, the arguments and local variable values hide, or shadow, any variables of the same names used in the rest of the program. The shadowed variables are not accessible in the function definition, because there is no way to name them while their names have been taken away for the arguments and local variables. All other variables used in the awk program can be referenced or set normally in the function’s body.

The arguments and local variables last only as long as the function body is executing. Once the body finishes, you can once again access the variables that were shadowed while the function was running.

The function body can contain expressions that call functions. They can even call this function, either directly or by way of another function. When this happens, we say the function is recursive. The act of a function calling itself is called recursion.

All the built-in functions return a value to their caller. User-defined functions can do so also, using the return statement, which is described in detail in The return Statement. Many of the subsequent examples in this section use the return statement.

In many awk implementations, including gawk, the keyword function may be abbreviated func. (c.e.) However, POSIX only specifies the use of the keyword function. This actually has some practical implications. If gawk is in POSIX-compatibility mode (see Command-Line Options), then the following statement does not define a function:

func foo() { a = sqrt($1) ; print a }

Instead, it defines a rule that, for each record, concatenates the value of the variable ‘func’ with the return value of the function ‘foo.’ If the resulting string is non-null, the action is executed. This is probably not what is desired. (awk accepts this input as syntactically valid, because functions may be used before they are defined in awk programs.[55])

To ensure that your awk programs are portable, always use the keyword function when defining a function.

Function Definition Examples

Here is an example of a user-defined function, called myprint(), that takes a number and prints it in a specific format:

function myprint(num)

{

printf "%6.3g\n", num

}

To illustrate, here is an awk rule that uses our myprint() function:

$3 > 0 { myprint($3) }

This program prints, in our special format, all the third fields that contain a positive number in our input. Therefore, when given the following input:

1.2 3.4 5.6 7.8

9.10 11.12 -13.14 15.16

17.18 19.20 21.22 23.24

this program, using our function to format the results, prints:

5.6

21.2

This function deletes all the elements in an array (recall that the extra whitespace signifies the start of the local variable list):

function delarray(a, i)

{

for (i in a)

delete a[i]

}

When working with arrays, it is often necessary to delete all the elements in an array and start over with a new list of elements (see The delete Statement). Instead of having to repeat this loop everywhere that you need to clear out an array, your program can just call delarray(). (This guarantees portability. The use of ‘delete array’ to delete the contents of an entire array is a relatively recent[56] addition to the POSIX standard.)

The following is an example of a recursive function. It takes a string as an input parameter and returns the string in reverse order. Recursive functions must always have a test that stops the recursion. In this case, the recursion terminates when the input string is already empty:

function rev(str)

{

if (str == "")

return ""

return (rev(substr(str, 2)) substr(str, 1, 1))

}

If this function is in a file named rev.awk, it can be tested this way:

$ echo "Don't Panic!" |

> gawk -e '{ print rev($0) }' -f rev.awk

!cinaP t'noD

The C ctime() function takes a timestamp and returns it as a string, formatted in a well-known fashion. The following example uses the built-in strftime() function (see Time Functions) to create an awk version of ctime():

# ctime.awk

#

# awk version of C ctime(3) function

function ctime(ts, format)

{

format = "%a %b %e %H:%M:%S %Z %Y"

if (ts == 0)

ts = systime() # use current time as default

return strftime(format, ts)

}

You might think that ctime() could use PROCINFO["strftime"] for its format string. That would be a mistake, because ctime() is supposed to return the time formatted in a standard fashion, and user-level code could have changed PROCINFO["strftime"].

Calling User-Defined Functions

Calling a function means causing the function to run and do its job. A function call is an expression and its value is the value returned by the function.

Writing a function call

A function call consists of the function name followed by the arguments in parentheses. awk expressions are what you write in the call for the arguments. Each time the call is executed, these expressions are evaluated, and the values become the actual arguments. For example, here is a call tofoo() with three arguments (the first being a string concatenation):

foo(x y, "lose", 4 * z)

CAUTION

Whitespace characters (spaces and TABs) are not allowed between the function name and the opening parenthesis of the argument list. If you write whitespace by mistake, awk might think that you mean to concatenate a variable with an expression in parentheses. However, it notices that you used a function name and not a variable name, and reports an error.

Controlling variable scope

Unlike in many languages, there is no way to make a variable local to a { … } block in awk, but you can make a variable local to a function. It is good practice to do so whenever a variable is needed only in that function.

To make a variable local to a function, simply declare the variable as an argument after the actual function arguments (see Function Definition Syntax). Look at the following example, where variable i is a global variable used by both functions foo() and bar():

function bar()

{

for (i = 0; i < 3; i++)

print "bar's i=" i

}

function foo(j)

{

i = j + 1

print "foo's i=" i

bar()

print "foo's i=" i

}

BEGIN {

i = 10

print "top's i=" i

foo(0)

print "top's i=" i

}

Running this script produces the following, because the i in functions foo() and bar() and at the top level refer to the same variable instance:

top's i=10

foo's i=1

bar's i=0

bar's i=1

bar's i=2

foo's i=3

top's i=3

If you want i to be local to both foo() and bar(), do as follows (the extra space before i is a coding convention to indicate that i is a local variable, not an argument):

function bar( i)

{

for (i = 0; i < 3; i++)

print "bar's i=" i

}

function foo(j, i)

{

i = j + 1

print "foo's i=" i

bar()

print "foo's i=" i

}

BEGIN {

i = 10

print "top's i=" i

foo(0)

print "top's i=" i

}

Running the corrected script produces the following:

top's i=10

foo's i=1

bar's i=0

bar's i=1

bar's i=2

foo's i=1

top's i=10

Besides scalar values (strings and numbers), you may also have local arrays. By using a parameter name as an array, awk treats it as an array, and it is local to the function. In addition, recursive calls create new arrays. Consider this example:

function some_func(p1, a)

{

if (p1++ > 3)

return

a[p1] = p1

some_func(p1)

printf("At level %d, index %d %s found in a\n",

p1, (p1 - 1), (p1 - 1) in a ? "is" : "is not")

printf("At level %d, index %d %s found in a\n",

p1, p1, p1 in a ? "is" : "is not")

print ""

}

BEGIN {

some_func(1)

}

When run, this program produces the following output:

At level 4, index 3 is not found in a

At level 4, index 4 is found in a

At level 3, index 2 is not found in a

At level 3, index 3 is found in a

At level 2, index 1 is not found in a

At level 2, index 2 is found in a

Passing function arguments by value or by reference

In awk, when you declare a function, there is no way to declare explicitly whether the arguments are passed by value or by reference.

Instead, the passing convention is determined at runtime when the function is called, according to the following rule: if the argument is an array variable, then it is passed by reference. Otherwise, the argument is passed by value.

Passing an argument by value means that when a function is called, it is given a copy of the value of this argument. The caller may use a variable as the expression for the argument, but the called function does not know this—it only knows what value the argument had. For example, if you write the following code:

foo = "bar"

z = myfunc(foo)

then you should not think of the argument to myfunc() as being “the variable foo.” Instead, think of the argument as the string value "bar". If the function myfunc() alters the values of its local variables, this has no effect on any other variables. Thus, if myfunc() does this:

function myfunc(str)

{

print str

str = "zzz"

print str

}

to change its first argument variable str, it does not change the value of foo in the caller. The role of foo in calling myfunc() ended when its value ("bar") was computed. If str also exists outside of myfunc(), the function body cannot alter this outer value, because it is shadowed during the execution of myfunc() and cannot be seen or changed from there.

However, when arrays are the parameters to functions, they are not copied. Instead, the array itself is made available for direct manipulation by the function. This is usually termed call by reference. Changes made to an array parameter inside the body of a function are visible outside that function.

NOTE

Changing an array parameter inside a function can be very dangerous if you do not watch what you are doing. For example:

function changeit(array, ind, nvalue)

{

array[ind] = nvalue

}

BEGIN {

a[1] = 1; a[2] = 2; a[3] = 3

changeit(a, 2, "two")

printf "a[1] = %s, a[2] = %s, a[3] = %s\n",

a[1], a[2], a[3]

}

prints ‘a[1] = 1, a[2] = two, a[3] = 3’, because changeit() stores "two" in the second element of a.

Some awk implementations allow you to call a function that has not been defined. They only report a problem at runtime, when the program actually tries to call the function. For example:

BEGIN {

if (0)

foo()

else

bar()

}

function bar() { … }

# note that `foo' is not defined

Because the ‘if’ statement will never be true, it is not really a problem that foo() has not been defined. Usually, though, it is a problem if a program calls an undefined function.

If --lint is specified (see Command-Line Options), gawk reports calls to undefined functions.

Some awk implementations generate a runtime error if you use either the next statement or the nextfile statement (see The next Statement, and The nextfile Statement) inside a user-defined function. gawk does not have this limitation.

The return Statement

As seen in several earlier examples, the body of a user-defined function can contain a return statement. This statement returns control to the calling part of the awk program. It can also be used to return a value for use in the rest of the awk program. It looks like this:

return [expression]

The expression part is optional. Due most likely to an oversight, POSIX does not define what the return value is if you omit the expression. Technically speaking, this makes the returned value undefined, and therefore unpredictable. In practice, though, all versions of awk simply return the null string, which acts like zero if used in a numeric context.

A return statement without an expression is assumed at the end of every function definition. So, if control reaches the end of the function body, then technically the function returns an unpredictable value. In practice, it returns the empty string. awk does not warn you if you use the return value of such a function.

Sometimes, you want to write a function for what it does, not for what it returns. Such a function corresponds to a void function in C, C++, or Java, or to a procedure in Ada. Thus, it may be appropriate to not return any value; simply bear in mind that you should not be using the return value of such a function.

The following is an example of a user-defined function that returns a value for the largest number among the elements of an array:

function maxelt(vec, i, ret)

{

for (i in vec) {

if (ret == "" || vec[i] > ret)

ret = vec[i]

}

return ret

}

You call maxelt() with one argument, which is an array name. The local variables i and ret are not intended to be arguments; there is nothing to stop you from passing more than one argument to maxelt() but the results would be strange. The extra space before i in the function parameter list indicates that i and ret are local variables. You should follow this convention when defining functions.

The following program uses the maxelt() function. It loads an array, calls maxelt(), and then reports the maximum number in that array:

function maxelt(vec, i, ret)

{

for (i in vec) {

if (ret == "" || vec[i] > ret)

ret = vec[i]

}

return ret

}

# Load all fields of each record into nums.

{

for(i = 1; i <= NF; i++)

nums[NR, i] = $i

}

END {

print maxelt(nums)

}

Given the following input:

1 5 23 8 16

44 3 5 2 8 26

256 291 1396 2962 100

-6 467 998 1101

99385 11 0 225

the program reports (predictably) that 99,385 is the largest value in the array.

Functions and Their Effects on Variable Typing

awk is a very fluid language. It is possible that awk can’t tell if an identifier represents a scalar variable or an array until runtime. Here is an annotated sample program:

function foo(a)

{

a[1] = 1 # parameter is an array

}

BEGIN {

b = 1

foo(b) # invalid: fatal type mismatch

foo(x) # x uninitialized, becomes an array dynamically

x = 1 # now not allowed, runtime error

}

In this example, the first call to foo() generates a fatal error, so awk will not report the second error. If you comment out that call, though, then awk does report the second error.

Usually, such things aren’t a big issue, but it’s worth being aware of them.

Indirect Function Calls

This section describes an advanced, gawk-specific extension.

Often, you may wish to defer the choice of function to call until runtime. For example, you may have different kinds of records, each of which should be processed differently.

Normally, you would have to use a series of if-else statements to decide which function to call. By using indirect function calls, you can specify the name of the function to call as a string variable, and then call the function. Let’s look at an example.

Suppose you have a file with your test scores for the classes you are taking, and you wish to get the sum and the average of your test scores. The first field is the class name. The following fields are the functions to call to process the data, up to a “marker” field ‘data:’. Following the marker, to the end of the record, are the various numeric test scores.

Here is the initial file:

Biology_101 sum average data: 87.0 92.4 78.5 94.9

Chemistry_305 sum average data: 75.2 98.3 94.7 88.2

English_401 sum average data: 100.0 95.6 87.1 93.4

To process the data, you might write initially:

{

class = $1

for (i = 2; $i != "data:"; i++) {

if ($i == "sum")

sum() # processes the whole record

else if ($i == "average")

average()

… # and so on

}

}

This style of programming works, but can be awkward. With indirect function calls, you tell gawk to use the value of a variable as the name of the function to call.

The syntax is similar to that of a regular function call: an identifier immediately followed by an opening parenthesis, any arguments, and then a closing parenthesis, with the addition of a leading ‘@’ character:

the_func = "sum"

result = @the_func() # calls the sum() function

Here is a full program that processes the previously shown data, using indirect function calls:

# indirectcall.awk --- Demonstrate indirect function calls

# average --- return the average of the values in fields $first - $last

function average(first, last, sum, i)

{

sum = 0;

for (i = first; i <= last; i++)

sum += $i

return sum / (last - first + 1)

}

# sum --- return the sum of the values in fields $first - $last

function sum(first, last, ret, i)

{

ret = 0;

for (i = first; i <= last; i++)

ret += $i

return ret

}

These two functions expect to work on fields; thus, the parameters first and last indicate where in the fields to start and end. Otherwise, they perform the expected computations and are not unusual:

# For each record, print the class name and the requested statistics

{

class_name = $1

gsub(/_/, " ", class_name) # Replace _ with spaces

# find start

for (i = 1; i <= NF; i++) {

if ($i == "data:") {

start = i + 1

break

}

}

printf("%s:\n", class_name)

for (i = 2; $i != "data:"; i++) {

the_function = $i

printf("\t%s: <%s>\n", $i, @the_function(start, NF) "")

}

print ""

}

This is the main processing for each record. It prints the class name (with underscores replaced with spaces). It then finds the start of the actual data, saving it in start. The last part of the code loops through each function name (from $2 up to the marker, ‘data:’), calling the function named by the field. The indirect function call itself occurs as a parameter in the call to printf. (The printf format string uses ‘%s’ as the format specifier so that we can use functions that return strings, as well as numbers. Note that the result from the indirect call is concatenated with the empty string, in order to force it to be a string value.)

Here is the result of running the program:

$ gawk -f indirectcall.awk class_data1

Biology 101:

sum: <352.8>

average: <88.2>

Chemistry 305:

sum: <356.4>

average: <89.1>

English 401:

sum: <376.1>

average: <94.025>

The ability to use indirect function calls is more powerful than you may think at first. The C and C++ languages provide “function pointers,” which are a mechanism for calling a function chosen at runtime. One of the most well-known uses of this ability is the C qsort() function, which sorts an array using the famous “quicksort” algorithm (see the Wikipedia article for more information). To use this function, you supply a pointer to a comparison function. This mechanism allows you to sort arbitrary data in an arbitrary fashion.

We can do something similar using gawk, like this:

# quicksort.awk --- Quicksort algorithm, with user-supplied

# comparison function

# quicksort --- C.A.R. Hoare's quicksort algorithm. See Wikipedia

# or almost any algorithms or computer science text.

function quicksort(data, left, right, less_than, i, last)

{

if (left >= right) # do nothing if array contains fewer

return # than two elements

quicksort_swap(data, left, int((left + right) / 2))

last = left

for (i = left + 1; i <= right; i++)

if (@less_than(data[i], data[left]))

quicksort_swap(data, ++last, i)

quicksort_swap(data, left, last)

quicksort(data, left, last - 1, less_than)

quicksort(data, last + 1, right, less_than)

}

# quicksort_swap --- helper function for quicksort, should really be inline

function quicksort_swap(data, i, j, temp)

{

temp = data[i]

data[i] = data[j]

data[j] = temp

}

The quicksort() function receives the data array, the starting and ending indices to sort (left and right), and the name of a function that performs a “less than” comparison. It then implements the quicksort algorithm.

To make use of the sorting function, we return to our previous example. The first thing to do is write some comparison functions:

# num_lt --- do a numeric less than comparison

function num_lt(left, right)

{

return ((left + 0) < (right + 0))

}

# num_ge --- do a numeric greater than or equal to comparison

function num_ge(left, right)

{

return ((left + 0) >= (right + 0))

}

The num_ge() function is needed to perform a descending sort; when used to perform a “less than” test, it actually does the opposite (greater than or equal to), which yields data sorted in descending order.

Next comes a sorting function. It is parameterized with the starting and ending field numbers and the comparison function. It builds an array with the data and calls quicksort() appropriately, and then formats the results as a single string:

# do_sort --- sort the data according to `compare'

# and return it as a string

function do_sort(first, last, compare, data, i, retval)

{

delete data

for (i = 1; first <= last; first++) {

data[i] = $first

i++

}

quicksort(data, 1, i-1, compare)

retval = data[1]

for (i = 2; i in data; i++)

retval = retval " " data[i]

return retval

}

Finally, the two sorting functions call do_sort(), passing in the names of the two comparison functions:

# sort --- sort the data in ascending order and return it as a string

function sort(first, last)

{

return do_sort(first, last, "num_lt")

}

# rsort --- sort the data in descending order and return it as a string

function rsort(first, last)

{

return do_sort(first, last, "num_ge")

}

Here is an extended version of the datafile:

Biology_101 sum average sort rsort data: 87.0 92.4 78.5 94.9

Chemistry_305 sum average sort rsort data: 75.2 98.3 94.7 88.2

English_401 sum average sort rsort data: 100.0 95.6 87.1 93.4

Finally, here are the results when the enhanced program is run:

$ gawk -f quicksort.awk -f indirectcall.awk class_data2

Biology 101:

sum: <352.8>

average: <88.2>

sort: <78.5 87.0 92.4 94.9>

rsort: <94.9 92.4 87.0 78.5>

Chemistry 305:

sum: <356.4>

average: <89.1>

sort: <75.2 88.2 94.7 98.3>

rsort: <98.3 94.7 88.2 75.2>

English 401:

sum: <376.1>

average: <94.025>

sort: <87.1 93.4 95.6 100.0>

rsort: <100.0 95.6 93.4 87.1>

Another example where indirect functions calls are useful can be found in processing arrays. This is described in Traversing Arrays of Arrays.

Remember that you must supply a leading ‘@’ in front of an indirect function call.

Starting with version 4.1.2 of gawk, indirect function calls may also be used with built-in functions and with extension functions (see Chapter 16). The only thing you cannot do is pass a regular expression constant to a built-in function through an indirect function call.[57]

gawk does its best to make indirect function calls efficient. For example, in the following case:

for (i = 1; i <= n; i++)

@the_func()

gawk looks up the actual function to call only once.

Summary

§ awk provides built-in functions and lets you define your own functions.

§ POSIX awk provides three kinds of built-in functions: numeric, string, and I/O. gawk provides functions that sort arrays, work with values representing time, do bit manipulation, determine variable type (array versus scalar), and internationalize and localize programs. gawk also provides several extensions to some of standard functions, typically in the form of additional arguments.

§ Functions accept zero or more arguments and return a value. The expressions that provide the argument values are completely evaluated before the function is called. Order of evaluation is not defined. The return value can be ignored.

§ The handling of backslash in sub() and gsub() is not simple. It is more straightforward in gawk’s gensub() function, but that function still requires care in its use.

§ User-defined functions provide important capabilities but come with some syntactic inelegancies. In a function call, there cannot be any space between the function name and the opening left parenthesis of the argument list. Also, there is no provision for local variables, so the convention is to add extra parameters, and to separate them visually from the real parameters by extra whitespace.

§ User-defined functions may call other user-defined (and built-in) functions and may call themselves recursively. Function parameters “hide” any global variables of the same names. You cannot use the name of a reserved variable (such as ARGC) as the name of a parameter in user-defined functions.

§ Scalar values are passed to user-defined functions by value. Array parameters are passed by reference; any changes made by the function to array parameters are thus visible after the function has returned.

§ Use the return statement to return from a user-defined function. An optional expression becomes the function’s return value. Only scalar values may be returned by a function.

§ If a variable that has never been used is passed to a user-defined function, how that function treats the variable can set its nature: either scalar or array.

§ gawk provides indirect function calls using a special syntax. By setting a variable to the name of a function, you can determine at runtime what function will be called at that point in the program. This is equivalent to function pointers in C and C++.


[41] The C version of rand() on many Unix systems is known to produce fairly poor sequences of random numbers. However, nothing requires that an awk implementation use the C rand() to implement the awk version of rand(). In fact, gawk uses the BSD random() function, which is considerably better than rand(), to produce random numbers.

[42] mawk uses a different seed each time.

[43] Computer-generated random numbers really are not truly random. They are technically known as pseudorandom. This means that although the numbers in a sequence appear to be random, you can in fact generate the same sequence of random numbers over and over again.

[44] Unless you use the --non-decimal-data option, which isn’t recommended. See Allowing Nondecimal Input Data for more information.

[45] Note that this means that the record will first be regenerated using the value of OFS if any fields have been changed, and that the fields will be updated after the substitution, even if the operation is a “no-op” such as ‘sub(/^/, "")’.

[46] This is different from C and C++, in which the first character is number zero.

[47] This was rather naive of him, despite there being a note in this section indicating that the next major version would move to the POSIX rules.

[48] A program is interactive if the standard output is connected to a terminal device. On modern systems, this means your keyboard and screen.

[49] The GNU date utility can also do many of the things described here. Its use may be preferable for simple time-related operations in shell scripts.

[50] Occasionally there are minutes in a year with a leap second, which is why the seconds can go up to 60.

[51] Unfortunately, not every system’s strftime() necessarily supports all of the conversions listed here.

[52] If you don’t understand any of this, don’t worry about it; these facilities are meant to make it easier to “internationalize” programs. Other internationalization features are described in Chapter 13.

[53] This is because ISO C leaves the behavior of the C version of strftime() undefined and gawk uses the system’s version of strftime() if it’s there. Typically, the conversion specifier either does not appear in the returned string or appears literally.

[54] This example shows that zeros come in on the left side. For gawk, this is always true, but in some languages, it’s possible to have the left side fill with ones.

[55] This program won’t actually run, because foo() is undefined.

[56] Late in 2012.

[57] This may change in a future version; recheck the documentation that comes with your version of gawk to see if it has.