A Library of awk Functions - Problem Solving with awk - Effective awk Programming (2015)

Effective awk Programming (2015)

Part II. Problem Solving with awk

Part II shows how to use awk and gawk for problem solving. There is lots of code here for you to read and learn from. It contains the following chapters:

§ Chapter 10, A Library of awk Functions

§ Chapter 11, Practical awk Programs

Chapter 10. A Library of awk Functions

User-Defined Functions describes how to write your own awk functions. Writing functions is important, because it allows you to encapsulate algorithms and program tasks in a single place. It simplifies programming, making program development more manageable and making programs more readable.

In their seminal 1976 book, Software Tools,[58] Brian Kernighan and P.J. Plauger wrote:

Good Programming is not learned from generalities, but by seeing how significant programs can be made clean, easy to read, easy to maintain and modify, human-engineered, efficient and reliable, by the application of common sense and good programming practices. Careful study and imitation of good programs leads to better writing.

In fact, they felt this idea was so important that they placed this statement on the cover of their book. Because we believe strongly that their statement is correct, this chapter and Chapter 11 provide a good-sized body of code for you to read and, we hope, to learn from.

This chapter presents a library of useful awk functions. Many of the sample programs presented later in this book use these functions. The functions are presented here in a progression from simple to complex.

Extracting Programs from Texinfo Source Files presents a program that you can use to extract the source code for these example library functions and programs from the Texinfo source for this book. (This has already been done as part of the gawk distribution.)

The programs in this chapter and in Chapter 11 freely use gawk-specific features. Rewriting these programs for different implementations of awk is pretty straightforward:

§ Diagnostic error messages are sent to /dev/stderr. Use ‘| "cat 1>&2"’ instead of ‘> "/dev/stderr"’ if your system does not have a /dev/stderr, or if you cannot use gawk.

§ A number of programs use nextfile (see The nextfile Statement) to skip any remaining input in the input file.

§ Finally, some of the programs choose to ignore upper- and lowercase distinctions in their input. They do so by assigning one to IGNORECASE. You can achieve almost the same effect[59] by adding the following rule to the beginning of the program:

§ # ignore case

{ $0 = tolower($0) }

Also, verify that all regexp and string constants used in comparisons use only lowercase letters.

Naming Library Function Global Variables

Due to the way the awk language evolved, variables are either global (usable by the entire program) or local (usable just by a specific function). There is no intermediate state analogous to static variables in C.

Library functions often need to have global variables that they can use to preserve state information between calls to the function—for example, getopt()’s variable _opti (see Processing Command-Line Options). Such variables are called private, as the only functions that need to use them are the ones in the library.

When writing a library function, you should try to choose names for your private variables that will not conflict with any variables used by either another library function or a user’s main program. For example, a name like i or j is not a good choice, because user programs often use variable names like these for their own purposes.

The example programs shown in this chapter all start the names of their private variables with an underscore (‘_’). Users generally don’t use leading underscores in their variable names, so this convention immediately decreases the chances that the variable names will be accidentally shared with the user’s program.

In addition, several of the library functions use a prefix that helps indicate what function or set of functions use the variables—for example, _pw_byname() in the user database routines (see Reading the User Database). This convention is recommended, as it even further decreases the chance of inadvertent conflict among variable names. Note that this convention is used equally well for variable names and for private function names.[60]

As a final note on variable naming, if a function makes global variables available for use by a main program, it is a good convention to start those variables’ names with a capital letter—for example, getopt()’s Opterr and Optind variables (see Processing Command-Line Options). The leading capital letter indicates that it is global, while the fact that the variable name is not all capital letters indicates that the variable is not one of awk’s predefined variables, such as FS.

It is also important that all variables in library functions that do not need to save state are, in fact, declared local.[61] If this is not done, the variables could accidentally be used in the user’s program, leading to bugs that are very difficult to track down:

function lib_func(x, y, l1, l2)

{

# some_var should be local but by oversight is not

use variable some_var

}

A different convention, common in the Tcl community, is to use a single associative array to hold the values needed by the library function(s), or “package.” This significantly decreases the number of actual global names in use. For example, the functions described in Reading the User Database might have used array elements PW_data["inited"], PW_data["total"], PW_data["count"], and PW_data["awklib"], instead of _pw_inited, _pw_awklib, _pw_total, and _pw_count.

The conventions presented in this section are exactly that: conventions. You are not required to write your programs this way—we merely recommend that you do so.

General Programming

This section presents a number of functions that are of general programming use.

Converting Strings to Numbers

The strtonum() function (see String-Manipulation Functions) is a gawk extension. The following function provides an implementation for other versions of awk:

# mystrtonum --- convert string to number

function mystrtonum(str, ret, n, i, k, c)

{

if (str ~ /^0[0-7]*$/) {

# octal

n = length(str)

ret = 0

for (i = 1; i <= n; i++) {

c = substr(str, i, 1)

# index() returns 0 if c not in string,

# includes c == "0"

k = index("1234567", c)

ret = ret * 8 + k

}

} else if (str ~ /^0[xX][[:xdigit:]]+$/) {

# hexadecimal

str = substr(str, 3) # lop off leading 0x

n = length(str)

ret = 0

for (i = 1; i <= n; i++) {

c = substr(str, i, 1)

c = tolower(c)

# index() returns 0 if c not in string,

# includes c == "0"

k = index("123456789abcdef", c)

ret = ret * 16 + k

}

} else if (str ~ \

/^[-+]?([0-9]+([.][0-9]*([Ee][0-9]+)?)?|([.][0-9]+([Ee][-+]?[0-9]+)?))$/) {

# decimal number, possibly floating point

ret = str + 0

} else

ret = "NOT-A-NUMBER"

return ret

}

# BEGIN { # gawk test harness

# a[1] = "25"

# a[2] = ".31"

# a[3] = "0123"

# a[4] = "0xdeadBEEF"

# a[5] = "123.45"

# a[6] = "1.e3"

# a[7] = "1.32"

# a[8] = "1.32E2"

#

# for (i = 1; i in a; i++)

# print a[i], strtonum(a[i]), mystrtonum(a[i])

# }

The function first looks for C-style octal numbers (base 8). If the input string matches a regular expression describing octal numbers, then mystrtonum() loops through each character in the string. It sets k to the index in "1234567" of the current octal digit. The return value will either be the same number as the digit, or zero if the character is not there, which will be true for a ‘0’. This is safe, because the regexp test in the if ensures that only octal values are converted.

Similar logic applies to the code that checks for and converts a hexadecimal value, which starts with ‘0x’ or ‘0X’. The use of tolower() simplifies the computation for finding the correct numeric value for each hexadecimal digit.

Finally, if the string matches the (rather complicated) regexp for a regular decimal integer or floating-point number, the computation ‘ret = str + 0’ lets awk convert the value to a number.

A commented-out test program is included, so that the function can be tested with gawk and the results compared to the the built-in strtonum() function.

Assertions

When writing large programs, it is often useful to know that a condition or set of conditions is true. Before proceeding with a particular computation, you make a statement about what you believe to be the case. Such a statement is known as an assertion. The C language provides an<assert.h> header file and corresponding assert() macro that a programmer can use to make assertions. If an assertion fails, the assert() macro arranges to print a diagnostic message describing the condition that should have been true but was not, and then it kills the program. In C, using assert() looks this:

#include <assert.h>

int myfunc(int a, double b)

{

assert(a <= 5 && b >= 17.1);

}

If the assertion fails, the program prints a message similar to this:

prog.c:5: assertion failed: a <= 5 && b >= 17.1

The C language makes it possible to turn the condition into a string for use in printing the diagnostic message. This is not possible in awk, so this assert() function also requires a string version of the condition that is being tested. Following is the function:

# assert --- assert that a condition is true. Otherwise, exit.

function assert(condition, string)

{

if (! condition) {

printf("%s:%d: assertion failed: %s\n",

FILENAME, FNR, string) > "/dev/stderr"

_assert_exit = 1

exit 1

}

}

END {

if (_assert_exit)

exit 1

}

The assert() function tests the condition parameter. If it is false, it prints a message to standard error, using the string parameter to describe the failed condition. It then sets the variable _assert_exit to one and executes the exit statement. The exit statement jumps to the ENDrule. If the END rule finds _assert_exit to be true, it exits immediately.

The purpose of the test in the END rule is to keep any other END rules from running. When an assertion fails, the program should exit immediately. If no assertions fail, then _assert_exit is still false when the END rule is run normally, and the rest of the program’s END rules execute. For all of this to work correctly, assert.awk must be the first source file read by awk. The function can be used in a program in the following way:

function myfunc(a, b)

{

assert(a <= 5 && b >= 17.1, "a <= 5 && b >= 17.1")

}

If the assertion fails, you see a message similar to the following:

mydata:1357: assertion failed: a <= 5 && b >= 17.1

There is a small problem with this version of assert(). An END rule is automatically added to the program calling assert(). Normally, if a program consists of just a BEGIN rule, the input files and/or standard input are not read. However, now that the program has an END rule, awkattempts to read the input datafiles or standard input (see Startup and cleanup actions), most likely causing the program to hang as it waits for input.

There is a simple workaround to this: make sure that such a BEGIN rule always ends with an exit statement.

Rounding Numbers

The way printf and sprintf() (see Using printf Statements for Fancier Printing) perform rounding often depends upon the system’s C sprintf() subroutine. On many machines, sprintf() rounding is unbiased, which means it doesn’t always round a trailing .5 up, contrary to naive expectations. In unbiased rounding, .5 rounds to even, rather than always up, so 1.5 rounds to 2 but 4.5 rounds to 4. This means that if you are using a format that does rounding (e.g., "%.0f"), you should check what your system does. The following function does traditional rounding; it might be useful if your awk’s printf does unbiased rounding:

# round.awk --- do normal rounding

function round(x, ival, aval, fraction)

{

ival = int(x) # integer part, int() truncates

# see if fractional part

if (ival == x) # no fraction

return ival # ensure no decimals

if (x < 0) {

aval = -x # absolute value

ival = int(aval)

fraction = aval - ival

if (fraction >= .5)

return int(x) - 1 # -2.5 --> -3

else

return int(x) # -2.3 --> -2

} else {

fraction = x - ival

if (fraction >= .5)

return ival + 1

else

return ival

}

}

# test harness

# { print $0, round($0) }

The Cliff Random Number Generator

The Cliff random number generator is a very simple random number generator that “passes the noise sphere test for randomness by showing no structure.” It is easily programmed, in less than 10 lines of awk code:

# cliff_rand.awk --- generate Cliff random numbers

BEGIN { _cliff_seed = 0.1 }

function cliff_rand()

{

_cliff_seed = (100 * log(_cliff_seed)) % 1

if (_cliff_seed < 0)

_cliff_seed = - _cliff_seed

return _cliff_seed

}

This algorithm requires an initial “seed” of 0.1. Each new value uses the current seed as input for the calculation. If the built-in rand() function (see Numeric Functions) isn’t random enough, you might try using this function instead.

Translating Between Characters and Numbers

One commercial implementation of awk supplies a built-in function, ord(), which takes a character and returns the numeric value for that character in the machine’s character set. If the string passed to ord() has more than one character, only the first one is used.

The inverse of this function is chr() (from the function of the same name in Pascal), which takes a number and returns the corresponding character. Both functions are written very nicely in awk; there is no real reason to build them into the awk interpreter:

# ord.awk --- do ord and chr

# Global identifiers:

# _ord_: numerical values indexed by characters

# _ord_init: function to initialize _ord_

BEGIN { _ord_init() }

function _ord_init( low, high, i, t)

{

low = sprintf("%c", 7) # BEL is ascii 7

if (low == "\a") { # regular ascii

low = 0

high = 127

} else if (sprintf("%c", 128 + 7) == "\a") {

# ascii, mark parity

low = 128

high = 255

} else { # ebcdic(!)

low = 0

high = 255

}

for (i = low; i <= high; i++) {

t = sprintf("%c", i)

_ord_[t] = i

}

}

Some explanation of the numbers used by _ord_init() is worthwhile. The most prominent character set in use today is ASCII.[62] Although an 8-bit byte can hold 256 distinct values (from 0 to 255), ASCII only defines characters that use the values from 0 to 127.[63] In the now distant past, at least one minicomputer manufacturer used ASCII, but with mark parity, meaning that the leftmost bit in the byte is always 1. This means that on those systems, characters have numeric values from 128 to 255. Finally, large mainframe systems use the EBCDIC character set, which uses all 256 values. There are other character sets in use on some older systems, but they are not really worth worrying about:

function ord(str, c)

{

# only first character is of interest

c = substr(str, 1, 1)

return _ord_[c]

}

function chr(c)

{

# force c to be numeric by adding 0

return sprintf("%c", c + 0)

}

#### test code ####

# BEGIN {

# for (;;) {

# printf("enter a character: ")

# if (getline var <= 0)

# break

# printf("ord(%s) = %d\n", var, ord(var))

# }

# }

An obvious improvement to these functions is to move the code for the _ord_init function into the body of the BEGIN rule. It was written this way initially for ease of development. There is a “test program” in a BEGIN rule, to test the function. It is commented out for production use.

Merging an Array into a String

When doing string processing, it is often useful to be able to join all the strings in an array into one long string. The following function, join(), accomplishes this task. It is used later in several of the application programs (see Chapter 11).

Good function design is important; this function needs to be general, but it should also have a reasonable default behavior. It is called with an array as well as the beginning and ending indices of the elements in the array to be merged. This assumes that the array indices are numeric—a reasonable assumption, as the array was likely created with split() (see String-Manipulation Functions):

# join.awk --- join an array into a string

function join(array, start, end, sep, result, i)

{

if (sep == "")

sep = " "

else if (sep == SUBSEP) # magic value

sep = ""

result = array[start]

for (i = start + 1; i <= end; i++)

result = result sep array[i]

return result

}

An optional additional argument is the separator to use when joining the strings back together. If the caller supplies a nonempty value, join() uses it; if it is not supplied, it has a null value. In this case, join() uses a single space as a default separator for the strings. If the value is equal toSUBSEP, then join() joins the strings with no separator between them. SUBSEP serves as a “magic” value to indicate that there should be no separation between the component strings.[64]

Managing the Time of Day

The systime() and strftime() functions described in Time Functions provide the minimum functionality necessary for dealing with the time of day in human-readable form. Although strftime() is extensive, the control formats are not necessarily easy to remember or intuitively obvious when reading a program.

The following function, getlocaltime(), populates a user-supplied array with preformatted time information. It returns a string with the current time formatted in the same way as the date utility:

# getlocaltime.awk --- get the time of day in a usable format

# Returns a string in the format of output of date(1)

# Populates the array argument time with individual values:

# time["second"] -- seconds (0 - 59)

# time["minute"] -- minutes (0 - 59)

# time["hour"] -- hours (0 - 23)

# time["althour"] -- hours (0 - 12)

# time["monthday"] -- day of month (1 - 31)

# time["month"] -- month of year (1 - 12)

# time["monthname"] -- name of the month

# time["shortmonth"] -- short name of the month

# time["year"] -- year modulo 100 (0 - 99)

# time["fullyear"] -- full year

# time["weekday"] -- day of week (Sunday = 0)

# time["altweekday"] -- day of week (Monday = 0)

# time["dayname"] -- name of weekday

# time["shortdayname"] -- short name of weekday

# time["yearday"] -- day of year (0 - 365)

# time["time zone"] -- abbreviation of time zone name

# time["ampm"] -- AM or PM designation

# time["weeknum"] -- week number, Sunday first day

# time["altweeknum"] -- week number, Monday first day

function getlocaltime(time, ret, now, i)

{

# get time once, avoids unnecessary system calls

now = systime()

# return date(1)-style output

ret = strftime("%a %b %e %H:%M:%S %Z %Y", now)

# clear out target array

delete time

# fill in values, force numeric values to be

# numeric by adding 0

time["second"] = strftime("%S", now) + 0

time["minute"] = strftime("%M", now) + 0

time["hour"] = strftime("%H", now) + 0

time["althour"] = strftime("%I", now) + 0

time["monthday"] = strftime("%d", now) + 0

time["month"] = strftime("%m", now) + 0

time["monthname"] = strftime("%B", now)

time["shortmonth"] = strftime("%b", now)

time["year"] = strftime("%y", now) + 0

time["fullyear"] = strftime("%Y", now) + 0

time["weekday"] = strftime("%w", now) + 0

time["altweekday"] = strftime("%u", now) + 0

time["dayname"] = strftime("%A", now)

time["shortdayname"] = strftime("%a", now)

time["yearday"] = strftime("%j", now) + 0

time["timezone"] = strftime("%Z", now)

time["ampm"] = strftime("%p", now)

time["weeknum"] = strftime("%U", now) + 0

time["altweeknum"] = strftime("%W", now) + 0

return ret

}

The string indices are easier to use and read than the various formats required by strftime(). The alarm program presented in An Alarm Clock Program uses this function. A more general design for the getlocaltime() function would have allowed the user to supply an optional timestamp value to use instead of the current time.

Reading a Whole File at Once

Often, it is convenient to have the entire contents of a file available in memory as a single string. A straightforward but naive way to do that might be as follows:

function readfile(file, tmp, contents)

{

if ((getline tmp < file) < 0)

return

contents = tmp

while (getline tmp < file) > 0)

contents = contents RT tmp

close(file)

return contents

}

This function reads from file one record at a time, building up the full contents of the file in the local variable contents. It works, but is not necessarily efficient.

The following function, based on a suggestion by Denis Shirokov, reads the entire contents of the named file in one shot:

# readfile.awk --- read an entire file at once

function readfile(file, tmp, save_rs)

{

save_rs = RS

RS = "^$"

getline tmp < file

close(file)

RS = save_rs

return tmp

}

It works by setting RS to ‘^$’, a regular expression that will never match if the file has contents. gawk reads data from the file into tmp, attempting to match RS. The match fails after each read, but fails quickly, such that gawk fills tmp with the entire contents of the file. (See How Input Is Split into Records for information on RT and RS.)

In the case that file is empty, the return value is the null string. Thus, calling code may use something like:

contents = readfile("/some/path")

if (length(contents) == 0)

# file was empty …

This tests the result to see if it is empty or not. An equivalent test would be ‘contents == ""’.

See Reading an Entire File for an extension function that also reads an entire file into memory.

Quoting Strings to Pass to the Shell

Michael Brennan offers the following programming pattern, which he uses frequently:

#! /bin/sh

awkp='

'

input_program | awk "$awkp" | /bin/sh

For example, a program of his named flac-edit has this form:

$ flac-edit -song="Whoope! That's Great" file.flac

It generates the following output, which is to be piped to the shell (/bin/sh):

chmod +w file.flac

metaflac --remove-tag=TITLE file.flac

LANG=en_US.88591 metaflac --set-tag=TITLE='Whoope! That'"'"'s Great' file.flac

chmod -w file.flac

Note the need for shell quoting. The function shell_quote() does it. SINGLE is the one-character string "'" and QSINGLE is the three-character string "\"'\"":

# shell_quote --- quote an argument for passing to the shell

function shell_quote(s, # parameter

SINGLE, QSINGLE, i, X, n, ret) # locals

{

if (s == "")

return "\"\""

SINGLE = "\x27" # single quote

QSINGLE = "\"\x27\""

n = split(s, X, SINGLE)

ret = SINGLE X[1] SINGLE

for (i = 2; i <= n; i++)

ret = ret QSINGLE SINGLE X[i] SINGLE

return ret

}

Datafile Management

This section presents functions that are useful for managing command-line datafiles.

Noting Datafile Boundaries

The BEGIN and END rules are each executed exactly once, at the beginning and end of your awk program, respectively (see The BEGIN and END Special Patterns). We (the gawk authors) once had a user who mistakenly thought that the BEGIN rules were executed at the beginning of each datafile and the END rules were executed at the end of each datafile.

When informed that this was not the case, the user requested that we add new special patterns to gawk, named BEGIN_FILE and END_FILE, that would have the desired behavior. He even supplied us the code to do so.

Adding these special patterns to gawk wasn’t necessary; the job can be done cleanly in awk itself, as illustrated by the following library program. It arranges to call two user-supplied functions, beginfile() and endfile(), at the beginning and end of each datafile. Besides solving the problem in only nine(!) lines of code, it does so portably; this works with any implementation of awk:

# transfile.awk

#

# Give the user a hook for filename transitions

#

# The user must supply functions beginfile() and endfile()

# that each take the name of the file being started or

# finished, respectively.

FILENAME != _oldfilename {

if (_oldfilename != "")

endfile(_oldfilename)

_oldfilename = FILENAME

beginfile(FILENAME)

}

END { endfile(FILENAME) }

This file must be loaded before the user’s “main” program, so that the rule it supplies is executed first.

This rule relies on awk’s FILENAME variable, which automatically changes for each new datafile. The current filename is saved in a private variable, _oldfilename. If FILENAME does not equal _oldfilename, then a new datafile is being processed and it is necessary to call endfile()for the old file. Because endfile() should only be called if a file has been processed, the program first checks to make sure that _oldfilename is not the null string. The program then assigns the current filename to _oldfilename and calls beginfile() for the file. Because, like allawk variables, _oldfilename is initialized to the null string, this rule executes correctly even for the first datafile.

The program also supplies an END rule to do the final processing for the last file. Because this END rule comes before any END rules supplied in the “main” program, endfile() is called first. Once again, the value of multiple BEGIN and END rules should be clear.

If the same datafile occurs twice in a row on the command line, then endfile() and beginfile() are not executed at the end of the first pass and at the beginning of the second pass. The following version solves the problem:

# ftrans.awk --- handle datafile transitions

#

# user supplies beginfile() and endfile() functions

FNR == 1 {

if (_filename_ != "")

endfile(_filename_)

_filename_ = FILENAME

beginfile(FILENAME)

}

END { endfile(_filename_) }

Counting Things shows how this library function can be used and how it simplifies writing the main program.

SO WHY DOES GAWK HAVE BEGINFILE AND ENDFILE?

You are probably wondering, if beginfile() and endfile() functions can do the job, why does gawk have BEGINFILE and ENDFILE patterns?

Good question. Normally, if awk cannot open a file, this causes an immediate fatal error. In this case, there is no way for a user-defined function to deal with the problem, as the mechanism for calling it relies on the file being open and at the first record. Thus, the main reason for BEGINFILE is to give you a “hook” to catch files that cannot be processed. ENDFILE exists for symmetry, and because it provides an easy way to do per-file cleanup processing. For more information, refer to The BEGINFILE and ENDFILE Special Patterns.

Rereading the Current File

Another request for a new built-in function was for a function that would make it possible to reread the current file. The requesting user didn’t want to have to use getline (see Explicit Input with getline) inside a loop.

However, as long as you are not in the END rule, it is quite easy to arrange to immediately close the current input file and then start over with it from the top. For lack of a better name, we’ll call the function rewind():

# rewind.awk --- rewind the current file and start over

function rewind( i)

{

# shift remaining arguments up

for (i = ARGC; i > ARGIND; i--)

ARGV[i] = ARGV[i-1]

# make sure gawk knows to keep going

ARGC++

# make current file next to get done

ARGV[ARGIND+1] = FILENAME

# do it

nextfile

}

The rewind() function relies on the ARGIND variable (see Built-in Variables That Convey Information), which is specific to gawk. It also relies on the nextfile keyword (see The nextfile Statement). Because of this, you should not call it from an ENDFILE rule. (This isn’t necessary anyway, because gawk goes to the next file as soon as an ENDFILE rule finishes!)

Checking for Readable Datafiles

Normally, if you give awk a datafile that isn’t readable, it stops with a fatal error. There are times when you might want to just ignore such files and keep going.[65] You can do this by prepending the following program to your awk program:

# readable.awk --- library file to skip over unreadable files

BEGIN {

for (i = 1; i < ARGC; i++) {

if (ARGV[i] ~ /^[a-zA-Z_][a-zA-Z0-9_]*=.*/ \

|| ARGV[i] == "-" || ARGV[i] == "/dev/stdin")

continue # assignment or standard input

else if ((getline junk < ARGV[i]) < 0) # unreadable

delete ARGV[i]

else

close(ARGV[i])

}

}

This works, because the getline won’t be fatal. Removing the element from ARGV with delete skips the file (because it’s no longer in the list). See also Using ARGC and ARGV.

Because awk variable names only allow the English letters, the regular expression check purposely does not use character classes such as ‘[:alpha:]’ and ‘[:alnum:]’ (see Using Bracket Expressions).

Checking for Zero-Length Files

All known awk implementations silently skip over zero-length files. This is a by-product of awk’s implicit read-a-record-and-match-against-the-rules loop: when awk tries to read a record from an empty file, it immediately receives an end-of-file indication, closes the file, and proceeds on to the next command-line datafile, without executing any user-level awk program code.

Using gawk’s ARGIND variable (see Predefined Variables), it is possible to detect when an empty datafile has been skipped. Similar to the library file presented in Noting Datafile Boundaries, the following library file calls a function named zerofile() that the user must provide. The arguments passed are the filename and the position in ARGV where it was found:

# zerofile.awk --- library file to process empty input files

BEGIN { Argind = 0 }

ARGIND > Argind + 1 {

for (Argind++; Argind < ARGIND; Argind++)

zerofile(ARGV[Argind], Argind)

}

ARGIND != Argind { Argind = ARGIND }

END {

if (ARGIND > Argind)

for (Argind++; Argind <= ARGIND; Argind++)

zerofile(ARGV[Argind], Argind)

}

The user-level variable Argind allows the awk program to track its progress through ARGV. Whenever the program detects that ARGIND is greater than ‘Argind + 1’, it means that one or more empty files were skipped. The action then calls zerofile() for each such file, incrementingArgind along the way.

The ‘Argind != ARGIND’ rule simply keeps Argind up to date in the normal case.

Finally, the END rule catches the case of any empty files at the end of the command-line arguments. Note that the test in the condition of the for loop uses the ‘<=’ operator, not ‘<’.

Treating Assignments as Filenames

Occasionally, you might not want awk to process command-line variable assignments (see Assigning variables on the command line). In particular, if you have a filename that contains an ‘=’ character, awk treats the filename as an assignment and does not process it.

Some users have suggested an additional command-line option for gawk to disable command-line assignments. However, some simple programming with a library file does the trick:

# noassign.awk --- library file to avoid the need for a

# special option that disables command-line assignments

function disable_assigns(argc, argv, i)

{

for (i = 1; i < argc; i++)

if (argv[i] ~ /^[a-zA-Z_][a-zA-Z0-9_]*=.*/)

argv[i] = ("./" argv[i])

}

BEGIN {

if (No_command_assign)

disable_assigns(ARGC, ARGV)

}

You then run your program this way:

awk -v No_command_assign=1 -f noassign.awk -f yourprog.awk *

The function works by looping through the arguments. It prepends ‘./’ to any argument that matches the form of a variable assignment, turning that argument into a filename.

The use of No_command_assign allows you to disable command-line assignments at invocation time, by giving the variable a true value. When not set, it is initially zero (i.e., false), so the command-line arguments are left alone.

Processing Command-Line Options

Most utilities on POSIX-compatible systems take options on the command line that can be used to change the way a program behaves. awk is an example of such a program (see Command-Line Options). Often, options take arguments (i.e., data that the program needs to correctly obey the command-line option). For example, ktawk’s -F option requires a string to use as the field separator. The first occurrence on the command line of either -- or a string that does not begin with ‘-’ ends the options.

Modern Unix systems provide a C function named getopt() for processing command-line arguments. The programmer provides a string describing the one-letter options. If an option requires an argument, it is followed in the string with a colon. getopt() is also passed the count and values of the command-line arguments and is called in a loop. getopt() processes the command-line arguments for option letters. Each time around the loop, it returns a single character representing the next option letter that it finds, or ‘?’ if it finds an invalid option. When it returns −1, there are no options left on the command line.

When using getopt(), options that do not take arguments can be grouped together. Furthermore, options that take arguments require that the argument be present. The argument can immediately follow the option letter, or it can be a separate command-line argument.

Given a hypothetical program that takes three command-line options, -a, -b, and -c, where -b requires an argument, all of the following are valid ways of invoking the program:

prog -a -b foo -c data1 data2 data3

prog -ac -bfoo -- data1 data2 data3

prog -acbfoo data1 data2 data3

Notice that when the argument is grouped with its option, the rest of the argument is considered to be the option’s argument. In this example, -acbfoo indicates that all of the -a, -b, and -c options were supplied, and that ‘foo’ is the argument to the -b option.

getopt() provides four external variables that the programmer can use:

optind

The index in the argument value array (argv) where the first nonoption command-line argument can be found.

optarg

The string value of the argument to an option.

opterr

Usually getopt() prints an error message when it finds an invalid option. Setting opterr to zero disables this feature. (An application might want to print its own error message.)

optopt

The letter representing the command-line option.

The following C fragment shows how getopt() might process command-line arguments for awk:

int

main(int argc, char *argv[])

{

...

/* print our own message */

opterr = 0;

while ((c = getopt(argc, argv, "v:f:F:W:")) != -1) {

switch (c) {

case 'f': /* file */

...

break;

case 'F': /* field separator */

...

break;

case 'v': /* variable assignment */

...

break;

case 'W': /* extension */

...

break;

case '?':

default:

usage();

break;

}

}

...

}

As a side point, gawk actually uses the GNU getopt_long() function to process both normal and GNU-style long options (see Command-Line Options).

The abstraction provided by getopt() is very useful and is quite handy in awk programs as well. Following is an awk version of getopt(). This function highlights one of the greatest weaknesses in awk, which is that it is very poor at manipulating single characters. Repeated calls tosubstr() are necessary for accessing individual characters (see String-Manipulation Functions).[66]

The discussion that follows walks through the code a bit at a time:

# getopt.awk --- Do C library getopt(3) function in awk

# External variables:

# Optind -- index in ARGV of first nonoption argument

# Optarg -- string value of argument to current option

# Opterr -- if nonzero, print our own diagnostic

# Optopt -- current option letter

# Returns:

# -1 at end of options

# "?" for unrecognized option

# <c> a character representing the current option

# Private Data:

# _opti -- index in multiflag option, e.g., -abc

The function starts out with comments presenting a list of the global variables it uses, what the return values are, what they mean, and any global variables that are “private” to this library function. Such documentation is essential for any program, and particularly for library functions.

The getopt() function first checks that it was indeed called with a string of options (the options parameter). If options has a zero length, getopt() immediately returns −1:

function getopt(argc, argv, options, thisopt, i)

{

if (length(options) == 0) # no options given

return -1

if (argv[Optind] == "--") { # all done

Optind++

_opti = 0

return -1

} else if (argv[Optind] !~ /^-[^:[:space:]]/) {

_opti = 0

return -1

}

The next thing to check for is the end of the options. A -- ends the command-line options, as does any command-line argument that does not begin with a ‘-’. Optind is used to step through the array of command-line arguments; it retains its value across calls to getopt(), because it is a global variable.

The regular expression that is used, /^-[^:[:space:]/, checks for a ‘-’ followed by anything that is not whitespace and not a colon. If the current command-line argument does not match this pattern, it is not an option, and it ends option processing. Continuing on:

if (_opti == 0)

_opti = 2

thisopt = substr(argv[Optind], _opti, 1)

Optopt = thisopt

i = index(options, thisopt)

if (i == 0) {

if (Opterr)

printf("%c -- invalid option\n", thisopt) > "/dev/stderr"

if (_opti >= length(argv[Optind])) {

Optind++

_opti = 0

} else

_opti++

return "?"

}

The _opti variable tracks the position in the current command-line argument (argv[Optind]). If multiple options are grouped together with one ‘-’ (e.g., -abx), it is necessary to return them to the user one at a time.

If _opti is equal to zero, it is set to two, which is the index in the string of the next character to look at (we skip the ‘-’, which is at position one). The variable thisopt holds the character, obtained with substr(). It is saved in Optopt for the main program to use.

If thisopt is not in the options string, then it is an invalid option. If Opterr is nonzero, getopt() prints an error message on the standard error that is similar to the message from the C version of getopt().

Because the option is invalid, it is necessary to skip it and move on to the next option character. If _opti is greater than or equal to the length of the current command-line argument, it is necessary to move on to the next argument, so Optind is incremented and _opti is reset to zero. Otherwise, Optind is left alone and _opti is merely incremented.

In any case, because the option is invalid, getopt() returns "?". The main program can examine Optopt if it needs to know what the invalid option letter actually is. Continuing on:

if (substr(options, i + 1, 1) == ":") {

# get option argument

if (length(substr(argv[Optind], _opti + 1)) > 0)

Optarg = substr(argv[Optind], _opti + 1)

else

Optarg = argv[++Optind]

_opti = 0

} else

Optarg = ""

If the option requires an argument, the option letter is followed by a colon in the options string. If there are remaining characters in the current command-line argument (argv[Optind]), then the rest of that string is assigned to Optarg. Otherwise, the next command-line argument is used (‘-xFOO’ versus ‘-x FOO’). In either case, _opti is reset to zero, because there are no more characters left to examine in the current command-line argument. Continuing:

if (_opti == 0 || _opti >= length(argv[Optind])) {

Optind++

_opti = 0

} else

_opti++

return thisopt

}

Finally, if _opti is either zero or greater than the length of the current command-line argument, it means this element in argv is through being processed, so Optind is incremented to point to the next element in argv. If neither condition is true, then only _opti is incremented, so that the next option letter can be processed on the next call to getopt().

The BEGIN rule initializes both Opterr and Optind to one. Opterr is set to one, because the default behavior is for getopt() to print a diagnostic message upon seeing an invalid option. Optind is set to one, because there’s no reason to look at the program name, which is in ARGV[0]:

BEGIN {

Opterr = 1 # default is to diagnose

Optind = 1 # skip ARGV[0]

# test program

if (_getopt_test) {

while ((_go_c = getopt(ARGC, ARGV, "ab:cd")) != -1)

printf("c = <%c>, Optarg = <%s>\n",

_go_c, Optarg)

printf("non-option arguments:\n")

for (; Optind < ARGC; Optind++)

printf("\tARGV[%d] = <%s>\n",

Optind, ARGV[Optind])

}

}

The rest of the BEGIN rule is a simple test program. Here are the results of two sample runs of the test program:

$ awk -f getopt.awk -v _getopt_test=1 -- -a -cbARG bax -x

c = <a>, Optarg = <>

c = <c>, Optarg = <>

c = <b>, Optarg = <ARG>

non-option arguments:

ARGV[3] = <bax>

ARGV[4] = <-x>

$ awk -f getopt.awk -v _getopt_test=1 -- -a -x -- xyz abc

c = <a>, Optarg = <>

error→ x -- invalid option

c = <?>, Optarg = <>

non-option arguments:

ARGV[4] = <xyz>

ARGV[5] = <abc>

In both runs, the first -- terminates the arguments to awk, so that it does not try to interpret the -a, etc., as its own options.

NOTE

After getopt() is through, user-level code must clear out all the elements of ARGV from 1 to Optind, so that awk does not try to process the command-line options as filenames.

Using ‘#!’ with the -E option may help avoid conflicts between your program’s options and gawk’s options, as -E causes gawk to abandon processing of further options (see Executable awk Programs and Command-Line Options).

Several of the sample programs presented in Chapter 11 use getopt() to process their arguments.

Reading the User Database

The PROCINFO array (see Predefined Variables) provides access to the current user’s real and effective user and group ID numbers, and, if available, the user’s supplementary group set. However, because these are numbers, they do not provide very useful information to the average user. There needs to be some way to find the user information associated with the user and group ID numbers. This section presents a suite of functions for retrieving information from the user database. See Reading the Group Database for a similar suite that retrieves information from the group database.

The POSIX standard does not define the file where user information is kept. Instead, it provides the <pwd.h> header file and several C language subroutines for obtaining user information. The primary function is getpwent(), for “get password entry.” The “password” comes from the original user database file, /etc/passwd, which stores user information along with the encrypted passwords (hence the name).

Although an awk program could simply read /etc/passwd directly, this file may not contain complete information about the system’s set of users.[67] To be sure you are able to produce a readable and complete version of the user database, it is necessary to write a small C program that callsgetpwent(). getpwent() is defined as returning a pointer to a struct passwd. Each time it is called, it returns the next entry in the database. When there are no more entries, it returns NULL, the null pointer. When this happens, the C program should call endpwent() to close the database. Following is pwcat, a C program that “cats” the password database:

/*

* pwcat.c

*

* Generate a printable version of the password database.

*/

#include <stdio.h>

#include <pwd.h>

int

main(int argc, char **argv)

{

struct passwd *p;

while ((p = getpwent()) != NULL)

printf("%s:%s:%ld:%ld:%s:%s:%s\n",

p->pw_name, p->pw_passwd, (long) p->pw_uid,

(long) p->pw_gid, p->pw_gecos, p->pw_dir, p->pw_shell);

endpwent();

return 0;

}

If you don’t understand C, don’t worry about it. The output from pwcat is the user database, in the traditional /etc/passwd format of colon-separated fields. The fields are:

Login name

The user’s login name.

Encrypted password

The user’s encrypted password. This may not be available on some systems.

User-ID

The user’s numeric user ID number. (On some systems, it’s a C long, and not an int. Thus, we cast it to long for all cases.)

Group-ID

The user’s numeric group ID number. (Similar comments about long versus int apply here.)

Full name

The user’s full name, and perhaps other information associated with the user.

Home directory

The user’s login (or “home”) directory (familiar to shell programmers as $HOME).

Login shell

The program that is run when the user logs in. This is usually a shell, such as Bash.

A few lines representative of pwcat’s output are as follows:

$ pwcat

root:x:0:1:Operator:/:/bin/sh

nobody:*:65534:65534::/:

daemon:*:1:1::/:

sys:*:2:2::/:/bin/csh

bin:*:3:3::/bin:

arnold:xyzzy:2076:10:Arnold Robbins:/home/arnold:/bin/sh

miriam:yxaay:112:10:Miriam Robbins:/home/miriam:/bin/sh

andy:abcca2:113:10:Andy Jacobs:/home/andy:/bin/sh

...

With that introduction, following is a group of functions for getting user information. There are several functions here, corresponding to the C functions of the same names:

# passwd.awk --- access password file information

BEGIN {

# tailor this to suit your system

_pw_awklib = "/usr/local/libexec/awk/"

}

function _pw_init( oldfs, oldrs, olddol0, pwcat, using_fw, using_fpat)

{

if (_pw_inited)

return

oldfs = FS

oldrs = RS

olddol0 = $0

using_fw = (PROCINFO["FS"] == "FIELDWIDTHS")

using_fpat = (PROCINFO["FS"] == "FPAT")

FS = ":"

RS = "\n"

pwcat = _pw_awklib "pwcat"

while ((pwcat | getline) > 0) {

_pw_byname[$1] = $0

_pw_byuid[$3] = $0

_pw_bycount[++_pw_total] = $0

}

close(pwcat)

_pw_count = 0

_pw_inited = 1

FS = oldfs

if (using_fw)

FIELDWIDTHS = FIELDWIDTHS

else if (using_fpat)

FPAT = FPAT

RS = oldrs

$0 = olddol0

}

The BEGIN rule sets a private variable to the directory where pwcat is stored. Because it is used to help out an awk library routine, we have chosen to put it in /usr/local/libexec/awk; however, you might want it to be in a different directory on your system.

The function _pw_init() fills three copies of the user information into three associative arrays. The arrays are indexed by username (_pw_byname), by user ID number (_pw_byuid), and by order of occurrence (_pw_bycount). The variable _pw_inited is used for efficiency, as_pw_init() needs to be called only once.

Because this function uses getline to read information from pwcat, it first saves the values of FS, RS, and $0. It notes in the variable using_fw whether field splitting with FIELDWIDTHS is in effect or not. Doing so is necessary, as these functions could be called from anywhere within a user’s program, and the user may have his or her own way of splitting records and fields. This makes it possible to restore the correct field-splitting mechanism later. The test can only be true for gawk. It is false if using FS or FPAT, or on some other awk implementation.

The code that checks for using FPAT, using using_fpat and PROCINFO["FS"], is similar.

The main part of the function uses a loop to read database lines, split the lines into fields, and then store the lines into each array as necessary. When the loop is done, _pw_init() cleans up by closing the pipeline, setting _pw_inited to one, and restoring FS (and FIELDWIDTHS or FPAT if necessary), RS, and $0. The use of _pw_count is explained shortly.

The getpwnam() function takes a username as a string argument. If that user is in the database, it returns the appropriate line. Otherwise, it relies on the array reference to a nonexistent element to create the element with the null string as its value:

function getpwnam(name)

{

_pw_init()

return _pw_byname[name]

}

Similarly, the getpwuid() function takes a user ID number argument. If that user number is in the database, it returns the appropriate line. Otherwise, it returns the null string:

function getpwuid(uid)

{

_pw_init()

return _pw_byuid[uid]

}

The getpwent() function simply steps through the database, one entry at a time. It uses _pw_count to track its current position in the _pw_bycount array:

function getpwent()

{

_pw_init()

if (_pw_count < _pw_total)

return _pw_bycount[++_pw_count]

return ""

}

The endpwent() function resets _pw_count to zero, so that subsequent calls to getpwent() start over again:

function endpwent()

{

_pw_count = 0

}

A conscious design decision in this suite is that each subroutine calls _pw_init() to initialize the database arrays. The overhead of running a separate process to generate the user database, and the I/O to scan it, are only incurred if the user’s main program actually calls one of these functions. If this library file is loaded along with a user’s program, but none of the routines are ever called, then there is no extra runtime overhead. (The alternative is move the body of _pw_init() into a BEGIN rule, which always runs pwcat. This simplifies the code but runs an extra process that may never be needed.)

In turn, calling _pw_init() is not too expensive, because the _pw_inited variable keeps the program from reading the data more than once. If you are worried about squeezing every last cycle out of your awk program, the check of _pw_inited could be moved out of _pw_init() and duplicated in all the other functions. In practice, this is not necessary, as most awk programs are I/O-bound, and such a change would clutter up the code.

The id program in Printing Out User Information uses these functions.

Reading the Group Database

Much of the discussion presented in Reading the User Database applies to the group database as well. Although there has traditionally been a well-known file (/etc/group) in a well-known format, the POSIX standard only provides a set of C library routines (<grp.h> and getgrent()) for accessing the information. Even though this file may exist, it may not have complete information. Therefore, as with the user database, it is necessary to have a small C program that generates the group database as its output. grcat, a C program that “cats” the group database, is as follows:

/*

* grcat.c

*

* Generate a printable version of the group database.

*/

#include <stdio.h>

#include <grp.h>

int

main(int argc, char **argv)

{

struct group *g;

int i;

while ((g = getgrent()) != NULL) {

printf("%s:%s:%ld:", g->gr_name, g->gr_passwd,

(long) g->gr_gid);

for (i = 0; g->gr_mem[i] != NULL; i++) {

printf("%s", g->gr_mem[i]);

if (g->gr_mem[i+1] != NULL)

putchar(',');

}

putchar('\n');

}

endgrent();

return 0;

}

Each line in the group database represents one group. The fields are separated with colons and represent the following information:

Group Name

The group’s name.

Group Password

The group’s encrypted password. In practice, this field is never used; it is usually empty or set to ‘*’.

Group ID Number

The group’s numeric group ID number; the association of name to number must be unique within the file. (On some systems it’s a C long, and not an int. Thus, we cast it to long for all cases.)

Group Member List

A comma-separated list of usernames. These users are members of the group. Modern Unix systems allow users to be members of several groups simultaneously. If your system does, then there are elements "group1" through "groupN" in PROCINFO for those group ID numbers. (Note that PROCINFO is a gawk extension; see Predefined Variables.)

Here is what running grcat might produce:

$ grcat

wheel:*:0:arnold

nogroup:*:65534:

daemon:*:1:

kmem:*:2:

staff:*:10:arnold,miriam,andy

other:*:20:

Here are the functions for obtaining information from the group database. There are several, modeled after the C library functions of the same names:

# group.awk --- functions for dealing with the group file

BEGIN {

# Change to suit your system

_gr_awklib = "/usr/local/libexec/awk/"

}

function _gr_init( oldfs, oldrs, olddol0, grcat,

using_fw, using_fpat, n, a, i)

{

if (_gr_inited)

return

oldfs = FS

oldrs = RS

olddol0 = $0

using_fw = (PROCINFO["FS"] == "FIELDWIDTHS")

using_fpat = (PROCINFO["FS"] == "FPAT")

FS = ":"

RS = "\n"

grcat = _gr_awklib "grcat"

while ((grcat | getline) > 0) {

if ($1 in _gr_byname)

_gr_byname[$1] = _gr_byname[$1] "," $4

else

_gr_byname[$1] = $0

if ($3 in _gr_bygid)

_gr_bygid[$3] = _gr_bygid[$3] "," $4

else

_gr_bygid[$3] = $0

n = split($4, a, "[ \t]*,[ \t]*")

for (i = 1; i <= n; i++)

if (a[i] in _gr_groupsbyuser)

_gr_groupsbyuser[a[i]] = gr_groupsbyuser[a[i]] " " $1

else

_gr_groupsbyuser[a[i]] = $1

_gr_bycount[++_gr_count] = $0

}

close(grcat)

_gr_count = 0

_gr_inited++

FS = oldfs

if (using_fw)

FIELDWIDTHS = FIELDWIDTHS

else if (using_fpat)

FPAT = FPAT

RS = oldrs

$0 = olddol0

}

The BEGIN rule sets a private variable to the directory where grcat is stored. Because it is used to help out an awk library routine, we have chosen to put it in /usr/local/libexec/awk. You might want it to be in a different directory on your system.

These routines follow the same general outline as the user database routines (see Reading the User Database). The _gr_inited variable is used to ensure that the database is scanned no more than once. The _gr_init() function first saves FS, RS, and $0, and then sets FS and RS to the correct values for scanning the group information. It also takes care to note whether FIELDWIDTHS or FPAT is being used, and to restore the appropriate field-splitting mechanism.

The group information is stored in several associative arrays. The arrays are indexed by group name (_gr_byname), by group ID number (_gr_bygid), and by position in the database (_gr_bycount). There is an additional array indexed by username (_gr_groupsbyuser), which is a space-separated list of groups to which each user belongs.

Unlike in the user database, it is possible to have multiple records in the database for the same group. This is common when a group has a large number of members. A pair of such entries might look like the following:

tvpeople:*:101:johnny,jay,arsenio

tvpeople:*:101:david,conan,tom,joan

For this reason, _gr_init() looks to see if a group name or group ID number is already seen. If so, the usernames are simply concatenated onto the previous list of users.[68]

Finally, _gr_init() closes the pipeline to grcat, restores FS (and FIELDWIDTHS or FPAT, if necessary), RS, and $0, initializes _gr_count to zero (it is used later), and makes _gr_inited nonzero.

The getgrnam() function takes a group name as its argument, and if that group exists, it is returned. Otherwise, it relies on the array reference to a nonexistent element to create the element with the null string as its value:

function getgrnam(group)

{

_gr_init()

return _gr_byname[group]

}

The getgrgid() function is similar; it takes a numeric group ID and looks up the information associated with that group ID:

function getgrgid(gid)

{

_gr_init()

return _gr_bygid[gid]

}

The getgruser() function does not have a C counterpart. It takes a username and returns the list of groups that have the user as a member:

function getgruser(user)

{

_gr_init()

return _gr_groupsbyuser[user]

}

The getgrent() function steps through the database one entry at a time. It uses _gr_count to track its position in the list:

function getgrent()

{

_gr_init()

if (++_gr_count in _gr_bycount)

return _gr_bycount[_gr_count]

return ""

}

The endgrent() function resets _gr_count to zero so that getgrent() can start over again:

function endgrent()

{

_gr_count = 0

}

As with the user database routines, each function calls _gr_init() to initialize the arrays. Doing so only incurs the extra overhead of running grcat if these functions are used (as opposed to moving the body of _gr_init() into a BEGIN rule).

Most of the work is in scanning the database and building the various associative arrays. The functions that the user calls are themselves very simple, relying on awk’s associative arrays to do work.

The id program in Printing Out User Information uses these functions.

Traversing Arrays of Arrays

Arrays of Arrays described how gawk provides arrays of arrays. In particular, any element of an array may be either a scalar or another array. The isarray() function (see Getting Type Information) lets you distinguish an array from a scalar. The following function, walk_array(), recursively traverses an array, printing the element indices and values. You call it with the array and a string representing the name of the array:

function walk_array(arr, name, i)

{

for (i in arr) {

if (isarray(arr[i]))

walk_array(arr[i], (name "[" i "]"))

else

printf("%s[%s] = %s\n", name, i, arr[i])

}

}

It works by looping over each element of the array. If any given element is itself an array, the function calls itself recursively, passing the subarray and a new string representing the current index. Otherwise, the function simply prints the element’s name, index, and value. Here is a main program to demonstrate:

BEGIN {

a[1] = 1

a[2][1] = 21

a[2][2] = 22

a[3] = 3

a[4][1][1] = 411

a[4][2] = 42

walk_array(a, "a")

}

When run, the program produces the following output:

$ gawk -f walk_array.awk

a[1] = 1

a[2][1] = 21

a[2][2] = 22

a[3] = 3

a[4][1][1] = 411

a[4][2] = 42

The function just presented simply prints the name and value of each scalar array element. However, it is easy to generalize it, by passing in the name of a function to call when walking an array. The modified function looks like this:

function process_array(arr, name, process, do_arrays, i, new_name)

{

for (i in arr) {

new_name = (name "[" i "]")

if (isarray(arr[i])) {

if (do_arrays)

@process(new_name, arr[i])

process_array(arr[i], new_name, process, do_arrays)

} else

@process(new_name, arr[i])

}

}

The arguments are as follows:

arr

The array.

name

The name of the array (a string).

process

The name of the function to call.

do_arrays

If this is true, the function can handle elements that are subarrays.

If subarrays are to be processed, that is done before walking them further.

When run with the following scaffolding, the function produces the same results as does the earlier version of walk_array():

BEGIN {

a[1] = 1

a[2][1] = 21

a[2][2] = 22

a[3] = 3

a[4][1][1] = 411

a[4][2] = 42

process_array(a, "a", "do_print", 0)

}

function do_print(name, element)

{

printf "%s = %s\n", name, element

}

Summary

§ Reading programs is an excellent way to learn Good Programming. The functions and programs provided in this chapter and the next are intended to serve that purpose.

§ When writing general-purpose library functions, put some thought into how to name any global variables so that they won’t conflict with variables from a user’s program.

§ The functions presented here fit into the following categories:

General problems

Number-to-string conversion, testing assertions, rounding, random number generation, converting characters to numbers, joining strings, getting easily usable time-of-day information, and reading a whole file in one shot

Managing datafiles

Noting datafile boundaries, rereading the current file, checking for readable files, checking for zero-length files, and treating assignments as filenames

Processing command-line options

An awk version of the standard C getopt() function

Reading the user and group databases

Two sets of routines that parallel the C library versions

Traversing arrays of arrays

Tow function that traverse an array of arrays to any depth


[58] Sadly, over 35 years later, many of the lessons taught by this book have yet to be learned by a vast number of practicing programmers.

[59] The effects are not identical. Output of the transformed record will be in all lowercase, while IGNORECASE preserves the original contents of the input record.

[60] Although all the library routines could have been rewritten to use this convention, this was not done, in order to show how our own awk programming style has evolved and to provide some basis for this discussion.

[61] gawk’s --dump-variables command-line option is useful for verifying this.

[62] This is changing; many systems use Unicode, a very large character set that includes ASCII as a subset. On systems with full Unicode support, a character can occupy up to 32 bits, making simple tests such as used here prohibitively expensive.

[63] ASCII has been extended in many countries to use the values from 128 to 255 for country-specific characters. If your system uses these extensions, you can simplify _ord_init() to loop from 0 to 255.

[64] It would be nice if awk had an assignment operator for concatenation. The lack of an explicit operator for concatenation makes string operations more difficult than they really need to be.

[65] The BEGINFILE special pattern (see The BEGINFILE and ENDFILE Special Patterns) provides an alternative mechanism for dealing with files that can’t be opened. However, the code here provides a portable solution.

[66] This function was written before gawk acquired the ability to split strings into single characters using "" as the separator. We have left it alone, as using substr() is more portable.

[67] It is often the case that password information is stored in a network database.

[68] There is a subtle problem with the code just presented. Suppose that the first time there were no names. This code adds the names with a leading comma. It also doesn’t check that there is a $4.