The Evolution of the awk Language - Appendices - Effective awk Programming (2015)

Effective awk Programming (2015)

Part IV. Appendices

Part IV contains three appendices, the last of which is the license that covers the gawk source code:

§ Appendix A

§ Appendix B

§ Appendix C

Appendix A. The Evolution of the awk Language

This book describes the GNU implementation of awk, which follows the POSIX specification. Many longtime awk users learned awk programming with the original awk implementation in Version 7 Unix. (This implementation was the basis for awk in Berkeley Unix, through 4.3-Reno. Subsequent versions of Berkeley Unix, and, for a while, some systems derived from 4.4BSD-Lite, used various versions of gawk for their awk.) This chapter briefly describes the evolution of the awk language, with cross-references to other parts of the book where you can find more information.

To save space, we have omitted information on the history of features in gawk from this edition. You can find it in the online documentation.

Major Changes Between V7 and SVR3.1

The awk language evolved considerably between the release of Version 7 Unix (1978) and the new version that was first made generally available in System V Release 3.1 (1987). This section summarizes the changes, with cross-references to further details:

§ The requirement for ‘;’ to separate rules on a line (see awk Statements Versus Lines)

§ User-defined functions and the return statement (see User-Defined Functions)

§ The delete statement (see The delete Statement)

§ The do-while statement (see The do-while Statement)

§ The built-in functions atan2(), cos(), sin(), rand(), and srand() (see Numeric Functions)

§ The built-in functions gsub(), sub(), and match() (see String-Manipulation Functions)

§ The built-in functions close() and system() (see Input/Output Functions)

§ The ARGC, ARGV, FNR, RLENGTH, RSTART, and SUBSEP predefined variables (see Predefined Variables)

§ Assignable $0 (see Changing the Contents of a Field)

§ The conditional expression using the ternary operator ‘?:’ (see Conditional Expressions)

§ The expression ‘indx in array’ outside of for statements (see Referring to an Array Element)

§ The exponentiation operator ‘^’ (see Arithmetic Operators) and its assignment operator form ‘^=’ (see Assignment Expressions)

§ C-compatible operator precedence, which breaks some old awk programs (see Operator Precedence (How Operators Nest))

§ Regexps as the value of FS (see Specifying How Fields Are Separated) and as the third argument to the split() function (see String-Manipulation Functions), rather than using only the first character of FS

§ Dynamic regexps as operands of the ‘~’ and ‘!~’ operators (see Using Dynamic Regexps)

§ The escape sequences ‘\b’, ‘\f’, and ‘\r’ (see Escape Sequences)

§ Redirection of input for the getline function (see Explicit Input with getline)

§ Multiple BEGIN and END rules (see The BEGIN and END Special Patterns)

§ Multidimensional arrays (see Multidimensional Arrays)

Changes Between SVR3.1 and SVR4

The System V Release 4 (1989) version of Unix awk added these features (some of which originated in gawk):

§ The ENVIRON array (see Predefined Variables)

§ Multiple -f options on the command line (see Command-Line Options)

§ The -v option for assigning variables before program execution begins (see Command-Line Options)

§ The -- signal for terminating command-line options

§ The ‘\a’, ‘\v’, and ‘\x’ escape sequences (see Escape Sequences)

§ A defined return value for the srand() built-in function (see Numeric Functions)

§ The toupper() and tolower() built-in string functions for case translation (see String-Manipulation Functions)

§ A cleaner specification for the ‘%c’ format-control letter in the printf function (see Format-Control Letters)

§ The ability to dynamically pass the field width and precision ("%*.*d") in the argument list of printf and sprintf() (see Format-Control Letters)

§ The use of regexp constants, such as /foo/, as expressions, where they are equivalent to using the matching operator, as in ‘$0 ~ /foo/’ (see Using Regular Expression Constants)

§ Processing of escape sequences inside command-line variable assignments (see Assigning variables on the command line)

Changes Between SVR4 and POSIX awk

The POSIX Command Language and Utilities standard for awk (1992) introduced the following changes into the language:

§ The use of -W for implementation-specific options (see Command-Line Options)

§ The use of CONVFMT for controlling the conversion of numbers to strings (see Conversion of Strings and Numbers)

§ The concept of a numeric string and tighter comparison rules to go with it (see Variable Typing and Comparison Expressions)

§ The use of predefined variables as function parameter names is forbidden (see Function Definition Syntax)

§ More complete documentation of many of the previously undocumented features of the language

In 2012, a number of extensions that had been commonly available for many years were finally added to POSIX. They are:

§ The fflush() built-in function for flushing buffered output (see Input/Output Functions)

§ The nextfile statement (see The nextfile Statement)

§ The ability to delete all of an array at once with ‘delete array’ (see The delete Statement)

See Common Extensions Summary for a list of common extensions not permitted by the POSIX standard.

The 2008 POSIX standard can be found online at http://www.opengroup.org/onlinepubs/9699919799/.

Extensions in Brian Kernighan’s awk

Brian Kernighan has made his version available via his home page (see Other Freely Available awk Implementations).

This section describes common extensions that originally appeared in his version of awk:

§ The ‘**’ and ‘**=’ operators (see Arithmetic Operators and Assignment Expressions)

§ The use of func as an abbreviation for function (see Function Definition Syntax)

§ The fflush() built-in function for flushing buffered output (see Input/Output Functions)

See Common Extensions Summary for a full list of the extensions available in his awk.

Extensions in gawk Not in POSIX awk

The GNU implementation, gawk, adds a large number of features. They can all be disabled with either the --traditional or --posix options (see Command-Line Options).

A number of features have come and gone over the years. This section summarizes the additional features over POSIX awk that are in the current version of gawk.

§ Additional predefined variables:

§ The ARGIND, BINMODE, ERRNO, FIELDWIDTHS, FPAT, IGNORECASE, LINT, PROCINFO, RT, and TEXTDOMAIN variables (see Predefined Variables)

§ Special files in I/O redirections:

§ The /dev/stdin, /dev/stdout, /dev/stderr, and /dev/fd/N special filenames (see Special Filenames in gawk)

§ The /inet, /inet4, and ‘/inet6’ special files for TCP/IP networking using ‘|&’ to specify which version of the IP protocol to use (see Using gawk for Network Programming)

§ Changes and/or additions to the language:

§ The ‘\x’ escape sequence (see Escape Sequences)

§ Full support for both POSIX and GNU regexps (see Chapter 3)

§ The ability for FS and for the third argument to split() to be null strings (see Making Each Character a Separate Field)

§ The ability for RS to be a regexp (see How Input Is Split into Records)

§ The ability to use octal and hexadecimal constants in awk program source code (see Octal and hexadecimal numbers)

§ The ‘|&’ operator for two-way I/O to a coprocess (see Two-Way Communications with Another Process)

§ Indirect function calls (see Indirect Function Calls)

§ Directories on the command line produce a warning and are skipped (see Directories on the Command Line)

§ New keywords:

§ The BEGINFILE and ENDFILE special patterns (see The BEGINFILE and ENDFILE Special Patterns)

§ The switch statement (see The switch Statement)

§ Changes to standard awk functions:

§ The optional second argument to close() that allows closing one end of a two-way pipe to a coprocess (see Two-Way Communications with Another Process)

§ POSIX compliance for gsub() and sub() with --posix

§ The length() function accepts an array argument and returns the number of elements in the array (see String-Manipulation Functions)

§ The optional third argument to the match() function for capturing text-matching subexpressions within a regexp (see String-Manipulation Functions)

§ Positional specifiers in printf formats for making translations easier (see Rearranging printf Arguments)

§ The split() function’s additional optional fourth argument, which is an array to hold the text of the field separators (see String-Manipulation Functions)

§ Additional functions only in gawk:

§ The gensub(), patsplit(), and strtonum() functions for more powerful text manipulation (see String-Manipulation Functions)

§ The asort() and asorti() functions for sorting arrays (see Controlling Array Traversal and Array Sorting)

§ The mktime(), systime(), and strftime() functions for working with timestamps (see Time Functions)

§ The and(), compl(), lshift(), or(), rshift(), and xor() functions for bit manipulation (see Bit-Manipulation Functions)

§ The isarray() function to check if a variable is an array or not (see Getting Type Information)

§ The bindtextdomain(), dcgettext(), and dcngettext() functions for internationalization (see Internationalizing awk Programs)

§ Changes and/or additions in the command-line options:

§ The AWKPATH environment variable for specifying a path search for the -f command-line option (see Command-Line Options)

§ The AWKLIBPATH environment variable for specifying a path search for the -l command-line option (see Command-Line Options)

§ The -b, -c, -C, -d, -D, -e, -E, -g, -h, -i, -l, -L, -M, -n, -N, -o, -O, -p, -P, -r, -S, -t, and -V short options. Also, the ability to use GNU-style long-named options that start with --; and the --assign, --bignum, --characters-as-bytes, --copyright, --debug, --dump-variables, --exec, --field-separator, --file, --gen-pot, --help, --include, --lint, --lint-old, --load, --non-decimal-data, --optimize, --posix, --pretty-print, --profile, --re-interval, --sandbox, --source, --traditional, --use-lc-numeric, and --version long options (see Command-Line Options)

§ Support for the following obsolete systems was removed from the code and the documentation for gawk version 4.0:

§ Amiga

§ Atari

§ BeOS

§ Cray

§ MIPS RiscOS

§ MS-DOS with the Microsoft Compiler

§ MS-Windows with the Microsoft Compiler

§ NeXT

§ SunOS 3.x, Sun 386 (Road Runner)

§ Tandem (non-POSIX)

§ Prestandard VAX C compiler for VAX/VMS

§ GCC for VAX and Alpha has not been tested for a while.

§ Support for the following obsolete system was removed from the code for gawk version 4.1:

§ Ultrix

Common Extensions Summary

The following table summarizes the common extensions supported by gawk, Brian Kernighan’s awk, and mawk, the three most widely used freely available versions of awk (see Other Freely Available awk Implementations).

Feature

BWK awk

mawk

gawk

Now standard

‘\x’ escape sequence

FS as null string

/dev/stdin special file

/dev/stdout special file

/dev/stderr special file

delete without subscript

fflush() function

length() of an array

nextfile statement

** and **= operators

func keyword

BINMODE variable

RS as regexp

Time-related functions

Regexp Ranges and Locales: A Long Sad Story

This section describes the confusing history of ranges within regular expressions and their interactions with locales, and how this affected different versions of gawk.

The original Unix tools that worked with regular expressions defined character ranges (such as ‘[a-z]’) to match any character between the first character in the range and the last character in the range, inclusive. Ordering was based on the numeric value of each character in the machine’s native character set. Thus, on ASCII-based systems, ‘[a-z]’ matched all the lowercase letters, and only the lowercase letters, as the numeric values for the letters from ‘a’ through ‘z’ were contiguous. (On an EBCDIC system, the range ‘[a-z]’ includes additional nonalphabetic characters as well.)

Almost all introductory Unix literature explained range expressions as working in this fashion, and in particular, would teach that the “correct” way to match lowercase letters was with ‘[a-z]’, and that ‘[A-Z]’ was the “correct” way to match uppercase letters. And indeed, this was true.[104]

The 1992 POSIX standard introduced the idea of locales (see Where You Are Makes a Difference). Because many locales include other letters besides the plain 26 letters of the English alphabet, the POSIX standard added character classes (see Using Bracket Expressions) as a way to match different kinds of characters besides the traditional ones in the ASCII character set.

However, the standard changed the interpretation of range expressions. In the "C" and "POSIX" locales, a range expression like ‘[a-dx-z]’ is still equivalent to ‘[abcdxyz]’, as in ASCII. But outside those locales, the ordering was defined to be based on collation order.

What does that mean? In many locales, ‘A’ and ‘a’ are both less than ‘B’. In other words, these locales sort characters in dictionary order, and ‘[a-dx-z]’ is typically not equivalent to ‘[abcdxyz]’; instead, it might be equivalent to ‘[ABCXYabcdxyz]’, for example.

This point needs to be emphasized: much literature teaches that you should use ‘[a-z]’ to match a lowercase character. But on systems with non-ASCII locales, this also matches all of the uppercase characters except ‘A’ or ‘Z’! This was a continuous cause of confusion, even well into the twenty-first century.

To demonstrate these issues, the following example uses the sub() function, which does text replacement (see String-Manipulation Functions). Here, the intent is to remove trailing uppercase characters:

$ echo something1234abc | gawk-3.1.8 '{ sub("[A-Z]*$", ""); print }'

something1234a

This output is unexpected, as the ‘bc’ at the end of ‘something1234abc’ should not normally match ‘[A-Z]*’. This result is due to the locale setting (and thus you may not see it on your system).

Similar considerations apply to other ranges. For example, ‘["-/]’ is perfectly valid in ASCII, but is not valid in many Unicode locales, such as en_US.UTF-8.

Early versions of gawk used regexp matching code that was not locale-aware, so ranges had their traditional interpretation.

When gawk switched to using locale-aware regexp matchers, the problems began; especially as both GNU/Linux and commercial Unix vendors started implementing non-ASCII locales, and making them the default. Perhaps the most frequently asked question became something like, “Why does ‘[A-Z]’ match lowercase letters?!?”

This situation existed for close to 10 years, if not more, and the gawk maintainer grew weary of trying to explain that gawk was being nicely standards-compliant, and that the issue was in the user’s locale. During the development of version 4.0, he modified gawk to always treat ranges in the original, pre-POSIX fashion, unless --posix was used (see Command-Line Options).[105]

Fortunately, shortly before the final release of gawk 4.0, the maintainer learned that the 2008 standard had changed the definition of ranges, such that outside the "C" and "POSIX" locales, the meaning of range expressions was undefined.[106]

By using this lovely technical term, the standard gives license to implementors to implement ranges in whatever way they choose. The gawk maintainer chose to apply the pre-POSIX meaning both with the default regexp matching and when --traditional or --posix are used. In all cases gawk remains POSIX-compliant.

Major Contributors to gawk

Always give credit where credit is due.

—Anonymous

This section names the major contributors to gawk and/or this book, in approximate chronological order:

§ Dr. Alfred V. Aho, Dr. Peter J. Weinberger, and Dr. Brian W. Kernighan, all of Bell Laboratories, designed and implemented Unix awk, from which gawk gets the majority of its feature set.

§ Paul Rubin did the initial design and implementation in 1986, and wrote the first draft (around 40 pages) of this book.

§ Jay Fenlason finished the initial implementation.

§ Diane Close revised the first draft of this book, bringing it to around 90 pages.

§ Richard Stallman helped finish the implementation and the initial draft of this book. He is also the founder of the FSF and the GNU Project.

§ John Woods contributed parts of the code (mostly fixes) in the initial version of gawk.

§ In 1988, David Trueman took over primary maintenance of gawk, making it compatible with “new” awk, and greatly improving its performance.

§ Conrad Kwok, Scott Garfinkle, and Kent Williams did the initial ports to MS-DOS with various versions of MSC.

§ Pat Rankin provided the VMS port and its documentation.

§ Hal Peterson provided help in porting gawk to Cray systems. (This is no longer supported.)

§ Kai Uwe Rommel provided the initial port to OS/2 and its documentation.

§ Michal Jaegermann provided the port to Atari systems and its documentation. (This port is no longer supported.) He continues to provide portability checking, and has done a lot of work to make sure gawk works on non-32-bit systems.

§ Fred Fish provided the port to Amiga systems and its documentation. (With Fred’s sad passing, this is no longer supported.)

§ Scott Deifik currently maintains the MS-DOS port using DJGPP.

§ Eli Zaretskii currently maintains the MS-Windows port using MinGW.

§ Juan Grigera provided a port to Windows32 systems. (This is no longer supported.)

§ For many years, Dr. Darrel Hankerson acted as coordinator for the various ports to different PC platforms and created binary distributions for various PC operating systems. He was also instrumental in keeping the documentation up to date for the various PC platforms.

§ Christos Zoulas provided the extension() built-in function for dynamically adding new functions. (This was obsoleted at gawk 4.1.)

§ Jürgen Kahrs contributed the initial version of the TCP/IP networking code and documentation, and motivated the inclusion of the ‘|&’ operator.

§ Stephen Davies provided the initial port to Tandem systems and its documentation. (However, this is no longer supported.) He was also instrumental in the initial work to integrate the byte-code internals into the gawk code base.

§ Matthew Woehlke provided improvements for Tandem’s POSIX-compliant systems.

§ Martin Brown provided the port to BeOS and its documentation. (This is no longer supported.)

§ Arno Peters did the initial work to convert gawk to use GNU Automake and GNU gettext.

§ Alan J. Broder provided the initial version of the asort() function as well as the code for the optional third argument to the match() function.

§ Andreas Buening updated the gawk port for OS/2.

§ Isamu Hasegawa, of IBM in Japan, contributed support for multibyte characters.

§ Michael Benzinger contributed the initial code for switch statements.

§ Patrick T.J. McPhee contributed the code for dynamic loading in Windows32 environments. (This is no longer supported.)

§ Anders Wallin helped keep the VMS port going for several years.

§ Assaf Gordon contributed the code to implement the --sandbox option.

§ John Haque made the following contributions:

§ The modifications to convert gawk into a byte-code interpreter, including the debugger

§ The addition of true arrays of arrays

§ The additional modifications for support of arbitrary-precision arithmetic

§ The initial text of Chapter 15

§ The work to merge the three versions of gawk into one, for the 4.1 release

§ Improved array internals for arrays indexed by integers

§ The improved array sorting features were also driven by John, together with Pat Rankin

§ Panos Papadopoulos contributed the original text for Including Other Files into Your Program.

§ Efraim Yawitz contributed the original text for Chapter 14.

§ The development of the extension API first released with gawk 4.1 was driven primarily by Arnold Robbins and Andrew Schorr, with notable contributions from the rest of the development team.

§ John Malmberg contributed significant improvements to the OpenVMS port and the related documentation.

§ Antonio Giovanni Colombo rewrote a number of examples in the early chapters that were severely dated, for which I am incredibly grateful.

§ Arnold Robbins has been working on gawk since 1988, at first helping David Trueman, and as the primary maintainer since around 1994.

Summary

§ The awk language has evolved over time. The first release was with V7 Unix, circa 1978. In 1987, for System V Release 3.1, major additions, including user-defined functions, were made to the language. Additional changes were made for System V Release 4, in 1989. Since then, further minor changes have happened under the auspices of the POSIX standard.

§ Brian Kernighan’s awk provides a small number of extensions that are implemented in common with other versions of awk.

§ gawk provides a large number of extensions over POSIX awk. They can be disabled with either the --traditional or --posix options.

§ The interaction of POSIX locales and regexp matching in gawk has been confusing over the years. Today, gawk implements Rational Range Interpretation, where ranges of the form ‘[a-z]’ match only the characters numerically between ‘a’ through ‘z’ in the machine’s native character set. Usually this is ASCII, but it can be EBCDIC on IBM S/390 systems.

§ Many people have contributed to gawk development over the years. We hope that the list provided in this chapter is complete and gives the appropriate credit where credit is due.


[104] And Life was good.

[105] And thus was born the Campaign for Rational Range Interpretation (or RRI). A number of GNU tools have already implemented this change, or will soon. Thanks to Karl Berry for coining the phrase “Rational Range Interpretation.”

[106] See the standard and its rationale.