Preface - Effective awk Programming (2015)

Effective awk Programming (2015)

Preface

Arnold Robbins

Nof Ayalon

Israel

Several kinds of tasks occur repeatedly when working with text files. You might want to extract certain lines and discard the rest. Or you may need to make changes wherever certain patterns appear, but leave the rest of the file alone. Such jobs are often easy with awk. The awk utility interprets a special-purpose programming language that makes it easy to handle simple data-reformatting jobs.

The GNU implementation of awk is called gawk; if you invoke it with the proper options or environment variables, it is fully compatible with the POSIX[1] specification of the awk language and with the Unix version of awk maintained by Brian Kernighan. This means that all properly writtenawk programs should work with gawk. So most of the time, we don’t distinguish between gawk and other awk implementations.

Using awk you can:

§ Manage small, personal databases

§ Generate reports

§ Validate data

§ Produce indexes and perform other document-preparation tasks

§ Experiment with algorithms that you can adapt later to other computer languages

In addition, gawk provides facilities that make it easy to:

§ Extract bits and pieces of data for processing

§ Sort data

§ Perform simple network communications

§ Profile and debug awk programs

§ Extend the language with functions written in C or C++

This book teaches you about the awk language and how you can use it effectively. You should already be familiar with basic system commands, such as cat and ls,[2] as well as basic shell facilities, such as input/output (I/O) redirection and pipes.

Implementations of the awk language are available for many different computing environments. This book, while describing the awk language in general, also describes the particular implementation of awk called gawk (which stands for “GNU awk”). gawk runs on a broad range of Unix systems, ranging from Intel-architecture PC-based computers up through large-scale systems. gawk has also been ported to Mac OS X, Microsoft Windows (all versions), and OpenVMS.[3]

History of awk and gawk

RECIPE FOR A PROGRAMMING LANGUAGE

1 part egrep

1 part snobol

2 parts ed

3 parts C

Blend all parts well using lex and yacc. Document minimally and release.

After eight years, add another part egrep and two more parts C. Document very well and release.

The name awk comes from the initials of its designers: Alfred V. Aho, Peter J. Weinberger, and Brian W. Kernighan. The original version of awk was written in 1977 at AT&T Bell Laboratories. In 1985, a new version made the programming language more powerful, introducing user-defined functions, multiple input streams, and computed regular expressions. This new version became widely available with Unix System V Release 3.1 (1987). The version in System V Release 4 (1989) added some new features and cleaned up the behavior in some of the “dark corners” of the language. The specification for awk in the POSIX Command Language and Utilities standard further clarified the language. Both the gawk designers and the original awk designers at Bell Laboratories provided feedback for the POSIX specification.

Paul Rubin wrote gawk in 1986. Jay Fenlason completed it, with advice from Richard Stallman. John Woods contributed parts of the code as well. In 1988 and 1989, David Trueman, with help from me, thoroughly reworked gawk for compatibility with the newer awk. Circa 1994, I became the primary maintainer. Current development focuses on bug fixes, performance improvements, standards compliance, and, occasionally, new features.

In May 1997, Jürgen Kahrs felt the need for network access from awk, and with a little help from me, set about adding features to do this for gawk. At that time, he also wrote the bulk of TCP/IP Internetworking with gawk (a separate document, available as part of the gawk distribution). His code finally became part of the main gawk distribution with gawk version 3.1.

John Haque rewrote the gawk internals, in the process providing an awk-level debugger. This version became available as gawk version 4.0 in 2011.

See Major Contributors to gawk for a full list of those who have made important contributions to gawk.

A Rose by Any Other Name

The awk language has evolved over the years. Full details are provided in Appendix A. The language described in this book is often referred to as “new awk.” By analogy, the original version of awk is referred to as “old awk.”

On most current systems, when you run the awk utility you get some version of new awk.[4] If your system’s standard awk is the old one, you will see something like this if you try the test program:

$ awk 1 /dev/null

error→ awk: syntax error near line 1

error→ awk: bailing out near line 1

In this case, you should find a version of new awk, or just install gawk!

Throughout this book, whenever we refer to a language feature that should be available in any complete implementation of POSIX awk, we simply use the term awk. When referring to a feature that is specific to the GNU implementation, we use the term gawk.

Using This Book

The term awk refers to a particular program as well as to the language you use to tell this program what to do. When we need to be careful, we call the language “the awk language,” and the program “the awk utility.” This book explains both how to write programs in the awk language and how to run the awk utility. The term “awk program” refers to a program written by you in the awk programming language.

Primarily, this book explains the features of awk as defined in the POSIX standard. It does so in the context of the gawk implementation. While doing so, it also attempts to describe important differences between gawk and other awk implementations. Finally, it notes any gawk features that are not in the POSIX standard for awk.

This book has the difficult task of being both a tutorial and a reference. If you are a novice, feel free to skip over details that seem too complex. You should also ignore the many cross-references; they are for the expert user and for the online Info and HTML versions of the book.

There are sidebars scattered throughout the book. They add a more complete explanation of points that are relevant, but not likely to be of interest on first reading.

Most of the time, the examples use complete awk programs. Some of the more advanced sections show only the part of the awk program that illustrates the concept being described.

Although this book is aimed principally at people who have not been exposed to awk, there is a lot of information here that even the awk expert should find useful. In particular, the description of POSIX awk and the example programs in Chapter 10 and Chapter 11 should be of interest.

This book is split into several parts, as follows:

§ Part I, describes the awk language and the gawk program in detail. It starts with the basics, and continues through all of the features of awk. It contains the following chapters:

§ Chapter 1, Getting Started with awk, provides the essentials you need to know to begin using awk.

§ Chapter 2, Running awk and gawk, describes how to run gawk, the meaning of its command-line options, and how it finds awk program source files.

§ Chapter 3, Regular Expressions, introduces regular expressions in general, and in particular the flavors supported by POSIX awk and gawk.

§ Chapter 4, Reading Input Files, describes how awk reads your data. It introduces the concepts of records and fields, as well as the getline command. I/O redirection is first described here. Network I/O is also briefly introduced here.

§ Chapter 5, Printing Output, describes how awk programs can produce output with print and printf.

§ Chapter 6, Expressions, describes expressions, which are the basic building blocks for getting most things done in a program.

§ Chapter 7, Patterns, Actions, and Variables, describes how to write patterns for matching records, actions for doing something when a record is matched, and the predefined variables awk and gawk use.

§ Chapter 8, Arrays in awk, covers awk’s one and only data structure: the associative array. Deleting array elements and whole arrays is described, as well as sorting arrays in gawk. The chapter also describes how gawk provides arrays of arrays.

§ Chapter 9, Functions, describes the built-in functions awk and gawk provide, as well as how to define your own functions. It also discusses how gawk lets you call functions indirectly.

§ Part II, shows how to use awk and gawk for problem solving. There is lots of code here for you to read and learn from. This part contains the following chapters:

§ Chapter 10, A Library of awk Functions, provides a number of functions meant to be used from main awk programs.

§ Chapter 11, Practical awk Programs, provides many sample awk programs.

Reading these two chapters allows you to see awk solving real problems.

§ Part III, focuses on features specific to gawk. It contains the following chapters:

§ Chapter 12, Advanced Features of gawk, describes a number of advanced features. Of particular note are the abilities to control the order of array traversal, have two-way communications with another process, perform TCP/IP networking, and profile your awk programs.

§ Chapter 13, Internationalization with gawk, describes special features for translating program messages into different languages at runtime.

§ Chapter 14, Debugging awk Programs, describes the gawk debugger.

§ Chapter 15, Arithmetic and Arbitrary-Precision Arithmetic with gawk, describes advanced arithmetic facilities.

§ Chapter 16, Writing Extensions for gawk, describes how to add new variables and functions to gawk by writing extensions in C or C++.

§ Part IV, provides the following appendices, including the GNU General Public License:

§ Appendix A, describes how the awk language has evolved since its first release to the present. It also describes how gawk has acquired features over time.

§ Appendix B, describes how to get gawk, how to compile it on POSIX-compatible systems, and how to compile and use it on different non-POSIX systems. It also describes how to report bugs in gawk and where to get other freely available awk implementations.

§ Appendix C, presents the license that covers the gawk source code.

The version of this book distributed with gawk contains additional appendices and other end material. To save space, we have omitted them from the printed edition. You may find them online, as follows:

§ The appendix on implementation notes describes how to disable gawk’s extensions, how to contribute new code to gawk, where to find information on some possible future directions for gawk development, and the design decisions behind the extension API.

§ The appendix on basic concepts provides some very cursory background material for those who are completely unfamiliar with computer programming.

§ The glossary defines most, if not all, of the significant terms used throughout the book. If you find terms that you aren’t familiar with, try looking them up here.

§ The GNU FDL is the license that covers this book.

Some of the chapters have exercise sections; these have also been omitted from the print edition but are available online.

Typographical Conventions

This book is written in Texinfo, the GNU documentation formatting language. A single Texinfo source file is used to produce both the printed and online versions of the documentation. Because of this, the typographical conventions are slightly different than in other books you may have read.

Examples you would type at the command line are preceded by the common shell primary and secondary prompts, ‘$’ and ‘>’. Input that you type is shown like this. Output from the command, usually its standard output, appears like this. Error messages and other output on the command’s standard error are preceded by the glyph “error→”. For example:

$ echo hi on stdout

hi on stdout

$ echo hello on stderr 1>&2

error→ hello on stderr

In the text, almost anything related to programming, such as command names, variable and function names, and string, numeric and regexp constants appear in this font. Code fragments appear in the same font and quoted, ‘like this’. Things that are replaced by the user or programmer appear in this font. Options look like this: -f. Filenames are indicated like this: /path/to/ourfile. The first occurrence of a new term is usually its definition and appears in the same font as the previous occurrence of “definition” in this sentence.

Characters that you type at the keyboard look like this. In particular, there are special characters called “control characters.” These are characters that you type by holding down both the CONTROL key and another key, at the same time. For example, a Ctrl-d is typed by first pressing and holding the CONTROL key, next pressing the d key, and finally releasing both keys.

For the sake of brevity, throughout this book, we refer to Brian Kernighan’s version of awk as “BWK awk.” (See Other Freely Available awk Implementations for information on his and other versions.)

NOTE

Notes of interest look like this.

CAUTION

Cautionary or warning notes look like this.

Dark Corners

Dark corners are basically fractal—no matter how much you illuminate, there’s always a smaller but darker one.

—Brian Kernighan

Until the POSIX standard (and Effective awk Programming), many features of awk were either poorly documented or not documented at all. Descriptions of such features (often called “dark corners”) are noted in this book with “(d.c.).”

But, as noted by the opening quote, any coverage of dark corners is by definition incomplete.

Extensions to the standard awk language that are supported by more than one awk implementation are marked “(c.e.)” for “common extension.”

The GNU Project and This Book

The Free Software Foundation (FSF) is a nonprofit organization dedicated to the production and distribution of freely distributable software. It was founded by Richard M. Stallman, the author of the original Emacs editor. GNU Emacs is the most widely used version of Emacs today.

The GNU[5] Project is an ongoing effort on the part of the Free Software Foundation to create a complete, freely distributable, POSIX-compliant computing environment. The FSF uses the GNU General Public License (GPL) to ensure that its software’s source code is always available to the end user. The GPL applies to the C language source code for gawk. To find out more about the FSF and the GNU Project online, see the GNU Project’s home page. This book may also be read from GNU’s website.

The book you are reading is actually free—at least, the information in it is free to anyone. The machine-readable source code for the book comes with gawk.

The book itself has gone through multiple previous editions. Paul Rubin wrote the very first draft of The GAWK Manual; it was around 40 pages long. Diane Close and Richard Stallman improved it, yielding a version that was around 90 pages and barely described the original, “old” version of awk.

I started working with that version in the fall of 1988. As work on it progressed, the FSF published several preliminary versions (numbered 0.x). In 1996, edition 1.0 was released with gawk 3.0.0. The FSF published the first two editions under the title The GNU Awk User’s Guide. SSC published two editions of the book under the title Effective awk Programming, and O’Reilly published the third edition in 2001.

This edition maintains the basic structure of the previous editions. For FSF edition 4.0, the content was thoroughly reviewed and updated. All references to gawk versions prior to 4.0 were removed. Of significant note for that edition was the addition of Chapter 14.

For FSF edition 4.1 (the fourth edition as published by O’Reilly), the content has been reorganized into parts, and the major new additions are Chapter 15 and Chapter 16.

This book will undoubtedly continue to evolve. If you find an error in the book, please report it! See Reporting Problems and Bugs for information on submitting problem reports electronically.

How to Stay Current

You may have a newer version of gawk than the one described here. To find out what has changed, you should first look at the NEWS file in the gawk distribution, which provides a high-level summary of the changes in each release.

You can then look at the online version of this book to read about any new features.


[1] The 2008 POSIX standard is accessible online.

[2] These utilities are available on POSIX-compliant systems, as well as on traditional Unix-based systems. If you are using some other operating system, you still need to be familiar with the ideas of I/O redirection and pipes.

[3] Some other, obsolete systems to which gawk was once ported are no longer supported and the code for those systems has been removed.

[4] Only Solaris systems still use an old awk for the default awk utility. A more modern awk lives in /usr/xpg6/bin on these systems.

[5] GNU stands for “GNU’s Not Unix.”