Writing Extensions for gawk - Moving Beyond Standard awk with gawk - Effective awk Programming (2015)

Effective awk Programming (2015)

Part III. Moving Beyond Standard awk with gawk

Chapter 16. Writing Extensions for gawk

It is possible to add new functions written in C or C++ to gawk using dynamically loaded libraries. This facility is available on systems that support the C dlopen() and dlsym() functions. This chapter describes how to create extensions using code written in C or C++.

If you don’t know anything about C programming, you can safely skip this chapter, although you may wish to review the documentation on the extensions that come with gawk (see The Sample Extensions in the gawk Distribution) and the information on the gawkextlib project (see The gawkextlib Project). The sample extensions are automatically built and installed when gawk is.

NOTE

When --sandbox is specified, extensions are disabled (see Command-Line Options).

Introduction

An extension (sometimes called a plug-in) is a piece of external compiled code that gawk can load at runtime to provide additional functionality, over and above the built-in capabilities described in the rest of this book.

Extensions are useful because they allow you (of course) to extend gawk’s functionality. For example, they can provide access to system calls (such as chdir() to change directory) and to other C library routines that could be of use. As with most software, “the sky is the limit”; if you can imagine something that you might want to do and can write in C or C++, you can write an extension to do it!

Extensions are written in C or C++, using the application programming interface (API) defined for this purpose by the gawk developers. The rest of this chapter explains the facilities that the API provides and how to use them, and presents a small example extension. In addition, it documents the sample extensions included in the gawk distribution and describes the gawkextlib project. See http://www.gnu.org/software/gawk/manual/html_node/Extension-Design.html for a discussion of the extension mechanism goals and design.

Extension Licensing

Every dynamic extension must be distributed under a license that is compatible with the GNU GPL (see Appendix C).

In order for the extension to tell gawk that it is properly licensed, the extension must define the global symbol plugin_is_GPL_compatible. If this symbol does not exist, gawk emits a fatal error and exits when it tries to load your extension.

The declared type of the symbol should be int. It does not need to be in any allocated section, though. The code merely asserts that the symbol exists in the global scope. Something like this is enough:

int plugin_is_GPL_compatible;

How It Works at a High Level

Communication between gawk and an extension is two-way. First, when an extension is loaded, gawk passes it a pointer to a struct whose fields are function pointers. This is shown in Figure 16-1.

Loading the extension

Figure 16-1. Loading the extension

The extension can call functions inside gawk through these function pointers, at runtime, without needing (link-time) access to gawk’s symbols. One of these function pointers is to a function for “registering” new functions. This is shown in Figure 16-2.

Registering a new function

Figure 16-2. Registering a new function

In the other direction, the extension registers its new functions with gawk by passing function pointers to the functions that provide the new feature (do_chdir(), for example). gawk associates the function pointer with a name and can then call it, using a defined calling convention. This is shown in Figure 16-3.

Calling the new function

Figure 16-3. Calling the new function

The do_xxx() function, in turn, then uses the function pointers in the API struct to do its work, such as updating variables or arrays, printing messages, setting ERRNO, and so on.

Convenience macros make calling through the function pointers look like regular function calls so that extension code is quite readable and understandable.

Although all of this sounds somewhat complicated, the result is that extension code is quite straightforward to write and to read. You can see this in the sample extension filefuncs.c (see Example: Some File Functions) and also in the testext.c code for testing the APIs.

Some other bits and pieces:

§ The API provides access to gawk’s do_xxx values, reflecting command-line options, like do_lint, do_profiling, and so on (see API Variables). These are informational: an extension cannot affect their values inside gawk. In addition, attempting to assign to them produces a compile-time error.

§ The API also provides major and minor version numbers, so that an extension can check if the gawk it is loaded with supports the facilities it was compiled with. (Version mismatches “shouldn’t” happen, but we all know how that goes.) See API version constants and variables for details.

API Description

C or C++ code for an extension must include the header file gawkapi.h, which declares the functions and defines the data types used to communicate with gawk. This (rather large) section describes the API in detail.

Introduction

Access to facilities within gawk is achieved by calling through function pointers passed into your extension.

API function pointers are provided for the following kinds of operations:

§ Allocating, reallocating, and releasing memory.

§ Registration functions. You may register:

§ Extension functions

§ Exit callbacks

§ A version string

§ Input parsers

§ Output wrappers

§ Two-way processors

All of these are discussed in detail later in this chapter.

§ Printing fatal, warning, and “lint” warning messages.

§ Updating ERRNO, or unsetting it.

§ Accessing parameters, including converting an undefined parameter into an array.

§ Symbol table access: retrieving a global variable, creating one, or changing one.

§ Creating and releasing cached values; this provides an efficient way to use values for multiple variables and can be a big performance win.

§ Manipulating arrays:

§ Retrieving, adding, deleting, and modifying elements

§ Getting the count of elements in an array

§ Creating a new array

§ Clearing an array

§ Flattening an array for easy C-style looping over all its indices and elements

Some points about using the API:

§ The following types, macros, and/or functions are referenced in gawkapi.h. For correct use, you must therefore include the corresponding standard header file before including gawkapi.h:

C entity

Header file

EOF

<stdio.h>

Values for errno

<errno.h>

FILE

<stdio.h>

NULL

<stddef.h>

memcpy()

<string.h>

memset()

<string.h>

size_t

<sys/types.h>

struct stat

<sys/stat.h>

§ Due to portability concerns, especially to systems that are not fully standards-compliant, it is your responsibility to include the correct files in the correct way. This requirement is necessary in order to keep gawkapi.h clean, instead of becoming a portability hodge-podge as can be seen in some parts of the gawk source code.

§ The gawkapi.h file may be included more than once without ill effect. Doing so, however, is poor coding practice.

§ Although the API only uses ISO C 90 features, there is an exception; the “constructor” functions use the inline keyword. If your compiler does not support this keyword, you should either place ‘-Dinline=''’ on your command line or use the GNU Autotools and include a config.hfile in your extensions.

§ All pointers filled in by gawk point to memory managed by gawk and should be treated by the extension as read-only. Memory for all strings passed into gawk from the extension must come from calling one of gawk_malloc(), gawk_calloc(), or gawk_realloc(), and is managed by gawk from then on.

§ The API defines several simple structs that map values as seen from awk. A value can be a double, a string, or an array (as in multidimensional arrays, or when creating a new array). String values maintain both pointer and length, because embedded NUL characters are allowed.

NOTE

By intent, strings are maintained using the current multibyte encoding (as defined by LC_xxx environment variables) and not using wide characters. This matches how gawk stores strings internally and also how characters are likely to be input into and output from files.

§ When retrieving a value (such as a parameter or that of a global variable or array element), the extension requests a specific type (number, string, scalar, value cookie, array, or “undefined”). When the request is “undefined,” the returned value will have the real underlying type.

However, if the request and actual type don’t match, the access function returns “false” and fills in the type of the actual value that is there, so that the extension can, e.g., print an error message (such as “scalar passed where array expected”).

You may call the API functions by using the function pointers directly, but the interface is not so pretty. To make extension code look more like regular code, the gawkapi.h header file defines several macros that you should use in your code. This section presents the macros as if they were functions.

General-Purpose Data Types

I have a true love/hate relationship with unions.

—Arnold Robbins

That’s the thing about unions: the compiler will arrange things so they can accommodate both love and hate.

—Chet Ramey

The extension API defines a number of simple types and structures for general-purpose use. Additional, more specialized data structures are introduced in subsequent sections, together with the functions that use them.

The general-purpose types and structures are as follows:

typedef void *awk_ext_id_t;

A value of this type is received from gawk when an extension is loaded. That value must then be passed back to gawk as the first parameter of each API function.

#define awk_const …

This macro expands to ‘const’ when compiling an extension, and to nothing when compiling gawk itself. This makes certain fields in the API data structures unwritable from extension code, while allowing gawk to use them as it needs to.

typedef enum awk_bool {
awk_false = 0,
awk_true
} awk_bool_t;

A simple Boolean type.

typedef struct awk_string {
char *str; /* data */
size_t len; /* length thereof, in chars */
} awk_string_t;

This represents a mutable string. gawk owns the memory pointed to if it supplied the value. Otherwise, it takes ownership of the memory pointed to. Such memory must come from calling one of the gawk_malloc(), gawk_calloc(), or gawk_realloc() functions!

As mentioned earlier, strings are maintained using the current multibyte encoding.

typedef enum {
AWK_UNDEFINED,
AWK_NUMBER,
AWK_STRING,
AWK_ARRAY,
AWK_SCALAR, /* opaque access to a variable */
AWK_VALUE_COOKIE /* for updating a previously created value */
} awk_valtype_t;

This enum indicates the type of a value. It is used in the following struct.

typedef struct awk_value {
awk_valtype_t val_type;
union {
awk_string_t s;
double d;
awk_array_t a;
awk_scalar_t scl;
awk_value_cookie_t vc;
} u;
} awk_value_t;

An “awk value.” The val_type member indicates what kind of value the union holds, and each member is of the appropriate type.

#define str_value u.s
#define num_value u.d
#define array_cookie u.a
#define scalar_cookie u.scl
#define value_cookie u.vc

Usings these macros makes accessing the fields of the awk_value_t more readable.

typedef void *awk_scalar_t;

Scalars can be represented as an opaque type. These values are obtained from gawk and then passed back into it. This is discussed in a general fashion in the text following this list, and in more detail in Variable access and update by cookie.

typedef void *awk_value_cookie_t;

A “value cookie” is an opaque type representing a cached value. This is also discussed in a general fashion in the text following this list, and in more detail in Creating and using cached values.

Scalar values in awk are either numbers or strings. The awk_value_t represents values. The val_type member indicates what is in the union.

Representing numbers is easy—the API uses a C double. Strings require more work. Because gawk allows embedded NUL bytes in string values, a string must be represented as a pair containing a data pointer and length. This is the awk_string_t type.

Identifiers (i.e., the names of global variables) can be associated with either scalar values or with arrays. In addition, gawk provides true arrays of arrays, where any given array element can itself be an array. Discussion of arrays is delayed until Array Manipulation.

The various macros listed earlier make it easier to use the elements of the union as if they were fields in a struct; this is a common coding practice in C. Such code is easier to write and to read, but it remains your responsibility to make sure that the val_type member correctly reflects the type of the value in the awk_value_t struct.

Conceptually, the first three members of the union (number, string, and array) are all that is needed for working with awk values. However, because the API provides routines for accessing and changing the value of a global scalar variable only by using the variable’s name, there is a performance penalty: gawk must find the variable each time it is accessed and changed. This turns out to be a real issue, not just a theoretical one.

Thus, if you know that your extension will spend considerable time reading and/or changing the value of one or more scalar variables, you can obtain a scalar cookie[95] object for that variable, and then use the cookie for getting the variable’s value or for changing the variable’s value. Theawk_scalar_t type holds a scalar cookie, and the scalar_cookie macro provides access to the value of that type in the awk_value_t struct. Given a scalar cookie, gawk can directly retrieve or modify the value, as required, without having to find it first.

The awk_value_cookie_t type and value_cookie macro are similar. If you know that you wish to use the same numeric or string value for one or more variables, you can create the value once, retaining a value cookie for it, and then pass in that value cookie whenever you wish to set the value of a variable. This saves storage space within the running gawk process and reduces the time needed to create the value.

Memory Allocation Functions and Convenience Macros

The API provides a number of memory allocation functions for allocating memory that can be passed to gawk, as well as a number of convenience macros. This subsection presents them all as function prototypes, in the way that extension code would use them:

void *gawk_malloc(size_t size);

Call the correct version of malloc() to allocate storage that may be passed to gawk.

void *gawk_calloc(size_t nmemb, size_t size);

Call the correct version of calloc() to allocate storage that may be passed to gawk.

void *gawk_realloc(void *ptr, size_t size);

Call the correct version of realloc() to allocate storage that may be passed to gawk.

void gawk_free(void *ptr);

Call the correct version of free() to release storage that was allocated with gawk_malloc(), gawk_calloc(), or gawk_realloc().

The API has to provide these functions because it is possible for an extension to be compiled and linked against a different version of the C library than was used for the gawk executable.[96] If gawk were to use its version of free() when the memory came from an unrelated version ofmalloc(), unexpected behavior would likely result.

Two convenience macros may be used for allocating storage from gawk_malloc() and gawk_realloc(). If the allocation fails, they cause gawk to exit with a fatal error message. They should be used as if they were procedure calls that do not return a value:

#define emalloc(pointer, type, size, message) …

The arguments to this macro are as follows:

pointer

The pointer variable to point at the allocated storage.

type

The type of the pointer variable. This is used to create a cast for the call to gawk_malloc().

size

The total number of bytes to be allocated.

message

A message to be prefixed to the fatal error message. Typically this is the name of the function using the macro.

For example, you might allocate a string value like so:

awk_value_t result;

char *message;

const char greet[] = "Don't Panic!";

emalloc(message, char *, sizeof(greet), "myfunc");

strcpy(message, greet);

make_malloced_string(message, strlen(message), & result);

#define erealloc(pointer, type, size, message) …

This is like emalloc(), but it calls gawk_realloc() instead of gawk_malloc(). The arguments are the same as for the emalloc() macro.

Constructor Functions

The API provides a number of constructor functions for creating string and numeric values, as well as a number of convenience macros. This subsection presents them all as function prototypes, in the way that extension code would use them:

static inline awk_value_t *
make_const_string(const char *string, size_t length,
awk_value_t *result);

This function creates a string value in the awk_value_t variable pointed to by result. It expects string to be a C string constant (or other string data), and automatically creates a copy of the data for storage in result. It returns result.

static inline awk_value_t *
make_malloced_string(const char *string, size_t length,
awk_value_t *result);

This function creates a string value in the awk_value_t variable pointed to by result. It expects string to be a ‘char *’ value pointing to data previously obtained from gawk_malloc(), gawk_calloc(), or gawk_realloc(). The idea here is that the data is passed directly togawk, which assumes responsibility for it. It returns result.

static inline awk_value_t *
make_null_string(awk_value_t *result);

This specialized function creates a null string (the “undefined” value) in the awk_value_t variable pointed to by result. It returns result.

static inline awk_value_t *
make_number(double num, awk_value_t *result);

This function simply creates a numeric value in the awk_value_t variable pointed to by result.

Registration Functions

This section describes the API functions for registering parts of your extension with gawk.

Registering an extension function

Extension functions are described by the following record:

typedef struct awk_ext_func {

const char *name;

awk_value_t *(*function)(int num_actual_args, awk_value_t *result);

size_t num_expected_args;

} awk_ext_func_t;

The fields are:

const char *name;

The name of the new function. awk-level code calls the function by this name. This is a regular C string.

Function names must obey the rules for awk identifiers. That is, they must begin with either an English letter or an underscore, which may be followed by any number of letters, digits, and underscores. Letter case in function names is significant.

awk_value_t *(*function)(int num_actual_args, awk_value_t *result);

This is a pointer to the C function that provides the extension’s functionality. The function must fill in *result with either a number or a string. gawk takes ownership of any string memory. As mentioned earlier, string memory must come from one of gawk_malloc(),gawk_calloc(), or gawk_realloc().

The num_actual_args argument tells the C function how many actual parameters were passed from the calling awk code.

The function must return the value of result. This is for the convenience of the calling code inside gawk.

size_t num_expected_args;

This is the number of arguments the function expects to receive. Each extension function may decide what to do if the number of arguments isn’t what it expected. As with real awk functions, it is likely OK to ignore extra arguments.

Once you have a record representing your extension function, you register it with gawk using this API function:

awk_bool_t add_ext_func(const char *namespace,
const awk_ext_func_t *func);

This function returns true upon success, false otherwise. The namespace parameter is currently not used; you should pass in an empty string (""). The func pointer is the address of a struct representing your function, as just described.

Registering an exit callback function

An exit callback function is a function that gawk calls before it exits. Such functions are useful if you have general “cleanup” tasks that should be performed in your extension (such as closing database connections or other resource deallocations). You can register such a function with gawkusing the following function:

void awk_atexit(void (*funcp)(void *data, int exit_status),
void *arg0);

The parameters are:

funcp

A pointer to the function to be called before gawk exits. The data parameter will be the original value of arg0. The exit_status parameter is the exit status value that gawk intends to pass to the exit() system call.

arg0

A pointer to private data that gawk saves in order to pass to the function pointed to by funcp.

Exit callback functions are called in last-in, first-out (LIFO) order—that is, in the reverse order in which they are registered with gawk.

Registering an extension version string

You can register a version string that indicates the name and version of your extension, with gawk, as follows:

void register_ext_version(const char *version);

Register the string pointed to by version with gawk. Note that gawk does not copy the version string, so it should not be changed.

gawk prints all registered extension version strings when it is invoked with the --version option.

Customized input parsers

By default, gawk reads text files as its input. It uses the value of RS to find the end of the record, and then uses FS (or FIELDWIDTHS or FPAT) to split it into fields (see Chapter 4). Additionally, it sets the value of RT (see Predefined Variables).

If you want, you can provide your own custom input parser. An input parser’s job is to return a record to the gawk record-processing code, along with indicators for the value and length of the data to be used for RT, if any.

To provide an input parser, you must first provide two functions (where XXX is a prefix name for your extension):

awk_bool_t XXX_can_take_file(const awk_input_buf_t *iobuf);

This function examines the information available in iobuf (which we discuss shortly). Based on the information there, it decides if the input parser should be used for this file. If so, it should return true. Otherwise, it should return false. It should not change any state (variable values, etc.) within gawk.

awk_bool_t XXX_take_control_of(awk_input_buf_t *iobuf);

When gawk decides to hand control of the file over to the input parser, it calls this function. This function in turn must fill in certain fields in the awk_input_buf_t structure and ensure that certain conditions are true. It should then return true. If an error of some kind occurs, it should not fill in any fields and should return false; then gawk will not use the input parser. The details are presented shortly.

Your extension should package these functions inside an awk_input_parser_t, which looks like this:

typedef struct awk_input_parser {

const char *name; /* name of parser */

awk_bool_t (*can_take_file)(const awk_input_buf_t *iobuf);

awk_bool_t (*take_control_of)(awk_input_buf_t *iobuf);

awk_const struct awk_input_parser *awk_const next; /* for gawk */

} awk_input_parser_t;

The fields are:

const char *name;

The name of the input parser. This is a regular C string.

awk_bool_t (*can_take_file)(const awk_input_buf_t *iobuf);

A pointer to your XXX_can_take_file() function.

awk_bool_t (*take_control_of)(awk_input_buf_t *iobuf);

A pointer to your XXX_take_control_of() function.

awk_const struct input_parser *awk_const next;

This is for use by gawk; therefore it is marked awk_const so that the extension cannot modify it.

The steps are as follows:

1. Create a static awk_input_parser_t variable and initialize it appropriately.

2. When your extension is loaded, register your input parser with gawk using the register_input_parser() API function (described next).

An awk_input_buf_t looks like this:

typedef struct awk_input {

const char *name; /* filename */

int fd; /* file descriptor */

#define INVALID_HANDLE (-1)

void *opaque; /* private data for input parsers */

int (*get_record)(char **out, struct awk_input *iobuf,

int *errcode, char **rt_start, size_t *rt_len);

ssize_t (*read_func)();

void (*close_func)(struct awk_input *iobuf);

struct stat sbuf; /* stat buf */

} awk_input_buf_t;

The fields can be divided into two categories: those for use (initially, at least) by XXX_can_take_file(), and those for use by XXX_take_control_of(). The first group of fields and their uses are as follows:

const char *name;

The name of the file.

int fd;

A file descriptor for the file. If gawk was able to open the file, then fd will not be equal to INVALID_HANDLE. Otherwise, it will.

struct stat sbuf;

If the file descriptor is valid, then gawk will have filled in this structure via a call to the fstat() system call.

The XXX_can_take_file() function should examine these fields and decide if the input parser should be used for the file. The decision can be made based upon gawk state (the value of a variable defined previously by the extension and set by awk code), the name of the file, whether or not the file descriptor is valid, the information in the struct stat, or any combination of these factors.

Once XXX_can_take_file() has returned true, and gawk has decided to use your input parser, it calls XXX_take_control_of(). That function then fills either the get_record field or the read_func field in the awk_input_buf_t. It must also ensure that fd is not set toINVALID_HANDLE. The following list describes the fields that may be filled by XXX_take_control_of():

void *opaque;

This is used to hold any state information needed by the input parser for this file. It is “opaque” to gawk. The input parser is not required to use this pointer.

int (*get_record)(char **out,
struct awk_input *iobuf,
int *errcode,
char **rt_start,
size_t *rt_len);

This function pointer should point to a function that creates the input records. Said function is the core of the input parser. Its behavior is described in the text following this list.

ssize_t (*read_func)();

This function pointer should point to a function that has the same behavior as the standard POSIX read() system call. It is an alternative to the get_record pointer. Its behavior is also described in the text following this list.

void (*close_func)(struct awk_input *iobuf);

This function pointer should point to a function that does the “teardown.” It should release any resources allocated by XXX_take_control_of(). It may also close the file. If it does so, it should set the fd field to INVALID_HANDLE.

If fd is still not INVALID_HANDLE after the call to this function, gawk calls the regular close() system call.

Having a “teardown” function is optional. If your input parser does not need it, do not set this field. Then, gawk calls the regular close() system call on the file descriptor, so it should be valid.

The XXX_get_record() function does the work of creating input records. The parameters are as follows:

char **out

This is a pointer to a char * variable that is set to point to the record. gawk makes its own copy of the data, so the extension must manage this storage.

struct awk_input *iobuf

This is the awk_input_buf_t for the file. The fields should be used for reading data (fd) and for managing private state (opaque), if any.

int *errcode

If an error occurs, *errcode should be set to an appropriate code from <errno.h>.

char **rt_start
size_t *rt_len

If the concept of a “record terminator” makes sense, then *rt_start should be set to point to the data to be used for RT, and *rt_len should be set to the length of the data. Otherwise, *rt_len should be set to zero. gawk makes its own copy of this data, so the extension must manage this storage.

The return value is the length of the buffer pointed to by *out, or EOF if end-of-file was reached or an error occurred.

It is guaranteed that errcode is a valid pointer, so there is no need to test for a NULL value. gawk sets *errcode to zero, so there is no need to set it unless an error occurs.

If an error does occur, the function should return EOF and set *errcode to a value greater than zero. In that case, if *errcode does not equal zero, gawk automatically updates the ERRNO variable based on the value of *errcode. (In general, setting ‘*errcode = errno’ should do the right thing.)

As an alternative to supplying a function that returns an input record, you may instead supply a function that simply reads bytes, and let gawk parse the data into records. If you do so, the data should be returned in the multibyte encoding of the current locale. Such a function should follow the same behavior as the read() system call, and you fill in the read_func pointer with its address in the awk_input_buf_t structure.

By default, gawk sets the read_func pointer to point to the read() system call. So your extension need not set this field explicitly.

NOTE

You must choose one method or the other: either a function that returns a record, or one that returns raw data. In particular, if you supply a function to get a record, gawk will call it, and will never call the raw read function.

gawk ships with a sample extension that reads directories, returning records for each entry in a directory (see Reading Directories). You may wish to use that code as a guide for writing your own input parser.

When writing an input parser, you should think about (and document) how it is expected to interact with awk code. You may want it to always be called, and to take effect as appropriate (as the readdir extension does). Or you may want it to take effect based upon the value of an awkvariable, as the XML extension from the gawkextlib project does (see The gawkextlib Project). In the latter case, code in a BEGINFILE section can look at FILENAME and ERRNO to decide whether or not to activate an input parser (see The BEGINFILE and ENDFILE Special Patterns).

You register your input parser with the following function:

void register_input_parser(awk_input_parser_t *input_parser);

Register the input parser pointed to by input_parser with gawk.

Customized output wrappers

An output wrapper is the mirror image of an input parser. It allows an extension to take over the output to a file opened with the ‘>’ or ‘>>’ I/O redirection operators (see Redirecting Output of print and printf).

The output wrapper is very similar to the input parser structure:

typedef struct awk_output_wrapper {

const char *name; /* name of the wrapper */

awk_bool_t (*can_take_file)(const awk_output_buf_t *outbuf);

awk_bool_t (*take_control_of)(awk_output_buf_t *outbuf);

awk_const struct awk_output_wrapper *awk_const next; /* for gawk */

} awk_output_wrapper_t;

The members are as follows:

const char *name;

This is the name of the output wrapper.

awk_bool_t (*can_take_file)(const awk_output_buf_t *outbuf);

This points to a function that examines the information in the awk_output_buf_t structure pointed to by outbuf. It should return true if the output wrapper wants to take over the file, and false otherwise. It should not change any state (variable values, etc.) within gawk.

awk_bool_t (*take_control_of)(awk_output_buf_t *outbuf);

The function pointed to by this field is called when gawk decides to let the output wrapper take control of the file. It should fill in appropriate members of the awk_output_buf_t structure, as described next, and return true if successful, false otherwise.

awk_const struct output_wrapper *awk_const next;

This is for use by gawk; therefore it is marked awk_const so that the extension cannot modify it.

The awk_output_buf_t structure looks like this:

typedef struct awk_output_buf {

const char *name; /* name of output file */

const char *mode; /* mode argument to fopen */

FILE *fp; /* stdio file pointer */

awk_bool_t redirected; /* true if a wrapper is active */

void *opaque; /* for use by output wrapper */

size_t (*gawk_fwrite)(const void *buf, size_t size, size_t count,

FILE *fp, void *opaque);

int (*gawk_fflush)(FILE *fp, void *opaque);

int (*gawk_ferror)(FILE *fp, void *opaque);

int (*gawk_fclose)(FILE *fp, void *opaque);

} awk_output_buf_t;

Here too, your extension will define XXX_can_take_file() and XXX_take_control_of() functions that examine and update data members in the awk_output_buf_t. The data members are as follows:

const char *name;

The name of the output file.

const char *mode;

The mode string (as would be used in the second argument to fopen()) with which the file was opened.

FILE *fp;

The FILE pointer from <stdio.h>. gawk opens the file before attempting to find an output wrapper.

awk_bool_t redirected;

This field must be set to true by the XXX_take_control_of() function.

void *opaque;

This pointer is opaque to gawk. The extension should use it to store a pointer to any private data associated with the file.

size_t (*gawk_fwrite)(const void *buf, size_t size, size_t count,
FILE *fp, void *opaque);
int (*gawk_fflush)(FILE *fp, void *opaque);
int (*gawk_ferror)(FILE *fp, void *opaque);
int (*gawk_fclose)(FILE *fp, void *opaque);

These pointers should be set to point to functions that perform the equivalent functions as the <stdio.h> functions do, if appropriate. gawk uses these function pointers for all output. gawk initializes the pointers to point to internal “pass-through” functions that just call the regular<stdio.h> functions, so an extension only needs to redefine those functions that are appropriate for what it does.

The XXX_can_take_file() function should make a decision based upon the name and mode fields, and any additional state (such as awk variable values) that is appropriate.

When gawk calls XXX_take_control_of(), that function should fill in the other fields as appropriate, except for fp, which it should just use normally.

You register your output wrapper with the following function:

void register_output_wrapper(awk_output_wrapper_t *output_wrapper);

Register the output wrapper pointed to by output_wrapper with gawk.

Customized two-way processors

A two-way processor combines an input parser and an output wrapper for two-way I/O with the ‘|&’ operator (see Redirecting Output of print and printf). It makes identical usage of the awk_input_parser_t and awk_output_buf_t structures as described earlier.

A two-way processor is represented by the following structure:

typedef struct awk_two_way_processor {

const char *name; /* name of the two-way processor */

awk_bool_t (*can_take_two_way)(const char *name);

awk_bool_t (*take_control_of)(const char *name,

awk_input_buf_t *inbuf,

awk_output_buf_t *outbuf);

awk_const struct awk_two_way_processor *awk_const next; /* for gawk */

} awk_two_way_processor_t;

The fields are as follows:

const char *name;

The name of the two-way processor.

awk_bool_t (*can_take_two_way)(const char *name);

The function pointed to by this field should return true if it wants to take over two-way I/O for this filename. It should not change any state (variable values, etc.) within gawk.

awk_bool_t (*take_control_of)(const char *name,
awk_input_buf_t *inbuf,
awk_output_buf_t *outbuf);

The function pointed to by this field should fill in the awk_input_buf_t and awk_outut_buf_t structures pointed to by inbuf and outbuf, respectively. These structures were described earlier.

awk_const struct two_way_processor *awk_const next;

This is for use by gawk; therefore it is marked awk_const so that the extension cannot modify it.

As with the input parser and output processor, you provide “yes I can take this” and “take over for this” functions, XXX_can_take_two_way() and XXX_take_control_of().

You register your two-way processor with the following function:

void register_two_way_processor(awk_two_way_processor_t *
two_way_processor);

Register the two-way processor pointed to by two_way_processor with gawk.

Printing Messages

You can print different kinds of warning messages from your extension, as described here. Note that for these functions, you must pass in the extension ID received from gawk when the extension was loaded:[97]

void fatal(awk_ext_id_t id, const char *format, ...);

Print a message and then cause gawk to exit immediately.

void warning(awk_ext_id_t id, const char *format, ...);

Print a warning message.

void lintwarn(awk_ext_id_t id, const char *format, ...);

Print a “lint warning.” Normally this is the same as printing a warning message, but if gawk was invoked with ‘--lint=fatal’, then lint warnings become fatal error messages.

All of these functions are otherwise like the C printf() family of functions, where the format parameter is a string with literal characters and formatting codes intermixed.

Updating ERRNO

The following functions allow you to update the ERRNO variable:

void update_ERRNO_int(int errno_val);

Set ERRNO to the string equivalent of the error code in errno_val. The value should be one of the defined error codes in <errno.h>, and gawk turns it into a (possibly translated) string using the C strerror() function.

void update_ERRNO_string(const char *string);

Set ERRNO directly to the string value of ERRNO. gawk makes a copy of the value of string.

void unset_ERRNO(void);

Unset ERRNO.

Requesting Values

All of the functions that return values from gawk work in the same way. You pass in an awk_valtype_t value to indicate what kind of value you expect. If the actual value matches what you requested, the function returns true and fills in the awk_value_t result. Otherwise, the function returns false, and the val_type member indicates the type of the actual value. You may then print an error message or reissue the request for the actual value type, as appropriate. This behavior is summarized in Table 16-1.

Table 16-1. API value types returned

Type of Actual Value

String

Number

Array

Undefined

String

String

String

False

False

Number

Number if can be converted, else false

Number

False

False

Type

Array

False

False

Array

False

Requested

Scalar

Scalar

Scalar

False

False

Undefined

String

Number

Array

Undefined

Value cookie

False

False

False

False

Accessing and Updating Parameters

Two functions give you access to the arguments (parameters) passed to your extension function. They are:

awk_bool_t get_argument(size_t count,
awk_valtype_t wanted,
awk_value_t *result);

Fill in the awk_value_t structure pointed to by result with the countth argument. Return true if the actual type matches wanted, and false otherwise. In the latter case, result->val_type indicates the actual type (see Table 16-1). Counts are zero-based—the first argument is numbered zero, the second one, and so on. wanted indicates the type of value expected.

awk_bool_t set_argument(size_t count, awk_array_t array);

Convert a parameter that was undefined into an array; this provides call by reference for arrays. Return false if count is too big, or if the argument’s type is not undefined. See Array Manipulation for more information on creating arrays.

Symbol Table Access

Two sets of routines provide access to global variables, and one set allows you to create and release cached values.

Variable access and update by name

The following routines provide the ability to access and update global awk-level variables by name. In compiler terminology, identifiers of different kinds are termed symbols, thus the “sym” in the routines’ names. The data structure that stores information about symbols is termed a symbol table. The functions are as follows:

awk_bool_t sym_lookup(const char *name,
awk_valtype_t wanted,
awk_value_t *result);

Fill in the awk_value_t structure pointed to by result with the value of the variable named by the string name, which is a regular C string. wanted indicates the type of value expected. Return true if the actual type matches wanted, and false otherwise. In the latter case, result->val_type indicates the actual type (see Table 16-1).

awk_bool_t sym_update(const char *name, awk_value_t *value);

Update the variable named by the string name, which is a regular C string. The variable is added to gawk’s symbol table if it is not there. Return true if everything worked, and false otherwise.

Changing types (scalar to array or vice versa) of an existing variable is not allowed, nor may this routine be used to update an array. This routine cannot be used to update any of the predefined variables (such as ARGC or NF).

An extension can look up the value of gawk’s special variables. However, with the exception of the PROCINFO array, an extension cannot change any of those variables.

CAUTION

It is possible for the lookup of PROCINFO to fail. This happens if the awk program being run does not reference PROCINFO; in this case, gawk doesn’t bother to create the array and populate it.

Variable access and update by cookie

A scalar cookie is an opaque handle that provides access to a global variable or array. It is an optimization that avoids looking up variables in gawk’s symbol table every time access is needed. This was discussed earlier, in General-Purpose Data Types.

The following functions let you work with scalar cookies:

awk_bool_t sym_lookup_scalar(awk_scalar_t cookie,
awk_valtype_t wanted,
awk_value_t *result);

Retrieve the current value of a scalar cookie. Once you have obtained a scalar cookie using sym_lookup(), you can use this function to get its value more efficiently. Return false if the value cannot be retrieved.

awk_bool_t sym_update_scalar(awk_scalar_t cookie, awk_value_t *value);

Update the value associated with a scalar cookie. Return false if the new value is not of type AWK_STRING or AWK_NUMBER. Here too, the predefined variables may not be updated.

It is not obvious at first glance how to work with scalar cookies or what their raison d’être really is. In theory, the sym_lookup() and sym_update() routines are all you really need to work with variables. For example, you might have code that looks up the value of a variable, evaluates a condition, and then possibly changes the value of the variable based on the result of that evaluation, like so:

/* do_magic --- do something really great */

static awk_value_t *

do_magic(int nargs, awk_value_t *result)

{

awk_value_t value;

if ( sym_lookup("MAGIC_VAR", AWK_NUMBER, & value)

&& some_condition(value.num_value)) {

value.num_value += 42;

sym_update("MAGIC_VAR", & value);

}

return make_number(0.0, result);

}

This code looks (and is) simple and straightforward. So what’s the problem?

Well, consider what happens if awk-level code associated with your extension calls the magic() function (implemented in C by do_magic()), once per record, while processing hundreds of thousands or millions of records. The MAGIC_VAR variable is looked up in the symbol table once or twice per function call!

The symbol table lookup is really pure overhead; it is considerably more efficient to get a cookie that represents the variable, and use that to get the variable’s value and update it as needed.[98]

Thus, the way to use cookies is as follows. First, install your extension’s variable in gawk’s symbol table using sym_update(), as usual. Then get a scalar cookie for the variable using sym_lookup():

static awk_scalar_t magic_var_cookie; /* cookie for MAGIC_VAR */

static void

my_extension_init()

{

awk_value_t value;

/* install initial value */

sym_update("MAGIC_VAR", make_number(42.0, & value));

/* get the cookie */

sym_lookup("MAGIC_VAR", AWK_SCALAR, & value);

/* save the cookie */

magic_var_cookie = value.scalar_cookie;

}

Next, use the routines in this section for retrieving and updating the value through the cookie. Thus, do_magic() now becomes something like this:

/* do_magic --- do something really great */

static awk_value_t *

do_magic(int nargs, awk_value_t *result)

{

awk_value_t value;

if ( sym_lookup_scalar(magic_var_cookie, AWK_NUMBER, & value)

&& some_condition(value.num_value)) {

value.num_value += 42;

sym_update_scalar(magic_var_cookie, & value);

}

return make_number(0.0, result);

}

NOTE

The previous code omitted error checking for presentation purposes. Your extension code should be more robust and carefully check the return values from the API functions.

Creating and using cached values

The routines in this section allow you to create and release cached values. Like scalar cookies, in theory, cached values are not necessary. You can create numbers and strings using the functions in Constructor Functions. You can then assign those values to variables using sym_update() orsym_update_scalar(), as you like.

However, you can understand the point of cached values if you remember that every string value’s storage must come from gawk_malloc(), gawk_calloc(), or gawk_realloc(). If you have 20 variables, all of which have the same string value, you must create 20 identical copies of the string.[99]

It is clearly more efficient, if possible, to create a value once and then tell gawk to reuse the value for multiple variables. That is what the routines in this section let you do. The functions are as follows:

awk_bool_t create_value(awk_value_t *value, awk_value_cookie_t *result);

Create a cached string or numeric value from value for efficient later assignment. Only values of type AWK_NUMBER and AWK_STRING are allowed. Any other type is rejected. AWK_UNDEFINED could be allowed, but doing so would result in inferior performance.

awk_bool_t release_value(awk_value_cookie_t vc);

Release the memory associated with a value cookie obtained from create_value().

You use value cookies in a fashion similar to the way you use scalar cookies. In the extension initialization routine, you create the value cookie:

static awk_value_cookie_t answer_cookie; /* static value cookie */

static void

my_extension_init()

{

awk_value_t value;

char *long_string;

size_t long_string_len;

/* code from earlier */

/* … fill in long_string and long_string_len … */

make_malloced_string(long_string, long_string_len, & value);

create_value(& value, & answer_cookie); /* create cookie */

}

Once the value is created, you can use it as the value of any number of variables:

static awk_value_t *

do_magic(int nargs, awk_value_t *result)

{

awk_value_t new_value;

… /* as earlier */

value.val_type = AWK_VALUE_COOKIE;

value.value_cookie = answer_cookie;

sym_update("VAR1", & value);

sym_update("VAR2", & value);

sym_update("VAR100", & value);

}

Using value cookies in this way saves considerable storage, as all of VAR1 through VAR100 share the same value.

You might be wondering, “Is this sharing problematic? What happens if awk code assigns a new value to VAR1; are all the others changed too?”

That’s a great question. The answer is that no, it’s not a problem. Internally, gawk uses reference-counted strings. This means that many variables can share the same string value, and gawk keeps track of the usage. When a variable’s value changes, gawk simply decrements the reference count on the old value and updates the variable to use the new value.

Finally, as part of your cleanup action (see Registering an exit callback function) you should release any cached values that you created, using release_value().

Array Manipulation

The primary data structure[100] in awk is the associative array (see Chapter 8). Extensions need to be able to manipulate awk arrays. The API provides a number of data structures for working with arrays, functions for working with individual elements, and functions for working with arrays as a whole. This includes the ability to “flatten” an array so that it is easy for C code to traverse every element in an array. The array data structures integrate nicely with the data structures for values to make it easy to both work with and create true arrays of arrays (see General-Purpose Data Types).

Array data types

The data types associated with arrays are as follows:

typedef void *awk_array_t;

If you request the value of an array variable, you get back an awk_array_t value. This value is opaque[101] to the extension; it uniquely identifies the array but can only be used by passing it into API functions or receiving it from API functions. This is very similar to way ‘FILE *’ values are used with the <stdio.h> library routines.

typedef struct awk_element {
/* convenience linked list pointer, not used by gawk */
struct awk_element *next;
enum {
AWK_ELEMENT_DEFAULT = 0, /* set by gawk */
AWK_ELEMENT_DELETE = 1 /* set by extension */
} flags;
awk_value_t index;
awk_value_t value;
} awk_element_t;

The awk_element_t is a “flattened” array element. awk produces an array of these inside the awk_flat_array_t (see the next item). Individual elements may be marked for deletion. New elements must be added individually, one at a time, using the separate API for that purpose. The fields are as follows:

struct awk_element *next;

This pointer is for the convenience of extension writers. It allows an extension to create a linked list of new elements that can then be added to an array in a loop that traverses the list.

enum { … } flags;

A set of flag values that convey information between the extension and gawk. Currently there is only one: AWK_ELEMENT_DELETE. Setting it causes gawk to delete the element from the original array upon release of the flattened array.

index
value

The index and value of the element, respectively. All memory pointed to by index and value belongs to gawk.

typedef struct awk_flat_array {
awk_const void *awk_const opaque1; /* for use by gawk */
awk_const void *awk_const opaque2; /* for use by gawk */
awk_const size_t count; /* how many elements */
awk_element_t elements[1]; /* will be extended */
} awk_flat_array_t;

This is a flattened array. When an extension gets one of these from gawk, the elements array is of actual size count. The opaque1 and opaque2 pointers are for use by gawk; therefore they are marked awk_const so that the extension cannot modify them.

Array functions

The following functions relate to individual array elements:

awk_bool_t get_element_count(awk_array_t a_cookie, size_t *count);

For the array represented by a_cookie, place in *count the number of elements it contains. A subarray counts as a single element. Return false if there is an error.

awk_bool_t get_array_element(awk_array_t a_cookie,
const awk_value_t *const index,
awk_valtype_t wanted,
awk_value_t *result);

For the array represented by a_cookie, return in *result the value of the element whose index is index. wanted specifies the type of value you wish to retrieve. Return false if wanted does not match the actual type or if index is not in the array (see Table 16-1).

The value for index can be numeric, in which case gawk converts it to a string. Using nonintegral values is possible, but requires that you understand how such values are converted to strings (see Conversion of Strings and Numbers); thus, using integral values is safest.

As with all strings passed into gawk from an extension, the string value of index must come from gawk_malloc(), gawk_calloc(), or gawk_realloc(), and gawk releases the storage.

awk_bool_t set_array_element(awk_array_t a_cookie,
const awk_value_t *const index,
const awk_value_t *const value);

In the array represented by a_cookie, create or modify the element whose index is given by index. The ARGV and ENVIRON arrays may not be changed, although the PROCINFO array can be.

awk_bool_t set_array_element_by_elem(awk_array_t a_cookie,
awk_element_t element);

Like set_array_element(), but take the index and value from element. This is a convenience macro.

awk_bool_t del_array_element(awk_array_t a_cookie,
const awk_value_t* const index);

Remove the element with the given index from the array represented by a_cookie. Return true if the element was removed, or false if the element did not exist in the array.

The following functions relate to arrays as a whole:

awk_array_t create_array(void);

Create a new array to which elements may be added. See How to create and populate arrays for a discussion of how to create a new array and add elements to it.

awk_bool_t clear_array(awk_array_t a_cookie);

Clear the array represented by a_cookie. Return false if there was some kind of problem, true otherwise. The array remains an array, but after calling this function, it has no elements. This is equivalent to using the delete statement (see The delete Statement).

awk_bool_t flatten_array(awk_array_t a_cookie, awk_flat_array_t **data);

For the array represented by a_cookie, create an awk_flat_array_t structure and fill it in. Set the pointer whose address is passed as data to point to this structure. Return true upon success, or false otherwise. See the next section for a discussion of how to flatten an array and work with it.

awk_bool_t release_flattened_array(awk_array_t a_cookie,
awk_flat_array_t *data);

When done with a flattened array, release the storage using this function. You must pass in both the original array cookie and the address of the created awk_flat_array_t structure. The function returns true upon success, false otherwise.

Working with all the elements of an array

To flatten an array is to create a structure that represents the full array in a fashion that makes it easy for C code to traverse the entire array. Some of the code in extension/testext.c does this, and also serves as a nice example showing how to use the APIs.

We walk through that part of the code one step at a time. First, the gawk script that drives the test extension:

@load "testext"

BEGIN {

n = split("blacky rusty sophie raincloud lucky", pets)

printf("pets has %d elements\n", length(pets))

ret = dump_array_and_delete("pets", "3")

printf("dump_array_and_delete(pets) returned %d\n", ret)

if ("3" in pets)

printf("dump_array_and_delete() did NOT remove index \"3\"!\n")

else

printf("dump_array_and_delete() did remove index \"3\"!\n")

print ""

}

This code creates an array with split() (see String-Manipulation Functions) and then calls dump_array_and_delete(). That function looks up the array whose name is passed as the first argument, and deletes the element at the index passed in the second argument. The awk code then prints the return value and checks if the element was indeed deleted. Here is the C code that implements dump_array_and_delete(). It has been edited slightly for presentation.

The first part declares variables, sets up the default return value in result, and checks that the function was called with the correct number of arguments:

static awk_value_t *

dump_array_and_delete(int nargs, awk_value_t *result)

{

awk_value_t value, value2, value3;

awk_flat_array_t *flat_array;

size_t count;

char *name;

int i;

assert(result != NULL);

make_number(0.0, result);

if (nargs != 2) {

printf("dump_array_and_delete: nargs not right "

"(%d should be 2)\n", nargs);

goto out;

}

The function then proceeds in steps, as follows. First, retrieve the name of the array, passed as the first argument, followed by the array itself. If either operation fails, print an error message and return:

/* get argument named array as flat array and print it */

if (get_argument(0, AWK_STRING, & value)) {

name = value.str_value.str;

if (sym_lookup(name, AWK_ARRAY, & value2))

printf("dump_array_and_delete: sym_lookup of %s passed\n",

name);

else {

printf("dump_array_and_delete: sym_lookup of %s failed\n",

name);

goto out;

}

} else {

printf("dump_array_and_delete: get_argument(0) failed\n");

goto out;

}

For testing purposes and to make sure that the C code sees the same number of elements as the awk code, the second step is to get the count of elements in the array and print it:

if (! get_element_count(value2.array_cookie, & count)) {

printf("dump_array_and_delete: get_element_count failed\n");

goto out;

}

printf("dump_array_and_delete: incoming size is %lu\n",

(unsigned long) count);

The third step is to actually flatten the array, and then to double-check that the count in the awk_flat_array_t is the same as the count just retrieved:

if (! flatten_array(value2.array_cookie, & flat_array)) {

printf("dump_array_and_delete: could not flatten array\n");

goto out;

}

if (flat_array->count != count) {

printf("dump_array_and_delete: flat_array->count (%lu)"

" != count (%lu)\n",

(unsigned long) flat_array->count,

(unsigned long) count);

goto out;

}

The fourth step is to retrieve the index of the element to be deleted, which was passed as the second argument. Remember that argument counts passed to get_argument() are zero-based, and thus the second argument is numbered one:

if (! get_argument(1, AWK_STRING, & value3)) {

printf("dump_array_and_delete: get_argument(1) failed\n");

goto out;

}

The fifth step is where the “real work” is done. The function loops over every element in the array, printing the index and element values. In addition, upon finding the element with the index that is supposed to be deleted, the function sets the AWK_ELEMENT_DELETE bit in the flags field of the element. When the array is released, gawk traverses the flattened array, and deletes any elements that have this flag bit set:

for (i = 0; i < flat_array->count; i++) {

printf("\t%s[\"%.*s\"] = %s\n",

name,

(int) flat_array->elements[i].index.str_value.len,

flat_array->elements[i].index.str_value.str,

valrep2str(& flat_array->elements[i].value));

if (strcmp(value3.str_value.str,

flat_array->elements[i].index.str_value.str) == 0) {

flat_array->elements[i].flags |= AWK_ELEMENT_DELETE;

printf("dump_array_and_delete: marking element \"%s\" "

"for deletion\n",

flat_array->elements[i].index.str_value.str);

}

}

The sixth step is to release the flattened array. This tells gawk that the extension is no longer using the array, and that it should delete any elements marked for deletion. gawk also frees any storage that was allocated, so you should not use the pointer (flat_array in this code) once you have called release_flattened_array():

if (! release_flattened_array(value2.array_cookie, flat_array)) {

printf("dump_array_and_delete: could not release flattened array\n");

goto out;

}

Finally, because everything was successful, the function sets the return value to success, and returns:

make_number(1.0, result);

out:

return result;

}

Here is the output from running this part of the test:

pets has 5 elements

dump_array_and_delete: sym_lookup of pets passed

dump_array_and_delete: incoming size is 5

pets["1"] = "blacky"

pets["2"] = "rusty"

pets["3"] = "sophie"

dump_array_and_delete: marking element "3" for deletion

pets["4"] = "raincloud"

pets["5"] = "lucky"

dump_array_and_delete(pets) returned 1

dump_array_and_delete() did remove index "3"!

How to create and populate arrays

Besides working with arrays created by awk code, you can create arrays and populate them as you see fit, and then awk code can access them and manipulate them.

There are two important points about creating arrays from extension code:

§ You must install a new array into gawk’s symbol table immediately upon creating it. Once you have done so, you can then populate the array.

Similarly, if installing a new array as a subarray of an existing array, you must add the new array to its parent before adding any elements to it.

Thus, the correct way to build an array is to work “top down.” Create the array, and immediately install it in gawk’s symbol table using sym_update(), or install it as an element in a previously existing array using set_array_element(). We show example code shortly.

§ Due to gawk internals, after using sym_update() to install an array into gawk, you have to retrieve the array cookie from the value passed in to sym_update() before doing anything else with it, like so:

§ awk_value_t value;

§ awk_array_t new_array;

§

§ new_array = create_array();

§ val.val_type = AWK_ARRAY;

§ val.array_cookie = new_array;

§

§ /* install array in the symbol table */

§ sym_update("array", & val);

§

new_array = val.array_cookie; /* YOU MUST DO THIS */

If installing an array as a subarray, you must also retrieve the value of the array cookie after the call to set_element().

The following C code is a simple test extension to create an array with two regular elements and with a subarray. The leading #include directives and boilerplate variable declarations (see Boilerplate Code) are omitted for brevity. The first step is to create a new array and then install it in the symbol table:

/* create_new_array --- create a named array */

static void

create_new_array()

{

awk_array_t a_cookie;

awk_array_t subarray;

awk_value_t index, value;

a_cookie = create_array();

value.val_type = AWK_ARRAY;

value.array_cookie = a_cookie;

if (! sym_update("new_array", & value))

printf("create_new_array: sym_update(\"new_array\") failed!\n");

a_cookie = value.array_cookie;

Note how a_cookie is reset from the array_cookie field in the value structure.

The second step is to install two regular values into new_array:

(void) make_const_string("hello", 5, & index);

(void) make_const_string("world", 5, & value);

if (! set_array_element(a_cookie, & index, & value)) {

printf("fill_in_array: set_array_element failed\n");

return;

}

(void) make_const_string("answer", 6, & index);

(void) make_number(42.0, & value);

if (! set_array_element(a_cookie, & index, & value)) {

printf("fill_in_array: set_array_element failed\n");

return;

}

The third step is to create the subarray and install it:

(void) make_const_string("subarray", 8, & index);

subarray = create_array();

value.val_type = AWK_ARRAY;

value.array_cookie = subarray;

if (! set_array_element(a_cookie, & index, & value)) {

printf("fill_in_array: set_array_element failed\n");

return;

}

subarray = value.array_cookie;

The final step is to populate the subarray with its own element:

(void) make_const_string("foo", 3, & index);

(void) make_const_string("bar", 3, & value);

if (! set_array_element(subarray, & index, & value)) {

printf("fill_in_array: set_array_element failed\n");

return;

}

}

Here is a sample script that loads the extension and then dumps the array:

@load "subarray"

function dumparray(name, array, i)

{

for (i in array)

if (isarray(array[i]))

dumparray(name "[\"" i "\"]", array[i])

else

printf("%s[\"%s\"] = %s\n", name, i, array[i])

}

BEGIN {

dumparray("new_array", new_array);

}

Here is the result of running the script:

$ AWKLIBPATH=$PWD ./gawk -f subarray.awk

new_array["subarray"]["foo"] = bar

new_array["hello"] = world

new_array["answer"] = 42

(See How gawk Finds Extensions for more information on the AWKLIBPATH environment variable.)

API Variables

The API provides two sets of variables. The first provides information about the version of the API (both with which the extension was compiled, and with which gawk was compiled). The second provides information about how gawk was invoked.

API version constants and variables

The API provides both a “major” and a “minor” version number. The API versions are available at compile time as constants:

GAWK_API_MAJOR_VERSION

The major version of the API

GAWK_API_MINOR_VERSION

The minor version of the API

The minor version increases when new functions are added to the API. Such new functions are always added to the end of the API struct.

The major version increases (and the minor version is reset to zero) if any of the data types change size or member order, or if any of the existing functions change signature.

It could happen that an extension may be compiled against one version of the API but loaded by a version of gawk using a different version. For this reason, the major and minor API versions of the running gawk are included in the API struct as read-only constant integers:

api->major_version

The major version of the running gawk

api->minor_version

The minor version of the running gawk

It is up to the extension to decide if there are API incompatibilities. Typically, a check like this is enough:

if (api->major_version != GAWK_API_MAJOR_VERSION

|| api->minor_version < GAWK_API_MINOR_VERSION) {

fprintf(stderr, "foo_extension: version mismatch with gawk!\n");

fprintf(stderr, "\tmy version (%d, %d), gawk version (%d, %d)\n",

GAWK_API_MAJOR_VERSION, GAWK_API_MINOR_VERSION,

api->major_version, api->minor_version);

exit(1);

}

Such code is included in the boilerplate dl_load_func() macro provided in gawkapi.h (discussed in Boilerplate Code).

Informational variables

The API provides access to several variables that describe whether the corresponding command-line options were enabled when gawk was invoked. The variables are:

do_debug

This variable is true if gawk was invoked with --debug option.

do_lint

This variable is true if gawk was invoked with --lint option.

do_mpfr

This variable is true if gawk was invoked with --bignum option.

do_profile

This variable is true if gawk was invoked with --profile option.

do_sandbox

This variable is true if gawk was invoked with --sandbox option.

do_traditional

This variable is true if gawk was invoked with --traditional option.

The value of do_lint can change if awk code modifies the LINT predefined variable (see Predefined Variables). The others should not change during execution.

Boilerplate Code

As mentioned earlier (see How It Works at a High Level), the function definitions as presented are really macros. To use these macros, your extension must provide a small amount of boilerplate code (variables and functions) toward the top of your source file, using predefined names as described here. The boilerplate needed is also provided in comments in the gawkapi.h header file:

/* Boilerplate code: */

int plugin_is_GPL_compatible;

static gawk_api_t *const api;

static awk_ext_id_t ext_id;

static const char *ext_version = NULL; /* or … = "some string" */

static awk_ext_func_t func_table[] = {

{ "name", do_name, 1 },

/* … */

};

/* EITHER: */

static awk_bool_t (*init_func)(void) = NULL;

/* OR: */

static awk_bool_t

init_my_extension(void)

{

}

static awk_bool_t (*init_func)(void) = init_my_extension;

dl_load_func(func_table, some_name, "name_space_in_quotes")

These variables and functions are as follows:

int plugin_is_GPL_compatible;

This asserts that the extension is compatible with the GNU GPL. If your extension does not have this, gawk will not load it (see Extension Licensing).

static gawk_api_t *const api;

This global static variable should be set to point to the gawk_api_t pointer that gawk passes to your dl_load() function. This variable is used by all of the macros.

static awk_ext_id_t ext_id;

This global static variable should be set to the awk_ext_id_t value that gawk passes to your dl_load() function. This variable is used by all of the macros.

static const char *ext_version = NULL; /* or … = "some string" */

This global static variable should be set either to NULL, or to point to a string giving the name and version of your extension.

static awk_ext_func_t func_table[] = { … };

This is an array of one or more awk_ext_func_t structures, as described earlier (see Registering an extension function). It can then be looped over for multiple calls to add_ext_func().

static awk_bool_t (*init_func)(void) = NULL;
OR
static awk_bool_t init_my_extension(void) { … }
static awk_bool_t (*init_func)(void) = init_my_extension;

If you need to do some initialization work, you should define a function that does it (creates variables, opens files, etc.) and then define the init_func pointer to point to your function. The function should return awk_false upon failure, or awk_true if everything goes well.

If you don’t need to do any initialization, define the pointer and initialize it to NULL.

dl_load_func(func_table, some_name, "name_space_in_quotes")

This macro expands to a dl_load() function that performs all the necessary initializations.

The point of all the variables and arrays is to let the dl_load() function (from the dl_load_func() macro) do all the standard work. It does the following:

1. Check the API versions. If the extension major version does not match gawk’s, or if the extension minor version is greater than gawk’s, it prints a fatal error message and exits.

2. Load the functions defined in func_table. If any of them fails to load, it prints a warning message but continues on.

3. If the init_func pointer is not NULL, call the function it points to. If it returns awk_false, print a warning message.

4. If ext_version is not NULL, register the version string with gawk.

How gawk Finds Extensions

Compiled extensions have to be installed in a directory where gawk can find them. If gawk is configured and built in the default fashion, the directory in which to find extensions is /usr/local/lib/gawk. You can also specify a search path with a list of directories to search for compiled extensions. See The AWKLIBPATH Environment Variable for more information.

Example: Some File Functions

No matter where you go, there you are.

—Buckaroo Banzai

Two useful functions that are not in awk are chdir() (so that an awk program can change its directory) and stat() (so that an awk program can gather information about a file). In order to illustrate the API in action, this section implements these functions for gawk in an extension.

Using chdir() and stat()

This section shows how to use the new functions at the awk level once they’ve been integrated into the running gawk interpreter. Using chdir() is very straightforward. It takes one argument, the new directory to change to:

@load "filefuncs"

newdir = "/home/arnold/funstuff"

ret = chdir(newdir)

if (ret < 0) {

printf("could not change to %s: %s\n", newdir, ERRNO) > "/dev/stderr"

exit 1

}

The return value is negative if the chdir() failed, and ERRNO (see Predefined Variables) is set to a string indicating the error.

Using stat() is a bit more complicated. The C stat() function fills in a structure that has a fair amount of information. The right way to model this in awk is to fill in an associative array with the appropriate information:

file = "/home/arnold/.profile"

ret = stat(file, fdata)

if (ret < 0) {

printf("could not stat %s: %s\n",

file, ERRNO) > "/dev/stderr"

exit 1

}

printf("size of %s is %d bytes\n", file, fdata["size"])

The stat() function always clears the data array, even if the stat() fails. It fills in the following elements:

"name"

The name of the file that was stat()ed.

"dev"
"ino"

The file’s device and inode numbers, respectively.

"mode"

The file’s mode, as a numeric value. This includes both the file’s type and its permissions.

"nlink"

The number of hard links (directory entries) the file has.

"uid"
"gid"

The numeric user and group ID numbers of the file’s owner.

"size"

The size in bytes of the file.

"blocks"

The number of disk blocks the file actually occupies. This may not be a function of the file’s size if the file has holes.

"atime"
"mtime"
"ctime"

The file’s last access, modification, and inode update times, respectively. These are numeric timestamps, suitable for formatting with strftime() (see Time Functions).

"pmode"

The file’s “printable mode.” This is a string representation of the file’s type and permissions, such as is produced by ‘ls -l’—for example, "drwxr-xr-x".

"type"

A printable string representation of the file’s type. The value is one of the following:

"blockdev"
"chardev"

The file is a block or character device (“special file”).

"directory"

The file is a directory.

"fifo"

The file is a named pipe (also known as a FIFO).

"file"

The file is just a regular file.

"socket"

The file is an AF_UNIX (“Unix domain”) socket in the filesystem.

"symlink"

The file is a symbolic link.

"devbsize"

The size of a block for the element indexed by "blocks". This information is derived from either the DEV_BSIZE constant defined in <sys/param.h> on most systems, or the S_BLKSIZE constant in <sys/stat.h> on BSD systems. For some other systems, a priori knowledge is used to provide a value. Where no value can be determined, it defaults to 512.

Several additional elements may be present, depending upon the operating system and the type of the file. You can test for them in your awk program by using the in operator (see Referring to an Array Element):

"blksize"

The preferred block size for I/O to the file. This field is not present on all POSIX-like systems in the C stat structure.

"linkval"

If the file is a symbolic link, this element is the name of the file the link points to (i.e., the value of the link).

"rdev"
"major"
"minor"

If the file is a block or character device file, then these values represent the numeric device number and the major and minor components of that number, respectively.

C Code for chdir() and stat()

Here is the C code for these extensions.[102]

The file includes a number of standard header files, and then includes the gawkapi.h header file, which provides the API definitions. Those are followed by the necessary variable declarations to make use of the API macros and boilerplate code (see Boilerplate Code):

#ifdef HAVE_CONFIG_H

#include <config.h>

#endif

#include <stdio.h>

#include <assert.h>

#include <errno.h>

#include <stdlib.h>

#include <string.h>

#include <unistd.h>

#include <sys/types.h>

#include <sys/stat.h>

#include "gawkapi.h"

#include "gettext.h"

#define _(msgid) gettext(msgid)

#define N_(msgid) msgid

#include "gawkfts.h"

#include "stack.h"

static const gawk_api_t *api; /* for convenience macros to work */

static awk_ext_id_t *ext_id;

static awk_bool_t init_filefuncs(void);

static awk_bool_t (*init_func)(void) = init_filefuncs;

static const char *ext_version = "filefuncs extension: version 1.0";

int plugin_is_GPL_compatible;

By convention, for an awk function foo(), the C function that implements it is called do_foo(). The function should have two arguments. The first is an int, usually called nargs, that represents the number of actual arguments for the function. The second is a pointer to an awk_value_tstructure, usually named result:

/* do_chdir --- provide dynamically loaded chdir() function for gawk */

static awk_value_t *

do_chdir(int nargs, awk_value_t *result)

{

awk_value_t newdir;

int ret = -1;

assert(result != NULL);

if (do_lint && nargs != 1)

lintwarn(ext_id,

_("chdir: called with incorrect number of arguments, "

"expecting 1"));

The newdir variable represents the new directory to change to, which is retrieved with get_argument(). Note that the first argument is numbered zero.

If the argument is retrieved successfully, the function calls the chdir() system call. If the chdir() fails, ERRNO is updated:

if (get_argument(0, AWK_STRING, & newdir)) {

ret = chdir(newdir.str_value.str);

if (ret < 0)

update_ERRNO_int(errno);

}

Finally, the function returns the return value to the awk level:

return make_number(ret, result);

}

The stat() extension is more involved. First comes a function that turns a numeric mode into a printable representation (e.g., octal 0644 becomes ‘-rw-r--r--’). This is omitted here for brevity:

/* format_mode --- turn a stat mode field into something readable */

static char *

format_mode(unsigned long fmode)

{

}

Next comes a function for reading symbolic links, which is also omitted here for brevity:

/* read_symlink --- read a symbolic link into an allocated buffer.

… */

static char *

read_symlink(const char *fname, size_t bufsize, ssize_t *linksize)

{

}

Two helper functions simplify entering values in the array that will contain the result of the stat():

/* array_set --- set an array element */

static void

array_set(awk_array_t array, const char *sub, awk_value_t *value)

{

awk_value_t index;

set_array_element(array,

make_const_string(sub, strlen(sub), & index),

value);

}

/* array_set_numeric --- set an array element with a number */

static void

array_set_numeric(awk_array_t array, const char *sub, double num)

{

awk_value_t tmp;

array_set(array, sub, make_number(num, & tmp));

}

The following function does most of the work to fill in the awk_array_t result array with values obtained from a valid struct stat. This work is done in a separate function to support the stat() function for gawk and also to support the fts() extension, which is included in the same file but whose code is not shown here (see File-Related Functions).

The first part of the function is variable declarations, including a table to map file types to strings:

/* fill_stat_array --- do the work to fill an array with stat info */

static int

fill_stat_array(const char *name, awk_array_t array, struct stat *sbuf)

{

char *pmode; /* printable mode */

const char *type = "unknown";

awk_value_t tmp;

static struct ftype_map {

unsigned int mask;

const char *type;

} ftype_map[] = {

{ S_IFREG, "file" },

{ S_IFBLK, "blockdev" },

{ S_IFCHR, "chardev" },

{ S_IFDIR, "directory" },

#ifdef S_IFSOCK

{ S_IFSOCK, "socket" },

#endif

#ifdef S_IFIFO

{ S_IFIFO, "fifo" },

#endif

#ifdef S_IFLNK

{ S_IFLNK, "symlink" },

#endif

#ifdef S_IFDOOR /* Solaris weirdness */

{ S_IFDOOR, "door" },

#endif /* S_IFDOOR */

};

int j, k;

The destination array is cleared, and then code fills in various elements based on values in the struct stat:

/* empty out the array */

clear_array(array);

/* fill in the array */

array_set(array, "name", make_const_string(name, strlen(name),

& tmp));

array_set_numeric(array, "dev", sbuf->st_dev);

array_set_numeric(array, "ino", sbuf->st_ino);

array_set_numeric(array, "mode", sbuf->st_mode);

array_set_numeric(array, "nlink", sbuf->st_nlink);

array_set_numeric(array, "uid", sbuf->st_uid);

array_set_numeric(array, "gid", sbuf->st_gid);

array_set_numeric(array, "size", sbuf->st_size);

array_set_numeric(array, "blocks", sbuf->st_blocks);

array_set_numeric(array, "atime", sbuf->st_atime);

array_set_numeric(array, "mtime", sbuf->st_mtime);

array_set_numeric(array, "ctime", sbuf->st_ctime);

/* for block and character devices, add rdev,

major and minor numbers */

if (S_ISBLK(sbuf->st_mode) || S_ISCHR(sbuf->st_mode)) {

array_set_numeric(array, "rdev", sbuf->st_rdev);

array_set_numeric(array, "major", major(sbuf->st_rdev));

array_set_numeric(array, "minor", minor(sbuf->st_rdev));

}

The latter part of the function makes selective additions to the destination array, depending upon the availability of certain members and/or the type of the file. It then returns zero, for success:

#ifdef HAVE_STRUCT_STAT_ST_BLKSIZE

array_set_numeric(array, "blksize", sbuf->st_blksize);

#endif /* HAVE_STRUCT_STAT_ST_BLKSIZE */

pmode = format_mode(sbuf->st_mode);

array_set(array, "pmode", make_const_string(pmode, strlen(pmode),

& tmp));

/* for symbolic links, add a linkval field */

if (S_ISLNK(sbuf->st_mode)) {

char *buf;

ssize_t linksize;

if ((buf = read_symlink(name, sbuf->st_size,

& linksize)) != NULL)

array_set(array, "linkval",

make_malloced_string(buf, linksize, & tmp));

else

warning(ext_id, _("stat: unable to read symbolic link `%s'"),

name);

}

/* add a type field */

type = "unknown"; /* shouldn't happen */

for (j = 0, k = sizeof(ftype_map)/sizeof(ftype_map[0]); j < k; j++) {

if ((sbuf->st_mode & S_IFMT) == ftype_map[j].mask) {

type = ftype_map[j].type;

break;

}

}

array_set(array, "type", make_const_string(type, strlen(type), & tmp));

return 0;

}

The third argument to stat() was not discussed previously. This argument is optional. If present, it causes do_stat() to use the stat() system call instead of the lstat() system call. This is done by using a function pointer: statfunc. statfunc is initialized to point to lstat()(instead of stat()) to get the file information, in case the file is a symbolic link. However, if the third argument is included, statfunc is set to point to stat(), instead.

Here is the do_stat() function, which starts with variable declarations and argument checking:

/* do_stat --- provide a stat() function for gawk */

static awk_value_t *

do_stat(int nargs, awk_value_t *result)

{

awk_value_t file_param, array_param;

char *name;

awk_array_t array;

int ret;

struct stat sbuf;

/* default is lstat() */

int (*statfunc)(const char *path, struct stat *sbuf) = lstat;

assert(result != NULL);

if (nargs != 2 && nargs != 3) {

if (do_lint)

lintwarn(ext_id,

_("stat: called with wrong number of arguments"));

return make_number(-1, result);

}

Then comes the actual work. First, the function gets the arguments. Next, it gets the information for the file. If the called function (lstat() or stat()) returns an error, the code sets ERRNO and returns:

/* file is first arg, array to hold results is second */

if ( ! get_argument(0, AWK_STRING, & file_param)

|| ! get_argument(1, AWK_ARRAY, & array_param)) {

warning(ext_id, _("stat: bad parameters"));

return make_number(-1, result);

}

if (nargs == 3) {

statfunc = stat;

}

name = file_param.str_value.str;

array = array_param.array_cookie;

/* always empty out the array */

clear_array(array);

/* stat the file; if error, set ERRNO and return */

ret = statfunc(name, & sbuf);

if (ret < 0) {

update_ERRNO_int(errno);

return make_number(ret, result);

}

The tedious work is done by fill_stat_array(), shown earlier. When done, the function returns the result from fill_stat_array():

ret = fill_stat_array(name, array, & sbuf);

return make_number(ret, result);

}

Finally, it’s necessary to provide the “glue” that loads the new function(s) into gawk.

The filefuncs extension also provides an fts() function, which we omit here (see File-Related Functions). For its sake, there is an initialization function:

/* init_filefuncs --- initialization routine */

static awk_bool_t

init_filefuncs(void)

{

}

We are almost done. We need an array of awk_ext_func_t structures for loading each function into gawk:

static awk_ext_func_t func_table[] = {

{ "chdir", do_chdir, 1 },

{ "stat", do_stat, 2 },

#ifndef __MINGW32__

{ "fts", do_fts, 3 },

#endif

};

Each extension must have a routine named dl_load() to load everything that needs to be loaded. It is simplest to use the dl_load_func() macro in gawkapi.h:

/* define the dl_load() function using the boilerplate macro */

dl_load_func(func_table, filefuncs, "")

And that’s it!

Integrating the Extensions

Now that the code is written, it must be possible to add it at runtime to the running gawk interpreter. First, the code must be compiled. Assuming that the functions are in a file named filefuncs.c, and idir is the location of the gawkapi.h header file, the following steps[103] create a GNU/Linux shared library:

$ gcc -fPIC -shared -DHAVE_CONFIG_H -c -O -g -Iidir filefuncs.c

$ gcc -o filefuncs.so -shared filefuncs.o

Once the library exists, it is loaded by using the @load keyword:

# file testff.awk

@load "filefuncs"

BEGIN {

"pwd" | getline curdir # save current directory

close("pwd")

chdir("/tmp")

system("pwd") # test it

chdir(curdir) # go back

print "Info for testff.awk"

ret = stat("testff.awk", data)

print "ret =", ret

for (i in data)

printf "data[\"%s\"] = %s\n", i, data[i]

print "testff.awk modified:",

strftime("%m %d %Y %H:%M:%S", data["mtime"])

print "\nInfo for JUNK"

ret = stat("JUNK", data)

print "ret =", ret

for (i in data)

printf "data[\"%s\"] = %s\n", i, data[i]

print "JUNK modified:", strftime("%m %d %Y %H:%M:%S", data["mtime"])

}

The AWKLIBPATH environment variable tells gawk where to find extensions (see How gawk Finds Extensions). We set it to the current directory and run the program:

$ AWKLIBPATH=$PWD gawk -f testff.awk

/tmp

Info for testff.awk

ret = 0

data["blksize"] = 4096

data["devbsize"] = 512

data["mtime"] = 1412004710

data["mode"] = 33204

data["type"] = file

data["dev"] = 2053

data["gid"] = 1000

data["ino"] = 10358899

data["ctime"] = 1412004710

data["blocks"] = 8

data["nlink"] = 1

data["name"] = testff.awk

data["atime"] = 1412004716

data["pmode"] = -rw-rw-r--

data["size"] = 666

data["uid"] = 1000

testff.awk modified: 09 29 2014 18:31:50

Info for JUNK

ret = -1

JUNK modified: 01 01 1970 02:00:00

The Sample Extensions in the gawk Distribution

This section provides a brief overview of the sample extensions that come in the gawk distribution. Some of them are intended for production use (e.g., the filefuncs, readdir, and inplace extensions). Others mainly provide example code that shows how to use the extension API.

File-Related Functions

The filefuncs extension provides three different functions, as follows. The usage is:

@load "filefuncs"

This is how you load the extension.

result = chdir("/some/directory")

The chdir() function is a direct hook to the chdir() system call to change the current directory. It returns zero upon success or a value less than zero upon error. In the latter case, it updates ERRNO.

result = stat("/some/path", statdata [, follow])

The stat() function provides a hook into the stat() system call. It returns zero upon success or a value less than zero upon error. In the latter case, it updates ERRNO.

By default, it uses the lstat() system call. However, if passed a third argument, it uses stat() instead.

In all cases, it clears the statdata array. When the call is successful, stat() fills the statdata array with information retrieved from the filesystem, as follows:

Subscript

Field in struct stat

File type

"name"

The filename

All

"dev"

st_dev

All

"ino"

st_ino

All

"mode"

st_mode

All

"nlink"

st_nlink

All

"uid"

st_uid

All

"gid"

st_gid

All

"size"

st_size

All

"atime"

st_atime

All

"mtime"

st_mtime

All

"ctime"

st_ctime

All

"rdev"

st_rdev

Device files

"major"

st_major

Device files

"minor"

st_minor

Device files

"blksize"

st_blksize

All

"pmode"

A human-readable version of the mode value, like that printed by ls (for example, "-rwxr-xr-x")

All

"linkval"

The value of the symbolic link

Symbolic links

"type"

The type of the file as a string—one of "file", "blockdev", "chardev", "directory", "socket", "fifo", "symlink", "door", or "unknown" (not all systems support all file types)

All

flags = or(FTS_PHYSICAL, ...)
result = fts(pathlist, flags, filedata)

Walk the file trees provided in pathlist and fill in the filedata array, as described next. flags is the bitwise OR of several predefined values, also described in a moment. Return zero if there were no errors, otherwise return −1.

The fts() function provides a hook to the C library fts() routines for traversing file hierarchies. Instead of returning data about one file at a time in a stream, it fills in a multidimensional array with data about each file and directory encountered in the requested hierarchies.

The arguments are as follows:

pathlist

An array of filenames. The element values are used; the index values are ignored.

flags

This should be the bitwise OR of one or more of the following predefined constant flag values. At least one of FTS_LOGICAL or FTS_PHYSICAL must be provided; otherwise fts() returns an error value and sets ERRNO. The flags are:

FTS_LOGICAL

Do a “logical” file traversal, where the information returned for a symbolic link refers to the linked-to file, and not to the symbolic link itself. This flag is mutually exclusive with FTS_PHYSICAL.

FTS_PHYSICAL

Do a “physical” file traversal, where the information returned for a symbolic link refers to the symbolic link itself. This flag is mutually exclusive with FTS_LOGICAL.

FTS_NOCHDIR

As a performance optimization, the C library fts() routines change directory as they traverse a file hierarchy. This flag disables that optimization.

FTS_COMFOLLOW

Immediately follow a symbolic link named in pathlist, whether or not FTS_LOGICAL is set.

FTS_SEEDOT

By default, the C library fts() routines do not return entries for . (dot) and .. (dot-dot). This option causes entries for dot-dot to also be included. (The extension always includes an entry for dot; more on this in a moment.)

FTS_XDEV

During a traversal, do not cross onto a different mounted filesystem.

filedata

The filedata array holds the results. fts() first clears it. Then it creates an element in filedata for every element in pathlist. The index is the name of the directory or file given in pathlist. The element for this index is itself an array. There are two cases:

The path is a file

In this case, the array contains two or three elements:

"path"

The full path to this file, starting from the “root” that was given in the pathlist array.

"stat"

This element is itself an array, containing the same information as provided by the stat() function described earlier for its statdata argument. The element may not be present if the stat() system call for the file failed.

"error"

If some kind of error was encountered, the array will also contain an element named "error", which is a string describing the error.

The path is a directory

In this case, the array contains one element for each entry in the directory. If an entry is a file, that element is the same as for files, just described. If the entry is a directory, that element is (recursively) an array describing the subdirectory. If FTS_SEEDOT was provided in the flags, then there will also be an element named "..". This element will be an array containing the data as provided by stat().

In addition, there will be an element whose index is ".". This element is an array containing the same two or three elements as for a file: "path", "stat", and "error".

The fts() function returns zero if there were no errors. Otherwise, it returns −1.

NOTE

The fts() extension does not exactly mimic the interface of the C library fts() routines, choosing instead to provide an interface that is based on associative arrays, which is more comfortable to use from an awk program. This includes the lack of a comparison function, because gawk already provides powerful array sorting facilities. Although an fts_read()-like interface could have been provided, this felt less natural than simply creating a multidimensional array to represent the file hierarchy and its information.

See test/fts.awk in the gawk distribution for an example use of the fts() extension function.

Interface to fnmatch()

This extension provides an interface to the C library fnmatch() function. The usage is:

@load "fnmatch"

This is how you load the extension.

result = fnmatch(pattern, string, flags)

The return value is zero on success, FNM_NOMATCH if the string did not match the pattern, or a different nonzero value if an error occurred.

In addition to the fnmatch() function, the fnmatch extension adds one constant (FNM_NOMATCH), and an array of flag values named FNM.

The arguments to fnmatch() are:

pattern

The filename wildcard to match

string

The filename string

flag

Either zero, or the bitwise OR of one or more of the flags in the FNM array

The flags are as follows:

Array element

Corresponding flag defined by fnmatch()

FNM["CASEFOLD"]

FNM_CASEFOLD

FNM["FILE_NAME"]

FNM_FILE_NAME

FNM["LEADING_DIR"]

FNM_LEADING_DIR

FNM["NOESCAPE"]

FNM_NOESCAPE

FNM["PATHNAME"]

FNM_PATHNAME

FNM["PERIOD"]

FNM_PERIOD

Here is an example:

@load "fnmatch"

flags = or(FNM["PERIOD"], FNM["NOESCAPE"])

if (fnmatch("*.a", "foo.c", flags) == FNM_NOMATCH)

print "no match"

Interface to fork(), wait(), and waitpid()

The fork extension adds three functions, as follows:

@load "fork"

This is how you load the extension.

pid = fork()

This function creates a new process. The return value is zero in the child and the process ID number of the child in the parent, or −1 upon error. In the latter case, ERRNO indicates the problem. In the child, PROCINFO["pid"] and PROCINFO["ppid"] are updated to reflect the correct values.

ret = waitpid(pid)

This function takes a numeric argument, which is the process ID to wait for. The return value is that of the waitpid() system call.

ret = wait()

This function waits for the first child to die. The return value is that of the wait() system call.

There is no corresponding exec() function.

Here is an example:

@load "fork"

if ((pid = fork()) == 0)

print "hello from the child"

else

print "hello from the parent"

Enabling In-Place File Editing

The inplace extension emulates GNU sed’s -i option, which performs “in-place” editing of each input file. It uses the bundled inplace.awk include file to invoke the extension properly:

# inplace --- load and invoke the inplace extension.

@load "inplace"

# Please set INPLACE_SUFFIX to make a backup copy. For example, you may

# want to set INPLACE_SUFFIX to .bak on the command line or in a BEGIN rule.

BEGINFILE {

inplace_begin(FILENAME, INPLACE_SUFFIX)

}

ENDFILE {

inplace_end(FILENAME, INPLACE_SUFFIX)

}

For each regular file that is processed, the extension redirects standard output to a temporary file configured to have the same owner and permissions as the original. After the file has been processed, the extension restores standard output to its original destination. If INPLACE_SUFFIX is not an empty string, the original file is linked to a backup filename created by appending that suffix. Finally, the temporary file is renamed to the original filename.

If any error occurs, the extension issues a fatal error to terminate processing immediately without damaging the original file.

Here are some simple examples:

$ gawk -i inplace '{ gsub(/foo/, "bar") }; { print }' file1 file2 file3

To keep a backup copy of the original files, try this:

$ gawk -i inplace -v INPLACE_SUFFIX=.bak '{ gsub(/foo/, "bar") }

> { print }' file1 file2 file3

Character and Numeric values: ord() and chr()

The ordchr extension adds two functions, named ord() and chr(), as follows:

@load "ordchr"

This is how you load the extension.

number = ord(string)

Return the numeric value of the first character in string.

char = chr(number)

Return a string whose first character is that represented by number.

These functions are inspired by the Pascal language functions of the same name. Here is an example:

@load "ordchr"

printf("The numeric value of 'A' is %d\n", ord("A"))

printf("The string value of 65 is %s\n", chr(65))

Reading Directories

The readdir extension adds an input parser for directories. The usage is as follows:

@load "readdir"

When this extension is in use, instead of skipping directories named on the command line (or with getline), they are read, with each entry returned as a record.

The record consists of three fields. The first two are the inode number and the filename, separated by a forward slash character. On systems where the directory entry contains the file type, the record has a third field (also separated by a slash), which is a single letter indicating the type of the file. The letters and their corresponding file types are shown in Table 16-2.

Table 16-2. File types returned by the readdir extension

Letter

File type

b

Block device

c

Character device

d

Directory

f

Regular file

l

Symbolic link

p

Named pipe (FIFO)

s

Socket

u

Anything else (unknown)

On systems without the file type information, the third field is always ‘u’.

NOTE

On GNU/Linux systems, there are filesystems that don’t support the d_type entry (see the readdir(3) manual page), and so the file type is always ‘u’. You can use the filefuncs extension to call stat() in order to get correct type information.

Here is an example:

@load "readdir"

...

BEGIN { FS = "/" }

{ print "filename is", $2 }

Reversing Output

The revoutput extension adds a simple output wrapper that reverses the characters in each output line. Its main purpose is to show how to write an output wrapper, although it may be mildly amusing for the unwary. Here is an example:

@load "revoutput"

BEGIN {

REVOUT = 1

print "don't panic" > "/dev/stdout"

}

The output from this program is ‘cinap t'nod’.

Two-Way I/O Example

The revtwoway extension adds a simple two-way processor that reverses the characters in each line sent to it for reading back by the awk program. Its main purpose is to show how to write a two-way processor, although it may also be mildly amusing. The following example shows how to use it:

@load "revtwoway"

BEGIN {

cmd = "/magic/mirror"

print "don't panic" |& cmd

cmd |& getline result

print result

close(cmd)

}

The output from this program also is: ‘cinap t'nod’.

Dumping and Restoring an Array

The rwarray extension adds two functions, named writea() and reada(), as follows:

@load "rwarray"

This is how you load the extension.

ret = writea(file, array)

This function takes a string argument, which is the name of the file to which to dump the array, and the array itself as the second argument. writea() understands arrays of arrays. It returns one on success, or zero upon failure.

ret = reada(file, array)

reada() is the inverse of writea(); it reads the file named as its first argument, filling in the array named as the second argument. It clears the array first. Here too, the return value is one on success and zero upon failure.

The array created by reada() is identical to that written by writea() in the sense that the contents are the same. However, due to implementation issues, the array traversal order of the re-created array is likely to be different from that of the original array. As array traversal order in awk is by default undefined, this is (technically) not a problem. If you need to guarantee a particular traversal order, use the array sorting features in gawk to do so (see Controlling Array Traversal and Array Sorting).

The file contains binary data. All integral values are written in network byte order. However, double-precision floating-point values are written as native binary data. Thus, arrays containing only string data can theoretically be dumped on systems with one byte order and restored on systems with a different one, but this has not been tried.

Here is an example:

@load "rwarray"

ret = writea("arraydump.bin", array)

ret = reada("arraydump.bin", array)

Reading an Entire File

The readfile extension adds a single function named readfile(), and an input parser:

@load "readfile"

This is how you load the extension.

result = readfile("/some/path")

The argument is the name of the file to read. The return value is a string containing the entire contents of the requested file. Upon error, the function returns the empty string and sets ERRNO.

BEGIN { PROCINFO["readfile"] = 1 }

In addition, the extension adds an input parser that is activated if PROCINFO["readfile"] exists. When activated, each input file is returned in its entirety as $0. RT is set to the null string.

Here is an example:

@load "readfile"

contents = readfile("/path/to/file");

if (contents == "" && ERRNO != "") {

print("problem reading file", ERRNO) > "/dev/stderr"

...

}

Extension Time Functions

The time extension adds two functions, named gettimeofday() and sleep(), as follows:

@load "time"

This is how you load the extension.

the_time = gettimeofday()

Return the time in seconds that has elapsed since 1970-01-01 UTC as a floating-point value. If the time is unavailable on this platform, return −1 and set ERRNO. The returned time should have sub-second precision, but the actual precision may vary based on the platform. If the standard Cgettimeofday() system call is available on this platform, then it simply returns the value. Otherwise, if on MS-Windows, it tries to use GetSystemTimeAsFileTime().

result = sleep(seconds)

Attempt to sleep for seconds seconds. If seconds is negative, or the attempt to sleep fails, return −1 and set ERRNO. Otherwise, return zero after sleeping for the indicated amount of time. Note that seconds may be a floating-point (nonintegral) value. Implementation details: depending on platform availability, this function tries to use nanosleep() or select() to implement the delay.

API Tests

The testext extension exercises parts of the extension API that are not tested by the other samples. The extension/testext.c file contains both the C code for the extension and awk test code inside C comments that run the tests. The testing framework extracts the awk code and runs the tests. See the source file for more information.

The gawkextlib Project

The gawkextlib project provides a number of gawk extensions, including one for processing XML files. This is the evolution of the original xgawk (XML gawk) project.

As of this writing, there are seven extensions:

§ errno extension

§ GD graphics library extension

§ PDF extension

§ PostgreSQL extension

§ MPFR library extension (this provides access to a number of MPFR functions that gawk’s native MPFR support does not)

§ Redis extension

§ XML parser extension, using the Expat XML parsing library

You can check out the code for the gawkextlib project using the Git distributed source code control system. The command is as follows:

git clone git://git.code.sf.net/p/gawkextlib/code gawkextlib-code

You will need to have the Expat XML parser library installed in order to build and use the XML extension.

In addition, you must have the GNU Autotools installed (Autoconf, Automake, Libtool, and GNU gettext).

The simple recipe for building and testing gawkextlib is as follows. First, build and install gawk:

cd .../path/to/gawk/code

./configure --prefix=/tmp/newgawk Install in /tmp/newgawk for now

make && make check Build and check that all is OK

make install Install gawk

Next, build gawkextlib and test it:

cd .../path/to/gawkextlib-code

./update-autotools Generate configure, etc.

You may have to run this command twice

./configure --with-gawk=/tmp/newgawk Configure, point at ``installed'' gawk

make && make check Build and check that all is OK

make install Install the extensions

If you have installed gawk in the standard way, then you will likely not need the --with-gawk option when configuring gawkextlib. You may need to use the sudo utility to install both gawk and gawkextlib, depending upon how your system works.

If you write an extension that you wish to share with other gawk users, consider doing so through the gawkextlib project. See the project’s website for more information.

Summary

§ You can write extensions (sometimes called plug-ins) for gawk in C or C++ using the application programming interface (API) defined by the gawk developers.

§ Extensions must have a license compatible with the GNU General Public License (GPL), and they must assert that fact by declaring a variable named plugin_is_GPL_compatible.

§ Communication between gawk and an extension is two-way. gawk passes a struct to the extension that contains various data fields and function pointers. The extension can then call into gawk via the supplied function pointers to accomplish certain tasks.

§ One of these tasks is to “register” the name and implementation of new awk-level functions with gawk. The implementation takes the form of a C function pointer with a defined signature. By convention, implementation functions are named do_XXXX() for some awk-level functionXXXX().

§ The API is defined in a header file named gawkapi.h. You must include a number of standard header files before including it in your source file.

§ API function pointers are provided for the following kinds of operations:

§ Allocating, reallocating, and releasing memory

§ Registration functions (you may register extension functions, exit callbacks, a version string, input parsers, output wrappers, and two-way processors)

§ Printing fatal, warning, and “lint” warning messages

§ Updating ERRNO, or unsetting it

§ Accessing parameters, including converting an undefined parameter into an array

§ Symbol table access (retrieving a global variable, creating one, or changing one)

§ Creating and releasing cached values; this provides an efficient way to use values for multiple variables and can be a big performance win

§ Manipulating arrays (retrieving, adding, deleting, and modifying elements; getting the count of elements in an array; creating a new array; clearing an array; and flattening an array for easy C-style looping over all its indices and elements)

§ The API defines a number of standard data types for representing awk values, array elements, and arrays.

§ The API provides convenience functions for constructing values. It also provides memory management functions to ensure compatibility between memory allocated by gawk and memory allocated by an extension.

§ All memory passed from gawk to an extension must be treated as read-only by the extension.

§ All memory passed from an extension to gawk must come from the API’s memory allocation functions. gawk takes responsibility for the memory and releases it when appropriate.

§ The API provides information about the running version of gawk so that an extension can make sure it is compatible with the gawk that loaded it.

§ It is easiest to start a new extension by copying the boilerplate code described in this chapter. Macros in the gawkapi.h header file make this easier to do.

§ The gawk distribution includes a number of small but useful sample extensions. The gawkextlib project includes several more (larger) extensions. If you wish to write an extension and contribute it to the community of gawk users, the gawkextlib project is the place to do so.


[95] See the “cookie” entry in the Jargon file for a definition of cookie, and the “magic cookie” entry in the Jargon file for a nice example.

[96] This is more common on MS-Windows systems, but it can happen on Unix-like systems as well.

[97] Because the API uses only ISO C 90 features, it cannot make use of the ISO C 99 variadic macro feature to hide that parameter. More’s the pity.

[98] The difference is measurable and quite real. Trust us.

[99] Numeric values are clearly less problematic, requiring only a C double to store.

[100] OK, the only data structure.

[101] It is also a “cookie,” but the gawk developers did not wish to overuse this term.

[102] This version is edited slightly for presentation. See extension/filefuncs.c in the gawk distribution for the complete version.

[103] In practice, you would probably want to use the GNU Autotools (Automake, Autoconf, Libtool, and gettext) to configure and build your libraries. Instructions for doing so are beyond the scope of this book. See The gawkextlib Project for Internet links to the tools.