Data Processing - Pro Bash Programming: Scripting the GNU/Linux Shell, Second Edition (2015)

Pro Bash Programming: Scripting the GNU/Linux Shell, Second Edition (2015)

CHAPTER 13. Data Processing

Data manipulation includes a wide range of actions, far more than can be adequately covered in a single chapter. However, most actions are just the application of techniques already covered in earlier chapters. Arrays are a basic data structure, and although the syntax was covered in Chapter 5 and they were used in the fifteen puzzle code in Chapter 11, I haven’t yet explained their uses. Parameter expansion has been used in a number of chapters, but its application to parsing data structures has not been discussed.

This chapter will cover different ways of using strings and arrays, how to parse character-delimited records into their individual fields, and how to read a data file. There are two function libraries for manipulating two-dimensional grids, and there are functions for sorting and searching arrays.

Arrays

Arrays are not included in the POSIX shell, but bash has used indexed arrays since version 2.0, and in version 4.0, associative arrays were added. Indexed arrays are assigned and referenced using integer subscripts; associative arrays use strings. There is no preset limit to the number of elements an array can contain; they are limited only by available memory.

Holes in an Indexed Array

If some elements of an indexed array are unset, the array is left with holes and it becomes a sparse array. It will then be impossible to traverse the array merely by incrementing an index. There are various ways of dealing with such an array. To demonstrate, let’s create an array and poke some holes in it:

array=( a b c d e f g h i j )
unset array[2] array[4] array[6] array[8]

The array now contains six elements instead of the original ten:

$ sa "${array[@]}"
:a:
:b:
:d:
:f:
:h:
:j:

One way to iterate through all the remaining elements is to expand them as arguments to for. In this method, there is no way of knowing what the subscript for each element is:

for i in "${array[@]}"
do
: do something with each element, $i, here
done

With a packed array (one with no holes), the index can start at 0 and be incremented to get the next element. With a sparse (or any) array, the ${!array[@]} expansion lists the subscripts:

$ echo "${!array[@]}"
0 1 3 5 7 9

This expansion can be used as the argument to for:

for i in "${!array[@]}"
do
: do something with ${array[$i]} here
done

That solution does not provide a method of referring to the next element. You can save the previous element yet not get the value of the next one. To do that, you could put the list of subscripts into an array and use its elements to reference the original array. It’s much simpler to pack the array, removing the holes:

$ array=( "${array[@]}" )
$ echo "${!array[@]}"
0 1 2 3 4 5

Note that this will convert an associative array to an indexed array.

Using an Array for Sorting

Ordering data alphabetically (or numerically) is not usually a task for the shell. The sort command is a very flexible and efficient tool that can handle most sorting needs. There are, however, a couple of cases where sorting can best be done by the shell.

The most obvious is file name expansion, in which the result of expanding wildcards is always sorted alphabetically. This is useful, for example, when working with date-stamped files. If the date stamp uses the standard ISO format, YYYY-MM-DD, or a compressed version, YYYYMMDD, the files will automatically be sorted in date order. If you have files in the format log.YYYYMMDD, this loops through them in chronological order:

for file in log.* ## loop through files in chronological order
do
: do whatever
done

There is no need to use ls; the shell sorts the wildcard expansion.

With bash-4.x, another expansion is sorted alphabetically: associative arrays with single-character subscripts:

$ declare -A q
$ q[c]=1 q[d]=2 q[a]=4
$ sa "${q[@]}"
:4:
:1:
:2:

This led to writing a function that sorts the letters of a word (Listing 13-1).

Listing 13-1. lettersort, Sort Letters in a Word Alphabetically

lettersort() #@ Sort letters in $1, store in $2
{
local letter string
declare -A letters
string=${1:?}
while [ -n "$string" ]
do
letter=${string:0:1}
letters["$letter"]=${letters["$letter"]}$letter
string=${string#?}
done
printf -v "${2:-_LETTERSORT}" "%s" "${letters[@]}"
}

What’s the point, you ask? Take a look at these examples:

$ lettersort triangle; printf "%s\n" "$_LETTERSORT"
aegilnrt
$ lettersort integral; printf "%s\n" "$_LETTERSORT"
aegilnrt

When the letters are sorted, you can see that the two words contain the same letters. Therefore, they are anagrams of each other. Try this process with the words altering, alerting, and relating.

Insertion Sort Function

If you really want to do your sorting in the shell, you can. The function in Listing 13-2 is slower than the external sort command when there are more than 15 to 20 elements (the exact numbers will vary depending on your computer, its load, and so on). It inserts each element into the correct position in an array and then prints the resulting array.

Image Note The sort function is a program written in C, optimized for speed, and compiled, whereas the script written in bash is interpreted at runtime. However, it all depends on the number of elements you are sorting and the way your scipt is structured, which determines the suitability ofsort over using your own scripted sort.

Listing 13-2. isort, Sort Command-Line Arguments

isort()
{
local -a a
a=( "$1" ) ## put first argument in array for initial comparison
shift ## remove first argument
for e ## for each of the remaining arguments…
do
if [ "$e" \< "${a[0]}" ] ## does it precede the first element?
then
a=( "$e" "${a[@]}" ) ## if yes, put it first
elif [ "$e" \> "${a[${#a[@]}-1]}" ] ## if no, does it go at the end?
then
a=( "${a[@]}" "$e" ) ## if yes, put it at the end
else ## otherwise,
n=0
while [ "${a[$n]}" \< "$e" ] ## find where it goes
do
n=$(( $n + 1 ))
done
a=( "${a[@]:0:n}" "$e" "${a[@]:n}" ) ## and put it there
fi
done
printf "%s\n" "${a[@]}"
}

To put Canada’s ten provincial capitals in alphabetical order, you’d use this code:

$ isort "St. John's" Halifax Fredericton Charlottetown "Quebec City" \
Toronto Winnipeg Regina Edmonton Victoria
Charlottetown
Edmonton
Fredericton
Halifax
Quebec City
Regina
St. John's
Toronto
Victoria
Winnipeg

Searching an Array

As with the isort function, this function is designed for use with relatively small arrays. If the array contains more than a certain number of elements (50? 60? 70?), it is faster to pipe it through grep. The function in Listing 13-3 takes the name of an array and a search string as arguments and stores elements containing the search string in a new array, _asearch_elements.

Listing 13-3. asearch, Search Elements of an Array for a String

asearch() #@ Search for substring in array; results in array _asearch_elements
{ #@ USAGE: asearch arrayname string
local arrayname=$1 substring=$2 array

eval "array=( \"\${$arrayname[@]}\" )"

case ${array[*]} in
*"$substring"*) ;; ## it's there; drop through
*) return 1 ;; ## not there; return error
esac

unset _asearch_elements
for subscript in "${!array[@]}"
do
case ${array[$subscript]} in
*"$substring"*)
_asearch_elements+=( "${array[$subscript]}" )
;;
esac
done
}

To see the function in action, put the provincial capitals from the previous section into an array and call asearch:

$ capitals=( "St. John's" Halifax Fredericton Charlottetown "Quebec City"
Toronto Winnipeg Regina Edmonton Victoria )
$ asearch captials Hal && printf "%s\n" "${_asearch_elements[@]}"
Halifax
$ asearch captials ict && printf "%s\n" "${_asearch_elements[@]}"
Fredericton
Victoria

Reading an Array into Memory

There are various ways of reading a file into an array with bash. The most obvious is also the slowest: a while read loop:

unset array
while read line
do
array+=( "$line" )
done < "$kjv" ## kjv is defined in Chapter 8

A faster method that is still portable uses the external command, cat:

IFS=$'\n' ## split on newlines, so each line is a separate element
array=( $(cat "$kjv") )

In bash, cat is unnecessary:

array=( < "$kjv" ) ## IFS is still set to a newline

With bash-4.x, a new built-in command, mapfile, is even faster:

mapfile -t array < "$kjv"

The options to mapfile allow you to select the line at which to start reading (actually, it’s the number of lines to skip before starting to read), the number of lines to read, and the index at which to start populating the array. If no array name is given, the variable MAPFILE is used.

The following are the seven options to mapfile:

· -n num: Reads no more than num lines

· -O index: Begins populating the array at element index

· -s num: Discards the first num lines

· -t: Removes the trailing newline from each line

· -u fd: Reads from input stream fd instead of the standard input

· -C callback: Evaluates the shell command callback every N lines, where N is set by -c N

· -c N: Specifies the number of lines between each evaluation of callback; the default is 5000

With older versions of bash, you could use sed to extract ranges of lines from a file; with bash-4.x, you could use mapfile. Listing 13-4 installs a function that uses mapfile if the version of bash is 4.x or greater but sed is used if not.

Listing 13-4. getlines, Store a Range of Lines from a File in an Array

if [ "${BASH_VERSINFO[0]}" -ge 4 ]
then
getlines() #@ USAGE: getlines file start num arrayname
{
mapfile -t -s$(( $2 - 1 )) -n ${3:?} "$4" < "$1"
}
else
getlines() #@ USAGE: getlines file start num arrayname
{
local IFS=$'\n' getlinearray arrayname=${4:?}
getlinearray=( $(sed -n "$2,$(( $2 - 1 + $3 )) p" "$1") )
eval "$arrayname=( \"\${getlinearray[@]}\" )"
}
fi

Process substitution and external utilities can be used with mapfile to extract portions of a file using different criteria:

mapfile -t exodus < <(grep ^Exodus: "$kjv") ## store the book of Exodus
mapfile -t books < <(cut -d: -f1 "$kjv" | uniq) ## store names of all books in KJV

Image Tip You can also use readarray to read the data from a file into an array, it is basically an alias for mapfile.

Two-Dimensional Grids

Programmers often have to deal with two-dimensional grids. As a constructor of crossword puzzles, I need to convert a grid from a puzzle file to a format that my clients’ publications can import into desktop publishing software. As a chess tutor, I need to convert chess positions into a format I can use in worksheets for my students. In games such as tic-tac-toe, maxit, and fifteen (from Chapter 11), the game board is a grid.

The obvious structure to use is a two-dimensional array. Because bash has only one-dimensional arrays, a workaround is needed to simulate two dimensions. This can be done as an array, a string, an array of strings, or a “poor man’s” array (see Chapter 9).

For a chess diagram, an associative array could be used, with the squares identified using the standard algebraic notation (SAN) for squares, a1, b1 to g8, h8:

declare -A chessboard
chessboard["a1"]=R
chessboard["a2"]=P
: ... 60 squares skipped
chessboard["g8"]=r
chessboard["h8"]=b

A structure that I’ve used on a few occasions is an array in which each element is a string representing a rank:

chessboard=(
RNBQKBRN
PPPPPPPP
" "
" "
" "
" "
pppppppp
rnbqkbnr
)

My preference, when using bash, is a simple indexed array:

chessboardarray=(
R N B Q K B R N
P P P P P P P P
"" "" "" "" "" "" "" ""
"" "" "" "" "" "" "" ""
"" "" "" "" "" "" "" ""
"" "" "" "" "" "" "" ""
p p p p p p p p
r n b q k b n r
)

Or, in a POSIX shell, it could be a single string:

chessboard="RNBQKBRNPPPPPPPP pppppppprnbqkbnr"

Next, two function libraries are discussed, one for dealing with grids in a single string and the other for grids stored in arrays.

Working with Single-String Grids

I have a function library, stringgrid-funcs, for dealing with two-dimensional grids stored in a single string. There is a function to initialize all elements of a grid to a given character and one to calculate the index in the string of a character based on the x and y coordinates. There’s one to fetch the character in the string using x/y and one to place a character into the grid at x/y. Finally, there are functions to print a grid, starting either with the first row or with the last row. These functions only work with square grids.

Function: initgrid

Given the name of the grid (that is, the variable name), the size, and optionally the character with which to fill it, initgrid (Listing 13-5) creates a grid with the parameters supplied. If no character is supplied, a space is used.

Listing 13-5. initgrid, Create a Grid and Fill It

initgrid() #@ Fill N x N grid with a character
{ #@ USAGE: initgrid gridname size [character]
## If a parameter is missing, it's a programming error, so exit
local grid gridname=${1:?} char=${3:- } size
export gridsize=${2:?} ## set gridsize globally

size=$(( $gridsize ** 2 )) ## total number of characters in grid
printf -v grid "%$size.${size}s" " " ## print string of spaces to variable
eval "$gridname=\${grid// /"$char"}" ## replace spaces with desired character
}

The length of the string is the square of the grid size. A string of that length is created using a width specification in printf, with the -v option to save it to a variable supplied as an argument. Pattern substitution then replaces the spaces with the requested string.

This and the other functions in this library use the ${var:?} expansion, which displays an error and exits the script if there is no value for the parameter. This is appropriate because it is a programming error, not a user error if a parameter is missing. Even if it’s missing because the userfailed to supply it, it is still a programming error; the script should have checked that a value had been entered.

A tic-tac-toe grid is a string of nine spaces. For something this simple, the initgrid function is hardly necessary, but it is a useful abstraction:

$ . stringgrid-funcs
$ initgrid ttt 3
$ sa "$ttt" ## The sa script/function has been used in previous chapters
: :

Function: gridindex

To convert x and y coordinates into the corresponding position in the grid string, subtract 1 from the row number, multiply it by the gridsize, and add the columns. Listing 13-6, gridindex, is a simple formula that could be used inline when needed, but again the abstraction makes using string grids easier and localizes the formula so that if there is a change, it only needs fixing in one place.

Listing 13-6. gridindex, Calculate Index from Row and Column

gridindex() #@ Store row/column's index into string in var or $_gridindex
{ #@ USAGE: gridindex row column [gridsize] [var]]
local row=${1:?} col=${2:?}

## If gridsize argument is not given, take it from definition in calling script
local gridsize=${3:-$gridsize}
printf -v "${4:-_GRIDINDEX}" "%d" "$(( ($row - 1) * $gridsize + $col - 1))"
}

What’s the index of row 2, column 3 in the tic-tac-toe grid string?

$ gridindex 2 3 ## gridsize=3
$ echo "$_GRIDINDEX"
5

Function: putgrid

To change a character in the grid string, putgrid (Listing 13-7) takes four arguments: the name of the variable containing the string, the row and column coordinates, and the new character. It splits the string into the part before the character and the part after it using bash’s substring parameter expansion. It then sandwiches the new character between the two parts and assigns the composite string to the gridname variable. (Compare this with the _overlay function in Chapter 7.)

Listing 13-7. putgrid, Insert Character in Grid at Specified Row and Column

putgrid() #@ Insert character int grid at row and column
{ #@ USAGE: putgrid gridname row column char
local gridname=$1 ## grid variable name
local left right ## string to left and right of character to be changed
local index ## result from gridindex function
local char=${4:?} ## character to place in grid
local grid=${!gridname} ## get grid string though indirection

gridindex ${2:?} ${3:?} "$gridsize" index

left=${grid:0:index}
right=${grid:index+1}
grid=$left$4$right
eval "$gridname=\$grid"
}

Here’s the code for the first move in a tic-tac-toe game:

$ putgrid ttt 1 2 X
$ sa "$ttt"
: X :

Function: getgrid

The opposite of putgrid is getgrid (Listing 13-8). It returns the character in a given position. Its arguments are the grid name (I could have used the string itself, because nothing is being assigned to it, but the grid name is used for consistency), the coordinates, and the name of the variable in which to store the character. If no variable name is supplied, it is stored in _GRIDINDEX.

Listing 13-8. getgrid, Get Character at Row and Column Location in Grid

getgrid() #@ Get character from grid in row Y, column X
{ #@ USAGE: getgrid gridname row column var
: ${1:?} ${2:?} ${3:?} ${4:?}
local grid=${!1}
gridindex "$2" "$3"
eval "$4=\${grid:_GRIDINDEX:1}"
}

This snippet returns the piece in the square e1. A chess utility would convert the square to coordinates and then call the getgrid function. Here it is used directly:

$ gridsize=8
$ chessboard="RNBQKBRNPPPPPPPP pppppppprnbqkbnr"
$ getgrid chessboard 1 5 e1
$ sa "$e1"
:K:

Function: showgrid

This function (Listing 13-9) extracts rows from a string grid using substring expansion and the gridsize variable and prints them to the standard output.

Listing 13-9. showgrid, Print a Grid from a String

showgrid() #@ print grid in rows to stdout
{ #@ USAGE: showgrid gridname [gridsize]
local grid=${!1:?} gridsize=${2:-$gridsize}
local row ## the row to be printed, then removed from local copy of grid

while [ -n "$grid" ] ## loop until there's nothing left
do
row=${grid:0:"$gridsize"} ## get first $gridsize characters from grid
printf "\t:%s:\n" "$row" ## print the row
grid=${grid#"$row"} ## remove $row from front of grid
done
}

Here another move is added to the tic-tac-toe board and displays it:

$ gridsize=3 ## reset gridsize after changing it for the chessboard
$ putgrid ttt 2 2 O ## add O's move in the center square
$ showgrid ttt ## print it
: X :
: O :
: :

Function: rshowgrid

For most grids, counting begins in the top left corner. For others, such as a chessboard, it starts in the lower left corner. To display a chessboard, the rgridshow function extracts and displays rows starting from the end of the string rather than from the beginning.

In Listing 13-10, substring expansion is used with a negative.

Listing 13-10. rshowgrid, Print a Grid in Reverse Order

rshowgrid() #@ print grid to stdout in reverse order
{ #@ USAGE: rshowgrid grid [gridsize]
local grid gridsize=${2:-$gridsize} row
grid=${!1:?}
while [ -n "$grid" ]
do
## Note space before minus sign
## to distinguish it from default value substitution
row=${grid: -$gridsize} ## get last row from grid
printf "\t:%s:\n" "$row" ## print it
grid=${grid%"$row"} ## remove it
done
}

Here, rshowgrid is used to display the first move of a chess game. (For those who are interested, the opening is called Bird’s Opening. It’s not often played, but I have been using it successfully for 45 years.)

$ gridsize=8
$ chessboard="RNBQKBRNPPPPPPPP pppppppprnbqkbnr"
$ putgrid chessboard 2 6 ' '
$ putgrid chessboard 4 6 P
$ rshowgrid chessboard
:rnbqkbnr:
:pppppppp:
: :
: :
: P :
: :
:PPPPP PP:
:RNBQKBRN:

These output functions can be augmented by piping the output through a utility such as sed or awk or even replaced with a custom function for specific uses. I find that the chessboard looks better when piped through sed to add some spacing:

$ rshowgrid chessboard | sed 's/./& /g' ## add a space after every character
: r n b q k b n r :
: p p p p p p p p :
: :
: :
: P :
: :
: P P P P P P P :
: R N B Q K B R N :

Two-Dimensional Grids Using Arrays

For many grids, a single string is more than adequate (and is portable to other shells), but an array-based grid offers more flexibility. In the fifteen puzzle in Chapter 11, the board is stored in an array. It is printed with printf using a format string that can easily be changed to give it a different look. The tic-tac-toe grid in an array could be as follows:

$ ttt=( "" X "" "" O "" "" X "" )

And this is the format string:

$ fmt="
| |
%1s | %1s | %1s
----+---+----
%1s | %1s | %1s
----+---+----
%1s | %1s | %1s
| |

"

And the result, when printed, looks like this:

$ printf "$fmt" "${ttt[@]}"

| |
| X |
----+---+----
| O |
----+---+----
| X |
| |

If the format string is changed to this:

fmt="

_/ _/
%1s _/ %1s _/ %1s
_/ _/
_/_/_/_/_/_/_/_/_/_/
_/ _/
%1s _/ %1s _/ %1s
_/ _/
_/_/_/_/_/_/_/_/_/_/
_/ _/
%1s _/ %1s _/ %1s
_/ _/

"

the output will look like this:

_/ _/
_/ X _/
_/ _/
_/_/_/_/_/_/_/_/_/_/
_/ _/
_/ O _/
_/ _/
_/_/_/_/_/_/_/_/_/_/
_/ _/
_/ X _/
_/ _/

The same output could be achieved with a single-string grid, but it would require looping over every character in the string. An array is a group of elements that can be addressed individually or all at once, depending on the need.

The functions in arraygrid-funcs mirror those in stringgrid-funcs. In fact, the gridindex function is identical to the one in stringgrid-funcs, so it’s not repeated here. As with the sdtring grid functions, some of them expect the size of the grid to be available in a variable, agridsize.

Function: initagrid

Most of the functions for array grids are simpler than their single-string counterparts. A notable exception is initagrid (Listing 13-11), which is longer and slower, due to the necessity of a loop instead of a simple assignment. The entire array may be specified as arguments, and any unused array elements will be initialized to an empty string.

Listing 13-11. initagrid, Initialize a Grid Array

initagrid() #@ Fill N x N grid with supplied data (or placeholders if none)
{ #@ USAGE: initgrid gridname size [character ...]
## If a required parameter is missing, it's a programming error, so exit
local grid gridname=${1:?} char=${3:- } size
export agridsize=${2:?} ## set agridsize globally

size=$(( $agridsize * $agridsize )) ## total number of elements in grid

shift 2 ## Remove first two arguments, gridname and agridsize
grid=( "$@" ) ## What's left goes into the array

while [ ${#grid[@]} -lt $size ]
do
grid+=( "" )
done

eval "$gridname=( \"\${grid[@]}\" )"
}

Function: putagrid

Changing a value in an array is a straightforward assignment. Unlike changing a character in a string, there is no need to tear it apart and put it back together. All that’s needed is the index calculated from the coordinates. This function (Listing 13-12) requires agridsize to be defined.

Listing 13-12. putagrid, Replace a Grid Element

putagrid() #@ Replace character in grid at row and column
{ #@ USAGE: putagrid gridname row column char
local left right pos grid gridname=$1
local value=${4:?} index
gridindex ${2:?} ${3:?} "$agridsize" index ## calculate the index
eval "$gridname[index]=\$value" ## assign the value
}

Function: getagrid

Given the x and y coordinates, getagrid fetches the value at that position and stores it in a supplied variable (Listing 13-13).

Listing 13-13. getagrid, Extract an Entry from a Grid

getagrid() #@ Get entry from grid in row Y, column X
{ #@ USAGE: getagrid gridname row column var
: ${1:?} ${2:?} ${3:?} ${4:?}
local grid

eval "grid=( \"\${$1[@]}\" )"
gridindex "$2" "$3"
eval "$4=\${grid[$_GRIDINDEX]}"
}

Function: showagrid

The function showagrid (Listing 13-14) prints each row of an array grid on a separate line.

Listing 13-14. showagrid, Description

showagrid() #@ print grid to stdout
{ #@ USAGE: showagrid gridname format [agridsize]
local gridname=${1:?} grid
local format=${2:?}
local agridsize=${3:-${agridsize:?}} row

eval "grid=( \"\${$1[@]}\" )"
printf "$format" "${grid[@]}"
}

Function: rshowagrid

The function rshowagrid (Listing 13-15) prints each row of an array grid on a separate line in reverse order.

Listing 13-15. rshowagrid, Description

rshowagrid() #@ print grid to stdout in reverse order
{ #@ USAGE: rshowagrid gridname format [agridsize]
local format=${2:?} temp grid
local agridsize=${3:-$agridsize} row
eval "grid=( \"\${$1[@]}\" )"
while [ "${#grid[@]}" -gt 0 ]
do
## Note space before minus sign
## to distinguish it from default value substitution
printf "$format" "${grid[@]: -$agridsize}"
grid=( "${grid[@]:0:${#grid[@]}-$agridsize}" )
done
}

Data File Formats

Data files are used for many purposes and come in many different flavors, which are divided into two main types: line oriented and block oriented. In line-oriented files, each line is a complete record, usually with fields separated by a certain character. In block-oriented files, each record can span many lines, and there may be more than one block in a file. In some formats, a record is more than one block (a chess game in PGN format, for example, is two blocks separated by a blank line).

The shell is not the best language for working with large files of data; it is better when working with individual records. However, there are utilities such as sed and awk that can work efficiently with large files and extract records to pass to the shell. This section deals with processing single records.

Line-Based Records

Line-based records are those where each line in the file is a complete record. It will usually be divided into fields by a delimiting character, but sometimes the fields are defined by length: the first 20 characters are the names, the next 20 are the first line of the address, and so on.

When the files are large, the processing is usually done by an external utility such as sed or awk. Sometimes an external utility will be used to select a few records for the shell to process. This snippet searches the password file for users whose shell is bash and feeds the results to the shell to perform some (unspecified) checks:

grep 'bash$' /etc/passwd |
while read line
do
: perform some checking here
done

Delimiter-Separated Values

Most single-line records will have fields delimited by a certain character. In /etc/passwd, the delimiter is a colon. In other files, the delimiter may be a tab, tilde, or, very commonly, a comma. For these records to be useful, they must be split into their separate fields.

When records are received on an input stream, the easiest way to split them is to change IFS and read each field into its own variable:

grep 'bash$' /etc/passwd |
while IFS=: read user passwd uid gid name homedir shell
do
printf "%16s: %s\n" \
User "$user" \
Password "$passwd" \
"User ID" "$uid" \
"Group ID" "$gid" \
Name "$name" \
"Home directory" "$homedir" \
Shell "$shell"

read < /dev/tty
done

Sometimes it is not possible to split a record as it is read, such as if the record will be needed in its entirety as well as split into its constituent fields. In such cases, the entire line can be read into a single variable and then split later using any of several techniques. For all of these, the examples here will use the root entry from /etc/passwd:

record=root:x:0:0:root:/root:/bin/bash

The fields can be extracted one at a time using parameter expansion:

for var in user passwd uid gid name homedir shell
do
eval "$var=\${record%%:*}" ## extract the first field
record=${record#*:} ## and take it off the record
done

As long as the delimiting character is not found within any field, records can be split by setting IFS to the delimiter. When doing this, file name expansion should be turned off (with set -f) to avoid expanding any wildcard characters. The fields can be stored in an array and variables can be set to reference them:

IFS=:
set -f
data=( $record )
user=0
passwd=1
uid=2
gid=3
name=4
homedir=5
shell=6

The variable names are the names of the fields that can then be used to retrieve values from the data array:

$ echo;printf "%16s: %s\n" \
User "${data[$user]}" \
Password "${data[$passwd]}" \
"User ID" "${data[$uid]}" \
"Group ID" "${data[$gid]}" \
Name "${data[$name]}" \
"Home directory" "${data[$homedir]}" \
Shell "${data[$shell]}"

User: root
Password: x
User ID: 0
Group ID: 0
Name: root
Home directory: /root
Shell: /bin/bash

It is more usual to assign each field to a scalar variable. This function (Listing 13-16) takes a passwd record and splits it on colons and assigns fields to the variables.

Listing 13-16. split_passwd, Split a Record from /etc/passwd into Fields and Assign to Variables

split_passwd() #@ USAGE: split_passwd RECORD
{
local opts=$- ## store current shell options
local IFS=:
local record=${1:?} array

set -f ## Turn off filename expansion
array=( $record ) ## Split record into array
case $opts in *f*);; *) set +f;; esac ## Turn on expansion if previously set

user=${array[0]}
passwd=${array[1]}
uid=${array[2]}
gid=${array[3]}
name=${array[4]}
homedir=${array[5]}
shell=${array[6]}
}

The same thing can be accomplished using a here document (Listing 13-17).

Listing 13-17. split_passwd, Split a Record from /etc/passwd into Fields and Assign to Variables

split_passwd()
{
IFS=: read user passwd uid gid name homedir shell <<.
$1
.
}

More generally, any character-delimited record can be split into variables for each field with this function (Listing 13-18).

Listing 13-18. split_record, Split a Record by Reading Variables

split_record() #@ USAGE parse_record record delimiter var ...
{
local record=${1:?} IFS=${2:?} ## record and delimiter must be provided
: ${3:?} ## at least one variable is required
shift 2 ## remove record and delimiter, leaving variables

## Read record into a list of variables using a 'here document'
read "$@" <<.
$record
.
}

Using the record defined earlier, here’s the output:

$ split_record "$record" : user passwd uid gid name homedir shell
$ sa "$user" "$passwd" "$uid" "$gid" "$name" "$homedir" "$shell"
:root:
:x:
:0:
:0:
:root:
:/root:
:/bin/bash:

Fixed-Length Fields

Less common than delimited fields are fixed-length fields. They aren’t used often, but when they are, they would be looped through name=width strings to parse them, which is how many text editors import data from fixed-length field data files:

line="John 123 Fourth Street Toronto Canada "
for nw in name=15 address=20 city=12 country=22
do
var=${nw%%=*} ## variable name precedes the equals sign
width=${nw#*=} ## field width follows it
eval "$var=\${line:0:width}" ## extract field
line=${line:width} ## remove field from the record
done

Block File Formats

Among the many types of block data files to work with is the portable game notation (PGN) chess file. It stores one or more chess games in a format that is both human readable and machine readable. All chess programs can read and write this format.

Each game begins with a seven-tag roster that identifies where and when the game was played, who played it, and the results. This is followed by a blank line and then the moves of the game.

Here’s a PGN chess game file (from http://cfaj.freeshell.org/Fidel.pgn):

[Event "ICS rated blitz match"]
[Site "69.36.243.188"]
[Date "2009.06.07"]
[Round "-"]
[White "torchess"]
[Black "FidelCastro"]
[Result "1-0"]

1. f4 c5 2. e3 Nc6 3. Bb5 Qc7 4. Nf3 d6 5. b3 a6 6. Bxc6+ Qxc6 7. Bb2 Nf6
8. O-O e6 9. Qe1 Be7 10. d3 O-O 11. Nbd2 b5 12. Qg3 Kh8 13. Ne4 Nxe4 14.
Qxg7#
{FidelCastro checkmated} 1-0

You can use a while loop to read the tags and then mapfile to get the moves of the game. The gettag function extracts the value from each tag and assigns it to the tag name (Listing 13-19).

Listing 13-19. readpgn, Parse a PGN Game and Print Game in a Column

pgnfile="${1:?}"
header=0
game=0

gettag() #@ create a variable with the same name and value as the tag
{
local tagline=$1
tag=${tagline%% *} ## get line before the first space
tag=${tag#?} ## remove the open bracket
IFS='"' read a val b <<. ## get the 2nd field, using " as delimiter
$tagline
.

eval "$tag=\$val"
}

{
while IFS= read -r line
do
case $line in
\[*) gettag "$line" ;;
"") [ -n "$Event" ] && break;; ## skip blank lines at beginning of file
esac
done
mapfile -t game ## read remainder of the file
} < "$pgnfile"

## remove blank lines from end of array
while [ -z "${game[${#game[@]}-1]}" ]
do
unset game[${#game[@]}-1]
done

## print the game with header
echo "Event: $Event"
echo "Date: $Date"
echo
set -f
printf "%4s %-10s %-10s\n" "" White Black "" ========== ========== \
"" "$White" "$Black" ${game[@]:0:${#game[@]}-1}
printf "%s\n" "${game[${#game[@]}-1]}"

Summary

This chapter only scratched the surface of the possibilities for data manipulation, but it is hoped that it will provide techniques to solve some of your needs and provide hints for others. Much of the chapter involved using that most basic of programming structures, arrays. Techniques were shown for working with single-line, character-delimited records, and basic techniques for working with blocks of data in files.

Exercises

1. Modify the isort and asearch functions to use sort and grep, respectively, if the array exceeds a certain size.

2. Write a function that transposes rows and columns in a grid (either a single-string grid or an array). For example, transform these:

123
456
789

into these:

147
256
369

3. Convert some of the grid functions, either string or array versions, to work with grids that are not square, for example, 6 × 3.

4. Convert the code that parses fixed-width records into a function that accepts the line of data as the first argument, followed by the varname=width list.