Filters and Utilities - Tools and Programming - UNIX: The Complete Reference (2007)

UNIX: The Complete Reference (2007)

Part V: Tools and Programming

Chapter List

Chapter 19: Filters and Utilities

Chapter 20: Shell Scripting

Chapter 21: awk and sed

Chapter 22: Perl

Chapter 23: Python

Chapter 24: C and C++ Programming Tools

Chapter 25: An Overview of Java

Chapter 19: Filters and Utilities

Overview

One of the most valuable features of the UNIX System is the rich set of commands it gives you. This chapter surveys a particularly useful set of commands that are often referred to as tools or utilities. They are small, modular commands, each of which performs a specific function, such as sorting a list or searching for a word in a file. You can use them singly and in combination to carry out many common tasks.

Most of the tools described in this chapter are what are often referred to as filters. Filters are programs that read standard input, operate on it, and produce the result as standard output. They are not interactive-they do not prompt you or wait for input. Filters are often used with other commands in a command pipeline. By allowing you to combine filters in pipelines, the UNIX System makes it easy to accomplish tasks that would be overly difficult and time-consuming in other operating systems.

Most of the filters are designed to work with text or with text files. In general, filters do not modify the original file, so you can experiment without much risk of overwriting data. (Exceptions to this rule are carefully noted.) Also, most of the tools in this chapter have other command-line options that are not included here. To get more details about the options that are available, check the man pages or the references at the end of this chapter.

A number of the tools described in this chapter have features that are especially useful in dealing with files containing structured lists. Such files are often used as simple databases. Typically, each line in the file is a separate record containing information about a particular item. The information is often structured in fields. For example, each line in a personnel file may contain a record consisting of information about one employee, with fields for name, address, phone number, and so forth. The UNIX System includes tools to search, edit, and reformat this type of file.

This chapter also describes a number of miscellaneous tools, including commands for compressing files, performing numerical calculations, and monitoring input and output. For other utilities, see Chapter 3(which includes the commands for working with files and directories) and Chapter 5 (which explains the main tools for editing text). The chapter after this one, which shows you how to write shell scripts, includes many uses of the tools presented here. And Chapter 21 explains how to use awk and sed, a very powerful pair of tools for working with files and pattern matching.

Most of the tools described here can be found in any standard UNIX or Linux system. A few, such as patch and tac, come with Linux but are not part of the standard UNIX command set. You can download free versions of many of these tools through the GNU project, at http://www.gnu.org/. Versions of most of the tools mentioned in this chapter are also available for Microsoft Windows through the MKS toolkit (http://www.mkssoftware.com/).

Finding Patterns in Files

Among the most commonly used tools in the UNIX System are those for finding words in files, especially grep, fgrep, and egrep. These commands search for text that matches a target or pattern that you specify You can use them to extract information from files, to search the output of a command for lines relating to a particular item, and to locate files containing a particular key word.

The three commands in the grep family are very similar. All of them print lines matching a target. They differ, however, in how you specify the search targets.

§ grep is the most commonly used of the three commands. It lets you search for a target which may be one or more words or patterns containing wildcards and other regular expression elements.

§ fgrep (fixed grep) does not allow regular expressions but does allow you to search for multiple targets.

§ egrep (extended grep) takes a richer set of regular expressions, as well as allowing multiple target searches, and is considerably faster than grep.

grep

The grep command searches through one or more files for lines containing a target and then prints all of the matching lines it finds. For example, the following command prints all lines in the file mtg_note that contain the word “room”:

$ grep room mtg_note

will be held at 2:00 in room 1J303. We will discuss

Note that you specify the target as the first argument and follow it with the names of the files to search. Think of the command as “search for target in file.”

The target can be a phrase-that is, two or more words separated by spaces. If the target contains spaces, however, you have to enclose it in quotes to prevent the shell from treating the different words as separate arguments. The following searches for lines containing the phrase “boxing wizards” in the file pangrams:

$ grep "boxing wizards" pangrams

The five boxing wizards jump quickly.

Note that if the words “boxing” and “wizards” appear on different lines (separated by a newline character), grep will not find them, because it looks at only one line at a time.

If you give grep two or more files to search, it includes the name of the file before each line of output. For example, the following command searches for lines containing the string “vacation” in all of the files in the current directory:

$ grep vacation *

mbox: I'll be gone on vacation July 24–28, but we could meet

mbox: so, the only week when we're all available for a vacation

savemail: sounds like a great idea for a vacation. I'd love

The output lists the names of the two files that contain the target word “vacation”-mbox and savemail-and the line(s) containing the target in each file.

You can use this feature to locate a file when you have forgotten its name but remember a key word that would identify it. For example, if you keep copies of your saved e-mail in a particular directory, you can use grep to find the one dealing with a particular subject by searching for a word or phrase that you know is contained in it. The following command shows how you can use grep to find a mail from someone named Dan:

$ grep Dan *

savemail27: From: Dan N <dnidz>

savemail43: well, sure. Dancing is pretty good exercise, so I

This shows you that the letter you were looking for is in the file savemail27.

Searching for Patterns Using Regular Expressions

The examples so far have used grep to search for specific words or strings of text, but grep also allows you to search for patterns that may match a number of different words or strings. The patterns for grepcan be the same kinds of regular expressions that were described in Chapter 5. For example,

$ grep 'ch.*se' recipes

will find entries containing “chinese” or “cheese”, or in fact any line that has a ch sometime before an se, including something like “Blanch for 45 seconds”.

In the preceding pattern, the dot (.) matches any character other than newline. The asterisk says that those characters may be repeated any number of times. Together, .* indicates any string of any characters. Note that in this example the target pattern “ch.*se” is enclosed in single quotation marks. This prevents the asterisk from being treated by the shell as a filename wildcard. In general, you need to use quotes around any regular expression containing a character that has special meaning for the shell. (Filename wildcards and other special shell symbols are discussed in Chapter 4.)

Other regular expression symbols that are often useful in specifying targets for grep include the caret (^) and dollar sign ($), which are used to anchor words to the beginning and end of lines, and brackets ([ ]), which are used to indicate a class of characters. The following example shows how these can be used to specify patterns as targets:

$ grep '^Section [1–9]$' manuscript

This command finds all lines that contain just “Section n”, where n is a number from 1 to 9, in the file manuscript. The caret at the beginning and the dollar sign at the end indicate that the pattern must match the whole line. The brackets indicate that the target can include any one of the numbers from 1 to 9.

Table 19–1 lists regular expression symbols that are useful in forming grep search patterns.

Table 19–1: grep Regular Expressions

Symbol

Definition

Example

Matches

.

Matches any single character.

th.nk

think, thank, thunk, etc.

\

Quotes the following character.

script\.py

script.py

*

Matches zero or more repetitions of the previous item.

ap*le

ale, apple, etc.

[ ]

Matches any one of the characters inside.

[QqXx]

Q, q, X, or x

[a-z]

Matches any one of the characters in the range.

[0–9]*

any number: 0110, 27, 9876, etc.

^

Matches the beginning of a line.

^If

any line beginning with If

$

Matches the end of a line.

\.$

any line ending in a period

Options for grep

Normally, grep distinguishes between uppercase and lowercase. For example, the following command would find “Unix” but not “UNIX" or “unix”:

$ grep Unix notes

You can use the −i (ignore case) option to find all lines containing a target regardless of uppercase and lowercase distinctions. This command finds all occurrences of the word “unix” regardless of capitalization:

$ grep −i unix notes

The −r option causes grep to recursively search files in all the subdirectories of the current directory.

$ grep −r "\.p[ly]" *

PerlScripts/quickmail.pl: # usage: quickmail.pl recipient subject contents

PythonScripts/zwrite.py: # usage: zwrite.py username

The backslash (\) prevents the dot (.) from being treated as a regular expression character-it represents a period here, so grep searches for a file containing “.pl” or “.py”. Be careful: if the directory contains many subdirectories with many files in them, it can take a very long time for a command like this to complete.

Another useful grep option, −n, allows you to list the line number on which the target (here, while) is found. For example,

$ grep −n while perlsample.pl

4: while (<>){

11: while ($n > −0) {

One of the common uses of grep is to find which of several files in a directory deals with a particular topic. If all you want is to identify the files that contain a particular word or pattern, there is no need to print out the matching lines. With the −l (list) option, grep suppresses the printing of matching lines and just prints the names of files that contain the target. The following example lists all files in the current directory that include the word “Duckpond”:

$ grep −l Duckpond *

about.html

index.html

report.cgi

You can use this option with the shell command substitution feature described in Chapter 4 to use these filenames as arguments to another UNIX System command. For example, the following command will use more to list all the files found by grep:

more 'grep −l Duckpond *'

By default, grep finds all lines that match the target pattern. Sometimes, though, it is useful to find the lines that do not match a particular pattern. You can do this with the −v option, which tells grep to print all lines that do not contain the specified target. This provides a quick way to find entries in a file that are missing a required piece of information. For example, suppose the file phonenums contains your personal phone book. The following command will print all lines in phonenums that do not contain numbers:

$ grep −v '[0–9]' phonenums

The −v option can also be useful for removing unwanted information from the output of another command. Chapter 3 described the file command and showed how you can use it to get a short description of the type of information contained in a file. Because the file command includes the word “directory” in its output for directories, you could list all files in the current directory that are not directories by piping the output of file to grep −v, as shown in the following example:

$ file * | grep −v directory

fgrep

The fgrep command is similar to grep, but with three main differences: You can use it to search for several targets at once, it does not allow you to use regular expressions to search for patterns, and it is faster than grep. When you need to search many files or a very large file, the difference in speed can be significant.

With fgrep, you can search for lines containing any one of several targets. For example, the following command finds all entries in the phone_nums file that contain any of the words “saul”, “michelle”, or “anita”:

$ fgrep "saul

> michelle

> anita" phone_nums

The output might look like this:

saul 555–1122

saul (home) 555–1100

michelle 555–3344

anita 555–6677

When you give fgrep multiple search targets, each one must be on a separate line, and the entire search string must be in quotation marks. In this example, if you didn’t put michelle on a separate line you would be searching for saul michelle, and if you left out the quotes, the command would execute as soon as you hit ENTER.

With the −f (file) option, you can tell fgrep to take the search targets from a file, rather than having to enter them directly If you had a file in your home directory named .friends containing the usernames of your friends on the system, you could use fgrep to search the output of the finger command for the names on your list, like this:

$ finger | fgrep −f −/.friends

egrep

The egrep command is the most powerful member of the grep command family You can use it like fgrep to search for multiple targets, and it provides a larger set of regular expressions than grep. In fact, if you find yourself using the extended features of egrep often, you may want to add an alias that replaces grep with egrep in your shell configuration file. (For example, if you are using bash, you could add the line “alias grep=egrep” to your .bashrc.)

You can tell egrep to search for several targets in two ways: by putting them on separate lines as in fgrep, or by separating them with the vertical bar or pipe symbol (|). For example, the following command uses the pipe symbol to tell egrep to search for the words dan, robin, ben, and mari in the file phone_list:

$ egrep "dan|robin ben|mari" phone_list

dan dnidz x1234

robin rpelc x3141

ben bsquared x9876

marissa mbaskett x2718

Note that there are no spaces between the pipe symbol and the targets. If there were, egrep would consider the spaces part of the target string. Also note the use of quotation marks to prevent the shell from interpreting the pipe symbol as an instruction to create a pipeline.

Table 19–2 summarizes the egrep extensions to the grep regular expression symbols.

Table 19–2: Additional egrep Regular Expressions

Symbol

Definition

Example

Matches

+

Matches one or more repetitions of the previous item.

.+

any non-empty line

?

Matches the previous item zero or one times.

index\.html?

index.htm, index.html

( )

Groups a portion of the pattern.

script(\.pl)?

script, script.pl

|

Matches either the value before or after the |.

(E|e)xit

Exit, exit

The egrep command provides most of the basic options of both grep and fgrep. You can tell it to ignore uppercase and lowercase distinctions (−i), search recursively through subdirectories (−r), print the line number of each match (−n), print only the names of files containing target lines (−l), print lines that do not contain the target (−v), and take the list of targets from a file (−f).

Compressing and Packaging Files

Compression replaces a file with an encoded version containing fewer bytes. The compressed version of the file saves all the information that was in the original file. The original file can be recovered by undoing the compression procedure.

Compressed files require less storage space but are also less convenient to work with than uncompressed files. Most commands won’t work on compressed files-for example, you can’t edit a text file while it’s compressed. Because of this, compressed files are ideal for backups, which won’t need to be accessed very often. Compression is also used to reduce the size of files being sent over a network or distributed on a web site.

Most UNIX variants provide utilities for compressing files. SVR4-based systems include the pack and compress commands. Other systems, including Linux, provide the gzip command, which is probably the most popular compression utility for UNIX today It is available for most platforms (including Windows) at http://www.gzip.org/. The command bzip2, a somewhat newer utility that’s very similar to gzip, can be downloaded for various platforms from http://www.bzip.org/.

The compress command is more efficient than pack, meaning that it will almost always create smaller compressed files. Similarly, gzip is more efficient than compress, and bzip2 is generally more efficient than gzip.

All UNIX variants include the tar command, which was originally designed for creating tape archives for backups but is now commonly used to “bundle” files, often before compressing them.

pack

The pack command replaces a file with a compressed version. The original file is destroyed, so be sure to make a copy beforehand if you need to save the file. The compressed file has .z appended to the filename, to indicate how it was compressed. To uncompress the file, use the unpack command, with the original filename as the argument.

$ pack research-data

pack: research-data: 45.4% Compression

$ ls research*

research-data.z

$ unpack research-data

unpack: research-data: unpacked

$ ls research*

research-data

The second line of this example shows that the file research-data.z is 45.4 percent smaller than research-data. Note that the compressed file is deleted when it is uncompressed. If you want to keep the compressed file, you will need to create a copy.

compress

The compress command works in pretty much the same way as pack. It adds .Z (uppercase) at the end of the compressed filename, instead of the .z (lowercase) that pack uses. The uncompress command will recover the original file. As with pack, compressing or uncompressing a file will delete it, so be sure to make a copy if you need to save the original version.

$ compress research-data

$ ls research*

research-data.Z

$ uncompress research-data

Note that, unlike pack, compress does not report after compressing or uncompressing a file. The −v (verbose) option will cause it to display feedback.

gzip

The gzip command will also replace a file with a compressed version. A file compressed with gzip has the extension .gz. To uncompress the file, use either gzip −d (for decompress), or the command gunzip. As with compress, the −v option will cause gzip and gunzip to display a confirmation after compressing or uncompressing a file.

$ gzip −v research-data

research-data: 81.3% -- replaced with research-data.gz

$ gunzip −v *.gz

download.gz 33.6% -- replaced with download

research-data.gz: 81.3% -- replaced with research-data

gunzip can also be used to decompress .z and .Z files. Some systems (such as Linux) include the command bzip2 (and the related command bunzip2 for decompressing files), which is an alternative to gzipthat works in the same way

Working with Compressed Files

The gzip package comes with a set of tools for working with compressed files. These tools include zcat, zmore, zless, zgrep, and zdiff, which do for compressed files what their counterparts do with ordinary text files.

The zcat command reads files that have been compressed by compress or gzip and prints the uncompressed content to standard output.

The zmore and zless commands work like the more and less commands, printing compressed files in their uncompressed form, one screen at a time.

The zgrep command searches a compressed file for lines that match a grep search target, and prints them in uncompressed form. The following finds lines that contain “toss” in the compressed file fulltext.gz.

$ zgrep toss fulltext.gz

Your mind is tossing on the ocean;

The zdiff command is based on the diff command, which is described later in this chapter. zdiff reads the files specified as its arguments and prints the result of doing a diff on the uncompressed contents. It can be used to compare two compressed files, or to compare a compressed file to an uncompressed file.

tar

As noted previously, two of the most common uses of compression are creating backup files and sending files over a network. In both of these cases, you may have many files that you want to keep together. For example, you may be backing up an entire directory, or e-mailing all of the files for a project. The tar command can be used to “package” a group of files into a single file. It is commonly used on files before compressing them.

The syntax for the tar command is complicated. This section will cover only the basic commands for combining or separating a group of files. More details can be found in the UNIX man page for tar.

To combine files with tar, use the command

$ tar −cvf mail.tar save sent

This will create a file called mail.tar that contains the files save and sent. (The c option stands for create.) You can list as many files to include as you like, including directories. To package all the files in the directory -/Project into a tar file, use

$ tar −cvf projectfiles.tar −/Project

Note that, unlike the compression tools, tar leaves the original files unchanged. Also, it does not automatically add the .tar extension to the combined file. Unlike most UNIX commands, tar does not require the-in front of options, so tar −cvf could also be written as tar cvf.

To separate a .tar file, use the command

$ tar −xvf projectfiles.tar

This will extract all of the files from projectfiles.tar. (The x option stands for extract.)

Some versions of tar (including the versions found on most Linux systems) have an option to create a .tar file and compress it with gzip in one step. This can be convenient, since tar is commonly used to package files before compressing them. The following command will tar and compress all files starting with cs in the current directory:

$ tar −cvzf csfiles.tar.gz cs*

These versions of tar can also extract .tar.gz files in a single step. To do this, use the command

$ tar −xvzf csfiles.tar.gz

Counting Lines, Words, and File Size

The command wc (word count) is a flexible little tool that provides several ways to count the size of a file. The command nl is another small tool. It can be used to add line numbers to a file.

WC

The command wc (word count) prints the number of bytes, lines, or words in a file. For example,

$ cat samplefile

This file contains 143 bytes.

It has 30 words,

and it is 5 lines long.

It has 3 lines that contain the number 3.

The longest line is 41 bytes.

$ wc −c samplefile # Size of the file in bytes.

143 samplefile

$ wc −w samplefile # Number of words in the file.

30 samplefile

$ grep 3 samplefile | wc −1 # Number of lines in file that contain "3".

3

$ wc −L samplefile # Length of the longest line.

41 samplefile

nl

To number each line in a file, use the command

$ nl filename > numbered

This will only add numbers at the beginning of nonempty lines. To number all the lines in a file, use

$ nl −ba hello.py

1 #!/usr/bin/python

2

3 print "Hello, world"

Working with Columns and Fields

Many files contain information that is organized in terms of position within a line. These include tables, which organize text in columns, and files such as /etc/password that consist of lines made up of fields. The UNIX System includes a number of tools designed specifically to work with files organized in columns or fields. You can use the commands described in this section to extract and modify or rearrange information in field-structured or columnstructured files.

§ cut allows you to select particular columns or fields from files.

§ colrm deletes one or more columns from a file or set of files.

§ paste glues together columns or fields from existing files.

§ join merges information from two database files.

cut

Often you are interested in only some of the fields or columns contained in a table or file. For example, you may want to get a list of e-mail addresses from a personnel file that contains names, employee numbers, e-mail addresses, telephone numbers, and so forth. cut allows you to extract from such files only the fields or columns you want.

When you use cut, you have to tell it how to identify fields and which fields to select. You can identify fields either by character position or by the use of field separator characters. You must specify either the −cor the −f option and the field or fields to select.

Using cut with Fields

Many files can be thought of as a list of records, each consisting of several fields, with a specific kind of information in each field. An example is the file contact-info shown here, which contains names, usernames, phone numbers, and office numbers:

$ cat contact-info

Barker-Plummer,D dbp 555–1111 1J333

Etchemendy,J etch 555–2222 2F328

Liu, A a-liu 555–3333 1J322

Field-structured files like this are used often in the UNIX System, both for personal databases like this one and to hold system information.

A field-structured file uses a field separator or delimiter to separate the different fields. In the preceding example, the field separator is the tab character, but any other character-such as a colon (:) or the percent sign (%)-could be used.

To retrieve a particular field from each record of a file, you tell cut the number of the field you want. For example, the following command uses cut to retrieve the e-mail addresses from contact-info by cutting out the second field from each line or record:

$ cut −f2 contact-info

dbp

etch

a-liu

You can use cut to select any set of fields from a file. The following command uses cut to produce a list of names and telephone numbers from contact-info by selecting the first and third fields from each record:

$ cut −f1, 3 contact-info > phone-list

You can also specify a range of adjacent fields, as in the following example, which includes each person’s username and telephone number in the output:

$ cut −f1–3 contact-info > contact-short

If you omit the last number from a range, it means “to the end of the line.” The following command copies everything except field two from contact-info to contact-short:

$ cut −f1, 3− contact-info > contact_short

Using cut with Multiple Files

You can use cut to select fields from several files at once. For example, if you have two files of contact information, one containing personal contacts and one for work-related contacts, you could create a list of all the names and phone numbers in both files with the following command:

cut −f1, 3 contacts.work contacts.home > contacts.all

Of course, the files must share the same formatting, so that the command cut −f1,3 works correctly on both of them.

Specifying Delimiters

Fields are separated by delimiters. The default field delimiter is a tab, as in the preceding example. This is a convenient choice because when you print out a file that uses tabs to separate fields, the fields automatically line up in columns. However, for files containing many fields, the use of tabs often causes individual records to run over into two lines, which can make the display confusing or unreadable. The use of tabs as a delimiter can also cause confusion because a tab looks just like a collection of spaces. As a result, sometimes it is better to use a different character as the field separator.

To tell cut to treat some other character as the field separator, use the −d (delimiter) option, followed by the character. Separators are often infrequently used characters like the colon (:), percent sign (%), and caret (^).

The /etc/passwd file contains information about users in records using : as the field separator. This example shows how you could use cut to select the login name, user name, and home directory (the first, fifth, and sixth fields) from the /etc/passwd file:

$ cat /etc/passwd

root:x:0:0:root:/root:/bin/bash

dbp:x:944:100:Dave Barker-Plummer:/home/dbp:/bin/bash

etch:x:945:100:John Etchemendy:/home/etch:/bin/bash

a-liu:x:946:100 :Albert Liu:/home/a-liu:/bin/bash

$ cut −d: −f 1, 5–6 /etc/passwd

root:root:/

dbp:Dave Barker-Plummer:/home/dbp

etch:John Etchemendy:/home/etch

a-liu:Albert Liu:/home/a-liu

If the delimiter has special meaning to the shell, it should be enclosed in quotes. For example, the following tells cut to print all fields from the second one on, using a space as the delimiter:

$ cut −d' ' −f2− file

Using cut with Columns

Some files arrange information into columns with fixed widths. For example, the long form of the ls command uses spaces to align its output:

$ ls −1

-rw-rw-r--1 jmf users 2958 Oct 8 13:02 inbox

-rw-rw-r--1 jmf users 553 Oct 8 12:32 save

-rw-rw-r--1 jmf users 464787 Oct 8 13:03 sent

Each of the types of information in this output is assigned a fixed number of characters. In this example, the permissions field consists of the characters in positions 1–10, the size is contained in characters 35–42, and the name field is characters 56 and following. (The size of the columns may vary on different systems.)

The −c (column) option tells cut to identify fields in terms of character positions within a line. The following command selects the size (positions 35–42) and name (positions 56 to end) for each file in the long output of ls:

$ ls −l | cut −c35–42, 56–

2958 inbox

553 save

464787 sent

colrm

The colrm command is a specialized command that you can use to remove one or more columns from a file or set of files. Although you can use the cut command to do this, colrm is a simple alternative when that is exactly what you need to do. You specify the range of character positions to remove from standard input. For example, the following command deletes the characters in columns 8–12 from the file pangrams.

$ cat pangrams

The quick brown fox jumps over the lazy dog.

The five boxing wizards jump quickly.

Sphinx of black quartz, judge my vow.

$ cat pangrams | colrm 8 12

The quips over the lazy dog.

The five jump quickly.

Sphinx judge my vow.

paste

The paste command joins files together line by line. You can use it to create new tables by gluing together fields or columns from two or more files. In this example, paste creates a new file by combining the information in states and state_abbr:

$ cat states

Alabama

Alaska

Arizona

Arkansas

California

$ cat state_abbr

AL

AK

AZ

AR

CA

$ paste states state_abbr > states.comp

$ cat states.comp

Alabama AL

Alaska AK

Arizona AZ

Arkansas AR

California CA

Of course, if the contents of the files do not line up correctly (e.g., if they are not in the same order) the output from paste may not be what you were expecting.

Specifying the paste Field Separator

The paste command separates the parts of the lines it pastes together with a field separator. The default delimiter is tab, but as with cut, you can use the −d (delimiter) option to specify another one if you want. The following command combines the states files with a third file containing the capitals, using a colon as the separator:

$ paste −d: states state_abbr capitals

Alabama:AL:Montgomery

Alaska:AK:Juneau

Arizona:AZ:Phoenix

Arkansas:AR:Little Rock

California:CA:Sacramento

Using paste with Standard Input

You can use the minus sign (−) to tell paste to use standard input as one of its input “files.” This feature allows you to paste information from a command pipeline or from the keyboard.

For example, the following command will add a new field to each line of the addresses file.

$ paste addresses − > addresses.new

Here, paste reads each line of addresses and then waits for you to type a line from your keyboard. paste prints the output line to the file addresses.new and then goes on to read the next line of input from addresses.

Using cut and paste to Reorganize a File

You can use cut and paste together to reorder the contents of a structured file. A typical use is to switch the order of some of the fields in a file. The following commands switch the second and third fields of the contact-info file:

$ cut −f1, 3 contact-info > temp

$ cut −f4- contact-info > temp2

$ cut −f2 contact-info | paste temp-temp2 > contacts.new

The first command cuts fields one and three from contact-info and places them in temp. The second command cuts out the fourth field from contact-info and puts it in temp2. Finally, the last command cuts out the second field and uses a pipe to send its output to paste, which creates a new file, contacts.new with the fields in the desired order. The result is to change the order of fields from name, username, phone number, room number to name, phone number, room number, username. Note the use of the minus sign to tell paste to put the standard input (from the pipeline) between the contents of temp and temp2.

There is a much easier way to do the swapping of fields illustrated here, using the awk language. You’ll see how in Chapter 21.

join

The join command joins together two existing files on the basis of a key field that contains entries common to both of them. It is similar to paste, but join matches lines according to the key field, rather than simply gluing them together. The key field appears only once in the output.

For example, a jewelry store might use two files to keep information about merchandise, one named merch containing the stock number and description of each item, and one, costs, containing the stock number and cost of each item. The following uses join to create a single file from these two, listing stock numbers, descriptions, and costs. (Here the first field is the key field.)

$ cat merch

63A457 man's gold watch

73B312 garnet ring

82B119 sapphire pendant

$ cat costs

63A457 125.50

73B312 255.00

82B119 534.75

$ join merch costs

63A457 man's gold watch 125.50

73B312 garnet ring 255.00

82B119 sapphire pendant 534.75

The join command requires that both input files be sorted according to the common field on which they are joined.

Specifying the join Field

By default, join uses the first field of each input file as the common field. You can specify which fields to use with the −j (join) option. The following command tells join to join the files using field 2 in the first file and field 3 in the second file:

$ join −j1 2 −j2 3 ss_no personnel > new_data

Specifying Field Separators

The join command treats any white space (a space or tab) in the input as a field separator and uses the space character as the default delimiter in the output. You can change the field separator with the −t (tab) option. The following command joins the data in the system files /etc/passwd and /etc/group, both of which use a colon as their field separator. The colon is also used as the delimiter for the output.

$ join −t: /etc/passwd /etc/group > all_data

Unfortunately, the option letter that join uses to specify the delimiter (−t) is different from the one (−d) that is used by cut, paste, and several other UNIX System commands.

Sorting the Contents of Files

The UNIX command sort is a powerful, general-purpose tool for sorting information in a file or as part of a command pipeline. It is sometimes used with uniq, a command that identifies and removes duplicate lines from sorted data. The sort and uniq commands can operate on either whole lines or specific fields.

sort

The sort command orders or reorders the lines of a file. In the simplest form, all you need to do is give it the name of the file to sort, and it will print the lines from the file in ASCII order. This example shows how you could use sort to put a list of names into alphabetical order:

$ sort names

cunningham, j.p.

lewis,s.h.

long, s.

rosen,k.h.

rosinski,r.r.

wiseman, s.

You can use sort to combine the contents of several files into a single sorted file. The following command creates a file names.all containing all of the names in three input files, sorted in alphabetical order:

$ sort names.work names.class names.personal > names.all

The −o (output) option tells sort to save the results to a file. For example, this command will sort commandlist and replace its contents with the sorted output:

$ sort −o commandlist commandlist

Be careful: you cannot just redirect the output of sort to the original file. Because the shell creates the output file before it runs sort, the following command would delete the original file before sorting it:

$ sort commandlist > commandlist # File will be emptied!

Alternative Sorting Rules

By default, sort sorts its input according to the order of characters in the ASCII character set. This is similar to alphabetical order, with the difference that all uppercase letters precede any lowercase letters. In addition, numbers are sorted by their ASCII representation, not their numerical value, so 100 precedes 20, and so forth.

Several options allow you to change the rule that sort uses to order its output. These include options to ignore case, sort in numerical order, and reverse the order of the sorted output. You can also tell sortwhich column or field of a file to act on, and whether or not to include duplicate lines in the output.

Table 19–3 summarizes the most common options for sort.

Table 19–3: Options for sort

Option

Mnemonic

Effect

−d

Dictionary

Sort on letters, digits, and blanks only.

−f

Fold

Ignore uppercase and lowercase distinctions.

−n

Numeric

Sort by numeric value, in ascending order.

−r

Reverse

Reverse order of output.

−o filename

Output

Send output to a file.

−u

Unique

Eliminate duplicate lines in output.

Ignore Case You can get a more normal alphabetical ordering with the −f (fold) option that tells sort to ignore the differences between uppercase and lowercase versions of the same letter. The following example shows how the output of sort changes when you use the −f option:

$ sort locations

Lincroft

Summit

holmdel

middletown

$ sort −f locations

holmdel

Lincroft

middletown

Summit

Numerical Sorting To tell sort to sort numbers by their numerical value, use the −n (numeric) option. Here’s an example of how the −n option changes the output of sort. This uses wc to get the size of each file in the output from ls and then pipes the list of sizes and files to sort.

$ wc 'ls' | sort

100 Palo Alto

12 Fox Island

130 Seattle

22 Rumson

4 Santa Monica

$ wc 'ls' | sort −n

4 Santa Monica

12 Fox Island

22 Rumson

100 Palo Alto

130 Seattle

Reverse Order The −r (reverse) option tells sort to reverse the order of its output. In the previous example, the −r option could be used to list the largest files first, like this:

$ wc −c 'ls' sort −rn

130 Seattle

100 Palo Alto

22 Rumson

12 Fox Island

4 Santa Monica

Sorting by Column or Field The sort command provides a way for you to specify the field or column to use for its comparisons. You do this by telling sort to skip one or more fields or columns. For example, the following command ignores the first column of the output from file, so it sorts by the second column, which is the file type.

$ file * | sort +1

notes: ASCII English text

tmp: ASCII English text

mbox: ASCII mail text

bin: directory

Desktop: directory

Mail: directory

zwrite: symbolic link to /home/raf/scripts/Python/zwrite.py

Like cut, sort allows you to specify an alternative field separator. You do this with the −t (tab) option. The following command tells sort to skip the first four fields in a file that uses a colon (:) as a field separator:

$ sort −t: +4 /etc/passwd

Suppressing Repeated Lines Sorting often reveals that a file contains multiple copies of the same line. The next section describes the uniq command, which is designed to remove repeated lines from input files. But because this is such a common sorting task, sort also provides an option, −u (unique), that removes repeated lines from its output. Repeated lines are likely to occur when you combine and sort data from several different files into a single file. For example, if you have several lists of e-mail addresses, you may want to create a single file containing all of them. The following command uses the −u option to ensure that the resulting file contains only one copy of each address:

$ sort −u names.* > uniq-names

uniq

The uniq command filters or removes repeated lines from files. It is usually used with files that have first been sorted by sort. In its simplest form it has the same effect as the −u option to sort, but uniq also provides several useful options of its own.

The following example illustrates how you can use uniq as an alternative to the −u option of sort:

$ sort names.* | uniq > names

Counting Repetitions

One of the most valuable uses of uniq is in counting the number of occurrences of each line. This is a very convenient way to collect frequency data. The following illustrates how you could use uniq along with cut and sort to produce a listing of the number of entries for each ZIP code in a mailing list:

$ cut −f6 mail.list

07760

07733

07733

07760

07738

07760

07731

$ cut −f6 mail.list | sort | uniq −c | sort −rn

3 07760

2 07733

1 07738

1 07731

The preceding pipeline uses four commands: The first cuts the ZIP code field from the mailing list file. The second uses sort to group identical lines together. The third uses uniq −c to remove repeated lines and add a count of how many times each line appeared in the data. The final sort −rn arranges the lines numerically (n) in reverse order (r), so that the data is displayed in order of descending frequency.

Finding Repeated and Nonrepeated Lines

uniq can also be used to show which lines occur more than once and which occur only once. The −d (duplicate) option tells uniq to show only repeated lines, and the −u (unique) option prints only lines that appear exactly once. For example, the following shows ZIP codes that appear only once in the mailing list from the preceding example:

$ cut −f6 mail.list | uniq −u

07738

07731

Comparing Files

Often you need to see whether two files have different contents and to list the differences if there are any For example, you may want to compare two versions of a document you’re working on to see what you’ve changed. It is also sometimes useful to be able to tell whether files having the same name in two different directories are simply different copies of the same file, or whether the files themselves are different.

§ cmp, comm, and diff each tell whether two files are the same or different, and they give information about where or how the files differ. The differences among them have to do with how much information they give you, and how they display it.

§ patch uses the list of differences produced by diff, together with an original file, to update the original to include the differences.

§ dircmp tells whether the files in two directories are the same or different.

cmp

The cmp command is the simplest of the file comparison tools. It tells you whether two files differ, and if they do, it reports the position in the file where the first difference occurs. The following example illustrates how it works:

$ cat note

Nate,

Here's the first draft of the plan.

I think it needs more work.

$ cat note.more

Nate,

Here's the first draft of the new plan.

I think it needs more work.

Let me know what you think.

$ cmp note note.more

note note.more differ: byte 37, line 2

This output shows that the first difference in the two files occurs at the 37th character, which is in the second line. cmp does not print anything if there are no differences in the files.

comm

The comm (common) command is designed to compare two sorted files and show lines that are the same or different. You can display lines that are found only in the first file, lines found only in the second file, and/or lines that are found in both files.

By default, comm prints its output in three columns: lines unique to the first file, those unique to the second file, and lines found in both, respectively The following illustrates how it works, using two files containing lists of cities:

$ comm cities.1 cities.2

New York

Palo Alto

San Francisco

Santa Monica

Seattle

This shows that “New York” is only in the first file, “Santa Monica” only occurs in the second, and “Palo Alto”, “San Francisco”, and “Seattle” are found in both.

The comm command provides options you can use to control which of the summary reports it prints. Options −1 and −2 suppress the reports of lines unique to the first and second files, respectively Use −3 to suppress printing of the lines found in both. These options can be combined. For example, to print only the lines unique to the first file, use −23, like this:

$ comm −23 cities.1 cities.2

New York

diff

The diff command compares two files, line by line, and prints out differences. In addition, for each block of text that differs between the two files, diff tells you how the text from the first file would have to be changed to match the text from the second.

The following example illustrates the diff output for the two note files described earlier:

$ diff note note.more

2c2

< Here's the first draft of the plan.

--

> Here's the first draft of the new plan.

3a4

> Let me know what you think.

Lines containing text that is found only in the first file begin with <. Lines containing text found only in the second file begin with >. Dashed lines separate parts of the diff output that refer to different files.

Each section of the diff output begins with a code that indicates what kinds of differences the following lines refer to. In the preceding example, the first pair of differences begin with the code 3c3. This tells you that there is a change (c) between line 3 in the first file and line 3 in the second file. The second difference begins with 4a5. The letter a (append) indicates that line 5 in the second file is added following line 4 in the first. Similarly, a d (deleted) would indicate lines found in one file but not in the other.

patch

If you save the output from diff, you can use the patch command to recreate the second file by applying the differences to the first file. The patched version replaces the original file. The following shows how you could patch the file project.c using the difference file diffs.

$ diff project.c project2.c > diffs

$ patch project.c diffs

After this pair of commands, the contents of project.c are identical to the contents of project2.c.

The patch command allows you to keep track of successive versions of a file without having to keep all of the intermediate versions. All you need to do is to keep the original version and the output from diffneeded to change it into each new version. (This is how some revision control systems store files. See Chapter 24 for an explanation of revision control.)

dircmp

Some versions of UNIX, such as Solaris, include the dircmp command, which compares the contents of two directories and tells you how they differ. The output of dircmp lists the filenames that are unique to each directory If there are files with the same name in both directories, dircmp tells you whether their contents are the same or different.

The following command compares the contents of ~jcm/Dev with the contents of ~jcm/ Dev/Backup:

$ dircmp ~jcm/Dev ~jcm/Dev/Backup

In addition to comparing two of your own directories, dircmp may be used to compare directories belonging to different users. For example, if two users are working on the same project and each has their own copy of the files, they may need to determine which files are no longer identical.

Examining File Contents

Chapter 3 described several commands for viewing text files: cat, head, tail, and the pagers pg, more, and less. These are adequate for most purposes, but they are of limited use with files that contain nonprinting ASCII characters, and they are of no use at all with files that contain binary data. This section describes the od and strings commands, which help you view the contents of files that contain nonprinting characters or binary data. It also includes the tac command, which is a backward version of cat.

od

The od command shows you the exact contents of a file, including nonprinting characters. It can be used for both text and data files. od prints the contents of each byte of the file in any of several different representations, including octal, hexadecimal, and “character” formats. The following discussion deals only with the character representation, which is invoked with the −c (character) option. To illustrate how odworks, consider how it displays an ordinary text file. For example,

$ cat example

The UNIX Operating System is becoming

increasingly popular.

$ od −c example

0000000 T h e U N I X O p e r a t i

0000020 n g S y s t e m i s b e c

0000040 o m i n g \n i n c r e a s i n

0000060 g l y p o p u l a r . \n

0000076

Each line of the output shows 16 bytes of data, interpreted as ASCII characters. The number at the beginning of each line is the octal representation of the offset, or position, in the file of the first byte in the line. The other fields show each byte in its character representation. The file in this example is an ordinary text file, so the output consists mostly of normal characters. The only thing that is special is the \n, which represents the newline at the end of each line in the file. Newline is an ASCII character, but od uses the special sequence \n to make it visible. Other special sequences include \t (tab), \b (backspace), and \r(return). Less common nonprinting characters are shown as a threedigit octal representation of their ASCII encoding.

You can specify an offset, a number of bytes of input to skip before displaying the data, as an octal number following the filename. For example, the following command skips 16 bytes (octal 20):

$ od −c data_file 20

strings

Some files are mostly binary data but may contain a few readable strings. If these files are very long, then using od to read them can take a very long time. The command strings will search a file for printable characters. By default, strings prints any chains of four or more printable characters that it finds. In this example, strings searches the binary file ping for printable characters and prints chains of six or more characters.

$ strings −6 ping

The strings command can be used on multiple files at once. The −f option tells it to print the name of the file when it prints a string of characters, so that you know which file the string came from.

$ strings −f /bin/* | grep version more

In this example, strings searches all the files in /bin. It sends the results to grep, which searches for lines containing the word “version”. Each of these lines is printed to the screen, along with the name of the file it came from.

tac

The tac command is a backward version of cat. It takes a list of files and prints them line by line to standard output, but in reverse line order. Like cat, tac can accept standard input.

You can use the −s option to tell tac to use a separator other than newline to mark breaks between “lines”. For example, if the individual records in the file accounts are separated by ***, the following command will print them in reverse order.

$ tac −s "***" accounts

Editing and Formatting Files

There are many ways to edit and format files in the UNIX System. Chapter 5 described the text editors vi and emacs. Chapter 21 will explain how to use awk and sed to write programs that modify file contents. In addition, the troff, nroff, and LaTeX systems can be used to create formatted documents. For example, many of the UNIX man pages are formatted with nroff, which is why they cannot be saved to a file with

$ man command > manfile

To save a man page, use the command

$ man command | col −b > manfile

which sends the output of man through the col filter for nroff output. Formatting documents with troff, nroff, and LaTeX is explained in detail on the companion web site.

The commands pr and fmt can be used to add simple formatting to a file, such as a header with page numbers, often before printing it.

The tr command is a small but useful tool for processing text. It translates characters according to a simple set of rules.

spell searches a file for misspelled words. The related commands ispell and aspell allow you to interactively correct the spelling in a file.

pr

The most common use of pr is to add a header to every page of a file. The header contains the page number, date, time, and name of the file. For example, if names is a simple data file that contains a short list of names and addresses, with no header information, then with pr, you get the following:

$ pr names

Aug 28 15:25 2006 names Page 1

Nate nate@engineer.com

Rebecca rlf@library.edu

Dan dkraut@bio.ca.edu

Liz liz@thebest.net

pr is often used to add header information to files when they are printed, as shown here:

$ pr notes lp

If you name several files, each one will have its own header, with the appropriate name and page numbers in the output.

You can also use pr in a pipeline to add header information to the output of another command. This is very useful for printing data files when you need to keep track of date or version information. The following commands print out the long format file listing of the current directory with a header that includes today’s date:

$ ls −1 pr lp

You can customize the heading with the −h option followed by the heading you want. The following command prints “Chapter 19 --- First Draft” at the top of each page of output:

$ pr −h "Chapter 19 --- First Draft" chapter19 | lp

Note that when the header text contains spaces, it must be enclosed by quotation marks.

Simple Formatting with pr

pr also has options for simple formatting. To double-space a file when you print it, use the −d option. The −n option adds line numbers to the output. The following command prints the file double-spaced and with line numbers:

$ pr −d −n program.c lp

You can use pr to print output in two or more columns. For example, the following prints the names of the files in the current directory in three columns:

$ ls pr −3 lp

pr handles simple page layout, including setting the number of lines per page, the line width, and the offset of the left margin. The following command specifies a line width of 60 characters, a left margin offset of eight characters, and a page length of 60 lines:

$ pr −w 60 −o 8 −1 60 note lp

fmt

Another simple formatter, fmt, can be used to control the width of your output. fmt breaks, fills, or joins lines in the input you give it and produces output lines that have (up to) the number of characters you specify The default width is 72 characters, but you can use the −w option to specify other line widths. fmt is a quick way to consolidate files that contain lots of short lines, or eliminate long lines from files before sending them to a printer. In general it makes ragged files look better. The following illustrates how fmt works.

$ cat sample

This is an example of

a short

file

that contains lines of varying width.

We can even up the lines in the file sample as follows.

$ fmt −w 16 sample

This is an

example of a

short file that

contains lines

of varying

width.

tr

tr replaces one set of characters with another set. For example, you could use tr to translate all the : (colon) characters in the /etc/passwd file into tabs, like this:

$ tr : '\t' < /etc/passwd

root x 0 0 root /root /bin/bash

dbp x 944 100 Dave Barker-Plummer /home/dbp /bin/bash

etch x 945 100 John Etchemendy /home/etch /bin/bash

a-liu x 946 100 Albert Liu /home/a-liu /bin/bash

In this example, the escape sequence \t stands for the TAB character. It is enclosed in single quotes to prevent the shell from interpreting it. File redirection (with the input operator <) is used to send the contents of /etc/passwd to tr. The tr command is one of the few common UNIX System tools that does not allow you to specify a filename as an argument. tr only reads standard input, so you have to use input redirection or a pipe to give it input.

The tr command can translate any number of characters. In general, you give tr two lists of characters: the list of characters to be translated, and the list of characters they will be replaced by. tr translates the first character in the input list to the first character in the output list, the second input character to the second output character, and so on. For example, the following command replaces the characters a, b, and cin lowerfile with the corresponding uppercase letters, and saves the output to a new file:

$ tr abc ABC < lowerfile > upperfile

Because each character in the input list corresponds to one character in the output list, the two lists must have the same number of characters.

Specifying Ranges and Repetitions

You can use brackets and a minus sign (−) to indicate a range of characters, similar to the use of range patterns in regular expressions and filename matching. The following example uses tr to translate all lowercase letters in name_file to uppercase:

$ cat name_file

ben

robin

dan

marissa

$ tr ' [a-z] ' ' [A-Z] ' < name_file

BEN

ROBIN

DAN

MARISSA

tr can be used to encode or decode text using simple substitution ciphers (codes). A specific example of this is the rot13 cipher, which replaces each letter in the input text with the letter 13 letters later in the alphabet (wrapping around at the end). For instance, k is translated to x and Y is translated to L. The following command encrypts a file using this rule. Note that rot13 maps lowercase letters to lowercase letters and uppercase letters to uppercase letters.

$ cat hello

Hello, world

$ tr "[a-m] [n-z] [A-M] [N-Z]" "[n-z] [a-m] [N-Z] [A-M]" < hello > code.rot13

$ cat code.rot13

Uryyb, j beyq

You can use the same tr command to decrypt a file encrypted with the rot13 rule. The rot13 cipher is sometimes used to weakly encrypt potentially offensive jokes in newsgroups.

If you want to translate each of a set of input characters to the same single output character, you can use an asterisk to tell tr to repeat the output character. For example, the following replaces each digit in the input with the number sign (#).

$ tr ' [0–9] ' ' [#*] ' < data

This particular feature of tr is not found in all versions of UNIX.

Removing Repeated Characters

The previous example translates digits to number signs. Each digit of a number will produce a number sign in the output. For example, 1024 comes out as #. You can tell tr to remove repeated characters from the translated string with the −s (squeeze) option. The following version of the preceding command replaces each number in the input with a single number sign in the output, regardless of how many digits it contains:

$ tr −s ' [0–9] ' ' [#*] ' < data

You can use tr to create a list of all the words appearing in a file. The following command puts every word in the file on a separate line by replacing each group of spaces with a newline. It then sorts the words into alphabetical order and uses uniq to produce an output that lists each word and the number of times it occurs in the file.

$ cat short_file

This is the first line.

And this is the last.

$ cat short_file | tr −s ' ' '\n' sort | uniq −c

1 And

1 This

1 first

2 is

1 last.

1 line.

2 the

1 this

If you wanted to list words in order of descending frequency, you could pipe the output of uniq −c to sort −rn.

Other Options for tr

Sometimes it is convenient to specify the input list by its complement, that is, by telling tr which characters not to translate. You can do this with the −c (complement) option.

The following command makes nonalphanumeric characters in a file easily visible by translating characters that are not alphabetic or digits to an underscore.

$ tr −c ' [A-Z] [a-z] [0–9] ' ' [_*] ' < messyfile

You can use the −d (delete) option to tell tr to delete characters in the input set from its output. This is an easy way to remove special or nonprinting characters from a file. The following command uses the −cand −d options to remove everything except alphabetic characters and digits:

$ tr −cd " [a-z] [A-Z] [0–9]" < messyfile

In particular, this example will delete punctuation marks, spaces, and other characters.

spell

spell is a UNIX command that allows you to check the spelling in a file. Running

$ spell textfile

will produce a list of the words that are misspelled in textfile. The option −b causes spell to use British spellings.

Linux systems come with the command ispell, which allows you to interactively correct misspelled words. ispell can be downloaded from http://ficus-www.cs.uda.edu/geoff/ispell.html. A similar program, called aspell, can be found at http://aspell.net/. To check the spelling in a file with aspell, use

$ aspell check textfile

aspell often does a better job of suggesting alternatives to misspelled words than ispell. The manual can be found online at http://aspell.net/man-html/index.html.

Saving Output

In addition to the file redirection operator >, the UNIX System provides several commands that you can use to record output. The command tee can be used to copy standard output to a file, while script can be used to keep a record of your session. You can also use mail from the command line to send output as e-mail.

tee

The tee command is named after a tee joint in plumbing. A tee joint splits an incoming stream of water into two separate streams. tee splits its (standard) input into two or more output streams; one is sent to standard output, the others are sent to the files you specify

The following command uses file to display information about files in the current directory By sending the output to tee, you can view it on your screen and at the same time save a copy in the file filetypes:

$ file * | tee filetypes

In this example, if the file filetypes already exists, it will be overwritten. You can use tee −a filetypes to append output to the file.

You can also use tee inside a pipeline to monitor part of a complex command. The following example prints the output of a grep command by sending it directly to lp. Passing the data through tee allows you to see the output on your screen as well:

$ grep perl filetypes | tee /dev/tty lp

Note the use of /dev/tty in this example. Recall that tee sends one output stream to standard output, and the other to a specified file. In this case, you cannot use the standard output from tee to view the information, because standard output is used as the input to lp. In order to display the data on the screen, this command makes use of the fact that /dev/tty is the name of the logical file corresponding to your display Sending the data to the “file” /dev/tty displays it on your screen.

Finally, tee can be used in a shell script to create a log file. For example, if you have a script that can be run periodically to backup files, the last line in the script could be

$ echo "'date' Backup completed." tee −a logfile

This will print a message containing the current date and time to standard output, and also append the message to logfile.

script

The script command copies everything displayed on your terminal screen to a file, including both your input and the system’s prompts and output. You can use it to keep a record of part or all of a session. It can be very handy when you want to document how to solve a complicated problem, or when you are learning to use a new program. To use it, you invoke script with the name of the file in which you want the transcript stored. For example,

$ script mysession

Script started, file is mysession

To terminate the script program and end recording, type CTRL-D:

$ [CTRL-D] Script done, file is mysession

If you invoke script without a filename argument, it uses the default filename typescript.

An example of a file produced by script is shown here:

$ script ksh-install

Script started on Mon 27 Nov 2006 09:59:58 AM PST

$ cd Desktop^M

$ gunzip ksh.2006–02–14.linux.i386.gz^M

$ mv ksh* ../bin^M

$ cd ../bin^M

$ ln −s ksh* ksh^M

$

Script done on Mon 27 Nov 2006 10:01:06 AM PST

Note that script includes all of the characters you type, including CTRL-M (which represents RETURN), in its output file. The script command is not very useful for recording sessions with screen-oriented programs such as vi because the resulting files include screen control characters that make them difficult to use.

mail

The mail command, and the related commands mailx and Mail, were introduced in Chapter 2. Most users will quickly switch to a more full-featured mail program, but mail is still useful for certain tasks. In particular, it can be used in a pipeline to mail the output of a command, as in this example:

$ find . -print mail root

This will send a list of files to the root user. The mail command can also be useful in shell scripts, as will be seen in the next chapter.

If the mail command is unable to send a message, it will save it in the file dead.letter.

Working with Dates and Times

The UNIX System includes several tools for working with dates and times. Two of these are date, which can get the current time or format an arbitrary time, and touch, which can change the modification time associated with a file.

date

The date command prints the current time and date in any of a variety of formats. It is also used by the system administrator to set or change the system time. You can use it to timestamp data in files, to display the time as part of a prompt, or simply as part of your login .profile sequence.

By itself, date prints the date in a default format, like this:

$ date

Mon Sept 18 17:19:33 PDT 2006

You can change the information that date prints with format specification options. Date format specifications are entered as arguments to date. They begin with a plus sign (+), followed by codes that tell datewhat information to display and how to display it. These codes use the percent sign (%) followed by a single letter to specify particular types of information. Format specifications can include ordinary text that you specify as part of the argument.

Here is one example of the type of formatting you can use with date:

$ date "+Today is %A, %B %d, %Y"

Today is Monday, September 18, 2006

Table 19–4 lists some of the more useful date format specifications.

Table 19–4: date Format Specifications

Unit

Symbol

Example

Unit

Symbol

Example

Year

Y

y

2006

06

Hour

H

I

17 (00 to 23)

5 (1 to 12)

Month

B

b

m

November

Nov

11

Minute

M

23 (00 to 59)

Day of week

A

a

Saturday

Sat

Second

S

03

Day of month

d

e

04

4

A.M./P.M.

P

P

AM

pm

Day of year

j

256

Time

T

X

14:20:15

02:20:15 PM

Date

D

03/27/79

Newline

n

One common use of date is to create a timestamp, a string which can be added to data in order to mark the date when it was created. For example,

$ cat output > "logfile.$(date "+%Y.%j, %X")"

$ ls log*

logfile.2006.261.17.19.48

uses command substitution to create a file with the date and time appended to the filename. In some versions of date, the command

$ date +%s

1158625190

will print the number of seconds since January 1,1970 UTC, which is a common format for a timestamp.

Like the cal command, date can be used to look up a specific day. The GNU version of date has a −d option that allows you to specify a particular time or date to display:

$ date −d 1/1/2007

Mon Jan 1 00:00:00 PST 2007

$ date +%A −d 11/23 # Find the day of the week for 11/23 this year.

Thursday

touch

Chapter 3 showed how you can use the touch command to create a new empty file. But the primary purpose of touch is to change the access and modification times for each of its filename arguments.

Every file in the UNIX file system has three times associated with it, and the touch command can be used to change two of them. One is the modification time, that is, the time when the file was last changed. This is the time that is displayed with ls −1. Files also have an access time, which can be displayed with ls −lu. You can use the −mtime and −atime options to find in order to search for files according to these times.

The command

$ touch filename

changes both the modification time and access time of filename to the current time. The command touch −m changes only the modification time, and the −a option causes touch to change only the access time.

One use of touch is in working with shell scripts that perform actions according to how recently a file was changed. For example, you could write a script to back up files that used touch to mark each file after copying it. The script could use find to search for files by modification date in order to copy only those files that had changed since the last backup.

Performing Mathematical Calculations

The UNIX system provides several tools for doing mathematical operations. One of these is the factor command, which finds the prime factors of a positive integer. Some systems include the primes command, which can be used to generate a list of prime numbers. This section describes two of the most powerful and useful UNIX tools for mathematical calculations.

§ bc (basic calculator) is a powerful and flexible program for executing arithmetic calculations. It includes control statements and the ability to create and save userdefined functions.

§ dc (desk calculator) is an older alternative to bc. It uses the RPN (Reverse Polish Notation) method of entering data and operations (unlike bc, which uses the more familiar infix method).

bc

The bc command is both a calculator and a mini-language for writing mathematical programs. It provides all of the standard arithmetic operations, as well as a set of control statements and user-definable functions.

Using bc is fairly intuitive, as this example shows.

$ bc

32+17

49

sqrt (49)

7

quit

As you can see, most arithmetic operators act just as you would expect. To add 32 and 17, just type 32+17, and bc will print the result. The command to find a square root is sqrt, and the command to exit bc is quit. To do longer strings of calculations, parentheses can be used to group terms:

$ bc

(( (1+5)*(3+4))/6)^2

49

quit

By default, bc does not save any decimal places in the result of a calculation. For example, if you try to find the square root of 2, it will report that the result is 1:

$ bc

sqrt(2)

1

The bc command can be used to do calculations to any degree of precision, but you must remember to specify how many decimal places to preserve (the scale). You do this by setting the variable scale to the appropriate number. For instance,

scale=4

sqrt (2)

1.4142

This time the result shows the square root to four decimal places.

A number of common mathematical functions are available with the −l (library) option. This tells bc to read in a set of predefined functions, including s (sine), c (cosine), a (arctangent), 1 (natural logarithm), and e (raises the constant e to a power). The following example shows how you could use the arctangent of 1 to find the approximate value of pi:

$ bc −1

scale=6

a(1) * 4

3.141592

You can save the result of a calculation with a variable. For example, you might want to save the value of pi in order to use it in another line:

pi=a(1) * 4

16*pi

50.265472

In newer versions of bc, the result of your latest calculation is automatically saved in the variable last.

Table 19–5 lists the most common bc operators, instructions, and functions.

Table 19–5: bc Operators and Functions

Symbol

Operation

Symbol

Operation

+

Addition

sqrt(x)

Square root

Subtraction

scale=n

Set scale

/

Division

ibase=n

Set input base

*

Multiplication

obase=n

Set output base

%

Remainder

define f(x)

Define function

A

Exponentiation

for, if, while

Control statements

()

Grouping

quit

Exit bc

Changing Bases

The bc command can be used to convert numbers from one base to another. The ibase variable sets the input base, and obase controls the output base. In the following example, when you enter the binary number 11010, bc displays the decimal equivalent, 26:

$ bc

ibase=2

11010

26

ibase=1010

To change back to the default input base of 10, you will need to enter the number 10 in the new base. So, in the preceding example, the line ibase=1010 returns to decimal input, since 1010 is binary for the number 10.

To convert typed decimal numbers to their hexadecimal representation, use obase:

$ bc

obase=16

41968

A3F0

This time you can return to decimal by typing obase=10, since you did not change the input base.

Control Statements

You can use bc control statements to write numerical programs or functions. The bc control statements include for, if, and while. Their syntax and use is the same as the corresponding statements in the C language. Curly brackets can be used to group terms that are part of a control statement.

The following example uses the for statement to compute the first four squares:

$ bc

for(i=1;i<=4;i=i+1) i^2

1

4

9

16

The next example uses while to print the squares of the first ten integers:

x=1

while(x<=10) {

x^2

x=x+1

The following line tests the value of the variable x and, if it is greater than 10, sets y to 10:

if(x>10) y=10

Defining Your Own Functions

You can define your own bc functions and use them just like built-in functions. The format of bc function definitions is the same as that of functions in the C language. The following illustrates how you could define and use a function:

define pyth(a,b){

c=a^2+b^2

return(sqrt(c))

}

pyth (3, 4)

5

Another example, which uses the for statement, is a function to compute factorials:

define f(n) {

auto x, i

x=1

for (i=1;i<=n;i=i+1) x=x*i

return(x)

}

The bc command is ultimately rather limited for programming, but bc can be used in shell scripts to perform calculations. The next chapter describes shell scripting in detail, including a few examples with bc. Alternately, the awk language can be used to perform the same functions that bc provides, but it is more powerful and flexible. awk is described in Chapter 21.

Reading Functions from Files

The bc command allows you to read in a file and then continue to accept keyboard input. This allows you to build up libraries of bc programs and functions. For example, suppose you have saved the factorial function in a file called lib.bc. If you tell bc to read this file when it starts up, you can use these functions just like built-in functions. For instance,

$ bc lib.bc

f (5)

120

dc

The dc command gives you a calculator that uses the Reverse Polish Notation (RPN) method of entering data and operations. This approach is based on the concept of a stack that contains the data you enter, and operators that act on the numbers at the top of the stack.

When you enter a number, it is placed on top of the stack. Each new number is placed above the preceding one. An operation (+, −, etc.) acts on the number or numbers on the top of the stack. In most cases, an operation replaces the numbers on top of the stack with the result of the operation.

Data and operators can be separated by any white space (a new line, space, or tab). You do not need a space between operator symbols. The following shows how you could use dc to add two numbers:

$ dc

32 64+p

96

q

$

In this example, the numbers 32 and 64 are added together. The “p” tells dc to print the result. The “q” is the instruction to quit the program.

The dc command provides the standard arithmetic operators, including remainder (%) and square root (v). The f command prints the full stack. A good way to learn about how dc works is to experiment with it, printing out the stack before and after each operation.

By default, dc does not save any decimal places in the result of a calculation. As with bc, you must remember to specify the scale. To set the scale, enter the desired number followed by k. For instance,

4 k 2vp

1.4142

This example prints the square root of 2 to 4 decimal places.

There are no parentheses in dc. Because the result of each calculation is added to the top of the stack, it is possible to do long strings of calculations without them. The calculation

$ bc

(((1+5)*(3+4))/6)^2

49

would look like

$ dc

1 5+ 3 4+ * 6/ 2^

49

in dc. Notice that the second example is much more concise. This is one of the features of RPN that makes it very appealing to some people.

Table 19–6 shows the symbols for the basic dc operators.

Table 19–6: dc Operators

Symbol

Operation

Symbol

Operation

+

Addition

P

Print top item on stack

Subtraction

d

Duplicate top item on stack

/

Division

r

Reverse top two items on stack

*

Multiplication

f

Print entire stack

%

Remainder

c

Clear stack

^

Exponentiation

sx

Save to memory register x

v

Square root

lx

Load from register x

k

Set scale

l

Set input base

q or Q

Exit program

o

Set output base

Programming in dc

The dc language also includes instructions that you can use to write complex numerical programs. The syntax is very unintuitive, however, and most users will be much more comfortable using bc or another scripting language. Programmers who do learn to program in dc typically see it as an interesting challenge rather than a serious choice of language.

For those who are curious about the sort of programs that can be written in dc, Amit Singh has a remarkably readable example at http://www.kernelthread.com/hanvi/html/dc.html.

Summary

The UNIX System gives you many commands that can be used singly or in combination to perform a wide variety of tasks, and to solve a wide range of problems. They can be thought of as software tools. This chapter has surveyed a number of the most useful tools in the UNIX System toolkit, including tools for finding patterns in files; working with compressed files, modifying structured files; comparing files; and performing numerical computations.

These tools can be used on the command line, as shown in many of these examples, or invoked from scripts. In fact, many of these tools are extremely useful when writing scripts to automate complex tasks. Shell scripting will be discussed in the next chapter. Two very powerful tools for scripting, awk and sed, are discussed in the chapter after that.

Table 19–7 summarizes the commands that were discussed in this chapter.

Table 19–7: Command Summary

Commands

Use

Commands

Use

grep

fgrep

egrep

Search for text in a file

cmp

comm

diff

dircmp

Compare files or directories

pack

compress

gzip

bzip2

Compress a file See section on compressing files for related commands, such as gunzip and zcat

od

strings

View files with unusual characters

tar

Package/extract a set of files

tac

Print a file backward

wc

Count words or lines

pr

font

Add simple formatting

nl

Add line numbers

tr

Translate certain characters

cut

colrm

Get certain columns or fields

tee

Fork output to a file and to standard output

paste

join

Combine files formatted with columns or fields

script

Record all text on the screen

sort

Sort lines

mail

Send mail

uniq

Remove duplicate lines

date

Display the date and time

patch

Update a file to a new version

touch

Update the date on a file

spell

ispell

Check spelling

bc

dc

Perform calculations

How to Find Out More

Many books on the UNIX System contain discussions of tools and filters. One excellent work, with lots of examples, is

· Powers, Shelley et al. UNIX Power Tools. 3rd ed. Sebastopol, CA: O’Reilly & Associates, 2002.

A dictionary-like reference, similar in style to the UNIX man pages, is

· Robbins, Arnold. UNIX in a Nutshell. 4th ed. Sebastopol, CA: O’Reilly Media, 2005.

Information on GNU project tools, and how to obtain them, is available at the GNU web site, http://www.gnu.org/.