Putting It All Together - Advanced Programming Concepts - Practical C Programming, 3rd Edition (2011)

Practical C Programming, 3rd Edition (2011)

Part III. Advanced Programming Concepts

Chapter 22. Putting It All Together

For there isn’t a job on the top of the earth the beggar don’t know, nor do.

—Rudyard Kipling

In this chapter, we create a complete program. Every step of the process is covered from setting forth the requirements to testing the result.

Requirements

Before we start, we need to decide what we are going to do. This step is very important and is left out of far too many programming cycles.

This chapter’s program must fulfill several requirements. First, it must be long enough to demonstrate modular programming, but at the same time be short enough to fit inside a single chapter. It must be complex enough to demonstrate a wide range of C features, but be simple enough for a novice C programmer to understand.

Finally, the program must be useful. Usefulness is not simple to define. What’s useful to one person might not be useful to another. We decided to refine this requirement and restate it as “The program must be useful to C programmers.”

The program we have selected reads C source files and generates simple statistics on the nesting of parentheses, and in the ratio of comments to code lines.

Specification

The specification for our statistics program is shown in the following sidebar:

PRELIMINARY SPECIFICATION FOR A C STATISTICS GATHERING PROGRAM

Steve Oualline

February 10, 1996

The program stat gathers statistics about C source files and prints them. The command line is:

stat <files..>

Where <files..> is a list of source files. The following shows the output of the program on a short test file.

[File: stat/stat.out]

1 (0 {0 /*-*/

2 (0 {0 /********************************************************

3 (0 {0 * Name: Calculator (Version 2). *

4 (0 {0 * *

5 (0 {0 * Purpose: *

6 (0 {0 * Act like a simple four-function calculator. *

7 (0 {0 * *

8 (0 {0 * Usage: *

9 (0 {0 * Run the program. *

10 (0 {0 * Type in an operator (+ - * /) and a number. *

11 (0 {0 * The operation will be performed on the current *

12 (0 {0 * result, and a new result will be displayed. *

13 (0 {0 * *

14 (0 {0 * Type 'Q' to quit. *

15 (0 {0 * *

16 (0 {0 * Notes: Like version 1 but written with a switch *

17 (0 {0 * statement. *

18 (0 {0 ********************************************************/

19 (0 {0 /*+*/

20 (0 {0 #include <stdio.h>

21 (0 {0 char line[100]; /* line of text from input */

22 (0 {0

23 (0 {0 int result; /* the result of the calculations */

24 (0 {0 char operator; /* operator the user specified */

25 (0 {0 int value; /* value specified after the operator */

26 (0 {0 int main()

27 (0 {1 {

28 (0 {1 result = 0; /* initialize the result */

29 (0 {1

30 (0 {1 /* loop forever (or until break reached) */

31 (0 {2 while (1) {

32 (0 {2 printf("Result: %d\n", result);

33 (0 {2 printf("Enter operator and number: ");

34 (0 {2

35 (0 {2 fgets(line, sizeof(line), stdin);

36 (0 {2 sscanf(line, "%c %d", &operator, &value);

37 (0 {2

38 (0 {2 if ((operator == 'q') || (operator == 'Q'))

39 (0 {2 break;

40 (0 {3 switch (operator) {

41 (0 {3 case '+':

42 (0 {3 result += value;

43 (0 {3 break;

44 (0 {3 case '-':

45 (0 {3 result -= value;

46 (0 {3 break;

47 (0 {3 case '*':

48 (0 {3 result *= value;

49 (0 {3 break;

50 (0 {3 case '/':

51 (0 {4 if (value == 0) {

52 (0 {4 printf("Error:Divide by zero\n");

53 (0 {4 printf(" operation ignored\n");

54 (0 {3 } else

55 (0 {3 result /= value;

56 (0 {3 break;

57 (0 {3 default:

58 (0 {3 printf("Unknown operator %c\n", operator);

59 (0 {3 break;

60 (0 {2 }

61 (0 {1 }

62 (0 {1 return (0);

63 (0 {0 }

Total number of lines: 63

Maximum nesting of () : 2

Maximum nesting of {} : 4

Number of blank lines .................4

Number of comment only lines ..........20

Number of code only lines .............34

Number of lines with code and comments 5

Comment to code ratio 64.1%

Code Design

Several schools of code design exist. In structured programming, you divide the code into modules, then divide the modules into submodules, then divide the sub-modules into subsubmodules, and so on. Other approaches exist, such as state tables and transition diagrams.

All have the same basic principle at heart: “Arrange the program’s information in the clearest and simplest way possible, and then try to turn it into C code.”

Our program breaks down into several logical modules. First, we have a token scanner, which reads raw C code and turns it into tokens. This subdivides into three smaller modules. The first reads the input file, the second determines what type of character we have, and finally, the third assembles this information into a token. A token is a group of characters that form a single word, number, or symbol.

The main module consumes tokens and output statistics. Again, this module breaks down into smaller submodules: a do_file procedure to manage each file and a submodule for each statistic.

Token Module

Our program scans C source code and uses the tokens to generate statistics. For example, the line:

answer = (123 + 456) / 89; /* Compute some sort of result */

consists of the tokens:

T_ID The word "answer"

T_OPERATOR The character "="

T_L_PAREN Left parenthesis

T_NUMBER The number 123

T_OPERATOR The character "+"

T_NUMBER The number 456

T_R_PAREN The right parenthesis

T_OPERATOR The divide operator

T_NUMBER The number 89

T_OPERATOR The semicolon

T_COMMENT The comment

T_NEW_LINE The end-of-line character

So how do we identify a token? Most of the time, we simply have to look at the first character. For example, the * character signals an operator token, while the A character starts an identifier. But some characters are ambiguous. For example, the / character could be the divide operator, or it could be the start of a /* comment. In order to identify this character, we need to look ahead one character. So one of the requirements for our input module is that it allow us to peek ahead one character.

The token module builds tokens from groups of characters. For example, an identifier is defined as a letter or underscore, followed by any number of letters or digits. So our tokenizer needs to contain the pseudo code:

if the current character is a letter, then

scan until we get a character that's not a letter or digit.

As you can see from the pseudo code, our tokenizer depends a great deal on character types, so we need a module to help us with the type information.

Input Module

The input module needs to do two things: first, it needs to provide the current and next characters to the token module, and second, it needs to buffer the entire line for display at a later time.

How not to design an input module

Sometimes a software design will undergo several revisions before it is finally coded, as was the case with the input module. I carefully designed the module, reviewed my design, and decided to throw it away.

However, nothing written is completely wasted, so I am including the first design as a example of how things go wrong and what can be done to improve them.

The first design consisted of a public structure:

struct input_file {

FILE *file; /* File we are reading */

char line[LINE_MAX];/* Current line */

char *char_ptr; /* Current character on the line */

int cur_char; /* Current character (can be EOF) */

int next_char; /* Next character (can be EOF) */

};

and functions that operated on the structure:

extern void in_open(struct input_file *in_file, const char name[]);

extern void in_read_char(struct input_file *in_file);

extern void in_flush(struct input_file *in_file);

To use this package, the caller needs to call in_open to open the file and then check in_file.file to see if the file is opened. In C, these operations are done as follows:

struct input_file in_file; /* File for input */

/* ... */

in_open(&in_file, name);

if (in_file.file == NULL) {

fprintf(stderr,"Error: Could not open input file: %s\n", name);

The token module needs to look at the current and next character. For example, when it sees a slash (/) and a star (*), it knows that it’s looking at a comment. The current character is stored in in_file.cur_char and the next character is in in_file.next_char. The C code to check for a comment might look like:

if ((in_file.cur_char == '/') && (in_file.next_char == '*')) {

/* Handle a comment */

To move up a character, the user calls in_read_char to advance the input by one character.

Finally, when the file is finished, the user closes the file with the statement:

fclose(in_file.file);

In a good module design:

§ The amount of information needed by the people who use the module should be minimized.

§ The number rules that the users of the module must follow in order to use the module properly should be small.

§ The module should be easily expandable.

The design of the input module requires that the user know an awful lot about how the module is designed. In order to open a file, the user must know to call the in_open function, then check the file field of the in_file structure for errors. Consequently, the user must be aware of the in_filestructure and, what’s worse, the internal workings of the in_file structure.

There are other cases for which the user needs to know about the internals of struct in_file, such as when accessing the current character (cur_char) or next character (next_char). Also, the user must manually close the file using yet another member of our data structure.

So this design requires that the user know a great deal about the internal workings of our module and access things properly. A better design would demand much less from the user.

A better input module

Our new module design eliminates the in_file structure (as far as the user is concerned) and provides the following functions:

extern int in_open(const char name[]);

extern void in_close(void);

extern void in_read_char(void);

extern int in_cur_char(void);

extern int in_next_char(void);

extern void in_flush(void);

These functions hide all the bookkeeping necessary to handle the input file. Also, the opening of a file is simpler. We can open the file and check for errors in one function call.

The big win in this design is that the caller does not need to know about the structure of the input file. In fact, the structure has been removed from the header entirely. This design tremendously simplifies the information that the caller needs in order to use the module.

This design has a couple of drawbacks. The module has a lot more functions than our previous design. Also, the module allows us to have only one file open at a time. This restriction exists because the structure we removed from the head is being placed in the module itself. The one-file restriction limits our flexibility. However, we don’t need to open multiple files, so we don’t need this feature at this time. In this case, we decided that the gain of simplifying the interface was well worth the decrease in flexibility.

Character Type Module

The purpose of the character type module is to read characters and decode their types. Some types overlap. For example, the C_ALPHA_NUMERIC includes the C_NUMERIC character set.

This module stores most of the type information in an array, and requires only a little logic to handle the special types like C_ALPHA_NUMERIC.

The functions in this module are:

extern int is_char_type(int ch, enum CHAR_TYPE kind);

extern enum CHAR_TYPE get_char_type(int ch);

One question comes up: how do we initialize the character array? We could require the user to initialize it before he calls any of the functions, such as:

main() {

/* .... */

init_char_type();

/* ...... */

type_info = ch_to_type(ch);

Another solution is to put a check at the beginning of each function to initialize the array if needed:

int is_char_type(int ch, enum CHAR_TYPE kind)

if (!ch_setup) {

init_char_type();

ch_setup = 0;

}

The second method requires a little more code. But it has several advantages. First, it makes life simpler for the user. He doesn’t have to remember to initialize the character type module. Also, mistakes are harder to make. If the user doesn’t have to do the initialization, he can’t forget to do it. Finally, this method hides an internal bookkeeping matter inside the character type module so that the user doesn’t have to worry about it.

Statistics Submodules

Each of our statistics submodules looks at the token stream and produces statistics about it. For example, the parentheses counter counts the nesting of parentheses. Some statistics are reported on a line-by-line basis, such as the current parentheses nesting. Others are reported at the end-of-file, such as the maximum nesting of parentheses.

We collect four statistics, a count of the number of lines, the parentheses ( ) nesting, the curly-brace {} nesting, and a count of the lines with comments versus the lines without comments. Because each of our static submodules performs similar functions, we give the procedures in each of them similar names: (xx is the sub-module identifier listed below).

xx_init

Initializes the statistic. This function is called at the beginning of each file.

xx_take_token

Receives a token and updates the statistic based on it.

xx_line_start

Writes out the value of the statistic that’s output at the beginning of each line. In some cases, this may be nothing.

xx_eof

Writes out the statistical information that goes at the end of the file.

The xx part of the identifier is the submodule identifier. It is:

lc

Line counter submodule

pc

Parentheses counter submodule

bc

Brace counter submodule

cc

Comment / not-comment line counter submodule

Coding

The coding process was fairly simple. The only problem that came up was getting the end-of-line right.

Functional Description

This section describes all modules and major functions in our program. For a more complete and detailed description, take a look at the listings at the end of this chapter.

ch_type Module

The ch_type module computes the type of a character. For the most part, this computation is done through a table named type_info. Some types like C_ALPHA_NUMERIC, for example, include two different types of characters, C_ALPHA and C_DIGIT. So, in addition to our table, we need a little code for the special cases.

in_file Module

This modules reads data from the input file one character at a time. It buffers up a line and on demand writes it to the output.

token Module

We want an input stream of tokens. But what we have to start with is an input stream consisting of characters. The main function of this class, next_token, turns characters into tokens. Actually, our tokenizer is rather simple, because we don’t have to deal with most of the details that a full C tokenizer must handle.

The coding for this function is fairly straightforward, except for the fact that it breaks up multiline comments into a series of T_COMMENT and T_NEW_LINE tokens.

Line Counter Submodule (lc)

The simplest statistic we collect is a count of the number of lines processed so far. This concept is done through the line_counter submodule. The only token it cares about is T_NEW_LINE. At the beginning of each line, it outputs the line number (the current count of the T_NEW_LINE tokens). At the end-of-file, this submodule outputs nothing. We define a lc_eof function for consistency’s sake, but the function does absolutely nothing.

Brace Counter Submodule (bc)

This submodule keeps track of the nesting level of the curly braces {}. We feed the submodule a stream of tokens through the bc_take_token function. This function keeps track of the left and right curly braces and ignores everything else:

void bc_take_token(enum TOKEN_TYPE token) {

switch (token) {

case T_L_CURLY:

++bc_cur_level;

if (bc_cur_level > bc_max_level)

bc_max_level = bc_cur_level;

break;

case T_R_CURLY:

--bc_cur_level;

break;

default:

/* Ignore */

break;

}

}

The results of this statistic are printed in two places. The first is at the beginning of each line. The second is at the end-of-file. We define two functions to print these statistics:

static void bc_line_start(void) {

printf("{%-2d ", bc_cur_level);

}

static void bc_eof(void) {

printf("Maximum nesting of {} : %d\n", bc_max_level);

}

Parentheses Counter Submodule (pc)

This submodule is very similar to the brace counter submodule. As a matter of fact, it was created by copying the brace counter submodule and performing a few simple edits.

We probably should combine the Parentheses Counter submodule and the Brace Counter submodule into one submodule that uses a parameter to tell it what to count. Oh well, something for the next version.

Comment Counter Submodule (cc)

In these functions, we keep track of lines with comments in them, lines with code in them, lines with both, and lines with none. The results are printed at the end-of-file.

do_file Procedure

The do_file procedure reads each file one token at a time, and sends each token to the take_token routine for every statistic class:

while (1) {

cur_token = next_token();

lc_take_token(cur_token);

pc_take_token(cur_token);

bc_take_token(cur_token);

cc_take_token(cur_token);

Expandability

One thing to keep in mind when designing any software is expandability. In other words, what must someone do to add to our program? Suppose someone wants to add a new statistic to our program. What must they do?

Suppose they are adding a Word Count submodule (wc). They need to define four procedures: wc_init, wc_take_token, wc_line_start, and wc_eof.

We need to call these procedures at the proper point. But how do they know where to put the procedure calls? The answer is that they can use the editor to look for every place that a Comment Counter submodule procedure is used and clone the Comment Counter calls. This method is not exactly the best way of doing things, especially if the calls are spread across several files. But, the method is the best that we can come up with given the limitations of C.

C++ has no such limitations. In the book Practical C++ Programming, we design a similar program using C++ classes. The result is that instead of having multiple lists of procedure classes, we have one class list. The single list makes expandability and maintainability much easier.

Testing

To test this program, we came up with a small C program that contains every different type of possible token. The results are shown in Example 22-1.

Example 22-1. stat/test.out

1 (0 {0 /* This is a single line comment */

2 (0 {0 /*

3 (0 {0 * This is a multiline

4 (0 {0 * comment.

5 (0 {0 */

6 (0 {0 int main()

7 (0 {1 {

8 (0 {1 /* A procedure */

9 (0 {1 int i; /* Comment / code line */

10 (0 {1 char foo[10];

11 (0 {1

12 (0 {1 strcpy(foo, "abc"); /* String */

13 (0 {1 strcpy(foo, "a\"bc"); /* String with special character */

14 (0 {1

15 (0 {1 foo[0] = 'a'; /* Character */

16 (0 {1 foo[1] = '\''; /* Character with escape */

17 (0 {1

18 (0 {1 i = 3 / 2; /* Slash that's not a commment */

19 (0 {1 i = 3; /* Normal number */

20 (0 {1 i = 0x123ABC; /* Hex number */

21 (0 {1

22 (1 {1 i = ((1 + 2) * /* Nested () */

23 (0 {1 (3 + 4));

24 (0 {1

25 (0 {2 {

26 (0 {2 int j; /* Nested {} */

27 (0 {1 }

28 (0 {1 return (0);

29 (0 {0 }

30 (0 {0

Total number of lines: 30

Maximum nesting of () : 2

Maximum nesting of {} : 2

Number of blank lines .................6

Number of comment only lines ..........6

Number of code only lines .............8

Number of lines with code and comments 10

Comment to code ratio 88.9%

Revisions

Currently, the program collects a very limited set of statistics. You could add things like average identifier size and tracking statistics for each procedure. One thing we kept in mind when we designed our program was the need for expandability.

We stopped our statistics collection at four types of statistics because we had fulfilled our mission to demonstrate a reasonably advanced set of C constructs. We didn’t add more because the program would be too complex to fit in the chapter. On the whole, the program does its job well.

A Final Warning

Just because you can generate a statistic doesn’t mean that it’s useful.

Program Files

The in_file.h File

/********************************************************

* input_file -- Data from the input file. *

* *

* The current two characters are stored in *

* cur_char and next_char. *

* Lines are buffered so that they can be output to *

* the screen after a line is assembled. *

* *

* Functions: *

* in_open -- Opens the input file. *

* in_close -- Closes the input file. *

* read_char -- Reads the next character. *

* in_char_char -- Returns the current character. *

* in_next_char -- Returns the next character. *

* in_flush -- Sends line to the screen. *

********************************************************/

/********************************************************

* in_open -- Opens the input file. *

* *

* Parameters *

* name -- Name of disk file to use for input. *

* *

* Returns *

* 0 -- Open succeeded. *

* nonzero -- Open failed. *

********************************************************/

extern int in_open(const char name[]);

/********************************************************

* in_close -- Closes the input file. *

********************************************************/

extern void in_close(void);

/********************************************************

* in_read_char -- Read the next character from the *

* input file. *

********************************************************/

extern void in_read_char(void);

/********************************************************

* in_cur_char -- Gets the current input character. *

* *

* Returns *

* current character. *

********************************************************/

extern int in_cur_char(void);

/********************************************************

* in_next_char -- Peeks ahead one character. *

* *

* Returns *

* next character. *

********************************************************/

extern int in_next_char(void);

/********************************************************

* in_flush -- Flushes the buffered input line to the *

* screen. *

********************************************************/

extern void in_flush(void);

The in_file.c File

/********************************************************

* infile module *

* Handles opening, reading, and display of *

* data from the input file. *

* *

* Functions: *

* in_open -- Opens the input file. *

* in_close -- Closes the input file. *

* read_char -- Reads the next character. *

* in_char_char -- Returns the current character. *

* in_next_char -- Returns the next character. *

* in_flush -- Sends line to the screen. *

********************************************************/

#include <stdio.h>

#include <errno.h>

#include "in_file.h"

#define LINE_MAX 500 /* Longest possible line */

struct input_file {

FILE *file; /* File we are reading */

char line[LINE_MAX];/* Current line */

char *char_ptr; /* Current character on the line */

int cur_char; /* Current character (can be EOF)*/

int next_char; /* Next character (can be EOF) */

};

/* Input file that we are reading */

static struct input_file in_file = {

NULL, /* file */

"", /* line */

NULL, /* char_ptr */

'\0', /* cur_char */

'\0' /* next_char */

};

/********************************************************

* in_open -- Opens the input file. *

* *

* Parameters *

* name -- Name of disk file to use for input. *

* *

* Returns *

* 0 -- Open succeeded. *

* nonzero -- Open failed. *

********************************************************/

int in_open(const char name[])

{

in_file.file = fopen(name, "r");

if (in_file.file == NULL)

return (errno);

/*

* Initialize the input file and read the first two

* characters.

*/

in_file.cur_char = fgetc(in_file.file);

in_file.next_char = fgetc(in_file.file);

in_file.char_ptr = in_file.line;

return (0);

}

/********************************************************

* in_close -- Closes the input file. *

********************************************************/

void in_close(void)

{

if (in_file.file != NULL) {

fclose(in_file.file);

in_file.file = NULL;

}

}

/********************************************************

* in_cur_char -- Gets the current input character. *

* *

* Returns *

* current character. *

********************************************************/

int in_cur_char(void)

{

return (in_file.cur_char);

}

/********************************************************

* in_next_char -- Peeks ahead one character. *

* *

* Returns *

* next character. *

********************************************************/

int in_next_char(void)

{

return (in_file.next_char);

}

/********************************************************

* in_flush -- Flushes the buffered input line to the *

* screen. *

********************************************************/

void in_flush(void)

{

*in_file.char_ptr = '\0'; /* End the line */

fputs(in_file.line, stdout); /* Send the line */

in_file.char_ptr = in_file.line; /* Reset the line */

}

/********************************************************

* in_read_char -- Reads the next character from the *

* input file. *

********************************************************/

void in_read_char(void)

{

*in_file.char_ptr = in_file.cur_char;

++in_file.char_ptr;

in_file.cur_char = in_file.next_char;

in_file.next_char = fgetc(in_file.file);

};

The ch_type.h File

/********************************************************

* char_type -- Character type module. *

********************************************************/

enum CHAR_TYPE {

C_EOF, /* End of file character */

C_WHITE, /* Whitespace or control character */

C_NEWLINE, /* A newline character */

C_ALPHA, /* A Letter (includes _) */

C_DIGIT, /* A Number */

C_OPERATOR, /* Random operator */

C_SLASH, /* The character '/' */

C_L_PAREN, /* The character '(' */

C_R_PAREN, /* The character ')' */

C_L_CURLY, /* The character '{' */

C_R_CURLY, /* The character '}' */

C_SINGLE, /* The character '\'' */

C_DOUBLE, /* The character '"' */

/* End of simple types, more complex, derrived types follow */

C_HEX_DIGIT,/* Hexidecimal digit */

C_ALPHA_NUMERIC/* Alpha numeric */

};

/********************************************************

* is_char_type -- Determines if a character belongs to *

* a given character type. *

* *

* Parameters *

* ch -- Character to check. *

* kind -- Type to check it for. *

* *

* Returns: *

* 0 -- Character is not of the specified kind. *

* 1 -- Character is of the specified kind. *

********************************************************/

extern int is_char_type(int ch, enum CHAR_TYPE kind);

/********************************************************

* get_char_type -- Given a character, returns its type.*

* *

* Note: We return the simple types. Composite types *

* such as C_HEX_DIGIT and C_ALPHA_NUMERIC are not *

* returned. *

* *

* Parameters: *

* ch -- Character having the type we want. *

* *

* Returns *

* character type. *

********************************************************/

extern enum CHAR_TYPE get_char_type(int ch);

The ch_type.c File

/********************************************************

* ch_type package *

* *

* This module is used to determine the type of *

* various characters. *

* *

* Public functions: *

* init_char_type -- Initializes the table. *

* is_char_type -- Is a character of a given type? *

* get_char_type -- Given char, returns type. *

********************************************************/

#include <stdio.h>

#include "ch_type.h"

/* Define the type information array */

static enum CHAR_TYPE type_info[256];

static int ch_setup = 0; /* True if character type info setup */

/********************************************************

* fill_range -- Fills in a range of types for the *

* character type class. *

* *

* Parameters *

* start, end -- Range of items to fill in. *

* type -- Type to use for filling. *

********************************************************/

static void fill_range(int start, int end, enum CHAR_TYPE type)

{

int cur_ch; /* Character we are handling now */

for (cur_ch = start; cur_ch <= end; ++cur_ch) {

type_info[cur_ch] = type;

}

}

/*********************************************************

* init_char_type -- Initializes the char type table. *

*********************************************************/

static void init_char_type(void)

{

fill_range(0, 255, C_WHITE);

fill_range('A', 'Z', C_ALPHA);

fill_range('a', 'z', C_ALPHA);

type_info['_'] = C_ALPHA;

fill_range('0', '9', C_DIGIT);

type_info['!'] = C_OPERATOR;

type_info['#'] = C_OPERATOR;

type_info['$'] = C_OPERATOR;

type_info['%'] = C_OPERATOR;

type_info['^'] = C_OPERATOR;

type_info['&'] = C_OPERATOR;

type_info['*'] = C_OPERATOR;

type_info['-'] = C_OPERATOR;

type_info['+'] = C_OPERATOR;

type_info['='] = C_OPERATOR;

type_info['|'] = C_OPERATOR;

type_info['~'] = C_OPERATOR;

type_info[','] = C_OPERATOR;

type_info[':'] = C_OPERATOR;

type_info['?'] = C_OPERATOR;

type_info['.'] = C_OPERATOR;

type_info['<'] = C_OPERATOR;

type_info['>'] = C_OPERATOR;

type_info['/'] = C_SLASH;

type_info['\n'] = C_NEWLINE;

type_info['('] = C_L_PAREN;

type_info[')'] = C_R_PAREN;

type_info['{'] = C_L_CURLY;

type_info['}'] = C_R_CURLY;

type_info['"'] = C_DOUBLE;

type_info['\''] = C_SINGLE;

}

/********************************************************

* is_char_type -- Determines if a character belongs to *

* a given character type. *

* *

* Parameters *

* ch -- Character to check. *

* kind -- Type to check it for. *

* *

* Returns: *

* 0 -- Character is not of the specified kind. *

* 1 -- Character is of the specified kind. *

********************************************************/

int is_char_type(int ch, enum CHAR_TYPE kind)

{

if (!ch_setup) {

init_char_type();

ch_setup = 1;

}

if (ch == EOF) return (kind == C_EOF);

switch (kind) {

case C_HEX_DIGIT:

if (type_info[ch] == C_DIGIT)

return (1);

if ((ch >= 'A') && (ch <= 'F'))

return (1);

if ((ch >= 'a') && (ch <= 'f'))

return (1);

return (0);

case C_ALPHA_NUMERIC:

return ((type_info[ch] == C_ALPHA) ||

(type_info[ch] == C_DIGIT));

default:

return (type_info[ch] == kind);

}

};

/********************************************************

* get_char_type -- Given a character, returns its type.*

* *

* Note: We return the simple types. Composite types *

* such as C_HEX_DIGIT and C_ALPHA_NUMERIC are not *

* returned. *

* *

* Parameters: *

* ch -- Character having the type we want. *

* *

* Returns *

* character type. *

********************************************************/

enum CHAR_TYPE get_char_type(int ch) {

if (!ch_setup) {

init_char_type();

ch_setup = 1;

}

if (ch == EOF) return (C_EOF);

return (type_info[ch]);

}

The token.h File

/********************************************************

* token -- Token handling module. *

* *

* Functions: *

* next_token -- Gets the next token from the input.*

********************************************************/

/*

* Define the enumerated list of tokens.

*/

enum TOKEN_TYPE {

T_NUMBER, /* Simple number (floating point or integer */

T_STRING, /* String or character constant */

T_COMMENT, /* Comment */

T_NEWLINE, /* Newline character */

T_OPERATOR, /* Arithmetic operator */

T_L_PAREN, /* Character "(" */

T_R_PAREN, /* Character ")" */

T_L_CURLY, /* Character "{" */

T_R_CURLY, /* Character "}" */

T_ID, /* Identifier */

T_EOF /* End of File */

};

/*

* We use #define here instead of "const int" because so many old

* software packages use #define. We must have picked

* up a header file that uses #define for TRUE/FALSE. Consequently,

* we protect against double defines as well as against using #define

* ourselves.

*/

#ifndef TRUE

#define TRUE 1 /* Define a simple TRUE/FALSE values */

#define FALSE 0

#endif /* TRUE */

/*********************************************************

* next_token -- Reads the next token in an input stream.*

* *

* Parameters *

* in_file -- File to read. *

* *

* Returns *

* next token. *

*********************************************************/

extern enum TOKEN_TYPE next_token(void);

The token.c File

/*********************************************************

* token -- Token handling module. *

* *

* Functions: *

* next_token -- Gets the next token from the input.*

*********************************************************/

#include <stdio.h>

#include <stdlib.h>

#include "ch_type.h"

#include "in_file.h"

#include "token.h"

static int in_comment = FALSE; /* True if we're in a comment */

/*********************************************************

* read_comment -- Reads in a comment. *

* *

* Returns *

* Token read. Can be a T_COMMENT or T_NEW_LINE, *

* depending on what we read. *

* *

* Multiline comments are split into multiple *

* tokens. *

*********************************************************/

static enum TOKEN_TYPE read_comment(void)

{

if (in_cur_char() == '\n') {

in_read_char();

return (T_NEWLINE);

}

while (1) {

in_comment = TRUE;

if (in_cur_char() == EOF) {

fprintf(stderr, "Error: EOF inside comment\n");

return (T_EOF);

}

if (in_cur_char() == '\n')

return (T_COMMENT);

if ((in_cur_char() == '*') &&

(in_next_char() == '/')) {

in_comment = FALSE;

/* Skip past the ending */

in_read_char();

in_read_char();

return (T_COMMENT);

}

in_read_char();

}

}

/*********************************************************

* next_token -- Reads the next token in an input stream.*

* *

* Returns *

* next token. *

*********************************************************/

enum TOKEN_TYPE next_token(void)

{

if (in_comment)

return (read_comment());

while (is_char_type(in_cur_char(), C_WHITE)) {

in_read_char();

}

if (in_cur_char() == EOF)

return (T_EOF);

switch (get_char_type(in_cur_char())) {

case C_NEWLINE:

in_read_char();

return (T_NEWLINE);

case C_ALPHA:

while (is_char_type(in_cur_char(), C_ALPHA_NUMERIC))

in_read_char();

return (T_ID);

case C_DIGIT:

in_read_char();

if ((in_cur_char() == 'X') || (in_cur_char() == 'x')) {

in_read_char();

while (is_char_type(in_cur_char(), C_HEX_DIGIT))

in_read_char();

return (T_NUMBER);

}

while (is_char_type(in_cur_char(), C_DIGIT))

in_read_char();

return (T_NUMBER);

case C_SLASH:

/* Check for '/', '*' characters */

if (in_next_char() == '*') {

return (read_comment());

}

/* Fall through */

case C_OPERATOR:

in_read_char();

return (T_OPERATOR);

case C_L_PAREN:

in_read_char();

return (T_L_PAREN);

case C_R_PAREN:

in_read_char();

return (T_R_PAREN);

case C_L_CURLY:

in_read_char();

return (T_L_CURLY);

case C_R_CURLY:

in_read_char();

return (T_R_CURLY);

case C_DOUBLE:

while (1) {

in_read_char();

/* Check for end of string */

if (in_cur_char() == '"')

break;

/* Escape character, then skip the next character */

if (in_cur_char() == '\\')

in_read_char();

}

in_read_char();

return (T_STRING);

case C_SINGLE:

while (1) {

in_read_char();

/* Check for end of character */

if (in_cur_char() == '\'')

break;

/* Escape character, then skip the next character */

if (in_cur_char() == '\\')

in_read_char();

}

in_read_char();

return (T_STRING);

default:

fprintf(stderr, "Internal error: Very strange character\n");

abort();

}

fprintf(stderr, "Internal error: We should never get here\n");

abort();

return (T_EOF); /* Should never get here either */

/* But we put in the return to avoid a compiler */

/* warning. */

}

The stat.c File

/********************************************************

* stat *

* Produces statistics about a program. *

* *

* Usage: *

* stat [options] <file-list> *

* *

********************************************************/

#include <stdio.h>

#include <stdlib.h>

#include <memory.h>

#include "ch_type.h"

#include "in_file.h"

#include "token.h"

/********************************************************

********************************************************

********************************************************

* line_counter -- Handles line number / line count *

* stat. *

* *

* Counts the number of T_NEW_LINE tokens seen and *

* outputs the current line number at the beginning *

* of the line. *

* *

* At EOF, it will output the total number of lines. *

********************************************************/

static int cur_line; /* Current line number */

/********************************************************

* lc_init -- Initializes the line counter variables. *

********************************************************/

static void lc_init(void)

{

cur_line = 0;

};

/********************************************************

* lc_take_token -- Consumes tokens and looks for *

* end-of-line tokens. *

* *

* Parameters *

* token -- The token coming in from the input *

* stream. *

********************************************************/

static void lc_take_token(enum TOKEN_TYPE token) {

if (token == T_NEWLINE)

++cur_line;

}

/********************************************************

* lc_line_start -- Outputs the per-line statistics, *

* namely the current line number. *

********************************************************/

static void lc_line_start(void) {

printf("%4d ", cur_line);

}

/********************************************************

* lc_eof -- Outputs the eof statistics. *

* In this case, the number of lines. *

********************************************************/

static void lc_eof(void) {

printf("Total number of lines: %d\n", cur_line);

}

/********************************************************

********************************************************

********************************************************

* paren_count -- Counts the nesting level of (). *

* *

* Counts the number of T_L_PAREN vs T_R_PAREN tokens *

* and writes the current nesting level at the beginning*

* of each line. *

* *

* Also keeps track of the maximum nesting level. *

********************************************************/

static int pc_cur_level;

static int pc_max_level;

/********************************************************

* pc_init -- Initializes the () counter variables. *

********************************************************/

void pc_init(void) {

pc_cur_level = 0;

pc_max_level = 0;

};

/********************************************************

* pc_take_token -- Consumes tokens and looks for *

* () tokens. *

* *

* Parameters *

* token -- The token coming in from the input *

* stream. *

********************************************************/

void pc_take_token(enum TOKEN_TYPE token) {

switch (token) {

case T_L_PAREN:

++pc_cur_level;

if (pc_cur_level > pc_max_level)

pc_max_level = pc_cur_level;

break;

case T_R_PAREN:

--pc_cur_level;

break;

default:

/* Ignore */

break;

}

}

/********************************************************

* pc_line_start -- Outputs the per-line statistics, *

* namely the current () nesting. *

********************************************************/

static void pc_line_start(void) {

printf("(%-2d ", pc_cur_level);

}

/********************************************************

* pc_eof -- Outputs the eof statistics. *

* In this case, the max nesting of (). *

********************************************************/

void pc_eof(void) {

printf("Maximum nesting of () : %d\n", pc_max_level);

}

/********************************************************

********************************************************

********************************************************

* brace_counter -- Counts the nesting level of {}. *

* *

* Counts the number of T_L_CURLY vs T_R_CURLY tokens *

* and writes the current nesting level at the beginning*

* of each line. *

* *

* Also, keeps track of the maximum nesting level. *

* *

* Note: brace_counter and paren_counter should *

* probably be combined. *

********************************************************/

static int bc_cur_level; /* Current nesting level */

static int bc_max_level; /* Maximum nesting level */

/********************************************************

* pc_init -- Initialize the {} counter variables. *

********************************************************/

void bc_init(void) {

bc_cur_level = 0;

bc_max_level = 0;

};

/********************************************************

* bc_take_token -- Consumes tokens and looks for *

* {} tokens. *

* *

* Parameters *

* token -- The token coming in from the input *

* stream. *

********************************************************/

void bc_take_token(enum TOKEN_TYPE token) {

switch (token) {

case T_L_CURLY:

++bc_cur_level;

if (bc_cur_level > bc_max_level)

bc_max_level = bc_cur_level;

break;

case T_R_CURLY:

--bc_cur_level;

break;

default:

/* Ignore */

break;

}

}

/********************************************************

* bc_line_start -- Outputs the per-line statistics, *

* namely the current {} nesting. *

********************************************************/

static void bc_line_start(void) {

printf("{%-2d ", bc_cur_level);

}

/********************************************************

* bc_eof -- Outputs the eof statistics. *

* In this case, the max nesting of {}. *

********************************************************/

static void bc_eof(void) {

printf("Maximum nesting of {} : %d\n", bc_max_level);

}

/********************************************************

********************************************************

********************************************************

* comment_counter -- Counts the number of lines *

* with and without comments. *

* *

* Outputs nothing at the beginning of each line, but *

* will output a ratio at the end of file. *

* *

* Note: This class makes use of two bits: *

* CF_COMMENT -- a comment was seen *

* CF_CODE -- code was seen *

* to collect statistics. *

* *

* These are combined to form an index into the counter *

* array so the value of these two bits is very *

* important. *

********************************************************/

static const int CF_COMMENT = (1<<0); /* Line contains comment */

static const int CF_CODE = (1<<1); /* Line contains code */

/* These bits are combined to form the statistics */

/* 0 -- [0] Blank line */

/* CF_COMMENT -- [1] Comment-only line */

/* CF_CODE -- [2] Code-only line */

/* CF_COMMENT|CF_CODE -- [3] Comments and code on this line */

static int counters[4]; /* Count of various types of stats. */

static int flags; /* Flags for the current line */

/********************************************************

* cc_init -- Initializes the comment counter variables.*

********************************************************/

static void cc_init(void) {

memset(counters, '\0', sizeof(counters));

flags = 0;

};

/********************************************************

* cc_take_token -- Consumes tokens and looks for *

* comments tokens. *

* *

* Parameters *

* token -- The token coming in from the input *

* stream. *

********************************************************/

void cc_take_token(enum TOKEN_TYPE token) {

switch (token) {

case T_COMMENT:

flags |= CF_COMMENT;

break;

default:

flags |= CF_CODE;

break;

case T_NEWLINE:

++counters[flags];

flags = 0;

break;

}

}

/********************************************************

* cc_line_start -- Outputs the per-line statistics. *

********************************************************/

static void cc_line_start(void)

{

/* Do nothing */

}

/********************************************************

* cc_eof -- Outputs the eof statistics. *

* In this case, the comment/code ratios. *

********************************************************/

static void cc_eof(void) {

printf("Number of blank lines .................%d\n",

counters[0]);

printf("Number of comment only lines ..........%d\n",

counters[1]);

printf("Number of code only lines .............%d\n",

counters[2]);

printf("Number of lines with code and comments %d\n",

counters[3]);

printf("Comment to code ratio %3.1f%%\n",

(float)(counters[1] + counters[3]) /

(float)(counters[2] + counters[3]) * 100.0);

}

/********************************************************

* do_file -- Processes a single file. *

* *

* Parameters *

* name -- The name of the file to process. *

********************************************************/

static void do_file(const char *const name)

{

enum TOKEN_TYPE cur_token; /* Current token type */

/*

* Initialize the counters

*/

lc_init();

pc_init();

bc_init();

cc_init();

if (in_open(name) != 0) {

printf("Error: Could not open file %s for reading\n", name);

return;

}

while (1) {

cur_token = next_token();

lc_take_token(cur_token);

pc_take_token(cur_token);

bc_take_token(cur_token);

cc_take_token(cur_token);

switch (cur_token) {

case T_NEWLINE:

lc_line_start();

pc_line_start();

bc_line_start();

cc_line_start();

in_flush();

break;

case T_EOF:

lc_eof();

pc_eof();

bc_eof();

cc_eof();

in_close();

return;

default:

/* Do nothing */

break;

}

}

}

int main(int argc, char *argv[])

{

char *prog_name = argv[0]; /* Name of the program */

if (argc == 1) {

printf("Usage is %s [options] <file-list>\n", prog_name);

exit (8);

}

for (/* argc set */; argc > 1; --argc) {

do_file(argv[1]);

++argv;

}

return (0);

}

UNIX Makefile for CC (Generic Unix)

# File: stat/makefile.unx

#

# Makefile for the UNIX standard cc compiler

#

CC=cc

CFLAGS=-g

OBJS= stat.o ch_type.o token.o in_file.o

all: stat.out stat test.out

test.out: test.c stat

stat test.c >test.out

# This generates a test output based on another example

# in this book.

stat.out: stat

stat ../calc3/calc3.c >stat.out

stat: $(OBJS)

$(CC) $(CFLAGS) -o stat $(OBJS)

stat.o: stat.c token.h

$(CC) $(CFLAGS) -c stat.c

ch_type.o: ch_type.c ch_type.h

$(CC) $(CFLAGS) -c ch_type.c

token.o: token.c token.h ch_type.h in_file.h

$(CC) $(CFLAGS) -c token.c

in_file.o: in_file.c in_file.h

$(CC) $(CFLAGS) -c in_file.c

clean:

rm -f stat stat.o ch_type.o token.o in_file.o

UNIX Makefile for gcc

# File: stat/makefile.gcc

#

# Makefile for the Free Software Foundations g++ compiler

#

CC=gcc

CFLAGS=-g -Wall -D__USE_FIXED_PROTOTYPES__

OBJS= stat.o ch_type.o token.o in_file.o

all: stat.out stat test.out

test.out: test.c stat

stat test.c >test.out

# This generates a test output based on another example

# in this book.

stat.out: stat

stat ../calc3/calc3.c >stat.out

stat: $(OBJS)

$(CC) $(CFLAGS) -o stat $(OBJS)

stat.o: stat.c token.h

$(CC) $(CFLAGS) -c stat.c

ch_type.o: ch_type.c ch_type.h

$(CC) $(CFLAGS) -c ch_type.c

token.o: token.c token.h ch_type.h in_file.h

$(CC) $(CFLAGS) -c token.c

in_file.o: in_file.c in_file.h

$(CC) $(CFLAGS) -c in_file.c

clean:

rm -f stat stat.o ch_type.o token.o in_file.o

Turbo C++ Makefile

# File: stat/makefile.tcc

#

# Makefile for Borland's Borland-C++ compiler

#

CC=tcc

#

# Flags

# -N -- Check for stack overflow.

# -v -- Enable debugging.

# -w -- Turn on all warnings.

# -ml -- Large model.

#

CFLAGS=-N -v -w -ml

OBJS= stat.obj ch_type.obj token.obj in_file.obj

all: stat.out stat.exe test.out

test.out: test.c stat.exe

stat test.c >test.out

# This generates a test output based on another example

# in this book.

stat.out: stat.exe

stat ..\calc3\calc3.c >stat.out

stat.exe: $(OBJS)

$(CC) $(CFLAGS) -estat $(OBJS)

stat.obj: stat.c token.h

$(CC) $(CFLAGS) -c stat.c

in_file.obj: in_file.c in_file.h

$(CC) $(CFLAGS) -c in_file.c

ch_type.obj: ch_type.c ch_type.h

' $(CC) $(CFLAGS) -c ch_type.c

token.obj: token.c token.h ch_type.h

$(CC) $(CFLAGS) -c token.c

clean:

erase stat.exe

erase stat.obj

erase ch_type.obj

erase in_file.obj

erase token.obj

Borland C++ Makefile

# File: stat/makefile.bcc

#

# Makefile for Borland's Borland C++ compiler

#

CC=bcc

#

# Flags

# -N -- Check for stack overflow.

# -v -- Enable debugging.

# -w -- Turn on all warnings.

# -ml -- Large model.

#

CFLAGS=-N -v -w -ml

OBJS= stat.obj ch_type.obj token.obj in_file.obj

all: stat.out stat.exe test.out

test.out: test.c stat.exe

stat test.c >test.out

# This generates a test output based on another example

# in this book.

stat.out: stat.exe

stat ..\calc3\calc3.c >stat.out

stat.exe: $(OBJS)

$(CC) $(CFLAGS) -estat $(OBJS)

stat.obj: stat.c token.h

$(CC) $(CFLAGS) -c stat.c

in_file.obj: in_file.c in_file.h

$(CC) $(CFLAGS) -c in_file.c

ch_type.obj: ch_type.c ch_type.h

$(CC) $(CFLAGS) -c ch_type.c

token.obj: token.c token.h ch_type.h

$(CC) $(CFLAGS) -c token.c

clean:

erase stat.exe

erase stat.obj

erase ch_type.obj

erase in_file.obj

erase token.obj

Microsoft Visual C++ Makefile

# File: stat/makefile.msc

#

# Makefile for Microsoft Visual C++

#

CC=cl

#

# Flags

# AL -- Compile for large model.

# Zi -- Enable debugging.

# W1 -- Turn on warnings.

#

CFLAGS=/AL /Zi /W1

OBJS= stat.obj ch_type.obj token.obj in_file.obj

all: stat.out stat.exe test.out

test.out: test.c stat.exe

stat test.c >test.out

# This generates a test output based on another example

# in this book.

stat.out: stat.exe

stat ..\calc3\calc3.c >stat.out

stat.exe: $(OBJS)

$(CC) $(CCFLAGS) $(OBJS)

stat.obj: stat.c token.h

$(CC) $(CCFLAGS) -c stat.c

ch_type.obj: ch_type.c ch_type.h

$(CC) $(CCFLAGS) -c ch_type.c

token.obj: token.c token.h ch_type.h

$(CC) $(CCFLAGS) -c token.c

in_file.obj: in_file.c

$(CC) $(CCFLAGS) -c in_file.c

clean:

erase stat.exe

erase stat.obj

erase ch_type.obj

erase token.obj

erase in_file.obj

Programming Exercises

Exercise 22-1: Write a program that checks a text file for doubled words (for example, “in the the file”).

Exercise 22-2: Write a program that removes four-letter words from a file and replaces them with more acceptable equivalents.

Exercise 22-3: Write a mailing list program. This program will read, write, sort, and print mailing labels.

Exercise 22-4: Update the statistics program presented in this chapter, adding a cross-reference capability.

Exercise 22-5: Write a program that takes a text file and splits long lines into two smaller lines. The split point should be at the end of a sentence if possible, or at the end of a word if a sentence is too long.