Hacking: The Art of Exploitation (2008)

Chapter 0x200. PROGRAMMING

Hacker is a term for both those who write code and those who exploit it. Even though these two groups of hackers have different end goals, both groups use similar problem-solving techniques. Since an understanding of programming helps those who exploit, and an understanding of exploitation helps those who program, many hackers do both. There are interesting hacks found in both the techniques used to write elegant code and the techniques used to exploit programs. Hacking is really just the act of finding a clever and counterintuitive solution to a problem.

The hacks found in program exploits usually use the rules of the computer to bypass security in ways never intended. Programming hacks are similar in that they also use the rules of the computer in new and inventive ways, but the final goal is efficiency or smaller source code, not necessarily a security compromise. There are actually an infinite number of programs that can be written to accomplish any given task, but most of these solutions are unnecessarily large, complex, and sloppy. The few solutions that remain are small, efficient, and neat. Programs that have these qualities are said to have elegance, and the clever and inventive solutions that tend to lead to this efficiency are called hacks. Hackers on both sides of programming appreciate both the beauty of elegant code and the ingenuity of clever hacks.

In the business world, more importance is placed on churning out functional code than on achieving clever hacks and elegance. Because of the tremendous exponential growth of computational power and memory, spending an extra five hours to create a slightly faster and more memory efficient piece of code just doesn't make business sense when dealing with modern computers that have gigahertz of processing cycles and gigabytes of memory. While time and memory optimizations go without notice by all but the most sophisticated of users, a new feature is marketable. When the bottom line is money, spending time on clever hacks for optimization just doesn't make sense.

True appreciation of programming elegance is left for the hackers: computer hobbyists whose end goal isn't to make a profit but to squeeze every possible bit of functionality out of their old Commodore 64s, exploit writers who need to write tiny and amazing pieces of code to slip through narrow security cracks, and anyone else who appreciates the pursuit and the challenge of finding the best possible solution. These are the people who get excited about programming and really appreciate the beauty of an elegant piece of code or the ingenuity of a clever hack. Since an understanding of programming is a prerequisite to understanding how programs can be exploited, programming is a natural starting point.

What Is Programming?

Programming is a very natural and intuitive concept. A program is nothing more than a series of statements written in a specific language. Programs are everywhere, and even the technophobes of the world use programs every day. Driving directions, cooking recipes, football plays, and DNA are all types of programs. A typical program for driving directions might look something like this:

Start out down Main Street headed east. Continue on Main Street until you see

a church on your right. If the street is blocked because of construction, turn

right there at 15th Street, turn left on Pine Street, and then turn right on

16th Street. Otherwise, you can just continue and make a right on 16th Street.

Continue on 16th Street, and turn left onto Destination Road. Drive straight

down Destination Road for 5 miles, and then you'll see the house on the right.

The address is 743 Destination Road.

Anyone who knows English can understand and follow these driving directions, since they're written in English. Granted, they're not eloquent, but each instruction is clear and easy to understand, at least for someone who reads English.

But a computer doesn't natively understand English; it only understands machine language. To instruct a computer to do something, the instructions must be written in its language. However, machine language is arcane and difficult to work with—it consists of raw bits and bytes, and it differs from architecture to architecture. To write a program in machine language for an Intel x86 processor, you would have to figure out the value associated with each instruction, how each instruction interacts, and myriad low-level details. Programming like this is painstaking and cumbersome, and it is certainly not intuitive.

What's needed to overcome the complication of writing machine language is a translator. An assembler is one form of machine-language translator—it is a program that translates assembly language into machine-readable code. Assembly language is less cryptic than machine language, since it uses names for the different instructions and variables, instead of just using numbers. However, assembly language is still far from intuitive. The instruction names are very esoteric, and the language is architecture specific. Just as machine language for Intel x86 processors is different from machine language for Sparc processors, x86 assembly language is different from Sparc assembly language. Any program written using assembly language for one processor's architecture will not work on another processor's architecture. If a program is written in x86 assembly language, it must be rewritten to run on Sparc architecture. In addition, in order to write an effective program in assembly language, you must still know many low-level details of the processor architecture you are writing for.

These problems can be mitigated by yet another form of translator called a compiler. A compiler converts a high-level language into machine language. High-level languages are much more intuitive than assembly language and can be converted into many different types of machine language for different processor architectures. This means that if a program is written in a high level language, the program only needs to be written once; the same piece of program code can be compiled into machine language for various specific architectures. C, C++, and Fortran are all examples of high-level languages. A program written in a high-level language is much more readable and English-like than assembly language or machine language, but it still must follow very strict rules about how the instructions are worded, or the compiler won't be able to understand it.

Pseudo-code

Programmers have yet another form of programming language called pseudo-code. Pseudo-code is simply English arranged with a general structure similar to a high-level language. It isn't understood by compilers, assemblers, or any computers, but it is a useful way for a programmer to arrange instructions. Pseudo-code isn't well defined; in fact, most people write pseudo-code slightly differently. It's sort of the nebulous missing link between English and high-level programming languages like C. Pseudo-code makes for an excellent introduction to common universal programming concepts.

Control Structures

Without control structures, a program would just be a series of instructions executed in sequential order. This is fine for very simple programs, but most programs, like the driving directions example, aren't that simple. The driving directions included statements like, Continue on Main Street until you see a church on your right and If the street is blocked because of construction…. These statements are known as control structures, and they change the flow of the program's execution from a simple sequential order to a more complex and more useful flow.

If-Then-Else

In the case of our driving directions, Main Street could be under construction. If it is, a special set of instructions needs to address that situation. Otherwise, the original set of instructions should be followed. These types of special cases can be accounted for in a program with one of the most natural controlstructures: the if-then-else structure. In general, it looks something like this:

If (condition) then

{

Set of instructions to execute if the condition is met;

}

Else

{

Set of instruction to execute if the condition is not met;

}

For this book, a C-like pseudo-code will be used, so every instruction will end with a semicolon, and the sets of instructions will be grouped with curly braces and indentation. The if-then-else pseudo-code structure of the preceding driving directions might look something like this:

Drive down Main Street;

If (street is blocked)

{

Turn right on 15th Street;

Turn left on Pine Street;

Turn right on 16th Street;

}

Else

{

Turn right on 16th Street;

}

Each instruction is on its own line, and the various sets of conditional instructions are grouped between curly braces and indented for readability. In C and many other programming languages, the then keyword is implied and therefore left out, so it has also been omitted in the preceding pseudo-code.

Of course, other languages require the then keyword in their syntax— BASIC, Fortran, and even Pascal, for example. These types of syntactical differences in programming languages are only skin deep; the underlying structure is still the same. Once a programmer understands the concepts these languages are trying to convey, learning the various syntactical variations is fairly trivial. Since C will be used in the later sections, the pseudo code used in this book will follow a C-like syntax, but remember that pseudo-code can take on many forms.

Another common rule of C-like syntax is when a set of instructions bounded by curly braces consists of just one instruction, the curly braces are optional. For the sake of readability, it's still a good idea to indent these instructions, but it's not syntactically necessary. The driving directions from before can be rewritten following this rule to produce an equivalent piece of pseudo-code:

Drive down Main Street;

If (street is blocked)

{

Turn right on 15th Street;

Turn left on Pine Street;

Turn right on 16th Street;

}

Else

Turn right on 16th Street;

This rule about sets of instructions holds true for all of the control structures mentioned in this book, and the rule itself can be described in pseudo-code.

If (there is only one instruction in a set of instructions)

The use of curly braces to group the instructions is optional;

Else

{

The use of curly braces is necessary;

Since there must be a logical way to group these instructions;

}

Even the description of a syntax itself can be thought of as a simple program. There are variations of if-then-else, such as select/case statements, but the logic is still basically the same: If this happens do these things, otherwise do these other things (which could consist of even more if-then statements).

While/Until Loops

Another elementary programming concept is the while control structure, which is a type of loop. A programmer will often want to execute a set of instructions more than once. A program can accomplish this task through looping, but it requires a set of conditions that tells it when to stop looping, lest it continue into infinity. A while loop says to execute the following set of instructions in a loop while a condition is true. A simple program for a hungry mouse could look something like this:

While (you are hungry)

{

Find some food;

Eat the food;

}

The set of two instructions following the while statement will be repeated while the mouse is still hungry. The amount of food the mouse finds each time could range from a tiny crumb to an entire loaf of bread. Similarly, the number of times the set of instructions in the while statement is executed changes depending on how much food the mouse finds.

Another variation on the while loop is an until loop, a syntax that is available in the programming language Perl (C doesn't use this syntax). An until loop is simply a while loop with the conditional statement inverted. The same mouse program using an until loop would be:

Until (you are not hungry)

{

Find some food;

Eat the food;

}

Logically, any until-like statement can be converted into a while loop. The driving directions from before contained the statement Continue on Main Street until you see a church on your right. This can easily be changed into a standard while loop by simply inverting the condition.

While (there is not a church on the right)

Drive down Main Street;

For Loops

Another looping control structure is the for loop. This is generally used when a programmer wants to loop for a certain number of iterations. The driving direction Drive straight down Destination Road for 5 miles could be converted to a for loop that looks something like this:

For (5 iterations)

Drive straight for 1 mile;

In reality, a for loop is just a while loop with a counter. The same statement can be written as such:

Set the counter to 0;

While (the counter is less than 5)

{

Drive straight for 1 mile;

Add 1 to the counter;

}

The C-like pseudo-code syntax of a for loop makes this even more apparent:

For (i=0; i<5; i++)

Drive straight for 1 mile;

In this case, the counter is called i, and the for statement is broken up into three sections, separated by semicolons. The first section declares the counter and sets it to its initial value, in this case 0. The second section is like a while statement using the counter: While the counter meets this condition, keep looping. The third and final section describes what action should be taken on the counter during each iteration. In this case, i++ is a shorthand way of saying, Add 1 to the counter called i.

Using all of the control structures, the driving directions from What Is Programming? can be converted into a C-like pseudo-code that looks something like this:

Begin going East on Main Street;

While (there is not a church on the right)

Drive down Main Street;

If (street is blocked)

{

Turn right on 15th Street;

Turn left on Pine Street;

Turn right on 16th Street;

}

Else

Turn right on 16th Street;

Turn left on Destination Road;

For (i=0; i<5; i++)

Drive straight for 1 mile;

Stop at 743 Destination Road;

More Fundamental Programming Concepts

In the following sections, more universal programming concepts will be introduced. These concepts are used in many programming languages, with a few syntactical differences. As I introduce these concepts, I will integrate them into pseudo-code examples using C-like syntax. By the end, the pseudo code should look very similar to C code.

Variables

The counter used in the for loop is actually a type of variable. A variable can simply be thought of as an object that holds data that can be changed— hence the name. There are also variables that don't change, which are aptly called constants. Returning to the driving example, the speed of the car would be a variable, while the color of the car would be a constant. In pseudo code, variables are simple abstract concepts, but in C (and in many other languages), variables must be declared and given a type before they can be used. This is because a C program will eventually be compiled into an executable program. Like a cooking recipe that lists all the required ingredients before giving the instructions, variable declarations allow you to make preparations before getting into the meat of the program. Ultimately, all variables are stored in memory somewhere, and their declarations allow the compiler to organize this memory more efficiently. In the end though, despite all of the variable type declarations, everything is all just memory.

In C, each variable is given a type that describes the information that is meant to be stored in that variable. Some of the most common types are int (integer values), float (decimal floating-point values), and char (single character values). Variables are declared simply by using these keywords before listing the variables, as you can see below.

int a, b;

float k;

char z;

The variables a and b are now defined as integers, k can accept floating point values (such as 3.14), and z is expected to hold a character value, like A or w. Variables can be assigned values when they are declared or anytime afterward, using the = operator.

int a = 13, b;

float k;

char z = 'A';

k = 3.14;

z = 'w';

b = a + 5;

After the following instructions are executed, the variable a will contain the value of 13, k will contain the number 3.14, z will contain the character w, and b will contain the value 18, since 13 plus 5 equals 18. Variables are simply a way to remember values; however, with C, you must first declare each variable's type.

Arithmetic Operators

The statement b = a + 7 is an example of a very simple arithmetic operator. In C, the following symbols are used for various arithmetic operations.

The first four operations should look familiar. Modulo reduction may seem like a new concept, but it's really just taking the remainder after division. If a is 13, then 13 divided by 5 equals 2, with a remainder of 3, which means that a % 5 = 3. Also, since the variables a and b are integers, the statement b = a / 5 will result in the value of 2 being stored in b, since that's the integer portion of it. Floating-point variables must be used to retain the more correct answer of 2.6.

Operation	Symbol	Example
Addition	+	b = a + 5
Subtraction	-	b = a - 5
Multiplication	*	b = a * 5
Division	/	b = a / 5
Modulo reduction	%	b = a % 5

To get a program to use these concepts, you must speak its language. The C language also provides several forms of shorthand for these arithmetic operations. One of these was mentioned earlier and is used commonly in for loops.

Full Expression	Shorthand	Explanation
i = i + 1	i++ or ++i	Add 1 to the variable.
i = i - 1	i-- or --i	Subtract 1 from the variable.

These shorthand expressions can be combined with other arithmetic operations to produce more complex expressions. This is where the difference between i++ and ++i becomes apparent. The first expression means Increment the value of i by 1 after evaluating the arithmetic operation, while the second expression means Increment the value of i by 1 before evaluating the arithmetic operation. The following example will help clarify.

int a, b;

a = 5;

b = a++ * 6;

At the end of this set of instructions, b will contain 30 and a will contain 6, since the shorthand of b = a++ * 6; is equivalent to the following statements:

b = a * 6;

a = a + 1;

However, if the instruction b = ++a * 6; is used, the order of the addition to a changes, resulting in the following equivalent instructions:

a = a + 1;

b = a * 6;

Since the order has changed, in this case b will contain 36, and a will still contain 6.

Quite often in programs, variables need to be modified in place. For example, you might need to add an arbitrary value like 12 to a variable, and store the result right back in that variable (for example, i = i + 12). This happens commonly enough that shorthand also exists for it.

Full Expression	Shorthand	Explanation
i = i + 12	i+=12	Add some value to the variable.
i = i - 12	i-=12	Subtract some value from the variable.
i = i * 12	i*=12	Multiply some value by the variable.
i = i / 12	i/=12	Divide some value from the variable.

Comparison Operators

Variables are frequently used in the conditional statements of the previously explained control structures. These conditional statements are based on some sort of comparison. In C, these comparison operators use a shorthand syntax that is fairly common across many programming languages.

Condition	Symbol	Example
Less than	<	(a < b)
Greater than	>	(a > b)
Less than or equal to	<=	(a <= b)
Greater than or equal to	>=	(a >= b)
Equal to	==	(a == b)
Not equal to	!=	(a != b)

Most of these operators are self-explanatory; however, notice that the shorthand for equal to uses double equal signs. This is an important distinction, since the double equal sign is used to test equivalence, while the single equal sign is used to assign a value to a variable. The statement a = 7 means Put the value 7 in the variable a, while a == 7 means Check to see whether the variable a is equal to 7. (Some programming languages like Pascal actually use := for variable assignment to eliminate visual confusion.) Also, notice that an exclamation point generally means not. This symbol can be used by itself to invert any expression.

!(a < b) is equivalent to (a >= b)

These comparison operators can also be chained together using shorthand for OR and AND.

Logic	Symbol	Example
OR	\|\|	((a < b) \|\| (a < c))
AND	&&	((a < b) && !(a < c))

The example statement consisting of the two smaller conditions joined with OR logic will fire true if a is less than b, OR if a is less than c. Similarly, the example statement consisting of two smaller comparisons joined with AND logic will fire true if a is less than b AND a is not less than c. These statements should be grouped with parentheses and can contain many different variations.

Many things can be boiled down to variables, comparison operators, and control structures. Returning to the example of the mouse searching for food, hunger can be translated into a Boolean true/false variable. Naturally, 1 means true and 0 means false.

While (hungry == 1)

{

Find some food;

Eat the food;

}

Here's another shorthand used by programmers and hackers quite often. C doesn't really have any Boolean operators, so any nonzero value is considered true, and a statement is considered false if it contains 0. In fact, the comparison operators will actually return a value of 1 if the comparison is true and a value of 0 if it is false. Checking to see whether the variable hungry is equal to 1 will return 1 if hungry equals 1 and 0 if hungry equals 0. Since the program only uses these two cases, the comparison operator can be dropped altogether.

While (hungry)

{

Find some food;

Eat the food;

}

A smarter mouse program with more inputs demonstrates how comparison operators can be combined with variables.

While ((hungry) && !(cat_present))

{

Find some food;

If(!(food_is_on_a_mousetrap))

Eat the food;

}

This example assumes there are also variables that describe the presence of a cat and the location of the food, with a value of 1 for true and 0 for false. Just remember that any nonzero value is considered true, and the value of 0 is considered false.

Functions

Sometimes there will be a set of instructions the programmer knows he will need several times. These instructions can be grouped into a smaller subprogram called a function. In other languages, functions are known as subroutines or procedures. For example, the action of turning a car actually consists of many smaller instructions: Turn on the appropriate blinker, slow down, check for oncoming traffic, turn the steering wheel in the appropriate direction, and so on. The driving directions from the beginning of this chapter require quite a few turns; however, listing every little instruction for every turn would be tedious (and less readable). You can pass variables as arguments to a function in order to modify the way the function operates. In this case, the function is passed the direction of the turn.

Function Turn(variable_direction)

{

Activate the variable_direction blinker;

Slow down;

Check for oncoming traffic;

while(there is oncoming traffic)

{

Stop;

Watch for oncoming traffic;

}

Turn the steering wheel to the variable_direction;

while(turn is not complete)

{

if(speed < 5 mph)

Accelerate;

}

Turn the steering wheel back to the original position;

Turn off the variable_direction blinker;

}

This function describes all the instructions needed to make a turn. When a program that knows about this function needs to turn, it can just call this function. When the function is called, the instructions found within it are executed with the arguments passed to it; afterward, execution returns to where it was in the program, after the function call. Either left or right can be passed into this function, which causes the function to turn in that direction.

By default in C, functions can return a value to a caller. For those familiar with functions in mathematics, this makes perfect sense. Imagine a function that calculates the factorial of a number—naturally, it returns the result.

In C, functions aren't labeled with a "function" keyword; instead, they are declared by the data type of the variable they are returning. This format looks very similar to variable declaration. If a function is meant to return an integer (perhaps a function that calculates the factorial of some number x), the function could look like this:

int factorial(int x)

{

int i;

for(i=1; i < x; i++)

x *= i;

return x;

}

This function is declared as an integer because it multiplies every value from 1 to x and returns the result, which is an integer. The return statement at the end of the function passes back the contents of the variable x and ends the function. This factorial function can then be used like an integer variable in the main part of any program that knows about it.

int a=5, b;

b = factorial(a);

At the end of this short program, the variable b will contain 120, since the factorial function will be called with the argument of 5 and will return 120.

Also in C, the compiler must "know" about functions before it can use them. This can be done by simply writing the entire function before using it later in the program or by using function prototypes. A function prototype is simply a way to tell the compiler to expect a function with this name, this return data type, and these data types as its functional arguments. The actual function can be located near the end of the program, but it can be used anywhere else, since the compiler already knows about it. An example of a function prototype for the factorial() function would look something like this:

int factorial(int);

Usually, function prototypes are located near the beginning of a program. There's no need to actually define any variable names in the prototype, since this is done in the actual function. The only thing the compiler cares about is the function's name, its return data type, and the data types of its functional arguments.

If a function doesn't have any value to return, it should be declared as void, as is the case with the turn() function I used as an example earlier. However, the turn() function doesn't yet capture all the functionality that our driving directions need. Every turn in the directions has both a direction and a street name. This means that a turning function should have two variables: the direction to turn and the street to turn on to. This complicates the function of turning, since the proper street must be located before the turn can be made. A more complete turning function using proper C-like syntax is listed below in pseudo-code.

void turn(variable_direction, target_street_name)

{

Look for a street sign;

current_intersection_name = read street sign name;

while(current_intersection_name != target_street_name)

{

Look for another street sign;

current_intersection_name = read street sign name;

}

Activate the variable_direction blinker;

Slow down;

Check for oncoming traffic;

while(there is oncoming traffic)

{

Stop;

Watch for oncoming traffic;

}

Turn the steering wheel to the variable_direction;

while(turn is not complete)

{

if(speed < 5 mph)

Accelerate;

}

Turn the steering wheel right back to the original position;

Turn off the variable_direction blinker;

}

This function includes a section that searches for the proper intersection by looking for street signs, reading the name on each street sign, and storing that name in a variable called current_intersection_name. It will continue to look for and read street signs until the target street is found; at that point, the remaining turning instructions will be executed. The pseudo-code driving instructions can now be changed to use this turning function.

Begin going East on Main Street;

while (there is not a church on the right)

Drive down Main Street;

if (street is blocked)

{

Turn(right, 15th Street);

Turn(left, Pine Street);

Turn(right, 16th Street);

}

else

Turn(right, 16th Street);

Turn(left, Destination Road);

for (i=0; i<5; i++)

Drive straight for 1 mile;

Stop at 743 Destination Road;

Functions aren't commonly used in pseudo-code, since pseudo-code is mostly used as a way for programmers to sketch out program concepts before writing compilable code. Since pseudo-code doesn't actually have to work, full functions don't need to be written out—simply jotting down Do some complex stuff here will suffice. But in a programming language like C, functions are used heavily. Most of the real usefulness of C comes from collections of existing functions called libraries.

Getting Your Hands Dirty

Now that the syntax of C feels more familiar and some fundamental programming concepts have been explained, actually programming in C isn't that big of a step. C compilers exist for just about every operating system and processor architecture out there, but for this book, Linux and an x86-based processor will be used exclusively. Linux is a free operating system that everyone has access to, and x86-based processors are the most popular consumer-grade processor on the planet. Since hacking is really about experimenting, it's probably best if you have a C compiler to follow along with.

Included with this book is a Live CD you can use to follow along if your computer has an x86 processor. Just put the CD in the drive and reboot your computer. It will boot into a Linux environment without modifying your existing operating system. From this Linux environment you can follow along with the book and experiment on your own.

Let's get right to it. The firstprog.c program is a simple piece of C code that will print "Hello, world!" 10 times.

Getting Your Hands Dirty

firstprog.c

#include <stdio.h>

int main()

{

int i;

for(i=0; i < 10; i++) // Loop 10 times.

{

puts("Hello, world!\n"); // put the string to the output.

}

return 0; // Tell OS the program exited without errors.

}

The main execution of a C program begins in the aptly named main()function. Any text following two forward slashes (//) is a comment, which is ignored by the compiler.

The first line may be confusing, but it's just C syntax that tells the compiler to include headers for a standard input/output (I/O) library named stdio. This header file is added to the program when it is compiled. It is located at /usr/include/stdio.h, and it defines several constants and function prototypes for corresponding functions in the standard I/O library. Since the main() function uses the printf() function from the standard I/O library, a function prototype is needed for printf() before it can be used. This function prototype (along with many others) is included in the stdio.h header file. A lot of the power of C comes from its extensibility and libraries. The rest of the code should make sense and look a lot like the pseudo-code from before. You may have even noticed that there's a set of curly braces that can be eliminated. It should be fairly obvious what this program will do, but let's compile it using GCC and run it just to make sure.

The GNU Compiler Collection (GCC) is a free C compiler that translates C into machine language that a processor can understand. The outputted translation is an executable binary file, which is called a.out by default. Does the compiled program do what you thought it would?

reader@hacking:~/booksrc $ gcc firstprog.c

reader@hacking:~/booksrc $ ls -l a.out

-rwxr-xr-x 1 reader reader 6621 2007-09-06 22:16 a.out

reader@hacking:~/booksrc $ ./a.out

Hello, world!

reader@hacking:~/booksrc $

The Bigger Picture

Okay, this has all been stuff you would learn in an elementary programming class—basic, but essential. Most introductory programming classes just teach how to read and write C. Don't get me wrong, being fluent in C is very useful and is enough to make you a decent programmer, but it's only a piece of the bigger picture. Most programmers learn the language from the top down and never see the big picture. Hackers get their edge from knowing how all the pieces interact within this bigger picture. To see the bigger picture in the realm of programming, simply realize that C code is meant to be compiled. The code can't actually do anything until it's compiled into an executable binary file. Thinking of C-source as a program is a common misconception that is exploited by hackers every day. The binary a.out's instructions are written in machine language, an elementary language the CPU can understand. Compilers are designed to translate the language of C code into machine language for a variety of processor architectures. In this case, the processor is in a family that uses the x86 architecture. There are also Sparc processor architectures (used in Sun Workstations) and the PowerPC processor architecture (used in pre-Intel Macs). Each architecture has a different machine language, so the compiler acts as a middle ground—translating C code into machine language for the target architecture.

As long as the compiled program works, the average programmer is only concerned with source code. But a hacker realizes that the compiled program is what actually gets executed out in the real world. With a better understanding of how the CPU operates, a hacker can manipulate the programs that run on it. We have seen the source code for our first program and compiled it into an executable binary for the x86 architecture. But what does this executable binary look like? The GNU development tools include a program called objdump, which can be used to examine compiled binaries. Let's start by looking at the machine code the main() function was translated into.

reader@hacking:~/booksrc $ objdump -D a.out | grep -A20 main.:

08048374 <main>:

8048374: 55 push %ebp

8048375: 89 e5 mov %esp,%ebp

8048377: 83 ec 08 sub $0x8,%esp

804837a: 83 e4 f0 and $0xfffffff0,%esp

804837d: b8 00 00 00 00 mov $0x0,%eax

8048382: 29 c4 sub %eax,%esp

8048384: c7 45 fc 00 00 00 00 movl $0x0,0xfffffffc(%ebp)

804838b: 83 7d fc 09 cmpl $0x9,0xfffffffc(%ebp)

804838f: 7e 02 jle 8048393 <main+0x1f>

8048391: eb 13 jmp 80483a6 <main+0x32>

8048393: c7 04 24 84 84 04 08 movl $0x8048484,(%esp)

804839a: e8 01 ff ff ff call 80482a0 <printf@plt>

804839f: 8d 45 fc lea 0xfffffffc(%ebp),%eax

80483a2: ff 00 incl (%eax)

80483a4: eb e5 jmp 804838b <main+0x17>

80483a6: c9 leave

80483a7: c3 ret

80483a8: 90 nop

80483a9: 90 nop

80483aa: 90 nop

reader@hacking:~/booksrc $

The objdump program will spit out far too many lines of output to sensibly examine, so the output is piped into grep with the command-line option to only display 20 lines after the regular expression main.:. Each byte is represented in hexadecimal notation, which is a base-16 numbering system. The numbering system you are most familiar with uses a base-10 system, since at 10 you need to add an extra symbol. Hexadecimal uses 0 through 9 to represent 0 through 9, but it also uses A through F to represent the values 10 through 15. This is a convenient notation since a byte contains 8 bits, each of which can be either true or false. This means a byte has 256 (2⁸) possible values, so each byte can be described with 2 hexadecimal digits.

The hexadecimal numbers—starting with 0x8048374 on the far left—are memory addresses. The bits of the machine language instructions must be put somewhere, and this somewhere is called memory. Memory is just a collection of bytes of temporary storage space that are numbered with addresses.

Like a row of houses on a local street, each with its own address, memory can be thought of as a row of bytes, each with its own memory address. Each byte of memory can be accessed by its address, and in this case the CPU accesses this part of memory to retrieve the machine language instructions that make up the compiled program. Older Intel x86 processors use a 32-bit addressing scheme, while newer ones use a 64-bit one. The 32-bit processors have 2³² (or 4,294,967,296) possible addresses, while the 64-bit ones have 2⁶⁴ (1.84467441 x 10¹⁹) possible addresses. The 64-bit processors can run in 32-bit compatibility mode, which allows them to run 32-bit code quickly.

The hexadecimal bytes in the middle of the listing above are the machine language instructions for the x86 processor. Of course, these hexadecimal values are only representations of the bytes of binary 1s and 0s the CPU can understand. But since 0101010110001001111001011000001111101100111100001 … isn't very useful to anything other than the processor, the machine code is displayed as hexadecimal bytes and each instruction is put on its own line, like splitting a paragraph into sentences.

Come to think of it, the hexadecimal bytes really aren't very useful themselves, either—that's where assembly language comes in. The instructions on the far right are in assembly language. Assembly language is really just a collection of mnemonics for the corresponding machine language instructions. The instruction ret is far easier to remember and make sense of than 0xc3 or 11000011. Unlike C and other compiled languages, assembly language instructions have a direct one-to-one relationship with their corresponding machine language instructions. This means that since every processor architecture has different machine language instructions, each also has a different form of assembly language. Assembly is just a way for programmers to represent the machine language instructions that are given to the processor. Exactly how these machine language instructions are represented is simply a matter of convention and preference. While you can theoretically create your own x86 assembly language syntax, most people stick with one of the two main types: AT&T syntax and Intel syntax. The assembly shown in the output on The Bigger Picture is AT&T syntax, as just about all of Linux's disassembly tools use this syntax by default. It's easy to recognize AT&T syntax by the cacophony of % and $ symbols prefixing everything (take a look again at the example on The Bigger Picture). The same code can be shown in Intel syntax by providing an additional command-line option, -M intel, to objdump, as shown in the output below.

reader@hacking:~/booksrc $ objdump -M intel -D a.out | grep -A20 main.:

08048374 <main>:

8048374: 55 push ebp

8048375: 89 e5 mov ebp,esp

8048377: 83 ec 08 sub esp,0x8

804837a: 83 e4 f0 and esp,0xfffffff0

804837d: b8 00 00 00 00 mov eax,0x0

8048382: 29 c4 sub esp,eax

8048384: c7 45 fc 00 00 00 00 mov DWORD PTR [ebp-4],0x0

804838b: 83 7d fc 09 cmp DWORD PTR [ebp-4],0x9

804838f: 7e 02 jle 8048393 <main+0x1f>

8048391: eb 13 jmp 80483a6 <main+0x32>

8048393: c7 04 24 84 84 04 08 mov DWORD PTR [esp],0x8048484

804839a: e8 01 ff ff ff call 80482a0 <printf@plt>

804839f: 8d 45 fc lea eax,[ebp-4]

80483a2: ff 00 inc DWORD PTR [eax]

80483a4: eb e5 jmp 804838b <main+0x17>

80483a6: c9 leave

80483a7: c3 ret

80483a8: 90 nop

80483a9: 90 nop

80483aa: 90 nop

reader@hacking:~/booksrc $

Personally, I think Intel syntax is much more readable and easier to understand, so for the purposes of this book, I will try to stick with this syntax. Regardless of the assembly language representation, the commands a processor understands are quite simple. These instructions consist of an operation and sometimes additional arguments that describe the destination and/or the source for the operation. These operations move memory around, perform some sort of basic math, or interrupt the processor to get it to do something else. In the end, that's all a computer processor can really do. But in the same way millions of books have been written using a relatively small alphabet of letters, an infinite number of possible programs can be created using a relatively small collection of machine instructions.

Processors also have their own set of special variables called registers. Most of the instructions use these registers to read or write data, so understanding the registers of a processor is essential to understanding the instructions. The bigger picture keeps getting bigger….

The x86 Processor

The 8086 CPU was the first x86 processor. It was developed and manufactured by Intel, which later developed more advanced processors in the same family: the 80186, 80286, 80386, and 80486. If you remember people talking about 386 and 486 processors in the '80s and '90s, this is what they were referring to.

The x86 processor has several registers, which are like internal variables for the processor. I could just talk abstractly about these registers now, but I think it's always better to see things for yourself. The GNU development tools also include a debugger called GDB. Debuggers are used by programmers to step through compiled programs, examine program memory, and view processor registers. A programmer who has never used a debugger to look at the inner workings of a program is like a seventeenth-century doctor who has never used a microscope. Similar to a microscope, a debugger allows a hacker to observe the microscopic world of machine code—but a debugger is far more powerful than this metaphor allows. Unlike a microscope, a debugger can view the execution from all angles, pause it, and change anything along the way.

Below, GDB is used to show the state of the processor registers right before the program starts.

reader@hacking:~/booksrc $ gdb -q ./a.out

Using host libthread_db library "/lib/tls/i686/cmov/libthread_db.so.1".

(gdb) break main

Breakpoint 1 at 0x804837a

(gdb) run

Starting program: /home/reader/booksrc/a.out

Breakpoint 1, 0x0804837a in main ()

(gdb) info registers

eax 0xbffff894 -1073743724

ecx 0x48e0fe81 1222704769

edx 0x1 1

ebx 0xb7fd6ff4 -1208127500

esp 0xbffff800 0xbffff800

ebp 0xbffff808 0xbffff808

esi 0xb8000ce0 -1207956256

edi 0x0 0

eip 0x804837a 0x804837a <main+6>

eflags 0x286 [ PF SF IF ]

cs 0x73 115

ss 0x7b 123

ds 0x7b 123

es 0x7b 123

fs 0x0 0

gs 0x33 51

(gdb) quit

The program is running. Exit anyway? (y or n) y

reader@hacking:~/booksrc $

A breakpoint is set on the main() function so execution will stop right before our code is executed. Then GDB runs the program, stops at the breakpoint, and is told to display all the processor registers and their current states.

The first four registers (EAX, ECX, EDX, and EBX) are known as general purpose registers. These are called the Accumulator, Counter, Data, and Base registers, respectively. They are used for a variety of purposes, but they mainly act as temporary variables for the CPU when it is executing machine instructions.

The second four registers (ESP, EBP, ESI, and EDI) are also general purpose registers, but they are sometimes known as pointers and indexes. These stand for Stack Pointer, Base Pointer, Source Index, and Destination Index, respectively. The first two registers are called pointers because they store 32-bit addresses, which essentially point to that location in memory. These registers are fairly important to program execution and memory management; we will discuss them more later. The last two registers are also technically pointers, which are commonly used to point to the source and destination when data needs to be read from or written to. There are load and store instructions that use these registers, but for the most part, these registers can be thought of as just simple general-purpose registers.

The EIP register is the Instruction Pointer register, which points to the current instruction the processor is reading. Like a child pointing his finger at each word as he reads, the processor reads each instruction using the EIP register as its finger. Naturally, this register is quite important and will be used a lot while debugging. Currently, it points to a memory address at 0x804838a.

The remaining EFLAGS register actually consists of several bit flags that are used for comparisons and memory segmentations. The actual memory is split into several different segments, which will be discussed later, and these registers keep track of that. For the most part, these registers can be ignored since they rarely need to be accessed directly.

Assembly Language

Since we are using Intel syntax assembly language for this book, our tools must be configured to use this syntax. Inside GDB, the disassembly syntax can be set to Intel by simply typing set disassembly intel or set dis intel, for short. You can configure this setting to run every time GDB starts up by putting the command in the file .gdbinit in your home directory.

reader@hacking:~/booksrc $ gdb -q

(gdb) set dis intel

(gdb) quit

reader@hacking:~/booksrc $ echo "set dis intel" > ~/.gdbinit

reader@hacking:~/booksrc $ cat ~/.gdbinit

set dis intel

reader@hacking:~/booksrc $

Now that GDB is configured to use Intel syntax, let's begin understanding it. The assembly instructions in Intel syntax generally follow this style:

operation <destination>, <source>

The destination and source values will either be a register, a memory address, or a value. The operations are usually intuitive mnemonics: The movoperation will move a value from the source to the destination, sub will subtract, incwill increment, and so forth. For example, the instructions below will move the value from ESP to EBP and then subtract 8 from ESP (storing the result in ESP).

8048375: 89 e5 mov ebp,esp

8048377: 83 ec 08 sub esp,0x8

There are also operations that are used to control the flow of execution. The cmp operation is used to compare values, and basically any operation beginning with j is used to jump to a different part of the code (depending on the result of the comparison). The example below first compares a 4-byte value located at EBP minus 4 with the number 9. The next instruction is shorthand for jump if less than or equal to, referring to the result of the previous comparison. If that value is less than or equal to 9, execution jumps to the instruction at 0x8048393. Otherwise, execution flows to the next instruction with an unconditional jump. If the value isn't less than or equal to 9, execution will jump to 0x80483a6.

804838b: 83 7d fc 09 cmp DWORD PTR [ebp-4],0x9

804838f: 7e 02 jle 8048393 <main+0x1f>

8048391: eb 13 jmp 80483a6 <main+0x32>

These examples have been from our previous disassembly, and we have our debugger configured to use Intel syntax, so let's use the debugger to step through the first program at the assembly instruction level.

The -g flag can be used by the GCC compiler to include extra debugging information, which will give GDB access to the source code.

reader@hacking:~/booksrc $ gcc -g firstprog.c

reader@hacking:~/booksrc $ ls -l a.out

-rwxr-xr-x 1 matrix users 11977 Jul 4 17:29 a.out

reader@hacking:~/booksrc $ gdb -q ./a.out

Using host libthread_db library "/lib/libthread_db.so.1".

(gdb) list

1 #include <stdio.h>

3 int main()

4 {

5 int i;

6 for(i=0; i < 10; i++)

7 {

8 printf("Hello, world!\n");

9 }

10 }

(gdb) disassemble main

Dump of assembler code for function main():

0x08048384 <main+0>: push ebp

0x08048385 <main+1>: mov ebp,esp

0x08048387 <main+3>: sub esp,0x8

0x0804838a <main+6>: and esp,0xfffffff0

0x0804838d <main+9>: mov eax,0x0

0x08048392 <main+14>: sub esp,eax

0x08048394 <main+16>: mov DWORD PTR [ebp-4],0x0

0x0804839b <main+23>: cmp DWORD PTR [ebp-4],0x9

0x0804839f <main+27>: jle 0x80483a3 <main+31>

0x080483a1 <main+29>: jmp 0x80483b6 <main+50>

0x080483a3 <main+31>: mov DWORD PTR [esp],0x80484d4

0x080483aa <main+38>: call 0x80482a8 <_init+56>

0x080483af <main+43>: lea eax,[ebp-4]

0x080483b2 <main+46>: inc DWORD PTR [eax]

0x080483b4 <main+48>: jmp 0x804839b <main+23>

0x080483b6 <main+50>: leave

0x080483b7 <main+51>: ret

End of assembler dump.

(gdb) break main

Breakpoint 1 at 0x8048394: file firstprog.c, line 6.

(gdb) run

Starting program: /hacking/a.out

Breakpoint 1, main() at firstprog.c:6

6 for(i=0; i < 10; i++)

(gdb) info register eip

eip 0x8048394 0x8048394

(gdb)

First, the source code is listed and the disassembly of the main() function is displayed. Then a breakpoint is set at the start of main(), and the program is run. This breakpoint simply tells the debugger to pause the execution of the program when it gets to that point. Since the breakpoint has been set at the start of the main() function, the program hits the breakpoint and pauses before actually executing any instructions in main(). Then the value of EIP (the Instruction Pointer) is displayed.

Notice that EIP contains a memory address that points to an instruction in the main() function's disassembly (shown in bold). The instructions before this (shown in italics) are collectively known as the function prologue and are generated by the compiler to set up memory for the rest of the main() function's local variables. Part of the reason variables need to be declared in C is to aid the construction of this section of code. The debugger knows this part of the code is automatically generated and is smart enough to skip over it. We'll talk more about the function prologue later, but for now we can take a cue from GDB and skip it.

The GDB debugger provides a direct method to examine memory, using the command x, which is short for examine. Examining memory is a critical skill for any hacker. Most hacker exploits are a lot like magic tricks—they seem amazing and magical, unless you know about sleight of hand and misdirection. In both magic and hacking, if you were to look in just the right spot, the trick would be obvious. That's one of the reasons a good magician never does the same trick twice. But with a debugger like GDB, every aspect of a program's execution can be deterministically examined, paused, stepped through, and repeated as often as needed. Since a running program is mostly just a processor and segments of memory, examining memory is the first way to look at what's really going on.

The examine command in GDB can be used to look at a certain address of memory in a variety of ways. This command expects two arguments when it's used: the location in memory to examine and how to display that memory.

The display format also uses a single-letter shorthand, which is optionally preceded by a count of how many items to examine. Some common format letters are as follows:

o Display in octal.

x Display in hexadecimal.

u Display in unsigned, standard base-10 decimal.

t Display in binary.

These can be used with the examine command to examine a certain memory address. In the following example, the current address of the EIP register is used. Shorthand commands are often used with GDB, and even info register eip can be shortened to just i r eip.

gdb) i r eip

eip 0x8048384 0x8048384 <main+16>

(gdb) x/o 0x8048384

0x8048384 <main+16>: 077042707

(gdb) x/x $eip

0x8048384 <main+16>: 0x00fc45c7

(gdb) x/u $eip

0x8048384 <main+16>: 16532935

(gdb) x/t $eip

0x8048384 <main+16>: 00000000111111000100010111000111

(gdb)

The memory the EIP register is pointing to can be examined by using the address stored in EIP. The debugger lets you reference registers directly, so $eip is equivalent to the value EIP contains at that moment. The value 077042707 in octal is the same as 0x00fc45c7 in hexadecimal, which is the same as 16532935 in base-10 decimal, which in turn is the same as 00000000111111000100010111000111 in binary. A number can also be prepended to the format of the examine command to examine multiple units at the target address.

(gdb) x/2x $eip

0x8048384 <main+16>: 0x00fc45c7 0x83000000

(gdb) x/12x $eip

0x8048384 <main+16>: 0x00fc45c7 0x83000000 0x7e09fc7d 0xc713eb02

0x8048394 <main+32>: 0x84842404 0x01e80804 0x8dffffff 0x00fffc45

0x80483a4 <main+48>: 0xc3c9e5eb 0x90909090 0x90909090 0x5de58955

(gdb)

The default size of a single unit is a four-byte unit called a word. The size of the display units for the examine command can be changed by adding a size letter to the end of the format letter. The valid size letters are as follows:

b A single byte

h A halfword, which is two bytes in size

w A word, which is four bytes in size

g A giant, which is eight bytes in size

This is slightly confusing, because sometimes the term word also refers to 2-byte values. In this case a double word or DWORD refers to a 4-byte value. In this book, words and DWORDs both refer to 4-byte values. If I'm talking about a 2-byte value, I'll call it a short or a halfword. The following GDB output shows memory displayed in various sizes.

(gdb) x/8xb $eip

0x8048384 <main+16>: 0xc7 0x45 0xfc 0x00 0x00 0x00 0x00 0x83

(gdb) x/8xh $eip

0x8048384 <main+16>: 0x45c7 0x00fc 0x0000 0x8300 0xfc7d 0x7e09 0xeb02 0xc713

(gdb) x/8xw $eip

0x8048384 <main+16>: 0x00fc45c7 0x83000000 0x7e09fc7d 0xc713eb02

0x8048394 <main+32>: 0x84842404 0x01e80804 0x8dffffff 0x00fffc45

(gdb)

If you look closely, you may notice something odd about the data above. The first examine command shows the first eight bytes, and naturally, the examine commands that use bigger units display more data in total. However, the first examine shows the first two bytes to be 0xc7 and 0x45, but when a halfword is examined at the exact same memory address, the value 0x45c7 is shown, with the bytes reversed. This same byte-reversal effect can be seen when a full four-byte word is shown as 0x00fc45c7, but when the first four bytes are shown byte by byte, they are in the order of 0xc7, 0x45, 0xfc, and 0x00.

This is because on the x86 processor values are stored in little-endian byte order, which means the least significant byte is stored first. For example, if four bytes are to be interpreted as a single value, the bytes must be used in reverse order. The GDB debugger is smart enough to know how values are stored, so when a word or halfword is examined, the bytes must be reversed to display the correct values in hexadecimal. Revisiting these values displayed both as hexadecimal and unsigned decimals might help clear up any confusion.

(gdb) x/4xb $eip

0x8048384 <main+16>: 0xc7 0x45 0xfc 0x00

(gdb) x/4ub $eip

0x8048384 <main+16>: 199 69 252 0

(gdb) x/1xw $eip

0x8048384 <main+16>: 0x00fc45c7

(gdb) x/1uw $eip

0x8048384 <main+16>: 16532935

(gdb) quit

The program is running. Exit anyway? (y or n) y

reader@hacking:~/booksrc $ bc -ql

199*(256^3) + 69*(256^2) + 252*(256^1) + 0*(256^0)

3343252480

0*(256^3) + 252*(256^2) + 69*(256^1) + 199*(256^0)

16532935

quit

reader@hacking:~/booksrc $

The first four bytes are shown both in hexadecimal and standard unsigned decimal notation. A command-line calculator program called bc is used to show that if the bytes are interpreted in the incorrect order, a horribly incorrect value of 3343252480 is the result. The byte order of a given architecture is an important detail to be aware of. While most debugging tools and compilers will take care of the details of byte order automatically, eventually you will directly manipulate memory by yourself.

In addition to converting byte order, GDB can do other conversions with the examine command. We've already seen that GDB can disassemble machine language instructions into human-readable assembly instructions. The examine command also accepts the format letter i, short for instruction, to display the memory as disassembled assembly language instructions.

reader@hacking:~/booksrc $ gdb -q ./a.out

Using host libthread_db library "/lib/tls/i686/cmov/libthread_db.so.1".

(gdb) break main

Breakpoint 1 at 0x8048384: file firstprog.c, line 6.

(gdb) run

Starting program: /home/reader/booksrc/a.out

Breakpoint 1, main () at firstprog.c:6

6 for(i=0; i < 10; i++)

(gdb) i r $eip

eip 0x8048384 0x8048384 <main+16>

(gdb) x/i $eip

0x8048384 <main+16>: mov DWORD PTR [ebp-4],0x0

(gdb) x/3i $eip

0x8048384 <main+16>: mov DWORD PTR [ebp-4],0x0

0x804838b <main+23>: cmp DWORD PTR [ebp-4],0x9

0x804838f <main+27>: jle 0x8048393 <main+31>

(gdb) x/7xb $eip

0x8048384 <main+16>: 0xc7 0x45 0xfc 0x00 0x00 0x00 0x00

(gdb) x/i $eip

0x8048384 <main+16>: mov DWORD PTR [ebp-4],0x0

(gdb)

In the output above, the a.out program is run in GDB, with a breakpoint set at main(). Since the EIP register is pointing to memory that actually contains machine language instructions, they disassemble quite nicely.

The previous objdump disassembly confirms that the seven bytes EIP is pointing to actually are machine language for the corresponding assembly instruction.

8048384: c7 45 fc 00 00 00 00 mov DWORD PTR [ebp-4],0x0

This assembly instruction will move the value of 0 into memory located at the address stored in the EBP register, minus 4. This is where the C variable i is stored in memory; i was declared as an integer that uses 4 bytes of memory on the x86 processor. Basically, this command will zero out the variable i for the for loop. If that memory is examined right now, it will contain nothing but random garbage. The memory at this location can be examined several different ways.

(gdb) i r ebp

ebp 0xbffff808 0xbffff808

(gdb) x/4xb $ebp - 4

0xbffff804: 0xc0 0x83 0x04 0x08

(gdb) x/4xb 0xbffff804

0xbffff804: 0xc0 0x83 0x04 0x08

(gdb) print $ebp - 4

$1 = (void *) 0xbffff804

(gdb) x/4xb $1

0xbffff804: 0xc0 0x83 0x04 0x08

(gdb) x/xw $1

0xbffff804: 0x080483c0

(gdb

The EBP register is shown to contain the address 0xbffff808, and the assembly instruction will be writing to a value offset by 4 less than that, 0xbffff804. The examine command can examine this memory address directly or by doing the math on the fly. The print command can also be used to do simple math, but the result is stored in a temporary variable in the debugger. This variable named $1 can be used later to quickly re-access a particular location in memory. Any of the methods shown above will accomplish the same task: displaying the 4 garbage bytes found in memory that will be zeroed out when the current instruction executes.

Let's execute the current instruction using the command nexti, which is short for next instruction. The processor will read the instruction at EIP, execute it, and advance EIP to the next instruction.

(gdb) nexti

0x0804838b 6 for(i=0; i < 10; i++)

(gdb) x/4xb $1

0xbffff804: 0x00 0x00 0x00 0x00

(gdb) x/dw $1

0xbffff804: 0

(gdb) i r eip

eip 0x804838b 0x804838b <main+23>

(gdb) x/i $eip

0x804838b <main+23>: cmp DWORD PTR [ebp-4],0x9

(gdb)

As predicted, the previous command zeroes out the 4 bytes found at EBP minus 4, which is memory set aside for the C variable i. Then EIP advances to the next instruction. The next few instructions actually make more sense to talk about in a group.

(gdb) x/10i $eip

0x804838b <ma in+23>: cmp DWORD PTR [ebp-4],0x9

0x804838f <main+27>: jle 0x8048393 <main+31>

0x8048391 <main+29>: jmp 0x80483a6 <main+50>

0x8048393 <main+31>: mov DWORD PTR [esp],0x8048484

0x804839a <main+38>: call 0x80482a0 <printf@plt>

0x804839f <main+43>: lea eax,[ebp-4]

0x80483a2 <main+46>: inc DWORD PTR [eax]

0x80483a4 <main+48>: jmp 0x804838b <main+23>

0x80483a6 <main+50>: leave

0x80483a7 <main+51>: ret

(gdb)

The first instruction, cmp, is a compare instruction, which will compare the memory used by the C variable i with the value 9. The next instruction, jle stands for jump if less than or equal to. It uses the results of the previous comparison (which are actually stored in the EFLAGS register) to jump EIP to point to a different part of the code if the destination of the previous comparison operation is less than or equal to the source. In this case the instruction says to jump to the address 0x8048393 if the value stored in memory for the C variable i is less than or equal to the value 9. If this isn't the case, the EIP will continue to the next instruction, which is an unconditional jump instruction. This will cause the EIP to jump to the address 0x80483a6. These three instructions combine to create an if-then-else control structure: If the i is less than or equal to 9, then go to the instruction at address 0x8048393; otherwise, go to the instruction at address 0x80483a6. The first address of 0x8048393 (shown in bold) is simply the instruction found after the fixed jump instruction, and the second address of 0x80483a6 (shown in italics) is located at the end of the function.

Since we know the value 0 is stored in the memory location being compared with the value 9, and we know that 0 is less than or equal to 9, EIP should be at 0x8048393 after executing the next two instructions.

(gdb) nexti

0x0804838f 6 for(i=0; i < 10; i++)

(gdb) x/i $eip

0x804838f <main+27>: jle 0x8048393 <main+31>

(gdb) nexti

8 printf("Hello, world!\n");

(gdb) i r eip

eip 0x8048393 0x8048393 <main+31>

(gdb) x/2i $eip

0x8048393 <main+31>: mov DWORD PTR [esp],0x8048484

0x804839a <main+38>: call 0x80482a0 <printf@plt>

(gdb)

As expected, the previous two instructions let the program execution flow down to 0x8048393, which brings us to the next two instructions. The first instruction is another mov instruction that will write the address 0x8048484 into the memory address contained in the ESP register. But what is ESP pointing to?

(gdb) i r esp

esp 0xbffff800 0xbffff800

(gdb)

Currently, ESP points to the memory address 0xbffff800, so when the mov instruction is executed, the address 0x8048484 is written there. But why? What's so special about the memory address 0x8048484? There's one way to find out.

(gdb) x/2xw 0x8048484

0x8048484: 0x6c6c6548 0x6f57206f

(gdb) x/6xb 0x8048484

0x8048484: 0x48 0x65 0x6c 0x6c 0x6f 0x20

(gdb) x/6ub 0x8048484

0x8048484: 72 101 108 108 111 32

(gdb)

A trained eye might notice something about the memory here, in particular the range of the bytes. After examining memory for long enough, these types of visual patterns become more apparent. These bytes fall within the printable ASCII range. ASCII is an agreed-upon standard that maps all the characters on your keyboard (and some that aren't) to fixed numbers. The bytes 0x48, 0x65, 0x6c, and 0x6f all correspond to letters in the alphabet on the ASCII table shown below. This table is found in the man page for ASCII, available on most Unix systems by typing man ascii.

ASCII Table

Oct Dec Hex Char Oct Dec Hex Char

------------------------------------------------------------

000 0 00 NUL '\0' 100 64 40 @

001 1 01 SOH 101 65 41 A

002 2 02 STX 102 66 42 B

003 3 03 ETX 103 67 43 C

004 4 04 EOT 104 68 44 D

005 5 05 ENQ 105 69 45 E

006 6 06 ACK 106 70 46 F

007 7 07 BEL '\a' 107 71 47 G

010 8 08 BS '\b' 110 72 48 H

011 9 09 HT '\t' 111 73 49 I

012 10 0A LF '\n' 112 74 4A J

013 11 0B VT '\v' 113 75 4B K

014 12 0C FF '\f' 114 76 4C L

015 13 0D CR '\r' 115 77 4D M

016 14 0E SO 116 78 4E N

017 15 0F SI 117 79 4F O

020 16 10 DLE 120 80 50 P

021 17 11 DC1 121 81 51 Q

022 18 12 DC2 122 82 52 R

023 19 13 DC3 123 83 53 S

024 20 14 DC4 124 84 54 T

025 21 15 NAK 125 85 55 U

026 22 16 SYN 126 86 56 V

027 23 17 ETB 127 87 57 W

030 24 18 CAN 130 88 58 X

031 25 19 EM 131 89 59 Y

032 26 1A SUB 132 90 5A Z

033 27 1B ESC 133 91 5B [

034 28 1C FS 134 92 5C \ '\\'

035 29 1D GS 135 93 5D ]

036 30 1E RS 136 94 5E ^

037 31 1F US 137 95 5F _

040 32 20 SPACE 140 96 60 `

041 33 21 ! 141 97 61 a

042 34 22 " 142 98 62 b

043 35 23 # 143 99 63 c

044 36 24 $ 144 100 64 d

045 37 25 % 145 101 65 e

046 38 26 & 146 102 66 f

047 39 27 ' 147 103 67 g

050 40 28 ( 150 104 68 h

051 41 29 ) 151 105 69 i

052 42 2A * 152 106 6A j

053 43 2B + 153 107 6B k

054 44 2C , 154 108 6C l

055 45 2D - 155 109 6D m

056 46 2E . 156 110 6E n

057 47 2F / 157 111 6F o

060 48 30 0 160 112 70 p

061 49 31 1 161 113 71 q

062 50 32 2 162 114 72 r

063 51 33 3 163 115 73 s

064 52 34 4 164 116 74 t

065 53 35 5 165 117 75 u

066 54 36 6 166 118 76 v

067 55 37 7 167 119 77 w

070 56 38 8 170 120 78 x

071 57 39 9 171 121 79 y

072 58 3A : 172 122 7A z

073 59 3B ; 173 123 7B {

074 60 3C < 174 124 7C |

075 61 3D = 175 125 7D }

076 62 3E > 176 126 7E ~

077 63 3F ? 177 127 7F DEL

Thankfully, GDB's examine command also contains provisions for looking at this type of memory. The c format letter can be used to automatically look up a byte on the ASCII table, and the s format letter will display an entire string of character data.

(gdb) x/6cb 0x8048484

0x8048484: 72 'H' 101 'e' 108 'l' 108 'l' 111 'o' 32 ' '

(gdb) x/s 0x8048484

0x8048484: "Hello, world!\n"

(gdb)

These commands reveal that the data string "Hello, world!\n" is stored at memory address 0x8048484. This string is the argument for the printf() function, which indicates that moving the address of this string to the address tored in ESP (0x8048484) has something to do with this function. The following output shows the data string's address being moved into the address ESP is pointing to.

(gdb) x/2i $eip

0x8048393 <main+31>: mov DWORD PTR [esp],0x8048484

0x804839a <main+38>: call 0x80482a0 <printf@plt>

(gdb) x/xw $esp

0xbffff800: 0xb8000ce0

(gdb) nexti

0x0804839a 8 printf("Hello, world!\n");

(gdb) x/xw $esp

0xbffff800: 0x08048484

(gdb)

The next instruction is actually called the printf() function; it prints the data string. The previous instruction was setting up for the function call, and the results of the function call can be seen in the output below in bold.

(gdb) x/i $eip

0x804839a <main+38>: call 0x80482a0 <printf@plt>

(gdb) nexti

Hello, world!

6 for(i=0; i < 10; i++)

(gdb)

Continuing to use GDB to debug, let's examine the next two instructions. Once again, they make more sense to look at in a group.

(gdb) x/2i $eip

0x804839f <main+43>: lea eax,[ebp-4]

0x80483a2 <main+46>: inc DWORD PTR [eax]

(gdb)

These two instructions basically just increment the variable i by 1. The lea instruction is an acronym for Load Effective Address, which will load the familiar address of EBP minus 4 into the EAX register. The execution of this instruction is shown below.

(gdb) x/i $eip

0x804839f <main+43>: lea eax,[ebp-4]

(gdb) print $ebp - 4

$2 = (void *) 0xbffff804

(gdb) x/x $2

0xbffff804: 0x00000000

(gdb) i r eax

eax 0xd 13

(gdb) nexti

0x080483a2 6 for(i=0; i < 10; i++)

(gdb) i r eax

eax 0xbffff804 -1073743868

(gdb) x/xw $eax

0xbffff804: 0x00000000

(gdb) x/dw $eax

0xbffff804: 0

(gdb)

The following inc instruction will increment the value found at this address (now stored in the EAX register) by 1. The execution of this instruction is also shown below.

gdb) x/i $eip

0x80483a2 <main+46>: inc DWORD PTR [eax]

(gdb) x/dw $eax

0xbffff804: 0

(gdb) nexti

0x080483a4 6 for(i=0; i < 10; i++)

(gdb) x/dw $eax

0xbffff804: 1

(gdb)

The end result is the value stored at the memory address EBP minus 4 (0xbffff804), incremented by 1. This behavior corresponds to a portion of C code in which the variable i is incremented in the for loop.

The next instruction is an unconditional jump instruction.

(gdb) x/i $eip

0x80483a4 <main+48>: jmp 0x804838b <main+23>

(gdb)

When this instruction is executed, it will send the program back to the instruction at address 0x804838b. It does this by simply setting EIP to that value.

Looking at the full disassembly again, you should be able to tell which parts of the C code have been compiled into which machine instructions.

(gdb) disass main

Dump of assembler code for function main:

0x08048374 <main+0>: push ebp

0x08048375 <main+1>: mov ebp,esp

0x08048377 <main+3>: sub esp,0x8

0x0804837a <main+6>: and esp,0xfffffff0

0x0804837d <main+9>: mov eax,0x0

0x08048382 <main+14>: sub esp,eax

0x08048384 <main+16>: mov DWORD PTR [ebp-4],0x0

0x0804838b <main+23>: cmp DWORD PTR [ebp-4],0x9

0x0804838f <main+27>: jle 0x8048393 <main+31>

0x08048391 <main+29>: jmp 0x80483a6 <main+50>

0x08048393 <main+31>: mov DWORD PTR [esp],0x8048484

0x0804839a <main+38>: call 0x80482a0 <printf@plt>

0x0804839f <main+43>: lea eax,[ebp-4]

0x080483a2 <main+46>: inc DWORD PTR [eax]

0x080483a4 <main+48>: jmp 0x804838b <main+23>

0x080483a6 <main+50>: leave

0x080483a7 <main+51>: ret

End of assembler dump.

(gdb) list

1 #include <stdio.h>

3 int main()

4 {

5 int i;

6 for(i=0; i < 10; i++)

7 {

8 printf("Hello, world!\n");

9 }

10 }

(gdb)

The instructions shown in bold make up the for loop, and the instructions in italics are the printf() call found within the loop. The program execution will jump back to the compare instruction, continue to execute the printf()call, and increment the counter variable until it finally equals 10. At this point the conditional jle instruction won't execute; instead, the instruction pointer will continue to the unconditional jump instruction, which exits the loop and ends the program.

Back to Basics

Now that the idea of programming is less abstract, there are a few other important concepts to know about C. Assembly language and computer processors existed before higher-level programming languages, and many modern programming concepts have evolved through time. In the same way that knowing a little about Latin can greatly improve one's understanding of the English language, knowledge of low-level programming concepts can assist the comprehension of higher-level ones. When continuing to the next section, remember that C code must be compiled into machine instructions before it can do anything.

Strings

The value "Hello, world!\n" passed to the printf() function in the previous program is a string—technically, a character array. In C, an array is simply a list of n elements of a specific data type. A 20-character array is simply 20 adjacent characters located in memory. Arrays are also referred to as buffers. The char_array.c program is an example of a character array.

char_array.c

#include <stdio.h>

int main()

{

char str_a[20];

str_a[0] = 'H';

str_a[1] = 'e';

str_a[2] = 'l';

str_a[3] = 'l';

str_a[4] = 'o';

str_a[5] = ',';

str_a[6] = ' ';

str_a[7] = 'w';

str_a[8] = 'o';

str_a[9] = 'r';

str_a[10] = 'l';

str_a[11] = 'd';

str_a[12] = '!';

str_a[13] = '\n';

str_a[14] = 0;

printf(str_a);

}

The GCC compiler can also be given the -o switch to define the output file to compile to. This switch is used below to compile the program into an executable binary called char_array.

reader@hacking:~/booksrc $ gcc -o char_array char_array.c

reader@hacking:~/booksrc $ ./char_array

Hello, world!

reader@hacking:~/booksrc $

In the preceding program, a 20-element character array is defined as str_a, and each element of the array is written to, one by one. Notice that the number begins at 0, as opposed to 1. Also notice that the last character is a 0. (This is also called a null byte.) The character array was defined, so 20 bytes are allocated for it, but only 12 of these bytes are actually used. The null byte at the end is used as a delimiter character to tell any function that is dealing with the string to stop operations right there. The remaining extra bytes are just garbage and will be ignored. If a null byte is inserted in the fifth element of the character array, only the characters Hello would be printed by the printf()function.

Since setting each character in a character array is painstaking and strings are used fairly often, a set of standard functions was created for string manipulation. For example, the strcpy() function will copy a string from a source to a destination, iterating through the source string and copying each byte to the destination (and stopping after it copies the null termination byte). The order of the function's arguments is similar to Intel assembly syntax: destination first and then source. The char_array.c program can be rewritten using strcpy() to accomplish the same thing using the string library. The next version of the char_array program shown below includes string.h since it uses a string function.

char_array2.c

#include <stdio.h>

#include <string.h>

int main() {

char str_a[20];

strcpy(str_a, "Hello, world!\n");

printf(str_a);

}

Let's take a look at this program with GDB. In the output below, the compiled program is opened with GDB and breakpoints are set before, in, and after the strcpy() call shown in bold. The debugger will pause the program at each breakpoint, giving us a chance to examine registers and memory. The strcpy() function's code comes from a shared library, so the breakpoint in this function can't actually be set until the program is executed.

reader@hacking:~/booksrc $ gcc -g -o char_array2 char_array2.c

reader@hacking:~/booksrc $ gdb -q ./char_array2

Using host libthread_db library "/lib/tls/i686/cmov/libthread_db.so.1".

(gdb) list

1 #include <stdio.h>

2 #include <string.h>

4 int main() {

5 char str_a[20];

7 strcpy(str_a, "Hello, world!\n");

8 printf(str_a);

9 }

(gdb) break 6

Breakpoint 1 at 0x80483c4: file char_array2.c, line 6.

(gdb) break strcpy

Function "strcpy" not defined.

Make breakpoint pending on future shared library load? (y or [n]) y

Breakpoint 2 (strcpy) pending.

(gdb) break 8

Breakpoint 3 at 0x80483d7: file char_array2.c, line 8.

(gdb)

When the program is run, the strcpy() breakpoint is resolved. At each breakpoint, we're going to look at EIP and the instructions it points to. Notice that the memory location for EIP at the middle breakpoint is different.

(gdb) run

Starting program: /home/reader/booksrc/char_array2

Breakpoint 4 at 0xb7f076f4

Pending breakpoint "strcpy" resolved

Breakpoint 1, main () at char_array2.c:7

7 strcpy(str_a, "Hello, world!\n");

(gdb) i r eip

eip 0x80483c4 0x80483c4 <main+16>

(gdb) x/5i $eip

0x80483c4 <main+16>: mov DWORD PTR [esp+4],0x80484c4

0x80483cc <main+24>: lea eax,[ebp-40]

0x80483cf <main+27>: mov DWORD PTR [esp],eax

0x80483d2 <main+30>: call 0x80482c4 <strcpy@plt>

0x80483d7 <main+35>: lea eax,[ebp-40]

(gdb) continue

Continuing.

Breakpoint 4, 0xb7f076f4 in strcpy () from /lib/tls/i686/cmov/libc.so.6

(gdb) i r eip

eip 0xb7f076f4 0xb7f076f4 <strcpy+4>

(gdb) x/5i $eip

0xb7f076f4 <strcpy+4>: mov esi,DWORD PTR [ebp+8]

0xb7f076f7 <strcpy+7>: mov eax,DWORD PTR [ebp+12]

0xb7f076fa <strcpy+10>: mov ecx,esi

0xb7f076fc <strcpy+12>: sub ecx,eax

0xb7f076fe <strcpy+14>: mov edx,eax

(gdb) continue

Continuing.

Breakpoint 3, main () at char_array2.c:8

8 printf(str_a);

(gdb) i r eip

eip 0x80483d7 0x80483d7 <main+35>

(gdb) x/5i $eip

0x80483d7 <main+35>: lea eax,[ebp-40]

0x80483da <main+38>: mov DWORD PTR [esp],eax

0x80483dd <main+41>: call 0x80482d4 <printf@plt>

0x80483e2 <main+46>: leave

0x80483e3 <main+47>: ret

(gdb)

The address in EIP at the middle breakpoint is different because the code for the strcpy() function comes from a loaded library. In fact, the debugger shows EIP for the middle breakpoint in the strcpy() function, while EIP at the other two breakpoints is in the main() function. I'd like to point out that EIP is able to travel from the main code to the strcpy() code and back again. Each time a function is called, a record is kept on a data structure simply called the stack. The stack lets EIP return through long chains of function calls. In GDB, the bt command can be used to backtrace the stack. In the output below, the stack backtrace is shown at each breakpoint.

(gdb) run

The program being debugged has been started already.

Start it from the beginning? (y or n) y

Starting program: /home/reader/booksrc/char_array2

Error in re-setting breakpoint 4:

Function "strcpy" not defined.

Breakpoint 1, main () at char_array2.c:7

7 strcpy(str_a, "Hello, world!\n");

(gdb) bt

#0 main () at char_array2.c:7

(gdb) cont

Continuing.

Breakpoint 4, 0xb7f076f4 in strcpy () from /lib/tls/i686/cmov/libc.so.6

(gdb) bt

#0 0xb7f076f4 in strcpy () from /lib/tls/i686/cmov/libc.so.6

#1 0x080483d7 in main () at char_array2.c:7

(gdb) cont

Continuing.

Breakpoint 3, main () at char_array2.c:8

8 printf(str_a);

(gdb) bt

#0 main () at char_array2.c:8

(gdb)

At the middle breakpoint, the backtrace of the stack shows its record of the strcpy() call. Also, you may notice that the strcpy() function is at a slightly different address during the second run. This is due to an exploit protection method that is turned on by default in the Linux kernel since 2.6.11. We will talk about this protection in more detail later.

Signed, Unsigned, Long, and Short

By default, numerical values in C are signed, which means they can be both negative and positive. In contrast, unsigned values don't allow negative numbers. Since it's all just memory in the end, all numerical values must be stored in binary, and unsigned values make the most sense in binary. A 32-bit unsigned integer can contain values from 0 (all binary 0s) to 4,294,967,295 (all binary 1s). A 32-bit signed integer is still just 32 bits, which means it can only be in one of 2³² possible bit combinations. This allows 32-bit signed integers to range from –2,147,483,648 to 2,147,483,647. Essentially, one of the bits is a flag marking the value positive or negative. Positively signed values look the same as unsigned values, but negative numbers are stored differently using a method called two's complement. Two's complement represents negative numbers in a form suited for binary adders—when a negative value in two's complement is added to a positive number of the same magnitude, the result will be 0. This is done by first writing the positive number in binary, then inverting all the bits, and finally adding 1. It sounds strange, but it works and allows negative numbers to be added in combination with positive numbers using simple binary adders.

This can be explored quickly on a smaller scale using pcalc, a simple programmer's calculator that displays results in decimal, hexadecimal, and binary formats. For simplicity's sake, 8-bit numbers are used in this example.

reader@hacking:~/booksrc $ pcalc 0y01001001

73 0x49 0y1001001

reader@hacking:~/booksrc $ pcalc 0y10110110 + 1

183 0xb7 0y10110111

reader@hacking:~/booksrc $ pcalc 0y01001001 + 0y10110111

256 0x100 0y100000000

reader@hacking:~/booksrc $

First, the binary value 01001001 is shown to be positive 73. Then all the bits are flipped, and 1 is added to result in the two's complement representation for negative 73, 10110111. When these two values are added together, the result of the original 8 bits is 0. The program pcalc shows the value 256 because it's not aware that we're only dealing with 8-bit values. In a binary adder, that carry bit would just be thrown away because the end of the variable's memory would have been reached. This example might shed some light on how two's complement works its magic.

In C, variables can be declared as unsigned by simply prepending the keyword unsigned to the declaration. An unsigned integer would be declared with unsigned int. In addition, the size of numerical variables can be extended or shortened by adding the keywords long or short. The actual sizes will vary depending on the architecture the code is compiled for. The language of C provides a macro called sizeof() that can determine the size of certain data types. This works like a function that takes a data type as its input and returns the size of a variable declared with that data type for the target architecture. The datatype_sizes.c program explores the sizes of various data types, using the sizeof() function.

datatype_sizes.c

#include <stdio.h>

int main() {

printf("The 'int' data type is\t\t %d bytes\n", sizeof(int));

printf("The 'unsigned int' data type is\t %d bytes\n", sizeof(unsigned int));

printf("The 'short int' data type is\t %d bytes\n", sizeof(short int));

printf("The 'long int' data type is\t %d bytes\n", sizeof(long int));

printf("The 'long long int' data type is %d bytes\n", sizeof(long long int));

printf("The 'float' data type is\t %d bytes\n", sizeof(float));

printf("The 'char' data type is\t\t %d bytes\n", sizeof(char));

}

This piece of code uses the printf() function in a slightly different way. It uses something called a format specifier to display the value returned from the sizeof() function calls. Format specifiers will be explained in depth later, so for now, let's just focus on the program's output.

reader@hacking:~/booksrc $ gcc datatype_sizes.c

reader@hacking:~/booksrc $ ./a.out

The 'int' data type is 4 bytes

The 'unsigned int' data type is 4 bytes

The 'short int' data type is 2 bytes

The 'long int' data type is 4 bytes

The 'long long int' data type is 8 bytes

The 'float' data type is 4 bytes

The 'char' data type is 1 bytes

reader@hacking:~/booksrc $

As previously stated, both signed and unsigned integers are four bytes in size on the x86 architecture. A float is also four bytes, while a char only needs a single byte. The long and short keywords can also be used with floating-point variables to extend and shorten their sizes.

Pointers

The EIP register is a pointer that "points" to the current instruction during a program's execution by containing its memory address. The idea of pointers is used in C, also. Since the physical memory cannot actually be moved, the information in it must be copied. It can be very computationally expensive to copy large chunks of memory to be used by different functions or in different places. This is also expensive from a memory standpoint, since space for the new destination copy must be saved or allocated before the source can be copied. Pointers are a solution to this problem. Instead of copying a large block of memory, it is much simpler to pass around the address of the beginning of that block of memory.

Pointers in C can be defined and used like any other variable type. Since memory on the x86 architecture uses 32-bit addressing, pointers are also 32 bits in size (4 bytes). Pointers are defined by prepending an asterisk (*) to the variable name. Instead of defining a variable of that type, a pointer is defined as something that points to data of that type. The pointer.c program is an example of a pointer being used with the char data type, which is only 1 byte in size.

pointer.c

#include <stdio.h>

#include <string.h>

int main() {

char str_a[20]; // A 20-element character array

char *pointer; // A pointer, meant for a character array

char *pointer2; // And yet another one

strcpy(str_a, "Hello, world!\n");

pointer = str_a; // Set the first pointer to the start of the array.

printf(pointer);

pointer2 = pointer + 2; // Set the second one 2 bytes further in.

printf(pointer2); // Print it.

strcpy(pointer2, "y you guys!\n"); // Copy into that spot.

printf(pointer); // Print again.

}

As the comments in the code indicate, the first pointer is set at the beginning of the character array. When the character array is referenced like this, it is actually a pointer itself. This is how this buffer was passed as a pointer to the printf() and strcpy() functions earlier. The second pointer is set to the first pointer's address plus two, and then some things are printed (shown in the output below).

reader@hacking:~/booksrc $ gcc -o pointer pointer.c

reader@hacking:~/booksrc $ ./pointer

Hello, world!

llo, world!

Hey you guys!

reader@hacking:~/booksrc $

Let's take a look at this with GDB. The program is recompiled, and a breakpoint is set on the tenth line of the source code. This will stop the program after the "Hello, world!\n" string has been copied into the str_abuffer and the pointer variable is set to the beginning of it.

reader@hacking:~/booksrc $ gcc -g -o pointer pointer.c

reader@hacking:~/booksrc $ gdb -q ./pointer

Using host libthread_db library "/lib/tls/i686/cmov/libthread_db.so.1".

(gdb) list

1 #include <stdio.h>

2 #include <string.h>

4 int main() {

5 char str_a[20]; // A 20-element character array

6 char *pointer; // A pointer, meant for a character array

7 char *pointer2; // And yet another one

9 strcpy(str_a, "Hello, world!\n");

10 pointer = str_a; // Set the first pointer to the start of the array.

(gdb)

11 printf(pointer);

13 pointer2 = pointer + 2; // Set the second one 2 bytes further in.

14 printf(pointer2); // Print it.

15 strcpy(pointer2, "y you guys!\n"); // Copy into that spot.

16 printf(pointer); // Print again.

17 }

(gdb) break 11

Breakpoint 1 at 0x80483dd: file pointer.c, line 11.

(gdb) run

Starting program: /home/reader/booksrc/pointer

Breakpoint 1, main () at pointer.c:11

11 printf(pointer);

(gdb) x/xw pointer

0xbffff7e0: 0x6c6c6548

(gdb) x/s pointer

0xbffff7e0: "Hello, world!\n"

(gdb)

When the pointer is examined as a string, it's apparent that the given string is there and is located at memory address 0xbffff7e0. Remember that the string itself isn't stored in the pointer variable—only the memory address 0xbffff7e0 is stored there.

In order to see the actual data stored in the pointer variable, you must use the address-of operator. The address-of operator is a unary operator, which simply means it operates on a single argument. This operator is just an ampersand (&) prepended to a variable name. When it's used, the address of that variable is returned, instead of the variable itself. This operator exists both in GDB and in the C programming language.

(gdb) x/xw &pointer

0xbffff7dc: 0xbffff7e0

(gdb) print &pointer

$1 = (char **) 0xbffff7dc

(gdb) print pointer

$2 = 0xbffff7e0 "Hello, world!\n"

(gdb)

When the address-of operator is used, the pointer variable is shown to be located at the address 0xbffff7dc in memory, and it contains the address 0xbffff7e0.

The address-of operator is often used in conjunction with pointers, since pointers contain memory addresses. The addressof.c program demonstrates the address-of operator being used to put the address of an integer variable into a pointer. This line is shown in bold below.

addressof.c

#include <stdio.h>

int main() {

int int_var = 5;

int *int_ptr;

int_ptr = &int_var; // put the address of int_var into int_ptr

}

The program itself doesn't actually output anything, but you can probably guess what happens, even before debugging with GDB.

reader@hacking:~/booksrc $ gcc -g addressof.c

reader@hacking:~/booksrc $ gdb -q ./a.out

Using host libthread_db library "/lib/tls/i686/cmov/libthread_db.so.1".

(gdb) list

1 #include <stdio.h>

3 int main() {

4 int int_var = 5;

5 int *int_ptr;

7 int_ptr = &int_var; // Put the address of int_var into int_ptr.

8 }

(gdb) break 8

Breakpoint 1 at 0x8048361: file addressof.c, line 8.

(gdb) run

Starting program: /home/reader/booksrc/a.out

Breakpoint 1, main () at addressof.c:8

8 }

(gdb) print int_var

$1 = 5

(gdb) print &int_var

$2 = (int *) 0xbffff804

(gdb) print int_ptr

$3 = (int *) 0xbffff804

(gdb) print &int_ptr

$4 = (int **) 0xbffff800

(gdb)

As usual, a breakpoint is set and the program is executed in the debugger. At this point the majority of the program has executed. The first print command shows the value of int_var, and the second shows its address using the address-of operator. The next two print commands show that int_ptr contains the address of int_var, and they also show the address of the int_ptr for good measure.

An additional unary operator called the dereference operator exists for use with pointers. This operator will return the data found in the address the pointer is pointing to, instead of the address itself. It takes the form of an asterisk in front of the variable name, similar to the declaration of a pointer. Once again, the dereference operator exists both in GDB and in C. Used in GDB, it can retrieve the integer value int_ptr points to.

(gdb) print *int_ptr

$5 = 5

A few additions to the addressof.c code (shown in addressof2.c) will demonstrate all of these concepts. The added printf() functions use format parameters, which I'll explain in the next section. For now, just focus on the program's output.

addressof2.c

#include <stdio.h>

int main() {

int int_var = 5;

int *int_ptr;

int_ptr = &int_var; // Put the address of int_var into int_ptr.

printf("int_ptr = 0x%08x\n", int_ptr);

printf("&int_ptr = 0x%08x\n", &int_ptr);

printf("*int_ptr = 0x%08x\n\n", *int_ptr);

printf("int_var is located at 0x%08x and contains %d\n", &int_var, int_var);

printf("int_ptr is located at 0x%08x, contains 0x%08x, and points to %d\n\n",

&int_ptr, int_ptr, *int_ptr);

}

The results of compiling and executing addressof2.c are as follows.

reader@hacking:~/booksrc $ gcc addressof2.c

reader@hacking:~/booksrc $ ./a.out

int_ptr = 0xbffff834

&int_ptr = 0xbffff830

*int_ptr = 0x00000005

int_var is located at 0xbffff834 and contains 5

int_ptr is located at 0xbffff830, contains 0xbffff834, and points to 5

reader@hacking:~/booksrc $

When the unary operators are used with pointers, the address-of operator can be thought of as moving backward, while the dereference operator moves forward in the direction the pointer is pointing.

Format Strings

The printf() function can be used to print more than just fixed strings. This function can also use format strings to print variables in many different formats. A format string is just a character string with special escape sequences that tell the function to insert variables printed in a specific format in place of the escape sequence. The way the printf() function has been used in the previous programs, the "Hello, world!\n" string technically is the format string; however, it is devoid of special escape sequences. These escape sequences are also called format parameters, and for each one found in the format string, the function is expected to take an additional argument. Each format parameter begins with a percent sign (%) and uses a single-character shorthand very similar to formatting characters used by GDB's examine command.

Parameter	Output Type
%d	Decimal
%u	Unsigned decimal
%x	Hexadecimal

All of the preceding format parameters receive their data as values, not pointers to values. There are also some format parameters that expect pointers, such as the following.

Parameter	Output Type
%s	String
%n	Number of bytes written so far

The %s format parameter expects to be given a memory address; it prints the data at that memory address until a null byte is encountered. The %nformat parameter is unique in that it actually writes data. It also expects to be given a memory address, and it writes the number of bytes that have been written so far into that memory address.

For now, our focus will just be the format parameters used for displaying data. The fmt_strings.c program shows some examples of different format parameters.

fmt_strings.c

#include <stdio.h>

int main() {

char string[10];

int A = -73;

unsigned int B = 31337;

strcpy(string, "sample");

// Example of printing with different format string

printf("[A] Dec: %d, Hex: %x, Unsigned: %u\n", A, A, A);

printf("[B] Dec: %d, Hex: %x, Unsigned: %u\n", B, B, B);

printf("[field width on B] 3: '%3u', 10: '%10u', '%08u'\n", B, B, B);

printf("[string] %s Address %08x\n", string, string);

// Example of unary address operator (dereferencing) and a %x format string

printf("variable A is at address: %08x\n", &A);

}

In the preceding code, additional variable arguments are passed to each printf() call for every format parameter in the format string. The final printf() call uses the argument A, which will provide the address of the variable A. The program's compilation and execution are as follows.

reader@hacking:~/booksrc $ gcc -o fmt_strings fmt_strings.c

reader@hacking:~/booksrc $ ./fmt_strings

[A] Dec: -73, Hex: ffffffb7, Unsigned: 4294967223

[B] Dec: 31337, Hex: 7a69, Unsigned: 31337

[field width on B] 3: '31337', 10: ' 31337', '00031337'

[string] sample Address bffff870

variable A is at address: bffff86c

reader@hacking:~/booksrc $

The first two calls to printf() demonstrate the printing of variables A and B, using different format parameters. Since there are three format parameters in each line, the variables A and B need to be supplied three times each. The %d format parameter allows for negative values, while %u does not, since it is expecting unsigned values.

When the variable A is printed using the %u format parameter, it appears as a very high value. This is because A is a negative number stored in two's complement, and the format parameter is trying to print it as if it were an unsigned value. Since two's complement flips all the bits and adds one, the very high bits that used to be zero are now one.

The third line in the example, labeled [field width on B], shows the use of the field-width option in a format parameter. This is just an integer that designates the minimum field width for that format parameter. However, this is not a maximum field width—if the value to be outputted is greater than the field width, the field width will be exceeded. This happens when 3 is used, since the output data needs 5 bytes. When 10 is used as the field width, 5 bytes of blank space are outputted before the output data. Additionally, if a field width value begins with a 0, this means the field should be padded with zeros. When 08 is used, for example, the output is 00031337.

The fourth line, labeled [string], simply shows the use of the %s format parameter. Remember that the variable string is actually a pointer containing the address of the string, which works out wonderfully, since the %s format parameter expects its data to be passed by reference.

The final line just shows the address of the variable A, using the unary address operator to dereference the variable. This value is displayed as eight hexadecimal digits, padded by zeros.

As these examples show, you should use %d for decimal, %u for unsigned, and %x for hexadecimal values. Minimum field widths can be set by putting a number right after the percent sign, and if the field width begins with 0, it will be padded with zeros. The %s parameter can be used to print strings and should be passed the address of the string. So far, so good.

Format strings are used by an entire family of standard I/O functions, including scanf(), which basically works like printf() but is used for input instead of output. One key difference is that the scanf() function expects all of its arguments to be pointers, so the arguments must actually be variable addresses—not the variables themselves. This can be done using pointer variables or by using the unary address operator to retrieve the address of the normal variables. The input.c program and execution should help explain.

input.c

#include <stdio.h>

#include <string.h>

int main() {

char message[10];

int count, i;

strcpy(message, "Hello, world!");

printf("Repeat how many times? ");

scanf("%d", &count);

for(i=0; i < count; i++)

printf("%3d - %s\n", i, message);

}

In input.c, the scanf() function is used to set the count variable. The output below demonstrates its use.

reader@hacking:~/booksrc $ gcc -o input input.c

reader@hacking:~/booksrc $ ./input

Repeat how many times? 3

0 - Hello, world!

1 - Hello, world!

2 - Hello, world!

reader@hacking:~/booksrc $ ./input

Repeat how many times? 12

0 - Hello, world!

1 - Hello, world!

2 - Hello, world!

3 - Hello, world!

4 - Hello, world!

5 - Hello, world!

6 - Hello, world!

7 - Hello, world!

8 - Hello, world!

9 - Hello, world!

10 - Hello, world!

11 - Hello, world!

reader@hacking:~/booksrc $

Format strings are used quite often, so familiarity with them is valuable. In addition, the ability to output the values of variables allows for debugging in the program, without the use of a debugger. Having some form of immediate feedback is fairly vital to the hacker's learning process, and something as simple as printing the value of a variable can allow for lots of exploitation.

Typecasting

Typecasting is simply a way to temporarily change a variable's data type, despite how it was originally defined. When a variable is typecast into a different type, the compiler is basically told to treat that variable as if it were the new data type, but only for that operation. The syntax for typecasting is as follows:

(typecast_data_type) variable

This can be used when dealing with integers and floating-point variables, as typecasting.c demonstrates.

typecasting.c

#include <stdio.h>

int main() {

int a, b;

float c, d;

a = 13;

b = 5;

c = a / b; // Divide using integers.

d = (float) a / (float) b; // Divide integers typecast as floats.

printf("[integers]\t a = %d\t b = %d\n", a, b);

printf("[floats]\t c = %f\t d = %f\n", c, d);

}

The results of compiling and executing typecasting.c are as follows.

reader@hacking:~/booksrc $ gcc typecasting.c

reader@hacking:~/booksrc $ ./a.out

[integers] a = 13 b = 5

[floats] c = 2.000000 d = 2.600000

reader@hacking:~/booksrc $

As discussed earlier, dividing the integer 13 by 5 will round down to the incorrect answer of 2, even if this value is being stored into a floating-point variable. However, if these integer variables are typecast into floats, they will be treated as such. This allows for the correct calculation of 2.6.

This example is illustrative, but where typecasting really shines is when it is used with pointer variables. Even though a pointer is just a memory address, the C compiler still demands a data type for every pointer. One reason for this is to try to limit programming errors. An integer pointer should only point to integer data, while a character pointer should only point to character data. Another reason is for pointer arithmetic. An integer is four bytes in size, while a character only takes up a single byte. The pointer_types.c program will demonstrate and explain these concepts further. This code uses the format parameter %p to output memory addresses. This is shorthand meant for displaying pointers and is basically equivalent to 0x%08x.

pointer_types.c

#include <stdio.h>

int main() {

int i;

char char_array[5] = {'a', 'b', 'c', 'd', 'e'};

int int_array[5] = {1, 2, 3, 4, 5};

char *char_pointer;

int *int_pointer;

char_pointer = char_array;

int_pointer = int_array;

for(i=0; i < 5; i++) { // Iterate through the int array with the int_pointer.

printf("[integer pointer] points to %p, which contains the integer %d\n",

int_pointer, *int_pointer);

int_pointer = int_pointer + 1;

}

for(i=0; i < 5; i++) { // Iterate through the char array with the char_pointer.

printf("[char pointer] points to %p, which contains the char '%c'\n",

char_pointer, *char_pointer);

char_pointer = char_pointer + 1;

}

In this code two arrays are defined in memory—one containing integer data and the other containing character data. Two pointers are also defined, one with the integer data type and one with the character data type, and they are set to point at the start of the corresponding data arrays. Two separate for loops iterate through the arrays using pointer arithmetic to adjust the pointer to point at the next value. In the loops, when the integer and character values are actually printed with the %d and %c format parameters, notice that the corresponding printf() arguments must dereference the pointer variables. This is done using the unary * operator and has been marked above in bold.

reader@hacking:~/booksrc $ gcc pointer_types.c

reader@hacking:~/booksrc $ ./a.out

[integer pointer] points to 0xbffff7f0, which contains the integer 1

[integer pointer] points to 0xbffff7f4, which contains the integer 2

[integer pointer] points to 0xbffff7f8, which contains the integer 3

[integer pointer] points to 0xbffff7fc, which contains the integer 4

[integer pointer] points to 0xbffff800, which contains the integer 5

[char pointer] points to 0xbffff810, which contains the char 'a'

[char pointer] points to 0xbffff811, which contains the char 'b'

[char pointer] points to 0xbffff812, which contains the char 'c'

[char pointer] points to 0xbffff813, which contains the char 'd'

[char pointer] points to 0xbffff814, which contains the char 'e'

reader@hacking:~/booksrc $

Even though the same value of 1 is added to int_pointer and char_pointer in their respective loops, the compiler increments the pointer's addresses by different amounts. Since a char is only 1 byte, the pointer to the next char would naturally also be 1 byte over. But since an integer is 4 bytes, a pointer to the next integer has to be 4 bytes over.

In pointer_types2.c, the pointers are juxtaposed such that the int_pointer points to the character data and vice versa. The major changes to the code are marked in bold.

pointer_types2.c

#include <stdio.h>

int main() {

int i;

char char_array[5] = {'a', 'b', 'c', 'd', 'e'};

int int_array[5] = {1, 2, 3, 4, 5};

char *char_pointer;

int *int_pointer;

char_pointer = int_array; // The char_pointer and int_pointer now

int_pointer = char_array; // point to incompatible data types.

for(i=0; i < 5; i++) { // Iterate through the int array with the int_pointer.

printf("[integer pointer] points to %p, which contains the char '%c'\n",

int_pointer, *int_pointer);

int_pointer = int_pointer + 1;

}

for(i=0; i < 5; i++) { // Iterate through the char array with the char_pointer.

printf("[char pointer] points to %p, which contains the integer %d\n",

char_pointer, *char_pointer);

char_pointer = char_pointer + 1;

}

The output below shows the warnings spewed forth from the compiler.

reader@hacking:~/booksrc $ gcc pointer_types2.c

pointer_types2.c: In function `main':

pointer_types2.c:12: warning: assignment from incompatible pointer type

pointer_types2.c:13: warning: assignment from incompatible pointer type

reader@hacking:~/booksrc $

In an attempt to prevent programming mistakes, the compiler gives warnings about pointers that point to incompatible data types. But the compiler and perhaps the programmer are the only ones that care about a pointer's type. In the compiled code, a pointer is nothing more than a memory address, so the compiler will still compile the code if a pointer points to an incompatible data type—it simply warns the programmer to anticipate unexpected results.

reader@hacking:~/booksrc $ ./a.out

[integer pointer] points to 0xbffff810, which contains the char 'a'

[integer pointer] points to 0xbffff814, which contains the char 'e'

[integer pointer] points to 0xbffff818, which contains the char '8'

[integer pointer] points to 0xbffff81c, which contains the char '

[integer pointer] points to 0xbffff820, which contains the char '?'

[char pointer] points to 0xbffff7f0, which contains the integer 1

[char pointer] points to 0xbffff7f1, which contains the integer 0

[char pointer] points to 0xbffff7f2, which contains the integer 0

[char pointer] points to 0xbffff7f3, which contains the integer 0

[char pointer] points to 0xbffff7f4, which contains the integer 2

reader@hacking:~/booksrc $

Even though the int_pointer points to character data that only contains 5 bytes of data, it is still typed as an integer. This means that adding 1 to the pointer will increment the address by 4 each time. Similarly, the char_pointer's address is only incremented by 1 each time, stepping through the 20 bytes of integer data (five 4-byte integers), one byte at a time. Once again, the littleendian byte order of the integer data is apparent when the 4-byte integer is examined one byte at a time. The 4-byte value of 0x00000001 is actually stored in memory as 0x01, 0x00, 0x00, 0x00.

There will be situations like this in which you are using a pointer that points to data with a conflicting type. Since the pointer type determines the size of the data it points to, it's important that the type is correct. As you can see in pointer_types3.c below, typecasting is just a way to change the type of a variable on the fly.

pointer_types3.c

#include <stdio.h>

int main() {

int i;

char char_array[5] = {'a', 'b', 'c', 'd', 'e'};

int int_array[5] = {1, 2, 3, 4, 5};

char *char_pointer;

int *int_pointer;

char_pointer = (char *) int_array; // Typecast into the

int_pointer = (int *) char_array; // pointer's data type.

for(i=0; i < 5; i++) { // Iterate through the int array with the int_pointer.

printf("[integer pointer] points to %p, which contains the char '%c'\n",

int_pointer, *int_pointer);

int_pointer = (int *) ((char *) int_pointer + 1);

}

for(i=0; i < 5; i++) { // Iterate through the char array with the char_pointer.

printf("[char pointer] points to %p, which contains the integer %d\n",

char_pointer, *char_pointer);

char_pointer = (char *) ((int *) char_pointer + 1);

}

In this code, when the pointers are initially set, the data is typecast into the pointer's data type. This will prevent the C compiler from complaining about the conflicting data types; however, any pointer arithmetic will still be incorrect. To fix that, when 1 is added to the pointers, they must first be typecast into the correct data type so the address is incremented by the correct amount. Then this pointer needs to be typecast back into the pointer's data type once again. It doesn't look too pretty, but it works.

reader@hacking:~/booksrc $ gcc pointer_types3.c

reader@hacking:~/booksrc $ ./a.out

[integer pointer] points to 0xbffff810, which contains the char 'a'

[integer pointer] points to 0xbffff811, which contains the char 'b'

[integer pointer] points to 0xbffff812, which contains the char 'c'

[integer pointer] points to 0xbffff813, which contains the char 'd'

[integer pointer] points to 0xbffff814, which contains the char 'e'

[char pointer] points to 0xbffff7f0, which contains the integer 1

[char pointer] points to 0xbffff7f4, which contains the integer 2

[char pointer] points to 0xbffff7f8, which contains the integer 3

[char pointer] points to 0xbffff7fc, which contains the integer 4

[char pointer] points to 0xbffff800, which contains the integer 5

reader@hacking:~/booksrc $

Naturally, it is far easier just to use the correct data type for pointers in the first place; however, sometimes a generic, typeless pointer is desired. In C, a void pointer is a typeless pointer, defined by the void keyword. Experimenting with void pointers quickly reveals a few things about typeless pointers. First, pointers cannot be de-referenced unless they have a type. In order to retrieve the value stored in the pointer's memory address, the compiler must first know what type of data it is. Secondly, void pointers must also be typecast before doing pointer arithmetic. These are fairly intuitive limitations, which means that a void pointer's main purpose is to simply hold a memory address.

The pointer_types3.c program can be modified to use a single void pointer by typecasting it to the proper type each time it's used. The compiler knows that a void pointer is typeless, so any type of pointer can be stored in a void pointer without typecasting. This also means a void pointer must always be typecast when dereferencing it, however. These differences can be seen in pointer_types4.c, which uses a void pointer.

pointer_types4.c

#include <stdio.h>

int main() {

int i;

char char_array[5] = {'a', 'b', 'c', 'd', 'e'};

int int_array[5] = {1, 2, 3, 4, 5};

void *void_pointer;

void_pointer = (void *) char_array;

for(i=0; i < 5; i++) { // Iterate through the int array with the int_pointer.

printf("[char pointer] points to %p, which contains the char '%c'\n",

void_pointer, *((char *) void_pointer));

void_pointer = (void *) ((char *) void_pointer + 1);

}

void_pointer = (void *) int_array;

for(i=0; i < 5; i++) { // Iterate through the int array with the int_pointer.

printf("[integer pointer] points to %p, which contains the integer %d\n",

void_pointer, *((int *) void_pointer));

void_pointer = (void *) ((int *) void_pointer + 1);

}

The results of compiling and executing pointer_types4.c are as follows.

reader@hacking:~/booksrc $ gcc pointer_types4.c