Writing Hacker-Proof Code - Security - C++ For Dummies (2014)

C++ For Dummies (2014)

Part V

Security

Chapter 28

Writing Hacker-Proof Code

In This Chapter

arrow How to avoid becoming a soldier in someone's botnet army

arrow Getting a handle on SQL injection

arrow Understanding buffer overflow hacks

arrow Defensive programming against buffer overflows

arrow Getting a little help from the operating system

In the interest of full disclosure, I should admit right now: I'm not sure that it's possible to write hacker-proof code. Those slippery devils always seem to find a way. But by knowing some of their tricks and how to counter them, you can write programs that are very hacker resistant.

There is more to hacker-proofing that just writing code. Program protection takes a multitude of forms which I describe in Chapter 30. However, since this book is about writing programs, after all, and since code writing is probably the most important component to hacker-proofing, let's start there.

Understanding the Hacker's Motives

Why would a hacker want to break into one of the lowly C++ console programs presented in this book? The short answer is, “He wouldn't.” The programs in this book are all written to be executed from the keyboard at normal user privileges. If the user can get to the keyboard to execute one of these programs, then he can execute any other command that he wants. He doesn't need to resort to hacks.

Think a little further into the future, however. After you've finished this book and sharpened your C++ skills, you land that really sweet job that you were looking for at the, hmmm, at the bank. Yeah, that's the ticket. You're a big-time programmer at the bank, and you've just finished writing the back-end code for some awesome ledger application that customers use to balance their accounts. Performance is great because it's C++, and the customers love it. You're looking forward to that big bonus that's surely coming your way. Then you get called to the Department Vice President's office. Seems that hackers have found a way to get into your program from its interface to the Internet and transferred money from other peoples' accounts into their own. Millions have been lost. Disaster! No bonus. No promotion. Nobody will sit with you in the cafeteria. Your kids get bullied on the playground. You'll be lucky to keep your now greatly reduced job.

The point of this story is that real world programs often have multiple interfaces unlike the simple programs in this book. For example, any program that reads a port or connects to a database is susceptible to being hacked.

What is the hacker after:

· If you're lucky, the hacker is doing nothing more than exploiting some flaw in your program's logic to cause it to crash. As long as the program is crashed, no one else can use it. This is called a Denial of Service (DoS) attack because it denies the service provided by your program to everyone else.

DoS attacks can be expensive because they can cost your company lost revenue from business that doesn't get conducted or customers who give up in frustration because your program is not taking calls right now. And this doesn't even include the cost of someone going into the code to find and fix the susceptibility.

· Some hackers are trying to get access to information that your program has access to but to which the user has no right. A good example of this would be identify theft.

The loss of information is more than embarrassing as a good hacker may be able to use this information to turn around and steal. For example, armed with the proper credentials, the hacker can then call up a bank teller on the phone and order sums of money be transferred from our hacked customers' accounts to his own where he can subsequently withdraw the funds. This is commonly the case with SQL injection attacks, which I describe in the following section.

· Finally, some hackers are after remote control of your computer. If your program opens a connection to the Internet and a hacker can get your program to execute the proper system calls, that hacker can turn your program into a remote terminal into your system. From there, the hacker can download his own program onto your machine, and from then on you are said to be owned.

Perhaps the hacker wants access to your accounts, where he can steal money, or maybe he just wants your computer itself. This is the case with groups of owned computers that make up what is known as a botnet.

But how does this work? Your bank program has a very limited interface. It asks the user for his account number, his name, and the amount of his deposit. Nowhere does it say, “Would you like to take over this computer?” or “What extra code would you like this computer to execute?”

The two most common hacker tricks that you must deal with in your code are code injection and buffer overflow.


A bot-what?

The term botnet is a contract of “robot network,” meaning a network of roboted (also called zombie) computers. A zombied computer runs along like normal as long as it's not needed. It can run spreadsheets and Code::Blocks and whatever else, but sitting deep in the background is a backdoor that's open to the person with the proper program and the passwords — the botnet master.

When the botnet master decides he needs the zombie computer, he sends commands to his slave, and it dutifully starts carrying out the master's instructions. The owner of the zombied computer may not even notice that there's anything wrong, other than the fact that his computer runs kind of slow sometimes.

Botnets can do lots of things, but one of their best tricks is to swamp legitimate Web sites with bogus requests in another form of Denial of Service attack. Suppose, for example, that you don’t like the Brotherhood of Aryan Goatherders, and you want to bring down their BAG site so that no one can read their lies. You try to swamp the site with requests from your computer, but you can't because the BAG's computer is just as fast as yours. So you buy four or five computers and have all of them hit their Web site at once. That works for a few minutes, but it doesn't take long to figure out that all these requests are coming from just a few source IP addresses, so the system administrator for the Brotherhood (very unfairly) blocks requests from your PCs!

But what if you could rent the services of a botnet army consisting of thousands of PCs all over the world? Each computer has to generate only a few requests per second in order to bring the BAG site completely to its knees. And what can the system administrator do about it? He can't block every PC he sees without blocking legitimate users of the site. The BAG might as well just give up and close the site down.


Understanding Code Injection

Code injection occurs when the user entices your program to execute some piece of user-created code. “What? My program would never do that!” you say. Consider the most common and, fortunately for us, easiest to understand variant of this little scam: SQL injection.

Examining an example SQL injection

Let me start with a few facts about SQL:

· SQL (often pronounced “sequel”) stands for Structured Query Language.

· SQL is the most common language for accessing databases.

· SQL is used almost universally in accessing relational databases.

· SQL is not the subject of this book.

This last bullet is important because I have no intent of teaching you SQL just so you can follow the examples presented here. If you don't already know SQL, it's sufficient to say that SQL is often interpreted at runtime. Very often, C++ statements will send an SQL query to a separate database server and then process and display whatever the server sends back. A typical SQL query within a C++ program might look like the following:

char* query = "SELECT * FROM transactions WHERE accountID='123456789';"
results = submit(query);

This code says, “SELECT all of the fields FROM the transactions table WHERE the accountID (presumably one of the fields in the transaction table) is equal to 123456789 (the user's account id).” The submit() library function might send this query off to the database server. The database server would respond with all of the data it has on every transaction that the user has ever made on this account, which would get stored into the collection results. The program would then iterate through results, probably displaying the transactions in a table with each transaction on a separate row.

The user probably doesn't need that much data. Maybe just those transactions between startDate and endDate, two variables that the program reads from the user's query page. This more selective C++ program might contain a statement like the following:

char* query = "SELECT * FROM transactions WHERE accountID='123456789'"
" AND date > '" + startDate + "' AND date < '" + endDate + "';";

If the user enters 2013/10/1 for a startDate and 2013/11/1 for endDate, then the resulting query that gets sent to the database is the following:

SELECT * FROM transactions WHERE accountID='123456789' AND
date > '2013/10/1' AND date < '2013/11/1';

In other words, show all the transactions made in the month of October 2013. That makes sense. What's the problem?

The problem arises if the program just accepts whatever the user enters as start and end dates and plugs them into the query. It doesn't do any checking to make sure that the user is entering just a date and nothing but a date. This program is far too trusting.

What if a hacker were to enter 2013/10/1 for the startDate, but for the endDate he were to enter something like 2013/11/1' OR accountID='234567890. (Notice the unbalanced single quotes.) Now the combined SQL query that gets sent to the database server would look like

SELECT * FROM transactions WHERE accountID='123456789' AND
date > '2013/10/1' AND date < '2013/11/1' OR
accountID='234567890';

This says, “Show me all the transactions for the account 123456789 for the month of October 2013, plus all the transactions for some other account 234567890 that I don't own for any date.”

This little example may raise a few questions in the reader's mind: “How did the hacker know that he could enter SQL statements in place of dates?” He doesn't know — he just tries entering bogus SQL into every field that accepts character text and sees what happens. If the program complains, “That's not a legal date,” then the hacker knows that the program checks to make sure that input dates are valid and SQL injection won't work here. If, on the other hand, the program displays an error message like Illegal SQL statement, then the hacker knows that the program accepted the bogus input and shipped it off to the database server which then kicked it back. Success! Now all he has to do is formulate the query just right.

So how did the hacker know that the account ID was called accountID? He didn't know that either, but how long would it take to guess that one? Hackers are very persistent.

Finally, how did the hacker know that 234567890 was a valid account number? Again, he didn't — but do you really think that the hacker's going to stop there? Heck no. He's going to try every combination of digits he can think of until he finds some really big accounts with really big balances that are worth stealing from.

Let me assure you of three things:

· SQL injection was very common years ago.

· It was just this simple.

· With a better knowledge of SQL and some really tortured syntax, a good hacker can do almost anything he wants with an SQL injection like this.

So how can the programmer avoid this hack?

Avoiding code injection

The first rule of avoiding code inject is never, ever, allow user input to be processed by a general-purpose language interpreter. The error with the SQL-injection example was that the program accepted user input as if it were always a date and inserted it into an SQL query that it then shipped off to the database engine for processing.

The safest and most user-friendly approach would have been to provide the user a calendar graphic from which he could select the start and end dates. The program would then create a date based on what the user clicked. If this is not possible, then the program should have carefully checked the input to make sure that the input was in the proper format for a date, in this case yyyy/mm/dd — in other words, four digits followed by a slash followed by two digits and a slash and finally two more digits. Nothing else should be considered acceptable input.

Sometimes you can't be that specific about the format. If you must allow the user to enter flexible text, then you can at least avoid special characters. For example, it's pretty much impossible to do SQL code injection without using either a single or double quote. You can't insert HTML tags without using a less than (<) and greater than (>) sign. Or you could just take the approach that anything other than ASCII text will not be tolerated:

// check some string 's' to make sure it's straight ASCII
size_type off = s.find_first_not_of(
"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890_");
if (off != string::npos)
{
cerr << "Error\n";
}

This code searches the string s for a character that's not one of the characters A through Z, a through z, 0 through 9, or underscore. If it finds such a character, then the program rejects the input.

image If you allow only the Latin characters shown here, your application will not be useable in many foreign markets such as those that don't use English character sets (such as Arabic, Chinese, Hebrew, or Russian, to name just a few). You may have to take the opposite approach and just look for the bad characters.

Overflowing Buffers for Fun and Profit

The second common hacker method that I present is the dreaded buffer overflow. First you'll see a very small program with a very big vulnerability. You'll see how this vulnerability comes about and how it can be exploited by a hacker. Then you'll see a number of different ways to mitigate the vulnerability.

Can I see an example?

Consider the smallest, simplest hackable program that I could devise:

// BufferOverflow - this program demonstrates how a
// program that reads data into a fixed
// length buffer without checking can be
// hacked
#include <cstdio>
#include <cstdlib>
#include <fstream>
#include <iostream>
#include <cstring>
#include <string>

using namespace std;

// getString - read a string of input from the user prompt
// and return it to the caller
char* getString(istream& cin)
{
char buffer[64];

// now input a string from the file
char* pB;
for(pB = buffer;*pB = cin.get(); pB++)
{
if (cin.eof())
{
break;
}
}
*pB = '\0';

// return a copy of the string to the caller
pB = new char[strlen(buffer) + 1];
strcpy(pB, buffer);
return pB;
}

int main(int argc, char* pArgv[])
{
// get the name of the file to read
cout <<"This program reads input from an input file\n"
"Enter the name of the file:";
string sName;
cin >> sName;

// open the file
ifstream c(sName.c_str());
if (!c)
{
cout << "\nError opening input file" << endl;
exit(-1);
}

// read the file's content into a string
char* pB = getString(c);

// output what we got
cout << "\nWe successfully read in:\n" << pB << endl;

cout << "Press Enter to continue..." << endl;
cin.ignore(10, '\n');
cin.get();
printf("Done!");
exit(0);
return 0;
}

This program starts by prompting the user for the name of a file. The program then opens that file and passes the open file handle to the function getString(). This function does nothing more than read the contents of the file into a buffer, create a copy of that buffer in a memory block that it allocates off of the heap, and then returns that chunk of heap memory to the caller.

The output from a sample run of this program appears as follows:

This program reads input from an input file
Enter the name of the file:OK_File.txt

We successfully read in:
This is benign input.
Press Enter to continue...

Here the user told the program to read the file OK_File.txt and display the results, which it did.

image Code::Blocks for Windows opens the console application in the project directory so all you need to enter is the file name OK_File.txt as shown. Code::Blocks for Macintosh opens the console window in your user directory so you need to enter the entire path to the file:Desktop/CPP_Programs_from_Book/Chap28/BufferOverflow/OK_File.txt (assuming that you installed the source files in the default location). This same tip is applicable to every file in this chapter.

The problem with this program lies in getString(). The programmer was told that each input file contains a short string of not more than 20 characters. Not wanting to be stingy, she allocated a 64-character buffer just to make sure that there was enough room to hold the file contents. The file OK_File.txt contains the string This is benign input. which may have been a little longer than the promised 20 characters but fits comfortably within the 64-character buffer. But let's try the program again with the file Big_File.txt; the output of this run is shown in Figure 28-1.

image

Figure 28-1: The result of executing the BufferOverflow program on Big_File.txt.

When presented this new file, the BufferOverflow program crashed rather than generating any reasonable output.

What you don't know is that the file Big_File.txt contains the following:

ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789
abcdefghijklmnopqrstuvwxyz0123456789
ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789
abcdefghijklmnopqrstuvwxyz0123456789
ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789
abcdefghijklmnopqrstuvwxyz0123456789
ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789
abcdefghijklmnopqrstuvwxyz0123456789

“Wait a minute!” you say. “That's not fair. That file contains more than 20 characters.” True. And it contains more than 64 characters, and for some reason that caused the program to crash. Hackers don't play fair.

How does a call stack up?

image This entire section is fairly technical. You can skip it if you're not into the details of computer memory.

Consider how computer memory is laid out: There are variables known as global variables that are accessible to all functions. These variables reside at fixed memory locations so that everyone can find them. But most variables are declared within the scope of a single function. The memory for these variables is allocated when the function is called and is deallocated when the function returns. Computers do this through a mechanism known as the stack.

The stack pointer (which in assembly language parlance normally carries the name ESP) points to the next available location on the stack. A function can invoke a PUSH instruction to save a value in a register to the stack. This automatically decrements the ESP so that the memory isn't used for something else. A corresponding POP instruction restores the value to the register and increments the ESP back to its original pre-PUSH location.

Another value that gets pushed onto the stack is the return address whenever a function is called. The 80x86's instruction CALL getString pushes the next address onto the stack and then jumps to the address of the getString() function. This is shown graphically as a busy but interesting capture from the Code::Blocks debugger in Figure 28-2.

image

Figure 28-2: The ESP and the stack memory immediately before the call to getString().

The program is stopped at the beginning of the call to getString() (which you can tell by the yellow arrows in Figure 28-2, both in the right source view that shows only the C++ source and in the left mixed disassembly view that shows the C++ source and the 80x86 assembly language that was generated.) Notice on the left that the instruction after the CALL to getString() is 0x0046AA6C. The CPU Registers window shows that the value of the ESP is 0x0028FDC0.

Figure 28-3 shows the same windows immediately after the call to getString(). Notice that the ESP has been decremented by 4 bytes (the size of a return address) to 0x0028FDBC and that the ESP now points to the value 0x0046AA6C, the address of the next instruction after the CALL. This is called the return address.

image What you actually see on the stack in Figure 28-3 is 6C-AA-46-00. This is because the 80x86 processor stores all values with the least significant byte at the smallest address. This is called Little Endian.

image

Figure 28-3: The ESP and the stack memory immediately after the call to getString().

Figure 28-4 shows the situation immediately after a successful return from getString(). The small yellow arrow in the disassembly window shows that the instruction pointer is indeed pointing to the instruction immediately after the CALL and the ESP has returned to its former value of0x0028FDC0.

That's all very nice, but so what? Well, C++ also stores locally defined variables on the stack. For example, the 64-byte buffer in getString() is stored on the stack. As long as the program writes only 64 bytes (or less) into this buffer, everything is fine; but if the program tries to write more data into buffer than buffer can hold, the remaining data spills over and starts overwriting other data. If the program writes far enough, it will eventually overwrite the return address. This is exactly what happened when getString() read the oversized Big_File.txt. This is shown in Figure28-5.

image

Figure 28-4: The ESP and the stack memory immediately after the return from getString().

image

Figure 28-5: The return address on the stack are overwritten when getString() tries to read Big_File.txt.

You can see that the location 0x0028FDC0 no longer contains the return address, but rather the value 0x46454443, which happens to be the ASCII characters “FEDC”, which you can also see along with many of the other characters from Big_File.txt on the right of Figure 28-5.

image Remember to read the bytes from right to left since the 80x86 is Little Endian.

This doesn't cause a problem as long as the program is processing through getString(), but when the program tries to return, the return address that’s on the stack is not a return address at all. Instead, it points to some illegal address, and the program crashes as soon as it executes the RETstatement at the end of getString().

Hacking BufferOverflow

The BufferOverflow program crashed because the contents of Big_File.txt overflowed buffer and overwrote the return address within the function getString(). When the function attempted to execute a return instruction, control passed to some garbage address, and the program crashed.

But what if you could engineer the text file so that it overwrote the return address not with crazy ASCII characters, but with the address of some code that you wanted to force the program to execute? When getString() executed a RET, it wouldn't crash, it would go off and execute the code you want it to.

But where could you put this extra code? What better place than within the text that's already been read into buffer? So the hack goes like this:

1. Create a machine language program that does whatever you want the program to do and insert it into the input file first.

2. Make sure that the input overflows the buffer just far enough that the return address gets overwritten with the address of buffer itself.

3. When the program reads the text into the buffer, it will in effect load the hacker code into buffer and then overwrite the return address.

4. When getString() tries to return to where it was called in main(), control will pass to the beginning of buffer, where the hacker code gets executed.

image This sounds pretty tricky, and actually, it is. But remember that the hacker can execute your program as often as he wants. When executed with a good debugger, he can figure out how big to make the buffer and what address to use for buffer.

Just to show you that such a thing is possible, check out the following run:

C:\CPP_Programs\Chap28\BufferOverflow>BufferOverflow
This program reads input from an input file
Enter the name of the file:BO_File.txt
You've been hacked!
C:\CPP_Programs\Chap28\BufferOverflow>

Here the program starts out like normal by prompting the user for an input file. This time the user entered the file name BO_File.txt. In response, the program didn't output the contents of the file as you might expect, nor did it crash. Instead, in response to this file, the program output the ominous message “You've been hacked!” and exited. Notice in particular that the program didn't output the normal “Press Enter to continue…”. This program went directly to Jail, didn't pass Go, and didn't collect $200! Control never returned from getString() back to main().

In fact, the file BO_File.txt (which stands for Buffer Overflow File, by the way) contains a small machine language program that outputs the message “You've been hacked!” and then calls exit(0) to exit normally. In addition, it's crafted in just such a way that it overwrites the return address with the beginning of the buffer to cause this program to be executed when getString() attempts to return, just as described earlier.

image The details of a hack like this are very specific to exactly how the executable file is laid out in memory. This particular version of BO_File.txt works on only versions of BufferOverflow built for Windows with a particular version of gcc. This is not a limitation of the overflow hack itself — I could create a version of BO_File.txt for Linux or Macintosh and for a different version of gcc. Since you may not be using the same version of gcc that I am, I have included the .exe executable in the BufferOverflow directory right next to the source code. To execute this version, you will need to open a console in Windows, navigate to the proper directory (in my case, C:\CPP_Programs_from_Book\Chap28\BufferOverflow), and enter the command BufferOverflow.


How did this hack work?

Let me start off by saying that the point of this chapter is not to teach you how to hack other people's programs — the point is to keep you from being hacked yourself. Let me also say that the details of this hack have nothing to do with learning C++ programming, so feel free to skip this sidebar if you want. However, it seems only fair that you get to see how this hack worked in detail. If you are familiar with 80x86 assembly language, you will probably be able to follow this small program. If not, then you may want to just accept my assurances that it works and kick the can on down the road.

The Hex Editor that comes with Code::Blocks displays the contents of the BO_Text.txt file as follows:

0000: 90 90 55 89 E5 31 C0 B0 F8 29 C4 90 90 EB 24 31 U 1 ) $1
0010: C0 8B 1C E4 36 88 43 13 B8 45 AA 47 01 WD 01 01 6 C U G -
0020: 01 01 FF D0 31 C0 50 B8 F9 FE 42 01 2D 01 01 01 1 P B -
0030: 01 FF D0 E8 D7 FF FF FF 59 6F 72 27 76 65 20 62 You've b
0040: 65 65 6E 20 68 61 63 6B 65 64 21 90 70 FD 28 00 een hacked! p (

That's not very enlightening. Other than the string You've been hacked!, the remainder of the file appears to be garbage. Let's try an 80x86 disassembler.

; set up a stack frame to protect our code from being
; overwritten when we make a function call below
; we do this by subtracting a big number like F8 from ESP
entryPoint:
NOP ; 90
NOP ; 90
PUSH EBP ; 55
MOV ESP,EBP ; 89 E5
XOR EAX,EAX ; 31 C0
MOV F8,AL ; B0 F8
SUB EAX,ESP ; 29 C4

; the following can be replaced by an INT 3 (0xCC) during debug and test
NOP ; 90
NOP ; 90

; put the address of the output message on the stack by jumping to a call
JMP label2 ; EB 24
label1:

; null terminate the string by writing a 0 to *ESP + 13
XOR EAX,EAX ; 31 C0
MOV [ESP],EBX ; 8B 1C E4
MOV AL,SS:[BX+13] ; 36 88 43 13

; now call print (but can't have any zeros in the address)
; this value changes every time you rebuild the program!
MOV print+01010101,EAX ; B8 45 AA 47 01
SUB 01010101,EAX ; 2D 01 01 01 01
CALL EAX ; FF D0 (calls 0047AA45)

; and then call exit passing a 0 (this call doesn't return)
XOR EAX,EAX ; 31 C0
PUSH EAX ; 50
MOV exit+01010101,EAX ; B8 F9 FE 42 01
SUB 01010101,EAX ; 2D 01 01 01 01
CALL EAX ; FF D0 (calls 0041FDF8)

label2:
CALL label1 ; E8 D7 FF FF FF
"You've been hacked!"
90 ; this will be overwritten the terminating null
address of entryPoint ; B0 FD 28 00
; this will overwrite the return address

Of course, the disassembler didn't create the comments — I've added those to help you out a bit.

The most important part of this program is the last 4 bytes. These overwrite the return address with the address 0x0028FDB0, which is the address of buffer on the stack. How did I know that? I had to single-step the program with an assembly language debugger and note the address of buffer myself.

The getString() function copies this file into the fixed length buffer, dutifully overwriting its own return address before encountering the terminating NULL. It goes on to make a copy of this string out of heap memory, a process that we care nothing about. When getString() tries to return to main(), control passes to the label entryPoint.

The first couple of instructions do nothing — NOP stands for No Op or No Operation. These are there in case the hack misses the address by a few bytes.

The next few instructions are very important. After getString() executes a return, buffer is no longer in scope. This means that all of your code is vulnerable to being overwritten if an interrupt occurs or the next time a function is called. This small section of code moves the ESP around the small program so that it is not overwritten by the upcoming function calls.

The next small section of code is where I hard-coded breakpoint instructions (INT3 or 0xCC) when I was debugging this code. They appear as NOPs in the production version that you are seeing.

The next JMP instruction jumps down to the label label2. The CALL instruction located here first pushes the address of the following instruction, which is actually the address of the string You've been hacked!, and then jumps back to label1. This sleight of hand is the hacker's way of pushing the address of the string onto the stack. That done, the program then makes sure that the string is null terminated by writing a 0 at the location 13 bytes deep into the string. (The XOR EAX,EAX, which means EXCLUSIVE OR the contents of the EAX register with itself, puts a zero in the EAX register.)

The next block of code actually does nothing more than call the print(), which is located at 0x0046A944. Unfortunately, the program can't call this function directly since its address contains a null byte. This null byte would cause the copying of the block to terminate before overwriting the return address. To avoid this, I added a 1 to each byte of the address stored in memory, and then I subtract this one back out before I use the address. The program copies 0x0147AA45 into the EAX register and then subtracts 0x01010101 to calculate the desired address. The CALL EAX calls the resulting address contained in the EAX register. This outputs the "You've been hacked!" message.

How did I know that print() was located at 0x0046A944? By examining the call to print() in main().

The final block calls the exit() function using the same trick to terminate the program. Control does not return from exit().


Avoiding buffer overflow — first attempt

You can look at the hackable error in getString() as a combination of two problems: The programmer used a fixed-length buffer, and she assumed that the input would not overflow that buffer. This error can be fixed by addressing either one of these assumptions.

The following NoBufferOverflow1 program addresses the second assumption by making sure that the input does not exceed the size allocated to the fixed-size buffer:

// NoBufferOverflow1 - this program avoids being hacked by
// limiting the amount of input into a fixed buffer
#include <cstdio>
#include <cstdlib>
#include <fstream>
#include <iostream>
#include <cstring>
#include <string>

using namespace std;

// getString - read a string of input from the user prompt
// and return it to the caller
char* getString(istream& cin)
{
char buffer[64];

// now input a string from the file
// (but not more than our buffer will hold)
int i;
for(i = 0; i < 63; i++)
{
// read the next character into the buffer
buffer[i] = cin.get();

// exit the loop if we read a NULL or EOF
if ((buffer[i] == 0) || cin.eof())
{
break;
}
}
// make sure that the buffer is null terminated
buffer[i] = '\0';

// return a copy of the string to the caller
char* pB = new char[strlen(buffer) + 1];
if (pB != nullptr)
{
strcpy(pB, buffer);
}
return pB;
}

This version of getString() reads input from the file until either one of three things happen: the function reads a null, the function reads an End of File, or the function reads 63 bytes.

image Remember to leave 1 extra byte for the terminating null.

The output of this program to all three files is as you would expect. The OK_File.txt outputs a benign message:

This program reads input from an input file
Enter the name of the file:OK_File.txt

We successfully read in:
This is benign input.
Press any key to continue...

The program outputs only the first 63 bytes of Big_File.txt but, hey, it doesn't crash:

This program reads input from an input file
Enter the name of the file:Big_File.txt

We successfully read in:
ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789
abcdefghijklmnopqrstuvwxyz
Press any key to continue...

The program also reads in the first 63 bytes of our hack program contained in BO_File.txt, but since it doesn't exceed the limits of buffer and therefore doesn't overwrite the return address, no harm is done and there is no hack.

This program reads input from an input file
Enter the name of the file:BO_File.txt

We successfully read in:
ÉÉUëσ1└░°)─ÉÉδ$1└ï∟Σ6êC‼╕U¬G☺-☺☺☺☺ ╨1└P╕∙■B☺-☺☺☺☺ ╨Φ╫ You've
Press Enter to continue...

image This is the normal way to avoid buffer overflow: Make sure that you don't copy more data into the buffer than the buffer can hold, no matter what kind of garbage is contained in the buffer.

Avoiding buffer overflow — second attempt

An alternative approach to available buffer overflow is to make sure that the buffer can grow to accommodate the size of the input. There are several flexible-size containers in the Standard Template Library. The most common is the vector class. (See Chapter 27 for details.)

// NoBufferOverflow2 - this program avoids being hacked by
// using a variable-size buffer
#include <cstdio>
#include <cstdlib>
#include <fstream>
#include <iostream>
#include <cstring>
#include <string>
#include <vector>

using namespace std;

// getString - read a string of input from the user prompt
// and return it to the caller
char* getString(istream& cin)
{
// create a variable-size buffer with an initial
// length of 64 characters; however, this buffer can
// grow if there are more than 64 characters in the
// input file
vector<char> buffer;
buffer.reserve(64);

// now input a string from the file
for(;;)
{
// read the next character
char c = cin.get();

// exit the loop if we read a NULL or EOF
if ((c == 0) || cin.eof())
{
break;
}

// add the character to the buffer and grow the
// buffer if necessary to accommodate
buffer.push_back(c);
}
// make sure that the buffer is null terminated
buffer.push_back('\0');

// return a copy of the string to the caller
char* pB = new char[buffer.size()];
if (pB != nullptr)
{
strcpy(pB, buffer.data());
}
return pB;
}

This version of getString() creates a variable-size vector of char objects. The function sets the initial size of the vector to 64 characters, but buffer will grow automatically if necessary. Once in the loop, the function uses the function push_back() to push each character onto the end of the vector.

image The vector class overloads the bracket operator, so I could have said buffer [index] = c; however, in order to improve performance, the bracket operator does not check for buffer overflow. The push_back() method first checks that there is enough room in the buffer to handle the character being added. If not, push_back() allocates another buffer, twice as big as the first, and copies the contents of the smaller buffer into the larger. It repeats this process every time it needs more room in the input buffer.

The output of NoBufferOverflow2 is indistinguishable from the fixed buffer version when reading small files:

This program reads input from an input file
Enter the name of the file:OK_File.txt

We successfully read in:
This is benign input.
Press Enter to continue...

However, the output differs from the fixed buffer version when reading really large files:

This program reads input from an input file
Enter the name of the file:Big_File.txt

We successfully read in:
ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789
abcdefghijklmnopqrstuvwxyz0123456789
ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789
abcdefghijklmnopqrstuvwxyz0123456789
ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789
abcdefghijklmnopqrstuvwxyz0123456789
ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789
abcdefghijklmnopqrstuvwxyz0123456789
Press Enter to continue...

You can see that this version of getString() reads the entire input file rather than chopping off input at 63 bytes.

Similarly, NoBufferOverflow2 has no problem reading the buffer overflow hack file:

This program reads input from an input file
Enter the name of the file:BO_file.txt

We successfully read in:
ÉÉUëσ1└░°)─ÉÉδ$1└ï∟Σ6êC‼╕U¬G☺-☺☺☺☺ ╨1└P╕∙■B☺-☺☺☺☺ ╨Φ╫ You've been hacked!Ép²(
Press Enter to continue...

A lot of garbage gets printed out, but no hack occurs.

Another argument for the string class

In a way, all of the buffer overflow examples in this chapter are a bit contrived. In actual practice, the safest approach would have been to read input into an object of class string. Most of the functions associated with string are designed to vary the size of the internal buffer to accommodate the amount of input.

// NoBufferOverflow3 - this program avoids being hacked by
// using the string class
#include <cstdio>
#include <cstdlib>
#include <fstream>
#include <iostream>
#include <string>

using namespace std;

// getString - read a string of input from the user prompt
// and return it to the caller. Terminate the
// string at a null or the end-of-file
string getString(istream& cin)
{
string s;
getline(cin, s, '\0');
return s;
}

The call to getline() says read from cin into the string s until either a null or an end-of-file is encountered (the EOF is implied in every call to getline()). The size of the buffer in s is not fixed but expands to hold whatever is thrown at it.

Just as before, the output of this version is indistinguishable from the others when reading a benign file:

This program reads input from an input file
Enter the name of the file:OK_File.txt

We successfully read in:
This is benign input.
Press Enter to continue...

This output of this version is also identical to NoBufferOverflow2 for the oversized cases such as Big_File.txt:

This program reads input from an input file
Enter the name of the file:Big_File.txt

We successfully read in:
ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789
abcdefghijklmnopqrstuvwxyz0123456789
ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789
abcdefghijklmnopqrstuvwxyz0123456789
ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789
abcdefghijklmnopqrstuvwxyz0123456789
ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789
abcdefghijklmnopqrstuvwxyz0123456789
Press Enter to continue...

And for the buffer overflow BO_File.txt case:

This program reads input from an input file
Enter the name of the file:BO_file.txt

We successfully read in:
ÉÉUëσ1└░°)─ÉÉδ$1└ï∟Σ6êC‼╕U¬G☺-☺☺☺☺ ╨1└P╕∙■B☺-☺☺☺☺ ╨Φ╫ You've been hacked!Ép²(
Press Enter to continue...

Again, garbage but no hack.

Why not always use string functions?

Given the relative simplicity of the NoBufferOverflow3 program compared with the other two, why wouldn't a programmer always use the string class and its associated functions? In a way, the answer is, "You should.” But you need to keep in mind that internally this program is every bit as complicated as the vector-based NoBufferOverflow2 version. The getline() function is calling a variable-size container such as vector for you. Even though you may not be making all these extra calls, the calls are being made, and the performance of the function reflects that fact. The versions of getString() that rely on fixed-size buffers are considerably faster than those that use variable-size structures.

This difference is not noticeable if the program calls getString() only once or even only a thousand times, but it can be considerable if this function were being called in the middle of a very time critical loop.

Thus, the string or vector versions of getString() are the way to go for general use, but there may be conditions that justify the use of fixed-size buffers.

image There is no measurable difference in performance between the version of getString() that does not check for buffer overflow and the one that does. There is no justification for leaving yourself exposed to hacking by buffer overflow even if you're trying to shave a few instructions off of the execution time.


Let the operating system help

CPU manufacturers and operating system vendors have combined to devise ways to help avoid buffer overflow hacks. Two of the most common are Address Space Layout Randomization (ASLR) and Data Execution Prevention (DEP).

One of the Achilles heels of the preceding hack is that I had to hard code the address of buffer on the stack. I could do this because my version of Windows always loads that particular program, BufferOverflow.exe, the same way. But what if it were to vary things a bit every time it executed the program? For example, what if the operating system added some small constant to the stack pointer before executing the program each time? It wouldn't make any difference to the program, but it would make it impossible for the hacker to know what value to overwrite the return address with since the address of buffer would be slightly different every time the program executed. This moving memory around is known as Address Space Layout Randomization (ASLR).

Another vulnerability of this hack is the fact that at least for a small period of time the processor was being asked to execute machine instructions that were stored in an area reserved for data (namely the machine code that got loaded into buffer). Most 80x86 processors have the ability known as Data Execution Prevention (DEP) to mark code segments as either executable or not executable (using a flag known as the Nx flag). Operating systems that support DEP mark memory segments where code is stored as executable while marking areas intended only for data, such as the stack where buffer is stored as not executable. This buffer overflow hack would have been trapped by the processor as soon as control passed to the beginning of buffer. The CPU would have thrown an exception that someone was trying to execute instructions stored in non-executable memory — a no-no of the first order. The operating system would catch the exception and immediately throw the miscreant program out of memory before any hacker harm could be done.

In Windows Vista and later, DEP is enabled for most Windows kernel processes and many applications to avoid hackers from gaining administrator privileges but is often not enabled for user code. The Task Manager will show you which processes have DEP enabled and which do not (you have to enable this column — by default, this column is not displayed). This figure shows the output of the Task Manager on my Windows 7 machine while BufferOverflow is executing. Notice that the process created is called BufferOverflow.exe*32, indicating that this is a 32-bit process. Also notice that DEP is disabled for this process. There doesn't appear to be a way to tell gcc to generate code that enables DEP on Windows.

image