Praise for Gray Hat Hacking: The Ethical Hacker’s Handbook, Fourth Edition (2015)
PART I. Crash Course: Preparing for the War
CHAPTER 4. Advanced Analysis with IDA Pro
In this chapter, you will be introduced to features of IDA Pro that will help you analyze binary code more efficiently and with greater confidence. Out of the box, IDA Pro is already one of the most powerful binary analysis tools available. The range of processors and binary file formats that IDA Pro can process is more than many users will ever need. Likewise, the disassembly view provides all of the capability that the majority of users will ever want. Occasionally, however, a binary will be sufficiently sophisticated or complex that you will need to take advantage of IDA Pro’s advanced features to fully comprehend what the binary does. In other cases, you may find that IDA Pro does a large percentage of what you wish to do, and you would like to pick up from there with additional automated processing.
In this chapter, we cover the following topics:
• Static analysis challenges
• Extending IDA Pro
Static Analysis Challenges
For any nontrivial binary, generally several challenges must be overcome to make analysis of that binary less difficult. Examples of challenges you might encounter include
• Binaries that have been stripped of some or all of their symbol information
• Binaries that have been linked with static libraries
• Binaries that make use of complex, user-defined data structures
• Compiled C++ programs that make use of polymorphism
• Binaries that have been obfuscated in some manner to hinder analysis
• Binaries that use instruction sets with which IDA Pro is not familiar
• Binaries that use file formats with which IDA Pro is not familiar
IDA Pro is equipped to deal with all of these challenges to varying degrees, though its documentation may not indicate that. One of the first things you need to learn to accept as an IDA Pro user is that there is no user’s manual, and the help files are pretty terse. Familiarize yourself with the available online IDA Pro resources—aside from your own hunting around and poking at IDA Pro, they will be your primary means of answering questions. Some sites that have strong communities of IDA Pro users include OpenRCE (www.openrce.org), Hex Blog (www.hexblog.com), and the IDA Pro support boards at the Hex-Rays website (see the “For Future Reading” section at the end of the chapter for more details).
The process of building software generally consists of several phases. In a typical C/C++ environment, you will encounter at a minimum the preprocessor, compilation, and linking phases before an executable can be produced. For follow-on phases to correctly combine the results of previous phases, intermediate files often contain information specific to the next build phase. For example, the compiler embeds into object files a lot of information that is specifically designed to assist the linker in doing its job of combining those object files into a single executable or library. Among other things, this information includes the names of all the functions and global variables within the object file. Once the linker has done its job, however, this information is no longer necessary. Quite frequently, all of this information is carried forward by the linker and remains present in the final executable file, where it can be examined by tools such as IDA Pro to learn what all the functions within a program were originally named. If we assume—which can be dangerous—that programmers tend to name functions and variables according to their purpose, then we can learn a tremendous amount of information simply by having these symbol names available to us.
The process of “stripping” a binary involves removing all symbol information that is no longer required once the binary has been built. Stripping is generally performed by using the command-line strip utility and, as a result of removing extraneous information, has the side effect of yielding a smaller binary. From a reverse-engineering perspective, however, stripping makes a binary slightly more difficult to analyze as a result of the loss of all the symbols. In this regard, stripping a binary can be seen as a primitive form of obfuscation. The most immediate impact of dealing with a stripped binary in IDA Pro is that IDA Pro will be unable to locate the main() function and will instead initially position the disassembly view at the program’s true entry point, generally named _start.
NOTE Contrary to popular belief, main is not the first thing executed in a compiled C or C++ program. A significant amount of initialization must take place before control can be transferred to main. Some of the startup tasks include initialization of the C libraries, initialization of global objects, and creation of the argv and envp arguments expected by main.
You will seldom desire to reverse-engineer all of the startup code added by the compiler, so locating main is a handy thing to be able to do. Fortunately, each compiler tends to have its own style of initialization code, so with practice you will be able to recognize the compiler that was used based simply on the startup sequence. Because the last thing the startup sequence does is transfer control to main, you should be able to locate main easily regardless of whether a binary has been stripped. The following code shows the _start function for a gcc-compiled binary that has not been stripped:
Notice that main located at is not called directly; rather, it is passed as a parameter to the library function __libc_start_main at . The __libc_start_main function takes care of libc initialization, pushing the proper arguments to main, and finally transferring control to main. Note that main is the last parameter pushed before the call to __libc_start_main. The following code shows the _start function from the same binary after it has been stripped:
In this second case, we can see that IDA Pro no longer understands the name main at . We also notice that two other function names at and have been lost as a result of the stripping operation, and that one function has managed to retain its name. It is important to note that the behavior of _start has not been changed in any way by the stripping operation. As a result, we can apply what we learned from the unstrapped listing—that main at is the last argument pushed to __libc_start_main—and deduce that loc_8046854 must be the start address of main; we are free to rename loc_8046854 to main as an early step in our reversing process.
One question we need to understand the answer to is why __libc_start_main has managed to retain its name while all the other functions we saw in the unstrapped listing lost theirs. The answer lies in the fact that the binary we are looking at was dynamically linked (the file command would tell us so) and __libc_start_main is being imported from libc.so, the shared C library. The stripping process has no effect on imported or exported function and symbol names. This is because the runtime dynamic linker must be able to resolve these names across the various shared components required by the program. As you will see in the next section, we are not always so lucky when we encounter statically linked binaries.
Statically Linked Programs and FLAIR
When compiling programs that make use of library functions, the linker must be told whether to use shared libraries such as .dll and .so files, or static libraries such as .a files. Programs that use shared libraries are said to be dynamically linked, whereas programs that use static libraries are said to be statically linked. Each form of linking has its own advantages and disadvantages. Dynamic linking results in smaller executables and easier upgrading of library components at the expense of some extra overhead when launching the binary, and the chance that the binary will not run if any required libraries are missing. To learn which dynamic libraries an executable depends on, you can use the dumpbin utility on Windows, ldd on Linux, and otool on Mac OS X. Each will list the names of the shared libraries that the loader must find in order to execute a given dynamically linked program. Static linking results in much larger binaries because library code is merged with program code to create a single executable file that has no external dependencies, making the binary easier to distribute. As an example, consider a program that makes use of the OpenSSL cryptographic libraries. If this program is built to use shared libraries, then each computer on which the program is installed must contain a copy of the OpenSSL libraries. The program would fail to execute on any computer that does not have OpenSSL installed. Statically linking that same program eliminates the requirement to have OpenSSL present on computers that will be used to run the program, making distribution of the program somewhat easier.
From a reverse-engineering point of view, dynamically linked binaries are somewhat easier to analyze, for several reasons. First, dynamically linked binaries contain little to no library code, which means that the code you get to see in IDA Pro is just the code that is specific to the application, making it both smaller and easier to focus on application-specific code rather than library code. The last thing you want to do is spend your time-reversing library code that is generally accepted to be fairly secure. Second, when a dynamically linked binary is stripped, it is not possible to strip the names of library functions called by the binary, which means the disassembly will continue to contain useful function names in many cases. Statically linked binaries present more of a challenge because they contain far more code to disassemble, most of which belongs to libraries. However, as long as the statically linked program has not been stripped, you will continue to see all the same names that you would see in a dynamically linked version of the same program. A stripped, statically linked binary presents the largest challenge for reverse engineering. When the strip utility removes symbol information from a statically linked program, it removes not only the function and global variable names associated with the program, but also the function and global variable names associated with any libraries that were linked in. As a result, it is extremely difficult to distinguish program code from library code in such a binary. Further, it is difficult to determine exactly how many libraries may have been linked into the program. IDA Pro has facilities (not well documented) for dealing with exactly this situation.
The following code shows what our _start function ends up looking like in a statically linked, stripped binary:
At this point, we have lost the names of every function in the binary and we need some method for locating the main function so that we can begin analyzing the program in earnest. Based on what we saw in the two listings from the “Stripped Binaries” section, we can proceed as follows:
• Find the last function called from _start; this should be __libc_start_main.
• Locate the first argument to __libc_start_main; this will be the topmost item on the stack, usually the last item pushed prior to the function call. In this case, we deduce that main must be sub_8048208. We are now prepared to start analyzing the program beginning with main.
Locating main is only a small victory, however. By comparing the listing from the unstripped version of the binary with the listing from the stripped version, we can see that we have completely lost the ability to distinguish the boundaries between user code and library code.
Following is an example of unstripped code with named references to library code:
Following is an example of stripped code without names referencing library code:
Comparing the previous two listings, we have lost the names of stderr, fwrite, exit, and gethostbyname, and each is indistinguishable from any other user space function or global variable. The danger we face is that, being presented with the binary in the stripped listing, we might attempt to reverse-engineer the function at loc_8048F7C. Having done so, we would be disappointed to learn that we have done nothing more than reverse a piece of the C standard library. Clearly, this is not a desirable situation for us. Fortunately, IDA Pro possesses the ability to help out in these circumstances.
Fast Library Identification and Recognition Technology (FLIRT) is the name that IDA Pro gives to its ability to automatically recognize functions based on pattern/signature matching. IDA Pro uses FLIRT to match code sequences against many signatures for widely used libraries. IDA Pro’s initial use of FLIRT against any binary is to attempt to determine the compiler that was used to generate the binary. This is accomplished by matching entry point sequences (such as the previous two listings) against stored signatures for various compilers. Once the compiler has been identified, IDA Pro attempts to match against additional signatures more relevant to the identified compiler. In cases where IDA Pro does not pick up on the exact compiler that was used to create the binary, you can force IDA Pro to apply any additional signatures from IDA Pro’s list of available signature files. Signature application takes place via the File | Load File | FLIRT Signature File menu option, which brings up the dialog box shown in Figure 4-1.
Figure 4-1 IDA Pro library signature selection dialog box
The dialog box is populated based on the contents of IDA Pro’s sig subdirectory. Selecting one of the available signature sets causes IDA Pro to scan the current binary for possible matches. For each match that is found, IDA Pro renames the matching code in accordance with the signature. When the signature files are correct for the current binary, this operation has the effect of unstripping the binary. It is important to understand that IDA Pro does not come complete with signatures for every static library in existence. Consider the number of different libraries shipped with any Linux distribution and you can appreciate the magnitude of this problem. To address this limitation, Hex-Rays ships a tool set called Fast Library Acquisition for Identification and Recognition (FLAIR). FLAIR consists of several command-line utilities used to parse static libraries and generate IDA Pro–compatible signature files.
Generating IDA Pro Sig Files
Installation of the FLAIR tools is as simple as unzipping the FLAIR distribution (flair51.zip used in this section) into a working directory. Beware that FLAIR distributions are generally not backward compatible with older versions of IDA Pro, so be sure to obtain the appropriate version of FLAIR for your version of IDA Pro from the Hex-Rays IDA Pro Downloads page (see “For Further Reading”). After you have extracted the tools, you will find the entire body of existing FLAIR documentation in the three files named pat.txt, readme.txt, and sigmake.txt. You are encouraged to read through these files for more detailed information on creating your own signature files.
The first step in creating signatures for a new library involves the extraction of patterns for each function in the library. FLAIR comes with pattern-generating parsers for several common static library file formats. All FLAIR tools are located in FLAIR’s bin subdirectory. The pattern generators are named pXXX, where XXX represents various library file formats. In the following example, we will generate a sig file for the statically linked version of the standard C library (libc.a) that ships with FreeBSD 6.2. After moving libc.a onto our development system, the following command is used to generate a pattern file:
We choose the pelf tool because FreeBSD uses ELF format binaries. In this case, we are working in FLAIR’s bin directory. If you wish to work in another directory, the usual PATH issues apply for locating the pelf program. FLAIR pattern files are ASCII text files containing patterns for each exported function within the library being parsed. Patterns are generated from the first 32 bytes of a function, from some intermediate bytes of the function for which a CRC16 value is computed, and from the 32 bytes following the bytes used to compute the cyclic redundancy check (CRC). Pattern formats are described in more detail in the pat.txt file included with FLAIR. The second step in creating a sig file is to use the sigmake tool to create a binary signature file from a generated pattern file. The following command attempts to generate a sig file from the previously generated pattern file:
The –n option can be used to specify the “Library name” of the sig file as displayed in the sig file selection dialog box (refer to Figure 4-1). The default name assigned by sigmake is “Unnamed Sample Library.” The last two arguments for sigmake represent the input pattern file and the output sig file, respectively. In this example, we seem to have a problem: sigmake is reporting some collisions. In a nutshell, collisions occur when two functions reduce to the same signature. If any collisions are found, sigmake refuses to generate a sig file and instead generates an exclusions(.exc) file. The first few lines of this particular exclusions file are shown here:
In this example, we see that the functions ntohs and htons have the same signature, which is not surprising considering that they do the same thing on an x86 architecture—namely, swap the bytes in a 2-byte short value. The exclusions file must be edited to instruct sigmake how to resolve each collision. As shown earlier, basic instructions for this can be found in the generated .exc file. At a minimum, the comment lines (those beginning with a semicolon) must be removed. You must then choose which, if any, of the colliding functions you wish to keep. In this example, if we choose to keep htons, we must prefix the htons line with a + character, which tells sigmake to treat any function with the same signature as if it were htons rather than ntohs. More detailed instructions on how to resolve collisions can be found in FLAIR’s sigmake.txt file. Once you have edited the exclusions file, simply rerun sigmake with the same options. A successful run will result in no error or warning messages and the creation of the requested sig file. Installing the newly created signature file is simply a matter of copying it to the sig subdirectory under your main IDA Pro program directory. The installed signatures will now be available for use, as shown in Figure 4-2.
Figure 4-2 Selecting appropriate signatures
Let’s apply the new signatures to the following code:
This yields the following improved disassembly in which we are far less likely to waste time analyzing any of the three functions that are called:
We have not covered how to identify exactly which static library files to use when generating your IDA Pro sig files. It is safe to assume that statically linked C programs are linked against the static C library. To generate accurate signatures, it is important to track down a version of the library that closely matches the one with which the binary was linked. Here, some file and strings analysis can assist in narrowing the field of operating systems that the binary may have been compiled on. The file utility can distinguish among various platforms, such as Linux, FreeBSD, and Mac OS X, and the strings utility can be used to search for version strings that may point to the compiler or libc version that was used. Armed with that information, you can attempt to locate the appropriate libraries from a matching system. If the binary was linked with more than one static library, additional strings analysis may be required to identify each additional library. Useful things to look for in strings output include copyright notices, version strings, usage instructions, and other unique messages that could be thrown into a search engine in an attempt to identify each additional library. By identifying as many libraries as possible and applying their signatures, you greatly reduce the amount of code you need to spend time analyzing and get to focus more attention on application-specific code.
Data Structure Analysis
One consequence of compilation being a lossy operation is that we lose access to data declarations and structure definitions, which makes it far more difficult to understand the memory layout in disassembled code. IDA Pro provides the capability to define the layout of data structures and then to apply those structure definitions to regions of memory. Once a structure template has been applied to a region of memory, IDA Pro can utilize structure field names in place of integer offsets within the disassembly, making the disassembly far more readable. There are two important steps in determining the layout of data structures in compiled code. The first step is to determine the size of the data structure. The second step is to determine how the structure is subdivided into fields and what type is associated with each field. The following is a sample program that will be used to illustrate several points about disassembling structures:
The following is an assembly representation for the compiled code in the previous listing:
There are two methods for determining the size of a structure. The first and easiest method is to find locations at which a structure is dynamically allocated using malloc or new. The lines labeled and in the assembly listing show a call to malloc with 96 as the argument. Malloc’ed blocks of memory generally represent either structures or arrays. In this case, we learn that this program manipulates a structure whose size is 96 bytes. The resulting pointer is transferred into the esi register and used to access the fields in the structure for the remainder of the function. References to this structure take place at , , and and can be used to further examine fields of the structure.
The second method of determining the size of a structure is to observe the offsets used in every reference to the structure and to compute the maximum size required to house the data that is referenced. In this case, references the 80 bytes at the beginning of the structure (based on themaxlen argument pushed at ), references 4 bytes (the size of eax) starting at offset 80 into the structure ([esi + 80]), and references 8 bytes (a quad word/qword) starting at offset 88 ([esi + 88]) into the structure. Based on these references, we can deduce that the structure is 88 (the maximum offset we observe) plus 8 (the size of data accessed at that offset), or 96 bytes long. Thus, we have derived the size of the structure via two different methods. The second method is useful in cases where we can’t directly observe the allocation of the structure, perhaps because it takes place within library code.
To understand the layout of the bytes within a structure, we must determine the types of data used at each observable offset within the structure. In our example, the access at uses the beginning of the structure as the destination of a string copy operation, limited in size to 80 bytes. We can conclude, therefore, that the first 80 bytes of the structure comprise an array of characters. At , the 4 bytes at offset 80 in the structure are assigned the result of the function atol, which converts an ASCII string to a long value. Here, we can conclude that the second field in the structure is a 4-byte long. Finally, at , the 8 bytes at offset 88 into the structure are assigned the result of the function atof, which converts an ASCII string to a floating-point double value.
You may have noticed that the bytes at offsets 84–87 of the structure appear to be unused. There are two possible explanations for this. The first is that there is a structure field between the long and the double that is simply not referenced by the function. The second possibility is that the compiler has inserted some padding bytes to achieve some desired field alignment. Based on the actual definition of the structure in the C source code listing, we conclude that padding is the culprit in this particular case. If we wanted to see meaningful field names associated with each structure access, we could define a structure in the IDA Pro Structures window. IDA Pro offers an alternative method for defining structures that you may find far easier to use than its structure-editing facilities. IDA Pro can parse C header files via the File | Load File menu option. If you have access to the source code or prefer to create a C-style struct definition using a text editor, IDA Pro will parse the header file and automatically create structures for each struct definition that it encounters in the header file. The only restriction you must be aware of is that IDA Pro only recognizes standard C data types. For any nonstandard types (uint32_t, for example), the header file must contain an appropriate typedef, or you must edit the header file to convert all nonstandard types to standard types.
Access to stack or globally allocated structures looks quite different from access to dynamically allocated structures. The C source code listing shows that main contains a local, stack-allocated structure declared at . and in main reference fields in this locally allocated structure. These references correspond to and in the assembly listing. Although we can see that references memory that is 80 bytes ([ebp-96+80] == [ebp-16]) after the reference at , we don’t get a sense that the two references belong to the same structure. This is because the compiler can compute the address of each field (as an absolute address in a global variable, or a relative address within a stack frame) at compile time, making access to fields less obvious. Access to fields in dynamically allocated structures must always be computed at runtime because the base address of the structure is not known at compile time and has the effect of showing the field boundaries inside the structure.
Using IDA Pro Structures to View Program Headers
In addition to enabling you to declare your own data structures, IDA Pro contains a large number of common data structure templates for various build environments, including standard C library structures and Windows API structures. An interesting example use of these predefined structures is to use them to examine the program file headers, which by default are not loaded into the analysis database. To examine file headers, you must perform a manual load when initially opening a file for analysis. Manual loads are selected via a check box on the initial load dialog box, as shown in Figure 4-3.
Figure 4-3 Forcing a manual load with IDA Pro
Manual loading forces IDA Pro to ask you whether you wish to load each section of the binary into IDA Pro’s database. One of the sections that IDA Pro will ask about is the header section, which will allow you to see all the fields of the program headers, including structures such as the MSDOS and NT file headers. Another section that gets loaded only when a manual load is performed is the resource section that is used on the Windows platform to store dialog box and menu templates, string tables, icons, and the file properties. You can view the fields of the MSDOS header by scrolling to the beginning of a manually loaded Windows PE file and placing the cursor on the first address in the database, which should contain the “M” value of the MSDOS “MZ” signature. No layout information will be displayed until you add the IMAGE_DOS_HEADER to your Structures window. This is accomplished by switching to the Structures tab, clicking Insert, entering IMAGE_DOS_HEADER as the Structure Name, as shown in Figure 4-4, and clicking OK.
Figure 4-4 Importing the IMAGE_DOS_HEADER structure
This will pull IDA Pro’s definition of the IMAGE_DOS_HEADER from its type library into your local Structures window and make it available to you. Finally, you need to return to the disassembly window, position the cursor on the first byte of the DOS header, and press ALT-Q to apply the IMAGE_DOS_HEADER template. The structure may initially appear in its collapsed form, but you can view all of the struct fields by expanding the struct with the numeric keypad + key. This results in the display shown next:
A little research on the contents of the DOS header will tell you that the e_lfanew field holds the offset to the PE header struct. In this case, we can go to address 00400000 + 200h (00400200) and expect to find the PE header. The PE header fields can be viewed by repeating the process just described and using IMAGE_NT_HEADERS as the structure you wish to select and apply.
Quirks of Compiled C++ Code
C++ is a somewhat more complex language than C, offering member functions and polymorphism, among other things. These two features require implementation details that make compiled C++ code look rather different from compiled C code when they are used. First, all nonstatic member functions require a this pointer; second, polymorphism is implemented through the use of vtables.
NOTE In C++, a this pointer is available in all nonstatic member functions. This points to the object for which the member function was called and allows a single function to operate on many different objects merely by providing different values for this each time the function is called.
The means by which this pointers are passed to member functions vary from compiler to compiler. Microsoft compilers take the address of the calling object and place it in the ecx register prior to calling a member function. Microsoft refers to this calling convention as a this call. Other compilers, such as Borland and g++, push the address of the calling object as the first (leftmost) parameter to the member function, effectively making this an implicit first parameter for all nonstatic member functions. C++ programs compiled with Microsoft compilers are very recognizable as a result of their use of this call. Here’s a simple example:
Because Borland and g++ pass this as a regular stack parameter, their code tends to look more like traditional compiled C code and does not immediately stand out as compiled C++.
Virtual tables (or vtables) are the mechanism underlying virtual functions and polymorphism in C++. For each class that contains virtual member functions, the C++ compiler generates a table of pointers called a vtable. A vtable contains an entry for each virtual function in a class, and the compiler fills each entry with a pointer to the virtual function’s implementation. Subclasses that override any virtual functions receive their own vtable. The compiler copies the superclass’s vtable, replacing the pointers of any functions that have been overridden with pointers to their corresponding subclass implementations. The following is an example of superclass and subclass vtables:
As can be seen, the subclass overrides func3 and func4, but inherits the remaining virtual functions from its superclass. The following features of vtables make them stand out in disassembly listings:
• Vtables are usually found in the read-only data section of a binary.
• Vtables are referenced directly only from object constructors and destructors.
• By examining similarities among vtables, it is possible to understand inheritance relationships among classes in a C++ program.
• When a class contains virtual functions, all instances of that class will contain a pointer to the vtable as the first field within the object. This pointer is initialized in the class constructor.
• Calling a virtual function is a three-step process. First, the vtable pointer must be read from the object. Second, the appropriate virtual function pointer must be read from the vtable. Finally, the virtual function can be called via the retrieved pointer.
Extending IDA Pro
Although IDA Pro is an extremely powerful disassembler on its own, it is rarely possible for a piece of software to meet every need of its users. To provide as much flexibility as possible to its users, IDA Pro was designed with extensibility in mind. These features include a custom scripting language for automating simple tasks, and a plug-in architecture that allows for more complex, compiled extensions.
IDA Pro has support for writing plug-ins and automation scripts in one of these languages: IDC, Python, or C++. Although the three mentioned languages are the most prevalent ones, there are some projects that expose some of the IDA API to languages such as Ruby and OCaml.
IDC is a C-like language that is interpreted rather than compiled. Like many scripting languages, IDC is dynamically typed, and it can be run in something close to an interactive mode or as complete stand-alone scripts contained in .idc files. IDA Pro does provide some documentation on IDC in the form of help files that describe the basic syntax of the language and the built-in API functions available to the IDC programmer.
IDAPython is an IDA Pro plug-in that allows running Python code in IDA. The project was started by Gergely Erdelyi, and due to its popularity it was merged into the standard IDA Pro release and is currently maintained by IDA developers. Python has proven itself as one of the prevalent languages in the reverse-engineering community, so it doesn’t come as a surprise that most select it as the tool of choice when scripting in IDA.
IDA comes with a software development kit (SDK) that exposes most internal functions and allows them to be called from C++ code. Using the SDK used to be the only way to write more advanced plug-ins. IDC and Python didn’t have access to functions necessary to develop things like processor modules. Every new version of IDA exposes more functions to the supported scripting languages, so since version 5.7 it is possible to develop processor modules in IDC and Python.
Scripting in IDAPython
For those familiar with IDA’s IDC language, scripting in Python will be an easy transition. All IDC function are available in IDAPython plus all the native Python functions and libraries.
NOTE In this chapter, we will be using Microsoft’s Portable Executable (PE) format as an example. Presented information is still applicable to other formats such as Unix/Linux Executable and Linkable Format (ELF) and Mac OS Mach Object (Mach-O).
Functions in IDA
To start things off, let’s analyze the following problem. There are many ways to perform deep binary analysis, but unless you possess extraordinary memory you will want to rename and annotate as many functions as possible. A good way to start is to rename the functions that appear in disassembly very often. Renaming these functions will save you much time when looking at the disassembly. This process can be partially automated by scripting steps 1–3 in the following list:
1. Find all functions in the program.
2. Count how many times each function is called.
3. Sort the functions by number of times they are called.
4. Manually analyze the top called functions and give them meaningful names.
Functions in IDA can be identified by looking at the Functions window, which is available via View | Open subviews | Functions. Another way to open this windows is to use the SHIFT-F3 hotkey. Using hotkeys is probably the fastest way to navigate the IDA interface, so keep note of all the combinations and slowly learn to adopt them into your workflow.
The functions window contains information about each function recognized by IDA. The following information is displayed:
• Function name
• Name of the segment the function is located in
• Start of function
• Length of function
• Size of local variables and function arguments
• Various function options and flags
Functions can be identified in the disassembly window by the color of the font for the section name and address. Disassembly code that is not associated with a function will appear in a red font, whereas code that belongs to a non-library function will appear in a black font. Functions that are recognized by IDA to come from a known library and are statically linked will be shown in a light blue color. Another way to distinguish functions from regular code is by location names.
NOTE Sometimes a portion of code that should be a function is not recognized as such by IDA, and the code will appear in red. In such cases, it is possible to manually make that part of code into a function by pressing keyboard shortcut P. If after that the code appears in the usual black font and the name label in blue, it means that a function has been successfully created.
The function start will get assigned either a known library function name, which is known to IDA, from the Imports or Exports section, or a generic function name starting with “sub_”. An example of a generic function name is given in the following code snippet:
Following is a list of basic API functions that can be used to get and set various function information and parameters:
Detailed information about all exposed functions can be found in the IDAPython documentation listed in the “For Further Reading” section.
Sorting Functions by Call Reference
Let’s analyze the following script, which outputs the information about functions (address and names) ordered by the number of times they were called:
The script is structured based on the steps outlined at the beginning of this section. The first two steps are implemented in the function BuildFuncsDict, between and . The function iterates over all functions recognized by IDA and gathers information about them. One of the flags every function has is FUNC_LIB. This bit field is used by IDA to mark functions that have been identified to come from a known library. We are not necessarily interested in these functions because they would be renamed by IDA to a matched library’s function name, so we can use this flag to filter out all library functions.
We can count the number of times a function is called by counting the number of code references to that specific function. IDA maintains information about references (connections) between functions that can be used to build a directed graph of function interactions.
NOTE An important thing to keep in mind when talking about function references is that there are two different types of references: code and data. A code reference is when a specific location or address is used (or referenced) from a location that has been recognized as a code by IDA. For the data references, the destination (for example, a function address) is stored in a location that has been classified as a data by IDA. Typically, most function references will be from code locations, but in the case of object-oriented languages, data references from class data tables are fairly common.
After information about functions has been collected, the SortNonLibFuncsByCall Count function (between and ) is responsible for ordering functions based on the call count and filtering out all the library functions. Finally, PrintResults (between and ) will output (by default) the top 10 most-called functions. By modifying the limit parameter of the PrintResults function, it is possible to output an arbitrary number of mostly referenced functions, or you can specify None as the limit to output all the referenced functions.
Renaming Wrapper Functions
One common usage of the IDA scripting API is function renaming. In the previous section, we touched on the importance of function renaming but we haven’t explored any options for automating the process. Wrapper functions are a good example of when it is possible to programmatically rename functions and thus improve the readability of the IDA database. A wrapper function can be defined as a function whose only purpose is to call another function. These functions are usually created to perform additional error checking on the calling function and make the error handling more robust. One special case of wrapper functions that we are interested in are wrappers of non-dummy functions. IDA dummy names (for example, the sub_ prefix) are generic IDA names that are used when there is no information about the original name. Our goal is to rename dummy functions that point to functions with meaningful names (for example, non-dummy functions).
The following script can be used to rename the wrappers and reduce the number of functions that need to be analyzed:
The function find_wrappers will iterate over all defined functions in the IDA database and check at line whether the function has a dummy name. We are only interested in renaming the dummy names. Renaming other name types would overwrite valuable information stored as a function name. A function size check at line is used as a heuristic filter to see whether the function has more than 200 instructions and is therefore too big to be relevant. We generally expect wrapper functions to implement simple functionality and be short. This can then be used as a good heuristic to filter out big functions and improve the speed of the script. A loop at line iterates over all function instructions and looks for all “call” instructions at line . For every call instruction at line , we check that the destination function has a meaningful name (it’s a library function or not a dummy function). Finally, at line the function rename_func is called to rename a wrapper function if it found only one called function with a non-dummy call destination.
The rename_func function renames wrapper functions using the following naming convention: WrapperDestination + _w. The WrapperDestination is a function name that is called from the wrapper, and the _w suffix is added to symbolize a wrapper function. Additionally, after the _wsuffix, a number might appear. This number is a counter that increases every time a wrapper for the same function is created. This is necessary because IDA doesn’t support multiple functions having the same name. One special case at line where _w is omitted is for wrappers for C++ mangled names. To get nice unmangled names in a wrapper, we can’t append w because it would break the mangle; so in this case we only append _DIGIT, where DIGIT is a wrapper counter.
Decrypting Strings in IDB
One of the common obfuscation techniques employed by malware authors is encrypting cleartext data. This technique makes static analysis harder and thwarts static antivirus signatures to some extent. There are different ways and algorithms used by malware for string encryption, but all of them share a few things in common:
• Generally only one algorithm is used to encrypt/decrypt all the strings used by malware.
• The decryption function can be identified by following cross-references to referenced binary data in the .data section.
• Decryption functions are usually small and have an xor instruction somewhere in the loop.
To illustrate the previous points, we will take a look at a component of the infamous Flamer malware. More specifically, we will be analyzing the mssecmgr.ocx (md5: bdc9e04388bda8527b398a8c34667e18) sample.
After opening the sample with IDA, we first go to the .data section by pressing SHIFT-F7 and double-clicking the appropriate segment. Scrolling down the .data segment, you will start seeing data references to what seems like random data. Following is an example of such a reference at address 0x102C9FD4:
Looking at the location of the reference at sub_101C06B0+1Ao, it becomes evident that this location (unk_102C9FD4 at 0x102C9FD4) is pushed as an argument to an unknown function:
Looking at the called function sub_1000E477, it becomes evident that this is only a wrapper for another function and that the interesting functionality is performed in the sub_1000E3F5 function:
Moving along, we examine sub_1000E3F5, and the first thing we should notice is a jnz short loc_1000E403 loop. Inside this loop are several indicators that this could be some kind of decryption function. First of all, there is a loop that contains several xor instructions that operate on data that is ultimately written to memory at address 0x1000E427:
.text:1000E427 sub [esi], cl
After closer inspection of the code, we can assume that this function is indeed decrypting data, so we can proceed with understanding its functionality. The first thing we should do is to identify function arguments and their types and then give them appropriate names. To improve the readability of the assembly, we will add an appropriate function definition by pressing Y at the function start (address 0x1000E3F5) and enter the following as the type declaration:
Next, we change the name of the function by pressing N at the function start at 0x1000E3F5 and enter Decrypt as the function name.
We have already determined that we are dealing with a decryption function, so will continue to rename the sub_1000E477 wrapper function as Decrypt_w.
A good habit to have is checking for all locations where the decryption function is used. To do that, first jump to the Decrypt function by pressing G and entering Decrypt. To get all cross-references, press CTRL-X. This shows there is another function calling Decrypt that hasn’t been analyzed so far. If you take a look, it seems very similar to the previously analyzed decryption wrapper Decrypt_w. For now, we will also rename the second wrapper as Decrypt_w2 by pressing N at the function name location.
Performing the analysis and decompilation of the decryption function is left as an exercise for the reader. The decryption function is sufficiently short and straightforward to serve as good training for these types of scenarios. Instead, a solution will be presented as an IDA script that decrypts and comments all encrypted strings. This script should be used to validate the analysis results of the decryption function.
Following is a representative example of how to approach the task of decrypting strings by using a static analysis approach and implementing the translation of the decryption function to a high-level language:
The decryption script consists of the following functions:
• DecryptXrefs() This is a main function whose responsibility is to locate all encrypted string location addresses and call the decompiled decryption function: DecryptStruct. This function represents both wrapper functions and needs three arguments to correctly process data. The first argument, pEncStruct, is an address of the structure that represents the encrypted string. The following two arguments, iJunkSize and iBoolSize, define the two variables that are different in the two wrapper functions. iJunkSize is the length of junk data in the structure, andiBoolSize defines the size of the variable that is used to define whether or not the structure has been decrypted.
The following IDA APIs are used to fetch the address of the decryption function, find all cross-references, and walk the disassembly listing: LocByName, CodeRefsTo, and PrevHead.
Useful APIs for parsing the disassembly include GetMnem, GetOpType, GetOperandValue, and OpHex.
• DecryptStruct() This is a high-level representation of the wrapper function that calls the decryption functionality. It first checks whether or not the structure that represents the encrypted string has already been processed (decrypted). It does this by calling IsEncrypted(), which checks the specific field of the structure representing this information. If the data hasn’t been decrypted, it will proceed with fetching the size of the encrypted string from a field in the structure and then read the encrypted content. This content is then passed to the DecryptData()function, which returns the decrypted data. The function proceeds with patching the IDB with a decrypted string and updating the field denoting the status of decryption for the structure in PatchString() and PatchIsEncrypted(). Finally, a comment is added to the IDB at the location of the encrypted string.
Useful APIs for reading data from IDB are Byte, Word, Dword, and Qword.
• PatchString() and PatchIsEncrypted() These functions modify the state of IDB by changing content of the program. The important thing to notice is that changes are made only to the IDB and not to the original program, so changes will not influence the original binary that is being analyzed.
• Useful APIs for patching data in IDB are PatchByte, PatchWord, PatchDword, and PatchQword.
• AddComment() This adds a comment in the IDB at a specific location. The data to be written is first stripped of any null bytes and then written as an ASCII string to the desired location.
• Useful APIs for manipulating comments are MakeComm, Comments, and CommentEx.
Example 4-1: Decrypting Strings in Place
NOTE This exercise is provided as an example rather than as a lab due to the fact that in order to perform the exercise, malicious code is needed.
This example exercise covers a static analysis technique that is commonly used when dealing with malware samples. As mentioned previously, malware authors commonly encrypt strings and other cleartext data to evade static signatures and make the analysis more difficult.
In this example, we will look at the mssecmgr.ocx (md5: bdc9e04388bda8527b398a8c34667e18) component of the infamous Flamer malware and decrypt all strings used by this threat. Follow these steps:
1. Open the sample mssecmgr.ocx (md5: bdc9e04388bda8527b398a8c34667e18) with IDA Pro.
2. Jump to the decryption function by pressing G and entering the following address: 0x1000E477.
3. Rename the function by pressing N and entering DecryptData_w as the function name.
4. Jump to the decryption function by pressing G and entering the following address: 0x 1000E431.
5. Rename the function by pressing N and entering DecryptData_w2 as the function name.
6. Download IDAPython_DecryptFlameStrings.py from the lab repository and run the script from the IDA menu (File | Script file…). When the script finishes, it will print “All done!” to the output window.
Here’s an example of the script output:
IDA Python is a powerful tool for programmatically modifying IDB files. The ability to automate the manual tasks of annotating disassembly is of great importance when analyzing big and complex malware samples. Investing time into getting familiar with the IDA API and learning ways to control the IDB information will greatly improve your analysis capabilities and speed.
Executing Python Code
You have several ways to execute scripts in IDA:
• You can execute script commands from the IDA menu via File | Script command.
The hotkey command (for IDA 6.4+; might vary between versions) is SHIFT-F2.
• You can execute script files from the IDA menu via File | Script file.
The hotkey command (for IDA 6.4+; might vary between versions) is ALT-F2.
• You can also execute script files from command line using the -S command-line switch.
From the scripting point of view, IDA batch mode execution is probably the most interesting. If you need to analyze a bunch of files and want to perform a specific set of actions over them (for example, running a script to rename functions), then batch mode is the answer. Batch mode is invoked by the -B command-line switch and will create an IDA database (IDB), run the default IDA auto-analysis, and exit. This is handy because it won’t show any dialog boxes or require user interaction.
To create IDBs for all .exe files in a directory, we can run the following command from Windows command line prompt:
C:\gh_test> for %f in (*.exe) do idaw.exe -B %f
In this case, we invoke idaw.exe, which is a regular IDA program, but with a text interface only (no GUI). IDAW is preferred when running in batch mode because the text interface is much faster and more lightweight than the GUI version. The previous command will create .idb and .asm files for every .exe file that it successfully analyzed. After IDB files are created, we can run any of the IDA-supported script files: IDC or Python. The -S command-line switch is used to specify a script name to be executed and any parameters required by the script, as in the following example:
C:\gh_test> for %f in (.\*.idb) do idaw.exe –S”script_name.py argument” %f
NOTE Keep in mind that IDA will not exit when you are executing scripts from the command line. When you are running a script over many IDB files, remember to call the Exit() function at the end of the script so that IDA quits and saves the changes made to IDB.
IDA Pro is the most popular and advanced reverse-engineering tool for static analysis. It is used for vulnerability research, malware analysis, exploit development, and many other tasks. Taking time to fully understand all the functionality offered by IDA will pay off in the long run and make reverse-engineering tasks easier. One of the greatest advantages of IDA is its extensible architecture. IDA plug-in extensions can be written in one of many supported programming languages, making it even easier to start experimenting. Additionally, the great IDA community has released numerous plug-ins, extending its capabilities even more and making it a part of every reverse engineers toolkit.
For Further Reading
FLIRT reference www.hex-rays.com/idapro/flirt.htm.
Hex-Rays IDA PRO download page (FLAIR) www.hex-rays.com/idapro/idadown.htm.
Hex blog www.hexblog.com.
Hex-Rays forum www.hex-rays.com/forum.
“Introduction to IDAPython” (Ero Carrera) www.offensivecomputing.net/papers/IDAPythonIntro.pdf.
IDAPython plug-in code.google.com/p/idapython/.
IdaRub plug-in www.metasploit.com/users/spoonm/idarub/.
ida-x86emu plug-in sourceforge.net/projects/ida-x86emu/.
IDAPython docs https://www.hex-rays.com/products/ida/support/idapython_docs/.
IDA plug-in contest https://www.hex-rays.com/contests/.
OpenRCE forums www.openrce.org/forums/.