The Antivirus Hacker's Handbook (2015)

Part I. Antivirus Basics

Chapter 4. Understanding Antivirus Signatures

Signatures are a key part of any antivirus engine. The signatures are typically hashes or byte-streams that are used to determine whether a file or buffer contains a malicious payload.

All antivirus engines, since their inception, have used a signature scheme. Although various kinds exist, the signatures are typically small hashes or byte-streams that contain enough information to determine whether a file or a buffer matches a known-malware pattern. When hashes are used for signatures, they are generated with algorithms such as CRC or MD5, which are typically fast and can be calculated many times per second with a negligible performance penalty. This is the most typical and preferred method for antivirus engineers to detect a specific piece of malicious software, because the algorithms are easy to implement and tend to be fast.

This chapter covers the various signature database types, their strengths and weaknesses, when they are best used, and how they can be circumvented.

Typical Signatures

Even though each AV engine uses a different set of algorithms to generate its signatures, and almost all of them have algorithms of their own, various algorithms are shared among AV products. Some algorithms that are used to generate signatures can have a high false-positive ratio but are extremely fast. Other more complex (and naturally more expensive) signatures exhibit a lower rate of false positives but take a very long time (from a desktop antivirus point of view) to match. The following sections will cover the most notable signatures and discuss the advantages and disadvantages of each one.

Byte-Streams

The simplest form of an antivirus signature is a byte-stream that is specific to a malware file and that does not normally appear on non-malicious files. For example, to detect the European Institute for Computer Anti-Virus Research (EICAR) antivirus testing file, an antivirus engine may simply search for this entire string:

X5O!P%@AP[4\PZX54(P^)7CC)7}$EICAR-STANDARD-ANTIVIRUS-TEST-FILE!$H+H*

This is, naturally, the easiest approach for detecting malware; it is fast and easy to implement, as there are many robust and efficient algorithms for string matching (such as Aho-Corasick, Knuth-Morris-Pratt, Boyer-Moore, and so on) that are available to anyone. However, this approach is error prone for the same reason that it is easy to implement: if a goodware file contains the byte-string, a false positive is generated, which means that a healthy file is interpreted as a malicious one. Indeed, it is difficult to predict the actual number of antivirus products that will detect an electronic file containing the text in this chapter as malicious because it contains the entire EICAR signature.

Checksums

The most typical signature-matching algorithm is used by almost all existing AV engines and is based on calculating CRCs. The Cyclic Redundancy Check (CRC) algorithm is an error-detection code that is commonly used in storage devices to detect damage, the accidental change of data, transmission errors, and so on. This algorithm takes a buffer as input and generates an output hash in the form of a checksum, which is typically just four bytes (32 bits when the CRC32 algorithm is used). Then, specific malware is compared with the file or buffer under analysis by calculating the CRC checksum of the entire buffer or selected parts of it. Using the example from the previous section, the EICAR test file has the following CRC32 checksum: 0x6851CF3C. An antivirus engine may detect this testing file by calculating the CRC32 checksum of the entire buffer against chunks of data (that is, the first 2Kb block, the last 2Kb block, and so on) or by analyzing the specific parts of a file format that can be divided (that is, by checking the CRC32 hash of a specific section of a PE or ELF file).

As with the previous example, the CRC algorithm is fast but generates a large number of false positives. It was not created with the aim of detecting malicious payloads but, rather, of detecting erroneous transfers of data over unreliable channels or detecting media damage. Therefore, finding “collisions” with a particular CRC32 hash is easy, causing it to generate a lot of false positives with goodware. Some antivirus engines add additional checks to their implementation; for example, they may first find a small string (a prefix) and then apply the entire CRC32 function to the buffer, starting from the prefixed string up to some determined size. But, again, the number of false positives that this approach can generate is greater than with other ones. As a simple example, both the words “petfood” and “eisenhower” have the same CRC32 hash (0xD0132158). As another example, the file with MD5 hash 7f80e21c3d249dd514565eed459548c7, available for download, outputs the same CRC32 hash that the EICAR test file does, causing false positives with a number of antiviruses, as shown in the following report from VirusTotal:
https://www.virustotal.com/file/83415a507502e5052d425f2bd3a5b16f25eae3613554629769ba06b4438d17f9/analysis/.

Modified CRC algorithms

All the antivirus engines that have been analyzed so far use the CRC32 algorithm. However, in some cases, the original CRC32 algorithm is not used, but is replaced by a modified version. For example, the tables of constants used by the original algorithm may be changed or the number of rounds may be changed. This is something that you must consider when analyzing the signatures of the antivirus product being targeted. CRC32 hashes can differ from the original CRC32 algorithm and may cause you some headaches.

Custom Checksums

Most antivirus engines create their own set of CRC-like signatures. For example, some antivirus kernels use the CRCs of some Windows PE executables sections, perform an XOR operation with all of them, and use the output as the hash for some PE files; other antivirus engines perform arithmetic calculations and displacements over blocks of data, generating a small DWORD or QWORD that is used as the signature. Some antivirus kernels generate various CRC32 checksums of some parts of the file (such as the CRC32 of the header and the footer) and use the resulting hashes as a multi-checksum signature.

The list of custom checksums is really too large to enumerate in this book. The interesting point is that such custom checksums do not offer any benefit to antivirus developers (other than using a hashing function that is unknown, which forces a reverse-engineer analyzing the targeted AV engine to discover where that function is, analyze it, and, likely, implement it). Such checksums are prone to false positives, as are the original CRC32 algorithm's checksum-based signatures. This is the reason the antivirus industry decided some time ago to use a more robust form of function hashes: cryptographic hashes.

Cryptographic Hashes

A cryptographic hash function generates a “signature” that univocally identifies one buffer and just one buffer, which thus reduces the odds of producing a false positive (because of fewer “collisions”). An ideal cryptographic hash function has four properties, as extracted from Wikipedia:

· It is easy to compute the hash value for any given message.

· It is infeasible to find a message that has a given hash.

· It is infeasible to modify a message without changing its hash.

· It is infeasible to find two different messages with the same hash.

The antivirus industry decided to use such hash functions because they do not produce false positives. However, there are disadvantages to using cryptographic hash functions. One is that it is typically more expensive to calculate, say, an MD5 or SHA1 hash than a CRC32 hash. A second disadvantage is that when a malware developer changes just one bit of data, the cryptographic hash functions return a different hash value, thus rendering the file or buffer undetectable when such algorithms are used for detection. Indeed, this is the purpose of a cryptographic hash function: it must be infeasible to modify a message without changing the resulting hash. A typical example of how to bypass such signatures is by adding one byte at the end of the file. In the case of executable files, a byte addition at the end of the file is either ignored or considered garbage and does not cause the targeted operating system to consider the file malformed or damaged when it tries to execute it.

It may seem at first that such signatures are not frequently used in today's antivirus products, but the reality is otherwise. For example, as of January 2015, ClamAV contained more than 48,000 signatures based on the MD5 hash of the file. The daily.cvd file (a file with the daily signatures) contains more than 1,000 MD5 hashes. Cryptographic hashes are often used by antivirus products only for recently discovered malwares that are considered critical, such as the droppers and dropped executables in attacks discovered in the wild. Meanwhile, stronger signatures are being developed, for which more time is required. Using cryptographic hashes in antivirus products as signatures, except in the last case mentioned, does not make any sense; this approach will just detect the given file (as their hashes were originally added into the signature database) if not modified, but changing a single bit will “bypass” detection.

Advanced Signatures

Many signature types are implemented in AV engines that are not as simple as the CRC32 algorithm. Most of them are specific to each AV product, and some of them are expensive and, thus, are used only after other signatures are matched. Most of these signatures are created with the aim of reducing the number of false positives while at the same time maximizing the possibility that an AV engineer will detect a malware family, instead of a single file such as in the previous cases in this chapter. One typical advanced signature, the bloom filter, is discussed in Chapter 3. The next section will discuss some of the most common advanced signature types that are found in various AV products.

Fuzzy Hashing

A fuzzy hash signature is the result of a hash function that aims to detect groups of files instead of just a single file, like the cryptographic hash functions' counterparts do. A fuzzy hash algorithm is not affected by the same rules as a cryptographic hash; instead it has the following properties:

· Minimal or no diffusion at all—A minimal change in the input should minimally affect the generated output and only to the corresponding block of output, if it affects it at all. In a good cryptographic hash, a minimal change in the input must change the complete hash.

· No confusion at all—The relationship between the key and the generated fuzzy hash is easy to identify, corresponding one to one. For example, a tiny change in the first block should change only the first generated output byte (if at all).

· A good collision rate—The collision rate must be defined by the actual application. For example, a high collision rate may be acceptable for spam detection, but it may not be suitable for malware detection (because of the high number of false positives it generates).

Various free public implementations of cryptographic hashes are available, including SpamSum, by Dr. Andrew Tridgell; ssdeep, by Jesse Kornblum; and DeepToad, by Joxean Koret. However, as far as can be determined, none of the antivirus products use any of these publicly available fuzzy hashing algorithms; instead they create their own. In any case, all of them are based on the same ideas and have the same three properties discussed in the previous list.

The number of false positives of such signatures—depending on the collision rate configured by the antivirus developers and the quality of the implemented algorithm—is usually lower than the number of false positives that other more basic signatures cause (such as simple pattern matching or checksums). However, because of the intrinsic nature of such hashes, false positives will happen, and such algorithms cannot be used alone. In some cases, these algorithms are used to match malware files after they pass a bloom filter, thus reducing the odds of causing false positives.

Bypassing such antivirus signatures is not as easy as in the previous cases. Bypassing a cryptographic or checksum-based hash function or a simple pattern-matching algorithm is a matter of changing just one bit in the right place (either in the specific string being matched or anywhere in the buffer). In the case of fuzzy hashes, an attacker needs to change many parts of the file because small changes to the buffer do not cause a big diffusion, if at all. The following example uses the ssdeep tool to demonstrate how such an algorithm works. Say that you want to detect the /bin/ls executable from Ubuntu Linux in your hypothetical antivirus engine using the ssdeep algorithm. Such a file will generate the following signature:

$ md5sum ls

fa97c59cc414e42d4e0e853ddf5b4745 ls

$ ssdeep ls

ssdeep,1.1--blocksize:hash:hash,filename

1536:MW9/IqY+yF00SZJVWCy62Rnm1lPdOHRXSoyZ03uawcfXN4qMlkW:MW9/ZL/T6ilPdotHaqMlkW

," ls"

The first command calculates the MD5 hash of the given file. The last command calculates its ssdeep hash. The last line is the entire signature generated by ssdeep: the block size, the hash, and the hash plus the filename. Now add one more byte at the end of the file, the character “A,” and calculate both hashes:

$ cp ls ls.mod

$ echo "A" >> ls.mod

$ ssdeep ls.mod

ssdeep,1.1--blocksize:hash:hash,filename

1536:MW9/IqY+yF00SZJVWCy62Rnm1lPdOHRXSoyZ03uawcfXN4qMlkWP:MW9/ZL/T6ilPdotHaqMlk

WP,"/home/joxean/Documentos/research/books/tahh/chapter4/ls.mod"

$ md5sum ls.mod

369f8025d9c99bf16652d782273a4285 ls.mod

The MD5 hash has changed completely, but the ssdeep hash has just changed one byte (notice the extra P at the end of the hash). If developers using this signature approach calculate the edit distance, they will discover that the file is similar to a known one, and thus detect it as part of some malware family. In order to completely change the hash when using fuzzy hash algorithms, you need to modify many other parts of this file. Try another example, this time, appending the file cp from Ubuntu Linux to the original ls file:

$ cp ls ls.mod

$ cat /bin/cp >> ls.mod

$ ssdeep ls.mod

ssdeep,1.1--blocksize:hash:hash,filename

3072:MW9/ZL/T6ilPdotHaqMlkWSP9GCr/vr/oWwGqP7WiyJpGjTO:3xZLL1doYplkWoUGqP7WiyJpG

,"ls.mod"

$ ssdeep ls

ssdeep,1.1--blocksize:hash:hash,filename

1536:MW9/IqY+yF00SZJVWCy62Rnm1lPdOHRXSoyZ03uawcfXN4qMlkW:MW9/ZL/T6ilPdotHaqMlkW

," ls"

Now, almost the entire hash has changed, and thus you have bypassed this signature. However, the number of changes required to bypass a fuzzy signature depends on the block size: if the block size depends on the size of the given buffer and is not fixed, bypassing such signatures is easier. For example, try again, this time with the DeepToad tool, which allows you to configure the block size. Select a block size of 512 bytes and hash the two files, the original /bin/ls and the modified one:

$ deeptoad -b=512 ls

NTWPj4+PiIiIiLm5ubklJSUl2tra2gMD;j4+IiLm5JSXa2gMDDAxpaTw81dUJCSQk;c3P29pqaZWU/P

7q6GBhSUtDQ4OBCQqSk;ls

$ deeptoad -b=512 ls.mod

NTWPj4+PiIiIiLm5ubklJSUl2tra2gMD;j4+IiLm5JSXa2gMDDAxpaTw81dUJCSQk;jIyhoXV1bW2Fh

aamsrKwsN7eZWVpaezs;ls.mod

This time, you cannot trick this tool by making such a change. This is for two reasons: first, because the block size is fixed, instead of being dynamically chosen, which is the case with ssdeep; and second, because DeepToad calculates three different hashes, separated by the semicolon character (;), and the first two hashes completely match. So, in short, the number of changes required to bypass a fuzzy hash algorithm depends on the block size and how the block size is chosen.

Graph-Based Hashes for Executable Files

Some advanced antivirus products contain signatures for program graphs. A software program can be divided into two different kinds of graphs:

· Call graph—A directed graph showing the relationships between all the functions in a program (that is, a graph displaying all callers and callees of each function in the software piece)

· Flow graph—A directed graph showing the relationships between basic blocks (a portion of code with only one entry point and only one exit point) of some specific function

An antivirus engine that implements a code analysis engine may use the signatures in the form of graphs using the information extracted from the call graph (a graph with all the functions in a program) or the flow graphs (a graph with all the basic blocks and relations for each function). Naturally, this operation can be quite expensive; a tool such as IDA can take anywhere from seconds to minutes to analyze an entire piece of software. An antivirus kernel cannot expend seconds or minutes analyzing a single file, so the code analysis engines implemented in AV products are limited to some instructions, basic blocks, or a configured time-out value so the analysis engine does not take longer than the specified maximum amount of time.

Graph-based signatures are powerful tools for detecting malware families that are polymorphic; while the actual instructions will be different between different evolutions, the call graph and flow graphs usually remain stable. Therefore, an AV engineer may decide to take a graph signature of the basic blocks of a particular function used to unpack the code of a malware, for example, to detect the unpacking or decryption layers.

This approach—in addition to the performance problems it may cause if no limits are set or are set inappropriately—can also cause false positives like any other approach for creating signatures. For example, if a malware author knows that his piece of software is being detected by an antivirus engine using a signature created out of the flow graph of a specific function, he may decide to change the layout (read, the flow graph) of that function to the layout of a function from goodware; this could be a function from thenotepad.exe Windows operating system tool or any other goodware software. The AV engineers will discover that they need to create a new signature for this new family instead of adapting the previous one or adding a modification to it, because the graphs used in this new evolution can be found in other, goodware, software pieces.

From the viewpoint of an attacker who wants to evade such signatures, a variety of approaches are available:

· Change the layout of flow graphs or the layout of the call graph so they look like “common” graphs extracted from any goodware software, as explained previously.

· Implement anti-disassembly tricks so the AV's code analysis engine cannot disassemble the whole function because it does not understand an instruction or set of instructions.

· Mix anti-disassembly tricks with opaque predicates so the analysis engine cannot decide correctly whether or not a jump is taken and will fail at analyzing either the “true” or the “false” path because invalid instructions or code are simply put there to fool the code analysis engine.

· Use time-out tricks to make the flow graph of the malware so complex that the code analysis engine of the antivirus kernel must stop the code analysis step before it can be considered finished because it timed out; timing out would cause it to have a partial and unreliable view of the flow graph of some or all functions.

An example open-source tool that builds and uses graph-based signatures that can be used as a testing tool is GCluster, an example script from the bigger project Pyew, available at http://github.com/joxeankoret/pyew.

This tool analyzes the program building the call graph and each function's flow graph for the list of binaries given to the tool and then compares both elements, the call graph and the flow graphs, in order to give a similarity level. The following is an example execution of this tool against two malware samples from the same family that at binary level are completely different but at structural level (the call graph and flow graphs) are exactly equal:

$ /home/joxean/pyew/gcluster.py HGWC.ex_ BypassXtrap.ex_

[+] Analyzing file HGWC.ex_

[+] Analyzing file BypassXtrap.ex_

Expert system: Programs are 100% equals

Primes system: Programs are 100% equals

ALists system: Programs are 100% equals

If you check the cryptographic hash of the files, you will see that they are actually different files:

$ md5sum HGWC.ex_ BypassXtrap.ex_

e1acaf0572d7430106bd813df6640c2e HGWC.ex_

73be87d0dbcc5ee9863143022ea62f51 BypassXtrap.ex_

Also, you can check that other advanced signatures, like fuzzy hashing at binary levels, don't work for such binaries, as in the following example run of ssdeep:

$ ssdeep HGWC.ex_ BypassXtrap.ex_ ssdeep,1.1--

blocksize:hash:hash,filename12288:faWzgMg7v3qnCiMErQohh0F4CCJ8lnyC8rm2NY:

CaHMv6CorjqnyC8

rm2NY,"/home/joxean/pyew/test/graphs/HGWC.ex_"

49152:C1vqjdC8rRDMIEQAePhBi70tIZDMIEQAevrv5GZS/ZoE71LGc2eC6JI/Cfnc:

C1vqj9fAxYmlfACr5GZAVETeDI/Cvc,"/home/joxean/pyew/test/graphs/BypassXtrap.ex_"

Clearly, graph-based signatures are much more powerful than signatures based exclusively in the bytes. However, for performance reasons their use is often prohibitive. This is why antivirus companies did not adopt this approach massively: it is not practical.

Summary

Antivirus signatures play an integral part in malware detection. They have been used since the inception of the AV software. Essentially, signatures are databases of some sort that are used in conjunction with various matching algorithms to detect malware or a family of malware. For each of the signature database types, this chapter also showed various methods for circumventing detections based on them. Various types of signature databases are mentioned in this chapter:

· Byte-streams, as the name suggests, are used in conjunction with string matching algorithms to match a sequence of bytes in the malicious file.

· Checksums, such as the CRC32 checksum algorithm, are applied on a byte-stream to generate a unique identifier that is then looked up in the signature. Checksums are usually weak against collision attacks and prone to generating false positives.

· Cryptographic hash functions, unlike checksum algorithms, are resilient against collision attacks and do not cause a lot of false positives. However, they take a long time to compute. Malware writers can easily evade those algorithms because a simple change in the input file can generate a totally different hash value.

· Fuzzy hash functions are used to detect a group of files, typically malware files belonging to the same family. Unlike cryptographic hashes, it is somewhat acceptable to have collisions. If collisions occur, it is usually because the malware with the fuzzy hash belong to the same family.

· Finally, graph-based hashes are computed from either the call graphs or the flow graph of a malicious executable. Calculating graph-based hashes is more time-consuming than all other hashing methods and requires that the AV engine has disassembling ability so it can build such graphs. Nonetheless, graph-based hashes are very good for detecting different iterations of the same malware, because they rely not on the bytes-stream sequence but on the relationship of basic blocks or functions call graphs.

The next chapter introduces the update services, discusses how they work, and then walks you through a practical example of how to dissect and understand a real-world update service of a popular AV software.