Digital Archaeology (2014)

10. E-mail Forensics

Throughout history, people have used various methods of exchanging messages over long distances. In the days of the Roman Empire, long-distance runners acted as couriers, carrying messages back and forth. Trained pigeons have transported notes attached to their legs, and more recently, the U.S. Postal Service has served as an agent for transporting letters back and forth. In the Electronic Age, it is the wonderful world of electronic mail—or, more simply, e-mail—that serves as the transport mechanism for the vast majority of today’s written communications.

The relationship of e-mail to crime is convoluted and appears in many forms. Direct criminal activities involving e-mail include phishing attempts, e-mail fraud, and extortion. E-mail can also provide evidence of other crime. Later in the chapter there will be an account of a murder that was solved in Vermont using e-mail correspondence to tie the co-conspirators together with the victim. This same example shows how e-mail evidence can be used to prove a conspiracy. This chapter focuses on finding e-mail, back tracing a message to its source, and analyzing e-mail contents.

E-mail Technology

An investigator who doesn’t understand the basic technology behind e-mail communications is going to be at a distinct disadvantage when it comes to using e-mail as evidence of malfeasance. Understanding the technology assists in locating e-mail messages thought to be destroyed and helps prove the source of a message.

An e-mail message at its heart is a simple text message that starts at one computer and traverses a network or the Internet, finally arriving at the destination computer (Figure 10.1). Modern e-mail protocols allow users to send messages more complex than text in the message. Graphics and advanced formatting becomes possible with HTML. Files can be sent by e-mail as an attachment.

Figure 10.1 A simplified version of conventional e-mail traffic

How E-mail Travels

E-mail transport happens at three different levels. The mail user agent (MUA) is the application with which the user interfaces. Most computer professionals simply refer to this as the client. The mail transport agent (MTA) is responsible for getting the message from the sender to the recipient. The mail delivery agent (MDA) sorts out all the e-mail that arrives at a specific location and gets each message to the correct recipient. Dedicated mail server applications, such as Microsoft Exchange Server or the Linux-based SendMail application, provide MTA and MDA services without the average administrator needing to know what part of the application is performing what function. Some of the open-source solutions have separate applications for MTA and MDA.

To get from Point A to Point B, the e-mail starts at the sending computer’s client application, or more simply, the client. A client is a program running either as a local application on the user’s computer or as a Web application from the e-mail provider that supports the transfer back and forth of electronic messages. Once the e-mail’s composer clicks on the SEND button, the message begins its path across several e-mail servers until the recipient’s client sends a request to the server that provides e-mail services for that person to download all new messages. At that point the server sends all the messages that have accumulated. Depending on the client configuration and the policies of the user’s ISP, the messages may or may not be deleted from the queue at that point. Every step of this process is a function of the e-mail protocols in use by the various clients and servers in the path.

E-mail Addresses

E-mail addresses typically consist of three elements. The first of these is the user name. This is assigned to the individual by the e-mail service provider used by that person. The “at sign” (@) acts as a pointer that indicates of what domain this user is a member. Following the @ is the domain name that hosts the user account. Therefore, if William Robert acquires an e-mail address provided by an organization called ringbox.com, his e-mail address might be displayed as billybob@ringbox.com—or generaluser@somewhere.com, as seen in Figure 10.2. This combination of elements separates him from users billybob@rodeo.com or billybob@whitehouse.org, which are likely to represent different users, although William may have set up accounts with both of those other organizations.

Figure 10.2 The syntax of a simple e-mail address

The user name in an e-mail address is not necessarily indicative of the name. Most services allow the user to request a user ID and, if it is available, will provide that ID. For example, the author could request michaelgraves@yahoo.com. However, if that user ID is already taken (which it is), then he must select another ID that is available. Therefore, in the example presented here, if a person is trying to send an e-mail to the author of this book, and assumes that michaelgraves@yahoo.com is the correct e-mail address, the e-mail is likely to be delivered, but not to the person intended. Likewise, if an investigator is searching for e-mails sent by the author, any e-mails carrying that user ID are not going to be his—unless the author has deliberately spoofed the user ID. Spoofing is a method of altering information to make it appear as if it originated from somewhere else. There will be more on that later.

E-mail Protocols

E-mail transfer requires separate protocols for sending and for receiving messages. As of this writing, Simple Mail Transport Protocol (SMTP) is used for all outbound transmissions of e-mail. Two protocols used for receiving messages are Post Office Protocol, Version 3 (POP3) and the Internet Message Access Protocol (IMAP).

SMTP and ESMTP

Sending e-mail requires the support of SMTP. Extended SMTP (ESMTP) is used as both a mail submission protocol and a mail relay protocol. For the purposes of this book, the differences between the two protocols is inconsequential. (A more detailed description of e-mail protocols can be found later in the chapter.)

The client connects to the server application over TCP/IP port 25 by default. However, due to the onslaught of malware targeting port 25, many ISPs are redirecting traffic to port 587. The client starts out by sending a simple handshaking packet, sometimes called the HELO packet. This packet simply informs the server that user@domain.com wants to send a message to user@otherdomain.com. The server examines both addresses and determines two things. First, the transmitting e-mail address is valid and is authorized to use the services provided by the server. Second, the recipient address is a valid address. Once the transmitted address is validated, and the recipient address is determined to be a valid syntax, the server accepts the message and attempts to transmit it.

The client will probably not send the e-mail message immediately unless the user uses a Send Now option or if the client has been specifically configured to transmit messages immediately. Most e-mail clients default to a queued transmission schedule by which e-mails build up and are all transmitted at specific intervals, such as every ten minutes. Likewise, the client, when attached to the Internet, will check the server’s inbox at similar intervals looking for new incoming messages to download.

POP3

POP3 is the third incarnation of the venerable Post Office Protocol. It is the age-old standard for managing incoming e-mail messages. POP3 allows for standard text messages, attachments, and HTML encoded messages. Messages can be configured to remain on the server after download or be deleted. Once a message transfer is completed, the recipient can disconnect from the Internet and read the messages at leisure. POP3 transfers data over TCP/IP port 110.

IMAP

The IMAP (Internet Message Access Protocol) is a more modern protocol than POP3. There are two differences worth noting between IMAP and POP3. A key difference to the investigator is that, unless specifically configured otherwise, IMAP leaves all messages on the server after downloading. Another critical difference is that multiple users can administer the same mailbox. Therefore, just because a message is contained within an IMAP mailbox does not explicitly point to the owner of that mailbox as the person responsible for the message. IMAP uses port 143 for data transfer. IMAP is frequently used by people who use offline storage of e-mail messages or when multiple people administer the same mail box.

E-mail Clients

Virtually anyone who owns a computer these days has some form of e-mail client installed on the computer, whether that software is used or not. Nearly every operating system that ships or that is available for download provides an e-mail client (Figure 10.3). A wide variety of clients are available either as a free download or as a part of a commercial software suite. Additionally, e-mail clients may reside on a host Web site and open as a Web page for the user. Some Web-based e-mail services allow the user to interface with a host client, while others offer only Web page support. Table 10.1 lists a representative collection of different e-mail clients, although it is far from being a comprehensive list.

Figure 10.3 Microsoft Outlook is a commonly used e-mail client that is part of the Office suite.

Table 10.1 Common E-mail Clients

Whichever e-mail client a user has selected, there are some main functions that are always present:

• Create and transmit messages

• Receive messages

• Display list of messages in inbox by header

• Open a message (and associated attachments)

• Add attachments to outgoing messages

• Receive attachments with incoming messages

The latter two functions are frequently limited or controlled by the e-mail service provider. In order to more efficiently utilize available storage and bandwidth, it is common for e-mail providers to limit the size attachments it will allow.

E-mail clients also control where saved messages are stored and how deleted messages are handled. A typical Windows user with Outlook Express installed and configured as an e-mail client will have one file containing their personal address book and one containing their mail folders. The address book will typically have a .wab extension, while the mail folders have a .mbx extension. For example, a user whose login ID is robchild will have an address book stored in the robchild.wab file and a set of mail folders in a robchild.mbx file. Users of the dedicated Outlook application will have a single folder with a .pst extension.

Information Stores

The various e-mail clients each have specific mechanisms for storing information. The information typically stored by an e-mail client includes the messages and the folders created by the end user for organizing messages, calendars, address books, and other information (such as notes, reminders, and task lists). Insomuch as there are literally hundreds of e-mail clients, it would be far beyond the scope of this book to cover them all. However, the vast majority of e-mail accounts are accessed by a relative few clients. In the corporate environment, there are even fewer. The different architectures this chapter will cover include

• Microsoft Outlook Express

• Microsoft Live Mail

• Microsoft Outlook

• Microsoft Entourage

• Lotus Notes

• Evolution

• KMail

• OS-X Mail

While there are notable differences in how each of the applications stores data, there are also significant similarities. An effective e-mail client stores messages and schedules in one file and address books in another. The reason for that is that the address book is frequently made available to other applications. For example, the personal address book (PAB) in Outlook can provide addresses, feed mail merges, and do other tasks for Microsoft Word.

There are two major Microsoft e-mail clients—Outlook and Outlook Express (OE). The two applications are somewhat different in how they store information, so a discussion about each one is in order.

Outlook Express

OE was the default mail client in Windows versions 98 through Vista. Its lengthy reign means that it will appear as the client of choice on a large number of computers still in use. Versions of OE, along with the associated Windows versions, appear in Table 10.2.

Source: PeoplePC 2011.

Table 10.2 Outlook Express Version Evolution

OE 4.0 stored data in a file with a .mbx extension. In OE versions using the MBX file format, three files retained user data:

• MBX files: Contain text from all messages stored on system.

• IDX files: Index files for each MBX file. MBX and IDX files occur in pairs.

• NCH files: Retain the folder structure created by the user.

Since OE 4.0 is obsolete and rarely seen in use, there are few (if any) tools available for examining the system.

Subsequent versions all store messaging information in a database file with a .dbx extension. Thus, it is generally referred to as the DBX file. Other files used by other applications also use the same extension; therefore, the .dbx extension is not a positive indicator that the file is a data archive for OE. Two identifiers link a file with the .dbx extension to OE.

When viewing the file in raw format (as one would see it in a disk editor or if capturing the file with a carving utility), the header will begin with 0xCF 0xAD 0x12 0xFE. Following the header, a content class identifier (CLSID) identifies the type of DBX file it is. A CLSID is a string at the beginning of certain types of files that the OS uses to associate the file with a specific application and to define the file as an object within the OS. In this particular case, the CLSID tells Windows that this particular DBX file is associated with Outlook Express and that it is a database object. Versions of OE that use the more common DBX file are more readily examined. Messaging information is stored in .dbx files. These files generally possess user-friendly names, so are easily identified. Typical DBX files include

• INBOX.DBX: The user inbox.

• SENT ITEMS.DBX: Messages sent by the default user.

• DELETED ITEMS.DBX: Contains messages deleted from the inbox.

• DRAFTS.DBX: Messages begun, but not finished, may be stored here, as well as messages waiting for further attention prior to transmission.

• OFFLINE.DBX: Exists on systems where the user has configured Webmail services, such as Hotmail. Does not exist on systems where Webmail accounts have not been configured.

• POP3UIDL.DBX: Tracks messages left on the POP server.

• <Generic name>.DBX: Database for a user-created mail folder. For example, if the user has a folder called EDUCATION, a file called EDUCATION.DBX will be created.

• <Newsgroup name>.DBX: If the user subscribes to a news group, a folder will be created for that news group.

The user’s address book is stored in a Windows Address Book (WAB) file.

Outlook

The full version of Outlook was designed to be a complete personal organizer, rather than simply an e-mail client. There have been several versions over the years (Table 10.3). Calendar events, contact information, notes, and messages were all stored in databases accessed by a common interface.

Table 10.3 Microsoft Outlook Versions

Outlook stores data in Personal Folder Files (PST, due to the file extension). By default, PST files are located in the user’s Documents and Settings folder created when the user’s profile is generated by Windows. However, this is a setting easily changed by the user, and therefore, a general search for PST files should be performed any time Outlook is the preferred application on a computer.

E-mail Servers

For two e-mail clients to communicate between one another, there must be at least one e-mail server—but there will most likely be two or more. In a corporate environment, where two employees exchange messages, it is possible for a single server to do all the work, acting as both an SMTP server for outgoing messages and a POP3 or IMAP server for incoming. Two people across the world from each other will each be connecting to their own server, and along the way, the message will bounce across numerous relay servers. Relays don’t act as actual e-mail servers but can provide information valuable to an investigation, as will be examined later in this chapter.

SMTP Servers

The SMTP server handles all outgoing messages. In a typical e-mail system, the SMTP server might have an address such as smtp.mwgraves.com. The e-mail client must be configured to connect to the SMTP server whenever it needs to transmit messages. The client connects to the server across port 25 or port 587, depending on how it was configured. The server verifies that the sending account is a valid one and looks at the target address. That address is split into two components—the user ID and the domain name. If the target domain name is the same as the sender’s, a subroutine in the e-mail software called a delivery agent hands the message off to the POP3 or IMAP server in the domain. If one server is performing both functions, the agent will simply paste a copy of the message into the recipient’s inbox folder.

If the message is intended for an external domain, the SMTP server sends a request to a Domain Name Services (DNS) server to resolve the domain name to the IP address of a recipient SMTP server. Once the address is resolved, the server transmits the message to its destination, and then sits back and waits for a response. If the address is good and the message is delivered, an acknowledgment (ACK) packet will be returned to the transmitting server. If the address cannot be resolved, or if the user is not a valid user on the target system, a nonacknowledgment (NACK) packet will be returned. The latter will initiate a delivery failure message by the SMTP server that will go to the sender’s POP3 or IMAP server, eventually to be delivered to the user.

The routing protocols used by devices along the way determine the fastest path between source and destination. A message is quite likely to be relayed by numerous intermediate SMTP servers before it reaches the recipient. Each server will append the header with a Received: line.

POP and IMAP Servers

POP3 and IMAP act as the post office for the network. Incoming messages are stored on these servers, waiting for end users to access and download them. With POP3, SMTP retrieves the message from the Internet, copies it to the message queue, and then notifies the SMTP delivery agent that there is a new message. The delivery agent transfers the message to the mail storage folder for distribution by POP3. With IMAP, the “mailbox” resides on the server and not on the user’s computer.

This is the root of one key difference between the protocols that is particularly relevant to forensic investigation. While it can be configured to save messages until deleted by the user, the default configuration for a POP server is to delete a message once it is downloaded. IMAP defaults to saving the message on the server until deleted. IMAP also allows a single user to maintain multiple inbox folders, or for several users to share a single inbox.

The ramifications of these differences are significant. If the investigator sees that the suspect uses IMAP, then a strong possibility exists that the ISP still has copies of pertinent messages. However, additional corroboration will be required to verify that the message can be connected to a specific individual. In a shared environment, someone can use the defense that another user is responsible for the message. Conversely, even if a user is the sole owner of an account, there is no reason that the same user can’t have other accounts on the system. Deeper digging will be required.

The Anatomy of an E-mail

The fundamental structure of an e-mail is based on a standard called the Multipurpose Internet Mail Extensions (MIME). A key functionality of MIME is to define the format of an e-mail. At the very minimum, an e-mail consists of a header and the body.

The header contains the control information used by servers to identify and direct the journey of the message. It is broken down into numerous fields to be discussed in this section. The body is the text of the message as composed by the author of the message.

Optionally, a message might also contain one or more attachments. An attachment is a separate file that is “paper-clipped” to the message and transmitted alongside of it. Some e-mails contain additional content even when the author did not intentionally add an attachment. Custom signatures added by e-mail clients transfer as embedded images. These images are sent as an attachment and displayed via a hidden link.

Standard Header Information

E-mail headers are metadata fields contained in every SMTP message transmitted over the Internet. They consist of multiple fields that contain useful information about the message, from source to destination. There are four header elements common to all e-mail messages. These are all fields that are frequently indexed by popular e-mail clients so that a user can sort messages using any one of these fields. These standard fields are

• TO:

• FROM:

• SUBJECT:

• DATE:

While superficially these seem to be self-explanatory, a little discussion is in order. The TO: field contains the name of the addressee(s) as seen in the current state of the message. Multiple addressees can be defined for a single e-mail. Each address is separated by either a comma or a semicolon (depending on the e-mail client used). Names in the TO: field should not be considered as definitive. This information can be overwritten and the message retransmitted, or it can be spoofed. When a message arrives in a user’s inbox, the mere fact that her name is in the TO: field is no indication that she was the original intended recipient—or the only recipient, for that matter—even if her name is the only one that appears in the TO: field. Mass-mailing software sends the same message to every e-mail address in a database and can even customize the messages to extract information from lists and insert it into the message.

The FROM: field is even more likely to contain bogus information. E-mail spoofing changes the information in various MIME fields in order to conceal the origin of the message. Some viruses scour the hard disk of a computer where it just landed and forwards itself to every e-mail address it can locate on the machine. To the recipients of the relayed message, it would appear that this user was the person who originated the message.

The SUBJECT: line is optional and may be left empty. It is typically filled in by the original author, and if the e-mail is forwarded or a reply is sent, the e-mail client will automatically append the subject line with the prefix RE: which is an extraction of the Latin word res, meaning “pertaining to.”

The DATE: field specifies on what date the message was sent. This metadata element is generated by the e-mail client that originates the message. Therefore, the time/date stamps on the message will be dependent on the clock set on the client machine. A sender can easily set the clock on her computer to any time and date that she pleases, compose and send a message, and the time stamp on the message will read accordingly. However, the time/date stamps found in other header fields generated by intermediate transport servers will reveal the correct time/date stamp. Time/date stamps on replies will be different from those on original messages, since, to the system, a reply is a new message being sent. As with the other fields, this is easily spoofed.

MIME Header Information

In addition to the obvious fields, there are a number of metadata fields populated by e-mail clients as well as servers along the path of the message. The header information in a typical e-mail message can be extracted easily from the e-mail client (Figure 10.4).

Figure 10.4 Internet headers can be easily extracted from most e-mail clients.

On the Webmail application of my ISP (hosting www.mwgraves.com), the header can be extracted as follows (Figure 10.5):

1. Open the message.

2. Click Mail.

3. Select Show Details.

4. Select All from the Header field.

5. Cut and paste into a word processing application.

Figure 10.5 Finding header information in a Webmail application

The header information can be extracted from Microsoft Outlook as follows (refer to Figure 10.4):

1. Open the message.

2. Click View.

3. Click Options.

4. Select All from the Internet headers field.

5. Cut and paste into a word processing application.

With Microsoft Entourage (for Macintosh OS-X), the header information can be extracted as follows (Figure 10.6):

1. Open the message.

2. Click View.

3. Click Internet Headers. (A new window will open with the Header Information.)

Figure 10.6 Finding header information in a Microsoft Entourage

Header information is read from the bottom line up. The following is an extract from a typical e-mail, along with explanations of the content for specific fields. I have interjected descriptions of key lines in bold that were not originally part of the header. IP server names and addresses of the network servers have been masked for obvious reasons. In most cases, time and date stamps should always be verified through external sources. Spoofing software can make times and dates to be anything the creator wants them to be.

Tracing the Source of an E-mail

As has been mentioned, one of the key points for the investigator to remember is that much of the information stored in a header can be spoofed. However, the attacker has no control over intermediate servers that exist between the source and target machines. When an e-mail is first generated on the transmitting client computer, the first of the headers is created.

When John sends a message to Sally, John’s e-mail client will generate a FROM: header line containing John’s e-mail address and a TO: header line with Sally’s e-mail address. If John is using a cleverly designed bulk e-mail application, he can configure it to put into the header any FROM: address he chooses to invent. John’s client application is configured with the server name and IP address of his SMTP server. He can also tell it to inform the SMTP server that his computer is any name or IP address that he wants it to be.

The SMTP server will take the information provided by John’s client and generate the first Received: line, appending his information to the header. The server uses the Domain Name System (DNS) to locate the target mail server identified in Sally’s address, which returns the IP address of the target mail server. A mail transfer agent packages the message and transmits it to that IP.

In a perfect world, where there are express routes from every point to every other point, the mail would arrive and the recipient’s mail server could append the header with its information and that would be it. However, in the real world there are usually multiple hops along the way. These intermediate mail servers add their signature in the form of a Received: line. They don’t identify themselves, but they identify the server from which they received the e-mail. And the next server along the path identifies them. None of the intermediate servers change the all-important message ID.

Each server maintains activity logs. If these activity logs can be obtained from the service provider, they can be used to ascertain certain things. For example, it is quite possible for users to forge message headers to make it appear that messages came from places they didn’t or arrived at times of their choice. Comparing the SMTP logs to questionable messages can provide evidence of such tampering.

An Approach to E-mail Analysis

When beginning a forensic examination of a body of e-mail, it is critical for the investigator to remember a few key points.

This is as much a legal matter as it is a technical matter. If working from a warrant, it is likely that the warrant spells out specific subjects or specific identities for which the investigation can target. Exceeding the boundaries of the warrant can result in numerous problems. Evidence can be disqualified, the investigator can lose credibility, or an unintended privacy breach can result in a lawsuit.

In spite of the massive volumes of correspondences a typical e-mail server might yield, only a few of those messages are of any relevance. The case will dictate the legal limitations the investigator faces, while the budget allowed will dictate the scope of the search.

Large volumes of data to sort through will not make the courts, the lawyers, or the clients more forgiving of the investigator who fails to meet deadlines. Time is the constant delimiter.

When the investigator does stumble across the proverbial “smoking gun” that solves the case, he or she will have to explain to anybody that matters just how the evidence was obtained, why it is relevant, and what makes it authentic.

The opposing party is going to do everything in its power to discredit all the work that is done on this side of the investigation. Document everything to make that as difficult as possible.

The Search

The first step the investigator must take when searching e-mail is to identify all sources of e-mail used on the target system. As seen in the previous sections of this chapter, different e-mail systems store messages (and therefore evidence) in different locations on the computer. In some cases, the evidence will not be on the computer, but somewhere out in the cloud. Web-based systems such as Hotmail, Yahoo Mail, and such typically provide each user with a fixed amount of storage space on which he or she can store messages. Client-based applications such as Outlook and Entourage store them in a database on the local system.

Once the archives have been identified, the search for relevant message threads begins. This of course begs the question, how does one find the few relevant e-mails among hundreds of thousands of messages in perhaps dozens of archives? And once a relevant message is found, how does the investigator keep the thread together? A thread is the complete string of e-mail messages, beginning with the initial message and following through each reply or response, all the way to the final message. Keeping the thread intact is critical if the outcome of the case depends on the reliability of the chain.

Searching the Archives for Strings

Typical command-line text search utilities, such as GREP, fall short in this area. If an investigator is searching for any message that contains the term “infiltrate,” after a lengthy search, the utility will return any message that contains the word but will not segregate based on who sent the message. Care must be taken in the selection of search strings as well. Just asking GREP to look for “bright” will return every instance of that string, including words like “brighter,” “brightest,” “Albright,” and “brighten,” along with potentially hundreds of alliterations of the string. The investigator must use manual methods to locate and connect the entire e-mail chain from an archive.

Tools such as those by Clearwell, Paraben, and others facilitate this task. Dedicated e-mail analysis software such as these offer far more sophisticated search strategies. The investigator can search multiple keywords simultaneously. Boolean tools allow fine-tuning searches even further. Many offer deduping capabilities, eliminating wasteful duplicate e-mails. Filters that automatically compensate for time zone changes prevent mistakes in time line analysis.

Boolean operators are used in a refined search in order to limit results. A Boolean operator is a term that the user adds to a search phrase that is not a term that is searched but rather one that defines the search on a more granular level. Boolean words must be typed in all capital letters. Boolean operations include

• AND: The search must include both words (or both phrases enclosed in quotes).

• OR: The search must include either of the words or quoted phrases, but not necessarily both.

• NEAR: Both words or phrases must be included in the entity, and they must occur in close proximity to one another within the entity.

• + or “”: Search for the phrase exactly as typed (do not put a space between + and first term of search string).

• - or NOT: Do not include any entity that contains the following string along with the defined search. For example, typing [car –Ford] (minus the brackets) will bring up every entity within the database that includes the word “Car” but does NOT include “Ford.”

Most databases automatically assume the AND operator in any given search. They also do not assume any relationship between words in a phrase. Therefore, the search string [Ford Motor Company] with no quotes will bring up every document that contains “Ford,” every document that contains “motor,” and every one that contains “company.” Enclosing the same three words in quotation marks will return only documents that contain the phrase “Ford Motor Company.”

A clear understanding of Boolean search strings facilitates an efficient search. Jason Baron was an investigator involved in many of the massive lawsuits involving the tobacco industry. He reported that the companies involved in the lawsuits generated over 1,700 electronic discovery requests involving e-mail, directed at various government organizations (Baron 2010). One of the search strings he used incorporated ten carefully chosen keywords combined with the names of the litigants and then modified by 35 different combinations of Boolean operators. This one search narrowed a field of over 32 million e-mails down to a “mere” 320,000. The difficult part is that while the search eliminated 99% of the pool, the remaining 1% had to be manually examined for evidentiary value.

A second problem inherent in using only keyword searches as a data mining technique is that such searches also cull out a large number of potentially relevant documents that simply did not contain one of the keyword phrases. Blair and Maron (1985) produced a study that indicated only 20% of relevant documents were located by keyword searches when dealing with large volumes of data. An organization called the Text Retrieval Conference (TReC) wanted to know if this ratio was still relevant after several decades and repeated the study. Their results were nearly identical (TReC 2010).

Analyzing the Results

A large search of document archives, whether e-mail or otherwise, will result in four basic categories of result. False positives are those documents that were in no way relevant to the subject of the search, but were nonetheless retrieved during the search process. The evil cousin of the false positive is the false negative. This is the document that is relevant but that the search scheme failed to locate. On the positive side of the discovery ledger are the true positives (relevant documents that were retrieved) and the true negatives (irrelevant documents that were ignored).

There are two phrases with which the investigator should be familiar. Precision is the ratio of retrieved documents between true positives and false positives. For example, if an investigator’s search of an archive of 32 million files yields 32,000 hits, those hits would be sorted between relevant and irrelevant. The percentage of true positives yielded by the method is the precision ratio. If 80% of those 32,000 files are true positives, then the precision rate is 80%. The other phrase is recall. This value is much more difficult to ascertain, and yet it is the source of much of an investigator’s pain. Recall is the percentage of relevant documents that were retrieved from the initial mass. If the search technique that yielded 32,000 records actually found only 20% of the files that were relevant, then the recall was only 20%. Precision × Recall = Accuracy. In this example, 20% × 80% = 16%. These are all concepts to remember in Chapter 16, “Litigation and Electronic Discovery,” as well, when the discussion about document searches focuses on a discovery request.

Advanced Search Methods

If it is indeed true that a Boolean search has such a low success rate, is there another option? Several attempts have been made at a more targeted approach. For a while, Columbia University worked on a forensics tool called the E-mail Mining Toolkit (EMT). EMT attempted to collect messages into groups that evidenced similar behavioral characteristics. Development of this concept has been discontinued, but the concepts generated a great deal of interest. Among the techniques developed by Columbia were the following (Hershkop 2006):

• Stationary User Profiles: Previous user activity was analyzed using known user accounts. An algorithm developed by the programmers created a histogram of account activity. This histogram was used to compare to other accounts (to ascertain whether a single user made use of multiple accounts) or against unknown message threads to attempt user identification.

• Similar Users: This analysis protocol collected multiple users with similar behavioral characteristics and created composite histograms. Accounts that dramatically deviated from the “norms” identified by these histograms could be considered suspect accounts, worthy of more intense inspection.

• Attachment Statistics: Every e-mail in the target repository is examined for attachments. Proliferation of attachments (retransmission to multiple recipients) provided an indication of certain behavioral traits. The software calculated a number of metrics regarding the cumulative collection of attachments. Among these metrics, incidence rate and spread were usable as a measure of threat.

• Recipient Frequency: The underlying philosophy supporting this approach is that certain types of users receive certain types of e-mails consistently and other types rarely. For example, a medical office or a real estate agency is likely to receive more image files than a library. Some users receive many e-mails from a small number of sources, while others receive large numbers from large numbers of sources. For example, a law office is likely to receive a lot of e-mails, but from a select group of transmitting accounts. A state agency will receive one or two messages each from thousands of individual users. These behavioral characteristics can be used to identify the type of account. An account that transmits millions of messages but receives very few is a strong candidate for investigation as a spam originator.

• Group Communications: Large numbers of messages sent to identifiable groups of users can be used to identify the type of user. Groups of individuals with pronounced similarity in last names are likely family groups. Collections of seemingly unrelated accounts that receive similar types of messages might indicate a club or business organization.

While there has been little progress made in the development of commercial tools that employ these techniques, other researchers have pursued the “concept searching” model, which does not rely on exact keyword matches. Rather, it looks for words related to a particular idea. ContentAnalyst is a company that uses this approach. They provide solutions that can be integrated into other applications that provide advanced search capabilities based on concept searching. Investigators start with a blank data set and feed the application terms that define the search. As the application receives data, it begins to build up the “language” of the project.

An example of this would be searches for messages relating to an automobile accident. Using a Boolean search that included the terms “car,” “automobile,” “crash,” and “accident” would bring up a large number of e-mails at an insurance agency. However, it would miss a message with the header “April Incident” and the message “I trashed my ride yesterday.” A well-designed concept search would find these messages as well. It works by creating clusters of phrases that mean the same thing. For wreck, the analyst could tell the application that related terms, such as “accident,” “wreck,” or any other synonym (either noun or verb) is a valid equivalent. The concept search takes any of the words that have been identified as part of the general cluster and finds messages that include those terms. The well-versed analyst will include related slang terms as well as literate terminology.

ContentAnalyst technology can be found in several document management solutions and is used by several software developers as an advanced search engine for their products. Companies employing this technology include

• Agilex (intelligence applications)

• AnyDoc (document management)

• Datacap (document processing)

• dtSearch (text retrieval)

• eIVia (information management)

• eLumicor (electronic discovery)

• Fastline Technologies (data mining software)

• H&A eDiscovery (litigation support)

• iConect (litigation support)

• kCura (electronic discovery)

• Planet Data (litigation support)

• SAIC (intelligence applications)

This list may not be all inclusive and is likely to expand as more companies recognize the value of the technology.

Tracing the Source of an E-mail

Often it is important to locate the source of a message. First, the investigator confirms that the IP address is valid. The nslookup command run from a command line will identify the URL of a valid IP address. If the utility cannot identify the IP address, then that is a very good sign that the IP address in the header is not a valid SMTP server. Figure 10.7 shows the results of an nslookup query.

Figure 10.7 Nslookup is a utility that will resolve the host name when provided an IP address.

The first shows a successful query, asking about the identity of an IP address located in the header of a message that arrived from a trusted vendor. As expected, the URL reported was a valid address and not unexpected. The second queried address came from an unsolicited offer for a popular drug enhancing male stamina. Unsurprisingly, the query was unsuccessful. The sender spoofed the IP.

The investigator can identify each address using a WHOIS lookup if valid. www.whois.com is a database of URLs that provides a significant amount of information. WHOIS can query both by domain name and by IP address. Therefore, an investigator can WHOIS an IP address and find out the name of the domain. A WHOIS query of the author’s domain identified the author as the owner of that domain (along with some personal information that was redacted from Figure 10.8) and provided the IP addresses of the servers that hosted the domain. It also provided the contact information for the host provider for the domain, including address and telephone number.

Figure 10.8 The WHOIS database provides a great deal of information about domains and/or IP addresses.

Note that none of these queries identified a specific individual. That is rarely possible without help from the ISP or organization hosting the mail server from which the message originated. In a criminal investigation, it might be possible to subpoena the mail server logs and message archives. During civil litigation, this information can be demanded during the discovery process.

Chapter Review

1. Explain the headers that are used in a standard e-mail message and why they are relevant to an investigation. Why is it dangerous to use information you find without some form of collaborative evidence?

2. What information can be extracted from an e-mail header that will allow you to trace a message back to its originating ISP?

3. Explain the concept of information stores. Why is an understanding of how different clients store messaging information critical to the success of an e-mail search?

4. Explain the concepts of precision and recall. How are they used in e-mail analysis, and what is each one’s relevance?

5. What are two network tools of value to the investigator that are freely available on any machine with network access?

Chapter Exercises

1. Find an example of spam on one of your machines. Extract copies of the headers and analyze the message. See if you can figure out the originating server and, if possible, the ISP that initially forwarded the message.

2. Locate and make a copy of the e-mail store on your personal machine. Use the strings utility to search for several key words or phrases.

References

Baron, J. 2010. How do you find anything when you have a billion emails? Author Blog. http://e-discoveryteam.com/2009/03/04/jason-baron-on-search-how (accessed October 22, 2011).

Blair, D., and M. Maron. 1985. An evaluation of retrieval effectiveness for a full-text document retrieval system. Communications of the ACM 28(3):293.

Hershkop, S. 2006. Behavior based email analysis with application to spam detection. PhD diss., Columbia University.

PeoplePC. 2011. Common Internet software versions. http://psc.peoplepc.com/articles/email/common-internet-software-versions-noimages.php (accessed December 3, 2011).

TReC. 2010. National Institute of Standards and Technology TREC Legal Track. http://trec-legal.umiacs.umd.edu/ (accessed November 1, 2011).