Litigation and Electronic Discovery - Digital Archaeology (2014)

Digital Archaeology (2014)

16. Litigation and Electronic Discovery

Much of the discussion in this book so far has treated evidence as though it were to be presented in a criminal investigation. Digital detectives employed by a corporation are more likely to spend the majority of their time involved in civil litigation than in criminal action. In any legal case, one of the first steps taken is for both sides to meet and discuss what evidence is relevant to the proceedings and come to an agreement on how the information controlled by each litigant will be made available to the other party. The presentation of all documents relevant to a case is a process known as discovery. In virtually every case filed in U.S. courts today, the presentation of electronic documents is crucial to success or failure. This is electronic discovery, or as it is more commonly referred to, e-discovery.

Consider a few statistics, and it will become clear just how critical a good discovery strategy really is. According to the Association of Corporate Counsel, a third of 485 companies responding to their survey reported e-discovery was a key component in 70% or more of their cases. More than half indicated that over 90% of their cases involved electronic documents. Only 5% indicated that they had not been involved in e-discovery (Association of Corporate Counsel 2010).

With increasing emphasis on sanctions against organizations that fail to properly comply with an e-discovery motion, it is more critical than ever that IT staffs and legal departments have a solid understanding of the processes and requirements involved. This chapter will discuss the anatomy and function of an e-discovery motion and outline the steps an organization will take in order to comply.

What Is E-Discovery?

In the old days (about ten years ago), when a lawsuit was filed, lawyers from both sides got together and debated over what kinds of paper records each side would have to produce. Virtually all records existed primarily in hard copy. Computer systems were used as a means of facilitating work. They were not the primary repository for records. Acres of file cabinets that could be navigated by only a select few housed a company’s records.

In today’s corporate world, the paper record is the “backup.” The real world exists inside of silicon-based life forms known as computers. The document search of today does not look for letters—it searches e-mail records. The file cabinets consist of a database-driven engine that sorts documents based on metadata. Those documents might be scattered across a global network made up of thousands of computers. Finding all of the documents related to a singular event is now more challenging than ever. Dozens of different versions of the same document might exist, with only minor variations between some of them. In some cases, similarly named document may have no more resemblance than the name they were given. As long ago as 2003, experts were estimating that as much as 95% of information generated by a corporate entity never appeared in printed form (Lyman and Varian 2003).

The courts and the legislatures have responded to changing technology. The Federal Rules of Civil Procedure (FRCP), Rule 34, defines the role of electronic documentation in civil procedure where evidence exists only in electronic form. In this section of code, the definition for the term document is revised to include documents and electronically stored information, and calls for respondents to provide that information in readable form (Committee on the Judiciary 2008). Simply put, e-discovery is the preservation, identification, and production of all documentary evidence related to a specific piece of litigation.

A Roadmap of E-Discovery

E-discovery begins long before there is ever a scent of a lawsuit. Company policies, such as document retention and deletion policies, e-mail rules and retention, and so forth, all play into discovery. Many companies rush to destroy incriminating evidence as soon as there is anticipation of litigation, although some recent penalties have certainly provided negative incentive for such behavior. Sanctions include having the court issue a default judgment against the offending party or forcing that party to pay attorney’s fees. Additionally, the court can order the violating party to pay compensation to the victim. This will be discussed in more detail later in the chapter.

While every case carries with it unique situations, it is safe to say that the majority of discovery requests will follow a certain order. Figure 16.1 illustrates the Electronic Discovery Reference Model (EDRM), developed by the group of the same name. The EDRM represents nine steps in six phases that most cases will go through from beginning to end.

Image

Figure 16.1 The Electronic Discovery Reference Model (reproduced under Creative Commons Attribution, based on EDRM 2010)

It is no coincidence that the discovery model so closely resembles the forensic investigation model presented in Chapter 1, “The Anatomy of a Digital Investigation.” To summarize the model, the individual steps are

1. Information management

2. Identification

3. Collection

4. Preservation

5. Processing

6. Review

7. Analysis

8. Production

9. Presentation

The next few pages will provide additional detail for each step.

Information Management

Information management is the one step of the discovery model that occurs long before litigation is ever contemplated. Every organization should have established policies in place that determine the basic rules of information control. This information should be included in a detailed document, two copies of which should exist. The first should be a general document that can be made available to members of the public that may have a need to know, but that does not include sensitive information such as passwords and Directory Services (including Active Directory) information. Several aspects of this policy should be included:

• Document storage policies

• Designated custodians

• Controlled storage locations

• Data loss prevention policies

• Document retention policies

• Retention period

• Document deletion policies

• E-mail retention policies

• Application map

• Types of documents used by organization

• Applications used to generate documents

• Applications used to archive and manage documents

• Network architecture map

• Server names and IP addresses

• AN volume information

• Directory Services schema

• Backup schema

• DR architecture

An organization with detailed documentation going into litigation is better protected in several ways. The forensic investigator is likely to spend far less time searching for documents. Knowing what applications and operating systems are in use facilitates recovering deleted data. In the event critical data proves to be unrecoverable, being able to represent that the information was deleted in accordance with defined policy makes it easier to defend accusations of spoliation.

Identification

Today’s corporate network contains more information than a team of humans could ever hope to sort through in several lifetimes if forced to read each page of documentation individually. Because of the extreme volume of data to be searched and due to the sensitive nature of much of that information, there are several steps involved in identifying the data to be produced. Consider the identification process to occur in two stages. The first stage consists of critical decisions to be made before anyone touches the hardware. The second stage occurs during the search.

Pre-search Processes

One of the first steps in litigation, as defined by FRCP, is disclosure. Rule 26f of the act states that all parties of litigation will agree to a conference to discuss the nature of the claim and the possibilities of settlement and to develop a discovery plan. During this conference, representatives for each party will discuss the following issues:

• The nature of information determined to be discoverable

• Names and contact information of all parties likely to possess discoverable information

• Details to take into consideration for a suitable discovery plan of action:

• The timing of initial discovery

• The scope of discovery

• Issues regarding disclosure of information

• The form of production of discoverable information

• Limitations of discovery

• Claims of privilege

• A procedure for invoking privilege on documents identified during discovery

• The scope of documentation each disclosing party has in possession that supports its claim or its defenses

• Witnesses expected to present testimony

Once the scope of the search has been agreed upon by both legal teams, it is time to start planning for data acquisition. If it has not already been done, a litigation hold must be placed on all data hosted by the target network or devices. A litigation hold, sometimes known as a preservation order or a hold order, is a directive ordering the party to suspend all document deletion or destruction. Under litigation hold, information must be retained in its exact form, with no alteration.

The duty to preserve is defined by two parameters: trigger and scope. The trigger is a point in time at which a party is legally under an obligation to preserve evidence. The scope refers to the types of materials that must be preserved pursuant to this obligation. The trigger is that magical moment when there is reasonable anticipation of litigation.

What constitutes reasonable anticipation? This question was addressed in two Florida cases. Hagopian v. Publix Supermarkets, Inc. (2001) involved a situation where evidence was collected in an accident case before there was anticipation of a lawsuit. No lawsuit was filed for several years, and in the meantime, the evidence was discarded. The court found Publix to be guilty of spoliation because, after making the effort to preserve evidence, they allowed it to be destroyed.

Royal & Sunallaiance a/s/o R.R. & L. R. Corp, Inc. v. Lauderdale Marine Center (2004) took the discussion to the next step. In this case, a Florida judge stated that mere anticipation of litigation does not automatically establish the necessity to preserve evidence. Evidence of a fire was discarded, and yet the courts stated that there was no reasonable expectation of legal action. The difference between the two cases, as explained by the court, was that Publix had collected the evidence, expecting a lawsuit, and then destroyed it. In Lauderdale Marine, the destroyed material was collected merely as part of a routine investigation with no effort to preserve it as evidence. No explanation was offered as to how one should determine intent to preserve.

In Zubulake v. UBS Warburg in 2003, the court had a different opinion. The court issued a decision in this trial that the defendant should have known as much as six months prior to a lawsuit being filed that the plaintiff was likely to sue. As such, they were required to preserve evidence. Additionally, the court noted that UBS Warburg destroyed the evidence in violation of its own published retention policy.

Spencer (2006) noted several events that can lead the courts to determine that spoliation occurred. One of these events is the destruction of any document in violation of an existing statute. The example he uses is that of personnel records. Such records should be retained for either one year beyond when the record was made, or one year beyond such time as an action was taken using the record, whichever comes later. Early destruction is spoliation. A second form of violation is more obvious. If documents are destroyed in direct violation of a litigation hold order, the spoliation has occurred. The final example is the violation of a court order. Beyond these obvious violations, courts have the ability to issue sanctions any time there is evidence that a party destroyed evidence with the intent of preventing the court from ever seeing it.

The obvious conclusion is that the instant there is a whiff of litigation in the air, a litigation hold should be immediately imposed that prevents any documents from being deleted. As a digital investigator, it is your job to locate and identify any evidence that documents have been deleted pertaining to the investigation at hand. The inherent problem is that there has been no clear definition for a good retention policy for documents and document types. The only thing that is really clear is that if your organization has a published policy, and it violates its own policy, it is very likely to face sanctions if the court determines that it is guilty of spoliation.

Search Processes

Once both parties are in agreement on the scope of the discovery request, it falls on the legal team and document management specialists to identify what documents fulfill the criteria defined in the request. This is the process of data mapping. Identifying individual items of data is not always as easy as it sounds. Many people are likely to have been involved in the case. Any one of them may have created, modified, or copied documents relevant to the case. It is a virtual certainty that e-mails will be a search target as well as written documents, images, and possibly sound clips. An organized approach is required for success.

The first thing to do is develop a strategy. Just like a good general doesn’t go into battle without a detailed plan, neither should the digital investigator. Before the first computer is turned on (figuratively speaking), several details should be considered:

• Identify stakeholders. Who all is likely to be involved in this case, beyond those named in the complaint?

• Corporate legal

• Outside legal

• IT/records management

• Employees and their managers

• HR personnel

• Determine what document and data types are relevant. Aside from standard text documents and e-mails, are there images or sound files needed? How are relevant records from databases to be separated from protected or privileged records? In what format are files to be presented?

• Identify any data custodians who would have control over the data. This might include IT personnel or office managers as well as the stakeholders.

• Identify all data repositories where information might be held:

• Local disks

• SAN

• Servers

• Portable storage

• Employees’ personal hardware

• Retired systems

• Disaster recovery systems

• Tape backups

• Optical disks

• Internet storage locations

• Prepare a list of key contacts, including outside counsel, IT personnel for the opposite team, and any third-party service providers that might be involved.

• Locate and prepare copies of all corporate documentation relevant to the investigation. This might include employee manuals, DR plans, backup/recovery strategies, and document/e-mail retention policies.

• Assign a time frame to use as a search parameter.

• Analyze the discovery request to assemble a qualified list of key words to use in a keyword search.

A proactive company already has much of this information in place. It is just a matter of collecting it all into a unified structure. Once this has been completed, the process of data mapping is finished. A document should be prepared and signed by both parties acknowledging the parameters of the search as defined in this phase. The team is now ready to begin the document recovery.

Collection

Data collection is undoubtedly the most challenging (and probably the most fun) part of the process. It will not be as easy as having someone present the team with a list of files to locate. More likely, the request will bounce back and forth between parties several times before it is considered specific enough to be accomplished and yet generic enough to cover all bases. The initial request may be for “all documents, files, messages, and transactions related to Acme Industries.” The refined order might read “all electronic information related to real estate transactions between Acme Industries and Ima Landlord between January 1, 2010, and January 1, 2011.” If Ima Landlord is a real estate broker, even the more specific request may be refined even further.

Not every relevant document is going to have a file name clearly identifying it as relevant (an e-mail stored as a message file might have a file name such as RE Tuesday Meeting.msg). Content searching utilities are in order, and if they are to be used successfully, careful consideration must be given to what search strings to employ.

Data searches cannot be conducted randomly. The e-discovery team needs to plan out the collection strategy very carefully, decide on a specific method by which data will be collected, and then document every step of the collection process. Evidence that is collected must be packaged in the manner agreed upon by all parties, and a chain of custody maintained for all materials collected throughout the course of the investigation.

Using Search Strings

Careful selection of text strings to use in a keyword search will determine the success or failure of the search. As Justice Peck points out in William A. Gross. Constr. Assocs., Inc. v. Am. Mfrs. Mut. Ins. Co. (2009), a request that is too loosely defined will result in a collection of materials substantially larger than is required and result in undue burden on the producing party. Conversely, a search that is too restrictive in nature is likely to miss critical evidence.

Koutrika et al. (2009) point out that any data search is basically a quest to find one or more “entities.” The entity is a document or file that meets a specific set of requirements. Yet any given entity can be accurately described in a number of ways. For example, searching for a digital image of storm damage caused by Katrina can result in different degrees of success, depending on your selection of search criteria.

Typing in “Katrina NEAR JPG” in Yahoo can be most disappointing. On Google, the request on one particular day resulted in over 28 million hits. While many of those hits were relevant images, it also offered links to a vast collection of images that would land the average employee in HR if they were ever displayed in an office cubicle. Adding the words “Storm Damage” to the search string narrows the results to 104,000 hits, and (for the first several pages, anyway) none of them feature scantily clad females. While this example is Web-centric, the same principle holds when searching any document collection.

In view of all this, is it ever possible to know whether a particular search has resulted in finding everything that is relevant? Of course not. However, as long as an organization can demonstrate that it made a good faith effort to retrieve and present all requested materials and can document the process used to collect those materials—and as long as that organization has done all it can to prevent the loss of any material it possesses—then they should be safe from sanctions. In SafeCard Services, Inc. v. SEC (1991), the court stated, “mere speculation that as yet uncovered documents may exist does not undermine the finding that the agency conducted a reasonable search.” Their job may not be finished, because opposing counsel may redefine how they want the search to be accomplished using a set of keywords of their own. But the possibilities of legal repercussions are minimized.

Forms of Data

The job of data collection would be much easier if everything requested existed in a uniform time-space continuum. However, data exists in a wide variety of locations, including some that are typically inaccessible to the average user. Zubulake (2003) identified two primary categories of data, each of which is further divided into subcategories. Each form of data gets increasingly distant from the user interface (see Figure 16.2).

Image

Figure 16.2 Data falls into different subcategories of either accessible or inaccessible data.

The first category is accessible data. This is any information readily retrieved by the average user. There are three subcategories of accessible data. Active online data consists of information stored on installed media, such as the hard disk, that is easily accessible via the file system. Near-line data exists on user media, such as rewritable CDs or DVDs, external USB drives, and such. Off-line storage or archives hold information that is not directly stored or read from local drives. This would include network storage locations, Internet storage services, and so forth. Off-line storage may be directly linked to the computer system via logical mappings, such as mapped drives, or it may require logging into separate systems.

The second category is inaccessible data. The most commonly requested subcategory of inaccessible data is the information stored on backup tapes. While, technically speaking, data found on tapes is not truly inaccessible, the degree of difficulty in locating specific information on a tape allows it to fall into this category. Accessing backup data requires that the tapes be restored to active systems or that the backup software used to create them be used to search for files. Backup tapes are very time-consuming and costly to search. Even more difficult is the second subcategory of erased, fragmented, or damaged data. Recovery of this form of data requires specialized tools and people with specialized skills. This is where the forensic specialists are called upon for their services.

Data Collection Tools

Most of the tools that will be discussed in Chapter 18, “Tools of the Digital Investigator,” are as relevant to the litigation support professional as they are to the forensic specialist. The basic goal is the same. Find existing target files as quickly as possible and locate as many deleted files as possible. On a corporate network it is hardly likely that an investigator will be asked to provide a forensic copy of a SAN. Therefore, live searches are the order of the day. It is possible that certain individual computers or storage devices might need to be imaged, and therefore, the tool kit needs to contain the appropriate utilities.

Great care must be taken in the selection of these tools. The courts have recognized a relatively small collection of hardware devices and software utilities that are considered acceptable for extraction, analysis, and presentation of evidence. While it is not forbidden to use tools not on this list, doing so is likely to result in extensive questioning regarding the validity of results. An investigator not prepared to assume the role of expert witness is better served by using accepted tools. While the following list is not an all-inclusive list, tools that have been proven in court include

• Access Data: Forensic Tool Kit

• Encase eDiscovery

• Pro-Discover

• X-Ways Forensics

• Computer Online Forensic Evidence Extractor

• Any tool tested and certified by NIST

Preservation

By now, the litigation hold should be in full effect and all personnel notified to suspend destruction of data. If the organization reaches this point in the discovery process without instituting a hold, it is facing a serious risk of sanction. The discovery team must now develop a plan by which the requested information will be collected, stored, and prepared for delivery. This is not simply a matter of copying files to media.

Files must be presented exactly as they exist on the host system. Additionally, on the vast majority of networked systems today, files will exist in multiple locations, and possibly in multiple versions. While it is unlikely that several identical copies are desirable, there is a good chance that different versions or iterations of a file might be required. Develop a plan of action that includes de-duplicating (deduping) files to minimize the quantity of data collected while still presenting what is required (see side bar entitled “Duplicates versus Near Duplicates”).

Larger organizations employ some form of document management system for managing electronically stored information (ESI). These systems store metadata separately from the files. The metadata is critical for establishing audit trails and time lines. Most discovery orders will require that the metadata be preserved. It is also possible that dedicated software might be required for presenting information in a readable format.

Determine how data selected for review is going to be stored. There is increasing emphasis on the use of online document reviews in litigation (to be described in more detail later in this chapter). If this is the method selected, the site must be prepared, accounts configured, and security established before the first byte of data is copied from the source media.

If physical media is to be used, the form, format, and density of media must be agreed upon in advance. Whenever possible, new media should be employed. If existing drives must be used, then it is essential that they be wiped with a Department of Defense (DOD) certified data wiping utility and reformatted prior to use.

Duplicates versus Near Duplicates

A problem that every e-discovery project encounters is the existence of files that are virtually identical, but not exact duplicates of one another. Examples of such files would be evolutionary versions of a file, a PDF version of a Microsoft Word document, partial files recovered from slack space, or a TIF file of a scanned document. The conventional method for identifying duplicate files is to use a utility that compares the hash of different files. This method has the problem that even the minutest change will drastically alter the hash of a file. Figure 16.3 shows the results of a text file that had a single character changed from capital to lowercase.

Image

Figure 16.3 Changing a single character in a Word document completely alters the hash of the file.

Finding exact duplicates is good because getting rid of them eliminates dead weight. Finding near duplicates is good because you can find altered documents, build a timeline of the evolution of a document, or find remnants of destroyed files in the unallocated space of a hard disk. However, finding near duplicates can be very tedious and time consuming without some form of automation.

A technique called context triggered piecewise hashing (CTPH) allows this magic to be performed on a system (Kornblum 2006). CTPH uses an algorithm that calculates the hash of smaller blocks of the file rather than the entire file. This is called rolling hash. File similarity is calculated by measuring the number of data blocks that are identical. Kornblum used this algorithm in a utility called ssdeep.

By itself, ssdeep can compare individual files and report if they are near duplicate or not. The Forensic Tool Kit by Access Data Corporation uses ssdeep as its fuzzy hashing algorithm, but builds on it to allow full-directory searches. Encase by Guidance Software and X-Ways Forensics both have near-duplicate detection capabilities, but it is unclear what underlying technology is used, as the companies are protective of their intellectual property.

Processing, Review, and Analysis

The next three steps are going to be the most labor intensive, and possibly the most scrutinized after the collection techniques used. In a complex case inside of a large organization, the search process very likely generated a massive volume of data to be processed, reviewed, and analyzed. Only a relatively small percentage of that material is going to end up as evidentiary material. The court simply does not want to see several million pages of evidence. Only evidence that is directly incriminating or exculpatory is going to be allowed.

Processing

The processing stage is where the weeding-out process is accomplished. There are several stages to processing and reviewing data:

• Assessment

• Assign roles and responsibilities to team members.

• Determine what data archives need to be searched.

Image What type of media is involved?

Image What tools are required for searching the types of data targeted?

• Decide on the target format and media to be used for the collection.

• Establish the processing steps that will be used.

• Determine what possible problems the team might face in extracting data streams.

• Define the criteria for “acceptable” data.

• Determine a policy for audit trails, error reporting, and chain of custody.

• Establish a credible and measurable measure of success.

• Preparation

• Convert any legacy data formats into a format readable by all parties.

• Restore any relevant backups to live systems.

• Run a deduping utility against the data set to eliminate excess baggage.

Image Some of the network duplication detection utilities are not considered forensically sound.

Image Some of the more advanced applications detect near duplicates as well as duplicates.

• Identify and extract any cabinet files or e-mail PO box files.

• Run a text indexing utility against each archive to identify possible data sources.

• Selection and review

• Eliminate identical duplicates.

• Review near duplicates to ascertain relevancy.

• Work with the legal team to identify protected data.

• Determine the feasibility of running a concept extraction utility against the archives (see sidebar “Concept Extraction at Work”).


Concept Extraction at Work

A significant part of the job of the e-discovery specialist is identifying documents identified in the request (responding documents) and also information that is protected from disclosure to the opposing party (privileged documents). Wading through several terabytes of corporate data to find the exact collection of data to present is tedious and time consuming. Fortunately, technology can help in this regard as well.

Deshpande et al. (2000) described a technique of software-assisted discovery that mines relevant data out of mountains of generic bits and bytes. The authors define two categories of search functions that assist with the sorting. The first category is a set of focus items that target the type of information sought. Focus categories look for specific types of behavior. An investigation into corporate malfeasance might target documents or e-mails with content-specific terms related to legal terminology. Terms such as contract, litigation, legal, attorney, and so forth are examples. Next, the search focuses on filter categories. Items such as private communications, corporate documents, and so forth constitute filter categories.

A typical Boolean search brings up every piece of information that meets the criteria defined. Concept extraction software compares the focus and filter categories to relevant topics and identifies documents specific to the search. Well-designed concept extraction utilities recognize that a paper about elephants is not relevant when looking for proof that evidence was hidden in the culprit’s trunk. Artificial intelligence algorithms work with rules of language processing to bring up only relevant documents.

Tools that employ concept extraction are not common, and they are not cheap. One example is ZyLab eDiscovery. The text mining capabilities of this package include linguistic analysis, content clustering, concept and pattern extraction, and even e-mail chain analysis. Kazeon Systems’ Analysis and Review package provides similar capabilities.


Review

In larger litigation events, data will be presented in a rolling review. This is a process by which data is produced in incremental stages, rather than all at once. There are two advantages to this method. First of all, it lets the technical staff working on data extraction to identify and resolve any ghosts in the machine. Unwieldy search strings or poorly defined objectives are usually picked up early on. It’s better to figure out the problems before inflicting them on the entire project as a whole. Second, data review by postproduction personnel (doctors and lawyers and such) can begin before all production is complete. Meeting difficult deadlines is made easier this way.

Another form of review that is becoming increasingly popular is the online review. In this format, all documents from both sides are stored on a mutually accessible platform with controlled access. Documents must be checked out for review and checked back in when the session is completed. Audit logs are maintained to verify times and dates of access. Typically, an online review is useful when there are large amounts of data to be reviewed by several people. The online platform helps reduce costs and (if properly configured) increase security. There are several concerns that need to be addressed if the parties agree on using online review as a platform.

How will the data be viewed? If documents are to be viewed in native format, do both parties have all the necessary applications required to open the files? A critical consideration is making sure that the data is online all the time when either party requires it. As a security precaution, it may be decided that only certain “windows” of time are available for review. The Web host might open access to the data from 9:00 a.m. to 5:00 p.m., and then lock it down to prohibit after hours access.

Speaking of security, who is in charge of that critical element? Typically, a third-party service provider will be selected to host the data, and it falls upon the third party to provide security. This third party may be an entity agreed upon by the two primary litigants, or it may be one assigned by the court. Once that has been decided, the service provide must assure that the confidentiality and privacy of all data is maintained at all times. To that extent, there must be some agreement on how privileged information will be redacted from documents put up for review.

Once the review has been completed, it is essential that all information be handled in the manner dictated by the court. It is likely that the review, along with all of the audit logs generated in its lifetime, will be archived for possible use in future appeals. If this is the last leg of the journey, it is likely that the data will be ordered destroyed. It is up to both parties to verify proper disposition of data.

Analysis

Analysis occurs in two stages of the discovery process. Initially, the team must perform an analysis of the data archive to determine what to search. Once that decision has been made, it becomes necessary to decide what to produce. Not everything that the recovery team finds is going to be used in litigation. As mentioned earlier, duplicate files can be eliminated during the review process. Near duplicates must be analyzed and a decision made as to whether the differences are sufficient to qualify each document as a separate item. Therefore, near duplicates must be moved into the next phase. Each document retained for possible presentation will eventually be analyzed to ascertain whether it is truly relevant or whether it might contain privileged information. Any preliminary screening at this stage will reduce later efforts.

During analysis, the team will collect an inventory of all information collected. A detailed chain of custody for each document archive must be created and maintained. Lists of all document sources complete with network or computer location, the name of the data custodian, and any information relevant to the retrieval of the data must be assembled. This would include such information as what tools were used to retrieve the information, whether the information had to be extracted through less than conventional methods (data carving, tape backup recovery, etc.), and the name of the technician who retrieved the data.

During this time, the team needs to maintain certain metrics. These metrics would include information such as

• Percentage of requested data collected

• Percentage from each custodian collected

• Percentage of requested data that was not found (or not retrievable)

• Total number of items collected

• Total number of items per custodian collected

• Overall volume (in bytes)

• What data was collected by what technician

Identify each tool used in collection, and be prepared to describe how it was used. It may be necessary to defend the techniques used by each member of the team. Opposing counsel will leverage any weakness they can find to fray the seams in your presentation.

Production and Presentation

The final stages of the e-discovery process involve packaging the information the investigators uncover into a usable bundle. Data is preserved in the agreed-upon format and storage schema and delivered in the manner determined during the initial conference. This may be bundles of CDs, boxes of printed documents (highly unlikely these days), or with increasing frequency in an online forum for review.

During the initial conference, the form of data production will have been determined. Presentation of data in native format means that the potential viewer of the information must have the proper software installed on the review computer. For example, an Excel spreadsheet in native format can only be opened in Excel or an Excel viewer. In order to see embedded code, Excel will be required. An example of non-native production would be if Excel documents were printed out in PDF format. In this format, only the information entered into the spreadsheet can be viewed. The formulae used to calculate values are not visible, and macros cannot be examined. Generally speaking, native format is the desirable option.

Near native format is a desirable option in certain cases. E-mails extracted from e-mail servers may be presented in MSG or EML format. Databases are typically too cumbersome to present in their entirety, and doing so would most likely expose protected information along with the discoverable data. The discovery team will work with the legal team and business managers to determine what records need to be produced and what the best production format is going to be.

Documents stored in document management systems are usually managed by the application that stores the document and not by the one that created it. As such, the document management software is most likely going to have numerous indexing and management files that control the information, while the information itself is stored in a condensed archive. Figure 16.4 illustrates the file structure of a typical document management application.

Image

Figure 16.4 Document management applications store massive amounts of information about the documents they manage in separate index files. Each folder holds hundreds of documents, and the database files maintain the metadata.

The second phase of analysis begins now. The team wants to present the minimum amount of information that will satisfy the discovery order. Each document will be examined by the legal team and business group to determine relevance and privilege level. In extreme cases, it might be necessary to examine particularly incriminating documents for authenticity. A study of the metadata may be able to determine if a document was subjected to additional editing since its creation. Speaking of metadata, here is where it becomes necessary to package it for presentation.

It will be necessary to prove authenticity of each document that is selected for presentation. A simple MD5 or SHA-1 hash of the file is a generally acceptable method of proving that two copies of a file are identical. Once files are produced, an auditable chain of custody must be maintained. Generally, the best way to track specific documents is to create a unique identifier for each file. That identifier and the hash of the file accompany the document on its path from the data archives to the courtroom. A database or spreadsheet holds the following information:

• Unique identifier

• Original file name and path

• Hash value of file

• File name and path of file in review storage

• Type of file

• Technician ID of team member who extracted file

Some references suggest renaming the file with its unique identifier as a file name. Two potential issues that might arise with this technique are that the hash value of the file will change, and it will be more difficult to verify that the file is actually identical in every aspect save file name. Second, file system information becomes more difficult to use as a method of confirming the authenticity of the file.

Every discovery project will involve files that must be modified prior to presentation. Part of the discovery process might involve labeling each page with a unique identifier. This is the process of Bates numbering. As each document is processed (or perhaps each page of each document), a unique identifier is attached. This can be embedded in the document header or placed as a visible stamp, depending on what the prediscovery conference arranged.

Another necessary modification is the redaction. Many discoverable documents contain both discoverable and protected information. The process of redaction is simply a method of blocking protected information in a document before releasing it. This is usually not possible in native format. Documents that must be redacted will be converted to an image file, with the protected information blocked. This is the digital equivalent of blocking out letters on a page with a black marker.

Simply blocking the visible text with a black rectangle does not completely hide the information. The hidden data remains in the document and can be extracted by a knowledgeable person. The black marker is simply a graphical overlay that exists coincidentally with the text. To effectively redact a document, you must first create a valid copy of the original data. It is certainly not a good idea to permanently alter the only copy of a document. Next, it is necessary to make sure that no metadata exists within the document that contains protected information. For example, it might be necessary to conceal the identity of an individual, but that individual was the creator of a document. The document’s creator is identified in the metadata.

Other pieces of information that can be identified from metadata include user information of those who may have edited the document, to whom the software is registered, dates of creation and editing of the file, and the file path to where the file was originally stored. If the user happened to use any of the custom fields available to most software packagers, there is a wide array of information that might be extracted.

Conclusion

As long and as detailed as this chapter may seem, it really only scratches the surface of the complexity of electronic discovery. In truth, the subject deserves a dedicated book of its own. In most cases, the forensic investigator is only going to play a limited role in the process. Still, the more he or she knows about the subject, the easier the task will be for everyone concerned.

Chapter Review

1. What is the significance of Rule 26f in the Federal Rules of Civil Procedure? How does it impact on a civil case?

2. Under what circumstances would it be advisable to issue a litigation hold in a corporate environment? Who all should be affected by such a hold?

3. Explain two circumstances that would constitute “spoliation” in the eyes of the court, should evidentiary material conveniently disappear. What is a mitigating circumstance that might convince the court to overlook the destruction of critical evidence?

4. Differentiate between near-line data and inaccessible data. Where does offline storage fit into the equation?

5. Several sources cite the review of potential data as being the most expensive aspect of the e-discovery process. Explain why this is the case.

Chapter Exercises

1. Locate at least one civil case that involved spoliation, and describe what happened in that case. Were there any sanctions imposed? If so, what were they, and if not, why was the destruction of evidence overlooked?

2. Go on line and find three examples of an enterprise-level document management system. Read what the vendors say about their products in regards to e-discover and regulatory compliance. Can you think of why the two concepts might be related?

References

Association of Corporate Counsel. 2010. Civil litigation survey of the chief legal officers and general counsel. University of Denver, Association of Corporate Counsel. Denver: Institute for the Advancement of the American Legal System.

Committee on the Judiciary. 2008. The federal rules of civil procedure. www.law.cornell.edu/rules/fcrp (accessed May 12, 2011).

Deshpande, M., J. Srivasta, R. Cooley, and P. Tan. 2000. Web usage mining: Discovery and applications of usage patterns from Web data. ACM SIGKDD Explorations Newsletter 1(2).

EDRM. 2010. The electronic directory reference model. www.erdm.net/archives/2998 (accessed May 13, 2011).

Hagopian v. Publix Supermarkets, Inc., 788 So.2d 1088 (2001).

Kornblum, J. 2006. Identifying almost identical files using context triggered piecewise hashing. Digital Investigation. www.dfrws.org/2006/proceedings/12-Kornblum.pdf (accessed June 6, 2011).

Koutrika, G., Z. Zadeh, and H. Garcia-Molina. 2009. CourseCloud: Summarizing and refining keyword searches over structured data. Presented at Microsoft’s Extending Database Technology. http://academic.research.microsoft.com/Paper/4706513.aspx (accessed November 30, 2010).

Lyman, P., and H. Varian. 2003. How much information? www.sims.berkeley.edu/research/projects/how-much-info-2003/printable_report.pdf (accessed May 12, 2011).

Royal & Sunallaiance a/s/o R.R. & L. R. Corp, Inc. v. Lauderdale Marine Center, 877 So 2d 843 (Fla. 4th DCA 2004).

SafeCard Services, Inc. v. SEC, 288 U.S. App. D.C. 324, 926 F.2d 1197, 1201 (D.C. Cir. 1991).

Spencer, B. 2006. The preservation obligation: Regulating and sanctioning pre-litigation spoliation in federal court. Fordham Law Review 79:17–18.

William A. Gross. Constr. Assocs., Inc. v. Am. Mfrs. Mut. Ins. Co., 256 F.R.D. 134 (S.D.N.Y. 2009).

Zubulake v. UBS Warburg, 220 F.R.D 212 (S.D.N.Y 2003).