Share |

Applying Predictive Coding to DOJ Data Dumps

Mar 25, 2012
Edward J. Page, Catherine Salinas Acree, and Rebecca Shwayri

Using Computer Assisted Review to More Accurately and Quickly Locate Electronically Stored Information

page 1 of 1

Many attorneys working opposite the Department of Justice find themselves in the unenviable position of being the recipient of a “Data Dump” — having terabytes of electronically stored information (“ESI”) produced by the Government. Such productions can contain literally millions of documents and emails, and the search for relevant information, exculpatory or inculpatory or otherwise, may seem like searching for a needle in a haystack. Your instincts tell you that you want “everything” and that you do not want the Government limiting the production, yet the reality of receiving “everything” may be overwhelming. There is some good news, however. Just as technology has contributed to the snowballing amount of ESI, technology can also assist in sorting through that information to locate those crucial “hot docs.”

A relatively new technique for searching for relevant information is called predictive coding or computer-assisted coding. This coding system was discussed at length in a recent opinion in an employment discrimination case from the United States District Court for the Southern District of New York. United States Magistrate Judge Andrew J. Peck approved the use of computer-assisted coding to cull approximately three million electronic documents in Moore v. Publicis Groupe & MSL Group, No. 11-Civ.-1279, 2012 WL 607412 (S.D. NY Feb. 24, 2012). Much of the Moore opinion deals the parties’ disputes about the development of an agreed-upon protocol to govern the production of ESI. In the white collar context, there may be no negotiation about protocol — the prosecutors simply determine whether production is required, based on a variety of factors including the Due Process Clause of the Fifth Amendment, The Federal Rules of Criminal Procedure, and the ethical and professional obligations of individual federal prosecutors. Yet Moore is still instructive because it illustrates how computer-assisted coding is becoming an acceptable and reliable way to search large volumes of information. Although studies have shown that predictive coding is more accurate and cost effective compared to traditional types of review, including manual review, the legal community has been hesitant to adopt the technology. The Moore decision may help the legal community to adopt this new methodology more readily.

When a criminal defense attorney receives a large data dump, he is left to sort out the garbage from the relevant documents. Without the resources to review each document, the criminal defense attorney may elect to run keyword searches in an effort to identify the relevant information. Unfortunately, keyword searches can leave out close to 80 percent of information relevant to the case. The Sedona Conference Best Practices Commentary on the Use of Search and Information Retrieval Methods in E-Discovery, at 206, August 2007. While lawyers believe that keyword searches are capturing 75 to 80 percent of relevant evidence, relevant studies demonstrate otherwise. Id.

In his opinion, Judge Peck cited case studies showing that keyword searches are largely ineffective. When using keywords, lawyers have to guess at what words might be contained in the relevant information, resulting in a search protocol similar to a child’s game of “Go Fish.” Moore, 2012 WL 607412 at *10. Moreover, searching for keywords often leads to over-inclusive results and can be thwarted by abbreviations and misspellings. Id. Judge Peck also rejected the notion that the best option is to have an attorney review each document, noting “while some lawyers still consider manual review to be the ‘gold standard,’ that is a myth, as statistics clearly show that computerized searches are at least as accurate, if not more so, than manual review.” Id. at *9. Writing with an obvious passion for the subject matter and command of the data, Judge Peck concluded that computer-assisted coding appears to be “better than the available alternatives.” Id. at *11.

Edward Page

Explaining how the predictive coding technology works, Judge Peck stated that predictive coding uses “sophisticated algorithms to enable the computer to determine relevance, based on interaction with (i.e. training) by a human reviewer.” Moore, 2012 WL 607412 at *2. While manual review is typically performed by junior staff members, predictive coding involves an expert on the case — typically a senior partner — reviewing and coding a “seed set” of documents. Id. Based on the expert’s choices, the computer identifies the properties of the documents and uses those properties to code other documents. Id. The computer also codes some documents and asks the senior reviewer for feedback. Id. Compared to manual review which involves putting eyeballs on tens of thousands or even hundreds of thousands of documents at a cost of several dollars per page, predictive coding will typically require reviewing only a few thousand documents from the corpus. The overall cost of the review is substantially reduced to a few cents per page.

Rebecca Shwayri

Judge Peck notes that sampling and quality control tests are an important part of the application of the predictive coding technology. Moore, 2012 WL 607412 at *2-3. Predictive coding is not a license to simply have the computer review all of your documents. The technology involves an interaction of man and machine. In order to ensure that the technology is capturing the relevant documents based on the expert’s coding of the initial seed set, it is important for counsel to review a randomly selected, statistically significant portion of the overall corpus of the documents. This may mean reviewing as little as a few hundred documents for a large data set. If the review of the random sample demonstrates that the technology found the relevant documents in that set or that there were no relevant documents in that set, then counsel can rest assured based on the application of statistical methodologies that the technology has captured the relevant documents within the corpus.

Catherine Salinas

Predictive coding technology is thus a partnership of humans and technology. Math and algorithm formulas are used to find words linked to primary terms, and the computer searches for relationships between words and synonyms and is able to search for concepts, rather than just words. While computer-assisted coding is not inexpensive, the price pales in comparison to manual review and keyword searches. Furthermore, the results are better than keyword searches and manual review. Properly designed predictive coding protocols can retrieve up to 90 percent of all relevant electronically stored information in a case. This technology has the potential of extracting relevant documents quickly from huge document productions, allowing for an assessment of costs, liability, and potential for plea negotiations at an earlier stage than might otherwise occur. Thus, as Judge Peck noted, computer-assisted review is an available tool and should be seriously considered for use in large-data-volume cases.

Edward J. Page is a shareholder in the Tampa office of Carlton Fields, P.A. Mr. Page, a white collar criminal defense lawyer and commercial litigator, is a former federal and state prosecutor with over 30 years of experience.

Catherine Salinas Acree is a shareholder in the Atlanta office of Carlton Fields, P.A. Ms. Salinas Acree handles civil and criminal litigation in federal and state court, as well as corporate internal investigations and responses to governmental subpoenas and investigations.

Rebecca Shwayri is an associate in the Tampa office of Carlton Fields, P.A. Ms. Shwayri is a business litigator and information technology lawyer who focuses on e-discovery, privacy, and technology related issues in the law.

Share |
paid advertisement