In a previous article on the topic of electronic discovery, I began the conversation about a term we use called “processing”. Now we will discuss one of the stages within the workflow of processing electronic discovery referred to as “deduplication”.
Remember, the goal of processing is to get the electronic documents we collected into a format that we can easily load into a document database so that the attorneys can review them for relevance.
There is a good chance that when electronic discovery data is initially collected, it will include duplicate copies of the same content. Duplicates occur within one custodian, especially if we collected multiple years of data for the same custodian. Duplicates also occur across multiple custodians when custodians share the same file with each other or when one custodian sends an email to multiple recipients.
The concern with duplicate data is the significant increase in the number of documents that the attorneys will need to review. So, in almost every electronic discovery matter, we perform a deduplication process. Typically this step will reduce the document volume by 30-40%.
In order to remove the duplicates, we generate a hash value or a “compact digital fingerprint” for each file. A hash value is a numeric representation. It is unlikely that two non-identical files will have the same hash value. There are several types of hash algorithms.
The first is called MD5 (Message-Digest Algorithm 5) and it is a 32 character hexadecimal value.
The next is called SHA (Secure Hash Algorithm) and it uses five algorithms for computing a condensed digital representation. It was designed by the National Security Agency (NSA) and published by the NIST as a U. S. government standard. There is a SHA-1 hash value as well as the SHA-2 (SHA-224, SHA-256, SHA-384, SHA-512).
Hash values are fairly easy to generate. To see how it works, you can test it out using a website like this one: http://www.miraclesalad.com/webtools/md5.php
For example, if we use the text string “The quick brown fox jumps over the lazy dog”, the hash values would be:
MD5 = 9e107d9d372bb6826bd81d3542a419d6
SHA-1 = 2fd4e1c6 7a2d28fc ed849ee1 bb76e739 1b93eb12
SHA256 = d7a8fbb3 07d78094 69ca9abc b0082e4f 8d5651e4 6d3cdb76 2d02d0bf 37c9e592
SHA512 = 07e547d9 586f6a73 f73fbac0 435ed769 51218fb7 d0c8d788 a309d785 436bbb64 2e93a252 a954f239 12547d1e 8a3b5ed6 e1bfd709 7821233f a0538f3d b854fee6
For emails, we typically combine several fields of metadata together to generate a hash value. For instance, we might combine these fields:
To, CC, BCC, From, Date, Subject, AttachmentCount, MessageBody
Keep in mind that the hash values generated will be delivered to you as a metadata field within the load files from your processing service provider and will be a field in your document database. The hash value is also usually included, later on, within the production load files as well.
Once the hash values exist, the next step is to perform the deduplication process. A decision needs to be made at this point and it will depend on the matter. Do you want to remove duplicates within each custodian only or do you want to remove duplicates across all custodians? This is a question for the lead attorney or partner and it may be subject to an agreement with opposing counsel.
Deduplication within a custodian is also referred to as “vertical” deduplication.
Deduplication across custodians is also referred to as “horizontal” or “global” deduplication.
Lastly, there is a process called “reduplication”. There have been some instances where the data will be deduplicated prior to document review, but then the service provider will be asked to put all of the duplicates back in during the production stage. This is a ploy. I have never been asked to do this step, but I have colleagues that have mentioned it to me.