Litigators can cut review time as well as production costs by de-duplicating e-mail so that each message in a population is reviewed only once. De-duplicating e-mail is best done, as with other document types, by using the MD-5 or SHA message digest algorithms.
De-duplicating e-mail is different from other document types, however, in that e-mail is a compound document. In other words, it consists of a message and one or more attachments.
Although determining the duplicate e-mails at the compound level can reduce the population, even more efficiency can be gained when the e-mail message and attachments are considered separately.
In addition, the elements considered for de-duplication within the e-mail message itself should be narrowed in order to gain the most efficiency. In other words, specific metadata properties within the e-mail should be selected for de-duplicating the e-mail message. Otherwise, the same message sent to various parties could still appear as unique message simply because they passed through different servers when sent over the internet.
Although de-duplication can increase efficiency, it must also be matched with the ability to track where all of the duplicates were found. Simply eliminating duplicates without tracking their locations prevents the litigator from fully recognizing the consequence of key documents once they are identified. Thus, tracking where the documents have been is just as important in the efficiency of de-duplication. |