Another topic when training a litigation support newbie is the concept of how we go about getting searchable documents for our document databases.
One of the most important reasons why we provide litigators with document databases is to enable the legal team to perform searches across all of the documents in order to find the (a) relevant documents, (b) harmful documents and (c) helpful documents.
Since searching the documents is such a high priority, we need to find ways to make sure there is “searchable text” for most of the documents.
The first way we accomplish this is to simply “extract the text” from the native electronic files. There are many software programs that perform this task. Extracted text is considered nearly 100% accurate in terms of performing searches.
The second way is a little more work. There will be native electronic files that do not play nice with the text extraction software. Sometimes it is because there simply isn't any text to be extracted. Other times it is because there is a technical reason why the file is prohibiting the text from being extracted. We have some workarounds for these technical reasons and we eventually succeed in extracting text from some of these files.
There are other electronic files that will require a process referred to as “optical character recognition” (OCR) before we can get searchable text for the file. The OCR software can be run against a batch of files, but it can also take a while to complete the process. In the end, we get OCR text that we consider to be about 85% accurate.
We must not forget that some of the documents collected in litigation matters are hardcopy documents. They need to go through the process of being scanned and then OCR'd. There are several quality check stages during this process. Dealing with the conversion of hardcopy documents into a format we can use in a document database can take a good deal of time.
When we advise the legal team, we make sure they are aware of what percentage of the database contains extracted text versus OCR text. They need to understand any potential limitations of their search results.