OCR Text vs. Extracted Text

OCR Text vs Extracted Text

Another topic when training a litigation support newbie is the concept of how we go about getting searchable documents for our document databases.

One of the most important reasons why we provide litigators with document databases is to enable the legal team to perform searches across all of the documents in order to find the (a) relevant documents, (b) harmful documents and (c) helpful documents.

Since searching the documents is such a high priority, we need to find ways to make sure there is “searchable text” for most of the documents.

The first way we accomplish this is to simply “extract the text” from the native electronic files. There are many software programs that perform this task. Extracted text is considered nearly 100% accurate in terms of performing searches.

The second way is a little more work. There will be native electronic files that do not play nice with the text extraction software. Sometimes it is because there simply isn't any text to be extracted. Other times it is because there is a technical reason why the file is prohibiting the text from being extracted. We have some workarounds for these technical reasons and we eventually succeed in extracting text from some of these files.

There are other electronic files that will require a process referred to as “optical character recognition” (OCR) before we can get searchable text for the file. The OCR software can be run against a batch of files, but it can also take a while to complete the process. In the end, we get OCR text that we consider to be about 85% accurate.

We must not forget that some of the documents collected in litigation matters are hardcopy documents. They need to go through the process of being scanned and then OCR'd. There are several quality check stages during this process. Dealing with the conversion of hardcopy documents into a format we can use in a document database can take a good deal of time.

When we advise the legal team, we make sure they are aware of what percentage of the database contains extracted text versus OCR text. They need to understand any potential limitations of their search results.

 

    I am very passionate about helping legal professionals succeed. I even quit my day job to devote more time to mentoring! I want to encourage you to subscribe and join the LitSuppGuru community. I share humorous, informative, and time-sensitive emails above and beyond what appears on this site.

    Please note: I reserve the right to delete comments that are salesy, offensive or off-topic.

    • mgolab

      Thanks, Amy.

      To me this is a key issue when clients conduct their own collection in that you have to weigh up the risk of % of image based attachments vs searchable. With the [theoretical at least] trend in companies generating less paper, then there is a pretty big trend in having scanners email you scanned docs which then sit in mailboxes and email archives as inert image based files that can’t be searched in their native environments.

      There is also the slight ‘elephant in the room’ issue of handwriting and the [to me at least] relatively poor capability of the standard OCR tools to handle handwriting.

      Finally there is also the consideration of the original language of the image and the capabilities of the OCR system in terms of language
      identification.

      It would be interesting to hear your thoughts on the OCR tools out there such as Adobe Acrobat and ABBYY etc.

      • Thanks for sharing your wisdom, Matthew. I have found Adobe’s OCR results to be better than some other tools and I use it often. It is also a good solution for a small firm who can’t afford to purchase litigation support processing tools. I haven’t used ABBYY in a while, but I know some people are.

    • Patt F.

      Another problem is when you get a production in PDF format and the Bates number have been applied using Adobe. In some processing software, when you try to OCR those documents, your text file will only contain the Bates number! 🙁