SharePoint and OCR 2.0: Out with The Old

Sharepoint optical character recognition

Using Adaptive OCR Technology & Analytics to Drive SharePoint Efficiency and Adoption

Optical Character Recognition technology, or OCR, has been around for quite some time.  It really became mainstream back in the ’70s when a man named Ray Kurzweil developed a technology to help the visually impaired.    He quickly realized the broad commercial implications of his invention, and so did Xerox, who purchased his company.   From there, OCR experienced broad adoption across all types of use cases.

At its simplest,OCR is a means to take an image and convert recognized characters to text.  In the Enterprise Content Management (ECM) world, it’s this technology that provides a broad range of metadata and content collection methods as documents are scanned and processed.   Here are the basic legacy forms of OCR that can be leveraged with SharePoint:

  • Full Text OCR – converts the entire document image to text, allowing full text search capabilities.  Using this OCR with SharePoint, documents are typically converted to an Image+Text PDF, which can be crawled, and the content made fully searchable.
  • Zone OCR – Zoning provides the ability to extract text from a specific location on the page.  In this form of “templated” processing, specific OCR metadata can be extracted and mapped to a SharePoint column.  This method is appropriate for structured documents that have the data in the same location.
  • Pattern Matching OCR – pattern matching is purely a method to filter, or match patterns within OCR text.  This technique can provide some capabilities when it comes to extracting data from unstructured, or non-homogeneous documents.  For example, you could extract a Social Security Number pattern (XXX-XX-XXXX) from the OCR text and map it to a SharePoint column.

These forms of OCR are deemed as legacy methods of extraction, and although they can provide some value when utilized with any document process that involves SharePoint, they are purely data driven at the text level.

In steps OCR 2.0.  Today, innovators like Ephesoft leverage OCR as the very bottom of their document analytics and intelligence stack.   The OCR text is now pushed through algorithms that create meaning out of all types of dimensions: location, size, font, patterns, values, zones, numbers, and more (You can read about this patented technology here: Document Analytics and Why It Matters in Capture and OCR ).  So rather than just being completely data-centric, or functioning at the text layer, we now create a high-functioning intelligence layer that can be used beyond just text searching and metadata.  And the best part?  This technology has been extended to non-scanned files like Office documents.   Examples?  See below:

  • Multi-dimensional Classification – using that analysis capability (with OCR as algorithm input), and all the collected dimensions of the document, document type or content type can now be accurately identified.  As documents are fed into SharePoint, they can be intelligently classified, and that information is now actionable with workflows, retention policies, security restrictions and more.  You can see more on this topic in this video on Multi-dimensional Classification Technology: Machine Learning and Classification of Documents
  • Machine Learning – legacy OCR technology provided no means or method to “get smarter” as documents were processed.  Just looking at pure text, it either recognized it, or not.  With a machine learning layer, you now have a system that gets more efficient the more you use it.   The key here is that learned intelligence must span documents, it cannot be tied to any one item.  It’s this added efficiency that can drive SharePoint usage and adoption through ease of use.  You can see more on machine learning in the videos below:

Machine Learning and OCR Data Extraction

Machine Learning and External Data

  • Document Analytics, Accuracy  and Extraction – with legacy OCR, extracting the information you need can be problematic at best.  How do you raise confidence that the information you have is accurate?  With an analysis engine, we look not just at the text,  but where it sits, what surrounds it, and know patterns or libraries.  This added layer provides the ability to express higher confidence in data extraction, and makes sure you are putting the right data into SharePoint.

This was just a quick overview of the benefits from moving away from legacy OCR, and embracing OCR 2.0 for SharePoint. Thoughts?