Front-end Document Analytics for Capture, Classification and Extraction
One of our folks over in Europe built out a new Box export integration (Thanks OBH!). It provides some great features through the Box API, and uses our machine learning engine to automate document additions and allows the following:
Addition of any document through Ephesoft to Box
Machine learning for Box of documents and data
Document Analytics, Separation, Classification and Data Extraction
Using Adaptive OCR Technology & Analytics to Drive SharePoint Efficiency and Adoption
Optical Character Recognition technology, or OCR, has been around for quite some time. It really became mainstream back in the ’70s when a man named Ray Kurzweil developed a technology to help the visually impaired. He quickly realized the broad commercial implications of his invention, and so did Xerox, who purchased his company. From there, OCR experienced broad adoption across all types of use cases.
At its simplest,OCR is a means to take an image and convert recognized characters to text. In the Enterprise Content Management (ECM) world, it’s this technology that provides a broad range of metadata and content collection methods as documents are scanned and processed. Here are the basic legacy forms of OCR that can be leveraged with SharePoint:
Full Text OCR – converts the entire document image to text, allowing full text search capabilities. Using this OCR with SharePoint, documents are typically converted to an Image+Text PDF, which can be crawled, and the content made fully searchable.
Zone OCR – Zoning provides the ability to extract text from a specific location on the page. In this form of “templated” processing, specific OCR metadata can be extracted and mapped to a SharePoint column. This method is appropriate for structured documents that have the data in the same location.
Pattern Matching OCR – pattern matching is purely a method to filter, or match patterns within OCR text. This technique can provide some capabilities when it comes to extracting data from unstructured, or non-homogeneous documents. For example, you could extract a Social Security Number pattern (XXX-XX-XXXX) from the OCR text and map it to a SharePoint column.
These forms of OCR are deemed as legacy methods of extraction, and although they can provide some value when utilized with any document process that involves SharePoint, they are purely data driven at the text level.
In steps OCR 2.0. Today, innovators like Ephesoft leverage OCR as the very bottom of their document analytics and intelligence stack. The OCR text is now pushed through algorithms that create meaning out of all types of dimensions: location, size, font, patterns, values, zones, numbers, and more (You can read about this patented technology here: Document Analytics and Why It Matters in Capture and OCR ). So rather than just being completely data-centric, or functioning at the text layer, we now create a high-functioning intelligence layer that can be used beyond just text searching and metadata. And the best part? This technology has been extended to non-scanned files like Office documents. Examples? See below:
Multi-dimensional Classification – using that analysis capability (with OCR as algorithm input), and all the collected dimensions of the document, document type or content type can now be accurately identified. As documents are fed into SharePoint, they can be intelligently classified, and that information is now actionable with workflows, retention policies, security restrictions and more. You can see more on this topic in this video on Multi-dimensional Classification Technology: Machine Learning and Classification of Documents
Machine Learning – legacy OCR technology provided no means or method to “get smarter” as documents were processed. Just looking at pure text, it either recognized it, or not. With a machine learning layer, you now have a system that gets more efficient the more you use it. The key here is that learned intelligence must span documents, it cannot be tied to any one item. It’s this added efficiency that can drive SharePoint usage and adoption through ease of use. You can see more on machine learning in the videos below:
Document Analytics, Accuracy and Extraction – with legacy OCR, extracting the information you need can be problematic at best. How do you raise confidence that the information you have is accurate? With an analysis engine, we look not just at the text, but where it sits, what surrounds it, and know patterns or libraries. This added layer provides the ability to express higher confidence in data extraction, and makes sure you are putting the right data into SharePoint.
This was just a quick overview of the benefits from moving away from legacy OCR, and embracing OCR 2.0 for SharePoint. Thoughts?
Web Services for OCR, Data Extraction and Document Classification
Continuing on my themes of open, web service enabled document capture and analytics, as well as this notion of “In App” Document Capture through APIs, I thought I would share out a demo one of our fantastic SEs built to show the automation capabilities within Salesforce. This shows background document classification and extraction, all initiated through a file upload in Salesforce. This leverage OCR technology and our machine learning algorithms to auto-populate data in Salesforce.
Document Capture Through Intelligent Learning and Analytics
One of our regional reps produced this video to help show how we differ from other document capture and analytics platforms on the market. This is a great expansion to one of my earlier posts – Analytics and Document Capture – Why it Matters The video gives a great overview on the many dimensions of a document, and how Ephesoft leverages its patented technology to enhance accuracy, analyze large volumes of documentation, and process unstructured information.
In the world of document capture and analytics, our typical value proposition is around efficiency, reduction in required headcount and the reduction in turnaround time. Of course, there is true value and cost savings for any organization processing a significant volume of documents if you focus on these value points. Lately, we have been having some great conversations both internally and externally on the true cost of errors in data entry, and I wanted to dig deep into my past, and present a key topic for discussion.
Back in my Navy days, I found myself in the center of a focus on quality, and we had morphed Deming’s Total Quality Management (TQM) into a flavor that would serve us well. In a nutshell, it was an effort to increase quality through a systematic analysis of process, and a continuous improvement cycle. It focused on reducing “Defects” in process, with the ultimate goal of eliminating them all together. Defects impose a high cost on the organization, and can lead to failures across the board. Today, all these concepts can be applied to the processing of documents and their associated data. What is the true value of preventing defects in data?
In my education in this topic, I remember a core concept on quality, and defects: the 1-10-100 rule.
The rule gives us a graphic representation of the escalating cost of errors (or failures), with prevention costing a $1, correction $10 and failure $100. So, in terms of data:
Prevention Cost – Preventing an error in data at the point of extraction will cost you a $1.
Correction Cost – Having someone correct an error post extraction will cost you $10.
Failure Cost – Letting bad data run through a process to its end resting place will cost you $100.
So, an ounce of prevention is worth a pound of cure. In this case, the lack of technology to prevent data errors in the first place will cost the business 100x the cost of acquiring an automated technology that can prevent errors in the first place.
In document capture today, we focus on the top rung of the pyramid, and in prevention. Below are the core benefits of an intelligent capture platform:
Errors in data can be prevented through the automated extraction of data, setting of business rules, and the elimination of hand-keying data.
Existing data sources in the organization can be used to enhance prevention, and insure data validation through the use of Fuzzy DB technology.
Adding review and validation capabilities prevent bad data from slipping through the process and contaminating your data store. This is invaluable and prevents the ripple effect bad information can have on a process and the organization as a whole.
With machine learning technology, if correction is required, the system learns, and can prevent future corrections, reducing costs.
The benefits of intelligent document capture are well documented, and its impact on efficiency can be a quick win for any business processing a decent volume of documents. Capture 2.0 adds a new dimension of automation, and provides capture functionality to any application or system. So now, the ERP, CRM or Document Management System can be capture enabled through its own interface, without the need to switch windows, open a new application or “send files” to a processing location. This “hidden” automation layer requires no end-user expertise, and as far as they know, they are doing business as usual. Below are some core benefits to leveraging Capture 2.0 Web Services:
Minimal Impact to Operations – Current process: the end-user uploads a received file through the CRM interface, and then enters some notes and metadata about the file: customer name, date of contract, salesperson, and region. Capture 2.0 enabled process: the end-user uses the same upload process, but in the background, the hidden capture automation layer classifies the document, extracts all the data and enters it into the CRM automatically. This is the power of document capture web services. With no impact to current process or operations, you gain efficiency and reduce errors, driving speed of transactions, reduced response times and the costly fixing of erroneous data.
Low to No Training – because end users are using the applications they use every day, there is almost zero training required. That’s the beauty of the hidden capture layer. If deployed correctly, users just perform their process the same way, and manual data entry and steps are eliminated. The lack of a training requirement minimizes any lost work days spent on costly training, and provides value starting day one.
Solve the Plague of Windows – more and more, IT and Business staff alike are looking to streamline and consolidate, and reduce the number of applications required to do business. Creating that single interactive interface, or that single source of truth, is the end goal. And a web services automation layer can provide functionality that would normally require the addition of an interactive app.
Maximum Efficiency – it is said that 50% of document intensive process labor is spent on 5% of your documents. Why? The cost of problems and fixing errors. Automating any data extraction and document classification process, coupled with data validation techniques, and reduce errors to almost 0, and drive the maximum efficiency possible in your organization.
Just a brief note on some thoughts and trends I am seeing in the marketplace. Thoughts? If you want to see Capture 2.0 Web Services in action, take a peek at Ephesoft Transact Web Services API.
Leveraging Intelligent Capture to Break Down Repository Silos
Every organization has them in both their technical realm and organizational/departmental structure: the Old Silo. But the elephant in the room is usually the document repository. That terabyte nightmare no one wants to address for fear of what lies within. Compounding the issue is the fact that most organizations have numerous document silos, usually the result of years of acquisitions, changing technical staffs with new ideas, or new line of business systems that house their own documents. Repository silos usually take the form of one of the below:
The File Share – when was the last time someone looked at that behemoth? Usually laden with layer upon layer of departmental and personal folder structures, a complete lack of file naming standards, and a plethora (my $2 word 😉 ) of file types. They continue to be backed up, and most IT departments that initiate projects for cleaning these up find a minefield, and departments that are fearful to purge anything
The Legacy ECM System – Hey, who manages Documentum/FileNet now?Do I continue to put items in X?How do I change the metadata in Y? As time goes on, legacy Enterprise Content Management becomes a huge burden on IT staff, and impose a massive cost burden for maintenance, support and development. Many of these systems were put in place a decade or so ago, and the file tagging and metadata needs have changes, with users struggling to find what they need through standard search. Some of these systems have just become expensive file shares, due to lack of required functionality or non-supported features.
The Line of Business Repository – just about every system nowadays has a “Document Management” plugin: the Accounting System that stores invoices, the Human Resource Info System that houses employee documents, or the Contracts Management system in legal that maintains contracts. These “niche” systems have created document sprawl within organizations, and a major headache for IT staff.
The SharePoint Library – The SharePoint phenomena hit pretty hard over the last 8 or so years, and most organization jumped on the train. Although most organizations we see in the field did not truly standardize on SharePoint as their sole repository, many started using it for niche solutions, focused on departmental document needs. Now, many years into their usage, they have massive content databases housed on expensive storage.
The New Kids on the Block – now enter the new kids: Alfresco, Dropbox, Box, OneDrive and Google Drive. New is a relative term here, but organizations now have broad and extensive content on cloud-based, file-sync technologies. Spanning personal and business accounts, these technologies have created new silos and management challenges for many organizations.
So, how can we leverage intelligent document capture and analytics to breakdown silos and make life easier? Here are some core “silo breaking” uses:
Data Capture and Extraction – for projects where you want to “peer” into that document repository and extract and/or analyze the content, there are two solutions. Intelligent capture applications, like Ephesoft Transact, can consume repository content, classify document types, and extract pertinent data. Transact has a whole set of extraction technologies that can pull out valuable unstructured data:
Key Value Extraction – this method can parse document contents for information. Take for example a repository of patient records where you want to glean patient name and date of birth. This technology will look for patterns, and pull out required data.
Paragraph Extraction – lets say you want to find a specific paragraph, perhaps in a lease document, and then extract important information. So you can easily identify paragraphs of interest across differing documents, and get what you need.
Cross-section Extraction – say you want to process 10 years of annual reports and pull off a specific bit of data from a table. Say the liabilities number from the financial section. You can specify the row and column header, and pluck just what you need.
Table Extraction – what if you want tables of data within a repository of documents. Take for example lab results from a set of medical records. You can extract the entire table and export it to a DB across thousands of reports.
Managed Migration from X to Y – we are seeing the desire to consolidate repositories and drive “scrubbed” content to a new, central repository. Through advanced document capture, you can consume content from any of the above sources, reclassify, extract new metadata and confirm legacy data as you migrate to a new location.
Single Unified Capture Platform – providing a single, unified platform that can tie into all your existing repositories can save money, and add a layer of automation to older, legacy capture and scanning technology. This repository “spanning” strategy provides a single path for documents which enhances reporting, provides powerful audit capabilities, and minimizes support costs and IT management burden.
Advanced Document Analytics (DA) – with the advent of document analytics platforms, like Ephesoft Insight, you can make that repository useful from a big data perspective through supervised machine learning. These platforms take the integration of capture and analytics to the next level, and provide extensive near real-time processing. DA is focused on processing large volumes of documents and extracting meaning, where there seems to be absolutely no structure. So you can point, consume and analyze any repository for a wide variety of purposes. You can read some great use cases for this technology here: Notes From the Field: What’s Hiding in Your Documents.
Just a quick brain dump on breaking down silos with intelligent document capture and analytics. Thoughts? Did I miss anything?
Email Attachments to a SharePoint Online Document Library
This is a demo request I had from a partner to see the ability of Ephesoft to pickup email attachments, separate and classify, and then extract data. The final resting place is a SharePoint Online document library. This is a seamless Email to SharePoint Online solution.
Our SEs built out some really cool demos for our Innovate Conference in October. One of our core focus areas was around Microsoft SharePoint Scanning and Capture, and enabling workflow processes with web services. This video below shows our SharePoint integration, and integrates with Nintex workflow.
You can read about our Ephesoft Transact 4.1 release here: