Machine Learning and Distributed Document Capture and Scanning

Machine Learning for Copier Scanning

Using Copiers to Machine Learn Documents

I have been working with several of our MFP/Copier partners, and wanted to put together a video demo on how to use copiers to train Ephesoft when it comes to our machine learning engine.  This demo shows how you can use our document analytics engine and train HR documents.



Document Analytics: Machine Learning and Document Dimensions Video

Document Analytics and Machine Learning

Document Capture Through Intelligent Learning and Analytics

One of our regional reps produced this video to help show how we differ from other document capture and analytics platforms on the market.  This is a great expansion to one of my earlier posts – Analytics and Document Capture – Why it Matters  The video gives a great overview on the many dimensions of a document, and how Ephesoft leverages its patented technology to enhance accuracy, analyze large volumes of documentation, and process unstructured information.

Capture, Data Quality and the 1-10-100 Rule

The True Cost of Bad Data

In the world of document capture and analytics, our typical value proposition is around efficiency, reduction in required headcount and the reduction in turnaround time.  Of course, there is true value and cost savings for any organization processing a significant volume of documents if you focus on these value points.  Lately, we have been having some great conversations both internally and externally on the true cost of errors in data entry, and I wanted to dig deep into my past, and present a key topic for discussion.

Back in my Navy days, I found myself in the center of a focus on quality, and we had morphed Deming’s Total Quality Management (TQM) into a flavor that would serve us well.  In a nutshell, it was an effort to increase quality through a systematic analysis of process, and a continuous improvement cycle.  It focused on reducing “Defects” in process, with the ultimate goal of eliminating them all together.  Defects impose a high cost on the organization, and can lead to failures across the board.  Today, all these concepts can be applied to the processing of documents and their associated data.  What is the true value of preventing defects in data?

In my education in this topic, I remember a core concept on quality, and defects: the 1-10-100 rule.

Document Capture and OCR

The rule gives us a graphic representation of the escalating cost of errors (or failures), with prevention costing a $1, correction $10 and failure $100.  So, in terms of data:

  • Prevention Cost – Preventing an error in data at the point of extraction will cost you a $1.
  • Correction Cost – Having someone correct an error post extraction will cost you $10.
  • Failure Cost – Letting bad data run through a process to its end resting place will cost you $100.

So, an ounce of prevention is worth a pound of cure.  In this case, the lack of technology to prevent data errors in the first place will cost the business 100x the cost of acquiring an automated technology that can prevent errors in the first place.

In document capture today, we focus on the top rung of the pyramid, and in prevention.  Below are the core benefits of an intelligent capture platform:

  • Errors in data can be prevented through the automated extraction of data, setting of business rules, and the elimination of hand-keying data.
  • Existing data sources in the organization can be used to enhance prevention, and insure data validation through the use of Fuzzy DB technology.
  • Adding review and validation capabilities prevent bad data from slipping through the process and contaminating your data store.  This is invaluable and prevents the ripple effect bad information can have on a  process and the organization as a whole.
  • With machine learning technology, if correction is required, the system learns, and can prevent future corrections, reducing costs.

See more features for insuring high quality document processing and data extraction here:  Ephesoft Document Capture and Data Extraction.

Just some thoughts…more to come on this topic.


Ephesoft Transact 4.1: New Document Capture Features

OCr, Scanning and Capture Features

Intelligent Document Capture and Scanning

Ephesoft has just released version 4.1 of our advanced capture platform, with a ton of new features.  Below is just a quick list, you can watch the video below for more details:

Accuracy in Capture Enhancements

  • Enhanced Interactive Machine Learning
  • Paragraph Data Extraction
  • Multi-dimensional Classification
  • Enhanced Table Extraction
  • Cross Section Data Extraction
  • Progressive Barcode Reader

Productivity in Capture Enhancements

  • Auto-regex Creation
  • Line Item Matching (ERP integration)
  • Fuzzy Database Redesign
  • Format Extraction

Connectivity in Capture Enhancements

Security in Capture Enhancements

  • HTML 5 Web Scanner Service
  • Cluster Configuration Enhancements
  • Data Encryption in Linux
  • Single Sign On – SAML v2
  • PIV/CAC Authentication

Video overview of features:


Machine Learning Part Deux: ERP Data Intelligence

Document processing machine learning

Using Existing Data to Drive System Intelligence and Automation

This is part II in a series of videos showing Ephesoft’s Document & Process Machine Learning capabilities (Part I here: Machine Learning and Data Extraction) .  In this video, I will show how you can add intelligence to any document capture process through learning external data tables.   This allows for leveraging pre-existing ERP and financial system information to make the Ephesoft System smarter.



Data Extraction and Machine Learning

Document Capture and OCR

Intelligent Document Capture, OCR and Machine Learning

In my previous post, I outlined some of the premises of machine learning in document capture, and how it can drive unseen levels of efficiency and productivity (See here: Rise of the Machines: Machine Learning and Capture).   I always like to follow-up with a video.  This is the first of two videos focusing on Ephehsoft’s Machine Learning.  This one is focused on end-user input driving document understanding for improved data extraction and less setup time.

Rise of the Machines: Machine Learning, Capture and Analytics

Document Automation and Machine Learning

Teaching Machines to Understand Documents

I remember when I first started out in the document capture and ECM world, I was sitting across from a CIO, presenting our technology, and he started asking pointed questions about configuration and services.  We talked for about 15 minutes, and he stopped, and I could see the gears were turning.  He looked at me and said: “Why do you guys make it so damn hard?”  I looked at him and said, “What do you mean?”  He responded with: “Why all the configuration and setup time?  Why cant it just understand my documents, and what I am trying to accomplish?  I know that current technology is capable.”  At that time, the trend in the industry was a heavy reliance on regular expressions, basically a pattern matching language that originated in 1956, born through mathematical theory.  So essentially, the CIO hit the nail on the head: We were using 1950s math theory to provide automation and value, but it came with a deep cost in the form of expertise and services.  So here we are 10 years later, and the majority of the industry still uses that same method to analyze, classify and extract data.

Rise of the Machines

In the document automation space, we typically present a magic world to the end users, one where they just hit the button, or upload their document, and stuff just happens “automagically”.  But in reality, behind the scenes, there was a lot of work to get to that point.  With the burden placed on IT in the form of education, configuration, service costs and testing.  Machine learning strives to eliminate that burden through simple efforts to train the system, and I think the goal, although lofty, is to reduce or eliminate configuration to a point that any user can create a workable system.

So, in document capture, what can machine learning provide?  In modern document automation technologies, like Ephesoft’s Capture Platform, machine learning can be leveraged in several ways:

  • Classifying Documents – If I had Ephesoft back in the day, I could have really made an impression on the CIO.  With Ephesoft’s training interface, I can take my different types of documents and train the system.  As I drag and drop new types of documents into the system, it “learns” all the nuances of the document.  It understands the structure, the words, their proximity, typeface and other information and uses that as key identifiers in the process.  For more on the extent of our Document Analytics/Analysis engine, see this post: Document Analysis, Analytics and Capture.
  • Intelligence & Confidence – Just like people, good machines know when to ask questions or admit when they are wrong.  In a machine learning environment, having a mechanism to ask questions is key.  In document capture, this comes in the form of an established confidence level, and voting algorithm that can call attention to documents or data in question.  When these questions are answered, the machine gets smarter, and learns.
OCR confidence levels
Confidence Flagging Aids in Understanding
  • Gathering Information – just as we learned through experience growing up, the machine needs to learn from every interaction.  Any form of human input needs to add to understanding, and overall document intelligence.  Click on a missed piece of data, and now the system knows its location, and its format.  It also knows the proximity of other words, and has an enhanced understanding of new dimensions of that document.

These are just a few examples of machine learning, and what it brings to the document capture industry.  More to come when we release our new version, 4.1 at our conference next week.