Capture, Data Quality and the 1-10-100 Rule

The True Cost of Bad Data

In the world of document capture and analytics, our typical value proposition is around efficiency, reduction in required headcount and the reduction in turnaround time.  Of course, there is true value and cost savings for any organization processing a significant volume of documents if you focus on these value points.  Lately, we have been having some great conversations both internally and externally on the true cost of errors in data entry, and I wanted to dig deep into my past, and present a key topic for discussion.

Back in my Navy days, I found myself in the center of a focus on quality, and we had morphed Deming’s Total Quality Management (TQM) into a flavor that would serve us well.  In a nutshell, it was an effort to increase quality through a systematic analysis of process, and a continuous improvement cycle.  It focused on reducing “Defects” in process, with the ultimate goal of eliminating them all together.  Defects impose a high cost on the organization, and can lead to failures across the board.  Today, all these concepts can be applied to the processing of documents and their associated data.  What is the true value of preventing defects in data?

In my education in this topic, I remember a core concept on quality, and defects: the 1-10-100 rule.

Document Capture and OCR

The rule gives us a graphic representation of the escalating cost of errors (or failures), with prevention costing a $1, correction $10 and failure $100.  So, in terms of data:

  • Prevention Cost – Preventing an error in data at the point of extraction will cost you a $1.
  • Correction Cost – Having someone correct an error post extraction will cost you $10.
  • Failure Cost – Letting bad data run through a process to its end resting place will cost you $100.

So, an ounce of prevention is worth a pound of cure.  In this case, the lack of technology to prevent data errors in the first place will cost the business 100x the cost of acquiring an automated technology that can prevent errors in the first place.

In document capture today, we focus on the top rung of the pyramid, and in prevention.  Below are the core benefits of an intelligent capture platform:

  • Errors in data can be prevented through the automated extraction of data, setting of business rules, and the elimination of hand-keying data.
  • Existing data sources in the organization can be used to enhance prevention, and insure data validation through the use of Fuzzy DB technology.
  • Adding review and validation capabilities prevent bad data from slipping through the process and contaminating your data store.  This is invaluable and prevents the ripple effect bad information can have on a  process and the organization as a whole.
  • With machine learning technology, if correction is required, the system learns, and can prevent future corrections, reducing costs.

See more features for insuring high quality document processing and data extraction here:  Ephesoft Document Capture and Data Extraction.

Just some thoughts…more to come on this topic.


Ephesoft Transact 4.1: New Document Capture Features

OCr, Scanning and Capture Features

Intelligent Document Capture and Scanning

Ephesoft has just released version 4.1 of our advanced capture platform, with a ton of new features.  Below is just a quick list, you can watch the video below for more details:

Accuracy in Capture Enhancements

  • Enhanced Interactive Machine Learning
  • Paragraph Data Extraction
  • Multi-dimensional Classification
  • Enhanced Table Extraction
  • Cross Section Data Extraction
  • Progressive Barcode Reader

Productivity in Capture Enhancements

  • Auto-regex Creation
  • Line Item Matching (ERP integration)
  • Fuzzy Database Redesign
  • Format Extraction

Connectivity in Capture Enhancements

Security in Capture Enhancements

  • HTML 5 Web Scanner Service
  • Cluster Configuration Enhancements
  • Data Encryption in Linux
  • Single Sign On – SAML v2
  • PIV/CAC Authentication

Video overview of features:


Notes From The Field: What’s Hiding in Your Documents?

Document Capture OCR & Scanning

Advanced Capture for Document Mining, Extraction and Analytics

I wanted to write a post about some trends we are seeing within the market, mostly focused on leveraging intelligent document capture (Ephesoft) to mine existing document repositories.   So what constitutes a repository?  Well, it could be 100,000 scanned TIFFs in a network folder.   It could be a legacy document management system like Documentum housing terabytes of documents.    Or like many larger organizations, it could be a massive set of 10 separate repositories that span acquisitions, offices and countries.   With content growing exponentially, organizations are quickly realizing that this information can be a treasure trove, or it can be hiding something sinister that needs to be identified.

So what are the key use cases and industries?  Here are two below:

Financial Services – Anti-money Laundering (AML) – There have been a number of regulations passed that govern how financial institutions detect and report the flow of “dirty money” in and out of their institutions.  The Bank Secrecy Act has been around since the 1970’s, but has been amended with some key requirements through the Patriot Act, with a focus on terrorism and funding.  The onus is on financial firms to quickly identify, track and report suspicious transactions or face massive fines.  Much of this data is based in documents, and finding and extracting this critical information can be impossible without the right technology.  How do you tie new account ID information to another account opened and closed 3 years ago when all you have is a scan of a passport/ID and the original new account form in scanned PDF?  It gets more complex with trade-based money laundering, and there are several red flags that require evaluation of documents, such as:

  • Payments to a vendor by unrelated third parties
  • False reporting, such as commodity mis-classification, commodity over- or under-valuation
  • Repeated importation and exportation of the same high-value commodity, known as carousel transactions
  • Commodities being traded that do not match the business involved
  • Unusual shipping routes or transshipment points
  • Packaging inconsistent with the commodity or shipping method
  • Double-invoicing (list from
Document Analytics AML
An Example of Trade-based Money Laundering (Image from CTTS Office)

As you can imagine, you need all the components of an advanced capture and classification engine to identify key documents, extract core data, and place that information into an analytics engine for processing.

Healthcare – The Quest for a Cure – Imagine the value of being able to go back and consume 30 years of cancer patient lab reports.  Size of tumors, type of treatment, type of cancer, and all the metabolic information.  The challenge lies in the fact that the majority of patient records still exist in paper format, or at least those that were created prior to the rise of the Electronic Medical Record (EMR).   These labs are buried in a deep mess of the typical medical record.   What if you could process it all, automatically identify all the lab reports and pull out everything you need to map trends and results?   What if you could easily identify and extract the typical lab report table?

extracting lab report data capture
What if the paper lab report became data?

We have some customers today processing records for this very purpose.


The list can go on and on:

  • Oil and Gas Land Leases
  • Invoice Analysis for Identifying Trends
  • Claims Analysis for Fraud Identification

Anything I missed?  Thoughts?


Data Extraction and Machine Learning

Document Capture and OCR

Intelligent Document Capture, OCR and Machine Learning

In my previous post, I outlined some of the premises of machine learning in document capture, and how it can drive unseen levels of efficiency and productivity (See here: Rise of the Machines: Machine Learning and Capture).   I always like to follow-up with a video.  This is the first of two videos focusing on Ephehsoft’s Machine Learning.  This one is focused on end-user input driving document understanding for improved data extraction and less setup time.