Notes From The Field: What’s Hiding in Your Documents?

Document Capture OCR & Scanning

Advanced Capture for Document Mining, Extraction and Analytics

I wanted to write a post about some trends we are seeing within the market, mostly focused on leveraging intelligent document capture (Ephesoft) to mine existing document repositories.   So what constitutes a repository?  Well, it could be 100,000 scanned TIFFs in a network folder.   It could be a legacy document management system like Documentum housing terabytes of documents.    Or like many larger organizations, it could be a massive set of 10 separate repositories that span acquisitions, offices and countries.   With content growing exponentially, organizations are quickly realizing that this information can be a treasure trove, or it can be hiding something sinister that needs to be identified.

So what are the key use cases and industries?  Here are two below:

Financial Services – Anti-money Laundering (AML) – There have been a number of regulations passed that govern how financial institutions detect and report the flow of “dirty money” in and out of their institutions.  The Bank Secrecy Act has been around since the 1970’s, but has been amended with some key requirements through the Patriot Act, with a focus on terrorism and funding.  The onus is on financial firms to quickly identify, track and report suspicious transactions or face massive fines.  Much of this data is based in documents, and finding and extracting this critical information can be impossible without the right technology.  How do you tie new account ID information to another account opened and closed 3 years ago when all you have is a scan of a passport/ID and the original new account form in scanned PDF?  It gets more complex with trade-based money laundering, and there are several red flags that require evaluation of documents, such as:

  • Payments to a vendor by unrelated third parties
  • False reporting, such as commodity mis-classification, commodity over- or under-valuation
  • Repeated importation and exportation of the same high-value commodity, known as carousel transactions
  • Commodities being traded that do not match the business involved
  • Unusual shipping routes or transshipment points
  • Packaging inconsistent with the commodity or shipping method
  • Double-invoicing (list from
Document Analytics AML
An Example of Trade-based Money Laundering (Image from CTTS Office)

As you can imagine, you need all the components of an advanced capture and classification engine to identify key documents, extract core data, and place that information into an analytics engine for processing.

Healthcare – The Quest for a Cure – Imagine the value of being able to go back and consume 30 years of cancer patient lab reports.  Size of tumors, type of treatment, type of cancer, and all the metabolic information.  The challenge lies in the fact that the majority of patient records still exist in paper format, or at least those that were created prior to the rise of the Electronic Medical Record (EMR).   These labs are buried in a deep mess of the typical medical record.   What if you could process it all, automatically identify all the lab reports and pull out everything you need to map trends and results?   What if you could easily identify and extract the typical lab report table?

extracting lab report data capture
What if the paper lab report became data?

We have some customers today processing records for this very purpose.


The list can go on and on:

  • Oil and Gas Land Leases
  • Invoice Analysis for Identifying Trends
  • Claims Analysis for Fraud Identification

Anything I missed?  Thoughts?