There are typically 3 types of invoice scanning software solutions that companies implement. If you need a quick primer before you read on, see our terminology page here: Invoice Scanning Software. Below are the 3 types of solutions:
Scan Invoices and Extract the Vendor and Total with OCR – As a baby-step solution, many Accounts Payable departments look to start with a minimal impact solution to add some level of automation and efficiency. They scan their invoices with a copier or scanner, and pass it through an application like Ephesoft Transact for AP Invoices. Users can either point and click in the interface to populate fields, manually type in information, use ERP or Accounting system lookups for their vendors, and output their scanned invoices to structured folders, an accounting Microsoft SharePoint Library, or another content management system like Alfresco.
Capture All Invoice Header Information – the next step is usually to capture (with document analytics + OCR), all the scanned invoice header information. Here is the typical information:
Invoice Vendor Name
This information can be extracted automatically through data extraction, and modern solutions use an analytics algorithm to improve accuracy and reduce end-user interaction with the invoice scanning process. Once again, all accounts payable invoice information is sent to an end resting place, and typically in this level of solution, the information is sent to an accounting system automatically.
3. Scan to Capture Invoice Header and Line Item Data – Top tier invoice scanning applications provide the ability to extract all the information on the invoice automatically. They do all of the above, and also provide line item/table extraction of invoice data. This extracted information can be used to compare the invoice to the purchase order, and provide line item matching capability. In addition, all OCR invoice data can be routed to an ERP or accounting system for update.
There are tools, like Ephesoft Transact, that can provide all 3 invoice scanning solutions without the need to add additional modules or components.
I have been working with several of our MFP/Copier partners, and wanted to put together a video demo on how to use copiers to train Ephesoft when it comes to our machine learning engine. This demo shows how you can use our document analytics engine and train HR documents.
I spent this week at the AppianWorld Conference with one of our great partners, GxP Partners. GxP has done some really cool things with Ephesoft Transact, through leveraging our open capture platform and OCR & Extraction web services to create ApnCapture. The solution is tightly tied to a previous article I wrote Document Capture+BPM, and below
is a quick summary of the value of using intelligent document capture with any workflow tool:
With the rise of Document Capture as a Platform (CaaP), there is an enormous opportunity for organizations to leverage the power of capture as an intelligent document automation component to any business process or workflow solution. Here are the core use areas of document capture and automation with any Business Process Management System (BPMS):
The “Pre” – The logical fit is to use document capture software to “feed the beast”, or in other terms, as a front-end processor for inbound documents destined for workflows. You might ask, “Why? My BPM/Workflow solution has the capability to import documents.” Modern capture platforms add another dimension of automation through the use of several features like separation, document classification and automatic data extraction. Imagine a mortgage banking process where a PDF document is sent inbound that houses 12 different document types in a single PDF file. The power of capture is to auto-split the PDF, classify each document, extract information and then pass all of that in a neatly formatted packages to the workflow engine. Now, the workflow has a second dimension of intelligence, and it can use that to branch, route and execute. Platforms like Ephesoft Enterprise have the ability to ingest documents from email, folders, legacy document management systems, fax and also legacy capture (like Kofax and Captiva).
2. Mid-stream – What about activities during the workflow? Ones that are necessary mid-process? This is where the true power of a “platform” comes into play, and it requires a web services API (See other requirements of a Capture Platform in this article: 6 Key Components of a Document Capture Platform). Some examples of activities that can be accomplished through a capture platform API in workflow:
Value Extraction – pass the engine a document and return extracted information.
Read Barcodes – pass the engine an image, and read and return the value of a barcode.
Classify a Document – pass a document and identify what it is
Create OCR – pass a non-searchable PDF and return a searchable file.
As you can imagine, this can provide extreme customization in any process that requires document automation, and can reduce end-user input, create added efficiency, and once again add that second dimension of intelligence after the workflow has begun. You can see an extensive list of API operations here: Document Capture API Guide
3.The “Post” – Depending on the process and requirements, a “post-process” capture may be in order. Most capture platforms have extensive integrations with 3rd party ECM systems like SharePoint, Filebound and Onbase, and can be leveraged as an integration point to these systems. In addition, there is a new wave in the big data and analytics world, with a focus on data contained within documents. Routing documents and data to analytics repositories can help organizations glean important insight into their operations. If you choose a capture platform with a tied-in document analytics component, this can be accomplished automatically.
ApnCapture: Capture and OCR for Appian
So, how did GxP implement ApnCapture and integrate with Appian? Below is a series of screen shots as an overview from start to finish:
The capture process is initiated from any document source Ephesoft supports:
Those documents are processed, and if there are no confidence issues, they pass right through the process. If there are issues that require end-user correction or validation, users can access document batches through the ApnCapture Batch Report.
Clicking any line in the Appian produced form interacts with Ephesoft Web Services to open a validation and review screen.
Extracted data is then sent into Appian and can be used for all types of purposes: adding intelligence to workflows, enhancing business rules with data, and leveraging documents for approval and review.
Finally, all the extracted data can provide a deeper view of any process that is capture enabled.
To find out more about Intelligent Capture and OCR for Appian, contact GxP Partners.
What is your privacy strategy for documents and content repositories?
The new General Data Protection Regulation (GDPR) is set to replace the older Data Protection Directive in the EU on May 25, 2018. This new roll out of privacy protections for EU nations has broad and expansive implications for any company within the realm of the EU, or those that process EU citizen information and data. Here is a summary of the major changes:
GDPR jurisdiction now applies to all organizations that process EU subject personal data, regardless of the
Breach of GDPR can be fined up to 4% of global turnover or 20M Euros (whichever is larger)
Consent when providing personal information must be clear and easy to understand.
There are a set of core subject rights that apply, and below is a quick summary:
Breach Notification – any data breach requires notification within 72 hours.
Right to Access – subjects can request an electronic copy of all private data at any time.
Right to be Forgotten – aka Data Erasure, a subject at any time can request to have all private data removed from a controlling organizations systems.
Data Portability – subjects can request to have their information transferred to another organization at any time. This will go hand in hand with the “right to be forgotten”.
Privacy by Design – now a legal requirement, organizations must show proof of “…appropriate technical and organizational measures…” within any system or process.
Data Protection Officers (DPOs) – organizations will now require DPOs. This individual will be responsible for interfacing with EU nations and authorities, and will carry the heavy burden of responsibility for all data protection efforts.
So, with that quick outline, imagine the implications of millions of application documents with personal information that are breached. What about the accidental scan of medical records to an insecure document sync folder? Or the directory of millions of scanned documents that have a few documents with private information?
Organizations need a two-pronged approach to prevent the document minefield. So, to get this under control, and mitigate risk, there are really two types of technologies that need to work hand in hand.
First, a document and content capture technology that works as an ingestion point for new content and existing document-centric processes. This form of enterprise input management can be placed as an non-invasive automation layer to flag/identify suspect content and provide reporting capabilities around private information for compliance. Once again, focused on day forward transactions.
Second, is a solution to crawl existing repositories to classify, extract and identify documents that pose a risk. This technology can work hand in hand with the transactional layer to build machine learning profiles, and establish analytical libraries of document and data profiles so the analytical side can become proactive and preemptive. This can be a critical step in identifying possible legacy documents that house private information that could be subject to GDPR fines.
So, where does Ephesoft fit? We have two products that span the transactional and analytical requirements to help organizations capture, classify, identify and visualize their documents in a broad sense, and comply with GDPR privacy rules.
Capture is the New Intelligent Document Transport Layer
As Enterprise infrastructure gets more and more complex, especially with a move to cloud content and line of business systems, organizations struggle with creating what I will call an “Intelligent Document Transport” layer. The ability to move documents from system to system and maintain data integrity and standardization is paramount to driving organizational efficiency. With 72% of larger organizations having 3 or more repositories, and 25% having 5 or more, allowing a seamless interchange of documents and data seems more like a dream than an actual reality. In addition, legacy, in-place capture systems just lack the modern web service oriented architecture to allow the adaptability and flexibility required to work with modern cloud infrastructure. These “Fat” client applications are often laden with complex, host-based SDKs and legacy code, requiring extensive development cycles and specialized skill sets to extend and integrate. Here are some of the core challenges in organizations lacking this transport/integration layer:
Lack of Document “Intelligence” – many organizations move documents throughout their systems as a closed entity. They may know it is a PDF or a Word document, and that it came from accounting, but beyond that, it is a digital mystery. They usually have limited data or information, and this usually requires human intervention or hand keying of info.
Lost in Translation – As documents move from department to department, person to person and system to system, things get lost in translation. Information may be misinterpreted, data may be lost or the interpretation may be different.
Lack of Standardization and Normalization – With out a standardized transport layer, problems begin to arise. Take this simple example: The difference in file naming. Maybe one department calls it W4, another W_4, and yet a third W-4. As documents flow back and forth, between systems, think of the headaches this minor difference can create in reporting, workflow and overall system operation.
Unified Security – the ability for users and integrations to span the on premise world and the cloud ether is reliant on complex authentication and authorization. In this day and age, having centralized reporting and audit capability on document transactions can be critical, and a single sign on capability required.
So, what is required to eliminate these challenges and create an efficient document transport layer to connect people, departments and systems? In my previous post about the new keys to digital transformation, companies are realizing the benefits of new age application architecture: open modular platforms, cloud adaptive technology, scale up and scale down and rapid deployment. New age document capture and analytics platforms, like Ephesoft Transact, encompass these modern traits, and help create a smooth and efficient document transport layer through the following:
Bundling both the document and metadata in an intelligent “suitcase”. When documents enter the capture layer, they are immediately classified into document types, and appropriate data is extracted. All of this information travels with the document until it reaches its destination, and the document and data are translated into the required format.
Breaking down the barriers that exist between on premise cloud systems. With a platform built for cloud adaptation, there are now no barriers between systems in corporate data centers and cloud based services. New age capture platforms can now reside anywhere, and inter-operate with all types of repositories and applications.
Creation of standard processing workflows and business rules. Creating repeatable processes that are standardized regardless of the user, device or system reduce errors and streamline operations. Document processing becomes predictable, more efficient and agile when the need for change arises.
Security is enforced and an audit trail created. With a single system that is the epicenter of document traffic, all transactions can be tracked and logged. With authentication that spans all systems (through single sign on), access can be granted to only documents and systems that are in a users security realm.
New age capture technologies breakdown the barriers that exist, and create a “borderless” Enterprise, and allow the exchange of documents and their associated data to enable improved efficiency and productivity. Thoughts?
Using Adaptive OCR Technology & Analytics to Drive SharePoint Efficiency and Adoption
Optical Character Recognition technology, or OCR, has been around for quite some time. It really became mainstream back in the ’70s when a man named Ray Kurzweil developed a technology to help the visually impaired. He quickly realized the broad commercial implications of his invention, and so did Xerox, who purchased his company. From there, OCR experienced broad adoption across all types of use cases.
At its simplest,OCR is a means to take an image and convert recognized characters to text. In the Enterprise Content Management (ECM) world, it’s this technology that provides a broad range of metadata and content collection methods as documents are scanned and processed. Here are the basic legacy forms of OCR that can be leveraged with SharePoint:
Full Text OCR – converts the entire document image to text, allowing full text search capabilities. Using this OCR with SharePoint, documents are typically converted to an Image+Text PDF, which can be crawled, and the content made fully searchable.
Zone OCR – Zoning provides the ability to extract text from a specific location on the page. In this form of “templated” processing, specific OCR metadata can be extracted and mapped to a SharePoint column. This method is appropriate for structured documents that have the data in the same location.
Pattern Matching OCR – pattern matching is purely a method to filter, or match patterns within OCR text. This technique can provide some capabilities when it comes to extracting data from unstructured, or non-homogeneous documents. For example, you could extract a Social Security Number pattern (XXX-XX-XXXX) from the OCR text and map it to a SharePoint column.
These forms of OCR are deemed as legacy methods of extraction, and although they can provide some value when utilized with any document process that involves SharePoint, they are purely data driven at the text level.
In steps OCR 2.0. Today, innovators like Ephesoft leverage OCR as the very bottom of their document analytics and intelligence stack. The OCR text is now pushed through algorithms that create meaning out of all types of dimensions: location, size, font, patterns, values, zones, numbers, and more (You can read about this patented technology here: Document Analytics and Why It Matters in Capture and OCR ). So rather than just being completely data-centric, or functioning at the text layer, we now create a high-functioning intelligence layer that can be used beyond just text searching and metadata. And the best part? This technology has been extended to non-scanned files like Office documents. Examples? See below:
Multi-dimensional Classification – using that analysis capability (with OCR as algorithm input), and all the collected dimensions of the document, document type or content type can now be accurately identified. As documents are fed into SharePoint, they can be intelligently classified, and that information is now actionable with workflows, retention policies, security restrictions and more. You can see more on this topic in this video on Multi-dimensional Classification Technology: Machine Learning and Classification of Documents
Machine Learning – legacy OCR technology provided no means or method to “get smarter” as documents were processed. Just looking at pure text, it either recognized it, or not. With a machine learning layer, you now have a system that gets more efficient the more you use it. The key here is that learned intelligence must span documents, it cannot be tied to any one item. It’s this added efficiency that can drive SharePoint usage and adoption through ease of use. You can see more on machine learning in the videos below:
Document Analytics, Accuracy and Extraction – with legacy OCR, extracting the information you need can be problematic at best. How do you raise confidence that the information you have is accurate? With an analysis engine, we look not just at the text, but where it sits, what surrounds it, and know patterns or libraries. This added layer provides the ability to express higher confidence in data extraction, and makes sure you are putting the right data into SharePoint.
This was just a quick overview of the benefits from moving away from legacy OCR, and embracing OCR 2.0 for SharePoint. Thoughts?
Web Services for OCR, Data Extraction and Document Classification
Continuing on my themes of open, web service enabled document capture and analytics, as well as this notion of “In App” Document Capture through APIs, I thought I would share out a demo one of our fantastic SEs built to show the automation capabilities within Salesforce. This shows background document classification and extraction, all initiated through a file upload in Salesforce. This leverage OCR technology and our machine learning algorithms to auto-populate data in Salesforce.
Document Capture Through Intelligent Learning and Analytics
One of our regional reps produced this video to help show how we differ from other document capture and analytics platforms on the market. This is a great expansion to one of my earlier posts – Analytics and Document Capture – Why it Matters The video gives a great overview on the many dimensions of a document, and how Ephesoft leverages its patented technology to enhance accuracy, analyze large volumes of documentation, and process unstructured information.
In the world of document capture and analytics, our typical value proposition is around efficiency, reduction in required headcount and the reduction in turnaround time. Of course, there is true value and cost savings for any organization processing a significant volume of documents if you focus on these value points. Lately, we have been having some great conversations both internally and externally on the true cost of errors in data entry, and I wanted to dig deep into my past, and present a key topic for discussion.
Back in my Navy days, I found myself in the center of a focus on quality, and we had morphed Deming’s Total Quality Management (TQM) into a flavor that would serve us well. In a nutshell, it was an effort to increase quality through a systematic analysis of process, and a continuous improvement cycle. It focused on reducing “Defects” in process, with the ultimate goal of eliminating them all together. Defects impose a high cost on the organization, and can lead to failures across the board. Today, all these concepts can be applied to the processing of documents and their associated data. What is the true value of preventing defects in data?
In my education in this topic, I remember a core concept on quality, and defects: the 1-10-100 rule.
The rule gives us a graphic representation of the escalating cost of errors (or failures), with prevention costing a $1, correction $10 and failure $100. So, in terms of data:
Prevention Cost – Preventing an error in data at the point of extraction will cost you a $1.
Correction Cost – Having someone correct an error post extraction will cost you $10.
Failure Cost – Letting bad data run through a process to its end resting place will cost you $100.
So, an ounce of prevention is worth a pound of cure. In this case, the lack of technology to prevent data errors in the first place will cost the business 100x the cost of acquiring an automated technology that can prevent errors in the first place.
In document capture today, we focus on the top rung of the pyramid, and in prevention. Below are the core benefits of an intelligent capture platform:
Errors in data can be prevented through the automated extraction of data, setting of business rules, and the elimination of hand-keying data.
Existing data sources in the organization can be used to enhance prevention, and insure data validation through the use of Fuzzy DB technology.
Adding review and validation capabilities prevent bad data from slipping through the process and contaminating your data store. This is invaluable and prevents the ripple effect bad information can have on a process and the organization as a whole.
With machine learning technology, if correction is required, the system learns, and can prevent future corrections, reducing costs.
Email Attachments to a SharePoint Online Document Library
This is a demo request I had from a partner to see the ability of Ephesoft to pickup email attachments, separate and classify, and then extract data. The final resting place is a SharePoint Online document library. This is a seamless Email to SharePoint Online solution.