Front-end Document Analytics for Capture, Classification and Extraction
One of our folks over in Europe built out a new Box export integration (Thanks OBH!). It provides some great features through the Box API, and uses our machine learning engine to automate document additions and allows the following:
Addition of any document through Ephesoft to Box
Machine learning for Box of documents and data
Document Analytics, Separation, Classification and Data Extraction
The benefits of intelligent document capture are well documented, and its impact on efficiency can be a quick win for any business processing a decent volume of documents. Capture 2.0 adds a new dimension of automation, and provides capture functionality to any application or system. So now, the ERP, CRM or Document Management System can be capture enabled through its own interface, without the need to switch windows, open a new application or “send files” to a processing location. This “hidden” automation layer requires no end-user expertise, and as far as they know, they are doing business as usual. Below are some core benefits to leveraging Capture 2.0 Web Services:
Minimal Impact to Operations – Current process: the end-user uploads a received file through the CRM interface, and then enters some notes and metadata about the file: customer name, date of contract, salesperson, and region. Capture 2.0 enabled process: the end-user uses the same upload process, but in the background, the hidden capture automation layer classifies the document, extracts all the data and enters it into the CRM automatically. This is the power of document capture web services. With no impact to current process or operations, you gain efficiency and reduce errors, driving speed of transactions, reduced response times and the costly fixing of erroneous data.
Low to No Training – because end users are using the applications they use every day, there is almost zero training required. That’s the beauty of the hidden capture layer. If deployed correctly, users just perform their process the same way, and manual data entry and steps are eliminated. The lack of a training requirement minimizes any lost work days spent on costly training, and provides value starting day one.
Solve the Plague of Windows – more and more, IT and Business staff alike are looking to streamline and consolidate, and reduce the number of applications required to do business. Creating that single interactive interface, or that single source of truth, is the end goal. And a web services automation layer can provide functionality that would normally require the addition of an interactive app.
Maximum Efficiency – it is said that 50% of document intensive process labor is spent on 5% of your documents. Why? The cost of problems and fixing errors. Automating any data extraction and document classification process, coupled with data validation techniques, and reduce errors to almost 0, and drive the maximum efficiency possible in your organization.
Just a brief note on some thoughts and trends I am seeing in the marketplace. Thoughts? If you want to see Capture 2.0 Web Services in action, take a peek at Ephesoft Transact Web Services API.
Leveraging Intelligent Capture to Break Down Repository Silos
Every organization has them in both their technical realm and organizational/departmental structure: the Old Silo. But the elephant in the room is usually the document repository. That terabyte nightmare no one wants to address for fear of what lies within. Compounding the issue is the fact that most organizations have numerous document silos, usually the result of years of acquisitions, changing technical staffs with new ideas, or new line of business systems that house their own documents. Repository silos usually take the form of one of the below:
The File Share – when was the last time someone looked at that behemoth? Usually laden with layer upon layer of departmental and personal folder structures, a complete lack of file naming standards, and a plethora (my $2 word 😉 ) of file types. They continue to be backed up, and most IT departments that initiate projects for cleaning these up find a minefield, and departments that are fearful to purge anything
The Legacy ECM System – Hey, who manages Documentum/FileNet now?Do I continue to put items in X?How do I change the metadata in Y? As time goes on, legacy Enterprise Content Management becomes a huge burden on IT staff, and impose a massive cost burden for maintenance, support and development. Many of these systems were put in place a decade or so ago, and the file tagging and metadata needs have changes, with users struggling to find what they need through standard search. Some of these systems have just become expensive file shares, due to lack of required functionality or non-supported features.
The Line of Business Repository – just about every system nowadays has a “Document Management” plugin: the Accounting System that stores invoices, the Human Resource Info System that houses employee documents, or the Contracts Management system in legal that maintains contracts. These “niche” systems have created document sprawl within organizations, and a major headache for IT staff.
The SharePoint Library – The SharePoint phenomena hit pretty hard over the last 8 or so years, and most organization jumped on the train. Although most organizations we see in the field did not truly standardize on SharePoint as their sole repository, many started using it for niche solutions, focused on departmental document needs. Now, many years into their usage, they have massive content databases housed on expensive storage.
The New Kids on the Block – now enter the new kids: Alfresco, Dropbox, Box, OneDrive and Google Drive. New is a relative term here, but organizations now have broad and extensive content on cloud-based, file-sync technologies. Spanning personal and business accounts, these technologies have created new silos and management challenges for many organizations.
So, how can we leverage intelligent document capture and analytics to breakdown silos and make life easier? Here are some core “silo breaking” uses:
Data Capture and Extraction – for projects where you want to “peer” into that document repository and extract and/or analyze the content, there are two solutions. Intelligent capture applications, like Ephesoft Transact, can consume repository content, classify document types, and extract pertinent data. Transact has a whole set of extraction technologies that can pull out valuable unstructured data:
Key Value Extraction – this method can parse document contents for information. Take for example a repository of patient records where you want to glean patient name and date of birth. This technology will look for patterns, and pull out required data.
Paragraph Extraction – lets say you want to find a specific paragraph, perhaps in a lease document, and then extract important information. So you can easily identify paragraphs of interest across differing documents, and get what you need.
Cross-section Extraction – say you want to process 10 years of annual reports and pull off a specific bit of data from a table. Say the liabilities number from the financial section. You can specify the row and column header, and pluck just what you need.
Table Extraction – what if you want tables of data within a repository of documents. Take for example lab results from a set of medical records. You can extract the entire table and export it to a DB across thousands of reports.
Managed Migration from X to Y – we are seeing the desire to consolidate repositories and drive “scrubbed” content to a new, central repository. Through advanced document capture, you can consume content from any of the above sources, reclassify, extract new metadata and confirm legacy data as you migrate to a new location.
Single Unified Capture Platform – providing a single, unified platform that can tie into all your existing repositories can save money, and add a layer of automation to older, legacy capture and scanning technology. This repository “spanning” strategy provides a single path for documents which enhances reporting, provides powerful audit capabilities, and minimizes support costs and IT management burden.
Advanced Document Analytics (DA) – with the advent of document analytics platforms, like Ephesoft Insight, you can make that repository useful from a big data perspective through supervised machine learning. These platforms take the integration of capture and analytics to the next level, and provide extensive near real-time processing. DA is focused on processing large volumes of documents and extracting meaning, where there seems to be absolutely no structure. So you can point, consume and analyze any repository for a wide variety of purposes. You can read some great use cases for this technology here: Notes From the Field: What’s Hiding in Your Documents.
Just a quick brain dump on breaking down silos with intelligent document capture and analytics. Thoughts? Did I miss anything?
Adding The Next Generation of Document Capture Automation
Over the past decade, the document capture industry has become quite stagnant and ripe for disruption. The acquisition of just about every capture company by larger, behemoth organizations has created a stagnation in innovation and a lack of modernization. IT executives are yearning for a refresh to their legacy capture solutions, and they expect standards of the modern tech world:
Service/Platform based architecture
Web/browser-based user interfaces
Web services APIs
With that said, many organizations have made massive investments in document capture technology, and a “rip and replace” strategy comes with a serious impact to business operations. But there can be exponential benefits to a modernization of document automation and capture technologies. This comes from key new developments from innovative capture startups:
Machine Learning – in the legacy capture world, long expensive services engagements are the norm, with deep custom development and configuration. Isn’t it 2016? Aren’t computers supposed to take that pain away with intelligence? In steps machine learning. The modern capture platform provides a core learning engine that understands your documents, their layouts and data. As you use the system, it gets smarter, improving accuracy and reducing user intervention, with a true end goal of autonomous operations.
Capture Web Services – providing capture functionality to any application in the organization can be a huge boost to efficiency and productivity. Want a customer document upload page to validate the uploaded documents are of the correct type? Need check the date of a document, or that it has been signed? Document capture services can give your development teams a tool set they have never had in the past.
Document Analytics and Analysis – taking a holistic view of the whole document capture process is essential to the modern capture platform. Seeing the document as pure words will not further understanding, nor provide additional benefit. With a true Analytics/Analysis frame of mind, every single characteristic of the document becomes important: font, font size, location, surrounding words and overall layout (for a deep look at the facets of document analytics, see my previous post: Document Analytics and Capture ).
Open Architecture – Having a capture platform that has been built from the ground up with openness and extensibility in my mind is absolutely critical. Adding this as an afterthought creates a clunky difficult environment for developers, and leads to workarounds and lack of desired functionality.
The great benefit here, is that without a “rip and replace” event, modern capture platforms can be added as a non-disruptive, transparent automation and efficiency layer.
By adding a centralized capture engine, you can glean the following benefits:
Any scanning device becomes an input device
BPM and Workflow systems can take advantage of capture with minimal dev (See an example here: Notes From the Field)
Services like fax and email can easily be designated as a source for capture
Legacy capture processes with bar code sheets and manual data entry can be automated
Mobile devices can now leverage mobile capture SDKs and the centralized automation engine
Legacy ECM systems now have a new automation dimension
Cloud-based enterprise services can be capture-enabled
So a tech refresh on the capture front becomes a viable initial project, and current capture components can be left in place. In this case, Ephesoft becomes a new layer of automation and a catalyst for process improvement and efficiency.
This has been a consistent theme in our experience out in the market, with existing legacy capture customers and new prospects looking for a minimal impact refresh for their ailing and aged capture infrastructures. Thoughts? Comments?
The document capture industry has seen a transformation over the last 3 or so years, and a migration to providing Capture as a Platform (CaaP) or Capture as a Service (CaaS). If you look at Enterprise Capture Platforms, they typically have a core set of features that provide not only product-based functionality, but also platform APIs to integrate, extend and allow usage at the application level. Here are critical features every platform must provide:
A Web User Interface – Let’s face it, for any application today, a fully functional web interface is an absolute requirement, and document capture is no exception. The web provides simplicity for IT, and removes installation headaches and support pain. It also gives end users an easy way to launch the application from anywhere, on any device. The UI should provide not only end-user functionality, but also administrative capabilities.
Cloud and On-premise – With many organizations looking to streamline IT and move core services to the cloud, all Capture 2.0 platforms must have a true, dedicated cloud offering. Cloud enabled platforms can provide services to other cloud-based apps (like Salesforce and O365) without alteration of on-premise security or infrastructure.
A Learning Engine – the days of extensive manual configuration are long gone, and a core learning engine within the capture platform drives ease of setup, and agility when changes need to be made. Classification of documents should be as easy as a quick drag and drop into the learning engine for auto-configuration.
Extensive Web Services API – The power of any platform is to provide a standardized processing engine to perform specific related tasks. With a capture platform, the ability to perform just about any document processing task through the API is a must. Some examples: passing a document for classification, creating OCR text for a passed image and extracting key metadata from a document. For an example of an extensive Capture Platform Services API, see Epehsoft’s Capture Web Services.
Mobile Client and SDK – with the rise of mobile, there is demand within organizations to enable mobile solutions. Any capture platform should have a mobile client, as well as an extensive SDK, including on-board OCR capabilities.
Analytics Engine and BI – Going beyond the basics of reporting, document analytics a new hot topic within the capture industry. How can you parse your unstructured document repositories, and extract meaning from all types of files? The answer is document analytics. Your capture platform should have all the plumbing for analytics, or have an add-on engine to enable this area of functionality.
Obviously, there are many more areas we could cover: Linux and Windows support, clustering capabilities, and on and on. But in my opinion, these core areas are a must. Thoughts? Did I miss any?