Efficient document management is essential in any industry.

Client

One of the biggest problems in document management is the diversity of paperwork formats, structures and origins. As a rule, dealing with the data stored in non-standardized PDFs and on paper requires time and effort.

There are tools that help optimize document workflow. For example, the systems based on optical character recognition (OCR) technology are able to extract data from different digitized documents. However, there’s no such one-size-fits-all tool that would be able to process any document in any format. Often, it requires costly manual verification to guarantee the accuracy of data extraction.

Agriculture analytics and research provider in the UK was facing the exact problem. The company asked Digiteum to build a custom web application to automate the processing of PDF invoices of different structures and extract meaningful information from these documents such as invoice number, date, company’s name, the total amount, etc.

Automated recognition and processing of the data stored in multiformat PDF invoices.

Tesseract OCR for smart PDF text recognition

Digiteum team analyzed the company’s business process and suggested building an automated PDF processing platform based on two major modules: a data extractor and an OCR service.

Initially, the data extractor performs the preliminary analysis of a PDF invoice to figure out the basic characteristics of the document – if it’s a scanned image, original PDF, or plain text. If the system recognizes images, it engages the OCR service for PDF text recognition and extraction.

In order to choose cost-efficient and reliable OCR service for this purpose, Digiteum team has tested major cloud-based OCR services – ABBYY, Google, Azure, OCR Space and open-source offline service Tesseract. After the analysis, the team has selected Version 4 of Tesseract as the most advanced OCR which showed the highest precision in PDF text recognition.

Apart from its originally strong computer vision algorithms, library and configuration capabilities, the latest version of Tesseract offers Deep Learning methods for image understanding. The advanced methods allow to experiment and train neural networks, improve symbol recognition, enhance accuracy and teach the system to understand handwriting, for example. These benefits allowed Tesseract OCR service to meet the objectives of the project and perform PDF text recognition of multiformat invoices with a high level of accuracy.

Custom algorithms to perform the detailed analysis and extract information from PDFs.

Data extraction algorithms provide up to 80% accuracy

The other part of the system – the data extractor engine – uses custom algorithms to perform the detailed analysis and extract information from the readable PDFs and the documents prepared by the OCR service. Digiteum team has tested 15 algorithms. They found the algorithms that reach up to 80% of PDF text recognition accuracy along with the ones that provide 60% quality of recognition and can be improved in the future. These algorithms allow the system to identify font style, and geometry, recognize tables and their structures, etc. and parse the given data against a number of validation rules to find certain classes of data such as account number, product name, contact information, etc.

Finally, the information extracted by the system is sent to AWS cloud service which enables the security and reliability of data storage.

In the future, this project will grow into a full-on document management system. By training the OCR service using Deep Learning technology, the team can teach the system to better recognize the text in scanned or other image PDF invoices, improve data validation rules, introduce new pattern-based algorithms, and, as a result, reduce the number of errors.

Broad space for the development and scalability of the system.

Highlights

Custom web system for automated text recognition in multiformat PDF invoices of various quality.
Deployment on cloud-based AWS for scalability and strong data storage security.
Advanced OCR service for text recognition based on proven computer vision technology and smart Deep Learning methods.
Custom data extraction algorithms that enable up to 80% accuracy.
Data validation rules that allow to classify data against certain requirements.
Broad space for system development and scalability – improving algorithms, enhancing text recognition precision, introducing new data validation rules, training OCR service.

Custom Document Management System

Client

Tesseract OCR for smart PDF text recognition

Data extraction algorithms provide up to 80% accuracy

Highlights