Building an accurate OCR Receipt Engine is an interesting engineering challenge because the problem cannot be solved by a deterministic solution. In reality, there are too many uncertainties (i.e., receipt format, language, origin country, picture quality, receipt angle, etc…) in what a receipt scanner API receives.
However, like most engineering problems, the path you take ultimately affects the outcome of the solution in the end.
This article outlines what I have found to be one of the most effective ways to build a receipt scanner API with optimized features such as being accurate, automatic, real-time, multi-lingual, and adaptive.
I will explain this solution based on our experience in building TAGGUN, a Receipt and Invoice Scanning API powered by machine learning.
There are many ways that businesses capitalize off digitizing receipt processing. To name a few:
Tech Giants like AWS, Microsoft, Google and IBM are actively competing with each other to offer the best machine learning and computer vision on the market. So, instead of reinventing the wheel and training a tesseract OCR model for ourselves, we take advantage of this healthy competition in the capitalist market and select the best computer vision OCR solutions to convert the image of the receipt to raw texts.
So, the true crux of a modern OCR receipt engine is its ability to convert syntactical data into semantic information. This function is more affiliated with NLP (Natural Language Processing).
NLP is the field of Machine Learning that allows computers to digest and understand written and spoken texts (ref. 1).
Both NLP and OCR are therefore the rudiments fuelling TAGGUN’s engine. I will now paint the picture of what we’ve used these to build the scanner.
Based on our testing, Microsoft Cognitive and Google Vision are 2 of the best OCR providers on the market. And the latest version from Microsoft Cognitive actually outperforms Google Vision. So, we recently switched to Microsoft as the main computer vision provider, after 3 years with Google. They each have their own benefits and trade-offs, and we set up our engine to optimize the result from both providers.
After the file is processed by OCR provider, the output is naturally the classic computer vision OCR result: raw text with coordinates and bounding box.
To improve the data extraction, a contextual awareness should be built around the file and request to predict the meta of the file.
E.g., Predicting the:
This stage detects and extracts the most basic information from the text.
For example:
This phase of the Scan Receipt API is where the more complex information is identified and extracted.
If there are five distinct amounts, how do you know which is the total amount? Or the tax amount?
As you can imagine, it becomes increasingly tricky as the content, format, and language of invoices and receipts become more variable.
Several different algorithms are run to determine the best result for each of the entities.
The official sum method is followed to validate each number and improve accuracy.
The recognition of patterns in the text is required. This is so grouped information can be accurately extracted (such as tax rate, gross tax amount, net tax amount).
This can be trained (or fed back) for each account. So, accuracy can be improved, especially for each individual account over time.
Total Amount, Tax Amount, ABN, Multi Tax Line Items Merchant Verification, Merchant Name Receipt Number, Invoice Number IBAN Payment Type (i.e., credit card, cash, visa, MC, etc.) Fapiao Invoice Number and Code.
Public and helpful APIs are called as needed, to acquire supplementary information.
Examples of these are:
The result outputted is a JSON format.
Because the JSON format is a common data format, developers can simply integrate the receipt OCR library into any software, with any programming language.
What’s also recommended, is building the engine to instantly return the result following the API request. TAGGUN has this feature to avoid developers needing to make additional requests or building additional webhook endpoints.