Hello world. This tutorial is a gentle introduction to building modern text recognition system using deep learning in 15 minutes. It will teach you the main ideas of how to use Keras and Supervisely for this problem. This guide is for anyone who is interested in using Deep Learning for text recognition in images but has no idea where to start.
We are going to consider simple real-world example: number plate recognition. This is a good start point and you can easily customize it for your task. Simple tutorial on how to detect number plates you can find here.
When we dove into this field we faced a lack of materials in the internet. Through long research and reading many papers we have developed an understanding of main principles behind creating effective recognition systems. And we have shared our understanding with community in two small video lectures (part1 and part2) and explain how it works in plain language. We feel that this content is extremely valuable, because it is impossible to find nice and simple explanation of how to build modern recognition systems. We highly recommended to watch them before you start, because they will give you a lot of intuition behind this topic.
To pass this tutorial without problems, you will need Ubuntu, GPU and Docker.
All sources are available at github. Source code is located at a single jupyther notebook with comments and useful visualizations.
For this tutorial we have generated artificial dataset of more than 10K images that are very similar to real number plates. The images look like this.
You can easily get this dataset from Supervisely. Let us say a few words about it. We at DeepSystems do a lot of computer vision developments like self-driving car, receipt recognition system, road defect detection and so on. We as data scientists spend a lot of time to working with training data: creating custom image annotations, merging our data with public datasets, making data augmentations and so on. Supervisely simplifies the way you work with training data and automate many routine tasks. We believe youâll find it useful in your everyday work.
The first step is to register in Supervisely. The next step is to go to âImportâ -> âDatasets libraryâ tab and click to âanpr_ocrâ project.
After that type name âanpr_ocrâ and click âNextâ button.
Then click âUploadâ button. Thatâs all. Now the project âanpr_ocrâ is added to your account.
It consists of two datasets: âtrainâ and âtestâ.
If you want to preview images, just click to dataset and you will instantly get into annotation tool. For each image we have a text description that will be used as ground truth to train our system. To view it just click to small icon opposite the selected image (market in red).
Now we have to download it in a specific format. To do it just click to âDTLâ page and insert this config to text area. It will look like this.
In the screenshot above you can see the scheme illustrating the export steps. We will not dig into technical details (you can read the documentation if needed) but try to explain this process below. In our âanpr_ocrâ project we have two datasets. âTestâ dataset is exported as is(all images will be tagged as âtestâ). âTrainâ dataset is splitted to two sets: âtrainâ and âvalâ. Random 95 percent of images will be tagged as âtrainâ, and the rest 5 percent as âvalâ.
Now you can click âStartâ button and wait two minutes while the system prepare archive to download. Click âDTLâ -> âTask statusâ -> âThree vertical dotsâ -> âDownloadâ button to get training data (marked in red).
We prepared all you need in our git repository. Clone it with the following commands
git clone https://github.com/DeepSystems/supervisely-tutorials.gitcd supervisely-tutorials/anpr_ocr
Directory structure will be the following
.âââ dataâââ dockerâ âââ build.shâ âââ Dockerfileâ âââ run.shâââ srcâââ architecture.pngâââ export_config.jsonâââ image_ocr.ipynb
Put downloaded zip archive into âdataâ directory and run the command below
unzip <archive name>.zip -d .
In my case the command was
unzip anpr_ocr.zip -d .
Now lets build and run docker container with prepared working environment (tensorflow and keras). Just go to âdockerâ directory and run the following commands
./build.sh./run.sh
After that you will be inside the container. Run next command to start Jupyther notebook
jupyter notebook
In terminal you will see something like this
You have to copy selected link and paste it into web browser. Notice, that your link will be slightly different from mine.
The last step is to run whole âimage_ocr.ipynbâ notebook. Click âCellâ -> âRun allâ.
Notebook consists of few main parts: data loading and visualisation, model training, model evaluation on test set. On average for this dataset training process takes around 30 minutes.
If everything will be ok, youâll see the following output
As you can see, the predicted string will be the same as ground truth. Thus we have constructed the modern OCR system in one pretty clear jupyther notebook. In the next chapter of this tutorial we will cover and explain all main principles of how it works.
As for us, the understanding of neural network architecture is the key. Please, donât be lazy and take 15 minutes to watch our small and simple about high level overview of NN architecture, that was mentioned at the beginning. It will give you general understanding. If you have already doneâââbravo! :-)
Here i will try to give you short explanation. High level overview is the following
Firstly, image is feeded to CNN to extract image features. The next step is to apply Recurrent Neural Network to these features followed by the special decoding algorithm. This decoding algorithm takes lstm outputs from each time step and produces the final labeling.
Detailed architecture will be the following. FCâââfully connected layer, SMâââsoftmax layer.
Image has the following shape: height equals to 64, width equals to 128 and num channels equal to three.
As you have seen before we feed this image tensor to CNN feature extractor and it produces tensor with shape 4*8*4. We put image âappleâ to the feature tensor so you can understand how to interpret it. Height equals to 4, width equals to 8 (These are spatial dimentions) and num channels equals to 4. Thus we transform input image with 3 channels to 4 channel tensor. In practice number of channels should be much larger, but we constructed small demo network only because everything fit on the slide.
Next we do reshape operation. After that we obtain the sequence of 8 vectors of 16 elements. After that we feed these 8 vectors to the LSTM network and get its outputâââalso the vectors of 16 elements. Then we apply fully connected layer followed by softmax layer and get the vector of 6 elements. This vector contains probability distribution of observing alphabet symbols at each LSTM step.
In practice, the number of CNN output vectors can reach 32, 64 or more. The choice will depend on the specific task. Also in production it is better to use multilayered bidirectional LSTM. But this simple example explains only most important concepts.
But How does decoding algorithm work? On the above diagram we have eight vectors of probabilities at each LSTM time step. Letâs take most probable symbol at each time step. As a result we obtain the string of eight charactersâââone most probable letter at each time step. Then we have to glue all consecutive repeating characters into one. In our example two âeâ letters are glued to single one. Special blank character allows us to split symbols that are repeated in the original labeling. We added blank symbol to the alphabet to teach our neural network to predict blank between such case symbols. Then we remove all blank symbols. Look at the illustration below
When we train our network we replace decoding algorithm with CTC Loss layer. It is explained in our second video lecture. Now it is available only in russian, sorry about it. But the good news are: we have english slides and we will publish english version soon.
A bit complex NN architecture is used in our implementation. The architecture is the following, but the main principles are still the same.
After the model training we apply it on images from test set and get really high accuracy. We also visualize probability distributions from each RNN step as a matrix. Here is the example.
The rows of this matrix are correspond to all alphabet symbols plus âblankâ. Columns correspond to RNN steps.
We are happy to share our experience with community. We believe that video lectures, this tutorial, ready-to-use artificial data and source code will help you get basic intuition and that everyone can build modern OCR system from scratch.
Feel free to ask any questions! Thank you!