Hello world. This tutorial is a gentle introduction to building modern text recognition system using deep learning in 15 minutes. It will teach you the main ideas of how to use and for this problem. This guide is for anyone who is interested in using Deep Learning for text recognition in images but has no idea where to start. Keras Supervisely We are going to consider simple real-world example: number plate recognition. This is a good start point and you can easily customize it for your task. Simple tutorial on how to detect number plates you can find . here When we dove into this field we faced a lack of materials in the internet. Through long research and reading many papers we have developed an understanding of main principles behind creating effective recognition systems. And we have shared our understanding with community in two small video lectures ( and ) and explain how it works in plain language. We feel that this content is extremely valuable, because it is impossible to find nice and simple explanation of how to build modern recognition systems. We highly recommended to watch them before you start, because they will give you a lot of intuition behind this topic. part1 part2 To pass this tutorial without problems, you will need Ubuntu, GPU and Docker. All sources are available at . Source code is located at a with comments and useful visualizations. github single jupyther notebook Where to get training data? For this tutorial we have generated artificial dataset of more than 10K images that are very similar to real number plates. The images look like this. You can easily get this dataset from . Let us say a few words about it. We at do a lot of computer vision developments like , receipt recognition system, and so on. We as data scientists spend a lot of time to working with training data: creating custom image annotations, merging our data with public datasets, making data augmentations and so on. simplifies the way you work with training data and automate many routine tasks. We believe you’ll find it useful in your everyday work. Supervisely DeepSystems self-driving car road defect detection Supervisely The first step is to in . The next step is to go to “Import” -> “Datasets library” tab and click to “anpr_ocr” project. register Supervisely After that type name “anpr_ocr” and click “Next” button. Then click “Upload” button. That’s all. Now the project “anpr_ocr” is added to your account. It consists of two datasets: “train” and “test”. If you want to preview images, just click to dataset and you will instantly get into annotation tool. For each image we have a text description that will be used as ground truth to train our system. To view it just click to small icon opposite the selected image (market in red). Now we have to download it in a specific format. To do it just click to “DTL” page and insert this to text area. It will look like this. config In the screenshot above you can see the scheme illustrating the export steps. We will not dig into technical details (you can read the if needed) but try to explain this process below. In our “anpr_ocr” project we have two datasets. “Test” dataset is exported as is(all images will be tagged as “test”). “Train” dataset is splitted to two sets: “train” and “val”. Random 95 percent of images will be tagged as “train”, and the rest 5 percent as “val”. documentation Now you can click “Start” button and wait two minutes while the system prepare archive to download. Click “DTL” -> “Task status” -> “Three vertical dots” -> “Download” button to get training data (marked in red). Let’s start our experiment We prepared all you need in our . Clone it with the following commands git repository git clone cd supervisely-tutorials/anpr_ocr https://github.com/DeepSystems/supervisely-tutorials.git Directory structure will be the following .├── data├── docker│   ├── build.sh│   ├── Dockerfile│   └── run.sh└── src├── architecture.png├── export_config.json└── image_ocr.ipynb Put downloaded zip archive into “data” directory and run the command below unzip <archive name>.zip -d . In my case the command was unzip anpr_ocr.zip -d . Now lets build and run docker container with prepared working environment (tensorflow and keras). Just go to “docker” directory and run the following commands ./build.sh./run.sh After that you will be inside the container. Run next command to start Jupyther notebook jupyter notebook In terminal you will see something like this You have to copy selected link and paste it into web browser. Notice, that your link will be slightly different from mine. The last step is to run whole “image_ocr.ipynb” notebook. Click “Cell” -> “Run all”. Notebook consists of few main parts: data loading and visualisation, model training, model evaluation on test set. On average for this dataset training process takes around 30 minutes. If everything will be ok, you’ll see the following output As you can see, the predicted string will be the same as ground truth. Thus we have constructed the modern OCR system in one pretty . In the next chapter of this tutorial we will cover and explain all main principles of how it works. clear jupyther notebook How it works As for us, the understanding of neural network architecture is the key. Please, don’t be lazy and take 15 minutes to watch our small and simple about high level overview of NN architecture, that was mentioned at the beginning. It will give you general understanding. If you have already done — bravo! :-) video lecture Here i will try to give you short explanation. High level overview is the following Firstly, image is feeded to CNN to extract image features. The next step is to apply to these features followed by the special decoding algorithm. This decoding algorithm takes lstm outputs from each time step and produces the final labeling. Recurrent Neural Network Detailed architecture will be the following. FC — fully connected layer, SM — softmax layer. Image has the following shape: height equals to 64, width equals to 128 and num channels equal to three. As you have seen before we feed this image tensor to CNN feature extractor and it produces tensor with shape 4*8*4. We put image “apple” to the feature tensor so you can understand how to interpret it. Height equals to 4, width equals to 8 (These are spatial dimentions) and num channels equals to 4. Thus we transform input image with 3 channels to 4 channel tensor. In practice number of channels should be much larger, but we constructed small demo network only because everything fit on the slide. Next we do reshape operation. After that we obtain the sequence of 8 vectors of 16 elements. After that we feed these 8 vectors to the LSTM network and get its output — also the vectors of 16 elements. Then we apply fully connected layer followed by softmax layer and get the vector of 6 elements. This vector contains probability distribution of observing alphabet symbols at each LSTM step. In practice, the number of CNN output vectors can reach 32, 64 or more. The choice will depend on the specific task. Also in production it is better to use multilayered bidirectional LSTM. But this simple example explains only most important concepts. But How does decoding algorithm work? On the above diagram we have eight vectors of probabilities at each LSTM time step. Let’s take most probable symbol at each time step. As a result we obtain the string of eight characters — one most probable letter at each time step. Then we have to glue all consecutive repeating characters into one. In our example two “e” letters are glued to single one. Special blank character allows us to split symbols that are repeated in the original labeling. We added blank symbol to the alphabet to teach our neural network to predict blank between such case symbols. Then we remove all blank symbols. Look at the illustration below When we train our network we replace decoding algorithm with CTC Loss layer. It is explained in our second video lecture. Now it is available only in russian, sorry about it. But the good news are: we have english slides and we will publish english version soon. A bit complex NN architecture is used in our implementation. The architecture is the following, but the main principles are still the same. After the model training we apply it on images from test set and get really high accuracy. We also visualize probability distributions from each RNN step as a matrix. Here is the example. The rows of this matrix are correspond to all alphabet symbols plus “blank”. Columns correspond to RNN steps. Conclusion We are happy to share our experience with community. We believe that video lectures, this tutorial, ready-to-use artificial data and source code will help you get basic intuition and that everyone can build modern OCR system from scratch. Feel free to ask any questions! Thank you!

Alphabet

Apple

Glue

Instantly

🔥 Latest Deep Learning OCR with Keras and Supervisely in 15 minutes

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Untitled Story

Can you solve a person detection task in 10 minutes?

The Noonification: How Often Do NFTs Pass The Howey Test? (1/13/2023)

Darwin's Hybrid Intelligence to Align AI & Human Goals for Startups & VCs

The Noonification: White Man (11/26/2022)

The Noonification: The Metaverse is a Sh*tshow (11/2/2022)

Can you solve a person detection task in 10 minutes?

The Noonification: How Often Do NFTs Pass The Howey Test? (1/13/2023)

Darwin's Hybrid Intelligence to Align AI & Human Goals for Startups & VCs

The Noonification: White Man (11/26/2022)

The Noonification: The Metaverse is a Sh*tshow (11/2/2022)

Light-Mode

Classic

Newspaper

Dark-Mode

Neon Noir

Minty

HN StartUps