I have played with the Keras official example for a while and want to share my takeaways in this post. image_ocr.py The official example only does the training for the model while missing the prediction part, and my final source code is available both on as well as . More technical detail of OCR(optical character recognization) including the model structure and CTC loss will also be explained briefly in the following sections. my GitHub a runnable Google Colab notebook OCR task declaration The input will be an image contains a single line of text, the text could be at any location in the image. And the task for the model is to output the actual text given this image. For example, OCR example input & output The official image_ocr.py example source code is quite long and may look daunting. It can be breaking down into several parts. The generator for the training samples, this part of the source code will generate vivid text images resembling the scanning documents with artificial speckles, random locations and a variety of fronts. The model callback to save the model weights and visualize the performance of the current model with some generated text images after each training epochs. The model construction and training part. We will elaborate more on this part in the next section. Model structure The model input is image data, and we first feed the data to two convolutional networks to extract the image features, followed by the Reshape and Dense to reduce the dimensions of the feature vectors before letting the bidirectional GRU process the sequential data. The sequential data feed to the GRU is the horizontally divided image features. The final output Dense layer transforms the output for a given image to an array with the shape of (32, 28) representing (#of horizontal steps, #char labels). Base model structure And here is the part of the code to construct the Keras model. CTC Loss As we can see in the example image, the text could be located anywhere, how the model align between the input and output to locates each character in the image and turns them into text? That is where CTC comes into play, CTC stands for connectionist temporal classification. input -> softmax output Notice that the output of the model has 32 timesteps, but the output might not have 32 characters. The CTC cost function allows the RNN to generate output like: Output sequence -“a game” with CTC blanks inserted CTC introduced the “blank” token, and itself doesn’t translate into any character, what it does is to separate individual characters so that we can collapse repeated characters that are not separated by the blank. So the decoding output for the previous sequence will be “a game”. Let’s take a look at another example of the text “speed”. Output sequence -“speed” According to for the decoding principle, we first collapse repeating characters that are not separated by blank token, and then we remove the blank tokens themselves. Notice that if there is no blank token to separate the two “e”s they will be collapsed into one. In Keras, the CTC decoding can be performed in a single function, . K.ctc_decode The is the model output which consists of 32 timesteps of 28 softmax probability values for each of the 28 tokens from a~z, space, and blank token. We set the parameter to perform the greedy search which means the function will only return the most likely output token sequence. out greedy Alternatively, if we want to have the CTC decoder return the top N possible output sequence, we can ask it to perform beam search with a given beam width. One thing worth mentioning is that if you are new to beam search algorithm, the parameter is no greater than the parameter since the beam width tells the beam search algorithm exactly how many top results to keep track of in iterating all timesteps. top_paths beam_width Right now the output of the decoder will be a sequence of tokens, and we just need to translate the numerical classes back to characters. So far we only talked about the decoding part of the CTC. You may wonder how the model is trained with CTC loss? In order to compute the CTC loss, it requires more than the true labels and predicted outputs of the model, but also the output sequence length and the lengths for each of the true labels. A sample of it could look like [0, 1, 2, 3, 4, 26, 25] stands for the text sequence ‘abcde z’ y_true. is the output of the softmax layer, a sample of it has the shape (32, 28), 32 timesteps, 28 categories, i.e. ‘a-z’, space and blank token. y_pred is the output sequence length img_w // downsample_factor — 2 = 128 / 4 -2 = 30, 2 means the first 2 discarded RNN output timesteps since first couple outputs of the RNN tend to be garbage. input_length will be 7 for the previous sample, label_length y_true In Keras the CTC loss is packaged in one function . K.ctc_batch_cost Conclusion Checkpoint results after training the model for 25 epochs. Model performance check after 25 training epochs If you have read this far and experimenting along on the Google Colab you should now have a Keras OCR demo running. If you are still eager for more information about CTC and beam search, feel free to check out the following resources. — An in-depth elaboration of CTC algorithm and other applications where CTC can be applied to such as speech recognition, lip reading from video and so on. Sequence Modeling With CTC . Quick and easy to understand. Coursera Beam search video lecture Don’t forget to get the source code from as well as . my GitHub a runnable Google Colab notebook Originally published at www.dlology.com .