Reading 100K newspapers in 20 vernacular languages in India

A deep learning approach to digitizing Vernacular Newspapers : Analyzing hyper local news across the country to identify issues/sentiment and combat fake news

Facebook has an estimated user base of 270 million in India [more]. Facebook’s active users in India also skew young, with more than half of the country’s users below the age of 25.

A social listening tool based on social media is thus going to be highly biased, add to that the problem of decreasing authenticity of web news.

Vernacular News-Papers

Local newspapers in India still enjoy huge readership, according to the ‘Registrar of Newspapers for India’ [link] the claimed readership of all vernacular newspapers combined is 488 Million spread across more than 20 vernacular languages.

These local news papers are extremely popular in their respective areas and act as hyper local news sources.They also tend to a different category of users as opposed to social media.

Fractional Readership across languages

A listening tool for Governments/Brands built on recognizing issues and sentiments of local news papers can therefore be an important cue in policy/strategy discussions. It can also be used to measure the effectiveness of off-line campaigns.

An important use case can be to corroborate fake news on the web by using digitized newspapers.

Digitizing NewsPapers

Data Collection: Most daily newspapers publish their e-papers which are mostly in pdf/png format.I collected 500 such images across languages with bounding boxes marked.[example]

Object detection[NODM]: Collected data of articles from multiple languages.Using the tensorflow object detection code and a pre-trained ssd model , fine tuned a news article detection article. The bounding boxes for articles were classified into three classes News : Advertisements : Others [more]

Article Detection

OCR: Currently using the Google vision OCR for text detection in an image.As Google-OCR [more] works for a very few Indian languages [Hindi,Marathi,Punjabi,Telugu,Malayalam,Tamil,Bengali,Assamese],a deep learning ocr for vernacular languages can be trained.[more]

vernacular ocr

Text Cleanup: The output from the image detects characters instead of words and outputs a blob of characters hence a post processing model to clean the blob into its constituent words is used.I use a dynamic programming algorithm to solve for maximum reward from an existing dictionary of most frequent words of the language.

Text Cleanup

Once the Article is in text form a multitude of APIs are available to find out existing entities, sentiments etc .Examples of using the Google API are shown below.Google Language processing APIs expect the text to be translated to English first.

Translated text

Document Classification

Sentiment Analysis

This is a prototype that uses Google APIs for OCR,Translation, Sentiment Analysis and Classification.All these models can be replaced internally to help with scalability across languages and unit economics.

Reach out to me on Linkedin