A deep approach to digitizing Vernacular Newspapers : Analyzing hyper local news across the country to identify issues/sentiment and combat fake news learning has an estimated user base of 270 million in India [ ]. Facebook’s active users in India also skew young, with more than half of the country’s users below the age of 25. Facebook more A social listening based on social media is thus going to be highly biased, add to that the problem of decreasing authenticity of web news. tool Vernacular News-Papers Local newspapers in India still enjoy huge readership, according to the ‘Registrar of Newspapers for India’ [ ] the claimed readership of all vernacular newspapers combined is spread across more than 20 vernacular languages. link 488 Million These local news papers are extremely popular in their respective areas and act as hyper local news sources.They also tend to a different category of users as opposed to social media. Fractional Readership across languages A listening tool for Governments/Brands built on recognizing issues and sentiments of local news papers can therefore be an important cue in policy/strategy discussions. It can also be used to measure the effectiveness of off-line campaigns. An important use case can be to corroborate fake news on the web by using digitized newspapers. Digitizing NewsPapers Most daily newspapers publish their e-papers which are mostly in pdf/png format.I collected 500 such images across languages with bounding boxes marked.[ ] Data Collection: example Collected data of articles from multiple languages.Using the tensorflow object detection code and a pre-trained ssd model , fine tuned a news article detection article. The bounding boxes for articles were classified into three classes [ ] Object detection[NODM]: News : Advertisements : Others more Article Detection Currently using the vision OCR for text detection in an image.As Google-OCR [ ] works for a very few Indian languages [Hindi,Marathi,Punjabi,Telugu,Malayalam,Tamil,Bengali,Assamese],a deep learning ocr for vernacular languages can be trained.[ ] OCR: Google more more vernacular ocr The output from the image detects characters instead of words and outputs a blob of characters hence a post processing model to clean the blob into its constituent words is used.I use a dynamic programming algorithm to solve for maximum reward from an existing dictionary of most frequent words of the language. Text Cleanup: Text Cleanup Once the Article is in text form a multitude of APIs are available to find out existing entities, sentiments etc .Examples of using the Google API are shown below.Google Language processing APIs expect the text to be translated to English first. Translated text Document Classification Sentiment Analysis This is a prototype that uses Google APIs for OCR,Translation, Sentiment Analysis and Classification.All these models can be replaced internally to help with scalability across languages and unit economics. Reach out to me on Linkedin