I am disseminating 2 datasets:
Kannada-MNIST dataset: 28X 28 grayscale images: 60k Train | 10k Test
Dig-MNIST: 28X 28 grayscale images: 10240 (1024x10) {See pic below}
The Kannada-MNIST dataset is meant to be a drop-in replacement for the MNIST dataset 🙏 , albeit for the numeral symbols in the Kannada language .
Also, I am disseminating an additional dataset of 10k handwritten digits in the same language (predominantly by the non-native users of the language) called Dig-MNIST that can be used as an additional test set.
Resource-list:
GitHub 👉: https://github.com/vinayprabhu/Kannada_MNIST
Kaggle 👉: https://www.kaggle.com/higgstachyon/kannada-mnist
ArXiv 👉 : https://arxiv.org/pdf/1908.01242.pdf
If you use Kannada-MNIST in a peer reviewed paper, we would appreciate referencing it as:
Prabhu, Vinay Uday. “Kannada-MNIST: A new handwritten digits dataset for the Kannada language.” arXiv preprint arXiv:1908.01242 (2019)..
Bibtex entry:
@article{prabhu2019kannada,
title={Kannada-MNIST: A new handwritten digits dataset for the Kannada language},
author={Prabhu, Vinay Uday},
journal={arXiv preprint arXiv:1908.01242},
year={2019}
}
Kannada is the official and administrative language of the state of Karnataka in India with nearly 60 million speakers worldwide. Also, as per articles 344(1) and 351 of the Indian Constitution, Kannada holds the status of being one of the 22 scheduled languages of India .
The language is written using the official Kannada script, which is an abugida of the Brahmic family and traces its origins to the Kadamba script (325–550 AD).
Kannada stone inscriptions: Source: https://karnatakaitihasaacademy.org/karnataka-epigraphy/inscriptions/
Distinct glyphs are used to represent the numerals 0–9 in the language that appear distinct from the modern Hindu-Arabic numerals in vogue in much of the world today.
Unlike some of the other archaic numeral-systems, these numerals are very much used in day-to-day affairs in Karnataka, as in evinced by the prevalence of these glyphs on license-plates of vehicles captured in the pic below:
Fig: A vehicle license plate with Kannada numeral glyphs
This figure below captures the MNIST-ized renderings of the variations of the glyphs across the following modern fonts: Kedage, Malige-i, Malige-n, Malige-b, Kedage-n, Malige-t, Kedage-t, Kedage-i, Lohit-Kannada, Sampige and Hubballi-Regular.
Kannada-MNIST:
65 volunteers were recruited in Bangalore, India, who were native speakers of the language as well as day-to-day users of the numeral script. Each volunteer filled out an A3 sheet containing a 32 × 40 grid. This yielded filled-out A3 sheets containing 128 instances of each number which we posit is large enough to capture most of the natural intra-volunteer variations of the glyph shapes.
All of the sheets thus collected were scanned at 600 dots-per-inch resolution using the Konica Accurio-Press-C6085 scanner that yielded 65 4963 × 3509 png images.
Volunteers helping curate the Kannada-MNIST dataset
8 volunteers aged 20 to 40 were recruited to generate a 32 × 40 grid of Kannada numerals (akin to 2.1), all written with a black ink Z-Grip Series | Zebra Pen on a commercial Mead Cambridge Quad Writing Pad, 8–1/2" x 11", Quad Ruled, White, 80 Sheets/Pad book.
We then scan the sheet(s) using a Dell — S3845cdn scanner with the following settings:
• Output color: Grayscale
• Original type: Text
• Lighten/Darken: Darken+3
• Size: Auto-detect
The reduced size of the sheets used for writing the digits (US-letter vis-a-vis A3) resulted in smaller scan (.tif) images that were all approximately 1600×2000.
1: Mean pixel-intensities distribution:
2: Morphological properties:
Code source: https://github.com/dccastro/Morpho-MNIST
3: PCA-analysis:
4: UMAP visualizations:
I used a standard MNIST-CNN architecture to get some basic accuracy benchmarks (See fig below)
The CNN architecture used for the benchmarks
We propose the following open challenges to the machine learning community at large.
[1] Prabhu, Vinay Uday, Sanghyun Han, Dian Ang Yap, Mihail Douhaniaris, Preethi Seshadri, and John Whaley. “Fonts-2-Handwriting: A Seed-Augment-Train framework for universal digit classification.” arXiv preprint arXiv:1905.08633 (2019). [ https://arxiv.org/abs/1905.08633 ]