Africa has over languages however, these languages are not well represented in the existing Natural language processing (NLP) ecosystem. One of the challenges is the lack of useful that can be used to solve different social and economical problems. 2000 African language datasets In this article, I have compiled a list of African language datasets from across the web. These datasets can be used in numerous NLP tasks such as text classification, named entity recognition, machine translation, sentiment analysis, speech recognition, and topic modeling. This collection of datasets have been made public to give you an opportunity to use your skills and help solving different challenges. Text Classification Text classification datasets are categorized or organized into different groups based on their contents. Below is the list of African language datasets for . Text classification 1. Swahili news Dataset The Swahili news dataset contains more than from different news categories such as Local, International, Business or Financial, health, sports, and Entertainment. The Swahili language is one of the most spoken languages in Africa, it is spoken by 100-150 million people across East Africa. 31,000 news articles The data was collected from different news publication platforms inside and outside of Tanzania. The dataset can be used to develop a multi-class classification model to classify news content according to their specific categories specified. The model can be used by Swahili online news platforms to automatically group news according to their categories and help readers find the specific news they want to read. You can also download this dataset from the : datasets python library datasets load_dataset

dataset = load_dataset( ) from import "swahili_news" The Swahili news dataset has an imbalance of category distribution. It contains few news articles in the following categories: Note: International News( 6.2%) Health News(4.9%) Business News(4.3%) 2. Chichewa News Dataset This dataset consists of in Chichewa. Chichewa is a Bantu language spoken in much of Southern, Southeast, and East Africa, namely the countries of Malawi and Zambia, where it is an official language. news articles The dataset contains a collection of containing over , and over . The Chichewa news articles have been categorized into such as education, law/order.politics, culture, arts and crafts, farming, economy, and wildlife. 3,482 articles, 930,000 words 48,000 sentences 19 categories You can also download this dataset from the following link: . AI4D Malawi News Classification Zindi Challenge Named-entity Recognition Named-entity Recognition datasets are used to extract information by locating and classifying named entities mentioned in unstructured text. Examples of entities are person names, organizations, locations, times, and dates. NER is an essential component of numerous applications including spellcheckers, conversational agents, and localization of voice and dialogue systems. Below is the list of African language datasets for Named-entity Recognition. 3.Masakhane-ner Datasets Masakhane is a grassroots NLP community for Africa, by Africans with a mission to strengthen and spur NLP research in African languages. The community created the first large publicly available high-quality dataset for named entity recognition (NER) in ten African languages. Amharic Hausa Igbo Kinyarwanda Luganda Luo Naija Pidgin Swahili Wolof Yorùbá You can read the research paper here and download the ten NER datasets . MasakhaNER: Named Entity Recognition for African Languages here Machine Translation Machine translation (MT) is the task of translating a text or speech in a source language to a different target language. Machine translation can be used to translate large volumes of text quickly without any human input. Machine translation datasets can be used to create MT models for different purpose such as: Internal emails and other written or oral communication. Documentation and instructions for products or services. Below is the list of African language datasets for Machine translation. 4. French to Ewe and French to Fongbe Machine Translation Dataset This is a parallel corpus dataset for machine translation from and . French to Ewe French to Fongbe Fonbge and Ewe are Niger-Congo languages, Fongbe is spoken in Benin with approximately 4.1 million speakers while Ewe is spoken in Togo and southeastern Ghana with approximately 4.5 million speakers. This dataset contains roughly French to Ewe and French to Fongbe parallel sentences, collected from blogs, tales, newspapers, daily conversations, webpages and annotated for neural machine translation. 23,000 53,000 5. Yorùbá to English Machine Translation Dataset This is a parallel sentence corpus dataset for machine translation from the Yorùbá language to the English language. Yorùbá is the Niger-Congo language and it is spoken in West Africa (southwestern Nigeria). The number of Yorùbá speakers is estimated at between 45 to 55 million. The dataset consists of parallel Yorùbá-English sentences from different domains like news, Yorùbá proverbs, movie transcript, localization translation, and books. 10,054 6. English to Luganda Machine Translation Dataset This is a parallel sentence corpus dataset for machine translation from the English language to the Luganda language. Luganda is a Bantu language and it is one of the major languages in Uganda. It is spoken by more than 8.5 million Baganda and other people in Kampala(capital city of Uganda). The dataset consists of parallel English-Luganda sentences and it was created by a team of researchers from the AI & Data Science research Lab at Makerere University with a team of Luganda teachers, students, and freelancers. 15,022 Sentiment Analysis Sentiment Analysis Datasets are used for the interpretation and classification of emotions ( ) within text data using different text analysis methods. positive, negative, and neutral Sentiment analysis has found its applications in various fields such as social media monitoring, brand monitoring, customer service, and market research. Below is the list of African language datasets for Sentiment Analysis. 7. Tunizi Dataset Tunizi is the first Tunisian Arabizi sentiment analysis dataset. Tunisian Arabizi represents the Tunisian dialect that is written in Latin characters and numbers rather than Arabic letters. gathered comments from social media platforms that express sentiment about popular topics. They extracted using public streaming APIs. iCompass 100k comments The collected comments were manually annotated using an overall polarity: Positive (1) Negative (-1) Neutral (0) The annotators were diverse in gender, age, and social background. You can also download this dataset from the : datasets python library datasets load_dataset

dataset = load_dataset( ) from import "tunizi" Speech Recognition Speech recognition, also known as Automatic Speech Recognition (ASR) can be defined as a technology that analyzes human speech and formulates an output, often a written transcription, in real-time. Sometimes referring to the process as "speech to text." Don't confuse this with voice recognition, as voice recognition just seeks to identify an individual user’s voice. Below is the list of African language datasets for Speech Recognition. 8. Speech Recognition dataset in Wolof Wolof is the language of Senegal, the Gambia, and Mauritania. It is spoken by more than 10 million people and about 40 percent (approximately 5 million people) of Senegal’s population speak Wolof as their native language. The ASR dataset has a total of audio files and transcriptions and it was created by a team of researchers from Baamtu Datamation company in Senegal. 6,683 9. Speech Recognition dataset in Kinyarwanda Kinyarwanda is the Bantu language and an official language of Rwanda. It is spoken by at least 12 million people in Rwanda, the Eastern Democratic Republic of the Congo, and southern Uganda. The was created by from different genders and ages in a common voice platform. The dataset has a total of of validated speech. The current dataset size is dataset 895 speakers 1,183 hours 40 GB. Topic Modeling Topic modeling uses unsupervised learning techniques to extract the main topic or set of topics that occur in a collection of text documents. Below is the list of African language datasets for Topic Modelling. 10. South African News Dataset This is the news dataset from South Africa. The news data were collected from SABC4 Facebook pages. The is the public broadcaster from South Africa. SABC The dataset contains news headlines (i.e short text) from and languages. Setswana is a Bantu language spoken in Southern Africa by about 8.2 million people while Sepedi is mainly spoken in the northern parts of South Africa by 4.7 million people. Setswana Sepedi Since the dataset is not annotated, you can use it to create a Topic model to cluster news data into different news topics such as sports, politics, culture, and entertainment. Final Thoughts on African Language Datasets I hope you found this list of different African language datasets useful and you can use them in your next . I will be happy to see what applications/solutions you will create from these datasets. If you couldn't find the dataset you need, please check out the following links: data science project Zenodo: African Natural Language Processing (AfricaNLP) Github: Masakhane Congratulations 👏👏, you have made it to the end of this article! I hope you have learned something new that will help you on your next data science project. If you learned something new or enjoyed reading this article, please share it so that others can see it. Until then, see you in the next post! You can also find me on Twitter . @Davis_McDavid And you can read more articles like this . here For more AI and machine learning guides, be sure to in the footer below. subscribe to our newsletter

Facebook

Southern

Target

Twitter

How To Build and Deploy an NLP Model with FastAPI: Part 1

Scikit-Learn 0.24: Top 5 New Features 

Contact me for collaboration

Nominated for 2022 - Data Science Demon

Nominated for 2022 - HackerNoon Contributor of the Year - Data Science

Nominated for 2022 - HackerNoon Contributor of the Year - Artificial Intelligence

10 Best African Language Datasets for Data Science Projects

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

23 Common Data Science Interview Questions for Beginners

10 Lessons I Learned as a First-time Tech Product Founder

3 technology lessons from Africa everyone should learn

The Noonification: Are Feature Branches the Ultimate Way to Go? (5/20/2023)

6 Lessons Learned Fundraising for Our Startup in Africa

A Critique of Nigeria's Blockchain Policy

23 Common Data Science Interview Questions for Beginners

10 Lessons I Learned as a First-time Tech Product Founder

3 technology lessons from Africa everyone should learn

The Noonification: Are Feature Branches the Ultimate Way to Go? (5/20/2023)

6 Lessons Learned Fundraising for Our Startup in Africa

A Critique of Nigeria's Blockchain Policy

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps