Swahili (also known as Kiswahili) is one of the most spoken languages in Africa. It is spoken by 100–150 million people across East Africa. Swahili is popularly used as a second language by people across the African continent and taught in schools and universities. In Tanzania, it is one of two national languages (the other is English).
News in Swahili is an important part of the media sphere in Tanzania and other countries in East Africa. News contributes to education, technology, and the economic growth of a country, and news in local languages plays an important cultural role in many African countries.
In the modern age, African languages in news and other spheres are at risk of being lost as English becomes the dominant language in online spaces.
Swahili open-source African language text datasets are not often available in Tanzania that results in being left behind in the creation of NLP technologies to solve African challenges.
The goal of this project was to build an open-source text dataset in the Swahili language focused on News articles. I mainly focus on collecting news in different categories such as Local, International, Business or Financial, health, sports, and entertainment.
The dataset is open-source, and NLP practitioners can access the dataset and learn from it.
I was able to implement the following phases of the project in order to achieve the objective of the project.
(a)Collect website with Swahili news
The first phase of the project is to find and collect different websites that provide news in the Swahili language. I was able to find some websites that provide news in Swahili only and others in different languages including Swahili.
(b) Understand policy and copyright.
In this phase of the project, I mainly focus on understanding their policies and copyrights for each website on what I can do and what I can not do.AI4D helped me to understand this process by providing Data Protection Guidelines to consider for data collection and data mining.
(c) Understand the structure of the news website
Each news website was developed by different web technologies such as PHP, Python, WordPress, Django, javascript e.t.c. The main task is to analyze website source code by using a web browser tool (view page source). I looked at different HTML tags to find news titles, categories, and links to access the content of the particular title.
(d) Data Collection
News articles were collected by using different tools and programming languages. These tools are as follows:
The collected news articles were saved in a CSV file containing the content(text) and the category(label) of particular news e.g sports.
(e) Analyzing and Cleaning
The collected news articles were analyzed and cleaned to remove irrelevant information such as HTML tags and symbols that were collected during the scrapping process.
At the end of this project, I was able to achieve the following milestones
The main challenge is the imbalance of collected news from different categories. For example, we have few news in international, business and health news.
You can download the datasets from two different versions. The first version (v0.1) was released on December 1, 2020, you can download the dataset from zenodo platform here.
Another way is by using the datasets python library from Hugging Face.
from datasets import load_dataset
dataset = load_dataset("swahili_news")
The second version (v0.2) of the dataset was released on September 18, 2021, this version contains both Train and Test sets for topic classification. You can download the dataset from the zenodo platform here.
I’m planning to make sure the dataset will be available on datasets python library for easy access.
The news dataset collected has an imbalance of topic distribution. It contains few news contents on the following topics:-
Therefore, my plans are to find more news resources in the Swahili language and collect more news datasets on the topics mentioned above in order to bring more balance among news topics in the dataset.
This will help AI practitioners to create useful machine learning models that perform well in test environments.
If you learned something new or enjoyed reading this article, please share it so that others can see it. Until then, see you in the next post!
You can also find me on Twitter @Davis_McDavid.
And you can read more articles like this here.
Want to keep up to date with all the latest datasets for machine learning and data science? Subscribe to our newsletter in the footer below