Text classification datasets are used to categorize natural language texts according to content. For example, think classifying news articles by topic, or classifying book reviews based on a positive or negative response. Text classification is also helpful for language detection, organizing customer feedback, and fraud detection. Though time consuming when done manually, this process can be automated with machine learning models. The result saves companies time while also providing valuable data insights.
Below, I’ve compiled datasets from across the web, including product reviews, online content evaluation, news classification, and dataset repositories. I hope it provides a comprehensive look at available open-source datasets, and a starting point for machine learning projects!
Recommender Systems Datasets: This dataset repository contains a collection of recommender systems datasets that have been used in the research of Julian McAuley, an associate professor of the computer science department of UCSD. The datasets contain social networks, product reviews, social circles data, and question/answer data.
TREC Data Repository: The Text REtrieval Conference was started with the purpose of supporting research in the information retrieval community. Their data repository is a collection of research papers related to NLP with their corresponding datasets. Datasets include news articles, question/answer sets, spam, and more. Please note: the website is quite old and sometimes difficult to navigate, but the datasets are there for those willing to dig!
Kaggle Text Classification Datasets: Kaggle is home to code and data for data science work, and contains 19,000 public datasets for a variety of use cases. There’s no shortage of text classification datasets here! Still, you’ll want to utilize their search and sorting functions to narrow your search to exactly what you’re looking for. Kaggle also hosts competitions with monetary prizes to encourage specific text classification projects and research.
GroupLens Datasets: GroupLens is a research lab specialized in recommender systems, online communities, mobile and ubiquitous technologies, digital libraries, and geographic information systems. Available datasets include rating data from the MovieLens website, recommendation data from WikiLens, book ratings from BookCrossing, and more.
Opin-Rank Review Dataset: This dataset contains two sets of reviews: one for hotel reviews on TripAdvisor, and another for car reviews on Edmunds. The TripAdvisor data includes 259,000 hotel reviews in 10 cities around the world, and around 80-700 hotels in each city. The Edmunds car review data covers 2007 to 2009, and includes dates, author names, and full textual reviews.
Large Movie Review Dataset: By the Stanford AI Laboratory, this text classification dataset contains a set of 25,000 highly polar movie reviews, with an additional 25,000 reviews for training. The dataset is useful for sentiment analysis experiments. It also includes unlabeled data which can be used for further training or testing.
Twitter US Airline Sentiment Dataset: This dataset contains a collection of Twitter data in which contributors classified tweets as positive, negative, and neutral. Negative reasons were also categorized under titles such as “late flight” or “rude service”. In total there are around 15,000 tweets across six airlines.
Stop Clickbait Dataset: This dataset was used in a paper titled “Stop Clickbait: Detecting and Preventing Clickbaits in Online News Media”. It contains 16,000 article headlines categorized as “clickbait” and “non-clickbait”. The clickbait articles have been pulled from websites including Buzzfeed and Upworthy, while the non-clickbait articles come from sites including Wikinews, The New York Times, and The Guardian.
Spambase Dataset: The Spambase is a spam email database with 4,601 email messages, of which 1,813 are spam. The dataset is useful for constructing a personal spam filter, but the authors also state that a wider collection of data is necessary for attempting a general purpose spam filter.
Hate Speech and Offensive Language Dataset: This dataset was originally used to research hate-speech detection by separating hate-speech from other instances of offensive language on social media. The text was taken from tweets and is classified as: containing hate-speech, containing only offensive language, and containing neither. Please note: due to the nature of the content, the dataset contains content that is racist, sexist, homophobic, and offensive.
The Blog Authorship Corpus: The Blog Authorship Corpus is a collection of 681,288 posts gathered from blogger.com in 2004. The posts are written by 19,320 bloggers, and in total the dataset contains more than 140 million words. This text categorization dataset is useful for sentiment analysis, summarization, and other NLP-based machine learning experiments.
AG’s News Topic Classification Dataset: The AG’s News Topic Classification dataset is based on the AG dataset, a collection of 1,000,000+ news articles gathered from more than 2,000 news sources by an academic news search engine. This dataset contains 30,000 training samples and 1,900 testing samples from the 4 largest classes of the AG corpus. The total training sample number is 120,000 with 7,600 testing samples.
Reuters Text Categorization Dataset: This dataset contains 21,578 Reuters documents that appeared on Reuters newswire in 1987. The dataset is split into a training set of 13,625, and a testing set of 6,188. Each document is tagged according to date, topic, place, people, organizations, companies, and etc.
The 20 Newsgroups Dataset: The 20 Newsgroups Dataset is a popular dataset for experimenting with text applications of machine learning techniques, including text classification. The dataset collates approximately 20,000 newsgroup documents partitioned across 20 different newsgroups, each corresponding to a different topic. The website offers three versions of the dataset for slightly different purposes.