Text classification datasets are used to categorize natural language texts according to content. For example, think classifying news articles by topic, or classifying book reviews based on a positive or negative response. is also helpful for language detection, organizing customer feedback, and fraud detection. Though time consuming when done manually, this process can be automated with machine learning models. The result saves companies time while also providing valuable data insights. Text classification Below, I’ve compiled datasets from across the web, including product reviews, online content evaluation, news classification, and dataset repositories. I hope it provides a comprehensive look at available open-source datasets, and a starting point for machine learning projects! Text Classification Dataset Repositories : This dataset repository contains a collection of recommender systems datasets that have been used in the research of Julian McAuley, an associate professor of the computer science department of UCSD. The datasets contain social networks, product reviews, social circles data, and question/answer data. Recommender Systems Datasets : The Text REtrieval Conference was started with the purpose of supporting research in the information retrieval community. Their data repository is a collection of research papers related to NLP with their corresponding datasets. Datasets include news articles, question/answer sets, spam, and more. Please note: the website is quite old and sometimes difficult to navigate, but the datasets are there for those willing to dig! TREC Data Repository : Kaggle is home to code and data for data science work, and contains 19,000 public datasets for a variety of use cases. There’s no shortage of text classification datasets here! Still, you’ll want to utilize their search and sorting functions to narrow your search to exactly what you’re looking for. Kaggle also hosts competitions with monetary prizes to encourage specific text classification projects and research. Kaggle Text Classification Datasets : GroupLens is a research lab specialized in recommender systems, online communities, mobile and ubiquitous technologies, digital libraries, and geographic information systems. Available datasets include rating data from the MovieLens website, recommendation data from WikiLens, book ratings from BookCrossing, and more. GroupLens Datasets Review Datasets : This dataset contains two sets of reviews: one for hotel reviews on TripAdvisor, and another for car reviews on Edmunds. The TripAdvisor data includes 259,000 hotel reviews in 10 cities around the world, and around 80-700 hotels in each city. The Edmunds car review data covers 2007 to 2009, and includes dates, author names, and full textual reviews. Opin-Rank Review Dataset : By the Stanford AI Laboratory, this text classification dataset contains a set of 25,000 highly polar movie reviews, with an additional 25,000 reviews for training. The dataset is useful for sentiment analysis experiments. It also includes unlabeled data which can be used for further training or testing. Large Movie Review Dataset : This dataset contains a collection of Twitter data in which contributors classified tweets as positive, negative, and neutral. Negative reasons were also categorized under titles such as “late flight” or “rude service”. In total there are around 15,000 tweets across six airlines. Twitter US Airline Sentiment Dataset Online Content Evaluation Datasets : This dataset was used in a paper titled “Stop Clickbait: Detecting and Preventing Clickbaits in Online News Media”. It contains 16,000 article headlines categorized as “clickbait” and “non-clickbait”. The clickbait articles have been pulled from websites including Buzzfeed and Upworthy, while the non-clickbait articles come from sites including Wikinews, The New York Times, and The Guardian. Stop Clickbait Dataset : The Spambase is a spam email database with 4,601 email messages, of which 1,813 are spam. The dataset is useful for constructing a personal spam filter, but the authors also state that a wider collection of data is necessary for attempting a general purpose spam filter. Spambase Dataset : This dataset was originally used to research hate-speech detection by separating hate-speech from other instances of offensive language on social media. The text was taken from tweets and is classified as: containing hate-speech, containing only offensive language, and containing neither. Please note: due to the nature of the content, the dataset contains content that is racist, sexist, homophobic, and offensive. Hate Speech and Offensive Language Dataset : The Blog Authorship Corpus is a collection of 681,288 posts gathered from blogger.com in 2004. The posts are written by 19,320 bloggers, and in total the dataset contains more than 140 million words. This text categorization dataset is useful for sentiment analysis, summarization, and other NLP-based machine learning experiments. The Blog Authorship Corpus News Datasets : The AG’s News Topic Classification dataset is based on the AG dataset, a collection of 1,000,000+ news articles gathered from more than 2,000 news sources by an academic news search engine. This dataset contains 30,000 training samples and 1,900 testing samples from the 4 largest classes of the AG corpus. The total training sample number is 120,000 with 7,600 testing samples. AG’s News Topic Classification Dataset : This dataset contains 21,578 Reuters documents that appeared on Reuters newswire in 1987. The dataset is split into a training set of 13,625, and a testing set of 6,188. Each document is tagged according to date, topic, place, people, organizations, companies, and etc. Reuters Text Categorization Dataset : The 20 Newsgroups Dataset is a popular dataset for experimenting with text applications of machine learning techniques, including text classification. The dataset collates approximately 20,000 newsgroup documents partitioned across 20 different newsgroups, each corresponding to a different topic. The website offers three versions of the dataset for slightly different purposes. The 20 Newsgroups Dataset Also published on: https://lionbridge.ai/datasets/14-best-text-classification-datasets-for-machine-learning/

Polar

Twitter

14 Open Datasets for Text Classification in Machine Learning

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

10 Best Entry Level Machine Learning Tutorials

The Noonification: How Often Do NFTs Pass The Howey Test? (1/13/2023)

Darwin's Hybrid Intelligence to Align AI & Human Goals for Startups & VCs

The Noonification: White Man (11/26/2022)

The Noonification: The Metaverse is a Sh*tshow (11/2/2022)

100 Days of AI Day 1: From Newsletter to Podcast, Leveraging AI for Audio Transformation

10 Best Entry Level Machine Learning Tutorials

The Noonification: How Often Do NFTs Pass The Howey Test? (1/13/2023)

Darwin's Hybrid Intelligence to Align AI & Human Goals for Startups & VCs

The Noonification: White Man (11/26/2022)

The Noonification: The Metaverse is a Sh*tshow (11/2/2022)

100 Days of AI Day 1: From Newsletter to Podcast, Leveraging AI for Audio Transformation

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps