In this post, I wanted to share a Reddit dataset list that gained a lot of traction on social media when it was first posted. Known as “the front page of the internet,” Reddit is part forum, part social media site, where users can post virtually anything and everything. Unlike Facebook, Twitter, or Instagram, the majority of Reddit users remain anonymous. Reddit moderators strictly censor and curate the subforums, known as subreddits. However, anonymity allows people to say what they want in whatever manner they wish. Therefore, Reddit comments and posts are perfect for testing and training numerous natural language processing (NLP) models. Some of the datasets below were compiled specifically for the training of content moderation models. Therefore, the data may include explicit content. Warning: Reddit Comments Datasets 1. – This dataset contains comments from the subreddit r/cryptocurrency. The data consists of comments posted over five months from November 2017 to March 2018. Cryptocurrency Reddit Comments Dataset 2. – A simple dataset containing thousands of comments crawled from Reddit that mention Donald Trump. Donald Trump Comments on Reddit 3. – This dataset was built to help create a model that can predict whether or not a Reddit comment will receive upvotes or downvotes. The dataset includes 4 million Reddit comments: 2 million poor-performing (downvoted) and 2 million high-performing (upvoted). Reddit Comment Score Prediction Reddit News Datasets 4. – As the title suggests, this dataset was originally made to create models that could predict stock market fluctuations. The data consists of news crawled from r/worldnews from June 2008 to July 2016, as well as Dow Jones Industrial Average stock data. Daily News for Stock Market Prediction 5. – Taken from the r/worldnews subreddit, this dataset contains info about all of the news posted on this subreddit dating back to 2008. The dataset includes the following info: date created, upvotes and downvotes, title, author, and whether or not the news contains mature content. World News on Reddit Other Data from Reddit 6. – This dataset contains the top 1,000 posts of all time from 18 subreddits, in terms of upvotes. For each post, the CSV files contain the title of the post and username of the poster. Additionally, the number of upvotes and downvotes, subreddit name, url, and other metadata has been included. Reddit’s Top 1000 7. – A simple dataset containing a CSV file of 26 million usernames of Reddit users. Furthermore, the dataset includes the total number of comments each user has made. Reddit Usernames 8. – This dataset consists of over 1.3 million sarcastic comments and posts crawled from Reddit. The dataset creator has labeled the sarcasm in each statement. In addition, the username of the poster, topic, and context is also included with each statement. SARC: Self-Annotated Reddit Corpus for Sarcasm 9. – This dataset contains over 140,000 acronyms found on subreddits about science, biology, technology, and futurology. The data is in the form of a CSV file which includes the comment ID, time, username, subreddit name, and the acronym mentioned. Science and Tech Acronyms from Reddit 10. – This product dataset is a collection of the top 100 Amazon products from every subreddit that has ever posted an Amazon product from 2015 to 2017. Each CSV file in the dataset includes the name of the product, category, and URL to the product. Furthermore, the total mentions on Reddit and total subreddit mentions have been included in the data. Things on Reddit (products) The datasets above could be used to help train sentiment analysis models, text classifiers, predictive models, and other NLP algorithms. For more datasets, please view our related resources. Also published on: https://lionbridge.ai/datasets/top-10-reddit-datasets-for-machine-learning/ Lead image via Erik Mclean on Unsplash

Amazon

Facebook

Instagram

Twitter

10 Best Reddit Datasets for NLP and Other ML Projects

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

10 AI and ML Apps, Games, and Tools for Android Phones

The Noonification: How Often Do NFTs Pass The Howey Test? (1/13/2023)

Darwin's Hybrid Intelligence to Align AI & Human Goals for Startups & VCs

The Noonification: White Man (11/26/2022)

The Noonification: The Metaverse is a Sh*tshow (11/2/2022)

100 Days of AI Day 1: From Newsletter to Podcast, Leveraging AI for Audio Transformation

10 AI and ML Apps, Games, and Tools for Android Phones

The Noonification: How Often Do NFTs Pass The Howey Test? (1/13/2023)

Darwin's Hybrid Intelligence to Align AI & Human Goals for Startups & VCs

The Noonification: White Man (11/26/2022)

The Noonification: The Metaverse is a Sh*tshow (11/2/2022)

100 Days of AI Day 1: From Newsletter to Podcast, Leveraging AI for Audio Transformation

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps