While shaping the idea of your data science project, you probably dreamed of writing variants of algorithms, estimating model performance on training data, and discussing prediction results with colleagues . . . But before you live the dream, you not only have to get the right data, you also must check if it’s labeled according to your task. Even if you don’t need to collect specific data, you can spend a good chunk of time looking for a dataset that will work best for the project.
Thousands of public datasets on different topics — from top fitness trends and beer recipes to pesticide poisoning rates — are available online. To spend less time on the search for the right dataset, you must know where to look for it.
This article is aimed at helping you find the best publicly available dataset for your machine learning project. We’ve grouped the article sections according to dataset sources, types, and a number of topics:
So, let’s deep dive into this ocean of data.
While you can find separate portals that collect datasets on various topics, there are large dataset aggregators and catalogs that mainly do two things:
1. Provide links to other specific data portals. The examples of such catalogs are DataPortals and OpenDataSoft described below. The service doesn’t directly provide access to data. Instead, it allows users to browse existing portals with datasets on the map and then use those portals to drill down to the desirable datasets.
2. Aggregate datasets from various providers. This allows users to find health, population, energy, education, and many more datasets from open providers in one place — convenient.
Let’s have a look at the most popular representatives of this group.
This website’s domain name says it all. DataPortals has links to 524 data portals around the globe.
The homepage contains a zoomable map, so specialists can easily find their portal of interest. They can also use the search panel or go to a page where data portals are listed and described.
Users can contribute to the meta-database, whether a contribution entails adding a new feature and data portal, reporting a bug on GitHub, or joining the project team as an editor.
The open data portals register by OpenDataSoft is impressive — the company team has gathered more than 2600 of them. The list looks like an interactive map, similar to the one on DataPortlas. Data portals are also grouped by countries, so users can choose between two search options.
OpenDataSoft provides data management services by building data portals. With its platform, clients publish, maintain, process, and analyze their data.
This search engine was specifically designed for numeric data with limited metadata — the type of data specialists need for their machine learning projects. Knoema has the biggest collection of publicly available data and statistics on the web, its representatives state. Users have access to nearly 2.5-billion time series data of 1000 topics obtained from more than 1000 sources, the information being updated daily.
Knoema provides efficient data exploration tools with datasets clustered by sources and topics. The search by topic can be narrowed down with Sources and Indicators filters. Datasets are also listed in alphabetical order.
Data scientists can study data online in tables and charts or downloaded as an Excel file, for example. However, the export isn’t free and requires a premium account.
Searching for the public dataset on data.gov, “the home of the US Government’s open data,” is fast and simple. Users are free to choose the appropriate dataset among more than 237,545 related to 14 topics. When looking for a dataset of a specific domain, users can apply extra filters like topic category, location, tags, file format, organizations and their types, and publishers.
The statistics office of the EU provides high-quality stats about numerous industries and areas of life. Datasets are open and free of charge, so everyone can study them online via data explorer or downloaded in TSV format.
The data navigation tree helps users find the way and understand data hierarchy. Databases and tables are grouped by themes, and some have metadata. There are also tables on EU policies, the ones grouped in cross-cutting themes. New and recently updated items are located in the corresponding folders.
If you want to get more data by state institutions, agencies, and bodies, you can surf such websites as the UK’s Office for National Statistics and Data.Gov.UK, European Data Portal, EU Open Data Portal, and OpenDataNI. Data portals of the Australian Bureau of Statistics, the Government of Canada, and the Queensland Government are also rich in open source datasets. Search engines at these websites are similar: Users can browse datasets by topics, and use filters and tags to narrow down the search.
Datasets that you can find within this source category can partly intersect with government and social data described above. However, here we focused mostly on science-related portals and datasets.
Those looking for research data may find this source useful. Re3Data contains information on more than 2,000 data repositories. The catalog developers paid attention to its usability. It allows for searching data repositories by subject, content type, country of origin, and “any combination of 41 different attributes.” Users can choose between graphical and text forms of subject search. Every repository is marked with icons providing a short description of its characteristics and explaining terms of access and use.
Research Pipeline is a Wikipedia-style website. If we were to describe this resource with a single phrase, it could be “everything in one place.” Users can look for free datasets, data processing software, data science-related content sources, or statistical organizations on its numerous Wiki pages. The data sources are grouped by topics and can also be reached through a search panel. Website creator Lyndie Chiou welcomes users to upload datasets and leave comments on the blog.
FAIRsharing is another place to hunt for open research data. With 1058 databases listed on the source, specialists have a big choice. Users can search for data among catalogs of databases and data use policies, as well as collections of standards and/or databases grouped by similarities.
Users can also specify the search by clicking on checkboxes with domains, taxonomies, countries of data origin, and the organizations that created it. To speed up the process, a user can select a record type.
A really useful way to look for datasets for machine learning is to apply to sources that data scientists suggest themselves. These datasets weren’t necessarily gathered by machine learning specialists, but they gained wide popularity due to their machine learning-friendly nature. Usually, data science communities share their favorite public datasets via popular engineering and data science platforms like Kaggle and GitHub.
DataHub is not only a data management and automation platform but also a community for data scientists. The project founders created the Awesome section with high-quality public datasets on various topics and dataset collections. Datasets for machine learning, datasets about climate change, property prices, armed conflicts, well-being in the US, even football — users have plenty of options to choose from. Besides linking datasets by other providers, the DataHub team made 333 datasets available for a download on the website.
It’s one of the oldest collections of databases, domain theories, and test data generators on the Internet. The website (current version developed in 2007) contains 427 datasets, the oldest dated 1987 — the year when machine learning practitioner David Aha with his graduate students created the repository as an FTP archive.
UCI allows for filtering datasets by the type of machine learning task, number of attributes and their types, number of instances, data type (i.e. time-series, multivariate, text), research area, and format type (matrix and non-matrix).
Most of the datasets — clean enough not to require additional preprocessing — can be used for model training right after the download. What’s also great about UCI repository is that users don’t need to register prior upload.
data.world is the platform where data scientists can upload their data to collaborate with colleagues and other members, and search for data added by other community members (filters are also available.)
data.world offers tools simplifying data processing and analysis. Users can write SQL and SPARQL queries to explore numerous files at once and join multiple datasets. The platform also provides SDKs for R and Python to make it easier to upload, export, and work with data.
It would be surprising if GitHub, a large community for software developers, didn’t have a page dedicated to datasets. Its Awesome Public Datasets list contains sources with datasets of 30 topics. Although most of the datasets won’t cost you a dime, be ready to pay for some of them. As contributors have to comply with format guidelines for the data they add to the Awesome list, its high quality and uniformity are guaranteed.
Kaggle, a place to go for data scientists who want to refine their knowledge and maybe participate in machine learning competitions, also has a dataset collection. Registered users can choose among 13,321 high-quality themed datasets.
Search box with filters (size, file types, licenses, tags) make it easy to find needed datasets. Another nifty feature — users can bookmark and preview the ones they liked.
When it comes to working with data, there are two options. Users can download datasets or analyze them in Kaggle Kernels — a free platform that allows for running Jupyter notebooks in a browser — and share the results with the community.
The Kaggle team welcomes everyone to contribute to the collection by publishing their datasets.
A trusted site in scientific and business communities, KDnuggets, maintains a list of links to numerous data repositories with their brief descriptions. Data from international government agencies, exchanges, and research centers, data published by users on data science community sites — this collection has it all.
Reddit is a social news site with user-contributed content and discussion boards called subreddits. These boards are organized around specific subjects. Their members communicate with each other by sharing content related to their common interests, answering questions, and leaving feedback.
Browsing Datasets subreddit is like rummaging through a treasure chest because you never know what unique dataset you may come across.
Datasets subreddit members write requests about datasets they are looking for, recommend sources of qualitative datasets, or publish the data they collected. All requests and shared datasets are filtered as hot, new, rising, and top. There is also a wiki section and a search bar.
Media outlets generally gather a lot of social and political data for their work. Sometimes they share it with the public. We suggest looking at these two companies first.
BuzzFeed media company shares public data, analytic code, libraries, and tools journalists used in their investigative articles. They advise users to read the pieces before exploring the data to understand the findings better. Datasets are available on GitHub.
Journalists from FiveThirtyEight, famous for its sports pieces as well as news on politics, economics, and other spheres of life, also publish data and code they gathered while they work. Like BuzzFeed, FiveThirtyEight chose GitHub as a platform for dataset sharing.
Quandl is a source of financial and economic data. The main feature of this platform is that it also provides alternative or untapped data from “non-traditional publishers” that has “never been exposed to Wall Street.” Acquiring such data has become possible thanks to digitalization. Alternative data is generated from IoT. Analysis of transactional data can give valuable insights into consumer behavior.
Clients can filter datasets by type, region, publisher, accessibility, and asset class.
Quandl shares some free data, but most of it comes at a price. Registered users can choose a format for data they get. They can access data via API and the web interface.
The International Monetary Fund (IMF) and The World Bank share insights on the international economy. On the IMF website, datasets are listed alphabetically and classified by topics. The World Bank users can narrow down their search by applying such filters as license, data type, country, supported language, frequency of publication, and rating.
The World Health Organization (WHO) collects and shares data on global health for its 194-member countries under the Global Health Observatory (GHO) initiative.
Source users have options to browse for data by theme, category, indicator (i.e. existence of a national child-restraint law (Road Safety)), and by country. The metadata section allows for learning how data is organized. Datasets are available online or can be downloaded in CSV, HTML, Excel, JSON, and XML formats.
The CDC is a rich source of US health-related data. It maintains Wide-ranging OnLine Data for Epidemiologic Research (WONDER) — a web application system aimed at sharing healthcare information with the a general audience and medical professionals.
With CDC WONDER, users access public data hosted by different state sources, sorted alphabetically and by topic. Data can be used in desktop applications and is ready for download in CSV and Excel formats.
Medicare is another website with healthcare data. It hosts 143 datasets focused on a comparison of services provided by its health institutions.
The website of the US National Travel and Tourism Office is a trusted source of travel data.
It runs several statistical programs aimed at snapping the big picture of US tourism industry. For example, the office provides the latest statistics on inbound and outbound travel, cities and states visited by overseas travelers, and so on. Users can save datasets as Excel and PDF files.
The Federal Highway Administration of the US Department of Transportation researches the nation’s travel preferences under the National Household Travel Survey (NHTS) initiative. Received insights show, for example, what vehicles Americans use when traveling, the correlation between family income and a number of vehicle trips, as well as trip length, etc.
Survey data is available for online exploration and for downloading as CSV, SAS Transport files. Users can also work with it in dBase, SPSS, and SAS Windows binary applications.
Amazon hosts large public datasets on its AWS platform. Specialists can practice their skills on various data, for example financial, statistical, geospatial, and environmental.
Registered users can access and download data for free. However, AWS provides cloud-based tools for data analysis and processing (Amazon EC2, Amazon EMR, Amazon Athena, and AWS Lambda). Those who prefer to analyze datasets with these tools online are charged for the computational power and storage they used.
Google also shares open source datasets for data science enthusiasts. Datasets are stored in its cloud hosting service, Google Cloud Platform (GCP) and can be examined with the BiGQuery tool. To start working with datasets, users must register a GCP account and create a project. While Google maintains storage of data and gives access to it, users pay for the queries they perform on it for analysis. The first terabyte of processed data per month is free, which sounds inspirational.
Cloud provider Microsoft Azure has a list of public datasets adapted for testing and prototyping. As it provides descriptions and groups data by general topics, the search won’t take much time.
As so many owners share their datasets on the web, you may wonder yourself how to start your search or struggle making a good dataset choice.
When looking for specific data, first browse catalogs of data portals. Then decide what continent and country information must come from. Finally, explore data portals of that geographic area to pinpoint the right dataset.
You can speed up the search by surfing websites of organizations and companies that focus on researching a certain industry. If you’re interested in governmental and official data, you can find it on numerous sources we mentioned in that section. Besides that, data science communities are good sources of qualitative user-contributed datasets and data collections from different publishers.
It’s important to consider the overall quality of published content and make extra time for dataset preparation if needed. Sources like data.gov, data.world, and Reddit contain datasets from multiple publishers, and they may lack citation and be collected according to different format rules.
At the same time, data scientists note that most of the datasets at UCI, Kaggle, and Quandl are clean.
Open datasets on every possible topic are published on numerous data portals, included in data portal catalog listings, and shared by government agencies, private companies, and data science buffs.
Most of the information is free of charge, but some of it, especially financial and economic data, requires payment.
With an abundance of open data on the web, finding appropriate isn’t a problem. However, it’s important to consider the project type when choosing the dataset.
Liked the story? Let us and other readers know about it! 👏
Originally published at AltexSoft’s blog: “Best Public Datasets for Machine Learning and Data Science: Sources and Advice on the Choice”
Create your free account to unlock your custom reading experience.