While shaping the idea of your data science project, you probably dreamed of writing variants of algorithms, estimating model performance on training data, and discussing prediction results with colleagues . . . But before you live the dream, you not only have to get the right data, you also . Even if you don’t need to collect specific data, you can spend a good chunk of time looking for a dataset that will work best for the project. must check if it’s labeled according to your task Thousands of public datasets on different topics — from top fitness trends and beer recipes to pesticide poisoning rates — are available online. To spend less time on the search for the right dataset, you must know where to look for it. This article is aimed at helping you find the best publicly available dataset for your machine learning project. We’ve grouped the article sections according to dataset sources, types, and a number of topics: Catalogs of data portals and aggregators Government and official data Scientific research data Verified datasets from data science communities Political and social datasets from media outlets Finance and economic data Healthcare data Travel and transportation data Other sources So, let’s deep dive into this ocean of data. Catalogs of data portals and aggregators While you can find separate portals that collect datasets on various topics, there are large dataset aggregators and catalogs that mainly do two things: 1. The examples of such catalogs are DataPortals and OpenDataSoft described below. The service doesn’t directly provide access to data. Instead, it allows users to browse existing portals with datasets on the map and then use those portals to drill down to the desirable datasets. Provide links to other specific data portals. 2. This allows users to find health, population, energy, education, and many more datasets from open providers in one place — convenient. Aggregate datasets from various providers. Let’s have a look at the most popular representatives of this group. DataPortals: meta-database with 524 data portals This website’s domain name says it all. has links to 524 data portals around the globe. DataPortals The homepage contains a zoomable map, so specialists can easily find their portal of interest. They can also use the search panel or go to a page where data portals are listed and described. Users can contribute to the meta-database, whether a contribution entails adding a new feature and data portal, reporting a bug on GitHub, or joining the project team as an editor. OpenDataSoft: a map with more than 2600 data portals The open data portals register by is impressive — the company team has gathered more than 2600 of them. The list looks like an interactive map, similar to the one on DataPortlas. Data portals are also grouped by countries, so users can choose between two search options. OpenDataSoft OpenDataSoft provides data management services by building data portals. With its platform, clients publish, maintain, process, and analyze their data. Knoema: home to nearly 2.5-billion time series data of 1000 topics from more than 1000 sources This search engine was specifically designed for numeric data with limited metadata — the type of data specialists need for their machine learning projects. has the biggest collection of publicly available data and statistics on the web, its representatives . Users have access to nearly 2.5-billion time series data of 1000 topics obtained from more than 1000 sources, the information being updated daily. Knoema state Knoema provides efficient data exploration tools with datasets clustered by sources and topics. The search by topic can be narrowed down with Sources and Indicators filters. Datasets are also listed in alphabetical order. Knoema’s data explorer Data scientists can study data online in tables and charts or downloaded as an Excel file, for example. However, the export isn’t free and requires a premium account. Government and official data Data.gov: 237,545 datasets of 14 topics from the US government Searching for the public dataset on data.gov, “the home of the US Government’s open data,” is fast and simple. Users are free to choose the appropriate among more than 237,545 related to 14 topics. When looking for a dataset of a specific domain, users can apply extra filters like topic category, location, tags, file format, organizations and their types, and publishers. dataset Various filters are available on data.gov Eurostat: open data from the EU statistical office The statistics office of the EU provides high-quality stats about numerous industries and areas of life. are open and free of charge, so everyone can study them online via data explorer or downloaded in TSV format. Datasets The data navigation tree helps users find the way and understand data hierarchy. Databases and tables are grouped by themes, and some have metadata. There are also tables on EU policies, the ones grouped in cross-cutting themes. New and recently updated items are located in the corresponding folders. Data navigation tree of Eurostat database If you want to get more data by state institutions, agencies, and bodies, you can surf such websites as the UK’s and , , , and . Data portals of the , the , and the are also rich in open source datasets. Search engines at these websites are similar: Users can browse datasets by topics, and use filters and tags to narrow down the search. Office for National Statistics Data.Gov.UK European Data Portal EU Open Data Portal OpenDataNI Australian Bureau of Statistics Government of Canada Queensland Government Scientific research data Datasets that you can find within this source category can partly intersect with government and social data described above. However, here we focused mostly on science-related portals and datasets. Re3data: 2000 research data repositories with flexible search Those looking for research data may find this source useful. contains information on more than 2,000 data repositories. The catalog developers paid attention to its usability. It allows for searching data repositories by subject, content type, country of origin, and “any combination of 41 different attributes.” Users can choose between graphical and text forms of subject search. Every repository is marked with icons providing a short description of its characteristics and explaining terms of access and use. Re3Data Re3data provides two options of subject search Research Pipeline: Wiki pages with datasets and other data science-related content is a Wikipedia-style website. If we were to describe this resource with a single phrase, it could be “everything in one place.” Users can look for free datasets, data processing software, data science-related content sources, or statistical organizations on its numerous . The data sources are grouped by topics and can also be reached through a search panel. Website creator Lyndie Chiou welcomes users to upload datasets and leave comments on the blog. Research Pipeline Wiki pages FAIRsharing: “resource on data and metadata standards, inter-related to databases and data policies” is another place to hunt for open research data. With 1058 databases listed on the source, specialists have a big choice. Users can search for data among catalogs of databases and data use policies, as well as collections of standards and/or databases grouped by similarities. FAIRsharing Users can also specify the search by clicking on checkboxes with domains, taxonomies, countries of data origin, and the organizations that created it. To speed up the process, a user can select a record type. Verified datasets from data science communities A really useful way to look for datasets for machine learning is to apply to sources that data scientists suggest themselves. These datasets weren’t necessarily gathered by machine learning specialists, but they gained wide popularity due to their machine learning-friendly nature. Usually, data science communities share their favorite public datasets via popular engineering and data science platforms like Kaggle and GitHub. DataHub: high-quality datasets shared by data scientists for data scientists is not only a data management and automation platform but also a community for data scientists. The project founders created the with high-quality public datasets on various topics and dataset collections. Datasets for machine learning, datasets about climate change, property prices, armed conflicts, well-being in the US, even football — users have plenty of options to choose from. Besides linking datasets by other providers, the DataHub team made 333 datasets available for a download on the website. DataHub Awesome section UCI Machine Learning Repository: one of the oldest sources with 427 datasets It’s one of the oldest collections of databases, domain theories, and test data generators on the Internet. The website (current version developed in 2007) contains the oldest dated 1987 — the year when machine learning practitioner David Aha with his graduate students created the repository as an FTP archive. 427 datasets, UCI allows for filtering datasets by the type of machine learning task, number of attributes and their types, number of instances, data type (i.e. time-series, multivariate, text), research area, and format type (matrix and non-matrix). Most of the datasets — clean enough not to require additional preprocessing — can be used for model training right after the download. What’s also great about UCI repository is that users don’t need to register prior upload. data.world: open data community is the platform where data scientists can upload their data to collaborate with colleagues and other members, and search for data added by other community members (filters are also available.) data.world data.world offers tools simplifying data processing and analysis. Users can write SQL and SPARQL queries to explore numerous files at once and join multiple datasets. The platform also provides SDKs for R and Python to make it easier to upload, export, and work with data. GitHub: a list of awesome datasets made by the software development community It would be surprising if GitHub, a large community for software developers, didn’t have a page dedicated to datasets. Its list contains sources with datasets of 30 topics. Although most of the datasets won’t cost you a dime, be ready to pay for some of them. As contributors have to comply with for the data they add to the Awesome list, its high quality and uniformity are guaranteed. Awesome Public Datasets format guidelines Kaggle: 13,321 themed datasets on “Facebook for data people” Kaggle, a place to go for data scientists who want to refine their knowledge and maybe participate in machine learning competitions, also has a . Registered users can choose among 13,321 high-quality themed datasets. dataset collection Search box with filters (size, file types, licenses, tags) make it easy to find needed datasets. Another nifty feature — users can bookmark and preview the ones they liked. Searching for datasets on Kaggle is simple When it comes to working with data, there are two options. Users can download datasets or analyze them in Kaggle Kernels — a free platform that allows for running in a browser — and share the results with the community. Jupyter notebooks The Kaggle team welcomes everyone to contribute to the collection by publishing their datasets. KDnuggets: a comprehensive list of data repositories on a famous data science website A trusted site in scientific and business communities, , maintains a of links to numerous data repositories with their brief descriptions. Data from international government agencies, exchanges, and research centers, data published by users on data science community sites — this collection has it all. KDnuggets list Reddit: datasets and requests of data on a dedicated discussion board Reddit is a social news site with user-contributed content and discussion boards called subreddits. These boards are organized around specific subjects. Their members communicate with each other by sharing content related to their common interests, answering questions, and leaving feedback. Browsing is like rummaging through a treasure chest because you never know what unique dataset you may come across. Datasets subreddit Datasets subreddit members write requests about datasets they are looking for, recommend sources of qualitative datasets, or publish the data they collected. All requests and shared datasets are filtered as hot, new, rising, and top. There is also a wiki section and a search bar. Political and social datasets from media outlets Media outlets generally gather a lot of social and political data for their work. Sometimes they share it with the public. We suggest looking at these two companies first. BuzzFeed: datasets and related content by a media company media company shares public data, analytic code, libraries, and tools journalists used in their investigative articles. They advise users to read the pieces before exploring the data to understand the findings better. Datasets are available on . BuzzFeed GitHub FiveThirtyEight: datasets from data-driven pieces Journalists from , famous for its sports pieces as well as news on politics, economics, and other spheres of life, also publish data and code they gathered while they work. Like BuzzFeed, FiveThirtyEight chose as a platform for dataset sharing. FiveThirtyEight GitHub Finance and economic data Quandl: Alternative Financial and Economic Data is a source of financial and economic data. The main feature of this platform is that it also provides alternative or untapped data from “non-traditional publishers” that has “never been exposed to Wall Street.” Acquiring such data has become possible thanks to digitalization. is generated from IoT. Analysis of transactional data can give valuable insights into consumer behavior. Quandl Alternative data Clients can filter datasets by type, region, publisher, accessibility, and asset class. Quandl shares some free data, but most of it comes at a price. Registered users can choose a format for data they get. They can access data via API and the web interface. International Monetary Fund and The World Bank: International Economy Stats (IMF) and share insights on the international economy. On the IMF website, datasets are listed alphabetically and classified by topics. The World Bank users can narrow down their search by applying such filters as license, data type, country, supported language, frequency of publication, and rating. The International Monetary Fund The World Bank Healthcare data World Health Organization: Global Health Records from 194 Countries The World Health Organization (WHO) collects and shares for its 194-member countries under the Global Health Observatory (GHO) initiative. data on global health Source users have options to browse for data by theme, category, indicator (i.e. ), and by country. The metadata section allows for learning how data is organized. Datasets are available online or can be downloaded in CSV, HTML, Excel, JSON, and XML formats. existence of a national child-restraint law (Road Safety) The Center for Disease Control (CDC): Searching for data is easy with an online database The CDC is a rich source of US health-related data. It maintains — a web application system aimed at sharing healthcare information with the a general audience and medical professionals. Wide-ranging OnLine Data for Epidemiologic Research (WONDER) With , users access public data hosted by different state sources, sorted alphabetically and by topic. Data can be used in desktop applications and is ready for download in CSV and Excel formats. CDC WONDER Medicare: data from the US health insurance program is another website with healthcare data. It hosts 143 datasets focused on a comparison of services provided by its health institutions. Medicare Travel and transportation data National Travel and Tourism Office: overview of the US travel landscape The website of the US is a trusted source of travel data. National Travel and Tourism Office It runs several statistical programs aimed at snapping the big picture of US tourism industry. For example, the office provides the latest statistics on inbound and outbound travel, cities and states visited by overseas travelers, and so on. Users can save datasets as Excel and PDF files. Federal Highway Administration: US road transportation data of the US Department of Transportation researches the nation’s travel preferences under the National Household Travel Survey (NHTS) initiative. Received insights show, for example, what vehicles Americans use when traveling, the correlation between family income and a number of vehicle trips, as well as trip length, etc. The Federal Highway Administration Survey data is available for online exploration and for downloading as CSV, SAS Transport files. Users can also work with it in dBase, SPSS, and SAS Windows binary applications. Other sources Amazon Web Services: free public datasets and paid machine learning tools Amazon hosts large on its AWS platform. Specialists can practice their skills on various data, for example financial, statistical, geospatial, and environmental. public datasets Registered users can access and download data for free. However, AWS provides cloud-based tools for data analysis and processing ( , , , and ). Those who prefer to analyze datasets with these tools online are charged for the computational power and storage they used. Amazon EC2 Amazon EMR Amazon Athena AWS Lambda Google Public datasets: data analysis with the BiGQuery tool in the cloud Google also shares for data science enthusiasts. Datasets are stored in its cloud hosting service, (GCP) and can be examined with the BiGQuery tool. To start working with datasets, users must register a GCP account and create a project. While Google maintains storage of data and gives access to it, users pay for the queries they perform on it for analysis. The first terabyte of processed data per month is free, which sounds inspirational. open source datasets Google Cloud Platform Cloud provider Microsoft Azure has a adapted for testing and prototyping. As it provides descriptions and groups data by general topics, the search won’t take much time. list of public datasets Advice on the dataset choice As so many owners share their datasets on the web, you may wonder yourself how to start your search or struggle making a good dataset choice. When looking for specific data, first browse catalogs of data portals. Then decide what continent and country information must come from. Finally, explore data portals of that geographic area to pinpoint the right dataset. You can speed up the search by surfing websites of organizations and companies that focus on researching a certain industry. If you’re interested in governmental and official data, you can find it on numerous sources we mentioned in that section. Besides that, data science communities are good sources of qualitative user-contributed datasets and data collections from different publishers. It’s important to consider the overall quality of published content and make extra time for if needed. Sources like data.gov, data.world, and Reddit contain datasets from multiple publishers, and they may lack citation and be collected according to different format rules. dataset preparation At the same time, data scientists that most of the datasets at UCI, Kaggle, and Quandl are clean. note Final word Open datasets on every possible topic are published on numerous data portals, included in data portal catalog listings, and shared by government agencies, private companies, and data science buffs. Most of the information is free of charge, but some of it, especially financial and economic data, requires payment. With an abundance of open data on the web, finding appropriate isn’t a problem. However, it’s important to consider the project type when choosing the dataset. Originally published at AltexSoft’s blog: “ ” Liked the story? Let us and other readers know about it! 👏 Best Public Datasets for Machine Learning and Data Science: Sources and Advice on the Choice _Read writing about Machine Learning in Hacker Noon. how hackers start their afternoons._hackernoon.com Machine Learning - Hacker Noon _Read writing about Data Science in Hacker Noon. how hackers start their afternoons._hackernoon.com Data Science - Hacker Noon