Have you ever being in a situation to guess another person’s age? Well May be YES!! How about playing games like finding things in minimum time? or about finding the written character where your doctor wrote in the prescription when you are sick?
Well everyone faced these problems in real life. How about asking your machine or your favorite computer to do the task for you. Isn’t it great? well computers actually do by using Machine Learning. so for doing this we actually need to train the machine by using some powerful datasets.
The key to getting better most fields in life is practice. Practice on a variety of problem from image processing to speech recognition. Each of these problem has it’s own unique technique and approach. But how do you get this data?
We have listed a collection of high quality datasets that every Machine learning enthusiast should work on to apply and improve their skill. Working on these datasets will make you a better data expert and the amount of learning you will have will be invaluable in your career.
Has around 500 images with car license plates marked as rectangular bounding boxes in images of cars on roads and streets.
Link to dataset.
A database of around 2500 images with faces of celebrities and important key-points like eyes, nose etc marked.
Link to the Dataset.
Images from E-commerce sites with bounding boxes drawn around shirts, jackets, sunglasses etc.
Has around 500 images manually tagged for item detection.
Link to Dataset.
Around 300 medical surgery images with bounding boxes drawn around wounds.
Link to Dataset.
IMDb, an abbreviation of Internet Movie Database, is an online database of information related to world films, television programs, home videos and video games, and internet streams, including cast, production crew and personnel biographies, plot summaries, trivia, and fan reviews and ratings. An additional fan feature, message boards, was abandoned in February 2017. Originally a fan-operated website, the database is owned and operated by IMDb.com, Inc., a subsidiary of Amazon. Not very rare but the grand-daddy of all image datasets.
This dataset is an image classification dataset to classify room images as bedroom, kitchen, bathroom, living room, exterior, etc. Images from different houses are collected and kept together as a dataset for computer testing and training. This dataset helps for finding which image belongs to which part of house.
Visual Genome is a dataset, a knowledge base, an ongoing effort to connect structured image concepts to language.
This dataset is to classify the cracks on the walls. The dataset consists of wall images with or without cracks
It has images with shadow of some wires also which exactly looks like cracks on the wall, we should train the system carefully so that it has to differentiate between cracks and shadow. This dataset is very challenging which will revamp your coding skills.
Has 5K labeled images of street signs cropped to just contain the portion that has the text. Quite a difficult dataset with even the best vision algorithms being at 80% accuracy rates. (Read: comparison of Google, AWS, Microsoft OCR APIs on this dataset)
Link to the dataset.
This dataset is to identify cars in the images. The set has different images which does or does not have cars in it. The main objective of this dataset is to identify even the small parts of the car in the images. This dataset is human labeled dataset.
The FERET Dataset
The Face Recognition Technology (FERET) program is managed by the Defense Advanced Research Projects Agency (DARPA) and the National Institute of Standards and Technology (NIST).
Department of Defense (DoD) Counterdrug Technology Development Program Office sponsored the Face Recognition Technology (FERET) program. The goal of the FERET program was to develop automatic face recognition capabilities that could be employed to assist security, intelligence, and law enforcement personnel in the performance of their duties. The FERET database was collected in 15 sessions between August 1993 and July 1996.
The database contains 1564 sets of images for a total of 14,126 images that includes 1199 individuals and 365 duplicate sets of images. A duplicate set is a second set of images of a person already in the database and was usually taken on a different day.
Has around 1300 faces marked as rectangular bounding boxes in images. Images range from part pics to random people on streets.
Link to Dataset.
Caltech 101 is a data set of digital images. The Caltech 101 data set was used to train and test several machine learning, computer vision recognition and classification algorithms. A set of annotations is provided for each image. Each set of annotations contains two pieces of information: the general bounding box in which the object is located and a detailed human-specified outline enclosing the object.
A MATLAB script is provided with the annotations. It loads an image and its corresponding annotation file and displays them as a MATLAB figure.
The Caltech 101 data set aims at alleviating many of these common problems.
This dataset is to classify uxbot pictures into dark, professional, Minimalist, Glamorous, etc.… uxbot is the platform for chatting now a days. This dataset is used to train computer with new technical skills. It is human labeled dataset.
LabelMe is a project created by the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) which provides a dataset of digital images with annotations. The dataset is dynamic, free to use, and open to public contribution. The motivation behind creating LabelMe comes from the history of publicly available data for computer vision researchers. Most available data was tailored to a specific research group’s problems and caused new researchers to have to collect additional data to solve their own problems. LabelMe was created to solve several common shortcomings of available data
You can find thousands of such open datasets here.