Rare Datasets for Computer Vision Every Machine Learning Expert Must Work With by@dataturks

Rare Datasets for Computer Vision Every Machine Learning Expert Must Work With

DataTurks: Data Annotations Made Super Easy HackerNoon profile picture

DataTurks: Data Annotations Made Super Easy

Have you ever being in a situation to guess another person’s age? Well May be YES!! How about playing games like finding things in minimum time? or about finding the written character where your doctor wrote in the prescription when you are sick?

Well everyone faced these problems in real life. How about asking your machine or your favorite computer to do the task for you. Isn’t it great? well computers actually do by using Machine Learning. so for doing this we actually need to train the machine by using some powerful datasets.

The key to getting better most fields in life is practice. Practice on a variety of problem from image processing to speech recognition. Each of these problem has it’s own unique technique and approach. But how do you get this data?

We have listed a collection of high quality datasets that every Machine learning enthusiast should work on to apply and improve their skill. Working on these datasets will make you a better data expert and the amount of learning you will have will be invaluable in your career.


Car License Plate Detection

Has around 500 images with car license plates marked as rectangular bounding boxes in images of cars on roads and streets.


Link to dataset.

Celebrity Face Key-Points

A database of around 2500 images with faces of celebrities and important key-points like eyes, nose etc marked.


Link to the Dataset.

E-commerce Tagging for clothing

Images from E-commerce sites with bounding boxes drawn around shirts, jackets, sunglasses etc.

Has around 500 images manually tagged for item detection.


Link to Dataset.

Wound Dataset

Around 300 medical surgery images with bounding boxes drawn around wounds.


Link to Dataset.

IMDB-WIKI dataset


IMDb, an abbreviation of Internet Movie Database, is an online database of information related to world films, television programs, home videos and video games, and internet streams, including cast, production crew and personnel biographies, plot summaries, trivia, and fan reviews and ratings. An additional fan feature, message boards, was abandoned in February 2017. Originally a fan-operated website, the database is owned and operated by IMDb.com, Inc., a subsidiary of Amazon. Not very rare but the grand-daddy of all image datasets.

  • Description: IMDB and Wikipedia face images with gender and age labels.
  • Instances:523,051
  • Format: images
  • Default task: Gender classification, face detection, face recognition, age estimation
  • Created: 2015 by R. Rothe, R. Timofte, L. V. Gool
  • Download link: https://data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki/



This dataset is an image classification dataset to classify room images as bedroom, kitchen, bathroom, living room, exterior, etc. Images from different houses are collected and kept together as a dataset for computer testing and training. This dataset helps for finding which image belongs to which part of house.

  • Description: The dataset has 20001 items of which 4404 items have been manually labeled.
  • Categories: bedroom, kitchen, bathroom, exterior, living room, other
  • Default task: image classification, image captioning.
  • Format: images
  • Created by: DataTurks
  • Download link : https://dataturks.com/projects/sheerun/rooms

Visual Genome dataset


Visual Genome is a dataset, a knowledge base, an ongoing effort to connect structured image concepts to language.


  • 108,077 Images
  • 5.4 Million Region Descriptions
  • 1.7 Million Visual Question Answers
  • 3.8 Million Object Instances
  • 2.8 Million Attributes
  • 2.3 Million Relationships
  • Everything Mapped to Wordnet Synsets

CRACK Classification dataset

This dataset is to classify the cracks on the walls. The dataset consists of wall images with or without cracks


It has images with shadow of some wires also which exactly looks like cracks on the wall, we should train the system carefully so that it has to differentiate between cracks and shadow. This dataset is very challenging which will revamp your coding skills.


IIT-5K OCR dataset

Has 5K labeled images of street signs cropped to just contain the portion that has the text. Quite a difficult dataset with even the best vision algorithms being at 80% accuracy rates. (Read: comparison of Google, AWS, Microsoft OCR APIs on this dataset)


Link to the dataset.

CARS dataset


This dataset is to identify cars in the images. The set has different images which does or does not have cars in it. The main objective of this dataset is to identify even the small parts of the car in the images. This dataset is human labeled dataset.


The FERET Dataset

The Face Recognition Technology (FERET) program is managed by the Defense Advanced Research Projects Agency (DARPA) and the National Institute of Standards and Technology (NIST).


Department of Defense (DoD) Counterdrug Technology Development Program Office sponsored the Face Recognition Technology (FERET) program. The goal of the FERET program was to develop automatic face recognition capabilities that could be employed to assist security, intelligence, and law enforcement personnel in the performance of their duties. The FERET database was collected in 15 sessions between August 1993 and July 1996.

The database contains 1564 sets of images for a total of 14,126 images that includes 1199 individuals and 365 duplicate sets of images. A duplicate set is a second set of images of a person already in the database and was usually taken on a different day.


Face Detection

Has around 1300 faces marked as rectangular bounding boxes in images. Images range from part pics to random people on streets.


Link to Dataset.

CALTECH-101 dataset


Caltech 101 is a data set of digital images. The Caltech 101 data set was used to train and test several machine learning, computer vision recognition and classification algorithms. A set of annotations is provided for each image. Each set of annotations contains two pieces of information: the general bounding box in which the object is located and a detailed human-specified outline enclosing the object.

A MATLAB script is provided with the annotations. It loads an image and its corresponding annotation file and displays them as a MATLAB figure.


The Caltech 101 data set aims at alleviating many of these common problems.

  1. The images are cropped and re-sized.
  2. Many categories are represented, which suits both single and multiple class recognition algorithms.
  3. Detailed object outlines are marked.
  4. Available for general use, Caltech 101 acts as a common standard by which to compare different algorithms without bias due to different data sets.
  • Description: Pictures of objects, detailed object outlines marked.
  • Instances: 9,146 images, split between 101 different object categories, as well as an additional background/clutter category.
  • Format: Images
  • Default task: Classification, object recognition.
  • Created: September 2003 and compiled by Fei-Fei Li
  • Download link : http://www.vision.caltech.edu/Image_Datasets/Caltech101/

UXBOT Dataset 


This dataset is to classify uxbot pictures into dark, professional, Minimalist, Glamorous, etc.… uxbot is the platform for chatting now a days. This dataset is used to train computer with new technical skills. It is human labeled dataset.

  • Description : dataset has 129 items of which 129 items have been manually labeled.
  • Format : images
  • Categories : Elegant, clean , fresh , light, Airy, cooperate, funky, Retro, Eddy, fun, etc.….
  • Default task : image classification
  • Created : Data Turks.
  • Download link: https://dataturks.com/projects/briannaorg/UXBot



LabelMe is a project created by the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) which provides a dataset of digital images with annotations. The dataset is dynamic, free to use, and open to public contribution. The motivation behind creating LabelMe comes from the history of publicly available data for computer vision researchers. Most available data was tailored to a specific research group’s problems and caused new researchers to have to collect additional data to solve their own problems. LabelMe was created to solve several common shortcomings of available data


You can find thousands of such open datasets here.