Computer vision enables computers to understand the content of images and videos. The goal in computer vision is to automate tasks that the human visual system can do.
Computer vision tasks include image acquisition, image processing, and image analysis. The image data can come in different forms, such as video sequences, view from multiple cameras at different angles, or multi-dimensional data from a medical scanner.
Note: This article was originally written by Meiryum Ali and published on Lionbridge AI.
Labelme: A large dataset created by the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) containing 187,240 images, 62,197 annotated images, and 658,992 labeled objects.
Lego Bricks: Approximately 12,700 images of 16 different Lego bricks classified by folders and computer rendered using Blender.
ImageNet: The de-facto image dataset for new algorithms. Is organized according to the WordNet hierarchy, in which each node of the hierarchy is depicted by hundreds and thousands of images.
LSUN: Scene understanding with many ancillary tasks (room layout estimation, saliency prediction, etc.)
MS COCO: COCO is a large-scale object detection, segmentation, and captioning dataset containing over 200,000 labeled images. It can be used for object segmentation, recognition in context, and many other use cases.
Columbia University Image Library: COIL100 is a dataset featuring 100 different objects imaged at every angle in a 360 rotation.
Visual Genome: Visual Genome is a dataset and knowledge base created in an effort to connect structured image concepts to language. The database features detailed visual knowledge base with captioning of 108,077 images.
Google’s Open Images: A collection of 9 million URLs to images “that have been annotated with labels spanning over 6,000 categories” under Creative Commons.
Annotated images from the Open Images dataset. Left: Ghost Arches by Kevin Krejci. Right: Some Silverware by J B. Both images used under CC BY 2.0 license.
Youtube-8M: a large-scale labeled dataset that consists of millions of YouTube video IDs, with annotations of over 3,800+ visual entities.
Labelled Faces in the Wild: 13,000 labeled images of human faces, for use in developing applications that involve facial recognition.
Stanford Dogs Dataset: Contains 20,580 images and 120 different dog breed categories, with about 150 images per class.
Places: Scene-centric database with 205 scene categories and 2.5 million images with a category label.
CelebFaces: Face dataset with more than 200,000 celebrity images, each with 40 attribute annotations.
Sample images from the CelebFaces Dataset.
Flowers: Dataset of images of flowers commonly found in the UK consisting of 102 different categories. Each flower class consists of between 40 and 258 images with different pose and light variations.
Plant Image Analysis: A collection of datasets spanning over 1 million images of plants. Can choose from 11 species of plants.
Home Objects: A dataset that contains random objects from home, mostly from kitchen, bathroom and living room split into training and test datasets.
CIFAR-10: A large image dataset of 60,000 32×32 colour images split into 10 classes. The dataset is divided into five training batches and one test batch, each containing 10,000 images.
CompCars: Contains 163 car makes with 1,716 car models, with each car model labeled with five attributes, including maximum speed, displacement, number of doors, number of seats, and type of car.
Indoor Scene Recognition: A very specific dataset, useful as most scene recognition models are better ‘outside’. Contains 67 Indoor categories, and a total of 15620 images.
VisualQA: VQA is a dataset containing open-ended questions about 265,016 images. These questions require an understanding of vision and language. For each image, there are at least 3 questions and 10 answers per question.
Create your free account to unlock your custom reading experience.