The term ‘curation’ is commonly associated with museums or libraries, not data science. However, much like the work that’s done on rare paintings or books, data curation tools make the most important data easily accessible to engineers as they build complex machine learning models. Without curation, data is difficult to find, analyze, and interpret. Data curation tools provide meaningful insights and enduring access to all your data in one place. In this article, we’ll dive into the importance of data curation for computer vision specifically, as well as review the top data curation tools on the market today. What is data curation? Data curation is the act of organizing, enhancing, and preserving data for future use. In machine learning, data curation describes the management of data throughout its lifecycle: from its collection and initially storage, to the time it is archived for future re-use. This process is all the more important for computer vision engineers, who deal with massive amounts of visual data on a daily basis. Instead of using manual methods such as writing ETL jobs to extract insights, data curation tools provide a streamlined way to access the right data whenever you need to. The importance of data curation for machine learning Under the hood, data curation tools directly influence computer vision model performance. Using data curation tools, engineers can get a better understanding of the data they’ve collected, identify the most important subsets and edge cases, and curate custom training datasets to feed back into their models. The best data curation tools enable you to: Make it easy to obtain insights on key metrics, as well as the general distribution and diversity of your datasets regardless of sensor type and format. Visualize large scale data: Quickly search, filter, and sort through the entire data lake by making all features queryable and easily accessible. Enable data discovery and retrieval: Identify the most interesting segments within your dataset, and manipulate them within the tool to create completely customized training sets. Curate diverse scenarios: The tool should fit well within your existing workflows and toolset. Seamlessly integrate: What are the best data curation tools for computer vision? With an overwhelming amount of AI products and platforms popping up year after year, how do you know which will provide the most value? Read on below to find out which data curation tool is the best fit for your computer vision project. 1. Aquarium Learning Aquarium is a data management platform that aims to make it easy to identify labeling errors and model failures. With Aquarium, users can version and combine model predictions with their ground truth. Aquarium is especially focused on curating and maintaining training datasets, catering less to raw data management use cases. This is because data exploration in Aquarium is predominantly tied to model predictions and ground truth labels. Users can access Aquarium via their cloud platform or API. However, they currently do not offer on-premise or VPC deployments, and there are no external integrations. - Aquarium supports image, 3D, audio, and text data. They also support multiple annotation types, such as classification, detection, and segmentation. Wide range of use cases - Users can manipulate evaluation thresholds and obtain interactive visualizations to obtain required samples quickly. Interactive model evaluation - Users can collaborate with each other on the Aquarium platform to build data subsets, associate them with issues, and identify new data for annotation. Collaborative features 2. FiftyOne Developed by Voxel51, FiftyOne is an open-source tool to visualize and interpret computer vision datasets. The tool is made up of three components: the Python library, the web app (GUI), and the Brain. FiftyOne does not contain any auto-tagging capabilities, and therefore works best with datasets that have previously been annotated. Furthermore, the tool only supports image and video data, and does not work for multimodal sensor datasets at this time. Unlike other tools, FiftyOne is designed to be used by individual developers rather than teams, functioning like a programming IDE. Today, the platform lacks collaborative features; for example, a single instance cannot host multiple user accounts. FiftyOne taps into TF and Pytorch dataset zoos to provide access to a variety of open datasets and open-source models. Model & dataset zoo - - Via the Brain, a separate closed-source Python package, users can quantitatively assess the uniqueness, mistakenness, and hardness of data. Advanced data analysis - FiftyOne directly integrates with popular annotation tools such as . They also have tight integrations with Jupyter and Colab Notebooks making it easy for users to run FiftyOne through Python notebooks. External integrations LabelBox , 3. Scale Nucleus Launched in late 2020 by , Nucleus is one of the newest data curation tools to hit the market. The Nucleus platform allows users to collaboratively search through image data for model failures. As of now, Nucleus only supports image data, with no support for 3D sensor fusion, video, or text data. Scale Users can access Nucleus via their cloud platform, API or Python SDK. Currently, Nucleus does not support on-premise deployability. - Users can search for visually similar images based on one or multiple base samples and associate custom tags with them. Visual similarity - Using the Nucleus SDK, users can create flexible metadata schemas. Nucleus provides smart methods to detect and create schemas using the annotation format provided. Metadata schemas - Users can create model entities and associate corresponding runs with them. Hence, models can be versioned based on runs (dataset & predictions). Model versioning 4. Clarifai Clarifai is an end-to-end solution for labeling, searching & modeling unstructured data. One of the first AI startups, they provide a platform for modeling image, video, and text data. While Clarifai’s original focus was enabling users to build custom models, they’ve recently added several data curation features including auto-tagging, visual search, and annotations. Ultimately, Clarifai is more of a modelling platform and less of a developer tool. They are best suited for relatively inexperienced teams getting started with ML use cases. - Clarifai offers a broad library of pre-built AI models, including anything from food to facial recognition. Ready-to-use model gallery - The platform supports for image, video, and text data. Wide range of data types - With the platform, users can customize or retrain existing models or create new ones from scratch. Model customization - In addition to their modelling platform, Clarifai offers fully managed annotation services through their Scribe LabelForce data labeling service. Data annotation 5. SiaSearch is a data management platform for computer vision data. Consisting of a scalable metadata catalog and query engine, SiaSearch enables developers to easily search through visual data, add metadata to frames and sequences, as well as assemble custom subsets of data for training or testing. SiaSearch With deep roots in autonomous driving, the SiaSearch platform is used by many OEMs, Tier 1s and tech companies. Aside from autonomous driving, SiaSearch also has solutions for robotics, retail, and more. - One of the only tools that can support 3D sensor fusion data, SiaSearch can analyze large volumes of unstructured sensor data, providing insights at the frame and sequence level. Specialized in sensor data - SiaSearch employs a large catalog of pre-trained extractors to automatically add frame-level, contextual metadata to raw data. Additionally, SiaSearch provides a toolbox for quick extractor development, allowing developers to integrate their own extractors. Auto-tagging capabilities - The SiaSearch platform features a unique, proprietary architecture that combines numeric and sequence-based queries to enable noticeably faster performance. Fast performance - Users can access SiaSearch via their web-based GUI or programmatic API. SiaSearch also supports cloud or on-premise deployment for enterprise users. Flexible workflows & integrations Interested in data curation? The right data curation tool can dramatically reduce the time spent on manual processes, allowing engineers to focus on what really matters - building great models. Lead image via Tobias Fischer on Unsplash Originally published by Clemens Viernickel on: https://www.siasearch.io/blog/best-data-curation-tools-for-computer-vision and has been reposted with permission.