In the spring of 1993, a Harvard statistics professor named Donald Rubin sat down to write a paper. Rubin’s paper would go on to change the way that artificial intelligence is researched and practiced, but its stated goal was more modest: analyze data from the 1990 U.S. census, while preserving the anonymity of its respondents.
It wasn’t feasible to simply anonymize the data, because individuals could still be identified by their home address, phone number, or social security number, all of which was crucial to the analyses that Rubin’s colleagues wanted to perform. To solve the problem, Rubin generated a set of anonymized census responses whose population statistics mirrored those of the original data set. This way, Rubin’s colleagues could draw valid statistical inferences about the complexion of the United States without compromising the identity of its citizens.
Rubin’s solution was original. He had produced synthetic data, and in doing so contributed the term to our academic vocabulary. His approach was popularized by statisticians, economists, and medical researchers.
Synthetic Data in Machine Learning
Decades later, synthetic data found a new use as an accelerant to machine learning. Machine learning systems are predictive, and most require data - the more, the better.
For example, the accuracy of a supervised machine learning model that predicts election outcomes will improve with more data. But elections are infrequent events, which means the data-derived predictive power of such a model is limited (changes to the model’s architecture could yield small performance improvements but would be dwarfed by the impact of doubling its training dataset).
To achieve more predictive power, the model needs more data. It must also be able to account for changes to the mechanisms that determine election results, so that valid inferences about the relationship between the two can be made.
Generating synthetic data whose properties enable valid inference was the original purpose of Rubin’s work. Inspired by it, researchers at Caltech and UC Irvine created synthetic electoral data that might have been recorded at the ballot box, but was not.
In that study, synthetic data was used to overcome data scarcity, but data privacy is another grave concern. Industries that traffic in highly sensitive personal information, like healthcare, are zealous advocates of synthetic data because regulations often preclude their data scientists from working with real patient records.
Privacy and scarcity are important data access problems, and solving them makes models more performant. But in a different corner of the machine learning community, synthetic data is being used to give models new capabilities — the ability to see things they otherwise wouldn’t, and to make novel predictions.
Synthetic Images
The subset of machine learning that processes images is called computer vision. Like models for predicting elections, most computer vision models improve with data.
The dominant approach to data acquisition in computer vision relies on humans sitting in a room, labelling images according to their contents. This is a crucial but labor intensive process (a now famous collection of photos called ImageNet was hand-annotated nearly 14 million times).
Labels are important because they are the method by which we encode our semantic understanding of the world into a computer. For example, the people sitting in that room, labelling images, might be annotating photos as Cat or Dog, to show a computer how to recognize the difference. But labels need not be constrained to things that are discernible by the human eye.
Synthetic images created by computers can contain labels whose dimensions cannot even be reliably quantified by humans - parameters like depth, or transparency.
Imagine trying to measure the relative depth of thousands of individual plastic bottles in an image. Now measure their transparency, and the angles at which they reflect light. The task is impossible for a human, but photos with these attributes broaden the inference possibilities for a computer vision model.
In the retail and waste management industries for example, robots can pick stock and recycle plastic bottles with far greater dexterity when they are trained on synthetic datasets that include depth and transparency labels (researchers proved this earlier this year). Using synthetic data, the robots became more intelligent.
A Contrarian Bet
It’s true that computers have been generating images for decades, but doing so photorealistically, with aesthetic diversity, and at scale, is very difficult. Generative Adversarial Networks, or GANs, are a sophisticated solution. They create information procedurally, which means they can provide infinite variation in images, yet require no more human guidance than that of other deep learning models. For those who know how to use them, GANs have enabled an advantaged data supply chain.
Still, synthetic data is, for now, a contrarian bet because the conventional wisdom assumes that models trained with human-labelled images are more performant than those trained with synthetic images.
But evidence from the academic community suggests that the conventional wisdom is wrong. In many cases, models that are trained on, or augmented with synthetic data are more performant than models trained on real-world data, and they can perceive things that other models cannot.
This is already evident in the autonomous vehicles industry, where real-world uncertainty and dynamism has created unprecedented demand for synthetic data. Uber, Tesla, Waymo, and Zoox won’t put cars on the road unless they are safe, but how can they anticipate every driving scenario that might occur?
Capturing millions of hours of rainy, nighttime, mountainous driving scenarios with a real driver in a real car is impractical. It would take too long and put people in unnecessary danger. A better solution is an image generation pipeline that can provide unlimited scenic diversity. It’s likely that all major autonomous vehicle companies have incorporated synthetic data into their computer vision systems.
Another argument for synthetic images is economic. Like other digital goods, its marginal cost of production is near-zero. As long as the alternative is humans labelling images, synthetic data will be cheaper, or so the argument goes.
In reality, the unit economics are more complicated. Many high-value use cases require custom 3D assets, which must be purchased, or drawn by CGI artists using animation software. Competitive advantage amongst the first wave of synthetic data startups may come in their ability to spread the fixed cost of such artists. The returns to doing so would be large, but also require a consistent, recurring use case across customers, which doesn’t yet exist in all synthetic image markets.
As it becomes more dramatic, the performance advantage of synthetic data will be appreciated outside of the academic and startup community. Google, Amazon, and Microsoft will incorporate synthetic data into their end-to-end machine learning pipelines. They may choose to expose the actual data synthesis tools to users, but will more likely monitor inference and automatically retrain models in response to confounding data (this will not be limited to computer vision).
Still, startups that can identify differentiated and high-value use cases and build predictable revenue streams around them will enjoy enviable market positions.
For startups and incumbents alike, one thing is clear: Society’s performance expectations of machine learning systems are rising, and synthetic data is being used to meet them.