paint-brush
Data Quality is All You Need: Why Synthetic Data Is Not A Replacement For High-Quality Databy@yaw.etse
233 reads

Data Quality is All You Need: Why Synthetic Data Is Not A Replacement For High-Quality Data

by yaw.etseAugust 20th, 2024
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Synthetic data is not a replacement for high-quality original data, especially because of the risk of model collapse. This issue was highlighted in a conversation with Jack Fitzsimons, who shared an article “AI Models Collapse When Trained on Recursively Generated Data.”
featured image - Data Quality is All You Need: Why Synthetic Data Is Not A Replacement For High-Quality Data
yaw.etse HackerNoon profile picture

As a follow-up to “Hashing, Synthetic Data, Enterprise Data Leakage, and the Reality of Privacy Risks,” it’s important to address the limitations of synthetic data beyond just privacy concerns.


Synthetic data is not a replacement for high-quality original data, especially because of the risk of model collapse. This issue was highlighted in a conversation with Jack Fitzsimons from Oblivious, who shared the Nature article “AI Models Collapse When Trained on Recursively Generated Data.” Transformers, known for their ability to capture long-range dependencies through self-attention mechanisms, are also pretty vulnerable.


The paper points out that “When models are trained on data that has been generated recursively, they can become increasingly biased towards the synthetic data, leading to a degradation in performance when exposed to real-world data.”

Is the Transformer Architecture More Susceptible to Model Collapse?

The “Attention is All You Need” paper highlights that transformers focus on different parts of the input data through attention mechanisms. But is the transformer architecture more susceptible to model collapse due to its reliance on self-attention?


Understanding this susceptibility is crucial since transformers are widely used in machine learning applications. More research is needed to determine if the architecture itself contributes to model collapse or if the issue is primarily with the quality of synthetic data.


Both sides of the argument make compelling points about using synthetic data for model training, how transformers work, and why they might be impacted by recursive training on synthetic data. I also highlight the importance of data quality, lineage, observability, and monitoring as essential components for avoiding these pitfalls.

Measuring Model Collapse

Model collapse is usually measured by evaluating the model’s performance on real data after training on synthetic data. E.g.:


  • Performance Degradation: A noticeable drop in accuracy or other performance metrics when the model is applied to real data.
  • Bias Amplification: Increasing divergence between the synthetic data patterns and the real data patterns.
  • Recurrence of Errors: Repeated exposure to synthetic data with inherent biases can reinforce these errors over time.


The Nature article also highlights that model collapse worsens with more permutations. As models are repeatedly trained on synthetic data generated from other models, the biases and inaccuracies compound, leading to significant performance degradation.

Example of Model Collapse

From the Nature paper: a model can be trained to generate synthetic images of handwritten digits, such as those from the MNIST dataset. Initially, the model performs well, creating indistinguishable images from real ones. However, if this model is then used to generate a new training dataset, and subsequent models are trained on this recursively generated data, the quality of the images deteriorates. Over multiple generations, the images become increasingly distorted, losing the original characteristics of the handwritten digits. This recursive training amplifies the errors, leading to a collapse in the model’s ability to produce realistic images.

The Value of Data Quality

The crux of the issue with synthetic data and model collapse is that synthetic data is not a substitute for high-quality data. The papers discussed above repeatedly highlight that the quality of data used in training is critical to maintaining model performance and avoiding collapse. This is why data quality tooling around lineage, observability, and monitoring is so important.


  • Data Lineage: Understanding the origins and transformations of data is crucial in assessing its quality. Lineage tools help track data flow through the pipeline, ensuring that any issues with synthetic data generation can be traced back to their source.
  • Observability: Monitoring the behavior of models during training and inference is essential for detecting early signs of model collapse. Observability tools provide insights into how models interact with data, allowing for timely interventions.
  • Monitoring: Monitoring model performance on both real and synthetic data is necessary to ensure that the model remains aligned with the true data distribution. Monitoring tools can detect when a model begins to drift, providing an opportunity to retrain or adjust the data mix before collapse occurs.

Better Uses of Synthetic Data

Despite its limitations in training, synthetic data has valuable applications, especially when combined with Privacy Enhancing Technologies (PETs). For example:


  • Data Sharing: Enabling secure data sharing across organizations without compromising privacy.
  • Software/Model Testing: Providing realistic test cases for software development and QA without exposing sensitive data.
  • Scenario Simulation: Useful for simulating rare or hypothetical scenarios to test models’ robustness and response.


I plan to follow up on the topic of Synthetic data with a balanced view on leveraging Synthetic Data and PETs, exploring their best uses and offering practical ideas for integrating these technologies into a comprehensive data strategy.

Appendix: Further Reading

  1. “Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data”
    This paper examines whether model collapse, where models trained on synthetic data degrade in performance over time, is inevitable. The authors suggest a strategy to mitigate this risk by mixing real and synthetic data during training. Their approach, which involves preserving some real data while adding synthetic data, helps maintain model performance over successive generations. The paper emphasizes the mathematical basis of this method, showing how the inclusion of real data prevents the drift away from the original data distribution that typically leads to collapse.


  2. “The Curse of Recursion: Training on Generated Data Makes Models Forget”
    This paper analyzes model collapse when models are trained recursively on data generated by previous models. The authors identify several factors contributing to collapse:


  3. “Attention Is All You Need”
    This seminal paper introduced the transformer architecture, which relies on self-attention mechanisms to capture long-range dependencies in data. While transformers are powerful, this strength can also lead to problems when trained on synthetic data. The self-attention mechanism tends to focus on patterns that may be artifacts of synthetic data rather than true features of the original data distribution. This can result in overfitting to non-representative patterns, leading to model collapse.


  4. “AI Models Collapse When Trained on Recursively Generated Data”
    This Nature article highlights the long-term risks of training models on recursively generated data. The study finds that models progressively lose information about the true data distribution, particularly at the distribution’s tails, eventually converging to a distribution with reduced variance. The paper presents a theoretical framework explaining this collapse, showing it as a universal phenomenon across generative models. Even without estimation errors, the compounding of small inaccuracies over generations leads to collapse, emphasizing the need for access to original, human-generated data to prevent this outcome.


  5. “Addressing Concerns of Model Collapse from Synthetic Data in AI”
    Alexander Watson’s article in Towards Data Science presents a counterargument to the concerns about model collapse. He acknowledges the risks but argues that these can be mitigated by strategically combining synthetic and real data during training. Watson suggests using differential privacy techniques and carefully curating synthetic datasets to ensure they reflect real-world data diversity. While synthetic data alone might lead to collapse, thoughtful integration with real data can preserve model performance and reduce the risk of degradation.


  6. “LoRA Learns Less and Forgets Less”
    This paper examines Low-Rank Adaptation (LoRA) as a parameter-efficient finetuning method for large language models. LoRA trains only low-rank perturbations to selected weight matrices, saving memory and computational resources. The study finds that while LoRA underperforms compared to full finetuning, it offers a desirable form of regularization by maintaining the base model’s performance on tasks outside the target domain. LoRA helps mitigate the “forgetting” of the source domain, a key issue in model collapse. The authors provide a detailed analysis of LoRA’s performance across different domains and propose best practices for finetuning with LoRA.