OpenAI recently released DALL·E 3 in ChatGPT, a successor to the impressive DALL·E 2, and it's nothing short of a technological marvel. This week, they also shared a paper on how they built this gem. Here are the takeaways...
DALL·E 3 is trained on highly descriptive generated image captions, marking a departure from the previous model's training methodology. It doesn’t just create images; it kind of offers a depth of understanding and creativity that is much more interesting than previous approaches. The model’s ability to interpret prompts has been elevated, thanks to a new model: a robust image captioner at its core.
The journey from DALL·E 2 to DALL·E 3 is marked by significant advancements in the quality and complexity of generated images. Mainly thanks to the quality of the training data. As always! The new model is trained with a blend of 95% synthetic captions and 5% ground truth captions. The captions are not just texts; they are detailed narratives, offering insights and descriptions that are as rich as they are accurate. So they are not simply Instagram pictures and captions scraped. They used another model to generate detailed descriptions of images, to then build millions of detailed image-caption pairs of much higher quality.
The image captioner model is akin to a language model like ChatGPT but tailored for images. It employs tokens, and numerical representations that the model interprets and processes to generate coherent and contextually relevant sentences. The integration of approaches like CLIP facilitates the conversion of both text and images into a compressed space, ensuring consistency and relevance in the generated content.
In evaluations, DALL·E 3 is consistently outperforming DALL·E 2. It’s not just about the aesthetics but the harmony of style, coherence, and prompt adherence that DALL·E 3 has mastered. However, like any deep learning-based solution, it has its limitations. It doesn't have complete spatial awareness, and text rendering is still ripe for enhancement.
Watch the full video for a deep dive into what sets DALL·E 3 apart: