paint-brush
DALLE 3: Improving Image Generation with Better Captionsby@whatsai
771 reads
771 reads

DALLE 3: Improving Image Generation with Better Captions

by Louis BouchardOctober 23rd, 2023
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

OpenAI recently released DALL·E 3 in ChatGPT, a successor to the impressive DALL E 2. The new model is trained with a blend of 95% synthetic captions and 5% ground truth captions. The captions are not just texts; they are detailed narratives, offering insights and descriptions.
featured image - DALLE 3: Improving Image Generation with Better Captions
Louis Bouchard HackerNoon profile picture

OpenAI recently released DALL·E 3 in ChatGPT, a successor to the impressive DALL·E 2, and it's nothing short of a technological marvel. This week, they also shared a paper on how they built this gem. Here are the takeaways...


DALL·E 3 is trained on highly descriptive generated image captions, marking a departure from the previous model's training methodology. It doesn’t just create images; it kind of offers a depth of understanding and creativity that is much more interesting than previous approaches. The model’s ability to interpret prompts has been elevated, thanks to a new model: a robust image captioner at its core.


The journey from DALL·E 2 to DALL·E 3 is marked by significant advancements in the quality and complexity of generated images. Mainly thanks to the quality of the training data. As always! The new model is trained with a blend of 95% synthetic captions and 5% ground truth captions. The captions are not just texts; they are detailed narratives, offering insights and descriptions that are as rich as they are accurate. So they are not simply Instagram pictures and captions scraped. They used another model to generate detailed descriptions of images, to then build millions of detailed image-caption pairs of much higher quality.


The image captioner model is akin to a language model like ChatGPT but tailored for images. It employs tokens, and numerical representations that the model interprets and processes to generate coherent and contextually relevant sentences. The integration of approaches like CLIP facilitates the conversion of both text and images into a compressed space, ensuring consistency and relevance in the generated content.


In evaluations, DALL·E 3 is consistently outperforming DALL·E 2. It’s not just about the aesthetics but the harmony of style, coherence, and prompt adherence that DALL·E 3 has mastered. However, like any deep learning-based solution, it has its limitations. It doesn't have complete spatial awareness, and text rendering is still ripe for enhancement.


Watch the full video for a deep dive into what sets DALL·E 3 apart: