Image Captioning and Fine-tuning Stable Diffusion XL for unCLIP

Written by imagerecognition | Published 2025/04/09
Tech Story Tags: image-recognition | image-captioning | stable-diffusion | generativeimage2text | openclip | stable-diffusion-xl | multimodal-contrastive-model | clip-image-embeddings

TLDRThe study was conducted using fMRI-to-Image Reconstruction. The results were published in the journal Human Experiments. The study was published on November 13, 2013.via the TL;DR App

Abstract and 1 Introduction

2 MindEye2 and 2.1 Shared-Subject Functional Alignment

2.2 Backbone, Diffusion Prior, & Submodules

2.3 Image Captioning and 2.4 Fine-tuning Stable Diffusion XL for unCLIP

2.5 Model Inference

3 Results and 3.1 fMRI-to-Image Reconstruction

3.2 Image Captioning

3.3 Image/Brain Retrieval and 3.4 Brain Correlation

3.5 Ablations

4 Related Work

5 Conclusion

6 Acknowledgements and References

A Appendix

A.1 Author Contributions

A.2 Additional Dataset Information

A.3 MindEye2 (not pretrained) vs. MindEye1

A.4 Reconstruction Evaluations Across Varying Amounts of Training Data

A.5 Single-Subject Evaluations

A.6 UnCLIP Evaluation

A.7 OpenCLIP BigG to CLIP L Conversion

A.8 COCO Retrieval

A.9 Reconstruction Evaluations: Additional Information

A.10 Pretraining with Less Subjects

A.11 UMAP Dimensionality Reduction

A.12 ROI-Optimized Stimuli

A.13 Human Preference Experiments

2.3 Image Captioning

To predict image captions from brain activity we convert the diffusion prior’s predicted ViT-bigG/14 embeddings to CLIP ViT/L-14 space and then feed through a frozen pretrained GenerativeImage2Text (GIT) model (Wang et al., 2022). The use of GIT to caption images from brain activity in the Natural Scenes Dataset was previously shown to be viable by Ferrante et al. (2023b). We independently trained a linear model to convert from OpenCLIP ViT-bigG/14 embeddings to CLIP ViT-L/14 embeddings (see Appendix A.7), which was necessary because there was no existing GIT model that accepted OpenCLIP ViT-bigG/14 embeddings as inputs. Image caption prediction from brain activity lends further flexibility to such decoding approaches and can help refine image reconstructions to match desired semantic content.

2.4 Fine-tuning Stable Diffusion XL for unCLIP

CLIP (Radford et al., 2021) is an example of a multimodal contrastive model that maps images and text captions to a shared embedding space. unCLIP (or image variations) models go from this shared embedding space back to pixel space, and have been used for the creative application of returning variations of a given reference image (Xu et al., 2023; Ye et al., 2023; Pinkney, 2022). As such, previous unCLIP models prioritized replication of high-level semantics over low-level structures. These models can be trained by fine-tuning a base image generation model to accept CLIP image embeddings instead of, or in addition to, text embeddings. Outputs are diffused from pure noise just like the base model, unlike image-to-image models (Meng et al., 2022) that start the diffusion process from a reference image mixed with noise.

Contrary to previous unCLIP models, our goal was to train a model that returns images as close as possible to the reference image across both low-level structure and high-level semantics. This is because our use-case was to exactly return the original image given its CLIP image embedding predicted from the brain.

The base Stable Diffusion XL (SDXL) (Podell et al., 2023) model uses text conditionings from both OpenCLIP ViTbigG/14 and CLIP ViT-L/14. They condition cross-attention layers on the penultimate text encoder outputs and additionally condition on pooled text embeddings from OpenCLIP ViT-bigG/14 by adding it to the timestep embedding. Here, we fine-tuned the cross-attention layers using the OpenCLIP ViT-bigG/14 image embeddings corresponding to all 256 patch tokens and we dropped the additional conditioning on pooled text embeddings. We opted to only condition on image embeddings because we observed that incorporating any text conditioning worsened the fidelity of the unCLIP reconstructions.

We evaluate the fidelity of our SDXL unCLIP model to reconstruct images from ground truth OpenCLIP ViT-bigG/14 image embeddings in Appendix A.6, showing that reconstructions are nearly identical to the original images. We fine-tuned SDXL on one 8xA100 80GB GPU node using an internal dataset for 110, 000 optimization steps at a resolution of 256 × 256 pixels and a batch size of 8 with offsetnoise (Lin et al., 2024; Guttenberg, 2023) set to 0.04. All other settings were identical to those used with base Stable Diffusion XL. Like Stable Diffusion XL, this unCLIP model can output different aspect ratios, however, we observed best results with 768 × 768 resolution.

This paper is available on arxiv under CC BY 4.0 DEED license.

Authors:

(1) Paul S. Scotti, Stability AI and Medical AI Research Center (MedARC);

(2) Mihir Tripathy, Medical AI Research Center (MedARC) and a Core contribution;

(3) Cesar Kadir Torrico Villanueva, Medical AI Research Center (MedARC) and a Core contribution;

(4) Reese Kneeland, University of Minnesota and a Core contribution;

(5) Tong Chen, The University of Sydney and Medical AI Research Center (MedARC);

(6) Ashutosh Narang, Medical AI Research Center (MedARC);

(7) Charan Santhirasegaran, Medical AI Research Center (MedARC);

(8) Jonathan Xu, University of Waterloo and Medical AI Research Center (MedARC);

(9) Thomas Naselaris, University of Minnesota;

(10) Kenneth A. Norman, Princeton Neuroscience Institute;

(11) Tanishq Mathew Abraham, Stability AI and Medical AI Research Center (MedARC).


Written by imagerecognition | Image Recognition
Published by HackerNoon on 2025/04/09