Table of Links
2 MindEye2 and 2.1 Shared-Subject Functional Alignment
2.2 Backbone, Diffusion Prior, & Submodules
2.3 Image Captioning and 2.4 Fine-tuning Stable Diffusion XL for unCLIP
3 Results and 3.1 fMRI-to-Image Reconstruction
3.3 Image/Brain Retrieval and 3.4 Brain Correlation
6 Acknowledgements and References
A Appendix
A.2 Additional Dataset Information
A.3 MindEye2 (not pretrained) vs. MindEye1
A.4 Reconstruction Evaluations Across Varying Amounts of Training Data
A.5 Single-Subject Evaluations
A.7 OpenCLIP BigG to CLIP L Conversion
A.9 Reconstruction Evaluations: Additional Information
A.10 Pretraining with Less Subjects
A.11 UMAP Dimensionality Reduction
A.13 Human Preference Experiments
Abstract
Reconstructions of visual perception from brain activity have improved tremendously, but the practical utility of such methods has been limited. This is because such models are trained independently per subject where each subject requires dozens of hours of expensive fMRI training data to attain high-quality results. The present work showcases high-quality reconstructions using only 1 hour of fMRI training data. We pretrain our model across 7 subjects and then fine-tune on minimal data from a new subject. Our novel functional alignment procedure linearly maps all brain data to a shared-subject latent space, followed by a shared non-linear mapping to CLIP image space. We then map from CLIP space to pixel space by fine-tuning Stable Diffusion XL to accept CLIP latents as inputs instead of text. This approach improves out-of-subject generalization with limited training data and also attains state-of-the-art image retrieval and reconstruction metrics compared to single-subject approaches. MindEye2 demonstrates how accurate reconstructions of perception are possible from a single visit to the MRI facility. All code is available on GitHub.
1 Introduction
Spurred by the open releases of deep learning models such as CLIP (Radford et al., 2021) and Stable Diffusion (Rombach et al., 2022), along with large-scale functional magnetic resonance imaging (fMRI) datasets such as the Natural Scenes Dataset (Allen et al., 2022) where human participants were scanned viewing tens of thousands of images, there has been an influx of research papers demonstrating the ability to reconstruct visual perception from brain activity with high fidelity (Takagi and Nishimoto, 2022; 2023; Ozcelik et al., 2022; Ozcelik and VanRullen, 2023; Gaziv et al., 2022; Gu et al., 2023; Scotti et al., 2023; Kneeland et al., 2023a;b;c; Ferrante et al., 2023a; Thual et al., 2023; Chen et al., 2023a;b; Sun et al., 2023; Mai and Zhang, 2023; Xia et al., 2023). FMRI indirectly measures neural activity by detecting changes in blood oxygenation. These patterns of fMRI brain activity are translated into embeddings of pretrained deep learning models and used to visualize internal mental representations (Beliy et al., 2019; Shen et al., 2019a;b; Seeliger et al., 2018; Lin et al., 2019).
Visualization of internal mental representations, and more generally the ability to map patterns of brain activity to the latent space of rich pretrained deep learning models, has potential to enable novel clinical assessment approaches and brain-computer interface applications. However, despite all the recent research demonstrating high-fidelity reconstructions of perception, the practical adoption of such approaches to these settings has been limited if not entirely absent. A major reason for this is that the high-quality results shown in these papers use single-subject models that are not generalizable across people, and which have only been shown to work well if each subject contributes dozens of hours of expensive fMRI training data. MindEye2 introduces a novel functional alignment procedure that addresses these barriers by pretraining a shared-subject model that can be fine-tuned using limited data from a held-out subject and generalized to held-out data from that subject. This approach yields similar reconstruction quality to a single-subject model trained using 40× the training data. See Figure 1 for selected samples of reconstructions obtained from just 1 hour of data from subject 1 compared to their full 40 hours of training data in the Natural Scenes Dataset.
In addition to a novel approach to shared-subject alignment, MindEye2 builds upon the previous SOTA approach introduced by MindEye1 (Scotti et al., 2023). In terms of similarities, both approaches map flattened spatial patterns of fMRI activity across voxels (3-dimensional cubes of cortical tissue) to the image embedding latent space of a pretrained CLIP (Radford et al., 2021) model with the help of a residual MLP backbone, diffusion prior, and retrieval submodule. The diffusion prior (Ramesh et al., 2022) is used for reconstruction and is trained from scratch to take in the outputs from the MLP backbone and produce aligned embeddings suitable as inputs to any pretrained image generation model that accepts CLIP image embeddings (hereafter referred to as unCLIP models). The retrieval submodule is contrastively trained and produces CLIP-fMRI embeddings that can be used to find the original (or nearest neighbor) image in a pool of images, but is not used to reconstruct a novel image. Both MindEye2 and MindEye1 also map brain activity to the latent space of Stable Diffusion’s (Rombach et al., 2022) variational autoencoder (VAE) to obtain blurry reconstructions that lack high-level semantic content but perform well on low-level image metrics (e.g., color, texture, spatial position), which get combined with the semantically rich outputs from the diffusion prior to return reconstructions that perform well across perceptual and semantic features.
MindEye2 innovates upon MindEye1 in the following ways: (1) Rather than the whole pipeline being independently trained per subject, MindEye2 is pretrained on data from other subjects and then fine-tuned on the held-out target subject. (2) We map from fMRI activity to a richer CLIP space provided by OpenCLIP ViT-bigG/14 (Schuhmann et al., 2022; Ilharco et al., 2021), and reconstruct images via a fine-tuned Stable Diffusion XL unCLIP model that supports inputs from this latent space. (3) We merge the previously independent high- and low-level pipelines into a single pipeline through the use of submodules. (4) We additionally predict the text captions of images to be used as conditional guidance during a final image reconstruction refinement step.
The above changes support the following main contributions of this work: (1) Using the full fMRI training data from Natural Scenes Dataset we achieve state-of-the-art performance across image retrieval and reconstruction metrics. (2) Our novel multi-subject alignment procedure enables competitive decoding performance even with only 2.5% of a subject’s full dataset (i.e., 1 hour of scanning).
This paper is available on arxiv under CC BY 4.0 DEED license.
Authors:
(1) Paul S. Scotti, Stability AI and Medical AI Research Center (MedARC);
(2) Mihir Tripathy, Medical AI Research Center (MedARC) and a Core contribution;
(3) Cesar Kadir Torrico Villanueva, Medical AI Research Center (MedARC) and a Core contribution;
(4) Reese Kneeland, University of Minnesota and a Core contribution;
(5) Tong Chen, The University of Sydney and Medical AI Research Center (MedARC);
(6) Ashutosh Narang, Medical AI Research Center (MedARC);
(7) Charan Santhirasegaran, Medical AI Research Center (MedARC);
(8) Jonathan Xu, University of Waterloo and Medical AI Research Center (MedARC);
(9) Thomas Naselaris, University of Minnesota;
(10) Kenneth A. Norman, Princeton Neuroscience Institute;
(11) Tanishq Mathew Abraham, Stability AI and Medical AI Research Center (MedARC).
