Holistic Evaluation of Text-to-Image Models: Author contributions, Acknowledgments and References

Researchers at Stanford University, Microsoft, Adobe and Aleph Alpha have created a framework for analyzing scenarios and metrics. They have also created a Datasheet for human evaluation procedures.
(1) Tony Lee, Stanford with Equal contribution;

(2) Michihiro Yasunaga, Stanford with Equal contribution;

(3) Chenlin Meng, Stanford with Equal contribution;

(4) Yifan Mai, Stanford;

(5) Joon Sung Park, Stanford;

(6) Agrim Gupta, Stanford;

(7) Yunzhi Zhang, Stanford;

(8) Deepak Narayanan, Microsoft;

(9) Hannah Benita Teufel, Aleph Alpha;

(10) Marco Bellagente, Aleph Alpha;

(11) Minguk Kang, POSTECH;

(12) Taesung Park, Adobe;

(13) Jure Leskovec, Stanford;

(14) Jun-Yan Zhu, CMU;

(15) Li Fei-Fei, Stanford;

(16) Jiajun Wu, Stanford;

(17) Stefano Ermon, Stanford;

(18) Percy Liang, Stanford.

Abstract and 1 Introduction

2 Core framework

3 Aspects

4 Scenarios

5 Metrics

6 Models

7 Experiments and results

8 Related work

9 Conclusion

10 Limitations

Author contributions, Acknowledgments and References

A Datasheet

B Scenario details

C Metric details

D Model details

E Human evaluation procedure

Author contributions

Tony Lee: Co-led the project. Designed the core framework (aspects, scenarios, metrics). Implemented scenarios, metrics and models. Conducted experiments. Contributed to writing.

Michihiro Yasunaga: Co-led the project. Designed the core framework (aspects, scenarios, metrics). Wrote the paper. Conducted analysis. Implemented models.

Chenlin Meng: Designed the core framework (aspects, scenarios, metrics). Contributed to writing.

Yifan Mai: Implemented the evaluation infrastructure. Contributed to project discussions.

Joon Sung Park: Designed human evaluation questions.

Agrim Gupta: Implemented the detection scenario and metrics.

Yunzhi Zhang: Implemented the detection scenario and metrics.

Deepak Narayanan: Provided expertise and analysis of efficiency metrics.

Hannah Teufel: Provided model expertise and inference.

Marco Bellagente: Provided model expertise and inference.

Minguk Kang: Provided model expertise and inference.

Taesung Park: Provided model expertise and inference.

Jure Leskovec: Provided advice on human evaluation and paper writing.

Jun-Yan Zhu: Provided advice on human evaluation and paper writing.

Li Fei-Fei: Provided advice on the core framework.

Jiajun Wu: Provided advice on the core framework.

Stefano Ermon: Provided advice on the core framework.

Percy Liang: Provided overall supervision and guidance throughout the project.

Statement of neutrality. The authors of this paper affirm their commitment to maintaining a fair and independent evaluation of the image generation models. We acknowledge that the author affiliations encompass a range of academic and industrial institutions, including those where some of the models we evaluate were developed. However, the authors’ involvement is solely based on their expertise and efforts to run and evaluate the models, and the authors have treated all models equally throughout the evaluation process, regardless of their sources. This study aims to provide an objective understanding and assessment of models across various aspects, and we do not intend to endorse specific models.


We thank Robin Rombach, Yuhui Zhang, members of Stanford P-Lambda, CRFM, and SNAP groups, as well as our anonymous reviewers for providing valuable feedback. We thank Josselin Somerville for assisting with the human evaluation infrastructure. This work is supported in part by the AI2050 program at Schmidt Futures (Grant G-22-63429), a Google Research Award, and ONR N00014-23-1- 2355. Michihiro Yasunaga is supported by a Microsoft Research PhD Fellowship.


This paper is available on arxiv under CC BY 4.0 DEED license.