Authors: (1) Tony Lee, Stanford with Equal contribution; (2) Michihiro Yasunaga, Stanford with Equal contribution; (3) Chenlin Meng, Stanford with Equal contribution; (4) Yifan Mai, Stanford; (5) Joon Sung Park, Stanford; (6) Agrim Gupta, Stanford; (7) Yunzhi Zhang, Stanford; (8) Deepak Narayanan, Microsoft; (9) Hannah Benita Teufel, Aleph Alpha; (10) Marco Bellagente, Aleph Alpha; (11) Minguk Kang, POSTECH; (12) Taesung Park, Adobe; (13) Jure Leskovec, Stanford; (14) Jun-Yan Zhu, CMU; (15) Li Fei-Fei, Stanford; (16) Jiajun Wu, Stanford; (17) Stefano Ermon, Stanford; (18) Percy Liang, Stanford. Table of Links Abstract and 1 Introduction 2 Core framework 3 Aspects 4 Scenarios 5 Metrics 6 Models 7 Experiments and results 8 Related work 9 Conclusion 10 Limitations Author contributions, Acknowledgments and References A Datasheet B Scenario details C Metric details D Model details E Human evaluation procedure 9 Conclusion We introduced Holistic Evaluation of Text-to-Image Models (HEIM), a new benchmark to assess 12 important aspects in text-to-image generation, including alignment, quality, aesthetics, originality, reasoning, knowledge, bias, toxicity, fairness, robustness, multilinguality, and efficiency. Our evaluation of 26 recent text-to-image models reveals that different models excel in different aspects, opening up research avenues to study whether and how to develop models that excel across multiple aspects. To enhance transparency and reproducibility, we release our evaluation pipeline, along with the generated images and human evaluation results. We encourage the community to consider the different aspects when developing text-to-image models. This paper is available on arxiv under CC BY 4.0 DEED license. Authors: (1) Tony Lee, Stanford with Equal contribution; (2) Michihiro Yasunaga, Stanford with Equal contribution; (3) Chenlin Meng, Stanford with Equal contribution; (4) Yifan Mai, Stanford; (5) Joon Sung Park, Stanford; (6) Agrim Gupta, Stanford; (7) Yunzhi Zhang, Stanford; (8) Deepak Narayanan, Microsoft; (9) Hannah Benita Teufel, Aleph Alpha; (10) Marco Bellagente, Aleph Alpha; (11) Minguk Kang, POSTECH; (12) Taesung Park, Adobe; (13) Jure Leskovec, Stanford; (14) Jun-Yan Zhu, CMU; (15) Li Fei-Fei, Stanford; (16) Jiajun Wu, Stanford; (17) Stefano Ermon, Stanford; (18) Percy Liang, Stanford. Authors: Authors: (1) Tony Lee, Stanford with Equal contribution; (2) Michihiro Yasunaga, Stanford with Equal contribution; (3) Chenlin Meng, Stanford with Equal contribution; (4) Yifan Mai, Stanford; (5) Joon Sung Park, Stanford; (6) Agrim Gupta, Stanford; (7) Yunzhi Zhang, Stanford; (8) Deepak Narayanan, Microsoft; (9) Hannah Benita Teufel, Aleph Alpha; (10) Marco Bellagente, Aleph Alpha; (11) Minguk Kang, POSTECH; (12) Taesung Park, Adobe; (13) Jure Leskovec, Stanford; (14) Jun-Yan Zhu, CMU; (15) Li Fei-Fei, Stanford; (16) Jiajun Wu, Stanford; (17) Stefano Ermon, Stanford; (18) Percy Liang, Stanford. Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 2 Core framework 2 Core framework 3 Aspects 3 Aspects 4 Scenarios 4 Scenarios 5 Metrics 5 Metrics 6 Models 6 Models 7 Experiments and results 7 Experiments and results 8 Related work 8 Related work 9 Conclusion 9 Conclusion 10 Limitations 10 Limitations Author contributions, Acknowledgments and References Author contributions, Acknowledgments and References A Datasheet A Datasheet B Scenario details B Scenario details C Metric details C Metric details D Model details D Model details E Human evaluation procedure E Human evaluation procedure 9 Conclusion We introduced Holistic Evaluation of Text-to-Image Models (HEIM), a new benchmark to assess 12 important aspects in text-to-image generation, including alignment, quality, aesthetics, originality, reasoning, knowledge, bias, toxicity, fairness, robustness, multilinguality, and efficiency. Our evaluation of 26 recent text-to-image models reveals that different models excel in different aspects, opening up research avenues to study whether and how to develop models that excel across multiple aspects. To enhance transparency and reproducibility, we release our evaluation pipeline, along with the generated images and human evaluation results. We encourage the community to consider the different aspects when developing text-to-image models. This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv