Authors: (1) Tony Lee, Stanford with Equal contribution; (2) Michihiro Yasunaga, Stanford with Equal contribution; (3) Chenlin Meng, Stanford with Equal contribution; (4) Yifan Mai, Stanford; (5) Joon Sung Park, Stanford; (6) Agrim Gupta, Stanford; (7) Yunzhi Zhang, Stanford; (8) Deepak Narayanan, Microsoft; (9) Hannah Benita Teufel, Aleph Alpha; (10) Marco Bellagente, Aleph Alpha; (11) Minguk Kang, POSTECH; (12) Taesung Park, Adobe; (13) Jure Leskovec, Stanford; (14) Jun-Yan Zhu, CMU; (15) Li Fei-Fei, Stanford; (16) Jiajun Wu, Stanford; (17) Stefano Ermon, Stanford; (18) Percy Liang, Stanford. Table of Links Abstract and 1 Introduction 2 Core framework 3 Aspects 4 Scenarios 5 Metrics 6 Models 7 Experiments and results 8 Related work 9 Conclusion 10 Limitations Author contributions, Acknowledgments and References A Datasheet B Scenario details C Metric details D Model details E Human evaluation procedure 8 Related work Holistic benchmarking. Benchmarks drive the advancements of AI by orienting the directions for the community to improve upon [20, 56, 57, 58]. In particular, in natural language processing (NLP), the adoption of meta-benchmarks [59, 60, 61, 62] and holistic evaluation [1] across multiple scenarios or tasks has allowed for comprehensive assessments of models and accelerated model improvements. However, despite the growing popularity of image generation and the increasing number of models being developed, a holistic evaluation of these models has been lacking. Furthermore, image generation encompasses various technological and societal impacts, including alignment, quality, originality, toxicity, bias, and fairness, which necessitate comprehensive evaluation. Our work fills this gap by conducting a holistic evaluation of image generation models across 12 important aspects. Benchmarks for image generation. Existing benchmarks primarily focus on assessing image quality and alignment, using automated metrics [23, 36, 24, 63, 64, 65]. Widely used benchmarks such as MS-COCO [21] and ImageNet [20] have been employed to evaluate the quality and alignment of generated images. Metrics like Fréchet Inception Distance (FID) [23, 66], Inception Score [36], and CLIPScore [24] are commonly used for quantitative assessment of image quality and alignment. To better capture human perception in image evaluation, crowdsourced human evaluation has been explored in recent years [25, 6, 35, 67]. However, these evaluations have been limited to assessing aspects such as alignment and quality. Building upon these crowdsourcing techniques, we extend the evaluation to include additional aspects such as aesthetics, originality, reasoning, and fairness. As the ethical and societal impacts of image generation models gain prominence [19], researchers have also started evaluating these aspects [33, 29, 8]. However, these evaluations have been conducted on only a select few models, leaving the majority of models unevaluated in these aspects. Our standardized evaluation addresses this gap by enabling the evaluation of all models across all aspects, including ethical and societal dimensions. Art and design. Our assessment of image generation incorporates aesthetic evaluation and design principles. Aesthetic evaluation considers factors like composition, color harmony, balance, and visual complexity [68, 69]. Design principles, such as clarity, legibility, hierarchy, and consistency in design elements, also influence our evaluation [70]. Combining these insights, we aim to determine whether generated images are visually pleasing, with thoughtful compositions, harmonious colors, balanced elements, and an appropriate level of visual complexity. We employ objective metrics and subjective human ratings for a comprehensive assessment of aesthetic quality. This paper is available on arxiv under CC BY 4.0 DEED license. Authors: (1) Tony Lee, Stanford with Equal contribution; (2) Michihiro Yasunaga, Stanford with Equal contribution; (3) Chenlin Meng, Stanford with Equal contribution; (4) Yifan Mai, Stanford; (5) Joon Sung Park, Stanford; (6) Agrim Gupta, Stanford; (7) Yunzhi Zhang, Stanford; (8) Deepak Narayanan, Microsoft; (9) Hannah Benita Teufel, Aleph Alpha; (10) Marco Bellagente, Aleph Alpha; (11) Minguk Kang, POSTECH; (12) Taesung Park, Adobe; (13) Jure Leskovec, Stanford; (14) Jun-Yan Zhu, CMU; (15) Li Fei-Fei, Stanford; (16) Jiajun Wu, Stanford; (17) Stefano Ermon, Stanford; (18) Percy Liang, Stanford. Authors: Authors: (1) Tony Lee, Stanford with Equal contribution; (2) Michihiro Yasunaga, Stanford with Equal contribution; (3) Chenlin Meng, Stanford with Equal contribution; (4) Yifan Mai, Stanford; (5) Joon Sung Park, Stanford; (6) Agrim Gupta, Stanford; (7) Yunzhi Zhang, Stanford; (8) Deepak Narayanan, Microsoft; (9) Hannah Benita Teufel, Aleph Alpha; (10) Marco Bellagente, Aleph Alpha; (11) Minguk Kang, POSTECH; (12) Taesung Park, Adobe; (13) Jure Leskovec, Stanford; (14) Jun-Yan Zhu, CMU; (15) Li Fei-Fei, Stanford; (16) Jiajun Wu, Stanford; (17) Stefano Ermon, Stanford; (18) Percy Liang, Stanford. Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 2 Core framework 2 Core framework 3 Aspects 3 Aspects 4 Scenarios 4 Scenarios 5 Metrics 5 Metrics 6 Models 6 Models 7 Experiments and results 7 Experiments and results 8 Related work 8 Related work 9 Conclusion 9 Conclusion 10 Limitations 10 Limitations Author contributions, Acknowledgments and References Author contributions, Acknowledgments and References A Datasheet A Datasheet B Scenario details B Scenario details C Metric details C Metric details D Model details D Model details E Human evaluation procedure E Human evaluation procedure 8 Related work Holistic benchmarking. Benchmarks drive the advancements of AI by orienting the directions for the community to improve upon [20, 56, 57, 58]. In particular, in natural language processing (NLP), the adoption of meta-benchmarks [59, 60, 61, 62] and holistic evaluation [1] across multiple scenarios or tasks has allowed for comprehensive assessments of models and accelerated model improvements. However, despite the growing popularity of image generation and the increasing number of models being developed, a holistic evaluation of these models has been lacking. Furthermore, image generation encompasses various technological and societal impacts, including alignment, quality, originality, toxicity, bias, and fairness, which necessitate comprehensive evaluation. Our work fills this gap by conducting a holistic evaluation of image generation models across 12 important aspects. Holistic benchmarking. Benchmarks for image generation. Existing benchmarks primarily focus on assessing image quality and alignment, using automated metrics [23, 36, 24, 63, 64, 65]. Widely used benchmarks such as MS-COCO [21] and ImageNet [20] have been employed to evaluate the quality and alignment of generated images. Metrics like Fréchet Inception Distance (FID) [23, 66], Inception Score [36], and CLIPScore [24] are commonly used for quantitative assessment of image quality and alignment. Benchmarks for image generation. To better capture human perception in image evaluation, crowdsourced human evaluation has been explored in recent years [25, 6, 35, 67]. However, these evaluations have been limited to assessing aspects such as alignment and quality. Building upon these crowdsourcing techniques, we extend the evaluation to include additional aspects such as aesthetics, originality, reasoning, and fairness. As the ethical and societal impacts of image generation models gain prominence [19], researchers have also started evaluating these aspects [33, 29, 8]. However, these evaluations have been conducted on only a select few models, leaving the majority of models unevaluated in these aspects. Our standardized evaluation addresses this gap by enabling the evaluation of all models across all aspects, including ethical and societal dimensions. Art and design. Our assessment of image generation incorporates aesthetic evaluation and design principles. Aesthetic evaluation considers factors like composition, color harmony, balance, and visual complexity [68, 69]. Design principles, such as clarity, legibility, hierarchy, and consistency in design elements, also influence our evaluation [70]. Combining these insights, we aim to determine whether generated images are visually pleasing, with thoughtful compositions, harmonious colors, balanced elements, and an appropriate level of visual complexity. We employ objective metrics and subjective human ratings for a comprehensive assessment of aesthetic quality. Art and design. This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv