paint-brush
New Dimensions in Text-to-Image Model Evaluationby@autoencoder
137 reads

New Dimensions in Text-to-Image Model Evaluation

tldt arrow

Too Long; Didn't Read

This section discusses the importance of holistic benchmarking in AI, particularly for image generation models. Existing benchmarks have largely focused on image quality and alignment, relying mainly on automated metrics. While recent efforts have incorporated human evaluation, they often fall short of assessing a wider range of aspects, including aesthetics, originality, reasoning, and ethical implications. Our work addresses these gaps by implementing a comprehensive evaluation framework that assesses all models across critical dimensions, ensuring a thorough understanding of their performance and societal impacts.
featured image - New Dimensions in Text-to-Image Model Evaluation
Auto Encoder: How to Ignore the Signal Noise HackerNoon profile picture

Authors:

(1) Tony Lee, Stanford with Equal contribution;

(2) Michihiro Yasunaga, Stanford with Equal contribution;

(3) Chenlin Meng, Stanford with Equal contribution;

(4) Yifan Mai, Stanford;

(5) Joon Sung Park, Stanford;

(6) Agrim Gupta, Stanford;

(7) Yunzhi Zhang, Stanford;

(8) Deepak Narayanan, Microsoft;

(9) Hannah Benita Teufel, Aleph Alpha;

(10) Marco Bellagente, Aleph Alpha;

(11) Minguk Kang, POSTECH;

(12) Taesung Park, Adobe;

(13) Jure Leskovec, Stanford;

(14) Jun-Yan Zhu, CMU;

(15) Li Fei-Fei, Stanford;

(16) Jiajun Wu, Stanford;

(17) Stefano Ermon, Stanford;

(18) Percy Liang, Stanford.

Abstract and 1 Introduction

2 Core framework

3 Aspects

4 Scenarios

5 Metrics

6 Models

7 Experiments and results

8 Related work

9 Conclusion

10 Limitations

Author contributions, Acknowledgments and References

A Datasheet

B Scenario details

C Metric details

D Model details

E Human evaluation procedure

Holistic benchmarking. Benchmarks drive the advancements of AI by orienting the directions for the community to improve upon [20, 56, 57, 58]. In particular, in natural language processing (NLP), the adoption of meta-benchmarks [59, 60, 61, 62] and holistic evaluation [1] across multiple scenarios or tasks has allowed for comprehensive assessments of models and accelerated model improvements. However, despite the growing popularity of image generation and the increasing number of models being developed, a holistic evaluation of these models has been lacking. Furthermore, image generation encompasses various technological and societal impacts, including alignment, quality, originality, toxicity, bias, and fairness, which necessitate comprehensive evaluation. Our work fills this gap by conducting a holistic evaluation of image generation models across 12 important aspects.


Benchmarks for image generation. Existing benchmarks primarily focus on assessing image quality and alignment, using automated metrics [23, 36, 24, 63, 64, 65]. Widely used benchmarks such as MS-COCO [21] and ImageNet [20] have been employed to evaluate the quality and alignment of generated images. Metrics like Fréchet Inception Distance (FID) [23, 66], Inception Score [36], and CLIPScore [24] are commonly used for quantitative assessment of image quality and alignment.


To better capture human perception in image evaluation, crowdsourced human evaluation has been explored in recent years [25, 6, 35, 67]. However, these evaluations have been limited to assessing aspects such as alignment and quality. Building upon these crowdsourcing techniques, we extend the evaluation to include additional aspects such as aesthetics, originality, reasoning, and fairness.


As the ethical and societal impacts of image generation models gain prominence [19], researchers have also started evaluating these aspects [33, 29, 8]. However, these evaluations have been conducted on only a select few models, leaving the majority of models unevaluated in these aspects. Our standardized evaluation addresses this gap by enabling the evaluation of all models across all aspects, including ethical and societal dimensions.


Art and design. Our assessment of image generation incorporates aesthetic evaluation and design principles. Aesthetic evaluation considers factors like composition, color harmony, balance, and visual complexity [68, 69]. Design principles, such as clarity, legibility, hierarchy, and consistency in design elements, also influence our evaluation [70]. Combining these insights, we aim to determine whether generated images are visually pleasing, with thoughtful compositions, harmonious colors, balanced elements, and an appropriate level of visual complexity. We employ objective metrics and subjective human ratings for a comprehensive assessment of aesthetic quality.


This paper is available on arxiv under CC BY 4.0 DEED license.