Authors:
(1) Tony Lee, Stanford with Equal contribution;
(2) Michihiro Yasunaga, Stanford with Equal contribution;
(3) Chenlin Meng, Stanford with Equal contribution;
(4) Yifan Mai, Stanford;
(5) Joon Sung Park, Stanford;
(6) Agrim Gupta, Stanford;
(7) Yunzhi Zhang, Stanford;
(8) Deepak Narayanan, Microsoft;
(9) Hannah Benita Teufel, Aleph Alpha;
(10) Marco Bellagente, Aleph Alpha;
(11) Minguk Kang, POSTECH;
(12) Taesung Park, Adobe;
(13) Jure Leskovec, Stanford;
(14) Jun-Yan Zhu, CMU;
(15) Li Fei-Fei, Stanford;
(16) Jiajun Wu, Stanford;
(17) Stefano Ermon, Stanford;
(18) Percy Liang, Stanford. Table of Links Abstract and 1 Introduction 2 Core framework 3 Aspects 4 Scenarios 5 Metrics 6 Models 7 Experiments and results 8 Related work 9 Conclusion 10 Limitations Author contributions, Acknowledgments and References A Datasheet B Scenario details C Metric details D Model details E Human evaluation procedure 3 Aspects We evaluate 12 diverse aspects crucial for deploying text-to-image models, as detailed in Table 1. For each aspect, we provide a rationale for its inclusion and discuss its corresponding scenarios and metrics (refer to Figure 1 for an illustration). Further details regarding all scenarios and metrics will be presented in §4 and §5. Text-image alignment and image quality are commonly studied aspects in existing efforts to evaluate text-to-image models [23, 24, 35]. Since these are general aspects, we can assess these aspects for any scenario. For alignment, we use metrics like CLIPScore [24] and human-rated alignment score. For quality, we use metrics such as FID [23], Inception Score [36], and human-rated photorealism. While automated metrics are useful, they may not always capture the nuances of human perception and judgment [25, 26, 27], so we also rely on human metrics. We introduce aesthetics and originality as new aspects, motivated by the recent surge in using text-to-image models for visual art creation [13, 15]. In particular, originality is crucial for addressing copyright infringement concerns in generative AI [37, 38, 39]. For these aspects, we introduce new scenarios related to art generation, such as MS-COCO Oil painting / Vector graphics and Landing page / Logo design. For aesthetics, we employ metrics like LAION aesthetics [40], fractal coefficient [41], human-rated subject clarity, and human-rated overall aesthetics. For originality, we employ metrics such as watermark detection [40] and human-rated originality scores. Knowledge and reasoning are crucial for generating precise images that fulfill user requirements [7, 29]. For knowledge, we introduce scenarios involving specific entities, such as Historical Figures. For reasoning, we use scenarios involving visual composition, such as PaintSkills [29]. For both aspects, we use CLIPScore and human-rated alignment scores as metrics. Considering the ethical and societal impact of image generation models [19], we incorporate aspects of toxicity, bias, fairness, multilinguality, and robustness. Our definitions, outlined in Table 1, align with [1]. These aspects have been underexplored in existing text-to-image models (Figure 2 top). However, these aspects are crucial for real-world model deployment. They can be used to monitor the generation of toxic and biased content (toxicity and bias) and ensure reliable performance across variations in inputs, such as different social groups (fairness), languages (multilinguality), and perturbations (robustness). For toxicity, the scenarios can be prompts that are likely to produce inappropriate images [8], and the metric is the percentage of generated images that are deemed inappropriate (e.g., NSFW, nude, or blacked out). For bias, the scenarios can be prompts that may trigger stereotypical associations [33], and the metrics are the demographic biases in generated images, such as gender bias and skin tone bias. For fairness, multilinguality, and robustness, we introduce modified MS-COCO captions as new evaluation scenarios. Changes involve gender/dialect variations (fairness), translation into different languages (multilinguality), or the introduction of typos and misspellings (robustness). We then measure the performance change (e.g., CLIPScore) compared to the unmodified MS-COCO scenario. Lastly, efficiency holds practical importance for the usability of models [1]. Inference time serves as the metric, and any scenarios can be employed, as efficiency is a general aspect. This paper is available on arxiv under CC BY 4.0 DEED license. Authors: (1) Tony Lee, Stanford with Equal contribution; (2) Michihiro Yasunaga, Stanford with Equal contribution; (3) Chenlin Meng, Stanford with Equal contribution; (4) Yifan Mai, Stanford; (5) Joon Sung Park, Stanford; (6) Agrim Gupta, Stanford; (7) Yunzhi Zhang, Stanford; (8) Deepak Narayanan, Microsoft; (9) Hannah Benita Teufel, Aleph Alpha; (10) Marco Bellagente, Aleph Alpha; (11) Minguk Kang, POSTECH; (12) Taesung Park, Adobe; (13) Jure Leskovec, Stanford; (14) Jun-Yan Zhu, CMU; (15) Li Fei-Fei, Stanford; (16) Jiajun Wu, Stanford; (17) Stefano Ermon, Stanford; (18) Percy Liang, Stanford. Authors: Authors: (1) Tony Lee, Stanford with Equal contribution; (2) Michihiro Yasunaga, Stanford with Equal contribution; (3) Chenlin Meng, Stanford with Equal contribution; (4) Yifan Mai, Stanford; (5) Joon Sung Park, Stanford; (6) Agrim Gupta, Stanford; (7) Yunzhi Zhang, Stanford; (8) Deepak Narayanan, Microsoft; (9) Hannah Benita Teufel, Aleph Alpha; (10) Marco Bellagente, Aleph Alpha; (11) Minguk Kang, POSTECH; (12) Taesung Park, Adobe; (13) Jure Leskovec, Stanford; (14) Jun-Yan Zhu, CMU; (15) Li Fei-Fei, Stanford; (16) Jiajun Wu, Stanford; (17) Stefano Ermon, Stanford; (18) Percy Liang, Stanford. Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 2 Core framework 2 Core framework 3 Aspects 3 Aspects 4 Scenarios 4 Scenarios 5 Metrics 5 Metrics 6 Models 6 Models 7 Experiments and results 7 Experiments and results 8 Related work 8 Related work 9 Conclusion 9 Conclusion 10 Limitations 10 Limitations Author contributions, Acknowledgments and References Author contributions, Acknowledgments and References A Datasheet A Datasheet B Scenario details B Scenario details C Metric details C Metric details D Model details D Model details E Human evaluation procedure E Human evaluation procedure 3 Aspects We evaluate 12 diverse aspects crucial for deploying text-to-image models, as detailed in Table 1. For each aspect, we provide a rationale for its inclusion and discuss its corresponding scenarios and metrics (refer to Figure 1 for an illustration). Further details regarding all scenarios and metrics will be presented in §4 and §5. Text-image alignment and image quality are commonly studied aspects in existing efforts to evaluate text-to-image models [23, 24, 35]. Since these are general aspects, we can assess these aspects for any scenario. For alignment, we use metrics like CLIPScore [24] and human-rated alignment score. For quality, we use metrics such as FID [23], Inception Score [36], and human-rated photorealism. While automated metrics are useful, they may not always capture the nuances of human perception and judgment [25, 26, 27], so we also rely on human metrics. alignment quality We introduce aesthetics and originality as new aspects, motivated by the recent surge in using text-to-image models for visual art creation [13, 15]. In particular, originality is crucial for addressing aesthetics originality copyright infringement concerns in generative AI [37, 38, 39]. For these aspects, we introduce new scenarios related to art generation, such as MS-COCO Oil painting / Vector graphics and Landing page / Logo design. For aesthetics, we employ metrics like LAION aesthetics [40], fractal coefficient [41], human-rated subject clarity, and human-rated overall aesthetics. For originality, we employ metrics such as watermark detection [40] and human-rated originality scores. Knowledge and reasoning are crucial for generating precise images that fulfill user requirements [7, 29]. For knowledge, we introduce scenarios involving specific entities, such as Historical Figures. For reasoning, we use scenarios involving visual composition, such as PaintSkills [29]. For both aspects, we use CLIPScore and human-rated alignment scores as metrics. Knowledge reasoning Considering the ethical and societal impact of image generation models [19], we incorporate aspects of toxicity , bias , fairness , multilinguality , and robustness . Our definitions, outlined in Table 1, align with [1]. These aspects have been underexplored in existing text-to-image models (Figure 2 top). However, these aspects are crucial for real-world model deployment. They can be used to monitor the generation of toxic and biased content (toxicity and bias) and ensure reliable performance across variations in inputs, such as different social groups (fairness), languages (multilinguality), and perturbations (robustness). toxicity bias fairness multilinguality robustness For toxicity, the scenarios can be prompts that are likely to produce inappropriate images [8], and the metric is the percentage of generated images that are deemed inappropriate (e.g., NSFW, nude, or blacked out). For bias, the scenarios can be prompts that may trigger stereotypical associations [33], and the metrics are the demographic biases in generated images, such as gender bias and skin tone bias. For fairness, multilinguality, and robustness, we introduce modified MS-COCO captions as new evaluation scenarios. Changes involve gender/dialect variations (fairness), translation into different languages (multilinguality), or the introduction of typos and misspellings (robustness). We then measure the performance change (e.g., CLIPScore) compared to the unmodified MS-COCO scenario. Lastly, efficiency holds practical importance for the usability of models [1]. Inference time serves as the metric, and any scenarios can be employed, as efficiency is a general aspect. efficiency This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

12 Key Aspects for Assessing the Power of Text-to-Image Models

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A Comprehensive Evaluation of 26 State-of-the-Art Text-to-Image Models

Pay attention to that man behind the curtain

AI Is Inherently Neutral - It Is Human Beings Who Are Biased, and the Machines Merely Replicate Them

AI Is Not the Concern - It’s AI Developers You Should Be Worried About

Are Smart Cities a Threat to Data Privacy?

Can AI Ever Overcome Built-In Human Biases?

A Comprehensive Evaluation of 26 State-of-the-Art Text-to-Image Models

Pay attention to that man behind the curtain

AI Is Inherently Neutral - It Is Human Beings Who Are Biased, and the Machines Merely Replicate Them

AI Is Not the Concern - It’s AI Developers You Should Be Worried About

Are Smart Cities a Threat to Data Privacy?

Can AI Ever Overcome Built-In Human Biases?

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps