Authors:
(1) Tony Lee, Stanford with Equal contribution;
(2) Michihiro Yasunaga, Stanford with Equal contribution;
(3) Chenlin Meng, Stanford with Equal contribution;
(4) Yifan Mai, Stanford;
(5) Joon Sung Park, Stanford;
(6) Agrim Gupta, Stanford;
(7) Yunzhi Zhang, Stanford;
(8) Deepak Narayanan, Microsoft;
(9) Hannah Benita Teufel, Aleph Alpha;
(10) Marco Bellagente, Aleph Alpha;
(11) Minguk Kang, POSTECH;
(12) Taesung Park, Adobe;
(13) Jure Leskovec, Stanford;
(14) Jun-Yan Zhu, CMU;
(15) Li Fei-Fei, Stanford;
(16) Jiajun Wu, Stanford;
(17) Stefano Ermon, Stanford;
(18) Percy Liang, Stanford. Table of Links Abstract and 1 Introduction 2 Core framework 3 Aspects 4 Scenarios 5 Metrics 6 Models 7 Experiments and results 8 Related work 9 Conclusion 10 Limitations Author contributions, Acknowledgments and References A Datasheet B Scenario details C Metric details D Model details E Human evaluation procedure 10 Limitations Our work identifies 12 important aspects in real-world deployments of text-to-image generation models, namely alignment, quality, aesthetics, originality, reasoning, knowledge, bias, toxicity, fairness, robustness, multilinguality, and efficiency. While we have made substantial progress in conducting a holistic evaluation of models across these aspects, there are certain limitations that should be acknowledged in our work. Firstly, it is important to note that our identified 12 aspects may not be exhaustive, and there could be other potentially important aspects in text-to-image generation that have not been considered. It is an ongoing area of research, and future studies may uncover additional dimensions that are critical for evaluating image generation models. We welcome further exploration in this direction to ensure a comprehensive understanding of the field. Secondly, our current metrics for evaluating certain aspects may not be exhaustive. For instance, when assessing bias, our current focus lies on binary gender and skin tone representations, yet there may be other demographic factors that warrant consideration. Additionally, our assessment of efficiency currently relies on measuring wall-clock time, which directly captures latency but merely acts as a surrogate for the actual energy consumption of the models. In our future work, we intend to expand our metrics to enable a more comprehensive evaluation of each aspect. Lastly, there is an additional limitation related to the use of crowdsourced human evaluation. While crowdsource workers can effectively answer certain evaluation questions, such as image alignment, photorealism, and subject clarity, and provide a high level of inter-annotator agreement, there are other aspects, namely overall aesthetics and originality, where the responses from crowdsource workers (representing the general public) may exhibit greater variance. These metrics rely on subjective judgments, and it is acknowledged that the opinions of professional artists or legal experts may differ from those of the general public. Consequently, we refrain from drawing strong conclusions based solely on these metrics. However, we do believe there is value in considering the judgments of the general public, as it is reasonable to desire generated images to be visually pleasing and exhibit a sense of originality to a wide audience. This paper is available on arxiv under CC BY 4.0 DEED license. Authors: (1) Tony Lee, Stanford with Equal contribution; (2) Michihiro Yasunaga, Stanford with Equal contribution; (3) Chenlin Meng, Stanford with Equal contribution; (4) Yifan Mai, Stanford; (5) Joon Sung Park, Stanford; (6) Agrim Gupta, Stanford; (7) Yunzhi Zhang, Stanford; (8) Deepak Narayanan, Microsoft; (9) Hannah Benita Teufel, Aleph Alpha; (10) Marco Bellagente, Aleph Alpha; (11) Minguk Kang, POSTECH; (12) Taesung Park, Adobe; (13) Jure Leskovec, Stanford; (14) Jun-Yan Zhu, CMU; (15) Li Fei-Fei, Stanford; (16) Jiajun Wu, Stanford; (17) Stefano Ermon, Stanford; (18) Percy Liang, Stanford. Authors: Authors: (1) Tony Lee, Stanford with Equal contribution; (2) Michihiro Yasunaga, Stanford with Equal contribution; (3) Chenlin Meng, Stanford with Equal contribution; (4) Yifan Mai, Stanford; (5) Joon Sung Park, Stanford; (6) Agrim Gupta, Stanford; (7) Yunzhi Zhang, Stanford; (8) Deepak Narayanan, Microsoft; (9) Hannah Benita Teufel, Aleph Alpha; (10) Marco Bellagente, Aleph Alpha; (11) Minguk Kang, POSTECH; (12) Taesung Park, Adobe; (13) Jure Leskovec, Stanford; (14) Jun-Yan Zhu, CMU; (15) Li Fei-Fei, Stanford; (16) Jiajun Wu, Stanford; (17) Stefano Ermon, Stanford; (18) Percy Liang, Stanford. Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 2 Core framework 2 Core framework 3 Aspects 3 Aspects 4 Scenarios 4 Scenarios 5 Metrics 5 Metrics 6 Models 6 Models 7 Experiments and results 7 Experiments and results 8 Related work 8 Related work 9 Conclusion 9 Conclusion 10 Limitations 10 Limitations Author contributions, Acknowledgments and References Author contributions, Acknowledgments and References A Datasheet A Datasheet B Scenario details B Scenario details C Metric details C Metric details D Model details D Model details E Human evaluation procedure E Human evaluation procedure 10 Limitations Our work identifies 12 important aspects in real-world deployments of text-to-image generation models, namely alignment, quality, aesthetics, originality, reasoning, knowledge, bias, toxicity, fairness, robustness, multilinguality, and efficiency. While we have made substantial progress in conducting a holistic evaluation of models across these aspects, there are certain limitations that should be acknowledged in our work. Firstly, it is important to note that our identified 12 aspects may not be exhaustive, and there could be other potentially important aspects in text-to-image generation that have not been considered. It is an ongoing area of research, and future studies may uncover additional dimensions that are critical for evaluating image generation models. We welcome further exploration in this direction to ensure a comprehensive understanding of the field. Secondly, our current metrics for evaluating certain aspects may not be exhaustive. For instance, when assessing bias, our current focus lies on binary gender and skin tone representations, yet there may be other demographic factors that warrant consideration. Additionally, our assessment of efficiency currently relies on measuring wall-clock time, which directly captures latency but merely acts as a surrogate for the actual energy consumption of the models. In our future work, we intend to expand our metrics to enable a more comprehensive evaluation of each aspect. Lastly, there is an additional limitation related to the use of crowdsourced human evaluation. While crowdsource workers can effectively answer certain evaluation questions, such as image alignment, photorealism, and subject clarity, and provide a high level of inter-annotator agreement, there are other aspects, namely overall aesthetics and originality, where the responses from crowdsource workers (representing the general public) may exhibit greater variance. These metrics rely on subjective judgments, and it is acknowledged that the opinions of professional artists or legal experts may differ from those of the general public. Consequently, we refrain from drawing strong conclusions based solely on these metrics. However, we do believe there is value in considering the judgments of the general public, as it is reasonable to desire generated images to be visually pleasing and exhibit a sense of originality to a wide audience. This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Limitations in AI Model Evaluation: Bias, Efficiency, and Human Judgment

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

12 Key Aspects for Assessing the Power of Text-to-Image Models

Pay attention to that man behind the curtain

AI Is Inherently Neutral - It Is Human Beings Who Are Biased, and the Machines Merely Replicate Them

AI Is Not the Concern - It’s AI Developers You Should Be Worried About

Are Smart Cities a Threat to Data Privacy?

Can AI Ever Overcome Built-In Human Biases?

12 Key Aspects for Assessing the Power of Text-to-Image Models

Pay attention to that man behind the curtain

AI Is Inherently Neutral - It Is Human Beings Who Are Biased, and the Machines Merely Replicate Them

AI Is Not the Concern - It’s AI Developers You Should Be Worried About

Are Smart Cities a Threat to Data Privacy?

Can AI Ever Overcome Built-In Human Biases?

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps