paint-brush
Evaluating AI Models with HEIM Metrics for Fairness, Robustness, and Moreby@autoencoder

Evaluating AI Models with HEIM Metrics for Fairness, Robustness, and More

by Auto Encoder: How to Ignore the Signal Noise
Auto Encoder: How to Ignore the Signal Noise HackerNoon profile picture

Auto Encoder: How to Ignore the Signal Noise

@autoencoder

Research & publications on Auto Encoders, revolutionizing data compression and...

October 12th, 2024
Read on Terminal Reader
Read this story in a terminal
Print this story
Read this story w/o Javascript
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

HEIM introduces a diverse set of metrics, combining human and automated evaluations to assess text-to-image models across 12 aspects. Human metrics help capture nuanced qualities like photorealism, originality, and alignment, while new metrics close gaps for fairness, robustness, multilinguality, and efficiency. Crowdsourcing ensures realistic and representative evaluations.
featured image - Evaluating AI Models with HEIM Metrics for Fairness, Robustness, and More
1x
Read by Dr. One voice-avatar

Listen to this story

Auto Encoder: How to Ignore the Signal Noise HackerNoon profile picture
Auto Encoder: How to Ignore the Signal Noise

Auto Encoder: How to Ignore the Signal Noise

@autoencoder

Research & publications on Auto Encoders, revolutionizing data compression and feature learning techniques.

Learn More
LEARN MORE ABOUT @AUTOENCODER'S
EXPERTISE AND PLACE ON THE INTERNET.
0-item

STORY’S CREDIBILITY

Academic Research Paper

Academic Research Paper

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Authors:

(1) Tony Lee, Stanford with Equal contribution;

(2) Michihiro Yasunaga, Stanford with Equal contribution;

(3) Chenlin Meng, Stanford with Equal contribution;

(4) Yifan Mai, Stanford;

(5) Joon Sung Park, Stanford;

(6) Agrim Gupta, Stanford;

(7) Yunzhi Zhang, Stanford;

(8) Deepak Narayanan, Microsoft;

(9) Hannah Benita Teufel, Aleph Alpha;

(10) Marco Bellagente, Aleph Alpha;

(11) Minguk Kang, POSTECH;

(12) Taesung Park, Adobe;

(13) Jure Leskovec, Stanford;

(14) Jun-Yan Zhu, CMU;

(15) Li Fei-Fei, Stanford;

(16) Jiajun Wu, Stanford;

(17) Stefano Ermon, Stanford;

(18) Percy Liang, Stanford.

Abstract and 1 Introduction

2 Core framework

3 Aspects

4 Scenarios

5 Metrics

6 Models

7 Experiments and results

8 Related work

9 Conclusion

10 Limitations

Author contributions, Acknowledgments and References

A Datasheet

B Scenario details

C Metric details

D Model details

E Human evaluation procedure

5 Metrics

To evaluate the 12 aspects (§3), we also curate a diverse and realistic set of metrics. Table 3 presents an overview of all the metrics and their descriptions.


Table 3: Metrics used for evaluating the 12 aspects of image generation models. We use realistic, human metrics as well as automated and commonly-used existing metrics.

Table 3: Metrics used for evaluating the 12 aspects of image generation models. We use realistic, human metrics as well as automated and commonly-used existing metrics.


Table 4: Models evaluated in the HEIM effort.

Table 4: Models evaluated in the HEIM effort.


Compared to previous metrics, our metrics are more realistic and broader. First, in addition to automated metrics, we use human metrics (top rows in Table 3) to perform realistic evaluation that reflects human judgment [25, 26, 27]. Specifically, we employ human metrics for the overall text-image alignment and photorealism, which are used for many evaluation aspects, including alignment, quality, knowledge, reasoning, fairness, robustness, and multilinguality. We also employ human metrics for overall aesthetics and originality, for which capturing the nuances of human judgment is important. To conduct human evaluation, we employ crowdsourcing following the methodology described in Otani et al.,[35]. Concrete English definitions are provided for each human evaluation question and rating choice, and a minimum of 5 participants evaluate each image. We use at least 100 image samples for each aspect. For more details about the crowdsourcing procedure, please refer to Appendix E.


The second contribution is introducing new metrics for aspects that have received limited attention in existing evaluation efforts, namely fairness, robustness, multilinguality, and efficiency, as discussed in §3. The new metrics aim to close the evaluation gaps.


This paper is available on arxiv under CC BY 4.0 DEED license.


L O A D I N G
. . . comments & more!

About Author

Auto Encoder: How to Ignore the Signal Noise HackerNoon profile picture
Auto Encoder: How to Ignore the Signal Noise@autoencoder
Research & publications on Auto Encoders, revolutionizing data compression and feature learning techniques.

TOPICS

THIS ARTICLE WAS FEATURED IN...

Permanent on Arweave
Read on Terminal Reader
Read this story in a terminal
 Terminal
Read this story w/o Javascript
Read this story w/o Javascript
 Lite
Also published here
X
X REMOVE AD