2 Background and 2.1 Automatic Evaluation Methods for LLMs
3 Design and Implementation and 3.1 Design Principles
3.2 FreeEval Architecture Overview and 3.3 Extensible Modular Design
3.5 Efficient Inference Backends
4 Conclusion, Ethical Considerations, and References
Meta-evaluation refers to the process of evaluating the fairness, reliability, and validity of evaluation protocols themselves. We incorporate several meta-evaluation methods into FreeEval.
Data Contamination occurs when an LLM is exposed to test data during training, leading to inflated performance scores and an inaccurate assessment of the model’s true capabilities (Schaeffer, 2023; Sainz et al., 2023; Zhu et al., 2023a). This issue is particularly important due to its impact on evaluation fairness, and should be considered. We implement data contamination detection methods like Min-K prob (Shi et al., 2023) and average loss (Wei et al., 2023) in FreeEval as modules, to make contamination detection a fundamental process in evaluating LLMs or creating a new evaluation protocol.
Human Evaluation is the gold standard for meta-evaluation (Chang et al., 2023), as it directly reflects human preferences on generated texts. This is particularly important for LLM-based evaluators, which subjectively evaluate output quality like human experts. However, the lack of standardized platforms or guidelines for human annotation can lead to biased, inconsistent, and unfair judgments. To address this, we incorporate meta-evaluation protocols from Wang et al. (2023c); Zeng et al. (2023); Zheng et al. (2023b), as they reflect preferences from human experts in different scenarios. Additionally, we create a user-friendly interface for human experts to create new preference datasets, facilitating the collection of high-quality human evaluations for meta-evaluation purposes.
This paper is available on arxiv under CC BY 4.0 DEED license.
Authors:
(1) Zhuohao Yu, Peking University;
(2) Chang Gao, Peking University;
(3) Wenjin Yao, Peking University;
(4) Yidong Wang, Peking University;
(5) Zhengran Zeng, Peking University;
(6) Wei Ye, Peking University and a corresponding author;
(7) Jindong Wang, Microsoft Research;
(8) Yue Zhang, Westlake University;
(9) Shikun Zhang, Peking University.