Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 2 Background and 2.1 Automatic Evaluation Methods for LLMs 2 Background and 2.1 Automatic Evaluation Methods for LLMs 2.2 Meta-Evaluation of LLMs 2.2 Meta-Evaluation of LLMs 3 Design and Implementation and 3.1 Design Principles 3 Design and Implementation and 3.1 Design Principles 3.2 FreeEval Architecture Overview and 3.3 Extensible Modular Design 3.2 FreeEval Architecture Overview and 3.3 Extensible Modular Design 3.4 Trustworthy Evaluation 3.4 Trustworthy Evaluation 3.5 Efficient Inference Backends 3.5 Efficient Inference Backends 4 Conclusion, Ethical Considerations, and References 4 Conclusion, Ethical Considerations, and References 2.2 Meta-Evaluation of LLMs Meta-evaluation refers to the process of evaluating the fairness, reliability, and validity of evaluation protocols themselves. We incorporate several meta-evaluation methods into FreeEval. Data Contamination occurs when an LLM is exposed to test data during training, leading to inflated performance scores and an inaccurate assessment of the model’s true capabilities (Schaeffer, 2023; Sainz et al., 2023; Zhu et al., 2023a). This issue is particularly important due to its impact on evaluation fairness, and should be considered. We implement data contamination detection methods like Min-K prob (Shi et al., 2023) and average loss (Wei et al., 2023) in FreeEval as modules, to make contamination detection a fundamental process in evaluating LLMs or creating a new evaluation protocol. Data Contamination Human Evaluation is the gold standard for meta-evaluation (Chang et al., 2023), as it directly reflects human preferences on generated texts. This is particularly important for LLM-based evaluators, which subjectively evaluate output quality like human experts. However, the lack of standardized platforms or guidelines for human annotation can lead to biased, inconsistent, and unfair judgments. To address this, we incorporate meta-evaluation protocols from Wang et al. (2023c); Zeng et al. (2023); Zheng et al. (2023b), as they reflect preferences from human experts in different scenarios. Additionally, we create a user-friendly interface for human experts to create new preference datasets, facilitating the collection of high-quality human evaluations for meta-evaluation purposes. Human Evaluation This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv Authors:
(1) Zhuohao Yu, Peking University;
(2) Chang Gao, Peking University;
(3) Wenjin Yao, Peking University;
(4) Yidong Wang, Peking University;
(5) Zhengran Zeng, Peking University;
(6) Wei Ye, Peking University and a corresponding author;
(7) Jindong Wang, Microsoft Research;
(8) Yue Zhang, Westlake University;
(9) Shikun Zhang, Peking University. Authors: Authors: (1) Zhuohao Yu, Peking University; (2) Chang Gao, Peking University; (3) Wenjin Yao, Peking University; (4) Yidong Wang, Peking University; (5) Zhengran Zeng, Peking University; (6) Wei Ye, Peking University and a corresponding author; (7) Jindong Wang, Microsoft Research; (8) Yue Zhang, Westlake University; (9) Shikun Zhang, Peking University.

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Abstract

Human

Microsoft

reflect

Trustworthy

The Design and Implementation of FreeEval

Background and Automatic Evaluation Methods for LLMs

MODULARIZING tech ASAP

Too Long; Didn't Read