paint-brush
How FreeEval Incorporates A Range of Metaevaluation Modulesby@modularizing
New Story

How FreeEval Incorporates A Range of Metaevaluation Modules

by ModularizingMarch 18th, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

FreeEval prioritizes trustworthiness and fairness in evaluations by incorporating a range of metaevaluation modules that validates the evaluation results and processes.

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - How FreeEval Incorporates A Range of Metaevaluation Modules
Modularizing HackerNoon profile picture
0-item

Abstract and 1 Introduction

2 Background and 2.1 Automatic Evaluation Methods for LLMs

2.2 Meta-Evaluation of LLMs

3 Design and Implementation and 3.1 Design Principles

3.2 FreeEval Architecture Overview and 3.3 Extensible Modular Design

3.4 Trustworthy Evaluation

3.5 Efficient Inference Backends

4 Conclusion, Ethical Considerations, and References

3.4 Trustworthy Evaluation

FreeEval prioritizes trustworthiness and fairness in evaluations by incorporating a range of metaevaluation modules that validates the evaluation results and processes.


As human preference remain the gold standard for measuring the effectiveness of evaluation protocols, FreeEval model human preference into two types: pairwise comparison and direct scoring. We incorporate existing meta-evaluation datasets from PandaLM (Wang et al., 2023c), MT-Bench (Zheng et al., 2023b), LLMBar (Guo et al., 2023), AlpacaEval (Li et al., 2023b), and provide a user-friendly


Figure 2: Config for an example pipeline, evaluating LLaMA-2 70B (Touvron et al., 2023b) on ARCChallenge (Clark et al., 2018) dataset and then detecting data contamination with Min-K% Prob (Shi et al., 2023).


interface for annotating and curating new human evaluation datasets.


To ensure the trustworthiness of the evaluation results, we also implement data contamination detection methods, as introduced in subsection 2.2, into our toolkit as steps. Understanding whether the tested dataset appear in the training phase of the evaluated models would help users assess the validity and reliability of evaluation results. We also provide bias evaluation modules and visualization tools specifically for LLM-based evaluators, as previous studies have reported they exhibit position bias and length bias (Zheng et al., 2023b; Wang et al., 2023c). These meta-evaluation modules can be easily integrated into existing evaluation pipelines, allowing researchers to understand the effectiveness of their results, the fairness of the evaluation process, and study bad cases that lead to unexpected evaluation results.


This paper is available on arxiv under CC BY 4.0 DEED license.

Authors:

(1) Zhuohao Yu, Peking University;

(2) Chang Gao, Peking University;

(3) Wenjin Yao, Peking University;

(4) Yidong Wang, Peking University;

(5) Zhengran Zeng, Peking University;

(6) Wei Ye, Peking University and a corresponding author;

(7) Jindong Wang, Microsoft Research;

(8) Yue Zhang, Westlake University;

(9) Shikun Zhang, Peking University.