2 Background and 2.1 Automatic Evaluation Methods for LLMs
3 Design and Implementation and 3.1 Design Principles
3.2 FreeEval Architecture Overview and 3.3 Extensible Modular Design
3.5 Efficient Inference Backends
4 Conclusion, Ethical Considerations, and References
FreeEval prioritizes trustworthiness and fairness in evaluations by incorporating a range of metaevaluation modules that validates the evaluation results and processes.
As human preference remain the gold standard for measuring the effectiveness of evaluation protocols, FreeEval model human preference into two types: pairwise comparison and direct scoring. We incorporate existing meta-evaluation datasets from PandaLM (Wang et al., 2023c), MT-Bench (Zheng et al., 2023b), LLMBar (Guo et al., 2023), AlpacaEval (Li et al., 2023b), and provide a user-friendly
interface for annotating and curating new human evaluation datasets.
To ensure the trustworthiness of the evaluation results, we also implement data contamination detection methods, as introduced in subsection 2.2, into our toolkit as steps. Understanding whether the tested dataset appear in the training phase of the evaluated models would help users assess the validity and reliability of evaluation results. We also provide bias evaluation modules and visualization tools specifically for LLM-based evaluators, as previous studies have reported they exhibit position bias and length bias (Zheng et al., 2023b; Wang et al., 2023c). These meta-evaluation modules can be easily integrated into existing evaluation pipelines, allowing researchers to understand the effectiveness of their results, the fairness of the evaluation process, and study bad cases that lead to unexpected evaluation results.
This paper is available on arxiv under CC BY 4.0 DEED license.
Authors:
(1) Zhuohao Yu, Peking University;
(2) Chang Gao, Peking University;
(3) Wenjin Yao, Peking University;
(4) Yidong Wang, Peking University;
(5) Zhengran Zeng, Peking University;
(6) Wei Ye, Peking University and a corresponding author;
(7) Jindong Wang, Microsoft Research;
(8) Yue Zhang, Westlake University;
(9) Shikun Zhang, Peking University.