paint-brush
FreeEval: A Modular Framework for Trustworthy and Efficient Evaluation of Large Language Modelsby@modularizing
New Story

FreeEval: A Modular Framework for Trustworthy and Efficient Evaluation of Large Language Models

by ModularizingMarch 17th, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

FreeEval is designed with a high-performance infrastructure, including distributed computation and caching strategies, enabling extensive evaluations across multi-node, multi-GPU clusters for open-source and proprietary LLMs.

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - FreeEval: A Modular Framework for Trustworthy and Efficient Evaluation of Large Language Models
Modularizing HackerNoon profile picture
0-item

Abstract and 1 Introduction

2 Background and 2.1 Automatic Evaluation Methods for LLMs

2.2 Meta-Evaluation of LLMs

3 Design and Implementation and 3.1 Design Principles

3.2 FreeEval Architecture Overview and 3.3 Extensible Modular Design

3.4 Trustworthy Evaluation

3.5 Efficient Inference Backends

4 Conclusion, Ethical Considerations, and References

Abstract

The rapid development of large language model (LLM) evaluation methodologies and datasets has led to a profound challenge: integrating state-of-the-art evaluation techniques cost-effectively while ensuring reliability, reproducibility, and efficiency. Currently, there is a notable absence of a unified and adaptable framework that seamlessly integrates various evaluation approaches. Moreover, the reliability of evaluation findings is often questionable due to potential data contamination, with the evaluation efficiency commonly overlooked when facing the substantial costs associated with LLM inference. In response to these challenges, we introduce FreeEval, a modular and scalable framework crafted to enable trustworthy and efficient automatic evaluations of LLMs. Firstly, FreeEval’s unified abstractions simplify the integration and improve the transparency of diverse evaluation methodologies, encompassing dynamic evaluation that demand sophisticated LLM interactions. Secondly, the framework integrates meta-evaluation techniques like human evaluation and data contamination detection, which, along with dynamic evaluation modules in the platform, enhance the fairness of the evaluation outcomes. Lastly, FreeEval is designed with a high-performance infrastructure, including distributed computation and caching strategies, enabling extensive evaluations across multi-node, multi-GPU clusters for open-source and proprietary LLMs. We open-source all our code at https://github.com/WisdomShell/ FreeEval.

1 Introduction

Large Language Models (LLMs) have revolutionized the field of Natural Language Processing (NLP) with their impressive performance on a wide range of tasks (Brown et al., 2020; Zhang et al., 2022; Bubeck et al., 2023; OpenAI, 2023). As * Corresponding author. LLMs play a critical role in both academia and industry, understanding their capabilities and evaluating their performance has become an essential topic (Guo et al., 2023). Therefore, there have been proposals of automatic evaluation methodologies that harness benchmark datasets (Clark et al., 2018; Zellers et al., 2019; Cobbe et al., 2021; Bang et al., 2023) for objective assessments, complemented by the introduction of LLM-based subjective evaluation tools (Wang et al., 2023c; Zheng et al., 2023b; Li et al., 2023b; Chan et al., 2023).


With the continuous emergence of evaluation data and methods for LLMs, the challenge of incorporating the latest cutting-edge evaluation techniques cost-effectively and conducting more rapid and reliable evaluation, has intensified. In response to this need, several open-source evaluation platforms or toolkits for LLMs were proposed, each with its unique features and focus. Table 1 provides a comprehensive comparison of these frameworks. Specifically, Eval-Harness (Gao et al., 2021) proposes a framework for evaluating LLMs with a variety of benchmark datasets. HELM (Liang et al., 2022) provides a collection of metrics beyond accuracy on custom datasets and models. OpenAI Evals (Contributors, 2023) implement interface for creating LLM judges, which leverage LLMs to evaluate other models, and metaevaluation of these judges. OpenCompass (Contributors, 2023b) introduces distributed inference with SLURM (Yoo et al., 2003) on cluster environments. PromptBench (Zhu et al., 2023b) introduces prompt attacks during inference and DyVal (Zhu et al., 2023a) in the framework.


Despite these promising efforts, current evaluation platforms still face three bottlenecks.


First, a unified and extensible framework is lacking to integrate different evaluation methods seamlessly. This issue consequently affects evaluation flexibility and transparency. The evaluation results of LLMs may highly depend on complex deployment settings and prompting techniques, since LLMs are not robust enough to handle these intricate settings (Zheng et al., 2023a). For example, Table 2 demonstrates that these settings can significantly influence results, confirming the need for standardized implementation of evaluation methods to assure transparent and consistent assessment.


Table 1: Comparison of popular evaluation toolkits on features.


Table 2: Comparison of different inference implementations. We report 25-shot accuracy of llama 2-7b-chat on ARC-Challenge (Clark et al., 2018), 5-shot accuracy on MMLU (Hendrycks et al., 2020) and HellaSwag (Zellers et al., 2019). ‘CP’ and ‘MCP’ denote Cloze Prompt and Multiple Choice Prompt from Robinson et al. (2022).


Second, the reliability of empirical results on these platforms can not always be guaranteed. Automatic evaluation of LLMs remains a complex and challenging task (Chang et al., 2023) due to their open-ended nature and the presence of data contamination in training datasets, which lead to inflated performance metrics (Schaeffer, 2023; Sainz et al., 2023; Yu et al., 2024).


Third, the efficiency of previous evaluation toolkits is often neglected. LLM inference might be a significant challenge for researchers since it require strong GPUs or paid APIs, especially when facing large scale evaluations (Wang et al., 2023c). Optimizing the inference process and reducing computational costs are crucial for making LLM evaluation more accessible for the research community.


To address these challenges, we propose FreeEval, a modular and extensible framework for trustworthy and efficient automatic evaluation of LLMs. The main features of FreeEval are as follows:


FreeEval offers a unified abstraction and modular implementation of various evaluation methods. We present the concept of step, dataset, and config to uniformly describe dataset-based, classic reference-based, and LLM-based evaluators. Dataset-based evaluators include task-specific datasets along with dataset operations such as custom prompting, data augmenting, and data generation. LLM-based evaluators, such as MTBench (Zheng et al., 2023b), AlpacaEval (Li et al., 2023b), PandaLM (Wang et al., 2023c) and KIEval (Yu et al., 2024), are also integrated to provide subjective assessment. Complementing these are Classic Judges, which utilize referencebased evaluation metrics like ROUGE (Lin, 2004) and BERTScore (Zhang et al., 2019) to examine model output. FreeEval’s modular design allows for easy implementation of new evaluation protocols and supports evaluating both open-source and proprietary models. The abstractions also bring transparency to the evaluation process since all the evaluation details and settings are open to users.


FreeEval pioneeringly incorporates several practical meta-evaluation modules to ensure trustworthiness. Meta-evaluation methods we support include contamination detection, human judgment, case analysis and visualization, and bias evaluation, helping to mitigate the overfitting risks and provide interpretability in model evaluation. FreeEval also includes a user-friendly interface for human annotation to facilitate meta-evaluation and improve the explainability and reliability of evaluation results.


This paper is available on arxiv under CC BY 4.0 DEED license.

Authors:

(1) Zhuohao Yu, Peking University;

(2) Chang Gao, Peking University;

(3) Wenjin Yao, Peking University;

(4) Yidong Wang, Peking University;

(5) Zhengran Zeng, Peking University;

(6) Wei Ye, Peking University and a corresponding author;

(7) Jindong Wang, Microsoft Research;

(8) Yue Zhang, Westlake University;

(9) Shikun Zhang, Peking University.