paint-brush
FreeEval Architecture Overview and Extensible Modular Designby@modularizing
New Story

FreeEval Architecture Overview and Extensible Modular Design

by ModularizingMarch 18th, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

FreeEval’s architecture features a modular design that could be separated into Evaluation Methods, Meta-Evaluation and LLM Inference Backends.

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - FreeEval Architecture Overview and Extensible Modular Design
Modularizing HackerNoon profile picture
0-item

Abstract and 1 Introduction

2 Background and 2.1 Automatic Evaluation Methods for LLMs

2.2 Meta-Evaluation of LLMs

3 Design and Implementation and 3.1 Design Principles

3.2 FreeEval Architecture Overview and 3.3 Extensible Modular Design

3.4 Trustworthy Evaluation

3.5 Efficient Inference Backends

4 Conclusion, Ethical Considerations, and References

3.2 FreeEval Architecture Overview

FreeEval’s architecture, illustrated in Figure 1, features a modular design that could be separated into Evaluation Methods, Meta-Evaluation and LLM Inference Backends. Evaluation Methods contain different datasets and implementation for evaluation methods. The Meta-Evaluation module ensures the integrity and fairness of assessments by providing data contamination detection methods and popular meta-evaluation method implementation. LLM Inference Backends form the computational backbone, as it provide distributed and concurrent inference of LLMs featuring performance optimization techniques.

3.3 Extensible Modular Design

FreeEval’s modular architecture is designed to accommodate the rapidly evolving landscape of LLM evaluation. To help users implement new evaluation methods without complexity, FreeEval is implemented around the concept of step, dataset and config, which serve as the building blocks for creating flexible and extensible evaluation pipelines:


step: A step encapsulates a specific evaluation method, data augmentation technique, or metric calculation logic. Each step contain three phases: preprocess handles loading or initializing the required dataset or models; run handles the execution of actual logics; postprocess parse the outputs, collects evaluation results and free up the resources.


dataset: Data used by the evaluators are defined as dataset. Each dataset handles the preprocessing required to load data, few-shot settings, prompting, augmentation of instances, and postprocessing of inference results.


config: A config file is used to compose evaluation pipelines with steps and datasets. The config file contains all the details and settings. steps defined in the config are executed sequentially, and they share the same context which stores intermediate results.


These abstractions improve transparency in evaluations by providing users with full access to the configuration details for each evaluation pipeline. The config file also serves as a complete record of the evaluation process, including all necessary hyperparameters and settings. The modular design also allow data to be re-used in different scenarios without effort. For example, GSM8K (Cobbe et al., 2021) is a evaluation dataset, we could simply calculate perplexity of models on this dataset, or we could use a data generation step to generate new data with GPT-4 in the same distribution to detect data contamination following Wei et al. (2023). The modular approach allows researchers to easily add new evaluation methods or modify existing ones without disrupting the overall structure of the framework. By defining each evaluator as a self-contained unit, FreeEval promotes code reusability and maintainability.


This configuration-driven approach eliminates the need for users to write Python code when defining and running an evaluation pipeline. All settings and parameters for each step and dataset are specified within the config file, making the evaluation process highly customizable and accessible to researchers with varying levels of programming expertise. Figure 2 shows an example config for a pipeline evaluating LLaMA-2 70B (Touvron et al., 2023b) on ARC-Challenge (Clark et al., 2018) dataset with a fixed seed for sampling 25- shot examples and custom prompt. The model can be deployed locally or on a remote machine. The pipeline also include detecting data contamination with Min-K% Prob (Shi et al., 2023).


This paper is available on arxiv under CC BY 4.0 DEED license.

Authors:

(1) Zhuohao Yu, Peking University;

(2) Chang Gao, Peking University;

(3) Wenjin Yao, Peking University;

(4) Yidong Wang, Peking University;

(5) Zhengran Zeng, Peking University;

(6) Wei Ye, Peking University and a corresponding author;

(7) Jindong Wang, Microsoft Research;

(8) Yue Zhang, Westlake University;

(9) Shikun Zhang, Peking University.