Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 2 Background and 2.1 Automatic Evaluation Methods for LLMs 2 Background and 2.1 Automatic Evaluation Methods for LLMs 2.2 Meta-Evaluation of LLMs 2.2 Meta-Evaluation of LLMs 3 Design and Implementation and 3.1 Design Principles 3 Design and Implementation and 3.1 Design Principles 3.2 FreeEval Architecture Overview and 3.3 Extensible Modular Design 3.2 FreeEval Architecture Overview and 3.3 Extensible Modular Design 3.4 Trustworthy Evaluation 3.4 Trustworthy Evaluation 3.5 Efficient Inference Backends 3.5 Efficient Inference Backends 4 Conclusion, Ethical Considerations, and References 4 Conclusion, Ethical Considerations, and References 3.2 FreeEval Architecture Overview FreeEval’s architecture, illustrated in Figure 1, features a modular design that could be separated into Evaluation Methods, Meta-Evaluation and LLM Inference Backends. Evaluation Methods contain different datasets and implementation for evaluation methods. The Meta-Evaluation module ensures the integrity and fairness of assessments by providing data contamination detection methods and popular meta-evaluation method implementation. LLM Inference Backends form the computational backbone, as it provide distributed and concurrent inference of LLMs featuring performance optimization techniques. 3.3 Extensible Modular Design FreeEval’s modular architecture is designed to accommodate the rapidly evolving landscape of LLM evaluation. To help users implement new evaluation methods without complexity, FreeEval is implemented around the concept of step, dataset and config, which serve as the building blocks for creating flexible and extensible evaluation pipelines: • step: A step encapsulates a specific evaluation method, data augmentation technique, or metric calculation logic. Each step contain three phases: preprocess handles loading or initializing the required dataset or models; run handles the execution of actual logics; postprocess parse the outputs, collects evaluation results and free up the resources. step • dataset: Data used by the evaluators are defined as dataset. Each dataset handles the preprocessing required to load data, few-shot settings, prompting, augmentation of instances, and postprocessing of inference results. dataset • config: A config file is used to compose evaluation pipelines with steps and datasets. The config file contains all the details and settings. steps defined in the config are executed sequentially, and they share the same context which stores intermediate results. config These abstractions improve transparency in evaluations by providing users with full access to the configuration details for each evaluation pipeline. The config file also serves as a complete record of the evaluation process, including all necessary hyperparameters and settings. The modular design also allow data to be re-used in different scenarios without effort. For example, GSM8K (Cobbe et al., 2021) is a evaluation dataset, we could simply calculate perplexity of models on this dataset, or we could use a data generation step to generate new data with GPT-4 in the same distribution to detect data contamination following Wei et al. (2023). The modular approach allows researchers to easily add new evaluation methods or modify existing ones without disrupting the overall structure of the framework. By defining each evaluator as a self-contained unit, FreeEval promotes code reusability and maintainability. This configuration-driven approach eliminates the need for users to write Python code when defining and running an evaluation pipeline. All settings and parameters for each step and dataset are specified within the config file, making the evaluation process highly customizable and accessible to researchers with varying levels of programming expertise. Figure 2 shows an example config for a pipeline evaluating LLaMA-2 70B (Touvron et al., 2023b) on ARC-Challenge (Clark et al., 2018) dataset with a fixed seed for sampling 25- shot examples and custom prompt. The model can be deployed locally or on a remote machine. The pipeline also include detecting data contamination with Min-K% Prob (Shi et al., 2023). This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv Authors:
(1) Zhuohao Yu, Peking University;
(2) Chang Gao, Peking University;
(3) Wenjin Yao, Peking University;
(4) Yidong Wang, Peking University;
(5) Zhengran Zeng, Peking University;
(6) Wei Ye, Peking University and a corresponding author;
(7) Jindong Wang, Microsoft Research;
(8) Yue Zhang, Westlake University;
(9) Shikun Zhang, Peking University. Authors: Authors: (1) Zhuohao Yu, Peking University; (2) Chang Gao, Peking University; (3) Wenjin Yao, Peking University; (4) Yidong Wang, Peking University; (5) Zhengran Zeng, Peking University; (6) Wei Ye, Peking University and a corresponding author; (7) Jindong Wang, Microsoft Research; (8) Yue Zhang, Westlake University; (9) Shikun Zhang, Peking University.

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Abstract

Microsoft

Trustworthy

FreeEval Architecture Overview and Extensible Modular Design

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A Meta-Evaluation of LLMs

How to Choose Test Cases for Automation

FreeEval: A Modular Framework for Trustworthy and Efficient Evaluation of Large Language Models

Background and Automatic Evaluation Methods for LLMs

A Meta-Evaluation of LLMs

The Design and Implementation of FreeEval

A Meta-Evaluation of LLMs

How to Choose Test Cases for Automation

FreeEval: A Modular Framework for Trustworthy and Efficient Evaluation of Large Language Models

Background and Automatic Evaluation Methods for LLMs

A Meta-Evaluation of LLMs

The Design and Implementation of FreeEval

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps