paint-brush
Enhancing Data Quality: Proposed Workflow for Data Quality Controlby@computational

Enhancing Data Quality: Proposed Workflow for Data Quality Control

tldt arrow

Too Long; Didn't Read

In this study, researchers propose a data quality control pipeline for federated fine-tuning of foundation models.
featured image - Enhancing Data Quality: Proposed Workflow for Data Quality Control
Computational Technology for All HackerNoon profile picture

Authors:

(1) Wanru Zhao, University of Cambridge, Shanghai AI Laboratory with Equal contribution;

(2) Yaxin Du, Shanghai Jiao Tong University with Equal contribution;

(3) Nicholas D. Lane, University of Cambridge and Flower Labs;

(4) Siheng Chen, Shanghai AI Laboratory and Shanghai Jiao Tong University;

(5) Yanfeng Wang, Shanghai AI Laboratory and Shanghai Jiao Tong University.

3 PROPOSED WORKFLOW FOR DATA QUALITY CONTROL

3.1 OVERVIEW


Figure 2: Overall workflow diagram consists of two phases: 1) Phase I: client-side compute each sample’s quality score with scoring functions using the public validation set and global model, then server-side aggregates the scores, giving a global threshold by anchor data 2) Phase II: clients filter data according to the global threshold and starts federated learning on selected high-quality data.

3.2 LOCAL DATA SCORING AND QUALITY CONTROL

3.3 GLOBAL STANDARD WITH ANCHOR DATA SCORING

On the server, we select only a few amount of data (10 samples in our paper) as our anchor data and use the aforementioned scoring method to calculate the average score of these 10 data points as the global threshold. This establishes a unified standard for division between low- and high-quality data for heterogeneous clients, allowing for the further filtering of local data.


This paper is available on arxiv under CC BY 4.0 DEED license.