paint-brush
Enhancing Data Quality: Ablation Study of Unified Scoring With Anchor Databy@computational
New Story

Enhancing Data Quality: Ablation Study of Unified Scoring With Anchor Data

tldt arrow

Too Long; Didn't Read

In this study, researchers propose a data quality control pipeline for federated fine-tuning of foundation models.
featured image - Enhancing Data Quality: Ablation Study of Unified Scoring With Anchor Data
Computational Technology for All HackerNoon profile picture

Authors:

(1) Wanru Zhao, University of Cambridge, Shanghai AI Laboratory with Equal contribution;

(2) Yaxin Du, Shanghai Jiao Tong University with Equal contribution;

(3) Nicholas D. Lane, University of Cambridge and Flower Labs;

(4) Siheng Chen, Shanghai AI Laboratory and Shanghai Jiao Tong University;

(5) Yanfeng Wang, Shanghai AI Laboratory and Shanghai Jiao Tong University.

D ABLATION STUDY OF UNIFIED SCORING WITH ANCHOR DATA

We perform a comparative analysis focusing on the number of selected data and the proportion of low-quality data. This comparison contrasts the approach of selection based on a unified scoring from anchor data, as depicted in the fourth bar, with methods that determine the threshold by acknowledging the overall proportion of low-quality data. In Figure 4, the method represented by the second bar involves each client selecting high-quality data by sorting each sample’s scores and eliminating the lowest-scored samples, informed by the proportion of low-quality data within the entire training set of FL. This approach, which does not need a score threshold and relies solely on the known proportion of low-quality data, proves to be impractical and less effective in settings with heterogeneous data quality due to varying data quality compositions among clients. The third bar illustrates the outcomes of implementing a global score threshold aligned with the exact proportion of global low-quality data. This method necessitates the server’s aggregation of all scores from clients within the FL framework and knowing the proportion of low-quality data, potentially leading to privacy concerns. The results indicate that selection by anchor score consistently yields a lower proportion of low-quality data compared to the other two methodologies.


Figure 4: Number of selected data and the proportion of low-quality data across different selection principles (select by proportion, score threshold, or anchor set score) in federated NIID1 setting, employing the ConPro score.


This paper is available on arxiv under CC BY 4.0 DEED license.