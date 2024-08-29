Search icon
    Enhancing Data Quality: Ablation Study of Unified Scoring With Anchor Databy@computational

    Enhancing Data Quality: Ablation Study of Unified Scoring With Anchor Data

    by Computational Technology for AllAugust 29th, 2024
    In this study, researchers propose a data quality control pipeline for federated fine-tuning of foundation models.
    Authors:

    (1) Wanru Zhao, University of Cambridge, Shanghai AI Laboratory with Equal contribution;

    (2) Yaxin Du, Shanghai Jiao Tong University with Equal contribution;

    (3) Nicholas D. Lane, University of Cambridge and Flower Labs;

    (4) Siheng Chen, Shanghai AI Laboratory and Shanghai Jiao Tong University;

    (5) Yanfeng Wang, Shanghai AI Laboratory and Shanghai Jiao Tong University.

    D ABLATION STUDY OF UNIFIED SCORING WITH ANCHOR DATA

    We perform a comparative analysis focusing on the number of selected data and the proportion of low-quality data. This comparison contrasts the approach of selection based on a unified scoring from anchor data, as depicted in the fourth bar, with methods that determine the threshold by acknowledging the overall proportion of low-quality data. In Figure 4, the method represented by the second bar involves each client selecting high-quality data by sorting each sample’s scores and eliminating the lowest-scored samples, informed by the proportion of low-quality data within the entire training set of FL. This approach, which does not need a score threshold and relies solely on the known proportion of low-quality data, proves to be impractical and less effective in settings with heterogeneous data quality due to varying data quality compositions among clients. The third bar illustrates the outcomes of implementing a global score threshold aligned with the exact proportion of global low-quality data. This method necessitates the server’s aggregation of all scores from clients within the FL framework and knowing the proportion of low-quality data, potentially leading to privacy concerns. The results indicate that selection by anchor score consistently yields a lower proportion of low-quality data compared to the other two methodologies.


    Figure 4: Number of selected data and the proportion of low-quality data across different selection principles (select by proportion, score threshold, or anchor set score) in federated NIID1 setting, employing the ConPro score.


    This paper is available on arxiv under CC BY 4.0 DEED license.


