paint-brush
Enhancing Data Quality: How low-quality data affects the performance of Collaborative Trainingby@computational
New Story

Enhancing Data Quality: How low-quality data affects the performance of Collaborative Training

tldt arrow

Too Long; Didn't Read

In this study, researchers propose a data quality control pipeline for federated fine-tuning of foundation models.
featured image - Enhancing Data Quality: How low-quality data affects the performance of Collaborative Training
Computational Technology for All HackerNoon profile picture

Authors:

(1) Wanru Zhao, University of Cambridge, Shanghai AI Laboratory with Equal contribution;

(2) Yaxin Du, Shanghai Jiao Tong University with Equal contribution;

(3) Nicholas D. Lane, University of Cambridge and Flower Labs;

(4) Siheng Chen, Shanghai AI Laboratory and Shanghai Jiao Tong University;

(5) Yanfeng Wang, Shanghai AI Laboratory and Shanghai Jiao Tong University.

2 MOTIVATION AND SETUP: HOW LOW-QUALITY DATA AFFECTS THE PERFORMANCE OF COLLABORATIVE TRAINING

In our paper, we identify two unique challenges for federated fine-tuning of LLMs in terms of data quality. 1) Real low-quality data Firstly, we aim to highlight three prevalent patterns of low-quality data observed in real-world corpora: cut, deletion and exchange. The cut category encompasses scenarios where content is truncated due to word limit constraints, deletion pertains to instances where critical terminologies are absent from the corpus, and exchange refers to examples containing entirely incorrect information. We provide specific examples of these categories in Appendix E.


2) Quality heterogeneity Quality heterogeneity refers to the variability in the quality of data collection across different clients in federated learning. Given that federated learning often encompasses a vast number of clients, each with varying capabilities in data synthesis, it is impractical to assume uniformity in data quality among all participants. Consequently, some clients may possess a higher proportion of low-quality data compared to others, highlighting the absence of a uniform standard for data sample quality across all participants. We provide two Non-IID settings in Appendix B.


In our preliminary experiments, we consider the two factors above, adjust the proportion of low-quality data of composition of PMCLLama (Wu et al., 2023) and Medalpacaflashcards (Han et al., 2023) datasets in federated training, shown in Figure 1. Higher scores indicate better performance (for more details about the metrics, see Appendix C.2). The key observation is, the quality of the training data has a significant effect on the performance of collaborative training: low-quality data consistently lead to worse influence on all the metrics.


Figure 1: The impact of low-quality data on the performance of federated fine-tuning of LLMs.


This paper is available on arxiv under CC BY 4.0 DEED license.