Enhancing Data Quality: How low-quality data affects the performance of Collaborative Training

by Computational Technology for AllAugust 29th, 2024

Too Long; Didn't Read

In this study, researchers propose a data quality control pipeline for federated fine-tuning of foundation models.

featured image - Enhancing Data Quality: How low-quality data affects the performance of Collaborative Training

Authors:

(1) Wanru Zhao, University of Cambridge, Shanghai AI Laboratory with Equal contribution;

(2) Yaxin Du, Shanghai Jiao Tong University with Equal contribution;

(3) Nicholas D. Lane, University of Cambridge and Flower Labs;

(4) Siheng Chen, Shanghai AI Laboratory and Shanghai Jiao Tong University;

(5) Yanfeng Wang, Shanghai AI Laboratory and Shanghai Jiao Tong University.

Table of Links

2 MOTIVATION AND SETUP: HOW LOW-QUALITY DATA AFFECTS THE PERFORMANCE OF COLLABORATIVE TRAINING

In our paper, we identify two unique challenges for federated fine-tuning of LLMs in terms of data quality. 1) Real low-quality data Firstly, we aim to highlight three prevalent patterns of low-quality data observed in real-world corpora: cut, deletion and exchange. The cut category encompasses scenarios where content is truncated due to word limit constraints, deletion pertains to instances where critical terminologies are absent from the corpus, and exchange refers to examples containing entirely incorrect information. We provide specific examples of these categories in Appendix E.

2) Quality heterogeneity Quality heterogeneity refers to the variability in the quality of data collection across different clients in federated learning. Given that federated learning often encompasses a vast number of clients, each with varying capabilities in data synthesis, it is impractical to assume uniformity in data quality among all participants. Consequently, some clients may possess a higher proportion of low-quality data compared to others, highlighting the absence of a uniform standard for data sample quality across all participants. We provide two Non-IID settings in Appendix B.

In our preliminary experiments, we consider the two factors above, adjust the proportion of low-quality data of composition of PMCLLama (Wu et al., 2023) and Medalpacaflashcards (Han et al., 2023) datasets in federated training, shown in Figure 1. Higher scores indicate better performance (for more details about the metrics, see Appendix C.2). The key observation is, the quality of the training data has a significant effect on the performance of collaborative training: low-quality data consistently lead to worse influence on all the metrics.

This paper is available on arxiv under CC BY 4.0 DEED license.

L O A D I N G
. . . comments & more!

About Author

Computational Technology for All@computational

Computational: We take random inputs, follow complex steps, and hope the output makes sense. And then blog about it.

Read my stories Learn More

TOPICS

machine-learning #machine-learning #federated-fine-tuning #foundation-models #large-language-models #ai-model-training #data-quality-control #fine-tuning-llms #foundation-model-training

THIS ARTICLE WAS FEATURED IN...

Permanent on Arweave

Terminal

Lite

Join HackerNoon

Latest technology trends. Customized Experience. Curated Stories. Publish Your Ideas

Enhancing Data Quality: How low-quality data affects the performance of Collaborative Training

Too Long; Didn't Read

Table of Links

2 MOTIVATION AND SETUP: HOW LOW-QUALITY DATA AFFECTS THE PERFORMANCE OF COLLABORATIVE TRAINING

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES