Enhancing Data Quality: Examples for Low-and High-quality Data

by Computational Technology for AllAugust 29th, 2024

Too Long; Didn't Read

In this study, researchers propose a data quality control pipeline for federated fine-tuning of foundation models.

featured image - Enhancing Data Quality: Examples for Low-and High-quality Data

Authors:

(1) Wanru Zhao, University of Cambridge, Shanghai AI Laboratory with Equal contribution;

(2) Yaxin Du, Shanghai Jiao Tong University with Equal contribution;

(3) Nicholas D. Lane, University of Cambridge and Flower Labs;

(4) Siheng Chen, Shanghai AI Laboratory and Shanghai Jiao Tong University;

(5) Yanfeng Wang, Shanghai AI Laboratory and Shanghai Jiao Tong University.

Table of Links

E EXAMPLES FOR LOW- AND HIGH- QUALITY DATA

To model real-world scenario, we designed three types of low-quality data generation rules.

Cut The cut type data simulates scenarios where an answer exceeds the word limit, resulting in the truncation of the last few words. Practically, we establish a threshold of 100 tokens, retaining only the initial tokens up to this limit, as demonstrated in Table 5.

Deletion Deletion type of data refers to missing important terminologies. In practice, we delete part of the tokens in the answer to simulate the missing terminology or some nouns, shown in Table 4.

Exchange Exchange type low-quality data is characterized by the complete irrelevance between questions and their corresponding answers. In this approach, we swap answers among question-answer pairs to illustrate this form of low-quality data. Table 3 demonstrates examples of this exchange, where the first two samples involve swapping answers within the Medalpaca-flashcards dataset (Han et al., 2023). The remaining two samples, drawn from PMC-Llama (Wu et al., 2023), undergo a process where their rationales for answer choices are exchanged, and a different, incorrect choice is selected to ensure a clear distinction in the option outputs.

This paper is available on arxiv under CC BY 4.0 DEED license.

L O A D I N G
. . . comments & more!

About Author

Computational Technology for All@computational

Computational: We take random inputs, follow complex steps, and hope the output makes sense. And then blog about it.

Read my stories Learn More

TOPICS

machine-learning #machine-learning #federated-fine-tuning #foundation-models #large-language-models #ai-model-training #data-quality-control #fine-tuning-llms #foundation-model-training

THIS ARTICLE WAS FEATURED IN...

Permanent on Arweave

Terminal

Lite

Join HackerNoon

Latest technology trends. Customized Experience. Curated Stories. Publish Your Ideas

Enhancing Data Quality: Examples for Low-and High-quality Data

Too Long; Didn't Read

Table of Links

E EXAMPLES FOR LOW- AND HIGH- QUALITY DATA

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES