Authors:
(1) Thuat Nguyen, Dept. of Computer Science, University of Oregon, OR, USA;
(2) Chien Van Nguyen, Dept. of Computer Science, University of Oregon, OR, USA;
(3) Viet Dac Lai, Dept. of Computer Science, University of Oregon, OR, USA;
(4) Hieu Man, Dept. of Computer Science, University of Oregon, OR, USA;
(5) Nghia Trung Ngo, Dept. of Computer Science, University of Oregon, OR, USA;
(6) Franck Dernoncourt, Adobe Research, USA;
(7) Ryan A. Rossi, Adobe Research, USA;
(8) Thien Huu Nguyen, Dept. of Computer Science, University of Oregon, OR, USA. Table of Links Abstract and Introduction Multilingual Dataset Creation Data Analysis and Experiments Related Work Conclusion and References 3 Data Analysis and Experiments After completing all the cleaning and deduplication steps, our ultimate dataset comprises 6.3 trillion tokens spanning 167 languages. Table 1 provides an overview of the number of documents and tokens for the top 42 languages in CulturaX following each processing stage. As can be seen, our datacleaning pipeline can substantially reduce the number of documents in the original mC4 and OSCAR datasets for each language. The total number of removed documents accounts for 46.48% of our initial documents, suggesting the the effectiveness of our approaches to filter noisy information for multilingual datasets. This paper is available on arxiv under CC BY 4.0 DEED license. Authors: (1) Thuat Nguyen, Dept. of Computer Science, University of Oregon, OR, USA; (2) Chien Van Nguyen, Dept. of Computer Science, University of Oregon, OR, USA; (3) Viet Dac Lai, Dept. of Computer Science, University of Oregon, OR, USA; (4) Hieu Man, Dept. of Computer Science, University of Oregon, OR, USA; (5) Nghia Trung Ngo, Dept. of Computer Science, University of Oregon, OR, USA; (6) Franck Dernoncourt, Adobe Research, USA; (7) Ryan A. Rossi, Adobe Research, USA; (8) Thien Huu Nguyen, Dept. of Computer Science, University of Oregon, OR, USA. Authors: Authors: (1) Thuat Nguyen, Dept. of Computer Science, University of Oregon, OR, USA; (2) Chien Van Nguyen, Dept. of Computer Science, University of Oregon, OR, USA; (3) Viet Dac Lai, Dept. of Computer Science, University of Oregon, OR, USA; (4) Hieu Man, Dept. of Computer Science, University of Oregon, OR, USA; (5) Nghia Trung Ngo, Dept. of Computer Science, University of Oregon, OR, USA; (6) Franck Dernoncourt, Adobe Research, USA; (7) Ryan A. Rossi, Adobe Research, USA; (8) Thien Huu Nguyen, Dept. of Computer Science, University of Oregon, OR, USA. Table of Links Abstract and Introduction Abstract and Introduction Multilingual Dataset Creation Multilingual Dataset Creation Data Analysis and Experiments Data Analysis and Experiments Related Work Related Work Conclusion and References Conclusion and References 3 Data Analysis and Experiments After completing all the cleaning and deduplication steps, our ultimate dataset comprises 6.3 trillion tokens spanning 167 languages. Table 1 provides an overview of the number of documents and tokens for the top 42 languages in CulturaX following each processing stage. As can be seen, our datacleaning pipeline can substantially reduce the number of documents in the original mC4 and OSCAR datasets for each language. The total number of removed documents accounts for 46.48% of our initial documents, suggesting the the effectiveness of our approaches to filter noisy information for multilingual datasets. This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

CulturaX: A High-Quality, Multilingual Dataset for LLMs - Data Analysis and Experiments

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Untitled Story

12 Key Aspects for Assessing the Power of Text-to-Image Models

5 Data Management Principles That Matter in 2021

A Brief Introduction Into A Typical Data Science Project Life Cycle

Data Cleaning

Loan Risk Prediction Using Neural Networks

12 Key Aspects for Assessing the Power of Text-to-Image Models

5 Data Management Principles That Matter in 2021

A Brief Introduction Into A Typical Data Science Project Life Cycle

Data Cleaning

Loan Risk Prediction Using Neural Networks

Light-Mode

Classic

Newspaper

Dark-Mode

Neon Noir

Minty

HN StartUps