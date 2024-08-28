Search icon
    CulturaX: A High-Quality, Multilingual Dataset for LLMs - Data Analysis and Experiments

    by Auto Encoder: How to Ignore the Signal NoiseAugust 28th, 2024
    After completing all the cleaning and deduplication steps, our ultimate dataset comprises 6.3 trillion tokens spanning 167 languages. Table 1 provides an overview of the number of documents and tokens for the top 42 languages in CulturaX following each processing stage. The total number of removed documents accounts for 46.48% of our initial documents, suggesting the effectiveness of our approaches.
    Authors:

    (1) Thuat Nguyen, Dept. of Computer Science, University of Oregon, OR, USA;

    (2) Chien Van Nguyen, Dept. of Computer Science, University of Oregon, OR, USA;

    (3) Viet Dac Lai, Dept. of Computer Science, University of Oregon, OR, USA;

    (4) Hieu Man, Dept. of Computer Science, University of Oregon, OR, USA;

    (5) Nghia Trung Ngo, Dept. of Computer Science, University of Oregon, OR, USA;

    (6) Franck Dernoncourt, Adobe Research, USA;

    (7) Ryan A. Rossi, Adobe Research, USA;

    (8) Thien Huu Nguyen, Dept. of Computer Science, University of Oregon, OR, USA.

    Abstract and Introduction

    Multilingual Dataset Creation

    Data Analysis and Experiments

    Related Work

    Conclusion and References

    3 Data Analysis and Experiments

    After completing all the cleaning and deduplication steps, our ultimate dataset comprises 6.3 trillion tokens spanning 167 languages. Table 1 provides an overview of the number of documents and tokens for the top 42 languages in CulturaX following each processing stage. As can be seen, our datacleaning pipeline can substantially reduce the number of documents in the original mC4 and OSCAR datasets for each language. The total number of removed documents accounts for 46.48% of our initial documents, suggesting the the effectiveness of our approaches to filter noisy information for multilingual datasets.


    This paper is available on arxiv under CC BY 4.0 DEED license.


