Authors: (1) Thuat Nguyen, Dept. of Computer Science, University of Oregon, OR, USA; (2) Chien Van Nguyen, Dept. of Computer Science, University of Oregon, OR, USA; (3) Viet Dac Lai, Dept. of Computer Science, University of Oregon, OR, USA; (4) Hieu Man, Dept. of Computer Science, University of Oregon, OR, USA; (5) Nghia Trung Ngo, Dept. of Computer Science, University of Oregon, OR, USA; (6) Franck Dernoncourt, Adobe Research, USA; (7) Ryan A. Rossi, Adobe Research, USA; (8) Thien Huu Nguyen, Dept. of Computer Science, University of Oregon, OR, USA.

5 Conclusion

We present CulturaX, a novel multilingual dataset with text data for 167 languages. Our dataset is cleaned and deduplicated via a comprehensive pipeline, producing 6.3 trillion tokens. CulturaX is thus a large-scale and high-quality dataset, which can be readily used to train high-performing LLMs for multiple languages. Our data is openly accessible to the public to promote further research and applications of multilingual learning.

This paper is available on arxiv under CC BY 4.0 DEED license.



