This paper is available on arxiv under CC 4.0 license.
Authors:
(1) Jinrui Yang, School of Computing & Information Systems, The University of Melbourne (Email: [email protected]);
(2) Timothy Baldwin, School of Computing & Information Systems, The University of Melbourne and Mohamed bin Zayed University of Artificial Intelligence, UAE (Email: (tbaldwin,trevor.cohn)@unimelb.edu.au);
(3) Trevor Cohn, School of Computing & Information Systems, The University of Melbourne.
Conclusion, Limitations, Ethics Statement, Acknowledgements, References, and Appendix
We present Multi-EuP, a new multilingual benchmark dataset, comprising 22K multilingual documents collected from the European Parliament, spanning 24 languages. This dataset is designed to investigate fairness in a multilingual information retrieval (IR) context to analyze both language and demographic bias in a ranking context. It boasts an authentic multilingual corpus, featuring topics translated into all 24 languages, as well as cross-lingual relevance judgments. Furthermore, it offers rich demographic information associated with its documents, facilitating the study of demographic bias. We report the effectiveness of Multi-EuP for benchmarking both monolingual and multilingual IR. We also conduct a preliminary experiment on language bias caused by the choice of tokenization strategy.
Information retrieval (IR) classically uses a retrieval model to query a document collection and return a ranked list of documents which are predicted to be (decreasingly) relevant to the query. Retrieval models have increasingly been based on supervised learning, involving the annotation of documents with relevance scores relative to a given query, and the training of models to predict the relative association between a query and document (Karpukhin et al., 2020; Khattab and Zaharia, 2020).
In parallel with these advances, the democratisation of the internet has led to a surge of individual contributors serving as information disseminators, hailing from various countries and regions, and posting in different languages. This has created possibilities for exploration of cross-lingual and multilingual text retrieval. Cross-lingual retrieval pertains to scenarios where queries are formulated in one language but documents are retrieved from another language. On the other hand, multilingual retrieval involves a query in one language but retrieval of documents across multiple languages simultaneously. An important consideration in any such work is both robustness and fairness across different combinations of languages – for instance, are results from one language consistently ranked higher than another for certain types of query.
While progress towards multilingual retrieval through the release of datasets such as Mr. TYDI (Zhang et al., 2021) and mMARCO (Bonifacio et al., 2021), both are limited in that they evaluate monolingual retrieval for a range of languages, rather than true multilingual retrieval, using multiple languages simultaneously. Additionally, mMARCO was created by machine translation of MS MARCO (Nguyen et al., 2016), introducing a confounding factor of translation errors.
We present a multilingual dataset based on the European Parliament debate archive with queries in 24 distinct languages, and relevance judgements also across all 24 languages. This ensures the “multilingual” nature of the dataset in terms of both query-to-document and document-to-query associations. We additionally augment each document with comprehensive metadata of the author, including gender, nationality, political affiliation, and age, for use in exploring fairness with respect to protected attributes.
Our work contributes to the field in three main ways: (1) we construct and release the Multi-EuP dataset, a resource for multilingual retrieval over 24 languages, effectively capturing the multilingual nature of both queries and documents; (2) we explore language bias within the realm of multilingual retrieval, revealing that multilingual IR using BM25 indeed exhibits notable language bias; and (3) we supplement the dataset with rich author metadata to enable research on fairness and demographic bias in IR.[1]
[1] The Multi-EuP dataset is available for download from https://github.com/jrnlp/Multi-EuP.