paint-brush
Multi-EuP: Analysis of Bias in Information Retrieval - Abstract and Introby@mediabias
1,024 reads
1,024 reads

Multi-EuP: Analysis of Bias in Information Retrieval - Abstract and Intro

tldt arrow

Too Long; Didn't Read

Explore language bias in multilingual information retrieval with the Multi-EuP dataset, revealing insights into fairness and demographic factors.
featured image - Multi-EuP: Analysis of Bias in Information Retrieval - Abstract and Intro
Tech Media Bias [Research Publication] HackerNoon profile picture

This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Jinrui Yang, School of Computing & Information Systems, The University of Melbourne (Email: [email protected]);

(2) Timothy Baldwin, School of Computing & Information Systems, The University of Melbourne and Mohamed bin Zayed University of Artificial Intelligence, UAE (Email: (tbaldwin,trevor.cohn)@unimelb.edu.au);

(3) Trevor Cohn, School of Computing & Information Systems, The University of Melbourne.

Abstract and Intro

Background and Related Work

Multi-EuP

Experiments and Findings

Language Bias Discussion

Conclusion, Limitations, Ethics Statement, Acknowledgements, References, and Appendix

Abstract

We present Multi-EuP, a new multilingual benchmark dataset, comprising 22K multilingual documents collected from the European Parliament, spanning 24 languages. This dataset is designed to investigate fairness in a multilingual information retrieval (IR) context to analyze both language and demographic bias in a ranking context. It boasts an authentic multilingual corpus, featuring topics translated into all 24 languages, as well as cross-lingual relevance judgments. Furthermore, it offers rich demographic information associated with its documents, facilitating the study of demographic bias. We report the effectiveness of Multi-EuP for benchmarking both monolingual and multilingual IR. We also conduct a preliminary experiment on language bias caused by the choice of tokenization strategy.

1 Introduction

Information retrieval (IR) classically uses a retrieval model to query a document collection and return a ranked list of documents which are predicted to be (decreasingly) relevant to the query. Retrieval models have increasingly been based on supervised learning, involving the annotation of documents with relevance scores relative to a given query, and the training of models to predict the relative association between a query and document (Karpukhin et al., 2020; Khattab and Zaharia, 2020).


In parallel with these advances, the democratisation of the internet has led to a surge of individual contributors serving as information disseminators, hailing from various countries and regions, and posting in different languages. This has created possibilities for exploration of cross-lingual and multilingual text retrieval. Cross-lingual retrieval pertains to scenarios where queries are formulated in one language but documents are retrieved from another language. On the other hand, multilingual retrieval involves a query in one language but retrieval of documents across multiple languages simultaneously. An important consideration in any such work is both robustness and fairness across different combinations of languages – for instance, are results from one language consistently ranked higher than another for certain types of query.


While progress towards multilingual retrieval through the release of datasets such as Mr. TYDI (Zhang et al., 2021) and mMARCO (Bonifacio et al., 2021), both are limited in that they evaluate monolingual retrieval for a range of languages, rather than true multilingual retrieval, using multiple languages simultaneously. Additionally, mMARCO was created by machine translation of MS MARCO (Nguyen et al., 2016), introducing a confounding factor of translation errors.


We present a multilingual dataset based on the European Parliament debate archive with queries in 24 distinct languages, and relevance judgements also across all 24 languages. This ensures the “multilingual” nature of the dataset in terms of both query-to-document and document-to-query associations. We additionally augment each document with comprehensive metadata of the author, including gender, nationality, political affiliation, and age, for use in exploring fairness with respect to protected attributes.


Our work contributes to the field in three main ways: (1) we construct and release the Multi-EuP dataset, a resource for multilingual retrieval over 24 languages, effectively capturing the multilingual nature of both queries and documents; (2) we explore language bias within the realm of multilingual retrieval, revealing that multilingual IR using BM25 indeed exhibits notable language bias; and (3) we supplement the dataset with rich author metadata to enable research on fairness and demographic bias in IR.[1]




[1] The Multi-EuP dataset is available for download from https://github.com/jrnlp/Multi-EuP.