Search icon
ReadWrite
see notifications
Notifications
see more
    paint-brush
    Multi-EuP: Analysis of Bias in Information Retrieval - Abstract and Introby@mediabias
    994 reads
    994 reads

    Multi-EuP: Analysis of Bias in Information Retrieval - Abstract and Intro

    by Media Bias [Deeply Researched Academic Papers]May 1st, 2024
    Read on Terminal Reader
    Read this story w/o Javascript
    tldt arrow

    Too Long; Didn't Read

    Explore language bias in multilingual information retrieval with the Multi-EuP dataset, revealing insights into fairness and demographic factors.
    featured image - Multi-EuP: Analysis of Bias in Information Retrieval - Abstract and Intro
    Media Bias [Deeply Researched Academic Papers] HackerNoon profile picture

    This paper is available on arxiv under CC 4.0 license.

    Authors:

    (1) Jinrui Yang, School of Computing & Information Systems, The University of Melbourne (Email: [email protected]);

    (2) Timothy Baldwin, School of Computing & Information Systems, The University of Melbourne and Mohamed bin Zayed University of Artificial Intelligence, UAE (Email: (tbaldwin,trevor.cohn)@unimelb.edu.au);

    (3) Trevor Cohn, School of Computing & Information Systems, The University of Melbourne.

    Abstract and Intro

    Background and Related Work

    Multi-EuP

    Experiments and Findings

    Language Bias Discussion

    Conclusion, Limitations, Ethics Statement, Acknowledgements, References, and Appendix

    Abstract

    We present Multi-EuP, a new multilingual benchmark dataset, comprising 22K multilingual documents collected from the European Parliament, spanning 24 languages. This dataset is designed to investigate fairness in a multilingual information retrieval (IR) context to analyze both language and demographic bias in a ranking context. It boasts an authentic multilingual corpus, featuring topics translated into all 24 languages, as well as cross-lingual relevance judgments. Furthermore, it offers rich demographic information associated with its documents, facilitating the study of demographic bias. We report the effectiveness of Multi-EuP for benchmarking both monolingual and multilingual IR. We also conduct a preliminary experiment on language bias caused by the choice of tokenization strategy.

    1 Introduction

    Information retrieval (IR) classically uses a retrieval model to query a document collection and return a ranked list of documents which are predicted to be (decreasingly) relevant to the query. Retrieval models have increasingly been based on supervised learning, involving the annotation of documents with relevance scores relative to a given query, and the training of models to predict the relative association between a query and document (Karpukhin et al., 2020; Khattab and Zaharia, 2020).


    In parallel with these advances, the democratisation of the internet has led to a surge of individual contributors serving as information disseminators, hailing from various countries and regions, and posting in different languages. This has created possibilities for exploration of cross-lingual and multilingual text retrieval. Cross-lingual retrieval pertains to scenarios where queries are formulated in one language but documents are retrieved from another language. On the other hand, multilingual retrieval involves a query in one language but retrieval of documents across multiple languages simultaneously. An important consideration in any such work is both robustness and fairness across different combinations of languages – for instance, are results from one language consistently ranked higher than another for certain types of query.


    While progress towards multilingual retrieval through the release of datasets such as Mr. TYDI (Zhang et al., 2021) and mMARCO (Bonifacio et al., 2021), both are limited in that they evaluate monolingual retrieval for a range of languages, rather than true multilingual retrieval, using multiple languages simultaneously. Additionally, mMARCO was created by machine translation of MS MARCO (Nguyen et al., 2016), introducing a confounding factor of translation errors.


    We present a multilingual dataset based on the European Parliament debate archive with queries in 24 distinct languages, and relevance judgements also across all 24 languages. This ensures the “multilingual” nature of the dataset in terms of both query-to-document and document-to-query associations. We additionally augment each document with comprehensive metadata of the author, including gender, nationality, political affiliation, and age, for use in exploring fairness with respect to protected attributes.


    Our work contributes to the field in three main ways: (1) we construct and release the Multi-EuP dataset, a resource for multilingual retrieval over 24 languages, effectively capturing the multilingual nature of both queries and documents; (2) we explore language bias within the realm of multilingual retrieval, revealing that multilingual IR using BM25 indeed exhibits notable language bias; and (3) we supplement the dataset with rich author metadata to enable research on fairness and demographic bias in IR.[1]



    [1] The Multi-EuP dataset is available for download from https://github.com/jrnlp/Multi-EuP.

    MongoDB
    L O A D I N G
    . . . comments & more!

    About Author

    Media Bias [Deeply Researched Academic Papers] HackerNoon profile picture
    Media Bias [Deeply Researched Academic Papers]@mediabias
    We publish deeply researched (and often vastly underread) academic papers about our collective omnipresent media bias.
    Read my storiesRead My Stories

    TOPICS

    purcat-imgdata-science #datasets #multilingual-information #dataset-construction #language-bias-analysis #cross-lingual-retrieval #multilingual-document-analysis #fairness-in-information #multi-eup

    THIS ARTICLE WAS FEATURED IN...

    Permanent on Arweave
    Read on Terminal Reader Terminal
    Read this story w/o Javascript Lite

    RELATED STORIES

    Article Thumbnail
    Multi-EuP: Analysis of Bias in Information Retrieval - Background and Related Work
    by mediabias
    May 01, 2024
    #datasets
    Article Thumbnail
    Multi-EuP: Analysis of Bias in Information Retrieval - Background and Related Work
    by mediabias
    May 01, 2024
    #datasets
    Article Thumbnail
    Multi-EuP: Analysis of Bias in Information Retrieval - Multi-EuP Use
    by mediabias
    May 01, 2024
    #datasets
    Article Thumbnail
    Multi-EuP: Analysis of Bias in Information Retrieval - Experiments and Findings
    by mediabias
    May 01, 2024
    #datasets
    Article Thumbnail
    Multi-EuP: Analysis of Bias in Information Retrieval - Language Bias Discussion
    by mediabias
    May 01, 2024
    #datasets
    Join HackerNoonloading
    Latest technology trends. Customized Experience. Curated Stories. Publish Your Ideas