paint-brush
Multi-EuP: Analysis of Bias in Information Retrieval - Language Bias Discussionby@mediabias
130 reads

Multi-EuP: Analysis of Bias in Information Retrieval - Language Bias Discussion

tldt arrow

Too Long; Didn't Read

Explore language bias in multilingual information retrieval with the Multi-EuP dataset, revealing insights into fairness and demographic factors.
featured image - Multi-EuP: Analysis of Bias in Information Retrieval - Language Bias Discussion
Tech Media Bias [Research Publication] HackerNoon profile picture

This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Jinrui Yang, School of Computing & Information Systems, The University of Melbourne (Email: [email protected]);

(2) Timothy Baldwin, School of Computing & Information Systems, The University of Melbourne and Mohamed bin Zayed University of Artificial Intelligence, UAE (Email: (tbaldwin,trevor.cohn)@unimelb.edu.au);

(3) Trevor Cohn, School of Computing & Information Systems, The University of Melbourne.

Abstract and Intro

Background and Related Work

Multi-EuP

Experiments and Findings

Language Bias Discussion

Conclusion, Limitations, Ethics Statement, Acknowledgements, References, and Appendix

5 Language Bias Discussion

In light of our findings in a one-vs-many setting, we were keen to delve further into the underlying causes of the disparity between languages.

5.1 Bias Detection

Language bias is likely if the query language aligns better with one document language than another. As mentioned earlier, Pyserini supports different tokenizers, specifically language-specific tokenizers or simple whitespace tokenization. Therefore, in the one-vs-many setting, we analyze the composition of the top-100 rankings for the 100 topics. During indexing of the document collection, we used the simple whitespace tokenizer, given the multilingual nature of the collection. However, over the queries during retrieval, we employed two different tokenizers — a language-specific tokenizer, and the whitespace tokenizer.


We conducted a correlation analysis between the language of the topics and the language of the top 100 relevant documents. From Table 2, we can see that relevance judgments in our test cases are consistent across languages, ensuring uniformity in the correlation matrix within the test set. However, Figure 2 reveals that both approaches generate strong language bias. In both cases, the query language aligns better with documents in its own language than others. The right plot appears to show that languages from the same family has strong correlation (e.g., PL, CS) and (IT, ES) since they may have some shared vocabulary.

5.2 Collection Distribution Factors

Initially, we hypothesized that the disparity for each language may be a contributing factor to this bias. Figure 3 presents the regression line between the number of documents in a given language and MRR, which explains much of the variation across languages.


However, note the outlier above the regression line (Polish: PL), which has a substantial number of documents but surprisingly low MRR performance. We refer to this phenomenon as a “BM25 unfriendly” language. According to Wojtasik et al. (2023), the main reason for the low performance of Polish lies in its highly-inflected morphology, giving rise to a a multitude of word forms per lexeme, including inflections of proper names, and complex morphological structure. In such cases, lexical matching is less effective than in other morphologically-simpler languages. Furthermore, LUCENE 8.5.1 API does not have a language-specific tokenizer for Polish. Conversely, languages below the regression line can be termed “BM25 friendly” languages, as they require fewer documents to achieve higher MRR in retrieval.

5.3 Language Tokenizer Factors

Secondly, we speculated that the choice of language-specific Analyzer in LUCENE might be a contributing factor, as it influences word tokenization, token filter, synonym expansion and other processing. [7] To investigate this, we conducted a controlled experiment in the one-vs-many setting. When indexing the collection, given the multilingual nature of the collection, we employed whitespace as the tokenizer. However, over the queries, we experimented with either a language-specific tokenizer or whitespace tokenizer. We then compared the linear regression of MRR against the number of documents in Figure 3. On the right side of the plot, we can see a strong correlation when using whitespace tokenization for both the collection and the queries, reducing language bias.


Furthermore, when transitioning from languagespecific tokenizers to whitespace tokenizers, the overall MRR across all languages declined modestly, from 15.02 to 14.18. That is, the original performance level was largely preserved, but language bias was diminished in using simple whitespace tokenization.




[7] https://lucene.apache.org/core/8_0_0/core/ org/apache/lucene/analysis/package-summary.html# package.description