This paper is available on arxiv under CC 4.0 license.


(1) Jinrui Yang, School of Computing & Information Systems, The University of Melbourne (Email: [email protected]);

(2) Timothy Baldwin, School of Computing & Information Systems, The University of Melbourne and Mohamed bin Zayed University of Artificial Intelligence, UAE (Email: (tbaldwin,trevor.cohn);

(3) Trevor Cohn, School of Computing & Information Systems, The University of Melbourne.

6 Conclusion

In this paper, we introduce Multi-EuP, a novel dataset for multilingual information retrieval across 24 languages, collected from European Parliament debates. The demographic information provided by the Multi-EuP dataset serves a dual purpose: not only does it contribute to multilingual retrieval tasks, but it also holds significant potential for advancing research in the realm of fairness and bias. This dataset can play a pivotal role in investigating issues of equitable representations and mitigation of biases within document ranking settings.

Figure 2: Language correlation matrix between topics and the ranking output top 100 relevant documents in a one-vs-many setting. The row is the topic languages, the columns is the document languages. The left matrix displays results using a language-specific tokenizer, while the right matrix represents the experiment with a simple whitespace tokenizer. Both of them show strong language bias between the language of the topic and the retrieved documents.

Multi-EuP facilitates diverse information retrieval (IR) scenarios, encompassing one-vs-one, one-vs-many, and many-vs-many settings. We demonstrated the utility of Multi-EuP as a benchmark for evaluating both monolingual and multilingual IR. Our study reveals the presence of language bias in multilingual IR when employing BM25. We further validate the effectiveness of mitigating this bias through the strategic implementation of whitespace as a language tokenizer.

We propose to conduct future work in three main areas. First, we intend to expand our investigation of language bias to encompass a broader range of ranking methods, including neural methods such as mDPR (Zhang et al., 2021), mColBERT (Lawrie et al., 2023) and PLAID-X(Santhanam et al., 2022). Second, we will expand the dataset by developing an automated API to retrieve data published by the European Parliament (EP), thereby ensuring realtime synchronization of our dataset. Lastly, our current experiments have explored language bias only, but we plan to further investigate gender bias, age bias, and nationality bias.


The limitations of the Multi-EuP dataset are notable but navigable. Primarily, the temporal coverage of the dataset is confined to the past three years. This temporal constraint arises due to the fact that, preceding 2020, documents released by the EU were predominantly available in mono-lingual versions only. However, a potential remedy lies in the amalgamation of the Europarl (Koehn, 2005) collection, enabling a more comprehensive and holistic MultiEuP dataset.

Furthermore, it is worth noting the domain skew of the dataset, in that Multi-EuP inevitably centers on political matters. While this presents challenges, particularly in terms of the intricate nuances of political language, it inherently serves as an excellent foundational stepping stone for delving into the intricacies of multilingual retrieval. We believe, however, that this dataset can serve as a launching pad for broader explorations encompassing crossdomain and open-domain transfer learning scenarios, thus contributing to the broader landscape of language understanding and retrieval.

Ethics Statement

The dataset contains publicly-available EP data that does not include personal or sensitive information, with the exception of information relating to public officeholders, e.g., the names of the active members of the European Parliament, European Council, or other official administration bodies. The collected data is licensed under the Creative Commons Attribution 4.0 International licence. [8]


This research was funded by Melbourne Research Scholarship and undertaken using the LIEF HPCGPGPU Facility hosted at the University of Melbourne. This facility was established with the assistance of LIEF Grant LE170100200. We would like to thank George Buchanan for providing valuable feedback.


