This paper is available on arxiv under CC 4.0 license.
Authors:
(1) Jinrui Yang, School of Computing & Information Systems, The University of Melbourne (Email: [email protected]);
(2) Timothy Baldwin, School of Computing & Information Systems, The University of Melbourne and Mohamed bin Zayed University of Artificial Intelligence, UAE (Email: (tbaldwin,trevor.cohn)@unimelb.edu.au);
(3) Trevor Cohn, School of Computing & Information Systems, The University of Melbourne.
Table of Links
Conclusion, Limitations, Ethics Statement, Acknowledgements, References, and Appendix
2 Background and Related Work
The European Parliament (EP) serves as an important forum for political debates and decisionmaking at the European Union level. Members of the European Parliament (MEP) are elected in direct elections across the EU. The European Parliament debate is presided over by the President, who guides MEPs in discussing specific subjects.
EP debates have been the source of three key datasets. First, Europarl-2005 was crafted by Koehn (2005) by collecting EP debates documents from 1996 to 2011, and extracting translations as a parallel corpus for statistical machine translation, enriched with attributes including debate date, chapter id, MEP id, language, MEP name, and MEP party.
Later, Rabinovich et al. (2017) built Europarl2017 upon Europarl-2005, by introducing additional demographic attributes: MEP gender and MEP age. These were sourced from sources such as Wikidata (Vrandeciˇ c and Krötzsch ´ , 2014) and automatic annotation tools such as *Genderize[*2] and AlchemyVision. [3] However, Europarl-2017 is limited to only two language pairs: English– German and English–French. Europarl-2018 (Vanmassenhove and Hardmeier, 2018) expanded upon Europarl-2017 to add twenty additional language pairs, based on the manual translations in the EP archives. These corpora have been used primarily for machine translation research.
Since 2020, the EU has publicly released raw debates in the form of transcribed source-language speeches with rich multilingual topic index data, along with the original video and audio recordings. This forms the basis of the Multi-EuP dataset, with additional attributes for each speaking MEP such as an image, birthplace, and nationality.
Zhang et al. (2021) introduced Mr. TYDI, an evaluation benchmark dataset for dense retrieval assessment over 11 languages. This dataset is constructed from TYDI (Clark et al., 2020), a question answering dataset. For each language, annotators assign relevance scores as judgments for questions, derived from Wikipedia articles. Notably, the questions for different languages are crafted independently, and relevance judgements are provided in-language only. Based on the dataset, the authors evaluate on monolingual retrieval tasks for non-English languages using BM25 and mDPR as zero-shot baselines. However, Mr. TYDI’s scope is limited in that it is not truly multilingual, in that queries in a given language are only performed over documents in that language. This is part of the void our work aims to address.
MS MARCO (Nguyen et al., 2016) is a widelyused dataset, sourced from Bing’s search query logs, but for English queries and documents only. To mitigate this, Bonifacio et al. (2021) introduced mMARCO, a multilingual variant of the MS MARCO passage ranking dataset, spanning 13 languages and created through machine translation, based on one open-source approach (Tiedemann and Thottingal, 2020) and one commercial system in the form of Google Translate.[4] Analysis of the authors’ results reveals a positive correlation between translation quality and retrieval performance, with higher translation BLEU scores yielding improved retrieval MRR outcomes. However, similar to Mr. TYDI, mMARCO focuses on in-language retrieval only for multiple languages, rather than multilingual retrieval.
Throughout the past few decades, numerous datasets and tasks pertaining to multilingual retrieval have been developed for evaluation, through efforts such as CLEF, TREC, and FIRE, each contributing standardized document collections and evaluation procedures. These evaluation datasets facilitate genuine multilingual IR research such as Rahimi et al. (2015) and Lawrie et al. (2023). However, the scope of these datasets is generally limited to a small number of queries. For example, in the case of CLEF 2001-2003, each edition encompasses a mere few dozen queries. This limitation tends to confine research predominantly to evaluation and not offer a resource for training a multilingual ranking model. Our dataset is of a scale to accommodate both large-scale training and evaluation of multilingual retrieval methods.
Compared with the related work above, our work augments the multilingual mixture of queries and documents compared to Mr.TYDI, preserves the authenticity of multilingual contexts compared to mMARCO’s translation-based approach, and surpasses the query count limitations of tasks like CLEF.
[2] https://genderize.io/
[3] https://www.ibm.com/smarterplanet/us/en/ ibmwatson/developercloud/alchemy-vision.html
[4] https://cloud.google.com/translate