This story draft by @escholar has not been reviewed by an editor, YET.

Transforming LLMs into Cross-modal and Cross-lingual Retrieval Systems: Data and Tasks

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
0-item

Authors:

(1) Frank Palma Gomez, from Boston University and the work done by him and Ramon during their internship in Google Research and Google DeepMind respectively;

(2) Ramon Sanabria, The University of Edinburgh and the work done by him and Frank during their internship in Google Research and Google DeepMind respectively;

(3) Yun-hsuan Sung, Google Research;

(4) Daniel Cer, Google Research;

(5) Siddharth Dalmia, Google DeepMind and Equal Advising Contributions;

(6) Gustavo Hernandez Abrego, Google Research and Equal Advising Contributions.

Table of Links

Abstract and 1 Introduction

2 Method

3 Data and Tasks

4 Model

5 Experiments

6 Related Work

7 Conclusion

8 Acknowledgements and References

A Appendix

3 Data and Tasks

Appendix A.3 details our training and evaluation datasets along with the number of languages in each dataset, the split we used, and the size of each dataset. We focus on the following retrieval tasks:


Speech-to-Text Retrieval (S2T) involves retrieving the corresponding transcription from a database given a speech sample. In S2T, we train on CoVoST-2 (Wang et al., 2021) speech utterances and their transcriptions. CoVoST-2 is a large multilingual speech corpus derived from Wikipedia expanding over 21 languages and provides translation to and from English. We use FLEURS (Conneau et al., 2023) to evaluate S2T performance on 102 languages. FLEURS is an n-way parallel dataset containing speech utterances from FLoRES-101 (Goyal et al., 2021) human translations. To evaluate S2T, we report recall at 1 (R@1) rates for retrieving the correct transcription for every speech sample and word error rate (WER).


Speech-to-Text Translation Retrieval (S2TT) attempts to retrieve the corresponding text translation of a speech sample. We use S2TT to measure the cross-lingual capabilities of our multi-modal DE retrieval system. We evaluate this capability zero-shot on X → En S2TT data of FLUERS and explore if we can further improve this capability by training on readily-available machine translation data from WikiMatrix (Schwenk et al., 2019). We pick French, German, Dutch, and Polish to English that are common across WikiMatrix and FLEURS and further discuss the amount of machine translation data used in Appendix A.3. For S2TT, we report 4-gram corpusBLEU (Post, 2018).


Table 1: PaLM 2 DE results for R@1 and WER compared against the mSLAM DE on 102 languages from FLEURS for speech-to-text retrieval (S2T).


This paper is available on arxiv under CC BY 4.0 DEED license.


L O A D I N G
. . . comments & more!

About Author

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
EScholar: Electronic Academic Papers for Scholars@escholar
We publish the best academic work (that's too often lost to peer reviews & the TA's desk) to the global tech community

Topics

Around The Web...

Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks