Authors:
(1) Frank Palma Gomez, from Boston University and the work done by him and Ramon during their internship in Google Research and Google DeepMind respectively;
(2) Ramon Sanabria, The University of Edinburgh and the work done by him and Frank during their internship in Google Research and Google DeepMind respectively;
(3) Yun-hsuan Sung, Google Research;
(4) Daniel Cer, Google Research;
(5) Siddharth Dalmia, Google DeepMind and Equal Advising Contributions;
(6) Gustavo Hernandez Abrego, Google Research and Equal Advising Contributions.
Table of Links
8 Acknowledgements and References
A Appendix
A.1 Training Setup
Ni et al. (2022) showed that applying a contrastive loss to sentence encoders leads to improved retrieval performance in downstream tasks. After initializing our model from the PaLM 2, we use a contrastive loss (Hadsell et al., 2006).
Using equation 1, our multi-modal DE will learn from paired speech and text embeddings (xi , yi), where yi is considered as a positive example to xi while all other examples where i ̸= j are negative ones. The model should learn to bring the positive transcriptions closer to the corresponding speech sample, while pushing away all the other negative transcriptions. In our training, the positive and negative distinction is done within the training batch. Hence, we apply an in-batch softmax as part of our loss computation. Lastly, sim() is a similarity function formulated as the dot product between the speech sample and the transcription embeddings.
To train our model, we use the sum of a contrastive loss with a spreadout loss (Zhang et al., 2017) of both the speech and text embeddings. We calculate the contrastive loss (Yang et al., 2019)
in a bidirectional way, by adding the loss in the speech-to-text and the text-to-speech direction.
A.2 Expressing Tasks
For training and inference, we found that using a prefix improves speech-to-text retrieval performance. Therefore, we pre-pend a prefix containing the language and modality shown in in Table 3. In the case of a speech utterance, the prefix will be tokenized with the LLMs tokenizer and the remaining will be converted to audio tokens.
A.3 Data
Table 4 shows the training and evaluation datasets we used through out our experiments. We used 21 languages CoVoST-2 to train our model on speech-to-text retrieval which amounts to approximately 900 hours of speech. To evaluate our models speech-to-text retrieval capabilities, we evaluate on FLEURS speech-to-text test split on 102 languages. We use FLEURS speech-to-text translation test split to evaluate our models abilities on tasks that require cross-lingual and cross-modal knowledge. We evaluate of 4 different languages: German, Polish, French, and Dutch.
We find that combining speech-to-text retrieval data and readily available translation data improves our models cross-lingual and cross-modal abilities. Table 5 shows the number of parallel sentences we used during training from X → En.
This paper is available on arxiv under CC BY 4.0 DEED license.