This story draft by @escholar has not been reviewed by an editor, YET.

Transforming LLMs into Cross-modal and Cross-lingual Retrieval Systems: Method

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
0-item

Authors:

(1) Frank Palma Gomez, from Boston University and the work done by him and Ramon during their internship in Google Research and Google DeepMind respectively;

(2) Ramon Sanabria, The University of Edinburgh and the work done by him and Frank during their internship in Google Research and Google DeepMind respectively;

(3) Yun-hsuan Sung, Google Research;

(4) Daniel Cer, Google Research;

(5) Siddharth Dalmia, Google DeepMind and Equal Advising Contributions;

(6) Gustavo Hernandez Abrego, Google Research and Equal Advising Contributions.

Table of Links

Abstract and 1 Introduction

2 Method

3 Data and Tasks

4 Model

5 Experiments

6 Related Work

7 Conclusion

8 Acknowledgements and References

A Appendix

2 Method

We train a transformer-based DE model that encodes speech and text given a dataset D = {(xi , yi)}, where xi is a speech utterance and yi is its transcription. We denote the speech and text embeddings as xi = E(xi) and yi = E(yi), respectively, where E is a transformer-based DE that encodes speech and text.

2.1 Generating Audio Tokens

We convert raw speech into discrete tokens using the process in Lakhotia et al. (2021); Borsos et al. (2023). The process converts a speech query xi into an embedding using a pre-trained speech encoder. The output embedding is then discretized into a set of tokens using k-means clustering. We refer to the resulting tokens as audio tokens. We use the 2B variant of the Universal Speech Model (USM) encoder (Zhang et al., 2023) as the speech encoder and take the middle layer as the embedding for xi . Additionally, we generate audio tokens at 25Hz using k-means clustering, resulting in a set of 1024 possible audio tokens. We will refer to this as our audio token vocabulary.

2.2 Supporting Text and Audio Tokens

To support text and audio tokens in our LLM, we follow the formulation of Rubenstein et al. (2023). We extend the embedding layer of a transformer decoder by a tokens, where a represents the size of our audio token vocabulary. This modification leads to an embedding layer with size (t + a) × m, where t is the number of tokens in the text vocabulary and m is the dimensions of the embedding vectors. In our implementation, the first t tokens represent text and the remaining a tokens are reserved for audio. We initialize the embeddings layer from scratch when training our model.


This paper is available on arxiv under CC BY 4.0 DEED license.


L O A D I N G
. . . comments & more!

About Author

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
EScholar: Electronic Academic Papers for Scholars@escholar
We publish the best academic work (that's too often lost to peer reviews & the TA's desk) to the global tech community

Topics

Around The Web...

Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks