Authors:
(1) Frank Palma Gomez, from Boston University and the work done by him and Ramon during their internship in Google Research and Google DeepMind respectively;
(2) Ramon Sanabria, The University of Edinburgh and the work done by him and Frank during their internship in Google Research and Google DeepMind respectively;
(3) Yun-hsuan Sung, Google Research;
(4) Daniel Cer, Google Research;
(5) Siddharth Dalmia, Google DeepMind and Equal Advising Contributions;
(6) Gustavo Hernandez Abrego, Google Research and Equal Advising Contributions.
Table of Links
8 Acknowledgements and References
2 Method
We train a transformer-based DE model that encodes speech and text given a dataset D = {(xi , yi)}, where xi is a speech utterance and yi is its transcription. We denote the speech and text embeddings as xi = E(xi) and yi = E(yi), respectively, where E is a transformer-based DE that encodes speech and text.
2.1 Generating Audio Tokens
We convert raw speech into discrete tokens using the process in Lakhotia et al. (2021); Borsos et al. (2023). The process converts a speech query xi into an embedding using a pre-trained speech encoder. The output embedding is then discretized into a set of tokens using k-means clustering. We refer to the resulting tokens as audio tokens. We use the 2B variant of the Universal Speech Model (USM) encoder (Zhang et al., 2023) as the speech encoder and take the middle layer as the embedding for xi . Additionally, we generate audio tokens at 25Hz using k-means clustering, resulting in a set of 1024 possible audio tokens. We will refer to this as our audio token vocabulary.
2.2 Supporting Text and Audio Tokens
To support text and audio tokens in our LLM, we follow the formulation of Rubenstein et al. (2023). We extend the embedding layer of a transformer decoder by a tokens, where a represents the size of our audio token vocabulary. This modification leads to an embedding layer with size (t + a) × m, where t is the number of tokens in the text vocabulary and m is the dimensions of the embedding vectors. In our implementation, the first t tokens represent text and the remaining a tokens are reserved for audio. We initialize the embeddings layer from scratch when training our model.
This paper is available on arxiv under CC BY 4.0 DEED license.