Authors:
(1) Tamay Aykut, Sureel, Palo Alto, California, USA;
(2) Markus Hofbauer, Technical University of Munich, Germany;
(3) Christopher Kuhn, Technical University of Munich, Germany;
(4) Eckehard Steinbach, Technical University of Munich, Germany;
(5) Bernd Girod, Stanford University, Stanford, California, USA.
In this paper, we presented a virtual audience framework for online conferences. Performers such as actors, comedians, or musicians rely heavily on the feedback of their audience. This work addressed the issue of accumulating noise caused by multiple audio inputs that so far is being solved by requiring the audience to be muted. The proposed virtual audience framework enables all participants to experience the audience feedback without the transmission of an audio stream and the resulting synchronization issues. We collect abstract audience state information, such as the number of clapping and laughing participants, on a central server and synthesize a unified audience sound locally on every client. Every user contributes to the overall audience state and has direct influence on the synthesized audio information.
In future work, reactions such as laughter can be locally detected using methods such as deep neural networks [34] to then generate the abstract audience state data which will be shared with the server. Furthermore, the abstract state information is not restricted to be binary. The field of audio synthesis offers promising ideas such as acoustic unit discovery [35], [36]. The acoustic units present in the acoustic feedback of an audience member can be used as a more informative state data. The joint audience sound can consist of the same abstract units to achieve a sound that closely resembles the actual sound. Such improved synthesis implementations can be easily added to the proposed modular framework to continually improve the virtual audience sound synthesis.
This work has been supported by the Max Planck Center for Visual Computing and Communication.
[1] John Smith, “11 comedians reflect on what they’ve learned from months of zoom and outdoor shows during covid-19,” Accessed on: 2021-05-17.
[2] Radosław Niewiadomski, Jennifer Hofmann, J´er ˆome Urbain, Tracey Platt, Johannes Wagner, and Piot Bilal, “Laugh-aware virtual agent and its impact on user amusement,” 2013, Publisher: s.n.
[3] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu, “WaveNet: A Generative Model for Raw Audio,” arXiv:1609.03499 [cs], Sept. 2016, arXiv: 1609.03499.
[4] Ryan Prenger, Rafael Valle, and Bryan Catanzaro, “WaveGlow: A Flowbased Generative Network for Speech Synthesis,” arXiv:1811.00002 [cs, eess, stat], Oct. 2018, arXiv: 1811.00002.
[5] Jesse Engel, Kumar Krishna Agrawal, Shuo Chen, Ishaan Gulrajani, Chris Donahue, and Adam Roberts, “GANSYNTH: ADVERSARIAL NEURAL AUDIO SYNTHESIS,” p. 17, 2019.
[6] No´e Tits, Kevin El Haddad, and Thierry Dutoit, “Laughter Synthesis: Combining Seq2seq modeling with Transfer Learning,” arXiv:2008.09483 [cs, eess], Aug. 2020, arXiv: 2008.09483.
[7] Jerome Urbain, Elisabetta Bevacqua, Thierry Dutoit, Alexis Moinet, Radoslaw Niewiadomski, Catherine Pelachaud, Benjamin Picart, Joelle Tilmanne, and Johannes Wagner, “The AVLaughterCycle Database,” p. 6.
[8] Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, and Alexandre D´efossez, “Simple and controllable music generation,” arXiv preprint arXiv:2306.05284, 2023.
[9] Julius O Smith III, “Viewpoints on the history of digital synthesis,” in Proceedings of the International Computer Music Conference. INTERNATIONAL COMPUTER MUSIC ACCOCIATION, 1991, pp. 1–1.
[10] Curtis Roads, “Introduction to granular synthesis,” Computer Music Journal, vol. 12, no. 2, pp. 11–13, 1988.
[11] Ian Simon, Sumit Basu, David Salesin, and Maneesh Agrawala, “Audio analogies: Creating new music from an existing performance by concatenative synthesis,” in ICMC. Citeseer, 2005.
[12] Stefan Bilbao, Numerical sound synthesis: finite difference schemes and simulation in musical acoustics, John Wiley & Sons, 2009.
[13] RL Harrison-Harsley and S Bilbao, “Separability of wave solutions in nonlinear brass instrument modelling,” The Journal of the Acoustical Society of America, vol. 143, no. 6, pp. 3654–3657, 2018.
[14] Craig J Webb and Stefan Bilbao, “On the limits of real-time physical modelling synthesis with a modular environment,” in Proceedings of the International Conference on Digital Audio Effects, 2015, p. 65.
[15] Xavier Serra and Julius Smith, “Spectral modeling synthesis: A sound analysis/synthesis system based on a deterministic plus stochastic decomposition,” Computer Music Journal, vol. 14, no. 4, pp. 12–24, 1990.
[16] Xavier Amatriain, Jordi Bonada, Alex Loscos, and Xavier Serra, “Spectral processing,” Z¨olzer U, editor. DAFX-Digital Audio Effects. Chichester: John Wiley & Sons; 2002., 2002.
[17] John M Chowning, “The synthesis of complex audio spectra by means of frequency modulation,” Journal of the audio engineering society, vol. 21, no. 7, pp. 526–534, 1973.
[18] M. Huzaifah and L. Wyse, “Deep generative models for musical audio synthesis,” arXiv:2006.06426 [cs, eess, stat], June 2020, arXiv: 2006.06426.
[19] Siddique Latif, Rajib Rana, Sara Khalifa, Raja Jurdak, Junaid Qadir, and Bj ¨orn W. Schuller, “Deep Representation Learning in Speech Processing: Challenges, Recent Advances, and Future Trends,” arXiv:2001.00378 [cs, eess], Jan. 2020, arXiv: 2001.00378.
[20] Fanny Roche, Thomas Hueber, Samuel Limier, and Laurent Girin, “Autoencoders for music sound modeling: a comparison of linear, shallow, deep, recurrent and variational models,” arXiv:1806.04096 [cs, eess], May 2019, arXiv: 1806.04096.
[21] Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and Ilya Sutskever, “Jukebox: A Generative Model for Music,” arXiv:2005.00341 [cs, eess, stat], Apr. 2020, arXiv: 2005.00341.
[22] Mikołaj Bi ´nkowski, Jeff Donahue, Sander Dieleman, Aidan Clark, Erich Elsen, Norman Casagrande, Luis C. Cobo, and Karen Simonyan, “High Fidelity Speech Synthesis with Adversarial Networks,” arXiv:1909.11646 [cs, eess], Sept. 2019, arXiv: 1909.11646.
[23] Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu, “FastSpeech 2: Fast and High-Quality End-to-End Text to Speech,” arXiv:2006.04558 [cs, eess], June 2020, arXiv: 2006.04558.
[24] Geng Yang, Shan Yang, Kai Liu, Peng Fang, Wei Chen, and Lei Xie, “Multi-band MelGAN: Faster Waveform Generation for HighQuality Text-to-Speech,” arXiv:2005.05106 [cs, eess], May 2020, arXiv: 2005.05106.
[25] Jan Vainer and Ondˇrej Duˇsek, “SpeedySpeech: Efficient Neural Speech Synthesis,” arXiv:2008.03802 [cs, eess], Aug. 2020, arXiv: 2008.03802.
[26] Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre D´efossez, Jade Copet, Devi Parikh, Yaniv Taigman, and Yossi Adi, “Audiogen: Textually guided audio generation,” arXiv preprint arXiv:2209.15352, 2022.
[27] Leevi Peltola, Cumhur Erkut, Perry R Cook, and Vesa Valimaki, “Synthesis of hand clapping sounds,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 3, pp. 1021–1029, 2007.
[28] Mikio Mori, Mitsuhiro Ogihara, Tomoya Minamimoto, Shuji Taniguchi, Shozo Kato, and Chikahiro Araki, “A method to synthesize whistling sounds using frequency modulation for musical whistling certificate examination system,” IEEJ Transactions on Electronics, Information and Systems, vol. 130, no. 1, pp. 156–157, 2010.
[29] Marc Cardle, Stephen Brooks, and Peter Robinson, “Audio and user directed sound synthesis,” in ICMC. Citeseer, 2003.
[30] Hiroki Mori, Tomohiro Nagata, and Yoshiko Arimoto, “Conversational and social laughter synthesis with wavenet.,” in INTERSPEECH, 2019, pp. 520–523.
[31] Yoshiko Arimoto, Hiromi Kawatsu, Sumio Ohno, and Hitoshi Iida, “Naturalistic emotional speech collection paradigm with online game and its psychological and acoustical assessment,” p. 11, 2012.
[32] Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre de Br´ebisson, Yoshua Bengio, and Aaron Courville, “Melgan: Generative adversarial networks for conditional waveform synthesis,” arXiv preprint arXiv:1910.06711, 2019.
[33] Kushal Lakhotia, Eugene Kharitonov, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Benjamin Bolte, Tu-Anh Nguyen, Jade Copet, Alexei Baevski, Abdelrahman Mohamed, et al., “On generative spoken language modeling from raw audio,” Transactions of the Association for Computational Linguistics, vol. 9, pp. 1336–1354, 2021.
[34] Gerhard Hagerer, Nicholas Cummins, Florian Eyben, and Bjorn Schuller, ““Did you laugh enough today?” – Deep Neural Networks for Mobile and Wearable Laughter Trackers,” p. 2.
[35] Ryan Eloff, Andr´e Nortje, Benjamin van Niekerk, Avashna Govender, Leanne Nortje, Arnu Pretorius, Elan van Biljon, Ewald van der Westhuizen, Lisa van Staden, and Herman Kamper, “Unsupervised acoustic unit discovery for speech synthesis using discrete latent-variable neural networks,” arXiv:1904.07556 [cs, eess], June 2019, arXiv: 1904.07556.
[36] Jan Chorowski, Ron J. Weiss, Samy Bengio, and A¨aron van den Oord, “Unsupervised speech representation learning using WaveNet autoencoders,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 12, pp. 2041–2053, Dec. 2019, arXiv: 1901.08810.
This paper is available on arxiv under CC 4.0 license.