Deep Learning for Modeling Audio-Visual Correspondences

TLDR

Humans can ‘hear faces’ and ‘see voices’ by cultivating a mental picture or an acoustic memory of the person. The natural synchronization between sound and vision can provide a rich self-supervisory signal for grounding auditory signals into the visual signals. Inspired by our capability of interpreting sound sources from how objects move visually, we can create learning-models that learn to interpret this interpretation on its own. We will use a simple architecture that will rely on static visual information to learn the cross-modal context. The motion signals are of crucial importance for learning the audio-visual correspondences.via the TL;DR App

no story

Written by kraken | Data Scientist and Visual Computing Researcher

Published by HackerNoon on 2020/09/16