Too Long; Didn't Read
Humans can ‘hear faces’ and ‘see voices’ by cultivating a mental picture or an acoustic memory of the person. The natural synchronization between sound and vision can provide a rich self-supervisory signal for grounding auditory signals into the visual signals. Inspired by our capability of interpreting sound sources from how objects move visually, we can create learning-models that learn to interpret this interpretation on its own. We will use a simple architecture that will rely on static visual information to learn the cross-modal context. The motion signals are of crucial importance for learning the audio-visual correspondences.