Published on Mon Dec 21 2020

Semantic Audio-Visual Navigation

Changan Chen, Ziad Al-Halah, Kristen Grauman

semantic audio-visual navigation. Where objects in the environment make sounds consistent with their semantic meaning (e.g., toilet flushing) and acoustic events are sporadic or short in duration.

0
0
0
Abstract

Recent work on audio-visual navigation assumes a constantly-sounding target and restricts the role of audio to signaling the target's position. We introduce semantic audio-visual navigation, where objects in the environment make sounds consistent with their semantic meaning (e.g., toilet flushing, door creaking) and acoustic events are sporadic or short in duration. We propose a transformer-based model to tackle this new semantic AudioGoal task, incorporating an inferred goal descriptor that captures both spatial and semantic properties of the target. Our model's persistent multimodal memory enables it to reach the goal even long after the acoustic event stops. In support of the new task, we also expand the SoundSpaces audio simulations to provide semantically grounded sounds for an array of objects in Matterport3D. Our method strongly outperforms existing audio-visual navigation methods by learning to associate semantic, acoustic, and visual cues.

Fri Aug 21 2020
Artificial Intelligence
Learning to Set Waypoints for Audio-Visual Navigation
In audio-visual navigation, an agent travels through a complex, unmapped 3D environment using both sights and sounds to find a sound source. Existing models learn to act at afixed granularity of agent motion and rely on simple recurrent aggregations of audio observations.
0
0
0
Wed Dec 25 2019
Machine Learning
Look, Listen, and Act: Towards Audio-Visual Embodied Navigation
A crucial ability of mobile intelligent agents is to integrate the evidence from multiple sensory inputs in an environment. Here we describe an approach to audio-visual embodied navigation that takes advantage of both the visual and audio pieces of evidence.
0
0
0
Tue Dec 24 2019
Computer Vision
SoundSpaces: Audio-Visual Navigation in 3D Environments
0
0
0
Thu Jun 11 2020
Computer Vision
Telling Left from Right: Learning Spatial Correspondence of Sight and Sound
0
0
0
Fri Mar 23 2018
Computer Vision
Audio-Visual Event Localization in Unconstrained Videos
An audio-visual event is both visible and audible in a video segment. Joint modeling of auditory and visual modalities outperforms independent modeling. Strong correlations between the two modalities enable cross-modality localization.
0
0
0
Sat May 15 2021
Computer Vision
Move2Hear: Active Audio-Visual Source Separation
The active audio-visual source separation problem. We introduce a reinforcement learning approach that trains movement policies. We demonstrate our model's ability to find minimal movement sequences with maximal payoff for audio source separation.
2
5
22