Published on Tue Sep 01 2020

Multimodal Aggregation Approach for Memory Vision-Voice Indoor Navigation with Meta-Learning

Liqi Yan, Dongfang Liu, Yaoxian Song, Changbin Yu

Memory Vision-Voice Indoor Navigation (MVV-IN) receives voice commands and analyzes multimodal information of visual observation.

0
0
0
Abstract

Vision and voice are two vital keys for agents' interaction and learning. In this paper, we present a novel indoor navigation model called Memory Vision-Voice Indoor Navigation (MVV-IN), which receives voice commands and analyzes multimodal information of visual observation in order to enhance robots' environment understanding. We make use of single RGB images taken by a first-view monocular camera. We also apply a self-attention mechanism to keep the agent focusing on key areas. Memory is important for the agent to avoid repeating certain tasks unnecessarily and in order for it to adapt adequately to new scenes, therefore, we make use of meta-learning. We have experimented with various functional features extracted from visual observation. Comparative experiments prove that our methods outperform state-of-the-art baselines.

Wed Dec 25 2019
Machine Learning
Look, Listen, and Act: Towards Audio-Visual Embodied Navigation
A crucial ability of mobile intelligent agents is to integrate the evidence from multiple sensory inputs in an environment. Here we describe an approach to audio-visual embodied navigation that takes advantage of both the visual and audio pieces of evidence.
0
0
0
Wed Feb 26 2020
Artificial Intelligence
From Seeing to Moving: A Survey on Learning for Visual Indoor Navigation (VIN)
The Visual Indoor Navigation (VIN) task has drawn increasing attention from the data-driven machine learning communities. This survey first summarizes the representative work of learning-based approaches for the VIN task and then identifies and discusses lingering issues.
0
0
0
Tue Mar 30 2021
Artificial Intelligence
Diagnosing Vision-and-Language Navigation: What Really Matters
0
0
0
Mon Dec 10 2018
Machine Learning
Vision-based Navigation with Language-based Assistance via Imitation Learning with Indirect Intervention
Vision-based Navigation with Language-based Assistance (VNLA) is a vision-language task. The agent with visual perception is guided via language to find objects in photorealistic indoor environments. Empirical results show that this approach significantly improves the success rate of the learning agent.
0
0
0
Fri Apr 09 2021
NLP
The Road to Know-Where: An Object-and-Room Informed Sequential BERT for Indoor Vision-Language Navigation
Vision-and-Language Navigation (VLN) requires an agent to find a path to a remote location on the basis of natural-language instructions. Most existing methods take the words in the VLN instructions and the discrete views of each panorama as the minimal unit of encoding. We propose an object-informed
5
0
0
Wed Jul 15 2020
Computer Vision
Active Visual Information Gathering for Vision-Language Navigation
Vision-language navigation (VLN) is the task of entailing an agent to carry out navigational instructions inside photo-realistic environments. One of the key challenges in VLN is how to conduct a robust navigation by mitigating the uncertainty caused by ambiguous instructions and insufficient observation.
0
0
0