Memory Vision-Voice Indoor Navigation (MVV-IN) receives voice commands and analyzes multimodal information of visual observation.
Vision and voice are two vital keys for agents' interaction and learning. In
this paper, we present a novel indoor navigation model called Memory
Vision-Voice Indoor Navigation (MVV-IN), which receives voice commands and
analyzes multimodal information of visual observation in order to enhance
robots' environment understanding. We make use of single RGB images taken by a
first-view monocular camera. We also apply a self-attention mechanism to keep
the agent focusing on key areas. Memory is important for the agent to avoid
repeating certain tasks unnecessarily and in order for it to adapt adequately
to new scenes, therefore, we make use of meta-learning. We have experimented
with various functional features extracted from visual observation. Comparative
experiments prove that our methods outperform state-of-the-art baselines.