Visual question answering (VQA) is a task that combines both the techniques
of computer vision and natural language processing. It requires models to
answer a text-based question according to the information contained in a
visual. In recent years, the research field of VQA has been expanded. Research
that focuses on the VQA, examining the reasoning ability and VQA on scientific
diagrams, has also been explored more. Meanwhile, more multimodal feature
fusion mechanisms have been proposed. This paper will review and analyze
existing datasets, metrics, and models proposed for the VQA task.