Published on Mon May 13 2019

Quantifying and Alleviating the Language Prior Problem in Visual Question Answering

Yangyang Guo, Zhiyong Cheng, Liqiang Nie, Yibing Liu, Yinglong Wang, Mohan Kankanhalli

VQA aims to answer questions about an image or a video. Some studies have pointed out that current VQA models are heavily affected by the language prior problem. The proposed score regularization module adopts a pair-wise learning strategy.

0
0
0
Abstract

Benefiting from the advancement of computer vision, natural language processing and information retrieval techniques, visual question answering (VQA), which aims to answer questions about an image or a video, has received lots of attentions over the past few years. Although some progress has been achieved so far, several studies have pointed out that current VQA models are heavily affected by the \emph{language prior problem}, which means they tend to answer questions based on the co-occurrence patterns of question keywords (e.g., how many) and answers (e.g., 2) instead of understanding images and questions. Existing methods attempt to solve this problem by either balancing the biased datasets or forcing models to better understand images. However, only marginal effects and even performance deterioration are observed for the first and second solution, respectively. In addition, another important issue is the lack of measurement to quantitatively measure the extent of the language prior effect, which severely hinders the advancement of related techniques. In this paper, we make contributions to solve the above problems from two perspectives. Firstly, we design a metric to quantitatively measure the language prior effect of VQA models. The proposed metric has been demonstrated to be effective in our empirical studies. Secondly, we propose a regularization method (i.e., score regularization module) to enhance current VQA models by alleviating the language prior problem as well as boosting the backbone model performance. The proposed score regularization module adopts a pair-wise learning strategy, which makes the VQA models answer the question based on the reasoning of the image (upon this question) instead of basing on question-answer patterns observed in the biased training set. The score regularization module is flexible to be integrated into various VQA models.

Mon Oct 08 2018
Computer Vision
Overcoming Language Priors in Visual Question Answering with Adversarial Regularization
Modern Visual Question Answering (VQA) models rely heavily on superficial correlations between question and answer words. We introduce a question-only model that takes as input the question encoding from the VQA model and must leverage language biases in order to succeed. We then pose training as an adversarial game between
0
0
0
Sat May 29 2021
Computer Vision
LPF: A Language-Prior Feedback Objective Function for De-biased Visual Question Answering
VQA systems tend to overly rely on language bias and hence fail to reason from the visual clue. We propose a novel Language-Prior Feedback (LPF) objective function to re-balance the proportion of each answer's loss value in the total VQA loss.
0
0
0
Wed Oct 05 2016
Artificial Intelligence
Visual Question Answering: Datasets, Algorithms, and Future Challenges
Visual Question Answering (VQA) is a recent problem in computer vision and natural language processing. In VQA, an algorithm needs to answer text-based questions about images. Since the release of the first V QA dataset in 2014, additional datasets have been released.
0
0
0
Mon Dec 21 2020
Computer Vision
Learning content and context with language bias for Visual Question Answering
VQA is a challenging multimodal task to answer questions about an image. Many works concentrate on how to reduce language bias. However, reducing language bias also weakens the ability of VQA models to learn context prior. We propose a novel learning strategy named CCB.
0
0
0
Thu Dec 17 2020
Computer Vision
Overcoming Language Priors with Self-supervised Learning for Visual Question Answering
Most Visual Question Answering (VQA) models suffer from the language prior problem, which is caused by inherent data biases. Our method can compensate for the data biases by generating balanced data without introducing external annotations.
0
0
0
Thu Sep 14 2017
Computer Vision
Robustness Analysis of Visual QA Models by Basic Questions
Visual Question Answering (VQA) models should have both high robustness and high accuracy. There are two main modules in our algorithm. Given a natural language question about an image, the first module takes the question as input and then outputs the ranked basic questions of the given question.
0
0
0