Collaborative Modality Fusion for Mitigating Language Bias in Visual Question Answering

J Imaging. 2024 Feb 23;10(3):56. doi: 10.3390/jimaging10030056.

Abstract

Language bias stands as a noteworthy concern in visual question answering (VQA), wherein models tend to rely on spurious correlations between questions and answers for prediction. This prevents the models from effectively generalizing, leading to a decrease in performance. In order to address this bias, we propose a novel modality fusion collaborative de-biasing algorithm (CoD). In our approach, bias is considered as the model's neglect of information from a particular modality during prediction. We employ a collaborative training approach to facilitate mutual modeling between different modalities, achieving efficient feature fusion and enabling the model to fully leverage multimodal knowledge for prediction. Our experiments on various datasets, including VQA-CP v2, VQA v2, and VQA-VS, using different validation strategies, demonstrate the effectiveness of our approach. Notably, employing a basic baseline model resulted in an accuracy of 60.14% on VQA-CP v2.

Keywords: collaborative learning; language bias; visual question answering.

Grants and funding

This research was funded by the Key Scientific and Technological Project of Henan Province of China (No.232102211013).