Learning to Reason on Tree Structures for Knowledge-Based Visual Question Answering

Qifeng Li; Xinyi Tang; Yi Jian

doi:10.3390/s22041575

Learning to Reason on Tree Structures for Knowledge-Based Visual Question Answering

Sensors (Basel). 2022 Feb 17;22(4):1575. doi: 10.3390/s22041575.

Authors

Qifeng Li^{1

2

3}, Xinyi Tang^{1

3}, Yi Jian^{1

3}

Affiliations

¹ Shanghai Institute of Technical Physics of the Chinese Academy of Sciences, Shanghai 200083, China.
² School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100049, China.
³ Key Laboratory of Infrared System Detection and Imaging Technology, Chinese Academy of Sciences, Shanghai 200083, China.

Abstract

Collaborative reasoning for knowledge-based visual question answering is challenging but vital and efficient in understanding the features of the images and questions. While previous methods jointly fuse all kinds of features by attention mechanism or use handcrafted rules to generate a layout for performing compositional reasoning, which lacks the process of visual reasoning and introduces a large number of parameters for predicting the correct answer. For conducting visual reasoning on all kinds of image-question pairs, in this paper, we propose a novel reasoning model of a question-guided tree structure with a knowledge base (QGTSKB) for addressing these problems. In addition, our model consists of four neural module networks: the attention model that locates attended regions based on the image features and question embeddings by attention mechanism, the gated reasoning model that forgets and updates the fused features, the fusion reasoning model that mines high-level semantics of the attended visual features and knowledge base and knowledge-based fact model that makes up for the lack of visual and textual information with external knowledge. Therefore, our model performs visual analysis and reasoning based on tree structures, knowledge base and four neural module networks. Experimental results show that our model achieves superior performance over existing methods on the VQA v2.0 and CLVER dataset, and visual reasoning experiments prove the interpretability of the model.

Keywords: attention mechanism; compositional reasoning; knowledge base; neural module network; tree structure.

MeSH terms

Knowledge Bases*
Learning
Neural Networks, Computer*
Problem Solving
Semantics

Grants and funding

No.104040402/The National Pre-Research Foundation of China