REXUP: I REason, I EXtract, I UPdate with Structured Compositional Reasoning for Visual Question Answering

July 27, 2020 · Declared Dead · 🏛 International Conference on Neural Information Processing

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Siwen Luo, Soyeon Caren Han, Kaiyuan Sun, Josiah Poon arXiv ID 2007.13262 Category cs.CV: Computer Vision Cross-listed cs.AI Citations 4 Venue International Conference on Neural Information Processing Last Checked 3 months ago

Abstract

Visual question answering (VQA) is a challenging multi-modal task that requires not only the semantic understanding of both images and questions, but also the sound perception of a step-by-step reasoning process that would lead to the correct answer. So far, most successful attempts in VQA have been focused on only one aspect, either the interaction of visual pixel features of images and word features of questions, or the reasoning process of answering the question in an image with simple objects. In this paper, we propose a deep reasoning VQA model with explicit visual structure-aware textual information, and it works well in capturing step-by-step reasoning process and detecting a complex object-relationship in photo-realistic images. REXUP network consists of two branches, image object-oriented and scene graph oriented, which jointly works with super-diagonal fusion compositional attention network. We quantitatively and qualitatively evaluate REXUP on the GQA dataset and conduct extensive ablation studies to explore the reasons behind REXUP's effectiveness. Our best model significantly outperforms the precious state-of-the-art, which delivers 92.7% on the validation set and 73.1% on the test-dev set.