Enhancing HOI Detection with Contextual Cues from Large Vision-Language Models

November 26, 2023 · Declared Dead · 🏛 ACM Multimedia

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Yu-Wei Zhan, Fan Liu, Xin Luo, Xin-Shun Xu, Liqiang Nie, Mohan Kankanhalli arXiv ID 2311.16475 Category cs.CV: Computer Vision Citations 3 Venue ACM Multimedia Last Checked 3 months ago

Abstract

Human-Object Interaction (HOI) detection aims at detecting human-object pairs and predicting their interactions. However, conventional HOI detection methods often struggle to fully capture the contextual information needed to accurately identify these interactions. While large Vision-Language Models (VLMs) show promise in tasks involving human interactions, they are not tailored for HOI detection. The complexity of human behavior and the diverse contexts in which these interactions occur make it further challenging. Contextual cues, such as the participants involved, body language, and the surrounding environment, play crucial roles in predicting these interactions, especially those that are unseen or ambiguous. Moreover, large VLMs are trained on vast image and text data, enabling them to generate contextual cues that help in understanding real-world contexts, object relationships, and typical interactions. Building on this, in this paper we introduce ConCue, a novel approach for improving visual feature extraction in HOI detection. Specifically, we first design specialized prompts to utilize large VLMs to generate contextual cues within an image. To fully leverage these cues, we develop a transformer-based feature extraction module with a multi-tower architecture that integrates contextual cues into both instance and interaction detectors. Extensive experiments and analyses demonstrate the effectiveness of using these contextual cues for HOI detection. The experimental results show that integrating ConCue with existing state-of-the-art methods significantly enhances their performance on two widely used datasets.