SAC: Semantic Attention Composition for Text-Conditioned Image Retrieval

September 03, 2020 · Declared Dead · 🏛 IEEE Workshop/Winter Conference on Applications of Computer Vision

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Surgan Jandial, Pinkesh Badjatiya, Pranit Chawla, Ayush Chopra, Mausoom Sarkar, Balaji Krishnamurthy arXiv ID 2009.01485 Category cs.CV: Computer Vision Cross-listed cs.AI Citations 55 Venue IEEE Workshop/Winter Conference on Applications of Computer Vision Last Checked 3 months ago

Abstract

The ability to efficiently search for images is essential for improving the user experiences across various products. Incorporating user feedback, via multi-modal inputs, to navigate visual search can help tailor retrieved results to specific user queries. We focus on the task of text-conditioned image retrieval that utilizes support text feedback alongside a reference image to retrieve images that concurrently satisfy constraints imposed by both inputs. The task is challenging since it requires learning composite image-text features by incorporating multiple cross-granular semantic edits from text feedback and then applying the same to visual features. To address this, we propose a novel framework SAC which resolves the above in two major steps: "where to see" (Semantic Feature Attention) and "how to change" (Semantic Feature Modification). We systematically show how our architecture streamlines the generation of text-aware image features by removing the need for various modules required by other state-of-art techniques. We present extensive quantitative, qualitative analysis, and ablation studies, to show that our architecture SAC outperforms existing techniques by achieving state-of-the-art performance on 3 benchmark datasets: FashionIQ, Shoes, and Birds-to-Words, while supporting natural language feedback of varying lengths.