VaLiD: Mitigating the Hallucination of Large Vision Language Models by Visual Layer Fusion Contrastive Decoding

November 24, 2024 · Declared Dead · 🏛 arXiv.org

Authors Jiaqi Wang, Yifei Gao, Jitao Sang arXiv ID 2411.15839 Category cs.CV: Computer Vision Citations 11 Venue arXiv.org Repository https://github.com/RicardoLuL/VaLiD_LVLMs_hallucinations}{Github} Last Checked 1 month ago

Abstract

Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in multimodal task reasoning. However, they often generate responses that appear plausible yet do not accurately reflect the visual content, a phenomenon known as hallucination. Recent approaches have introduced training-free methods to mitigate hallucinations by adjusting the decoding strategy during the inference stage, typically attributing hallucinations to the language model itself. Our analysis, however, reveals that distortions in the visual encoding process significantly affect the model's reasoning capabilities. Specifically, earlier visual layers may retain key features but gradually distort as the information propagates toward the output layer. Building on these insights, we propose a novel hallucination-mitigation method from the visual encoding perspective: \textbf{V}isu\textbf{a}l \textbf{L}ayer Fus\textbf{i}on Contrastive \textbf{D}ecoding (\textbf{VaLiD}). This method utilizes uncertainty to guide the visual layer selection, correcting distortions in the visual encoding process and thereby enhancing the reliability of the generated content. Experimental results demonstrate the effectiveness of VaLiD in mitigating hallucinations across various benchmarks, achieving state-of-the-art performance when compared to baseline methods. Codes are available at \href{https://github.com/RicardoLuL/VaLiD_LVLMs_hallucinations}{Github}.