QEVA: A Reference-Free Evaluation Metric for Narrative Video Summarization with Multimodal Question Answering

April 27, 2026 ยท Grace Period ยท ๐Ÿ› Findings of EMNLP 2025

โณ Grace Period
This paper is less than 90 days old. We give authors time to release their code before passing judgment.
Authors Woojun Jung, Junyeong Kim arXiv ID 2604.24052 Category cs.CV: Computer Vision Cross-listed cs.AI Citations 0 Venue Findings of EMNLP 2025
Abstract
Video-to-text summarization remains underexplored in terms of comprehensive evaluation methods. Traditional n-gram overlap-based metrics and recent large language model (LLM)-based approaches depend heavily on human-written reference summaries, limiting their practicality and sensitivity to nuanced semantic aspects. In this paper, we propose QEVA, a reference-free metric evaluating candidate summaries directly against source videos through multimodal question answering. QEVA assesses summaries along three clear dimensions: Coverage, Factuality, and Chronology. We also introduce MLVU(VS)-Eval, a new annotated benchmark derived from the MLVU dataset, comprising 800 summaries generated from 200 videos using state-of-the-art video-language multimodal models. This dataset establishes a transparent and consistent framework for evaluation. Experimental results demonstrate that QEVA shows higher correlation with human judgments compared to existing approaches, as measured by Kendall's $ฯ„_b$, $ฯ„_c$, and Spearman's $ฯ$. We hope that our benchmark and metric will facilitate meaningful progress in video-to-text summarization research and provide valuable insights for the development of future evaluation methods.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

๐Ÿ“œ Similar Papers

In the same crypt โ€” Computer Vision