CinePile: A Long Video Question Answering Dataset and Benchmark

May 14, 2024 · Entered Twilight · 🏛 arXiv.org

"No code URL or promise found in abstract"
"Derived repo from GitHub Pages (backfill)"

Evidence collected by the PWNC Scanner

Repo contents: .DS_Store, README.md, index.html, static

Authors Ruchit Rawal, Khalid Saifullah, Miquel Farré, Ronen Basri, David Jacobs, Gowthami Somepalli, Tom Goldstein arXiv ID 2405.08813 Category cs.CV: Computer Vision Cross-listed cs.LG, cs.MM Citations 98 Venue arXiv.org Repository https://github.com/ruchitrawal/cinepile Last Checked 11 days ago

Abstract

Current datasets for long-form video understanding often fall short of providing genuine long-form comprehension challenges, as many tasks derived from these datasets can be successfully tackled by analyzing just one or a few random frames from a video. To address this issue, we present a novel dataset and benchmark, CinePile, specifically designed for authentic long-form video understanding. This paper details our innovative approach for creating a question-answer dataset, utilizing advanced LLMs with human-in-the-loop and building upon human-generated raw data. Our comprehensive dataset comprises 305,000 multiple-choice questions (MCQs), covering various visual and multimodal aspects, including temporal comprehension, understanding human-object interactions, and reasoning about events or actions within a scene. Additionally, we fine-tuned open-source Video-LLMs on the training split and evaluated both open-source and proprietary video-centric LLMs on the test split of our dataset. The findings indicate that although current models underperform compared to humans, fine-tuning these models can lead to significant improvements in their performance.