CinePile: A Long Video Question Answering Dataset and Benchmark
May 14, 2024 Β· Entered Twilight Β· π arXiv.org
"No code URL or promise found in abstract"
"Derived repo from GitHub Pages (backfill)"
Evidence collected by the PWNC Scanner
Repo contents: .DS_Store, README.md, index.html, static
Authors
Ruchit Rawal, Khalid Saifullah, Miquel FarrΓ©, Ronen Basri, David Jacobs, Gowthami Somepalli, Tom Goldstein
arXiv ID
2405.08813
Category
cs.CV: Computer Vision
Cross-listed
cs.LG,
cs.MM
Citations
98
Venue
arXiv.org
Repository
https://github.com/ruchitrawal/cinepile
Last Checked
11 days ago
Abstract
Current datasets for long-form video understanding often fall short of providing genuine long-form comprehension challenges, as many tasks derived from these datasets can be successfully tackled by analyzing just one or a few random frames from a video. To address this issue, we present a novel dataset and benchmark, CinePile, specifically designed for authentic long-form video understanding. This paper details our innovative approach for creating a question-answer dataset, utilizing advanced LLMs with human-in-the-loop and building upon human-generated raw data. Our comprehensive dataset comprises 305,000 multiple-choice questions (MCQs), covering various visual and multimodal aspects, including temporal comprehension, understanding human-object interactions, and reasoning about events or actions within a scene. Additionally, we fine-tuned open-source Video-LLMs on the training split and evaluated both open-source and proprietary video-centric LLMs on the test split of our dataset. The findings indicate that although current models underperform compared to humans, fine-tuning these models can lead to significant improvements in their performance.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
π Similar Papers
In the same crypt β Computer Vision
π
π
Old Age
π
π
Old Age
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
R.I.P.
π»
Ghosted
You Only Look Once: Unified, Real-Time Object Detection
π
π
Old Age
SSD: Single Shot MultiBox Detector
π
π
Old Age
Squeeze-and-Excitation Networks
R.I.P.
π»
Ghosted