DocSum: Domain-Adaptive Pre-training for Document Abstractive Summarization

December 11, 2024 · Declared Dead · 🏛 2025 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW)

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Phan Phuong Mai Chau, Souhail Bakkali, Antoine Doucet arXiv ID 2412.08196 Category cs.CL: Computation & Language Cross-listed cs.CV Citations 1 Venue 2025 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW) Last Checked 3 months ago

Abstract

Abstractive summarization has made significant strides in condensing and rephrasing large volumes of text into coherent summaries. However, summarizing administrative documents presents unique challenges due to domain-specific terminology, OCR-generated errors, and the scarcity of annotated datasets for model fine-tuning. Existing models often struggle to adapt to the intricate structure and specialized content of such documents. To address these limitations, we introduce DocSum, a domain-adaptive abstractive summarization framework tailored for administrative documents. Leveraging pre-training on OCR-transcribed text and fine-tuning with an innovative integration of question-answer pairs, DocSum enhances summary accuracy and relevance. This approach tackles the complexities inherent in administrative content, ensuring outputs that align with real-world business needs. To evaluate its capabilities, we define a novel downstream task setting-Document Abstractive Summarization-which reflects the practical requirements of business and organizational settings. Comprehensive experiments demonstrate DocSum's effectiveness in producing high-quality summaries, showcasing its potential to improve decision-making and operational workflows across the public and private sectors.