Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation

June 12, 2024 · Declared Dead · 🏛 Annual Meeting of the Association for Computational Linguistics

Authors Se Jin Park, Chae Won Kim, Hyeongseop Rha, Minsu Kim, Joanna Hong, Jeong Hun Yeo, Yong Man Ro arXiv ID 2406.07867 Category cs.CV: Computer Vision Cross-listed cs.AI, cs.HC Citations 29 Venue Annual Meeting of the Association for Computational Linguistics Repository https://huggingface.co/datasets/IVLLab/MultiDialog Last Checked 1 month ago

Abstract

In this paper, we introduce a novel Face-to-Face spoken dialogue model. It processes audio-visual speech from user input and generates audio-visual speech as the response, marking the initial step towards creating an avatar chatbot system without relying on intermediate text. To this end, we newly introduce MultiDialog, the first large-scale multimodal (i.e., audio and visual) spoken dialogue corpus containing 340 hours of approximately 9,000 dialogues, recorded based on the open domain dialogue dataset, TopicalChat. The MultiDialog contains parallel audio-visual recordings of conversation partners acting according to the given script with emotion annotations, which we expect to open up research opportunities in multimodal synthesis. Our Face-to-Face spoken dialogue model incorporates a textually pretrained large language model and adapts it into the audio-visual spoken dialogue domain by incorporating speech-text joint pretraining. Through extensive experiments, we validate the effectiveness of our model in facilitating a face-to-face conversation. Demo and data are available at https://multidialog.github.io and https://huggingface.co/datasets/IVLLab/MultiDialog, respectively.