SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data

September 30, 2022 · Entered Twilight · 🏛 IEEE/ACM Transactions on Audio Speech and Language Processing

"No code URL or promise found in abstract"
"Code repo scraped from project page (backfill)"

Evidence collected by the PWNC Scanner

Repo contents: .gitignore, CODE_OF_CONDUCT.md, CONTRIBUTING.md, LICENSE, README.md, get_covost_splits.py, get_tt_speech.py, overview.png, stats.png, stats2.png

Authors Ziqiang Zhang, Sanyuan Chen, Long Zhou, Yu Wu, Shuo Ren, Shujie Liu, Zhuoyuan Yao, Xun Gong, Lirong Dai, Jinyu Li, Furu Wei arXiv ID 2209.15329 Category cs.CL: Computation & Language Cross-listed cs.AI, eess.AS Citations 69 Venue IEEE/ACM Transactions on Audio Speech and Language Processing Repository https://github.com/facebookresearch/covost ⭐ 395 Last Checked 22 days ago

Abstract

How to boost speech pre-training with textual data is an unsolved problem due to the fact that speech and text are very different modalities with distinct characteristics. In this paper, we propose a cross-modal Speech and Language Model (SpeechLM) to explicitly align speech and text pre-training with a pre-defined unified discrete representation. Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities, including phoneme-unit and hidden-unit tokenizers, which can be trained using a small amount of paired speech-text data. Based on the trained tokenizers, we convert the unlabeled speech and text data into tokens of phoneme units or hidden units. The pre-training objective is designed to unify the speech and the text into the same discrete semantic space with a unified Transformer network. We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB, demonstrating significant improvements on content-related tasks. Code and models are available at https://aka.ms/SpeechLM.