Deep Contextualized Acoustic Representations For Semi-Supervised Speech Recognition

December 03, 2019 · Declared Dead · 🏛 IEEE International Conference on Acoustics, Speech, and Signal Processing

⏳ CAUSE OF DEATH: Coming Soon™
Promised but never delivered

"Paper promises code 'coming soon'"

Evidence collected by the PWNC Scanner

Authors Shaoshi Ling, Yuzong Liu, Julian Salazar, Katrin Kirchhoff arXiv ID 1912.01679 Category eess.AS: Audio & Speech Cross-listed cs.CL, cs.LG, cs.SD Citations 145 Venue IEEE International Conference on Acoustics, Speech, and Signal Processing Last Checked 1 month ago
Abstract
We propose a novel approach to semi-supervised automatic speech recognition (ASR). We first exploit a large amount of unlabeled audio data via representation learning, where we reconstruct a temporal slice of filterbank features from past and future context frames. The resulting deep contextualized acoustic representations (DeCoAR) are then used to train a CTC-based end-to-end ASR system using a smaller amount of labeled audio data. In our experiments, we show that systems trained on DeCoAR consistently outperform ones trained on conventional filterbank features, giving 42% and 19% relative improvement over the baseline on WSJ eval92 and LibriSpeech test-clean, respectively. Our approach can drastically reduce the amount of labeled data required; unsupervised training on LibriSpeech then supervision with 100 hours of labeled data achieves performance on par with training on all 960 hours directly. Pre-trained models and code will be released online.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

📜 Similar Papers

In the same crypt — Audio & Speech

Died the same way — ⏳ Coming Soon™