Deep Contextualized Acoustic Representations For Semi-Supervised Speech Recognition

December 03, 2019 · Declared Dead · 🏛 IEEE International Conference on Acoustics, Speech, and Signal Processing

"Paper promises code 'coming soon'"

Evidence collected by the PWNC Scanner

Authors Shaoshi Ling, Yuzong Liu, Julian Salazar, Katrin Kirchhoff arXiv ID 1912.01679 Category eess.AS: Audio & Speech Cross-listed cs.CL, cs.LG, cs.SD Citations 145 Venue IEEE International Conference on Acoustics, Speech, and Signal Processing Last Checked 1 month ago

Abstract

We propose a novel approach to semi-supervised automatic speech recognition (ASR). We first exploit a large amount of unlabeled audio data via representation learning, where we reconstruct a temporal slice of filterbank features from past and future context frames. The resulting deep contextualized acoustic representations (DeCoAR) are then used to train a CTC-based end-to-end ASR system using a smaller amount of labeled audio data. In our experiments, we show that systems trained on DeCoAR consistently outperform ones trained on conventional filterbank features, giving 42% and 19% relative improvement over the baseline on WSJ eval92 and LibriSpeech test-clean, respectively. Our approach can drastically reduce the amount of labeled data required; unsupervised training on LibriSpeech then supervision with 100 hours of labeled data achieves performance on par with training on all 960 hours directly. Pre-trained models and code will be released online.

📄 View on arXiv 🌐 View on ar5iv 📑 PDF 🎉 Report Code Found

Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

📜 Similar Papers

In the same crypt — Audio & Speech

R.I.P. 👻 Ghosted

Robust Speech Recognition via Large-Scale Weak Supervision

Alec Radford, Jong Wook Kim, ... (+4 more)

eess.AS 🏛 ICML 📚 6.1K cites 3 years ago

R.I.P. 👻 Ghosted

SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

Daniel S. Park, William Chan, ... (+5 more)

eess.AS 🏛 Interspeech 📚 3.9K cites 6 years ago

R.I.P. 👻 Ghosted

DiffWave: A Versatile Diffusion Model for Audio Synthesis

Zhifeng Kong, Wei Ping, ... (+3 more)

eess.AS 🏛 ICLR 📚 1.8K cites 5 years ago

R.I.P. 👻 Ghosted

FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

Yi Ren, Chenxu Hu, ... (+5 more)

eess.AS 🏛 ICLR 📚 1.7K cites 5 years ago

R.I.P. 👻 Ghosted

MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis

Kundan Kumar, Rithesh Kumar, ... (+7 more)

eess.AS 🏛 NeurIPS 📚 1.1K cites 6 years ago

R.I.P. 👻 Ghosted

Generalized End-to-End Loss for Speaker Verification

Li Wan, Quan Wang, ... (+2 more)

eess.AS 🏛 ICASSP 📚 1.0K cites 8 years ago

Died the same way — ⏳ Coming Soon™

R.I.P. ⏳ Coming Soon™

Exploring Simple Siamese Representation Learning

Xinlei Chen, Kaiming He

cs.CV 🏛 CVPR 📚 4.8K cites 5 years ago

R.I.P. ⏳ Coming Soon™

An Analysis of Scale Invariance in Object Detection - SNIP

Bharat Singh, Larry S. Davis

cs.CV 🏛 CVPR 📚 795 cites 8 years ago

R.I.P. ⏳ Coming Soon™

Class-balanced Grouping and Sampling for Point Cloud 3D Object Detection

Benjin Zhu, Zhengkai Jiang, ... (+3 more)

cs.CV 🏛 arXiv 📚 556 cites 6 years ago

R.I.P. ⏳ Coming Soon™

FSRNet: End-to-End Learning Face Super-Resolution with Facial Priors

Yu Chen, Ying Tai, ... (+3 more)

cs.CV 🏛 CVPR 📚 542 cites 8 years ago