Listen and Fill in the Missing Letters: Non-Autoregressive Transformer for Speech Recognition
November 10, 2019 · Declared Dead · + Add venue
"Paper promises code 'coming soon'"
Evidence collected by the PWNC Scanner
Authors
Nanxin Chen, Shinji Watanabe, Jesús Villalba, Najim Dehak
arXiv ID
1911.04908
Category
eess.AS: Audio & Speech
Cross-listed
cs.LG,
cs.SD,
stat.ML
Citations
16
Last Checked
1 month ago
Abstract
Recently very deep transformers have outperformed conventional bi-directional long short-term memory networks by a large margin in speech recognition. However, to put it into production usage, inference computation cost is still a serious concern in real scenarios. In this paper, we study two different non-autoregressive transformer structure for automatic speech recognition (ASR): A-CMLM and A-FMLM. During training, for both frameworks, input tokens fed to the decoder are randomly replaced by special mask tokens. The network is required to predict the tokens corresponding to those mask tokens by taking both unmasked context and input speech into consideration. During inference, we start from all mask tokens and the network iteratively predicts missing tokens based on partial results. We show that this framework can support different decoding strategies, including traditional left-to-right. A new decoding strategy is proposed as an example, which starts from the easiest predictions to the most difficult ones. Results on Mandarin (Aishell) and Japanese (CSJ) ASR benchmarks show the possibility to train such a non-autoregressive network for ASR. Especially in Aishell, the proposed method outperformed the Kaldi ASR system and it matches the performance of the state-of-the-art autoregressive transformer with 7x speedup. Pretrained models and code will be made available after publication.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
📜 Similar Papers
In the same crypt — Audio & Speech
R.I.P.
👻
Ghosted
R.I.P.
👻
Ghosted
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
R.I.P.
👻
Ghosted
DiffWave: A Versatile Diffusion Model for Audio Synthesis
R.I.P.
👻
Ghosted
FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
R.I.P.
👻
Ghosted
MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis
R.I.P.
👻
Ghosted
Generalized End-to-End Loss for Speaker Verification
Died the same way — ⏳ Coming Soon™
R.I.P.
⏳
Coming Soon™
Exploring Simple Siamese Representation Learning
R.I.P.
⏳
Coming Soon™
An Analysis of Scale Invariance in Object Detection - SNIP
R.I.P.
⏳
Coming Soon™
Class-balanced Grouping and Sampling for Point Cloud 3D Object Detection
R.I.P.
⏳
Coming Soon™