Replay attack detection with complementary high-resolution information using end-to-end DNN for the ASVspoof 2019 Challenge
April 23, 2019 ยท Entered Twilight ยท ๐ Interspeech
"Last commit was 6.0 years ago (โฅ5 year threshold)"
Evidence collected by the PWNC Scanner
Repo contents: .gitignore, README.md, g0-Extract_spectrograms_torch_asvspoof2019.py, g1-train_model.py, g1-train_model.yaml, g2-train_model_CNN_GRU_wPretrn.py, g2-train_model_CNN_GRU_wPretrn.yaml, g_spec_CNN_GRU.py, g_spec_CNN_c_bc.py
Authors
Jee-weon Jung, Hye-jin Shim, Hee-Soo Heo, Ha-Jin Yu
arXiv ID
1904.10134
Category
eess.AS: Audio & Speech
Cross-listed
cs.CR,
cs.SD
Citations
51
Venue
Interspeech
Repository
https://github.com/Jungjee/ASVspoof2019_PA
โญ 23
Last Checked
1 month ago
Abstract
In this study, we concentrate on replacing the process of extracting hand-crafted acoustic feature with end-to-end DNN using complementary high-resolution spectrograms. As a result of advance in audio devices, typical characteristics of a replayed speech based on conventional knowledge alter or diminish in unknown replay configurations. Thus, it has become increasingly difficult to detect spoofed speech with a conventional knowledge-based approach. To detect unrevealed characteristics that reside in a replayed speech, we directly input spectrograms into an end-to-end DNN without knowledge-based intervention. Explorations dealt in this study that differentiates from existing spectrogram-based systems are twofold: complementary information and high-resolution. Spectrograms with different information are explored, and it is shown that additional information such as the phase information can be complementary. High-resolution spectrograms are employed with the assumption that the difference between a bona-fide and a replayed speech exists in the details. Additionally, to verify whether other features are complementary to spectrograms, we also examine raw waveform and an i-vector based system. Experiments conducted on the ASVspoof 2019 physical access challenge show promising results, where t-DCF and equal error rates are 0.0570 and 2.45 % for the evaluation set, respectively.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
๐ Similar Papers
In the same crypt โ Audio & Speech
R.I.P.
๐ป
Ghosted
R.I.P.
๐ป
Ghosted
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
R.I.P.
๐ป
Ghosted
DiffWave: A Versatile Diffusion Model for Audio Synthesis
R.I.P.
๐ป
Ghosted
FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
R.I.P.
๐ป
Ghosted
MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis
R.I.P.
๐ป
Ghosted