FusDom: Combining In-Domain and Out-of-Domain Knowledge for Continuous Self-Supervised Learning

December 20, 2023 ยท Entered Twilight ยท ๐Ÿ› IEEE International Conference on Acoustics, Speech, and Signal Processing

๐Ÿ’ค TWILIGHT: Eternal Rest
Repo abandoned since publication

Repo contents: .circleci, .github, .gitignore, .gitmodules, .pre-commit-config.yaml, CODE_OF_CONDUCT.md, CONTRIBUTING.md, LICENSE, MANIFEST.in, README.md, RELEASE.md, docs, examples, fairseq, fairseq_cli, hubconf.py, hydra_plugins, pyproject.toml, release_utils.py, scripts, setup.cfg, setup.py, tests, train.py

Authors Ashish Seth, Sreyan Ghosh, S. Umesh, Dinesh Manocha arXiv ID 2312.13026 Category eess.AS: Audio & Speech Cross-listed cs.AI, cs.CL, cs.SD Citations 1 Venue IEEE International Conference on Acoustics, Speech, and Signal Processing Repository https://github.com/cs20s030/fusdom โญ 3 Last Checked 1 month ago
Abstract
Continued pre-training (CP) offers multiple advantages, like target domain adaptation and the potential to exploit the continuous stream of unlabeled data available online. However, continued pre-training on out-of-domain distributions often leads to catastrophic forgetting of previously acquired knowledge, leading to sub-optimal ASR performance. This paper presents FusDom, a simple and novel methodology for SSL-based continued pre-training. FusDom learns speech representations that are robust and adaptive yet not forgetful of concepts seen in the past. Instead of solving the SSL pre-text task on the output representations of a single model, FusDom leverages two identical pre-trained SSL models, a teacher and a student, with a modified pre-training head to solve the CP SSL pre-text task. This head employs a cross-attention mechanism between the representations of both models while only the student receives gradient updates and the teacher does not. Finally, the student is fine-tuned for ASR. In practice, FusDom outperforms all our baselines across settings significantly, with WER improvements in the range of 0.2 WER - 7.3 WER in the target domain while retaining the performance in the earlier domain.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

๐Ÿ“œ Similar Papers

In the same crypt โ€” Audio & Speech