Wav2Pix: Speech-conditioned Face Generation using Generative Adversarial Networks

March 25, 2019 · Entered Twilight · 🏛 IEEE International Conference on Acoustics, Speech, and Signal Processing

"No code URL or promise found in abstract"
"Derived repo from GitHub Pages (backfill)"

Evidence collected by the PWNC Scanner

Repo contents: LICENSE, Readme.md, assets, config.yaml, dataset, models, runtime.py, scripts, wav2pix-2019-icassp.pdf

Authors Amanda Duarte, Francisco Roldan, Miquel Tubau, Janna Escur, Santiago Pascual, Amaia Salvador, Eva Mohedano, Kevin McGuinness, Jordi Torres, Xavier Giro-i-Nieto arXiv ID 1903.10195 Category cs.MM: Multimedia Cross-listed cs.CV Citations 84 Venue IEEE International Conference on Acoustics, Speech, and Signal Processing Repository https://github.com/imatge-upc/wav2pix ⭐ 56 Last Checked 7 days ago

Abstract

Speech is a rich biometric signal that contains information about the identity, gender and emotional state of the speaker. In this work, we explore its potential to generate face images of a speaker by conditioning a Generative Adversarial Network (GAN) with raw speech input. We propose a deep neural network that is trained from scratch in an end-to-end fashion, generating a face directly from the raw speech waveform without any additional identity information (e.g reference image or one-hot encoding). Our model is trained in a self-supervised approach by exploiting the audio and visual signals naturally aligned in videos. With the purpose of training from video data, we present a novel dataset collected for this work, with high-quality videos of youtubers with notable expressiveness in both the speech and visual signals.