SSGVS: Semantic Scene Graph-to-Video Synthesis

November 11, 2022 · Declared Dead · 🏛 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

⏳ CAUSE OF DEATH: Coming Soon™
Promised but never delivered

"Paper promises code 'coming soon'"

Evidence collected by the PWNC Scanner

Authors Yuren Cong, Jinhui Yi, Bodo Rosenhahn, Michael Ying Yang arXiv ID 2211.06119 Category cs.CV: Computer Vision Citations 8 Venue 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) Last Checked 1 month ago
Abstract
As a natural extension of the image synthesis task, video synthesis has attracted a lot of interest recently. Many image synthesis works utilize class labels or text as guidance. However, neither labels nor text can provide explicit temporal guidance, such as when an action starts or ends. To overcome this limitation, we introduce semantic video scene graphs as input for video synthesis, as they represent the spatial and temporal relationships between objects in the scene. Since video scene graphs are usually temporally discrete annotations, we propose a video scene graph (VSG) encoder that not only encodes the existing video scene graphs but also predicts the graph representations for unlabeled frames. The VSG encoder is pre-trained with different contrastive multi-modal losses. A semantic scene graph-to-video synthesis framework (SSGVS), based on the pre-trained VSG encoder, VQ-VAE, and auto-regressive Transformer, is proposed to synthesize a video given an initial scene image and a non-fixed number of semantic scene graphs. We evaluate SSGVS and other state-of-the-art video synthesis models on the Action Genome dataset and demonstrate the positive significance of video scene graphs in video synthesis. The source code will be released.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

📜 Similar Papers

In the same crypt — Computer Vision

Died the same way — ⏳ Coming Soon™