Robust Photo-Realistic Hand Gesture Generation: from Single View to Multiple View

May 14, 2025 · Declared Dead · 🏛 ACM Multimedia

Repo contents: LICENSE

Authors Qifan Fu, Xu Chen, Muhammad Asad, Shanxin Yuan, Changjae Oh, Gregory Slabaugh arXiv ID 2505.10576 Category cs.GR: Graphics Citations 0 Venue ACM Multimedia Repository https://github.com/fuqifan/MUFEN Last Checked 1 month ago

Abstract

High-fidelity hand gesture generation represents a significant challenge in human-centric generation tasks. Existing methods typically employ a single-view mesh-rendered image prior to enhancing gesture generation quality. However, the spatial complexity of hand gestures and the inherent limitations of single-view rendering make it difficult to capture complete gesture information, particularly when fingers are occluded. The fundamental contradiction lies in the loss of 3D topological relationships through 2D projection and the incomplete spatial coverage inherent to single-view representations. Diverging from single-view prior approaches, we propose a multi-view prior framework, named Multi-Modal UNet-based Feature Encoder (MUFEN), to guide diffusion models in learning comprehensive 3D hand information. Specifically, we extend conventional front-view rendering to include rear, left, right, top, and bottom perspectives, selecting the most information-rich view combination as training priors to address occlusion. This multi-view prior with a dedicated dual stream encoder significantly improves the model's understanding of complete hand features. Furthermore, we design a bounding box feature fusion module, which can fuse the gesture localization features and multi-modal features to enhance the location-awareness of the MUFEN features to the gesture-related features. Experiments demonstrate that our method achieves state-of-the-art performance in both quantitative metrics and qualitative evaluations. The source code is available at https://github.com/fuqifan/MUFEN.