An Implementation of Multimodal Fusion System for Intelligent Digital Human Generation

October 31, 2023 · Declared Dead · 🏛 arXiv.org

Repo contents: LICENCE, README.md, framework_v1.png

Authors Yingjie Zhou, Yaodong Chen, Kaiyue Bi, Lian Xiong, Hui Liu arXiv ID 2310.20251 Category cs.MM: Multimedia Citations 14 Venue arXiv.org Repository https://github.com/zyj-2000/CUMT_2D_PhotoSpeaker ⭐ 20 Last Checked 1 month ago

Abstract

With the rapid development of artificial intelligence (AI), digital humans have attracted more and more attention and are expected to achieve a wide range of applications in several industries. Then, most of the existing digital humans still rely on manual modeling by designers, which is a cumbersome process and has a long development cycle. Therefore, facing the rise of digital humans, there is an urgent need for a digital human generation system combined with AI to improve development efficiency. In this paper, an implementation scheme of an intelligent digital human generation system with multimodal fusion is proposed. Specifically, text, speech and image are taken as inputs, and interactive speech is synthesized using large language model (LLM), voiceprint extraction, and text-to-speech conversion techniques. Then the input image is age-transformed and a suitable image is selected as the driving image. Then, the modification and generation of digital human video content is realized by digital human driving, novel view synthesis, and intelligent dressing techniques. Finally, we enhance the user experience through style transfer, super-resolution, and quality evaluation. Experimental results show that the system can effectively realize digital human generation. The related code is released at https://github.com/zyj-2000/CUMT_2D_PhotoSpeaker.