Introduction

The script-driven multimodal collaboration hyper-realistic digital human technology, released in April 2025, has made breakthroughs in key challenges such as real-time multimodal collaboration and comprehensive dynamic interactions. It delivers hyper-realistic digital human live-streaming characterized by high expressiveness, strong emotional impact, and free interaction among people, objects and environments, as well as the stable and efficient live-streaming. To date, over 100,000 digital humans have been produced using this technology, covering industries such as e-commerce, education, and law. It has helped reduce the cost of starting a live broadcasting by 80% and increased conversion rates by 31%.

Script-driven multimodal collaboration hyper-realistic digital human technology

To confront the challenges such as multi-style script generation, high-fidelity voice synthesis, dual-anchor interaction and voice-appearance-language consistency of digital human over extended durations, there are 5 innovative techniques of the script-driven multimodal collaboration hyper-realistic digital human technology, including script-driven multimodal collaboration to ensure highly realistic voice-appearance-language of digital human, script generation integrating Multimodal Planning and Deep Reasoning to upgrade the content attraction and the depth of knowledge of the live broadcasting, real-time interactive technology with dynamic decision-making to guarantee the natural and smooth interaction with users, text-controllable speech synthesis to achieve more natural voice, and high-consistency, hyper-realistic long video generation for digital humans to solve long-term identity preservation and complex interactions between humans, objects, and environments. These techniques have broken the industrial bottlenecks and led the digital human technology to higher level.

Producing over 100,000 digital humans and leading the development of digital human industry

Based on the script-driven multimodal collaboration hyper-realistic digital human technology, over 100,000 digital humans have been produced and applied in industries as e-commerce, education and law, etc.

It has helped reduce live-streaming startup costs by 80% and increase conversion rates by 31%.

Notably, using this technology, the live-broadcasting featuring digital versions of two entrepreneurs on Baidu Youxuan, had completed a 6-hour hyper-realistic broadcast that attracted 13 million viewers and achieved GMV exceeding 55 million RMB. The result shows that the technology can effectively attract a large audience, boost live-stream traffic and sales, and bring significant economic benefits to e-commerce platforms and merchants.

Currently, this technology has been adopted in Baidu's e-commerce live-streams such as celebrity, books, and health. It has delivered outstanding online performance, outperforming real people in live-broadcasting and demonstrating broad application prospects and market potential.

Overcome the challenge of multimodal coordination and lead the development of digital human technology

In terms of technological progress, this technology has overcome the challenge of multimodal coordination of digital humans in live-streaming, achieving high-level synchronization and coordination in language, voice, actions, and expressions. This script-driven multimodal coordination technology provides new ideas and methods for research in digital human technology, promoting the integration and innovation of perception, understanding, and expression to achieve a better content generation and user interaction.

In terms of industrial structural adjustment, this technology has changed the way that the e-commerce live-streaming heavily relies on real-person anchor and faces limited anchor resources and constrained live-streaming time. The application of this technology has made digital human anchors a new force in the live-streaming industry, enabling 24/7 uninterrupted livestreaming without considering the time and space. Furthermore, the digital human anchors can switch freely according to different scenarios to satisfy user demands and improve user experience. This innovative technology promotes the change of employment structure as well. On one side, it creates new positions in digital human R&D, promotion, operation and maintenance; on the other side, it provides transformation opportunities for human anchors to improve their skills and capabilities to catch up with the rapid industry growth.


The World Internet Conference (WIC) was established as an international organization on July 12, 2022, headquartered in Beijing, China. It was jointly initiated by Global System for Mobile Communication Association (GSMA), National Computer Network Emergency Response Technical Team/Coordination Center of China (CNCERT), China Internet Network Information Center (CNNIC), Alibaba Group, Tencent, and Zhijiang Lab.