The Chinese company Tencent Cloud launched the Deepfakes-as-a-Service (DFaaS) platform to create realistic digital copies of people with high detail and voice similarity.
For a digital “image,” you will need three minutes of live video with the “original” person and 100 phrases spoken by him.
A copy will cost $145, and it takes 24 hours to generate such a “deepfake.” In the future, the customer will be able to choose the background and make corrections to his “character,” for example, a tan.
Tencent Cloud announced for the first time the production platform of small sample digital sapiens, which has the characteristics of small training samples, high production efficiency, and automated production, which can realize low-cost “self-service” digital sapiens production.
While digital humans are becoming increasingly popular with businesses and advertisers, the high production and operational costs remain to be addressed. In the past, the complex data collection of training samples led to a long customization process and high cost of digital human, which limited the rapid application of digital human to a certain extent.
Relying on Tencent’s self-developed AI capabilities and technical experience, the platform launched by Tencent Cloud Intelligence only requires 3 minutes of real-life broadcast video and 100 sentences of voice material, and the platform can model and generate high-definition portraits in real time through multimodal data input of audio and text, and produce “digital sapiens” similar to real people within 24 hours.

With the help of this platform, digital human production can achieve a cost of 1,000 yuan and an hour-level production time, greatly reducing the threshold for the use of digital humans.
Chen Lei, General Manager of Tencent Cloud Intelligent Digital Homo sapiens Products, said that Tencent Cloud Intelligence hopes to build an automated “AI+Digital Homo sapiens factory”, relying on the one-stop platform of “production, sales and service” to achieve “self-service” purchase, production and application of Digital Homo sapiens.
Tencent Cloud Intelligence relies on a self-developed small-sample digital sapien-driven technology framework and a general multimodal model based on a self-supervision mechanism, allowing users to submit a small number of sample data for AI training.
For example, 3 minutes of real broadcast video and 100 sentences of voice material can obtain digital sapiens similar to real people and voice, shorten the production cycle to the day level, and the price can be as low as <>,<> yuan.
Chen Lei said that small sample digital sapiens supports half-body and full-body image display, gestures and movements will be flexibly adjusted according to the content, and also supports arbitrary change of recording background, which is suitable for a wider range of commercial scenarios such as live streaming and goods.
Compared with 2D live-action boutique digital humans, small sample digital sapiens do not need professional studio recording materials, and the cost is lower; Compared with digital humans generated by photos and only showing facial shapes, small sample digital Homo sapiens can design gestures based on text, and reproduce the real human style with lip movements, mouth shapes, and expressions.
Taking knowledge-sharing oral video production as an example, a small sample number of Homo sapiens can appear on behalf of doctors, lawyers and other professionals, greatly saving video recording time.
In order to accelerate the popularization of digital sapiens services, Tencent Cloud Intelligence also proposed the direction of automating the “AI + Digital Homo sapiens Factory”. Out-of-the-box digital sapiens production services, relying on the Tencent Cloud TI platform, have more than 10 built-in AI algorithm capabilities. In the future, without any algorithms and R&D experience, as long as video and voice training materials are imported on the platform, large-scale digital sapiens image and timbre customization can be completed through “self-service” services.
For the operation of digital sapiens, Tencent also provides broadcast digital sapiens platform and interactive digital sapiens platform services. The broadcast digital sapiens platform supports rapid generation of digital sapiens videos through text and voice input; The interactive digital sapiens platform can create digital intelligence employees, customize exclusive Q&A libraries, provide 7*24-hour human-computer two-way interactive services, and realize digital sapiens live broadcast services, freely switch real voice takeover, and interact with user Q&A.
Since 2018, Tencent has invested in the R&D and services of digital sapiens, and is one of the earliest enterprises in China to invest in the field of digital humans, and has published hundreds of related technology conferences, journal papers, and nearly <> patents.

For the technical characteristics of Tencent Digital Human, Wang Chengjie, research director of Tencent Youtu Lab, said that behind the 2D small sample technology is 3D technology.
“The small sample of digital sapiens intuitively feels that it is a 2D video, behind which is actually a 3D portrait as a support, which is a mode from ‘text/audio’ information to ‘3D portrait drive’ and then to ‘2D portrait video’. Through the introduction of prior information on 3D face structure, the digital intelligence population type and expression are more in place. Wang Chengjie said.
On the other hand, the general multimodal model based on the self-supervision mechanism can correlate speech and text with the expression and mouth shape of portraits after large-scale data training.
Wang Chengjie said that the threshold and cost of using small sample digital sapiens have been greatly reduced, but Tencent hopes to improve the quality of small sample digital sapiens through the comprehensive use of a number of visual AI technologies, including high-precision portrait segmentation, lighting optimization, portrait beautification, and gaze correction.
In terms of sound reproduction, based on Tencent’s self-developed new generation of small-sample tone customization technology, as well as deep learning acoustic models and neural network vocoders, small-sample digital Homo sapiens has improved the problem of single speech rhythm and flat intonation of traditional acoustic models, making speech synthesis more refined.
In addition, by building a pre-trained base model for large-scale high-quality timbre data, in the future, small-sample digital sapiens will also support users to synthesize English and dialect speech only by recording Mandarin.
At present, Tencent Cloud Intelligent Digital Homo sapiens has covered five image styles: 3D realism, 3D semi-realism, 3D cartoon, 2D real person, and 2D cartoon, which can realize ultra-subtle facial emotional expressions and hundreds of body movements, and support image asset management, business service configuration and content production related services. It is reported that dozens of partners have provided digital sapiens live broadcast SaaS and knowledge oral SaaS applications to the industry, covering multiple industries such as medical, media and finance
