Create Realistic Lip-Sync Videos with ComfyUI's Infinite Talk!

Cover Image for Create Realistic Lip-Sync Videos with ComfyUI's Infinite Talk!
AICU media
AICU media

Hello everyone! This time, I'm going to introduce you to ComfyUI's wonderful feature, "Infinite Talk". This is a technology that can generate surprisingly natural lip-sync videos from just one image and an audio file. Based on the explanations by Purrs, Phil, and Julian of Comfy Org Stream, let's take a closer look at its charm and how to use it.

ComfyUI WAN InfiniteTalk in ComfyUI - Extended Lipsync Videos twitter.com

What is Infinite Talk?

Infinite Talk is a very excellent lip-sync model that generates not only the movement of the lips but also the entire character's movement in accordance with songs and conversations, based on a single still image.

What's amazing about this model is that it not only moves the mouth but also reproduces realistic movements based on physics. For example, when a character moves while singing, earrings sway, and clothes naturally flutter in accordance with the body's movements. This makes it possible to create very lively videos.

πŸ—£οΈ Since their release, WAN InfiniteTalk/ MultiTalk / S2V have become the backbone of some of the coolest ComfyUI experiments β€” extended lipsync videos, and entire performances synced with AI voices. Seeing the community take it this far has been nothing short of magic. 🌌πŸ”₯ pic.twitter.com/YF7RbkHJ52

β€” ComfyUI (@ComfyUI)

InfiniteTalk in ComfyUI. It seems to work quite well with Japanese.

For now, I just used Kijai's sample. https://t.co/fphtcui6ug https://t.co/U2eCSFU1Rc pic.twitter.com/sLk3r9gK3a

β€” Baku (@bk_sakurai)

Explanation of the Workflow in ComfyUI

Let's take a look at the basic flow and points for actually using Infinite Talk in ComfyUI.

What You Need

  1. Input Image: The image of the character you want to animate.
  2. Audio File: Audio such as songs or lines.
  3. Prompt: Simple instructions for the video you want to generate.

Workflow Mechanism

This workflow is cleverly designed.

First, it extracts only the vocal part from the input audio file. In accordance with the vocals, it generates the movement and expressions of the lips in the image. Finally, it recombines the generated video with the original audio file (including not only the vocals but also the music). As a result, the number of frames (the length of the video) completely matches the original audio, so there is no need to worry about sound lag.

Recommended Environment

In the distribution, it was recommended to use "WAN video wrapper", which makes it easy to manage extensions and supports more models. The sample workflow called "Infinite Talk example 3" included in this wrapper is the basis for this explanation.

About Hardware and Generation Time

  1. VRAM: In the case of a GPU equipped with 24GB of VRAM, generating a video of about 500 frames (about 20 seconds at 24FPS) is a guideline 13131313. Of course, if you have a higher-performance GPU, you can generate even longer videos 14. During the demo, VRAM usage of 16GB to 34GB was confirmed.
  2. Rendering Time: Generating long videos takes time. Therefore, it is efficient to use the "Audio Crop" node to cut out only your favorite parts of the song.

Comparison with Other Lip-Sync Technologies

In the distribution, several audio-to-video technologies other than Infinite Talk were introduced.

Animate: A model that can achieve very high-precision lip-sync without using audio.

Humo: Produces great results in music video production, but has the constraint that it can only generate short videos at a time (small context window).

Infinite Talk is particularly superior to these technologies in that it can generate longer videos.

Examples and Tips for Actual Use

In the distribution, several demos were held, and very creative videos were born.

Julian created a fantastic and beautiful video based on an image of a woman singing while drawing. In the demo using an image of a claymation-style punk rocker screaming, dynamic expressions beyond the prompt were added, such as a drummer appearing in the background.

Secrets to Success

Input image is important: In particular, starting with an image where the mouth is clearly open makes it easier for the model to recognize the shape of the mouth, enabling consistent teeth expression throughout the video.

Enjoy trial and error: You won't necessarily get perfect results the first time. Sometimes, changing the seed value and trying it many times can give you the ideal results.

Combine short clips: Instead of sticking to long one-shot videos, you can create more fast-paced, professional works by generating multiple short clips and connecting them with editing.

Caution: Since this model is mainly trained on Chinese data, the shape of the mouth when pronouncing vowels may look a little unnatural in other languages ​​such as English. This is a limitation of the model at this time, but it is still capable of very high-quality lip-syncing.

Summary

Infinite Talk is a very powerful and fun lip-sync tool available in ComfyUI. There are some caveats and "quirks", but the possibilities are endless for creating lively singing or conversation scenes from a single image.

The workflow introduced this time can be downloaded from "Kajai's ComfyUI WAN video wrapper" GitHub, so please try it!