[LTX-2] ComfyUI Standard Support! Another Dimension of Video Generation Enabling Simultaneous Generation of Video and Audio (Part 2)

Cover Image for [LTX-2] ComfyUI Standard Support! Another Dimension of Video Generation Enabling Simultaneous Generation of Video and Audio (Part 2)
AICU media
AICU media

[LTX-2] ComfyUI Standard Support! Another Dimension of Video Generation Enabling Simultaneous Generation of Video and Audio (Part 2)

On January 6, 2026, the open-source audio/video generation AI model "LTX-2" gained native support for ComfyUI. The biggest feature of this model is that it can simultaneously generate dialogue, environmental sounds, and BGM in a single pass along with video generation. It has also been found to be able to sing in Japanese. In this second part, AICU AIDX Lab will share technical explanations of LTX-2 and how to build a working environment in Google Colab. There's also valuable information, so please read to the end!

Sing in Japanese! #ComfyUI #LTX2 #SakiNoire https://t.co/mCmV8LmO3F pic.twitter.com/Rss5ci74L0

— AICU - Creating Creators (@AICUai)

It seems that the era of video generation AI where "only the video moves" is quickly coming to an end. The native ComfyUI support of LTX-2, which can simultaneously generate sound, conversation/voice, sound effects, and video expressions in a harmonized manner, is considered a major turning point for creators. There is a feeling that the workflow and common sense of video production will be fundamentally rewritten from now on.

[LTX-2] ComfyUI Standard Support! Another Dimension of Video Generation Enabling Simultaneous Generation of Video and Audio (Part 1) | AICU

j.aicu.ai _On January 6, 2026, the open-source audio/video generation AI model "LTX-2" gained native support for ComfyUI. This model...

https://j.aicu.ai/260107

Appearing at the Top Even in Official Templates

It is also being prominently featured in ComfyUI's official templates.

However, recent ComfyUI seems to be full of technical terms and hard-to-obtain high-end hardware specs, and it cannot be denied that it is difficult for beginners despite being open video generation. Also, since it is expected that "you won't understand video generation even if you read about it!", AICU would like to gradually increase its presence on the YouTube channel. Along with the examples, the prompts are also placed in the description欄, so please watch it first.

https://www.youtube.com/watch?v=d0P40hEY8HU

Let's Learn the [Time/Visual/Sound] New Generation Prompt Format

LTX-2 differs from conventional image generation AI in that the structure of the prompt has also changed. It is recommended to describe the following 3 elements. Expressing the "Unified Audio-Visual Prompting" written in the LTX-2 paper in AICU's own way makes it easy to understand as "Time/Visual/Sound." The prompt is a translation of the above video into Japanese.

① Time Passage: Write how events and actions change over time.

In a cinematic, photorealistic style, a woman sits in a hospital waiting room, holding a medical report in her hands. Intense fluorescent lights cast a cold, medical glow on her pale face. She silently reads the report, frowning with each line.

② Visual Details: Describe all the visual elements you want to appear on the screen.

The camera starts with a wide shot, showing an empty waiting room with sterile white walls and patterned teal wallpaper. It then slowly zooms in to a close-up of her face.

③ Audio: Describe the "sounds" and "dialogue" needed for the scene.

A melancholic piano-based melody plays softly as background music. The gentle, emotional melody is punctuated by the sound of strings, echoing the weight of the moment. The hustle and bustle of the hospital—muffled voices, the beeping of machines, footsteps in the hallway, the rustling of the medical report—fades in and out, blending with the music. After a long silence, she raises her eyes from the report, looking forward with tears in her eyes. "I have to tell him," she says quietly, her voice slightly choked. As she carefully folds the paper, stands up, and walks toward the hallway, the camera pulls back, her figure shrinking as she turns the corner and disappears.

What's New with LTX-2?

First of all, LTX-2 is not a U-Net structure like the conventional Stable Diffusion, but a model based on a diffusion transformer (Diffusion Transformer : DiT). Its feature is that it is designed to be able to generate images and videos, and even audio, in an integrated (unified manner) and synchronized manner with a single model. Long data dependencies (the context of videos) are processed more accurately with the transformer's attention mechanism. LTX-2 is not a single model, but a huge "asymmetric dual-stream" structure combining a 14 billion parameter video stream and a 5 billion parameter audio stream.

In addition, the reason why "model_type FLUX" is displayed in the ComfyUI log is that LTX-2's architecture shares a structure (DiT block arrangement, etc.) that is very close to FLUX.1. The paper "LTX-2: Efficient Joint Audio-Visual Foundation Model

https://arxiv.org/abs/2601.03233 " states that it inherits the design principles of the prior research "LTX- Video." Similar to Black Forest Labs (BFL)'s "FLUX," it has in common the adoption of DiT architecture and "Rectified Flow" for efficiency. According to the paper, even with this huge configuration, the "Rectified Flow" technology dramatically reduces the sampling steps, and it operates about 18 times faster than other models such as Wan 2.2 on H100. As for the implementation, the backend model management system reuses or extends the inference code for FLUX to run LTX-2, so a common identifier is displayed.

Can It Handle Japanese Because It Uses Gemma?

Since Gemma 3 12B IT is used as the internal text encoder that interprets prompts and generates images, language understanding ability has improved dramatically. Since Gemma 3 is a multilingual model (Instruct version), it has the ability to understand the "concept" even if Japanese is entered in the prompt and reflect it in the video. The final quality of "drawing" depends on the learning data of the video generation model itself, but it is a configuration that is much easier for Japanese to pass through than conventional CLIP. However, Gemma 3 12B IT used in the workflow consumes 12GB, which is comparable to the main body size of a general video generation AI by itself. In fact, in the case of raw weights that do not perform FP8 or quantization, a 12B model requires more than 24GB of VRAM. It runs on 12GB only in the case of "4bit quantized version", etc. In other words, the workflow recommends a specific quantized version, and this choice may have been made to build with VRAM 20GB.

Providing a "Distilled Version" in the Official Template

Support for distilled models has not been common in conventional ComfyUI official templates, but LTX-2 is supported from day zero this time.

Many "Parts (Models)" Are Required

LTX-2 does not operate as a single file, but by combining multiple huge parts. The subgraph of the basic workflow (t2v) is in this state.

  • Checkpoints (Main Body) : The main engine for video generation (19B model).

  • Text Encoder (Language Understanding) : The brain for understanding prompts. Uses the latest Gemma 3.

  • LoRAs (Additional Control) :

    • Distilled : Accelerator to generate quickly with fewer steps.

    • Camera Control : A control device for specifying camera work such as "dolly left."

  • Upscale Models : Function to improve the image quality (double the resolution) of generated videos.

These parts must be placed in specific folders in ComfyUI to be recognized.

  • checkpoints/ : ltx-2-19b-dev-fp8.safetensors etc.

  • text_encoders/ : gemma_3_12B_it.safetensors

  • loras/ : ltx-2-19b-distilled-lora-384.safetensors etc.

  • latent_upscale_models/ : ltx-2-spatial-upscaler-x2-1.0.safetensors

What Is Being Distilled in "Distillation"?

"Distillation" usually refers to a technique to reduce the number of generation steps (Step Distillation), but the LTX-2 "Distilled Version (35GB)" and "Standard Version (19GB/FP8)" that are distributed this time are a little different. Generally, the larger file size (about 35GB) is FP16 precision, and the smaller file size (about 19GB) is FP8 precision. The double file size is due to the difference in precision (number of bits). If you don't understand well that it is a contrast between "high-quality FP16 (distilled) version" and "lightweight FP8 version", you will misunderstand the opposite. ltx-2-19b-distilled used by the official workflow seems to be a model that has distilled (compressed) the "number of generation steps". It has been re-trained to complete a high-quality generation process that would normally take 20 to 50 steps in only about 8 steps. Speaking of which, Alibaba's "Z-Image- Turbo" recently released also uses a similar method, but it accelerates the generation speed by several times while maintaining the quality.

Too Fast!? "Z-Image-Turbo" from Alibaba Is Born!!

www.aicu.jp Alibaba's latest image generation AI "Z-Image- Turbo" announced. Lightweight and fast, it runs on home GPUs. Technical details and usage are explained.

https://www.aicu.jp/post/z-image-turbo-20251127

AICU's ComfyUI Operating Environment Under Development

Regarding the operating environment, the setting was "even VRAM 20GB is tight", so AICU AIDX Lab is tuning and verifying the operation so that it works on the latest Google Colab. However, this time, not only VRAM but also disk capacity (maximum 112GB) is a constraint. Even if you have prepared all the environment, if the remaining amount is about 12GB, it is in an extremely dangerous (almost punctured) state to operate LTX-2 (19B model). The distilled version (Distilled) of LTX-2 introduced in the official workflow is about 35GB, the standard version is about 19GB, and the text encoder (Gemma 3) is about 12GB, and even if you try to download these, the disk will be full, and the file may not be saved correctly or may be damaged. In the 112GB limit, if you try to coexist with the huge distilled version 35GB model and the standard version 19GB model (FP8), the capacity of libraries and custom nodes will quickly lead to a state where "even temporary files cannot be created". In addition, the H100 "Hopper GPU" recently introduced to GoogleColab is said to have 6 to 30 times the capability of the A100 "Ampere GPU", which was previously the most powerful, depending on the type of workload. To take advantage of this power, we thought that making a lightweight and fast dedicated notebook limited to the "19GB FP8 version" would be the best balance between disk management and generation speed.

With a model of this size, some ingenuity is required for downloading. Download multiple complex and huge models. Traditional download methods such as wget and aria2 are no longer practical, so a token is set to use HuggingFace's official distributed download tool (HF_TOKEN). This token has become common since Stable Diffusion 3 (SD3) started license management. The method we are developing this time can check "whether the file exists in the specified location", and we have also implemented a "model relocation & cleanup script" that reorganizes files using symbolic links. In addition to high speed and stability, we have also implemented a notification function from ComfyUI to Discord as a convenience.

I want to publish it soon, but there is still room for improvement.

(Pre-release for subscribers)

AICU Lab+ Study Session [ComfyJapan] Next Preview: 1/17 (Sat) 20:00~ "LTX-2" Thorough Strategy!

To another dimension of video generation where video and sound are born simultaneously.

At the next AICU Lab+ study session, we will feature [ComfyJapan] "LTX-2: Simultaneous Generation of Video and Sound." Lecturer Hakase Shirai (AICU representative) will directly and thoroughly explain how to use this latest model, which can synchronously generate dialogue, environmental sounds, and BGM in a single pass, with ComfyUI.

  • Google Colab Support : Optimized workflow to eliminate VRAM shortage.

  • Japanese Singing Experiment : Let's generate Japanese songs with LTX-2.

  • Discord Notification Integration : Don't be afraid of long generation! Notify completion on Discord.

Date : January 17, 2026 (Sat) 20:00~ Reservation here : https://j.aicu.ai/LabYoyaku

[Session Participation Fee] Spot participation: 5,500 yen (tax included) With the advantageous subscription "AICU Lab+", participation is free + unlimited viewing of archives + monthly "IQ Magazine" PDF version is also included! https://j.aicu.ai/LabPlus

If you want a notebook that runs on Google Colab, I will introduce it at the end of this blog, but for AICU Lab+ users, a shared ComfyUI environment is provided, and there is also a video archive, so I definitely recommend participating in the study session.

What is AICU Lab+

AICU's official community where you can learn, share, and present your works at the forefront of generative AI. In conjunction with magazines, contests, and events, we offer member-only study sessions and benefits.

Summary: Reasons to Know ComfyUI and AICU Lab+ Now

The new year has begun and CES2026 is being held in the United States. "ComfyOrg", which develops ComfyUI, is also accelerating development in cooperation with NVIDIA and other model manufacturers. In Japan, it is difficult to procure GPUs, and the environment is difficult for new entrants to generate images and videos on PCs, but it is valuable to use GoogleColab and other tools to explore in study sessions instead of worrying about settings and costs alone. Join AICU Lab+ to share the latest know-how and turn ComfyUI from just a "tool" into your "weapon".

Use the initial free code Lab26Jan valid until the end of January 2026

https://j.aicu.ai/LabPlus

and please knock on the door of the community. We look forward to your participation!

Hashtags

#ComfyUI #AICULab #GenerativeAI #AIStudySession #IQMagazine #ComfyJapan #LTX2 #AIVideoGeneration #GoogleColab

It is recommended to study ComfyUI Purple Book or SD Yellow Book as preparation.

Image Generation AI Stable Diffusion Start Guide (Generative AI Illustration) j.aicu.ai 2,640 yen (as of December 31, 2025 20:35 Click here for details) Buy on Amazon.co.jp

Image/Video Generation AI ComfyUI Master Guide (Generative AI Illustration) j.aicu.ai 3,850 yen (as of December 07, 2025 00:14 Click here for details) Buy on Amazon.co.jp

Shortcut to this article https://j.aicu.ai/260108

I will put the workflow (under development and improvement) of "LTX-2 works for now" on Google Colab beyond the following paywall. We plan to share more advanced workflow construction at the AICU Lab+ study session. Let's move our hands together at that time.