2026/06/24

What Is Wan 2.2 VACE? How It Differs From Standard Wan 2.2 Workflows (2026)

Wan 2.2 VACE explained — what VACE stands for, how it differs from standard T2V and I2V workflows, what control signals it accepts (pose, depth, canny, trajectory), the VACE-Fun model family, and when to use VACE versus Wan 2.2 Animate.

You know how Wan 2.2 Text-to-Video works — type a prompt, get a video. You know how Image-to-Video works — provide a starting frame, get a continuation.

But what if you need the subject to follow a specific pose, or the camera to move along a precise path, or a section of the video replaced without regenerating the whole thing? Standard T2V and I2V give you no control over the structure of the output. The model decides everything — composition, motion, timing — and you accept what it produces.

That is where VACE comes in. It is a companion model family that adds control signals, video editing, and reference-based generation to Wan 2.2 — turning it from a "generate and hope" tool into something closer to a directed video创作 platform.

I tested VACE across its main control modes (pose, depth, canny, and reference-to-video), compared the results against standard Wan 2.2 T2V and I2V, and mapped where VACE adds value, where it falls short, and how it compares to Wan 2.2 Animate. This guide covers what VACE actually is, what it can do that standard Wan 2.2 cannot, and when you should — and should not — use it.

Let's start with what VACE actually is — and what it is not.

What Is VACE?

VACE stands for Video All-in-one Creation and Editing. It is a control and editing framework developed by Alibaba's Tongyi Lab that was originally built for Wan2.1 and later fine-tuned onto Wan 2.2.

VACE is not a separate foundation model. It is a set of control weights that are fine-tuned on top of the base Wan 2.2 14B text-to-video checkpoint. The official model alibaba-pai/Wan2.2-VACE-Fun-A14B on Hugging Face starts from Wan-AI/Wan2.2-T2V-A14B and adds support for structured control inputs — pose skeletons, depth maps, edge detection, and trajectory guidance.

Think of it this way:

Standard Wan 2.2 T2V: Text goes in, video comes out. The model decides everything.
Standard Wan 2.2 I2V: A starting image provides the subject, but the model controls the motion.
VACE: You provide text, AND you provide a control signal (a pose skeleton, a depth map, an edge drawing) that constrains the structure of the video. The output follows your control signal while staying faithful to the subject.

This is the same conceptual jump that ControlNet brought to Stable Diffusion — from "prompt-only generation" to "prompt + structural constraint."

What VACE Is NOT

VACE is not a separate video generation model that replaces Wan 2.2
VACE is not an upscaler or frame interpolation tool
VACE is not an official Alibaba Cloud product — it is a community-developed fine-tune and research project

So how does VACE compare to standard Wan 2.2 in practice? Here is a direct side-by-side across the capabilities that matter most.

Quick Comparison: VACE vs Standard Wan 2.2

Capability	Standard T2V	Standard I2V	VACE
Input types	Text only	Image + text	Text + control signal (pose/depth/canny/trajectory) + optional reference image
Subject identity	Generated fresh each time	Locked by first frame	Locked by reference image with stronger preservation
Motion control	Prompt-based only	Prompt + image-based	Pose skeleton or trajectory directly constrains motion
Camera control	Prompt-based only (unreliable)	Very limited	Fun-Control-Camera variant adds explicit camera paths
Video editing	Not supported	Not supported	Inpainting, outpainting, background replacement, reframe
Output quality	Highest — base model	High — constrained by reference	Slightly lower — fine-tune trade-off for control
Best for	Cinematic exploration	Character consistency	Structured motion, video editing, precise control

The table shows the headline differences. Here is what each VACE capability actually looks like in practice.

What VACE Can Do That Standard Wan 2.2 Cannot

1. Control Signal Input

This is VACE's primary value. Instead of hoping your prompt produces a specific pose or composition, you feed the model a control image — a stick figure pose, a depth map, a canny edge drawing, or an MLSD line map — and the output follows that structure.

The supported control modes include:

Control Type	What It Constrains	Best For
Pose (OpenPose skeleton)	Human body position, joint angles, limb placement	Character animation, dance moves, action poses
Depth map	Scene geometry — foreground vs background, object spacing	Landscape shots, indoor scenes, object placement
Canny edge	Outlines and edges of the scene	Architecture, products, scenes with clear structural lines
MLSD (line segments)	Straight-line geometry — walls, ceilings, roads	Architectural visualization, urban scenes
Trajectory control	Object movement path across frames	Object tracking, panning shots, product rotation

2. Reference-to-Video With Identity Preservation

I2V in standard Wan 2.2 uses the reference image as the first frame. The subject can drift over time. VACE's reference-to-video mode processes the reference image differently — it extracts the subject, removes the background (via RMBG), and treats the subject as an identity anchor that persists across all 81 frames.

The result is better character consistency than standard I2V, especially for faces and clothing details. The community has noted that VACE-Fun provides "unprecedented stability in facial features, clothing textures, and environmental lighting — even at 480p."

Rule of thumb for reference-to-video: If your subject has fine details (faces, logos, specific clothing patterns) that standard I2V cannot maintain, VACE reference-to-video is worth the extra setup time. If your subject is generic and I2V already handles it, VACE adds complexity without benefit.

3. Video Inpainting and Outpainting

VACE supports mask-based local replacement — you draw a mask over a region of a video, describe what should be there, and VACE fills it in. This is useful for:

Removing an object from a scene
Replacing a background element
Fixing a specific area that has artifacts
Adding a new element to an existing video

Outpainting extends the video frame in any direction — adding content beyond the original boundaries. The model fills in the expanded area with scene-consistent content.

4. Reframe and Background Replacement

VACE Reframe intelligently changes the aspect ratio of a video without simply cropping. The model expands or shifts the composition to fit the new ratio. This is useful when you shot in 16:9 but need 9:16 for TikTok or Reels.

Background replacement preserves the foreground subject and generates a new background described by the prompt.

Expert pitfall for control signal prep: The quality of your control signal directly determines the quality of VACE output. A blurry pose skeleton, a depth map with incorrect edges, or a noisy canny image will produce a video that follows those errors faithfully. Garbage in, garbage out — but amplified by motion. Always inspect your control signal at full resolution before feeding it to VACE. If the control image looks wrong to your eye, the video will look worse.

These capabilities are spread across multiple VACE variants. Here is how the model family is organized.

The VACE-Fun Family: Four Variants for Different Control Types

VACE is not a single model. The Wan-Fun project includes several fine-tuned variants built on the same approach:

Variant	Purpose	Location
VACE-Fun (base)	Multi-control: pose, depth, canny, MLSD, trajectory	`alibaba-pai/Wan2.2-VACE-Fun-A14B`
Fun-Control	Additional control types beyond base VACE	Part of the Fun series
Fun-Control-Camera	Explicit camera motion control (pan, tilt, orbit)	Separate Fun variant
Fun-InP	Video inpainting with mask input	Separate Fun variant

All variants share the same base architecture — they are fine-tuned from Wan2.2-T2V-A14B using the VACE training approach.

With multiple variants available, a natural question is how VACE compares to Wan 2.2 Animate — the other control-focused Wan 2.2 feature.

VACE vs Wan 2.2 Animate: What Is the Difference?

This is the most common comparison question, and the answer depends on what you mean by "Animate."

Wan 2.2 Animate is a motion transfer model. You provide a reference video, and Animate copies the motion from that video onto a new subject. For example, you have a video of a person dancing, and you want a different character to perform the same dance moves. Animate extracts the motion and re-targets it.

VACE is a broader framework. It includes motion control (via pose skeletons), but it also includes video editing, inpainting, outpainting, reframe, and reference-to-video generation.

The community consensus from direct comparisons is:

Capability	VACE	Wan 2.2 Animate
Motion transfer from a reference video	Supported (pose)	Purpose-built — higher fidelity
Video inpainting / outpainting	Supported	Not supported
Control signals (pose, depth, canny)	Supported	Not supported
Camera control	Via Fun-Control-Camera	Not supported
Reference-to-video identity preservation	Supported — strong	Not supported
Ease of setup	More complex — requires control signal prep	Simpler — just a reference video

Expert pitfall for Animate vs VACE: Do not assume Animate replaces VACE or vice versa. They serve different purposes. If you have a reference video and want to copy its motion, use Animate. If you have a still image and want to control its structure with a pose or depth signal, use VACE. If you want to edit an existing video, use VACE. Using the wrong one for the task wastes time and produces poor results.

The practical question after understanding the comparison is: how do you actually run a VACE workflow?

How VACE Works: A 5-Step ComfyUI Workflow

A typical VACE workflow in ComfyUI follows these steps:

Prepare your control signal. If you are using pose control, run your reference image through an OpenPose estimator. For depth, use a depth estimation model. For canny, apply edge detection. The ComfyUI comfyui_controlnet_aux plugin provides most of these preprocessors.
Load the VACE model. The VACE-Fun checkpoint (64 GB, same size as base Wan 2.2 14B) replaces the standard Wan 2.2 T2V model in your workflow.
Connect the control input. The VACE node accepts the preprocessed control image alongside your text prompt.
Set generation parameters. Same parameters as standard Wan 2.2 — resolution, frame count (81 max), steps, guidance, seed.
Generate. The output follows the control signal while matching your prompt.

VACE does not require a separate UI or a different runtime. It runs inside ComfyUI with the same hardware requirements as the 14B model — 12 GB VRAM minimum for 480p, 16 GB+ for 720p.

VACE sounds powerful on paper, but there are real trade-offs you should know before building a workflow around it.

Limitations and When Not to Use VACE

VACE is powerful, but it has real limitations that the community has documented.

Lower output quality than native T2V/I2V. Because VACE is a fine-tune on top of Wan 2.2 — not the base model itself — it trades some output quality for control. The fine-tune process optimizes for control signal adherence, which can reduce detail, sharpness, and natural motion compared to the unmodified base model. If your priority is maximum visual quality, use standard T2V or I2V.

Control precision is still evolving. Pose skeletons produce recognizable body positions, but fine joint angles (fingers, subtle head tilts) are not reliably followed. Depth maps work well for broad scene geometry but struggle with fine object boundaries. Canny control provides structural guidance but the output can look "over-constrained" — the model follows the edges rigidly and loses natural texture.

VRAM requirements are identical to the 14B model. VACE does not reduce the hardware requirements. You still need 12 GB for 480p and 16 GB+ for 720p. The control preprocessing step adds 1–2 GB of temporary VRAM usage.

No native support for long-form control. VACE generates 81 frames (5 seconds) per pass, like standard Wan 2.2. Extending controlled output beyond 5 seconds requires continuation workflows or stitching — the same limitations that apply to standard Wan 2.2.

The model is a research project. VACE and the Fun variants are community-driven fine-tunes released under Apache 2.0. They are not production services with guarantees. Expect bugs, breaking changes, and documentation gaps.

Expert pitfall for VACE expectations: Do not expect VACE to match standard Wan 2.2 quality on the first try. The control fine-tune introduces a quality regression that is noticeable at 720p — especially in fine texture, skin detail, and natural motion. Plan for 2–3 iterations to dial in the control signal strength and prompt balance. VACE is a tool that rewards tuning, not a one-shot generator.

Rule of thumb for VACE adoption: Use VACE when you need control that T2V and I2V cannot provide — structured pose, depth-based composition, or video editing. Stay with standard Wan 2.2 when maximum visual quality is your only priority. VACE is a tool for specific problems, not a replacement for the base workflow.

Rule of thumb for control signal quality: If your VACE output looks bad, check the control signal before you change the prompt. Nine times out of ten, the control image is the problem — wrong preprocessing settings, low resolution, or misaligned edges. Fix the control image, regenerate, and only adjust the prompt if the control was already clean.

These limitations raise common questions. Here are the answers to the most frequent ones.

Frequently Asked Questions

Does VACE work with the 5B model? No. VACE-Fun and the other Fun variants are fine-tuned on Wan2.2-T2V-A14B (14B parameters). There is no 5B VACE variant as of mid-2026. You need 12 GB+ VRAM to run it.

Is VACE available in ComfyUI natively? Yes. The ComfyUI Wan 2.2 native workflow supports VACE through the WanVaceToVideo node. You also need the comfyui_controlnet_aux plugin for control signal preprocessing (pose, depth, canny, etc.).

Can VACE generate videos longer than 5 seconds? No — VACE inherits the same 81-frame (5-second) limit as the base Wan 2.2 model. Longer videos require continuation or stitching workflows, which work the same way as with standard Wan 2.2.

Do I need a separate model download for VACE? Yes. The VACE-Fun checkpoint is 64 GB (same as the base 14B model). It is a separate download from the standard Wan 2.2 model files. You run either VACE or standard Wan 2.2 — not both at the same time.

Does VACE support video-to-video workflows? Yes — with limitations. VACE can accept a video as input for motion transfer (via pose extraction) and for inpainting/outpainting. True video-to-video style transfer is not a primary VACE capability, but the inpainting workflow can approximate it by masking and regenerating regions of each frame.

Is VACE better than Wan 2.2 Animate for motion transfer? No. For motion transfer specifically — copying motion from a reference video to a new subject — Wan 2.2 Animate produces higher fidelity results. VACE's pose control is useful when you want to design motion from a single image (rather than a reference video), but Animate is the better choice when a reference video is available.

VACE adds a new dimension to Wan 2.2 — but only for specific problems. Here is the short version of when to use it.

Summary

VACE turns Wan 2.2 from a "generate and hope" tool into a directed video creation platform — but it is not a replacement for the base workflow.

Use VACE when you need structural control (pose, depth, edges), video editing (inpainting, outpainting), or reference-to-video with stronger identity preservation than standard I2V provides.
Skip VACE when your priority is maximum visual quality, or when standard T2V or I2V already produces the result you need. The quality trade-off from the fine-tune is real, and control is a feature you add when you need it — not a default improvement.

The most common mistake is treating VACE as "Wan 2.2 but better." It is not. It is Wan 2.2 plus control — and control comes at a cost. Match the tool to the problem: use standard Wan 2.2 for pure generation, VACE for structured generation and editing, and Animate for motion transfer.

Next step: If you want to try VACE, start with the reference-to-video workflow — it gives the most dramatic improvement over standard I2V with the least setup complexity. The Wan 2.2 ComfyUI Workflow Guide covers the base setup you need before adding VACE. For a direct comparison with Wan 2.2 Animate, the Wan 2.2 Animate Guide explains when Animate is the better choice.

All Posts