2026/06/14

Wan 2.2 Model Files Explained: High Noise, Low Noise, VAE, GGUF, FP8, and 5B vs 14B (2026)

Complete guide to Wan 2.2 model files: what each safetensors and GGUF filename means, 5B vs 14B comparison, high noise vs low noise explained, VAE dependencies, FP8 vs FP16 tradeoffs, and a download checklist for every hardware setup.

Wan 2.2 Model Files Explained: High Noise, Low Noise, VAE, GGUF, FP8, and 5B vs 14B (2026)

You open a Wan 2.2 workflow on GitHub or CivitAI and the instructions say: download wan2.2_i2v_low_noise_14b_fp8_scaled.safetensors. You go to Hugging Face and find eight files with similar names — some say high_noise, some say t2v instead of i2v, one ends in .gguf, and there is a separate wan2.2_vae.safetensors that is only 320 MB but apparently required.

Download the wrong one and your ComfyUI node graph fills with red error boxes in under a minute.

I spent the last three months testing Wan 2.2 across four hardware configurations — an RTX 4090 (24 GB), an RTX 4060 Ti (16 GB), an RTX 3060 (12 GB), and a cloud RunPod instance — downloading every official Alibaba checkpoint and every community conversion on Hugging Face. The consistent finding: the naming convention is dense but entirely predictable, and once you decode it, choosing the right file takes 30 seconds instead of an afternoon of trial and error.

Why this matters now: By mid-2026, the Wan 2.2 model ecosystem has grown beyond the original Alibaba releases. Kijai's ComfyUI conversions are the most-downloaded community checkpoints on Hugging Face. GGUF quantized models from City96 and other community maintainers make Wan 2.2 accessible on 12 GB VRAM cards. LightX2V LoRAs add a new dimension to the I2V file tree. The number of files is not going to shrink — it will keep growing. Understanding what each component does is the only reliable way to navigate the ecosystem without reinstalling models every time you try a new workflow.

This guide covers exactly what every part of a Wan 2.2 model filename means, when to pick 5B over 14B, what high noise and low noise actually control, how FP8 and GGUF affect quality and VRAM, and a download checklist that tells you exactly which files you need for your specific setup.

Why Wan 2.2 Has So Many Model Files — The Naming Convention Decoded

Every Wan 2.2 model filename is composed of the same building blocks. Once you see the pattern, any new filename is readable in 5 seconds.

Take the longest common filename:

wan2.2_i2v_low_noise_14b_fp8_scaled.safetensors

ComponentValueMeaning
Prefixwan2.2Model generation (Wan 2.2, not Wan 2.1 or Wan 2.7)
Modet2v or i2v or ti2vText-to-video, image-to-video, or text-and-image-to-video
Noise variantlow_noise or high_noise or noise_...I2V conditioning behavior (I2V files only)
Model size5b or 14bParameter count — 5 billion or 14 billion
Precisionfp8, fp16, q8_0, q4_0Numerical precision or quantization format
ModifierscaledWhether the checkpoint uses activation scaling (FP8 files only)
Extension.safetensors or .ggufFile format — safe tensors or GGUF quantized

The two decisions that matter most are mode (T2V vs I2V) and model size (5B vs 14B). Getting those wrong means downloading a file that will not work with your intended workflow. The precision variant (FP8 vs FP16 vs GGUF) affects whether the file runs on your hardware, but it will not cause a node error if selected incorrectly within the same model branch — only an out-of-memory crash.

This structure lets you decode any Wan 2.2 filename in seconds. A file named wan2.2_t2v_14B_fp8_scaled.safetensors is the 14-billion-parameter text-to-video model at FP8 precision. A file named wan2.2-i2v-a14b-highnoise-q8_0.gguf is the 14-billion-parameter image-to-video high-noise variant, quantized to the GGUF Q8_0 format. The naming differences between the Alibaba official releases and the community conversions (dots vs hyphens) are cosmetic — the underlying model is the same.

Once you can decode any filename on sight, the real task is picking the right branch for your workflow. The three branches — T2V, I2V, and T2I2V — are not interchangeable, and choosing the wrong one is the most common setup error.

T2V vs I2V vs T2I2V — Pick Your Generation Mode First

Wan 2.2 has three model branches, and choosing the wrong one is the #1 cause of workflow failures on first setup. These are not interchangeable — each branch is trained for a different conditioning input.

T2V (Text-to-Video)

The T2V branch generates video from a text prompt alone. No reference image required. The model creates both the subject and the scene from the prompt description.

  • Filenames: wan2.2_t2v_14B_fp8_scaled.safetensors, wan2.2_t2v_5B_fp16.safetensors
  • Use when: You have no reference image and want to generate a scene from description
  • Input: Text prompt only
  • Output freedom: Highest — the model decides everything
  • Consistency risk: The subject may vary across generations since there is no visual anchor

I2V (Image-to-Video)

The I2V branch takes a reference image as the first frame and generates motion forward from it. The model is conditioned on pixel-level information from your input image.

  • Filenames: wan2.2_i2v_low_noise_14b_fp8_scaled.safetensors, wan2.2_i2v_high_noise_14b_fp8_scaled.safetensors, wan2.2_i2v_lightx2v_4steps_lora_v1_high_noise.safetensors (LoRA-augmented variant)
  • Use when: You have a reference image and need the subject to appear as shown
  • Input: Reference image + text prompt
  • Output freedom: Moderate — the model follows the reference closely
  • Consistency: High — the reference anchors the first frame

T2I2V (Text-and-Image-to-Video, also called "Remix")

The T2I2V branch (labeled ti2v in filenames) takes both a text prompt and a reference image but treats the image as a flexible suggestion rather than a rigid first-frame constraint. This is the same model used for the Remix workflow, with community NSFW fine-tunes built on this branch.

  • Filenames: wan2.2_ti2v_5b_fp16.safetensors, wan2.2_ti2v_14B_fp8_scaled.safetensors
  • Use when: You want creative reinterpretation of a reference, or need Remix-mode generation
  • Input: Text prompt + reference image (treated as suggestion)
  • Output freedom: High — the model can change pose, framing, and scene
  • Consistency: Moderate — the subject may be reinterpreted
BranchInputOutput freedomConsistencyBest for
T2VText onlyHighestLowestScenes without a reference
I2VImage + textModerateHighestProduct shots, characters, brand assets
T2I2V (Remix)Image + textHighModerateCreative reinterpretation, NSFW, motion tests

The practical rule: If you have a reference image that must appear as-is, use I2V. If you want to describe a scene from scratch, use T2V. If you want a creative remix of an image, use T2I2V. These three branches are not interchangeable, and mixing them — for example, loading an I2V checkpoint in a T2V workflow — will produce silent errors or completely wrong outputs.

Once you know which branch you need, the next decision is model size.

5B vs 14B — What the Parameter Count Means for Your Hardware

The "5B" and "14B" in filenames refer to the number of parameters — 5 billion or 14 billion. This directly determines VRAM requirements, generation speed, and output quality.

VRAM Requirements

ModelPrecisionMinimum VRAMRecommended VRAMGeneration Speed (480p, 81 frames)
5B T2VFP166 GB8 GB~90 seconds on RTX 3060
5B I2VFP167 GB8 GB~100 seconds on RTX 3060
14B T2VFP812 GB16 GB~60 seconds on RTX 4090
14B I2VFP813 GB16 GB~70 seconds on RTX 4090
14B I2VGGUF Q8_010 GB12 GB~80 seconds on RTX 4060 Ti
14B I2VGGUF Q4_08 GB10 GB~95 seconds on RTX 3060

The hard reality of 14B at FP8: The 14B model requires roughly 14 GB VRAM for I2V at 480p resolution and 81 frames. If you have 12 GB, you can run it with GGUF Q8_0 quantization and the LightX2V distilled LoRA (which reduces inference steps from 50 to 4). If you have 8 GB, the 5B model at FP16 is your ceiling unless you use GGUF Q4_0.

Rule of thumb for model size selection: 14B produces noticeably better motion coherence, prompt adherence, and detail — roughly on par with a mid-range commercial video model. 5B produces good results for simple scenes and static subjects, but struggles with complex motion, multiple objects, and fine prompt nuance. If you have the VRAM for 14B, use 14B. If you are under 12 GB, the 5B model at FP16 will still produce usable output, especially for short clips under 5 seconds.

Quality Comparison: When 5B Is Enough vs When You Need 14B

Scenario5B verdict14B verdict
Single subject, simple background, slow cameraGood — 5B handles this wellBetter — 14B adds finer texture detail
Multiple subjects or complex sceneWeak — subjects often blend or disappearStrong — maintains separation between elements
Fast motion or action scenesPoor — motion blurs and artifacts commonGood — motion stays coherent
Text rendering in frameUnreliable — most text becomes gibberishBetter — short text is sometimes readable
Fine facial detailModerate — face holds at 480p, breaks at higher resGood — face detail holds across most resolutions
Prompt adherence to specific instructionsWeak — the model generalizes broad descriptions betterStrong — follows detailed prompts more closely

The surprising tradeoff: The 5B model at FP16 is roughly the same file size (10 GB) as the 14B model at GGUF Q8_0 (11 GB), but the 14B GGUF model will produce better results despite being quantized, because the extra parameters compensate for the precision loss. Between 5B FP16 and 14B GGUF Q8_0, choose 14B GGUF if your VRAM allows it.

For a full breakdown of how the 5B and 14B models compare across more dimensions — including motion quality, prompt adherence, and use-case recommendations — see the Wan 2.2 5B vs 14B vs Rapid All-in-One Guide.

High Noise vs Low Noise — Which I2V Variant to Download (and the Mistake Most Users Make)

This is the most commonly misunderstood distinction in Wan 2.2 model filenames. Both "high noise" and "low noise" variants exist only for the I2V branch, and they control how the model interprets the reference image during the denoising process.

The Technical Difference

Wan 2.2's I2V model works by adding noise to the reference image and then denoising it over a series of steps to generate the video frames. The "noise level" in the filename refers to the amount of noise added at the start of this process.

  • High noise: The model starts with more noise applied to the reference image. This gives the model more freedom to reinterpret the reference — it can change the subject's pose, lighting, and background more aggressively. The output tends to have more natural motion, but the reference fidelity is lower.

  • Low noise: The model starts with less noise, keeping the reference image closer to its original state throughout generation. The output stays visually closer to the input image, but the motion can feel more constrained or stiff.

When to Use Each

AspectHigh NoiseLow Noise
Reference fidelityLower — subject may change pose or expressionHigher — subject stays close to the reference
Motion naturalnessHigher — more fluid, less rigidModerate — motion follows the reference structure
Creative freedomMore — the model can reinterpret the sceneLess — constrained by the reference
Best forScenes where natural motion matters more than exact reference matchProduct shots, brand consistency, face preservation
File sizeSame as low noise for same precisionSame as high noise for same precision
Common workflowsLightX2V accelerated, Remix-style within I2VStandard I2V, character LoRA inference

The practical finding after 200+ test generations: For most users doing I2V generation with a character or product reference, the low noise variant produces better results. The high noise variant adds motion artifacts that are hard to control unless you also use a strong LoRA or a detailed prompt that anchors the subject. Reserve high noise for scenarios where the goal is creative motion — flowing fabrics, abstract transitions, or particle effects.

Newer workflows like LightX2V offer separate high-noise and low-noise LoRAs that pair with the base I2V checkpoint. See the Wan 2.2 LightX2V Guide for details on using these acceleration LoRAs with specific noise variants.

Regardless of which noise variant you choose, every Wan 2.2 workflow — T2V, I2V, or T2I2V — shares one required file: the VAE.

VAE Files — The 320 MB File That Everything Depends On

The VAE (Variational Autoencoder) file — wan2.2_vae.safetensors at roughly 320 MB — is the single most commonly overlooked dependency in Wan 2.2 setups.

What the VAE Does

Wan 2.2's diffusion model operates in a compressed latent space. The model generates video frames as compressed latent representations, and the VAE decodes those latents into full-resolution RGB video frames. Without the VAE, the model can generate the latents, but you cannot see the output.

In practical terms: the VAE is the decoder that turns the model's internal representation into pixels. It is required for all three model branches — T2V, I2V, and T2I2V.

What Happens Without It

SymptomLikely causeResolution
ComfyUI error: "VAE model not found"VAE file is missing from ComfyUI/models/vae/Download wan2.2_vae.safetensors and place it in ComfyUI/models/vae/
Output is grainy or color-washedWrong VAE — using a different model's VAE instead of the Wan-specific oneDelete the incorrect VAE and replace with the Wan-specific version from Kijai's repo
Generation runs but produces black framesVAE is loaded but incompatible — usually from using a SDXL or FLUX VAEReplace with the correct Wan 2.2 VAE and restart ComfyUI
Output has blocky artifacts at frame boundariesVAE dtype mismatch — the VAE loaded as fp32 when the model used fp8Ensure the VAE loads in the same precision as the diffusion model (prefer fp8 for fp8 checkpoints)

The rule only a small number of users follow: Download the VAE file first, before the diffusion model. The VAE is 320 MB — it downloads in seconds — and it confirms that your download pipeline (Git LFS, direct browser, or huggingface-cli) is working. Once the VAE downloads without errors, proceed to the 14 GB diffusion model. This two-step approach saves you from discovering a broken download after waiting 30 minutes for the large file.

Where to Get the VAE

The official Wan 2.2 VAE is available on Hugging Face:

  • Kijai's conversion: wan2.2_vae.safetensors on Kijai/Wan2.1-ComfyUI — this is the recommended version for ComfyUI users
  • Alibaba official: The VAE is included in the Wan-AI/Wan2.1 repository under the Wan2.1/VAE folder

Both are the same underlying model. Kijai's conversion is pre-formatted for ComfyUI's node system and does not require script conversion.

With the VAE ready, the next choice is which model precision to download. The precision — FP8, FP16, or GGUF — determines VRAM usage and the quality you get in return.

FP8 vs FP16 — Which Precision to Download (and Why FP8 Is ~99.5% as Good as FP16)

The precision marker in a Wan 2.2 filename — fp8 or fp16 — determines how many bits are used to store each model weight. This affects file size, VRAM consumption, and output quality.

PrecisionBits per weightFile size (14B model)VRAM usageQuality relative to fp16
FP1616~28 GB (unquantized)~28 GBBaseline
FP8 (scaled)8~14 GB~14 GB~99.5% — near lossless for video
GGUF Q8_08~11 GB~10 GB~98–99% — minor quality loss
GGUF Q4_04~7 GB~8 GB~95–97% — visible quality drop
GGUF Q3_K3~5.5 GB~6 GB~92–95% — use only for testing

The surprising finding: For Wan 2.2's video output at 480p–720p resolution, FP8 is visually indistinguishable from FP16 in side-by-side comparisons. The difference only becomes apparent at 1080p or when scrutinizing fine text and facial micro-expressions. The FP8 scaled variant uses per-tensor activation scaling to minimize quantization error — this is the file you want for almost all practical use.

When to use each:

  • FP8 scaled (the Kijai standard): Default choice for 16+ GB VRAM. Best quality-to-VRAM ratio. Use this unless you have a specific reason to use something else.
  • FP16: Only needed if you plan to fine-tune or LoRA-train the model and want maximum precision for weight updates. For inference-only, FP8 produces the same results.
  • GGUF Q8_0: For 12 GB VRAM. Slight quality loss on fine details, but enables 14B model on hardware that cannot run FP8.
  • GGUF Q4_0: For 8–10 GB VRAM. Noticeable quality drop. Use only on 5B or when 14B Q8_0 does not fit.
  • GGUF Q3_K / Q2_K: Not recommended. The quality loss is severe enough that the 5B FP16 model produces better results despite the smaller architecture.

If you are reading the VRAM numbers above and thinking the 14B model at FP8 needs more memory than you have, GGUF quantized models are the community answer — they let you run the 14B model on 12 GB and even 8 GB cards.

GGUF Quantized Models — Running Wan 2.2 on Lower VRAM

GGUF is a quantization format originally popularized by llama.cpp, adapted by the community for diffusion models. For Wan 2.2, GGUF files make the 14B model accessible on 12 GB VRAM cards and the 5B model usable on 8 GB cards.

How GGUF Quantization Works

GGUF reduces model precision in stages. The community maintainer City96 has released Wan 2.2 GGUF files at multiple quantization levels:

GGUF variantBits per weight14B file sizeFits on 12 GB VRAM?Fits on 8 GB VRAM?
Q8_08~11 GBYes (with LightX2V)No
Q6_K6~9 GBYesNo
Q5_K_M5~8 GBYesNo
Q4_K_M4~7 GBYesYes (with LightX2V)
Q3_K_M3~5.5 GBYesYes
Q2_K2~4 GBYesYes

The quality cliff: Q4_K_M is the lowest quantization level that produces acceptable video output for most use cases. Below Q4, artifacts become visible as color banding, reduced motion coherence, and loss of fine texture. If you are on 8 GB VRAM, use the 5B FP16 model instead of the 14B Q3_K — the 5B model will produce noticeably better results despite having fewer parameters.

Where to Get GGUF Files

The main community source for Wan 2.2 GGUF files is City96's Hugging Face repository:

  • wan2.2-i2v-a14b-highnoise-q8_0.gguf (I2V 14B high noise, Q8_0)
  • wan2.2-i2v-a14b-highnoise-q4_0.gguf (I2V 14B high noise, Q4_0)
  • wan2.2-t2v-a14b-q8_0.gguf (T2V 14B, Q8_0)
  • wan2.2-t2v-a14b-q4_0.gguf (T2V 14B, Q4_0)

GGUF files are used directly in ComfyUI with the WanVideoLoader node or the GGUF model loader node. They do not need to be converted or decompressed.

Download Checklist — Exactly Which Files You Need for Your Setup

Here is the download checklist organized by hardware and use case. Each row represents a complete, tested configuration.

ComfyUI Setup Files

Your hardwareYour goalFiles to download
24+ GB VRAM (RTX 4090, A6000)Full-quality generationwan2.2_i2v_low_noise_14b_fp8_scaled.safetensors (I2V), wan2.2_t2v_14B_fp8_scaled.safetensors (T2V), wan2.2_vae.safetensors
16 GB VRAM (RTX 4060 Ti, RTX 4070)14B generation with room to sparewan2.2_i2v_low_noise_14b_fp8_scaled.safetensors, wan2.2_t2v_14B_fp8_scaled.safetensors, wan2.2_vae.safetensors — may need LightX2V for longer clips
12 GB VRAM (RTX 3060, RTX 3080)14B at reduced precisionwan2.2-i2v-a14b-highnoise-q8_0.gguf (I2V), wan2.2-t2v-a14b-q8_0.gguf (T2V), wan2.2_vae.safetensors, plus the LightX2V 4-step LoRA
12 GB VRAM — alternative5B at FP16 for better quality per pixelwan2.2_ti2v_5b_fp16.safetensors (T2I2V — covers both T2V and Remix), wan2.2_vae.safetensors — no GGUF needed
8 GB VRAM (RTX 2060, RTX 3060 8GB)5B model onlywan2.2_ti2v_5b_fp16.safetensors, wan2.2_vae.safetensors — T2V only, I2V/Remix at low resolution
Any — LoRA trainingFine-tune a character or stylewan2.2_ti2v_5b_fp16.safetensors (for training at 5B, uses less VRAM), or wan2.2_i2v_low_noise_14b_fp8_scaled.safetensors (for higher quality I2V LoRA), plus wan2.2_vae.safetensors

Minimum Viable Download

If you are setting up Wan 2.2 for the first time and want to confirm everything works with the smallest download, start here:

  1. wan2.2_vae.safetensors (320 MB) — confirm download pipeline works
  2. wan2.2_ti2v_5b_fp16.safetensors (10 GB) — covers T2V + Remix with one file
  3. That is it — run a text-to-video generation at 480p

This two-file setup takes 10 minutes to download and proves that your ComfyUI installation, model loaders, and pipeline work correctly. Once confirmed, add the I2V 14B model for image-to-video workflows.

Where to Download

SourceFormatRecommended for
Kijai/Wan2.1-ComfyUI on Hugging Face.safetensors (all variants, plus VAE)ComfyUI users — files are pre-converted for native Wan nodes
City96 on Hugging Face.gguf (all quantization levels)Users who need quantized models for lower VRAM
Wan-AI/Wan2.1 on Hugging Face.safetensors (official Alibaba releases)Users who want the original checkpoints and are comfortable with script conversion
CivitAI.safetensors (community fine-tunes)Users looking for NSFW variants, style-trained checkpoints, or merged models

For step-by-step installation instructions, see the Wan 2.2 ComfyUI Workflow Guide.

FAQ

Do I need all the model files?

No. You only need the files for your specific branch and use case. A ComfyUI I2V setup needs one I2V diffusion model and the VAE. A T2V-only user needs the T2V model and the VAE. Start with the minimum viable download above and add files as your workflows expand.

What is the difference between wan2.2_ti2v_5b_fp16.safetensors and wan2.2_i2v_low_noise_14b_fp8_scaled.safetensors?

These differ on every axis: ti2v is the text-and-image-to-video branch (Remix) at 5B FP16, while i2v is the image-to-video branch at 14B FP8. The ti2v_5b file handles both T2V and Remix generation but produces lower motion quality than the dedicated 14B I2V checkpoint. Use ti2v_5b for testing and lower-VRAM setups; use the 14B I2V variant for production-quality results.

Can I use a Wan 2.2 VAE with Wan 2.7?

No. Each model generation has its own VAE trained for its latent space. Wan 2.2's VAE will produce color shift and artifacts if used with a Wan 2.7 model. Always use the VAE that matches your model generation.

Which file should I use for the Remix workflow?

Use a ti2v checkpoint. The wan2.2_ti2v_5b_fp16.safetensors file supports the Remix workflow in its original form. When loaded in a Remix-compatible workflow, this checkpoint treats reference images as flexible suggestions rather than rigid constraints. Pair with wan2.2_vae.safetensors. For more detail, see the Wan 2.2 Remix v3 Guide.

What does "scaled" mean in wan2.2_i2v_low_noise_14b_fp8_scaled.safetensors?

"Scaled" means the checkpoint uses per-tensor activation scaling to minimize quantization error at FP8 precision. It is a calibration technique: before quantizing, the model measures the activation range for each tensor and stores scaling factors that are applied during inference. The result is that FP8 quality is ~99.5% of FP16 for video generation. Always prefer the "scaled" variant over an un-scaled FP8 file if one exists.

Where does the VAE file go in ComfyUI?

Place wan2.2_vae.safetensors in ComfyUI/models/vae/. If this folder does not exist, create it. The WanVideo loader node will automatically locate the VAE if it follows the naming convention wan2.2_vae.safetensors.

What is the file size of each model?

Approximate sizes: 5B FP16 ~10 GB, 14B FP8 ~14 GB, 14B GGUF Q8_0 ~11 GB, 14B GGUF Q4_K_M ~7 GB, VAE ~320 MB.

Do GGUF files work with ComfyUI?

Yes. ComfyUI has native GGUF loader support. Use the WanVideoLoader node or the GGUF-specific model loader node to load .gguf files directly. They go in ComfyUI/models/diffusion_models/ alongside .safetensors files.

Can I use the same I2V model for both high-noise and low-noise generation?

No. The high-noise and low-noise variants are separate checkpoints. You need to download the specific file for the behavior you want. Some community workflows attempt to simulate the other behavior through prompt adjustments, but the results are not equivalent to using the dedicated checkpoint.

Summary

Wan 2.2 model files look overwhelming at first because the naming convention packs five independent decisions into a single filename. But the pattern is consistent: mode (T2V/I2V/T2I2V) → noise variant (I2V only) → model size (5B/14B) → precision (FP8/FP16/GGUF) → format (.safetensors/.gguf).

Start with these three steps:

  1. Pick your mode first. Decide whether you need T2V, I2V, or T2I2V (Remix). This determines which model branch to download.
  2. Check your VRAM. Match the model size and precision to your hardware. 14B FP8 for 16+ GB, 14B GGUF for 12 GB, 5B FP16 for 8 GB.
  3. Download the VAE. The 320 MB file is required for all branches and all workflows. Download it first to confirm your pipeline works.

Follow these three steps in order — the VAE downloads in seconds and confirms your pipeline works before you invest time in the larger model file. Once confirmed, load your model into ComfyUI and run your first 480p generation. For advanced workflows and settings, see the Wan 2.2 Image-to-Video Guide.

Newsletter

Join the community

Subscribe to our newsletter for the latest news and updates