Wan 2.2 Image to Video: A Complete I2V Workflow Guide (2026)
A complete Wan 2.2 image to video workflow guide. Learn how I2V differs from T2V, which checkpoint to use (5B vs 14B), how to prepare input images, write effective I2V prompts, and fix common issues like subject drift and weak motion.

You uploaded a photo to Wan 2.2 and clicked generate. The output is a 5-second video — but the subject barely moves, the motion looks like liquid instead of real physics, or by the third second the character has drifted into a different person entirely.
This is the most common first experience with the wan 2.2 image to video workflow. The model runs, the file saves, and the result is disappointing. Not because the model is bad — but because the I2V workflow requires a fundamentally different approach than text-to-video, and most guides skip the differences.
After generating over 2,000 clips across Wan 2.2 I2V using every public checkpoint variant — the 14B FP8, the wan 2.2 5b image to video model, GGUF q8_0, LightX2V, and through both ComfyUI local and the wan27.org API — I documented what separates a usable clip from a wasted generation.
This guide covers the complete wan 2.2 i2v workflow: which checkpoint to use, how to prepare your source image, how to write a prompt that adds motion without fighting the reference, the exact settings that produce stable output, and how to fix the five most common I2V failures.
Read it once, and you will know exactly how to turn an image into a moving clip that preserves the subject, adds realistic motion, and looks intentional.
How Wan 2.2 I2V Differs from T2V (And Why Most First Attempts Fail)
The single biggest mistake in the wan2.2 i2v workflow is treating it like text-to-video with an image attached. The architecture handles conditioning differently, and those differences affect every decision — from checkpoint choice to prompt phrasing to frame count.
The Conditioning Difference
In Wan 2.2 T2V (text-to-video), the model generates both the subject and the scene from the text prompt plus random noise. The model has full creative freedom — nothing constrains it beyond your words.
In Wan 2.2 I2V (image-to-video), the model receives pixel-level information about your subject before generation starts. It knows exactly what the person, object, or scene looks like from the source image. The model's job is narrower: it must extrapolate motion from a static image while preserving the subject's appearance across frames.
This has three practical consequences:
| T2V Approach | I2V Equivalent | Why |
|---|---|---|
| Prompt describes subject + action + scene | Prompt describes motion + camera only | The image already carries the subject |
| Any checkpoint works for any input | Must use I2V-specific checkpoint | T2V checkpoints ignore image input |
| Long prompts improve detail | Shorter prompts improve motion fidelity | The prompt fights the image if it tries to redefine the subject |
| 81–161 frames is fine | 41–81 frames is the sweet spot | Longer generation increases drift from reference |
The I2V Attention Bottleneck
Wan 2.2's I2V checkpoint processes the reference image through a separate conditioning pathway. The image features are cross-attended with the latent noise at each denoising step. This means:
-
Strong image signal = stable subject, limited motion flexibility. The model is more conservative when the image conditioning is strong. The subject stays recognizable, but the model may produce barely perceptible motion — what the community calls "the statue effect."
-
Weak image signal = more motion, higher drift risk. If the image conditioning weakens (from poor input quality, mismatched aspect ratio, or excessive guidance), the model treats the image as a loose suggestion. The subject changes appearance frame by frame.
The art of the wan 2.2 image to video workflow is balancing these two forces. You want enough motion to look alive, but enough image fidelity that the subject remains the same person from frame 1 to frame 81.
Understanding this core tension between subject fidelity and motion flexibility leads directly to the most consequential practical decision: which checkpoint variant to load. Each one shifts this balance in a different direction.
Decision Framework: Which Wan 2.2 I2V Checkpoint Should You Use?
Wan 2.2 offers multiple I2V checkpoint variants. Choosing the wrong one is the most common cause of poor results — not bad prompting.
14B vs 5B for I2V
The checkpoint size affects more than quality. It affects VRAM usage, generation speed, and — crucially for I2V — how carefully the model preserves the reference image.
| Checkpoint | VRAM Needed | Frame Quality | Subject Fidelity | Motion Responsiveness | Best For |
|---|---|---|---|---|---|
| I2V-14B FP8 | ~15 GB | Highest | Strongest | Moderate | High-quality output with strong subject preservation |
| I2V-14B GGUF q8_0 | ~11 GB | ~98% of FP8 | Comparable | Moderate | 12 GB cards, most practical daily driver |
| I2V-14B GGUF q4_0 | ~8 GB | ~93% of FP8 | Weaker — more drift | Higher | 8 GB cards, quality trade-off visible |
| I2V-5B FP16 | ~6 GB | Noticeably lower | Weakest — frequent drift | Highest | 6–8 GB cards, quick previews only |
For the 5B checkpoint specifically: the wan 2.2 5b image to video model uses roughly 6 GB VRAM and generates clips roughly 2–3× faster than the 14B at the same resolution. However, subject drift is noticeably more frequent — the smaller model has fewer parameters to encode the reference image's visual features, so it tends to "forget" what the subject looks like between frames. Use it only for quick previews or when constrained by VRAM.
Rule of thumb: If you can run the 14B, run the 14B. The 5B I2V model produces acceptable results for simple subjects (solid backgrounds, centered faces, minimal detail) but fails consistently on complex scenes, group shots, or highly detailed objects. The 14B's subject preservation advantage matters most in I2V, where the entire generation depends on maintaining the reference across time.
The LightX2V Option for I2V
LightX2V is a distilled LoRA trained specifically for Wan 2.2 I2V. It reduces the denoising steps from 30–50 to 4–6 while maintaining subject fidelity. This is not an upscaler or post-processor — it is a fundamental change to the diffusion trajectory that compresses the full denoising schedule into fewer steps.
| Setup | Steps | Generation Time (480p, 81 frames) | Subject Fidelity | Motion Quality |
|---|---|---|---|---|
| I2V-14B FP8, 30 steps | 30 | ~95 sec | Reference | Good |
| I2V-14B + LightX2V, 4 steps | 4 | ~22 sec | Slightly weaker | Good |
| I2V-5B + LightX2V, 4 steps | 4 | ~12 sec | Noticeably weaker | Moderate |
| I2V-14B GGUF q8_0 + LightX2V, 4 steps | 4 | ~20 sec | Comparable to FP8 | Good |
The LightX2V LoRA has two variants — low noise and high noise. For I2V, start with the low-noise variant. The high-noise variant introduces more motion variation at the cost of subject stability, and in I2V, subject stability is usually the priority.
Once you have the right checkpoint selected, your next task is arguably more important: preparing the image that feeds into it. The best model variant cannot fix a poorly prepared source image.
How to Prepare Your Source Image for Wan 2.2 I2V (3 Factors That Determine Success)
The input image matters more than the prompt. I2V starts from the image, and whatever the image contains — good or bad — becomes the foundation of the entire video.
Resolution and Aspect Ratio
Wan 2.2 was trained at 480p and 720p. The model center-crops images that do not match its expected aspect ratio. If you upload a tall portrait photo (e.g., 2:3 aspect ratio), the model crops both sides to fit the nearest supported ratio — and the cropped area may cut off important parts of your subject.
| Source Aspect Ratio | How Wan 2.2 Handles It | Recommendation |
|---|---|---|
| 3:2 (landscape photo) | Crops to ~1.5:1 | Acceptable, minor crop |
| 16:9 (widescreen) | Letterbox or crop to center | Use 16:9 images with subject centered |
| 4:3 (standard photo) | Minimal crop | Good — close to native 720×480 |
| 1:1 (square) | Significant crop on sides | Add padding or resize to 3:2 before upload |
| 2:3 (portrait) | Heavy crop on sides | Avoid — subject likely cropped |
| 9:16 (phone portrait) | Extreme crop | Not recommended — subject will not fit |
Best practice: Resize your source image to 720×480 or 480×720 before feeding it to Wan 2.2 I2V. You can use any basic image editor, FFmpeg, or even the preview tool in your OS. The goal is to match Wan 2.2's native resolution so the model receives the full image without cropping.
Image Quality Factors That Affect Output
| Factor | Good | Bad | Effect on Output |
|---|---|---|---|
| Subject position | Centered, fills 40–70% of frame | Off-center, very small or very large | Off-center subjects drift faster |
| Background | Clean, minimal clutter | Busy, text-heavy, patterned | Background flickers or warps |
| Lighting | Even, natural lighting | Harsh shadows, high contrast | Lighting artifacts in motion |
| Face angle | Front or ¾ profile | Extreme profile or turned away | Face drift increases significantly |
| Image sharpness | Clear, in focus | Blurry, compressed artifacts | Output inherits and amplifies blur |
| Expression | Neutral or subtle smile | Exaggerated expression | Expression collapses or morphs |
Rule of thumb: If an image would work well as a passport photo or product catalog shot, it will work well in Wan 2.2 I2V. If it is an action shot, extreme close-up, or group photo, expect the model to struggle with subject drift and motion artifacts.
The Background Check
Backgrounds are the most overlooked input quality factor. Wan 2.2 I2V must infer how the background continues behind and around the subject as the camera moves. Models cannot invent missing context — they hallucinate it.
- Solid backgrounds (plain walls, gradients, out-of-focus bokeh) produce the cleanest motion because there is less ambiguity about what exists behind the subject.
- Detailed backgrounds (trees, architecture, crowds) introduce visible warping or flickering as the model struggles to extrapolate the unseen areas.
- Text and logos are almost always distorted in motion. Remove or blur them before input if possible.
With the image prepared, the next question is what to tell the model about how it should move. I2V prompting follows a fundamentally different logic than T2V — and most users get it backward on their first attempt.
How to Write Prompts for Wan 2.2 I2V (The 3-Component Rule)
The I2V prompt follows a different logic than T2V. In T2V, the prompt carries the entire generation: subject, action, scene, camera, lighting. In I2V, the image carries the subject and scene — the prompt only carries what changes over time.
The 3-Component I2V Prompt Structure
- Motion — What the subject does (required)
- Camera — How the viewer sees it (optional but powerful)
- Atmosphere — Lighting, weather, mood changes (optional)
That is it. Do not describe the subject's appearance — the image already shows it. Every word spent describing the subject's hair color, clothing, or facial features is a word that pulls the model away from the reference image.
I2V Prompt Examples
| Quality | Prompt | Why |
|---|---|---|
| ❌ Weak | "A woman with brown hair and blue eyes looking at the camera" | Describes the subject (already in the image), no motion, no camera |
| ❌ Weak | "A photorealistic woman standing in a room, detailed face, cinematic lighting" | Describes static qualities the image already provides; no motion instruction |
| ✅ Good | "Turns head slowly toward the right, a subtle smile forms over 3 seconds" | Pure motion instruction, camera implied static |
| ✅ Good | "Slow zoom in, wind gently moves hair and clothes, soft natural lighting shift" | Camera + motion + atmosphere — no subject description |
| ✅ Good | "Walks forward from hips-up framing, looks around casually, shallow depth of field" | Motion + implied camera + scene atmosphere |
The Motion Magnitude Rule
The amount of motion in your prompt must match what the image allows.
| Image Type | Appropriate Motion | Risk of Over-Motion |
|---|---|---|
| Portrait, face-forward | Subtle: turn head, smile, blink, breathe | The face morphs or warps if the head turn is too large |
| Full body, standing | Moderate: walk, stretch, look around | The legs or arms distort if the starting pose is ambiguous |
| Product, still life | Subtle: rotate, light shift, pour | Objects "melt" or change shape with aggressive motion |
| Landscape, wide | Generous: pan, zoom, weather change | Sky warps but landscape usually holds |
Rule of thumb: Describe 30–50% less motion than you actually want. The model tends to overanimate — what sounds like "subtle" in the prompt often produces moderate motion, and "moderate" often produces aggressive, physics-violating motion.
These prompting and image preparation principles translate directly into a repeatable process. Here is the exact sequence — from cloud validation to final output — that produces consistent results whether you run locally or use a cloud API.
The 7-Step Wan 2.2 Image to Video Workflow (Validate Before You Generate)
This workflow works regardless of whether you run locally (ComfyUI, SwarmUI, AI Toolkit) or use a cloud platform like wan27.org. The principles are the same — only the interface changes.
Step 1: Validate with a Cloud Generation First
Before investing time in local setup, validate your approach with one cloud generation. This confirms three things:
- Your source image is suitable for I2V
- Your prompt direction produces usable motion
- The output quality meets your standard
Use Wan 2.2 I2V on wan27.org — upload an image, write a 3-component prompt at 480p, and generate. The generation takes 15–60 seconds depending on load. If the cloud output is unusable (subject drift, no motion, artifacts), the issue is your image or prompt, not the hardware or setup. Fix it before moving to local generation.
This step alone can save you hours of debugging a local install that was never the problem.
Step 2: Select Your I2V Checkpoint
Based on your hardware, pick from the decision table above. If you are unsure, start with the 14B GGUF q8_0 — it offers the best balance of subject fidelity and VRAM efficiency for the wan 2.2 i2v workflow on consumer GPUs.
Reminder: You must use an I2V-specific checkpoint. The T2V checkpoint ignores the image input entirely. The generation will run, save a file, and produce a video — but it will be a text-to-video generation that ignores your reference image completely. This is the most common "bug" reported in Wan 2.2 I2V, and it is not a bug — it is a model mismatch.
I2V checkpoint filenames contain i2v:
wan2.2_i2v_14B_fp8_scaled.safetensors✅ I2Vwan2.2_i2v_14B_q8_0.gguf✅ I2V (GGUF)wan2.2_i2v_5B_fp16.safetensors✅ I2V (5B)wan2.2_t2v_14B_fp8_scaled.safetensors❌ T2V — will not work for I2V
Step 3: Prepare the Image
Apply the image preparation rules from the section above:
- Resize to 720×480 (or the nearest supported resolution)
- Center the subject
- Keep background simple
- Check face visibility
- Verify good lighting
Step 4: Write a 3-Component I2V Prompt
Write your prompt following the motion + camera + atmosphere structure. Keep it under 30 words. Longer prompts in I2V increase the chance of subject drift because the model splits its attention between too many text tokens and the image conditioning.
Step 5: Set Generation Parameters
| Parameter | Recommended I2V Starting Point | Notes |
|---|---|---|
| Frames | 41–81 | 81 (~5s) is the default; start at 41 for validation |
| Steps | 30 (or 4 with LightX2V) | 30–40 for quality; 50+ has diminishing returns |
| CFG | 4.5 | Lower than T2V (5.0) — high CFG increases drift |
| Sampler | euler | Most consistent for I2V |
| Scheduler | sgm_uniform | Default, reliable |
| Resolution | 480p (720×480) | Always validate at 480p before scaling |
| Seed | Random for first pass | Set a fixed seed when iterating on the same image |
CFG and I2V: A CFG of 4.0–4.5 produces the best subject preservation for I2V. Higher CFG values (5.5+) push the model away from the reference image, increasing motion but also increasing drift. If the subject stays stable but motion is too weak, increase frame count or adjust the prompt — do not raise CFG above 5.0.
Step 6: Generate and Evaluate
Generate the clip and evaluate it against five criteria:
- Subject preservation — Does the person stay recognizable from frame 1 to the last frame?
- Motion naturalness — Does the motion look like real physics or like liquid morphing?
- Background stability — Does the background warp or flicker?
- Prompt adherence — Does the motion match what you described?
- Resolution quality — Is the output sharp enough for your use case?
If the clip fails on criterion 1 or 3, fix the image or lower CFG. If it fails on 2 or 4, adjust the prompt. If it fails on 5, increase resolution (only after passing 1–4 at 480p).
Step 7: Iterate — Change One Variable at a Time
I2V iteration follows a clear priority:
- Image first — A bad image cannot be saved by a good prompt
- Checkpoint second — Wrong checkpoint = wrong result
- CFG third — Fine-tune subject vs. motion balance
- Prompt last — Only adjust prompt after the first three are correct
Most iteration loops get this backward: users change the prompt first, then the CFG, then the image, and never check whether they loaded the I2V checkpoint. Follow the priority. It saves generations.
5 I2V-Specific Problems and Their Exact Fixes (With a Rule of Thumb for Each)
These are the most common failures in the wan 2.2 image to video workflow and the exact fix for each.
Problem 1: The Subject Drifts Into a Different Person
Symptom: The first frame matches the reference perfectly, but by frame 40, the face, hair color, or clothing has changed.
Root causes (check in this order):
- You used the T2V checkpoint instead of I2V
- CFG is too high (above 5.5)
- The reference image has the subject off-center or with an extreme expression
- You are using the 5B checkpoint, which has weaker subject preservation
Fix:
- Verify the checkpoint filename contains
i2v - Lower CFG to 4.0–4.5
- Recenter the subject or use a neutral-expression reference image
- If on 5B, switch to 14B or accept that subject drift will be higher
Rule of thumb: If the subject has visibly changed before frame 10, the checkpoint is wrong. If drift builds gradually over 40+ frames, adjust CFG and image centering.
Problem 2: Almost No Motion (The "Statue" Effect)
Symptom: The video looks like a barely-wobbling still image. The subject stays frozen, and only subtle jitter suggests something happened.
Root causes:
- CFG is too low (below 3.5)
- The prompt contains no action verbs — only descriptions
- The reference image is too "perfect" (studio lighting, rigid pose, everything in focus)
Fix:
- Raise CFG to 4.5–5.0
- Rewrite the prompt with explicit motion: "turns head," "walks," "raises hand"
- Use an image with a more dynamic pose or expression
Rule of thumb: If the output has zero visible motion, the prompt is the problem — not the settings. The prompt must contain an explicit action verb. "A person standing" is a static description. "Turns head slowly" is motion.
Problem 3: The Background Warps or Melts
Symptom: The subject is stable, but the background ripples, stretches, or flickers throughout the clip.
Root causes:
- The background is too detailed (trees, text, patterns)
- The image quality is low (compressed, low resolution)
- The generation is too long (81+ frames) for the background complexity
Fix:
- Replace the background with a simple gradient, solid color, or bokeh before I2V
- Use a higher-quality source image (less compression, higher pixel count)
- Reduce frame count to 41 and evaluate whether shorter output reduces warping
Rule of thumb: A background that looks slightly boring in the source image will look stable in motion. A background that looks beautifully detailed will likely warp or flicker. Choose stability over aesthetics for background content.
Problem 4: Faces Distort or "Melt" When Moving
Symptom: The face transforms unnaturally when the head moves — eyes slide sideways, teeth distort, nose changes shape.
Root causes:
- The head turn described in the prompt is too aggressive
- The reference face is small relative to the frame
- Fine facial details (teeth, glasses, jewelry) confuse the model
Fix:
- Reduce motion magnitude in the prompt — "turns slightly" instead of "turns head sharply"
- Crop the image to make the face larger in the frame
- If using glasses, accept that they may warp — consider removing them in the source image
- Add "closed mouth" if teeth are visible — the model consistently distorts teeth during motion
Rule of thumb: If a facial detail makes you nervous when you look at the source image — teeth, glasses, extreme angle — that detail will look worse in motion. Fix it in the source before the model animates it.
Problem 5: The Output Is Blurry or Pixelated
Symptom: The video looks lower quality than the source image. Fine details vanish.
Root causes:
- Generating at 480p when the source was higher resolution (the downscale loses detail)
- Using a low-quality source image compressed by social media or messaging apps
- The VAE decoding is introducing artifacts
Fix:
- Generate at 720p if your hardware supports it
- Use the highest-quality source image available — avoid re-downloaded JPEGs from Telegram or WhatsApp
- If using ComfyUI, check the VAE dtype matches the model dtype (mismatched decoding causes softness)
Rule of thumb: The output will never be sharper than the source. If the source image was compressed by a messaging app or re-downloaded from social media, start by finding the original file — not by tweaking generation settings.
Wan 2.2 I2V vs T2V: When Each Workflow Makes Sense
The choice between I2V and T2V is not about which is better — it is about which input you have and what you want the output to look like.
| Situation | Use I2V | Use T2V |
|---|---|---|
| You have a specific person/character | ✅ Best — preserves appearance | ❌ Requires detailed prompt that may not match |
| You want a specific scene composition | ✅ Reference image controls framing | ❌ Model interprets scene freely |
| You need a specific action/motion pattern | ✅ — Prompt specifies motion | ✅ — Prompt specifies everything |
| You have no reference image, only text | ❌ Needs an image input | ✅ Natural fit |
| You want creative freedom, no constraints | ❌ Image restricts output | ✅ Model generates freely |
| Your reference image is low quality | ❌ Poor input = poor output | ✅ No input image needed |
| You are iterating fast on a concept | ❌ Each iteration needs image prep | ✅ Faster per generation |
Limitations and Responsible Use of Wan 2.2 I2V
Wan 2.2 I2V is a powerful tool, but it has real constraints that affect how and when you should use it.
Cost and Resource Considerations
Running the 14B I2V checkpoint requires roughly 11–15 GB of VRAM and takes 20–95 seconds per generation depending on the setup. If you are generating through the wan27.org API, each generation consumes credits based on resolution, frame count, and step count. At scale, the cost of hundreds of iterations adds up quickly.
Cost-saving strategy: Validate every image and prompt at 480p with 41 frames and 4-step LightX2V before committing to a full 720p, 81-frame, 30-step generation. This reduces each failed generation's cost by roughly 80%.
When Not to Use I2V
I2V is not the right workflow for every task. Avoid it when:
- You need free-form creative generation — Use T2V instead. I2V constrains the output to the reference image, which limits creative freedom.
- Your reference image is low quality — A blurry, compressed, or poorly lit source image produces a blurry, artifact-ridden video. No amount of prompting or settings tuning will fix a bad source.
- The subject has no clear visual anchor — Abstract concepts, text-heavy slides, or images with multiple similar subjects often confuse the model, leading to drift and morphing.
- You need consistent multi-shot output — Wan 2.2 I2V has no built-in memory between generations. The same image and prompt with different seeds produce different motion patterns, and characters will not remain visually consistent across cuts.
Ethical Use Guidelines
Image-to-video generation raises specific ethical considerations that every user should address before publishing or sharing output:
- Consent: Only use images of real people with their explicit permission. Generating video of a person without consent — even from a publicly available photo — carries legal and ethical risks in most jurisdictions.
- Transparency: Disclose that the video was AI-generated when publishing or sharing. Most social platforms now require AI-generated content labels, and omitting the disclosure may violate platform terms of service.
- Misrepresentation: Do not use Wan 2.2 I2V to create video that misrepresents a real event, a person's actions, or a product's capabilities. The model can produce convincing motion, but that motion is generated, not recorded.
- Content safety: Wan 2.2 I2V inherits the biases and limitations of its training data. The model may produce unexpected or undesirable content when given images outside its training distribution. Review every output before sharing.
Rule of thumb: If you would not share the source image publicly without context, do not generate a video from it. The video amplifies both the good and the problematic aspects of the input.
FAQ: Wan 2.2 Image to Video
What is the difference between Wan 2.2 I2V and T2V?
Wan 2.2 I2V generates video from an image input plus a text prompt that describes motion. Wan 2.2 T2V generates video from a text prompt alone without any reference image. The I2V workflow uses a different checkpoint that processes image conditioning, subject preservation, and temporal consistency differently than T2V.
Which Wan 2.2 checkpoint should I use for image to video?
Use the 14B I2V checkpoint (FP8 or GGUF q8_0) for the best balance of subject preservation and motion quality. The wan 2.2 5b image to video checkpoint works for simple subjects and quick previews but produces more subject drift. The T2V checkpoint does not accept image inputs at all.
Why does my Wan 2.2 I2V output ignore the reference image?
You loaded the T2V checkpoint instead of the I2V checkpoint. The T2V model has no image conditioning pathway and silently defaults to text-only generation. Check the checkpoint filename — it must contain i2v.
What resolution should my input image be for Wan 2.2 I2V?
Resize to 720×480 for the best results. Wan 2.2 was trained at this resolution, and the model center-crops images that do not match. Using a mismatched aspect ratio means losing part of your image to cropping.
How many frames should I use for Wan 2.2 I2V?
41–81 frames (roughly 2.5–5 seconds at 16 fps). Start at 41 frames for validation — it reduces generation time and minimizes drift risk. Only extend to 81 frames after confirming subject preservation at 41 frames.
Why is my Wan 2.2 I2V output blurry?
The most common causes are generating at 480p when the source was higher resolution (the downscale loses detail), or using a low-quality source image. Try generating at 720p, or use a higher-quality source image.
Can I use Wan 2.2 I2V without a GPU?
Yes. Wan 2.2 I2V on wan27.org runs generation server-side — no GPU needed on your end. Upload an image, write a prompt, and generate in a browser.
Does LightX2V work with Wan 2.2 I2V?
Yes. LightX2V is a distilled LoRA designed specifically for Wan 2.2 I2V. It reduces the denoising steps from 30 to 4–6 while maintaining subject fidelity. Use the low-noise variant for I2V — it preserves the reference image better than the high-noise variant.
What CFG should I use for Wan 2.2 I2V?
Start at 4.5. The I2V workflow benefits from slightly lower CFG than T2V because high CFG (5.5+) increases drift from the reference image. If motion is too weak, increase frame count or adjust the prompt before raising CFG above 5.0.
Why does the background warp in my Wan 2.2 I2V output?
Complex backgrounds (trees, text, patterns) force the model to hallucinate what exists behind the subject as the camera or subject moves. Use a simple background (solid color, gradient, out-of-focus bokeh) for cleaner motion.
Core Summary: Your I2V Workflow Checklist
Every time you start a wan 2.2 image to video generation, run through this checklist:
- Checkpoint — Confirm the model filename contains
i2v - Image — Resized to 720×480, subject centered, face visible, simple background
- Prompt — Motion + camera + atmosphere only. No subject description
- CFG — 4.0–4.5
- Frames — 41 for validation, 81 for final
- Steps — 30 (or 4 with LightX2V)
- Evaluate — Subject stable? Motion natural? Background clean?
The most expensive mistake is iterating on the prompt when the checkpoint or image is the problem. Follow the priority: image → checkpoint → CFG → prompt.
If you do not have a GPU that can run Wan 2.2, or simply want to skip the setup — generate your first Wan 2.2 I2V clip at wan27.org in under a minute with no hardware requirements. Upload any image, write a motion prompt, and download the result directly from your browser.
Author
More Posts
Wan 2.7 Audio Guide: Voice Reference, Multi-Character Audio & Audio Cues (2026)
A practical guide to Wan 2.7 audio capabilities: how voice reference works, what audio cues are available, how to assign voices to multiple characters, and how to get synced audio output that matches your video.

Can Wan 2.2 Generate Longer Than 5 Seconds? Limits, Loops, and Stitching Workarounds (2026)
Wan 2.2 native clip length is 5 seconds — here is why, and which workarounds (loop workflows, last-frame I2V continuation, VACE stitching, scene splitting) actually produce usable longer videos without quality collapse.

Wan 2.7 Pricing: Basic vs Pro vs Max, Free Credits, and Real Cost Per Video
Updated for April 22, 2026: Wan 2.7 pricing on wan27.org, including Basic/Pro/Max plans, 10 signup credits, pay-as-you-go credit packs, commercial rights, and clip cost math.
Newsletter
Join the community
Subscribe to our newsletter for the latest news and updates