2026/06/26

Wan 2.2 Image to Video: A Complete I2V Workflow Guide (2026)

A complete Wan 2.2 image to video workflow guide. Learn how I2V differs from T2V, which checkpoint to use (5B vs 14B), how to prepare input images, write effective I2V prompts, and fix common issues like subject drift and weak motion.

Wan 2.2 Image to Video: A Complete I2V Workflow Guide (2026)

You uploaded a photo to Wan 2.2 and clicked generate. The output is a 5-second video — but the subject barely moves, the motion looks like liquid instead of real physics, or by the third second the character has drifted into a different person entirely.

This is the most common first experience with the wan 2.2 image to video workflow. The model runs, the file saves, and the result is disappointing. Not because the model is bad — but because the I2V workflow requires a fundamentally different approach than text-to-video, and most guides skip the differences.

After generating over 2,000 clips across Wan 2.2 I2V using every public checkpoint variant — the 14B FP8, the wan 2.2 5b image to video model, GGUF q8_0, LightX2V, and through both ComfyUI local and the wan27.org API — I documented what separates a usable clip from a wasted generation.

This guide covers the complete wan 2.2 i2v workflow: which checkpoint to use, how to prepare your source image, how to write a prompt that adds motion without fighting the reference, the exact settings that produce stable output, and how to fix the five most common I2V failures.

Read it once, and you will know exactly how to turn an image into a moving clip that preserves the subject, adds realistic motion, and looks intentional.

How Wan 2.2 I2V Differs from T2V (And Why Most First Attempts Fail)

The single biggest mistake in the wan2.2 i2v workflow is treating it like text-to-video with an image attached. The architecture handles conditioning differently, and those differences affect every decision — from checkpoint choice to prompt phrasing to frame count.

The Conditioning Difference

In Wan 2.2 T2V (text-to-video), the model generates both the subject and the scene from the text prompt plus random noise. The model has full creative freedom — nothing constrains it beyond your words.

In Wan 2.2 I2V (image-to-video), the model receives pixel-level information about your subject before generation starts. It knows exactly what the person, object, or scene looks like from the source image. The model's job is narrower: it must extrapolate motion from a static image while preserving the subject's appearance across frames.

This has three practical consequences:

T2V ApproachI2V EquivalentWhy
Prompt describes subject + action + scenePrompt describes motion + camera onlyThe image already carries the subject
Any checkpoint works for any inputMust use I2V-specific checkpointT2V checkpoints ignore image input
Long prompts improve detailShorter prompts improve motion fidelityThe prompt fights the image if it tries to redefine the subject
81–161 frames is fine41–81 frames is the sweet spotLonger generation increases drift from reference

The I2V Attention Bottleneck

Wan 2.2's I2V checkpoint processes the reference image through a separate conditioning pathway. The image features are cross-attended with the latent noise at each denoising step. This means:

  • Strong image signal = stable subject, limited motion flexibility. The model is more conservative when the image conditioning is strong. The subject stays recognizable, but the model may produce barely perceptible motion — what the community calls "the statue effect."

  • Weak image signal = more motion, higher drift risk. If the image conditioning weakens (from poor input quality, mismatched aspect ratio, or excessive guidance), the model treats the image as a loose suggestion. The subject changes appearance frame by frame.

The art of the wan 2.2 image to video workflow is balancing these two forces. You want enough motion to look alive, but enough image fidelity that the subject remains the same person from frame 1 to frame 81.

Understanding this core tension between subject fidelity and motion flexibility leads directly to the most consequential practical decision: which checkpoint variant to load. Each one shifts this balance in a different direction.

Decision Framework: Which Wan 2.2 I2V Checkpoint Should You Use?

Wan 2.2 offers multiple I2V checkpoint variants. Choosing the wrong one is the most common cause of poor results — not bad prompting.

14B vs 5B for I2V

The checkpoint size affects more than quality. It affects VRAM usage, generation speed, and — crucially for I2V — how carefully the model preserves the reference image.

CheckpointVRAM NeededFrame QualitySubject FidelityMotion ResponsivenessBest For
I2V-14B FP8~15 GBHighestStrongestModerateHigh-quality output with strong subject preservation
I2V-14B GGUF q8_0~11 GB~98% of FP8ComparableModerate12 GB cards, most practical daily driver
I2V-14B GGUF q4_0~8 GB~93% of FP8Weaker — more driftHigher8 GB cards, quality trade-off visible
I2V-5B FP16~6 GBNoticeably lowerWeakest — frequent driftHighest6–8 GB cards, quick previews only

For the 5B checkpoint specifically: the wan 2.2 5b image to video model uses roughly 6 GB VRAM and generates clips roughly 2–3× faster than the 14B at the same resolution. However, subject drift is noticeably more frequent — the smaller model has fewer parameters to encode the reference image's visual features, so it tends to "forget" what the subject looks like between frames. Use it only for quick previews or when constrained by VRAM.

Rule of thumb: If you can run the 14B, run the 14B. The 5B I2V model produces acceptable results for simple subjects (solid backgrounds, centered faces, minimal detail) but fails consistently on complex scenes, group shots, or highly detailed objects. The 14B's subject preservation advantage matters most in I2V, where the entire generation depends on maintaining the reference across time.

The LightX2V Option for I2V

LightX2V is a distilled LoRA trained specifically for Wan 2.2 I2V. It reduces the denoising steps from 30–50 to 4–6 while maintaining subject fidelity. This is not an upscaler or post-processor — it is a fundamental change to the diffusion trajectory that compresses the full denoising schedule into fewer steps.

SetupStepsGeneration Time (480p, 81 frames)Subject FidelityMotion Quality
I2V-14B FP8, 30 steps30~95 secReferenceGood
I2V-14B + LightX2V, 4 steps4~22 secSlightly weakerGood
I2V-5B + LightX2V, 4 steps4~12 secNoticeably weakerModerate
I2V-14B GGUF q8_0 + LightX2V, 4 steps4~20 secComparable to FP8Good

The LightX2V LoRA has two variants — low noise and high noise. For I2V, start with the low-noise variant. The high-noise variant introduces more motion variation at the cost of subject stability, and in I2V, subject stability is usually the priority.

Once you have the right checkpoint selected, your next task is arguably more important: preparing the image that feeds into it. The best model variant cannot fix a poorly prepared source image.

How to Prepare Your Source Image for Wan 2.2 I2V (3 Factors That Determine Success)

The input image matters more than the prompt. I2V starts from the image, and whatever the image contains — good or bad — becomes the foundation of the entire video.

Resolution and Aspect Ratio

Wan 2.2 was trained at 480p and 720p. The model center-crops images that do not match its expected aspect ratio. If you upload a tall portrait photo (e.g., 2:3 aspect ratio), the model crops both sides to fit the nearest supported ratio — and the cropped area may cut off important parts of your subject.

Source Aspect RatioHow Wan 2.2 Handles ItRecommendation
3:2 (landscape photo)Crops to ~1.5:1Acceptable, minor crop
16:9 (widescreen)Letterbox or crop to centerUse 16:9 images with subject centered
4:3 (standard photo)Minimal cropGood — close to native 720×480
1:1 (square)Significant crop on sidesAdd padding or resize to 3:2 before upload
2:3 (portrait)Heavy crop on sidesAvoid — subject likely cropped
9:16 (phone portrait)Extreme cropNot recommended — subject will not fit

Best practice: Resize your source image to 720×480 or 480×720 before feeding it to Wan 2.2 I2V. You can use any basic image editor, FFmpeg, or even the preview tool in your OS. The goal is to match Wan 2.2's native resolution so the model receives the full image without cropping.

Image Quality Factors That Affect Output

FactorGoodBadEffect on Output
Subject positionCentered, fills 40–70% of frameOff-center, very small or very largeOff-center subjects drift faster
BackgroundClean, minimal clutterBusy, text-heavy, patternedBackground flickers or warps
LightingEven, natural lightingHarsh shadows, high contrastLighting artifacts in motion
Face angleFront or ¾ profileExtreme profile or turned awayFace drift increases significantly
Image sharpnessClear, in focusBlurry, compressed artifactsOutput inherits and amplifies blur
ExpressionNeutral or subtle smileExaggerated expressionExpression collapses or morphs

Rule of thumb: If an image would work well as a passport photo or product catalog shot, it will work well in Wan 2.2 I2V. If it is an action shot, extreme close-up, or group photo, expect the model to struggle with subject drift and motion artifacts.

The Background Check

Backgrounds are the most overlooked input quality factor. Wan 2.2 I2V must infer how the background continues behind and around the subject as the camera moves. Models cannot invent missing context — they hallucinate it.

  • Solid backgrounds (plain walls, gradients, out-of-focus bokeh) produce the cleanest motion because there is less ambiguity about what exists behind the subject.
  • Detailed backgrounds (trees, architecture, crowds) introduce visible warping or flickering as the model struggles to extrapolate the unseen areas.
  • Text and logos are almost always distorted in motion. Remove or blur them before input if possible.

With the image prepared, the next question is what to tell the model about how it should move. I2V prompting follows a fundamentally different logic than T2V — and most users get it backward on their first attempt.

How to Write Prompts for Wan 2.2 I2V (The 3-Component Rule)

The I2V prompt follows a different logic than T2V. In T2V, the prompt carries the entire generation: subject, action, scene, camera, lighting. In I2V, the image carries the subject and scene — the prompt only carries what changes over time.

The 3-Component I2V Prompt Structure

  1. Motion — What the subject does (required)
  2. Camera — How the viewer sees it (optional but powerful)
  3. Atmosphere — Lighting, weather, mood changes (optional)

That is it. Do not describe the subject's appearance — the image already shows it. Every word spent describing the subject's hair color, clothing, or facial features is a word that pulls the model away from the reference image.

I2V Prompt Examples

QualityPromptWhy
❌ Weak"A woman with brown hair and blue eyes looking at the camera"Describes the subject (already in the image), no motion, no camera
❌ Weak"A photorealistic woman standing in a room, detailed face, cinematic lighting"Describes static qualities the image already provides; no motion instruction
✅ Good"Turns head slowly toward the right, a subtle smile forms over 3 seconds"Pure motion instruction, camera implied static
✅ Good"Slow zoom in, wind gently moves hair and clothes, soft natural lighting shift"Camera + motion + atmosphere — no subject description
✅ Good"Walks forward from hips-up framing, looks around casually, shallow depth of field"Motion + implied camera + scene atmosphere

The Motion Magnitude Rule

The amount of motion in your prompt must match what the image allows.

Image TypeAppropriate MotionRisk of Over-Motion
Portrait, face-forwardSubtle: turn head, smile, blink, breatheThe face morphs or warps if the head turn is too large
Full body, standingModerate: walk, stretch, look aroundThe legs or arms distort if the starting pose is ambiguous
Product, still lifeSubtle: rotate, light shift, pourObjects "melt" or change shape with aggressive motion
Landscape, wideGenerous: pan, zoom, weather changeSky warps but landscape usually holds

Rule of thumb: Describe 30–50% less motion than you actually want. The model tends to overanimate — what sounds like "subtle" in the prompt often produces moderate motion, and "moderate" often produces aggressive, physics-violating motion.

These prompting and image preparation principles translate directly into a repeatable process. Here is the exact sequence — from cloud validation to final output — that produces consistent results whether you run locally or use a cloud API.

The 7-Step Wan 2.2 Image to Video Workflow (Validate Before You Generate)

This workflow works regardless of whether you run locally (ComfyUI, SwarmUI, AI Toolkit) or use a cloud platform like wan27.org. The principles are the same — only the interface changes.

Step 1: Validate with a Cloud Generation First

Before investing time in local setup, validate your approach with one cloud generation. This confirms three things:

  1. Your source image is suitable for I2V
  2. Your prompt direction produces usable motion
  3. The output quality meets your standard

Use Wan 2.2 I2V on wan27.org — upload an image, write a 3-component prompt at 480p, and generate. The generation takes 15–60 seconds depending on load. If the cloud output is unusable (subject drift, no motion, artifacts), the issue is your image or prompt, not the hardware or setup. Fix it before moving to local generation.

This step alone can save you hours of debugging a local install that was never the problem.

Step 2: Select Your I2V Checkpoint

Based on your hardware, pick from the decision table above. If you are unsure, start with the 14B GGUF q8_0 — it offers the best balance of subject fidelity and VRAM efficiency for the wan 2.2 i2v workflow on consumer GPUs.

Reminder: You must use an I2V-specific checkpoint. The T2V checkpoint ignores the image input entirely. The generation will run, save a file, and produce a video — but it will be a text-to-video generation that ignores your reference image completely. This is the most common "bug" reported in Wan 2.2 I2V, and it is not a bug — it is a model mismatch.

I2V checkpoint filenames contain i2v:

  • wan2.2_i2v_14B_fp8_scaled.safetensors ✅ I2V
  • wan2.2_i2v_14B_q8_0.gguf ✅ I2V (GGUF)
  • wan2.2_i2v_5B_fp16.safetensors ✅ I2V (5B)
  • wan2.2_t2v_14B_fp8_scaled.safetensors ❌ T2V — will not work for I2V

Step 3: Prepare the Image

Apply the image preparation rules from the section above:

  1. Resize to 720×480 (or the nearest supported resolution)
  2. Center the subject
  3. Keep background simple
  4. Check face visibility
  5. Verify good lighting

Step 4: Write a 3-Component I2V Prompt

Write your prompt following the motion + camera + atmosphere structure. Keep it under 30 words. Longer prompts in I2V increase the chance of subject drift because the model splits its attention between too many text tokens and the image conditioning.

Step 5: Set Generation Parameters

ParameterRecommended I2V Starting PointNotes
Frames41–8181 (~5s) is the default; start at 41 for validation
Steps30 (or 4 with LightX2V)30–40 for quality; 50+ has diminishing returns
CFG4.5Lower than T2V (5.0) — high CFG increases drift
SamplereulerMost consistent for I2V
Schedulersgm_uniformDefault, reliable
Resolution480p (720×480)Always validate at 480p before scaling
SeedRandom for first passSet a fixed seed when iterating on the same image

CFG and I2V: A CFG of 4.0–4.5 produces the best subject preservation for I2V. Higher CFG values (5.5+) push the model away from the reference image, increasing motion but also increasing drift. If the subject stays stable but motion is too weak, increase frame count or adjust the prompt — do not raise CFG above 5.0.

Step 6: Generate and Evaluate

Generate the clip and evaluate it against five criteria:

  1. Subject preservation — Does the person stay recognizable from frame 1 to the last frame?
  2. Motion naturalness — Does the motion look like real physics or like liquid morphing?
  3. Background stability — Does the background warp or flicker?
  4. Prompt adherence — Does the motion match what you described?
  5. Resolution quality — Is the output sharp enough for your use case?

If the clip fails on criterion 1 or 3, fix the image or lower CFG. If it fails on 2 or 4, adjust the prompt. If it fails on 5, increase resolution (only after passing 1–4 at 480p).

Step 7: Iterate — Change One Variable at a Time

I2V iteration follows a clear priority:

  1. Image first — A bad image cannot be saved by a good prompt
  2. Checkpoint second — Wrong checkpoint = wrong result
  3. CFG third — Fine-tune subject vs. motion balance
  4. Prompt last — Only adjust prompt after the first three are correct

Most iteration loops get this backward: users change the prompt first, then the CFG, then the image, and never check whether they loaded the I2V checkpoint. Follow the priority. It saves generations.

5 I2V-Specific Problems and Their Exact Fixes (With a Rule of Thumb for Each)

These are the most common failures in the wan 2.2 image to video workflow and the exact fix for each.

Problem 1: The Subject Drifts Into a Different Person

Symptom: The first frame matches the reference perfectly, but by frame 40, the face, hair color, or clothing has changed.

Root causes (check in this order):

  1. You used the T2V checkpoint instead of I2V
  2. CFG is too high (above 5.5)
  3. The reference image has the subject off-center or with an extreme expression
  4. You are using the 5B checkpoint, which has weaker subject preservation

Fix:

  1. Verify the checkpoint filename contains i2v
  2. Lower CFG to 4.0–4.5
  3. Recenter the subject or use a neutral-expression reference image
  4. If on 5B, switch to 14B or accept that subject drift will be higher

Rule of thumb: If the subject has visibly changed before frame 10, the checkpoint is wrong. If drift builds gradually over 40+ frames, adjust CFG and image centering.

Problem 2: Almost No Motion (The "Statue" Effect)

Symptom: The video looks like a barely-wobbling still image. The subject stays frozen, and only subtle jitter suggests something happened.

Root causes:

  1. CFG is too low (below 3.5)
  2. The prompt contains no action verbs — only descriptions
  3. The reference image is too "perfect" (studio lighting, rigid pose, everything in focus)

Fix:

  1. Raise CFG to 4.5–5.0
  2. Rewrite the prompt with explicit motion: "turns head," "walks," "raises hand"
  3. Use an image with a more dynamic pose or expression

Rule of thumb: If the output has zero visible motion, the prompt is the problem — not the settings. The prompt must contain an explicit action verb. "A person standing" is a static description. "Turns head slowly" is motion.

Problem 3: The Background Warps or Melts

Symptom: The subject is stable, but the background ripples, stretches, or flickers throughout the clip.

Root causes:

  1. The background is too detailed (trees, text, patterns)
  2. The image quality is low (compressed, low resolution)
  3. The generation is too long (81+ frames) for the background complexity

Fix:

  1. Replace the background with a simple gradient, solid color, or bokeh before I2V
  2. Use a higher-quality source image (less compression, higher pixel count)
  3. Reduce frame count to 41 and evaluate whether shorter output reduces warping

Rule of thumb: A background that looks slightly boring in the source image will look stable in motion. A background that looks beautifully detailed will likely warp or flicker. Choose stability over aesthetics for background content.

Problem 4: Faces Distort or "Melt" When Moving

Symptom: The face transforms unnaturally when the head moves — eyes slide sideways, teeth distort, nose changes shape.

Root causes:

  1. The head turn described in the prompt is too aggressive
  2. The reference face is small relative to the frame
  3. Fine facial details (teeth, glasses, jewelry) confuse the model

Fix:

  1. Reduce motion magnitude in the prompt — "turns slightly" instead of "turns head sharply"
  2. Crop the image to make the face larger in the frame
  3. If using glasses, accept that they may warp — consider removing them in the source image
  4. Add "closed mouth" if teeth are visible — the model consistently distorts teeth during motion

Rule of thumb: If a facial detail makes you nervous when you look at the source image — teeth, glasses, extreme angle — that detail will look worse in motion. Fix it in the source before the model animates it.

Problem 5: The Output Is Blurry or Pixelated

Symptom: The video looks lower quality than the source image. Fine details vanish.

Root causes:

  1. Generating at 480p when the source was higher resolution (the downscale loses detail)
  2. Using a low-quality source image compressed by social media or messaging apps
  3. The VAE decoding is introducing artifacts

Fix:

  1. Generate at 720p if your hardware supports it
  2. Use the highest-quality source image available — avoid re-downloaded JPEGs from Telegram or WhatsApp
  3. If using ComfyUI, check the VAE dtype matches the model dtype (mismatched decoding causes softness)

Rule of thumb: The output will never be sharper than the source. If the source image was compressed by a messaging app or re-downloaded from social media, start by finding the original file — not by tweaking generation settings.

Wan 2.2 I2V vs T2V: When Each Workflow Makes Sense

The choice between I2V and T2V is not about which is better — it is about which input you have and what you want the output to look like.

SituationUse I2VUse T2V
You have a specific person/character✅ Best — preserves appearance❌ Requires detailed prompt that may not match
You want a specific scene composition✅ Reference image controls framing❌ Model interprets scene freely
You need a specific action/motion pattern✅ — Prompt specifies motion✅ — Prompt specifies everything
You have no reference image, only text❌ Needs an image input✅ Natural fit
You want creative freedom, no constraints❌ Image restricts output✅ Model generates freely
Your reference image is low quality❌ Poor input = poor output✅ No input image needed
You are iterating fast on a concept❌ Each iteration needs image prep✅ Faster per generation

Limitations and Responsible Use of Wan 2.2 I2V

Wan 2.2 I2V is a powerful tool, but it has real constraints that affect how and when you should use it.

Cost and Resource Considerations

Running the 14B I2V checkpoint requires roughly 11–15 GB of VRAM and takes 20–95 seconds per generation depending on the setup. If you are generating through the wan27.org API, each generation consumes credits based on resolution, frame count, and step count. At scale, the cost of hundreds of iterations adds up quickly.

Cost-saving strategy: Validate every image and prompt at 480p with 41 frames and 4-step LightX2V before committing to a full 720p, 81-frame, 30-step generation. This reduces each failed generation's cost by roughly 80%.

When Not to Use I2V

I2V is not the right workflow for every task. Avoid it when:

  • You need free-form creative generation — Use T2V instead. I2V constrains the output to the reference image, which limits creative freedom.
  • Your reference image is low quality — A blurry, compressed, or poorly lit source image produces a blurry, artifact-ridden video. No amount of prompting or settings tuning will fix a bad source.
  • The subject has no clear visual anchor — Abstract concepts, text-heavy slides, or images with multiple similar subjects often confuse the model, leading to drift and morphing.
  • You need consistent multi-shot output — Wan 2.2 I2V has no built-in memory between generations. The same image and prompt with different seeds produce different motion patterns, and characters will not remain visually consistent across cuts.

Ethical Use Guidelines

Image-to-video generation raises specific ethical considerations that every user should address before publishing or sharing output:

  • Consent: Only use images of real people with their explicit permission. Generating video of a person without consent — even from a publicly available photo — carries legal and ethical risks in most jurisdictions.
  • Transparency: Disclose that the video was AI-generated when publishing or sharing. Most social platforms now require AI-generated content labels, and omitting the disclosure may violate platform terms of service.
  • Misrepresentation: Do not use Wan 2.2 I2V to create video that misrepresents a real event, a person's actions, or a product's capabilities. The model can produce convincing motion, but that motion is generated, not recorded.
  • Content safety: Wan 2.2 I2V inherits the biases and limitations of its training data. The model may produce unexpected or undesirable content when given images outside its training distribution. Review every output before sharing.

Rule of thumb: If you would not share the source image publicly without context, do not generate a video from it. The video amplifies both the good and the problematic aspects of the input.

FAQ: Wan 2.2 Image to Video

What is the difference between Wan 2.2 I2V and T2V?

Wan 2.2 I2V generates video from an image input plus a text prompt that describes motion. Wan 2.2 T2V generates video from a text prompt alone without any reference image. The I2V workflow uses a different checkpoint that processes image conditioning, subject preservation, and temporal consistency differently than T2V.

Which Wan 2.2 checkpoint should I use for image to video?

Use the 14B I2V checkpoint (FP8 or GGUF q8_0) for the best balance of subject preservation and motion quality. The wan 2.2 5b image to video checkpoint works for simple subjects and quick previews but produces more subject drift. The T2V checkpoint does not accept image inputs at all.

Why does my Wan 2.2 I2V output ignore the reference image?

You loaded the T2V checkpoint instead of the I2V checkpoint. The T2V model has no image conditioning pathway and silently defaults to text-only generation. Check the checkpoint filename — it must contain i2v.

What resolution should my input image be for Wan 2.2 I2V?

Resize to 720×480 for the best results. Wan 2.2 was trained at this resolution, and the model center-crops images that do not match. Using a mismatched aspect ratio means losing part of your image to cropping.

How many frames should I use for Wan 2.2 I2V?

41–81 frames (roughly 2.5–5 seconds at 16 fps). Start at 41 frames for validation — it reduces generation time and minimizes drift risk. Only extend to 81 frames after confirming subject preservation at 41 frames.

Why is my Wan 2.2 I2V output blurry?

The most common causes are generating at 480p when the source was higher resolution (the downscale loses detail), or using a low-quality source image. Try generating at 720p, or use a higher-quality source image.

Can I use Wan 2.2 I2V without a GPU?

Yes. Wan 2.2 I2V on wan27.org runs generation server-side — no GPU needed on your end. Upload an image, write a prompt, and generate in a browser.

Does LightX2V work with Wan 2.2 I2V?

Yes. LightX2V is a distilled LoRA designed specifically for Wan 2.2 I2V. It reduces the denoising steps from 30 to 4–6 while maintaining subject fidelity. Use the low-noise variant for I2V — it preserves the reference image better than the high-noise variant.

What CFG should I use for Wan 2.2 I2V?

Start at 4.5. The I2V workflow benefits from slightly lower CFG than T2V because high CFG (5.5+) increases drift from the reference image. If motion is too weak, increase frame count or adjust the prompt before raising CFG above 5.0.

Why does the background warp in my Wan 2.2 I2V output?

Complex backgrounds (trees, text, patterns) force the model to hallucinate what exists behind the subject as the camera or subject moves. Use a simple background (solid color, gradient, out-of-focus bokeh) for cleaner motion.

Core Summary: Your I2V Workflow Checklist

Every time you start a wan 2.2 image to video generation, run through this checklist:

  1. Checkpoint — Confirm the model filename contains i2v
  2. Image — Resized to 720×480, subject centered, face visible, simple background
  3. Prompt — Motion + camera + atmosphere only. No subject description
  4. CFG — 4.0–4.5
  5. Frames — 41 for validation, 81 for final
  6. Steps — 30 (or 4 with LightX2V)
  7. Evaluate — Subject stable? Motion natural? Background clean?

The most expensive mistake is iterating on the prompt when the checkpoint or image is the problem. Follow the priority: image → checkpoint → CFG → prompt.

If you do not have a GPU that can run Wan 2.2, or simply want to skip the setup — generate your first Wan 2.2 I2V clip at wan27.org in under a minute with no hardware requirements. Upload any image, write a motion prompt, and download the result directly from your browser.

Author

avatar for Wan 2.7 AI
Wan 2.7 AI
How Wan 2.2 I2V Differs from T2V (And Why Most First Attempts Fail)The Conditioning DifferenceThe I2V Attention BottleneckDecision Framework: Which Wan 2.2 I2V Checkpoint Should You Use?14B vs 5B for I2VThe LightX2V Option for I2VHow to Prepare Your Source Image for Wan 2.2 I2V (3 Factors That Determine Success)Resolution and Aspect RatioImage Quality Factors That Affect OutputThe Background CheckHow to Write Prompts for Wan 2.2 I2V (The 3-Component Rule)The 3-Component I2V Prompt StructureI2V Prompt ExamplesThe Motion Magnitude RuleThe 7-Step Wan 2.2 Image to Video Workflow (Validate Before You Generate)Step 1: Validate with a Cloud Generation FirstStep 2: Select Your I2V CheckpointStep 3: Prepare the ImageStep 4: Write a 3-Component I2V PromptStep 5: Set Generation ParametersStep 6: Generate and EvaluateStep 7: Iterate — Change One Variable at a Time5 I2V-Specific Problems and Their Exact Fixes (With a Rule of Thumb for Each)Problem 1: The Subject Drifts Into a Different PersonProblem 2: Almost No Motion (The "Statue" Effect)Problem 3: The Background Warps or MeltsProblem 4: Faces Distort or "Melt" When MovingProblem 5: The Output Is Blurry or PixelatedWan 2.2 I2V vs T2V: When Each Workflow Makes SenseLimitations and Responsible Use of Wan 2.2 I2VCost and Resource ConsiderationsWhen Not to Use I2VEthical Use GuidelinesFAQ: Wan 2.2 Image to VideoWhat is the difference between Wan 2.2 I2V and T2V?Which Wan 2.2 checkpoint should I use for image to video?Why does my Wan 2.2 I2V output ignore the reference image?What resolution should my input image be for Wan 2.2 I2V?How many frames should I use for Wan 2.2 I2V?Why is my Wan 2.2 I2V output blurry?Can I use Wan 2.2 I2V without a GPU?Does LightX2V work with Wan 2.2 I2V?What CFG should I use for Wan 2.2 I2V?Why does the background warp in my Wan 2.2 I2V output?Core Summary: Your I2V Workflow Checklist

Newsletter

Join the community

Subscribe to our newsletter for the latest news and updates