How to Use Our Image & Video Generator — A Practical Guide
From uploading files to generating studio-quality videos, this guide covers every tool on the platform with prompt templates, pro tips, and common pitfalls to avoid.
How to Use Our Image & Video Generator
Whether you're generating your first image or trying to get the perfect 15-second video, this guide walks you through every tool on the platform. Each section covers what a model can do, how to write prompts that actually work, and the mistakes to skip.
1. Before You Start — File Uploads
These limits apply everywhere, no matter which tool you're using.
| Media Type | Accepted Formats | Max Size |
|---|---|---|
| Images | jpg, jpeg, png, webp | 10 MB |
| Videos | mp4, mov, webm, mkv | 100 MB (as reference) / 80 MB (direct upload) |
| Audio | aac, m4a, mp3, ogg, wav, webm | 100 MB |
Practical tips:
- Converting PNG to JPEG can cut file size by 60–80% if you're close to the 10 MB limit.
- For video references, use the shortest clip that captures the motion you want — don't upload a 3-minute video when 10 seconds is enough.
- The Video References and Motion Control tools accept images, videos, and audio. AI Reframe only accepts videos.
2. Image Generator — General Purpose
This is your go-to tool for creating still images.
What you can control
| Setting | Range | Tip |
|---|---|---|
| Prompt | 10 – 1,000 characters | Stay between 50–200 chars for best results |
| Number of images | 1–4 (text-to-image: 1, 2, or 4) | Generate 4 at a time to compare variations |
| Reference images | Up to 5 URLs | Use for style reference, not subject copying |
| Negative prompt | Up to 800 characters | "blurry, distorted, low quality, extra limbs" |
| Inference steps | 10 – 64 | 30 is a good default; go higher for complex scenes |
| Guidance scale | 1 – 15 | 7–9 works for most prompts |
How to write prompts that work
The best image prompts follow a simple formula:
Subject + Scene + Lighting + Style
Example:
"A young woman with wavy auburn hair in a vintage floral dress, standing in a sunlit garden, soft golden hour light, cinematic shallow depth of field"
Do's and Don'ts:
| ✅ Do This | ❌ Don't Do This |
|---|---|
| Describe what the subject looks like (hair, clothes, expression) | Name specific real people by name |
| Specify the lighting (golden hour, neon glow, soft window light) | Stack tech terms ("8K, hyperrealistic, 4K, ultra-detailed") — they dilute each other |
| Use atmosphere words (cinematic, painterly, documentary-style) | Use camera model names ("shot on Sony A7III") — the model doesn't know what that looks like |
| Be specific about composition ("close-up of face", "full body shot") | Write run-on sentences — break into clear phrases |
Image-edit mode
If you're editing an existing image:
- You can only upload 1 reference image (not 5).
- You can generate 2, 3, or 4 variations (not just 1).
- Describe what you want to change in the prompt, not what's already there.
3. V1 Pro Video
An older model, but it supports the widest range of aspect ratios.
| Setting | Range |
|---|---|
| Prompt | 1 – 10,000 characters |
| Reference image | Optional — 1 URL |
| Aspect ratio | 1:1, 4:3, 3:4, 16:9, 9:16, 21:9 — the full set |
| Resolution | 480p, 720p, or 1080p |
| Duration | 5 or 10 seconds |
| Fixed lens / Generate audio | On/off toggles |
Tips
- With 10,000 characters of prompt space, you can write very detailed scene descriptions. Use time-coded segments for multi-beat action.
- The 21:9 aspect ratio is unique to this model — useful for cinematic widescreen.
- For faster iteration, test at 480p first, then render final at 1080p.
4. Kling 2.6
Great for image-to-video workflows — supports up to 10 reference images.
| Setting | Range |
|---|---|
| Prompt | 1 – 2,500 characters |
| Mode | Text-to-video or image-to-video |
| Reference images | Up to 10 URLs (image-to-video) |
| Sound | On/off toggle |
| Aspect ratio | 1:1, 16:9, 9:16 |
| Duration | 5 or 10 seconds |
Prompt formula
Scene setting + Subject description + Motion + Stylistic guidance
Image-to-video tips:
- Your prompt should describe motion, not the image itself — the model can already see it.
- Focus on camera movement and what changes: "Camera slowly tracks right, subtle wind affecting hair and clothing"
- Use high-resolution source images (1080p or higher) for best results.
- Keep reference images free of text overlays and watermarks.
Multi-shot storytelling (new in 2.6): You can describe sequential events in a single prompt and the model will generate transitions:
"A man walks into a coffee shop, orders a drink, then sits by the window as rain starts"
Native audio tips (2.6):
- Put dialogue in quotes to trigger lip-sync: "...saying 'Let's begin.'"
- Describe ambient sounds explicitly: "coffee shop chatter, espresso machine hissing, rain on windows"
- Audio doubles the credit cost — enable only when needed.
Chain generations trick
Generate a 10-second clip, then use its last frame as the input for the next generation. You can build sequences up to 3 minutes this way.
5. Wan 2.6
| Setting | Range |
|---|---|
| Prompt | 1 – 5,000 characters |
| Modes | Text-to-video, image-to-video (1 image required), video-to-video (1 video required) |
| Resolution | 720p or 1080p |
| Duration | 5, 10, or 15 seconds |
A solid all-rounder. The prompt tips for Wan 2.7 (below) apply here too, since they share the same architecture.
6. Grok Imagine 1.5
Image-to-video only — you must provide a starting image. There's no text-to-video mode on 1.5.
| Setting | Range |
|---|---|
| Prompt | 1 – 2,000 characters |
| Reference image | Required |
| Aspect ratio | 16:9, 9:16, 1:1 |
| Resolution | 480p or 720p |
| Duration | 5, 8, 10, or 15 seconds |
The golden rule: describe motion, not the image
The model already sees your input image. Your job is to tell it what should change.
✅ "The woman slowly turns her head to the right and smiles, soft breeze moving her hair, gentle camera push-in"
❌ "A beautiful woman with long hair stands in a garden" (the model can already see this)
Prompt formula
[Subject action] + [Camera movement] + [Lighting changes] + [Audio cues]
Camera vocabulary that works:
- Pan left/right, tilt up/down, zoom in/out, dolly in/out
- Tracking/follow shot, orbit/surround, aerial/drone
- Handheld, slow push-in, static/tripod
Pro tip: Always name at least one camera move. "Cinematic" alone tells the model nothing.
Audio prompting
Grok generates audio in the same pass as video. Add an AUDIO: block at the end of your prompt:
"Close-up of hands pulling apart a warm cinnamon roll, steam rising, soft morning window light, slow camera push-in. AUDIO: soft room tone, faint kettle hiss, gentle pastry tear sound"*
Common mistakes
| Mistake | Fix |
|---|---|
| Re-describing the image | Focus only on motion — the model already sees it |
| Contradicting the source image | If the image has a man, don't prompt "a woman dances" |
| Negative prompts | Don't use them — they're ignored on this model |
| Tag stacking ("knight, castle, epic, 8K") | Write a natural sentence with intent |
| Too many simultaneous actions | Stick to 1 subject + 1 action + 1 camera move |
Iteration workflow
- Generate 3–5 variations at 480p first (cheaper, faster).
- Pick the best one.
- Render that version at 720p for final output.
7. Happy Horse 1.1
Alibaba's latest video model. Three modes, each with different input needs.
| Setting | Text to Video | Image to Video | Reference to Video |
|---|---|---|---|
| Prompt | Up to 10,000 chars | Up to 10,000 chars | Up to 10,000 chars |
| Resolution | 720p / 1080p | 720p / 1080p | 720p / 1080p |
| Duration | 5 / 10 / 15s | 5 / 10 / 15s | 5 / 10 / 15s |
| Inputs required | Just a description | Exactly 1 primary image | At least 1 reference image. Primary optional. Up to 8 extra images. |
How to prompt Happy Horse
Happy Horse is unusual: brevity wins. Most shots only need about 20 words.
Subject → Action → Setting → One camera cue
✅ Good: "A young woman in a red coat walks down a wet city street at night, neon reflections, slow dolly-in"
❌ Bad: "A beautiful stunning gorgeous young woman in a detailed amazing dress walks slowly through a lovely park with incredible lighting" — those extra adjectives actually hurt quality.
The anti-slop rule: Cut every adjective that isn't specific. Drop "beautiful", "stunning", "amazing", "masterpiece", "epic", "breathtaking". Replace them with concrete details: "overcast daylight", "wet asphalt", "neon reflections", "warm amber backlight".
Camera language pays off
Happy Horse is unusually good at camera moves. Put the camera cue at the end of the prompt for maximum weight:
- "Steadicam push", "slow dolly-in", "lateral orbit with parallax", "helicopter aerial", "rack focus"
For longer prompts: use shot lists
If you need more than one sentence, don't write a paragraph — use a shot list with timecodes:
Shot 1 (wide establishing, 0-1s): Camera pulls into a rain-slicked street at night.
Shot 2 (mid tracking, 1-4s): The woman enters frame from right, walking briskly.
Shot 3 (slow push-in close, 4-5s): Slow dolly-in onto her face, raindrops in her hair.Reference to Video (R2V) — best for commercial use
- Use clear, sharp reference images of your subject.
- Upload 3–9 multi-angle refs for character consistency — this prevents "face-changing" across shots.
- The reference defines who, the prompt defines what happens.
- Describe motion, not appearance — the reference already shows the look.
Dialogue timing formula
Single line duration = (character count ÷ 4) × 1.2
Keep dialogue to 40 characters or fewer per 15-second clip, max 2 lines per shot for clean lip-sync.
What Happy Horse does well vs. struggles with
| ✅ Excels At | ❌ Struggles With |
|---|---|
| Camera moves (Steadicam, dolly, aerial) | Multi-step sequences in plain prose (use shot lists) |
| Atmospheric lighting (blue hour, neon noir) | Extreme slow-motion cues ("1000fps slow-mo") |
| Vehicles and large rigid objects | Wardrobe details under heavy motion |
| Cloth in wind (capes, flags, hair) | Booru tags, JSON, weighted parentheses |
| Fire and embers | Writing specific text in prompts |
| Mirrors and reflections (geometrically consistent) | Multiple simultaneous complex actions |
8. Wan 2.7
The latest Wan model, with the most modes. Good for both video and audio.
| Setting | Text to Video | Image to Video | Reference to Video | Video Edit |
|---|---|---|---|---|
| Prompt | Up to 5,000 | Up to 5,000 | Up to 5,000 | Up to 5,000 |
| Resolution | 720p / 1080p | 720p / 1080p | 720p / 1080p | 720p / 1080p |
| Duration | 5 / 10 / 15s | 5 / 10 / 15s | 5 / 10 / 15s | 5 / 10 / 15s |
| Extra inputs | Optional audio | Image, end frame, video, audio (all optional). Need image OR continuation clip. Max 1 video URL. | At least one of: primary image, extra images, video, or audio. Max 4 extra images. Max 1 video URL. | Source Video required. Ref image optional. Max 1 video URL. |
How to prompt Wan 2.7
Wan rewards structured, screenplay-like prompts. Use this formula:
Subject + Scene + Motion + Lighting + Camera + Style
Example:
"A golden retriever running through autumn leaves in a park, warm afternoon light, camera tracking from the side, cinematic shallow depth of field"
First & last frame control (standout feature)
This is Wan 2.7's killer feature for Image-to-Video:
- Upload a start frame and an end frame — the model generates the motion between them.
- Keep both frames aligned in aspect ratio, lighting direction, and subject placement.
- Inconsistent lighting between frames causes mid-clip light-source jumps.
- Think of the pair as defining a verb: open → closed, before → after, assembling → complete.
Audio tips
Wan 2.7 supports native audio — describe it in your prompt:
- Dialogue: Include spoken lines with tone and pace: "A man says 'Hello', tone warm, medium pace"
- Sound effects: "Ice cube drops into glass, sharp clink"
- Background music: "Upbeat synthwave background track"
- If you don't want audio, explicitly say: "No dialogue. No background music."
Known limitations
- Complex multi-character scenes with specific interactions can be inconsistent.
- Text rendering within generated videos is unreliable.
- Longer durations (10+ seconds) may show motion degradation.
- Hand and finger consistency can be imperfect — reduce simultaneous actions and avoid extreme foreshortening.
Iteration workflow
- Treat the first output as a draft — refine, don't regenerate from scratch.
- Lock a good seed once you get a clean output for reproducibility.
- Validate motion at 0.5x speed — look for mid-clip jitter around 50–70% through.
- Build scene-by-scene: generate one clip at a time, combine in editing.
9. Wan 2.5
| Setting | Range |
|---|---|
| Prompt | Up to 5,000 characters |
| Mode | Text-to-video or image-to-video (exactly 1 image) |
| Resolution | 720p or 1080p |
| Duration | 5 or 10 seconds |
A simpler, shorter-duration version of Wan 2.7. Use this when you need quick, clean 5–10 second clips. Same prompt principles apply — structured, descriptive, with camera direction at the end.
10. Wan 2.2
Older generation. Lower resolution ceiling (max 720p), but still useful for specific tasks.
| Mode | Prompt | Inputs | Resolution | Duration |
|---|---|---|---|---|
| Text | Up to 5,000 | None | 480p / 580p / 720p | 5s (fixed) |
| Image | Up to 5,000 | Exactly 1 image | 480p / 580p / 720p | 5s (fixed) |
| Speech | Up to 5,000 | 1 portrait + 1 audio | 480p / 580p / 720p | 5 or 10s |
| Animate Move / Replace | No prompt needed | 1 image + 1 video | 480p / 580p / 720p | 1s (fixed) |
The Animate Move / Replace mode is unique — it creates a 1-second motion loop from an image and a reference video. Use it for subtle motion effects like a character blinking, an object rotating, or a flag waving. No prompt writing needed.
The Speech mode requires a portrait photo (a clear face shot) plus an audio file. Make sure the photo is well-lit with the face clearly visible for best results.
11. Seedance 2.0
ByteDance's model with flexible duration (any length from 4 to 15 seconds).
| Mode | Prompt | Resolution | Duration | Inputs |
|---|---|---|---|---|
| Text to Video | Up to 10,000 chars | 480p / 720p / 1080p | 4–15s | None |
| Image to Video | Up to 10,000 chars | 480p / 720p / 1080p | 4–15s | Primary image required, secondary optional |
| Fast Text | Up to 10,000 chars | 480p / 720p | 4–15s | None |
| Fast Image | Up to 10,000 chars | 480p / 720p | 4–15s | Primary image required, secondary optional |
Prompt formula
[Subject] + [Action] + [Environment] + [Camera] + [Lighting] + [Style] + [Audio]
Seedance is reference-driven — upload images or video clips and assign them roles with @ notation:
"Use @Image1 for the character's appearance. Use @Video1 for camera movement only."
The golden rule for Seedance
The first 20–30 words of your prompt carry the most weight. Lead with who/what is in frame, then what they do. Save style and lighting for after the subject is locked.
✅ Strong: "A clear glass perfume bottle sits on a black stone pedestal in a dark studio. Condensation rolls slowly down the glass. Medium close-up, slow circular dolly, soft side lighting, luxury beauty-ad look."
❌ Weak: "A cool cinematic ad for a perfume bottle, dramatic, stylish, premium, beautiful lighting. Wide camera, realistic."
Multi-shot sequences
Label shots clearly with timecodes:
Shot 1: Wide establishing shot of a futuristic train station at sunrise.
Shot 2: Medium shot of a traveler stepping onto the platform, coat moving in wind.
Shot 3: Close-up of a glowing ticket in their hand.Keep each shot to one primary action. If a shot tries to do running + camera pan + lightning + mood shift, the model will pick what to prioritize.
Audio direction
Seedance generates synchronized audio. Describe sounds explicitly:
"Soft rain ambience, distant traffic, subtle tyre splash, warm analog music begins to play"
Cost-savvy workflow
| Stage | Best Choice | Why |
|---|---|---|
| Concept exploration | Fast mode / lower resolution | Cheap iteration |
| Prompt tuning | Fast or 480p | You're testing composition |
| Internal review | Standard | Enough quality to judge continuity |
| Final delivery | 720p or 1080p | Reserve premium spend for approved shots |
12. Wan 2.7 Image / Image Pro
Separate image-only tools (not the video generator).
Wan 2.7 Image:
| Setting | Range |
|---|---|
| Prompt | 1 – 3,000 characters |
| Aspect ratio | Pick from a preset list |
| Quality | Basic or High |
Wan 2.7 Image Pro:
| Setting | Range |
|---|---|
| Prompt | 1 – 3,000 characters |
| Reference images | Up to 9 URLs |
| Aspect ratio | Same preset list |
| Quality | Basic or High |
Prompt formula for Wan 2.7 Image
Entity (detailed appearance) + Scene (environment) + Aesthetic (lighting, shot size) + Stylization
Tips for best quality:
- Layer lighting elements as a sequence: "Twilight, warm interior lights, deep blue sky transitioning to orange"
- Use atmosphere words over technical jargon: "cinematic", "atmospheric", "painterly photoreal" work better than "shot on 50mm f/1.4"
- Use reference images to control color palette for brand consistency.
- Image Pro allows up to 9 reference images — use them to lock in character appearance, product details, or scene style.
Quality tiers
- Basic: Fast iteration, good for testing prompts.
- High: Production quality. Use for final output.
13. Tattoo Generator
Specialized for tattoo designs, with simpler controls than the general image tool.
| Setting | What you can do |
|---|---|
| Mode | Text-to-image (describe your design) or image-edit (modify an existing design) |
| Prompt | 1 – 500 characters — shorter than the general image tool, so be concise |
| Style, Complexity, Line Weight, Image Size, AI Model | All fixed presets — just pick from what's shown |
| Reference image | Required if you use image-edit mode |
Tips:
- 500 characters is enough for a detailed description of placement, style (tribal, watercolor, geometric, etc.), and subject matter.
- For image-edit, upload a clean photo of the existing design or placement area.
14. Something to Watch Out For
There's one limit that isn't enforced by the interface yet: Wan 2.7 custom audio clips must be under 30 seconds on the model side. The page shows a reminder, but if you upload a longer audio file, it will fail when the model tries to process it. Keep your audio under 30 seconds for now.
Quick Reference: Which Model Should You Use?
| Your Goal | Best Model | Why |
|---|---|---|
| Best overall video quality | Wan 2.7 or Happy Horse 1.1 | Up to 15s, 1080p, best prompt adherence |
| Start from an image | Happy Horse 1.1 Image to Video or Wan 2.7 Image to Video | Both accept starting images with motion control |
| Edit an existing video | Wan 2.7 Video Edit | Built specifically for this |
| Many reference images (up to 10) | Kling 2.6 | Most generous image limit |
| A short motion loop (~1 second) | Wan 2.2 Animate Move / Replace | Unique 1-second fixed duration mode |
| Very specific duration (7s, 12s, etc.) | Seedance 2.0 | Any duration from 4 to 15 seconds |
| Add audio / lip-sync to video | Wan 2.7 Text to Video or Happy Horse 1.1 | Both support native audio generation |
| Wide aspect ratio (21:9) | V1 Pro | The only model with ultrawide support |
| High-quality still images | Wan 2.7 Image Pro | Up to 9 reference images, 2K+ output |
| Quick drafts / fast iteration | Seedance 2.0 Fast or Wan 2.5 | Lower cost per generation |
Production Workflow — Putting It All Together
For serious projects (AI short dramas, commercials, storytelling), here's a recommended pipeline used by experienced creators:
- Script breakdown → Convert your script into visual shot descriptions.
- Character design → Establish core appearance using Wan 2.7 Image Pro with reference images.
- Style board → Set overall art direction, color palette, and lighting.
- Storyboard → Plan camera language, shot types, and composition.
- Generate keyframes → Create high-quality still images of each shot.
- Generate video → Animate stills using your chosen video model. Use first/last frame control for smooth transitions.
- Chain generations → Use the last frame of clip A as the input for clip B to maintain continuity.
- Edit & assemble → Refine pacing, fix details, output final cut.
Pro tip for consistency across shots: When you switch models mid-project, keep the prompt structure the same — change only the visual vocabulary. The model influences the result, but the prompt structure determines the shot.
General Prompt Writing Checklist
Before you hit generate, run through this:
- Does the prompt start with the subject?
- Is there one clear action or motion?
- Did I specify camera position or movement?
- Did I describe lighting?
- Did I drop vague adjectives (beautiful, stunning, amazing)?
- Is the prompt under 50–70 words (unless I need a multi-shot sequence)?
- If multi-shot, are scenes labeled with timecodes?
- Are negative prompts used only where the model supports them?
- Did I test at lower resolution first?
- Am I changing one variable at a time during iteration?
Model availability and limits may change as the platform updates. The interface always shows what's available for your current selection.
Author
Categories
More Posts

Wan 2.7 Image: Release Date, Features, Pricing, and How to Use It
Updated for April 24, 2026: what Wan 2.7 Image is, when it launched, what Wan 2.7 Image Pro adds, current wan27.org credits, and how to use the model for generation and editing.
How to Set Up Wan 2.2 in ComfyUI: A Step-by-Step Workflow Guide (2026)
Learn how to set up Wan 2.2 in ComfyUI with step-by-step instructions for text-to-video and image-to-video workflows. Includes GGUF optimization, LightX2V integration, and troubleshooting for common ComfyUI + Wan 2.2 errors.

Tongyi Wanxiang Video Production Guidelines (and How to Follow Them)
A practical, creator-friendly breakdown of the Tongyi Wanxiang video production guidelines: what they usually cover, why they matter, and a simple compliance checklist you can apply to Wan 2.7 workflows.
Newsletter
Join the community
Subscribe to our newsletter for the latest news and updates