2026/06/29

How to Use Our Image & Video Generator — A Practical Guide

From uploading files to generating studio-quality videos, this guide covers every tool on the platform with prompt templates, pro tips, and common pitfalls to avoid.

How to Use Our Image & Video Generator

Whether you're generating your first image or trying to get the perfect 15-second video, this guide walks you through every tool on the platform. Each section covers what a model can do, how to write prompts that actually work, and the mistakes to skip.

1. Before You Start — File Uploads

These limits apply everywhere, no matter which tool you're using.

Media Type	Accepted Formats	Max Size
Images	`jpg`, `jpeg`, `png`, `webp`	10 MB
Videos	`mp4`, `mov`, `webm`, `mkv`	100 MB (as reference) / 80 MB (direct upload)
Audio	`aac`, `m4a`, `mp3`, `ogg`, `wav`, `webm`	100 MB

Practical tips:

Converting PNG to JPEG can cut file size by 60–80% if you're close to the 10 MB limit.
For video references, use the shortest clip that captures the motion you want — don't upload a 3-minute video when 10 seconds is enough.
The Video References and Motion Control tools accept images, videos, and audio. AI Reframe only accepts videos.

2. Image Generator — General Purpose

This is your go-to tool for creating still images.

What you can control

Setting	Range	Tip
Prompt	10 – 1,000 characters	Stay between 50–200 chars for best results
Number of images	1–4 (text-to-image: 1, 2, or 4)	Generate 4 at a time to compare variations
Reference images	Up to 5 URLs	Use for style reference, not subject copying
Negative prompt	Up to 800 characters	"blurry, distorted, low quality, extra limbs"
Inference steps	10 – 64	30 is a good default; go higher for complex scenes
Guidance scale	1 – 15	7–9 works for most prompts

How to write prompts that work

The best image prompts follow a simple formula:

Subject + Scene + Lighting + Style

Example:

"A young woman with wavy auburn hair in a vintage floral dress, standing in a sunlit garden, soft golden hour light, cinematic shallow depth of field"

Do's and Don'ts:

✅ Do This	❌ Don't Do This
Describe what the subject looks like (hair, clothes, expression)	Name specific real people by name
Specify the lighting (golden hour, neon glow, soft window light)	Stack tech terms ("8K, hyperrealistic, 4K, ultra-detailed") — they dilute each other
Use atmosphere words (cinematic, painterly, documentary-style)	Use camera model names ("shot on Sony A7III") — the model doesn't know what that looks like
Be specific about composition ("close-up of face", "full body shot")	Write run-on sentences — break into clear phrases

Image-edit mode

If you're editing an existing image:

You can only upload 1 reference image (not 5).
You can generate 2, 3, or 4 variations (not just 1).
Describe what you want to change in the prompt, not what's already there.

3. V1 Pro Video

An older model, but it supports the widest range of aspect ratios.

Setting	Range
Prompt	1 – 10,000 characters
Reference image	Optional — 1 URL
Aspect ratio	1:1, 4:3, 3:4, 16:9, 9:16, 21:9 — the full set
Resolution	480p, 720p, or 1080p
Duration	5 or 10 seconds
Fixed lens / Generate audio	On/off toggles

Tips

With 10,000 characters of prompt space, you can write very detailed scene descriptions. Use time-coded segments for multi-beat action.
The 21:9 aspect ratio is unique to this model — useful for cinematic widescreen.
For faster iteration, test at 480p first, then render final at 1080p.

4. Kling 2.6

Great for image-to-video workflows — supports up to 10 reference images.

Setting	Range
Prompt	1 – 2,500 characters
Mode	Text-to-video or image-to-video
Reference images	Up to 10 URLs (image-to-video)
Sound	On/off toggle
Aspect ratio	1:1, 16:9, 9:16
Duration	5 or 10 seconds

Prompt formula

Scene setting + Subject description + Motion + Stylistic guidance

Image-to-video tips:

Your prompt should describe motion, not the image itself — the model can already see it.
Focus on camera movement and what changes: "Camera slowly tracks right, subtle wind affecting hair and clothing"
Use high-resolution source images (1080p or higher) for best results.
Keep reference images free of text overlays and watermarks.

Multi-shot storytelling (new in 2.6): You can describe sequential events in a single prompt and the model will generate transitions:

"A man walks into a coffee shop, orders a drink, then sits by the window as rain starts"

Native audio tips (2.6):

Put dialogue in quotes to trigger lip-sync: "...saying 'Let's begin.'"
Describe ambient sounds explicitly: "coffee shop chatter, espresso machine hissing, rain on windows"
Audio doubles the credit cost — enable only when needed.

Chain generations trick

Generate a 10-second clip, then use its last frame as the input for the next generation. You can build sequences up to 3 minutes this way.

5. Wan 2.6

Setting	Range
Prompt	1 – 5,000 characters
Modes	Text-to-video, image-to-video (1 image required), video-to-video (1 video required)
Resolution	720p or 1080p
Duration	5, 10, or 15 seconds

A solid all-rounder. The prompt tips for Wan 2.7 (below) apply here too, since they share the same architecture.

6. Grok Imagine 1.5

Image-to-video only — you must provide a starting image. There's no text-to-video mode on 1.5.

Setting	Range
Prompt	1 – 2,000 characters
Reference image	Required
Aspect ratio	16:9, 9:16, 1:1
Resolution	480p or 720p
Duration	5, 8, 10, or 15 seconds

The golden rule: describe motion, not the image

The model already sees your input image. Your job is to tell it what should change.

✅ "The woman slowly turns her head to the right and smiles, soft breeze moving her hair, gentle camera push-in"

❌ "A beautiful woman with long hair stands in a garden" (the model can already see this)

Prompt formula

[Subject action] + [Camera movement] + [Lighting changes] + [Audio cues]

Camera vocabulary that works:

Pan left/right, tilt up/down, zoom in/out, dolly in/out
Tracking/follow shot, orbit/surround, aerial/drone
Handheld, slow push-in, static/tripod

Pro tip: Always name at least one camera move. "Cinematic" alone tells the model nothing.

Audio prompting

Grok generates audio in the same pass as video. Add an AUDIO: block at the end of your prompt:

"Close-up of hands pulling apart a warm cinnamon roll, steam rising, soft morning window light, slow camera push-in. AUDIO: soft room tone, faint kettle hiss, gentle pastry tear sound"*

Common mistakes

Mistake	Fix
Re-describing the image	Focus only on motion — the model already sees it
Contradicting the source image	If the image has a man, don't prompt "a woman dances"
Negative prompts	Don't use them — they're ignored on this model
Tag stacking ("knight, castle, epic, 8K")	Write a natural sentence with intent
Too many simultaneous actions	Stick to 1 subject + 1 action + 1 camera move

Iteration workflow

Generate 3–5 variations at 480p first (cheaper, faster).
Pick the best one.
Render that version at 720p for final output.

7. Happy Horse 1.1

Alibaba's latest video model. Three modes, each with different input needs.

Setting	Text to Video	Image to Video	Reference to Video
Prompt	Up to 10,000 chars	Up to 10,000 chars	Up to 10,000 chars
Resolution	720p / 1080p	720p / 1080p	720p / 1080p
Duration	5 / 10 / 15s	5 / 10 / 15s	5 / 10 / 15s
Inputs required	Just a description	Exactly 1 primary image	At least 1 reference image. Primary optional. Up to 8 extra images.

How to prompt Happy Horse

Happy Horse is unusual: brevity wins. Most shots only need about 20 words.

Subject → Action → Setting → One camera cue

✅ Good: "A young woman in a red coat walks down a wet city street at night, neon reflections, slow dolly-in"

❌ Bad: "A beautiful stunning gorgeous young woman in a detailed amazing dress walks slowly through a lovely park with incredible lighting" — those extra adjectives actually hurt quality.

The anti-slop rule: Cut every adjective that isn't specific. Drop "beautiful", "stunning", "amazing", "masterpiece", "epic", "breathtaking". Replace them with concrete details: "overcast daylight", "wet asphalt", "neon reflections", "warm amber backlight".

Camera language pays off

Happy Horse is unusually good at camera moves. Put the camera cue at the end of the prompt for maximum weight:

"Steadicam push", "slow dolly-in", "lateral orbit with parallax", "helicopter aerial", "rack focus"

For longer prompts: use shot lists

If you need more than one sentence, don't write a paragraph — use a shot list with timecodes:

Shot 1 (wide establishing, 0-1s): Camera pulls into a rain-slicked street at night.
Shot 2 (mid tracking, 1-4s): The woman enters frame from right, walking briskly.
Shot 3 (slow push-in close, 4-5s): Slow dolly-in onto her face, raindrops in her hair.

Reference to Video (R2V) — best for commercial use

Use clear, sharp reference images of your subject.
Upload 3–9 multi-angle refs for character consistency — this prevents "face-changing" across shots.
The reference defines who, the prompt defines what happens.
Describe motion, not appearance — the reference already shows the look.

Dialogue timing formula

Single line duration = (character count ÷ 4) × 1.2

Keep dialogue to 40 characters or fewer per 15-second clip, max 2 lines per shot for clean lip-sync.

What Happy Horse does well vs. struggles with

✅ Excels At	❌ Struggles With
Camera moves (Steadicam, dolly, aerial)	Multi-step sequences in plain prose (use shot lists)
Atmospheric lighting (blue hour, neon noir)	Extreme slow-motion cues ("1000fps slow-mo")
Vehicles and large rigid objects	Wardrobe details under heavy motion
Cloth in wind (capes, flags, hair)	Booru tags, JSON, weighted parentheses
Fire and embers	Writing specific text in prompts
Mirrors and reflections (geometrically consistent)	Multiple simultaneous complex actions

8. Wan 2.7

The latest Wan model, with the most modes. Good for both video and audio.

Setting	Text to Video	Image to Video	Reference to Video	Video Edit
Prompt	Up to 5,000	Up to 5,000	Up to 5,000	Up to 5,000
Resolution	720p / 1080p	720p / 1080p	720p / 1080p	720p / 1080p
Duration	5 / 10 / 15s	5 / 10 / 15s	5 / 10 / 15s	5 / 10 / 15s
Extra inputs	Optional audio	Image, end frame, video, audio (all optional). Need image OR continuation clip. Max 1 video URL.	At least one of: primary image, extra images, video, or audio. Max 4 extra images. Max 1 video URL.	Source Video required. Ref image optional. Max 1 video URL.

How to prompt Wan 2.7

Wan rewards structured, screenplay-like prompts. Use this formula:

Subject + Scene + Motion + Lighting + Camera + Style

Example:

"A golden retriever running through autumn leaves in a park, warm afternoon light, camera tracking from the side, cinematic shallow depth of field"

First & last frame control (standout feature)

This is Wan 2.7's killer feature for Image-to-Video:

Upload a start frame and an end frame — the model generates the motion between them.
Keep both frames aligned in aspect ratio, lighting direction, and subject placement.
Inconsistent lighting between frames causes mid-clip light-source jumps.
Think of the pair as defining a verb: open → closed, before → after, assembling → complete.

Audio tips

Wan 2.7 supports native audio — describe it in your prompt:

Dialogue: Include spoken lines with tone and pace: "A man says 'Hello', tone warm, medium pace"
Sound effects: "Ice cube drops into glass, sharp clink"
Background music: "Upbeat synthwave background track"
If you don't want audio, explicitly say: "No dialogue. No background music."

Known limitations

Complex multi-character scenes with specific interactions can be inconsistent.
Text rendering within generated videos is unreliable.
Longer durations (10+ seconds) may show motion degradation.
Hand and finger consistency can be imperfect — reduce simultaneous actions and avoid extreme foreshortening.

Iteration workflow

Treat the first output as a draft — refine, don't regenerate from scratch.
Lock a good seed once you get a clean output for reproducibility.
Validate motion at 0.5x speed — look for mid-clip jitter around 50–70% through.
Build scene-by-scene: generate one clip at a time, combine in editing.

9. Wan 2.5

Setting	Range
Prompt	Up to 5,000 characters
Mode	Text-to-video or image-to-video (exactly 1 image)
Resolution	720p or 1080p
Duration	5 or 10 seconds

A simpler, shorter-duration version of Wan 2.7. Use this when you need quick, clean 5–10 second clips. Same prompt principles apply — structured, descriptive, with camera direction at the end.

10. Wan 2.2

Older generation. Lower resolution ceiling (max 720p), but still useful for specific tasks.

Mode	Prompt	Inputs	Resolution	Duration
Text	Up to 5,000	None	480p / 580p / 720p	5s (fixed)
Image	Up to 5,000	Exactly 1 image	480p / 580p / 720p	5s (fixed)
Speech	Up to 5,000	1 portrait + 1 audio	480p / 580p / 720p	5 or 10s
Animate Move / Replace	No prompt needed	1 image + 1 video	480p / 580p / 720p	1s (fixed)

The Animate Move / Replace mode is unique — it creates a 1-second motion loop from an image and a reference video. Use it for subtle motion effects like a character blinking, an object rotating, or a flag waving. No prompt writing needed.

The Speech mode requires a portrait photo (a clear face shot) plus an audio file. Make sure the photo is well-lit with the face clearly visible for best results.

11. Seedance 2.0

ByteDance's model with flexible duration (any length from 4 to 15 seconds).

Mode	Prompt	Resolution	Duration	Inputs
Text to Video	Up to 10,000 chars	480p / 720p / 1080p	4–15s	None
Image to Video	Up to 10,000 chars	480p / 720p / 1080p	4–15s	Primary image required, secondary optional
Fast Text	Up to 10,000 chars	480p / 720p	4–15s	None
Fast Image	Up to 10,000 chars	480p / 720p	4–15s	Primary image required, secondary optional

Prompt formula

[Subject] + [Action] + [Environment] + [Camera] + [Lighting] + [Style] + [Audio]

Seedance is reference-driven — upload images or video clips and assign them roles with @ notation:

"Use @Image1 for the character's appearance. Use @Video1 for camera movement only."

The golden rule for Seedance

The first 20–30 words of your prompt carry the most weight. Lead with who/what is in frame, then what they do. Save style and lighting for after the subject is locked.

✅ Strong: "A clear glass perfume bottle sits on a black stone pedestal in a dark studio. Condensation rolls slowly down the glass. Medium close-up, slow circular dolly, soft side lighting, luxury beauty-ad look."

❌ Weak: "A cool cinematic ad for a perfume bottle, dramatic, stylish, premium, beautiful lighting. Wide camera, realistic."

Multi-shot sequences

Label shots clearly with timecodes:

Shot 1: Wide establishing shot of a futuristic train station at sunrise.
Shot 2: Medium shot of a traveler stepping onto the platform, coat moving in wind.
Shot 3: Close-up of a glowing ticket in their hand.

Keep each shot to one primary action. If a shot tries to do running + camera pan + lightning + mood shift, the model will pick what to prioritize.

Audio direction

Seedance generates synchronized audio. Describe sounds explicitly:

"Soft rain ambience, distant traffic, subtle tyre splash, warm analog music begins to play"

Cost-savvy workflow

Stage	Best Choice	Why
Concept exploration	Fast mode / lower resolution	Cheap iteration
Prompt tuning	Fast or 480p	You're testing composition
Internal review	Standard	Enough quality to judge continuity
Final delivery	720p or 1080p	Reserve premium spend for approved shots

12. Wan 2.7 Image / Image Pro

Separate image-only tools (not the video generator).

Wan 2.7 Image:

Setting	Range
Prompt	1 – 3,000 characters
Aspect ratio	Pick from a preset list
Quality	Basic or High

Wan 2.7 Image Pro:

Setting	Range
Prompt	1 – 3,000 characters
Reference images	Up to 9 URLs
Aspect ratio	Same preset list
Quality	Basic or High

Prompt formula for Wan 2.7 Image

Entity (detailed appearance) + Scene (environment) + Aesthetic (lighting, shot size) + Stylization

Tips for best quality:

Layer lighting elements as a sequence: "Twilight, warm interior lights, deep blue sky transitioning to orange"
Use atmosphere words over technical jargon: "cinematic", "atmospheric", "painterly photoreal" work better than "shot on 50mm f/1.4"
Use reference images to control color palette for brand consistency.
Image Pro allows up to 9 reference images — use them to lock in character appearance, product details, or scene style.

Quality tiers

Basic: Fast iteration, good for testing prompts.
High: Production quality. Use for final output.

13. Tattoo Generator

Specialized for tattoo designs, with simpler controls than the general image tool.

Setting	What you can do
Mode	Text-to-image (describe your design) or image-edit (modify an existing design)
Prompt	1 – 500 characters — shorter than the general image tool, so be concise
Style, Complexity, Line Weight, Image Size, AI Model	All fixed presets — just pick from what's shown
Reference image	Required if you use image-edit mode

Tips:

500 characters is enough for a detailed description of placement, style (tribal, watercolor, geometric, etc.), and subject matter.
For image-edit, upload a clean photo of the existing design or placement area.

14. Something to Watch Out For

There's one limit that isn't enforced by the interface yet: Wan 2.7 custom audio clips must be under 30 seconds on the model side. The page shows a reminder, but if you upload a longer audio file, it will fail when the model tries to process it. Keep your audio under 30 seconds for now.

Quick Reference: Which Model Should You Use?

Your Goal	Best Model	Why
Best overall video quality	Wan 2.7 or Happy Horse 1.1	Up to 15s, 1080p, best prompt adherence
Start from an image	Happy Horse 1.1 Image to Video or Wan 2.7 Image to Video	Both accept starting images with motion control
Edit an existing video	Wan 2.7 Video Edit	Built specifically for this
Many reference images (up to 10)	Kling 2.6	Most generous image limit
A short motion loop (~1 second)	Wan 2.2 Animate Move / Replace	Unique 1-second fixed duration mode
Very specific duration (7s, 12s, etc.)	Seedance 2.0	Any duration from 4 to 15 seconds
Add audio / lip-sync to video	Wan 2.7 Text to Video or Happy Horse 1.1	Both support native audio generation
Wide aspect ratio (21:9)	V1 Pro	The only model with ultrawide support
High-quality still images	Wan 2.7 Image Pro	Up to 9 reference images, 2K+ output
Quick drafts / fast iteration	Seedance 2.0 Fast or Wan 2.5	Lower cost per generation

Production Workflow — Putting It All Together

For serious projects (AI short dramas, commercials, storytelling), here's a recommended pipeline used by experienced creators:

Script breakdown → Convert your script into visual shot descriptions.
Character design → Establish core appearance using Wan 2.7 Image Pro with reference images.
Style board → Set overall art direction, color palette, and lighting.
Storyboard → Plan camera language, shot types, and composition.
Generate keyframes → Create high-quality still images of each shot.
Generate video → Animate stills using your chosen video model. Use first/last frame control for smooth transitions.
Chain generations → Use the last frame of clip A as the input for clip B to maintain continuity.
Edit & assemble → Refine pacing, fix details, output final cut.

Pro tip for consistency across shots: When you switch models mid-project, keep the prompt structure the same — change only the visual vocabulary. The model influences the result, but the prompt structure determines the shot.