2026/06/29

How to Use Our Image & Video Generator — A Practical Guide

From uploading files to generating studio-quality videos, this guide covers every tool on the platform with prompt templates, pro tips, and common pitfalls to avoid.

How to Use Our Image & Video Generator — A Practical Guide

How to Use Our Image & Video Generator

Whether you're generating your first image or trying to get the perfect 15-second video, this guide walks you through every tool on the platform. Each section covers what a model can do, how to write prompts that actually work, and the mistakes to skip.


1. Before You Start — File Uploads

These limits apply everywhere, no matter which tool you're using.

Media TypeAccepted FormatsMax Size
Imagesjpg, jpeg, png, webp10 MB
Videosmp4, mov, webm, mkv100 MB (as reference) / 80 MB (direct upload)
Audioaac, m4a, mp3, ogg, wav, webm100 MB

Practical tips:

  • Converting PNG to JPEG can cut file size by 60–80% if you're close to the 10 MB limit.
  • For video references, use the shortest clip that captures the motion you want — don't upload a 3-minute video when 10 seconds is enough.
  • The Video References and Motion Control tools accept images, videos, and audio. AI Reframe only accepts videos.

2. Image Generator — General Purpose

This is your go-to tool for creating still images.

What you can control

SettingRangeTip
Prompt10 – 1,000 charactersStay between 50–200 chars for best results
Number of images1–4 (text-to-image: 1, 2, or 4)Generate 4 at a time to compare variations
Reference imagesUp to 5 URLsUse for style reference, not subject copying
Negative promptUp to 800 characters"blurry, distorted, low quality, extra limbs"
Inference steps10 – 6430 is a good default; go higher for complex scenes
Guidance scale1 – 157–9 works for most prompts

How to write prompts that work

The best image prompts follow a simple formula:

Subject + Scene + Lighting + Style

Example:

"A young woman with wavy auburn hair in a vintage floral dress, standing in a sunlit garden, soft golden hour light, cinematic shallow depth of field"

Do's and Don'ts:

✅ Do This❌ Don't Do This
Describe what the subject looks like (hair, clothes, expression)Name specific real people by name
Specify the lighting (golden hour, neon glow, soft window light)Stack tech terms ("8K, hyperrealistic, 4K, ultra-detailed") — they dilute each other
Use atmosphere words (cinematic, painterly, documentary-style)Use camera model names ("shot on Sony A7III") — the model doesn't know what that looks like
Be specific about composition ("close-up of face", "full body shot")Write run-on sentences — break into clear phrases

Image-edit mode

If you're editing an existing image:

  • You can only upload 1 reference image (not 5).
  • You can generate 2, 3, or 4 variations (not just 1).
  • Describe what you want to change in the prompt, not what's already there.

3. V1 Pro Video

An older model, but it supports the widest range of aspect ratios.

SettingRange
Prompt1 – 10,000 characters
Reference imageOptional — 1 URL
Aspect ratio1:1, 4:3, 3:4, 16:9, 9:16, 21:9 — the full set
Resolution480p, 720p, or 1080p
Duration5 or 10 seconds
Fixed lens / Generate audioOn/off toggles

Tips

  • With 10,000 characters of prompt space, you can write very detailed scene descriptions. Use time-coded segments for multi-beat action.
  • The 21:9 aspect ratio is unique to this model — useful for cinematic widescreen.
  • For faster iteration, test at 480p first, then render final at 1080p.

4. Kling 2.6

Great for image-to-video workflows — supports up to 10 reference images.

SettingRange
Prompt1 – 2,500 characters
ModeText-to-video or image-to-video
Reference imagesUp to 10 URLs (image-to-video)
SoundOn/off toggle
Aspect ratio1:1, 16:9, 9:16
Duration5 or 10 seconds

Prompt formula

Scene setting + Subject description + Motion + Stylistic guidance

Image-to-video tips:

  • Your prompt should describe motion, not the image itself — the model can already see it.
  • Focus on camera movement and what changes: "Camera slowly tracks right, subtle wind affecting hair and clothing"
  • Use high-resolution source images (1080p or higher) for best results.
  • Keep reference images free of text overlays and watermarks.

Multi-shot storytelling (new in 2.6): You can describe sequential events in a single prompt and the model will generate transitions:

"A man walks into a coffee shop, orders a drink, then sits by the window as rain starts"

Native audio tips (2.6):

  • Put dialogue in quotes to trigger lip-sync: "...saying 'Let's begin.'"
  • Describe ambient sounds explicitly: "coffee shop chatter, espresso machine hissing, rain on windows"
  • Audio doubles the credit cost — enable only when needed.

Chain generations trick

Generate a 10-second clip, then use its last frame as the input for the next generation. You can build sequences up to 3 minutes this way.


5. Wan 2.6

SettingRange
Prompt1 – 5,000 characters
ModesText-to-video, image-to-video (1 image required), video-to-video (1 video required)
Resolution720p or 1080p
Duration5, 10, or 15 seconds

A solid all-rounder. The prompt tips for Wan 2.7 (below) apply here too, since they share the same architecture.


6. Grok Imagine 1.5

Image-to-video only — you must provide a starting image. There's no text-to-video mode on 1.5.

SettingRange
Prompt1 – 2,000 characters
Reference imageRequired
Aspect ratio16:9, 9:16, 1:1
Resolution480p or 720p
Duration5, 8, 10, or 15 seconds

The golden rule: describe motion, not the image

The model already sees your input image. Your job is to tell it what should change.

"The woman slowly turns her head to the right and smiles, soft breeze moving her hair, gentle camera push-in"

"A beautiful woman with long hair stands in a garden" (the model can already see this)

Prompt formula

[Subject action] + [Camera movement] + [Lighting changes] + [Audio cues]

Camera vocabulary that works:

  • Pan left/right, tilt up/down, zoom in/out, dolly in/out
  • Tracking/follow shot, orbit/surround, aerial/drone
  • Handheld, slow push-in, static/tripod

Pro tip: Always name at least one camera move. "Cinematic" alone tells the model nothing.

Audio prompting

Grok generates audio in the same pass as video. Add an AUDIO: block at the end of your prompt:

"Close-up of hands pulling apart a warm cinnamon roll, steam rising, soft morning window light, slow camera push-in. AUDIO: soft room tone, faint kettle hiss, gentle pastry tear sound"*

Common mistakes

MistakeFix
Re-describing the imageFocus only on motion — the model already sees it
Contradicting the source imageIf the image has a man, don't prompt "a woman dances"
Negative promptsDon't use them — they're ignored on this model
Tag stacking ("knight, castle, epic, 8K")Write a natural sentence with intent
Too many simultaneous actionsStick to 1 subject + 1 action + 1 camera move

Iteration workflow

  1. Generate 3–5 variations at 480p first (cheaper, faster).
  2. Pick the best one.
  3. Render that version at 720p for final output.

7. Happy Horse 1.1

Alibaba's latest video model. Three modes, each with different input needs.

SettingText to VideoImage to VideoReference to Video
PromptUp to 10,000 charsUp to 10,000 charsUp to 10,000 chars
Resolution720p / 1080p720p / 1080p720p / 1080p
Duration5 / 10 / 15s5 / 10 / 15s5 / 10 / 15s
Inputs requiredJust a descriptionExactly 1 primary imageAt least 1 reference image. Primary optional. Up to 8 extra images.

How to prompt Happy Horse

Happy Horse is unusual: brevity wins. Most shots only need about 20 words.

Subject → Action → Setting → One camera cue

Good: "A young woman in a red coat walks down a wet city street at night, neon reflections, slow dolly-in"

Bad: "A beautiful stunning gorgeous young woman in a detailed amazing dress walks slowly through a lovely park with incredible lighting" — those extra adjectives actually hurt quality.

The anti-slop rule: Cut every adjective that isn't specific. Drop "beautiful", "stunning", "amazing", "masterpiece", "epic", "breathtaking". Replace them with concrete details: "overcast daylight", "wet asphalt", "neon reflections", "warm amber backlight".

Camera language pays off

Happy Horse is unusually good at camera moves. Put the camera cue at the end of the prompt for maximum weight:

  • "Steadicam push", "slow dolly-in", "lateral orbit with parallax", "helicopter aerial", "rack focus"

For longer prompts: use shot lists

If you need more than one sentence, don't write a paragraph — use a shot list with timecodes:

Shot 1 (wide establishing, 0-1s): Camera pulls into a rain-slicked street at night.
Shot 2 (mid tracking, 1-4s): The woman enters frame from right, walking briskly.
Shot 3 (slow push-in close, 4-5s): Slow dolly-in onto her face, raindrops in her hair.

Reference to Video (R2V) — best for commercial use

  • Use clear, sharp reference images of your subject.
  • Upload 3–9 multi-angle refs for character consistency — this prevents "face-changing" across shots.
  • The reference defines who, the prompt defines what happens.
  • Describe motion, not appearance — the reference already shows the look.

Dialogue timing formula

Single line duration = (character count ÷ 4) × 1.2

Keep dialogue to 40 characters or fewer per 15-second clip, max 2 lines per shot for clean lip-sync.

What Happy Horse does well vs. struggles with

✅ Excels At❌ Struggles With
Camera moves (Steadicam, dolly, aerial)Multi-step sequences in plain prose (use shot lists)
Atmospheric lighting (blue hour, neon noir)Extreme slow-motion cues ("1000fps slow-mo")
Vehicles and large rigid objectsWardrobe details under heavy motion
Cloth in wind (capes, flags, hair)Booru tags, JSON, weighted parentheses
Fire and embersWriting specific text in prompts
Mirrors and reflections (geometrically consistent)Multiple simultaneous complex actions

8. Wan 2.7

The latest Wan model, with the most modes. Good for both video and audio.

SettingText to VideoImage to VideoReference to VideoVideo Edit
PromptUp to 5,000Up to 5,000Up to 5,000Up to 5,000
Resolution720p / 1080p720p / 1080p720p / 1080p720p / 1080p
Duration5 / 10 / 15s5 / 10 / 15s5 / 10 / 15s5 / 10 / 15s
Extra inputsOptional audioImage, end frame, video, audio (all optional). Need image OR continuation clip. Max 1 video URL.At least one of: primary image, extra images, video, or audio. Max 4 extra images. Max 1 video URL.Source Video required. Ref image optional. Max 1 video URL.

How to prompt Wan 2.7

Wan rewards structured, screenplay-like prompts. Use this formula:

Subject + Scene + Motion + Lighting + Camera + Style

Example:

"A golden retriever running through autumn leaves in a park, warm afternoon light, camera tracking from the side, cinematic shallow depth of field"

First & last frame control (standout feature)

This is Wan 2.7's killer feature for Image-to-Video:

  • Upload a start frame and an end frame — the model generates the motion between them.
  • Keep both frames aligned in aspect ratio, lighting direction, and subject placement.
  • Inconsistent lighting between frames causes mid-clip light-source jumps.
  • Think of the pair as defining a verb: open → closed, before → after, assembling → complete.

Audio tips

Wan 2.7 supports native audio — describe it in your prompt:

  • Dialogue: Include spoken lines with tone and pace: "A man says 'Hello', tone warm, medium pace"
  • Sound effects: "Ice cube drops into glass, sharp clink"
  • Background music: "Upbeat synthwave background track"
  • If you don't want audio, explicitly say: "No dialogue. No background music."

Known limitations

  • Complex multi-character scenes with specific interactions can be inconsistent.
  • Text rendering within generated videos is unreliable.
  • Longer durations (10+ seconds) may show motion degradation.
  • Hand and finger consistency can be imperfect — reduce simultaneous actions and avoid extreme foreshortening.

Iteration workflow

  1. Treat the first output as a draft — refine, don't regenerate from scratch.
  2. Lock a good seed once you get a clean output for reproducibility.
  3. Validate motion at 0.5x speed — look for mid-clip jitter around 50–70% through.
  4. Build scene-by-scene: generate one clip at a time, combine in editing.

9. Wan 2.5

SettingRange
PromptUp to 5,000 characters
ModeText-to-video or image-to-video (exactly 1 image)
Resolution720p or 1080p
Duration5 or 10 seconds

A simpler, shorter-duration version of Wan 2.7. Use this when you need quick, clean 5–10 second clips. Same prompt principles apply — structured, descriptive, with camera direction at the end.


10. Wan 2.2

Older generation. Lower resolution ceiling (max 720p), but still useful for specific tasks.

ModePromptInputsResolutionDuration
TextUp to 5,000None480p / 580p / 720p5s (fixed)
ImageUp to 5,000Exactly 1 image480p / 580p / 720p5s (fixed)
SpeechUp to 5,0001 portrait + 1 audio480p / 580p / 720p5 or 10s
Animate Move / ReplaceNo prompt needed1 image + 1 video480p / 580p / 720p1s (fixed)

The Animate Move / Replace mode is unique — it creates a 1-second motion loop from an image and a reference video. Use it for subtle motion effects like a character blinking, an object rotating, or a flag waving. No prompt writing needed.

The Speech mode requires a portrait photo (a clear face shot) plus an audio file. Make sure the photo is well-lit with the face clearly visible for best results.


11. Seedance 2.0

ByteDance's model with flexible duration (any length from 4 to 15 seconds).

ModePromptResolutionDurationInputs
Text to VideoUp to 10,000 chars480p / 720p / 1080p4–15sNone
Image to VideoUp to 10,000 chars480p / 720p / 1080p4–15sPrimary image required, secondary optional
Fast TextUp to 10,000 chars480p / 720p4–15sNone
Fast ImageUp to 10,000 chars480p / 720p4–15sPrimary image required, secondary optional

Prompt formula

[Subject] + [Action] + [Environment] + [Camera] + [Lighting] + [Style] + [Audio]

Seedance is reference-driven — upload images or video clips and assign them roles with @ notation:

"Use @Image1 for the character's appearance. Use @Video1 for camera movement only."

The golden rule for Seedance

The first 20–30 words of your prompt carry the most weight. Lead with who/what is in frame, then what they do. Save style and lighting for after the subject is locked.

Strong: "A clear glass perfume bottle sits on a black stone pedestal in a dark studio. Condensation rolls slowly down the glass. Medium close-up, slow circular dolly, soft side lighting, luxury beauty-ad look."

Weak: "A cool cinematic ad for a perfume bottle, dramatic, stylish, premium, beautiful lighting. Wide camera, realistic."

Multi-shot sequences

Label shots clearly with timecodes:

Shot 1: Wide establishing shot of a futuristic train station at sunrise.
Shot 2: Medium shot of a traveler stepping onto the platform, coat moving in wind.
Shot 3: Close-up of a glowing ticket in their hand.

Keep each shot to one primary action. If a shot tries to do running + camera pan + lightning + mood shift, the model will pick what to prioritize.

Audio direction

Seedance generates synchronized audio. Describe sounds explicitly:

"Soft rain ambience, distant traffic, subtle tyre splash, warm analog music begins to play"

Cost-savvy workflow

StageBest ChoiceWhy
Concept explorationFast mode / lower resolutionCheap iteration
Prompt tuningFast or 480pYou're testing composition
Internal reviewStandardEnough quality to judge continuity
Final delivery720p or 1080pReserve premium spend for approved shots

12. Wan 2.7 Image / Image Pro

Separate image-only tools (not the video generator).

Wan 2.7 Image:

SettingRange
Prompt1 – 3,000 characters
Aspect ratioPick from a preset list
QualityBasic or High

Wan 2.7 Image Pro:

SettingRange
Prompt1 – 3,000 characters
Reference imagesUp to 9 URLs
Aspect ratioSame preset list
QualityBasic or High

Prompt formula for Wan 2.7 Image

Entity (detailed appearance) + Scene (environment) + Aesthetic (lighting, shot size) + Stylization

Tips for best quality:

  • Layer lighting elements as a sequence: "Twilight, warm interior lights, deep blue sky transitioning to orange"
  • Use atmosphere words over technical jargon: "cinematic", "atmospheric", "painterly photoreal" work better than "shot on 50mm f/1.4"
  • Use reference images to control color palette for brand consistency.
  • Image Pro allows up to 9 reference images — use them to lock in character appearance, product details, or scene style.

Quality tiers

  • Basic: Fast iteration, good for testing prompts.
  • High: Production quality. Use for final output.

13. Tattoo Generator

Specialized for tattoo designs, with simpler controls than the general image tool.

SettingWhat you can do
ModeText-to-image (describe your design) or image-edit (modify an existing design)
Prompt1 – 500 characters — shorter than the general image tool, so be concise
Style, Complexity, Line Weight, Image Size, AI ModelAll fixed presets — just pick from what's shown
Reference imageRequired if you use image-edit mode

Tips:

  • 500 characters is enough for a detailed description of placement, style (tribal, watercolor, geometric, etc.), and subject matter.
  • For image-edit, upload a clean photo of the existing design or placement area.

14. Something to Watch Out For

There's one limit that isn't enforced by the interface yet: Wan 2.7 custom audio clips must be under 30 seconds on the model side. The page shows a reminder, but if you upload a longer audio file, it will fail when the model tries to process it. Keep your audio under 30 seconds for now.


Quick Reference: Which Model Should You Use?

Your GoalBest ModelWhy
Best overall video qualityWan 2.7 or Happy Horse 1.1Up to 15s, 1080p, best prompt adherence
Start from an imageHappy Horse 1.1 Image to Video or Wan 2.7 Image to VideoBoth accept starting images with motion control
Edit an existing videoWan 2.7 Video EditBuilt specifically for this
Many reference images (up to 10)Kling 2.6Most generous image limit
A short motion loop (~1 second)Wan 2.2 Animate Move / ReplaceUnique 1-second fixed duration mode
Very specific duration (7s, 12s, etc.)Seedance 2.0Any duration from 4 to 15 seconds
Add audio / lip-sync to videoWan 2.7 Text to Video or Happy Horse 1.1Both support native audio generation
Wide aspect ratio (21:9)V1 ProThe only model with ultrawide support
High-quality still imagesWan 2.7 Image ProUp to 9 reference images, 2K+ output
Quick drafts / fast iterationSeedance 2.0 Fast or Wan 2.5Lower cost per generation

Production Workflow — Putting It All Together

For serious projects (AI short dramas, commercials, storytelling), here's a recommended pipeline used by experienced creators:

  1. Script breakdown → Convert your script into visual shot descriptions.
  2. Character design → Establish core appearance using Wan 2.7 Image Pro with reference images.
  3. Style board → Set overall art direction, color palette, and lighting.
  4. Storyboard → Plan camera language, shot types, and composition.
  5. Generate keyframes → Create high-quality still images of each shot.
  6. Generate video → Animate stills using your chosen video model. Use first/last frame control for smooth transitions.
  7. Chain generations → Use the last frame of clip A as the input for clip B to maintain continuity.
  8. Edit & assemble → Refine pacing, fix details, output final cut.

Pro tip for consistency across shots: When you switch models mid-project, keep the prompt structure the same — change only the visual vocabulary. The model influences the result, but the prompt structure determines the shot.


General Prompt Writing Checklist

Before you hit generate, run through this:

  • Does the prompt start with the subject?
  • Is there one clear action or motion?
  • Did I specify camera position or movement?
  • Did I describe lighting?
  • Did I drop vague adjectives (beautiful, stunning, amazing)?
  • Is the prompt under 50–70 words (unless I need a multi-shot sequence)?
  • If multi-shot, are scenes labeled with timecodes?
  • Are negative prompts used only where the model supports them?
  • Did I test at lower resolution first?
  • Am I changing one variable at a time during iteration?

Model availability and limits may change as the platform updates. The interface always shows what's available for your current selection.

Newsletter

Join the community

Subscribe to our newsletter for the latest news and updates