Wan 2.7 Reference-to-Video (R2V): Character Consistency Across Every Shot
How to use Wan 2.7 reference-to-video (R2V) for consistent characters across multi-shot video. Covers subject references, voice references, multi-character scenes, and how to get clean results.

Character drift is the most frustrating problem in AI video. Generate the same character across five shots and you get five versions of the same character — slightly different face, slightly different proportions, slightly different coloring. Each shot works individually. None of them match.
Wan 2.7 Reference-to-Video (R2V) is the direct solution. Supply reference images or video for a character, assign a voice reference if needed, and the model maintains visual identity and voice across generated shots.
At launch, Wan 2.7's official team put it directly: "We've built a director's suite — character customization with up to 5 reference inputs and voice profiles."

What Reference-to-Video (R2V) Does
R2V takes one or more reference inputs — images, video clips, or audio — and uses them as conditioning signals during generation. The model generates new footage where the referenced subjects appear with consistent identity, not as statistical approximations of the reference.
Single-subject R2V: supply one character reference, generate a clip featuring that character in any prompt-specified scenario. The face, body proportions, clothing style (if specified), and overall appearance stay locked to the reference.
Multi-subject R2V: supply up to 5 character references simultaneously. Each character gets its own visual identity in the generated scene, and they can interact with each other within the same shot.
Voice references: assign an audio reference to any subject, and the model maintains that voice timbre when the character speaks — even as the dialogue changes.
This is the most ambitious character consistency system in any open-weight video model released to date.
How R2V Works in Practice
Go to wan27.org and open the Wan 2.7 reference-to-video tool.

Step 1: Prepare your reference material
For each character you want to maintain:
- Image references: clean, well-lit photos showing the character's face and general appearance. Multiple angles improve consistency. Avoid heavily filtered or stylized references — the model works better from neutral, natural references.
- Video references: short clips of the character work well for capturing motion style, expression range, and voice in a single input
- Voice references: a short audio clip of the character speaking. Clean audio without background noise produces better voice matching
Step 2: Assign references to characters
In the R2V interface, assign each reference to a named character slot. You can have up to 5 active character references in a single generation.
Step 3: Write your scene prompt
Reference characters in your prompt by their assigned names or labels. Describe the scene, their positions, interactions, and what they say if relevant. The more explicit the spatial and behavioral direction, the more control you have over how the characters appear relative to each other.
Step 4: Generate and evaluate consistency
Check the output for identity drift — does the generated character match the reference? For faces, the hairline, eye shape, and jawline are the areas to evaluate. Minor variations in lighting are expected and normal. Structural facial inconsistency indicates the reference was ambiguous or the prompt conflicted with the reference.
Multi-Character Scenes
The five-character limit is where R2V becomes genuinely novel. The official demo from Alibaba shows a generated scene with five characters, each with locked visual and vocal identity, interacting in a shared space with explicit dialogue:
"Character 2 holds Character 4 and plays a soothing folk song on the guitar in Character 5's chair, saying 'The sunshine is really nice today.' Character 1 carries Character 3, walks past, places it on the table, and says 'That sounds great, can you play it again?'"
The model tracks all five identities through spatial movement, object interaction, and simultaneous dialogue. This level of multi-subject control has no equivalent in other open-weight models.
For production teams, this opens up:
- Episodic content with a consistent cast
- Scripted dialogue scenes without live performance
- Brand mascot interactions across campaign content
- Localized version variants where different character sets appear
9-Grid Image Input for R2V
Wan 2.7 supports a 3×3 grid of reference images (9 images) as structured input for character and scene consistency. This is particularly useful for storyboard-driven workflows — feed the 9-grid as a reference panel and the model interprets each panel as a directorial context signal for character appearance, angle consistency, and scene continuity.
For R2V specifically, a 9-grid populated with multi-angle character references (front, three-quarter, side, expressions) gives the model significantly more to work from than a single reference image. If consistency across complex poses and angles matters for your output, the 9-grid approach is worth the extra preparation time.
Voice Reference Best Practices
Voice references condition the model on timbre, speaking style, and vocal character. A few guidelines:
Keep it short and clean. A 5–15 second clip of clean speech is enough. Longer clips do not improve conditioning and introduce more opportunities for background noise to interfere.
Match the emotional register. If the character will be delivering calm narration, use a calm voice reference. If the character is energetic and conversational, use a reference clip that matches that energy. The model carries the vocal character of the reference into the generated dialogue.
Single speaker only. Voice reference clips with multiple speakers confuse the conditioning. Use clips where only the target character is speaking.
R2V vs Standard I2V: When to Use Each
| Situation | Use |
|---|---|
| One shot, one subject, specific opening frame | Standard I2V |
| Multi-shot series with the same character | R2V |
| Scene with dialogue and lip sync | R2V with voice reference |
| Multiple characters interacting | R2V with multiple references |
| Storyboard with established character designs | R2V with 9-grid input |
| Product or object animation | Standard I2V |
Current Limitations
R2V is the most technically ambitious feature in Wan 2.7 and accordingly the one with the most edge cases at launch:
Reference quality matters significantly. Blurry, low-res, or heavily filtered reference images produce inconsistent character output. The model is conditioning on the visual information you provide — garbage in, drift out.
Spatial complexity scales the difficulty. A two-character scene with clear spatial separation between subjects is more reliable than a five-character scene with overlapping positions and simultaneous actions. Start simple to validate the reference quality, then add complexity.
Voice reference drift. On longer generated clips, voice timbre can drift slightly from the reference at the end of the clip. This is more noticeable in longer single takes than in shorter, edited sequences.
The Bigger Picture
The community analysis at launch captured the core argument for R2V well: "This is no longer just about generating clips, but about maintaining continuity. The model holds subject identity, poses, composition, and multi-angle consistency through its reference systems. That is exactly what makes it feel more like a workflow tool than a prompt demo."
That is the right framing. R2V is not a feature for generating impressive single shots. It is infrastructure for building content at scale with consistent characters — something that has required significant manual work or live production budgets until now.
Try reference-to-video at wan27.org.
More Posts

Wan 2.7 Text-to-Image Pro: Up to 4K AI Image Generation With Thinking Mode
Wan 2.7 Text-to-Image Pro generates images up to 4K resolution from text prompts with thinking mode, superior text rendering, and magazine-cover quality. Generate directly at wan27.org.

Wan 2.7 Text-to-Image: Generate High-Quality AI Images With Thinking Mode
Wan 2.7 Text-to-Image generates high-quality images from text prompts using a built-in thinking mode for better composition, superior text rendering, hex color control, and flexible aspect ratios. Generate directly at wan27.org.

Wan 2.7 Image Is Here — And It Changes More Than You Think
Alibaba just dropped Wan 2.7-Image today. Precise facial control, hex-based color palettes, strong text rendering, multi-image composition, and region editing. Here is what it means for creators.
Newsletter
Join the community
Subscribe to our newsletter for the latest news and updates