Wan 2.7 Reference-to-Video (R2V): Character Consistency Across Every Shot

Character drift is the most frustrating problem in AI video. Generate the same character across five shots and you get five versions of the same character — slightly different face, slightly different proportions, slightly different coloring. Each shot works individually. None of them match.

Wan 2.7 Reference-to-Video (R2V) is the direct solution. Supply reference images or video for a character, assign a voice reference if needed, and the model maintains visual identity and voice across generated shots.

At launch, Wan 2.7's official team put it directly: "We've built a director's suite — character customization with up to 5 reference inputs and voice profiles."

Wan 2.7 reference-to-video R2V: multiple character references feeding into consistent multi-shot video output, professional film production setup

What Reference-to-Video (R2V) Does

R2V takes one or more reference inputs — images, video clips, or audio — and uses them as conditioning signals during generation. The model generates new footage where the referenced subjects appear with consistent identity, not as statistical approximations of the reference.

Single-subject R2V: supply one character reference, generate a clip featuring that character in any prompt-specified scenario. The face, body proportions, clothing style (if specified), and overall appearance stay locked to the reference.

Multi-subject R2V: supply up to 5 character references simultaneously. Each character gets its own visual identity in the generated scene, and they can interact with each other within the same shot.

Voice references: assign an audio reference to any subject, and the model maintains that voice timbre when the character speaks — even as the dialogue changes.

This is the most ambitious character consistency system in any open-weight video model released to date.

How R2V Works in Practice

Go to wan27.org and open the Wan 2.7 reference-to-video tool.

Wan 2.7 R2V workflow diagram: reference images feeding into AI model producing consistent multi-shot video sequence

Step 1: Prepare your reference material

For each character you want to maintain:

Image references: clean, well-lit photos showing the character's face and general appearance. Multiple angles improve consistency. Avoid heavily filtered or stylized references — the model works better from neutral, natural references.
Video references: short clips of the character work well for capturing motion style, expression range, and voice in a single input
Voice references: a short audio clip of the character speaking. Clean audio without background noise produces better voice matching

Step 2: Assign references to characters

In the R2V interface, assign each reference to a named character slot. You can have up to 5 active character references in a single generation.

Step 3: Write your scene prompt

Reference characters in your prompt by their assigned names or labels. Describe the scene, their positions, interactions, and what they say if relevant. The more explicit the spatial and behavioral direction, the more control you have over how the characters appear relative to each other.

Step 4: Generate and evaluate consistency

Check the output for identity drift — does the generated character match the reference? For faces, the hairline, eye shape, and jawline are the areas to evaluate. Minor variations in lighting are expected and normal. Structural facial inconsistency indicates the reference was ambiguous or the prompt conflicted with the reference.

Multi-Character Scenes

The five-character limit is where R2V becomes genuinely novel. The official demo from Alibaba shows a generated scene with five characters, each with locked visual and vocal identity, interacting in a shared space with explicit dialogue:

"Character 2 holds Character 4 and plays a soothing folk song on the guitar in Character 5's chair, saying 'The sunshine is really nice today.' Character 1 carries Character 3, walks past, places it on the table, and says 'That sounds great, can you play it again?'"

The model tracks all five identities through spatial movement, object interaction, and simultaneous dialogue. This level of multi-subject control has no equivalent in other open-weight models.

For production teams, this opens up:

Episodic content with a consistent cast
Scripted dialogue scenes without live performance
Brand mascot interactions across campaign content
Localized version variants where different character sets appear

9-Grid Image Input for R2V

Wan 2.7 supports a 3×3 grid of reference images (9 images) as structured input for character and scene consistency. This is particularly useful for storyboard-driven workflows — feed the 9-grid as a reference panel and the model interprets each panel as a directorial context signal for character appearance, angle consistency, and scene continuity.

For R2V specifically, a 9-grid populated with multi-angle character references (front, three-quarter, side, expressions) gives the model significantly more to work from than a single reference image. If consistency across complex poses and angles matters for your output, the 9-grid approach is worth the extra preparation time.

Voice Reference Best Practices

Voice references condition the model on timbre, speaking style, and vocal character. A few guidelines:

Keep it short and clean. A 5–15 second clip of clean speech is enough. Longer clips do not improve conditioning and introduce more opportunities for background noise to interfere.

Match the emotional register. If the character will be delivering calm narration, use a calm voice reference. If the character is energetic and conversational, use a reference clip that matches that energy. The model carries the vocal character of the reference into the generated dialogue.

Single speaker only. Voice reference clips with multiple speakers confuse the conditioning. Use clips where only the target character is speaking.

R2V vs Standard I2V: When to Use Each

Situation	Use
One shot, one subject, specific opening frame	Standard I2V
Multi-shot series with the same character	R2V
Scene with dialogue and lip sync	R2V with voice reference
Multiple characters interacting	R2V with multiple references
Storyboard with established character designs	R2V with 9-grid input
Product or object animation	Standard I2V

Current Limitations

R2V is the most technically ambitious feature in Wan 2.7 and accordingly the one with the most edge cases at launch:

Reference quality matters significantly. Blurry, low-res, or heavily filtered reference images produce inconsistent character output. The model is conditioning on the visual information you provide — garbage in, drift out.

Spatial complexity scales the difficulty. A two-character scene with clear spatial separation between subjects is more reliable than a five-character scene with overlapping positions and simultaneous actions. Start simple to validate the reference quality, then add complexity.

Voice reference drift. On longer generated clips, voice timbre can drift slightly from the reference at the end of the clip. This is more noticeable in longer single takes than in shorter, edited sequences.

The Bigger Picture

The community analysis at launch captured the core argument for R2V well: "This is no longer just about generating clips, but about maintaining continuity. The model holds subject identity, poses, composition, and multi-angle consistency through its reference systems. That is exactly what makes it feel more like a workflow tool than a prompt demo."

That is the right framing. R2V is not a feature for generating impressive single shots. It is infrastructure for building content at scale with consistent characters — something that has required significant manual work or live production budgets until now.

Try reference-to-video at wan27.org.