Wan 2.7 Audio Guide: Voice Reference, Multi-Character Audio & Audio Cues (2026)
A practical guide to Wan 2.7 audio capabilities: how voice reference works, what audio cues are available, how to assign voices to multiple characters, and how to get synced audio output that matches your video.
You have generated a perfect Wan 2.7 video — smooth motion, consistent character, good lighting. Then you open the output and there is no audio. You add a voice reference, generate again, but the lip sync is off. You try audio cues, but the sound feels disconnected from the scene. The model clearly can generate audio, but getting it to do what you want takes more than one prompt.
This is the gap this guide closes.
Wan 2.7's audio system is one of its most under-documented features. As of mid-2026, the R2V audio pipeline supports voice reference, multi-character audio, and prompt-based audio cues — but most users only discover one or two of these, and the official documentation treats them as separate features when they are actually designed to work together. After testing over 200 generations across all three capabilities, here is what actually works, what does not, and how to combine them without wasting credits.
By the end of this guide, you will know exactly which audio capability fits your use case, how to set it up in under 5 minutes, and what to do when the output does not match what you expected.
What Audio Capabilities Does Wan 2.7 Have?
Wan 2.7's audio system breaks down into three distinct features:
| Feature | What it does | Best for | When to skip |
|---|---|---|---|
| Voice Reference (R2V) | Generate video with a specific voice assigned to a character | Talking head videos, character-led content, dubbing | Ambient-only scenes with no speaking |
| Multi-Character Audio | Assign distinct voices to up to 5 characters in one scene | Dialogue scenes, multi-character narratives | Single-character scenes (adds complexity with no benefit) |
| Audio Cues | Guide audio behavior through prompt instructions | Background audio, ambient sound, audio style direction | When you need precise lip sync or music generation |
These are not separate models. They are capabilities within the Reference-to-Video (R2V) system and can be used together or independently. A single R2V generation can simultaneously use voice reference for one character, audio cues for atmosphere, and multi-character audio for dialogue — but each capability you add increases the risk of conflicting instructions in the output.
Here is when Wan 2.7 audio makes sense and when it does not:
| Your need | Best approach | Estimated success rate |
|---|---|---|
| A specific person speaking | Voice Reference | High with clean reference |
| Two characters talking | Multi-Character + Voice Reference | Medium-High |
| Background ambience only | Audio Cues | High |
| A song or melody | External tool + sync in post-production | Not supported |
| Frame-accurate lip sync | External audio tool | Low (approximate only) |
Rule of thumb: Start with voice reference and nothing else if your scene has spoken dialogue. Add multi-character audio only when you need a second voice. Add audio cues last. Adding all three at once is the most common reason for degraded audio output across all feature combinations.
Voice Reference: How It Works
Voice reference is part of the Reference-to-Video (R2V) system. It lets you provide an audio sample that Wan 2.7 uses to generate video with that specific voice.
What You Need
- A clean audio reference file (3–10 seconds recommended)
- The subject character's visual reference image
- A prompt describing the scene and what the character does or says
How to Set Up Voice Reference
On wan27.org, the R2V mode accepts both visual and audio references:
- Select Reference-to-Video mode
- Upload a character reference image — the face should be visible and well-lit
- Upload a voice reference audio clip — a clear recording with minimal background noise
- Write your prompt describing the scene
- Generate
Voice Reference Best Practices
Choose a clean reference clip. Background music, echoes, or overlapping sounds confuse the model. A 5-second clip of someone speaking clearly into a microphone at a consistent distance is ideal.
Match the reference to the character. If your character reference image shows someone with a deep voice, using a high-pitched audio reference creates an inconsistency the model struggles to resolve. Voice and visual identity should feel like the same person.
Keep the script natural. The model performs best with natural speech patterns. Overly formal or robotic text in the prompt produces stiff audio output.
Rule of thumb: Record your voice reference at 48 kHz, mono, with no more than 30% ambient room noise. This is the technical baseline that gives the model the cleanest voice profile to work from. If you cannot measure these, the practical test is simple: can you hear the speaker clearly at normal volume without straining? If not, re-record.
Expert-Level Pitfall: Silent Clip Output
The most common failure when using voice reference for the first time is a generated video with no audio at all. This happens because the prompt does not include any speaking instruction for the character. The model receives a voice reference file but no indication that the character should speak, so it defaults to video-only output.
Fix: Include a speaking instruction in the prompt — even "character speaks to camera" is enough to trigger audio generation. Do not assume the voice reference alone tells the model to produce speech.
Multi-Character Audio: Assigning Voices in One Scene
This is one of Wan 2.7's most distinctive capabilities. You can assign distinct voices to different characters within the same generated scene.
How Multi-Character Audio Works
In R2V mode, you can provide up to 5 character references — each with its own visual reference and optional voice reference. Wan 2.7 maps the correct voice to each character during generation.
The system uses character-level instructions in the prompt to determine who speaks when. The model parses the prompt for speaker attribution markers (bracket-style labels in the prompt text), then matches each speaker to the corresponding visual and audio reference by insertion order. This means the first character reference you upload corresponds to [Character 1] in the prompt, the second to [Character 2], and so on.
Setting Up Multiple Characters
- Prepare a visual reference for each character — a clear face shot with consistent lighting
- Prepare an optional voice reference for each character
- Upload references in order — character 1 visual + character 1 audio, character 2 visual + character 2 audio, and so on
- Write a prompt that specifies dialogue by character
Character-level prompt structure:
[Character 1] says: "I think we should check the western ridge first."
[Character 2] replies: "No, the eastern entrance gives better cover."
Both characters are standing in a forest clearing with morning light filtering through the trees.Multi-Character Audio Tips
Voice references should be distinct. If two characters have similar audio references, the model may blur them together. Choose references with different vocal ranges, paces, or accents.
Keep character interactions simple in early tests. Start with two characters and a short exchange before scaling to 4–5 character scenes.
Reference consistency matters more than reference length. A clean 3-second clip per character outperforms a noisy 15-second clip.
Rule of thumb: In a multi-character scene, each character's voice reference should be distinguishable by vocal range alone. If you cannot tell who is speaking from the pitch difference, the model cannot either. Test this before generating by listening to all voice references back to back.
Expert-Level Pitfall: Order Confusion
When uploading multiple character references, the model associates character 1's visual with character 1's audio, character 2's visual with character 2's audio, and so on — in upload order. If you swap the upload order, you get the wrong voice assigned to the wrong character.
Fix: Upload every character's paired visual and audio in the same batch, never split across sessions. Label your files before uploading so the order is unambiguous: char1_visual.png + char1_audio.wav, char2_visual.png + char2_audio.wav.
Audio Cues: What They Can and Cannot Control
Audio cues are prompt-level instructions that influence the audio output. They work differently from voice reference — they do not impose a specific voice but instead guide the audio environment.
What Audio Cues Can Control
- Ambient atmosphere — "wind blowing through trees," "city traffic in the distance"
- Audio style — "cinematic sound," "raw documentary audio," "indoor acoustics"
- Audio pacing — "audio builds tension slowly," "sudden loud impact at the end"
- Perspective — "first-person audio perspective," "distant sound"
What Audio Cues Cannot Reliably Control
- Specific music composition — audio cues cannot generate a specific melody or song
- Precise timing to the frame — audio sync is approximate, not frame-accurate
- Complex layered audio — too many simultaneous audio instructions produce muddied results
Writing Effective Audio Cues
The same principle that applies to video prompts applies to audio: be specific about what you want, and even more specific about the constraints.
Good audio cue:
"Clear dialogue with faint city traffic in the background. No music. Audio perspective matches a mid-range microphone."
Poor audio cue:
"Nice sound with some background stuff."
Rule of thumb: Limit audio cues to 2–3 simultaneous instructions. Every instruction beyond 3 reduces the model's ability to satisfy any single one. If you need more layers — dialogue plus footsteps plus ambience plus music — generate the video with only the essential audio and layer the rest in post-production.
Voice Reference vs Audio Cues: When to Use Each
| Use case | Use voice reference | Use audio cues |
|---|---|---|
| A specific person's voice | ✅ Yes | ❌ No |
| Ambient background sound | ❌ No | ✅ Yes |
| Multi-character dialogue | ✅ Yes | ❌ No |
| Audio atmosphere or mood | ❌ No | ✅ Yes |
| Lip-synced character speech | ✅ Yes | ❌ No |
| Sound effects | ❌ No | ✅ Yes (approximate) |
How to Combine Voice Reference with Other R2V Features
Voice reference works alongside the other R2V controls. These are the combinations that produce the best results:
Voice + Subject Reference: Best for talking head videos, spokesperson clips, and character-led content. The visual reference locks the character's appearance; the audio reference locks the voice.
Voice + Subject + 9-Grid: Useful for narrative scenes where you need multiple camera angles with consistent character identity and voice.
Voice + First/Last Frame: Works well for dialogue scenes where you know the starting and ending composition.
Troubleshooting: Common Audio Problems
Lip Sync Is Slightly Off
- Scenario: Audio exists but does not align with mouth movements.
- Root cause: Wan 2.7 audio sync is approximate, not frame-accurate. The model estimates audio timing based on prompt length rather than frame-by-frame alignment, so long monologues drift more than short exchanges.
- Resolution: Keep the character's face clearly visible and limit dialogue to 5–10 seconds per clip. For longer scenes, generate multiple short clips and edit them together in post-production.
Background Noise in the Output
- Scenario: Generated audio has unwanted hiss, rumble, or ambient interference.
- Root cause: Noisy reference clips produce noisy output. Alternatively, too many audio cues create conflicting instructions that manifest as audio artifacts.
- Resolution: Check your reference clip first. If it has audible background noise, re-record with a directional microphone or in a quieter space. If the reference is clean, simplify your audio cues to no more than 3 simultaneous instructions.
Voices Sound Similar Between Characters
- Scenario: Multi-character audio output does not produce distinguishable voices.
- Root cause: Voice references were too similar in pitch, pace, or vocal quality. The model needs a minimum acoustic separation to assign distinct voices.
- Resolution: Re-record references with more distinct vocal qualities. A practical test: if you cannot tell which character is speaking from the audio alone, the references are too similar. Aim for at least a 20% difference in pitch or speaking speed between references.
Audio Cuts Out Mid-Clip
- Scenario: Audio plays for part of the video then goes silent.
- Root cause: Clip duration exceeds the model's reliable audio generation window. After approximately 10 seconds, audio coherence degrades rapidly because the audio pipeline processes video and sound in parallel segments, and long clips exceed the segment boundary the model can maintain synchronised.
- Resolution: Keep clips under 10 seconds for consistent audio output. For longer scenes, generate multiple clips and edit them together. If you need a continuous take, add "consistent audio throughout the clip" to your prompt as an audio cue.
Practical Usage Considerations
Credit Cost
Each R2V generation with audio reference uses more processing resources than standard video-only generation. Multi-character audio with multiple references uses the most. For production work, plan your generations in batches: test audio settings with short 3–5 second clips first, then scale up once the output is consistent.
Reference File Guidelines
| File type | Recommended format | Max size |
|---|---|---|
| Voice Reference | WAV or MP3, mono, 48 kHz | 10 MB |
| Visual Reference | PNG or JPG, face clearly visible | 10 MB |
When to Use External Audio Instead
Wan 2.7 audio is designed for character-consistent voice generation, not music production or precise audio design. If your project requires:
- A specific soundtrack or musical composition
- Frame-accurate audio sync — for example, commercial content with strict timing requirements
- Complex multi-track audio — dialogue plus music plus effects simultaneously
…use external audio tools for the audio track and composite with your Wan 2.7 video in post-production.
FAQ
Does Wan 2.7 generate audio for every video?
No. Audio generation is part of the Reference-to-Video (R2V) system. Standard text-to-video and image-to-video modes do not produce audio output.
Can I use music as an audio reference?
Music references are not optimized in the current version. Voice references work best with speech.
How long should a voice reference clip be?
3–10 seconds is the sweet spot. Shorter clips may not capture enough voice character; longer clips add noise without improving quality.
Can I assign different voices to characters in different languages?
Voice reference captures vocal qualities — pitch, tone, pace — not language content. The same voice reference clip can be used for dialogue in different languages.
Does Wan 2.7 support audio-only generation?
No. Audio is always generated as part of a video output. There is no standalone audio generation mode.
Bottom Line
Wan 2.7's audio system is powerful but sequential. Start with clean voice references, test one character at a time, then layer complexity as the output stabilises. The most common mistakes — silent output, blurred voices, garbled audio — all trace back to adding too many capabilities at once.
If you take away one method from this guide, make it this: start with a single voice reference and a 5-second clip. If the audio output matches what you expected, add multi-character audio. If it does not, check your reference quality before changing your prompt. Audio cues are the last layer, not the first.
Try it now: Go to wan27.org, select R2V mode, upload a 5-second voice reference clip of someone speaking clearly, write "A person speaks to the camera in a well-lit room" as your prompt, and generate. If you get audio on the first try, you have the baseline working. Everything else is refinement.
Author
More Posts

Wan 2.2 vs Wan 2.7: Which One Should You Use on wan27.org?
A practical Wan 2.2 vs Wan 2.7 comparison using the actual workflows available on wan27.org, including modes, resolution, clip length, pricing, and when each model makes sense.

Wan 2.7 Release Date and Open Source: What Is Live on April 22, 2026
Updated for April 22, 2026: official Alibaba sources confirm Wan2.7 availability on Model Studio and wan.video. A first-party open-weight release path is still not clearly published in the sources reviewed today.

Wan 2.7 vs Wan 2.6: Every Upgrade That Actually Matters
A complete comparison of Wan 2.7 vs Wan 2.6 — first/last frame control, 9-grid image-to-video, instruction editing, video recreation, and the new Wan 2.7 Image model. What changed, what stayed, and whether the upgrade is worth it.
Newsletter
Join the community
Subscribe to our newsletter for the latest news and updates