Wan 2.7 Model Architecture — DiT, MoE, Spatio-Temporal Attention, and Flow Matching Explained
A technical deep dive into the Wan 2.7 model architecture: Diffusion Transformer backbone, MoE with 27B parameters (14B active), full spatio-temporal attention, flow matching training, T5 encoder, and VAE latent space.
You are reading a paper on Wan 2.7, and you hit the architecture section.
"We adopt a causal 3D VAE, a DiT backbone with MoE, full spatio-temporal attention, and a flow matching training objective."
That sentence contains five distinct technical decisions. Each one is a bet the team made — a tradeoff that could have gone differently. Understanding why each was chosen tells you more about Wan 2.7 than any benchmark table.
This article breaks down every major component of the Wan 2.7 architecture: what it does, why it was chosen, how it compares to alternatives, and why it matters for the output you actually see. This analysis draws from the Wan 2.7 technical report, cross-referenced with hands-on testing and direct comparisons against Sora, Stable Diffusion 3, and other DiT-based video generators.
If you want the full product picture first, start with the Wan 2.7 Complete Guide.
Architecture Overview
Wan 2.7 is not a single model. It is a family of models sharing the same architectural backbone, adapted for different tasks.
| Model | Task | Backbone | Parameters |
|---|---|---|---|
| T2V-14B | Text-to-Video | 3D DiT + MoE | 27B total, 14B active |
| I2V-14B | Image-to-Video | 3D DiT + MoE | 27B total, 14B active |
| T2I-14B | Text-to-Image | 2D DiT + MoE | 27B total, 14B active |
| T2V-1.3B | Text-to-Video (lightweight) | 3D DiT | 1.3B dense |
The 14B models all share the same architecture with task-specific adaptations. The 1.3B model is a dense (non-MoE) variant for resource-constrained environments.
The pipeline works in four stages:
- VAE Encoder compresses raw video or image frames into a latent space
- T5 Encoder processes text prompts into embedding vectors
- DiT Backbone denoises the latent representation conditioned on text embeddings
- VAE Decoder reconstructs the denoised latent back into pixel space
Each of these components has specific design choices worth examining.
Diffusion Transformer — Why Wan 2.7 Replaced U-Net
Before Wan 2.7, most video generation models (including earlier Wan versions) used a U-Net backbone — an encoder-decoder architecture with skip connections that was the default for diffusion models since Stable Diffusion.
Wan 2.7 replaces U-Net with a Diffusion Transformer (DiT). This is the same architectural shift seen in Sora, Stable Diffusion 3, and other next-generation models.
Here is the difference:
- U-Net processes data at multiple resolutions through convolutional layers. It is efficient because it operates locally — each convolution only sees a small window of pixels. But that locality is also its limitation: long-range dependencies (a subject appearing at frame 1 and reappearing at frame 60) must be propagated through multiple downsampling and upsampling stages.
- DiT processes data as a sequence of patches using self-attention, the same mechanism behind LLMs. Every patch can attend to every other patch, regardless of spatial or temporal distance. This makes long-range coherence inherently easier.
Why this matters: U-Net struggles with temporal consistency over long videos because frame-to-frame relationships must be encoded through the bottleneck. DiT sees all frames simultaneously through attention, making temporal coherence a native capability rather than an emergent one.
Wan 2.7's DiT backbone operates on 3D patches — the input video is divided into spatio-temporal cubes rather than just spatial patches. Each patch is flattened into a token, projected, and processed by transformer blocks. This 3D patchification is what distinguishes the video DiT from the image DiT variant.
Mixture-of-Experts: 27B Total, 14B Active
The 14B in the model name refers to active parameters per forward pass, not total parameters.
Wan 2.7 uses a Mixture-of-Experts (MoE) architecture. Here is what that means in concrete terms:
The model has 27 billion total parameters, but only 14 billion are activated for any single inference. The remaining 13 billion parameters exist as specialized "experts" that only fire when the router decides their expertise is relevant.
How MoE Routing Works
Each transformer block contains multiple feed-forward network experts. A lightweight router network looks at each input token and picks the top-2 experts to process it. The outputs of the chosen experts are combined with weights determined by the router.
| Concept | What It Means |
|---|---|
| Total parameters | 27B — storage and memory footprint |
| Active parameters | 14B — compute cost per forward pass |
| Number of experts | Not publicly specified (estimated 8-16 based on typical MoE ratios) |
| Top-k routing | Each token activates 2 experts |
| Expert specialization | Different experts learn different patterns (motion, texture, lighting, etc.) |
Why this matters for users: MoE gives you the capacity of a 27B model at the compute cost of a 14B model. The extra parameters are not wasted — they allow the model to learn more granular patterns without making inference slower. This is why Wan 2.7 can produce detailed motion and texture while running on consumer GPUs.
For local deployment considerations, see the Wan 2.7 Open Source Guide for hardware requirements.
Load Balancing in Practice
A known failure mode of MoE is router collapse — where the router learns to always send tokens to the same few experts, defeating the purpose of having experts at all. Wan 2.7 addresses this through a load-balancing loss that penalizes uneven expert utilization during training. The exact penalty weight is not published in the technical report, but it is standard practice across production MoE systems.
Full Spatio-Temporal Attention
This is arguably the most important architectural decision in Wan 2.7 for video quality.
Spatial attention handles relationships within a single frame: which pixels belong to the same object, how edges connect, what textures are present.
Temporal attention handles relationships across frames: how objects move, how lighting changes, whether a subject that disappears from frame reappears naturally.
Wan 2.7 applies full spatio-temporal attention, meaning every patch in the video attends to every other patch across both space and time, within the same attention operation.
Why This Is Hard
Full attention has a computational cost that scales quadratically with sequence length. For a 720p video at 16 frames, the number of patches is roughly:
- 1280 × 720 pixels → ~5,760 patches per frame (at 16×16 patch size)
- 16 frames → ~92,000 total patches
- Full attention → ~8.5 billion pairwise interactions
Wan 2.7 absorbs this cost because the alternative — factorized attention (spatial and temporal attention applied separately) — struggles to capture space-time interactions that are not separable. A moving object, for example, changes both its spatial position and its appearance across frames simultaneously. Factorized attention can miss these coupled dynamics.
Rule of Thumb
If a video model uses factorized (separate) spatial and temporal attention, it will save compute but lose some motion-texture coupling. If it uses full spatio-temporal attention, it will be more compute-intensive but produce more natural motion, especially for complex scenes with multiple moving objects.
Wan 2.7 chose full attention for the 14B models and uses a more efficient factorized approach for the 1.3B dense variant.
Putting It Together: How a DiT Block Processes Video
To put it together: the DiT backbone processes spatio-temporal patches through transformer blocks that each include self-attention (full spatio-temporal), MoE feed-forward layers, and adaptive conditioning. Each block's self-attention layer computes relationships between all patches across the entire video, then the MoE layer refines the representation using specialized expert networks. This is repeated over multiple transformer blocks (depth not publicly specified) to produce the denoised latent.
Flow Matching Training Framework
Wan 2.7 is trained with flow matching rather than the standard denoising diffusion objective used by Stable Diffusion or earlier Wan releases.
Diffusion vs Flow Matching
In standard diffusion, the training process adds noise to data following a fixed schedule (usually a cosine or linear noise schedule), and the model learns to predict the noise at each timestep.
In flow matching, the training process defines a continuous path from noise to data using an ordinary differential equation (ODE). The model learns to predict the velocity (direction and magnitude of change) at each point along this path.
| Aspect | Diffusion | Flow Matching |
|---|---|---|
| Training target | Predict noise ε | Predict velocity v |
| Sampling path | Stochastic | Determined by ODE |
| Step count | 50-1000 steps (DDPM/DDIM) | 28-50 steps (ODE solver) |
| Training simplicity | Fixed schedule | Straightforward path definition |
| Sampling speed | Moderate (fewer steps with distillation) | Fast (straight-line paths possible) |
Why this matters: Flow matching typically requires fewer sampling steps to reach high-quality results. The ODE-based formulation also allows a technique called rectified flow — the ability to straighten the sampling trajectory over successive training stages, making inference faster with minimal quality loss.
Flow Matching Steps in Practice
Wan 2.7 uses an ODE solver for sampling rather than a diffusion denoising loop. The exact solver and step count varies by model variant, but typical generation uses 28-50 steps for the 14B model compared to 50-100 steps needed for comparable diffusion-based models. This is a direct speed benefit from the flow matching formulation.
A Closer Look at Flow Matching Training
The flow matching objective defines a probability path between the data distribution and a simple prior (standard Gaussian). During training, the model learns to follow this path. During inference, it starts from a random latent and follows the learned path in reverse direction, moving from noise toward the data distribution.
The practical result: Wan 2.7 sampling is faster and more deterministic than typical diffusion models. You get more consistent outputs with fewer steps.
Bilingual T5 Encoder + Cross-Attention
Text prompts enter the model through a bilingual T5-XXL encoder.
What makes this notable is that T5 is an encoder-only transformer — it converts the full text prompt into a fixed set of embedding vectors. These embeddings are then injected into the DiT backbone through cross-attention layers, where each patch in the video attends to relevant parts of the prompt.
Why T5 Instead of CLIP
Most image generation models (Stable Diffusion, DALL-E 3) use CLIP or a similar contrastive text encoder. CLIP aligns text and images in a shared embedding space, which works well for image-level descriptions.
T5, on the other hand, is a pure language model encoder that captures more granular linguistic structure — syntax, semantics, long-range dependencies between words, and complex compositional relationships.
| Encoder | Strength | Weakness |
|---|---|---|
| CLIP | Strong image-text alignment | Struggles with complex compositions and long prompts |
| T5 | Deep language understanding, long prompts | Larger, more compute-intensive |
Why bilingual: Wan 2.7 was trained by Alibaba on both English and Chinese data. The T5 encoder handles both languages natively, with the same architecture processing prompts in either language. This is a practical decision for a model serving a multilingual user base.
How Cross-Attention Conditioned Generation Works
- The T5 encoder processes the text prompt → produces pooled text embeddings
- Each transformer block in the DiT backbone has cross-attention layers
- During denoising, each patch's query attends to the text embedding keys
- The weighted text information flows into the patch representation
- The denoising process reconstructs video that matches the text guidance
Why this matters for users: T5 encoders handle complex, multi-clause prompts better than CLIP. If you write a prompt like "A woman in a red dress walks from left to right while a dog follows behind her in a sunny park," the T5 encoder captures the relationships between woman/dress/dog/park/sun better than a CLIP encoder would. This directly translates to more accurate prompt following.
For prompt engineering tips that leverage this capability, see the Wan 2.7 Prompt Guide.
VAE Latent Space
Wan 2.7 uses a causal 3D VAE to compress video into a latent space before processing by the DiT backbone.
What the VAE Does
Raw video is too large for direct transformer processing. A single 16-frame 720p video at 24-bit color depth is roughly 265 MB of uncompressed data. Even after patching, the transformer would need an impractical number of tokens.
The VAE compresses this. It encodes the video into a smaller latent representation (typically reducing spatial dimensions by 8× and temporal dimension by 4×) while preserving visual information. The DiT processes this compressed latent, and the VAE decoder reconstructs the result back to full resolution.
Causal 3D VAE — What Makes It Different
Most image VAEs (like the one in Stable Diffusion) are 2D — they compress each frame independently. Wan 2.7's VAE is:
- 3D: It compresses across both spatial and temporal dimensions simultaneously, exploiting redundancies between adjacent frames.
- Causal: The encoding of each frame depends only on past frames, not future frames. This is critical for streaming applications where you process video in temporal order.
VAE Compression Ratio
| Dimension | Input | Latent | Compression |
|---|---|---|---|
| Spatial (H×W) | 1280×720 | 160×90 | 8× each axis |
| Temporal (frames) | 16 | 4 | 4× |
| Channel depth | 3 (RGB) | 16 (latent) | Expansion |
The effective compression rate is approximately 8 × 8 × 4 × (16/3) ≈ 1365× — meaning a 265 MB video input is reduced to roughly 200 KB in the latent space.
VAE Latent Space Characteristics
The latent space is continuous (not discrete like VQ-VAE). This means the VAE preserves smooth gradients and fine details better than discrete alternatives but requires higher precision in the DiT's denoising process.
Why this matters: The quality of the VAE reconstruction is a bottleneck — the DiT can only generate within the representable capacity of the latent space. Wan 2.7's causal 3D VAE was likely trained specifically on video data to ensure it preserves motion continuity, scene transitions, and temporal textures that a frame-independent VAE might lose.
Image vs Video Architecture Differences
The same 14B MoE backbone is adapted for both image and video generation. Here is what changes between the two.
| Component | Image Model (T2I-14B) | Video Model (T2V-14B) |
|---|---|---|
| VAE | 2D VAE (frame-independent) | Causal 3D VAE (temporal-aware) |
| Patchification | 2D spatial patches | 3D spatio-temporal patches |
| Attention | Spatial only | Full spatio-temporal |
| Frame handling | Single frame | Multiple frames with causal masking |
| Training data | Image-caption pairs | Video-caption pairs |
| Sampling | Direct ODE solve | ODE solve per frame group |
The image model is essentially a subset of the video architecture — it processes a single frame through a 2D VAE and applies spatial attention only. The underlying DiT backbone, MoE routing, T5 encoder, and flow matching framework remain identical.
This shared architecture means improvements to the backbone benefit both models simultaneously. A better attention mechanism or improved MoE routing improves both image and video generation from the same update.
27B vs 14B: What Each Number Tells You
A common question is why "27B total" and "14B active" are both specified. The total parameter count determines the model's storage and memory footprint (RAM/VRAM required to load the model). The active parameter count determines the compute cost per generation (FLOPs per forward pass). When evaluating hardware requirements, use the 27B figure for memory sizing and the 14B figure for speed estimation.
Wan 2.7 vs Stable Diffusion 3: Where They Diverge
Stable Diffusion 3 also uses a DiT + MoE + flow matching architecture, making it the closest direct comparison to Wan 2.7. The key differences are:
- SD3 uses three separate CLIP/T5 text encoders; Wan 2.7 uses a single bilingual T5 encoder
- SD3's MoE uses a different expert configuration (details vary by model size)
- Wan 2.7's video model adds 3D patchification and temporal attention that SD3 does not need
- Both use flow matching, but the training data and compute scale differ significantly
Why This Architecture Matters for Users
Architecture decisions translate into real differences in output quality and usability.
| What the Architecture Does | What You Experience |
|---|---|
| DiT backbone with full spatio-temporal attention | Consistent motion across frames, objects stay coherent |
| MoE with 14B active parameters | Fast generation despite 27B total capacity |
| Flow matching objective | Fewer sampling steps, faster results |
| T5 bilingual encoder | Better prompt following for complex scenes, works in English and Chinese |
| Causal 3D VAE | Smooth temporal compression without frame boundary artifacts |
The MoE architecture particularly matters for local users. A dense 27B model would require significantly more VRAM and compute. The MoE design means you get near-27B output quality with 14B-level resource requirements — a practical difference for anyone running Wan 2.7 on an RTX 4090 or similar consumer hardware.
Verifying the Architecture Yourself
The claims above are testable without a deep learning background. Run the same detailed prompt — "a person walking a dog across a park while cyclists pass in the background" — through Wan 2.7 T2V-14B and any U-Net-based video model. Compare a scene with fast motion: observe which model maintains object identity across frames and which one blurs or flickers. The temporal coherence advantage of DiT combined with full spatio-temporal attention is visible in a single clip, and the generation speed gain from MoE shows up in wall-clock time.
Note that the 14B MoE model requires approximately 20-24 GB of VRAM. This fits on an RTX 4090 or A6000 but exceeds the capacity of most consumer GPUs with 8-12 GB. The 1.3B dense variant is the practical option for those configurations.
FAQ
What is the DiT in Wan 2.7?
The Diffusion Transformer (DiT) is the core denoising backbone. It replaces the U-Net architecture used in older video generation models. Instead of processing data through convolutional layers at multiple resolutions, DiT divides video into patches and processes them with transformer self-attention. This allows every part of the video to directly interact with every other part, improving long-range temporal coherence.
How does MoE work in Wan 2.7?
Mixture-of-Experts divides the feed-forward layers into multiple specialized "expert" networks. A lightweight router selects which experts to activate for each token. Wan 2.7 uses top-2 routing — each token activates 2 out of many available experts. This gives the model 27B total parameters to store knowledge but only activates 14B per forward pass, saving compute while maintaining capacity.
What is the difference between Wan 2.7 image and video architecture?
The image model uses a 2D VAE and spatial-only attention on single frames. The video model uses a causal 3D VAE (compresses across time as well as space) and full spatio-temporal attention that tracks relationships across both spatial position and frame index. The DiT backbone, MoE routing, T5 encoder, and flow matching framework are shared between both.
How many parameters does Wan 2.7 have?
The 14B models have 27 billion total parameters with 14 billion active per forward pass. There is also a 1.3B dense (non-MoE) variant for lighter deployments. The total parameter count determines memory requirements; the active count determines inference compute.
What is flow matching in Wan 2.7?
Flow matching is the training objective that replaces standard diffusion denoising. Instead of predicting noise added by a fixed schedule, the model learns the velocity along a continuous path from noise to data. This allows faster sampling (fewer steps needed) and more deterministic generations compared to diffusion-based approaches.
How does Wan 2.7 architecture compare to Stable Diffusion?
Wan 2.7 and Stable Diffusion 3 are architecturally similar — both use DiT backbones, MoE, and flow matching. Key differences include Wan 2.7's bilingual T5 encoder, its 3D VAE with causal masking for video, and its full spatio-temporal attention for temporal coherence. SD3 uses multiple text encoders (CLIP + T5) and is primarily optimized for image generation, while Wan 2.7 is built for video-first generation with image as a secondary capability.
Can I run Wan 2.7 locally with this architecture?
Yes. The 14B MoE requires approximately 20-24 GB VRAM for full inference on a single GPU. The 1.3B dense variant runs on 8-12 GB VRAM. The MoE design is a key reason Wan 2.7 is practical on consumer hardware — a dense 27B model would be significantly harder to run locally. See the Wan 2.7 Open Source Guide for specific hardware requirements and setup instructions.
Summary
Wan 2.7's architecture represents a deliberate shift from legacy video generation designs:
- DiT backbone replaces U-Net for native long-range temporal coherence
- MoE with 27B total / 14B active balances model capacity with inference cost
- Full spatio-temporal attention captures coupled motion-texture dynamics
- Flow matching enables faster sampling with fewer steps
- Bilingual T5 handles complex prompts in English and Chinese
- Causal 3D VAE compresses video efficiently while preserving temporal continuity
If you understand these six decisions, you understand what Wan 2.7 is doing under the hood — and why the outputs look the way they do.
To see this architecture in action, open wan27.org, paste a prompt with multiple moving subjects — "a person walking a dog across a park while cyclists pass in the background" — and watch how every frame maintains object identity. That coherence is the DiT backbone and full spatio-temporal attention working exactly as designed.
Categories
More Posts

Wan 2.7 Instruction-Based Video Editing Guide: What to Change, What to Leave Alone
A practical guide to Wan 2.7 instruction-based video editing, including edit prompt templates, reliable use cases, weak spots, and the one-change rule.
Wan 2.7 LoRA: Train Custom Styles, Characters, and Concepts on Wan 2.7
How to train and use LoRA adapters on Wan 2.7. Covers what LoRA does for Wan 2.7, training data requirements, step-by-step training workflow, ComfyUI integration, and common mistakes that waste training time.

Wan 2.7 Pricing: Basic vs Pro vs Max, Free Credits, and Real Cost Per Video
Updated for April 22, 2026: Wan 2.7 pricing on wan27.org, including Basic/Pro/Max plans, 10 signup credits, pay-as-you-go credit packs, commercial rights, and clip cost math.
Newsletter
Join the community
Subscribe to our newsletter for the latest news and updates