EN 中文

Fast-WAM: Do World Action Models Need Test-time Future Imagination?

Method name: Fast-WAM

Authors: Tianyuan Yuan, Zibin Dong, Yicheng Liu, Hang Zhao

arXiv: 2603.16666; v1 submitted on 2026-03-17, v2 updated on 2026-03-23; Direction: Computer Vision and Pattern Recognition, Artificial Intelligence

Links: arXiv summary page | PDF | HTML | Project page | official code

1. Quick overview of the paper

One-sentence summary: The conclusion of Fast-WAM is sharp: the main benefit of World Action Model may not be to really "imagine future videos" during inference, but to learn a better world representation through video co-training during training; therefore it retains the future video modeling during training, but deletes the future video branch during inference and generates actions directly from the latent world representation of the current observation.
What should the paper solve?Existing WAM often uses imagine-then-execute: first iteratively generate future vision, and then predict actions based on the future vision. This brings high test-time latency, but it is not clear whether "video modeling during training" is useful or "explicit future imagination during inference" is useful.
The author's approachConstruct Fast-WAM and three controlled variants: Fast-WAM-Joint, Fast-WAM-IDM, Fast-WAM w.o. video co-train, and decouple video co-training and test-time future generation under a unified backbone/training recipe.
most important resultsFast-WAM reaches 91.8% in RoboTwin 2.0, and LIBERO averages 97.6% without embodied pretraining; after removing video co-training, it drops to 83.8% and 93.5% respectively. In the real towel folding task, Fast-WAM has a delay of 190 ms, while Fast-WAM-IDM is 810 ms; no-video-co-train only has a 10% success rate.
Things to note when readingFast-WAM does not negate the world model, but repositions the role of the world model: it may be more like a representation shaping signal in the training phase, rather than a video plan that must be explicitly generated during deployment.

Difficulty rating: ★★★★☆. Need to understand WAM/VLA, video diffusion transformer, flow matching, action chunk diffusion, attention mask to prevent future leakage, and simulation/real robot evaluation.

Keywords: World Action Model, video co-training, test-time future imagination, flow matching, Mixture-of-Transformer, Wan2.2-5B, LIBERO, RoboTwin, towel folding.

Three WAM paradigms
Figure 1. Three types of WAM paradigms: (A) future video/action joint denoising; (B) first generate future video, and then use inverse dynamics/action predictor; (C) Fast-WAM retains video co-training during training, and directly generates actions from the current latent world representation during inference.

2. Motivation

2.1 The attraction and cost of WAM

Standard VLA directly maps visual observations and language to actions, mainly inheriting the semantic priors of web-scale visual language pre-training. But robot control also requires an understanding of how the physical world evolves under action. The appeal of WAM is to put future visual prediction and action modeling into the same framework, allowing the policy to explicitly contact the task-relevant temporal structure.

The problem is that most WAMs iterate over denoise future videos during inference and then generate actions based on this imagined future. Iterative sampling of video diffusion is very expensive, and every few hundred milliseconds in a real robot closed loop may slow down and blunt the strategy.

2.2 The real question the paper asks

Core questions: Where does WAM's revenue come from? Does predicting future videos during training allow the backbone to learn physical/motion representations, or does explicitly generating future observations during inference really give the foresight necessary for action prediction?

This question used to be difficult to answer because many WAMs tied two factors together: the same model both learned video predictions during training and generated future videos during testing. The contribution of Fast-WAM is to separate these two things and use controlled variables to experimentally determine which is more important.

4. Detailed explanation of method

4.1 Formalization of the problem

Suppose the current observation is $o$, the task language is $l$, and the action chunk is $a_{1: H}$. Standard visuomotor policy learning:

$$p(a_{1: H}\mid o, l).$$

A typical imagine-then-execute WAM introduces future vision $v_{1: T}$, written as:

$$p(a_{1: H}\mid o, l)=\int p(v_{1: T}\mid o, l)\, p(a_{1: H}\mid o, l, v_{1: T})\, dv_{1: T}.$$

Intuition: First imagine the future, then generate actions based on your imagination. The price is sampling or denoise $v_{1: T}$ during inference.

Fast-WAM is changed to a direct policy interface:

$$p_\theta(a_{1: H}\mid o, l)=p_\theta(a_{1: H}\mid z(o, l)), $$

Among them, $z(o, l)$ is the latent world representation obtained by the video backbone in a single forward direction of the current context. Key difference: $z(o, l)$ is not a future video generated during inference, but the current representation shaped by video co-training during training.

4.2 Architecture: Wan2.2 video DiT + action expert DiT

Fast-WAM architecture
Figure 2a. Fast-WAM architecture: Wan2.2-5B video DiT serves as world modeling backbone, T5 coding language, video VAE encodes multi-camera images latent; action expert DiT generates action chunks.

Model input tokens are divided into three groups:

All tokens receive language embedding via cross-attention. A Mixture-of-Transformer structure is used between the video and action branches, and a structured attention mask is used to control the information flow.

4.3 Attention mask: share context and prevent future leakage

Training and inference masks
Figure 2b. Training/inference mask: action tokens can see the current clean first-frame tokens, but not future noisy video tokens; first-frame tokens do not look at other tokens to prevent future information from contaminating the current anchor point.

During training, future video tokens have bidirectional attention within the video branch and can access first-frame tokens; action tokens have bidirectional attention within the action branch and can also access first-frame tokens. The most critical constraint is that action tokens cannot attend to future video tokens. In this way, video modeling and action prediction both rely on the same current visual context, but the action does not peek into the ground-truth future.

During inference, the future video branch is completely deleted: only the first-frame latent tokens are retained, the video backbone generates latent world features in a single forward direction, and then performs action denoising for the action expert.

4.4 Training goal: action loss + video co-training loss

Fast-WAM uses the same flow matching form for both actions and videos. For any target variable $y$, which can be action chunk $a_{1: H}$ or future video latents $z_{1: T}$, sample noise $\epsilon\sim\mathcal{N}(0, I)$ and time step $t\in(0, 1)$:

$$y_t=(1-t)y+t\epsilon.$$

Train the model to predict the velocity field from data to noise:

$$\mathcal{L}_{\mathrm{FM}}(y)= \mathbb{E}_{y, \epsilon, t} \left[ \left\|f_\theta(y_t, t, o, l)-(\epsilon-y)\right\|_2^2 \right].$$

The actions and videos are:

$$\mathcal{L}_{\mathrm{act}}=\mathcal{L}_{\mathrm{FM}}(a_{1: H}), \qquad \mathcal{L}_{\mathrm{vid}}=\mathcal{L}_{\mathrm{FM}}(z_{1: T}).$$

Total loss:

$$\mathcal{L}=\mathcal{L}_{\mathrm{act}}+\lambda\mathcal{L}_{\mathrm{vid}}.$$

Intuition: $\mathcal{L}_{\mathrm{vid}}$ is not necessarily for generating video during inference, but as a world representation regularizer / co-training signal.

4.5 Three controlled variants

Variantsvideo co-trainingreasoning future imaginationfunction
Fast-WAMYesNoneMain method: retain the training signal and remove the inference cost.
Fast-WAM-JointYesYes, video/action joint denoisingSimulate joint-modeling WAM to allow video and action tokens to pay attention to each other.
Fast-WAM-IDMYesYes, video-then-actionFirst generate future video, and then use future representation to predict actions; use LingBot-VA to do ground-truth video token noise augmentation, $p=0.5$.
Fast-WAM w.o. video co-trainNoneNoneOnly remove $\mathcal{L}_{\mathrm{vid}}$ to control the contribution of video co-training.

5. Experiment

5.1 Implementation details

5.2 Benchmark settings

BenchmarkData and trainingAssessment
LIBEROFour suites: Spatial, Object, Goal, Long; each suite has 10 tasks, 500 demos; training 20k steps.40 tasks, different random seeds, a total of 2000 trials, reporting success rate.
RoboTwin 2.050+ dual-arm tasks; 2500 clean demos + 25000 heavy-randomization demos; training 30k steps.100 trials per task, reporting clean and randomized average success rates.
Real-world towel foldingGalaxea R1 Lite platform, 60 hours teleoperated demonstrations; training 30k steps.Report success rate and average completion time; towel folding tests deformable object dynamics, long-term planning, and closed-loop operational efficiency.
Real-world towel folding task
Figure 3. Realistic towel folding task. The author emphasizes that completion time is equally important as success rate, because slow trial-and-error success does not represent a good strategy.

5.3 RoboTwin 2.0 Main Results

MethodEmbodied PT.CleanRand.Average
$\pi_0$Yes65.9258.4062.2
$\pi_{0.5}$Yes82.7476.7679.8
MotusYes88.6687.0287.8
Motus from WAN2.2No77.5677.0077.3
LingBot-VAYes92.9091.5092.2
LingBot-VA from WAN2.2No80.60--80.6
Fast-WAMNo91.8891.7891.8

Fast-WAM does not use embodied pretraining, but reaches 91.8%, which is significantly higher than Motus from WAN2.2 (77.3) and LingBot-VA from WAN2.2 (80.6), which also have no embodied pretraining, and is close to LingBot-VA with embodied pretraining (92.2). Appendix Table 3 gives RoboTwin clean/rand details for each task; overall, Fast-WAM competes with the strongest baseline on many tasks, but the average value of no-video-co-train is significantly lower. Appendix Table 3.

5.4 LIBERO main results

MethodEmbodied PT.SpatialObjectGoalLongAverage
OpenVLAYes84.788.479.253.776.5
$\pi_0$Yes96.898.895.885.294.1
$\pi_{0.5}$Yes98.898.298.092.496.9
LingBot-VAYes98.599.697.298.598.5
MotusYes96.899.896.697.697.7
Fast-WAMNo98.2100.097.095.297.6

The average Fast-WAM on LIBERO is 97.6%, exceeding $\pi_{0.5}$'s 96.9, and close to Motus/LingBot-VA. It does not have embodied pretraining, which is a data efficiency point emphasized by the author.

5.5 Control variables: future imagination vs video co-training

VariantRoboTwin Avg.LIBERO Avg.explain
Fast-WAM91.897.6There is video co-training for training, and there is no future imagination for reasoning.
Fast-WAM-Joint90.698.5joint denoise future video/action, explicit reasoning imagination.
Fast-WAM-IDM91.398.0First generate future video, and then action prediction.
Fast-WAM w.o. video co-train83.893.5The inference is the same as Fast-WAM, but the video modeling objective is removed from the training.

This is the key evidence of the paper: the gap between Fast-WAM and the two imagine-then-execute variants is small; but the drop is more obvious after removing video co-training. RoboTwin dropped from 91.8 to 83.8, LIBERO dropped from 97.6 to 93.5, and LIBERO Spatial/Long dropped particularly significantly. Based on this, the authors believe that the main value of WAM is more likely to come from the video prediction goal during training, rather than actually generating future videos during inference.

5.6 Real Towel Folding: Performance and Latency

Real-world results and latency
Figure 4. Real towel folding results: The left picture is closer to the upper left, the better; the right picture shows the inference delay. Fast-WAM is 190 ms, Fast-WAM-IDM is 810 ms, and Fast-WAM-Joint is 580 ms. The figure also shows that no-video-co-train has a success rate of about 10% and has the longest completion time.

In real tasks, pre-training $\pi_{0.5}$ is still the strongest method, with the highest success rate and the shortest completion time. The performance between Fast-WAM families is similar: Fast-WAM-IDM has the highest success rate, and Fast-WAM completion time is better. More importantly, all Fast-WAM variants with video co-training are significantly stronger than $\pi_{0.5}$ without pretraining, while no-video-co-train collapses to 10% success. This again supports video co-training as the main reason.

In terms of latency, Fast-WAM is 190 ms, which is of the same order as no-video-co-train's 190 ms; Fast-WAM-Joint is 580 ms, and Fast-WAM-IDM is 810 ms. Fast-WAM therefore becomes a better compromise for deployment: retain most of WAM performance but avoid explicit future video sampling overhead.

6. Reproducible auditing

6.1 Components required to reproduce

componentsPaper informationNote for recurrence
BackboneWan2.2-5B video DiT + T5 text encoder + video VAE. Requires loadable Wan2.2-5B; higher memory/parameter scale.
Action expertDiT, isomorphic to video branch, hidden dim 1024, about 1B.Be careful when aligning action tokens, time step embedding, cross-attention and video branches.
training objectives$\mathcal{L}_{act}+\lambda\mathcal{L}_{vid}$. The paper does not clearly give the value of $\lambda$ in the main text. reproducibility experiments need to be confirmed from the code or default configuration.
dataLIBERO, RoboTwin 2.0, Real Galaxea R1 Lite Towel Fold 60 Hours Data.Simulation has high reproducibility; real data and hardware are more difficult to completely reproduce.
Delay measurementSingle NVIDIA RTX 5090D V2 32GB.Latency across GPUs is not directly comparable; action denoising steps and batch settings are reported.

6.2 Minimum recurrence route

  1. First implement Fast-WAM w.o. video co-train on LIBERO: only use the current first-frame latent + language + action DiT to run through action flow matching.
  2. Add future video latent branch and $\mathcal{L}_{vid}$, but keep action tokens and cannot see future video tokens to verify whether Fast-WAM is improved.
  3. Implement Fast-WAM-Joint: Release the mutual attention of action/video tokens and test whether it is close to Fast-WAM.
  4. Implement Fast-WAM-IDM: first generate future video representation, then condition action; note the use of $p=0.5$ ground-truth video token noise augmentation.
  5. Reproduce the LIBERO table and then migrate to RoboTwin multi-task training; finally consider real towel folding.
  6. Delay evaluation must be done separately: Fast-WAM has no future branch, but still has 10 steps of action denoising; the additional overhead of IDM/Joint comes from future video generation/joint sampling.

6.3 Recurrence risk points

riskWhy is it importantSuggestions
$\lambda$ is not given in the textThe strength of video co-training directly affects the conclusion.Prioritize checking the official code configuration; if not, do $\lambda$ sweep.
The mask implementation is prone to leaking futuresIf action tokens can see ground-truth future video, the results will be artificially high.Write unit tests to check attention mask reachability.
Multi-camera puzzle input detailsMultiple cameras concat to a single image before entering VAE, which affects token layout.Keep camera order, resolution, crop/resize consistent.
Real towel data cannot be publicly verified60 hours of teleoperation and hardware platforms have a big impact.Treat real results as deployment evidence, and structural conclusions mainly look at simulation control variables.
Pre-training fairnessDifferent baselines use embodied pretraining or not mixed in the same table.Group comparison during reading: The method of using both no embodied PTs is the fairest.

7. Analysis, Limitations and Boundaries

7.1 The most valuable part of this paper

The most valuable thing is the problem solving itself. Many WAM papers default to "generating future videos" as a necessary step, but Fast-WAM splits it into two factors: training goal and inference mechanism, and verifies it with variants in the same framework. This experimental design is more instructive than simply proposing a new model: it tells us that the value of the world model may be mainly reflected in the training representation, rather than the explicit imagination at deployment time.

The second value point is deployment orientation. The gap of 190 ms vs 580/810 ms is realistic for real robots. Fast-WAM brings WAM close to the inference interface of VLA while retaining the WAM training signal, which is a very practical compromise.

7.2 Why the results hold up

7.3 Limitations and points that need to be questioned

questioninfluence
The conclusion is that "future imagination is not so critical", not "completely useless"Joint/IDM on LIBERO is still slightly higher than Fast-WAM; Fast-WAM-IDM has the highest success rate in real tasks. But whether their gains are worth the delay cost depends on the deployment scenario.
Only study single action chunkThe authors omit outer autoregressive rollout for control variables; whether explicit future imagination is more useful in longer tasks remains to be verified.
The only real task is towel foldingDeformable objects are challenging, but a single real-world task is not enough to cover all robot operations.
The model is very large6B model + Wan2.2-5B backbone, high threshold for reproducibility and deployment.
Training details still depend on the codeThe text gives most of the optimization parameters, but for key configurations such as $\lambda$, you need to check the official code.

7.4 Questions that can be asked in group meetings

  1. Will test-time future imagination become important again if the task requires explicit intermediate subgoals, such as complex assembly or navigation?
  2. What exactly does the representation learned by Fast-WAM's video co-training encode? Can probing/attention/feature prediction be used to prove that it captures physical dynamics?
  3. Action tokens cannot see future video tokens, but the video branch and action branch share the first-frame anchor; is this mask optimal, or is there a thinner causal mask?
  4. Wan2.2 is a general-purpose video generation backbone. After switching to the robot video pre-training backbone, will the gap between Fast-WAM and Joint/IDM become larger or smaller?
  5. Fast-WAM has better completion time than IDM in real tasks, but its success rate is not necessarily the highest. How should the actual system choose between delay, success rate and action stability?

Attachment: This report covers inspections

Covered: Abstract, Introduction, Related Work, Method, Experiment, Conclusion, and RoboTwin per-task results in Appendix.

Chart processing: PNG images and source code images rendered using arXiv HTML are saved in figures/; Key tables have been rebuilt into HTML.

Residual risk: Real towel folding training data is not a public benchmark; complete reproducibility still relies on official code configuration, especially $\lambda$ and specific data processing.