EN 中文

Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets

Authors: Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, Abhishek Gupta

Organization: University of Washington, Toyota Research Institute

Version: arXiv v3, updated 2025-05-23

Links: arXiv: 2504.02792 | PDF | Project Page | Code

Source code description: The arXiv source code contains a complete Appendix: implementation details, real robot experiment details, simulation experiment details, additional ablation and Internet video cotraining.

1. Quick overview of the paper

One-sentence summary: Unified World Models (UWM) uses a diffusion transformer with independent action timestep and future-observation timestep to simultaneously model actions and videos, allowing the same model to switch between four modes: policy, forward dynamics, inverse dynamics, and video prediction by setting timestep, and can be trained together from action-free videos.
Reading targeting itemcompact conclusion
What should the paper solve?Large-scale imitation learning relies on high-quality action demonstrations, while large amounts of video data do not have action annotation; traditional policies do not explicitly learn environment dynamics.
The author's approachCouple action diffusion and video diffusion into the same transformer, but let $t_a$ and $t_{o'}$ independently sample/control; $t=T$ is approximate to "mask/marginalize", and $t=0$ is approximate to "condition".
most important resultsIn the five real robot tasks, UWM is better than DP, PAD, and GR1 in ID/OOD, and action-free DROID video cotraining is further improved; LIBERO's average success rate is 0.79, which is higher than DP 0.71, PAD 0.57, and GR1 0.58.
Things to note when readingUWM is not discrete token diffusion, but continuous latent image/action diffusion; the core innovation is the condition/marginalization control and action-free video training brought by independent modality timestep.

Difficulty rating: ★★★★☆. Requires understanding of DDPM, conditional diffusion, Diffusion Policy, world model, inverse dynamics, latent diffusion and robotic pre-training/fine-tuning experimental protocols.

Keywords: Unified World ModelAction DiffusionVideo DiffusionIndependent TimestepsAction-Free Videos

Core contribution list

UWM teaser
Figure 1. UWM can switch between policy, forward dynamics, inverse dynamics and video prediction by controlling action/video diffusion timesteps.

2. Motivation

2.1 Problems to be solved

Behavioral cloning and Diffusion Policy have demonstrated that reliable manipulation policies can be learned as long as they are demonstrated by high-quality experts. But robot data collection is expensive, and policies usually only learn the mapping from observations to actions, without explicitly modeling "how actions will change future observations." On the other hand, world models and video models can learn dynamics, but videos often do not have action labels, making it difficult to directly use them to train policy.

2.2 Limitations of existing methods

2.3 Solution ideas of this article

The author's key insight is to treat diffusion timestep as a continuous version of masking: the closer $t$ is to $T$, the closer the variable appears to be completely masked; the closer $t$ is to $0$, the closer the variable is to clean input. Therefore, as long as action and future observation are given independent timesteps, a certain variable can be "masked" for marginalization during reasoning, or fixed to a clean value for conditioning.

4. Detailed explanation of method

4.1 Problem setting

There are two types of data: expert data with motion $\mathcal{D}_e=\{(o_i, a_i, o'_i)\}_{i=1}^N$, and video data without motion $\mathcal{D}_{af}=\{(o_i, o_{i+1})\}_{i=1}^M$. The model expects four distributions from the same set of mechanisms:

modeDistributionPurpose
Policy$p(a|o)$Given the current observation, sample the action to perform.
Forward dynamics$p(o'|o, a)$Given an action, predict the next observation.
Inverse dynamics$p(a|o, o')$Given the next observation of a target, infer how to reach it.
Video prediction$p(o'|o)$Predict future observations without specifying actions.

4.2 Coupled Video-Action Diffusion

UWM's score/noise prediction network inputs the current observation $o$, the noisy action $a_{t_a}$, the next noisy observation $o'_{t_{o'}}$, the action timestep $t_a$ and the next observation timestep $t_{o'}$, and outputs two noise predictions $\epsilon_a^\theta, \epsilon_{o'}^\theta$.

During training, the model is allowed to see all noise combinations of action and future observation, so it can both "predict actions by looking at pictures" and "predict future images by looking at actions", and can also share information when both are noisy.

$$\ell(\theta)= \mathbb{E}\left[ w_a\|\epsilon_a^\theta-\epsilon_a\|_2^2+ w_{o'}\|\epsilon_{o'}^\theta-\epsilon_{o'}\|_2^2 \right]$$ $$\epsilon_a^\theta, \epsilon_{o'}^\theta=s_\theta(o, a_{t_a}, o'_{t_{o'}}, t_a, t_{o'})$$
$a_{t_a}$The noised version of the action chunk, $a_{t_a}=\sqrt{\bar\alpha_{t_a}}a+\sqrt{1-\bar\alpha_{t_a}}\epsilon_a$.
$o'_{t_{o'}}$The latent noisy version of the next observation, $o'_{t_{o'}}=\sqrt{\bar\alpha_{t_{o'}}}o'+\sqrt{1-\bar\alpha_{t_{o'}}}\epsilon_{o'}$.
$w_a, w_{o'}$The weight of action loss and image loss; both are 1.0 in the appendix hyperparameter table.
UWM method
Figure 2. UWM training and inference process: robot data is randomly sampled $t_a, t_{o'}$; video data is fixed $t_a=T$; timestep is used to control marginal/conditional inference during inference.

4.3 Four reasoning modes

reasoning modetimestep settingsmeaning
Policy $p(a|o)$$t_{o'}=T$, $o'_T\sim\mathcal{N}(0, I)$; reverse diffusion to $a$.Marginalize future observations and only generate actions.
Video prediction $p(o'|o)$$t_a=T$, $a_T\sim\mathcal{N}(0, I)$; reverse diffusion to $o'$.Marginalize actions and only predict future videos.
Forward dynamics $p(o'|o, a)$$t_a=0$, $a_0=a$; reverse diffusion to $o'$.Use actions as clean conditions to predict the next observation.
Inverse dynamics $p(a|o, o')$$t_{o'}=0$, $o'_0=o'$; reverse diffusion to $a$.Use the next observation of the target as a clean condition to generate actions.

4.4 Architecture

UWM uses diffusion transformer + AdaLN conditioning. The current observation is encoded by ResNet-18; $t_a, t_{o'}$ is encoded with sinusoidal timestep encoder; image diffusion uses latent diffusion. The original $224\times224\times3$ image is encoded into $28\times28\times4$ latent by frozen SDXL VAE, and then cut into patch embeddings using the $(4, 4, 2)$ spatiotemporal patchifier. Each step of the action chunk is encoded with shallow MLP, concatenated with image patch embeddings and learnable register tokens and then input into the transformer.

UWM transformer
Figure 3. UWM block: AdaLN injects the current observation and two timesteps; register tokens are used for cross-modal information exchange.
forward dynamics visualization
Figure 4. Forward dynamics visualization: given the current observation and action, the predicted next observation is close to the true next observation.

4.5 Training and implementation details

Appendix Implementation Key hyperparameters are given: observation length $h_o=2$, action length $h_a=16$, rollout length $h'_a=8$, embedded dim 768, 12-layer transformer, 12 heads, 8 registers; training diffusion steps 100, inference steps 10, DDIM sampler; batch size is 36×4 pretraining, 36×2 finetuning; AdamW, learning rate $10^{-4}$, weight decay $10^{-6}$.

Training 100K steps DROID pretraining uses 4 NVIDIA A100s, about 24 hours. When deploying, execute the replan after predicting the first $h'_a=8$ actions.

Algorithm: UWM training step Input: robot tuple (o, a, o') or action-free video tuple (o, o') if robot tuple: sample t_a ~ U(0, T), t_o' ~ U(0, T) noise action a -> a[t_a] noise future obs o' -> o'[t_o'] optimize action noise loss + image noise loss if action-free video tuple: set t_a = T impute missing action with Gaussian noise sample t_o' ~ U(0, T) optimize same UWM objective

5. Experiments and results

5.1 Experimental setup

Real robotic experiments using Franka Panda and DROID manipulation platform. The observation includes two scene cameras and a wrist camera, plus an overhead evaluation camera for alignment initialization; the control frequency is 10 Hz; the action space is delta end-effector pose plus continuous gripper state.

DROID dataset
Figure 5. Real robot pre-training/fine-tuning data: 2000 tracks with motion in DROID are used for pretraining, and the other 2000 without motion are used as cotraining videos.
robot setup
Figure 6. Real robot camera setup: two scene cameras, a wrist camera, and an evaluation camera.

The five real tasks are Stack-Bowls, Block-Cabinet, Paper-Towel, Hang-Towel, and Rice-Cooker. The first four tasks evaluate 50 initializations, while Rice-Cooker only evaluates 20 initializations close to the data distribution due to its high difficulty; each initialization gives each method 3 attempts.

real tasks
Figure 7. Start/end/OOD settings for five real tasks. OOD mainly adds visual disturbances such as unseen interference objects and opening drawers/cabinet doors.

5.2 Real Robot Results

TaskUWM Pretrain / CotrainOverview of best baseline performance
Stack-Bowls ID / OOD0.86 / 0.92; 0.76 / 0.84GR1 pretrain 0.66/0.48; DP 0.48/0.36; PAD is very low.
Block-Cabinet ID / OOD0.76 / 0.84; 0.60 / 0.72GR1 cotrain 0.74/0.64 closest; DP OOD 0.26.
Paper-Towel ID / OOD0.78 / 0.86; 0.78 / 0.84GR1 pretrain 0.60/0.60; DP 0.52/0.48.
Hang-Towel ID / OOD0.82 / 0.86; 0.64 / 0.76GR1 0.66/0.48; DP 0.64/0.28.
Rice-Cooker ID0.60 / 0.65GR1 pretrain 0.40, cotrain 0.25; DP 0.35; PAD 0.
average real robot results
Figure 8. Average success rate of real robots. UWM pretraining is already better than the baseline, and action-free video cotraining improves further.

The paper explains the real results as follows: UWM learns the causal relationship between actions and image observations through action/video feature sharing and training of different conditional/marginal distributions. GR1 is a strong regression baseline, but video cotraining sometimes dilutes the action learning signal; the poor performance of PAD is attributed by the author to raw-pixel concatenation conditioning which is difficult under the same model capacity.

5.3 LIBERO Simulation

The simulation experiment uses 4500 trajectories of LIBERO-100 and LIBERO-90 for pre-training. Five tasks in LIBERO-10 are randomly selected for evaluation, and each task is fine-tuned with 50 expert demos. The author expanded the object initialization range to 0.03 and removed the background objects to create OOD.

LIBERO datasets
Figure 9. LIBERO pre-training and fine-tuning tasks: Book-Caddy, Soup-Cheese, Bowl-Drawer, Moka-Moka, Mug-Mug.
methodBook-CaddySoup-CheeseBowl-DrawerMoka-MokaMug-MugAverage
UWM0.91 ± 0.070.93 ± 0.010.80 ± 0.020.68 ± 0.020.65 ± 0.010.79 ± 0.11
DP0.73 ± 0.100.88 ± 0.020.77 ± 0.020.65 ± 0.030.53 ± 0.050.71 ± 0.12
PAD0.78 ± 0.040.47 ± 0.040.74 ± 0.050.59 ± 0.080.25 ± 0.040.57 ± 0.19
GR10.77 ± 0.030.65 ± 0.050.62 ± 0.030.46 ± 0.040.38 ± 0.050.58 ± 0.14

5.4 Analysis and Ablations

experimentsettingsResults/Conclusion
Forward dynamicsFixed $t_a=0$, uses ground-truth actions, diffuse future observations.The visualization shows that predicted next obs is close to true next obs, indicating that conditional dynamic modeling is effective.
Inverse dynamics trackingGiven expert future observations, action tracking trajectories are generated using inverse dynamics.Under the limitation of trajectory length, Book-Caddy policy 0.47 vs inverse 0.65; Soup-Cheese policy 0.26 vs inverse 0.55.
Categorized OODStack-Bowls / Block-Cabinet lighting, background, clutter.Stack-Bowls: UWM cotrain 21/30, pre 15/30, DP 12/30; Block-Cabinet: 15/30, 8/30, 6/30.
Registers8 registers / 4 registers / no registers / cross-attention UWM. Book-Caddy 0.88 / 0.83 / 0.81 / 0.78; Soup-Cheese 0.90 / 0.86 / 0.85 / 0.86.
Learning objectivefuture obs reconstruction vs current obs reconstruction vs no reconstruction. Stack-Bowls 0.86/0.70/0.48; Block-Cabinet 0.76/0.66/0.60, proving that dynamic prediction is more useful than pure current image reconstruction.
Internet videosrobot data + robot videos / robot data + Internet videos / robot data only. Stack-Bowls 0.92/0.88/0.86; Block-Cabinet 0.84/0.80/0.76. Internet videos have improved, but not as much as in-domain robot videos.
OOD settings
Figure 10. Classification OOD: lighting, background, clutter.
scratch vs pretraining
Figure 11. From scratch vs pretrained: UWM and DP are similar in training from scratch, but UWM benefits more from pretraining.
robot camera views
Figure 12. Robot perspective: camera observations under ID, standard OOD, and categorized OOD conditions.
internet videos
Figure 13. Internet video dataset from Kinetics-400 and Something-Something-v2.
evaluation tracker
Figure 14. Real robot evaluation tracker: Overlaying reference initial frame and current frame to systematically align random initialization.

6. Analysis and discussion within the paper

6.1 Core explanation given by the author

6.2 Failure modes

Appendix Real-World Failure Modes Point out: Multiple cameras may still have difficult viewing angles where certain objects are only seen by one camera; the behavior of the object itself can also cause failure. For example, the Paper-Towel will fall over after being placed on the wooden platform at a bad angle, and the baseline in Stack-Bowls will confuse the blue bowl with distractors after picking up the pink bowl.

7. Analysis, Limitations and Boundaries

7.1 The most valuable part of this paper

The most valuable thing is to combine "learning dynamics with videos" and "learning actions with demonstrations" into a controllable continuous diffusion probability model, rather than adding an independent video predictor. Through independent $t_a, t_{o'}$, UWM allows policy, forward dynamics, inverse dynamics, and video prediction to be different inference slices of the same training target. This directly explains why action-free video can enter training: the missing action is a complete noise variable of $t_a=T$ in a diffusion sense.

7.2 Why the results hold up

The evidence chain is relatively complete: real robot tasks cover rigid bodies, deformable objects and long-term tasks; each task has ID/OOD; DROID robot video cotraining and Internet video cotraining are independently verified; LIBERO provides a standard simulation benchmark; forward/inverse dynamics shows that UWM does not only improve policy scores, but can indeed work in different inference modes. Ablation further demonstrates that future obs reconstruction, registers, and AdaLN conditioning all contribute quantitatively.

7.3 Author's statement of limitations

7.4 Applicable boundaries

UWM is suitable for scenarios where you have action-labeled robot trajectories and want to take advantage of additional action-free robot/video data at the same time. It assumes that observation/action/future observations can be modeled by continuous diffusion, and that future images can serve as useful dynamic supervision. The paper does not yet provide sufficient evidence for long-term planning, multi-robot shape transfer, high-precision contact force control, or long-span transfer of human video to robot actions.

8. Reproducibility Audit

recurring elementsInformation given in the paper/projectAudit status
codeProject page provides official GitHub: WEIRDLabUW/unified-world-model. Found
Source code and diagramsarXiv provides LaTeX source code and 14 independent PDF images; this report has all been converted to PNG.Checkable
dataDROID, LIBERO, Kinetics-400, and Something-Something-v2 are public data; the real task finetuning demos are self-collected by the author.Partially reproducible
Model structureResNet-18, SDXL VAE, DiT/AdaLN, 12 layers 12 heads, 8 registers, action/image latent shapes are given.more complete
training configuration100K pretraining, 10K/20K/50K finetuning, DDIM 10 inference steps, AdamW, batch/lr/weight decay are given.more complete
real assessmentTask definition, number of demos, number of evaluation conditions, 3 attempts per initialization, OOD settings and failure modes are given.more complete
The report covers self-test: Abstract, Introduction, Preliminaries, Methods, Experiments, Related Work, Discussion, Limitations, Acknowledgments and four Appendix files have been covered; appendix implementation details, real/simulated setup, additional ablation and Internet video experiments have been integrated into corresponding chapters.