Embodied AI VLA World Models World Action Models

survey of world action models

World Action Models: The Next Frontier in Embodied AI

Siyin Wang^1,2,*,‡, Junhao Shi^1,2,*, Zhaoyang Fu^1,*, Xinzhe He^1,*, Feihong Liu^1,*,
Chenchen Yang^1,2, Yikang Zhou², Zhaoye Fei¹, Jingjing Gong², Jinlan Fu¹,
Mike Zheng Shou³, Xuanjing Huang^1,2, Xipeng Qiu^1,2, Yu-Gang Jiang^1,†

¹Fudan University ²Shanghai Innovation Institute ³National University of Singapore

^*Equal Contribution, ^‡Project Lead, ^†Corresponding Author

Paper Project

This is the first systematic survey on World Action Models (WAMs) — embodied foundation models that unify predictive world modeling with action generation. We formalize the definition of WAMs, trace how Vision-Language-Action (VLA) models and world models converge, and organize existing methods into Cascaded and Joint architectures. We also review the training data ecosystem, synthesize evaluation protocols, and discuss open challenges, offering a comprehensive roadmap for this emerging field.

Temporal evolution and taxonomy of representative World Action Models. The figure organizes the field into Joint and Cascaded WAM branches and highlights the dominant architectural directions explored across recent work.

definition

Definitions and Formalism

World Action Models are defined by forward predictive modeling and coupled action generation. The key boundary is that future-state prediction must be part of the policy, not just an auxiliary backbone or external simulator.

Conceptual definition of VLA, WM, and WAM — Conceptual definition and comparison of VLA, WM, and WAM. The figure contrasts their input-output formulations and highlights that WAMs jointly predict future observations and executable actions rather than modeling either side alone.

Vision-Language-Action (VLA)

VLAs are embodied foundation models that frame robot control as a multimodal sequence modeling task. They process current observation o and language instruction l to generate actions under the objective p(a | o, l).

In this formulation, observation and language map directly to action. Semantic grounding is strong, but the model remains fundamentally reactive because future physical evolution is not explicitly represented.

World Models (WM)

World models are predictive transition functions that internalize environment dynamics. Their role is to model how a current state evolves under intervention, typically as p(o' | o, a), so they can simulate future observations rather than output a policy by themselves.

This makes them predictive rather than directly executable: they imagine what the world will look like after action, but do not by themselves define a robot policy.

World Action Models (WAMs)

WAMs unify environmental dynamics modeling with motor control. A model qualifies as a WAM only when it performs forward predictive modeling of future state and couples action generation to that anticipated future, targeting p(o', a | o, l).

The defining shift is that future-state synthesis and executable action are learned together inside one embodied policy framework rather than being treated as unrelated outputs.

vlas and wms

VLAs and World Models: Foundations and Early Integration

WAMs emerge from the convergence of Vision-Language-Action policies and predictive world models. The survey positions VLA, world-modeling, and WM-for-VLA integration as the background that made predictive embodied policies possible.

Vision-Language-Action Models

VLAs scale robot control through language-conditioned foundation policies

VLAs evolved from task-specific imitation learning into language-conditioned policies that fuse visual observations and task prompts. Modern systems inherit LVLM priors, then generate actions through autoregressive tokenization or diffusion-style action heads.

Key point: VLA models bring internet-scale semantic understanding into robot control and support open-vocabulary, long-horizon manipulation.
Limitation: Even with 3D, depth, force, and tactile inputs, most VLAs remain reactive image-to-action mappings without explicit world dynamics.

World Models

World models provide predictive structure over future state evolution

World models learn internal representations of environment dynamics and predict the consequences of actions, language instructions, or multimodal context. The survey distinguishes action-conditioned, language-conditioned, and embodied world models.

Role: World models enable simulation, planning, and decision-making by forecasting future states before real execution.
Design: Explicit models predict pixels or videos, while implicit models learn compact latent dynamics for efficiency and abstraction.

World Models for VLA

World models extend VLAs beyond direct policy learning

World models let VLA agents imagine future observations, generate trajectories, estimate outcomes, and test policies before physical execution. The survey frames their contribution through two routes: improving learning and enabling scalable evaluation.

Learning: World models augment imitation data, support model-based reinforcement learning, and derive reward signals from predicted futures.
Evaluation: World models act as data-driven simulators for reproducible, safety-aware rollout testing with less real-world deployment cost.

architecture

Architecture

The decisive design question is how world prediction and action generation are structurally coupled. The survey organizes methods into Cascaded WAMs and Joint WAMs, each with different training regimes, representations, and latency trade-offs.

Cascaded WAM

future plan -> action

A world model first synthesizes the anticipated future state, then a separate action model decodes executable commands from that plan. This factorization gives clear modular structure, but makes coupling quality between the two stages the central bottleneck.

Explicit planningpixel-space carriers such as video, flow, depth, normals, or 4D structure

Implicit planninglatent features, future tokens, masks, and hidden-state conditioning

Joint WAM

future + action

Future states and actions are predicted within one shared model and trained under joint supervision. The main question becomes how coupling is realized across discrete tokenization, autoregressive sequence modeling, diffusion, flow matching, or parallel generation.

Autoregressive generationshared token spaces, multi-head routing, unified vocabularies

Diffusion-based generationsingle-engine or multi-engine diffusion and non-autoregressive generation

Cascaded World Action Model architecture diagram — Cascaded WAM architectures first predict future state representations and then derive executable actions from that predicted plan.

Joint World Action Model architecture diagram — Joint WAM architectures model future world state and action generation inside one shared predictive system with tighter coupling.

training data

Training Data

WAM data is a mixture-design problem, not just scale. The survey groups data into robot-centric teleoperation, portable human demonstration, simulation, and human or egocentric video.

Overview of embodied data sources for training World Action Models

Robot-Centric Teleoperation

Aligned state-action supervision

High-frequency robot trajectories give aligned state-action pairs, kinematics, proprioception, and contact-rich control with low sim-to-real gap.

Portable Human Demonstration

Low-cost real-world diversity

UMI-style collection brings retargeted human demonstrations from scenes, adding low-cost diversity, dexterous motion, and multi-view interaction.

Simulation

Privileged physics supervision

Simulators provide scalable variation with privileged depth, pose, collision, 3D structure, and controllable physics for dynamics learning.

Human and Ego-Centric Data

Massive-scale world priors

Web and egocentric video supply passive dynamics, long-horizon activity, semantic diversity, and open-world priors beyond robot labs.

evaluation

Evaluation

Evaluation has two axes: world modeling capability and action policy capability. WAMs cannot be judged by visual plausibility alone, and they also cannot be judged by task success alone without testing whether imagined futures are physically and causally meaningful.

world modeling capability

How to Evaluate World Modeling Capability?

The survey treats world-model evaluation as a three-part question: whether generated futures are visually faithful, physically plausible, and still informative enough to recover executable control.

Visual fidelity

Reconstruction, perception, and realism

PSNR and SSIM cover low-level fidelity; LPIPS, DreamSim, and DINO test perceptual and semantic consistency; FVD measures distribution-level realism and temporal quality.

Physical commonsense

Object dynamics and trajectory plausibility

Physical commonsense asks whether generated worlds obey object continuity, material behavior, causal order, and plausible motion. Benchmarks such as VideoPhy, PhyGenBench, VBench-2.0, WorldModelBench, Physics-IQ, WorldScore, and EWMBench cover both physical interactions and long-horizon motion consistency.

Action plausibility

Can imagined futures support control?

WorldSimBench and the IDM Turing Test ask whether generated futures preserve enough action-relevant information to infer correct controls and support downstream execution.

action policy capability

How to Evaluate Action Policy?

The survey reviews benchmark families by robot morphology and manipulation setting, emphasizing generalization, long-horizon control, dexterity, sim-to-real transfer, and real-device performance.

General manipulation

Multi-task and broad policy capability

Meta-World, RLBench, ManiSkill, LIBERO, RoboCasa, GemBench, CALVIN, and related suites test multi-task learning, scaling, language conditioning, robustness, and long-horizon execution.

Bimanual and humanoid form

Higher-DoF coordinated control

RoboTwin, BiGym, HumanoidBench, and HumanoidGen raise the difficulty through dual-arm coordination, humanoid locomotion, tactile sensing, and large action spaces.

Mobile manipulation

Navigation plus manipulation

ManipulaTHOR, HomeRobot, and BEHAVIOR-1K evaluate policies that must combine scene navigation, open-vocabulary perception, and object interaction inside larger environments.

Contact and deformation

Physics beyond rigid-body control

SoftGym, PlasticineLab, DaXBench, TacSL, and ManiFeel assess cloth, liquid, deformable objects, and tactile-driven fine manipulation where contact modeling becomes central.

Real-device evaluation

Deployment in physical environments

RoboArena, RoboChallenge, and Maniparena measure whether policy performance survives the reality gap and remains reliable when moved onto real robots.

open challenges

Open Challenges and Opportunities

WAMs are still a nascent paradigm. The next phase depends on resolving coupling, modality, data-mixture, long-horizon, efficiency, evaluation, and safety problems.

Architectural Coupling

Systematic comparisons under matched scale, data, and protocols are still missing.

Multimodal Physical State

RGB-only prediction misses tactile, force, acoustic, and deformation cues that matter most in contact-rich manipulation.

Data Utilization and Mixture Design

The marginal role of robot data, simulation, and egocentric human video is still poorly understood.

Long-Horizon Planning

Distribution drift, compounding action error, and weak temporal abstraction still block sustained predictive control.

Inference Latency and Efficiency

Diffusion and autoregressive prediction remain too slow for many closed-loop settings without aggressive compression.

Evaluation and Safe Deployment

The field still lacks joint metrics for causal consistency between imagined futures and executed actions, plus robust safety checks.

library

Paper Library

This library is populated from the survey repository and keeps the paper list searchable. For quick browsing, the top-level filters focus only on the two architectural families emphasized by the paper: Cascaded WAM and Joint WAM.

Sort

Loading related papers...