EN 中文

WorldVLA: Towards Autoregressive Action World Model

Authors: Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, Deli Zhao, Hao Chen

Organization: DAMO Academy, Alibaba Group; Hupan Lab; Zhejiang University

arXiv: 2506.21539, submission date 2025-06-26; topic: Robotics and Artificial Intelligence

PDF: arXiv PDF; Code: GitHub link given in the paper, which currently redirects to RynnVLA-002; Model page: Hugging Face WorldVLA

1. Quick overview of the paper

One-sentence summary: WorldVLA puts the discrete autoregressive VLA action model and the action-conditioned world model into the same Chameleon-style token generation framework, allowing the model to learn to "look at the picture to take action" and "predict the next frame based on the action" at the same time, and uses action attention mask to alleviate the error accumulation of autoregressive action chunks.
What should the paper solve?Existing VLA models usually only use actions as output and lack the ability to use actions as input to understand environmental dynamics; the world model can predict future visual states based on actions, but cannot directly output robot actions. WorldVLA attempts to unify these two categories of functionality.
The author's approachDiscretize text, images and 7-dimensional robot actions into tokens, and mix training action loss and world loss in an autoregressive LLM; and design an attention mask for action chunk generation that only looks at text and images and does not look at previous action tokens.
most important resultsOn LIBERO, WorldVLA 512×512 achieved an average success rate of 81.8% without large-scale robot pre-training, which is higher than OpenVLA's 76.5%. Compared with action model alone, the average SR of action-world joint training increased from 62.8% to 67.2%, and with action chunk and new mask, it increased from 76.6% to 78.1%.
Things to note when readingThe paper has no appendices, and many implementation hyperparameters are only given in the text; "WorldVLA" in the table sometimes refers to action-world joint training, and sometimes refers to the 256/512 resolution model, which needs to be distinguished in conjunction with the table column headers.

keywords

Vision-Language-Action World Model Autoregressive Modeling Action Chunking LIBERO

core contribution

2. Motivation and related work

2.1 Problems to be solved

The paper divides the robot model into two complementary but incomplete capabilities: the VLA/action model outputs actions based on images and language, and the world model predicts future visual states based on the current visual state and actions. The former can perform tasks, but actions are mostly regarded as final outputs; the latter can model environmental changes caused by actions, but cannot directly give control strategies.

The motivation for WorldVLA is that if a model wants to both generate actions and imagine the next frame based on the actions, it will be forced to learn visual semantics, action semantics, and physical transfer relationships simultaneously during training. The author calls this the mutual enhancement of action model and world model.

Action model, world model and action world model comparison
Figure 1. The teaser of the paper: action model is image/text to action, world model is image/action to image, and action world model attempts to unify the two.

2.2 Related work context

Technical line Representation method Positioning in the paper WorldVLA Differences
Vision-Language-Action Model RT-1/RT-2, OpenVLA, $\pi_0$, diffusion VLA Use MLLM or a visual backbone plus an action head to map observations and instructions to actions. It not only predicts actions, but also uses actions as input to train predictions for the next frame, allowing the model to learn action-conditioned dynamics.
Video / World Model MAGVIT, SVD, Cosmos, iVideoGPT, DWS Video generation can be used to "imagine the future"; the world model further uses actions to control future states. WorldVLA retains both world prediction and action output, instead of just generating video or only making strategy selection.
Unified Understanding and Generation Chameleon, Emu3, Transfusion, Janus, UVA Unify understanding and generation into the same model or system. WorldVLA chooses the discrete token + autoregressive LLM route; UVA serves as a continuous diffusion head control in the action-world direction.

2.3 Comparison table of method types

Model TypeDiscreteContinuousInputOutput
Action ModelOpenVLA$\pi_0$T + VA
Video Prediction ModelMAGVITSVDT + VV
World ModeliVideoGPTDWST + V + AV
Action World ModelWorldVLAUVAT + V + AV + A

3. Detailed explanation of method

3.1 Overall architecture

WorldVLA is initialized from Chameleon, because Chameleon itself is a discrete token model that unifies image understanding and image generation. On this basis, the paper adds three types of tokenizers: image tokenizer, text tokenizer, and action tokenizer. All modalities end up in the same sequence of autoregressive tokens.

WorldVLA overview
Figure 2. Action model on the left: text and historical image input, output $K$ action chunks; world model on the right: text, image and action input, output the next frame of image, repeat $N$ rounds.
componentsThe settings given in the paperfunction
Image tokenizerVQ-GAN; compression rate 16; codebook size 8192; 256×256 image generation 256 tokens, 512×512 image generation 1024 tokens.Discretize the image so that LLM can generate image tokens like text tokens.
Text tokenizerBPE tokenizer; vocabulary size 65, 536, including 8192 image tokens and 256 action tokens.Unified vocabulary entry for text, image, and action tokens.
Action tokenizerEach continuous action dimension is discretized into 256 bins; an action is represented by 7 tokens: 3 relative positions, 3 relative angles, and 1 absolute gripper state.Turn the robot control variables into discrete tokens that can be predicted by the autoregressive model.

3.2 Action Model Data

The task of the action model is to generate actions based on language instructions and image observations. The text prompt form used in the paper is:

What action should the robot take to + task instruction + ?

The sequence form is:

$$\texttt{[BOS]\{text\}} \underbrace{\texttt{[BOI]\{image\}\dots\{image\}[EOI]}}_{\times M} \texttt{[EOS]} \overbrace{ \underbrace{\texttt{[BOA]\{action\}\dots\{action\}[EOA]}}_{\times K} \texttt{[EOS]}}^{\mathcal{L}_{action}}$$

Here $M$ is the number of input historical images, and $K$ is the length of the action chunk output at one time. During training, only the cross-entropy loss $\mathcal{L}_{action}$ on the action token is calculated.

3.3 World Model Data

The world model's task is to generate the next frame based on the current image and current action. The paper emphasizes that it does not require task instruction, because after a given action, the next state is mainly determined by the current state and action. The text prompt used is:

Generate the next frame based on the current image and the action.
$$\texttt{[BOS]\{text\}} \underbrace{ \texttt{[BOI]\{image\}[EOI][BOA]\{action\}[EOA][EOS]} \overbrace{\texttt{[BOI]\{image\}[EOI][EOS]}}^{\mathcal{L}_{world}} }_{\times N}$$

$N$ represents the number of rounds of continuous prediction of the next frame; the paper defaults to $N=1$ to save calculations. During training, only the loss of the generated image token is calculated.

3.4 Action Attention Mask

Ordinary autoregressive causal masks allow the current token to see all past tokens. This is natural for text and image generation, but it creates a problem for action chunks: subsequent action tokens will rely on previously predicted action tokens. Since basic MLLM is mainly pre-trained on text and images, action modal generalization is not as strong as text images. Once the previous action is predicted incorrectly, the error will be propagated within the chunk.

The mask strategy proposed by the author is: only text and image input are allowed to be viewed when generating the current action, and previous actions are not allowed to be viewed. In this way, $K$ actions are semantically closer to parallel prediction, and each action is directly determined by visual observation.

Attention mask comparison
Figure 3. (a) Default action model causal mask; (b) WorldVLA's action mask, which blocks previous actions; (c) World model still uses conventional causal mask.

3.5 Forward process pseudocode

# Action branch tokens = concat(BOS, text_tokens, image_tokens[1: M], EOS) action_tokens = autoregressive_decode(tokens, K * 7, mask="no_prior_action") loss_action = cross_entropy(action_tokens, target_action_tokens) # World branch tokens = concat(BOS, world_prompt, image_t, action_t, EOS) next_image_tokens = autoregressive_decode(tokens, image_token_count, mask="causal") loss_world = cross_entropy(next_image_tokens, target_image_tokens) # Joint training loss = loss_action + alpha * loss_world

4. Mathematical forms and training objectives

4.1 Problem Formulation

Action model is answering: given historical observations and language instructions, what action should be performed currently.

$$a_t = \pi_\theta(a_t \mid o_{t-h: t}, l)$$
$a_t$Robot motion at time $t$.
$o_{t-h: t}$Historical image observations from $t-h$ to $t$.
$l$Natural language task instructions.
$\pi_\theta$Strategy model, also known as action model.

World model is answering: given past observations and past actions, what will the next frame look like.

$$o_t = f_\phi(o_t \mid o_{t-h: t-1}, a_{t-h: t-1})$$
$f_\phi$world model.
$o_{t-h: t-1}$Sequence of image observations prior to the current frame.
$a_{t-h: t-1}$The corresponding historical action sequence.

WorldVLA puts both capabilities into the same parametric model $M_\psi$.

$$M_\psi: \begin{cases} a_t = M_\psi^{\text{policy}}(a_t \mid o_{t-h: t}, l), \\ o_t = M_\psi^{\text{world}}(o_t \mid o_{t-h: t-1}, a_{t-h: t-1}). \end{cases}$$

These are not two completely separate models, but share an autoregressive token backbone; the differences mainly come from the input sequence format, loss position and attention mask.

4.2 Joint Loss

$$\mathcal{L} = \mathcal{L}_{action} + \alpha \mathcal{L}_{world}$$
$\mathcal{L}_{action}$Cross-entropy loss on action token.
$\mathcal{L}_{world}$cross-entropy loss on generated image token.
$\alpha$The world loss weight is fixed at 0.04 in the paper experiment.

The reason for using $\alpha$ is very practical: an action has only 7 tokens, while an image has 256 or 1024 tokens; without weighting, the number of tokens in world loss will significantly dominate the total loss.

5. Experiments and results

5.1 Experimental setup

Projectsettings
BenchmarkLIBERO, including LIBERO-Spatial, Object, Goal, Long, LIBERO-90. Spatial measures spatial relationships, Object measures object recognition, grasping and placement, Goal measures different target processes, Long contains 10 long-range tasks, and LIBERO-90 is used for pre-training.
Data processingFilter failed trajectories and no-operation actions; world model evaluation requires paired video/action ground truth, so 90% trajectory is used as the training set and 10% is used as the validation set. The benchmark comparison in Table 3 was trained fairly using all available data.
Default hyperparametersThe default input image number of action model is $M=2$; the action chunk size of LIBERO Long is $K=10$, and the remaining three tasks are $K=5$; the default world model is $N=1$; $\alpha=0.04$.
Action metricsEach task is rolled out 50 times, and the success rate is recorded in percentage.
World metricsRecord FVD, PSNR, SSIM, LPIPS on the validation set.

5.2 Main benchmark results

Continuous Action ModelPretrainingSpatialObjectGoalLongAverage
Diffusion PolicyNo78.392.568.350.572.4
OctoYes78.985.784.651.175.1
DiT PolicyYes84.296.385.463.882.4
OpenVLA-OFTYes96.998.195.591.195.4
Discrete Action ModelPretrainingSpatialObjectGoalLongAverage
OpenVLAYes84.788.479.253.776.5
WorldVLA 256×256No85.689.082.659.079.1
WorldVLA 512×512No87.696.283.460.081.8

The author's interpretation is that WorldVLA has surpassed discrete OpenVLA without large-scale robot pre-training. 512×512 is better than 256×256. On the one hand, Chameleon's image tokenizer and LLM components are closer to the 512×512 pre-training setting. On the other hand, high resolution provides more detailed visual information, which is important for crawling tasks.

5.3 World Model helps Action Model

IndexAction ModelWorld ModelAction ChunkingNew MaskGoalObjectSpatialLongAverage
1YesNoNoNo67.382.977.823.062.8
2YesYesNoNo73.188.080.227.367.2
3YesNoYesNo79.682.936.716.954.0
4YesNoYesYes84.490.981.849.376.6
5YesYesYesYes85.190.984.052.478.1

Compared with Row 1, the average success rate of Row 2 increased from 62.8 to 67.2; compared with Row 4, the average success rate of Row 5 increased from 76.6 to 78.1. The paper explains that: the world model needs to learn the state changes caused by actions, so it will make the shared backbone better understand the physics of the environment and the meaning of actions, which will be beneficial to the action model.

Action model visualization
Figure 4. Qualitative examples given in the paper: a single action model will move directly to the target location but fails to successfully grasp the object; the action world model will repeatedly try to grab and then move after success.

5.4 Action Model helps World Model

10 frames50 frames
ModelFVD ↓PSNR ↑SSIM ↑LPIPS ↓FVD ↓PSNR ↑SSIM ↑LPIPS ↓
World Model250.029.6290.7311.97718.623.9883.4115.60
Action World Model255.129.7790.4011.94674.124.3083.5515.44

On the 10-frame, the two are close, and the FVD and SSIM of the pure world model are slightly better; on the 50-frame, the action world model is better on all four indicators. The author emphasizes that action generation training can also enhance visual and behavioral pattern understanding, especially for longer sequence generation.

World model visualization
Figure 5. The pure world model has inconsistencies or objects disappear in scenes such as opening drawers, moving plates, and putting bowls on the stove; the predictions of the action world model are more consistent with the action results.

5.5 Action Chunking and Attention Mask

The paper observes that when the ordinary autoregressive method generates multiple continuous actions, the longer the action chunk, the easier it is to degrade performance. The reason is that subsequent actions rely too much on previously generated actions rather than being directly rooted in visual input. The new mask allows each action to rely only on text and images to avoid error propagation within the chunk.

Action chunk length ablation
Figure 6. Action chunk length ablation: Ordinary masks degrade significantly under long chunks, while new masks maintain a better success rate under longer chunks; chunks that are too long will still decline because the policy cannot adapt in time.

5.6 World Model vs. Video Prediction Model

The input of the video prediction model is the current frame and task instruction, without action; the input of the world model includes action. The author compared the help of the two to the action model and found that the world model improved all evaluation tasks, while the video prediction model only helped on two tasks and had a negative impact on one task. The explanation given in the paper is: when there is no action condition, the same initial frame may correspond to multiple reasonable future frames, and the training signal is more blurred; the future state of the action-conditioned world model is more certain.

World model versus video prediction model
Figure 7. Comparison of action world model and action video prediction model; the paper attributes the difference to whether action conditioning is clear.

5.7 Historical image input length

1 frame2 frames4 frames
SettingSR ↑FPS ↑SR ↑FPS ↑SR ↑FPS ↑
w/o Action Chunking58.42.2767.31.7778.71.22
w/ Action Chunking74.03.6784.43.1384.72.78

The more history frames there are, the stronger visual context the model gets, but at the expense of FPS. With action chunking, the SR from 2 frames to 4 frames only goes from 84.4 to 84.7, so the author uses $M=2$ by default as a compromise between performance and speed.

5.8 World Model Pretraining

SettingGoalObjectSpatialLongAverage
w/o World Model Pretrain67.382.977.823.062.8
w/ World Model Pretrain73.184.079.830.266.8

After world model pretraining, the average SR is from 62.8 to 66.8, and the Long is from 23.0 to 30.2. The author explains this as: world model pre-training forces the model to learn visual input, actions, and state transfer physics before migrating to the action model.

6. Repeat audit

6.1 Code and model

Available information: The arXiv page and source code both provide GitHub links. alibaba-damo-academy/WorldVLA. This link currently redirects to RynnVLA-002, the README shows that WorldVLA was upgraded to RynnVLA-002 on 2025-11-10, and the WorldVLA model, training code and LIBERO evaluation code were released on 2025-06-23. Still on Hugging Face WorldVLA model card, 256 and 512 resolution checkpoints are listed.

6.2 Data and preprocessing

6.3 Training key hyperparameters

Super parametersvalueRemarks
Backbone initializationChameleonDiscrete autoregressive models for unifying image understanding and generation.
Image resolution256×256 or 512×512A 512×512 image corresponds to 1024 image tokens, which has higher performance but a higher cost.
Image tokenizer codebook8192VQ-GAN tokenizer.
Text tokenizer vocabulary65, 536Contains image/action token.
Action bins256 bins per dimensionThe bin width is determined by the training data range.
Action dimension7 tokens3 relative positions + 3 relative angles + 1 gripper state.
Historical images $M$Default 2Determined by historical frame ablation.
Action chunk $K$Long is 10, the rest are 5Default configuration.
World prediction rounds $N$1To save calculations.
Loss weight $\alpha$0.04Balancing the problem of far more image tokens than action tokens.

6.4 Recurring gaps

7. Analysis, Limitations and Boundaries

7.1 The most valuable part of this paper

Judging from the paper's own evidence, the value is concentrated in two interconnected conclusions: first, world model training can improve action generation; second, action model training can also improve world prediction for longer horizons. This conclusion is not supported by concept maps alone, but is supported by action ablation tables, world model ablation tables, and visual cases.

7.2 Why the results hold up

The paper claims that three types of evidence align with each other: the main benchmark shows that WorldVLA exceeds OpenVLA in the discrete action model group; ablation clearly splits the impact of action model, world model, action chunking, and attention mask; the visualization diagram shows the typical failure modes of pure action model and pure world model. In particular, Row 3 and Row 4 of Table 4 directly illustrate that the naive autoregressive action chunk will degrade significantly, but the new mask can restore performance.

7.3 Limitations and future directions described by the author

7.4 Applicable boundaries

The experimental evidence of the paper mainly comes from the LIBERO simulation benchmark; whether the model can be directly transferred to real robots, more hardware action spaces, longer task chains and more complex interactive environments, the text does not give experimental verification. The reported conclusions on generalization are therefore limited to the LIBERO setting that the paper has evaluated.

7.5 Appendix status

This arXiv source code does not contain available appendices: `\beginappendix` and `\input{sec/appendix}` in `paper.tex` are both commented, and there is no `sec/appendix.tex` in the source code Contents. Therefore, this report does not have appendix proofs, appendix superparameters or appendix supplementary experiments that can be integrated.