EN 中文

RynnVLA-002: A Unified Vision-Language-Action and World Model

Authors: Jun Cen, Siteng Huang, Yuqian Yuan, Kehan Li, Hangjie Yuan, Chaohui Yu, Yuming Jiang, Jiayan Guo, Xin Li, Hao Luo, Fan Wang, Deli Zhao, Hao Chen

Organization: DAMO Academy, Alibaba Group; Hupan Lab; Zhejiang University

Version: arXiv: 2511.17502 v2, submitted on 2025-11-21, revised on 2025-11-24

Links: arXiv | PDF | official code

1. Quick overview of the paper

One-sentence summary: RynnVLA-002 puts VLA action policy and action-conditioned world model into the same autoregressive multimodal LLM framework, so that the model can not only generate actions based on images and language, but also predict future images based on images and actions, and use the joint training of the two to enhance each other.

Difficulty rating: ★★★★☆. Reading requires familiarity with VLA, MLLM tokenization, VQ-GAN image token, discrete action token, continuous action head, world model video prediction indicators, and LIBERO / LeRobot evaluation protocol.

Keywords: Vision-Language-ActionWorld ModelUnified VocabularyAction ChunkingAction TransformerLIBERO

Reading targeting itemShort answer
What should the paper solve?Standard VLA only places actions on the output side, lacking explicit internal modeling of action dynamics and physical evolution; standard world models can predict future observations but cannot directly output robot actions. The paper attempts to unify action planning and future visual prediction.
The author's approachBased on Chameleon's unified image and text generation architecture, image/text/state/action tokenizers are introduced to mix VLA data and world model data for training; while retaining discrete action token training, and adding a continuous Action Transformer head to solve real robot generalization and speed problems.
most important resultsIn LIBERO, RynnVLA-002-Continuous achieved an average success rate of 97.4% without additional pre-training; on the real LeRobot SO100, after adding world model data, the success rate increased from less than 30% to more than 80% for block tasks. The abstract of the paper stated that the overall success rate increased by 50%.
Things to note when reading"Unification" is not just about sharing backbone names, but organizing VLA queries and world-model queries into the same set of token sequences and shared parameters; however, discrete actions fail on real robots, so the continuous action header is an important patch for this method to be implemented.

Core contribution list

Action world model concept
Fig. 1 / Action World Model: VLA generates actions from images, world model generates future images from actions and images, RynnVLA-002 attempts to unify the two.

2. Motivation

2.1 What problem should be solved?

The VLA model maps language targets and visual observations to actions and is the mainstream form of current robot foundation policy. The paper believes that this architecture has three flaws: the action is only on the output side, and the model does not have explicit action dynamics representation; the model does not predict "how the world will change if a certain action is performed, " so it lacks imagination and counterfactual capabilities; the model does not directly learn physical dynamics, and it is difficult to internalize laws such as contact, stability, and object interaction.

World model just fills in the other half: it can predict future states based on current images and actions and learn environmental dynamics. However, traditional world models do not directly output actions, so action planning cannot be completed independently. The problem definition of RynnVLA-002 is to put the two into the same queryable model: when asking it "What action should the robot take to...?" it is VLA; when asking it "Generate the next frame..." it is the world model.

2.2 Where are the existing methods stuck?

VLM-based VLA often relies on the visual language understanding capabilities of large-scale MLLM, coupled with action heads or action experts. Discrete action tokens are convenient for cross-entropy training of language models, but there are problems with quantization error and autoregressive error accumulation in fine control. Continuous action heads can output smoother trajectories, but without modeling the evolution of the world, the model may still only learn short-sighted "image to action" correlations.

Visual generation-based VLA and world models can predict future frames, but still often face problems of visual fidelity, cross-domain transfer, computational efficiency, and how to truly translate predicted dynamics into action improvements. The positioning of this article is to use the same MLLM to consume VLA data and world model data at the same time, so that action understanding and visual dynamic prediction can provide training signals to each other.

2.3 Solution ideas of this article

RynnVLA-002 uses unified token vocabulary to organize image/text/action/state into token sequences that can be processed by the same language model. The VLA side generates action chunks from language, status and historical dual-view images; the world model side generates the next frame image from images and action tokens. Discrete actions are trained with cross-entropy, and world image tokens are also trained with cross-entropy; continuous action heads use L1 regression.

4. Detailed explanation of method

RynnVLA-002 overview
Fig. 2 / Overview: VLA query on the left and world model query on the right; both share the RynnVLA-002 body.

4.1 Unified modeling goals

The paper first writes VLA and world model as two conditional generation problems:

VLA problem: Given language, state, and historical observations, generate actions.

$$a_t \sim \pi(a_t \mid l, s_{t-1}, o_{t-h: t})$$

World model problem: Given historical observations and actions, predict the next frame of observations.

$$\hat{o}_t \sim f(o_t \mid o_{t-h: t-1}, a_{t-h: t-1})$$

RynnVLA-002 uses a parameter set $\psi$ to support both queries. Different tasks only change the organization method and text prefix of the input token, and the main model is shared.

4.2 Data Tokenization

The model is initialized from Chameleon because Chameleon natively supports unified image understanding and generation. RynnVLA-002 involves four tokenizers:

Tokenizerfunctionkey details
Image tokenizerDiscretize images into visual tokensVQ-GAN; compression ratio 16; codebook size 8192; $256\times256$ image generates 256 tokens, $512\times512$ generates 1024 tokens.
Text tokenizerProcessing language promptBPE tokenizer inherited from Chameleon.
State tokenizerdiscretization proprioceptive stateEach continuous dimension is divided into 256 bins based on the training data range.
Action tokenizerDiscretized robot actionsEach continuous action dimension is divided into 256 bins; the continuous Action Transformer outputs raw action without tokenization.
Implementation points: Image, text, action, and state tokens share a vocabulary of 65, 536 sizes. This setting allows the action to not only output numerical values, but also enter the same autoregressive token language as images and text.

4.3 VLA Model Data

The sequence structure of the VLA training sample is:

$$\texttt{\{text\}}\ \texttt{\{state\}}\ \underbrace{\texttt{\{image-front-wrist\}}}_{\times M}\ \underbrace{\texttt{\{action\}}}_{\times K}$$
$M$Number of historical image observations. In the experiment, the VLA uses $M=2$.
$K$action chunk size. LIBERO-Long and LIBERO-Spatial use $K=10$, LIBERO-Object and LIBERO-Goal use $K=5$.
$\mathcal{L}_{dis\_action}$Cross-entropy loss for discrete action tokens.

The text prompt is of the form "What action should the robot take to <task>?". Input images include front and wrist cameras, and states include proprioceptive state.

4.4 World Model Data

The sequence structure of the world model is:

$$\texttt{\{text\}}\ \underbrace{\texttt{\{images-front-wrist\}}\texttt{\{action\}}\overbrace{\texttt{\{images-front-wrist\}}}^{\mathcal{L}_{img}}}_{\times N}$$

All world model samples use the same text prefix. The main text of the paper is "Generate the next frame based on the current image and the action.", and the official README data sample is "Generate the next image based on the provided sequence of historical images and corresponding actions."

$N$Autoregressive prediction rounds. In the experiment, the efficiency is set to $N=1$.
$\mathcal{L}_{img}$Cross-entropy loss for future image discrete tokens.

The discrete training objectives are:

$$\mathcal{L}_{dis}=\mathcal{L}_{dis\_action}+\mathcal{L}_{img}$$

This means that action prediction and image prediction are optimized hybridly in the same training phase.

4.5 Attention Mask of Discrete Action Chunk

When autoregression generates multiple actions, if the default causal attention allows subsequent action tokens to see previous action tokens, early action errors will enter the subsequent action conditions, causing error accumulation. The author's attention mask is modified to: the current action only looks at the text / visual / state input, and does not look at the previous actions in the same chunk.

Attention mask
Fig. 3 / Attention Mask: Comparison of default VLA, this article's VLA mask and world model mask.

This design turns multiple action tokens into "conditionally independently generated from the visual context" to reduce error propagation. But the cost also appears in the paper: on a real robot, this kind of discrete action chunk is easy to be unsmooth, because the actions in the same chunk are isolated from each other and the trajectory continuity cannot be guaranteed.

4.6 Continuous Action Transformer Head

In order to handle real robot generalization and inference speed, the author adds a small Action Transformer in addition to discrete joint modeling. It reads the complete context, including language, image, and state tokens, and outputs the entire continuous action chunk in parallel using learnable action queries.

The final training goal combines the three types of supervision: discrete action, world image and continuous action.

$$\mathcal{L}=\mathcal{L}_{dis}+\alpha\mathcal{L}_{conti} =\mathcal{L}_{dis\_action}+\mathcal{L}_{img}+\alpha\mathcal{L}_{conti\_action}$$
$\mathcal{L}_{conti\_action}$L1 regression loss for continuous actions.
$\alpha$The weight of continuous action loss is set to 10 in the experiment.
Training batch construction
1. Sample VLA data:
   prompt = "What action should the robot take to ?"
   input = text + state + M history images from front/wrist cameras
   target_discrete = K action tokens
   target_continuous = K raw actions through Action Transformer

2. Sample world-model data:
   prompt = "Generate the next frame based on the current image and the action."
   input = current front/wrist images + action tokens
   target = next front/wrist image tokens

3. Optimize:
   L = CE(discrete actions) + CE(image tokens) + alpha * L1(continuous actions)

5. Experiment

5.1 Experimental setup

experimentsettingsindicator
LIBERO simulationFour suites: Spatial, Object, Goal, Long. Remove unsuccessful trajectories and no-operation actions when cleaning data; world model uses 90% / 10% train-val split.VLA uses the success rate of 50 different initial state rollouts per task; the world model uses FVD, PSNR, SSIM, and LPIPS.
LeRobot SO100 real-worldTwo types of pick-and-place: block inside circle 248 demos, strawberries into cup 249 demos; both are human teleoperation expert demonstrations.Each task and each scenario are tested 10 times and the success rate is reported.
AblationRemove the world model data, action chunking, this article's attention mask, wrist camera, and proprioceptive state respectively, and compare discrete and continuous actions.LIBERO success rate, real robot success rate, world model generation metrics, Hz inference frequency.

5.2 LIBERO main results

methodPretrainingAction TypeSpatialObjectGoalLongAverage
UniVLAYesDiscrete96.596.895.692.095.2
OpenVLA-OFTYesContinuous97.698.497.994.597.1
RynnVLA-002-DiscreteNoDiscrete94.296.894.687.693.3
RynnVLA-002-ContinuousNoContinuous99.099.896.494.497.4

The key point emphasized by the author is: RynnVLA-002-Continuous achieves an average success rate of 97.4% without large-scale robot pre-training, which is comparable to or higher than several strong baselines with pre-training. The discrete version also reaches 93.3%, indicating that the unified discrete action/world token scheme is effective in simulation, but the continuous header further improves the overall performance.

5.3 Real robot results

Real robot settings
Fig. 4 / Real robot settings: Single-target, multi-target and SO100 desktop operation settings with distractors.
Task/ScenarioGR00T N1.5$\pi_0$RynnVLA-002
Block / Single-target90.0100.090.0
Block / Multi-target60.070.090.0
Block / w/ Distractors50.050.080.0
Strawberries / Single-target50.080.080.0
Strawberries / Multi-target50.070.080.0
Strawberries / w/ Distractors70.040.050.0

Real robot evaluation can better expose the limitations of discrete actions. The text of the paper states that RynnVLA-002 is highly competitive in cluttered environments, especially in multi-target and distractor scenarios in block tasks, which is 10% to 30% higher than the baseline. However, GR00T N1.5 in the strawberry + distractors scenario is still higher than RynnVLA-002, which shows that the real robot results in this article are not overwhelmingly leading in the entire scenario.

5.4 World Model Benefits VLA

Discrete action settingsWorld ModelAction ChunkingThis article Attention MaskAverage
VLA onlyNoNoNo62.8
+ World ModelYesNoNo67.2
+ Chunking, default maskNoYesNo54.0
+ Chunking + proposed maskNoYesYes76.6
complete discrete modelYesYesYes78.1

This ablation table illustrates two points: adding only world model data can raise the value from 62.8 to 67.2; if action chunking is used with the default causal mask, it will drop to 54.0, but with the mask in this article, it will increase to 76.6. The full discrete model achieves 78.1.

VLA visualization
Fig. 5 / VLA visualization: When there is no world model data, the model moves directly towards the target position; after adding the world model, it continues to try to grab when it fails.

5.5 Ablation of Continuous Action

settingsWorld ModelWrist CameraProprioceptive StateGoalObjectSpatialLongAverage
Basic continuous VLANoNoNo90.292.488.467.084.5
+ Wrist CameraNoYesNo91.495.498.281.491.6
+ World ModelYesYesNo96.097.499.085.894.6
complete continuous modelYesYesYes96.499.899.094.497.4

The strongest evidence for the continuous model is: the wrist camera rises from 84.5 to 91.6, the world model data rises from 91.6 to 94.6, and the proprioceptive state pulls Long from 85.8 to 94.4. Real robot ablation is stronger: the success rate is 0 when wrist camera or proprioceptive state is missing; Single / Multi / Distractors is only 30.0 / 10.0 / 0 when world model is missing, while the complete continuous model reaches 80.0 / 80.0 / 50.0.

5.6 VLA Enhances World Model

SuitemodelFVD ↓PSNR ↑SSIM ↑LPIPS ↓
GoalWorld Model370.022.2577.8419.70
GoalAction World Model336.822.1378.1319.43
ObjectWorld Model1141.620.3159.5927.30
ObjectAction World Model877.222.1865.0322.60
SpatialWorld Model405.422.3279.1520.28
SpatialAction World Model373.123.8882.4116.33
LongWorld Model557.7318.2469.1631.60
LongAction World Model427.8619.3672.1927.78

The world model metrics show that the Action World Model mixed with VLA data outperforms the world model alone in most metrics.Supplementary material / World Model Visualization Further explanation: in two examples, the baseline world model cannot predict successful capture from the front camera perspective, but the Action World Model can generate videos containing successful captures; the baseline also has the problem of inconsistent predictions from the front/wrist perspectives.

World model visualization
Fig. 6 / World model visualization: Comparison of world model visualization in supplementary material.

5.7 Efficiency and action form

The efficiency ablation table of the paper gives a clear trend: parallel generation of continuous actions is much faster than discrete autoregressive actions. For example, the continuous model without wrist and history is 24.94 Hz at chunk size 5 and 48.20 Hz at chunk size 10; after adding wrist and history, there is still 7.75 / 15.78 Hz. The discrete model, even with action chunking, is only about 2.74 to 3.69 Hz. The paper also reports that discrete action tokens will accelerate the convergence of continuous action generation, especially in the early stages of training.

Discrete vs continuous performance
Fig. 7 / Discrete vs Continuous: Continuous actions converge faster, and the gap between real robots is more obvious.

6. Repeat audit

6.1 Code and model resources

Already published: The official GitHub is alibaba-damo-academy/RynnVLA-002. README annotation 2025-11-10 releases models, training code, and evaluation code, covering LIBERO simulation benchmark and real-world LeRobot experiments.

Model Zoo: README provides four LIBERO checkpoints of VLA Model 256×256, and suite checkpoints of World Model / Action World Model 512×512. The values ​​in the table are consistent with the main table of the paper.

Dependency cost: You need to install this warehouse, flash-attn, LIBERO, and download the Chameleon tokenizer, base-model and starting point weights. reproducibility involves two training pipelines, pretokenize and no-pretokenize.

6.2 Key hyperparameters

itemPaper setting
VLA history images$M=2$, front + wrist camera historical observation.
Action chunk sizeLIBERO-Long / Spatial uses $K=10$; LIBERO-Object / Goal uses $K=5$.
World model prediction round$N=1$, for computational efficiency only predicts the next frame in one round.
Continuous action loss weight$\alpha=10$.
image tokenizationVQ-GAN, compression ratio 16, codebook size 8192.
state/action discretization256 bins per dimension, range determined by training data min/max.
Data cleaningRemoved unsuccessful trajectories and no-operation actions, following OpenVLA style.

6.3 Official reproducibility path

  1. Install dependencies: pip install -r requirements.txt, install flash-attn, pip install -e ., and install LIBERO.
  2. Download the Chameleon tokenizer, base-model, and starting point weights and put them into the ones specified in README rynnvla-002/ckpts/... path.
  3. Filter LIBERO data no-op actions: run regenerate_libero_dataset_filter_no_op.py.
  4. Choose Pretokenize or NoPretokenize. Pretokenize needs to save the image/action/state first, then generate VLA conversations and world model conversations, and finally tokenize the conversation and splice the records.
  5. Configuration rynnvla-002/configs/libero_goal/... data path in, run exps_pretokenize or exps_nopretokenize The training script below.
  6. EVALUATING LIBERO: AT evals_libero/ Medium settings checkpoint_path, runs a continuous or discrete evaluation script.
  7. Real LeRobot: README provides data generation, state/action min-max calculation, tokenization, training and eval_solver_lerobot_action_head_state.py Reasoning entrance.

6.4 Recurrence risk

7. Analysis, Limitations and Boundaries

7.1 The most valuable part of this paper

According to the paper's own evidence, the most valuable part is to change "action as output" to "action as a modality". Discrete action tokens allow actions to enter the same autoregressive vocabulary as images and text; world model training requires the model to predict visual consequences from actions. So instead of just learning a supervised mapping from images to actions, the VLA is forced to learn how actions change objects and perspectives. Both LIBERO and real robot ablation show that world model data improves VLA, and the supplementary material visualization also shows that VLA data improves world model generation for the grasping process.

7.2 Why the results hold up

The claims of the paper are mutually supported by several types of evidence: the main table shows that continuous RynnVLA-002 reaches 97.4% on LIBERO; the discrete ablation table shows that world model, action chunking, and attention mask each contribute; the continuous ablation table shows that wrist camera, state, and world model are critical to Long and real robots; the world model indicator table shows that Action World Model improves FVD/SSIM/LPIPS; the real robot ablation shows that there is no world model, wrist camera, or proprioceptive state will fail significantly. This evidence covers both directions of "unified models mutually reinforcing each other".

7.3 Author's statement and limitations of experimental exposure

7.4 Applicable boundaries

RynnVLA-002 is more suitable for operational tasks where there are pairs of images, actions, and status data, and the future visual status can reflect the progress of the task. It requires stable interfaces for image tokenizers, action/state discretization scopes, front/wrist cameras, and proprioceptive state. For tasks with strong contact, force control, severe occlusion, visual status that is difficult to express task success, or tasks that require high-speed closed-loop control, the evidence in this article is not sufficient.