VideoVLA: Video Generators Can Be Generalizable Robot Manipulators

Method name: VideoVLA

Authors: Yichao Shen, Fangyun Wei, Zhiying Du, Yaobo Liang, Yan Lu, Jiaolong Yang, Nanning Zheng, Baining Guo

Organization: IAIR, Xi'an Jiaotong University; Microsoft Research Asia; Fudan University

Publication: NeurIPS 2025; arXiv v1 submitted on 2025-12-07

Links: arXiv: 2512.06963 | PDF | Project page | official code | model

1. Quick overview of the paper

One-sentence summary: VideoVLA transforms pre-trained video generators like CogVideoX-5B into VLA robot strategies: given language and the current image, use a multi-modal Diffusion Transformer to simultaneously predict future action chunks and future visual results after executing these actions.

What should the paper solve?	Existing VLA mostly relies on pre-trained understanding models, which can complete tasks within the training distribution, but the generalization to new tasks, new objects, and new embodiment skills is still limited; the author attempts to transfer the physical imagination and future state prediction capabilities of the video generator to robot operations.
The author's approach	Unify video generation and action generation into a diffusion denoising problem: language token and current image latent are used as conditions, and future video latent and 7-D action sequence are used as common denoising targets.
most important results	The average SIMPLER in-domain task average is 63.0, slightly higher than CogACT 62.6; SIMPLER novel objects average 65.2, new skills average 48.6; real Realman in-domain average 64.6, novel objects 50.6, cross-embodiment skills 58.0, all are the best in the table.
Things to note when reading	The core is not to generate an additional video for display, but to use video-action dual prediction as a training constraint. Ablation shows that removing video loss or only predicting actions will significantly reduce in-domain and generalization performance.

Difficulty rating: ★★★★☆. Need to understand VLA, Diffusion Transformer, CogVideoX/causal video VAE, action chunking, SIMPLER evaluation, and robot cross-embodiment generalization.

Keywords: VideoVLA, video generation, VLA, Diffusion Transformer, visual imagination, action chunk, CogVideoX, SIMPLER, Realman robot.

Core contribution list

Paradigm Shift: From "using a visual language understanding model as a VLA backbone" to "using a video generation model as a VLA manipulator".
Unified modeling: Put language, current visual latent, future visual latent, and action vectors into the same DiT sequence to jointly predict action and visual consequences.
Generalization evidence: Demonstrates stronger average performance than OpenVLA, SpatialVLA, $\pi_0$, CogACT in novel objects and unseen skills.
Imagine - perform correlation analysis: Using motion similarity and human judgment to demonstrate that better visual imagination is associated with higher execution success rates.

Figure 1. VideoVLA, given verbal instructions and current visual observations, simultaneously predicts the next actions and the future visual impact of these actions on environmental interactions.

2. Motivation

2.1 What problem should be solved?

The long-term goal of robotic manipulation is to generalize to tasks, objects, and environments not seen during training. Existing VLA reduces task-specific robot data requirements through large-scale vision, language or visual language understanding models, but the author believes that this route is still difficult to fully achieve true generalization, especially on novel objects and unseen skills.

The video generation model shows strong generalization under novel text/image conditions and generates physically plausible future videos. The authors observe that this is highly aligned with robotic operations: robots also need to predict physical consequences from new instructions and new visual observations, and organize actions accordingly.

2.2 Key assumptions

If a model can generate "visual imagery" that is consistent with actual execution, then it is more likely to predict actions that will accomplish the task. In other words, future visual outcomes are not additional by-products but implicit supervisory and diagnostic signals of action reliability.

3. Related work context

Technical line	Representation in the paper	Positioning of VideoVLA
Vision-Language-Action Models	Octo, Open X-Embodiment, OpenVLA, CogACT, HPT, GR-2, RDT, etc.	Previous works mostly relied on understanding-oriented foundation models; VideoVLA uses a generation-oriented video model instead, and predicts actions and visual consequences at the same time.
Video Generation Models	SVD, OpenSora, CogVideoX, HunyuanVideo, Wan, etc.	VideoVLA is based on CogVideoX and follows the two-stage generation paradigm of video VAE + DiT, but adds actions to the same diffusion sequence.
Video Generation for Robot Manipulation	UniPi, UniSim, RoboDreamer, Gen2Act, VidMan, GR-1, GR-2, etc.	Most methods treat video generation as modular visual planning or auxiliary features; VideoVLA is an end-to-end unified VLA, directly adapted from a pre-trained video generator.

4. Detailed explanation of method

4.1 Formalization of the problem

The input is text instruction $\mathcal{T}$ and current visual observation $\mathcal{O}$, and two types of future quantities are output:

Action output: action chunk that can be executed sequentially in $K$ steps in the future.

$$\mathcal{A}=\{\boldsymbol{a}_i\in \mathbb{R}^{7}\}_{i=1}^{K}$$

$\boldsymbol{a}_i[1: 3]$	wrist rotation.
$\boldsymbol{a}_i[4: 6]$	wrist translation.
$\boldsymbol{a}_i[7]$	gripper state, 0 is closed, 1 is open.

Visual output: Future video frames expected after the action is performed, but the implementation predicts latent.

$$\mathcal{F}=\{\boldsymbol{F}_j\}_{j=1}^{N}$$

Action frequency and video frame frequency are not required to be the same: each action may correspond to multiple future frames. After executing an action chunk, the robot obtains new observations and repeatedly predicts the next action chunk.

4.2 Overall architecture

VideoVLA mainly consists of two encoders and a DiT backbone. The T5 text encoder converts language commands into fixed-length 226 tokens; CogVideoX's 3D-causal VAE encoder encodes video clips into frame latents. Since the VAE is causal, the first frame latent $\boldsymbol{V}_1$ only encodes the first frame, which is the current observation.

Figure 2. Language and video are first encoded into tokens/latents; DiT uses language tokens and current frame latent as conditions to jointly predict future action chunks and future frame latents. The pink video decoder is only used when visualizing the future.

During training, the complete video clip enters the video encoder, so the model can obtain the current latent $\boldsymbol{V}_1$ and the future target latents $\{\boldsymbol{V}_j\}_{j=2}^{n}$; during inference, only the current observation is encoded to obtain $\boldsymbol{V}_1$.

4.3 Data preprocessing and token sequence

For each visual latent, VideoVLA flattens the spatial dimensions in raster order. Let $\boldsymbol{V}'_1$ be the flattened version of the current observation latent, and $\{\boldsymbol{V}'_j\}_{j=2}^{n}$ be the flattened version of future frame latents. Multimodal sequences are spliced from the following parts:

Input/target sequence: T5 language tokens T + current observation latent V'_1 + noisy future frame latents {V'_j}_{j=2..n} + noisy action chunk A Condition: T and V'_1 Diffusion targets: future visual latents and action chunk

All modes are first projected to a common embedding dimension; future video latents and actions are added with Gaussian noise, and the model uses DDPM diffusion loss to learn denoise. Noise timestep embedding is injected via adaptive LayerNorm as DiT. The backbone is initialized from pretrained CogVideoX.

4.4 Unified Future Modeling

VideoVLA's unified future modeling refers to treating "future action" and "future visual consequences" as two modalities of the same future process, modeled simultaneously in one transformer. Different from modular video planning, VideoVLA does not first generate a video and then take actions through inverse dynamics, but allows action tokens and visual tokens to interact directly in denoising.

Training goal: apply diffusion denoising loss to action and future video latent simultaneously.

$$\mathcal{L}_{dual}=\mathcal{L}_{video\ denoise}+\mathcal{L}_{action\ denoise}$$

The main text of the paper does not write the total loss as a separate formula, but it clearly states in the dual-prediction ablation that default uses denoising losses for both modes at the same time.

The appendix further compares attention direction and diffusion schedule: both default bidirectional interaction and synchronous diffusion schedule are better than causal mask or asynchronous schedule.Appendix More Analysis

4.5 Training and inference details

Project	The settings given in the paper
Pre-training data	Open X-Embodiment subset, 22.5M frames; OXE originally included 60 datasets, 22 robot embodiments, and more than 1M real-world robot trajectories.
Real robot fine-tuning data	Realman robot teleoperation, 5824 samples, covering three types of tasks: pick, stack, and place.
backbone	CogVideoX-5B.
Default horizon	Inference predicts 13 future frame latents, or 49 frames, at each step, and predicts 6 future actions.
Deployment execution	Predict 6 actions each time, but only execute the first 3.
training	Pre-training 100K iterations; real fine-tuning 15K iterations; 32 AMD MI300X GPUs; batch size 256.
optimizer	AdamW, learning rate 1e-5, weight decay 1e-4.
sampling	DDIM sampling; write inference in the main text using 50 denoising steps; write 10 denoising steps in the appendix real deployment limitations, 1.1s/H100, about 3Hz.
Efficiency settings	The simulation predicts 13 latents/49 frames; the real experiment predicts the efficiency of 4 latents/13 frames.

5. Experiment

5.1 Experimental scope and evaluation protocol

The paper does both simulation and real-world, and includes in-domain and generalization respectively. Generalization focuses on two types of abilities: performing learned skills on novel objects, and transferring skills learned by other embodiments and unseen by the target embodiment to the target robot.

Review category	Number of tasks/trials
Google Robot SIMPLER-VM	Pick Up Coke Can 300; Move Near 240; Open/Close Drawer 216; Open Top Drawer and Place Apple 108.
Google Robot SIMPLER-VA	Pick Up Coke Can 825; Move Near 600; Open/Close Drawer 378; Open Top Drawer and Place Apple 189.
WidowX SIMPLER-VM	Four tasks with 24 trials each.
Novel objects / new skills simulation	25 trials per novel object; 20 trials per new skill.
Real-world	Pick Up 24; Stack 48; Place 24; 12 for each novel object; 16 for each new skill.

5.2 SIMPLER in-domain

In SIMPLER, Google robot has Visual Matching (VM) and Variant Aggregation (VA), and WidowX only has VM. VideoVLA is highest on WidowX VM average, Google VA average, and all 12 task global averages, with Google VM average being second.

method	WidowX VM Avg	Google VM Avg	Google VA Avg	Avg All
RT-1-X	1.1	42.7	30.5	24.8
OpenVLA	4.2	34.3	39.4	26.0
SpatialVLA	34.4	54.6	52.4	47.1
$\pi_0$	53.1	53.5	43.4	50.0
CogACT	51.3	75.2	61.4	62.6
VideoVLA	53.1	73.1	62.8	63.0

5.3 SIMPLER novel objects

Google robot's Pick Up skill evaluated on 10 unseen objects from YCB and GSO. VideoVLA averages 65.2, which is significantly higher than SpatialVLA 50.8 and CogACT 42.4, and is best on 8/10 objects.

method	Avg	Main points
OpenVLA	6.4	Several objects are close to 0.
SpatialVLA	50.8	Second highest; highest on cleaner bottle is 56.0.
$\pi_0$	28.8	Moderate, but 0 for toy airplane.
CogACT	42.4	green cube 84.0, carrot 72.0, but several objects are lower.
VideoVLA	65.2	green cube 96.0, carrot 84.0, eggplant 88.0, plum 80.0, tennis ball 68.0.

5.4 SIMPLER new skills / cross-embodiment transfer

The new skills come from the WidowX robot training data, but are not in the Google robot training set. VideoVLA was best across all skills with an average of 48.6, 28.2 points higher than second place CogACT at 20.4.

method	Put Spoon	Put Carrot	Stack Block	Take Out Apple	Flip Cup	Pour Coke	Slide	Avg
OpenVLA	0.0	12.5	0.0	26.7	0.0	4.0	0.0	6.2
CogACT	20.8	41.7	5.0	43.8	4.0	20.0	8.0	20.4
VideoVLA	56.3	58.3	20.0	93.8	20.0	52.0	40.0	48.6

5.5 Real Realman Experiment

The real robot is a Realman 7-DoF arm + gripper. All models are first pre-trained in OXE and then fine-tuned using the Realman data collected by the author. In-domain includes pick, stack, and place. Place needs to be picked up first and then placed, so the success rates of the two stages are reported respectively.

method	Pick Up Avg	Stack Avg	Place Avg	Task Avg
OpenVLA	8.3	6.3	14.6	9.7
SpatialVLA	37.5	20.8	10.4	22.9
$\pi_0$	66.7	54.2	31.3	50.7
CogACT	75.0	64.6	35.5	58.4
VideoVLA	70.8	66.7	56.3	64.6

Among real novel objects, VideoVLA has a non-zero success rate for all 12 unseen objects, with an average of 50.6; CogACT is second, with an average of 26.9. In real cross-embodiment skill transfer, VideoVLA averages 58.0, which is significantly higher than CogACT 35.1.

Real generalization setting	OpenVLA	SpatialVLA	$\pi_0$	CogACT	VideoVLA
Novel objects Avg	9.6	14.1	21.8	26.9	50.6
New skills Avg	8.3	13.5	28.5	35.1	58.0

5.6 Ablation experiment

Backbone	Pick Up Coke Can	Move Near	Open/Close Drawer	Avg
OpenSora-1.1	67.7	57.1	25.9	50.2
CogVideoX-5B trained from scratch	18.6	10.8	9.2	12.6
CogVideoX-5B pretrained	92.3	82.9	66.2	80.4

Future frames	Pick Up Coke Can	Move Near	Open/Close Drawer	Avg
13 frames	88.7	75.4	61.6	75.2
25 frames	90.0	79.2	63.0	77.4
49 frames	92.3	82.9	66.2	80.4

Dual-prediction variant	In-domain Avg	Novel Objects	New Skills
Default	80.4	65.2	48.6
No video loss	27.0	12.7	4.4
Action only	25.5	11.3	2.1

appendix ablation	Pick Up	Move	Open/Close	Avg
Default bidirectional	92.3	82.9	66.2	80.4
Causal mask	89.3	76.2	61.1	75.5
Async train, sync inference	87.3	74.1	60.2	73.8
Async train, async inference	84.7	70.8	57.4	71.0

These ablations support three points: pre-trained video generation backbones are critical; longer future video horizons facilitate action consequence inference; and bidirectional, simultaneous joint denoising of action and video is better than staged or unidirectional information flow.Appendix More Analysis

5.7 Imagination-Execution Correlation

The author records the real video frames when performing the predicted action, and passes the predicted video latents through VAE decoder to obtain imagination frames. Then use SIFT to extract keypoints in the first frame, use SAM to segment the foreground, retain only the robot and object areas, and then use SAM-PT to track the keypoint trajectory. After aligning the imagination and execution trajectories through Hungarian matching, the normalized cosine similarity of the trajectory vectors is calculated, and the robot motion similarity is obtained after averaging.

Figure 3a. In Google robot, the higher the motion similarity between visual imagination and real execution, the higher the probability of successful execution.

Figure 3b. The same trend is observed in the WidowX robot.

Metric	Novel Objects	New Skills
Visual Imagination Success Rate	84.0	63.4
Actual Execution Success Rate	65.2	48.6

Visual imagination is judged manually: success requires satisfying semantic following instructions without significant geometric distortion or violation of physical plausibility. Actual execution is lower than imagined, which the authors attribute to the additional difficulty of precise physical grounding, activation noise, and perception errors.

Figure 4. Examples of visual imagination and real execution of VideoVLA predictions.

Figure 5. Appendix Real robot visualization.

Figure 6. Appendix Simulation Visualization.

6. Reproducible auditing

Code and models

There is an official code entrance: VideoVLA-Project/VideoVLA. README provides `build.sh` environment preparation, CogVideo T5/VAE checkpoint configuration, and `sample_video_action.py` inference commands. The project page provides model links to Hugging Face.

Recurring items	Information given in the paper/project	Status
Model structure	CogVideoX-5B backbone; T5 text encoder 226 tokens; 3D-causal VAE video encoder; 7-D action vector; DiT unified token sequence.	relatively sufficient
Training hyperparameters	100K pretraining, 15K finetuning, 32 AMD MI300X, batch 256, AdamW, LR 1e-5, WD 1e-4, DDIM sampling.	relatively sufficient
data	OXE subset 22.5M frames; real Realman 5824 samples. Whether the real data is public is not made clear in the README.	Public reproducibility limited
Evaluation Agreement	The main text and appendix give SIMPLER/real-world trial counts, task list and main results.	fully
Full training cost	32 AMD MI300X and large CogVideoX backbone.	High cost

Official inference skeleton: bash build.sh # Download CogVideo T5 and VAE checkpoints. # Update paths in: config_use/action_config/videovla_config.yaml python sample_video_action.py \ --base config_use/action_config/videovla_config.yaml \ config_use/action_config/inference_config/inference.yaml

7. Analysis, Limitations and Boundaries

7.1 The most valuable part of this paper

Based on the paper's own evidence, the most valuable part is to convert the "future visual modeling capabilities of the video generator" into robot strategy training signals, rather than using the video generator as an external planner. Dual-prediction ablation shows: only retaining action or removing video loss, in-domain and generalization are significantly reduced, which directly supports that "visual imagination participates in training" is the core factor.

7.2 Why the results hold up

Results span simulation and reality, in-domain and generalization, and strong contrasts are reported in both novel objects and cross-embodiment skill transfer. Backbone ablation, future frames ablation, dual-prediction ablation, causal mask ablation, and diffusion schedule ablation jointly constrain alternative explanations: it is not that the simple model is large, nor is it simple action diffusion, but the pre-trained video generation backbone plus joint future modeling work together.

7.3 Limitations clearly stated by the author

The appendix states that the main limitation is inference speed. In a real deployment, VideoVLA predicts 4 future latents (13 frames) and 6 future actions (execute the first 3), using 10 DDIM denoising steps, in about 1.1 seconds on a single H100, so the effective control frequency is about 3 Hz. The author believes that the bottleneck comes from the large pre-trained video generator CogVideoX-5B, and proposes that it can be accelerated by robot-directed small video generators, one-step denoising (such as ShortCut) and distillation in the future.Appendix Limitations and Broader Impacts

7.4 Applicable boundaries

The method is suitable for desktop operation tasks that require predicting visual consequences; for high-frequency closed-loop, fast contact, and fine force control tasks, the current 3 Hz control frequency may be insufficient.
The complete training relies on large-scale OXE data, CogVideoX-5B and 32 MI300X, and ordinary laboratories are more likely to reproduce the inference or small-scale fine-tuning.
The size of the real robot data set is 5824 samples, and the tasks are concentrated in pick, stack, and place; the generalization of the real open world still needs to be verified on a larger scale.
Imagination success relies on manual judgment, which shows that the quality of visual imagination currently lacks fully automated and standardized evaluation indicators.