EN 中文

VideoVLA: Video Generators Can Be Generalizable Robot Manipulators

Method name: VideoVLA

Authors: Yichao Shen, Fangyun Wei, Zhiying Du, Yaobo Liang, Yan Lu, Jiaolong Yang, Nanning Zheng, Baining Guo

Organization: IAIR, Xi'an Jiaotong University; Microsoft Research Asia; Fudan University

Publication: NeurIPS 2025; arXiv v1 submitted on 2025-12-07

Links: arXiv: 2512.06963 | PDF | Project page | official code | model

1. Quick overview of the paper

One-sentence summary: VideoVLA transforms pre-trained video generators like CogVideoX-5B into VLA robot strategies: given language and the current image, use a multi-modal Diffusion Transformer to simultaneously predict future action chunks and future visual results after executing these actions.
What should the paper solve?Existing VLA mostly relies on pre-trained understanding models, which can complete tasks within the training distribution, but the generalization to new tasks, new objects, and new embodiment skills is still limited; the author attempts to transfer the physical imagination and future state prediction capabilities of the video generator to robot operations.
The author's approachUnify video generation and action generation into a diffusion denoising problem: language token and current image latent are used as conditions, and future video latent and 7-D action sequence are used as common denoising targets.
most important resultsThe average SIMPLER in-domain task average is 63.0, slightly higher than CogACT 62.6; SIMPLER novel objects average 65.2, new skills average 48.6; real Realman in-domain average 64.6, novel objects 50.6, cross-embodiment skills 58.0, all are the best in the table.
Things to note when readingThe core is not to generate an additional video for display, but to use video-action dual prediction as a training constraint. Ablation shows that removing video loss or only predicting actions will significantly reduce in-domain and generalization performance.

Difficulty rating: ★★★★☆. Need to understand VLA, Diffusion Transformer, CogVideoX/causal video VAE, action chunking, SIMPLER evaluation, and robot cross-embodiment generalization.

Keywords: VideoVLA, video generation, VLA, Diffusion Transformer, visual imagination, action chunk, CogVideoX, SIMPLER, Realman robot.

Core contribution list

VideoVLA illustration
Figure 1. VideoVLA, given verbal instructions and current visual observations, simultaneously predicts the next actions and the future visual impact of these actions on environmental interactions.

2. Motivation

2.1 What problem should be solved?

The long-term goal of robotic manipulation is to generalize to tasks, objects, and environments not seen during training. Existing VLA reduces task-specific robot data requirements through large-scale vision, language or visual language understanding models, but the author believes that this route is still difficult to fully achieve true generalization, especially on novel objects and unseen skills.

The video generation model shows strong generalization under novel text/image conditions and generates physically plausible future videos. The authors observe that this is highly aligned with robotic operations: robots also need to predict physical consequences from new instructions and new visual observations, and organize actions accordingly.

2.2 Key assumptions

If a model can generate "visual imagery" that is consistent with actual execution, then it is more likely to predict actions that will accomplish the task. In other words, future visual outcomes are not additional by-products but implicit supervisory and diagnostic signals of action reliability.

4. Detailed explanation of method

4.1 Formalization of the problem

The input is text instruction $\mathcal{T}$ and current visual observation $\mathcal{O}$, and two types of future quantities are output:

Action output: action chunk that can be executed sequentially in $K$ steps in the future.

$$\mathcal{A}=\{\boldsymbol{a}_i\in \mathbb{R}^{7}\}_{i=1}^{K}$$
$\boldsymbol{a}_i[1: 3]$wrist rotation.
$\boldsymbol{a}_i[4: 6]$wrist translation.
$\boldsymbol{a}_i[7]$gripper state, 0 is closed, 1 is open.

Visual output: Future video frames expected after the action is performed, but the implementation predicts latent.

$$\mathcal{F}=\{\boldsymbol{F}_j\}_{j=1}^{N}$$

Action frequency and video frame frequency are not required to be the same: each action may correspond to multiple future frames. After executing an action chunk, the robot obtains new observations and repeatedly predicts the next action chunk.

4.2 Overall architecture

VideoVLA mainly consists of two encoders and a DiT backbone. The T5 text encoder converts language commands into fixed-length 226 tokens; CogVideoX's 3D-causal VAE encoder encodes video clips into frame latents. Since the VAE is causal, the first frame latent $\boldsymbol{V}_1$ only encodes the first frame, which is the current observation.

VideoVLA overview
Figure 2. Language and video are first encoded into tokens/latents; DiT uses language tokens and current frame latent as conditions to jointly predict future action chunks and future frame latents. The pink video decoder is only used when visualizing the future.

During training, the complete video clip enters the video encoder, so the model can obtain the current latent $\boldsymbol{V}_1$ and the future target latents $\{\boldsymbol{V}_j\}_{j=2}^{n}$; during inference, only the current observation is encoded to obtain $\boldsymbol{V}_1$.

4.3 Data preprocessing and token sequence

For each visual latent, VideoVLA flattens the spatial dimensions in raster order. Let $\boldsymbol{V}'_1$ be the flattened version of the current observation latent, and $\{\boldsymbol{V}'_j\}_{j=2}^{n}$ be the flattened version of future frame latents. Multimodal sequences are spliced from the following parts:

Input/target sequence: T5 language tokens T + current observation latent V'_1 + noisy future frame latents {V'_j}_{j=2..n} + noisy action chunk A Condition: T and V'_1 Diffusion targets: future visual latents and action chunk

All modes are first projected to a common embedding dimension; future video latents and actions are added with Gaussian noise, and the model uses DDPM diffusion loss to learn denoise. Noise timestep embedding is injected via adaptive LayerNorm as DiT. The backbone is initialized from pretrained CogVideoX.

4.4 Unified Future Modeling

VideoVLA's unified future modeling refers to treating "future action" and "future visual consequences" as two modalities of the same future process, modeled simultaneously in one transformer. Different from modular video planning, VideoVLA does not first generate a video and then take actions through inverse dynamics, but allows action tokens and visual tokens to interact directly in denoising.

Training goal: apply diffusion denoising loss to action and future video latent simultaneously.

$$\mathcal{L}_{dual}=\mathcal{L}_{video\ denoise}+\mathcal{L}_{action\ denoise}$$

The main text of the paper does not write the total loss as a separate formula, but it clearly states in the dual-prediction ablation that default uses denoising losses for both modes at the same time.

The appendix further compares attention direction and diffusion schedule: both default bidirectional interaction and synchronous diffusion schedule are better than causal mask or asynchronous schedule.Appendix More Analysis

4.5 Training and inference details

ProjectThe settings given in the paper
Pre-training dataOpen X-Embodiment subset, 22.5M frames; OXE originally included 60 datasets, 22 robot embodiments, and more than 1M real-world robot trajectories.
Real robot fine-tuning dataRealman robot teleoperation, 5824 samples, covering three types of tasks: pick, stack, and place.
backboneCogVideoX-5B.
Default horizonInference predicts 13 future frame latents, or 49 frames, at each step, and predicts 6 future actions.
Deployment executionPredict 6 actions each time, but only execute the first 3.
trainingPre-training 100K iterations; real fine-tuning 15K iterations; 32 AMD MI300X GPUs; batch size 256.
optimizerAdamW, learning rate 1e-5, weight decay 1e-4.
samplingDDIM sampling; write inference in the main text using 50 denoising steps; write 10 denoising steps in the appendix real deployment limitations, 1.1s/H100, about 3Hz.
Efficiency settingsThe simulation predicts 13 latents/49 frames; the real experiment predicts the efficiency of 4 latents/13 frames.

5. Experiment

5.1 Experimental scope and evaluation protocol

The paper does both simulation and real-world, and includes in-domain and generalization respectively. Generalization focuses on two types of abilities: performing learned skills on novel objects, and transferring skills learned by other embodiments and unseen by the target embodiment to the target robot.

Review categoryNumber of tasks/trials
Google Robot SIMPLER-VMPick Up Coke Can 300; Move Near 240; Open/Close Drawer 216; Open Top Drawer and Place Apple 108.
Google Robot SIMPLER-VAPick Up Coke Can 825; Move Near 600; Open/Close Drawer 378; Open Top Drawer and Place Apple 189.
WidowX SIMPLER-VMFour tasks with 24 trials each.
Novel objects / new skills simulation25 trials per novel object; 20 trials per new skill.
Real-worldPick Up 24; Stack 48; Place 24; 12 for each novel object; 16 for each new skill.

5.2 SIMPLER in-domain

In SIMPLER, Google robot has Visual Matching (VM) and Variant Aggregation (VA), and WidowX only has VM. VideoVLA is highest on WidowX VM average, Google VA average, and all 12 task global averages, with Google VM average being second.

methodWidowX VM AvgGoogle VM AvgGoogle VA AvgAvg All
RT-1-X1.142.730.524.8
OpenVLA4.234.339.426.0
SpatialVLA34.454.652.447.1
$\pi_0$53.153.543.450.0
CogACT51.375.261.462.6
VideoVLA53.173.162.863.0

5.3 SIMPLER novel objects

Google robot's Pick Up skill evaluated on 10 unseen objects from YCB and GSO. VideoVLA averages 65.2, which is significantly higher than SpatialVLA 50.8 and CogACT 42.4, and is best on 8/10 objects.

methodAvgMain points
OpenVLA6.4Several objects are close to 0.
SpatialVLA50.8Second highest; highest on cleaner bottle is 56.0.
$\pi_0$28.8Moderate, but 0 for toy airplane.
CogACT42.4green cube 84.0, carrot 72.0, but several objects are lower.
VideoVLA65.2green cube 96.0, carrot 84.0, eggplant 88.0, plum 80.0, tennis ball 68.0.

5.4 SIMPLER new skills / cross-embodiment transfer

The new skills come from the WidowX robot training data, but are not in the Google robot training set. VideoVLA was best across all skills with an average of 48.6, 28.2 points higher than second place CogACT at 20.4.

methodPut SpoonPut CarrotStack BlockTake Out AppleFlip CupPour CokeSlideAvg
OpenVLA0.012.50.026.70.04.00.06.2
CogACT20.841.75.043.84.020.08.020.4
VideoVLA56.358.320.093.820.052.040.048.6

5.5 Real Realman Experiment

The real robot is a Realman 7-DoF arm + gripper. All models are first pre-trained in OXE and then fine-tuned using the Realman data collected by the author. In-domain includes pick, stack, and place. Place needs to be picked up first and then placed, so the success rates of the two stages are reported respectively.

methodPick Up AvgStack AvgPlace AvgTask Avg
OpenVLA8.36.314.69.7
SpatialVLA37.520.810.422.9
$\pi_0$66.754.231.350.7
CogACT75.064.635.558.4
VideoVLA70.866.756.364.6

Among real novel objects, VideoVLA has a non-zero success rate for all 12 unseen objects, with an average of 50.6; CogACT is second, with an average of 26.9. In real cross-embodiment skill transfer, VideoVLA averages 58.0, which is significantly higher than CogACT 35.1.

Real generalization settingOpenVLASpatialVLA$\pi_0$CogACTVideoVLA
Novel objects Avg9.614.121.826.950.6
New skills Avg8.313.528.535.158.0

5.6 Ablation experiment

BackbonePick Up Coke CanMove NearOpen/Close DrawerAvg
OpenSora-1.167.757.125.950.2
CogVideoX-5B trained from scratch18.610.89.212.6
CogVideoX-5B pretrained92.382.966.280.4
Future framesPick Up Coke CanMove NearOpen/Close DrawerAvg
13 frames88.775.461.675.2
25 frames90.079.263.077.4
49 frames92.382.966.280.4
Dual-prediction variantIn-domain AvgNovel ObjectsNew Skills
Default80.465.248.6
No video loss27.012.74.4
Action only25.511.32.1
appendix ablationPick UpMoveOpen/CloseAvg
Default bidirectional92.382.966.280.4
Causal mask89.376.261.175.5
Async train, sync inference87.374.160.273.8
Async train, async inference84.770.857.471.0

These ablations support three points: pre-trained video generation backbones are critical; longer future video horizons facilitate action consequence inference; and bidirectional, simultaneous joint denoising of action and video is better than staged or unidirectional information flow.Appendix More Analysis

5.7 Imagination-Execution Correlation

The author records the real video frames when performing the predicted action, and passes the predicted video latents through VAE decoder to obtain imagination frames. Then use SIFT to extract keypoints in the first frame, use SAM to segment the foreground, retain only the robot and object areas, and then use SAM-PT to track the keypoint trajectory. After aligning the imagination and execution trajectories through Hungarian matching, the normalized cosine similarity of the trajectory vectors is calculated, and the robot motion similarity is obtained after averaging.

Google robot similarity
Figure 3a. In Google robot, the higher the motion similarity between visual imagination and real execution, the higher the probability of successful execution.
WidowX robot similarity
Figure 3b. The same trend is observed in the WidowX robot.
MetricNovel ObjectsNew Skills
Visual Imagination Success Rate84.063.4
Actual Execution Success Rate65.248.6

Visual imagination is judged manually: success requires satisfying semantic following instructions without significant geometric distortion or violation of physical plausibility. Actual execution is lower than imagined, which the authors attribute to the additional difficulty of precise physical grounding, activation noise, and perception errors.

VideoVLA visualization
Figure 4. Examples of visual imagination and real execution of VideoVLA predictions.
Real-world supplementary visualization
Figure 5. Appendix Real robot visualization.
Simulation supplementary visualization
Figure 6. Appendix Simulation Visualization.

6. Reproducible auditing

Code and models

There is an official code entrance: VideoVLA-Project/VideoVLA. README provides `build.sh` environment preparation, CogVideo T5/VAE checkpoint configuration, and `sample_video_action.py` inference commands. The project page provides model links to Hugging Face.

Recurring itemsInformation given in the paper/projectStatus
Model structureCogVideoX-5B backbone; T5 text encoder 226 tokens; 3D-causal VAE video encoder; 7-D action vector; DiT unified token sequence. relatively sufficient
Training hyperparameters100K pretraining, 15K finetuning, 32 AMD MI300X, batch 256, AdamW, LR 1e-5, WD 1e-4, DDIM sampling. relatively sufficient
dataOXE subset 22.5M frames; real Realman 5824 samples. Whether the real data is public is not made clear in the README.Public reproducibility limited
Evaluation AgreementThe main text and appendix give SIMPLER/real-world trial counts, task list and main results.fully
Full training cost32 AMD MI300X and large CogVideoX backbone.High cost
Official inference skeleton: bash build.sh # Download CogVideo T5 and VAE checkpoints. # Update paths in: config_use/action_config/videovla_config.yaml python sample_video_action.py \ --base config_use/action_config/videovla_config.yaml \ config_use/action_config/inference_config/inference.yaml

7. Analysis, Limitations and Boundaries

7.1 The most valuable part of this paper

Based on the paper's own evidence, the most valuable part is to convert the "future visual modeling capabilities of the video generator" into robot strategy training signals, rather than using the video generator as an external planner. Dual-prediction ablation shows: only retaining action or removing video loss, in-domain and generalization are significantly reduced, which directly supports that "visual imagination participates in training" is the core factor.

7.2 Why the results hold up

Results span simulation and reality, in-domain and generalization, and strong contrasts are reported in both novel objects and cross-embodiment skill transfer. Backbone ablation, future frames ablation, dual-prediction ablation, causal mask ablation, and diffusion schedule ablation jointly constrain alternative explanations: it is not that the simple model is large, nor is it simple action diffusion, but the pre-trained video generation backbone plus joint future modeling work together.

7.3 Limitations clearly stated by the author

The appendix states that the main limitation is inference speed. In a real deployment, VideoVLA predicts 4 future latents (13 frames) and 6 future actions (execute the first 3), using 10 DDIM denoising steps, in about 1.1 seconds on a single H100, so the effective control frequency is about 3 Hz. The author believes that the bottleneck comes from the large pre-trained video generator CogVideoX-5B, and proposes that it can be accelerated by robot-directed small video generators, one-step denoising (such as ShortCut) and distillation in the future.Appendix Limitations and Broader Impacts

7.4 Applicable boundaries