EN 中文

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

arXiv ID:2412.14803, v2, submitted 2024-12-19, revised 2025-05-04

Authors:Yucheng Hu*; Yanjiang Guo*; Pengchao Wang; Xiaoyu Chen; Yen-Jen Wang; Jianke Zhang; Koushil Sreenath; Chaochao Lu; Jianyu Chen

Affiliations:Tsinghua University; UC Berkeley; Shanghai AI Lab; Shanghai Qi Zhi Institute; RobotEra

Publication:ICML 2025 Spotlight Paper

Project page:video-prediction-policy.github.ioCode:roboterax/video-prediction-policy,MIT license

source:arXiv abs, PDF, native LaTeX source, project page, and GitHub README.

One-sentence summary:VPP transforms the video diffusion model from a "slow module that generates complete future videos" to a "predictive visual encoder that only forwards once": first fine-tune Stable Video Diffusion into a manipulation text-guided video prediction model, then extract its internal future representation, use Video Former aggregation and then follow the diffusion transformer policy, allowing the robot strategy to learn implicit inverse dynamics by predicting future visual trajectories.

1. Quick overview of the paper

Introductory itemsWhat does this paper answer?Where do you focus when reading?
What should the paper solve? Existing robot vision encoders mostly learn static information from single image reconstruction, double image comparison or image and text comparison, and lack explicit future dynamics; VPP should transform the future prediction representation inside the video diffusion model into the visual conditions of a general robot strategy. Introduction to the hypothesis of predictive visual representations, and Fig. 1 Comparison of current/future representations.
The author's approach Two stages: first fine-tune SVD to manipulation TVP, then aggregate multi-layer features in TVP forward into predictive representations, output tokens through Video Former, and finally use diffusion transformer policy to generate actions. Method's TVP objective, feature aggregation, Video Former spatial-temporal attention and diffusion action loss.
most important results CALVIN ABC-D Avg. Len reaches 4. 33; MetaWorld 50-task average success rate is 0. 682; Panda seen/unseen is 0. 85/0. 73; XHand seen/unseen/tool-use is 0. 75/0. 60/0. 68. Table 1, MetaWorld table, Table 6, appendix Panda/XHand breakdown table and ablation table.
Things to note when reading VPP does not rely on complete 30-step video denoising to control the robot, but only uses the internal representation of TVP forward once; this is both the key to high-frequency closed-loop control and the main difference from SuSIE/Uni-Pi/GR-1. Policy roll-out details, one-step representation of Fig. 5, latency comparison in Video Former and ablation.

Difficulty rating:★★★★☆. Need to understand video diffusion, latent/upsampling features of Stable Video Diffusion, Diffusion Policy/DiT and multi-view spatiotemporal attention. The paper does not have complex theorems, but the method implementation chain is long. During group meetings, it is most likely to be asked "why a forward rough representation of the future is enough to guide actions. "

Numerical caliber reminder:arXiv The summary writing of CALVIN ABC-D is 18. 6% higher than the previous SOTA, corresponding to VPP 4. 33 in Table 1 compared with RoboUniview 3. 65; the introduction and project page writing is 41. 5%, corresponding to the improvement of 3. 06 relative to GR-1. The report explains the two calibers in tables.

1. 1 Contribution list

2. Motivation and related work

2. 1 Why static visual representation is not enough

Robot strategies need to understand "how the world will change next" from the image. Although visual representation learning methods such as R3M, VIP, VC-1, and Voltron can learn semantic and spatial information from video or graphic data, the training goals are usually single image reconstruction, two image comparison, MAE, or language generation, and the model input and output do not explicitly require prediction of the continuous future. These representations are therefore biased toward current states rather than future dynamics.

2. 2 Why video diffusion model is a suitable candidate

The video diffusion model directly models the complete video sequence, and the text-guided video prediction model can also predict future frames based on current observations and language. The author's assumption is that even if future videos are not completely denoised to pixel-level clarity, the intermediate features inside the VDM already contain coarse-grained information about how objects and robots will move. VPP calls these intermediate features predictive visual representations.

Predictive visual representations
Figure 1: VDM internal representation explicitly spans current and future time steps; traditional visual encoders usually only represent the current observation. The core assumption of VPP is that future representations can guide action learning.

2. 3 Differences from future predictive control methods

SuSIE first generates a future goal image and then lets the policy track the image; Uni-Pi learns inverse dynamics between two frames; GR-1 autoregressively generates future frames and actions. The paper believes that these methods have two shortcomings: either they only use a single future prediction and cannot cover complex continuous dynamics; or they require complete denoising/autoregression and low control frequency. The difference of VPP is to use the VDM intermediate representation directly instead of waiting for the final pixel video generation to be completed.

3. Preliminary knowledge

3.1 Video Diffusion Models

$$q(x_t|x_{t-1})=\mathcal{N}(x_t;\sqrt{\alpha_t}x_{t-1},(1-\alpha_t)\mathbb{I}),$$ $$x_t=\sqrt{\bar{\alpha}_t}x_0+\sqrt{1-\bar{\alpha}_t}\epsilon_t,\quad \bar{\alpha}_t=\prod_{i=1}^t\alpha_i.$$

Intuitive understanding:

The forward process continuously adds noise to the real video $x_0$ until it approaches Gaussian noise. Training the video generation model is the reverse process of learning to gradually restore clean video from noisy video.

$$p(x_{t-1}|x_t)=\mathcal{N}\big(x_{t-1};\sqrt{\bar{\alpha}_{t-1}}\mu_\theta(x_t,t),(1-\bar{\alpha}_{t-1})\mathbb{I}\big),$$ $$\mu_\theta(x_t,t)=\frac{x_t-\sqrt{1-\bar{\alpha}_t}\epsilon_\theta(x_t,t)}{\sqrt{\bar{\alpha}_t}}.$$

In text-guided video generation, the model also learns $\epsilon_\theta(x_t,c)$ conditioned on the initial frame and language prompt. VPP subsequently uses this conditional video prediction model as a visual encoder.

3.2 Diffusion Policy

Diffusion policy treats the action sequence $a_i=(\hat{a}_i,\ldots,\hat{a}_{i+m})$ as a denoising object. Compared with unimodal regression, diffusion policy can express multi-modal action distribution. VPP uses a diffusion transformer block in the action head and uses the predicted visual tokens output by Video Former as cross-attention conditions.

4. Detailed explanation of method

VPP method
Figure 2: VPP two-stage process. First fine-tune the TVP model, then use Video Former to aggregate the TVP internal predictive representations, and finally use the diffusion transformer policy to output the robot action.

4. 1 Stage 1: Training manipulation TVP model

The authors build on the open source Stable Video Diffusion (SVD) with 1. 5B parameters. The original SVD is mainly conditioned on the initial frame $s_0$; VPP adds the CLIP language feature $l_{emb}$, injects language conditions through cross-attention, and sets the output video to $16\times256\times256$ to improve training and inference efficiency. Apart from these changes the main components of SVD are retained.

$$\mathcal{L}_D=\mathbb{E}_{x_0\sim D,\epsilon,t}\left\|V_\theta(x_t,l_{emb},s_0)-x_0\right\|^2,$$

Here $x_0=s_{0:T}$ is the complete video sequence, and $x_t=\sqrt{\bar{\alpha}_t}x_0+\sqrt{1-\bar{\alpha}_t}\epsilon$ is the noise-added video. The model learns to reconstruct the entire future sequence based on initial observations and language.

$$\mathcal{L}_{video}=\lambda_H\mathcal{L}_{D_H}+\lambda_R\mathcal{L}_{D_R}+\lambda_C\mathcal{L}_{D_C}.$$

The three types of data are internet human manipulation, internet robot manipulation and self-collected/downstream data. The appendix gives the sampling proportions of each data set; these weights are used to balance data of different sizes and qualities.

TVP training datanumber of trajectoriesSampling ratio
Something-Something-v2191,6420.30
RT-187,2120.15
Bridge23,3770.15
BC-Z43,2640.08
Taco-Play / Jaco-Play3,603 / 1,0850.01 / 0.01
CALVIN-ABC / MetaWorld18,033 / 2,5000.10 / 0.05
Panda Arm / Dexterous Hand2,000 / 2,4760.05 / 0.10
Total375,1921.00

Appendix description: 5, 558 items in Bridge and 2, 048 items in Something-Something-v2 are reserved for validation according to Seer settings; 3% of other data sets are reserved for validation.

4. 2 Stage 2: Treat TVP as a forward vision encoder

Fully denoising high-quality video is very slow and results in open-loops or low-frequency domination. The key engineering choice of VPP is to only perform one forward step of TVP to extract features of the internal up-sampling layers without waiting for the final pixel video.

$$L_m=V_\theta(x_{t'},l_{emb},s_0)_{(m)},\quad L_m\in\mathbb{R}^{T\times C_m\times W_m\times H_m}.$$

$m$ represents an up-sampling layer in TVP. The input is the splicing of the current image $s_0$ and the noisy latent $q(x_{t'}|x_0)$; $t'$ usually corresponds to white noise or a latent close to white noise.

$$L'_m=\mathrm{Interpolation}(L_m),\quad L'_m\in\mathbb{R}^{T\times C_m\times W_p\times H_p},$$ $$F_p=\mathrm{concat}\big((L'_0,L'_1,\ldots,L'_m),\mathrm{dim}=1\big).$$

Why aggregate multiple layers:

The final pixel layer contains a lot of texture detail that is irrelevant to control; intermediate upsampling features are often more preserving motion and spatial structure. VPP does not manually select a single layer, but interpolates to the same resolution and then splices it by channel.

For multi-view robots, VPP predicts future representations for static camera and wrist camera respectively, and obtains $F_p^{static}$ and $F_p^{wrist}$. This keeps the input form of TVP simple and allows Video Former to handle multiple views uniformly.

4. 3 Video Former: Compressed spatio-temporal multi-view representation

The predictive representation is a high-dimensional feature sequence of $T\times C\times W\times H$. Video Former uses a fixed number of learnable tokens $Q_{[0:T,0:L]}$ to aggregate these representations. Each frame first performs spatial attention, and then performs temporal attention across time.

$$Q'=\{\mathrm{Spat\text{-}Attn}(Q[i],(F_p^{static}[i],F_p^{wrist}[i]))\}_{i=0}^{T},$$ $$Q''=\mathrm{FFN}(\mathrm{Temp\text{-}Attn}(Q')).$$

The output $Q''$ is fixed-length tokens, which are subsequently used as the cross-attention condition of the diffusion policy.

4.4 Diffusion transformer action head

The action head takes the noised action sequence as input and uses DiT/decoder blocks to gradually restore the action. The aggregated $Q''$ is injected into each transformer block through cross-attention. The goal of motion denoising is:

$$a_k=\sqrt{\bar{\beta}_k}a_0+\sqrt{1-\bar{\beta}_k}\epsilon,$$ $$\mathcal{L}_{\mathrm{diff}}(\psi;A)=\mathbb{E}_{a_0,\epsilon,k}\left\|D_\psi(a_k,l_{emb},Q'')-a_0\right\|^2.$$

Here $a_0$ is the real action sequence, and $D_\psi$ directly predicts the denoised action. VPP also uses action chunking to output multiple action steps to increase control frequency.

5. Intensive reading of experimental results and charts

5. 1 Simulation settings

Simulations include CALVIN ABC$\rightarrow$D and MetaWorld. The training environment in CALVIN is ABC, the test environment is unseen D, and only language-annotated ABC data is used for training according to the GR-1 settings. MetaWorld contains 50 Sawyer robot tasks, each task collects 50 trajectories using the official Oracle policy. TVP fine-tuning requires 8 A100s for 2-3 days; policy training requires 4 A100s for 6-12 hours.

CALVIN and MetaWorld
Figure 3: CALVIN and MetaWorld task environments, and MetaWorld 50-task success rate table.

5.2 CALVIN ABC-D long-horizon

CategoryMethodDataTask 1Task 3Task 5Avg. Len
Direct actionRT-1100% ABC0.5330.0940.0130.90
Direct actionDiffusion Policy100% ABC0.4020.0260.0000.56
Future predictionSuSIE100% ABC0.8700.4900.2602.69
Future predictionGR-1100% ABC0.8540.5960.4013.06
Future predictionVidMan100% ABC0.9150.6820.4673.42
3D methodRoboUniview100% ABC0.9420.7340.5073.65
OursVPP100% ABC0.9650.8660.7694.33
Data efficiencyGR-110% ABC0.6720.1980.0691.41
Data efficiencyVPP10% ABC0.8780.6320.4533.25

CALVIN is evaluated by completing 5 command tasks in a row. The higher the Avg. Len, the better. The success rate of VPP on the fifth mission is still 0. 769, indicating that the attenuation of long-chain missions is significantly smaller than that of GR-1 and VidMan. VPP's 3. 25 also outperforms multiple full-data future prediction methods in the 10% data setting.

5.3 MetaWorld 50-task

MethodEasy (28)Middle (11)Hard (11)Average
RT-10.6050.0420.0150.346
Diffusion Policy0.4420.0620.0950.279
SuSIE0.5600.1960.2550.410
GR-10.7250.3270.4510.574
VPP0.8180.4930.5260.682

MetaWorld results are used to demonstrate that VPP is not only applicable to CALVIN long-chain language tasks. On hard tasks, the VPP is 0. 526, which is higher than GR-1's 0. 451; the average success rate ranges from 0. 574 to 0. 682, which the paper claims is 10. 8 percentage points higher than the strongest GR-1 baseline.

5. 4 Visualization of one-time forward prediction representation

One-step predictive representations
Figure 4: Single-step forward representation compared to full 30-step denoised prediction. Single step means that the texture is not accurate, but it already reflects the movement direction of the object and the robot arm.

This picture is the key to understanding VPP: the author is not saying that one-shot forward can generate beautiful videos, but that it can encode dynamic trends in the intermediate representation that are sufficient to support action learning. The control doesn't need a pixel-perfect future, it just needs to enable the inverse dynamics tracker to track future motion.

5.5 Ablations

AblationAvg. LenLatencyexplain
VPP4.33about 140 mscomplete model
w/o Internet data3.97about 140 msRemove internet manipulation co-training
w/o Calvin video3.31about 140 msDo not fine-tune TVP on downstream CALVIN videos
w/o Internet data + w/o SVD pretrain1.63about 140 msTraining video prediction model from scratch, performance drops significantly
w/o Video Former3.86about 450 msAccuracy drops and inference slows down
w/o Feature Aggregation3.60about 140 msUse only final layer features instead of multi-layer aggregation
Visual encoderPre-training typeAvg. Len
VDM (VPP)Video generation4.33
Stable-VAEVAE reconstruction2.58
VC-1MAE reconstruction1.23
VoltronMAE reconstruction + language generation1.54
More appendix ablationresult
Feature layerLayer-3 3.72;Layer-6 3.88;Layer-9 4.29;Layer-12 4.05;VPP 4.33
Diffusion time-stepTime-step 10: 4.21;20: 4.33;30: 4.25
Single-viewAvg. Len 3. 58, using only static view; the author points out that this result still exceeds 3. 35 for 3D Diffuser Actor
Ablation 1Remove the Temporal-attn of Video Former, Avg. Len 4. 18
Ablation 2TVP is characterized by 2-step denoising, Avg. Len 4. 19, the delay is nearly doubled

5.6 Real-world results

Real-world platforms
Figure 5: Real hardware platform: Franka Panda arm and XArm + 12-DoF XHand dexterous hand, with some examples of tasks.

The Panda platform collects 2, 000 trajectories, covering 30+ tasks and 6 categories, including picking, placing, pressing, routing, opening, and closing. The XHand platform uses Vision Pro to capture human hand joint motion and retarget it to a 12-DoF dexterous hand; the main text contains 4, 000 items, 100+ tasks, and 13 categories, and the appendix is ​​expressed as 2. 5k trajectories, 100+ tasks, and 10 categories, and additional tool-use tasks are reported. The report is presented separately as main text and appendices. Please note that there is a difference in caliber here.

Platform / settingDiffusion PolicySuSIEGR-1VPP
Panda seen0.420.560.520.85
Panda unseen0.250.460.380.73
XHand seen0.280.450.320.75
XHand unseen0.110.280.150.60
XHand tool-use0.050.230.150.68

The appendix breakdown shows that pick/place/press/route/drawer in Panda unseen are 0. 80/0. 72/0. 80/0. 70/0. 60 respectively. XHand tool-use includes Spoon 0. 9, Hammer 0. 6, Drill 0. 8, Pipette 0. 4, with an average of 0. 68.

Predictions and executions on unseen tasks
Figure 6: Prediction and execution in unseen tasks. Red is the predicted future, and green is the actual execution trajectory; the paper uses this to show that the actual actions will be close to the future predicted by TVP.

5.7 Prediction quality

Appendix reports FVD on Bridge validation: VideoFusion 501. 2, Tune-A-Video 515. 7, Seer 246. 3, VPP 41. 4. The author explains that this improvement comes from using pre-trained SVD as the basis, while early TVP such as Seer did not utilize this video foundation model.

Human manipulation prediction
Appendix: 30-step denoising prediction on human manipulation validation.
Robot prediction
Appendix: TVP prediction on robotic datasets, green is ground truth, red is predicted future.
Predictive representations
Appendix: one-step predictive representations, visualized as blue frames. The authors emphasize that their textures are inaccurate but reflect key physical evolutions.

6. Reproduction list and project details

6. 1 Official code and checkpoints

GitHub README provides PyTorch implementation, the main directories include video_modelspolicy_modelspolicy_trainingpolicy_evaluationvideo_datasetvideo_conf and policy_conf. To reproduce CALVIN, you need to first install the official CALVIN environment and download the ABC-D dataset. The README estimates that the data is about 500 GB.

stepEntranceillustrate
environmentconda create -n vpp python==3.10pip install -r requirements.txtCALVIN needs to be installed according to mees/calvin; the README mentions that the torch version warning can be ignored.
Video modelstep1_prepare_latent.py / step1_train_svd.pyREADME is called the main function entry; the file name in the warehouse also contains step1_prepare_latent.py
Action modelstep2_train_action_calvin.py / step2_train_action_xbot.pyTrain CALVIN and XBot/XHand strategies separately.
CALVIN evaluationpolicy_evaluation/calvin_evaluate.pyneed svd-robot-calvindp-calvin, CLIP and CALVIN ABC dataset paths.
Video prediction demomake_prediction.py --eval --config video_conf/val_svd.yaml ...Available at provided video_dataset_instance Test prediction on the sample.

README provides Hugging Face checkpoints: CLIP text encoder about 600M;svd-robot About 8G;svd-robot-calvin About 8G;dp-calvin About 1G.

6. 2 Architecture hyperparameters

TypeCALVINMetaWorldFranka PandaXHand
Video length1681616
Action shape10 x 74 x 410 x 710 x 18
Language shape20 x 51220 x 51220 x 51220 x 512
Image shape256 x 256256 x 256256 x 256256 x 256
Video Former token shape16 x 14 x 3848 x 28 x 38414 x 16 x 38414 x 16 x 384
Video Former input/latent dim1280 / 5121280 / 5121280 / 5121280 / 512
Video Former heads/layers8 / 68 / 68 / 68 / 6
Diffusion Transformer latent dim384384384384
Condition shape225 x 384225 x 384225 x 384225 x 384
Encoder/decoder layers4 / 44 / 44 / 44 / 4
Sampling steps10101010
TVP batch / policy batch4 / 764 / 644 / 1284 / 128
Epochs / LR12 / 1e-430 / 5e-530 / 1e-440 / 1e-4

6. 3 Recurrence risk points

7. Analysis, Limitations and Boundaries

7. 1 The most valuable part of this paper

According to the paper's own experimental evidence, the core value of VPP is to rewrite "future prediction" from expensive pixel video generation to a representation extraction problem that can be deployed in a closed-loop. Table 1, Fig. 5 and w/o Video Former/2-step denoising ablation jointly illustrate that the control strategy does not need to wait for clear future video, and only needs a predictive representation of TVP forward, plus Video Former compression, to improve performance and maintain a 7-10 Hz control frequency on CALVIN, MetaWorld and real robot tasks.

7. 2 Why the results hold up

The paper supports the claim through four types of evidence: first, after replacing with Stable-VAE, VC-1, and Voltron, Avg. Len dropped from 4. 33 to 2. 58/1. 23/1. 54, indicating that not any visual encoder can achieve this effect; second, removing internet data, Calvin video, or SVD pretrain all decreased, indicating that video foundation prior and manipulation domain adaptation both contributed; third, layer/time-step/Video Former Ablation explains which designs work; fourth, seen/unseen/tool-use results on real Panda and XHand check for cross-platform generalization.

7. 3 The boundaries of the paper itself that are explicit or exposed by experiments

7. 4 Group discussion question 1: What is the essential difference between predictive representation and world model?

VPP's TVP will predict future visual sequences, but the policy does not rollout or plan in the prediction world, but uses the internal representation of a forward as a conditional input to learn inverse dynamics. It can be discussed: Is this approach more like representation learning or implicit planning? If it is to be combined with MPC/world-model RL in the future, what state, action, reward or uncertainty structures need to be added?

7. 5 Group meeting discussion question 2: Where does the information of one-step predictive representation come from?

Appendix layer/time-step ablation shows that Layer-9 and time-step 20 are better; Fig. 5 shows that one-step representation is blurry but can express motion. The discussion point is: does this information come from physical priors in SVD pre-training, manipulation TVP fine-tuning, language conditions, or multi-layer feature aggregation? This relates to whether VPP can be transferred to more complex contact, occlusion or long-range tasks.

Suggestions for closing the group meeting:The main line of this paper can be summed up in one sentence: VPP does not require the video model to be "beautifully imagined", but requires it to be "imagined sufficiently controllable" in the internal representation; Video Former and diffusion action head convert this controllable future representation into a closed-loop action.