EN 中文

Video Generators are Robot Policies

Authors: Junbang Liang, Pavel Tokmakov, Ruoshi Liu, Sruthi Sudhakar, Paarth Shah, Rares Ambrus, Carl Vondrick

Organization: Columbia University; Toyota Research Institute

Paper: arXiv: 2508.00795 | PDF | Project home page

Keywords: Behavior CloningVideo GenerationDiffusion PolicyRobot ManipulationAction-free Video

One-sentence summary: This paper proposes Video Policy, which uses the pre-trained video generation model as the "behavioral imagination" backbone of the robot strategy, and then uses the action diffusion head to decode executable actions from the intermediate features of the generated video, thereby improving the generalization of the robot operation strategy in scenarios with a small amount of action teaching and distribution deviation.

1. Quick overview of the paper

Reading positioningcontent
What should the paper solve? The current visuomotor policy is insufficiently generalized under shifts in perception or behavior distribution such as new objects, new backgrounds, and new tasks, and teaching real robot actions is expensive. The paper hopes to use the dynamic world prior learned in large-scale video generation models to reduce reliance on teaching data with action tags.
The author's approach Let the video generation model first generate future multi-view videos of the robot completing the task, and then send the video U-Net intermediate layer features to the action U-Net to predict future robot actions. The key training choice is two-stage training: first adjust the video generation, then freeze the video U-Net training action head, and block the action loss from being passed back to the video network.
most important results On RoboCasa, Video Policy uses 50 demos to achieve an average success rate of 0.63, which is higher than DP-VLA 0.57, GR00T 0.50, and UVA 0.50; it reaches 0.66 when using 300 demos. The average success rate on Libero10 is 0.94, which is higher than UVA 0.90 and $\pi_0$ 0.85. The ablation is 0.63 for two-stage training, 0.57 for joint training, and only 0.09 for the untuned video model.
Things to note when reading Don't simply understand it as "the video model directly controls the robot". Actions are still output by a dedicated action diffusion head, and the video generation model mainly provides intermediate representations for future dynamics. Also note its high computational cost: the appendix gives about two weeks of training for 8 A100s, and a 25-frame video generation on the A100 takes about 9 seconds at a time.

Core contribution list

Video Policy teaser
Figure 1: Given an initial observation and a language task, the model simultaneously generates a video and action sequence of the robot performing the task. This figure is the entrance to understanding the paper's claim: video generation is not a side-channel visualization, but the core source of policy representation.

2. Background and problem setting

2.1 Core contradictions to be resolved

Robot behavior cloning can already work on many operating tasks, but a common weakness is distribution shift: the objects, backgrounds, locations, and task combinations seen during training are limited, and they may fail if they are slightly changed during testing. Computer vision and NLP can cover the long tail with larger data sets, but the collection cost of robot action teaching is high, especially real-world demonstrations are more expensive.

The author regards the video generation model as an exploitable intermediate resource: there are a large number of videos without action labels in the Internet and robot videos, which can help the model learn dynamic priors "from the current scene to the future task execution process". The problem of the paper is not to simply improve video quality, but to prove whether this kind of pixel-level future prediction can stably serve action generation.

2.2 Where did the previous game get stuck?

2.3 High-level ideas of this article

This article splits the strategy into two roles: the video generator $f$ is responsible for "imagining" the task execution process, and the action model $g$ is responsible for decoding the intermediate features of $f$ into robot actions. The author's core hypothesis is that as long as the video generation model can accurately synthesize future videos of the robot performing the task, then the action decoder can be smaller and mainly responsible for interface conversion, rather than relearning the complete task strategy.

3. Related work context

Technical linePositioning in the paperDifferences from this article
Behavior Cloning Supervised learning actions from demonstrations. In recent years, diffusion policy has been commonly used to deal with multi-modal actions. Instead of just encoding from vision to action, this paper explicitly trains a video diffusion model to predict future pixels and then decodes it using an action diffusion head.
Visual Pretraining for Policy Learning Video prediction, contrastive learning, MAE, etc. are used to obtain more robust visual representations. This paper takes video prediction as the proxy goal of policy learning, and verifies its effect through success rate, prediction horizon, and action-free video ablation.
Video Models for Decision-Making Video generative models can be used for world simulation, long-range planning, or joint pixel-action generation. This article emphasizes the systematic comparison of video/action training objectives within the same framework, and provides RoboCasa, Libero10, and real robot evaluations.

4. Method details

4.1 Overall formalization

The input is an initial scene image $v_0$ and a language task description $c$. The model needs to output a segment of robot end motion $a_t \in \mathbb{R}^k$. The paper writes the strategy as:

Intuition: First generate a future execution video, and then read the action from the intermediate representation of the video generator.

$$ \{\hat v_t\}=f(v_0, c), \qquad \{a_t\}=g(\psi_0, \ldots, \psi_i), \quad \psi_i=f_i(v_0, c) $$

Among them, $f$ is the video generator, $f_i$ represents the hidden features of the $i$ layer of the video generator; $g$ is the action decoding model. The key here is not the final pixels themselves, but the spatiotemporal features formed during the video generation process.

4.2 Architecture: Video U-Net + Action U-Net

The author is based on Image-to-Video Stable Video Diffusion. Video U-Net $\mu_\theta$ receives two types of conditions: one is the CLIP embedding $\phi(c)$ of the task text $c$, injected through cross-attention; the other is the latent $z_0=\mathrm{VAE}(v_0)$ of the input image $v_0$ obtained by SVD freezing VAE, and the latent of future noisy frames by channel Splicing.

The action end is a 1D CNN U-Net $\alpha_\theta$ adapted Diffusion Policy. In each denoising step $i$, hidden features are taken from the five layers of the video U-Net decoder. The paper gives the layer numbers as 9, 14, 17, 20, and 23; these spatiotemporal features are compressed into vector $h_i$ by the CNN adapter, which serves as the global conditioning of the action U-Net.

Rather than looking at the final video frame and then doing post-processing, action generation happens simultaneously with video generation at each denoising step.

$$ \{a_t\}=\alpha_\theta(a_i, i, h_i) $$

$a_i$ is the noisy action, $i$ is the diffusion denoising step, and $h_i$ is the intermediate feature of video U-Net. This design makes the action head rely on the future dynamic representation being built by the video model.

Video Policy method
Figure 2: Method structure. The initial image, future frame noise, and action noise enter the system together; the video U-Net generates the future video, and the action U-Net uses the video U-Net intermediate representation to denoise the action.

4.3 Training objectives

In the training data $D=\{d_1, \ldots, d_n\}$, each demonstration contains video observation $\{v_t\}$, task text $c$ and action $\{a_t\}$. The video model training target is standard diffusion noise prediction:

$$ L_{\mathrm{video}}=\mathbb{E}_{z_0, \epsilon, i} \left[\left\|\epsilon-\mu_\theta(z_i, i, \phi(c), z_{i, 0})\right\|^2\right] $$

$z_i$ is the noisy video latent, and $z_{i, 0}$ is the noisy latent embedding corresponding to the first frame. The goal is to let the video U-Net predict the noise $\epsilon$.

$$ L_{\mathrm{action}}=\mathbb{E}_{a_0, \epsilon, i} \left[\left\|\epsilon-\alpha_\theta(a_i, i, h_i)\right\|^2\right] $$

The action head is also trained with diffusion noise prediction. The author explicitly blocks the gradient backpropagation of $L_{\mathrm{action}}$ to the video U-Net $\mu_\theta$, allowing the video network to be primarily driven by the pixel future prediction goal.

4.4 Why is two-stage training important?

The paper compares joint training and 2-stage training. The two-stage version first uses the RoboCasa training set to fine-tune SVD for video generation, then freezes the video diffusion U-Net and trains the action denoising head. Experiments show that the average success rate of the two stages is 0.63, which is higher than 0.57 of joint; when the video model is not fine-tuned and only vanilla SVD features are used, it is only 0.09. This supports the author's explanation: the future video generation goal in pixel space is more general than the action generation goal, and the video model needs to first complete the task domain adaptation of "video of the robot executing the strategy".

5. Experiments and results

5.1 Experimental setup

The simulation experiment covers RoboCasa and Libero10, with a total of 34 operation tasks. Both benchmarks provide 50 human demonstrations for each task. RoboCasa follows the official protocol of evaluating 50 rollouts per task in 5 RoboCasa scenes; Libero10 follows the evaluation protocol used by UVA.

The action space is $a_i\in\mathbb{R}^7$, including a 6-DoF gripper pose and an opening and closing scalar. The input vision includes three cameras: gripper-mounted camera and left and right cameras. During training, each camera predicts 8 frames, for a total of 24 frames; in order to adapt to the 25-frame input format of SVD, a pad frame is added at the beginning of the sequence.

5.2 RoboCasa main results

methodaverage task success rateRemarks
3DA0.06Explicit 3D characterization baseline
DP30.233D diffusion policy related baselines
DP-ResNet0.41This article reproduces the experiment, ImageNet pre-training ResNet
DP-CLIP0.43CLIP Visual Language Representation Variants
GR00T0.50Use 300 demos
FPV0.51Front view/3D-like strong baseline
DP-VLA0.57Automated demonstrations using 3000 MimicGen
UVA0.50Joint video action generation parallel work
Video Policy, 50 demos0.63The main model of this article, 50 demos/task
Video Policy, 300 demos0.66Further improvements after more MimicGen demonstrations

The most interesting details to read are the Pick and Place tasks. The author pointed out that there is a significant shift in object location/category distribution between training and testing of this type of task, and Video Policy's improvement in this type of task is particularly obvious. For example, PnPStoveToCounter is 0.64 under 50 demos, and PnPSinkToCounter is 0.64, which is significantly higher than the 0.29 and 0.33 of GR00T 300 demos.

5.3 Libero10 Main Results

modelDP-CDP-TOpenVLAUniPi$\pi_0$$\pi_0$-FASTUVAOurs
average success rate0.530.580.540.000.850.600.900.94

The appendix gives task-by-task results, with Video Policy averaging 0.94 across 10 Libero10 tasks, with 4 tasks at or near 1.00, and the lowest being 0.80 on the KITCHEN SCENE8 task.

5.4 Ablation: Are video targets useful?

VariantsRoboCasa average success rateexplain
Joint0.57End-to-end joint training of video and action targets
2-Stage0.63First train the video generation, then freeze the video U-Net training action head
No Video Tuning0.09Do not fine-tune SVD to robot execution videos, only train action heads
Half Tasks0.41The action head is only trained on half of the tasks, but the video model can watch all task videos
DP Half Tasks0.21ResNet Diffusion Policy is only trained on half of the tasks

This table is the key to the argument chain of the paper. No Video Tuning's 0.09 shows that "directly taking pre-trained SVD features" is not enough; 2-Stage's 0.63 shows that the video model must first be adapted to the robot execution trajectory, and the action head is best used as a decoder on the frozen video representation.

5.5 Prediction horizon and action-free video

The authors fixed the action prediction to 1.6 seconds in the future and changed the video prediction horizon. The protocol used for horizon analysis in the appendix differs from standard RoboCasa: the authors sample the MimicGen environment to isolate distribution shift effects. The task-by-task table shows that the average 32-step video horizon is 0.67, 16-step is 0.55, and 0-step is 0.30; the difference is more obvious on distribution shift tasks such as pick-and-place.

Prediction horizon plot
Figure 3: The longer the video prediction horizon, the higher the success rate; for tasks with distribution shifts, the improvement is more obvious. This supports the conclusion that learning environment dynamics contributes to generalization.
Unseen tasks with action-free videos
Figure 4: The action head is only trained on the left 12 tasks, but the video generation model can use videos from all 24 tasks. On the unseen action supervision task on the right, Video Policy significantly outperforms DP-ResNet trained on only 12 tasks.

5.6 Real robot results

The real experiment consists of 5 tasks: Open Drawer, Pick and Place, M&Ms to Cup, Upright Object, and Stack Cups. Each task collects 200 demonstrations and tests three categories of generalization: object position changes, unseen objects, and unseen backgrounds. The success rate is calculated using 10 rollouts for each condition.

TaskVary Object LocationUnseen ObjectsUnseen Background
Open Drawer0.81.00.9
Pick and Place1.00.90.8
M&Ms to Cup0.80.90.2
Upright Object0.30.70.8
Stack Cups0.30.20.2

The failure of real experiments is also very informative. The authors explicitly point out that failures of Upright Object and Stack Cups often come from unrealistic video predictions, such as failing to generate correct upright placement, or generating gripper trajectories that cause cups to tip over. M&Ms to Cup dropped to 0.2 on the unseen background because the background color change affected the precise positioning of small objects.

Real-world qualitative results
Figure 5: Qualitative results for real Pick and Place, covering position, object appearance, and background color changes.

6. Key points of reproducibility and implementation

6.1 Video model implementation

6.2 Training hyperparameters

modelresolutionlearning rateBatchStepsPrecision
Joint Training256×2561e-53236886616-mixed
2-Stage Training256×2561e-532368866×216-mixed
No Video Tuning256×2561e-53236886616-mixed
2-Stage Libero10256×2561e-532170000+14000016-mixed
Real World256×192 → 448×3201e-532331500+9296016-mixed
The appendix states that the RoboCasa model was fine-tuned on 8 A100 images for about two weeks, and continued training did not bring performance improvement. The real model first accelerates training at low resolution, and then improves the effect at high resolution.

6.3 Baseline reproducibility details

6.4 Real robot setup

Real demonstrations collected by humans using modified handheld grippers. The left and right side cameras are Intel RealSense D435, the gripper-mounted camera is a Basler fisheye camera; the gripper pose is tracked by a RealSense T265, the opening is estimated by an ArUco marker, and the clamping force is measured by a single-axis force sensor, all sensors running at 30 Hz.

The model inputs three-channel RGB images and predicts the relative gripper pose, relative gripper position and absolute grip force in the next 32 steps. When deployed, the robot uses impedance control to perform 24/32 of the steps. If the predicted gripping force is more than 300g higher than the actual measured value, the system will add a small gripper closing correction to prevent insufficient gripping force.

Real robot setup
Appendix Figure: Data collection and real robot experimental setup. This figure shows that the real data in this article is not collected directly by the teleoperated robot, but human teaching is collected using a handheld gripper consistent with the end of the robot.

7. Analysis, Limitations and Boundaries

7.1 The most valuable part of this paper

The most valuable part is that it turns "whether video generation can be used as a policy learning agent target" into a testable engineering problem, rather than just giving a qualitative demonstration. The paper uses the same architecture to simultaneously compare joint, 2-stage, no video tuning, half tasks, and different video horizons, and concludes on the success rate: 2-stage is higher than joint, the non-tuned video model is almost invalid, long video horizon is better, and action-free videos can help unseen action supervision tasks. These ablations directly serve the core proposition.

7.2 Why the results hold up

  • Multiple benchmarks: The simulation experiment covers RoboCasa and Libero10, with a total of 34 tasks, not just one or two demos.
  • Strong comparison objects: RoboCasa neutralizes DP-ResNet, DP-CLIP, GR00T, DP-VLA, UVA, etc.; Libero10 neutralizes $\pi_0$, $\pi_0$-FAST, UVA, etc.
  • Dissolve around causal chains: No Video Tuning 0.09 eliminates the explanation of "just taking SVD features is enough"; 2-Stage 0.63 is higher than Joint 0.57, supporting the design of "let the video target dominate the representation"; 32-step horizon 0.67 is higher than 16-step 0.55 and 0-step 0.30, indicating that the future dynamic prediction length is related to policy generalization.
  • Real robot verification boundary: Real experiments not only show successful cases, but also give failed tasks and failure reasons. For example, Stack Cups and M&Ms to Cup failed under certain distribution shifts.

7.3 Limitations given by the author

7.4 Applicable boundaries

Judging from the paper evidence, Video Policy is more suitable for operational tasks where the visual distribution shift is obvious, but the task can still be expressed through short-term future videos, such as pick-and-place, opening and closing doors, pressing buttons, etc. For scenarios that require extremely precise small object localization, strong contact physics, real-time response, or cross-embodiment migration, the paper's current evidence is weak.

7.5 Group meeting reading reminder