EN 中文

VILP: Imitation Learning with Latent Video Planning

Reading Report: VILP puts the video generation model into the imitation learning policy, but the focus is not on generating larger RGB videos, but on quickly generating future robot videos in latent space, and converting video plans into actions in real time to achieve receding horizon planning.

arXiv: 2502.01784 Imitation Learning Latent Video Diffusion Robot Video Planning Receding Horizon
Authors: Zhengtong Xu, Qiang Qiu, Yu She
Institution: Purdue University
Code: https: //github.com/ZhengtongXu/VILP
Local output: Report/2502.01784/

1. Quick overview of the paper

What should the paper solve? Existing video planning robot methods usually generate videos in pixel space, which is slow and difficult to replan in real time; long horizon open-loop execution is easy to accumulate errors. What VILP wants to solve is: how to use video generation as part of the robot strategy, while being fast enough to support receding horizon, and using video data to reduce reliance on high-quality action annotation data.
The author's approach Use VQGAN to compress multi-view/RGBD images into latent space, train the 3D-UNet DDIM video diffusion model in latent; use visual observation to encode through ResNet-18 and use cross-attention to globally condition video generation; then use goal-conditioned low-level policy to map adjacent prediction frames into action sequences, only execute the first \(N_e\) step and repeatedly re-plan.
most important results Compared with UniPi, VILP significantly reduces training memory and inference time, and maintains or improves video quality and strategy success rate on multiple tasks. For example, VILP-8 in Arrange-Blocks-Hybrid achieves a max/mean success rate of 84.0/80.4, while UniPi-16 only achieves 18.0/13.2; in real Real-Arrange-Blocks, VILP-16 completes two block arrangements in 7/15, UniPi-4/16 is 0/15, and the VILP inference time is 0.238s faster than 1.422s for UniPi-16.
Things to note when reading The advantages of VILP mainly come from the speed of latent planning, cross-attention condition mechanism, multi-view alignment and receding horizon; however, it is not a completely action data-free method, and low-level strategy training is still required from video to action. When reading the experiment, you must distinguish between the FID/FVD/speed experiment of "evaluating video planning only" and the success rate experiment of "complete policy rollout".
VILP generated video plans
Figure 1: Predicted robot video frames generated by VILP. These video plans are then mapped into robot movements.

2. Background and problem setting

Why video planning is suitable for robots

Video generative models naturally learn temporal consistency and future evolution: given current observations, it can "imagine" future frames after the robot performs a task. If these future frames can be converted into actions, a video planning policy is formed. Video is easier to collect and expand than motion data, so video generation may help robots move into a more scalable data paradigm.

Three key questions raised by the author

3. Method details

3.1 Video data format

The paper writes the video data as \(\mathcal{D}^{v}=\{(o_0^i, o_1^i, \ldots, o_{T_d^i}^i)\}_{i=1}^{E}\). Where \(o\) is the combination of multi-view or RGBD images: \(o=\{f^j\}_{j=1}^{M}\), \(M\) is the number of views. That said, VILP handles multi-camera input as a common robotic data format from the beginning.

3.2 VQGAN compressed into latent space

Given an image \(f\in\mathbb{R}^{\tilde{H}\times\tilde{W}\times\tilde{C}}\), VQGAN encoder \(\mathcal{E}\) compresses it into \(z=\mathcal{E}(f)\in\mathbb{R}^{H\times W\times C}\), and decoder \(\mathcal{D}\) can reconstruct the image from \(z\). Both RGB images and depth images can be compressed; VQGAN is fixed after training and is not updated with diffusion training.

latent space is a source of speed. Paper example: 96x160 images can be compressed into 12x20 latent space, greatly reducing the calculation amount of video diffusion.

3.3 Latent video diffusion planner

Sample the future \(N\) frame from the video:

$$\mathbf{f}_t=[f_{t+\Delta t}, f_{t+2\Delta t}, \ldots, f_{t+N\Delta t}]$$

Compress to get latent sequence:

$$\mathbf{z}_t=[z_{t+\Delta t}, z_{t+2\Delta t}, \ldots, z_{t+N\Delta t}]$$

The video diffusion model is trained on latent sequences with the goal of:

$$\mathcal{L}(\theta)=\mathbb{E}_{\mathbf{z}_t^0, \epsilon^k, k}\left[\|\epsilon^k-\epsilon_\theta(\mathbf{z}_t^k, k)\|^2\right]$$

The network uses UNet with 3D convolution to capture both spatial and temporal features. In order to make it a planner, we also need to condition the current observation \(o_t\) and learn \(p(\mathbf{z}_t|o_t)\).

Latent video planning pipeline
Figure 2: VILP video planning pipeline. The image is first compressed into latent, and 3D UNet performs DDIM denoising in the latent space to generate future videos.

3.4 Observation conditioning and multi-view generation

VILP conditioning is divided into three steps:

  1. Use modified ResNet-18 to encode each view observation into a low-dimensional vector; use different encoders for different views, and the depth map is repeated three times as a three-channel input.
  2. Splice all perspective embeddings into \(c_t\), and inject the 3D UNet middle layer through cross-attention: \(\text{Attention}(Q, K, V)=\text{softmax}(QK/\sqrt{d})V\).
  3. Each perspective trains a diffusion model to generate a video from that perspective, and simultaneously integrates multi-view observation embedding to align the generated videos from different perspectives in time.

3.5 From predicted videos to actions

A low-level strategy uses adjacent predicted frames to generate action sequences:

$$\hat{\mathbf{a}}_{t+n\Delta t}=\pi(\hat{o}_{t+n\Delta t}, \hat{o}_{t+(n+1)\Delta t}), \quad n=0, \ldots, N-1$$

\(\pi\) consists of two CNN encoders and an MLP head. It maps two adjacent prediction observations into a continuous action segment \([a_{t+n\Delta t}, \ldots, a_{t+(n+1)\Delta t-1}]\). During execution, all generated actions are not executed, but only the first \(N_e\) step is executed, and then re-observed, re-generated, and re-action decoded. This is receding horizon control.

Low-level policy
Figure 3: The low-level strategy maps adjacent prediction frames into action sequences and uses the receding horizon execution strategy.

4. Experiments and results

4.1 Video planning experiment

The authors only evaluate video planning on Move-the-Stack, Push-T, and Towers-of-Hanoi, without involving policy rollout. Using 9: 1 episode split, 90% training, 10% unseen testing, indicators include FID, FVD, single inference time. Both VILP and UniPi use DDIM, and the number after the method name represents denoising steps.

TaskKey observations
Move-the-StackUniPi-64 achieves the best FID/FVD, but at 2.5s; VILP-4 at 0.058s is of similar quality and much faster.
Push-TVILP significantly outperforms UniPi at a small number of steps, with VILP-8 FID 14.65, VILP-16 FVD 447.56, and UniPi-16 FVD 744.06.
Towers-of-HanoiVILP-8/16 has better FID/FVD than UniPi, maintaining a better quality/speed trade-off for long-structured tasks.
Experiment tasks
Figure 4: Collection of tasks used by the video planning and strategy rollout.

4.2 Training memory and speed

VILP is generated in latent space, and the training memory is much lower than UniPi. For example, VILP in Arrange-Blocks only requires 10.0GB to generate 5 frames of 96x160, while UniPi requires 82.5GB for the same setting; VILP in Towers-of-Hanoi requires 8.7GB and UniPi 68.2GB. In terms of inference speed, VILP can reach levels from 0.058s to 0.231s in multiple settings, supporting near-real-time/real-time replanning.

4.3 Complete strategy rollout: Nut-Assembly and Arrange-Blocks

TaskDiffusion PolicyVILPUniPiConclusion
Nut-Assembly-Small28.0/24.326.7/23.320.0/17.6When it comes to small action data, VILP is close to Diffusion Policy and is better than UniPi.
Nut-Assembly-Hybrid48.0/43.156.7/53.235.3/27.2VILP is the strongest after adding off-target/hybrid action data.
Arrange-Blocks-Small8.9/5.946.0/40.414.7/8.9Video data provides a wealth of information about task structure.
Arrange-Blocks-Hybrid22.7/17.184.0/77.618.7/16.2VILP has the greatest advantage in this setting.

The values in the table are max/mean success rate. This experiment supports the author's claim that VILP can leverage task knowledge in video generation models when there is a lot of video data but limited or complex high-quality action annotations.

Hybrid datasets
Figure 5: Hybrid data configuration diagram. Low-level action data can contain different targets or different objects, and may not be completely equivalent to the target task.

4.4 Comparison with other imitation learning methods

TaskUniPiDiffusionPolicy-CDiffusionPolicy-TLSTM-GMMIBCVILP w/o low dim.VILP w/ low dim.
Sim Push-T82.08466546482.688.0
Can-PickPlace37.4979888195.792.2

VILP is best on Sim Push-T, which the author believes shows that it can represent multi-modal action distribution; it is close to Diffusion Policy-C/T on Can-PickPlace, which shows that video planning routes are not only meaningful in low-data scenarios.

4.5 Module ablation and horizon ablation

Module ablation shows that conditional concatenation is significantly weaker than VILP's cross-attention/global conditioning; multi-view fusion is critical to Can-PickPlace. Horizon ablation shows that the video planning horizon and action horizon cannot be blindly increased: too short to see enough of the future, too long to train and infer, which is expensive and more difficult to generate. The author's empirically preferred combination is video horizon 6 + action horizon 8, or video horizon 12 + action horizon 16.

Horizon ablation
Figure 6: Video planning horizon and action horizon ablation on Sim Push-T.

4.6 Real Real-Arrange-Blocks

The real task asked Franka Panda to line up the L block and T block in the "LT" configuration on the blue line. The initial position and orientation are random, the action space is delta motion in the x/y direction, and it is trained using 220 human demonstrations. Results: VILP-16 completed two block arrangements 7/15, and another 6/15 completed one block; UniPi-4 and UniPi-16 both completed two blocks/one block 0/15. The inference time is 0.238s for VILP-16 and 1.422s for UniPi-16.

Real Arrange Blocks rollout
Figure 7: A fragment of a VILP rollout from a real Real-Arrange-Blocks mission.

5. Key points of implementation and diagrams

Key implementation choices

The paper has no independent appendix

There are no additional appendix files in the arXiv source code; all methods, experiments, and discussions are in `vilp.tex`. This report has integrated all major charts and experimental tables in the source code.

6. Key points of reproducibility and implementation

Minimum recurrence path

  1. Prepare multi-view/RGBD video data \(\mathcal{D}^{v}\), and a small amount of data with action tags for low-level policy.
  2. Train VQGAN autoencoder to compress each frame to latent; freeze after training.
  3. Samples a sequence of future frames from the video by \(N, \Delta t\) and encodes it as \(\mathbf{z}_t\).
  4. Training conditional latent video diffusion: input current observation \(o_t\), noisy latent video, denoising step, and predict noise.
  5. Different viewpoint observations are encoded with modified ResNet-18 and injected into 3D UNet via cross-attention.
  6. Train the low-level goal-conditioned policy \(\pi(\hat{o}_t, \hat{o}_{t+\Delta t})\rightarrow\hat{\mathbf{a}}\).
  7. When deployed, the future video latent is generated, decoded or directly used for adjacent frame action prediction, only the first \(N_e\) steps are executed and the cycle is re-planned.
The most important things to pay attention to when reproducing are VQGAN compression quality, selection of video horizon and action horizon, action data distribution of low-level policy, and multi-view time alignment. VILP's speed advantage relies on small resolution latent and fewer DDIM steps.

7. Analysis, Limitations and Boundaries

The most valuable part of this paper

It advances "video generation as a robot planner" from a slower conceptual route to an engineering form that can be re-planned in real time. There are two most valuable points: first, latent video diffusion makes video planning fast enough to support receding horizons instead of long open-loops; second, it clearly demonstrates that video data and action data can be partially decoupled, with the video model responsible for learning the future evolution of the task, and a small amount/mixed action data responsible for bridging video to action.

Why does the result stand?

Main limitations

Questions to ask while reading