EN 中文

ARDuP: Active Region Video Diffusion for Universal Policies

Reading Report: This paper uses video generation for general robot policy learning, but the key is not to "generate a complete video as real as possible", but to explicitly generate task-related active areas first, so that the video planner and action decoder can pay more attention to the objects/regions that will actually be interacted with.

arXiv: 2406.13301v2 Video Diffusion Robot Policy Active Region CLIPort / BridgeData v2
Authors: Shuaiyi Huang, Mara Levy, Zhenyu Jiang, Anima Anandkumar, Yuke Zhu, Linxi Fan, De-An Huang, Abhinav Shrivastava
Institution: University of Maryland, UT Austin, Caltech, NVIDIA
Local output: Report/2406.13301/
Note: The arXiv source code does not contain independent appendices; this report integrates all chapters of the text and all charts in the source code.

1. Quick overview of the paper

What should the paper solve? There are existing video-as-planner methods that generate future frames from text targets and initial images, and then use inverse dynamics to decode actions. However, these models usually treat all pixels equally, are prone to focus on the wrong objects, and even use visual shortcuts such as "changing colors" to reduce the generation loss but generate the wrong plan. What ARDuP wants to solve is: how to make the video planner more stably focus on the areas where the task really requires interaction, and transform this attention into a higher success rate of robot tasks.
The author's approach Introduce active region as an explicit condition. During training, use Co-Tracker to find motion points from the demonstration video, and then use SAM to generate a pseudo active region mask without manual annotation; during inference, use latent active region diffusion to predict the active region based on the initial frame and language commands, then input the region as a condition to the latent video diffusion planner, and finally use latent inverse dynamics to directly generate latent sequences to decode actions.
most important results On the three unseen tasks of CLIPort, ARDuP improved Place Bowl +21.3%, Pack Object +17.2%, and Pack Pair +15.7% respectively compared to UniPi*. The higher the quality of the active region, the higher the success rate; if GT active region is used during testing, the improvement can be +16.6%, +24.4%, +24.8%. On BridgeData v2, ARDuP-generated video plans were qualitatively better at selecting correct objects and placement.
Things to note when reading This article does not directly solve the closed-loop control of real robots, but an offline data-driven video planning + action decoding framework. When reading, you need to distinguish three things: how the active region pseudo-label comes from, how the active region enters latent diffusion, and whether the improvement in mission success rate comes from the active region conditions or the design of latent inverse dynamics.
ARDuP teaser
Figure 1: When there is no active region, the model will grasp the wrong purple square; when there is an active region condition, the generation plan expands around the white square.

2. Background and problem setting

From MDP to UPDP

Traditional MDP requires defining states, actions, rewards, and environment dynamics. For cross-task and cross-environment robot strategies, these definitions are difficult to unify. UniPi introduces the Unified Predictive Decision Process (UPDP): changing the state space into video frames, changing the target into language text, and changing planning into conditional video generation.

UPDP is defined as \(\mathcal{G}=(\mathcal{X}, \mathcal{C}, H, \rho)\). Where \(\mathcal{X}\) is the RGB frame space, \(\mathcal{C}\) is the task text collection, \(H\) is the planning length, and \(\rho(\cdot|x_0, c)\) is the video generator that generates future \(H\) frames based on the initial frame \(x_0\) and task text \(c\). The action prediction algorithm \(\mu(\cdot|\{x_h\}_{h=0}^{H}, c)\) then predicts action sequences from the generated video.

Problems with UPDP

The quality of UPDP actions depends on the quality of the generated frames. However, if the generative model optimizes all pixels equally, it will produce a very dangerous error in robotic tasks: as long as the image looks "close to the target", it may not actually interact with the correct object. For example, the model can first grasp the wrong object, and then change the color of the object to the color in the text description in subsequent frames.

The core judgment of ARDuP is that what is useful for control is not all the pixels in the entire image, but the active regions, that is, the objects/regions that are most likely to be interacted with. Explicitly exposing these regions to the video generator reduces false attention and visual shortcuts.

3. Method details

3.1 LUPDP-AR formalization

The author changed UPDP to Latent Unified Predictive Decision Process conditioned on Active Region (LUPDP-AR):

$$\hat{\mathcal{G}}=(\hat{\mathcal{X}}, \mathcal{C}, \hat{\mathcal{O}}, H, \phi)$$

\(\hat{\mathcal{X}}\) is the RGB frame latent space, and \(\hat{\mathcal{O}}\) is the active region latent space. Stable Diffusion style encoder \(\mathcal{E}\) maps RGB frame and active region frame to latent. The video generator was changed from the original \(\rho(\cdot|x_0, c)\) to:

$$\phi(\cdot|\hat{x}_0, c, \hat{o})$$

That is, not only look at the initial frame and text, but also look at the active region latent \(\hat{o}\). Also define active region generator:

$$\psi(\hat{o}|\hat{x}_0, c): \hat{\mathcal{X}}\times\mathcal{C}\rightarrow\hat{\mathcal{O}}$$

The final action prediction is also performed on the latent sequence: \(\pi(\cdot|\{\hat{x}_h\}_{h=0}^{H}, c)\rightarrow \Delta(\mathcal{A}^{H})\). This avoids having to decode each latent back to RGB and do inverse dynamics.

ARDuP overview
Figure 2: ARDuP overview. Co-Tracker + SAM generates pseudo active region and trains active region diffusion; the predicted active region is conditioned on latent video diffusion, and then the action is decoded by latent inverse dynamics.

3.2 Automatically construct active region supervision from video

There is no manual active region annotation during training. The author uses the demonstration video \(\mathbf{V}=\{x_h\}_{h=0}^{H}\) to automatically generate a pseudo mask:

  1. Use Co-Tracker to place the \(M\times M\) grid point on the initial frame and track all points to subsequent frames to obtain the trajectory set \(\mathcal{P}=\mathcal{F}_p(\mathbf{V})\).
  2. Calculate the adjacent frame movement amount \(\Delta \mathbf{p}_h=\|\mathbf{p}_h-\mathbf{p}_{h-1}\|_2\) for each point trajectory.
  3. Calculate the average moving amount \(\Delta\bar{\mathbf{p}}=\frac{1}{H}\sum_{h=1}^{H}\Delta\mathbf{p}_h\).
  4. Filter for moving point \(\mathcal{P}_m=\{\mathbf{p}\in\mathcal{P}|\Delta\bar{\mathbf{p}}>\tau\}\).
  5. Enter the positions of these points in the initial frame as prompt into SAM to obtain active region mask \(\mathbf{M}\).

In order to form an active region frame, the author keeps the original image inside the mask and whitens the outside of the mask:

$$o=x_0\circ \mathbf{M}+x_b\circ(1-\mathbf{M})$$

Then use encoder \(\mathcal{E}\) to get \(\hat{o}\).

Pseudo active regions
Figure 3: Pseudo active region obtained from point trajectories and SAM in BridgeData v2. It is the supervision signal for training active region generators.

3.3 Latent active region diffusion

There is no future video during inference, and Co-Tracker can no longer be used to extract active regions from the entire video. Therefore, a conditional latent diffusion model \(\psi(\hat{o}|\hat{x}_0, c; \theta_\psi)\) is trained, inputting the initial frame latent and task text, and outputting the active region latent. The author emphasizes that both are indispensable: without images, we do not know the location of objects, and without text, we do not know which objects to interact with in the task.

3.4 Active-region-conditioned latent video planner

Video planner \(\phi(\cdot|\hat{x}_0, c, \hat{o}; \theta_\phi)\) generates future latent sequences. In terms of implementation, the author splices the active region latent with the latent of each frame, and also splices it with the latent of the initial frame, so that the denoising process is always constrained by "where to focus". You can use decoder \(\mathcal{R}\) to restore RGB during visualization, but action decoding does not rely on RGB decoding.

CLIPort qualitative comparison
Figure 4: Qualitative comparison of CLIPort. Models without active regions are prone to grasping the wrong object or generating the wrong robot arm position; ARDuP focuses more on the target object.

3.5 Latent inverse dynamics and execution

Given the adjacent generated frame latents \(\hat{x}_h, \hat{x}_{h+1}\), the latent inverse dynamics module predicts the action \(a_h\). It consists of a convolutional layer with skip connection plus a linear layer, and outputs a 7-dimensional manipulation action. When executing, it starts from \(x_0, c\), first generates \(\hat{o}\), then generates \(H=6\) step latent sequence, and finally decodes \(H\) actions at once and executes them in an open-loop.

4. Experiments and results

4.1 Implement settings

4.2 CLIPort multi-environment migration

CLIPort is a language-conditional robot manipulation simulation benchmark based on Ravens. The author trained on 11 tasks and 110k demos, tested on 3 unseen tasks, and followed UniPi's multi-environment transfer settings.

ModelPlace BowlPack ObjectPack Pair
State + Transformer BC9.821.71.3
Image + Transformer BC5.35.77.8
Image + TT4.919.82.3
Diffuser14.815.910.5
UniPi*65.451.830.9
ARDuP86.7 (+21.3)69.0 (+17.2)46.6 (+15.7)

This result shows that under the same video planning paradigm, explicit active region conditions significantly help the success rate of unseen tasks.

4.3 BridgeData v2 real data migration

BridgeData v2 contains 60, 096 real robot trajectories, natural language instructions, and 24 environments. The authors use 95% of the data for training and the rest for evaluation. The paper mainly gives qualitative results: ARDuP can select the correct object and place it in the correct location in complex real scenes, while UniPi* is more likely to select the wrong object or place it in the wrong location.

BridgeData qualitative comparison
Figure 5: BridgeData v2 qualitative results. ARDuP can better select task-related objects such as sushi, cup, and colander; UniPi* often selects or misplaces objects.

4.4 Ablation: active region quality

AR trainingAR testPlannerPlace BowlPack ObjectPack Pair
--w/o Active Region83.467.738.0
UnsupervisedPredicted+Active Region86.7 (+3.3)69.0 (+1.3)46.6 (+8.6)
SupervisedPredicted+Active Region93.3 (+9.9)79.6 (+11.9)51.7 (+13.7)
-GT+Active Region100.0 (+16.6)92.1 (+24.4)62.8 (+24.8)

This table is very critical: the active region does not only look good on qualitative graphs, its quality is positively correlated with the final mission success rate. The more accurate the prediction of the active area, the easier it is to succeed in the task.

4.5 Ablation: task loss

The author proposes task loss to measure the quality of video generation in control tasks: input generated video latents into pre-trained latent inverse dynamics, decode the action, and then calculate the L1 error of the action and GT action. The task loss of the model with active region is significantly lower on the CLIPort test set, indicating that the generated video is not only more visually reasonable, but also more in line with the needs of the action decoder.

Task loss ablation
Figure 6: The active region condition reduces task loss, indicating that the generated latent is more consistent with the real action sequence.

5. Key points of implementation and diagrams

Source code structure

`root.tex` in the arXiv source code is an IEEE template sample. The main file of the actual paper is `iros24_vid_main.tex`, and the text is under `data/`. The source code does not provide a separate appendix. Image resources include `teaser.jpg`, `overview.jpg`, `vis_cliport.jpg`, `bridge_vis.jpg`, `pseudo_bridge.jpg`, `ablation_active.pdf`, etc.

Why do it in latent space?

The author puts both video planning and action decoding in latent space as much as possible: on the one hand, it reduces the computational cost of RGB video generation, and on the other hand, it allows inverse dynamics to directly consume the internal representation of the generator, avoiding decoding RGB first and then re-encoding/perceiving. The visualization is only for human viewing, and the control itself does not require RGB decoding.

Two roles of active region

6. Key points of reproducibility and implementation

Minimum recurrence path

  1. Prepare datasets with language instructions, image sequences, action sequences, such as CLIPort or BridgeData v2.
  2. Run Co-Tracker on the training video to obtain the full temporal trajectory of the initial frame grid points.
  3. Filter moving points according to the average displacement threshold, and use SAM to generate a pseudo active region mask.
  4. Encode initial frame, active region frame to Stable Diffusion VAE latent.
  5. Training active region diffusion \(\psi\), input \(\hat{x}_0, c\), output \(\hat{o}\).
  6. Train latent video diffusion \(\phi\), input \(\hat{x}_0, c, \hat{o}\), and output future latent sequences.
  7. Train latent inverse dynamics \(\pi\), input adjacent latents, and output 7-dimensional actions.
  8. During inference, open-loop generates \(H=6\) actions and executes them.

An easy place to step into pitfalls

7. Analysis, Limitations and Boundaries

The most valuable part of this paper

It proposes a very practical correction: do not expect the video generator to naturally learn "what is important for control" at the full-image pixel level, but to explicitly model the interaction area. This idea brings video generation back from "good-looking" to "useful for action". At the same time, the author uses Co-Tracker + SAM to automatically construct pseudo labels, so that the active region does not require manual annotation, which is the key to its expansion to large-scale video data.

Why does the result stand?

Main limitations

Questions to ask while reading