EN 中文

Learning to Act from Actionless Videos through Dense Correspondences

Authors: Po-Chen Ko, Jiayuan Mao, Yilun Du, Shao-Hua Sun, Joshua B. Tenenbaum

Organization: National Taiwan University; MIT CSAIL; MIT BCS; CBMM

Version: arXiv: 2310.08576v1, submitted on 2023-10-12

Links: arXiv | PDF | Project page | Code

Keywords: actionless videos; dense correspondence; video diffusion; optical flow; robot policy; learning from observation

1. Quick overview of the paper

One-sentence summary: This paper proposes AVDC: first use a text conditional video diffusion model to "imagine" future videos of the robot completing the task, then use inter-frame dense correspondences and initial depth to restore the 3D rigid body transformation of the object or camera, and finally map the transformation into robot actions, so training mainly relies on RGB videos without action labels.
Reading targeting itemcontent
What should the paper solve?Learn executable robot strategies from a small number of video demonstrations without action annotations, avoiding the need to recollect action-labeled trajectories for each robot and each task.
The author's approachSplit the "action" into two reusable intermediate quantities: text conditional future video and inter-frame dense correspondence. The video represents future state changes, and optical flow and depth restore pixel changes to $SE(3)$ transformation.
most important resultsMeta-World has an average success rate of 43.1%, which is higher than BC-Scratch 16.2%, BC-R3M 15.4%, and UniPi 6.1%; iTHOR has an average success rate of 31.3%, while the two BC baselines are only 2.1% and 0.4%; the zero-shot success rate from human pushing video to robot execution on Visual Pusher is 90% / 40 runs.
Things to note when readingThe method does not directly output actions from the video end-to-end, but relies heavily on optical flow, mask, depth, rigid body motion assumptions and manual action primitives; the appendix report of the real Panda experiment failed 8 times out of 10 tests and needs to be read together with the qualitative display of the main text.

Difficulty rating: ★★★★☆. Experimental protocols that simultaneously understand diffusion video generation, optical flow/dense correspondence, camera projection geometry, robot motion primitives, and learning from observation are required.

Core contribution list

Diverse task execution without action labels
Figure 1: Paper teaser. The author shows that AVDC only relies on synthesized video and dense correspondence to implement actions in manipulation, navigation and real-world manipulation.

2. Motivation

2.1 What problem should be solved?

A common bottleneck in robot learning is that state and action spaces are highly dependent on embodiment. Folding cloth, pouring water, pick-and-place, and navigation require different state representations and action interfaces; if policy learning requires expert action sequences for each task, the cost of data collection will increase rapidly with the number of robots and scenarios.

The authors capture the versatility of video data: RGB videos can record "how states change" and are easier to collect both on the Internet and in the lab. But the video itself doesn't tell the robot which joint trajectories or end-effector actions it should perform. The question of this article is: can we learn an executable strategy only from RGB video, and then translate the video changes into current robot actions during deployment.

2.2 Limitations of existing methods

2.3 The solution ideas of this article

AVDC's high-level thinking is to "imagine the future first, then geometricize the actions." Given the current RGBD observation and text target, the model generates the next 8 frames of video; GMFlow outputs dense correspondence between adjacent generated frames; the initial depth and camera internal parameters upgrade the pixels to 3D; finally, the rigid body transformation of the object or scene is restored through optimization, and the transformation is converted into actions using ready-made grasp, push, IK, and navigation action mapping.

4. Detailed explanation of method

4.1 Overall pipeline

AVDC framework
Figure 2: AVDC's four-step framework: input RGBD + text target, generate imagined execution video, estimate optical flow of adjacent frames, convert optical flow and initial depth into the $SE(3)$ transformation of the target object, and then map it to robot commands.
  1. Video generation: Conditional diffusion model learning $p(\textit{img}_{1: T}\mid \textit{img}_0, \textit{txt})$, experimental $T=8$. Input the current frame and text description, and output the future execution video.
  2. Optical flow estimation: Use GMFlow to predict optical flow for each pair of adjacent generated frames. The flow of each pixel is dense correspondence, indicating where this point will move in the next frame.
  3. Geometry recovery: Use the initial depth map and camera intrinsic parameters to convert the initial pixel points into 3D points; then find a rigid body transformation $T_t$ to make the projected position of the transformed 3D point match the 2D point tracked by the optical flow as much as possible.
  4. Action execution: In a fixed camera scene, restore the target object transformation and convert it into grasp/push subgoals; in a navigation scene, invert the scene transformation to obtain camera/robot motion, and then map it to MoveForward, RotateLeft, RotateRight or Done.

4.2 Text conditional video diffusion model

The goal of the diffusion model is to generate future frames from initial image and text conditions. The training loss written in the paper is:

Intuition: The model learns to denoise the future video after adding noise; the condition is the current frame and task text.

$$ \mathcal{L}_{\mathrm{MSE}} = \left\|\epsilon - \epsilon_\theta\left(\sqrt{1-\beta_t}\, \textit{img}_{1: T} + \sqrt{\beta_t}\, \epsilon, \ t \mid \textit{txt}\right) \right\|^2. $$
$\textit{img}_0$The current observation frame is used as the initial condition.
$\textit{img}_{1: T}$Future $T$ frame, experimental $T=8$.
$\textit{txt}$Natural language task description, encoded by fixed CLIP-Text encoder and Perceiver pooling.
$\epsilon_\theta$Video U-Net denoising network, using noise to predict training targets.
$\beta_t$Diffusion noise scheduling; appendix explains training/inference timesteps=100, beta schedule=cosine, objective=predict_v.

Architecturally, the author starts from Dhariwal & Nichol's image diffusion U-Net and extends it to video. To enhance consistency with the initial frame, they spliced ​​the conditional frame $\textit{img}_0$ to each future frame in the RGB dimension instead of just adding one frame in front of the timeline. Factorized spatial-temporal convolution is used in the ResNet block: spatial convolution is first performed on each time step, and then temporal convolution is performed on each spatial position, replacing the expensive complete 3D convolution.

U-Net architecture
Figure 3: U-Net architecture for video diffusion model. The dashed line is the residual connection.

4.3 Recovering action from optical flow and depth

In the object manipulation task of a fixed camera, let the initial 3D point set of the target object be $\{x_i\}$, the camera internal parameter be $K$, and $T_t$ is the rigid body transformation of the object in frame $t$ relative to the initial frame. The projection relationship is $K T_t x_i = (u_t, v_t, d_t)$, and the corresponding 2D point is $(u_t/d_t, v_t/d_t)$. GMFlow gives the tracking pixel $(u_t^i, v_t^i)$ of point $x_i$ at frame $t$, so the author optimizes:

$$ \mathcal{L}_{\text{Trans}} = \sum_i \left\|u_t^i - \frac{(K T_t x_i)_1}{(K T_t x_i)_3}\right\|_2^2 + \left\|v_t^i - \frac{(K T_t x_i)_2}{(K T_t x_i)_3}\right\|_2^2. $$

This step only requires the initial frame depth, not future frame depth. Because $T_t$ is assumed to be a rigid body transformation, the future 3D depth is implicitly determined through the projected geometry.

Derivation and completion: why this loss is enough to restore $T_t$

The initial point $x_i$ has been determined by the initial RGBD and camera internal parameters. Given a candidate rigid body transformation $T_t$, $x_i$ can be placed under the object coordinates of frame $t$, and then projected to the image plane via $K$. Optical flow provides the 2D position of the same physical point in the generated video frame $t$, so minimizing the projection error is to find the 6DoF transformation that best explains all dense correspondence. In actual implementation, the Meta-World appendix first uses RANSAC to find inliers from 2D correspondence, and then uses these inliers to estimate the 3D transformation. Appendix: Meta-World setup.

4.4 Action mapping in different environments

environmentThe recovered geometryaction mappingKey implementation details
Meta-WorldTarget object rigid body transformationSelect grasp or push according to whether the vertical displacement exceeds 10 cm; grasp mode closes the gripper and then follows subgoals, and push mode puts the manipulator in the pushable direction and then follows subgoals.Sample $N=500$ mask points, using the object centroid as the contact point; use RANSAC to filter correspondence outliers Appendix: 4.1.
iTHORStatic scene transformation, inverse to obtain camera motionObserve the imaginary point 1m in front of the robot: Done if the displacement is less than 1 mm, MoveForward if the horizontal displacement is less than 25 cm, otherwise press the direction RotateLeft/RotateRight.Do not track a single object; use the entire frame flow to filter key points that move out of the image; re-plan when inliers are less than 10% of the initial sampling points Appendix: iTHOR setup.
Real PandaTarget object initial pose and target poseAssuming the target is top-graspable and does not require redirection, use the IK solver to generate the robot trajectory.The hardware is Franka Emika Panda + Intel Realsense D435; appendix explains that target segmentation is manually specified, and small objects/low resolution cause 3D rotation to be unstable Appendix: real-world setup.

4.5 Key points of training and inference

First-frame conditioning: The author spliced the initial frame to the noisy future video along the RGB channel instead of splicing along the time dimension. Appendix MSE experiments prove that cat_c is better than cat_t in the early stage of Bridge training.
Text encoder: The CLIP-Text encoder is fixed, and diffusion time embedding is added after Perceiver pooling to aggregate tokens; text cross-attention is not used. The difference in MSE between CLIP and T5-base is not significant. Appendix: Text Encoder.
Reasoning speed: Each round of action planning in Meta-World takes about 18 seconds, with a maximum of 5 replans. The total planning cost is about 18-108 seconds. DDIM 10 steps accelerated video generation by about 10 times, and Meta-World overall success dropped from 43.1% to 37.5%, a drop of 5.6 percentage points. Appendix: complexity/DDIM.
Algorithm: AVDC execution loop Input: current RGBD image, task text, camera intrinsics, optional object mask 1. Generate future frames img_1: T with a text-conditioned video diffusion model. 2. For each adjacent frame pair, estimate optical flow with GMFlow. 3. Track dense correspondences through the generated video. 4. Lift initial pixels to 3D using initial depth and K. 5. Estimate object or scene rigid transforms by minimizing projection error. 6. Convert transforms to task-specific robot commands using primitives or navigation rules. 7. Execute; if progress stalls or correspondences are lost, replan from the new observation.

5. Experiment

5.1 Baselines and variants

methodtraining signalfunction
BC-ScratchTrain ResNet-18 + CLIP text + MLP from scratch using expert action labels.Measuring the difficulty of regular behavioral cloning in a multitasking setting.
BC-R3MAlso uses action tags, but the visual backbone is initialized to R3M.Testing whether robot pretrained visual representations aid action prediction.
UniPi baselineUse the video generated by AVDC to train the inverse dynamics model to output actions.The route representing "Video Planning + Learning Inverse Dynamics" still requires action tags.
AVDC (Flow)Directly predict inter-frame optical flow.Verify whether generating flow directly is better than RGB video first and then GMFlow.
AVDC (No Replan)Full geometric motion is restored, but open-loop execution is performed.Verify the effect of closed-loop replanning.
AVDC (Full)RGB video generation + GMFlow + geometric motion recovery + replanning.The main method of the paper.

5.2 Meta-World desktop operation

Setup: 11 Sawyer robot arm operation tasks, 3 camera positions per task, 5 demonstrations per camera position, 165 videos in total. Methods and variants use ground-truth target object segmentation mask; BC baseline uses action tags. Evaluated as average success rate for 25 seeds per task, per camera position.

methodOverallkey phenomena
BC-Scratch16.2%Even with 15, 216 action labels, multi-task generalization is still weak.
BC-R3M15.4%R3M initialization did not improve the overall results.
UniPi (With Replan)6.1%Requires inverse dynamics, overall lower than BC.
AVDC (Flow)13.7%It performs well on button-press-topdown, faucet-close, and handle-press, but most tasks are poor and supports the author's "two-stage" design.
AVDC (No Replan)19.6%It exceeds BC, but is significantly lower than the closed-loop version.
AVDC (Full)43.1%The overall best among 11 tasks; door-open 72.0%, door-close 89.3%, handle-press 81.3%.
Meta-World qualitative result
Figure 4: Meta-World qualitative result. The authors demonstrate generating video, optical flow, current/next subgoal, and execution trajectory.
Number of replanning steps versus success rate
Figure 5: The greater the maximum number of replannings, the higher the success rate from multiple perspectives. The authors use it to support closed-loop replanning strategies.

5.3 iTHOR target navigation

Settings: 12 target objects, distributed in 4 types of rooms; MoveForward, RotateLeft, RotateRight, and Done can be executed at each time step. The success criterion is that the target object comes into view and is within 1.5m, or Done is predicted in the correct state. 3 objects per room, 20 episodes per object.

RoomBC-ScratchBC-R3MAVDC
Kitchen1.7%0.0%26.7%
Living Room3.3%0.0%23.3%
Bedroom1.7%1.7%38.3%
Bathroom1.7%0.0%36.7%
Overall2.1%0.4%31.3%

The author explains: BC-R3M is worse than BC-Scratch, probably because R3M is pre-trained on robot operation tasks and is not suitable for visual navigation. The intermediate video of AVDC can show the agent navigating to the target, and the optical flow can easily be mapped to movement or rotation; when there is no flow, it means that the target has been found and Done is selected.

iTHOR qualitative result
Figure 6: iTHOR qualitative result. Generating video, flow and action inference corresponding to TV navigation tasks.

5.4 Cross-embodiment: human video to robot recommendation

The Visual Pusher experiment only uses 198 actionless human pushing videos to train the video diffusion model. The U-Net architecture is the same as Meta-World, and the training is 10k steps; then a zero-shot test is performed on the simulated robot pushing task without fine-tuning. The result was a 90% success rate in 40 runs.

Visual Pusher cross embodiment result
Figure 7: Visual Pusher. AVDC generates video plans from human demonstrations, and then turns the dense correspondence into robot actions.

5.5 Bridge to reality Franka Panda

The Bridge dataset contains 33, 078 WidowX 250 kitchen task teleoperation videos, without depth. The author first used Bridge to train the video generation model, and then used 20 human hand demonstration videos to fine-tune in his own real desktop environment. Realistic setup using Franka Emika Panda and a permanently installed Intel Realsense D435 RGBD camera.

The main article emphasizes that the model can generate videos, predict optical flow, identify targets and infer actions; the appendix gives a more specific failure analysis: the real experiment failed 8 times out of 10 tests, 75% of the failure reasons came from the wrong plan generated by the video diffusion model (selecting the wrong object or placing the wrong target), and 25% came from the discontinuity of the generated video, such as the disappearance of objects in the middle frame Appendix: real-world failure mode.

Bridge qualitative result
Figure 8: Bridge predicts the current and next subgoal.
Franka Panda qualitative result
Figure 9: Franka Panda execution example showing current and next subgoal at bottom.

5.6 Appendix Supplementary Experiments

Appendix contentresultmeaning
Object mask with segmentation modelsUsing Language Segment-Anything instead of GT mask, the average success rate in Meta-World 11 tasks is 34.5%, which is 8.6 percentage points lower than GT mask's 43.1%.The method is sensitive to segmentation quality; the main results of the paper use GT mask in Meta-World.
First-frame conditioningThe last frame MSE of cat_c early in Bridge training is lower than the temporal dimension splicing cat_t; each point is the mean of 4000 samples, and the error bars are the standard error.RGB channel-wise conditioning is part of the efficiency design.
Text encoderThe difference in video generation MSE between CLIP-Text 63M and T5-base 110M is not significant.The text encoder is not the main bottleneck; the authors use fixed CLIP-Text + Perceiver.
Bridge zero-shot real scenesThe Bridge model can generate reasonable videos on complex real-life kitchen images outside the toy kitchen, but the original resolution is 48x64 and the video is blurry.The video generation model has certain scene migration, but low resolution will affect motion recovery.
Segmentation mask qualitative result
Figure 10: Language Segment-Anything success and failure mask examples.
Ablation MSE plot
Figure 11: First-frame conditioning and text encoder ablation. The paper uses last frame MSE to evaluate the generation quality.

6. Summary of recurrence information

6.1 Data and Evaluation Protocol

scenetraining dataAssessment
Meta-World11 tasks x 3 cameras x 5 demonstrations = 165 videos; BC baseline uses 15, 216 frame-action pairs.25 seeds per task, per camera position, average success rate reported.
iTHOR240 videos; BC baseline uses 5, 757 frame-action pairs.12 object navigation tasks, 4 room types, 20 episodes per object.
Visual Pusher198 human pushing videos, no action labels; training 10k steps.Simulated robot pushing, zero-shot, no fine-tuning, 40 runs.
Bridge/PandaBridge 33, 078 WidowX videos; 20 hands-on demonstrations of fine-tuning in real environments.Real pick-and-place tabletop task, with 10 trials failure analysis provided in the appendix.

6.2 Model hyperparameters

Common settings for all models: dropout=0, num_head_channels=32, train/inference timesteps=100, training objective=predict_v, beta_schedule=cosine, loss_function=l2, min_snr_gamma=5, learning_rate=1e-4, ema_update_steps=10, ema_decay=0.999 Appendix: Video Diffusion Model.

parametersMeta-WorldiTHORBridge
num_parameters201M109M166M
resolution128 x 12864 x 6448 x 64
base_channels128128160
num_res_block233
attention_resolutions8, 164, 84, 8
channel_mult1, 2, 3, 4, 51, 2, 41, 2, 4
batch_size163232
training_timesteps60k80k180k

6.3 Perceiver text aggregator

ParameterValue
layers2
num_attn_heads8
num_head_channels64
num_output_tokens64
num_output_tokens_from_pooled4
max_seq_len512
ff_expansion_factor4

6.4 Hardware and time

Local material status: arXiv source code has been decompressed in tmp/arxiv_source_2310.08576/; PDF at tmp/2310.08576.pdf; The source compressed package is in tmp/arxiv_source_2310.08576.tar.gz; The report chart is in Report/2310.08576/figures/. These temporary materials have not yet been deleted to facilitate continued verification.

7. Analysis, Limitations and Boundaries

7.1 The most valuable part of this paper

According to the paper's own experimental and method design, the core value is to explicitly remove the missing link between "video without action tags" and "executable actions": the video model is only responsible for generating future visual states, and dense correspondence and geometric optimization are responsible for converting state changes into executable transformations. This split allows the method to be trained using RGB-only demonstrations, while leaving motion recovery to more mature modules such as depth, flow, mask, IK, and motion primitives.

7.2 Why the results hold up

The support for the results mainly comes from three levels. First, the Meta-World table contains multiple action tag baselines and multiple AVDC variants. The Full version is higher than the No Replan and Flow versions and supports "replanning + two-stage RGB-to-flow" design. Second, if the iTHOR task is replaced by navigation, the action space and object operations are completely different, but the same correspondence-to-transform logic can still exceed two BC baselines. Third, the appendix ablation exposes the impact of engineering variables such as segmentation, first-frame conditioning, and number of DDIM steps, and does not only report successful cases of the main method.

7.3 Author's statement of limitations

7.4 Applicable boundaries

Suitable for use casesNot suitable or requires additional modules
Key changes to the task can be approximated as rigid body transformations of the object or camera.Flexible objects, complex contacts, force control tasks, or tasks where contact surfaces must be inferred.
Initial depth can be obtained during deployment, or can be replaced by monocular depth estimation.Environments without reliable depth, camera calibration or object/scene masks.
There is a clear relationship between language goals and visual changes, such as pick up fruit, navigate to object, push object.Goals require hidden states, long-term multi-stage symbolic planning, or non-visual feedback.
You can accept ten seconds of planning latency, or use sampling acceleration such as DDIM.Real-time strongly constrained control tasks; even using DDIM 10 steps still sacrifices some success rate.