EN 中文

Flow as the Cross-Domain Manipulation Interface

Authors: Mengda Xu, Zhenjia Xu, Yinghao Xu, Cheng Chi, Gordon Wetzstein, Manuela Veloso, Shuran Song

Organization: Stanford University; Columbia University; J.P. Morgan AI Research; Carnegie Mellon University

Publication: Conference on Robot Learning 2024; arXiv v1 2024-07-21, v2 2024-10-04

Links: arXiv | PDF | Project page

Keywords: Im2Flow2Act; object flow; cross-domain manipulation; cross-embodiment learning; sim-to-real; flow-conditioned policy

1. Quick overview of the paper

One-sentence summary: This paper proposes Im2Flow2Act, which uses object flow as a unified interface between human video, simulation exploration data and real robot execution: first generate a complete object flow of "how the object should move" from real human hand video and task language, and then use a flow-conditioned policy trained only on simulated robot play data to convert the flow into UR5e real actions.
Reading targeting itemcontent
What should the paper solve?Real robot teleoperation data is expensive, and the human hand video and robot action space are very different; the author wants the robot to learn real operation skills from real human demonstration videos and simulation exploration data without real robot training data.
The author's approachUse object flow as a cross-embodiment, cross-environment interface. The flow generation network generates a complete task flow from the initial image and task description; the flow-conditioned imitation policy outputs the robot end action based on the task flow, current keypoint tracking and proprioception.
most important resultsThe real-world four-task language-conditioned success rate is Pick&Place 90%, Pouring 80%, Drawer 85%, and Cloth 70%, with an average of about 81%; the simulated language-conditioned success rate is 90%, 85%, 90%, and 35%; the long-horizon multi-object task reaches 85%.
Things to note when readingThe method relies on 2D flow, Grounding DINO, SAM/motion/depth filters, TAPIR online point tracking and camera perspective calibration; 2D flow has ambiguity for 3D actions, and cloth is only 35% in the language-conditioned simulation, which is the boundary that should be focused on when reading experiments.

Difficulty rating: ★★★★☆. Need to understand point tracking / object flow representation, Latent Diffusion / AnimateDiff condition generation, transformer temporal alignment, Diffusion Policy, and sim-to-real and cross-embodiment experimental design.

Core contribution list

Flow as the cross-domain manipulation interface
Figure 1: Core diagram of Im2Flow2Act. Object flow simultaneously connects action-less human videos and task-less simulated robot data, ultimately forming a language-conditioned real-world manipulation system.

2. Motivation

2.1 What problem should be solved?

To extend robot learning to multi-task real-life scenarios, the key difficulty is not whether there are demonstrations for a single task, but that the sources of demonstration data are incompatible with each other. Real robot data is expensive, simulation data and real appearance/contact dynamics are different, human videos have no robot motion, and the embodiment of the human hand and the UR5e gripper are completely different. The author's starting point is: If you only look at "how the manipulated object should move", then this information is easier to reuse across embodiments than joint movements or human hand trajectories.

2.2 Limitations of existing methods

2.3 The solution ideas of this article

Im2Flow2Act splits the system into two modules that can be trained separately. First, the flow generation network inputs the initial RGB image, task description and initial object keypoints, and outputs the complete task flow. Second, the flow-conditioned imitation policy inputs task flow, current tracked keypoints and proprioception, and outputs the end-effector action sequence of the next 16 steps. During deployment, a complete flow is generated only once at the beginning of the task. During the execution process, the remaining flows corresponding to the current progress are found through online point tracking and temporal alignment, and then the actions are output in a closed loop.

4. Detailed explanation of method

4.1 Overall pipeline

  1. Determine the target object: Use Grounding DINO to detect object bounding boxes based on manually given keywords. Real task keywords include "green cup", "yellow drawer", and "checker cloth" Appendix: Grounding DINO.
  2. Form a rectangular flow image: Sampling keypoints evenly within the object frame to form $\mathcal{F}_0\in\mathbb{R}^{3\times H\times W}$. The three channels are the $u, v$ coordinates and visibility of image-space.
  3. Generate complete task flow: Use TAPIR to get $\mathcal{F}_{1: T}\in\mathbb{R}^{3\times T\times H\times W}$ from human hand video to train the flow generator; when deployed, the AnimateDiff-style model generates future flows based on the initial frame and language.
  4. Filter object-centric flows: Motion filters, SAM filters and depth filters remove background points, points on overly large segments and points with missing depth; then $N=128$ keypoints are randomly selected as policy input Appendix: Motion Filters.
  5. Closed loop execution: Use TAPIR online to track the current keypoints at 5Hz; the policy aligns the remaining flow according to the current progress and outputs a 16-step action sequence.
Flow generation network
Figure 3: Flow Generation Network. Grounding DINO detects targets, samples keypoints within the frame, AnimateDiff generates future flows, and motion filter obtains object-centric task flow.

4.2 Flow generation network

The authors' key representation choice is to organize permutation-invariant object keypoints into a structured rectangular flow image. This allows reusing the convolution and attention structures of the image/video generation model instead of directly processing unordered point sets.

Intuition: Instead of generating an RGB video, the flow generator generates "where these object points should be in the next 32 time steps."

$$ \mathcal{F}_0 \in \mathbb{R}^{3\times H\times W}, \qquad \mathcal{F}_{1: T}\in \mathbb{R}^{3\times T\times H\times W}. $$
$H=W=32$The flow image spatial resolution given in the appendix training details corresponds to 1024 keypoints.
$T=32$The number of time steps for flow generation network and task flow horizon.
3 channels$u$ coordinates, $v$ coordinates, and visibility of each keypoint.
$E_\phi, D_\theta$Stable Diffusion autoencoder's encoder/decoder; the encoder is fixed and the decoder is fine-tune to adapt to flow images.

In order to reduce the cost of high-resolution flow generation, the author compresses the flow image into the latent space of Stable Diffusion: $x_{1: T}=\{E_\phi(\mathcal{F}_i)\mid i\in[1, T]\}$, and the spatial dimension is downsampled 8 times compared to the input flow. When training AnimateDiff, insert LoRA with rank=128 into Stable Diffusion U-Net and train the motion module from scratch.

4.3 Flow-conditioned imitation policy

The conditional probability of policy is written as:

$$ p(\mathcal{A}_{t: t+L}\mid \mathcal{F}_{0: T}, s_t, \rho_t), \qquad \mathcal{A}_{t: t+L}=\{a_t, \ldots, a_{t+L}\}. $$

Action $a_t$ contains 6-DoF end-effector Cartesian pose and 1-DoF gripper open/close. $s_t$ is encoded by the current tracked keypoint image locations $f_t$ and the initial 3D coordinates $x_0$. $\rho_t$ is the robot proprioception.

policy contains three modules:

moduleinput/outputfunction
State encoder $\phi$Input $N=128$ current 2D keypoints and corresponding initial 3D coordinates; output 384-d state representation.Compress the current object state into a permutation-invariant representation. Implemented as 4-layer transformer encoder + CLS token Appendix: Policy training.
Temporal alignment $\psi$Input the complete task flow, current state, proprioception; output the latent $z_t$ of the remaining task flow.Estimate the current execution progress to avoid that the policy is always conditioned on the complete flow; it is also a key module that supports demonstration-conditioned execution.
Diffusion action headInput $z_t, s_t, \rho_t$; output action sequence.Using Diffusion Policy, DDIM scheduler, 50 training diffusion steps, 16 inference steps.
Why is Temporal alignment necessary?

The training data comes from unstructured simulated exploration. The complete task flow only describes the target movement from the starting point to the end point, but after the policy is executed halfway, you need to know "which flow segments are left unfinished". Therefore, the author first uses encoder $\xi$ to encode the ground-truth remaining flow $f_{t: T'}$ into the supervision target $\hat{z}_t$, then trains $\psi(\mathcal{F}_{0: T}, s_t, \rho_t)$ to predict $z_t$, and uses $L_2$ loss $\|\hat{z}_t-z_t\|^2$ for supervision. Both the main article and real experiments show that the execution will be less smooth after alignment is removed, and the decrease in real Pick&Place and Pouring is particularly obvious.

4.4 Implementation points

Flow generator conditional injection: The text is passed through CLIP text encoder, the initial frame is passed through CLIP image encoder, and the initial keypoints are fixed 2D sinusoidal encoding; all conditions are injected into the denoising process through cross-attention.
Training hyperparameters: Stable Diffusion decoder fine-tune 400 epochs, learning rate $5\times10^{-5}$; AnimateDiff LoRA rank=128, motion module training 4000 epochs, learning rate $1\times10^{-4}$, AdamW weight decay $10^{-2}$.
Policy input format: The sample is $(\rho_t, f_t, \mathcal{A}_{t: t+L}, F_{0: T})$; task flow horizon $T=32$, action sequence length=16, and $N=128$ are randomly selected from available keypoints.
Real execution: UR5e + WSG-50 gripper, policy sends end pose command at 2.5Hz, end speed limit is $<0.2$m/s, position is at least 1cm higher than the desktop; RealSense D415 720p/30Hz, online tracking input is downsampled to 256x256/5Hz.
Algorithm: Im2Flow2Act inference Input: initial RGB-D observation, task text, object keyword, calibrated camera 1. Use Grounding DINO to detect the object of interest. 2. Uniformly sample H x W keypoints in the object bounding box. 3. Generate a complete task flow F_0: T with the flow generation network. 4. Apply motion, SAM, and depth filters; keep N=128 object keypoints. 5. During execution, track the same keypoints online with TAPIR. 6. Encode current keypoints and initial 3D coordinates into state s_t. 7. Temporal alignment predicts remaining-flow latent z_t. 8. Diffusion action head outputs a 16-step end-effector action sequence. 9. Execute in closed loop until the task success criterion is met or the episode ends.

5. Experiment

5.1 Experimental setup

The authors evaluate 4 categories of tasks: Pick&Place, Pouring, Drawer Opening, Cloth Folding, covering rigid bodies, joint objects, and flexible objects. The training data is divided into two parts: robot exploration data collected using UR5e and predefined random heuristic primitives in simulation; human demonstration videos collected for each task in the real world for training flow generation model. The evaluation includes both demonstration-conditioned execution and language-conditioned execution modes.

Assessment modemeaningWhat problem does it isolate?
Demonstration-conditioned executionThe policy uses a single human demonstration video or a simulated spherical agent demonstration to extract the object flow.It mainly tests whether the low-level flow-to-action policy can follow the given flow and try to eliminate flow generation errors.
Language-conditioned executionThe complete system first generates a flow based on the task description and initial image, and then uses the policy to execute the action.Test the end-to-end capabilities of flow generator + policy.

5.2 Baselines

Baselinedesign purposeFairness/Additional Information
ATMClosed-loop grid flow generation only predicts the immediate next few steps.Test whether full object-centric task flow is more suitable for cross-embodiment than grid flow.
HeuristicSelect the contact point and use RANSAC / pose estimation to push the motion from the future object flow.The author gives it ground-truth 3D flows and optimal grasp pose; used to test whether the learned policy is necessary.
GridFlowReplace Im2Flow2Act's object keypoints with uniform grid keypoints.Examine the effect of object-centric keypoint sampling.
No alignmentRemove temporal alignment model $\psi$ and complete task flow conditions.Check whether alignment helps find the action corresponding to the current progress from unstructured exploration data.
Tasks and results
Figure 4: Four types of real tasks and execution results. Initial scene, flow generation visualization and online point tracking are from RealSense; robot execution recorded from another perspective.

5.3 Simulation results

methodDemonstration-conditionedLanguage-conditioned
Pick&PlacePouringDrawerClothPick&PlacePouringDrawerCloth
Im2Flow2Act10095959090859035
ATM////50308530
Heuristic7050300////
GridFlow30253545////
No alignment80859090////

The author's interpretation is: object flow can connect different data sources; although Heuristic has ground-truth 3D flows and optimal grasp pose, it can still work on tasks rigid, but fails obviously on drawer / cloth, indicating that flow-to-action requires a learning policy; GridFlow is significantly lower than object flow, indicating that it is critical to exclude embodiment/background motion.

5.4 Real-world results

methodDemonstration-conditionedLanguage-conditioned
Pick&PlacePouringDrawerClothPick&PlacePouringDrawerCloth
Im2Flow2Act9580907090808570
Heuristic7050300////
No alignment5508060////

The real world average success rate is about 81%. The author points out that the average drop from simulation to real is only 15%, and attributes the reason to the fact that flow pays more attention to motion rather than appearance, thus reducing the sim-to-real gap. No alignment drops significantly on real Pick&Place/Pouring, supporting the design of temporal alignment.

Typical failure cases of baselines
Figure 5: Baseline failure cases. Heuristic may push the drawer back first even with human ground-truth flow; No Alignment will cause unnecessary rotation along random behavior in exploration data.

5.5 Appendix: ATM, long-horizon and keypoint ablation

Appendix experimentresultConclusion
ATM comparisonIn Pick&Place/Pouring/Drawer, Im2Flow2Act language-conditioned is 90/85/90; ATM cross embodiment is 50/30/85; ATM same embodiment is 90/90/95.ATM can be strong in the same embodiment, but cross-embodiment visual input OOD; object-centric flow is more robust to UR5 appearing in the picture.
Long-horizon multi-object taskThe task is open drawer -> pick blue cube -> place into drawer -> close drawer. Im2Flow2Act demonstration-conditioned 90%, language-conditioned 85%; ATM language-conditioned 45%.Complete task flow + temporal subsampling can compress long-term multi-object tasks.
Initial 3D keypointsDemo-conditioned without 3D is 100/90/90, full is 100/95/95; language-conditioned without 3D is 85/85/80, full is 90/85/90.Initial 3D keypoints are helpful for identifying noisy keypoints under generated flow input, especially drawer handle positions.
ATM comparison
Figure 6: Comparison of flow generation between ATM and Im2Flow2Act during pouring deployment.
Long horizon task
Figure 7: Long-horizon multi-object task. The system first generates flows for multiple objects, followed by policies.
ATM full grid comparison
Figure 8: ATM full-grid supplementary comparison. The author shows that when the robot is closer to the object, ATM cross-embodiment still generates noise flow; the flow of Im2Flow2Act has been processed by motion filters, and the initial point is uniformly sampled by the bounding box rather than manually selected.

5.6 Appendix: Pre-training, simulation flow and deformable failure

experimentNumerical value/phenomenonexplain
Stable Diffusion pre-training ablationPretrain U-Net: 90/85/90/35; U-Net from scratch: 90/90/95/30. Pretraining has less impact on the final success rate, but LoRA + pretrained SD provides training efficiency; AE latent space may still help diffusion learning.
Generated flow in simulationIn the simulation, a spherical agent is used instead of UR5 to collect cross-embodiment demonstrations, and a motion filter is used to visualize the high-variance flow.Verify that the flow generator not only works on real human hand videos, but also learns from simulated cross-embodiment demonstrations.
Deformable failureIn the cloth folding simulation, in order to allow the robot to grasp cloth, small squares of 1cm x 1cm x 9cm are attached to the corners of the cloth; the policy can often reach the corners but cannot accurately grasp the small squares.Explain that the simulated cloth language-conditioned is only 35%; in reality, it can be grabbed anywhere along the corners of the cloth, so the success rate is more reasonable.
Pretrain versus scratch loss
Figure 9: Training loss of Pretrained U-Net and scratch; the author shows that pre-training converges faster.
Deformable failure
Figure 10: Deformable environment failure. The generated flow points to a reasonable grabbing point, but the small square fails to grab, causing the task to fail.
Generated flow in simulation
Figure 11: Visualization of generated flow in simulation. The author uses motion filter to filter high-variance flow and downsamples keypoints for easy display; the real-world flow generator is trained by human demonstrations.

6. Summary of recurrence information

6.1 Data collection

data sourceHow to collect/use
Simulated exploration dataUse UR5e and a set of predefined random heuristic primitives in simulations, covering rigid, articulated, and deformable objects. Train a multi-task flow-conditioned policy.
Real-world human demonstrationsFour tasks: pick & place, pouring, opening drawer, folding cloth. RealSense 30 FPS recording, training flow generation model.
Simulation cross-embodiment demosUse sphere agent to simulate human demonstrations, collected using the same primitive; long-horizon task collects 150 sphere trajectories and 100 UR5 demonstrations.

6.2 Authentic Assessment Protocol

6.3 Key hyperparameters

moduleConfiguration
Flow image$H=W=32$, $T=32$, 1024 keypoints, 3 channels: $u, v, visibility$.
Flow generatorStable Diffusion AE encoder is fixed; decoder fine-tune 400 epochs, lr $5e-5$; AnimateDiff U-Net inserts LoRA rank 128; motion module is trained from scratch 4000 epochs, lr $1e-4$, AdamW weight decay $1e-2$.
CLIP conditionsCLIP text encoder for text; final ViT layer patch embeddings of CLIP image encoder for initial image; openai/clip-vit-large-patch14 weight freezing.
Policy sample$(\rho_t, f_t, \mathcal{A}_{t: t+L}, F_{0: T})$; task flow horizon 32; action sequence length 16; $N=128$ keypoints are randomly selected for each sample during training.
State encoderThe initial 3D coordinate projection is 192-d, and the 2D keypoint location encoding is 192-d, which is spliced into a 384-d descriptor; 4-layer transformer encoder + CLS token.
Temporal alignmentThe remaining task flow encoder $\xi$ is a 4-layer transformer; the alignment model is an 8-layer transformer; using fixed 1D sinusoidal positional encoding and CLS token.
Action headDiffusion Policy; DDIM scheduler; 50 training diffusion steps, 16 inference steps; policy training 500 epochs, lr $1e-4$, AdamW weight decay $1e-2$.

6.4 Inference perception module

moduleConfiguration/Thresholds
Grounding DINOgrounding-dino-base, input 480x640; keywords: green cup, yellow drawer, checker cloth.
Moving filterRemove points in 256x256 image space that move below the threshold; threshold 20 for Pick&Place/Pouring/Drawer, threshold 10 for Cloth.
SAM filterThe initial frame is resized to 256x256, using finest segmentation; the filter points are points whose segment area exceeds the threshold, and the threshold is 10, 000.
Depth filterRemove missing depth, i.e. keypoints with a depth value of 0.
Online trackingTAPIR online point tracking, visual observation resize to 256x256, 5Hz.
Local material status: arXiv source code has been decompressed in tmp/arxiv_source_2407.15208/; PDF at tmp/2407.15208.pdf; The source compressed package is in tmp/arxiv_source_2407.15208.tar.gz; The report chart is in Report/2407.15208/figures/. These temporary materials have not yet been deleted to facilitate continued verification.

7. Analysis, Limitations and Boundaries

7.1 The most valuable part of this paper

According to the paper's own claims and experiments, the most valuable thing is to break down cross-domain robot learning into a relatively clear interface problem: use object flow to express task knowledge, let human demonstrations only be responsible for providing "how the object moves", and let simulation exploration only be responsible for learning "how the robot achieves this movement". This disassembly avoids migrating directly from human actions to UR5e and avoids collecting robot teleoperation for each real task.

7.2 Why the results hold up

Experimental support mainly comes from three types of controls. First, the ground-truth 3D flow obtained by the Heuristic baseline and the optimal grasp pose are still different from drawer / cloth, indicating that flow-to-action is not sufficient for simple pose estimation. Second, the decrease of GridFlow and ATM cross-embodiment indicates that "object-centric complete flow" is more suitable for cross-embodiment than grid flow containing motion embodiment. Third, No alignment drops significantly in real tasks, indicating that temporal alignment has a practical effect on implementing the policy learned from unstructured exploration data into real human flow.

7.3 Author's statement of limitations

7.4 Applicable boundaries

Suitable for use casesNot suitable or requires additional modules
Target tasks can be described by object point motion, such as pick/place, pour, open drawer, fold cloth.The critical success factors are force, tactile, hidden state or non-visual feedback rather than visible object motion.
Target objects can be stably positioned by Grounding DINO/SAM/depth filtering and can be tracked online.Scenes with severe occlusion, transparent/reflective objects, no depth or tracking easily lost.
Simulation can generate exploration data that sufficiently covers the action-object motion relationship.Real contact dynamics differ greatly from simulation, especially for deformable, granular or high-precision force control tasks.
The camera view can see the action and target objects, and the simulated/real view can be calibrated.Multiple viewing angles change significantly, hand-eye occlusion is strong, and actions occur in areas invisible to the camera.