EN 中文

3DFlowAction: Learning Cross-Embodiment Manipulation from 3D Flow World Model

arXiv: 2506.06199 3D optical flow cross-embodiment manipulation flow world model optimization policy
Authors: Hongyan Zhi, Peihao Chen, Siyuan Zhou, Yubo Dong, Quanxi Wu, Lei Han, Mingkui Tan
Organization: South China University of Technology; Tencent Robotics X; HKUST; Pazhou Laboratory
Code/data: github.com/Hoyyyaard/3DFlowAction
Material description: The arXiv source package only contains figure PDF, and the text is extracted based on local PDF text; the report retains PDF and available PNG images.

1. Quick overview of the paper

What should the paper solve? The robot data action space is not uniform, such as joint angles, end poses, coordinate systems, and hardware forms, which makes cross-robot learning difficult. At the same time, generating future RGB states directly from videos is easily affected by background, robot arm appearance, and 2D plane limitations. The paper addresses the question of whether an object-centered, 3D, cross-embodiment motion representation can be used to learn motion cues from human and robot videos that can be used for real robot operations.
The author's approach The core gripper is 3D optical flow. The author built ManiFlow-110k to extract the motion of the manipulated object in the open source human/robot video into 3D optical flow; trained a 3D flow world model based on AnimateDiff to predict the future 3D motion of the manipulated object according to language and current scene; then formed a closed-loop planning through rendering + GPT-4o verification, and used 3D flow as an optimization constraint to solve the end effector action.
most important results The total success rate of the four real operation tasks reaches 70.0%, which is higher than AVDC 20.0%, ReKep 20.0%, and Im2Flow2Act* 25.0%; compared with the imitation learning method, 3DFlowAction is 70.0%, higher than PI0 50.0% and Im2Flow2Act 27.5%. Cross-platform, Franka 67.5%, XTrainer 70.0%, without hardware-specific fine-tuning.
Things to note when reading The point is not "another world model", but whether 3D flow can really serve as a unified action interface. Pay special attention to: the difference between 2D flow and 3D flow, whether the GPT-4o rendering verification is reliable, whether the optimization policy can handle contact/occlusion/non-rigid bodies, and the experiment only has 10 trials per task, so statistical stability requires caution.
One sentence version: 3DFlowAction does not allow the model to directly predict the robot's movements. Instead, it predicts how the manipulated object will move in 3D space, and then renders, verifies, and converts this 3D optical flow trajectory into a robot end-effector trajectory.
3DFlowAction teaser
Teaser: 3DFlowAction learns the flow world model and uses 3D optical flow as an action guide for downstream real robot operations.
3DFlowAction pipeline
Flow-guided action generation pipeline: First generate 3D flow in a closed loop, then select task-aware grasp pose, and finally use 3D flow constraints to optimize the action.

2. Motivation and problem definition

2.1 Why 3D flow is needed

The author starts from the intuition of human operation: before people perform "hanging the cup on the cup holder", they often first imagine how the object moves in space, rather than first thinking about how the joint angles change. This object motion trajectory is relatively common to humans and different robots. Therefore, the paper regards "the future 3D movement of the object" as an embodiment-agnostic action representation.

Compared with the RGB video world model, 3D flow is more centered on the object and can reduce interference from the background and the robot's appearance. Compared with 2D optical flow, 3D flow can express 3D trajectories such as movement in the depth direction, rotation, and pouring, which is especially important for tasks such as pouring tea, inserting pens, and hanging cups.

2.2 Problem setting

The input is the current RGB observation, task language and initial object point set, and the output is a temporal 3D optical flow:

$$F \in \mathbb{R}^{T\times H\times W\times4}, $$

The first two channels are the 2D coordinates of the image plane, the third channel is the depth, and the fourth channel is visibility. This flow describes the 3D position changes of the surface points of the manipulated object at various times in the future.

Afterwards, the system does not train a hardware-bound action head, but turns the flow into a constraint function and solves a series of end-effector poses in SE(3) through optimization.

2.3 Contribution positioning

4. Detailed explanation of method

4.1 ManiFlow-110k: Extract 3D flow from original video

The paper first requires a large amount of data on "how objects move". Open source robot/human videos often have cluttered backgrounds, similar objects, and robotic arm grippers, which are easily detected incorrectly by standard detectors. The author proposes a moving object detection pipeline:

  1. Segment the gripper mask at the first frame using Grounding-SAM2.
  2. Sample points in the entire image in the first frame and exclude points falling within the gripper mask.
  3. Track these points with CoTracker3 to find points of significant motion in the video.
  4. Use the maximum bounding box of these motion points to locate the manipulated object.
  5. Use CoTracker3 again to extract the object 2D optical flow.
  6. Remove camera motion flow if necessary.
  7. Use DepthAnythingV2 to predict depth and project 2D flows into 3D.

The author verified that the moving object detection accuracy of this pipeline exceeds 80% on BridgeV2. Finally, 110k 3D flow instances were generated from open source data such as BridgeV2, RT1, AgiWorld, Libero, RH20T-Human, HOI4D, DROID, etc.

3D flow generation pipeline
3D flow generation pipeline: Detect moving objects from multi-source human/robot videos, build ManiFlow-110k, and pre-train the flow world model.

4.2 3D Flow World Model

The model goal is to generate a time-varying 3D flow $F$ based on the initial RGB observation, task prompt, and initial point $F_0$. The author follows the idea of Im2Flow2Act and uses AnimateDiff as the optical flow generator, but makes key adjustments:

3D flow generation and experiments
3D flow generation / experiment settings. The figure shows the core processes of ManiFlow-110k construction, flow world model and task settings.
3DFlowAction qualitative visualization
Flow world model visualization: The paper shows the prediction and execution effects of 3DFlowAction in four tasks.

4.3 Closed-Loop Motion Planning

3D flow predictions can still be wrong. To improve stability, the author designs an object-centric target state rendering machine. Assume the flow points at the first moment are $P_1$ and the last moment are $P_2$. Use SVD to estimate the rigid body transformation between the two:

$$T=\mathrm{SVD}(P_2, P_1).$$

Then apply $T$ to the initial point cloud of the manipulated object to obtain the predicted target state, and then put it back into the current 3D scene point cloud and reproject it into a 2D image. Finally, input the task instructions and predicted images into GPT-4o to determine whether the flow is aligned with the task; if it fails, re-predict the flow. This mechanism changes one-time world model sampling into closed-loop planning with verification.

4.4 Task-Aware Grasp Pose Generation

If you choose the wrong grabbing pose, the target location may be unreachable or the task may not be completed. The author asked GPT-4o to output the part of the object that should be grasped according to the task, and then used AnyGrasp to generate candidate grasping poses around the part. Since AnyGrasp itself does not know the task, the system will use the previously estimated transformation matrix $T$ to transform the candidate grasping pose to the target state, and use robot IK to check the reachability, thereby selecting a task-related and reachable grasping pose.

4.5 Flow-Based Action Generation

Action generation is formulated as constrained optimization. First, select $N$ key points from the object surface through farthest point sampling, and obtain the predicted 3D flow corresponding to these points. The goal of each time step $t$ is to make the current object key point reach the position predicted by the flow world model:

$$f^{(t)}(k_{\mathrm{initial}})=\min \sum_{i=1}^{N}\left\|k^i_{\mathrm{initial}}-k^i_{\mathrm{pred}}(t)\right\|_2.$$

IK and collision detection are also added to the real scene. The decision variable of the single-arm robot is the end-effector pose, which is a 6-dimensional variable of position and Euler angles; the position is limited by workspace bounds, and the rotation is limited to the lower hemisphere. The optimization implementation follows the ReKep style. In the first iteration, Dual Annealing is used for global search, and then SLSQP is used for local optimization; subsequent iterations are initialized with the previous stage solution and only perform local optimization.

5. Experiments and results

5.1 Experimental setup

The experiment focused on four basic but space-intensive tasks: pouring tea from the teapot into the cup, inserting the pen into the pen holder, hanging the cup on the cup holder, and opening the upper drawer. Each setting is run 10 times, with random object poses, and the success rate is reported. The real hardware uses a Dobot XTrainer, and the perception uses a Femto Bolt camera, located on the opposite side of the robot, providing a third-person perspective.

In terms of fine-tuning data, the author manually collected 30 human hand demonstrations for each task, excluding robot action labels, which took about 10 minutes per task for fine-tune 3DFlowAction.

5.2 Comparison with manipulation world models

TaskAVDCReKepIm2Flow2Act*3DFlowAction
Pour tea from teapot to cup1/102/102/106/10
Insert pen in holder2/101/102/107/10
Hang cup to mug rack0/103/100/105/10
Open top drawer5/102/106/1010/10
Total20.0%20.0%25.0%70.0%

Im2Flow2Act* replaces the learnable action policy of the original method with the 2D flow baseline after the optimization process. 3DFlowAction takes the lead in all tasks, indicating that 3D flow is more expressive for tasks involving depth and rotation such as pouring tea, inserting pens, and hanging cups.

Planning and execution comparison for pouring tea
Visualization of planning and execution in the tea pouring task: Baseline 2D/code-based planning is difficult to fully express the motion of objects in 3D space.

5.3 Cross-embodiment experiments

TaskFrankaXTrainer
Pour tea from teapot to cup7/106/10
Insert pen in holder7/107/10
Hang cup to mug rack4/105/10
Open top drawer9/1010/10
Total67.5%70.0%

The authors emphasize that this experiment did not involve robot-related fine-tuning. Since 3D flow describes object movement rather than robot action, in theory, as long as the new robot can achieve the object movement through IK and optimization, it can be executed across platforms.

5.4 Comparison with imitation learning methods

TaskPI0Im2Flow2Act3DFlowAction
Pour tea from teapot to cup5/104/106/10
Insert pen in holder5/102/107/10
Hang cup to mug rack4/100/105/10
Open top drawer6/105/1010/10
Total50.0%27.5%70.0%

PI0, as the VLA/flow action model baseline, performs well but is still lower than 3DFlowAction. The key comparison here is: 3DFlowAction does not use robot teleoperation action labels, while PI0/Im2Flow2Act class methods require action data or simulation data.

5.5 OOD Object and Background Generalization

TaskObject GeneralizationBackground Generalization
AVDCPI03DFlowActionAVDCPI03DFlowAction
Pour tea0/103/104/100/104/104/10
Insert pen2/106/106/100/101/104/10
Hang cup0/102/104/100/103/104/10
Open drawer4/105/108/100/105/108/10
Total15.0%40.0%55.0%0.0%32.5%50.0%

AVDC drops to 0 during background generalization, indicating that RGB future state generation is greatly disturbed by background changes. Although 3DFlowAction also decreased, it remained around 50%, consistent with the expectation that object center representation is more resistant to background changes.

Object generalization visualization
OOD object generalization visualization: shows flow prediction and execution results under different target object conditions.
Background generalization visualization
Background generalization visualization: Shows the prediction and execution performance of 3DFlowAction under background changes.

5.6 Ablation: Closed-loop planning and large-scale pre-training

MethodLarge-scale PretrainRendering MachineSuccess RatePour teaInsert penHang cupOpen drawer
Variant 1YesNo50.0%3/105/103/109/10
Variant 2NoYes30.0%3/103/102/104/10
3DFlowActionYesYes70.0%6/107/105/1010/10

After turning off the rendering machine, the average drop is 20 points, indicating that GPT-4o verification and re-prediction are indeed useful; after removing ManiFlow-110k large-scale pre-training, the drop is 40 points, indicating that the downstream 10 to 30 human demonstrations per task are not enough to learn a stable 3D flow from scratch.

6. reproducibility Key Points

6.1 Data and annotation process

ManiFlow rendering visualization
ManiFlow-110k visualization of in-domain flow generation and target state rendering.

6.2 Training details

parametersvalue
Model baseAnimateDiff + Stable Diffusion v1.5 layers + motion module + LoRA
Training dataManiFlow-110k
Learning rate0.0001
Batch size512
Epochs500
OptimizerAdamW
Weight decay0.01
Epsilon1e-8
Compute8x8 V100, about 2 days

6.3 Optimization process

Source code description: Completely re-unpacked LaTeX with all from arXiv source package fig_2025/*.pdf figure, and converted 9 paper figures into PNG images that can be opened directly in HTML.

7. Analysis, Limitations and Boundaries

7.1 The most valuable part of this paper

The most valuable aspect of this paper is that it transforms the problem of cross-robot action learning from "unified robot action space" to "unified object 3D motion space". 3D flow is both more object-centered than RGB video and better expressive of depth and rotation than 2D flow, so it does have the potential to become a common interface between human video, robot video, and different hardware.

The second value point is that the system links are relatively complete: data construction, flow world model, closed-loop flow verification, grabbing pose selection and action optimization are all connected. It does not just show flow prediction visualization, but uses flow to achieve real robot task success rates.

7.2 Why the results hold up

7.3 Main limitations

7. 4 Boundary conditions

Applicable conditionsConditions that require caution
The manipulated object can be approximated as a rigid body and key surface points can be tracked.Cloth, rope, liquid, soft object or strong non-rigid body deformation.
Tasks can be expressed through object target poses/trajectories.Tasks rely on force control, touch, hidden states, or complex contact patterns.
Single-arm robots can predict object trajectories through IK.The end is unreachable, the grab point is unstable, or re-grasp is required.
Allow GPT-4o to do plan verification and reforecasting.Real-time, safety-critical, offline deployments or systems that cannot call external VLM.

8. Preparation for group meeting Q&A

Q1: What is the biggest difference between 3DFlowAction and Im2Flow2Act?

Im2Flow2Act mainly uses 2D optical flow, which cannot fully express depth direction movement and 3D rotation; 3DFlowAction uses 2D coordinates, depth and visibility to form a 3D flow, and directly uses the 3D flow as an optimization constraint.

Q2: Why not use RGB video world model?

RGB video world model needs to generate background, robot appearance and irrelevant objects, which is computationally intensive and easily affected by OOD background. 3D flow only focuses on the movement trajectory of the manipulated object, which is more centered on the object and is more suitable for direct conversion into action constraints.

Q3: What does GPT-4o do in the system?

Two key uses: first, after flow-guided rendering, determine whether the predicted final state meets the task and decide whether to re-predict the flow; second, output the part of the object that should be grasped according to the task description to assist in selecting task-aware grasp pose.

Q4: Why can methods span embodiments?

Because the world model outputs the movement of objects in 3D space, not the specific actions of a certain robot. Different robots only need to implement the same object trajectory through their own IK, workspace bounds and optimizers.

Q5: What is the strongest evidence?

Cross-platform experiments and ablation are the most critical: Franka 67.5% and XTrainer 70.0% support cross-embodiment; removing large-scale pre-training drops from 70% to 30%, indicating that ManiFlow-110k is the core support.

Q6: What is the most likely place to be questioned?

Only 10 trials per task, relies on GPT-4o, relies on monocular depth and rigid/trackable object assumptions. In real deployment, these links may become sources of failure.