EN 中文

Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations

Reading Report: RIGVid attempts to answer a very direct but bold question: Can a robot complete operating tasks in the real world without watching real human/robot demonstrations, but only by watching a task video synthesized by a video generation model?

arXiv: 2507.00990v2 CVPR 2025 Robot Manipulation Video Generation 6D Pose Tracking
Authors: Shivansh Patel, Shraddhaa Mohan, Hanlin Mai, Unnat Jain, Svetlana Lazebnik, Yunzhu Li
Institution: UIUC, UC Irvine, Columbia University
Project page: https: //rigvid-robot.github.io/
Local output: Report/2507.00990/

1. Quick overview of the paper

What should the paper solve? Traditional robot imitation from video methods rely on real human videos, robot demonstrations, or offline robot data. RIGVid wants to verify: given only the initial RGB-D scene and language commands, it is possible to use a generated video as the sole supervision to allow real robots to perform operational tasks such as pouring water, lifting lids, placing spatulas, and sweeping garbage.
The author's approach Convert "generate video" into "executable 6D object pose trajectory". The process is: generate task video, VLM filtering fails to generate; estimate the depth of each frame and do scale/offset alignment with the initial true depth; use FoundationPose to track the 6D pose of the active object; after grabbing the object, redirect the object pose trajectory to the end effector trajectory, and perform closed-loop tracking to recover the disturbance.
most important results The filtered Kling v1.6 generated videos achieve performance similar to real human videos on four real tasks; RIGVid has an average success rate of 85%, which is higher than ReKep's 50%, Gen2Act's 67.5%, AVDC's 32.5%, 4D-DPM's 35.0% and Track2Act's 7.5%.
Things to note when reading The key to the success of the result is not that "the video generation model directly understands robots", but that the author compressed the generated video into an object-level SE(3) trajectory, and added a lot of engineering support with real RGB-D, object mesh, 6D tracking, VLM filtering and closed-loop execution. Don't interpret this article as pure prompt-to-action, and don't ignore the two main bottlenecks of depth estimation and mesh pre-construction.
RIGVid overview
Figure 1: RIGVid overview. Language commands and initial RGB-D scenes enter the video generation model; the robot execution trajectory is then obtained by depth estimation, object 6D pose tracking and trajectory reorientation.

2. Background and problem setting

core issues

The core problem of the paper can be written as: given the initial scene image \(I_0\), the initial depth \(D_0\) and the language command \(c\), directly predict the robot 6DoF end trajectory \(\{\mathbf{T}^{WE}_t\}_{t=1}^T\) without real demonstration or training task strategy. The intermediate representation chosen by the author is not the action token or key point constraints, but the 6D pose trajectory \(\{\mathbf{T}^{WO}_t\}_{t=1}^T\) of the object in world coordinates.

This changes the problem from "generating robot movements" to "recovering how objects move from generated videos". This transformation is critical because current video generation models can generate rich visual sequences but do not output executable actions; robot execution requires geometrically consistent trajectories.

Differences from existing work

3. Method details

3.1 Video generation and VLM filtering

The inputs are initial RGB images, corresponding depths, and free language commands. The authors use an image-to-video generative model to generate candidate task videos. Since the generated video may not execute commands, change objects, violate physics, or remain stationary, the author uses GPT-4o/o1 class VLM for automatic filtering: uniformly sample 4 frames from the video, vertically splice them into a video summary, and let the VLM determine whether there are visible hands in the video to complete a given action. If it fails, it will be regenerated, and a maximum of 5 attempts will be made; if all fails, the last generated one will be used.

The filtering here is not a decorative module. Experiments show that Sora fails 100% of these tasks; although Kling v1.6 is better, its pass rate for placement and garbage removal is only 55% and 45%. Without filtering, subsequent geometric tracking would "faithfully" turn the erroneous video into an erroneous trajectory.

3.2 Depth estimation and scale alignment

The generated video is RGB only, and RIGVid requires depth per frame to recover the 3D pose. The author uses a monocular depth estimator to predict depth, but there are scale and translation ambiguities in monocular depth. Therefore, they fit an affine transformation to the predicted depth \(\hat{D}^{mono}_0\) using the true initial depth \(D_0\) in the active object area of the first frame:

$$D_t = a \hat{D}^{mono}_t + b$$

This \(a, b\) is applied to the entire video after the first frame is aligned. This design uses the real RGB-D initial observation as a "scale anchor", otherwise the three-dimensional motion in the generated video cannot fall into the real robot coordinates.

3.3 Active object recognition and 6D pose trajectory

The system needs to know "which object is being manipulated". The author first lets GPT-4o determine the active object category according to the command, then uses Grounding DINO to give the frame, and uses SAM-2 to refine it into a mask. Then use FoundationPose combined with depth tracking to track the 6D pose trajectory of the object.

FoundationPose is a model-based tracker that requires object mesh. The author uses BundleSDF to pre-acquire a short RGB-D video of an object rotating to reconstruct the mesh. Appendix Note: BundleSDF can also do mesh-free joint reconstruction and tracking, but the official implementation takes about 30 minutes to process a video and is not suitable for real-time closed loop.

3.4 Trajectory redirection to robot

Scraping is done by AnyGrasp. After grasping the object, the authors assume that the rigid body transformation between the end effector and the object remains unchanged. If \(\mathbf{T}^{WO}_t\) is the pose of the object in world coordinates, and \(\mathbf{T}^{EO}\) is the fixed transformation from the end to the object after grabbing, then the target end trajectory can be understood as:

$$\mathbf{T}^{WE}_t = \mathbf{T}^{WO}_t(\mathbf{T}^{EO})^{-1}$$

The specific coordinate convention may vary depending on the implementation, but the core is not to predict the action, but to use a fixed grasp transform to convert the object trajectory into a terminal trajectory. This is why it can be migrated to an ALOHA or dual-arm setup: when changing robots you mainly change the end-to-object transformation.

Retargeting
Figure 2: Reorientation of the object's 6D pose trajectory to the robot's end trajectory. Orange is the object trajectory and blue is the robot execution trajectory.

3.5 Closed-loop execution and disturbance recovery

During execution, the system continues to track the object's 6D pose in real time. If the current object pose deviates from the precalculated trajectory by more than 3 cm or 20 degrees, the robot will return to the previous successful trajectory point and continue execution. This closed-loop mechanism allows RIGVid to handle real-life disturbances such as people pushing the robot and slipping after grabbing.

Robustness
Figure 3: Example of perturbation recovery. After detecting object deviation, the robot moves back and realigns its trajectory.

4. Experiments and results

4.1 Task settings

The experiments used an xArm7 robotic arm and a fixed Orbbec Femto Bolt RGB-D camera. The four main tasks cover different difficulties: pouring water, lifting the pot lid, putting the spatula into the pot, and sweeping the garbage into the dustpan. The evaluation is manually judged based on the task success criteria, and all baselines use the same batch to generate videos.

Evaluation tasks
Figure 4: Four real-life operation tasks, with increasing difficulty ranging from smooth handling and depth changes to thin object occlusion and fine contact.

4.2 Video generation quality and filtering

video sourceAuthor's observationImpact on robot execution
SoraThe picture is beautiful but the camera, object size, object identity and scene layout are often changed; the pass rate of the four task filters is 0%.Not suitable for direct imitation, unfiltered execution success rate is 0%.
Kling v1.5More consistent with language and scenes, but still with physical implausibility, such as water flowing out of the top of a pot or actions not happening.Better than Sora, but less stable the harder the mission.
Kling v1.6The command following and physical rationality are the best; the filter pass rate is 83% for pouring water, 66% for lifting the lid, 55% for placing the spatula, and 45% for sweeping the garbage.After VLM filtering, the generated video can achieve an effect close to the real demonstration video.
Video generation comparison
Figure 5: Comparison of build quality between Sora, Kling v1.5, and Kling v1.6. The paper believes that Kling v1.6 is most suitable for current robot imitation tasks.

4.3 Can generated videos replace real videos?

The authors compare unfiltered Sora, unfiltered Kling v1.5, unfiltered Kling v1.6, filtered Kling v1.6 and real human videos. Filtration significantly increases the execution success rate of Kling v1.6: pouring water from 80% to 100%, lifting lid from 60% to 80%, placing spatula from 50% to 90%, sweeping garbage from 20% to 70%. The key conclusion of the paper is that after filtering, the current strong video generation model can already be used as an effective visual demonstration source.

Performance vs video quality
Figure 6: The better the video quality, the better the robot performs; filtered Kling v1.6 is close to real video performance.

4.4 Comparison with VLM keypoint/constraint method

ReKep represents the "let the VLM directly generate sparser keypoint relationship constraints" route. The average success rate is 85% for RIGVid and 50% for ReKep. The author's explanation is that although video is expensive, it retains continuous visual details during the task; compact representations such as ReKep are prone to errors in local details such as grab points, movement constraints, and dumping constraints.

RIGVid vs ReKep
Figure 7: Success rate advantage of RIGVid compared to ReKep. The appendix gives examples of key points where ReKep failed in the water pouring task.

4.5 Comparison with trajectory extraction baseline

methodintermediate representationaverage success rateMain failure modes
Track2Act2D point tracks between initial image and target image7.5%Trajectory prediction does not follow the movement of real objects, and the initial and final images are not enough to restore the complete process.
AVDCoptical flow of the entire video32.5%Frame-by-frame flow errors accumulate, causing object position drift.
4D-DPM3D feature field / Gaussian field35.0%Tracking is unstable and jittery, especially with large rotations of a single object.
Gen2Act adaptedGenerate sparse tracks + PnP on video67.5%Occlusion and large-angle rotation cause visible points to be lost, making PnP unstable.
RIGVidFoundationPose object-level 6D pose trajectory85.0%The main remaining failures come from monocular depth estimation and individual grasp slippage.
Main baseline comparison
Figure 8: RIGVid is overall the strongest on the four tasks, and its advantage is more obvious on difficult tasks.

4.6 Generalization and extension

The author demonstrated three categories of generalization: first, the success rate of pouring water on ALOHA is 80%, while the default xArm setting is 100%. The performance degradation mainly comes from the difficulty of ALOHA camera calibration; second, the two-arm ALOHA can put a pair of shoes into a box; third, it is extended to wipe, stir, iron, right the ketchup bottle, unplug the charger, rotate the spoon to pour beans, and more operations. These results are more qualitative or preliminary, but illustrate that object-centric retargeting does have potential across embodiments.

Demos and embodiment transfer
Figure 9: Examples of ALOHA, dual arms, and more open tasks. This should be regarded as a demonstration of the boundaries of capabilities, rather than a strict quantitative conclusion at the same level as the main experiment.

5. Appendix key information

Video generation practice

The appendix summarizes the conditions for more reliable video generation: clean background, few distractors, large enough objects, perspective close to the natural human perspective, single and clear task, and concise prompts; use relevance factor 0.7, and add negative prompt "fast motion". These details illustrate that the results are not straightforward for any desktop scenario.

Filter prompts and filter indicators

The author uses GPT o1/4o to determine whether the video has completed the command on the 4-frame spliced image. Compared with VBench++'s video-text consistency and I2V subject consistency, VLM filtering has the highest correlation with human judgment: the four task correlation coefficients are 0.91, 0.91, 0.91, 0.66, with an average of 0.84. The VBench++ indicator can only roughly reflect the video quality and cannot reliably determine whether the task is actually completed.

Prompting example
Figure 10: Example of VLM filtering prompt in the appendix. The filter inputs video frame summaries and language commands, and outputs whether they are successful.

Depth estimation error is the main bottleneck

On filtered Kling v1.6 videos, failures mainly come from monocular depth estimation, except for one grab slip. The appendix further isolates the factors: real video + real depth is 100% successful; real video + predicted depth is 85%; Kling generated video + predicted depth is also 85%. This shows that generating the video itself is not the only bottleneck, and the depth estimation error will directly contaminate the 6D pose trajectory.

Depth ablation
Figure 11: Impact of depth estimation on execution success rate. True depth is significantly more stable.

MegaPose vs FoundationPose

The author compares pose trajectory jitter: MegaPose average translation RMS jitter is 0.0045 m and rotation RMS jitter is 37.47 degrees; FoundationPose is 0.0029 m and 14.31 degrees. FoundationPose is smoother and performs in real time, making it the primary method choice.

Why does the point trajectory method fail?

The appendix shows that even if Gen2Act uses BootsTAP or CoTracker, visible surface points will be obscured when encountering large rotations, 2D-3D correspondence is insufficient, and PnP will drift or jump. RIGVid relies on complete object models and SE(3) trajectory filtering, so it is more stable in scenes with large rotations, thin objects, and occlusions.

Point tracking limitation
Figure 12: The point trajectory loses correspondence when the object rotates greatly, causing pose estimation to fail.

6. Key points of reproducibility and implementation

Inputs and dependencies

Minimum recurrence path

  1. Fixed the task scene, collected the initial RGB-D image, and pre-built mesh for the active object.
  2. Use language commands and initial images to generate less than 5 candidate videos, and VLM filters out successful videos.
  3. Depth is estimated for the generated video and affine aligned with the true depth within the active object mask of the first frame.
  4. Use FoundationPose to estimate the 6D pose of the object in each frame and perform pose smoothing.
  5. AnyGrasp grabs the active object and records the fixed transformation from the end to the object at the moment of grabbing.
  6. Redirect the object trajectory to the end trajectory, track the object in real time and perform deviation recovery during execution.
The links with the highest risk of recurrence are video generation API/model version, object mesh quality, camera calibration, monocular depth stability and FoundationPose real-time performance. The conclusion of the paper is best understood as "in a more controlled tabletop scenario, the current strong generative model + strong geometric tracking can work in a closed loop".

7. Analysis, Limitations and Boundaries

The most valuable part of this paper

It takes the problem of "can generated videos be used as robot supervision", which is easy to stay at the conceptual level, and falls on real robots, real tasks and multiple groups of baselines. The most valuable thing is not to prove that Kling v1.6 is good alone, but to prove an executable decomposition: the generative model is responsible for providing visual priors of the task process, VLM is responsible for filtering out obvious failure samples, 6D tracking is responsible for converting the visual process into geometric trajectories, and closed-loop control is responsible for handling real-world disturbances. This link allows the visual knowledge of the generative model to be more solidly connected to the physical execution for the first time.

Why does the result stand?

Main limitations

Questions to ask during reading group meetings