EN 中文

Veo-Act: How Far Can Frontier Video Models Advance Generalizable Robot Manipulation?

Authors: Zhongru Zhang, Chenghan Yang, Qingzhou Lu, Yanjiang Guo, Jianke Zhang, Yucheng Hu, Jianyu Chen

Organization: Tsinghua University

Publication: arXiv preprint, 2026

arXiv: 2604.04502 | PDF: Download

1. Quick overview of the paper

One-sentence summary: This article studies whether cutting-edge video generation models such as Veo-3 can be used as high-level visual planners for robots, and proposes Veo-Act: use Veo-3 to generate future visual trajectories, use a multi-head inverse dynamics model to restore action and switching signals, and then switch to the VLA low-level strategy during the contact interaction stage.

2. Motivation

2.1 What problem should be solved?

The paper focuses on generalized robot operations in open environments: robots need to grasp and place in different object geometries, semantic instructions, and distractor scenarios. Existing VLA models rely on large-scale robot data, and action modalities need to be introduced when converting from VLM to VLA; the paper points out that this will make it difficult to completely retain the generalization knowledge of the pre-trained VLM.

The failure scenarios constructed in this paper include: the target object is not within the field of view of the wrist camera, there are interference objects with similar appearance, and non-target objects are located on the path of the robotic arm. These scenarios do not simply examine "whether it can be caught", but examine whether the strategy can still follow the target instructions under visual, semantic and path interference.

2.2 Limitations of existing methods

2.3 The solution ideas of this article

At the high level, Veo-3 generates a video trajectory to complete the task based on the initial image and language prompt; in the middle, the multi-head IDM converts the inter-frame changes into action clips and predicts the interaction gate; during low-level execution, the system follows the video planning action by default. When the gate indicates that it has entered the contact interaction stage, it switches to the VLA strategy to perform detailed operations, and then switches back to the remaining planning queue.

三种控制管线对比
Figure 1: Comparison of three control pipelines: VLA, Video model + IDM, and Veo-Act.

3. Summary of related work

3.1 Related work of the thesis self-description

Related work linesrepresentative workPositioning in the paper
Video generative models for policy learningSora, Veo, V-JEPA 2, Unified World Models, Vidar, VPP, AnyPos, TC-IDMVideo models contain physical priors such as object persistence, motion continuity, and coarse-grained collisions to generate visual plans; however, performance is constrained by video fidelity, especially in contact-rich dexterity operations.
Learning from Observation / IDMLfO, VPT, Vidar, AnyPos, TC-IDMIDM can map state or image transitions into actions, and can even be learned from unlabeled videos or play data; this article uses self-supervised random play to train multi-head IDM.
Vision-Language-Action ModelsRT-2, OpenVLA, $\pi_0$, $\pi_{0.5}$VLA is an important route for current general robot strategies, but transforming VLM into an action output model will bring knowledge retention issues; this article uses $\pi_{0.5}$ as the baseline and low-level executor.

3.2 Direct comparison with previous works

DimensionsVLA / $\pi_{0.5}$Video Model + IDM/VPPVeo-Act
Core ideaExport actions directly from images, words and states.A future video is generated and the action is restored by IDM or inverse model.The video model is responsible for high-level trajectories, VLA is responsible for contact interaction, and IDM outputs actions and switching gates at the same time.
key assumptionsThe trained action modality still retains the semantic generalization of VLM/VLA.The video generated trajectories are physically consistent enough that IDM can reliably restore motion.Video trajectories serve as task-level priors, but the exposure phase requires reactive low-level strategies.
Applicable scenariosClosed-loop operations with sufficient training coverage, visible targets, and weak semantic confusion.Simple interaction or coarse-grained task planning.Semantic or perspective confounder obvious while requiring deft pick-and-place grabbing.
Experimental performanceThe overall success of all simulation + real experiments is 102/228 = 0.45.As the video baseline VPP in simulation, several conditions are lower than Veo-Act.The overall success of all simulation + real experiments is 182/228 = 0.80.

4. Detailed explanation of method

4.1 Method overview

The data flow of Veo-Act is: initial observation $I_0$ and task prompt → Veo-3 generates future frame sequence $I^{*}_{0: n}$ → multi-head IDM performs inverse dynamics on adjacent frames, outputs action fragment $a^{*}_{0: n-1}$ and gate signal → smoother obtains $\bar{a}^{*}_{0: n-1}$ → gradually pops from the action queue during execution; if real-time gate $G_t$ If it continues to be higher than the threshold $\tau$, switch to the low-level VLA policy $\pi_{\mathrm{VLA}}$; after the gate is lowered, cut off the interaction interval and restore the planning queue.

Veo-Act pipeline
Figure 2: Hierarchical planning and control pipeline. The original PDF image has been converted to PNG.
三种 inference paradigms
Figure 3: Comparison of three execution trajectories of pure IDM, one-time switching, and complete Veo-Act.

4.2 Method evolution

VLA direct control → Input the current observation and language, and directly output the action; the problem is that it is easy to follow the wrong object or path by mistake under confounder.

Video model + IDM → First generate a video of the task completion, and then resume the action; the advantage is that the semantics and physical priors of the video model are retained, but the problem is that the contact control accuracy is insufficient.

Veo-Act → Use the video model to plan "where to go and where to pass", use VLA to complete "how to stabilize the grasp and release during contact", and use IDM gate to connect the two control modes.

4.3 Core design and mathematical derivation

Formula 1: Video model generates future visual trajectories.

$$I^{*}_{0: n}=\{I^{*}_1, I^{*}_2, \ldots, I^{*}_n\}$$

Here $I_0$ is the initial observation, $I^{*}_k$ is the $k$th synthesized future frame, and $n$ is the number of sampled frames. The function of this formula is not to directly control the robot, but to convert the language target into a visuo-motor prior that can be digested by IDM.

Formula 2: Multi-head IDM predicts actions and interaction gates simultaneously.

$$(a_t, G_t)=\pi_{\mathrm{IDM}}(I_{t-1}, I_t, s_t)$$

$I_{t-1}, I_t$Two adjacent frames of images represent changes in visual status.
$s_t$Robot state characteristics; 21-dimensional single-arm state was recorded in the experiment.
$a_t$The recovered executable action; running the entire generated video can result in $a^{*}_{0: n-1}$.
$G_t\in[0, 1]$The output of the interaction detector, indicating whether the current should be handed to the lower VLA.

Intuitively, the action head is responsible for "how the robot should move between these two frames", and the gate head is responsible for "whether it has entered the grabbing/contact stage that requires reactive strategies." It is divided into two MLP heads because the numerical distributions of action regression and binary classification gate are different.

Formula 3: IDM training objectives.

$$\mathcal{L}=\lambda_{\text{act}}\mathcal{L}_{\text{act}}(a_t, \hat a_t)+\lambda_{\text{gate}}\mathcal{L}_{\text{gate}}(G_t, \hat g_t)$$

The text states that the action head uses Huber loss and the gate head uses Binary Cross Entropy. The appendix further gives the IDM training target: $\mathcal{L}_{\mathrm{IDM}}=\lambda_{\mathrm{act}}\sum_d d(a_{t, d}, \hat a_{t, d})+\lambda_{\mathrm{gate}}\mathrm{BCE}(G_t, \hat g_t)$, where $d(\cdot)$ is the weighted Smooth L1.Appendix: IDM Training

Supplementary derivation: Why weighted Smooth L1 is suitable for motion regression
When the error is $|x-\hat{x}|<\beta$, Smooth L1 uses the quadratic term $0.5w(x-\hat{x})^2/\beta$, and the gradient changes linearly with the error, which is conducive to small error refinement; when the error is large, the loss becomes $w(|x-\hat{x}|-0.5\beta)$, and the gradient amplitude is controlled, reducing the dominance of abnormal action samples on training. The appendix of the paper sets $\beta$ to 0.1, and gives a weight of 2 to dimensions 4 and 11, and a weight of 1 to the remaining dimensions.
Formula 4: Action smoothing and queue execution.

$$\bar a^{*}_{0: n-1}=\mathrm{Smoother}(a^{*}_{0: n-1}), \qquad \mathcal Q=\{\bar a^{*}_1, \ldots, \bar a^{*}_m\}$$

The default action during execution is $a_t=\bar a^{*}_{k+1}$. The appendix specifies the smoother as follows: detecting local extreme key points in key dimensions, and doing centered moving-average in the adjacent key point interval; key points remain unchanged; key point actions can be held for an additional $H$ step; and finally, lower bound clamp, $m=0.13$ is made for the specified dimension $d_c=2$.Appendix: Action Smoother

IDM training
Figure 4: Multi-head IDM training process.
smoother
Appendix figure: Example of how smoother handles an action dimension.

4.4 Implementation Points (For reproducibility)

moduleImplementation detailsSource
Video promptprompt clearly requires the robot to distinguish between target objects and similar objects, non-target objects to remain motionless, the target to remain in place before being grabbed, the viewing angle to remain unchanged, the movement to be smooth and natural, and the container not to move.Appendix: Video-Generation Prompt Construction
visual encoderDINOv3 ViT-B/16, embedding 768, depth 12, heads 12, patch size 16, using patch token features of two frames to splice into 1536 dimensions.Appendix: IDM Training
IDM dataThe main text contains 300k simulated frame-pairs, plus 100k simulated random-motion and 150k real samples; the appendix writes about using 550k frame-pairs for training.Text §5.1 + Appendix
noise enhancementRunning Stem-OB inversion on the RGB trace produces noisy stem observations: 50 denoising steps, 10 inversion steps.Appendix: IDM Training
switching logicMaintain gate history $\mathcal H_G$; enable VLA after stable_high; close VLA after stable_low, and use truncate/drop operations to skip the queue segment corresponding to the high gate.Appendix: Algorithmic Variants
Hierarchical Veo-Act inference
1. reset environment and sample task
2. build video-generation prompt from task
3. generate future frames I*_{0:n} with Veo-3
4. run IDM.action over generated frames to get a*_{0:n-1}
5. smooth actions and enqueue them into Q
6. for each real timestep:
   - observe current image/state
   - compute gate G_t = IDM.gate(I_{t-1}, I_t)
   - if gate stays low: execute next planned action from Q
   - if gate stays high: execute low-level VLA action
   - if gate returns low: prune high-gate queue segment and resume plan

5. Experiment

5.1 Experimental setup

Projectcontent
platform7-DoF robotic arm + 12-DoF dexterous hand; two RGB cameras: global camera and wrist camera. Video generation and IDM use the global camera; low-level policies can use the wrist camera after switching.
SimulationUse IsaacLab to build high-fidelity simulation environments that correspond to real platforms.
TaskGrab the specified target object and put it into the specified container.
Simulation founderInvisible wrist camera, interference from similar objects, pass-by interaction.
real robot cofounderSimilar object interference, pass-by interaction, and more complex semantic instructions.
Baselines$\pi_{0.5}$ is used as VLA baseline; the video method VPP is also compared in the simulation. Veo-Act uses $\pi_{0.5}$ as the low-level policy.
code repositorySearching by "Veo-Act code github" and "2604.04502 github", no clear official repository was found.

training configuration

Hyperparameters/ConfigurationvalueSource
IDM training sample550k frame-pair samplesAppendix: IDM Training
IDM training iteration85, 000 iterationsAppendix: IDM Training
IDM hardware4 NVIDIA Ampere-series 80 GPUs, ~10 hoursAppendix: IDM Training
IDM batch size8 per GPU, total batch size 32Appendix: Table IDM Training
IDM optimizerAdamW, $\beta=(0.9, 0.999)$, $\epsilon=0.01$, weight decay 0.01Appendix: Table IDM Training
IDM learning rateDINO: $5\times10^{-5}$; other modules: $5\times10^{-4}$Appendix: Table IDM Training
IDM schedulerCosine scheduler, warmup 8, 500 stepsAppendix: Table IDM Training
$\pi_{0.5}$ trainingbatch size 32, initial LR $2.5\times10^{-5}$, training 40K iterations, using official implementation and LR schedulerAppendix: Pi0.5 Training

5.2 Main results

settingsBaseline overallVeo-Act overallThe conclusion given by the paper
Simulation: wrist-camera invisible / Experimental10/30 = 0.3320/30 = 0.67Veo-Act improves to ~2x when the target is out of view of the wrist camera.
Simulation: similar-object distractors / Experimental12/30 = 0.4028/30 = 0.93Veo-Act alleviates semantic confusion of similar objects.
Simulation: pass-by interaction / Experimental0/30 = 0.0014/30 = 0.47Baseline nearly fails, Veo-Act regains partial mission capability.
True: similar-object distractors8/16 = 0.5012/16 = 0.75Improvements are also achieved on real platforms.
True: pass-by interaction2/13 = 0.1511/13 = 0.85Overall success increased by 5.7 times.
Reality: richer semantics2/19 = 0.1115/19 = 0.79The overall success under complex semantic instructions is increased by 7.2 times.

Paper summary: Under the Experimental conditions of the simulation experiment, the baseline overall success is 22/90 = 0.24, and the Veo-Act is 62/90 = 0.69; the summary of all overall successes in simulation + real life, the baseline is 102/228 = 0.45, and the Veo-Act is 182/228 = 0.80.

simulation results
Simulation success rate: yellow is instruction-following, green is overall.
real robot results
Real robot success rate.

5.3 Ablation experiment

VariantsInstruction-followingOverallVerification purpose
ResNet backbone22/30 = 0.7317/30 = 0.57Testing whether DINOv3 visual representation is important.
Without noise22/30 = 0.7316/30 = 0.53Testing the effect of STEM-OB noise enhancement on robustness.
Single head25/30 = 0.8317/30 = 0.57Testing the impact of a single action head and not learning the interaction detector separately.
Ours25/30 = 0.8320/30 = 0.67The multi-head design improves overall success without compromising instruction-following.
ablation
Ablation results on wrist camera invisible settings.

5.4 Supplementary experiments (from appendix)

invisible controlinvisible ours
Appendix setting 1: control and ours for invisible object condition.
pass by controlpass by ours
Appendix Setting 2: pass-by interaction.
similar controlsimilar ours
Appendix Setting 3: similar-object distractors.

6. Analysis and Discussion

6.1 Analysis and explanation of the results given in the paper

6.2 Limitations of the author's statement

It is clearly pointed out in the text and conclusion of the paper that the current video model cannot accurately complete most low-level contact-rich operations: Veo-3 + IDM can generate generally correct trajectories, but the low-level control accuracy is insufficient, especially in the physical contact stage. The paper also points out that this route is constrained by the fidelity of the underlying video generation model, and the limitations are more obvious in high-dimensional action spaces and multi-point contact dynamics scenarios such as dexterous operation.

The appendix failure analysis is included here: failures mainly correspond to target identification, path interference, contact execution and low-level switching in the confounded setting. Since the paper does not expand these failures into new method suggestions in the main text, this report only objectively records them as applicable boundaries.

6.3 Applicable boundaries and discussions clearly stated in the paper

Self-acceptance

Completed: Analyze the title, author, abstract, text structure, methods, experiments and appendices.

Completed: The prompt, algorithm variants, pure IDM, failure analysis, smoother, IDM/$\pi_{0.5}$ training details in the appendix have been integrated into the corresponding chapters.

Completed: Standalone PNG/JPG image copied; main PDF image converted to PNG and embedded.

Note: No clear official code repository was found; the report has been marked according to the search results.