EN 中文

S-VAM: Shortcut Video-Action Model by Self-Distilling Geometric and Semantic Foresight

Authors: Haodong Yan, Zhide Zhong, Jiaguan Zhu, Junjie He, Weilin Yuan, Wenxuan Song, Xin Gong, Yingjie Cai, Guanyi Zhao, Xu Yan, Bingbing Liu, Ying-Cong Chen, Haoang Li

Organization: The Hong Kong University of Science and Technology (Guangzhou); Huawei Foundation Model Department

Publication: arXiv preprint, 2026

arXiv: 2603.16195 | PDF: Download | Project: haodong-yan.github.io/S-VAM | Code: GitHub

1. Quick overview of the paper

One-sentence summary: S-VAM attempts to solve the contradiction in the video-action model of "multi-step video generation has high-quality foresight but is too slow, and single-step diffusion features are fast but noisy/entangled": it uses self-distillation to compress the VFM geometric/semantic representation in the multi-step generated video into a single-step forward predictable shortcut foresight, and then gives it to the action expert to output the action.

Difficulty rating: ★★★★☆. Need to understand VLA/VAM, Stable Video Diffusion, diffusion model denoising characteristics, Vision Foundation Model characterization, token condensation/QFormer/Perceiver, and diffusion policy.

Keywords: Video-Action Model, Self-Distillation, Geometric Foresight, Semantic Foresight, Vision Foundation Models, Diffusion Policy.

Reading positioning issuesanswer
What should the paper solve?Existing VAM either relies on slow multi-step video generation or uses noisy one-step diffusion features, making it difficult to simultaneously satisfy real-time control and high-quality future prediction.
The author's approachUse the diffusion model to extract VFM teacher targets from the multi-step video generated by itself, and then train lightweight geometric/semantic decouplers to directly predict these teacher representations from single-step denoising features.
most important resultsThe average sequence length of CALVIN is 4.16, and the average success rate of MetaWorld is 72.8%; the real two-arm Cobot is better than VPP in four tasks, and the effective control frequency is 25 Hz when the action chunk is 8.
Things to note when readingThe core is not to simply integrate VFM features into the strategy, but to teacher target multi-step generated videos from the same diffusion trajectory to avoid trajectory mismatch between GT future frames and one-step features.

Core contribution list

2. Motivation

2.1 What problem should be solved?

The VLA model usually connects the pre-trained VLM to the action head, and then uses the robot action data to fine-tune it. The problem is that VLM mainly comes from static graphics and text pre-training and does not have the spatiotemporal foresight required for physical interaction; if it relies entirely on robot action data to learn dynamic rules, the data cost will be very high.

The VAM route attempts to use the video diffusion model to generate a visual plan, and then let the action expert predict and control based on the visual plan. This can take advantage of dynamic priors in Internet videos and reduce reliance on robot action data.

teaser
Motivation and overview: The paper uses this figure to illustrate the efficiency of VAM-foresight trade-off, and the shortcut idea of S-VAM.

2.2 Limitations of existing methods

2.3 The solution ideas of this article

This article treats "slow multi-step generation of stable structured representations in videos" as the teacher, and "fast single-step denoising features" as the student input. The geometric branch learns DPAv3 representations, and the semantic branch learns DINOv2 representations. Instead of running the entire multi-step video generation during inference, only one denoising feature extraction and decouplers are needed to obtain geometric and semantic foresight that can be used by action experts.

4. Detailed explanation of method

4.1 Method overview

The pipeline of S-VAM is: input the current observation $I$ and task description $P$; SVD extracts multi-layer up-sampling features in the first step of denoising; geometric decoupler and semantic decoupler decouple these noisy/entangled features into DPAv3-like and DINOv2-like future representations respectively; Uni-Perceiver combines the two types of foresight with the original diffusion feature Aggregated into compact tokens; diffusion policy outputs action sequences based on these tokens and text embedding.

method overview
S-VAM architecture: The core is a shortcut from one-step diffusion features to geometric/semantic foresight, and then actions are generated by Uni-Perceiver and diffusion policy.
Training stage 1: fine-tune SVD backbone when needed Training stage 2: freeze SVD, generate multi-step videos V_hat, extract teacher targets Y_geo = DPAv3(Interpolate(V_hat)) Y_sem = DINOv2(Interpolate(V_hat)) extract one-step denoising features F train geo/sem decouplers: F -> Y_geo, Y_sem Training stage 3: freeze SVD + decouplers C = Concat(F_geo, F_sem, F_raw) F_agg = UniPerceiver(C) train diffusion policy to denoise action sequences Inference: one SVD denoising forward pass -> F decouplers -> geometric/semantic foresight Uni-Perceiver + diffusion policy -> action chunk

4.2 Method evolution

stageformImprovement motivation
Direct VLACurrent image/language directly to action.Missing explicit spatiotemporal foresight, high robot data requirements.
VAM with multi-step video generationThe video diffusion model generates future visual plans in multiple steps and then predicts actions.foresight is high-fidelity, but slow for multi-step denoising inference.
One-step VAMGuiding action with single-step diffusion internal features.Good real-time performance, but noisy/entangled features.
S-VAMSelf-distilling VFM representations of multi-step generated videos into single-step features.Preserve single-step efficiency while obtaining more stable geometric/semantic future representations.

4.3 Core design and mathematical derivation

4.3.1 Stable Video Diffusion Basics

SVD generates video by stepwise denoising from noise in latent space; multi-step sampling is high quality but slow.
$$z_{s-1}=\frac{1}{\sqrt{\alpha_s}}\left(z_s-\frac{1-\alpha_s}{\sqrt{1-\bar{\alpha}_s}}\epsilon_\theta(z_s, s, P, I)\right)+\sigma_s\epsilon.$$

$z_s$ is the latent of step $s$, $\epsilon_\theta$ is the noise prediction network, and the conditions are observation frame $I$ and task description $P$. S-VAM does not perform full multi-step sampling at control time, but instead uses first-step denoising features.

4.3.2 VFM teacher target

The teacher target comes from the video generated by the diffusion model in its own multi-steps, rather than ground-truth future frames.
$$Y=\Phi(\operatorname{Interpolate}(\hat{V})), \qquad Y\in\mathbb{R}^{T\times C_{\mathrm{VFM}}\times h\times w}.$$

$\hat{V}$ is SVD multi-step generated video; $\Phi$ is frozen VFM encoder; interpolation is used to align spatial resolution. The paper uses DPAv3 as the geometric teacher and DINOv2 as the semantic teacher.

4.3.3 One-step denoising feature aggregation

Single-step features come from the multi-layer up-sampling blocks of the first denoising step, which need to be aligned to the resolution and spliced.
$$F'_l=\operatorname{Interpolate}(F_l), \qquad F'_l\in\mathbb{R}^{T\times C_l\times h\times w}.$$ $$F=\operatorname{Concat}((F'_0, \dots, F'_L), \mathrm{dim}=1), \qquad F\in\mathbb{R}^{T\times C_\Sigma\times h\times w}.$$

$F_l$ is the feature map of the $l$ up-sampling layer, $C_\Sigma=\sum_l C_l$. These characteristics are fast, but inherently noisy and entangled.

4.3.4 Geometric/Semantic decouplers

The decoupler not only looks at the noisy features, but also uses the currently observed VFM reference representation as the anchor.
$$Y_i^{\mathrm{ref}}=\Phi_i(\operatorname{Interpolate}(I_{\mathrm{obs}}))\in\mathbb{R}^{1\times C_i\times h\times w}, \quad i\in\{\mathrm{geo}, \mathrm{sem}\}.$$ $$\tilde{F}_i^0=\operatorname{Concat}((F, \operatorname{Repeat}(Y_i^{\mathrm{ref}})), \mathrm{dim}=1).$$ $$\tilde{F}_i^k=\mathcal{T}_i^k(\mathcal{S}_i^k(\tilde{F}_i^{k-1})), \quad 1\le k\le K.$$

$\mathcal{S}$ and $\mathcal{T}$ are spatial and temporal transformer layers respectively. Each branch is finally projected back to the corresponding VFM dimension.

Self-distillation loss: Let the decoupler output be close to the teacher VFM representation of multi-step generated video extraction.
$$\mathcal{L}_i=\|\tilde{F}_i^K-Y_i\|_2^2, \qquad i\in\{\mathrm{geo}, \mathrm{sem}\}.$$

The paper emphasizes that the teacher generates videos from the same diffusion trajectory to avoid trajectory misalignment caused by GT future frames.

4.3.5 Uni-Perceiver and diffusion policy

The geometric, semantic and original diffusion features are synthesized into a holistic context and then compressed into a small number of tokens.
$$\mathcal{C}=\operatorname{Concat}(\tilde{F}_{\mathrm{geo}}^K, \tilde{F}_{\mathrm{sem}}^K, F)\in\mathbb{R}^{T\times C_{\mathrm{hol}}\times h\times w}.$$ $$F_{\mathrm{agg}}=\operatorname{FFN}(\operatorname{SelfAttn}(\operatorname{CrossAttn}(\mathcal{Q}, \mathcal{C}))).$$

$\mathcal{Q}$ are $N$ learnable latent queries. Cross-attention extracts information from high-dimensional spatiotemporal context, and self-attention models the internal relationships of compact tokens.

The essence of the action expert is a diffusion policy: predicting noise from noisy actions, conditional on aggregated foresight tokens and text embedding.
$$\mathcal{L}_A=\mathbb{E}_{j, a_j, \epsilon}\left[\|\epsilon-\epsilon_\phi(a_j, F_{\mathrm{agg}}, E, j)\|_2^2\right].$$

$a_j$ is the noisy action of diffusion timestep $j$, $E$ is the task text embedding, and $\epsilon_\phi$ is the action noise prediction network.

4.4 Implementation points

Three stages of training: First fine-tune SVD backbone; then freeze video generation model to train geometric/semantic decouplers; finally freeze SVD and decouplers to train only action experts.
Number of training steps: MetaWorld SVD 100k steps, real task SVD 40k steps; CALVIN directly uses VPP fine-tuned model; decouplers all benchmark 50k steps; action expert is 60k steps in CALVIN, and other benchmarks are 40k steps.
Hardware: SVD fine-tuning uses 4 NVIDIA H100; decoupler self-distillation uses a single H100; action expert uses 4 NVIDIA H100; inference uses a single NVIDIA RTX 3090 24GB.
teacher target selection: Semantic uses DINOv2 and geometric uses DPAv3. Ablation showed an average length of 4.16 for DINOv2+DPAv3, which was higher than 4.06 for SigLIP+DPAv3 and 4.04 for DINOv2+VGGT.
Real-time: In the real experiment, one forward time is 307.6 ms, including video diffusion backbone 231.0 ms, decouplers 40.1 ms, and action expert 36.5 ms; each predicted action chunk length is 8, corresponding to the effective control frequency of 25 Hz.

5. Experiment

5.1 Experimental setup

Projectsettings
CALVINABC → D: Train in the ABC environment, evaluate in the unseen D environment, and examine generalization on consecutive long-horizon tasks. The indicators are the success rate of the $i$th task and Avg. Len.
MetaWorld50 Sawyer manipulation tasks, counted by Easy/Middle/Hard. The training set has 50 demonstrations per task.
real robotAgileX Robotics Cobot dual-arm platform, Mobile ALOHA design; 7 DoF per arm and parallel gripper; only uses front camera monocular RGB observations.
real tasksPlace-to-Pot, Place-to-Pot (Hard, transparent object), Pour-Water, Lift-Pot. Unified multi-task model, about 50 human demonstrations per task, evaluated on 25 trials per task.
BaselinesDirect action learning: RT-1, Diffusion Policy, OpenVLA, CLOVER, $\pi_0$, Spatial Forcing; Predictive methods: SuSIE, VPP, GR-1, Uni-VLA, HiF-VLA, etc.
code repositoryThe official code is provided on the project page: https: //github.com/Haodong-Yan/S-VAM-Code.

5.2 Main results

CALVIN

Method1st2nd3rd4th5thAvg. Len.
Spatial Forcing93.685.878.472.064.63.94
VPP90.981.571.362.051.83.58
Uni-VLA95.585.874.866.956.53.80
HiF-VLA93.587.481.475.969.44.08
S-VAM95.890.783.777.068.94.16

The paper highlights that S-VAM is 0.58 Avg. Len. higher than the direct baseline VPP. In the qualitative graph, VPP's one-step entangled features produce an attention trajectory that deviates from the instructions; S-VAM's decoupled foresight makes the attention trajectory more consistent with the language instructions.

CALVIN qualitative
CALVIN qualitative comparison: Demonstrates the difference between attention trajectory and rollout execution.

MetaWorld

MethodEasyMiddleHardAverage
Spatial Forcing0.7370.4360.4510.609
HiF-VLA0.7290.3640.4040.577
VPP0.8180.4930.5260.682
S-VAM0.7930.6070.6840.728

The average success rate of S-VAM is 72.8%, and Hard tasks is 68.4%. The paper states that it significantly exceeds VPP's 52.6% on hard tasks. The authors attribute the advantage to decoupled geometric/semantic foresight providing more stable target localization in complex object interactions and fine geometric constraints.

MetaWorld qualitative
MetaWorld qualitative comparison: VPP's attention trajectory deviates from the target nut; S-VAM's geometry/semantic foresight helps action experts locate the target.

real robot

real world
Real dual-arm Cobot multi-task experiment: paper reports that S-VAM outperforms VPP on four tasks without sacrificing real-time control capabilities.

Key details of the real experiment: ~50 demonstrations, 25 trials per task. Place-to-Pot (Hard) involves transparent objects, the VPP success rate is 16%, and S-VAM improves to 32%. The paper explains that the geometric decoupler helps with depth ambiguity on transparent surfaces, and the semantic decoupler helps maintain a consistent representation of transparent objects.

5.3 Ablation experiment

Variant1st2nd3rd4th5thAvg. Len.purpose
w/o Geometric Distillation94.187.179.373.566.54.01Verify geometry foresight effect.
w/o Semantic Distillation94.087.180.473.264.13.99Validation semantics foresight's role in object identity consistency.
w/o Self-Distillation94.285.275.967.859.03.82Replace the model with GT future frames to self-generate the video teacher.
w/o Uni-Perceiver94.084.674.765.153.83.72Verify the necessity of compact token condensation.
w/o Original Diffusion Feature95.386.677.670.962.53.93Verify that the original diffusion features provide residual global context.
S-VAM Full95.890.783.777.068.94.16Complete model.

5.4 Supplementary experiment: VFM teacher target selection

TypeRepresentationAvg. Len.Paper explanation
SemanticCLIP / SigLIP3.72 / 3.77Global semantics has scene-level information, but lacks dense patch-level affordance for low-level control.
SemanticDINOv2 / DINOv34.01 / 3.95Dense patch-level representation outperforms CLIP/SigLIP.
GeometricDPAv3 / VGGT3.99 / 3.74DPAv3 adapts to dynamic video streams; VGGT prefers static scene reconstruction.
Motion-awareVideoMAEv2 / V-JEPA23.90 / 3.74The author believes that current video models are still weaker than specialized image models in fine-grained feature fidelity.
SynergisticSigLIP+DPAv3 / DINOv2+VGGT / DINOv2+DPAv34.06 / 4.04 / 4.16DINOv2's dense semantics complements DPAv3's dynamic geometry.

6. Analysis and Discussion

6.1 Analysis and explanation of the results given in the paper

6.2 Limitations of the author's statement

The Conclusion in the source code does not explicitly list limitation or failure cases, nor does it provide an independent appendix. Therefore, this report does not incorporate the subjective limitations of the report writer. Gaps in reproducibility are only listed in §6.4 according to the published information of the paper.

6.3 Applicable boundaries and discussions clearly stated in the paper

6.4 Reproducibility audit

ProjectStatusDescription
Source code structureObtainedarXiv e-print contains main.tex, secs/, bib, style and PDF figures.
chartExtractedAll PDF figures have been converted to PNG and placed into figures/.
code repositoryFoundProject page linked to Haodong-Yan/S-VAM-Code.
training configurationpartially completeThe paper gives the number of three-stage training steps and GPU configuration, but does not list the complete hyperparameter table such as batch size and learning rate in LaTeX.
Data settingsThe main message is clearCALVIN, MetaWorld, real Cobot demos/trials are all explained.
AppendixNo independent appendixNo appendix was found in the source code, so there are no additional proofs, failure cases, or full hyperparameter appendices to incorporate.