S-VAM: Shortcut Video-Action Model by Self-Distilling Geometric and Semantic Foresight

Authors: Haodong Yan, Zhide Zhong, Jiaguan Zhu, Junjie He, Weilin Yuan, Wenxuan Song, Xin Gong, Yingjie Cai, Guanyi Zhao, Xu Yan, Bingbing Liu, Ying-Cong Chen, Haoang Li

Organization: The Hong Kong University of Science and Technology (Guangzhou); Huawei Foundation Model Department

Publication: arXiv preprint, 2026

arXiv: 2603.16195 | PDF: Download | Project: haodong-yan.github.io/S-VAM | Code: GitHub

1. Quick overview of the paper

One-sentence summary: S-VAM attempts to solve the contradiction in the video-action model of "multi-step video generation has high-quality foresight but is too slow, and single-step diffusion features are fast but noisy/entangled": it uses self-distillation to compress the VFM geometric/semantic representation in the multi-step generated video into a single-step forward predictable shortcut foresight, and then gives it to the action expert to output the action.

Difficulty rating: ★★★★☆. Need to understand VLA/VAM, Stable Video Diffusion, diffusion model denoising characteristics, Vision Foundation Model characterization, token condensation/QFormer/Perceiver, and diffusion policy.

Keywords: Video-Action Model, Self-Distillation, Geometric Foresight, Semantic Foresight, Vision Foundation Models, Diffusion Policy.

Reading positioning issues	answer
What should the paper solve?	Existing VAM either relies on slow multi-step video generation or uses noisy one-step diffusion features, making it difficult to simultaneously satisfy real-time control and high-quality future prediction.
The author's approach	Use the diffusion model to extract VFM teacher targets from the multi-step video generated by itself, and then train lightweight geometric/semantic decouplers to directly predict these teacher representations from single-step denoising features.
most important results	The average sequence length of CALVIN is 4.16, and the average success rate of MetaWorld is 72.8%; the real two-arm Cobot is better than VPP in four tasks, and the effective control frequency is 25 Hz when the action chunk is 8.
Things to note when reading	The core is not to simply integrate VFM features into the strategy, but to teacher target multi-step generated videos from the same diffusion trajectory to avoid trajectory mismatch between GT future frames and one-step features.

Core contribution list

S-VAM shortcut is proposed.Bypass the inference latency of iterative video generation and obtain future representations that can be used for control with a single denoising forward pass.
Propose geometric/semantic foresight self-distillation.Using DPAv3 and DINOv2 to extract teacher representations from SVD multi-step generated videos, supervised decouplers learn to recover structured foresight from noisy one-step features.
Propose Uni-Perceiver action expert conditional aggregation.Compress geometric, semantic and original diffusion features into compact tokens as conditions for diffusion policy.
Verification in simulation and real robots.Covers CALVIN, MetaWorld and AgileX Cobot real-arm tasks, and provides component ablation and VFM teacher target ablation.

2. Motivation

2.1 What problem should be solved?

The VLA model usually connects the pre-trained VLM to the action head, and then uses the robot action data to fine-tune it. The problem is that VLM mainly comes from static graphics and text pre-training and does not have the spatiotemporal foresight required for physical interaction; if it relies entirely on robot action data to learn dynamic rules, the data cost will be very high.

The VAM route attempts to use the video diffusion model to generate a visual plan, and then let the action expert predict and control based on the visual plan. This can take advantage of dynamic priors in Internet videos and reduce reliance on robot action data.

Motivation and overview: The paper uses this figure to illustrate the efficiency of VAM-foresight trade-off, and the shortcut idea of S-VAM.

2.2 Limitations of existing methods

Multi-step video generation: Multi-step denoising can produce higher-fidelity visual images, but the inference delay is high and is not suitable for real-time control.
One-step feature extraction: Single-step feature diffusion is fast, but the features are noisy and entangled, making it difficult to stably express geometric structures, object semantics, and future interactions.
Direct action learning: Without explicitly introducing video model dynamic priors, physical dynamics need to be learned from limited robot data.
Use GT future frames for distillation: The paper ablation points out that GT future frames and one-step diffusion features are not on the same generated trajectory, which will cause trajectory misalignment and the performance will drop from 4.16 to 3.82.

2.3 The solution ideas of this article

This article treats "slow multi-step generation of stable structured representations in videos" as the teacher, and "fast single-step denoising features" as the student input. The geometric branch learns DPAv3 representations, and the semantic branch learns DINOv2 representations. Instead of running the entire multi-step video generation during inference, only one denoising feature extraction and decouplers are needed to obtain geometric and semantic foresight that can be used by action experts.

3. Summary of related work

3.1 Related work of the thesis self-description

Technical line	Positioning in the paper	Differences from S-VAM
Vision-Language-Action Models	RT-1, OpenVLA, $\pi_0$, etc. predict actions from vision and language through VLM + action head.	S-VAM not only converts static visual language into actions, but also introduces future dynamic priori of video diffusion backbone.
Video Generation Models for Robot Learning	VPP, SuSIE, GR-1, etc. utilize video generation or future prediction to aid robot control.	S-VAM solves the delay problem of multi-step video generation and builds one-step shortcut through self-distillation.
Vision Foundation Models in Robot Learning	CLIP/SigLIP/DINO/Depth/VGGT etc. can provide semantic, geometric or motion related representations.	S-VAM uses VFM representations as teacher targets instead of directly inputting VFM features into the strategy; the final choice was DINOv2 + DPAv3.

3.2 Direct comparison with previous works

Dimensions	VPP	HiF-VLA / Uni-VLA	S-VAM
Core idea	Provide fast predictive information using one-step diffusion features.	Use predictive mechanisms or VLM-based future reasoning.	Self-distilling multi-step generation of geometric/semantic VFM representations in videos into single-step features.
key assumptions	Single-step diffusion internal features are sufficient to guide actions.	The VLM/Prediction module captures mission-relevant future information.	A multi-step generated video of the same diffusion trajectory contains stable teacher foresight that can be learned by lightweight decouplers.
Applicable scenarios	Emphasis on real-time, but easily affected by noisy features in complex targets, transparent objects, and fine geometry.	Simulate long-term or multi-task manipulation.	Operational tasks that require real-time control and rely on geometric/semantic future foresight.
Experimental performance	CALVIN Avg. Len. 3.58; MetaWorld 68.2%; Real Hard Transparent Mission 16%.	HiF-VLA CALVIN 4.08, MetaWorld 57.7%; Uni-VLA CALVIN 3.80.	CALVIN 4.16; MetaWorld 72.8%; Real Hard Transparent Mission 32%.

4. Detailed explanation of method

4.1 Method overview

The pipeline of S-VAM is: input the current observation $I$ and task description $P$; SVD extracts multi-layer up-sampling features in the first step of denoising; geometric decoupler and semantic decoupler decouple these noisy/entangled features into DPAv3-like and DINOv2-like future representations respectively; Uni-Perceiver combines the two types of foresight with the original diffusion feature Aggregated into compact tokens; diffusion policy outputs action sequences based on these tokens and text embedding.

S-VAM architecture: The core is a shortcut from one-step diffusion features to geometric/semantic foresight, and then actions are generated by Uni-Perceiver and diffusion policy.

Training stage 1: fine-tune SVD backbone when needed Training stage 2: freeze SVD, generate multi-step videos V_hat, extract teacher targets Y_geo = DPAv3(Interpolate(V_hat)) Y_sem = DINOv2(Interpolate(V_hat)) extract one-step denoising features F train geo/sem decouplers: F -> Y_geo, Y_sem Training stage 3: freeze SVD + decouplers C = Concat(F_geo, F_sem, F_raw) F_agg = UniPerceiver(C) train diffusion policy to denoise action sequences Inference: one SVD denoising forward pass -> F decouplers -> geometric/semantic foresight Uni-Perceiver + diffusion policy -> action chunk

4.2 Method evolution

stage	form	Improvement motivation
Direct VLA	Current image/language directly to action.	Missing explicit spatiotemporal foresight, high robot data requirements.
VAM with multi-step video generation	The video diffusion model generates future visual plans in multiple steps and then predicts actions.	foresight is high-fidelity, but slow for multi-step denoising inference.
One-step VAM	Guiding action with single-step diffusion internal features.	Good real-time performance, but noisy/entangled features.
S-VAM	Self-distilling VFM representations of multi-step generated videos into single-step features.	Preserve single-step efficiency while obtaining more stable geometric/semantic future representations.

4.3 Core design and mathematical derivation

4.3.1 Stable Video Diffusion Basics

SVD generates video by stepwise denoising from noise in latent space; multi-step sampling is high quality but slow.

$$z_{s-1}=\frac{1}{\sqrt{\alpha_s}}\left(z_s-\frac{1-\alpha_s}{\sqrt{1-\bar{\alpha}_s}}\epsilon_\theta(z_s, s, P, I)\right)+\sigma_s\epsilon.$$

$z_s$ is the latent of step $s$, $\epsilon_\theta$ is the noise prediction network, and the conditions are observation frame $I$ and task description $P$. S-VAM does not perform full multi-step sampling at control time, but instead uses first-step denoising features.

4.3.2 VFM teacher target

The teacher target comes from the video generated by the diffusion model in its own multi-steps, rather than ground-truth future frames.

$$Y=\Phi(\operatorname{Interpolate}(\hat{V})), \qquad Y\in\mathbb{R}^{T\times C_{\mathrm{VFM}}\times h\times w}.$$

$\hat{V}$ is SVD multi-step generated video; $\Phi$ is frozen VFM encoder; interpolation is used to align spatial resolution. The paper uses DPAv3 as the geometric teacher and DINOv2 as the semantic teacher.

4.3.3 One-step denoising feature aggregation

Single-step features come from the multi-layer up-sampling blocks of the first denoising step, which need to be aligned to the resolution and spliced.

$$F'_l=\operatorname{Interpolate}(F_l), \qquad F'_l\in\mathbb{R}^{T\times C_l\times h\times w}.$$ $$F=\operatorname{Concat}((F'_0, \dots, F'_L), \mathrm{dim}=1), \qquad F\in\mathbb{R}^{T\times C_\Sigma\times h\times w}.$$

$F_l$ is the feature map of the $l$ up-sampling layer, $C_\Sigma=\sum_l C_l$. These characteristics are fast, but inherently noisy and entangled.

4.3.4 Geometric/Semantic decouplers

The decoupler not only looks at the noisy features, but also uses the currently observed VFM reference representation as the anchor.

$$Y_i^{\mathrm{ref}}=\Phi_i(\operatorname{Interpolate}(I_{\mathrm{obs}}))\in\mathbb{R}^{1\times C_i\times h\times w}, \quad i\in\{\mathrm{geo}, \mathrm{sem}\}.$$ $$\tilde{F}_i^0=\operatorname{Concat}((F, \operatorname{Repeat}(Y_i^{\mathrm{ref}})), \mathrm{dim}=1).$$ $$\tilde{F}_i^k=\mathcal{T}_i^k(\mathcal{S}_i^k(\tilde{F}_i^{k-1})), \quad 1\le k\le K.$$

$\mathcal{S}$ and $\mathcal{T}$ are spatial and temporal transformer layers respectively. Each branch is finally projected back to the corresponding VFM dimension.

Self-distillation loss: Let the decoupler output be close to the teacher VFM representation of multi-step generated video extraction.

$$\mathcal{L}_i=\|\tilde{F}_i^K-Y_i\|_2^2, \qquad i\in\{\mathrm{geo}, \mathrm{sem}\}.$$

The paper emphasizes that the teacher generates videos from the same diffusion trajectory to avoid trajectory misalignment caused by GT future frames.

4.3.5 Uni-Perceiver and diffusion policy

The geometric, semantic and original diffusion features are synthesized into a holistic context and then compressed into a small number of tokens.

$$\mathcal{C}=\operatorname{Concat}(\tilde{F}_{\mathrm{geo}}^K, \tilde{F}_{\mathrm{sem}}^K, F)\in\mathbb{R}^{T\times C_{\mathrm{hol}}\times h\times w}.$$ $$F_{\mathrm{agg}}=\operatorname{FFN}(\operatorname{SelfAttn}(\operatorname{CrossAttn}(\mathcal{Q}, \mathcal{C}))).$$

$\mathcal{Q}$ are $N$ learnable latent queries. Cross-attention extracts information from high-dimensional spatiotemporal context, and self-attention models the internal relationships of compact tokens.

The essence of the action expert is a diffusion policy: predicting noise from noisy actions, conditional on aggregated foresight tokens and text embedding.

$$\mathcal{L}_A=\mathbb{E}_{j, a_j, \epsilon}\left[\|\epsilon-\epsilon_\phi(a_j, F_{\mathrm{agg}}, E, j)\|_2^2\right].$$

$a_j$ is the noisy action of diffusion timestep $j$, $E$ is the task text embedding, and $\epsilon_\phi$ is the action noise prediction network.

4.4 Implementation points

Three stages of training: First fine-tune SVD backbone; then freeze video generation model to train geometric/semantic decouplers; finally freeze SVD and decouplers to train only action experts.

Number of training steps: MetaWorld SVD 100k steps, real task SVD 40k steps; CALVIN directly uses VPP fine-tuned model; decouplers all benchmark 50k steps; action expert is 60k steps in CALVIN, and other benchmarks are 40k steps.

Hardware: SVD fine-tuning uses 4 NVIDIA H100; decoupler self-distillation uses a single H100; action expert uses 4 NVIDIA H100; inference uses a single NVIDIA RTX 3090 24GB.

teacher target selection: Semantic uses DINOv2 and geometric uses DPAv3. Ablation showed an average length of 4.16 for DINOv2+DPAv3, which was higher than 4.06 for SigLIP+DPAv3 and 4.04 for DINOv2+VGGT.

Real-time: In the real experiment, one forward time is 307.6 ms, including video diffusion backbone 231.0 ms, decouplers 40.1 ms, and action expert 36.5 ms; each predicted action chunk length is 8, corresponding to the effective control frequency of 25 Hz.

5. Experiment

5.1 Experimental setup

Project	settings
CALVIN	ABC → D: Train in the ABC environment, evaluate in the unseen D environment, and examine generalization on consecutive long-horizon tasks. The indicators are the success rate of the $i$th task and Avg. Len.
MetaWorld	50 Sawyer manipulation tasks, counted by Easy/Middle/Hard. The training set has 50 demonstrations per task.
real robot	AgileX Robotics Cobot dual-arm platform, Mobile ALOHA design; 7 DoF per arm and parallel gripper; only uses front camera monocular RGB observations.
real tasks	Place-to-Pot, Place-to-Pot (Hard, transparent object), Pour-Water, Lift-Pot. Unified multi-task model, about 50 human demonstrations per task, evaluated on 25 trials per task.
Baselines	Direct action learning: RT-1, Diffusion Policy, OpenVLA, CLOVER, $\pi_0$, Spatial Forcing; Predictive methods: SuSIE, VPP, GR-1, Uni-VLA, HiF-VLA, etc.
code repository	The official code is provided on the project page: https: //github.com/Haodong-Yan/S-VAM-Code.

5.2 Main results

CALVIN

Method	1st	2nd	3rd	4th	5th	Avg. Len.
Spatial Forcing	93.6	85.8	78.4	72.0	64.6	3.94
VPP	90.9	81.5	71.3	62.0	51.8	3.58
Uni-VLA	95.5	85.8	74.8	66.9	56.5	3.80
HiF-VLA	93.5	87.4	81.4	75.9	69.4	4.08
S-VAM	95.8	90.7	83.7	77.0	68.9	4.16

The paper highlights that S-VAM is 0.58 Avg. Len. higher than the direct baseline VPP. In the qualitative graph, VPP's one-step entangled features produce an attention trajectory that deviates from the instructions; S-VAM's decoupled foresight makes the attention trajectory more consistent with the language instructions.

CALVIN qualitative comparison: Demonstrates the difference between attention trajectory and rollout execution.

MetaWorld

Method	Easy	Middle	Hard	Average
Spatial Forcing	0.737	0.436	0.451	0.609
HiF-VLA	0.729	0.364	0.404	0.577
VPP	0.818	0.493	0.526	0.682
S-VAM	0.793	0.607	0.684	0.728

The average success rate of S-VAM is 72.8%, and Hard tasks is 68.4%. The paper states that it significantly exceeds VPP's 52.6% on hard tasks. The authors attribute the advantage to decoupled geometric/semantic foresight providing more stable target localization in complex object interactions and fine geometric constraints.

MetaWorld qualitative comparison: VPP's attention trajectory deviates from the target nut; S-VAM's geometry/semantic foresight helps action experts locate the target.

real robot

Real dual-arm Cobot multi-task experiment: paper reports that S-VAM outperforms VPP on four tasks without sacrificing real-time control capabilities.

Key details of the real experiment: ~50 demonstrations, 25 trials per task. Place-to-Pot (Hard) involves transparent objects, the VPP success rate is 16%, and S-VAM improves to 32%. The paper explains that the geometric decoupler helps with depth ambiguity on transparent surfaces, and the semantic decoupler helps maintain a consistent representation of transparent objects.

5.3 Ablation experiment

Variant	1st	2nd	3rd	4th	5th	Avg. Len.	purpose
w/o Geometric Distillation	94.1	87.1	79.3	73.5	66.5	4.01	Verify geometry foresight effect.
w/o Semantic Distillation	94.0	87.1	80.4	73.2	64.1	3.99	Validation semantics foresight's role in object identity consistency.
w/o Self-Distillation	94.2	85.2	75.9	67.8	59.0	3.82	Replace the model with GT future frames to self-generate the video teacher.
w/o Uni-Perceiver	94.0	84.6	74.7	65.1	53.8	3.72	Verify the necessity of compact token condensation.
w/o Original Diffusion Feature	95.3	86.6	77.6	70.9	62.5	3.93	Verify that the original diffusion features provide residual global context.
S-VAM Full	95.8	90.7	83.7	77.0	68.9	4.16	Complete model.

5.4 Supplementary experiment: VFM teacher target selection

Type	Representation	Avg. Len.	Paper explanation
Semantic	CLIP / SigLIP	3.72 / 3.77	Global semantics has scene-level information, but lacks dense patch-level affordance for low-level control.
Semantic	DINOv2 / DINOv3	4.01 / 3.95	Dense patch-level representation outperforms CLIP/SigLIP.
Geometric	DPAv3 / VGGT	3.99 / 3.74	DPAv3 adapts to dynamic video streams; VGGT prefers static scene reconstruction.
Motion-aware	VideoMAEv2 / V-JEPA2	3.90 / 3.74	The author believes that current video models are still weaker than specialized image models in fine-grained feature fidelity.
Synergistic	SigLIP+DPAv3 / DINOv2+VGGT / DINOv2+DPAv3	4.06 / 4.04 / 4.16	DINOv2's dense semantics complements DPAv3's dynamic geometry.

6. Analysis and Discussion

6.1 Analysis and explanation of the results given in the paper

In CALVIN, the author attributes the advantage of S-VAM over VPP to the fact that decoupled geometric and semantic foresight can form a more coherent attention trajectory that is consistent with language instructions.
Among MetaWorld hard tasks, the author believes that complex object interactions and fine geometric constraints require more dynamic geometry and dense semantic foresight.
In the real transparent object task, the author clearly explains: the geometric decoupler alleviates the depth ambiguity of transparent surfaces, and the semantic decoupler maintains consistent object representation of transparent objects.
In VFM target ablation, the author explains that DINOv2 is better than CLIP/SigLIP because dense patch-level representation is more suitable for low-level control; DPAv3 is better than VGGT because DPAv3 is oriented to dynamic video streams.

6.2 Limitations of the author's statement

The Conclusion in the source code does not explicitly list limitation or failure cases, nor does it provide an independent appendix. Therefore, this report does not incorporate the subjective limitations of the report writer. Gaps in reproducibility are only listed in §6.4 according to the published information of the paper.

6.3 Applicable boundaries and discussions clearly stated in the paper

S-VAM's teacher target relies on SVD's own multi-step generated videos; if the video generation backbone cannot produce a stable future for certain types of scenarios, the self-distilled teacher will also be affected. This point is an applicable premise deduced from the method mechanism, and is not an experimental limitation separately listed by the author.
The real experiment uses only front-camera monocular RGB observations and is verified on 4 dual-arm tasks.
The real-time control conclusion is based on the settings of action chunk length 8, RTX 3090 inference, and 307.6 ms forward pass.

6.4 Reproducibility audit

Project	Status	Description
Source code structure	Obtained	arXiv e-print contains main.tex, secs/, bib, style and PDF figures.
chart	Extracted	All PDF figures have been converted to PNG and placed into figures/.
code repository	Found	Project page linked to Haodong-Yan/S-VAM-Code.
training configuration	partially complete	The paper gives the number of three-stage training steps and GPU configuration, but does not list the complete hyperparameter table such as batch size and learning rate in LaTeX.
Data settings	The main message is clear	CALVIN, MetaWorld, real Cobot demos/trials are all explained.
Appendix	No independent appendix	No appendix was found in the source code, so there are no additional proofs, failure cases, or full hyperparameter appendices to incorporate.