AIM: Intent-Aware Unified world action Modeling with Spatial Value Maps

作者：Liaoyuan Fan, Zetian Xu, Chen Cao, Wenyao Zhang, Mingqi Yuan, Jiayu Chen

机构：INFIFORCE Intelligent Technology Co., Ltd.; The University of Hong Kong; Shanghai Jiao Tong University

发表：arXiv preprint / NeurIPS 2025 template，源码版本 2026-03-23

arXiv：2604.11135 | PDF：下载

1. 论文速览

一句话总结：AIM 用“未来 RGB + 空间 value map + 动作”的统一生成式世界-动作模型，把视频生成模型的未来想象能力转化为机器人控制中“在哪里交互”的显式空间线索，并在 RoboTwin 2.0 上取得 94.0% / 92.1% 的 Easy / Hard 平均成功率。

阅读定位	内容
论文要解决什么	已有 unified world action model 能预测未来画面，但动作解码仍要从密集 RGB latent 中隐式恢复接触位置和操作意图，导致机器人域适配成本高。
作者的方法抓手	引入与未来帧对齐的 action-based spatial value map，作为未来视觉预测与动作解码之间的显式空间接口。
最重要的结果	RoboTwin 2.0 的 50 个仿真任务上，AIM 达到 Easy 94.0%、Hard 92.1%，平均 93.1%，高于 Stage1 的 92.5% 和外部 baselines。
阅读时要注意的点	核心不是“多加一个 heatmap 监督”，而是用 intent-causal attention 强制 action branch 只能通过 value map 读取未来信息。

难度评级：★★★★☆。需要理解 video diffusion / flow matching、Transformer attention mask、VLA / world-action model，以及 RL post-training 中 PPO/GRPO 类目标。

关键词：Unified World Action Model；Spatial Value Map；Intent-Causal Attention；Mixture-of-Transformers；GRPO Post-Training；RoboTwin 2.0。

核心贡献清单

Intent-aware unified world action model：在未来预测和动作解码之间加入空间 value-map 接口，使模型显式表示任务相关交互结构。
Spatially grounded control training framework：把 joint frame-value generation、intent-causal attention 和 self-distillation RL post-training 放进同一框架。
30K 仿真轨迹数据集：构造含多视角视频、动作序列、任务标识和逐步 value-map 标注的数据，用于训练与评估。
RoboTwin 2.0 性能：在 50 个仿真任务的 Easy / Hard 设置上分别达到 94.0% / 92.1% 平均 SR。

附录状态源码中未发现 Appendix / Supplementary section；本报告所有内容均来自正文、图表和参考文献文件。

2. 动机

2.1 要解决什么问题

论文关注的核心问题是：如何把大规模视频生成模型学到的视觉动态先验，可靠地转化为接触密集机器人操作中的连续动作。视频模型擅长回答“场景接下来长什么样”，但机器人控制还需要回答“末端执行器应该接触哪里、为什么这个位置对任务有用”。在抓取、放置、按压、扫描、开关等任务中，这类信息通常是稀疏的接触区域，而 RGB future latent 包含大量外观细节。

作者指出，已有统一世界-动作模型通常让 action head 直接从共享的未来视觉表示中解码动作。这会迫使模型从密集视觉表示里隐式恢复 manipulation intent；对 cluttered scene 或 contact-sensitive task 来说，这个逆动力学问题更难。

2.2 已有方法的局限

传统 VLA：直接从 observation + language 到 action，能够模仿示范，但不显式建模动作如何改变未来观测。
两阶段 WAM：先预测未来状态，再交给 action model；未来状态与动作解码之间缺少可控、显式的交互接口。
一阶段 WAM：未来状态预测与动作预测共享 observation stream；action head 仍需要从 dense RGB future 中抽取稀疏动作线索。
空间 grounding 方法：已有工作证明可操作区域、voxel-aligned action space 等对 manipulation 有用，但通常作为独立 policy head 或感知模块，没有直接嵌入 generative world-action model。

Figure 1：典型 unified WAM 直接从 future visual representation 解码动作；AIM 在中间加入 spatial value-map interface。

2.3 本文的解决思路

AIM 的高层 insight 是：让模型先把未来视觉动态压缩成“与任务相关的空间交互区域”，再让动作头基于这个空间表示生成动作。也就是说，future RGB 用于建模世界怎么演化，value map 用于表达未来交互意图，action head 只通过 value map 接收未来信息。

3. 相关工作梳理

3.1 论文自述的相关工作

技术线	论文如何定位	AIM 的区别
Video generation for robot learning	DreamZero、VPP、Video Generators 等工作把 pretrained video generator 当作机器人学习的视觉动态先验。	AIM 采用 Wan2.2-TI2V-5B 作为 video backbone，但增加 value-map 预测和动作分支。
Unified world action model	LingBot-VA、GigaWorld-Policy、Fast-WAM、DreamZero 等把未来观测和动作放进统一架构。	AIM 不让动作直接依赖 future RGB，而加入显式 spatial intermediate interface。
Spatially grounded representations	Where2Act、PerAct、CLIPort、CALAMARI 等强调 interaction region / spatial grounding 对 manipulation 的作用。	AIM 把 spatial value prediction 直接集成进 generative world-action model，而不是独立感知或策略头。

3.2 直接前作对比

维度	LingBot-VA	GigaWorld-Policy / Fast-WAM	AIM
核心思路	在共享 latent space 中学习视频预测与动作生成。	强调 action-centered 或 efficient world-action modeling。	联合预测 future RGB、future value map 和 future action。
关键假设	共享视觉 latent 足以服务动作解码。	world-action cotraining 可提升策略。	动作需要显式空间意图表示，不能只靠 dense future RGB。
信息流	动作可从共享未来表示取信息。	依具体模型设计而定，通常没有 value-map 门控。	intent-causal attention 让 action branch 不能直接看 future RGB。
实验性能	RoboTwin 平均 SR 92.2%。	Fast-WAM 91.8%，Giga-World 86.0%。	RoboTwin 平均 SR 93.1%。

4. 方法详解

4.1 方法概览

AIM 的输入是历史窗口 $\mathcal{H}_t=\{o_{t-k:t}, a_{t-k:t-1}\}$，其中 $o_t$ 是同步多视角观测，$a_t$ 是机器人动作。模型输出 horizon-$h$ 的 future RGB frames $X^+$、future value maps $M^+$ 和 future actions $A^+$。

Figure 2：AIM Stage I 同时学习 future frame generation、action prediction 和 spatial value map estimation；Stage II 用 sparse + dense rewards 做 GRPO。

这个分解在说：先预测未来世界与空间意图，再基于空间意图生成动作。

$$p(X^+, M^+, A^+ \mid \mathcal{H}_t)=p(X^+, M^+ \mid \mathcal{H}_t)\,p(A^+ \mid \mathcal{H}_t,M^+)$$

$X^+$	未来 RGB 帧序列，表示模型想象的未来视觉状态。
$M^+$	与未来帧空间对齐的 value map，编码任务相关交互区域。
$A^+$	未来连续机器人动作 chunk。
$\mathcal{H}_t$	历史观测、历史动作和语言指令构成的上下文。

4.2 方法演变脉络

传统 VLA：$o_t,c ightarrow a_t$，没有显式 future world modeling。

Unified WAM：$\mathcal{H}_t ightarrow (X^+,A^+)$，把未来观测和动作合并预测，但 action head 仍要从 RGB future 中抽取意图。

AIM：$\mathcal{H}_t ightarrow (X^+,M^+) ightarrow A^+$，用 spatial value map 作为动作解码的未来信息接口。

4.3 核心设计与数学推导

4.3.1 Tokenization 与前缀构造

三路输入分别是 packed RGB、packed ASVM 和连续动作。论文沿用 LingBot-VA 的多视角 packing：head camera 放在上方，左右 wrist camera 放在两侧，形成 T-pose canvas。

这个公式把 RGB 图像和 value map 都送进同一个 Wan2.2 VAE，使二者在 latent 空间几何对齐。

$$z_t^o = E_{\mathrm{vae}}( ilde x_t), \qquad z_t^m = E_{\mathrm{vae}}( ilde m_t)$$

$ ilde x_t$	三视角拼接后的 RGB observation。
$ ilde m_t$	同样 T-pose layout 的 RGB ASVM。
$z_t^o,z_t^m$	VAE latent tokens；共享 VAE 避免重做视觉 tokenizer。

这个公式把动作和语言也变成 token，以便进入 Transformer 框架。

$$z_t^a=E_a(a_t),\qquad z^\ell=E_{\mathrm{t5}}(c)$$

其中 $a_t\in\mathbb{R}^{d_a}$ 是双臂连续动作向量，$E_a$ 是轻量 MLP，$c$ 是语言指令。语言 token 只通过 cross-attention 注入 video model，不直接注入 action branch。

这个前缀定义了 rollout 时模型能看到的历史上下文。

$$\mathcal{H}^{\mathrm{tok}}_t=[z_{t-k:t}^o,\,z_{t-k:t-1}^a,\,z^\ell]$$

它同时包含近期观测、近期动作和任务语言，用于估计机器人状态与预测未来 chunk。

4.3.2 Mixture-of-Transformers 三流架构

模型包含 video generation model 与 action head。video branch 由 Wan2.2 初始化，用于 future RGB 与 value-map generation；action head 深度相同但 hidden width 更小，用于 action denoising。

rollout 开始时，三个未来 token stream 都从噪声开始，再逐步 denoise。

$$\hat z_0^x,\hat z_0^m,\hat z_0^a\sim\mathcal{N}(0,I)$$

value stream 还接收 learned value noise token $n^m$，实际输入为 $[\hat z_0^m,n^m]$。RGB 和 value map 沿同一 flow-matching trajectory denoise，action token 由 action head denoise。

每一路输出都由对应 decoder 还原为可解释对象。

$$\hat X^+=D_x(z^x),\qquad \hat M^+=D_m(z^m),\qquad \hat A^+=D_a(z^a)$$

$D_x$	未来 RGB 解码器。
$D_m$	value-map 解码器。
$D_a$	连续动作解码器。

MoT 的关键是：attention 共享交互，FFN 保持分支专用。

$$Q_s^\ell=h_s^\ell W_{Q,s}^\ell,\quad K_s^\ell=h_s^\ell W_{K,s}^\ell,\quad V_s^\ell=h_s^\ell W_{V,s}^\ell,\quad s\in\{x,m,a\}$$

每个 stream 先用自己的投影得到 Q/K/V，再投到共同 attention dimension 中做 masked shared self-attention，最后投回各自 hidden space 并走 branch-specific feed-forward。

训练目标同时约束未来视觉、空间意图和动作。

$$\mathcal{L}=\mathcal{L}_{\mathrm{rgb}}+\lambda_m\mathcal{L}_{\mathrm{map}}+\lambda_a\mathcal{L}_{\mathrm{act}}$$

$\mathcal{L}_{\mathrm{rgb}}$ 和 $\mathcal{L}_{\mathrm{map}}$ 监督 flow-matching velocity field；$\mathcal{L}_{\mathrm{act}}$ 监督 inverse-dynamics action prediction。

4.3.3 Intent-Causal Self-Attention

这是 AIM 的结构性约束：action token 不允许直接 attend future RGB token，只能通过 future value token 访问未来信息。

三个 visible token set 定义了每一路在 shared attention 中能看见什么。

$$egin{aligned} \mathcal{V}_x&=[z_t^o,z_{t-k:t-1}^o,z_{t-k:t-1}^a,z^\ell,z^x],\ \mathcal{V}_m&=[z_t^o,z_{t-k:t-1}^o,z^x,z^m],\ \mathcal{V}_a&=[z_t^o,z_{t-k:t-1}^a,z^m,z^a]. \end{aligned}$$

$\mathcal{V}_x$	future video 可看当前观测、历史观测/动作、语言和自身 future video token。
$\mathcal{V}_m$	future value map 可看当前/历史观测和 future video，使 value map 绑定到采样的未来状态。
$\mathcal{V}_a$	action 可看当前观测、历史动作、future value map 和自身 action token，但不能看 future RGB。

masked attention 只从对应 visible set 取 K/V。

$$ ilde h_s^\ell=\mathrm{Attn}(Q_s^\ell,K(\mathcal{V}_s),V(\mathcal{V}_s))$$

因此任务语义先进入 video branch，future state 信息再汇入 value stream，action branch 最后只通过 value representation 接收 future information。

4.3.4 Self-Distillation RL Post-Training

Stage I 的监督学习让 action head 模仿数据集动作；Stage II 则在闭环环境中只更新 action head，冻结 video generator 和 value-map head，避免 future frame / value-map prediction 漂移。

dense reward 奖励动作落点是否落在模型自己预测的高 value 区域。

$$r_t=\lambda_d r_t^{\mathrm{dense}}+\lambda_s r_t^{\mathrm{sparse}},\qquad r_t^{\mathrm{dense}}=M_t(\Pi(p_t))$$

$r_t^{\mathrm{sparse}}$	环境级任务成功或完成信号。
$p_t$	预测动作落点或末端执行器目标。
$\Pi(\cdot)$	相机投影函数，把 3D 目标投到图像平面。
$M_t$	冻结 value head 预测的 value map。

GRPO 用 clipped ratio 限制 action head 的策略更新幅度。

$$\mathcal{L}_{\mathrm{GRPO}}(\phi)=\mathbb{E}_t\left[\min\left( ho_t(\phi)\hat A_t,\mathrm{clip}( ho_t(\phi),1-\epsilon,1+\epsilon)\hat A_t ight) ight]$$ $$ ho_t(\phi)=rac{\pi_\phi(a_t\mid\mathcal{H}_t,m_{t+1:t+h})}{\pi_{\phi_{\mathrm{old}}}(a_t\mid\mathcal{H}_t,m_{t+1:t+h})}$$

$\hat A_t$ 是基于 combined reward 的 advantage。作者称其为 self-distillation，因为冻结的 value head 在线指导 action head，不需要额外人工标签。

4.4 实现要点（面向复现）

视觉输入：三视角合成 T-pose canvas，保持与 Wan2.2 视频 VAE 的输入接口兼容。
value-map 输入：初始化为纯黑图像，相当于 null value prior；通过 denoising 学出任务相关空间结构。
语言注入：T5 language token 只 cross-attend 到 video model；action head 的语言条件必须经过 world/value representation 间接传递。
attention mask：实现时必须保证 action token 对 future RGB token 不可见，否则会破坏论文的 intent-causal 约束。
推理效率：autoregressive chunk-wise rollout 支持 KV cache，只对新增真实观测和预测 token 重新计算注意力。
RL 稳定性：Stage II 冻结 video generation model 与 value-map head，只更新 action head。

Algorithm: AIM rollout Input: history observations o[t-k:t], actions a[t-k:t-1], instruction c 1. Pack multi-view RGB into T-pose canvas x_tilde 2. Encode RGB/value/action/language tokens: z_o, z_m, z_a, z_l 3. Initialize future tokens z_x, z_m, z_a from Gaussian noise 4. Denoise RGB and value streams with video model 5. Apply intent-causal attention mask: action stream sees history action + current observation + future value, not future RGB 6. Decode X+, M+, A+ 7. Execute action chunk; append new observation/action to KV-cached prefix

5. 实验

5.1 实验设置

项目	设置
数据集	30K RoboTwin 2.0 simulation trajectories；每条含同步多视角视频、动作序列、任务 ID、per-step value-map annotations。
任务	RoboTwin 2.0 的 50 个仿真 manipulation tasks，包含 Easy 与 Hard 两种设置。
Backbone	video generation model 初始化自 Wan2.2-TI2V-5B。
Baselines	$\pi_0$、$\pi_{0.5}$、X-VLA、Motus、Fast-WAM、Giga-World、LingBot-VA；另报告 Stage1 作为 RL 前监督模型。
指标	Success Rate (SR)，按任务成功率统计。
RL post-training	action head 从 Stage1 checkpoint 初始化；video generation model 和 value-map head 冻结。
硬件 / 超参数	正文未给出 GPU、训练时长、batch size、learning rate、$\lambda_m$、$\lambda_a$、$\lambda_d$、$\lambda_s$、GRPO clipping $\epsilon$ 的具体数值。
代码仓库	arXiv 页面与源码未提供官方 GitHub / project URL。

5.2 Value-map 标注流程

任务类型	标注来源	生成方式	含义
Pick	gripper 与目标物体有效抓取接触时的 contact surface point cloud。	用相机标定矩阵投影到图像平面，再做 Gaussian smoothing；核宽随相机参数和深度动态调整。	grasp affordance region，即末端执行器与目标物成功物理交互的空间区域。
Place	物体达到稳定放置状态时，抓取物与目标支撑面的 contact region。	以 center-of-mass velocity 小阈值检测 placement completion，再投影接触区域生成 heat map。	placement contact region，即满足放置目标时应接触环境的位置。

5.3 主要结果

Setting	$\pi_0$	$\pi_{0.5}$	X-VLA	Motus	Fast-WAM	Giga-World	LingBot-VA	Stage1	AIM
Easy	65.9%	82.7%	72.8%	88.7%	91.9%	87.0%	92.9%	93.0%	94.0%
Hard	58.4%	76.8%	72.8%	87.0%	91.8%	85.0%	91.6%	92.0%	92.1%
Average	62.2%	79.8%	72.8%	87.8%	91.8%	86.0%	92.2%	92.5%	93.1%

逐列看，AIM 比 Motus 在 Easy / Hard 上分别高 +5.3% / +5.0%；比 $\pi_{0.5}$ 分别高 +11.3% / +15.3%。Stage1 已达 93.0% / 92.0%，Stage II RL 进一步到 94.0% / 92.1%，说明主要收益来自空间接口和监督训练，RL post-training 提供额外小幅提升。

展开：50 个任务的 per-task SR 表

Task	$\pi_{0.5}$ Easy	$\pi_{0.5}$ Hard	X-VLA Easy	X-VLA Hard	Motus Easy	Motus Hard	Stage1 Easy	Stage1 Hard	AIM Easy	AIM Hard
Adjust Bottle	100%	99%	100%	99%	89%	93%	98%	99%	100%	100%
Beat Block Hammer	96%	93%	92%	88%	95%	88%	98%	100%	100%	100%
Blocks Ranking RGB	92%	85%	83%	83%	99%	97%	91%	77%	92%	77%
Blocks Ranking Size	49%	26%	67%	74%	75%	63%	47%	44%	47%	43%
Click Alarmclock	98%	89%	99%	99%	100%	100%	98%	99%	100%	100%
Click Bell	99%	66%	100%	100%	100%	100%	98%	99%	100%	100%
Dump Bin Bigbin	92%	97%	79%	77%	95%	91%	98%	100%	100%	100%
Grab Roller	100%	100%	100%	100%	100%	100%	98%	99%	100%	100%
Handover Block	66%	57%	73%	37%	86%	73%	92%	89%	93%	90%
Handover Mic	98%	97%	0%	0%	78%	63%	82%	82%	83%	81%
Hanging Mug	18%	17%	23%	27%	38%	38%	43%	43%	43%	42%
Lift Pot	96%	85%	99%	100%	96%	99%	98%	100%	100%	100%
Move Can Pot	51%	55%	89%	86%	34%	74%	99%	97%	100%	98%
Move Pillbottle Pad	84%	61%	73%	71%	93%	96%	97%	99%	97%	98%
Move Playingcard Away	96%	84%	93%	98%	100%	96%	98%	100%	100%	100%
Move Stapler Pad	56%	42%	78%	73%	83%	85%	91%	83%	92%	84%
Open Laptop	90%	96%	93%	100%	95%	91%	98%	100%	100%	100%
Open Microwave	34%	77%	79%	71%	95%	91%	83%	80%	83%	79%
Pick Diverse Bottles	81%	71%	58%	36%	90%	91%	99%	97%	100%	98%
Pick Dual Bottles	93%	63%	47%	36%	96%	90%	92%	90%	93%	91%
Place A2B Left	87%	82%	48%	49%	82%	79%	93%	91%	94%	92%
Place A2B Right	87%	84%	36%	36%	90%	87%	89%	89%	90%	88%
Place Bread Basket	77%	64%	81%	71%	91%	94%	92%	90%	93%	91%
Place Bread Skillet	85%	66%	77%	67%	86%	83%	98%	100%	100%	100%
Place Burger Fries	94%	87%	94%	94%	98%	98%	98%	100%	100%	100%
Place Can Basket	62%	62%	49%	52%	81%	76%	78%	77%	78%	76%
Place Cans Plasticbox	94%	84%	97%	98%	98%	94%	98%	100%	100%	100%
Place Container Plate	99%	95%	97%	95%	98%	99%	100%	96%	100%	97%
Place Dual Shoes	75%	75%	79%	88%	93%	87%	100%	99%	100%	98%
Place Empty Cup	100%	99%	100%	98%	99%	98%	98%	100%	100%	100%
Place Fan	87%	85%	80%	75%	91%	87%	93%	89%	93%	90%
Place Mouse Pad	60%	39%	70%	70%	66%	68%	97%	96%	97%	95%
Place Object Basket	80%	76%	44%	39%	81%	87%	93%	88%	93%	89%
Place Object Scale	86%	80%	52%	74%	88%	85%	100%	97%	100%	98%
Place Object Stand	91%	85%	86%	88%	98%	97%	98%	100%	100%	100%
Place Phone Stand	81%	81%	88%	87%	87%	86%	82%	81%	82%	80%
Place Shoe	92%	93%	96%	95%	99%	97%	98%	100%	100%	100%
Press Stapler	87%	83%	92%	98%	93%	98%	96%	95%	96%	94%
Put Bottles Dustbin	84%	79%	74%	77%	81%	79%	80%	75%	80%	74%
Put Object Cabinet	80%	79%	46%	48%	88%	71%	81%	75%	81%	74%
Rotate QRcode	89%	87%	34%	33%	89%	73%	98%	99%	100%	98%
Scan Object	72%	65%	14%	36%	67%	66%	98%	97%	100%	98%
Shake Bottle Horizontally	99%	99%	100%	100%	100%	98%	98%	100%	100%	100%
Shake Bottle	99%	97%	99%	100%	100%	97%	98%	100%	100%	100%
Stack Blocks Three	91%	76%	6%	10%	91%	95%	100%	99%	100%	98%
Stack Blocks Two	97%	100%	92%	87%	100%	98%	98%	100%	100%	100%
Stack Bowls Three	77%	71%	76%	86%	79%	87%	100%	99%	100%	98%
Stack Bowls Two	95%	96%	96%	93%	98%	98%	100%	97%	100%	98%
Stamp Seal	79%	55%	76%	82%	93%	92%	100%	100%	100%	100%
Turn Switch	62%	54%	40%	61%	84%	78%	100%	99%	100%	98%

5.4 消融与补充结果

论文显式报告的 ablation 是 Stage1 vs AIM：Stage1 表示 RL post-training 前的监督模型，AIM 表示加入 self-distillation RL 后的模型。平均 SR 从 92.5% 到 93.1%，Easy 从 93.0% 到 94.0%，Hard 从 92.0% 到 92.1%。这说明 RL 后训练的额外提升存在，但幅度小于与外部 baselines 的差距。

作者指出收益较明显的任务集中在 contact-sensitive 与 stage-dependent manipulation：Place Mouse Pad 达 97% / 95%，Scan Object 达 100% / 98%，Turn Switch 达 100% / 98%。这些任务需要准确定位任务相关交互区域。

Figure 3：RoboTwin 2.0 代表任务执行过程，包括 place mouse pad、press stapler、scan object、turn switch、open laptop；左列为 Easy，右列为 Hard。

7. 分析、局限与边界

7.1 这篇论文最有价值的地方

基于论文自身表述与实验，AIM 的核心价值在于把“未来视觉预测”与“动作解码”之间的隐式耦合拆成可检查的空间接口：future frame 负责场景演化，value map 负责 task-relevant interaction region，action head 只通过 value map 读取未来信息。这个结构让模型的性能提升可以和可视化中的 value-map localization、projected action target 对齐起来。

7.2 结果为什么站得住

论文给出的证据有三层：第一，平均 SR 表中 AIM 在 Easy / Hard / Average 三个汇总维度均高于外部 baselines；第二，Stage1 到 AIM 的对比隔离了 RL post-training 的贡献；第三，作者对 contact-sensitive tasks 的分析与可视化说明，future frames、value maps 和 projected actions 在操作阶段上保持一致，支持“收益来自 spatial bridge 而非 shortcut correlations”的原文解释。

7.3 论文已给出的结果分析与解释

作者解释 AIM 在 contact-sensitive 与 stage-dependent tasks 上收益较明显，因为这些任务依赖准确定位交互区域。
作者指出 value maps concentrate on meaningful interaction regions rather than generic saliency，projected action targets fall within high-value areas。
作者将 Stage1 的高性能归因于 joint frame-value generation 与 intent-causal attention，RL post-training 只带来进一步提升。

7.4 作者自述的局限性

正文和 Conclusion 没有单独列出 limitations，也没有明确声明失败案例。可以从实验设置客观归纳出论文覆盖边界：实验在 RoboTwin 2.0 仿真环境中进行，value-map annotation 依赖仿真 contact API、相机标定和物理状态；正文未报告真实机器人实验、跨数据集泛化、训练成本、超参数敏感性或失败任务分析。以上是基于原文实验范围的覆盖边界，不是额外性能判断。

7.5 适用边界与讨论

适用场景：多视角、语言条件、接触密集 manipulation，且可获得或自动生成 value-map supervision 的设置。
关键前提：value map 必须与未来 RGB 在空间上对齐；intent-causal mask 必须阻断 action branch 对 future RGB 的直接访问。
数据前提：训练依赖 30K 含多视角视频、动作和 per-step value-map annotations 的 RoboTwin trajectories。
未覆盖内容：源码未提供 Appendix；未给完整超参数、硬件、训练时长、代码仓库或真实机器人部署细节。

验收记录

章节覆盖：Abstract、Introduction、Related Work、Overview、Method、Dataset and Value-Map Annotation、Experiments、Conclusion 均已映射到报告章节。
图表覆盖：三张独立 PNG 图均已复制并嵌入；两张结果表已以 HTML 表格重建，其中 per-task 表放入折叠区域。
附录覆盖：源码无 Appendix；报告已明确标注。
客观性：分析与局限均基于原文实验、图表和 Conclusion，不加入额外改进建议。