World-Value-Action Model: Implicit Planning for Vision-Language-Action Systems

Authors: Runze Li, Hongyin Zhang, Junxi Jin, Qixin Zeng, Zifeng Zhuang, Yiqi Tang, Shangke Lyu, Donglin Wang

Organization: Westlake University; Nanjing University Suzhou Campus

Publication: arXiv preprint, 2026

arXiv: 2604.14732 | PDF: Download | Project Page: win-commit.github.io/wavpage

1. Quick overview of the paper

One-sentence summary: WAV rewrites VLA's action generation from "directly predicting actions" to "implicit planning in the latent space of future trajectories": first generate candidate future visual trajectories, then use trajectory value to evaluate long-term benefits, and finally decode high-value and dynamically feasible latent trajectories into actions.

Difficulty rating: ★★★★☆. Need to be familiar with VLA, world model, flow matching, MPC/MPPI, trajectory value, and basic high-dimensional probability/covering argument.

Keywords: Vision-Language-Action, World Model, Trajectory Value, Latent Planning, Flow Matching, MPPI-style Inference.

Reading positioning issues	answer
What should the paper solve?	Most existing VLAs predict actions directly from current observations and instructions, lacking reasoning and evaluation of long-term future trajectories, so errors are prone to accumulate in compositional/long-horizon manipulation.
The author's approach	Construct three World-Value-Action modules: the world/video generator predicts future visual features, the value module evaluates the long-term utility of the trajectory, and the action decoder generates actions based on the optimized latent video/value features.
most important results	LIBERO's average success rate is 98.1; after removing latent trajectory planning, it drops to 96.4, and Long suite drops from 94.4 to 91.8. The average success rate of real robot tasks increased from 35.6% for GE-ACT to 75.6% for WAV.
Things to note when reading	The core is not simply "one more world model", but moving the action search to the learned latent trajectory distribution and repeatedly reweighting it with value/SNR/elite selection.

Core contribution list

Proposed WAV framework.Put future visual trajectory generation, trajectory value estimation and action decoding into a unified VLA decision-making framework.
Proposed latent trajectory planning.Instead of explicitly rolling out a large number of action sequences, we do iterative inference on video/value latent noise distributions.
Give a theoretical explanation.The author uses the probability mass analysis of feasible trajectory manifold to illustrate that the feasible trajectory sampling probability of direct action-space search will exponentially decay with the horizon, while the latent generator can redistribute the probability mass.
Simulation and real robot verification are given.Comparison with baselines such as GE-ACT on LIBERO and Piper real-arm robot tasks, and ablation of K, M, N, smoothing parameters, elite count, speed/video memory, etc.

2. Motivation

2.1 What problem should be solved?

The paper focuses on language-conditioned robotic manipulation. At each moment, the model receives visual observation $o_t$, language command $g$, optional body state $p_t$, and outputs action $a_t$. Existing VLAs have obtained semantic generalization capabilities through pre-training VLM, but most methods still regard decision-making as direct action prediction: given the current context, output the current or short window action.

This setting is feasible in short tasks, but two problems will arise in long-term tasks: first, the model does not have an explicit mechanism to evaluate "what state the current action will bring to the future"; second, early small errors in multi-step tasks will accumulate into subsequent failures. In a real robot example, the drawer task requires opening a drawer, placing objects, and closing the drawer; if the first step is not aligned with the drawer handle, subsequent steps will fail even if the action form is correct.

2.2 Limitations of existing methods

Direct action prediction: Each step or short action chunk is usually regarded as a supervised target, lacking long-term trajectory-level evaluation.
world model-only route: It can predict future observations, but it does not necessarily provide a value judgment on "which future trajectory is more worthy of execution", nor does it automatically complete action selection.
world model as RL simulator: Synthetic rollouts or rewards can be generated, but the paper points out that such methods are limited by world model generalization, and compounding errors are prone to occur in complex or OOD scenarios.
Explicit action-space planning: As the horizon $H$ grows, the action sequence space dimension grows linearly, while feasible trajectories that satisfy physical, contact, and semantic constraints account for only a minimal probability mass. This theoretical motivation is formalized in the text and appendix [Appendix A.1].

2.3 The solution ideas of this article

The high-level insight of WAV is that planning can be used not as an external optimizer, but as an inference process within a structured generative model. The model learns a latent generator that produces plausible future trajectories, and then learns a trajectory value function to evaluate these futures. The inference stage repeatedly moves the latent noise distribution toward areas of high value and low uncertainty.

3. Summary of related work

3.1 Related work of the thesis self-description

Technical line	How to organize a paper	Relationship with this article
World Models for VLA	Including unified VLA-world-model architectures, as well as routes to use world models to generate virtual environments, synthetic rewards or post-training signals, such as DreamVLA, GEVRM, WorldVLA, GENIE/GE-ACT, etc.	This article not only improves the prediction ability of the world model, but also combines world prediction with trajectory value and action decoding into latent planning.
RL post-training for VLA	Related work focuses on improving the task execution capabilities of VLA through RL, preference optimization, and reward design.	WAV uses rule-based dense reward to train trajectory value, but the focus is on inference-time latent planning instead of just policy post-training.
Model-Based RL	From Dyna, probabilistic dynamics, MPC, latent world models to MPPI/Dreamer/MuZero, etc., the emphasis is on long-term decision-making through predictive models and value assessment.	WAV draws on the "sampling-evaluation-reweighting" idea of MPC/MPPI, but changes the optimization variable from action sequence to learned latent noise distributions.

3.2 Direct comparison with previous works

Dimensions	GE-ACT / GENIE class methods	WorldVLA / DreamVLA class methods	WAV
Core idea	Improve action generation with video pretraining / generative action modeling.	Joint modeling of policy execution and future state prediction.	Simultaneously generate future visual trajectories, estimate trajectory values, decode actions, and iteratively plan in latent space.
key assumptions	Better actions can be generated directly given the current context.	Future prediction can enhance policy learning or data augmentation.	The learned latent generator can cover the feasible trajectory manifold, and the value module can distinguish high-value trajectories.
Applicable scenarios	Can handle multiple manipulations, but errors accumulate over long chains.	Suitable for VLA settings where prediction of future dynamics is required.	Especially for long-horizon, compositional, tasks that require multi-step coordination.
Experimental performance	GE-ACT averages 96.5 in LIBERO; real-world tasks average 35.6%.	WorldVLA 79.1, DreamVLA 92.6, UniVLA 95.5, reported in paper form.	LIBERO average 98.1; real task average 75.6%.

4. Detailed explanation of method

4.1 Method overview

The data flow of WAV can be written as: input multi-view observation, language instructions and robot status; the video generation module generates candidate future visual features; the trajectory value module estimates long-term returns and stability for each candidate future; latent planning selects elite samples in video/value latent distributions and updates the mean variance; the action decoder fuses the optimized video/value features to output actions.

Figure: WAV pipeline. The figure corresponds to the video/world module, trajectory value module and action decoding module. The image in the source code is an independent PNG file.

Input: observation o_t, proprioception p_t, language g Encode g with frozen T5-XXL -> T(g) Initialize video latent Gaussian f_vid^(0), value latent Gaussian f_val^(0) for k = 1..K: sample M video noises z_vid ~ f_vid^(k-1) generate future visual features x = W(z_vid, observations, T(g)) for each x: sample N value noises z_val ~ f_val^(k-1) estimate trajectory values v = V(x, z_val) score by SNR(v) update f_vid^(k) from top-K1 video samples update f_val^(k) from top-K2 value samples smooth mean/std with alpha, beta sample optimized z_vid*, z_val* decode action a_t = A(video features, value features)

4.2 Method evolution

The evolution logic of the paper can be summarized as:

stage	method form	Improvement motivation
Direct VLA	$\pi_\theta(a_{t: t+H}\mid o_t, p_t, g)$ directly predicts short action sequences.	Can take advantage of language and visual pre-training, but lacks trajectory-level future evaluation.
MPC/MPPI	Sampling candidate future action sequences, rollout and then select the best ones based on reward/value.	It has long-term reasoning, but action-space search has a low probability of feasible trajectories in high-dimensional long horizons.
WAV	Sampling and reweighting in learned latent trajectory space, and finally decoding the motion.	Leverage generative models to focus probabilistic mass around physically and semantically more feasible future trajectories.

4.3 Core design and mathematical derivation

4.3.1 Basic definitions of VLA and MPC

VLA direct action prediction: given historical observations, states, actions, and language, predict actions step by step.

$$\pi_\theta(a_{1: T}\mid o_{1: T}, p_{1: T}, g)=\prod_{t=1}^{T}\pi_\theta(a_t\mid o_{1: t}, p_{1: t}, a_{1: t-1}, g).$$

Among them, $o_t$ is visual observation, $p_t$ is proprioceptive state, $g$ is language instruction, and $a_t$ is action. Actual implementation of constant prediction short window $a_{t: t+H}$.

MPC goal: select an action sequence that maximizes the cumulative discount reward within a limited horizon.

$$a_{t: t+H}^\star=\arg\max_{a_{t: t+H}}\mathbb{E}\left[\sum_{i=0}^{H}\gamma^iR(s_{t+i}, a_{t+i})\right].$$

This explains why this article requires future prediction and trajectory value; but searching directly in $\mathcal{A}^H$ will encounter the feasibility bottleneck.

4.3.2 Why action-space search is difficult

Core statement: The long horizon trajectory space is very large, but the feasible trajectory that satisfies physical, contact, and semantic constraints is just a "thin manifold".

$$\mathcal{X}=\mathcal{S}^{H}\times\mathcal{A}^{H}, \qquad D=H(\dim\mathcal{S}+\dim\mathcal{A}).$$ $$\frac{\mu(\mathcal{N}_\epsilon(\mathcal{M}_{\mathrm{traj}}))}{\mu(\mathcal{X})}\le \exp(-cH).$$

$\mathcal{M}_{\mathrm{traj}}$ is a feasible trajectory set; $\mathcal{N}_\epsilon$ is its $\epsilon$-neighborhood. The conclusion is: if you randomly search for candidates in the entire trajectory/action space, the probability of encountering an approximately feasible trajectory will decrease exponentially with $H$.

Appendix Proof Integration: Covering Number Proof Ideas [Appendix A.1]

The author treats $\mathcal{M}_{\mathrm{traj}}$ as a compact subset of intrinsic dimension $d$ in the appendix. Cover the set with a ball of radius $\epsilon$, and the coverage number satisfies $N_\epsilon\le C_1\epsilon^{-d}$. The volume of each ball in $D$-dimensional space is proportional to $\epsilon^D$, so the neighborhood volume is at most the same order as $\epsilon^{D-d}$. Since $D=H(\dim\mathcal{S}+\dim\mathcal{A})$, and assuming $d\le\lambda H$ and $\lambda<\dim\mathcal{S}+\dim\mathcal{A}$, we get $D-d\ge\kappa H$, and then $\epsilon^{D-d}=\exp((D-d)\log\epsilon)\le\exp(-cH)$.

4.3.3 How to redistribute probability mass in latent planning

If the learned generator already tends to generate feasible trajectories, then sampling in latent space is easier to obtain feasible trajectories than blind sampling in action space.

$$\tau_{t: t+H}=\mathcal{W}_\theta(z), \qquad z\sim f_\theta(s_t), \qquad P_{\mathrm{latent}}=(\mathcal{W}_\theta)_\# f_\theta.$$ $$P_{\mathrm{latent}}(\mathcal{M}_{\mathrm{traj}})=\Pr_{z\sim f_\theta(s_t)}[\mathcal{W}_\theta(z)\in\mathcal{M}_{\mathrm{traj}}]\ge 1-\delta.$$ $$\frac{\Pr_{z\sim f_\theta(s_t)}[\mathcal{W}_\theta(z)\in\mathcal{M}_{\mathrm{traj}}]}{\Pr_{a_{t: t+H}\sim\mathrm{Unif}(\mathcal{A}^{H})}[\Phi(a_{t: t+H})\in\mathcal{M}_{\mathrm{traj}}]}\ge \exp(cH)(1-\delta).$$

$\Phi$ is a rollout map induced by system dynamics. This proposition does not mean that latent planning is guaranteed to be optimal, but that under the condition that "the learned latent generator approximately covers feasible manifold", feasible probability is exponentially higher than action-space uniform sampling.

Appendix Proving Integration: Why Is Iterative Inference Still Needed? [Appendix A.2]

The appendix further points out that feasible is not equal to high-value. Define the trajectory return $V(\tau)=\sum_{h=0}^{H-1}\gamma^h r(s_{t+h}, a_{t+h})$, and the $\varepsilon$-optimal set $\mathcal{M}_\varepsilon=\{\tau\in\mathcal{M}_{\mathrm{traj}}\mid V(\tau)\ge V^\star-\varepsilon\}$. Even if $P_{\mathrm{latent}}(\mathcal{M}_{\mathrm{traj}})$ is large, it does not follow that $P_{\mathrm{latent}}(\mathcal{M}_\varepsilon)$ has a constant lower bound. Therefore one-shot latent sampling with a fixed sample budget does not guarantee to find a near-optimal trajectory. WAV uses iterative inference to continuously push latent distribution to high-value areas through value/SNR feedback.

4.3.4 Three modules and training objectives

Video generation module: Generates future visual feature chunks for language and multi-view history.

$$\hat{x}_{t: t+N}=\mathcal{W}(\{v_0^{(i)}, v_{\hat{t}}^{(i)}, z^{(i)}\}_{i}, \mathcal{T}(g)).$$

$\mathcal{T}(g)\in\mathbb{R}^{L_g\times d_t}$ comes from frozen T5-XXL; $i\in\{h, l, r\}$ represents different camera angles; $z^{(i)}\sim\mathcal{N}(0, I)$ is view-specific latent noise.

Action decoder: The action token first attends to video features and then to value embedding.

$$\mathbf{z}_{\mathrm{act}}^{(i)}=\mathcal{B}_{i}^{\mathrm{act}}(\mathbf{z}_{\mathrm{act}}^{(i-1)}, \operatorname{CrossAttn}(\mathbf{z}_{\mathrm{act}}^{(i-1)}, \mathbf{x}_i)), $$ $$\mathbf{a}_i=\mathcal{B}_{i}^{\mathrm{act}}(\mathbf{z}_{\mathrm{act}}^{(i)}, \operatorname{CrossAttn}(\mathbf{z}_{\mathrm{act}}^{(i)}, \mathbf{u}_i)).$$

$\mathbf{x}_i$ is the video tokens of the $i$th visual transformer block; $\mathbf{u}_i$ is the trajectory value embedding.

Training uses three stages of flow matching:

Video flow loss: $\mathcal{L}_{\mathrm{vid}}=\mathbb{E}[\|v_\theta(t, l, o, x^t)-(x^1-x^0)\|_2^2]$.
Value flow loss: $\mathcal{L}_{\mathrm{val}}=\mathbb{E}[\|v_\theta(t, l, o, z_{\mathrm{vid}}, v^t)-(v^1-v^0)\|_2^2]$, among which $v^1=\sum_{i=0}^{H}\gamma^iR(s_{t+i}, a_{t+i})$.
Action flow loss: $\mathcal{L}_{\mathrm{act}}=\mathbb{E}[\|v_\theta(t, l, o, z_{\mathrm{vid}}, z_{\mathrm{val}}, a^t)-(a^1-a^0)\|_2^2]$.

4.3.5 Iterative latent inference

WAV maintains two Gaussians: video latent noise distribution and value latent noise distribution.

$$f_{\mathrm{vid}}^{(k)}=\mathcal{N}(\boldsymbol{\mu}_{\mathrm{vid}}^{(k)}, \operatorname{diag}((\boldsymbol{\sigma}_{\mathrm{vid}}^{(k)})^2)), $$ $$f_{\mathrm{val}}^{(k)}=\mathcal{N}(\boldsymbol{\mu}_{\mathrm{val}}^{(k)}, \operatorname{diag}((\boldsymbol{\sigma}_{\mathrm{val}}^{(k)})^2)).$$

$k$ is the iteration number. $M$ video noises are sampled in each round, and $N$ value noises are sampled for each video hypothesis.

Scoring function: Use value prediction's signal-to-noise ratio to measure "high return and stability".

$$\operatorname{SNR}^{(m, n)}=\frac{\mathbb{E}[\mathbf{v}^{(m, n)}]}{\operatorname{Std}[\mathbf{v}^{(m, n)}]+\epsilon}, \qquad \phi^{(m)}=\max_{n\in\{1, \dots, N\}}\operatorname{SNR}^{(m, n)}.$$

$\epsilon$ is a numerical stability constant. The score of each video candidate is the most reliable one among its $N$ value estimates.

Elite update: Update the mean and variance with latent samples of top-$K_1$ / top-$K_2$.

$$\boldsymbol{\mu}_{\mathrm{vid}}^{(k)}=\frac{1}{K_1}\sum_{m\in\mathcal{E}_{\mathrm{vid}}}\mathbf{z}_{\mathrm{vid}}^{(m)}, \quad \boldsymbol{\sigma}_{\mathrm{vid}}^{(k)}=\sqrt{\frac{1}{K_1}\sum_{m\in\mathcal{E}_{\mathrm{vid}}}(\mathbf{z}_{\mathrm{vid}}^{(m)}-\boldsymbol{\mu}_{\mathrm{vid}}^{(k)})^2}.$$ $$\boldsymbol{\mu}^{(k)}\leftarrow\alpha\boldsymbol{\mu}^{(k)}+(1-\alpha)\boldsymbol{\mu}^{(k-1)}, \quad \boldsymbol{\sigma}^{(k)}\leftarrow\beta\boldsymbol{\sigma}^{(k)}+(1-\beta)\boldsymbol{\sigma}^{(k-1)}.$$

The update of value distribution is the same, except that the elite set is $\mathcal{E}_{\mathrm{val}}$, and top-$K_2$ is selected from $M\times N$ value samples.

4.4 Implementation points

Real robot data: Appendix description: drawer-opening and bowl-organization each collect about 300 successful trajectories, and towel-flattening collects 2, 000 successful trajectories; each trajectory contains two wrist cameras ($240\times320\times3$), a third-person top camera ($240\times424\times3$) and 14-dimensional robot joint states [Appendix B].

Reward design: Follow the dense reward idea of ReinboT, but omit the sub-goal achievement term and retain the related items of task progress, behavioral smoothness, and task completion; there are 9 types of reward terms and 16 scalar components in the two-arm setting. [Appendix B].

Numerical stability: Add $\epsilon$ to the SNR denominator; use $\alpha, \beta$ after latent distribution update to smooth the mean and variance to avoid distribution collapse.

Training resources: The paper report uses 8 NVIDIA A100-SXM4-80GB GPU, and the CPU is Intel Xeon Platinum 8358 @ 2.60GHz. LIBERO full-parameter fine-tuning takes about 5 days; real Piper takes about 3 days per task [Appendix B].

5. Experiment

5.1 Experimental setup

Project	settings
Simulation data set	LIBERO benchmark includes four suites: Spatial, Object, Goal, and Long; they test spatial generalization, object generalization, goal-conditioned behavior, and long-term combination tasks respectively.
real robot	Piper dual-arm platform; tasks include bowl organization, tower flattening, and long-horizon drawer tasks.
Baseline	LIBERO compares Diffusion Policy, Octo, OpenVLA, SpatialVLA, $\pi_0$ series, OpenVLA-OFT, VLA-Adapter, WorldVLA, CoT-VLA, FlowVLA, DreamVLA, UniVLA, GE-ACT, etc.; real robots mainly compare GE-ACT.
Evaluation index	success rate. Real robots use strict binary success metric: the task is successful only when the task is completely completed, and there is no partial credit.
Code/Project Page	The project page is given in the source code of the paper: https: //win-commit.github.io/wavpage/. The source code does not give an explicit GitHub URL in the text.

Training hyperparameters [Appendix B]

module	Gradient clip	Steps	Warm-up	Batch	Learning rate	Weight decay	Caption Dropout	Optimizer
Video Training	1.0	40000	1000	128	$3e-4$	$1e-5$	0.06	Adam ($\beta_1=0.9, \beta_2=0.95, \beta_3=0.999$)
Value & Action Training	1.0	30000	1000	128	$5e-5$	$1e-5$	0	Adam ($\beta_1=0.9, \beta_2=0.95, \beta_3=0.999$)

Dense reward terms [Appendix B]

Reward term	Definition	Weight
Wrist-view MSE	$c_{1, t}^b=\exp(-0.01\cdot\mathrm{MSE}(I_t^b, I_T^b))$	$+1/16$ each
Wrist-view SSIM	$c_{2, t}^b=\exp(\mathrm{SSIM}(I_t^b, I_T^b)-1)$	$+1/16$ each
Top-view MSE / SSIM	MSE and SSIM target similarity corresponding to top camera	$+1/16$ each
Joint-state proximity	$c_{5, t}^b=\exp(-\\|s_t^b-s_T^b\\|_2)$	$+1/16$ each
Joint/action velocity & acceleration penalties	$\sum_j\|\Delta s_{t, j}^b\|$, $\sum_j\|\Delta^2s_{t, j}^b\|$, $\sum_j\|\Delta a_{t, j}^b\|$, $\sum_j\|\Delta^2a_{t, j}^b\|$	joint penalties $-1/16$ each; action penalties $-0.1/16$ each

5.2 Main results

LIBERO

Model	Params	Spatial	Object	Goal	Long	Avg.
GE-ACT	2b	98.2	97.6	95.8	94.4	96.5
VLA-Adapter	0.5b	97.8	99.2	97.2	95.0	97.3
WAV (Ours)	2.2b	99.6	100.0	98.6	94.4	98.1
WAV w/o Latent Trajectory Planning	-	99.0	99.6	95.0	91.8	96.4

The key interpretation of the paper is that the average improvement comes from multiple suites, and the contribution of latent planning is most obvious on the Long suite. After removing latent trajectory planning, the average score dropped 1.7 points, with Long falling from 94.4 to 91.8.

real robot

Quantitative results on real robots: WAV vs. GE-ACT, average of 15 trials per result. The average success rate for paper body reports increased from 35.6% to 75.6%.

Qualitative comparison of real tasks: The paper points out that common failures of GE-ACT include inaccurate alignment of drawer handles, unstable grasping, and weak spatial grounding; WAV's multi-step behavior is more coherent.

5.3 Ablation experiment

K, M, N ablation: K increases significantly from 1 to 5, and continues to decrease at 10. The profit margin decreases; M has a stronger impact on performance, and N saturates earlier.

Number of iterations $K$: Adding $K$ will make the latent distribution have more rounds of weighting, with success rising significantly first and diminishing returns thereafter.
Video samples $M$: The paper concludes that performance is sensitive to $M$, illustrating the importance of exploring multiple future visual trajectory hypotheses.
Value samples $N$: The impact is relatively mild, and the benefits of continuing to increase after reaching a reasonable estimated density are limited.

Left: $\alpha, \beta$ smoothing parameters; right: elite counts $K_1, K_2$. Too small a smoothing will lead to instability, and a very small elite count will reduce stability.

Performance-efficiency trade-off: Increasing $K$ will increase the success rate but increase inference time and video memory; the paper believes that $K=3$ is a better compromise.

5.4 Supplementary experiments and appendix figures

Appendix supplement: More trend charts under different $K, M, N$ [Appendix B].

Appendix: Comparison of inferred state-value trajectories and ground truth in real robots and LIBERO [Appendix B].

Appendix: Qualitative comparison of predicted videos and ground truth in two LIBERO tasks [Appendix B].

6. Analysis and Discussion

6.1 Analysis and explanation of the results given in the paper

The author attributes the advantage of LIBERO Long suite to trajectory-level planning's mitigation of compounding errors.
In the real robot results, the author believes that baseline errors come from inaccurate action execution and weak spatial grounding, and these errors will cascade and amplify in multi-step tasks.
During ablation, the author explains that the sensitivity of $M$ comes from insufficient exploration of future trajectory assumptions; the early saturation of $N$ indicates that the value evaluation has limited benefits from new samples after reaching a reasonable density.
In the speed/video memory experiment, the author believes that $K=3$ has captured most of the benefits of iterative refinement, and continuing to increase $K$ mainly increases the computing cost.

6.2 Limitations of the author's statement

The main limitations clearly stated in Conclusion are deployment time and storage overhead. The paper does not expand on failure taxonomy or safety boundaries in the main text, so this report does not add additional subjective limitations.

6.3 Applicable boundaries and future work

Applicable boundaries: WAV relies on the learned latent generator to approximately cover the feasible trajectory set; the theoretical proposition itself is also a conditional comparison.
Data and rewards: Task-specific successful trajectories and rule-based dense rewards are used in real tasks. The author of the data set said it will be made public after publication.
Future work: The authors propose extensions to richer multi-modal instructions and implementation of real-time closed-loop deployment on physical robotic systems.

6.4 Reproducibility audit

Project	Status	Description
Source code structure	Obtained	arXiv e-print contains main tex, bib, style and figures.
chart	Extracted	The PNG has been copied and the PDF figure has been converted to PNG for inclusion in this report figures/.
Training hyperparameters	more complete	The appendix gives the main hyperparameters of video and value/action training.
Hardware/training time	clear	8x A100-SXM4-80GB; LIBERO ~5 days, real Piper ~3 days per task.
data	Part to be made public	The real robot data scale and sensor configuration are clear, but the paper says the dataset will be made public after publication.
official code	Not explicitly given in the source code text	The source code gives the project page, but does not provide the GitHub repository URL directly in the LaTeX text.