EN 中文

World-Value-Action Model: Implicit Planning for Vision-Language-Action Systems

Authors: Runze Li, Hongyin Zhang, Junxi Jin, Qixin Zeng, Zifeng Zhuang, Yiqi Tang, Shangke Lyu, Donglin Wang

Organization: Westlake University; Nanjing University Suzhou Campus

Publication: arXiv preprint, 2026

arXiv: 2604.14732 | PDF: Download | Project Page: win-commit.github.io/wavpage

1. Quick overview of the paper

One-sentence summary: WAV rewrites VLA's action generation from "directly predicting actions" to "implicit planning in the latent space of future trajectories": first generate candidate future visual trajectories, then use trajectory value to evaluate long-term benefits, and finally decode high-value and dynamically feasible latent trajectories into actions.

Difficulty rating: ★★★★☆. Need to be familiar with VLA, world model, flow matching, MPC/MPPI, trajectory value, and basic high-dimensional probability/covering argument.

Keywords: Vision-Language-Action, World Model, Trajectory Value, Latent Planning, Flow Matching, MPPI-style Inference.

Reading positioning issuesanswer
What should the paper solve?Most existing VLAs predict actions directly from current observations and instructions, lacking reasoning and evaluation of long-term future trajectories, so errors are prone to accumulate in compositional/long-horizon manipulation.
The author's approachConstruct three World-Value-Action modules: the world/video generator predicts future visual features, the value module evaluates the long-term utility of the trajectory, and the action decoder generates actions based on the optimized latent video/value features.
most important resultsLIBERO's average success rate is 98.1; after removing latent trajectory planning, it drops to 96.4, and Long suite drops from 94.4 to 91.8. The average success rate of real robot tasks increased from 35.6% for GE-ACT to 75.6% for WAV.
Things to note when readingThe core is not simply "one more world model", but moving the action search to the learned latent trajectory distribution and repeatedly reweighting it with value/SNR/elite selection.

Core contribution list

2. Motivation

2.1 What problem should be solved?

The paper focuses on language-conditioned robotic manipulation. At each moment, the model receives visual observation $o_t$, language command $g$, optional body state $p_t$, and outputs action $a_t$. Existing VLAs have obtained semantic generalization capabilities through pre-training VLM, but most methods still regard decision-making as direct action prediction: given the current context, output the current or short window action.

This setting is feasible in short tasks, but two problems will arise in long-term tasks: first, the model does not have an explicit mechanism to evaluate "what state the current action will bring to the future"; second, early small errors in multi-step tasks will accumulate into subsequent failures. In a real robot example, the drawer task requires opening a drawer, placing objects, and closing the drawer; if the first step is not aligned with the drawer handle, subsequent steps will fail even if the action form is correct.

2.2 Limitations of existing methods

2.3 The solution ideas of this article

The high-level insight of WAV is that planning can be used not as an external optimizer, but as an inference process within a structured generative model. The model learns a latent generator that produces plausible future trajectories, and then learns a trajectory value function to evaluate these futures. The inference stage repeatedly moves the latent noise distribution toward areas of high value and low uncertainty.

4. Detailed explanation of method

4.1 Method overview

The data flow of WAV can be written as: input multi-view observation, language instructions and robot status; the video generation module generates candidate future visual features; the trajectory value module estimates long-term returns and stability for each candidate future; latent planning selects elite samples in video/value latent distributions and updates the mean variance; the action decoder fuses the optimized video/value features to output actions.

WAV pipeline
Figure: WAV pipeline. The figure corresponds to the video/world module, trajectory value module and action decoding module. The image in the source code is an independent PNG file.
Input: observation o_t, proprioception p_t, language g Encode g with frozen T5-XXL -> T(g) Initialize video latent Gaussian f_vid^(0), value latent Gaussian f_val^(0) for k = 1..K: sample M video noises z_vid ~ f_vid^(k-1) generate future visual features x = W(z_vid, observations, T(g)) for each x: sample N value noises z_val ~ f_val^(k-1) estimate trajectory values v = V(x, z_val) score by SNR(v) update f_vid^(k) from top-K1 video samples update f_val^(k) from top-K2 value samples smooth mean/std with alpha, beta sample optimized z_vid*, z_val* decode action a_t = A(video features, value features)

4.2 Method evolution

The evolution logic of the paper can be summarized as:

stagemethod formImprovement motivation
Direct VLA$\pi_\theta(a_{t: t+H}\mid o_t, p_t, g)$ directly predicts short action sequences.Can take advantage of language and visual pre-training, but lacks trajectory-level future evaluation.
MPC/MPPISampling candidate future action sequences, rollout and then select the best ones based on reward/value.It has long-term reasoning, but action-space search has a low probability of feasible trajectories in high-dimensional long horizons.
WAVSampling and reweighting in learned latent trajectory space, and finally decoding the motion.Leverage generative models to focus probabilistic mass around physically and semantically more feasible future trajectories.

4.3 Core design and mathematical derivation

4.3.1 Basic definitions of VLA and MPC

VLA direct action prediction: given historical observations, states, actions, and language, predict actions step by step.
$$\pi_\theta(a_{1: T}\mid o_{1: T}, p_{1: T}, g)=\prod_{t=1}^{T}\pi_\theta(a_t\mid o_{1: t}, p_{1: t}, a_{1: t-1}, g).$$

Among them, $o_t$ is visual observation, $p_t$ is proprioceptive state, $g$ is language instruction, and $a_t$ is action. Actual implementation of constant prediction short window $a_{t: t+H}$.

MPC goal: select an action sequence that maximizes the cumulative discount reward within a limited horizon.
$$a_{t: t+H}^\star=\arg\max_{a_{t: t+H}}\mathbb{E}\left[\sum_{i=0}^{H}\gamma^iR(s_{t+i}, a_{t+i})\right].$$

This explains why this article requires future prediction and trajectory value; but searching directly in $\mathcal{A}^H$ will encounter the feasibility bottleneck.

4.3.2 Why action-space search is difficult

Core statement: The long horizon trajectory space is very large, but the feasible trajectory that satisfies physical, contact, and semantic constraints is just a "thin manifold".
$$\mathcal{X}=\mathcal{S}^{H}\times\mathcal{A}^{H}, \qquad D=H(\dim\mathcal{S}+\dim\mathcal{A}).$$ $$\frac{\mu(\mathcal{N}_\epsilon(\mathcal{M}_{\mathrm{traj}}))}{\mu(\mathcal{X})}\le \exp(-cH).$$

$\mathcal{M}_{\mathrm{traj}}$ is a feasible trajectory set; $\mathcal{N}_\epsilon$ is its $\epsilon$-neighborhood. The conclusion is: if you randomly search for candidates in the entire trajectory/action space, the probability of encountering an approximately feasible trajectory will decrease exponentially with $H$.

Appendix Proof Integration: Covering Number Proof Ideas [Appendix A.1]

The author treats $\mathcal{M}_{\mathrm{traj}}$ as a compact subset of intrinsic dimension $d$ in the appendix. Cover the set with a ball of radius $\epsilon$, and the coverage number satisfies $N_\epsilon\le C_1\epsilon^{-d}$. The volume of each ball in $D$-dimensional space is proportional to $\epsilon^D$, so the neighborhood volume is at most the same order as $\epsilon^{D-d}$. Since $D=H(\dim\mathcal{S}+\dim\mathcal{A})$, and assuming $d\le\lambda H$ and $\lambda<\dim\mathcal{S}+\dim\mathcal{A}$, we get $D-d\ge\kappa H$, and then $\epsilon^{D-d}=\exp((D-d)\log\epsilon)\le\exp(-cH)$.

4.3.3 How to redistribute probability mass in latent planning

If the learned generator already tends to generate feasible trajectories, then sampling in latent space is easier to obtain feasible trajectories than blind sampling in action space.
$$\tau_{t: t+H}=\mathcal{W}_\theta(z), \qquad z\sim f_\theta(s_t), \qquad P_{\mathrm{latent}}=(\mathcal{W}_\theta)_\# f_\theta.$$ $$P_{\mathrm{latent}}(\mathcal{M}_{\mathrm{traj}})=\Pr_{z\sim f_\theta(s_t)}[\mathcal{W}_\theta(z)\in\mathcal{M}_{\mathrm{traj}}]\ge 1-\delta.$$ $$\frac{\Pr_{z\sim f_\theta(s_t)}[\mathcal{W}_\theta(z)\in\mathcal{M}_{\mathrm{traj}}]}{\Pr_{a_{t: t+H}\sim\mathrm{Unif}(\mathcal{A}^{H})}[\Phi(a_{t: t+H})\in\mathcal{M}_{\mathrm{traj}}]}\ge \exp(cH)(1-\delta).$$

$\Phi$ is a rollout map induced by system dynamics. This proposition does not mean that latent planning is guaranteed to be optimal, but that under the condition that "the learned latent generator approximately covers feasible manifold", feasible probability is exponentially higher than action-space uniform sampling.

Appendix Proving Integration: Why Is Iterative Inference Still Needed? [Appendix A.2]

The appendix further points out that feasible is not equal to high-value. Define the trajectory return $V(\tau)=\sum_{h=0}^{H-1}\gamma^h r(s_{t+h}, a_{t+h})$, and the $\varepsilon$-optimal set $\mathcal{M}_\varepsilon=\{\tau\in\mathcal{M}_{\mathrm{traj}}\mid V(\tau)\ge V^\star-\varepsilon\}$. Even if $P_{\mathrm{latent}}(\mathcal{M}_{\mathrm{traj}})$ is large, it does not follow that $P_{\mathrm{latent}}(\mathcal{M}_\varepsilon)$ has a constant lower bound. Therefore one-shot latent sampling with a fixed sample budget does not guarantee to find a near-optimal trajectory. WAV uses iterative inference to continuously push latent distribution to high-value areas through value/SNR feedback.

4.3.4 Three modules and training objectives

Video generation module: Generates future visual feature chunks for language and multi-view history.
$$\hat{x}_{t: t+N}=\mathcal{W}(\{v_0^{(i)}, v_{\hat{t}}^{(i)}, z^{(i)}\}_{i}, \mathcal{T}(g)).$$

$\mathcal{T}(g)\in\mathbb{R}^{L_g\times d_t}$ comes from frozen T5-XXL; $i\in\{h, l, r\}$ represents different camera angles; $z^{(i)}\sim\mathcal{N}(0, I)$ is view-specific latent noise.

Action decoder: The action token first attends to video features and then to value embedding.
$$\mathbf{z}_{\mathrm{act}}^{(i)}=\mathcal{B}_{i}^{\mathrm{act}}(\mathbf{z}_{\mathrm{act}}^{(i-1)}, \operatorname{CrossAttn}(\mathbf{z}_{\mathrm{act}}^{(i-1)}, \mathbf{x}_i)), $$ $$\mathbf{a}_i=\mathcal{B}_{i}^{\mathrm{act}}(\mathbf{z}_{\mathrm{act}}^{(i)}, \operatorname{CrossAttn}(\mathbf{z}_{\mathrm{act}}^{(i)}, \mathbf{u}_i)).$$

$\mathbf{x}_i$ is the video tokens of the $i$th visual transformer block; $\mathbf{u}_i$ is the trajectory value embedding.

Training uses three stages of flow matching:

4.3.5 Iterative latent inference

WAV maintains two Gaussians: video latent noise distribution and value latent noise distribution.
$$f_{\mathrm{vid}}^{(k)}=\mathcal{N}(\boldsymbol{\mu}_{\mathrm{vid}}^{(k)}, \operatorname{diag}((\boldsymbol{\sigma}_{\mathrm{vid}}^{(k)})^2)), $$ $$f_{\mathrm{val}}^{(k)}=\mathcal{N}(\boldsymbol{\mu}_{\mathrm{val}}^{(k)}, \operatorname{diag}((\boldsymbol{\sigma}_{\mathrm{val}}^{(k)})^2)).$$

$k$ is the iteration number. $M$ video noises are sampled in each round, and $N$ value noises are sampled for each video hypothesis.

Scoring function: Use value prediction's signal-to-noise ratio to measure "high return and stability".
$$\operatorname{SNR}^{(m, n)}=\frac{\mathbb{E}[\mathbf{v}^{(m, n)}]}{\operatorname{Std}[\mathbf{v}^{(m, n)}]+\epsilon}, \qquad \phi^{(m)}=\max_{n\in\{1, \dots, N\}}\operatorname{SNR}^{(m, n)}.$$

$\epsilon$ is a numerical stability constant. The score of each video candidate is the most reliable one among its $N$ value estimates.

Elite update: Update the mean and variance with latent samples of top-$K_1$ / top-$K_2$.
$$\boldsymbol{\mu}_{\mathrm{vid}}^{(k)}=\frac{1}{K_1}\sum_{m\in\mathcal{E}_{\mathrm{vid}}}\mathbf{z}_{\mathrm{vid}}^{(m)}, \quad \boldsymbol{\sigma}_{\mathrm{vid}}^{(k)}=\sqrt{\frac{1}{K_1}\sum_{m\in\mathcal{E}_{\mathrm{vid}}}(\mathbf{z}_{\mathrm{vid}}^{(m)}-\boldsymbol{\mu}_{\mathrm{vid}}^{(k)})^2}.$$ $$\boldsymbol{\mu}^{(k)}\leftarrow\alpha\boldsymbol{\mu}^{(k)}+(1-\alpha)\boldsymbol{\mu}^{(k-1)}, \quad \boldsymbol{\sigma}^{(k)}\leftarrow\beta\boldsymbol{\sigma}^{(k)}+(1-\beta)\boldsymbol{\sigma}^{(k-1)}.$$

The update of value distribution is the same, except that the elite set is $\mathcal{E}_{\mathrm{val}}$, and top-$K_2$ is selected from $M\times N$ value samples.

4.4 Implementation points

Real robot data: Appendix description: drawer-opening and bowl-organization each collect about 300 successful trajectories, and towel-flattening collects 2, 000 successful trajectories; each trajectory contains two wrist cameras ($240\times320\times3$), a third-person top camera ($240\times424\times3$) and 14-dimensional robot joint states [Appendix B].
Reward design: Follow the dense reward idea of ReinboT, but omit the sub-goal achievement term and retain the related items of task progress, behavioral smoothness, and task completion; there are 9 types of reward terms and 16 scalar components in the two-arm setting. [Appendix B].
Numerical stability: Add $\epsilon$ to the SNR denominator; use $\alpha, \beta$ after latent distribution update to smooth the mean and variance to avoid distribution collapse.
Training resources: The paper report uses 8 NVIDIA A100-SXM4-80GB GPU, and the CPU is Intel Xeon Platinum 8358 @ 2.60GHz. LIBERO full-parameter fine-tuning takes about 5 days; real Piper takes about 3 days per task [Appendix B].

5. Experiment

5.1 Experimental setup

Projectsettings
Simulation data setLIBERO benchmark includes four suites: Spatial, Object, Goal, and Long; they test spatial generalization, object generalization, goal-conditioned behavior, and long-term combination tasks respectively.
real robotPiper dual-arm platform; tasks include bowl organization, tower flattening, and long-horizon drawer tasks.
BaselineLIBERO compares Diffusion Policy, Octo, OpenVLA, SpatialVLA, $\pi_0$ series, OpenVLA-OFT, VLA-Adapter, WorldVLA, CoT-VLA, FlowVLA, DreamVLA, UniVLA, GE-ACT, etc.; real robots mainly compare GE-ACT.
Evaluation indexsuccess rate. Real robots use strict binary success metric: the task is successful only when the task is completely completed, and there is no partial credit.
Code/Project PageThe project page is given in the source code of the paper: https: //win-commit.github.io/wavpage/. The source code does not give an explicit GitHub URL in the text.

Training hyperparameters [Appendix B]

moduleGradient clipStepsWarm-upBatchLearning rateWeight decayCaption DropoutOptimizer
Video Training1.0400001000128$3e-4$$1e-5$0.06Adam ($\beta_1=0.9, \beta_2=0.95, \beta_3=0.999$)
Value & Action Training1.0300001000128$5e-5$$1e-5$0Adam ($\beta_1=0.9, \beta_2=0.95, \beta_3=0.999$)

Dense reward terms [Appendix B]

Reward termDefinitionWeight
Wrist-view MSE$c_{1, t}^b=\exp(-0.01\cdot\mathrm{MSE}(I_t^b, I_T^b))$$+1/16$ each
Wrist-view SSIM$c_{2, t}^b=\exp(\mathrm{SSIM}(I_t^b, I_T^b)-1)$$+1/16$ each
Top-view MSE / SSIMMSE and SSIM target similarity corresponding to top camera$+1/16$ each
Joint-state proximity$c_{5, t}^b=\exp(-\|s_t^b-s_T^b\|_2)$$+1/16$ each
Joint/action velocity & acceleration penalties$\sum_j|\Delta s_{t, j}^b|$, $\sum_j|\Delta^2s_{t, j}^b|$, $\sum_j|\Delta a_{t, j}^b|$, $\sum_j|\Delta^2a_{t, j}^b|$joint penalties $-1/16$ each; action penalties $-0.1/16$ each

5.2 Main results

LIBERO

ModelParamsSpatialObjectGoalLongAvg.
GE-ACT2b98.297.695.894.496.5
VLA-Adapter0.5b97.899.297.295.097.3
WAV (Ours)2.2b99.6100.098.694.498.1
WAV w/o Latent Trajectory Planning-99.099.695.091.896.4

The key interpretation of the paper is that the average improvement comes from multiple suites, and the contribution of latent planning is most obvious on the Long suite. After removing latent trajectory planning, the average score dropped 1.7 points, with Long falling from 94.4 to 91.8.

real robot

real world bar
Quantitative results on real robots: WAV vs. GE-ACT, average of 15 trials per result. The average success rate for paper body reports increased from 35.6% to 75.6%.
real world tasks
Qualitative comparison of real tasks: The paper points out that common failures of GE-ACT include inaccurate alignment of drawer handles, unstable grasping, and weak spatial grounding; WAV's multi-step behavior is more coherent.

5.3 Ablation experiment

stage1
K, M, N ablation: K increases significantly from 1 to 5, and continues to decrease at 10. The profit margin decreases; M has a stronger impact on performance, and N saturates earlier.
smoothing elite
Left: $\alpha, \beta$ smoothing parameters; right: elite counts $K_1, K_2$. Too small a smoothing will lead to instability, and a very small elite count will reduce stability.
speed memory
Performance-efficiency trade-off: Increasing $K$ will increase the success rate but increase inference time and video memory; the paper believes that $K=3$ is a better compromise.

5.4 Supplementary experiments and appendix figures

stage1 appendix
Appendix supplement: More trend charts under different $K, M, N$ [Appendix B].
value trajectory
Appendix: Comparison of inferred state-value trajectories and ground truth in real robots and LIBERO [Appendix B].
pred vs gt
Appendix: Qualitative comparison of predicted videos and ground truth in two LIBERO tasks [Appendix B].

6. Analysis and Discussion

6.1 Analysis and explanation of the results given in the paper

6.2 Limitations of the author's statement

The main limitations clearly stated in Conclusion are deployment time and storage overhead. The paper does not expand on failure taxonomy or safety boundaries in the main text, so this report does not add additional subjective limitations.

6.3 Applicable boundaries and future work

6.4 Reproducibility audit

ProjectStatusDescription
Source code structureObtainedarXiv e-print contains main tex, bib, style and figures.
chartExtractedThe PNG has been copied and the PDF figure has been converted to PNG for inclusion in this report figures/.
Training hyperparametersmore completeThe appendix gives the main hyperparameters of video and value/action training.
Hardware/training timeclear8x A100-SXM4-80GB; LIBERO ~5 days, real Piper ~3 days per task.
dataPart to be made publicThe real robot data scale and sensor configuration are clear, but the paper says the dataset will be made public after publication.
official codeNot explicitly given in the source code textThe source code gives the project page, but does not provide the GitHub repository URL directly in the LaTeX text.