villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models

Reading Report: villa-X is a paper on latent action modeling in the Vision-Language-Action (VLA) model. Its proposition is that latent actions should not only compress visual changes, but also use robot proprioception to align them with real physical actions; and the strategy model should explicitly plan latent actions first, and then conditionally generate robot actions.

arXiv: 2507.23682v3 VLA Latent Action Joint Diffusion SIMPLER / LIBERO / Real Robots

Authors: Xiaoyu Chen, Hangxing Wei, Pushi Zhang, Chuheng Zhang, Kaixin Wang, etc.

Institution: Microsoft Research, Tsinghua University, Wuhan University, HKUST, Nanjing University

Code: https: //github.com/microsoft/villa-x

Project page: https: //aka.ms/villa-x

1. Quick overview of the paper

What should the paper solve?	Existing VLA utilizes motionless video data through latent action, but latent action is often learned only from visual reconstruction, and it is easy to ignore actions with small pixel changes but critical control, such as end rotation and gripper opening and closing. At the same time, many methods only use latent action as an intermediate pre-training task or an independent token, and fail to fully use it for robot action generation. villa-X wants to solve two problems: "How to learn latent action more physically" and "How to connect to VLA policy more effectively".
The author's approach	Two starting points: first, add proprioceptive FDM to the Latent Action Model (LAM), so that latent action not only reconstructs future images, but also predicts future robot proprio states and actions, and uses embodiment context to distinguish different robots/control frequencies; second, use joint diffusion in ACT policy to simultaneously predict latent actions and robot actions, and let the robot action expert explicitly condition on the latent action expert's plan.
most important results	On SIMPLER, villa-X reaches an average of 77. 7% for Google Robot and 62. 5% for WidowX, which is better than the multi-category VLA/latent-action/visual-trace baseline; the average of LIBERO four suites is 90. 1%, higher than w/o latent's 81. 9%; on the real Realman and XHand platforms, it is also overall better than the GR00T, GR-1 and w/o latent versions. ACT-latent also demonstrates zero-shot latent planning for an unseen Realman arm and open vocabulary symbol cards.
Things to note when reading	Don't think of villa-X as simply "adding one more latent token". The key is that the training goal of latent action is physically constrained by proprio-FDM, and the policy structure puts latent action and robot action into the same flow/diffusion generation process. Also note that latent expert's zero-shot demonstration is "planning capability" verified by world/FDM rendering and does not equate to real execution without fine-tuning on all new platforms.

LAM overview — Figure 1: Ordinary LAM only performs visual future reconstruction; villa-X adds proprio-FDM to allow latent action to also predict future proprio state/action.

2. Background and problem setting

Why latent action is important

The VLA model needs to generate robot actions from vision and language, but real robot action annotation is expensive and inconsistent across robot forms. The idea of latent action is to abstract the intermediate token of "what action happened" from adjacent frames or short video clips, and convert a large number of action-free human/robot videos into pseudo-action supervision, thereby helping VLA pre-training.

The problem is: if the latent action only serves image reconstruction, what it learns may be "visually changing things" rather than "controlly important things". For example, the opening and closing of the gripper, the rotation of the end, and the small contact movements do not change much in pixels, but they determine the success or failure of the task.

The core judgment of villa-X is that latent action must have the ability to explain both visual changes and physical dynamics to be suitable as a bridge between vision and control.

Two big questions

How to better learn latent action: From pure visual compression to joint constraints of visual + proprioceptive dynamics.
How to better use latent action: Instead of just doing pre-training head, teacher forcing or independent planning, we jointly model latent action sequence and robot action chunk within the policy.

3. Method details

3.1 Overall training process

villa-X contains two core modules: LAM is responsible for extracting physically grounded latent actions from observation pairs; ACT is responsible for generating real robot actions based on VLM features, latent actions, proprio state and embodiment context. The training is divided into three stages: LAM pretraining, ACT joint latent-robot pretraining, and embodiment-specific finetuning.

3.2 Latent Action Model (LAM)

Standard LAM usually consists of IDM + visual FDM. IDM predicts latent tokens from two frames, and FDM uses the current frame and latent tokens to reconstruct future frames:

$$z_t=\text{IDM}(o_t, o_{t+K}), \quad \hat{o}_{t+K}=\text{FDM}(o_t, z_t)$$

villa-X joins proprio-FDM, so that the same $z_t$ must also predict the future K-step robot status and actions:

$$(\hat{q}_{t+1},..., \hat{q}_{t+K}, \hat{a}_{t+1},..., \hat{a}_{t+K})=\text{proprio-FDM}(q_t, z_t, c_e)$$

Among them, $c_e=f(\text{dataset ID}, \text{control frequency})$, the dataset ID uses learnable embedding, and the control frequency uses sinusoidal features + MLP. The purpose of this is not to stuff the differences between different robots into latent actions, but to hand over the embodiment-specific dynamics to the context and retain a more consistent latent action space.

3.3 LAM implementation details

IDM is ST-Transformer, input default $8\times224\times224$ video clip, patch size 14, 12 ST blocks, hidden dim 768, 32 heads.
The VQ codebook size is 32; discrete tokens are used during training, but the continuous vector of the codebook center is used downstream as latent action.
Image reconstruction FDM is a 12-layer ViT-base; proprio-FDM is a 2-layer MLP with dual output heads predicting future states/actions.
LAM batch size 512, learning rate $1.5\times10^{-4}$, 2000 step warmup, about 128 A100 training for 4 days.
When human videos do not have proprio labels, proprio-FDM loss is omitted and only the visual FDM target is used.

3.4 Actor Module (ACT)

ACT puts latent action and robot action into an explicitly decomposed policy:

$$\pi(a_{t: t+m-1}, z^K_{t: t+(n-1)K}|o_t, l, q_t, c_e)=\pi_{\text{robot}}(a_{t: t+m-1}|z^K_{t: t+(n-1)K}, o_t, l, q_t, c_e)\cdot\pi_{\text{latent}}(z^K_{t: t+(n-1)K}|o_t, l)$$

ACT consists of three parts: VLM encodes visual language input; ACT-latent predicts mid-level latent action plan; ACT-robot generates low-level action chunks based on VLM features, predicted latent actions, proprio state and embodiment context.

Figure 2: ACT architecture. The latent action expert first makes a mid-level plan, and the robot action expert outputs real actions under its conditions and uses embodiment context and attention mask.

3.5 Joint diffusion / flow matching

In terms of implementation, villa-X uses conditional flow matching to model the joint distribution of latent actions and robot actions. The target action group is recorded as $x_t$, the condition input is recorded as $O_t=(o_t, l, q_t, c_e)$, and the training goal is:

$$L_\tau(\theta)=\mathbb{E}_{p(x_t|O_t), q(x_t^\tau|x_t)}\|v_\tau^\theta(x_t^\tau, O_t)-u(x_t^\tau|x_t)\|^2$$

Among them, $x_t^\tau=\tau x_t+(1-\tau)\epsilon$, network prediction denoising vector field $u(x_t^\tau|x_t)=\epsilon-x_t$. Explicit factorization is implemented via block-wise causal attention mask.

3.6 Prevent latent shortcut masking

If a robot action branch relies too heavily on latent tokens, it may learn vulnerable shortcuts. The author randomly masks robot-to-latent attention during training: 50% of the time, the robot-to-latent attention is completely masked; otherwise, 50% of latent tokens are randomly masked. Appendix ablation shows that both attention mask and embodiment context help.

4. Experiments and results

4.1 Does LAM learn better latent actions?

The author compares \texttt{w/pp} with proprio-FDM and \texttt{wo/pp} without. After freezing LAM on LIBERO, use 3-layer MLP to predict robot action from latent action, and count the maximum L1 error. The results show that \texttt{w/pp} has more samples in the small error bin and fewer samples in the high error bin, indicating that proprio-FDM allows latent actions to carry more low-level action information.

Figure 3: Probing experiment. The latent action added to proprio-FDM is more easily decoded by the linear/MLP probe to detect low-level actions.

4.2 SIMPLER ablation

method	Google Avg.	WidowX Avg.	meaning
Ours	58.5	40.8	Complete LAM + ACT design.
wo/pp	57.4	32.3	Without proprio-FDM, the decrease is obvious on WidowX.
wo/LAM	35.0	33.1	Without using latent action, there is a huge drop-off on Google.
LAPA-style	43.8	1.0	Two-stage latent-action pre-training and then changing the action head, the structural transmission is weak.
Go-1-style	32.8	14.8	Independent latent planner + robot action prediction is not as good as joint diffusion.

This table corresponds to two conclusions: proprio-FDM improves latent action quality; ACT's joint latent-robot modeling is more effective than LAPA/Go-1 style access.

4.3 ACT-latent's zero-shot planning capability

The authors tested ACT-latent on an unseen Realman robotic arm. Given a starting image and a language command such as "touch the corn", ACT-latent first generates a latent action sequence, which is then rendered into a video via a separately trained world/image FDM. The results show that the model can recognize open vocabulary symbol cards and generate reasonable touch/movement plans.

Zero-shot latent plan visualization — Figure 4: Zero-shot latent planning visualization on unseen Realman embodiment + open-vocabulary symbol cards.

4.4 SIMPLER main results

method	Google Avg.	WidowX Avg.	Remarks
RT-1-X*	49.4	1.1	Assessment directly after pretraining.
RoboVLMs	60.8	37.5	VLA baseline.
π0-FAST	61.9	32.1	Strong VLA/action baseline.
GR00T-N1.5	57.9	62.0	world modeling / future embedding alignment.
Magma	62.3	44.8	visual trace method.
MoTo	59.2	N/A	latent-action method.
LAPA	N/A	57.3	latent-action method.
Ours w/o latent	36.5	49.0	Remove latent action expert.
Ours	77.7	62.5	Google/WidowX has the highest average.

SIMPLER results show that villa-X achieves strong performance on both types of robots, and the gap with w/o latent proves that the latent-action expert makes a substantial contribution to the final strategy.

4.5 Real robot results

There are two real platforms: Realman RM75 + Inspire gripper, and XArm + 12-DoF XHand dexterous hand. Realman uses 375 teleop trajectories for fine-tuning, 75 for each of the five tasks;

Real robot platforms — Figure 5: Real robot platform and tasks: Realman gripper platform above, XArm + XHand dexterous hand platform below.

Realman indicator	GR00T	Ours w/o latent	Ours
Pick in	30	40	30
Pick out	70	80	100
Push	10	30	50
Stack	10	60	50
Unstack	60	70	100
Change block color	50	40	60
Change table cover	30	30	60

XHand tasks	Ours seen	Ours unseen	Key conclusions
Pick & Place	84	68	Better than GR-1, GR00T, w/o latent.
Stack Cube	75	50	unseen object/background still maintains its advantage.
Place Cup Upright	60	30	Seen/unseen are both the highest or tied for the highest.
Pour Water	60	30	It is still better than baseline in terms of complex and dexterous operation.
Flick Ball	50	40	unseen lower than baseline.

4.6 LIBERO results

The appendix is evaluated on four suites, LIBERO-Spatial, Object, Goal, and Long. villa-X averages 90.1%, which is higher than π0-FAST's 85.5%, OpenVLA's 76.5, Octo-base's 75.1, and higher than itself w/o latent's 81.9. The full models in the four suites are Spatial 97.5, Object 97.0, Goal 91.5, and Long 74.5.

5. Appendix key information

Data size

The pre-training data contains a mix of robot data and action-free human videos. Robot data is about 1.6M trajectories / 223.5M frames, mainly from OpenX mixture and AgiBot; human videos are about 3.6M clips, from Ego4D/EgoHOD, EgoPAT3D, EGTEA, EPIC-KITCHENS, HO-Cap, HOI4D, HoloAssist, RH20T, Something-Something V2, etc.

LAM additional visualization

The appendix shows image pairs corresponding to the same latent action, illustrating that similar latent actions in different embodiments (including humans and robots) can correspond to similar low-level behaviors. It also demonstrated how to convert robot/human video demonstrations into SIMPLER robot actions through LAM + proprio-FDM and execute them, verifying that latent actions can be migrated from videos to robot actions.

Video to SIMPLER action part 1 — Figure 6: Converting robot video demonstration into SIMPLER actions via LAM and proprio-FDM.

Embodiment context ablation

Removing embodiment context will increase visual FDM/proprio FDM validation loss. When doing action probing for unseen Realman embodiments, Ours has lower overall probing loss, xyz, rotation, and gripper than w/o context, indicating that context helps the model separate-specific dynamics and improve the generalization of new embodiments.

policy ablation

Appendix policy ablation shows: Ours in Google Avg. 58.5, WidowX Avg. 40.8; w/o mask dropped to 53.2 / 34.0; w/o context dropped to 49.1 / 38.5. Note that attention mask and embodiment context are not decorations, but key designs that affect policy stability.

6. Key points of reproducibility and implementation

Minimum recurrence path

Prepare robot trajectories and human videos; robot data must include observation, state, action, dataset ID, and control frequency.
Training LAM: ST-Transformer IDM predicts latent tokens from $T_{\text{LAM}}=8$ frames; visual FDM reconstructs future images; proprio-FDM predicts future states/actions; human videos skip proprio loss.
Use VQ codebook centers as continuous latent actions for ACT pretraining.
Building ACT: PaliGemma 3B VLM encodes visual language; ACT-latent and ACT-robot are 18-layer Transformer experts respectively.
Use conditional flow matching to jointly train latent action sequence and robot action chunk; set up block-wise causal attention and robot-to-latent stochastic masking.
Fine-tune state/action projection, action decoder and other modules according to the target embodiment, and add wrist camera feature if necessary.
Evaluation by task distribution on SIMPLER/LIBERO/real robots.

The reproducibility cost is very high: LAM requires about 128 A100 images for 4 days of training, and ACT pretraining requires about 64 A100 images for 4 days of training. A more realistic way to reproduce is to first use open source checkpoint to do downstream finetune and ablation instead of pre-training all modules from scratch.

7. Analysis, Limitations and Boundaries

The most valuable part of this paper

It advances latent action from "pseudo action token compressed from video" to "intermediate action representation calibrated by physical state and action supervision". This is important because robot control is not purely visual prediction: many key actions are hidden in pixels but very explicit in proprio/action space. The second value of villa-X is the policy structure design. It does not use latent action as a bypass auxiliary task, but allows ACT-latent and ACT-robot to form a clear dependency in joint diffusion, truly connecting the middle-level plan to the low-level control.

Why does the result stand?

The evidence chain is relatively complete: probing proves that latent action is better at predicting robot action; SIMPLER ablation proves that proprio-FDM and latent action expert are effective for policy; the main results prove that it has advantages over multi-class baselines.
Evaluation area width: SIMPLER covers Google Robot/WidowX embodiment, LIBERO covers Spatial/Object/Goal/Long, and the real world covers both gripper and dexterous hands.
Ablation such as w/o latent, wo/pp, LAPA-style, Go-1-style, w/o mask, w/o context, etc. can disassemble and verify the main designs one by one.
The appendix provides more details on data size, architecture, and training, explains performance sources, and exposes reproducibility costs.

Main limitations

Training costs are extremely high: The threshold for 100-calorie pre-training is high and it is difficult for ordinary laboratories to completely reproduce it.
The zero-shot of latent expert is not a complete closed-loop execution.: The symbol card experiment mainly checks the latent plan through world/FDM rendering, which is not equivalent to successful execution directly on any unseen robot.
Still need embodiment-specific finetuning: Realman/XHand both need to be fine-tuned according to the platform data, and cannot be understood as completely free of data generalization across robots.
proprio grounding relies on robot state/action annotation: Human videos do not have proprio branches and can only participate in LAM training through visual targets.
Some tasks are still not perfect: Realman Pick in, XHand unseen pour/cup, etc. still have obvious room for failure, indicating that latent action is not a universal bridge.

Questions to ask while reading

Which actions do the physical grounding learned by proprio-FDM help most: rotation, gripper, contact, or long-distance movement?
Why are continuous codebook centers more stable than discrete tokens in ACT?
Can ACT-latent predicted latent plans be individually evaluated, filtered, or corrected using a VLM critic?
Can we make better use of human videos if we use hand poses and end-effector keypoints instead of proprio state?
w/o latent isn't bad at some real tasks either, which scenarios does latent action expert really provide the most benefit?