EN 中文

villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models

Reading Report: villa-X is a paper on latent action modeling in the Vision-Language-Action (VLA) model. Its proposition is that latent actions should not only compress visual changes, but also use robot proprioception to align them with real physical actions; and the strategy model should explicitly plan latent actions first, and then conditionally generate robot actions.

arXiv: 2507.23682v3 VLA Latent Action Joint Diffusion SIMPLER / LIBERO / Real Robots
Authors: Xiaoyu Chen, Hangxing Wei, Pushi Zhang, Chuheng Zhang, Kaixin Wang, etc.
Institution: Microsoft Research, Tsinghua University, Wuhan University, HKUST, Nanjing University
Code: https: //github.com/microsoft/villa-x
Project page: https: //aka.ms/villa-x

1. Quick overview of the paper

What should the paper solve? Existing VLA utilizes motionless video data through latent action, but latent action is often learned only from visual reconstruction, and it is easy to ignore actions with small pixel changes but critical control, such as end rotation and gripper opening and closing. At the same time, many methods only use latent action as an intermediate pre-training task or an independent token, and fail to fully use it for robot action generation. villa-X wants to solve two problems: "How to learn latent action more physically" and "How to connect to VLA policy more effectively".
The author's approach Two starting points: first, add proprioceptive FDM to the Latent Action Model (LAM), so that latent action not only reconstructs future images, but also predicts future robot proprio states and actions, and uses embodiment context to distinguish different robots/control frequencies; second, use joint diffusion in ACT policy to simultaneously predict latent actions and robot actions, and let the robot action expert explicitly condition on the latent action expert's plan.
most important results On SIMPLER, villa-X reaches an average of 77. 7% for Google Robot and 62. 5% for WidowX, which is better than the multi-category VLA/latent-action/visual-trace baseline; the average of LIBERO four suites is 90. 1%, higher than w/o latent's 81. 9%; on the real Realman and XHand platforms, it is also overall better than the GR00T, GR-1 and w/o latent versions. ACT-latent also demonstrates zero-shot latent planning for an unseen Realman arm and open vocabulary symbol cards.
Things to note when reading Don't think of villa-X as simply "adding one more latent token". The key is that the training goal of latent action is physically constrained by proprio-FDM, and the policy structure puts latent action and robot action into the same flow/diffusion generation process. Also note that latent expert's zero-shot demonstration is "planning capability" verified by world/FDM rendering and does not equate to real execution without fine-tuning on all new platforms.
LAM overview
Figure 1: Ordinary LAM only performs visual future reconstruction; villa-X adds proprio-FDM to allow latent action to also predict future proprio state/action.

2. Background and problem setting

Why latent action is important

The VLA model needs to generate robot actions from vision and language, but real robot action annotation is expensive and inconsistent across robot forms. The idea of ​​latent action is to abstract the intermediate token of "what action happened" from adjacent frames or short video clips, and convert a large number of action-free human/robot videos into pseudo-action supervision, thereby helping VLA pre-training.

The problem is: if the latent action only serves image reconstruction, what it learns may be "visually changing things" rather than "controlly important things". For example, the opening and closing of the gripper, the rotation of the end, and the small contact movements do not change much in pixels, but they determine the success or failure of the task.

The core judgment of villa-X is that latent action must have the ability to explain both visual changes and physical dynamics to be suitable as a bridge between vision and control.

Two big questions

3. Method details

3.1 Overall training process

villa-X contains two core modules: LAM is responsible for extracting physically grounded latent actions from observation pairs; ACT is responsible for generating real robot actions based on VLM features, latent actions, proprio state and embodiment context. The training is divided into three stages: LAM pretraining, ACT joint latent-robot pretraining, and embodiment-specific finetuning.

3.2 Latent Action Model (LAM)

Standard LAM usually consists of IDM + visual FDM. IDM predicts latent tokens from two frames, and FDM uses the current frame and latent tokens to reconstruct future frames:

$$z_t=\text{IDM}(o_t, o_{t+K}), \quad \hat{o}_{t+K}=\text{FDM}(o_t, z_t)$$

villa-X joins proprio-FDM, so that the same \(z_t\) must also predict the future K-step robot status and actions:

$$(\hat{q}_{t+1},..., \hat{q}_{t+K}, \hat{a}_{t+1},..., \hat{a}_{t+K})=\text{proprio-FDM}(q_t, z_t, c_e)$$

Among them, \(c_e=f(\text{dataset ID}, \text{control frequency})\), the dataset ID uses learnable embedding, and the control frequency uses sinusoidal features + MLP. The purpose of this is not to stuff the differences between different robots into latent actions, but to hand over the embodiment-specific dynamics to the context and retain a more consistent latent action space.

3.3 LAM implementation details

3.4 Actor Module (ACT)

ACT puts latent action and robot action into an explicitly decomposed policy:

$$\pi(a_{t: t+m-1}, z^K_{t: t+(n-1)K}|o_t, l, q_t, c_e)=\pi_{\text{robot}}(a_{t: t+m-1}|z^K_{t: t+(n-1)K}, o_t, l, q_t, c_e)\cdot\pi_{\text{latent}}(z^K_{t: t+(n-1)K}|o_t, l)$$

ACT consists of three parts: VLM encodes visual language input; ACT-latent predicts mid-level latent action plan; ACT-robot generates low-level action chunks based on VLM features, predicted latent actions, proprio state and embodiment context.

ACT architecture
Figure 2: ACT architecture. The latent action expert first makes a mid-level plan, and the robot action expert outputs real actions under its conditions and uses embodiment context and attention mask.

3.5 Joint diffusion / flow matching

In terms of implementation, villa-X uses conditional flow matching to model the joint distribution of latent actions and robot actions. The target action group is recorded as \(x_t\), the condition input is recorded as \(O_t=(o_t, l, q_t, c_e)\), and the training goal is:

$$L_\tau(\theta)=\mathbb{E}_{p(x_t|O_t), q(x_t^\tau|x_t)}\|v_\tau^\theta(x_t^\tau, O_t)-u(x_t^\tau|x_t)\|^2$$

Among them, \(x_t^\tau=\tau x_t+(1-\tau)\epsilon\), network prediction denoising vector field \(u(x_t^\tau|x_t)=\epsilon-x_t\). Explicit factorization is implemented via block-wise causal attention mask.

3.6 Prevent latent shortcut masking

If a robot action branch relies too heavily on latent tokens, it may learn vulnerable shortcuts. The author randomly masks robot-to-latent attention during training: 50% of the time, the robot-to-latent attention is completely masked; otherwise, 50% of latent tokens are randomly masked. Appendix ablation shows that both attention mask and embodiment context help.

4. Experiments and results

4.1 Does LAM learn better latent actions?

The author compares \texttt{w/pp} with proprio-FDM and \texttt{wo/pp} without. After freezing LAM on LIBERO, use 3-layer MLP to predict robot action from latent action, and count the maximum L1 error. The results show that \texttt{w/pp} has more samples in the small error bin and fewer samples in the high error bin, indicating that proprio-FDM allows latent actions to carry more low-level action information.

Probing experiment
Figure 3: Probing experiment. The latent action added to proprio-FDM is more easily decoded by the linear/MLP probe to detect low-level actions.

4.2 SIMPLER ablation

methodGoogle Avg.WidowX Avg.meaning
Ours58.540.8Complete LAM + ACT design.
wo/pp57.432.3Without proprio-FDM, the decrease is obvious on WidowX.
wo/LAM35.033.1Without using latent action, there is a huge drop-off on Google.
LAPA-style43.81.0Two-stage latent-action pre-training and then changing the action head, the structural transmission is weak.
Go-1-style32.814.8Independent latent planner + robot action prediction is not as good as joint diffusion.

This table corresponds to two conclusions: proprio-FDM improves latent action quality; ACT's joint latent-robot modeling is more effective than LAPA/Go-1 style access.

4.3 ACT-latent's zero-shot planning capability

The authors tested ACT-latent on an unseen Realman robotic arm. Given a starting image and a language command such as "touch the corn", ACT-latent first generates a latent action sequence, which is then rendered into a video via a separately trained world/image FDM. The results show that the model can recognize open vocabulary symbol cards and generate reasonable touch/movement plans.

Zero-shot latent plan visualization
Figure 4: Zero-shot latent planning visualization on unseen Realman embodiment + open-vocabulary symbol cards.

4.4 SIMPLER main results

methodGoogle Avg.WidowX Avg.Remarks
RT-1-X*49.41.1Assessment directly after pretraining.
RoboVLMs60.837.5VLA baseline.
π0-FAST61.932.1Strong VLA/action baseline.
GR00T-N1.557.962.0world modeling / future embedding alignment.
Magma62.344.8visual trace method.
MoTo59.2N/Alatent-action method.
LAPAN/A57.3latent-action method.
Ours w/o latent36.549.0Remove latent action expert.
Ours77.762.5Google/WidowX has the highest average.

SIMPLER results show that villa-X achieves strong performance on both types of robots, and the gap with w/o latent proves that the latent-action expert makes a substantial contribution to the final strategy.

4.5 Real robot results

There are two real platforms: Realman RM75 + Inspire gripper, and XArm + 12-DoF XHand dexterous hand. Realman uses 375 teleop trajectories for fine-tuning, 75 for each of the five tasks;

Real robot platforms
Figure 5: Real robot platform and tasks: Realman gripper platform above, XArm + XHand dexterous hand platform below.
Realman indicatorGR00TOurs w/o latentOurs
Pick in304030
Pick out7080100
Push103050
Stack106050
Unstack6070100
Change block color504060
Change table cover303060
XHand tasksOurs seenOurs unseenKey conclusions
Pick & Place8468Better than GR-1, GR00T, w/o latent.
Stack Cube7550unseen object/background still maintains its advantage.
Place Cup Upright6030Seen/unseen are both the highest or tied for the highest.
Pour Water6030It is still better than baseline in terms of complex and dexterous operation.
Flick Ball5040unseen lower than baseline.

4.6 LIBERO results

The appendix is evaluated on four suites, LIBERO-Spatial, Object, Goal, and Long. villa-X averages 90.1%, which is higher than π0-FAST's 85.5%, OpenVLA's 76.5, Octo-base's 75.1, and higher than itself w/o latent's 81.9. The full models in the four suites are Spatial 97.5, Object 97.0, Goal 91.5, and Long 74.5.

5. Appendix key information

Data size

The pre-training data contains a mix of robot data and action-free human videos. Robot data is about 1.6M trajectories / 223.5M frames, mainly from OpenX mixture and AgiBot; human videos are about 3.6M clips, from Ego4D/EgoHOD, EgoPAT3D, EGTEA, EPIC-KITCHENS, HO-Cap, HOI4D, HoloAssist, RH20T, Something-Something V2, etc.

LAM additional visualization

The appendix shows image pairs corresponding to the same latent action, illustrating that similar latent actions in different embodiments (including humans and robots) can correspond to similar low-level behaviors. It also demonstrated how to convert robot/human video demonstrations into SIMPLER robot actions through LAM + proprio-FDM and execute them, verifying that latent actions can be migrated from videos to robot actions.

Video to SIMPLER action part 1
Figure 6: Converting robot video demonstration into SIMPLER actions via LAM and proprio-FDM.

Embodiment context ablation

Removing embodiment context will increase visual FDM/proprio FDM validation loss. When doing action probing for unseen Realman embodiments, Ours has lower overall probing loss, xyz, rotation, and gripper than w/o context, indicating that context helps the model separate-specific dynamics and improve the generalization of new embodiments.

policy ablation

Appendix policy ablation shows: Ours in Google Avg. 58.5, WidowX Avg. 40.8; w/o mask dropped to 53.2 / 34.0; w/o context dropped to 49.1 / 38.5. Note that attention mask and embodiment context are not decorations, but key designs that affect policy stability.

6. Key points of reproducibility and implementation

Minimum recurrence path

  1. Prepare robot trajectories and human videos; robot data must include observation, state, action, dataset ID, and control frequency.
  2. Training LAM: ST-Transformer IDM predicts latent tokens from \(T_{\text{LAM}}=8\) frames; visual FDM reconstructs future images; proprio-FDM predicts future states/actions; human videos skip proprio loss.
  3. Use VQ codebook centers as continuous latent actions for ACT pretraining.
  4. Building ACT: PaliGemma 3B VLM encodes visual language; ACT-latent and ACT-robot are 18-layer Transformer experts respectively.
  5. Use conditional flow matching to jointly train latent action sequence and robot action chunk; set up block-wise causal attention and robot-to-latent stochastic masking.
  6. Fine-tune state/action projection, action decoder and other modules according to the target embodiment, and add wrist camera feature if necessary.
  7. Evaluation by task distribution on SIMPLER/LIBERO/real robots.
The reproducibility cost is very high: LAM requires about 128 A100 images for 4 days of training, and ACT pretraining requires about 64 A100 images for 4 days of training. A more realistic way to reproduce is to first use open source checkpoint to do downstream finetune and ablation instead of pre-training all modules from scratch.

7. Analysis, Limitations and Boundaries

The most valuable part of this paper

It advances latent action from "pseudo action token compressed from video" to "intermediate action representation calibrated by physical state and action supervision". This is important because robot control is not purely visual prediction: many key actions are hidden in pixels but very explicit in proprio/action space. The second value of villa-X is the policy structure design. It does not use latent action as a bypass auxiliary task, but allows ACT-latent and ACT-robot to form a clear dependency in joint diffusion, truly connecting the middle-level plan to the low-level control.

Why does the result stand?

Main limitations

Questions to ask while reading