EN 中文

Unified Diffusion VLA: Vision-Language-Action Model via Joint Discrete Denoising Diffusion Process

Authors: Jiayi Chen, Wenxuan Song, Pengxiang Ding, Ziyang Zhou, Han Zhao, Feilong Tang, Donglin Wang, Haoang Li

Organization: The Hong Kong University of Science and Technology (Guangzhou), Westlake University, Zhejiang University, Monash University

Publication format: ICLR 2026 conference template / arXiv preprint, 2025

Links: arXiv: 2511.01718 | PDF | Project Page | Code | CALVIN checkpoint

Source code description: arXiv source code includes complete Appendix: LLM usage, future image visualization, real-world setup/tasks, loss formulations, training details, baselines, discrete vs continuous diffusion comparison.

1. Quick overview of the paper

One-sentence summary: UD-VLA unifies language, current images, future images and actions into a discrete token space, and uses the Joint Discrete Denoising Diffusion Process to generate future images and actions in parallel in the same synchronized denoising trajectory, allowing actions to continuously utilize future visual information in each step of denoising.
Reading targeting itemcompact conclusion
What should the paper solve?Existing unified VLA either relies on external vision/action experts, or although it tokenizes vision and action, it retains separate decoding between image generation and action prediction, resulting in insufficient guidance of action by future images and slow reasoning.
The author's approachUse VQ visual tokenizer and FAST action tokenizer to convert images and actions into discrete tokens; use hybrid attention to maintain cross-modal directionality; use JD3P to synchronously denoise future image tokens and action tokens in the same discrete diffusion process.
most important resultsThe average length of CALVIN ABCD->D is 4.64; the average success rate of LIBERO is 92.7%; the SimplerEnv-WidowX table overall is 62.5%; the JD3P decoding speed is 219.3 tokens/s, which is about 4.3 times higher than AR's 50.2 tokens/s.
Things to note when readingThe core is not "adding future images" itself, but letting the action token repeatedly attend to the intermediate future image token in multiple rounds of denoising; also note that there are numerical inconsistencies between the SimplerEnv text and the table.

Difficulty rating: ★★★★★. Need to be familiar with VLA, visual chain-of-thought, VQ tokenization, FAST action tokenization, discrete diffusion / mask-predict, attention mask design and robot imitation learning benchmark.

Keywords: Vision-Language-ActionDiscrete DiffusionJD3PHybrid AttentionFuture Image Generation

Core contribution list

UD-VLA overview
Figure 1. Overview of UD-VLA: architecture, two-stage training, and JD3P inference. The output includes fixed-length future image tokens and variable-length action tokens.

2. Motivation

2.1 Problems to be solved

The VLA's task is to read in natural language instructions and visual observations and output actions that can be performed in the physical world. Recently, unified VLA has begun to incorporate future images into the understanding-acting loop: first predict the future visual state, and then convert the action prediction into the inverse kinematics problem of "how to achieve this future state."

The paper believes that true unification is not just to put the output modality in the same sequence, but to make visual generation and action generation gain each other. If the future image is only used as an auxiliary task in the training stage, or if image generation and action generation are performed separately during inference, the action token can only absorb future visual information to a limited extent.

2.2 Limitations of existing methods

paradigmTypical approachIssues pointed out in the paper
External experts unify modalitiesGR-1, SEER, DreamVLA, F1, UP-VLA, etc. use additional encoder/decoder or diffusion experts to generate vision/action.Module separation may lead to misalignment, higher complexity, and weak coupling of vision generation and action prediction.
Unify input and output token space but separate decodingCoT-VLA, WorldVLA, UniVLA, etc. unify images and actions at the token layer.Images and actions are still not the same joint decoding process; some methods only decode actions during inference, and the predictive value of images in training is not explicitly retained.
AR generates images and actionsAutoregressively generate future images/actions in token order.Each action token usually only undergoes context calculation once, and the image does not provide sufficient guidance for the action; the image token itself is not naturally suitable for strict next-token order.

2.3 High-level ideas of this article

The core assumption of UD-VLA is that generating future images and generating actions should be co-optimized in the same simultaneous denoising process. As the future image is restored from coarse to fine, the action token is also gradually restored from the mask/noise state; each action can attend to the future image in the current denoising stage, thereby obtaining continuous and sufficient visual guidance.

4. Detailed explanation of method

4.1 Unified Tokenization

UD-VLA converts language, vision, and actions into discrete tokens and spells them into a single sequence. Language token design follows Emu3; current and future images are discretized by VQ tokenizer; actions are represented by FAST tokenizer. Author's use <BOI>/<EOI> To mark an image block, use <BOA>/<EOA> Mark action block.

This sequence puts the input and output in the same token space: the first half is the condition, and the second half is the future vision and action the model wants to generate.

$$[\; \text{text tokens}; \; \text{current image tokens}; \; \text{future image tokens}; \; \text{action tokens}\; ]$$
text tokensVerbal instructions are the input for understanding the intent of the task.
current image tokensThe current observed image token is the visual condition.
future image tokensThe future visual state to be generated by the model, fixed length.
action tokensThe action sequence to be generated by the model, variable length, obtained by FAST action tokenizer.

4.2 Hybrid Attention

The basic rules of Hybrid attention are: text/current image is used as understanding input; future image block and action block are output. Bidirectional attention is allowed within the block to allow full interaction between image tokens or action dimensions; causal direction is maintained between blocks, so that the future image can only see the input, and the action can see the input and the future image, but action information is not allowed to flow back to the vision.

hybrid attention
Figure 2. Hybrid attention. The paper emphasizes that action-to-vision is prohibited to avoid coarse action information leakage and error accumulation.

Intuitive understanding: The patches/tokens within an image do not have strict sequence of cause and effect, and the different spatial dimensions of actions do not have strict token order, so bidirectional within a block is more appropriate; but there is directionality between "current observation - future image - action", so causality is required across blocks.

4.3 Joint Discrete Denoising Diffusion Process

JD3P combines the future image tokens $\mathbf{v}_0$ and the action tokens $\mathbf{a}_0$ into a sequence. Forward noising does not add Gaussian noise, but replaces the token with a special mask token $\mathrm{M}$ with probability $\beta_t$; reverse denoising predicts the original token at the masked position.

"Noising" in discrete diffusion is to gradually cover the token; "denoising" is to restore the covered image/action token according to the context.

$$\mathbf{Q}_t\mathbf{e}_{t, r}=(1-\beta_t)\mathbf{e}_{t, r}+\beta_t\mathbf{e}_{\mathrm{M}}$$
$\mathbf{e}_{t, r}$The one-hot token at position $r$ may come from a future image or action.
$\beta_t$The probability that step $t$ is replaced by mask.
$\mathbf{e}_{\mathrm{M}}$One-hot basis of mask token.

Reverse decomposition into two parts: visual recovery and movement recovery:

$$p_\theta(\mathbf{v}_{t-1}, \mathbf{a}_{t-1}\mid\mathbf{v}_t, \mathbf{a}_t, \mathbf{c}) =p_\theta(\mathbf{v}_{t-1}\mid\mathbf{v}_t, \mathbf{c})\; p_\theta(\mathbf{a}_{t-1}\mid\mathbf{v}_t, \mathbf{a}_t, \mathbf{c})$$

where $\mathbf{c}$ is text and current image. Note that the action distribution is explicitly conditioned on $\mathbf{v}_t$, so each action recovery step can take advantage of future visual tokens at the current stage.

4.4 Loss Function

During training, the author does not explicitly expand the complete multi-step diffusion chain, but uses a single-step mask-predict objective: randomly sampling mask ratio $\rho_t$, covering part of the clean future image/action token, and only calculating cross-entropy at the blocked position.

The number of visual tokens may be much greater than that of action tokens, so the paper uses $\omega$ to reduce the visual loss weight to avoid visual items dominating the training.

$$\mathcal{L}_{\text{CE}}(\theta)= -\omega\sum_j^{L_v}\log p_\theta^{(v)}(v_{0, j}\mid\mathbf{v}_t, \mathbf{c})\mathbb{1}\{v_{t, j}=\mathrm{M}\} -\sum_i^{L_a}\log p_\theta^{(a)}(a_{0, i}\mid\mathbf{v}_t, \mathbf{a}_t, \mathbf{c})\mathbb{1}\{a_{t, i}=\mathrm{M}\}$$

Appendix Loss Formulations Four types of losses, namely MSE, continuous diffusion, discrete diffusion, and next-token prediction, are also defined in Table 1 for comparing different VLAs. The most critical thing in the report is $\mathcal{L}_{\mathrm{Diff\text{-}disc}}$: it is masked-token prediction on a limited vocabulary, rather than noise prediction in continuous space.

discrete vs continuous diffusion
Figure 3. Discrete diffusion versus continuous diffusion in the appendix. UD-VLA chooses discrete diffusion because both image tokens and action tokens have been mapped into a limited codebook.

4.5 Two-stage training

  1. Stage (i): world-model style post-training. Initialized from the pretrained VLM backbone, future image predictions are trained on large-scale video data with the sequence [text; current image; future image], The goal is to infuse VLA with the ability to model future, states.
  2. Stage (ii): robot action fine-tuning. Use full sequences on downstream robot motion data [text; current image; future image; action], jointly train image generation and action generation according to JD3P.

Appendix Training Details Given training resources: CALVIN-ABCD action chunk 10, 8 H100s, training takes about 24 hours; LIBERO four suite joint training, action chunk 10, 8 H100s, about 30 hours; SimplerEnv uses Bridge data training, action chunk 5, 8 H100s, about 30 hours; real-world experiments collect 600+ trajectories, action chunk 8, 8 H100s, about 8 hours. Another configuration is also written in the real-world task section of the text: 4 H100 training for 24 hours, 9000 steps, batch size 64, learning rate 8e-5, weight decay 0.1. The two GPU/duration descriptions are not completely consistent, and should be further checked with the official code configuration when reproducing.

4.6 Reasoning

Inference starts from all masked future image/action tokens and repeats for a small number of iterations. Each round predicts the distribution of all mask positions in parallel, selects a part of the most reliable positions to fill in tokens based on confidence, and continues to mask the rest. The mask ratio changes from high to low using cosine schedule.

Algorithm: UD-VLA inference with JD3P Input: text tokens, current image tokens Initialize: future image tokens = [MASK] * Lv action tokens = [MASK] * La prefill <BOI>, <EOI>, <BOA>; cache prefix KV for t = T... 1: predict token distribution for masked vision/action positions in parallel restrict vision positions to visual codebook and action positions to action codebook compute confidence = max probability for each masked position fill top (1 - rho_t) masked positions using GumbelMax sampling if <EOA> appears, fix action length and mask later action slots Output: future image tokens, action tokens

5. Experiments and results

5.1 Benchmarks

Benchmarksettingsindicator
CALVIN4 environments A/B/C/D, 34 tasks, 1000 language instructions; each model evaluates 500 rollouts, and each rollout has 5 consecutive sub-tasks.Average finished length avg. len., maximum 5.
LIBEROThere are four suites of Spatial, Object, Goal and Long; each suite has 10 tasks and each task has 50 rollouts.The success rate and average success rate of each suite.
SimplerEnv-WidowXReal-to-sim environment, tasks include Put Spoon, Put Carrot, Stack Block, Put Eggplant; change lighting, texture, color, and perspective.Single task success rate and overall.
Real-worldUR5e + Inspire RH56E2 hand + D435i wrist camera + Gemini 336L static camera; three types of tasks: stacking bowls, putting blocks, flipping towers.Each method is evaluated 30 times per task, seen/unseen setting.

5.2 Main Results in Simulation

BenchmarkUD-VLA resultsKey contrasts
CALVIN ABCD->D1/2/3/4/5 company mission success rate is 0.992/0.968/0.936/0.904/0.840, Avg. Len. 4.64.Higher than MDT 4.52, UP-VLA 4.42, MODE 4.39, UniVLA* 4.26, GR-1 4.21.
LIBEROSpatial 94.1%, Object 95.7%, Goal 91.2%, Long 89.6%, Average 92.7%. The Average in the table is higher than DreamVLA 92.6%, FlowVLA 88.1%, and $\pi_0$-FAST 85.5%. Long suite is 89.6%, the highest in the table.
SimplerEnv-WidowXPut Spoon 58.3%, Put Carrot 62.5%, Stack Block 54.1%, Put Eggplant 75.0%, Overall 62.5%. Overall in the table is higher than F1 59.4%, $\pi_0$-FAST 48.3%, and SpatialVLA 42.7%.
Value conflict: The text of SimplerEnv reads "UD-VLA achieves an average success rate of 59.4%", but in the table, UD-VLA overall is 62.5% and F1 is 59.4%. This report is summarized in tables and text figures are treated as potential clerical errors.

5.3 In-Depth Analysis / Ablation

questionsettingsresultPaper explanation
Is the Attention mechanism critical?Causal / Bidirectional / HybridAvg. Len. 4.04 / 4.32 / 4.64Intra-block bidirectional is suitable for image spatial consistency and action dimension correlation; cross-modal full bidirectional will leak information, so hybrid is best.
Are future images more useful for reconstruction than current images?Null / Current Image / Future ImageAvg. Len. 4.21 / 4.39 / 4.64Current image reconstruction enhances fine-grained perception, but only learns static information; future images provide temporal dynamics and action planning clues.
Is JD3P better than other decoding mechanisms?AR / Jacobi / Independent Diffusion / JD3PAvg. Len. 4.18 / 4.16 / 4.35 / 4.64; speed 50.2 / 101.6 / 144.4 / 219.3 tokens/s.Joint denoising allows actions to repeatedly benefit from intermediate image denoising states; independent diffusion has limited information flow.

5.4 Real-World Experiment

real world setup and results
Figure 4. Real robot system: UR5e, Inspire RH56E2 hand, D435i wrist camera, Gemini 336L static camera, and real task results.

Real-world data includes three types of tasks: stacking bowls, putting blocks into a box, and flipping towers/towels. Each type of task contains objects of different colors and shapes, and the data is collected in three backgrounds; the main text contains 200 trajectories for each category, 15 Hz, and the appendix contains a total of 600+ trajectories. The evaluation includes seen and unseen; unseen includes new objects and new backgrounds.

real scene details
Figure 5. Real-world setting in Appendix: target objects, distractors, seen/unseen object split, and seen/unseen backgrounds.
real world task demos
Figure 6. Three types of real task demonstrations: put block into box, stack bowls, fold towel/tower.

The paper text reports that UD-VLA outperforms GR00T N1 and UniVLA on all real tasks, and the success rate of each task exceeds 80%. The author explains that in seen tasks, action quantization improves action accuracy, and joint denoising ensures action quality; in unseen tasks, future image generation improves visual generalization to unseen targets and backgrounds.

5.5 Future Image Generation Visualization

future image generation visualization
Figure 7. Future image generation visualization: CALVIN and real-world comparison of GT and Generation. The authors acknowledge that pixel-level fidelity is not high, but believe that task progress information is sufficient to guide control.
tiny toy demo
Figure 8. The appendix shows an example of fine manipulation of a tiny toy with a radius of approximately 1.5 cm.

6. Analysis and discussion within the paper

6.1 Mechanism explanation given by the author

6.2 Limitations of author's self-report generation

The paper explicitly acknowledges in the visual analysis that generating future frames lacks visual fidelity, especially fine-grained details such as robotic arms and backgrounds. This is attributed to the lack of large-scale generative pretraining and the use of compressed images with fewer tokens for efficiency. The authors concluded that although pixel-level accuracy is difficult, the generated results still reliably express task progress and are sufficient to serve action planning.

7. Analysis, Limitations and Boundaries

7.1 The most valuable part of this paper

According to the paper's own evidence, the most valuable part is to advance future visual reasoning from "auxiliary training goals" to "intermediate structures that are jointly optimized during reasoning." JD3P allows the action token not to be generated after reading the future image at once, but to receive the intermediate state guidance of the future image token in each step of discrete denoising. During CALVIN ablation, Future Image compared to Null ranged from 4.21 to 4.64, JD3P compared to AR ranged from 4.18 to 4.64, and the speed ranged from 50.2 to 219.3 tokens/s. This is the paper's most direct evidence supporting the design.

7.2 Why the results hold up

The results are supported by three types of simulation benchmarks, real robots, and multiple groups of ablations: CALVIN checks long sequence language operations, LIBERO checks spatial/object/goal/long generalization, and SimplerEnv checks real-to-sim transfer; real experiments also cover seen/unseen objects and backgrounds. The ablation does not only compare the complete model and the weak baseline, but separates the attention, visual generation target, and decoding mechanism respectively, which makes the causal chain of "hybrid attention + future image + JD3P" clearer.

7.3 Limitations

7.4 Applicable boundaries

The paper evidence mainly covers language condition manipulation benchmark and controlled real desktop tasks. It has not yet been proven that it is suitable for long-term mobile operations, complex force-controlled assembly, high-speed dynamic operations, open environment safety constraints, or no robot form. Real hardware uses UR5e + dexterous hand + RGB-D/static camera; action tokenizer, chunk, camera views and controller may need to be retuned when changing to other embodiments.

8. Reproducibility Audit

recurring elementsInformation given in the paper/projectAudit status
Source code and diagramsarXiv provides LaTeX source code, and all figures are independent PDFs; this report has been converted to PNG.Checkable
codeThe official GitHub is available on the project page: OpenHelix-Team/UD-VLA. Found
CheckpointCALVIN ABCD-D Hugging Face checkpoint is available on the project page: UD-VLA_CALVIN_ABCD_D. Found
Benchmark protocolThe tasks, rollout and indicators of CALVIN, LIBERO and SimplerEnv are given in the main text; the baselines are introduced item by item in the appendix.more complete
training detailsInformation such as action chunk, number of GPUs, training duration, real experimental batch/lr/weight decay and other information are given.There are local inconsistencies and code verification is required.
method formulaJD3P forward mask noising, reverse factorization, masked CE loss, confidence-guided decoding formulas are complete.more complete
real dataThe real experiment uses 600+ self-collected trajectories; the paper does not specify public downloading.Real experiments are difficult to completely reproduce
The report covers self-test: Abstract, Introduction, Related Works, Methods, Experiments, Conclusion, Acknowledgment, Reproducibility statement, and Appendix A-G have all been covered; loss, training, real-world setup/tasks, baselines, and visualization in the Appendix have been integrated into the corresponding chapters.