Unified Diffusion VLA: Vision-Language-Action Model via Joint Discrete Denoising Diffusion Process

Authors: Jiayi Chen, Wenxuan Song, Pengxiang Ding, Ziyang Zhou, Han Zhao, Feilong Tang, Donglin Wang, Haoang Li

Organization: The Hong Kong University of Science and Technology (Guangzhou), Westlake University, Zhejiang University, Monash University

Publication format: ICLR 2026 conference template / arXiv preprint, 2025

Links: arXiv: 2511.01718 | PDF | Project Page | Code | CALVIN checkpoint

Source code description: arXiv source code includes complete Appendix: LLM usage, future image visualization, real-world setup/tasks, loss formulations, training details, baselines, discrete vs continuous diffusion comparison.

1. Quick overview of the paper

One-sentence summary: UD-VLA unifies language, current images, future images and actions into a discrete token space, and uses the Joint Discrete Denoising Diffusion Process to generate future images and actions in parallel in the same synchronized denoising trajectory, allowing actions to continuously utilize future visual information in each step of denoising.

Reading targeting item	compact conclusion
What should the paper solve?	Existing unified VLA either relies on external vision/action experts, or although it tokenizes vision and action, it retains separate decoding between image generation and action prediction, resulting in insufficient guidance of action by future images and slow reasoning.
The author's approach	Use VQ visual tokenizer and FAST action tokenizer to convert images and actions into discrete tokens; use hybrid attention to maintain cross-modal directionality; use JD3P to synchronously denoise future image tokens and action tokens in the same discrete diffusion process.
most important results	The average length of CALVIN ABCD->D is 4.64; the average success rate of LIBERO is 92.7%; the SimplerEnv-WidowX table overall is 62.5%; the JD3P decoding speed is 219.3 tokens/s, which is about 4.3 times higher than AR's 50.2 tokens/s.
Things to note when reading	The core is not "adding future images" itself, but letting the action token repeatedly attend to the intermediate future image token in multiple rounds of denoising; also note that there are numerical inconsistencies between the SimplerEnv text and the table.

Difficulty rating: ★★★★★. Need to be familiar with VLA, visual chain-of-thought, VQ tokenization, FAST action tokenization, discrete diffusion / mask-predict, attention mask design and robot imitation learning benchmark.

Keywords: Vision-Language-ActionDiscrete DiffusionJD3PHybrid AttentionFuture Image Generation

Core contribution list

Unified Diffusion VLA. The paper proposes to unify understanding, generation, and acting in one transformer instead of splicing them through external image/action decoder or separation process.
JD3P. The Joint Discrete Denoising Diffusion Process puts future images and actions into the same discrete mask-denoise trajectory, allowing actions to receive guidance from the intermediate results of future images in each round.
Hybrid attention. Bidirectional attention is used inside image blocks and action blocks to maintain the causal direction across blocks to avoid action-to-vision information leakage while retaining the conditional dependence of action on future images.
Training and inference supporting design.The two-stage training first injects future image generation capabilities into VLM, and then jointly trains on the robot action data; prefix KV cache, prefilled special tokens, confidence-guided decoding and decoding space mapping are used during inference.

Figure 1. Overview of UD-VLA: architecture, two-stage training, and JD3P inference. The output includes fixed-length future image tokens and variable-length action tokens.

2. Motivation

2.1 Problems to be solved

The VLA's task is to read in natural language instructions and visual observations and output actions that can be performed in the physical world. Recently, unified VLA has begun to incorporate future images into the understanding-acting loop: first predict the future visual state, and then convert the action prediction into the inverse kinematics problem of "how to achieve this future state."

The paper believes that true unification is not just to put the output modality in the same sequence, but to make visual generation and action generation gain each other. If the future image is only used as an auxiliary task in the training stage, or if image generation and action generation are performed separately during inference, the action token can only absorb future visual information to a limited extent.

2.2 Limitations of existing methods

paradigm	Typical approach	Issues pointed out in the paper
External experts unify modalities	GR-1, SEER, DreamVLA, F1, UP-VLA, etc. use additional encoder/decoder or diffusion experts to generate vision/action.	Module separation may lead to misalignment, higher complexity, and weak coupling of vision generation and action prediction.
Unify input and output token space but separate decoding	CoT-VLA, WorldVLA, UniVLA, etc. unify images and actions at the token layer.	Images and actions are still not the same joint decoding process; some methods only decode actions during inference, and the predictive value of images in training is not explicitly retained.
AR generates images and actions	Autoregressively generate future images/actions in token order.	Each action token usually only undergoes context calculation once, and the image does not provide sufficient guidance for the action; the image token itself is not naturally suitable for strict next-token order.

2.3 High-level ideas of this article

The core assumption of UD-VLA is that generating future images and generating actions should be co-optimized in the same simultaneous denoising process. As the future image is restored from coarse to fine, the action token is also gradually restored from the mask/noise state; each action can attend to the future image in the current denoising stage, thereby obtaining continuous and sufficient visual guidance.

3. Related work context

Technical line	Positioning in the paper	UD-VLA Differences
Visual prediction for manipulation	UniPi, VLP, IPA, etc. decompose sequence decision-making into visual prediction and action/inverse dynamics.	UD-VLA does not just predict visual plans, but uses discrete diffusion within the VLM backbone to jointly generate future images and actions.
Unified VLA	GR-1, SEER, DreamVLA, F1, UP-VLA, CoT-VLA, WorldVLA, UniVLA, etc. explore the unification of future visual state and action.	The paper emphasizes its "single joint decoding process": future image/action is no longer decoded separately.
Discrete diffusion VLA	PD-VLA, Discrete Diffusion VLA, LLADA-VLA, and CEED-VLA use masked/token denoising to improve motion decoding efficiency.	These methods mainly focus on action prediction; UD-VLA incorporates both visual tokens and action tokens into joint discrete diffusion.

4. Detailed explanation of method

4.1 Unified Tokenization

UD-VLA converts language, vision, and actions into discrete tokens and spells them into a single sequence. Language token design follows Emu3; current and future images are discretized by VQ tokenizer; actions are represented by FAST tokenizer. Author's use <BOI>/<EOI> To mark an image block, use <BOA>/<EOA> Mark action block.

This sequence puts the input and output in the same token space: the first half is the condition, and the second half is the future vision and action the model wants to generate.

$$[\; \text{text tokens}; \; \text{current image tokens}; \; \text{future image tokens}; \; \text{action tokens}\; ]$$

text tokens	Verbal instructions are the input for understanding the intent of the task.
current image tokens	The current observed image token is the visual condition.
future image tokens	The future visual state to be generated by the model, fixed length.
action tokens	The action sequence to be generated by the model, variable length, obtained by FAST action tokenizer.

4.2 Hybrid Attention

The basic rules of Hybrid attention are: text/current image is used as understanding input; future image block and action block are output. Bidirectional attention is allowed within the block to allow full interaction between image tokens or action dimensions; causal direction is maintained between blocks, so that the future image can only see the input, and the action can see the input and the future image, but action information is not allowed to flow back to the vision.

Figure 2. Hybrid attention. The paper emphasizes that action-to-vision is prohibited to avoid coarse action information leakage and error accumulation.

Intuitive understanding: The patches/tokens within an image do not have strict sequence of cause and effect, and the different spatial dimensions of actions do not have strict token order, so bidirectional within a block is more appropriate; but there is directionality between "current observation - future image - action", so causality is required across blocks.

4.3 Joint Discrete Denoising Diffusion Process

JD3P combines the future image tokens $\mathbf{v}_0$ and the action tokens $\mathbf{a}_0$ into a sequence. Forward noising does not add Gaussian noise, but replaces the token with a special mask token $\mathrm{M}$ with probability $\beta_t$; reverse denoising predicts the original token at the masked position.

"Noising" in discrete diffusion is to gradually cover the token; "denoising" is to restore the covered image/action token according to the context.

$$\mathbf{Q}_t\mathbf{e}_{t, r}=(1-\beta_t)\mathbf{e}_{t, r}+\beta_t\mathbf{e}_{\mathrm{M}}$$

$\mathbf{e}_{t, r}$	The one-hot token at position $r$ may come from a future image or action.
$\beta_t$	The probability that step $t$ is replaced by mask.
$\mathbf{e}_{\mathrm{M}}$	One-hot basis of mask token.

Reverse decomposition into two parts: visual recovery and movement recovery:

$$p_\theta(\mathbf{v}_{t-1}, \mathbf{a}_{t-1}\mid\mathbf{v}_t, \mathbf{a}_t, \mathbf{c}) =p_\theta(\mathbf{v}_{t-1}\mid\mathbf{v}_t, \mathbf{c})\; p_\theta(\mathbf{a}_{t-1}\mid\mathbf{v}_t, \mathbf{a}_t, \mathbf{c})$$

where $\mathbf{c}$ is text and current image. Note that the action distribution is explicitly conditioned on $\mathbf{v}_t$, so each action recovery step can take advantage of future visual tokens at the current stage.

4.4 Loss Function

During training, the author does not explicitly expand the complete multi-step diffusion chain, but uses a single-step mask-predict objective: randomly sampling mask ratio $\rho_t$, covering part of the clean future image/action token, and only calculating cross-entropy at the blocked position.

The number of visual tokens may be much greater than that of action tokens, so the paper uses $\omega$ to reduce the visual loss weight to avoid visual items dominating the training.

$$\mathcal{L}_{\text{CE}}(\theta)= -\omega\sum_j^{L_v}\log p_\theta^{(v)}(v_{0, j}\mid\mathbf{v}_t, \mathbf{c})\mathbb{1}\{v_{t, j}=\mathrm{M}\} -\sum_i^{L_a}\log p_\theta^{(a)}(a_{0, i}\mid\mathbf{v}_t, \mathbf{a}_t, \mathbf{c})\mathbb{1}\{a_{t, i}=\mathrm{M}\}$$

Appendix Loss Formulations Four types of losses, namely MSE, continuous diffusion, discrete diffusion, and next-token prediction, are also defined in Table 1 for comparing different VLAs. The most critical thing in the report is $\mathcal{L}_{\mathrm{Diff\text{-}disc}}$: it is masked-token prediction on a limited vocabulary, rather than noise prediction in continuous space.

Figure 3. Discrete diffusion versus continuous diffusion in the appendix. UD-VLA chooses discrete diffusion because both image tokens and action tokens have been mapped into a limited codebook.

4.5 Two-stage training

Stage (i): world-model style post-training. Initialized from the pretrained VLM backbone, future image predictions are trained on large-scale video data with the sequence [text; current image; future image], The goal is to infuse VLA with the ability to model future, states.
Stage (ii): robot action fine-tuning. Use full sequences on downstream robot motion data [text; current image; future image; action], jointly train image generation and action generation according to JD3P.

Appendix Training Details Given training resources: CALVIN-ABCD action chunk 10, 8 H100s, training takes about 24 hours; LIBERO four suite joint training, action chunk 10, 8 H100s, about 30 hours; SimplerEnv uses Bridge data training, action chunk 5, 8 H100s, about 30 hours; real-world experiments collect 600+ trajectories, action chunk 8, 8 H100s, about 8 hours. Another configuration is also written in the real-world task section of the text: 4 H100 training for 24 hours, 9000 steps, batch size 64, learning rate 8e-5, weight decay 0.1. The two GPU/duration descriptions are not completely consistent, and should be further checked with the official code configuration when reproducing.

4.6 Reasoning

Inference starts from all masked future image/action tokens and repeats for a small number of iterations. Each round predicts the distribution of all mask positions in parallel, selects a part of the most reliable positions to fill in tokens based on confidence, and continues to mask the rest. The mask ratio changes from high to low using cosine schedule.

Algorithm: UD-VLA inference with JD3P Input: text tokens, current image tokens Initialize: future image tokens = [MASK] * Lv action tokens = [MASK] * La prefill <BOI>, <EOI>, <BOA>; cache prefix KV for t = T... 1: predict token distribution for masked vision/action positions in parallel restrict vision positions to visual codebook and action positions to action codebook compute confidence = max probability for each masked position fill top (1 - rho_t) masked positions using GumbelMax sampling if <EOA> appears, fix action length and mask later action slots Output: future image tokens, action tokens

5. Experiments and results

5.1 Benchmarks

Benchmark	settings	indicator
CALVIN	4 environments A/B/C/D, 34 tasks, 1000 language instructions; each model evaluates 500 rollouts, and each rollout has 5 consecutive sub-tasks.	Average finished length avg. len., maximum 5.
LIBERO	There are four suites of Spatial, Object, Goal and Long; each suite has 10 tasks and each task has 50 rollouts.	The success rate and average success rate of each suite.
SimplerEnv-WidowX	Real-to-sim environment, tasks include Put Spoon, Put Carrot, Stack Block, Put Eggplant; change lighting, texture, color, and perspective.	Single task success rate and overall.
Real-world	UR5e + Inspire RH56E2 hand + D435i wrist camera + Gemini 336L static camera; three types of tasks: stacking bowls, putting blocks, flipping towers.	Each method is evaluated 30 times per task, seen/unseen setting.

5.2 Main Results in Simulation

Benchmark	UD-VLA results	Key contrasts
CALVIN ABCD->D	1/2/3/4/5 company mission success rate is 0.992/0.968/0.936/0.904/0.840, Avg. Len. 4.64.	Higher than MDT 4.52, UP-VLA 4.42, MODE 4.39, UniVLA* 4.26, GR-1 4.21.
LIBERO	Spatial 94.1%, Object 95.7%, Goal 91.2%, Long 89.6%, Average 92.7%.	The Average in the table is higher than DreamVLA 92.6%, FlowVLA 88.1%, and $\pi_0$-FAST 85.5%. Long suite is 89.6%, the highest in the table.
SimplerEnv-WidowX	Put Spoon 58.3%, Put Carrot 62.5%, Stack Block 54.1%, Put Eggplant 75.0%, Overall 62.5%.	Overall in the table is higher than F1 59.4%, $\pi_0$-FAST 48.3%, and SpatialVLA 42.7%.

Value conflict: The text of SimplerEnv reads "UD-VLA achieves an average success rate of 59.4%", but in the table, UD-VLA overall is 62.5% and F1 is 59.4%. This report is summarized in tables and text figures are treated as potential clerical errors.

5.3 In-Depth Analysis / Ablation

question	settings	result	Paper explanation
Is the Attention mechanism critical?	Causal / Bidirectional / Hybrid	Avg. Len. 4.04 / 4.32 / 4.64	Intra-block bidirectional is suitable for image spatial consistency and action dimension correlation; cross-modal full bidirectional will leak information, so hybrid is best.
Are future images more useful for reconstruction than current images?	Null / Current Image / Future Image	Avg. Len. 4.21 / 4.39 / 4.64	Current image reconstruction enhances fine-grained perception, but only learns static information; future images provide temporal dynamics and action planning clues.
Is JD3P better than other decoding mechanisms?	AR / Jacobi / Independent Diffusion / JD3P	Avg. Len. 4.18 / 4.16 / 4.35 / 4.64; speed 50.2 / 101.6 / 144.4 / 219.3 tokens/s.	Joint denoising allows actions to repeatedly benefit from intermediate image denoising states; independent diffusion has limited information flow.

5.4 Real-World Experiment

Figure 4. Real robot system: UR5e, Inspire RH56E2 hand, D435i wrist camera, Gemini 336L static camera, and real task results.

Real-world data includes three types of tasks: stacking bowls, putting blocks into a box, and flipping towers/towels. Each type of task contains objects of different colors and shapes, and the data is collected in three backgrounds; the main text contains 200 trajectories for each category, 15 Hz, and the appendix contains a total of 600+ trajectories. The evaluation includes seen and unseen; unseen includes new objects and new backgrounds.

Figure 5. Real-world setting in Appendix: target objects, distractors, seen/unseen object split, and seen/unseen backgrounds.

Figure 6. Three types of real task demonstrations: put block into box, stack bowls, fold towel/tower.

The paper text reports that UD-VLA outperforms GR00T N1 and UniVLA on all real tasks, and the success rate of each task exceeds 80%. The author explains that in seen tasks, action quantization improves action accuracy, and joint denoising ensures action quality; in unseen tasks, future image generation improves visual generalization to unseen targets and backgrounds.

5.5 Future Image Generation Visualization

Figure 7. Future image generation visualization: CALVIN and real-world comparison of GT and Generation. The authors acknowledge that pixel-level fidelity is not high, but believe that task progress information is sufficient to guide control.

Figure 8. The appendix shows an example of fine manipulation of a tiny toy with a radius of approximately 1.5 cm.

6. Analysis and discussion within the paper

6.1 Mechanism explanation given by the author

Future image is a visual CoT.In the CALVIN results, the authors view explicit future image generation as a more efficient chain-of-thought because action prediction can be conditioned on future visual outcomes.
Multi-step diffusion facilitates visual-motor information exchange.Compared with single-step filled-token prediction, JD3P allows images and actions to continuously interact through multiple rounds of intermediate states.
Image generation requires bidirectional attention.The paper believes that the main relationship between image tokens is spatial consistency, which is not suitable for strict causal next-token; there is no natural causal order in different dimensions of actions.
Discrete diffusion balances efficiency and performance.The speed of AR is 50.2 tokens/s; JD3P is 219.3 tokens/s, and Avg. Len. is the highest.

6.2 Limitations of author's self-report generation

The paper explicitly acknowledges in the visual analysis that generating future frames lacks visual fidelity, especially fine-grained details such as robotic arms and backgrounds. This is attributed to the lack of large-scale generative pretraining and the use of compressed images with fewer tokens for efficiency. The authors concluded that although pixel-level accuracy is difficult, the generated results still reliably express task progress and are sufficient to serve action planning.

7. Analysis, Limitations and Boundaries

7.1 The most valuable part of this paper

According to the paper's own evidence, the most valuable part is to advance future visual reasoning from "auxiliary training goals" to "intermediate structures that are jointly optimized during reasoning." JD3P allows the action token not to be generated after reading the future image at once, but to receive the intermediate state guidance of the future image token in each step of discrete denoising. During CALVIN ablation, Future Image compared to Null ranged from 4.21 to 4.64, JD3P compared to AR ranged from 4.18 to 4.64, and the speed ranged from 50.2 to 219.3 tokens/s. This is the paper's most direct evidence supporting the design.

7.2 Why the results hold up

The results are supported by three types of simulation benchmarks, real robots, and multiple groups of ablations: CALVIN checks long sequence language operations, LIBERO checks spatial/object/goal/long generalization, and SimplerEnv checks real-to-sim transfer; real experiments also cover seen/unseen objects and backgrounds. The ablation does not only compare the complete model and the weak baseline, but separates the attention, visual generation target, and decoding mechanism respectively, which makes the causal chain of "hybrid attention + future image + JD3P" clearer.

7.3 Limitations

Future image quality is limited.The authors explicitly admit that the predicted frames are not detailed enough, especially the robotic arm and background. If the task requires precise geometry, occlusion relationships, or contact states, a low-fidelity future image may be insufficient.
The real world is still small.The real experiment consists of three types of tasks and 600+ trajectories. Although it includes unseen settings, it is not enough to prove large-scale universality across robots, cross-scenarios, and cross-skills.
There are inconsistencies between the text and the table.SimplerEnv overall conflicts in the main text and the table; the training resources are also described differently in the real-world section of the main text and the appendix training details. You need to check the official code when reproducing.
Depends on discrete tokenization quality.Both vision and action are compressed into discrete tokens, and quantization errors in the VQ codebook and FAST action tokenizer may limit fine-grained control.

7.4 Applicable boundaries

The paper evidence mainly covers language condition manipulation benchmark and controlled real desktop tasks. It has not yet been proven that it is suitable for long-term mobile operations, complex force-controlled assembly, high-speed dynamic operations, open environment safety constraints, or no robot form. Real hardware uses UR5e + dexterous hand + RGB-D/static camera; action tokenizer, chunk, camera views and controller may need to be retuned when changing to other embodiments.

8. Reproducibility Audit

recurring elements	Information given in the paper/project	Audit status
Source code and diagrams	arXiv provides LaTeX source code, and all figures are independent PDFs; this report has been converted to PNG.	Checkable
code	The official GitHub is available on the project page: OpenHelix-Team/UD-VLA.	Found
Checkpoint	CALVIN ABCD-D Hugging Face checkpoint is available on the project page: UD-VLA_CALVIN_ABCD_D.	Found
Benchmark protocol	The tasks, rollout and indicators of CALVIN, LIBERO and SimplerEnv are given in the main text; the baselines are introduced item by item in the appendix.	more complete
training details	Information such as action chunk, number of GPUs, training duration, real experimental batch/lr/weight decay and other information are given.	There are local inconsistencies and code verification is required.
method formula	JD3P forward mask noising, reverse factorization, masked CE loss, confidence-guided decoding formulas are complete.	more complete
real data	The real experiment uses 600+ self-collected trajectories; the paper does not specify public downloading.	Real experiments are difficult to completely reproduce

The report covers self-test: Abstract, Introduction, Related Works, Methods, Experiments, Conclusion, Acknowledgment, Reproducibility statement, and Appendix A-G have all been covered; loss, training, real-world setup/tasks, baselines, and visualization in the Appendix have been integrated into the corresponding chapters.