EN 中文

Causal World Modeling for Robot Control

Method name: LingBot-VA

Authors: Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, Yujun Shen, Yinghao Xu

arXiv: 2601.21998; v1 was submitted on 2026-01-29, and the current version v2 was revised on 2026-03-22; the topics are cs.CV and cs.RO

Project: Project Page; GitHub; Base Checkpoint; RoboTwin Checkpoint; LIBERO-Long Checkpoint

1. Quick overview of the paper

One-sentence summary: LingBot-VA puts the video world model and robot action inference into the same causal autoregressive diffusion framework: first "imagine" the future in the latent video, then use inverse dynamics to decode the action, and use KV cache, teacher forcing causal mask, partial denoising and asynchronous execution to support closed-loop robot control.
What should the paper solve?Most existing VLAs output actions directly and reactively from the current image. Visual understanding, physical dynamics and motion control are squeezed into one supervision signal, which is prone to low sample efficiency and poor generalization. Existing world models often use open-loop or chunk-based bidirectional diffusion, making it difficult to continuously access real feedback.
The author's approachUse Wan2.2-5B video diffusion backbone as video stream, add a narrower action stream, and interact through Mixture-of-Transformers; interleave video tokens and action tokens into a single causal sequence according to time, and use flow matching to train visual dynamics and inverse dynamics simultaneously.
most important resultsThe average SR of RoboTwin 2.0 50 tasks reaches 92.93% Easy / 91.55% Hard; the average SR of LIBERO reaches 98.5%; 6 real-world tasks are only adapted with 50 demos per task, the average progress score is about 79.2%, and the average success rate is about 59.2%, both of which are higher than $\pi_{0.5}$'s about 65.4% / 39.2%.
Things to note when readingThe main text of the paper states that both real-world indicators are overall better, but the item-by-item table in the appendix shows that the progress score of Fold Clothes is higher for $\pi_{0.5}$ than LingBot-VA; the following report is objectively presented according to the original values in the table.

World Model Flow Matching Autoregressive Diffusion Mixture-of-Transformers Inverse Dynamics Asynchronous Control

LingBot-VA teaser
Figure 1. The overall positioning of the paper: LingBot-VA is pre-trained using large-scale video and robot action data, evaluated on real-world tasks and simulation benchmarks, and demonstrates few-sample adaptation, sequential memory and generalization capabilities.

core contribution

2. Motivation and related work

2.1 Representation entanglement of existing VLA

The paper believes that current VLA often adopts a feedforward policy: mapping current visual observations and language instructions directly to action sequences. This method compresses visual semantics, physical dynamics and low-level motion control into the same representation and the same action supervision signal. The author calls it representation entanglement, and points out that this will lead to two problems: low sample efficiency, and limited generalization to new scenes, new objects, and long-term tasks.

2.2 Three types of limitations of existing world model / video policy

2.3 Relationship with related work

directionPositioning in the paperLingBot-VA Differences
Vision-Language-Action Policies$\pi_0$, $\pi_{0.5}$, GR00T-N1, OpenVLA, etc. are pre-trained using VLM/VLA and fine-tuned through robot demonstrations.Not only learn observation-to-action reactive mapping, but explicitly train video dynamics and inverse dynamics, and maintain causal history during execution.
World Models for Robotic ControlThree categories: latent-space, 3D point cloud, and 2D pixel/video; the paper focuses on video/world models that can predict future frames and conditional action generation during execution.Use KV cache and causal mask to continuously access real observations, and avoid waiting for complete high-quality video generation through partial denoising.
Video-action generative policiesUVA, UWM, Motus, Gen2Act, Act2Goal, etc. demonstrate video-action joint or video sub-goal ideas.Emphasis on causal autoregressive sequence and persistent memory rather than bidirectional chunk generation or offline video subgoals.

3. Detailed explanation of method

3.1 From reactive policy to world-model-first policy

Ordinary VLA uses $\pi_\theta(\cdot \mid o_t)$ to predict actions directly. LingBot-VA splits the control into two stages: first predict the future visual state, and then reverse the action based on the current state and the predicted future state. This decomposition allows Stage 1 to learn physics priors using large-scale video data, while Stage 2 uses robotics data to map visual changes onto executable actions.

LingBot-VA framework
Figure 2. LingBot-VA's video-action interleaving: Given the task language and initial observation, the video stream predicts the future visual latent, and the action stream corresponds to the decoding action; subsequent observations and actions continue to enter the same autoregressive sequence.

3.2 Preliminary knowledge of Flow Matching

The paper uses continuous latent diffusion / flow matching. Given a data sample $x_1$ and noise $\epsilon \sim \mathcal{N}(0, I)$, the model learns a continuous-time vector field that pushes the noise along the path to the data distribution.

Intuitive understanding: The model does not predict the final sample in one step, but learns which direction each noise state should move.

$$\frac{dx^{(s)}}{ds}=v_s(x^{(s)}), \quad x^{(0)}=\epsilon$$ $$\mathcal{L}_{\text{FM}}=\mathbb{E}_{s, \epsilon, x_1}\left[\|v_\theta(x^{(s)}, s)-\dot{x}^{(s)}\|^2\right]$$

Commonly used linear interpolation $x^{(s)}=(1-s)\epsilon+s x_1$, so the true speed $\dot{x}^{(s)}=x_1-\epsilon$. During inference, integrate from the noise to $s=1$.

3.3 Autoregressive Video-Action World Modeling

The core idea is to put visual latent and action tokens into a causal sequence that unfolds in time. Each autoregressive step predicts a video chunk and decodes the corresponding action chunk at the same time; chunks can be generated internally in parallel, and causal dependency is maintained between chunks.

Visual dynamics prediction: The next video latent is determined by past vision and past actions.

$$z_{t+1: t+K} \sim p_\theta(\cdot \mid z_{\leq t}, a_{ $z_t$Visual latent token encoded by Wan2.2 causal VAE. $a_t$The action token is the action vector after MLP projection. $K$Video chunk length; random sampling during training, and the paper uses $K=4$ during deployment.

Action decoding: Given the predicted future visual state, the action to achieve the visual transfer is deduced.

$$a_{t: t+K-1} \sim g_\psi(\cdot \mid \hat{z}_{t+1: t+K}, z_{\leq t}, a_{Here $g_\psi$ is the inverse dynamics model. It looks not only at the current and next frames, but also at historical actions and historical observations in order to preserve embodiment state and multi-step task context.

3.4 Unified architecture: asymmetric dual-stream MoT

modulePaper settingdesign purpose
Video streamInitialized from Wan2.2-5B, hidden dimension $d_v=3072$, 30-layer Transformer.Inheriting visual dynamic priors in large-scale video generation models.
Action streamThe same 30 layers, but the width is $d_a=768$, which is 4 times smaller than the video stream; there are about 350M additional parameters, and the total model is about 5.3B.The action distribution is lower dimensional than video and does not require the same capacity; action-specific feature space is retained.
MoT fusionVideo/action calculate QKV separately. The action token is first projected to the video dimension to participate in joint self-attention, and then projected back to the action dimension.Allow cross-modal interaction while reducing the contamination of different modal representations.
Video sparsificationThe video time is downsampled by $\tau=4$, and each video frame is associated with $\tau$ consecutive actions.Reduce the number of video tokens while retaining high-frequency control of actions.
Action initializationInitialize the action stream interpolated with pretrained video weights and scaled with $\alpha=\sqrt{d_v/d_a}$.Stabilize early training and avoid the excessive gap between action token distribution and video token distribution from damaging joint attention.

3.5 Teacher Forcing and causal attention mask

During training, the entire episode is treated as an interleaved video-action sequence. The model predicts the next token at each token position, but the context uses ground-truth history tokens, rather than the history generated by the model itself. The authors emphasize that this is reasonable in robots since real-world observations are also constantly received when deployed.

Teacher forcing attention mask
Figure 3. teacher forcing causal mask: Each token can only see earlier tokens in time to maintain the physical causal direction.

3.6 Noisy History Augmentation and partial denoising

Video token generation is an inference bottleneck because the number of visual tokens is large and each token requires multiple steps of denoising. The author's observation is that action prediction does not necessarily require completely denoised pixel-level vision, and partially denoised latent can already provide action-relevant structure. Therefore, the historical video latent is randomly added to the noise during training, forcing the action decoder to adapt to the noisy visual states.

$$\tilde{z}_{\leq t} = \begin{cases} (1-s_{\text{aug}})\epsilon+s_{\text{aug}}z_{\leq t}, & p=0.5, \ s_{\text{aug}}\in[0.5, 1] \\ z_{\leq t}, & 1-p=0.5 \end{cases}$$

The video token does not need to be fully integrated to $s=1$ during inference. The theory in the main text is that it can be integrated to $s=0.5$; in the experimental implementation, Euler solver is used in 3 steps and integrated to $s=0.6$, and the action token still uses 10 steps to be integrated to $s=1.0$.

3.7 KV Cache and asynchronous reasoning

Synchronous reasoning will allow the robot to wait for the model to generate future videos and actions; asynchronous reasoning will allow the robot to execute the current action chunk while the model predicts the next segment. The problem is that naive async will continue to generate based on stale predicted video, causing drift. LingBot-VA adds Forward Dynamics Model grounded step: use recent real observations and currently performed actions to reimagine the current results, and then predict subsequent vision and actions.

Asynchronous pipeline
Figure 4. Asynchronous pipeline: B-1 is naive async, which is easy to continue old predictions; B-2 recalibrates the cache with real feedback through FDM grounding.
KV-cache inference: 1. Encode real observation o0 -> z0, initialize cache C={z0} 2. Predict future video chunk by integrating video flow only to partial denoising level 3. Decode action chunk by integrating action flow to s=1 4. Execute actions, collect real observations 5. Encode real observations and append {z, a} to KV cache 6. Repeat with persistent causal history

4. Mathematical forms and training objectives

4.1 Training loss

Visual dynamics loss: A supervised model predicts the flow velocity of the next visual latent given history, action, and language conditions.

$$\mathcal{L}_{\mathrm{dyn}} = \mathbb{E}_{t, s, z_{t+1}, \epsilon} \left[ \left\| v_\theta(z_{t+1}^{(s)}, s, \tilde{z}_{\leq t}, a_{ $c$Language instructions are injected via cross-attention after passing through the frozen T5 text encoder. $\tilde{z}_{\leq t}$Historical visual latent, possibly through noisy history augmentation. $a_{Historical action token, used to represent embodiment trajectories and interaction history.

Action inverse dynamics loss: Restore action flow from the current/next visual latent and historical actions.

$$\mathcal{L}_{\mathrm{inv}} = \mathbb{E}_{t, s, a_t, \epsilon} \left[ \left\| v_\psi(a_t^{(s)}, s, \tilde{z}_{\leq t+1}, a_{Experimental $\lambda=1$. This shows that the paper does not use action loss as a small weight auxiliary item, but treats action inverse dynamics and visual dynamics equally.

4.2 forward dynamics loss of asynchronous post-training

$$\mathcal{L}_{\mathrm{fdm}} = \mathbb{E}_{t, s, \hat{z}_{t+1}, \epsilon} \left[ \left\| v_\psi(\tilde{z}_{t+1}, s, z_t, a_t, \tilde{z}_{This loss corresponds to grounding in asynchronous deployment: the model uses the latest true state $z_t$ and the current action $a_t$ to re-predict the visual results, reducing the open-loop drift caused by just continuing to scroll along the old hallucinated video.

5. Experiments and results

5.1 Data and training settings

Projectsettings
Pre-training dataAggregating Agibot, RoboMind, InternData-A1, OpenVLA subset of OXE, UMI Data, RoboCOIN, and internal collection demonstrations; a total of approximately 16K hours of robot operation data.
General action representationThe arms are unified into a 7-dimensional EEF pose for each arm, a maximum of 7-dimensional joint angles, and a 1-dimensional gripper; the arms have a total of $(7+7+1)\times2=30$ dimensions, and any missing dimensions are filled with zeros.
video encodingWan2.2 causal VAE, compression ratio $4\times16\times16$, patchify to further reduce the spatial dimension by half, multi-view splicing along the width, $N=192$ spatial tokens per frame.
pre-training1.4T tokens; AdamW, peak LR $1\times10^{-4}$, weight decay 0.01, cosine annealing + linear warmup, bf16, gradient clipping 2.0, text dropout 0.1, uniform SNR sampler.
post-trainingAdaptation of a small amount of task data; the text states that 50 demos can be effectively deployed; it is recommended to train 3K steps with LR $1\times10^{-5}$, or 1K steps with LR $1\times10^{-4}$. The real world experiment office writes 500 steps, LR $1\times10^{-4}$, and sequence length 150, 000.
reasoningvideo Euler solver 3 steps to $s=0.6$, action 10 steps to $s=1.0$; video CFG 5.0, action CFG 1.0; deployment chunk size $K=4$.

5.2 Real world deployment

The author evaluates 6 tasks on a real dual-arm platform, organized into three categories: long-horizon, precision, and deformable. Each task only uses 50 real demonstrations for training/adaptation; the appendix explains that each method has 20 trials per task, and the two methods are alternately tested to reduce sequence bias. Progress Score is the average step score divided by the maximum number of steps, and Success Rate is the proportion of successful trials in all steps.

Real-world deployment results
Figure 5. Real-world deployment tasks and metrics; the table below uses the exact values from the trial-by-trial table in the appendix.
TaskCategoryLingBot-VA PS$\pi_{0.5}$ PSLingBot-VA SR$\pi_{0.5}$ SRRemarks
Make BreakfastLong-horizon97.073.075.070.010 steps; LingBot-VA mainly fails at pour, serve or intermediate placement.
Pick ScrewsPrecision82.574.070.050.05 steps; LingBot-VA is more stable overall in inverting the screws and inserting them one by one.
Insert TubesPrecision85.879.240.030.03 grasp + 3 insert; grasp is close to full score, insert is the main bottleneck.
Unpack DeliveryLong-horizon84.573.065.025.0Cutting the seal and opening the lid are the main points of failure with the $\pi_{0.5}$.
Fold ClothesDeformable48.862.935.030.0The complete success rate of LingBot-VA is slightly higher, but the progress score is lower than $\pi_{0.5}$.
Fold PantsDeformable76.730.070.030.0LingBot-VA significantly improved on the three-step folding task.
average-79.265.459.239.2Simple average by 6 tasks.
Real-world task progressions
Figure 6. Six key execution steps for real-life tasks. The trial-by-trial table in the Appendix is ​​scored around these intermediate steps.

5.3 RoboTwin 2.0 simulation

RoboTwin 2.0 is a two-arm manipulation benchmark. The paper uses multi-task training: 50 tasks, each task has 50 demonstrations of clean scenes, plus 500 demonstrations of heavily randomized scenes; the video is reduced from 50 Hz to 12.5 Hz, and the action frequency is maintained at 50 Hz; training is 50K steps, LR $1\times10^{-5}$.

MetricX-VLA EasyX-VLA Hard$\pi_0$ Easy$\pi_0$ Hard$\pi_{0.5}$ Easy$\pi_{0.5}$ HardMotus EasyMotus HardOurs EasyOurs Hard
Horizon = 181.682.566.561.685.180.291.090.694.1893.56
Horizon = 259.355.966.154.779.373.085.280.990.3586.95
Horizon = 361.266.061.650.278.667.485.084.293.2293.28
Average 50 Tasks72.972.865.958.482.776.888.787.092.9391.55

The appendix provides 50 task-by-task results. Looking at tasks, LingBot-VA does not rank first in every sub-task, such as Hanging Mug, Turn Switch, etc. There is still obvious room; but the average value exceeds the second place in Easy/Hard and each horizon group.

5.4 LIBERO simulation

LIBERO uses four suites: Spatial, Object, Goal, and Long. Each suite has 10 tasks and each task has 50 demos. Paper filtering failed demonstrations, finetune 4K steps, LR $1\times10^{-5}$, sequence length $10^5$. Each suite reports 3 random seeds, each with 500 trials, for a total of 1500 trials.

MethodSpatialObjectGoalLongAvg
OpenVLA84.788.479.253.776.5
$\pi_0$96.898.895.885.294.1
OpenVLA-OFT97.698.497.994.597.1
X-VLA98.298.697.897.698.1
LingBot-VA98.5 ± 0.399.6 ± 0.397.2 ± 0.298.5 ± 0.598.5

5.5 Ablation

AblationSettingEasy allHorizon 1Horizon 2Horizon 3explain
BaselineLingBot-VA92.994.290.493.2Complete method.
DeploymentFDM-grounded Async90.492.587.785.6With FDM grounding added asynchronously, the speed increases but the SR decreases slightly.
DeploymentNaive Async74.383.370.332.9Without real feedback calibration, the long horizon obviously collapses.
PretrainWAN init80.684.976.367.6Only Wan2.2 is used for initialization and fine-tuning, and video-action pre-training is missing.

The text also points out that the task completion speed of asynchronous methods is about 2 times that of synchronous methods; FDM-grounded async in the table is close to baseline, while naive async especially drops from 93.2 to 32.9 at Horizon=3, which directly supports the design motivation of "stale prediction will cause drift".

Action initialization loss comparison
Figure 7. Action stream initialization ablation: Random initialization leads to high gradient norm and slow convergence; the strategy of copying and scaling video weights is the most stable.

5.6 Analysis experiments: few samples, memory, generalization

Few-shot comparison
Figure 8. Few-sample adaptation: 5/10/25/50 demos on RoboTwin Easy are all higher than $\pi_{0.5}$; 10/25/50 demos on real Make Breakfast are 61.1/81.7/97.0 respectively, higher than $\pi_{0.5}$'s 45.5/60.0/73.0.
Temporal memory test
Figure 9. Timing memory: Wipe Plate needs to be wiped accurately 6 times, and Search Box needs to remember that the right box is empty before going to the left box; LingBot-VA both reaches 100%, and $\pi_{0.5}$ is 47% and 50% respectively.
Generalization test
Figure 10. Generalization test: training only sees certain types of objects or local spaces, and reasoning tests new objects and OOD locations. The paper uses qualitative examples to demonstrate that the model can handle changes in shape, texture, and position.

6. Repeat audit

6.1 Public resources

Code and models are publicly available.arXiv and source code metadata point to GitHub repository and Project page. As of the time of this report, the GitHub README shows that the base checkpoint, RoboTwin post-train checkpoint, LIBERO-Long post-train checkpoint, as well as the RoboTwin/LIBERO evaluation script and post-training data description have been exposed.

6.2 Environment and operating information

Projectpublic information
Basic environmentGitHub README states Python 3.10.16, PyTorch 2.9.0, CUDA 12.6; dependencies include diffusers 0.36.0, transformers 4.55.2, flash-attn, etc.
attention modeThe README emphasizes that `attn_mode="flex"` needs to be set for training, and `"torch"` or `"flashattn"` must be set for inference/evaluation, otherwise eval will report an error.
RoboTwin ReasoningProvide server-client structure; single-GPU RoboTwin evaluation offload mode requires approximately 24GB VRAM; multi-GPU client padding 50 tasks to 56 and divided into 7 groups to adapt to 8-GPU settings.
Image-to-video-action reasoningREADME gives `NGPU=1 CONFIG_NAME='robotwin_i2av' bash script/run_launch_va_server_sync.sh`, offload mode requires about 18GB VRAM.
Post-trainingUse LeRobot dataset format; training examples are `NGPU=8 CONFIG_NAME='robotwin_train' bash script/run_va_posttrain.sh` and `CONFIG_NAME='libero_train'`.

6.3 Details that must be grasped for reproducibility

6.4 Appendix Integration Instructions

The appendix is ​​not negligible material: the complete list of RoboTwin 50 tasks, trial-by-trial success/step scores of 6 real-world tasks, and PS/SR calculation definitions are all in the appendix. The report has integrated these appendices information into 5.2 and 5.3, rather than treating the appendices as independent tails.

7. Analysis, Limitations and Boundaries

7.1 The most valuable part of this paper

The most valuable part of the paper is that it turns the question "whether the video world model can become the basis of robot control" into a deployable causal autoregressive system, instead of just showing offline video generation. Specific evidence includes: RoboTwin's long horizon grouping improvement, 100% vs 47/50% memory task, and ablation where naive async is greatly degraded at Horizon=3 but FDM-grounded async is significantly restored.

7.2 Why the results hold up

The support chain of the results is relatively complete: the main table shows the average indicators of RoboTwin and LIBERO; the real-world tasks use 20 trials, step-by-step scoring and alternating evaluation protocols; ablation separately separates async grounding, video-action pretraining, and action stream initialization; the analysis experiments are supplemented from the three perspectives of few samples, sequential memory, and generalization. In other words, the paper does not rely on a total score to support all claims.

7.3 Limitations and future directions described by the author

7.4 Qualifications in the report