EN 中文

AdaWorldPolicy: World-Model-Driven Diffusion Policy with Online Adaptive Learning for Robotic Manipulation

Authors: Ge Yuan, Qiyuan Qiao, Jing Zhang, Dong Xu

Organization: The University of Hong Kong; Beihang University

arXiv: 2602.20057, submission date 2026-02-23; source code uses CVPR 2026 template

Project page: https: //AdaWorldPolicy.github.io; The source code of the paper does not provide a GitHub repository or checkpoint link

1. Quick overview of the paper

One-sentence summary: AdaWorldPolicy combines the Cosmos-Predict2 world model, lightweight action expert and force predictor into a unified Flow Matching DiT strategy, and uses world prediction error and force prediction error to perform LoRA online updates during testing, thereby enabling self-supervised adaptation under visual and physical domain shifts.
What should the paper solve?Robots will encounter visual perturbations, object/mechanics changes, and physical contact distribution shifts in dynamic real-world environments and contact-rich tasks; it is difficult to self-correct with real feedback during testing by relying solely on offline imitation or VLA-style reactive policies.
The author's approachTurn the world model from "offline predictor/validator" to active supervisor: first generate actions, then use the Future Imagination mode of the same network to predict future observations after execution, and use the difference between predictions and real feedback as the test-time adaptation signal.
most important resultsLIBERO-10 full multimodal success 0.96; Variant PushT under texture / random light / random color OOD AWP(ol) reaches 0.51 / 0.77 / 0.66, both higher than AWP; CALVIN ABC→D average completion length AWP(ol) is 3.54; real task in-domain average full method in the ablation table is 76.3%.
Things to note when readingAdaOL updates LoRA parameters online, not simple re-planning; real-world results are mainly presented in bar graphs, without trial-by-trial tables, but the appendix supplements the evaluation protocol, success criteria, TTA two-stage process and key hyper-parameters.

World Model Diffusion Policy Flow Matching Test-Time Adaptation Force Feedback LoRA

AdaWorldPolicy teaser
Figure 1. AdaWorldPolicy's closed loop: Mode I generates actions and executes them; Mode II predicts the future based on the same observation and action; the error between the real future and imagined observation drives LoRA online update.

core contribution

2. Motivation and related work

2.1 Why ordinary VLA is not enough

The paper points out that although the VLA model can combine language, vision and action, it usually relies on a large number of human demonstrations and has limited generalization in unseen or dynamically changing contact-rich scenes. The fundamental reason is that most of them are reactive mappings trained offline: the current observation comes in and the action is directly output, lacking a mechanism to explicitly predict the physical consequences and use real feedback to correct itself.

2.2 Why should we put the world model into a closed loop?

There are existing world models that are often used as "digital twin" or offline validators in robots; there are also WorldVLA, UVA, etc. that unify action generation and world prediction. However, the author believes that most of these methods are still offline training strategies and cannot adapt quickly when faced with visual and dynamic changes during deployment. The core motivation of AdaWorldPolicy is that the prediction error of the world model itself is a self-supervised signal, which can continuously correct the action model, world model and force predictor during testing.

Related directionsPositioning with existing methodsAdaWorldPolicy Differences
World Models for Robotic ControlDreamer, Cosmos, Dino-WM, etc. for dynamics prediction, planning or policy validation.The world model does not just predict or verify, but actively generates online adaptation loss.
Diffusion Models for Decision MakingDiffusion Policy etc. model action trajectories as diffusion processes and are good at multi-modal action distribution.Add future outcome modeling and force prediction to the diffusion policy to constrain the physical consistency of the action.
Online Adaptation for RoboticsTTA, LoRA, confidence maximization, etc. adjust model parameters during testing.Using world-model prediction error and force discrepancy as robot-specific self-supervised update signals.

3. Detailed explanation of method

3.1 Problem setting

The multi-modal historical observation $o=\{x_{\text{static}}, x_{\text{gripper}}, f\}$ is input at each moment, including static camera sequence, gripper camera sequence and force-torque readings. The three share the context length $T_c$. The model outputs the future action sequence $a=a_{t: t+T_a-1}$. The goal is to use its own interactive data $\{(o_t, a_t, o_{t+1})\}_{t=0}^{T}$ to update the parameters $\theta_t$ to $\theta_{t+1}$ in a test environment without adding new manual annotations or demonstrations.

AdaWorldPolicy architecture
Figure 2. Network architecture: World Model is based on Cosmos-Predict2 2B; Force Predictor and Action Model are 0.4B Flow DiT; the three exchange features through Multi-modal Self-Attention, and are supported by LoRA for online updates.

3.2 Three modules

moduleinput/outputfunctionscale
World ModelInput the current static/gripper angle of view and action conditions; output the future visual state $x'_{\text{static}}, x'_{\text{gripper}}$.Predict the visual consequences of actions and provide AdaOL self-supervised signals.Cosmos-Predict2 2B.
Force PredictorInputs current status and action; outputs future force-torque reading $f'$.Supplement the contact dynamics invisible to the visual world model and alleviate force shift.Real world 0.4B; removed when simulation has no force.
Action ModelMode I inputs noise action tokens and denoise; Mode II inputs known actions as conditions.Generates an action; or becomes an action-conditioned condition in Future Imagination.Real world 0.4B; simulation can increase to 0.6B.

3.3 Dual mode: same network, two roles

Two modes
Figure 3. Mode I generates action sequences; Mode II conditions executed actions and predicts future observations. This switchable mode is the basis of AdaOL.
Mode I: Action Generation input: observation history o, noisy action token a_k mask: action token is target (mask = 0) output: action sequence a loss: L1, flow matching on action vector field Mode II: Future Imagination input: observation history o, concrete action a, noised future observation o'_k mask: action is condition (mask = 1) output: imagined future observation o_hat' loss: L2, flow matching on future observation vector field

3.4 Multi-perspective expansion of World Model

The paper extends Cosmos-Predict2 from single-view video prediction to multi-view input: each camera view first obtains tokens through Cosmos VAE, and then splices along the temporal dimension; different views use independent RoPE to maintain the spatial and temporal structure across views. The input token is also equipped with a binary mask, where 1 represents the known condition and 0 represents the prediction target; after final denoising, the condition part will be replaced by the original input to ensure that the condition observation is not contaminated by the model generation results.

3.5 Multi-modal Self-Attention

MMSA is the bridge between the three modules. It is not a simple concat, nor is it a one-way cross-attention, but allows world / force / action to generate QKV respectively, and then performs self-attention after splicing the token dimension. In this way, the three can query each other's information while retaining their own dedicated representation space.

What MMSA is doing: putting the attention requests of three experts, world, force, and action, into the same attention field to communicate with each other.

$$\text{MMSA}(Q, K, V)=A([Q_x, Q_f, Q_a], [K_x, K_f, K_a], [V_x, V_f, V_a])$$
$x$World Model / visual tokens.
$f$Force Predictor / force tokens.
$a$Action Model / action tokens.
$A$Standard self-attention operations.

3.6 AdaOL online adaptation loop

AdaOL runs a closed loop after each execution: generate an action, execute it, receive real feedback, imagine future results of the same action, calculate the error, update a small number of parameters with LoRA. The author emphasizes that only low-rank matrices are updated, the proportion is less than 0.1%, and the online update overhead is controllable.

AdaOL loss compares: what the model thinks the action will cause and what actually happens in the real world.

$$\mathcal{L}_{\text{AdaOL}}=\|E(o_{t+1})-E(\hat{o}_{t+1})\|_2^2$$

$E(\cdot)$ is the Cosmos VAE encoder. This error generates the correction gradient $\Delta w$, which is used for LoRA online updates.

4. Mathematical forms and training objectives

4.1 Action Generation loss

$$\mathcal{L}_{1}(\theta)= \mathbb{E}\left[ \left\|\mathbf{u}_{\theta}(a_k, k, o; \theta)-\mathbf{v}_k(a_k, a)\right\|^2 \right]$$

Here $a_k$ is the noised action, $\mathbf{u}_\theta$ is the flow vector field predicted by the model, and $\mathbf{v}_k$ is the Flow Matching target vector field. This loss training model denoises the action $a$ from the current observation $o$.

4.2 Future Imagination loss

$$\mathcal{L}_{2}(\theta)= \mathbb{E}\left[ \left\|\mathbf{u}_{\theta}(o'_k, k, o, a; \theta)-\mathbf{v}_k(o'_k, o')\right\|^2 \right]$$

$o'_k$ is a noised future observation; here the action $a$ is a known condition. It trains the world/force/action shared system to predict "what will happen after performing this action".

4.3 Joint objective

$$\mathcal{L}_{\text{total}}(\theta)=p_a L_1+(1-p_a)L_2$$

Two modes are randomly switched during training: training action generation with probability $p_a$, otherwise training future imagination. This design enables the model to learn to be both a policy and an action-conditioned world model simultaneously.

5. Experiments and results

5.1 Experimental setup

settingscontent
Simulation benchmarkLIBERO-10 measures long-range combination skills; Variant PushT measures texture, random lighting, and random color OOD; CALVIN uses ABC→D cross-domain protocol, and each sequence completes 5 tasks in a row.
real robot6-DoF robotic arm, gripper camera, wrist-mounted force-torque sensor and third-person static camera; tasks include Sweep Beans, Pick-and-Place Eggs, Pour Water, Wipe Whiteboard.
Offline trainingPyTorch + Cosmos-Predict2; 8 A100 80GB; AdamW; global batch size 64 to 256; LR $1\times10^{-4}$, linear attenuation to 1% in up to 20k steps after loss plateau.
online learningSingle NVIDIA RTX 5880 48GB; LoRA rank 16, only placed in the first 4 layers of each backbone; each incoming sample does 2 gradient steps, LR $5\times10^{-7}$; the average TTA inference speed is only about 5% slower than without adaptation.
Benchmarks
Figure 4. Three simulation benchmarks: Variant PushT, LIBERO, and CALVIN.

5.2 LIBERO-10

SettingMethodStatic CameraGripper CameraJoint StatesSuccess
Static onlyUVAYesNoNo0.89
Static onlyAWPYesNoNo0.91
Full multimodalOpenVLAYesYesYes0.54
Full multimodalMODEYesYesYes0.94
Full multimodalOpenVLA-OFTYesYesYes0.94
Full multimodalAWPYesYesYes0.96

LIBERO-10 results prove that the AWP architecture itself is already strong without online adaptation: static-only super UVA, full multimodal super MODE and OpenVLA-OFT.

5.3 Variant PushT: OOD robustness

MethodOriginalTextureRand LightRand Color
Diffusion Policy0.780.180.140.11
OpenVLA0.350.220.200.14
UniPi0.420.350.330.18
UVA0.940.110.540.13
AWP0.970.470.710.61
AWP (ol)0.980.510.770.66

AdaOL is best seen here: AWP is already stronger than most baselines, but online learning continues to bring improvements on all OOD variants, especially random lighting from 0.71 to 0.77, and random color from 0.61 to 0.66.

5.4 CALVIN ABC→D

MethodLen 1Len 2Len 3Len 4Len 5Avg. Len.
OpenVLA91.377.862.052.143.53.27
MoDE91.579.267.355.845.33.39
GR-MG91.079.167.856.947.73.42
AWP91.879.268.562.848.03.51 ± 0.03
AWP (ol)92.079.668.663.048.03.54 ± 0.04

The increments for online learning on CALVIN are smaller but consistent: average length 3.51 to 3.54, with length 5 remaining at 48.0. The author explains this as TTA being able to fine-tune policies that have generalized well.

5.5 Real robot tasks

Real-robot setup
Figure 5. Real robot setup: INOVO robotic arm, static/gripper camera and force sensor; tasks include sweeping beans, putting eggs, pouring water, and wiping whiteboards; tablecloth, distractor, object, lighting, etc. domain shift is shown on the right.
Real-world results
Figure 6. Real-world results: AWP under in-domain has outperformed DP-Force and UVA for most tasks; AWP(ol) under domain shift has a stable improvement compared to AWP. In the figure, Pour goes from 80% to 90% under Object Change, which is consistent with the example in the text.

The appendix complements the real-world protocol: 150 expert demonstrations collected in-domain per task; 30 trials per model configuration per task/distribution, up to 1500 execution steps. AdaOL testing is divided into two phases: Trials 1-15 are continuously updated online, and Trials 16-30 freeze the updated model to evaluate the stability of the adapted policy.

5.6 Ablation

ConfigurationSuccess Ratemeaning
AdaWorldPolicy w/ AdaOL76.3Full method, real in-domain four-task averaging.
AdaWorldPolicy w/o AdaOL72.5Updated online without testing, down 3.8.
w/o Force Predictor53.8Remove force prediction, and contact-rich tasks degrade significantly.
w/o World Model Supervision46.3Degenerating into a behavioral cloning strategy shows that world supervision is the core.
MMSA → Concatenation36.3Simple splicing cannot effectively integrate modules.
MMSA → Cross-Attention50.0Normal cross-attention is still weaker than MMSA.

5.7 Appendix supplement: sampling steps, imagined future and super parameters

Appendix experimentkey resultsIntegrate location
Sampling steps on LIBEROAWP 20/10/5/2 steps are 96.33 / 95.53 / 94.67 / 94.00 respectively; reducing steps only slightly reduces performance.Speed and accuracy can be traded off when reproducing.
AdaOL on LIBEROImproved from 95.53 to 96.05 in 10 steps.It shows that AdaOL also has small gains in slight distribution differences.
MMSA fusionMMSA 95.53, Concat 89.67, Cross-Attention 91.21. The necessity of reinforcing the MMSA in the main text ablation.
Imagined future visualizationThe imagined future of PushT, CALVIN, and LIBERO is basically consistent with real observation; there will be blur/artifacts in complex backgrounds and egg tasks in real scenes.Supporting a world model provides supervision but also reveals vision generation limitations.
Simulation imagined future
Appendix Figure. Imagined future versus real observation in simulation: PushT, CALVIN, and LIBERO10.
Real imagined future
Appendix Figure. Imagined future in real scenes; the authors point out that artifacts and blur will appear in scenes with complex backgrounds and small objects, but the structural consistency is still sufficient for policy.

6. Repeat audit

6.1 Public resource status

The source code does not provide complete training code with arXiv.The paper only gives the project homepage in the abstract AdaWorldPolicy.github.io, there are no GitHub or checkpoint links in the LaTeX source code. The appendix mentions that the supplementary zip contains the local video web page `AdaWorldPolicy_Homepage/index.html`, but the folder is not included in the arXiv e-print.

6.2 Key hyperparameters

BenchmarkImage SizeHistory LengthAction Horizon# Imagined Frames
LIBERO10128 × 12852020
PushT256 × 25652020
CALVIN192 × 19211212
Real-world112 × 1601324

Real-world use of sparse prediction: imagined frames are only 4, while the action horizon is 32, covering the action time span with sparse future frames to reduce video generation latency.

6.3 Data processing and force data

6.4 Recurring gaps

7. Analysis, Limitations and Boundaries

7.1 The most valuable part of this paper

The most valuable point is to turn the world model error into an online learning signal, instead of just using the world model as a rollout visualizer or offline evaluator. The PushT OOD table, real-world domain shift diagram and ablation in the main article all point to the same Affiliations: when visual or physical conditions change, AWP(ol) is continuously modified through real feedback, which is more stable than a fixed offline policy.

7.2 Why the results hold up

The evidence chain of the paper covers two levels: "offline capability" and "online adaptation": LIBERO/CALVIN shows that the base AWP architecture itself is strong; PushT OOD and real domain shift show the increment brought by AdaOL; ablation shows that removing world supervision, force predictor or MMSA all decrease significantly. The sampling-step and imagined-future visualizations in the appendix further illustrate that the futures generated by the world model, while not photo-perfect, are structurally usable as supervision.

7.3 Limitations and future directions described by the author

7.4 Applicable boundaries

AdaWorldPolicy is suitable for scenarios where there is continuous observation feedback, a small amount of update overhead can be tolerated during testing, and the world prediction error is related to the success or failure of the task. For scenarios that require strict safety constraints, cannot update parameters online, or have weak correlation between visual future prediction and real control objectives, the paper does not provide sufficient verification.