AdaWorldPolicy: World-Model-Driven Diffusion Policy with Online Adaptive Learning for Robotic Manipulation

Authors: Ge Yuan, Qiyuan Qiao, Jing Zhang, Dong Xu

Organization: The University of Hong Kong; Beihang University

arXiv: 2602.20057, submission date 2026-02-23; source code uses CVPR 2026 template

Project page: https: //AdaWorldPolicy.github.io; The source code of the paper does not provide a GitHub repository or checkpoint link

1. Quick overview of the paper

One-sentence summary: AdaWorldPolicy combines the Cosmos-Predict2 world model, lightweight action expert and force predictor into a unified Flow Matching DiT strategy, and uses world prediction error and force prediction error to perform LoRA online updates during testing, thereby enabling self-supervised adaptation under visual and physical domain shifts.

What should the paper solve?	Robots will encounter visual perturbations, object/mechanics changes, and physical contact distribution shifts in dynamic real-world environments and contact-rich tasks; it is difficult to self-correct with real feedback during testing by relying solely on offline imitation or VLA-style reactive policies.
The author's approach	Turn the world model from "offline predictor/validator" to active supervisor: first generate actions, then use the Future Imagination mode of the same network to predict future observations after execution, and use the difference between predictions and real feedback as the test-time adaptation signal.
most important results	LIBERO-10 full multimodal success 0.96; Variant PushT under texture / random light / random color OOD AWP(ol) reaches 0.51 / 0.77 / 0.66, both higher than AWP; CALVIN ABC→D average completion length AWP(ol) is 3.54; real task in-domain average full method in the ablation table is 76.3%.
Things to note when reading	AdaOL updates LoRA parameters online, not simple re-planning; real-world results are mainly presented in bar graphs, without trial-by-trial tables, but the appendix supplements the evaluation protocol, success criteria, TTA two-stage process and key hyper-parameters.

World Model Diffusion Policy Flow Matching Test-Time Adaptation Force Feedback LoRA

Figure 1. AdaWorldPolicy's closed loop: Mode I generates actions and executes them; Mode II predicts the future based on the same observation and action; the error between the real future and imagined observation drives LoRA online update.

core contribution

Unified multimodal diffusion framework.The world model, action model, and force predictor are all implemented as Flow Matching DiT and exchange information via MMSA.
AdaOL was adapted online during testing.The model self-supervisedly updates a small number of LoRA parameters based on the gap between real feedback and imagined future to reduce visual and physical domain shift.
Access force-torque feedback.Force Predictor predicts future force readings for handling dynamic force shift in contact-rich tasks.
Multiple benchmark verification.Covers LIBERO-10, Variant PushT, CALVIN and 4 real robot tasks, and adds sampling step, MMSA, future frame visualization and implementation details in the appendix.

2. Motivation and related work

2.1 Why ordinary VLA is not enough

The paper points out that although the VLA model can combine language, vision and action, it usually relies on a large number of human demonstrations and has limited generalization in unseen or dynamically changing contact-rich scenes. The fundamental reason is that most of them are reactive mappings trained offline: the current observation comes in and the action is directly output, lacking a mechanism to explicitly predict the physical consequences and use real feedback to correct itself.

2.2 Why should we put the world model into a closed loop?

There are existing world models that are often used as "digital twin" or offline validators in robots; there are also WorldVLA, UVA, etc. that unify action generation and world prediction. However, the author believes that most of these methods are still offline training strategies and cannot adapt quickly when faced with visual and dynamic changes during deployment. The core motivation of AdaWorldPolicy is that the prediction error of the world model itself is a self-supervised signal, which can continuously correct the action model, world model and force predictor during testing.

Related directions	Positioning with existing methods	AdaWorldPolicy Differences
World Models for Robotic Control	Dreamer, Cosmos, Dino-WM, etc. for dynamics prediction, planning or policy validation.	The world model does not just predict or verify, but actively generates online adaptation loss.
Diffusion Models for Decision Making	Diffusion Policy etc. model action trajectories as diffusion processes and are good at multi-modal action distribution.	Add future outcome modeling and force prediction to the diffusion policy to constrain the physical consistency of the action.
Online Adaptation for Robotics	TTA, LoRA, confidence maximization, etc. adjust model parameters during testing.	Using world-model prediction error and force discrepancy as robot-specific self-supervised update signals.

3. Detailed explanation of method

3.1 Problem setting

The multi-modal historical observation $o=\{x_{\text{static}}, x_{\text{gripper}}, f\}$ is input at each moment, including static camera sequence, gripper camera sequence and force-torque readings. The three share the context length $T_c$. The model outputs the future action sequence $a=a_{t: t+T_a-1}$. The goal is to use its own interactive data $\{(o_t, a_t, o_{t+1})\}_{t=0}^{T}$ to update the parameters $\theta_t$ to $\theta_{t+1}$ in a test environment without adding new manual annotations or demonstrations.

Figure 2. Network architecture: World Model is based on Cosmos-Predict2 2B; Force Predictor and Action Model are 0.4B Flow DiT; the three exchange features through Multi-modal Self-Attention, and are supported by LoRA for online updates.

3.2 Three modules

module	input/output	function	scale
World Model	Input the current static/gripper angle of view and action conditions; output the future visual state $x'_{\text{static}}, x'_{\text{gripper}}$.	Predict the visual consequences of actions and provide AdaOL self-supervised signals.	Cosmos-Predict2 2B.
Force Predictor	Inputs current status and action; outputs future force-torque reading $f'$.	Supplement the contact dynamics invisible to the visual world model and alleviate force shift.	Real world 0.4B; removed when simulation has no force.
Action Model	Mode I inputs noise action tokens and denoise; Mode II inputs known actions as conditions.	Generates an action; or becomes an action-conditioned condition in Future Imagination.	Real world 0.4B; simulation can increase to 0.6B.

3.3 Dual mode: same network, two roles

Figure 3. Mode I generates action sequences; Mode II conditions executed actions and predicts future observations. This switchable mode is the basis of AdaOL.

Mode I: Action Generation input: observation history o, noisy action token a_k mask: action token is target (mask = 0) output: action sequence a loss: L1, flow matching on action vector field Mode II: Future Imagination input: observation history o, concrete action a, noised future observation o'_k mask: action is condition (mask = 1) output: imagined future observation o_hat' loss: L2, flow matching on future observation vector field

3.4 Multi-perspective expansion of World Model

The paper extends Cosmos-Predict2 from single-view video prediction to multi-view input: each camera view first obtains tokens through Cosmos VAE, and then splices along the temporal dimension; different views use independent RoPE to maintain the spatial and temporal structure across views. The input token is also equipped with a binary mask, where 1 represents the known condition and 0 represents the prediction target; after final denoising, the condition part will be replaced by the original input to ensure that the condition observation is not contaminated by the model generation results.

3.5 Multi-modal Self-Attention

MMSA is the bridge between the three modules. It is not a simple concat, nor is it a one-way cross-attention, but allows world / force / action to generate QKV respectively, and then performs self-attention after splicing the token dimension. In this way, the three can query each other's information while retaining their own dedicated representation space.

What MMSA is doing: putting the attention requests of three experts, world, force, and action, into the same attention field to communicate with each other.

$$\text{MMSA}(Q, K, V)=A([Q_x, Q_f, Q_a], [K_x, K_f, K_a], [V_x, V_f, V_a])$$

$x$	World Model / visual tokens.
$f$	Force Predictor / force tokens.
$a$	Action Model / action tokens.
$A$	Standard self-attention operations.

3.6 AdaOL online adaptation loop

AdaOL runs a closed loop after each execution: generate an action, execute it, receive real feedback, imagine future results of the same action, calculate the error, update a small number of parameters with LoRA. The author emphasizes that only low-rank matrices are updated, the proportion is less than 0.1%, and the online update overhead is controllable.

AdaOL loss compares: what the model thinks the action will cause and what actually happens in the real world.

$$\mathcal{L}_{\text{AdaOL}}=\|E(o_{t+1})-E(\hat{o}_{t+1})\|_2^2$$

$E(\cdot)$ is the Cosmos VAE encoder. This error generates the correction gradient $\Delta w$, which is used for LoRA online updates.

4. Mathematical forms and training objectives

4.1 Action Generation loss

$$\mathcal{L}_{1}(\theta)= \mathbb{E}\left[ \left\|\mathbf{u}_{\theta}(a_k, k, o; \theta)-\mathbf{v}_k(a_k, a)\right\|^2 \right]$$

Here $a_k$ is the noised action, $\mathbf{u}_\theta$ is the flow vector field predicted by the model, and $\mathbf{v}_k$ is the Flow Matching target vector field. This loss training model denoises the action $a$ from the current observation $o$.

4.2 Future Imagination loss

$$\mathcal{L}_{2}(\theta)= \mathbb{E}\left[ \left\|\mathbf{u}_{\theta}(o'_k, k, o, a; \theta)-\mathbf{v}_k(o'_k, o')\right\|^2 \right]$$

$o'_k$ is a noised future observation; here the action $a$ is a known condition. It trains the world/force/action shared system to predict "what will happen after performing this action".

4.3 Joint objective

$$\mathcal{L}_{\text{total}}(\theta)=p_a L_1+(1-p_a)L_2$$

Two modes are randomly switched during training: training action generation with probability $p_a$, otherwise training future imagination. This design enables the model to learn to be both a policy and an action-conditioned world model simultaneously.

5. Experiments and results

5.1 Experimental setup

settings	content
Simulation benchmark	LIBERO-10 measures long-range combination skills; Variant PushT measures texture, random lighting, and random color OOD; CALVIN uses ABC→D cross-domain protocol, and each sequence completes 5 tasks in a row.
real robot	6-DoF robotic arm, gripper camera, wrist-mounted force-torque sensor and third-person static camera; tasks include Sweep Beans, Pick-and-Place Eggs, Pour Water, Wipe Whiteboard.
Offline training	PyTorch + Cosmos-Predict2; 8 A100 80GB; AdamW; global batch size 64 to 256; LR $1\times10^{-4}$, linear attenuation to 1% in up to 20k steps after loss plateau.
online learning	Single NVIDIA RTX 5880 48GB; LoRA rank 16, only placed in the first 4 layers of each backbone; each incoming sample does 2 gradient steps, LR $5\times10^{-7}$; the average TTA inference speed is only about 5% slower than without adaptation.

Figure 4. Three simulation benchmarks: Variant PushT, LIBERO, and CALVIN.

5.2 LIBERO-10

Setting	Method	Static Camera	Gripper Camera	Joint States	Success
Static only	UVA	Yes	No	No	0.89
Static only	AWP	Yes	No	No	0.91
Full multimodal	OpenVLA	Yes	Yes	Yes	0.54
Full multimodal	MODE	Yes	Yes	Yes	0.94
Full multimodal	OpenVLA-OFT	Yes	Yes	Yes	0.94
Full multimodal	AWP	Yes	Yes	Yes	0.96

LIBERO-10 results prove that the AWP architecture itself is already strong without online adaptation: static-only super UVA, full multimodal super MODE and OpenVLA-OFT.

5.3 Variant PushT: OOD robustness

Method	Original	Texture	Rand Light	Rand Color
Diffusion Policy	0.78	0.18	0.14	0.11
OpenVLA	0.35	0.22	0.20	0.14
UniPi	0.42	0.35	0.33	0.18
UVA	0.94	0.11	0.54	0.13
AWP	0.97	0.47	0.71	0.61
AWP (ol)	0.98	0.51	0.77	0.66

AdaOL is best seen here: AWP is already stronger than most baselines, but online learning continues to bring improvements on all OOD variants, especially random lighting from 0.71 to 0.77, and random color from 0.61 to 0.66.

5.4 CALVIN ABC→D

Method	Len 1	Len 2	Len 3	Len 4	Len 5	Avg. Len.
OpenVLA	91.3	77.8	62.0	52.1	43.5	3.27
MoDE	91.5	79.2	67.3	55.8	45.3	3.39
GR-MG	91.0	79.1	67.8	56.9	47.7	3.42
AWP	91.8	79.2	68.5	62.8	48.0	3.51 ± 0.03
AWP (ol)	92.0	79.6	68.6	63.0	48.0	3.54 ± 0.04

The increments for online learning on CALVIN are smaller but consistent: average length 3.51 to 3.54, with length 5 remaining at 48.0. The author explains this as TTA being able to fine-tune policies that have generalized well.

5.5 Real robot tasks

Figure 5. Real robot setup: INOVO robotic arm, static/gripper camera and force sensor; tasks include sweeping beans, putting eggs, pouring water, and wiping whiteboards; tablecloth, distractor, object, lighting, etc. domain shift is shown on the right.

Figure 6. Real-world results: AWP under in-domain has outperformed DP-Force and UVA for most tasks; AWP(ol) under domain shift has a stable improvement compared to AWP. In the figure, Pour goes from 80% to 90% under Object Change, which is consistent with the example in the text.

The appendix complements the real-world protocol: 150 expert demonstrations collected in-domain per task; 30 trials per model configuration per task/distribution, up to 1500 execution steps. AdaOL testing is divided into two phases: Trials 1-15 are continuously updated online, and Trials 16-30 freeze the updated model to evaluate the stability of the adapted policy.

5.6 Ablation

Configuration	Success Rate	meaning
AdaWorldPolicy w/ AdaOL	76.3	Full method, real in-domain four-task averaging.
AdaWorldPolicy w/o AdaOL	72.5	Updated online without testing, down 3.8.
w/o Force Predictor	53.8	Remove force prediction, and contact-rich tasks degrade significantly.
w/o World Model Supervision	46.3	Degenerating into a behavioral cloning strategy shows that world supervision is the core.
MMSA → Concatenation	36.3	Simple splicing cannot effectively integrate modules.
MMSA → Cross-Attention	50.0	Normal cross-attention is still weaker than MMSA.

5.7 Appendix supplement: sampling steps, imagined future and super parameters

Appendix experiment	key results	Integrate location
Sampling steps on LIBERO	AWP 20/10/5/2 steps are 96.33 / 95.53 / 94.67 / 94.00 respectively; reducing steps only slightly reduces performance.	Speed and accuracy can be traded off when reproducing.
AdaOL on LIBERO	Improved from 95.53 to 96.05 in 10 steps.	It shows that AdaOL also has small gains in slight distribution differences.
MMSA fusion	MMSA 95.53, Concat 89.67, Cross-Attention 91.21.	The necessity of reinforcing the MMSA in the main text ablation.
Imagined future visualization	The imagined future of PushT, CALVIN, and LIBERO is basically consistent with real observation; there will be blur/artifacts in complex backgrounds and egg tasks in real scenes.	Supporting a world model provides supervision but also reveals vision generation limitations.

Appendix Figure. Imagined future versus real observation in simulation: PushT, CALVIN, and LIBERO10.

Appendix Figure. Imagined future in real scenes; the authors point out that artifacts and blur will appear in scenes with complex backgrounds and small objects, but the structural consistency is still sufficient for policy.

6. Repeat audit

6.1 Public resource status

The source code does not provide complete training code with arXiv.The paper only gives the project homepage in the abstract AdaWorldPolicy.github.io, there are no GitHub or checkpoint links in the LaTeX source code. The appendix mentions that the supplementary zip contains the local video web page `AdaWorldPolicy_Homepage/index.html`, but the folder is not included in the arXiv e-print.

6.2 Key hyperparameters

Benchmark	Image Size	History Length	Action Horizon	# Imagined Frames
LIBERO10	128 × 128	5	20	20
PushT	256 × 256	5	20	20
CALVIN	192 × 192	1	12	12
Real-world	112 × 160	1	32	4

Real-world use of sparse prediction: imagined frames are only 4, while the action horizon is 32, covering the action time span with sparse future frames to reduce video generation latency.

6.3 Data processing and force data

In the real task, 150 expert demonstrations were collected in each in-domain environment, using PS5 controller teleoperation, and mapping the force sensor readings to handle vibrations to provide feedback to the operator.
Force readings do not have stable upper and lower bounds like pixels or relative poses; the appendix explains the use of quantile-based normalization, that is, scaling by 1st and 99th percentile to mitigate outlier/spike effects.
DP baseline has also been enhanced to DP-Force: concat 6D force data and image observation features, and then fuse them through Transformer cross-attention to ensure fair comparison of real experiments.

6.4 Recurring gaps

The complete code, model weights and data download path are not provided; the reproducibility framework can only be described based on the paper.
The values in the real-world graph come from histograms, and the main text does not have a trial-by-trial table; the appendix only gives the protocol and success criteria, but does not have a complete real rollout table.
The specific fine-tuning configuration of Cosmos-Predict2 depends on the original implementation; the paper states "largely follow training recipe", but does not list all scheduler, augmentation, and LoRA target module details.
The online update measured about 5% slowdown on RTX 5880, but did not give details on the absolute latency and batch/sequence padding of each step.

7. Analysis, Limitations and Boundaries

7.1 The most valuable part of this paper

The most valuable point is to turn the world model error into an online learning signal, instead of just using the world model as a rollout visualizer or offline evaluator. The PushT OOD table, real-world domain shift diagram and ablation in the main article all point to the same Affiliations: when visual or physical conditions change, AWP(ol) is continuously modified through real feedback, which is more stable than a fixed offline policy.

7.2 Why the results hold up

The evidence chain of the paper covers two levels: "offline capability" and "online adaptation": LIBERO/CALVIN shows that the base AWP architecture itself is strong; PushT OOD and real domain shift show the increment brought by AdaOL; ablation shows that removing world supervision, force predictor or MMSA all decrease significantly. The sampling-step and imagined-future visualizations in the appendix further illustrate that the futures generated by the world model, while not photo-perfect, are structurally usable as supervision.

7.3 Limitations and future directions described by the author

OOD generalization of world models is still not perfect.The appendix clearly states that Cosmos-Predict2 will experience future frame quality degradation under significant domain shifts, continuously changing lighting, and complex real-world scenes.
AdaOL hyperparameters are fixed.The author does not adjust AdaOL hyperparameters for each task or environment; adaptive or meta-learning can be used to automatically adjust them in the future.
Long-range planning failures remain to be addressed.The conclusion states that long-horizon planning failures will be further addressed in the future and extended to larger network architectures.

7.4 Applicable boundaries

AdaWorldPolicy is suitable for scenarios where there is continuous observation feedback, a small amount of update overhead can be tolerated during testing, and the world prediction error is related to the success or failure of the task. For scenarios that require strict safety constraints, cannot update parameters online, or have weak correlation between visual future prediction and real control objectives, the paper does not provide sufficient verification.