EN 中文

Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation

arXiv: 2603.00110 Reading Report - PhysGen / Preparation for junior PhD group meeting

Zijian Song, Qichang Li, Sihan Qin et al. Sun Yat-sen University / X-Era AI Lab / GDUT Video Generation as World Interaction Model Continuous Physical Tokens 732M parameters

1. Quick overview of the paper

The core idea of PhysGen is: don't just use the pre-trained language model as the "brain" of the robot strategy, but also use the pre-trained autoregressive video generation model as an implicit physics simulator. The model projects visual frames and action blocks into the same continuous physical token space, uses a causal transformer to predict the next set of "visual + action" tokens, and then uses a diffusion de-tokenizer to restore future images and executable actions.

What should the paper solve?Robot operation lacks large-scale action teaching data; VLA relies on the symbolic knowledge of LLM/VLM, but operation requires continuous space, time and physical interaction. The author hopes to transfer the physical prior of "how the object will move" from the video generation model and learn control with less robot data.
The author's approachBased on the NOVA autoregressive video generation backbone, PhysGen is proposed: combine video frame tokens and action chunk tokens into physical tokens; use continuous diffusion de-tokenizer to model visual and action distribution; add inverse-kinematics causal mask, Lookahead Multi-Token Prediction, LoRA training and KV-cache inference.
most important resultsThe average success rate of LIBERO is 90.8%, higher than OpenVLA 77%, WorldVLA 82%, and Pi0-Fast 86%; ManiSkill averages 74%, and PushCube reaches 100%; the real Franka Panda four tasks average 75%, which is the same as Pi0, and 75% vs Pi0 70% in transparent object grabbing.
Things to note when reading"The video generation model can be used as a physical simulator" is a strong idea, but the implementation does not directly use the video generation results for planning, but changes the video generation backbone to joint video-action autoregression. The core evidence comes from continuous tokens, video pre-training, AR rollout, and L-MTP ablation.
90.8%
LIBERO average success rate
+13.8%
An absolute improvement over OpenVLA's LIBERO
60h
The longest fine-tuning time is about one card A100 training time
PhysGen framework
PhysGen framework: The model runs synchronously with the real environment. Each step predicts the next visual state and action. After executing the action, the environment feedback is encoded back into the token stream, forming a continuous perception-planning-execution loop.

2. Motivations and issues

2.1 Why not continue to pile action data?

The paper starts from the scarcity of robot data: large-scale generative pretraining can bring cross-task generalization in NLP and vision, but robot action teaching is expensive, time-consuming, and strongly hardware-dependent. The VLA method connects LLM/VLM to the action head and can transfer language and visual knowledge, but there is a modal gap between text/action: language is a symbolic description, and robot action is a continuous, geometric, and time-sensitive control signal.

The author believes that the video generation model is closer to the knowledge required by the robot: the video model must predict future frames from historical frames, and therefore implicitly learns physical priors such as object permanence, motion trends after contact, and temporal consistency. Autoregressive video generation is particularly like a sequential decision process in control because it predicts the future in a rolling, step-by-step manner.

2.2 Problems with traditional tokenization

Many autoregressive models rely on discrete tokens. Discretization is natural for language, but the nature of latent images and actions is continuous signals, and quantization will bring resolution errors. For robots, small motion quantization errors may accumulate into trajectory drift over long time series. PhysGen therefore advocates continuous tokens: both vision and action are modeled in continuous embedding space, and then diffusion loss is used to learn conditional distribution.

One sentence version: PhysGen does not allow the robot to "understand the language and then perform actions", but allows the robot to "borrow the world evolution prior of the video generation model" to simultaneously predict future images and actions in the continuous physical token space.

4. Preliminary knowledge

4.1 How Diffusion loss serves as a probabilistic model for continuous tokens

The ordinary diffusion model gradually adds noise to the clean sample $x_0$:

$$q(x_t \mid x_{t-1})=\mathcal{N}(x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t I).$$

The training goal is to have the network predict the injected noise:

$$L(\theta)=\mathbb{E}_{t, x_0, \epsilon}\left[\left\|\epsilon-\epsilon_\theta(\sqrt{\bar{\alpha}_t}x_0+\sqrt{1-\bar{\alpha}_t}\epsilon, t)\right\|^2\right].$$

The key inspiration of MAR/NOVA is: given the autoregressive context $z$, you can directly train the denoiser $\epsilon_\theta(x_t|t, z)$ to learn $p(x|z)$ without quantizing the continuous signal into a discrete vocabulary:

$$\mathcal{L}(z, x)=\mathbb{E}_{\epsilon, t}\left[\|\epsilon-\epsilon_\theta(x_t|t, z)\|^2\right].$$

PhysGen applies this idea to de-tokenization of frame tokens and action tokens.

4.2 What NOVA offers

NOVA is a continuous, non-quantized autoregressive video generation model. It is modeled by time:

$$p(l, S_1, \dots, S_N)=\prod_{n=1}^{N}p(S_n|l, S_1, \dots, S_{n-1}), $$

And it is generated autoregressively in the form of token set within the frame. PhysGen retains NOVA's language and visual embedding mechanism, allowing it to inherit the world dynamic knowledge in video pre-training, and then adds action tokens and action diffusion de-tokenizers.

5. Detailed explanation of method

5.1 Input and output: predict next step from historical vision and actions

At step $n$, PhysGen inputs the task instruction $l$, the historical image $\{O_0, \dots, O_{N-1}\}$ and the corresponding action block $\{A_1, \dots, A_{N-1}\}$, and outputs the next frame visual state $O_N$ and the next action block $A_N$. Each action block $A_n$ contains $L$ consecutive actions, $L=8$ in the experiment.

The meaning of this formulation is that the model does not only predict actions, but predicts "how the actions and the environment evolve together." This makes it more like a predictive world interaction model than a pure policy head.

5.2 Tokenizer: Put language, vision, and action into the same space

modalencoding methodWhether to freezeoutput
language instructionsPhi language model tokenizer/encoderFreeze$E_l\in\mathbb{R}^{K_l\times d}$
visual frameNOVA's original 3D-VAE, flattened into frame tokensFreeze$E_{O, n}\in\mathbb{R}^{K_O\times d}$
action blockMLP action tokenizertraining$E_{A, n}\in\mathbb{R}^{K_A\times d}$, among which $K_A=L$

Each physical token is composed of visual tokens and action tokens concatenated along the sequence dimension:

$$P_n=[E_{O, n}; E_{A, n}], \quad P_n\in\mathbb{R}^{(K_O+K_A)\times d}.$$

Since observation precedes action, the author adds a learnable Begin-of-Action token before the action sequence to align the length.

5.3 Physical Autoregression

PhysGen follows LLM's token-by-token autoregression, but token now represents the joint physical state of vision and action:

$$p(E_l, P_0, \dots, P_N)=\prod_{n=0}^{N}p(P_n|E_l, P_0, \dots, P_{n-1}).$$

The conditional distribution is parameterized by a Causal Transformer replicating the NOVA architecture. Transformer output condition vector $Z_n$:

$$Z_n=\mathrm{Transformer}(l, P_0, \dots, P_{n-1}).$$

5.4 De-tokenizer: Continuous diffusion restores vision and action

Given $Z_n$, PhysGen estimates $p(P_n|Z_n)$ using the DiT-based denoising process:

$$\mathcal{L}(P_n, Z_n)=\mathbb{E}_{\epsilon, t}\left[\|\epsilon-\epsilon_\theta(P_{n, t}|t, Z_n)\|^2\right].$$

The reverse sampling form is:

$$P_{n, t-1}=\frac{1}{\sqrt{\alpha_t}}\left(P_{n, t}-\frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\epsilon_\theta(P_{n, t}|t, Z_n)\right)+\sigma_t\delta.$$

The de-tokenization of vision and action is done separately: the frame token uses NOVA's reconstruction paradigm; the action token uses lightweight Action-DiT, and the condition vector is injected through cross-attention.

Action diffusion network
Action diffusion network: predict token as a condition vector and inject action denoiser through cross-attention.

5.5 Causal Mask and Implicit Inverse Kinematics

PhysGen's attention mask is designed for the joint token of "frame + action".

  • Frame part uses chunk-wise full attention: patches in the same frame can pay attention to each other.
  • Action part uses temporal causal attention: early actions within the action block cannot see subsequent actions.
  • Action tokens can attend one-way to frame tokens: action planning can be conditioned on visual states, which the authors claim facilitates implicit inverse kinematics.
  • Maintain temporal causality across chunks to avoid future information leakage.
Causal attention mask
Causal attention mask: frame tokens are fully connected within the chunk, action tokens maintain causality over time, and visual tokens are read in one direction for visually conditioned control.

5.6 Lookahead Multi-Token Prediction

L-MTP combines lookahead action generation and Multi-Token Prediction. In each autoregressive step, the action de-tokenizer generates multiple future tokens in parallel, which is 3 tokens in the paper implementation. All prediction tokens are supervised during training; only the first token is executed during inference, and the remaining tokens are used as lookahead information to condition subsequent predictions.

This is equivalent to allowing the model to retain short-term future plans while executing the current action, mitigating the short-term problem of single token rollout, and improving long-term consistency.

5.7 Training goals and implementation details

The total loss is the average diffusion loss of frames and actions over the sequence:

$$loss=\sum_{n=1}^{N}\mathcal{L}(Z_n, P_n)=\sum_{n=1}^{N}\left(\mathcal{L}_{obs}(Z_n, E_{O, n})+\mathcal{L}_{act}(Z_n, E_{A, n})\right).$$

Projectsettings
action block length$L=8$
Transformer maximum context2096 tokens
context composition256 language tokens + 5 physical token packages
Each physical package360 visual tokens + 8 action tokens
Multi-view inputSplice it into one image and send it to VAE and AR transformer
training efficiencyTeacher forcing is fully parallel; Transformer backbone uses LoRA fine-tuning
Reasoning efficiencyKV-cache caches intermediate features of each layer
Hardware and timeSingle NVIDIA A100-SXM4-80GB; maximum training time 60 GPU hours
position encodingRoPE; frame and action token use different frequency settings

6. Experimentation and reproducibility

6.1 LIBERO simulation experiment

The LIBERO experiment evaluates four suites: Spatial, Object, Goal, and Long. Each suite has about 400 demonstrations for fine-tuning and 500 rollouts for evaluation, and the indicator is success rate. The author emphasizes that "no action pretraining" means that the backbone is not pretrained on large-scale manipulation/action data.

MethodParamsSpatialObjectGoalLongAvg.
DP, w/o action pretraining-7893685172
OpenVLA7B8588795477
ThinkAct7B8891877184
Pi0-Fast3B9697896086
MolmoACT7B8795887787
UniMimic, w/o action pretraining~200M7179672962
WorldVLA, w/o action pretraining7B8896836082
PhysGen, w/o action pretraining732M91.099.693.878.890.8

Key points in the interpretation of the results: PhysGen averages 90.8, which not only exceeds the LLM/VLM-based baseline, but also exceeds the visual generation pre-training baseline WorldVLA by 8.8 points; it is 18.8 points higher than WorldVLA in the Long suite, which supports the author's claim that continuous AR world interaction modeling improves long-term consistency. The only weak entry is Spatial, which is lower than Pi0-Fast's 96, which the authors attribute to the underlying video model's limited spatial awareness.

6.2 ManiSkill simulation experiment

ManiSkill selects three tasks: PushCube, PickCube, and StackCube. Each task uses 1000 demonstrations for fine-tuning and 125 rollouts for evaluation.

MethodPushCubePickCubeStackCubeAvg.
ACT76203042
BC-T9841439
DP88408069
ICRT77783062
RDT100777484
Pi0100604869
PhysGen100734874

PhysGen on ManiSkill is not as high on average as RDT, but better than ICRT and Pi0, especially PushCube by 100%. This suggests that the video prior is helpful, but does not necessarily outperform the action-pretrained policy across all manipulation settings.

6.3 Real robot experiment

The real experiment uses a Franka Panda with two RealSense D415 cameras, one fixed and one on the wrist. The four tasks are Pick Cube, Press Button, Stack Cube, and Pick Transparency. Collect 80-100 teleoperation demonstrations per task and evaluate 20 trials. ACT is trained from scratch, and OpenVLA and Pi0 use official checkpoints for fine-tuning; PhysGen only uses these collected data for fine-tuning, and backbone does not have large-scale action pre-training.

Real-world manipulation visualization
Franka Panda Real tasks: transparent object grabbing, block grabbing, button pressing, block stacking. The paper emphasizes that transparent objects cause visual ambiguity due to reflection and refraction, and require physical prior knowledge.
MethodPick CubePress ButtonStack CubePick TransparencyAvg.
ACT4040301030
OpenVLA302510016.3
Pi08585607075
PhysGen8085607575

6.4 Ablation experiment

Ablation is performed on LIBERO-Object, and the metric is success rate.

VariantPretrainTokenL-MTPBackboneSRexplain
PhysGen-ZeroNocontinuous+AR86.4No video pre-training as a control to test NOVA prior
PhysGen-DiscreteNOVAdiscrete+AR94.2Actions are quantified into discrete words and the value of continuous tokens is tested.
PhysGen-NoARNOVAcontinuous-NoAR95.0Remove the autoregressive rollout, equivalent to $N=1$ single-step mapping
PhysGen-STPNOVAcontinuous-AR96.8Single token prediction, remove Lookahead-MTP
PhysGen-FullNOVAcontinuous+AR99.6complete model

Ablation corresponds to four conclusions: video pre-training brings +13.2; continuous token versus discrete brings +5.4; AR architecture brings +4.6; L-MTP brings +3.4.

6.5 Qualitative analysis

Predicted videos and actual executions
Predicted video and actual execution alignment: Each row shows PhysGen's predicted video and the corresponding execution video. The author believes that the two are highly similar in motion trajectories and key action times, indicating that video prior has been successfully transferred to action planning.
Attention map visualization
Attention visualization: token-level attention shows that action prediction will selectively focus on historical frames and action tokens; pixel-level attention focuses on task-critical areas such as boxes, target areas, and robotic arms.

7. Discussion and limitations

7.1 The most valuable part of this paper

The most valuable thing is that it turns the sentence "there is physical knowledge in the video generation model" into a trainable robot strategy structure. It does not stop at the indirect route of "generating videos and extracting actions from the videos", but directly incorporates actions into the autoregressive video token stream, allowing the model to predict future vision and actions at the same time. This allows video prior, motion control and sequence planning to interact within the same model.

The second value is the continuous token. The continuity of robot movements and visual latent is very important. Discrete tokens are natural in language, but they are prone to accuracy losses in control. PhysGen uses a diffusion de-tokenizer to learn continuous conditional distributions, avoiding hard quantization while maintaining generative modeling capabilities.

7.2 Why the results hold up

  • The ablation closed loop is relatively complete: Video pre-training, continuous token, autoregression, and L-MTP have corresponding ablation respectively, and each has a positive return.
  • Compare to strong baseline: Comparing OpenVLA, Pi0-Fast, WorldVLA, and MolmoACT on LIBERO; comparing Pi0 in real experiments, it shows that it is not just about weakening the baseline.
  • The real transparent object task has recognition: Pick Transparency involves visual ambiguity caused by reflection/refraction, which is a good way to test whether video physics priors help control.

7.3 Things to be careful about

"Video model = physics simulator" is still a metaphor: What the video generation model learns is the physical laws in visual statistics, which is not necessarily equivalent to a controllable and verifiable dynamics model. It may predict visually reasonable but controllably wrong states.

Real experiments are smaller: Four Franka tabletop tasks, with 20 trials each, demonstrate feasibility but are not sufficient to illustrate broad real-world generalization.

ManiSkill is not ahead across the board: The average of 74 is lower than RDT's 84, and StackCube is lower than DP/RDT. This shows that video priors without action pre-training are not a silver bullet for all scenarios.

The source code does not contain the actual appendix: The text mentions that the details of the actual task are in the Appendix, but the appendix input in LaTeX is commented out. Therefore, some task definition details cannot be further verified from the source code.

8. Questioning at the group meeting

Q1: What is the biggest difference between PhysGen and WorldVLA/UWM?

They all focus on video-action joint prediction, but PhysGen explicitly uses the NOVA-style continuous non-quantized autoregressive backbone to combine frame tokens and action tokens into physical tokens to gradually predict the co-evolution of the world and the robot; at the same time, it uses diffusion de-tokenizer to model continuous visual/action distribution.

Q2: Why is it not considered a leak if the action token can attend to future visual states?

What the author means is that within the same physical chunk, action planning can read the corresponding visual token representation, thereby forming a visual conditioned inverse kinematics relationship; the temporal causal mask is still maintained across chunks. It should be noted here that "future visual states" are more like structured conditions for combining tokens within chunks, rather than unlimited reading of real future trajectories.

Q3: Why only the first token is executed during L-MTP inference?

Subsequent tokens are used as lookahead information to help the current prediction have a longer planning horizon; only executing the first token can retain the feedback ability of receding-horizon control, and the next step is to re-predict based on real environment feedback.

Q4: How is continuous diffusion de-tokenizer better than MLP regression?

MLP regression gives deterministic point estimates and easily averages multi-modal actions. The diffusion de-tokenizer learns the conditional distribution $p(P_n|Z_n)$, which retains both continuous accuracy and the ability to generate models to express multi-modal output.

Q5: If it happens again, where is the most likely pitfall?

The first is the token organization and mask: language, 5 physical packages, and 360 visual + 8 action tokens per package must be aligned; the second is the diffusion sampling of action de-tokenizer and the multiple sampling of $t$ during training; the third is whether the input distribution of the original NOVA VAE matches after multi-view splicing.

9. reproducibility information

9.1 Link and code status

9.2 Minimum Reproducible Configuration Shorthand

Backbone: NOVA autoregressive video generation model
Language tokens: frozen Phi language model
Visual tokenizer: frozen 3D-VAE from NOVA
Action tokenizer: MLP, K_A = L = 8
Physical token: concat(frame tokens, action tokens)
Context length: 2096 = 256 language + 5 * (360 visual + 8 action)
De-tokenizer: frame diffusion + Action-DiT action diffusion
Training: teacher forcing, LoRA finetuning, 4 samples of t per image/action
Inference: autoregressive rollout with KV-cache
Hardware: single A100-SXM4-80GB, longest finetuning within 60 GPU hours

9.3 Coverage check

This report has covered the core content of Abstract, Introduction, Related Work, Preliminaries, Method, Experiment, and Conclusion. The source code contains `\appendix` but the appendix file input is commented and no actual appendix text was found to be integrated; this is noted in the report.

Generation date: 2026-05-08. The source code package, PDF and decompression Contents are left uncleaned to facilitate subsequent verification.