EN 中文

Motus: A Unified Latent Action World Model

Authors:Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, Hongyan Zhao, Hanyu Liu, Zhizhong Su, Lei Ma, Hang Su, Jun Zhu

Affiliations:Tsinghua University, Peking University, Horizon Robotics

Version:arXiv: 2512. 13030; arXiv HTML generation date is displayed as 2025-12-25

Link:arXiv | PDF | Project Page | GitHub

1. Quick overview of the paper

One-sentence summary:Motus connects video generation model, vision-language understanding expert and action expert into a unified MoT architecture, uses UniDiffuser-style scheduler to switch VLA, world model, IDM, VGM and video-action joint prediction in the same model, and uses optical-flow latent action to allow videos without action tags to participate in action expert pre-training.

Difficulty rating:★★★★☆. Reading requires understanding of VLA/action chunking, diffusion or rectified flow, video generation model, Mixture-of-Transformers, latent action and robot evaluation protocol.

Keywords:Unified Embodied Foundation Model, Latent Action, World Model, MoT, Rectified Flow, Cross-Embodiment Pretraining。

Reading positioningcontent
What should the paper solve?Existing embodied models split understanding, future prediction, inverse dynamics, video generation and action output into isolated models; at the same time, large-scale videos, human perspective data and cross-robot trajectories make it difficult to jointly train action modules due to inconsistent action spaces.
The author's approachUse MoT + Tri-model Joint Attention to connect VGM, action expert, and understanding expert; use UniDiffuser-style scheduler with different mode timesteps to switch five modeling modes; use optical flow to train latent action to compress pixel motion into a space close to the control dimension.
most important resultsThe average success rate of RoboTwin 2. 0 reaches Clean 88. 66% and Randomized 87. 02%, which is an absolute improvement of about 45 points compared to $\pi_{0.5}$; the average PSR of real AC-One is 63. 22%, and the average PSR of Agilex-Aloha-2 is 59. 30%.
Things to note when readingThe "unification" of Motus is not to splice all data directly into a common Transformer, but to achieve it through expert retention, joint attention interaction, modal noise scheduling and latent action alignment; the appendix algorithm illustrates how the five modes are switched by the same rectified-flow model.

Core contribution list

2. Motivation

2. 1 What problem should be solved?

The paper focuses on language-conditioned robotic manipulation. Given the current visual observation $\mathbf{o}_t$, the verbal instruction $\ell$, and the optional proprioception $\mathbf{p}_t$, the policy needs to predict the future $k$ step action chunk $\mathbf{a}_{t+1:t+k}$. Traditional goal writing:

This formula is saying: Use expert demonstration data to train a policy so that it gives future action blocks that the expert will do under current observations and instructions.

$$\max_\theta\ \mathbb{E}_{(\mathbf{o}_t,\mathbf{p}_t,\mathbf{a}_{t+1:t+k},\ell)\sim D_{\text{expert}}}\log p_\theta(\mathbf{a}_{t+1:t+k}\mid \mathbf{o}_t,\mathbf{p}_t,\ell).$$
$\mathbf{o}_t$The current visual observation belongs to observation space $\mathcal{O}$.
$\mathbf{p}_t$Robot ontology perception; the footnote of the paper explains that proprioception and action share the representation space in joint position control.
$\mathbf{a}_{t+1:t+k}$Future action chunks, the length of action chunking is $k$.
$\ell$Language instructions.

The paper believes that a general embodied agent not only needs this kind of VLA conditional distribution, but also needs to be able to understand the scene, generate the future, predict the consequences based on the actions, infer the actions based on the video, and generate videos and actions simultaneously. Therefore, the author writes the problem as a unified modeling of five distributions:

modeldistributedeffect
VLA$p(\mathbf{a}_{t+1:t+k}\mid \mathbf{o}_t,\ell)$Take actions directly from the current image and commands.
World Model$p(\mathbf{o}_{t+1:t+k}\mid \mathbf{o}_t,\mathbf{a}_{t+1:t+k})$Given an action, predict future visual consequences.
IDM$p(\mathbf{a}_{t+1:t+k}\mid \mathbf{o}_{t:t+k})$Given current and future videos, infer actions.
VGM$p(\mathbf{o}_{t+1:t+k}\mid \mathbf{o}_t,\ell)$Given an initial image and language, imagine the task of future videos.
Joint$p(\mathbf{o}_{t+1:t+k},\mathbf{a}_{t+1:t+k}\mid \mathbf{o}_t,\ell)$Simultaneously generate future videos and actions.

2. 2 Limitations of existing methods

The paper divides the bottlenecks of existing methods into two categories. The first category is capability separation: VLAs usually learn policy directly from vision-language input, the video/world-model series emphasizes future prediction, $\mathcal{F}_1$ combines VLA with IDM but does not cover world model or VGM, and UWM covers five types of models but does not fully utilize internet-scale multimodal priors and domain priors in large-scale robot trajectories. The second category is data heterogeneity: cross-robot action spaces are different in dimension, scope, and semantics. Internet videos and egocentric human videos do not have action labels, so it is difficult to use these data directly for action expert pretraining.

The author clearly named these two points as two challenges in Problem Formulation: Challenge 1 is how to unify multi-modal generation capabilities in a framework; Challenge 2 is how to utilize heterogeneous data, especially videos with missing action labels but rich motion information.

2. 3 The solution ideas of this article

The high-level idea of ​​Motus is to "retain expert capabilities and then allow experts to interact under unified generative scheduling. " VGM provides physical and temporal priors, VLM/understanding expert provides visual language understanding and spatial positioning priors, and action expert is responsible for controlling the output. Tri-model Joint Attention allows three types of tokens to interact; UniDiffuser-style scheduler allows the same model to execute VLA, WM, IDM, VGM or Joint mode under different combinations of conditions by setting different noise timesteps for observation/action. For the problem of data heterogeneity, the paper does not forcefully unify the real actions of all robots, but learns latent actions from optical flow, and then uses a small number of action labels and task-agnostic data to bring the latent actions closer to the real control distribution.

3. Summary of related work

3. 1 Related work of the thesis self-description

Unified Multimodal Models。The paper starts from unified multi-modal generation models such as Emu3, Chameleon, Mmada, DualDiffusion, Show-o2, and Janus, and emphasizes that Bagel shares the self-attention layer of understanding expert and generation expert through MoT. Motus draws on this "expert retention + attention interaction" structure, but the application object extends from image and text generation to the embodied setting of video, action, and language.

Embodied foundation model paradigms。OpenVLA, $\pi_{0.5}$, GR00T-N1, etc. use VLM's image and text understanding capabilities for action prediction; UniPi, RoboDreamer, Vidar, etc. use VGM or future video to support world modeling or policy; $\mathcal{F}_1$ explicitly imagines future visual states and then uses IDM to output actions; UWM unifies five robotic modeling forms in a diffusion backbone. The core difference between Motus and UWM is the introduction of universal priors for pre-training VLM/VGM, and the addition of large-scale robot trajectories and latent action pre-training.

Latent Action Models。The paper points out that early latent actions usually use IDM + FDM to reconstruct the next frame from adjacent frames to learn low-dimensional actions, but RGB reconstruction can easily bring in appearance information such as lighting and texture. Subsequent methods reduce redundancy through low-capacity autoencoder, $\beta$-VAE, DINOv2 features, object keypoints, language instructions, or a small number of action labels. Motus chooses optical flow as motion-centric supervision and understands motion as pixel-level displacement.

3. 2 Direct comparison with previous works

Dimensions$\mathcal{F}_1$UWMBagel/MoT class unified modelMotus
Core ideaVLA explicitly imagines future vision, and then outputs actions through IDM.Unify WM, VLA, IDM, VGM, Joint in diffusion backbone.Sharing attention between comprehension and production experts with MoT.Put VGM, action expert, and VLM-derived understanding expert into the MoT, and use the scheduler to switch between the five embodied modes.
key assumptionsPredicting future images helps in action inference.The five distributions can be unified by a single diffusion framework.Experts can retain expertise while blending through attention.Pretrained VGM/VLM priors can be reused, and actions can be pretrained from unlabeled videos via optical-flow latent actions.
deficiency or boundaryDoes not override world model or VGM.There is a lack of sufficient general multimodal priors and priors in large-scale robotics domains.Not designed for robot action/world model.Requires three-stage training, optical flow preprocessing, VAE latent action alignment and target robot SFT.
Paper report performanceIt is not used as the baseline of the main table of this article.Not used as main table baseline.As an architecture source rather than a robot baseline.RoboTwin 2. 0 Clean 88. 66, Randomized 87. 02; the average PSR of the real machine is 63. 22 and 59. 30 respectively on the two platforms.

4. Detailed explanation of method

4. 1 Method overview

The overall data flow of Motus can be understood separately by training and inference. During training, the model inputs the current observation, language, noisy future observation/action chunk, and the timestep of each observation/action; the model outputs two velocity fields, and uses rectified-flow loss to simultaneously supervise future videos and future actions. During inference, by setting the initial timestep of different modes to clean or noise, the same model can obtain VGM, WM, IDM, VLA or Joint mode.Appendix Training and Inference

Motus Architecture
Figure 1。Motus Architecture. On the left is the Video Generation Model, in the middle is the Action Expert, and on the right is the Understanding Expert; the three exchange information through Tri-model Joint Attention. $\tau_v$ and $\tau_a$ control the rectified-flow timestep of video and action respectively.

4. 2 Method evolution

stageRepresent ideasWhat does this article inherit or change?
Separate VLA / VGM / WM / IDMEach ability is trained independently, with different input and output distributions.Motus writes five distributions as different conditionalized forms of the same multimodal generative model.
UWMPut observation tokens and action tokens into a single diffusion backbone.Motus does not train a single backbone from scratch, but reuses pre-trained experts such as Wan 2. 2 5B and Qwen3-VL-2B.
MoT unified multimodal modelExpert-preserved FFN/blocks are fused via shared attention.Motus transformed MoT into Tri-model Joint Attention, targeting video/action/language experts.
Latent ActionExtracting low-dimensional actions from visual variations.Motus uses optical flow instead of RGB next-frame reconstruction for more direct motion supervision and aligns the distribution with a small number of ground-truth action labels.

4. 3 Core design and mathematical derivation

4.3.1 Unified rectified-flow objective

This set of three formulas is doing one thing: let the action branch and observation branch predict the velocity field "from real data to noise", and then add the two errors.

$$ l_{\text{action}}^\theta = \mathbb{E}_{\substack{(\mathbf{o}_{t:t+k}, \mathbf{a}_{t+1:t+k}, \ell)\sim\mathcal{D}\\ \tau_a\sim \mathcal{U}(0,T_\tau),\ \epsilon_a\sim\mathcal{N}(\mathbf{0},\mathbf{I})}} \left\|v_a^\theta-(\epsilon_a-\mathbf{a}_{t+1:t+k})\right\|_2^2 $$ $$ l_{\text{obs}}^\theta = \mathbb{E}_{\substack{(\mathbf{o}_{t:t+k}, \mathbf{a}_{t+1:t+k}, \ell)\sim\mathcal{D}\\ \tau_o\sim \mathcal{U}(0,T_\tau),\ \epsilon_o\sim\mathcal{N}(\mathbf{0},\mathbf{I})}} \left\|v_o^\theta-(\epsilon_o-\mathbf{o}_{t+1:t+k})\right\|_2^2,\quad l^\theta=l_{\text{action}}^\theta+l_{\text{obs}}^\theta. $$
$v_a^\theta, v_o^\theta$Action velocity field and observation velocity field predicted by the model.
$\epsilon_a,\epsilon_o$Action/video noise sampled from standard Gaussian respectively.
$\tau_a,\tau_o$Separate noise time steps for action and observation; they allow the model to set different condition strengths on different modes.
$\epsilon - x^0$rectified flow Derivative direction of linear path $x^\tau=(1-\tau)x^0+\tau\epsilon$.
Why is the target speed $\epsilon-x^0$

Appendix Algorithm 1 Explicitly write $o_{t+1:t+k}^{\tau_o}=(1-\tau_o)o^0_{t+1:t+k}+\tau_o\epsilon_o$ and $a_{t+1:t+k}^{\tau_a}=(1-\tau_a)a^0_{t+1:t+k}+\tau_a\epsilon_a$. Taking the derivative of $\tau$, we get $\frac{d x^\tau}{d\tau}=\epsilon-x^0$. The training goal is therefore to have the model predict the speed of movement along a straight path at any intermediate noise point. During inference, the noise is integrated in the reverse direction to the clean sample. The appendix algorithm uses the form $x^{\tau-1}=x^\tau+v\,d\tau$ to represent discrete updates.

4. 3. 2 How to switch between the five modes of UniDiffuser-style scheduler

Instead of training five separate models, Motus uses the noise level of observation/action to indicate "is this modality a condition or a variable to be generated". If a certain mode is to be used as a condition, its timestep is set to 0, that is, clean; if a certain mode is to be generated, it starts iterative denoising from $T_\tau$ noise.Appendix Algorithms 2-6

modelclean conditions for the modelKeep variables that are noisy or need to be denoisedoutput
VGM$o_t^0,\ell$$o_{t+1:t+k}^{T_\tau}$ denoise; $a_{t+1:t+k}^{T_\tau}$ maintain noiseFuture Video $o^0_{t+1:t+k}$
World Model$o_t^0,a^0_{t+1:t+k},\ell$$o_{t+1:t+k}^{T_\tau}$ denoisingFuture video under action conditions
IDM$o^0_{t:t+k},\ell$$a_{t+1:t+k}^{T_\tau}$ denoisingActions that explain video changes
VLA$o_t^0,\ell$$a_{t+1:t+k}^{T_\tau}$ denoise; future observation maintains noiseaction chunk
Joint$o_t^0,\ell$observation and action simultaneously denoise from noiseFuture videos and actions

4.3.3 Tri-model Joint Attention

Motus contains three experts: Wan 2. 2 5B as the generative expert, Qwen3-VL-2B as the VLM source understanding expert, and an action expert with the same depth as Wan. Each expert retains its own Transformer module and FFN, but the QKV of the self-attention layer is connected into the same Tri-model Joint Attention. The goal is to preserve expert functionality while allowing video tokens, action tokens, and understanding tokens to interact across experts.

Unlike UWM, which directly connects observation/action tokens into the same backbone, Motus' MoT design makes pre-trained VGM/VLM weights easier to utilize. The block of action expert contains AdaLN, FFN and Tri-model Joint Attention; understanding expert uses the last layer of VLM corresponding tokens as input and consists of LayerNorm, FFN and Tri-model Joint Attention.

4.3.4 Action-Dense Video-Sparse Prediction

Action-Dense Video-Sparse Prediction
Figure 2。Action-Dense Video-Sparse Prediction. The paper points out that action chunking will bring about the problems of action frame density and video frame redundancy. Therefore, the video is downsampled during training and inference to make video tokens and action tokens more balanced. For example, the paper states that the video frame rate can be set to one-sixth of the action frame rate.

This design solves three things: reduce training/inference overhead; avoid predicting a large number of highly similar future frames; avoid the number of video tokens overwhelming the action tokens in Tri-model Joint Attention, causing the model to focus too much on video prediction and weakening action prediction.

4.3.5 Optical-flow latent action

Latent Action VAE
Figure 3。Latent Action VAE. Adjacent frames first obtain the optical flow through DPFlow and then convert it into an RGB flow image; DC-AE reconstructs the flow and encodes it into four 512-dimensional tokens; the lightweight encoder projects the $4\times512$ features to the 14-dimensional latent action.

This formula is training latent action VAE: it is necessary to reconstruct the optical flow, align the latent action with the real action, and use KL terms to constrain the latent space.

$$ \mathcal{L}=\mathcal{L}_{\text{recon}}+\lambda_a\|a_{\text{real}}-a_{\text{pred}}\|^2+\beta\mathcal{L}_{\text{KL}}. $$
$\mathcal{L}_{\text{recon}}$flow reconstruction error, allowing latent to retain pixel motion information.
$a_{\text{real}}$Real actions in labeled trajectories or task-agnostic data.
$a_{\text{pred}}$Action decoded or predicted from latent action.
$\lambda_a,\beta$The appendix implementation details give $\lambda_a=1.0$, $\beta=10^{-6}$.

During training, the author mixed 90% unlabeled data for self-supervised reconstruction and 10% labeled trajectories for weak action supervision; the labeled part includes task-agnostic data and standard robot demonstrations. Task-agnostic data adopts the AnyPos idea and uses Curobo to randomly sample the target robot action space to collect image-action pairs, so that the latent space is anchored to the executable control distribution.

4. 4 Implementation points

Model size and key hyperparameters

componentsConfigurationsource
Action ExpertHidden size 1024;30 layers;24 heads;LayerNorm eps 1e-5;GELUImplementation Details
Understanding ExpertHidden size 512;30 layers;24 heads;LayerNorm eps 1e-5;GELUImplementation Details
Latent Action VAE$\lambda_a=1.0$;$\beta=1\times10^{-6}$Implementation Details
Sampling RateVideo frames: 8 @ 5Hz;Action chunk: 48 @ 30HzImplementation Details
Flow MatchingInference steps 10;sampling strategy: Logit NormalImplementation Details
Model ScaleVGM 5. 00B; VLM 2. 13B; Action Expert 641. 5M; Understanding Expert 253. 5M; Total about 8BImplementation Details

Training pseudocode

Algorithm: Motus training iteration # sample clean data and independent noise levels sample o^0_{t:t+k}, a^0_{t+1:t+k}, language l from D_expert sample tau_o, tau_a from Uniform({1, ..., T_tau}) sample eps_o, eps_a from Normal(0, I) # rectified-flow interpolation o_tau = (1 - tau_o) * o^0_{t+1:t+k} + tau_o * eps_o a_tau = (1 - tau_a) * a^0_{t+1:t+k} + tau_a * eps_a # one unified model predicts two velocity fields v_o, v_a = Model_theta(o^0_t, o_tau, a_tau, tau_o, tau_a, l) loss = ||v_a - (eps_a - a^0_{t+1:t+k})||^2 + ||v_o - (eps_o - o^0_{t+1:t+k})||^2 update theta with AdamW

Three-stage training and data pyramid

stagedatatraining objecteffect
Off-the-shelf foundation modelsLevel 1: Web DataVGM and VLMProviding general visual, linguistic and video generation priors.
Stage 1: Video GenerationLevel 2 egocentric human videos;Level 3 synthetic data;Level 5 multi-robot task trajectory dataOnly VGMAdapt VGM to videos of robot operations and future mission execution videos.
Stage 2: Unified Training with Latent ActionsLevel 2、3、4 task-agnostic data、5Motus three experts, using latent actions; VLM frozenPre-train action experts and cross-modal interactions with videos, language, and latent actions.
Stage 3: SFTLevel 6 target-robot task trajectory dataMotus three experts, using real actionsAdapt target robot specific dynamics and kinematics.
Embodied Data Pyramid
Figure 4。Embodied Data Pyramid. The data decreases in quantity and increases in quality and strategic relevance from Level 1 web data to Level 6 target-robot trajectory data. Paper Notes The order of Level 3 and Level 4 may sometimes change.

5. Experiment

5. 1 Experimental setup

projectset up
Simulation benchmarkRoboTwin 2. 0, 50 representative manipulation tasks; each task has 50 demonstrations of clean scenes and 500 demonstrations of randomized scenes, for a total of 2, 500 clean + 25, 000 randomized demonstrations.
Simulation randomizationRandom background, messy desktop, table height disturbance, random lighting.
Simulation training/evaluationAll models start from pre-training checkpoint and finetune 40k steps on RoboTwin data; each task is evaluated with 100 execution trials at success rate.
Real machine platformAC-One and Agilex-Aloha-2 are two dual-arm robot platforms.
Real machine missionSpatial understanding, flexible objects, fine fluid control, visual understanding, long-range planning and other tasks; 100 trajectories for each task training; each platform jointly trains a model for multiple tasks.
Real machine indicatorsPartial Success Rate, points are given according to subtasks, and full points are awarded for complete success.
Baselines$\pi_{0.5}$, X-VLA; ablation includes w/o Pretrain, Stage1-only/Stage1 Pretrain.

Complete data set information

DatasetSizeEmbodimentData Level
Egodex230,949HumanLevel 2: Egocentric Human Videos
Agibot728,209Genie-1 RobotLevel 5: Multi-Robot Task Trajectory Data
RDT6,083Aloha RobotLevel 5
RoboMind Franka9,589Franka RobotLevel 5
RoboMind Aloha7,272Aloha RobotLevel 5
RoboTwin27,500Aloha RobotLevel 3: Synthetic Data
Task-Agnostic Data1,000Aloha RobotLevel 4
In-house Data2,000Aloha RobotLevel 6: Target-Robot Task Trajectory Data

training configuration

ConfigurationStage 1Stage 2Stage 3
Batch Size256256256
Learning Rate$8\times10^{-5}$$5\times10^{-5}$$1\sim5\times10^{-5}$
OptimizerAdamWAdamWAdamW
Weight Decay0.010.010.01
GPU Hoursabout 8000about 10000about 400

5. 2 Main results

RoboTwin 2.0 simulation

ModelClean Avg. (%)Randomized Avg. (%)illustrate
GO-1 Appendix Table 1437.8036.24Appears only in appendix full comparison.
$\pi_{0.5}$42.9843.84Main baseline.
X-VLA72.8072.84Strong VLA baseline.
w/o Pretrain77.5677.00Without pre-training ablation; the text short table Clean writes 72. 8, and the appendix full table gives 77. 56.
Stage182.2681.86Only do Stage 1 video pretraining.
Motus88.6687.02The Clean and Randomized columns have the highest average.

The author emphasizes in the text that compared to $\pi_{0.5}$, Motus has an absolute improvement of more than 45 percentage points in RoboTwin 2. 0 randomized multi-task setting; compared to X-VLA, the summary and contribution section reports are improved by about +15%. Appendix Table 14 gives the complete step-by-step results for the 50 tasks, and Table 2 of the main text is a truncated presentation.

Real-world tasks

Task definitions and visualizations
Figure 5。Real task definition and visualization. The paper uses this figure to show the task language instructions and subtask definitions on AC-One and Agilex-Aloha-2.
Platform$\pi_{0.5}$ Avg.w/o Pretrain Avg.Motus Avg.key phenomena
AC-One14.7925.8663.22Motus tops in 8 out of 9 tasks; Touch Keyboard tops w/o Pretrain.
Agilex-Aloha-248.6026.6059.30Motus is the highest on average; Put Bread into Oven and Touch Keyboard are not the highest.

Table 4 and Table 5 of the text explain the calculation of the Partial Success Rate using two tasks: AC-One's Put Bread into Oven is scored according to sub-goals such as Open Oven, Grab Bread, Put Bread, Close Oven, Spin Button, etc. ; Agilex-Aloha-2's Get Water from Water Dispenser is scored according to Grab cup, Fill cup, Complete Success. Appendix Table 15 and Table 16 extend subgoal breakdown to all real-world missions.Appendix More Real-World Results

5. 3 Ablation experiment

Ablation in RoboTwin 2.0
Figure 6。Training phase ablation of RoboTwin 2. 0 randomized and clean. Randomized: w/o pretrain 77. 00, Stage1 81. 86, Stage2/Motus 87. 02, improvement +10. 02; Clean: w/o pretrain 77. 56, Stage1 82. 26, Stage2/Motus 88. 66, improvement +11. 10.

This ablation verifies the pre-training contribution in the three-stage training: Stage 1 is improved compared to no pre-training, and the unified latent-action pretraining of Stage 2 further improves the final success rate. The real experiment w/o Pretrain also appears in Tables 3, 15, and 16 as a control; AC-One improved from 25. 86 to 63. 22 on average, and Agilex-Aloha-2 improved from 26. 60 to 59. 30 on average.

5. 4 Supplementary experiments

Supplementary validation of five modes

modelSupplementary resultsConclusion scope
VGMAgilex-Aloha-2 and AC-One visualizations are shown in Figure 7 and Figure 9 respectively.Demonstrating video generation capabilities under language and initial image conditions.
World ModelAgilex-Aloha-2: FID 9.4571, FVD 49.2848, SSIM 0.88618, LPIPS 0.05449, PSNR 26.1021;AC-One: FID 12.9609, FVD 73.1325, SSIM 0.84605, LPIPS 0.07280, PSNR 24.0379。Demonstrating action-conditioned future video prediction quality.
IDMAction MSE: ResNet18+MLP 0.044,DINOv2+MLP 0.122,Motus 0.014。Motus' IDM mode is lower than specially trained IDM baselines on this test.
VLARoboTwin 2.0 randomized average success: Motus (VLA) 83.90,Motus (Joint) 87.02。Joint mode is higher than individual VLA mode, but VLA mode is still competitive.
Video-Action JointFigure 12 shows a visualization of simultaneous video and action generation in real-machine inference.Supplementary explanation: The same model can simultaneously predict future vision and actions.
VGM Agilex-Aloha-2
Figure 7。Motus VGM mode on Agilex-Aloha-2。
VGM AC-One
Figure 9。Motus VGM mode on AC-One。
Nine real-world task demos
Figure 8。Visualization of the execution of two types of robots and nine real machine tasks, supplementing the real experiment table in the text.
World model Agilex-Aloha-2
Figure 10。World Model mode visualization on Agilex-Aloha-2 data.
World model AC-One
Figure 11。World Model mode visualization on AC-One data.
Video-action joint prediction
Figure 12。Video-Action Joint Prediction Model mode during real-world inference。

Other benchmarks

Benchmarkset upresult
LIBERO-Long10 long-range linguistic conditional manipulation tasks in LIBERO-100.Motus 97. 6, equal to X-VLA 97. 6; higher than $\pi_0$ 85. 2, GR00T-N1 90. 6, UniVLA 94. 0, OpenVLA-OFT 94. 5.
VLABench In DistributionThree tasks: Add Condiment, Select Toy, and Select Fruit.Motus averaged 0. 48, higher than $\pi_{0.5}$'s average of 0. 43.
VLABench Cross CategorySame three tasks, evaluated across categories.Motus averaged 0. 25, higher than $\pi_{0.5}$'s average of 0. 22.

6. Result analysis and discussion

6. 1 Interpretation of the results given in the paper

For RoboTwin 2. 0, the author explains that two parts of the design work together: the unified MoT model integrates vision, language and action generation, responding to Challenge 1; latent actions allow the model to utilize labeled and large-scale unlabeled data, improve cross-embodiment generalization and capture motion priors, responding to Challenge 2. The text also emphasizes that the benchmark's random background, cluttered desktop, table height perturbation and lighting perturbation enable it to test generalization under distribution shift.

Regarding the real machine results, the author emphasizes that the tasks of the two platforms include spatial understanding, deformable objects manipulation, precision fluid control, visual understanding, and long-horizon planning. Because the task is divisible and long-term, the paper uses Partial Success Rate instead of just looking at the binary success rate to reflect whether the model completes the key sub-goals.

For the supplementary results of the five modes, the author used the visualization and generation quality indicators of VGM/WM, the action MSE of IDM, and the success rate comparison of VLA and Joint to illustrate that Motus is not just a policy, but performs multiple embodied foundation model functions in the same model.

6. 2 Anomalous or borderline results in the paper

Not every single item on the real task list is Motus-high. AC-One's Touch Instructed Keyboard has a w/o Pretrain of 100 and a Motus of 82. 5; Agilex-Aloha-2's Put Bread into Oven has a $\pi_{0.5}$ of 36 and a Motus of 34; Agilex-Aloha-2's Touch Keyboard has a w/o Pretrain of 85 and a Motus of 80. The paper does not provide additional explanations for these individual items, so the report only records the phenomena in the table without expanding on the reasons.

The appendix VLA vs Joint table shows that Motus (Joint) 87. 02 is higher than Motus (VLA) 83. 90, which shows that the final main result of the paper is stronger using video-action joint prediction mode; however, the author still uses VLA mode as a display of the switchability of the same model.

7. Analysis, Limitations and Boundaries

The most valuable part of this paper

Based on the paper's own claims and experiments, the core value of Motus is to handle the two issues of "unified model" and "available heterogeneous data" in the same technical closed loop. MoT and Tri-model Joint Attention enable VGM/VLM pre-training priors to enter the robot action model; UniDiffuser-style scheduler unifies five conditions/joint distributions into different timestep settings; optical-flow latent action allows videos without action labels to participate in action expert pre-training through motion prior. These three together support the simulation average success rate, real machine PSR, IDM MSE and WM/VGM visualization results.

Why does the result stand?

The results of the paper are supported by multiple complementary levels: RoboTwin 2. 0 uses 50 tasks, two clean/randomized scenarios, 100 trials/task, and limits all models to 40k finetuning steps; the real machine evaluation covers two dual-arm platforms and a variety of long-range/fine tasks, with 100 trajectories per task, and multi-task joint training; the appendix provides a full 50-task table, real machine sub-goal breakdown, LIBERO-Long, VLABench, WM Build quality, IDM MSE, VLA vs Joint comparison and training configuration. Therefore, the main conclusion is not supported by a single benchmark or a single indicator.

Limitations stated by the author

The text under the heading Conclusion and Limitations does not list specific failure cases or negative limitations, but summarizes that Motus unifies vision-language understanding, video generation, inverse dynamics, world modeling and video-action joint prediction, and explains that in the future, it will continue to explore more advanced unified model architectures, more universal motion priors, and learn latent actions from internet-scale general videos. The report does not expand these future work into new recommendations, but only records them as follow-up directions clearly written by the authors.

applicable boundary

8. Repeat audit

Code and models

Partially available.The main file of the paper gives the Project Page; retrieves the GitHub link https://github.com/thu-ml/Motus. Whether the complete training code, weights, and data processing scripts are included depends on the current content of the warehouse.

data

The paper lists the data sources and scale.The appendix lists the scale, embodiment, and data pyramid levels of Egodex, Agibot, RDT, RoboMind, RoboTwin, Task-Agnostic Data, and In-house Data. However, the collection details, filtering rules, and original publishing status of In-house Data will affect full reproduction.

training configuration

Key training hyperparameters are given.Batch size, learning rate, optimizer, weight decay, GPU hours, model size, latent VAE weights, sampling rate and flow matching steps are listed in the appendix.

Main steps to reproduce

  1. Prepare VGM and VLM pre-training experts: Wan 2. 2 5B, Qwen3-VL-2B.
  2. Use DPFlow to generate optical flow for video data and train latent action VAE: flow reconstruction + action alignment + KL.
  3. Stage 1 only trains/adapts the VGM to generate future mission videos on robot operation data.
  4. Stage 2 trains three experts Motus, VLM frozen, input videos, language and latent actions.
  5. Stage 3 SFT on the target robot's real action trajectories, replacing latent actions with the target robot's actions.
  6. When evaluating, set the observation/action timestep according to the target mode: the clean mode is set to 0, and the mode to be generated starts from $T_\tau$ noise.

Potential obstacles to recurrence

Computing power and data are the main bottlenecks.The appendix training configuration shows that Stage 1 takes about 8, 000 GPU hours and Stage 2 takes about 10, 000 GPU hours; the total model size is about 8B. Complete reproduction also requires obtaining large-scale multi-source data and in-house target-robot trajectories, and replicating the real robot PSR evaluation protocol.