EN 中文
Junior PhD group meeting Reading Report

GigaWorld-Policy: An Efficient Action-Centered World--Action Model

Angen Ye et al., GigaAI. arXiv: 2603.17240v2, updated on 2026-03-21. This article proposes an action-centered World--Action Model: during training, action prediction is subject to future video dynamic supervision. During inference, the video branch can be closed and only low-latency action decoding is performed.
Topic: World--Action Model Backbone: Wan 2.2 5B diffusion Transformer Key indicator: 0.36 s/step Official code: open-gigaai/giga-world-policy

1. Quick overview of the paper

What should the paper solve?Existing VLA only relies on sparse action label learning, and the supervision density is insufficient; existing WAM often strongly couples "future video generation" and "action prediction", and a large number of video tokens must be sampled during inference, resulting in high latency, and action quality is easily affected by future video prediction errors.
The author's approachChange WAM to action-centered: the model first predicts action chunks and action latents, and then uses these action conditions to predict future videos. The causal attention mask ensures that the action token cannot see future video tokens, so only the action branch can be sampled during inference, and the video branch is only a dynamic constraint and optional diagnostic output during training.
most important resultsOn RoboTwin 2.0, the GigaWorld-Policy average success rate Clean/Rand is 0.87/0.85, close to Motus's 0.89/0.87, but inference drops from Motus's 3231 ms to 360 ms; the real four-task average success rate is 0.83, higher than Motus 0.76, $\pi_{0.5}$ 0.69 and GigaBrain-0 0.68.
Things to note when readingThe proposition of this article is not that "video prediction must be done during inference", but "use future videos as high-density physical supervision during training, and action paths are independently available during inference." Therefore, the information flow design of causal mask is the most critical reproducible detail of the full text.
One sentence version contribution. It separates "the world model provides dense visual dynamics supervision" and "low-latency closed-loop control during deployment": video generation helps training, but does not kidnap reasoning.
Version/expression differences. The summary of arXiv emphasizes acceleration and real task success rate +7% compared to Motus $9\times$; the project page has a marketing statement of "10x/35%". The report is mainly based on thesis tables and text numerical values.

2. Problem background and motivation

2.1 Why VLA alone is not enough

The VLA model usually learns $a_{t: t+p-1}\sim q_{\Theta}(\cdot\mid o_t, s_t, l)$, the input is the current image, robot state and language, and the output is a future action. The authors believe that the weakness of this type of paradigm is the sparse action supervision: actions are low-dimensional and have repetitive patterns, while observations and language are very high-dimensional. The model may learn a shallow context-to-action template mapping, but is not forced to understand how the world will change after the action is executed.

2.2 Why ordinary WAM is not enough

Recently, WAM uses video generation models to introduce temporal intensive supervision, which theoretically allows the strategy to learn physical dynamics. However, many methods strongly couple actions with future videos: joint action-video prediction generates future visual trajectories during inference, while the two-stage method first generates future videos and then uses IDM to decode actions. This brings two risks: first, diffusion video token sampling is very slow; second, video prediction errors will be transmitted to actions, and small errors will accumulate in long time series.

Comparison of VLA, joint WAM, two-stage WAM, and GigaWorld-Policy
Figure 1: The author divides the previous work into four categories: VLA-assisted future supervision, joint action-video WAM, two-stage video-to-action, and this article's action-centered WAM.

2.3 The core turn of this article

This article does not deny the value of future video supervision, but changes its role in the system: future video is an auxiliary task for regularizing action rationality during training, not an intermediate product that must be completed during inference. Action tokens are designed to rely only on current observations, states, and language, and future video tokens can only be generated after actions, so video prediction can be turned off.

4. Detailed explanation of method

GigaWorld-Policy pipeline
Figure 2: Overall training process. First, the general video generation model is converted into a robot-related video dynamic model, and then action prediction and future video prediction are jointly trained on the target robot trajectory.

4.1 Formalization of tasks

At each moment $t$, the robot receives multi-view RGB observation $o_t=\{o_t^v\}_{v\in S}$, including $S=\{left, front, right\}$, language instructions $l$ and body status $s_t$. The policy outputs an action chunk with a length of $p$:

$$a_{t: t+p-1}=(a_t, a_{t+1}, \ldots, a_{t+p-1}).$$

Traditional VLA learns:

$$a_{t: t+p-1}\sim q_\Theta(\cdot\mid o_t, s_t, l).$$

GigaWorld-Policy lets the unified model $g_\Theta$ parameterize two conditional distributions at the same time. Action side:

$$\big(a_{t: t+p-1}, c_t\big)\sim g_\Theta(\cdot\mid o_t, s_t, l), $$

Among them, $c_t$ is the action latent conditioning signal used for visual prediction. Visual dynamic side:

$$ (o_{t+\Delta}, o_{t+2\Delta}, \ldots, o_{t+K\Delta}) \sim g_\Theta(\cdot\mid o_t, s_t, l, c_t), \quad K=\lfloor p/\Delta\rfloor. $$

This decomposition is critical: the action is modeled first, and the future video is a subsequent prediction conditioned by the action, rather than a pre-prediction that the action must depend on.

4.2 Input token and multi-view splicing

In order to process the three-way camera without changing the video generation backbone, the author combined the three perspectives of left/front/right into a composite image:

$$o_t^{comp}=\mathrm{Compose}(o_t^{left}, o_t^{front}, o_t^{right}).$$

Both current observations and future observations are encoded into visual latent through the same pre-trained VAE, and then cut into spatiotemporal visual tokens: the current observation token is recorded as $T_o$, and the future video token is recorded as $T_f$. The ontology state and action are mapped to the hidden dimension through the linear layer, resulting in $T_s$ and $T_a$ respectively. The language instructions are obtained by the pre-trained language encoder $T_l$ and injected in a cross-attention manner.

4.3 Sharing Transformer and causal mask

Unlike MoE or multi-branch experts, this article puts all tokens into the same group of Transformer blocks and shares Q/K/V projections. The unified sequence is written as:

$$T_t=[\, T_o; \, T_s; \, T_a; \, T_f\, ].$$

Causal attention mask
Figure 3: Causal attention mask. Action tokens are not allowed to see future video tokens, but future video tokens are allowed to see action tokens.

This mask imposes three dependencies: $T_s$ and $T_o$ can pay attention to each other, but cannot look at the action or the future; $T_a$ can look at $T_s, T_o$, but not $T_f$; $T_f$ can look at $T_s, T_o, T_a$. What it means is: action prediction is determined only by the current context, and future video prediction is determined by the current context and action. In this way, the visual dynamic supervision during training will not "cheat" into the action token through information leakage, and also provide structural guarantee for closing future video branches during inference.

4.4 Training goals: two flow-matching losses

For either action token or future video latent mode $x$, sample flow time $s\sim U(0, 1)$ and noise $\epsilon\sim\mathcal N(0, I)$, construct:

$$x^{(s)}=(1-s)\epsilon+s x, \qquad \dot{x}^{(s)}=x-\epsilon.$$

Future video loss is defined on VAE latent $z_f$:

$$ \mathcal L_{video}= \mathbb E_{s, \epsilon}\left[ \left\| g_\Theta(z_f^{(s)}, s\mid T_s, T_o, T_a, T_l)-\dot z_f^{(s)} \right\|^2 \right]. $$

Action loss is only conditioned on historical context and language, not future videos:

$$ \mathcal L_{action}= \mathbb E_{s, \epsilon}\left[ \left\| g_\Theta(a^{(s)}, s\mid T_s, T_o, T_l)-\dot a^{(s)} \right\|^2 \right]. $$

Only video flow matching is optimized in the pre-training stage; joint optimization is performed in the post-training stage:

$$\mathcal L_{all}=\lambda_{video}\mathcal L_{video}+\lambda_{action}\mathcal L_{action}.$$

4.5 Reasoning: action-only decoding

The context during inference is $w_t=(T_l, T_s, T_o)$. The model only initializes and samples action tokens:

$$a^{(0)}\sim\mathcal N(0, I), \qquad \frac{d a^{(s)}}{ds}=g_\Theta(a^{(s)}, s\mid w_t), \ s\in[0, 1].$$

After integrating, $a^{(1)}$ is decoded into the continuous action chunk $\hat a_{t: t+p-1}$. After execution, the new observation is used to close the loop. If you need visualization or diagnosis, you can also open the video branch: either combine denoise future video tokens, or reuse the KV cache during the denoising action to regenerate the video. But the control itself does not depend on this step.

5. Key points of data, training and reproducibility

5.1 Pre-training data

The author used about 10, 000 hours of embodied data for pre-training, ranging from real robot videos, egocentric human videos, and general interaction videos. The estimated hours in the table are as follows:

data sourcehoursfunction
EgoDex800Hand/object interaction, daily operation primitives
Agibot2, 500Robot real operation and workspace visual distribution
EGO4D3, 500Long-term human first-person perspective activity structure
RoboMind300Robot operation video
RDT25Robot operation video
Open X-Embodiment3, 500Cross-robot/cross-task vision coverage
DROID350real robot manipulation
ATARA10Robot mission video
Something-Something V2200Object interaction dynamic prior

5.2 Training formula

Recurring cost judgment. The paper gives the key hyperparameters and data size, but does not clearly describe the complete training code, data cleaning details, multi-view compose specific layout, action normalization, sampling steps, etc. The official project page provides code links, but actual reproducibility requires checking the warehouse release content and weight/data availability.

6. Analysis of experimental results

6.1 Inference speed and success rate

Latency versus success comparison
Figure 4: Inference frequency/success rate comparison between real tasks and A100. The selling point of GigaWorld-Policy is that it is located in an area with a high success rate and low latency.
methodTime (ms)Simulation SRReal-world SR
$\pi_{0.5}$2250.480.69
GigaBrain-0452--0.68
Motus32310.880.76
Cosmos-Policy1413--0.58
GigaWorld-Policy3600.860.83

Compared to Motus, GigaWorld-Policy has a slightly lower simulation success rate of 0.02, but the latency drops from 3231 ms to 360 ms, and the real success rate is higher. This result supports the authors' core argument: the world model training signal is useful, but the complete generation of future videos is not necessary for inference.

6.2 RoboTwin 2.0 simulation

RoboTwin 2.0 contains 50 representative manipulation tasks, evaluating clean and randomized scenes. The average values ​​of the main table are: $\pi_{0.5}$ 0.43/0.44, X-VLA 0.73/0.73, Motus 0.89/0.87, GigaWorld-Policy 0.87/0.85. In other words, the method in this article is close to the strongest WAM baseline, but it is significantly ahead in real-time performance.

6.3 Real robot tasks

The real platform is the AgileX PiPER 6-DoF robotic arm. Four tasks are defined from the appendix:

methodClean DeskScan QRSweep TrashStack BowlsAvg.
$\pi_{0.5}$0.750.550.650.800.69
GigaBrain-00.700.650.600.750.68
Motus0.800.750.700.800.76
Cosmos-Policy0.650.500.450.700.58
GigaWorld-Policy0.900.750.750.900.83
QR scanning task
Real task: QR code scanning.
Trash sweeping task
Real task: sweeping trash.
Stacking bowls and cleaning desk tasks
Appendix Supplementary Figure: Real deployment scenarios of stack bowls and clean desk.

6.4 Data efficiency and ablation

Data efficiency curve
Figure 5: Real task success rate under changing training data ratio. The authors claim that GigaWorld-Policy can achieve the maximum success rate of VLA using 10% of the data.
Embodied pretraining data fraction
Figure 6: The higher the proportion of embedded pre-training data, the higher the success rate of real tasks.
ablation termsettingsresultexplain
Pre-training combinationscratch / video init / embodied pretraining / bothSR: 0.45 / 0.57 / 0.73 / 0.83Universal video priors and embodied data pretraining complement each other.
number of future frames$\Delta=0, 4, 8, 12, 24, 48$SR: 0.60 / 0.76 / 0.78 / 0.83 / 0.80 / 0.76The right amount of future modeling is helpful, but too dense a forecast has diminishing returns.
Causal maskSelf-Attn vs OursSR 0.81 vs 0.83; PSNR 27.87 vs 28.41; SSIM 0.892 vs 0.901Avoid future token leaks while improving action condition video prediction quality.
Qualitative video prediction comparison
Figure 7: Causal mask predicts object state changes more accurately than self-attention. The red box is the area emphasized by the author.

7. Discussion: Value, Credibility and Limitations

7.1 The most valuable part of this paper

The greatest value is to provide a clean structural solution to the deployment bottleneck of WAM. In the past, many methods of "using world models to make strategies" emphasized the reasoning link: both imagining the future and extracting actions from the imagination. This paper retains future dynamics as training signals and decouples the action reasoning path from the video reasoning path. This is important for real robots, because 3 seconds of reasoning, even if the success rate is good, will destroy the closed-loop control frequency; although 360 ms is not yet a high-frequency servo, it has entered the scope of the deployable strategy layer.

Another value is that it does not reduce future video prediction to "useless visualization". The ablation shows that the success rate drops significantly when $K=0$ is used, indicating that future dynamic supervision does indeed provide action learning signals; however, the optimal $\Delta=12$ also shows that dense video prediction is not more, the better. This result is more detailed than "the world model must be rolled out".

7.2 Why the results hold up

The chain of evidence has three levels. First, the speedometer directly compares the action-only path with the WAM baseline that requires video reasoning, and the numerical difference is huge. Second, RoboTwin 2.0 has been compared with the real platform, which shows that it is not only effective in simulation. Third, ablation separates pre-training, the number of future frames and the causal mask, which can correspond to the three main claims of the method: large-scale embodied pretraining is useful, future dynamic supervision is useful, and the mask makes video branches optional and reduces leakage.

Especially for causal mask ablation, although the SR only goes from 0.81 to 0.83, PSNR/SSIM and visualization together show that the mask improves the modeling quality of action-conditioned dynamics. This is consistent with the method assumption: actions should not steal information from future frames, and future frames should be interpreted by actions.

7.3 Main limitations

8. Questions can be asked during group meetings

  1. If future videos are not generated at all during inference, through which parameter paths does $\mathcal L_{video}$ during training affect action prediction? Is sharing Transformer blocks enough to explain this migration?
  2. Causal mask prevents action tokens from seeing future video tokens, but action loss and video loss still share the backbone. Is there possible gradient-level conflict, and is $\lambda_{action}=5, \lambda_{video}=1$ sensitive to different tasks?
  3. Is the optimal $\Delta=12$ related to the action chunk length $p=48$, robot control frequency and task duration? How to set up when changing robots or faster tasks?
  4. Composite multi-view image sacrifices inter-camera geometry explicitness. Compared with independent view token + view embedding, does its advantage mainly come from compatibility with video backbone?
  5. How strongly does the future video prediction quality metric PSNR/SSIM correlate with the real control success rate? Are there cases where the video looks bad but the action is still right, or the video is good but the action fails?
  6. There are 50 demos for each real task, and the model has large-scale pre-training. If only a small number of downstream demos are allowed, does the performance bottleneck come more from action head adaptation or visual dynamic prior not being close enough?

9. Recurrence Checklist

moduleWhat must be confirmedInformation given in the paper
code/weightOfficial warehouse, model weights, inference scripts, training configurationsgithub.com/open-gigaai/giga-world-policy
Input processingThree-view compose rules, image resolution, VAE latent shapeThe three perspectives are combined into a composite image with the same resolution; the visual token is encoded with VAE.
modelWan 2.2 5B access method, state/action projection, language cross-attentionShared Transformer blocks, visual 2D PE, state/action 1D temporal PE.
trainingflow steps, batching, loss weight, optimizer, data sampling ratioAdamW, batch 256, lr 1e-4 to 1e-6; post-training $\lambda_a=5, \lambda_v=1$.
AssessmentRoboTwin 2.0 task list, randomization settings, real robot trial protocol50 tasks; clean/randomized; 20 trials of real tasks, maximum 5 attempts per trial.
reasoningaction-only sampling steps, whether to use KV cache, control frequencyFuture videos are turned off by default and only denoise action tokens; optionally enable video branches or reuse KV cache.

Paper page: arXiv: 2603.17240; PDF: arxiv.org/pdf/2603.17240; Project page: GigaWorld-Policy Project.