Junior PhD group meeting Reading Report

GigaWorld-Policy: An Efficient Action-Centered World--Action Model

Angen Ye et al., GigaAI. arXiv: 2603.17240v2, updated on 2026-03-21. This article proposes an action-centered World--Action Model: during training, action prediction is subject to future video dynamic supervision. During inference, the video branch can be closed and only low-latency action decoding is performed.

Topic: World--Action Model Backbone: Wan 2.2 5B diffusion Transformer Key indicator: 0.36 s/step Official code: open-gigaai/giga-world-policy

1. Quick overview of the paper

What should the paper solve?	Existing VLA only relies on sparse action label learning, and the supervision density is insufficient; existing WAM often strongly couples "future video generation" and "action prediction", and a large number of video tokens must be sampled during inference, resulting in high latency, and action quality is easily affected by future video prediction errors.
The author's approach	Change WAM to action-centered: the model first predicts action chunks and action latents, and then uses these action conditions to predict future videos. The causal attention mask ensures that the action token cannot see future video tokens, so only the action branch can be sampled during inference, and the video branch is only a dynamic constraint and optional diagnostic output during training.
most important results	On RoboTwin 2.0, the GigaWorld-Policy average success rate Clean/Rand is 0.87/0.85, close to Motus's 0.89/0.87, but inference drops from Motus's 3231 ms to 360 ms; the real four-task average success rate is 0.83, higher than Motus 0.76, $\pi_{0.5}$ 0.69 and GigaBrain-0 0.68.
Things to note when reading	The proposition of this article is not that "video prediction must be done during inference", but "use future videos as high-density physical supervision during training, and action paths are independently available during inference." Therefore, the information flow design of causal mask is the most critical reproducible detail of the full text.

One sentence version contribution. It separates "the world model provides dense visual dynamics supervision" and "low-latency closed-loop control during deployment": video generation helps training, but does not kidnap reasoning.

Version/expression differences. The summary of arXiv emphasizes acceleration and real task success rate +7% compared to Motus $9\times$; the project page has a marketing statement of "10x/35%". The report is mainly based on thesis tables and text numerical values.

2. Problem background and motivation

2.1 Why VLA alone is not enough

The VLA model usually learns $a_{t: t+p-1}\sim q_{\Theta}(\cdot\mid o_t, s_t, l)$, the input is the current image, robot state and language, and the output is a future action. The authors believe that the weakness of this type of paradigm is the sparse action supervision: actions are low-dimensional and have repetitive patterns, while observations and language are very high-dimensional. The model may learn a shallow context-to-action template mapping, but is not forced to understand how the world will change after the action is executed.

2.2 Why ordinary WAM is not enough

Recently, WAM uses video generation models to introduce temporal intensive supervision, which theoretically allows the strategy to learn physical dynamics. However, many methods strongly couple actions with future videos: joint action-video prediction generates future visual trajectories during inference, while the two-stage method first generates future videos and then uses IDM to decode actions. This brings two risks: first, diffusion video token sampling is very slow; second, video prediction errors will be transmitted to actions, and small errors will accumulate in long time series.

Comparison of VLA, joint WAM, two-stage WAM, and GigaWorld-Policy — Figure 1: The author divides the previous work into four categories: VLA-assisted future supervision, joint action-video WAM, two-stage video-to-action, and this article's action-centered WAM.

2.3 The core turn of this article

This article does not deny the value of future video supervision, but changes its role in the system: future video is an auxiliary task for regularizing action rationality during training, not an intermediate product that must be completed during inference. Action tokens are designed to rely only on current observations, states, and language, and future video tokens can only be generated after actions, so video prediction can be turned off.

4. Detailed explanation of method

GigaWorld-Policy pipeline — Figure 2: Overall training process. First, the general video generation model is converted into a robot-related video dynamic model, and then action prediction and future video prediction are jointly trained on the target robot trajectory.

4.1 Formalization of tasks

At each moment $t$, the robot receives multi-view RGB observation $o_t=\{o_t^v\}_{v\in S}$, including $S=\{left, front, right\}$, language instructions $l$ and body status $s_t$. The policy outputs an action chunk with a length of $p$:

$$a_{t: t+p-1}=(a_t, a_{t+1}, \ldots, a_{t+p-1}).$$

Traditional VLA learns:

$$a_{t: t+p-1}\sim q_\Theta(\cdot\mid o_t, s_t, l).$$

GigaWorld-Policy lets the unified model $g_\Theta$ parameterize two conditional distributions at the same time. Action side:

$$\big(a_{t: t+p-1}, c_t\big)\sim g_\Theta(\cdot\mid o_t, s_t, l), $$

Among them, $c_t$ is the action latent conditioning signal used for visual prediction. Visual dynamic side:

$$ (o_{t+\Delta}, o_{t+2\Delta}, \ldots, o_{t+K\Delta}) \sim g_\Theta(\cdot\mid o_t, s_t, l, c_t), \quad K=\lfloor p/\Delta\rfloor. $$

This decomposition is critical: the action is modeled first, and the future video is a subsequent prediction conditioned by the action, rather than a pre-prediction that the action must depend on.

4.2 Input token and multi-view splicing

In order to process the three-way camera without changing the video generation backbone, the author combined the three perspectives of left/front/right into a composite image:

$$o_t^{comp}=\mathrm{Compose}(o_t^{left}, o_t^{front}, o_t^{right}).$$

Both current observations and future observations are encoded into visual latent through the same pre-trained VAE, and then cut into spatiotemporal visual tokens: the current observation token is recorded as $T_o$, and the future video token is recorded as $T_f$. The ontology state and action are mapped to the hidden dimension through the linear layer, resulting in $T_s$ and $T_a$ respectively. The language instructions are obtained by the pre-trained language encoder $T_l$ and injected in a cross-attention manner.

4.3 Sharing Transformer and causal mask

Unlike MoE or multi-branch experts, this article puts all tokens into the same group of Transformer blocks and shares Q/K/V projections. The unified sequence is written as:

$$T_t=[\, T_o; \, T_s; \, T_a; \, T_f\, ].$$

Figure 3: Causal attention mask. Action tokens are not allowed to see future video tokens, but future video tokens are allowed to see action tokens.

This mask imposes three dependencies: $T_s$ and $T_o$ can pay attention to each other, but cannot look at the action or the future; $T_a$ can look at $T_s, T_o$, but not $T_f$; $T_f$ can look at $T_s, T_o, T_a$. What it means is: action prediction is determined only by the current context, and future video prediction is determined by the current context and action. In this way, the visual dynamic supervision during training will not "cheat" into the action token through information leakage, and also provide structural guarantee for closing future video branches during inference.

4.4 Training goals: two flow-matching losses

For either action token or future video latent mode $x$, sample flow time $s\sim U(0, 1)$ and noise $\epsilon\sim\mathcal N(0, I)$, construct:

$$x^{(s)}=(1-s)\epsilon+s x, \qquad \dot{x}^{(s)}=x-\epsilon.$$

Future video loss is defined on VAE latent $z_f$:

$$ \mathcal L_{video}= \mathbb E_{s, \epsilon}\left[ \left\| g_\Theta(z_f^{(s)}, s\mid T_s, T_o, T_a, T_l)-\dot z_f^{(s)} \right\|^2 \right]. $$

Action loss is only conditioned on historical context and language, not future videos:

$$ \mathcal L_{action}= \mathbb E_{s, \epsilon}\left[ \left\| g_\Theta(a^{(s)}, s\mid T_s, T_o, T_l)-\dot a^{(s)} \right\|^2 \right]. $$

Only video flow matching is optimized in the pre-training stage; joint optimization is performed in the post-training stage:

$$\mathcal L_{all}=\lambda_{video}\mathcal L_{video}+\lambda_{action}\mathcal L_{action}.$$

4.5 Reasoning: action-only decoding

The context during inference is $w_t=(T_l, T_s, T_o)$. The model only initializes and samples action tokens:

$$a^{(0)}\sim\mathcal N(0, I), \qquad \frac{d a^{(s)}}{ds}=g_\Theta(a^{(s)}, s\mid w_t), \ s\in[0, 1].$$

After integrating, $a^{(1)}$ is decoded into the continuous action chunk $\hat a_{t: t+p-1}$. After execution, the new observation is used to close the loop. If you need visualization or diagnosis, you can also open the video branch: either combine denoise future video tokens, or reuse the KV cache during the denoising action to regenerate the video. But the control itself does not depend on this step.

5. Key points of data, training and reproducibility

5.1 Pre-training data

The author used about 10, 000 hours of embodied data for pre-training, ranging from real robot videos, egocentric human videos, and general interaction videos. The estimated hours in the table are as follows:

data source	hours	function
EgoDex	800	Hand/object interaction, daily operation primitives
Agibot	2, 500	Robot real operation and workspace visual distribution
EGO4D	3, 500	Long-term human first-person perspective activity structure
RoboMind	300	Robot operation video
RDT	25	Robot operation video
Open X-Embodiment	3, 500	Cross-robot/cross-task vision coverage
DROID	350	real robot manipulation
ATARA	10	Robot mission video
Something-Something V2	200	Object interaction dynamic prior

5.2 Training formula

Backbone: Wan 2.2 5B diffusion Transformer.
Action chunk length: $p=48$.
Default future observation stride: $\Delta=12$, so $K=\lfloor 48/12\rfloor=4$ future frames.
Post-training loss weight: $\lambda_{action}=5$, $\lambda_{video}=1$.
The pre-training hyperparameters are from the appendix: about 6000 GPU hours, global batch size 256, AdamW, $\beta_1=0.85, \beta_2=0.9$, learning rate from $1\times10^{-4}$ cosine decay to $1\times10^{-6}$.
For real tasks, 50 demonstration trajectories are collected for each task for post-training; each method is tested for 20 trials per task, with a maximum of 5 trials per trial.
Each task is simulated with 100 test episodes; the training data is 50 demos per task for clean scenes and 500 demos for randomized scenes per task, totaling 2, 500 clean + 25, 000 randomized demonstrations.

Recurring cost judgment. The paper gives the key hyperparameters and data size, but does not clearly describe the complete training code, data cleaning details, multi-view compose specific layout, action normalization, sampling steps, etc. The official project page provides code links, but actual reproducibility requires checking the warehouse release content and weight/data availability.

6. Analysis of experimental results

6.1 Inference speed and success rate

Latency versus success comparison — Figure 4: Inference frequency/success rate comparison between real tasks and A100. The selling point of GigaWorld-Policy is that it is located in an area with a high success rate and low latency.

method	Time (ms)	Simulation SR	Real-world SR
$\pi_{0.5}$	225	0.48	0.69
GigaBrain-0	452	--	0.68
Motus	3231	0.88	0.76
Cosmos-Policy	1413	--	0.58
GigaWorld-Policy	360	0.86	0.83

Compared to Motus, GigaWorld-Policy has a slightly lower simulation success rate of 0.02, but the latency drops from 3231 ms to 360 ms, and the real success rate is higher. This result supports the authors' core argument: the world model training signal is useful, but the complete generation of future videos is not necessary for inference.

6.2 RoboTwin 2.0 simulation

RoboTwin 2.0 contains 50 representative manipulation tasks, evaluating clean and randomized scenes. The average values of the main table are: $\pi_{0.5}$ 0.43/0.44, X-VLA 0.73/0.73, Motus 0.89/0.87, GigaWorld-Policy 0.87/0.85. In other words, the method in this article is close to the strongest WAM baseline, but it is significantly ahead in real-time performance.

6.3 Real robot tasks

The real platform is the AgileX PiPER 6-DoF robotic arm. Four tasks are defined from the appendix:

Clean the Desk: Move various dishes to the target basket, and the dishes are required to be under the bowl.
Stack Bowls: Two bowls in any initial pose are nested and stacked.
Scan a QR Code: Take the scanner, grab the target object, align it with the QR code and read the code, then put it back on the target object.
Sweep up Trash: Take a brush and dustpan and sweep small scattered objects into the dustpan.

method	Clean Desk	Scan QR	Sweep Trash	Stack Bowls	Avg.
$\pi_{0.5}$	0.75	0.55	0.65	0.80	0.69
GigaBrain-0	0.70	0.65	0.60	0.75	0.68
Motus	0.80	0.75	0.70	0.80	0.76
Cosmos-Policy	0.65	0.50	0.45	0.70	0.58
GigaWorld-Policy	0.90	0.75	0.75	0.90	0.83

QR scanning task — Real task: QR code scanning.

Trash sweeping task — Real task: sweeping trash.

Stacking bowls and cleaning desk tasks — Appendix Supplementary Figure: Real deployment scenarios of stack bowls and clean desk.

6.4 Data efficiency and ablation

Data efficiency curve — Figure 5: Real task success rate under changing training data ratio. The authors claim that GigaWorld-Policy can achieve the maximum success rate of VLA using 10% of the data.

Embodied pretraining data fraction — Figure 6: The higher the proportion of embedded pre-training data, the higher the success rate of real tasks.

ablation term	settings	result	explain
Pre-training combination	scratch / video init / embodied pretraining / both	SR: 0.45 / 0.57 / 0.73 / 0.83	Universal video priors and embodied data pretraining complement each other.
number of future frames	$\Delta=0, 4, 8, 12, 24, 48$	SR: 0.60 / 0.76 / 0.78 / 0.83 / 0.80 / 0.76	The right amount of future modeling is helpful, but too dense a forecast has diminishing returns.
Causal mask	Self-Attn vs Ours	SR 0.81 vs 0.83; PSNR 27.87 vs 28.41; SSIM 0.892 vs 0.901	Avoid future token leaks while improving action condition video prediction quality.

Qualitative video prediction comparison — Figure 7: Causal mask predicts object state changes more accurately than self-attention. The red box is the area emphasized by the author.

7. Discussion: Value, Credibility and Limitations

7.1 The most valuable part of this paper

The greatest value is to provide a clean structural solution to the deployment bottleneck of WAM. In the past, many methods of "using world models to make strategies" emphasized the reasoning link: both imagining the future and extracting actions from the imagination. This paper retains future dynamics as training signals and decouples the action reasoning path from the video reasoning path. This is important for real robots, because 3 seconds of reasoning, even if the success rate is good, will destroy the closed-loop control frequency; although 360 ms is not yet a high-frequency servo, it has entered the scope of the deployable strategy layer.

Another value is that it does not reduce future video prediction to "useless visualization". The ablation shows that the success rate drops significantly when $K=0$ is used, indicating that future dynamic supervision does indeed provide action learning signals; however, the optimal $\Delta=12$ also shows that dense video prediction is not more, the better. This result is more detailed than "the world model must be rolled out".

7.2 Why the results hold up

The chain of evidence has three levels. First, the speedometer directly compares the action-only path with the WAM baseline that requires video reasoning, and the numerical difference is huge. Second, RoboTwin 2.0 has been compared with the real platform, which shows that it is not only effective in simulation. Third, ablation separates pre-training, the number of future frames and the causal mask, which can correspond to the three main claims of the method: large-scale embodied pretraining is useful, future dynamic supervision is useful, and the mask makes video branches optional and reduces leakage.

Especially for causal mask ablation, although the SR only goes from 0.81 to 0.83, PSNR/SSIM and visualization together show that the mask improves the modeling quality of action-conditioned dynamics. This is consistent with the method assumption: actions should not steal information from future frames, and future frames should be interpreted by actions.

7.3 Main limitations

reproducibility resources are heavy. 5B diffusion Transformer, about 10, 000 hours of data, and 6000 GPU hours of pre-training, this is not a level that ordinary laboratories can fully reproduce.
Some key engineering details still require code confirmation. For example, multi-view puzzle layout, action representation/normalization, flow sampling steps, post-training data cleaning, real robot control interface, etc. are not described in detail in the main text.
The number of real tasks is limited. Four PiPER missions illustrate deployment potential, but are not sufficient to cover more contact-rich, dynamic object, human-robot collaboration, or long-duration missions.
Speed is still policy layer speed. 360 ms is much faster than WAM, but if the task requires higher-frequency fine force control, it still needs the cooperation of an underlying controller or a lightweight distillation strategy.
There are slight inconsistencies in official statements. The paper table shows 9x and +7%, and the project page shows 10x and +35%; which value should be quoted clearly when reporting at the group meeting.

8. Questions can be asked during group meetings

If future videos are not generated at all during inference, through which parameter paths does $\mathcal L_{video}$ during training affect action prediction? Is sharing Transformer blocks enough to explain this migration?
Causal mask prevents action tokens from seeing future video tokens, but action loss and video loss still share the backbone. Is there possible gradient-level conflict, and is $\lambda_{action}=5, \lambda_{video}=1$ sensitive to different tasks?
Is the optimal $\Delta=12$ related to the action chunk length $p=48$, robot control frequency and task duration? How to set up when changing robots or faster tasks?
Composite multi-view image sacrifices inter-camera geometry explicitness. Compared with independent view token + view embedding, does its advantage mainly come from compatibility with video backbone?
How strongly does the future video prediction quality metric PSNR/SSIM correlate with the real control success rate? Are there cases where the video looks bad but the action is still right, or the video is good but the action fails?
There are 50 demos for each real task, and the model has large-scale pre-training. If only a small number of downstream demos are allowed, does the performance bottleneck come more from action head adaptation or visual dynamic prior not being close enough?

9. Recurrence Checklist

module	What must be confirmed	Information given in the paper
code/weight	Official warehouse, model weights, inference scripts, training configurations	github.com/open-gigaai/giga-world-policy
Input processing	Three-view compose rules, image resolution, VAE latent shape	The three perspectives are combined into a composite image with the same resolution; the visual token is encoded with VAE.
model	Wan 2.2 5B access method, state/action projection, language cross-attention	Shared Transformer blocks, visual 2D PE, state/action 1D temporal PE.
training	flow steps, batching, loss weight, optimizer, data sampling ratio	AdamW, batch 256, lr 1e-4 to 1e-6; post-training $\lambda_a=5, \lambda_v=1$.
Assessment	RoboTwin 2.0 task list, randomization settings, real robot trial protocol	50 tasks; clean/randomized; 20 trials of real tasks, maximum 5 attempts per trial.
reasoning	action-only sampling steps, whether to use KV cache, control frequency	Future videos are turned off by default and only denoise action tokens; optionally enable video branches or reuse KV cache.

Paper page: arXiv: 2603.17240; PDF: arxiv.org/pdf/2603.17240; Project page: GigaWorld-Policy Project.