EN 中文
Reading Report of the group paper / junior PhD version

Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

Author: Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, Jinwei Gu

institution: NVIDIA; Stanford University

arXiv: 2601.16163, Submission date 2026-01-22

Project page: research.nvidia.com/labs/dir/cosmos-policy/

1. Quick overview of the paper

One-sentence summary: This paper combines a pre-trained video diffusion model with Cosmos-Predict2-2B Directly fine-tune the robot strategy without changing the model structure. It only relies on a unified interface that "encodes actions, future states and value functions into latent variable frames", and supports both direct control and model-based planning.

difficulty rating: ★★★★☆

You need to be familiar with diffusion models, latent video models, imitation learning, world model / value function, and robot strategy evaluation.

keywords

Video Foundation Model Latent Diffusion Visuomotor Policy World Model Best-of-N Planning

Core contribution list

  • propose Cosmos Policy: Single-stage post-training, which directly adapts the pre-trained video model to the robot control strategy without additional design of action head, inverse dynamics or separate action diffuser.
  • propose latent frame injection: Represent actions, robot body states, future states and value functions as "frames" in the video latent space, allowing the same diffusion model to simultaneously assume the three roles of policy, world model and value function.
  • A dual-model deployment is proposed: the basic checkpoint is responsible for taking actions, and the checkpoint fine-tuned using rollout is responsible for predicting future states and values, and doing best-of-N planning during testing.
  • Achieve strong results on LIBERO, RoboCasa and real-arm ALOHA; where LIBERO averages 98.5%, RoboCasa averages 67.1%, and ALOHA four-task average score is 93.6.
Cosmos Policy overview
Figure 1. Overview diagram: input multi-view images, robot status and text task description; output action blocks, future status and values. Later in the report, this picture will be split into two parts: "input organization method" and "training/inference responsibility allocation".

2. Motivation

2.1 What problem should be solved?

The paper is concerned about: the basic video model has learned temporal causality, implicit physics and motion patterns. Can these priors be used directly for robot control, instead of just using VLA pre-trained with static images and text as the backbone. The author hopes to obtain a unified strategy model that can output actions, imagine future states, and evaluate whether the future is worth executing.

The specific pain point here is not as general as "the robot control is not strong enough", but two things exist at the same time: first, the action distribution of many tasks is multi-modal, for example, when grabbing candy, you can start with the left hand and then the right hand, or vice versa; second, many tasks are extremely sensitive to accuracy, such as zipper bag sliders, and millimeter-level differences will lead to failure. The author believes that the video model has natural advantages in modeling complex, high-dimensional, multi-modal time series distribution, so it is worth directly migrating to control.

2.2 Limitations of existing methods

  • The video strategy route is usually multi-stage: first adapt the video model to the robot data, then train the action module separately, or add inverse dynamics / action diffuser. In this way, the training chain is longer, there are more modules, and there are more sources of errors.
  • Although the unified video-action model avoids a multi-module structure, it is often not initialized from a large-scale pre-trained video model, but a custom structure is trained from scratch, and the spatio-temporal prior learned from Internet videos is not available.
  • VLAs are powerful, but they are mainly built on static image and text pre-training. The author's judgment is that this type of backbone is more semantic than the dynamic physical process itself, so it may not be the most natural inductive bias for high multi-modality, long time series, and fine control.
  • In terms of planning, if you only look at demonstration, world model and value function, you will see mostly successful trajectories; once the actual execution enters a state of failure or deviation from the distribution, their generalization will become worse. The author emphasizes both in the text and in the planning experiment: for planning to be useful, rollout experience must be introduced.[Text §4.3]

2.3 Solution ideas of this article (high level)

The author's core idea is "Don't add a new head to the video model, but put everything needed for the robot's task back into its most familiar latent sequence." The specific method is: arrange the current observation $s$, action block $a$, future observation $s'$, and future value $V(s')$ into a latent variable frame sequence, and let the diffusion model continue to do what it is most familiar with: recovering these frames from noise.

In this way, the same model can learn three conditional distributions through different condition masks during training: policy $p(a, s', V(s')\mid s)$, world model $p(s', V(s')\mid s, a)$, and value function $p(V(s')\mid s, a, s')$. You can use it only as a direct policy during testing, or you can connect it to a best-of-N planner after fine-tuning the rollout.

3. Summary of related work

3.1 Related work of the thesis self-description

  • Video-based robot policies: One type first fine-tunes the video model and then trains the independent action module; the other type directly trains the joint video-action model. The author positions himself as the third line: still retaining unified generation, but initializing directly from the large video model without structural modifications.
  • Vision-language-action models: RT-2, OpenVLA, π0.5, UniVLA, and CogVLA represent the large backbone + robot fine-tuning route. The difference in the paper is not that "there is also a large backbone", but that the backbone is a video generation model rather than a static graphic and text model.
  • World models and value functions: Traditional Dyna/MBPO/TD-MPC/Dreamer to recent FLARE, SAILOR, and Latent Policy Steering all show that planning is valuable, but policy/world model/value must usually be built separately. The distinguishing point of this article is that the same pre-trained video model assumes the three functions at the same time.

3.2 Direct comparison with previous works

Dimensions Video Policy / UVA Class UWM-like unified video action model VLA(OpenVLA / π0.5) Cosmos Policy
Pre-trained backbone Video models often require additional action modules Customize the unified model and usually do not directly inherit the large video model Static graphic model Cosmos-Predict2-2B video diffusion model
Structural changes usually required The model itself is designed for action Action head/nudge strategies vary by method The structure is not changed, only the content organization of the latent sequence is changed.
action expression Individual action modules or inverse dynamics United Video-Action Output Direct return or other action header Action blocks are encoded into latent frames
Whether to uniformly model future state/value Usually not uniform Does not necessarily contain value Usually no explicit world model/value Unified modeling of actions, future states and values
planning ability Often requires additional modules Not necessarily supported Usually model-free After rollout fine-tuning, best-of-N planning can be directly performed
Representative results given in the paper Video Policy in RoboCasa 66.0, LIBERO Long 94.0 UWM in RoboCasa 60.8 OpenVLA-OFT in LIBERO 97.1; π0.5 Average in ALOHA 88.6 LIBERO 98.5, RoboCasa 67.1, ALOHA 93.6

4. Detailed explanation of method

4.1 Method overview

The entire pipeline can be understood by "Input Organization -> Unified Training -> Two Deployment Methods":

  • Input: multi-view images at the current moment, robot proprioception, text task description.
  • Presentation side: The image will be transformed into a latent frame through VAE; the author then copies, normalizes and writes the action block, current/future robot state and value function into certain latent frame positions.
  • Training end: Different batch samples use different condition masks, corresponding to three condition generation tasks: policy, world model, and value function.
  • Deployment side: If you only do direct policy, generate $(a, s', V)$ in parallel and then only execute the action; if you do planning, take the action first, then the future state, then the value, and use best-of-N to select the action.
latent diffusion sequence
Figure 2. Core diagram. The author does not add action heads to the network, but inserts different modes into the latent diffusion sequence. This decision directly brings about the possibility of "learning policy / world model / value at the same time on the same network".
detailed latent diffusion sequence
Appendix Figure. A more detailed implementation diagram. The blank placeholder image is first encoded into a latent frame by VAE, and then overwritten by the action/state/value; it also explains why there is a blank frame at the beginning of the sequence.[Appendix A.1]

4.2 Method evolution

Stage A
Video models are used for robot strategies, but a common approach is "video model + independent action module". The advantage is that it is easy to connect to existing video backbone; the cost is multi-stage training and stronger coupling between modules.
Stage B
Unified video-action models try to model actions and future observations together, but many methods do not directly initialize from large-scale pre-trained video models, so they cannot obtain strong enough spatiotemporal priors.
Stage C
Cosmos Policy directly retains the pre-trained video model structure, integrates new modalities through latent injection, and then uses different masks to let the same model learn policy, world model, and value; finally, it uses rollout to fine-tune and supplement the out-of-distribution experience required for planning.

4.3 Core design and mathematical derivation

What Formula 1 is doing: retaining the original video diffusion training goal, the difference is that the "clean sample" is no longer just a video frame, but a joint latent sequence of actions, states, and values inserted.

$$ \mathcal{L}(D_\theta, \sigma) = \mathbb{E}_{\mathbf{x}_0, \mathbf{c}, \mathbf{n}} \left[ \left\| D_\theta(\mathbf{x}_0 + \mathbf{n}; \sigma, \mathbf{c}) - \mathbf{x}_0 \right\|_2^2 \right] $$

$\mathbf{x}_0$Clean latent sequence. This can contain both image latent frames and injected action/state/value latent frames.
$\mathbf{c}$Text conditions, provided by T5-XXL embedding in the paper.
$\mathbf{n}$Gaussian noise, $\mathbf{n}\sim\mathcal{N}(0, \sigma^2 I)$.
$D_\theta$Diffusion transformer denoiser for Cosmos-Predict2.
$\sigma$Noise intensity; the author will specifically change its sampling distribution later.[Appendix A.2.1]
Intuition: This paper does not redefine the policy learning loss, but changes the robot control problem into the problem of "restoring the joint latent sequence", so the existing learning algorithm of the basic video model is reused to the maximum extent.

What Equation 2 does: Define the value function under sparse reward and label each rollout transition with Monte Carlo return.

$$ V^\pi(s)= \mathbb{E}_{\tau\sim\pi}\left[\sum_{k=t}^{H}\gamma^{k-t}R(s_k, a_k)\mid s_t=s\right] = \mathbb{E}_{\tau\sim\pi}\left[\gamma^{H-t}R(s_H, a_H)\mid s_t=s\right] $$

$s_t, a_t$The status and actions at time $t$.
$H$episode length of time domain.
$\gamma$The discount factor spreads the final rewards forward.
$R(s_H, a_H)$End point reward; use sparse reward for tasks in the text.
Why is it easy to learn bad value functions here?: If the training set is all demonstrations, almost all of them are successful, then the model has no supervision in the failure area. Therefore, the author explicitly introduces rollout data and gives 90% of the batch to the world model and value function during planning fine-tuning.[Text §4.3]

Although the formula is not written as a total loss, the text actually defines three types of conditional distributions, which correspond to three working modes of the same network.

$p(a, s', V(s')\mid s)$Strategy training: given the current state, jointly generate actions, future states and values.
$p(s', V(s')\mid s, a)$World model training: given the current state and actions, predict future states and values.
$p(V(s')\mid s, a, s')$Value function training: given the complete prefix, predict future values.
Key points: The author emphasizes the importance of auxiliary targets. Policy is not just about learning $p(a\mid s)$, but also the future state and value; world model is not just about learning $p(s'\mid s, a)$. This is equivalent to forcing the intermediate representation to retain information about "what will happen after the action" and "is it worth it?"
Why this joint targeting might work

If you only train $p(a\mid s)$, the model only needs to output an action block that can complete the task; but the joint objective requires it to generate future states and values at the same time, which is equivalent to using "behavior consequences" as auxiliary supervision, forcing the hidden representation to be more sensitive to environmental dynamics. This is supported by both LIBERO and RoboCasa ablations in the paper.

Latent frame injection has not been written into a formal formula, but the implementation logic is very clear: first normalize, then copy, and then reshape to the single-frame latent variable volume.

inputFor example, if the action block size is $K\times d_{act}$, first flatten it into a vector of length $K d_{act}$.
normalizationNon-image modes are uniformly scaled to $[-1, +1]$.
CopyRepeat $\frac{H'W'C'}{K d_{act}}$ times until the length matches the volume of a latent frame.[Appendix A.1]
DecodeAfter generation, these copied blocks are averaged and then denormalized to obtain action/state/value predictions; non-image modalities do not require VAE decoding.
realize intuition: The author did not create a tokenization module specifically for low-dimensional vectors, but simply spread it over the entire latent variable volume. Although there is redundancy in the representation, this results in "not changing the original model structure at all".
balanced batches
Figure 12. Training distribution map of batch 50/25/25. It shows that policy, world model, and value function are not three networks, but three training instances of the same latent diffusion sequence under different condition masks.[Appendix A.4.1]

4.4 Implementation points

blank first latent frame: A blank placeholder frame is specially placed at the beginning of the sequence. It is not visually meaningful, but to adapt to the video tokenizer's "the first frame is not time-compressed, and the rest are compressed by 4 frames" mechanism, so that the structure of current and future observations can be more aligned.[Appendix A.1]
Action blocks are not executed step by step in a rolling manner: LIBERO uses chunk 16 and executes a full 16 steps; RoboCasa predicts 32 steps but rechecks after 16 steps; ALOHA predicts and executes a full 50 steps. This design is directly tied to actual inference latency.[Appendix A.2.2-A.2.4]
The noise distribution has been specially modified: No longer only the original log-normal is used during training, but 70% is sampled from the original distribution and 30% is sampled from the uniform distribution of $[1, 85]$ to enhance the training signal in high noise areas; during inference, $\sigma_{\min}$ is increased from 0.002 to 4. The authors reasoned that robot action generation is more sensitive to high noise initial stages.[Appendix A.2.1]
noise schedules
Figure 9. The author changed the noise distribution of the base model to a more noisy hybrid distribution to improve the test-time generation accuracy.[Appendix A.2.1]
Inference mode switch: direct policy uses parallel decoding, which is fast; planning uses autoregressive decoding, which has higher quality. ALOHA planning has 10 denoise steps for actions, 5 steps for future state, and 5 steps for value; and each candidate action is equipped with 3 future state predictions and 5 value predictions, for a total of 15 value results.[Appendix A.3.1]
Dual model deployment: The policy model uses basic checkpoint, and the planning model uses rollout fine-tuned checkpoint. The reason given by the author is: the on-policy experience of the planning model is derived from the original policy model, so the separation of duties is more stable.[Text §4.3]

5. Experiment

5.1 Experimental setup

Datasets and tasks

platform tasks/data training data Observation and action Additional details
LIBERO Spatial / Object / Goal / Long Four sets of tasks, one-arm Franka Panda 500 demonstrations per set (10 tasks × 50) action chunk 16, test execution complete chunk Policy training filtering fails demonstrations; world model/value uses unfiltered data
RoboCasa 24 kitchen tasks with one arm Franka Panda Use only 50 human teleoperation demonstrations per mission action chunk 32, check again after executing 16 steps The evaluation uses 5 scenes, 50 trials × 3 seeds per task, a total of 3600 times; only unseen objects are tested, and some styles have never appeared in training.
ALOHA Four real-arm tasks: put X on plate / fold shirt / put candies in bowl / put candy in ziploc bag Total 185 demonstrations 14-dimensional joint angle + 3-way camera + text; chunk 50 (2 seconds), recheck after complete execution 101 fixed initial condition evaluations, including IID and OOD

baseline method

The paper compares the diffusion policy (Diffusion Policy, Dita) trained from scratch, the video model route (UVA, UWM, Video Policy), and the VLA route (π0, π0.5, OpenVLA-OFT, CogVLA, UniVLA, DP-VLA, GR00T-N1.5). Focus on comparing Diffusion Policy, OpenVLA-OFT+, and π in real ALOHA0 and π0.5.

Evaluation index

  • LIBERO / RoboCasa: success rate.
  • ALOHA: Staged completion scores on a scale of 0-100, rather than binary success. The scoring details are given in detail in the appendix. For example, "put candy in ziploc bag" is divided into five 20-point stages: grabbing the zipper, grabbing the corner of the bag, opening the mouth of the bag, grabbing the candy, and putting it in the bag.[Appendix A.3.2]

Training configuration and hardware

platform Number of training steps GPU global batch Time consuming Main loss after training
LIBERO 40K 64 × H100 1920 48 hours action 0.012; future proprio 0.007; future wrist latent 0.068; future third-person latent 0.036; value 0.007
RoboCasa 45K 32 × H100 800 48 hours action 0.016; future proprio 0.007; future wrist latent 0.084; future third-person latent 0.048; value 0.007
ALOHA 50K 8 × H100 200 48 hours action 0.010; future proprio 0.008; future wrist latent 0.097; future third-person latent 0.085; value 0.007

All non-image modalities are normalized to $[-1, +1]$. The basic models are all fully parameter fine-tuned.[Appendix A.2.2-A.2.4]

Code repository and reproducibility

The paper clearly states that model checkpoints, training data and code will be released; the public entrance is the project page research.nvidia.com/labs/dir/cosmos-policy/. The Reproducibility Statement also states that training and evaluation scripts will be provided with the release.

5.2 Main results

LIBERO main results

method Spatial Object Goal Long Average
Diffusion Policy78.392.568.350.572.4
Dita97.494.893.283.692.3
π096.898.895.885.294.2
UniVLA96.596.895.692.095.2
π0.598.898.298.092.496.9
OpenVLA-OFT97.698.497.994.597.1
CogVLA98.698.896.695.497.4
Cosmos Policy98.1100.098.297.698.5

Interpretation: Cosmos Policy is not the first in all items, Spatial is π0.5 Higher, but it gets the best values in the Object, Goal, and Long columns, so the final average is the highest. In particular, Long mentioned OpenVLA-OFT from 94.5 to 97.6, indicating that the advantages of timing modeling emphasized by the author are mainly reflected in long-term tasks.

RoboCasa main results

method Number of training demonstrations per task Average SR
GR00T-N130049.6
UVA5050.0
DP-VLA300057.3
GR00T-N1 + DreamGen300 (+10000 synthetic)57.6
GR00T-N1 + DUST30058.5
UWM100060.8
π030062.5
GR00T-N1.530064.1
Video Policy30066.0
FLARE30066.4
GR00T-N1.5 + HAMLET30066.4
Cosmos Policy5067.1

Interpretation: RoboCasa The most important information in this table is not that 67.1 is only 0.7 higher than 66.4, but that it only uses 50 real-person demonstrations per task. The paper uses this table to emphasize data efficiency, not just absolute accuracy.

REAL ALOHA MAIN RESULTS

method put X on plate fold shirt put candies in bowl put candy in ziploc bag Average
Diffusion Policy63.323.532.814.633.6
OpenVLA-OFT+68.399.521.658.562.0
π085.098.571.256.977.9
π0.598.399.595.261.588.6
Cosmos Policy100.099.589.685.493.6

Interpretation: The average score advantage of Cosmos Policy mainly comes from the last and most difficult ziploc bag task: 85.4 compared to π0.5 of 61.5, a huge advantage. candies in bowl on π0.5 It is higher, but ziploc bag has higher requirements for fine control. The paper uses it as a representative case of "the video model is better at high-precision control a priori".

ALOHA task performance
Figure 4. The histogram of ALOHA's four tasks visualizes that Cosmos Policy is not always far ahead in the first three tasks, but its lead is more obvious in the average score and the most difficult task.
ALOHA rollouts
Figure 3. Qualitative results of real dual-arm tasks are shown, indicating that the four tasks selected by the author all contain long-term, multi-modal grasping or high-precision contact operations.
competitor failure modes
Figure 5. Two types of failure modes summarized by the Authors: π0.5 High-precision grasping on the ziploc bag is unstable; OpenVLA-OFT+ will reach between two candies in the candies task, reflecting inaccurate action distribution modeling.

IID / OOD Segmentation Results for ALOHA

method IID average OOD average observe
OpenVLA-OFT+67.151.7The shirt has almost full marks, but the OOD of candies and put-on-plate has dropped significantly.
π081.371.7Overall stable, but no significant advantages at mission level
π0.587.892.5OOD average highest
Cosmos Policy96.389.3The overall average is the highest, but the paper also gives it honestly: when looking at the OOD average alone, π0.5 slightly higher

This is important: 93.6 in the main results table does not mean that Cosmos Policy is the strongest in every generalization subscenario. The appendix clearly states that when broken down to the average OOD score, π0.5 Slightly higher.[Appendix A.3.2]

IID positions
Appendix Figure. ALOHA's IID initial state representation.[Appendix A.3.2]
OOD positions
Appendix Figure. ALOHA's OOD initial state representation. The author adds unseen T-shirts, unseen bowls, unseen ziploc bags with higher loading levels, etc. into OOD.[Appendix A.3.2]

5.3 Ablation experiment

LIBERO ablation

version Spatial Object Goal Long Average
Complete Cosmos Policy98.1100.098.297.698.5
Remove auxiliary losses97.699.896.794.097.0
No need to pre-train the video model, train it from scratch94.798.996.388.694.6

Interpretation: auxiliary supervision brings 1.5 points; video prior brings 3.9 points. The biggest drop is still in the Long column, which once again shows that the timing prior mainly helps long-time tasks.

RoboCasa ablation and additional experiments

version Average SR
Complete Cosmos Policy (5 denoising steps)67.1
Remove value function training samples66.6
Remove world model + value function training samples64.0
Then remove the auxiliary value supervision during policy training62.5
Then remove the auxiliary future state + value supervision during policy training, leaving only the pure action strategy44.4
Only use 1 denoising step66.4

There are two conclusions from this table. First, the pure action strategy dropped from 67.1 to 44.4, indicating that "letting the policy predict the future state at the same time" is not the icing on the cake, but a key component. Second, using only one denoising step can still maintain 66.4, which is almost unchanged, indicating that the number of diffusion inference steps has obvious room for compression in this task.[Appendix A.4.1]

5.4 Supplementary experiments (from appendix)

Plan an experiment

The authors only conduct planning evaluations on two more difficult ALOHA tasks: put candies in bowl and put candy in ziploc bag. The rollout data pool consists of two parts: 505 rollouts accumulated during the previous direct policy evaluation, plus an additional 143 rollouts collected for the ziploc bag, for a total of 648 rollouts. The author believes that additional ziploc data is critical because this task has severe self-occlusion, high environmental randomness, and millimeter-level motion differences that will affect the results.

planning rollouts
Figure 6. The world model of base policy can only imagine "a successful future"; after fine-tuning rollout, the planning model can more easily imagine the real consequences of failure, so it can avoid wrong actions.
planning results
Figure 7. Compare model-based planning without planning, based on $V(s')$, and model-free planning based on $Q(s, a)$. The authors report that the model-based version achieves an average score improvement of 12.5 points on two difficult tasks.

The author compares two forms of value training:

  • $V(s')$: The world model first predicts the future state, and then scores the future state. This is the model-based version.
  • $Q(s, a)$: Score the current state and actions directly, without the need for an explicit world model. This is the model-free version.

The result is that $V(s')$ is better. The author's own explanation is: when the rollout data is very limited, explicitly using future states and environmental dynamics is more sample efficient than directly learning high-dimensional Q functions, and is less likely to overfit.

Reasoning delay

settings delay Description
direct policy, 5 denoising steps0.61 s / action chunkLIBERO / RoboCasa
direct policy, 10 denoising steps0.95 s / action chunkALOHA
direct policy, 1 denoising step0.16 s / action chunkRoboCasa still has 66.4% SR
ALOHA model-based planning4.9 s8 H100s in parallel, N=8, each candidate action is performed 3 times for future state and 5 times for value ensemble

This is also the most realistic cost of the paper: planning is useful, but very slow. The author lists this as the primary limitation in the Discussion.

6. Analysis and Discussion

6.1 Analysis and explanation of the results given in the paper

  • The authors argue that the video model prior is an efficient initialization and thus can outperform partially strong VLA on real robots even without additional large-scale action labels. The text directly attributes this to the video model learning spatiotemporal dynamics and implicit physics, not just static semantics.
  • The author uses the failure case in Figure 5 to explain why high multi-modality and high-precision tasks are a watershed: a deviation will pull the action between the two targets, or miss in millimeter-level grasping.
  • The author uses Figure 6 to illustrate that the planning model after rollout fine-tuning can better predict the consequences of failure, so the planning stage can avoid mistakes that are easy to make with basic strategies, such as the slipper of a zipper bag.
  • The value of auxiliary supervision and joint objective is clearly verified in ablation: not all modules are equally important, but "letting policy predict future states together" is very critical.

6.2 Limitations of the author's statement

Limitation 1: Planning reasoning is slow. The paper clearly states that model-based planning takes about 5 seconds to produce an action chunk, which is not friendly to dynamic tasks.

Limitation 2: Planning relies on rollout data. A sufficiently accurate world model/value function cannot be learned by demonstration alone; if there are few rollouts, the planning quality will be significantly limited.

Limitation 3: The current search tree has only one level. The paper only performs best-of-N single-layer search and does not extend the prediction horizon of the world model to deeper planning.

6.3 Applicable boundaries and discussions clearly stated in the paper

  • The "state" in this article is actually closer to observation, because the real environment is not completely observable; the author specifically states this in Preliminaries for the sake of simplicity. In other words, the world model predicts future observations, not the complete state in the strict sense.
  • The benefits of planning are mainly shown in difficult tasks where the basic direct policy still has obvious room for error; for the first two easier tasks of LIBERO or ALOHA, the paper does not emphasize further planning gains.
  • Both RoboCasa and ALOHA embody methods that can do multi-perspective input, but do not build in the input history; the state only takes the current moment, and only predicts $t+K$ in the future, so the model is not a long horizon autoregressive world model.
  • The future work given at the end of the paper also strictly focuses on the original method: accelerating search, reducing rollout requirements, and extending deeper planning depth.