Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation

Method name: GR-1

Authors: Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, Tao Kong

Organization: ByteDance Research

Publication: ICLR 2024; arXiv v2 submitted on 2023-12-21

Links: arXiv: 2312.13139 | PDF | Project page | official code

1. Quick overview of the paper

One-sentence summary: This paper introduces large-scale language conditional video prediction pre-training into robot operation, and proposes a unified GPT-style Transformer GR-1, which enables the same model to first predict future frames on Ego4D videos, and then simultaneously predict actions and future images on robot data.

What should the paper solve?	Robot demonstration data is scarce, expensive, and contains multi-modal signals such as images, states, actions, and languages; the author hopes to use larger-scale non-robot video data to improve multi-task language-conditioned visual operations.
The author's approach	Think of the robot trajectory as a visual sequence with actions: first use the language conditional video prediction on Ego4D to learn "what will happen next", and then transfer this ability to robot strategy learning.
most important results	CALVIN ABCD$\rightarrow$D's success rate increased from the best baseline of 88.9% to 94.9%, and the average number of consecutive completed tasks increased from 3.06 to 4.21; ABC$\rightarrow$D's unseen scene success rate increased from 53.3% to 85.4%.
Things to note when reading	The core is not simply replaced by Transformer, but a combination of "video prediction token + behavioral cloning token + large-scale video pre-training"; the appendix ablation shows that only video prediction without pre-training is also helpful, but the complete GR-1 is the strongest.

Difficulty rating: ★★★★☆. Requires familiarity with behavioral cloning, Transformer sequence modeling, visual encoders, long-term robotic evaluation, and CALVIN's continuous task protocol.

Keywords: visual robot manipulation, language-conditioned policy, video generative pre-training, GPT-style transformer, CALVIN, Ego4D.

Core contribution list

Proof that video generative pre-training helps robots operate: The author uses human interaction videos in Ego4D as a pre-training source and verifies the transfer effect on CALVIN and real robots.
Proposed unified model GR-1: A GPT-style causal Transformer simultaneously receives language, historical images, and robot state, and outputs actions and future images.
System evaluation generalization and data efficiency: Including multi-tasking, unseen scenes, 10% data, unseen language, unseen instances and unseen categories of real robots.

Figure 1. GR-1 first performs language conditional video prediction pre-training on large-scale video data, and then fine-tunes and learns multi-task visual operations on robot data.

2. Motivation

2.1 What problem should be solved?

Generative pretraining in NLP and CV has shown that large-scale sequence data can provide transferable representations. The obstacles in the field of robotics are: high demonstration collection costs and sparse data; at the same time, robotic data is naturally multi-modal, including images, language, status, actions, etc. The key judgment of the paper is that the robot trajectory itself also contains video sequences, so there are structural similarities between "predicting future pictures based on language and past pictures" and "selecting actions based on language and historical observations".

A specific scenario is a long-term CALVIN operation: the robot must not only understand language such as "slide left the red block", but also visually locate objects, predict how the environment will change after the operation, and continuously complete up to 5 tasks. If it only relies on a small amount of robot data with language annotation, it will be difficult for the model to obtain a sufficiently robust visual-language-temporal structure.

2.2 Limitations of existing methods

The paper divides the previous work into several lines: the language conditional operation method can use LLM or CLIP for task understanding, but some methods predict sparse key points and rely on motion planners, which are less flexible than end-to-end continuous actions; the hierarchical method uses the latent plan conditionalization strategy, but it is not a simple and unified GPT-style trajectory modeling. Transformer decision-making models such as Decision Transformer, GATO, and RoboCat indicate that sequence models are suitable for decision-making problems, but RoboCat does not do video pre-training and is goal-image conditioned rather than language-conditioned.

In terms of pre-training, R3M, MVP, etc. focus more on visual representation, while VPT/VIPER uses videos within the task environment; the difference between GR-1 and them is that it uses non-robot, large-scale Ego4D videos outside the domain for language-conditional future frame prediction, and retains two output heads of future image and action prediction in the same model.

2.3 The solution ideas of this article

The high-level idea is to design the pre-training task as a substructure of the robot fine-tuning task: the pre-training phase inputs language and past frames, and outputs future frames; the robot phase additionally adds state input and action output, while continuing to predict future frames. In this way, the visual-language-temporal relationships learned through video pre-training can be directly fed into robot policy learning, rather than just as independent frozen representations.

3. Related work context

Technical line	Representation in the paper	GR-1 Positioning
language-conditioned visual manipulation	RT-1/RT-2, CLIPort, PerAct, Hiveformer, HULC, etc.	Does not rely on sparse keypoints or external planners; directly trains GPT-style trajectory models with language annotation demonstrations.
Transformer sequence decision	Decision Transformer, VIMA, GATO, RoboCat	Transformer is also used to model sequences, but GR-1 is oriented to language conditional operations and adds large-scale video prediction pre-training.
Robot learning pre-training	R3M, MVP, RPT, VPT, VIPER, model-based video prediction	Rather than learning only visual representations or training only on in-task videos, unified future image prediction capabilities are pre-trained on non-robotic Ego4D.

4. Detailed explanation of method

4.1 Formalization of the problem

Video pre-training task: given language description and historical image sequence, predict future images.

$$\pi(l, \mathbf{o}_{t-h: t}) \rightarrow \mathbf{o}_{t+\Delta t}$$

$l$	Natural language description of video.
$\mathbf{o}_{t-h: t}$	The video frame sequence from $t-h$ to the current $t$.
$\mathbf{o}_{t+\Delta t}$	Target frame at step $\Delta t$ in the future.

Robot fine-tuning task: Beyond video prediction, simultaneously predicting current actions.

$$\pi(l, \mathbf{o}_{t-h: t}, \mathbf{s}_{t-h: t}) \rightarrow \mathbf{o}_{t+\Delta t}, \mathbf{a}_{t}$$

$\mathbf{s}_{t-h: t}$	Robot state sequence, including end effector 6D pose and binary gripper state.
$\mathbf{a}_{t}$	Current action: continuous arm action and gripper binary action.
$D=\{\tau_i\}_{i=1}^{N}$	An expert trajectory data set containing $M$ tasks. Each trajectory contains language, image, status, and action.

4.2 Model input and token organization

GR-1 uses a multi-modal encoder to map language, images and robot states into the same Transformer dimension. Language is encoded by CLIP text encoder; visual observations are encoded by MAE pre-trained ViT, CLS token is used as a global representation, patch token is compressed by Perceiver resampler; robot state is encoded by linear layer.Appendix Network and Training Details

Figure 2. Language, state, image encoder, and future image decoder and action decoder.

The model learns two types of special tokens: [OBS] for future image prediction, [ACT] for action prediction. The sequence in the pre-training stage is language, image, [OBS] Appear alternately; the robot fine-tuning phase adds status and [ACT]. The author places repeated linguistic tokens in order to prevent linguistic information from being overwhelmed by denser visual and status tokens. A learned relative time embedding is also added to each time step, and all modalities at the same time step share this time embedding.

Pre-training sequence: (l, o[t-h], [OBS], l, o[t-h+1], [OBS],..., l, o[t], [OBS]) Robot finetuning sequence: (l, s[t-h], o[t-h], [OBS], [ACT],..., l, s[t], o[t], [OBS], [ACT])

4.3 Causal Transformer and mask

GR-1 uses GPT-style causal Transformer, but uses a special mask for the prediction token. During pre-training, the token can see the token at the previous position, but not the previous one. [OBS] Prediction token; when fine-tuning, you cannot see the previous [OBS] and [ACT]. This prevents the model from leaking target information from the prediction token and maintains autoregressive historical condition prediction.

4.4 Output header and loss

The future image is transformed by the transformer decoder from [OBS] Corresponding output and mask token reconstruct patch, the training target is pixel space MSE, and MAE style patch-wise normalization is used. action by [ACT] The output goes through three layers of MLP, and is finally divided into two heads: arm and gripper: arm uses Smooth-L1 loss, and gripper uses BCE loss. The total loss in the fine-tuning phase is:

$$L_{\mathrm{finetune}} = L_{\mathrm{arm}} + L_{\mathrm{gripper}} + L_{\mathrm{video}}$$

This shows that video prediction is not an auxiliary task that is discarded after pre-training, but still exists as a training signal when the robot is fine-tuned.

4.5 Training details

The pre-training data comes from Ego4D and contains more than 3500 hours of human-object interaction videos. The author trimmed 3-second short clips from the video, resulting in a total of 800, 000 clips and 8M frames. Pretrain randomly sampled sequences and predict future frames. Randomly sample robot trajectory segments during fine-tuning while optimizing behavioral cloning and video prediction losses.

Project	pre-training	Robot fine-tuning
data	Ego4D; 800k clips; 8M frames	CALVIN or real robot data
prediction step size	$\Delta t=1$, equally spaced frames 1/3 second apart	$\Delta t=3$, predict static camera and gripper camera images
sequence length	10
freeze module	CLIP text encoder and MAE image encoder
Transformer	12 layers, 12 heads, hidden size 384; total parameters 195M, of which 46M are trainable Appendix Network and Training Details

hyperparameters	pre-training	fine-tuning
batch size	1024	512
learning rate	3.6e-4	1e-3
dropout	0.1	0.1
optimizer	AdamW	AdamW
schedule	cosine decay	cosine decay
warmup epochs	5	1
training epochs	50	20

5. Experiment

5.1 Experimental questions and settings

The author designed experiments around three questions: whether GR-1 can improve visual robot operation; whether it can work on real robots; whether it can handle challenges such as small data, unseen scenes, unseen objects, and unseen languages. The main benchmarks include the CALVIN benchmark and real robot object transportation / articulated object manipulation.

Figure 3. Multi-task learning example on CALVIN.

Figure 4. The object positions and desktop configurations of the four CALVIN environments A/B/C/D are different.

CALVIN contains 34 tasks and open language instructions. The training data has more than 20k expert trajectories, but the author simulates real scenarios and only uses 1% of the data with crowdsourced language instruction labels to train GR-1, RT-1 and MT-R3M; MCIL and HULC use the complete CALVIN data. The evaluation uses 1000 unique sequence instruction chains, and each chain can complete up to 5 tasks continuously; a single task that is not completed within 360 timesteps is considered a failure.Appendix CALVIN Benchmark Experiments

5.2 CALVIN main result

settings	method	1 task	2 tasks	3 tasks	4 tasks	5 tasks	Avg. Len.
ABCD$\rightarrow$D	HULC best baseline	0.889	0.733	0.587	0.475	0.383	3.06
ABCD$\rightarrow$D	GR-1	0.949	0.896	0.844	0.789	0.731	4.21
ABC$\rightarrow$D	RT-1/MT-R3M Best Baseline	0.533	0.234	0.105	0.043	0.018	0.93
ABC$\rightarrow$D	GR-1	0.854	0.712	0.596	0.497	0.401	3.06
10% data	HULC best baseline	0.668	0.295	0.103	0.032	0.013	1.11
10% data	GR-1	0.778	0.533	0.332	0.218	0.139	2.00
unseen language	HULC best baseline	0.715	0.470	0.308	0.199	0.130	1.82
unseen language	GR-1	0.764	0.555	0.381	0.270	0.196	2.17

ABCD$\rightarrow$D measures the multi-task ability trained in all environments and evaluated in D environment; GR-1 is higher than the baseline from the 1st to the 5th consecutive task, especially the 5-task success from 0.383 to 0.731 of HULC. ABC$\rightarrow$D is a zero-shot unseen scene generalization. The 1-task success of GR-1 is 0.854, which is significantly higher than the best baseline of 0.533. 10% data uses 10% of the total training set of ABCD$\rightarrow$D, that is, 66 per task and a total of 2244 trajectories; GR-1 still reaches a 1-task success of 0.778. success.

In the unseen language experiment, the author used GPT-4 to generate 50 synonymous instructions for each of 34 tasks, and randomly sampled them during evaluation. Appendix examples include "use the switch to turn off the light bulb" to "use the switch to stop the light source."Appendix More Results

5.3 Real robot experiment

Figure 5. Real robot experiments include object handling and drawer opening and closing.

The real robot uses a 7-DoF Kinova Gen2 with RealSense installed on the end, and the camera static view is from Kinect Azure. The object handling training scene includes plate, eggplant, broccoli, and bell pepper, with a total of 1775 VR collection demonstrations; the drawer opening and closing training has 2856 trajectories. Real robot fine-tuning basically follows the CALVIN settings, but the batch size is changed to 64 and the training epochs are changed to 30.Appendix Real Robot Experiments

method	Seen Objects	Unseen Instances	Unseen Categories	Articulated Object Manipulation
RT-1	0.27	0.13	0.00	0.35
MT-R3M	0.15	0.13	0.10	0.30
GR-1	0.79	0.73	0.30	0.75

The main failure modes reported by the author are also very specific: RT-1 and MT-R3M often pick up the wrong objects and place them incorrectly, and RT-1 may also hit the plate or desk; GR-1 will confuse bell pepper with similar colors in unseen categories with peach; in the drawer task, it will appear that the handle is not fully closed or the handle is not grasped. See Figure 9 for the real robot rollout.

5.4 Qualitative results of video prediction

Figure 6. The green box is the real future frame, and the blue box is the predicted future frame; the author explains that GR-1 can restore the main future state, but details such as occluded objects may still be missing.

This part supports the core mechanism explanation of the paper: future frame prediction signals can provide future state constraints for action prediction. The authors do not convert the prediction quality into an independent quantitative metric, but use qualitative figures to illustrate that future frame structure can be recovered on CALVIN and real robot data.

5.5 Ablation experiment

Figure 7. Ablation plot showing CALVIN's average number of consecutive completed tasks and real robot pick-and-place success rate.

settings	Pre-training	Video Prediction	1 task	5 tasks	Avg. Len.
ABCD$\rightarrow$D	No	No	0.889	0.459	3.33
ABCD$\rightarrow$D	No	Yes	0.918	0.619	3.82
ABCD$\rightarrow$D	Yes	Yes	0.949	0.731	4.21
ABC$\rightarrow$D	No	No	0.823	0.225	2.40
ABC$\rightarrow$D	No	Yes	0.815	0.297	2.65
ABC$\rightarrow$D	Yes	Yes	0.854	0.401	3.06
10% data	No	No	0.526	0.022	1.04
10% data	No	Yes	0.698	0.052	1.52
10% data	Yes	Yes	0.778	0.139	2.00

The ablation conclusion is divided into two levels: first, adding video prediction without pre-training has usually improved, indicating that the future frame auxiliary task itself is useful; second, adding large-scale video pre-training further improves, especially in unseen scenes and small data. The authors explain that pre-training helps the model learn a more robust video prediction model, thereby forming hints about future states.Appendix Ablation Studies

Future Step	1 task	2 tasks	3 tasks	4 tasks	5 tasks	Avg. Len.
1	0.895	0.802	0.710	0.643	0.562	3.61
3	0.918	0.833	0.761	0.685	0.619	3.82
5	0.909	0.806	0.719	0.649	0.583	3.67

Prediction $\Delta t=3$ is better than 1 and 5. The explanation given by the author is: when the continuous frames are too close, the information difference is insufficient, and when they are too far away, they are not suitable for guiding the current local action.

5.6 Task-by-task success rate and more visualizations

The task-by-task success rate table shows that video pre-training has a greater improvement in tasks involving block manipulation, such as rotate blue block right from 71.2 to 94.9, stack block from 45.7 to 80.1, and lift red block table from 76.7 to 97.7. The author explains that the difficulty of these tasks lies in the need to grasp the correct square first and then operate according to the language; video generative pre-training improves the performance of such tasks.Appendix Task Success Rates

Task	GR-1	GR-1 w/o Video Prediction & Pre-training	GR-1 10% data
rotate blue block right	94.9	71.2	51.6
stack block	80.1	45.7	43.2
lift red block table	97.7	76.7	36.5
place in slider	91.3	89.1	34.8
open drawer	99.4	100.0	94.2
turn on/off LED	100.0 / 100.0	98.7 / 100.0	95.6 / 95.6
push red/blue block right	54.2 / 53.6	49.3 / 50.0	43.6 / 33.9

Figure 8. More rollouts on the CALVIN benchmark.

Figure 9. Real robot rollout; the first five lines are object handling, and the last two lines are drawer-type operations.

Figure 10. Supplementary future frame prediction results on CALVIN.

Figure 11. Supplementary future frame prediction results in object handling and drawer operations.

Figure 12. Future frame prediction results in ablation pick-and-place experiments.

6. Reproducible auditing

Code and resources

There is official code: github.com/bytedance/GR-1. README provides CALVIN environment installation, CALVIN data download, MAE ViT-Base weight download, GR-1 ABCD-D/ABC-D checkpoint download, and `evaluate_calvin.sh` evaluation command.

Recurring items	Information given by the paper/code	Status
Model structure	12 layers, 12 heads, hidden size 384, 195M parameters, 46M trainable; CLIP text encoder and MAE ViT image encoder frozen.	fully
Training hyperparameters	Batch size, learning rate, dropout, optimizer, cosine decay, warmup, and epochs are all given in the appendix.	fully
Pre-training data construction	Ego4D; 3 seconds clips; 800k clips/8M frames; equally spaced frames, 1/3 second apart.	relatively sufficient
CALVIN Review	1000 instruction chains; up to 5 consecutive tasks; 360 timesteps timeout; ABCD$\rightarrow$D and ABC$\rightarrow$D split.	fully
real robot	The robot model, camera, task, number of training trajectories, scene settings, and batch/epoch changes are all given.	Replicate high physical dependence
Pre-trained weights	The official README provides GR-1 CALVIN checkpoint; the paper does not describe all training logs of complete Ego4D pre-training weights.	The evaluation is reproducible, but the complete training cost is high

CALVIN evaluation skeleton from official README: export CALVIN_ROOT=/path/to/calvin/directory/ export EVAL_DIR=eval_logs/ export POLICY_CKPT_PATH=logs/snapshot_ABCD.pt export MAE_CKPT_PATH=/path/to/mae/checkpoint/ bash evaluate_calvin.sh --dataset_dir /path/to/calvin/dataset/task_ABCD_D/directory/

7. Analysis, Limitations and Boundaries

7.1 The most valuable part of this paper

Based on the paper's own experiments, the core value lies in advancing "outside large-scale video" from ordinary visual representation pre-training to "future state prediction", a goal closer to robot strategy learning. The evidence is not a single benchmark number, but the main result and ablation support at the same time: only the existing gains in video prediction are retained, and after superimposing Ego4D pre-training, it is further improved under multi-tasking, unseen scenes and small data.

7.2 Why the results hold up

The paper uses multiple complementary settings: in CALVIN there are multi-task learning of ABCD$\rightarrow$D, as well as unseen scenes, 10% data, and unseen language of ABC$\rightarrow$D; in real robots there are seen objects, unseen instances, unseen categories and articulated object manipulation. Appendix ablation separates video prediction from pre-training and compares future steps, reducing the explanation space of "just model capacity or Transformer architecture causing improvement".

7.3 Explanations and failure modes given by the author

The author believes that the unseen scene generalization comes from the visual-text alignment brought by the rich human-object interaction in Ego4D; the unseen language generalization comes from the diverse language exposure and frozen CLIP text encoder in pre-training. In the real robot, the author clearly lists the failure modes of GR-1: objects with similar colors will be confused in the unseen category, such as bell pepper and peach; the drawer task will not be fully closed or the handle will not be grasped. In the video prediction results, the author also pointed out that details such as occluded objects may be missing.

7.4 Future work as described by the author

In Conclusion, the author proposes three directions: combine video data training with and without language to enhance robustness and generalization; explore the difference between "arbitrary video pre-training" and "more relevant operation video pre-training"; expand robot data to include more trajectories in different environments and more operating skills.

7.5 Applicable boundaries

The experiments of GR-1 mainly cover tasks in which language-conditioned visual operations and future frames can be used as action cues; the paper does not prove that this method is equally effective for all contact dynamics or high-precision force control tasks.
The scale of the real robot experiment is smaller than that of CALVIN, and the tasks focus on object handling and drawer operation; reproducing the experiment requires cameras, robots, VR acquisition and scene layout.
Complete pre-training relies on large-scale Ego4D data and larger batches, and the computational cost is significantly higher than just downloading the official checkpoint for evaluation.