EN 中文

Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation

Method name: GR-1

Authors: Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, Tao Kong

Organization: ByteDance Research

Publication: ICLR 2024; arXiv v2 submitted on 2023-12-21

Links: arXiv: 2312.13139 | PDF | Project page | official code

1. Quick overview of the paper

One-sentence summary: This paper introduces large-scale language conditional video prediction pre-training into robot operation, and proposes a unified GPT-style Transformer GR-1, which enables the same model to first predict future frames on Ego4D videos, and then simultaneously predict actions and future images on robot data.
What should the paper solve?Robot demonstration data is scarce, expensive, and contains multi-modal signals such as images, states, actions, and languages; the author hopes to use larger-scale non-robot video data to improve multi-task language-conditioned visual operations.
The author's approachThink of the robot trajectory as a visual sequence with actions: first use the language conditional video prediction on Ego4D to learn "what will happen next", and then transfer this ability to robot strategy learning.
most important resultsCALVIN ABCD$\rightarrow$D's success rate increased from the best baseline of 88.9% to 94.9%, and the average number of consecutive completed tasks increased from 3.06 to 4.21; ABC$\rightarrow$D's unseen scene success rate increased from 53.3% to 85.4%.
Things to note when readingThe core is not simply replaced by Transformer, but a combination of "video prediction token + behavioral cloning token + large-scale video pre-training"; the appendix ablation shows that only video prediction without pre-training is also helpful, but the complete GR-1 is the strongest.

Difficulty rating: ★★★★☆. Requires familiarity with behavioral cloning, Transformer sequence modeling, visual encoders, long-term robotic evaluation, and CALVIN's continuous task protocol.

Keywords: visual robot manipulation, language-conditioned policy, video generative pre-training, GPT-style transformer, CALVIN, Ego4D.

Core contribution list

GR-1 overview
Figure 1. GR-1 first performs language conditional video prediction pre-training on large-scale video data, and then fine-tunes and learns multi-task visual operations on robot data.

2. Motivation

2.1 What problem should be solved?

Generative pretraining in NLP and CV has shown that large-scale sequence data can provide transferable representations. The obstacles in the field of robotics are: high demonstration collection costs and sparse data; at the same time, robotic data is naturally multi-modal, including images, language, status, actions, etc. The key judgment of the paper is that the robot trajectory itself also contains video sequences, so there are structural similarities between "predicting future pictures based on language and past pictures" and "selecting actions based on language and historical observations".

A specific scenario is a long-term CALVIN operation: the robot must not only understand language such as "slide left the red block", but also visually locate objects, predict how the environment will change after the operation, and continuously complete up to 5 tasks. If it only relies on a small amount of robot data with language annotation, it will be difficult for the model to obtain a sufficiently robust visual-language-temporal structure.

2.2 Limitations of existing methods

The paper divides the previous work into several lines: the language conditional operation method can use LLM or CLIP for task understanding, but some methods predict sparse key points and rely on motion planners, which are less flexible than end-to-end continuous actions; the hierarchical method uses the latent plan conditionalization strategy, but it is not a simple and unified GPT-style trajectory modeling. Transformer decision-making models such as Decision Transformer, GATO, and RoboCat indicate that sequence models are suitable for decision-making problems, but RoboCat does not do video pre-training and is goal-image conditioned rather than language-conditioned.

In terms of pre-training, R3M, MVP, etc. focus more on visual representation, while VPT/VIPER uses videos within the task environment; the difference between GR-1 and them is that it uses non-robot, large-scale Ego4D videos outside the domain for language-conditional future frame prediction, and retains two output heads of future image and action prediction in the same model.

2.3 The solution ideas of this article

The high-level idea is to design the pre-training task as a substructure of the robot fine-tuning task: the pre-training phase inputs language and past frames, and outputs future frames; the robot phase additionally adds state input and action output, while continuing to predict future frames. In this way, the visual-language-temporal relationships learned through video pre-training can be directly fed into robot policy learning, rather than just as independent frozen representations.

4. Detailed explanation of method

4.1 Formalization of the problem

Video pre-training task: given language description and historical image sequence, predict future images.

$$\pi(l, \mathbf{o}_{t-h: t}) \rightarrow \mathbf{o}_{t+\Delta t}$$
$l$Natural language description of video.
$\mathbf{o}_{t-h: t}$The video frame sequence from $t-h$ to the current $t$.
$\mathbf{o}_{t+\Delta t}$Target frame at step $\Delta t$ in the future.

Robot fine-tuning task: Beyond video prediction, simultaneously predicting current actions.

$$\pi(l, \mathbf{o}_{t-h: t}, \mathbf{s}_{t-h: t}) \rightarrow \mathbf{o}_{t+\Delta t}, \mathbf{a}_{t}$$
$\mathbf{s}_{t-h: t}$Robot state sequence, including end effector 6D pose and binary gripper state.
$\mathbf{a}_{t}$Current action: continuous arm action and gripper binary action.
$D=\{\tau_i\}_{i=1}^{N}$An expert trajectory data set containing $M$ tasks. Each trajectory contains language, image, status, and action.

4.2 Model input and token organization

GR-1 uses a multi-modal encoder to map language, images and robot states into the same Transformer dimension. Language is encoded by CLIP text encoder; visual observations are encoded by MAE pre-trained ViT, CLS token is used as a global representation, patch token is compressed by Perceiver resampler; robot state is encoded by linear layer.Appendix Network and Training Details

GR-1 encoders and decoders
Figure 2. Language, state, image encoder, and future image decoder and action decoder.

The model learns two types of special tokens: [OBS] for future image prediction, [ACT] for action prediction. The sequence in the pre-training stage is language, image, [OBS] Appear alternately; the robot fine-tuning phase adds status and [ACT]. The author places repeated linguistic tokens in order to prevent linguistic information from being overwhelmed by denser visual and status tokens. A learned relative time embedding is also added to each time step, and all modalities at the same time step share this time embedding.

Pre-training sequence: (l, o[t-h], [OBS], l, o[t-h+1], [OBS],..., l, o[t], [OBS]) Robot finetuning sequence: (l, s[t-h], o[t-h], [OBS], [ACT],..., l, s[t], o[t], [OBS], [ACT])

4.3 Causal Transformer and mask

GR-1 uses GPT-style causal Transformer, but uses a special mask for the prediction token. During pre-training, the token can see the token at the previous position, but not the previous one. [OBS] Prediction token; when fine-tuning, you cannot see the previous [OBS] and [ACT]. This prevents the model from leaking target information from the prediction token and maintains autoregressive historical condition prediction.

4.4 Output header and loss

The future image is transformed by the transformer decoder from [OBS] Corresponding output and mask token reconstruct patch, the training target is pixel space MSE, and MAE style patch-wise normalization is used. action by [ACT] The output goes through three layers of MLP, and is finally divided into two heads: arm and gripper: arm uses Smooth-L1 loss, and gripper uses BCE loss. The total loss in the fine-tuning phase is:

$$L_{\mathrm{finetune}} = L_{\mathrm{arm}} + L_{\mathrm{gripper}} + L_{\mathrm{video}}$$

This shows that video prediction is not an auxiliary task that is discarded after pre-training, but still exists as a training signal when the robot is fine-tuned.

4.5 Training details

The pre-training data comes from Ego4D and contains more than 3500 hours of human-object interaction videos. The author trimmed 3-second short clips from the video, resulting in a total of 800, 000 clips and 8M frames. Pretrain randomly sampled sequences and predict future frames. Randomly sample robot trajectory segments during fine-tuning while optimizing behavioral cloning and video prediction losses.

Projectpre-trainingRobot fine-tuning
dataEgo4D; 800k clips; 8M framesCALVIN or real robot data
prediction step size$\Delta t=1$, equally spaced frames 1/3 second apart$\Delta t=3$, predict static camera and gripper camera images
sequence length10
freeze moduleCLIP text encoder and MAE image encoder
Transformer12 layers, 12 heads, hidden size 384; total parameters 195M, of which 46M are trainable Appendix Network and Training Details
hyperparameterspre-trainingfine-tuning
batch size1024512
learning rate3.6e-41e-3
dropout0.10.1
optimizerAdamWAdamW
schedulecosine decaycosine decay
warmup epochs51
training epochs5020

5. Experiment

5.1 Experimental questions and settings

The author designed experiments around three questions: whether GR-1 can improve visual robot operation; whether it can work on real robots; whether it can handle challenges such as small data, unseen scenes, unseen objects, and unseen languages. The main benchmarks include the CALVIN benchmark and real robot object transportation / articulated object manipulation.

CALVIN examples
Figure 3. Multi-task learning example on CALVIN.
CALVIN environments
Figure 4. The object positions and desktop configurations of the four CALVIN environments A/B/C/D are different.

CALVIN contains 34 tasks and open language instructions. The training data has more than 20k expert trajectories, but the author simulates real scenarios and only uses 1% of the data with crowdsourced language instruction labels to train GR-1, RT-1 and MT-R3M; MCIL and HULC use the complete CALVIN data. The evaluation uses 1000 unique sequence instruction chains, and each chain can complete up to 5 tasks continuously; a single task that is not completed within 360 timesteps is considered a failure.Appendix CALVIN Benchmark Experiments

5.2 CALVIN main result

settingsmethod1 task2 tasks3 tasks4 tasks5 tasksAvg. Len.
ABCD$\rightarrow$DHULC best baseline0.8890.7330.5870.4750.3833.06
ABCD$\rightarrow$DGR-10.9490.8960.8440.7890.7314.21
ABC$\rightarrow$DRT-1/MT-R3M Best Baseline0.5330.2340.1050.0430.0180.93
ABC$\rightarrow$DGR-10.8540.7120.5960.4970.4013.06
10% dataHULC best baseline0.6680.2950.1030.0320.0131.11
10% dataGR-10.7780.5330.3320.2180.1392.00
unseen languageHULC best baseline0.7150.4700.3080.1990.1301.82
unseen languageGR-10.7640.5550.3810.2700.1962.17

ABCD$\rightarrow$D measures the multi-task ability trained in all environments and evaluated in D environment; GR-1 is higher than the baseline from the 1st to the 5th consecutive task, especially the 5-task success from 0.383 to 0.731 of HULC. ABC$\rightarrow$D is a zero-shot unseen scene generalization. The 1-task success of GR-1 is 0.854, which is significantly higher than the best baseline of 0.533. 10% data uses 10% of the total training set of ABCD$\rightarrow$D, that is, 66 per task and a total of 2244 trajectories; GR-1 still reaches a 1-task success of 0.778. success.

In the unseen language experiment, the author used GPT-4 to generate 50 synonymous instructions for each of 34 tasks, and randomly sampled them during evaluation. Appendix examples include "use the switch to turn off the light bulb" to "use the switch to stop the light source."Appendix More Results

5.3 Real robot experiment

Real robot experiments
Figure 5. Real robot experiments include object handling and drawer opening and closing.

The real robot uses a 7-DoF Kinova Gen2 with RealSense installed on the end, and the camera static view is from Kinect Azure. The object handling training scene includes plate, eggplant, broccoli, and bell pepper, with a total of 1775 VR collection demonstrations; the drawer opening and closing training has 2856 trajectories. Real robot fine-tuning basically follows the CALVIN settings, but the batch size is changed to 64 and the training epochs are changed to 30.Appendix Real Robot Experiments

methodSeen ObjectsUnseen InstancesUnseen CategoriesArticulated Object Manipulation
RT-10.270.130.000.35
MT-R3M0.150.130.100.30
GR-10.790.730.300.75

The main failure modes reported by the author are also very specific: RT-1 and MT-R3M often pick up the wrong objects and place them incorrectly, and RT-1 may also hit the plate or desk; GR-1 will confuse bell pepper with similar colors in unseen categories with peach; in the drawer task, it will appear that the handle is not fully closed or the handle is not grasped. See Figure 9 for the real robot rollout.

5.4 Qualitative results of video prediction

Video prediction results
Figure 6. The green box is the real future frame, and the blue box is the predicted future frame; the author explains that GR-1 can restore the main future state, but details such as occluded objects may still be missing.

This part supports the core mechanism explanation of the paper: future frame prediction signals can provide future state constraints for action prediction. The authors do not convert the prediction quality into an independent quantitative metric, but use qualitative figures to illustrate that future frame structure can be recovered on CALVIN and real robot data.

5.5 Ablation experiment

Ablation results
Figure 7. Ablation plot showing CALVIN's average number of consecutive completed tasks and real robot pick-and-place success rate.
settingsPre-trainingVideo Prediction1 task5 tasksAvg. Len.
ABCD$\rightarrow$DNoNo0.8890.4593.33
ABCD$\rightarrow$DNoYes0.9180.6193.82
ABCD$\rightarrow$DYesYes0.9490.7314.21
ABC$\rightarrow$DNoNo0.8230.2252.40
ABC$\rightarrow$DNoYes0.8150.2972.65
ABC$\rightarrow$DYesYes0.8540.4013.06
10% dataNoNo0.5260.0221.04
10% dataNoYes0.6980.0521.52
10% dataYesYes0.7780.1392.00

The ablation conclusion is divided into two levels: first, adding video prediction without pre-training has usually improved, indicating that the future frame auxiliary task itself is useful; second, adding large-scale video pre-training further improves, especially in unseen scenes and small data. The authors explain that pre-training helps the model learn a more robust video prediction model, thereby forming hints about future states.Appendix Ablation Studies

Future Step1 task2 tasks3 tasks4 tasks5 tasksAvg. Len.
10.8950.8020.7100.6430.5623.61
30.9180.8330.7610.6850.6193.82
50.9090.8060.7190.6490.5833.67

Prediction $\Delta t=3$ is better than 1 and 5. The explanation given by the author is: when the continuous frames are too close, the information difference is insufficient, and when they are too far away, they are not suitable for guiding the current local action.

5.6 Task-by-task success rate and more visualizations

The task-by-task success rate table shows that video pre-training has a greater improvement in tasks involving block manipulation, such as rotate blue block right from 71.2 to 94.9, stack block from 45.7 to 80.1, and lift red block table from 76.7 to 97.7. The author explains that the difficulty of these tasks lies in the need to grasp the correct square first and then operate according to the language; video generative pre-training improves the performance of such tasks.Appendix Task Success Rates

TaskGR-1GR-1 w/o Video Prediction & Pre-trainingGR-1 10% data
rotate blue block right94.971.251.6
stack block80.145.743.2
lift red block table97.776.736.5
place in slider91.389.134.8
open drawer99.4100.094.2
turn on/off LED100.0 / 100.098.7 / 100.095.6 / 95.6
push red/blue block right54.2 / 53.649.3 / 50.043.6 / 33.9
CALVIN rollouts
Figure 8. More rollouts on the CALVIN benchmark.
Real robot rollouts
Figure 9. Real robot rollout; the first five lines are object handling, and the last two lines are drawer-type operations.
CALVIN video prediction appendix
Figure 10. Supplementary future frame prediction results on CALVIN.
Real robot video prediction appendix 1
Figure 11. Supplementary future frame prediction results in object handling and drawer operations.
Real robot video prediction appendix 2
Figure 12. Future frame prediction results in ablation pick-and-place experiments.

6. Reproducible auditing

Code and resources

There is official code: github.com/bytedance/GR-1. README provides CALVIN environment installation, CALVIN data download, MAE ViT-Base weight download, GR-1 ABCD-D/ABC-D checkpoint download, and `evaluate_calvin.sh` evaluation command.

Recurring itemsInformation given by the paper/codeStatus
Model structure12 layers, 12 heads, hidden size 384, 195M parameters, 46M trainable; CLIP text encoder and MAE ViT image encoder frozen.fully
Training hyperparametersBatch size, learning rate, dropout, optimizer, cosine decay, warmup, and epochs are all given in the appendix.fully
Pre-training data constructionEgo4D; 3 seconds clips; 800k clips/8M frames; equally spaced frames, 1/3 second apart.relatively sufficient
CALVIN Review1000 instruction chains; up to 5 consecutive tasks; 360 timesteps timeout; ABCD$\rightarrow$D and ABC$\rightarrow$D split.fully
real robotThe robot model, camera, task, number of training trajectories, scene settings, and batch/epoch changes are all given.Replicate high physical dependence
Pre-trained weightsThe official README provides GR-1 CALVIN checkpoint; the paper does not describe all training logs of complete Ego4D pre-training weights.The evaluation is reproducible, but the complete training cost is high
CALVIN evaluation skeleton from official README: export CALVIN_ROOT=/path/to/calvin/directory/ export EVAL_DIR=eval_logs/ export POLICY_CKPT_PATH=logs/snapshot_ABCD.pt export MAE_CKPT_PATH=/path/to/mae/checkpoint/ bash evaluate_calvin.sh --dataset_dir /path/to/calvin/dataset/task_ABCD_D/directory/

7. Analysis, Limitations and Boundaries

7.1 The most valuable part of this paper

Based on the paper's own experiments, the core value lies in advancing "outside large-scale video" from ordinary visual representation pre-training to "future state prediction", a goal closer to robot strategy learning. The evidence is not a single benchmark number, but the main result and ablation support at the same time: only the existing gains in video prediction are retained, and after superimposing Ego4D pre-training, it is further improved under multi-tasking, unseen scenes and small data.

7.2 Why the results hold up

The paper uses multiple complementary settings: in CALVIN there are multi-task learning of ABCD$\rightarrow$D, as well as unseen scenes, 10% data, and unseen language of ABC$\rightarrow$D; in real robots there are seen objects, unseen instances, unseen categories and articulated object manipulation. Appendix ablation separates video prediction from pre-training and compares future steps, reducing the explanation space of "just model capacity or Transformer architecture causing improvement".

7.3 Explanations and failure modes given by the author

The author believes that the unseen scene generalization comes from the visual-text alignment brought by the rich human-object interaction in Ego4D; the unseen language generalization comes from the diverse language exposure and frozen CLIP text encoder in pre-training. In the real robot, the author clearly lists the failure modes of GR-1: objects with similar colors will be confused in the unseen category, such as bell pepper and peach; the drawer task will not be fully closed or the handle will not be grasped. In the video prediction results, the author also pointed out that details such as occluded objects may be missing.

7.4 Future work as described by the author

In Conclusion, the author proposes three directions: combine video data training with and without language to enhance robustness and generalization; explore the difference between "arbitrary video pre-training" and "more relevant operation video pre-training"; expand robot data to include more trajectories in different environments and more operating skills.

7.5 Applicable boundaries