EN 中文

GR-MG: Leveraging Partially-Annotated Data via Multi-Modal Goal-Conditioned Policy

Authors: Peiyan Li, Hongtao Wu, Yan Huang, Chilam Cheang, Liang Wang, Tao Kong

Organization: CASIA / UCAS / ByteDance Research

Publish information: IEEE Robotics and Automation Letters; arXiv: 2408.14368; The paper is marked as received in December 2024, and the project page citation is 2025 RA-L paper

Links: arXiv | PDF | Project home page | official code

1. Quick overview of the paper

One-sentence summary: GR-MG uses a target image generation model with task progress conditions to convert language instructions and current observations into intermediate target images, and then allows the multi-modal target condition strategy to simultaneously look at language, target images, historical observations and robot status to predict actions, thereby utilizing videos lacking action labels and robot trajectories lacking text labels.

Difficulty rating: ★★★★☆. Reading requires familiarity with language conditional imitation learning, goal-conditioned policy, diffusion image editing, Transformer policy, cVAE action trajectory prediction, and CALVIN long-range evaluation.

Keywords: Robot manipulationPartially annotated dataGoal image generationMulti-modal goal-conditioned policyTask progress

Reading targeting itemShort answer
What should the paper solve?Language-conditional robot operation requires trajectories with both action and text annotations, but this fully-annotated data is expensive; the paper must include both "videos with text and no actions" and "robot trajectories with actions and no text" into training.
The author's approachSplit the goal into two levels: first use InstructPix2Pix style progress to guide the goal image generator to generate sub-goal image, and then use GPT-style policy that is simultaneously conditioned on text + goal image to predict action trajectories, future images and task progress.
most important resultsThe average number of completions of 5 consecutive tasks in CALVIN ABC→D increased from 3.35 to 4.04; the real robot simple setting increased from 68.7% to 78.1%, and the generalization average increased from 44.4% to 60.6%.
Things to note when readingThe core is not simply "generating a good-looking target map", but the progress conditions, text+image bi-conditional policy, and two types of partially annotated data entering the training paths of the two modules respectively.

Core contribution list

GR-MG overview
Fig. 1 / Overview: How two modules and two types of partially annotated data are used.

2. Motivation

2.1 What problem should be solved?

The paper focuses on language-conditioned visual robot manipulation. A standard strategy can be written as:

The strategy uses language, historical images and robot status to directly predict the current action trajectory.

$$\mathbf{a}_{t} = \pi(l, \mathbf{o}_{t-h: t}, \mathbf{s}_{t-h: t})$$
$l$Natural language task instructions.
$\mathbf{o}_{t-h: t}$For the RGB observation sequence from $t-h$ to $t$, the paper uses static camera and wrist-mounted camera.
$\mathbf{s}_{t-h: t}$End-effector 6-DoF pose and binary gripper state sequence.
$\mathbf{a}_{t}$The current action trajectory to be output, not the single step action.

The difficulty lies in the data: fully-annotated trajectory contains language, images, states, and actions at the same time, and the acquisition and labeling costs are high. In contrast, text-annotated human activity videos lack actions but are easily obtained from public video data; robot trajectories without text labels lack language but can be collected autonomously or semi-autonomously by robots. The goal of GR-MG is to convert both types of data into trainable signals.

2.2 Where are the existing methods stuck?

The paper divides existing routes into two categories: one category only uses data lacking a certain label, such as learning representations from videos or training with languageless robot trajectories; the other category uses generated target images/future videos as conditions for policy or inverse dynamics. The author pointed out that these methods usually have two problems: first, goal generation tends to ignore the task progress, resulting in the generation of wrong goals in tasks where "the current observations are the same but in different stages"; second, if the policy only relies on generated images, once the generated images deviate from the language instructions, subsequent action predictions will become fragile.

2.3 Solution ideas of this article

The high-level design of GR-MG is "generative goal + multi-modal conditional strategy". The generator is responsible for converting the language and current observation into an intermediate target image; the strategy not only looks at this target image, but also retains the language conditions, so when the generated image is inaccurate, there is still a text signal to constrain action prediction. The task progress is predicted by policy in rollout, and then fed back to the generator to form a closed loop.

4. Detailed explanation of method

GR-MG architecture
Fig. 2 / Network Architecture: Closed loop of progress-guided goal image generator and multi-modal goal condition strategy.

4.1 Data form and training signal distribution

The fully-annotated trajectory is written as:

$$\tau = \{ l, (\mathbf{o}_{1}, \mathbf{s}_{1}, \mathbf{a}_{1}), \ldots, (\mathbf{o}_{T}, \mathbf{s}_{T}, \mathbf{a}_{T}) \}$$

Among them, language $l$, observation $\mathbf{o}$, status $\mathbf{s}$, and action $\mathbf{a}$ are all available. GR-MG splits the data into different modules according to the missing tags:

data typeWhat's includedWhich module to use for trainingHow to use during training
fully-annotated robot trajectoriesLanguage, image, status, actiontwo modulesThe generator uses the current frame, language, future frames, and progress; the policy uses language, real target graphs, historical observations/states, and actions.
data w/o action labelsVideo with text, no actiongoal image generation modelNo action is required, just sample the current image and future target image from the video.
data w/o text labelsRobot trajectory with motion, no textmulti-modal goal-conditioned policyUse null string as the text condition, train the policy first, and then finetune the fully-annotated data.

4.2 Progress-guided Goal Image Generation Model

The generator is based on InstructPix2Pix, a diffusion-based image-editing model. The input is the current observation image, text task description and task progress, and the output is the sub-goal image after $N$ steps. The author follows Susie's sub-goal idea and does not directly generate the final state, but regularly updates the intermediate goals.

Implementation points: Instead of adding a special structure, progress is spelled into the text, such as "pick up the red block. And 60% of the instruction has been completed.", and then encoded by T5-Base. During training, the progress is calculated from the video/trajectory timestep; during inference, it is predicted by the policy.

Appendix Data/Generation Model reproducibility details are given: the image is first resized to $256\times256$; the training sample is $(l, o_t, o_{t+k}, p)$; the goal image is sampled from frames in the future $k_\mathrm{min}$ to $k_\mathrm{max}$; latent diffusion uses VAE image encoder, U-Net denoising, text cross-attention and classifier-free guidance; the inference denoising steps are 50.

Dataset$k_\mathrm{min}$$k_\mathrm{max}$Description
CALVIN2022Simulation benchmark.
Something-Something-V21114Text annotation of human activity videos.
RT-156Real robot data for scaling generator training.
Real3035Real robot data in this article.

4.3 Multi-modal Goal-Conditioned Policy

policy inherits the GPT-style Transformer structure of GR-1, but makes three key changes:

Appendix Multi-modal Goal Conditioned Policy Further explanation: Each picture is first encoded into 196 patch tokens and 1 global token; the 196 patch tokens are reduced to 9 tokens by Perceiver Resampler; the language is encoded with CLIP; the robot state is encoded with a linear layer; all tokens are aligned to the GPT hidden size through the linear layer. GPT hidden size is 384, 12 heads, 12 layers.

Algorithm: GR-MG inference loop
Input: text instruction l, observation/history o, robot state s
progress p = 0
Every n steps:
  prompt = l + " And {p}% of the instruction has been completed."
  goal_image = ProgressGuidedGenerator(current_image, prompt)
Each policy step:
  tokens = [MAE(goal_image), CLIP(l), MAE(o_{t-h:t}), Linear(s_{t-h:t}), [PROG], [OBS], [ACT]]
  action_trajectory, future_images, progress = GPTPolicy(tokens)
  execute first part of action_trajectory
  feed predicted progress back to generator at next goal update

4.4 Training objectives

The goal image generation model is trained according to the DDPM noise prediction method; the policy predicts actions, future images and progress at the same time. policy loss is:

This is not a single action cloning loss, but a joint constraint of action, image prediction, VAE regularization, and progress regression.

$$L = l_\mathrm{arm} + 0.01 l_\mathrm{gripper} + 0.1 l_\mathrm{img} + l_\mathrm{kl} + l_\mathrm{prog}$$
$l_\mathrm{arm}$Robotic arm action prediction loss.
$l_\mathrm{gripper}$Gripper action prediction loss, weight 0.01.
$l_\mathrm{img}$Future image prediction loss, weight 0.1, follows the GR-1 training signal.
$l_\mathrm{kl}$KL divergence of cVAE.
$l_\mathrm{prog}$The task progress prediction loss is fed back to the generator.

5. Experiment

Experimental settings
Fig. 3 / Experiments: CALVIN 34 tasks and real robot 58 task examples.

5.1 Experimental setup

experimental groupData and settingsReview questions
CALVIN ABC→DTrain on Env A/B/C, test on Env D; ~18k fully-annotated trajectories; evaluate 1000 5-task chains.Multitasking and generalization to unseen environments.
CALVIN data scarcityOnly 10% fully-annotated data is used, about 1.8k trajectories / 0.1M frames; at the same time, the text-free trajectories of 1M frames in Env A/B/C are used to train the policy first.Is data w/o text labels helpful when fully-annotated data is missing?
real robotKinova Gen-3 + Robotiq 2F-85 + static/wrist cameras; 18k demonstrations, 37 training tasks; SSV2 and RT-1 added when training the generator.Generalization of simple, unseen distractors, unseen instructions, unseen backgrounds, unseen objects.
few-shot novel skillsHold out 8 tasks from 37 tasks, 7 of which are novel skills; first train on 29 tasks/15k trajectories, and then finetune with 10 or 30 trajectories per task.Few-shot learning capabilities for new skills.

5.2 CALVIN main result

method1 task3 tasks5 tasksAvg. Len.
3D Diff Actor93.8%66.2%41.2%3.35 ± 0.04
GR-MG w/o image91.0%67.8%47.7%3.42 ± 0.28
GR-MG w/o text91.8%68.9%48.1%3.46 ± 0.04
GR-MG w/o progress94.1%75.2%56.3%3.76 ± 0.11
GR-MG96.8%81.5%64.4%4.04 ± 0.03

The most critical way to read the main table is to look at the long-horizon indicator: the single task success rate has not improved much from 93.8% to 96.8%, but the 5 consecutive tasks have increased from 41.2% to 64.4%, and the average completion length has increased from 3.35 to 4.04, indicating that the robustness improvement under error accumulation is more obvious. The performance of w/o text and w/o image is similar and both are lower than the full model, supporting the paper's conclusion about the biconditional complementarity of text + image.

5.3 Data scarcity and partially labeled data

methodfully-annotated datapartially-annotated data1 task5 tasksAvg. Len.
GR-110%No67.2%6.9%1.41 ± 0.06
GR-MG w/o part. ann. data10%No82.4%19.7%2.33 ± 0.04
GR-MG10%Yes90.3%37.5%3.11 ± 0.08

When only 10% of the fully annotated data is given, the additional 1M frames of text-free trajectories significantly improve policy capabilities. The author observed that w/o part. ann. data can often generate the correct target graph, but the policy has insufficient ability to follow the target graph. Therefore, the main enhancement of the textless robot trajectory is the policy, not the generator.

5.4 Progress condition ablation

methodMSE ↓PSNR ↑SSIM ↑CD-ResNet50 ↑
GR-MG w/o progress965.34718.8210.7210.945
GR-MG903.13919.1210.7300.946

This set of experiments turns "whether progress condition is just an additional prompt decoration" into a testable question. All four target map similarity metrics improve; the qualitative plot shows w/o progress that it can produce target maps of higher visual quality but inconsistent with the language, while the full model is closer to the ground truth.

Generated goal image comparison
Fig. 4 / Generated goal images: The impact of progress information and additional part annotation videos on the accuracy of goal images.

5.5 Real robot results

Real robot success rates
Fig. 5 / Success rates: real robot simple and four generalization settings.

A total of 58 tasks were evaluated on real robots. The paper reports that GR-MG improves the average success rate from 68.7% to 78.1% in the simple setting, and from 44.4% to 60.6% in the four-category generalization average. The author also explained the typical failures of baseline one by one: OpenVLA's discrete action space and lack of history/wrist camera input affect grasping and gripper opening and closing timing; Octo has history and proprioception, but generalization of unseen backgrounds/objects is weak; GR-1 easily selects the wrong object among unseen objects. When comparing w/o part. ann. data, the authors attribute the improvement to additional action-labeled videos improving language semantic understanding and OOD robustness.

5.6 Few-shot Novel Skills

method10-shot30-shot
OpenVLA0.0%2.5%
Octo0.0%0.0%
GR-12.5%22.5%
GR-MG w/o part. ann. data10.0%27.5%
GR-MG17.5%37.5%

An important observation in the few-shot part is that the target map generator can generate more accurate target maps after finetuning with few samples, but policy is still the main bottleneck. This observation corresponds to the future work in the conclusion, which is to further expand the real-world text-free trajectory training of the policy.

6. Repeat audit

6.1 Code and resources

Already published: The official GitHub is bytedance/GR-MG. The README provides the installation script, training script and CALVIN evaluation script of the goal image generation model and multi-modal goal-conditioned policy, and provides policy checkpoint, goal generation checkpoint, InstructPix2Pix, MAE and CALVIN data download entrances.

Heavy dependence: The official README indicates that the test environment is CUDA 12.1 + Python 3.9; goal generation and policy dependencies are installed separately. reproducibility requires not only CALVIN, but also Ego4D pretraining checkpoint or self-pretraining.

6.2 Key hyperparameters

ProjectGoal Image Generation ModelMulti-modal Goal-Conditioned Policy
batch size1024512
learning rate8e-51e-3
optimizerAdamWAdamW
weight decay1e-20
Adam beta1 / beta20.95 / 0.9990.9 / 0.999
epochs5050

Appendix Training: The generator is trained on 16 NVIDIA A100 80GB for 50 epochs, CALVIN is about 18 hours, and the real robot is about 30 hours; the policy is trained on 32 NVIDIA A800 40GB for 50 epochs, CALVIN is about 17 hours, and the real robot is about 7 hours. Generator training using CenterCrop, ColorJitter, EMA is critical for stable performance.

6.3 reproducibility path

  1. Prepare the official environment: install separately goal_gen/install.sh with policy/install.sh Required dependencies.
  2. Download InstructPix2Pix weights to resources/IP2P/, download the MAE encoder to resources/MAE/, prepare CALVIN data.
  3. Training target graph generator: modifications goal_gen/config/train.json run after bash ./goal_gen/train_ip2p.sh ./goal_gen/config/train.json.
  4. Policy pre-training: You can use the Ego4D-pretrained checkpoint provided by the author, or bash ./policy/main.sh ./policy/config/pretrain.json Pre-train on your own.
  5. Training policy: settings /policy/config/train.json pretrained model path in, run bash ./policy/main.sh ./policy/config/train.json.
  6. CALVIN evaluation: run bash ./evaluate/eval.sh ./policy/config/train.json, and specify goal generation model and policy checkpoint in the script.

7. Analysis, Limitations and Boundaries

7.1 The most valuable part of this paper

Judging from the experimental design of the paper itself, the value is concentrated in "connecting data with two different missing labels to different modules respectively." Videos lacking action labels are not suitable for direct supervision of actions, but are suitable for training "current picture + language + progress → future goal picture"; robot trajectories lacking text labels are not suitable for training language understanding, but are suitable for training goal image conditioned policy on how to convert visual goals into actions. This division of module labor makes the usage path of partially-annotated data clearer.

7.2 Why the results hold up

The paper does not only give a single main table, but uses multiple sets of corresponding evidence to support the core design: the CALVIN main table verifies the long-horizon improvement of the complete GR-MG; w/o text, w/o image verifies the bimodal condition; w/o progress and the target graph similarity index verify the progress condition; 10% data scarcity verifies that the textless robot trajectory is useful for policy; the real robot w/o part. ann. data and the generated graph visualize the missing action label video for the generator and OOD Goal understanding helps. The few-shot part also points out that policy is the bottleneck, which is consistent with the expansion direction in the conclusion.

7.3 Limitations and future directions described by the author

7.4 Applicable boundaries

GR-MG is suitable for operational tasks that can express intermediate goal states visually, and assumes that the generator can regularly generate sub-goal images useful for policy. The paper does not provide sufficient coverage for tasks where the goal cannot be expressed by a single RGB sub-goal, depth/contact information is critical, or the policy has extremely high requirements on real execution dynamics. Although the real robot experiments include non-pick-and-place tasks and multiple types of OOD settings, they are still verified within the author's own platform, camera configuration, and task set.