EN 中文

FRAPPE: Infusing World Modeling into Generalist Policies via Multiple Future Representation Alignment

Authors: Han Zhao, Jingbo Wang, Wenxuan Song, Shuai Chen, Yang Liu, Yan Wang, Haoang Li, Donglin Wang

Organization: Zhejiang University, Westlake University, HKUST (GZ), South China University of Technology, ShanghaiTech University, Tsinghua University

Publishing status: arXiv v1 submitted on 2026-02-19

Links: arXiv: 2602.17259 | PDF | Project page | code | model

1. Quick overview of the paper

One-sentence summary: FRAPPE adds learnable future prefixes to diffusion-based VLA such as RDT, and uses two-stage training to first adapt it to future representation alignment, and then expand it in parallel into multiple Prefix+LoRA experts to align future visual representations such as CLIP, DINOv2, ViT, etc., thereby improving implicit world modeling, generalization and data efficiency.
Reading targeting itemcontent
What should the paper solve?Explicit world models need to predict future pixels, so it is easy to focus too much on pixel-level reconstruction, and relying on predicting future observations during inference will accumulate errors; a single latent alignment may be limited by the inductive bias of a single vision task.
The author's approachUse learnable future prefix to align VFM embedding of future observations; mid-training uses Theia-style distilled encoder single-stream full parameter adaptation, post-training uses Mixture-of-Prefix-and-LoRA to align multiple VFMs in parallel, and then the router aggregates actions.
most important resultsThe eight-task average of RoboTwin 2.0: Easy 57.5%, Hard 25.5%, both are the highest; unseen settings perform well in the real-arm AgileX task, the long-range three-stage task RDT is 0%, and FRAPPE is 20%.
Things to note when readingFRAPPE is similar to FLARE but emphasizes multiple future representation alignment and parallel progressive expansion; it uses multiple teacher VFMs during training and no longer calls these VFMs during inference, but retains the parallel expert calculation graph.

difficulty rating

4/5. Need to understand diffusion/DiT policy, RDT, LoRA, MoE/router, future representation alignment, visual foundation model teacher, and mixed training of robot data and human egocentric data.

keywords

VLA; RDT; Implicit World Modeling; Future Representation Alignment; Prefix Tuning; LoRA; Mixture of Experts; Human Egocentric Videos; RoboTwin

Core contribution list

FRAPPE teaser
Figure 1: FRAPPE demonstrates performance improvements in simulations, real complex scenarios, and different levels of training data pyramid.

2. Motivation

2.1 What problem should be solved?

VLA and diffusion policy can already learn multi-modal action distribution, but robots still need to understand the dynamics of the environment when performing complex tasks, that is, world modeling. Existing methods often specify world modeling as "predicting future images", and then use this prediction for action generation or auxiliary training.

The paper points out that there are two problems with this approach: first, future pixel prediction spends a lot of calculations on redundant texture and background details instead of task-related object information; second, the inference stage relies on future observations generated by the model, and prediction errors will accumulate along time, affecting actions.

2.2 Limitations of existing methods

Explicit future image methods, such as models that jointly generate future frames/actions, may produce poor image quality in OOD scenarios. Implicit alignment methods, such as FLARE/VPP/representation alignment, avoid explicit generation, but if they only align a single visual representation, they may inherit the bias of the visual task and may not be suitable for all robot tasks.

The core motivation of FRAPPE is that world modeling does not have to predict the image itself, nor should it be limited to a single representation; the model can be aligned to future representations of multiple visual base models at the same time, and gain scaling benefits through multi-stream parallel computing during inference.

2.3 The solution ideas of this article

FRAPPE uses a two-stage recipe:

  1. Mid-training: Single-stream, full-parameter fine-tuning, adding future prefix, aligning a Theia-style tiny teacher encoder obtained by multi-VFM distillation, so that RDT can adapt to the world-modeling goal.
  2. Post-training: Freeze the shared backbone and only train multiple future prefixes and corresponding LoRA; each expert is aligned with an independent teacher encoder, and finally the router aggregates the expert output to generate actions.

4. Detailed explanation of method

FRAPPE overview
Figure 2: Overview of training and inference. During training, multiple VFM representations are gradually aligned; during inference, the parallel expert calculation graph is retained without using VFM teacher supervision.

4.1 Preliminaries: RDT

RDT modeling conditional action sequence distribution $p_\theta(\mathbf{a}_t|\mathbf{o}_t, l)$. Given language $l$, observation $\mathbf{o}_t$, noisy action $\tilde{\mathbf{a}}_t$, and diffusion timestep $k$, DiT denoising network $f_\theta$ predicts clean action chunk.

The original goal of RDT: denoise noisy action chunks into real action chunks.

$$\mathcal{L}_{action}=\mathrm{MSE}\left(\mathbf{a}_t, f_\theta(l, \mathbf{o}_t, \tilde{\mathbf{a}}_t, k)\right)$$ $$\tilde{\mathbf{a}}_t=\sqrt{\bar{\alpha}_k}\mathbf{a}_t+\sqrt{1-\bar{\alpha}_k}\epsilon, \quad \epsilon\sim\mathcal{N}(0, I)$$
$\mathbf{o}_t$Current visual observation.
$l$Language instructions.
$\tilde{\mathbf{a}}_t$Noisy action chunk at diffusion timestep $k$.
$f_\theta$DiT backbone for RDT, conditional SigLIP visual tokens and T5 language tokens.

4.2 Future Prefix Alignment

FRAPPE adds learnable future prefix $\mathbf{p}\in\mathbb{R}^{n\times d}$ to the RDT input sequence. The model not only outputs actions, but also outputs future representation predictions corresponding to prefix:

The future prefix allows the RDT, which was originally only responsible for action denoising, to learn the future state representation internally.

$$\mathbf{a}_t, \mathbf{p}_t=f_\theta(l, \mathbf{o}_t, \tilde{\mathbf{a}}_t, k)$$ $$\mathbf{e}_{t+h}=\Phi(o_{t+h}), \quad \mathcal{L}_\Phi=\cos(\mathbf{p}_t, \mathrm{sg}(\mathbf{e}_{t+h}))$$
$\Phi$pretrained VFM teacher encoder.
$h$future horizon; the optimal appendix ablation is $h=8$.
$\mathrm{sg}$stop-gradient, do not update teacher encoder.
$\mathbf{p}_t$A future-prefix representation of the RDT output, used to align future observation embeddings.

4.3 Parallel Scaling: Mixture-of-Prefix-and-LoRA

To leverage knowledge from multiple vision base models, FRAPPE builds multiple future-prefix + LoRA experts on a shared RDT backbone. Each expert corresponds to a teacher encoder, the paper setting is $M=3$, and the teachers are CLIP 400M, DINOv2 142M, and ViT 300M.

$$\mathcal{L}_{align}=\sum_{i=1}^{M}\mathcal{L}_{\Phi_i}$$

Among them, $\mathcal{L}_{\Phi_i}$ is the future representation alignment loss of the $i$th VFM teacher.

During inference, multiple experts will give latent action representations, the router will generate gating weights, and then aggregate the output.

$$\mathbf{a}_t=\mathrm{MLP}\left(\sum_{i=1}^{M}w_i\cdot z_i\right), \quad \sum_i w_i=1$$
$z_i$The latent action representation output by the $i$th expert.
$w_i$The weight given by router to the $i$ expert.
MLPSharing action heads maps weighted latent representations into executable action chunks.

4.4 Load Balance and Label Smoothing

The authors observed mode collapse: a certain stream may dominate learning, and other experts rarely update. Add load-balancing loss and gating label smoothing to this.

$$\mathcal{L}_{balance}=\frac{1}{B}\sum_{j=1}^{B}\left(\log\sum_{i=1}^{M}e^{\mathbf{g}_{i, j}}\right)^2$$ $$w'_i=w_i(1-\epsilon)+\frac{\epsilon}{M}, \quad \epsilon=0.1$$ $$\mathcal{L}_{total}=\mathcal{L}_{action}+\lambda_1\mathcal{L}_{align}+\lambda_2\mathcal{L}_{balance}$$

Appendix A $\lambda_1$ ablation shows that $\lambda_1=0.05$ is optimal; if it is too large, it will interfere with the main task of action prediction.

4.5 Why are Mid-training and Post-training separated?

The paper emphasizes that parallel post-training cannot be done directly on base RDT because the architecture and goals deviate too much from the original RDT pre-training distribution. Mid-training first uses single-stream future prefix and Theia-style 86M distilled encoder to do full-parameter fine-tuning to adapt the model to the world-modeling objective; then post-training freezes the backbone and uses LoRA/prefix to efficiently align multiple teachers.

Implementation points: The main experiment starts from the official RDT-1B pretrained weights, with a total training of 20k steps: 15k mid-training + 5k post-training; the training data only has 50 task-specific trajectories for each task in the Easy setting.

5. Experiments and results

5.1 Simulation Setup

The simulation experiment uses RoboTwin 2.0, which is a real-to-sim bimanual benchmark. Each task has two settings: Easy and Hard; Hard includes domain randomization such as scene clutter, background texture, lighting, desktop height, etc. All simulation experiments cover 8 tasks, and each model uses 100 evaluation trials to report the average performance.

Training settings: Starting with RDT-1B official pretrained weights; training data limited to 50 trajectories per task in Easy setting; two H100 training 20, 000 steps, batch size 32.

5.2 RoboTwin 2.0 Main Results

MethodAverage EasyAverage HardRemarks
DP31.3%0.0%train-from-scratch visuomotor baseline.
VPP35.8%4.0%implicit world model baseline.
RDT47.4%15.1%FRAPPE's base model.
$\pi_0$57.1%14.1%RoboTwin SOTA baseline.
$\pi_{0.5}$45.4%13.3%$\pi_0$ successor.
FRAPPE57.5%25.5%Easy has the highest average, and Hard has the significantly highest average.

The improvement of Hard setting is more critical: FRAPPE increases from 15.1% of RDT to 25.5%, which also exceeds $\pi_{0.5}$. The author explains that the model better learns the low-level dynamics behind multiple visual observations instead of relying on spurious visual correlations.

5.3 Training Paradigm Ablation

No.MethodStepsEasyHardAverage
0RDT20k59.020.539.8
1mid-train full ft20k63.027.545.3
2mid-train prefix & LoRA ft20k48.08.528.3
3post-train prefix ft20k25.04.014.5
4post-train prefix & LoRA ft20k46.09.027.5
5mid full ft + post prefix ft15k + 5k68.021.544.8
6mid full ft + post prefix & LoRA ft15k + 5k73.532.052.3

The conclusion is clear: mid-training must be done first, and full-parameter fine-tuning is required; post-training alone is very ineffective; the final best recipe is 15k full-parameter mid-training + 5k prefix&LoRA post-training.

5.4 Inference Efficiency

MetricRDT 5 stepsmid-train 5 stepspost-train 5 stepspost-train 3 steps
Inference Memory3.7 GB3.7 GB8.0 GB8.0 GB
Latency0.214 s0.228 s0.235 s0.173 s
Success Rate39.8%45.3%52.3%48.5%

Post-training parallel experts increased the video memory from 3.7GB to 8.0GB, but the delay of the same 5 steps only increased by about 20ms. After reducing to 3 denoising steps, the delay is lower than RDT 5 steps, and the success rate is still higher than the baseline.

5.5 Smaller-scale Policy Model

RDT-130M results
Figure 3: Verification on RDT-130M. FRAPPE recipe can also improve performance on small models. The difference between LoRA post-training and full-parameter post-training is only about 2-3%.

The author uses RDT-130M to illustrate that this training paradigm does not only rely on the 1B parameter size. RDT-130M's original hard-task generalization is weak, but FRAPPE significantly improves hard tasks and can approach the level of naive RDT-1B fine-tuning.

5.6 Real-world Experiments

Real experiments use bimanual AgileX mobile manipulator, 6-DoF per manipulator and parallel gripper. The vision system consists of a high-mounted third-person main camera, and two wrist-mounted ego-centric cameras. Training data: 25 demonstrations per variation for basic tasks, 100 demonstrations for long-horizon tasks. Assessment: 40 trials each for basic tasks, 20 trials each for long-horizon tasks.

Real-world seen unseen
Figure 4: Real task seen/unseen setup. FRAPPE performs well in lighting, height, pose, and object variations, especially unseen settings.
Long horizon
Figure 5: The long-range task contains three time-dependent subtasks and four interactive objects. RDT could not be completed in the trial, and FRAPPE achieved a complete success rate of 20%.

5.7 Human Egocentric Co-training

FRAPPE proposed a data pyramid: the bottom layer is large-scale action-free human egocentric data, the middle layer is task-specific human egocentric data, and the top layer is task-specific robot teleoperation data. The author emphasizes that task-specific human data does not use GoPro/VR, but uses a static third-person camera consistent with the robot data; in this way, a novice human operator can exceed 360 trajectories/hour, while a skilled robot teleoperation is usually about 120 trajectories/hour.

The co-training experiments used 5 robot action trajectories per object, 50 task-specific human egocentric trajectories, and 10k task-irrelevant human egocentric videos. For action-free samples, action loss is omitted and only alignment loss is optimized.

Ego data results
Figure 6: human egocentric data without action labels. Large-scale Ego(web) provides strong inductive prior for novel objects, and Ego(task) improves spatial generalization. The combination of the two has the best effect.

6. Repeat audit

6.1 Key training configurations

ProjectConfiguration
Base modelofficial RDT-1B pretrained weights; small model verification is RDT-130M.
Simulation dataRoboTwin Easy setting, 50 task-specific trajectories per task.
Training budget2 NVIDIA H100 GPUs; 20, 000 steps; batch size 32.
FRAPPE schedule15, 000 mid-training steps + 5, 000 post-training steps.
Post-training trainable paramsfuture prefixes + LoRA + router/action aggregation related lightweight modules; shared RDT backbone frozen.
Teacher encodersMid-training: 86M Theia-style distilled encoder; Post-training: CLIP 400M, DINOv2 142M, ViT 300M.
Evaluation100 trials each for RoboTwin; 40 trials each for real basic tasks; 20 trials each for real long-horizon.

6.2 Appendix Hyperparameter Ablation

$\lambda_1$00.0010.020.050.10.5
SR14.0%18.5%26.4%32.5%22.0%23.5%
Alignment depth7142128
SR14.5%18.0%23.5%16.0%
Future horizon $h$81632
SR35.3%35.0%29.7%

Appendix A Using layer 21 of the RDT-1B 28-layer DiT for future prefix alignment, it is approximately 3/4 of the total depth; this is consistent with the observation that deeper alignment is more effective in FLARE.

6.3 Human Egocentric Co-training Details

Appendix B Using TASTE-Rob as Ego(web): 100, 856 video sequences, ~9M frames, with high quality language alignment. This phase trains for 1 epoch, which takes about 96 hours on 8 H100s. The reason why the author chose TASTE-Rob is that the fixed egocentric viewpoint is closer to the mainstream VLA camera setting and helps to migrate to downstream robot action prediction.

6.4 reproducibility checklist

Papers and resources give relatively sufficient information

Fully: The code and model links have been made public; the core formulas, teacher encoders, two-stage training steps, batch size/GPU, RoboTwin data volume, real robot data volume, hyperparameter ablation, and inference efficiency tables are relatively clear.

Still needs practical confirmation: The specific LoRA target modules/rank, future prefix token length $n$, router architecture details, specific checkpoint of Theia variant, RoboTwin task configuration and real robot control stack need to be further confirmed from the code repository.

Minimum recurrence path

  1. Load RDT-1B pretrained weights, keeping the original SigLIP/T5 conditional interface and action decoder.
  2. Add future prefix to DiT input and take prefix representation at layer 21.
  3. Use future observation embedding of Theia-style 86M distilled encoder to do single-stream full-parameter mid-training, 15k steps.
  4. Build 3 Prefix+LoRA experts, align CLIP, DINOv2, and ViT respectively, freeze the shared backbone, and train for 5k steps.
  5. Implement router aggregation latent action representation, and add load balance loss and $\epsilon=0.1$ gating smoothing.
  6. Trained on $\lambda_1=0.05$, $h=8$, batch size 32, 2 H100, and evaluated on RoboTwin Easy/Hard 8 tasks with 100 trials/task.

7. Analysis, Limitations and Boundaries

7.1 The most valuable part of this paper

Judging from the paper's own claims, the most valuable thing about FRAPPE is that it combines "implicit world modeling + parallel scaling + parameter efficient finetuning" into a clear recipe. It does not simply add a future representation loss, but uses mid-training to first solve the target distribution mutation, and then uses MiPA to absorb the future representations of multiple VFM teachers in parallel. The training paradigm table shows that if you skip mid-training directly or only use prefix post-training, the performance will decrease significantly.

7.2 Why the results hold up

The chain of evidence is relatively complete: the RoboTwin table covers 8 tasks, two settings of Easy/Hard and multiple SOTA baselines; the training paradigm ablation directly compares the combination of mid/post/full/LoRA/prefix; the efficiency table shows that parallel expansion does not cause unacceptable delays; the RDT-130M experiment shows that the method is not only suitable for large models; the real two-arm task and human egocentric co-training further support the claims of data efficiency and generalization.

7.3 Limitations of explicit or indirect presentation by the author

7.4 Applicable boundaries

FRAPPE is suitable for scenarios that already have pretrained diffusion/VLA backbone and hope to improve generalization with a small amount of robot data and a large amount of motionless videos, especially for double-arm manipulation, visual perturbation, object changes, and small data fine-tuning. It is less suitable for systems that need to explicitly visualize future trajectories for human review because it learns latent future representations rather than generating images of the future.

The other boundary is teacher indicating selection. The core benefit of FRAPPE comes from multi-VFM alignment; if the task domain is very different from the visual semantic coverage of CLIP/DINOv2/ViT, teacher selection and mid-training teacher distillation may become key bottlenecks.