FRAPPE: Infusing World Modeling into Generalist Policies via Multiple Future Representation Alignment

Authors: Han Zhao, Jingbo Wang, Wenxuan Song, Shuai Chen, Yang Liu, Yan Wang, Haoang Li, Donglin Wang

Organization: Zhejiang University, Westlake University, HKUST (GZ), South China University of Technology, ShanghaiTech University, Tsinghua University

Publishing status: arXiv v1 submitted on 2026-02-19

Links: arXiv: 2602.17259 | PDF | Project page | code | model

1. Quick overview of the paper

One-sentence summary: FRAPPE adds learnable future prefixes to diffusion-based VLA such as RDT, and uses two-stage training to first adapt it to future representation alignment, and then expand it in parallel into multiple Prefix+LoRA experts to align future visual representations such as CLIP, DINOv2, ViT, etc., thereby improving implicit world modeling, generalization and data efficiency.

Reading targeting item	content
What should the paper solve?	Explicit world models need to predict future pixels, so it is easy to focus too much on pixel-level reconstruction, and relying on predicting future observations during inference will accumulate errors; a single latent alignment may be limited by the inductive bias of a single vision task.
The author's approach	Use learnable future prefix to align VFM embedding of future observations; mid-training uses Theia-style distilled encoder single-stream full parameter adaptation, post-training uses Mixture-of-Prefix-and-LoRA to align multiple VFMs in parallel, and then the router aggregates actions.
most important results	The eight-task average of RoboTwin 2.0: Easy 57.5%, Hard 25.5%, both are the highest; unseen settings perform well in the real-arm AgileX task, the long-range three-stage task RDT is 0%, and FRAPPE is 20%.
Things to note when reading	FRAPPE is similar to FLARE but emphasizes multiple future representation alignment and parallel progressive expansion; it uses multiple teacher VFMs during training and no longer calls these VFMs during inference, but retains the parallel expert calculation graph.

difficulty rating

4/5. Need to understand diffusion/DiT policy, RDT, LoRA, MoE/router, future representation alignment, visual foundation model teacher, and mixed training of robot data and human egocentric data.

keywords

VLA; RDT; Implicit World Modeling; Future Representation Alignment; Prefix Tuning; LoRA; Mixture of Experts; Human Egocentric Videos; RoboTwin

Core contribution list

Multiple future representation alignment. Instead of just aligning one future visual representation, three VFM representations of CLIP, DINOv2, and ViT are aligned in parallel in post-training to reduce the inductive bias of a single representation.
Parallel progressive expansion. First use mid-training to adapt the model to the world-modeling goal, and then use Prefix+LoRA to expand in parallel with multiple experts to avoid slow convergence and poor performance in direct parallel training.
Parameter efficient post-training.The frozen RDT backbone is shared, each expert has its own future prefix and LoRA, and the actions are aggregated by the router.
Motionless human video available.Omit action loss for action-free samples and only optimize future representation alignment loss so that human egocentric data can participate in training.
Simulation and real verification.Covers RoboTwin Easy/Hard, RDT-1B and RDT-130M, small data training, real two-arm mobile robotic arm and long-range missions.

Figure 1: FRAPPE demonstrates performance improvements in simulations, real complex scenarios, and different levels of training data pyramid.

2. Motivation

2.1 What problem should be solved?

VLA and diffusion policy can already learn multi-modal action distribution, but robots still need to understand the dynamics of the environment when performing complex tasks, that is, world modeling. Existing methods often specify world modeling as "predicting future images", and then use this prediction for action generation or auxiliary training.

The paper points out that there are two problems with this approach: first, future pixel prediction spends a lot of calculations on redundant texture and background details instead of task-related object information; second, the inference stage relies on future observations generated by the model, and prediction errors will accumulate along time, affecting actions.

2.2 Limitations of existing methods

Explicit future image methods, such as models that jointly generate future frames/actions, may produce poor image quality in OOD scenarios. Implicit alignment methods, such as FLARE/VPP/representation alignment, avoid explicit generation, but if they only align a single visual representation, they may inherit the bias of the visual task and may not be suitable for all robot tasks.

The core motivation of FRAPPE is that world modeling does not have to predict the image itself, nor should it be limited to a single representation; the model can be aligned to future representations of multiple visual base models at the same time, and gain scaling benefits through multi-stream parallel computing during inference.

2.3 The solution ideas of this article

FRAPPE uses a two-stage recipe:

Mid-training: Single-stream, full-parameter fine-tuning, adding future prefix, aligning a Theia-style tiny teacher encoder obtained by multi-VFM distillation, so that RDT can adapt to the world-modeling goal.
Post-training: Freeze the shared backbone and only train multiple future prefixes and corresponding LoRA; each expert is aligned with an independent teacher encoder, and finally the router aggregates the expert output to generate actions.

4. Detailed explanation of method

Figure 2: Overview of training and inference. During training, multiple VFM representations are gradually aligned; during inference, the parallel expert calculation graph is retained without using VFM teacher supervision.

4.1 Preliminaries: RDT

RDT modeling conditional action sequence distribution $p_\theta(\mathbf{a}_t|\mathbf{o}_t, l)$. Given language $l$, observation $\mathbf{o}_t$, noisy action $\tilde{\mathbf{a}}_t$, and diffusion timestep $k$, DiT denoising network $f_\theta$ predicts clean action chunk.

The original goal of RDT: denoise noisy action chunks into real action chunks.

$$\mathcal{L}_{action}=\mathrm{MSE}\left(\mathbf{a}_t, f_\theta(l, \mathbf{o}_t, \tilde{\mathbf{a}}_t, k)\right)$$ $$\tilde{\mathbf{a}}_t=\sqrt{\bar{\alpha}_k}\mathbf{a}_t+\sqrt{1-\bar{\alpha}_k}\epsilon, \quad \epsilon\sim\mathcal{N}(0, I)$$

$\mathbf{o}_t$	Current visual observation.
$l$	Language instructions.
$\tilde{\mathbf{a}}_t$	Noisy action chunk at diffusion timestep $k$.
$f_\theta$	DiT backbone for RDT, conditional SigLIP visual tokens and T5 language tokens.

4.2 Future Prefix Alignment

FRAPPE adds learnable future prefix $\mathbf{p}\in\mathbb{R}^{n\times d}$ to the RDT input sequence. The model not only outputs actions, but also outputs future representation predictions corresponding to prefix:

The future prefix allows the RDT, which was originally only responsible for action denoising, to learn the future state representation internally.

$$\mathbf{a}_t, \mathbf{p}_t=f_\theta(l, \mathbf{o}_t, \tilde{\mathbf{a}}_t, k)$$ $$\mathbf{e}_{t+h}=\Phi(o_{t+h}), \quad \mathcal{L}_\Phi=\cos(\mathbf{p}_t, \mathrm{sg}(\mathbf{e}_{t+h}))$$

$\Phi$	pretrained VFM teacher encoder.
$h$	future horizon; the optimal appendix ablation is $h=8$.
$\mathrm{sg}$	stop-gradient, do not update teacher encoder.
$\mathbf{p}_t$	A future-prefix representation of the RDT output, used to align future observation embeddings.

4.3 Parallel Scaling: Mixture-of-Prefix-and-LoRA

To leverage knowledge from multiple vision base models, FRAPPE builds multiple future-prefix + LoRA experts on a shared RDT backbone. Each expert corresponds to a teacher encoder, the paper setting is $M=3$, and the teachers are CLIP 400M, DINOv2 142M, and ViT 300M.

$$\mathcal{L}_{align}=\sum_{i=1}^{M}\mathcal{L}_{\Phi_i}$$

Among them, $\mathcal{L}_{\Phi_i}$ is the future representation alignment loss of the $i$th VFM teacher.

During inference, multiple experts will give latent action representations, the router will generate gating weights, and then aggregate the output.

$$\mathbf{a}_t=\mathrm{MLP}\left(\sum_{i=1}^{M}w_i\cdot z_i\right), \quad \sum_i w_i=1$$

$z_i$	The latent action representation output by the $i$th expert.
$w_i$	The weight given by router to the $i$ expert.
MLP	Sharing action heads maps weighted latent representations into executable action chunks.

4.4 Load Balance and Label Smoothing

The authors observed mode collapse: a certain stream may dominate learning, and other experts rarely update. Add load-balancing loss and gating label smoothing to this.

$$\mathcal{L}_{balance}=\frac{1}{B}\sum_{j=1}^{B}\left(\log\sum_{i=1}^{M}e^{\mathbf{g}_{i, j}}\right)^2$$ $$w'_i=w_i(1-\epsilon)+\frac{\epsilon}{M}, \quad \epsilon=0.1$$ $$\mathcal{L}_{total}=\mathcal{L}_{action}+\lambda_1\mathcal{L}_{align}+\lambda_2\mathcal{L}_{balance}$$

Appendix A $\lambda_1$ ablation shows that $\lambda_1=0.05$ is optimal; if it is too large, it will interfere with the main task of action prediction.

4.5 Why are Mid-training and Post-training separated?

The paper emphasizes that parallel post-training cannot be done directly on base RDT because the architecture and goals deviate too much from the original RDT pre-training distribution. Mid-training first uses single-stream future prefix and Theia-style 86M distilled encoder to do full-parameter fine-tuning to adapt the model to the world-modeling objective; then post-training freezes the backbone and uses LoRA/prefix to efficiently align multiple teachers.

Implementation points: The main experiment starts from the official RDT-1B pretrained weights, with a total training of 20k steps: 15k mid-training + 5k post-training; the training data only has 50 task-specific trajectories for each task in the Easy setting.

5. Experiments and results

5.1 Simulation Setup

The simulation experiment uses RoboTwin 2.0, which is a real-to-sim bimanual benchmark. Each task has two settings: Easy and Hard; Hard includes domain randomization such as scene clutter, background texture, lighting, desktop height, etc. All simulation experiments cover 8 tasks, and each model uses 100 evaluation trials to report the average performance.

Training settings: Starting with RDT-1B official pretrained weights; training data limited to 50 trajectories per task in Easy setting; two H100 training 20, 000 steps, batch size 32.

5.2 RoboTwin 2.0 Main Results

Method	Average Easy	Average Hard	Remarks
DP	31.3%	0.0%	train-from-scratch visuomotor baseline.
VPP	35.8%	4.0%	implicit world model baseline.
RDT	47.4%	15.1%	FRAPPE's base model.
$\pi_0$	57.1%	14.1%	RoboTwin SOTA baseline.
$\pi_{0.5}$	45.4%	13.3%	$\pi_0$ successor.
FRAPPE	57.5%	25.5%	Easy has the highest average, and Hard has the significantly highest average.

The improvement of Hard setting is more critical: FRAPPE increases from 15.1% of RDT to 25.5%, which also exceeds $\pi_{0.5}$. The author explains that the model better learns the low-level dynamics behind multiple visual observations instead of relying on spurious visual correlations.

5.3 Training Paradigm Ablation

No.	Method	Steps	Easy	Hard	Average
0	RDT	20k	59.0	20.5	39.8
1	mid-train full ft	20k	63.0	27.5	45.3
2	mid-train prefix & LoRA ft	20k	48.0	8.5	28.3
3	post-train prefix ft	20k	25.0	4.0	14.5
4	post-train prefix & LoRA ft	20k	46.0	9.0	27.5
5	mid full ft + post prefix ft	15k + 5k	68.0	21.5	44.8
6	mid full ft + post prefix & LoRA ft	15k + 5k	73.5	32.0	52.3

The conclusion is clear: mid-training must be done first, and full-parameter fine-tuning is required; post-training alone is very ineffective; the final best recipe is 15k full-parameter mid-training + 5k prefix&LoRA post-training.

5.4 Inference Efficiency

Metric	RDT 5 steps	mid-train 5 steps	post-train 5 steps	post-train 3 steps
Inference Memory	3.7 GB	3.7 GB	8.0 GB	8.0 GB
Latency	0.214 s	0.228 s	0.235 s	0.173 s
Success Rate	39.8%	45.3%	52.3%	48.5%

Post-training parallel experts increased the video memory from 3.7GB to 8.0GB, but the delay of the same 5 steps only increased by about 20ms. After reducing to 3 denoising steps, the delay is lower than RDT 5 steps, and the success rate is still higher than the baseline.

5.5 Smaller-scale Policy Model

Figure 3: Verification on RDT-130M. FRAPPE recipe can also improve performance on small models. The difference between LoRA post-training and full-parameter post-training is only about 2-3%.

The author uses RDT-130M to illustrate that this training paradigm does not only rely on the 1B parameter size. RDT-130M's original hard-task generalization is weak, but FRAPPE significantly improves hard tasks and can approach the level of naive RDT-1B fine-tuning.

5.6 Real-world Experiments

Real experiments use bimanual AgileX mobile manipulator, 6-DoF per manipulator and parallel gripper. The vision system consists of a high-mounted third-person main camera, and two wrist-mounted ego-centric cameras. Training data: 25 demonstrations per variation for basic tasks, 100 demonstrations for long-horizon tasks. Assessment: 40 trials each for basic tasks, 20 trials each for long-horizon tasks.

Figure 4: Real task seen/unseen setup. FRAPPE performs well in lighting, height, pose, and object variations, especially unseen settings.

Figure 5: The long-range task contains three time-dependent subtasks and four interactive objects. RDT could not be completed in the trial, and FRAPPE achieved a complete success rate of 20%.

5.7 Human Egocentric Co-training

FRAPPE proposed a data pyramid: the bottom layer is large-scale action-free human egocentric data, the middle layer is task-specific human egocentric data, and the top layer is task-specific robot teleoperation data. The author emphasizes that task-specific human data does not use GoPro/VR, but uses a static third-person camera consistent with the robot data; in this way, a novice human operator can exceed 360 trajectories/hour, while a skilled robot teleoperation is usually about 120 trajectories/hour.

The co-training experiments used 5 robot action trajectories per object, 50 task-specific human egocentric trajectories, and 10k task-irrelevant human egocentric videos. For action-free samples, action loss is omitted and only alignment loss is optimized.

Figure 6: human egocentric data without action labels. Large-scale Ego(web) provides strong inductive prior for novel objects, and Ego(task) improves spatial generalization. The combination of the two has the best effect.

6. Repeat audit

6.1 Key training configurations

Project	Configuration
Base model	official RDT-1B pretrained weights; small model verification is RDT-130M.
Simulation data	RoboTwin Easy setting, 50 task-specific trajectories per task.
Training budget	2 NVIDIA H100 GPUs; 20, 000 steps; batch size 32.
FRAPPE schedule	15, 000 mid-training steps + 5, 000 post-training steps.
Post-training trainable params	future prefixes + LoRA + router/action aggregation related lightweight modules; shared RDT backbone frozen.
Teacher encoders	Mid-training: 86M Theia-style distilled encoder; Post-training: CLIP 400M, DINOv2 142M, ViT 300M.
Evaluation	100 trials each for RoboTwin; 40 trials each for real basic tasks; 20 trials each for real long-horizon.

6.2 Appendix Hyperparameter Ablation

$\lambda_1$	0	0.001	0.02	0.05	0.1	0.5
SR	14.0%	18.5%	26.4%	32.5%	22.0%	23.5%

Alignment depth	7	14	21	28
SR	14.5%	18.0%	23.5%	16.0%

Future horizon $h$	8	16	32
SR	35.3%	35.0%	29.7%

Appendix A Using layer 21 of the RDT-1B 28-layer DiT for future prefix alignment, it is approximately 3/4 of the total depth; this is consistent with the observation that deeper alignment is more effective in FLARE.

6.3 Human Egocentric Co-training Details

Appendix B Using TASTE-Rob as Ego(web): 100, 856 video sequences, ~9M frames, with high quality language alignment. This phase trains for 1 epoch, which takes about 96 hours on 8 H100s. The reason why the author chose TASTE-Rob is that the fixed egocentric viewpoint is closer to the mainstream VLA camera setting and helps to migrate to downstream robot action prediction.

6.4 reproducibility checklist

Papers and resources give relatively sufficient information

Fully: The code and model links have been made public; the core formulas, teacher encoders, two-stage training steps, batch size/GPU, RoboTwin data volume, real robot data volume, hyperparameter ablation, and inference efficiency tables are relatively clear.

Still needs practical confirmation: The specific LoRA target modules/rank, future prefix token length $n$, router architecture details, specific checkpoint of Theia variant, RoboTwin task configuration and real robot control stack need to be further confirmed from the code repository.

Minimum recurrence path

Load RDT-1B pretrained weights, keeping the original SigLIP/T5 conditional interface and action decoder.
Add future prefix to DiT input and take prefix representation at layer 21.
Use future observation embedding of Theia-style 86M distilled encoder to do single-stream full-parameter mid-training, 15k steps.
Build 3 Prefix+LoRA experts, align CLIP, DINOv2, and ViT respectively, freeze the shared backbone, and train for 5k steps.
Implement router aggregation latent action representation, and add load balance loss and $\epsilon=0.1$ gating smoothing.
Trained on $\lambda_1=0.05$, $h=8$, batch size 32, 2 H100, and evaluated on RoboTwin Easy/Hard 8 tasks with 100 trials/task.

7. Analysis, Limitations and Boundaries

7.1 The most valuable part of this paper

Judging from the paper's own claims, the most valuable thing about FRAPPE is that it combines "implicit world modeling + parallel scaling + parameter efficient finetuning" into a clear recipe. It does not simply add a future representation loss, but uses mid-training to first solve the target distribution mutation, and then uses MiPA to absorb the future representations of multiple VFM teachers in parallel. The training paradigm table shows that if you skip mid-training directly or only use prefix post-training, the performance will decrease significantly.

7.2 Why the results hold up

The chain of evidence is relatively complete: the RoboTwin table covers 8 tasks, two settings of Easy/Hard and multiple SOTA baselines; the training paradigm ablation directly compares the combination of mid/post/full/LoRA/prefix; the efficiency table shows that parallel expansion does not cause unacceptable delays; the RDT-130M experiment shows that the method is not only suitable for large models; the real two-arm task and human egocentric co-training further support the claims of data efficiency and generalization.

7.3 Limitations of explicit or indirect presentation by the author

Hard setting The absolute success rate is still low.FRAPPE has the highest average on Hard, but 25.5% still shows that visual generalization under strong domain randomization is not completely solved.
The success rate of real long-range missions is limited.RDT is 0% and FRAPPE is 20%, which reflects an improvement, but long-range missions are still far from stable.
Engineering complexity increases.Post-training requires multiple teacher encoders, multiple prefix/LoRA experts, routers, load balance and label smoothing; the recurrence complexity is higher than a single alignment loss.
Video memory increased.Post-training inference memory increases from 3.7GB to 8.0GB, which is still within the range of common inference GPUs, but the deployment resource requirements increase.
Real data settings rely on a fixed perspective.task-specific human egocentric data actually uses a static third-person camera instead of GoPro/VR, indicating that the pipeline still relies on the consistency of camera settings.

7.4 Applicable boundaries

FRAPPE is suitable for scenarios that already have pretrained diffusion/VLA backbone and hope to improve generalization with a small amount of robot data and a large amount of motionless videos, especially for double-arm manipulation, visual perturbation, object changes, and small data fine-tuning. It is less suitable for systems that need to explicitly visualize future trajectories for human review because it learns latent future representations rather than generating images of the future.

The other boundary is teacher indicating selection. The core benefit of FRAPPE comes from multi-VFM alignment; if the task domain is very different from the visual semantic coverage of CLIP/DINOv2/ViT, teacher selection and mid-training teacher distillation may become key bottlenecks.