EN 中文

F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions

Authors: Qi Lv, Weijie Kong, Hao Li, Jia Zeng, Zherui Qiu, Delin Qu, Haoming Song, Qizhi Chen, Xiang Deng, Jiangmiao Pang

Organization: Shanghai AI Laboratory; Harbin Institute of Technology (Shenzhen)

Version: arXiv: 2509.06951 v2, submitted on 2025-09-08, revised on 2025-09-09; ICLR 2026 conference style source code

Links: arXiv | PDF | Project home page | official code | HuggingFace

1. Quick overview of the paper

One-sentence summary: F1 changes VLA's action prediction from "react directly based on the current state" to "first generate the next visual foresight, and then use the predicted future image to guide inverse dynamics action generation", and connects understanding, imagination and execution through three Transformer experts: understanding, generation and action.

Difficulty rating: ★★★★☆. Reading requires familiarity with VLA, predictive inverse dynamics, Mixture-of-Transformer, VAR/RQ-VAE next-scale image generation, flow matching action prediction, and LIBERO / SimplerEnv / real robot evaluation.

Keywords: Vision-Language-ActionVisual ForesightPredictive Inverse DynamicsMixture-of-TransformerFlow MatchingLong-horizon Robotics

Reading targeting itemShort answer
What should the paper solve?Most of the existing VLA is reactive state-to-action mapping, which is short-sighted and fragile in dynamic scenes and long-range tasks; visual prediction strategies often lack VLM semantic grounding. F1 should combine semantic understanding, future visual prediction and action execution into a reasoning chain.
The author's approachMixture-of-Transformer three experts: understanding expert processes language and current observation; generation expert uses next-scale prediction to generate goal-conditioned foresight image; action expert uses foresight as an explicit target and uses flow matching to generate action chunks.
most important resultsThe real Genie-1 nine-task average success rate is 82.2%, higher than $\pi_0$'s 65.2%; LIBERO pre-training version averages 95.7%, ranking first; the SimplerEnv Bridge pre-training version has an overall average of 72.9%; the dynamic conveyor belt task is 66.7%, $\pi_0$ is 33.3%.
Things to note when readingThe generation module of F1 is not just a training auxiliary item; the paper explicitly allows the model to first predict future observations during inference, and then rewrite the action generation into foresight-guided inverse dynamics. The image does not have to be pixel perfect, but it must provide task-progress cues.

Core contribution list

VLA paradigm comparison
Fig. 1 / VLA paradigm evolution: from pure action expert, to VLM+action expert, to visual prediction, to F1's understanding-generation-action integration.

2. Motivation

2.1 What problem should be solved?

Real robotic environments are not a static picture classification problem: objects move, scenes change, and instructions often require multiple steps to unfold. Traditional VLA maps directly from the current image and language to actions, which is prone to short-sighted behavior when dynamic targets, long-term sequences, and distribution changes. F1 believes that the robot needs to explicitly predict "what the next visual state should be if the task continues to advance" before acting.

This problem is formulated as predictive inverse dynamics: first predict the future observation $\hat{o}_{t+1}$, and then infer the action chunk required to get from the current state to this future visual target. Action is not just a reaction to the current frame, but towards a generated visual target.

2.2 Limitations of existing methods

The paper divides manipulation policy into three categories. End-to-end action experts such as ACT and Diffusion Policy lack semantic grounding and cross-task generalization; VLM-integrated policies such as $\pi_0$ and gr00t-N1 have stronger understanding capabilities, but are still reactive and lack scene evolution modeling; visual prediction policies such as VPP and Genie Envisioner can predict future vision, but do not fully integrate VLM semantic understanding and have limited control robustness.

Therefore, the core issue of F1 is not "adding a video prediction loss", but to find an architecture and training principle to connect semantic understanding, future visual prediction, and action execution in a controllable information flow.

2.3 Solution ideas of this article

F1 uses three MoT experts: understanding expert inherits VLM capabilities; generation expert generates foresight through RQ-VAE / next-scale prediction; action expert uses current observation, language, proprioception and predicted future images as conditions to generate continuous action chunks through flow matching. For training, first let the generation learn to align with the understanding, then jointly pre-train on large-scale robot data, and finally post-train for the specific platform.

4. Detailed explanation of method

F1 architecture
Fig. 2 / F1 framework: three experts MoT, UGA progressive attention, foresight generation and action expert.

4.1 Overall pipeline

Given the language instruction $l$, the current observation $o_t$ and the historical observation $\{o_{t-m}, \ldots, o_{t-1}\}$, the calculation chain of F1 is as follows:

Input: instruction l, current observation o_t, history o_{t-m:t-1}, proprioception q_t
1. Understanding expert:
   encode language + current visual observation into semantic multimodal representation

2. Generation expert:
   use history + language + understanding representation
   generate visual foresight image: o_hat_{t+1}

3. Action expert:
   condition on l, o_t, q_t, o_hat_{t+1}
   generate action chunk a_hat_{t:t+k} using flow matching

The paper emphasizes that $\hat{o}_{t+1}$ is an explicit planning target. Action expert does not only rely on the current state, but moves closer to predicting the future visual state.

4.2 Understanding Expert

Understanding expert is initialized from the pretrained vision-language model. The current observation $o_t$ is first encoded into high-level perceptual features by the SigLIP vision encoder, and then input into the decoder-only Transformer together with the language prompt. The implementation details state that the understanding expert architecture is the same as PaliGemma and inherits weights from $\pi_0$.

4.3 Generation Expert: Next-Scale Visual Foresight

The Generation expert is responsible for generating $\hat{o}_{t+1}$. It does not directly do high-cost diffusion video generation, but uses VAR style next-scale prediction. The recent historical observation $\{o_{t-m}, \ldots, o_t\}$ is encoded by multi-scale residual vector quantization, and each frame is decomposed into the multi-scale token $\{z_i^0, \ldots, z_i^k\}$ on the $16\times16$ patch. In order to avoid the multi-frame token string being too long, the model uses temporal convolutional network to aggregate motion-relevant features.

Residual VQ-VAE
Fig. 3 / Residual VQ-VAE: reconstruction from low to high scale for next-scale foresight token prediction.
Reasoning trade-offs: In training Stage I, the generation expert uses VAR to predict 10 resolution scales; Stage II/III and inference stages are limited to 4 scales for efficiency. The appendix training sheet also labels "# Num Predicted Scales" as Stage I=10 and the remaining stages=4.

4.4 Action Expert: Foresight-Guided Flow Matching

Action expert is conditioned on language, current observation, foresight image and proprioception, and outputs short-range action chunk $\hat{a}_{t: t+k}$. The paper uses ACT-style chunked action prediction and uses flow matching to learn the vector field from Gaussian noise to expert action in the continuous action space.

This loss is learned in the training policy: given the current context and the noisy action $a_t^\tau$, predict the direction in which the movement should be towards the true action.

$$a_t^\tau = (1-\tau)\epsilon + \tau a_t, \quad \tau\sim\mathcal{U}(0, 1), \quad \epsilon\sim\mathcal{N}(0, I)$$ $$\mathcal{L}_{\mathrm{action}} = \mathbb{E}\left[\left\|\pi_\theta(l, \{o_i\}_{i=t-m}^{t}, q_t, a_t^\tau)-(a_t-\epsilon)\right\|^2\right]$$
$q_t$proprioception information at time $t$.
$a_t^\tau$Interpolation points of real action $a_t$ and noise $\epsilon$.
$a_t-\epsilon$flow matching target vector, pointing in the direction from noise to real action.

4.5 UGA Progressive Attention

UGA stands for Understanding-Generation-Action. Rich token interactions can be carried out within each expert, and the foresight tokens in the generation expert maintain causal/scale-conditioned patterns. The information flow across experts is a one-way hierarchy: generation attends to understanding, action attends to both, but action cannot affect generation in the reverse direction, and understanding does not receive information from subsequent modules.

This design has two functions: first, to prevent reverse leakage of action tokens during training, making foresight truly an intermediate representation; second, to make the model structure interpretable as "first understand, then imagine, then execute."

4.6 Three-stage training

stagetargetTraining methodKey hyperparameters Appendix Training Details
Pretrain Stage IInject the foresight capability into the generation expert and align it with the frozen understanding expert.The understanding expert inherits $\pi_0$ and freezes it; the generation expert is randomly initialized and uses teacher forcing to predict ground-truth future visual tokens.batch 1280, lr 3e-4, cosine, 512K steps, understanding resolution 224, generation resolution 256, 10 predicted scales.
Pretrain Stage IIJointly optimize three experts on large-scale robot data to learn general visuomotor knowledge.Joint training using autoregressive predicted foresight tokens + flow matching action prediction.batch 2880, lr 5e-5 constant, 100K steps, Gen: Act loss weight 0.1: 1, action chunk size 30, 4 predicted scales.
Post-train Stage IIIAdapt to downstream platforms and tasks.Finetune on LIBERO, Simpler, Genie-1, Franka, Dynamic, Long-horizon and other mission data.batch 128, lr 5e-5 cosine; Simpler 10 epochs, Genie/Franka/Dynamic 40 epochs, Long-horizon 60 epochs; action chunk size is 4/8/50, etc. respectively.

The overall goal of Stage II/III is to combine predicting future images and generating actions.

$$\mathcal{L}_{\mathrm{total}}=\mathcal{L}_{\mathrm{gen}}^{\mathrm{pred}}+\lambda\cdot\mathcal{L}_{\mathrm{action}}$$

Among them, $\mathcal{L}_{\mathrm{gen}}^{\mathrm{pred}}$ is the autoregressive next-scale visual token NLL, and $\lambda$ controls the action loss weight.

4.7 Data scale

Appendix Dataset Details Complete statistics are given: a total of 330.9K trajectories, 73.8M frames, covering Genie-G1, Franka, WidowX, Google Robot, ARX LIFT II, ​​multi-view including third-person, wrist/head, frame rate 3-30 FPS.

data sourceStageEmbodiment# Trajs# Frames
Agibot-WorldI + IIGenie-G1187K66.4M
LIBEROI + II + IIIFranka1.7K0.3M
OXE-Bridge-v2I + II + IIIWidowX53.2K1.9M
OXE-FractalI + IIGoogle Robot87.2K3.8M
In-house tasksIIIGenie-G1 / Franka / ARX LIFT IIAbout 1.8KAbout 1.4M

5. Experiment

Real-world tasks
Fig. 4 / Real-world robot experiments: A total of 12 types of experiments on Genie-1, Franka, and ARX LIFT II.

5.1 Real Genie-1 Nine Missions

15 tests per task, comparing $\pi_0$, gr00t-N1, gr00t-N1.5 and F1. The average success rate of F1 is 82.2%, which is higher than $\pi_0$'s 65.2%, gr00t-N1.5's 53.3%, and gr00t-N1's 30.4%.

methodPenFlowerChipTea TableTea ShelfBreadHandoverHandover R2HMixtureAvg
$\pi_0$66.766.786.786.773.366.733.340.066.765.2
gr00t-N146.733.333.340.013.333.326.713.333.330.4
gr00t-N1.573.340.046.673.526.653.360.040.066.753.3
F193.380.0100.093.386.766.780.073.366.782.2

The paper emphasizes that Handover (R2H) requires dynamic adjustment and human-computer interaction, with F1 73.3%, significantly higher than $\pi_0$'s 40.0% and gr00t-N1's 13.3%.

5.2 LIBERO

methodPretrainedSpatialObjectGoalLongAverageAvg Rank
$\pi_0$Yes98.096.894.488.494.42
gr00t-N1Yes94.497.693.090.693.94
CoT-VLAYes87.591.687.669.083.96
F1No97.497.694.288.094.33
F1Yes98.297.895.491.395.71

LIBERO-Long is the suite with the greatest long-range planning pressure. The F1 pre-training version reaches 91.3%, exceeding $\pi_0$'s 88.4% and gr00t-N1's 90.6%. This supports the authors' contention that foresight is effective for long-range missions.

5.3 SimplerEnv Bridge

methodPretrainedCarrot SuccessEggplant SuccessSpoon SuccessStack SuccessOverall Avg
SpatialVLAYes25.0100.016.729.247.9
$\pi_0$Yes0.016.662.529.140.1
$\pi_0$-FastYes21.910.866.629.148.3
F1No33.375.045.862.566.1
F1Yes70.866.750.050.072.9

SimplerEnv Bridge mission emphasizes fine-grained placement. The overall average of F1 is significantly higher than the next strongest baseline, which the paper attributes to foresight's ability to adapt to changes in source object configuration and target location.

5.4 Simulated ablation

VariantmeaningAvgMain conclusions
F1complete model77.5baseline.
Frozen-GenFreeze generation expert after Stage I73.8Fixed generation is still useful, but subsequent end-to-end adaptation can improve task alignment.
Cotrain-ScratchRemove Stage II large-scale robot pre-training74.2Stage II provides manipulation prior, stable optimization.
No-GenRemove generation expert60.3visual foresight is the most critical module. SimplerEnv will obviously collapse after being removed.
2-ScalesInference only predicts 2 planning scales73.4foresight is rough and lacks planning information.
6-ScalesInference prediction 6 planning scales75.8More scales are not necessarily better, are more computationally intensive and may introduce instability; the paper ultimately uses 4 scales.

5.5 Real task ablation and dynamic generalization

Real task ablation shows that No-Gen achieves only 40.0% and 60.0% in Handover (R2H) and Mixture respectively, while F1 reaches 93.3% and 73.3%. Cotrain-Scratch is also significantly lower than the full model, indicating that Stage II's large-scale robot pre-training is helpful for real downstream generalization.

Real-world ablation
Fig. 5 / Real-world ablation: The complete three-stage training is significantly improved compared to the variant without Stage II.

The dynamic conveyor belt experiment uses ARX LIFT II, a robot embodiment not found in the pre-training data. Using only 47 post-train demonstrations, the success rate of F1 in continuous double-arm dynamic grasping is 66.7%, and that of $\pi_0$ is 33.3%; in the Lettuce and Bread subtasks, the success rate of F1 is 80.0%, and that of $\pi_0$ is 53.3% and 46.7%.

Dynamic manipulation
Fig. 6 / Dynamic manipulation: Specified food grabbing on a moving conveyor belt.

5.6 Rapid adaptation and long-range tasks

experiment$\pi_0$F1Description
Franka sweep: number of successful objects4.9 / 8.07.1 / 8.0F1 sweeps more.
Franka sweep: maximum number of attempts4.8 / 5.03.5 / 5.0F1 has fewer attempts.
Franka sweep: number of empty sweeps2.4 / 5.00.8 / 5.0The paper explains that reduced air scans reflect more accurate spatial grounding.
Sort 1/2/3 consecutive grabs100 / 86.7 / 53.3100 / 100 / 66.7F1 retains better across repeated interactions.

The long-range ARX LIFT II mission consists of 10 steps and approximately 2 minutes. $\pi_0$ is 93.3% in the first two pickplace steps, and basically 0% after that; F1 is 100% in the first three steps, 93.3% in steps 4/5, and there are still 73.3%, 60.0%, 40.0%, 40.0%, and 40.0% in subsequent complex steps. The paper acknowledges that the decline in subsequent steps is consistent with long-range error accumulation expectations.

5.7 Generation quality and action reliability

The paper does not use traditional FID/PSNR as the main generation indicator, but uses Qwen2.5-VL-32B-Instruct for task-relevant evaluation. The input includes task instruction, four frame history, predicting the next frame and ground-truth frame; the output includes three binary scores: scene consistency, object consistency, and task progress following. prompt template in Appendix Prompt Template given in.

Generation quality metrics
Fig. 7 / Generation quality: scene consistency is improved first, object consistency is the most difficult, and task progress follows to improve steadily.

The authors observed that the object consistency curve is lower because F1 is not pre-trained on large-scale generative datasets; but task progress following often exceeds object consistency, indicating that foresight can provide action-related clues even if the pixels are not perfect.

Correlation between image and action accuracy
Fig. 8 / Image-action correlation: Among the four LIBERO suites, image token accuracy and action token accuracy have a stable positive correlation.

6. Repeat audit

6.1 Code and resources

Already published: The official GitHub is InternRobotics/F1-VLA, including f1_vla/, train_hf.py and README training entrance. The project homepage is F1-VLA, HuggingFace organization provides model/data entry.

Dependencies given in README: Python ≥ 3.10, torch ≥ 2.6.0, CUDA ≥ 12.4; FFmpeg + TorchCodec is recommended to speed up video data loading.

Downloads: README points to LIBERO no-op filtered data, F1_pretrain checkpoint, lerobot/pi0_base, google/paligemma-3b-pt-224, VAR VAE weights.

6.2 reproducibility path

  1. Clone the repository and create an environment: git clone https://github.com/InternRobotics/F1-VLA.git, conda create -n f1_vla python==3.10.
  2. Install PyTorch CUDA 12.4 version and repository: pip install torch==2.6.0 ..., enter F1-VLA/f1_vla after pip install -e ..
  3. Download the README specifying checkpoint/tokenizer/VAE/LIBERO data and fill in the path in the configuration.
  4. Edit configuration: f1_vla/config/debug_test.yaml Or corresponding task configuration.
  5. Run training: python train_hf.py --config-file f1_vla/config/debug_test.yaml.

6.3 Deployment delays

Appendix Deploy Platform and Latency Analysis: All experiments were run on an Intel i9 CPU + NVIDIA RTX 4090 workstation, and the robot was connected via wired Ethernet. When three channels of synchronous camera input are used, the total inference time is about 235ms.

moduledelay
image process / resize18ms
temporal downsampling28ms
image encoder18ms
foresight generation76ms
x10 action forward pass (flow)95ms
Total235ms

6.4 Recurrence risk

7. Analysis, Limitations and Boundaries

7.1 The most valuable part of this paper

Based on the paper's own evidence, the most valuable thing is to upgrade "future image prediction" from an auxiliary training goal to an explicit planning intermediate variable during inference. F1 lets the generation expert first produce visual foresight, and then lets the action expert target it to do inverse dynamics. This is closer to the structure of "imagine first and then execute" than simply adding an action head to the backbone. No-Gen dropped from 77.5% to 60.3% in ablation, Real Handover (R2H) and Dynamic Conveyor tasks also showed that foresight is particularly important for complex dynamics.

7.2 Why the results hold up

The evidence chain of the paper is relatively complete: nine real tasks show that F1 improves the average success rate relative to $\pi_0$; LIBERO and SimplerEnv provide simulation benchmarks; Frozen-Gen, Cotrain-Scratch, No-Gen, and 2/6 scales ablation respectively verify generation trainability, Stage II pre-training, generation expert itself, and the number of inference scales; dynamic conveyor belt, Franka fast adaptation, and 10-step The long-range task shows scenarios beyond short-range static crawling; the generation quality chapter also shows that foresight quality is positively related to action accuracy.

7.3 Limitations and future directions described by the author

7.4 Applicable boundaries

F1 is suitable for manipulation tasks where the future visual state can provide clear planning goals for actions, especially dynamic goals, long-range steps, hand-eye coordination and cross-platform adaptation. It relies heavily on multi-view historical images, proprioception, generative visual tokenization and heavier reasoning chains. There is insufficient evidence in the paper for tasks such as invisible contact forces, pure tactile feedback, extremely high-speed closed-loop control, or tasks where the target state cannot be expressed through images.