F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions

Authors: Qi Lv, Weijie Kong, Hao Li, Jia Zeng, Zherui Qiu, Delin Qu, Haoming Song, Qizhi Chen, Xiang Deng, Jiangmiao Pang

Organization: Shanghai AI Laboratory; Harbin Institute of Technology (Shenzhen)

Version: arXiv: 2509.06951 v2, submitted on 2025-09-08, revised on 2025-09-09; ICLR 2026 conference style source code

Links: arXiv | PDF | Project home page | official code | HuggingFace

1. Quick overview of the paper

One-sentence summary: F1 changes VLA's action prediction from "react directly based on the current state" to "first generate the next visual foresight, and then use the predicted future image to guide inverse dynamics action generation", and connects understanding, imagination and execution through three Transformer experts: understanding, generation and action.

Difficulty rating: ★★★★☆. Reading requires familiarity with VLA, predictive inverse dynamics, Mixture-of-Transformer, VAR/RQ-VAE next-scale image generation, flow matching action prediction, and LIBERO / SimplerEnv / real robot evaluation.

Keywords: Vision-Language-ActionVisual ForesightPredictive Inverse DynamicsMixture-of-TransformerFlow MatchingLong-horizon Robotics

Reading targeting item	Short answer
What should the paper solve?	Most of the existing VLA is reactive state-to-action mapping, which is short-sighted and fragile in dynamic scenes and long-range tasks; visual prediction strategies often lack VLM semantic grounding. F1 should combine semantic understanding, future visual prediction and action execution into a reasoning chain.
The author's approach	Mixture-of-Transformer three experts: understanding expert processes language and current observation; generation expert uses next-scale prediction to generate goal-conditioned foresight image; action expert uses foresight as an explicit target and uses flow matching to generate action chunks.
most important results	The real Genie-1 nine-task average success rate is 82.2%, higher than $\pi_0$'s 65.2%; LIBERO pre-training version averages 95.7%, ranking first; the SimplerEnv Bridge pre-training version has an overall average of 72.9%; the dynamic conveyor belt task is 66.7%, $\pi_0$ is 33.3%.
Things to note when reading	The generation module of F1 is not just a training auxiliary item; the paper explicitly allows the model to first predict future observations during inference, and then rewrite the action generation into foresight-guided inverse dynamics. The image does not have to be pixel perfect, but it must provide task-progress cues.

Core contribution list

A new paradigm is proposed that explicitly puts visual foresight generation into the VLA decision pipeline, turning action prediction from reactive imitation to planning-based control.
Design UGA progressive attention to make information flow in one direction from Understanding → Generation → Action to avoid reverse leakage of action to foresight.
Three stages of training are proposed: Stage I aligns generation and understanding; Stage II pre-trains the full model on large-scale robot data; Stage III post-trains on the target embodiment/task.
System evaluation on real robots, LIBERO, SimplerEnv Bridge, dynamic scenes, fast adaptation, long-range tasks and generation quality correlation.

Fig. 1 / VLA paradigm evolution: from pure action expert, to VLM+action expert, to visual prediction, to F1's understanding-generation-action integration.

2. Motivation

2.1 What problem should be solved?

Real robotic environments are not a static picture classification problem: objects move, scenes change, and instructions often require multiple steps to unfold. Traditional VLA maps directly from the current image and language to actions, which is prone to short-sighted behavior when dynamic targets, long-term sequences, and distribution changes. F1 believes that the robot needs to explicitly predict "what the next visual state should be if the task continues to advance" before acting.

This problem is formulated as predictive inverse dynamics: first predict the future observation $\hat{o}_{t+1}$, and then infer the action chunk required to get from the current state to this future visual target. Action is not just a reaction to the current frame, but towards a generated visual target.

2.2 Limitations of existing methods

The paper divides manipulation policy into three categories. End-to-end action experts such as ACT and Diffusion Policy lack semantic grounding and cross-task generalization; VLM-integrated policies such as $\pi_0$ and gr00t-N1 have stronger understanding capabilities, but are still reactive and lack scene evolution modeling; visual prediction policies such as VPP and Genie Envisioner can predict future vision, but do not fully integrate VLM semantic understanding and have limited control robustness.

Therefore, the core issue of F1 is not "adding a video prediction loss", but to find an architecture and training principle to connect semantic understanding, future visual prediction, and action execution in a controllable information flow.

2.3 Solution ideas of this article

F1 uses three MoT experts: understanding expert inherits VLM capabilities; generation expert generates foresight through RQ-VAE / next-scale prediction; action expert uses current observation, language, proprioception and predicted future images as conditions to generate continuous action chunks through flow matching. For training, first let the generation learn to align with the understanding, then jointly pre-train on large-scale robot data, and finally post-train for the specific platform.

3. Summary of related work

Technical line	context in the paper	Positioning of F1
Vision-Language-Action Model	OpenVLA, $\pi_0$, SpatialVLA, GR00T, Gemini Robotics, EmbodiedOneVision, etc. migrate MLLM/VLM capabilities to robot action prediction.	F1 is not satisfied with reactive action prediction, but adds explicit foresight generation inside VLA so that actions are guided by predicted future states.
Inverse Dynamics Model	VPP, Genie Envisioner, WorldVLA, DreamVLA, FlowVLA, CoT-VLA, etc. try to reduce the difficulty of image-to-action mapping through future image/video or intermediate representation.	F1 actually generates the visual foresight during the inference phase and feeds it into the action expert, rather than just using future predictions as a training regularizer.
Unified Vision Language Model	Chameleon, Transfusion, Show-o, Emu3, Janus, MoT, etc. unify visual understanding and generation, but usually do not include actions into a complete cognitive closed loop.	F1 advances the unified paradigm to understanding-generation-action: not only perception and generation, but also action on the physical world.

4. Detailed explanation of method

Fig. 2 / F1 framework: three experts MoT, UGA progressive attention, foresight generation and action expert.

4.1 Overall pipeline

Given the language instruction $l$, the current observation $o_t$ and the historical observation $\{o_{t-m}, \ldots, o_{t-1}\}$, the calculation chain of F1 is as follows:

Input: instruction l, current observation o_t, history o_{t-m:t-1}, proprioception q_t
1. Understanding expert:
   encode language + current visual observation into semantic multimodal representation

2. Generation expert:
   use history + language + understanding representation
   generate visual foresight image: o_hat_{t+1}

3. Action expert:
   condition on l, o_t, q_t, o_hat_{t+1}
   generate action chunk a_hat_{t:t+k} using flow matching

The paper emphasizes that $\hat{o}_{t+1}$ is an explicit planning target. Action expert does not only rely on the current state, but moves closer to predicting the future visual state.

4.2 Understanding Expert

Understanding expert is initialized from the pretrained vision-language model. The current observation $o_t$ is first encoded into high-level perceptual features by the SigLIP vision encoder, and then input into the decoder-only Transformer together with the language prompt. The implementation details state that the understanding expert architecture is the same as PaliGemma and inherits weights from $\pi_0$.

4.3 Generation Expert: Next-Scale Visual Foresight

The Generation expert is responsible for generating $\hat{o}_{t+1}$. It does not directly do high-cost diffusion video generation, but uses VAR style next-scale prediction. The recent historical observation $\{o_{t-m}, \ldots, o_t\}$ is encoded by multi-scale residual vector quantization, and each frame is decomposed into the multi-scale token $\{z_i^0, \ldots, z_i^k\}$ on the $16\times16$ patch. In order to avoid the multi-frame token string being too long, the model uses temporal convolutional network to aggregate motion-relevant features.

Fig. 3 / Residual VQ-VAE: reconstruction from low to high scale for next-scale foresight token prediction.

Reasoning trade-offs: In training Stage I, the generation expert uses VAR to predict 10 resolution scales; Stage II/III and inference stages are limited to 4 scales for efficiency. The appendix training sheet also labels "# Num Predicted Scales" as Stage I=10 and the remaining stages=4.

4.4 Action Expert: Foresight-Guided Flow Matching

Action expert is conditioned on language, current observation, foresight image and proprioception, and outputs short-range action chunk $\hat{a}_{t: t+k}$. The paper uses ACT-style chunked action prediction and uses flow matching to learn the vector field from Gaussian noise to expert action in the continuous action space.

This loss is learned in the training policy: given the current context and the noisy action $a_t^\tau$, predict the direction in which the movement should be towards the true action.

$$a_t^\tau = (1-\tau)\epsilon + \tau a_t, \quad \tau\sim\mathcal{U}(0, 1), \quad \epsilon\sim\mathcal{N}(0, I)$$ $$\mathcal{L}_{\mathrm{action}} = \mathbb{E}\left[\left\|\pi_\theta(l, \{o_i\}_{i=t-m}^{t}, q_t, a_t^\tau)-(a_t-\epsilon)\right\|^2\right]$$

$q_t$	proprioception information at time $t$.
$a_t^\tau$	Interpolation points of real action $a_t$ and noise $\epsilon$.
$a_t-\epsilon$	flow matching target vector, pointing in the direction from noise to real action.

4.5 UGA Progressive Attention

UGA stands for Understanding-Generation-Action. Rich token interactions can be carried out within each expert, and the foresight tokens in the generation expert maintain causal/scale-conditioned patterns. The information flow across experts is a one-way hierarchy: generation attends to understanding, action attends to both, but action cannot affect generation in the reverse direction, and understanding does not receive information from subsequent modules.

This design has two functions: first, to prevent reverse leakage of action tokens during training, making foresight truly an intermediate representation; second, to make the model structure interpretable as "first understand, then imagine, then execute."

4.6 Three-stage training

stage	target	Training method	Key hyperparameters Appendix Training Details
Pretrain Stage I	Inject the foresight capability into the generation expert and align it with the frozen understanding expert.	The understanding expert inherits $\pi_0$ and freezes it; the generation expert is randomly initialized and uses teacher forcing to predict ground-truth future visual tokens.	batch 1280, lr 3e-4, cosine, 512K steps, understanding resolution 224, generation resolution 256, 10 predicted scales.
Pretrain Stage II	Jointly optimize three experts on large-scale robot data to learn general visuomotor knowledge.	Joint training using autoregressive predicted foresight tokens + flow matching action prediction.	batch 2880, lr 5e-5 constant, 100K steps, Gen: Act loss weight 0.1: 1, action chunk size 30, 4 predicted scales.
Post-train Stage III	Adapt to downstream platforms and tasks.	Finetune on LIBERO, Simpler, Genie-1, Franka, Dynamic, Long-horizon and other mission data.	batch 128, lr 5e-5 cosine; Simpler 10 epochs, Genie/Franka/Dynamic 40 epochs, Long-horizon 60 epochs; action chunk size is 4/8/50, etc. respectively.

The overall goal of Stage II/III is to combine predicting future images and generating actions.

$$\mathcal{L}_{\mathrm{total}}=\mathcal{L}_{\mathrm{gen}}^{\mathrm{pred}}+\lambda\cdot\mathcal{L}_{\mathrm{action}}$$

Among them, $\mathcal{L}_{\mathrm{gen}}^{\mathrm{pred}}$ is the autoregressive next-scale visual token NLL, and $\lambda$ controls the action loss weight.

4.7 Data scale

Appendix Dataset Details Complete statistics are given: a total of 330.9K trajectories, 73.8M frames, covering Genie-G1, Franka, WidowX, Google Robot, ARX LIFT II, multi-view including third-person, wrist/head, frame rate 3-30 FPS.

data source	Stage	Embodiment	# Trajs	# Frames
Agibot-World	I + II	Genie-G1	187K	66.4M
LIBERO	I + II + III	Franka	1.7K	0.3M
OXE-Bridge-v2	I + II + III	WidowX	53.2K	1.9M
OXE-Fractal	I + II	Google Robot	87.2K	3.8M
In-house tasks	III	Genie-G1 / Franka / ARX LIFT II	About 1.8K	About 1.4M

5. Experiment

Fig. 4 / Real-world robot experiments: A total of 12 types of experiments on Genie-1, Franka, and ARX LIFT II.

5.1 Real Genie-1 Nine Missions

15 tests per task, comparing $\pi_0$, gr00t-N1, gr00t-N1.5 and F1. The average success rate of F1 is 82.2%, which is higher than $\pi_0$'s 65.2%, gr00t-N1.5's 53.3%, and gr00t-N1's 30.4%.

method	Pen	Flower	Chip	Tea Table	Tea Shelf	Bread	Handover	Handover R2H	Mixture	Avg
$\pi_0$	66.7	66.7	86.7	86.7	73.3	66.7	33.3	40.0	66.7	65.2
gr00t-N1	46.7	33.3	33.3	40.0	13.3	33.3	26.7	13.3	33.3	30.4
gr00t-N1.5	73.3	40.0	46.6	73.5	26.6	53.3	60.0	40.0	66.7	53.3
F1	93.3	80.0	100.0	93.3	86.7	66.7	80.0	73.3	66.7	82.2

The paper emphasizes that Handover (R2H) requires dynamic adjustment and human-computer interaction, with F1 73.3%, significantly higher than $\pi_0$'s 40.0% and gr00t-N1's 13.3%.

5.2 LIBERO

method	Pretrained	Spatial	Object	Goal	Long	Average	Avg Rank
$\pi_0$	Yes	98.0	96.8	94.4	88.4	94.4	2
gr00t-N1	Yes	94.4	97.6	93.0	90.6	93.9	4
CoT-VLA	Yes	87.5	91.6	87.6	69.0	83.9	6
F1	No	97.4	97.6	94.2	88.0	94.3	3
F1	Yes	98.2	97.8	95.4	91.3	95.7	1

LIBERO-Long is the suite with the greatest long-range planning pressure. The F1 pre-training version reaches 91.3%, exceeding $\pi_0$'s 88.4% and gr00t-N1's 90.6%. This supports the authors' contention that foresight is effective for long-range missions.

5.3 SimplerEnv Bridge

method	Pretrained	Carrot Success	Eggplant Success	Spoon Success	Stack Success	Overall Avg
SpatialVLA	Yes	25.0	100.0	16.7	29.2	47.9
$\pi_0$	Yes	0.0	16.6	62.5	29.1	40.1
$\pi_0$-Fast	Yes	21.9	10.8	66.6	29.1	48.3
F1	No	33.3	75.0	45.8	62.5	66.1
F1	Yes	70.8	66.7	50.0	50.0	72.9

SimplerEnv Bridge mission emphasizes fine-grained placement. The overall average of F1 is significantly higher than the next strongest baseline, which the paper attributes to foresight's ability to adapt to changes in source object configuration and target location.

5.4 Simulated ablation

Variant	meaning	Avg	Main conclusions
F1	complete model	77.5	baseline.
Frozen-Gen	Freeze generation expert after Stage I	73.8	Fixed generation is still useful, but subsequent end-to-end adaptation can improve task alignment.
Cotrain-Scratch	Remove Stage II large-scale robot pre-training	74.2	Stage II provides manipulation prior, stable optimization.
No-Gen	Remove generation expert	60.3	visual foresight is the most critical module. SimplerEnv will obviously collapse after being removed.
2-Scales	Inference only predicts 2 planning scales	73.4	foresight is rough and lacks planning information.
6-Scales	Inference prediction 6 planning scales	75.8	More scales are not necessarily better, are more computationally intensive and may introduce instability; the paper ultimately uses 4 scales.

5.5 Real task ablation and dynamic generalization

Real task ablation shows that No-Gen achieves only 40.0% and 60.0% in Handover (R2H) and Mixture respectively, while F1 reaches 93.3% and 73.3%. Cotrain-Scratch is also significantly lower than the full model, indicating that Stage II's large-scale robot pre-training is helpful for real downstream generalization.

Fig. 5 / Real-world ablation: The complete three-stage training is significantly improved compared to the variant without Stage II.

The dynamic conveyor belt experiment uses ARX LIFT II, a robot embodiment not found in the pre-training data. Using only 47 post-train demonstrations, the success rate of F1 in continuous double-arm dynamic grasping is 66.7%, and that of $\pi_0$ is 33.3%; in the Lettuce and Bread subtasks, the success rate of F1 is 80.0%, and that of $\pi_0$ is 53.3% and 46.7%.

Fig. 6 / Dynamic manipulation: Specified food grabbing on a moving conveyor belt.

5.6 Rapid adaptation and long-range tasks

experiment	$\pi_0$	F1	Description
Franka sweep: number of successful objects	4.9 / 8.0	7.1 / 8.0	F1 sweeps more.
Franka sweep: maximum number of attempts	4.8 / 5.0	3.5 / 5.0	F1 has fewer attempts.
Franka sweep: number of empty sweeps	2.4 / 5.0	0.8 / 5.0	The paper explains that reduced air scans reflect more accurate spatial grounding.
Sort 1/2/3 consecutive grabs	100 / 86.7 / 53.3	100 / 100 / 66.7	F1 retains better across repeated interactions.

The long-range ARX LIFT II mission consists of 10 steps and approximately 2 minutes. $\pi_0$ is 93.3% in the first two pickplace steps, and basically 0% after that; F1 is 100% in the first three steps, 93.3% in steps 4/5, and there are still 73.3%, 60.0%, 40.0%, 40.0%, and 40.0% in subsequent complex steps. The paper acknowledges that the decline in subsequent steps is consistent with long-range error accumulation expectations.

5.7 Generation quality and action reliability

The paper does not use traditional FID/PSNR as the main generation indicator, but uses Qwen2.5-VL-32B-Instruct for task-relevant evaluation. The input includes task instruction, four frame history, predicting the next frame and ground-truth frame; the output includes three binary scores: scene consistency, object consistency, and task progress following. prompt template in Appendix Prompt Template given in.

Fig. 7 / Generation quality: scene consistency is improved first, object consistency is the most difficult, and task progress follows to improve steadily.

The authors observed that the object consistency curve is lower because F1 is not pre-trained on large-scale generative datasets; but task progress following often exceeds object consistency, indicating that foresight can provide action-related clues even if the pixels are not perfect.

Correlation between image and action accuracy

Fig. 8 / Image-action correlation: Among the four LIBERO suites, image token accuracy and action token accuracy have a stable positive correlation.

6. Repeat audit

6.1 Code and resources

Already published: The official GitHub is InternRobotics/F1-VLA, including f1_vla/, train_hf.py and README training entrance. The project homepage is F1-VLA, HuggingFace organization provides model/data entry.

Dependencies given in README: Python ≥ 3.10, torch ≥ 2.6.0, CUDA ≥ 12.4; FFmpeg + TorchCodec is recommended to speed up video data loading.

Downloads: README points to LIBERO no-op filtered data, F1_pretrain checkpoint, lerobot/pi0_base, google/paligemma-3b-pt-224, VAR VAE weights.

6.2 reproducibility path

Clone the repository and create an environment: git clone https://github.com/InternRobotics/F1-VLA.git, conda create -n f1_vla python==3.10.
Install PyTorch CUDA 12.4 version and repository: pip install torch==2.6.0 ..., enter F1-VLA/f1_vla after pip install -e ..
Download the README specifying checkpoint/tokenizer/VAE/LIBERO data and fill in the path in the configuration.
Edit configuration: f1_vla/config/debug_test.yaml Or corresponding task configuration.
Run training: python train_hf.py --config-file f1_vla/config/debug_test.yaml.

6.3 Deployment delays

Appendix Deploy Platform and Latency Analysis: All experiments were run on an Intel i9 CPU + NVIDIA RTX 4090 workstation, and the robot was connected via wired Ethernet. When three channels of synchronous camera input are used, the total inference time is about 235ms.

module	delay
image process / resize	18ms
temporal downsampling	28ms
image encoder	18ms
foresight generation	76ms
x10 action forward pass (flow)	95ms
Total	235ms

6.4 Recurrence risk

Large-scale training costs are high: Stage I 512K steps, Stage II 100K steps, batch size 1280 / 2880 respectively. Real and complete pre-training is not a single-card lightweight replication experiment.
Real robot tasks rely on in-house demonstrations and Genie-G1/Franka/ARX LIFT II platforms, and public code does not mean reproducible physical forms.
The generation quality evaluation relies on Qwen2.5-VL-32B-Instruct as the evaluator, and the prompt is disclosed in the appendix, but the evaluation stability still depends on the model version and calling settings.
The paper claims that F1 is a 4.2B parameter model, and the project also requires initialization resources such as PaliGemma / $\pi_0$ / VAR VAE and multi-view video processing links.

7. Analysis, Limitations and Boundaries

7.1 The most valuable part of this paper

Based on the paper's own evidence, the most valuable thing is to upgrade "future image prediction" from an auxiliary training goal to an explicit planning intermediate variable during inference. F1 lets the generation expert first produce visual foresight, and then lets the action expert target it to do inverse dynamics. This is closer to the structure of "imagine first and then execute" than simply adding an action head to the backbone. No-Gen dropped from 77.5% to 60.3% in ablation, Real Handover (R2H) and Dynamic Conveyor tasks also showed that foresight is particularly important for complex dynamics.

7.2 Why the results hold up

The evidence chain of the paper is relatively complete: nine real tasks show that F1 improves the average success rate relative to $\pi_0$; LIBERO and SimplerEnv provide simulation benchmarks; Frozen-Gen, Cotrain-Scratch, No-Gen, and 2/6 scales ablation respectively verify generation trainability, Stage II pre-training, generation expert itself, and the number of inference scales; dynamic conveyor belt, Franka fast adaptation, and 10-step The long-range task shows scenarios beyond short-range static crawling; the generation quality chapter also shows that foresight quality is positively related to action accuracy.

7.3 Limitations and future directions described by the author

The pixel-level details of Foresight images are still limited, especially for fine-grained or deformable objects such as grid shopping carts, plastic bags, and clothing. The author attributes the reason to the lack of pre-training generation experts on large-scale generative datasets.
The success rate of long-range tasks dropped significantly in the second half, which the paper explained was due to the accumulation of long-range errors and the increased difficulty of complex reasoning.
Future directions include expansion to more embodiments and task families, such as locomotion, dexterous manipulation, and multi-agent collaboration.
The author also proposed that foresight can be enhanced with structured world models or physics-informed priors, and combined with reinforcement learning / online adaptation / human feedback for continuous improvement.

7.4 Applicable boundaries

F1 is suitable for manipulation tasks where the future visual state can provide clear planning goals for actions, especially dynamic goals, long-range steps, hand-eye coordination and cross-platform adaptation. It relies heavily on multi-view historical images, proprioception, generative visual tokenization and heavier reasoning chains. There is insufficient evidence in the paper for tasks such as invisible contact forces, pure tactile feedback, extremely high-speed closed-loop control, or tasks where the target state cannot be expressed through images.