EN 中文

$\pi_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities

Authors:Physical Intelligence Team: Bo Ai, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Kevin Black, . . . Ury Zhilinsky, etc.

Affiliations:Physical Intelligence

Publication:arXiv preprint, 2026

arXiv:2604.15483 | PDF:download | Project:pi.website/pi07

1. Quick overview of the paper

One-sentence summary:$\pi_{0.7}$ is a 5B parameter steerable generalist robotic foundation model. The core is not just to expand VLA, but to enable the same model to absorb multi-quality and multi-source data through multi-modal context conditioning (language, subtask, episode metadata, subgoal images, history), and demonstrate out-of-the-box capabilities in unseen instructions, cross-embodiment, and compositional tasks.

Difficulty rating:★★★★☆. Need to be familiar with VLA, flow matching action expert, VLM backbone, robot history encoder, world model/subgoal generation, cross-embodiment evaluation and robot experiment evaluation.

Keywords:Robotic Foundation Model, VLA, Context Conditioning, Subgoal Images, Episode Metadata, Cross-Embodiment Transfer。

Reading positioning issuesAnswer
What should the paper solve?The combinatorial generalization ability of the robot foundation model is still weak: the model often requires task-specific fine-tuning, and it is difficult to reuse existing skills in new tasks, new robots, and new environments just by relying on prompts.
The author's approachExpand prompt/context from single language instructions to multi-modal context: subtask instructions, subgoal images, episode metadata, video history, and combined training with dropout.
most important resultsSeen dexterous tasks often exceed 90% success rate; zero-shot/unseen or unseen robot-task combinations are mostly 60-80%; UR5e shirt folding human experts 90. 9% progress/80. 6% success, $\pi_{0.7}$ is 85. 6% progress/80% success.
Things to note when readingA large number of conclusions in this article come from real robot evaluation diagrams and supplementary explanations, not a single benchmark table; the focus is on how "steerability + diverse context" allows the model to exploit failures, suboptimal data, non-robot data, and specialist rollouts.

Core contribution list

2. Motivation

2. 1 What problem should be solved?

The paper starts from the general principle of the foundation model: capabilities come from large-scale and diverse data training. Language models can combine knowledge, follow user formats, and perform reasoning, but physical intelligence in robots is still difficult to achieve similar compositional generalization. Already VLA, while getting larger, often still relies on task-specific data or fine-tuning.

teaser
Teaser: $\pi_{0.7}$ uses richer prompts to describe what and how, including language, subgoal images, and episode metadata, thereby leveraging wider data and combining skills to solve new tasks.

2. 2 Limitations of existing methods

2. 3 The solution ideas of this article

The core insight is: if the context is rich enough, the model can distinguish "what task this trajectory is, what strategy is used, success or failure, whether it should be imitated or avoided", thereby turning heterogeneous data into a trainable resource. $\pi_{0.7}$ uses component dropout to allow the model to see different context combinations during training, and can flexibly use language, subgoal image or metadata to steer behavior during inference.

4. Detailed explanation of method

4. 1 Method overview

$\pi_{0.7}$ is the 5B parameter VLA. The input includes observation history $\mathbf{o}_{t-T:t}$ and context $\mathcal{C}_t$. observation includes multi-camera image $\mathbf{I}_t^i$ and joint status $\mathbf{q}_t$; context can include language command, next semantic subtask, subgoal image, episode metadata, control mode, etc. The output is action chunk $\mathbf{a}_{t:t+H}$. Usually only shorter sub-segments are executed and then re-planned.

architecture
Architecture overview: 4B VLM backbone, MEM-style history encoder, 860M action expert; high-level semantic policy in runtime generates language commands, and BAGEL-based world model generates subgoal images.
Training sample: (observation history o_{t-T:t}, action chunk a_{t:t+H}, context C_t) C_t may include: language command ell_t next semantic subtask rawtext_t subgoal image generated by world model episode metadata: quality / strategy / success / source control mode and other fields Train VLA by imitation / flow-matching action expert. At inference: update history tokens with MEM encoder optionally generate semantic subtask and subgoal image build prompt/context C_t action expert predicts action chunk execute short horizon, repeat

4. 2 Method evolution

stageformImprovement motivation
$\pi_0$ / prior VLAShort text task instructions + observation history → action chunk.It can control multitasking, but context cannot describe strategy and data quality.
MEM-style VLAAdd video history encoder to handle long-term observations.Supports tasks that require memory.
$\pi_{0.7}$Multimodal context: subtask, metadata, subgoal images, history, control mode.Allow models to be steered, utilize heterogeneous data, and combine existing skills to solve new tasks.

4. 3 Core design and mathematical derivation

VLA Training goal: Given historical observations and rich context, maximize the conditional likelihood of future action chunks.
$$\max_{\theta}\;\mathbb{E}_{\mathcal{D}}\left[\log \pi_{\theta}(\mathbf{a}_{t:t+H}\mid \mathbf{o}_{t-T:t},\mathcal{C}_t)\right].$$

$\mathcal{D}$ is the training data set; $\mathbf{o}_{t-T:t}$ is historical observation; $\mathcal{C}_t$ is context; $\mathbf{a}_{t:t+H}$ is future action chunk. The paper explains that flow matching action expert actually optimizes the approximate lower bound instead of the closed log-likelihood.

4.3.1 Subtask instructions

In addition to the overall task description $\ell_t$, the model also receives higher-level text $\hat{\ell}_t$ representing the next semantic subtask. For example, instead of just saying "clean the kitchen", you can also tell the model "now pick up the plate and put it in the sink". This allows step-by-step verbal coaching of the model with human or high-level semantic policy during inference.

4.3.2 Subgoal images

Subgoal image tells the model the goal state in a visual form. The paper uses a lightweight world model based on the BAGEL image generation model to generate subgoal images. The supplementary material explains that the world model uses 7B LLM backbone and 7B generation backbone; ViT input is resized to 448x336, and VAE input is resized to 512x384; subgoal is generated every $\Delta=4$ seconds during inference.

prompt
Prompt overview: context consists of subtask instructions, subgoal images, episode metadata, etc. ; each component is dropout during training, and can be flexibly combined during inference.

4.3.3 Episode metadata

The role of episode metadata is to let the model know the source, quality, strategy or success of this trajectory. In this way, the model can take advantage of lower-quality demonstrations, failures, autonomous data from prior models, and RL/SFT specialist rollouts, instead of relying only on high-quality human demonstrations. Without metadata, it is difficult for the model to distinguish between "high-quality behavior that should be imitated" and "failure/suboptimal behavior that should be avoided. "

4.3.4 Architecture and knowledge insulation

The model backbone is initialized from Gemma3 4B VLM, where the vision encoder is about 400M. Actions are generated by 860M flow matching action expert. The paper follows the knowledge insulation recipe: the VLM backbone is trained with discrete cross-entropy of FAST tokens; the action expert attends to the backbone activation, but the gradient of the action expert is not transmitted back to the VLM backbone, thus stabilizing the training.

4. 4 Implementation points

Parameter scale:Total about 5B; 4B VLM backbone + 400M vision encoder + 860M action expert.
Prompt dropout:Randomly drop each context component during training, so that the model can accept any subset during inference and does not rely heavily on all prompt items.
Reasoning speed:The minimum variant is 38ms under 3 camera inputs, 5 denoising steps, training-time RTC; the worst is 127ms after turning on MEM encoder and subgoal images [Supplementary Inference speed]
Subgoal world model:14B image generation model, nearly 10, 000 tokens, 4xH100 tensor parallelism, 8-bit large matmul, SageAttention; 25 denoising steps about 1. 25 seconds, using an asynchronous strategy to allow the robot to generate the next subgoal while continuing to execute [Supplementary Training of world model]
Control space:The supplementary experiment compares joint-space and end-effector control. EE control has no obvious advantage in cross-embodiment tasks, so the main experiment uses joint-space control. [Additional action spaces]

5. Experiment

5. 1 Experimental setup

categoryset up
Robot platformA variety of robots, including mobile/static/single-arm/bimanual systems. The figure shows the experimental robot collection.
Task typeespresso, box building, laundry folding, peanut butter sandwich, turn shirt inside-out, drive through door, zucchini slicing, peeling fruits/vegetables, take out trash, mug swapping, find object, etc.
Review topicout-of-the-box dexterity、instruction following、cross-embodiment transfer、compositional task generalization。
Baselinesprior $\pi_0$ models、task-specific RL/SFT specialists、human teleoperators、ablations without eval data/metadata。
Supplementary RatingThe supplementary materials provide a scoring rubric for each task. For example, espresso needs to complete grinding/powdering/portafilter/extraction/moving cups, and take out trash can score up to 12 points.
robots
Schematic diagram of the experimental robot platform.

5. 2 Main results

Out-of-the-box dexterity

Thesis report $\pi_{0.7}$ achieves performance close to that of task-specific RL/SFT specialists on tasks such as espresso, box building, laundry, etc. directly out of the box, and exceeds specialists on the throughput of laundry and box building. The authors also report that both "no eval data" and "no metadata" ablations are significantly weaker than the full model on the task, indicating that eval rollouts and metadata are critical to utilizing mixed-quality data.

distillation results
Out-of-the-box dexterity: Complete $\pi_{0.7}$ compared to RL/SFT specialists.
ablations
The ablation of prompt composition and evaluation data: no eval data / no metadata are weaker than the complete model.

Instruction following

The paper tests open-ended instruction, referral instruction, reverse tasks and tasks that require memory. Reverse Bussing requires placing trash/dishes in a position opposite to the offset of the data set; Reverse Fridge to Microwave requires placing food from the microwave back to the fridge. The results show that $\pi_{0.7}$ is significantly better at overcoming dataset bias than prior models; for Reverse Fridge to Microwave, using the GC version of generated subgoal images is critical to success.

instruction following
Broad instruction following in novel environments。
compositional
Breaking dataset biases by following instructions。

Cross-embodiment transfer

When the target robot does not have training data for the task, $\pi_{0.7}$ displays zero-shot transfer. In the supplementary human control experiment, 10 top experienced operators (average experience of all robot platforms are about 375 hours) performed a total of 30 trials on UR5e shirt folding. Human achieved 90. 9% task progress and 80. 6% success; $\pi_{0.7}$ (GC) achieved 85. 6% progress and 80% success.

cross embodiment
Cross-embodiment transfer: Transfer skills between different robots.
human vs pi07
Comparison of $\pi_{0.7}$ (GC) and expert human teleoperators in UR5e shirt folding.

Compositional task generalization and coaching

The paper shows that users can guide the model to complete new long-horizon tasks through verbal coaching, such as putting sweet potatoes into an air fryer. The author takes this as an example of portfolio generalization: instead of collecting action data separately for each new task, existing skill sets are called upon through prompts, subgoals, and step-by-step coaching.

air fryer
Language coaching Example: air fryer long-horizon task.

5. 3 Ablation experiment

ablateVerification purposePaper conclusion
No eval dataBy removing autonomous evaluation episodes, strong RL/specialist policy rollouts cannot be distilled.Weaker than the full model on $\pi_*^{0.6}$ release tasks.
No metadataThe context does not contain episode metadata and the model cannot differentiate between data quality/policy.Weaker than the full model on all relevant tasks, indicating that metadata is critical for mixed-quality data.
Joint vs EE controlCompare action space in cross-embodiment.EE control has no obvious advantages, and the main experiment uses joint-space control.
GC vs non-GCCheck generated subgoal image conditioning.The GC is a critical component in tasks such as Reverse Fridge to Microwave and cross-embodiment shirt folding.

5. 4 Supplementary experiment and scoring details

The supplementary materials give detailed task scoring rubrics, covering laundry, espresso, box building, peanut butter sandwich, inside-out shirt, drive through door, cut zucchini, peel fruits/vegetables, take out trash, swap mugs, find object, etc. Instead of binary successes, many tasks are scored by subgoals, such as taking out trash for up to 12 points, and peanut butter sandwich for up to 9 points.

operator stats
Operator experience statistics in the Human subject study; the selected candidates belong to the top 2% of the operator team's experience.

6. Analysis and Discussion

6. 1 Analysis and explanation of the results given in the paper

6. 2 Limitations of the author's statement

Discussion clearly states that zero-shot generalization has a lower success rate than in-distribution tasks. Seen tasks tend to exceed 90%, while unseen tasks or unseen task-robot combinations tend to be around 60-80%. The authors also point out that in such a large and diverse data set, it is difficult to rigorously determine which tasks are truly unseen, because the relevant skills may exist in the form of different labels or other task sub-behaviors.

6. 3 Applicable boundaries and future work

6. 4 Reproducibility audit

projectstateillustrate
Source code and PDFObtainedarXiv e-print, abs, and PDF are all downloaded successfully.
chartExtracted22 PDF figures have been converted to PNG and the report selects key figures for embedding.
Model structureclearer5B, 4B VLM, 860M action expert, MEM encoder, BAGEL world model and other information are clear.
Training hyperparametersincompleteThe paper provides training recipes and inference optimization, but does not provide complete batch size, learning rate and other recurrence experiment tables.
DatasetWell described but not directly reproducibleThe tasks and scoring details are rich, but the training data size, complete data list and downloadable data are not fully given in the source code.
code repositoryNo explicit GitHub foundThe paper provides a project page, but the official GitHub repository is not specified in the source code/abs.