Overall Image

FRoM-W1

The Humanoid Intelligence Team from FudanNLP and OpenMOSS

FRoM-W1: Towards General Humanoid Whole-Body Control with Language Instructions

Peng Li1,2*†, Zihan Zhuang1*†, Yangfan Gao1*, Yi Dong1*, Sixian Li1,2, Changhao Jiang1, Shihan Dou1, Zhiheng Xi1, Enyu Zhou1, Jixuan Huang1, Hui Li1, JingJing Gong2, Xingjun Ma1,2, Tao Gui1✉, Zuxuan Wu1,2, Qi Zhang1, Xuanjing Huang1, Yu-Gang Jiang1, Xipeng Qiu1,2✉
1 Fudan University
2 Shanghai Innovation Institute

*Equal contribution   Project Lead   Corresponding authors

Note: this webpage is a preview version, and more demo videos are currently under development. Please stay tuned!

Whole-body Control of Humanoid Robots with Language Instructions

Akimbo

Jump Jack

Play the role of Elephant

Conduct Orchestra

Dance Hiphop

Box

Abstract

Humanoid robots are capable of performing various actions such as greeting, dancing and even backflipping. However, these motions are often hard-coded or specifically trained, which limits their versatility. In this work, we present FRoM-W1, an open-source framework designed to achieve general humanoid whole-body motion control using natural language. To universally understand natural language and generate corresponding motions, as well as enable various humanoid robots to stably execute these motions in the physical world under gravity, FRoM-W1 operates in two stages: (a) H-GPT: utilizing massive human data, a large-scale language-driven human whole-body motion generation model is trained to generate diverse natural behaviors. We further leverage the Chain-of-Thought technique to improve the model's generalization in instruction understanding. (b) H-ACT: After retargeting generated human whole-body motions into robot-specific actions, a motion controller that is pretrained and further fine-tuned through reinforcement learning in physical simulation enables humanoid robots to accurately and stably perform corresponding actions. It is then deployed on real robots via a modular simulation-to-reality module. We extensively evaluate FRoM-W1 on Unitree H1 and G1 robots. Results demonstrate superior performance on the HumanML3D-X benchmark for human whole-body motion generation, and our introduced reinforcement learning fine-tuning consistently improves both motion tracking accuracy and task success rates of these humanoid robots. We open-source the entire FRoM-W1 framework and hope it will advance the development of humanoid intelligence.

FRoM-W1: Foundational Humanoid Robot Model - Whole-Body Control, Version 1

Pipeline

First research result visualization

H-GPT first translates language instructions into motion sequences with the CoT thinking. H-ACT then retargets the generated motions to different robot embodiments and executes these motions on real robots through motion-tracking control policies.

One instruction for different embodiments

Play Violin

Box

Squat

Unitree G1

Unitree H1

One instruction for different policies

Box (Human2Humanoid)

Box (HugWBC)

Elephant (BeyondMimic)

Elephant (Twist2)

Motion Diveristy

BibTeX

@misc{li2026fromw1generalhumanoidwholebody,
      title={FRoM-W1: Towards General Humanoid Whole-Body Control with Language Instructions}, 
      author={Peng Li and Zihan Zhuang and Yangfan Gao and Yi Dong and Sixian Li and Changhao Jiang and Shihan Dou and Zhiheng Xi and Enyu Zhou and Jixuan Huang and Hui Li and Jingjing Gong and Xingjun Ma and Tao Gui and Zuxuan Wu and Qi Zhang and Xuanjing Huang and Yu-Gang Jiang and Xipeng Qiu},
      year={2026},
      eprint={2601.12799},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2601.12799}, 
}