RoboOmni:
Proactive Robot Manipulation in Omni-modal Context

Siyin Wang1,2 , Jinlan Fu3,† , Feihong Liu1, Xinzhe He1, Huangxuan Wu1,
Junhao Shi1,2, Kexin Huang1, Zhaoye Fei1,
Jingjing Gong2, Zuxuan Wu1,2, Yugang Jiang1, See-Kiong Ng3, Tat-Seng Chua3, Xipeng Qiu1,2,† ,
Corresponding Author
1Fudan University, 2Shanghai Innovation Institute, 3National University of Singapore

What are Contextual Instructions?

Traditional robot manipulation models often rely on explicit commands to perform tasks. However, in real-life human-robot interactions, instructions are not always clear-cut. For example, a person might say "I’m thirsty" without explicitly requesting a drink. Instead, this statement, when combined with environmental sounds (like the noise of a juicer) and visual cues (like seeing a Coke can), implies a latent intent that the robot must infer.

In RoboOmni, we introduce contextual instructions, where robots derive intent from a combination of speech, environmental sounds, and visual cues, rather than waiting for direct commands. This is a step beyond traditional approaches that rely on straightforward verbal or written instructions. RoboOmni's ability to infer context from overlapping dialogues, non-verbal sounds, and sentiment allows it to proactively ask clarifying questions, making it more intuitive and responsive in complex scenarios.

OmniAction

Proactive robots must infer implicit intent from audio and visual observations, yet existing datasets lack such a combination of modalities (most of them lack audio modality) and inferential instructions needed for intent reasoning. To address this gap, we introduce OmniAction, a large-scale corpus comprising 140k episodes with 5k+ speakers, 2.4k event sounds, 640 backgrounds, and six contextual instruction types, along with OmniAction-LIBERO for simulation-based evaluation.

The OmniAction dataset is constructed through a three-stage pipeline: textual scripting, auditory realization, and verification. First, tasks are sampled from the Open-X Embodiment dataset and transformed into contextual instructions using GPT-4o. This involves filtering trivial samples, synthesizing multi-turn dialogues, extending interactions to simulate natural conversations, and validating the consistency of intent. For auditory realization, the dialogues are converted into audio using high-fidelity TTS engines like MOSS-TTSD, CosyVoice, and Gemini-TTS, with additional multi-speaker simulation and non-verbal event insertion. Environmental sounds are also added to reflect household conditions. Finally, the data quality is verified through manual evaluation, confirming that the task intent is recoverable with 98.7% agreement.

RoboOmni

At the heart of RoboOmni lies the Perceiver-Thinker-Talker-Executor architecture, which unifies multiple modalities (vision, speech, environmental sounds) into a single, seamless framework for robot action execution.

  • Perceiver: The Perceiver handles the encoding of heterogeneous input modalities (vision, speech, environmental sounds) into a unified embedding space, enabling RoboOmni to process these modalities together.
  • Thinker: The Thinker processes the unified multimodal representations and generates contextually appropriate outputs. It ensures RoboOmni understands and reasons across different input modalities to determine the best action.
  • Talker: The Talker converts the high-level representations into natural speech, enabling RoboOmni to communicate and interact through verbal responses.
  • Executor: - The Executor decodes the action tokens from the Thinker and translates them into executable robot actions, enabling RoboOmni to perform tasks based on the given context.

Experiments

Performance on OmniAction-LIBERO Benchmark

We evaluated RoboOmni's performance across four task suites—Spatial, Goal, Object, and Long-Horizon—using the OmniAction-LIBERO benchmark. RoboOmni outperformed all other baseline models, including NORA, OpenVLA, and π0, across every task type, demonstrating superior intent recognition, task execution, and speed.


Evaluation of Proactive Assistance Capabilities

Intent Recognition - RoboOmni excels at recognizing user intent under contextual instructions. In evaluations comparing RoboOmni to baseline models (like Qwen2.5-Omni-3B and ASR-based systems), RoboOmni achieved the highest accuracy at 88.9%. This demonstrates the power of end-to-end multimodal processing that preserves paralinguistic cues, unlike traditional models that rely on speech-to-text or ASR systems, which often lose critical information in noisy environments.

Qualitative Analysis - RoboOmni also stands out in its ability to interact proactively. During tests, it effectively integrated speech, environmental sounds, and visual cues to ask clarifying questions, ensuring task completion aligned with user intent. For example, when faced with ambiguous or incomplete instructions like "egg dumplings," RoboOmni would ask, "Would you like me to put the egg dumpling into the hot pot?"—a behavior not seen in baseline models. This proactive clarification approach ensures that RoboOmni doesn’t make assumptions and executes tasks more accurately based on user feedback.

BibTeX

@article{wang25roboomni,
    title={RoboOmni: roactive Robot Manipulation in Omni-modal Context},
    author={Siyin wang and Jinlan Fu and Feihong Liu and Xinzhe He and Huangxuan Wu and Junhao Shi and Kexin Huang and Zhaoye Fei and Jingjing Gong and Zuxuan Wu and Yugang Jiang and See-Kiong Ng and Tat-Seng Chua and Xipeng Qiu},
    journal={arXiv preprint arXiv:2510.23763},
    year={2025},
    url={https://arxiv.org/abs/2510.23763},
    archivePrefix={arXiv},
    primaryClass={cs.RO},
}