Proactive robots must infer implicit intent from audio and visual observations, yet existing datasets lack such a combination of modalities (most of them lack audio modality) and inferential instructions needed for intent reasoning. To address this gap, we introduce OmniAction, a large-scale corpus comprising 140k episodes with 5k+ speakers, 2.4k event sounds, 640 backgrounds, and six contextual instruction types, along with OmniAction-LIBERO for simulation-based evaluation.
The OmniAction dataset is constructed through a three-stage pipeline: textual scripting, auditory realization, and verification. First, tasks are sampled from the Open-X Embodiment dataset and transformed into contextual instructions using GPT-4o. This involves filtering trivial samples, synthesizing multi-turn dialogues, extending interactions to simulate natural conversations, and validating the consistency of intent. For auditory realization, the dialogues are converted into audio using high-fidelity TTS engines like MOSS-TTSD, CosyVoice, and Gemini-TTS, with additional multi-speaker simulation and non-verbal event insertion. Environmental sounds are also added to reflect household conditions. Finally, the data quality is verified through manual evaluation, confirming that the task intent is recoverable with 98.7% agreement.
At the heart of RoboOmni lies the Perceiver-Thinker-Talker-Executor architecture, which unifies multiple modalities (vision, speech, environmental sounds) into a single, seamless framework for robot action execution.
@article{wang25roboomni,
title={RoboOmni: roactive Robot Manipulation in Omni-modal Context},
author={Siyin wang and Jinlan Fu and Feihong Liu and Xinzhe He and Huangxuan Wu and Junhao Shi and Kexin Huang and Zhaoye Fei and Jingjing Gong and Zuxuan Wu and Yugang Jiang and See-Kiong Ng and Tat-Seng Chua and Xipeng Qiu},
journal={arXiv preprint arXiv:2510.23763},
year={2025},
url={https://arxiv.org/abs/2510.23763},
archivePrefix={arXiv},
primaryClass={cs.RO},
}