Logo FutureOmni

Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

Qian Chen1, Jinlan Fu1,3†, Changsong Li1,2, See-Kiong Ng3, Xipeng Qiu1,2†
Corresponding Author
1Fudan University 2Shanghai Innovation Institute 3National University of Singapore

Introduction

data-sample
Although Multimodal Large Language Models (MLLMs) demonstrate strong omni-modal perception, their ability to forecast future events from audio–visual cues remains largely unexplored, as existing benchmarks focus mainly on retrospective understanding. To bridge this gap, we introduce FutureOmni, the first benchmark designed to evaluate omni-modal future forecasting from audio–visual environments. The evaluated models are required to perform cross-modal causal and temporal reasoning, as well as effectively leverage internal knowledge to predict future events. FutureOmni is constructed via a scalable LLM-assisted, human-in-the-loop pipeline and contains 919 videos and 1,034 multiple-choice QA pairs across 8 primary domains. Evaluations on 13 omni-modal and 7 video-only models show that current systems struggle with audio-visual future prediction, particularly in speech-heavy scenarios, with the best accuracy of 64.8% achieved by Gemini 3 Flash. To mitigate this limitation, we curate a 7K-sample instruction-tuning dataset and propose an Omni-Modal Future Forecasting (OFF) training strategy. Evaluations on FutureOmni and popular audio-visual and video-only benchmarks demonstrate that OFF enhances future forecasting and generalization.

Leaderboard

Overall performance on FutureOmni. Dedicated exclusively to assessing whether models can predict future states based on audio-visual causal logic.

# Model LLM Size Modality Cartoon Education Emergency Surveillance Dailylife Movie Game Documentary Average
Gemini 3 Flash

Google

- A+V 62.71 75.00 58.70 80.28 68.75 59.03 65.06 53.47 64.80
Gemini 2.5 Pro

Google

- A+V 49.15 75.00 54.35 69.01 62.50 51.54 65.06 46.53 57.93
Gemini 2.5 Flash

Google

- A+V 50.85 70.00 47.83 59.15 58.59 51.54 60.24 50.00 55.61
Qwen 3 Omni

Alibaba

30B A+V 52.94 68.00 32.88 62.71 59.05 45.60 62.65 49.25 53.05
Claude Haiku 4.5

Anthropic

- A+V 55.08 66.00 44.57 57.04 51.56 48.90 57.83 41.67 52.03
GPT-4o

OpenAI

- V 44.06 65.00 34.78 57.74 52.34 50.22 51.80 36.11 49.70
Qwen3-VL

Alibaba

30B V 41.88 66.00 43.48 59.15 53.12 41.85 61.45 39.58 49.32
MiniCPM-o 2.6

OpenBMB

8B A+V 48.72 63.00 43.48 59.15 50.00 41.85 62.65 36.11 49.08
Ola

Tsinghua & Tencent & NTU

7B A+V 44.44 62.00 42.39 64.08 47.66 41.41 59.04 37.50 48.54
Qwen 2.5 Omni

Alibaba

7B A+V 47.86 55.00 35.87 59.86 48.44 40.09 61.45 40.28 47.48
video-SALMONN 2+ 7B

Tsinghua & ByteDance

7B A+V 50.43 61.00 39.13 55.63 52.34 40.09 54.22 33.33 47.00
VideoLLaMA3

Alibaba

7B V 42.74 59.00 33.70 58.16 42.97 43.61 67.47 35.66 46.80
video-SALMONN 2 7B

Tsinghua & ByteDance

7B A+V 43.59 55.00 39.13 57.04 48.44 40.97 57.83 34.72 46.03
Qwen3-VL

Alibaba

7B V 39.32 64.00 34.78 58.45 48.44 38.33 57.83 36.11 45.84
Qwen2.5-VL

Alibaba

7B V 43.59 58.00 30.43 52.82 48.44 37.00 53.01 34.72 43.71
VideoLLaMA2

Alibaba

7B A+V 43.59 47.00 29.35 53.52 40.62 32.60 57.83 31.94 40.75
LLaVA-NeXT

Wisconsin-Madison & Microsoft & ByteDance

7B V 43.59 49.00 31.52 49.30 35.94 38.33 50.60 31.94 40.62
Qwen 2.5 Omni

Alibaba

3B A+V 37.61 51.00 29.35 57.75 35.94 32.16 51.81 25.00 38.91
Video-LLaVA

Peking & Peng Cheng & PandaVilla

7B V 39.32 47.00 33.70 41.55 42.19 32.16 44.58 29.86 37.72
AVicuna

Rochester & Sony

7B A+V 31.62 39.00 26.09 35.21 32.81 28.19 33.73 20.83 30.37

If you want to add your model to our leaderboard, please contact qianchen901005@gmail.com

FutureOmni

Overview

Our FutureOmni have the following five main features:

(1) First Omni-Modal Forecasting Benchmark, (2) Scalable Construction Pipeline, (3) Comprehensive & Original

(4) Challenging Evaluation, (5) OFF Training Strategy

Distribution of FutureOmni.

(i) Video category hierarchy. (ii) Audio QA Count and Video duration distribution.

Benchmark Curation

Data collection and QA annotation pipelines.

Benchmark Static

Static comparison with other benchmarks.

Experiment Results

Overall Performance

overall_performance

Overall performance on our FutureOmni.

Fine-grained Results

Fine-grained results on video duration.

Fine-grained results on audio type.

In-depth Analysis

Modality Ablation Results

Omni-Modal Future Forecasting (OFF) Strategy

Fine-grained Audio Performance

generalization-results

Fine-grained Duration Performance

generalization-results

General Capability

generalization-results

Attention score difference visualization

Attention score difference visualization.

The blue represents the attention difference for Video keyframes, while the yellow represents Audio keyframes.

Citation


        @article{hong2025worldsenseevaluatingrealworldomnimodal,
          title={WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs},
          author={Jack Hong and Shilin Yan and Jiayin Cai and Xiaolong Jiang and Yao Hu and Weidi Xie},
          year={2025},
          eprint={2502.04326},
          archivePrefix={arXiv},
          primaryClass={cs.CV},
          url={https://arxiv.org/abs/2502.04326}, 
        }