Statistic · Benchmark · Third-party verified
94% (top score)
GPT-5.4 Pro leads the MMMU-Pro multimodal benchmark.
As of May 25, 2026, GPT-5.4 Pro holds the top position on MMMU-Pro at 94%, followed by Claude Mythos Preview at 92.7% and Gemini 3.1 Pro at 83.9%. The top tier has pulled away from the field — three models above 83% and the rest clustered between 60% and 81%. The benchmark covers 30 academic disciplines and was redesigned in 2024 specifically to remove the text-only shortcuts that inflated earlier multimodal scores.
MMMU-Pro top 10 · May 25, 2026
| Rank | Model | Provider | Score | License |
|---|---|---|---|---|
| 1 | GPT-5.4 Pro | OpenAI | 94.0% | Closed |
| 2 | Claude Mythos Preview | Anthropic | 92.7% | Closed |
| 3 | Gemini 3.1 Pro | 83.9% | Closed | |
| 4 | Gemini 3.5 Flash | 83.6% | Closed | |
| 5 | GPT-5.5 (medium) | OpenAI | 81.0% | Closed |
| 6 | Gemini 3 Flash | 81.2% | Closed | |
| 7 | Grok 4.1 | xAI | 79.5% | Closed |
| 8 | Kimi K2 Thinking | Moonshot AI | 76.9% | Open |
| 9 | Qwen3-VL Instruct | Alibaba | 75.8% | Open |
| 10 | Claude Haiku 4.5 | Anthropic | 73.8% | Closed |
MMMU-Pro is a stricter variant of MMMU that filters out questions answerable by text alone, augments the answer set, and tests vision-only inputs. Scores on MMMU-Pro typically drop 17-27 percentage points relative to the original MMMU benchmark for the same model — see the related statistic. Source: BenchLM.ai, accessed 2026-05-26.
Why this benchmark matters more than the original MMMU
The original MMMU benchmark, released in 2023, became the industry default for multimodal evaluation — and within a year, every major lab had a model scoring 60-70% on it. But researchers noticed an awkward fact: many of the highest-scoring "multimodal" answers could be reached by a text-only model that just guessed cleverly from the question wording. The benchmark was leaking signal through its language.
MMMU-Pro, introduced in late 2024, fixes three failure modes:
- Text shortcut filtering. Questions answerable by a strong text-only model (without seeing the image) are removed.
- Augmented options. The original four-option multiple-choice is expanded, removing easy-rule guessing.
- Vision-only setting. The hardest variant embeds the question text inside the image — forcing the model to genuinely read and integrate visual and textual signal.
The result is the clearest currently-public test of whether a model can actually see rather than just answer from priors. On the original MMMU, Claude 3.5 Sonnet scored 68.3%; on MMMU-Pro the same model scored 48.0%. That gap — 20.3 percentage points — is roughly the amount of "real visual reasoning" the benchmark exposes.
What's actually changed since 2024
Comparing the late-2024 results to the May 2026 leaderboard, every model in the top 10 is at least one generation newer than what was being tested two years ago, and the top score has jumped from 56% (GPT-4o 0513) to 94% (GPT-5.4 Pro) — an absolute jump of 38 percentage points in 18 months. The top-tier models are now scoring above the documented human-expert "medium" rater band on the original MMMU (82.1%), although on MMMU-Pro the human-expert ceiling has not been re-measured.
Sources
- BenchLM.ai MMMU-Pro leaderboard — benchlm.ai/benchmarks/mmmuPro (accessed 2026-05-26)
- Artificial Analysis MMMU-Pro evaluation — artificialanalysis.ai/evaluations/mmmu-pro
- LLM-Stats MMMU-Pro leaderboard — llm-stats.com/benchmarks/mmmu-pro
- Original MMMU-Pro paper — arxiv.org/html/2409.02813v3 (Yue et al., 2024)
Cite this statistic
DAM LLM Research. "MMMU-Pro multimodal benchmark leaderboard, May 2026." damllm.ai, 2026. https://damllm.ai/statistics/mmmu-pro-leaderboard-may-2026/