Google DeepMind · closed · 2026-02

Gemini 3.1 Pro

#4 index 87.7
the take

The reasoning ceiling of this whole list. Highest GPQA here, a genuinely multimodal million-token context, and it holds up on real coding. The long-context surcharge is the tax you pay for the ceiling.

benchmarks
specs
Context
1.048576M
Input
$2/M
Output
$12/M
Speed
136.2 tok/s
Modality
text, image, audio, video
strengths
  • Top GPQA (94.3) and ARC-AGI-2
  • Strong SWE-bench (80.6)
  • Native text/image/audio/video, 1M context
weaknesses
  • 2x long-context surcharge past 200K
  • Slower output than Flash tiers
  • Card omits AIME/MMLU-Pro

Sources: [1][2]