Kimi K2.5 Benchmark: Complete Performance Analysis vs GPT, Claude & Gemini 2026

Jan 30, 2026

Kimi K2.5 is Moonshot AI’s open-weights native multimodal, agentic model. It continues pretraining on approximately 15T mixed visual + text tokens and introduces Agent Swarm (up to 100 sub-agents) as a research preview.

This article uses the official Kimi K2.5 benchmark table as the single numeric source of truth. Any benchmark not reported there is marked as “—” to avoid mixing non-verifiable or incomparable results.

Kimi K2.5 Overview: Architecture and Capabilities

Before diving into benchmark comparisons, let's understand what makes Kimi K2.5 unique:

Model Architecture

SpecificationDetails
ArchitectureMixture-of-Experts (MoE)
Total Parameters1T
Activated Parameters32B
Context Window256K tokens (often “hundreds of pages,” depending on formatting/language)
Training Data~15T mixed visual + text tokens
Attention MechanismMLA (Multi-head Latent Attention)
Experts384 total, 8 selected per token

Key Capabilities

  • Agent Swarm (preview): up to 100 sub-agents, parallel workflows, up to ~1,500 coordinated tool calls/steps
  • Native multimodality: text + image + video
  • Tool-augmented evaluation: official benchmarks run K2.5 with tools (search, code interpreter, web browsing) for HLE-with-tools and agentic search benchmarks
  • Open-weights: weights + Modified MIT License are publicly available

Comprehensive Benchmark Results

Summary Table: Kimi K2.5 vs Top Competitors

BenchmarkCategoryKimi K2.5GPT-5.2Claude Opus 4.5Gemini 3 Pro
HLE-Full (w/ tools)Agentic / Tools50.245.543.245.8
AIME 2025Math96.1100.092.895.0
HMMT 2025 (Feb)Contest Math95.499.492.9*97.3*
IMO-AnswerBenchMath / Reasoning81.886.378.5*83.1*
GPQA-DiamondReasoning87.692.487.091.9
MMLU-ProKnowledge87.186.7*89.3*90.1
MMMU-ProMultimodal78.579.5*74.081.0
MathVisionVision + Math84.283.077.1*86.1*
SWE-Bench VerifiedCoding (Agentic)76.880.080.976.2
LiveCodeBench (v6)Coding85.082.2*87.4*
TerminalBenchTools / Terminal50.846.254.046.4
OCRBenchDocument OCR92.380.7*86.5*90.3*
OmniDocBench 1.5Document Understanding88.885.784.1*87.7*
VideoMMMUVideo Understanding86.685.984.4*87.6
LongVideoBenchLong Video79.8

* “*” indicates re-evaluated / aligned scoring under the official table’s stated conditions. “—” means not reported in the official table.

Kimi K2.5 vs GPT 5.2

Coding

BenchmarkKimi K2.5GPT-5.2Winner
SWE-Bench Verified76.8%80.0%GPT
TerminalBench50.846.2Kimi
LiveCodeBench (v6)85.0

Key Insight: GPT-5.2 is slightly higher on SWE-Bench Verified in the official table, while Kimi K2.5 leads on TerminalBench, indicating stronger terminal/tool execution performance. LiveCodeBench(v6) is not reported for GPT-5.2 in the same official table.

Math & Reasoning

BenchmarkKimi K2.5GPT-5.2Winner
AIME 202596.1100.0GPT
HMMT 2025 (Feb)95.499.4GPT
IMO-AnswerBench81.886.3GPT
GPQA-Diamond87.692.4GPT

Key Insight: In the official table, GPT-5.2 leads on the hardest listed math/reasoning benchmarks, while Kimi K2.5 remains close and competitive.

Agentic w/ Tools

BenchmarkKimi K2.5GPT-5.2Winner
HLE-Full (w/ tools)50.245.5Kimi

Key Insight: Kimi K2.5 leads HLE-Full (w/ tools) by 4.7 points, highlighting strong tool-augmented agentic performance.

Multimodal & Docs

BenchmarkKimi K2.5GPT-5.2Winner
MMMU-Pro78.579.5*GPT
MathVision84.283.0Kimi
OCRBench92.380.7*Kimi
OmniDocBench 1.588.885.7Kimi
VideoMMMU86.685.9Kimi

Key Insight: Kimi K2.5 shows clear advantages in document OCR and document understanding, and stays competitive on vision/video reasoning.

Kimi K2.5 vs Gemini 3 Pro

Google’s Gemini series emphasizes multimodality and long context. Comparison:

Multimodal Performance

BenchmarkKimi K2.5Gemini 3 ProWinner
MMMU-Pro78.581.0Gemini 3 Pro
MathVision84.286.1*Gemini 3 Pro
OCRBench92.390.3*Kimi K2.5
OmniDocBench 1.588.887.7*Kimi K2.5
VideoMMMU86.687.6Gemini 3 Pro
LongVideoBench79.8

Key Insight: Gemini 3 Pro leads on MMMU-Pro / MathVision / VideoMMMU, while Kimi K2.5 leads on OCRBench / OmniDocBench, making Kimi particularly strong for enterprise document workflows.

Coding and Tools

BenchmarkKimi K2.5Gemini 3 ProWinner
SWE-Bench Verified76.876.2Kimi K2.5
LiveCodeBench (v6)85.087.4*Gemini 3 Pro
TerminalBench50.846.4Kimi K2.5

Key Insight: Kimi K2.5 is slightly higher on SWE-Bench Verified and clearly higher on TerminalBench, while Gemini 3 Pro leads on LiveCodeBench (v6) in the same official table.

Reasoning and Knowledge

BenchmarkKimi K2.5Gemini 3 ProWinner
GPQA-Diamond87.691.9Gemini 3 Pro
MMLU-Pro87.190.1Gemini 3 Pro

Key Insight: Gemini 3 Pro is higher on the official table’s GPQA-Diamond and MMLU-Pro.

Kimi K2.5 vs Claude Opus 4.5

Anthropic’s Claude models are known for strong coding and reasoning. Comparison:

Coding and Development Tasks

BenchmarkKimi K2.5Claude Opus 4.5Winner
SWE-Bench Verified76.880.9Claude Opus 4.5
LiveCodeBench (v6)85.082.2*Kimi K2.5
TerminalBench50.854.0Claude Opus 4.5

Key Insight: Claude Opus 4.5 leads on SWE-Bench Verified and TerminalBench, while Kimi K2.5 is higher on LiveCodeBench (v6) in the official table.

Reasoning and Knowledge

BenchmarkKimi K2.5Claude Opus 4.5Winner
GPQA-Diamond87.687.0Kimi K2.5
MMLU-Pro87.189.3*Claude Opus 4.5

Key Insight: Kimi K2.5 edges Claude on GPQA-Diamond, while Claude Opus 4.5 leads on MMLU-Pro (noted as re-evaluated “*” in the official table).

Tool Use and Agentic Performance

BenchmarkKimi K2.5Claude Opus 4.5Winner
HLE-Full (w/ tools)50.243.2Kimi K2.5

Key Insight: Kimi K2.5 leads Claude Opus 4.5 on HLE-Full (w/ tools), indicating stronger tool-augmented agentic behavior in this benchmark.

Specialized Capability Notes

Kimi’s technical report describes Agent Swarm as a research preview trained with PARL, enabling up to 100 sub-agents and up to ~1,500 tool calls/steps for parallel workflows. These disclosures describe capability direction and evaluation setup, but real-world outcomes can vary by task definition, tool availability, and provider implementation.

Recommendations by Use Case

Choose Kimi K2.5 When:

  • Document/OCR workflows matter: leads on OCRBench and OmniDocBench
  • Tool-augmented agentic tasks are core: leads on HLE-Full (w/ tools)
  • Open-weights deployment is required: weights + Modified MIT license are public

Choose GPT-5.2 When:

  • Maximum hard math/reasoning is required: leads on AIME 2025 / GPQA-Diamond / HMMT / IMO-AnswerBench
  • Top-end SWE-Bench Verified performance is critical

Choose Claude Opus 4.5 When:

  • Agentic software engineering is the top priority: highest SWE-Bench Verified in the official table
  • Terminal/tool tasks matter: higher TerminalBench in the official table

Choose Gemini 3 Pro When:

  • General multimodal strength is the priority: higher MMMU-Pro / MathVision / VideoMMMU in the official table
  • You need large-context options (validate based on your actual API/product channel)

Conclusion

To make benchmark writing withstand strict fact-checking, the most important rule is consistent sourcing. This version uses the official Kimi K2.5 benchmark table for all numbers and avoids filling gaps with unverified third-party values.

From the official table, Kimi K2.5’s standout strengths are:

  1. Tool-augmented agentic performance: HLE-Full (w/ tools) leads
  2. Document understanding: OCRBench and OmniDocBench lead
  3. Competitive coding and multimodal performance: strong SWE/LiveCode/Video results and close gaps vs top proprietary models

Sources

Kimi K2.5 Team

Kimi K2.5 Team

Kimi K2.5 Benchmark: Complete Performance Analysis vs GPT, Claude & Gemini 2026 | Blog