Kimi K2.5 Benchmark: Complete Performance Analysis vs GPT, Claude & Gemini 2026

Kimi K2.5 is Moonshot AI’s open-weights native multimodal, agentic model. It continues pretraining on approximately 15T mixed visual + text tokens and introduces Agent Swarm (up to 100 sub-agents) as a research preview.

This article uses the official Kimi K2.5 benchmark table as the single numeric source of truth. Any benchmark not reported there is marked as “—” to avoid mixing non-verifiable or incomparable results.

Kimi K2.5 Overview: Architecture and Capabilities

Before diving into benchmark comparisons, let's understand what makes Kimi K2.5 unique:

Model Architecture

Specification	Details
Architecture	Mixture-of-Experts (MoE)
Total Parameters	1T
Activated Parameters	32B
Context Window	256K tokens (often “hundreds of pages,” depending on formatting/language)
Training Data	~15T mixed visual + text tokens
Attention Mechanism	MLA (Multi-head Latent Attention)
Experts	384 total, 8 selected per token

Key Capabilities

Agent Swarm (preview): up to 100 sub-agents, parallel workflows, up to ~1,500 coordinated tool calls/steps
Native multimodality: text + image + video
Tool-augmented evaluation: official benchmarks run K2.5 with tools (search, code interpreter, web browsing) for HLE-with-tools and agentic search benchmarks
Open-weights: weights + Modified MIT License are publicly available

Comprehensive Benchmark Results

Summary Table: Kimi K2.5 vs Top Competitors

Benchmark	Category	Kimi K2.5	GPT-5.2	Claude Opus 4.5	Gemini 3 Pro
HLE-Full (w/ tools)	Agentic / Tools	50.2	45.5	43.2	45.8
AIME 2025	Math	96.1	100.0	92.8	95.0
HMMT 2025 (Feb)	Contest Math	95.4	99.4	92.9*	97.3*
IMO-AnswerBench	Math / Reasoning	81.8	86.3	78.5*	83.1*
GPQA-Diamond	Reasoning	87.6	92.4	87.0	91.9
MMLU-Pro	Knowledge	87.1	86.7*	89.3*	90.1
MMMU-Pro	Multimodal	78.5	79.5*	74.0	81.0
MathVision	Vision + Math	84.2	83.0	77.1*	86.1*
SWE-Bench Verified	Coding (Agentic)	76.8	80.0	80.9	76.2
LiveCodeBench (v6)	Coding	85.0	—	82.2*	87.4*
TerminalBench	Tools / Terminal	50.8	46.2	54.0	46.4
OCRBench	Document OCR	92.3	80.7*	86.5*	90.3*
OmniDocBench 1.5	Document Understanding	88.8	85.7	84.1*	87.7*
VideoMMMU	Video Understanding	86.6	85.9	84.4*	87.6
LongVideoBench	Long Video	79.8	—	—	—

* “*” indicates re-evaluated / aligned scoring under the official table’s stated conditions. “—” means not reported in the official table.

Kimi K2.5 vs GPT 5.2

Coding

Benchmark	Kimi K2.5	GPT-5.2	Winner
SWE-Bench Verified	76.8%	80.0%	GPT
TerminalBench	50.8	46.2	Kimi
LiveCodeBench (v6)	85.0	—	—

Key Insight: GPT-5.2 is slightly higher on SWE-Bench Verified in the official table, while Kimi K2.5 leads on TerminalBench, indicating stronger terminal/tool execution performance. LiveCodeBench(v6) is not reported for GPT-5.2 in the same official table.

Math & Reasoning

Benchmark	Kimi K2.5	GPT-5.2	Winner
AIME 2025	96.1	100.0	GPT
HMMT 2025 (Feb)	95.4	99.4	GPT
IMO-AnswerBench	81.8	86.3	GPT
GPQA-Diamond	87.6	92.4	GPT

Key Insight: In the official table, GPT-5.2 leads on the hardest listed math/reasoning benchmarks, while Kimi K2.5 remains close and competitive.

Agentic w/ Tools

Benchmark	Kimi K2.5	GPT-5.2	Winner
HLE-Full (w/ tools)	50.2	45.5	Kimi

Key Insight: Kimi K2.5 leads HLE-Full (w/ tools) by 4.7 points, highlighting strong tool-augmented agentic performance.

Multimodal & Docs

Benchmark	Kimi K2.5	GPT-5.2	Winner
MMMU-Pro	78.5	79.5*	GPT
MathVision	84.2	83.0	Kimi
OCRBench	92.3	80.7*	Kimi
OmniDocBench 1.5	88.8	85.7	Kimi
VideoMMMU	86.6	85.9	Kimi

Key Insight: Kimi K2.5 shows clear advantages in document OCR and document understanding, and stays competitive on vision/video reasoning.

Kimi K2.5 vs Gemini 3 Pro

Google’s Gemini series emphasizes multimodality and long context. Comparison:

Multimodal Performance

Benchmark	Kimi K2.5	Gemini 3 Pro	Winner
MMMU-Pro	78.5	81.0	Gemini 3 Pro
MathVision	84.2	86.1*	Gemini 3 Pro
OCRBench	92.3	90.3*	Kimi K2.5
OmniDocBench 1.5	88.8	87.7*	Kimi K2.5
VideoMMMU	86.6	87.6	Gemini 3 Pro
LongVideoBench	79.8	—	—

Key Insight: Gemini 3 Pro leads on MMMU-Pro / MathVision / VideoMMMU, while Kimi K2.5 leads on OCRBench / OmniDocBench, making Kimi particularly strong for enterprise document workflows.

Coding and Tools

Benchmark	Kimi K2.5	Gemini 3 Pro	Winner
SWE-Bench Verified	76.8	76.2	Kimi K2.5
LiveCodeBench (v6)	85.0	87.4*	Gemini 3 Pro
TerminalBench	50.8	46.4	Kimi K2.5

Key Insight: Kimi K2.5 is slightly higher on SWE-Bench Verified and clearly higher on TerminalBench, while Gemini 3 Pro leads on LiveCodeBench (v6) in the same official table.

Reasoning and Knowledge

Benchmark	Kimi K2.5	Gemini 3 Pro	Winner
GPQA-Diamond	87.6	91.9	Gemini 3 Pro
MMLU-Pro	87.1	90.1	Gemini 3 Pro

Key Insight: Gemini 3 Pro is higher on the official table’s GPQA-Diamond and MMLU-Pro.

Kimi K2.5 vs Claude Opus 4.5

Anthropic’s Claude models are known for strong coding and reasoning. Comparison:

Coding and Development Tasks

Benchmark	Kimi K2.5	Claude Opus 4.5	Winner
SWE-Bench Verified	76.8	80.9	Claude Opus 4.5
LiveCodeBench (v6)	85.0	82.2*	Kimi K2.5
TerminalBench	50.8	54.0	Claude Opus 4.5

Key Insight: Claude Opus 4.5 leads on SWE-Bench Verified and TerminalBench, while Kimi K2.5 is higher on LiveCodeBench (v6) in the official table.

Reasoning and Knowledge

Benchmark	Kimi K2.5	Claude Opus 4.5	Winner
GPQA-Diamond	87.6	87.0	Kimi K2.5
MMLU-Pro	87.1	89.3*	Claude Opus 4.5

Key Insight: Kimi K2.5 edges Claude on GPQA-Diamond, while Claude Opus 4.5 leads on MMLU-Pro (noted as re-evaluated “*” in the official table).

Tool Use and Agentic Performance

Benchmark	Kimi K2.5	Claude Opus 4.5	Winner
HLE-Full (w/ tools)	50.2	43.2	Kimi K2.5

Key Insight: Kimi K2.5 leads Claude Opus 4.5 on HLE-Full (w/ tools), indicating stronger tool-augmented agentic behavior in this benchmark.

Specialized Capability Notes

Kimi’s technical report describes Agent Swarm as a research preview trained with PARL, enabling up to 100 sub-agents and up to ~1,500 tool calls/steps for parallel workflows. These disclosures describe capability direction and evaluation setup, but real-world outcomes can vary by task definition, tool availability, and provider implementation.

Recommendations by Use Case

Choose Kimi K2.5 When:

Document/OCR workflows matter: leads on OCRBench and OmniDocBench
Tool-augmented agentic tasks are core: leads on HLE-Full (w/ tools)
Open-weights deployment is required: weights + Modified MIT license are public

Choose GPT-5.2 When:

Maximum hard math/reasoning is required: leads on AIME 2025 / GPQA-Diamond / HMMT / IMO-AnswerBench
Top-end SWE-Bench Verified performance is critical

Choose Claude Opus 4.5 When:

Agentic software engineering is the top priority: highest SWE-Bench Verified in the official table
Terminal/tool tasks matter: higher TerminalBench in the official table

Choose Gemini 3 Pro When:

General multimodal strength is the priority: higher MMMU-Pro / MathVision / VideoMMMU in the official table
You need large-context options (validate based on your actual API/product channel)

Conclusion

To make benchmark writing withstand strict fact-checking, the most important rule is consistent sourcing. This version uses the official Kimi K2.5 benchmark table for all numbers and avoids filling gaps with unverified third-party values.

From the official table, Kimi K2.5’s standout strengths are:

Tool-augmented agentic performance: HLE-Full (w/ tools) leads
Document understanding: OCRBench and OmniDocBench lead
Competitive coding and multimodal performance: strong SWE/LiveCode/Video results and close gaps vs top proprietary models

Sources

Official Kimi K2.5 benchmark table (NVIDIA Model Card): https://build.nvidia.com/moonshotai/kimi-k2.5/modelcard
Hugging Face Model Card (tools/notes/license): https://huggingface.co/moonshotai/Kimi-K2.5
Kimi K2.5 Technical Report (Agent Swarm / PARL): https://www.kimi.com/blog/kimi-k2-5.html
OpenAI pricing: https://platform.openai.com/docs/pricing
Kimi K2.5 LICENSE (Modified MIT): https://huggingface.co/moonshotai/Kimi-K2.5/blob/main/LICENSE

Kimi K2.5 Benchmark: Complete Performance Analysis vs GPT, Claude & Gemini 2026

Table of Contents