Kimi K2.5 is Moonshot AI’s open-weights native multimodal, agentic model. It continues pretraining on approximately 15T mixed visual + text tokens and introduces Agent Swarm (up to 100 sub-agents) as a research preview.
This article uses the official Kimi K2.5 benchmark table as the single numeric source of truth. Any benchmark not reported there is marked as “—” to avoid mixing non-verifiable or incomparable results.
Kimi K2.5 Overview: Architecture and Capabilities
Before diving into benchmark comparisons, let's understand what makes Kimi K2.5 unique:
Model Architecture
| Specification | Details |
|---|---|
| Architecture | Mixture-of-Experts (MoE) |
| Total Parameters | 1T |
| Activated Parameters | 32B |
| Context Window | 256K tokens (often “hundreds of pages,” depending on formatting/language) |
| Training Data | ~15T mixed visual + text tokens |
| Attention Mechanism | MLA (Multi-head Latent Attention) |
| Experts | 384 total, 8 selected per token |
Key Capabilities
- Agent Swarm (preview): up to 100 sub-agents, parallel workflows, up to ~1,500 coordinated tool calls/steps
- Native multimodality: text + image + video
- Tool-augmented evaluation: official benchmarks run K2.5 with tools (search, code interpreter, web browsing) for HLE-with-tools and agentic search benchmarks
- Open-weights: weights + Modified MIT License are publicly available
Comprehensive Benchmark Results
Summary Table: Kimi K2.5 vs Top Competitors
| Benchmark | Category | Kimi K2.5 | GPT-5.2 | Claude Opus 4.5 | Gemini 3 Pro |
|---|---|---|---|---|---|
| HLE-Full (w/ tools) | Agentic / Tools | 50.2 | 45.5 | 43.2 | 45.8 |
| AIME 2025 | Math | 96.1 | 100.0 | 92.8 | 95.0 |
| HMMT 2025 (Feb) | Contest Math | 95.4 | 99.4 | 92.9* | 97.3* |
| IMO-AnswerBench | Math / Reasoning | 81.8 | 86.3 | 78.5* | 83.1* |
| GPQA-Diamond | Reasoning | 87.6 | 92.4 | 87.0 | 91.9 |
| MMLU-Pro | Knowledge | 87.1 | 86.7* | 89.3* | 90.1 |
| MMMU-Pro | Multimodal | 78.5 | 79.5* | 74.0 | 81.0 |
| MathVision | Vision + Math | 84.2 | 83.0 | 77.1* | 86.1* |
| SWE-Bench Verified | Coding (Agentic) | 76.8 | 80.0 | 80.9 | 76.2 |
| LiveCodeBench (v6) | Coding | 85.0 | — | 82.2* | 87.4* |
| TerminalBench | Tools / Terminal | 50.8 | 46.2 | 54.0 | 46.4 |
| OCRBench | Document OCR | 92.3 | 80.7* | 86.5* | 90.3* |
| OmniDocBench 1.5 | Document Understanding | 88.8 | 85.7 | 84.1* | 87.7* |
| VideoMMMU | Video Understanding | 86.6 | 85.9 | 84.4* | 87.6 |
| LongVideoBench | Long Video | 79.8 | — | — | — |
* “*” indicates re-evaluated / aligned scoring under the official table’s stated conditions. “—” means not reported in the official table.
Kimi K2.5 vs GPT 5.2
Coding
| Benchmark | Kimi K2.5 | GPT-5.2 | Winner |
|---|---|---|---|
| SWE-Bench Verified | 76.8% | 80.0% | GPT |
| TerminalBench | 50.8 | 46.2 | Kimi |
| LiveCodeBench (v6) | 85.0 | — | — |
Key Insight: GPT-5.2 is slightly higher on SWE-Bench Verified in the official table, while Kimi K2.5 leads on TerminalBench, indicating stronger terminal/tool execution performance. LiveCodeBench(v6) is not reported for GPT-5.2 in the same official table.
Math & Reasoning
| Benchmark | Kimi K2.5 | GPT-5.2 | Winner |
|---|---|---|---|
| AIME 2025 | 96.1 | 100.0 | GPT |
| HMMT 2025 (Feb) | 95.4 | 99.4 | GPT |
| IMO-AnswerBench | 81.8 | 86.3 | GPT |
| GPQA-Diamond | 87.6 | 92.4 | GPT |
Key Insight: In the official table, GPT-5.2 leads on the hardest listed math/reasoning benchmarks, while Kimi K2.5 remains close and competitive.
Agentic w/ Tools
| Benchmark | Kimi K2.5 | GPT-5.2 | Winner |
|---|---|---|---|
| HLE-Full (w/ tools) | 50.2 | 45.5 | Kimi |
Key Insight: Kimi K2.5 leads HLE-Full (w/ tools) by 4.7 points, highlighting strong tool-augmented agentic performance.
Multimodal & Docs
| Benchmark | Kimi K2.5 | GPT-5.2 | Winner |
|---|---|---|---|
| MMMU-Pro | 78.5 | 79.5* | GPT |
| MathVision | 84.2 | 83.0 | Kimi |
| OCRBench | 92.3 | 80.7* | Kimi |
| OmniDocBench 1.5 | 88.8 | 85.7 | Kimi |
| VideoMMMU | 86.6 | 85.9 | Kimi |
Key Insight: Kimi K2.5 shows clear advantages in document OCR and document understanding, and stays competitive on vision/video reasoning.
Kimi K2.5 vs Gemini 3 Pro
Google’s Gemini series emphasizes multimodality and long context. Comparison:
Multimodal Performance
| Benchmark | Kimi K2.5 | Gemini 3 Pro | Winner |
|---|---|---|---|
| MMMU-Pro | 78.5 | 81.0 | Gemini 3 Pro |
| MathVision | 84.2 | 86.1* | Gemini 3 Pro |
| OCRBench | 92.3 | 90.3* | Kimi K2.5 |
| OmniDocBench 1.5 | 88.8 | 87.7* | Kimi K2.5 |
| VideoMMMU | 86.6 | 87.6 | Gemini 3 Pro |
| LongVideoBench | 79.8 | — | — |
Key Insight: Gemini 3 Pro leads on MMMU-Pro / MathVision / VideoMMMU, while Kimi K2.5 leads on OCRBench / OmniDocBench, making Kimi particularly strong for enterprise document workflows.
Coding and Tools
| Benchmark | Kimi K2.5 | Gemini 3 Pro | Winner |
|---|---|---|---|
| SWE-Bench Verified | 76.8 | 76.2 | Kimi K2.5 |
| LiveCodeBench (v6) | 85.0 | 87.4* | Gemini 3 Pro |
| TerminalBench | 50.8 | 46.4 | Kimi K2.5 |
Key Insight: Kimi K2.5 is slightly higher on SWE-Bench Verified and clearly higher on TerminalBench, while Gemini 3 Pro leads on LiveCodeBench (v6) in the same official table.
Reasoning and Knowledge
| Benchmark | Kimi K2.5 | Gemini 3 Pro | Winner |
|---|---|---|---|
| GPQA-Diamond | 87.6 | 91.9 | Gemini 3 Pro |
| MMLU-Pro | 87.1 | 90.1 | Gemini 3 Pro |
Key Insight: Gemini 3 Pro is higher on the official table’s GPQA-Diamond and MMLU-Pro.
Kimi K2.5 vs Claude Opus 4.5
Anthropic’s Claude models are known for strong coding and reasoning. Comparison:
Coding and Development Tasks
| Benchmark | Kimi K2.5 | Claude Opus 4.5 | Winner |
|---|---|---|---|
| SWE-Bench Verified | 76.8 | 80.9 | Claude Opus 4.5 |
| LiveCodeBench (v6) | 85.0 | 82.2* | Kimi K2.5 |
| TerminalBench | 50.8 | 54.0 | Claude Opus 4.5 |
Key Insight: Claude Opus 4.5 leads on SWE-Bench Verified and TerminalBench, while Kimi K2.5 is higher on LiveCodeBench (v6) in the official table.
Reasoning and Knowledge
| Benchmark | Kimi K2.5 | Claude Opus 4.5 | Winner |
|---|---|---|---|
| GPQA-Diamond | 87.6 | 87.0 | Kimi K2.5 |
| MMLU-Pro | 87.1 | 89.3* | Claude Opus 4.5 |
Key Insight: Kimi K2.5 edges Claude on GPQA-Diamond, while Claude Opus 4.5 leads on MMLU-Pro (noted as re-evaluated “*” in the official table).
Tool Use and Agentic Performance
| Benchmark | Kimi K2.5 | Claude Opus 4.5 | Winner |
|---|---|---|---|
| HLE-Full (w/ tools) | 50.2 | 43.2 | Kimi K2.5 |
Key Insight: Kimi K2.5 leads Claude Opus 4.5 on HLE-Full (w/ tools), indicating stronger tool-augmented agentic behavior in this benchmark.
Specialized Capability Notes
Kimi’s technical report describes Agent Swarm as a research preview trained with PARL, enabling up to 100 sub-agents and up to ~1,500 tool calls/steps for parallel workflows. These disclosures describe capability direction and evaluation setup, but real-world outcomes can vary by task definition, tool availability, and provider implementation.
Recommendations by Use Case
Choose Kimi K2.5 When:
- Document/OCR workflows matter: leads on OCRBench and OmniDocBench
- Tool-augmented agentic tasks are core: leads on HLE-Full (w/ tools)
- Open-weights deployment is required: weights + Modified MIT license are public
Choose GPT-5.2 When:
- Maximum hard math/reasoning is required: leads on AIME 2025 / GPQA-Diamond / HMMT / IMO-AnswerBench
- Top-end SWE-Bench Verified performance is critical
Choose Claude Opus 4.5 When:
- Agentic software engineering is the top priority: highest SWE-Bench Verified in the official table
- Terminal/tool tasks matter: higher TerminalBench in the official table
Choose Gemini 3 Pro When:
- General multimodal strength is the priority: higher MMMU-Pro / MathVision / VideoMMMU in the official table
- You need large-context options (validate based on your actual API/product channel)
Conclusion
To make benchmark writing withstand strict fact-checking, the most important rule is consistent sourcing. This version uses the official Kimi K2.5 benchmark table for all numbers and avoids filling gaps with unverified third-party values.
From the official table, Kimi K2.5’s standout strengths are:
- Tool-augmented agentic performance: HLE-Full (w/ tools) leads
- Document understanding: OCRBench and OmniDocBench lead
- Competitive coding and multimodal performance: strong SWE/LiveCode/Video results and close gaps vs top proprietary models
Sources
- Official Kimi K2.5 benchmark table (NVIDIA Model Card): https://build.nvidia.com/moonshotai/kimi-k2.5/modelcard
- Hugging Face Model Card (tools/notes/license): https://huggingface.co/moonshotai/Kimi-K2.5
- Kimi K2.5 Technical Report (Agent Swarm / PARL): https://www.kimi.com/blog/kimi-k2-5.html
- OpenAI pricing: https://platform.openai.com/docs/pricing
- Kimi K2.5 LICENSE (Modified MIT): https://huggingface.co/moonshotai/Kimi-K2.5/blob/main/LICENSE