Kimi K2.6 Benchmark: Results vs GPT-5.4, Claude, Gemini, and K2.5

Apr 21, 2026

New to Kimi K2.5?Try Kimi K2.5.

I'm sticking to Moonshot's K2.6 benchmark table for this one, and that's on purpose. Benchmark posts tend to get messy the moment you start mixing vendor tables, different tool settings, different reasoning effort, and different evaluation harnesses — the numbers stop comparing the same things to the same things.

So the rule here is simple: use the K2.6 table as the number source, and be explicit about what it does and doesn't compare.

As of April 21, 2026, Moonshot's K2.6 table includes Kimi K2.6, GPT-5.4 (xhigh), Claude Opus 4.6 (max effort), Gemini 3.1 Pro (thinking high), and Kimi K2.5.

New to Kimi K2.6? Try Kimi K2.6.

Kimi K2.6 Benchmark: Quick Take

The short version: Kimi K2.6 is strong on coding and agentic work, clearly ahead of K2.5, close to the frontier proprietary models, and it wins some benchmarks while narrowly trailing on others.

What matters most isn't "K2.6 wins every row" — it doesn't. The more useful read is that K2.6 closes most of the gap, while sitting at a meaningfully lower published API price than premium Claude or GPT-class pricing.

Benchmark Table: Selected Kimi K2.6 Results

Agentic and Tool-Augmented Tasks

BenchmarkKimi K2.6GPT-5.4 (xhigh)Claude Opus 4.6Gemini 3.1 ProKimi K2.5
HLE-Full w/ tools54.052.153.051.450.2
BrowseComp83.282.783.785.974.9
BrowseComp (agent swarm)86.378.4
DeepSearchQA (f1)92.578.691.381.989.0
DeepSearchQA (accuracy)83.063.780.660.277.1
Toolathlon50.054.647.248.827.8
OSWorld-Verified73.175.072.763.3

Coding Benchmarks

BenchmarkKimi K2.6GPT-5.4 (xhigh)Claude Opus 4.6Gemini 3.1 ProKimi K2.5
Terminal-Bench 2.066.765.4*65.468.550.8
SWE-Bench Pro58.657.753.454.250.7
SWE-Bench Multilingual76.777.876.9*73.0
SWE-Bench Verified80.280.880.676.8
SciCode52.256.651.958.948.7
OJBench (python)60.660.370.754.7
LiveCodeBench (v6)89.688.891.785.0

Reasoning and Knowledge

BenchmarkKimi K2.6GPT-5.4 (xhigh)Claude Opus 4.6Gemini 3.1 ProKimi K2.5
HLE-Full34.739.840.044.430.1
AIME 202696.499.296.798.395.8
HMMT 2026 (Feb)92.797.796.294.787.1
IMO-AnswerBench86.091.475.391.0*81.8
GPQA-Diamond90.592.891.394.387.6

Vision Benchmarks

BenchmarkKimi K2.6GPT-5.4 (xhigh)Claude Opus 4.6Gemini 3.1 ProKimi K2.5
MMMU-Pro79.481.273.983.0*78.5
MMMU-Pro w/ python80.182.177.385.3*77.7
MathVision87.492.0*71.2*89.8*84.2
MathVision w/ python93.296.1*84.6*95.7*85.0
V* w/ python96.998.4*86.4*96.9*86.9

* Entries marked with * are noted on Moonshot’s K2.6 page as re-evaluated under its benchmark conditions.

What the Kimi K2.6 Benchmark Says

1. K2.6 is a meaningful step up from K2.5

The single most reliable conclusion in this table is the within-family one. Against K2.5, the gains are broad and not particularly subtle:

  • HLE-Full w/ tools: 54.0 vs 50.2
  • BrowseComp: 83.2 vs 74.9
  • DeepSearchQA (f1): 92.5 vs 89.0
  • Terminal-Bench 2.0: 66.7 vs 50.8
  • SWE-Bench Pro: 58.6 vs 50.7
  • SWE-Bench Verified: 80.2 vs 76.8
  • LiveCodeBench (v6): 89.6 vs 85.0
  • GPQA-Diamond: 90.5 vs 87.6
  • MMMU-Pro: 79.4 vs 78.5

That lines up with Moonshot's own positioning: K2.6 isn't a K2.5 repackage, it's a genuine step forward on long-horizon coding and agent behavior.

2. K2.6 is strongest on tasks that look like real engineering or real agents

The benchmarks where K2.6 pulls ahead most cleanly aren't toy prompts — they're closer to what developers and agent builders actually ship:

  • HLE-Full w/ tools
  • DeepSearchQA
  • SWE-Bench Pro
  • Terminal-Bench 2.0
  • SWE-Bench Verified

Tool calling, multi-step execution, engineering tasks, long agent chains. That matches the K2.6 narrative about long-horizon coding and stronger autonomous execution better than most benchmark stories line up with their press releases.

3. K2.6 does not dominate the frontier models everywhere

This is the part worth being honest about. Straight from the same table:

  • Gemini 3.1 Pro leads on several vision-heavy benchmarks like MMMU-Pro and LiveCodeBench
  • GPT-5.4 (xhigh) leads on several reasoning-heavy tests like AIME 2026 and HMMT 2026
  • Claude Opus 4.6 is still slightly ahead on SWE-Bench Verified and SWE-Bench Multilingual

So the K2.6 story isn't "wins everything". It's more like: highly competitive on frontier coding and agentic tasks, with clear internal-family gains over K2.5.

Kimi K2.6 vs GPT-5.4 (xhigh)

Moonshot's table suggests a pretty clean split between the two.

K2.6 leads GPT-5.4 on HLE-Full w/ tools, DeepSearchQA (both f1 and accuracy), and SWE-Bench Pro. GPT-5.4 leads on AIME 2026, HMMT 2026, IMO-AnswerBench, GPQA-Diamond, and a chunk of the vision-heavy tasks.

Practical rule of thumb: if your workload is pure high-end reasoning or contest-style math, GPT-5.4 still has stronger published numbers on Moonshot's table. If it's tool-augmented engineering and agent execution, K2.6 becomes much harder to ignore.

Kimi K2.6 vs Claude Opus 4.6

One thing worth flagging: Moonshot's table compares K2.6 against Claude Opus 4.6 (max effort), not Opus 4.7.

Within that comparison, K2.6 leads on HLE-Full w/ tools, DeepSearchQA, Terminal-Bench 2.0, and SWE-Bench Pro. Claude Opus 4.6 is still slightly ahead on SWE-Bench Verified and SWE-Bench Multilingual.

Closer than most people would assume.

Kimi K2.6 vs Gemini 3.1 Pro

Gemini 3.1 Pro looks strongest on the more visual or benchmark-style multimodal items — MMMU-Pro, MMMU-Pro w/ python, LiveCodeBench (v6), OJBench (python), and GPQA-Diamond.

K2.6 looks stronger where the task is closer to real agentic execution — HLE-Full w/ tools, DeepSearchQA, BrowseComp (agent swarm), and SWE-Bench Pro.

Why the Kimi K2.6 Benchmark Story Matters

What makes Moonshot's K2.6 tech blog more persuasive than a typical benchmark drop is that it doesn't stop at a table. It ties the numbers back to concrete long-horizon engineering examples: 4,000+ tool calls over 12+ hours optimizing a Zig inference engine; 13 hours of autonomous work on an open-source financial matching engine; internal and partner reports about better long-context stability, stronger tool calling, and better instruction following.

That matters. A table on its own is easy to over-sell. When the table, the case studies, and the partner reports all tell the same story — better long-horizon coding, better agent execution, better engineering follow-through — the narrative becomes a lot harder to dismiss.

Final Verdict

The clean reading of Moonshot's K2.6 benchmark is pretty simple: K2.6 is stronger than K2.5, competitive with the frontier proprietary models, especially good on coding and tool-heavy agent work, and still not the top of every reasoning or multimodal benchmark.

That's already plenty of reason to take it seriously, especially if your workload looks like software engineering, agent orchestration, long-running execution, or tool-based research and coding.

Sources

Kimi K2.6 Benchmark: Results vs GPT-5.4, Claude, Gemini, and K2.5 | Blog