Kimi K2.5 HuggingFace: Model Card, Usage Guide & Integration 2026

Feb 10, 2026

The Kimi K2.5 HuggingFace release marks a significant milestone in open AI models. According to Moonshot AI's official Kimi-K2.5 repository and model materials, Kimi K2.5 is available on HuggingFace Hub for download and self-hosted deployment workflows.

Kimi K2.5 HuggingFace Model Card Overview

The official Kimi K2.5 model card on HuggingFace provides comprehensive information about the model architecture, capabilities, and usage guidelines.

Model Information

AttributeDetails
Model Namemoonshotai/Kimi-K2.5
ArchitectureMixture-of-Experts (MoE)
Total Parameters1 trillion (1T)
Activated Parameters32 billion (32B)
Context Window256,000 tokens
LicenseModified MIT
LanguagesMultilingual
ModalitiesText, Image, Video

Model Page

Official Model Page: https://huggingface.co/moonshotai/Kimi-K2.5

The model page includes:

  • Model weights and configuration files
  • Tokenizer files
  • Usage examples
  • Community discussions
  • Evaluation results

Downloading Kimi K2.5 from HuggingFace

Using HuggingFace Hub CLI

# Install HuggingFace Hub
pip install huggingface-hub

# Login (required for gated models)
huggingface-cli login

# Download the model
huggingface-cli download moonshotai/Kimi-K2.5 --local-dir ./kimi-k2-5

Using Python

from huggingface_hub import snapshot_download

# Download model
model_path = snapshot_download(
    repo_id="moonshotai/Kimi-K2.5",
    local_dir="./kimi-k2-5",
    local_dir_use_symlinks=False
)

print(f"Model downloaded to: {model_path}")

Storage Requirements (Approximate)

ComponentSize
Model Weights (FP16)~2TB
Model Weights (INT8)~1TB
Model Weights (INT4)~500GB
Tokenizer & Config~10MB

Note: These are approximate planning numbers; actual disk usage varies by format and deployment stack.

Loading Kimi K2.5 with Transformers

Moonshot's deployment guide notes a minimum transformers version of 4.57.1.

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    "moonshotai/Kimi-K2.5",
    trust_remote_code=True
)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "moonshotai/Kimi-K2.5",
    trust_remote_code=True,
    torch_dtype="auto",
    device_map="auto"
)

# Generate text
inputs = tokenizer("Explain quantum computing:", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=200)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Multi-GPU Loading

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load with device mapping across multiple GPUs
model = AutoModelForCausalLM.from_pretrained(
    "moonshotai/Kimi-K2.5",
    trust_remote_code=True,
    torch_dtype=torch.float16,
    device_map="auto",  # Automatically distribute across available GPUs
    max_memory={0: "80GiB", 1: "80GiB", 2: "80GiB", 3: "80GiB"}
)

Quantized Loading (4-bit)

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

# Configure 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True
)

# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    "moonshotai/Kimi-K2.5",
    trust_remote_code=True,
    quantization_config=bnb_config,
    device_map="auto"
)

Running Kimi K2.5 with vLLM

vLLM provides optimized inference for large language models with efficient attention mechanisms and continuous batching.

Installation

pip install vllm

Basic vLLM Server

# Start vLLM server
python -m vllm.entrypoints.openai.api_server \
    --model moonshotai/Kimi-K2.5 \
    --tensor-parallel-size 4 \
    --max-model-len 65536 \
    --dtype float16

vLLM Python API

from vllm import LLM, SamplingParams

# Initialize LLM
llm = LLM(
    model="moonshotai/Kimi-K2.5",
    tensor_parallel_size=4,
    max_model_len=65536,
    dtype="float16"
)

# Configure sampling
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=1000
)

# Generate
prompts = [
    "Explain machine learning:",
    "Write a Python function to sort a list:"
]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(f"Prompt: {output.prompt}")
    print(f"Response: {output.outputs[0].text}\n")

OpenAI-Compatible API with vLLM

# After starting vLLM server
import openai

client = openai.OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy"
)

response = client.chat.completions.create(
    model="moonshotai/Kimi-K2.5",
    messages=[
        {"role": "user", "content": "Hello, Kimi!"}
    ]
)

print(response.choices[0].message.content)

Using Kimi K2.5 with llama.cpp

For CPU inference or edge deployment, llama.cpp with GGUF quantization enables running Kimi K2.5 on consumer hardware.

Downloading GGUF Versions

Community GGUF conversions may be available, but availability changes frequently:

# Search and verify actively maintained GGUF repos before downloading
# https://huggingface.co/models?search=Kimi-K2.5%20GGUF

Running with llama.cpp

# Basic inference
./main \
    -m ./models/Kimi-K2.5.Q4_K_M.gguf \
    -p "Explain quantum computing:" \
    -n 512 \
    --temp 0.7

# Interactive mode
./main \
    -m ./models/Kimi-K2.5.Q4_K_M.gguf \
    --interactive \
    --temp 0.7 \
    -n 4096

Python Binding

from llama_cpp import Llama

# Load model
llm = Llama(
    model_path="./models/Kimi-K2.5.Q4_K_M.gguf",
    n_ctx=8192,
    n_threads=8
)

# Generate
output = llm(
    "Explain machine learning:",
    max_tokens=512,
    temperature=0.7
)
print(output["choices"][0]["text"])

Fine-Tuning Kimi K2.5

LoRA Fine-Tuning

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer
)
from peft import LoraConfig, get_peft_model
import torch

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    "moonshotai/Kimi-K2.5",
    trust_remote_code=True,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Configure LoRA
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

# Training setup
training_args = TrainingArguments(
    output_dir="./kimi-k2-5-finetuned",
    num_train_epochs=3,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    save_steps=100,
    logging_steps=10,
    fp16=True
)

# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    tokenizer=tokenizer
)

# Train
trainer.train()

Deployment Options

Docker Deployment

FROM nvidia/cuda:12.1-devel-ubuntu22.04

WORKDIR /app

# Install dependencies
RUN pip install torch transformers vllm huggingface-hub

# Download model
RUN huggingface-cli download moonshotai/Kimi-K2.5 --local-dir /models/kimi-k2-5

# Start vLLM server
CMD python -m vllm.entrypoints.openai.api_server \
    --model /models/kimi-k2-5 \
    --tensor-parallel-size 4 \
    --host 0.0.0.0 \
    --port 8000

Kubernetes Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: kimi-k2-5
spec:
  replicas: 1
  selector:
    matchLabels:
      app: kimi-k2-5
  template:
    metadata:
      labels:
        app: kimi-k2-5
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          args:
            - --model
            - moonshotai/Kimi-K2.5
            - --tensor-parallel-size
            - '4'
          resources:
            limits:
              nvidia.com/gpu: '4'
          ports:
            - containerPort: 8000

Hardware Requirements

Hardware requirements depend heavily on inference engine, tensor parallel settings, context length, and quantization strategy.
Moonshot's official deployment guide currently provides reference commands for vLLM/SGLang TP8 setups (for example, single-node H200 examples), and recommends checking engine docs for latest tuning guidance.

FAQ

How do I access the Kimi K2.5 model on HuggingFace?

Visit huggingface.co/moonshotai/Kimi-K2.5 and accept the license agreement. Some versions may require authentication.

Can I run Kimi K2.5 on consumer GPUs?

It depends on the quantization format and serving stack. Validate against the specific GGUF/checkpoint variant and your target latency/QPS requirements before committing hardware.

Is the HuggingFace version the same as the API?

Not necessarily in end-to-end behavior. The same base model family can behave differently depending on serving stack, parser/tool settings, and model mode configuration.

vLLM is recommended for production inference due to its optimized kernels and efficient batching. Transformers is best for fine-tuning and experimentation.

How do I fine-tune Kimi K2.5?

Use PEFT with LoRA adapters for efficient fine-tuning. Full fine-tuning requires very large compute budgets, so start with pilot runs and profile memory/throughput first.

Can I use Kimi K2.5 commercially?

Review the exact terms in the official Modified MIT License before production or commercial rollout.

References

Kimi K2.5 HuggingFace: Model Card, Usage Guide & Integration 2026 | Blog