Kimi K2.5 HuggingFace: Model Card, Usage Guide & Integration 2026

The Kimi K2.5 HuggingFace release marks a significant milestone in open AI models. According to Moonshot AI's official Kimi-K2.5 repository and model materials, Kimi K2.5 is available on HuggingFace Hub for download and self-hosted deployment workflows.

Kimi K2.5 HuggingFace Model Card Overview

The official Kimi K2.5 model card on HuggingFace provides comprehensive information about the model architecture, capabilities, and usage guidelines.

Model Information

Attribute	Details
Model Name	moonshotai/Kimi-K2.5
Architecture	Mixture-of-Experts (MoE)
Total Parameters	1 trillion (1T)
Activated Parameters	32 billion (32B)
Context Window	256,000 tokens
License	Modified MIT
Languages	Multilingual
Modalities	Text, Image, Video

Model Page

Official Model Page: https://huggingface.co/moonshotai/Kimi-K2.5

The model page includes:

Model weights and configuration files
Tokenizer files
Usage examples
Community discussions
Evaluation results

Downloading Kimi K2.5 from HuggingFace

Using HuggingFace Hub CLI

# Install HuggingFace Hub
pip install huggingface-hub

# Login (required for gated models)
huggingface-cli login

# Download the model
huggingface-cli download moonshotai/Kimi-K2.5 --local-dir ./kimi-k2-5

Using Python

from huggingface_hub import snapshot_download

# Download model
model_path = snapshot_download(
    repo_id="moonshotai/Kimi-K2.5",
    local_dir="./kimi-k2-5",
    local_dir_use_symlinks=False
)

print(f"Model downloaded to: {model_path}")

Storage Requirements (Approximate)

Component	Size
Model Weights (FP16)	~2TB
Model Weights (INT8)	~1TB
Model Weights (INT4)	~500GB
Tokenizer & Config	~10MB

Note: These are approximate planning numbers; actual disk usage varies by format and deployment stack.

Loading Kimi K2.5 with Transformers

Moonshot's deployment guide notes a minimum transformers version of 4.57.1.

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    "moonshotai/Kimi-K2.5",
    trust_remote_code=True
)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "moonshotai/Kimi-K2.5",
    trust_remote_code=True,
    torch_dtype="auto",
    device_map="auto"
)

# Generate text
inputs = tokenizer("Explain quantum computing:", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=200)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Multi-GPU Loading

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load with device mapping across multiple GPUs
model = AutoModelForCausalLM.from_pretrained(
    "moonshotai/Kimi-K2.5",
    trust_remote_code=True,
    torch_dtype=torch.float16,
    device_map="auto",  # Automatically distribute across available GPUs
    max_memory={0: "80GiB", 1: "80GiB", 2: "80GiB", 3: "80GiB"}
)

Quantized Loading (4-bit)

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

# Configure 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True
)

# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    "moonshotai/Kimi-K2.5",
    trust_remote_code=True,
    quantization_config=bnb_config,
    device_map="auto"
)

Running Kimi K2.5 with vLLM

vLLM provides optimized inference for large language models with efficient attention mechanisms and continuous batching.

Installation

pip install vllm

Basic vLLM Server

# Start vLLM server
python -m vllm.entrypoints.openai.api_server \
    --model moonshotai/Kimi-K2.5 \
    --tensor-parallel-size 4 \
    --max-model-len 65536 \
    --dtype float16

vLLM Python API

from vllm import LLM, SamplingParams

# Initialize LLM
llm = LLM(
    model="moonshotai/Kimi-K2.5",
    tensor_parallel_size=4,
    max_model_len=65536,
    dtype="float16"
)

# Configure sampling
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=1000
)

# Generate
prompts = [
    "Explain machine learning:",
    "Write a Python function to sort a list:"
]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(f"Prompt: {output.prompt}")
    print(f"Response: {output.outputs[0].text}\n")

OpenAI-Compatible API with vLLM

# After starting vLLM server
import openai

client = openai.OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy"
)

response = client.chat.completions.create(
    model="moonshotai/Kimi-K2.5",
    messages=[
        {"role": "user", "content": "Hello, Kimi!"}
    ]
)

print(response.choices[0].message.content)

Using Kimi K2.5 with llama.cpp

For CPU inference or edge deployment, llama.cpp with GGUF quantization enables running Kimi K2.5 on consumer hardware.

Downloading GGUF Versions

Community GGUF conversions may be available, but availability changes frequently:

# Search and verify actively maintained GGUF repos before downloading
# https://huggingface.co/models?search=Kimi-K2.5%20GGUF

Running with llama.cpp

# Basic inference
./main \
    -m ./models/Kimi-K2.5.Q4_K_M.gguf \
    -p "Explain quantum computing:" \
    -n 512 \
    --temp 0.7

# Interactive mode
./main \
    -m ./models/Kimi-K2.5.Q4_K_M.gguf \
    --interactive \
    --temp 0.7 \
    -n 4096

Python Binding

from llama_cpp import Llama

# Load model
llm = Llama(
    model_path="./models/Kimi-K2.5.Q4_K_M.gguf",
    n_ctx=8192,
    n_threads=8
)

# Generate
output = llm(
    "Explain machine learning:",
    max_tokens=512,
    temperature=0.7
)
print(output["choices"][0]["text"])

Fine-Tuning Kimi K2.5

LoRA Fine-Tuning

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer
)
from peft import LoraConfig, get_peft_model
import torch

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    "moonshotai/Kimi-K2.5",
    trust_remote_code=True,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Configure LoRA
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

# Training setup
training_args = TrainingArguments(
    output_dir="./kimi-k2-5-finetuned",
    num_train_epochs=3,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    save_steps=100,
    logging_steps=10,
    fp16=True
)

# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    tokenizer=tokenizer
)

# Train
trainer.train()

Deployment Options

Docker Deployment

FROM nvidia/cuda:12.1-devel-ubuntu22.04

WORKDIR /app

# Install dependencies
RUN pip install torch transformers vllm huggingface-hub

# Download model
RUN huggingface-cli download moonshotai/Kimi-K2.5 --local-dir /models/kimi-k2-5

# Start vLLM server
CMD python -m vllm.entrypoints.openai.api_server \
    --model /models/kimi-k2-5 \
    --tensor-parallel-size 4 \
    --host 0.0.0.0 \
    --port 8000

Kubernetes Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: kimi-k2-5
spec:
  replicas: 1
  selector:
    matchLabels:
      app: kimi-k2-5
  template:
    metadata:
      labels:
        app: kimi-k2-5
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          args:
            - --model
            - moonshotai/Kimi-K2.5
            - --tensor-parallel-size
            - '4'
          resources:
            limits:
              nvidia.com/gpu: '4'
          ports:
            - containerPort: 8000

Hardware Requirements

Hardware requirements depend heavily on inference engine, tensor parallel settings, context length, and quantization strategy.
Moonshot's official deployment guide currently provides reference commands for vLLM/SGLang TP8 setups (for example, single-node H200 examples), and recommends checking engine docs for latest tuning guidance.