The Kimi K2.5 HuggingFace release marks a significant milestone in open AI models. According to Moonshot AI's official Kimi-K2.5 repository and model materials, Kimi K2.5 is available on HuggingFace Hub for download and self-hosted deployment workflows.
Kimi K2.5 HuggingFace Model Card Overview
The official Kimi K2.5 model card on HuggingFace provides comprehensive information about the model architecture, capabilities, and usage guidelines.
Model Information
| Attribute | Details |
|---|---|
| Model Name | moonshotai/Kimi-K2.5 |
| Architecture | Mixture-of-Experts (MoE) |
| Total Parameters | 1 trillion (1T) |
| Activated Parameters | 32 billion (32B) |
| Context Window | 256,000 tokens |
| License | Modified MIT |
| Languages | Multilingual |
| Modalities | Text, Image, Video |
Model Page
Official Model Page: https://huggingface.co/moonshotai/Kimi-K2.5
The model page includes:
- Model weights and configuration files
- Tokenizer files
- Usage examples
- Community discussions
- Evaluation results
Downloading Kimi K2.5 from HuggingFace
Using HuggingFace Hub CLI
# Install HuggingFace Hub
pip install huggingface-hub
# Login (required for gated models)
huggingface-cli login
# Download the model
huggingface-cli download moonshotai/Kimi-K2.5 --local-dir ./kimi-k2-5Using Python
from huggingface_hub import snapshot_download
# Download model
model_path = snapshot_download(
repo_id="moonshotai/Kimi-K2.5",
local_dir="./kimi-k2-5",
local_dir_use_symlinks=False
)
print(f"Model downloaded to: {model_path}")Storage Requirements (Approximate)
| Component | Size |
|---|---|
| Model Weights (FP16) | ~2TB |
| Model Weights (INT8) | ~1TB |
| Model Weights (INT4) | ~500GB |
| Tokenizer & Config | ~10MB |
Note: These are approximate planning numbers; actual disk usage varies by format and deployment stack.
Loading Kimi K2.5 with Transformers
Moonshot's deployment guide notes a minimum transformers version of 4.57.1.
Basic Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
"moonshotai/Kimi-K2.5",
trust_remote_code=True
)
# Load model
model = AutoModelForCausalLM.from_pretrained(
"moonshotai/Kimi-K2.5",
trust_remote_code=True,
torch_dtype="auto",
device_map="auto"
)
# Generate text
inputs = tokenizer("Explain quantum computing:", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=200)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)Multi-GPU Loading
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load with device mapping across multiple GPUs
model = AutoModelForCausalLM.from_pretrained(
"moonshotai/Kimi-K2.5",
trust_remote_code=True,
torch_dtype=torch.float16,
device_map="auto", # Automatically distribute across available GPUs
max_memory={0: "80GiB", 1: "80GiB", 2: "80GiB", 3: "80GiB"}
)Quantized Loading (4-bit)
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
# Configure 4-bit quantization
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True
)
# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
"moonshotai/Kimi-K2.5",
trust_remote_code=True,
quantization_config=bnb_config,
device_map="auto"
)Running Kimi K2.5 with vLLM
vLLM provides optimized inference for large language models with efficient attention mechanisms and continuous batching.
Installation
pip install vllmBasic vLLM Server
# Start vLLM server
python -m vllm.entrypoints.openai.api_server \
--model moonshotai/Kimi-K2.5 \
--tensor-parallel-size 4 \
--max-model-len 65536 \
--dtype float16vLLM Python API
from vllm import LLM, SamplingParams
# Initialize LLM
llm = LLM(
model="moonshotai/Kimi-K2.5",
tensor_parallel_size=4,
max_model_len=65536,
dtype="float16"
)
# Configure sampling
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=1000
)
# Generate
prompts = [
"Explain machine learning:",
"Write a Python function to sort a list:"
]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(f"Prompt: {output.prompt}")
print(f"Response: {output.outputs[0].text}\n")OpenAI-Compatible API with vLLM
# After starting vLLM server
import openai
client = openai.OpenAI(
base_url="http://localhost:8000/v1",
api_key="dummy"
)
response = client.chat.completions.create(
model="moonshotai/Kimi-K2.5",
messages=[
{"role": "user", "content": "Hello, Kimi!"}
]
)
print(response.choices[0].message.content)Using Kimi K2.5 with llama.cpp
For CPU inference or edge deployment, llama.cpp with GGUF quantization enables running Kimi K2.5 on consumer hardware.
Downloading GGUF Versions
Community GGUF conversions may be available, but availability changes frequently:
# Search and verify actively maintained GGUF repos before downloading
# https://huggingface.co/models?search=Kimi-K2.5%20GGUFRunning with llama.cpp
# Basic inference
./main \
-m ./models/Kimi-K2.5.Q4_K_M.gguf \
-p "Explain quantum computing:" \
-n 512 \
--temp 0.7
# Interactive mode
./main \
-m ./models/Kimi-K2.5.Q4_K_M.gguf \
--interactive \
--temp 0.7 \
-n 4096Python Binding
from llama_cpp import Llama
# Load model
llm = Llama(
model_path="./models/Kimi-K2.5.Q4_K_M.gguf",
n_ctx=8192,
n_threads=8
)
# Generate
output = llm(
"Explain machine learning:",
max_tokens=512,
temperature=0.7
)
print(output["choices"][0]["text"])Fine-Tuning Kimi K2.5
LoRA Fine-Tuning
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
TrainingArguments,
Trainer
)
from peft import LoraConfig, get_peft_model
import torch
# Load base model
model = AutoModelForCausalLM.from_pretrained(
"moonshotai/Kimi-K2.5",
trust_remote_code=True,
torch_dtype=torch.float16,
device_map="auto"
)
# Configure LoRA
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
# Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Training setup
training_args = TrainingArguments(
output_dir="./kimi-k2-5-finetuned",
num_train_epochs=3,
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
learning_rate=2e-4,
save_steps=100,
logging_steps=10,
fp16=True
)
# Initialize trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
tokenizer=tokenizer
)
# Train
trainer.train()Deployment Options
Docker Deployment
FROM nvidia/cuda:12.1-devel-ubuntu22.04
WORKDIR /app
# Install dependencies
RUN pip install torch transformers vllm huggingface-hub
# Download model
RUN huggingface-cli download moonshotai/Kimi-K2.5 --local-dir /models/kimi-k2-5
# Start vLLM server
CMD python -m vllm.entrypoints.openai.api_server \
--model /models/kimi-k2-5 \
--tensor-parallel-size 4 \
--host 0.0.0.0 \
--port 8000Kubernetes Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: kimi-k2-5
spec:
replicas: 1
selector:
matchLabels:
app: kimi-k2-5
template:
metadata:
labels:
app: kimi-k2-5
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- --model
- moonshotai/Kimi-K2.5
- --tensor-parallel-size
- '4'
resources:
limits:
nvidia.com/gpu: '4'
ports:
- containerPort: 8000Hardware Requirements
Hardware requirements depend heavily on inference engine, tensor parallel settings, context length, and quantization strategy.
Moonshot's official deployment guide currently provides reference commands for vLLM/SGLang TP8 setups (for example, single-node H200 examples), and recommends checking engine docs for latest tuning guidance.
FAQ
How do I access the Kimi K2.5 model on HuggingFace?
Visit huggingface.co/moonshotai/Kimi-K2.5 and accept the license agreement. Some versions may require authentication.
Can I run Kimi K2.5 on consumer GPUs?
It depends on the quantization format and serving stack. Validate against the specific GGUF/checkpoint variant and your target latency/QPS requirements before committing hardware.
Is the HuggingFace version the same as the API?
Not necessarily in end-to-end behavior. The same base model family can behave differently depending on serving stack, parser/tool settings, and model mode configuration.
What framework is recommended for inference?
vLLM is recommended for production inference due to its optimized kernels and efficient batching. Transformers is best for fine-tuning and experimentation.
How do I fine-tune Kimi K2.5?
Use PEFT with LoRA adapters for efficient fine-tuning. Full fine-tuning requires very large compute budgets, so start with pilot runs and profile memory/throughput first.
Can I use Kimi K2.5 commercially?
Review the exact terms in the official Modified MIT License before production or commercial rollout.