1. Model Introduction
Kimi-K2 is a state-of-the-art MoE language model by Moonshot AI with 32B activated parameters and 1T total parameters. Model Variants:- Kimi-K2-Instruct: Post-trained model optimized for general-purpose chat and agentic tasks. Compatible with vLLM, SGLang, KTransformers, and TensorRT-LLM.
- Kimi-K2-Thinking: Advanced thinking model with step-by-step reasoning and tool calling. Native INT4 quantization with 256k context window. Ideal for complex reasoning and multi-step tool use.
2. SGLang Installation
Refer to the official SGLang installation guide.3. Model Deployment
This section provides a progressive guide from quick deployment to performance optimization, suitable for users at different levels.3.1 Basic Configuration
Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, model variant, deployment strategy, and capabilities.3.2 Configuration Tips
- Memory: Requires 8 GPUs with ≥140GB each (H200/B200). Use
--context-length 128000to conserve memory. - Expert Parallelism (EP): Use
--epfor better MoE throughput. See EP docs. - Data Parallel (DP): Enable with
--dp 4 --enable-dp-attentionfor production throughput. - KV Cache: Use
--kv-cache-dtype fp8_e4m3to reduce memory by 50% (CUDA 11.8+). - Reasoning Parser: Add
--reasoning-parser kimi_k2for Kimi-K2-Thinking to separate thinking and content. - Tool Call Parser: Add
--tool-call-parser kimi_k2for structured tool calls.
4. Model Invocation
4.1 Basic Usage
See Basic API Usage.4.2 Advanced Usage
4.2.1 Reasoning Parser
Enable reasoning parser for Kimi-K2-Thinking:Command
Example
Output
4.2.2 Tool Calling
Kimi-K2-Instruct and Kimi-K2-Thinking support tool calling capabilities. Enable the tool call parser during deployment: Deployment Command:Command
Example
Output
- The reasoning parser shows how the model decides to use a tool
- Tool calls are clearly marked with the function name and arguments
- You can then execute the function and send the result back to continue the conversation
Example
5. Benchmark
5.1 Speed Benchmark
Test Environment:- Hardware: NVIDIA B200 GPU (8x)
- Model: Kimi-K2-Instruct
- sglang version: 0.5.6.post1
5.1.1 Latency-Sensitive Benchmark
- Model Deployment Command:
Command
- Benchmark Command:
Command
- Test Results:
Output
5.1.2 Throughput-Sensitive Benchmark
- Model Deployment Command:
Command
- Benchmark Command:
Command
- Test Results:
Output
5.2 Accuracy Benchmark
5.2.1 GSM8K Benchmark
- Server Command
Command
- Benchmark Command
Command
- Result:
Output
