Documentation Index
Fetch the complete documentation index at: https://docs.sglang.io/llms.txt
Use this file to discover all available pages before exploring further.
1. Model Introduction
DeepSeek-V4 is the next-generation Mixture-of-Experts model from DeepSeek, released 2026-04-24 under an MIT License. It ships as two Instruct repos (one per variant) plus matching Base repos:| Variant | Total params | Active (MoE) | Use |
|---|---|---|---|
| DeepSeek-V4-Flash | 284B | 13B | single-node serving: B200 / GB200 / GB300 / H200 on 4 GPUs |
| DeepSeek-V4-Pro | 1.6T | 49B | high-capacity: B200 8 GPU / GB200 8 GPU (2 nodes) / GB300 4 GPU / H200 8 GPU(fp4)/16 GPU(fp8) |
DeepSeek-V4-Flash-Base, DeepSeek-V4-Pro-Base — ship pure FP8 mixed and are not for chat / tool calling.
Key Features (per the official model card):
- Hybrid Attention Architecture — combines Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) for long-context efficiency. At 1M-token context, DeepSeek-V4-Pro uses only ~27% of per-token inference FLOPs and ~10% of KV cache compared with DeepSeek-V3.2.
- Manifold-Constrained Hyper-Connections (mHC) — strengthens residual connections, improving signal-propagation stability across layers while preserving expressivity.
- Muon optimizer — faster convergence and greater training stability.
- Context length: 1M tokens; pre-trained on 32T+ diverse, high-quality tokens.
- Three reasoning modes: Non-think (fast, intuitive responses), Think High (conscious logical analysis, slower but more accurate), Think Max (push reasoning to its fullest extent). Recommend a ≥ 384K context window when running Think Max.
- Ships with a dedicated
encoding_dsv4.encode_messagesPython encoder + DSML tool-call grammar (<|DSML|tool_calls>/<|DSML|invoke>/<|DSML|parameter>).
temperature=1.0, top_p=1.0 (per the official model card).
License: MIT.
Resources:
- HuggingFace: DeepSeek-V4-Flash, DeepSeek-V4-Pro
- ModelScope: DeepSeek-V4-Flash, DeepSeek-V4-Pro
2. SGLang Installation
SGLang offers multiple installation methods. Choose based on your hardware platform. Please refer to the official SGLang installation guide for installation instructions. Docker Image: Uselmsysorg/sglang:latest for all supported hardware platforms (B300 / B200 / GB200 / GB300 / H200 / H100).
Command
sglang serve ... with whatever the command generator below produces):
Command
3. Model Deployment
SGLang supports three main serving recipes for DeepSeek-V4 with different latency/throughput trade-offs (low-latency, balanced, max-throughput), plus specialized recipes for long-context (cp, prefill context-parallel) and prefill/decode disaggregation (pd-disagg). The interactive generator below emits the exact launch command for any (hardware, variant, recipe) combination.
3.1 Basic Configuration
Interactive Command Generator: Use the selector below to generate the deployment command for your hardware + recipe combination.3.2 Configuration Tips
Concurrency & DeepEP dispatch buffer Must hold:max-running-requests × MTP_draft_tokens ≤ SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK. Violating it blows DeepEP’s dispatch buffer at steady-state load (deep_ep.cpp:1105). When tuning, move --cuda-graph-max-bs, --max-running-requests, and the env together.
The generator currently picks values on the conservative side (mirroring an internal stress-test matrix). They run safely out of the box but likely leave throughput on the table — please tune them up toward your actual workload’s peak concurrency and report findings back so the defaults can be revised.
MTP (Multi-Token Prediction, EAGLE)
low-latency: steps=3, draft-tokens=4 → largest win at bs=1.balanced: steps=1, draft-tokens=2 → gentler MTP, reduces throughput hit at higher batch.max-throughput: MTP disabled — at saturation the verify step costs more than it saves.- MTP currently requires
SGLANG_ENABLE_SPEC_V2=1.
- Original FP4 checkpoints: To run original FP4 checkpoints, we provide two different options for w4a16 MoE kernels: Marlin (
--moe-runner-backend marlin) and Flashinfer (`—moe-runner-backend flashinfer_mxfp4). For this variant we only support Tensor Parallelism. Complete Pro model can be run on a single H200 node with this option. - Converted FP8 checkpoints: We also provide pre-converted FP8 checkpoints (
sgl-project/DeepSeek-V4-Flash-FP8,sgl-project/DeepSeek-V4-Pro-FP8), which support more parallelism and features.
docker run --privileged --ulimit memlock=-1
(or --device /dev/infiniband:/dev/infiniband --cap-add IPC_LOCK) so mooncake
can discover the IB HCAs; without IB exposure mooncake silently falls back to
TCP, which can lead to garbled KV transfer on large checkpoints.
MegaMoE
MegaMoE fuses expert dispatch + GEMM into a single kernel for higher throughput
on MoE layers. To enable it, use the MegaMoE toggle in the
command generator above — the generator will swap
--moe-a2a-backend deepep for --moe-a2a-backend megamoe and add the
relevant env vars automatically.
Two variants are exposed:
- W4A8 — default MegaMoE kernel (FP4 weights, FP8 activations).
- W4A4 — adds
SGLANG_OPT_DEEPGEMM_MEGA_MOE_USE_FP4_ACTS=1andSGLANG_OPT_DEEPGEMM_MEGA_MOE_USE_MXF4_KIND=1to run the custom W4A4 kernel (FP4 activations). Higher throughput with negligible accuracy drop (~89.5 GPQA on Pro).
- MegaMoE is not supported on Hopper (H100 / H200) nor on the
low-latency/cpsettings. When running MegaMoE, don’t set--moe-runner-backendmanually. - Adjust
SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANKbased on your workload and memory usage. Setting higher number of tokens for MegaMoE requires more HBM space. (recommended: 4096 for balanced, 8320 for max-throughput).
nvlink_transport.cpp:497 Requested address ... not found!. If
this happens, prepend MC_FORCE_MNNVL=1 NCCL_MNNVL_ENABLE=1 NCCL_CUMEM_ENABLE=1
to both prefill and decode sglang serve commands.
4. Model Invocation
4.1 Basic Usage
For basic API usage and request examples, see: Once the server is running (for example via the command generator above), send a request:Command
PD-Disagg note: if you deployed with thepd-disaggrecipe from the generator above, the prefill server is on port30000, the decode server on30001, and the router on port8000— client traffic should targethttp://localhost:8000, not:30000.
4.2 Advanced Usage
4.2.1 Reasoning Parser
Enable thedeepseek-v4 reasoning parser (check the box in the command panel above) to separate thinking from the final answer into reasoning_content vs content.
Streaming with Thinking Process:
Example
Output
4.2.2 Tool Calling
Enable thedeepseekv4 tool-call parser (check the box in the command panel above) to surface structured tool calls via message.tool_calls.
Python Example (with Thinking Process):
Example
Output
4.2.3 HiCache (Hierarchical KV Caching)
HiCache enables multi-tier KV cache offloading (GPU → CPU → Storage), significantly expanding effective context capacity for long-context and multi-turn scenarios. Combined with UnifiedRadixTree, it provides intelligent prefix caching across all tiers. To enable HiCache, use the HiCache toggle in the command generator above:- L2 (GPU + CPU): Offloads cold KV pages to CPU memory. Enables
SGLANG_ENABLE_UNIFIED_RADIX_TREE=1for intelligent hierarchical prefix caching. - L3 (GPU + CPU + Storage): Coming soon.
5. Benchmark
5.1 Speed Benchmark on Blackwell
Test Environment:- Hardware: NVIDIA B200 GPU (4x)
- Model: DeepSeek-V4-Flash (FP4)
- Tensor Parallelism: 4
- sglang version: Pending update
5.1.1 Latency-Sensitive Benchmark
- Model Deployment Command: see the command panel above.
- Benchmark Command:
Command
- Test Results:
Output
5.1.2 Throughput-Sensitive Benchmark
- Model Deployment Command: see the command panel above.
- Benchmark Command:
Command
- Test Results:
Output
5.2 Accuracy Benchmark
5.2.1 GSM8K Benchmark
- Benchmark Command:
Command
- Test Results:
- DeepSeek-V4-Flash (FP4, Blackwell)
- DeepSeek-V4-Flash (FP8, Hopper)
- DeepSeek-V4-Flash (FP4, Blackwell)
5.2.2 MMLU Benchmark
- Benchmark Command:
Command
- Test Results:
- DeepSeek-V4-Flash (FP4, Blackwell)
- DeepSeek-V4-Flash (FP8, Hopper)
- DeepSeek-V4-Flash (FP4, Blackwell)
5.3 Speed Benchmark on Hopper
Test Environment:- Hardware: NVIDIA H200 GPU (4x)
- Model: DeepSeek-V4-Flash (FP8)
- Tensor Parallelism: 4
- sglang version: Pending update
5.3.1 Latency-Sensitive Benchmark
- Model Deployment Command: see the command panel above.
- Benchmark Command:
Command
- Test Results:
Output
5.3.2 Throughput-Sensitive Benchmark
- Model Deployment Command: see the command panel above.
- Benchmark Command:
Command
- Test Results:
Output
