DeepSeek-V4 - SGLang Documentation

Deployment

Install SGLang

For all methods and hardware platforms, see the official SGLang installation guide. The two paths below match the Python / Docker toggle in the command panel.

Python (pip / uv)
Docker

Command

pip install --upgrade pip
pip install uv
uv pip install sglang

Then run the Python output of the command panel below in that environment.

For how to launch the image, see Install → Method 3: Using Docker. A minimal example (substitute the inner sglang serve ... with whatever the command generator below produces):NVIDIA GPUsA single image — lmsysorg/sglang:latest — covers the datacenter GPUs in this cookbook (B200 / B300 / GB200 / GB300 / H100 / H200 / RTX PRO 6000).

Command

docker pull lmsysorg/sglang:latest

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<your-hf-token>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    sglang serve <use args below>

AMD GPUs (ROCm)AMD uses the daily-updated lmsysorg/sglang-rocm images. You can find the latest images on Docker Hub. We recommend the ROCm 7.2 version.For example:

MI355X → lmsysorg/sglang-rocm:v0.5.14-rocm720-mi35x-20260710
MI300X → lmsysorg/sglang-rocm:v0.5.13.post1-rocm720-mi30x-20260623

Command

docker pull lmsysorg/sglang-rocm:v0.5.14-rocm720-mi35x-20260710

docker run \
    --device=/dev/kfd --device=/dev/dri \
    --group-add video \
    --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
    --shm-size 32g --ipc=host \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<your-hf-token>" \
    lmsysorg/sglang-rocm:v0.5.14-rocm720-mi35x-20260710 \
    sglang serve <use args below>

Pick your hardware + recipe to generate the launch command. The three serving strategies cover the common operating points:

Low-Latency — fastest reply for a single user. Pick for chat.
Balanced — good speed with several users at once. Use for typical multi-user serving.
High-Throughput — most tokens per second across many users. Best for batch jobs.

For a runnable end-to-end example, see the DeepSeek-V4-Flash demo notebook.

Panel controls (top of the command box):

Python / Docker — bare sglang serve … for an existing SGLang env, or a docker run … sglang serve … wrap against the per-hardware image from the Install SGLang panel above.
⧉ Copy — copies the current command (with whichever framing is active) to your clipboard.
$ cURL — a sample request against localhost:30000 to confirm the server is up.
⚙ Env — edits the placeholders (HOST_IP, PORT, HF_TOKEN, NODE_RANK, NODE0_IP) the command and cURL share. Persists in localStorage across cookbooks.
Verified / Not Verified badge — green when the (hw, variant, quant, strategy, nodes) combo has been run end-to-end on real hardware; yellow when auto-derived from a neighbor and not yet re-checked.

Playground

The Playground is where you experiment with SGLang features beyond the verified matrix. The Deploy panel above only emits combinations the SGLang team has signed off on; the Playground lets you turn on additional knobs on top of whichever cell the Deploy panel is currently showing. The base is read live from your Deploy selection — only your overrides change. The knobs come in two flavors:

Built-in SGLang features — parallelism overrides (TP / CP / DP-Attention — DP-Attention’s value is the DP degree, with off to disable), MoE backend + EP, reasoning / tool-call parsers, speculative-decoding presets, prefill/decode disaggregation, HiCache tiers, and HiSparse hierarchical sparse attention (decode-role only — the card appears once PD-Disagg mode is set to decode).
DeepSeek-V4 specific features — MegaMoE W4A8 / W4A4 fused kernel (Blackwell only).

Lines highlighted green are added by your overrides; lines with red strikethrough were in the verified base but stripped by an override. When no override differs from the base cell, the playground inherits the base’s Verified badge; any actual change flips it to Not Verified until the new configuration is run end-to-end and submitted back.

Panel controls reuse Python / Docker · ⧉ Copy · $ cURL · ⚙ Env from the Deploy panel, plus one extra:

Submit ↗ — opens a pre-filled GitHub issue so you can land your override combo as a new verified cookbook cell. Shown only while the badge says Not Verified; click it once you’ve actually run the command on your hardware and confirmed it works.

1. Model Introduction

DeepSeek-V4 is the next-generation Mixture-of-Experts model from DeepSeek, released 2026-04-24 under an MIT License. It ships as two Instruct repos (one per variant) plus matching Base repos:

Variant	Total params	Active (MoE)	Use
DeepSeek-V4-Flash	284B	13B	single-node serving on B200 / B300 / GB200 / GB300 / H200 (TP=4); RTX PRO 6000 (TP=2); H100 (TP=8)
DeepSeek-V4-Pro	1.6T	49B	high-capacity: B200 / B300 (TP=8) · GB300 (TP=4) · H200 FP4 (TP=8) · GB200 (2-node, TP=8) · H200 FP8 (2-node, TP=16) · H100 (2-node, TP=16)

Both Instruct repos ship as FP4 MoE experts + FP8 attention / dense (one mixed-precision checkpoint covers every FP4-capable GPU). Matching *-Base repos ship pure FP8 mixed and are for further pre-training only — not for chat or tool calling. Highlights: hybrid CSA + HCA attention (~27% inference FLOPs / ~10% KV cache vs DSv3.2 at 1M context), manifold-constrained hyper-connections (mHC), Muon optimizer, 1M-token context (32T+ pre-training tokens), three reasoning modes (Non-think / Think High / Think Max — use ≥ 384K context for Think Max), and a dedicated encoding_dsv4.encode_messages Python encoder + DSML tool-call grammar. Recommended generation: temperature=1.0, top_p=1.0. Resources: HuggingFace · Flash · Pro · ModelScope · Flash · Pro.

2. Configuration Tips

Concurrency & DeepEP dispatch buffer Must hold: max-running-requests × MTP_draft_tokens ≤ SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK. Violating it blows DeepEP’s dispatch buffer at steady-state load (deep_ep.cpp:1105). When tuning, move --cuda-graph-max-bs-decode, --max-running-requests, and the env together. The generator currently picks values on the conservative side (mirroring an internal stress-test matrix). They run safely out of the box but likely leave throughput on the table — please tune them up toward your actual workload’s peak concurrency and report findings back so the defaults can be revised. MTP (Multi-Token Prediction, EAGLE)

low-latency: steps=3, draft-tokens=4 → largest win at bs=1.
balanced: steps=1, draft-tokens=2 → gentler MTP, reduces throughput hit at higher batch.
high-throughput: MTP disabled — at saturation the verify step costs more than it saves.
MTP runs on the v2 speculative path.

Compressed attention state dtype DeepSeek-V4 uses hybrid compressed attention for long-context efficiency. SGLANG_DSV4_COMPRESS_STATE_DTYPE controls the dtype of the C4 / C128 compressed attention state pools. Supported values are float32 / fp32 (default: float32) and bfloat16 / bf16. For BF16 on the offline compression path:

Command

SGLANG_DSV4_COMPRESS_STATE_DTYPE=bf16 \
sglang serve \
  --model-path deepseek-ai/DeepSeek-V4-Flash \
  <other args>

This BF16 setting applies only to the compressed attention state pools and reduces the GPU memory footprint of each compressed-state slot. It does not change model weight precision or the main KV cache dtype. With automatic pool sizing and no explicit capacity cap, the same memory budget holds more slots, and the startup log shows larger c4_state and c128_state pool sizes. Keep the default float32 setting for the most conservative behavior. EPLB + Waterfill (Experimental) For recorded/static EPLB reproduction, first record an expert-distribution file by following Capture expert selection distribution in MoE models. For reproduction runs, use the generated expert_distribution_recorder_*.pt as the initial expert location. Please checkout to latest main branch for this feature. For non-PD reproduction, use:

Command

--moe-a2a-backend deepep \
--deepep-mode auto \
--init-expert-location /path/to/expert_distribution_recorder_*.pt \
--enable-waterfill

For PD-Disagg reproduction, use normal mode on the prefill server and low_latency mode on the decode server. Add the same --init-expert-location flag to both commands:

Command

# prefill
--moe-a2a-backend deepep \
--deepep-mode normal \
--init-expert-location /path/to/expert_distribution_recorder_*.pt \
--enable-waterfill

# decode
--moe-a2a-backend deepep \
--deepep-mode low_latency \
--init-expert-location /path/to/expert_distribution_recorder_*.pt \
--enable-waterfill

You can also add --ep-num-redundant-experts and --eplb-algorithm to customize EPLB placement. Waterfill also supports MegaMOE. Use --moe-a2a-backend megamoe --enable-waterfill to keep the MegaMOE backend while applying Waterfill to the fused shared expert slot. FP4 Indexer (Experimental) DeepSeek-V4 uses the default indexer path unless --enable-deepseek-v4-fp4-indexer is set. Enable this flag to use the experimental FP4 C4 indexer on SM100 GPUs with DeepGEMM FP4 indexer support. This path is intended for decode-heavy long-context workloads where reducing indexer cache bandwidth is beneficial.

Command

# Please use the latest main branch for this feature.
sglang serve \
  --model-path deepseek-ai/DeepSeek-V4-Flash \
  --tp 4 \
  --moe-runner-backend flashinfer_mxfp4 \
  --enable-deepseek-v4-fp4-indexer

NVFP4 Hybrid Checkpoints The nvidia/DeepSeek-V4-Pro-NVFP4 and nvidia/DeepSeek-V4-Flash-NVFP4 checkpoints quantize MoE experts to NVFP4 while keeping attention and dense layers in FP8. It requires --moe-runner-backend flashinfer_trtllm_routed which will be automatically selected if not provided.

Command

sglang serve \
  --model-path nvidia/DeepSeek-V4-Pro-NVFP4 \
  --tp 8

Command

sglang serve \
  --model-path nvidia/DeepSeek-V4-Flash-NVFP4 \
  --tp 8

Requires Blackwell (SM100+). The MTP layer in this checkpoint stays MXFP4-packed and is routed through the Mxfp4FlashinferTrtllmMoEMethod path automatically. Hopper (H100 / H200) note Two options are available for running DeepSeek-V4 on Hopper:

Original FP4 checkpoints — apply the W4A16 MoE kernels (Marlin) as the command generator picks for Hopper cells. This path works on both H100 and H200 and is the only option for H100 (no FP8 path). It is TP-only; on H200 the Pro variant fits on a single 8-GPU node, while H100 Pro needs 2 nodes (TP=16).
Converted FP8 checkpoints (H100 and H200 only) — pre-repackaged FP8 weights at sgl-project/DeepSeek-V4-Flash-FP8 and sgl-project/DeepSeek-V4-Pro-FP8 unlock DP-attention + DeepEP and richer parallelism (e.g. Pro TP=16 across 2 nodes).

PD-Disagg recipes on H200 may require docker run --privileged --ulimit memlock=-1 (or --device /dev/infiniband:/dev/infiniband --cap-add IPC_LOCK) so mooncake can discover the IB HCAs; without IB exposure mooncake silently falls back to TCP, which can lead to garbled KV transfer on large checkpoints. RTX PRO 6000 (SM120 / Blackwell Desktop) note RTX PRO 6000 (96 GB) runs Flash only with the FlashInfer MXFP4 MoE runner. V4-Pro doesn’t fit on 8× 96 GB; the Deploy panel greys out unsupported recipes. HiCache and MegaMoE are not supported on RTX PRO 6000. AMD (MI300X / MI355X) note

Model checkpoints — for correct accuracy, the FP4 model uses the stock deepseek-ai/DeepSeek-V4-{Flash,Pro}, and the FP8 model uses the repackaged sgl-project/DeepSeek-V4-{Flash,Pro}-FP8.
Supported models — MI300X supports DeepSeek-V4-Flash in FP8; MI355X supports DeepSeek-V4-Flash / Pro in both FP4 and FP8. All recipes run single-node.
TP / DP setting — both TP=4 and TP=8 are supported. At low concurrency we recommend TP-only; at high concurrency use TP + DP, which additionally needs --dp 8 --enable-dp-attention --enable-prefill-delayer --prefill-delayer-max-delay-ms 5000.
MTP — speculative decoding is supported; add --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4.
Kernels — uses the Unified KV attention and the flydsl MoE.

MegaMoE MegaMoE fuses expert dispatch + GEMM into a single kernel for higher throughput on MoE layers. To enable it, use the MegaMoE chip in the Playground below — the playground will swap --moe-a2a-backend deepep for --moe-a2a-backend megamoe and add the relevant env vars automatically. Two variants are exposed:

W4A8 — default MegaMoE kernel (FP4 weights, FP8 activations).
W4A4 — adds SGLANG_OPT_DEEPGEMM_MEGA_MOE_USE_FP4_ACTS=1 and SGLANG_OPT_DEEPGEMM_MEGA_MOE_USE_MXF4_KIND=1 to run the custom W4A4 kernel (FP4 activations). Higher throughput with negligible accuracy drop (~89.5 GPQA on Pro).

Notes:

MegaMoE is only supported on Blackwell GPUs (B200 / B300 / GB200 / GB300). The chip is hidden when the Deploy panel’s base cell sits on Hopper (H100 / H200).
MegaMoE is only wired into the high-throughput recipe on Blackwell (per sgl-project/sglang#26451). The chip is hidden on low-latency and balanced — switch to high-throughput to expose it.
When running MegaMoE, don’t set --moe-runner-backend manually.
Adjust SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK based on your workload and memory usage. Setting higher number of tokens for MegaMoE requires more HBM space (recommended: 8320 for high-throughput).

GB300 PD-Disagg cross-pod MNNVL On some GB300 clusters with cross-pod KV transfer over NVLink, mooncake may fail with nvlink_transport.cpp:497 Requested address ... not found!. If this happens, prepend MC_FORCE_MNNVL=1 NCCL_MNNVL_ENABLE=1 NCCL_CUMEM_ENABLE=1 to both prefill and decode sglang serve commands.

3. Advanced Usage

3.1 Reasoning

Enable the deepseek-v4 reasoning parser (toggle Reasoning Parser in the Parsers card of the Playground above) to separate thinking from the final answer into reasoning_content vs content.

Streaming with Thinking Process (Python)

Example

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V4-Flash",
    messages=[
        {"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"}
    ],
    max_tokens=2048,
    extra_body={"chat_template_kwargs": {"thinking": True}},
    stream=True,
)

thinking_started = False
has_thinking = False
has_answer = False

for chunk in response:
    if not chunk.choices:
        continue
    delta = chunk.choices[0].delta

    if getattr(delta, "reasoning_content", None):
        if not thinking_started:
            print("=============== Thinking =================", flush=True)
            thinking_started = True
        has_thinking = True
        print(delta.reasoning_content, end="", flush=True)

    if delta.content:
        if has_thinking and not has_answer:
            print("\n=============== Content =================", flush=True)
            has_answer = True
        print(delta.content, end="", flush=True)

print()

Example Output

Output

We are asked: "What is 15% of 240?" This is a simple percentage problem. I need to provide a step-by-step solution. The user wants the solution explained step by step. I'll calculate 15% of 240: 0.15 * 240 = 36. I'll break it down into steps: understand what percent means, convert percentage to decimal or fraction, then multiply. I'll present the answer clearly.</think>To find 15% of 240, follow these steps:

**Step 1: Understand the meaning of percent**
"Percent" means "per hundred," so 15% means 15 out of every100, or \( \frac{15}{100} \).

**Step2: Convert the percentage to a decimal or fraction**
\( 15\% = \frac{15}{100} = 0.15 \)

**Step3: Multiply by the given number**
Multiply the decimal form by 240:
\( 0.15 \times 240 \)

**Step4: Perform the multiplication**
\( 0.15 \times 240 = 36 \)

**Answer:** 15% of 240 is **36**.

3.2 Tool Calling

Enable the deepseekv4 tool-call parser (toggle Tool Call Parser in the Parsers card of the Playground above) to surface structured tool calls via message.tool_calls.

Python Example with Thinking Process

Example

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "The city name"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
                },
                "required": ["location"],
            },
        },
    }
]

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V4-Flash",
    messages=[{"role": "user", "content": "What's the weather in Beijing?"}],
    tools=tools,
    extra_body={"chat_template_kwargs": {"thinking": True}},
    stream=True,
)

thinking_started = False
has_thinking = False
tool_calls_accumulator = {}

for chunk in response:
    if not chunk.choices:
        continue
    delta = chunk.choices[0].delta

    if getattr(delta, "reasoning_content", None):
        if not thinking_started:
            print("=============== Thinking =================", flush=True)
            thinking_started = True
        has_thinking = True
        print(delta.reasoning_content, end="", flush=True)

    if getattr(delta, "tool_calls", None):
        if has_thinking and thinking_started:
            print("\n=============== Content =================\n", flush=True)
            thinking_started = False
        for tool_call in delta.tool_calls:
            index = tool_call.index
            if index not in tool_calls_accumulator:
                tool_calls_accumulator[index] = {"name": None, "arguments": ""}
            if tool_call.function:
                if tool_call.function.name:
                    tool_calls_accumulator[index]["name"] = tool_call.function.name
                if tool_call.function.arguments:
                    tool_calls_accumulator[index]["arguments"] += tool_call.function.arguments

    if delta.content:
        print(delta.content, end="", flush=True)

for index, tool_call in sorted(tool_calls_accumulator.items()):
    print(f"Tool Call: {tool_call['name']}")
    print(f"   Arguments: {tool_call['arguments']}")

print()

Example Output

Output

The user wants to know the weather in Beijing. I'll use the get_weather function with Beijing as the location. I don't need to specify a unit, so I'll just use the default.</think>

<｜DSML｜tool_calls>
<｜DSML｜invoke name="get_weather">
<｜DSML｜parameter name="location" string="true">Beijing</｜DSML｜parameter>
</｜DSML｜invoke>
</｜DSML｜tool_calls>

3.3 HiCache (Hierarchical KV Caching)

HiCache enables multi-tier KV cache offloading (GPU → CPU → Storage), significantly expanding effective context capacity for long-context and multi-turn scenarios. Combined with UnifiedRadixTree, it provides intelligent prefix caching across all tiers. To enable HiCache, open the HiCache card in the Playground above and flip Enable:

L2 (GPU + CPU) — leave Storage on auto (default). Cold KV pages spill to CPU pinned memory only.
L3 (GPU + CPU + Storage) — pick a Storage backend (file / mooncake / hf3fs / nixl); the Playground emits the canonical page_first_direct mem-layout + direct IO backend + wait_complete prefetch policy, matching the HiCache best-practices recipe.

For AMD devices,

L2 (GPU + CPU) — leave Storage on auto (default). Cold KV pages spill to CPU pinned memory only. Use direct IO backend + page_first_direct or layer-first mem-layout.
L3 (GPU + CPU + Storage) — pick a Storage backend (file); the Playground emits the canonical page_first_direct mem-layout + direct IO backend + wait_complete prefetch policy, matching the HiCache best-practices recipe.

The Write policy knob defaults to write_through (the upstream default); switch to write_back / write_through_selective to trade durability for write speed when the storage tier is slow. For more details, see the HiCache documentation.

​Deployment

​Playground

​1. Model Introduction

​2. Configuration Tips

​3. Advanced Usage

​3.1 Reasoning

​3.2 Tool Calling

​3.3 HiCache (Hierarchical KV Caching)

Deployment

Playground

1. Model Introduction

2. Configuration Tips

3. Advanced Usage

3.1 Reasoning

3.2 Tool Calling

3.3 HiCache (Hierarchical KV Caching)