Skip to main content

Deployment

For all methods and hardware platforms, see the official SGLang installation guide. The two paths below match the Python / Docker toggle in the command panel.
Command
pip install --upgrade pip
pip install uv
uv pip install sglang
Then run the Python output of the command panel below in that environment.
Pick your hardware + recipe to generate the launch command. The three serving strategies cover the common operating points:
  • Low-Latency — fastest reply for a single user. Pick for chat.
  • Balanced — good speed with several users at once. Use for typical multi-user serving.
  • High-Throughput — most tokens per second across many users. Best for batch jobs.
Speed numbers are measured with --random-range-ratio 1.0, --flush-cache, on main @ 09ca4fc. Spec cells pin the EAGLE acceptance length via the serve env SGLANG_SIMULATE_ACC_LEN (low-latency 5-1-6 = 3.5, balanced 1-1-2 = 2); high-throughput has no spec.
NVFP4 (B300 / GB300) deploys on the dev image lmsysorg/sglang:dev-glm52-nvfp4 — the command panel’s Docker toggle selects it automatically. The FP8 / BF16 recipes use the release lmsysorg/sglang:latest (the docker pull in Install above).

Playground

The Playground is where you experiment with SGLang features beyond the verified matrix. The Deploy panel above only emits combinations the SGLang team has signed off on; the Playground lets you turn on additional knobs on top of whichever cell the Deploy panel is currently showing.

1. Model Introduction

GLM-5.2 is Z.ai’s flagship Mixture-of-Experts model built on DeepSeek Sparse Attention (DSA): a lightning indexer selects a sparse set of key tokens per query (top-2048), so attention cost stays near-constant as context grows. It ships in two precisions — FP8 (zai-org/GLM-5.2-FP8) and full BF16 (zai-org/GLM-5.2) — both with 78 transformer layers, 256 routed experts (8 active per token), a 1M-token context window, and a single MTP (Multi-Token Prediction) layer for built-in EAGLE-style speculative decoding. FP8 is the recommended deployment; BF16 (~1.5 TB) needs an 8×B300 node or a multi-node setup. For Blackwell, NVIDIA also publishes an NVFP4 build (nvidia/GLM-5.2-NVFP4) that quantizes only the MoE experts’ linear weights and activations to 4-bit (the shared expert stays unquantized), holding accuracy within ~1 point of the FP8 baseline on GPQA Diamond, SciCode, and IFBench.
ModelArchitectureContext
GLM-5.2-FP8MoE · DSA · 256 experts (top-8) · MTP · FP81,048,576
GLM-5.2MoE · DSA · 256 experts (top-8) · MTP · BF161,048,576
GLM-5.2-NVFP4MoE · DSA · 256 experts (top-8) · MTP · NVFP41,048,576
Recommended generation: temperature=1.0, top_p=0.95 (the checkpoint’s generation_config.json defaults; informational — do not hardcode in client code). Resources: GLM-5.2-FP8 · GLM-5.2 (BF16) · GLM-5.2-NVFP4.

2. Configuration Tips

  • DeepSeek Sparse Attention (DSA). GLM-5.2 uses the glm_moe_dsa architecture; SGLang auto-selects the DSA attention backends (flashmla_sparse prefill, fa3 decode, sgl-kernel indexer topk). No attention-backend flag is needed on the supported hardware. SGLang also auto-selects the KV-cache dtype for DSA models — fp8_e4m3 on Blackwell (B200/GB300/B300, which then routes DSA through the TensorRT-LLM backend) and bf16 on Hopper (H200) — so no --kv-cache-dtype flag is required.
  • MTP / speculative decoding. The checkpoint ships one nextn layer. Enable EAGLE MTP for lower latency (--speculative-algorithm EAGLE --speculative-num-steps 5 --speculative-eagle-topk 1 --speculative-num-draft-tokens 6 for low-latency; 1-1-2 for balanced). The config’s index_share_for_mtp_iteration reuses the DSA indexer’s topk across draft steps (effective only at --speculative-eagle-topk 1). Tune the draft length to the accept length. GLM-5.2’s MTP head is strong — accept length runs high (4+ in many workloads, near-saturating at 5–6 in low-latency runs). Watch the server’s reported accept length and adjust --speculative-num-steps / --speculative-num-draft-tokens accordingly: while accept length stays close to the draft-token count there is headroom to push them higher (more accepted tokens per step); if it falls well below, lower them — every rejected draft token is wasted verification compute.
  • Context Parallelism (CP) for long prefill. DSA prefill CP splits the long-prefill attention across --attn-cp-size ranks. On Hopper (H200) this gives a large prefill-latency win at long context — e.g. round-robin CP (--tp 8 --attn-cp-size 8 --enable-dsa-prefill-context-parallel --dsa-prefill-cp-mode round-robin-split) cut 64K-token prefill TTFT roughly 2.5–2.8× vs. plain TP8 in our testing. Trade-offs: CP partitions the KV pool (lower max context at the same --mem-fraction-static) and adds some decode-side overhead, so it pays off only for long sequences. CP is currently verified on Hopper only — the Blackwell (sm100) DSA-CP FP8 rope kernel is not yet adapted, so leave CP off on B200/B300/GB300.
  • Memory. The FP8 weights are large (MoE total, not active params). Start around --mem-fraction-static 0.8 on H200 (TP8) and tune up; raise it for the 4-GPU GB300 single-node layout (TP4).
  • DP-Attention + DeepEP for the balanced/high-throughput strategies spreads attention across data-parallel ranks and routes MoE through DeepEP.
  • BF16 weights need more GPUs. The full-precision build (zai-org/GLM-5.2, ~1.5 TB) does not fit a single 8×H200 / 8×B200 / 4×GB300 node. It fits single-node on 8×B300 (TP8, ~2.1 TB HBM) — verified; on the smaller GPUs it needs a multi-node layout (e.g. 2×8×H200 or 2×8×B200 at TP16, 2×4×GB300 at TP8), and those multi-node BF16 recipes are still proposed/inferred (verified: false). FP8 is the recommended deployment. Use the same DSA / MTP / chunked-prefill guidance as FP8. On B300, BF16 low-latency matches FP8 (the sm103 FP8 path is not yet optimized), but FP8 wins at the balanced/high-throughput points.
  • Chunked-prefill size is regime-dependent. At long input (8K+) the default --chunked-prefill-size 2048 is too small and leaves the balanced point prefill-bound (queueing dominates TTFT). Raising it to --chunked-prefill-size 32768 on the balanced recipe gave roughly +34–78% output throughput and −39–59% TTFT on 8×H200 and 8×B200 (8K-in / 1K-out) in our testing. It is neutral for high-throughput (decode-bound there) — keep the default. --max-running-requests tracks KV capacity, not a tuning free-for-all: ~60–90 concurrent 8K+1K FP8 requests fit on a single 8-GPU node, so pin balanced near --max-running-requests 80 and let high-throughput run wider.

3. Advanced Usage

3.1 Reasoning

GLM-5.2 is a hybrid-reasoning model. Enable the glm45 reasoning parser (toggle Reasoning Parser in the Parsers card of the Playground above) to separate thinking from the final answer — thinking lands in message.reasoning_content, the answer in message.content. Thinking is on by default; turn it off with chat_template_kwargs: {"enable_thinking": False} (the template variable is enable_thinking, not thinking). Reasoning effort. Pass chat_template_kwargs: {"reasoning_effort": ...} to inject a Reasoning Effort: <level> system line (only while thinking is on). The template wires only two effective levels — Max and High — and if you don’t pass reasoning_effort at all you get Max, the highest. "high" is the only value that lowers effort; every other value (including "low" and "medium") falls through to Max:
reasoning_effortInjected system lineEffect
(not passed / unset)Reasoning Effort: Maxdefault — highest reasoning
"high"Reasoning Effort: Highdials reasoning down
"low", "medium", any other valueReasoning Effort: Maxfalls through to Max (not a distinct level)
Example
from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
resp = client.chat.completions.create(
    model="zai-org/GLM-5.2-FP8",
    messages=[{"role": "user", "content": "What is 15% of 240?"}],
    extra_body={"chat_template_kwargs": {"enable_thinking": True, "reasoning_effort": "high"}},
)
msg = resp.choices[0].message
print("Reasoning:", getattr(msg, "reasoning_content", None))
print("Answer:", msg.content)
Output
Reasoning: 1.  **Identify the core question:** The user wants to find 15% of 240.
2.  **Convert the percentage to a decimal:** 15% = 0.15
3.  **Multiply by the total:** 0.15 * 240 = 36
    (Quick mental math: 10% of 240 = 24; 5% = 12; 24 + 12 = 36.)

Answer: 15% of 240 is **36**.

Here is how you can calculate it:
0.15 × 240 = 36

3.2 Tool Calling

Enable the glm47 tool-call parser (toggle Tool Call Parser in the Parsers card of the Playground above) to surface structured tool calls via message.tool_calls. GLM-5.2 emits the newer <tool_call>…<arg_key>…<arg_value>… format, so it needs the glm47 parser — the older glm45 parser does not parse it (the call would be left as raw text in content). On thinking mode the turn also fills reasoning_content, so print both fields.
Example
from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get the current weather for a city",
        "parameters": {
            "type": "object",
            "properties": {"city": {"type": "string"}},
            "required": ["city"],
        },
    },
}]
resp = client.chat.completions.create(
    model="zai-org/GLM-5.2-FP8",
    messages=[{"role": "user", "content": "What's the weather in Paris?"}],
    tools=tools,
)
msg = resp.choices[0].message
print("Reasoning:", getattr(msg, "reasoning_content", None))
print("Tool calls:", msg.tool_calls)
Output
Reasoning: The user wants to know the weather in Paris. I'll call the get_weather function with "Paris" as the city.

Tool calls: [
  {
    "id": "call_13fcd52146934b7781d06d4a",
    "type": "function",
    "function": {"name": "get_weather", "arguments": "{\"city\": \"Paris\"}"}
  }
]

3.3 HiCache (Hierarchical KV Caching)

For long-context, prefix-heavy workloads, enable hierarchical KV caching to spill cold KV blocks to host memory (toggle the Hierarchical KV Cache card in the Playground above). Useful given GLM-5.2’s 1M-token window; pair --hicache-ratio with a write policy that matches your reuse pattern.

3.4 Claude Code Integration

GLM-5.2’s strong reasoning + tool-calling makes it a good backend for Claude Code, Anthropic’s agentic CLI. SGLang exposes the Anthropic-compatible /v1/messages endpoint on every server, so Claude Code can talk to a GLM-5.2 server with only environment variables — no code change. Launch the server with --reasoning-parser glm45 --tool-call-parser glm47 (any recipe from the Deployment panel above works), then:
Command
export ANTHROPIC_BASE_URL="http://127.0.0.1:30000"
export ANTHROPIC_AUTH_TOKEN="dummy"
export API_TIMEOUT_MS="3000000"
export CLAUDE_CODE_AUTO_COMPACT_WINDOW="1000000"
export CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1
export CLAUDE_CODE_ATTRIBUTION_HEADER=0
export ANTHROPIC_DEFAULT_HAIKU_MODEL="glm-5.2[1m]"
export ANTHROPIC_DEFAULT_SONNET_MODEL="glm-5.2[1m]"
export ANTHROPIC_DEFAULT_OPUS_MODEL="glm-5.2[1m]"
claude
Two of these matter specifically for GLM-5.2:
  • CLAUDE_CODE_ATTRIBUTION_HEADER=0 — Claude Code prepends a per-request attribution block to the system prompt. GLM-5.2’s chat template renders tools before system, so that per-request hash is the first token to diverge between turns and the radix prefix cache re-prefills the whole system + history every turn. This env removes the block and restores prefix-cache reuse.
  • glm-5.2[1m] as the model name — the [1m] suffix is the client-side hint that enables Claude Code’s 1M-context beta, matching GLM-5.2’s 1,048,576-token window. Without it, context is capped well below 1M. SGLang does not validate the model field, so any name is accepted server-side.
For the full setup (streaming, tool-use, count_tokens, persisting env in ~/.claude/settings.json, troubleshooting), see Anthropic-Compatible API.