Deployment
Install SGLang
Install SGLang
For all methods and hardware platforms, see the official SGLang installation guide. The two paths below match the Python / Docker toggle in the command panel.Then run the Python output of the command panel below in that environment.
- Python (pip / uv)
- Docker
Command
- Low-Latency — fastest reply for a single user. Pick for chat.
- Balanced — good speed with several users at once. Use for typical multi-user serving.
- High-Throughput — most tokens per second across many users. Best for batch jobs.
Speed numbers are measured with
--random-range-ratio 1.0, --flush-cache, on main @ 09ca4fc. Spec cells pin the EAGLE acceptance length via the serve env SGLANG_SIMULATE_ACC_LEN (low-latency 5-1-6 = 3.5, balanced 1-1-2 = 2); high-throughput has no spec.NVFP4 (B300 / GB300) deploys on the dev image
lmsysorg/sglang:dev-glm52-nvfp4 — the command panel’s Docker toggle selects it automatically. The FP8 / BF16 recipes use the release lmsysorg/sglang:latest (the docker pull in Install above).Playground
The Playground is where you experiment with SGLang features beyond the verified matrix. The Deploy panel above only emits combinations the SGLang team has signed off on; the Playground lets you turn on additional knobs on top of whichever cell the Deploy panel is currently showing.1. Model Introduction
GLM-5.2 is Z.ai’s flagship Mixture-of-Experts model built on DeepSeek Sparse Attention (DSA): a lightning indexer selects a sparse set of key tokens per query (top-2048), so attention cost stays near-constant as context grows. It ships in two precisions — FP8 (zai-org/GLM-5.2-FP8) and full BF16 (zai-org/GLM-5.2) — both with 78 transformer layers, 256 routed experts (8 active per token), a 1M-token context window, and a single MTP (Multi-Token Prediction) layer for built-in EAGLE-style speculative decoding. FP8 is the recommended deployment; BF16 (~1.5 TB) needs an 8×B300 node or a multi-node setup. For Blackwell, NVIDIA also publishes an NVFP4 build (nvidia/GLM-5.2-NVFP4) that quantizes only the MoE experts’ linear weights and activations to 4-bit (the shared expert stays unquantized), holding accuracy within ~1 point of the FP8 baseline on GPQA Diamond, SciCode, and IFBench.
| Model | Architecture | Context |
|---|---|---|
| GLM-5.2-FP8 | MoE · DSA · 256 experts (top-8) · MTP · FP8 | 1,048,576 |
| GLM-5.2 | MoE · DSA · 256 experts (top-8) · MTP · BF16 | 1,048,576 |
| GLM-5.2-NVFP4 | MoE · DSA · 256 experts (top-8) · MTP · NVFP4 | 1,048,576 |
temperature=1.0, top_p=0.95 (the checkpoint’s generation_config.json defaults; informational — do not hardcode in client code).
Resources: GLM-5.2-FP8 · GLM-5.2 (BF16) · GLM-5.2-NVFP4.
2. Configuration Tips
- DeepSeek Sparse Attention (DSA). GLM-5.2 uses the
glm_moe_dsaarchitecture; SGLang auto-selects the DSA attention backends (flashmla_sparseprefill,fa3decode,sgl-kernelindexer topk). No attention-backend flag is needed on the supported hardware. SGLang also auto-selects the KV-cache dtype for DSA models —fp8_e4m3on Blackwell (B200/GB300/B300, which then routes DSA through the TensorRT-LLM backend) andbf16on Hopper (H200) — so no--kv-cache-dtypeflag is required. - MTP / speculative decoding. The checkpoint ships one nextn layer. Enable EAGLE MTP for lower latency (
--speculative-algorithm EAGLE --speculative-num-steps 5 --speculative-eagle-topk 1 --speculative-num-draft-tokens 6for low-latency;1-1-2for balanced). The config’sindex_share_for_mtp_iterationreuses the DSA indexer’s topk across draft steps (effective only at--speculative-eagle-topk 1). Tune the draft length to the accept length. GLM-5.2’s MTP head is strong — accept length runs high (4+ in many workloads, near-saturating at 5–6 in low-latency runs). Watch the server’s reported accept length and adjust--speculative-num-steps/--speculative-num-draft-tokensaccordingly: while accept length stays close to the draft-token count there is headroom to push them higher (more accepted tokens per step); if it falls well below, lower them — every rejected draft token is wasted verification compute. - Context Parallelism (CP) for long prefill. DSA prefill CP splits the long-prefill attention across
--attn-cp-sizeranks. On Hopper (H200) this gives a large prefill-latency win at long context — e.g. round-robin CP (--tp 8 --attn-cp-size 8 --enable-dsa-prefill-context-parallel --dsa-prefill-cp-mode round-robin-split) cut 64K-token prefill TTFT roughly 2.5–2.8× vs. plain TP8 in our testing. Trade-offs: CP partitions the KV pool (lower max context at the same--mem-fraction-static) and adds some decode-side overhead, so it pays off only for long sequences. CP is currently verified on Hopper only — the Blackwell (sm100) DSA-CP FP8 rope kernel is not yet adapted, so leave CP off on B200/B300/GB300. - Memory. The FP8 weights are large (MoE total, not active params). Start around
--mem-fraction-static 0.8on H200 (TP8) and tune up; raise it for the 4-GPU GB300 single-node layout (TP4). - DP-Attention + DeepEP for the balanced/high-throughput strategies spreads attention across data-parallel ranks and routes MoE through DeepEP.
- BF16 weights need more GPUs. The full-precision build (
zai-org/GLM-5.2, ~1.5 TB) does not fit a single 8×H200 / 8×B200 / 4×GB300 node. It fits single-node on 8×B300 (TP8, ~2.1 TB HBM) — verified; on the smaller GPUs it needs a multi-node layout (e.g. 2×8×H200 or 2×8×B200 at TP16, 2×4×GB300 at TP8), and those multi-node BF16 recipes are still proposed/inferred (verified: false). FP8 is the recommended deployment. Use the same DSA / MTP / chunked-prefill guidance as FP8. On B300, BF16 low-latency matches FP8 (the sm103 FP8 path is not yet optimized), but FP8 wins at the balanced/high-throughput points. - Chunked-prefill size is regime-dependent. At long input (8K+) the default
--chunked-prefill-size 2048is too small and leaves the balanced point prefill-bound (queueing dominates TTFT). Raising it to--chunked-prefill-size 32768on the balanced recipe gave roughly +34–78% output throughput and −39–59% TTFT on 8×H200 and 8×B200 (8K-in / 1K-out) in our testing. It is neutral for high-throughput (decode-bound there) — keep the default.--max-running-requeststracks KV capacity, not a tuning free-for-all: ~60–90 concurrent 8K+1K FP8 requests fit on a single 8-GPU node, so pin balanced near--max-running-requests 80and let high-throughput run wider.
3. Advanced Usage
3.1 Reasoning
GLM-5.2 is a hybrid-reasoning model. Enable theglm45 reasoning parser (toggle Reasoning Parser in the Parsers card of the Playground above) to separate thinking from the final answer — thinking lands in message.reasoning_content, the answer in message.content. Thinking is on by default; turn it off with chat_template_kwargs: {"enable_thinking": False} (the template variable is enable_thinking, not thinking).
Reasoning effort. Pass chat_template_kwargs: {"reasoning_effort": ...} to inject a Reasoning Effort: <level> system line (only while thinking is on). The template wires only two effective levels — Max and High — and if you don’t pass reasoning_effort at all you get Max, the highest. "high" is the only value that lowers effort; every other value (including "low" and "medium") falls through to Max:
reasoning_effort | Injected system line | Effect |
|---|---|---|
| (not passed / unset) | Reasoning Effort: Max | default — highest reasoning |
"high" | Reasoning Effort: High | dials reasoning down |
"low", "medium", any other value | Reasoning Effort: Max | falls through to Max (not a distinct level) |
Reasoning Example (Python)
Reasoning Example (Python)
Example
Example Output
Example Output
Output
3.2 Tool Calling
Enable theglm47 tool-call parser (toggle Tool Call Parser in the Parsers card of the Playground above) to surface structured tool calls via message.tool_calls. GLM-5.2 emits the newer <tool_call>…<arg_key>…<arg_value>… format, so it needs the glm47 parser — the older glm45 parser does not parse it (the call would be left as raw text in content). On thinking mode the turn also fills reasoning_content, so print both fields.
Tool Calling Example (Python)
Tool Calling Example (Python)
Example
Example Output
Example Output
Output
3.3 HiCache (Hierarchical KV Caching)
For long-context, prefix-heavy workloads, enable hierarchical KV caching to spill cold KV blocks to host memory (toggle the Hierarchical KV Cache card in the Playground above). Useful given GLM-5.2’s 1M-token window; pair--hicache-ratio with a write policy that matches your reuse pattern.
3.4 Claude Code Integration
GLM-5.2’s strong reasoning + tool-calling makes it a good backend for Claude Code, Anthropic’s agentic CLI. SGLang exposes the Anthropic-compatible/v1/messages endpoint on every server, so Claude Code can talk to a GLM-5.2 server with only environment variables — no code change. Launch the server with --reasoning-parser glm45 --tool-call-parser glm47 (any recipe from the Deployment panel above works), then:
Command
CLAUDE_CODE_ATTRIBUTION_HEADER=0— Claude Code prepends a per-request attribution block to the system prompt. GLM-5.2’s chat template renderstoolsbeforesystem, so that per-request hash is the first token to diverge between turns and the radix prefix cache re-prefills the whole system + history every turn. This env removes the block and restores prefix-cache reuse.glm-5.2[1m]as the model name — the[1m]suffix is the client-side hint that enables Claude Code’s 1M-context beta, matching GLM-5.2’s 1,048,576-token window. Without it, context is capped well below 1M. SGLang does not validate themodelfield, so any name is accepted server-side.
~/.claude/settings.json, troubleshooting), see Anthropic-Compatible API.