Z-Image-Turbo - SGLang Documentation

1. Model Introduction

Z-Image is a powerful and highly efficient image generation model family with 6B parameters, developed by Tongyi-MAI. It adopts a Scalable Single-Stream DiT (S3-DiT) architecture, where text, visual semantic tokens, and image VAE tokens are concatenated at the sequence level to serve as a unified input stream, maximizing parameter efficiency compared to dual-stream approaches. Z-Image-Turbo is a distilled version of Z-Image that matches or exceeds leading competitors with only 8 NFEs (Number of Function Evaluations). It is powered by two core techniques: Decoupled-DMD (few-step distillation) and DMDR (fusing DMD with Reinforcement Learning). Key Features:

Sub-second Inference Latency: Achieves sub-second inference on enterprise-grade H800 GPUs and fits comfortably within 16GB VRAM consumer devices
Photorealistic Image Generation: Excels in high-quality photorealistic image generation with rich aesthetics
Bilingual Text Rendering: Supports accurate bilingual text rendering in both English and Chinese
Robust Instruction Adherence: Strong prompt following and instruction adherence capabilities
#1 Open-Source Model: Ranked 8th overall and #1 among open-source models on the Artificial Analysis Text-to-Image Leaderboard

For more details, please refer to the Z-Image-Turbo HuggingFace page, the GitHub repository, and the technical report (arXiv).

2. SGLang-diffusion Installation

SGLang-diffusion offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. Please refer to the official SGLang-diffusion installation guide for installation instructions.

3. Model Deployment

This section provides deployment configurations optimized for different hardware platforms and use cases.

3.1 Basic Configuration

Z-Image-Turbo is optimized for high-quality image generation with only 8 inference steps. The recommended launch configurations vary by hardware. Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform.

3.2 Configuration Tips

Current supported optimization all listed here.

--vae-path: Path to a custom VAE model or HuggingFace model ID. If not specified, the VAE will be loaded from the main model path.
--num-gpus: Number of GPUs to use
--tp-size: Tensor parallelism size (only for the encoder; should not be larger than 1 if text encoder offload is enabled, as layer-wise offload plus prefetch is faster)
--sp-degree: Sequence parallelism size (typically should match the number of GPUs)
--ulysses-degree: The degree of DeepSpeed-Ulysses-style SP in USP
--ring-degree: The degree of ring attention-style SP in USP

4. API Usage

For complete API documentation, please refer to the official API usage guide.

4.1 Generate an Image

import base64
from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:3000/v1")

response = client.images.generate(
    model="Tongyi-MAI/Z-Image-Turbo",
    prompt="Young Chinese woman in red Hanfu, intricate embroidery. Impeccable makeup, red floral forehead pattern. Elaborate high bun, golden phoenix headdress, red flowers, beads. Holds round folding fan with lady, trees, bird. Soft-lit outdoor night background, silhouetted tiered pagoda, blurred colorful distant lights.",
    size="1024x1024",
    n=1,
    response_format="b64_json",
)

# Save the generated image
image_bytes = base64.b64decode(response.data[0].b64_json)
with open("output.png", "wb") as f:
    f.write(image_bytes)

4.2 Advanced Usage

4.2.1 Cache-DiT Acceleration

SGLang integrates Cache-DiT, a caching acceleration engine for Diffusion Transformers (DiT), to achieve up to 7.4x inference speedup with minimal quality loss. You can set SGLANG_CACHE_DIT_ENABLED=True to enable it. For more details, please refer to the SGLang Cache-DiT documentation. Basic Usage

SGLANG_CACHE_DIT_ENABLED=true sglang serve --model-path Tongyi-MAI/Z-Image-Turbo

Advanced Usage

DBCache Parameters: DBCache controls block-level caching behavior:

Parameter	Env Variable	Default	Description
Fn	`SGLANG_CACHE_DIT_FN`	1	Number of first blocks to always compute
Bn	`SGLANG_CACHE_DIT_BN`	0	Number of last blocks to always compute
W	`SGLANG_CACHE_DIT_WARMUP`	4	Warmup steps before caching starts
R	`SGLANG_CACHE_DIT_RDT`	0.24	Residual difference threshold
MC	`SGLANG_CACHE_DIT_MC`	3	Maximum continuous cached steps

TaylorSeer Configuration: TaylorSeer improves caching accuracy using Taylor expansion:

Parameter	Env Variable	Default	Description
Enable	`SGLANG_CACHE_DIT_TAYLORSEER`	false	Enable TaylorSeer calibrator
Order	`SGLANG_CACHE_DIT_TS_ORDER`	1	Taylor expansion order (1 or 2)

Combined Configuration Example:

SGLANG_CACHE_DIT_ENABLED=true \
SGLANG_CACHE_DIT_FN=2 \
SGLANG_CACHE_DIT_BN=1 \
SGLANG_CACHE_DIT_WARMUP=4 \
SGLANG_CACHE_DIT_RDT=0.4 \
SGLANG_CACHE_DIT_MC=4 \
SGLANG_CACHE_DIT_TAYLORSEER=true \
SGLANG_CACHE_DIT_TS_ORDER=2 \
sglang serve --model-path Tongyi-MAI/Z-Image-Turbo

4.2.2 CPU Offload

--dit-cpu-offload: Use CPU offload for DiT inference. Enable if run out of memory.
--text-encoder-cpu-offload: Use CPU offload for text encoder inference.
--vae-cpu-offload: Use CPU offload for VAE.
--pin-cpu-memory: Pin memory for CPU offload. Only added as a temp workaround if it throws “CUDA error: invalid argument”.

5. Benchmark

5.1 Speedup Benchmark

5.1.1 Generate a image

Test Environment:

Hardware: NVIDIA B300 SXM6 AC (1x)
Model: Tongyi-MAI/Z-Image-Turbo
sglang version: 0.0.0.dev1+gf4417475b
git revision: f441747

Server Command:

sglang serve --model-path Tongyi-MAI/Z-Image-Turbo --port 30000

Benchmark Command:

python3 -m sglang.multimodal_gen.benchmarks.bench_serving \
    --backend sglang-image --dataset vbench --task text-to-image --num-prompts 1 --max-concurrency 1

Result:

================= Serving Benchmark Result =================
Task:                                    text-to-image
Model:                                   Tongyi-MAI/Z-Image-Turbo
Dataset:                                 vbench
--------------------------------------------------
Benchmark duration (s):                  13.59
Request rate:                            inf
Max request concurrency:                 1
Successful requests:                     1/1
--------------------------------------------------
Request throughput (req/s):              0.07
Latency Mean (s):                        13.5904
Latency Median (s):                      13.5904
Latency P99 (s):                         13.5904
--------------------------------------------------
Peak Memory Max (MB):                    16984.45
Peak Memory Mean (MB):                   16984.45
Peak Memory Median (MB):                 16984.45
============================================================

5.1.2 Generate images with high concurrency

Server Command :

sglang serve --model-path Tongyi-MAI/Z-Image-Turbo --port 30000

Benchmark Command :

python3 -m sglang.multimodal_gen.benchmarks.bench_serving \
    --backend sglang-image --dataset vbench --task text-to-image --num-prompts 20 --max-concurrency 20

Result :

================= Serving Benchmark Result =================
Task:                                    text-to-image
Model:                                   Tongyi-MAI/Z-Image-Turbo
Dataset:                                 vbench
--------------------------------------------------
Benchmark duration (s):                  30.15
Request rate:                            inf
Max request concurrency:                 20
Successful requests:                     20/20
--------------------------------------------------
Request throughput (req/s):              0.66
Latency Mean (s):                        21.2048
Latency Median (s):                      21.1990
Latency P99 (s):                         29.9739
--------------------------------------------------
Peak Memory Max (MB):                    16984.51
Peak Memory Mean (MB):                   16984.47
Peak Memory Median (MB):                 16984.46
============================================================

Cookbook

​1. Model Introduction

​2. SGLang-diffusion Installation

​3. Model Deployment

​3.1 Basic Configuration

​3.2 Configuration Tips

​4. API Usage

​4.1 Generate an Image

​4.2 Advanced Usage

​4.2.1 Cache-DiT Acceleration

​4.2.2 CPU Offload

​5. Benchmark

​5.1 Speedup Benchmark

​5.1.1 Generate a image

​5.1.2 Generate images with high concurrency