1. Model Introduction
Z-Image is a powerful and highly efficient image generation model family with 6B parameters, developed by Tongyi-MAI. It adopts a Scalable Single-Stream DiT (S3-DiT) architecture, where text, visual semantic tokens, and image VAE tokens are concatenated at the sequence level to serve as a unified input stream, maximizing parameter efficiency compared to dual-stream approaches. Z-Image-Turbo is a distilled version of Z-Image that matches or exceeds leading competitors with only 8 NFEs (Number of Function Evaluations). It is powered by two core techniques: Decoupled-DMD (few-step distillation) and DMDR (fusing DMD with Reinforcement Learning). Key Features:- Sub-second Inference Latency: Achieves sub-second inference on enterprise-grade H800 GPUs and fits comfortably within 16GB VRAM consumer devices
- Photorealistic Image Generation: Excels in high-quality photorealistic image generation with rich aesthetics
- Bilingual Text Rendering: Supports accurate bilingual text rendering in both English and Chinese
- Robust Instruction Adherence: Strong prompt following and instruction adherence capabilities
- #1 Open-Source Model: Ranked 8th overall and #1 among open-source models on the Artificial Analysis Text-to-Image Leaderboard
2. SGLang-diffusion Installation
SGLang-diffusion offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. Please refer to the official SGLang-diffusion installation guide for installation instructions.3. Model Deployment
This section provides deployment configurations optimized for different hardware platforms and use cases.3.1 Basic Configuration
Z-Image-Turbo is optimized for high-quality image generation with only 8 inference steps. The recommended launch configurations vary by hardware. Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform.3.2 Configuration Tips
Current supported optimization all listed here.--vae-path: Path to a custom VAE model or HuggingFace model ID. If not specified, the VAE will be loaded from the main model path.--num-gpus: Number of GPUs to use--tp-size: Tensor parallelism size (only for the encoder; should not be larger than 1 if text encoder offload is enabled, as layer-wise offload plus prefetch is faster)--sp-degree: Sequence parallelism size (typically should match the number of GPUs)--ulysses-degree: The degree of DeepSpeed-Ulysses-style SP in USP--ring-degree: The degree of ring attention-style SP in USP
4. API Usage
For complete API documentation, please refer to the official API usage guide.4.1 Generate an Image
4.2 Advanced Usage
4.2.1 Cache-DiT Acceleration
SGLang integrates Cache-DiT, a caching acceleration engine for Diffusion Transformers (DiT), to achieve up to 7.4x inference speedup with minimal quality loss. You can setSGLANG_CACHE_DIT_ENABLED=True to enable it. For more details, please refer to the SGLang Cache-DiT documentation.
Basic Usage
- DBCache Parameters: DBCache controls block-level caching behavior:
| Parameter | Env Variable | Default | Description |
|---|---|---|---|
| Fn | SGLANG_CACHE_DIT_FN | 1 | Number of first blocks to always compute |
| Bn | SGLANG_CACHE_DIT_BN | 0 | Number of last blocks to always compute |
| W | SGLANG_CACHE_DIT_WARMUP | 4 | Warmup steps before caching starts |
| R | SGLANG_CACHE_DIT_RDT | 0.24 | Residual difference threshold |
| MC | SGLANG_CACHE_DIT_MC | 3 | Maximum continuous cached steps |
- TaylorSeer Configuration: TaylorSeer improves caching accuracy using Taylor expansion:
| Parameter | Env Variable | Default | Description |
|---|---|---|---|
| Enable | SGLANG_CACHE_DIT_TAYLORSEER | false | Enable TaylorSeer calibrator |
| Order | SGLANG_CACHE_DIT_TS_ORDER | 1 | Taylor expansion order (1 or 2) |
4.2.2 CPU Offload
--dit-cpu-offload: Use CPU offload for DiT inference. Enable if run out of memory.--text-encoder-cpu-offload: Use CPU offload for text encoder inference.--vae-cpu-offload: Use CPU offload for VAE.--pin-cpu-memory: Pin memory for CPU offload. Only added as a temp workaround if it throws “CUDA error: invalid argument”.
5. Benchmark
5.1 Speedup Benchmark
5.1.1 Generate a image
Test Environment:- Hardware: NVIDIA B300 SXM6 AC (1x)
- Model: Tongyi-MAI/Z-Image-Turbo
- sglang version: 0.0.0.dev1+gf4417475b
- git revision: f441747
