Skip to main content
SGLang integrates Cache-DiT, a caching acceleration engine for Diffusion Transformers (DiT), to achieve up to 1.69x inference speedup with minimal quality loss.

Overview

Cache-DiT uses intelligent caching strategies to skip redundant computation in the denoising loop:
  • DBCache (Dual Block Cache): Dynamically decides when to cache transformer blocks based on residual differences
  • TaylorSeer: Uses Taylor expansion for calibration to optimize caching decisions
  • SCM (Step Computation Masking): Step-level caching control for additional speedup

Basic Usage

Enable Cache-DiT by exporting the environment variable and using sglang generate or sglang serve :
SGLANG_CACHE_DIT_ENABLED=true \
sglang generate --model-path Qwen/Qwen-Image \
    --prompt "A beautiful sunset over the mountains"

Diffusers Backend

Cache-DiT supports loading acceleration configs from a custom YAML file. For diffusers pipelines (diffusers backend), pass the YAML/JSON path via --cache-dit-config. This flow requires cache-dit >= 1.2.0 (cache_dit.load_configs).

Single GPU inference

Define a cache.yaml file that contains:
cache_config:
  max_warmup_steps: 8
  warmup_interval: 2
  max_cached_steps: -1
  max_continuous_cached_steps: 2
  Fn_compute_blocks: 1
  Bn_compute_blocks: 0
  residual_diff_threshold: 0.12
  enable_taylorseer: true
  taylorseer_order: 1
Then apply the config with:
sglang generate \
  --backend diffusers \
  --model-path Qwen/Qwen-Image \
  --cache-dit-config cache.yaml \
  --prompt "A beautiful sunset over the mountains"

Distributed inference

  • 1D Parallelism
Define a parallelism only config yaml parallel.yaml file that contains:
parallelism_config:
  ulysses_size: auto
  parallel_kwargs:
    attention_backend: native
    extra_parallel_modules: ["text_encoder", "vae"]
Then, apply the distributed inference acceleration config from yaml. ulysses_size: auto means that cache-dit will auto detect the world_size as the ulysses_size. Otherwise, you should manually set it as specific int number, e.g, 4. Then apply the distributed config with: (Note: please add --num-gpus N to specify the number of gpus for distributed inference)
sglang generate \
  --backend diffusers \
  --num-gpus 4 \
  --model-path Qwen/Qwen-Image \
  --cache-dit-config parallel.yaml \
  --prompt "A futuristic cityscape at sunset"
  • 2D Parallelism
You can also define a 2D parallelism config yaml parallel_2d.yaml file that contains:
parallelism_config:
  ulysses_size: auto
  tp_size: 2
  parallel_kwargs:
    attention_backend: native
    extra_parallel_modules: ["text_encoder", "vae"]
Then, apply the 2D parallelism config from yaml. Here tp_size: 2 means using tensor parallelism with size 2. The ulysses_size: auto means that cache-dit will auto detect the world_size // tp_size as the ulysses_size.
  • 3D Parallelism
You can also define a 3D parallelism config yaml parallel_3d.yaml file that contains:
parallelism_config:
  ulysses_size: 2
  ring_size: 2
  tp_size: 2
  parallel_kwargs:
    attention_backend: native
    extra_parallel_modules: ["text_encoder", "vae"]
Then, apply the 3D parallelism config from yaml. Here ulysses_size: 2, ring_size: 2, tp_size: 2 means using ulysses parallelism with size 2, ring parallelism with size 2 and tensor parallelism with size 2.

Hybrid Cache and Parallelism

Define a hybrid cache and parallel acceleration config yaml hybrid.yaml file that contains:
cache_config:
  max_warmup_steps: 8
  warmup_interval: 2
  max_cached_steps: -1
  max_continuous_cached_steps: 2
  Fn_compute_blocks: 1
  Bn_compute_blocks: 0
  residual_diff_threshold: 0.12
  enable_taylorseer: true
  taylorseer_order: 1
parallelism_config:
  ulysses_size: auto
  parallel_kwargs:
    attention_backend: native
    extra_parallel_modules: ["text_encoder", "vae"]
Then, apply the hybrid cache and parallel acceleration config from yaml.
sglang generate \
  --backend diffusers \
  --num-gpus 4 \
  --model-path Qwen/Qwen-Image \
  --cache-dit-config hybrid.yaml \
  --prompt "A beautiful sunset over the mountains"

Advanced Configuration

DBCache Parameters

DBCache controls block-level caching behavior:
ParameterEnv VariableDefaultDescription
FnSGLANG_CACHE_DIT_FN1Number of first blocks to always compute
BnSGLANG_CACHE_DIT_BN0Number of last blocks to always compute
WSGLANG_CACHE_DIT_WARMUP4Warmup steps before caching starts
RSGLANG_CACHE_DIT_RDT0.24Residual difference threshold
MCSGLANG_CACHE_DIT_MC3Maximum continuous cached steps

TaylorSeer Configuration

TaylorSeer improves caching accuracy using Taylor expansion:
ParameterEnv VariableDefaultDescription
EnableSGLANG_CACHE_DIT_TAYLORSEERfalseEnable TaylorSeer calibrator
OrderSGLANG_CACHE_DIT_TS_ORDER1Taylor expansion order (1 or 2)

Combined Configuration Example

DBCache and TaylorSeer are complementary strategies that work together, you can configure both sets of parameters simultaneously:
SGLANG_CACHE_DIT_ENABLED=true \
SGLANG_CACHE_DIT_FN=2 \
SGLANG_CACHE_DIT_BN=1 \
SGLANG_CACHE_DIT_WARMUP=4 \
SGLANG_CACHE_DIT_RDT=0.4 \
SGLANG_CACHE_DIT_MC=4 \
SGLANG_CACHE_DIT_TAYLORSEER=true \
SGLANG_CACHE_DIT_TS_ORDER=2 \
sglang generate --model-path black-forest-labs/FLUX.1-dev \
    --prompt "A curious raccoon in a forest"

SCM (Step Computation Masking)

SCM provides step-level caching control for additional speedup. It decides which denoising steps to compute fully and which to use cached results. SCM Presets SCM is configured with presets:
PresetCompute RatioSpeedQuality
none100%BaselineBest
slow~75%~1.3xHigh
medium~50%~2xGood
fast~35%~3xAcceptable
ultra~25%~4xLower
Usage
SGLANG_CACHE_DIT_ENABLED=true \
SGLANG_CACHE_DIT_SCM_PRESET=medium \
sglang generate --model-path Qwen/Qwen-Image \
    --prompt "A futuristic cityscape at sunset"
Custom SCM Bins For fine-grained control over which steps to compute vs cache:
SGLANG_CACHE_DIT_ENABLED=true \
SGLANG_CACHE_DIT_SCM_COMPUTE_BINS="8,3,3,2,2" \
SGLANG_CACHE_DIT_SCM_CACHE_BINS="1,2,2,2,3" \
sglang generate --model-path Qwen/Qwen-Image \
    --prompt "A futuristic cityscape at sunset"
SCM Policy
PolicyEnv VariableDescription
dynamicSGLANG_CACHE_DIT_SCM_POLICY=dynamicAdaptive caching based on content (default)
staticSGLANG_CACHE_DIT_SCM_POLICY=staticFixed caching pattern

Environment Variables

All Cache-DiT parameters can be configured via environment variables. See Environment variables for the complete list.

Supported Models

SGLang Diffusion x Cache-DiT supports almost all models originally supported in SGLang Diffusion:
Model FamilyExample Models
WanWan2.1, Wan2.2
FluxFLUX.1-dev, FLUX.2-dev
Z-ImageZ-Image-Turbo
QwenQwen-Image, Qwen-Image-Edit
HunyuanHunyuanVideo

Performance Tips

  1. Start with defaults: The default parameters work well for most models
  2. Use TaylorSeer: It typically improves both speed and quality
  3. Tune R threshold: Lower values = better quality, higher values = faster
  4. SCM for extra speed: Use medium preset for good speed/quality balance
  5. Warmup matters: Higher warmup = more stable caching decisions

Limitations

  • SGLang-native pipelines: Distributed support (TP/SP) is not yet validated; Cache-DiT will be automatically disabled when world_size > 1.
  • SCM minimum steps: SCM requires >= 8 inference steps to be effective
  • Model support: Only models registered in Cache-DiT’s BlockAdapterRegister are supported

Troubleshooting

SCM disabled for low step count

For models with < 8 inference steps (e.g., DMD distilled models), SCM will be automatically disabled. DBCache acceleration still works.

References