Overview
Cache-DiT uses intelligent caching strategies to skip redundant computation in the denoising loop:- DBCache (Dual Block Cache): Dynamically decides when to cache transformer blocks based on residual differences
- TaylorSeer: Uses Taylor expansion for calibration to optimize caching decisions
- SCM (Step Computation Masking): Step-level caching control for additional speedup
Basic Usage
Enable Cache-DiT by exporting the environment variable and usingsglang generate or sglang serve :
Diffusers Backend
Cache-DiT supports loading acceleration configs from a custom YAML file. For diffusers pipelines (diffusers backend), pass the YAML/JSON path via --cache-dit-config. This
flow requires cache-dit >= 1.2.0 (cache_dit.load_configs).
Single GPU inference
Define acache.yaml file that contains:
Distributed inference
- 1D Parallelism
parallel.yaml file that contains:
ulysses_size: auto means that cache-dit will auto detect the world_size as the ulysses_size. Otherwise, you should manually set it as specific int number, e.g, 4.
Then apply the distributed config with: (Note: please add --num-gpus N to specify the number of gpus for distributed inference)
- 2D Parallelism
parallel_2d.yaml file that contains:
tp_size: 2 means using tensor parallelism with size 2. The ulysses_size: auto means that cache-dit will auto detect the world_size // tp_size as the ulysses_size.
- 3D Parallelism
parallel_3d.yaml file that contains:
ulysses_size: 2, ring_size: 2, tp_size: 2 means using ulysses parallelism with size 2, ring parallelism with size 2 and tensor parallelism with size 2.
Hybrid Cache and Parallelism
Define a hybrid cache and parallel acceleration config yamlhybrid.yaml file that contains:
Advanced Configuration
DBCache Parameters
DBCache controls block-level caching behavior:| Parameter | Env Variable | Default | Description |
|---|---|---|---|
| Fn | SGLANG_CACHE_DIT_FN | 1 | Number of first blocks to always compute |
| Bn | SGLANG_CACHE_DIT_BN | 0 | Number of last blocks to always compute |
| W | SGLANG_CACHE_DIT_WARMUP | 4 | Warmup steps before caching starts |
| R | SGLANG_CACHE_DIT_RDT | 0.24 | Residual difference threshold |
| MC | SGLANG_CACHE_DIT_MC | 3 | Maximum continuous cached steps |
TaylorSeer Configuration
TaylorSeer improves caching accuracy using Taylor expansion:| Parameter | Env Variable | Default | Description |
|---|---|---|---|
| Enable | SGLANG_CACHE_DIT_TAYLORSEER | false | Enable TaylorSeer calibrator |
| Order | SGLANG_CACHE_DIT_TS_ORDER | 1 | Taylor expansion order (1 or 2) |
Combined Configuration Example
DBCache and TaylorSeer are complementary strategies that work together, you can configure both sets of parameters simultaneously:SCM (Step Computation Masking)
SCM provides step-level caching control for additional speedup. It decides which denoising steps to compute fully and which to use cached results. SCM Presets SCM is configured with presets:| Preset | Compute Ratio | Speed | Quality |
|---|---|---|---|
none | 100% | Baseline | Best |
slow | ~75% | ~1.3x | High |
medium | ~50% | ~2x | Good |
fast | ~35% | ~3x | Acceptable |
ultra | ~25% | ~4x | Lower |
| Policy | Env Variable | Description |
|---|---|---|
dynamic | SGLANG_CACHE_DIT_SCM_POLICY=dynamic | Adaptive caching based on content (default) |
static | SGLANG_CACHE_DIT_SCM_POLICY=static | Fixed caching pattern |
Environment Variables
All Cache-DiT parameters can be configured via environment variables. See Environment variables for the complete list.Supported Models
SGLang Diffusion x Cache-DiT supports almost all models originally supported in SGLang Diffusion:| Model Family | Example Models |
|---|---|
| Wan | Wan2.1, Wan2.2 |
| Flux | FLUX.1-dev, FLUX.2-dev |
| Z-Image | Z-Image-Turbo |
| Qwen | Qwen-Image, Qwen-Image-Edit |
| Hunyuan | HunyuanVideo |
Performance Tips
- Start with defaults: The default parameters work well for most models
- Use TaylorSeer: It typically improves both speed and quality
- Tune R threshold: Lower values = better quality, higher values = faster
- SCM for extra speed: Use
mediumpreset for good speed/quality balance - Warmup matters: Higher warmup = more stable caching decisions
Limitations
- SGLang-native pipelines: Distributed support (TP/SP) is not yet validated; Cache-DiT will be automatically
disabled when
world_size > 1. - SCM minimum steps: SCM requires >= 8 inference steps to be effective
- Model support: Only models registered in Cache-DiT’s BlockAdapterRegister are supported
