Overview
| Optimization | Type | Description |
|---|---|---|
| Cache-DiT | Caching | Block-level caching with DBCache, TaylorSeer, and SCM |
| TeaCache | Caching | Timestep-level caching using L1 similarity |
| Attention Backends | Kernel | Optimized attention implementations (FlashAttention, SageAttention, etc.) |
| Profiling | Diagnostics | PyTorch Profiler and Nsight Systems guidance |
Caching Strategies
SGLang supports two complementary caching approaches:Cache-DiT
Cache-DiT provides block-level caching with advanced strategies. It can achieve up to 1.69x speedup. Quick Start:- DBCache: Dynamic block-level caching based on residual differences
- TaylorSeer: Taylor expansion-based calibration for optimized caching
- SCM: Step-level computation masking for additional speedup
TeaCache
TeaCache (Temporal similarity-based caching) accelerates diffusion inference by detecting when consecutive denoising steps are similar enough to skip computation entirely. Quick Overview:- Tracks L1 distance between modulated inputs across timesteps
- When accumulated distance is below threshold, reuses cached residual
- Supports CFG with separate positive/negative caches
Attention Backends
Different attention backends offer varying performance characteristics depending on your hardware and model:- FlashAttention: Fastest on NVIDIA GPUs with fp16/bf16
- SageAttention: Alternative optimized implementation
- xformers: Memory-efficient attention
- SDPA: PyTorch native scaled dot-product attention
Profiling
To diagnose performance bottlenecks, SGLang-Diffusion supports profiling tools:- PyTorch Profiler: Built-in Python profiling
- Nsight Systems: GPU kernel-level analysis
