Skip to main content
Use this page as the starting point for SGLang Diffusion performance work. It separates performance levers into two decision classes:
  • Output-preserving / lossless-style: system settings that should preserve model behavior while changing residency, parallelism, kernels, or scheduling.
  • Quality-tradeoff / lossy or approximate: techniques that can change the denoising path, numerical representation, or generated output.
The docs use “output-preserving” instead of promising bit-exact “lossless” because different kernels, GPU types, or precision paths can still introduce small numerical differences. The decision boundary is whether the optimization intentionally trades quality or output equivalence for speed.

Start Here

  1. Pick a serving or generation mode from Deployment and Performance Modes. --performance-mode auto is the default; use speed when the model fits in GPU memory and latency matters most, memory when GPU memory is the bottleneck, and manual when every performance flag should be explicit.
  2. Choose the right attention backend from Attention Backends.
  3. Use Sequence Parallelism only when the model and video shape benefit from sequence splitting.
  4. Use Inference Batching for concurrent compatible requests during serving.
  5. Use Profiling before changing several levers at once.

Output-Preserving / Lossless-Style Levers

These settings should preserve model behavior while changing residency, parallelism, kernels, or scheduling. They are the first choices for production tuning.
LeverUse whenDocs
—performance-modeYou want a safe preset for speed or memory without overriding explicit flags.Deployment and Performance Modes
Offload, FSDP, CFG parallelismGPU memory, multi-GPU residency, or CFG branch splitting is the main bottleneck.Deployment and Performance Modes
Sequence parallelismLong image/video sequences need sequence-level parallelism.Sequence Parallelism
Attention backendKernel choice dominates DiT latency or memory.Attention Backends
Dynamic batchingServing many compatible requests concurrently.Inference Batching

Quality-Tradeoff / Lossy Or Approximate Levers

These techniques can change the denoising path, numerical representation, or generated output. They are useful after you have a baseline and an acceptance criterion for quality.
LeverTradeoffDocs
Cache-DiTSkips selected DiT block or step computation based on cache decisions.Cache-DiT
TeaCacheReuses residuals when consecutive denoising steps are similar enough.TeaCache
Progressive resolutionRuns early denoising at lower latent resolution for supported pipelines.Progressive Resolution Generation
QuantizationUses lower-precision transformer weights or activations.Quantization

Practical Order

  1. Establish a baseline with the target model, resolution, frame count, step count, and GPU type.
  2. Select --performance-mode and explicit residency or parallelism flags.
  3. Tune attention backend and batching for the deployment pattern.
  4. Profile if the bottleneck is unclear.
  5. Add caching, progressive resolution, or quantization only after comparing output quality against your acceptance target.

Diagnostics

Profiling is not an optimization technique by itself. It belongs in the performance workflow because it tells you which stage, kernel, or denoising step is worth optimizing before you change multiple levers.

References