- Output-preserving / lossless-style: system settings that should preserve model behavior while changing residency, parallelism, kernels, or scheduling.
- Quality-tradeoff / lossy or approximate: techniques that can change the denoising path, numerical representation, or generated output.
Start Here
- Pick a serving or generation mode from Deployment and Performance Modes.
--performance-mode autois the default; usespeedwhen the model fits in GPU memory and latency matters most,memorywhen GPU memory is the bottleneck, andmanualwhen every performance flag should be explicit. - Choose the right attention backend from Attention Backends.
- Use Sequence Parallelism only when the model and video shape benefit from sequence splitting.
- Use Inference Batching for concurrent compatible requests during serving.
- Use Profiling before changing several levers at once.
Output-Preserving / Lossless-Style Levers
These settings should preserve model behavior while changing residency, parallelism, kernels, or scheduling. They are the first choices for production tuning.| Lever | Use when | Docs |
|---|---|---|
—performance-mode | You want a safe preset for speed or memory without overriding explicit flags. | Deployment and Performance Modes |
| Offload, FSDP, CFG parallelism | GPU memory, multi-GPU residency, or CFG branch splitting is the main bottleneck. | Deployment and Performance Modes |
| Sequence parallelism | Long image/video sequences need sequence-level parallelism. | Sequence Parallelism |
| Attention backend | Kernel choice dominates DiT latency or memory. | Attention Backends |
| Dynamic batching | Serving many compatible requests concurrently. | Inference Batching |
Quality-Tradeoff / Lossy Or Approximate Levers
These techniques can change the denoising path, numerical representation, or generated output. They are useful after you have a baseline and an acceptance criterion for quality.| Lever | Tradeoff | Docs |
|---|---|---|
| Cache-DiT | Skips selected DiT block or step computation based on cache decisions. | Cache-DiT |
| TeaCache | Reuses residuals when consecutive denoising steps are similar enough. | TeaCache |
| Progressive resolution | Runs early denoising at lower latent resolution for supported pipelines. | Progressive Resolution Generation |
| Quantization | Uses lower-precision transformer weights or activations. | Quantization |
Practical Order
- Establish a baseline with the target model, resolution, frame count, step count, and GPU type.
- Select
--performance-modeand explicit residency or parallelism flags. - Tune attention backend and batching for the deployment pattern.
- Profile if the bottleneck is unclear.
- Add caching, progressive resolution, or quantization only after comparing output quality against your acceptance target.
