Skip to main content

Runtime

Environment VariableDefaultDescription
SGLANG_DIFFUSION_TARGET_DEVICEcudaTarget device for inference (cuda, rocm, xpu, npu, musa, mps, cpu)
SGLANG_DIFFUSION_ATTENTION_BACKENDnot setOverride attention backend via env var (e.g. fa, torch_sdpa, sage_attn)
SGLANG_DIFFUSION_ATTENTION_CONFIGnot setPath to attention backend configuration file (JSON/YAML)
SGLANG_DIFFUSION_STAGE_LOGGINGfalseEnable per-stage timing logs
SGLANG_DIFFUSION_SERVER_DEV_MODEfalseEnable dev-only HTTP endpoints for debugging
SGLANG_DIFFUSION_TORCH_PROFILER_DIRnot setDirectory for torch profiler traces (absolute path). Enables profiling when set
SGLANG_DIFFUSION_CACHE_ROOT~/.cache/sgl_diffusionRoot directory for cache files
SGLANG_DIFFUSION_CONFIG_ROOT~/.config/sgl_diffusionRoot directory for configuration files
SGLANG_DIFFUSION_LOGGING_LEVELINFODefault logging level
SGLANG_DIFFUSION_WORKER_MULTIPROC_METHODforkMultiprocess context for workers (fork or spawn)
SGLANG_USE_RUNAI_MODEL_STREAMERtrueUse Run:AI model streamer for model loading

Platform-Specific

Apple MPS

Environment VariableDefaultDescription
SGLANG_USE_MLXnot setSet to 1 to enable MLX fused Metal kernels for norm ops on MPS

ROCm (AMD GPUs)

Environment VariableDefaultDescription
SGLANG_USE_ROCM_VAEfalseUse AITer GroupNorm in VAE for improved performance on ROCm
SGLANG_USE_ROCM_CUDNN_BENCHMARKfalseEnable MIOpen auto-tuning for VAE conv layers on ROCm

Quantization

Environment VariableDefaultDescription
SGLANG_DIFFUSION_FLASHINFER_FP4_GEMM_BACKENDnot setOptional FlashInfer FP4 GEMM backend override for diffusion NVFP4. When unset, SGLang defaults to flashinfer_trtllm.
SGLANG_DIFFUSION_ENABLE_W8A8_FP8_GEMMfalseExperimental opt-in for fused W8A8 FP8 GEMM in diffusion weight-only FP8 linears. When disabled, FP8 weights are dequantized to the compute dtype before matmul. Enabling this dynamically quantizes activations to FP8 and may change output quality.

Caching Acceleration

These variables configure caching acceleration for Diffusion Transformer (DiT) models. SGLang supports multiple caching strategies - see caching documentation for an overview.

Cache-DiT Configuration

See cache-dit documentation for detailed configuration.
Environment VariableDefaultDescription
SGLANG_CACHE_DIT_ENABLEDfalseEnable Cache-DiT acceleration
SGLANG_CACHE_DIT_FN1First N blocks to always compute
SGLANG_CACHE_DIT_BN0Last N blocks to always compute
SGLANG_CACHE_DIT_WARMUP4Warmup steps before caching
SGLANG_CACHE_DIT_RDT0.24Residual difference threshold
SGLANG_CACHE_DIT_MC3Max continuous cached steps
SGLANG_CACHE_DIT_TAYLORSEERfalseEnable TaylorSeer calibrator
SGLANG_CACHE_DIT_TS_ORDER1TaylorSeer order (1 or 2)
SGLANG_CACHE_DIT_SCM_PRESETnoneSCM preset (none/slow/medium/fast/ultra)
SGLANG_CACHE_DIT_SCM_POLICYdynamicSCM caching policy
SGLANG_CACHE_DIT_SCM_COMPUTE_BINSnot setCustom SCM compute bins
SGLANG_CACHE_DIT_SCM_CACHE_BINSnot setCustom SCM cache bins

Cache-DiT Secondary Transformer

For dual-transformer models (e.g., Wan2.2 with high/low-noise experts), these variables configure caching for the secondary transformer. Each falls back to its primary counterpart if not set.
Environment VariableDefaultDescription
SGLANG_CACHE_DIT_SECONDARY_FN(from primary)First N blocks to always compute
SGLANG_CACHE_DIT_SECONDARY_BN(from primary)Last N blocks to always compute
SGLANG_CACHE_DIT_SECONDARY_WARMUP(from primary)Warmup steps before caching
SGLANG_CACHE_DIT_SECONDARY_RDT(from primary)Residual difference threshold
SGLANG_CACHE_DIT_SECONDARY_MC(from primary)Max continuous cached steps
SGLANG_CACHE_DIT_SECONDARY_TAYLORSEER(from primary)Enable TaylorSeer calibrator
SGLANG_CACHE_DIT_SECONDARY_TS_ORDER(from primary)TaylorSeer order (1 or 2)

Cloud Storage

These variables configure S3-compatible cloud storage for automatically uploading generated images and videos.
Environment VariableDefaultDescription
SGLANG_CLOUD_STORAGE_TYPEnot setSet to s3 to enable cloud storage
SGLANG_S3_BUCKET_NAMEnot setThe name of the S3 bucket
SGLANG_S3_ENDPOINT_URLnot setCustom endpoint URL (for MinIO, OSS, etc.)
SGLANG_S3_REGION_NAMEus-east-1AWS region name
SGLANG_S3_ACCESS_KEY_IDnot setAWS Access Key ID
SGLANG_S3_SECRET_ACCESS_KEYnot setAWS Secret Access Key

CUDA Crash Debugging

These variables enable kernel API logging and optional input/output dumps around diffusion CUDA kernel call boundaries. They are useful when tracking down CUDA crashes such as illegal memory access, device-side assert, or shape mismatches in custom kernels.
Environment VariableDefaultDescription
SGLANG_KERNEL_API_LOGLEVEL0Controls crash-debug kernel API logging. 1 logs API names, 3 logs tensor metadata, 5 adds tensor statistics, and 10 also writes dump snapshots.
SGLANG_KERNEL_API_LOGDESTstdoutDestination for crash-debug kernel API logs. Use stdout, stderr, or a file path. %i is replaced with the process PID.
SGLANG_KERNEL_API_DUMP_DIRsglang_kernel_api_dumpsOutput directory for level-10 kernel API dumps. %i is replaced with the process PID.
SGLANG_KERNEL_API_DUMP_INCLUDEnot setComma-separated wildcard patterns for kernel API names to include in level-10 dumps.
SGLANG_KERNEL_API_DUMP_EXCLUDEnot setComma-separated wildcard patterns for kernel API names to exclude from level-10 dumps.