Skip to main content
This document describes the attention backends available in sglang diffusion (sglang.multimodal_gen) and how to select them.

Overview

Attention backends are defined by AttentionBackendEnum (sglang.multimodal_gen.runtime.platforms.interface.AttentionBackendEnum) and selected via the CLI flag --attention-backend. Backend selection is performed by the shared attention layers (e.g. LocalAttention / USPAttention / UlyssesAttention in sglang.multimodal_gen.runtime.layers.attention.layer) and therefore applies to any model component using these layers (e.g. diffusion transformer / DiT and encoders). When using the diffusers backend, --attention-backend is passed through to diffusers’ set_attention_backend (e.g., flash, _flash_3_hub, sage, xformers, native).
  • CUDA: prefers FlashAttention (FA3/FA4) when supported; otherwise falls back to PyTorch SDPA.
  • ROCm: uses FlashAttention when available; otherwise falls back to PyTorch SDPA.
  • MPS: always uses PyTorch SDPA.
  • NPU: always uses PyTorch SDPA.

Backend options

For SGLang-native pipelines, the CLI accepts the lowercase names of AttentionBackendEnum. The table below lists the backends implemented by the built-in platforms. fa3/fa4 are accepted as aliases for fa.
CLI valueEnum valueNotes
fa / fa3 / fa4FAFlashAttention. fa3/fa4 are normalized to fa during argument parsing (ServerArgs.__post_init__).
torch_sdpaTORCH_SDPAPyTorch scaled_dot_product_attention.
sliding_tile_attnSLIDING_TILE_ATTNSliding Tile Attention (STA). Requires st_attn. Configure via --attention-backend-config.
sage_attnSAGE_ATTNRequires sageattention. Upstream SageAttention CUDA extensions target SM80/SM86/SM89/SM90/SM120 (compute capability 8.0/8.6/8.9/9.0/12.0); see upstream setup.py: https://github.com/thu-ml/SageAttention/blob/main/setup.py.
sage_attn_3SAGE_ATTN_3Requires SageAttention3 installed per upstream instructions.
video_sparse_attnVIDEO_SPARSE_ATTNRequires vsa. Configure sparsity via --attention-backend-config.
vmoba_attnVMOBA_ATTNRequires kernel.attn.vmoba_attn.vmoba. Configure via --attention-backend-config.
aiterAITERRequires aiter.
sparse_video_gen_2_attnSPARSE_VIDEO_GEN_2_ATTNRequires svg. See installation instructions at https://github.com/svg-project/Sparse-VideoGen.

Selection priority

The selection order in runtime/layers/attention/selector.py is:
  1. global_force_attn_backend(...) / global_force_attn_backend_context_manager(...)
  2. CLI --attention-backend (ServerArgs.attention_backend)
  3. Auto selection (platform capability, dtype, and installed packages)

Configuration

Some backends require additional configuration. You can pass these parameters via --attention-backend-config. This argument accepts:
  • A path to a JSON or YAML configuration file.
  • A JSON string (e.g., '{"sparsity": 0.5}').
  • Key-value pairs (e.g., "sparsity=0.5,enable_x=true").

Supported Configuration Parameters

Sliding Tile Attention (sliding_tile_attn)
ParameterTypeDescriptionDefault
mask_strategy_file_pathstrRequired. Path to the mask strategy JSON file.-
sta_modestrMode of STA.STA_inference
skip_time_stepsintNumber of steps to use full attention before switching to sparse attention.15
Video Sparse Attention (video_sparse_attn)
ParameterTypeDescriptionDefault
sparsityfloatValidation sparsity (0.0 - 1.0).0.0
V-MoBA (vmoba_attn)
ParameterTypeDescriptionDefault
temporal_chunk_sizeintChunk size for temporal dimension.-
temporal_topkintTop-K tokens to select in temporal dimension.-
spatial_chunk_sizelist[int]Chunk size for spatial dimension (H, W).-
spatial_topkintTop-K tokens to select in spatial dimension.-
st_chunk_sizelist[int]Chunk size for spatiotemporal dimension (T, H, W).-
st_topkintTop-K tokens to select in spatiotemporal dimension.-
moba_select_modestrSelection mode (e.g., threshold).threshold
moba_thresholdfloatThreshold value for selection.0.25
moba_threshold_typestrType of thresholding (e.g., query_head).query_head
first_full_stepintNumber of initial steps to use full attention.12
first_full_layerintNumber of initial layers to use full attention.0
temporal_layerintNumber of temporal layers.1
spatial_layerintNumber of spatial layers.1
st_layerintNumber of spatiotemporal layers.1

Platform support matrix

BackendCUDAROCmMPSNPUNotes
faYesYesNoNoCUDA requires SM80+ and fp16/bf16. FlashAttention is only used when the required runtime is installed; otherwise it falls back to torch_sdpa.
torch_sdpaYesYesYesYesMost compatible option across platforms.
sliding_tile_attnYesNoNoNoCUDA-only. Requires st_attn. Configure via --attention-backend-config.
sage_attnYesNoNoNoCUDA-only (optional dependency).
sage_attn_3YesNoNoNoCUDA-only (optional dependency).
video_sparse_attnYesNoNoNoCUDA-only. Requires vsa. Configure sparsity via --attention-backend-config.
vmoba_attnYesNoNoNoCUDA-only. Requires kernel.attn.vmoba_attn.vmoba. Configure via --attention-backend-config.
aiterYesNoNoNoRequires aiter.
sparse_video_gen_2_attnYesNoNoNoCUDA-only. Requires svg.

Usage

Select a backend via CLI

sglang generate \
  --model-path <MODEL_PATH_OR_ID> \
  --prompt "..." \
  --attention-backend fa
sglang generate \
  --model-path <MODEL_PATH_OR_ID> \
  --prompt "..." \
  --attention-backend torch_sdpa

Using Sliding Tile Attention (STA)

# Pass the mask strategy file path via config
sglang generate \
  --model-path <MODEL_PATH_OR_ID> \
  --prompt "..." \
  --attention-backend sliding_tile_attn \
  --attention-backend-config "mask_strategy_file_path=/abs/path/to/mask_strategy.json"

Notes for ROCm / MPS

  • ROCm: use --attention-backend torch_sdpa or fa depending on what is available in your environment.
  • MPS: the platform implementation always uses torch_sdpa.