Skip to main content
The sglang CLI provides two main subcommands for diffusion inference:
  • sglang generate — run a one-off generation without a persistent server
  • sglang serve — launch the OpenAI-compatible HTTP server

Prerequisites

A working SGLang Diffusion installation with the sglang CLI available in your $PATH. See the installation guide for setup instructions.

Generate

Run a one-off generation task without launching a persistent server. Pass both server arguments and sampling parameters after the generate subcommand:
SERVER_ARGS=(
  --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers
  --text-encoder-cpu-offload
  --pin-cpu-memory
  --num-gpus 4
  --ulysses-degree=2
  --ring-degree=2
)

SAMPLING_ARGS=(
  --prompt "A curious raccoon"
  --save-output
  --output-path outputs
  --output-file-name "A curious raccoon.mp4"
)

sglang generate "${SERVER_ARGS[@]}" "${SAMPLING_ARGS[@]}"
You can also enable Cache-DiT acceleration via an environment variable:
SGLANG_CACHE_DIT_ENABLED=true sglang generate "${SERVER_ARGS[@]}" "${SAMPLING_ARGS[@]}"
HTTP server-related arguments are ignored in generate mode. The process shuts down automatically once generation completes.

Serve

Launch the SGLang Diffusion HTTP server and interact through the OpenAI-compatible API.
SERVER_ARGS=(
  --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers
  --text-encoder-cpu-offload
  --pin-cpu-memory
  --num-gpus 4
  --ulysses-degree=2
  --ring-degree=2
)

sglang serve "${SERVER_ARGS[@]}"
  • --model-path — which model to load (e.g. Wan-AI/Wan2.1-T2V-1.3B-Diffusers)
  • --port — HTTP port to listen on (default: 30010)
For full API usage including image/video generation and LoRA management, see the OpenAI API documentation.

Supported arguments

Server arguments

ArgumentDescription
--model-path MODEL_PATHPath to the model or HuggingFace model ID
--lora-path LORA_PATHPath to a LoRA adapter (local or HuggingFace ID). If omitted, LoRA is not applied
--lora-nickname NAMENickname for the LoRA adapter (default: default)
--num-gpus NUMNumber of GPUs to use
--tp-size SIZETensor parallelism size (encoder only; keep at most 1 when text encoder offload is enabled)
--sp-degree SIZESequence parallelism size (typically should match the number of GPUs)
--ulysses-degree SIZEDeepSpeed-Ulysses-style SP degree in USP
--ring-degree SIZERing attention-style SP degree in USP
--attention-backend BACKENDAttention backend. Native pipelines: fa, torch_sdpa, sage_attn, etc. Diffusers pipelines: flash, _flash_3_hub, sage, xformers
--attention-backend-config CONFIGConfig for the attention backend. Accepts a JSON string, a JSON/YAML file path, or key=value pairs
--cache-dit-config PATHPath to a Cache-DiT YAML/JSON config (diffusers backend only)
--dit-precision DTYPEPrecision for the DiT model (fp32, fp16, bf16)
--text-encoder-cpu-offloadOffload text encoders to CPU
--pin-cpu-memoryPin CPU memory for faster transfers

Sampling parameters

ArgumentDescription
--prompt PROMPTText description for the image or video to generate
--negative-prompt PROMPTNegative prompt to guide generation away from certain concepts
--num-inference-steps STEPSNumber of denoising steps
--seed SEEDRandom seed for reproducible generation
ArgumentDescription
--height HEIGHTHeight of the generated output
--width WIDTHWidth of the generated output
--num-frames NUMNumber of frames to generate (video only)
--fps FPSFrames per second for the saved output (video only)
ArgumentDescription
--save-outputSave the image or video to disk
--output-path PATHDirectory to save the generated output
--output-file-name NAMEFile name for the saved output
--return-framesReturn the raw frames instead of saving

Frame interpolation (video only)

Frame interpolation is a post-processing step that synthesizes new frames between each pair of consecutive generated frames, producing smoother motion without re-running the diffusion model. The --frame-interpolation-exp flag controls how many rounds of interpolation to apply: each round inserts one new frame into every gap between adjacent frames, so the output frame count follows the formula: output frames=(N1)×2exp+1\text{output frames} = (N - 1) \times 2^{\text{exp}} + 1 For example, 5 original frames with exp=1 -> 4 gaps x 1 new frame + 5 originals = 9 frames; with exp=2 -> 17 frames.
ArgumentDescription
--enable-frame-interpolationEnable frame interpolation. Model weights are downloaded automatically on first use
--frame-interpolation-exp EXPInterpolation exponent — 1 = 2x temporal resolution, 2 = 4x, etc. (default: 1)
--frame-interpolation-scale SCALERIFE inference scale; use 0.5 for high-resolution inputs to save memory (default: 1.0)
--frame-interpolation-model-path PATHLocal directory or HuggingFace repo ID containing RIFE flownet.pkl weights (default: elfgum/RIFE-4.22.lite, downloaded automatically)
Example — generate a 5-frame video and interpolate to 9 frames ((51)×21+1=9(5 - 1) \times 2^1 + 1 = 9):
sglang generate \
  --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \
  --prompt "A dog running through a park" \
  --num-frames 5 \
  --enable-frame-interpolation \
  --frame-interpolation-exp 1 \
  --save-output

Configuration files

Instead of passing every parameter on the command line, you can use a JSON or YAML config file. Command-line arguments take precedence over config values.
sglang generate --config config.json
config.json
{
  "model_path": "FastVideo/FastHunyuan-diffusers",
  "prompt": "A beautiful woman in a red dress walking down a street",
  "output_path": "outputs/",
  "num_gpus": 2,
  "sp_size": 2,
  "tp_size": 1,
  "num_frames": 45,
  "height": 720,
  "width": 1280,
  "num_inference_steps": 6,
  "seed": 1024,
  "fps": 24,
  "precision": "bf16",
  "vae_precision": "fp16",
  "vae_tiling": true,
  "vae_sp": true,
  "vae_config": {
    "load_encoder": false,
    "load_decoder": true,
    "tile_sample_min_height": 256,
    "tile_sample_min_width": 256
  },
  "text_encoder_precisions": ["fp16", "fp16"],
  "mask_strategy_file_path": null,
  "enable_torch_compile": false
}
To see all available options:
sglang generate --help

Component path overrides

You can override any pipeline component (e.g. vae, transformer, text_encoder) by specifying a custom checkpoint path with --<component>-path, where <component> matches the key in the model’s model_index.json.

Example: FLUX.2-dev with Tiny AutoEncoder

Replace the default VAE with a distilled tiny autoencoder for ~3x faster decoding:
sglang serve \
  --model-path=black-forest-labs/FLUX.2-dev \
  --vae-path=fal/FLUX.2-Tiny-AutoEncoder
You can also use a local path:
sglang serve \
  --model-path=black-forest-labs/FLUX.2-dev \
  --vae-path=~/.cache/huggingface/hub/models--fal--FLUX.2-Tiny-AutoEncoder/snapshots/.../vae
The component key must match the one in the model’s model_index.json (e.g. vae). The path must be either a HuggingFace repo ID or point to a complete component folder containing config.json and safetensors files.

Diffusers backend

SGLang Diffusion supports a diffusers backend that runs any diffusers-compatible model through SGLang’s infrastructure using vanilla diffusers pipelines. This is useful for models without native SGLang implementations or models with custom pipeline classes.

Backend arguments

ArgumentValuesDescription
--backendauto (default), sglang, diffusersauto: prefer native SGLang, fallback to diffusers. sglang: force native (fails if unavailable). diffusers: force vanilla diffusers pipeline
--diffusers-attention-backendflash, _flash_3_hub, sage, xformers, nativeAttention backend for diffusers pipelines
--trust-remote-codeflagRequired for models with custom pipeline classes
--vae-tilingflagEnable VAE tiling for large image support (decodes tile-by-tile)
--vae-slicingflagEnable VAE slicing for lower memory usage (decodes slice-by-slice)
--dit-precisionfp16, bf16, fp32Precision for the diffusion transformer
--vae-precisionfp16, bf16, fp32Precision for the VAE

Example: running Ovis-Image-7B

Ovis-Image-7B is a 7B text-to-image model optimized for high-quality text rendering.
sglang generate \
  --model-path AIDC-AI/Ovis-Image-7B \
  --backend diffusers \
  --trust-remote-code \
  --diffusers-attention-backend flash \
  --prompt "A serene Japanese garden with cherry blossoms" \
  --height 1024 \
  --width 1024 \
  --num-inference-steps 30 \
  --save-output \
  --output-path outputs \
  --output-file-name ovis_garden.png

Extra diffusers arguments

For pipeline-specific parameters not exposed via CLI, use diffusers_kwargs in a config file:
config.json
{
  "model_path": "AIDC-AI/Ovis-Image-7B",
  "backend": "diffusers",
  "prompt": "A beautiful landscape",
  "diffusers_kwargs": {
    "cross_attention_kwargs": {"scale": 0.5}
  }
}
sglang generate --config config.json

Cache-DiT acceleration

Users on the diffusers backend can leverage Cache-DiT acceleration by loading custom cache configs from a YAML file. See the Cache-DiT documentation for details.

Cloud storage support

The server supports automatically uploading generated artifacts to S3-compatible cloud storage (AWS S3, MinIO, Alibaba Cloud OSS, Tencent Cloud COS). The workflow is: Generate -> Upload -> Delete local file. The API response returns the public URL of the uploaded object.
  1. Install boto3
pip install boto3
  1. Set environment variables
export SGLANG_CLOUD_STORAGE_TYPE=s3
export SGLANG_S3_BUCKET_NAME=my-bucket
export SGLANG_S3_ACCESS_KEY_ID=your-access-key
export SGLANG_S3_SECRET_ACCESS_KEY=your-secret-key

# Optional: custom endpoint for MinIO/OSS/COS
export SGLANG_S3_ENDPOINT_URL=https://minio.example.com
  1. Launch the server
sglang serve --model-path MODEL_PATH
See the environment variables reference for all storage-related variables.