sglang CLI provides two main subcommands for diffusion inference:
sglang generate— run a one-off generation without a persistent serversglang serve— launch the OpenAI-compatible HTTP server
Prerequisites
A working SGLang Diffusion installation with thesglang CLI available in your $PATH. See the installation guide for setup instructions.
Generate
Run a one-off generation task without launching a persistent server. Pass both server arguments and sampling parameters after thegenerate subcommand:
HTTP server-related arguments are ignored in
generate mode. The process shuts down automatically once generation completes.Serve
Launch the SGLang Diffusion HTTP server and interact through the OpenAI-compatible API.--model-path— which model to load (e.g.Wan-AI/Wan2.1-T2V-1.3B-Diffusers)--port— HTTP port to listen on (default:30010)
Supported arguments
Server arguments
Server arguments reference
Server arguments reference
| Argument | Description |
|---|---|
--model-path MODEL_PATH | Path to the model or HuggingFace model ID |
--lora-path LORA_PATH | Path to a LoRA adapter (local or HuggingFace ID). If omitted, LoRA is not applied |
--lora-nickname NAME | Nickname for the LoRA adapter (default: default) |
--num-gpus NUM | Number of GPUs to use |
--tp-size SIZE | Tensor parallelism size (encoder only; keep at most 1 when text encoder offload is enabled) |
--sp-degree SIZE | Sequence parallelism size (typically should match the number of GPUs) |
--ulysses-degree SIZE | DeepSpeed-Ulysses-style SP degree in USP |
--ring-degree SIZE | Ring attention-style SP degree in USP |
--attention-backend BACKEND | Attention backend. Native pipelines: fa, torch_sdpa, sage_attn, etc. Diffusers pipelines: flash, _flash_3_hub, sage, xformers |
--attention-backend-config CONFIG | Config for the attention backend. Accepts a JSON string, a JSON/YAML file path, or key=value pairs |
--cache-dit-config PATH | Path to a Cache-DiT YAML/JSON config (diffusers backend only) |
--dit-precision DTYPE | Precision for the DiT model (fp32, fp16, bf16) |
--text-encoder-cpu-offload | Offload text encoders to CPU |
--pin-cpu-memory | Pin CPU memory for faster transfers |
Sampling parameters
Generation parameters
Generation parameters
| Argument | Description |
|---|---|
--prompt PROMPT | Text description for the image or video to generate |
--negative-prompt PROMPT | Negative prompt to guide generation away from certain concepts |
--num-inference-steps STEPS | Number of denoising steps |
--seed SEED | Random seed for reproducible generation |
Image/video configuration
Image/video configuration
| Argument | Description |
|---|---|
--height HEIGHT | Height of the generated output |
--width WIDTH | Width of the generated output |
--num-frames NUM | Number of frames to generate (video only) |
--fps FPS | Frames per second for the saved output (video only) |
Output options
Output options
| Argument | Description |
|---|---|
--save-output | Save the image or video to disk |
--output-path PATH | Directory to save the generated output |
--output-file-name NAME | File name for the saved output |
--return-frames | Return the raw frames instead of saving |
Frame interpolation (video only)
Frame interpolation is a post-processing step that synthesizes new frames between each pair of consecutive generated frames, producing smoother motion without re-running the diffusion model. The--frame-interpolation-exp flag controls how many rounds of interpolation to apply: each round inserts one new frame into every gap between adjacent frames, so the output frame count follows the formula:
For example, 5 original frames with exp=1 -> 4 gaps x 1 new frame + 5 originals = 9 frames; with exp=2 -> 17 frames.
| Argument | Description |
|---|---|
--enable-frame-interpolation | Enable frame interpolation. Model weights are downloaded automatically on first use |
--frame-interpolation-exp EXP | Interpolation exponent — 1 = 2x temporal resolution, 2 = 4x, etc. (default: 1) |
--frame-interpolation-scale SCALE | RIFE inference scale; use 0.5 for high-resolution inputs to save memory (default: 1.0) |
--frame-interpolation-model-path PATH | Local directory or HuggingFace repo ID containing RIFE flownet.pkl weights (default: elfgum/RIFE-4.22.lite, downloaded automatically) |
Configuration files
Instead of passing every parameter on the command line, you can use a JSON or YAML config file. Command-line arguments take precedence over config values.- JSON
- YAML
config.json
Component path overrides
You can override any pipeline component (e.g.vae, transformer, text_encoder) by specifying a custom checkpoint path with --<component>-path, where <component> matches the key in the model’s model_index.json.
Example: FLUX.2-dev with Tiny AutoEncoder
Replace the default VAE with a distilled tiny autoencoder for ~3x faster decoding:Diffusers backend
SGLang Diffusion supports a diffusers backend that runs any diffusers-compatible model through SGLang’s infrastructure using vanilla diffusers pipelines. This is useful for models without native SGLang implementations or models with custom pipeline classes.Backend arguments
| Argument | Values | Description |
|---|---|---|
--backend | auto (default), sglang, diffusers | auto: prefer native SGLang, fallback to diffusers. sglang: force native (fails if unavailable). diffusers: force vanilla diffusers pipeline |
--diffusers-attention-backend | flash, _flash_3_hub, sage, xformers, native | Attention backend for diffusers pipelines |
--trust-remote-code | flag | Required for models with custom pipeline classes |
--vae-tiling | flag | Enable VAE tiling for large image support (decodes tile-by-tile) |
--vae-slicing | flag | Enable VAE slicing for lower memory usage (decodes slice-by-slice) |
--dit-precision | fp16, bf16, fp32 | Precision for the diffusion transformer |
--vae-precision | fp16, bf16, fp32 | Precision for the VAE |
Example: running Ovis-Image-7B
Ovis-Image-7B is a 7B text-to-image model optimized for high-quality text rendering.Extra diffusers arguments
For pipeline-specific parameters not exposed via CLI, usediffusers_kwargs in a config file:
config.json
Cache-DiT acceleration
Users on the diffusers backend can leverage Cache-DiT acceleration by loading custom cache configs from a YAML file. See the Cache-DiT documentation for details.Cloud storage support
The server supports automatically uploading generated artifacts to S3-compatible cloud storage (AWS S3, MinIO, Alibaba Cloud OSS, Tencent Cloud COS). The workflow is: Generate -> Upload -> Delete local file. The API response returns the public URL of the uploaded object.- Install boto3
- Set environment variables
- Launch the server
