1. Model Introduction
NVIDIA Cosmos3 is a world-generation model family for text-to-image, text-to-video, and image-to-video generation. SGLang Diffusion serves the public generator checkpoints with the nativeCosmos3OmniDiffusersPipeline.
| Model | Status | Notes |
|---|---|---|
nvidia/Cosmos3-Nano | Supported | T2I, T2V, I2V |
nvidia/Cosmos3-Super | Supported | T2I, T2V, I2V; use multi-GPU for the 64B checkpoint |
nvidia/Cosmos3-Super-Text2Image | Supported | T2I-specialized checkpoint |
nvidia/Cosmos3-Super-Image2Video | Supported | I2V-specialized checkpoint |
nvidia/Cosmos3-Nano-Policy-DROID | Not supported yet | Action/policy model; planned separately from visual generation |
generate_sound, action_mode, or video-to-video conditioning fields return a clear error instead of being silently ignored.
2. Installation
Install SGLang with the diffusion dependencies:Command
Command
cosmos-guardrail downloads gated NVIDIA guardrail weights, so pass a Hugging Face token if your environment needs one. If the package is not installed, SGLang skips Cosmos3 guardrails and logs a warning. To disable Cosmos3 guardrails for local experiments, set SGLANG_DISABLE_COSMOS3_GUARDRAILS=1 before starting the server.
3. Serve Cosmos3
ServeCosmos3-Nano directly from the Hugging Face model ID:
Command
Cosmos3-Super, split the model across multiple GPUs:
Command
nvidia/Cosmos3-Super-Text2Image and nvidia/Cosmos3-Super-Image2Video checkpoint IDs.
4. OpenAI-Compatible Requests
Text to image
Cosmos3 text-to-image uses/v1/images/generations. The default Cosmos3 image response is b64_json, matching vLLM-Omni’s examples.
Command
Text to video
Use/v1/videos to create an asynchronous job, then poll the job and download the completed MP4.
Command
Image to video
This mirrors the officialnvidia/Cosmos3-Nano Hugging Face image-to-video example:
Python
5. Cosmos3 Parameters
Cosmos3 supports the standard SGLang video and image fields such assize, num_frames, fps, num_inference_steps, guidance_scale, negative_prompt, and seed.
Top-level Cosmos3 request fields:
max_sequence_length: maximum text token length used by the Cosmos3 tokenizer.flow_shift: per-request scheduler flow shift. If omitted, SGLang uses--flow-shift, then the checkpoint scheduler default.
extra_params for video requests, or extra_args for image requests:
use_duration_template: whether to append SGLang’s generated duration suffix to video prompts.use_resolution_template: accepted for vLLM-Omni request compatibility.use_system_prompt: whether to add the Cosmos3 system prompt to the chat template.guardrailsoruse_guardrails: per-request guardrail toggle when the server started with guardrails enabled.
