Welcome to SGLang - SGLang Documentation

Star Fork

Performance & Runtime

Designed for low-latency, high-throughput inference with RadixAttention, prefix caching, and multi-GPU parallelism.

Models & Ecosystem

Broad support for Llama, Qwen, DeepSeek, and more. Compatible with Hugging Face and OpenAI APIs.

Extensive Hardware Support

Native support across Hardware Platforms including NVIDIA, AMD, Intel Xeon, Google TPU, and Ascend NPU accelerators.

Community & Training

Open-source with widespread adoption, powering 400k+ GPUs and integrated with major RL frameworks.

SGLang powers large-scale production deployments, generating trillions of tokens each day across more than 400,000 GPUs worldwide. It is hosted under the non-profit open-source organization LMSYS.

Get Started

SGLang is an inference framework meant for production level serving. It is designed to deliver low-latency and high-throughput inference across a wide range of setups, from a single GPU to large distributed clusters.