Overview
This guide walks you through the entire flow of getting started with SGLang:
Install SGLang
Launch an inference server
Send requests using cURL, OpenAI Python client, Python requests, or the native SGLang API
By the end, you’ll have a working SGLang server responding to your prompts.
Prerequisites
Python : 3.9 or higher
GPU : NVIDIA GPU with CUDA support (sm75 and above, e.g., T4, A10, A100, L4, L40S, H100)
OS : Linux (recommended)
Installation
Pip / uv (Recommended)
From Source
Docker
We recommend using uv for faster installation: pip install --upgrade pip
pip install uv
uv pip install sglang
# Clone and install from source
git clone https://github.com/sgl-project/sglang.git
cd sglang
pip install --upgrade pip
pip install -e "python"
The Docker images are available on Docker Hub at lmsysorg/sglang . Replace <secret> with your Hugging Face token : docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000
For production deployments, use the smaller runtime variant (~40% size reduction): docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:latest-runtime \
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000
If you encounter OSError: CUDA_HOME environment variable is not set, set it with: export CUDA_HOME = / usr / local / cuda- < your-cuda-version >
Launch a Server
Start the SGLang server with a model. Here we use qwen/qwen2.5-0.5b-instruct as a lightweight example:
python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0 --port 30000
Wait until you see The server is fired up and ready to roll! in the terminal output.
Once the server is running, API documentation is available at:
Swagger UI : http://localhost:30000/docs
ReDoc : http://localhost:30000/redoc
OpenAPI Spec : http://localhost:30000/openapi.json
The server automatically applies the chat template from the Hugging Face tokenizer. You can override it with --chat-template when launching.
Send Requests
SGLang is fully OpenAI API-compatible , so you can use the same tools and libraries you already know.
Using cURL
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen/qwen2.5-0.5b-instruct",
"messages": [
{"role": "user", "content": "What is the capital of France?"}
]
}'
Using OpenAI Python Client
Install the OpenAI Python library if you haven’t:
Then send a request:
import openai
client = openai.Client( base_url = "http://127.0.0.1:30000/v1" , api_key = "None" )
response = client.chat.completions.create(
model = "qwen/qwen2.5-0.5b-instruct" ,
messages = [
{ "role" : "user" , "content" : "List 3 countries and their capitals." },
],
temperature = 0 ,
max_tokens = 64 ,
)
print (response.choices[ 0 ].message.content)
Streaming
import openai
client = openai.Client( base_url = "http://127.0.0.1:30000/v1" , api_key = "None" )
response = client.chat.completions.create(
model = "qwen/qwen2.5-0.5b-instruct" ,
messages = [
{ "role" : "user" , "content" : "List 3 countries and their capitals." },
],
temperature = 0 ,
max_tokens = 64 ,
stream = True ,
)
for chunk in response:
if chunk.choices[ 0 ].delta.content:
print (chunk.choices[ 0 ].delta.content, end = "" , flush = True )
Using Python Requests
import requests
url = "http://localhost:30000/v1/chat/completions"
data = {
"model" : "qwen/qwen2.5-0.5b-instruct" ,
"messages" : [{ "role" : "user" , "content" : "What is the capital of France?" }],
}
response = requests.post(url, json = data)
print (response.json())
Using the Native /generate API
SGLang also provides a native /generate endpoint for more flexibility.
import requests
response = requests.post(
"http://localhost:30000/generate" ,
json = {
"text" : "The capital of France is" ,
"sampling_params" : {
"temperature" : 0 ,
"max_new_tokens" : 32 ,
},
},
)
print (response.json())
Streaming with /generate
import requests
import json
response = requests.post(
"http://localhost:30000/generate" ,
json = {
"text" : "The capital of France is" ,
"sampling_params" : {
"temperature" : 0 ,
"max_new_tokens" : 32 ,
},
"stream" : True ,
},
stream = True ,
)
prev = 0
for chunk in response.iter_lines( decode_unicode = False ):
chunk = chunk.decode( "utf-8" )
if chunk and chunk.startswith( "data:" ):
if chunk == "data: [DONE]" :
break
data = json.loads(chunk[ 5 :].strip( " \n " ))
output = data[ "text" ]
print (output[prev:], end = "" , flush = True )
prev = len (output)
Offline Batch Inference (No Server)
SGLang also supports offline batch inference using the Engine class directly — no HTTP server required.
import sglang as sgl
llm = sgl.Engine( model_path = "qwen/qwen2.5-0.5b-instruct" )
prompts = [
"Hello, my name is" ,
"The president of the United States is" ,
"The capital of France is" ,
"The future of AI is" ,
]
sampling_params = { "temperature" : 0.8 , "top_p" : 0.95 }
outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip (prompts, outputs):
print ( f "Prompt: { prompt } \n Generated text: { output[ 'text' ] } \n " )
llm.shutdown()
Common Troubleshooting
Set the CUDA_HOME environment variable to your CUDA install root: export CUDA_HOME = / usr / local / cuda- < your-cuda-version >
FlashInfer issues on sm75+ devices
Switch to alternative backends by adding these flags when launching the server: --attention-backend triton --sampling-backend pytorch
pip3 install --upgrade flashinfer-python --force-reinstall --no-deps
rm -rf ~/.cache/flashinfer
ptxas error on B300/GB300 (sm_103a)
export TRITON_PTXAS_PATH = / usr / local / cuda / bin / ptxas