Installation - SGLang Documentation

You can install SGLang using one of the methods below. This page primarily applies to common NVIDIA GPU platforms. For other or newer platforms, please refer to the dedicated pages for AMD GPUs, Intel Xeon CPUs, Google TPU, NVIDIA DGX Spark, NVIDIA Jetson, [Ascend NPUs](../hardware-platforms/ascend-npus/SGLang installation with NPUs support), and Intel XPU.

Install methods

It is recommended to use for faster installation:

pip install --upgrade pip
pip install uv
uv pip install "sglang"

Quick fixes to common problems

Wrong torch version

In some cases (for example, GB200), the command above might install a wrong torch version (for example, the CPU version) due to dependency resolution. Reinstall the correct PyTorch with the following:

uv pip install "torch" "torchvision" --extra-index-url https://download.pytorch.org/whl/cu129 --force-reinstall

CUDA 13 without Docker

If you do not have Docker access, install the matching sgl_kernel wheel from the sgl-project whl releases after installing SGLang. Replace X.Y.Z with the sgl_kernel version required by your SGLang (you can find this by running uv pip show sgl_kernel).x86_64

uv pip install "https://github.com/sgl-project/whl/releases/download/vX.Y.Z/sgl_kernel-X.Y.Z+cu130-cp310-abi3-manylinux2014_x86_64.whl"

aarch64

uv pip install "https://github.com/sgl-project/whl/releases/download/vX.Y.Z/sgl_kernel-X.Y.Z+cu130-cp310-abi3-manylinux2014_aarch64.whl"

CUDA_HOME not set

Choose one of the following solutions:

Set CUDA_HOME to your CUDA install root:

export CUDA_HOME=/usr/local/cuda-<your-cuda-version>

Install FlashInfer first following the FlashInfer installation doc, then install SGLang as described above.

git clone https://github.com/sgl-project/sglang.git
cd sglang
pip install --upgrade pip
pip install -e "python"

Quick fixes to common problems

Development setup

If you want to develop SGLang, try the dev docker image. Refer to setup docker container. The docker image is lmsysorg/sglang:dev.

The docker images are available on Docker Hub at lmsysorg/sglang, built from Dockerfile. Replace <secret> below with your huggingface hub token.Standard image

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000

Runtime image for production

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest-runtime \
    python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000

You can also find the nightly docker images here.

On B300/GB300 (SM103) or CUDA 13 environment, use the nightly image at lmsysorg/sglang:dev-cu13 or stable image at lmsysorg/sglang:latest-cu130-runtime. Do not re-install the project as editable inside the docker image, since it will override the version of libraries specified by the cu13 docker image.

Please check out OME, a Kubernetes operator for enterprise-grade management and serving of large language models (LLMs).

Single node serving
Multi-node serving

For models that fit into GPUs on one node, create the deployment and service with llama-31-8b as example.

kubectl apply -f docker/k8s-sglang-service.yaml

For larger models (for example, DeepSeek-R1), modify the model path and arguments, then create the statefulset and service.

kubectl apply -f docker/k8s-sglang-distributed-sts.yaml

This method is recommended if you plan to serve it as a service. A better approach is to use the k8s-sglang-service.yaml.

Copy the compose.yml to your local machine.
Start the service:

docker compose up -d

To deploy on Kubernetes or 12+ clouds, you can use SkyPilot.

Install SkyPilot and set up Kubernetes cluster or cloud access. See SkyPilot’s documentation.
Deploy on your own infra with a single command and get the HTTP API endpoint:

SkyPilot YAML: sglang.yaml

Config

# sglang.yaml
envs:
  HF_TOKEN: null

resources:
  image_id: docker:lmsysorg/sglang:latest
  accelerators: A100
  ports: 30000

run: |
  conda deactivate
  python3 -m sglang.launch_server \
    --model-path meta-llama/Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --port 30000

# Deploy on any cloud or Kubernetes cluster. Use --cloud <cloud> to select a specific cloud provider.
HF_TOKEN=<secret> sky launch -c sglang --env HF_TOKEN sglang.yaml

# Get the HTTP API endpoint
sky status --endpoint 30000 sglang

To scale with autoscaling and failure recovery, check out the SkyServe + SGLang guide.

To deploy on SGLang on AWS SageMaker, check out AWS SageMaker Inference.Amazon Web Services provide supports for SGLang containers along with routine security patching. For available SGLang containers, check out AWS SGLang DLCs.To host a model with your own container, follow the following steps:

Build a docker container with sagemaker.Dockerfile alongside the serve script, then push it to AWS ECR.

Dockerfile build script: build-and-push.sh

#!/bin/bash
AWS_ACCOUNT="<YOUR_AWS_ACCOUNT>"
AWS_REGION="<YOUR_AWS_REGION>"
REPOSITORY_NAME="<YOUR_REPOSITORY_NAME>"
IMAGE_TAG="<YOUR_IMAGE_TAG>"

ECR_REGISTRY="${AWS_ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com"
IMAGE_URI="${ECR_REGISTRY}/${REPOSITORY_NAME}:${IMAGE_TAG}"

echo "Starting build and push process..."

# Login to ECR
echo "Logging into ECR..."
aws ecr get-login-password --region ${AWS_REGION} | docker login --username AWS --password-stdin ${ECR_REGISTRY}

# Build the image
echo "Building Docker image..."
docker build -t ${IMAGE_URI} -f sagemaker.Dockerfile .

echo "Pushing ${IMAGE_URI}"
docker push ${IMAGE_URI}

echo "Build and push completed successfully!"

Deploy a model for serving on AWS Sagemaker. Refer to deploy_and_serve_endpoint.py. For more information, check out sagemaker-python-sdk.

Default commandThe model server on SageMaker runs: python3 -m sglang.launch_server --model-path opt/ml/model --host 0.0.0.0 --port 8080.Custom argumentsThe serve script exposes all options in python3 -m sglang.launch_server --help through environment variables prefixed with SM_SGLANG_.Environment variable mappingThe serve script converts variables with prefix SM_SGLANG_ from SM_SGLANG_INPUT_ARGUMENT into --input-argument for the python3 -m sglang.launch_server CLI.ExampleTo run Qwen/Qwen3-0.6B with reasoning parser, add SM_SGLANG_MODEL_PATH=Qwen/Qwen3-0.6B and SM_SGLANG_REASONING_PARSER=qwen3.

Common notes

FlashInfer is the default attention kernel backend. It only supports sm75 and above. If you encounter any FlashInfer-related issues on sm75+ devices (for example, T4, A10, A100, L4, L40S, H100), switch to other kernels by adding --attention-backend triton --sampling-backend pytorch and open an issue on GitHub.
To reinstall flashinfer locally, use the following command: pip3 install --upgrade flashinfer-python --force-reinstall --no-deps and then delete the cache with rm -rf ~/.cache/flashinfer.
When encountering ptxas fatal : Value 'sm_103a' is not defined for option 'gpu-name' on B300/GB300, fix it with export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas.

Get Started

​Install methods

​Quick fixes to common problems

​Quick fixes to common problems

​Common notes

Install methods

Quick fixes to common problems

Quick fixes to common problems

Common notes