XPU - SGLang Documentation

The document addresses how to set up the SGLang environment and run LLM inference on Intel GPU, see more context about Intel GPU support within PyTorch ecosystem. Specifically, SGLang is optimized for Intel® Arc™ Pro B-Series Graphics and Intel® Arc™ B-Series Graphics.

Optimized Model List

A list of LLMs have been optimized on Intel GPU, and more are on the way:

Model Name	BF16
Llama-3.2-3B	meta-llama/Llama-3.2-3B-Instruct
Llama-3.1-8B	meta-llama/Llama-3.1-8B-Instruct
Qwen2.5-1.5B	Qwen/Qwen2.5-1.5B

Note: The model identifiers listed in the table above have been verified on Intel® Arc™ B580 Graphics.

Installation

Install From Source

Currently SGLang XPU only supports installation from source. Please refer to “Getting Started on Intel GPU” to install XPU dependency.

Command

# Create and activate a conda environment
conda create -n sgl-xpu python=3.12 -y
conda activate sgl-xpu

# Set PyTorch XPU as primary pip install channel to avoid installing the larger CUDA-enabled version and prevent potential runtime issues.
pip3 install torch==2.12.0+xpu torchao==0.17.0+xpu torchvision==0.27.0+xpu torchaudio==2.11.0+xpu --index-url https://download.pytorch.org/whl/xpu
pip3 install xgrammar --no-deps # xgrammar will introduce CUDA-enabled triton which might conflict with XPU
pip3 install apache-tvm-ffi # xgrammar requires apache-tvm-ffi

# Clone the SGLang code
git clone https://github.com/sgl-project/sglang.git
cd sglang
git checkout <YOUR-DESIRED-VERSION>

# Use dedicated toml file
cd python
cp pyproject_xpu.toml pyproject.toml
# Install SGLang dependent libs, and build SGLang main package
pip install --upgrade pip setuptools
pip install -v . --extra-index-url https://download.pytorch.org/whl/xpu

Install Using Docker

The SGLang XPU Dockerfile is provided to facilitate the installation. Replace <secret> below with your HuggingFace access token.

Command

# Clone the SGLang repository
git clone https://github.com/sgl-project/sglang.git
cd sglang/docker

# Build the docker image
docker build -t sglang-xpu:latest -f xpu.Dockerfile .

# Initiate a docker container
docker run \
    -it \
    --privileged \
    --ipc=host \
    --network=host \
    --user root \
    --group-add $(getent group video | cut -d: -f3) \
    --device /dev/dri \
    -v /dev/dri/by-path:/dev/dri/by-path \
    -v /dev/shm:/dev/shm \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 30000:30000 \
    -e "HF_TOKEN=<secret>" \
    sglang-xpu:latest /bin/bash

Launch of the Serving Engine

Example command to launch SGLang serving:

sglang serve                         \
    --model-path <MODEL_ID_OR_PATH>  \
    --trust-remote-code              \
    --disable-overlap-schedule       \
    --device xpu                     \
    --host 0.0.0.0                   \
    --tp 2                           \   # using multi GPUs
    --attention-backend intel_xpu    \   # using intel optimized XPU attention backend
    --page-size                      \   # intel_xpu attention backend supports [32, 64, 128]

Benchmarking with Requests

You can benchmark the performance via the bench_serving script. Run the command in another terminal.

python -m sglang.bench_serving   \
    --dataset-name random        \
    --random-input-len 1024      \
    --random-output-len 1024     \
    --num-prompts 1              \
    --request-rate inf           \
    --random-range-ratio 1.0

The detail explanations of the parameters can be looked up by the command:

python -m sglang.bench_serving -h

Additionally, the requests can be formed with OpenAI Completions API and sent via the command line (e.g. using curl) or via your own script.

XPU Graph [Experimental]

SGLang enables XPU graph capture to reduce per-step kernel-launch overhead.

Phase	Backend	Mechanism	Default
Decode	`full`	One `torch.xpu.XPUGraph` per batch size, captured on startup	Off (opt-in)
Prefill	`tc_piecewise`	`torch.compile` + XPU graph, one graph segment per token-length bucket	Off (opt-in)
Prefill	`breakable`	Segmented `torch.xpu.XPUGraph` capture/replay (no `torch.compile`); eager break points at attention / MoE boundaries	Off (opt-in)

Enable Decode Graph

Decode graph capture is opt-in on XPU. Enable it explicitly:

python -m sglang.launch_server --model-path <MODEL> --device xpu \
    --cuda-graph-backend-decode full

Enable Prefill Graph

Prefill graph capture is opt-in on XPU and must be enabled explicitly. Two backends are available: tc_piecewise and breakable.

tc_piecewise

Uses torch.compile plus an XPU graph, one graph segment per token-length bucket:

python -m sglang.launch_server --model-path <MODEL> --device xpu \
    --cuda-graph-backend-prefill tc_piecewise

By default the prefill subgraphs are compiled with eager mode. Switch to inductor for higher-quality generated code at the cost of longer startup:

python -m sglang.launch_server --model-path <MODEL> --device xpu \
    --cuda-graph-backend-prefill tc_piecewise \
    --cuda-graph-tc-compiler inductor

breakable

Captures the transformer stack as segmented XPUGraphs with eager break points at attention / MoE boundaries, without torch.compile:

python -m sglang.launch_server --model-path <MODEL> --device xpu \
    --cuda-graph-backend-prefill breakable

You can also configure both phases together with a single --cuda-graph-config JSON argument:

python -m sglang.launch_server --model-path <MODEL> --device xpu \
    --cuda-graph-config '{"decode":{"backend":"full"},"prefill":{"backend":"tc_piecewise","tc_compiler":"eager"}}'

Enable torch.compile for Decode

--enable-torch-compile adds a torch.compile pass on top of the decode XPU graph: the model forward is compiled first, and the compiled forward is then captured as an XPUGraph. This can reduce per-kernel overhead further but increases startup time.

python -m sglang.launch_server --model-path <MODEL> --device xpu \
    --enable-torch-compile

Note: --enable-torch-compile is mutually exclusive with the prefill tc_piecewise graph (the compatibility rules auto-disable it). Use them separately or lock the prefill backend explicitly via --cuda-graph-config if you need both.

Disable XPU Graph

Both phases are disabled by default. To explicitly disable them anyway:

# Disable decode graph (already off by default; explicit form)
python -m sglang.launch_server --model-path <MODEL> --device xpu \
    --cuda-graph-backend-decode=disabled

# Disable prefill graph (already off by default; explicit form)
python -m sglang.launch_server --model-path <MODEL> --device xpu \
    --cuda-graph-backend-prefill=disabled

# Disable both phases
python -m sglang.launch_server --model-path <MODEL> --device xpu \
    --cuda-graph-backend-decode=disabled \
    --cuda-graph-backend-prefill=disabled

Customize Capture Buckets

By default, prefill capture sizes are derived from --chunked-prefill-size. To specify explicit token-length buckets:

python -m sglang.launch_server \
    --model-path <MODEL> --device xpu \
    --cuda-graph-backend-prefill tc_piecewise \
    --cuda-graph-bs-prefill 64 128 256 512

To specify explicit decode graph batch sizes:

python -m sglang.launch_server \
    --model-path <MODEL> --device xpu \
    --cuda-graph-bs-decode 1 2 4 8

Server Args

Argument	XPU allowed values	Default	Description
`--cuda-graph-backend-decode`	`full`, `disabled`	`disabled`	Backend for the decode phase. Only `full` is supported on XPU. Set to `full` to enable.
`--cuda-graph-backend-prefill`	`tc_piecewise`, `breakable`, `disabled`	`disabled`*	Backend for the prefill phase. Set to `tc_piecewise` or `breakable` explicitly to enable.
`--cuda-graph-tc-compiler`	`eager`, `inductor`	`eager`	Compiler for `tc_piecewise` prefill subgraphs. `inductor` produces more optimized code but has longer startup.
`--cuda-graph-bs-prefill`	list of ints	auto	Explicit token-length buckets to capture for prefill.
`--cuda-graph-bs-decode`	list of ints	auto	Explicit batch sizes to capture for decode.
`--cuda-graph-config`	JSON string	—	One-shot JSON config for both phases, e.g. `'{"decode":{"backend":"full"},"prefill":{"backend":"tc_piecewise","tc_compiler":"eager"}}'`. Overrides all per-phase flags.
`--disable-decode-cuda-graph`	—	`False`	Shorthand for `--cuda-graph-backend-decode=disabled`.
`--disable-prefill-cuda-graph`	—	`False`	Shorthand for `--cuda-graph-backend-prefill=disabled`.
`--enable-torch-compile`	—	`False`	Apply `torch.compile` on top of the decode XPU graph for further kernel optimization.
`--torch-compile-max-bs`	int	`32`	Maximum batch size compiled by `torch.compile` when `--enable-torch-compile` is set.

* Prefill graph is auto-disabled on XPU unless you lock the backend explicitly via --cuda-graph-backend-prefill or --cuda-graph-config.

Limitations

Feature	Status
Memory saver (`--enable-memory-saver`)	Not yet supported
Two-batch overlap (`--enable-two-batch-overlap`)	Not yet supported
Speculative decoding	Not yet implemented

Prefill-Decode (P/D) Disaggregation on Intel XPU [Experimental]

SGLang supports prefill-decode disaggregation on Intel XPU using the NIXL KV-transfer backend. Tested models:

Model	Notes
Qwen/Qwen3-0.6B	Used in integration tests; verified on Intel XPU with homogeneous P/D (XPU prefill + XPU decode)
Qwen/Qwen2.5-7B-Instruct	Verified on Intel XPU with homogeneous P/D (XPU prefill + XPU decode)

Prerequisites: pip install nixl sglang-router Start the prefill server (GPU 0):

ZE_AFFINITY_MASK=0 UCX_POSIX_USE_PROC_LINK=n python -m sglang.launch_server \
    --model-path Qwen/Qwen3-0.6B --trust-remote-code --device xpu \
    --disaggregation-mode prefill --disaggregation-transfer-backend nixl \
    --disaggregation-bootstrap-port 12335 --host 0.0.0.0 --port 30000

Start the decode server (GPU 1):

ZE_AFFINITY_MASK=1 UCX_POSIX_USE_PROC_LINK=n python -m sglang.launch_server \
    --model-path Qwen/Qwen3-0.6B --trust-remote-code --device xpu \
    --disaggregation-mode decode --disaggregation-transfer-backend nixl \
    --disaggregation-bootstrap-port 12335 --host 0.0.0.0 --port 30001

Start the router:

python -m sglang_router.launch_router \
    --pd-disaggregation \
    --prefill http://127.0.0.1:30000 \
    --decode  http://127.0.0.1:30001 \
    --host 0.0.0.0 --port 8000

Send a request:

curl http://127.0.0.1:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{"model": "Qwen/Qwen3-0.6B", "prompt": "The capital of France is", "max_tokens": 32}'

Note: UCX_POSIX_USE_PROC_LINK=n is required on Intel XPU to avoid UCX shared-memory transport issues.

​Optimized Model List

​Installation

​Install From Source

​Install Using Docker

​Launch of the Serving Engine

​Benchmarking with Requests

​XPU Graph [Experimental]

​Enable Decode Graph

​Enable Prefill Graph

​tc_piecewise

​breakable

​Enable torch.compile for Decode

​Disable XPU Graph

​Customize Capture Buckets

​Server Args

​Limitations

​Prefill-Decode (P/D) Disaggregation on Intel XPU [Experimental]