Skip to main content
The document addresses how to set up the SGLang environment and run LLM inference on CPU servers. SGLang is enabled and optimized on the CPUs equipped with Intel® AMX® Instructions, which are 4th generation or newer Intel® Xeon® Scalable Processors.

Optimized Model List

A list of popular LLMs are optimized and run efficiently on CPU, including the most notable open-source models like Llama series, Qwen series, and DeepSeek series like DeepSeek-R1 and DeepSeek-V3.1-Terminus.
Model NameBF16W8A8_INT8FP8
DeepSeek-R1DeepSeek-R1-Channel-INT8DeepSeek-R1
DeepSeek-V3.1-TerminusDeepSeek-V3.1-Terminus-Channel-int8DeepSeek-V3.1-Terminus
Llama-3.2-3BLlama-3.2-3B-InstructLlama-3.2-3B-quantized.w8a8
Llama-3.1-8BLlama-3.1-8B-InstructLlama-3.1-8B-quantized.w8a8
QwQ-32BQwQ-32B-quantized.w8a8
DeepSeek-Distilled-LlamaDeepSeek-R1-Distill-Llama-70B-quantized.w8a8
Qwen3-235BQwen3-235B-A22B-FP8
Note: The model identifiers listed in the table above have been verified on 6th Gen Intel® Xeon® P-core platforms.

Installation

Launch of the Serving Engine

Example command to launch SGLang serving:
python -m sglang.launch_server   \
    --model <MODEL_ID_OR_PATH>   \
    --trust-remote-code          \
    --disable-overlap-schedule   \
    --device cpu                 \
    --host 0.0.0.0               \
    --tp 6
Note: For running W8A8 quantized models, please add the flag --quantization w8a8_int8.
Note: The flag --tp 6 specifies that tensor parallelism will be applied using 6 ranks (TP6). On a CPU platform, a TP rank means a sub-NUMA cluster (SNC). You can get the SNC count using lscpu. If the specified TP rank number differs from the total SNC count, the system will automatically utilize the first n SNCs — but n cannot exceed the total SNC number. To specify the cores to be used, set the environment variable SGLANG_CPU_OMP_THREADS_BIND. For example, to use the first 40 cores of each SNC on a Xeon® 6980P server (which has 43-43-42 cores on the 3 SNCs of a socket):
export SGLANG_CPU_OMP_THREADS_BIND="0-39|43-82|86-125|128-167|171-210|214-253"
Please beware that with SGLANG_CPU_OMP_THREADS_BIND set, the available memory amounts of the ranks may not be determined in advance. You may need to set --max-total-tokens to avoid out-of-memory errors.
Note: For optimizing decoding with torch.compile, add the flag --enable-torch-compile. To specify the maximum batch size, set --torch-compile-max-bs. For example, --enable-torch-compile --torch-compile-max-bs 4 uses torch.compile with a maximum batch size of 4. The maximum applicable batch size is 16.
Note: A warmup step is automatically triggered when the service is started. The server is ready when you see the log The server is fired up and ready to roll!.

Benchmarking with Requests

You can benchmark the performance via the bench_serving script. Run the command in another terminal. An example command would be:
python -m sglang.bench_serving   \
    --dataset-name random        \
    --random-input-len 1024      \
    --random-output-len 1024     \
    --num-prompts 1              \
    --request-rate inf           \
    --random-range-ratio 1.0
Detailed parameter descriptions are available via the command:
python -m sglang.bench_serving -h
Additionally, requests can be formatted using the OpenAI Completions API and sent via the command line (e.g., using curl) or through your own scripts.

Example Usage Commands

Large Language Models can range from fewer than 1 billion to several hundred billion parameters. Dense models larger than 20B are expected to run on flagship 6th Gen Intel® Xeon® processors with dual sockets and a total of 6 sub-NUMA clusters. Dense models of approximately 10B parameters or fewer, or MoE (Mixture of Experts) models with fewer than 10B activated parameters, can run on more common 4th generation or newer Intel® Xeon® processors, or utilize a single socket of the flagship 6th Gen Intel® Xeon® processors.

Example: Running DeepSeek-V3.1-Terminus

python -m sglang.launch_server                                 \
    --model IntervitensInc/DeepSeek-V3.1-Terminus-Channel-int8 \
    --trust-remote-code                                        \
    --disable-overlap-schedule                                 \
    --device cpu                                               \
    --quantization w8a8_int8                                   \
    --host 0.0.0.0                                             \
    --enable-torch-compile                                     \
    --torch-compile-max-bs 4                                   \
    --tp 6
Note: Please set --torch-compile-max-bs to the maximum desired batch size for your deployment, which can be up to 16. The value 4 in the examples is illustrative.

Example: Running Llama-3.2-3B

python -m sglang.launch_server                     \
    --model meta-llama/Llama-3.2-3B-Instruct       \
    --trust-remote-code                            \
    --disable-overlap-schedule                     \
    --device cpu                                   \
    --host 0.0.0.0                                 \
    --enable-torch-compile                         \
    --torch-compile-max-bs 16                      \
    --tp 2
Note: The --torch-compile-max-bs and --tp settings are examples that should be adjusted for your setup. For instance, use --tp 3 to utilize 1 socket with 3 sub-NUMA clusters on an Intel® Xeon® 6980P server.
Once the server has been launched, you can test it using the bench_serving command or create your own commands or scripts following the benchmarking example.