Optimized Model List
A list of popular LLMs are optimized and run efficiently on CPU, including the most notable open-source models like Llama series, Qwen series, and DeepSeek series like DeepSeek-R1 and DeepSeek-V3.1-Terminus.| Model Name | BF16 | W8A8_INT8 | FP8 |
|---|---|---|---|
| DeepSeek-R1 | — | DeepSeek-R1-Channel-INT8 | DeepSeek-R1 |
| DeepSeek-V3.1-Terminus | — | DeepSeek-V3.1-Terminus-Channel-int8 | DeepSeek-V3.1-Terminus |
| Llama-3.2-3B | Llama-3.2-3B-Instruct | Llama-3.2-3B-quantized.w8a8 | — |
| Llama-3.1-8B | Llama-3.1-8B-Instruct | Llama-3.1-8B-quantized.w8a8 | — |
| QwQ-32B | — | QwQ-32B-quantized.w8a8 | — |
| DeepSeek-Distilled-Llama | — | DeepSeek-R1-Distill-Llama-70B-quantized.w8a8 | — |
| Qwen3-235B | — | — | Qwen3-235B-A22B-FP8 |
Note: The model identifiers listed in the table above have been verified on 6th Gen Intel® Xeon® P-core platforms.
Installation
- Docker (Recommended)
- From Source
It is recommended to use Docker for setting up the SGLang environment.
A Dockerfile is provided to facilitate the installation.
Note: Replace <secret> below with your HuggingFace access token.
Launch of the Serving Engine
Example command to launch SGLang serving:
Note: For running W8A8 quantized models, please add the flag --quantization w8a8_int8.
Note: The flag--tp 6specifies that tensor parallelism will be applied using 6 ranks (TP6). On a CPU platform, a TP rank means a sub-NUMA cluster (SNC). You can get the SNC count usinglscpu. If the specified TP rank number differs from the total SNC count, the system will automatically utilize the firstnSNCs — butncannot exceed the total SNC number. To specify the cores to be used, set the environment variableSGLANG_CPU_OMP_THREADS_BIND. For example, to use the first 40 cores of each SNC on a Xeon® 6980P server (which has 43-43-42 cores on the 3 SNCs of a socket):
Please beware that withSGLANG_CPU_OMP_THREADS_BINDset, the available memory amounts of the ranks may not be determined in advance. You may need to set--max-total-tokensto avoid out-of-memory errors.
Note: For optimizing decoding withtorch.compile, add the flag--enable-torch-compile. To specify the maximum batch size, set--torch-compile-max-bs. For example,--enable-torch-compile --torch-compile-max-bs 4usestorch.compilewith a maximum batch size of 4. The maximum applicable batch size is 16.
Note: A warmup step is automatically triggered when the service is started. The server is ready when you see the log The server is fired up and ready to roll!.
Benchmarking with Requests
You can benchmark the performance via thebench_serving script.
Run the command in another terminal. An example command would be:
curl) or through your own scripts.
Example Usage Commands
Large Language Models can range from fewer than 1 billion to several hundred billion parameters. Dense models larger than 20B are expected to run on flagship 6th Gen Intel® Xeon® processors with dual sockets and a total of 6 sub-NUMA clusters. Dense models of approximately 10B parameters or fewer, or MoE (Mixture of Experts) models with fewer than 10B activated parameters, can run on more common 4th generation or newer Intel® Xeon® processors, or utilize a single socket of the flagship 6th Gen Intel® Xeon® processors.Example: Running DeepSeek-V3.1-Terminus
Note: Please set--torch-compile-max-bsto the maximum desired batch size for your deployment, which can be up to 16. The value4in the examples is illustrative.
Example: Running Llama-3.2-3B
Note: TheOnce the server has been launched, you can test it using the--torch-compile-max-bsand--tpsettings are examples that should be adjusted for your setup. For instance, use--tp 3to utilize 1 socket with 3 sub-NUMA clusters on an Intel® Xeon® 6980P server.
bench_serving command or create
your own commands or scripts following the benchmarking example.