Ascend NPU Performance Testing

This page walks through performance testing your SGLang deployment on Ascend NPUs. We cover three model types — text generation (Qwen/Qwen2.5-7B-Instruct), multimodal vision (Qwen/Qwen2.5-VL-7B-Instruct), and embedding (Qwen/Qwen3-Embedding-8B) — in both online and offline serving modes. You can use Evalscope, AISBench, or SGLang’s built-in benchmarking tools.

The benchmark output examples in this guide are for illustration only. Actual performance depends on your hardware (e.g., Atlas 800I A2 vs A3), model version, SGLang version, and deployment configuration. Always run benchmarks on your own hardware to obtain accurate performance data.

1. Prepare

1.1 Start SGLang server

Launch the server with the appropriate flags for each model type. Make sure SGLang is installed first — see Ascend NPU Quickstart for environment setup.

Text Generation
Multimodal
Embedding

Command

# The model will be automatically downloaded by sglang or set --model-path to the local path if the model is already downloaded.
sglang serve --model-path Qwen/Qwen2.5-7B-Instruct

Command

# The model will be automatically downloaded by sglang or set --model-path to the local path if the model is already downloaded.
sglang serve --model-path Qwen/Qwen2.5-VL-7B-Instruct --mm-attention-backend ascend_attn

Command

# The model will be automatically downloaded by sglang or set --model-path to the local path if the model is already downloaded.
sglang serve --model-path Qwen/Qwen3-Embedding-8B --is-embedding

Add & at the end of the command to run the server in the background, or open a new terminal to run the benchmark commands in the following sections.

The server binds to http://127.0.0.1:30000 by default. All online benchmarks below assume the server is running at that address. The --is-embedding flag is required for embedding models.

1.2 Install benchmarking tools

bench_serving and bench_offline_throughput are built into SGLang and require no extra installation. For Evalscope and AISBench, set up each in its own virtual environment:

Evalscope
AISBench

Command

python3 -m venv .evalscope_venv
source .evalscope_venv/bin/activate
pip install evalscope[perf] -U

Command

python3 -m venv .aisbench_venv
source .aisbench_venv/bin/activate

git clone https://github.com/AISBench/benchmark.git
cd benchmark/
pip3 install -e ./ --use-pep517

pip3 install -r requirements/api.txt
pip3 install -r requirements/extra.txt

Run ais_bench -h to verify.

AISBench requires Python 3.10-3.12. After installation, all AISBench commands must be run from the benchmark/ directory (the cloned repo root). Set stream=True and ignore_eos=True in the model config for accurate results.

2. Online Service: Text Generation Model

Test Qwen/Qwen2.5-7B-Instruct via the online serving endpoint.

Before running any benchmark in this section, make sure the SGLang text-generation server is running at http://127.0.0.1:30000. See Start SGLang server for the launch command.

For performance testing, prefer random datasets (--dataset random, --dataset-name random) over real datasets. Random datasets let you pin --min-prompt-length / --max-prompt-length and --min-tokens / --max-tokens to fixed values, producing consistent, repeatable results. Real datasets (ShareGPT, openqa, etc.) have variable input lengths that add noise and make cross-run comparisons unreliable.

2.1 Using Evalscope

Prerequisites: Evalscope installed and its virtual environment activated (source .evalscope_venv/bin/activate). SGLang server running at http://127.0.0.1:30000.

Run the following command to run a performance test against the server:

Command

evalscope perf \
  --parallel 10 \
  --number 20 \
  --model Qwen/Qwen2.5-7B-Instruct \
  --url http://127.0.0.1:30000/v1/chat/completions \
  --api openai \
  --dataset random \
  --max-tokens 1024 \
  --min-tokens 1024 \
  --prefix-length 0 \
  --min-prompt-length 1024 \
  --max-prompt-length 1024 \
  --tokenizer-path Qwen/Qwen2.5-7B-Instruct \
  --extra-args '{"ignore_eos": true}'

If the model has already been downloaded, you can point --tokenizer-path to the local model path instead of the model id.

Example output (for illustration only — actual results depend on your hardware and configuration):

Benchmarking summary:
┌────────────────────────────┬─────────────┐
│ Metric                     │       Value │
├────────────────────────────┼─────────────┤
│ ── General ──              │             │
│ Test Duration (s)          │       89.34 │
│ Concurrency                │          10 │
│ Request Rate (req/s)       │       -1.00 │
│ Total / Success / Failed   │ 20 / 20 / 0 │
│ Req Throughput (req/s)     │        0.22 │
│ ── Latency ──              │             │
│ Avg Latency (s)            │       44.67 │
│ TTFT (ms)                  │      578.51 │
│ TPOT (ms)                  │       43.10 │
│ ITL (ms)                   │       43.12 │
│ ── Tokens ──               │             │
│ Avg Input Tokens           │     1024.00 │
│ Avg Output Tokens          │     1024.00 │
│ Output Throughput (tok/s)  │      229.24 │
│ Total Throughput (tok/s)   │      458.49 │
│ ── Speculative Decoding ── │             │
│ Decoded Tok/Iter           │        1.00 │
│ Spec. Accept Rate          │        0.00 │
└────────────────────────────┴─────────────┘

Percentile results:
┌────────────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ Metric         │      1% │      5% │     10% │     25% │     50% │     75% │     90% │     95% │     99% │
├────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Latency (s)    │   44.47 │   44.47 │   44.47 │   44.47 │   44.86 │   44.86 │   44.86 │   44.86 │   44.86 │
│ TTFT (ms)      │  138.12 │  142.07 │  426.17 │  426.87 │  783.67 │  785.26 │  786.85 │  787.97 │  787.97 │
│ ITL (ms)       │   41.84 │   42.14 │   42.22 │   42.36 │   42.57 │   42.80 │   42.99 │   49.24 │   49.84 │
│ TPOT (ms)      │   42.71 │   42.71 │   42.71 │   43.05 │   43.08 │   43.43 │   43.43 │   43.71 │   43.71 │
│ Input tokens   │ 1024.00 │ 1024.00 │ 1024.00 │ 1024.00 │ 1024.00 │ 1024.00 │ 1024.00 │ 1024.00 │ 1024.00 │
│ Output tokens  │ 1024.00 │ 1024.00 │ 1024.00 │ 1024.00 │ 1024.00 │ 1024.00 │ 1024.00 │ 1024.00 │ 1024.00 │
│ Output (tok/s) │   22.83 │   22.83 │   22.83 │   22.83 │   23.02 │   23.03 │   23.03 │   23.03 │   23.03 │
│ Total (tok/s)  │   45.65 │   45.65 │   45.65 │   45.65 │   46.05 │   46.05 │   46.05 │   46.05 │   46.05 │
│ Decode (tok/s) │   22.88 │   23.03 │   23.03 │   23.07 │   23.21 │   23.42 │   23.42 │   23.42 │   23.42 │
└────────────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘
...

See the Evalscope Performance Testing Guide for full details.

2.2 Using AISBench

Prerequisites: AISBench installed and its virtual environment activated (source .aisbench_venv/bin/activate). All commands must be run from the benchmark/ directory. SGLang server running at http://127.0.0.1:30000. Set stream=True and ignore_eos=True in the model config for accurate results.

Two files need to be configured for performance testing. First, describe the model and server settings in ais_bench/benchmark/configs/models/vllm_api/vllm_api_stream_chat.py:

vllm_api_stream_chat.py

# more details: https://ais-bench-benchmark.readthedocs.io/en/latest/base_tutorials/scenes_intro/performance_benchmark.html
from ais_bench.benchmark.models import VLLMCustomAPIChat
from ais_bench.benchmark.utils.postprocess.model_postprocessors import extract_non_reasoning_content

models = [
    dict(
        attr="service",
        type=VLLMCustomAPIChat,
        abbr="vllm-api-stream-chat",
        path="Qwen/Qwen2.5-7B-Instruct",
        model="Qwen/Qwen2.5-7B-Instruct",
        stream=True,
        request_rate=0,
        use_timestamp=False,
        retry=2,
        api_key="",
        host_ip="127.0.0.1",
        host_port=30000,
        url="",
        max_out_len=512,
        batch_size=32,
        trust_remote_code=False,
        generation_kwargs=dict(
            temperature=0.01,
            ignore_eos=True,
        ),
        pred_postprocessor=dict(type=extract_non_reasoning_content),
    )
]

If the model has already been downloaded, point path to the local model path instead of the model id.

Second, configure random prompt lengths in ais_bench/datasets/synthetic/synthetic_config.py:

synthetic_config.py

# more details: https://ais-bench-benchmark.readthedocs.io/en/latest/advanced_tutorials/synthetic_dataset.html
synthetic_config = {
    "Type":"tokenid",
    "RequestCount": 10,
    "TrustRemoteCode": False,
    "StringConfig" : {
        "Input" : {
            "Method": "uniform",
            "Params": {"MinValue": 1, "MaxValue": 200}
        },
        "Output" : {
            "Method": "gaussian",
            "Params": {"Mean": 100, "Var": 200, "MinValue": 1, "MaxValue": 100}
        }
    },
    "TokenIdConfig" : {
        "RequestSize": 10,
        "PrefixLen": 0
    }
}

Run with a synthetic dataset:

Command

ais_bench --models vllm_api_stream_chat --datasets synthetic_gen_string -m perf

Example output (for illustration only — actual results depend on your hardware and configuration):

╒══════════════════════════╤═════════╤═════════════════╤═════════════════╤═════════════════╤═════════════════╤═════════════════╤═════════════════╤═════════════════╤═════╕
│ Performance Parameters   │ Stage   │ Average         │ Min             │ Max             │ Median          │ P75             │ P90             │ P99             │  N  │
╞══════════════════════════╪═════════╪═════════════════╪═════════════════╪═════════════════╪═════════════════╪═════════════════╪═════════════════╪═════════════════╪═════╡
│ E2EL                     │ total   │ 3896.4 ms       │ 3081.6 ms       │ 4175.3 ms       │ 4013.8 ms       │ 4123.4 ms       │ 4137.1 ms       │ 4171.5 ms       │ 10  │
├──────────────────────────┼─────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────┤
│ TTFT                     │ total   │ 411.6 ms        │ 346.7 ms        │ 439.7 ms        │ 416.3 ms        │ 426.6 ms        │ 434.4 ms        │ 439.2 ms        │ 10  │
├──────────────────────────┼─────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────┤
│ TPOT                     │ total   │ 38.3 ms         │ 37.4 ms         │ 39.0 ms         │ 38.3 ms         │ 38.7 ms         │ 38.9 ms         │ 39.0 ms         │ 10  │
├──────────────────────────┼─────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────┤
│ ITL                      │ total   │ 38.7 ms         │ 0.0 ms          │ 156.5 ms        │ 38.9 ms         │ 39.0 ms         │ 39.2 ms         │ 117.1 ms        │ 10  │
├──────────────────────────┼─────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────┤
│ InputTokens              │ total   │ 123.4           │ 34.0            │ 228.0           │ 130.5           │ 170.5           │ 217.2           │ 226.92          │ 10  │
├──────────────────────────┼─────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────┤
│ OutputTokens             │ total   │ 92.1            │ 69.0            │ 100.0           │ 95.0            │ 99.75           │ 100.0           │ 100.0           │ 10  │
├──────────────────────────┼─────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────┤
│ OutputTokenThroughput    │ total   │ 23.5937 token/s │ 22.3912 token/s │ 24.2616 token/s │ 23.7399 token/s │ 23.9919 token/s │ 24.2027 token/s │ 24.2557 token/s │ 10  │
╘══════════════════════════╧═════════╧═════════════════╧═════════════════╧═════════════════╧═════════════════╧═════════════════╧═════════════════╧═════════════════╧═════╛
╒══════════════════════════╤═════════╤══════════════════╕
│ Common Metric            │ Stage   │ Value            │
╞══════════════════════════╪═════════╪══════════════════╡
│ Benchmark Duration       │ total   │ 4175.4485 ms     │
├──────────────────────────┼─────────┼──────────────────┤
│ Total Requests           │ total   │ 10               │
├──────────────────────────┼─────────┼──────────────────┤
│ Failed Requests          │ total   │ 0                │
├──────────────────────────┼─────────┼──────────────────┤
│ Success Requests         │ total   │ 10               │
├──────────────────────────┼─────────┼──────────────────┤
│ Concurrency              │ total   │ 9.3317           │
├──────────────────────────┼─────────┼──────────────────┤
│ Max Concurrency          │ total   │ 32               │
├──────────────────────────┼─────────┼──────────────────┤
│ Request Throughput       │ total   │ 2.395 req/s      │
├──────────────────────────┼─────────┼──────────────────┤
│ Total Input Tokens       │ total   │ 1234             │
├──────────────────────────┼─────────┼──────────────────┤
│ Prefill Token Throughput │ total   │ 299.8329 token/s │
├──────────────────────────┼─────────┼──────────────────┤
│ Total Generated Tokens   │ total   │ 921              │
├──────────────────────────┼─────────┼──────────────────┤
│ Input Token Throughput   │ total   │ 295.5371 token/s │
├──────────────────────────┼─────────┼──────────────────┤
│ Output Token Throughput  │ total   │ 220.5751 token/s │
├──────────────────────────┼─────────┼──────────────────┤
│ Total Token Throughput   │ total   │ 516.1122 token/s │
╘══════════════════════════╧═════════╧══════════════════╛

See the AISBench Documentation for details.

2.3 Using bench_serving

SGLang’s built-in bench_serving requires no extra installation. Make sure the server is running at http://127.0.0.1:30000 before running the benchmark.

See the Bench Serving Guide for all backends, datasets, and advanced options.

Command

python -m sglang.bench_serving \
  --backend sglang-oai \
  --base-url http://127.0.0.1:30000 \
  --model Qwen/Qwen2.5-7B-Instruct \
  --dataset-name random \
  --random-input-len 1024 \
  --random-output-len 512 \
  --random-range-ratio 1 \
  --num-prompts 100 \
  --max-concurrency 32

--dataset-name random samples token IDs from the ShareGPT dataset to generate realistic input; the first run downloads ShareGPT from Hugging Face automatically.

If you have network issues, set export HF_ENDPOINT=https://hf-mirror.com to use domestic mirror.
If downloading still fails, manually download the dataset file ShareGPT_V3_unfiltered_cleaned_split.json locally, upload it to your server, then specify the file directory via --dataset-path to run offline.

Set --random-range-ratio 1 for fixed input/output lengths (recommended for consistent comparisons) or 0 (default) for uniform distribution. Add --request-rate to control the request rate. For all backends, datasets, and advanced options, see the full Bench Serving Guide.

Example output (for illustration only — actual results depend on your hardware and configuration):

============ Serving Benchmark Result ============
Backend:                                 sglang-oai
Traffic request rate:                    inf
Max request concurrency:                 32
Successful requests:                     100
Benchmark duration (s):                  47.51
Total input tokens:                      102400
Total input text tokens:                 102400
Total generated tokens:                  51200
Total generated tokens (retokenized):    51195
Request throughput (req/s):              2.10
Input token throughput (tok/s):          2155.35
Output token throughput (tok/s):         1077.68
Peak output token throughput (tok/s):    1587.00
Peak concurrent requests:                64
Total token throughput (tok/s):          3233.03
Concurrency:                             26.93
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   12793.49
Median E2E Latency (ms):                 12940.17
P90 E2E Latency (ms):                    13049.86
P99 E2E Latency (ms):                    13051.61
---------------Time to First Token----------------
Mean TTFT (ms):                          1423.99
Median TTFT (ms):                        1489.29
P99 TTFT (ms):                           2325.56
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          22.25
Median TPOT (ms):                        22.22
P99 TPOT (ms):                           25.08
---------------Inter-Token Latency----------------
Mean ITL (ms):                           22.26
Median ITL (ms):                         20.74
P95 ITL (ms):                            21.40
P99 ITL (ms):                            23.62
Max ITL (ms):                            2229.30
==================================================

SGLang Serving Benchmark Result — Complete Reference

The output format is hardcoded in bench_serving.py. All formatting decisions — including column widths, alignment, and decimal precision — are statically defined in the source and cannot be changed via command-line arguments.

Test Configuration

Parameter	Description
`Backend`	The serving backend under test (e.g., `sglang`, `vllm`).
`Traffic request rate`	Request generation rate in req/s. `inf` means maximum rate (concurrency-bounded). `trace` indicates trace timestamp mode. A fixed value enforces constant inter-arrival time.
`Max request concurrency`	Maximum number of concurrent requests from the client side. Displays `not set` when unspecified.

Core Statistics & Throughput Metrics

Parameter	Description	Format Specification
`Successful requests`	Total number of successfully completed requests (HTTP 200, no generation errors).	Integer, no decimal places
`Benchmark duration (s)`	Total elapsed time from first request sent to last response fully received (seconds).	2 decimal places
`Total input tokens`	Total number of input (prompt) tokens across all requests, counted by server-side tokenizer.	Integer, no decimal places
`Total input text tokens`	Same as `Total input tokens`. For multimodal inputs, this may differ.	Integer, no decimal places
`Total generated tokens`	Total number of output tokens actually generated by the server (server-side tokenizer count).	Integer, no decimal places
`Total generated tokens (retokenized)`	Output text re-tokenized by the client using its own tokenizer. A large discrepancy indicates tokenizer mismatch or special tokens in output.	Integer, no decimal places
`Request throughput (req/s)`	Number of successful requests processed per second. Formula: `Successful requests / Benchmark duration (s)`.	2 decimal places
`Input token throughput (tok/s)`	Number of input tokens processed per second. Formula: `Total input tokens / Benchmark duration (s)`.	2 decimal places
`Output token throughput (tok/s)`	Number of output tokens generated per second. Formula: `Total generated tokens / Benchmark duration (s)`.	2 decimal places
`Peak output token throughput (tok/s)`	Observed instantaneous peak output token generation rate during the test (computed over a sliding window).	2 decimal places
`Peak concurrent requests`	Maximum number of requests being processed simultaneously on the server side. May exceed client-side `Max request concurrency` due to queueing.	Integer, no decimal places
`Total token throughput (tok/s)`	Sum of input and output token throughputs. Formula: `Input token throughput + Output token throughput`.	2 decimal places
`Concurrency`	Average number of concurrent requests during the test (Little’s Law). Formula: `Sum of all E2E latencies / Benchmark duration`.	2 decimal places

End-to-End Latency (E2E Latency)

Statistic	Description	Format
`Mean E2E Latency (ms)`	Arithmetic mean	2 decimal places
`Median E2E Latency (ms)`	50th percentile	2 decimal places
`P90 E2E Latency (ms)`	90th percentile (90% of requests have latency ≤ this value)	2 decimal places
`P99 E2E Latency (ms)`	99th percentile	2 decimal places

Time to First Token (TTFT)

Statistic	Description	Format
`Mean TTFT (ms)`	Arithmetic mean	2 decimal places
`Median TTFT (ms)`	50th percentile	2 decimal places
`P99 TTFT (ms)`	99th percentile	2 decimal places

Time per Output Token (TPOT) – Excluding First Token

Formula: (E2E Latency - TTFT) / (Number of output tokens - 1)

Statistic	Description	Format
`Mean TPOT (ms)`	Arithmetic mean	2 decimal places
`Median TPOT (ms)`	50th percentile	2 decimal places
`P99 TPOT (ms)`	99th percentile	2 decimal places

Inter-Token Latency (ITL)

Statistic	Description	Format
`Mean ITL (ms)`	Average inter-token interval	2 decimal places
`Median ITL (ms)`	50th percentile inter-token interval	2 decimal places
`P95 ITL (ms)`	95th percentile (used to detect stalls)	2 decimal places
`P99 ITL (ms)`	99th percentile	2 decimal places
`Max ITL (ms)`	Maximum observed inter-token interval; useful for identifying severe blocking events	2 decimal places

3. Online Service: Multimodal Model

Test Qwen/Qwen2.5-VL-7B-Instruct for vision-language tasks.

Before running any benchmark in this section, make sure the SGLang multimodal server is running at http://127.0.0.1:30000. See Start SGLang server and use the Multimodal tab for the launch command.

For consistent, repeatable results, set --random-range-ratio 1 to fix input/output lengths, or 0 (default) for uniform distribution.

3.1 Using Evalscope

Prerequisites: Evalscope installed and its virtual environment activated (source .evalscope_venv/bin/activate). SGLang multimodal server running at http://127.0.0.1:30000.

Evalscope’s perf tool uses the OpenAI-compatible /v1/chat/completions endpoint. Use --dataset random_vl for randomized multimodal data with image generation:

Command

evalscope perf \
  --parallel 10 \
  --number 20 \
  --model Qwen/Qwen2.5-VL-7B-Instruct \
  --url http://127.0.0.1:30000/v1/chat/completions \
  --api openai \
  --dataset random_vl \
  --min-tokens 1024 \
  --max-tokens 1024 \
  --prefix-length 0 \
  --min-prompt-length 1024 \
  --max-prompt-length 1024 \
  --image-width 512 \
  --image-height 512 \
  --image-format RGB \
  --image-num 1 \
  --tokenizer-path Qwen/Qwen2.5-VL-7B-Instruct \
  --extra-args '{"ignore_eos": true}'

If the model has already been downloaded, you can point --tokenizer-path to the local model path instead of the model id.

3.2 Using AISBench

Prerequisites: AISBench installed and its virtual environment activated (source .aisbench_venv/bin/activate). All commands run from the benchmark/ directory. SGLang multimodal server running at http://127.0.0.1:30000. AISBench does not include a built-in multimodal dataset — you must provide your own.

First, edit ais_bench/benchmark/configs/models/vllm_api/vllm_api_stream_chat.py to configure the vision model:

vllm_api_stream_chat.py

from ais_bench.benchmark.models import VLLMCustomAPIChat
from ais_bench.benchmark.utils.postprocess.model_postprocessors import extract_non_reasoning_content

models = [
    dict(
        attr="service",
        type=VLLMCustomAPIChat,
        abbr="vllm-api-stream-chat",
        path="Qwen/Qwen2.5-VL-7B-Instruct",
        model="Qwen/Qwen2.5-VL-7B-Instruct",
        stream=True,
        request_rate=0,
        use_timestamp=False,
        retry=2,
        api_key="",
        host_ip="127.0.0.1",
        host_port=30000,
        url="",
        max_out_len=256,
        batch_size=16,
        trust_remote_code=False,
        generation_kwargs=dict(
            temperature=0.01,
            ignore_eos=True,
        ),
        pred_postprocessor=dict(type=extract_non_reasoning_content),
    )
]

If the model has already been downloaded, point path to the local model path instead of the model id.

Next, download a multimodal dataset such as mmstar:

Command

# Download the mmstar dataset (from within the benchmark/ directory)
cd ais_bench/datasets
mkdir mmstar
cd mmstar
wget https://www.modelscope.cn/datasets/evalscope/MMStar/resolve/master/MMStar.tsv

Run the performance test:

Command

ais_bench --models vllm_api_stream_chat --datasets mmstar_gen -m perf

Example output (for illustration only — actual results depend on your hardware and configuration):

╒══════════════════════════╤═════════╤═════════════════╤════════════════╤═════════════════╤═════════════════╤═════════════════╤═════════════════╤═════════════════╤══════╕
│ Performance Parameters   │ Stage   │ Average         │ Min            │ Max             │ Median          │ P75             │ P90             │ P99             │  N   │
╞══════════════════════════╪═════════╪═════════════════╪════════════════╪═════════════════╪═════════════════╪═════════════════╪═════════════════╪═════════════════╪══════╡
│ E2EL                     │ total   │ 6190.9 ms       │ 5071.4 ms      │ 8464.8 ms       │ 6126.6 ms       │ 6475.2 ms       │ 6833.5 ms       │ 7897.9 ms       │ 1500 │
├──────────────────────────┼─────────┼─────────────────┼────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼──────┤
│ TTFT                     │ total   │ 693.3 ms        │ 96.0 ms        │ 2161.5 ms       │ 747.4 ms        │ 870.9 ms        │ 1032.3 ms       │ 1620.8 ms       │ 1500 │
├──────────────────────────┼─────────┼─────────────────┼────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼──────┤
│ TPOT                     │ total   │ 21.6 ms         │ 17.8 ms        │ 32.1 ms         │ 21.3 ms         │ 23.1 ms         │ 24.5 ms         │ 29.1 ms         │ 1500 │
├──────────────────────────┼─────────┼─────────────────┼────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼──────┤
│ ITL                      │ total   │ 25.5 ms         │ 0.0 ms         │ 1951.1 ms       │ 18.8 ms         │ 19.7 ms         │ 37.3 ms         │ 121.8 ms        │ 1500 │
├──────────────────────────┼─────────┼─────────────────┼────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼──────┤
│ InputTokens              │ total   │ 0.0             │ 0.0            │ 0.0             │ 0.0             │ 0.0             │ 0.0             │ 0.0             │ 1500 │
├──────────────────────────┼─────────┼─────────────────┼────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼──────┤
│ OutputTokens             │ total   │ 256.0           │ 256.0          │ 256.0           │ 256.0           │ 256.0           │ 256.0           │ 256.0           │ 1500 │
├──────────────────────────┼─────────┼─────────────────┼────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼──────┤
│ OutputTokenThroughput    │ total   │ 41.6779 token/s │ 30.243 token/s │ 50.4791 token/s │ 41.7847 token/s │ 44.6424 token/s │ 45.6484 token/s │ 46.0932 token/s │ 1500 │
╘══════════════════════════╧═════════╧═════════════════╧════════════════╧═════════════════╧═════════════════╧═════════════════╧═════════════════╧═════════════════╧══════╛
╒═════════════════════════╤═════════╤══════════════════╕
│ Common Metric           │ Stage   │ Value            │
╞═════════════════════════╪═════════╪══════════════════╡
│ Benchmark Duration      │ total   │ 582099.6816 ms   │
├─────────────────────────┼─────────┼──────────────────┤
│ Total Requests          │ total   │ 1500             │
├─────────────────────────┼─────────┼──────────────────┤
│ Failed Requests         │ total   │ 0                │
├─────────────────────────┼─────────┼──────────────────┤
│ Success Requests        │ total   │ 1500             │
├─────────────────────────┼─────────┼──────────────────┤
│ Concurrency             │ total   │ 15.9532          │
├─────────────────────────┼─────────┼──────────────────┤
│ Max Concurrency         │ total   │ 16               │
├─────────────────────────┼─────────┼──────────────────┤
│ Request Throughput      │ total   │ 2.5769 req/s     │
├─────────────────────────┼─────────┼──────────────────┤
│ Total Input Tokens      │ total   │ 0                │
├─────────────────────────┼─────────┼──────────────────┤
│ Total Generated Tokens  │ total   │ 384000           │
├─────────────────────────┼─────────┼──────────────────┤
│ Input Token Throughput  │ total   │ 0.0 token/s      │
├─────────────────────────┼─────────┼──────────────────┤
│ Output Token Throughput │ total   │ 659.6808 token/s │
├─────────────────────────┼─────────┼──────────────────┤
│ Total Token Throughput  │ total   │ 659.6808 token/s │
╘═════════════════════════╧═════════╧══════════════════╛

See the AISBench Documentation for details.

3.3 Using bench_serving (image dataset)

Set --dataset-name image for image datasets. bench_serving will generate random prompts with image inputs. Make sure the server is running at http://127.0.0.1:30000 before running the benchmark.

See the Bench Serving Guide for the full list of image-related flags.

Command

python -m sglang.bench_serving \
  --backend sglang \
  --base-url http://127.0.0.1:30000 \
  --model Qwen/Qwen2.5-VL-7B-Instruct \
  --dataset-name image \
  --random-input-len 1024 \
  --random-output-len 512 \
  --random-range-ratio 1 \
  --num-prompts 32 \
  --max-concurrency 16 \
  --image-count 1 \
  --image-resolution 720p

Example output (for illustration only — actual results depend on your hardware and configuration):

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 16
Successful requests:                     32
Benchmark duration (s):                  51.74
Total input tokens:                      73464
Total input text tokens:                 35128
Total input vision tokens:               38336
Total generated tokens:                  16384
Total generated tokens (retokenized):    9300
Request throughput (req/s):              0.62
Input token throughput (tok/s):          1419.96
Output token throughput (tok/s):         316.68
Peak output token throughput (tok/s):    800.00
Peak concurrent requests:                32
Total token throughput (tok/s):          1736.64
Concurrency:                             15.98
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   25841.84
Median E2E Latency (ms):                 25842.85
P90 E2E Latency (ms):                    26296.42
P99 E2E Latency (ms):                    26303.13
---------------Time to First Token----------------
Mean TTFT (ms):                          12211.59
Median TTFT (ms):                        14405.77
P99 TTFT (ms):                           15837.60
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          26.67
Median TPOT (ms):                        21.75
P99 TPOT (ms):                           41.89
---------------Inter-Token Latency----------------
Mean ITL (ms):                           26.67
Median ITL (ms):                         20.34
P95 ITL (ms):                            20.85
P99 ITL (ms):                            21.70
Max ITL (ms):                            11309.91
==================================================

4. Online Service: Embedding Model

Test Qwen/Qwen3-Embedding-8B on the embedding API endpoint.

Before running any benchmark in this section, make sure the SGLang embedding server is running with --is-embedding at http://127.0.0.1:30000. See Start SGLang server and use the Embedding tab for the launch command. AISBench does not support embedding endpoints — use bench_serving or Evalscope instead.

4.1 Using Evalscope

Prerequisites: Evalscope installed and its virtual environment activated (source .evalscope_venv/bin/activate). SGLang embedding server running with --is-embedding at http://127.0.0.1:30000.

Evalscope supports embedding evaluation. For performance testing the embedding API directly:

Command

evalscope perf \
  --parallel 10 \
  --number 20 \
  --model Qwen/Qwen3-Embedding-8B \
  --url http://127.0.0.1:30000/v1/embeddings \
  --api openai_embedding \
  --dataset random_embedding \
  --min-prompt-length 1024 \
  --max-prompt-length 1024 \
  --tokenizer-path Qwen/Qwen3-Embedding-8B

If the model has already been downloaded, you can point --tokenizer-path to the local model path instead of the model id.

Evalscope’s embedding performance testing support may vary by version. If the perf command does not accept the embeddings endpoint, use bench_serving with --backend sglang-embedding as the primary option.

4.2 Using bench_serving (embedding backend)

bench_serving is built into SGLang. Use --backend sglang-embedding to target the /v1/embeddings endpoint. Make sure the server is running with --is-embedding at http://127.0.0.1:30000.

Command

python -m sglang.bench_serving \
  --backend sglang-embedding \
  --base-url http://127.0.0.1:30000 \
  --model Qwen/Qwen3-Embedding-8B \
  --dataset-name random \
  --random-input-len 512 \
  --random-output-len 0 \
  --num-prompts 1000 \
  --max-concurrency 64 \
  --request-rate 32

--dataset-name random samples token IDs from the ShareGPT dataset; the first run downloads ShareGPT from Hugging Face automatically. Set export HF_ENDPOINT=https://hf-mirror.com if network is not available. Set --random-output-len 0 for embedding benchmarks — no output tokens are generated.

Example output (for illustration only — actual results depend on your hardware and configuration):

============ Serving Benchmark Result ============
Backend:                                 sglang-embedding
Traffic request rate:                    32.0
Max request concurrency:                 64
Successful requests:                     1000
Benchmark duration (s):                  31.86
Total input tokens:                      257891
Total input text tokens:                 257891
Request throughput (req/s):              31.39
Input token throughput (tok/s):          8094.67
Peak concurrent requests:                62
Concurrency:                             6.67
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   212.34
Median E2E Latency (ms):                 160.97
P90 E2E Latency (ms):                    267.31
P99 E2E Latency (ms):                    1445.94
==================================================

5. Offline Performance Testing

SGLang’s Engine API runs inference in-process, without an HTTP server, letting you measure maximum throughput. bench_offline_throughput is built into SGLang and requires no extra installation or running server.

bench_offline_throughput currently only supports text-generation (LLM) benchmarks. Multimodal and embedding models are not supported.

5.1 Using bench_offline_throughput

bench_offline_throughput uses the Engine API internally and measures pure inference throughput without HTTP overhead:

Command

python -m sglang.bench_offline_throughput \
  --model-path Qwen/Qwen2.5-7B-Instruct \
  --dataset-name random \
  --random-input-len 1024 \
  --random-output-len 512 \
  --num-prompts 500

--dataset-name random with --random-input-len and --random-output-len gives you full control over input/output token counts. Fixed-length random data eliminates variance from real datasets, making throughput comparisons across runs deterministic and reliable.

Hardware Platforms

1. Prepare

1.1 Start SGLang server

1.2 Install benchmarking tools

2. Online Service: Text Generation Model

2.1 Using Evalscope

2.2 Using AISBench

2.3 Using bench_serving

SGLang Serving Benchmark Result — Complete Reference

Test Configuration

Core Statistics & Throughput Metrics

End-to-End Latency (E2E Latency)

Time to First Token (TTFT)

Time per Output Token (TPOT) – Excluding First Token

Inter-Token Latency (ITL)

3. Online Service: Multimodal Model

3.1 Using Evalscope

3.2 Using AISBench

3.3 Using bench_serving (image dataset)

4. Online Service: Embedding Model

4.1 Using Evalscope

4.2 Using bench_serving (embedding backend)

5. Offline Performance Testing

5.1 Using bench_offline_throughput

See also

​1. Prepare

​1.1 Start SGLang server

​1.2 Install benchmarking tools

​2. Online Service: Text Generation Model

​2.1 Using Evalscope

​2.2 Using AISBench

​2.3 Using bench_serving

​SGLang Serving Benchmark Result — Complete Reference

Test Configuration

Core Statistics & Throughput Metrics

End-to-End Latency (E2E Latency)

Time to First Token (TTFT)

Time per Output Token (TPOT) – Excluding First Token

Inter-Token Latency (ITL)

​3. Online Service: Multimodal Model

​3.1 Using Evalscope

​3.2 Using AISBench

​3.3 Using bench_serving (image dataset)

​4. Online Service: Embedding Model

​4.1 Using Evalscope

​4.2 Using bench_serving (embedding backend)

​5. Offline Performance Testing

​5.1 Using bench_offline_throughput

​See also

1. Prepare

1.1 Start SGLang server

1.2 Install benchmarking tools

2. Online Service: Text Generation Model

2.1 Using Evalscope

2.2 Using AISBench

2.3 Using bench_serving

SGLang Serving Benchmark Result — Complete Reference

3. Online Service: Multimodal Model

3.1 Using Evalscope

3.2 Using AISBench

3.3 Using bench_serving (image dataset)

4. Online Service: Embedding Model

4.1 Using Evalscope

4.2 Using bench_serving (embedding backend)

5. Offline Performance Testing

5.1 Using bench_offline_throughput

See also