Skip to main content
This page walks through performance testing your SGLang deployment on Ascend NPUs. We cover three model types β€” text generation (Qwen/Qwen2.5-7B-Instruct), multimodal vision (Qwen/Qwen2.5-VL-7B-Instruct), and embedding (Qwen/Qwen3-Embedding-8B) β€” in both online and offline serving modes. You can use Evalscope, AISBench, or SGLang’s built-in benchmarking tools.
The benchmark output examples in this guide are for illustration only. Actual performance depends on your hardware (e.g., Atlas 800I A2 vs A3), model version, SGLang version, and deployment configuration. Always run benchmarks on your own hardware to obtain accurate performance data.

1. Prepare

1.1 Start SGLang server

Launch the server with the appropriate flags for each model type. Make sure SGLang is installed first β€” see Ascend NPU Quickstart for environment setup.
Command
# The model will be automatically downloaded by sglang or set --model-path to the local path if the model is already downloaded.
sglang serve --model-path Qwen/Qwen2.5-7B-Instruct
Add & at the end of the command to run the server in the background, or open a new terminal to run the benchmark commands in the following sections.
The server binds to http://127.0.0.1:30000 by default. All online benchmarks below assume the server is running at that address. The --is-embedding flag is required for embedding models.

1.2 Install benchmarking tools

bench_serving and bench_offline_throughput are built into SGLang and require no extra installation. For Evalscope and AISBench, set up each in its own virtual environment:
Command
python3 -m venv .evalscope_venv
source .evalscope_venv/bin/activate
pip install evalscope[perf] -U

2. Online Service: Text Generation Model

Test Qwen/Qwen2.5-7B-Instruct via the online serving endpoint.
Before running any benchmark in this section, make sure the SGLang text-generation server is running at http://127.0.0.1:30000. See Start SGLang server for the launch command.
For performance testing, prefer random datasets (--dataset random, --dataset-name random) over real datasets. Random datasets let you pin --min-prompt-length / --max-prompt-length and --min-tokens / --max-tokens to fixed values, producing consistent, repeatable results. Real datasets (ShareGPT, openqa, etc.) have variable input lengths that add noise and make cross-run comparisons unreliable.

2.1 Using Evalscope

Prerequisites: Evalscope installed and its virtual environment activated (source .evalscope_venv/bin/activate). SGLang server running at http://127.0.0.1:30000.
Run the following command to run a performance test against the server:
Command
evalscope perf \
  --parallel 10 \
  --number 20 \
  --model Qwen/Qwen2.5-7B-Instruct \
  --url http://127.0.0.1:30000/v1/chat/completions \
  --api openai \
  --dataset random \
  --max-tokens 1024 \
  --min-tokens 1024 \
  --prefix-length 0 \
  --min-prompt-length 1024 \
  --max-prompt-length 1024 \
  --tokenizer-path Qwen/Qwen2.5-7B-Instruct \
  --extra-args '{"ignore_eos": true}'
If the model has already been downloaded, you can point --tokenizer-path to the local model path instead of the model id.
Example output (for illustration only β€” actual results depend on your hardware and configuration):
Benchmarking summary:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Metric                     β”‚       Value β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ ── General ──              β”‚             β”‚
β”‚ Test Duration (s)          β”‚       89.34 β”‚
β”‚ Concurrency                β”‚          10 β”‚
β”‚ Request Rate (req/s)       β”‚       -1.00 β”‚
β”‚ Total / Success / Failed   β”‚ 20 / 20 / 0 β”‚
β”‚ Req Throughput (req/s)     β”‚        0.22 β”‚
β”‚ ── Latency ──              β”‚             β”‚
β”‚ Avg Latency (s)            β”‚       44.67 β”‚
β”‚ TTFT (ms)                  β”‚      578.51 β”‚
β”‚ TPOT (ms)                  β”‚       43.10 β”‚
β”‚ ITL (ms)                   β”‚       43.12 β”‚
β”‚ ── Tokens ──               β”‚             β”‚
β”‚ Avg Input Tokens           β”‚     1024.00 β”‚
β”‚ Avg Output Tokens          β”‚     1024.00 β”‚
β”‚ Output Throughput (tok/s)  β”‚      229.24 β”‚
β”‚ Total Throughput (tok/s)   β”‚      458.49 β”‚
β”‚ ── Speculative Decoding ── β”‚             β”‚
β”‚ Decoded Tok/Iter           β”‚        1.00 β”‚
β”‚ Spec. Accept Rate          β”‚        0.00 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Percentile results:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Metric         β”‚      1% β”‚      5% β”‚     10% β”‚     25% β”‚     50% β”‚     75% β”‚     90% β”‚     95% β”‚     99% β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Latency (s)    β”‚   44.47 β”‚   44.47 β”‚   44.47 β”‚   44.47 β”‚   44.86 β”‚   44.86 β”‚   44.86 β”‚   44.86 β”‚   44.86 β”‚
β”‚ TTFT (ms)      β”‚  138.12 β”‚  142.07 β”‚  426.17 β”‚  426.87 β”‚  783.67 β”‚  785.26 β”‚  786.85 β”‚  787.97 β”‚  787.97 β”‚
β”‚ ITL (ms)       β”‚   41.84 β”‚   42.14 β”‚   42.22 β”‚   42.36 β”‚   42.57 β”‚   42.80 β”‚   42.99 β”‚   49.24 β”‚   49.84 β”‚
β”‚ TPOT (ms)      β”‚   42.71 β”‚   42.71 β”‚   42.71 β”‚   43.05 β”‚   43.08 β”‚   43.43 β”‚   43.43 β”‚   43.71 β”‚   43.71 β”‚
β”‚ Input tokens   β”‚ 1024.00 β”‚ 1024.00 β”‚ 1024.00 β”‚ 1024.00 β”‚ 1024.00 β”‚ 1024.00 β”‚ 1024.00 β”‚ 1024.00 β”‚ 1024.00 β”‚
β”‚ Output tokens  β”‚ 1024.00 β”‚ 1024.00 β”‚ 1024.00 β”‚ 1024.00 β”‚ 1024.00 β”‚ 1024.00 β”‚ 1024.00 β”‚ 1024.00 β”‚ 1024.00 β”‚
β”‚ Output (tok/s) β”‚   22.83 β”‚   22.83 β”‚   22.83 β”‚   22.83 β”‚   23.02 β”‚   23.03 β”‚   23.03 β”‚   23.03 β”‚   23.03 β”‚
β”‚ Total (tok/s)  β”‚   45.65 β”‚   45.65 β”‚   45.65 β”‚   45.65 β”‚   46.05 β”‚   46.05 β”‚   46.05 β”‚   46.05 β”‚   46.05 β”‚
β”‚ Decode (tok/s) β”‚   22.88 β”‚   23.03 β”‚   23.03 β”‚   23.07 β”‚   23.21 β”‚   23.42 β”‚   23.42 β”‚   23.42 β”‚   23.42 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
...
See the Evalscope Performance Testing Guide for full details.

2.2 Using AISBench

Prerequisites: AISBench installed and its virtual environment activated (source .aisbench_venv/bin/activate). All commands must be run from the benchmark/ directory. SGLang server running at http://127.0.0.1:30000. Set stream=True and ignore_eos=True in the model config for accurate results.
Two files need to be configured for performance testing. First, describe the model and server settings in ais_bench/benchmark/configs/models/vllm_api/vllm_api_stream_chat.py:
vllm_api_stream_chat.py
# more details: https://ais-bench-benchmark.readthedocs.io/en/latest/base_tutorials/scenes_intro/performance_benchmark.html
from ais_bench.benchmark.models import VLLMCustomAPIChat
from ais_bench.benchmark.utils.postprocess.model_postprocessors import extract_non_reasoning_content

models = [
    dict(
        attr="service",
        type=VLLMCustomAPIChat,
        abbr="vllm-api-stream-chat",
        path="Qwen/Qwen2.5-7B-Instruct",
        model="Qwen/Qwen2.5-7B-Instruct",
        stream=True,
        request_rate=0,
        use_timestamp=False,
        retry=2,
        api_key="",
        host_ip="127.0.0.1",
        host_port=30000,
        url="",
        max_out_len=512,
        batch_size=32,
        trust_remote_code=False,
        generation_kwargs=dict(
            temperature=0.01,
            ignore_eos=True,
        ),
        pred_postprocessor=dict(type=extract_non_reasoning_content),
    )
]
If the model has already been downloaded, point path to the local model path instead of the model id.
Second, configure random prompt lengths in ais_bench/datasets/synthetic/synthetic_config.py:
synthetic_config.py
# more details: https://ais-bench-benchmark.readthedocs.io/en/latest/advanced_tutorials/synthetic_dataset.html
synthetic_config = {
    "Type":"tokenid",
    "RequestCount": 10,
    "TrustRemoteCode": False,
    "StringConfig" : {
        "Input" : {
            "Method": "uniform",
            "Params": {"MinValue": 1, "MaxValue": 200}
        },
        "Output" : {
            "Method": "gaussian",
            "Params": {"Mean": 100, "Var": 200, "MinValue": 1, "MaxValue": 100}
        }
    },
    "TokenIdConfig" : {
        "RequestSize": 10,
        "PrefixLen": 0
    }
}
Run with a synthetic dataset:
Command
ais_bench --models vllm_api_stream_chat --datasets synthetic_gen_string -m perf
Example output (for illustration only β€” actual results depend on your hardware and configuration):
╒══════════════════════════╀═════════╀═════════════════╀═════════════════╀═════════════════╀═════════════════╀═════════════════╀═════════════════╀═════════════════╀═════╕
β”‚ Performance Parameters   β”‚ Stage   β”‚ Average         β”‚ Min             β”‚ Max             β”‚ Median          β”‚ P75             β”‚ P90             β”‚ P99             β”‚  N  β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════β•ͺ═════════════════β•ͺ═════════════════β•ͺ═════════════════β•ͺ═════════════════β•ͺ═════════════════β•ͺ═════════════════β•ͺ═════════════════β•ͺ═════║
β”‚ E2EL                     β”‚ total   β”‚ 3896.4 ms       β”‚ 3081.6 ms       β”‚ 4175.3 ms       β”‚ 4013.8 ms       β”‚ 4123.4 ms       β”‚ 4137.1 ms       β”‚ 4171.5 ms       β”‚ 10  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€
β”‚ TTFT                     β”‚ total   β”‚ 411.6 ms        β”‚ 346.7 ms        β”‚ 439.7 ms        β”‚ 416.3 ms        β”‚ 426.6 ms        β”‚ 434.4 ms        β”‚ 439.2 ms        β”‚ 10  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€
β”‚ TPOT                     β”‚ total   β”‚ 38.3 ms         β”‚ 37.4 ms         β”‚ 39.0 ms         β”‚ 38.3 ms         β”‚ 38.7 ms         β”‚ 38.9 ms         β”‚ 39.0 ms         β”‚ 10  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€
β”‚ ITL                      β”‚ total   β”‚ 38.7 ms         β”‚ 0.0 ms          β”‚ 156.5 ms        β”‚ 38.9 ms         β”‚ 39.0 ms         β”‚ 39.2 ms         β”‚ 117.1 ms        β”‚ 10  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€
β”‚ InputTokens              β”‚ total   β”‚ 123.4           β”‚ 34.0            β”‚ 228.0           β”‚ 130.5           β”‚ 170.5           β”‚ 217.2           β”‚ 226.92          β”‚ 10  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€
β”‚ OutputTokens             β”‚ total   β”‚ 92.1            β”‚ 69.0            β”‚ 100.0           β”‚ 95.0            β”‚ 99.75           β”‚ 100.0           β”‚ 100.0           β”‚ 10  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€
β”‚ OutputTokenThroughput    β”‚ total   β”‚ 23.5937 token/s β”‚ 22.3912 token/s β”‚ 24.2616 token/s β”‚ 23.7399 token/s β”‚ 23.9919 token/s β”‚ 24.2027 token/s β”‚ 24.2557 token/s β”‚ 10  β”‚
β•˜β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•β•›
╒══════════════════════════╀═════════╀══════════════════╕
β”‚ Common Metric            β”‚ Stage   β”‚ Value            β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════β•ͺ══════════════════║
β”‚ Benchmark Duration       β”‚ total   β”‚ 4175.4485 ms     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Total Requests           β”‚ total   β”‚ 10               β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Failed Requests          β”‚ total   β”‚ 0                β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Success Requests         β”‚ total   β”‚ 10               β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Concurrency              β”‚ total   β”‚ 9.3317           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Max Concurrency          β”‚ total   β”‚ 32               β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Request Throughput       β”‚ total   β”‚ 2.395 req/s      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Total Input Tokens       β”‚ total   β”‚ 1234             β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Prefill Token Throughput β”‚ total   β”‚ 299.8329 token/s β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Total Generated Tokens   β”‚ total   β”‚ 921              β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Input Token Throughput   β”‚ total   β”‚ 295.5371 token/s β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Output Token Throughput  β”‚ total   β”‚ 220.5751 token/s β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Total Token Throughput   β”‚ total   β”‚ 516.1122 token/s β”‚
β•˜β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•›
See the AISBench Documentation for details.

2.3 Using bench_serving

SGLang’s built-in bench_serving requires no extra installation. Make sure the server is running at http://127.0.0.1:30000 before running the benchmark.
See the Bench Serving Guide for all backends, datasets, and advanced options.
Command
python -m sglang.bench_serving \
  --backend sglang-oai \
  --base-url http://127.0.0.1:30000 \
  --model Qwen/Qwen2.5-7B-Instruct \
  --dataset-name random \
  --random-input-len 1024 \
  --random-output-len 512 \
  --random-range-ratio 1 \
  --num-prompts 100 \
  --max-concurrency 32
--dataset-name random samples token IDs from the ShareGPT dataset to generate realistic input; the first run downloads ShareGPT from Hugging Face automatically. Set export HF_ENDPOINT=https://hf-mirror.com if network is not available.
Set --random-range-ratio 1 for fixed input/output lengths (recommended for consistent comparisons) or 0 (default) for uniform distribution. Add --request-rate to control the request rate. For all backends, datasets, and advanced options, see the full Bench Serving Guide.
Example output (for illustration only β€” actual results depend on your hardware and configuration):
============ Serving Benchmark Result ============
Backend:                                 sglang-oai
Traffic request rate:                    inf
Max request concurrency:                 32
Successful requests:                     100
Benchmark duration (s):                  47.51
Total input tokens:                      102400
Total input text tokens:                 102400
Total generated tokens:                  51200
Total generated tokens (retokenized):    51195
Request throughput (req/s):              2.10
Input token throughput (tok/s):          2155.35
Output token throughput (tok/s):         1077.68
Peak output token throughput (tok/s):    1587.00
Peak concurrent requests:                64
Total token throughput (tok/s):          3233.03
Concurrency:                             26.93
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   12793.49
Median E2E Latency (ms):                 12940.17
P90 E2E Latency (ms):                    13049.86
P99 E2E Latency (ms):                    13051.61
---------------Time to First Token----------------
Mean TTFT (ms):                          1423.99
Median TTFT (ms):                        1489.29
P99 TTFT (ms):                           2325.56
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          22.25
Median TPOT (ms):                        22.22
P99 TPOT (ms):                           25.08
---------------Inter-Token Latency----------------
Mean ITL (ms):                           22.26
Median ITL (ms):                         20.74
P95 ITL (ms):                            21.40
P99 ITL (ms):                            23.62
Max ITL (ms):                            2229.30
==================================================

SGLang Serving Benchmark Result β€” Complete Reference

The output format is **hardcoded in bench_serving.py. All formatting decisions β€” including column widths, alignment, and decimal precision β€” are statically defined in the source and cannot be changed via command-line arguments.
Test Configuration
ParameterDescription
BackendThe serving backend under test (e.g., sglang, vllm).
Traffic request rateRequest generation rate in req/s. inf means maximum rate (concurrency-bounded). trace indicates trace timestamp mode. A fixed value enforces constant inter-arrival time.
Max request concurrencyMaximum number of concurrent requests from the client side. Displays not set when unspecified.
Core Statistics & Throughput Metrics
ParameterDescriptionFormat Specification
Successful requestsTotal number of successfully completed requests (HTTP 200, no generation errors).Integer, no decimal places
Benchmark duration (s)Total elapsed time from first request sent to last response fully received (seconds).2 decimal places
Total input tokensTotal number of input (prompt) tokens across all requests, counted by server-side tokenizer.Integer, no decimal places
Total input text tokensSame as Total input tokens. For multimodal inputs, this may differ.Integer, no decimal places
Total generated tokensTotal number of output tokens actually generated by the server (server-side tokenizer count).Integer, no decimal places
Total generated tokens (retokenized)Output text re-tokenized by the client using its own tokenizer. A large discrepancy indicates tokenizer mismatch or special tokens in output.Integer, no decimal places
Request throughput (req/s)Number of successful requests processed per second. Formula: Successful requests / Benchmark duration (s).2 decimal places
Input token throughput (tok/s)Number of input tokens processed per second. Formula: Total input tokens / Benchmark duration (s).2 decimal places
Output token throughput (tok/s)Number of output tokens generated per second. Formula: Total generated tokens / Benchmark duration (s).2 decimal places
Peak output token throughput (tok/s)Observed instantaneous peak output token generation rate during the test (computed over a sliding window).2 decimal places
Peak concurrent requestsMaximum number of requests being processed simultaneously on the server side. May exceed client-side Max request concurrency due to queueing.Integer, no decimal places
Total token throughput (tok/s)Sum of input and output token throughputs. Formula: Input token throughput + Output token throughput.2 decimal places
ConcurrencyAverage number of concurrent requests during the test (Little’s Law). Formula: Sum of all E2E latencies / Benchmark duration.2 decimal places
End-to-End Latency (E2E Latency)
StatisticDescriptionFormat
Mean E2E Latency (ms)Arithmetic mean2 decimal places
Median E2E Latency (ms)50th percentile2 decimal places
P90 E2E Latency (ms)90th percentile (90% of requests have latency ≀ this value)2 decimal places
P99 E2E Latency (ms)99th percentile2 decimal places
Time to First Token (TTFT)
StatisticDescriptionFormat
Mean TTFT (ms)Arithmetic mean2 decimal places
Median TTFT (ms)50th percentile2 decimal places
P99 TTFT (ms)99th percentile2 decimal places
Time per Output Token (TPOT) – Excluding First Token
Formula: (E2E Latency - TTFT) / (Number of output tokens - 1)
StatisticDescriptionFormat
Mean TPOT (ms)Arithmetic mean2 decimal places
Median TPOT (ms)50th percentile2 decimal places
P99 TPOT (ms)99th percentile2 decimal places
Inter-Token Latency (ITL)
StatisticDescriptionFormat
Mean ITL (ms)Average inter-token interval2 decimal places
Median ITL (ms)50th percentile inter-token interval2 decimal places
P95 ITL (ms)95th percentile (used to detect stalls)2 decimal places
P99 ITL (ms)99th percentile2 decimal places
Max ITL (ms)Maximum observed inter-token interval; useful for identifying severe blocking events2 decimal places

3. Online Service: Multimodal Model

Test Qwen/Qwen2.5-VL-7B-Instruct for vision-language tasks.
Before running any benchmark in this section, make sure the SGLang multimodal server is running at http://127.0.0.1:30000. See Start SGLang server and use the Multimodal tab for the launch command.
For consistent, repeatable results, set --random-range-ratio 1 to fix input/output lengths, or 0 (default) for uniform distribution.

3.1 Using Evalscope

Prerequisites: Evalscope installed and its virtual environment activated (source .evalscope_venv/bin/activate). SGLang multimodal server running at http://127.0.0.1:30000.
Evalscope’s perf tool uses the OpenAI-compatible /v1/chat/completions endpoint. Use --dataset random_vl for randomized multimodal data with image generation:
Command
evalscope perf \
  --parallel 10 \
  --number 20 \
  --model Qwen/Qwen2.5-VL-7B-Instruct \
  --url http://127.0.0.1:30000/v1/chat/completions \
  --api openai \
  --dataset random_vl \
  --min-tokens 1024 \
  --max-tokens 1024 \
  --prefix-length 0 \
  --min-prompt-length 1024 \
  --max-prompt-length 1024 \
  --image-width 512 \
  --image-height 512 \
  --image-format RGB \
  --image-num 1 \
  --tokenizer-path Qwen/Qwen2.5-VL-7B-Instruct \
  --extra-args '{"ignore_eos": true}'
If the model has already been downloaded, you can point --tokenizer-path to the local model path instead of the model id.

3.2 Using AISBench

Prerequisites: AISBench installed and its virtual environment activated (source .aisbench_venv/bin/activate). All commands run from the benchmark/ directory. SGLang multimodal server running at http://127.0.0.1:30000. AISBench does not include a built-in multimodal dataset β€” you must provide your own.
First, edit ais_bench/benchmark/configs/models/vllm_api/vllm_api_stream_chat.py to configure the vision model:
vllm_api_stream_chat.py
from ais_bench.benchmark.models import VLLMCustomAPIChat
from ais_bench.benchmark.utils.postprocess.model_postprocessors import extract_non_reasoning_content

models = [
    dict(
        attr="service",
        type=VLLMCustomAPIChat,
        abbr="vllm-api-stream-chat",
        path="Qwen/Qwen2.5-VL-7B-Instruct",
        model="Qwen/Qwen2.5-VL-7B-Instruct",
        stream=True,
        request_rate=0,
        use_timestamp=False,
        retry=2,
        api_key="",
        host_ip="127.0.0.1",
        host_port=30000,
        url="",
        max_out_len=256,
        batch_size=16,
        trust_remote_code=False,
        generation_kwargs=dict(
            temperature=0.01,
            ignore_eos=True,
        ),
        pred_postprocessor=dict(type=extract_non_reasoning_content),
    )
]
If the model has already been downloaded, point path to the local model path instead of the model id.
Next, download a multimodal dataset such as mmstar:
Command
# Download the mmstar dataset (from within the benchmark/ directory)
cd ais_bench/datasets
mkdir mmstar
cd mmstar
wget https://www.modelscope.cn/datasets/evalscope/MMStar/resolve/master/MMStar.tsv
Run the performance test:
Command
ais_bench --models vllm_api_stream_chat --datasets mmstar_gen -m perf
Example output (for illustration only β€” actual results depend on your hardware and configuration):
╒══════════════════════════╀═════════╀═════════════════╀════════════════╀═════════════════╀═════════════════╀═════════════════╀═════════════════╀═════════════════╀══════╕
β”‚ Performance Parameters   β”‚ Stage   β”‚ Average         β”‚ Min            β”‚ Max             β”‚ Median          β”‚ P75             β”‚ P90             β”‚ P99             β”‚  N   β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════β•ͺ═════════════════β•ͺ════════════════β•ͺ═════════════════β•ͺ═════════════════β•ͺ═════════════════β•ͺ═════════════════β•ͺ═════════════════β•ͺ══════║
β”‚ E2EL                     β”‚ total   β”‚ 6190.9 ms       β”‚ 5071.4 ms      β”‚ 8464.8 ms       β”‚ 6126.6 ms       β”‚ 6475.2 ms       β”‚ 6833.5 ms       β”‚ 7897.9 ms       β”‚ 1500 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€
β”‚ TTFT                     β”‚ total   β”‚ 693.3 ms        β”‚ 96.0 ms        β”‚ 2161.5 ms       β”‚ 747.4 ms        β”‚ 870.9 ms        β”‚ 1032.3 ms       β”‚ 1620.8 ms       β”‚ 1500 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€
β”‚ TPOT                     β”‚ total   β”‚ 21.6 ms         β”‚ 17.8 ms        β”‚ 32.1 ms         β”‚ 21.3 ms         β”‚ 23.1 ms         β”‚ 24.5 ms         β”‚ 29.1 ms         β”‚ 1500 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€
β”‚ ITL                      β”‚ total   β”‚ 25.5 ms         β”‚ 0.0 ms         β”‚ 1951.1 ms       β”‚ 18.8 ms         β”‚ 19.7 ms         β”‚ 37.3 ms         β”‚ 121.8 ms        β”‚ 1500 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€
β”‚ InputTokens              β”‚ total   β”‚ 0.0             β”‚ 0.0            β”‚ 0.0             β”‚ 0.0             β”‚ 0.0             β”‚ 0.0             β”‚ 0.0             β”‚ 1500 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€
β”‚ OutputTokens             β”‚ total   β”‚ 256.0           β”‚ 256.0          β”‚ 256.0           β”‚ 256.0           β”‚ 256.0           β”‚ 256.0           β”‚ 256.0           β”‚ 1500 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€
β”‚ OutputTokenThroughput    β”‚ total   β”‚ 41.6779 token/s β”‚ 30.243 token/s β”‚ 50.4791 token/s β”‚ 41.7847 token/s β”‚ 44.6424 token/s β”‚ 45.6484 token/s β”‚ 46.0932 token/s β”‚ 1500 β”‚
β•˜β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•β•β•›
╒═════════════════════════╀═════════╀══════════════════╕
β”‚ Common Metric           β”‚ Stage   β”‚ Value            β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════β•ͺ══════════════════║
β”‚ Benchmark Duration      β”‚ total   β”‚ 582099.6816 ms   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Total Requests          β”‚ total   β”‚ 1500             β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Failed Requests         β”‚ total   β”‚ 0                β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Success Requests        β”‚ total   β”‚ 1500             β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Concurrency             β”‚ total   β”‚ 15.9532          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Max Concurrency         β”‚ total   β”‚ 16               β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Request Throughput      β”‚ total   β”‚ 2.5769 req/s     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Total Input Tokens      β”‚ total   β”‚ 0                β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Total Generated Tokens  β”‚ total   β”‚ 384000           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Input Token Throughput  β”‚ total   β”‚ 0.0 token/s      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Output Token Throughput β”‚ total   β”‚ 659.6808 token/s β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Total Token Throughput  β”‚ total   β”‚ 659.6808 token/s β”‚
β•˜β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•›
See the AISBench Documentation for details.

3.3 Using bench_serving (image dataset)

Set --dataset-name image for image datasets. bench_serving will generate random prompts with image inputs. Make sure the server is running at http://127.0.0.1:30000 before running the benchmark.
See the Bench Serving Guide for the full list of image-related flags.
Command
python -m sglang.bench_serving \
  --backend sglang \
  --base-url http://127.0.0.1:30000 \
  --model Qwen/Qwen2.5-VL-7B-Instruct \
  --dataset-name image \
  --random-input-len 1024 \
  --random-output-len 512 \
  --random-range-ratio 1 \
  --num-prompts 32 \
  --max-concurrency 16 \
  --image-count 1 \
  --image-resolution 720p
Example output (for illustration only β€” actual results depend on your hardware and configuration):
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 16
Successful requests:                     32
Benchmark duration (s):                  51.74
Total input tokens:                      73464
Total input text tokens:                 35128
Total input vision tokens:               38336
Total generated tokens:                  16384
Total generated tokens (retokenized):    9300
Request throughput (req/s):              0.62
Input token throughput (tok/s):          1419.96
Output token throughput (tok/s):         316.68
Peak output token throughput (tok/s):    800.00
Peak concurrent requests:                32
Total token throughput (tok/s):          1736.64
Concurrency:                             15.98
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   25841.84
Median E2E Latency (ms):                 25842.85
P90 E2E Latency (ms):                    26296.42
P99 E2E Latency (ms):                    26303.13
---------------Time to First Token----------------
Mean TTFT (ms):                          12211.59
Median TTFT (ms):                        14405.77
P99 TTFT (ms):                           15837.60
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          26.67
Median TPOT (ms):                        21.75
P99 TPOT (ms):                           41.89
---------------Inter-Token Latency----------------
Mean ITL (ms):                           26.67
Median ITL (ms):                         20.34
P95 ITL (ms):                            20.85
P99 ITL (ms):                            21.70
Max ITL (ms):                            11309.91
==================================================

4. Online Service: Embedding Model

Test Qwen/Qwen3-Embedding-8B on the embedding API endpoint.
Before running any benchmark in this section, make sure the SGLang embedding server is running with --is-embedding at http://127.0.0.1:30000. See Start SGLang server and use the Embedding tab for the launch command. AISBench does not support embedding endpoints β€” use bench_serving or Evalscope instead.

4.1 Using Evalscope

Prerequisites: Evalscope installed and its virtual environment activated (source .evalscope_venv/bin/activate). SGLang embedding server running with --is-embedding at http://127.0.0.1:30000.
Evalscope supports embedding evaluation. For performance testing the embedding API directly:
Command
evalscope perf \
  --parallel 10 \
  --number 20 \
  --model Qwen/Qwen3-Embedding-8B \
  --url http://127.0.0.1:30000/v1/embeddings \
  --api openai_embedding \
  --dataset random_embedding \
  --min-prompt-length 1024 \
  --max-prompt-length 1024 \
  --tokenizer-path Qwen/Qwen3-Embedding-8B
If the model has already been downloaded, you can point --tokenizer-path to the local model path instead of the model id.
Evalscope’s embedding performance testing support may vary by version. If the perf command does not accept the embeddings endpoint, use bench_serving with --backend sglang-embedding as the primary option.

4.2 Using bench_serving (embedding backend)

bench_serving is built into SGLang. Use --backend sglang-embedding to target the /v1/embeddings endpoint. Make sure the server is running with --is-embedding at http://127.0.0.1:30000.
Command
python -m sglang.bench_serving \
  --backend sglang-embedding \
  --base-url http://127.0.0.1:30000 \
  --model Qwen/Qwen3-Embedding-8B \
  --dataset-name random \
  --random-input-len 512 \
  --random-output-len 0 \
  --num-prompts 1000 \
  --max-concurrency 64 \
  --request-rate 32
--dataset-name random samples token IDs from the ShareGPT dataset; the first run downloads ShareGPT from Hugging Face automatically. Set export HF_ENDPOINT=https://hf-mirror.com if network is not available. Set --random-output-len 0 for embedding benchmarks β€” no output tokens are generated.
Example output (for illustration only β€” actual results depend on your hardware and configuration):
============ Serving Benchmark Result ============
Backend:                                 sglang-embedding
Traffic request rate:                    32.0
Max request concurrency:                 64
Successful requests:                     1000
Benchmark duration (s):                  31.86
Total input tokens:                      257891
Total input text tokens:                 257891
Request throughput (req/s):              31.39
Input token throughput (tok/s):          8094.67
Peak concurrent requests:                62
Concurrency:                             6.67
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   212.34
Median E2E Latency (ms):                 160.97
P90 E2E Latency (ms):                    267.31
P99 E2E Latency (ms):                    1445.94
==================================================

5. Offline Performance Testing

SGLang’s Engine API runs inference in-process, without an HTTP server, letting you measure maximum throughput. bench_offline_throughput is built into SGLang and requires no extra installation or running server.
bench_offline_throughput currently only supports text-generation (LLM) benchmarks. Multimodal and embedding models are not supported.

5.1 Using bench_offline_throughput

bench_offline_throughput uses the Engine API internally and measures pure inference throughput without HTTP overhead:
Command
python -m sglang.bench_offline_throughput \
  --model-path Qwen/Qwen2.5-7B-Instruct \
  --dataset-name random \
  --random-input-len 1024 \
  --random-output-len 512 \
  --num-prompts 500
--dataset-name random samples token IDs from the ShareGPT dataset; the first run downloads ShareGPT from Hugging Face automatically. Set export HF_ENDPOINT=https://hf-mirror.com if network is not available.
--dataset-name random with --random-input-len and --random-output-len gives you full control over input/output token counts. Fixed-length random data eliminates variance from real datasets, making throughput comparisons across runs deterministic and reliable.

See also