python -m sglang.bench_serving. It supports multiple inference backends via OpenAI-compatible and native endpoints, and produces both console metrics and optional JSONL outputs.
What it does
- Generates synthetic or dataset-driven prompts and submits them to a target serving endpoint
- Measures throughput, time-to-first-token (TTFT), inter-token latency (ITL), per-request end-to-end latency, and more
- Supports streaming or non-streaming modes, rate control, and concurrency limits
Supported backends and endpoints
sglang/sglang-native:POST /generatesglang-oai,vllm,lmdeploy:POST /v1/completionssglang-oai-chat,vllm-chat,lmdeploy-chat:POST /v1/chat/completionstrt(TensorRT-LLM):POST /v2/models/ensemble/generate_streamgserver: Custom server (Not Implemented yet in this script)truss:POST /v1/models/model:predict
--base-url is provided, requests are sent to it. Otherwise, --host and --port are used. When --model is not provided, the script will attempt to query GET /v1/models for an available model ID (OpenAI-compatible endpoints).
Prerequisites
- Python 3.8+
- Dependencies typically used by this script:
aiohttp,numpy,requests,tqdm,transformers, and for some datasetsdatasets,pillow,pybase64. Install as needed. - An inference server running and reachable via the endpoints above
- If your server requires authentication, set environment variable
OPENAI_API_KEY(used asAuthorization: Bearer <key>)
Quick start
Run a basic benchmark against an sglang server exposing/generate:
Command
Command
Command
Datasets
Select with--dataset-name:
sharegpt(default): loads ShareGPT-style pairs; optionally restrict with--sharegpt-context-lenand override outputs with--sharegpt-output-lenrandom: random text lengths; sampled from ShareGPT token spacerandom-ids: random token ids (can lead to gibberish)image: generates images and wraps them in chat messages; supports custom resolutions, multiple formats, and different content typesgenerated-shared-prefix: synthetic dataset with shared long system prompts and short questionsmmmu: samples from MMMU (Math split) and includes images
-
--num-prompts N: number of requests -
--random-input-len,--random-output-len,--random-range-ratio: for random/random-ids/image -
--image-count: Number of images per request (forimagedataset). -
--apply-chat-template: apply tokenizer chat template when constructing prompts -
--dataset-path PATH: file path for ShareGPT json; if blank and missing, it will be downloaded and cached
generated-shared-prefix):
--gsp-num-groups--gsp-prompts-per-group--gsp-system-prompt-len--gsp-question-len--gsp-output-len
image):
--image-count: Number of images per request--image-resolution: Image resolution; supports presets (4k, 1080p, 720p, 360p) or custom ‘heightxwidth’ format (e.g., 1080x1920, 512x768)--image-format: Image format (jpeg or png)--image-content: Image content type (random or blank)
Examples
- To benchmark image dataset with 3 images per request, 500 prompts, 512 input length, and 512 output length, you can run:
Command
Command
- To benchmark random dataset with 3000 prompts, 1024 input length, and 1024 output length, you can run:
Command
Command
Choosing model and tokenizer
--modelis required unless the backend exposesGET /v1/models, in which case the first model ID is auto-selected.--tokenizerdefaults to--model. Both can be HF model IDs or local paths.- For ModelScope workflows, setting
SGLANG_USE_MODELSCOPE=trueenables fetching via ModelScope (weights are skipped for speed). - If your tokenizer lacks a chat template, the script warns because token counting can be less robust for gibberish outputs.
Rate, concurrency, and streaming
--request-rate: requests per second.infsends all immediately (burst). Non-infinite rate uses a Poisson process for arrival times.--max-concurrency: caps concurrent in-flight requests regardless of arrival rate.--disable-stream: switch to non-streaming mode when supported; TTFT then equals total latency for chat completions.
Other key options
--output-file FILE.jsonl: append JSONL results to file; auto-named if unspecified--output-details: include per-request arrays (generated texts, errors, ttfts, itls, input/output lens)--extra-request-body '{"top_p":0.9,"temperature":0.6}': merged into payload (sampling params, etc.)--disable-ignore-eos: pass through EOS behavior (varies by backend)--warmup-requests N: run warmup requests with short output first (default 1)--flush-cache: call/flush_cache(sglang) before main run--profile: call/start_profileand/stop_profile(requires server to enable profiling, e.g.,SGLANG_TORCH_PROFILER_DIR)--lora-name name1 name2 ...: randomly pick one per request and pass to backend (e.g.,lora_pathfor sglang)--tokenize-prompt: send integer IDs instead of text (currently supports--backend sglangonly)
Authentication
If your target endpoint requires OpenAI-style auth, set:Command
Authorization: Bearer $OPENAI_API_KEY automatically for OpenAI-compatible routes.
Metrics explained
Printed after each run:- Request throughput (req/s)
- Input token throughput (tok/s) - includes both text and vision tokens
- Output token throughput (tok/s)
- Total token throughput (tok/s) - includes both text and vision tokens
- Total input text tokens and Total input vision tokens - per-modality breakdown
- Concurrency: aggregate time of all requests divided by wall time
- End-to-End Latency (ms): mean/median/std/p99 per-request total latency
- Time to First Token (TTFT, ms): mean/median/std/p99 for streaming mode
- Inter-Token Latency (ITL, ms): mean/median/std/p95/p99/max between tokens
- TPOT (ms): Token processing time after first token, i.e.,
(latency - ttft)/(tokens-1) - Accept length (sglang-only, if available): speculative decoding accept length
JSONL output format
When--output-file is set, one JSON object is appended per run. Base fields:
- Arguments summary: backend, dataset, request_rate, max_concurrency, etc.
- Duration and totals: completed, total_input_tokens, total_output_tokens, retokenized totals
- Throughputs and latency statistics as printed in the console
accept_lengthwhen available (sglang)
--output-details, an extended object also includes arrays:
input_lens,output_lensttfts,itls(per request: ITL arrays)generated_texts,errors
End-to-end examples
- sglang native
/generate(streaming):
Command
- OpenAI-compatible Completions (e.g., vLLM):
Command
- OpenAI-compatible Chat Completions (streaming):
Command
- Images (VLM) with chat template:
Command
Command
Command
- Generated shared prefix (long system prompts + short questions):
Command
- Tokenized prompts (ids) for strict length control (sglang only):
Command
- Profiling and cache flush (sglang):
Command
- TensorRT-LLM streaming endpoint:
Command
- Evaluating large-scale KVCache sharing with mooncake trace (sglang only):
Command
Troubleshooting
- All requests failed: verify
--backend, server URL/port,--model, and authentication. Check warmup errors printed by the script. - Throughput seems too low: adjust
--request-rateand--max-concurrency; verify server batch size/scheduling; ensure streaming is enabled if appropriate. - Token counts look odd: prefer chat/instruct models with proper chat templates; otherwise tokenization of gibberish may be inconsistent.
- Image/MMMU datasets: ensure you installed extra deps (
pillow,datasets,pybase64). - Authentication errors (401/403): set
OPENAI_API_KEYor disable auth on your server.
Notes
- The script raises the file descriptor soft limit (
RLIMIT_NOFILE) to help with many concurrent connections. - For sglang,
/get_server_infois queried post-run to report speculative decoding accept length when available.
