Ascend PyTorch Profiler
SGLang has built-in PyTorch Profiler support. Through the Ascendtorch_npu
backend, you can directly collect NPU operator-level performance data. No
additional packages are required — profiling start/stop is controlled via API
requests.
1. Environment Setup
Launch an SGLang online service and set theSGLANG_TORCH_PROFILER_DIR
environment variable to control where performance files are saved. Once the
service starts, profiling is ready on standby.
Command
On Ascend NPU, SGLang uses
torch_npu._apply_patches() to automatically
redirect PyTorch Profiler’s CUDA activity to NPU, so
activities: ["CPU", "GPU"] actually captures NPU operator events.| Variable | Description | Default |
|---|---|---|
SGLANG_TORCH_PROFILER_DIR | Trace file output directory | /tmp |
SGLANG_PROFILE_WITH_STACK | Record Python call stack (True / False) | True |
SGLANG_PROFILE_RECORD_SHAPES | Record operator input shapes (True / False) | True |
2. Collection Methods
SGLang provides four collection methods. The core differences are whether you need to manually send/start_profile and /stop_profile. All four methods
produce identical results — choose the most convenient one.
Method comparison:
| Method | Manual start_profile | Manual stop_profile | Notes |
|---|---|---|---|
| A: API manual start/stop | Yes | Yes | Maximum flexibility for precise control |
| B: API auto-stop | Yes | No | Set num_steps, auto-stops and generates output |
| C: bench_serving —profile | No | No | Benchmark + profiling in one command |
| D: sglang.profiler CLI | No | No | Standalone profiling CLI tool |
Method A: API Manual Start/Stop
Send/start_profile to start → send workload requests → send /stop_profile
to stop. After stopping, the server automatically parses the data — no need to
manually call analyse().
Command
/stop_profile returns "Stop profiling. This will take some time." — the
server needs time to flush trace data to disk and parse it. Wait for the
response to complete.
This method takes a significant amount of time to parse
profiling data;consider using Method B instead to avoid lengthy waits.Method B: API Auto-Stop
Specifynum_steps in the /start_profile request. Profiling stops
automatically after N steps and generates output — no need to manually send
/stop_profile.
Command
Method C: bench_serving —profile
Use SGLang’s built-inbench_serving with the --profile flag.
Automatically handles /start_profile and /stop_profile — no manual API
calls needed.
Command
--profile-steps N sends "num_steps": N to the server’s /start_profile, so
the server auto-stops and parses data after N steps — bench_serving skips
sending /stop_profile.bench_serving --profile creates a timestamp subdirectory inside
--profile-output-dir (e.g. <output_dir>/<timestamp>/). The output path is
shown in the server log as Profiling done. Traces are saved to: <path>.bench_serving --profile parameters:
| Parameter | Description |
|---|---|
—profile | Enable auto profiling start/stop |
—profile-steps N | Auto-stop after N steps (skips /stop_profile) |
—profile-output-dir | Trace output directory |
Method D: sglang.profiler CLI
Use thesglang.profiler CLI module, which automatically sends
/start_profile and waits for completion. Start sglang.profiler first,
then send inference requests (otherwise there are no steps to capture and the
profiler will wait indefinitely).
Command
Command
bench_serving --profile, which
handles both steps automatically:
Command
sglang.profiler is essentially a CLI wrapper around the /start_profile API.
Advanced options like --profile-by-stage are also supported. On Ascend NPU,
trace flushing is asynchronous and may take a while — the CLI may occasionally
block waiting for flush. If it times out, use Method B (API auto-stop) or
Method C (bench_serving —profile) instead.sglang.profiler CLI parameters:
| Parameter | Description |
|---|---|
—url | SGLang server address |
—output-dir | Output directory (defaults to SGLANG_TORCH_PROFILER_DIR) |
—num-steps | Number of steps to profile |
—profile-by-stage | Profile prefill / decode stages separately |
—profile-prefix | Trace filename prefix |
—cpu / —gpu / —mem / —rpd | Activity types to collect |
3. Full Parameter Reference
All methods ultimately send a/start_profile request to the server. The full
set of supported parameters:
| Parameter | Description | Default |
|---|---|---|
output_dir | Output directory. Falls back to
SGLANG_TORCH_PROFILER_DIR or /tmp | /tmp |
num_steps | Number of steps. If set, profiling auto-stops — no /stop_profile needed | None |
start_step | Step index to start profiling (inclusive), for skipping warmup | 0 |
activities | Activity types: CPU, GPU, MEM, RPD. On Ascend NPU, primarily CPU and GPU | [“CPU”, “GPU”] |
profile_by_stage | Profile prefill and decode stages separately | false |
with_stack | Record Python call stack. Also controllable via
SGLANG_PROFILE_WITH_STACK | true |
record_shapes | Record operator input shapes. Also controllable via
SGLANG_PROFILE_RECORD_SHAPES | true |
profile_prefix | Prefix for trace filenames | None |
profile_stages | Stages to profile, e.g. [“prefill”, “decode”].
Requires profile_by_stage | None |
4. Finding Output Files
The server log explicitly indicates where traces are saved. You can find them via:- When profiling starts: server log outputs
Profiling starts. Traces will be saved to: <path> (with profile id: <id>)
- When profiling stops: server log outputs
Profiling done. Traces are saved to: <path>
- CLI output:
sglang.profileroutputsDump profiling traces to <path>
<output_dir>/<hostname>_<pid>_<timestamp>_ascend_pt/. When using Method C
(bench_serving --profile), a timestamp subdirectory is added:
<output_dir>/<timestamp>/. Always check the server log for the exact path:
Profiling done. Traces are saved to: <path>.
5. Viewing Results
After profiling stops (either/stop_profile returns or num_steps
auto-triggers), the server automatically parses the raw data. The
ASCEND_PROFILER_OUTPUT directory directly contains the following visualization
files — no need to manually call analyse():
| File | Description |
|---|---|
trace_view.json | Chrome Tracing format. Open in MindStudio Insight |
analysis.db | Database-format performance data |
ascend_pytorch_profiler_0.db | Database-format performance data |
kernel_details.csv | Kernel-level data |
operator_details.csv | Operator-level data |
step_trace_time.csv | Step trace timing data |
trace_view.json can also be opened using Chrome’s built-in
chrome://tracing or Perfetto UI.If you need to merge distributed trace files in a multi-node deployment, set
"merge_profiles": true in the /start_profile request. Note: on Ascend NPU,
the merger has limited support for the *_ascend_pt format — check
trace_view.json on each node individually. See
Benchmark and Profiling
for details.6. Re-parsing Raw Data (Optional)
If you need to re-parse existing data with different parameters, or if profiling was interrupted andASCEND_PROFILER_OUTPUT was not auto-generated,
use torch_npu’s analyse() tool:
Normally no need to manually run
analyse() — the server already parses
data automatically. Only use this for re-parsing or handling interrupted data.Best Practices
Common Notes
- Finding output: Check the server log for
Profiling starts. Traces will be saved to: <path>andProfiling done. Traces are saved to: <path>, orsglang.profileroutput forDump profiling traces to <path>. - Control trace file size: Reduce the number of requests and output length
using
--num-promptsand--random-output-lento avoid trace files too large for browsers. - Warmup iterations: Set
start_stepto skip the first few warmup steps and capture performance data under steady state. - Profile step count: Large values for
num_stepsor--profile-stepscan lead to lengthy profiling data parsing times. Reduce these values appropriately when you only need a quick overview. - CUDA Graph impact: To see the full Python call stack → operator mapping in
traces, add
--disable-cuda-graphwhen starting the server. Note that this reduces decode performance — only use during profiling. To analyze CUDA Graph capture specifically, use--enable-profile-cuda-graph— traces are saved toSGLANG_TORCH_PROFILER_DIR/graph_capture_profile/. - Multi-node deployment: In multi-node environments, performance data is
distributed across nodes. On Ascend NPU, the
merge_profilesfeature has limited support — check*_ascend_pt/ASCEND_PROFILER_OUTPUT/trace_view.jsonon each node individually. In PD disaggregation mode, prefill and decode workers must be profiled separately — see Profile In PD Disaggregation Mode.
See Also
- SGLang Benchmark and Profiling — General SGLang profiling guide
- Ascend NPU Quickstart — Ascend NPU environment setup
- Ascend NPU Optimization — Ascend NPU optimization parameters
- Ascend NPU Performance Testing — Ascend NPU performance benchmarking
- Ascend NPU Environment Variables — Environment variable reference
