Ascend NPU Accuracy Evaluation
This document describes how to perform accuracy evaluation for SGLang models running on Ascend NPU using two tools: EvalScope and AISBench. The following scenarios are covered:- Online Testing: Evaluate via API interface after starting SGLang server
- Text Models: Using Qwen2.5-7B-Instruct as example
- Multimodal Models: Using Qwen2.5-VL-7B-Instruct as example
Environment Setup
First, launch the SGLang environment using the provided container image:- Atlas 800I A3
- Atlas 800I A2
Command
Using EvalScope
EvalScope is a comprehensive model evaluation framework from ModelScope, supporting both accuracy evaluation and performance stress testing.Install EvalScope
Command
Online Text Model Testing
This section covers online evaluation scenarios where the SGLang server is already running.Start SGLang Server
Command
Execute Accuracy Evaluation
EvalScope connects to the SGLang server via OpenAI-compatible API. The following example uses the GSM8K dataset:Command
Note: Output format may vary slightly across different EvalScope versions. The above example is from EvalScope 1.6.x. Ensure the--modelparameter matches the model name returned by the SGLang server’s/v1/modelsendpoint. When starting the server with an HF path (e.g.,Qwen/Qwen2.5-7B-Instruct), use that path directly. For local paths, pass the full path or the model name returned by/v1/models.
Common Datasets for Online Evaluation
Command
Online Multimodal Model Testing
Start Multimodal Model Server
Command
Execute Multimodal Accuracy Evaluation
Command
Using AISBench
AISBench is an official benchmark testing tool from Ascend, supporting accuracy and performance evaluation across multiple datasets.Install AISBench
Command
Note: When usingpip install -e(development mode), theais_benchcommand may not be in PATH. Usepython3 -m ais_bench.benchmark.cli.mainas an alternative.
Configuration File Setup
Each model task, dataset task, and result presentation task corresponds to a configuration file. You need to modify the content of these configuration files before running the command. The paths of these configuration files can be queried by adding--search to the original AISBench command. For example:
benchmark/ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py:
Note: SGLang server defaults to port30000and is compatible with OpenAI API format, so AISBench’sVLLMCustomAPIChatcan connect directly to SGLang. Important: The sum ofmax_out_lenand input token count must not exceed the SGLang server’smax_model_len(default 32768 for Qwen2.5-7B). We recommend settingmax_out_lento512or1024to avoid400errors caused by exceeding the context window.
Download Datasets
AISBench supports multiple common datasets that must be downloaded to a specified path before use.Command
Online Text Model Testing
Start SGLang Server
Command
Execute Accuracy Evaluation
Command
outputs/default/<timestamp>/ with the following structure:
Online Multimodal Model Testing
Configuration File
Edit multimodal model configuration file (e.g.,vllm_api_stream_chat_mutiturn.py):
Start Multimodal Server and Execute Evaluation
Command
Troubleshooting
SGLang Server Startup Failure
- Verify device mapping: A2 uses
davinci[0-7], A3 usesdavinci[0-15] - Confirm image tag matches device type: A2 uses
...-910b, A3 uses...-a3 - Check NPU status with
npu-smi info - First run requires model download; set
HF_ENDPOINT=https://hf-mirror.comif network access is restricted
EvalScope Connection Failure to Server
- Confirm SGLang server started successfully (look for
Application startup completein logs) - Verify
--api-urlpoints to the correct port (SGLang defaults to30000) - Ensure URL ends with
/v1, e.g.,http://localhost:30000/v1
EvalScope SSL certificate verification failed
When using EvalScope commands without specifying a dataset or model path, it will attempt to download automatically, which may encounter an SSL certificate verification error:/usr/local/python3.11.14/lib/python3.11/site-packages/requests/sessions.py, find the class Session definition, and set self.verify to False to resolve this.
