Ascend NPU Accuracy Evaluation

This document describes how to perform accuracy evaluation for SGLang models running on Ascend NPU using a tool: EvalScope. The following scenarios are covered:

Online Testing: Evaluate via API interface after starting SGLang server
Text Models: Using Qwen2.5-7B-Instruct as example
Multimodal Models: Using Qwen2.5-VL-7B-Instruct as example

Environment Setup

Ensure sufficient disk space before proceeding. The Docker image requires at least 30GB of free space. If you need to download model weights, check the model size at ModelScope to reserve enough space.

First, launch the SGLang environment using the provided container image:

Atlas 800I A3
Atlas 800I A2

Command

export IMAGE=quay.io/ascend/sglang:v0.5.13.post1-cann9.0.0-a3

docker run -it --rm --privileged --network=host --ipc=host --shm-size=16g \
    --device=/dev/davinci0 --device=/dev/davinci1 --device=/dev/davinci2 --device=/dev/davinci3 \
    --device=/dev/davinci4 --device=/dev/davinci5 --device=/dev/davinci6 --device=/dev/davinci7 \
    --device=/dev/davinci8 --device=/dev/davinci9 --device=/dev/davinci10 --device=/dev/davinci11 \
    --device=/dev/davinci12 --device=/dev/davinci13 --device=/dev/davinci14 --device=/dev/davinci15 \
    --device=/dev/davinci_manager \
    --device=/dev/hisi_hdc \
    --volume /usr/local/sbin:/usr/local/sbin \
    --volume /usr/local/Ascend/driver:/usr/local/Ascend/driver \
    --volume /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \
    --volume /etc/ascend_install.info:/etc/ascend_install.info \
    --volume /var/queue_schedule:/var/queue_schedule \
    --volume ~/.cache/:/root/.cache/ \
    --entrypoint=bash \
    $IMAGE

Command

export IMAGE=quay.io/ascend/sglang:v0.5.13.post1-cann9.0.0-910b

docker run -it --rm --privileged --network=host --ipc=host --shm-size=16g \
    --device=/dev/davinci0 --device=/dev/davinci1 --device=/dev/davinci2 --device=/dev/davinci3 \
    --device=/dev/davinci4 --device=/dev/davinci5 --device=/dev/davinci6 --device=/dev/davinci7 \
    --device=/dev/davinci_manager \
    --device=/dev/hisi_hdc \
    --volume /usr/local/sbin:/usr/local/sbin \
    --volume /usr/local/Ascend/driver:/usr/local/Ascend/driver \
    --volume /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \
    --volume /etc/ascend_install.info:/etc/ascend_install.info \
    --volume /var/queue_schedule:/var/queue_schedule \
    --volume ~/.cache/:/root/.cache/ \
    --entrypoint=bash \
    $IMAGE

Using EvalScope

EvalScope is a comprehensive model evaluation framework from ModelScope, supporting both accuracy evaluation and performance stress testing.

Install EvalScope

Command

# Method 1: Installing via pip
pip install evalscope

# Method 2: Installing from source
git clone https://github.com/modelscope/evalscope.git
cd evalscope/
pip install -e .

Online Text Model Testing

This section covers online evaluation scenarios where the SGLang server is already running.

Start SGLang Server

Command

# Set HuggingFace mirror (if network access is restricted)
export HF_ENDPOINT=https://hf-mirror.com

# Start text model server
sglang serve --model-path /home/weights/Qwen2.5-7B-Instruct --attention-backend ascend --host 0.0.0.0 --port 30000 &

For more details of SGLang server, refer to the Ascend NPU Quick Start

Execute Accuracy Evaluation

EvalScope connects to the SGLang server via OpenAI-compatible API. The following example uses the GSM8K dataset:

Command

evalscope eval \
 --model /home/weights/Qwen2.5-7B-Instruct \
 --api-url http://localhost:30000/v1 \
 --api-key EMPTY \
 --eval-type openai_api \
 --datasets gsm8k \
 --limit 10

Upon completion, results similar to the following will be displayed:

+---------------------+-----------+----------+----------+-------+---------+---------+
| Model               | Dataset   | Metric   | Subset   |   Num |   Score | Cat.0   |
+=====================+===========+==========+==========+=======+=========+=========+
| Qwen2.5-7B-Instruct | gsm8k     | mean_acc | main     |     5 |     1.0 | default |
+---------------------+-----------+----------+----------+-------+---------+---------+

Note: Output format may vary slightly across different EvalScope versions. The above example is from EvalScope 1.6.x. Ensure the --model parameter matches the model name returned by the SGLang server’s /v1/models endpoint. When starting the server with an HF path (e.g., Qwen/Qwen2.5-7B-Instruct), use that path directly. For local paths, pass the full path or the model name returned by /v1/models.

Common Datasets for Online Evaluation

Command

# MMLU
evalscope eval \
 --model /home/weights/Qwen2.5-7B-Instruct \
 --api-url http://localhost:30000/v1 \
 --api-key EMPTY \
 --eval-type openai_api \
 --datasets mmlu

# CEval (Chinese evaluation)
evalscope eval \
 --model /home/weights/Qwen2.5-7B-Instruct \
 --api-url http://localhost:30000/v1 \
 --api-key EMPTY \
 --eval-type openai_api \
 --datasets ceval

# MATH-500
evalscope eval \
 --model /home/weights/Qwen2.5-7B-Instruct \
 --api-url http://localhost:30000/v1 \
 --api-key EMPTY \
 --eval-type openai_api \
 --datasets math_500

# HumanEval (code generation)
evalscope eval \
 --model /home/weights/Qwen2.5-7B-Instruct \
 --api-url http://localhost:30000/v1 \
 --api-key EMPTY \
 --eval-type openai_api \
 --datasets humaneval

Online Multimodal Model Testing

Start Multimodal Model Server

Command

# Start multimodal model server (Qwen2.5-VL-7B-Instruct)
# Multimodal models require both --attention-backend and --mm-attention-backend
sglang serve --model-path /home/weights/Qwen2.5-VL-7B-Instruct \
    --attention-backend ascend \
    --mm-attention-backend ascend_attn \
    --host 0.0.0.0 --port 30000 &

Execute Multimodal Accuracy Evaluation

Command

# MMBench (multimodal evaluation)
evalscope eval \
 --model /home/weights/Qwen2.5-VL-7B-Instruct \
 --api-url http://localhost:30000/v1 \
 --api-key EMPTY \
 --eval-type openai_api \
 --datasets mm_bench

# MMMU (multimodal comprehensive understanding)
evalscope eval \
 --model /home/weights/Qwen2.5-VL-7B-Instruct \
 --api-url http://localhost:30000/v1 \
 --api-key EMPTY \
 --eval-type openai_api \
 --datasets mmmu

# HallusionBench (hallucination evaluation)
evalscope eval \
 --model /home/weights/Qwen2.5-VL-7B-Instruct \
 --api-url http://localhost:30000/v1 \
 --api-key EMPTY \
 --eval-type openai_api \
 --datasets hallusion_bench

For more details, refer to the EvalScope documentation.

Troubleshooting

SGLang Server Startup Failure

Verify device mapping: A2 uses davinci[0-7], A3 uses davinci[0-15]
Confirm image tag matches device type: A2 uses ...-910b, A3 uses ...-a3
Check NPU status with npu-smi info
First run requires model download; set HF_ENDPOINT=https://hf-mirror.com if network access is restricted

EvalScope Connection Failure to Server

Confirm SGLang server started successfully (look for Application startup complete in logs)
Verify --api-url points to the correct port (SGLang defaults to 30000)
Ensure URL ends with /v1, e.g., http://localhost:30000/v1

EvalScope SSL certificate verification failed

When using EvalScope commands without specifying a dataset or model path, it will attempt to download automatically, which may encounter an SSL certificate verification error:

  File "/usr/local/python3.11.14/lib/python3.11/site-packages/requests/sessions.py", line 605, in get
    return self.request("GET", url, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/python3.11.14/lib/python3.11/site-packages/requests/sessions.py", line 592, in request
    resp = self.send(prep, **send_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/python3.11.14/lib/python3.11/site-packages/requests/sessions.py", line 706, in send
    r = adapter.send(request, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/python3.11.14/lib/python3.11/site-packages/requests/adapters.py", line 676, in send
    raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='www.modelscope.cn', port=443): Max retries exceeded with url: /api/v1/datasets/AI-ModelScope/gsm8k (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1016)')))
[ERROR] 2026-05-13-02:20:01 (PID:876, Device:-1, RankID:-1) ERR99999 UNKNOWN application exception

Temporary workaround (test only): Navigate to /usr/local/python3.11.14/lib/python3.11/site-packages/requests/sessions.py, find the class Session definition, and set self.verify = False.

This disables TLS certificate validation globally for the Python requests library. Use it only as a temporary diagnostic step in isolated test environments — never in production.

Stable solution: The error is caused by a corporate TLS proxy injecting a self-signed certificate. Point requests to the proxy’s CA bundle:

# Obtain the CA certificate from your network administrator
# Then set the environment variable:
export REQUESTS_CA_BUNDLE=/path/to/your-proxy-ca-bundle.crt

This is a common workaround for corporate proxy environments. If it does not resolve your issue, consult your IT department — proxy configurations vary across organizations.

If you cannot obtain the CA certificate, download datasets manually as shown in Download Dataset Error below.

EvalScope Request Retry Timeout

If EvalScope keeps retrying requests with errors like:

2026-06-22 03:09:03 - evalscope - WARNING: Attempt 4 / 5 failed: ....... Retrying...
2026-06-22 03:09:14 - evalscope - INFO: Evaluating[ceval]   0%| 0/520 [Elapsed: 02:00 < Remaining: ?, ?it/s]
2026-06-22 03:09:19,557 - openai._base_client - INFO: Retrying request to /chat/completions in 0.447260 seconds
2026-06-22 03:09:26,088 - openai._base_client - INFO: Retrying request to /chat/completions in 0.992551 seconds

This is usually caused by the HTTP proxy intercepting requests to the local SGLang server. Disable the proxy with:

Command

unset http_proxy
unset https_proxy
unset HTTP_PROXY
unset HTTPS_PROXY

Download Dataset Error

For this error

root@localhost:/home/# wget https://www.modelscope.cn/datasets/evalscope/MMStar/resolve/master/MMStar.tsv
--2026-05-12 12:08:01--  https://www.modelscope.cn/datasets/evalscope/MMStar/resolve/master/MMStar.tsv
Connecting to <PROXY_IP>:<PROXY_PORT>... connected.
ERROR: cannot verify www.modelscope.cn's certificate, issued by ‘<CERT_ISSUER>’:
  Self-signed certificate encountered.
To connect to www.modelscope.cn insecurely, use `--no-check-certificate`.

You can add --no-check-certificate

wget https://www.modelscope.cn/datasets/evalscope/MMStar/resolve/master/MMStar.tsv --no-check-certificate

For additional assistance, refer to SGLang GitHub Issues.

​Ascend NPU Accuracy Evaluation

​Environment Setup

​Using EvalScope

​Install EvalScope

​Online Text Model Testing

​Start SGLang Server

​Execute Accuracy Evaluation

​Common Datasets for Online Evaluation

​Online Multimodal Model Testing

​Start Multimodal Model Server

​Execute Multimodal Accuracy Evaluation

​Troubleshooting

​SGLang Server Startup Failure

​EvalScope Connection Failure to Server

​EvalScope SSL certificate verification failed

​EvalScope Request Retry Timeout

​Download Dataset Error

Ascend NPU Accuracy Evaluation

Environment Setup

Using EvalScope

Install EvalScope

Online Text Model Testing

Start SGLang Server

Execute Accuracy Evaluation

Common Datasets for Online Evaluation

Online Multimodal Model Testing

Start Multimodal Model Server

Execute Multimodal Accuracy Evaluation

Troubleshooting

SGLang Server Startup Failure

EvalScope Connection Failure to Server

EvalScope SSL certificate verification failed

EvalScope Request Retry Timeout

Download Dataset Error