1. Model Introduction
Qwen3.5-397B-A17B is the latest flagship model in the Qwen series developed by Alibaba, representing a significant leap forward with unified vision-language foundation, efficient hybrid architecture, and scalable reinforcement learning. Qwen3.5 features a Gated Delta Networks combined with sparse Mixture-of-Experts architecture (397B total parameters, 17B activated), delivering high-throughput inference with minimal latency. It supports multimodal inputs (text, image, video) and natively handles context lengths of up to 262,144 tokens, extensible to over 1M tokens. Key Features:- Unified Vision-Language Foundation: Early fusion training on multimodal tokens achieves cross-generational parity with Qwen3 and outperforms Qwen3-VL models
- Efficient Hybrid Architecture: Gated Delta Networks + sparse MoE (397B total / 17B active) for high-throughput inference
- Hybrid Reasoning: Thinking mode enabled by default with step-by-step reasoning, can be disabled for direct responses
- Tool Calling: Built-in tool calling support with
qwen3_coderparser - Multi-Token Prediction (MTP): Speculative decoding support for lower latency
- 201 Language Support: Expanded multilingual coverage across 201 languages and dialects
- BF16 (Full precision): Qwen/Qwen3.5-397B-A17B
2. SGLang Installation
SGLang from the main branch is required for Qwen3.5. You can install from source or use a Docker image:Command
3. Model Deployment
This section provides deployment configurations optimized for different hardware platforms and use cases.3.1 Basic Configuration
Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform and capabilities.3.2 Configuration Tips
- The model has ~397B parameters in BF16, requiring ~800GB of GPU memory for weights alone.
- H100 (80GB) requires tp=16 (2 nodes) since each rank needs ~100GB at tp=8.
- H200 (141GB) and B200 (192GB) can run with tp=8 on a single node.
- Speculative decoding (MTP) can significantly reduce latency for interactive use cases.
- The
--mem-fraction-staticflag is recommended for optimal memory utilization, adjust it based on your hardware and workload. - Context length defaults to 262,144 tokens. If you encounter OOM errors, consider reducing it, but maintain at least 128K to preserve thinking capabilities.
- To speed up weight loading for this large model, add
--model-loader-extra-config='{"enable_multithread_load": "true","num_threads": 64}'to the launch command. - CUDA IPC Transport: Add
SGLANG_USE_CUDA_IPC_TRANSPORT=1as an environment variable to use CUDA IPC for transferring multimodal features, significantly improving TTFT (Time To First Token). Note: this consumes additional memory proportional to image size, so you may need to lower--mem-fraction-staticor--max-running-requests. - Multimodal Attention Backend: Use
--mm-attention-backend fa3on H100/H200 for better vision performance, or--mm-attention-backend fa4on B200. - For processing large images or videos, you may need to lower
--mem-fraction-staticto leave room for image feature tensors.
| Hardware | TP |
|---|---|
| H100 | 16 |
| H200 | 8 |
| B200 | 8 |
4. Model Invocation
Deploy Qwen3.5-397B-A17B with the following command (H200, all features enabled):Command
4.1 Basic Usage
For basic API usage and request examples, please refer to:4.2 Vision Input
Qwen3.5 supports image and video inputs as a unified vision-language model. Here is an example with an image:Example
Output
4.3 Advanced Usage
4.3.1 Reasoning Parser
Qwen3.5 supports Thinking mode by default. Enable the reasoning parser during deployment to separate the thinking and content sections. The thinking process is returned viareasoning_content in the streaming response.
To disable thinking and use Instruct mode, pass chat_template_kwargs at request time:
- Thinking mode (default): The model performs step-by-step reasoning before answering. No extra parameters needed.
- Instruct mode (
{"enable_thinking": false}): The model responds directly without a thinking process.
reasoning_content:
Example
Output
{"enable_thinking": false} via chat_template_kwargs:
Example
Output
4.3.2 Tool Calling
Qwen3.5 supports tool calling capabilities. Enable the tool call parser during deployment. Thinking mode is on by default; to disable it for tool calling requests, passextra_body={"chat_template_kwargs": {"enable_thinking": False}}.
Python Example (with Thinking Process):
Example
Output
5. Benchmark
5.1 Accuracy Benchmark
5.1.1 GSM8K Benchmark
- Benchmark Command
Command
- Test Result
Output
5.1.2 MMMU Benchmark
- Benchmark Command
Command
- Test Result
Output
5.2 Speed Benchmark
Test Environment:- Hardware: H200 (8x)
- Model: Qwen3.5-397B-A17B
- Tensor Parallelism: 8
- SGLang Version: main branch
Command
5.3.1 Latency Benchmark
Command
Output
5.3.2 Throughput Benchmark
Command
Output
5.3 Vision Speed Benchmark
We use SGLang’s built-in benchmarking tool to conduct performance evaluation with random images. Each request has 128 input tokens, two 720p images, and 1024 output tokens.5.3.1 Latency Benchmark
Command
Output
5.3.2 Throughput Benchmark
Command
Output
