1. Model Introduction
Qwen3-Next is an advanced large language model architecture developed by Alibaba’s Qwen team, designed to enhance efficiency and performance in handling extensive contexts and large-scale parameters. It features advanced capabilities in reasoning, function calling, and multilingual understanding. Qwen3-Next introduces several groundbreaking innovations:- Hybrid Attention Mechanism: Replaces standard attention with a combination of Gated DeltaNet (linear attention) and Full Attention, enabling efficient processing of context lengths up to 262,144 tokens. This hybrid approach makes it ideal for analyzing lengthy documents such as entire books or contracts.
- Highly Sparse Mixture-of-Experts (MoE): Features an 80-billion parameter architecture where only 3 billion parameters are active during inference. This design reduces computational costs by up to 90% while maintaining high performance, drastically reducing FLOPs per token without compromising model capacity.
- Multi-Token Prediction (MTP): Enables generation of multiple tokens per inference step, significantly reducing latency and enhancing user experience in real-time applications. This innovation boosts both pretraining performance and inference speed.
- Multilingual Support: Natively supports 119 languages, facilitating seamless cross-lingual tasks and making it versatile for global applications.
- Enterprise-Ready Deployment: Released under the Apache 2.0 license, offering flexible deployment options including on-premises, virtual private cloud (VPC), and private cloud environments, ensuring security and compliance for enterprise use.
- Advanced Reasoning & Stability: Demonstrates clear improvement in reasoning performance with support for tool use during inference. Includes stability optimizations such as zero-centered and weight-decayed layernorm for robust pre-training and post-training.
2. SGLang Installation
SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. Please refer to the official SGLang installation guide for installation instructions.3. Model Deployment
This section provides deployment configurations optimized for different hardware platforms and use cases.3.1 Basic Configuration
The Qwen3-Next series comes in only one size but offers different thinking modes. Recommended starting configurations vary depending on hardware. Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, quantization method, thinking capabilities, and speculative decoding.3.2 Configuration Tips
-
--max-mamba-cache-size: Adjust--max-mamba-cache-sizeto increase mamba cache space and max running requests capability. It will decrease KV cache space as a trade-off. You can adjust it according to workload. -
--mamba-ssm-dtype:bfloat16orfloat32, usebfloat16to save mamba cache size andfloat32to get more accurate results. The default setting isfloat32. -
--mamba-full-memory-ratio: Adjust--mamba-full-memory-ratioto set the ratio of mamba state memory to full kv cache memory. The default setting is0.9.
4. Model Invocation
4.1 Basic Usage
For basic API usage and request examples, please refer to:4.2 Advanced Usage
4.2.1 Reasoning Parser
- Streaming with Thinking Process: Qwen3-Next-80B-A3B-Thinking only supports thinking mode. Enable the reasoning parser during deployment to separate the thinking and the content sections.
Command
Example
Output
- Turn off Thinking: Qwen3-Next-80B-A3B-Instruct only supports instruct (non-thinking) mode.
Command
Example
Output
4.2.2 Tool Calling
Qwen/Qwen3-Next-80B-A3B-Instruct | Qwen/Qwen3-Next-80B-A3B-Thinking both support tool calling capabilities. Enable the tool call parser: Python Example (without Thinking Process): Start sglang server:Command
Example
Output
Command
Example
Output
- The reasoning parser shows how the model decides to use a tool
- Tool calls are clearly marked with the function name and arguments
- You can then execute the function and send the result back to continue the conversation
Example
4.2.3 Processing Ultra-Long Texts
Qwen3-Next natively supports context lengths of up to 262,144 tokens. For conversations where the total length (including both input and output) significantly exceeds this limit, we recommend using RoPE scaling techniques to handle long texts effectively. We have validated the model’s performance on context lengths of up to 1 million tokens using the YaRN method. Qwen3-Next-80B-A3B-InstructOutput
Output
5. Benchmark
5.1 Speed Benchmark
Test Environment:- Hardware: NVIDIA B200 GPU (8x)
- Tensor Parallelism: 8
- Model: Qwen/Qwen3-Next-80B-A3B-Instruct
- sglang version: 0.5.6
5.1.1 Latency-Sensitive Benchmark
- Server Command:
Output
- Test Command:
Command
- Test Results:
Output
5.1.2 Throughput-Sensitive Benchmark
- Server Command:
Output
- Test Command:
Command
Output
5.2 Accuracy Benchmark
5.2.1 GSM8K Benchmark
- Benchmark Command:
Command
-
Results:
-
Qwen3-Next-80B-A3B-Instruct
Output
-
Qwen3-Next-80B-A3B-Thinking
Output
-
Qwen3-Next-80B-A3B-Instruct
5.2.2 MMLU Benchmark
- Benchmark Command:
Command
-
Results:
-
Qwen3-Next-80B-A3B-Instruct
Output
-
Qwen3-Next-80B-A3B-Thinking
Output
-
Qwen3-Next-80B-A3B-Instruct
