Qwen 3.5 is Alibaba’s latest generation LLM featuring a hybrid attention architecture, advanced MoE with shared experts, and native multimodal capabilities. Key architecture features:Documentation Index
Fetch the complete documentation index at: https://docs.sglang.io/llms.txt
Use this file to discover all available pages before exploring further.
- Hybrid Attention: Gated Delta Networks (linear, O(n) complexity) combined with full attention every 4th layer for high associative recall
- MoE with Shared Experts: Top-8 active out of 64 routed experts plus a dedicated shared expert for universal features
- Multimodal: DeepStack Vision Transformer with Conv3d for native image and video understanding
Launch Qwen 3.5 with SGLang
Dense Model
To serveQwen/Qwen3.5-397B-A17B on 8 GPUs:
AMD GPU (MI300X / MI325X / MI35X)
On AMD Instinct GPUs, use thetriton attention backend. Both the full attention layers and the Gated Delta Net (linear attention) layers use Triton-based kernels on ROCm:
Configuration Tips
--attention-backend: Usetritonon AMD GPUs for Qwen 3.5. The hybrid attention architecture (Gated Delta Networks + full attention) works best with the Triton backend on ROCm. The linear attention (GDN) layers always use Triton kernels internally via theGDNAttnBackend.--watchdog-timeout: Increase to1200or higher for this large model, as weight loading takes significant time.--model-loader-extra-config '{"enable_multithread_load": true}': Enables parallel weight loading for faster startup.
Reasoning and Tool Calling
Qwen 3.5 supports reasoning and tool calling via the Qwen3 parsers:Accuracy Evaluation
You can evaluate the model accuracy usinglm-eval:
