1. Model Introduction
Qwen3-VL series are the most powerful vision-language models in the Qwen series to date, featuring advanced capabilities in multi-modal understanding, reasoning, and agentic applications. This generation delivers comprehensive upgrades across the board:- Superior text understanding & generation: Qwen3-VL-235B-A22B-Instruct was ranked as the #1 open model for text on lmarena.ai
- Deeper visual perception & reasoning: Enhanced image and video understanding capabilities.
- Extended context length: Supports up to 262K tokens for processing long documents and videos.
- Enhanced spatial and video dynamics comprehension: Better understanding of spatial relationships and temporal dynamics.
- Stronger agent interaction capabilities: Improved tool use and search-based agent performance.
- Flexible deployment options: Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning-enhanced Thinking editions.
2. SGLang Installation
SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. Please refer to the official SGLang installation guide for installation instructions.3. Model Deployment
This section provides deployment configurations optimized for different hardware platforms and use cases.3.1 Basic Configuration
The Qwen3-VL series offers models in various sizes and architectures, optimized for different hardware platforms including NVIDIA and AMD GPUs. The recommended launch configurations vary by hardware and model size. Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, model size, quantization method, and thinking capabilities.3.2 Configuration Tips
- Multimodal attention backend : Usually,
--mm-attention-backendis default tofa3on H100/H200/A100 for better performance, but it is default totriton_attnon B200 for compatibility. - TTFT Optimization : Set
SGLANG_USE_CUDA_IPC_TRANSPORT=1to use CUDA IPC for transferring multimodal features, which significantly improves TTFT. This consumes additional memory and may require adjusting--mem-fraction-staticand/or--max-running-requests. (additional memory is proportional to image size * number of images in current running requests.) - Memory Management : Set lower
--context-lengthto conserve memory. A value of128000is sufficient for most scenarios, down from the default 262K. - Expert Parallelism : SGLang supports Expert Parallelism (EP) via
--ep, allowing experts in MoE models to be deployed on separate GPUs for better throughput. One thing to note is that, for quantized models, you need to set--epto a value that satisfies the requirement:(moe_intermediate_size / moe_tp_size) % weight_block_size_n == 0, where moe_tp_size is equal to tp_size divided by ep_size.Note that EP may perform worse in low concurrency scenarios due to additional communication overhead. Check out Expert Parallelism Deployment for more details. - Kernel Tuning : For MoE Triton kernel tuning on your specific hardware, refer to fused_moe_triton.
4. Model Invocation
4.1 Basic Usage
For basic API usage and request examples, please refer to:4.2 Advanced Usage
4.2.1 Multi-Modal Inputs
Qwen3-VL supports both image and video inputs. Here’s a basic example with image input:Example
Output
Example
Output
Example
- For video processing, ensure you have sufficient context length configured (up to 262K tokens)
- Video processing may require more memory; adjust
--mem-fraction-staticaccordingly - You can also provide local file paths using
file://protocol
Output
4.2.2 Reasoning Parser
Qwen3-VL-Thinking supports reasoning mode. Enable the reasoning parser during deployment to separate the thinking and content sections:Command
Example
Output
4.2.3 Tool Calling
Qwen3-VL supports tool calling capabilities. Enable the tool call parser:Command
Example
Output
- The reasoning parser shows how the model decides to use a tool
- Tool calls are clearly marked with the function name and arguments
- You can then execute the function and send the result back to continue the conversation
Example
5. Benchmark
5.1 Speed Benchmark
Test Environment:- Hardware: NVIDIA B200 GPU (8x)
- Model: Qwen3-VL-235B-A22B-Instruct
- Tensor Parallelism: 8
- sglang version: 0.5.6
5.1.1 Latency-Sensitive Benchmark
- Model Deployment Command:
Command
- Benchmark Command:
Command
- Test Results:
Output
SGLANG_USE_CUDA_IPC_TRANSPORT=1. This significantly reduces TTFT by using CUDA IPC for transferring multimodal features.
- Model Deployment Command:
Command
- Benchmark Command:
Command
-
Test Results:
With
SGLANG_USE_CUDA_IPC_TRANSPORT=1, TTFT improves significantly:
Output
5.1.2 Throughput-Sensitive Benchmark
- Model Deployment Command:
Command
- Benchmark Command:
Command
- Test Results:
Output
5.2 Accuracy Benchmark
5.2.1 MMMU Benchmark
You can evaluate the model’s accuracy using the MMMU dataset withlmms_eval:
- Benchmark Command:
Command
- Test Results:
Output
