1. Model Introduction
LLaDA 2.1 is a series of large-scale discrete diffusion language models (dLLMs) developed by the InclusionAI team at Ant Group. Unlike traditional autoregressive models that generate text left-to-right one token at a time, LLaDA 2.1 uses a diffusion-based approach — drafting tokens in parallel and refining them through iterative denoising, enabling self-correction during generation. Key Features:- Token Editing (T2T + M2T): Combines Mask-to-Token (M2T) and Token-to-Token (T2T) editing, allowing the model to not only unmask tokens but also revise already-generated tokens mid-flight
- Dual Decoding Modes: Speed Mode (S) for maximum throughput with T2T refinement, and Quality Mode (Q) for conservative thresholds and higher benchmark scores
- MoE Architecture: Both variants use Mixture-of-Experts architecture for efficient scaling
- First Large-Scale RL for dLLMs: Implements the first reinforcement learning framework specifically designed for diffusion language models, improving reasoning and instruction-following
- Lightning-Fast Decoding: Up to 892 tokens/s on HumanEval+ for the 100B model
| Model | Parameters | Architecture | Context Length | HuggingFace |
|---|---|---|---|---|
| LLaDA2.1-mini | 16B | MoE (20 layers, 16 attention heads) | 32,768 tokens | inclusionAI/LLaDA2.1-mini |
| LLaDA2.1-flash | 100B | MoE | 32,768 tokens | inclusionAI/LLaDA2.1-flash |
2. SGLang Installation
SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. Please refer to the official SGLang installation guide for installation instructions.3. Model Deployment
This section provides deployment configurations optimized for different hardware platforms and use cases.3.1 Basic Configuration
Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, model size, and decoding mode.3.2 Configuration Tips
dLLM-Specific Parameters:| Parameter | Description | Recommended Value |
|---|---|---|
--dllm-algorithm | Diffusion decoding algorithm | JointThreshold |
--trust-remote-code | Required for LLaDA model loading | Always enabled |
--mem-fraction-static | Static memory fraction for KV cache | 0.8 |
--max-running-requests | Maximum concurrent requests | 1 (for best quality) |
--attention-backend | Attention computation backend | flashinfer |
| Mode | Threshold | Speed | Quality | Best For |
|---|---|---|---|---|
| Quality Mode (Q) | Conservative | Moderate | Higher benchmark scores | Accuracy-critical tasks |
| Speed Mode (S) | Aggressive | Very fast, relies on T2T editing | Slightly lower | Throughput-critical tasks |
- LLaDA2.1-mini (16B): ~47 GB VRAM, runs on a single GPU (TP=1)
- LLaDA2.1-flash (100B): Requires multi-GPU setup (TP=4 on H100/H200, TP=2 on B200)
4. Model Invocation
4.1 Deployment
Start the server using the command generated above, for example:4.2 Basic Usage
For basic API usage and request examples, please refer to: Simple Completion Example:4.3 Advanced Usage
4.3.1 Streaming
4.3.2 Code Generation
5. Benchmark
This section uses industry-standard configurations for comparable benchmark results.5.1 Speed Benchmark
Test Environment:- Hardware: NVIDIA B200 (4x)
- SGLang Version: 0.5.8+
5.1.1 LLaDA2.1-mini
Model Deployment:- Latency Benchmark
- Latency Result:
- Throughput Benchmark
- Throughput Result:
5.1.2 LLaDA2.1-flash
Model Deployment:- Latency Benchmark
- Latency Result:
- Throughput Benchmark
- Throughput Result:
