LLaDA 2.1 - SGLang Documentation

1. Model Introduction

LLaDA 2.1 is a series of large-scale discrete diffusion language models (dLLMs) developed by the InclusionAI team at Ant Group. Unlike traditional autoregressive models that generate text left-to-right one token at a time, LLaDA 2.1 uses a diffusion-based approach — drafting tokens in parallel and refining them through iterative denoising, enabling self-correction during generation. Key Features:

Token Editing (T2T + M2T): Combines Mask-to-Token (M2T) and Token-to-Token (T2T) editing, allowing the model to not only unmask tokens but also revise already-generated tokens mid-flight
Dual Decoding Modes: Speed Mode (S) for maximum throughput with T2T refinement, and Quality Mode (Q) for conservative thresholds and higher benchmark scores
MoE Architecture: Both variants use Mixture-of-Experts architecture for efficient scaling
First Large-Scale RL for dLLMs: Implements the first reinforcement learning framework specifically designed for diffusion language models, improving reasoning and instruction-following
Lightning-Fast Decoding: Up to 892 tokens/s on HumanEval+ for the 100B model

Available Models:

Model	Parameters	Architecture	Context Length	HuggingFace
LLaDA2.1-mini	16B	MoE (20 layers, 16 attention heads)	32,768 tokens	inclusionAI/LLaDA2.1-mini
LLaDA2.1-flash	100B	MoE	32,768 tokens	inclusionAI/LLaDA2.1-flash

License: Apache 2.0. Please refer to the official LLaDA2.X repository for details.

2. SGLang Installation

SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. Please refer to the official SGLang installation guide for installation instructions.

3. Model Deployment

This section provides deployment configurations optimized for different hardware platforms and use cases.

3.1 Basic Configuration

Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, model size, and decoding mode.

3.2 Configuration Tips

dLLM-Specific Parameters:

Parameter	Description	Recommended Value
`--dllm-algorithm`	Diffusion decoding algorithm	`JointThreshold`
`--trust-remote-code`	Required for LLaDA model loading	Always enabled
`--mem-fraction-static`	Static memory fraction for KV cache	`0.8`
`--max-running-requests`	Maximum concurrent requests	`1` (for best quality)
`--attention-backend`	Attention computation backend	`flashinfer`

Decoding Mode Comparison:

Mode	Threshold	Speed	Quality	Best For
Quality Mode (Q)	Conservative	Moderate	Higher benchmark scores	Accuracy-critical tasks
Speed Mode (S)	Aggressive	Very fast, relies on T2T editing	Slightly lower	Throughput-critical tasks

Hardware Requirements:

LLaDA2.1-mini (16B): ~47 GB VRAM, runs on a single GPU (TP=1)
LLaDA2.1-flash (100B): Requires multi-GPU setup (TP=4 on H100/H200, TP=2 on B200)

4. Model Invocation

4.1 Deployment

Start the server using the command generated above, for example:

python -m sglang.launch_server \
  --model-path inclusionAI/LLaDA2.1-mini \
  --dllm-algorithm JointThreshold \
  --tp 1 \
  --trust-remote-code \
  --mem-fraction-static 0.8 \
  --max-running-requests 1 \
  --attention-backend flashinfer \
  --host 0.0.0.0 \
  --port 8000

4.2 Basic Usage

For basic API usage and request examples, please refer to:

SGLang Basic Usage Guide

Simple Completion Example:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY"
)

response = client.chat.completions.create(
    model="inclusionAI/LLaDA2.1-mini",
    messages=[
        {"role": "user", "content": "Explain what a diffusion language model is in simple terms."}
    ],
    max_tokens=1024
)

print(response.choices[0].message.content)

Output Example:

Sure! Let's break it down in simple terms.

A **diffusion language model** is a type of artificial intelligence that learns to generate text—like sentences, stories, or emails—by studying a lot of written text.

Here’s how it works, using a simple real-life analogy:

Imagine you have a big book full of stories. A diffusion language model is trying to learn how to write a new story. Instead of being told the rules, it starts by looking at all the words in the book and trying to understand how words usually go together.

Now, think of the process like this:

1. **Start with random noise**: The model begins with a completely random set of words (like a scribble on paper).
2. ** ** "clean up" the noise**: It gradually "denoises" the noise by turning it into meaningful text, word by word, based on what it learned learned from the book.
3. **Learn from patterns**: As it does this, it learns patterns—like how words often follow each other, or how sentences start.
4. **Generate new text**: Once it’s learned the patterns, it can create new, coherent sentences or stories by starting from a and and building it up word by word.

So, the "diffusion" part comes from the idea of going from random noise to clear, meaningful text—like turning a scribble into a full story.

In short:
A diffusion language model is an AI that learns to write text by reading lots of books and gradually turning random noise into coherent, meaningful sentences based on what it learned.

4.3 Advanced Usage

4.3.1 Streaming

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY"
)

response = client.chat.completions.create(
    model="inclusionAI/LLaDA2.1-mini",
    messages=[
        {"role": "user", "content": "Write a Python function to compute the Fibonacci sequence."}
    ],
    max_tokens=2048,
    stream=True
)

for chunk in response:
    if chunk.choices and len(chunk.choices) > 0:
        delta = chunk.choices[0].delta
        if delta.content:
            print(delta.content, end="", flush=True)

print()

Output Example:

Here are several ways to implement the Fibonacci sequence in Python:

## 1. Recursive Approach (Simple but Inefficient)

```python
def fibonacci_recursive(n):
    """
    Compute the nth Fibonacci number using recursion.

    Args:
        n (int): The position in the Fibonacci sequence (0-indexed)

    Returns:
        int: The nth Fibonacci number

    Raises:
        ValueError: If n is negative
    """
    if n < 0:
        raise ValueError("n must be non-negative")

    if n <= 1:
        return n

    return fibonacci_recursive(n - 1) + fibonacci_recursive(n - 2)

# Example usage
print(fibonacci_recursive(10))  # Output: 55
```

## 2. Iterative Approach (Efficient)
...

4.3.2 Code Generation

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY"
)

response = client.chat.completions.create(
    model="inclusionAI/LLaDA2.1-mini",
    messages=[
        {"role": "user", "content": "Write a Python function that checks if a string is a palindrome. Include docstring and test cases."}
    ],
    max_tokens=2048
)

print(response.choices[0].message.content)

Output Example:

```python
def is_palindrome(s):
    """
    Check if a string is a palindrome.

    A palindrome is a word, phrase, or sequence that reads the same backward as forward.
    This function ignores case, spaces, punctuation, and non characters characters.

    Args:
        s (str): The string to check

    Returns:
        bool: True if the string is a palindrome, False otherwise

    Examples:
        >>> is_palindrome("racecar")
        True
        >>> is_palindrome("A man a plan a canal Panama")
        True
        >>> is_palindrome("race a car")
        False
        >>> is_palindrome("")
        True
        >>> is_palindrome("a")
        True
    """
    # Remove non-alphanumeric characters and convert to lowercase
    cleaned = ''.join(char.lower() for char in s if char.isalnum())

    # Check if the cleaned string reads the same forwards and backwards
    return cleaned == cleaned[::-1]


# Test cases
def test_is_palindrome():
    """Test the is_palindrome function with various inputs."""

    # Test basic palindromes
    assert is_palindrome("racecar") == True
    assert is_palindrome("level") == True
    assert is_palindrome("madam") == True
    assert is_palindrome("radar") == True

    # Test palindromes with spaces and punctuation
    assert is_palindrome("A man a plan a canal Panama") == True
    assert is_palindrome("race a car") == False
    assert is_palindrome("Was it a car or a cat I saw?") == True
    assert is_palindrome("Madam, I'm Adam") == True

    # Test edge cases
    assert is_palindrome("") == True
    assert is_palindrome("a") == True
    assert is_palindrome("A") == True
    assert is_palindrome("Aa") == True

    # Test non-palindromes
    assert is_palindrome("hello") == False
    assert is_palindrome("world") == False
    assert is_palindrome("python") == False

    # Test single characters
    assert is_palindrome("1") == True
    assert is_palindrome("1") == True

    print("All tests passed!")


# Run the tests
if __name__ == "__main__":
    # Example usage
    print("Testing isalindrome function:")
    print(f"'racecar' {is_palindrome('racecar')}")
    print(f"'A man a plan a canal Panama': {is_palindrome('A man a plan a canal Panama')}")
    print(f"'race a car': {is_palindrome('race a car')}")
    print(f"'hello': {is_palindrome('hello')}")

    # Run tests
    test_is_palindrome()
```

This implementation includes:

1. **Comprehensive function** `is_palindrome()` that:
   - Ignores case by converting to lowercase
   - Removes all non-alphanumeric characters (spaces, punctuation, etc.)
   - Uses string slicing (`[::-1]`) to reverse the string

2. **Detailed docstring** explaining:
   - What the function does
   - How it works
   - Return value
   - Examples of usage

3. **Extensive test cases** covering:
   - Basic palindromes
   - Palindromes with spaces and punctuation
   - Edge cases (empty string, single character)
   - Non-palindromes
   - Mixed case scenarios

4. **Test function** that uses assertions to verify the function works correctly

The function efficiently handles real-world palindrome checking by ignoring case, spaces, and punctuation, making it suitable for phrases like "A man a plan a canal Panama".

5. Benchmark

This section uses industry-standard configurations for comparable benchmark results.

5.1 Speed Benchmark

Test Environment:

Hardware: NVIDIA B200 (4x)
SGLang Version: 0.5.8+

5.1.1 LLaDA2.1-mini

Model Deployment:

python -m sglang.launch_server \
  --model-path inclusionAI/LLaDA2.1-mini \
  --dllm-algorithm JointThreshold \
  --tp 1 \
  --trust-remote-code \
  --mem-fraction-static 0.8 \
  --max-running-requests 1 \
  --attention-backend flashinfer

Latency Benchmark

python -m sglang.bench_serving \
  --backend sglang \
  --model inclusionAI/LLaDA2.1-mini \
  --dataset-name random \
  --random-input-len 1000 \
  --random-output-len 1000 \
  --num-prompts 10 \
  --max-concurrency 1 \
  --request-rate inf

Latency Result:

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  9.90
Total input tokens:                      6101
Total input text tokens:                 6101
Total generated tokens:                  4220
Total generated tokens (retokenized):    3433
Request throughput (req/s):              1.01
Input token throughput (tok/s):          616.26
Output token throughput (tok/s):         426.26
Peak output token throughput (tok/s):    1010.00
Peak concurrent requests:                3
Total token throughput (tok/s):          1042.53
Concurrency:                             1.00
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   988.87
Median E2E Latency (ms):                 655.27
P90 E2E Latency (ms):                    1952.50
P99 E2E Latency (ms):                    2932.19
---------------Time to First Token----------------
Mean TTFT (ms):                          152.74
Median TTFT (ms):                        150.37
P99 TTFT (ms):                           229.78
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          2.16
Median TPOT (ms):                        2.08
P99 TPOT (ms):                           3.72
---------------Inter-Token Latency----------------
Mean ITL (ms):                           2.10
Median ITL (ms):                         1.99
P95 ITL (ms):                            4.03
P99 ITL (ms):                            6.34
Max ITL (ms):                            26.59
==================================================

Throughput Benchmark

python -m sglang.bench_serving \
  --backend sglang \
  --model inclusionAI/LLaDA2.1-mini \
  --dataset-name random \
  --random-input-len 1000 \
  --random-output-len 1000 \
  --num-prompts 500 \
  --max-concurrency 100 \
  --request-rate inf

Throughput Result:

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 100
Successful requests:                     500
Benchmark duration (s):                  467.74
Total input tokens:                      249831
Total input text tokens:                 249831
Total generated tokens:                  252662
Total generated tokens (retokenized):    189717
Request throughput (req/s):              1.07
Input token throughput (tok/s):          534.12
Output token throughput (tok/s):         540.17
Peak output token throughput (tok/s):    1753.00
Peak concurrent requests:                105
Total token throughput (tok/s):          1074.30
Concurrency:                             90.77
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   84912.27
Median E2E Latency (ms):                 86564.26
P90 E2E Latency (ms):                    110567.26
P99 E2E Latency (ms):                    114303.38
---------------Time to First Token----------------
Mean TTFT (ms):                          83920.39
Median TTFT (ms):                        85669.54
P99 TTFT (ms):                           112969.91
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          2.67
Median TPOT (ms):                        1.65
P99 TPOT (ms):                           4.43
---------------Inter-Token Latency----------------
Mean ITL (ms):                           1.69
Median ITL (ms):                         1.46
P95 ITL (ms):                            3.96
P99 ITL (ms):                            4.84
Max ITL (ms):                            92.08
==================================================

5.1.2 LLaDA2.1-flash

Model Deployment:

python -m sglang.launch_server \
  --model-path inclusionAI/LLaDA2.1-flash \
  --dllm-algorithm JointThreshold \
  --tp 4 \
  --trust-remote-code \
  --mem-fraction-static 0.8 \
  --max-running-requests 1 \
  --attention-backend flashinfer

Latency Benchmark

python -m sglang.bench_serving \
  --backend sglang \
  --model inclusionAI/LLaDA2.1-flash \
  --dataset-name random \
  --random-input-len 1000 \
  --random-output-len 1000 \
  --num-prompts 10 \
  --max-concurrency 1 \
  --request-rate inf

Latency Result:

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 1
Successful requests:                     10
Benchmark duration (s):                  14.46
Total input tokens:                      6101
Total input text tokens:                 6101
Total generated tokens:                  4220
Total generated tokens (retokenized):    3276
Request throughput (req/s):              0.69
Input token throughput (tok/s):          421.79
Output token throughput (tok/s):         291.75
Peak output token throughput (tok/s):    676.00
Peak concurrent requests:                3
Total token throughput (tok/s):          713.53
Concurrency:                             1.00
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   1445.16
Median E2E Latency (ms):                 968.06
P90 E2E Latency (ms):                    3101.86
P99 E2E Latency (ms):                    4208.49
---------------Time to First Token----------------
Mean TTFT (ms):                          231.63
Median TTFT (ms):                        242.67
P99 TTFT (ms):                           341.33
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          3.04
Median TPOT (ms):                        2.79
P99 TPOT (ms):                           5.33
---------------Inter-Token Latency----------------
Mean ITL (ms):                           3.05
Median ITL (ms):                         2.41
P95 ITL (ms):                            7.25
P99 ITL (ms):                            8.27
Max ITL (ms):                            29.27
==================================================

Throughput Benchmark

python -m sglang.bench_serving \
  --backend sglang \
  --model inclusionAI/LLaDA2.1-flash \
  --dataset-name random \
  --random-input-len 1000 \
  --random-output-len 1000 \
  --num-prompts 500 \
  --max-concurrency 100 \
  --request-rate inf

Throughput Result:

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 100
Successful requests:                     500
Benchmark duration (s):                  671.85
Total input tokens:                      249831
Total input text tokens:                 249831
Total generated tokens:                  252662
Total generated tokens (retokenized):    177961
Request throughput (req/s):              0.74
Input token throughput (tok/s):          371.85
Output token throughput (tok/s):         376.07
Peak output token throughput (tok/s):    1521.00
Peak concurrent requests:                103
Total token throughput (tok/s):          747.92
Concurrency:                             91.28
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   122658.36
Median E2E Latency (ms):                 125265.55
P90 E2E Latency (ms):                    159554.07
P99 E2E Latency (ms):                    165174.88
---------------Time to First Token----------------
Mean TTFT (ms):                          121009.17
Median TTFT (ms):                        124437.80
P99 TTFT (ms):                           163579.29
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          4.73
Median TPOT (ms):                        2.16
P99 TPOT (ms):                           7.13
---------------Inter-Token Latency----------------
Mean ITL (ms):                           2.38
Median ITL (ms):                         1.40
P95 ITL (ms):                            6.89
P99 ITL (ms):                            8.60
Max ITL (ms):                            176.78
==================================================

5.2 Accuracy Benchmark

5.2.1 GSM8K Benchmark

python -m sglang.test.few_shot_gsm8k \
  --num-questions 200 \
  --port 8000

Results:

Accuracy: 0.895
Invalid: 0.000
Latency: 100.552 s
Output throughput: 262.094 token/s

Cookbook

​1. Model Introduction

​2. SGLang Installation

​3. Model Deployment

​3.1 Basic Configuration

​3.2 Configuration Tips

​4. Model Invocation

​4.1 Deployment

​4.2 Basic Usage

​4.3 Advanced Usage

​4.3.1 Streaming

​4.3.2 Code Generation

​5. Benchmark

​5.1 Speed Benchmark

​5.1.1 LLaDA2.1-mini

​5.1.2 LLaDA2.1-flash

​5.2 Accuracy Benchmark

​5.2.1 GSM8K Benchmark

1. Model Introduction

2. SGLang Installation

3. Model Deployment

3.1 Basic Configuration

3.2 Configuration Tips

4. Model Invocation

4.1 Deployment

4.2 Basic Usage

4.3 Advanced Usage

4.3.1 Streaming

4.3.2 Code Generation

5. Benchmark

5.1 Speed Benchmark

5.1.1 LLaDA2.1-mini

5.1.2 LLaDA2.1-flash

5.2 Accuracy Benchmark

5.2.1 GSM8K Benchmark