Ring-2.5-1T - SGLang Documentation

1. Model Introduction

Ring-2.5-1T is the world’s first open-source trillion-parameter reasoning model based on hybrid linear attention architecture, developed by InclusionAI. Building on Ring-1T, Ring-2.5-1T demonstrates substantial improvements in generation efficiency, reasoning depth, and long-horizon task execution capabilities. Key Features:

Trillion-Scale Model: ~1T total parameters with 63B activation parameters using a hybrid linear attention architecture (1:7 MLA + Lightning Linear Attention)
Generation Efficiency: Reduces memory access overhead by over 10x and increases generation throughput by more than 3x for sequences exceeding 32K tokens
Deep Reasoning: Achieves gold medal level for both IMO 2025 and CMO 2025, with dense rewards for rigorous reasoning process feedback
Long-horizon Task Execution: Enhanced autonomous execution capability through large-scale fully-async agentic RL training
Tool Calling: Supports function calling with XML-style tool call format
Context Length: 128K -> 256K (YaRN)

Available Models:

FP8 (8-bit quantized): inclusionAI/Ring-2.5-1T

License: MIT

2. SGLang Installation

Ring-2.5-1T requires a specific SGLang Docker image:

# For H200/B200
docker pull lmsysorg/sglang:nightly-dev-20260213-a0ebaa64

# For GB200/GB300
docker pull lmsysorg/sglang:nightly-dev-cu13-20260213-a0ebaa64

For other installation methods, please refer to the official SGLang installation guide.

3. Model Deployment

This section provides deployment configurations optimized for different hardware platforms.

3.1 Basic Configuration

Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform.

3.2 Configuration Tips

The --trust-remote-code flag is required for this model due to custom modeling code.
The model uses FP8 quantization (compressed-tensors format).

4. Model Invocation

Deploy Ring-2.5-1T with the following command (on H200, all features enabled):

python -m sglang.launch_server \
  --model-path inclusionAI/Ring-2.5-1T \
  --tp 8 \
  --trust-remote-code \
  --host 0.0.0.0 \
  --port 30000

4.1 Basic Usage

For basic API usage and request examples, please refer to:

SGLang Basic Usage Guide

4.2 Advanced Usage

4.2.1 Reasoning Parser

To enable reasoning output separation, add --reasoning-parser deepseek-r1 when launching the server. The thinking process is returned via reasoning_content in the streaming response.

python -m sglang.launch_server \
  --model-path inclusionAI/Ring-2.5-1T \
  --tp 8 \
  --trust-remote-code \
  --reasoning-parser deepseek-r1 \
  --host 0.0.0.0 \
  --port 30000

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

response = client.chat.completions.create(
    model="inclusionAI/Ring-2.5-1T",
    messages=[
        {"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"}
    ],
    max_tokens=2048,
    stream=True
)

for chunk in response:
    if chunk.choices and len(chunk.choices) > 0:
        delta = chunk.choices[0].delta

        if hasattr(delta, 'reasoning_content') and delta.reasoning_content:
            print(delta.reasoning_content, end="", flush=True)

        if delta.content:
            print(delta.content, end="", flush=True)

print()

4.2.2 Tool Calling

To enable tool calling, add --tool-call-parser qwen when launching the server.

python -m sglang.launch_server \
  --model-path inclusionAI/Ring-2.5-1T \
  --tp 8 \
  --trust-remote-code \
  --tool-call-parser qwen \
  --host 0.0.0.0 \
  --port 30000

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "The city name"
                    }
                },
                "required": ["location"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="inclusionAI/Ring-2.5-1T",
    messages=[
        {"role": "user", "content": "What's the weather in Beijing?"}
    ],
    tools=tools
)

print(response.choices[0].message.tool_calls)

Output Example:

[ChatCompletionMessageFunctionToolCall(id='call_770360e31d194ed79d32cd8c', function=Function(arguments='{"location": "Beijing"}', name='get_weather'), type='function', index=0)]

5. Benchmark

GSM8K

Deployment Command

python3 -m sglang.launch_server \
  --model-path inclusionAI/Ring-2.5-1T \
  --tp-size 8 \
  --trust-remote-code

Benchmark Command

python3 benchmark/gsm8k/bench_sglang.py --temperature 1.2 --top-p 0.8 --max-new-tokens 32768 --num-questions 200 --tokenizer-path inclusionAI/Ring-2.5-1T --enable-thinking

Test Result

Accuracy: 0.955
Invalid: 0.010
Latency: 615.833 s
Output throughput: 412.360 token/s

Cookbook

​1. Model Introduction

​2. SGLang Installation

​3. Model Deployment

​3.1 Basic Configuration

​3.2 Configuration Tips

​4. Model Invocation

​4.1 Basic Usage

​4.2 Advanced Usage

​4.2.1 Reasoning Parser

​4.2.2 Tool Calling

​5. Benchmark

​GSM8K