Documentation Index
Fetch the complete documentation index at: https://docs.sglang.io/llms.txt
Use this file to discover all available pages before exploring further.
SGLang supports parsing reasoning content out from “normal” content for reasoning models such as DeepSeek R1.
Supported Models & Parsers
| Model | Reasoning tags | Parser | Notes |
|---|
| DeepSeek‑R1 series | <think> … </think> | deepseek-r1 | Supports all variants (R1, R1-0528, R1-Distill) |
| DeepSeek‑V3 series | <think> … </think> | deepseek-v3 | Including DeepSeek‑V3.2. Supports thinking parameter |
| Standard Qwen3 models | <think> … </think> | qwen3 | Supports enable_thinking parameter |
| Qwen3-Thinking models | <think> … </think> | qwen3 or qwen3-thinking | Always generates thinking content |
| Kimi K2 Thinking | ◁think▷ … ◁/think▷ | kimi_k2 | Uses special thinking delimiters. Also requires --tool-call-parser kimi_k2 for tool use. |
| GPT OSS | <|channel|>analysis<|message|> … <|end|> | gpt-oss | N/A |
Model-Specific Behaviors
DeepSeek-R1 Family:
- DeepSeek-R1: No
<think> start tag, jumps directly to thinking content
- DeepSeek-R1-0528: Generates both
<think> start and </think> end tags
- Both are handled by the same
deepseek-r1 parser
DeepSeek-V3 Family:
- DeepSeek-V3.1/V3.2: Hybrid model supporting both thinking and non-thinking modes, use the
deepseek-v3 parser and thinking parameter (NOTE: not enable_thinking)
Qwen3 Family:
- Standard Qwen3 (e.g., Qwen3-2507): Use
qwen3 parser, supports enable_thinking in chat templates
- Qwen3-Thinking (e.g., Qwen3-235B-A22B-Thinking-2507): Use
qwen3 or qwen3-thinking parser, always thinks
Kimi K2:
- Kimi K2 Thinking: Uses special
◁think▷ and ◁/think▷ tags. For agentic tool use, also specify --tool-call-parser kimi_k2.
GPT OSS:
- GPT OSS: Uses special
<|channel|>analysis<|message|> and <|end|> tags
Usage
Launching the Server
Specify the --reasoning-parser option.
import requests
from openai import OpenAI
from sglang.test.doc_patch import launch_server_cmd
from sglang.utils import wait_for_server, print_highlight, terminate_process
server_process, port = launch_server_cmd(
"python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --host 0.0.0.0 --reasoning-parser deepseek-r1 --log-level warning"
)
wait_for_server(f"http://localhost:{port}")
Note that --reasoning-parser defines the parser used to interpret responses.
OpenAI Compatible API
Using the OpenAI compatible API, the contract follows the DeepSeek API design established with the release of DeepSeek-R1:
reasoning_content: The content of the CoT.
content: The content of the final answer.
# Initialize OpenAI-like client
client = OpenAI(api_key="None", base_url=f"http://0.0.0.0:{port}/v1")
model_name = client.models.list().data[0].id
messages = [
{
"role": "user",
"content": "What is 1+3?",
}
]
Non-Streaming Request
response_non_stream = client.chat.completions.create(
model=model_name,
messages=messages,
temperature=0.6,
top_p=0.95,
stream=False, # Non-streaming
extra_body={"separate_reasoning": True},
)
print_highlight("==== Reasoning ====")
print_highlight(response_non_stream.choices[0].message.reasoning_content)
print_highlight("==== Text ====")
print_highlight(response_non_stream.choices[0].message.content)
Streaming Request
response_stream = client.chat.completions.create(
model=model_name,
messages=messages,
temperature=0.6,
top_p=0.95,
stream=True, # Non-streaming
extra_body={"separate_reasoning": True},
)
reasoning_content = ""
content = ""
for chunk in response_stream:
if chunk.choices[0].delta.content:
content += chunk.choices[0].delta.content
if chunk.choices[0].delta.reasoning_content:
reasoning_content += chunk.choices[0].delta.reasoning_content
print_highlight("==== Reasoning ====")
print_highlight(reasoning_content)
print_highlight("==== Text ====")
print_highlight(content)
Optionally, you can buffer the reasoning content to the last reasoning chunk (or the first chunk after the reasoning content).
response_stream = client.chat.completions.create(
model=model_name,
messages=messages,
temperature=0.6,
top_p=0.95,
stream=True, # Non-streaming
extra_body={"separate_reasoning": True, "stream_reasoning": False},
)
reasoning_content = ""
content = ""
for chunk in response_stream:
if chunk.choices[0].delta.content:
content += chunk.choices[0].delta.content
if chunk.choices[0].delta.reasoning_content:
reasoning_content += chunk.choices[0].delta.reasoning_content
print_highlight("==== Reasoning ====")
print_highlight(reasoning_content)
print_highlight("==== Text ====")
print_highlight(content)
The reasoning separation is enable by default when specify .
To disable it, set the separate_reasoning option to False in request.
response_non_stream = client.chat.completions.create(
model=model_name,
messages=messages,
temperature=0.6,
top_p=0.95,
stream=False, # Non-streaming
extra_body={"separate_reasoning": False},
)
print_highlight("==== Original Output ====")
print_highlight(response_non_stream.choices[0].message.content)
SGLang Native API
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-7B")
input = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True, return_dict=False
)
gen_url = f"http://localhost:{port}/generate"
gen_data = {
"text": input,
"sampling_params": {
"skip_special_tokens": False,
"max_new_tokens": 1024,
"temperature": 0.6,
"top_p": 0.95,
},
}
gen_response = requests.post(gen_url, json=gen_data).json()["text"]
print_highlight("==== Original Output ====")
print_highlight(gen_response)
parse_url = f"http://localhost:{port}/separate_reasoning"
separate_reasoning_data = {
"text": gen_response,
"reasoning_parser": "deepseek-r1",
}
separate_reasoning_response_json = requests.post(
parse_url, json=separate_reasoning_data
).json()
print_highlight("==== Reasoning ====")
print_highlight(separate_reasoning_response_json["reasoning_text"])
print_highlight("==== Text ====")
print_highlight(separate_reasoning_response_json["text"])
terminate_process(server_process)
Offline Engine API
import sglang as sgl
from sglang.srt.parser.reasoning_parser import ReasoningParser
from sglang.utils import print_highlight
llm = sgl.Engine(model_path="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B")
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-7B")
input = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True, return_dict=False
)
sampling_params = {
"max_new_tokens": 1024,
"skip_special_tokens": False,
"temperature": 0.6,
"top_p": 0.95,
}
result = llm.generate(prompt=input, sampling_params=sampling_params)
generated_text = result["text"] # Assume there is only one prompt
print_highlight("==== Original Output ====")
print_highlight(generated_text)
parser = ReasoningParser("deepseek-r1")
reasoning_text, text = parser.parse_non_stream(generated_text)
print_highlight("==== Reasoning ====")
print_highlight(reasoning_text)
print_highlight("==== Text ====")
print_highlight(text)
Supporting New Reasoning Model Schemas
For future reasoning models, you can implement the reasoning parser as a subclass of BaseReasoningFormatDetector in python/sglang/srt/reasoning_parser.py and specify the reasoning parser for new reasoning model schemas accordingly.