1.Model Introduction
GPT-OSS is an advanced large language model developed by OpenAI designed for power reasoning, agentic tasks, and versatile developer use cases. It has versions with two model sizes.- gpt-oss-120b — for production, general purpose, high reasoning use cases that fit into a single 80GB GPU (like NVIDIA H100 80GB or AMD MI300X 192GB) (117B parameters with 5.1B active parameters)
- gpt-oss-20b — for lower latency, and local or specialized use cases (21B parameters with 3.6B active parameters)
- Configurable reasoning effort: Easily adjust the reasoning effort (low, medium, high) based on your specific use case and latency needs.
- Full chain-of-thought: Gain complete access to the model’s reasoning process, facilitating easier debugging and increased trust in outputs. It’s not intended to be shown to end users.
- Fine-tunable: Fully customize models to your specific use case through parameter fine-tuning.
- Agentic capabilities: Use the models’ native capabilities for function calling, web browsing, Python code execution, and Structured Outputs.
- MXFP4 quantization: The models were post-trained with MXFP4 quantization of the MoE weights, making gpt-oss-120b run on a single 80GB GPU (like NVIDIA H100 80GB or AMD MI300X 192GB) and the gpt-oss-20b model run within 16GB of memory. All evals were performed with the same MXFP4 quantization.
2.SGLang Installation
SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements. Please refer to the official SGLang installation guide for installation instructions.3.Model Deployment
This section provides deployment configurations optimized for different hardware platforms and use cases.3.1 Basic Configuration
The GPT-OSS series comes in two sizes. Recommended starting configurations vary depending on hardware. Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, model size, quantization method, and thinking capabilities.3.2 Configuration Tips
For more detailed configuration tips, please refer to GPS-OSS Usage.4.Model Invocation
4.1 Basic Usage
For basic API usage and request examples, please refer to:4.2 Advanced Usage
4.2.1 Reasoning Parser
GPT-OSS supports reasoning mode. Enable the reasoning parser during deployment to separate the thinking and content sections:Command
Example
Output
4.2.2 Tool Calling
GPT-OSS supports tool calling capabilities. Enable the tool call parser: Python Example (without Thinking Process): Start sglang server:Command
Example
Output
Command
Example
Output
- The reasoning parser shows how the model decides to use a tool
- Tool calls are clearly marked with the function name and arguments
- You can then execute the function and send the result back to continue the conversation
Example
5.Benchmark
5.1 Speed Benchmark
- Hardware: NVIDIA B200 GPU (8x)
- Tensor Parallelism: 8
- Model: openai/gpt-oss-120b
- sglang version: 0.5.6
5.1.1 Latency-Sensitive Benchmark
- Server Command:
Output
- Test Command:
Command
- Test Results:
Output
5.1.2 Throughput-Sensitive Benchmark
- Server Command:
Output
- Test Command:
Command
Output
5.2 Accuracy Benchmark
5.2.1 GSM8K Benchmark
- Benchmark Command:
Command
-
Results:
-
GPT-OSS-120b
Output
-
GPT-OSS-20b
Output
-
GPT-OSS-120b
