/generate(text generation model)/get_model_info/get_server_info/health/health_generate/flush_cache/update_weights/encode(embedding model)/v1/rerank(cross encoder rerank model)/v1/score(decoder-only scoring)/classify(reward model)/start_expert_distribution_record/stop_expert_distribution_record/dump_expert_distribution_record/tokenize/detokenize- A full list of these APIs can be found at http_server.py
requests to test these APIs in the following examples. You can also use curl.
Launch A Server
Example
Generate (text generation model)
Generate completions. This is similar to the/v1/completions in OpenAI API. Detailed parameters can be found in the sampling parameters.
Example
Get Model Info
Get the information of the model.model_path: The path/name of the model.is_generation: Whether the model is used as generation model or embedding model.tokenizer_path: The path/name of the tokenizer.preferred_sampling_params: The default sampling params specified via--preferred-sampling-params.Noneis returned in this example as we did not explicitly configure it in server args.weight_version: This field contains the version of the model weights. This is often used to track changes or updates to the model’s trained parameters.has_image_understanding: Whether the model has image-understanding capability.has_audio_understanding: Whether the model has audio-understanding capability.model_type: The model type from the HuggingFace config (e.g., “qwen2”, “llama”).architectures: The model architectures from the HuggingFace config (e.g., [“Qwen2ForCausalLM”]).
Example
Get Server Info
Gets the server information including CLI arguments, token limits, and memory pool sizes.- Note:
get_server_infomerges the following deprecated endpoints:get_server_argsget_memory_pool_sizeget_max_total_num_tokens
Example
Health Check
/health: Check the health of the server./health_generate: Check the health of the server by generating one token.
Example
Example
Flush Cache
Flush the radix cache. It will be automatically triggered when the model weights are updated by the/update_weights API.
Example
Update Weights From Disk
Update model weights from disk without restarting the server. Only applicable for models with the same architecture and parameter size. SGLang supportupdate_weights_from_disk API for continuous evaluation during training (save checkpoint to disk and update weights from disk).
Example
Example
Example
Encode (embedding model)
Encode text into embeddings. Note that this API is only available for embedding models and will raise an error for generation models. Therefore, we launch a new server to server an embedding model.Example
Example
Example
v1/rerank (cross encoder rerank model)
Rerank a list of documents given a query using a cross-encoder model. Note that this API is only available for cross encoder model like BAAI/bge-reranker-v2-m3 withattention-backend triton and torch_native.
Example
Example
Example
v1/score (decoder-only scoring)
Compute token probabilities for specified tokens given a query and items. This is useful for classification tasks, scoring responses, or computing log-probabilities. Parameters:query: Query textitems: Item text(s) to scorelabel_token_ids: Token IDs to compute probabilities forapply_softmax: Whether to apply softmax to get normalized probabilities (default: False)item_first: Whether items come first in concatenation order (default: False)model: Model name
scores - a list of probability lists, one per item, each in the order of label_token_ids.
Example
Example
Example
Classify (reward model)
SGLang Runtime also supports reward models. Here we use a reward model to classify the quality of pairwise generations.Example
Example
Example
Capture expert selection distribution in MoE models
SGLang Runtime supports recording the number of times an expert is selected in a MoE model run for each expert in the model. This is useful when analyzing the throughput of the model and plan for optimization. Note: We only print out the first 10 lines of the csv below for better readability. Please adjust accordingly if you want to analyze the results more deeply.Example
Example
Example
Tokenize/Detokenize Example (Round Trip)
This example demonstrates how to use the /tokenize and /detokenize endpoints together. We first tokenize a string, then detokenize the resulting IDs to reconstruct the original text. This workflow is useful when you need to handle tokenization externally but still leverage the server for detokenization.Example
Example
Example
