- Cross-encoder rerank models: run with
--is-embedding(embedding runner). - Decoder-only rerank models: run without
--is-embeddingand use next-token logprob scoring (yes/no).- Text-only (e.g. Qwen3-Reranker)
- Multimodal (e.g. Qwen3-VL-Reranker): also supports image/video content
--trust-remote-code.
Supported rerank models
| Model Family (Rerank) | Example HuggingFace Identifier | Chat Template | Description |
|---|---|---|---|
| BGE-Reranker (BgeRerankModel) | BAAI/bge-reranker-v2-m3 | N/A | Currently only support attention-backend triton and torch_native. High-performance cross-encoder reranker model from BAAI. Suitable for reranking search results based on semantic relevance. |
| Qwen3-Reranker (decoder-only yes/no) | Qwen/Qwen3-Reranker-8B | examples/chat_template/qwen3_reranker.jinja | Decoder-only reranker using next-token logprob scoring for labels (yes/no). Launch without --is-embedding. |
| Qwen3-VL-Reranker (multimodal yes/no) | Qwen/Qwen3-VL-Reranker-2B | examples/chat_template/qwen3_vl_reranker.jinja | Multimodal decoder-only reranker supporting text, images, and videos. Uses yes/no logprob scoring. Launch without --is-embedding. |
Cross-Encoder Rerank (embedding runner)
Launch Command
Example Client Request
query(required): The query text to rank documents againstdocuments(required): List of documents to be rankedmodel(required): Model to use for rerankingtop_n(optional): Maximum number of documents to return. Defaults to returning all documents. If specified value is greater than the total number of documents, all documents will be returned.return_documents(optional): Whether to return documents in the response. Defaults toTrue.
Qwen3-Reranker (decoder-only yes/no rerank)
Launch Command
--is-embedding.
Example Client Request (supports optional instruct, top_n, and return_documents)
query(required): The query text to rank documents againstdocuments(required): List of documents to be rankedmodel(required): Model to use for rerankinginstruct(optional): Instruction text for the rerankertop_n(optional): Maximum number of documents to return. Defaults to returning all documents. If specified value is greater than the total number of documents, all documents will be returned.return_documents(optional): Whether to return documents in the response. Defaults toTrue.
Response Format
/v1/rerank returns a list of objects (sorted by descending score):
score: float, higher means more relevantdocument: the original document string (only included whenreturn_documentsistrue)index: the original index in the inputdocumentsmeta_info: optional debug/usage info (may be present for some models)
top_n parameter. If top_n is not specified or is greater than the total number of documents, all documents are returned.
Example (with return_documents: true):
return_documents: false):
top_n: 2):
Common Pitfalls
- If you launch Qwen3-Reranker with
--is-embedding,/v1/rerankcannot compute yes/no logprob scores. Relaunch without--is-embedding. - If you see a validation error like “score should be a valid number” and the backend returned a list, upgrade to a version that coerces
embedding[0]intoscorefor rerank responses.
Qwen3-VL-Reranker (multimodal decoder-only rerank)
Qwen3-VL-Reranker extends the Qwen3-Reranker to support multimodal content, allowing reranking of documents containing text, images, and videos.Launch Command
--is-embedding.
Text-Only Reranking (backward compatible)
Image Reranking (text query, image/mixed documents)
Multimodal Query Reranking (query with image)
Request Parameters (Multimodal)
query(required): Can be a string (text-only) or a list of content parts:{"type": "text", "text": "..."}for text{"type": "image_url", "image_url": {"url": "..."}}for images{"type": "video_url", "video_url": {"url": "..."}}for videos
documents(required): List where each document can be a string or list of content parts (same format as query)instruct(optional): Instruction text for the rerankertop_n(optional): Maximum number of documents to returnreturn_documents(optional): Whether to return documents in the response (default:false)
Common Pitfalls
- Always use
--chat-template examples/chat_template/qwen3_vl_reranker.jinjafor Qwen3-VL-Reranker. - Do NOT launch with
--is-embedding. - For best results, use
--disable-radix-cacheto avoid caching issues with multimodal content. - Note: Currently only
Qwen3-VL-Reranker-2Bis tested and supported. The 8B model may have different behavior and is not guaranteed to work with this template.
