sglang can fall back to using models that are available in transformers. This works for most decoder-style language models and support for vision-language models is coming soon!
Example Launch Command
By default, we will use sglang implementation if it is available. Otherwise, we will fall back to transformers one. However, you can switch the implementation by setting--model-impl to transformers.
Supported Features
Quantization
Transformers fallback has supported most of available quantization in SGLang (except GGUF). See the Quantization page for more information about supported quantization in SGLang.Remote Code
This fallback also means that any model on the hub that can be used intransformers with trust_remote_code=True that correctly implements attention can be used in production!
A model just needs the following two things:
- Load the config
- Load the model class
MyModel python class is loaded from the auto_map, and we check that the model _supports_attention_backend.
- Use the TransformersModel backend
TransformersModel backend is used. See /srt/models/transformers, which leverages self.config._attn_implementation = "sglang", thus the need to use ALL_ATTENTION_FUNCTIONS.
That’s it!