Transformers Fallback in SGLang

Example Launch Command
Supported Features
Quantization
Remote Code

sglang can fall back to using models that are available in transformers. This works for most decoder-style language models and support for vision-language models is coming soon!

Example Launch Command

By default, we will use sglang implementation if it is available. Otherwise, we will fall back to transformers one. However, you can switch the implementation by setting --model-impl to transformers.

python3 -m sglang.launch_server \
  --model-path meta-llama/Llama-3.2-1B-Instruct \
  --host 0.0.0.0 \
  --port 30000 \
  --model-impl transformers

Supported Features

Quantization

Transformers fallback has supported most of available quantization in SGLang (except GGUF). See the Quantization page for more information about supported quantization in SGLang.

Remote Code

This fallback also means that any model on the hub that can be used in transformers with trust_remote_code=True that correctly implements attention can be used in production! A model just needs the following two things:

from transformers import PreTrainedModel
from torch import nn

class MyAttention(nn.Module):
  def forward(self, hidden_states, **kwargs): # <- kwargs are required
    ...
    attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
    attn_output, attn_weights = attention_interface(
      self,
      query_states,
      key_states,
      value_states,
      **kwargs,
    )
    ...

class MyModel(PreTrainedModel):
  _supports_attention_backend = True

Here is what happens in the background:

Load the config

The config is loaded.

Load the model class

MyModel python class is loaded from the auto_map, and we check that the model _supports_attention_backend.

Use the TransformersModel backend

The TransformersModel backend is used. See /srt/models/transformers, which leverages self.config._attn_implementation = "sglang", thus the need to use ALL_ATTENTION_FUNCTIONS. That’s it!

How to Support New Models

Use Models From ModelScope

⌘I

Basic Usage

Advanced Features

Supported Models

Developer Guide

References

Example Launch Command

Supported Features

Quantization

Remote Code

Basic Usage

Advanced Features

Supported Models

Developer Guide

References

​Example Launch Command

​Supported Features

​Quantization

​Remote Code

Example Launch Command

Supported Features

Quantization

Remote Code