Enabling cache for torch.compile

SGLang uses max-autotune-no-cudagraphs mode of torch.compile. The auto-tuning can be slow. If you want to deploy a model on many different machines, you can ship the torch.compile cache to these machines and skip the compilation steps. This is based on https://pytorch.org/tutorials/recipes/torch_compile_caching_tutorial.html

Generate the cache by setting TORCHINDUCTOR_CACHE_DIR and running the model once.

Output

TORCHINDUCTOR_CACHE_DIR=/root/inductor_root_cache python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --enable-torch-compile

Copy the cache folder to other machines and launch the server with TORCHINDUCTOR_CACHE_DIR.

Basic Usage

Advanced Features

Supported Models

Developer Guide

References