max-autotune-no-cudagraphs mode of torch.compile. The auto-tuning can be slow.
If you want to deploy a model on many different machines, you can ship the torch.compile cache to these machines and skip the compilation steps.
This is based on https://pytorch.org/tutorials/recipes/torch_compile_caching_tutorial.html
- Generate the cache by setting TORCHINDUCTOR_CACHE_DIR and running the model once.
Output
- Copy the cache folder to other machines and launch the server with
TORCHINDUCTOR_CACHE_DIR.
