> ## Documentation Index > Fetch the complete documentation index at: https://docs.sglang.io/llms.txt > Use this file to discover all available pages before exploring further. # Model Loading > Control how SGLang loads model weights: load formats, model loader extra config, multithreaded loading, prefetching, and remote/streaming loaders. `--model-path` selects the checkpoint to serve; `--load-format` and the weight-loading flags below control how those weights are read into memory. To stream weights from cloud object storage (S3/GCS/Azure), see [Loading Models from Object Storage](./object_storage). ## How loading works SGLang picks a loader from `--load-format`, falling back to auto-detection from the checkpoint or model path. The default `auto` loader reads `safetensors` and falls back to PyTorch `.bin`. ```bash theme={null} python -m sglang.launch_server \ --model-path Qwen/Qwen3.6-35B-A3B \ --load-format auto ``` Some formats are auto-detected and override `auto`: * A Mistral native checkpoint is detected and loaded with `mistral`. * A `.gguf` model path is detected and loaded with `gguf`. * An object storage URI (`s3://`, `gs://`, `az://`) is loaded with `runai_streamer`. * A remote URI is loaded with `remote`. ## Load formats Set with `--load-format`:

Format	Description
`auto`	Default. Load `safetensors` if available, otherwise fall back to the PyTorch `.bin` format.
`safetensors`	Load weights in the safetensors format.
`pt`	Load weights in the PyTorch `.bin` format.
`npcache`	Load PyTorch-format weights and store a numpy cache to speed up subsequent loads. Only supports `.bin` checkpoints.
`dummy`	Initialize weights with random values, for profiling.
`sharded\_state`	Each tensor-parallel worker reads only its own pre-sharded shard rather than the full checkpoint, giving a fast load path for large TP models. See `examples/runtime/engine/save\_sharded\_state.py` for creating a sharded checkpoint.
`fastsafetensors`	Load safetensors using the `fastsafetensors` iterator.
`layered`	Load weights layer by layer, so a layer can be quantized before the next is loaded, lowering the peak memory envelope.
`gguf`	Load weights in the GGUF format. Auto-detected from a `.gguf` model path.
`bitsandbytes`	Load weights using bitsandbytes quantization.
`mistral`	Load a Mistral native-format checkpoint. Auto-detected for such checkpoints.
`flash\_rl`	Load a BF16/FP16 checkpoint with native SGLang FP8 quantization for RL training. Requires `--rl-quant-profile`.
`runai\_streamer`	Stream weights from SSDs, shared filesystems, or object storage. See Loading Models from Object Storage.
`remote`	Load tensors from a remote KV/filesystem connector. Auto-detected for remote URIs.
`remote\_instance`	Pull weights over the network from another running SGLang instance (the "seed") rather than from disk. Configured with the `--remote-instance-weight-loader-\*` flags.

## Model loader extra config `--model-loader-extra-config` takes a JSON string passed to the loader selected by `--load-format`. ```bash theme={null} python -m sglang.launch_server \ --model-path Qwen/Qwen3.6-35B-A3B \ --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 16}' ```

Load format	Key	Description	Default
`auto` / `safetensors` / `pt` / `npcache`	`enable\_multithread\_load` (bool)	Read weight shards with a thread pool instead of sequentially.	`true`
`auto` / `safetensors` / `pt` / `npcache`	`num\_threads` (int)	Number of worker threads when multithreaded loading is enabled.	8
`sharded\_state`	`pattern` (str)	Filename pattern for per-rank shards.	`model-rank-\{rank}-part-\{part}.safetensors`
`bitsandbytes`	`qlora\_adapter\_name\_or\_path` (str)	QLoRA adapter to apply on top of the bitsandbytes-quantized base weights.	—
`runai\_streamer`	`distributed`, `concurrency`, `memory\_limit`	Streaming controls. See Loading Models from Object Storage.	See linked page

## Weight-loading performance flags Top-level arguments that tune how safetensors weights are read, independent of `--load-format`.

Flag	Description	Default
`--download-dir`	Directory used to download and cache Hugging Face model files.	HF default
`--weight-loader-disable-mmap`	Disable mmap while loading safetensors. Can help on filesystems where mmap is slow.	off
`--weight-loader-prefetch-checkpoints`	Prefetch checkpoint files into the OS page cache before loading. Each rank prefetches a fraction of the shards, cutting total network I/O on shared filesystems (NFS/Lustre) from N×checkpoint to 1×checkpoint. Recommended for models on network storage.	off
`--weight-loader-prefetch-num-threads`	Threads per rank for checkpoint prefetching.	4
`--weight-loader-drop-cache-after-load`	Call `posix\_fadvise(DONTNEED)` on each safetensors shard after loading it, freeing page cache.	off
`--custom-weight-loader`	Import path(s) of a custom weight-loading function, e.g. `my\_package.weight\_load\_func`.	—

## See also * [Loading Models from Object Storage](./object_storage) * [Quantization](./quantization) * [Server Arguments](./server_arguments)