mooncake / hf3fs / nixl / file / aibrix / eic) while SGLang is already running and serving traffic, without restarting the process.
For safety and consistency, the current implementation strictly requires these operations to happen only when the service is idle:
- No running requests
- No waiting/queued requests
1. Background and implementation overview
1.1 Architecture / control path
The control path is:- HTTP Server (
python/sglang/srt/entrypoints/http_server.py)- Exposes
PUT /hicache/storage-backend,DELETE /hicache/storage-backend,GET /hicache/storage-backend
- Exposes
- TokenizerManager (
python/sglang/srt/managers/tokenizer_communicator_mixin.py)- Sends the request to the Scheduler via
_Communicator
- Sends the request to the Scheduler via
- Scheduler (
python/sglang/srt/managers/scheduler.py)- Performs a strict idle check
- Calls
tree_cache.attach_storage_backend(...)/detach_storage_backend(...)
- HiRadixCache (
python/sglang/srt/mem_cache/hiradix_cache.py)- Parses
hicache_storage_backend_extra_config_json(supports both backend config and prefetch knobs) - Calls
cache_controller.attach_storage_backend(...)/detach_storage_backend(...)
- Parses
- HiCacheController (
python/sglang/srt/managers/cache_controller.py)- Creates/destroys the storage backend instance (via
StorageBackendFactory) - Starts/stops backend background threads at runtime (prefetch/backup)
- Creates/destroys the storage backend instance (via
2. Idle-state requirement (strict)
The Scheduler uses a stricter_is_idle_for_hicache_storage_op():
_is_no_request()is true (covers running/overlap/pp/disagg and other active states)waiting_queueis emptygrammar_queueis empty (if the grammar backend is enabled)
Reject attach: scheduler is not idle. #queue-req=... #running-req=...
2.1 DP (data parallel) semantics
Whendp_size > 1, the tokenizer dispatches the request to all DP scheduler instances and aggregates their responses:
- The final
successis true only if all DP ranks return success - The final
messageconcatenates messages from all DP ranks
- Overall failure even though some ranks already succeeded
- Prefer to keep backend config identical across ranks
- If attach fails, immediately call detach (best-effort/idempotent), fix config, then retry attach
3. How to use (HTTP Admin API)
The examples below assume your SGLang HTTP server is athttp://127.0.0.1:30000.
3.1 Query current storage backend status
Command
Config
3.2 Attach (enable) a storage backend
Command
Command
hicache_storage_backend_extra_config_jsoncan include both:- Backend configuration (e.g., Mooncake master/metadata/protocol, etc.)
- Prefetch configuration (
prefetch_threshold,prefetch_timeout_base,prefetch_timeout_per_ki_token,hicache_storage_pass_prefix_keys)
3.3 Detach (disable) the storage backend
Command
- Detach only makes SGLang stop using the L3 storage backend and stops prefetch/backup threads
- It does not automatically delete data stored in Mooncake/HF3FS (or other remote backends)
4. Behavior and caveats
- No restart required: attach/detach switches in-process at runtime
- Must be idle: otherwise the request is rejected to avoid consistency issues
- Host KV layout constraints still apply: for example, Mooncake still requires layouts like
page_first/page_first_direct/page_head; if the server’s HiCache host-memory layout does not satisfy the backend requirements, attach will fail with an error - Observability:
- After attach,
server_args.hicache_storage_backend*is updated on both the tokenizer and scheduler sides - If metrics are enabled, attach will create a storage metrics collector in
HiRadixCacheon demand
- After attach,
