Deployment
Install SGLang
Install SGLang
For all methods and hardware platforms, see the official SGLang installation guide. The two paths below match the Python / Docker toggle in the command panel.Then run the Python output of the command panel below in that environment.
- Python (pip / uv)
- Docker
Command
LFM2.5 support — the dense / MoE / VL model classes and the
lfm2 tool-call parser — ships on SGLang main. If your installed release predates it, install from source or use the Docker dev image.lfm2 tool-call parser and each reasoning model’s --reasoning-parser are already part of the verified command.
Panel controls (top of the command box):
- Python / Docker — bare
sglang serve …for an existing SGLang env, or adocker run … sglang serve …wrap against the dev image from the Install SGLang panel above. - ⧉ Copy — copies the current command (with whichever framing is active) to your clipboard.
- $ cURL — a sample request against
localhost:30000to confirm the server is up. - ⚙ Env — edits the placeholders (
HOST_IP,PORT,HF_TOKEN) the command and cURL share. Persists in localStorage across cookbooks. - Verified / Not Verified badge — green when the
(hw, variant, quant, strategy, nodes)combo has been run end-to-end on real hardware; yellow when auto-derived from a neighbor and not yet re-checked.
Playground
The Playground is where you experiment with SGLang features beyond the verified matrix. The Deploy panel above only emits combinations that have been signed off on; the Playground lets you turn on additional knobs on top of whichever cell the Deploy panel is currently showing. The base is read live from your Deploy selection — only your overrides change. For LFM2.5 the exposed knob is the TP override (every variant is verified at TP=1; TP=2 is available for experimentation on the larger checkpoints). The reasoning and tool-call parsers are not playground toggles here — they are variant-intrinsic and already baked into each verified command. Lines highlighted green are added by your overrides; lines with red strikethrough were in the verified base but stripped by an override. When no override differs from the base cell, the playground inherits the base’s Verified badge; any actual change flips it to Not Verified until the new configuration is run end-to-end and submitted back.Panel controls reuse Python / Docker · ⧉ Copy · $ cURL · ⚙ Env from the Deploy panel, plus one extra:
- Submit ↗ — opens a pre-filled GitHub issue so you can land your override combo as a new verified cookbook cell. Shown only while the badge says Not Verified; click it once you’ve actually run the command on your hardware and confirmed it works.
1. Model Introduction
LFM2.5 is Liquid AI’s family of hybrid models for on-device deployment, released under the LFM Open License v1.0. It builds on the LFM2 architecture with extended pre-training — 10T → 28T tokens for the dense models, 12T → 38T for the 8B-A1B MoE — and large-scale reinforcement learning. The backbone interleaves gated short convolution blocks with a small minority of grouped query attention (GQA) blocks. Each convolution block applies input-dependent multiplicative gating around a depthwise short convolution, giving fast local mixing at low compute and memory cost. The GQA blocks handle global context and long-range retrieval. This minimal hybrid layout was selected by a hardware-in-the-loop architecture search under edge latency and memory budgets. On CPUs it delivers up to 2× faster prefill and decode than similarly sized models (see the LFM2 Technical Report). Key Features:- Hybrid gated short conv + GQA layout: the 1.2B / 350M dense models are 16 layers (10 conv + 6 GQA); the 8B-A1B MoE is 24 layers (18 conv + 6 GQA). With only 6 attention layers per model, the KV cache stays small even at long context.
- Block details: depthwise convolutions with kernel size 3; GQA with 8 KV groups and head size 64, plus RoPE and QK-Norm; pre-norm RMSNorm and SwiGLU MLPs throughout.
- Sparse MoE (8B-A1B): 8.3B total / 1.5B active parameters. Every layer except the first two replaces its dense MLP with a 32-expert MoE block; each token is routed to the top-4 SwiGLU experts by a normalized sigmoid router with adaptive bias load balancing.
- New in 2.5 (8B-A1B): the blocks are unchanged from LFM2-8B-A1B, but the context window grows from 32K to 128K (a RoPE base-θ increase plus long-context midtraining) and the vocabulary doubles from 65,536 to 128,000 tokens for more efficient non-Latin tokenization.
- Pythonic tool calling: function calls are emitted as a Python list between
<|tool_call_start|>and<|tool_call_end|>tokens. Thelfm2tool-call parser surfaces these as standardmessage.tool_calls. - Reasoning variants: the 8B-A1B and 1.2B-Thinking checkpoints are reasoning-only models that always emit an explicit
<think>...</think>chain-of-thought before the answer. The MoE’s 1.5B active parameters keep those reasoning tokens cheap. - Multilingual: every model except the JP checkpoints covers at least English, Arabic, Chinese, French, German, Japanese, Korean, and Spanish (some variants add more). The dedicated JP chat checkpoints focus on Japanese (Japanese + English only).
- Vision: LFM2.5-VL-1.6B pairs the 1.2B language backbone with a SigLIP2 So400M NaFlex encoder for OCR, document understanding, and multilingual vision. LFM2.5-VL-450M pairs the 350M backbone with a SigLIP2 Base-86M encoder for captioning and object detection at edge sizes; bounding-box grounding and function calling are new in the 2.5 release.
| Model | Parameters | Context | Role |
|---|---|---|---|
| LFM2.5-8B-A1B | 8.3B total / 1.5B active (MoE) | 128K | Reasoning-tuned, agentic / tool use |
| LFM2.5-1.2B-Instruct | 1.17B (dense) | 32K | General instruct, RAG, data extraction |
| LFM2.5-1.2B-Thinking | 1.17B (dense) | 32K | Reasoning (always-on chain-of-thought) |
| LFM2.5-350M | 350M (dense) | 32K | Compact instruct, structured output |
| LFM2.5-230M | 230M (dense) | 32K | Most compact; data extraction, structured output |
| LFM2.5-1.2B-JP-202606 | 1.17B (dense) | 32K | Japanese chat (latest) |
| LFM2.5-1.2B-JP | 1.17B (dense) | 32K | Japanese chat (original) |
| LFM2.5-VL-1.6B | 1.2B LM + SigLIP2 400M | 32K | Vision-language (OCR, docs, multi-image) |
| LFM2.5-VL-450M | 350M LM + SigLIP2 86M | 32K | Compact vision-language (captioning, object detection) |
| LFM2.5-1.2B-Base | 1.17B (dense) | 32K | Pre-trained base (no post-training) |
--tool-call-parser) and the Base repos (pre-trained only, no post-training — see §3.5) launch the same way with the model path swapped.
Choosing a variant:
- 8B-A1B — flagship for agentic and tool-calling workloads; the only 128K-context option.
- 1.2B-Thinking — reasoning-heavy tasks: math, tool use, programming.
- 1.2B-Instruct — the recommended pick for chat and creative writing.
- 350M — tool use, data extraction, and structured output; not recommended for math, code, or creative writing.
- 230M — the most compact checkpoint; same use as the 350M, not for math, code, or creative writing.
2. Configuration Tips
- Reasoning parser: LFM2.5 reasoning models wrap their chain-of-thought in
<think>...</think>tags. The command generator passes--reasoning-parser qwen3for 8B-A1B (it emits an explicit opening<think>) and--reasoning-parser qwen3-thinkingfor 1.2B-Thinking (always-on reasoning). This splits the thinking process intoreasoning_content; without it the chain-of-thought stays inline incontent. - Tool calling:
--tool-call-parser lfm2surfaces LFM2.5’s Pythonic<|tool_call_start|>[...]<|tool_call_end|>calls as standardmessage.tool_calls. The original 1.2B-JP does not expose tool calling; Base has no post-training (see §3.5). - Attention backend on Blackwell (B200/sm100): SGLang defaults to the
trtllm_mhabackend on sm100, which is fastest for the dense text models. The 8B-A1B uses a mamba-style state cache that runs on a page-size-1 backend, so the generator picks--attention-backend flashinferfor it. The VL language model also uses that state cache and offers two backends:--attention-backend flashinfer(keeps prefix/radix caching — what the generator emits), or--attention-backend trtllm_mha --disable-radix-cacheto run the language model on Blackwelltrtllm_mhaattention (--disable-radix-cachelifts the page-size-1 requirement, at the cost of prefix caching). Pair either with--mm-attention-backend fa4for the vision tower. - VL vision tower (
--mm-attention-backend): on sm100 thetrtllm_mhadefault is fastest for text but applies causal attention to image tokens. For the VL model, pass--mm-attention-backend fa4on B200/B300 (orfa3on H100/H200) to restore bidirectional image-token attention and full vision quality. - VL multimodal feature transport: the generator launches the VL models with
SGLANG_USE_CUDA_IPC_TRANSPORT=1 SGLANG_USE_IPC_POOL_HANDLE_CACHE=1. The first moves the processor→scheduler image-feature handoff onto CUDA IPC instead of serializing tensors between processes; the second ships the pool handle so the scheduler opens it once and caches it, instead of opening a per-item handle on every request. On the image serving workload (1 image @ 720p, measured on VL-1.6B on H100 and B200) this pair is worth roughly 30–50% higher image throughput and 30–40% lower image TTFT vs running without them (measured on VL-1.6B, H100 and B200); decode speed (TPOT) is unaffected. - VL-450M memory headroom (
--mem-fraction-static 0.8): with the default memory fraction, the 450M’s small weights make SGLang size its static KV/mamba pools to nearly the whole GPU, leaving no headroom for image-feature tensors — under sustained concurrent image load the scheduler can crash with a CUDA OOM in the radix-cache free path. The generator caps--mem-fraction-static 0.8for VL-450M; the pool is still far larger than this model ever needs. - Mamba scheduling: LFM2.5 runs on the default
no_buffermamba scheduler strategy — no--mamba-scheduler-strategyflag is needed. Theextra_bufferstrategy (an overlap-scheduling throughput optimization available for some Gated-DeltaNet hybrids) does not apply to LFM2.5, whose convolution blocks usemamba_chunk_size=1. - Hardware requirements: all LFM2.5 models run on a single GPU (TP=1) on either Hopper or Blackwell. The 1.2B / 350M dense models fit in a few GB; the 8B-A1B MoE needs roughly 16 GB for bf16 weights plus KV cache. Multi-GPU tensor parallelism is not required for any variant.
generation_config.json, so the server will not apply them for you. top_k, min_p, and repetition_penalty are not standard OpenAI chat.completions fields — pass them through extra_body and SGLang forwards them to its sampler. Do not set max_tokens unless you intend to cap output, as it can truncate a response (or a reasoning model’s chain-of-thought) mid-stream.
| Model | temperature | extra_body (sampler) |
|---|---|---|
| LFM2.5-8B-A1B | 0.2 | |
| LFM2.5-1.2B-Instruct | 0.1 | |
| LFM2.5-1.2B-Thinking | 0.05 | |
| LFM2.5-350M | 0.1 | |
| LFM2.5-230M | 0.1 | |
| LFM2.5-1.2B-JP-202606 | 0.1 | |
| LFM2.5-1.2B-JP | 0.3 | |
| LFM2.5-VL-1.6B (text) | 0.1 | |
| LFM2.5-VL-450M (text) | 0.1 | |
| LFM2.5-1.2B-Base | 0.3 |
3. Advanced Usage
3.1 Basic Usage
A single client with the recommended sampling presets applied per model (the examples in the following sections reuse thischat helper):
Example
3.2 Reasoning
The 8B-A1B and 1.2B-Thinking checkpoints emit chain-of-thought as a built-in behavior. The Deploy panel launches them with the matching--reasoning-parser, which separates the thinking process into reasoning_content:
Example
3.3 Tool Calling
LFM2.5 writes Pythonic tool calls. With--tool-call-parser lfm2 (already part of the launch command) they are surfaced as standard message.tool_calls:
Example
3.4 Vision Input
The VL models (VL-1.6B and VL-450M) accept images via standard OpenAI multimodal content blocks. Base64 data URIs (data:image/jpeg;base64,...) work in place of a URL:
Example
3.5 Base Checkpoints
Each size ships a pre-trained Base repo — LFM2.5-230M-Base, LFM2.5-1.2B-Base, LFM2.5-350M-Base, and LFM2.5-8B-A1B-Base — intended for fine-tuning and continued pre-training. The repos ship a ChatML-style chat template, sochat.completions requests format normally. The checkpoints have no post-training, though — don’t expect instruction following. For raw text continuation:
Example
