Introduction
XiaomiMiMo/MiMo-V2-Flash, with 309B total parameters and 15B activated parameters, is a new inference-centric model designed to maximize decoding efficiency created by XiaomiMiMo Team explicitly co-designed for real-world serving workloads, enabling flexible tradeoffs between throughput and latency on different hardware. This model creates a new balance between long-context modeling capability and inference efficiency. Key features include:- Hybrid Attention Architecture: Interleaves Sliding Window Attention (SWA) and Global Attention (GA) with a 5:1 ratio and an aggressive 128-token window. This reduces KV-cache storage by nearly 6x while maintaining long-context performance via learnable attention sink bias.
- Multi-Token Prediction (MTP): Equipped with a lightweight MTP module (0.33B params/block) using dense FFNs. This triples output speed during inference and will be good to accelerates rollout in RL training.
- Efficient Pre-Training: Trained on 27T tokens using FP8 mixed precision and native 32k seq length. The context window supports up to 256k length.
- Agentic Capabilities: Post-training utilizes Multi-Teacher On-Policy Distillation (MOPD) and large-scale agentic RL, achieving superior performance on SWE-Bench and complex reasoning tasks.
Installation
MiMo-V2-Flash is currently available in SGLang via Docker image and pip install.Docker
Command
Pip Installation
Command
Model Deployment
Use the configuration selector below to automatically generate the appropriate deployment command.Testing the deployment
Once the server is running, test it with a chat completion request in another terminal:Command
Config
