1. Model Introduction
Chroma-1.0 is an open-source end-to-end speech conversation model developed by FlashLabs, focusing on the following core capabilities:- Real-time Speech Generation: Supports low-latency speech synthesis, suitable for real-time conversational scenarios.
- Customized Voice Cloning: Capable of cloning and replicating specific speaker voice characteristics.
- End-to-End Architecture: Provides a complete processing workflow from speech to speech.
- Speech Reasoning: Equipped with reasoning capabilities to understand and process speech content.
2. Architecture Overview
Chroma-1.0 utilizes a hybrid serving architecture rather than a direct SGLang deployment. This design choice is driven by:- Complex Model Architecture: The end-to-end speech processing pipeline involves specialized components that go beyond standard text generation loops.
- KV Cache & State Management: The model requires custom handling of KV caches that differs from standard implementations.
- Batching Limitations: The current implementation supports a batch size of 1, meaning SGLang’s advanced continuous batching capabilities are not yet fully applicable.
- Outer Layer: FlashLabs Server (Handles Audio I/O, State, and Model Logic)
- Inner Engine: SGLang Instance (Utilized for specific acceleration where applicable)
3. Installation & Setup
We recommend following these steps to set up the environment and prepare the model.Step 1: Get the Docker Image
Pull the official pre-built image from Docker Hub to ensure all dependencies are correctly configured.Command
Step 2: Download Model Weights
Download the Chroma-4B weights from Hugging Face. You can choose one of the following methods: Method 1: Using Python (Recommended)Command
Command
Step 3: Download Chroma Codes (SGLang version)
Command
Step 4: Run the Server
Command
Command
5. Client Usage Example
Once the server is running, you can interact with it using HTTP requests.Python Client
Example
OpenAI SDK Compatible Example
Example
CLI (cURL)
Command
