Query VLM with Offline Engine

This tutorial demonstrates how to use SGLang’s offline Engine API to query VLMs. We will demonstrate usage with Qwen2.5-VL and Llama 4. This section demonstrates three different calling approaches:

Basic Call: Directly pass images and text.
Processor Output: Use HuggingFace processor for data preprocessing.
Precomputed Embeddings: Pre-calculate image features to improve inference efficiency.

Understanding the Three Input Formats

SGLang supports three ways to pass visual data, each optimized for different scenarios:

1. Raw Images - Simplest approach

Pass PIL Images, file paths, URLs, or base64 strings directly
SGLang handles all preprocessing automatically
Best for: Quick prototyping, simple applications

2. Processor Output - For custom preprocessing

Pre-process images with HuggingFace processor
Pass the complete processor output dict with format: "processor_output"
Best for: Custom image transformations, integration with existing pipelines
Requirement: Must use input_ids instead of text prompt

3. Precomputed Embeddings - For maximum performance

Pre-calculate visual embeddings using the vision encoder
Pass embeddings with format: "precomputed_embedding"
Best for: Repeated queries on same images, caching, high-throughput serving
Performance gain: Avoids redundant vision encoder computation (30-50% speedup)

Key Rule: Within a single request, use only one format for all images. Don’t mix formats. The examples below demonstrate all three approaches with both Qwen2.5-VL and Llama 4 models.

Querying Qwen2.5-VL Model

Example

import nest_asyncio

nest_asyncio.apply()

import sglang.test.doc_patch  # noqa: F401

model_path = "Qwen/Qwen2.5-VL-3B-Instruct"
chat_template = "qwen2-vl"
example_image_url = "https://raw.githubusercontent.com/sgl-project/sglang/main/examples/assets/example_image.png"

Example

from io import BytesIO
import requests
from PIL import Image

from sglang.srt.parser.conversation import chat_templates

image = Image.open(BytesIO(requests.get(example_image_url).content))

conv = chat_templates[chat_template].copy()
conv.append_message(conv.roles[0], f"What's shown here: {conv.image_token}?")
conv.append_message(conv.roles[1], "")
conv.image_data = [image]

print("Generated prompt text:")
print(conv.get_prompt())
print(f"\nImage size: {image.size}")
image

Basic Offline Engine API Call

Example

from sglang import Engine

llm = Engine(model_path=model_path, chat_template=chat_template, log_level="warning")

Example

out = llm.generate(prompt=conv.get_prompt(), image_data=[image])
print("Model response:")
print(out["text"])

Call with Processor Output

Using a HuggingFace processor to preprocess text and images, and passing the processor_output directly into Engine.generate.

Example

from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained(model_path, use_fast=True)
processor_output = processor(
    images=[image], text=conv.get_prompt(), return_tensors="pt"
)

out = llm.generate(
    input_ids=processor_output["input_ids"][0].detach().cpu().tolist(),
    image_data=[dict(processor_output, format="processor_output")],
)
print("Response using processor output:")
print(out["text"])

Call with Precomputed Embeddings

You can pre-calculate image features to avoid repeated visual encoding processes.

Example

from transformers import AutoProcessor
from transformers import Qwen2_5_VLForConditionalGeneration

processor = AutoProcessor.from_pretrained(model_path, use_fast=True)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(model_path).eval()
vision = model.model.visual.cuda()

Example

processor_output = processor(
    images=[image], text=conv.get_prompt(), return_tensors="pt"
)

input_ids = processor_output["input_ids"][0].detach().cpu().tolist()

precomputed_embeddings = vision(
    processor_output["pixel_values"].cuda(), processor_output["image_grid_thw"].cuda()
)
precomputed_embeddings = precomputed_embeddings.pooler_output

multi_modal_item = dict(
    processor_output,
    format="precomputed_embedding",
    feature=precomputed_embeddings,
)

out = llm.generate(input_ids=input_ids, image_data=[multi_modal_item])
print("Response using precomputed embeddings:")
print(out["text"])

llm.shutdown()

Querying Llama 4 Vision Model

Example

model_path = "meta-llama/Llama-4-Scout-17B-16E-Instruct"
chat_template = "llama-4"

from io import BytesIO
import requests
from PIL import Image

from sglang.srt.parser.conversation import chat_templates

# Download the same example image
image = Image.open(BytesIO(requests.get(example_image_url).content))

conv = chat_templates[chat_template].copy()
conv.append_message(conv.roles[0], f"What's shown here: {conv.image_token}?")
conv.append_message(conv.roles[1], "")
conv.image_data = [image]

print("Llama 4 generated prompt text:")
print(conv.get_prompt())
print(f"Image size: {image.size}")

image

Llama 4 Basic Call

Llama 4 requires more computational resources, so it’s configured with multi-GPU parallelism (tp_size=4) and larger context length.

Example

llm = Engine(
    model_path=model_path,
    enable_multimodal=True,
    attention_backend="fa3",
    tp_size=4,
    context_length=65536,
)

out = llm.generate(prompt=conv.get_prompt(), image_data=[image])
print("Llama 4 response:")
print(out["text"])

Call with Processor Output

Using HuggingFace processor to preprocess data can reduce computational overhead during inference.

Example

from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained(model_path, use_fast=True)
processor_output = processor(
    images=[image], text=conv.get_prompt(), return_tensors="pt"
)

out = llm.generate(
    input_ids=processor_output["input_ids"][0].detach().cpu().tolist(),
    image_data=[dict(processor_output, format="processor_output")],
)
print("Response using processor output:")
print(out)

Call with Precomputed Embeddings

Example

from transformers import AutoProcessor
from transformers import Llama4ForConditionalGeneration

processor = AutoProcessor.from_pretrained(model_path, use_fast=True)
model = Llama4ForConditionalGeneration.from_pretrained(
    model_path, torch_dtype="auto"
).eval()

vision = model.vision_model.cuda()
multi_modal_projector = model.multi_modal_projector.cuda()

print(f'Image pixel values shape: {processor_output["pixel_values"].shape}')
input_ids = processor_output["input_ids"][0].detach().cpu().tolist()

# Process image through vision encoder
image_outputs = vision(
    processor_output["pixel_values"].to("cuda"),
    aspect_ratio_ids=processor_output["aspect_ratio_ids"].to("cuda"),
    aspect_ratio_mask=processor_output["aspect_ratio_mask"].to("cuda"),
    output_hidden_states=False
)
image_features = image_outputs.last_hidden_state

# Flatten image features and pass through multimodal projector
vision_flat = image_features.view(-1, image_features.size(-1))
precomputed_embeddings = multi_modal_projector(vision_flat)

# Build precomputed embedding data item
mm_item = dict(
    processor_output,
    format="precomputed_embedding",
    feature=precomputed_embeddings
)

# Use precomputed embeddings for efficient inference
out = llm.generate(input_ids=input_ids, image_data=[mm_item])
print("Llama 4 precomputed embedding response:")
print(out["text"])

Basic Usage

Advanced Features

Supported Models

Developer Guide

References

Understanding the Three Input Formats

1. Raw Images - Simplest approach

2. Processor Output - For custom preprocessing

3. Precomputed Embeddings - For maximum performance

Querying Qwen2.5-VL Model

Basic Offline Engine API Call

Call with Processor Output

Call with Precomputed Embeddings

Querying Llama 4 Vision Model

Llama 4 Basic Call

Call with Processor Output

Call with Precomputed Embeddings

​Understanding the Three Input Formats

​1. Raw Images - Simplest approach

​2. Processor Output - For custom preprocessing

​3. Precomputed Embeddings - For maximum performance

​Querying Qwen2.5-VL Model

​Basic Offline Engine API Call

​Call with Processor Output

​Call with Precomputed Embeddings

​Querying Llama 4 Vision Model

​Llama 4 Basic Call

​Call with Processor Output

​Call with Precomputed Embeddings

Understanding the Three Input Formats

1. Raw Images - Simplest approach

2. Processor Output - For custom preprocessing

3. Precomputed Embeddings - For maximum performance

Querying Qwen2.5-VL Model

Basic Offline Engine API Call

Call with Processor Output

Call with Precomputed Embeddings

Querying Llama 4 Vision Model

Llama 4 Basic Call

Call with Processor Output

Call with Precomputed Embeddings