🚀[0.3.36] Release Note: Gemma-4 Omni-Multimodal and ToolCall Improved, Qwen3.6 / Step3-VL Support, Compilation workflow optimization #112

JamePeng · 2026-04-16T19:05:00Z

JamePeng
Apr 16, 2026
Maintainer

Release v0.3.36: Gemma-4 Omni-Multimodal, Qwen 3.6 Support, and Core Optimizations

Hi everyone, I am excited to bring you version 0.3.36 of llama-cpp-python. This release is packed with massive upgrades to multimodal handling, state-of-the-art model support, strict API alignments, and crucial CI pipeline fixes.

Here is a breakdown of what I've included in this update:

🌟 The Highlight: Gemma-4 True Omni-Multimodal Integration

I want to put a massive spotlight on Gemma-4 in this release. While many models handle basic images, Gemma-4 (specifically the E2B and E4B variants) brings true Omni-Multimodal capabilities to the table—meaning it can natively process Vision, Audio, and Text simultaneously.

To support this, I have completely revamped the Gemma4ChatHandler:

Universal Media Payload Helper: I wrote a new build_media_payload function that dynamically routes and encodes local files into OpenAI-compatible image_url and input_audio structures. You can now mix audio and images in a single prompt seamlessly.
Audio Performance Warning: Please note that for audio processing, I highly recommend using a BF16 quantized mmproj file. Other quantizations are currently known to severely degrade audio signal quality.
Tool Call Syncing: I’ve also synced the handler with the latest google/gemma-4-31B-it chat template from HuggingFace, bringing in the new format_tool_response_block and OpenAI-compatible forward-scan tool resolution.
Check out my new comprehensive documentation and production-ready examples for this feature Gemma 4 Omni Example

🧠 Qwen 3.6 & Enhanced "Thinking" Management

I have upgraded the existing Qwen35ChatHandler to fully support the newly released Qwen 3.6.

Control Over Context: I added a new preserve_thinking parameter. By default, it is set to False to save your context window tokens, but you can enable it to retain the <think> reasoning blocks across all historical conversational turns.
I also improved the Jinja template robustness and simplified the JSON serialization logic for tool call arguments, making function calling much more reliable.

👁️ Step3-VL Support

For those needing a standard, straightforward vision-language model, I have implemented the Step3VLChatHandler. This provides plug-and-play support for the Step3-VL-10B model to handle your standard image-based tasks.

⚙️ OpenAPI Spec Alignment & Engine Sync

I’ve rigorously updated the llama_types to align with the absolute latest OpenAI API specifications:

Added PromptTokensDetails and CompletionTokensDetails to track granular token usage (perfect for tracking reasoning/cached tokens).
Streaming mode now fully supports the usage reporting block.
Updated the Assistant message schema to accept None for content and introduced the new refusal field.
Expanded tool choices to fully support allowed_tools and custom tool behaviors.

Additionally, I have synced the underlying C++ backend with the latest llama.cpp upstream (up to commit 9db77a0), which includes the highly anticipated mtmd API bindings.

🚀 CI / Compilation Workflow Optimizations

Finally, I spent some time fixing the GitHub Actions build pipeline.
Recently, using the all option for cudaarch on CUDA 12.4-12.6 was causing the compilation process to exceed the 6-hour maximum limit, resulting in cancelled CI jobs. To resolve this, I restricted the target architectures to explicitly support compute capabilities 7.0 through 9.0 (70-real to 90-real). This ensures extremely fast build times while maintaining full support for all modern NVIDIA GPUs equipped with Tensor Cores (from Volta up to Hopper).

I also bumped all the GitHub Action runners to their latest versions for better security and speed.

More information see: e1ade17...7820677

Enjoy the new features, and happy coding!

— JamePeng

kalle07 · 2026-04-16T20:16:51Z

kalle07
Apr 16, 2026

Quick question: can the “thinking” function now be disabled on the smaller Gemma models as well?

6 replies

kalle07 Apr 16, 2026

was not big enough ^^
gemma-4-E4B-it, unsloth model in lm-studio is not thinking, do they have a different template?

JamePeng Apr 17, 2026
Maintainer Author

kalle07 Apr 18, 2026

So, now I'm confused…
Is the "enable thinking" feature due to limitations in the chat handler?
I mean, I know that "thinking" usually yields better results, but it takes time.

i can simple change the unslot model template mode in LM-studio
{%- set enable_thinking = false -%}
{%- set enable_thinking = true -%}
and images will be also describet without the thinking part...

In my PDF parsing project "my happy coding", I'm extracting images and drawings, and I'd like to use a small, reliable model... Besides LFM25, Gemma4-e2b would be an option, but only if the "tink" part can be disabled.

JamePeng Apr 18, 2026
Maintainer Author

It's possible that the tool you're using also filters the output. If you turn this function on or off, the final output will be segmented and labeled before outputting the latter half of the thought process.

kalle07 Apr 18, 2026

no, no wait, instantly stream of the answer a well description of my image, and if i
set enable_thinking = true
|cannel "think ..." cannel| "answer"

besides JANai, kobold, openwebui, lm-studio is a local llm-provider

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🚀[0.3.36] Release Note: Gemma-4 Omni-Multimodal and ToolCall Improved, Qwen3.6 / Step3-VL Support, Compilation workflow optimization #112

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 6 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

🚀[0.3.36] Release Note: Gemma-4 Omni-Multimodal and ToolCall Improved, Qwen3.6 / Step3-VL Support, Compilation workflow optimization #112

Uh oh!

JamePeng Apr 16, 2026 Maintainer

Release v0.3.36: Gemma-4 Omni-Multimodal, Qwen 3.6 Support, and Core Optimizations

🌟 The Highlight: Gemma-4 True Omni-Multimodal Integration

🧠 Qwen 3.6 & Enhanced "Thinking" Management

👁️ Step3-VL Support

⚙️ OpenAPI Spec Alignment & Engine Sync

🚀 CI / Compilation Workflow Optimizations

Replies: 1 comment · 6 replies

Uh oh!

kalle07 Apr 16, 2026

Uh oh!

kalle07 Apr 16, 2026

Uh oh!

JamePeng Apr 17, 2026 Maintainer Author

Uh oh!

kalle07 Apr 18, 2026

Uh oh!

JamePeng Apr 18, 2026 Maintainer Author

Uh oh!

kalle07 Apr 18, 2026

JamePeng
Apr 16, 2026
Maintainer

Replies: 1 comment 6 replies

kalle07
Apr 16, 2026

JamePeng Apr 17, 2026
Maintainer Author

JamePeng Apr 18, 2026
Maintainer Author