🚀[0.3.36] Release Note: Gemma-4 Omni-Multimodal and ToolCall Improved, Qwen3.6 / Step3-VL Support, Compilation workflow optimization #112
JamePeng
announced in
Announcements
Replies: 1 comment 6 replies
-
|
Quick question: can the “thinking” function now be disabled on the smaller Gemma models as well? |
Beta Was this translation helpful? Give feedback.
6 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment

Uh oh!
There was an error while loading. Please reload this page.
-
Release v0.3.36: Gemma-4 Omni-Multimodal, Qwen 3.6 Support, and Core Optimizations
Hi everyone, I am excited to bring you version 0.3.36 of
llama-cpp-python. This release is packed with massive upgrades to multimodal handling, state-of-the-art model support, strict API alignments, and crucial CI pipeline fixes.Here is a breakdown of what I've included in this update:
🌟 The Highlight: Gemma-4 True Omni-Multimodal Integration
I want to put a massive spotlight on Gemma-4 in this release. While many models handle basic images, Gemma-4 (specifically the E2B and E4B variants) brings true Omni-Multimodal capabilities to the table—meaning it can natively process Vision, Audio, and Text simultaneously.
To support this, I have completely revamped the
Gemma4ChatHandler:build_media_payloadfunction that dynamically routes and encodes local files into OpenAI-compatibleimage_urlandinput_audiostructures. You can now mix audio and images in a single prompt seamlessly.BF16quantizedmmprojfile. Other quantizations are currently known to severely degrade audio signal quality.google/gemma-4-31B-itchat template from HuggingFace, bringing in the newformat_tool_response_blockand OpenAI-compatible forward-scan tool resolution.🧠 Qwen 3.6 & Enhanced "Thinking" Management
I have upgraded the existing
Qwen35ChatHandlerto fully support the newly released Qwen 3.6.preserve_thinkingparameter. By default, it is set toFalseto save your context window tokens, but you can enable it to retain the<think>reasoning blocks across all historical conversational turns.👁️ Step3-VL Support
For those needing a standard, straightforward vision-language model, I have implemented the
Step3VLChatHandler. This provides plug-and-play support for theStep3-VL-10Bmodel to handle your standard image-based tasks.⚙️ OpenAPI Spec Alignment & Engine Sync
I’ve rigorously updated the
llama_typesto align with the absolute latest OpenAI API specifications:PromptTokensDetailsandCompletionTokensDetailsto track granular token usage (perfect for tracking reasoning/cached tokens).usagereporting block.Nonefor content and introduced the newrefusalfield.allowed_toolsandcustomtool behaviors.Additionally, I have synced the underlying C++ backend with the latest
llama.cppupstream (up to commit9db77a0), which includes the highly anticipatedmtmdAPI bindings.🚀 CI / Compilation Workflow Optimizations
Finally, I spent some time fixing the GitHub Actions build pipeline.
Recently, using the
alloption forcudaarchon CUDA 12.4-12.6 was causing the compilation process to exceed the 6-hour maximum limit, resulting in cancelled CI jobs. To resolve this, I restricted the target architectures to explicitly support compute capabilities 7.0 through 9.0 (70-realto90-real). This ensures extremely fast build times while maintaining full support for all modern NVIDIA GPUs equipped with Tensor Cores (from Volta up to Hopper).I also bumped all the GitHub Action runners to their latest versions for better security and speed.
More information see: e1ade17...7820677
Enjoy the new features, and happy coding!
— JamePeng
Beta Was this translation helpful? Give feedback.
All reactions