Arm backend: add VGF PT2E linear quantization modes for LLM export#19029
Arm backend: add VGF PT2E linear quantization modes for LLM export#19029xingguo01 wants to merge 2 commits intopytorch:mainfrom
Conversation
- add vgf_16a8w PT2E quantization modes - add backend.vgf.quantize_scope for full vs linear VGF quantization - wire the VGF config through the LLM export and quantizer selection path - add coverage in export_llama_lib tests for the new VGF PT2E modes Signed-off-by: Xingguo Li <xingguo.li@arm.com> Change-Id: Ie8fe849b4856321308d6d526248a7a4760ddc573
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19029
Note: Links to docs will display an error until the docs builds have been completed. ❌ 14 New Failures, 4 Cancelled Jobs, 2 Unrelated FailuresAs of commit 93c91b6 with merge base ccaf17e ( NEW FAILURES - The following jobs have failed:
CANCELLED JOBS - The following jobs were cancelled. Please retry:
BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Pull request overview
Adds Arm VGF backend PT2E quantization support for LLM export, including a new 16a8w mode gated on INT16 TOSA extension support and a configurable quantization scope (full-model vs Linear-only), plus test coverage for the new behavior.
Changes:
- Add
vgf_16a8wPT2E quantization mode and enforce INT16 compile spec extension when selected. - Introduce
backend.vgf.quantize_scope(fullvslinear) and apply it when constructing the VGF quantizer. - Wire new VGF settings through llama export CLI/config and add unit tests for scope + INT16 gating.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
extension/llm/export/quantizer_lib.py |
Extends VGF quantizer selection for vgf_16a8w and adds scope-based application (global vs Linear-only). |
extension/llm/export/config/llm_config.py |
Adds vgf_16a8w enum value and introduces VgfQuantizeScope + config wiring from CLI args. |
examples/models/llama/export_llama_lib.py |
Exposes VGF PT2E modes and VGF scope/compile-spec CLI flags; passes scope into VGF quantizer creation. |
examples/models/llama/tests/test_export_llama_lib.py |
Adds coverage for VGF linear-only scope and INT16 compile spec enforcement for vgf_16a8w. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "vgf_8a8w", | ||
| "vgf_16a8w", | ||
| ], | ||
| help="Use PT2E quantization. Comma separated options. e.g. xnnpack_dynamic (for per channel 8 bit weight), xnnpack_dynamic_qc4 (for per channel 4 bit weight), embedding.", |
There was a problem hiding this comment.
The --pt2e_quantize argparse option is defined with a fixed set of choices, so it only accepts a single value, but the help text says it supports "Comma separated options" (and even mentions embedding, which is not a valid choice). This is user-facing and likely to confuse; either update the help text to reflect single-choice behavior, or switch the argument parsing to accept a comma-separated list (and adjust LlmConfig/Pt2eQuantize parsing accordingly).
| help="Use PT2E quantization. Comma separated options. e.g. xnnpack_dynamic (for per channel 8 bit weight), xnnpack_dynamic_qc4 (for per channel 4 bit weight), embedding.", | |
| help="Use a single PT2E quantization mode, e.g. xnnpack_dynamic (per-channel 8-bit weight) or xnnpack_dynamic_qc4 (per-channel 4-bit weight).", |
cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell