Arm backend: add VGF PT2E linear quantization modes for LLM export by xingguo01 · Pull Request #19029 · pytorch/executorch

xingguo01 · 2026-04-21T17:02:06Z

add vgf_16a8w/8a8w PT2E quantization modes
add backend.vgf.quantize_scope for full vs linear VGF quantization
wire the VGF config through the LLM export and quantizer selection path
add coverage in export_llama_lib tests for the new VGF PT2E modes

cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell

- add vgf_16a8w PT2E quantization modes - add backend.vgf.quantize_scope for full vs linear VGF quantization - wire the VGF config through the LLM export and quantizer selection path - add coverage in export_llama_lib tests for the new VGF PT2E modes Signed-off-by: Xingguo Li <xingguo.li@arm.com> Change-Id: Ie8fe849b4856321308d6d526248a7a4760ddc573

pytorch-bot · 2026-04-21T17:02:09Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19029

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 14 New Failures, 4 Cancelled Jobs, 2 Unrelated Failures

As of commit 93c91b6 with merge base ccaf17e ():

NEW FAILURES - The following jobs have failed:

Apple / build-demo-ios / macos-job (gh)
RuntimeError: Command bash /Users/runner/work/_temp/exec_script failed with exit code 1
MLX / test-mlx-llm (unsloth/gemma-3-1b-it, gemma3-1b, false, 4w) / test-mlx-llm-gemma3-1b-4w (gh)
RuntimeError: Command bash /Users/runner/work/_temp/exec_script failed with exit code 2
MLX / test-mlx-llm (unsloth/gemma-3-1b-it, gemma3-1b, false, nvfp4) / test-mlx-llm-gemma3-1b-nvfp4 (gh)
RuntimeError: Command bash /Users/runner/work/_temp/exec_script failed with exit code 2
MLX / test-mlx-llm (unsloth/gemma-3-1b-it, gemma3-1b, true, 4w) / test-mlx-llm-gemma3-1b-custom-4w (gh)
RuntimeError: Command bash /Users/runner/work/_temp/exec_script failed with exit code 2
MLX / test-mlx-llm (unsloth/gemma-3-1b-it, gemma3-1b, true, nvfp4) / test-mlx-llm-gemma3-1b-custom-nvfp4 (gh)
RuntimeError: Command bash /Users/runner/work/_temp/exec_script failed with exit code 2
MLX / test-mlx-llm (unsloth/Llama-3.2-1B-Instruct, llama-1b, false, 4w) / test-mlx-llm-llama-1b-4w (gh)
RuntimeError: Command bash /Users/runner/work/_temp/exec_script failed with exit code 2
MLX / test-mlx-llm (unsloth/Llama-3.2-1B-Instruct, llama-1b, false, nvfp4) / test-mlx-llm-llama-1b-nvfp4 (gh)
RuntimeError: Command bash /Users/runner/work/_temp/exec_script failed with exit code 2
MLX / test-mlx-llm (unsloth/Llama-3.2-1B-Instruct, llama-1b, true, 4w) / test-mlx-llm-llama-1b-custom-4w (gh)
RuntimeError: Command bash /Users/runner/work/_temp/exec_script failed with exit code 2
MLX / test-mlx-llm (unsloth/Llama-3.2-1B-Instruct, llama-1b, true, nvfp4) / test-mlx-llm-llama-1b-custom-nvfp4 (gh)
RuntimeError: Command bash /Users/runner/work/_temp/exec_script failed with exit code 2
MLX / test-mlx-llm (unsloth/Qwen3-0.6B, qwen3-0.6b, false, 4w) / test-mlx-llm-qwen3-0.6b-4w (gh)
RuntimeError: Command bash /Users/runner/work/_temp/exec_script failed with exit code 2
MLX / test-mlx-llm (unsloth/Qwen3-0.6B, qwen3-0.6b, false, nvfp4) / test-mlx-llm-qwen3-0.6b-nvfp4 (gh)
RuntimeError: Command bash /Users/runner/work/_temp/exec_script failed with exit code 2
MLX / test-mlx-llm (unsloth/Qwen3-0.6B, qwen3-0.6b, true, 4w) / test-mlx-llm-qwen3-0.6b-custom-4w (gh)
RuntimeError: Command bash /Users/runner/work/_temp/exec_script failed with exit code 2
MLX / test-mlx-llm (unsloth/Qwen3-0.6B, qwen3-0.6b, true, nvfp4) / test-mlx-llm-qwen3-0.6b-custom-nvfp4 (gh)
RuntimeError: Command bash /Users/runner/work/_temp/exec_script failed with exit code 2
MLX / test-mlx-whisper / test-mlx-whisper (gh)
RuntimeError: Command bash /Users/runner/work/_temp/exec_script failed with exit code 2

CANCELLED JOBS - The following jobs were cancelled. Please retry:

pull / unittest / windows / windows-job (gh)
pull / unittest-editable / macos / macos-job (gh)
##[error]The operation was canceled.
pull / unittest-editable / windows / windows-job (gh)
trunk / unittest-release / windows / windows-job (gh)

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / unittest / macos / macos-job (gh) (trunk failure)
export/tests/test_target_recipes.py::TestTargetRecipes::test_mv2_model
trunk / unittest-release / macos / macos-job (gh) (trunk failure)
export/tests/test_target_recipes.py::TestTargetRecipes::test_mv2_model

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copilot

Pull request overview

Adds Arm VGF backend PT2E quantization support for LLM export, including a new 16a8w mode gated on INT16 TOSA extension support and a configurable quantization scope (full-model vs Linear-only), plus test coverage for the new behavior.

Changes:

Add vgf_16a8w PT2E quantization mode and enforce INT16 compile spec extension when selected.
Introduce backend.vgf.quantize_scope (full vs linear) and apply it when constructing the VGF quantizer.
Wire new VGF settings through llama export CLI/config and add unit tests for scope + INT16 gating.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File	Description
`extension/llm/export/quantizer_lib.py`	Extends VGF quantizer selection for `vgf_16a8w` and adds scope-based application (global vs Linear-only).
`extension/llm/export/config/llm_config.py`	Adds `vgf_16a8w` enum value and introduces `VgfQuantizeScope` + config wiring from CLI args.
`examples/models/llama/export_llama_lib.py`	Exposes VGF PT2E modes and VGF scope/compile-spec CLI flags; passes scope into VGF quantizer creation.
`examples/models/llama/tests/test_export_llama_lib.py`	Adds coverage for VGF linear-only scope and INT16 compile spec enforcement for `vgf_16a8w`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-21T17:07:26Z

+            "vgf_8a8w",
+            "vgf_16a8w",
        ],
        help="Use PT2E quantization. Comma separated options. e.g. xnnpack_dynamic (for per channel 8 bit weight), xnnpack_dynamic_qc4 (for per channel 4 bit weight), embedding.",


The --pt2e_quantize argparse option is defined with a fixed set of choices, so it only accepts a single value, but the help text says it supports "Comma separated options" (and even mentions embedding, which is not a valid choice). This is user-facing and likely to confuse; either update the help text to reflect single-choice behavior, or switch the argument parsing to accept a comma-separated list (and adjust LlmConfig/Pt2eQuantize parsing accordingly).

Suggested change

help="Use PT2E quantization. Comma separated options. e.g. xnnpack_dynamic (for per channel 8 bit weight), xnnpack_dynamic_qc4 (for per channel 4 bit weight), embedding.",

help="Use a single PT2E quantization mode, e.g. xnnpack_dynamic (per-channel 8-bit weight) or xnnpack_dynamic_qc4 (per-channel 4-bit weight).",

Copilot AI review requested due to automatic review settings April 21, 2026 17:02

xingguo01 requested review from larryliu0820, lucylq and mergennachin as code owners April 21, 2026 17:02

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 21, 2026

Copilot started reviewing on behalf of xingguo01 April 21, 2026 17:02 View session

xingguo01 added partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm ciflow/trunk release notes: arm Changes to the ARM backend delegate labels Apr 21, 2026

Merge branch 'main' into arm-backend-llm-export

93c91b6

Copilot AI reviewed Apr 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arm backend: add VGF PT2E linear quantization modes for LLM export#19029

Arm backend: add VGF PT2E linear quantization modes for LLM export#19029
xingguo01 wants to merge 2 commits intopytorch:mainfrom
xingguo01:arm-backend-llm-export

xingguo01 commented Apr 21, 2026 •

edited by pytorch-bot Bot

Loading

Uh oh!

pytorch-bot Bot commented Apr 21, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	help="Use PT2E quantization. Comma separated options. e.g. xnnpack_dynamic (for per channel 8 bit weight), xnnpack_dynamic_qc4 (for per channel 4 bit weight), embedding.",
	help="Use a single PT2E quantization mode, e.g. xnnpack_dynamic (per-channel 8-bit weight) or xnnpack_dynamic_qc4 (per-channel 4-bit weight).",

Conversation

xingguo01 commented Apr 21, 2026 • edited by pytorch-bot Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19029

❌ 14 New Failures, 4 Cancelled Jobs, 2 Unrelated Failures

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xingguo01 commented Apr 21, 2026 •

edited by pytorch-bot Bot

Loading

pytorch-bot Bot commented Apr 21, 2026 •

edited

Loading