Inline quintic extension#197
Merged
TomWambsgans merged 2 commits intoleanEthereum:mainfrom Apr 17, 2026
Merged
Conversation
LLVM was not force-inlining quintic_mul despite #[inline] — the monomorphized body is large enough that LLVM's cost heuristic declined. Each call-site paid ~5 cycles of function-call overhead. With quintic_mul called millions of times per proof, this accumulated to ~2.4% of total runtime. Zen 4 (c7a.2xlarge): -2.38% on xmss_leaf_1400sigs, p=0.0, revert-A/B confirmed.
…sign Extends the previous commit's inlining pattern to additional multiplication- related functions: quintic_square, all platform-specific quintic_mul_packed variants (AVX-512/AVX2/NEON/fallback), and MulAssign<Self>/MulAssign<QEF>. Testing established the I-cache budget boundary for forced inlining on Zen 4: these 9 functions are the optimal set. Inlining more (e.g. Add/Sub/Neg) causes regression from expanded code size. Zen 4 (c7a.2xlarge): additional -1.25% on xmss_leaf_1400sigs, p=0.0, revert-A/B confirmed. Combined with previous commit: ~-3.6% total.
7d9c770 to
c02f10e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
perf(quintic-extension): force-inline quintic field arithmetic, ~3.6% faster
xmss_leaf_1400sigson Zen 4Summary
Two stacked
#[inline(always)]patches on the quintic extension fieldarithmetic, targeting the compiler's inlining cost model for large generic
functions. LLVM's default heuristic was choosing NOT to force-inline the
monomorphized
quintic_mul(which expands to 5dot_product::<5>calls,~80 LLVM IR instructions) and related functions, causing function-call
overhead on every field multiplication in the sumcheck/GKR/WHIR hot paths.
Net result on AMD EPYC Genoa (c7a.2xlarge, AVX-512 active): ~3.6%
faster
xmss_leaf_1400sigsat 1400 XMSS signatures, reproducibleacross runs, both changes confirmed by revert-A/B.
Diff shape
All changes are annotation-only (
#[inline]->#[inline(always)]).No algorithmic, behavioral, or API changes.
Changes
(a)
quintic_mul+ packedMulimpls (iter 9: -2.38%)extension.rs,packed_extension.rsThe generic
quintic_mulfunction (5 dot products of 5 packed elements)and the
PackedQuinticExtensionFieldMul<Self>+Mul<QuinticExtensionField>impls were marked
#[inline]. When monomorphized forPackedMontyField31AVX512,the function body is large enough (~80 IR instructions from the 5 inlined
dot_product::<5>calls) that LLVM's cost model declined to force-inline it.Each call-site paid ~5 cycles of function-call overhead (push/pop, indirect
branch, return). With quintic_mul called millions of times per proof
(every extension-field multiplication in every sumcheck round, GKR layer,
and WHIR commitment), this overhead accumulated to ~2.4% of total runtime.
Changed to
#[inline(always)]on:quintic_mul(the generic function inextension.rs)Mul<Self> for PackedQuinticExtensionField(packed x packed)Mul<QuinticExtensionField> for PackedQuinticExtensionField(packed x scalar)Measured: -2.38%, p = 0.0, revert-A/B confirmed.
(b)
quintic_square+quintic_mul_packed+MulAssign(iter 19: -1.25%)extension.rs,packing.rs,packed_extension.rsSame pattern applied to additional multiplication-related functions:
quintic_square(used by everysquare()call; has 16 multiplicationswhen monomorphized)
quintic_mul_packedvariants (AVX-512, AVX2, NEON,generic fallback — the scalar quintic multiplication path using
dot_product_2)MulAssign<Self>andMulAssign<QuinticExtensionField>forPackedQuinticExtensionField(the*= eq_valpattern incompute_sumcheck_terms)Measured: -1.25%, p = 0.0, revert-A/B confirmed.
I-cache budget boundary
Extensive testing established a precise I-cache budget for forced inlining:
Beyond 9 force-inlined functions, I-cache pressure from the expanded code
negates the call-overhead savings. The two keeps represent the optimal set.
Validation
correctness.sh(KoalaBear unit tests + full WHIRproof integration test) passes on each change.
virtualized.
RUSTFLAGS="-C target-cpu=native".eval_paired.sh(buildsboth binaries with
cargo clean --releasebetween, asserts distinctmd5 hashes, burn-in + paired loop). Both keeps confirmed by
eval_revert_ab.sh(temporary revert reproduces >= 50% of claimedimprovement).
Benchmark results
#[inline(always)]Baseline after both keeps: 5.17s median on
xmss_leaf_1400sigs.Pre-optimization baseline: 5.36s +/- 0.3s (calibrated).
Key architectural insight
Scalar
quintic_mul_packed(AVX-512) packs all 25 products of a 5x5quintic multiplication into 2 wide SIMD operations via
dot_product_2,achieving 2 packed base muls per quintic multiplication. The packed
quintic_mul(operating on 16-wide packed extension values) usesdot_product::<5>called 5 times, requiring 15 packed base muls perquintic multiplication. This 7.5x SIMD efficiency gap explains why
eval_eq_basic's scalar-then-transpose approach outperforms direct packedcomputation, and why changes to the eq polynomial structure always regress
wall-clock despite reducing instruction count.
How to reproduce
Notes for reviewers
#[inline]->#[inline(always)]).Zero behavioral difference. No algorithmic, API, or semantic changes.
#[inline(always)]affects codegen, not correctness.Performance benefit validated on Zen 4 (AMD EPYC Genoa, 32 KB L1I) only;
platforms with larger L1I (e.g. Apple M-series, 192 KB) may tolerate more
inlining, platforms with similar L1I (Intel Sapphire Rapids, 32 KB) should
see comparable results. No platform will regress correctness.
specific to
xmss_leaf_1400sigson Zen 4. A different workload ormicroarchitecture may have a different optimal set.
that was below our measurement threshold. It could be included as a
low-risk additional win if validated independently on a less noisy setup.
Related: quintic extension property tests (separate PR)
During this optimization work we found that
quintic_extension/has zerodirect unit tests anywhere in the codebase — the only coverage is implicit
through the WHIR end-to-end proof test. We wrote 15 algebraic property
tests covering:
Scalar arithmetic (10 tests): commutativity, associativity,
distributivity, multiplicative identity, add/sub roundtrip, double
negation, square == self·self, inverse roundtrip, zero not invertible,
base-field embedding preservation.
Packed ↔ scalar consistency (5 tests): packed add/sub/mul/base-mul
match scalar lane-by-lane, pack-unpack roundtrip.
Each test runs 200 randomized iterations with a seeded RNG. Total runtime
< 1 s under
--release. These tests may be more appropriate for Plonky3upstream (since
quintic_extensionoriginates there) — happy to submitto whichever repo makes sense. Available on branch
feat/quintic-extension-testsathttps://github.com/Barnadrot/leanMultisig.
Experimentally ruled out (29 iterations total)
Details
Structural changes to
eval_eq_basic(4 variants, all regress +7-9%)#[inline(always)]): -7.4% iai improvement,+8.1% wall-clock regression. I-cache pressure from inflating the
recursive function body.
#[inline(never)]): -5.7% iai, +9.2% wall-clock.Separate function avoids I-cache bloat but the 4-variable base case
has worse ILP than the recursive 3-variable approach (longer dependency
chain, less out-of-order overlap).
ILP degradation.
Packed
quintic_mul(15 packed base muls viadot_product::<5>) is12x less SIMD-efficient than scalar
quintic_mul_packed(2 packed basemuls via
dot_product_2).Conclusion:
eval_eq_basic's recursive structure is at a wall-clock localoptimum for Zen 4. Any change that reduces instruction count causes
wall-clock regression through ILP/cache/branch-prediction degradation.
Compiler micro-optimizations (all 0% iai delta)
LLVM already handles: CSE of redundant multiplications in
quintic_square,LICM of loop-invariant broadcasts, constant propagation through match arms
(eliminating
assert_eqin base cases), dead code elimination, andpre-broadcasting of fold factors.
GKR quotient accumulator combining (iter 27: -0.62%)
Combining
single*alpha + doublebefore eq_lo multiplication incompute_gkr_quotient_sumcheck_polynomial_split_eqreduces 4 accumulatorsto 2, saving 2 eq_lo multiplications per b_lo block. Real -0.62% (p=0.0)
but below the 1.0% wall-clock-only threshold. Applying the same change to
fold_and_compute_gkr_quotient_split_eqcaused +10% regression (disruptsthe complex
par_chunks_mutoptimization).Algorithmic approaches analyzed and rejected
Port-balance analysis on Zen 4: exactly equal throughput (47.5 cycles).
per element but affects only 1 round per GKR layer (~0.03% e2e).
at 300 cycles/element; x-side savings are ~0.003% e2e.
protocol restructuring.
element; net negative.
loop (fractions cleared by cross-multiplication).
random points; no sharing possible.