Summary
Port Python's statistics module to Rust — averages, measures of spread, and probability distributions.
In CPython this is a pure Python implementation (Lib/statistics.py) that heavily uses fractions.Fraction internally for exact intermediate arithmetic to avoid cumulative rounding errors.
Public API (21 items)
Central tendency
mean(data) — arithmetic mean (exact via Fraction)
fmean(data, weights=None) — fast float mean (via fsum)
geometric_mean(data) — geometric mean (via log/exp)
harmonic_mean(data, weights=None) — harmonic mean
median(data) — median (average of middle two for even length)
median_low(data) / median_high(data) — low/high median
median_grouped(data, interval=1.0) — grouped data median
mode(data) — single most common value
multimode(data) — all modes
quantiles(data, *, n=4, method='exclusive') — cut points
Spread
variance(data, xbar=None) — sample variance
pvariance(data, mu=None) — population variance
stdev(data, xbar=None) — sample standard deviation
pstdev(data, mu=None) — population standard deviation
Bivariate
covariance(x, y) — sample covariance
correlation(x, y, *, method='linear') — Pearson or Spearman correlation
linear_regression(x, y, *, proportional=False) — OLS regression
Kernel density estimation
kde(data, h, kernel='normal', *, cumulative=False) — returns PDF/CDF callable
kde_random(data, h, kernel='normal', *, seed=None) — returns sampling callable
Distribution
NormalDist(mu=0.0, sigma=1.0) — normal distribution class
- Methods:
pdf, cdf, inv_cdf, overlap, zscore, samples, quantiles
- Class method:
from_samples(data)
- Arithmetic:
+, -, *, / with scalars and other NormalDist
Exception
StatisticsError (subclass of ValueError)
Key design considerations
Dependency on fractions
CPython's statistics module uses Fraction internally for exact arithmetic in mean, variance, stdev, harmonic_mean, covariance, correlation, and linear_regression. This means the fractions module (#16) should be implemented first, or at minimum concurrently.
Precision strategy
| Function group |
CPython approach |
Rust approach |
| mean, variance, harmonic_mean |
Fraction-exact intermediate arithmetic |
Use Fraction<BigInt> from #16 |
| fmean, geometric_mean |
Float with fsum/log-exp |
Use math::fsum (already in pymath) |
| stdev, pstdev |
_float_sqrt_of_frac(n, d) specialized sqrt |
Implement equivalent |
| NormalDist.inv_cdf |
Wichura's Algorithm AS241 (rational approximations) |
Direct port of the piecewise approximation |
Type system
CPython statistics functions are polymorphic over int, float, Fraction, and Decimal. For the Rust port, the initial scope should focus on f64 inputs with exact Fraction-based intermediates where CPython does so, and return f64. Full type polymorphism can be added later via generics.
Implementation plan
Phase 1: Core averages (depends on #16)
- Internal
_sum() helper using Fraction for exact summation
mean, fmean, geometric_mean, harmonic_mean
StatisticsError error type
Phase 2: Median & mode
median, median_low, median_high, median_grouped
mode, multimode
quantiles (exclusive and inclusive methods)
Phase 3: Variance & standard deviation
- Internal
_ss() helper (sum of squared deviations via Fraction)
variance, pvariance, stdev, pstdev
_float_sqrt_of_frac() for precision-preserving sqrt
Phase 4: Bivariate statistics
covariance
correlation (linear and ranked methods)
linear_regression (with proportional option)
Phase 5: NormalDist
- Constructor, properties (
mean, median, mode, stdev, variance)
pdf, cdf (via erf from math::erf)
inv_cdf (Wichura's Algorithm AS241)
overlap, zscore
- Arithmetic operators
from_samples, samples, quantiles
Phase 6: KDE
kde with all kernel types (normal, logistic, rectangular, triangular, etc.)
kde_random
- Cumulative mode support
Phase 7: Testing
- pyo3 proptest against CPython
statistics module
- Edge cases: empty data, single element, identical values, NaN/Inf handling
- Precision verification for Fraction-based functions
Feature flag
[features]
statistics = ["fractions"] # depends on fractions module
Out of scope
Decimal input support (separate concern)
random module dependency for sampling (NormalDist.samples, kde_random)
References
Summary
Port Python's
statisticsmodule to Rust — averages, measures of spread, and probability distributions.In CPython this is a pure Python implementation (
Lib/statistics.py) that heavily usesfractions.Fractioninternally for exact intermediate arithmetic to avoid cumulative rounding errors.Public API (21 items)
Central tendency
mean(data)— arithmetic mean (exact via Fraction)fmean(data, weights=None)— fast float mean (via fsum)geometric_mean(data)— geometric mean (via log/exp)harmonic_mean(data, weights=None)— harmonic meanmedian(data)— median (average of middle two for even length)median_low(data)/median_high(data)— low/high medianmedian_grouped(data, interval=1.0)— grouped data medianmode(data)— single most common valuemultimode(data)— all modesquantiles(data, *, n=4, method='exclusive')— cut pointsSpread
variance(data, xbar=None)— sample variancepvariance(data, mu=None)— population variancestdev(data, xbar=None)— sample standard deviationpstdev(data, mu=None)— population standard deviationBivariate
covariance(x, y)— sample covariancecorrelation(x, y, *, method='linear')— Pearson or Spearman correlationlinear_regression(x, y, *, proportional=False)— OLS regressionKernel density estimation
kde(data, h, kernel='normal', *, cumulative=False)— returns PDF/CDF callablekde_random(data, h, kernel='normal', *, seed=None)— returns sampling callableDistribution
NormalDist(mu=0.0, sigma=1.0)— normal distribution classpdf,cdf,inv_cdf,overlap,zscore,samples,quantilesfrom_samples(data)+,-,*,/with scalars and otherNormalDistException
StatisticsError(subclass ofValueError)Key design considerations
Dependency on
fractionsCPython's statistics module uses
Fractioninternally for exact arithmetic inmean,variance,stdev,harmonic_mean,covariance,correlation, andlinear_regression. This means thefractionsmodule (#16) should be implemented first, or at minimum concurrently.Precision strategy
Fraction<BigInt>from #16math::fsum(already in pymath)_float_sqrt_of_frac(n, d)specialized sqrtType system
CPython statistics functions are polymorphic over
int,float,Fraction, andDecimal. For the Rust port, the initial scope should focus onf64inputs with exactFraction-based intermediates where CPython does so, and returnf64. Full type polymorphism can be added later via generics.Implementation plan
Phase 1: Core averages (depends on #16)
_sum()helper using Fraction for exact summationmean,fmean,geometric_mean,harmonic_meanStatisticsErrorerror typePhase 2: Median & mode
median,median_low,median_high,median_groupedmode,multimodequantiles(exclusive and inclusive methods)Phase 3: Variance & standard deviation
_ss()helper (sum of squared deviations via Fraction)variance,pvariance,stdev,pstdev_float_sqrt_of_frac()for precision-preserving sqrtPhase 4: Bivariate statistics
covariancecorrelation(linear and ranked methods)linear_regression(with proportional option)Phase 5: NormalDist
mean,median,mode,stdev,variance)pdf,cdf(via erf frommath::erf)inv_cdf(Wichura's Algorithm AS241)overlap,zscorefrom_samples,samples,quantilesPhase 6: KDE
kdewith all kernel types (normal, logistic, rectangular, triangular, etc.)kde_randomPhase 7: Testing
statisticsmoduleFeature flag
Out of scope
Decimalinput support (separate concern)randommodule dependency for sampling (NormalDist.samples, kde_random)References
Lib/statistics.py