Expose sample statistics functions as Folds from the foldl package. #220

axman6 · 2025-08-08T13:22:40Z

Well I managed to thoroughly nerdsnipe myself with this. This work is based on my foldl-statistics package, which translated all the functions in Statistics.Sample to use the foldl package's Fold type many years ago. This has a few benefits:

Statistics are no longer tied to the Vector package, any Foldable can be used as an input. Folds can also have their inputs contravariantly mapped and the output covariantly mapped, so projecting from more complex types into Double without producing intermediate vectors is simple:
```
 data Foo = Foo {bar :: Double, baz :: Int}

 averageBars :: Fold Foo Double
 averageBars = F.premap bar average

averageBazs :: Fold Foo Double
averageBazs = F.premap (fromIntegral . baz) average
```
Statistics can be computed in parallel more easily, if you want to compute the mean over a several vectors, or slices of a vector in parallel, kbnSum :: Fold Double KBNSum can be applied to each part, and combined monoidally before performing a single division.
Statistics can be computed incrementally. This seems particularly useful in the dataframes package, where statistics can be computed on each column as it's parsed (this has the added benefit that Columns of Double's could potentially store their mean, which is passed to folds over unfiltered columns, making more statistics "one pass", if we ignore the parsing pass).
it makes clear in the types when multiple passes over the data are needed - Folds which need to know the mean take it as an argument.

This PR converts all functions in Statistics.Sample to use the new Statistics.Sample.Fold module. Across the board, things are either about the same speed, slightly faster or slower (this seems to be more to do with how busy my machine is though), of significantly faster - correlation and covariance save between 40% and 50%, and I found an optimisation the custom (^) function which makes all the central moment calculations significantly faster. See benchmarks below.

I've also added the LMVSK type which computes the length, mean, variance, skewness and kurtosis in a single pass which I'd added to the foldl-statistics package.

I also had to rename the statistics-bench's main file name because cabal couldn't run the benchmarks. Happy to revert this if

Results from benchmarks:

Name	master Mean	foldl Mean	%of master	master 2*Stdev	foldl 2*Stdev
All.sample.range	8262200	8169891	98.88	498610	452426
All.sample.mean	26335827	24903906	94.56	1741032	2233392
All.sample.meanWeighted	49147827	49967260	101.66	3310974	3697810
All.sample.harmonicMean	6342874	6453201	101.73	431900	411892
All.sample.geometricMean	47130517	47531176	100.85	3358250	3973606
All.sample.variance	41127624	38427368	93.43	3461374	3347004
All.sample.varianceUnbiased	41193457	38872558	94.36	3349406	3730782
All.sample.varianceWeighted	57257397	60096997	104.95	3761462	5424648
All.sample.pearson	99297216	59261547	59.68	7153452	3936702
All.sample.covariance	71005322	39155200	55.14	5025584	3411858
All.sample.correlation	99429296	59264257	59.60	7793264	4439162
All.sample.covariance2	70466455	64897363	92.09	5640612	4228330
All.sample.correlation2	98596923	76753466	77.84	7266128	6147440
All.sample.stdDev	42555541	40379150	94.88	3659268	4008108
All.sample.skewness	112101611	48734301	43.47	9179452	4167406
All.sample.kurtosis	124104443	50724829	40.87	6640072	3372990
All.sample.C.M. 2	71539257	52870776	73.90	6927224	2400496
All.sample.C.M. 3	78216333	49357958	63.10	5512322	3578362
All.sample.C.M. 4	85592968	50596264	59.11	7322466	3992384
All.sample.C.M. 5	95427246	54429187	57.03	6733316	2168810
All.sample.Applicative LMVSK		91811230			683638
All.sample.fastLMVSK		45109643			3712986

Master:

Foldl:

axman6 · 2025-08-09T08:31:05Z

I believe this is ready for review - there's still a couple of questions to answer, which version of foldl should be the minimum (I would be surprised if this isn't compatible with all versions, this only uses Fold and length from the package IIRC), and does the new module and/or its definitions need since annotations.

I also haven't added LMVSK functions over Vectors, but if everyone is happy with what they're doing, I'll do that too.

axman6

A few notes to hopefully clarify things.

axman6 · 2025-08-09T06:38:50Z

Statistics/Sample/Fold.hs

+(^) :: Double -> Int -> Double
+!x ^ 1 = x
+x ^ 2 = x * x
+x ^ 3 = x * x * x
+x ^ 4 = (x * x) * (x * x)
+x ^ 5 = (x * x) * x * (x * x)
+-- x ^ n = x * (x ^ (n-1))
+x0 ^ n0 = go (n0-1) x0 where
+    go 0 !acc = acc
+    go n acc = go (n-1) (acc*x0)


This change is responsible for most of the improvement in the central moment benchmarks. There is a small improvement from the unrolling compared to the recursive loop here, just using

!x ^ 1 = x -- x ^ n = x * (x ^ (n-1)) x0 ^ n0 = go (n0-1) x0 where go 0 !acc = acc go n acc = go (n-1) (acc*x0)

is significantly better than the "fast" definition that was used before.

Having the unrolled definitions for small powers does allow the compiler to inline the definitions when the power is known.

BTW could you submit this as a separate PR? It's clear improvement and could be merged immediately?

axman6 · 2025-08-09T06:40:19Z

Statistics/Sample/Fold.hs

+
+yielding to boxed returns and heap checks.
+
+-}


Some things never change...

axman6 · 2025-08-09T06:46:17Z

Statistics/Sample/Fold.hs

+-- Counts the number of elementes using Word for the count.
+doubleLength :: F.Fold a Double
+doubleLength = F.Fold (\n _ -> n+1) (0::Word) fromIntegral


Only defined for convenience, not exported.

axman6 · 2025-08-09T07:46:31Z

Statistics/Sample/Fold.hs

+                          (\(_,y) -> square (y - muY))
+{-# INLINE correlation #-}
+
+-- $lmvsk


The following is ported from foldl-statistics and is based on http://www.johndcook.com/blog/skewness_kurtosis/. The benchmark I've included shows this is roughly twice as fast as computing each of these statistics individually and combining them. Useful for a quick summary of a dataset, and is about as expensive to compute as the central moments, despite its complexity..

axman6 · 2025-08-09T12:12:22Z

benchmark/Main.hs

@@ -1,5 +1,9 @@
+module Main where


On macOS this file and the ones in bench-time and bench-papi conflict - import Bench refers to this file, and the two others.

Well I guess I'll need to add macos CI to catch that

Shimuuar · 2025-08-09T12:18:20Z

Thank you!

I think new API for computing summary statistics should solve problems of current API. Namely:

Lack of sensible error handling. Namely we don't have any reasonable way of figuring out when we unable to compute statistic due to empty sample. Currently mean returns 0 for empty sample in rather arbitrary way. BTW your implementation changes that to return NaN. And such error handling should be ergonomic as well. In some application absent meant has to be handled (e.g. some long running app), in data analysis user likely to be happy with exception.
They could only work with vectors. Fold-based API does work with this.

And regarding parallelism. Fold API only gives incrementaly and doesn't expose monoidal structure of an accumulator. I tried to implement this approach in monoid-statistics but never quite completed it. Then foldl primitives could be expressed in term of such monoids.

RE 1. I need to remember which approaches I tried and how well did it work

axman6 · 2025-08-09T12:40:54Z

Thanks for the quick reply @Shimuuar! I hadn't noticed that mean with no inputs returns NaN, but my tests of the current algorithm show the same behaviour, which makes sense with the division by zero in both cases. I don't think that's an unreasonable behaviour though, certainly not an unexpected one anyway.

For parallel evaluation, I've made sure to expose folds which return monoidal types (KBNSum, LMVSKState), so that the folds can be easily split if the data is loaded in chunks - see the example in the docs.

Would you be able to approve the workflow again? I can't test that my fix has worked locally because I'm not on linux at the moment, so the PAPI tests won't work.

axman6 · 2025-08-09T13:40:34Z

Urgh, I had forgotten how much fun it is supporting old GHCs! Hopefully that's all that's needed, I've tested against GHC 9.2.8 and that seems to be working fine, but I can't test older versions.

Shimuuar · 2025-08-10T12:49:39Z

For parallel evaluation, I've made sure to expose folds which return monoidal types

I think that's not quite enough. In order to compute mean one needs both sum and number of elements. Both are monoids and their product is a monoid. In fact fold uses such product monoid as an accumulator except it's not exposed.

axman6 · 2025-08-10T13:42:31Z

That's true, but you get that by just using

meanState :: Fold Double (KBNSum, Int)
meanState = (,) <$> kbnSum <*> F.length

which is a Monoid. We could expose a data Mean = Mean !KBNSum !Int type, or incorporate the monoid-statistics into this:

-- Use explicit forall so you can write stat @Mean
stat :: forall m a.  StatMonoid m a => Fold a m
stat = Fold addValue mempty id

and that could simplify some of the definitions in Statistics.Sample.Fold. Assuming all the definitions in there are appropriately INLINEd, you could nearly write:

mean :: Fold Double Double
mean = calcMean <$> monoidalStat

except for the MonadThrow constraint. I guess you could make mean :: Fold Double (Maybe Double) but I don't actually see a problem with choosing NaN as the value that's returned for the mean of no samples, it's more appropriate than 0 IMO. Choosing zero for the higher order central moments when they don't have enough data makes sense - the variance of 1 sample is 0, the skewness of two samples is 0, etc.

Another choice might be

stat :: StatMonoid a m => a -> Fold a m
stat x = Fold addValue (singletonMonoid x) id

and make the user always provide at least one sample, but that still doesn't solve the problem of statistics that require at least two samples, etc.

axman6 added 8 commits August 8, 2025 13:39

make benchmarks work

07771e0

Initial Fold conversions

a72dd12

import foldl package

490d1d8

WIP

450e877

WIP

e08a275

F.genericLength -> doubleLength

9665867

Convert all Statistic.Sample functions to use folds

c75af10

Missed INLINE

d371118

axman6 force-pushed the foldl-statistics branch from 1dad1ef to e3b2003 Compare August 9, 2025 06:43

axman6 added 6 commits August 9, 2025 16:48

Update haddock for Fold

f646698

Unroll exponentation uo to 5 for inlining.

139fc65

Update Sample docs, export Fold folds

6aee3af

Export F.fold

8a2e350

Don't export V

dc8abb0

Add LMVSK types and folds

ca2a7c3

axman6 force-pushed the foldl-statistics branch from e3b2003 to ca2a7c3 Compare August 9, 2025 07:25

Add benchmark for LMVSK comparing to separate calculation

fe93816

axman6 changed the title ~~Expose sample statistics function as Folds from the foldl package~~ Expose sample statistics functions as Folds from the foldl package. Aug 9, 2025

axman6 marked this pull request as ready for review August 9, 2025 08:07

Use Main.hs for papi bench

3d2d245

axman6 commented Aug 9, 2025

View reviewed changes

axman6 added 4 commits August 9, 2025 23:10

Remove unused extensions

75f8f6b

Remove bang pattern in (^)

04c6a43

Fix doctests

5b7ca9e

Add Applicative imports for old GHCs

8e3b0af

axman6 mentioned this pull request Aug 10, 2025

Make exponentiation significantly faster #221

Merged

axman6 mentioned this pull request Aug 19, 2025

Docs not building on hackage mchav/dataframe#63

Closed

Expose sample statistics functions as Folds from the foldl package. #220

Are you sure you want to change the base?

Expose sample statistics functions as Folds from the foldl package. #220

Uh oh!

Conversation

axman6 commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Results from benchmarks:

Uh oh!

axman6 commented Aug 9, 2025

Uh oh!

axman6 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Shimuuar commented Aug 9, 2025

Uh oh!

axman6 commented Aug 9, 2025

Uh oh!

axman6 commented Aug 9, 2025

Uh oh!

Shimuuar commented Aug 10, 2025

Uh oh!

axman6 commented Aug 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

axman6 commented Aug 8, 2025 •

edited

Loading

axman6 commented Aug 10, 2025 •

edited

Loading