Skip to content

Conversation

@axman6
Copy link
Contributor

@axman6 axman6 commented Aug 8, 2025

Well I managed to thoroughly nerdsnipe myself with this. This work is based on my foldl-statistics package, which translated all the functions in Statistics.Sample to use the foldl package's Fold type many years ago. This has a few benefits:

  • Statistics are no longer tied to the Vector package, any Foldable can be used as an input. Folds can also have their inputs contravariantly mapped and the output covariantly mapped, so projecting from more complex types into Double without producing intermediate vectors is simple:
     data Foo = Foo {bar :: Double, baz :: Int}
    
     averageBars :: Fold Foo Double
     averageBars = F.premap bar average
    
    averageBazs :: Fold Foo Double
    averageBazs = F.premap (fromIntegral . baz) average
  • Statistics can be computed in parallel more easily, if you want to compute the mean over a several vectors, or slices of a vector in parallel, kbnSum :: Fold Double KBNSum can be applied to each part, and combined monoidally before performing a single division.
  • Statistics can be computed incrementally. This seems particularly useful in the dataframes package, where statistics can be computed on each column as it's parsed (this has the added benefit that Columns of Double's could potentially store their mean, which is passed to folds over unfiltered columns, making more statistics "one pass", if we ignore the parsing pass).
  • it makes clear in the types when multiple passes over the data are needed - Folds which need to know the mean take it as an argument.

This PR converts all functions in Statistics.Sample to use the new Statistics.Sample.Fold module. Across the board, things are either about the same speed, slightly faster or slower (this seems to be more to do with how busy my machine is though), of significantly faster - correlation and covariance save between 40% and 50%, and I found an optimisation the custom (^) function which makes all the central moment calculations significantly faster. See benchmarks below.

I've also added the LMVSK type which computes the length, mean, variance, skewness and kurtosis in a single pass which I'd added to the foldl-statistics package.

I also had to rename the statistics-bench's main file name because cabal couldn't run the benchmarks. Happy to revert this if

Results from benchmarks:

Name master Mean foldl Mean %of master master 2*Stdev foldl 2*Stdev
All.sample.range 8262200 8169891 98.88 498610 452426
All.sample.mean 26335827 24903906 94.56 1741032 2233392
All.sample.meanWeighted 49147827 49967260 101.66 3310974 3697810
All.sample.harmonicMean 6342874 6453201 101.73 431900 411892
All.sample.geometricMean 47130517 47531176 100.85 3358250 3973606
All.sample.variance 41127624 38427368 93.43 3461374 3347004
All.sample.varianceUnbiased 41193457 38872558 94.36 3349406 3730782
All.sample.varianceWeighted 57257397 60096997 104.95 3761462 5424648
All.sample.pearson 99297216 59261547 59.68 7153452 3936702
All.sample.covariance 71005322 39155200 55.14 5025584 3411858
All.sample.correlation 99429296 59264257 59.60 7793264 4439162
All.sample.covariance2 70466455 64897363 92.09 5640612 4228330
All.sample.correlation2 98596923 76753466 77.84 7266128 6147440
All.sample.stdDev 42555541 40379150 94.88 3659268 4008108
All.sample.skewness 112101611 48734301 43.47 9179452 4167406
All.sample.kurtosis 124104443 50724829 40.87 6640072 3372990
All.sample.C.M. 2 71539257 52870776 73.90 6927224 2400496
All.sample.C.M. 3 78216333 49357958 63.10 5512322 3578362
All.sample.C.M. 4 85592968 50596264 59.11 7322466 3992384
All.sample.C.M. 5 95427246 54429187 57.03 6733316 2168810
All.sample.Applicative LMVSK 91811230 683638
All.sample.fastLMVSK 45109643 3712986

Master:
Image

Foldl:
Image

@axman6 axman6 force-pushed the foldl-statistics branch from 1dad1ef to e3b2003 Compare August 9, 2025 06:43
@axman6 axman6 force-pushed the foldl-statistics branch from e3b2003 to ca2a7c3 Compare August 9, 2025 07:25
@axman6 axman6 changed the title Expose sample statistics function as Folds from the foldl package Expose sample statistics functions as Folds from the foldl package. Aug 9, 2025
@axman6 axman6 marked this pull request as ready for review August 9, 2025 08:07
@axman6
Copy link
Contributor Author

axman6 commented Aug 9, 2025

I believe this is ready for review - there's still a couple of questions to answer, which version of foldl should be the minimum (I would be surprised if this isn't compatible with all versions, this only uses Fold and length from the package IIRC), and does the new module and/or its definitions need since annotations.

I also haven't added LMVSK functions over Vectors, but if everyone is happy with what they're doing, I'll do that too.

Copy link
Contributor Author

@axman6 axman6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few notes to hopefully clarify things.

Comment on lines 142 to 161
(^) :: Double -> Int -> Double
!x ^ 1 = x
x ^ 2 = x * x
x ^ 3 = x * x * x
x ^ 4 = (x * x) * (x * x)
x ^ 5 = (x * x) * x * (x * x)
-- x ^ n = x * (x ^ (n-1))
x0 ^ n0 = go (n0-1) x0 where
go 0 !acc = acc
go n acc = go (n-1) (acc*x0)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is responsible for most of the improvement in the central moment benchmarks. There is a small improvement from the unrolling compared to the recursive loop here, just using

!x ^ 1 = x
-- x ^ n = x * (x ^ (n-1))
x0 ^ n0 = go (n0-1) x0 where
    go 0 !acc = acc
    go n acc = go (n-1) (acc*x0)

is significantly better than the "fast" definition that was used before.

Having the unrolled definitions for small powers does allow the compiler to inline the definitions when the power is known.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW could you submit this as a separate PR? It's clear improvement and could be merged immediately?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

yielding to boxed returns and heap checks.
-}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some things never change...

Comment on lines +242 to +254
-- Counts the number of elementes using Word for the count.
doubleLength :: F.Fold a Double
doubleLength = F.Fold (\n _ -> n+1) (0::Word) fromIntegral
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only defined for convenience, not exported.

(\(_,y) -> square (y - muY))
{-# INLINE correlation #-}

-- $lmvsk
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following is ported from foldl-statistics and is based on http://www.johndcook.com/blog/skewness_kurtosis/. The benchmark I've included shows this is roughly twice as fast as computing each of these statistics individually and combining them. Useful for a quick summary of a dataset, and is about as expensive to compute as the central moments, despite its complexity..

@@ -1,5 +1,9 @@
module Main where
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On macOS this file and the ones in bench-time and bench-papi conflict - import Bench refers to this file, and the two others.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I guess I'll need to add macos CI to catch that

@Shimuuar
Copy link
Collaborator

Shimuuar commented Aug 9, 2025

Thank you!

I think new API for computing summary statistics should solve problems of current API. Namely:

  1. Lack of sensible error handling. Namely we don't have any reasonable way of figuring out when we unable to compute statistic due to empty sample. Currently mean returns 0 for empty sample in rather arbitrary way. BTW your implementation changes that to return NaN. And such error handling should be ergonomic as well. In some application absent meant has to be handled (e.g. some long running app), in data analysis user likely to be happy with exception.
  2. They could only work with vectors. Fold-based API does work with this.

And regarding parallelism. Fold API only gives incrementaly and doesn't expose monoidal structure of an accumulator. I tried to implement this approach in monoid-statistics but never quite completed it. Then foldl primitives could be expressed in term of such monoids.

RE 1. I need to remember which approaches I tried and how well did it work

@axman6
Copy link
Contributor Author

axman6 commented Aug 9, 2025

Thanks for the quick reply @Shimuuar! I hadn't noticed that mean with no inputs returns NaN, but my tests of the current algorithm show the same behaviour, which makes sense with the division by zero in both cases. I don't think that's an unreasonable behaviour though, certainly not an unexpected one anyway.

For parallel evaluation, I've made sure to expose folds which return monoidal types (KBNSum, LMVSKState), so that the folds can be easily split if the data is loaded in chunks - see the example in the docs.

Would you be able to approve the workflow again? I can't test that my fix has worked locally because I'm not on linux at the moment, so the PAPI tests won't work.

@axman6
Copy link
Contributor Author

axman6 commented Aug 9, 2025

Urgh, I had forgotten how much fun it is supporting old GHCs! Hopefully that's all that's needed, I've tested against GHC 9.2.8 and that seems to be working fine, but I can't test older versions.

@Shimuuar
Copy link
Collaborator

For parallel evaluation, I've made sure to expose folds which return monoidal types

I think that's not quite enough. In order to compute mean one needs both sum and number of elements. Both are monoids and their product is a monoid. In fact fold uses such product monoid as an accumulator except it's not exposed.

@axman6
Copy link
Contributor Author

axman6 commented Aug 10, 2025

That's true, but you get that by just using

meanState :: Fold Double (KBNSum, Int)
meanState = (,) <$> kbnSum <*> F.length

which is a Monoid. We could expose a data Mean = Mean !KBNSum !Int type, or incorporate the monoid-statistics into this:

-- Use explicit forall so you can write stat @Mean
stat :: forall m a.  StatMonoid m a => Fold a m
stat = Fold addValue mempty id

and that could simplify some of the definitions in Statistics.Sample.Fold. Assuming all the definitions in there are appropriately INLINEd, you could nearly write:

mean :: Fold Double Double
mean = calcMean <$> monoidalStat

except for the MonadThrow constraint. I guess you could make mean :: Fold Double (Maybe Double) but I don't actually see a problem with choosing NaN as the value that's returned for the mean of no samples, it's more appropriate than 0 IMO. Choosing zero for the higher order central moments when they don't have enough data makes sense - the variance of 1 sample is 0, the skewness of two samples is 0, etc.

Another choice might be

stat :: StatMonoid a m => a -> Fold a m
stat x = Fold addValue (singletonMonoid x) id

and make the user always provide at least one sample, but that still doesn't solve the problem of statistics that require at least two samples, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants