-
Notifications
You must be signed in to change notification settings - Fork 74
Expose sample statistics functions as Folds from the foldl package. #220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
1dad1ef to
e3b2003
Compare
e3b2003 to
ca2a7c3
Compare
|
I believe this is ready for review - there's still a couple of questions to answer, which version of I also haven't added LMVSK functions over |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few notes to hopefully clarify things.
| (^) :: Double -> Int -> Double | ||
| !x ^ 1 = x | ||
| x ^ 2 = x * x | ||
| x ^ 3 = x * x * x | ||
| x ^ 4 = (x * x) * (x * x) | ||
| x ^ 5 = (x * x) * x * (x * x) | ||
| -- x ^ n = x * (x ^ (n-1)) | ||
| x0 ^ n0 = go (n0-1) x0 where | ||
| go 0 !acc = acc | ||
| go n acc = go (n-1) (acc*x0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is responsible for most of the improvement in the central moment benchmarks. There is a small improvement from the unrolling compared to the recursive loop here, just using
!x ^ 1 = x
-- x ^ n = x * (x ^ (n-1))
x0 ^ n0 = go (n0-1) x0 where
go 0 !acc = acc
go n acc = go (n-1) (acc*x0)is significantly better than the "fast" definition that was used before.
Having the unrolled definitions for small powers does allow the compiler to inline the definitions when the power is known.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW could you submit this as a separate PR? It's clear improvement and could be merged immediately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
| yielding to boxed returns and heap checks. | ||
| -} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some things never change...
| -- Counts the number of elementes using Word for the count. | ||
| doubleLength :: F.Fold a Double | ||
| doubleLength = F.Fold (\n _ -> n+1) (0::Word) fromIntegral |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only defined for convenience, not exported.
| (\(_,y) -> square (y - muY)) | ||
| {-# INLINE correlation #-} | ||
|
|
||
| -- $lmvsk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following is ported from foldl-statistics and is based on http://www.johndcook.com/blog/skewness_kurtosis/. The benchmark I've included shows this is roughly twice as fast as computing each of these statistics individually and combining them. Useful for a quick summary of a dataset, and is about as expensive to compute as the central moments, despite its complexity..
| @@ -1,5 +1,9 @@ | |||
| module Main where | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On macOS this file and the ones in bench-time and bench-papi conflict - import Bench refers to this file, and the two others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well I guess I'll need to add macos CI to catch that
|
Thank you! I think new API for computing summary statistics should solve problems of current API. Namely:
And regarding parallelism. RE 1. I need to remember which approaches I tried and how well did it work |
|
Thanks for the quick reply @Shimuuar! I hadn't noticed that mean with no inputs returns NaN, but my tests of the current algorithm show the same behaviour, which makes sense with the division by zero in both cases. I don't think that's an unreasonable behaviour though, certainly not an unexpected one anyway. For parallel evaluation, I've made sure to expose folds which return monoidal types ( Would you be able to approve the workflow again? I can't test that my fix has worked locally because I'm not on linux at the moment, so the PAPI tests won't work. |
|
Urgh, I had forgotten how much fun it is supporting old GHCs! Hopefully that's all that's needed, I've tested against GHC 9.2.8 and that seems to be working fine, but I can't test older versions. |
I think that's not quite enough. In order to compute mean one needs both sum and number of elements. Both are monoids and their product is a monoid. In fact fold uses such product monoid as an accumulator except it's not exposed. |
|
That's true, but you get that by just using meanState :: Fold Double (KBNSum, Int)
meanState = (,) <$> kbnSum <*> F.lengthwhich is a -- Use explicit forall so you can write stat @Mean
stat :: forall m a. StatMonoid m a => Fold a m
stat = Fold addValue mempty idand that could simplify some of the definitions in mean :: Fold Double Double
mean = calcMean <$> monoidalStatexcept for the Another choice might be stat :: StatMonoid a m => a -> Fold a m
stat x = Fold addValue (singletonMonoid x) idand make the user always provide at least one sample, but that still doesn't solve the problem of statistics that require at least two samples, etc. |
Well I managed to thoroughly nerdsnipe myself with this. This work is based on my
foldl-statisticspackage, which translated all the functions inStatistics.Sampleto use thefoldlpackage'sFoldtype many years ago. This has a few benefits:meanover a several vectors, or slices of a vector in parallel,kbnSum :: Fold Double KBNSumcan be applied to each part, and combined monoidally before performing a single division.dataframespackage, where statistics can be computed on each column as it's parsed (this has the added benefit that Columns of Double's could potentially store their mean, which is passed to folds over unfiltered columns, making more statistics "one pass", if we ignore the parsing pass).This PR converts all functions in Statistics.Sample to use the new
Statistics.Sample.Foldmodule. Across the board, things are either about the same speed, slightly faster or slower (this seems to be more to do with how busy my machine is though), of significantly faster -correlationandcovariancesave between 40% and 50%, and I found an optimisation the custom(^)function which makes all the central moment calculations significantly faster. See benchmarks below.I've also added the
LMVSKtype which computes the length, mean, variance, skewness and kurtosis in a single pass which I'd added to thefoldl-statisticspackage.I also had to rename the
statistics-bench's main file name becausecabalcouldn't run the benchmarks. Happy to revert this ifResults from benchmarks:
Master:

Foldl:
