Place where sepulci live
I’m currently developing new API for estimating various statistics. Probably it’ll require several posts to cover topic. This is first one. Let start from looking at existing code and try to identify deficiencies. Here is function for calculating mean from current statistics-0.10.1.0:
What could we notice?
It’s implemented in terms of left fold and there is nothing really specific to vectors. Anything foldable could be used: vectors, list, any container or stream of numbers. Yet function accept vectors only.
We have accumulator data type which is internal for library. From now on I’ll call such data types estimators. It’s however useful to expose it. It could be used to calculate mean incrementally. Moreover we could write function to merge partial results and parallelize evaluation.
Frequently several statistics are evaluated at once. In our example
Mean carry sample mean (first field) and number of elements (second field) in the sample. Currently we just discard number of elements. More motivating example is variance. Estimator for variance carry count, mean and two variance estimates: unbiased and maximum likelihood one. Not to mention standard deviation. With simple API we have to write separate functions for each combinations and there way too many of them.
Let try to rewrite code then:
-- fold here is fold for some unspecified data structure estimateMean = foldl addElement emptyEstimator emptyEstimator :: Mean emptyEstimator = Mean 0 0 calcMean :: Mean -> Double calcMean (Mean m _) = m addElement :: Mean -> Double -> Mean addElement (Mean m n) x = Mean m' n' where m' = m + (x - m) / fromIntegral n' n' = n + 1 data Mean = Mean !Double !Int
Now we have estimator for empty sample, functions for adding element to accumulator and extracting mean. First two are generic and apply to different estimators. So they should be generalized to type class. Extraction function is specific for estimator and it will saved for later.
Now it’s good moment to stop and think a little. What are we trying to do? We trying to write generic API for statistics which could be efficiently calculated using fold. Here efficiently means less than in O(n) additional space, preferably in O(1). There are a lot of such statistics: mean, moments and central moments (barring problems with numerical instabilities), maximum etc.
Not all statictics could be calculated in such way. Prime example is median and quantiles in general. One can always choose yet unknown element in such way that any already known element becomes estimate. This means we have to keep them all or O(n) additional space.
Since we want to use various data type for sample we need to define type class for it. There are
Foldable type class but it require data type fully polymorphic in parameter so no unboxed vectors. It could be worked around using type families:
For fold we need starting value and natural choice is estimator for empty sample. Unfortunately many statistics are not defined for empty sample. In fact mean is not defined too. In example above mean of empty sample is set to zero. It was done purely for convenience. Let put such question of dealing with them aside for a moment. Here is type class:
Next ingredient is function for adding element to estimator:
Note that there is no dependency between
a. Why? For example when we count elements we want to be able to count elements of any type. Another example is mean, variance, etc. Currently they are limited to doubles. But there is no real reason why couldn’t we calculate mean for sample of
Last type class is for merging estimators. If we have estimators for two subsamples we may have way to join them and obtain estimate for union of samples:
If estimator is instance of this type class it forms semigroup and choice of the name reflect this. If it is instance of
NullEstimator too it forms monoid.
That’s all for now.