BlackWaspTM

This web site uses cookies. By using the site you accept the cookie policy.This message is for compliance with the UK ICO law.

LINQ
.NET 3.5+

LINQ Style Variance and Standard Deviation Operators

In statistics, the variance and standard deviation for a set of data indicate how spread out the individual values are. Small values indicate that the elements of a set are close to the average value, whereas larger values suggest a greater spread.

Multiple Pass Implementation

Our first implementation is the multiple pass version. We'll create methods for the variance and standard deviation, with two members for the population calculations and another pair of methods for working with samples. Add a new class named, "StandardDeviationExtensions", to the console application project and change the declaration to make it a static class:

public static class StandardDeviationExtensions
{
}

For the variance of a population we need to calculate the mean first, which we can do with the Average standard query operator. We'll square the difference between the arithmetic mean and each value in our source sequence using a projection with the Select method. Finally, we'll use the Average method again to calculate the mean of the squares.

The only potential problem is if there are no values in the sequence. This would cause a division by zero exception from the Average operator. To avoid this we'll check for an empty collection first.

The full method is shown below:

public static double VarianceOfPopulation(this IEnumerable<double> source)
{
    var count = source.Count();

    if (count == 0) return 0;

    var mean = source.Average();
    var squaredDiffs = source.Select(d => (d - mean) * (d - mean));
    return squaredDiffs.Average();
}

The variance for a sample method is very similar to that for a population. This time we need to divide by one less than the count of the items in the input sequence at the end of the calculation. As the divisor is lower, we also need to ensure that there are at least two items in the collection.

public static double VarianceOfSample(this IEnumerable<double> source)
{
    var count = source.Count();

    if (count <= 1) return 0;

    var mean = source.Average();
    var squaredDiffs = source.Select(d => (d - mean) * (d - mean));
    return squaredDiffs.Sum() / (count - 1);
}

We can now add the two standard deviation methods. Each calls the appropriate variance method and returns the square root of the result. We can use the square root calculation provided by the .NET framework's Math class.

public static double StdDevOfPopulation(this IEnumerable<double> source)
{
    return Math.Sqrt(source.VarianceOfPopulation());
}

public static double StdDevOfSample(this IEnumerable<double> source)
{
    return Math.Sqrt(source.VarianceOfSample());
}

Single Pass Implementation

To function as well as one of the standard query operators we need an on-line algorithm for the variance and standard deviation. This is where the data is processed item by item and the list is only read once. With such an algorithm we could calculate the variance for any sequence, be it one that may only be read once or be it a collection of values that is too large to hold in memory at one time.

In this section we'll implement the calculations again with a single pass implementation. The process for this can be found in the variance on-line algorithm described at Wikipedia. More details of the method are available via this link. I won't describe the calculation any further here.

To begin, create new class called "SinglePassStandardDeviationExtensions" and make it static:

public static class SinglePassStandardDeviationExtensions
{
}

The variances for a full population or a sample are almost identical, with only the final division changing. Rather than duplicate code, let's create a private method that calculates either variance. We'll add a Boolean parameter to specify whether the division uses the count of the items in the sequence or one less than this number.

The method is as follows, add it to the new class:

private static double CalculateVariance(IEnumerable<double> source, bool isSample)
{
    int count = 0;
    double delta = 0;
    double mean = 0;
    double sumOfDiffSquares = 0;

    foreach (double value in source)
    {
        count++;
        delta = value - mean;
        mean = mean + (delta / count);
        sumOfDiffSquares = sumOfDiffSquares + delta * (value - mean);
    }

    // Switch calculation and item minimum based upon sample or population flag
    if (isSample)
    {
        return count <= 1 ? 0 : sumOfDiffSquares / (count - 1);
    }
    else
    {
        return count == 0 ? 0 : sumOfDiffSquares / count;
    }
}

To surface public methods for the variance add the following code. In each case we check that the source sequence is not null before calling the CalculateVariance method with the correct flag.

public static double VarianceOfPopulation1P(this IEnumerable<double> source)
{
    if (source == null) throw new ArgumentNullException("source");

    return CalculateVariance(source, false);
}

public static double VarianceOfSample1P(this IEnumerable<double> source)
{
    if (source == null) throw new ArgumentNullException("source");

    return CalculateVariance(source, true);
}

The standard deviation methods are the same as for the multiple pass versions:

public static double StdDevOfPopulation1P(this IEnumerable<double> source)
{
    return Math.Sqrt(source.VarianceOfPopulation1P());
}

public static double StdDevOfSample1P(this IEnumerable<double> source)
{
    return Math.Sqrt(source.VarianceOfSample1P());
}
2 April 2013