BlackWaspTM

This web site uses cookies. By using the site you accept the cookie policy.This message is for compliance with the UK ICO law.

LINQ
.NET 3.5+

A LINQ Style Operator to Find Items with Distinct Properties

Language-Integrated Query's (LINQ) Distinct operator removes duplicates from a sequence but does not allow a lambda expression to be used when determining if an item is unique. This article describes a custom extension method for this purpose.

Distinct

LINQ's Distinct operator processes a list of items to generate a new sequence with duplicate values removed. There are two overloaded versions of this method. The most basic requires no arguments when used as an extension method. It removes duplicate values for value types and duplicate references for reference types. Two objects with matching values that are actually different references will both appear in the output sequence.

To allow removal of similar items, perhaps those with matching values but different references, you can use the overload with an additional parameter. This receives an IEqualityComparer that is used to determine which values are unique and which are duplicated. For example, you could perform the Distinct operation upon a collection of strings using a case-insensitive comparer. In this case values that differ only in capitalisation will be deemed duplicates and all but the first of such items will be discarded.

Unfortunately, there is no overload of Distinct that allows you to specify the type of comparison using a delegate. This means you cannot use a lambda expression, as you might with other LINQ operations. You could use an equality comparer that is configured using a lambda expression, such as that described in the article, "A Generic Equality Comparer for LINQ". However, this could complicate the syntax somewhat.

In this article we will create a new extension method that behaves in a similar manner to Distinct but that performs comparisons using a Func delegate to which you can supply a lambda. As the operation filters based on the delegate, we'll name it DistinctOn. The method will allow you to perform operations using statements such as the following, which will perform a case-insensitive search.

var distinct = someStrings.DistinctOn(s => s.ToLower());

DistinctOn Operator

As with LINQ operators, it is important to validate the arguments immediately but execute the filtering functionality using deferred execution. This means that we will need two methods. The first is a public method that performs the validation, which must be added to a static class. This is shown below:

public static IEnumerable<TValue> DistinctOn<TValue, TKey>(
    this IEnumerable<TValue> sequence, Func<TValue, TKey> keySelector)
{
    if (sequence == null) throw new ArgumentNullException("sequence");
    if (keySelector == null) throw new ArgumentNullException("keySelector");

    return DistinctOnImpl(sequence, keySelector);
}

The public method is provided with the source sequence and the Func delegate. The delegate is named keySelector, as it generates a key value that will be used to determine whether values are unique or duplicates. The generic method has two type parameters. TValue defines the type of the items in the source sequence. TKey is the type of the generated keys, hence the key selector function takes a TValue and returns a TKey.

The second method is private. It uses yield return to produce the iterator that returns items using deferred execution. The code is shown below:

private static IEnumerable<TValue> DistinctOnImpl<TValue, TKey>(
    IEnumerable<TValue> sequence, Func<TValue, TKey> keySelector)
{
    var keys = new HashSet<TKey>();

    foreach (TValue value in sequence)
    {
        if (keys.Add(keySelector(value)))
        {
            yield return value;
        }
    }
}

The process is quite simple. A HashSet is used to store the keys that are generated by applying the key selector function to the values from the sequence. When you attempt to add an item to a HashSet you use the Add method. If the item does not already exist in the set, it is added and the method returns true. If the value is already present, it is not added and the method returns false.

The foreach loop iterates through all of the source items, trying to add each to the HashSet. If the Add is successful, the key has not been encountered before so the corresponding value is returned. If the Add fails, the key is not unique so the item is discarded.

Testing the Method

To try out the method we'll use something a little more interesting than case-insensitive string comparison. Instead, let's create a custom class that might be used to hold data from a database. The class below represents a person with a name and a unique integer identifier. The constructor is included to simplify initialisation and ToString is overridden to make it easier to see the results of the operation that we are about to perform.

public class Person
{
    public Person(int id, string name)
    {
        Id = id;
        Name = name;
    }

    public int Id { get; set; }

    public string Name { get; set; }

    public override string ToString()
    {
        return string.Format("{0} : {1}", Id, Name);
    }
}

We can now create a sequence and use the DistinctOn method. To do so, add the following code to the Main method. This finds all distinct items based upon the Id property being unique. If two items exist with the same identifier but different names, only the first will be included in the generated sequence. You can see this from the two items with the Id set to 3. The one with the capitalised name, "JIM", is omitted from the results.

var values = new List<Person>
{
    new Person(1, "Bob"),
    new Person(2, "Sam"),
    new Person(3, "Jim"),
    new Person(4, "Mel"),
    new Person(1, "Bob"),
    new Person(3, "JIM")
};

var distinct = values.DistinctOn(x => x.Id);

foreach (var value in distinct)
{
    Console.WriteLine(value);
}

/* OUTPUT

1 : Bob
2 : Sam
3 : Jim
4 : Mel

*/
5 September 2013