Linq: Split()

Deutsche Version

Terminator T1000

So you know the Cinderella saying “The good ones go into the pot, the bad ones go into your crop.”?

Well, .NET Linq does have a solution for either one, it’s called Where(). If you use that, your solution probably looks like this:

var evens = list.Where(number => ((number % 2) == 0));
var odds =  list.Where(number => ((number % 2) != 0));

But it does not have a solution for both. So I wrote myself one:

public static void Split<T>(this IEnumerable<T> source
    , Func<T, Boolean> predicate
    , out IEnumerable<T> trueList
    , out IEnumerable<T> falseList)
{
    trueList = source.Where(predicate);
    falseList = source.Except(trueList);
}

The performance of these two versions is exactly the same. A list containing the numbers 1..6 will be accessed 12 times (six times to check against even, six times to check against odd). Only the moment during the execution code differs slightly.

To understand what is actually happening I wrote myself a class “Number” encapsulating a simple integer. But the getter logs every read access to that number and a static field counts the overall accesses to all instances.

The Where() function output is then like this:

Starting Where

Printing evens
evens.GetType(): System.Linq.Enumerable+WhereListIterator`1[LinqSplit.Number]
Access #1. Number: 1
Access #2. Number: 2
2
Access #3. Number: 3
Access #4. Number: 4
4
Access #5. Number: 5
Access #6. Number: 6
6

Printing odds
odds.GetType(): System.Linq.Enumerable+WhereListIterator`1[LinqSplit.Number]
Access #7. Number: 1
1
Access #8. Number: 2
Access #9. Number: 3
3
Access #10. Number: 4
Access #11. Number: 5
5
Access #12. Number: 6

Finishing Where

The usage of the Except() function on the list using the evens as the to-be-filtered out results in a slightly different output:

Starting Split

Printing evens
evens.GetType(): System.Linq.Enumerable+WhereListIterator`1[LinqSplit.Number]
Access #1. Number: 1
Access #2. Number: 2
2
Access #3. Number: 3
Access #4. Number: 4
4
Access #5. Number: 5
Access #6. Number: 6
6

Printing odds
odds.GetType(): System.Linq.Enumerable+d__92`1[LinqSplit.Number]
Access #7. Number: 1
Access #8. Number: 2
Access #9. Number: 3
Access #10. Number: 4
Access #11. Number: 5
Access #12. Number: 6
1
3
5

Finishing Split

All the above code is in line with the deferred (or delayed) execution that Linq offers. The code is only actually executed when someone starts an interation over the enumerable return value.

However, there are ways to enhance the performance. For example, when you need to have the element count of such an enumerable before iterating over it, it is a bad idea to call Count() and then start iterating over the enumerable. Because Count() does the very same thing and in the end you will have iterated the code twice.

That’s why it is a good stlye approach to force the iteration once by putting the result into a List. Afterwards you can query the count property and iterate as often as you want over the List.

The Split() function can be improved in a similar way, especially if you don’t actually need deferred execution. You simply cache the results of the Where() call in a List and only then use it as an input for the Except() function and as a return value:

public static void SplitOptimized1<T>(this IEnumerable<T> source
    , Func<T, Boolean> predicate
    , out IEnumerable<T> trueList
    , out IEnumerable<T> falseList)
{
    trueList = source.Where(predicate).ToList();
    falseList = source.Except(trueList);
}

As you can see, the method signature did not actually change, because List is an IEnumerable. I just decided to force the code to execute immediately. The result is this:

Starting SplitOptimized1
Access #1. Number: 1
Access #2. Number: 2
Access #3. Number: 3
Access #4. Number: 4
Access #5. Number: 5
Access #6. Number: 6

Printing evens
evens.GetType(): System.Collections.Generic.List`1[LinqSplit.Number]
2
4
6

Printing odds
odds.GetType(): System.Linq.Enumerable+d__92`1[LinqSplit.Number]
1
3
5

Finishing SplitOptimized1

As you can see, the numer of read accesses on the actual data was cut in half by this simple optimization.

Now one could argue “If you forego deferred execution then be consistent about it” and they’d be right. So if you go down that road, it should be made clear in the method signature:

public static void SplitOptimized1<T>(this IEnumerable<T> source
    , Func<T, Boolean> predicate
    , out List<T> trueList
    , out IEnumerable<T> falseList)
{
    trueList = source.Where(predicate).ToList();
    falseList = source.Except(trueList);
}
public static void SplitOptimized2<T>(this IEnumerable<T> source
    , Func<T, Boolean> predicate
    , out List<T> trueList
    , out List<T> falseList)
{
    trueList = source.Where(predicate).ToList();
    falseList = source.Except(trueList).ToList();
}

As you can see, I made the second output a List as well to make it unmistakeably clear that there won’t be any deferred execution when using this particular method. However, this last change has no actual impact on the read accesses:

Starting SplitOptimized2
Access #1. Number: 1
Access #2. Number: 2
Access #3. Number: 3
Access #4. Number: 4
Access #5. Number: 5
Access #6. Number: 6

Printing evens
evens.GetType(): System.Collections.Generic.List`1[LinqSplit.Number]
2
4
6

Printing odds
odds.GetType(): System.Collections.Generic.List`1[LinqSplit.Number]
1
3
5

Finishing SplitOptimized2

The full sources can be found here: LinqSplit.zip