This article is not set out to try and persuade you, the reader, that either using an iterator or materialized collection will universally solve your problems. Both iterator and materialized collection usage can be used to solve the scenarios that we’ll be looking at, but both will come with a different set of pros and cons that we can further explore. The purpose of this article is to highlight scenarios based on real world experiences where either an iterator or materialized collection was being misunderstood, misused, and ultimately leading to a pile of headaches.
As you read this article, if you find yourself saying “Well, sure, but they should have…” you’re probably right. The problem is fundamentally not the usage of an iterator or the materialized collection, but not understanding how to consume them effectively. So I hope that when you’re working with either newer software engineers or perhaps people less familiar with some of these concepts that you can be reminded to impart your wisdom.
In order to give us some common ground as we explore an approach with a materialized collection in contrast with an iterator, let’s expand on the real world examples where I see these challenges regularly coming up. Let’s assume that you have a data access layer in your application that is responsible for getting records from a database or some data store.
You build some API that the rest of your application can use, and you’ll be using the results of that API in situations such as:
Another key ingredient to mention here is that because this is anchored in the real world… Code bases change and evolve over time. People come up with new use cases for the data access layer. There’s more data added into the data store pushing limits that people never would have accepted. You have new or more junior developers coming into the code base.
It’s real life and until we have more automated tech to police these things, we’re going to run into fun issues.
Before we focus on iterators, let’s explore the more common approach which involves materialized collections. Given the common scenario we discussed above, you have a method that might look like the following:
public List<string> GetEntriesFromDatabase()
{
// incur some latency for connecting to the database
var connection = _connectionFactory.OpenNew();
var command = connection.CreateCommand();
// TODO: actually create the query on the command, but this is just to illustrate
var resultsReader = command.Execute();
List<string> results = new List<string>();
while (resultsReader.Read())
{
// TODO: pull the data off the reader... this example just uses a single field
var value = resultsReader.GetValue(0);
results.Add(value);
}
return results;
}
There is nothing glaringly wrong with this example, and in fact, by leaving the actual query up to your imagination I’ve omitted where a lot of the trouble can come from. Let’s use an
List<string> PretendGetEntriesFromDatabase()
{
// let's simulate some exaggerated latency to the DB
Thread.Sleep(5000);
Console.WriteLine($"{DateTime.Now} - <DB now sending back results>");
// now let's assume we run some query that pulls back 100,000 strings from
// the database
List<string> results = new List<string>();
while (results.Count < 100_000)
{
// simulate a tiny bit of latency on the "reader" that would be
// reading data back from the database... every so often we'll
// sleep a little bit just to slow it down
if ((results.Count % 100) == 0)
{
Thread.Sleep(1);
}
results.Add(Guid.NewGuid().ToString());
}
return results;
}
Please note the delays in the example code above are artificially inflated so that if you run this in a console you can observe the different effects of changing the variables.
And now that we have the code snippet that simulates pulling from the database by building up a full collection first, let’s look at some calling code that can exercise it (
long memoryBefore = GC.GetTotalMemory(true);
Console.WriteLine($"{DateTime.Now} - Getting data from the database using List...");
List<string> databaseResultsList = PretendThisGoesToADatabaseAsList();
Console.WriteLine($"{DateTime.Now} - Got data from the database using List.");
Console.WriteLine($"{DateTime.Now} - Has Data: {databaseResultsList.Any()}");
Console.WriteLine($"{DateTime.Now} - Count of Data: {databaseResultsList.Count}");
long memoryAfter = GC.GetTotalMemory(true);
Console.WriteLine($"{DateTime.Now} - Memory Increase (bytes): {memoryAfter - memoryBefore}");
The calling code will take a snapshot of memory before we call our method and perform operations on the result. The two things we’ll be doing with the result are calling the LINQ method Any() and then calling Count directly on the list. As a side note, the Count() LINQ method actually will not require full enumeration as it has an optimization to check if there’s a known length.
With the materialized collection example, we are able to call the method and store the result set in memory. Given the two operations we are trying to use on the collection, Any() and Count, this information is quickly accessible to us because we’ve paid the performance hit one time to materialize the results into a list.
Compared to an iterator, this approach does not run the risk of allowing callers to accidentally fully re-enumerate the results. This is because the result set is materialized once. However, the implication here is that depending on the size of the results and how expensive it might be to fully materialize that full result set, you could be paying a disproportionate price for things like Any() that only need to know the existence of one element before they return true.
And if you recall what I said at the start of this article, if your mind automatically jumps to “Well someone should build a dedicated query for that”, then… yes, that’s absolutely a solution. But what I’m hear to tell you is that it’s a very common thing for something like this to slip through the cracks of a code review because of the LINQ syntax we have available to us. Especially if someone rights something like:
CallTheMethodThatActuallyMaterializesToAList().Any()
Because in this example, if the method name wasn’t quite so obvious, you’d have no issue with an iterator but a huge concern with a heavy-handed list materialization.
And why is it so heavy-handed? Well one could argue it’s doing exactly what it was coded to do, but we need to consider how callers are going to be taking advantage of this.
If callers rarely ever need to be dealing with the full data set and they need to do things like Any(), First() or otherwise lighter weight operations that don’t necessarily need the entire result set… They don’t have a choice with this API. They will be paying the full price to materialize the entire result set when in reality maybe they just needed to walk through several elements.
In the example code above, this results in multiple megabytes of string data being allocated when we just need a count of data and to check if there was any data. Yes, it looks contrived… But this is simply to illustrate that this API design does not lend itself well to particular use cases for callers.
Let’s go ahead and contrast the previous example with an iterator approach. We’ll start with the code, which you can
IEnumerable<string> PretendThisGoesToADatabaseAsIterator()
{
// let's simulate some exaggerated latency to the DB
Thread.Sleep(5000);
Console.WriteLine($"{DateTime.Now} - <DB now sending back results>");
// now let's assume we run some query that pulls back 100,000 strings from
// the database
for (int i = 0; i < 100_000; i++)
{
// simulate a tiny bit of latency on the "reader" that would be
// reading data back from the database... every so often we'll
// sleep a little bit just to slow it down
if ((i % 100) == 0)
{
Thread.Sleep(1);
}
yield return Guid.NewGuid().ToString();
}
}
As you can see in the code above, we have an iterator structured to be almost identical except for:
As a quick recap, an iterator will not be able to provide a caller with a count like we could do with other collection types, and all it can do is allow a caller to step through item by item.
We can use a similar calling code snippet over our iterator, but let’s go ahead and add in a couple of additional console writing lines (
long memoryBefore = GC.GetTotalMemory(true);
Console.WriteLine($"{DateTime.Now} - Getting data from the database using iterator...");
IEnumerable<string> databaseResultsIterator = PretendThisGoesToADatabaseAsIterator();
Console.WriteLine($"{DateTime.Now} - \"Got data\" (not actually... it's lazy evaluated) from the database using iterator.");
Console.WriteLine($"{DateTime.Now} - Has Data: {databaseResultsIterator.Any()}");
Console.WriteLine($"{DateTime.Now} - Finished checking if database has data using iterator.");
Console.WriteLine($"{DateTime.Now} - Count of Data: {databaseResultsIterator.Count()}");
Console.WriteLine($"{DateTime.Now} - Finished counting data from database using iterator.");
long memoryAfter = GC.GetTotalMemory(true);
Console.WriteLine($"{DateTime.Now} - Memory Increase (bytes): {memoryAfter - memoryBefore}");
The additional lines of console writing just provide some additional context for where our code will be spending time.
The short answer: no. The long answer: Iterators can make some of the earlier issues we saw with materialized collections go away, but they come with their own challenges for folks that are not familiar with working with them.
When we consider the memory footprint in this example, it’s nearly nothing in comparison to the prior example. This is the case because at no point in time in this calling code example did we need to have the entire result set materialized for us to answer the questions that we were interested in.
Will that always be the case? Absolutely not. However, one of the benefits of iterators here is that a caller now has the choice. These choices include whether they just want to do partial enumeration, full enumeration, or full enumeration to materialize the result set. The key here is flexibility in how the API is consumed.
But of course, flexibility comes with a trade… And this is something I see far more frequently with newer C# programmers because they are not actually familiar with iterators. The example above? Sure, it doesn’t use much memory at all… But it will go run PretendThisGoesToADatabaseAsIterator
twice.
And yes, to you reader with the keen eyes you likely already noticed this but with a small adjustment to the naming and calling convention:
var results = GetEntriesFromDatabase();
var any = results.Any();
var count = results.Count();
And suddenly you can’t tell if you’re dealing with an iterator or a materialized collection. And before you shout “Well this is why we never use var
!”, let me tweak it once more:
IEnumerable<string> results = GetEntriesFromDatabase();
var any = results.Any();
var count = results.Count();
And the truth is, var
doesn’t matter here because you just don’t know if GetEntriesFromDatabase()
is an iterator or materialized collection.
So without getting into the weeds of a million different ways we could try and improve this, the point that I would like to highlight to you is that people CAN and DO get this messed up in production code bases. All of the time.
A bonus round for iterators is that given the lazy nature of how they’re evaluated, I have seen layered architectures pass the enumerable all the way to a frontend to finally have it evaluated. The result was that all of the impressive asynchronous data loading support was completely foiled because the main thread ended up being the unfortunate soul that would call the iterator.
It depends. And to be crystal clear because I mentioned it in the beginning of this article, the intention of writing all of this was not to tell you that an iterator is better or worse than a materialized collection.
If you’re a
If you’re a more senior software engineer and you read this article being frustrated that you had ways to solve my examples… Good :)
I would like you to take that energy to the team you’re on and ensure you can work with more junior engineers. Help them understand where some of these issues come up and how they can avoid them.
My personal preference? I like using iterator-based APIs because I like having the flexibility to stream results. However, after many years of doing this, I am digging into some of the performance characteristics. Especially when we have access to things like spans, I might be heading back out to do a bit more research!
Also published here.