Django's N+1
problem is a well-known topic for Django developers, and we all know how to use the .select_related()
and .prefetch_related()
query set methods that allow us to avoid it.
However, over the years, I've noticed that the way we typically write code, the way projects evolve and features are developed, very often leads to situations where these traditional measures simply cannot help us avoid performance issues.
Let's create some very simple models to illustrate some cases.
class Topic(models.Model):
title = models.CharField(max_length=255)
user = models.ForeignKey(User, on_delete=models.CASCADE)
def comments_count(self):
return self.comments.count()
def last_comment_date(self):
return self.comments.last().created_at
class Comment(models.Model):
topic = models.ForeignKey(Topic, on_delete=models.CASCADE, related_name="comments")
text = models.TextField()
created_at = models.DateTimeField(auto_now_add=True)
updated_at = models.DateTimeField(auto_now=True)
See the comments_count
and last_comment
methods? This is exactly the problem I have faced many times while working on optimizing database queries.
The main problem here is that people initially work with some page or API where only one object is needed and just add a few methods to the model or serializer and go about their day. Indeed, when you only have the following code, everything will work fine.
topic = Topic.objects.last()
print(topic.comments_count())
print(topic.last_comment_date())
However, when you want to work with a set of Topic
model instances, you run into the N+1
problem that simply can't be fixed with select_related
or prefetch_related
because you're making queries inside model methods.
topics = Topic.objects.filter(user_id=1)
for topic in topics:
# N+1
print(topic.comments_count())
# N+1
print(topic.last_comment_date())
To fix the N+1
problem from here, you need to make an annotated query.
from django.db.models import Count, Max
topics = Topic.objects.filter(user_id=1).annotate(
comments_count=Count('comments'),
last_comment_date=Max('comments__created_at')
)
for topic in topics:
print(topic.comments_count)
print(topic.last_comment_date)
But wait, now we have a new problem: we have to support two different implementations of the same logic, which in some metaphorical sense is a more awkward case of copy-pasted code.
We have to do twice as many tests, and there is a higher chance than in a couple of years that someone will make a change in one place and forget to do it in another, and then we have to deal with unexpected errors, debug several different places, and end up wasting a lot of time.
What would be the best solution for this? Well, exactly what the story title suggests: treat your ORM instances as collections.
Basically, you can simply not write the comments_count
and last_comment_date
methods and always annotate the data, even if you only work with one Topic
instance. Just do it in both cases.
from django.db.models import Count, Max
# Work with collection
topics = Topic.objects.filter(user_id=1).annotate(
comments_count=Count('comments'),
last_comment_date=Max('comments__created_at')
)
for topic in topics:
print(topic.comments_count)
print(topic.last_comment_date)
# Work with single instance
topic = Topic.objects.filter(user_id=1).annotate(
comments_count=Count('comments'),
last_comment_date=Max('comments__created_at')
).last()
print(topic.comments_count)
print(topic.last_comment_date)
But what about code duplication, you might ask. Yes, you don't need to always write this annotation every time you send a query to the database.
There is a good chance that you only need this data in a certain part of your project, so create a service function and place it in the appropriate place and only use it when you need it.
def annotate_aggregated_comments_data(queryset):
return queryset.annotate(
comment_count=Count('comments'),
last_comment_created_at=Max('comments__created_at')
)
topic = annotate_aggregated_comments_data(Topic.objects.all()).last()
Doing so will benefit you greatly in the future by reducing potential performance issues and eliminating future refactoring and debugging. However, this may seem a bit excessive and may scare some people.
Don't add such utilities to QuerySet classes, as they are usually for special use cases and usually include queries for other models that you don't need in your QuerySet. A QuerySet should be primarily focused on one specific model with general case implementations for queries.
Instead, use some selection functions that will be created for specific domain-related parts of your project.
This is a very common case where you initially design your solution with a single object in mind, and after a while, you find yourself needing a lot of refactoring because your product needs to work with a large set of data, and the project suffers from long page loads, a large database load, etc.
Think about helping yourself in the future, or perhaps your colleagues, and make some effort to avoid it beforehand.