Treat Your ORM Instances as Collections

Written by regquerlyvalueex | Published 2023/07/07
Tech Story Tags: web-development | django | django-tips | django-orm | optimization | software-engineering | software-development | django-guide

TLDRDjango's 'N+1' problem is a well-known topic for Django developers. It can't be fixed with `select_related` or `prefetch_related()` because you're making queries inside model methods. To fix the problem from here, you need to make an annotated query.via the TL;DR App

Django's N+1 problem is a well-known topic for Django developers, and we all know how to use the .select_related() and .prefetch_related() query set methods that allow us to avoid it.

However, over the years, I've noticed that the way we typically write code, the way projects evolve and features are developed, very often leads to situations where these traditional measures simply cannot help us avoid performance issues.

Let's create some very simple models to illustrate some cases.

Models


class Topic(models.Model):
    title = models.CharField(max_length=255)
    user = models.ForeignKey(User, on_delete=models.CASCADE)

    def comments_count(self):
        return self.comments.count()

    def last_comment_date(self):
        return self.comments.last().created_at


class Comment(models.Model):
    topic = models.ForeignKey(Topic, on_delete=models.CASCADE, related_name="comments")
    text = models.TextField()
    created_at = models.DateTimeField(auto_now_add=True)
    updated_at = models.DateTimeField(auto_now=True)

Problem

See the comments_count and last_comment methods? This is exactly the problem I have faced many times while working on optimizing database queries.

The main problem here is that people initially work with some page or API where only one object is needed and just add a few methods to the model or serializer and go about their day. Indeed, when you only have the following code, everything will work fine.

topic = Topic.objects.last()
print(topic.comments_count())
print(topic.last_comment_date())

However, when you want to work with a set of Topic model instances, you run into the N+1 problem that simply can't be fixed with select_related or prefetch_related because you're making queries inside model methods.

topics = Topic.objects.filter(user_id=1)
for topic in topics:
    # N+1
    print(topic.comments_count())
    # N+1
    print(topic.last_comment_date())

To fix the N+1 problem from here, you need to make an annotated query.


from django.db.models import Count, Max
topics = Topic.objects.filter(user_id=1).annotate(
    comments_count=Count('comments'),
    last_comment_date=Max('comments__created_at')
)
for topic in topics:
    print(topic.comments_count)
    print(topic.last_comment_date)

But wait, now we have a new problem: we have to support two different implementations of the same logic, which in some metaphorical sense is a more awkward case of copy-pasted code.

We have to do twice as many tests, and there is a higher chance than in a couple of years that someone will make a change in one place and forget to do it in another, and then we have to deal with unexpected errors, debug several different places, and end up wasting a lot of time.

Solution

What would be the best solution for this? Well, exactly what the story title suggests: treat your ORM instances as collections.

Basically, you can simply not write the comments_count and last_comment_date methods and always annotate the data, even if you only work with one Topic instance. Just do it in both cases.

from django.db.models import Count, Max

# Work with collection
topics = Topic.objects.filter(user_id=1).annotate(
    comments_count=Count('comments'),
    last_comment_date=Max('comments__created_at')
)
for topic in topics:
    print(topic.comments_count)
    print(topic.last_comment_date)

# Work with single instance

topic = Topic.objects.filter(user_id=1).annotate(
    comments_count=Count('comments'),
    last_comment_date=Max('comments__created_at')
).last()

print(topic.comments_count)
print(topic.last_comment_date)

But what about code duplication, you might ask. Yes, you don't need to always write this annotation every time you send a query to the database.

There is a good chance that you only need this data in a certain part of your project, so create a service function and place it in the appropriate place and only use it when you need it.

def annotate_aggregated_comments_data(queryset):
    return queryset.annotate(
        comment_count=Count('comments'),
        last_comment_created_at=Max('comments__created_at')
    )

topic = annotate_aggregated_comments_data(Topic.objects.all()).last()

Doing so will benefit you greatly in the future by reducing potential performance issues and eliminating future refactoring and debugging. However, this may seem a bit excessive and may scare some people.

Caveats

Don't add such utilities to QuerySet classes, as they are usually for special use cases and usually include queries for other models that you don't need in your QuerySet. A QuerySet should be primarily focused on one specific model with general case implementations for queries.

Instead, use some selection functions that will be created for specific domain-related parts of your project.

Conclusion

This is a very common case where you initially design your solution with a single object in mind, and after a while, you find yourself needing a lot of refactoring because your product needs to work with a large set of data, and the project suffers from long page loads, a large database load, etc.

Think about helping yourself in the future, or perhaps your colleagues, and make some effort to avoid it beforehand.


Written by regquerlyvalueex | Python developer
Published by HackerNoon on 2023/07/07