In this post, I will share real experience that I gained while working with hundreds of millions of pieces of data in MongoDB. Don't Store All Data in a Single Mongo Collection This was the worst mistake that we made, which became the root cause of all issues. We used to store 500 million complex structured data in a single collection, which caused the following side effects: Indexes creation takes a very long time.
If collection is dropped accidentally, all data is gone.
Querying data takes longer than usual, even though with indexes.
Counting with filters is timing out because of large document scanning. Probably there will be even more issues as the data increases on a daily basis. Bad Document Structure Invest some time to define the document structure precisely. This is a critical part that never gets attention when starting the project. Our mistake was excluding the field entirely from some of the documents instead of assigning a default value for the field. That ended up with slow query scanning because of existence checking. In MongoDB, it's possible to give partial indexes, but it's really not working that well where field existence is checked. The example below shows null value assigned for phone instead of excluding it from the document completely. {
  "_id": "507f191e810c19729de860ea",
  "name": "John Doe",
  "age": 30,
  "email": "john.doe@example.com",
  "phone": null, //assign default value instead of removing
  "isActive": true,
  "tags": ["user", "active", "new"]
  "registeredOn": "2024-04-19T15:00:00Z"
} {$exists: true} filter is slow so consider adding a default value and indexing the fields for fast query results. Use Bulk Operations Bulk operations are used to execute multiple write operations (inserts, updates, deletes) efficiently in a single command. This method can significantly reduce the number of round trips between your application and the MongoDB server, leading to performance improvements, especially when dealing with large volumes of data. Since we were processing millions of data from RabbitMQ queues, it required handling messages in batches and making bulk DB operations for each batch. Here's the documentation for pymongo on how to use bulk operations for better write performance. Use Indexes Indexing plays a crucial part in the DB performance. Create required indexes in the collection based on the filters you are using in the codebase. That also refers to the part about creating a valid document structure. Instead of removing the field completely, assign a default value to benefit from indexing power. Instead of Count, Use Aggregation Pipeline If you need to get count of data based on a specific filter, I wouldn't recommend using count() , instead, use a combination of $match and $count aggregation for fast performance. $match filters the documents first, which means only the relevant documents that meet the specified conditions are passed down to the $count stage. This is more efficient than counting all documents and then filtering them, as count() might initially do without any conditions. Wrap Up That's all that I have experienced so far with MongoDB. I will update the article once I find more hints and best practices to help optimize the performance. In this post, I will share real experience that I gained while working with hundreds of millions of pieces of data in MongoDB. Don't Store All Data in a Single Mongo Collection This was the worst mistake that we made, which became the root cause of all issues. We used to store 500 million complex structured data in a single collection, which caused the following side effects: Indexes creation takes a very long time. If collection is dropped accidentally, all data is gone. Querying data takes longer than usual, even though with indexes. Counting with filters is timing out because of large document scanning. Indexes creation takes a very long time. If collection is dropped accidentally, all data is gone. Querying data takes longer than usual, even though with indexes. Counting with filters is timing out because of large document scanning. Probably there will be even more issues as the data increases on a daily basis. Bad Document Structure Invest some time to define the document structure precisely. This is a critical part that never gets attention when starting the project. Our mistake was excluding the field entirely from some of the documents instead of assigning a default value for the field. That ended up with slow query scanning because of existence checking. In MongoDB, it's possible to give partial indexes, but it's really not working that well where field existence is checked. The example below shows null value assigned for phone instead of excluding it from the document completely. null phone {
  "_id": "507f191e810c19729de860ea",
  "name": "John Doe",
  "age": 30,
  "email": "john.doe@example.com",
  "phone": null, //assign default value instead of removing
  "isActive": true,
  "tags": ["user", "active", "new"]
  "registeredOn": "2024-04-19T15:00:00Z"
} {
  "_id": "507f191e810c19729de860ea",
  "name": "John Doe",
  "age": 30,
  "email": "john.doe@example.com",
  "phone": null, //assign default value instead of removing
  "isActive": true,
  "tags": ["user", "active", "new"]
  "registeredOn": "2024-04-19T15:00:00Z"
} {$exists: true} filter is slow so consider adding a default value and indexing the fields for fast query results. {$exists: true} Use Bulk Operations Bulk operations are used to execute multiple write operations (inserts, updates, deletes) efficiently in a single command. This method can significantly reduce the number of round trips between your application and the MongoDB server, leading to performance improvements, especially when dealing with large volumes of data. Since we were processing millions of data from RabbitMQ queues, it required handling messages in batches and making bulk DB operations for each batch. Here's the documentation for pymongo on how to use bulk operations for better write performance. documentation documentation pymongo Use Indexes Indexing plays a crucial part in the DB performance. Create required indexes in the collection based on the filters you are using in the codebase. That also refers to the part about creating a valid document structure. Instead of removing the field completely, assign a default value to benefit from indexing power. Instead of Count, Use Aggregation Pipeline If you need to get count of data based on a specific filter, I wouldn't recommend using count() , instead, use a combination of $match and $count aggregation for fast performance. count() $match $count $match filters the documents first, which means only the relevant documents that meet the specified conditions are passed down to the $count stage. This is more efficient than counting all documents and then filtering them, as count() might initially do without any conditions. $match $count count() Wrap Up That's all that I have experienced so far with MongoDB. I will update the article once I find more hints and best practices to help optimize the performance.

Boost MongoDB Performance: Motor Client vs PyMongo - Which is Faster?

Lessons I Learned From Managing Hundreds of Millions of Data in MongoDB

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Automating Instagram API Using Python: Gain Active Followers

10 Ways to Optimize Your Database

10 Ways to Reduce Data Loss and Potential Downtime Of Your Database

10 Principles of Proper Database Benchmarking

10 Minute Guide to Fixing Damaged SQL Databases - No Recovery Required!

10 Best Online Courses to Learn Oracle and PL/SQL for Beginners

Automating Instagram API Using Python: Gain Active Followers

10 Ways to Optimize Your Database

10 Ways to Reduce Data Loss and Potential Downtime Of Your Database

10 Principles of Proper Database Benchmarking

10 Minute Guide to Fixing Damaged SQL Databases - No Recovery Required!

10 Best Online Courses to Learn Oracle and PL/SQL for Beginners

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps