288 reads

MongoDB insights for undecided developers

by 蓮沼貴裕（Takahiro Hasunuma）August 29th, 2017

Too Long; Didn't Read

Since a couple of years I used to build web applications around <a href="https://www.mongodb.com/">MongoDB</a>, in this short article I would like to answers some of the recurrent questions or misunderstanding most developers have when evaluating it:

featured image - MongoDB insights for undecided developers

Since a couple of years I used to build web applications around MongoDB, in this short article I would like to answers some of the recurrent questions or misunderstanding most developers have when evaluating it:

What is the licensing ?
What does it mean MongoDB is a NoSQL database ?
What about MongoDB performances ?

Licensing

Yes, MongoDB is licensed under Free Software Foundation’s GNU AGPL v3.0. This practically means that enhancements you make to MongoDB must be released to the community and the source code of any derived work has to be distributed as well. You might wonder if your application is a derived work, and I must confess I never found a simple definition of such a term (e.g. not sharing a process address space, could be reasonably replaced with an alternative, etc.). However, in the specific case of MongoDB, they simply recognize that applications using their database are a separate work. Moreover, their supported drivers are released under Apache License v2.0, which is a permissive license that does not enforce you to publish your source code, and your application usually only talks to MongoDB using a driver.

As a consequence you don’t need to be concerned with the licensing of MongoDB to build your app around it. They even send signed letters asserting the promise to legal departments if there are questions, and also provide commercial licenses if the signed letter isn’t enough.

Note: although a long experience make me trust this analysis I am not a lawyer, the view presented here is my personal understanding and is not an official one

NoSQL

Yes, MongoDB is a NoSQL database. What means this word can be pretty confusing, I will try to analyse the most common ideas with a focus on how this applies to MongoDB.

Document-oriented

In traditional SQL databases, data is arranged in the form of tables and rows. Each row has a fixed number of columns that can only store data of a specific type (e.g., Integer, Text, Datetime), which defines the schema of your data. In MongoDB, data is stored in the form of BSON objects that are organized into collections, and usually handled in the form of JSON objects. This makes mapping objects into the database a simple task, normally eliminating anything similar to an object-relational mapping.

Schema-less (really ?)

This means you don’t have to tell the database the structure of your data and the primitive types to be used before being able to manage it. This also means you can mix documents having different structures in the same collection of data. One of the great benefits is that schema migrations become easier (most of the adjustments to the database are transparent and automatic) and roll back is unlikely to cause problems. Another great thing is that dynamically extending existing data models with custom attributes at run time is straightforward.

But all of this does not mean you don’t have any schema at all. If it is not explicitly declared it shines implicitly from your application logic or might be declared in other ways to handle form/data validation, etc. Anyway, you still have to tell explicitly the database how to create indices to ensure good performances. Indeed, schema design is the cornerstone of making awesome databases, whether SQL or not. If you do not understand your data and the limitations of hardware and software you can not effectively design schema.

Non-relational (really ?)

This means that you don’t have to always create a relation between two documents to handle aggregated data structures. Indeed, in relational databases, the SQL JOIN clause allows you to combine rows from two or more tables using a common field between them. Document-oriented databases such as MongoDB are designed to store denormalized data. Ideally, there should be no relationship between collections: if the same data is required in two or more documents, it must be repeated. One of the great benefits is that a single read operation is required to get all data.

But you can still create relations and refer to another document if you’d like or have the need:

by ID, then you can “populate” it manually with a second query or using DBRefs
by any other field, then you can use the $lookupoperator

This makes MongoDB really flexible to choose how to handle the relations between your objects on a case-by-case basis.

Performances

Read/Write

Yes, MongoDB like any other “true” database is made to handle a huge volume of data. In a nutshell, hundreds or thousands of objects is nothing for a database, you don’t have to worry if you have such numbers. You can find a lot of benchmarks around, here is a simple one to give you some rough order of magnitude. The documents stored are really simple and typically represent a time-stamped measurement:

{
    value: random(0,100),
    timestamp: date
}

Because of the way MongoDB delegates memory management to the operating system having more complex documents (typically containing tens of attributes) does not affect results significantly

Both attributes have been indexed, MongoDB automatically adds and indexes the document unique ID, I tested three requests:

find the maximum value of the collection using the aggregation framework
find the 100 greatest values greater than 99.9
get a single document by ID

The “maximum request” is not taking benefit from indexes because of the aggregation while the “greater than” and “by ID” requests can use it, you will see how this is important for performances.

The test configuration was MongoDB 3.4.1 64 bits — OS Windows 7 Pro SP1 — CPU Core i7–4712HQ 2.3GHz — 16Go RAM—SSD HD, and the test results were the following:

<a href="https://medium.com/media/e15a83a701c27a1a936ae11628f9b2b2/href">https://medium.com/media/e15a83a701c27a1a936ae11628f9b2b2/href</a>

So if you build the correct indices querying a billion of documents is still performant enough for most application on a single server. If required you can increase performances using sharding.

Here are the scripts used to create/query the database for this test:

<a href="https://medium.com/media/a02e2ed27d18849d2b6d7efcebaf8b35/href">https://medium.com/media/a02e2ed27d18849d2b6d7efcebaf8b35/href</a><a href="https://medium.com/media/e9f25370e12caf8c6b19fa13234aea10/href">https://medium.com/media/e9f25370e12caf8c6b19fa13234aea10/href</a>

And the run commands:

// Launch server
./mongod --dbpath "C:\Program Files\MongoDB\Server\3.4\data" --port 27018
// Insertion exemple for 10e7
./mongo --port 27018 --eval "var arg1=10000000" create_collection.js
// Requests
./mongo --port 27018 --eval "" query_collection.js

Memory

Yes, MongoDB often looks like it uses all available RAM. It actually relies on different storage engines: WiredTiger is the default one starting in MongoDB 3.2 and MMAPv1 is the default one for MongoDB versions before 3.2. However they work pretty similar and via the file system cache, they automatically use all free memory that is not used by the engine cache or by other processes. And this is coherent if you’d like to have maximum performances. So system resource monitors often show that MongoDB uses a lot of memory, but its usage is dynamic. If another process suddenly needs half the server’s RAM, MongoDB will yield cached memory to the other process.

As a consequence the single parameter you can tune to optimize memory usage is the engine cache size. E.g. by default the WiredTiger engine uses 50% of RAM minus 1 GB, which can be pretty large on servers with a lot of memory. This can even causes some trouble if you use containers with limited memory, so simply find out the right balance for your use case.