credit Schema Design There is an ideal way to design databases in relational databases — 3rd normal form. In , it’s important to keep data in way that’s conducive to the application using the . You think about MongoDB data application data patterns what pieces of data are used together what pieces of data are used mostly read-only what pieces of data are written all the time In contrast, in relational DBMS — the data is organized in such a way that is agnostic to the application. supports . We can store an array of items, a value for a certain key can be an entire other document. This is going to allow us to for fast access. And that’s important because inside the kernel. Instead if we need to join, we need to join in the application itself. The reason being . This forces us to . We might wish to embed the data directly within the document. There’re . In case of , it’s as important as we think because of embedding. considers in a way. Also it , however are supported within one document. The data needs to be designed in such a way that it supports atomic operations. MongoDB rich documents pre-join/embed data **MongoDB** doesn’t support joins directly joins are very hard to scale think ahead of time about what data you want to use together with other data no constraints MongoDB MongoDB atomicity doesn’t support transactions atomic operations There’s but there’s a good chance that an application is going to have a schema. By having a schema, we mean that every single document in a particular collection is probably going to have a pretty similar structure. There might be small changes to that structure depending on the different versions of your application. Even though it’s not declared ahead of time, it’s important to think about the data structure so that the data schema itself supports all the different features of your application no declared schema Relational Normalization Let’s look at the below denormalized table for a blog posts project. It’s not the 3rd normal form, it’s broken. Let’s say there are multiple posts with same author, we may update a few rows and leave others un-updated. Leaving the table data inconsistent. Hence this violates normalization because it violates a common way to describing normalized tables in 3rd normal form, which is that . And that’s of a play on words for what you say in a US courtroom, telling the truth, the whole truth and nothing but the truth. The key in this case, is the and there is a non-key attribute which does not follow that. Because it does, in fact tell something about the author. And so it violates that 3rd normal form. every non-key attribute in the table must provide a fact about the key, the whole key and nothing but the key Post Id Author Email The above table can be represented in as: MongoDB {id: 'some id',title: 'some title',body: 'some content here',author: {name: 'author name',email: 'author email id'}} Refresher — what are the goals of normalization? — For , it looks like embedding data would mostly cause this. And in fact, we should try to avoid embedding data in documents in which possibly create these anomalies. Occasionally, we might need to duplicate data in the documents for performance reasons. However that’s not the default approach. The default is to avoid it. Frees the database from modification anomalies MongoDB MongoDB — is flexible enough because it allows addition of keys without re-designing all the documents Should minimize re-design when extending MongoDB — this is something, we’re not going to worry about when describing schema in . And one of the ideas behind the is to tune up your database to the applications that we’re trying to write and the problem we’re trying to solve. Avoid bias toward any particular access pattern MongoDB MongoDB One of the great things about relational database is that it is really good at keeping the data consistent within the database. One of the ways it does that is by using foreign keys. A foreign key constraint is that let’s say there’s a table with some column which will have a foreign key column with values from another table’s column. In , there’s no guarantee that foreign keys will be preserved. It’s upto the programmer to make sure that the data is consistent in that manner. This maybe possible in versions of but today, there’s no such option. The alternative for foreign key constraints is . MongoDB future MongoDB embedding data Living Without Transactions Transactions support properties but although there are no transactions in , we do have atomic operations. Well, atomic operations means that when you work on a single document that that work will be completed before anyone else sees the document. They’ll see all the changes we made or none of them. And using atomic operations, you can often accomplish the same thing we would have accomplished using transactions in a relational database. And the reason is that, in a relational database, we need to make changes across multiple tables. Usually tables that need to be joined and so we want to do that all at once. And to do it, since there are multiple tables, we’ll have to begin a transaction and do all those updates and then end the transaction. But with , we’re going to embed the data, since we’re going to it in documents and they’re these rich documents that have hierarchy. We can often accomplish the same thing. For instance, in the blog example, if we wanted to make sure that we updated a blog post atomically, we can do that because we can update the entire blog post at once. Where as if it were a bunch of relational tables, we’d probably have to open a transaction so that we can update the post collection and comments collection. ACID MongoDB MongoDB pre-join So what are our approaches that we can take in to overcome a lack of transactions? MongoDB — restructure the code, so that we’re working within a single document and taking advantage of the atomic operations that we offer within that document. And if we do that, then usually we’re all set. restructure — we can implement locking in software, by creating a critical section. We can build a test, test and set using find and modify. We can build semaphores, if needed. And in a way, that is the way the larger world works anyway. If we think about it, if one bank need to transfer money to another bank, they’re not living in the same relational system. And they each have their own relational databases often. And they’ve to be able to coordinate that operation even though we cannot begin transaction and end transaction across those database systems, only within one system within one bank. So there’s certainly ways in software to get around the problem. implement in software — the final approach, which often works in modern web apps and other applications that take in a tremendous amount of data is to just tolerate a bit of inconsistency. An example would, if we’re talking about a friend feed in Facebook, it doesn’t matter if everybody sees your wall update simultaneously. If okey, if one person’s a few beats behind for a few seconds and they catch up. It often isn’t critical in a lot of system designs that everything be kept perfectly consistent and that everyone have a perfectly consistent and the same view of the database. So we could simply tolerate a little bit of inconsistency that’s somewhat temporary. tolerate , , (within an update) & (within an update) operations operate atomically within a single document. Update findAndModify $addToSet $push One to one relations 1 to 1 relations are relations where each item corresponds to exactly one other item. e.g.: an employee have a resume and vice versa a building have and floor plan and vice versa a patient have a medical history and vice versa //employee{_id : '25',name: 'john doe',resume: 30} //resume{_id : '30',jobs: [....],education: [...],employee: 25} We can model the employee-resume relation by having a collection of employees and a collection of resumes and having the employee point to the resume through linking, where we have an that corresponds to an in th resume collection. Or if we prefer, we can link in another direction, where we have an employee key inside the resume collection, and it may point to the employee itself. Or if we want, we can embed. So we could take this entire resume document and we could embed it right inside the employee collection or vice versa. ID ID This embedding depends upon how the data is being accessed by the application and how frequently the data is being accessed. We need to consider: frequency of access the size of the items — what is growing all the time and what is not growing. So every time we add something to the document, there is a point beyond which the document need to be moved in the collection. If the document size goes beyond 16 , which is mostly unlikely. MB atomicity of data — there’re no transactions in , there’re atomic operations on individual documents. So if we knew that we couldn’t withstand any inconsistency and that we wanted to be able to update the entire employee plus the resume all the time, we may decide to put them into the same document and embed them one way or the other so that we can update it all at once. MongoDB One to Many Relations In this relationship, there is many, many entities or many entities that map to the one entity. e.g.: a city have many persons who live in that city. Say have 8 million people. NYC Let’s assume the below data model: //city{_id: 1,name: 'NYC',area: 30,people: [{_id: 1,name: 'name',gender: 'gender'.....},....8 million people data inside this array....]} This won’t work because that’s going to be REALLY HUGE. Let’s try to flip the head. //people{_id: 1,name: 'John Doe',gender: gender,city: {_id: 1,name: 'NYC',area: '30'.....}} Now the problem with this design is that if there are obviously multiple people living in NYC, so we’ve done a lot of duplication for city data. Probably, the best way to model this data is to use . true linking //people{_id: 1,name: 'John Doe',gender: gender,city: 'NYC'} //city{_id: 'NYC',...} In this case, collection can be linked to the collection. Knowing we don’t have foreign key constraints, we’ve to be consistent about it. So, this is a relation. It requires 2 collections. For small one to few (which is also one to many), relations like blog post to comments. Comments can be embedded inside post documents as an array. people city one to many So, if it’s truly one to many, 2 collections works best with linking. But for one to few, one single collection is generally enough. Many to Many Relations e.g.: books to authors students to teachers The books to authors is a relationship, so we can have either an array of books or authors inside another’s document. Same goes for students to teachers. We could also embed at the risk of duplication. However this will required that each student has a teacher in the system before insertion and vice versa. The application logic may always not allow it. In other words, the parent object must exist for the child object to exist. few to few Multikeys Multikey indexes is the feature because of which linking and embedding works so well. Assume that we’ve 2 schemas for student and teachers. //students{_id: 1,name: 'John Doe',teachers: [1,7,10,23]} //teachers{_id: '10',name:'Tony Stark'} Now there are 2 obvious queries. How to find all the teachers a particular student has had? — this can be searched by looking for teachers key in the students collection How to find all the students whom have had a particular teacher? — this is a little difficult query and requires to use operators. And order for that we need to be efficient, we need use . set Multikey indexes To create an index on column on collection, use . teachers students db.students.ensureIndex({ teachers : 1 }) Now to find all the students whom have had a particular teacher? use the query . Now, if we append to the above query - it shows the internal working by showing where the keys were applied etc. db.students.find( { 'teachers': {$all: [0,1]}} ) .explain() Benefits of Embedding The main reason for embedding documents in is performance. The main performance benefit comes from improved read performance. Now, why do we get read performance. The reason is the nature of the way computer systems are built, which is they often have spinning disks and those spinning disks have a very , which means they take a very long time (upto 1ms) to get to the first byte. But, when the first byte is access, each additional byte comes very quickly. So they tend to be pretty . So the idea is if we can co-locate the data to be used together in the same document, embed it and then we’re going to spin the disk, find the sector where we need this information and then we’re gonna start reading it. And we’re going to get all the information we need in one go. Also it means if we have 2 pieces of data that would normally be in 2 collections or in several relational database tables. Instead, they’re in one document, that we avoid . MongoDB high latency high bandwidth round trips to the database Trees One classic problem from the world of schema design is how do you represent a tree inside the database? Let’s look at the example problem for representing the e-commerce categories in a e-commerce site, such as Amazon. Where we have home, outdoors, winter, snow. And the idea is that we’ve the . Also we’ve a , where we can lookup and see the , some of the properties for the . products category category 7 categoryName category //products{category: 7,productName: 'ABC'} //category{_id: '7',categoryName:'outdoors',parent: 6} One way to do it is to keep a id - this might be something we can do in a simple relational database. But this doesn’t make it easy to find all the s to this . We’ve to iteratively query, find the parent of this and the parent of that until we get all the way to the top. parent parent category So an alternative way to do it in is to be able to list ancestors or children. So, let’s think about that and how that would work. So, we could decide to list all the children of this : MongoDB category //category{_id: '7',categoryName:'outdoors',children: [3,6,7,9]} That’s also fairly limiting, if we’re looking to be able to look and find the entire sub-tree that is above a certain piece of tree. Instead, what works pretty well and again, enable by the ability to put arrays inside is to list the ancestors from the top in order. MongoDB //category{_id: '7',categoryName:'outdoors',ancestors: [3,7,5,8,9]} Again the ability to structure and express rich data is one of the things that makes so interesting. This would be very difficult to do in a relational database. Now, in terms of how you represent the data for something like a hierarchy, again it all depends upon the access patterns. It depends on how we believe we’re going to need to show the data, access the data for the user. And then based on that, we know how to model it. MongoDB product category When to Denormalize One of the reasons for normalization in relational database is to avoid modification anomalies that come with the duplication of data. And when we look at and how it’s structured, allowing these rich documents, it’s easy to assume that what we’re doing is we’re denormalizing the data. And to certain extent, that’s true. As long as we don’t duplicate data, we don’t open ourselves for modification anomalies. MongoDB Generally it’s good to embed that data in case of relationships. In case of relationships, embedding can also work well without duplication of data as long as we embed from . Now, if we go from the one to the many, then linking would avoid the duplication of data. Now, if we want to embed something, even if it causes duplication of data for performance reasons to match the access patterns of the applications. That could make sense, especially if the data is rarely changing or being updated. But we can avoid it often enough, even in this relationship, if we go from the . In relation, which we looked at with and and and , there if you want to avoid the modification anomalies that come with denormalization, all we need to do is link through the arrays of in the documents. one to one one to many many to the one many to one many to many students teachers authors books object ids These are all guidelines. For a real world application, we may need to embed data for performance reasons, to match the data access patterns — — it might be needed to embed the data. Photos Originally published at xameeramir.github.io .