You can read the previous post in this series here.
Applications need to have permanent storage for user or applications specific data. In memory data structures like linked list, arrays are optimised for access by CPU via pointers. Permanent storage is optimised for read/write access by clients/processes connecting to database server. A very important aspect of permanent persistence is data modelling. I will devote this post on how to choose a good data model for your application.
The most famous and prevalent data modelling technique is using relational tables. In relational tables, data is organised into records of a table. Tables are related to one another using primary key foreign key. There are a lot of reasons why your should choose relational table.
Application development is done using object oriented programming. However when data storage is done using tables, there is a translation required from objects to “shredded” relational tables. ORM frameworks provide boilerplate code to reduce the effort required for this translation, but still there is work to be done.
NoSQL solves this problem by representing a record as a self contained JSON document. For example, let’s say we need to store patient demographic information along with his current conditions. One way to represent this record in a NoSQL database can be :
{ “first_name”: “John”, “last_name”: “Doe”, “conditions”:[ {“name” : “T2DM”, “onset”: “12–12–1990”}]}
If you need read profile data for a patient in your app, you need not issue multiple joins as all data is inside one NoSQL document. Typically, if your data model exhibits a tree like, one to many relationships, using a NoSQL database might make more sense.
In the patient example above, what if we wanted to store ICD10 standard codes for conditions instead of name of condition. This is a little troublesome in document based NoSQL databases as they have little support for joins. You can still do a join at the application layer but this will be always suboptimal compared to the joins done at a typical relational database layer. NoSQL databases become less desirable in this case.
Lastly in case your data model does not have a fixed schema, going the NoSQL route might make more sense. Consider, the patient example above and say we need to also store patient’s date of birth. In NoSQL case, we can add a new field, ‘dob’ to new documents. At the application level we can also add code to handle reading old documents without dob field. In a relational database, the solution to handle dob would be to alter schema and make data migrations. Data migrations are slow and require downtime and consequently generally avoided.
Graph database makes a lot of sense when your application’s data model needs to support many ‘many-many’ relationships. The relational model can handle a few many to many relationships, but beyond a point all the relational joins become messy and slow. Using graph databases also provides an added advantage of easily extending relationships between heterogeneous objects.
A graph consists of two kinds of objects — nodes and edges. Nodes contain description of objects or entities. Edges contain description of relationships between nodes. For example, say a person suffers from an allergic reaction because of exposure to an substance. You can model person, allergic reaction and substance as nodes. You can also model relationships between nodes in a graph database. For example a relationship between person and allergy can be a unidirectional relationship “person-has-allergy” from person node to allergy node. Relationship between allergy and and substance can be a unidirectional relationship “triggered-by-exposure-to” from allergy to substance.
Why not do this in relational tables ? Well you can. You can create three tables — person, allergy and substance and set up the appropriate Pk-FK constraints. Like I said earlier, using graph databases makes sense when we have a lot of many-many relationships. For example, let’s say you introduce a location object in the overall scheme of things so that you can capture location of the person where the allergy reaction occurred. Location can be neighbourhood, city, state, country, continent or hemisphere. Basically, location information can be available at various levels of granularity. Using SQL to create a declarative query will be messy. SQL needs to know in advance which joins will be part of the query. In graph database, on the other hand, you can traverse many nodes and edges before arriving at the target node. You can express the fact of traversing a graph once or many time quite concisely using a graph database declarative query language like Cypher, for Neo4j graph database.
Click on the link here to read Fundamentals of System Design — Part 3.