In a relational database, records are uniquely identified by the Primary Key. No two records can have the same primary key.
This concept is the same in Dynamodb as well. The Primary Key of a table is either the Partition key or a combination of the Partition key and the Sort key.
In the table given below, the student ID is the partition key. It is unique for each of the items.
student_id (Partition Key) |
name |
|
date_of_birth |
---|---|---|---|
cb15ffa1-5e86-4c51-b341 |
John Doe |
2002-01-15 | |
cbfe7f85-6b85-4d75-8030 |
Jane Smith |
2000-05-22 | |
2accafed-6a3e-4f10-b16e |
Michael Johnson |
2001-08-30 | |
93b98d8b-6de3-452e-s |
Emily Davis |
1999-11-10 | |
93b98d8b-6de3-452e-8 |
William Brown |
2003-03-25 |
In the below example, we store the content of articles in this table. Since the article content is long, it is broken into multiple parts and stored in multiple table items to prevent exceeding the 400kb limit Dynamodb imposes on item size.
ArticleId (Partition Key) |
Part (Sort Key) |
Content |
---|---|---|
101 |
1 |
<p>Introduction to DynamoDB with examples and illustrations...</p> |
101 |
2 |
<p>Setting up your DynamoDB environment with detailed steps...</p> |
101 |
3 |
<p>Basic operations in DynamoDB such as <strong>CRUD</strong>...</p> |
102 |
1 |
<p>Understanding different types of databases with pros and cons...</p> |
102 |
2 |
<p>Benefits of NoSQL databases over SQL databases...</p> |
103 |
1 |
<p>Advanced DynamoDB features like <em>global tables</em>...</p> |
103 |
2 |
<p>Performance optimization tips and tricks in DynamoDB...</p> |
A single article includes multiple items, so, the partition key cannot be unique. Therefore, a combination of the partition key and the sort key is used to uniquely identify each item. This is called a composite primary key.
Dynamodb is a distributed system. This means that unlike in traditional Relational Database Management System (RDBMS), all the data is not stored in a single server. Instead, they are spread across multiple servers.
In a Dynamodb table, items are grouped into partitions. Think of partitions as boxes within the table into which each item is placed. The partitions in a table are stored physically in different nodes (servers) within the region in which the table was created.
To simplify, you put each item of the table into different partitions (think of them as boxes) and then keep those boxes in different locations.
This is where the partition key comes into the picture. When you add a new item to the table, the partition key goes through a hash function. A hash function is an algorithm that takes in a value as an input and then outputs a hash code. In this case, the hash code is a fixed-size integer. When the same value is input into the hash function multiple times, it will always produce the same hash code.
Each partition in a table is associated with a hash code range.
To simplify, let’s say that a table has the below partitions associated with the given ranges.
We are adding this item to the table:
{
"student_id": "cb15ffa1-5e86-4c51-b341-48d1aed0f84f",
"name": "John Doe",
"email": "[email protected]",
"date_of_birth": "2002-01-15"
}
When the student_id
is input to the hash function, the output is 352.
Since this is within 302 - 402
, the item will go into Partition 3.
Unlike relational databases, Dynamodb does not offer too much flexibility on how data is queried.
You can only query data using a Partition Key. For example, if you want to query John Doe
from the Students table, you can only do so if you know his student_id
. To query by any other attribute, you need to create a Global Secondary Index (discussed later). Or perform a scan operation which is computationally very expensive.
This limitation is because items in the table are stored in different partitions that are located at different nodes. Dynamodb needs to know the partition where the item is stored so that it only has to search within that partition. In contrast, performing a scan operation is going through all the items in the table to find the one you are looking for.
Suppose we are querying for John Doe
using his student_id
which is the partition key. First, Dynamodb will input the student_id
into the hash function. The hash function will output the same hash code that was generated when the item was added to the table. By looking at the hash code, Dynamodb can know the partition in which the item is located.
Inside the partition, items are stored using a data structure called a B-Tree where the items are sorted by the partition key if it is unique. If we are using a composite primary key and the partition key is not unique, items will be sorted by the sort key. This allows Dynamodb to search the item within the partition very efficiently with time complexity of O(logn).
As mentioned earlier, the limitation of Dynamodb is that queries are not as flexible as in relational databases. When you are designing a Dynamodb table, you have to consider all querying patterns that will be used by your application and decide on the Primary Key pattern and indexes to enable them.
But on the upside, for these query patterns that are supported by your table keys, data can be fetched extremely efficiently compared to an RDBMS. This is because, unlike in an RDBMS, when the Dynamodb table grows in size, the underlying infrastructure can be scaled horizontally by spreading the data across more nodes. The query knows the precise location of each item using the partition key.
Therefore, there will be no change in performance or latency of the table as it grows in size.
Consider this table.
Department (Partition Key) |
empId (Sort Key) |
Name |
Age |
Location |
DOJ |
---|---|---|---|---|---|
HR |
1001 |
John Doe |
30 |
New York |
2020-05-15 21:35:51 |
IT |
1002 |
Jane Smith |
28 |
San Francisco |
2019-08-10 01:20:55 |
Finance |
1003 |
Mike Brown |
35 |
Chicago |
2018-03-22 08:49:11 |
Marketing |
1004 |
Linda White |
32 |
Seattle |
2021-06-30 21:22:48 |
IT |
1005 |
James Green |
29 |
Austin |
2017-11-12 07:14:55 |
HR |
1006 |
Emily Davis |
27 |
New York |
2022-02-05 21:07:59 |
There is a requirement to query the employees in a specific department who joined after a specific date.
This can be achieved by creating an LSI with DOJ as the sort key.
Department (Partition Key) |
DOJ (Sort Key) |
empId |
Name (Projected Attribute) |
---|---|---|---|
IT |
2017-11-12 07:14:55 |
1005 |
James Green |
Finance |
2018-03-22 08:49:11 |
1003 |
Mike Brown |
IT |
2019-08-10 01:20:55 |
1002 |
Jane Smith |
HR |
2020-05-15 21:35:51 |
1001 |
John Doe |
Marketing |
2021-06-30 21:22:48 |
1004 |
Linda White |
HR |
2022-02-05 21:07:59 |
1006 |
Emily Davis |
Note that the LSI has the same partition key as the associated item in the base table. Therefore, LSI items are stored in the same partition as their base table counterpart. LSI is the same items as the base table sorted differently. In this case, they are sorted by DOJ. Now, you can use this LSI to query for employees who joined within a specific date range.
Note that similar to the base table, the primary key of an LSI should be unique. There cannot be two employees who are from the same department and joined on the exact same date and time!
Consider the same table as above.
Department (Partition Key) |
empId (Sort Key) |
Name |
Age |
Location |
DOJ |
---|---|---|---|---|---|
HR |
1001 |
John Doe |
30 |
New York |
2020-05-15 21:35:51 |
IT |
1002 |
Jane Smith |
28 |
San Francisco |
2019-08-10 01:20:55 |
Finance |
1003 |
Mike Brown |
35 |
Chicago |
2018-03-22 08:49:11 |
Marketing |
1004 |
Linda White |
32 |
Seattle |
2021-06-30 21:22:48 |
IT |
1005 |
James Green |
29 |
Austin |
2017-11-12 07:14:55 |
HR |
1006 |
Emily Davis |
27 |
New York |
2022-02-05 21:07:59 |
We have another requirement to query employees by their ID without knowing their department. This query cannot be performed using an LSI since we don’t know the partition key.
So, we create a GSI where the partition key is the empId
.
empId (Partition Key) |
Department |
Name |
Age |
Location |
DOJ |
---|---|---|---|---|---|
1001 |
HR |
John Doe |
30 |
New York |
2020-05-15 21:35:51 |
1002 |
IT |
Jane Smith |
28 |
San Francisco |
2019-08-10 01:20:55 |
1003 |
Finance |
Mike Brown |
35 |
Chicago |
2018-03-22 08:49:11 |
1004 |
Marketing |
Linda White |
32 |
Seattle |
2021-06-30 21:22:48 |
1005 |
IT |
James Green |
29 |
Austin |
2017-11-12 07:14:55 |
1006 |
HR |
Emily Davis |
27 |
New York |
2022-02-05 21:07:59 |
Since the partition key of the GSI is different from the base table, the items in the GSI will be stored in a partition different from its corresponding item in the base table.
Think of the GSI as a separate copy of the table (with limited attributes that you select from the base table). Same with the LSI, the primary key should be unique.
You can use the GSI to query employees by their ID.