462 reads

Dynamodb Partition Keys: How They Work Under the Hood

by Praveen MuthukumaranaAugust 8th, 2024

Too Long; Didn't Read

Dynamodb is a distributed system. This means that unlike in traditional Relational Database Management System (RDBMS), all the data is not stored in a single server. In a Dynamodb table, items are grouped into partitions. The partitions in a table are stored physically in different nodes (servers) within the region in which the table was created.

featured image - Dynamodb Partition Keys: How They Work Under the Hood

‘piano keys’ Image created by HackerNoon AI Image Generator

Primary Key

In a relational database, records are uniquely identified by the Primary Key. No two records can have the same primary key.

This concept is the same in Dynamodb as well. The Primary Key of a table is either the Partition key or a combination of the Partition key and the Sort key.

In the table given below, the student ID is the partition key. It is unique for each of the items.

student_id (Partition Key)	name	email	date_of_birth
cb15ffa1-5e86-4c51-b341	John Doe	[email protected]	2002-01-15
cbfe7f85-6b85-4d75-8030	Jane Smith	[email protected]	2000-05-22
2accafed-6a3e-4f10-b16e	Michael Johnson	[email protected]	2001-08-30
93b98d8b-6de3-452e-s	Emily Davis	[email protected]	1999-11-10
93b98d8b-6de3-452e-8	William Brown	[email protected]	2003-03-25

In the below example, we store the content of articles in this table. Since the article content is long, it is broken into multiple parts and stored in multiple table items to prevent exceeding the 400kb limit Dynamodb imposes on item size.

ArticleId (Partition Key)	Part (Sort Key)	Content
101	1	<p>Introduction to DynamoDB with examples and illustrations...</p>
101	2	<p>Setting up your DynamoDB environment with detailed steps...</p>
101	3	<p>Basic operations in DynamoDB such as <strong>CRUD</strong>...</p>
102	1	<p>Understanding different types of databases with pros and cons...</p>
102	2	<p>Benefits of NoSQL databases over SQL databases...</p>
103	1	<p>Advanced DynamoDB features like <em>global tables</em>...</p>
103	2	<p>Performance optimization tips and tricks in DynamoDB...</p>

A single article includes multiple items, so, the partition key cannot be unique. Therefore, a combination of the partition key and the sort key is used to uniquely identify each item. This is called a composite primary key.

Partitions

Dynamodb is a distributed system. This means that unlike in traditional Relational Database Management System (RDBMS), all the data is not stored in a single server. Instead, they are spread across multiple servers.

In a Dynamodb table, items are grouped into partitions. Think of partitions as boxes within the table into which each item is placed. The partitions in a table are stored physically in different nodes (servers) within the region in which the table was created.

To simplify, you put each item of the table into different partitions (think of them as boxes) and then keep those boxes in different locations.

How Dynamodb Stores Items

This is where the partition key comes into the picture. When you add a new item to the table, the partition key goes through a hash function. A hash function is an algorithm that takes in a value as an input and then outputs a hash code. In this case, the hash code is a fixed-size integer. When the same value is input into the hash function multiple times, it will always produce the same hash code.

Each partition in a table is associated with a hash code range.

To simplify, let’s say that a table has the below partitions associated with the given ranges.

Partition 1 — 100 - 200
Partition 2 — 201 - 301
Partition 3 — 302 - 402
Partition 4 — 403 - 503

We are adding this item to the table:

{
  "student_id": "cb15ffa1-5e86-4c51-b341-48d1aed0f84f",
  "name": "John Doe",
  "email": "[email protected]",
  "date_of_birth": "2002-01-15"
}

When the student_id is input to the hash function, the output is 352.

Since this is within 302 - 402, the item will go into Partition 3.

How Dynamodb Retrieves Items

Unlike relational databases, Dynamodb does not offer too much flexibility on how data is queried.

You can only query data using a Partition Key. For example, if you want to query John Doe from the Students table, you can only do so if you know his student_id. To query by any other attribute, you need to create a Global Secondary Index (discussed later). Or perform a scan operation which is computationally very expensive.

This limitation is because items in the table are stored in different partitions that are located at different nodes. Dynamodb needs to know the partition where the item is stored so that it only has to search within that partition. In contrast, performing a scan operation is going through all the items in the table to find the one you are looking for.

Suppose we are querying for John Doe using his student_id which is the partition key. First, Dynamodb will input the student_id into the hash function. The hash function will output the same hash code that was generated when the item was added to the table. By looking at the hash code, Dynamodb can know the partition in which the item is located.

Inside the partition, items are stored using a data structure called a B-Tree where the items are sorted by the partition key if it is unique. If we are using a composite primary key and the partition key is not unique, items will be sorted by the sort key. This allows Dynamodb to search the item within the partition very efficiently with time complexity of O(logn).

Pros and Cons of This Behavior

As mentioned earlier, the limitation of Dynamodb is that queries are not as flexible as in relational databases. When you are designing a Dynamodb table, you have to consider all querying patterns that will be used by your application and decide on the Primary Key pattern and indexes to enable them.

But on the upside, for these query patterns that are supported by your table keys, data can be fetched extremely efficiently compared to an RDBMS. This is because, unlike in an RDBMS, when the Dynamodb table grows in size, the underlying infrastructure can be scaled horizontally by spreading the data across more nodes. The query knows the precise location of each item using the partition key.

Therefore, there will be no change in performance or latency of the table as it grows in size.

How Partition Key Works in a Local Secondary Index (LSI)

Consider this table.

Department (Partition Key)	empId (Sort Key)	Name	Age	Location	DOJ
HR	1001	John Doe	30	New York	2020-05-15 21:35:51
IT	1002	Jane Smith	28	San Francisco	2019-08-10 01:20:55
Finance	1003	Mike Brown	35	Chicago	2018-03-22 08:49:11
Marketing	1004	Linda White	32	Seattle	2021-06-30 21:22:48
IT	1005	James Green	29	Austin	2017-11-12 07:14:55
HR	1006	Emily Davis	27	New York	2022-02-05 21:07:59

There is a requirement to query the employees in a specific department who joined after a specific date.

This can be achieved by creating an LSI with DOJ as the sort key.

Department (Partition Key)	DOJ (Sort Key)	empId	Name (Projected Attribute)
IT	2017-11-12 07:14:55	1005	James Green
Finance	2018-03-22 08:49:11	1003	Mike Brown
IT	2019-08-10 01:20:55	1002	Jane Smith
HR	2020-05-15 21:35:51	1001	John Doe
Marketing	2021-06-30 21:22:48	1004	Linda White
HR	2022-02-05 21:07:59	1006	Emily Davis

Note that the LSI has the same partition key as the associated item in the base table. Therefore, LSI items are stored in the same partition as their base table counterpart. LSI is the same items as the base table sorted differently. In this case, they are sorted by DOJ. Now, you can use this LSI to query for employees who joined within a specific date range.

Note that similar to the base table, the primary key of an LSI should be unique. There cannot be two employees who are from the same department and joined on the exact same date and time!

How Partition Key Works in a Global Secondary Index (GSI)

Consider the same table as above.

Department (Partition Key)	empId (Sort Key)	Name	Age	Location	DOJ
HR	1001	John Doe	30	New York	2020-05-15 21:35:51
IT	1002	Jane Smith	28	San Francisco	2019-08-10 01:20:55
Finance	1003	Mike Brown	35	Chicago	2018-03-22 08:49:11
Marketing	1004	Linda White	32	Seattle	2021-06-30 21:22:48
IT	1005	James Green	29	Austin	2017-11-12 07:14:55
HR	1006	Emily Davis	27	New York	2022-02-05 21:07:59

We have another requirement to query employees by their ID without knowing their department. This query cannot be performed using an LSI since we don’t know the partition key.

So, we create a GSI where the partition key is the empId.

empId (Partition Key)	Department	Name	Age	Location	DOJ
1001	HR	John Doe	30	New York	2020-05-15 21:35:51
1002	IT	Jane Smith	28	San Francisco	2019-08-10 01:20:55
1003	Finance	Mike Brown	35	Chicago	2018-03-22 08:49:11
1004	Marketing	Linda White	32	Seattle	2021-06-30 21:22:48
1005	IT	James Green	29	Austin	2017-11-12 07:14:55
1006	HR	Emily Davis	27	New York	2022-02-05 21:07:59

Since the partition key of the GSI is different from the base table, the items in the GSI will be stored in a partition different from its corresponding item in the base table.

Think of the GSI as a separate copy of the table (with limited attributes that you select from the base table). Same with the LSI, the primary key should be unique.

You can use the GSI to query employees by their ID.

Recap

Primary Key uniquely identifies items in a Dynamodb table. It can be just the partition key if that is unique or a combination of the partition key and sort key if the partition key is not unique.

Based on the partition key of a record, Dynamodb puts it into different partitions. Partitions in a table are stored in different nodes (servers) within the region where the table was created.

In queries, Dynamodb uses the partition key to know the partition in which the item is stored and hence the exact server where the item resides.

Within a partition, items are stored in a sorted order of partition key if it is unique or else the sort key. This allows Dynamodb to efficiently search for an item within the partition.

LSI is like a copy of the table sorted using a different attribute but items still have the same partition key and are stored in the same partition as the corresponding base table item.

GSI is also like a copy of the table but the items have a different partition key, and they are stored in partitions different to the corresponding items in the base table.