Python & Data Engineering: Under the Hood of Join Operators

Written by nikagolubeva | Published 2021/09/01
Tech Story Tags: data-engineering | python | sql | learn-python | join-operators | python-tutorials | nested-loops | python-top-story

TLDRAn estimated 2.5 quintillion bytes are generated each day. This makes it difficult to comb through essential data pieces, process them, and extract insights. In order to optimize your queries to big data, you need to develop a profound understanding of how these algorithms work under the hood. In this post, I discuss the algorithms of a nested loop, hash join, and merge join in Python. Nested loop joins support only four logical join operators, including: Inner join* Left outer join, Left semi join and Left anti semi join. Merge join is touted as the most effective of all operators.via the TL;DR App

They say data is the new oil of the 21st century. And despite circulating criticism about excessive data use, there are a plethora of ways in which all this data improves the world.

However, the data creation rate is becoming greater each day. An estimated 2.5 quintillion bytes are generated each day. This makes it difficult to comb through essential data pieces, process them, and extract insights.

That is why we, data specialists, boast a great toolbox of diverse algorithms and structures to effectively get through the data and pull out the valuable pieces. But, in order to optimize your queries to big data, you need to develop a profound understanding of how these algorithms work under the hood.

In this post, I discuss the algorithms of a nested loop, hash join, and merge join in Python.

Assumptions made:

  1. Lists are sorted, it’s an important prerequisite for merge join
  2. We do not include Null values

With that said, let’s rock it with Python.

Stop 1. Nested Loops

In relational databases, the join of nested loops compare rows of different tables, looking for the ones to satisfy the join predicate. It also uses one joining table as an outer input table and the other one as the inner input table.

This process is maintained until all the output rows of the outer table are searched in the inner table. Therefore, nested loop join is the basic physical implementation of joining two tables.

The concept of nested loops is structurally similar to nested if statements. In their essence, nested loops are nothing more than one loop inside another loop. And those can be nested in Python, as they can with other programming languages.

Typically, nested loops go over 2+ loops. In practice, we usually nest up to 3 levels deep, otherwise, this is just ineffective and confusing.

Nested loops are usually constructed in the following way:

As you see, the implementation is quite simple. For each element of the first list, we will go through all the elements of the second list. If we have a match, we will write it in the resulting table.

Two nested loops are enough to pull off this algorithm, that's why they are called so. The most common use case for NL is when you need to handle multiple conditions in a program.

What types of join predicates do nested loop joins support?

Nested loop joins support all join predicates, including equality predicates and inequality predicates.

What are the logical join operators that nested loop join supports?

Nested loop joins support only four logical join operators, including:

  • Inner join
  • Left outer join
  • Left semi join
  • Left anti semi join.

Why use NL join?

Since the effectiveness of a nested loop join depends on the size of the external data multiplied by the size of the internal data, this join is a great fit for relatively small input datasets.

Implementing JL in Python

Now that we’ve studied the basics, let’s put this algorithm in practice using core python. For that, we need two tables that we’ll label as A and B respectively.

In this case, the list refers to the table, whereas lines are used as tuples. Let’s join them (here and throughout) by the first values of the rows, i.e. the first column. Now, we’ll implement the Inner join using the NL algorithm.

We can take it a step further and opt for the Left outer join, since we need to note the rows from the left table that are not included in the join.

Stop 2. Merge Join

Merge join or sort-merge join is touted as the most effective out of all operators. But unlike nested loops, this physical join operator has a number of prerequisites: • There has to be at least one equijoin (join condition containing an equality operator) • The input has to be sorted on the join keys ( the order of the columns should always be the same, e.g. A, B vs A1, B1, instead of B, A vs A1, B1)

Therefore, this algorithm presupposes a two-step process. • First, you need to sort out both outputs. • Second, you need to actually implement the operator, which is the fastest step out of the two. Hence, if both inputs are already sorted, then the merge join algorithm is a perfect choice.

How merge joins work

The merge joins simultaneously reads and compares two sorted inputs, one line per step. It continues comparing the next rows from the second input as long as the values match the first input's value.

If the values aren’t equal, the smaller of the two input values is excluded and the process continues. Since the input is sorted, it is easy to see that the excluded value will be smaller than any of the remaining values in either of the inputs and thus should not participate in the join.

When use the merge join algorithm

Relational database systems may leverage this operator when the joining relations have an index. In this case, there is no need to sort the relation, since the index helps to read the data in the required sorted order.

What are the logical join operators that the merge join supports?

The merge join operator supports all ten logical join operations, including:

  • Inner join
  • Left, right, and full outer join;
  • Left and right semi and anti semi join;
  • Concatenation
  • Union

Once again, it needs at least one equality-based join predicate. Implementing MJ in Python Let's flex our muscles and write an Inner Join using the ML algorithm.

As usual, we can step up the game with a Left outer join.

Stop 3. Hash Join

The algorithms mentioned above may render ineffective in some cases. These cases include insufficient data memory or the absence of pre-sorted data. In those instances, relational databases employ hash joins.

Hash join can match anything you have as long as:

  1. The equality predicate is run for the join
  2. There is enough space in your tempdb

This algorithm consists of two steps.

  1. During the first "Build" step, SQL Server creates a hash table in memory from one of the tables submitted for input (it’s usually the smallest of the two). Hashes are calculated based on the keys of the input data and then stored together with the row in the hash table under the hash bucket. In most cases, there is only one row of data in each bucket.
  2. After the hash table is created, the "Probe" stage begins. During the second step, SQL Server calculates a key hash for each row in the second input table and checks to see if it exists in the hash table created in the first step. If a match is found for this hash, then it checks whether the keys of the row(s) in the hash table and the row(s) from the second table actually match. This check must be performed because of possible collisions.

Why use hash joins?

The main differentiator of the hash join algorithm is that it can use temp DB for joining large datasets. If SQL Server cannot create all of the hashes in memory, it’ll keep the maximum number of buckets based on the remaining memory, while the rest is stored in temp db. What are the logical join operators that hash join supports? The hash match algorithm is one of the most powerful joining algorithms. However, its capacity goes well beyond just that. In particular, it supports a wide range of logical operations, including:

• Inner Join; • Left Outer Join; • Right Outer Join; • Full Outer Join; • Left Semi Join; • Left Anti Semi Join; • Right Semi Join; • Right Anti Semi Join; • Aggregate; • Partial aggregate; • Flow Distinct; • Union.

Implementing HJ in Python Let's look at what the Inner join implementation will look like in this case.

This time, we won’t complicate things, but actually make things easier for a Left outer join by removing verbose parts from the above implementation.

The Bottom Line

Joins are an essential class of operations in databases and data pipelines. Most intricate queries in an SQL database management system include join commands to create a relationship between the tables.

However, it’s not enough to just know the three operators by name. You have to establish a profound understanding of how those algorithms are implemented to proclaim yourself a professional.

On that note, I challenge my inquisitive reader to solve a bonus task and implement a full outer join for all algorithms.

And congrats, you’re now a step closer to mastering your Python skills!


Written by nikagolubeva | Data engineer, python teacher
Published by HackerNoon on 2021/09/01