Full Stack Developer
This article uses Exercise number 10 on the Self JOIN SQL ZOO tutorial to teach self JOINs in a beginner-friendly way.
The only way to understand SQL problems is to understand the data you're dealing with. The whole business of SQL becomes really abstract if you can't visualize the tables somehow. So, the first thing we will do is to understand how the tables in problem no. 10 work. Let's take a look at our database.
I'm not going to teach how to write a SQL query as I'm assuming you already know the basics. Rather, I'm going to show how I go about trying to make sense of databases. I divided my thought process into the following steps.
1. SELECT all items
If our table has millions of rows, a query that returns all items may take a lot of time to load. You can always use LIMIT 50 or something like that so that your table doesn't take forever to load. In our case, SQL ZOO already only returns the 50 first rows of our query, so we don't have to worry about it. Let's see what that looks like.
Alright, id and stop are the keys we used to connect the two tables, that's why they're always the same in every line. We got the name of the stop, the bus number, and the bus company. We learned from an earlier problem that we would usually refer to a bus by their num and company like '124 SMT'. So, by looking at our table we could assume that at the stop in Aberlady we can get the 124 SMT bus.
So far so good, but it took me a lot more time to understand what the pos was. At the beginning of the page, there's a link that takes us to a place where they tell us what every column in the table means. They say pos is the order of the stops, but I still couldn't visualize what it meant just by looking at the table. Now, the good news for us here is that we're not going to need to use pos to solve our problem, but I'll still tell you how you can understand it in the next step.
2. Use ORDER BY to understand what is going on
If we order by the num then by position, we'll see that the '1 LRT' bus, for example, leaves from Leith then goes on a round trip until it gets back to Leith after stopping at 9 other stops. If we go down the table, we see that not all buses go on a round trip as they won't always end up at the same stop where they started.
As I said, we won't be needing to use pos for this exercise but it could be useful if we were trying to find, for example, not only the buses available but also the shortest possible routes.
Ok, now that we got a clearer understanding of what is going on, let's see what the problem wants us to do.
Ok, so, by telling us that we need two buses, we know that there is not a bus that goes straight from Craiglockhart to Lochend. In other words, we need to find the buses that stop at Craiglockhart and have stops in common with the buses that stop at Lochend. Because that's where they can get off and change buses. Ok? I hope that's clear. As I said, it may be hard to visualize it without looking at a map. But hopefully, it's going to get clearer in a second.
As with everything, there is a number of ways we can do things. Keep that in mind as we move to the next step.
3. Never forget that a self JOIN creates a new table
When you do a self JOIN, you might be tempted to try to visualize it as a table somehow joined to itself. But the way to think about it is as two separate tables that look exactly the same. You join the new one to the side of the first one. Because they are exactly the same, any column can become a key. In other words, you can pick what your primary keys are going to be. That's where the fun starts!
Right now, there's only one stop per row, but for the next step, we need another one, our transfer stop. So, here's what we can do: we can connect a new route table to the table we already have ON num AND company, like this:
You see, our r2 table got appended to our r1 table adding up all of their columns. Both the r1 and r2 always have the same num and company (in the same lines) because that's where we joined them.
We already saw that the '1 LRT' line runs through 11 stops. That's why we can see the stop name repeat for eleven rows at each new stop. In other words, from Leith, you can get to any of those stops listed on the far-right column. Let's reduce the number of columns we're showing so we can get a better view of our data.
4. With Self JOINs, think horizontally
It's easy to get a little lost before you're about to filter your data. It helps if you visualize your tables as a horizontal map. On your left, you have the place from where you're leaving: 'Craiglockhart'. To your right, you'll have—I'll show it very soon—your transfer stop, then your destination.
So, let's filter the column name by 'Craiglockhart'.
So, what we're looking at now, if we take into consideration the stop column, is a list of all the stops you can get to leaving from Craiglockhart.
Now, imagine that we can do the same for Lochend, but instead of having it as the departure, we have it as the arrival. Let's see what that would look like.
Now, the stop column is showing all the stops from where we can get on a bus and get to Lochend.
In our query before this one, we saw all the stops we could get to leaving from Craiglockhart. In this one, we have all the stops we can leave from to get to Lochend. In other words, all we have to do is join these two tables together. Again, have that in mind: JOINs join tables horizontally—or to the side. There's something called UNION that stacks tables vertically—on top of each other, but JOINs glue them horizontally. Is that bueno?
So, how do we connect these two queries that we just saw? I will show you the long way first and then a shorter syntax.
Alright, I still haven't removed the unneeded columns, but let's take a look.
We now have four route tables and one stops table—which is showing us the transfer stop. What we have to do now is to filter the stop on r1 by 'Craiglockhart' and the stop in r4 by 'Lochend' and we'll get to our answer. To do that, we could JOIN another stops table at the beginning of our table and append another stops to the end of our table. But, instead, we can use what we learned with subqueries, like so.
We got our solution. How do we know? It says 'too many columns'. So we know we only have to adjust the number of columns being shown. Let's look at the problem header to do so.
It says some of the data is still incorrect. Let's investigate why. If we scroll down our table there's a link we can click that shows us what our answer should look like. Let's do that.
When we take it to compare our answer to theirs, we realize it's probably just a matter of ordering our columns the way they want us to. That might take a little experimenting in case you can't visualize what columns to ORDER BY right away, but in the end, this is what we should end up with.
Yay! We got the smiley face and the solution to our problem. There's only one more thing I'm going to show you which is a shorter syntax for that same solution.
Granted, it's not that much shorter and certainly not as intuitive, but there you have it. It's also important to note that when you use the JOIN ... ON syntax you can always write the conditionals on either the ON clause or using the WHERE.
That's it. I hope it was helpful. As always, please let me know if you have any questions. Thanks and have fun with SQL!