Throughout the hiring process, I had the pleasure of meeting many talented candidates for my team. Since our work involves handling complex data sets, it was important for me to gauge each candidate's ability to find smart solutions. I asked about their experience with window functions in SQLto assess their proficiency. While most of them knew about these functions, few were able to use them effectively.
Even though window functions have been around for almost 20 years, lots of SQL developers still find them difficult to grasp. It's not uncommon for even experienced developers to just copy and paste code from StackOverflow without really understanding what it does. This article is here to help! I'll explain window functions in a way that's easy to understand and provide examples to show you how they work in the real world.
Have you heard about window functions? They're awesome analytical tools that can solve a lot of problems. For instance, let's say you need to calculate a set of rows that share a common attribute, like a client ID. This is where window functions come in handy! They work like aggregate functions but let you keep the uniqueness of each row instead of grouping them together. Plus, the results of window functions show up as an extra field in the output selection. This is super helpful when you're making analytical reports, computing moving averages and running totals, or figuring out different attribution models.
Welcome to the world of SQL and window functions! If you're just starting out, you're in the right place. This article is beginner-friendly, with clear explanations and no complicated terminology or advanced concepts. You'll be able to follow along with ease, even if you're completely new to the topic.
There are three main types of functions to which window functions can be applied over a set of rows (a so-called window): these are aggregate, ranking, and value functions. In the image below, you can see the names of different functions that fall into each category.
These perform mathematical operations on a data group, resulting in a single cumulative value. They are used to calculate various aggregates, including the average, the total number of rows, the maximum or minimum values, or the total sum within each window or partition.
SUM: adds up all the values in the column
COUNT: calculates the number of values in the column, excluding NULL values
AVG: finds the average value in the column
MAX: identifies the highest value in the column
MIN: identifies the lowest value in the column
These are used to give each row in a partition a rank or order. This is done by evaluating specific criteria, like assigning sequential numbers or basing the ranking on specific values.
These make it easy to compare values between different rows in a group and also let you compare values with the first or last value in that group. This means you can easily move through different rows in a window and check out values at the beginning or at the end of the window.
To get started with the window functions, let's create a hypothetical 'salary' table and fill it with data.
Table creation:
create table salary
(
employee_id smallint,
employee_name varchar(10),
department varchar(20),
city varchar(20),
salary real
)
Filling the table:
insert into salary values ( 1 ,'Tony' ,'R&D', 'New York', 3000);
insert into salary values ( 2 ,'James' ,'Project management', 'London', 4000);
insert into salary values ( 3 ,'Dina' ,'Engineering', 'Tokyo', 5000);
insert into salary values ( 4 ,'Annie' ,'Security', 'London', 3000);
insert into salary values ( 5 ,'Tom' ,'R&D', 'New York', 3500);
insert into salary values ( 6 ,'Stan' ,'Project management', 'New York', 4200);
insert into salary values ( 7 ,'Jessa' ,'Sales', 'London', 5300);
insert into salary values ( 8 ,'Ronnie' ,'R&D', 'Tokyo', 2900);
insert into salary values ( 9 ,'Andrew' ,'Engineering', 'New York', 1500);
insert into salary values (10,'Dean' ,'Sales', 'Tokyo', 3700)
Let's check if we have filled the 'salary' table successfully:
select *
from salary
The next query will show the names and salaries of the employees from our table:
select employee_name, salary
from salary
Calculating the sum of salaries, average salary, maximum, minimum, and number of rows are some common use cases of aggregate functions:
When an aggregate function is applied, the salaries are aggregated and shown in one line.
But what if we want to display the names and salaries of the employees from the 'salary' table, and in the third column, the sum of all salaries? This value must be the same for all rows.
It’s a great opportunity to use a window function!
select
employee_name,
salary,
sum(salary) over() as sum_salary
from salary
Let's take a closer look at the window function that calculates the sum of the salaries in each row ofsum(salary) over()
.
The over()
expression defines a window or a set of rows over which the function operates. In our example, the window is the entire table, meaning the function will apply across all the rows.
The over()
expression only works when paired with functions that were requested before over()
.
For example, sum(salary) over()
, where sum()
is an aggregate function. And the whole expression sum(salary) over()
is an aggregate window function.
As I said before, all functions to which window functions are applied can be divided into three groups: aggregate, ranking, and value functions.
Aggregate functions sum()
, count()
, avg()
, min()
, max()
together with the over()
expression make a group of aggregate window functions.
In this article, we'll concentrate on this specific type of window functions.
Back to the examples!
Let's request the names of employees; their salaries; the sum of all salaries; average, maximum, and minimum salary; the number of employees.
select
employee_name,
salary,
sum(salary) over(),
avg(salary) over(),
max(salary) over(),
min(salary) over(),
count(*) over()
from salary
Now that it’s clearer what window functions are, let's explore some cases where they can be useful in your work.
select
employee_name,
salary,
sum(salary)over(),
salary/sum(salary)over() as share
from salary
order by salary/sum(salary)over() desc
We have calculated the percentage of the total salary budget for each salary in the fourth column. Jessa's salary amounts to nearly 15% of the whole budget for salaries.
Note that we also placed the formula that calculates percentages salary/sum(salary)over()
in the sort after order by
. A window function can be found not only in the output select
, but also in sorting order by
.
Another example: let's compare the salaries with the average salary for the company.
select
employee_name,
salary,
avg(salary)over(),
salary-avg(salary)over() as diff_salary
from salary
order by salary-avg(salary)over()
As we can see, Andrew's salary is 2110 less than average, and Jessa's is 1690 above the average.
Let's request three columns: employee name, department, and salary. Also, we'll sort them by department.
select
employee_name,
department,
salary
from salary
order by department
Now we'll request the same three columns, plus a column with the sum of the salaries of all employees. You already know that it can be done with a window function.
select
employee_name,
department,
salary,
sum(salary)over()
from salary
order by department
But what if we want to request not the sum of all salaries, but the sum of salaries for each department, as shown in the last column:
We can do it by adding the partition by
parameter to the over()
expression:
select
employee_name,
department,
salary,
sum(salary)over(),
sum(salary)over(partition by department)
from salary
order by department
Partition by
allows us to apply the window function not to all rows (the entire window), but to column sections.
Doesn't it look like a simple grouping? To calculate the sum of salaries for each department, we would make a grouping by departments (sections in the slang of window functions) and calculate the amount:
select department, sum(salary)
from salary
group by department
In essence, the difference between grouping and partition by
is that group by
returns one row per group, while partition by
, although the results of the function are identical to the results of an aggregate function with group by
, provides all the rows with the aggregate function based on a group.
Let's go back to the window functions:
select
employee_name,
department,
salary,
sum(salary)over(partition by department),
salary/sum(salary)over(partition by department) as shape
from salary
order by department
Using the window function, particularly the partition by
parameter, we can calculate the share of each employee's salary from the sum of the department's salaries. Or, for example, to compare salaries with the average salary in the department.
To recap:
A window function performs a calculation across a set of rows that are somehow related to the current row,
Main types of functions to which window functions are applied are aggregate, ranking and value functions,
To use a window function, you need to apply the over()
clause that defines a window (a set of rows) within a query result set. The window function then calculates a value for each row in the window,
To specify the column for which you want to perform aggregation, you need to add the partition by
clause to the over()
clause. Partition by
is somewhat similar to grouping but returns all the rows with the aggregate function applied, instead of one row per group.
This is it for now! In the next couple of articles, I’m going to explore more advanced SQL concepts with simple examples suitable for beginners, so stay tuned!