The most general way to satisfy a COUNT DISTINCT or SELECT DISTINCT clause is to scan the whole table or index and extract only distinct values from it. There could be 2 basic strategies in this operation. Of course, some database engines also support the HyperLogLog algorithm for approximate calculation, but we’ll concentrate on a more traditional approach in this article. Also, further in this article I will use the term “group by” or “grouping” to represent COUNT DISTINCT operation too, because the following queries are semantically equal.
select count(distinct A) from group_by_table;
select count(*) from (select a from group_by_table group by a) as tmp;
This technique is typically employed when a query needs to group values in a non-indexed column. During this process, a hash table is constructed, where the keys represent the unique combinations of column values. As the database engine scans through the rows, it calculates the hash value for each row's column value and stores actual values with the same hash in hash buckets. While this method can be memory-intensive for large datasets, it is often the fastest approach when the server has enough memory to store the hash table.
Stream Aggregation is used when the column to be grouped is already sorted, or nearly sorted. As the data stream comes in, the database engine compares the current row value with the previous one. If the current row belongs to the same group, the database engine continues the aggregation. When the value differs, a new group starts, and the previous group value is passed further in the execution plan. Stream Aggregation uses much less memory compared to Hash Aggregation (only one value needs to be stored), but it requires the data to be sorted, which might involve an extra sort operation if the data is not already sorted.
But both these strategies still require a full column scan and there is a better strategy for COUNT DISTINCT operation, when dealing with columns with low cardinality which PostgreSQL and MS SQL Server don’t have, but MySQL does.
Loose Scan is an advanced MySQL optimization technique that is applicable in the context of certain GROUP BY operations, especially in scenarios where a relatively small number of rows are being processed compared to the total number of rows in the table. This technique significantly improves the performance of queries by reducing the amount of data required to be read from the database.
In essence, the Loose Scan technique operates on a simple principle: instead of scanning the entire index for qualifying rows (a.k.a. 'tight' scan), it 'loosely' scans for the first matching row for each group. After finding a match, it immediately moves onto the next group. This method ensures a reduction in the number of rows that need to be evaluated, thus reducing the total time taken for executing a query.
Besides MySQL, a similar technique is also implemented in other database engines. It is called “Skip Scan” in
I guess, it’s enough theory, let’s move to a practical part and do a comparative analysis of Loose Scan in MySQL vs. PostgreSQL and Microsoft SQL Server. I will use the latest Docker containers with MySQL 8.0.33, PostgreSQL 15.3 and MS SQL 2022-CU4 on my laptop. I will create one table with 1 million rows and three columns of integer data type with different cardinality. The first column has 100 thousand unique values, the second 1 thousand and third only 10 unique values. I will create three separate nonclustered indexes and run COUNT DISTINCT queries on each column. Each query will be executed 5 times just to warm up the databases and then 20 more times with elapsed time calculation and we’ll compare total execution time afterwards. So, I tried to put all database engines in a pretty equal situation.
There is a script I used to initialize sample table in all databases:
-- MySQL
create table numbers (
id int not null
);
insert into numbers(id)
with tmp as (
select a.id + b.id * 10 + c.id * 100 + d.id * 1000 as id
from (select 0 as id union all select 1 union all select 2 union all select 3 union all select 4 union all select 5 union all select 6 union all select 7 union all select 8 union all select 9) as a
cross join (select 0 as id union all select 1 union all select 2 union all select 3 union all select 4 union all select 5 union all select 6 union all select 7 union all select 8 union all select 9) as b
cross join (select 0 as id union all select 1 union all select 2 union all select 3 union all select 4 union all select 5 union all select 6 union all select 7 union all select 8 union all select 9) as c
cross join (select 0 as id union all select 1 union all select 2 union all select 3 union all select 4 union all select 5 union all select 6 union all select 7 union all select 8 union all select 9) as d
)
select id from tmp;
create table group_by_table (
id int not null,
a int not null,
b int not null,
c int not null,
primary key (id)
);
insert into group_by_table(id, a, b, c)
with tmp as (
select
a.id + b.id * 10000 as id
from numbers as a
cross join numbers as b
)
select
id,
floor(rand() * 100000) as a,
floor(rand() * 1000) as b,
floor(rand() * 10) as c
from tmp
where id < 1000000;
create index idx_group_by_table_a on group_by_table(a);
create index idx_group_by_table_b on group_by_table(b);
create index idx_group_by_table_c on group_by_table(c);
-- PostgreSQL
create table group_by_table
(
id int not null,
a int not null,
b int not null,
c int not null,
primary key (id)
);
insert into group_by_table(id, a, b, c)
select
id,
floor(random() * 100000) as a,
floor(random() * 1000) as b,
floor(random() * 10) as c
from generate_series(1, 1000000, 1) as numbers(id);
create index idx_group_by_table_a on group_by_table(a);
create index idx_group_by_table_b on group_by_table(b);
create index idx_group_by_table_c on group_by_table(c);
-- MS SQL Server
create table group_by_table
(
id int not null,
a int not null,
b int not null,
c int not null,
primary key clustered (id)
);
with tmp as (
select
row_number() over (order by (select 1)) - 1 as id
from sys.all_columns as a
cross join sys.all_columns as b
)
insert into group_by_table(id, a, b, c)
select
id,
floor(rand(checksum(newid())) * 100000) as a,
floor(rand(checksum(newid())) * 1000) as b,
floor(rand(checksum(newid())) * 10) as c
from tmp
where id < 1000000;
create nonclustered index idx_group_by_table_a on group_by_table(a);
create nonclustered index idx_group_by_table_b on group_by_table(b);
create nonclustered index idx_group_by_table_c on group_by_table(c);
The queries, used to compare the performance of database engines:
select count(*) from (select a from group_by_table group by a) as tmp; -- ~ 100 thousand unique values
select count(*) from (select b from group_by_table group by b) as tmp; -- 1000 unique values
select count(*) from (select c from group_by_table group by c) as tmp; -- 10 unique values
All databases perform pretty much the same on column A, where data cardinality is high (100 thousand unique values out of 1 million rows). PostgreSQL total execution time was 4.72 seconds, MS SQL Server - 1.42 seconds, and MySQL - 3.11 seconds. But on column B, where cardinality is only 1000 unique values, MySQL total execution time drops to 58.6 milliseconds, while PostgreSQL does it in 4.2 seconds and MS SQL Server 1.08 seconds. **MySQL completes the second query 70 times faster than PostgreSQL and 18 times faster than MS SQL Server.**The situation is even better on column C, where there are only 10 unique values: MySQL - 12.5ms, PostgreSQL - 3.33s and MS SQL Server - 1.02s. **In the last example MySQL is blazingly fast and outperforms PostgreSQL by more than 250 times and MS SQL Server by more than 80 times.
**
Below there’s a visualization using logarithmic scale. It shows total execution time after 20 executions.
While PostgreSQL and MS SQL Server lack such optimization currently, there is a trick you can do to improve the performance of your SELECT/COUNT DISTINCT queries on these engines. The idea is to do several lookups in the index, instead of relying on default full Index Scan.\
For example, in PostgreSQL you can do a recursive query to find all unique values. First iteration selects the minimum value from the column, while every other iteration selects the next value greater than previous.
with recursive t as (
select min(a) as x from group_by_table
union all
select (select min(a) from group_by_table where a > t.x)
from t where t.x is not null
)
select count(*)
from (
select x from t where x is not null
union all
select null where exists (select 1 from group_by_table where a is null)
) as tmp;
We can do the same trick in MS SQL Server too. But, unfortunately, MS SQL Server recursive queries do not support TOP or aggregate operators, so I will use a temporary table to store the result and iterate using LOOP. Which, of course, has more overhead, but it seems, there’s no other generic way to complete such optimization in SQL Server.
create table #result (x int);
declare @current int;
select top (1) @current = a from group_by_table order by a;
while @@rowcount > 0
begin
insert into #result values (@current);
select top (1) @current = a from group_by_table where a > @current order by a;
end;
select count(*) from #result;
Let’s now compare how these modified queries run compared to original and MySQL. I will reference modified queries as A1, B1 and C1. Here is the table with full results.
|
A |
A1 |
B |
B1 |
C |
C1 |
---|---|---|---|---|---|---|
MySQL |
3.11s |
|
58.6ms |
|
12.5ms |
|
PostgreSQL |
4.72s |
6.18s |
4.2s |
70.93ms |
3.33s |
15.69ms |
MS SQL Server |
1.42s |
67.67s |
1.08s |
771.99ms |
1.02s |
74.66ms |
The results are pretty obvious, Loose Scan is a great optimization which helps significantly reduce the number of rows evaluated for SELECT DISTINCT or COUNT DISTINCT queries when using index. Though PostgreSQL allows you to write complex recursive queries to handle low cardinality columns with the same effectiveness as MySQL, it has significant performance penalty on column A, where cardinality is high. MS SQL Server clearly is an outsider in this example, though some workaround is still better than the original query.
I hope PostgreSQL and MS SQL Server implement Skip Scan optimization somewhen in the next versions.