ActiveRecord on MySQL— Iterating over large tables with conditions

In this article, I’m going to demonstrate performance differences between two ways of iterating over the records in a MySQL database table with millions of records. In a high volume analytics system, tables with millions of records are quite common and iterating over the full table or a subset of these tables becomes often necessary — whether it’s to perform computations, run a migration, or create parallelized background jobs on the records. At AirPR, we have many database tables with 100s of millions of records, and it becomes important to write efficient code for iterations because there is often an order of magnitude difference between a good and not-so-good approach.

Find Each Method

The standard approach provided natively by ActiveRecord is the find_eachmethod.

For the purposes of this exercise, I created anemployees table to which I added about 5 million rows of data¹.

There is also a salaries table with the following columns that stores the salaries of those employees across different time ranges. This table contains about 3 million records.

Let us measure the performance of iterating through this table using find_each

DEFAULT_BATCH_SIZE = 1000
time = Benchmark.realtime do
Employee.select(:emp_no, :first_name, :last_name).
find_each(batch_size: DEFAULT_BATCH_SIZE) do |employee|
end
end
=> 100.6963519999990

The underlying queries that ActiveRecord makes look like this:

Employee Load (2.1ms)  SELECT  `employees`.`emp_no`, `employees`.`first_name`, `employees`.`last_name` FROM `employees`  ORDER BY `employees`.`emp_no` ASC LIMIT 1000
Employee Load (1.9ms) SELECT `employees`.`emp_no`, `employees`.`first_name`, `employees`.`last_name` FROM `employees` WHERE (`employees`.`emp_no` > 11000) ORDER BY `employees`.`emp_no` ASC LIMIT 1000
Employee Load (1.8ms) SELECT `employees`.`emp_no`, `employees`.`first_name`, `employees`.`last_name` FROM `employees` WHERE (`employees`.`emp_no` > 12000) ORDER BY `employees`.`emp_no` ASC LIMIT 1000

...
Employee Load (1.3ms)  SELECT  `employees`.`emp_no`, `employees`.`first_name`, `employees`.`last_name` FROM `employees` WHERE (`employees`.`emp_no` > 5127997)  ORDER BY `employees`.`emp_no` ASC LIMIT 1000

Notice how ActiveRecord keeps track of theidfrom the previous iteration and uses it in a where condition in the next one. This is called value based pagination and is generally the preferred approach for pagination (over other methods like offset based pagination)².

ID Iterator Method

I propose we try a different iterating technique now:

time = Benchmark.realtime do
first_id = Employee.first.id
last_id = Employee.last.id
(first_id..last_id).step(DEFAULT_BATCH_SIZE).each do |value|
    Employee.where('employees.emp_no >= ?', value).
where('employees.emp_no < ?', value + DEFAULT_BATCH_SIZE).
order('employees.emp_no ASC').
select(:emp_no, :first_name, :last_name).each do |employee|
end
  end
end
=> 101.34066200000234

In this method, we divvy up the total number of rows into batches using where conditions on the primary key to iterate through all the records in the table. Notice how the performance is practically the same between the two methods. This is how the underlying queries look:

Employee Load (1.1ms)  SELECT  `employees`.* FROM `employees`  ORDER BY `employees`.`emp_no` ASC LIMIT 1
Employee Load (1.1ms) SELECT `employees`.* FROM `employees` ORDER BY `employees`.`emp_no` DESC LIMIT 1
Employee Load (1.5ms) SELECT `employees`.`emp_no`, `employees`.`first_name`, `employees`.`last_name` FROM `employees` WHERE (employees.emp_no > 10001) AND (employees.emp_no <= 11001)
Employee Load (1.9ms) SELECT `employees`.`emp_no`, `employees`.`first_name`, `employees`.`last_name` FROM `employees` WHERE (employees.emp_no > 11001) AND (employees.emp_no <= 12001)
...
Employee Load (1.8ms)  SELECT `employees`.`emp_no`, `employees`.`first_name`, `employees`.`last_name` FROM `employees` WHERE (employees.emp_no >= 5128001) AND (employees.emp_no < 5129001)

This approach works best if the ids are in order because the iteration wouldn’t have to iterate & skip a lot of missing records in that case³.

Iterating with joins

Now, let’s compare performance of these two methods when we add some more complexity to the query.

In this new scenario, say, we want to iterate through all employees whose salary was above 80,000 at any point during their employment with the company. The find_each method would look something like this:

time = Benchmark.realtime do
Employee.select(:emp_no, :first_name, :last_name).
joins(:salaries).
where('salary > 80000').
find_each(batch_size: DEFAULT_BATCH_SIZE) do |employee|
end
end
=> 1181.770457000006

On the other hand, the id iterator method for performing the same operation results in an order of magnitude improvement in performance.

time = Benchmark.realtime do
first_id = Employee.first.id
last_id = Employee.last.id
(first_id..last_id).step(DEFAULT_BATCH_SIZE).each do |value|
Employee.where('employees.emp_no >= ?', value).
where('employees.emp_no < ?', value + DEFAULT_BATCH_SIZE).
joins(:salaries).
where('salary > 80000').
order('employees.emp_no ASC').
select(:emp_no, :first_name, :last_name).each do |employee|
end
end
end
=> 72.75677799998084

The above results indicate that using the find_each approach results in a much worse performance⁴. The ID iterator approach is about 15x faster than naive find_each. The reason for this becomes clear when you inspect the queries that are made by the two approaches.

The find_each method makes this type of query:

SELECT  `employees`.`emp_no`, `employees`.`first_name`, `employees`.`last_name` FROM `employees` INNER JOIN `salaries` ON `salaries`.`emp_no` = `employees`.`emp_no` WHERE (salary > 80000)  ORDER BY `employees`.`emp_no` ASC LIMIT 1000

An EXPLAIN on this query reveals the following:

1 SIMPLE salaries ALL salary,emp_no NULL NULL NULL 2837536 Using where; Using temporary; Using filesort
1 SIMPLE employees eq_ref PRIMARY PRIMARY 4 employees.salaries.emp_no 1 Using index

which indicates that neither the index on salary nor the index on emp_no is being used to filter the salaries table.

The id iterator method makes this type of query:

SELECT `employees`.`emp_no`, `employees`.`first_name`, `employees`.`last_name` FROM `employees` INNER JOIN `salaries` ON `salaries`.`emp_no` = `employees`.`emp_no` WHERE (employees.emp_no >= 5128001) AND (employees.emp_no < 5129001) AND (salary > 80000)

An EXPLAIN on this query shows that the query optimizer uses the index on emp_no in the salaries table:

1 SIMPLE salaries range salary,emp_no emp_no 4 NULL 1 Using index condition; Using where
1 SIMPLE employees eq_ref PRIMARY PRIMARY 4 employees.salaries.emp_no 1 Using index

which reveals why the find_each method is so much slower than the iterator method.

TL;DR

The lesson here is always use EXPLAINs to understand what the MySQL query optimizer actually does so that you can create the most optimized queries⁵. Based on analyzing the results of the EXPLAIN, a decision can be made on which approach needs to be taken for iterations on large tables.

  • JOINs on large tables usually results in poor performance, so it’s best to avoid them. Try to use JOINs only when the result set has been narrowed down significantly through the use of an index based condition on one of the tables.
  • Try to make the best use of indices for queries in general. Use queries that results in the MySQL query optimizer choosing to use indices that are available in the table. Add indices to the table that may help speed up queries while understanding the trade-offs in terms of write performance degradation⁶.
  • Avoid running select *, instead select only the columns that are necessary for your operation. This will reduce the amount of data that needs to be sent especially when there are many TEXT columns in the table.
  • The query optimizer might take different paths depending on a variety of factors, the same query might take a different path on a server with larger resources because, say, an index might fit into the memory. This will result in drastic differences in performances. It is best to assume the worst in these situations and write queries that don’t rely on large indices to be kept in memory.
  • The easiest way to see the queries that ActiveRecord generates would be to turn on DEBUG logging. It is recommended to turn this on in development so you can catch performance issues early.
    ActiveRecord::Base.logger = Logger.new(STDOUT)
  • Alternatively, you can use to_sql on an ActiveRecord::Relation to see beforehand what query it’s going to make. 
    Employee.where(“gender = ‘M’”).to_sql

¹ I started out from this sample dataset, and deleted everything but the employees and salaries table. And then I duplicated records in the employees table to reach 5 million rows.

² This link has a good comparison of the Value based vs Offset based pagination.

³ If AUTO_INCREMENT option is turned on for the primary key, the records are automatically in incremental order.

⁴ The performance degrades even more on larger tables. When you reach 100s of millions of rows, it becomes even more important to understand the underlying queries because it might result in 100x or 1000x difference.

⁵ Take the time to read (and master) the official MySQL documentation on EXPLAIN output format, so its clear what’s good and what’s not good.

This link has a good description on the performance impact of creating indices. It’s important to understand that writes on a table with a lot of indices will be slower, so use them wisely.

Topics of interest

More Related Stories