In this article, I’m going to demonstrate performance differences between two ways of iterating over the records in a MySQL database table with millions of records. In a high volume analytics system, tables with millions of records are quite common and iterating over the full table or a subset of these tables becomes often necessary — whether it’s to perform computations, run a migration, or create parallelized background jobs on the records. At [AirPR](https://www.airpr.com/), we have many database tables with 100s of millions of records, and it becomes important to write efficient code for iterations because there is often an order of magnitude difference between a good and not-so-good approach. ### **Find Each Method** The standard approach provided natively by ActiveRecord is the `find_each`method. For the purposes of this exercise, I created an`employees` table to which I added about 5 million rows of data¹. There is also a `salaries` table with the following columns that stores the salaries of those employees across different time ranges. This table contains about 3 million records.  Let us measure the performance of iterating through this table using `find_each` DEFAULT\_BATCH\_SIZE = 1000 time = Benchmark.realtime do Employee.select(:emp\_no, :first\_name, :last\_name). find\_each(batch\_size: DEFAULT\_BATCH\_SIZE) do |employee| end end \=> 100.6963519999990 The underlying queries that ActiveRecord makes look like this: Employee Load (2.1ms) SELECT \`employees\`.\`emp\_no\`, \`employees\`.\`first\_name\`, \`employees\`.\`last\_name\` FROM \`employees\` ORDER BY \`employees\`.\`emp\_no\` ASC LIMIT 1000 Employee Load (1.9ms) SELECT \`employees\`.\`emp\_no\`, \`employees\`.\`first\_name\`, \`employees\`.\`last\_name\` FROM \`employees\` WHERE (\`employees\`.\`emp\_no\` > 11000) ORDER BY \`employees\`.\`emp\_no\` ASC LIMIT 1000 Employee Load (1.8ms) SELECT \`employees\`.\`emp\_no\`, \`employees\`.\`first\_name\`, \`employees\`.\`last\_name\` FROM \`employees\` WHERE (\`employees\`.\`emp\_no\` > 12000) ORDER BY \`employees\`.\`emp\_no\` ASC LIMIT 1000 ... Employee Load (1.3ms) SELECT \`employees\`.\`emp\_no\`, \`employees\`.\`first\_name\`, \`employees\`.\`last\_name\` FROM \`employees\` WHERE (\`employees\`.\`emp\_no\` > 5127997) ORDER BY \`employees\`.\`emp\_no\` ASC LIMIT 1000 Notice how ActiveRecord keeps track of the`id`from the previous iteration and uses it in a where condition in the next one. This is called value based pagination and is generally the preferred approach for pagination (over other methods like offset based pagination)². ### ID Iterator Method I propose we try a different iterating technique now: time = Benchmark.realtime do first\_id = Employee.first.id last\_id = Employee.last.id (first\_id..last\_id).step(DEFAULT\_BATCH\_SIZE).each do |value| Employee.where('employees.emp\_no >= ?', value). where('employees.emp\_no < ?', value + DEFAULT\_BATCH\_SIZE). order('employees.emp\_no ASC'). select(:emp\_no, :first\_name, :last\_name).each do |employee| end end end \=> 101.34066200000234 In this method, we divvy up the total number of rows into batches using `where` conditions on the primary key to iterate through all the records in the table. Notice how the performance is practically the same between the two methods. This is how the underlying queries look: Employee Load (1.1ms) SELECT \`employees\`.\* FROM \`employees\` ORDER BY \`employees\`.\`emp\_no\` ASC LIMIT 1 Employee Load (1.1ms) SELECT \`employees\`.\* FROM \`employees\` ORDER BY \`employees\`.\`emp\_no\` DESC LIMIT 1 Employee Load (1.5ms) SELECT \`employees\`.\`emp\_no\`, \`employees\`.\`first\_name\`, \`employees\`.\`last\_name\` FROM \`employees\` WHERE (employees.emp\_no > 10001) AND (employees.emp\_no <= 11001) Employee Load (1.9ms) SELECT \`employees\`.\`emp\_no\`, \`employees\`.\`first\_name\`, \`employees\`.\`last\_name\` FROM \`employees\` WHERE (employees.emp\_no > 11001) AND (employees.emp\_no <= 12001) ... Employee Load (1.8ms) SELECT \`employees\`.\`emp\_no\`, \`employees\`.\`first\_name\`, \`employees\`.\`last\_name\` FROM \`employees\` WHERE (employees.emp\_no >= 5128001) AND (employees.emp\_no < 5129001) This approach works best if the ids are in order because the iteration wouldn’t have to iterate & skip a lot of missing records in that case³. ### Iterating with joins Now, let’s compare performance of these two methods when we add some more complexity to the query. In this new scenario, say, we want to iterate through all employees whose salary was above 80,000 at any point during their employment with the company. The `find_each` method would look something like this: time = Benchmark.realtime do Employee.select(:emp\_no, :first\_name, :last\_name). joins(:salaries). where('salary > 80000'). find\_each(batch\_size: DEFAULT\_BATCH\_SIZE) do |employee| end end \=> 1181.770457000006 On the other hand, the id iterator method for performing the same operation results in an order of magnitude improvement in performance. time = Benchmark.realtime do first\_id = Employee.first.id last\_id = Employee.last.id (first\_id..last\_id).step(DEFAULT\_BATCH\_SIZE).each do |value| Employee.where('employees.emp\_no >= ?', value). where('employees.emp\_no < ?', value + DEFAULT\_BATCH\_SIZE). joins(:salaries). where('salary > 80000'). order('employees.emp\_no ASC'). select(:emp\_no, :first\_name, :last\_name).each do |employee| end end end \=> 72.75677799998084 The above results indicate that using the `find_each` approach results in a much worse performance⁴. The ID iterator approach is about 15x faster than naive `find_each`. The reason for this becomes clear when you inspect the queries that are made by the two approaches. The `find_each` method makes this type of query: SELECT \`employees\`.\`emp\_no\`, \`employees\`.\`first\_name\`, \`employees\`.\`last\_name\` FROM \`employees\` INNER JOIN \`salaries\` ON \`salaries\`.\`emp\_no\` = \`employees\`.\`emp\_no\` WHERE (salary > 80000) ORDER BY \`employees\`.\`emp\_no\` ASC LIMIT 1000 An EXPLAIN on this query reveals the following: 1 SIMPLE salaries ALL salary,emp\_no NULL NULL NULL 2837536 **Using where; Using temporary; Using filesort** 1 SIMPLE employees eq\_ref PRIMARY PRIMARY 4 employees.salaries.emp\_no 1 Using index which indicates that neither the index on salary nor the index on emp\_no is being used to filter the salaries table. The id iterator method makes this type of query: SELECT An EXPLAIN on this query shows that the query optimizer uses the index on emp\_no in the salaries table: 1 SIMPLE salaries range salary,emp\_no emp\_no 4 NULL 1 **Using index condition; Using where** 1 SIMPLE employees eq\_ref PRIMARY PRIMARY 4 employees.salaries.emp\_no 1 Using index which reveals why the `find_each` method is so much slower than the iterator method. ### TL;DR The lesson here is always use EXPLAINs to understand what the MySQL query optimizer actually does so that you can create the most optimized queries⁵. Based on analyzing the results of the EXPLAIN, a decision can be made on which approach needs to be taken for iterations on large tables. * JOINs on large tables usually results in poor performance, so it’s best to avoid them. Try to use JOINs only when the result set has been narrowed down significantly through the use of an index based condition on one of the tables. * Try to make the best use of indices for queries in general. Use queries that results in the MySQL query optimizer choosing to use indices that are available in the table. Add indices to the table that may help speed up queries while understanding the trade-offs in terms of write performance degradation⁶. * Avoid running select \*, instead select only the columns that are necessary for your operation. This will reduce the amount of data that needs to be sent especially when there are many TEXT columns in the table. * The query optimizer might take different paths depending on a variety of factors, the same query might take a different path on a server with larger resources because, say, an index might fit into the memory. This will result in drastic differences in performances. It is best to assume the worst in these situations and write queries that don’t rely on large indices to be kept in memory. * The easiest way to see the queries that ActiveRecord generates would be to turn on DEBUG logging. It is recommended to turn this on in development so you can catch performance issues early. `ActiveRecord::Base.logger = Logger.new(STDOUT)` * Alternatively, you can use `to_sql` on an `ActiveRecord::Relation` to see beforehand what query it’s going to make. `Employee.where(“gender = ‘M’”).to_sql` ¹ I started out from [this](https://github.com/datacharmer/test_db) sample dataset, and deleted everything but the `employees` and `salaries` table. And then I duplicated records in the `employees` table to reach 5 million rows. ² [This](https://blog.novatec-gmbh.de/art-pagination-offset-vs-value-based-paging/) link has a good comparison of the Value based vs Offset based pagination. ³ If `AUTO_INCREMENT` option is turned on for the primary key, the records are automatically in incremental order. ⁴ The performance degrades even more on larger tables. When you reach 100s of millions of rows, it becomes even more important to understand the underlying queries because it might result in [100x or 1000x difference](https://medium.com/squad-engineering/blazingly-fast-querying-on-huge-tables-by-avoiding-joins-5be0fca2f523). ⁵ Take the time to read (and master) the [official MySQL documentation](https://dev.mysql.com/doc/refman/8.0/en/explain-output.html) on EXPLAIN output format, so its clear what’s good and what’s not good. ⁶ [This link](https://logicalread.com/impact-of-adding-mysql-indexes-mc12/#.W_rzeJNKi_s) has a good description on the performance impact of creating indices. It’s important to understand that writes on a table with a lot of indices will be slower, so use them wisely.