I build infrastructure, platform, and data engineering solutions. I love writing about interesting things in tech.
Note: I have used MySQL and it’s storage engine InnoDB interchangeably.
A couple of days ago, someone at work asked me whether inserts are faster in a table when there’s a
primary key present in the table. My immediate answer was that they were faster. I had some idea about this from experience and previous reading but didn’t remember the reason why it was so.
For a business user unaware about InnoDB internals or InnoDB’s existence, it might seem perfectly intuitive to assume that inserts will be faster without a
primary key because a
primary key is an additional constraint on the table, i.e., MySQL has to check for uniqueness, nullability. When it comes to InnoDB, that is not the case. Also, a table without a
primary key will occupy more space than a table with a
primary key. See how.
MySQL has various
mutexes for preventing multiple threads accessing the same resource at the same time. There is a mutex called
dict_sys which locks the data dictionary, i.e., the
information_schema. Ovais Tariq has written a very insightful blogpost on this.
primary key is not defined on a table, MySQL generates
row-ids for every row that is inserted in that table. This, because InnoDB is an index-organized storage engine, it mandates a
primary key to store the contents of the table. If you don’t define a primary, MySQL will do this itself by generating a
6 byte auto increment primary key which is hidden. It is referred by the name
DB_ROW_ID in the documentation. This only has to exist if a table doesn’t have a
Now, when this
row-id is generated on every new insert, MySQL acquires
dict_sys mutex — and it becomes a system-wide contention point because what it is acquiring a mutex on is the data dictionary — which might be in use by a couple of other DDL and DML operations. This would mean that insert performance on a table might be affected by another operation taking place on another table. This contention would otherwise not exist if a
primary key is defined on the table.
To check how a similar insert workload behaves on a table with a defined
primary key and another table without a defined
primary key, I performed a simple
mysqlslap insert performance test on my
local. I have attached the scripts to perform these tests. Note: Doing this with
rand() is probably not ideal.
mysqlslap — is a
sysbench like tool for MySQL that can be used for performance and load testing of individual tables — Read more about it here.
The insert performance in this case was surprisingly not very different in both the cases — checked for 10 million records. This was with the concurrency of 4 threads
primary key, 1284.848 seconds
primary key, 1282.350 seconds
In an isolated environment and with only 4 thread, there were not many
dict_sys mutexes on
information_schema, hence, there was no significant difference between the insert performance. Although, with the same table configuration, these two tables occupy significantly different amount of space on disk — which comes from the fact that MySQL had to create
row ids for every record that was being inserted in the table without a defined
primary key, 900 MB
primary key, 1.1 GB
I conducted five more tests with increasing concurrency and a consistent load of 100K records. See the difference in insert performance here.
With increasing threads, time taken to insert into tables without
primary key increases quite a lot because of the contention point added because of
Always use a
P.S. — From Jeremy Cole’s blog about this
Given how much other code within InnoDB is protected by dict_sys->mutex I think it’s fair to say any tables with an implicit clustered key (ROW_ID) could expect to experience random insert stalls during operations like dropping (unrelated) tables. Parallel insertion into multiple tables with implicit keys could be performance-constrained, as it will be serialized on both the shared mutex and cache contention for the shared counter variable.