Quite often in my practice, I see people use varchar
fields with a small and limited number of available options.
Let's check out today what field data type is better and will be the best approach for your project.
Let's imagine us designing a database for a CMS. Just pretend we don't have any of them available at every corner, and we need to create the best example.
To begin with our database, we will start with the main table post. I'm now omitting the author
field because this is our blog, and we don't have any plans to add additional authors except for us ;)
CREATE TABLE post (
id INT NOT NULL AUTO_INCREMENT,
title VARCHAR(255) NOT NULL,
status ENUM ('archived', 'deleted', 'draft', 'published') DEFAULT 'draft',
content TEXT NOT NULL,
created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
updated_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (id),
KEY (status)
)
ENGINE = InnoDB;
I can imagine three ways how to describe the status
field.
The first one is in the example above.
status VARCHAR(255) NOT NULL default 'draft'
The next one will be enum:
status ENUM ('archived', 'deleted', 'draft', 'published') DEFAULT 'draft',
And the last one is an integer:
status TINYINT(1) UNSIGNED NOT NULL DEFAULT '0'
Let us compare these three variants in terms of performance, easiness to use, and support for feature changes. Setting initial scores for each option.
varchar_score = 0
enum_score = 0
integer_score = 0
Varchar and enum approaches look pretty much the same for a human eye. On the other hand, the integer is not human-readable. Then you see status=1, you have no way to say what status is just by looking into the database, and you have to keep the mapping between numbers and words in your code.
Updating the scores:
varchar_score += 1
enum_score +=1
Internally, MySQL uses numbers and null to store values of enums.
Internal value |
DisplayedValue |
---|---|
null |
null |
0 |
'' |
1 |
'archived' |
2 |
'deleted' |
3 |
'draft' |
4 |
'published' |
And now imagine your enum like that:
enum('draft', 'published', 'deleted', 'archived')
Ordering of the data will work based on internal representation - first will go values with draft
next - deleted
, draft
, and published
. This behavior is confusing the first time until you realize the reason for that. The correct order of options in the enum should match your expected sorting behavior.
For alphabetic sorting, it should be like this:
ENUM ('archived', 'deleted', 'draft', 'published')
Considering that, it looks like enums might bring some problems to beginners. But you simply have to keep in mind that enum is just an integer under the hood.
Tinyint takes 1 byte. The same story with enum - it also takes 1 byte of storage. Technically, both of them might be extended to 2 bytes, but I cannot imagine a real-life case where you will need more than 255 options.
For varchar, it's a bit more complicated. It will depend on the character set. For example, latin1
will take one byte per character, and utf8mb4
can require up to four bytes per character. Plus one byte to record the length of the string.
The set of statuses archived
, deleted
, draft
, and published
will take from 6 to 10 bytes in the latin1
character set.
The final amount of storage will depend on the distribution of the values, but anyway, varchar will take much more space.
Also, you should consider not only the storage size for the data but the storage size for the indexes. And indexes for varchar will take more space as well.
enum_score += 1
integer_score +=1
This case is pretty simple. For varchar and integers, you don't need to change the schema to add a new possible option. On the other hand, adding a new option to an enum will require changes to the schema. And depending on how you are going to do it - it might lead to database downtime, which is not acceptable.
varchar_score += 1
integer_score +=1
To make some experiments let's spin up the MySQL database inside the Docker container and compare the results.
docker run --name blog_mysql -e MYSQL_ROOT_PASSWORD=secret -e MYSQL_DATABASE=blog -d mysql:8
docker exec -it blog_mysql mysql -u root -psecret blog
Creating three very similar tables. The only difference will be in the status
field.
CREATE TABLE post_enum (
id INT NOT NULL AUTO_INCREMENT,
title VARCHAR(255) NOT NULL,
status ENUM ('archived', 'deleted', 'draft', 'published') DEFAULT 'draft',
content TEXT NOT NULL,
created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
updated_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (id),
KEY (status)
)
ENGINE = InnoDB
COLLATE utf8_bin;
CREATE TABLE post_varchar (
id INT NOT NULL AUTO_INCREMENT,
title VARCHAR(255) NOT NULL,
status VARCHAR(10) DEFAULT 'draft',
content TEXT NOT NULL,
created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
updated_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (id),
KEY (status)
)
ENGINE = InnoDB
COLLATE utf8_bin;
CREATE TABLE post_int (
id INT NOT NULL AUTO_INCREMENT,
title VARCHAR(255) NOT NULL,
status TINYINT(1) NOT NULL DEFAULT 0,
content TEXT NOT NULL,
created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
updated_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (id),
KEY (status)
)
ENGINE = InnoDB
COLLATE utf8_bin;
SHOW TABLES;
+----------------+
| Tables_in_blog |
+----------------+
| post_enum |
| post_int |
| post_varchar |
+----------------+
Let's have some fun, shall we? Not many people are doing MySQL procedures, but let's try some.
DROP PROCEDURE IF EXISTS add_post_varchar;
DROP PROCEDURE IF EXISTS add_post_enum;
DROP PROCEDURE IF EXISTS add_post_int;
DELIMITER $$
CREATE PROCEDURE add_post_varchar(number_of_posts INT)
BEGIN
DECLARE loop_counter INT UNSIGNED DEFAULT 0;
DECLARE post_content TEXT DEFAULT 'Lorem Ipsum is simply dummy text of the printing and typesetting industry.';
DECLARE post_title VARCHAR(255) DEFAULT 'Lorem Ipsum';
WHILE loop_counter < number_of_posts
DO
INSERT
INTO
post_varchar (title, status, content)
SELECT
post_title,
ELT(0.5 + RAND() * 4, 'archived', 'deleted', 'draft', 'published'),
post_content;
SET loop_counter = loop_counter + 1;
END WHILE;
END $$
CREATE PROCEDURE add_post_enum(number_of_posts INT)
BEGIN
DECLARE loop_counter INT UNSIGNED DEFAULT 0;
DECLARE post_content TEXT DEFAULT 'Lorem Ipsum is simply dummy text of the printing and typesetting industry.';
DECLARE post_title VARCHAR(255) DEFAULT 'Lorem Ipsum';
WHILE loop_counter < number_of_posts
DO
INSERT
INTO
post_enum (title, status, content)
SELECT
post_title,
ELT(0.5 + RAND() * 4, 'archived', 'deleted', 'draft', 'published'),
post_content;
SET loop_counter = loop_counter + 1;
END WHILE;
END $$
CREATE PROCEDURE add_post_int(number_of_posts INT)
BEGIN
DECLARE loop_counter INT UNSIGNED DEFAULT 0;
DECLARE post_content TEXT DEFAULT 'Lorem Ipsum is simply dummy text of the printing and typesetting industry.';
DECLARE post_title VARCHAR(255) DEFAULT 'Lorem Ipsum';
WHILE loop_counter < number_of_posts
DO
INSERT
INTO
post_int (title, status, content)
SELECT
post_title,
ELT(0.5 + RAND() * 4, '1', '2', '3', '4'),
post_content;
SET loop_counter = loop_counter + 1;
END WHILE;
END $$
DELIMITER ;
And create 200 000 records in each table.
call add_post_int(200000);
call add_post_enum(200000);
call add_post_varchar(200000);
SELECT COUNT(*), status FROM post_enum GROUP BY status
UNION
SELECT COUNT(*), status FROM post_int GROUP BY status
UNION
SELECT count(*), status FROM post_varchar GROUP BY status;
+----------+-----------+
| COUNT(*) | status |
+----------+-----------+
| 50058 | archived |
| 50229 | deleted |
| 50273 | draft |
| 49440 | published |
| 50123 | 1 |
| 49848 | 2 |
| 49642 | 3 |
| 50387 | 4 |
| 49885 | archived |
| 49974 | deleted |
| 50060 | draft |
| 50081 | published |
+----------+-----------+
Now we have almost even distribution.
Let's run some queries. First, enable profiling:
SET profiling=1;
Next run queries which will use only the index to obtain the data and will not go to the heap.
SELECT SQL_NO_CACHE COUNT(*), status FROM post_enum GROUP BY status;
SELECT SQL_NO_CACHE COUNT(*), status FROM post_varchar GROUP BY status;
SELECT SQL_NO_CACHE COUNT(*), status FROM post_int GROUP BY status;
And finally, let's check our performance.
Query_ID |
Duration |
Query |
---|---|---|
1 |
0.05163150 |
SELECT SQL_NO_CACHE COUNT(*), status FROM post_enum GROUP BY status |
2 |
0.05172150 |
SELECT SQL_NO_CACHE COUNT(*), status FROM post_varchar GROUP BY status |
3 |
0.05893025 |
SELECT SQL_NO_CACHE COUNT(*), status FROM post_int GROUP BY status |
Every scenario works at the same speed. Updating the scores:
varchar_score += 1
enum_score += 1
integer_score += 1
Our final scores are:
Option |
Score |
---|---|
integer_score |
3 |
varchar_score |
3 |
enum_score |
3 |
Turns out all our scores are the same. I didn't expect it to be so in the beginning. Honestly, I expect enums to win this battle. But enums have some specific behavior in sorting, which is not a big deal if you know your tools, but it might confuse a beginner.
Enums have advantages in being human-friendly and taking less space. Considering all of that, I'd recommend using enums for options like this.
And may the fast queries be with you ;)