Slow query times in large datasets are a common headache in database management. offers a neat way out of this. It's a columnar storage engine that significantly speeds up data analytics. Typically, you can improve query performance in relational databases by adding appropriate indexes. MariaDB ColumnStore However, maintaining indexes is hard, especially with ad-hoc queries where you don't really know where indexes are going to be needed. ColumnStore eases this pain. It's as if you had an index on each column but without the hassle of creating and updating them. The price to pay? Well, inserts are not as fast as with , so this is not the best option for operational/transactional databases but rather for analytical ones. are very fast, though. InnoDB Bulk inserts There's plenty of about ColumnStore, so I won't go through all the details on how it works or how to deploy it on production. Instead, in this article, I'll show you how to try MariaDB ColumnStore on your computer using . online documentation Docker Prerequisites You'll need: The command line tool mariadb Docker Setting up MariaDB ColumnStore Run a container with MariaDB + ColumnStore: docker run -d -p 3307:3306 -e PM1=mcs1 --hostname=mcs1 --name mcs1 mariadb/columnstore This command runs a new Docker container using the official ColumnStore image with several specified options: : Starts a new Docker container. docker run : Runs the container in detached mode (in the background). -d : Maps port 3307 on the host (your computer) to port 3306 inside the container. This makes the database accessible on port on the host machine. -p 3307:3306 3307 : The environment variable specifies the primary database node ( ). -e PM1=mcs1 PM1 PM1 mcs1 : Sets the hostname of the container to . --hostname=mcs1 mcs1 : Names the container . --name mcs1 mcs1 : Specifies the Docker image to use, in this case, an image for MariaDB with the ColumnStore storage engine. mariadb/columnstore Provision ColumnStore: docker exec -it mcs1 provision mcs1 The command is used to interact with a running Docker container. This is what each option does: docker exec : Executes a command in a running container. docker exec : This option ensures the command is run in interactive mode with a terminal. -it (first occurrence): This is the name of the Docker container in which the command is to be executed. mcs1 This is the specific command being executed inside the container. is a script included in the Docker image that initializes and configures the MariaDB ColumnStore environment within the container. The argument is passed to the command to specify the host for the MariaDB server within the Docker container. provision mcs1 provision mcs1 provision Connect to the MariaDB server using the default credentials defined in the MariaDB ColumnStore Docker image: mariadb -h 127.0.0.1 -P 3307 -u admin -p'C0lumnStore!' Check that ColumnStore is available as a storage engine by running the following SQL sentence: SHOW ENGINES; Setting up a demo database Create the database and its InnoDB tables: operations CREATE DATABASE operations;

CREATE TABLE operations.doctors(
    id SERIAL PRIMARY KEY,
    name VARCHAR(200) NOT NULL CHECK(TRIM(name) != '')
) ENGINE=InnoDB;

CREATE TABLE operations.appointments(
    id SERIAL PRIMARY KEY,
    name VARCHAR(200) NOT NULL CHECK(TRIM(name) != ''),
    phone_number VARCHAR(15) NOT NULL CHECK(phone_number RLIKE '[0-9]+'),
    email VARCHAR(254) NOT NULL CHECK(TRIM(email) != ''),
    time DATETIME NOT NULL,
    reason ENUM('Consultation', 'Follow-up', 'Preventive', 'Chronic') NOT NULL,
    status ENUM ('Scheduled', 'Canceled', 'Completed', 'No Show'),
    doctor_id BIGINT UNSIGNED NOT NULL,
    CONSTRAINT fk_appointments_doctors FOREIGN KEY (doctor_id) REFERENCES doctors(id)
) ENGINE=InnoDB; Create the database and its ColumnStore table: analytics CREATE DATABASE analytics;

CREATE TABLE analytics.appointments(
    id BIGINT UNSIGNED NOT NULL,
    name VARCHAR(200) NOT NULL,
    phone_number VARCHAR(15) NOT NULL,
    email VARCHAR(254) NOT NULL,
    time DATETIME NOT NULL,
    reason VARCHAR(15) NOT NULL,
    status VARCHAR(10) NOT NULL,
    doctor_id BIGINT UNSIGNED NOT NULL
) ENGINE=ColumnStore; You can use the same database (or schema. They are in MariaDB) for both the InnoDB and ColumnStore tables if you prefer. Use a different name for the ColumnStore table if you opt for this alternative. synonyms Inserting demo data Insert a few doctors: INSERT INTO operations.doctors(name)
VALUES ("Maria"), ("John"), ("Jane"); Create a new file with the name with the following content: test_data_insert.py import random
import os
import subprocess
from datetime import datetime, timedelta

# Function to generate a random date within a given range
def random_date(start, end):
    return start + timedelta(days=random.randint(0, int((end - start).days)))

# Function to execute a given SQL command using MariaDB
def execute_sql(sql):
    # Write the SQL command to a temporary file
    with open("temp.sql", "w") as file:
        file.write(sql)
    # Execute the SQL command using the MariaDB client
    subprocess.run(["mariadb", "-h", "127.0.0.1", "-P", "3307", "-u", "admin", "-pC0lumnStore!", "-e", "source temp.sql"])
    # Remove the temporary file
    os.remove("temp.sql")

print("Generating and inserting data...")

# Total number of rows to be inserted
total_rows = 4000000
# Number of rows to insert in each batch
batch_size = 10000

# Possible values for the 'reason' column and their associated weights for random selection
reasons = ["Consultation", "Follow-up", "Preventive", "Chronic"]
reason_weights = [0.5, 0.15, 0.25, 0.1]

# Possible values for the 'status' column and their associated weights for random selection
statuses = ["Scheduled", "Canceled", "Completed", "No Show"]
status_weights = [0.1, 0.15, 0.7, 0.05]

# Possible values for the 'doctor_id' column and their associated weights for random selection
doctors = [1, 2, 3]
doctors_weights = [0.4, 0.35, 0.25]

# List of patient names
names = [f"Patient_{i}" for i in range(total_rows)]

# Insert data in batches
for batch_start in range(0, total_rows, batch_size):
    batch_values = []

    # Generate data for each row in the batch
    for i in range(batch_start, min(batch_start + batch_size, total_rows)):
        name = names[i]
        phone_number = f"{random.randint(100, 999)}-{random.randint(100, 999)}-{random.randint(1000, 9999)}"
        email = f"patient_{i}@example.com"
        time = random_date(datetime(2023, 1, 1), datetime(2024, 1, 1)).strftime("%Y-%m-%d %H:%M:%S")
        reason = random.choices(reasons, reason_weights)[0]
        status = random.choices(statuses, status_weights)[0]
        doctor_id = random.choices(doctors, doctors_weights)[0]

        # Append the generated row to the batch
        batch_values.append(f"('{name}', '{phone_number}', '{email}', '{time}', '{reason}', '{status}', {doctor_id})")

    # SQL command to insert the batch of data into the 'appointments' table
    sql = "USE operations;\nINSERT INTO appointments (name, phone_number, email, time, reason, status, doctor_id) VALUES " + ", ".join(batch_values) + ";"
    # Execute the SQL command
    execute_sql(sql)
    # Print progress
    print(f"Inserted up to row {min(batch_start + batch_size, total_rows)}")

print("Data insertion complete.") Insert 4 million appointments by running the Python script: python3 test_data_insert.py Populate the ColumnStore table by connecting to the database and running: INSERT INTO analytics.appointments (
    id,
    name,
    phone_number,
    email,
    time,
    reason,
    status,
    doctor_id
)
SELECT
    appointments.id,
    appointments.name,
    appointments.phone_number,
    appointments.email,
    appointments.time,
    appointments.reason,
    appointments.status,
    appointments.doctor_id
FROM operations.appointments; Run cross-engine SQL queries MariaDB ColumnStore is designed to run in a cluster of multiple servers. It is there where you see massive performance gains in analytical queries. However, we can also see this in action with the single-node setup of this article. Run the following query and pay attention to the time it needs to complete (make sure it queries the database): operations SELECT doctors.name, status, COUNT(*) AS count
FROM operations.appointments -- use the InnoDB table
JOIN doctors ON doctor_id = doctors.id
WHERE status IN (
    'Scheduled',
    'Canceled',
    'Completed',
    'No Show'
)
GROUP BY doctors.name, status
ORDER BY doctors.name, status; On my machine, it took around 3 seconds. Now modify the query to use the ColumnStore table instead (in the database): analytics SELECT doctors.name, status, COUNT(*) AS count
FROM analytics.appointments -- use the ColumnStore table
JOIN doctors ON doctor_id = doctors.id
WHERE status IN (
    'Scheduled',
    'Canceled',
    'Completed',
    'No Show'
)
GROUP BY doctors.name, status
ORDER BY doctors.name, status; It takes less than a second. Of course, you can speed up the first query by adding an index in this simplistic example, but imagine the situation in which you have hundreds of tables—it will become harder and harder to manage indexes. ColumnStore removes this complexity. Also published . here

Walkthroughs, tutorials, guides, and tips. This story will teach you how to do something new or how to do something better.

Why Databases and SQL Matter: The Pillars of Modern App Development

Using MariaDB as a MongoDB NoSQL Database

Visit my personal Blog!

Nominated for 2022 - HackerNoon Contributor of the Year - Database

Nominated for 2022 - HackerNoon Contributor of the Year - Git

Nominated for 2022 - HackerNoon Contributor of the Year - Java

Nominated for 2022 - HackerNoon Contributor of the Year - Docker

Nominated for 2022 - HackerNoon Contributor of the Year - Sql

Nominated for 2022 - HackerNoon Contributor of the Year - Analytics

Too Long; Didn't Read

Optimizing Database Performance with MariaDB ColumnStore

Optimizing Database Performance with MariaDB ColumnStore

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

9 Useful Interactive CLI Tools for Linux

10 Great Articles On Data Science And Programming!

10 FinTech Trends in 2021 [Part II]

10 Financial Predictions for 2021 to Protect Your Wealth

13 API Metrics That Every Platform Team Should be Tracking

If You’re Trying to Talk to Everybody, You’re Not Reaching Anybody

9 Useful Interactive CLI Tools for Linux

10 Great Articles On Data Science And Programming!

10 FinTech Trends in 2021 [Part II]

10 Financial Predictions for 2021 to Protect Your Wealth

13 API Metrics That Every Platform Team Should be Tracking

If You’re Trying to Talk to Everybody, You’re Not Reaching Anybody

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps