Snowflake vs. BigQuery vs. ClickHouse: Mastering Cost-Effective Business Analyticsby@altinityinc
1,442 reads
1,442 reads

Snowflake vs. BigQuery vs. ClickHouse: Mastering Cost-Effective Business Analytics

by Altinity Inc.December 6th, 2023
Read on Terminal Reader
Read this story w/o Javascript

Too Long; Didn't Read

TL;DR: Comparing the cost models of analytics platforms Snowflake, BigQuery, and ClickHouse reveals diverse pricing structures. Snowflake's costs lean towards computing, BigQuery's on-demand model can rack up with extensive data scans, while ClickHouse offers real-time analytics with efficient storage. Properly aligning with each platform's strengths and understanding their cost nuances can guide businesses to more cost-effective analytics solutions.
featured image - Snowflake vs. BigQuery vs. ClickHouse: Mastering Cost-Effective Business Analytics
Altinity Inc. HackerNoon profile picture

In the dynamic landscape of data analytics, choosing an analytics platform can significantly impact your business's bottom line. In this educational article, we're on a quest to explore three heavyweight analytics contenders: Snowflake, BigQuery, and ClickHouse. We will delve into their cost models and reveal valuable strategies to help you master the art of cost-efficient analytics – it's a journey exploring insights that can transform your data game!

Understanding Analytic DBMS Cost Models

To make cost-effective decisions in data analytics, it's crucial to comprehend the cost models behind analytical database management systems (DBMS). A common starting point is to examine the cost structure of cloud-based businesses.
So, first, let's talk about cloud-based analytic databases and how they handle costs when running in the cloud. Picture this: It’s before 2013, and setting up a data warehouse could take up to several months of dealing with multiple vendors, hardware bottlenecks, and configurations. Then came Amazon Redshift in 2013, allowing you to start your data warehouse in under 20 minutes with just a credit card—quite the leap forward. This paved the way for other cloud-based data warehouses like BigQuery, Snowflake, and cloud services that operate Snowflake.

Most developers learn about the importance of cost-efficiency as soon as they start getting their cloud bills. For example, you could run a long-running query on BigQuery that ends up costing you hundreds, often thousands of dollars. Take it from the following tweet/X post; this is not an isolated occurrence.

We recently ran a webinar discussing this topic in further detail where Robert Hodges (CEO at Altinity Inc.) goes into the in-depth cost analysis of each option and shows you an alternative DIY solution at the end. Feel free to watch the recording of that webinar on-demand once you’re done reading this article!

So, let's dig into how these databases work and, more importantly, how they price their services.

We'll start with Snowflake, but before we get into its architecture, we will talk business. Snowflake, being a publicly traded company, gives us a chance to poke around its financials. In a recent report, they raked in a whopping $2 billion in total revenue. What's intriguing is the cost of delivering their service, roughly around $717 million. This cost, also known as the cost of goods sold, is what it takes Snowflake to run the show.

Understanding the cost of running Snowflake's cloud services is like a hidden treasure hunt. If you crunch the numbers, it's roughly about one-third, maybe a tad more, of their total revenue. So, in plain English, if all that cost went straight into covering their cloud expenses and nothing else, they'd be marking things up by roughly three times when they charge you.

But of course, it’s not that simple! The actual costs of running a powerhouse like Snowflake go beyond just running virtual machines and stashing data in Amazon S3. Now, if we do the math again, that markup on their costs? It's more like 5x.

That is a nifty benchmark to carry in your back pocket. If something's got a markup greater than 5x, well, it's living large in the world of Snowflake's pricing. On the flip side, if it's less than 5x, you're looking at a more budget-friendly option. It's like having a secret decoder ring for deciphering the cost puzzle.

Unpacking the Cost Models of Snowflake, BigQuery, and ClickHouse

To master cost-efficient analytics, let's dissect the cost models of Snowflake, BigQuery, and ClickHouse.

Snowflake Virtual Data Warehouse Model

Now, let's break down Snowflake's costs. They use a "virtual data warehouse" model, where your data resides in S3 object storage. When you run SQL queries, you create virtual data warehouses powered by credits, essentially hosted processing units pulling data from storage. The pricing for these virtual machines ranges from about $2-4 /hour as listed in their pricing. Keep in mind: it's not the thriftiest choice.

This would wrap up our Snowflake cost analysis if not for this intriguing twist: a recent bug in Snowflake unveiled that the credits for virtual data warehouses often translate to c5d2x large instances, costing about 38 cents per hour. Remarkably, Snowflake doesn't significantly mark up object storage, pricing it at $23-40 per terabyte per month, similar to Amazon's S3 costs. Instead, the real markup occurs in computing, which can be 5 to 10 times more expensive than storage.

BigQuery Serverless Query Model

Let's take a closer look at BigQuery, another heavyweight in cloud database analytics. BigQuery offers a distinct pricing model known as "serverless" or "on-demand." In this setup, you store your data on a unique distributed storage system, not the typical object storage used by most cloud services. However, the pricing is quite competitive, similar to object storage rates, at least for uncompressed data. Prices start at $0.016-0.023 per GB per month for storage—now that's more budget-friendly.

Here's the catch: when you run a query, BigQuery dynamically allocates compute resources as needed. It charges you $6.25 for every terabyte of data scanned during your query processing. This pricing structure means that even seemingly innocent queries can rack up costs if they involve scanning a large amount of data from distributed storage.

Now, comparing BigQuery costs to standard cloud resources isn't straightforward. Your actual expenses depend on various factors, like how often you run queries and your data storage methods. For instance, if you run queries sporadically, BigQuery might be cost-effective. But if you're running queries 24/7, it could get expensive. So, it's crucial to understand your workload and assess your real costs carefully.

AWS RedShift "Buy the Box" Model

The previous models we’ve discussed typically rely on object storage, or something quite similar, for data handling. However, there's another cloud database operating approach, one initially introduced by Redshift a decade ago. Let’s call it the "Buy-the-Box" model.

Here's the deal: you rent a virtual machine (VM), such as a dc28x large, which comes with attached SSD block storage. For instance, in Amazon us-west-2, this VM would cost you around 4.80 cents per hour. Now, let's break down the actual costs. This VM likely corresponds to an i38x large instance, an older Amazon instance type, which comes with local SSD storage. This type of instance costs approximately $2.50 per hour, offering nearly the same amount of RAM.

In terms of markup, Redshift is 92% more expensive while providing 66% less storage than if you were to set up a similar VM yourself. Interestingly, Redshift's markup is relatively lower compared to some other services, like Snowflake, which tends to charge more for computing resources.

ClickHouse Database Architecture

To improve on the previous models for cost efficiency, let us first introduce you to ClickHouse.  It is a free, renowned, open-source, real-time analytics database. ClickHouse’s architecture revolves around interconnected servers capable of seamless data replication, especially with replicated tables. This system employs efficient columnar storage, where data resides in compressed arrays, not only reducing storage costs but also boosting query performance.

Initially, ClickHouse was confined to block storage, but it has evolved to support S3-compatible object storage, making it versatile and open to integration with any S3 API-equipped object storage solution. To manage replication consensus efficiently, ClickHouse relies on either ClickHouseKeeper or ZooKeeper.

You can read more about Clickhouse here.

The Modernized "Buy-the-Box" Approach

Now, let's envision this as a cloud service paradigm, which we term "Modernized Buy-the-Box." On the left side of the image below is the traditional Redshift architecture, while on the right, we embrace innovation. We replace the old i3 instances with faster Intel-based m6is, providing a significant speed boost. The game-changer is the use of EBS (Elastic Block Storage) gp3 storage, granting control over bandwidth and throughput. This, paired with efficient VMs akin to Redshift, results in an approximate cost of 2.64 cents per hour.

The real magic happens with the separation of storage and computing. This flexible approach lets you easily adjust CPU and VM types while keeping the same storage, allowing cost scaling up or down. Our experience at Altinity.Cloud managing ClickHouse applications reflects this efficiency.

In the graph below, you'll see the all-in on-demand cost for Altinity.Cloud when running ClickHouse. For instance, opting for m6i 12x large might cost six dollars per hour, slightly more than Redshift. However, ClickHouse is so fast that smaller instance sizes get better performance than Redshift, which means lower cost without compromising performance. This showcases the tremendous benefits of separating storage and computing.

To summarize, we've explored three cloud-hosted analytic database models and their cost implications. In our comparison, we've created a wallet-sized table to make it clear how these models stack up against each other.

The "Buy-the-Box" approach offers cost-effective computing but pricier storage due to block storage usage. Snowflake and BigQuery, on the other hand, offer economical storage but can be costly in different ways. Snowflake tends to be relatively expensive in terms of computing, while BigQuery's on-demand query model can become pricey when handling extensive data scans. Each model has its strengths and weaknesses, making it essential to align them with your specific analytics needs. "Buy-the-Box" suits customer-facing analytics with unpredictable workloads, while the virtual data warehouse model and BigQuery excel in specific scenarios but require careful cost management to avoid surprises. This overview helps you navigate the landscape effectively.

In Conclusion

When it comes to analytics, cost-efficiency is paramount. Understanding the cost models of popular platforms like Snowflake, BigQuery, and ClickHouse is essential for making informed decisions. By evaluating each platform's strengths and weaknesses and considering their cost structures, organizations can craft cost-efficient analytic solutions tailored to their specific needs. Leveraging open-source solutions and educational resources can further optimize cost, ensuring organizations achieve their analytics goals while being mindful of their budgets.

This article was derived from an webinar.