In the dynamic landscape of data analytics, choosing an analytics platform can significantly impact your business's bottom line. In this educational article, we're on a quest to explore three heavyweight analytics contenders: Snowflake, BigQuery, and ClickHouse. We will delve into their cost models and reveal valuable strategies to help you master the art of cost-efficient analytics – it's a journey exploring insights that can transform your data game!
To make cost-effective decisions in data analytics, it's crucial to comprehend the cost models behind analytical database management systems (DBMS). A common starting point is to examine the cost structure of cloud-based businesses.
So, first, let's talk about cloud-based analytic databases and how they handle costs when running in the cloud. Picture this: It’s before 2013, and setting up a data warehouse could take up to several months of dealing with multiple vendors, hardware bottlenecks, and configurations. Then came Amazon Redshift in 2013, allowing you to start your data warehouse in under 20 minutes with just a credit card—quite the leap forward. This paved the way for other cloud-based data warehouses like BigQuery, Snowflake, and cloud services that operate Snowflake.
Most developers learn about the importance of cost-efficiency as soon as they start getting their cloud bills. For example, you could run a long-running query on BigQuery that ends up costing you hundreds, often thousands of dollars. Take it from the following tweet/X post; this is not an isolated occurrence.
We recently ran a webinar discussing this topic in further detail where Robert Hodges (CEO at Altinity Inc.) goes into the in-depth cost analysis of each option and shows you an alternative DIY solution at the end. Feel free to watch the recording of that webinar on-demand once you’re done reading this article!
So, let's dig into how these databases work and, more importantly, how they price their services.
We'll start with Snowflake, but before we get into its architecture, we will talk business. Snowflake, being a publicly traded company, gives us a chance to poke around its financials. In a recent report, they raked in a whopping $2 billion in total revenue. What's intriguing is the cost of delivering their service, roughly around $717 million. This cost, also known as the cost of goods sold, is what it takes Snowflake to run the show.
Understanding the cost of running Snowflake's cloud services is like a hidden treasure hunt. If you crunch the numbers, it's roughly about one-third, maybe a tad more, of their total revenue. So, in plain English, if all that cost went straight into covering their cloud expenses and nothing else, they'd be marking things up by roughly three times when they charge you.
But of course, it’s not that simple! The actual costs of running a powerhouse like Snowflake go beyond just running virtual machines and stashing data in Amazon S3. Now, if we do the math again, that markup on their costs? It's more like 5x.
That is a nifty benchmark to carry in your back pocket. If something's got a markup greater than 5x, well, it's living large in the world of Snowflake's pricing. On the flip side, if it's less than 5x, you're looking at a more budget-friendly option. It's like having a secret decoder ring for deciphering the cost puzzle.
To master cost-efficient analytics, let's dissect the cost models of Snowflake, BigQuery, and ClickHouse.
Now, let's break down Snowflake's costs. They use a "virtual data warehouse" model, where your data resides in S3 object storage. When you run SQL queries, you create virtual data warehouses powered by credits, essentially hosted processing units pulling data from storage. The pricing for these virtual machines ranges from about $2-4 /hour as listed in their pricing. Keep in mind: it's not the thriftiest choice.
This would wrap up our Snowflake cost analysis if not for this intriguing twist: a recent bug in Snowflake unveiled that the credits for virtual data warehouses often translate to c5d2x large instances, costing about 38 cents per hour. Remarkably, Snowflake doesn't significantly mark up object storage, pricing it at $23-40 per terabyte per month, similar to Amazon's S3 costs. Instead, the real markup occurs in computing, which can be 5 to 10 times more expensive than storage.
Let's take a closer look at BigQuery, another heavyweight in cloud database analytics. BigQuery offers a distinct pricing model known as "serverless" or "on-demand." In this setup, you store your data on a unique distributed storage system, not the typical object storage used by most cloud services. However, the pricing is quite competitive, similar to object storage rates, at least for uncompressed data. Prices start at $0.016-0.023 per GB per month for storage—now that's more budget-friendly.
Here's the catch: when you run a query, BigQuery dynamically allocates compute resources as needed. It charges you $6.25 for every terabyte of data scanned during your query processing. This pricing structure means that even seemingly innocent queries can rack up costs if they involve scanning a large amount of data from distributed storage.
Now, comparing BigQuery costs to standard cloud resources isn't straightforward. Your actual expenses depend on various factors, like how often you run queries and your data storage methods. For instance, if you run queries sporadically, BigQuery might be cost-effective. But if you're running queries 24/7, it could get expensive. So, it's crucial to understand your workload and assess your real costs carefully.
The previous models we’ve discussed typically rely on object storage, or something quite similar, for data handling. However, there's another cloud database operating approach, one initially introduced by Redshift a decade ago. Let’s call it the "Buy-the-Box" model.
Here's the deal: you rent a virtual machine (VM), such as a dc28x large, which comes with attached SSD block storage. For instance, in Amazon us-west-2, this VM would cost you around 4.80 cents per hour. Now, let's break down the actual costs. This VM likely corresponds to an i38x large instance, an older Amazon instance type, which comes with local SSD storage. This type of instance costs approximately $2.50 per hour, offering nearly the same amount of RAM.
In terms of markup, Redshift is 92% more expensive while providing 66% less storage than if you were to set up a similar VM yourself. Interestingly, Redshift's markup is relatively lower compared to some other services, like Snowflake, which tends to charge more for computing resources.
To improve on the previous models for cost efficiency, let us first introduce you to ClickHouse. It is a free, renowned, open-source, real-time analytics database. ClickHouse’s architecture revolves around interconnected servers capable of seamless data replication, especially with replicated tables. This system employs efficient columnar storage, where data resides in compressed arrays, not only reducing storage costs but also boosting query performance.
Initially, ClickHouse was confined to block storage, but it has evolved to support S3-compatible object storage, making it versatile and open to integration with any S3 API-equipped object storage solution. To manage replication consensus efficiently, ClickHouse relies on either ClickHouseKeeper or ZooKeeper.
You can read more about Clickhouse
Now, let's envision this as a cloud service paradigm, which we term "Modernized Buy-the-Box." On the left side of the image below is the traditional Redshift architecture, while on the right, we embrace innovation. We replace the old i3 instances with faster Intel-based m6is, providing a significant speed boost. The game-changer is the use of EBS (Elastic Block Storage) gp3 storage, granting control over bandwidth and throughput. This, paired with efficient VMs akin to Redshift, results in an approximate cost of 2.64 cents per hour.
The real magic happens with the separation of storage and computing. This flexible approach lets you easily adjust CPU and VM types while keeping the same storage, allowing cost scaling up or down. Our experience at Altinity.Cloud managing ClickHouse applications reflects this efficiency.
In the graph below, you'll see the all-in on-demand cost for
To summarize, we've explored three cloud-hosted analytic database models and their cost implications. In our comparison, we've created a wallet-sized table to make it clear how these models stack up against each other.
The "Buy-the-Box" approach offers cost-effective computing but pricier storage due to block storage usage. Snowflake and BigQuery, on the other hand, offer economical storage but can be costly in different ways. Snowflake tends to be relatively expensive in terms of computing, while BigQuery's on-demand query model can become pricey when handling extensive data scans. Each model has its strengths and weaknesses, making it essential to align them with your specific analytics needs. "Buy-the-Box" suits customer-facing analytics with unpredictable workloads, while the virtual data warehouse model and BigQuery excel in specific scenarios but require careful cost management to avoid surprises. This overview helps you navigate the landscape effectively.
When it comes to analytics, cost-efficiency is paramount. Understanding the cost models of popular platforms like Snowflake, BigQuery, and ClickHouse is essential for making informed decisions. By evaluating each platform's strengths and weaknesses and considering their cost structures, organizations can craft cost-efficient analytic solutions tailored to their specific needs. Leveraging open-source solutions and educational resources can further optimize cost, ensuring organizations achieve their analytics goals while being mindful of their budgets.
This article was derived from an Altinity.com webinar.