Before you go, check out these stories!

0
Hackernoon logoUnderstanding the tech behind Snowflake’s IPO and what’s to come by@boazfarkash

Understanding the tech behind Snowflake’s IPO and what’s to come

Author profile picture

@boazfarkashBoaz Farkash

Data-Analytics veteran, CPO at Firebolt

By now you must have read quite a few articles about Snowflake’s absolutely mind-blowing and record-setting IPO. This article is not intended to speculate on whether the valuation makes sense or not, but rather help you understand the technological concepts that make Snowflake so unique, and why it has proven to be so disruptful for the data space in general and the data warehousing space in particular.

Background: data warehouses start their journey to the cloud

To understand the market Snowflake operates in, it’s important to first spend a few minutes on the history of data warehouses in the cloud. Data warehouses have been around forever on-premises, but their appearance in the cloud is still relatively recent, with the release of Amazon Redshift in 2013. It was the first cloud data warehouse that seemed to be headed towards becoming mainstream.

It was only natural that AWS, as the leader of the cloud revolution, would release one of the first successful pure cloud data warehouses. Redshift quickly gained popularity and became the warehouse of choice both for companies that were transitioning from on-prem to the cloud (the name “Redshift” was a poke at Oracle — shifting from “Red” which is Oracle’s color), and for new cloud-first companies. The biggest benefit that ca9`me with Redshift was the fact that it is a managed cloud service. This meant that the ease of use of the cloud and SaaS finally became available to data warehousing. Provisioning a new warehouse suddenly was not a lengthy and complex project involving hardware and software, but rather a few mouse clicks with Redshift taking care of the heavy lifting behind the scenes.

But beyond reducing complexities around installations, provisioning and maintenance, Redshift does the same things in the same manner that traditional on-premises warehouses do. It also shares the same limitations.

Google and Microsoft released cloud data warehouse offerings of their own (Google’s BigQuery actually preceded Redshift), but neither was as successful as Redshift in grabbing market share and becoming the warehouse choice for on-prem to cloud transitions.

‘Decoupled storage and compute’ and simplicity as the foundation for Snowflake’s popularity

In the last few years you’ve heard the term ‘elasticity’ a million times in the context of the cloud revolution. But the story of elasticity made a significant leap specifically for data warehousing thanks to the ‘decoupled storage and compute’ concept. Although there are many product capabilities behind Snowflake’s success, we will focus on decoupled storage and compute because it’s the most important of them all.

Quick reminder — what is elasticity?

A simple explanation of elasticity in the cloud can be found in Wikipedia: “the degree to which a system is able to adapt to workload changes by provisioning and de-provisioning resources in an autonomic manner…”. However, the fact that something is in the cloud doesn’t mean it automatically enjoys elasticity. This is especially true for data warehouses.

What is ‘decoupled storage and compute’ and what does it have to do with elasticity?

Decoupling storage and compute enables users to quickly change the type and size of compute (clusters of CPUs) that is assigned to be the processing unit over the storage layer (hard drives, SSDs, etc). The cloud made this approach easily accessible to software, but it took time for software products to leverage this architectural concept.

Snowflake was the first cloud data warehouse that made it super simple for users to enjoy the benefits of decoupled storage and compute. No matter how much data you import into Snowflake, at any point in time you can decide which compute resource you want to use for your next workload with a few mouse clicks.

For example — you might decide to use a small (and therefore relatively cheap) compute cluster to conduct some research over a sample of your data. Then you might decide to use a bigger (and therefore more expensive) cluster for production dashboards that are viewed by many people throughout the day. Since you’re typically billed per usage in the cloud, this approach provides granular control over your spend. You can literally choose on a per-query level which hardware to use.

This is completely different from the traditional on-premise world, where the same hardware configuration you implemented for your warehouse always runs all queries, no matter what.

Now back to elasticity — Snowflake’s approach to decoupled storage and compute, and making it easily accessible for users through no more than a few clicks, brought a new level of elasticity to data warehouses. From a product perspective, it is the key reason for Snowflake’s popularity among data professionals.

What’s next for data warehouses in the cloud

The benefit of decoupled storage and compute for data warehouses and cloud data technologies in general is undeniable. This is one of the key reasons users like Snowflake. Decoupled storage and compute has become the architecture of choice for new data technologies. Amazon Redshift recently added both Amazon Spectrum and RA3 nodes for this reason.

As we built Firebolt on a decoupled storage and compute architecture, we also came to realize what was needed to change in storage and compute to solve many of the other problems with data warehouses, and there are plenty of them remaining. One of the bigger ones is being able to analyze huge data sets, quickly and easily without breaking the bank. Achieving fast query response times for big data sets is still something that modern warehouses struggle with, and when it’s possible, it typically comes with a price tag that only big enterprises can afford. Looking at public benchmarks comparing Snowflake, Redshift and BigQuery it’s clear that already in the 1TB scale of data, query runtime easily takes 7 seconds or more on average, sometimes even minutes (See Fivetran’s great benchmark post here). As data keeps growing, and as being able to deliver great user experiences over data becomes increasingly important for businesses, ‘performance at scale’ limitations is what the market will be looking for the tech industry to solve.

A decade ago BI meant that a smaller group of analysts performed analytics against historical data using advanced BI tools to support decisions for the rest of the company. A decade from now we expect 100x or more employees and customers to be performing self-service analytics against historical and real-time data using newer tools that leverage machine and deep learning and other computing — all to help them make faster and better decisions on their own. Snowflake’s valuation reflects the future potential. But the data industry has a long way to go till it gets there.

This is why from a technology perspective the future of cloud data technologies belongs to those that will be able to analyze more data at higher speeds, while relying on less hardware and compute power to do so, to reduce costs. This belief led us at Firebolt to start by concentrating on technology that can get much more analytic horsepower from existing hardware, as the foundation on top of which simplicity and the benefits of the cloud can bring powerful analytics to the masses.

Tags

The Noonification banner

Subscribe to get your daily round-up of top tech stories!