CEO of Dashbird. 13y experience as a software developer & 5y of building Serverless applications.
With the volume, velocity, and variety of today’s data, we have all started to acknowledge that there is no one-size-fits-all database for all data needs. Instead, many companies shifted towards choosing the right data store for a specific use case or project.
The distribution of data across different data stores brought the challenge of consolidating data for analytics. Historically, the only viable solution was to build a data warehouse: extract data from all those different sources, clean and bring it together, and finally, load this data to polished Data Warehouse (DWH) tables in a well-defined structure. While there is nothing wrong with this approach, a combination of a data lake and a data warehouse may be just the solution you need. Let’s investigate why.
A data lake doesn’t need to be the end destination of your data. Data is constantly flowing, moving, changing its form and shape. A modern data platform should facilitate the ease of ingestion and discoverability, while at the same time allowing for a thorough and rigorous structure for reporting needs. A common emerging pattern is that a data lake serves as an immutable layer for your data ingestion. Nothing ever gets deleted from it (perhaps just overwritten by a new version, or deleted for compliance reasons). All raw data ever ingested into your data platform can be found in a data lake. This means that you can still have ELT/ETL jobs that transform and clean the data, and later ingest it into your data warehouse, while strictly following Kimball, Inmon, or Data Vault methodology, including Slowly Changing Dimension historization and schema alignment.
You don’t need to choose between a data lake or a data warehouse. You can have both: data lake as an immutable staging area and a data warehouse for BI and reporting.
Databricks coined the term data lakehouse which strives to combine the best of both worlds in a single solution. Similarly, platforms such as Snowflake allow you to leverage cloud storage buckets such as S3 as external stages, effectively leveraging data lake as a staging area.
In the end, you need to decide yourself whether a single “data lakehouse”, or a combination of data lake and data warehouse works best for your use case. Monte Carlo Data put it nicely:
“Increasingly, we’re finding that data teams are unwilling to settle for just a data warehouse, a data lake, or even a data lakehouse — and for good reason. As more use cases emerge and more stakeholders (with differing skill sets!) are involved, it is almost impossible for a single solution to serve all needs.” — Source
An audit trail is often important to satisfy regulatory requirements. Data lakes make it easy to collect metadata about when and by which user the data was ingested. This can be helpful not only for compliance reasons but also to track data ownership.
By providing an immutable layer of all data ever ingested, we make data available to all consumers immediately after obtaining that data. By providing raw data, you are enabling exploratory analysis that would be difficult to accomplish when different data teams may use the same dataset in a very different way. Often different data consumers may need different transformations based on the same raw data. Data lake allows you to dive anywhere into all sorts and flavors of data and decide on your own what might be useful for you to generate insights.
Ingesting real-time data into a data warehouse is still a challenging problem. Even though there are tools on the market that try to address it, this problem can be solved much easier when leveraging data lake as an immutable layer for ingesting all of your data. For instance, many solutions such as Kinesis Data Streams or Apache Kafka allow you to specify S3 location as a sink for your data.
With the growing volume of data from social media, sensors, logs, web analytics, it can become expensive over time to store all of your data in a data warehouse. Many traditional data warehouses tie storage and processing tightly together, making scaling of each difficult.
Data lakes scale storage and processing (queries and API requests to retrieve data) independently of each other. Some cloud data warehouses support this paradigm, as well. More on that in my previous article:
Typically, data warehouse solutions require you to manage the underlying compute clusters. Cloud vendors started realizing the pain of doing that and built either fully managed or entirely serverless analytical data stores.
For instance, when leveraging S3 with AWS Glue and Athena, your platform remains fully serverless and you pay only for what you use. You can utilize this single data platform to:
Regarding the integrations, on AWS, you can leverage:
I couldn’t find any trustworthy statistics, but my guess is that at least a third of the data that is typically stored in a data warehouse is almost never used. Such data sources are ingested, cleaned, and maintained “just in case” they might be needed later. This means that data engineers are investing a lot of time and effort into building and maintaining something that may not even yet have a clear business need.
The ELT paradigm allows you to save engineering time by building data pipelines only for use cases that are really needed, while simultaneously storing all the data in a data lake for potential future use cases. If a specific business question arises in the future, you may find the answer because the data is already there. But you don’t have to spend time cleaning and maintaining data pipelines for something that doesn’t yet have a clear business use case.
Another reason why data lakes and cloud data platforms are future proof is that if your business grows beyond your imagination, your platform is equipped for growth. You don’t need expensive migration scenarios to a larger or smaller database to accommodate your growth.
Regardless of your choice, your cloud data platform should allow you to grow your data assets with virtually no limits.
To build an event-driven ETL demo, I used this dataset and followed the Databricks bronze-silver-gold principle. In short, it means that you use the “bronze” layer for raw data, “silver” for preprocessed and clean data, and finally “gold” tables represent the final stage of polished data for reporting. To implement this, I created:
In the images below you can find the Dockerfile and Lambda function code I used for this simple demo.
Dockerfile for the lambda function — image by author
A simple event-driven ETL in AWS Lambda — image by author
The command wr.s3.to_parquet() not only loads the data to a new data lake location, but it’s also:
As a result, we can see how S3, AWS Glue, and Athena play together in the management console:
The silver dataset in S3, Glue, and Athena — image by author
Imagine that you would do something similar for many more datasets. Managing all those lambda functions would likely become challenging. Even though the compute power of AWS Lambda scales virtually infinitely, managing the state of data transformations is difficult, especially in real-time and event-driven scenarios.
When testing my function, I’ve made several mistakes. Dashbird observability was very helpful to view the state of my event-driven ETL, including all the error messages. It allowed me to dive deeper into the logs, and inspect all unsuccessful executions at a glance. Imagine how difficult this might be if you have to do it for hundreds of ETL jobs.
Using Dashbird to fix my event-driven ETL
Similarly, configuring alerts on failure is as simple as adding your email address or Slack channel to the Alerting policy:
You can also be notified based on other selected conditions such as cold starts, when duration exceeds a specific threshold (potentially a zombie task), or when there is an unusually high number of function invocations for a specific period of time.
It may be considered a strong statement, but I believe that data lakes, as well as data warehouse solutions with data lake capabilities, constitute an essential component in building any future-proof data platform. Building a relational schema in advance for all of your data is inefficient and often incompatible with today’s data needs. Also, having an immutable data ingestion layer storing all data ever ingested is highly beneficial for audit, data discovery, reproducibility, and fixing mistakes in data pipelines.
Previously published at https://dashbird.io/blog/7-reasons-why-you-should-consider-a-data-lake/
Create your free account to unlock your custom reading experience.