With the volume, velocity, and variety of today’s data, we have all started to acknowledge that there is for all data needs. Instead, many companies shifted towards choosing the right data store for a . no one-size-fits-all database specific use case or project The distribution of data across different data stores brought the challenge of . Historically, the only viable solution was to : extract data from all those different sources, clean and bring it together, and finally, load this data to polished Data Warehouse (DWH) tables in a well-defined structure. While there is nothing wrong with this approach, . Let’s investigate why. consolidating data for analytics build a data warehouse a combination of a data lake and a data warehouse may be just the solution you need 1. Building a staging area for your data warehouse A data lake of your data. Data is constantly flowing, moving, changing its form and shape. A modern data platform should facilitate the ease of ingestion and discoverability, while at the same time allowing for a thorough and rigorous structure for reporting needs. A common emerging pattern is that a data lake serves as an for your data ingestion. Nothing ever gets deleted from it ( ). All raw data ever ingested into your data platform can be found in a data lake. This means that that transform and clean the data, and later ingest it into your , while strictly following Kimball, Inmon, or Data Vault methodology, including Slowly Changing Dimension historization and schema alignment. doesn’t need to be the end destination immutable layer perhaps just overwritten by a new version, or deleted for compliance reasons you can still have ELT/ETL jobs data warehouse You don’t need to choose between a data lake or a data warehouse. You can have both: data lake as an immutable staging area and a data warehouse for BI and reporting. Databricks coined the term which strives to combine the best of both worlds in a single solution. Similarly, platforms such as Snowflake allow you to leverage cloud storage buckets such as S3 as , effectively leveraging data lake as a staging area. data lakehouse external stages In the end, you need to decide yourself whether a single “data lakehouse”, or a combination of data lake and data warehouse works best for your use case. Monte Carlo Data put it nicely: “Increasingly, we’re finding that data teams are unwilling to settle for just a data warehouse, a data lake, or even a data lakehouse — and for good reason. As more use cases emerge and more stakeholders (with differing skill sets!) are involved, it is almost impossible for a single solution to serve all needs.” — Source 2. Audit log of all data ever ingested into your data ecosystem thanks to the immutable staging area An audit trail is often important to satisfy regulatory requirements. Data lakes make it easy to collect metadata about when and by which user the data was ingested. This can be helpful not only for reasons but also to track . compliance data ownership 3. Increase the time-to-value and time-to-insights By providing an immutable layer of all data ever ingested, we to all consumers immediately after obtaining that data. By providing , you are enabling that would be difficult to accomplish when different data teams may use the same dataset in a very different way. Often different data consumers may need different transformations based on the same raw data. Data lake allows you to into all sorts and flavors of data and decide on your own what might be useful for you to generate insights. make data available raw data exploratory analysis dive anywhere 4. A single data platform for real-time and batch analytics Ingesting real-time data into a data warehouse is still a . Even though there are tools on the market that try to address it, this problem can be solved much easier when leveraging data lake as an immutable layer for ingesting all of your data. For instance, many solutions such as Kinesis Data Streams or Apache Kafka allow you to specify S3 location as a sink for your data. challenging problem 5. Costs With the growing volume of data from social media, sensors, logs, web analytics, it can become expensive over time to store all of your data in a data warehouse. Many traditional data warehouses tie storage and processing tightly together, making scaling of each difficult. Data lakes scale storage and processing ( ) independently of each other. Some cloud data warehouses support this paradigm, as well. More on that in my previous article: queries and API requests to retrieve data What you should consider before migrating to the cloud to make your data warehouse and data lake future-proof 6. Convenience Typically, data warehouse solutions require you to manage the underlying compute clusters. Cloud vendors started realizing the pain of doing that and built either fully managed or entirely serverless analytical data stores. For instance, when leveraging S3 with AWS Glue and Athena, your platform remains fully and you pay only for what you use. You can utilize this : serverless single data platform to retrieve both and data, relational non-relational query and data, historical real-time your ML training jobs and , checkpoint serve ML models data before any transformations were applied, query directly after ingestion your data from the data lake and DWH tables via ( ) combine external tables available in nearly any DWH solution: Redshift Spectrum, Snowflake external tables, … with other services and distributed compute frameworks, such as Dask or Spark. integrate Regarding the integrations, on AWS, you can leverage: for management of access, Lake Formation ( ), awswrangler Python library that can be described as Pandas on AWS ( ), Quicksight AWS BI tool ( ), delta lake open-source platform created by Databricks providing a.o., ACID-compliant transactions & upserts for your data lake ( ), lakeFS version control for your data ( ) Upsolver a.o., data ingestion of stream and batch, including upserts, using Kappa architecture which allows you to incrementally export data from your RDS database tables ( ) into S3 parquet files that can be crawled with AWS Glue and queried using Athena. AWS Database Migration Service or even entire schemas 7. Future proof I couldn’t find any trustworthy statistics, but my guess is that at least a third of the data that is typically stored in a data warehouse is . Such data sources are ingested, cleaned, and maintained “just in case” they might be needed later. This means that data engineers are investing a lot of time and effort into building and maintaining something that may not even yet have a clear business need. almost never used The ELT paradigm allows you to save engineering time by building data pipelines only for use cases that are really needed, while simultaneously storing all the data in a data lake for potential future use cases. But you don’t have to spend time cleaning and maintaining data pipelines for something that doesn’t yet have a clear business use case. If a specific business question arises in the future, you may find the answer because the data is already there. Another reason why data lakes and cloud data platforms are future proof is that if your business grows beyond your imagination, your platform is equipped for growth. You don’t need expensive migration scenarios to a larger or smaller database to accommodate your growth. Regardless of your choice, your cloud data platform should allow you to grow your data assets with virtually no limits. Demo: Serverless Event-driven ETL with Data Lake on AWS To build an event-driven ETL demo, I used and followed the Databricks principle. In short, it means that you use the “bronze” layer for raw data, “silver” for preprocessed and clean data, and finally “gold” tables represent the final stage of polished data for reporting. To implement this, I created: this dataset bronze-silver-gold S3 bucket for : raw data s3://data-lake-bronze S3 bucket for cleaned and : transformed data s3://data-lake-silver AWS Lambda function (called ) which is triggered any time a new file arrives in the “bronze” S3 bucket. It transforms the new object and loads the data to the stage: “silver”. event-driven-etl In the images below you can find the Dockerfile and Lambda function code I used for this simple demo. Dockerfile for the lambda function — image by author A simple event-driven ETL in AWS Lambda — image by author The command wr.s3.to_parquet() not only loads the data to a new data lake location, but it’s also: compressing the data using snappy and parquet format, classifying a schema based on Pandas dataframe’s data types and column names, storing the schema in AWS Glue catalog, creating a new Athena table. As a result, we can see how S3, AWS Glue, and Athena play together in the management console: The silver dataset in S3, Glue, and Athena — image by author How does the serverless ETL with Lambda scale? Imagine that you would do something similar for . Managing all those lambda functions would likely become challenging. Even though the compute power of AWS Lambda scales virtually infinitely, is difficult, especially in and scenarios. many more datasets managing the state of data transformations real-time event-driven When testing my function, I’ve made several mistakes. Dashbird observability was very helpful to view the state of my event-driven ETL, including all the error messages. It allowed me to dive deeper into the logs, and inspect all unsuccessful executions at a glance. Imagine how difficult this might be if you have to do it for hundreds of ETL jobs. Using Dashbird to fix my event-driven ETL Similarly, configuring alerts on failure is as simple as adding your email address or Slack channel to the Alerting policy: You can also be notified based on other selected conditions such as , when exceeds a specific threshold ( ), or when there is an for a specific period of time. cold starts duration potentially a zombie task unusually high number of function invocations Conclusion It may be considered a strong statement, but I believe that data lakes, as well as data warehouse solutions with data lake capabilities, constitute an . Building a in advance for all of your data is and often with today’s data needs. Also, storing all data ever ingested is for audit, data discovery, reproducibility, and fixing mistakes in data pipelines. essential component in building any future-proof data platform relational schema inefficient incompatible having an immutable data ingestion layer highly beneficial Previously published at https://dashbird.io/blog/7-reasons-why-you-should-consider-a-data-lake/

Apache

Database Tips: 7 Reasons Why Data Lakes Could Solve Your Problems

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Untitled Story

10 Things in Engineering We Don't Spend Enough Time On

101 Stories To Learn About Cloud Infrastructure

10 Things in Engineering We Don't Spend Enough Time On

10 Things I Did To Increase CloudTrail Logs Security

10 reasons to give cloud computing a go

10 Things in Engineering We Don't Spend Enough Time On

101 Stories To Learn About Cloud Infrastructure

10 Things in Engineering We Don't Spend Enough Time On

10 Things I Did To Increase CloudTrail Logs Security

10 reasons to give cloud computing a go

Light-Mode

Classic

Newspaper

Dark-Mode

Neon Noir

Minty

HN StartUps