There is a bit of a confusion between Data Warehouse vs Data Lake or ETL vs ELT. I hear that Data Warehouses are not used anymore, that they are replaced by Data Lakes altogether, but is that true? And why do we need Data Warehouses anyway? I will go into that as well as the definitions of both pluses explain the differences between them.
A Data Warehouse, in short DWH and also known as an Enterprise Data Warehouse (EDW), is the traditional way of collecting data as we do since 31 years. The DWH serves the purpose of being the data integration from many different sources, the single point of truth and the data management meaning cleaning, historize and data joined together. It provides greater executive insight into corporate performance with management Dashboards, Reports or Ad-Hoc Analyses.
Various types of business data are analysed with Data Warehouses. The need for it often becomes evident when analytic requirements run afoul of the ongoing performance of operational databases. Running a complex query on a database requires the database to enter a temporarily fixed state. This is often untenable for transactional databases. A data warehouse is employed to do the analytic work, leaving the transactional database free to focus on transactions.
The other characteristics are the ability to analyse data from multiple origins (e.g. your Google Analytics with your CRM data), and that is highly transformed and structured due to the ETL (Extract Transform Load) process.
A Data Lake is a store full of unstructured and structured data, stored as-is, without a specific purpose in mind, that can be built on multiple technologies such as Hadoop, NoSQL, Amazon Simple Storage Service, a relational database, or various combinations and different formats (e.g. Excel, CSV, Text, Logs, etc.).
According to Hortonworks Data Lake Whitepaper, the data lake arose because new types of data needed to be captured and exploited by the enterprise. As this data became increasingly available, early adopters discovered that they could extract insight through new applications built to serve the business. The data lake supports the following capabilities:
I like the definition and comparison by James Dixon, founder and CTO of Pentaho:
If you think of a DWH as a store of bottled water — cleansed and packaged and structured for easy consumption — the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake and various users of the lake can come to examine, dive in, or take samples.
As you can see on the picture below, both Technologies are created for different purposes:
As mentioned in the introduction, companies are shifting from the Data Warehouse to the Data Lake, although it’s two different things, it still can make sense. Especially when you want real-time data, as the Data Warehouse typically works in batch processes, the Data Lake works near real time and handling Big Data. It’s made for huge data and stores them unstructured easy and fast. So when should I use what?
It is common, especially in mid or large size organisation to have both environments. The image below illustrates how you would integrate it with an Enterprise Data Warehouse and a Data Lake:
© Dell EMC
ETL (Extract Transform and Load) and ELT (Extract Load and Transform) is what has described above. ETL is what happens within a Data Warehouse and ELT within a Data Lake.
ETL is the most common method used when transferring data from a source system to a Data Warehouse. In that process, you load data to your stage-layer of your DWH, clean and transform it to the Dimensional Model (Facts and Dimensions) and at the end, you load it to a final Data Mart or a Cube for further Data Visualisations.
If you want to use ELT, that’s when you want to build a Data Lake. You extract data, mostly done by physical files, load it into your Data Lake in your Cloud Storage and only then start transforming and cleaning the data. The natural process is that you begin exploring and analysing your data and find out, that the data is dirty and then you begin cleaning it.
How is Big Data connected to this topic? Big Data is more or less gathering massive amounts of data (several million rows per second) from devices like IoT (Internet of Things), different data points from each smartphone, etc. With specific Big Data infrastructure and algorithms (e.g. map-reduce) you collect the data and store it into the Data Lake. Big data is greatest used with a Data Lakes.
As a new technology arises an old one gets replaced, that’s maybe why many technologists and thought leaders are declaring the Data Warehouse is dead. They say it’s no longer relevant in the age of Big Data. But why? The question is, are you still have a traditional Data Warehouse with an ETL tool, are you 100% happy with it? There is a great chance that you are not. Because of a significant amount of coding required in traditional ETL tools, your ETL tool was probably outdated before you were ready to deploy it.
But these prognosticators are mistaken. Big data can extend and enrich a Data Warehouse, but cannot replace it. It is not a Data Warehouses that are dead, but the traditional way of designing and building them. A better and new approach is using Data Warehouse Automation (DWA) Tools that automates the recurring parts of developing a Data Warehouse to cut down time development time by 40–60%. Find more information in the DWA Blog post series or directly why we should use DWA tools.
Another way of adapting to Big Data and fast-changing data connection points is the Data Vault modelling and methodology which enables you a more dynamic and flexible way to implement additions to your Data Warehouse. Lately, there has been an interesting move to use a Data Vault as a governed Data Lake, because it addresses the elements of the problems we identified within Data Warehousing:
This technological advancement is already proving to be highly effective and efficient. Easy to design, build, populate, and change. New Data Warehouses should be created with this methodology, also a point where Data Warehouse Automation helps you.
I don’t think so. As elaborated thoroughly above, it will still need both over a long period. Both are designed for distinctive purposes and have different advantages. And I believe we should be careful with Data Lakes as more data is always better but can also lead to more chaotic data stores where nobody knows the what’s in there and it takes everyone a lot of time just to get the needed data.
In my opinion, it makes sense to have both environments to be flexible for fast changes and analysis within the Data Lake. But still be able to make very well structured Analytics with Dashboards and Reports for the operational business user. That’s why I would suggest importing patterns and essential analysis you found in the Data Lake back to your Data Warehouse to make it easily accessible for everyone in a standard way and tools users know.
Thanks for reading that far. Please feel free to add comments or anything you don’t agree.
Originally published at www.sspaeti.com on October 26, 2017.