Databases store data in a structured form. The structure makes it possible to find and edit data. With their structured structure, databases are used for data management, data storage, data evaluation, and targeted processing of data.
In this sense, data is all information that is to be saved and later reused in various contexts. These can be date and time values, texts, addresses, numbers, but also pictures. The data should be able to be evaluated and processed later.
The amount of data the database could store is limited, so enterprise companies tend to use data warehouses, which are versions for huge streams of data.
A data warehouse (DWH) is a prepared, structured (tabular) recorded data source. In the simplest case, an SQL database that contains at least one data record. At the enterprise level, such a warehouse naturally quickly takes on larger dimensions, so that there are entire business intelligence departments that only deal with the business warehouse.
A data lake records both structured and unstructured data and store it in various processing steps (raw, processed, analyzed).The high possible variance of data sources (e.g. images, videos, text) and the raw data allows data scientists to carry out further analyzes that are not possible in a DWH.
In abstract terms, the data warehouse can be considered as an Excel file on the computer, while the data lake represents a file folder.
Like an Excel file, the DWH contains very structured data with named columns in a fixed schema. Adding new entries is not a problem, new columns are more difficult depending on the existing content. Once the data has been typed in and saved, the originals can no longer be found, so the file must be relied on.
The content, in turn, is very easy to use for certain applications: visualizations or simple arithmetic operations for KPI calculation are very convenient.
If you now take the data lake in the same example, it behaves like a file folder on the hard drive. You can store a large number of (original) data without having to format, type or structure them beforehand.
However, if you want to continue working with the data, you have to prepare the data first and you cannot simply pull sums across columns. If this processing has taken place, however, the data can then also be saved in the folder again. Even many Excel files — i.e. data warehouses — can thus be generated and saved in the folder for further processing.
Hopefully, this simple example explains the difference between a database, data warehouse, and data lake. For companies, depending on the maturity of the data-driven business, it makes sense to use one or both of the infrastructures. A DWH allows a wide range of users quick access to structured data for analysis, while a data lake enables advanced users, for example, data engineers and data scientists, to apply machine learning and other advanced analytics methods.
However, if you do need a deeper explanation of the differences between these 3 terms and a deep dive into Data Lake functioning ; read my previous article.
Create your free account to unlock your custom reading experience.