If you are following the trends in data science, it is more likely that you have heard the words big data, analytics, and machine learning. These days everyone wants to jump into this area of data science. Many of the software giants like Google, Amazon, Microsoft & etc. are already leading the way.
However, it’s not that easy for a new business to enter this area of expertise due to many reasons. One of the core problems is that the data is scattered everywhere in different systems and their own databases. It’s likely these datasets will live for many years, hardly providing any value for its businesses.
Although it would be wonderful if we can create a data warehouse in the first place (Check my article on Things to consider before building a serverless data warehouse for more details).
However, there are several practical challenges in creating a data warehouse at a very early stage for business. One of the main reason is that it is difficult to know exactly which data sets are important and how they should be cleaned, enriched, and transformed to solve different business problems.
Just imagine how much effort you need to put to identify data sets upfront from all the different systems and extract these data, doing the cleaning, enriching and transforming them such that they could be pushed into the data warehouse at a very early stage. Unless you have the data science experts in-house who also understand the business domain in depth to identify the required data sets and do the data preparations, it’s highly likely to screw up the things.
A data lake is a centralized repository to store all the structured and unstructured data. The real advantage is of a data lake is, it is possible to store data as-is where you can immediately start pushing data from different systems.
These data could be in CSV files, Excel, Database queries, Log files & etc. could be stored in the data lake with the associated metadata without having to first structure the data.
Once the data is available in the data lake over a time period, it’s possible to process the data later to run different types of analytics and big data processing for data visualization. It is also possible to use the data from the data lake for machine learning and deep learning tools for better-guided decisions.
For a business, to start creating a data lake and making sure that different data sets are added consistently over long periods of time requires a process and automation. To move in this direction, the first thing is to select a data lake technology and relevant tools to set up the data lake solution.
If you plan to create a data lake in a cloud, you can deploy a data lake on AWS which uses serverless services underneath without incurring a huge cost upfront and a significant portion of the cost of data lake solution is variable and increases mainly based on the amount of data you put in.
Then it is important to identify the data sources and the frequency of data being added to the data lake. Once the data sources are identified, make sure that the decisions are taken to either add the data sets as-is or to do the required level of cleaning and transformation of the data. It is also important to identify the metadata for individual types of data sets.
Since the data sets are is coming from different systems which might even be belonging to different departments of the business, it’s important to establish processes for consistency.
For example, the HR department could be informed to publish the employee satisfaction after each survey which is carried out annually to the data lake. Another example is that the Account department publishes data on payroll monthly to the data lake.
For operations that require a higher frequency of data publishing or time-consuming work, it is possible to automate the data sourcing process. This could involve, automating the extraction, transformation, and publishing of data to the data lake or at least automate some of the individual steps.
After setting up the data lake, it’s important to make sure, that the data lake is functioning properly. It’s not only about putting data into the data lake but also to allow or to facilitate the data retrieval for other systems to generate data-driven informed business decisions. Otherwise, the data lake will end up as a data swamp in the long run with little to no use.
After the data lake is properly set up and functioning for a reasonable period, you will be already collecting data to your data lake with the right amount of associated metadata. It will require to implement different processes with ETL (Extract Transform and Load) operations before using them to drive different business decisions. This is where the importance of Data Warehouses and Data Visualization tools come in. You can either publish the data to a Data Warehouse if there are more processing needs to be done in correlation with different data sets from other systems or directly feed into Data Visualization and Analytic tools like Microsoft Power BI and AWS QuickSight.
The most important thing comes next is to ask the right business questions which could be answered based on the availability of the data. Although it seems like it is too obvious, this is one of the areas many businesses make things so complex.
Although there is a fully functioning data lake which produces useful insights for the business, it is important not to stop from there. The strength of a data lake lies with continuous development and the evaluation of the solutions.