Architecting Serverless Data Integration Hubs on AWS for Enterprise Data Delivery: 2020 Edition
Principal at CloudData LLC (www.cloud-data.biz)
Happy New Year! If data is the new oil, why do we experience so many backfires in our systems? In a prior 2-part article series titled Serverless Data Integration (published on Medium.com), the importance of data integration, particularly in the serverless realm was discussed in good detail. We concluded that data integration was a critical business need, that had to be proactively and centrally addressed by every IT organization.
To reiterate, Data Integration focuses truly on delivering value to the business, by enabling true analytics, machine learning and AI initiatives fed with clean and standardized data. Without high quality data, business insights generated via various analytics, machine learning and AI initiatives,
will be questionable at best. In case you’ve missed it, here are the
links to the previous article series:
- Serverless Data Integration — Part I (2019)
- Serverless Data Integration — Part II (2019)
In this article, I’d like to clarify some additional points related to our topic of discussion. I also wish to share a sample reference architecture of a Data Integration Hub (henceforth referenced as The Hub) implemented on Amazon Web Services (AWS).
Clearing the Air
Data Integration Hubs are not Data Lakes
Data Lakes store data in near exact native format. That may be adequate for certain bespoke analytics initiatives, but a centralized data service that cleanses, standardizes and makes data compliant with all prevailing regulatory requirements, delivers better value to the business. The value proposition of a Hub is simple — write once read many (WORM), ensure data quality, standardization and compliance. Keep in current and don’t let the data go stale!
To achieve this, a Hub employs the concept of ELTn (Extract Load Transform(n)) instead of the classic ETL (Extract Transform Load). ELTn implies that transformation is not a single occurrence that occurs before loading of the data. It is a continuous process and occurs multiple times in the lifetime of the data supporting a specific business function. The goal is to eventually reach a high-level of data hygiene such that all data is CLEAN.
The Hub enables the paradigm of continually maintaining up-to-date ‘business objects’, matching the transformation that the business undergoes.
Once data from a given ingestion process is marked as CLEAN, all downstream applications can consume data from the Hub. This minimizes the load on back-end system-of-record systems for data extracts, not to mention repeated and wasteful efforts in data cleansing. Plus, data quality and compliance adherence can be done once, in a centralized location. This enables high-quality and relevant enterprise data delivery. Let’s make
data clean and compliant once. This is what a Hub does.
The Hub solves the problem of rampant Data Silos that have emerged over
many years of bad data management. Contrasted with a Data Lake, a Hub is a living breathing and constantly evolving centralized authorized data distributor for the enterprise. As mentioned before, Data Lakes bring
together disparate data sources and store in near exact original format. A Hub does much more by providing the required quality, standardization and compliance for all data initiatives within the enterprise.
Can we please walk before we try to run?
Ask any C-level executive today, as to the top priority (s)he is on the hook for and the answer is almost always ‘Need to do something with Artificial Intelligence (AI)’. In some responses you may replace AI with ‘Machine Learning (ML)’ but that’s close enough, given ML is a prerequisite for AI. Everybody wants to do AI, but before we embark on that journey, we need to build a rock-solid data pipeline that creates/feeds/maintains a Hub. Ask any data scientist how much time (s)he spends in cleansing and wrangling data just for their specific project. The answer to this question will bring out the apparent value of a centralized Hub. Plus, the Hub enables more output from the Data Science team, as they can focus purely on their skills — data science, not data cleansing or data wrangling. Remember WORM is good, WORO is not :)
Data Engineering thus is a fundamental requirement for all data-driven
businesses. Data is one of the fundamental drivers of positive change in
the business. Many IT organizations’ have modified their mindset to
support Data Engineering as a core competency and this is manifesting
rapidly due to the criticality of this new business need.
Monica Rogati’s excellent article on Hackernoon (AI Hierarchy of Needs
) from June 2017 is still very relevant to our topic of discussion. The following graphic referenced from that article is an exposition of the core idea that, in life we need to get the basics done first before reaching for the esoteric aspects. My two customizations to Ms. Rogati’s graphic are the use of ELTn
instead of ETL
and outlining the Hub’s function in the pyramid.
We thank Mr. Abraham Maslow for showcasing and illustrating his work using the pyramid many moons ago. Figure 1 is a picture worth a thousand words. Thank you, Ms. Rogati!
(Figure 1: Can we please walk before we try to run?)
Structure is in the eye of the data practitioner!
Just as beauty is in the eye of the beholder, so is the inherent structure of data — it is in the eye of the data practitioner. The demarcation of what one wants to store in a relational table versus a NoSQL datastore or BLOB storage is a relevant discussion to engage in. However, the value of delineation of data as structured, semi-structured and unstructured is up for debate. Please allow me to explain this in Figures 2 & 3 as we take a look at a sample web log and a tweet, both regularly classified as unstructured data.
(Figure 2: Data Structure of a sample Web Log)
Figure 2 illustrates the contents of a publicly available web log from NASA.
One immediately wonders the reason why a web log is classified as unstructured, as the above contents can very easily be stored in a relational table. Now, we may not want to store this in a relational table for a variety of other reasons (such as volume of data and in-turn performance of retrievals) but that is a different discussion altogether. That still doesn’t change the fact that the above web log sample is well structured data.
Let’s now take a look at the structure of a sample tweet:
(Figure 3: Data Structure of a Tweet)
Figure 3 illustrates the structure of a tweet. Yes, it is in JSON format and
the data is stored in a NoSQL database, but why does that necessarily
make this unstructured? In the eyes of a data practitioner, this is a
beautiful flexible structured business object.
For a data practitioner, all data possesses structure. Structure is present even in audio and video files. Now a video file may not possess the
‘required structure’ to store it in a ‘table’. But then again, we seem to inadvertently (and in my humble opinion incorrectly) equate the existence of structure to the existence of a table. For the above reasons and the purposes of this article, data shall be classified as relational and non-relational. If it makes sense to store it in a table, it is relational. If it does not, it is non-relational. There are many non-relational data stores that are fit-for-purpose, depending on the data’s characteristics. Let’s use the one that gets the job done right.
Relevance of Quiet Time & Workload Costing
In part II of the prior article series, I introduced the concept of Quiet
Time, as a relevant yardstick that we utilize while deciding whether or
not a given workload qualifies to be serverless. It is important to note that not all workloads truly qualify for serverless computing. A Hub is a poster-child for a serverless computing implementation, due to the nature of processing. Hubs are mostly unpredictable with regards to data volumes that need to be processed on a given day.
In addition, the timing of ingestion and other related tasks require dynamic scaling for optimal performance. More often than not, a Hub also has pockets of idle time during a 24-hr time period. All of this makes it an ideal candidate for serverless.
The following is a simple formula to calculate Quiet Time(qt) of a workload, that assists in identifying the feasibility of serverless computing:
qt = round (((x/86400) * 100),2) %
where, x = ‘workload idle time’, measured in seconds over a 24-hr period
In the above formula, Quiet Time (qt) is calculated as a percentage of the workload’s idle time (x) over the elapsed time in a day — 24 hours = 86,400 seconds. qt is important in determining the relevance of serverless computing in a given context. For example, if a given workload is idle on an average of 25,300 seconds on any given day, this implies that x=25300, allowing us to derive:
qt = round (((25300/86400) * 100),2) = 28.28%.
In the above example, we conclude that the application is quiet for 28.28% of any given day, thus making it a potential candidate for serverless computing. We will know for sure, once we complete the costing exercise. The point is this — if your application does not demonstrate pockets of idle time/quiet time and is constantly processing 24x7x365, serverless may not be the right choice.
We now calculate the cost of the relevant serverless and server-based
services, to determine which one is better. The lower of the 2 costing
efforts, will guide us to the preferred option:
- scc — serverless computing cost
This is the cost for utilizing serverless computing services. Serverless computing costs are usually comprised of number of function execution requests, CPU & RAM requested in AWS Step Functions/Lambda, API Gateway requests for such function execution requests and any other service that is specifically required to support the given workload (including database costs such as Amazon DynamoDB or Amazon Aurora Serverless). There are no compute costs associated with idle time/quiet time in serverless computing.
- tcc — traditional computing cost (server-based)
This is the cost for utilizing server-based computing services. These computing costs are usually comprised of compute engines for application and database servers, programming language software licensing, database software licensing, personnel costs associated with software maintenance of the said server-based services (install, patching, upgrades of operating system, programming language software and other 3rd party software) and actual cost of software maintenance fees (paid to vendors). In this model, there are compute and other costs associated with idle time/quiet time for all compute platforms.
If scc < tcc, it is clear that the workload qualifies for serverless even from a cost standpoint. If scc > tcc, it is obvious that serverless will cost more than the server-based model. However, there may be other reasons that drive the serverless model to be the preferred option. These may include (among others), a lack of deep technical talent within the organization to support the upgrade and maintenance of the workload’s software and database stack in the relevant compute platforms.
- It may be argued that the AWS support fees should also be factored into scc and this could be done if there were a method to determine prorated AWS support fees, calculated only for the serverless services. This is easier said than done, but in the interest of simplicity, this should be considered as a common cost between both models and thus excluded.
- Storage costs for services such as S3, networking, data transfer (ingress and egress) can also be deemed as common to both scc and tcc. Thus, these components can also be removed from the costing. Add these costs to scc ONLY if they are relevant to it and not to tcc.
Architecture of a Serverless Data Integration Hub on AWS
The 4 Layers
There are 4 Layers/Pillars in the Hub and it manifests on AWS as follows:
(Figure 4: Data Integration Hub on AWS - 4 Layers/Pillars)
Data flows through the 4 layers of the Hub. Each layer supports an independent function of the Data Integration Architecture. The 4
The architecture is purpose built utilizing independent services. This enables scaling each layer independently, supports abstraction to manage complexity, decouples services to avoid serialization, automatically provisions resources (where feasible) and facilitates an asynchronous
execution model. In many ways these are fundamental tenets of cloud
Drilldown into the 4 Layers/Pillars
Figure 5 illustrates the relevant services and details within each Layer/Pillar:
(Figure 5: Serverless Data Integration Hub on AWS - Detailed Architecture)
Now let’s dive a little deeper into each Layer/Pillar.
Ingestion is the first layer (Layer #1) where data from multiple sources enters the Hub. For batch extracts, the landing zone for the data is a
serverless Amazon S3 bucket where the data is pre-processed for certain rudimentary checks with Amazon Athena. A microservice (AWS Step Functions/Lambda) processes the data, on the event of data successfully landing in S3. The equivalent Amazon Kinesis real-time data streams processing service takes care of the streams of data, that arrive into the Hub in sporadic bursts.
Persistence, the second layer (Layer #2) of the architecture, plays the crucial role of storing data in the Hub. Amazon DynamoDB plays a huge role here, as it is the centerpiece of Data Integration. Layer #2 stores, manages and maintains business objects. When data is ingested in Layer #1, it is transported to Layer #2 and stored in a near-raw-format within RAWDB (set of tables for RAW data). More complex data validation routines are performed in the Transformation Layer. The persistence of Business Insights in Layer #4 is done using Amazon Aurora Serverless, if certain business function’s data consumption model is table-based data, instead of JSON-based business objects.
Transformation is the third layer (Layer #3) which handles the complexity of creating CLEAN business objects. The transformation layer’s complexity and heavy lifting relates to ensuring that data is conformant to data quality
rules, business rules and compliance rules/policies. The arrival of data in RAWDB of Layer #2, is the next event, that triggers the main transformation microservice and its related sub-microservices to fire. These event-based microservices are implemented in AWS Step Functions/Lambda. Data that fully conforms to all predefined transformation rules is persisted in CLEANDB (another logical container with a set of relevant tables for CLEAN data) and erroneous/non-conformant data is stored in ERRORDB (the 3rd logical container for ERROR data).
Analytics/Machine Learning (ML)/Artificial Intelligence (AI) is the fourth layer where Business Insights (BI) are generated. This layer is triggered by the arrival of data in CLEANDB within Layer #2 (the next event). This layer supports the calculation and storing of Key Performance
Indicators (KPIs) for the business and the management of Analytics/ML/AI jobs. The results of all jobs (BI) are stored in INSIGHTSDB (the 4th logical container for INSIGHTS within Amazon DynamoDB and Amazon Aurora Serverless as appropriate). Multiple serverless services are utilized here for generation of Business Insights (Amazon S3, Amazon Athena, Amazon Fargate, Amazon SageMaker) in this layer. Please note that Amazon SageMaker is not serverless and listed here for completeness for machine learning projects. Amazon Fargate is listed to facilitate the concept of serverless containers that run analytics and machine learning algorithms to generate BI and store in INSIGHTSDB.
When data passes through all 4 layers, it completes a single data
integration cycle. The cycle is repeated, as and when new/modified data
flows into the Hub from back-end systems or when data is either
augmented, corrected and reprocessed to move from ERRORDB to CLEANDB.
In the long run, the persistence and management of historical data (RAWDB & CLEANDB) needs to be addressed via standard Information Lifecycle Management (ILM) measures that utilize various tier-based-data-persistence services. AWS supports this with a multi-tiered storage provisioning model with Amazon S3 Glacier and Amazon S3 Glacier Deep Archive. This ensures data persistence costs are maintained at optimal levels while adhering to all regulatory requirements.
Visualization also plays an important role in information delivery within the enterprise. This is where the BI Dashboards come to life and deliver value to the business. In maintaining the serverless paradigm, Amazon QuickSight is an option for the future, when direct access of data via RESTful endpoints in API Gateway is supported.
On generation of BI, applications need to be enabled to consume data from INSIGHTSDB & CLEANDB in real-time. This consumption is done via the Data Access Layer (DAL) as illustrated in Figure 5. Utilizing RESTful endpoints (API Gateway), consumption of BI and the underlying data for the various business objects is supported. Data retrieval from relational data stores such as Amazon Aurora Serverless, is achieved by utilizing the Data API for Amazon Aurora Serverless. It is important to note that API Gateway is the single doorway for data consumption.
This service virtually eliminates all unwarranted external data breaches and data leaks, as it serves data only when the data request is accompanied by the required credentials. These required credentials (security credentials and data entitlements) are received from an enterprise’s federated authentication systems (AD, LDAP etc.) in concert with the AWS Identity and Access Management (Amazon IAM) and Amazon Cognito.
The well-defined RESTful endpoints created with API Gateway, not only obviates the need for persistent database connection management to the Hub, but also eases the impact of changes to data structures. For example, an application that is receiving and processing JSON documents (business objects) from the RESTful endpoints, can continue to do so regardless of all data structure changes (adding or dropping of data elements). This ensures runtime flexibility of both the applications and the Data Hub, in an ever-changing business environment.
Although not explicitly illustrated, all standard networking and isolation
security best practices should be assumed. This is in line with the AWS
Well Architected Framework.
What does the Hub really deliver?
The goal of the Hub is to deliver up-to-date business objects that provide a 360-degree view of any facet of the business. Figure 6 illustrates an integrated Customer Business Object that is built in a ‘Hub & Spoke Model’.
(Figure 6: Sample Integrated Business Object - Customer)
AWS provides a rich set of services to architect and build Serverless Data
Integration Hubs. The value of high-quality and standardized data delivered to the enterprise is a critical function that needs to be proactively addressed via data engineering. To state the obvious, good data generates meaningful business insights. And that is true value added to the business. I wish you the very best in 2020 and beyond for all of your data integration efforts. Take good care of yourself and stay in touch!
- Can we please walk before we try to run?
- Data Structure of a Web Log
- Data Structure of a Tweet
- AWS Architecture Icons
- Serverless Data Integration — Part I (2019), Gaja Vaidyanatha
- Serverless Data Integration — Part II (2019), Gaja Vaidyanatha
Subscribe to get your daily round-up of top tech stories!