Introduction Happy New Year! If , why do we experience so many backfires in our systems? In a prior 2-part article series titled (published on Medium.com), the importance of data integration, particularly in the serverless realm was discussed in good detail. We concluded that data integration was a critical business need, that had to be proactively and centrally addressed by every IT organization. data is the new oil Serverless Data Integration To reiterate, Data Integration focuses truly on delivering value to the business, by enabling true analytics, machine learning and AI initiatives fed with clean and standardized data. Without high quality data, business insights generated via various analytics, machine learning and AI initiatives, will be questionable at best. In case you’ve missed it, here are the links to the previous article series: Serverless Data Integration — Part I (2019) Serverless Data Integration — Part II (2019) In this article, I’d like to clarify some additional points related to our topic of discussion. I also wish to share a sample reference architecture of a Data Integration Hub (henceforth referenced as ) implemented on Amazon Web Services (AWS). The Hub Clearing the Air Data Integration Hubs are not Data Lakes Data Lakes store data in near exact native format. That may be adequate for certain bespoke analytics initiatives, but a centralized data service that cleanses, standardizes and makes data compliant with all prevailing regulatory requirements, delivers better value to the business. The value proposition of a Hub is simple — write once read many ( ), ensure data quality, standardization and compliance. Keep in current and don’t let the data go stale! WORM To achieve this, a Hub employs the concept of (Extract Load Transform( )) instead of the classic (Extract Transform Load). implies that transformation is not a single occurrence that occurs before loading of the data. It is a continuous process and occurs multiple times in the lifetime of the data supporting a specific business function. The goal is to eventually reach a high-level of data hygiene such that all data is CLEAN. ELTn n ETL ELTn The Hub enables the paradigm of continually maintaining up-to-date ‘business objects’, matching the transformation that the business undergoes. Once data from a given ingestion process is marked as CLEAN, all downstream applications can consume data from the Hub. This minimizes the load on back-end system-of-record systems for data extracts, not to mention repeated and wasteful efforts in data cleansing. Plus, data quality and compliance adherence can be done once, in a centralized location. This enables high-quality and relevant enterprise data delivery. Let’s make data clean and compliant once. This is what a Hub does. The Hub solves the problem of rampant Data Silos that have emerged over many years of bad data management. Contrasted with a Data Lake, a Hub is a living breathing and constantly evolving centralized authorized data distributor for the enterprise. As mentioned before, Data Lakes bring together disparate data sources and store in near exact original format. A Hub does much more by providing the required quality, standardization and compliance for all data initiatives within the enterprise. Can we please walk before we try to run? Ask any C-level executive today, as to the top priority (s)he is on the hook for and the answer is almost always ‘ ’. In some responses you may replace AI with ‘ ’ but that’s close enough, given ML is a prerequisite for AI. Everybody wants to do AI, but before we embark on that journey, we need to build a rock-solid data pipeline that creates/feeds/maintains a Hub. Ask any data scientist how much time (s)he spends in cleansing and wrangling data just for their specific project. The answer to this question will bring out the apparent value of a centralized Hub. Plus, the Hub enables more output from the Data Science team, as they can focus purely on their skills — data science, not data cleansing or data wrangling. Remember WORM is good, WORO is not :) Need to do something with Artificial Intelligence (AI) Machine Learning (ML) Data Engineering thus is a fundamental requirement for all data-driven businesses. Data is one of the fundamental drivers of positive change in the business. Many IT organizations’ have modified their mindset to support Data Engineering as a core competency and this is manifesting rapidly due to the criticality of this new business need. Monica Rogati’s excellent article on Hackernoon ( ) from June 2017 is still very relevant to our topic of discussion. The following graphic referenced from that article is an exposition of the core idea that, in life we need to get the basics done first before reaching for the esoteric aspects. My two customizations to Ms. Rogati’s graphic are the use of instead of and outlining the Hub’s function in the pyramid. We thank Mr. Abraham Maslow for showcasing and illustrating his work using the pyramid many moons ago. Figure 1 is a picture worth a thousand words. Thank you, Ms. Rogati! AI Hierarchy of Needs ELTn ETL (Figure 1: Can we please walk before we try to run?) Structure is in the eye of the data practitioner! Just as beauty is in the eye of the beholder, so is the inherent structure of data — it is in the eye of the data practitioner. The demarcation of what one wants to store in a relational table versus a NoSQL datastore or BLOB storage is a relevant discussion to engage in. However, the value of delineation of data as structured, semi-structured and unstructured is up for debate. Please allow me to explain this in as we take a look at a sample web log and a tweet, both regularly classified as unstructured data. Figures 2 & 3 (Figure 2: Data Structure of a sample Web Log) Figure 2 illustrates the contents of a publicly available web log from NASA. One immediately wonders the reason why a web log is classified as unstructured, as the above contents can very easily be stored in a relational table. Now, we may not want to store this in a relational table for a variety of other reasons (such as volume of data and in-turn performance of retrievals) but that is a different discussion altogether. That still doesn’t change the fact that the above web log sample is well structured data. Let’s now take a look at the structure of a sample tweet: (Figure 3: Data Structure of a Tweet) Figure 3 illustrates the structure of a tweet. Yes, it is in JSON format and the data is stored in a NoSQL database, but why does that necessarily make this unstructured? In the eyes of a data practitioner, this is a beautiful flexible structured business object. For a data practitioner, all data possesses structure. Structure is present even in audio and video files. Now a video file may not possess the ‘required structure’ to store it in a ‘table’. But then again, we seem to inadvertently (and in my humble opinion incorrectly) equate the existence of structure to the existence of a table. For the above reasons and the purposes of this article, data shall be classified as relational and non-relational. If it makes sense to store it in a table, it is relational. If it does not, it is non-relational. There are many non-relational data stores that are fit-for-purpose, depending on the data’s characteristics. Let’s use the one that gets the job done right. Relevance of Quiet Time & Workload Costing In part II of the prior article series, I introduced the concept of Quiet Time, as a relevant yardstick that we utilize while deciding whether or not a given workload qualifies to be serverless. It is important to note that not all workloads truly qualify for serverless computing. A Hub is a poster-child for a serverless computing implementation, due to the nature of processing. Hubs are mostly unpredictable with regards to data volumes that need to be processed on a given day. In addition, the timing of ingestion and other related tasks require dynamic scaling for optimal performance. More often than not, a Hub also has pockets of idle time during a 24-hr time period. All of this makes it an ideal candidate for serverless. The following is a simple formula to calculate Quiet Time( ) of a workload, that assists in identifying the feasibility of serverless computing: qt qt = round (((x/ ) * ), ) % 86400 100 2 where, x = ‘ workload idle time ’, measured in seconds over a 24-hr period In the above formula, Quiet Time ( ) is calculated as a percentage of the workload’s idle time ( ) over the elapsed time in a day — 24 hours = 86,400 seconds. is important in determining the relevance of serverless computing in a given context. For example, if a given workload is idle on an average of 25,300 seconds on any given day, this implies that , allowing us to derive: qt x qt x=25300 qt = round ((( / ) * ), ) = %. 25300 86400 100 2 28.28 In the above example, we conclude that the application is quiet for of any given day, thus making it a potential candidate for serverless computing. We will know for sure, once we complete the costing exercise. The point is this — if your application does not demonstrate pockets of idle time/quiet time and is constantly processing 24x7x365, serverless may not be the right choice. 28.28% We now calculate the cost of the relevant serverless and server-based services, to determine which one is better. The lower of the 2 costing efforts, will guide us to the preferred option: — serverless computing cost This is the cost for utilizing serverless computing services. Serverless computing costs are usually comprised of number of function execution requests, CPU & RAM requested in , requests for such function execution requests and any other service that is specifically required to support the given workload (including database costs such as ). There are no compute costs associated with in serverless computing. scc AWS Step Functions/Lambda API Gateway Amazon DynamoDB or Amazon Aurora Serverless idle time/quiet time — traditional computing cost (server-based) This is the cost for utilizing server-based computing services. These computing costs are usually comprised of compute engines for application and database servers, programming language software licensing, database software licensing, personnel costs associated with software maintenance of the said server-based services (install, patching, upgrades of operating system, programming language software and other 3rd party software) and actual cost of software maintenance fees (paid to vendors). In this model, there are compute and other costs associated with for all compute platforms. tcc idle time/quiet time If , it is clear that the workload qualifies for serverless even from a cost standpoint. If , it is obvious that serverless will cost more than the server-based model. However, there may be other reasons that drive the serverless model to be the preferred option. These may include (among others), a lack of deep technical talent within the organization to support the upgrade and maintenance of the workload’s software and database stack in the relevant compute platforms. scc < tcc scc > tcc Notes: It may be argued that the AWS support fees should also be factored into scc and this could be done if there were a method to determine prorated AWS support fees, calculated only for the serverless services. This is easier said than done, but in the interest of simplicity, this should be considered as a common cost between both models and thus excluded. Storage costs for services such as S3, networking, data transfer (ingress and egress) can also be deemed as common to both scc and tcc . Thus, these components can also be removed from the costing. Add these costs to scc ONLY if they are relevant to it and not to tcc . Architecture of a Serverless Data Integration Hub on AWS The 4 Layers There are 4 Layers/Pillars in the Hub and it manifests on AWS as follows: (Figure 4: Data Integration Hub on AWS - 4 Layers/Pillars) Data flows through the 4 layers of the Hub. Each layer supports an independent function of the Data Integration Architecture. The 4 Layers/Pillars are: Ingestion Persistence Transformation Analytics/ML/AI The architecture is purpose built utilizing independent services. This enables scaling each layer independently, supports abstraction to manage complexity, decouples services to avoid serialization, automatically provisions resources (where feasible) and facilitates an asynchronous execution model. In many ways these are fundamental tenets of cloud computing. Drilldown into the 4 Layers/Pillars Figure 5 illustrates the relevant services and details within each Layer/Pillar: (Figure 5: Serverless Data Integration Hub on AWS - Detailed Architecture) Now let’s dive a little deeper into each Layer/Pillar. is the first layer (Layer #1) where data from multiple sources enters the Hub. For batch extracts, the landing zone for the data is a serverless bucket where the data is pre-processed for certain rudimentary checks with . A microservice ( ) processes the data, on the of data successfully landing in S3. The equivalent real-time data streams processing service takes care of the streams of data, that arrive into the Hub in sporadic bursts. Ingestion Amazon S3 Amazon Athena AWS Step Functions/Lambda event Amazon Kinesis , the second layer (Layer #2) of the architecture, plays the crucial role of storing data in the Hub. plays a huge role here, as it is the centerpiece of Data Integration. Layer #2 stores, manages and maintains business objects. When data is ingested in Layer #1, it is transported to Layer #2 and stored in a within RAWDB (set of tables for RAW data). More complex data validation routines are performed in the Transformation Layer. The persistence of Business Insights in Layer #4 is done using , if certain business function’s data consumption model is table-based data, instead of JSON-based business objects. Persistence Amazon DynamoDB near-raw-format Amazon Aurora Serverless is the third layer (Layer #3) which handles the complexity of creating CLEAN business objects. The transformation layer’s complexity and heavy lifting relates to ensuring that data is conformant to data quality rules, business rules and compliance rules/policies. The arrival of data in RAWDB of Layer #2, is , that triggers the main transformation microservice and its related sub-microservices to fire. These event-based microservices are implemented in . Data that fully conforms to all predefined transformation rules is persisted in CLEANDB (another logical container with a set of relevant tables for CLEAN data) and erroneous/non-conformant data is stored in ERRORDB (the 3rd logical container for ERROR data). Transformation the next event AWS Step Functions/Lambda is the fourth layer where Business Insights (BI) are generated. This layer is triggered by the arrival of data in CLEANDB within Layer #2 ( ). This layer supports the calculation and storing of Key Performance Indicators (KPIs) for the business and the management of Analytics/ML/AI jobs. The results of all jobs (BI) are stored in INSIGHTSDB (the 4th logical container for INSIGHTS within and as appropriate). Multiple serverless services are utilized here for generation of Business Insights ( in this layer. Please note that is not serverless and listed here for completeness for machine learning projects. is listed to facilitate the concept of serverless containers that run analytics and machine learning algorithms to generate BI and store in INSIGHTSDB. Analytics/Machine Learning (ML)/Artificial Intelligence (AI) the next event Amazon DynamoDB Amazon Aurora Serverless Amazon S3, Amazon Athena, Amazon Fargate, Amazon SageMaker) Amazon SageMaker Amazon Fargate When data passes through all 4 layers, it completes a single data integration cycle. The cycle is repeated, as and when new/modified data flows into the Hub from back-end systems or when data is either augmented, corrected and reprocessed to move from ERRORDB to CLEANDB. Note: In the long run, the persistence and management of historical data (RAWDB & CLEANDB) needs to be addressed via standard Information Lifecycle Management (ILM) measures that utilize various tier-based-data-persistence services. AWS supports this with a multi-tiered storage provisioning model with Amazon S3 Glacier and Amazon S3 Glacier Deep Archive. This ensures data persistence costs are maintained at optimal levels while adhering to all regulatory requirements. also plays an important role in information delivery within the enterprise. This is where the BI Dashboards come to life and deliver value to the business. In maintaining the serverless paradigm, is an option for the future, when direct access of data via RESTful endpoints in is supported Visualization Amazon QuickSight API Gateway . Consumption Pattern On generation of BI, applications need to be enabled to consume data from INSIGHTSDB & CLEANDB in real-time. This consumption is done via the Data Access Layer (DAL) as illustrated in . Utilizing RESTful endpoints ( ), consumption of BI and the underlying data for the various business objects is supported. Data retrieval from relational data stores such as , is achieved by utilizing the . It is important to note that is the single doorway for data consumption. Figure 5 API Gateway Amazon Aurora Serverless Data API for Amazon Aurora Serverless API Gateway This service virtually eliminates all unwarranted external data breaches and data leaks, as it serves data only when the data request is accompanied by the required credentials. These required credentials ( ) are received from an enterprise’s federated authentication systems ( etc.) in concert with the ( ) and . security credentials and data entitlements AD, LDAP AWS Identity and Access Management Amazon IAM Amazon Cognito The well-defined RESTful endpoints created with , not only obviates the need for persistent database connection management to the Hub, but also eases the impact of changes to data structures. For example, an application that is receiving and processing JSON documents (business objects) from the RESTful endpoints, can continue to do so regardless of all data structure changes (adding or dropping of data elements). This ensures runtime flexibility of both the applications and the Data Hub, in an ever-changing business environment. API Gateway Note: Although not explicitly illustrated, all standard networking and isolation security best practices should be assumed. This is in line with the AWS Well Architected Framework. What does the Hub really deliver? The goal of the Hub is to deliver up-to-date business objects that provide a 360-degree view of any facet of the business. Figure 6 illustrates an integrated Customer Business Object that is built in a ‘Hub & Spoke Model’. (Figure 6: Sample Integrated Business Object - Customer) Conclusion AWS provides a rich set of services to architect and build Serverless Data Integration Hubs. The value of high-quality and standardized data delivered to the enterprise is a critical function that needs to be proactively addressed via data engineering. To state the obvious, good data generates meaningful business insights. And that is true value added to the business. I wish you the very best in 2020 and beyond for all of your data integration efforts. Take good care of yourself and stay in touch! References Can we please walk before we try to run? Data Structure of a Web Log Data Structure of a Tweet AWS Architecture Icons Serverless Data Integration — Part I (2019), Gaja Vaidyanatha Serverless Data Integration — Part II (2019), Gaja Vaidyanatha Originally published on LinkedIn — https://www.linkedin.com/pulse/architecting-serverless-data-integration-hubs-aws-2020-vaidyanatha