Aggregating into data lakes is the solution of today — but are Federated Sources the solution of tomorrow? TL;DR are considered to be a popular solution for uniting insights from ALL the organization’s data sources. Data lakes An emerging alternative is the approach, which generates the same output with a fraction of the effort and cost — without the data duplication. In this post, we’ll explore the advantages and disadvantages of both options and where it may make sense for your team. federated data sources A perfect database. What If? Imagine the perfect database. The best aspects of all databases types, combined into one — without limitations: Unlimited data storage and processing you get with Hadoop, the fast query execution enjoyed with Redis or DynamoDB; lightning-fast CRUD (create, read, update delete); RDBMS capabilities like Joins, Unions and aggregation; support for structured unstructured data; querying as if you’re using a data warehouse; with permissions configurable per table, column and row. and I’m not sure that even if such a super computing database existed, all of these capabilities could logically reside in one database. Anyway, it’ll be awhile before we get there. Until then, organizations are using multiple databases with different capabilities for different use cases. But that comes with it’s own issues. The problem Two main issues arise from juggling data in multiple databases and database types. Because each database uses a different API/language, having different database types requires having experts at hand for each DB: IT/admin, DevOps, Application Developers. Backups and maintenance are different for each DB as well. More importantly is that the main resource of the organization — — is distributed, duplicated, and cannot be aggregated for global insights in a clean, unified manner. data The data lake solution The first solution, which has become a common buzzword over the past few years, is to build a . The rationale for having a data lake is to bring all data into one place, where it can be queried using a single query engine. data lake However, the process to the lake requires many ETL (Extract, Transform and Load) processes using tools like Airflow, AWS Glue, Google Cloud Dataflow, among others. During this process, data is often duplicated and not updated as frequently as needed. Every organization has its data graveyards — huge stores of data that no one knows about or doesn’t have the courage to delete. With data lakes, the graveyard’s size will multiply. get it into Lastly, ETL processes can raise security and regulation concerns, they have high maintenance and costs associated with them, and current lake solutions are provider-locked, allowing you access to data for of their services only. some own The Federation solution brings a new approach to the table. Each database processes and storing its own data as they were meant to. Data is not transferred out. Instead, there is one engine that can query multiple types of databases and seamlessly merge the results. Federated Sources What are the important features of an engine capable of querying federated sources? This engine must have three important characteristics: One common API or language to query the database (SQL?) A mechanism that can unify all the data from disparate sources so it can be queried and aggregated. The security, governance, and auditing controls one has with a data lake solution. Two important notes about federation: Performance isn’t the main benefit of this approach. If your output requires specific capabilities, load the data into the relevant data engine (BigQuery, DynamoDB, etc.) and use it there. But if you need to get business logic from multiple database types, federation will do the job for you.Business logic output will always be relatively small. For example, the DWH will do what it does best and respond with the aggregated output to this new engine, which will be able to join/merge/union the results with other database outputs in order to help with important business insights. This federation engine will also have to have integrations to each database which implements each API, translate data formats so they can be merged/joined with other results, and expose one common (SQL) syntax to query all these databases. Who is investing in federated solutions? is investing some effort in the direction of federated sources, with a caveat — Google-only services (and only three, as of now). They sometimes refer to it as and sometimes “external data sources”. In their case, acts as the federation engine and SQL is the common syntax used. You can query external resources: , and The data sources can be or file types like and Google BigQuery federation, BigQuery Cloud BigTable Cloud Storage, Google Drive. MySQL tables ORC Parquet. also invests in the federation approach with a new project called ”. It enables unified query access across multiple data stores and data formats by separating the syntax and semantics of a query from the underlying format of the data or the data store that is being accessed — and it is open source. Amazon AWS “ PartiQL Wrapping up While federated querying has its obvious advantages over data lakes, the solutions out there today aren’t complete. Once we can use SQL syntax to query data source while being able to Join, Union, and aggregate the outputs in a single command line, we’ll have a complete solution. any then It’ll look something like this: * bigqueryTable bqt dynamoTable dt facebookApi fbt lt …. … select from as join as join as join logfiles as where group by At , we believe that this “complete” engine is possible, and are building for it — we call it “ ”. superQuery Data Alloy

Amazon

Going From Data Lakes to Oceans

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Untitled Story

10 Lessons from 10 Years of AWS (part 2)

10 Lessons from 10 Years of AWS (part 1)

111 Stories To Learn About Architecture

13 Expert Tips to Improve Your Web Application Performance Today

4 Skills You Need to Become a Distinguished Developer

10 Lessons from 10 Years of AWS (part 2)

10 Lessons from 10 Years of AWS (part 1)

111 Stories To Learn About Architecture

13 Expert Tips to Improve Your Web Application Performance Today

4 Skills You Need to Become a Distinguished Developer

Light-Mode

Classic

Newspaper

Dark-Mode

Neon Noir

Minty

HN StartUps