Aggregating into data lakes is the solution of today — but are Federated Sources the solution of tomorrow?
TL;DR
Data lakes are considered to be a popular solution for uniting insights from ALL the organization’s data sources.
An emerging alternative is the federated data sources approach, which generates the same output with a fraction of the effort and cost — without the data duplication. In this post, we’ll explore the advantages and disadvantages of both options and where it may make sense for your team.
Imagine the perfect database. The best aspects of all databases types, combined into one — without limitations:
Unlimited data storage and processing you get with Hadoop, the fast query execution enjoyed with Redis or DynamoDB; lightning-fast CRUD (create, read, update delete); RDBMS capabilities like Joins, Unions and aggregation; support for structured and unstructured data; querying as if you’re using a data warehouse; with permissions configurable per table, column and row.
I’m not sure that even if such a super computing database existed, all of these capabilities could logically reside in one database. Anyway, it’ll be awhile before we get there. Until then, organizations are using multiple databases with different capabilities for different use cases.
But that comes with it’s own issues.
Two main issues arise from juggling data in multiple databases and database types. Because each database uses a different API/language, having different database types requires having experts at hand for each DB: IT/admin, DevOps, Application Developers. Backups and maintenance are different for each DB as well. More importantly is that the main resource of the organization — data — is distributed, duplicated, and cannot be aggregated for global insights in a clean, unified manner.
The first solution, which has become a common buzzword over the past few years, is to build a data lake. The rationale for having a data lake is to bring all data into one place, where it can be queried using a single query engine.
However, the process to get it into the lake requires many ETL (Extract, Transform and Load) processes using tools like Airflow, AWS Glue, Google Cloud Dataflow, among others. During this process, data is often duplicated and not updated as frequently as needed. Every organization has its data graveyards — huge stores of data that no one knows about or doesn’t have the courage to delete. With data lakes, the graveyard’s size will multiply.
Lastly, ETL processes can raise security and regulation concerns, they have high maintenance and costs associated with them, and current lake solutions are provider-locked, allowing you access to data for some of their own services only.
Federated Sources brings a new approach to the table. Each database processes and storing its own data as they were meant to. Data is not transferred out. Instead, there is one engine that can query multiple types of databases and seamlessly merge the results.
What are the important features of an engine capable of querying federated sources?
This engine must have three important characteristics:
Two important notes about federation:
Google BigQuery is investing some effort in the direction of federated sources, with a caveat — Google-only services (and only three, as of now). They sometimes refer to it as federation, and sometimes “external data sources”. In their case, BigQuery acts as the federation engine and SQL is the common syntax used. You can query external resources: Cloud BigTable, Cloud Storage, and Google Drive. The data sources can be MySQL tables or file types like ORC and Parquet.
Amazon AWS also invests in the federation approach with a new project called “PartiQL”. It enables unified query access across multiple data stores and data formats by separating the syntax and semantics of a query from the underlying format of the data or the data store that is being accessed — and it is open source.
While federated querying has its obvious advantages over data lakes, the solutions out there today aren’t complete.
Once we can use SQL syntax to query any data source while being able to Join, Union, and aggregate the outputs in a single command line, then we’ll have a complete solution.
It’ll look something like this:
select * from
bigqueryTable as bqt join
dynamoTable as dt join
facebookApi as fbt join
logfiles as lt
where ….
group by …
At superQuery, we believe that this “complete” engine is possible, and are building for it — we call it “Data Alloy”.