Are you using SQL engines, such as Presto, to query existing Hive data warehouse and are experiencing challenges including overloaded Hive Metastore with slow and unpredictable access, unoptimized data formats and layouts such as too many small files, or lack of influence over the existing Hive system and other Hive applications? To address these challenges, we launched in the 2.1 open source release, which includes the . Alluxio Structured Data Management Alluxio Catalog Service Here is the FAQ based on the top questions from the community: How Does the Alluxio Catalog Service Work With Hive Metastore? The manages metadata for structured data, such as schema and table information. To do so, The Alluxio catalog manages connections to external metadata services (like Hive Metastore or AWS Glue) via an abstraction called Under Database (UDB). With the UDB abstraction, the Alluxio catalog can interact with other external metadata services for database and table information. Alluxio catalog For the Alluxio catalog to manage and maintain interactions with external metadata services, the external service must be “attached” to the Alluxio catalog. This attaching process adds the relationship between the external service and the Alluxio catalog. Attaching an external metadata service is initiated via the . “attachdb” CLI command In the Alluxio 2.2 release, there is an implementation of the UDB for the Hive Metastore. This means databases in a Hive Metastore can be attached to the Alluxio catalog so that the Alluxio catalog can manage and serve this metadata. In future versions, support for the AWS Glue UDB will be added. After an external database is attached, the Alluxio catalog syncs all the information from the external database and imports the table definitions into the catalog. The Alluxio catalog can then serve the information to the client, which in this case is in the Alluxio connector in Presto. The triggers the Alluxio catalog to sync with the UDB again. “sync” CLI command Once a UDB is attached, the Alluxio catalog manages the table metadata. This means the Alluxio catalog will transparently use Alluxio file locations instead of the locations found in the Hive Metastore without modifying the Hive Metastore. This allows simple deployment of the Alluxio catalog service with existing Hive Metastores. How Does the Alluxio Catalog Service Work With Presto? Presto interacts with the Alluxio catalog via the Alluxio connector. The Alluxio connector in Presto is new functionality for the existing Hive connector. The new functionality has already been merged into the codebases of PrestoSQL and PrestoDB. The Presto catalog needs to be instead of the Hive Metastore. Once Presto is configured in this way, all of the existing Hive connector operations will interact with the Alluxio catalog for the metadata of tables and not the Hive Metastore. configured to interact with the Alluxio catalog In the Alluxio 2.2 release, only read-only queries are supported with the Alluxio connector and Alluxio catalog. In future versions, write queries like DDL and DML statements will be supported. How do Alluxio Transformations Work? are performed by an , which executes distributed, asynchronous tasks. When a transformation is initiated, this is converted into a distributed job and executed across the various Alluxio job workers. Once the job is complete, the Alluxio catalog is updated with the newest transformed locations. When the Alluxio catalog serves the table metadata, the new transformed locations are used transparently, without any modification to the UDB. Alluxio transformations Alluxio-internal job service When transformations are generated, they are written through to the underlying persistent store and not cached in Alluxio. Once the transformed data is written, it can be read from the new transformed locations. The first read of the transformed data will cache the data in Alluxio on-demand. In Alluxio 2.2, the output file format is the parquet format since it is a more compute-optimized file format. In future versions, additional file formats and parquet output options will be available. For coalescing many small files together, the default behavior is to generate files of about 2G, but that is configurable. What is the Architecture of Alluxio Structured Data Services? Alluxio Structured Data Services is a part of the and has 3 main components: Alluxio connector in Presto, Alluxio Catalog Service, and Alluxio Transformation Service. Alluxio system The Alluxio connector is part of the Presto SQL engine. It is a connector based on the existing Hive connector and is the client to the Alluxio catalog. Alluxio connector: The catalog service manages the metadata for structured data and is a part of the Alluxio master processes. Alluxio Catalog Service: The transformation service is managed by the Alluxio catalog, and the transformations are performed by the Alluxio job service, which is an internal service for executing distributed, asynchronous tasks. The jobs are executed and managed by the Alluxio job masters and Alluxio job workers. Alluxio Transformation Service: Also Published At: https://www.alluxio.io/blog/everything-you-want-to-know-about-how-to-decouple-sql-engines-from-hive-data-warehouse/

Glue

Overcoming Challenges Running the Disaggregated Analytics Stack in K8s

Accelerating Write-Intensive Data Workloads on AWS S3

Join me on Alluxio's Open Source Community! 

Nominated for 2022 - HackerNoon Contributor of the Year - Analytics

A Primer on Decoupling SQL Engines from Hive Data Warehouse

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A New Approach to Solve I/O Challenges in the Machine Learning Pipeline

22 Best Tools to Use for Marketing Startups in 2022

15 Best Project Management Tools

5 Data Management Principles That Matter in 2021

5 Features To Consider When Looking For a Reliable Data Loss Prevention (DLP) Software To Buy

5 Software Documentation Tools That Make Managing Technical Documentation Easier

A New Approach to Solve I/O Challenges in the Machine Learning Pipeline

22 Best Tools to Use for Marketing Startups in 2022

15 Best Project Management Tools

5 Data Management Principles That Matter in 2021

5 Features To Consider When Looking For a Reliable Data Loss Prevention (DLP) Software To Buy

5 Software Documentation Tools That Make Managing Technical Documentation Easier

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps