Serving Structured Data in Alluxio: Example

Written by bin-fan | Published 2020/04/24
Tech Story Tags: open-source | data-engineering | distributed-systems | structured-data | infrastructure | metadata | caching | data-management

TLDR Alluxio 2.2.0 is now released since the previous article. I recommend users to update to the new version of the service if trying out this service for the first time. This tutorial requires you have Presto and Hive to be configured together and running. The Structured Data Service manages the metadata of structured data components such as databases, tables, and schemas. This article will go through an example to demonstrate how it helps SQL and structured data workloads with Presto. It also tracks the location of the stored data.via the TL;DR App

In the previous article, I described the concept and design of the Structured Data Service in the Alluxio 2.1.0 release. This article will go through an example to demonstrate how it helps SQL and structured data workloads.
Alluxio 2.2.0 is now released since the previous article. I recommend users to update to Alluxio 2.2.0 if trying out this service for the first time. This tutorial requires you have Presto and Hive to be configured together and running.

Step 1: Download and Setup Alluxio

Download and Deploy Alluxio 2.2.0
Download the Alluxio 2.2.0 release and deploy Alluxio on your local computer. Detailed instructions can be found here. The following is a summary of the commands mentioned:
$ tar xf alluxio-2.2.0-bin.tar.gz
$ cd alluxio-2.2.0 # this directory corresponds to ${ALLUXIO_HOME}
$ cp conf/alluxio-site.properties.template conf/alluxio-site.properties
$ echo "alluxio.master.hostname=localhost" >> conf/alluxio-site.properties
$ echo "alluxio.master.mount.table.root.ufs=/tmp" >> conf/alluxio-site.properties
$ ./bin/alluxio-mount.sh SudoMount
$ ./bin/alluxio format
$ ./bin/alluxio-start.sh local -f
Note that no additional configuration is needed to start the new Structured Data Service.
Install and Configure the Alluxio Presto Connector
The Alluxio Presto Connector is the client for Presto to access Alluxio’s Structured Data Service. In this developer preview version, we need to copy the connector manually to Presto 
This connector is bundled as part of the Alluxio 2.2.0 release in the directory
${ALLUXIO_HOME}/client/presto/plugins/. 
Copy the directory corresponding to the Presto version into Presto’s plugin directory.
$ cp -R ${ALLUXIO_HOME}/client/presto/plugins/presto-hive-alluxio-319/ \
${PRESTO_HOME}/plugin/hive-alluxio/
Once the connector is installed, it can be used to configure a Presto catalog. Add a new catalog configuration to Presto by creating the following file
$ echo "connector.name=hive-alluxio
hive.metastore=alluxio
hive.metastore.alluxio.master.address=localhost:19998" >
${PRESTO_HOME}/etc/catalog/catalog_alluxio.properties
Restart the Presto server for the connector and configuration to take effect.

Step 2: Attach a Hive Metastore to Alluxio Catalog Service

The Alluxio Catalog Service manages the metadata of structured data components such as databases, tables, and schemas. It also tracks the location of the stored data. This developer preview version supports attaching a Hive Metastore as an UnderDatabase, which is an abstraction of other external catalogs and databases, into the Alluxio Catalog service.
To attach the Hive Metastore into the Alluxio Catalog Service, use the
“attachdb”
command here:
$ ./bin/alluxio table attachdb hive thrift://localhost:9083 hive_db_name

Step 3: Use Alluxio Structured Data Management with Presto

Once a database is attached, the catalog service can be used from Presto. Start the Presto CLI with the Alluxio catalog:
$ presto --catalog catalog_alluxio
Any queries run within this CLI will access the Alluxio Catalog Service via the provided connector. The Alluxio Catalog Service will automatically serve the table information from Hive metastore, while transparently using the Alluxio mounted locations.
Transform a Table
Data transformations is a key benefit of working with structured data in Alluxio, particularly when the underlying files consisting of a table are not stored in a compute-optimized fashion. If the files are in CSV format or the table is split among lots of small files, the Alluxio Transformation Service is able to convert the format to parquet or join multiple small files into larger files.
To transform the test table in Hive:
$ ./bin/alluxio table transform hive_db_name test_table
For more on Data Transformations, see documentations here

Try it out!

Alluxio Structured Data Management is an exciting, new effort that provides further benefits for SQL frameworks. Get started with Alluxio Structured Data Service with Presto and let us know if you have any feedback for features and issues in the Alluxio Github repository! On behalf of the entire Alluxio open source community, I invite you to ask questions in our community slack channel whenever you encounter any issues. 

Written by bin-fan | VP of Open Source and Founding Member @Alluxio
Published by HackerNoon on 2020/04/24