962 reads

Building a Large-Scale Interactive SQL Query Engine with Open Source Software

by Bin FanMarch 7th, 2020

Too Long; Didn't Read

JD.com is running a data platform with more than 40,000 servers, running more than 1 million jobs per day, managing over 650PB of data. This article describes how JD built an interactive OLAP platform combining two open-source technologies: Presto and Alluxio. The enormous scale caused issues in achieving good data locality, which significantly impacts the performance of jobs running on Presto when reading from HDFS. The platform is over-utilized to the point where YARN is unable to schedule Presto jobs on its local HDFS datanode.

featured image - Building a Large-Scale Interactive SQL Query Engine with Open Source Software

This is a collaboration between Baolong Mao's team at JD.com and my team at Alluxio. The original article was published on Alluxio's blog. This article describes how JD built an interactive OLAP platform combining two open-source technologies: Presto and Alluxio.

1. Overview

JD.com is running a data platform with more than 40,000 servers, running more than 1 million jobs per day, managing over 650PB of data, with a daily increase of over 800TB. One of the most important services on this data platform is OLAP serving more than 500,000 queries every day for data analysis across different businesses.

Engineers at JD have implemented the first large-scale enterprise level interactive query engine in China in order to meet its fast-growing business requirements. To achieve this, Baolong and his team built the largest known Presto cluster in China, based on JDPresto leveraging its low latency and high concurrency. Alluxio was deployed with Presto as a fault-tolerant, pluggable caching service to reduce network traffic for low latency data access. By combining Presto and Alluxio, Baolong's team improved ad-hoc query latency by 10x. This stack of JDPresto on Alluxio has been running on more than 400 nodes in production for more than 2 years, covering mall APP, WeChat mobile QQ, and other offline data analysis platforms. It has significantly improved the experience for over 1 million users and helped with targeted marketing across tens of thousands of merchants on JD.

2. Platform Architecture and Its Challenges

The entire platform has over 40,000 servers where over 25,000 nodes are running batch processing jobs serving more than 13,000 users;

There are more than 1 million batch processing jobs, processing greater than 40PB data daily;

The total data volume exceeds 650PB, increasing daily by at least 800 TB, and the total storage capacity reaches 1EB (1000PB);

More than 40 business products with over 450 data models are running on this platform.

Figure 1 shows the architecture of JD's Big Data Platform

On this platform, HDFS is the foundation, serving data to the entire platform with data pipelines using different computation frameworks orchestrated by YARN. The enormous scale caused issues in achieving good data locality, which significantly impacts the performance of jobs running on Presto when reading from HDFS. As a result, a bridge is needed between the Presto and HDFS that can isolate resources and failures.

Figure 2 shows how deploying Alluxio helps Presto workers read data more efficiently

As shown on left, before using Alluxio, JDPresto workers experience low I/O performance when reading the input data from HDFS for two reasons:

HDFS is serving data several magnitudes larger than the amount of data consumed by each individual Presto cluster;
The platform is over-utilized to the point where YARN is unable to schedule Presto jobs on its local HDFS datanode.

As a result, Presto queries end up reading data from a remote HDFS datanode and experience delay due to network latency.

After deploying Alluxio workers together with Presto workers (as shown on the right in Figure 2), Presto initially will read data from HDFS, likely served by a remote datanode, but the data will be cached on a local Alluxio worker on the same node. The cached data is reused upon subsequent requests, eliminating the remote read overhead from HDFS. In short, Alluxio guarantees query performance by decoupling Presto performance from HDFS because data locality is achieved.

There is also a modification to split read path in JDPresto as shown in Figure 3. Normally, Presto first checks if the input data is in Alluxio. In case Alluxio is unavailable (e.g., down for maintenance), Presto can directly access HDFS without any user intervention. We also extended Alluxio to provide explicit consistency verification against HDFS.

Figure 3 shows the modification to split read paths in JDPresto

3. Performance Evaluation

Baolong's team at JD benchmarked two Presto clusters with different configurations, running the same SQL query multiple times. The results are shown in Figure 4. On the left, JDPresto directly accesses HDFS as the baseline whereas on the right, JDPresto uses Alluxio as a distributed cache backed by HDFS.

Figure 4. Benchmark two Presto clusters comparisons

The numbers in Figure 4 in red boxes represent the execution time.For both configurations, the first query reads directly from HDFS. This initial query is slower on the right side because Alluxio is caching tables and partition files. In subsequent queries, Alluxio speeds up the execution time by at least 10x.

Figure 5 summarizes the results of the test.

We can see in Figure 5 that JDPresto on Alluxio can reduce the read time after the first read, much faster than the JDPresto cluster.

4. Summary (and special thanks)

To achieve the optimization described in this section, Baolong's team at JD has done a lot of work including extending Alluxio and JDPresto and developing some testing tools. In addition, there was some work done around App-on-Yarn.

Special thanks to JD's team of big data engineers who have made many contributions to the Alluxio open source community while building and maintaining their big data platform. Currently, Baolong's team is continuing its effort in actively researching, evaluating and applying Alluxio to other computing frameworks.