This is a collaboration between Baolong Mao's team at JD.com and my team at Alluxio. The original article was published on This article describes how JD built an interactive OLAP platform combining two open-source technologies: and . Alluxio's blog. Presto Alluxio 1. Overview is running a data platform with more than , running more than 1 million jobs per day, managing over of data, with a daily increase of over 800TB. One of the most important services on this data platform is OLAP serving more than every day for data analysis across different businesses. JD.com 40,000 servers 650PB 500,000 queries Engineers at JD have implemented the first large-scale enterprise level interactive query engine in China in order to meet its fast-growing business requirements. To achieve this, Baolong and his team built the largest known Presto cluster in China, based on JDPresto leveraging its low latency and high concurrency. Alluxio was deployed with Presto as a fault-tolerant, pluggable caching service to reduce network traffic for low latency data access. By combining Presto and Alluxio, Baolong's team improved ad-hoc query latency by 10x. This stack of JDPresto on Alluxio has been running on more than 400 nodes in production for more than 2 years, covering mall APP, WeChat mobile QQ, and other offline data analysis platforms. It has significantly improved the experience for over 1 million users and helped with targeted marketing across tens of thousands of merchants on JD. 2. Platform Architecture and Its Challenges The entire platform has over 40,000 servers where over 25,000 nodes are running batch processing jobs serving more than 13,000 users; There are more than 1 million batch processing jobs, processing greater than 40PB data daily; The total data volume exceeds 650PB, increasing daily by at least 800 TB, and the total storage capacity reaches 1EB (1000PB); More than 40 business products with over 450 data models are running on this platform. Figure 1 shows the architecture of JD's Big Data Platform On this platform, HDFS is the foundation, serving data to the entire platform with data pipelines using different computation frameworks orchestrated by YARN. The enormous scale caused issues in achieving good data locality, which significantly impacts the performance of jobs running on Presto when reading from HDFS. As a result, a bridge is needed between the Presto and HDFS that can isolate resources and failures. Figure 2 shows how deploying Alluxio helps Presto workers read data more efficiently As shown on left, before using Alluxio, JDPresto workers experience low I/O performance when reading the input data from HDFS for two reasons: HDFS is serving data several magnitudes larger than the amount of data consumed by each individual Presto cluster; The platform is over-utilized to the point where YARN is unable to schedule Presto jobs on its local HDFS datanode. As a result, Presto queries end up reading data from a remote HDFS datanode and experience delay due to network latency. After deploying Alluxio workers together with Presto workers (as shown on the right in ), Presto initially will read data from HDFS, likely served by a remote datanode, but the data will be cached on a local Alluxio worker on the same node. The cached data is reused upon subsequent requests, eliminating the remote read overhead from HDFS. In short, Figure 2 Alluxio guarantees query performance by decoupling Presto performance from HDFS because data locality is achieved. There is also a modification to split read path in JDPresto as shown in . Normally, Presto first checks if the input data is in Alluxio. In case Alluxio is unavailable (e.g., down for maintenance), Presto can directly access HDFS without any user intervention. We also extended Alluxio to provide explicit consistency verification against HDFS. Figure 3 Figure 3 shows the modification to split read paths in JDPresto 3. Performance Evaluation Baolong's team at JD benchmarked two Presto clusters with different configurations, running the same SQL query multiple times. The results are shown in . On the left, JDPresto directly accesses HDFS as the baseline whereas on the right, JDPresto uses Alluxio as a distributed cache backed by HDFS. Figure 4 Figure 4. Benchmark two Presto clusters comparisons The numbers in in red boxes represent the execution time.For both configurations, the first query reads directly from HDFS. This initial query is slower on the right side because Alluxio is caching tables and partition files. In subsequent queries, Alluxio speeds up the execution time by at least 10x. Figure 4 Figure 5 summarizes the results of the test. We can see in that JDPresto on Alluxio can reduce the read time after the first read, much faster than the JDPresto cluster. Figure 5 4. Summary (and special thanks) To achieve the optimization described in this section, Baolong's team at JD has done a lot of work including extending Alluxio and JDPresto and developing some testing tools. In addition, there was some work done around App-on-Yarn. Special thanks to JD's team of big data engineers who have made many contributions to the y while building and maintaining their big data platform. Currently, Baolong's team is continuing its effort in actively researching, evaluating and applying Alluxio to other computing frameworks. Alluxio open source communit