This article describes how engineers in the Data Service Center (DSC) at Tencent PCG (Platform and Content Business Group) leverages Alluxio to optimize the analytics performance and minimize the operating costs in building Tencent Beacon Growing, a real-time data analytics platform.
Specifically, by using Alluxio as a distributed caching layer, the HDFS scanning performance for Impala SQL queries is accelerated by 2x out of the box without further optimizations.
The original article was published on Alluxio's Engineering Blog by guest writer Honghan Tian.
Beacon Growing is an analytics platform that provides analyzers insights of user behaviors in realtime. Beacon comes with both built-in and customized analytic models, such as RETENT, FUNNEL, PATH, etc. Combining the functions of A/B testing and user profiles, developers can select promotional plans more easily.
Under the hood, Beacon is powered by a stack of open source technologies. Particularly for the context of this article, we used Impala to query data in Parquet format stored in HDFS. Here is the overview of the software stack currently being deployed by Beacon, to give the readers a better understanding of the system.
In Tencent, we build systems to handle data at the Internet-scale. For example, the scan range of a typical query can be 100 billion rows, with hundreds of concurrent queries. To speed up the queries, we have applied optimizations like page index in Parquet files to reduce the I/O cost in scan operations. In practice, there were still a few issues observed:
Based on the above observations, we decided to build a caching layer using SSDs. However, HDFS is not designed to leverage heterogeneous storage. We deployed Alluxio to leverage the speed advantage of SSDs and the capacity advantage of HDDs. The new architecture is shown in the following figure:
There are a few changes and improvements observed after integrating Alluxio into Tencent Beacon infrastructure:
One of the most valuable features of Alluxio in our use case is Unified Namespace. It helps share the same namespace across multiple clusters, making it easier for Impala to access the data resources of multiple HDFS clusters.
The following table is a comparison of some data:
There are a few on-going projects on the horizon to integrate Alluxio at Tencent PCG.
First, we plan to further reduce the operating work by deploying a large Alluxio cluster with worker processes spanning multiple Impala clusters. In other words, Alluxio and Impala processes will be co-located at the same server to reduce network traffic. We are currently experiencing some issues and working with the core team to address them.
Second, we want to leverage the Alluxio tiered locality group to further improve Impala performance and eliminate remote reads of Alluxio workers with improved data locality.
Specifically, with one but much larger Alluxio cluster deployed, we will divide Alluxio workers nodes into different ‘locality groups’. As a result, when queries read data from different HDFS clusters, data will be cached into the nearest locality group to reduce I/O cost. This feature is an analogy of the “layer” configuration” in Clickhouse.
Third, once locality grouping described above is achieved, locally deployed Impala computing groups can be used to elastically expand or shrink computing resources based on different Impala clusters for different business uses.
In our project “Beacon Growing”, we have deployed Alluxio to improve Impala performance by 2.44x for IO intensive queries and 1.20x for all queries. The query failure rate due to timeout is also reduced by 29%. In the future, we foresee it can reduce disk utilization by over 20% for our planned elastic computing on Impala.