How We Improved Spark Jobs on HDFS Up To 30 Times

TLDR

Vipshop is the third largest e-commerce site in China and processes large amounts of data collected daily to generate targeted advertisements for its consumers. The site runs tens of thousands of queries to derive insights for targeted ads from a dozen of Hive tables stored in HDFS. The major challenge when running jobs in the architecture shown in Figure 1 is inconsistent performance due to multiple reasons. With a large number of nodes in the cluster, it is unlikely that the data needed by a computation process is served by the local storage process. Remote requests from other storage processes created bottlenecks on certain data nodes. With Alluxio, we separate storage and compute by moving HDFS to an isolated cluster. Resources on the compute cluster are scaled independently of storage capacity.via the TL;DR App

no story

Written by bin-fan | VP of Open Source and Founding Member @Alluxio

Published by HackerNoon on 2020/08/06