How We Improved Spark Jobs on HDFS Up To 30 Timesby@bin-fan
159 reads

How We Improved Spark Jobs on HDFS Up To 30 Times

by Bin Fan5mAugust 6th, 2020
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Vipshop is the third largest e-commerce site in China and processes large amounts of data collected daily to generate targeted advertisements for its consumers. The site runs tens of thousands of queries to derive insights for targeted ads from a dozen of Hive tables stored in HDFS. The major challenge when running jobs in the architecture shown in Figure 1 is inconsistent performance due to multiple reasons. With a large number of nodes in the cluster, it is unlikely that the data needed by a computation process is served by the local storage process. Remote requests from other storage processes created bottlenecks on certain data nodes. With Alluxio, we separate storage and compute by moving HDFS to an isolated cluster. Resources on the compute cluster are scaled independently of storage capacity.

Company Mentioned

Mention Thumbnail
featured image - How We Improved Spark Jobs on HDFS Up To 30 Times
Bin Fan HackerNoon profile picture
Bin Fan

Bin Fan

@bin-fan

VP of Open Source and Founding Member @Alluxio

About @bin-fan
LEARN MORE ABOUT @BIN-FAN'S
EXPERTISE AND PLACE ON THE INTERNET.

Share Your Thoughts

About Author

Bin Fan HackerNoon profile picture
Bin Fan@bin-fan
VP of Open Source and Founding Member @Alluxio

TOPICS

THIS ARTICLE WAS FEATURED IN...

Permanent on Arweave
Read on Terminal Reader
Read this story in a terminal
 Terminal
Read this story w/o Javascript
Read this story w/o Javascript
 Lite
L O A D I N G
. . . comments & more!