Accelerating Write-Intensive Data Workloads on AWS S3

Written by bin-fan | Published 2021/09/10
Tech Story Tags: aws-s3 | apache-spark | caching | data-orchestration | performance | cloud | storage | software-development

TLDRWe introduce Replicated Async Write to allow users to complete writes to Alluxio file system and return quickly with high application performance.via the TL;DR App

Introduction

When running write-intensive ETL jobs in the cloud using Spark, MapReduce, or Hive, directly writing the output to AWS S3 often yields slow, unstable performance or complicated error handling due to throughput throttling, object storage semantics among many other reasons.
Some users choose to first write the output to an HDFS cluster and then upload the staging data to AWS S3 using tools like 
s3-dist-cp
.
Though optimizing the performance or eliminating corner cases for object storage, the second approach adds both cost and complexity in maintaining a staging HDFS, as well as increases the delay for the data to be available for downstream applications.
Alluxio is an open-source data orchestration system widely used to speed up data-intensive workloads in the cloud.
Alluxio v2.0 introduced 
Replicated Async Write
 to allow users to complete writes to the Alluxio file system and return quickly with high application performance, while still providing users with peace of mind that data will be persisted to the chosen under storage like S3 in the background.

What is Replicated Async Write

Assume that a Spark job is writing a large data set to AWS S3.
To ensure that the output files are quickly written and keep highly available before persisted to S3, pass the write type 
ASYNC_THROUGH
 and a target replication level to Spark (see description in docs).
alluxio.user.file.writetype.default=ASYNC_THROUGHalluxio.user.file.replication.durable=<N>
and set the output to the Alluxio file system client (with URI scheme: “
alluxio://
”) 
scala> myData.saveAsTextFile("alluxio://<MasterHost>:19998/Output")
This Spark job will first synchronously write a new file to the Alluxio file system with N copies before returning. In the background, the Alluxio file system will persist a copy of the new data to the Alluxio under storage like S3. 
By default, the property 
alluxio.user.file.replication.durable
 is set to 1, and writing a file to Alluxio using 
ASYNC_THROUGH
 completes at memory speed if there is a colocated Alluxio worker.
But if any worker crashes or restarts before the data is persisted, the data can be lost.
Increasing 
alluxio.user.file.replication.durable
 to some value greater than 1 ensures the data is replicated at N different workers in Alluxio to survive N-1 worker failures before being persisted, at the cost of temporarily storing N times more data in Alluxio. 
Note that, before the file is eventually persisted, in case any of these N workers fail, Alluxio will take care of the failure by re-replicating the under-replicated blocks and ensuring sufficient copies of the file; also free command will not work before the file is persisted.
Once the file is persisted to the under storage, the in-Alluxio copies of this file can become evictable if another configuration allows this.

Enable Replicated Async Write in Alluxio

Configure the Alluxio property 
alluxio.user.file.replication.durable
 to a reasonable number to provide fault tolerance as well as set 
alluxio.user.file.writetype.default
 to 
ASYNC_THROUGH
 on all client applications.
Note that these properties are client-side only, meaning they don’t require restarts of any Alluxio processes to use them!
For example, put the following lines into
conf/alluxio-site.properties
 to write with three initial copies before data is persisted:
alluxio.user.file.writetype.default=ASYNC_THROUGHalluxio.user.file.replication.durable=3
Alluxio command-line is also supported to use replicated async write.
For example, the
copyFromLocal
command can copy data from the local file system to Alluxio using replicated async write:
$ ./bin/alluxio fs -Dalluxio.user.file.writetype.default=ASYNC_THROUGH \  -Dalluxio.user.file.replication.durable=3 copyFromLocal <localPath> <alluxioPath>

Comparison to Other Alluxio Write Types

Replicated async write is one write type that aims to balance the requirements in write speed and data persistence, but also at the transient cost of more storage capacity.
Alluxio also provides a couple of additional write types realizing different trade-offs in the design space. 
For an application that writes its data through Alluxio, unless the user is writing with 
THROUGH
 or 
CACHE_THROUGH
, the data will not exist in the under storage when writing.
The problem is that those types of writes are slower than their 
MUST_CACHE
and
ASYNC_THROUGH
 counterparts. When a file is written using 
MUST_CACHE
or
ASYNC_THROUGH
, if the blocks of the file only exist on a single worker and that worker fails or crashes, then the written data will be unrecoverable.
When Alluxio provides synchronous replicas of data at the time it is written, then it combines the performance of using 
MUST_CACHE
 while still providing fault tolerance.
Also published here.

Written by bin-fan | VP of Open Source and Founding Member @Alluxio
Published by HackerNoon on 2021/09/10