Accelerating Write-Intensive Data Workloads on AWS S3

Introduction

When running write-intensive ETL jobs in the cloud using Spark, MapReduce, or Hive, directly writing the output to AWS S3 often yields slow, unstable performance or complicated error handling due to throughput throttling, object storage semantics among many other reasons.

Some users choose to first write the output to an HDFS cluster and then upload the staging data to AWS S3 using tools like

s3-dist-cp

Though optimizing the performance or eliminating corner cases for object storage, the second approach adds both cost and complexity in maintaining a staging HDFS, as well as increases the delay for the data to be available for downstream applications.

Alluxio is an open-source data orchestration system widely used to speed up data-intensive workloads in the cloud.

Alluxio v2.0 introduced

Replicated Async Write

to allow users to complete writes to the Alluxio file system and return quickly with high application performance, while still providing users with peace of mind that data will be persisted to the chosen under storage like S3 in the background.

What is Replicated Async Write

Assume that a Spark job is writing a large data set to AWS S3.

To ensure that the output files are quickly written and keep highly available before persisted to S3, pass the write type

ASYNC_THROUGH

and a target replication level to Spark (see description in docs).

alluxio.user.file.writetype.default=ASYNC_THROUGHalluxio.user.file.replication.durable=<N>

and set the output to the Alluxio file system client (with URI scheme: “

alluxio://

”)

scala> myData.saveAsTextFile("alluxio://<MasterHost>:19998/Output")

This Spark job will first synchronously write a new file to the Alluxio file system with N copies before returning. In the background, the Alluxio file system will persist a copy of the new data to the Alluxio under storage like S3.

By default, the property

alluxio.user.file.replication.durable

is set to 1, and writing a file to Alluxio using

ASYNC_THROUGH

completes at memory speed if there is a colocated Alluxio worker.

But if any worker crashes or restarts before the data is persisted, the data can be lost.

Increasing

alluxio.user.file.replication.durable

to some value greater than 1 ensures the data is replicated at N different workers in Alluxio to survive N-1 worker failures before being persisted, at the cost of temporarily storing N times more data in Alluxio.

Note that, before the file is eventually persisted, in case any of these N workers fail, Alluxio will take care of the failure by re-replicating the under-replicated blocks and ensuring sufficient copies of the file; also free command will not work before the file is persisted.

Once the file is persisted to the under storage, the in-Alluxio copies of this file can become evictable if another configuration allows this.

Enable Replicated Async Write in Alluxio

Configure the Alluxio property

alluxio.user.file.replication.durable

to a reasonable number to provide fault tolerance as well as set

alluxio.user.file.writetype.default

ASYNC_THROUGH

on all client applications.

Note that these properties are client-side only, meaning they don’t require restarts of any Alluxio processes to use them!

For example, put the following lines into

conf/alluxio-site.properties

to write with three initial copies before data is persisted:

alluxio.user.file.writetype.default=ASYNC_THROUGHalluxio.user.file.replication.durable=3

Alluxio command-line is also supported to use replicated async write.

For example, the

copyFromLocal

command can copy data from the local file system to Alluxio using replicated async write:

$ ./bin/alluxio fs -Dalluxio.user.file.writetype.default=ASYNC_THROUGH \  -Dalluxio.user.file.replication.durable=3 copyFromLocal <localPath> <alluxioPath>

Comparison to Other Alluxio Write Types

Replicated async write is one write type that aims to balance the requirements in write speed and data persistence, but also at the transient cost of more storage capacity.

Alluxio also provides a couple of additional write types realizing different trade-offs in the design space.

For an application that writes its data through Alluxio, unless the user is writing with

THROUGH

CACHE_THROUGH

, the data will not exist in the under storage when writing.

The problem is that those types of writes are slower than their

MUST_CACHE

and

ASYNC_THROUGH

counterparts. When a file is written using

MUST_CACHE

ASYNC_THROUGH

, if the blocks of the file only exist on a single worker and that worker fails or crashes, then the written data will be unrecoverable.

When Alluxio provides synchronous replicas of data at the time it is written, then it combines the performance of using

MUST_CACHE

while still providing fault tolerance.

Also published here.