Data Location Awareness: The Benefits of Implementing Tiered Locality

Author profile picture

@bin-fanBin Fan

VP of Open Source and Founding Member @Alluxio

Tiered Locality is a feature led by my colleague Andrew Audibert at Alluxio. This article dives into the details of how tiered locality helps provide optimized performance and lower costs. 

1. Data Locality and Tiered Locality

Data locality corresponds to the effort required for an application to read the data it needs. The closer or more localized the data is, the faster the application can retrieve it. As a configuration optimization, it provides significant value in big data workloads on distributed systems and enables higher performance without needing to scale more resources. In most cases, data locality is about colocating the data and application on the same node. However, in the cloud, where there are multiple networking layers, there are more tiers of locality.
Tiered locality allows a user to configure data placement policies to accommodate the cluster's network topology in order to achieve performance and cost optimizations. 

2. Tiered Locality in Alluxio OSS

Tiered locality uses awareness of network topology and configurable policies to manage data placement for performance and cost optimizations. This feature is particularly useful with cloud deployments across multiple availability zones. It can also be useful for cost savings in environments where cross-zone or cross-location traffic is more expensive than intra-zone data traffic.
Here is a simple scenario where Alluxio can use network topology information to bias towards more local reads and writes with Alluxio workers in two different AWS Availability Zones.
Using this setup with m5.xlarge EC2 instances, the application demonstrates different read performance depending on which worker the data is read from.
Unsurprisingly, performance is fastest when reading from the local Alluxio worker and slows when read from a non-local worker. The performance difference between worker 2 and worker 3 is due to the difference in bandwidth between Availability Zones (AZs). Worker 2 is in the same AZ as the application with about 10 gigabits per second of bandwidth. Reading from worker 3 is slower because the bandwidth across AZs is only about 5 gigabits per second. 
Without tiered locality, the application is just as likely to read from either worker 2 or 3. Configuring tiered locality gives a preference for worker 2 for faster performance. The situation is similar for writing data. When applications write data through Alluxio, there is a preference for more-local workers.

3. Configuration

To enable tiered locality and the associated performance benefit, every actor (clients and workers) must be configured to know its tiered identity. Tiered identity is a mapping from locality tier (e.g. Availability Zone) to the value for that tier (e.g. us-east-1a). For the above cluster setup example, the tiered identities would be:
Configure with alluxio-site.properties
The most straight forward way to configure tiered locality is to use alluxio-site.properties Please refer to this configuration setting page:
Properties for Application, Worker 1, and Worker 2
alluxio-site.properties
alluxio.locality.az="us-east-1a"
alluxio.locality.order="node,az" # custom locality hierarchy
Properties for Worker 3 and Worker 4
alluxio-site.properties
alluxio.locality.az="us-east-1b"
alluxio.locality.order="node,az" # custom locality hierarchy
We set alluxio.locality.order to introduce the az locality tier and show its order in the locality hierarchy. By default the locality tiers are node,rack. Note that we don't need to explicitly configure node identity because it is determined automatically via localhost lookup.
Configure with alluxio-locality.sh
When the cluster is set up automatically or there are many workers, it can be convenient to set the locality information via script instead of using a static value in alluxio-site.properties. If a script exists at ${ALLUXIO_HOME}/conf/alluxio-locality.sh, it will be executed to determine tiered identity.
alluxio-locality.sh
#!/bin/bash
echo "az=$(curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone)"
The same script can be set on every node to look up and report the availability zone. The script name can be configured by setting the alluxio.locality.script property key.

4. Custom Tiers

The example shown here uses the tiers node and az, but the tier configuration is fully customizable. Readers can use whatever tiers make sense for their deployment, e.g. rack, zone, region, etc. as long as each region is contained within the next in alluxio.locality.order, since locality decisions prefer to match in the earliest tier possible.
For more information and docs on Alluxio Tiered Locality, Alluxio community users can get more information here. It is also worth mentioning that Alluxio Enterprise Edition adds an additional feature: strict locality tiers. Users can define a tier to be “strict” meaning that no traffic is allowed between actors that don’t match in the tier. For more information on Cluster Partitioning with Strict Locality, Alluxio Enterprise Edition users can refer to the docs here

5. Conclusion

In big data workloads on distributed systems, data locality helps engineers to get around the limitations due to network traffic. In the cloud, when clusters have non-uniform networking capabilities, tiered locality adds significant value in improving performance as well as saving costs. With its ability to intelligently aggregate and manage data in different environments, tiered locality is one of many capabilities that Alluxio can perform as a virtual data layer. For more information and discussion, I encourage users to dive deeper into Alluxio open source community
This article was previously published on 
Alluxio’s engineering blog
 and 
Dzone
.







Tags

The Noonification banner

Subscribe to get your daily round-up of top tech stories!