Metadata Synchronization: Design, Implementation and Optimization

By: David Zhu

Metadata synchronization (sync) is a core feature in Alluxio that keeps files and directories consistent with their source of truth in under storage systems, thus making it simple for users to retrieve the latest version of data from Alluxio. Meanwhile, understanding the internal process is important in order to tune the performance. This article describes the design and the implementation in Alluxio to keep metadata synchronized.

Why is Metadata Sync Critical in Alluxio

In Alluxio, metadata refers to the information of files and directories in the Alluxio file system, including information of their owners, groups, permission, creation and modification time, etc. Metadata is independent of their content — even if a file or directory is empty, it still has associated metadata.

Alluxio maintains a copy of the file system or the object store namespace of the under storage system. Metadata consistency is important in Alluxio, especially when different clusters write/read data in the data pipeline and make changes outside of Alluxio.

A typical scenario in the above diagram is a data pipeline with the combination of Spark ETL and Presto SQL. The ETL cluster (without Alluxio) writes data, followed by the analytics cluster with Alluxio reading the transformed data. Since Alluxio maintains a copy of the metadata from under storage and manages the metadata, when data changes in under storage with the ETL step, the instance of Alluxio on the analytics cluster must be made aware and kept consistent with the metadata in the underlying storage system to proceed correctly.

How Does Metadata Sync Work in Alluxio

Alluxio provides a file system abstraction in a unified namespace on one or more under storage systems. Accessing files or directories through Alluxio is expected to get the same result as directly accessing the under storage. For example, if the under storage mounted to Alluxio root is s3://bucket/data, listing the "/" directory in Alluxio yields the same result as listing objects in s3://bucket/data and printing "/file" in Alluxio should return the same result as s3://bucket/data/file.

In Alluxio, metadata is only stored and served from Alluxio masters, whereas the content of individual files is served from Alluxio workers.

By default, Alluxio loads the metadata from the under storage on demand. In the above example, an Alluxio master starting from empty does not have any information about s3://bucket/data/file after startup. This file is only recognized when some user lists "/" directory in Alluxio or tries to access "/file". This “lazy” behavior prevents unnecessary effort and significantly improves performance because metadata operations in under storage can be slow.

Note that updating metadata can be bidirectional. If all modifications to the file system occur through Alluxio, Alluxio only needs to scan under storage once to retrieve the initial state, then apply changes in both Alluxio and under storage synchronously as a part of the file system RPC calls. This will provide users with a consistent view of under storage. In reality, however, changes are often made to the under storage outside of Alluxio. As a result, Alluxio master must discover the addition, removal, and updates to files and directions in under storage, and apply the changes in the Alluxio file system. This process to synchronize the two namespaces is called metadata sync.

How to Trigger Metadata Sync

When applications make changes to the metadata of an Alluxio file, and this file is persisted, the change will always be propagated to under storage synchronously, and there is no need to trigger metadata sync. When applications update under storage files without letting Alluxio know, there are two approaches to govern the timing of metadata sync.

1. Rely on a time-based automatic sync

One can set a sync Interval as an Alluxio property key “alluxio.user.file.metadata.sync.interval”.

When the value is -1 (current default value), Alluxio will never resync with under storage after initial load.
When its value is set to 0, Alluxio will always resync with under storage whenever metadata is accessed.
When the value is positive (default unit is ms), Alluxio will (best effort) not re-sync a path within that time interval.

Note that using this approach, if a path in Alluxio has never been accessed, it will never trigger a sync. Once the path is accessed after the interval of sync expires, Alluxio will sync with the under storage again. For example, in a Presto job, the query planning stage lists all the files necessary for the job, which will trigger a sync if those paths have not been accessed recently. However, subsequent stages of the job will not sync unless the job duration exceeds the sync interval.

So we can technically resync more frequently than the sync interval in this case.

This property key can be customized with a new global default value (when set in alluxio-site.properties) or on a directory basis recursively to apply all its children.

2. Sync manually using LoadMetadata flags

If sync metadata does not happen because of the sync interval, most Alluxio operations will continue to execute with the current metadata in the Alluxio file system with some exceptions:

For most of the users, Alluxio CLI “loadMetadata” is the easiest way to trigger sync manually. For example, one can run “bin/alluxio fs loadMetadata /path/to/sync” to force update the metadata of Alluxio path “/path/to/sync“.
For applications built on Alluxio file system SDK (Java), there are two API methods getStatus and listStatus which retrieve the metadata for a path or a directory. When calling these methods, there is an additional field of LoadMetadataPType in the option for each call, which may trigger a “loadMetadata” process by the master on the Alluxio path being inquired. This process can be considered as a simplified version of the sync by only loading file metadata from the under storage but not modifying a file’s metadata if it is already in the Alluxio. if LoadMetadataPType is set to NEVER, then nothing will be loaded, and if the file does not already exist, a FileNotFound exception will be thrown. When LoadMetadataPType is ONCE, we will only load the metadata once for each directory. This affects these two file system calls only, and this option is only considered when sync does not happen.

How is Metadata Sync Implemented

Alluxio master potentially triggers metadata sync on an Alluxio path when the Alluxio master receives an RPC request to retrieve the metadata of this path. Rather than having a dedicated service to traverse the entire file system inode tree and keep it in sync, this effort is amortized by each individual Alluxio file system operation on master. The high-level process for the syncing inside the RPC request is:

Given the Alluxio path, determine if it is not consistent with the corresponding under storage path. This means the under storage path does not exist, or has metadata that differs from Alluxio. This part is done using the RPC thread.
The steps from 1 should populate the sync queue, and we iterate through the sync queue and process each path in a worker thread from a separate thread pool. The order of traversal is a BFS order, because additional paths are added at the end of the queue. Concurrency and executors will be discussed in the parallelism section in more detail.
This part is carried out by the sync threads, and under storage, information is read using the under storage prefetch thread. The reason for this is to overlap communication with computation. Sync threads need to manipulate inode tree, and under storage, prefetch can start as soon as we determine that we need the information at some point in the future. Prefetching threads load the under storage status information into the under storage status cache, which is discussed in the cache section.

Note that the metadata sync process can be relatively expensive and block other operations if they touch the same part of the inode tree. This is because the sync process may write-lock the part of file system metadata it is updating. Particularly, when syncing a particular path in a tree, the RPC handling thread will first acquire the read lock on the entire path to the file. Because the syncing thread needs the ability to create paths as well, it needs a write lock for the sync root path. As the syncing thread processes each path under the root path, additional locks are acquired. A write lock to the file path is acquired by the syncing thread and released as soon as the path is processed.

Performance Optimization

Tuning parallelism

One can tune the level of parallelism to sync metadata by controlling three configuration parameters:

alluxio.master.metadata.sync.concurrency.level This indicates the number of individual files to sync concurrently within a single metadata sync request (e.g., on a directory).
alluxio.master.metadata.sync.executor.pool.size This indicates the number of concurrent threads for all sync operations
alluxio.master.metadata.sync.ufs.prefetch.pool.size This indicates the number of concurrent threads that can perform under storage prefetch operations for all sync operations

Caching result

There are three types of different caches, with different goals and uses in the metadata syncing process. Here is a quick summary of all of them.

AbsentCache is a negative cache, used to avoid checking the under storage for those paths that have been known to not exist. It uses prefix matching to determine if a path is in the under storage or not. For example, if the path /a/b is in the absent cache, we know /a/b/cmust not exist on the under storage either.
In addition, Absent Cache entries have a timestamp attached. So that we know the time it was last checked in the under storage. This is useful when the sync interval is some time period, we use the timestamp to determine whether we need to recheck the under storage for the existence of the file or directory.
UfsStatusCache is a cache used to prefetch under storage statuses during the sync process. Instead of fetching the path information when needed, we can often prefetch some file statuses while we process the current directory.
UfsSyncPathCache is a positive cache and contains paths that have been synced with the under storage recently. When we receive a metadata operation, we will check with this cache to determine whether we need to sync a particular path.

Summary

Metadata synchronization is one of the most important features in Alluxio. There are multiple different ways to trigger synchronization with different performance tradeoffs. Inside Alluxio master, there is a list of optimizations applied to speed up the synchronization.

More Info

Also Published Here

Metadata Synchronization: Design, Implementation and Optimization

Too Long; Didn't Read

Companies Mentioned

Why is Metadata Sync Critical in Alluxio

How Does Metadata Sync Work in Alluxio

How to Trigger Metadata Sync

1. Rely on a time-based automatic sync

2. Sync manually using LoadMetadata flags

How is Metadata Sync Implemented

Performance Optimization

Tuning parallelism

Caching result

Summary

More Info

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

Categories

Trending Topics

Metadata Synchronization: Design, Implementation and Optimization

Too Long; Didn't Read

Companies Mentioned

Why is Metadata Sync Critical in Alluxio

How Does Metadata Sync Work in Alluxio

How to Trigger Metadata Sync

1. Rely on a time-based automatic sync

2. Sync manually using LoadMetadata flags

How is Metadata Sync Implemented

Performance Optimization

Tuning parallelism

Caching result

Summary

More Info

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES

Categories

Trending Topics