By: David Zhu
Metadata synchronization (sync) is a core feature in Alluxio that keeps files and directories consistent with their source of truth in under storage systems, thus making it simple for users to retrieve the latest version of data from Alluxio. Meanwhile, understanding the internal process is important in order to tune the performance. This article describes the design and the implementation in Alluxio to keep metadata synchronized.
In Alluxio, metadata refers to the information of files and directories in the Alluxio file system, including information of their owners, groups, permission, creation and modification time, etc. Metadata is independent of their content — even if a file or directory is empty, it still has associated metadata.
Alluxio maintains a copy of the file system or the object store namespace of the under storage system. Metadata consistency is important in Alluxio, especially when different clusters write/read data in the data pipeline and make changes outside of Alluxio.
A typical scenario in the above diagram is a data pipeline with the combination of Spark ETL and Presto SQL. The ETL cluster (without Alluxio) writes data, followed by the analytics cluster with Alluxio reading the transformed data. Since Alluxio maintains a copy of the metadata from under storage and manages the metadata, when data changes in under storage with the ETL step, the instance of Alluxio on the analytics cluster must be made aware and kept consistent with the metadata in the underlying storage system to proceed correctly.
Alluxio provides a file system abstraction in a unified namespace on one or more under storage systems. Accessing files or directories through Alluxio is expected to get the same result as directly accessing the under storage. For example, if the under storage mounted to Alluxio root is s3://bucket/data
, listing the "/"
directory in Alluxio yields the same result as listing objects in s3://bucket/data
and printing "/file"
in Alluxio should return the same result as s3://bucket/data/file
.
In Alluxio, metadata is only stored and served from Alluxio masters, whereas the content of individual files is served from Alluxio workers.
By default, Alluxio loads the metadata from the under storage on demand. In the above example, an Alluxio master starting from empty does not have any information about s3://bucket/data/file
after startup. This file is only recognized when some user lists "/"
directory in Alluxio or tries to access "/file"
. This “lazy” behavior prevents unnecessary effort and significantly improves performance because metadata operations in under storage can be slow.
Note that updating metadata can be bidirectional. If all modifications to the file system occur through Alluxio, Alluxio only needs to scan under storage once to retrieve the initial state, then apply changes in both Alluxio and under storage synchronously as a part of the file system RPC calls. This will provide users with a consistent view of under storage. In reality, however, changes are often made to the under storage outside of Alluxio. As a result, Alluxio master must discover the addition, removal, and updates to files and directions in under storage, and apply the changes in the Alluxio file system. This process to synchronize the two namespaces is called metadata sync.
When applications make changes to the metadata of an Alluxio file, and this file is persisted, the change will always be propagated to under storage synchronously, and there is no need to trigger metadata sync. When applications update under storage files without letting Alluxio know, there are two approaches to govern the timing of metadata sync.
One can set a sync Interval as an Alluxio property key “alluxio.user.file.metadata.sync.interval
”.
Note that using this approach, if a path in Alluxio has never been accessed, it will never trigger a sync. Once the path is accessed after the interval of sync expires, Alluxio will sync with the under storage again. For example, in a Presto job, the query planning stage lists all the files necessary for the job, which will trigger a sync if those paths have not been accessed recently. However, subsequent stages of the job will not sync unless the job duration exceeds the sync interval.
So we can technically resync more frequently than the sync interval in this case.
This property key can be customized with a new global default value (when set in alluxio-site.properties
) or on a directory basis recursively to apply all its children.
If sync metadata does not happen because of the sync interval, most Alluxio operations will continue to execute with the current metadata in the Alluxio file system with some exceptions:
loadMetadata
” is the easiest way to trigger sync manually. For example, one can run “bin/alluxio fs loadMetadata /path/to/sync
” to force update the metadata of Alluxio path “/path/to/sync
“.getStatus
and listStatus
which retrieve the metadata for a path or a directory. When calling these methods, there is an additional field of LoadMetadataPType
in the option for each call, which may trigger a “loadMetadata” process by the master on the Alluxio path being inquired. This process can be considered as a simplified version of the sync by only loading file metadata from the under storage but not modifying a file’s metadata if it is already in the Alluxio. if LoadMetadataPType
is set to NEVER
, then nothing will be loaded, and if the file does not already exist, a FileNotFound exception will be thrown. When LoadMetadataPType
is ONCE
, we will only load the metadata once for each directory. This affects these two file system calls only, and this option is only considered when sync does not happen.
Alluxio master potentially triggers metadata sync on an Alluxio path when the Alluxio master receives an RPC request to retrieve the metadata of this path. Rather than having a dedicated service to traverse the entire file system inode tree and keep it in sync, this effort is amortized by each individual Alluxio file system operation on master. The high-level process for the syncing inside the RPC request is:
Note that the metadata sync process can be relatively expensive and block other operations if they touch the same part of the inode tree. This is because the sync process may write-lock the part of file system metadata it is updating. Particularly, when syncing a particular path in a tree, the RPC handling thread will first acquire the read lock on the entire path to the file. Because the syncing thread needs the ability to create paths as well, it needs a write lock for the sync root path. As the syncing thread processes each path under the root path, additional locks are acquired. A write lock to the file path is acquired by the syncing thread and released as soon as the path is processed.
One can tune the level of parallelism to sync metadata by controlling three configuration parameters:
alluxio.master.metadata.sync.concurrency.level
This indicates the number of individual files to sync concurrently within a single metadata sync request (e.g., on a directory).alluxio.master.metadata.sync.executor.pool.size
This indicates the number of concurrent threads for all sync operationsalluxio.master.metadata.sync.ufs.prefetch.pool.size
This indicates the number of concurrent threads that can perform under storage prefetch operations for all sync operationsThere are three types of different caches, with different goals and uses in the metadata syncing process. Here is a quick summary of all of them.
/a/b
is in the absent cache, we know /a/b/c
must not exist on the under storage either.Metadata synchronization is one of the most important features in Alluxio. There are multiple different ways to trigger synchronization with different performance tradeoffs. Inside Alluxio master, there is a list of optimizations applied to speed up the synchronization.
Also Published Here