This story draft by @configuring has not been reviewed by an editor, YET.
Authors:
(1) Limeng Zhang, Centre for Research on Engineering Software Technologies (CREST), The University of Adelaide, Australia;
(2) M. Ali Babar, Centre for Research on Engineering Software Technologies (CREST), The University of Adelaide, Australia.
1.1 Configuration Parameter Tuning Challenges and 1.2 Contributions
3 Overview of Tuning Framework
4 Workload Characterization and 4.1 Query-level Characterization
4.2 Runtime-based Characterization
5 Feature Pruning and 5.1 Workload-level Pruning
5.2 Configuration-level Pruning
7 Configuration Recommendation and 7.1 Bayesian Optimization
10 Discussion and Conclusion, and References
In this section, we embark on an exploration of automatic parameter recommendation methodologies within the realm of DBMSs, categorizing them into four principal domains: Bayesian Optimization, Reinforcement Learning, Neural Network Solutions, and Search-based Solutions. Each classification encompasses a diverse array of methodologies and techniques, each meticulously designed to address specific nuances and challenges encountered within varying DBMS environments and scenarios.
Bayesian Optimization (BO) leverages Bayesian Theorem to direct an efficient and effective search of a global optimization problem. It minimizes/maximizes an objective function f iteratively through adaptive sampling of the search space. BO has two key components: surrogate model and acquisition function. The surrogate model captures the behavior of the objective function f , while the acquisition function guides the selection of the next sample. BO iteratively updates its surrogate model using samples chosen by the acquisition function, ultimately identifying the minimal or maximal value of f .
Surrogate models can be various machine learning models [33], such as Random Forest (RF) [34] and Gaussian Process (GP) models [35], Bayesian neural networks [36] or tree parzen estimators [37]. Among them, GP is a popular surrogate model for objective modeling in BO due to its expressiveness and well-calibrated uncertainty estimates, supporting for noisy observations. GP provides confidence bounds on its predictions to model the objective function with limited samples [3]. Regarding acquisition functions, popular choices include expected improvement (EI) [38], probability of improvement (PI) [39], and GP upper confidence bound (GP-UCB) [40] or entropy search [41], with EI being the most widely utilized. Sequential Model-based Algorithm Configuration (SMAC) [34] is a popular variant of BO-based solutions that uses a random forest as its surrogate model, which is known to perform well for high-dimensional and categorical input. It supports all types of variables, including continuous, discrete, and categorical features.
Bayesian optimization-based automatic parameter tuning solutions primarily involve designing the models in terms of the acquisition function, surrogate models, model inputs (i.e., system configurations and configuration constraints), initial design of samples, as well as model update, to meet the specific tuning objectives. In this section, we provide an overview of state-of-the-art BO-based automatic tuning methods applied in DBMS tuning and BDAFs to describe how BO is adopted in automatic parameter tuning. The summary of different solutions can be seen in Table 3.
7.1.1 BO-based solutions
LOCAT [7] proposes a BO-based approach to ascertain the optimal configurations tailored for Spark SQL applications, with a particular focus on accommodating variations in input data sizes. It begins by employing a workload-level pruning model, known as Query Configuration Sensitivity Analysis (QCSA), to eliminate configuration insensitive queries within a workload. Meanwhile, it conducts configuration-level pruning using the important configuration parameters (IICP) model through a two-step process: configuration parameter selection (CPS) and configuration parameter extraction (CPE). Subsequently, LOCAT utilizes the BO model along with its proposed DatasizeAware Gaussian Process (DAGP), which characterizes the application’s performance as a distribution of functions of configuration parameters and input data size to recommend the optimal configuration. In its BO-based solution, it employs the Expected Improvement (EI) with a Markov
Chain Monte Carlo (EI-MCMC) hyperparameter marginalization algorithm [42] as the acquisition function, leveraging the proposed DAGP as the surrogate model. Regrading its initial design, LOCAT incrementally constructs the GP model, initially starting with three samples generated through Latin Hypercube Sampling (LHS). Each sample includes information on execution time, configuration, and the corresponding data size. Following each execution, the updated GP model assists BO in selecting the next candidate configuration, aimed at minimizing the execution time of a Spark SQL application.
OtterTune [8] proposes to initiate the BO-based solution for DBMS configuration tuning through the observation data from similar workloads. Specially, it first maps unseen DBMS workloads to previously observed workloads by measuring the Euclidean distance between workloads and identifies the most similar workload in its repository based on DBMS runtime metrics. In the next step, OtterTune utilizes the selected similar workload observation data to train a GP surrogate model and updates it with new observations of the target workload. Meanwhile, in order to improve the model’s generalization ability on the target workload, OtterTune incorporates some noisy data processing techniques, including increasing the variance of the noise parameter for all points in the GP model that OtterTune has not yet tried for this tuning session and adding a smaller ridge term for each configuration that OtterTune selects. Regarding the acquisition function, OtterTune adopts EI and it consistently selects configurations with the greatest EI. Meanwhile Gradient descent [25] is applied to identify the local optimum on the surface predicted by the GP model, using data comprising the top-performing configurations completed in the current tuning session and randomly selected configurations.
ResTune [6] introduces a Constrained Bayesian Optimization (CBO) solution, which extends BO into an inequality-constrained setting to optimize resource utilization without violating service-level agreement constraints during the DBMS configuration tuning process. Specifically, it models each DBMS tuning task with a multi-output GP model which is composed of three independent GP models, representing resource utilization, throughput, and latency requirements. Meanwhile, ResTune introduces a Constrained Expected Improvement (CEI) as its acquisition function. ResTune redefines the best configuration point as the one with the lowest resource utilization while satisfying throughput and latency constraints. CEI calculates the expected improvement of a specific configuration over the best feasible point so far, weighted by the probability that both throughput and latency satisfy the constraints. Moreover, to expedite training, ResTune adopts an ensemble framework to learn from experience by assembling pre-trained learners (GP models) from previous tuning workloads in a weighted manner. In the initial tuning stage, it assigns static weights to previous models based on the workload embedding features. With sufficient observations collected, ResTune assigns dynamic weights according to the ranking loss based on their performance surface similarity.
ReIM [19] introduces a BO-based solution tailored for the optimization of memory-based analytics applications. In its framework, it utilizes GP as the surrogate model with EI as the acquisition function. In terms of inputs for the surrogate model, ReIM considers parameters encompassing both the application configuration and a set of empirical analysis statistics related to the objectives outlined by RelM. When searching, a combination of random sampling and standard gradient-based search [43] is carried out to find the highest EI.
ONLINETUNE [3] provides a BO-based solution for online tuning for DBMSs, addressing safety and dynamicity concerns during the tuning process by adding a dynamic context to the tuning task and narrowing down the search space to a safe configuration search subspace. Specifically, it augments the surrogate model GP with dynamic context variables, including workload dynamicity characteristics based on query arrival rate and query composition, as well as the underlying data distribution, in conjunction with DBMS configurations from historical observations. Additionally, it constructs an additive kernel by incorporating a linear kernel to model the dependencies among contexts and a Martin kernel to capture the nonlinear performance associated with configurations, premising that performance correlations emerge when configurations and contextual factors exhibit similarities. ONLINETUNE adopts the Upper Confidence Bound (UCB) [44] as its acquisition function, aiming to improve the mean estimate of the underlying function and decrease the uncertainty at candidate configurations to the maximum. Moreover, it unifies two sampling criteria: UCB method, which selects configurations with maximal confidence interval upper bound safety, and the Highest Uncertainty Boundary (HUB) method, which selects safe configurations at the boundary of the safety set with the highest uncertainty, according to the epsilongreedy policy [45].
Regrading its safe configuration search subspace, it first constructs a candidate configuration set by identifying a safe configuration subspace near the initial safety configuration (default configuration) and adjusts it iteratively, alternating between a hypercube and a line region towards the global optimum. It then discretizes the adapted subspace to build the candidate set. After that, the safety of each candidate is assessed based on the confidence bounds of the contextual GP (black-box knowledge) and the existing domain knowledge (white-box knowledge, e.g., MysqlTuner) to further remove some unsafe or unpromising configuration candidates. Finally, to address the inherent cubic computation complexity of the GP model, it employs the DBSCAN clustering algorithm [46] to classify historical observations into different clusters based on their context and trains a GP model for each cluster. Then, SVM is applied to learn the boundary for each model, mapping each new observation to its respective GP model.
LlamaTune [16] introduces a framework aimed at leveraging domain knowledge to enhance the sample efficiency of existing BO-based automatic DBMS configuration solutions and RL-based solutions. Initially, it employs a randomized projection technique, HeSBO, to project the configuration space from all documentations into a lowerdimensional subspace. Subsequently, the tuning process is initialized within this constructed subspace. In the context of BO-based solutions, LlamaTune adopts two BO-based frameworks: SMAC [47] and GP-based BO [48]. Initially, a space-filling sampling method (e.g., LHS) is employed to generate an initial set of points for training the surrogate model. The model then suggests a candidate point maximizing the acquisition function EI. Furthermore, LlamaTune implements various strategies for knob selection and knob discretization. LlamaTune introduces a novel biased-sampling approach, assigning a higher probability to special values such as (i) disabling certain features, (ii) inferring a knob’s value based on another knob, or (iii) setting a knob’s value using an internal heuristic or predefined value during the random initialization phase. Additionally, it employs bucketization of the knob value space at fixed intervals, limiting the number of unique values any configuration knob can take up to a threshold to reduce the configuration space.
CGPTuner [17] provides a BO-based online tuning solution for DBMSs, considering the IT stack diversity (the operating systems and Java virtual machine) to which the system is exposed. It addresses the IT stack diversity with the Contextual Gaussian Process Bandit Optimization framework, which is a contextual extension to the Bayesian Optimization framework. Specifically, it utilizes GPs as surrogate models, a multiplication combination of two Matern 5/2 ´ kernels to model the configuration space and the workload space. For the acquisition function, it follows the GP-Hedge approach [39], which, instead of focusing on a specific acquisition function, adopts a portfolio of acquisition functions governed by an online multi-armed bandit strategy, aiming to progressively select the best acquisition function according to its previous performance at each tuning iteration. Its workload characterisation is conducted by an external workload characterisation module, such as the workload characterisation provided by Ottertune [8], Peloton [49] or even a a reaction-based characterisation used in compiler autotuning [50]
7.1.2 Summary
In Bayesian Optimization-based solutions, a critical aspect lies in the careful selection and design of tailored components for BO models, encompassing the acquisition function, surrogate model, and model initialization, among others. Table 3 provides a comparative analysis in these dimensions. Concerning the acquisition function, a prevalent choice among researchers is EI, and some researchers also tried UCB and GP-Hedge. In the realm of surrogate models, GP emerges as a prominent selection, while SMAC exhibits capabilities in supporting diverse parameter types (continuous, categorical, and conditional) and has demonstrated efficacy in managing the intricacies of high-dimensional and heterogeneous configuration spaces. In terms of sample design, practitioners and researchers can try constructing samples from different aspects, including established techniques like random sampling and LHS, alongside the potential for tailoring samples to specific workload patterns or userprovided sample sets. Moreover, practitioners or researchers can also look into refining kernel designs and employing noise data processing strategies within the ambit of model design.
Owing to the training complexity, another crucial consideration is to accelerate the tuning process from data collection and model training. For each configuration execution, acceleration can be achieved through feature pruning, encompassing workload-level pruning and configurationlevel pruning. In terms of model training to enhance the coverage process, BO-based solutions are designed to adopt appropriate sampling techniques and transfer knowledge from experience. Regarding task workload modeling, the task workload can be modeled by training a GP model based on historical tuning observations or by building separate GP models for other workloads and ensembling them together for the target task. Currently, extensive research has been conducted on determining the weight in ensembling GP models in Bayesian optimization [33], [51]–[56].
This paper is available on arxiv under CC BY 4.0 DEED.