paint-brush

This story draft by @configuring has not been reviewed by an editor, YET.

How Configuration-level Pruning Reduces Optimization Time in DBMS Tuning

featured image - How Configuration-level Pruning Reduces Optimization Time in DBMS Tuning
Configuring HackerNoon profile picture
0-item

Authors:

(1) Limeng Zhang, Centre for Research on Engineering Software Technologies (CREST), The University of Adelaide, Australia;

(2) M. Ali Babar, Centre for Research on Engineering Software Technologies (CREST), The University of Adelaide, Australia.

Table of Links

Abstract and 1 Introduction

1.1 Configuration Parameter Tuning Challenges and 1.2 Contributions

2 Tuning Objectives

3 Overview of Tuning Framework

4 Workload Characterization and 4.1 Query-level Characterization

4.2 Runtime-based Characterization

5 Feature Pruning and 5.1 Workload-level Pruning

5.2 Configuration-level Pruning

5.3 Summary

6 Knowledge from Experience

7 Configuration Recommendation and 7.1 Bayesian Optimization

7.2 Neural Network

7.3 Reinforcement Learning

7.4 Search-based Solutions

8 Experimental Setting

9 Related Work

10 Discussion and Conclusion, and References

5.2 Configuration-level Pruning

Another approach to mitigating the optimization time required by machine learning-based methodologies is to reduce configuration space necessitating tuning. Configuration-level pruning endeavors to discern and prioritize the tuning of parameters that exert substantial influence on the DBMSs based on historical observations.


• Spearman Rank Correlation Coefficient (SRCC) [26]: is a nonparametric measure of rank correlation on the strength of a monotonic relationship between paired data. It is defined as the Pearson correlation coefficient between the rank variables. By leveraging its capacity to gauge monotonic relationships, SRCC facilitates the identification and removal of redundant parameters, contributing to a more efficient and simplified configuration. LOCAT [7] calculates SRCC between each configuration parameter and the workload execution time to filter out some redundant configuration parameters and narrow down the parameter search space.”


• Principal Component Analysis (PCA): is a dimensionality reduction method that transforms correlated variables into uncorrelated ones, known as principal components. It does so by calculating the covariance matrix, performing eigendecomposition, and selecting a subset of principal components based on their variance. After SRCC, LOCAT [7] applies Gaussian kernel Principal Component Analysis (KPCA) to further extract and form new lower-dimensional parameters derived after removal of redundant configuration parameters for subsequent optimal parameter searching.


• Least Absolute Shrinkage and Selection Operator (LASSO) and Lasso Path Algorithm. In the context of linear regression, LASSO modifies this objective function by adding a L1 penalty term equals to λ times the sum of absolute values of the regression coefficients, where λ controls the strength of the penalty, and it helps to reduce the effect of irrelevant variables by penalizing models with large weights. A higher λ value results in more coefficients being pushed toward zero, thereby achieving interpretable, stable, and computationally efficient feature selection. OtterTune [8] uses the LASSO path algorithm to assess the importance of configuration parameters by gradually reducing the penalty term. The order in 6 which parameters are included during this process is then considered as the importance order, and OtterTune uses this information to prune less important parameters.


• Random Forest (RF): is a supervised machine learning method to deal with classification and regression problems. It can be deemed a voting algorithm consisting of multiple decision trees, used to calculate the importance of features. In the study [27], the authors utilize Latin Hypercube Sampling (LHS) to select a subset of configurations. Subsequently, they employ Random Forests with 300 Classification and Regression Trees (CART) to ascertain the importance of each parameter. Their findings reveal that when employing the YCSB workload-A on Cassandra, optimizing just five parameters can yield performance levels comparable to 99% of the best configurations attained through extensive parameter tuning. HUNTER [13] adopt the RF classifier with CARTs and adopts the majority voting principle to determine the importance of the knobs. It then selects the top-20 knobs to reduce the configuration search space in its tuning framework.


• Hashing-enhanced Subspace BO (HeSBO) [28]: is a randomized low-dimensional projection approach that operates by deducing the search space for Bayesian Optimization-based solutions (discussed in Section 7.1). By utilizing two uniform hash functions to construct the rows and signs of the random projection matrix, it performs a count-sketch projection to transform the original high-dimensional configuration space into lower-dimensional subspace. This process is intuitively designed to effectively preserve the characteristics, such as pairwise point distances, of the original high-dimensional space for the reduced (important) dimensions. LlamaTune [16] adopts it to perform a projection of the configuration space from all documentations to a lowerdimensional subspace and then use the Bayesian Optimization-based optimizer to tune this smaller subspace.


• Alternative approaches: Due to the challenges of tuning in high-dimensional configuration spaces, the identification of parameters that significantly affect DBMS performance has gained considerable attention. Among them, Researchers in [29] propose Caver to select a subset of storage parameters. Given a set of initial parameters and configurations, Carver first employs LHS to select a small number of configurations for evaluation. Subsequently, it greedily selects the most important parameters based on their parameter importance values. The first parameter is determined by the highest importance value, after which Carver fixes its value and computes the parameter importance values for the remaining parameters. During this process, importance is calculated as the difference between the variance of the original set of configurations and the sum of variances within the subset when parameters take different values. The algorithm selects parameters with the maximum difference. Recently, OpAdviser [30] proposed constructing an effective configuration region that contains promising configurations. It leverages performance similarity among historical workloads to guide the construction process. Specifically, it extracts a relatively compact region based on similar workloads and a larger region based on less similar ones. Furthermore, OpAdviser constructs the target search space by employing a majority weighted voting strategy to aggregate suggestions from candidate tasks.


This paper is available on arxiv under CC BY 4.0 DEED.