paint-brush

This story draft by @configuring has not been reviewed by an editor, YET.

Prune Your Way to Better Database Management System Performance

featured image - Prune Your Way to Better Database Management System Performance
Configuring HackerNoon profile picture
0-item

Authors:

(1) Limeng Zhang, Centre for Research on Engineering Software Technologies (CREST), The University of Adelaide, Australia;

(2) M. Ali Babar, Centre for Research on Engineering Software Technologies (CREST), The University of Adelaide, Australia.

Table of Links

Abstract and 1 Introduction

1.1 Configuration Parameter Tuning Challenges and 1.2 Contributions

2 Tuning Objectives

3 Overview of Tuning Framework

4 Workload Characterization and 4.1 Query-level Characterization

4.2 Runtime-based Characterization

5 Feature Pruning and 5.1 Workload-level Pruning

5.2 Configuration-level Pruning

5.3 Summary

6 Knowledge from Experience

7 Configuration Recommendation and 7.1 Bayesian Optimization

7.2 Neural Network

7.3 Reinforcement Learning

7.4 Search-based Solutions

8 Experimental Setting

9 Related Work

10 Discussion and Conclusion, and References

5 FEATURE PRUNING

Given the diversity of workloads and the high-dimensional configurable parameters of DBMSs, both data collection and configuration space search can be time-intensive. Hence, employing strategies to mitigate workload execution time and streamline the search space becomes imperative. In this section, we present pruning strategies at two levels: workload-level pruning to optimize workload execution time and configuration-level pruning for enhancing search space efficiency.


Table 2 provides a technical comparison of strategies on feature pruning (in Section 5) and knowledge extraction (in Section 6) from existing tuning methods.

5.1 Workload-level Pruning

Workload-level pruning strategies primarily contribute to two aspects: data collection and model training. Query-level filtering aims to remove unimportant queries from a workload, thereby reducing the execution time of each workload and accelerating both training data collection and model validation. Workload Feature Selection aims to reduce the dimensionality of each workload feature, thereby decreasing the training time for autotuning models.


5.1.1 Query-level Filtering


Considering that each application consists of a number of queries, the execution time of a workload can be decreased through removing some unimportant queries. Query filtering aims to removing the unimportant queries which are not sensitive to the configuration parameters.


• Coefficient of Variation (CV): is a statistical measure that provides insights into the relative variability of a set of values. It is calculated as the ratio of the standard deviation to the mean, expressed as a percentage. LOCAT [7] utilizes CV to identify and eliminate configuration-insensitive queries whose performance (query execution time) is not affected by configuration changes within an application. It first constructs a matrix of the execution time for each query under different Spark SQL configurations (30 configuration instances) and then calculates the CV to filter queries with smaller CV values, categorizing them as ’configuration-insensitive’.


5.1.2 Workload Feature Selection


Workload feature selection aims to reduce the dimensionality of workload features. Existing strategies applied in the automatic parameter tuning on DBMSs include:


• k-means: is a popular unsupervised machine learning algorithm used for partitioning a dataset into clusters based on similarity. The primary goal of kmeans is to group data points in such a way that points within the same cluster are more similar to each other than to points in other clusters. In Ottertune [8], each workload is represented with runtime metrics. When conducting workload feature selection, factor analysis (FA) is first applied to transform the (potentially) high-dimensional DBMS metric data into lower-dimensional features. Then k-means, to cluster this lower dimensional data into meaningful clusters (for reducing noise data) and select a single metric for each cluster, namely, the one closest to the cluster center. During the process, number of clusters is determined through a heuristic model proposed in the work [25].


TABLE 2Strategies of Feature Pruning and Knowledge From Experience


• Factor analysis (FA): is a statistical technique used for identifying underlying latent factors that explain the observed relationships among a set of variables. The key assumption in factor analysis is that observed variables are linear combinations of unobservable latent factors and random error terms. Ottertune [8] utilizes it to transform the (potentially) high dimensional DBMS metric embedding vector into lower-dimensional features.


• Principal Component Analysis (PCA): is a popular linear dimension reduction method that transforms correlated variables into uncorrelated ones, known as principal components. It does so by calculating the covariance matrix, performing eigendecomposition, and selecting a subset of principal components based on their variance. HUNTER [13] applies PCA to complete the feature reduction on its collected workload runtime metrics for a workload, which including number of lush statements executed, number of threads activated, etc.


• Autoencoder: is a neural network architecture meticulously devised to achieve precise reconstruction of its input. Comprising two integral components, namely the encoder and the decoder, it operates by first directing the input through the encoder to produce a latent code. This latent code subsequently serves as the foundation for the decoder to generate the output. Thus, the latent code assumes a pivotal role in the identification and disentanglement of various sources of variation inherent within the input data. Through the reduction of the latent code’s dimension, the autoencoder facilitates effective dimension reduction, thereby yielding a more condensed and meaningful representation of the input data. MMOT [22] utilizes the autoencoder architecture to extract inherent patterns among workloads and then uses these patterns to identify the model from similar workloads to generate configurations.


This paper is available on arxiv under CC BY 4.0 DEED.