paint-brush
Why Re-invent the Wheel? Use Past Workloads for Smarter DBMS Tuningby@configuring
110 reads

Why Re-invent the Wheel? Use Past Workloads for Smarter DBMS Tuning

by ConfiguringNovember 27th, 2024
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Knowledge from past workloads helps optimize DBMS tuning by using similarity measures like Euclidean distance, kernel functions, and cosine similarity to guide the tuning process, reducing the need for extensive data collection and speeding up model training.
featured image - Why Re-invent the Wheel? Use Past Workloads for Smarter DBMS Tuning
Configuring HackerNoon profile picture

Authors:

(1) Limeng Zhang, Centre for Research on Engineering Software Technologies (CREST), The University of Adelaide, Australia;

(2) M. Ali Babar, Centre for Research on Engineering Software Technologies (CREST), The University of Adelaide, Australia.

Abstract and 1 Introduction

1.1 Configuration Parameter Tuning Challenges and 1.2 Contributions

2 Tuning Objectives

3 Overview of Tuning Framework

4 Workload Characterization and 4.1 Query-level Characterization

4.2 Runtime-based Characterization

5 Feature Pruning and 5.1 Workload-level Pruning

5.2 Configuration-level Pruning

5.3 Summary

6 Knowledge from Experience

7 Configuration Recommendation and 7.1 Bayesian Optimization

7.2 Neural Network

7.3 Reinforcement Learning

7.4 Search-based Solutions

8 Experimental Setting

9 Related Work

10 Discussion and Conclusion, and References

6 KNOWLEDGE FROM EXPERIENCE

As the process of finding the optimal configuration can be complex and time-consuming, a natural and common approach is to draw insights from previous workloads with similar characteristics and leverage that experience to guide the tuning algorithm in configuring the target workload. Existing aspects can be considered for learning from historical tuning experience, including:


• Euclidean distance-based workload similarity: Euclidean distance calculates the straight-line distance between two points in Euclidean space. It is commonly used in machine learning tasks, such as clustering, classification, and regression, to measure the similarity or dissimilarity. OtterTune [8] applies it to match the target DBMS’s workload with the most similar workload in its repository based on the performance measurements for the selected group of metrics and then reuses the data from the selected 7 workload to train a GP model on the target metric for configuration recommendation.


• Epanechnikov quadratic kernel-based workload similarity: The Epanechnikov quadratic kernel function, denoted as K(u), is defined as K(u) = 3 4 (1 − u 2 ) where the normalized distance u is typically computed as the difference between the coordinates of two points divided by the bandwidth parameter. ResTune [6] employs it to identify similar historical workloads in its initial tuning stage when there are only limited tuning observation data available for tuning.


• Cosine similarity: provides a straightforward and efficient method to measure the similarity between two vectors in a multidimensional space by calculating the cosine of the angle between these vectors. MMOT [22] adopts it to determine the model for the target workload by selecting from trained DDPGbased RL models based on workload similarity. • Performance surface similarity: works based on the consideration that similar tasks may exhibit comparable trends in terms of where the objective is minimized within the configuration space. It is calculated according to the misranked performance of pairs of given configurations. ResTune [6] and OpAdviser [30] identify workload similarity according to the ranking loss of two workloads on a given configuration dataset, utilizing the observed configurations and their corresponding performance as a benchmark. For every pair of configurations, if the ranking of the predicted performance pair does not match the observed pair, the rank loss (count of misranked pairs) is incremented by 1. Finally, the ranking loss for this selected model is defined as the total number of misranked pairings. The smaller the ranking loss, the more similar the selected model is to the target model.


In the realm of DBMS automatic tuning, knowledge transfer mainly serves two purposes: reducing the need for extensive training data collection and minimizing model training duration. By leveraging this approach, tuning algorithms can implicitly and explicitly utilize extracted knowledge. During the tuning phase, the tuning model can leverage historical data collected from similar workloads (implicitly) or directly apply extracted knowledge, such as ensembling trained models or selecting pre-trained models (explicitly).


This paper is available on arxiv under CC BY 4.0 DEED.