This story draft by @configuring has not been reviewed by an editor, YET.
Authors:
(1) Limeng Zhang, Centre for Research on Engineering Software Technologies (CREST), The University of Adelaide, Australia;
(2) M. Ali Babar, Centre for Research on Engineering Software Technologies (CREST), The University of Adelaide, Australia.
1.1 Configuration Parameter Tuning Challenges and 1.2 Contributions
3 Overview of Tuning Framework
4 Workload Characterization and 4.1 Query-level Characterization
4.2 Runtime-based Characterization
5 Feature Pruning and 5.1 Workload-level Pruning
5.2 Configuration-level Pruning
7 Configuration Recommendation and 7.1 Bayesian Optimization
10 Discussion and Conclusion, and References
As the process of finding the optimal configuration can be complex and time-consuming, a natural and common approach is to draw insights from previous workloads with similar characteristics and leverage that experience to guide the tuning algorithm in configuring the target workload. Existing aspects can be considered for learning from historical tuning experience, including:
• Euclidean distance-based workload similarity: Euclidean distance calculates the straight-line distance between two points in Euclidean space. It is commonly used in machine learning tasks, such as clustering, classification, and regression, to measure the similarity or dissimilarity. OtterTune [8] applies it to match the target DBMS’s workload with the most similar workload in its repository based on the performance measurements for the selected group of metrics and then reuses the data from the selected 7 workload to train a GP model on the target metric for configuration recommendation.
• Epanechnikov quadratic kernel-based workload similarity: The Epanechnikov quadratic kernel function, denoted as K(u), is defined as K(u) = 3 4 (1 − u 2 ) where the normalized distance u is typically computed as the difference between the coordinates of two points divided by the bandwidth parameter. ResTune [6] employs it to identify similar historical workloads in its initial tuning stage when there are only limited tuning observation data available for tuning.
• Cosine similarity: provides a straightforward and efficient method to measure the similarity between two vectors in a multidimensional space by calculating the cosine of the angle between these vectors. MMOT [22] adopts it to determine the model for the target workload by selecting from trained DDPGbased RL models based on workload similarity. • Performance surface similarity: works based on the consideration that similar tasks may exhibit comparable trends in terms of where the objective is minimized within the configuration space. It is calculated according to the misranked performance of pairs of given configurations. ResTune [6] and OpAdviser [30] identify workload similarity according to the ranking loss of two workloads on a given configuration dataset, utilizing the observed configurations and their corresponding performance as a benchmark. For every pair of configurations, if the ranking of the predicted performance pair does not match the observed pair, the rank loss (count of misranked pairs) is incremented by 1. Finally, the ranking loss for this selected model is defined as the total number of misranked pairings. The smaller the ranking loss, the more similar the selected model is to the target model.
In the realm of DBMS automatic tuning, knowledge transfer mainly serves two purposes: reducing the need for extensive training data collection and minimizing model training duration. By leveraging this approach, tuning algorithms can implicitly and explicitly utilize extracted knowledge. During the tuning phase, the tuning model can leverage historical data collected from similar workloads (implicitly) or directly apply extracted knowledge, such as ensembling trained models or selecting pre-trained models (explicitly).
This paper is available on arxiv under CC BY 4.0 DEED.