Table of Links Abstract and 1. Introduction Abstract and 1. Introduction Motivation and design goals Related Work Conformal prediction 4.1. Mondrian conformal prediction (MCP) 4.2. Evaluation metrics Mondrian conformal prediction for Disk Scrubbing: our approach 5.1. System and Storage statistics 5.2. Which disk to scrub: Drive health predictor 5.3. When to scrub: Workload predictor Experimental setting and 6.1. Open-source Baidu dataset 6.2. Experimental results Discussion 7.1. Optimal scheduling aspect 7.2. Performance metrics and 7.3. Power saving from selective scrubbing Conclusion and References Motivation and design goals Motivation and design goals Motivation and design goals Related Work Related Work Related Work Conformal prediction 4.1. Mondrian conformal prediction (MCP) 4.2. Evaluation metrics Conformal prediction Conformal prediction 4.1. Mondrian conformal prediction (MCP) 4.1. Mondrian conformal prediction (MCP) 4.2. Evaluation metrics 4.2. Evaluation metrics Mondrian conformal prediction for Disk Scrubbing: our approach 5.1. System and Storage statistics 5.2. Which disk to scrub: Drive health predictor 5.3. When to scrub: Workload predictor Mondrian conformal prediction for Disk Scrubbing: our approach Mondrian conformal prediction for Disk Scrubbing: our approach 5.1. System and Storage statistics 5.1. System and Storage statistics 5.2. Which disk to scrub: Drive health predictor 5.2. Which disk to scrub: Drive health predictor 5.3. When to scrub: Workload predictor 5.3. When to scrub: Workload predictor Experimental setting and 6.1. Open-source Baidu dataset 6.2. Experimental results Experimental setting and 6.1. Open-source Baidu dataset Experimental setting and 6.1. Open-source Baidu dataset 6.2. Experimental results 6.2. Experimental results Discussion 7.1. Optimal scheduling aspect 7.2. Performance metrics and 7.3. Power saving from selective scrubbing Discussion Discussion 7.1. Optimal scheduling aspect 7.1. Optimal scheduling aspect 7.2. Performance metrics and 7.3. Power saving from selective scrubbing 7.2. Performance metrics and 7.3. Power saving from selective scrubbing Conclusion and References Conclusion and References Conclusion and References 2. Motivation and design goals In data centers, a significant number of unhealthy drives go undetected due to latent failure attributes, resulting in fail-stop scenarios. One common approach to mitigate such scenarios is disk scrubbing, which consists of verifying disk data through a background scanning process to identify bad sectors. However, this process can consume energy and cause performance degradation depending on the trigger schedule. This scenario raises concerns in the industry, especially as disk capacities increase. We notice a missing link in addressing ’which disk to scrub’, ’when to scrub’, based on frequency of scrub cycle while minimizing storage array performance impact and also maximizing the reliability. In this paper, we consider the following objectives and design approaches to tackle this challenge: • Which disk to scrub? Depending on the specific scrubbing process, it can temporarily degrade the performance of the drive. To ensure that the drive remains fast and responsive, minimizing the frequency of scrubbing is crucial. Instead of performing scrubbing for all disks in the storage array, our approach focuses on selectively scrubbing only the disks that require it, thereby reducing the overall time required to complete the process. • Which disk to scrub? • When to scrub? We can optimize the disk drive scrubbing schedule by considering factors such as the workload of the system, the importance of the data on the drive, and the availability of resources. This approach ensures that scrubbing is performed at the most appropriate times, minimizing the impact on the overall system performance. • When to scrub? 3. Related Work Storage device reliability has long been a critical concern in the industry, and existing solutions often rely on failure analysis of storage systems. However, traditional methods like accelerated life tests (Cho et al., 2015) have not proven to be reliable indicators of actual failure rates in production environments. Recent machine learning-based approaches, such as multivariate time-series (Yu, 2019) and time-series classification (Ircio et al., 2022), have focused on improving model accuracy, but often lack deep integration of domain knowledge. Moreover, the multi-modal approach by (Lu et al., 2020) using performance metrics (disklevel and server-level) and disk spatial location only focuses on fail-stop scenarios, which may not be helpful in detecting latent failures. A most recent study (Lu et al., 2023) has addressed this issue by investigating grey failures (fail-slow drives) using a regression model to pinpoint and analyze fail-slow failures at the granularity of individual drives. Another important factor of disk scrubbing is the implementation cost and power consumption. (Mi et al., 2008) and (Jiang et al., 2019) address performance degradation due to scrubbing and propose assigning a lower priority to the background process during idle time, i.e. when the disk drive is not actively engaged in processing data or performing any other tasks. (Liu et al., 2010) and (Oprea and Juels, 2010) propose a method to mitigate power consumption and determine when to scrub in systems with inexpensive data but require designing another method to identify less critical data. Drive space management in case of replacing the failed disk is discussed in (Pˆaris et al., 2010), along with reducing the need for frequent scrubbing. A multilevel scrubbing is proposed in (Zhang et al., 2020) using a Long Short-Term Memory (LSTM) model to detect latent sector errors in a binary classification setup. However, using machine learning-based models may treat healthy and relatively less healthy disks the same, leading to unnecessary scrubbing of healthy disks. To the best of our knowledge, our work is the first to adopt Mondrian conformal prediction for assigning a health score to each individual disk drive and using the metrics to design a scrubbing cycle aligned with the system idle time. This paper is available on arxiv under CC BY-NC-ND 4.0 Deed (Attribution-Noncommercial-Noderivs 4.0 International) license. This paper is available on arxiv under CC BY-NC-ND 4.0 Deed (Attribution-Noncommercial-Noderivs 4.0 International) license. available on arxiv Authors: (1) Rahul Vishwakarma, California State University Long Beach, 1250 Bellflower Blvd, Long Beach, CA 90840, United States (rahuldeo.vishwakarma01@student.csullb.edu); (2) Jinha Hwang, California State University Long Beach, 1250 Bellflower Blvd, Long Beach, CA 90840, United States (jinha.hwang01@student.csulb.edu); (3) Soundouss Messoudi, HEUDIASYC - UMR CNRS 7253, Universit´e de Technologie de Compiegne, 57 avenue de Landshut, 60203 Compiegne Cedex - France (soundouss.messoudi@hds.utc.fr); (4) Ava Hedayatipour, California State University Long Beach, 1250 Bellflower Blvd, Long Beach, CA 90840, United States (ava.hedayatipour@csulb.edu). Authors: Authors: (1) Rahul Vishwakarma, California State University Long Beach, 1250 Bellflower Blvd, Long Beach, CA 90840, United States (rahuldeo.vishwakarma01@student.csullb.edu); (2) Jinha Hwang, California State University Long Beach, 1250 Bellflower Blvd, Long Beach, CA 90840, United States (jinha.hwang01@student.csulb.edu); (3) Soundouss Messoudi, HEUDIASYC - UMR CNRS 7253, Universit´e de Technologie de Compiegne, 57 avenue de Landshut, 60203 Compiegne Cedex - France (soundouss.messoudi@hds.utc.fr); (4) Ava Hedayatipour, California State University Long Beach, 1250 Bellflower Blvd, Long Beach, CA 90840, United States (ava.hedayatipour@csulb.edu).