How Predictive Algorithms Are Making Data Center Disk Scrubbing Smarter

Table of Links

5.2. Which disk to scrub: Drive health predictor

In a normal data center setting, all disk drives are classified as either healthy or unhealthy. Unhealthy disks are supposed to be dying or imminently failing, thus they are not marked for scrubbing, while healthy disks are marked for scrubbing.

In our approach, we propose to assign a relative ’degree of health’ score to each disk. Drives that are marked as of No concern are either dying/imminently failing or completely healthy, while those marked as of Concern have different degrees of health other than failing or healthy. The conformal prediction framework then classifies the ”No-concern” and ”Concern” drives, and only selects the disks which are in the set of ”Concern” drives for further ranking. These are the drives which are concerning to us and is used as input for the scrubbing scheduler.

Our focus, as shown in Figure 2, is on identifying disks in the system that are currently of concern or may become concerning soon, and only selecting those disks for scrubbing. This approach reduces the number of disks meant for scrubbing, since even completely healthy drives are not scrubbed, making the process more efficient and targeted. By doing so, we optimize time, power, and energy consumption and reduce the carbon footprint of data centers.

When dealing with disk drives in a usual data center environment, failures are rare over a period of time, resulting in a highly imbalanced dataset with a small number of failed disks and the majority of disks being healthy. To handle this imbalanced data, we adopt a Mondrian Conformal Prediction approach, in order to get the prediction labels ”0”: failed and ”1”: healthy, along with their confidence score that serves as a health score in our case. This means that our MCP algorithm selects disks with a confidence score depending on the threshold chosen by the administrator.

For instance, if the administrator sets a threshold of 1%, this will lead to excluding disks with health scores above 99% as healthy or failing (depending on the label) and only selecting disks with a health score lower than 99% for scrubbing. Furthermore, the selected drives can be mapped to distinct scrubbing frequencies. Thus, drives with poor health scores may require more frequent scrubbing (every week), while those with good health scores will need less frequent scrubbing (every 3 months). For the same threshold of 1%, the administrator can then map the disk health with a scrubbing frequency, as in Table 1.

5.3. When to scrub: Workload predictor

After identifying the disks to be scrubbed using the drive health predictor engine, the next step is to determine the optimal time to perform scrubbing using the workload predictor. This component needs to consider the availability of system resources, i.e. disk and CPU utilization information in the system and storage statistics subsystem.

The workload predictor employs a Probabilistically Weighted Fuzzy Time Series algorithm (PWFTS), as detailed in (Orang et al., 2020). This algorithm forecasts n-days ahead system utilization, by predicting the system utilization percentage for the next 12 hours, with 1-hour intervals. Then, this information is combined with one of the three possible scrubbing cycles (A, B, or C as in Table 1) obtained from the drive health predictor. Finally, the scrubbing is triggered. During the 1-hour interval, if the scrubbing is complete, then we stop, if not, the administrator is notified. The high-level flowchart for the system workload predictor is outlined in Figure 3.

In Figure 4, we showcase the n-days ahead forecasting of the system utilization percentage. It is evident from the figure that the system exhibits a lower load on day 0 and a higher load on day 2. Consequently, scheduling the scrubbing operations at day 0, when the system is under a lower load, would be more favorable. This approach optimizes the utilization of system resources, ensuring efficient scrubbing of the disks, and leading to lower processing time, lower energy consumption, and a reduced carbon footprint of the data center.

This paper is available on arxiv under CC BY-NC-ND 4.0 Deed (Attribution-Noncommercial-Noderivs 4.0 International) license.

Authors:

(1) Rahul Vishwakarma, California State University Long Beach, 1250 Bellflower Blvd, Long Beach, CA 90840, United States ([email protected]);

(2) Jinha Hwang, California State University Long Beach, 1250 Bellflower Blvd, Long Beach, CA 90840, United States ([email protected]);

(3) Soundouss Messoudi, HEUDIASYC - UMR CNRS 7253, Universit´e de Technologie de Compiegne, 57 avenue de Landshut, 60203 Compiegne Cedex - France ([email protected]);

(4) Ava Hedayatipour, California State University Long Beach, 1250 Bellflower Blvd, Long Beach, CA 90840, United States ([email protected]).