Table of Links
-
Mondrian conformal prediction for Disk Scrubbing: our approach
5.1. System and Storage statistics
-
7.1. Optimal scheduling aspect
7.2. Performance metrics and 7.3. Power saving from selective scrubbing
7.2. Performance metrics
We captured the effective coverage (i.e., for any chosen confidence level, prediction intervals will fail to include the correct label) and prediction set size for the open-source dataset in. The plot in Figure 6 demonstrates that there is a positive correlation between the confidence level and the coverage. The split-conformal method results in a higher mean coverage than the cross-validation method, indicating that the calibration set selection has a considerable influence on the effective coverage. Furthermore, the right side of the figure displays the average size of the prediction set, which increases as the confidence level increases. Similarly, the split-conformal method yields a consistently higher mean prediction set size than the cross-validation method. The metrics can be used to evaluate how well the Mondrian conformal predictor is performing.
7.3. Power saving from selective scrubbing
Scrubbing is a resource-intensive operation that can impact the performance of the system during its execution. The time taken to complete a scrubbing operation depends on various factors, such as the size of the HDD being scrubbed. For instance, scrubbing a 1TB HDD may take a few to several hours, while scrubbing an 8TB HDD could take significantly longer, potentially a day or more. Assuming an average power consumption of 7 watts during a 6-hour scrubbing operation for a single HDD, the total energy consumed would be 42 watt-hours (Wh). It’s important to note that power consumption during scrubbing can vary for different disks in a data center, depending on factors like disk size, manufacturer, and storage operations. Taking an average value for power usage comparison, if selective scrubbing is performed on 28,000 disks instead of scrubbing all 120,000 disks in a data center based on results from the Baidu open-source dataset, significant power and energy savings can be achieved for the entire data center.
8. Conclusion
The complexity and uncertainty of individual storage components in large-scale data centers pose challenges to business continuity. While proactive approaches like monitoring and failure analysis have been implemented, machine learning approaches may have false positive concerns in real-world applications with numerous disk drives. In this paper, we propose a fine-grained approach to disk scrubbing using a learning framework based on Mondrian conformal prediction, evaluated on the Baidu open-source dataset.
Our method provides a modest yet effective contribution from a methodological perspective. It tackles the issue of aggressive scrubbing of the entire storage array by utilizing Mondrian conformal predictors to assign health scores to each drive and selectively targeting disks with lower scores for scrubbing. This approach generates a prioritized list for the scheduler engine, leveraging drive failure analysis and quantifying disk health across the entire storage pool. As a result, only 22.7% of the drives need to be scrubbed, leading to power savings and improved reliability.
Future work could involve incorporating Venn-Abers predictors, which offer calibrated probabilities for predictions and could further enhance the accuracy and effectiveness of our approach (Vovk and Petej, 2012). By incorporating such predictors, we could potentially refine and optimize our method for even better performance in identifying and addressing potential disk failures in large-scale data centers.
References
MAPIE - Model Agnostic Prediction Interval Estimator ; MAPIE 0.6.3 documentation — mapie.readthedocs.io. https://mapie.readthedocs.io/en/latest/. [Accessed 08-Apr2023].
Jonathan Alvarsson, Staffan Arvidsson McShane, Ulf Norinder, and Ola Spjuth. Predicting with confidence: using conformal prediction in drug discovery. Journal of Pharmaceutical Sciences, 110(1):42–49, 2021.
Lakshmi N Bairavasundaram, Garth R Goodson, Shankar Pasupathy, and Jiri Schindler. An analysis of latent sector errors in disk drives. In Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, pages 289–300, 2007.
Nijaz Bajgori´c, Lejla Turulja, and Amra Alagi´c. Downtime and business continuity. In Always-On Business: Aligning Enterprise Strategies and IT in the Digital Age, pages 29–50. Springer, 2022.
Henrik Bostr¨om and Ulf Johansson. Mondrian conformal regressors. In Conformal and Probabilistic Prediction and Applications, pages 114–133. PMLR, 2020.
Euy-Hyun Cho, Jeong-Kyu Park, and Jong-Gyu Chae. The accelerated life test of hard disk in the environment of pacs. Journal of Digital Contents Society, 16(1):63–70, 2015.
DrTycoon. Hdds dataset (baidu inc..), Jan 2023. URL https://www.kaggle.com/ datasets/drtycoon/hdds-dataset-baidu-inc.
Greg Hamerly, Charles Elkan, et al. Bayesian approaches to failure prediction for disk drives. In ICML, volume 1, pages 202–209. Citeseer, 2001.
Ilias Iliadis, Robert Haas, Xiao-Yu Hu, and Evangelos Eleftheriou. Disk scrubbing versus intra-disk redundancy for high-reliability raid storage systems. ACM SIGMETRICS Performance Evaluation Review, 36(1):241–252, 2008.
Ilias Iliadis, Robert Haas, Xiao-Yu Hu, and Evangelos Eleftheriou. Disk scrubbing versus intra-disk redundancy for raid storage systems. ACM transactions on storage (TOS), 7 (2):1–42, 2011.
Josu Ircio, Aizea Lojo, Jose A Lozano, and Usue Mori. A multivariate time series streaming classifier for predicting hard drive failures [application notes]. IEEE Computational Intelligence Magazine, 17(1):102–114, 2022.
Tianming Jiang, Ping Huang, and Ke Zhou. Scrub unleveling: Achieving high data reliability at low scrubbing cost. In 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE), pages 1403–1408, 2019. doi: 10.23919/DATE.2019.8715169.
Junping Liu, Ke Zhou, Zhikun Wang, Liping Pang, and Dan Feng. Modeling the impact of disk scrubbing on storage system. J. Comput., 5(11):1629–1637, 2010.
Jie Lu, Anjin Liu, Fan Dong, Feng Gu, Joao Gama, and Guangquan Zhang. Learning under concept drift: A review. IEEE transactions on knowledge and data engineering, 31(12): 2346–2363, 2018.
Ruiming Lu, Erci Xu, Yiming Zhang, Fengyi Zhu, Zhaosheng Zhu, Mengtian Wang, Zongpeng Zhu, Guangtao Xue, Jiwu Shu, Minglu Li, et al. Perseus: A {Fail-Slow} detection framework for cloud storage systems. In 21st USENIX Conference on File and Storage Technologies (FAST 23), pages 49–64, 2023.
Sidi Lu, Bing Luo, Tirthak Patel, Yongtao Yao, Devesh Tiwari, and Weisong Shi. Making disk failure predictions smarter! In FAST, pages 151–167, 2020.
Rachel Luo, Shengjia Zhao, Jonathan Kuck, Boris Ivanovic, Silvio Savarese, Edward Schmerling, and Marco Pavone. Sample-efficient safety assurances using conformal prediction. In Algorithmic Foundations of Robotics XV: Proceedings of the Fifteenth Workshop on the Algorithmic Foundations of Robotics, pages 149–169. Springer, 2022.
Ao Ma, Fred Douglis, Guanlin Lu, Darren Sawyer, Surendar Chandra, and Windsor Hsu. RAIDShield: Characterizing, monitoring, and proactively protecting against disk failures. In 13th USENIX Conference on File and Storage Technologies (FAST 15), pages 241–256, Santa Clara, CA, February 2015. USENIX Association. ISBN 978-1-931971-201. URL https://www.usenix.org/conference/fast15/ technical-sessions/presentation/ma.
Valery Manokhin. Awesome conformal prediction, April 2022. URL https://doi.org/10. 5281/zenodo.6467205.
Soundouss Messoudi, Sylvain Rousseau, and S´ebastien Destercke. Deep conformal prediction for robust models. In International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems, pages 528–540. Springer, 2020.
Soundouss Messoudi, S´ebastien Destercke, and Sylvain Rousseau. Class-wise confidence for debt prediction in real estate management: discussion and lessons learned from an application. In Conformal and Probabilistic Prediction and Applications, pages 211–228. PMLR, 2021.
Ningfang Mi, Alma Riska, Evgenia Smirni, and Erik Riedel. Enhancing data availability in disk drives through background activities. In 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN), pages 492–501. IEEE, 2008.
Alina Oprea and Ari Juels. A clean-slate look at disk scrubbing. In FAST, pages 57–70, 2010.
Omid Orang, Rodrigo Silva, Petrˆonio Cˆandido de Lima e Silva, and Frederico Gadelha Guimar˜aes. Solar energy forecasting with fuzzy time series using high-order fuzzy cognitive maps. In 2020 IEEE international conference on fuzzy systems (FUZZ-IEEE), pages 1–8. IEEE, 2020.
Jehan-Fran¸cois Pˆaris, Thomas Schwarz, Ahmed Amer, and Darrell DE Long. Improving disk array reliability through expedited scrubbing. In 2010 IEEE Fifth International Conference on Networking, Architecture, and Storage, pages 119–125. IEEE, 2010.
Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz Andr´e Barroso. Failure trends in a large disk drive population. 2007.
Teerat Pitakrat, Andr´e van Hoorn, and Lars Grunske. A comparison of machine learning algorithms for proactive hard disk drive failure detection. In Proceedings of the 4th International ACM Sigsoft Symposium on Architecting Critical Systems, ISARCS ’13, page 1–10, New York, NY, USA, 2013. Association for Computing Machinery. ISBN 9781450321235. doi: 10.1145/2465470.2465473. URL https://doi.org/10.1145/2465470.2465473.
Junkil Ryu and Chanik Park. Effects of data scrubbing on reliability in storage systems. IEICE TRANSACTIONS on Information and Systems, 92(9):1639–1649, 2009.
Bianca Schroeder, Sotirios Damouras, and Phillipa Gill. Understanding latent sector errors and how to protect against them. ACM Transactions on storage (TOS), 6(3):1–23, 2010.
Glenn Shafer and Vladimir Vovk. A tutorial on conformal prediction. Journal of Machine Learning Research, 9(3), 2008.
Jitendra Singh and Rahul Deo Vishwakarma. System and method for survival forecasting of disk drives using semi-parametric transfer learning, January 24 2023. US Patent 11,561,701.
Xiaoyi Sun, Krishnendu Chakrabarty, Ruirui Huang, Yiquan Chen, Bing Zhao, Hai Cao, Yinhe Han, Xiaoyao Liang, and Li Jiang. System-level hardware failure prediction using deep learning. In Proceedings of the 56th Annual Design Automation Conference 2019, pages 1–6, 2019.
Rahul Deo Vishwakarma and Bing Liu. System and method for persistent storage failure prediction, April 22 2021. US Patent App. 16/656,875.
Rahul Deo Vishwakarma and Jayanth Kumar Reddy Perneti. Method and system for reliably forecasting storage disk failure, February 4 2021. US Patent App. 16/529,499.
Vladimir Vovk and Ivan Petej. Venn-abers predictors. arXiv preprint arXiv:1211.0025, 2012.
Vladimir Vovk, David Lindsay, Ilia Nouretdinov, and Alex Gammerman. Mondrian confidence machine. Technical Report, 2003.
Vladimir Vovk, Alexander Gammerman, and Glenn Shafer. Algorithmic learning in a random world. Springer International Publishing, Cham, Switzerland, 2 edition, December 2022.
Jie Yu. Hard disk drive failure prediction challenges in machine learning for multi-variate time series. In Proceedings of the 2019 3rd International Conference on Advances in Image Processing, pages 144–148, 2019.
Ji Zhang, Yuanzhang Wang, Yangtao Wang, Ke Zhou, Schelter Sebastian, Ping Huang, Bin Cheng, and Yongguang Ji. Tier-scrubbing: An adaptive and tiered disk scrubbing scheme with improved mttd and reduced cost. In 2020 57th ACM/IEEE Design Automation Conference (DAC), pages 1–6, 2020. doi: 10.1109/DAC18072.2020.9218551.
Yuqi Zhang, Wenwen Hao, Ben Niu, Kangkang Liu, Shuyang Wang, Na Liu, Xing He, Yongwong Gwon, and Chankyu Koh. Multi-view feature-based {SSD} failure prediction: What, when, and why. In 21st USENIX Conference on File and Storage Technologies (FAST 23), pages 409–424, 2023.
This paper is available on arxiv under CC BY-NC-ND 4.0 Deed (Attribution-Noncommercial-Noderivs 4.0 International) license.
Authors:
(1) Rahul Vishwakarma, California State University Long Beach, 1250 Bellflower Blvd, Long Beach, CA 90840, United States ([email protected]);
(2) Jinha Hwang, California State University Long Beach, 1250 Bellflower Blvd, Long Beach, CA 90840, United States ([email protected]);
(3) Soundouss Messoudi, HEUDIASYC - UMR CNRS 7253, Universit´e de Technologie de Compiegne, 57 avenue de Landshut, 60203 Compiegne Cedex - France ([email protected]);
(4) Ava Hedayatipour, California State University Long Beach, 1250 Bellflower Blvd, Long Beach, CA 90840, United States ([email protected]).