Today I’ve wanted to buy a new hard drive for my NAS because my old drive’s been failing on me. Upon checkout from my local retailer, I saw that I’ve got the option to extend my warranty so I get a total of 5 years of warranty.
At first, this seemed like a great deal — I get to use it for 5 years continuously in my NAS, no matter when the cheap desktop drive may fail, but the price tag seemed a bit hefty. Was it worth it? How long can drives be used for storing data and running continuously in my NAS? First google searches reveal nothing really useful — except the well known Backblaze Hard Drive stats. But the summary of the analysis didn’t really satisfy my curiosity.
Recently, I had to help my girlfriend, a medical student, to create a “Kaplan-Meier” survival graph in R. This type of graph is often used to visualize two different treatments and their effect on survival for patients. Well, in a way you could look at hard drives being patients, so I decided to take the plunge, download all 18GB worth of raw data from Backblaze and apply the same analysis principle on this new data.
After extracting the most relevant information from the dataset with a small Java program and loading the results into R, the first graph looked like this:
Because Backblaze has rigorously tracked each and every hard drive, we can see that they had over 100'000 unique individual drives in use — which is pretty impressive as is. In the graph above, we can see a typical Kaplan Meier survival curve which shows the survival probability of a hard drive with respect to time passed since first power up.
But, how do we distinguish from a device death & a simple removal — maybe Backblaze didn’t want their 1TB drives occupying precious data center space so they removed them, altough they weren’t dead. That’s where the censoring principle of Kaplan-Meier comes into play. Censoring just means that we cannot really tell what happened with a disk after a given time, but we know that it lived happily up until this point.
Using Kaplan-Meier survival estimates we assume that when a hard drive is removed, it would’ve died right when the next drive died. So If we have 10 hard drives, of which one is removed after 1 year and another one dies after 2 years, the survival probability would be 100% up until 2 years, and 80% after that.
The first graph didn’t give us a good intuition of the reliability of different hard drive models, it only gave us an overview of the reliability of all hard drives used at Backblaze collectively. So let’s draw a survival curve for each model that has more than 500 unique drives in the dataset:
Sadly this graph isn’t very readable so here they are split up by manufacturer:
HGST seemed to be very reliable & consistent in its reliability indeed. With over 24'000 drives in use, almost no drive died.
Also Hitachi performed very well. All models used at Backblaze had little errors, across all sizes.
The seagate ones have wildly different survival rates. The confidence intervals are here the largest, because we don’t have reliable data e.g. for the 1.5TB one with only 41 drives still in use after 4 years. The worst seemed to be the 3TB with survival rates lower than any other hard drive at Backblaze.
Interstingly enough, the 8 TB enterprise & consumer drives are not long in use but show already significantly better survival rates.
WD is also very consistent with its survival rates, indicating same build quality & methodology across all models.
Comparing my results to the Backblaze results, I cannot really observe the bell shape — indicating a high failure rate at the beginning, a low one in the middle and a high one at the end. Rather, the probability just declined rather steadily.
I hope my little wannabe analysis helped you in getting some insight into the the data Backblaze provides for free. Thanks & props to them for being so open that this kind of stuff is even possible!
And to answer my initial question about the warranty extension — I didn’t buy it. I’ll trust you guys, Mr. Kaplan and Mr. Meier.