Grease and hard drive change
Originally published 2007 in Atomic: Maximum Power ComputingLast modified 03-Dec-2011.
You know how mechanics put a little sticker in the corner of your windscreen to remind you when your car will need another service? Hard drives should come with something similar.
Because, one way or another, all hard drives are going to die.
Personally, I start feeling nervous about my drives when they hit their second birthday. Since they've then spent almost all of those two years cheek-by-jowl with other drives in the disk-farm PCs I favour, this is not entirely irrational. But, thanks to a couple of recent studies, I now know that it's less rational than I thought it was.
Google's study (PDF) of more than a hundred thousand drives over five years is useful as much for what it says about how hard it is to figure this stuff out, as for what it actually found.
It turns out that working drives hard, or running them warmer than recommended, doesn't seem to have much of an impact on their life. And the popular idea that failures follow a "bathtub curve", in which any drive that doesn't die in the first three months is likely to live for five years, also seems to be invalid. Drives actually just slowly wear out over their lives, like other mechanical devices.
The Google study found, as you'd expect, that S.M.A.R.T. errors are strong predictors of an imminent drive failure. But, overall, S.M.A.R.T. was actually close to useless; any well-used hard drive will have at least a couple of S.M.A.R.T. warning flags (just based on the accumulated on-time hour count), yet 36% of Google's dead drives hadn't shown any S.M.A.R.T. warnings at all.
I found this confusing when I first wrote this piece, since there's absolutely no technical reason why S.M.A.R.T. shouldn't be able to warn of many, if not most, imminent drive failures. Then I read this Usenet post, which alleges that the hard drive manufacturers' marketing departments just overruled the engineers and made them, in essence, turn off S.M.A.R.T.'s early warning features, without telling anyone.
I don't know whether this is actually true, but it certainly fits the evidence. Thanks a lot, marketing guys! Yet again, you've made the world just that little bit more awful!
(I wrote a bit more about this in this blog post. This other post, about failure rates for individual drives and RAID arrays, is also marginally relevant.)
Aaaanyway, a similarly huge Carnegie Mellon University hard drive study reached the same conclusion as the Google one.
And both studies also found that we're not all hallucinating - the very long lifespans indicated by hard disk manufacturers' "million hour" mean time to failure figures do not, in fact, indicate that any real drive is likely to last for anything like a million hours of operation (that's 114 years!).
MTBF numbers aren't meant to be taken as an actual lifespan estimate, as I explain in this old piece (it's also covered in the Wikipedia Mean Time Between Failures article), but that's not what we're talking about here. Even using standard MTBF analysis, drive failure rates are much higher than the manufacturers glibly allege.
Annual replacement rates (ARRs) for hard drives are usually specified by the manufacturers as being well below 1%. But the real ARRs can actually be above ten per cent.
If you've got a 0.5% ARR drive (pretending, for simplicity, that the rate doesn't change over time; actually, the probability of failure rises as the drive gets older) then it's very likely (97.5%) to still be alive after five years.
If you've got a 5% drive, though, then there's a 23% chance it'll die in its first five years.
And if you're unlucky enough to have bought a 12% drive, then it's more likely to be dead after five years than it is to be alive.
You can only figure these numbers out for a given model of drive at the end of the period, of course. Different drive models vary in reliability, and despite the devout beliefs of various geeks about different manufacturers' products, there's actually no detectable relationship between brand and reliability.
(Which, yes, does mean that there's a strong case for buying drives based on price, even if you've always sworn by Brand X on account of those two Brand Ys that died on you in quick succession.)
On balance, though, I now think it's perfectly reasonable to give your drives three years before you replace them. Four years is OK for penny-pinchers, and five years or more should be fine for Aunt Nora's PC that's only turned on for two hours a week.
When you reach the end of that period, it's time to put aside an evening for plugging fresh drive(s) into the computer and imaging the old drive(s) across, using some flavour of Norton Ghost or the freeware DriveImage XML or something. XXCLONE is also free for personal use, and looks pretty neat.
This procedure gets slower with each passing year, as drives increase in capacity much faster than drive interfaces increase in bandwidth. But it's still easy enough to do.
If you've got more than one hard drive, by the way, and your update leaves you with your drives in a different order - say you upgraded your old Parallel ATA C drive to a new SATA one, or just muddled the connectors up when you swapped new for old - then you're likely to get a terrifying boot error when you reset the computer. Windows, for instance, is likely to bleat about not being able to find HAL.DLL.
Don't panic. That error just means your new boot disk is not first in the boot order any more. Fix that in your motherboard's BIOS setup program (usually accessed by pressing Delete after the POST beep) and you'll be away.
Oh, and if you've got a multiple-hard-drive computer, it also pays to label the drives on whatever side faces out when the case is open, so you know which one is which without playing deductive games.
I use a silver pen.
Stop laughing.