Not SMART enough to predict disk failures
By Charles Miller

My friend Favi recently noticed an ominous error message when she turned on her computer: “1720 smart hard drive imminent failure.”

SMART is an acronym for “Self Monitoring And Reporting Technology.” Steve Gibson, one of the leading experts in the field of computer hardware, has a favorite expression: “SMART is dumb.” In order to understand his meaning, it is necessary to know at least a little of the history of hard drives.

The earliest computer hard drives were truly dumb devices, but with the advent of Integrated Drive Electronics (IDE) they became less so. That generation of hard drives included its own microprocessor on the drive. This allowed the drive to work faster and also to perform certain functions on its own, independent of the motherboard and the rest of the computer.

This “independence” soon became a bit of an issue as smarter and smarter hard drives were designed with capabilities such as self-correction of errors. Increasingly the hard drive was being designed to perform more functions autonomously without interaction with other components. Essentially, the circuit board on the hard drive was becoming a little self-contained computer of its own.

The computer manufacturers, Compaq being one of the leaders, wanted to tap into this little self-sufficient computer to know what the hard drive was doing. Among the functions being performed were compensating for bad sectors on the drive, monitoring the temperature of the disk, and so on. The computer makers realized that data such as scan errors, relocation count, offline relocation and probational count might give some valuable clues as to the overall health of the drive.

The computer manufacturers led by Compaq basically forced the hard disk manufacturers of the time to create the SMART standard. The goal of SMART was to allow other software to monitor the condition of the disk and hopefully predict failures.

The problem was that this was an idea that sounds very good in theory but proved to be problematic to implement, given the wide variety and constantly changing number of hard disk designs. What might be a dangerously high temperature reading for one type of drive might be okay for another brand.

For a time many in the computer industry held to the hope that accurate predictive failure models based on SMART telemetry could be created. If this could be accomplished then your computer could tell you that your hard disk was going to fail before it actually happened. Unfortunately, this accurate predictive failure model has proved elusive for more than a decade. Analyzing many years of data involving hundreds of thousands of hard drives allows the statisticians to pretty accurately know the rate of failure, but not when. Figures differ owing to differences of opinion and differing kinds of hardware. Some say the annual failure rate is between one and eight percent. Others, like my friend Larry, look past the annual figures to point out the lifetime failure rate for hard disks is really 100 percent because eventually all disks wear out.

SMART has not proven to be very helpful and the case of my friend Faviola is a good example. Knowing that hard disks frequently continue running for years after SMART warnings, I told her this. Of course, this time the disk promptly died, but fortuitously the SMART warning had encouraged us to make a backup of all the important data and so nothing was lost.

Charles Miller is a freelance computer consultant, a frequent visitor to San Miguel since 1981 and now practically a full-time resident. He may be contacted at 044-415-101-8528 or email FAQ8 (at) SMAguru.com.