Curious SMART failure

My main computer has 5 physical hard drives installed.  There’s my primary drive, and the live backup that is synchronized to it every night.  There are two large drives concatenated to make a backup volume for the MythTV box.  And there’s a separate drive that holds work-related code and files.

I have had many hard drive failures over the years, fortunately my backup routine has prevented me from losing anything important.

In light of the number of hard drives in the machine, I recently decided to start using SMART monitoring software on the main drive, its backup, and the work drive.  A daemon running on Linux periodically performs short and long tests of the hard drives.  Every Sunday, a report is generated and emailed to me to notify me of any developing issues with the hard drives.  The assumption is that, before outright failure, hard drives are likely to show degradation that manifests in these tests, allowing the user to prepare for the imminent loss of the drive.

In my latest Sunday report, all three drives reported normal conditions.  No signs of impending failure were noted.  Later that afternoon, the SMART daemon started issuing non-routine email messages indicating that it was unable to perform SMART queries on two drives, the work drive and my main backup drive.

It’s unreasonable to expect that two drives failed without warning, simultaneously, mere hours after getting a clean bill of health from SMART, so something unusual must have been happening.  Both the work drive and the backup drive were delivering SATA errors on any activity, they couldn’t be mounted, and files could not be read from them.  The emails indicated that the work drive failed first, and the primary backup drive about thirty minutes later.

My first thought was that the SATA bus was confused.  Maybe a BIOS error, a kernel bug, the infamous “cosmic ray”, or something else.  With the SATA bus unreliable, all activities, including SMART lookups, could fail.  So the first thing to try was a power cycling of the box.  After that, both drives were again mountable, and files could be read.

Less than an hour later, the SMART daemon started sending its messages again.  Work drive failure, followed half an hour later by backup drive failure.  The new theory was that the work drive was failing in a curious way.  When it received SMART queries, issued periodically by the daemon, the drive responded in such a way as to confuse the SATA bus or the kernel module responsible for handling it.

So, I bought a new hard drive to replace the failing work drive.  I modified my startup system to avoid activating the SMART daemon while I ran a new backup.  The work drive isn’t fully backed up every night, as many of the files are mirrored on other work computers, but just to be sure, I did a full backup of that drive.  This turned out to be straight-forward, without the SMART daemon issuing queries the drive and the SATA bus remained stable throughout the process.  Once the backup was completed, I powered down again and replaced the drive.  Set up the encryption on the new drive, formatted it, and recovered from the backup.

I’ve reactivated the SMART daemon, and the system has remained stable.  This leads me to believe that I had an unexpected SMART failure, where the only visible problem with the drive was that SMART queries messed up the state, leading to the inability to use not only that drive, but another unrelated drive in the computer.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

*

反垃圾邮件 / Anti-spam question * Time limit is exhausted. Please reload CAPTCHA.