Tag Archives: backups

Curious SMART failure

My main computer has 5 physical hard drives installed.  There’s my primary drive, and the live backup that is synchronized to it every night.  There are two large drives concatenated to make a backup volume for the MythTV box.  And there’s a separate drive that holds work-related code and files.

I have had many hard drive failures over the years, fortunately my backup routine has prevented me from losing anything important.

In light of the number of hard drives in the machine, I recently decided to start using SMART monitoring software on the main drive, its backup, and the work drive.  A daemon running on Linux periodically performs short and long tests of the hard drives.  Every Sunday, a report is generated and emailed to me to notify me of any developing issues with the hard drives.  The assumption is that, before outright failure, hard drives are likely to show degradation that manifests in these tests, allowing the user to prepare for the imminent loss of the drive.

In my latest Sunday report, all three drives reported normal conditions.  No signs of impending failure were noted.  Later that afternoon, the SMART daemon started issuing non-routine email messages indicating that it was unable to perform SMART queries on two drives, the work drive and my main backup drive.

It’s unreasonable to expect that two drives failed without warning, simultaneously, mere hours after getting a clean bill of health from SMART, so something unusual must have been happening.  Both the work drive and the backup drive were delivering SATA errors on any activity, they couldn’t be mounted, and files could not be read from them.  The emails indicated that the work drive failed first, and the primary backup drive about thirty minutes later.

My first thought was that the SATA bus was confused.  Maybe a BIOS error, a kernel bug, the infamous “cosmic ray”, or something else.  With the SATA bus unreliable, all activities, including SMART lookups, could fail.  So the first thing to try was a power cycling of the box.  After that, both drives were again mountable, and files could be read.

Less than an hour later, the SMART daemon started sending its messages again.  Work drive failure, followed half an hour later by backup drive failure.  The new theory was that the work drive was failing in a curious way.  When it received SMART queries, issued periodically by the daemon, the drive responded in such a way as to confuse the SATA bus or the kernel module responsible for handling it.

So, I bought a new hard drive to replace the failing work drive.  I modified my startup system to avoid activating the SMART daemon while I ran a new backup.  The work drive isn’t fully backed up every night, as many of the files are mirrored on other work computers, but just to be sure, I did a full backup of that drive.  This turned out to be straight-forward, without the SMART daemon issuing queries the drive and the SATA bus remained stable throughout the process.  Once the backup was completed, I powered down again and replaced the drive.  Set up the encryption on the new drive, formatted it, and recovered from the backup.

I’ve reactivated the SMART daemon, and the system has remained stable.  This leads me to believe that I had an unexpected SMART failure, where the only visible problem with the drive was that SMART queries messed up the state, leading to the inability to use not only that drive, but another unrelated drive in the computer.

Why do I have so many hard drives?

There are five hard drives in my main computer. There is no RAID setup. Why?

Hard drives fail. I’ve had the drive holding my root partition fail more than once. When that happens, I used to restore from backup. I would make a backup tape at least once a week, but a badly timed disk failure could still result in the loss of a lot of work.

My solution to this has been to buy my hard drives in matched pairs. I partition them equally, format them the same way, and install them both in the computer. One of them is the live disk, the other is the spare. The spare is kept unmounted and spun down. Every night around 3:00 AM, a cron job spins up the spares drives. Then, one partition at a time is fsck-ed, mounted, and copied to. The shell script uses rdist to synchronize the contents of the two partitions. Finally, I take special care to make the backup drive bootable. I use the LILO boot loader, so, when the root partition is mounted under /mnt/backup, the script executes the command:

/sbin/lilo -r /mnt/backup -b /dev/sdc

which, on my system, writes the LILO boot magic to the backup boot drive, which appears as /dev/sdc when it is the spare in my system. My lilo.conf file, on both the live system and the spare, refer to the boot drive as being /dev/sda, but this ‘-b’ switch overrides that, so that the information is written to the boot block of the current /dev/sdc, but is written so that is appropriate for booting the device at /dev/sda (which it will appear to be should my live boot drive fail and be removed from the system).

Next, I use volume labels to mount my partitions. You can’t have duplicate labels in the system, so my spare drive has labels with the suffix “_bak” applied. That means that the /etc/fstab file suitable for the live drive would not work if the spare were booted with that fstab. To solve this problem, the copying script runs this command after it finishes copying the files in /etc:

sed -e 's|LABEL=\([^ \t]*\)\([ \t]\)|LABEL=\1_bak\2|' /etc/fstab > /mnt/backup/etc/fstab

which has the effect of renaming the labels in the fstab to their versions with the _bak suffix, so they match the volume partitions on the spare hard drive.

OK, that sounds like a lot of work, why do I do it? What does it buy me?

First of all, it gives me automatic backups. Every night, every file is backed up. When I go to the computer at the beginning of the day, the spare drive holds a copy of the filesystem as it appeared when I went to sleep the night before. Now, if I do something really unwise, deleting a pile of important files, or similarly mess up the filesystem, I have a backup that I haven’t deleted. If I were to use RAID, deleting a file would delete it immediately from my backup, which isn’t what I want. As long as I realize there’s a problem before the end of the evening, I can always recover the machine to the way it looked before I started changing things in the morning. If I don’t have enough time to verify that the things I’ve done are OK, I turn off the backup for a night by editing the script.

Another important thing it allows me to do is to test really risky operations. For instance, replacing glibc on a live box can be tricky. In recent years, the process has been improved to the point that it’s not really scary to type “make install” on a live system, but ten years ago that would almost certainly have confused the dynamic linker enough that you would be forced to go to rescue floppies. Now, though, I can test it safely. I prepare for the risky operation, and then before doing it, I run the backup script. When that completes, I mount the complete spare filesystem under a mountpoint, /mnt/chroot. I chroot into that directory, and I am now running in the spare. I can try the unsafe operation, installing a new glibc, or a new bash, or something else critical to the operation of the Linux box. If things go badly wrong, I type “exit”, and I’m back in the boot drive, with a mounted image of the damage in /mnt/chroot. I can investigate that filesystem, figure out what went wrong and how to fix it, and avoid the problem when the time comes to do the operation “for real”. Then, I unmount the partitions under /mnt/chroot and re-run my backup script, and everything on the spare drive is restored. Think of it as a sort of semi-virtual machine for investigating dangerous filesystem operations.

The other thing this gives me is a live filesystem on a spare drive. When my hard drive fails (not “if”, “when”, your hard drive will fail one day), it’s a simple matter of removing the bad hardware from the box, re-jumpering the spare if necessary, and then rebooting the box. I have had my computer up and running again in less than ten minutes, having lost, at most, the things I did earlier in the same day. While you get this benefit with RAID, the other advantages listed above are not easily available with RAID.

Of course, this is fine, but it’s not enough for proper safety. The entire computer could catch fire, destroying all of my hard drives at once. I still make periodic backups to writable DVDs. I use afio for my backups, asking it to break the archive into chunks a bit larger than 4 GB, then burn them onto DVDs formatted with the ext2 filesystem (you don’t have to use a UDF filesystem on a DVD, ext2 works just fine, and it’s certain to be available when you’re using any rescue and recovery disk). Once I’ve written the DVDs, I put them in an envelope, mark it with the date, and give it to relatives to hang onto, as off-site backups.

So, one pair of drives is for my /home partition, one pair for the other partitions on my system. Why do I have 5 drives? Well, the fifth one isn’t backed up. It holds large data sets related to my work. These are files I can get back by carrying them home from the office on my laptop, so I don’t have a backup for this drive. Occasionally I put things on that drive that I don’t want to risk losing, and in that case I have a script that copies the appropriate directories to one of my backed-up partitions, but everything else on that drive is expendable.

There are two problems that can appear with large files.

  • rdist doesn’t handle files larger than 2 GB. I looked through the source code to see if I could fix that shortcoming, and got a bit worried about the code. So I’m working on writing my own replacement for rdist with the features I want. In the mean time, I rarely have files that large, and when I do, they don’t change often, so I’ve been copying the files to the backup manually.
  • Sometimes root’s shells, even those spawned by cron, have ulimit settings. If you’re not careful, you’ll find that cron jobs cannot create a file in excess of some maximum size, often 1 GB. This is an inconvenient restriction, and one that I have removed on my system.