Friday 18 July 2008

Time Machine fails on 10.5.4 when interrupted by sleep

Ok, so I've been using my QNAP TS-109 NAS box as a remote Time Machine server for my various home Macs for several months now. Everything's been going swimmingly and for the first time in years I've felt a little more secure knowing that my files are all backed up to several generations, without me having to do anything.

However... in the last couple of days, first one then another of the backups failed and wouldn't restart. The symptoms were Time Machine errors saying that the target volume was read-only.

I ran a couple of quick checks -- first seeing if I could open the sparse bundle used for the backup image directly and try to repair it using Disk Utility -- but it wouldn't even mount. Then I checked in the logs to see exactly what happened when.

It turns out that in both cases, the backup failed when the machine was put to sleep during a Time Machine run. When the Mac woke up, it tried to automatically reconnect to the SMB share and logged a failure message such as the following:

Jul 16 15:34:48 imac kernel[0]: smbfs_smb_reopen_file: Reopen 2b with open deny modes failed because the modify time has changed was 1216197810s 0ns now 1216197813s 0ns!

These messages were often succeeded by various other disk errors and culminated in the backupd process returning an error when it was trying to do something. I have two examples of the backupd failures:

Jul 16 15:34:53 imac /System/Library/CoreServices/backupd[72423]: Error -47 deleting backup: /Volumes/Backup of iMac/Backups.backupdb/iMac/2008-07-15-084358

and

Jul 17 21:42:05 ruth_macbook /System/Library/CoreServices/backupd[6543]: Error: (-50) copying /Users/ruth/Library/Calendars/Calendar Sync Changes to /Volumes/Backup of Ruth's MacBook/Backups.backupdb/Ruth's MacBook/2008-07-17-211219.inProgress/A11E4033-1FAB-457A-96FA-B23B3FED2579/Macintosh HD/Users/ruth/Library/Calendars

but both appear to be due to the smbfs error above.

I've also just recently upgraded all my machines to 10.5.4. Given that it's quite unlikely that two machines would start to fail so close to each other, I suspect that the latest OS upgrade may have introduced a bug.

Other people seem to have encountered similar issues, and not only with unsupported NAS boxes but with official Time Capsules too.

I tried attaching the backup image without mounting it and then running Disk Utility (as suggested here), but Disk Utility wasn't happy then either -- giving me "sibling link" errors. Running fsck by itself just results in "BAD SUPER BLOCK: MAGIC NUMBER WRONG" which is kind of worrying. I haven't yet tried using DiskWarrior, though it's probably worth the $99 for some peace of mind.

I'm going to try a couple of things next:

  1. Check the NAS disk itself for corruption, just in case.
  2. Buy DiskWarrior and see if that fares any better.

I guess after that, I may have to wipe the backup and start again. At least it's a backup and I don't need anything from it right now (famous last words...).

A few days later...

The NAS disk itself seems fine -- I successfully backed up to another image on the same disk with no trouble.

I bought DiskWarrior and ran it against one of the corrupted backup images, but this didn't work either.

Please post a comment if you've had something similar happen to you. Especially if you've been able to fix things!

What I have noticed is that the Time Machine Options (under the Time Machine Preference pane) has a checkbox labelled "Back up while on battery power". Since I only tend to put my machines to sleep when they're on battery power, avoiding backups completely during this time should avoid this corruption issue for the moment.

Backups are now going fine again, just not on battery power.


3 comments:

Anonymous said...

Plain "fsck" is for UFS file systems, which is why it failed and complained about a bad superblock.

I had an encrypted sparse bundle go bad on me when it was not cleanly unmounted. Not exactly the same situation, but perhaps what I did will help:

[The below is from memory, so I might have the options not-quite-right. Consult man pages before trying this.]

Attach the image without mounting it:

hdiutil attach foo.sparsebundle -readwrite -nomount

The hdiutil command will print what disk the image was attached as, eg "disk2".

Run fsck_hfs with the -r option:

fsck_hfs -r -d /dev/rdisk2

Kevin Peter said...

I found a solution to this. You can use a nice little utility called SleepWatcher. It executes scripts when the computer sleeps and when it wakes.

The tip linked to above shows how to add something to the SleepWatcher sleep script to automatically unmount the Time Machine disk when the computer goes to sleep.

I recently set up a Samba Time Machine share, and I tested this successfully. If you put the computer to sleep during a backup, Time Machine will just try again later after the computer wakes up.

I did get some sort of mount error once when I tried it during the initial backup, but the next time I went to back up, it worked just fine. Every time I after that, I was able to put the computer to sleep during a backup (by closing the lid) without error. Time Machine simply ran the backup again at the next scheduled time after I woke it up.

Adam Cohen-Rose said...

@Kevin: Great tip! Thanks! Does the script need admin access to unmount the Time Machine disk?