Crash-consistent backups aren't good enough

Scott Waterhouse finalized his thoughts on the whole VMware backup idea on his blog today.  One of the things he said that surprised me was that, while you can’t get application-consistent backups in VMware without the use of a host-level agent, you can get crash-consistent backups. My response is simply this:  I’m sorry, that is not even close to the MINIMUM requirement of what you need.

I’m frankly amazed that Scott would post this, as he is a backup person and knows better.  He knows that crash-consistent backups are as trustworthy as a 1978 Pinto’s gas tank.  So who cares if you can make them?

Besides being untrustworthy, they miss one of my critical points here: If you’re not doing a full VSS implementation (which they aren’t — even on Windows 2003) then the applications won’t know they’ve been backed up. Even if you get a <cough> crash-consistent backup, you’ll still need an agent to dump to tell the application that it’s been backed up so it can dump it’s transaction logs. Otherwise they’ll fill up and the app will crash.

But back to crash-consistent backups…  What does that mean?  it means you are backing up an image of the VM that is EXACTLY equivalent to doing the following:
1. Yank the power cord from your favorite server (no shut down, no quiesce, just yank the power cord out of the UPS)
2. Plug it’s now confused filesystem into another server
3. Don’t fsck it or anything
4. Just back up the bits

If you have to restore it, the server and its apps will need to go through a crash recovery process, and it will work most of the time.  But HEY!  The fact that servers don’t always come up after a crash is one of the reasons we take backups!  So why the (*%% would you trust that method as the backup?  I wouldn’t, never have, and never will.

So to summarize.  Without an agent in each VMware VM:
1. You’ll get a <gag> crash consistent backup
2. The apps won’t know they’ve been backed up, their logs will fill up and the app will crash

Unless you get an agent.  Which was the whole point.

Written by W. Curtis Preston (@wcpreston), four-time O'Reilly author, and host of The Backup Wrap-up podcast. I am now the Technology Evangelist at Sullivan Strickler, which helps companies manage their legacy data

6 comments
  • W., you are a bit off on this, IMO.

    Crash-consistent in this case means that VM Tools issues a FS level quiesce to VSS (just not the extended writer quiesce for MS Apps + Oracle). The backup is then generated. This is vastly different than the case you describe involving yanking a power cord. ๐Ÿ™‚ It is cute and descriptive, but not accurate.

    So you get a quiesced machine, just not a quiesced application. In my experience, you can restore from this. I have had *many* customers tell me that they unfailingly recover from these images all the time. My original post started with that premise, but said that as a backup guy, that didn’t make me real comfortable. So we start off by saying this works, but wouldn’t it be ideal if you could do better.

    And yes it would be ideal. However, a) guest level backup with Avamar is pretty darn good; and b) image level backup–even if it is just crash consistent–is also pretty darn good and reliable.

    We shouldn’t paint a picture of it being functionally equivalent to backing up a system that has crashed and hasn’t been fsck’ed because that just isn’t the case. It is not functionally equivalent to that at all.

  • As far as the app is concerned (which is what these posts have all been about), I maintain that the power-supply-cable analogy is totally valid. It’s WHY they’re called “crash-consistent,” which is basically an oxymoron. What crash-consistent means is “the backup looks like what happens when you crashed, so the app will have to go through a crash-recovery process when you restore — it’s nowhere NEAR consistent”

    I don’t doubt that lots of people have done it this way and they’ve restored fine. Thousands of people perform ufsdumps on mounted filesystems even though the man page says to unmount — and they’ve recovered fine. But some of these ufsdump restores failed, and I also know people whose crash-consistent restores have failed. The fact that this method requires crash recovery means it is not a consistent backup. IMO, it is therefore not a proper way to do backups of an application — period.

    And you never commented on the other issue which is that you’ll still need an agent to truncate the logs. This is the double-wammie.

  • I think I understand what he means by crash consistent. Yes some database products have signinficant good recovery processes after hardware failure. They use transaction logs or similar processes to recover from. That mean you assume it everything will be OK and well written on disks. Not always true. Disk DO NOT read after write like tape does since years. With disk you know you have a problem after you try to read bad data blocks.

    But Curtis is 100% right. What about file system corruption? This is where recovery is not so great with some OS and file systems type…

    I know for facts several people recover from crashes. Windows products mainly have significant expertise on this…they’ve learn form the blue screen era.

    Other Linux with open source file systems might not recover that great. That is the case with VMware. I would not trust crashing my VMware farm with the way he describe.

    VMware just started to have MPIO products right. That is after 5+ releases. Ask storage vendors how they love their certification tests and processes…would be surprise how flaky they are.

  • I keep coming back to it cause nobody seems to respond to it. Even if you let the whole crash-consistent thing go, you still have to deal with the logs, and you’re back to agents.

  • I’ll cop to having done this… using VSS file snapshots to take backups of Lotus Notes databases on Windows. After a restore of the files, the Lotus Admin team *always* had to recover the databases… and every time I always got a little nauseous. It was a bad practice then, and still is now.

    To bring up the logging point that Curtis is really unhappy about, its easy… you just turn on circular logging (or equivalent) so that the logs don’t fill up because they never know that they’ve been backed up, and if you’re not keeping your transaction logs, then you don’t care about recovering those transactions.

    Congrats, your recovery point is now the time of your daily backup (hopefully all those transactions you had in the mean time didn’t much to you or your customers) and your recovery time is the amount of time to restore the data plus run an application recovery process.

    Like I said… nauseous.

  • I forgot about circular logging. That is a “solution” to the log issue.

    Nauseous. Funny.