Written by W. Curtis Preston
Friday, 05 February 2010 20:40
This question has come up again. For what it’s worth, I’m still firmly in the camp that says that it is possible to have a complete backup system based solely on snapshots and replications — as long as you can address the criticisms that people have against the idea. So I thought I’d throw out all the objections and see how the concept does against them.
Although I really don't like doing this, I've taken out a few paragraphs here that were in the original post. They were really not germane to the topic and I'd rather you first read the post without them. But so it doesn't look like I'm hiding something, I'm going to attach the paragraphs to the post as a comment.]
I, for one, have a completely snapshot-based system for my home data (4 TB of documents, music, and DVD images) and love it. It’s based off of linux, rsync, and a few shell scripts. I have onsite and offsite backups and don’t use anything resembling what most people would consider “backup software.” (And I do this even with the dozens of offers I get for free backup software and hardware.) So if snapshots/copies aren’t backups, then Mr. Backup doesn’t backup his own data. And since BackupPC (which has 10s of thousands of open-source users) is essentially a fancy version of what I’m doing, it should probably change its name to SnapshotPC. Oh, and as long as I’m on it, Time Machine is essentially the same thing as well. It is not a “backup” in the traditional sense; it’s just a copy with hard links — just like my system.
Let’s look at the typical concerns about a snapshot-only backup system.
- Snapshots don’t protect against the loss of a volume
- This is correct — if you don’t replicate the snapshots. If you have only one copy and you have a double-disk failure in a RAID5 volume, your data goes bye-bye; therefore,I would only consider a snapshot-based protection system valid if the snapshots are replicated to another physical location. Some customers even have an onsite and an offsite copy. They use the onsite copy for BC/HA and the offsite copy for DR.
- A malevolent backup admin can delete all your snapshots
- A malevolent backup admin can delete your backup catalog & overwrite all your tapes with a for-loop. What’s your point? In addition, just like backup systems that have WORM capabilities to deal with this sort of thing, snapshot systems have similar protection mechanisms. But I would argue that you are never completely safe in either world from a malevolent internal person with superuser privileges. Death to all tyrants and background checks for all backup admins. Google Roger Duronio. That’s all I’m going to say about that.
- There is no backup catalog/index/history of snapshots
- The argument here is that the backup catalog/index/database of all the backups is instrumental in finding your data, but I argue that it’s instrumental in finding your data only when you’ve changed its format. When you’ve copied all your data onto tape or disk and encapsulated it into tar, open-tape, or whatever format, you must have the catalog to find your file. But what if you never change the format of the data and it’s just sitting in a directory structure? I think I can say that I’ve done my fair number of restores, so let’s talk about three types of such restores.
- The first type of restore is when you know where the file is and when it was last good. To restore this file you go to your backup software, select the appropriate file and version and press “Restore.” How does one accomplish this if using snapshots? You don't need a catalog. You change directory to the snapshot directory, then change directory again into the appropriate day (e.g. daily.3) and finally change directory to where the file(s) is/are that you are looking for. Take your file and copy it where it needs to go and you’re done. (This is the same number of steps as you would perform in a typical backup package; they’re just different steps.) In fancier snapshot-based systems (and even some of the free ones, such as BackupPC), there are also GUIs that will handle all of this for you. In addition, some systems have even integrated with the Windows “Previous Versions” tab, so a user can see the previous versions of a given file and restore it themselves. Now that’s a thing of beauty.
- The second type of restore is when you know where the file is, but are not sure when it was corrupted. With some backup software products you can easily see all the versions of a given file in one view and easily determine the last version of a given file and grab it. You don't need a catalog to do this with a snapshot system, either. Consider an example. Let’s say the file you need is /dir1/dir2/dir3/filename. First you cd to the snapshot directory, then issue the command ls –l (or dir) */dir1/dir2/dir3/filename. Now you have all the versions of the file in front of you and can easily see which one is the one you want. Again, the previous versions tab would also help here.
- What about the type of restore where you’re not even sure where the file is you’re looking for? I’d argue that many backup products don’t handle this very well, and this is where a snapshot system can really excel. Buy a best-of-breed indexing appliance (such as the Google appliance that costs less than a single tape drive) and you get far more than just “find this file.” You now get to find files based on their contents and all sorts of things (e.g. show me all the files with the word ABC in them). That’s soooo much better than what most backup products can do (with the exception here being CommVault; they do index their backups based on content).
- Update: @Storagezilla says that by adding a search box, I'm adding a catalog. I'm fine with that statement. The big reason I'm ok with Mark's statement is that the point of the original tweeter (who made the statement that became this objection) was that snapshots weren't valid as a backup system because they don't have a catalog. My point is that I don't think you need one, but if you disagree with me you can get one for a few thousand dollars. So it's not a valid criticism of snapshot-based backup.
- What about data that’s not on the snapshot-based storage
- This idea might work fine for all the data resident on your snapshot-based storage, but what about all those internal drives out there? Those drives need some software that runs on the host that coordinates snapshots on the host and replicates them to other storage. Two examples of this are Microsoft’s Volume Shadow Services and Data Protection Manager, and NetApp’s Open Systems SnapVault. MS VSS & DPM provide a completely self-contained (and totally snapshot-based) backup system (that only works with Windows Systems, of course). NetApp’s OSSV will create and replicated snapshots on most open-systems-based Oss, and then replicate to and store those snapshots on a NetApp filer.
- There is no centralized management, scheduling, reporting
- Even if there isn’t, I still argue that this doesn’t discount it as a backup system. It’s just not a scalable, manageable backup system. Even if all you had was a snapshot-based storage system where you had to script all your snapshots (such as early NetApp systems), I’d still argue that this is good enough for environments up to a certain size. Just because you have to write some shell scripts doesn’t make it not a backup system; it just means you’re managing all these scripts yourself. And plenty of people in organizations of all kinds of sizes are doing that for their backup system. Show me a large NBU/NW/TSM shop without shell scripts of some kind.
- Having made that argument, I should say that some storage systems do have a centralized management, scheduling, and reporting mechanism for their snapshots. Two examples here are NetApp’s (relatively new) Protection Manager and Microsoft’s DPM. NetApp PM can configure, schedule, and report on the snapshot-based backups of up to 250 filers (each of which may have over 100TB of storage), 100,000 snapshot/replication relationships (snap this volume this often and replicate it to this volume), each of which can have up to 256 snapshots each. In addition, it can handle OSSV snapshots as well. I’m not saying the product is perfect or will meet everyone’s needs, but I’m saying that it will meet some people’s needs. Microsoft DPM can also handle hundreds of Windows systems, and supports Exchange, SQL Server, SharePoint, etc.
- Snapshots take up too much space
- Anyone arguing this is obviously not comparing this to how much space regular backups take on tape or disk. Snapshots are delta-based and take up far less than most backups would take! Like disk-based backup systems, these snapshots can also be stored on ATA-based volumes, making the storage much cheaper. With an average daily block-level change rate of 1%, you would only need a disk system twice as big as your primary system to store 90 days of copies. (That’s the first “seed” copy, but 90 copies at 1% each, for a total of 190%, or 2x.) That is comparable to how much deduped disk you would need to buy — if not less.
- By the way, I’m not saying that you’ll have a 1% daily block-level change rate, but I have seen that as a very common number in real environments; in fact, I’ve seen the number be lower. My Avamar friends (who are arguing against what I’m arguing here) should at least be able to agree with this. An Avamar system sees a number very similar to this as the number of new blocks in a typical backup environment.
- Having a copy onsite and offsite? Isn’t that a lot of disk?
- First, this is not a requirement. It’s only what some people do. They replicate their primary volume to an onsite recovery volume (for HA purposes) and then replicate that to an offsite volume (for DR purposes). Second, while it’s true that this is a lot of disk, consider comparing it to some other systems. How many copies of your data do you need to pull of a replicated BCV setup that needs an asynchronous copy? I’ve seen some configs that have six copies — just to have one version on disk in a remote location.
- Finally, when asking the “isn’t that a lot of disk?” question, you also have to compare it to the cost of a traditional backup system. Assuming 90 day retention and a 1% daily block-level change rate, 1 onsite replica, and 1-off-site replica, you’re talking 4GB of backup disk for every 1GB. With traditional tape backups, you typically have 20 GB for every 1 GB on tape. You have to buy a backup system, a large tape library, and all of that will require much more attention than a good snapshot-based system. I’m not saying it’s necessarily cheaper, but I’d say that one onsite copy and one offsite copy (both with version) beats any tape-based system for coolness (read better able to meet business requirement) any day of the week.
- Too many snapshots slow down the production volume
- This is true for copy-on-write (COW) snapshot systems; however, there are at least two other snapshot systems of which I’m aware. With a COW snapshot system, the longer you keep your snapshots and the more snapshots you make, the worse the performance gets on the production volume. Nobody at any COW vendor actually knows how bad your performance would get if you kept 90 days of snapshots on a COW volume (because no one would do it), but one SE from a major TLA OEM estimated to a customer of mine a 50% drop in performance if they did that. However, other snapshot systems, such as redirect-on-write and WAFL don’t have this problem. I’m not saying there is no difference in performance between a volume with no snapshots and one with hundreds of them, but I’m saying that customers tell me that the difference is an acceptable one.
- You can’t keep snapshot history for n months or n years
- This is a variation of the above statement that basically says that you can’t have that many snapshots. Most people that I talk to that keep their backups for, say, 6 months, do it something like this: They keep daily incrementals for two or four weeks, and then weekly fulls for six months. If you did that with snapshots, you’d have 26 weekly copies + 28 daily copies for a total of 56 snapshots. Easy peasy. Most snapshot systems can handle more than that, so let’s do more. Let’s add to that hourly snapshots that we also keep for a week, or 128 additional snapshots. 26 + 28 + 128 and you have 222 snapshots and you have a system that offers many more recovery options than the typical backup system, has 6 months retention, and only 222 snapshots, which is well within what most snapshot-based products offer as a maximum number of snapshots.
- How much disk would you need for that? Again, based on a 1% daily block-level change rate you’d need 280% of the original volume. To protect a 1 TB volume, you’d need 1.8 TB of additional disk space. Again, that may not be cheap, but it’s not out of the question either.
- You need to integrate with the application
- I couldn’t agree more. If Exchange, SQL Server, and Oracle are not happy with what you’re doing, then neither will I be. This is where I do have to tip my hat to NetApp and FalconStor who have both written all kinds of application-level integration apps that “do the right thing” for each app they support.
- Snapshots = a bunch of scripts
- This misconception comes from the early days of NetApp. Before the days of the SnapManager line and the existence of Protection Manager, you had to be a script master to “do the right thing.” I wrote a lot of those scripts back in the day. I’m not saying that the current world is completely script free, but it’s a lot closer than it used to be. Protection Manager takes care of a lot of that now. Again, I’d like anyone to show me a large NBU/NW/TSM environment that is script free. They don’t exist. Shoot, so me a medium-sized environment of any of those products that’s script-free. Just because you have to write scripts does not make something not a backup.
- What about rollling code failure?
- This is an objection I’ve had for a long time and was the last objection of mine to fall. What happens if the snapshot-based-storage vendor does something wrong in their code and it cascades throughout the system? This is not an easy one to answer, but I believe you can mitigate this concern using rolling patch/upgrades and delaying replication during such upgrades. You can also mitigate by using different code. For example, in a NetApp shop, you could use SnapMirror between copy A and B, and SnapVault between copies B and C.
- If you are truly unconvinced of this particular issue, then I would suggest you keep a historical copy on some other medium, but that this medium doesn’t need much retention. Now that NetApp has abandoned their deduped VTL, you can’t do this with all NetApp, so you’ll need something like a Data Domain restorer that you can do bulk dumps to. You don’t have to have a long amount of history here, because you only need to be able to restore the whole filer after the cascading code feature. The restore would have all the history in it if you backed it up in a way that kept the snapshots.
- NetApp ASIS can’t handle volumes larger than N TB
- I again say that I’m trying not to argue NetApp’s case here, but one of the EMCers sent this to me as a DM. I would argue that this is a separate issue. Using snapshots/SnapMirror/SnapMirror does not mean that you have to use ASIS. It’s a completely separate decision that has its own ramifications. If it works for you, great. Turn it on. If it doesn’t, then don’t. It has nothing to do with whether or not a snapshot-only protection mechanism is valid.
- This isn’t scalable
- I will say that you probably can’t provide enterprise-wide backups for, say, a 400 TB shop with a single snapshot management system. You’d have to buy more than one of them. But just because you need multiple “brains” to control it doesn’t mean it’s not a valid backup system. If that were true, then neither is NetBackup, NetWorker, TSM, or any other backup app. Show me a large shop with a single NBU/NW/TSM backup server/catalog. They don’t exist. Avamar has a much smaller limitation. Does that mean these aren’t backup products?
: I moved this one down here because it's not an objection. It's an advantage of snapshots and a disadvantage of any other way.
Backups take too long to recover from
- I wanted to make the point that EMC sales reps try to make all the time: “If you reach for a backup, you’re dead. Any RTO is going to be too long to meet the recovery needs of your business.” They may be trying to sell BCVs, Timefinder, and more storage arrays, but their point is completely valid. The recovery needs of many applications are simply not meetable by traditional backups.
- But a snapshot-based system can meet any RTO (and all but the most aggressive RPO) requirements. Since a snapshot-based system keeps the data in its native format, you can simply switch to the replicated copy — no restore required. If you reach for a backup, you’re dead. And with a snapshot-based system, you never reach for a “backup” in the traditional sense.
So there you have it. I’ve thought about this quite a bit and I’m still in the camp that says that this is not only an acceptable method of backup and recovery, it’s actually a pretty dang good one. You never do “restores.” In many cases, you can just start using the “backup” immediately while your primary is being repaired. Your data is never put into any magical, proprietary format. There’s no backup catalog to worry about, back up, or recover in a DR scenario. And I can’t come up with any valid objection against it.
If anyone else thinks of any valid objections, I’ll be glad to add them to the list. But for now, consider me a fan.