|
|
  
Written by W. Curtis Preston
Friday, 05 February 2010 20:40
This question has come up again. For what it’s worth, I’m still firmly in the camp that says that it is possible to have a complete backup system based solely on snapshots and replications — as long as you can address the criticisms that people have against the idea. So I thought I’d throw out all the objections and see how the concept does against them.
[ Update: Although I really don't like doing this, I've taken out a few paragraphs here that were in the original post. They were really not germane to the topic and I'd rather you first read the post without them. But so it doesn't look like I'm hiding something, I'm going to attach the paragraphs to the post as a comment.]
I, for one, have a completely snapshot-based system for my home data (4 TB of documents, music, and DVD images) and love it. It’s based off of linux, rsync, and a few shell scripts. I have onsite and offsite backups and don’t use anything resembling what most people would consider “backup software.” (And I do this even with the dozens of offers I get for free backup software and hardware.) So if snapshots/copies aren’t backups, then Mr. Backup doesn’t backup his own data. And since BackupPC (which has 10s of thousands of open-source users) is essentially a fancy version of what I’m doing, it should probably change its name to SnapshotPC. Oh, and as long as I’m on it, Time Machine is essentially the same thing as well. It is not a “backup” in the traditional sense; it’s just a copy with hard links — just like my system.
Let’s look at the typical concerns about a snapshot-only backup system.
- Snapshots don’t protect against the loss of a volume
- This is correct — if you don’t replicate the snapshots. If you have only one copy and you have a double-disk failure in a RAID5 volume, your data goes bye-bye; therefore,I would only consider a snapshot-based protection system valid if the snapshots are replicated to another physical location. Some customers even have an onsite and an offsite copy. They use the onsite copy for BC/HA and the offsite copy for DR.
- A malevolent backup admin can delete all your snapshots
- A malevolent backup admin can delete your backup catalog & overwrite all your tapes with a for-loop. What’s your point? In addition, just like backup systems that have WORM capabilities to deal with this sort of thing, snapshot systems have similar protection mechanisms. But I would argue that you are never completely safe in either world from a malevolent internal person with superuser privileges. Death to all tyrants and background checks for all backup admins. Google Roger Duronio. That’s all I’m going to say about that.
- There is no backup catalog/index/history of snapshots
- The argument here is that the backup catalog/index/database of all the backups is instrumental in finding your data, but I argue that it’s instrumental in finding your data only when you’ve changed its format. When you’ve copied all your data onto tape or disk and encapsulated it into tar, open-tape, or whatever format, you must have the catalog to find your file. But what if you never change the format of the data and it’s just sitting in a directory structure? I think I can say that I’ve done my fair number of restores, so let’s talk about three types of such restores.
- The first type of restore is when you know where the file is and when it was last good. To restore this file you go to your backup software, select the appropriate file and version and press “Restore.” How does one accomplish this if using snapshots? You don't need a catalog. You change directory to the snapshot directory, then change directory again into the appropriate day (e.g. daily.3) and finally change directory to where the file(s) is/are that you are looking for. Take your file and copy it where it needs to go and you’re done. (This is the same number of steps as you would perform in a typical backup package; they’re just different steps.) In fancier snapshot-based systems (and even some of the free ones, such as BackupPC), there are also GUIs that will handle all of this for you. In addition, some systems have even integrated with the Windows “Previous Versions” tab, so a user can see the previous versions of a given file and restore it themselves. Now that’s a thing of beauty.
- The second type of restore is when you know where the file is, but are not sure when it was corrupted. With some backup software products you can easily see all the versions of a given file in one view and easily determine the last version of a given file and grab it. You don't need a catalog to do this with a snapshot system, either. Consider an example. Let’s say the file you need is /dir1/dir2/dir3/filename. First you cd to the snapshot directory, then issue the command ls –l (or dir) */dir1/dir2/dir3/filename. Now you have all the versions of the file in front of you and can easily see which one is the one you want. Again, the previous versions tab would also help here.
- What about the type of restore where you’re not even sure where the file is you’re looking for? I’d argue that many backup products don’t handle this very well, and this is where a snapshot system can really excel. Buy a best-of-breed indexing appliance (such as the Google appliance that costs less than a single tape drive) and you get far more than just “find this file.” You now get to find files based on their contents and all sorts of things (e.g. show me all the files with the word ABC in them). That’s soooo much better than what most backup products can do (with the exception here being CommVault; they do index their backups based on content).
- Update: @Storagezilla says that by adding a search box, I'm adding a catalog. I'm fine with that statement. The big reason I'm ok with Mark's statement is that the point of the original tweeter (who made the statement that became this objection) was that snapshots weren't valid as a backup system because they don't have a catalog. My point is that I don't think you need one, but if you disagree with me you can get one for a few thousand dollars. So it's not a valid criticism of snapshot-based backup.
- What about data that’s not on the snapshot-based storage
- This idea might work fine for all the data resident on your snapshot-based storage, but what about all those internal drives out there? Those drives need some software that runs on the host that coordinates snapshots on the host and replicates them to other storage. Two examples of this are Microsoft’s Volume Shadow Services and Data Protection Manager, and NetApp’s Open Systems SnapVault. MS VSS & DPM provide a completely self-contained (and totally snapshot-based) backup system (that only works with Windows Systems, of course). NetApp’s OSSV will create and replicated snapshots on most open-systems-based Oss, and then replicate to and store those snapshots on a NetApp filer.
- There is no centralized management, scheduling, reporting
- Even if there isn’t, I still argue that this doesn’t discount it as a backup system. It’s just not a scalable, manageable backup system. Even if all you had was a snapshot-based storage system where you had to script all your snapshots (such as early NetApp systems), I’d still argue that this is good enough for environments up to a certain size. Just because you have to write some shell scripts doesn’t make it not a backup system; it just means you’re managing all these scripts yourself. And plenty of people in organizations of all kinds of sizes are doing that for their backup system. Show me a large NBU/NW/TSM shop without shell scripts of some kind.
- Having made that argument, I should say that some storage systems do have a centralized management, scheduling, and reporting mechanism for their snapshots. Two examples here are NetApp’s (relatively new) Protection Manager and Microsoft’s DPM. NetApp PM can configure, schedule, and report on the snapshot-based backups of up to 250 filers (each of which may have over 100TB of storage), 100,000 snapshot/replication relationships (snap this volume this often and replicate it to this volume), each of which can have up to 256 snapshots each. In addition, it can handle OSSV snapshots as well. I’m not saying the product is perfect or will meet everyone’s needs, but I’m saying that it will meet some people’s needs. Microsoft DPM can also handle hundreds of Windows systems, and supports Exchange, SQL Server, SharePoint, etc.
- Snapshots take up too much space
- Anyone arguing this is obviously not comparing this to how much space regular backups take on tape or disk. Snapshots are delta-based and take up far less than most backups would take! Like disk-based backup systems, these snapshots can also be stored on ATA-based volumes, making the storage much cheaper. With an average daily block-level change rate of 1%, you would only need a disk system twice as big as your primary system to store 90 days of copies. (That’s the first “seed” copy, but 90 copies at 1% each, for a total of 190%, or 2x.) That is comparable to how much deduped disk you would need to buy — if not less.
- By the way, I’m not saying that you’ll have a 1% daily block-level change rate, but I have seen that as a very common number in real environments; in fact, I’ve seen the number be lower. My Avamar friends (who are arguing against what I’m arguing here) should at least be able to agree with this. An Avamar system sees a number very similar to this as the number of new blocks in a typical backup environment.
- Having a copy onsite and offsite? Isn’t that a lot of disk?
- First, this is not a requirement. It’s only what some people do. They replicate their primary volume to an onsite recovery volume (for HA purposes) and then replicate that to an offsite volume (for DR purposes). Second, while it’s true that this is a lot of disk, consider comparing it to some other systems. How many copies of your data do you need to pull of a replicated BCV setup that needs an asynchronous copy? I’ve seen some configs that have six copies — just to have one version on disk in a remote location.
- Finally, when asking the “isn’t that a lot of disk?” question, you also have to compare it to the cost of a traditional backup system. Assuming 90 day retention and a 1% daily block-level change rate, 1 onsite replica, and 1-off-site replica, you’re talking 4GB of backup disk for every 1GB. With traditional tape backups, you typically have 20 GB for every 1 GB on tape. You have to buy a backup system, a large tape library, and all of that will require much more attention than a good snapshot-based system. I’m not saying it’s necessarily cheaper, but I’d say that one onsite copy and one offsite copy (both with version) beats any tape-based system for coolness (read better able to meet business requirement) any day of the week.
- Too many snapshots slow down the production volume
- This is true for copy-on-write (COW) snapshot systems; however, there are at least two other snapshot systems of which I’m aware. With a COW snapshot system, the longer you keep your snapshots and the more snapshots you make, the worse the performance gets on the production volume. Nobody at any COW vendor actually knows how bad your performance would get if you kept 90 days of snapshots on a COW volume (because no one would do it), but one SE from a major TLA OEM estimated to a customer of mine a 50% drop in performance if they did that. However, other snapshot systems, such as redirect-on-write and WAFL don’t have this problem. I’m not saying there is no difference in performance between a volume with no snapshots and one with hundreds of them, but I’m saying that customers tell me that the difference is an acceptable one.
- You can’t keep snapshot history for n months or n years
- This is a variation of the above statement that basically says that you can’t have that many snapshots. Most people that I talk to that keep their backups for, say, 6 months, do it something like this: They keep daily incrementals for two or four weeks, and then weekly fulls for six months. If you did that with snapshots, you’d have 26 weekly copies + 28 daily copies for a total of 56 snapshots. Easy peasy. Most snapshot systems can handle more than that, so let’s do more. Let’s add to that hourly snapshots that we also keep for a week, or 128 additional snapshots. 26 + 28 + 128 and you have 222 snapshots and you have a system that offers many more recovery options than the typical backup system, has 6 months retention, and only 222 snapshots, which is well within what most snapshot-based products offer as a maximum number of snapshots.
- How much disk would you need for that? Again, based on a 1% daily block-level change rate you’d need 280% of the original volume. To protect a 1 TB volume, you’d need 1.8 TB of additional disk space. Again, that may not be cheap, but it’s not out of the question either.
- You need to integrate with the application
- I couldn’t agree more. If Exchange, SQL Server, and Oracle are not happy with what you’re doing, then neither will I be. This is where I do have to tip my hat to NetApp and FalconStor who have both written all kinds of application-level integration apps that “do the right thing” for each app they support.
- Snapshots = a bunch of scripts
- This misconception comes from the early days of NetApp. Before the days of the SnapManager line and the existence of Protection Manager, you had to be a script master to “do the right thing.” I wrote a lot of those scripts back in the day. I’m not saying that the current world is completely script free, but it’s a lot closer than it used to be. Protection Manager takes care of a lot of that now. Again, I’d like anyone to show me a large NBU/NW/TSM environment that is script free. They don’t exist. Shoot, so me a medium-sized environment of any of those products that’s script-free. Just because you have to write scripts does not make something not a backup.
- What about rollling code failure?
- This is an objection I’ve had for a long time and was the last objection of mine to fall. What happens if the snapshot-based-storage vendor does something wrong in their code and it cascades throughout the system? This is not an easy one to answer, but I believe you can mitigate this concern using rolling patch/upgrades and delaying replication during such upgrades. You can also mitigate by using different code. For example, in a NetApp shop, you could use SnapMirror between copy A and B, and SnapVault between copies B and C.
- If you are truly unconvinced of this particular issue, then I would suggest you keep a historical copy on some other medium, but that this medium doesn’t need much retention. Now that NetApp has abandoned their deduped VTL, you can’t do this with all NetApp, so you’ll need something like a Data Domain restorer that you can do bulk dumps to. You don’t have to have a long amount of history here, because you only need to be able to restore the whole filer after the cascading code feature. The restore would have all the history in it if you backed it up in a way that kept the snapshots.
- NetApp ASIS can’t handle volumes larger than N TB
- I again say that I’m trying not to argue NetApp’s case here, but one of the EMCers sent this to me as a DM. I would argue that this is a separate issue. Using snapshots/SnapMirror/SnapMirror does not mean that you have to use ASIS. It’s a completely separate decision that has its own ramifications. If it works for you, great. Turn it on. If it doesn’t, then don’t. It has nothing to do with whether or not a snapshot-only protection mechanism is valid.
- This isn’t scalable
- I will say that you probably can’t provide enterprise-wide backups for, say, a 400 TB shop with a single snapshot management system. You’d have to buy more than one of them. But just because you need multiple “brains” to control it doesn’t mean it’s not a valid backup system. If that were true, then neither is NetBackup, NetWorker, TSM, or any other backup app. Show me a large shop with a single NBU/NW/TSM backup server/catalog. They don’t exist. Avamar has a much smaller limitation. Does that mean these aren’t backup products?
Update: I moved this one down here because it's not an objection. It's an advantage of snapshots and a disadvantage of any other way.
Backups take too long to recover from
- I wanted to make the point that EMC sales reps try to make all the time: “If you reach for a backup, you’re dead. Any RTO is going to be too long to meet the recovery needs of your business.” They may be trying to sell BCVs, Timefinder, and more storage arrays, but their point is completely valid. The recovery needs of many applications are simply not meetable by traditional backups.
- But a snapshot-based system can meet any RTO (and all but the most aggressive RPO) requirements. Since a snapshot-based system keeps the data in its native format, you can simply switch to the replicated copy — no restore required. If you reach for a backup, you’re dead. And with a snapshot-based system, you never reach for a “backup” in the traditional sense.
So there you have it. I’ve thought about this quite a bit and I’m still in the camp that says that this is not only an acceptable method of backup and recovery, it’s actually a pretty dang good one. You never do “restores.” In many cases, you can just start using the “backup” immediately while your primary is being repaired. Your data is never put into any magical, proprietary format. There’s no backup catalog to worry about, back up, or recover in a DR scenario. And I can’t come up with any valid objection against it.
If anyone else thinks of any valid objections, I’ll be glad to add them to the list. But for now, consider me a fan.
Add comment
|
Comments
BEX backup jobs to the NetApp target are stored as SnapVault snapshots. On the recovery side, the data is viewed via the BEX console from a drive perspective, e.g. the D: drive of a server. The search function works across the full backup history of that drive. So, if I needed all versions of "File-x" across 100 existing backup jobs, I could search through all 100 snaps in a single operation (wildcards are supported). I could also target a specific date range to narrow the search, and/or a file size range.
Peter Eicher
Syncsort
I believe this combined solution provides many of those things you list that NetApp lacks.
Peter Eicher
Syncsort
Admittedly these snaps are not as "safe" as a third "real" copy elsewhere, but still they are there, just in case. Once you declare a DR, you do still need to protect the DR data that is now prod. Personally, I favor cascading SnapMirror or SnapVault to a 3rd filer, and/or backup to NDMP tapes from DR. Belt AND suspenders.
Just to stick with the post, however, the fourth copy could still be another "copy," not something on tape or in traditional backup format.
It's all about what you want to pay for.
You've put down some really good points to consider regarding whether or not snapshots can be used in place of traditional backups (preferably combined with replication) and how they compare/contrast to the role backup applications typically play. However, there was one point that you didn't discuss, in regards to overall backup architecture, and I was a little surprised to see it wasn't called out as a concern.
In the scenarios you've described, if the primary site goes offline for any reason and you move your operations to the DR site, you are now actively transacting on the array containing the only copy (or copies) of your critical data. You are also doing so in an inherently stressful situation that is prone to missteps. I've always viewed the complete lack of an independent copy of the data at the DR location as a serious flaw in a pure snapshot and replication architecture. I suppose you could have a fourth array deployed with a third replication copy (Primary copy@primary site, HA/BC replication copy@primary site, DR replication copy@DR site, and DoubleDR replication copy@DR site).
I'm curious to hear your stance on whether you feel this simply isn't a valid concern, was overlooked in the original points, or was simply outside the bounds of the original discussion.
Thanks.
Here's your blog entry:
storagezilla.typepad.com/storagezilla/2009/09/are-point-in-time-copies-backups.html
In comments in that post you say that PIT copies (aka snapshots) are NOT a replacement for backup -- even if they're replicated off the system. You said that they just "a fast and non-disruptive way of giving you an image to backup."
In a later comment in that same post, you also say not to use RecoverPoint for "months of retention," which is what I'm suggesting to do in this post.
Scott Waterhouse has argued this particular point very vehemently both publicly and privately. When discussing a customer's options in comments on this blog, he said that SnapMirror and SnapVault are "not backup."
http://thebackupblog.typepad.com/thebackupblog/2009/08/something-doesnt-add-up-and-never-will.html (link removed because the article has disappeared. See comment below.)
And he blogged about it in detail here:
http://thebackupblog.typepad.com/thebackupblog/2009/04/is-a-copy-a-backup.html (link removed because the article has disappeared. See comment below.)
And...
I was at a very large customer (I say very large only to tell you that the size of this company dictates that they get the best and brightest sales reps EMC had to offer) and they told the EMC sales reps that they wanted to do what I described in this post (90 days of snaps on one celerra replicated to another celerra) and they told the customer that this was the stupidest thing they've ever heard of. "Why?" They asked. "Because it would kill the performance of the Celerra?" "How much?" "We don't know, because no one does that, but I'd guess at least 50%."
As to the customer (@sharney) doing what I described in this blog, he has now explained that he has to snap both source and destination. If he's replicating application data, that's not a valid method of getting a good backup of it. You have to put the app in a backup mode, then snap it. How do you do that with a replica? This is another part of what I meant when I said "EMC can't do snapshots the way NetApp can."
In http://thebackupblog.typepad.com/thebackupblog/2009/08/something-doesnt-add-up-and-never-will.html he said:
Quote:And in
http://thebackupblog.typepad.com/thebackupblog/2009/04/is-a-copy-a-backup.html
he said this:
Quote:It seems pretty clear to me that the author of this post believes that backup has to change forms in order for it to be called a backup. By inference, this means that the system I described above does not meet his definition of a backup.
---
I am aware that the anti-snapshot argument is often proffered by EMC folks and the pro-snapshot argument often comes from NetApp folks. While I'm sure that they all strongly believe what they're saying, it's still a point-of-view that is based partly on where they work. EMC can't do snapshots the way NetApp can -- and they sell some of the backup and reporting software that you might do without if you went down the snapshot-only route. So I highly doubt that there are any training sessions within the Hallowed Halls of Hopkinton about any advantages that such a system might have. Their employees also have no reason to learn the advantages of the other approach. So it's no surprise that anyone that works there would have a dim view of such. On the flip side, NetApp doesn't sell backup software. It doesn't sell backup reporting software. Heck, it no longer even sells a target for traditional backups that I would buy (No, Val, I don't consider FAS w/ASIS a target dedupe system), so they are pretty much out of that business as far as I'm concerned. So they have a vested interest in espousing their point of view as well. Why buy any of that EMC software when you can buy everything from them?
Do I have a frame of reference too? Of course. Having worked in hundreds of customers of both approaches, I have seen first hand what they offer, and I still see the merits of both. Usually I am defending the monolithic, central backup software world (as opposed to many, non-integrated point solutions), but today I am arguing that the approach is also valid.
I want to say firmly that while the EMC and NetApp guys will interpret this as a pro-NetApp post, I am NOT arguing NetApp over EMC here. I am defending the concept of snapshot-based backups, which the EMC guys are saying is absolute bollocks. (That was just for you, Mark. I could have said "pants," but I like to see UK/AUS audiences giggle when I throw that word into a preso) Yes, I am well-aware that NetApp is the only major company that is pushing this idea, but there are other companies (e.g. Compellant, FalconStor, Dell) that also go down this route, albeit with less success than NetApp has had. In addition, Microsoft has had quite a bit of luck selling Data Protection Manager. Guess what? Totally snapshot-based. Then, of course, there's Time Machine, Mac OS's built-in backup tool. Also totally snapshot based, although only at the file level.
I am not, nor have I ever been, an employee of NetApp, Dell, Microsoft, FalconStor. I don't own stock in NetApp, Compellant, FalconStor, or any other snapshot-based company either. And they're not slipping me cash on the side for these posts. So anyone who wants to accuse me of blogging what I'm blogging because I'm in someone's pocket obviously has no idea who they're talking about.
FWIW, I have seen at least two people post on the other side of this issue that aren't EMCers, and they're Preston deGuise (@prestondeguise) and Stephen Foskett (@sfoskett) in his comments on my blog post. I believe that I've addressed both of their concerns in the rest of this post. Feel free to check out their blogs for any anti-responses. They're both people I respect a lot, even if Preston does talk funny and likes to live with Koalas, and Stephen likes to work for San Diego-based companies (where I live) but can't bring himself to move here.
I honestly have to say that I'm very surprised by your response, and in the end we were mainly arguing about what we thought the other person was saying. It sure doesn't seem to match what I've historically run into when talking EMC folks. Snapshot backup discussions very quickly become discussions about NetApp, and then EMC people start seeing red, so I've never had a snapshot discussion with an EMCer that ended well.
Which is all I meant when I started out the post saying that the anti-stuff on this topic often comes from EMCers. I honestly wasn't trying to pick a fight. If I had it to do all over again, I'd take that paragraph out. It's not germane to the discussion.
Thanks for the discussion.
RSS feed for comments to this post