Can you have a backup system based solely on snapshots and replication?

This question has come up again. For what it’s worth, I’m still firmly in the camp that says that it is possible to have a complete backup system based solely on snapshots and replications — as long as you can address the criticisms that people have against the idea. So I thought I’d throw out all the objections and see how the concept does against them.

[Update: Although I really don’t like doing this, I’ve taken out a few paragraphs here that were in the original post.  They were really not germane to the topic and I’d rather you first read the post without them.  But so it doesn’t look like I’m hiding something, I’m going to attach the paragraphs to the post as a comment.]

I, for one, have a completely snapshot-based system for my home data (4 TB of documents, music, and DVD images) and love it.  It’s based off of linux, rsync, and a few shell scripts.  I have onsite and offsite backups and don’t use anything resembling what most people would consider “backup software.”  (And I do this even with the dozens of offers I get for free backup software and hardware.)  So if snapshots/copies aren’t backups, then Mr. Backup doesn’t backup his own data.  And since BackupPC (which has 10s of thousands of open-source users) is essentially a fancy version of what I’m doing, it should probably change its name to SnapshotPC.  Oh, and as long as I’m on it, Time Machine is essentially the same thing as well.  It is not a “backup” in the traditional sense; it’s just a copy with hard links — just like my system.

Let’s look at the typical concerns about a snapshot-only backup system.

  • Snapshots don’t protect against the loss of a volume
    • This is correct — if you don’t replicate the snapshots.  If you have only one copy and you have a double-disk failure in a RAID5 volume, your data goes bye-bye; therefore,I would only consider a snapshot-based protection system valid if the snapshots are replicated to another physical location.  Some customers even have an onsite and an offsite copy. They use the onsite copy for BC/HA and the offsite copy for DR.
  • A malevolent backup admin can delete all your snapshots
    • A malevolent backup admin can delete your backup catalog & overwrite all your tapes with a for-loop.  What’s your point?  In addition, just like backup systems that have WORM capabilities to deal with this sort of thing, snapshot systems have similar protection mechanisms.  But I would argue that you are never completely safe in either world from a malevolent internal person with superuser privileges.  Death to all tyrants and background checks for all backup admins.  Google Roger Duronio.  That’s all I’m going to say about that.
  • There is no backup catalog/index/history of snapshots
    • The argument here is that the backup catalog/index/database of all the backups is instrumental in finding your data, but I argue that it’s instrumental in finding your data only when you’ve changed its format.  When you’ve copied all your data onto tape or disk and encapsulated it into tar, open-tape, or whatever format, you must have the catalog to find your file.  But what if you never change the format of the data and it’s just sitting in a directory structure?  I think I can say that I’ve done my fair number of restores, so let’s talk about three types of such restores.
    • The first type of restore is when you know where the file is and when it was last good. To restore this file you go to your backup software, select the appropriate file and version and press “Restore.”  How does one accomplish this if using snapshots?  You don’t need a catalog. You change directory to the snapshot directory, then change directory again into the appropriate day (e.g. daily.3) and finally change directory to where the file(s) is/are that you are looking for.  Take your file and copy it where it needs to go and you’re done. (This is the same number of steps as you would perform in a typical backup package; they’re just different steps.) In fancier snapshot-based systems (and even some of the free ones, such as BackupPC), there are also GUIs that will handle all of this for you.  In addition, some systems have even integrated with the Windows “Previous Versions” tab, so a user can see the previous versions of a given file and restore it themselves.  Now that’s a thing of beauty.
    • The second type of restore is when you know where the file is, but are not sure when it was corrupted. With some backup software products you can easily see all the versions of a given file in one view and easily determine the last version of a given file and grab it.  You don’t need a catalog to do this with a snapshot system, either.  Consider an example.  Let’s say the file you need is /dir1/dir2/dir3/filename.  First you cd to the snapshot directory, then issue the command ls –l (or dir) */dir1/dir2/dir3/filename.  Now you have all the versions of the file in front of you and can easily see which one is the one you want.  Again, the previous versions tab would also help here.
    • What about the type of restore where you’re not even sure where the file is you’re looking for?  I’d argue that many backup products don’t handle this very well, and this is where a snapshot system can really excel.  Buy a best-of-breed indexing appliance (such as the Google appliance that costs less than a single tape drive) and you get far more than just “find this file.”  You now get to find files based on their contents and all sorts of things (e.g. show me all the files with the word ABC in them).  That’s soooo much better than what most backup products can do (with the exception here being CommVault; they do index their backups based on content).
    • Update: @Storagezilla says that by adding a search box, I’m adding a catalog.  I’m fine with that statement.  The big reason I’m ok with Mark’s statement is that the point of the original tweeter (who made the statement that became this objection) was that snapshots weren’t valid as a backup system because they don’t have a catalog.  My point is that I don’t think you need one, but if you disagree with me you can get one for a few thousand dollars.  So it’s not a valid criticism of snapshot-based backup.
  • What about data that’s not on the snapshot-based storage
    • This idea might work fine for all the data resident on your snapshot-based storage, but what about all those internal drives out there?  Those drives need some software that runs on the host that coordinates snapshots on the host and replicates them to other storage. Two examples of this are Microsoft’s Volume Shadow Services and Data Protection Manager, and NetApp’s Open Systems SnapVault. MS VSS & DPM provide a completely self-contained (and totally snapshot-based) backup system (that only works with Windows Systems, of course). NetApp’s OSSV will create and replicated snapshots on most open-systems-based Oss, and then replicate to and store those snapshots on a NetApp filer.
  • There is no centralized management, scheduling, reporting
    • Even if there isn’t, I still argue that this doesn’t discount it as a backup system.  It’s just not a scalable, manageable backup system.  Even if all you had was a snapshot-based storage system where you had to script all your snapshots (such as early NetApp systems), I’d still argue that this is good enough for environments up to a certain size.  Just because you have to write some shell scripts doesn’t make it not a backup system; it just means you’re managing all these scripts yourself.  And plenty of people in organizations of all kinds of sizes are doing that for their backup system.  Show me a large NBU/NW/TSM shop without shell scripts of some kind.
    • Having made that argument, I should say that some storage systems do have a centralized management, scheduling, and reporting mechanism for their snapshots.  Two examples here are NetApp’s (relatively new) Protection Manager and Microsoft’s DPM.  NetApp PM can configure, schedule, and report on the snapshot-based backups of up to 250 filers (each of which may have over 100TB of storage), 100,000 snapshot/replication relationships (snap this volume this often and replicate it to this volume), each of which can have up to 256 snapshots each.  In addition, it can handle OSSV snapshots as well.  I’m not saying the product is perfect or will meet everyone’s needs, but I’m saying that it will meet some people’s needs.  Microsoft DPM can also handle hundreds of Windows systems, and supports Exchange, SQL Server, SharePoint, etc.
  • Snapshots take up too much space
    • Anyone arguing this is obviously not comparing this to how much space regular backups take on tape or disk.  Snapshots are delta-based and take up far less than most backups would take!  Like disk-based backup systems, these snapshots can also be stored on ATA-based volumes, making the storage much cheaper.  With an average daily block-level change rate of 1%, you would only need a disk system twice as big as your primary system to store 90 days of copies.  (That’s the first “seed” copy, but 90 copies at 1% each, for a total of 190%, or 2x.)  That is comparable to how much deduped disk you would need to buy — if not less.
    • By the way, I’m not saying that you’ll have a 1% daily block-level change rate, but I have seen that as a very common number in real environments; in fact, I’ve seen the number be lower.  My Avamar friends (who are arguing against what I’m arguing here) should at least be able to agree with this.  An Avamar system sees a number very similar to this as the number of new blocks in a typical backup environment.
  • Having a copy onsite and offsite?  Isn’t that a lot of disk?
    • First, this is not a requirement.  It’s only what some people do.  They replicate their primary volume to an onsite recovery volume (for HA purposes) and then replicate that to an offsite volume (for DR purposes).  Second, while it’s true that this is a lot of disk, consider comparing it to some other systems.  How many copies of your data do you need to pull of a replicated BCV setup that needs an asynchronous copy? I’ve seen some configs that have six copies — just to have one version on disk in a remote location.
    • Finally, when asking the “isn’t that a lot of disk?” question, you also have to compare it to the cost of a traditional backup system. Assuming 90 day retention and a 1% daily block-level change rate, 1 onsite replica, and 1-off-site replica, you’re talking 4GB of backup disk for every 1GB. With traditional tape backups, you typically have 20 GB for every 1 GB on tape.  You have to buy a backup system, a large tape library, and all of that will require much more attention than a good snapshot-based system. I’m not saying it’s necessarily cheaper, but I’d say that one onsite copy and one offsite copy (both with version) beats any tape-based system for coolness (read better able to meet business requirement) any day of the week.
  • Too many snapshots slow down the production volume
    • This is true for copy-on-write (COW) snapshot systems; however, there are at least two other snapshot systems of which I’m aware.  With a COW snapshot system, the longer you keep your snapshots and the more snapshots you make, the worse the performance gets on the production volume.  Nobody at any COW vendor actually knows how bad your performance would get if you kept 90 days of snapshots on a COW volume (because no one would do it), but one SE from a major TLA OEM estimated to a customer of mine a 50% drop in performance if they did that.  However, other snapshot systems, such as redirect-on-write and WAFL don’t have this problem.  I’m not saying there is no difference in performance between a volume with no snapshots and one with hundreds of them, but I’m saying that customers tell me that the difference is an acceptable one.
  • You can’t keep snapshot history for n months or n years
    • This is a variation of the above statement that basically says that you can’t have that many snapshots.  Most people that I talk to that keep their backups for, say, 6 months, do it something like this: They keep daily incrementals for two or four weeks, and then weekly fulls for six months.  If you did that with snapshots, you’d have 26 weekly copies + 28 daily copies for a total of 56 snapshots.  Easy peasy.  Most snapshot systems can handle more than that, so let’s do more.  Let’s add to that hourly snapshots that we also keep for a week, or 128 additional snapshots. 26 + 28 + 128 and you have 222 snapshots and you have a system that offers many more recovery options than the typical backup system, has 6 months retention, and only 222 snapshots, which is well within what most snapshot-based products offer as a maximum number of snapshots.
    • How much disk would you need for that?  Again, based on a 1% daily block-level change rate you’d need 280% of the original volume.  To protect a 1 TB volume, you’d need 1.8 TB of additional disk space.  Again, that may not be cheap, but it’s not out of the question either.
  • You need to integrate with the application
    • I couldn’t agree more.  If Exchange, SQL Server, and Oracle are not happy with what you’re doing, then neither will I be.  This is where I do have to tip my hat to NetApp and FalconStor who have both written all kinds of application-level integration apps that “do the right thing” for each app they support.
  • Snapshots = a bunch of scripts
    • This misconception comes from the early days of NetApp.  Before the days of the SnapManager line and the existence of Protection Manager, you had to be a script master to “do the right thing.” I wrote a lot of those scripts back in the day.  I’m not saying that the current world is completely script free, but it’s a lot closer than it used to be. Protection Manager takes care of a lot of that now.  Again, I’d like anyone to show me a large NBU/NW/TSM environment that is script free.  They don’t exist.  Shoot, so me a medium-sized environment of any of those products that’s script-free.  Just because you have to write scripts does not make something not a backup.
  • What about rollling code failure?
    • This is an objection I’ve had for a long time and was the last objection of mine to fall.  What happens if the snapshot-based-storage vendor does something wrong in their code and it cascades throughout the system?  This is not an easy one to answer, but I believe you can mitigate this concern using rolling patch/upgrades and delaying replication during such upgrades.  You can also mitigate by using different code. For example, in a NetApp shop, you could use SnapMirror between copy A and B, and SnapVault between copies B and C.
    • If you are truly unconvinced of this particular issue, then I would suggest you keep a historical copy on some other medium, but that this medium doesn’t need much retention.  Now that NetApp has abandoned their deduped VTL, you can’t do this with all NetApp, so you’ll need something like a Data Domain restorer that you can do bulk dumps to.  You don’t have to have a long amount of history here, because you only need to be able to restore the whole filer after the cascading code feature.  The restore would have all the history in it if you backed it up in a way that kept the snapshots.
  • NetApp ASIS can’t handle volumes larger than N TB
    • I again say that I’m trying not to argue NetApp’s case here, but one of the EMCers sent this to me as a DM.  I would argue that this is a separate issue.  Using snapshots/SnapMirror/SnapMirror does not mean that you have to use ASIS. It’s a completely separate decision that has its own ramifications.  If it works for you, great. Turn it on.  If it doesn’t, then don’t. It has nothing to do with whether or not a snapshot-only protection mechanism is valid.
  • This isn’t scalable
    • I will say that you probably can’t provide enterprise-wide backups for, say, a 400 TB shop with a single snapshot management system.  You’d have to buy more than one of them. But just because you need multiple “brains” to control it doesn’t mean it’s not a valid backup system.  If that were true, then neither is NetBackup, NetWorker, TSM, or any other backup app.  Show me a large shop with a single NBU/NW/TSM backup server/catalog.  They don’t exist. Avamar has a much smaller limitation.  Does that mean these aren’t backup products?

Update: I moved this one down here because it’s not an objection.  It’s an advantage of snapshots and a disadvantage of any other way.

Backups take too long to recover from

  • I wanted to make the point that EMC sales reps try to make all the time: “If you reach for a backup, you’re dead. Any RTO is going to be too long to meet the recovery needs of your business.” They may be trying to sell BCVs, Timefinder, and more storage arrays, but their point is completely valid.  The recovery needs of many applications are simply not meetable by traditional backups.
  • But a snapshot-based system can meet any RTO (and all but the most aggressive RPO) requirements.  Since a snapshot-based system keeps the data in its native format, you can simply switch to the replicated copy — no restore required. If you reach for a backup, you’re dead.  And with a snapshot-based system, you never reach for a “backup” in the traditional sense.

So there you have it.  I’ve thought about this quite a bit and I’m still in the camp that says that this is not only an acceptable method of backup and recovery, it’s actually a pretty dang good one.  You never do “restores.”  In many cases, you can just start using the “backup” immediately while your primary is being repaired.  Your data is never put into any magical, proprietary format.  There’s no backup catalog to worry about, back up, or recover in a DR scenario.  And I can’t come up with any valid objection against it.

If anyone else thinks of any valid objections, I’ll be glad to add them to the list.  But for now, consider me a fan.

Written by W. Curtis Preston (@wcpreston), four-time O'Reilly author, and host of The Backup Wrap-up podcast. I am now the Technology Evangelist at Sullivan Strickler, which helps companies manage their legacy data

22 comments
  • I said a bunch of times that this wasn’t about EMC vs NetApp. The first EMC guy to respond (Mark Twomey, @storagezilla) obviously felt otherwise. He responded to this blog post with the following tweet:

    “@wcpreston No computer in front of me so no blog yet but I disagree. EMC has the best snapshot+rep solution on the market. RecoverPoint.”

    Look, EMC has some really good products that they should be really proud of. And I’d really rather my posts be about what works, rather than what doesn’t work. But, your honor, Mark opened the door and I must cross-examine.

    If RecoverPoint was as awesome as Mark says it is, you’d think I’d have heard more about it. (I recently surveyed over 6,000 backupcentral members and 350 of them responded. 5 were RecoverPoint customers. Other products that got a similar score are all products that Mark would consider non-players. Shoot, many of the “niche players” got better scores than RecoverPoint. To quote another tweet from @Storagezilla today:

    “@storageio @stevedupe @wcpreston:Shut up & ship. That’s what I say to our product managers in meetings. No have customers you have nothing.”

    If you’ve got the greatest snapshot/replication solution in the industry, but no one’s buying it, you have nothing.

    But let’s take a look at the merits of his argument, shall we? IMO, even if the RecoverPoint was the “best snapshot+rep solution on the market” it’s still not doing snapshots the way NetApp can (which is all I said in my post):

    a. NetApp’s snapshots are included with every system. RecoverPoint is a 3rdparty add that EMC acquired, and it comes at a significant increase in price and complexity to the configuration.

    b. NetApp’s snapshots do not require a data-tap, RecoverPoint does. It watches all writes and then sends them to the recovery appliance, that data-tap has a load that NetApp’s snapshots don’t have.

    c. NetApp’s snapshots allow you to recover to 100s of points of time with a flick of a switch. Don’t like the current point? Flip the switch and choose another one. It’s available immediately. Last time I checked RecoverPoint takes a significant amount of time to move from NOW to sometime other than NOW. (In previous tests, the time required was deemed unacceptable, which is why the customer chose another solution.)

    d. NetApp’s snapshots have been deployed on every system they’ve shipped since day one. Think about the experience that gives them. By comparison, RecoverPoint is just beginning to walk.

    e. The EMC salesforce is so strong that it can sell ice to eskimos. It’s so good that it sold a bunch of DL3D’s that have so many problems they’re now “swapping a lot of those boxes out at zero revenue. [They’ve] taken out about a dozen and [they’ll] continue to take out a similar number this quarter. Customers don’t want it.” (a quote from Frank Slootman, President of EMC’s Data Backup & Recovery division). With a salesforce so strong, how is it that that RecoverPoint has only scratched the surface of EMC’s installed base?

  • Hi Curtis,

    I predict this is gonna be a very “popular” blog post, so figured I’d chime in with an early, concise comment. Taking your hint from the last time, the rest of it is here 🙂 http://bit.ly/9lqdZ4

    -Val.
    Cloud Czar, NetApp Office of the CTO
    Chair, SNIA Cloud Storage Initiative

  • Oh and your comment box doesn’t support Unicode.
    That’s a major bug and you should upgrade.

  • The precept that EMC is somehow against snapshot backup is ludicrous and demonstrably wrong. Indeed one of the first customer tweet responses you got was from a customer who was using snapshots on his Celerra for backup.

    I can’t speak for anyone else who works at EMC but I’m not only fine with that if it solves his problem I encourage it’s use.

    Factoring in that I’ve been using Point In Time copies for customer backup going back as far as the EDM days with SymmConnect I’ve used, designed and sold the following as parts of *larger* backup solutions in the past 12 months.

    -PowerSnap

    -Celerra CheckPoints

    -Clariion SnapView

    -Replication Manager

    -TimeFinder Snap, Clone and BCV

    -NetWorker Module for Microsoft Applications (VSS)

    And as mentioned

    -RecoverPoint

    So, EMC is against using snapshots or point in time copies? That’s a hell of a lot of tech using that functionality in the array.

    This is just some opinion of yours, like saying "The other side wants the terrorists to win" or or "It’s all about the oil" rather than anything factual.

    It’s the cyclonic emotion machine at work as you need a villain for this piece and IBM, Hitachi and HP are just too boring.

    EMC would want to stop putting support for PIT backup in as many applications, it’s storage hardware besides and utilities as it has if it didn’t want people to use them.

    You state you polled a couple of thousand backup customers. 5 of them were using RecoverPoint. That’s probably because Recoverpoint is a heterogeneous block level replication solution which supports multiple array types, VMware SRM, MSCS, etc without requiring you to commit to someone else’s storage array in front of what you have now.

    And it’s damn good at that but using it as a local Tivo it faces the same problem as the CDP vendors did, people don’t see why they should N+ the local storage when they have a tape or a deduplication storage appliance available as a backup target at a lower cost per GB.

    Much lower cost than replicating to an entirely different array just to get it off the box. People are still using NDMP to solve that problem and it works just fine.

    Does that change the fact that RecoverPoint is an amazing snapshot and replication solution? Hell no. The consistency group support alone is better than a lot of what you would paint as "mature" products. So mature they still don’t support consistency in replication.

    It is great tech.

    As for using a write splitter this is correct. But the spilitter exists either on the host, as a function of a storage array or in the SAN fabric running in the switch code.

    Having demolished Topio in the heterogeneous replication market RecoverPoint is doing quite well.

    It’s not just me saying RP is a great tech either.

    http://blog.virtualtacit.com/2009/06/26/emc-recoverpoint-3-2-released-and-why-you-should-care/

    And since we’re quoting people where does EMC come out and say it’s against snapshots for backup?

    Now, I like a lot of customers prefer to roll point in time backups off the array simply to protect against something happening to the array or the data centre. Some people use replication that’s assuming they’re willing to pay the cost of another array on a floor tile to keep another copy of what they have. cost constrained customers aren’t willing to make that investment so others move to a lower and cheaper cost per GB tier of storage entirely.

    I think like the other major backup software and system vendors it’s more accurate to say EMC holds the view that Snapshots are fine so long as the data is in the storage you’re snapping.

    Beyond your opinion there is no factual justification that EMC is against the use of point in time copies in backup environments since they are designed into solutions every day where the applicable.

    But lets not square peg round+hole here where everything can be solved with a Snap. That Exchange Server running on Blades using DAS isn’t going to be backed up no matter how many snaps you take on the array it’s not on.

    No Google search appliance is going to help your laptop/desktop backup or remote offices or VMware unless it can crack open VMDK files now and see what’s inside.

    A healthy backup and recovery portfolio has functionality for all the different use cases out there and not just the ones which land on their box.

    And that’s the difference between being a backup provider and an array provider.

    You can work beyond the box.

  • You’re right. I did. I said PiTs aren’t backups in that blog post on the second line. Lets look.

    “Yes point in time copies are backups but they don’t protect you from a system failure.”

    Maybe I didn’t say it there. What about when you asked me a question in the comments. What did I say?

    “PiTs are backups”,

    OMG! I said PiTs are backups three words in! I didn’t even get three words without saying PiTs are backups! But lets read on..

    “that not in dispute. Replication, which is usually part of a DR or HA scheme and not backup specific is a valid way of getting that data off of one system and onto another.

    The same way it is for non-backup data.

    But not everyone is going to have a DR site so then it comes down to a costing argument if you want to license replication tech to copy data between floor tiles in the same data centre or just use a backup app to roll it over across your LAN or SAN.”

    You victory here would be stunning were it not for the fact I state clearly that PiTs are backups they just don’t protect you from a system failure. I also mention replication and the lack of.

    Moving on I wouldn’t recommend anyone keep 90 days of snapshots on a system as that’s primary storage but yes Celerra and NetApp’s snapshots are different.

    Celerra for example allows you to store your snaps on different tiers of storage from the production file systems they’re taken from and it supports protected rollback. If we’re taking snapshots once an hour and you come to me at 12:00pm and say the DB was corrupted at 9:00am I can roll back to 9:00am. If you then come to me and say whoops you should have said 10:00am no problem, we can roll forward from 9:00am to the 10:00am.

    That’s not actually the case with some other technologies. Your trip to the past being one way.

    And @sharney is correct we don’t replicate snapshots as there’s no point in wasting the bandwidth when instead you can put the app into hot backup mode, tell the Celerra Replicatior scheduler to create a replication session of the app being in hot backup mode, then snap that session when it closes on the other side.

    Giving you an application consistent snapshot for backup or whatever else you choose.

    And I typed this thing out on a Blackberry in a tent in the dark surrounded by rabid squirrels. How’s that for hardcore?

  • The whole point of this post is that I believe that "it is possible to have a complete backup system based solely on snapshots and replications." Another way to say that would be that replicated snapshots can replace backup.

    You then post a comment saying that you have no idea where I get this idea that you’re against this idea.

    I then posted comments from your blog where you say the exact opposite of the main point of the post. I think snapshots & replication can replace backup. Preston deGuise said in his comment to your post that, "snapshots … don’t replace backup." To this you replied, "I don’t think they replace backup either I think they’re a fast and non-disruptive way of giving you an image to backup."

    I think they can; you think they can’t. I’ll give you the benefit of the doubt. Perhaps you meant, "they can’t replace it in all circumstances." But that’s not what I took your post and comment (and tweets) to mean; I believe that you do not agree with the core tenant of this post: that it is possible to have a complete backup system based solely on snapshots and replication.

    And I KNOW that Scott feels that way, based on his much more verbose posts and tweets on the subject.

    You asked where I got the idea that anyone at EMC is against the idea. I answered that question. If you would like to clarify that you didn’t mean what I thought you meant, then please do so. But I don’t think you have so far. I don’t think you’re saying they’re not backups, but I still don’t believe (based on what I read in your post and what you’ve said here so far) that you agree with the basic tenant of this post. So let me ask it straight:

    Do you think it is possible to have a complete backup system that meets all the backup, recovery, and DR needs of a customer using nothing but snapshots and replication?

  • Do I think it is possible to have a complete backup system that meets all the backup, recovery, and DR needs of a customer using nothing but snapshots and replication?

    No.

    My concern as a backup administrator is I can’t tell if something has been deleted or just moved without an entry for it in a central index.

    Here’s what I’ve seen in production and the culprit is in moving virtual machines between datastores located on different volumes or different arrays.

    So long as the data is in the volume being snapped I have a backup. The admin juggles some VMs around and now it’s no longer in that snapset. It may be in another snapset, it may no longer be snapped at all. It may no longer be in a replication set as a result of the move.

    The move itself can change the protection level of the VM and you may not know about it.

    When the VM is toast I go looking for a restore but it’s not in any of the previous snaps of the volume it started out on.

    I go looking for where it was moved to. It may no longer even reside on the array it started out on as it’s point and click to move a running workload across buildings and the protection level doesn’t travel with the virtual machine.

    It inherits the protection level from where ever it’s now located.

    Moving it across volumes, systems and campus while running takes less operator interaction than protecting it.

    Going to 16 VMs per processor core, with hundreds of virtual machines on shared storage what has to happen to purely use snapshots and replication as the only tools for backup is that the snaps and replicas have to integrate with the virtual machine management system the way they currently integrate with a backup app, just so you can track their movement across volumes snap and replication sets and ensure the protection level is correct.

    And that’s a reason I say no. Workloads are no longer static and a restore point might not be available where it is now but where it has been previously. The last snapshot could have been taken in the last location or the location before that depending on how often you’re taking snaps.

    There’s simply too much going on at any one time to synchronise movement against protection level without some form of central tracking index.

    You could go with a default policy. Snap everything, replicate everything regardless of where it is or what it does but from a data management point of view everything doesn’t have the same value and everything usually isn’t replicated.

    Workloads are now transient and because it was running on this volume ten minutes ago doesn’t mean it won’t be running on this other array entirely ten minutes from now.

    Now, you might not agree with any of that but can you see the thought process?

  • I do follow your thought process, and I agree with your concern, but not your conclusion.

    First, I would suggest that the problem that you pointed out is just as prevelant in traditional backup systems as it is in snapshots systems. Tracking moves, adds, and changes is a big part of the backup admin’s job.

    You keep saying “without an index.” I believe what you’re saying is that you want some sort of central tracking entity, which may or may not have what NetWorker would refer to as an “index.” An “index” in my vocab (AKA catalog, database) is the part of backup software that tracks what files are in what backup. That is not the same as a job history database, or a job config database. I agree with you that you need that latter, but I believe you can get by without the former in a snapshot/replication world. (If someone disagrees with me, they can easily add one, though, as I pointed out in the post.)

    So…

    If I have a central management system that tracks what’s snapped to what, allows for different protection levels, tracks the history of what worked and didn’t work to the level of the host and app, then are we in agreement that this could be a valid backup system?

    And, BTW, regarding moves adds and changes, the problem of moving hosts MUST be addressed by a change management process, followed by a regular audit process. The best snapshot or backup system is never going to handle people just moving things around willy-nilly without telling anyone.

  • While the best snapshot or backup system won’t handle people just moving things around without telling anyone if you keep an agent in the VM and it has network connectivity even the worst of the backup systems should get a backup done where so ever the VM lands. You won’t get an hourly backup, but you’ll get a daily backup and file level restore.

    That’s not me just pushing agents, one wonders if the machine and the apps are moving perhaps some backup intelligence should move with them?

    Yes, operationally change management should always be implemented but when you can create a new virtual machine orders of magnitude times faster than it takes to acquire and commission a new physical server, and move it faster than you can unpack and rack a new physical server it becomes so easy that the longest part of the process is the paperwork.

    Which is why the paperwork is starting to go into the bin even quicker than usual and the change management is what they’re looking at at the top level.

    I’m not going to mention the C word but in the C word compute and storage become blobs of CPUs and capacity as they are abstracted away from the workloads running on them.

    To answer the question if you had a central management system that tracks what’s snapped to what, allows for different protection levels, tracks the history of what worked and didn’t work to the level of the host and app are we in agreement that this could be a valid backup system?

    I’d agree with that as it’s any backup app with array level integration. I’m not being snitty with that answer, but you could build something like that today, with varying degrees of completion admittedly since some have deeper levels of array integration than others, with a traditional backup package.

    If someone wants to use snapshots and replication only that technology is available all around and it works but like any modern backup design you need to take into account what’s static and what’s not static in your environment.

    And when a LUN or file system vanishes into a virtualised layer should it be the layer you’re snapping and not the LUNs or file systems they’re spread across?

  • The seventh paragraph of your blog there is some weird spacing going on and my ADD kicked in and just couldn’t concentrate on anything else.

    Of course I’m kidding…

    See my reply on my website, http://www.backuphype.com

    -bh

  • @Mark Twomey

    I honestly have to say that I’m very surprised by your response, and in the end we were mainly arguing about what we thought the other person was saying. It sure doesn’t seem to match what I’ve historically run into when talking EMC folks. Snapshot backup discussions very quickly become discussions about NetApp, and then EMC people start seeing red, so I’ve never had a snapshot discussion with an EMCer that ended well.

    Which is all I meant when I started out the post saying that the anti-stuff on this topic often comes from EMCers. I honestly wasn’t trying to pick a fight. If I had it to do all over again, I’d take that paragraph out. It’s not germane to the discussion.

    Thanks for the discussion.

  • These paragraphs were in the original post. I moved them down here in hopes that future readers will read the main point of the article and not get bogged down in some stuff I really wish I hadn’t posted.

    I am aware that the anti-snapshot argument is often proffered by EMC folks and the pro-snapshot argument often comes from NetApp folks. While I’m sure that they all strongly believe what they’re saying, it’s still a point-of-view that is based partly on where they work. EMC can’t do snapshots the way NetApp can — and they sell some of the backup and reporting software that you might do without if you went down the snapshot-only route. So I highly doubt that there are any training sessions within the Hallowed Halls of Hopkinton about any advantages that such a system might have. Their employees also have no reason to learn the advantages of the other approach. So it’s no surprise that anyone that works there would have a dim view of such. On the flip side, NetApp doesn’t sell backup software. It doesn’t sell backup reporting software. Heck, it no longer even sells a target for traditional backups that I would buy (No, Val, I don’t consider FAS w/ASIS a target dedupe system), so they are pretty much out of that business as far as I’m concerned. So they have a vested interest in espousing their point of view as well. Why buy any of that EMC software when you can buy everything from them?

    Do I have a frame of reference too? Of course. Having worked in hundreds of customers of both approaches, I have seen first hand what they offer, and I still see the merits of both. Usually I am defending the monolithic, central backup software world (as opposed to many, non-integrated point solutions), but today I am arguing that the approach is also valid.

    I want to say firmly that while the EMC and NetApp guys will interpret this as a pro-NetApp post, I am NOT arguing NetApp over EMC here. I am defending the concept of snapshot-based backups, which the EMC guys are saying is absolute bollocks. (That was just for you, Mark. I could have said "pants," but I like to see UK/AUS audiences giggle when I throw that word into a preso) Yes, I am well-aware that NetApp is the only major company that is pushing this idea, but there are other companies (e.g. Compellant, FalconStor, Dell) that also go down this route, albeit with less success than NetApp has had. In addition, Microsoft has had quite a bit of luck selling Data Protection Manager. Guess what? Totally snapshot-based. Then, of course, there’s Time Machine, Mac OS’s built-in backup tool. Also totally snapshot based, although only at the file level.

    I am not, nor have I ever been, an employee of NetApp, Dell, Microsoft, FalconStor. I don’t own stock in NetApp, Compellant, FalconStor, or any other snapshot-based company either. And they’re not slipping me cash on the side for these posts. So anyone who wants to accuse me of blogging what I’m blogging because I’m in someone’s pocket obviously has no idea who they’re talking about.

    FWIW, I have seen at least two people post on the other side of this issue that aren’t EMCers, and they’re Preston deGuise (@prestondeguise) and Stephen Foskett (@sfoskett) in his comments on my blog post. I believe that I’ve addressed both of their concerns in the rest of this post. Feel free to check out their blogs for any anti-responses. They’re both people I respect a lot, even if Preston does talk funny and likes to live with Koalas, and Stephen likes to work for San Diego-based companies (where I live) but can’t bring himself to move here.

  • Yes, I think we’re both on the exact same track we just started out looking out the windows on the opposite side of the carriage.

  • Scott’s posts I linked to above appear to be gone from his site. So I grabbed them from google cache and will quote from them liberally here because I don’t want people thinking I’m making things up.

    In http://thebackupblog.typepad.com/thebackupblog/2009/08/something-doesnt-add-up-and-never-will.html he said:

    With respect to SM and SV–they aren’t backup. They are copies. At least that is my perspective. Now a copy may (or may not) be able to meet your data protection requirements. But the difference should be understood carefully first before you decide that a copy is good enough. In general, I dislike labeling them backup because they have such different characteristics.

    As to SV and SM not being a backup? I wrote about it a bit here: http://thebackupblog.typepad.com/thebackupblog/2009/04 /is-a-copy-a-backup.html
    I give that some thought every now and then. There are some things about SV and SM that leave me deeply uneasy as a backup guy (my intuitive feeling is they are “different” and “not backup”). That post was my attempt to wrestle with that and other similar approaches. I am not sure I got it 100%, but I wanted to try to deal with the issues that are the basis for my intuition.

    And in
    http://thebackupblog.typepad.com/thebackupblog/2009/04/is-a-copy-a-backup.html
    he said this:

    Its creation, aging, and disposition is managed by a backup and restore application that will store the data in a format that is different than the source format (meaning either a different type of file system than the source and/or a different disk format and/or the source is encapsulated in a package as is the case with virtual tape), and with access permissions that are a subset of the permissions associated with the source data.

    It seems pretty clear to me that the author of this post believes that backup has to change forms in order for it to be called a backup. By inference, this means that the system I described above does not meet his definition of a backup.

  • You have said in your blog that "PIT copies aren’t backups." Scott has said the same in his blog, specifically pointing out SnapMirror and SnapVault. Sales reps have said the same in sales meetings.

    Here’s your blog entry:

    http://storagezilla.typepad.com/storagezilla/2009/09/are-point-in-time-copies-backups.html

    In comments in that post you say that PIT copies (aka snapshots) are NOT a replacement for backup — even if they’re replicated off the system. You said that they just "a fast and non-disruptive way of giving you an image to backup."

    In a later comment in that same post, you also say not to use RecoverPoint for "months of retention," which is what I’m suggesting to do in this post.

    Scott Waterhouse has argued this particular point very vehemently both publicly and privately. When discussing a customer’s options in comments on this blog, he said that SnapMirror and SnapVault are "not backup."

    http://thebackupblog.typepad.com/thebackupblog/2009/08/something-doesnt-add-up-and-never-will.html (link removed because the article has disappeared. See comment below.)

    And he blogged about it in detail here:

    http://thebackupblog.typepad.com/thebackupblog/2009/04/is-a-copy-a-backup.html (link removed because the article has disappeared. See comment below.)

    And…

    I was at a very large customer (I say very large only to tell you that the size of this company dictates that they get the best and brightest sales reps EMC had to offer) and they told the EMC sales reps that they wanted to do what I described in this post (90 days of snaps on one celerra replicated to another celerra) and they told the customer that this was the stupidest thing they’ve ever heard of. "Why?" They asked. "Because it would kill the performance of the Celerra?" "How much?" "We don’t know, because no one does that, but I’d guess at least 50%."

    As to the customer (@sharney) doing what I described in this blog, he has now explained that he has to snap both source and destination. If he’s replicating application data, that’s not a valid method of getting a good backup of it. You have to put the app in a backup mode, then snap it. How do you do that with a replica? This is another part of what I meant when I said "EMC can’t do snapshots the way NetApp can."

  • Curtis,
    You’ve put down some really good points to consider regarding whether or not snapshots can be used in place of traditional backups (preferably combined with replication) and how they compare/contrast to the role backup applications typically play. However, there was one point that you didn’t discuss, in regards to overall backup architecture, and I was a little surprised to see it wasn’t called out as a concern.

    In the scenarios you’ve described, if the primary site goes offline for any reason and you move your operations to the DR site, you are now actively transacting on the array containing the only copy (or copies) of your critical data. You are also doing so in an inherently stressful situation that is prone to missteps. I’ve always viewed the complete lack of an independent copy of the data at the DR location as a serious flaw in a pure snapshot and replication architecture. I suppose you could have a fourth array deployed with a third replication copy (Primary copy@primary site, HA/BC replication copy@primary site, DR replication copy@DR site, and DoubleDR replication copy@DR site).

    I’m curious to hear your stance on whether you feel this simply isn’t a valid concern, was overlooked in the original points, or was simply outside the bounds of the original discussion.

    Thanks.

  • Your concern is a valid one. A fourth copy would solve that. Or you can have a third copy with no local copy onsite. (Personally I’d go with the local copy onsite and forgo the fourth copy if I had to make a choice between those two.)

    Just to stick with the post, however, the fourth copy could still be another “copy,” not something on tape or in traditional backup format.

    It’s all about what you want to pay for.

  • Note – if you are using volume SnapMirror to replicate to your DR site, you also automatically have all the snapshot copies of the original production filers (up to the last mirror sync), available in your DR filers. If you use SnapVault, you can also keep snaps even further back in history than are still on the prod site. If you declare a disaster and promote DR to production, you still have those snaps to revert to, if needed, as well as any new snaps created while you run the DR filers as prod (depending on DR filer capacity, of course). This has been a real lifesaver to me on a few occasions.

    Admittedly these snaps are not as “safe” as a third “real” copy elsewhere, but still they are there, just in case. Once you declare a DR, you do still need to protect the DR data that is now prod. Personally, I favor cascading SnapMirror or SnapVault to a 3rd filer, and/or backup to NDMP tapes from DR. Belt AND suspenders. 🙂

  • Curtis, very interesting discussion. Just a quick technical note on NetApp. Syncsort integrates its BEX backup software with NetApp snaps. That is, BEX provides OS agents and leverages SnapVault on secondary storage. This combo gives you: snapshot based backups, searchable within nodes; policy based export of data to tape; application integration (including VMware hooks for P2V and V2V restore); and full backup reporting. Snapshots replicated using SnapMirror can also serve as restore sources for BEX on the far end.

    I believe this combined solution provides many of those things you list that NetApp lacks.

    Peter Eicher
    Syncsort

  • I’ve had several inquiries about the precise meaning of the statement “snapshot based backups, searchable within nodes” in my last comment. To clarify:

    BEX backup jobs to the NetApp target are stored as SnapVault snapshots. On the recovery side, the data is viewed via the BEX console from a drive perspective, e.g. the D: drive of a server. The search function works across the full backup history of that drive. So, if I needed all versions of “File-x” across 100 existing backup jobs, I could search through all 100 snaps in a single operation (wildcards are supported). I could also target a specific date range to narrow the search, and/or a file size range.

    Peter Eicher
    Syncsort

  • “A malevolent backup admin can delete your backup catalog & overwrite all your tapes with a for-loop. What’s your point? […] That’s all I’m going to say about that.”

    That’s unfortunate, because you’re brushing off the critical point without addressing it. I work at a site where we consciously accepted the risk of using only snapmirror for backup for many years, but we now feel we have to deal with the fact that a single Netapp administrator could destroy every single copy of production data in a few minutes, and we’re working on a way to do so without using tapes (which we hate with a passion).

    So as to your “What’s your point?” question: if you rotate backup media offsite (e.g. with a data protection company), a malevolent admin can’t delete the contents, no matter how many for loops he writes (and the destruction of the online copies should tip someone off before the offsite media rotated back onsite). That’s the point. That’s *not* necessarily incompatible with using snapshots and snapmirror, BTW–the backup media in question could just as easily be disks instead of tapes.

    I agree that the “mirroring isn’t backup” argument is silly, but nonetheless, brushing off the fact that a single admin can delete all your data (and acting as though that isn’t a major weakness of a purely online backup strategy) weakens your argument.