Is dedupe to tape crazy?

Dedupe to tape is definitely crazy.  But is it crazy good or crazy bad?  I spent two days in (lovely?) Oceanport, New Jersey surrounded by a bunch of CommVault Kool-Aid drinking, Frank Slootman hated, but seriously technical people that knew their product very well.  Over those two days, I had every question I had about CommVault answered, and one of the questions was: “Why the heck would you want to dedupe to tape?”


In case anyone’s interested, I don’t get paid to blog and this is no exception.  CommVault is not paying me to write this.



My position on deduping to tape has been a consistent one: unconvinced.  I’ve read Dave West’s blog entries about it and seen some of their sales presentations on it, and I’ve always responded with the following thought: if I dedupe to tape, I’m going to need multiple tapes to restore one file!  I don’t care how much money I save, that’s going to have a significant impact on restore performance and I’m just not interested.

They took a tack that I didn’t expect: they agreed with me.  No one I talked to at CommVault’s corporate headquarters wanted to do restores from deduped tape. Now that I didn’t expect.

First let me explain how their dedupe to tape works.  If you’re going to dedupe to tape, you first have to dedupe to disk.  You create what they call a silo on disk, which is a full backup and a set of deduped incrementals based on (and deduped against) that full backup. The retention on that silo should be long enough to satisfy most of your operational restore requests.  (Typically that’s 30 days, but it could be longer in your environment.)

Once the silo’s time period is passed, they migrate the previous silo to tape.  Once the silo has been migrated to tape, it is deleted from disk to make room for new backups. The idea is that most restores should come from disk, but in the rare case that you would need to restore something that you don’t have on disk, they can get it from tape.

They would have to load multiple tapes to restore a single file, but they don’t have to read all those tapes.  They track a file’s locations on tape to a much more granular level than most products, so they just have to do a lot of fast-forwarding.

Everyone (including the CommVault folks) agrees that no one would want to do any significant portion of their restores from deduped tape.  But I also agree that if I typically do all my restores from within the last 30 days, and someone asks me for a 31 day-old file, it’s generally going to be the type of restore where the fact that it might take several minutes to complete is not going to be a huge deal.  (In the case that you did need to do a large restore from a deduped tape set, you could actually bring it back in to disk in its entirety before you initiate the restore.)

Now here’s the business case. Anyone who has done consulting in this business for a while has met the customer where everyone knows that 99% of the restores come from the last 30-60 days — and yet they keep their backups for 1-7 years.  What a waste of resources.  CommVault is saying, “Hey.  If you’re going to do that, at least dedupe the tapes.”  They showed me two business cases from two customers that doing this was saving them over $500K per year in their Iron Mountain bill.  WOW.

Call me convinced.

Written by W. Curtis Preston (@wcpreston), four-time O'Reilly author, and host of The Backup Wrap-up podcast. I am now the Technology Evangelist at Sullivan Strickler, which helps companies manage their legacy data

16 comments
  • One of the big complaints against TSM with progressive backups (incremental forever) is that when you go to a DR, you have to hit a lot of tapes to restore a filesystem. And hit them again for the next filesystem. And again, and again… Wouldn’t this be that situation, but 1000x worse? It’s a LOT of tape mounts, and a lot of wear and tear on the tape cartridges themselves.

    While a half a million off their Vault vendor bill is nice, spending 2 weeks working on recovering from a disaster would be a trade-off that probably wouldn’t make business sense.

  • I agree with you and so would they. This is why a DR scenario would NEVER be set up using deduped tapes. They would replicate disk backups to the DR site for that purpose. If that wasn’t possible, then they would recommend non-deduped tapes for the DR.

  • We’re one of those operations that keeps the tapes for 1-7 years because of retention rules regarding PHI, patient health information, and yes our restores are almost always in the last 30 days if not the last week. What are folks like us supposed to be doing instead to avoid this “waste of resources”? Should we be using a disk archival appliance instead?

  • The “waste of resources” to which I refer is using backup software as archive software. Backups should NEVER, NEVER, NEVER, NEVER be kept for 7 years. I’m going to go on a limb and say I’ve seen rare cases where backups should be kept longer than 18 months. (I go that far because some people don’t notice the file they look at once a year was damaged until NEXT year.) But if you’re retrieving a file for regulatory or eDiscovery reasons, you need archive software, not backup software.

    Archive software, unlike backup software, doesn’t keep backing up the same stuff over and over. It stores an email or file once (maybe twice, depending on how you set it up), and then never stores it again. In contrast, backup stores it over and over and over and over. If you do weekly fulls and store backups for 7 years, you have some files on tape 350+ times! Now THAT’s a waste of resources.

  • Thanks so much Curtis. Very helpful. Last question…

    Would you recommend archiving with my already installed Networker infrastructure using what may be already built in to it or would you advise using a dedicated archiving application?

    Thanks!

  • I’d have to know a lot more about your environment to tell you what the best product for you to use would be. Having said that, I always recommend that people look at what they have before moving on.

  • [quote name=W. Curtis Preston]I agree with you and so would they. This is why a DR scenario would NEVER be set up using deduped tapes. They would replicate disk backups to the DR site for that purpose. If that wasn’t possible, then they would recommend non-deduped tapes for the DR.[/quote]
    Which raises the “validity” of the business cases showing the Vault savings. It seems they’re saying, “We can save you $X, but we wouldn’t recommend you do it.”

  • I haven’t had time to read this fully yet – on the road right now, but you don’t mention the performance.

    what’s the performance?

  • They recommend it for a very specific user case: the tapes you know you’re making that you never plan to restore from. There are tons of people who make backups that they KNOW they’re never going to use: "We back up every day and some lawyer says we have to keep them for 7 years." I would suggest that the tapes older than 60 days and less than 7 years are never going to get read! They’re saying dedupe those tapes.

    If you ever have to read from those tapes, they’ve got three different ways to do so. One is easy but takes a while if you’re doing a LOT of files, and the other two take a little more setup time, but are much quicker once you do that setup.

    I really don’t see the problem. Every product I know has places where it works well and places where it doesn’t. They’re saying that dedupe to tape works well in that use case. If you don’t have that use case then don’t use it. What’s the problem with that?

  • David,

    If you’re asking about restore performance from deduped tape, it didn’t sound very good, but that’s why you would only use it when deduping data that you don’t plan on reading very much. Let me give you a scenario. Assume you’ve got a 30 tape set that has a file that was changed every day so pieces of that tape are on 30 different tapes. If you have to retrieve that file (notice I retrieve and not restore, because this is probably for an ediscovery case), you’ll need to load and fast forward to the right spot on 30 tapes. That’s at least an hour just to do that (assuming it was serial), but if you’re only grabbing a few files here and there — is that so bad? You’re retrieving it for an ediscovery request that typically gives you days or weeks, not hours or minutes, to retrieve the data. Does it matter if it takes an hour to get the one file off tape?

    And, again, if you’re retrieving a whole bunch of stuff for ediscovery, there are two ways to scan a set of tapes back in mass to make that much faster.

  • Do you know if you’re able to make multiple tape copies? I’d really hate to have a tape with deduped data on it fail.

  • You were going to say “retentions significantly longer than that.” So do I. What I’m saying is a waste of resources is people who use their backup apps as archive apps. They do a lousy job of it for many reason.

    Tthat is what ARCHIVE APPLICATIONS are for — NOT backup applications. Backup apps for restoring deleted files. If you want to keep data for long periods of time and recall files based on context (e.g. all the files with the word “elvis” in them) then you need an archive app.

  • An interesting topic.

    I have supported backups (Backup Exec / Netbackup) for over 10 years and I have been involved in the progression of the technologies and their growth and improvements over time. Here are the trends that I have seen:

    Although tape is not considered a reliable media for frequent backups (daily backups) due to the fact that it wears out from stretching and read / write processes, it is still heavily used. A lot of this has to do with the mind set of familiarity – it is a known media. Even though it fails, it is somewhat predictable in that failure. And it is cheap to keep spare tapes on hand to replace the failed tapes.

    Backing up to disk, whether directly or through a virtual tape library, is considered to be a new technology, despite the fact that it has been around for over 8 years.

    Deduplication has been around for 4 or 5 years, but is still largely misunderstood and untrusted (is that a proper word?).

    So to discuss deduping to disk instead of tape causes CIOs to shut down! ๐Ÿ™‚

    But here is the path that I have seen to work best and it encompasses more than just deduping:

    Backup to disk, using deduplication where possible. Deduplication is a process that can be done from either the source system (where the files are coming from) or by the target system (the system actually performing the deduplication).

    Retain the backups on disk for a period of x days (30 -45 is recommended), then move the FULLS to tape. Remember that at this point you are looking to this retention for DISASTER RECOVERY, not for deleted file restore.

    If there is a need for ARCHIVAL, tape is NOT the way to go. Instead, you should look to CAS (content addressed storage). Archival is for the purpose of retaining documents in their final form for an extended period of time (some as long as 30 years.

    If there is a concern about the data being available in the event of failure, there are several options available. Replication of data to a system that is distant from the main system is the best way to go. By distant, I am referring to over 100 miles apart.

    If you do not a facility that is distant, there are options, including "renting" the space for the system to hosted in. It is more important to do this than most companies think. And it is a topic that most of them do not want to address, largely because they do not understand the technology.

    I hope I am not adding confusion to this topic.

    The long and short of this is that deduplication to disk is the best way go. moving them to tape is an option, but in my opinion it is not the best way to go. It is better to protect the data through replication to a distant system and to look to CAS for long term, archival storage.

  • I don’t have the negative view of tape that you have. Tape is very reliable if the system around it is designed properly. Tape also does not “stretch.”

    The problem with tape is that people don’t (or can’t) architect for it properly. They’ve got a 180 MB/s tape drive that they’re feeding 20 MB/s — and they wonder why it fails.

    As to moving backups to tape after 30-45 days for the purposes of DR, I completely disagree. DR is done from yesterday’s backups — NEVER from backups from 30-45 days ago. If someone is moving something to tape after 30-45 days, it’s for long term retention — NOT DR.

    As to CAS being the ultimate long term solution, again I have to disagree. It’s the most expensive way to store data long term — often more expensive than not archiving it in the first place. The long term power costs alone are an order of magnitude greater than tape. And replicating CAS to a second location costs even more money. (And putting it in a colo makes it cost even more!)

    As to dedupe being untrusted, I think the sales of all of the dedupe products beg to differ. How is dedupe (offered by dozens of vendors) untrusted, but CAS (offered by two vendors) is trusted? (BTW, CAS is essentially dedupe. It’s just object-level dedupe.)

    I still like tape for long term archiving — if it is handled properly. That means retentioning every year and moving it to fresher, newer media every five years. If you don’t want to have to do that, then go to optical. A 30 yr CAS archive? Holy cow would that cost a lot.