CommVault Simpana is target dedupe — for now

Every once in a while someone talks to a CommVault sales rep that seems to want to classify CommVault as either source dedupe or (at the very least) not target dedupe.  As one of those who does not like ambiguity (except for the whole near-CDP thing), I will explain why I put them firmly in the target dedupe camp — for now.

Simpana uses a hash-based dedupe approach, which requires three steps (among a whole bunch of other things):

  1. “Chunking” the files to be backed up into segments of files which are typically much larger than a byte but smaller than a file.  This creates what we’ll call a “chunk.”
  2. Creating a hash for that chunk.  This is typically done using SHA-1, which creates a 160-bit value unique to each chunk.
  3. Look up the hash in a hash table to see if it has been seen before.  (If it has, it will not store it; if it hasn’t, it will store it.)

A typical hash-based target dedupe system does all three behind the backup server (please note that not all target dedupe systems are hash-based). In order to be considered source based, you must do all three at the client.  Because if you are not doing all three at the client, you are not deduping at the source; you are sending un-deduped (native) data across then LAN, then deduping it. The whole point of source dedupe is to reduce LAN traffic.

CommVault Simpana does steps 1 & 2 at the client.  They can then compress the data that has been chunked & fingerprinted and send it to the media agent where the third step will take place.  Because they don’t do the third step at the client, they are deduping at the target; they are a target dedupe solution.

They can (and do) argue that because they do it the way they do it, they reduce more LAN traffic than a typical target dedupe system because they can compress the data prior to sending.  If you turned on client compression, for example, with CommVault (or any other backup product) and then sent those compressed backups to a typical target dedupe system (e.g. Data Domain, SEPATON), the compression will negatively impact your dedupe ratio.  Therefore, it is a recommended practice to NOT compress data on a client before sending it to a target dedupe system — unless you’re using CommVault’s target dedupe where they do the fingerprinting/chunking at the client.

Until they do step 3 at the client, they are target dedupe — albeit an enhanced one.

But since all they have to do to be source dedupe is to add step three to their client process — I’ve got to believe they’re working on it.  That’s why I say they are target dedupe — for now.

Written by W. Curtis Preston (@wcpreston), four-time O'Reilly author, and host of The Backup Wrap-up podcast. I am now the Technology Evangelist at Sullivan Strickler, which helps companies manage their legacy data

6 comments
  • Actually, Curtis, if you deploy a CommVault Media Agent to the client with a LAN-attached disk target, then steps 1, 2 and 3 are done at the client – source dedupe, by your definition.

  • 1. Media agents are a very different price than clients
    2. That architecture assumes a reliable connection to a CIFS share

    So as long as I pay for much more expensive software AND make sure that my laptops have a reliable connection to a writeable CIFS share, sure…

    But that isn’t the same as what a source dedupe product does.

  • [quote name=W. Curtis Preston]1. Media agents are a very different price than clients
    2. That architecture assumes a reliable connection to a CIFS share

    So as long as I pay for much more expensive software AND make sure that my laptops have a reliable connection to a writeable CIFS share, sure…

    But that isn’t the same as what a source dedupe product does.[/quote]
    So we’re now amending your definition of source base dedupe to incorporate cost and protocol type?

    “A source based dedupe product is one that does 1, 2, 3, costs X and transfers data via Y”?

    In that case you need to look at the overall cost of your “favourite” source based dedupe products. By the same token I could argue that the cost any individual components of these is substantially more expensive than the cost of any single component of a CommVault solution.

    Similarly your “favourite” source based dedupe product relies on a reliable connection from client to destination/server/pool.

    I think your argument is flawed there. You clearly stated your pre-reqs to a source based dedupe above, and now the goal posts are being moved.

  • I’ll plead guilty to the moving goal post claim, but I do have mitigating circumstances. I should have attacked what you suggested on its merits alone, so I’ll do that now.

    Are you really telling me that CommVault support would recommend loading their media agent onto hundreds or thousands of laptops behind a commserve? I don’t think so. It might technically fit the description I gave for source dedupe, but it was never designed to work like that. Last time I checked, a single commserve is limited to 1500-2000 CLIENTS, let alone media agents. The product simply wasn’t designed to support what you are describing.

    You have no idea what my favorite source dedupe product is, so don’t imply that you do. You also don’t seem to have any experience with good source dedupe products.

    A good source dedupe product is actually designed specifically to deal with unreliable connections. You can drop your connection, cancel your backup, etc, at any time — and things will be just fine. It’s how it transfers the data to the backup server that makes that possible. It does it in very small, manageable chunks. That is not the case with a media agent in CommVault. It’s assuming it has a reliable connection to its backup storage, and it is going to write typical big backup files to it the way you would if you had a reliable mount. And I’m guessing (although I’ll admit I don’t know for sure) that if you were halfway through writing one of those big files and your connection dropped, it would handle that the way a typical media agent/media server would. It doesn’t know where things barfed, so it has to resend a bunch of stuff. Backup servers really don’t like it when the thing they’re writing to is either not available or goes away while they’re using it. A good source dedupe product, on the other hand, just starts writing to a local cache until the link comes back up.

    CommVault is working on taking their product to the final step to being source dedupe. If you’re right, why are they doing that?

    Stop trying to convince people they have it already. Otherwise, they’ll have nothing to announce when they actually do have it.

    As to the cost issue, I’ll give you that. The (IMHO crazy) solution you’ve proposed may indeed be cheaper than SOME source dedupe products.

  • Appreciate what you’re saying, Curtis. I am certainly by no means suggesting that a Media Agent should or would be deployed to laptop clients. Certainly source based dedupe would not be the approach in that context. CommVault have a different approach to laptop backups.

    Your opinion on the "craziness" of my solution is noted. You should be aware, however that there are a lot of organisations today happily backing up in exactly this "crazy" manner.

    I also didn’t mean to suggest there is a "favourite" dedupe product you have in mind. I just meant that if you look at the cost of "insert whatever dedupe product here" and you’ll most likely find the overall cost of the solution substantially higher, and I think you agree on this point.

    The other thing to note, of course, is at the end of the day what you end up with is just a dedupe product that addresses no advanced backup requirements, media management, DR, replication or data management via archiving or HSM without deploying a multitude of other products over the top or alongside (now who’s moving the goal posts!). So whilst there are other dedupe products out there who may do things better in terms of link management or backing up hundreds of thousands of laptops, they are essentially one-trick ponies. Data and Information Management has always been the core CommVault value proposition, not "-)edupe" as a standalone product – this is just a powerful feature of Simpana, enabled via a checkbox.

    All I was saying in my original reply was that your evaluation of CV’s dedupe was inaccurate – or at the very least incomplete. And if you define source based dedupe as "chunking, hashing and lookup" at the source or client then CV can do this today.

    I’m fine with rehashing (pardon the pun) the definition of a dedupe product to incorporate cost, and link management, and laptop backups and whatever else you want to include. Just saying that according to your original definition, your evaluation of the CommVault offering was innacurate.