More Misinformation About Dedupe

The second installment of a Byte and Switch four-part series is out, and it’s full of the same untrue statements found in the first installment.  I will say the same thing I said in a comment I made on the first installment: “Is the author completely unaware [of the real facts] … Or is the author purposefully withholding information…?”  Click Read More to see both sides of the story that he is only telling one side of.


You may notice that I haven’t included a link to the article in question.  It’s because I don’t think that Byte and Switch should be rewarded for publishing an article so full of factual errors by you clicking on my link and reading the “story.”  If you must, go there, then feel free to go to byte and switch dot com and click on the Industry Opinion section and read the article, ” A Data Deduplicaiton Survival Guide: Part 2.”  So you don’t have to, though, I’ve summarized the article here:

  1. A definition of inline, post-process, and “combination”
  2. Inline dedupe presents an easier to understand picture
  3. Inline backup performance is up to 1 TB/hr.  If you don’t need to go faster than that, why would you buy a post-processing system?
  4. Post processing systems might be able to restore faster, but other things will probably slow it down anyway
  5. If you want fast restores, shouldn’t you use CDP?
  6. Inline dedupe gets data offsite much faster than post-process dedupe.
  7. Dedupe is only part of the story.

I again want to state that I am not slamming the inline method or Data Domain.  I think the inline approach is completely valid and Data Domain is a great company with thousands of happy customers.  I do have to defend post-process, though, because this article (once again) fosters misconceptions about it that I feel the need to dispel.

In addition, I’ll say that I don’t care.  I don’t care why my car goes fast when I hit the gas peddle, and I don’t care how my dedupe vendor does what it does.  Does it back up fast enough?  Does it restore fast enough?  Can it replicate the data offsite in a timely fashion?  Can it store all my data?  AND can I afford it?  That’s all that matters.  Arguing over inline vs post-process is like arguing over hybrids.  I don’t care why my Prius gets 50 MPG.  I just know that it does.

Definition of Inline and Post-Process and “combination”

The article states that: “There are basically three “whens” of deduplicating data: inline, post-process, or a combination of the two.”  Actually inline vs post-process is as much a “when,” as it is a “how,” because both inline and post-processing vendors can dedupe data while backups are still going on.

You either store the original (non-deduped) data on disk or you don’t.  If you store the original data on disk for one second, you’re doing post-processing.  If you never store the original data on disk, then you’re doing inline processing.  It’s as simple as that.

What the author calls “combination” is vendors that do post-processing while backups are still coming in.  This is what some are calling “concurrent processing.”  I have no issue with the term, per se, but it’s still post-process dedupe.

(There is also one vendor, Quantum, that dynamically switches between inline and post-process depending on throughput, but that is not what this author was talking about.)

Inline dedupe presents an easier to understand picture

The author explains that because post-process vendors store the original data for some period of time, you have to account for that storage, and inline vendors don’t.  That’s probably the most correct statement in the entire article.  It is certainly a part of the design that you must consider when doing post-processing, and you don’t have to consider it with inline.  Score one for inline.

Inline backup performance is up to 1 TB/hr.  If you don’t need to go faster than that, why would you buy a post-processing system?

The author states that inline systems can perform up to 1 TB/hr, and “if your infrastructure can’t sustain much more than 1 Tbyte per hour, then the ease-of-use gains of an inline system outweigh the unrealized performance capabilities of a post-processed system.”    Apparently the author is unaware of any advantages that post-processing would have other than performance.  The author also stated that “no systems support de-duplication across separate appliances.”

This is where the reader should absolutely know that the author has never talked to a post-processing vendor to get their side of the story.  First, there are advantages to post-processing beyond backup performance.  As I discussed in my “inline vs post-process” blog post, there are three other advantages:

  1. It allows for a staggered implementation of dedupe
  2. It allows you to copy your backups from disk to tape before they’ve been touched by the dedupe system
  3. It allows you (if you want) to leave last night’s (or more) on disk in its entirety for faster restores.

The disadvantages of post process are that it does have a lot more I/O work to do than inline, and requires the landing zone disk.

The second reason you should know the author hasn’t talked to a post-processing vendor is the statement that there are “no systems support[ing] de-duplication across separate appliances.”  That’s complete fiction.  Falconstor, NEC, and SEPATON all have this functionality in general availability.  As to “we expect to see that capability delivered this year,” this is more evidence that he spends all his time talking to Data Domain, as their multi-node dedupe is expected at the end of this year.  Diligent’s is allegedly supposed to ship before then, but we already established he doesn’t seem to want to talk about them.  (He mentioned every major vendor but them in the first installment.)

Post processing systems might be able to restore faster, but other things will probably slow it down anyway.
The author acknowledges that “there could indeed be performance issues with recovering de-duplicated data,” but states that other bottlenecks might get in your way, like the network, the server to receive it over the network, rewriting RAID parity data, and the “fact that writes are slower than reads.”  He then simply moves on, as if restore performance doesn’t matter. Of course!  He doesn’t want to talk about that because he might have to acknowledge that this is an area where post-processing vendors might have an advantage!

So… following his logic…  If I live in a world where I’m restoring via a LAN-free connection (no network bottleneck), and my server isn’t being bogged down by that traffic (no server bottleneck), I’m not writing RAID parity data (not using RAID 4/5/6), and writes aren’t slower than reads (the only time I know they’re slower is when you’re writing parity), and restore performance is a factor, then might post-processing help me?

If you want fast restores, shouldn’t you use CDP?

I didn’t see this one coming, especially given the obvious Data Domain slant of the article.  Seriously?  You’re telling people that if restore performance matters, they should use CDP?  I’ve got nothing against CDP, but that’s just funny right there.  I don’t care who you are.  (Channeling Larry the Cable Guy.)

Inline dedupe gets data offsite much faster than post-process dedupe.

It’s again obvious that the writer has only been talking to inline vendors, because only someone who doesn’t understand the post-processing approach would describe it in this manner.  He says that post-processing dedupe “has to be completed before replication of the backup data can occur,” and “you may not have enough time to complete the backup job, run the de-duplication process, and then replicate the data.”

That’s completely untrue and nothing more than inline FUD.  The truth is that as soon as a given block of data has been examined by the post-processing dedupe engine, it can be replicated.  So dedupe must start (not finish) before replication can occur.  If you combine that with the fact I previously mentioned that many post-processing dedupe vendors are deduping while backups are coming in, this means that post-processing replication often starts only minutes after the backup starts. 

Therefore, the example that the author gives of a 7-hour backup followed by 15 hours of dedupe that “leaves you about 2 hours before you have to start the next backup,” is complete fiction.  (The author also plants a subliminal jab in that sentence by implying that it takes more than twice as long to dedupe the data as it does to ingest it.  While that is true of some systems, it is not true of all.)

Some post-processing systems can be configured to do dedupe after all backups are done, but it’s an option – not a requirement.  Some post-processing systems take more than twice as long to dedupe the data as they do to ingest it.  (One of them can ingest data 6 times faster than it can dedupe it!)  This is one of my concerns about post-processing: a poorly educated customer combined with an unscrupulous or poorly educated sales rep can give you just enough rope to hang yourself (a system that’s fast enough to back up your data, but not fast enough to dedupe before the next backups occurs).

The post-processing systems that I’m working with can do concurrent processing of backups (dedupe while backups are still going on), and can do so at rates that are fast enough to keep up with the incoming backups – and can start and finish their replication within minutes of inline vendors.  So this whole thing of “if you want to replicate, then you should really use an inline vendor” stuff is just bunk.

Dedupe is only part of the story.

Again, this is something I totally agree with.  Dedupe is a feature, not a product.  It reduces the amount of disk you need, but you need your target dedupe vendor to do a whole lot more than that.

Written by W. Curtis Preston (@wcpreston), four-time O'Reilly author, and host of The Backup Wrap-up podcast. I am now the Technology Evangelist at Sullivan Strickler, which helps companies manage their legacy data

3 comments
  • I think it is interesting how DD is positioning themselves in the de-dupe card game now that other players have arrived at the final table. QTM, EMC, and others have DD a bit concerned and so DD must spread the word that inline is the best and post-process is the worst. After all, they have been doing it the longest and are the best at it. Aren’t they? It is just too bad that so called industry experts from Byte and Switch are spreading FUD for them. Let DD do their own dirty work. They are very good at it, trust me. It is sad to think that some poor backup admin somewhere is going to read that and be totally convinced that inline is the only way to go beacause some “expert” said so in some article somewhere. “But Curtis, How do I choose the right de-dupe solution for my business with all of this mis-information?” Simple, do your homework and TEST the product with YOUR data. After all, it still comes down to “How big is it? How fast is it? How much does it cost?” Personally speaking, I like the flexibility option when it comes to de-dupe.

  • I’m pretty sure DD pays George, and obviously George is behind it, but I doubt that DD instructed George to write this junk. Although some of their sales reps have been liable to say just about anything to win a deal, the marketing folks that would be paying George don’t seem the type.

    They’ve always positioned inline as the best way. I would expect them to do so; it’s the way they do it. I just don’t like it when an “independent” author tells only one side of the story. There are advantages and disadvantages to both approaches, and an independent article that’s trying to cover the subject should cover both.

  • I always use a simple test with the sales people: If I say ” I want to do X”, and you don’t ask me at least 5 questions before you suggest something, I’m going somewhere else. And only 1 of those 5 gets to be about budget. I never trust single-answer types. Your solution may be perfect for me, but if you don’t have any other answers I just can’t trust your motives.