Why don't post-process vendors publish their dedupe rates?

I’m working on an update to my Mar 09 story on dedupe performance.  While there’s still the argument of global dedupe or not, the hardest problem I’m having is with post-process vendors not publishing their dedupe rates.

Inline vendors are a piece of cake in comparison.  Just look at their data sheet, and right up front it will say how fast the box is.  But the post-processing vendors are hiding behind their ingest numbers only.  Why?

Look at the data sheets from ExaGrid, FalconStor, Quantum, and SEPATON.  Every single one of them advertises only an ingest rate.  And every single one of them has a dedupe rate that is at best half their ingest rate — but they do not advertise it.  Why is that?  Are they trying to hide something?

I believe strongly in truth in advertising.  I think it’s bogus that Data Domain advertises a DDX “array” as one system with one throughput number when everyone knows it’s 16 completely separate systems with no dedupe knowledge of each other.  They might as well advertise a “DDY battery” that’s made up of 1000 DD 880s and say they have a system with the throughput of 5.4 PB/hr!  It would be just as truthful as the DDX array.

And I think it’s wrong that post-processing vendors don’t advertise their dedupe rates, because it’s really important for compariing and architecting systems.  If you can only dedupe 500 MB/s, it doesn’t matter if you can ingest 10000 MB/s.  You can only dedupe 43 TB/day, so you can only ingest 43 TB/day.  Maybe you can ingest data at 10000 MB/s, but you can only do it for 4.3 hours before you ingest more data than you can dedupe in a day.

These numbers matter. So why don’t they publish them?

Written by W. Curtis Preston (@wcpreston), four-time O'Reilly author, and host of The Backup Wrap-up podcast. I am now the Technology Evangelist at Sullivan Strickler, which helps companies manage their legacy data

2 comments
  • Curtis,

    We agree with your statements, advertising should be truthful. Our thoughts would be dedupe performance should be linear across data types, but compression is not unless its generic. Vendors shouldn’t shy away from publishing a “burst throughput” and a “dedupe throughput” number…users are smart, and burst performance can be meaningful. Also, global dedupe namespace may represent a new performance bottleneck that DDUP seeks to avoid, that is something to consider. If a zero-overlap data set arrives at a DD880 will the dedupe throughput
    be the same as a high-overlap data set? End-to-end improves backup performance in two ways:
    1) Less throughput required on the source(s)
    2) Less dedupe required on the target

    Curtis we have many more thoughts on this, we’d be open for a chat sometime soon if you’d like.

  • As software dedup start to gain momentum over hardware base solution we can also add them to the test suite.

    I have yet to see a TRUE backup benchmark where the servers, clients and data are the same for all hardware and software tests. This way it will be a apple-to-apple comparison. So far it is apple versus oranges with vendors. No one control how much the same file or data blocks are and few share the hardware used or setup. They all say millage will vary base on your data…

    I would like to see also a regular backup stream using recent tape (LTO-5) against any dedup hardware and software. We might be surprise…

    Bottom line they provide good indicators but certainly not something to base your design without adding some buffer.