I’m working on an update to my Mar 09 story on dedupe performance. While there’s still the argument of global dedupe or not, the hardest problem I’m having is with post-process vendors not publishing their dedupe rates.
Inline vendors are a piece of cake in comparison. Just look at their data sheet, and right up front it will say how fast the box is. But the post-processing vendors are hiding behind their ingest numbers only. Why?
Look at the data sheets from ExaGrid, FalconStor, Quantum, and SEPATON. Every single one of them advertises only an ingest rate. And every single one of them has a dedupe rate that is at best half their ingest rate — but they do not advertise it. Why is that? Are they trying to hide something?
I believe strongly in truth in advertising. I think it’s bogus that Data Domain advertises a DDX “array” as one system with one throughput number when everyone knows it’s 16 completely separate systems with no dedupe knowledge of each other. They might as well advertise a “DDY battery” that’s made up of 1000 DD 880s and say they have a system with the throughput of 5.4 PB/hr! It would be just as truthful as the DDX array.
And I think it’s wrong that post-processing vendors don’t advertise their dedupe rates, because it’s really important for compariing and architecting systems. If you can only dedupe 500 MB/s, it doesn’t matter if you can ingest 10000 MB/s. You can only dedupe 43 TB/day, so you can only ingest 43 TB/day. Maybe you can ingest data at 10000 MB/s, but you can only do it for 4.3 hours before you ingest more data than you can dedupe in a day.
These numbers matter. So why don’t they publish them?
----- Signature and Disclaimer -----
Written by W. Curtis Preston (@wcpreston). For those of you unfamiliar with my work, I've specialized in backup & recovery since 1993. I've written the O'Reilly books on backup and have worked with a number of native and commercial tools. I am now Chief Technical Evangelist at Druva, the leading provider of cloud-based data protection and data management tools for endpoints, infrastructure, and cloud applications. These posts reflect my own opinion and are not necessarily the opinion of my employer.