It seems to me that source dedupe is the most efficient way to backup data, so why is it that very few products do it? This is what I found myself thinking about today.
Source dedupe is the way to go
This is my opinion and always has been. Ever since I first learned about Avamar in 1998 (when it was called Undoo). If you can eliminate duplicate data across your enterprise – even before its sent – why wouldn’t you want to do that? It saves bandwidth and storage. Properly done, it makes the backups faster and does not slow down restores. Its even possible to use dedupe in reverse to speed up restores.
If properly done, it also reduces the CPU load on the client. A typical incremental backup (without dedupe) and a full backup both use way more compute cycles than those that are used to generate whatever hash is being used to do the dedupe.
You save bandwidth, storage, and CPU cycles. So why don’t all products do this?
Products that have been around a while have a significant code base to maintain. Changing to source dedupe requires massive architectural changes that can’t be easily added into the mix with an existing customer. It might require a “rip and replace” from the old to the new, which isn’t what you want to do with a customer.
Update: An earlier version of the post said some things about specific products that turned out to be out of date. I’ve removed those references. My question still remains, though.
None of the source dedupe products have torn up the market. For example, if Avamar became so popular that it was displacing the vast majority of backup installations, competitors would have been forced to come up with an answer. (The same could be true of CDP products, which could also be described as a much better way to do backups and restores. Very few true CDP products have had significant success.) But the market did not create a mandate for source dedupe, and I’ve often wondered why.
Many of the source dedupe implementations had limitations that made some think that it wasn’t the way to go. The biggest one I know of is that restore speeds for larger datasets were often slower than what you would get if you used traditional disk or a target dedupe disk. It seemed that developers of source dedupe solutions had done that venerable sin of making the backup faster and better at the expense of restore speed.
Another limitation of both source and target dedupe – but ostensibly more important in source dedupe implementations – is that the typical architectures used to hold the hash index topped out at some point. The “hash index,” as it’s called, could only handle datasets of a certain size before it could no longer reliably keep up with the backup speed customers needed.
The only solution to this problem was to create another hash index, which creates a dedupe island. This reduces the effectiveness of dedupe, because apps backed up to one dedupe island will not dedupe against another dedupe island. This increases bandwidth usage and the overall cost of things, since it will store more data as well.
This is one limitation my current employer worked around by using a massively scalable no-SQL database – DynamoDB – that is available to us in AWS. Where typical dedupe products top out at a 100 TB or so, we have customers with over 10 PB of data in a single environment, all being deduped against each other. And this implementation doesn’t slow down backups or restores.
What do you think?
Did I hit the nail on the head, or is there something else I’m missing? Why didn’t the whole world go to source dedupe?