Why doesn’t everyone do source dedupe?

It seems to me that source dedupe is the most efficient way to backup data, so why is it that very few products do it? This is what I found myself thinking about today.

Source dedupe is the way to go

This is my opinion and always has been.  Ever since I first learned about Avamar in 1998 (when it was called Undoo). If you can eliminate duplicate data across your enterprise – even before its sent – why wouldn’t you want to do that?  It saves bandwidth and storage. Properly done, it makes the backups faster and does not slow down restores. Its even possible to use dedupe in reverse to speed up restores.

If properly done, it also reduces the CPU load on the client. A typical incremental backup (without dedupe) and a full backup both use way more compute cycles than those that are used to generate whatever hash is being used to do the dedupe.

You save bandwidth, storage, and CPU cycles.  So why don’t all products do this?

Inertia

Products that have been around a while have a significant code base to maintain. Changing to source dedupe requires massive architectural changes that can’t be easily added into the mix with an existing customer. It might require a “rip and replace” from the old to the new, which isn’t what you want to do with a customer.

Update: An earlier version of the post said some things about specific products that turned out to be out of date. I’ve removed those references. My question still remains, though.

No mandate

None of the source dedupe products have torn up the market. For example, if Avamar became so popular that it was displacing the vast majority of backup installations, competitors would have been forced to come up with an answer. (The same could be true of CDP products, which could also be described as a much better way to do backups and restores. Very few true CDP products have had significant success.)  But the market did not create a mandate for source dedupe, and I’ve often wondered why.

Limitations

Many of the source dedupe implementations had limitations that made some think that it wasn’t the way to go.  The biggest one I know of is that restore speeds for larger datasets were often slower than what you would get if you used traditional disk or a target dedupe disk. It seemed that developers of source dedupe solutions had done that venerable sin of making the backup faster and better at the expense of restore speed.

Another limitation of both source and target dedupe – but ostensibly more important in source dedupe implementations –  is that the typical architectures used to hold the hash index topped out at some point. The “hash index,” as it’s called, could only handle datasets of a certain size before it could no longer reliably keep up with the backup speed customers needed.

The only solution to this problem was to create another hash index, which creates a dedupe island.  This reduces the effectiveness of dedupe, because apps backed up to one dedupe island will not dedupe against another dedupe island.  This increases bandwidth usage and the overall cost of things, since it will store more data as well.

This is one limitation my current employer worked around by using a massively scalable no-SQL database – DynamoDB – that is available to us in AWS. Where typical dedupe products top out at a 100 TB or so, we have customers with over 10 PB of data in a single environment, all being deduped against each other.  And this implementation doesn’t slow down backups or restores.

What do you think?

Did I hit the nail on the head, or is there something else I’m missing?  Why didn’t the whole world go to source dedupe?

Written by W. Curtis Preston (@wcpreston), four-time O'Reilly author, and host of The Backup Wrap-up podcast. I am now the Technology Evangelist at Sullivan Strickler, which helps companies manage their legacy data

10 comments
  • Your assessment of TSM seems to be outdated. Now called Spectrum Protect, it now has source dedup capabilities.

    • Fair point, and I’ll update the post.

      I’ve followed the product since it was called ADSM, and was aware of the recent name change. In my defense, I had to do quite a few google searches before I could verify your statement. Neither the product description page nor the downloadable whitepaper mention it. But the documentation does, so point taken.

  • I run a small business, and have done so since 1986. Since around 2000, when 80 GB IDE consumer grade drives became about $220.00 each, I have been backing up my network (about 20 machines, mostly linux, but a few MS-Windows boxes) to a backup server using a 3-way RAID1 mirror.

    I connect to the client machine via ssh and tunnel rsync over that to do the backups. On linux boxes, I run LVM and satbilize everything then take an LVM snapshot, then I let everything run again and backup from the snapshot. On the backup server, I use a file level dedup strategy, so identical files get hard linked to from day to day instead of being copied.

    Nowdays, I use 3 6 TB ATA magnetic spinner drives. Every month, the oldest drive in the array gets yanked and replaced with a nice fresh drive. The entire drive is encrypted with LUKS, so it is safe to leave the facility. I take the yanked drive to the safe deposit box at the bank. This strategy has served me well over the years. I was able to go back 3 years for an IRS tax audit, and the auditor was impressed.

    So you can see that I have been using source level dedup (via rsync) for 18 years. I would like to use target level dedup on the backup array, but I haven’t had time to research all the options yet.

    • I’ve used that backup technique as well, and it works for relatively small setups.

      But I would not call it source dedupe. File-level dedupe helps, but not as much as block-level dedupe and cross-client block-level dedupe.

  • Hi Curtis,

    Source side dedupe is defacto today. And has been for a long time:
    – Networker has ddboost since 2010 https://www.emc.com/about/news/press/2010/20100511-01.htm
    – Netbackup had Puredisk integrated also since 2010 https://www.veritas.com/support/en_US/article.100022314
    – TSM has client side deduplication also since 2010 with 6.2 release http://www-01.ibm.com/cgi-bin/common/ssi/ssialias?infotype=an&subtype=ca&htmlfid=897/ENUS210-040&appname=isource&language=enus#descx

    So those “massive architectural changes” has been done 8 years ago.

    However I’ve not noticed that any of those would be using dedupe in reverse to speed up restores.

    /Piippu

    • Thanks for updating my information. One of the challenges of having to follow the entire industry is not being able to get deep on every feature of every product. My intention was not to dis those products, but to ask why NEW products are coming out without source dedupe built into them. That I really don’t understand. Which is why I disagree with your comment that it is defacto.

      I’ve removed my specific references to those products. But I’ll comment here.

      NetWorker does not have client-side dedupe. Data Domain offers a client-side dedupe option for Networker (DDBoost). There is a distinct difference. Actual client-side dedupe does not require the purchase of a target dedupe machine.

      I followed the Puredisk stuff when they first had it. I’m glad to see it made its way into the main product. (For a while, there, I wasn’t sure that was going to happen.) Having said that, it’s not without its limitations. For example, I can’t use inline copying if I use source-side dedupe. It also doesn’t support NAS backups.

      I haven’t looked at TSM closely in a while, so I’m glad to see it now has source-side dedupe. The link you listed mentioned it was only for file data. Do you know if it now supports other agents?

  • Client-side anything often requires some type of footprint on the client (agent, processing, scratch space etc). That idea is anathema to the newer products. Customers want easy,simple rather than efficient.

    • There must always be client-side something. Even agentless configurations put the agent somewhere. For example, an “agentless” VMware setup puts the agent at the host level, not the VM level. Another “agentless” product I know simply puts the agent on automatically at time of backup, then takes it away after the backup. (It just means you don’t have to manage the agent.)

      So… if we agree there’s always an agent somewhere, really all we’re discussing is whether or not the footprint of a dedupe agent would be heavier than a non-dedupe agent. And I would argue that if the dedupe agent footprint is heavier, then something is wrong. As I wrote in another post, good dedupe makes things faster.

      https://backupcentral.com/dedupe-done-right-speeds-up-backups/