On my LinkedIn profile, I posted a link to my last article, Why good dedupe is important — and hard to do. I got some pretty good feedback on it, but one comment from my buddy Chris M. Evens (@chrismevans) got me thinking.
“Curtis, it’s worth highlighting that space optimisation may not be your only measurement of dedupe performance. The ability to do fast ingest with a poorer level of dedupe (which is then post processed) could be more attractive. Of course, you may be intending to talk about this in future posts…”
I’m glad you asked, Chris! (BTW, Chris lives over yonder across the pond, so he spells things funny.) Here’s my quick and longer answer to your question:
If dedupe is done right, it speeds up backups and doesn’t slow them down.
Target dedupe can slow down backups
I think Chris’ thinking stems primarily from thinking about dedupe as something that happens in a target dedupe appliance. I have run backups to a number of these appliances over the years, and Chris is right. Depending on the architecture — especially decisions made about dedupe efficiency vs speed — a dedupe appliance can indeed slow down the backup system.
This is actually why I traditionally preferred the post-process way of doing dedupe when I was looking at target appliances. A post-process system (e.g. Exagrid) first stores all backups in their native format in a landing zone. Those backups are then deduped asynchronously. This made sure that the dedupe process — which can be very CPU, RAM, and I/O intensive — didn’t slow down the incoming backup.
An inline approach (e.g. Data Domain) dedupes the data before it is every written to disk. Proponents of the inline approach say that it saves you from having to buy the disk for the staging area, and that it is more efficient to dedupe it first. They claim that the compute power required to dedupe data inline is made up for by a significant reduction in I/O.
But I generally preferred the post-process approach for two reasons. The biggest reason was that it left the latest backup in its native format in the landing zone, creating a significant performance advantage during restores — especially instant recovery type restores. But the other reason I generally preferred target dedupe was the performance impact I had seen inline dedupe have on backups.
Chris’ point was that strong dedupe can impact the performance of the backup, and I have seen just that with several inline dedupe solutions. Customers who really noticed this were those that had already grown accustomed to disk-based backup performance.
If you were used to tape performance (due to the speed mismatch issue I covered here) then you didn’t really notice anything. But if you were already backing up a large database or other server to disk, and then switched that backup to a target dedupe appliance, your backup times might actually increase — sometimes by a lot. I remember one customer who told me their Exchange backups were taking three times longer after they switched from a regular disk array to a popular target dedupe appliance.
Target dedupe was — and still is — a band-aid
The goal of target dedupe was to introduce the goodness of dedupe into your backup system without requiring you to change your backup software. Just point your backups to the target dedupe appliance and magic happens. It was a band-aid, and I contend it still is.
But doing dedupe at the target is much harder — read more expensive — than doing it at the source. The biggest reason is that the dedupe appliance is not looking at your files; it’s looking at a “tar ball” of your files. It’s looking at your files inside a backup container, many of which are cryptic and difficult to parse. A lot of work has to go into deciphering and properly “chunking” the backup formats. That work translates into development cost and computing cost, all of which gets passed down to you.
The second reason target dedupe is the wrong way to go is that it removes one of the primary benefits of dedupe: bandwidth savings. With a few exceptions (e.g. Boost), your network sees no benefit from dedupe. The entire backup — fulls and incrementals — are transferred across the network.
It was a band-aid, and it did a good job of introducing dedupe into the backup system. But now that we see the value of it, it’s time to do it right. It’s time to start deduping before we backup, not after.
Source dedupe is the way to go
Source dedupe is done at the very beginning of the backup process. Every new or modified file is parsed, and a hash is calculated for its contents. If that has has been seen before, that chunk doesn’t need to be transferred across the network.
There are multiple reasons why source dedupe is the way to go. The biggest reasons are purchase cost, performance and storage & bandwidth savings.
Target dedupe is expensive because it is developmentally and computationally expensive. I used to joke that a target dedupe appliance makes 10 TB look like 200 TB to the backup system, but they’d only charge you for 100 TB. Yes, target dedupe appliances make the impossible possible, but they also charge you for it.
They also charge for it over and over. Did you ever think about the fact that all the hard work of dedupe is done only by the first appliance? Therefore, one could argue that only the first appliance should cost so much more. But you know that isn’t the case; you pay the dedupe premium on every target dedupe appliance you buy, right? Source systems can charge once for the dedupe, then replicate that backup to many locations without having to charge your for it.
Source dedupe is also much faster. One reason for that is that it never has to dedupe a full backup ever again. Target appliances are forced to dedupe full backups all the time, because the backup software products all need to make them once in a while. A source dedupe product does one full, and block-level incrementals after that. Another reason target dedupe is faster is that it can look directly at the files being backed up, instead of having to divine the data hidden behind a cryptic backup format.
Finally, because source dedupe is looking directly at the data, it can dedupe better and get rid of more duplicate data. That saves bandwidth and storage, further reducing your costs — and speeding up the backup. The more you are using the cloud, the more important this is. Every deduped bit reduces your bandwidth cost and the bill you will pay the cloud vendor every month.
Dedupe done right speeds up backups
This is why I said to Chris that this problem of being forced to decided between dedupe ratio and backup performance really only applies to target dedupe. Source dedupe is faster, cheaper, and saves more storage than any other method. It’s been 20 years now since I was first introduced to the concept of dedupe. I think it’s time we start doing it right.
----- Signature and Disclaimer -----
Written by W. Curtis Preston (@wcpreston). For those of you unfamiliar with my work, I've specialized in backup & recovery since 1993. I've written the O'Reilly books on backup and have worked with a number of native and commercial tools. I am now Chief Technical Architect at Druva, the leading provider of cloud-based data protection and data management tools for endpoints, infrastructure, and cloud applications. These posts reflect my own opinion and are not necessarily the opinion of my employer.