The Perils of Hardware

No one likes hardware; they only like what they can do with it.  And I say this as a geek who has built plenty of PCs in my house, including a Hackintosh.  What kind of sick weirdo builds their own Mac?  Well, you know what? That Hackintosh illustrates the perils of hardware in three ways.

Hardware gets marked up

The first peril of hardware is why I did this: Apple’s crazy markup on hardware. Why did I go through the difficulty of finding and buying components that were compatible with MacOS?  Why did I go through the rigmarole necessary to fool the MacOS installer into installing on something that wasn’t a real Mac?

I wanted to run MacOS on a server powerful enough to run Adobe Premier Pro well, and the MacPro I wanted was something like $4-5000.  But I could build a Hackintosh for around $1500, so I did.

This is why storage customers revolted against traditional proprietary storage vendors in favor of software-defined startups that allowed them to use off-the-shelf hardware that wasn’t ridiculously marked up.  People started realizing that hardware is hardware, and rarely is hardware special enough to warrant a huge markup.

Hardware must be maintained

Hardware breaks.  Power supplies die, disks stop spinning, and fans stop blowing. This is why every production piece of hardware typically comes with a service agreement specifying how quickly the vendor should respond when a problem occurs.

At no time is this peril more acute than the last few weeks. The spectre of the Spectre and Meltdown vulnerabilities is wreaking havoc on hardware land. First Intel came out with a new microcode version to address the vulnerabilities, then Microsoft, RedHat, and other Linux vendors came out with OS patches.  Then people that installed them started seeing spontaneous reboots. So they all started pulling their patches, and Microsoft even released an out-of-band update that disabled the microcode patches if you installed them.  It’s been a tough couple of weeks for those that must maintain hardware.

Meanwhile customers who are using services like, Office365, Gmail, and yes, the Druva Cloud Platform, didn’t have to worry about maintaining the hardware underneath those systems. The service providers had plenty of work to do, for sure. The cloud is not magic. There is no such thing as the cloud; it’s only someone else’s datacenter. But people who were using true cloud services simply didn’t have to worry about maintaining the hardware behind the services they were using.

This brings me to the point of the companies in the data protection space who have now certified that their product runs in AWS. Yes, this allows them to say that they work “in the cloud.” But it’s important to distinguish this from a cloud service offering, where hardware is not your problem. Customers of such backup solutions that are “running in the cloud” are having just as many problems with their cloud backup servers as they are with their onsite servers.  Because even virtual hardware has to be maintained. It may be someone else’s hardware (i.e. you don’t own the server your cloud VM is running on), but you still have to maintain it.

Hardware is a capital expense

The Hackintosh I built was only $1500, but what if it had been $100,000?  Hardware of all kinds requires a significant amount of capital outlay.  Maybe you can finance it and maybe you need to come up with the actual cash to buy it outright.  Either way, it’s going to stay on your books for years.

Capital expenses can be really difficult to get approved. I remember working at a place where every single item over $1,000 was a capital expense, and getting capital expenses approved took months – even years.  I remember doing all sorts of things to work around that issue.

Real hardware also exists.  If you bought it for a project that changed directions, you’re stuck with that hardware.  If your project needs faster hardware, you have to upgrade – leaving the old hardware in the dust (literally). This is perhaps the most compelling thing about moving apps to the cloud.  If you change your mind, you just delete the VM.

The hardware isn’t important – the service is

This brings me full circle. The hardware isn’t what’s important; the service is what’s important. Consider my opening story of the Hackintosh. My need was to edit video. The solution to that need was Adobe Premiere Pro – which I already owned.  But I owned the MacOS version, so I needed a Mac.  I couldn’t afford a MacPro, so I built one. (I just found out the Hackintosh I built is running fine, BTW.)

But what if I was able to find a cloud service to do my video editing? Yes, I realize there are rules of physics that might get in my way, since raw video can be huge. But just work with me.  What if I could meet all of my business needs with a service that runs in the cloud?

Would I need the Mac?  Would I need the Hackintosh? Would I need Premiere Pro? No, i wouldn’t.  A Chromebook would probably do just fine.

But if I went to Apple and told them my business requirements, their answer to my questions would most certainly be a MacPro. That’s what happens when you ask a hardware vendor to help solve your problem. It’s like going into a hardware store and telling him you need a place to live. The first thing they’re going to do is sell you a hammer, nails, and wood.  Because that’s what they sell.

Why would you want hardware?

This entire blog post was inspired by another blog post by a blogger and writer I respect. The title also started with “The Perils of…” He used the hammer analogy, too. He suggested that you shouldn’t go to vendors who just sell “backup,” as there is an entire continuum of data protection requirements not met by that term.  I agree with that part.  The days of backup only are over.

But then he suggested that his company, a very large hardware and software vendor, was the right way to go because they sell all types of solutions. That’s where I’m going to have to disagree. Because almost all of their solutions are just more hardware & software.  Hardware & software get marked up.  It has to be maintained. And hardware is a large capital investment.  Why would you want to do any of that if you could meet your data protection needs with a service where none of that is an issue?  Just a thought.

----- Signature and Disclaimer -----

Written by W. Curtis Preston (@wcpreston). For those of you unfamiliar with my work, I've specialized in backup & recovery since 1993. I've written the O'Reilly books on backup and have worked with a number of native and commercial tools. I am now Chief Technical Evangelist at Druva, the leading provider of cloud-based data protection and data management tools for endpoints, infrastructure, and cloud applications. These posts reflect my own opinion and are not necessarily the opinion of my employer.

Instant recovery & dedupe are not friends

Instant recovery is the modern-day equivalent of what we used to call a hot site, as it allows you to recover immediately after some type of incident. I have personally advocated for this concept, as I strongly believe that in a true disaster (or ransomware event), time is of the essence.

As mentioned in my previous article, one company’s lack of an instant recovery system caused them to pay the ransom when they were infected with ransomware. They said recovering their entire datacenter using their backup system would have taken several days, and paying the ransom would cost less than several days of downtime. I explain in the article why I completely disagree with this reasoning, but I understand I have the luxury of Monday morning quarterbacking.

The key to being able to easily recover from a large disaster or ransomware attack is to be able to instantly spin up your entire datacenter in a hot site or an instant recovery system. This allows you to take your time addressing the cause of the incident, such as identifying and removing the ransomware itself, putting out actual fires, or replacing hardware damaged in the incident. If you can run your entire environment in a public or private cloud, you can continue your business – almost without interruption – regardless of how bad the incident is.

Dedupe is not instant recovery’s friend

Instant recovery is great, as it is allowing many to recover much quicker and better than they ever could before. Deduplication is also great, as it is the technology that enables so many wonderful things, like disk-based backup and recovery, offsite replication of backups without human intervention, and significant reductions in bandwidth usage. It’s the marriage of deduplication and instant recovery that usually doesn’t work.

Deduplication systems are very good at many things, but usually are not very good at random reads and writes. Just ask anyone who has attempted to run one or more VMs using their deduplicated backup data as the datastore. The performance might be enough to handle a single server that doesn’t require a lot of random I/O, but running several servers or an entire datacenter simply isn’t possible from a deduplicated datastore.

This is why post-process deduplication backup appliances make such a big deal about their native landing zone where recent backups are stored in their native, non-deduplicated format before they are deduplicated for replication or long-term storage. They advise customers who are interested in instant recovery to turn off any backup software dedupe.  Backups are sent to disk in their full, native format and are stored that way in the landing zone until they are pushed out by newer backups. This yields much better performance if you have to run multiple VMs from your backups.

But most people using the instant recovery feature tend to be using modern backup packages that already have deduplication integrated as a core part of their product. This means they are typically performing their instant recovery using a deduplicated datastore. This means that they should be able to recover one or two VMs at a time. However, they will probably be very disappointed if they try to recover their entire datacenter.

There are other ways to do instant recovery

If you are going to use instant recovery to run your entire datacenter in a disaster, the latest copy of your VM backups needs to be in native format on storage that can support the performance that you need. There are a couple of ways of accomplishing that.

Continuous data protection (CDP) products are essentially replication with a back button. Some of these companies describe themselves as a TiVo for your backups. They store your backups in native format, and also store the bits needed to be able to change portions of the latest version in order to move it back in time. (A good example of such a product would be Zerto.)

These types of products tend to do well at disaster recovery, but not at operational recovery. They’re good at recovering an entire datacenter, usually not so good at recovering a single file.  The DR functionality of these products can be quite advanced, as it is their specialty. Another upside of this approach is you only have to pay for one copy of your backup – plus the versioning blocks of course. One downside to this approach is most people also purchase another product for operational recovery.

Alternatively, you can use a backup product that uses its backups to update an image stored in native format as a DR image.  (Druva offers this as part of their Data Protection as a Service offering.) Instant recoveries – especially large scale recoveries of an entire datacenter – would run from this DR image. The advantage of this approach is you get operational recovery and disaster recovery in a single system. This is both simpler and less expensive than maintaining two systems. One disadvantage is that you will need to pay for the storage the DR copy of your data uses, equivalent to one full backup.  This cost is offset by the fact that you would be able to do both operational recovery and disaster recovery with a single product.

Don’t pay ransoms! Get a better backup product!

As I mentioned in my previous blog post, please prepare now to be able to recover from a ransomware attack or other disaster. Investigate the DR plans of your company, as you might need to activate them for something you might not consider a disaster. Your entire datacenter may be fully functional, but you won’t be able to get to your data if it’s all encrypted.  So make sure you have a solid plan for how you would recover from this scenario, because the likelihood that this will happen to your company goes up every day.

----- Signature and Disclaimer -----

Written by W. Curtis Preston (@wcpreston). For those of you unfamiliar with my work, I've specialized in backup & recovery since 1993. I've written the O'Reilly books on backup and have worked with a number of native and commercial tools. I am now Chief Technical Evangelist at Druva, the leading provider of cloud-based data protection and data management tools for endpoints, infrastructure, and cloud applications. These posts reflect my own opinion and are not necessarily the opinion of my employer.

Do not pay ransomware ransom!

You don’t negotiate with terrorists, and you don’t pay ransom unless you have no other choice. Even then, you should try every available avenue before you decide to pay money to the company holding your data for ransom.  It’s just a bad idea. Last week there was a news story of a company that paid several BitCoin (each of which was worth roughly $15K) to get their data back. (I am not putting the exact amount or link to the story for reasons I will explain later.)

This kind of thing has become all too common, but this time things were a little bit different. The company disclosed that they had backups of the data that they could have used to restore their environment without paying the ransom. They chose to pay the ransom because they felt that it would restore their data quicker then their backup system would be able to do. I have two observations here: that was a really bad idea, and they should have had a better backup system.

You don’t pay ransom or blackmail!

The biggest reason you do not pay ransom or blackmail is that it says you’re open to paying ransom or blackmail. There is absolutely nothing stopping the entity who attacked you from doing it again in a few days or weeks.

Just ask Alexander Hamilton. Yes, that Alexander Hamilton. He had an affair with a married woman and was subsequently blackmailed by her husband. Mr. Reynolds started out asking for small figures, amounting to a few hundred dollars in today’s money.  But by paying a few hundred dollars, Hamilton showed that he was open to paying ransom. If he was open to paying a few hundred, he would pay a few hundred more. Reynolds came back for money several times.  By the time the event came to a conclusion, Hamilton had paid Reynolds roughly $18,000 in today’s money. (And the affair eventually came out anyway.)

By paying the BitCoins to the black hat, this company has shown that they will pay the ransom if they are attacked. What makes matters even worse is that the event was published in the news. Now everyone knows that this company will pay a ransom if they are attacked. they might as well have put a giant “HACK US!” sign on their website. (The first version of this story included the name of the hospital and a link to the story. I took it out so as not to add insult to injury.)

They didn’t just paint a target on their back; they painted a target on every companies back. The more companies that pay the ransom, the more black hats will attack other companies. If we all collectively refuse to pay the ransom – after ensuring that we can recover from a ransomware attack without paying the ransom – these black hats will find some other way to make money.

Another reason that you do not pay ransomware companies any money is that you are dealing with unscrupulous characters, and there is no assurance that you will get your data back. I am personally aware of multiple companies who paid the ransom and got nothing.

They need a better backup system

The backup system must not have been designed with the business needs of the company, or it would have been able to help them recover from this attack without paying the ransom. According to the story, the company felt that restoring from a backup would take too long, and paying the ransom would be quicker. What this tells me is that the recovery expectation was nowhere near the recovery reality.

This company must have done a cost-benefit analysis on the cost of a few days of downtime, and decided that the amount of lost revenue was much greater than the cost of paying the ransom. Let’s say, for example, they calculated that everyday of downtime would lose them one million dollars. If they used their backup system to restore their data center, they would lose more than three million dollars, since they said it would take 2-3 days. $55,000 is peanuts when compared to three million, so they paid the ransom. I do not agree with this logic, as I discussed previously in this article.  But this is the logic they apparently used.

If they knew that their company would lose a million dollars a day, then they should have designed their backup or disaster recovery system to be able to recover in less than a day. Technology certainly exists that is capable of doing that, and it usually costs far less than the amount of money that would be lost in an outage.

Even if the system cost similar to the amount of money that would be lost in an outage, it still might make sense to buy such a system. The reason for this is the impacts to the business go beyond a straight loss of revenue due to downtime. If your business suffers a sustained outage, you may lose more business than just the business you lost while you were down. You might lose some customers for good, and the lost revenue from that would be difficult to calculate.

Being ready for a disaster

If minimizing downtime is the key, the only way to truly be ready for a disaster is to be able to boot instantly after an outage. There are a variety of products that advertise such functionality today, but very few of them would be able to recover an entire datacenter instantly. I will discuss the various instant recovery options in my next blog post.

For now, I just want to remind you of two things: be ready for ransomware, and never pay the ransom. Make sure you are able to recover all of your critical data in a time frame that your business would find acceptable, so that you can tell any ransomware black hats to go pound sand if they come knocking on your door.

----- Signature and Disclaimer -----

Written by W. Curtis Preston (@wcpreston). For those of you unfamiliar with my work, I've specialized in backup & recovery since 1993. I've written the O'Reilly books on backup and have worked with a number of native and commercial tools. I am now Chief Technical Evangelist at Druva, the leading provider of cloud-based data protection and data management tools for endpoints, infrastructure, and cloud applications. These posts reflect my own opinion and are not necessarily the opinion of my employer.

Addressing Spectre/Meltdown in your Backup System

Your backup server might be the biggest vulnerability in your datacenter, as I already discussed in my previous blog post. Which means that you should have patched it first, but I’m betting that you haven’t patched it yet. If you don’t know why I feel this is a problem, go check out the previous post.

How are you responding to the Spectre & Meltdown vulnerabilities with regards to your backup infrastructure?  What kind of week you’ve had depends on what type of backup infrastructure you have.

Bare Metal Backup Server

This includes bare-metal Linux & Windows servers, and backup servers running in VMs in the cloud. You need to find the appropriate patches for your backup server’s OS, test them, and install them.  Here’s a good list of those patches. I’m guessing you probably don’t have the time to test them to see what kind of performance impact they might have on your backup system.

Reports of the performance impact of various patches include everything from “no noticeable impact” to “50% performance loss.”  Unfortunately for you, it seems that the more I/O intensive your workload, the greater the impact on performance. So you might install (or have installed) the patches and then run/ran your next set of backups — only to find out that they don’t complete anywhere nearly as fast as they used to.

If that’s the case for you, then you’re having to figure out how to respond to this performance loss. If your backup server is running in a VM, you might be able to just upgrade to a bigger VM.  You’ll have a little downtime, but that’s a small price to pay.

If you have a bare metal server, which is far more likely, you might find yourself in a situation of needing to do an emergency upgrade to the backup server.  Some systems run in a cluster and can be scaled by just buying another node in the cluster, but others will require a forklift upgrade of the backup server.  Either way, you may be looking at an emergency order of a new server or two. In short, you might be having a very difficult week.  It’s a good week to be a server vendor, though.

Virtualized Backup Server

If your backup server is running inside a VM, you’ve had even more interesting week. In addition to everything mentioned above, you also need to deal with microcode updates from VMware or Microsoft.

VMware got a lot of credit for responding to Spectre/Meltdown very quickly, as they issued patched pretty quickly. Unfortunately, the patches were apparently causing spontaneous reboots, so they pulled them almost as fast. Check out this page for the latest info on this.

Once these patches are available again, you’ll need to test and install them. And, of course, you will also need to patch the guest operating systems just as you would if they were bare metal.

Hyper-V customers need to do the same thing.  Here’s the latest information from them.

The performance impact of these patches is no more known than the performance impact of the previously mentioned OS patches. Which means you might find yourself having to upgrade the underlying hardware, or at the very least increasing the power of any VMs to compensate for the performance loss.  Again, it’s a good week to sell servers, not such a good week for those buying them.

Cloud-native Backup Service

If you are using a cloud-native backup service, you don’t have to do anything.  A cloud native service means you are not responsible for the VMs offering such a service. Those VMs are not your problem.  The most you might want to do is contact your backup service vendor and ask them if they have patched their systems to address any vulnerabilities.

When the backup service installs the appropriate patches in the backend, there might indeed be an impact to the performance of each VM. But if it’s a scalable cloud service, it should be able to easily compensate for any performance loss by adding additional compute resources.  This should not be something you should have to worry about.

Cloud means never having to say you’re sorry

A true cloud service should not require you to have to worry about the infrastructure.  (Which is why I feel the word “cloud” does mean something, @mattwbaker.) There are other backup systems out there that are actually quite good – but they’re not cloud native. If your backup app requires you to create VMs in the cloud to install your backup server software in, they’re not really a cloud app.  They’re cloud washing. (Honestly, taking a product designed for physical nodes in a datacenter and installing it in VMs in the cloud is a perfect example of how not to use the cloud.)

If your backup service is actually a cloud backup service, you should not have to worry about the security of your backup system – it should be automatically taken care of.  If you’re having to take care of it, perhaps you should consider a different system.

----- Signature and Disclaimer -----

Written by W. Curtis Preston (@wcpreston). For those of you unfamiliar with my work, I've specialized in backup & recovery since 1993. I've written the O'Reilly books on backup and have worked with a number of native and commercial tools. I am now Chief Technical Evangelist at Druva, the leading provider of cloud-based data protection and data management tools for endpoints, infrastructure, and cloud applications. These posts reflect my own opinion and are not necessarily the opinion of my employer.

Your onsite backup server is a security risk

Did you know there have been 7870 public data breaches since 2005?  Your company’s data is under attack. Like terrorism, the attackers only have to be successful once. You have to be successful 100% of the time.

Which is why its important to patch your systems regularly and keep abreast of any security vulnerabilities your company’s backup product may have.  But have you ever thought about how much of a security risk the backup server is? It’s a risk for three reasons: the value of what it has, the typical experience level of its admins, and lack of attention.

The backup system has all the marbles

Did you ever think about the fact that the backup system is the most sensitive server in your environment? It’s sensitive because it has everything and it can do everything.

First, the backup system has a copy of everything! All the data in your environment resides on disks or tapes it controls. While some data may be stored offsite and is effectively out of reach, most current data is immediately available via a few simple commands. Sometimes the backup data is available via other mechanisms, such as a web or NFS server, which is why a vulnerability in those products could give a malicious user access to anything he/she wants.

The backup system can read and write every piece of data in your datacenter. In order to backup data, it must be able to read it.  To be able to read it, the backup system is given superuser privileges.  Unix/Linux backup software runs as root, and Windows systems tend to run as Administrator. That means it can read or write any file in the environment.

Most backup software also has the ability to run scripts before and after the backup, and those scripts run as the privileged user. Combine that with the ability to backup and restore files, and you have a scary situation.  A malicious user that gains backup admin privileges can write a malicious script, back it up, restore it to the appropriate location, then execute the script using a privileged user.  Just let that sink in for a minute.

The backup admins are often very junior

My first job in tech was the “backup guy” for a huge credit card company. I barely knew how to spell Unix, and a few days into my job I was given the keys to the kingdom: the root password to the backup system and every server in the datacenter.  (We didn’t have the concept of role-based admin in those days, so anything you did with backups, you did as root.)

My story is not unique.  Backups are often given to the FNG. He or she takes the gig because it gets them the job, but it’s the job that nobody wants. As soon as you get some experience under your belt, they do their best to pass off this very difficult job to anyone else.  This has been true of backups for years, and this revolving door usually results in very junior people running the backup system.

I know I wanted to get out of backups back then, but I went from being the backup guy to being in charge of the backup team.  Three years later, I was still the main point of contact for the backup system.  Working for me were several people who were just as junior as I was when I started, all of whom had root privileges to the entire bank. Without going into details, I’ll just say that not everyone that worked for me should have been given the keys to the kingdom like that.

The most sensitive system in your environment is being handed over to the most junior person you have.  Again… let that sink in a little bit.

The backup server doesn’t receive enough attention

The security team always made sure the database servers & file servers were patched. But I don’t recall ever getting a call from them about the backup server. That meant it was up to the most junior person in the environment to make sure the most sensitive server in the environment was being regularly patched and secured against attacks.  That makes perfect sense. Not.

Another way this manifests itself is in the backup software. Many companies making backup products rely on external products (e.g. Apache) to augment their functionality (e.g. web access to your backup server). The thinking is to use publicly available tools instead of building their own. They’re a backup company, after all, not a web server company.

But unfortunately, embedded software like this often gets patched later than it should.  When an Apache vulnerability is discovered, people who know they are running Apache tend to patch it.  But what if it’s inside your backup software?  You rely on the backup vendor to know that and to patch it appropriately. But the inattention I’m referring to also sometimes applies to embedded components inside a backup system. It make take weeks or months before the vulnerability is patched in the backup software. This ArsTechnica article discusses a recently patched vulnerability in a backup software package where there was a three month delay between the initial discovery of the vulnerability and the creation of a patch for all related systems.

Choice 1: Secure your onsite backup system

You can do a number of things to secure your onsite system, starting with recognizing how much of a vulnerability it is. You can harden the system itself, patch the backup system, and do your best to limit the powers of your backup admin.

Harden the backup system

Firewall it off, using a software firewall running in the system or an actual firewall in front of the system — preferably the latter. Make it so that you can only administer the system via a particular VPN, and that admins must authenticate to the VPN prior to administering the backup system. This also addresses another vulnerability, which is that some backup systems send their commands in plain text.

Make sure that the backup server is running the most secure version of the operating system you have.

Run the backup software via a separate privileged account, not the privileged account.  Run it with an account called backupadmin with userid 0, or with Administrator privileges.  Do not run it as root or Admininistrator.  Then use your ITD software to watch that account like a hawk.

If your backup admin needs root privileges on Unix systems, force them to use sudo.

Require Windows backup admins to use their non-privileged account, and “Run as administrator” when they need to do something special.

Make sure the backup system is continually updated to the latest patch level. It should be the first system you patch, not the last.

If your backup software supports two-factor authentication, use it.

If you are writing backup data to a deduplication appliance across Ethernet, you need to harden and separate that interface as well. For example, do not allow direct access to any of its data via NFS/SMB. A physically separate Ethernet connection between the backup server and any backup storage would be preferred.

Limit backup admin powers

If your backup system supports the concept or role-based admin, do whatever you can to limit the power of the backup admin.  Maybe give them the power to do backups but not restores.  Or they can run backups, but not configure backups.  Restores and configuration changes could/should be done by a separate account that requires a separate login with strong two-factor authentication.

Choice 2: Get rid of  your backup server

What if you got rid of your backup server altogether?  There’s nothing more secure than something that doesn’t exist!  You could do this by using a backup system with a service-based public cloud architecture. Backup services that backup directly to the cloud offer a number of security advantages over those that use backup servers.

Front end designed for direct Internet access

Traditional backup systems are designed to be run inside an already-secure datacenter, where there is an expectation that direct attacks will be lower. Cloud backup systems are designed with harder front ends because they acknowledge they will be directly connected to the Internet. A lot of the basic security changes suggested above would be considered table stakes to any Internet-facing service.

Continuous security monitoring

Backup services run in a cloud like AWS are continually monitored for attempted intrusion.  (Again, this is table stakes for such a service.)  You get best of breed security simply by using the service.

Any embedded systems constantly & automatically patched

The operating systems and applications supporting any backup service are automatically and immediately patched to the latest available patches. The infrastructure is so huge that this has to be automated; you don’t have to do anything to make it happen.

Backup data not exposed to anyone

A good cloud backup system also segregates your actual backup data from the rest of the network, just like I was suggesting for your onsite backup server. But in this case, that’s already one. No one is getting to your backup data except through the authorized backup system.

Summary: Lock it up or give it up

Once you recognize what an incredibly vulnerable thing your backup server is, your choices are simple: lock it up very tight or get rid of it. I think most companies would be served well by the latter.  Given the advent of really good dedupe and replication, only the biggest companies are not able to take cloud-based backup systems.

----- Signature and Disclaimer -----

Written by W. Curtis Preston (@wcpreston). For those of you unfamiliar with my work, I've specialized in backup & recovery since 1993. I've written the O'Reilly books on backup and have worked with a number of native and commercial tools. I am now Chief Technical Evangelist at Druva, the leading provider of cloud-based data protection and data management tools for endpoints, infrastructure, and cloud applications. These posts reflect my own opinion and are not necessarily the opinion of my employer.

Dedupe done right speeds up backups

On my LinkedIn profile, I posted a link to my last article, Why good dedupe is important — and hard to do.  I got some pretty good feedback on it, but one comment from my buddy Chris M. Evens (@chrismevans) got me thinking.

“Curtis, it’s worth highlighting that space optimisation may not be your only measurement of dedupe performance. The ability to do fast ingest with a poorer level of dedupe (which is then post processed) could be more attractive. Of course, you may be intending to talk about this in future posts…”

I’m glad you asked, Chris! (BTW, Chris lives over yonder across the pond, so he spells things funny.) Here’s my quick and longer answer to your question:

If dedupe is done right, it speeds up backups and doesn’t slow them down.

Target dedupe can slow down backups

I think Chris’ thinking stems primarily from thinking about dedupe as something that happens in a target dedupe appliance.  I have run backups to a number of these appliances over the years, and Chris is right.  Depending on the architecture — especially decisions made about dedupe efficiency vs speed — a dedupe appliance can indeed slow down the backup system.

slow down

This is actually why I traditionally preferred the post-process way of doing dedupe when I was looking at target appliances.  A post-process system (e.g. Exagrid) first stores all backups in their native format in a landing zone.  Those backups are then deduped asynchronously. This made sure that the dedupe process — which can be very CPU, RAM, and I/O intensive — didn’t slow down the incoming backup.

An inline approach (e.g. Data Domain) dedupes the data before it is every written to disk. Proponents of the inline approach say that it saves you from having to buy the disk for the staging area, and that it is more efficient to dedupe it first.  They claim that the compute power required to dedupe data inline is made up for by a significant reduction in I/O.

But I generally preferred the post-process approach for two reasons. The biggest reason was that it left the latest backup in its native format in the landing zone, creating a significant performance advantage during restores — especially instant recovery type restores. But the other reason I generally preferred target dedupe was the performance impact I had seen inline dedupe have on backups.

Chris’ point was that strong dedupe can impact the performance of the backup, and I have seen just that with several inline dedupe solutions. Customers who really noticed this were those that had already grown accustomed to disk-based backup performance.

If you were used to tape performance (due to the speed mismatch issue I covered here) then you didn’t really notice anything.  But if you were already backing up a large database or other server to disk, and then switched that backup to a target dedupe appliance, your backup times might actually increase — sometimes by a lot.  I remember one customer who told me their Exchange backups were taking three times longer after they switched from a regular disk array to a popular target dedupe appliance.

Target dedupe was — and still is — a band-aid

The goal of target dedupe was to introduce the goodness of dedupe into your backup system without requiring you to change your backup software. Just point your backups to the target dedupe appliance and magic happens.  It was a band-aid, and I contend it still is.

But doing dedupe at the target is much harder — read more expensive — than doing it at the source.  The biggest reason is that the dedupe appliance is not looking at your files; it’s looking at a “tar ball” of your files.  It’s looking at your files inside a backup container, many of which are cryptic and difficult to parse.  A lot of work has to go into deciphering and properly “chunking” the backup formats. That work translates into development cost and computing cost, all of which gets passed down to you.

The second reason target dedupe is the wrong way to go is that it removes one of the primary benefits of dedupe: bandwidth savings. With a few exceptions (e.g. Boost), your network sees no benefit from dedupe.  The entire backup — fulls and incrementals — are transferred across the network.

It was a band-aid, and it did a good job of introducing dedupe into the backup system. But now that we see the value of it, it’s time to do it right.  It’s time to start deduping before we backup, not after.

Source dedupe is the way to go

Source dedupe is done at the very beginning of the backup process.  Every new or modified file is parsed, and a hash is calculated for its contents. If that has has been seen before, that chunk doesn’t need to be transferred across the network.

There are multiple reasons why source dedupe is the way to go.  The biggest reasons are purchase cost, performance and storage & bandwidth savings.

Target dedupe is expensive because it is developmentally and computationally expensive. I used to joke that a target dedupe appliance makes 10 TB look like 200 TB to the backup system, but they’d only charge you for 100 TB.  Yes, target dedupe appliances make the impossible possible, but they also charge you for it.

They also charge for it over and over.  Did you ever think about the fact that all the hard work of dedupe is done only by the first appliance?  Therefore, one could argue that only the first appliance should cost so much more.  But you know that isn’t the case; you pay the dedupe premium on every target dedupe appliance you buy, right?  Source systems can charge once for the dedupe, then replicate that backup to many locations without having to charge your for it.

Source dedupe is also much faster.  One reason for that is that it never has to dedupe a full backup ever again. Target appliances are forced to dedupe full backups all the time, because the backup software products all need to make them once in a while.  A source dedupe product does one full, and block-level incrementals after that.  Another reason target dedupe is faster is that it can look directly at the files being backed up, instead of having to divine the data hidden behind a cryptic backup format.

Finally, because source dedupe is looking directly at the data, it can dedupe better and get rid of more duplicate data. That saves bandwidth and storage, further reducing your costs — and speeding up the backup.  The more you are using the cloud, the more important this is.  Every deduped bit reduces your bandwidth cost and the bill you will pay the cloud vendor every month.

Dedupe done right speeds up backups

This is why I said to Chris that this problem of being forced to decided between dedupe ratio and backup performance really only applies to target dedupe.  Source dedupe is faster, cheaper, and saves more storage than any other method.  It’s been 20 years now since I was first introduced to the concept of dedupe.  I think it’s time we start doing it right.

----- Signature and Disclaimer -----

Written by W. Curtis Preston (@wcpreston). For those of you unfamiliar with my work, I've specialized in backup & recovery since 1993. I've written the O'Reilly books on backup and have worked with a number of native and commercial tools. I am now Chief Technical Evangelist at Druva, the leading provider of cloud-based data protection and data management tools for endpoints, infrastructure, and cloud applications. These posts reflect my own opinion and are not necessarily the opinion of my employer.