Do not pay ransomware ransom!

You don’t negotiate with terrorists, and you don’t pay ransom unless you have no other choice. Even then, you should try every available avenue before you decide to pay money to the company holding your data for ransom.  It’s just a bad idea. Last week there was a news story of a company that paid several BitCoin (each of which was worth roughly $15K) to get their data back. (I am not putting the exact amount or link to the story for reasons I will explain later.)

This kind of thing has become all too common, but this time things were a little bit different. The company disclosed that they had backups of the data that they could have used to restore their environment without paying the ransom. They chose to pay the ransom because they felt that it would restore their data quicker then their backup system would be able to do. I have two observations here: that was a really bad idea, and they should have had a better backup system.

You don’t pay ransom or blackmail!

The biggest reason you do not pay ransom or blackmail is that it says you’re open to paying ransom or blackmail. There is absolutely nothing stopping the entity who attacked you from doing it again in a few days or weeks.

Just ask Alexander Hamilton. Yes, that Alexander Hamilton. He had an affair with a married woman and was subsequently blackmailed by her husband. Mr. Reynolds started out asking for small figures, amounting to a few hundred dollars in today’s money.  But by paying a few hundred dollars, Hamilton showed that he was open to paying ransom. If he was open to paying a few hundred, he would pay a few hundred more. Reynolds came back for money several times.  By the time the event came to a conclusion, Hamilton had paid Reynolds roughly $18,000 in today’s money. (And the affair eventually came out anyway.)

By paying the BitCoins to the black hat, this company has shown that they will pay the ransom if they are attacked. What makes matters even worse is that the event was published in the news. Now everyone knows that this company will pay a ransom if they are attacked. they might as well have put a giant “HACK US!” sign on their website. (The first version of this story included the name of the hospital and a link to the story. I took it out so as not to add insult to injury.)

They didn’t just paint a target on their back; they painted a target on every companies back. The more companies that pay the ransom, the more black hats will attack other companies. If we all collectively refuse to pay the ransom – after ensuring that we can recover from a ransomware attack without paying the ransom – these black hats will find some other way to make money.

Another reason that you do not pay ransomware companies any money is that you are dealing with unscrupulous characters, and there is no assurance that you will get your data back. I am personally aware of multiple companies who paid the ransom and got nothing.

They need a better backup system

The backup system must not have been designed with the business needs of the company, or it would have been able to help them recover from this attack without paying the ransom. According to the story, the company felt that restoring from a backup would take too long, and paying the ransom would be quicker. What this tells me is that the recovery expectation was nowhere near the recovery reality.

This company must have done a cost-benefit analysis on the cost of a few days of downtime, and decided that the amount of lost revenue was much greater than the cost of paying the ransom. Let’s say, for example, they calculated that everyday of downtime would lose them one million dollars. If they used their backup system to restore their data center, they would lose more than three million dollars, since they said it would take 2-3 days. $55,000 is peanuts when compared to three million, so they paid the ransom. I do not agree with this logic, as I discussed previously in this article.  But this is the logic they apparently used.

If they knew that their company would lose a million dollars a day, then they should have designed their backup or disaster recovery system to be able to recover in less than a day. Technology certainly exists that is capable of doing that, and it usually costs far less than the amount of money that would be lost in an outage.

Even if the system cost similar to the amount of money that would be lost in an outage, it still might make sense to buy such a system. The reason for this is the impacts to the business go beyond a straight loss of revenue due to downtime. If your business suffers a sustained outage, you may lose more business than just the business you lost while you were down. You might lose some customers for good, and the lost revenue from that would be difficult to calculate.

Being ready for a disaster

If minimizing downtime is the key, the only way to truly be ready for a disaster is to be able to boot instantly after an outage. There are a variety of products that advertise such functionality today, but very few of them would be able to recover an entire datacenter instantly. I will discuss the various instant recovery options in my next blog post.

For now, I just want to remind you of two things: be ready for ransomware, and never pay the ransom. Make sure you are able to recover all of your critical data in a time frame that your business would find acceptable, so that you can tell any ransomware black hats to go pound sand if they come knocking on your door.

----- Signature and Disclaimer -----

Written by W. Curtis Preston (@wcpreston). For those of you unfamiliar with my work, I've specialized in backup & recovery since 1993. I've written the O'Reilly books on backup and have worked with a number of native and commercial tools. I am now Chief Technical Architect at Druva, the leading provider of cloud-based data protection and data management tools for endpoints, infrastructure, and cloud applications. These posts reflect my own opinion and are not necessarily the opinion of my employer.

Addressing Spectre/Meltdown in your Backup System

Your backup server might be the biggest vulnerability in your datacenter, as I already discussed in my previous blog post. Which means that you should have patched it first, but I’m betting that you haven’t patched it yet. If you don’t know why I feel this is a problem, go check out the previous post.

How are you responding to the Spectre & Meltdown vulnerabilities with regards to your backup infrastructure?  What kind of week you’ve had depends on what type of backup infrastructure you have.

Bare Metal Backup Server

This includes bare-metal Linux & Windows servers, and backup servers running in VMs in the cloud. You need to find the appropriate patches for your backup server’s OS, test them, and install them.  Here’s a good list of those patches. I’m guessing you probably don’t have the time to test them to see what kind of performance impact they might have on your backup system.

Reports of the performance impact of various patches include everything from “no noticeable impact” to “50% performance loss.”  Unfortunately for you, it seems that the more I/O intensive your workload, the greater the impact on performance. So you might install (or have installed) the patches and then run/ran your next set of backups — only to find out that they don’t complete anywhere nearly as fast as they used to.

If that’s the case for you, then you’re having to figure out how to respond to this performance loss. If your backup server is running in a VM, you might be able to just upgrade to a bigger VM.  You’ll have a little downtime, but that’s a small price to pay.

If you have a bare metal server, which is far more likely, you might find yourself in a situation of needing to do an emergency upgrade to the backup server.  Some systems run in a cluster and can be scaled by just buying another node in the cluster, but others will require a forklift upgrade of the backup server.  Either way, you may be looking at an emergency order of a new server or two. In short, you might be having a very difficult week.  It’s a good week to be a server vendor, though.

Virtualized Backup Server

If your backup server is running inside a VM, you’ve had even more interesting week. In addition to everything mentioned above, you also need to deal with microcode updates from VMware or Microsoft.

VMware got a lot of credit for responding to Spectre/Meltdown very quickly, as they issued patched pretty quickly. Unfortunately, the patches were apparently causing spontaneous reboots, so they pulled them almost as fast. Check out this page for the latest info on this.

Once these patches are available again, you’ll need to test and install them. And, of course, you will also need to patch the guest operating systems just as you would if they were bare metal.

Hyper-V customers need to do the same thing.  Here’s the latest information from them.

The performance impact of these patches is no more known than the performance impact of the previously mentioned OS patches. Which means you might find yourself having to upgrade the underlying hardware, or at the very least increasing the power of any VMs to compensate for the performance loss.  Again, it’s a good week to sell servers, not such a good week for those buying them.

Cloud-native Backup Service

If you are using a cloud-native backup service, you don’t have to do anything.  A cloud native service means you are not responsible for the VMs offering such a service. Those VMs are not your problem.  The most you might want to do is contact your backup service vendor and ask them if they have patched their systems to address any vulnerabilities.

When the backup service installs the appropriate patches in the backend, there might indeed be an impact to the performance of each VM. But if it’s a scalable cloud service, it should be able to easily compensate for any performance loss by adding additional compute resources.  This should not be something you should have to worry about.

Cloud means never having to say you’re sorry

A true cloud service should not require you to have to worry about the infrastructure.  (Which is why I feel the word “cloud” does mean something, @mattwbaker.) There are other backup systems out there that are actually quite good – but they’re not cloud native. If your backup app requires you to create VMs in the cloud to install your backup server software in, they’re not really a cloud app.  They’re cloud washing. (Honestly, taking a product designed for physical nodes in a datacenter and installing it in VMs in the cloud is a perfect example of how not to use the cloud.)

If your backup service is actually a cloud backup service, you should not have to worry about the security of your backup system – it should be automatically taken care of.  If you’re having to take care of it, perhaps you should consider a different system.

----- Signature and Disclaimer -----

Written by W. Curtis Preston (@wcpreston). For those of you unfamiliar with my work, I've specialized in backup & recovery since 1993. I've written the O'Reilly books on backup and have worked with a number of native and commercial tools. I am now Chief Technical Architect at Druva, the leading provider of cloud-based data protection and data management tools for endpoints, infrastructure, and cloud applications. These posts reflect my own opinion and are not necessarily the opinion of my employer.

Your onsite backup server is a security risk

Did you know there have been 7870 public data breaches since 2005?  Your company’s data is under attack. Like terrorism, the attackers only have to be successful once. You have to be successful 100% of the time.

Which is why its important to patch your systems regularly and keep abreast of any security vulnerabilities your company’s backup product may have.  But have you ever thought about how much of a security risk the backup server is? It’s a risk for three reasons: the value of what it has, the typical experience level of its admins, and lack of attention.

The backup system has all the marbles

Did you ever think about the fact that the backup system is the most sensitive server in your environment? It’s sensitive because it has everything and it can do everything.

First, the backup system has a copy of everything! All the data in your environment resides on disks or tapes it controls. While some data may be stored offsite and is effectively out of reach, most current data is immediately available via a few simple commands. Sometimes the backup data is available via other mechanisms, such as a web or NFS server, which is why a vulnerability in those products could give a malicious user access to anything he/she wants.

The backup system can read and write every piece of data in your datacenter. In order to backup data, it must be able to read it.  To be able to read it, the backup system is given superuser privileges.  Unix/Linux backup software runs as root, and Windows systems tend to run as Administrator. That means it can read or write any file in the environment.

Most backup software also has the ability to run scripts before and after the backup, and those scripts run as the privileged user. Combine that with the ability to backup and restore files, and you have a scary situation.  A malicious user that gains backup admin privileges can write a malicious script, back it up, restore it to the appropriate location, then execute the script using a privileged user.  Just let that sink in for a minute.

The backup admins are often very junior

My first job in tech was the “backup guy” for a huge credit card company. I barely knew how to spell Unix, and a few days into my job I was given the keys to the kingdom: the root password to the backup system and every server in the datacenter.  (We didn’t have the concept of role-based admin in those days, so anything you did with backups, you did as root.)

My story is not unique.  Backups are often given to the FNG. He or she takes the gig because it gets them the job, but it’s the job that nobody wants. As soon as you get some experience under your belt, they do their best to pass off this very difficult job to anyone else.  This has been true of backups for years, and this revolving door usually results in very junior people running the backup system.

I know I wanted to get out of backups back then, but I went from being the backup guy to being in charge of the backup team.  Three years later, I was still the main point of contact for the backup system.  Working for me were several people who were just as junior as I was when I started, all of whom had root privileges to the entire bank. Without going into details, I’ll just say that not everyone that worked for me should have been given the keys to the kingdom like that.

The most sensitive system in your environment is being handed over to the most junior person you have.  Again… let that sink in a little bit.

The backup server doesn’t receive enough attention

The security team always made sure the database servers & file servers were patched. But I don’t recall ever getting a call from them about the backup server. That meant it was up to the most junior person in the environment to make sure the most sensitive server in the environment was being regularly patched and secured against attacks.  That makes perfect sense. Not.

Another way this manifests itself is in the backup software. Many companies making backup products rely on external products (e.g. Apache) to augment their functionality (e.g. web access to your backup server). The thinking is to use publicly available tools instead of building their own. They’re a backup company, after all, not a web server company.

But unfortunately, embedded software like this often gets patched later than it should.  When an Apache vulnerability is discovered, people who know they are running Apache tend to patch it.  But what if it’s inside your backup software?  You rely on the backup vendor to know that and to patch it appropriately. But the inattention I’m referring to also sometimes applies to embedded components inside a backup system. It make take weeks or months before the vulnerability is patched in the backup software. This ArsTechnica article discusses a recently patched vulnerability in a backup software package where there was a three month delay between the initial discovery of the vulnerability and the creation of a patch for all related systems.

Choice 1: Secure your onsite backup system

You can do a number of things to secure your onsite system, starting with recognizing how much of a vulnerability it is. You can harden the system itself, patch the backup system, and do your best to limit the powers of your backup admin.

  • Harden the backup system
    • Firewall it off, using a software firewall running in the system or an actual firewall in front of the system — preferably the latter. Make it so that you can only administer the system via a particular VPN, and that admins must authenticate to the VPN prior to administering the backup system. This also addresses another vulnerability, which is that some backup systems send their commands in plain text.
    • Make sure that the backup server is running the most secure version of the operating system you have.
    • Run the backup software via a separate privileged account, not the privileged account.  Run it with an account called backupadmin with userid 0, or with Administrator privileges.  Do not run it as root or Admininistrator.  Then use your ITD software to watch that account like a hawk.
    • If your backup admin needs root privileges on Unix systems, force them to use sudo.
    • Require Windows backup admins to use their non-privileged account, and “Run as administrator” when they need to do something special.
    • Make sure the backup system is continually updated to the latest patch level. It should be the first system you patch, not the last.
    • If your backup software supports two-factor authentication, use it.
    • If you are writing backup data to a deduplication appliance across Ethernet, you need to harden and separate that interface as well. For example, do not allow direct access to any of its data via NFS/SMB. A physically separate Ethernet connection between the backup server and any backup storage would be preferred.
  • Limit backup admin powers
    • If your backup system supports the concept or role-based admin, do whatever you can to limit the power of the backup admin.  Maybe give them the power to do backups but not restores.  Or they can run backups, but not configure backups.  Restores and configuration changes could/should be done by a separate account that requires a separate login with strong two-factor authentication.

Choice 2: Get rid of  your backup server

What if you got rid of your backup server altogether?  There’s nothing more secure than something that doesn’t exist!  You could do this by using a backup system with a service-based public cloud architecture. Backup services that backup directly to the cloud offer a number of security advantages over those that use backup servers.

  • Front end designed for direct Internet access
    • Traditional backup systems are designed to be run inside an already-secure datacenter, where there is an expectation that direct attacks will be lower. Cloud backup systems are designed with harder front ends because they acknowledge they will be directly connected to the Internet. A lot of the basic security changes suggested above would be considered table stakes to any Internet-facing service.
  • Continuous security monitoring
    • Backup services run in a cloud like AWS are continually monitored for attempted intrusion.  (Again, this is table stakes for such a service.)  You get best of breed security simply by using the service.
  • Any embedded systems constantly & automatically patched
    • The operating systems and applications supporting any backup service are automatically and immediately patched to the latest available patches. The infrastructure is so huge that this has to be automated; you don’t have to do anything to make it happen.
  • Backup data not exposed to anyone
    • A good cloud backup system also segregates your actual backup data from the rest of the network, just like I was suggesting for your onsite backup server. But in this case, that’s already one. No one is getting to your backup data except through the authorized backup system.

Summary: Lock it up or give it up

Once you recognize what an incredibly vulnerable thing your backup server is, your choices are simple: lock it up very tight or get rid of it. I think most companies would be served well by the latter.  Given the advent of really good dedupe and replication, only the biggest companies are not able to take cloud-based backup systems.

----- Signature and Disclaimer -----

Written by W. Curtis Preston (@wcpreston). For those of you unfamiliar with my work, I've specialized in backup & recovery since 1993. I've written the O'Reilly books on backup and have worked with a number of native and commercial tools. I am now Chief Technical Architect at Druva, the leading provider of cloud-based data protection and data management tools for endpoints, infrastructure, and cloud applications. These posts reflect my own opinion and are not necessarily the opinion of my employer.

Dedupe done right speeds up backups

On my LinkedIn profile, I posted a link to my last article, Why good dedupe is important — and hard to do.  I got some pretty good feedback on it, but one comment from my buddy Chris M. Evens (@chrismevans) got me thinking.

“Curtis, it’s worth highlighting that space optimisation may not be your only measurement of dedupe performance. The ability to do fast ingest with a poorer level of dedupe (which is then post processed) could be more attractive. Of course, you may be intending to talk about this in future posts…”

I’m glad you asked, Chris! (BTW, Chris lives over yonder across the pond, so he spells things funny.) Here’s my quick and longer answer to your question:

If dedupe is done right, it speeds up backups and doesn’t slow them down.

Target dedupe can slow down backups

I think Chris’ thinking stems primarily from thinking about dedupe as something that happens in a target dedupe appliance.  I have run backups to a number of these appliances over the years, and Chris is right.  Depending on the architecture — especially decisions made about dedupe efficiency vs speed — a dedupe appliance can indeed slow down the backup system.

slow down

This is actually why I traditionally preferred the post-process way of doing dedupe when I was looking at target appliances.  A post-process system (e.g. Exagrid) first stores all backups in their native format in a landing zone.  Those backups are then deduped asynchronously. This made sure that the dedupe process — which can be very CPU, RAM, and I/O intensive — didn’t slow down the incoming backup.

An inline approach (e.g. Data Domain) dedupes the data before it is every written to disk. Proponents of the inline approach say that it saves you from having to buy the disk for the staging area, and that it is more efficient to dedupe it first.  They claim that the compute power required to dedupe data inline is made up for by a significant reduction in I/O.

But I generally preferred the post-process approach for two reasons. The biggest reason was that it left the latest backup in its native format in the landing zone, creating a significant performance advantage during restores — especially instant recovery type restores. But the other reason I generally preferred target dedupe was the performance impact I had seen inline dedupe have on backups.

Chris’ point was that strong dedupe can impact the performance of the backup, and I have seen just that with several inline dedupe solutions. Customers who really noticed this were those that had already grown accustomed to disk-based backup performance.

If you were used to tape performance (due to the speed mismatch issue I covered here) then you didn’t really notice anything.  But if you were already backing up a large database or other server to disk, and then switched that backup to a target dedupe appliance, your backup times might actually increase — sometimes by a lot.  I remember one customer who told me their Exchange backups were taking three times longer after they switched from a regular disk array to a popular target dedupe appliance.

Target dedupe was — and still is — a band-aid

The goal of target dedupe was to introduce the goodness of dedupe into your backup system without requiring you to change your backup software. Just point your backups to the target dedupe appliance and magic happens.  It was a band-aid, and I contend it still is.

But doing dedupe at the target is much harder — read more expensive — than doing it at the source.  The biggest reason is that the dedupe appliance is not looking at your files; it’s looking at a “tar ball” of your files.  It’s looking at your files inside a backup container, many of which are cryptic and difficult to parse.  A lot of work has to go into deciphering and properly “chunking” the backup formats. That work translates into development cost and computing cost, all of which gets passed down to you.

The second reason target dedupe is the wrong way to go is that it removes one of the primary benefits of dedupe: bandwidth savings. With a few exceptions (e.g. Boost), your network sees no benefit from dedupe.  The entire backup — fulls and incrementals — are transferred across the network.

It was a band-aid, and it did a good job of introducing dedupe into the backup system. But now that we see the value of it, it’s time to do it right.  It’s time to start deduping before we backup, not after.

Source dedupe is the way to go

Source dedupe is done at the very beginning of the backup process.  Every new or modified file is parsed, and a hash is calculated for its contents. If that has has been seen before, that chunk doesn’t need to be transferred across the network.

There are multiple reasons why source dedupe is the way to go.  The biggest reasons are purchase cost, performance and storage & bandwidth savings.

Target dedupe is expensive because it is developmentally and computationally expensive. I used to joke that a target dedupe appliance makes 10 TB look like 200 TB to the backup system, but they’d only charge you for 100 TB.  Yes, target dedupe appliances make the impossible possible, but they also charge you for it.

They also charge for it over and over.  Did you ever think about the fact that all the hard work of dedupe is done only by the first appliance?  Therefore, one could argue that only the first appliance should cost so much more.  But you know that isn’t the case; you pay the dedupe premium on every target dedupe appliance you buy, right?  Source systems can charge once for the dedupe, then replicate that backup to many locations without having to charge your for it.

Source dedupe is also much faster.  One reason for that is that it never has to dedupe a full backup ever again. Target appliances are forced to dedupe full backups all the time, because the backup software products all need to make them once in a while.  A source dedupe product does one full, and block-level incrementals after that.  Another reason target dedupe is faster is that it can look directly at the files being backed up, instead of having to divine the data hidden behind a cryptic backup format.

Finally, because source dedupe is looking directly at the data, it can dedupe better and get rid of more duplicate data. That saves bandwidth and storage, further reducing your costs — and speeding up the backup.  The more you are using the cloud, the more important this is.  Every deduped bit reduces your bandwidth cost and the bill you will pay the cloud vendor every month.

Dedupe done right speeds up backups

This is why I said to Chris that this problem of being forced to decided between dedupe ratio and backup performance really only applies to target dedupe.  Source dedupe is faster, cheaper, and saves more storage than any other method.  It’s been 20 years now since I was first introduced to the concept of dedupe.  I think it’s time we start doing it right.

----- Signature and Disclaimer -----

Written by W. Curtis Preston (@wcpreston). For those of you unfamiliar with my work, I've specialized in backup & recovery since 1993. I've written the O'Reilly books on backup and have worked with a number of native and commercial tools. I am now Chief Technical Architect at Druva, the leading provider of cloud-based data protection and data management tools for endpoints, infrastructure, and cloud applications. These posts reflect my own opinion and are not necessarily the opinion of my employer.

Why good dedupe is important — and hard to do

Good dedupe creates real savings in disk and bandwidth requirements.  It also makes the impossible possible by replicating even full backups offsite. Dedupe is behind many advancements in backup and replication technology over the last decade or so.

duplicate data

What is dedupe?

Dedupe is the practice of identifying and eliminating duplicate data. It’s most common in backup technology but is now being implemented in primary storage as well.  Just as dedupe made disk more affordable as a backup target, dedupe makes flash more affordable as a primary storage target.  It also changes the economics of secondary storage for reference data.

The most common method of accomplishing this is to first chop up the data into chunks, which are analogous to blocks. We use the terms chunks, because blocks typically imply a fixed size of something like 8K. Dedupe systems often use pieces of variable size, so the term chunk was coined to refer to a piece of data to be compared.

The chunk is then run through a cryptographic algorithm such as SHA-256. Initially intended for security reasons, such algorithms produce a unique value for each chunk of data. In the case of SHA-256 (AKA SHA-2), it creates a 256-bit value we call a hash. If two chunks of data have the same hash, they are considered identical and one is discarded.

The more redundant data you can identify, the more money you can save and the faster you can replicate data across the network.  So what kinds of things make effective dedupe?

True global dedupe

The more data you can compare, the more duplicate data you are likely to find. Many dedupe systems create data pools that do not talk to each other and thus significantly reduce their effectiveness.

Some dedupe systems only look for duplicate data contained within the backups of a single system, for example.  They do not compare the files backed up from Apollo to the files backed up from Elvis. If your company has multiple email servers, for example, there is a very high chance of duplicate data across them, as many people will send the same attachment to several people that may hosted on different email systems. If you’re backing up endpoints such as laptops, the chances of duplicate data are significant.

On the opposite end of the backup equation are backup appliances. Target dedupe appliances — even the most well-known ones — typically compare data stored on an individual appliance. The dedupe is not global across all appliances.  Each target dedupe appliance is a dedupe silo.

This is also true when using different backup systems. If you are using one backup system for your laptops, another to backup Office365, and another to back up your servers, you are definitely creating dedupe silos as well.

A truly global dedupe system would compare all data to all other data. It would compare files on a mobile phone to attachments in emails. It would compare files on the corporate file server to files stored on every laptop.  It would identify a single copy of the Windows or Linux operating system and ignore all other copies.

Dedupe before backup

The most common type of dedupe today is target appliance dedupe, and it’s absolutely less effective than deduping at the source. The first reason it’s less effective is that it requires a significant amount of horsepower to crack the backup algorithm and look at the actual data being backed up. Even then, it’s deduping chunks of backup strings, instead of chunks of actual files. It’s deducing the underlying data rather than actually looking at it.  The closer you’re getting to the actual files, the better dedupe you’re going to get.

The second reason its less effective is that you spend a lot of CPU time, I/O resources, and network bandwidth transferring data that will eventually be discarded. Some dedupe appliances have recognized this issue and created specialized drivers that try to dedupe the data before it’s sent to the dedupe appliance, which validates that the backup client is the best to dedupe data.

The final reason why dedupe should be done before it reaches an appliance is that when you buy dedupe appliances, you pay for the dedupe multiple times. You pay for it in the initial dedupe appliance, and you may pay extra for the ability to dedupe before the data gets to the appliance. If you replicate the deduped data, you have to replicate it to another dedupe appliance that costs as much as the initial one.

Application-aware dedupe

Another reason to dedupe before you back up is that at the filesystem layer the backup software can actually understand the files its backing up. It can understand that it’s looking at a Microsoft Word document, or a SQL Server backup string. If it knows that, it can create slice and dice the data differently based on its data type.

For example, did you know that Microsoft Office documents are actually ZIP files?  Change a .docx extension to .zip and double-click it.  It will open up as a zip file. A dedupe process running at the filesystem layer can do just that and can look at the actual contents of the zip file, rather than looking at a jumble of chunks of data at the block layer.

How much can you actually save?

Money

I try to keep my blogs on backupcentral.com relatively agnostic, but in this one I feel compelled to use my employer (Druva) as an example of what I’m talking about. I remember seven years ago watching Jaspreet Singh, the CEO of Druva, introduce Druva’s first product to the US.  He talked about how good their dedupe was, and I remember thinking “Yea, yea… everybody says they have the best dedupe.”  Now that I’ve seen things on the inside, I see what he was talking about.

I’ve designed and implemented many dedupe systems throughout the years. Based on that experience, I’m comfortable using the 2X rule of thumb. Meaning that if you have a 100 TB datacenter, your dedupe system is going to need at least 200 TB of disk capacity to back it up with any kind of retention.

For clarification, when I say 100 TB, I’m talking about the size of a single full backup, not the size of all the backups.  A typical environment might create 4000 TB of backup data from a 100 TB datacenter, which gets deduped to 200 TB.  That’s why a good rule of thumb is to start with 2X the size of your original environment.

Imagine my surprise when i was told that the Druva rule of thumb was .75X.  Meaning that in order to backup 100 TB of data with a year of retention, Druva would need only 75 TB of disk capacity. That’s less than the size of a single full backup!

Since Druva customers only pay each month for the amount of deduped data that the product stores, this means that their monthly bill is reduced by more than half (62%.)  Instead of paying for 200 TB, they’re paying for 75 TB.   Like I said, good dedupe saves a lot of money and bandwidth.

----- Signature and Disclaimer -----

Written by W. Curtis Preston (@wcpreston). For those of you unfamiliar with my work, I've specialized in backup & recovery since 1993. I've written the O'Reilly books on backup and have worked with a number of native and commercial tools. I am now Chief Technical Architect at Druva, the leading provider of cloud-based data protection and data management tools for endpoints, infrastructure, and cloud applications. These posts reflect my own opinion and are not necessarily the opinion of my employer.

No such thing as a “Pay as you go” appliance

Pay as you goI’ve never seen an appliance solution that I would call “pay as you go.”  I might call it “pay as you grow,” but never “pay as you go.”  There is a distinct difference between the two.

What is “pay as you go?”

I’ll give you a perfect example.  BackupCentral.com runs on a Cpanel-based VM. Cpanel can automatically copy the backups of my account to an S3 account.   I blogged about how to do that here.

I tell Cpanel to keep a week of daily backups, four weeks of weekly backups, and 3 months of monthly backups.  A backup of backupcentral.com is about 20 GB, and the way I store those backups in S3, I have about fifteen copies.  That’s a total of about 300 GB of data I have stored in Amazon S3 at any given time.

Last time I checked, Amazon bills me about $.38/month.  If I change my mind and decrease my retention, my bill drops.  If I told Cpanel to not store the three monthly backups, my monthly bill would decrease by about 20%.  If I told it to make it six months of retention, my monthly bill would increase by about 20%.

What is “pay as you grow?”

Pay as you grow

Instead of using S3 — which automatically ensures my data is copied to three locations — I could buy three FTP servers and tell Cpanel to back up to them. I would buy the smallest servers I could find. Each server would need to be capable of storing 300 GB of data.  So let’s say I buy three servers with 500 GB hard drives, to allow for some growth.

Time will pass and backupcentral.com will grow.  That is the nature of things, right?  At some point, I will need more than 500 GB of storage to hold backupcentral.com.  I’ll need to buy another hard drive to go into each server and install that hard drive.

Pay as you grow always starts with a purchase of some hardware –– more than you need at the time.  This is done to allow for some growth.  Typically you buy enough hardware to hold three years of growth.  Then a few years later when you outgrow that hardware, you either replace it with a bigger one (if it’s fully depreciated) or you grow it by adding more nodes/blocks/chunks/bricks/whatever.

Every time you do this, you are buying more than you need at that moment, because you don’t want to have to keep buying and installing new hardware every month.  Even if the hardware you’re buying is the easiest to buy and install hardware in the world, pay as you grow is still a pain, so you minimize the number of times you have to do it. And that means you always buy more than you need.

What’s your point, Curtis?

The company I work (Druva) for has competitors that sell “pay as you grow” appliances, but they often refer to them as “pay as you go.”  And I think the distinction is important. All of them start with selling you a multi-node solution for onsite storage, and (usually) another multi-node solution for offsite storage. These things cost hundreds of thousands of dollars just to start backing up a few terabytes.

It is in their best interests (for multiple reasons) to over-provision and over-sell their appliance configuration.  If they do oversize it, nobody’s going to refund your money when that appliance is fully depreciated, and you find out you bought way more than you needed for the least three or five years.

What if you under-provision it?  Then you’d have to deal with whatever the upgrade process is sooner than you’d like.  Let’s say you only buy enough to handle one year of growth.  The problem with that is now you’re dealing with the capital process every year for a very crucial part of your infrastructure.  Yuck.

In contrast, Druva customers never buy any appliances from us.  They simply install our software client and start backing up to our cloud-based system that runs in AWS.  There’s no onsite appliance to buy, nor do they need a second appliance to get the data offsite.(There is an appliance we can rent them to help seed their data, but they do not have to buy it.) In our design, data is already offsite.  Meanwhile, the customer only pays for the amount of storage they consume after their data has been globally deduplicated and compressed.

In a true pay as you go system, no customer ever pays for anything they don’t consume. Customers often pay up front for future consumption, just to make the purchasing process easier.  But if they buy too much capacity, anything they paid for in advance just gets applied to the next renewal.  There is no wasted capacity, no wasted compute.

In one mode (pay as you grow)l you have wasted money and wasted power and cooling while your over-provisioned system sits there waiting for future data.  In the other model (pay as you go), you pay only for what you consume — and you have no wasted power and cooling.

What do you think?  Is this an important difference?

 

----- Signature and Disclaimer -----

Written by W. Curtis Preston (@wcpreston). For those of you unfamiliar with my work, I've specialized in backup & recovery since 1993. I've written the O'Reilly books on backup and have worked with a number of native and commercial tools. I am now Chief Technical Architect at Druva, the leading provider of cloud-based data protection and data management tools for endpoints, infrastructure, and cloud applications. These posts reflect my own opinion and are not necessarily the opinion of my employer.

Bandwidth: Backup Design Problem #2

Getting enough bandwidth is the second challenge of designing and maintaining a traditional backup system. Tape is the first challenge, which is solved by not using it for operational backups.  The next challenge, however, is getting enough bandwidth to get the job done.

bandwidth

This is a major problem with any backup software product that does occasional full backups, which is most of the products running in today’s datacenter. Products that do full-file incremental backups also have this problem, although to a lesser degree.  (A full-file incremental backup is one that backs up an entire file when even one byte has changed.)

This is such a problem that many people would agree with the statement that backups are the thing that test your network system more than anything else. This is one of the main reasons people run backups at night.

This problem has been around for a long time.  I remember one time I was testing backups over the weekend, and accidentally set things up for backups to kick off at 10 AM the next day — which happened to be Monday. The network came to a screeching halt that day until we figured out what was happening and shut the backups off.

Backup system admins spend a lot of time scheduling their backups so they even out this load.  Some perform full backups only on the weekend, but this really limits the overall capacity of the system.  I prefer to perform 1/7th of the full backups each night if I’m doing weekly full backups, or 1/28th of the full backups each night if I’m doing monthly full backups.

While this increases your system capacity, it also requires constant adjustment to even the full backups out, as the size of systems changes over time. And once you’ve divided the full backups by 28 and spread them out across the month, you’ve created a barrier that you will hit at some point. What do you do when you’re doing as many full backups each night as you can? Buy more bandwidth, of course.

How has this not been fixed?

 

Luckily this problem has been fixed. Products and services that have switched to block-level incremental-forever backups need significantly less bandwidth than those that do not use such technology.   A typical block-level incremental uses over 10 times less bandwidth than typical incremental backups, and over 100 times less bandwidth than a typical full backup.

Another design element of modern backup products and services is that they use global deduplication, which only backs up blocks that have changed and haven’t been seen on any other system. If a given file is present on multiple systems, it only needs to be backed up from one of them. This significantly lowers the amount of bandwidth needed to perform a backup.

Making the impossible possible

 

Impossible

Lowering the bandwidth requirement creates two previously unheard-of possibilities: Internet-based backups and round-the-clock backups. The network impact of globally deduplicated, block-level incremental backups is so small that the data can be transferred over the Internet for many environments.  In addition, the impact on the network is so small that backups can often be done throughout the day.  And all of this can be done without all of the hassle mentioned above.

The more a product identify blocks that have changed, and the more granular and global the deduplication can be designed, the more these things become possible. One of the best ways to determine how efficient a backup system is on bandwidth is to ask them how much storage is needed to store 90-180 days of backups. There is a direct relationship between that number and the amount of bandwidth you’re going to need.

----- Signature and Disclaimer -----

Written by W. Curtis Preston (@wcpreston). For those of you unfamiliar with my work, I've specialized in backup & recovery since 1993. I've written the O'Reilly books on backup and have worked with a number of native and commercial tools. I am now Chief Technical Architect at Druva, the leading provider of cloud-based data protection and data management tools for endpoints, infrastructure, and cloud applications. These posts reflect my own opinion and are not necessarily the opinion of my employer.

First Impressions of AWS Re:Invent

We’ve come a long way, baby.  I worked at Amazon when they were just an Internet bookseller. I put in the first enterprise-wide backup system back in 1998.  I was there on the day they came out with the “universal product locator,” which is the day they sold something other than books.

Oh, and if you’re here, make sure you stop by our booth and meet Data!  We have Brent Spiner from Star Trek Next Generation in our booth and at our party. Details here.

It’s a big show

There are definitely 10s of thousands of people here.  Amazon says is 40K, most of which are actual customers.  That’s a refreshing change to some shows that I’ve been at that are more about partner meetings than potential customer meetings.  Now that I’m viewing this show as a sponsor (since I now work at Druva), that’s really important. Almost everyone here is something we could potentially sell something to.

Of course, AWS being what it is, there is everything from a very small company with one VM or a couple of GB in S3 to a large enterprise.  Amazon says it’s more the latter than the former, of course.  But as a company with solutions aimed at the middle enterprise, that’s the first thing we have to determine.

The show is actually too big

It’s the first large show I’ve been at in Vegas that is in multiple venues. And there’s a sign telling you to expect it to take 30 mins to travel between venue.

There are plenty of cities that can host an event of this size without requiring people to travel between venues.  (I live in one of them.  San Diego hosts ComicCon, which is three times the size of this show.)  So I’m curious as to why Amazon has chosen Las Vegas.

The show is also sold out.  Druva has a large team here, but it would be larger if we were able to get more tickets. Even as a sponsor, we’re unable to buy more tickets for people just to work the booth.  Why is that?  Either it’s a marketing tactic or they’ve actually hit the agreed-upon capacity of the venues they chose. Either one is totally possible.

Remember when?

Amazon only sold books?  Remember when they only sold “stuff,” and weren’t the largest IaaS vendor on the planet?  Remember when we said no one would run production on Windows?  Remember when we said no one would move production to the cloud?  Ah, those were the days.

As a company that runs its entire world on Amazon, it’s now hard to imagine a world without them.  Their ability to scale infrastructure and applications like DynamoDB has enabled an entirely new class of production applications that simply weren’t possible before.  Druva is able to do things for our customers because we’re built as a cloud-native application.  We can dynamically and automatically scale (up and down) every part of our infrastructure as our customer’s needs demand.  This gives us unlimited scalability without any of the limits associated with typical backup apps.  This is why some of the largest organizations in the world trust us with their user data and ROBO data. And none of that would be possible without something like AWS.

Like I said, we’ve come a long way, baby.

 

----- Signature and Disclaimer -----

Written by W. Curtis Preston (@wcpreston). For those of you unfamiliar with my work, I've specialized in backup & recovery since 1993. I've written the O'Reilly books on backup and have worked with a number of native and commercial tools. I am now Chief Technical Architect at Druva, the leading provider of cloud-based data protection and data management tools for endpoints, infrastructure, and cloud applications. These posts reflect my own opinion and are not necessarily the opinion of my employer.

Why tape drives are bad for backups

Specifically, this article is about why modern tape drives are a really bad choice to store the initial copy of your backups. It’s been this way for a long time, and I’ve been saying so for at least 10 years, in case anyone thinks I’ve been swayed by my current employer.  Tape is good at some things, but receiving the first copy of your backups isn’t one of them.  There are also reasons why you don’t want to use them for your offsite copy, and I’ll look at those, too.

 

Tape drive are too fast for incremental backups

  • Tape drives are too fast
    • In case you didn’t know it, modern tape drives essentially have two speeds: stop and very fast. Yes, there are variable speed tape drives, but even the slowest speed they run at is still very fast.  For example, the slowest an LTO-7 drive can go using LTO-7 media is 79.99 MB/s native.  Add compression, and you’re at 100-200 MB/s minimum speed!
  • Incremental backups are too slow
    • Most backups are incremental backups, and incremental backups are way too slow. A file-level incremental backup supplies a random level of throughput usually measured in single digits of MegaBytes per second. This number is nowhere near 100-200 MB/s.
  • The speed mismatch is the problem
    • When incoming backups are really slow, and the tape drives want to go very fast, the drive has no choice but to stop, rewind, and start up again. It does this over and over, dragging the tape head back and forth across the read write head in multiple passes. This wears out the tape and the drive, and is the number one reason behind tape drive failures in most companies.  Tape drives are simply not the right tool for incoming backups.  Disk drives are much better suited to the task.
  • What about multiplexing
    • Multiplexing is simultaneously interleaving multiple backups together into a single stream in order to create a stream fast enough to keep your tape drive happy. It’s better than nothing, but remember that it helps your backups but hurts your restores.  If you interleave ten backups together during backup, you have to read all ten streams during a restore — and throw away nine of them just to get the one stream you want. It literally makes your restore ten times longer.  If you don’t care about restore speed, then they’re great!

What about offsite copies?

Their have been many incidents involving tapes lost or exposed by offsite vaulting companies like Iron Mountain.  Even Iron Mountain’s CEO once admitted that it happens at a regular enough interval that all tape should be encrypted. I agree with this recommendation — any transported tape ought to be encrypted.

Tape is still the cheapest way to get data offsite if you are using a traditional backup and recovery system. If you’re using such a system, you have to buy an expensive deduplication appliance to make the daily backup small enough to replicate. These can be effective, but they are very costly, and there are a lot of limits to their deduplication abilities — many of which make them cost more to purchase and use.  This is why most people are still using tape to get backups offsite.

If you have your nightly backups stored on disk, it should be possible to get those backups copied over to tape.  That is assuming that your disk target is able to supply a stream fast enough to keep your tape drives happy, and there aren’t any other bottlenecks in the way.  Unfortunately, one or more of those things is often not the case, and your offsite tape copy process becomes as mismatched as your initial backup process.

In other words, tape is often the cheapest way to get backups offsite, but it’s also the riskiest, as tapes are often lost or exposed during transit. Secondly, it can be difficult to configure your backup system properly to be able to create your offsite tape copy in an efficient manner.

I thought you liked tape?

I do like tape.  In fact, I’m probably one of the biggest proponents of tape.  It has advantages in some areas.  You cannot beat the bandwidth of tape, for example.  There is no faster way to get petabytes of data from one side of the world to another.  Tape is also much better had holding onto data for multiple decades, with a much lower chance of bit rot.  But none of these advantages come into play when talking day-to-day operational backups.

I know some of you might think that I’m saying this just because I now work at a cloud-based backup company. I will remind you that I’ve been saying these exact words above at my backup seminars for almost ten years.  Tape became a bad place to store your backups the day it started getting faster than the network connection backups were traveling over — and that was a long time ago.

What do you think?  Am I being too hard on tape?

----- Signature and Disclaimer -----

Written by W. Curtis Preston (@wcpreston). For those of you unfamiliar with my work, I've specialized in backup & recovery since 1993. I've written the O'Reilly books on backup and have worked with a number of native and commercial tools. I am now Chief Technical Architect at Druva, the leading provider of cloud-based data protection and data management tools for endpoints, infrastructure, and cloud applications. These posts reflect my own opinion and are not necessarily the opinion of my employer.

Is AWS Ready for Production Workloads?

Yes, I know they’re already there.  The question is whether or not Amazon’s infrastructure is ready for them.  And when I mean “ready for them,” I mean “ready for them to be backed up.”  Of course that’s what I meant.  This is backupcentral.com, right?

But as I prepare to go to Amazon Re:Invent after Thanksgiving, I find myself asking this question. Before we look at the protections that are avaialable for AWS data, let’s look at why we need them in the first place.

What are we afraid of?

There is no such thing as the cloud; there is only someone else’s datacenter.  The cloud is not magic; the things that can take out your datacenter can take out the cloud.  Yes, it’s super resilient and time-tested.  I would trust Amazons’ resources over any datacenter I’ve ever been in.  But it’s not magic and it’s not impenetrable – especially by stupidity.

  • Amazon zone/site failure
    • This is probably the thing Amazon customers are most prepared for.  All Amazon resources are continuously replicated to three geographically dispersed locations.  Something like 9/11, or even a massive hurricane or flood, should not affect the availability or integrity of data stored in AWS.  Caveat: replication is asynchronous, so you may lose some data.  But you should not lose your dataset.
  • Accidental deletion/corruption of a resource
    • People are, well, people. They do dumb things.  I’ve done dumb things. I can’t tell you the number of times I’ve accidentally deleted something I needed. And, no, I didn’t always have a backup.  Man, it sucks when that happens.  Admins can accidentally volumes, VMs, databases, and any kind of resource you can think of.  In fact, one could argue that virtualization and the cloud make it easier to do more dumb things.  No one ever accidentally deleted a server when that meant pulling it out of the rack.  Backups protect against stupidity.
  • Malicious damage to a resource
    • Hackers suck. And they are out there. WordPress tells me how many people try to hack my server every day.  And they are absolutely targeting companies with malware, ransomware, and directed hacking attacks.  The problem that I have with many of the methods that people use to protect their Amazon resources is that they do not take this aspect into account  – and I think this danger is the most common one that would happen in a cloud datacenter.  EC2 snapshots and RDS snapshots (which are actually copies) are stored in the same account they are backing up.  It takes extra effort and extra cost to move those snapshots over to another account.  And no one seems to be thinking about that.  People think about the resiliency and protection that Amazon offers – which it does – but they forget that if a hacker takes control of their account they are in deep doodoo.  Just ask codespaces.com.  Oh wait, you can’t.  Because a hacker deleted them.
  • Catastrophic failure of Amazon itself
    • This is extremely unlikely to happen, but it could happen. What if there were some type of rolling bug (or malware) that somehow affected all instances of all AWS accounts.  Even cross-account copies of data would go bye-bye.  Like I said, this is extremely unlikely to happen but it’s out there.

How do we protect against these things?

I’m going to write some other blog posts about how people protect their AWS data, but here’s a quick summary.

  • Automated Snapshots
    • As I said before, these aren’t snapshots in the traditional sense of the word.  These are actually backups.   You can use the AWS Ops Automator, for example, to regularly and automatically make a “snapshot” of your EC2 instance.  The first “snapshot” copies the entire EBS volume to S3.  Subsequent “snapshots” are incremental copies of blocks that have changed since the last snapshot.  I’m going to post more on these tools later.  Suffice it to say they’re better than nothing, but they leave Mr. Backup feeling a little queasy.
  • Manual copying of snapshots to another account
    • Amazon provides command-line and Powershell tools that can be used to copy snapshots to another account.  If I was relying on snapshots for data protection, that’s exactly what I would do.  I would have a central account that is used to hold all my snapshots, and that account would be locked down tighter than any other account. The downside to this tool is that it isn’t automated.  We’re now in scripting and manual scheduling land. For the Unix/Linux folks among us this might be no big deal. But it’s still a step backward for backup technology to be sure.
  • Home-grown tools
    • You could use rsync or something like that to backup some of your Amazon resources to something outside of Amazon.  Besides relying on scripting and cron, these tools are often very bandwidth-heavy, and you’re likely going to pay heavy egress charges to pull that data down.
  • Third-party tools
    • For some Amazon resources, such as EC2, you could install a third-party backup tool and backup your VMs as if they were real servers.  This would be automated and reportable, and probably the best thing from a data protection perspective. The challenge here is that this is currently only available for EC2 instances.  We’re starting to see some point tools to backup other things that run in AWS, but I haven’t seen anything yet that tackles the whole thing.

So is it ready?

As I said earlier, an AWS datacenter is probably more resilient and secure than most datacenters.  AWS is ready for your data. But I do think there is work to be done on the data protection front.  Right now it feels a little like deja vu.  When I start to think about shell scripts and cron, I start feeling like it’s the 90s.  It’s been 17 years since I’ve revisited hostdump.sh, the tool I wrote to automatically backup filesystems on a whole bunch of Unix systems.  I really don’t want to go back to those days.

----- Signature and Disclaimer -----

Written by W. Curtis Preston (@wcpreston). For those of you unfamiliar with my work, I've specialized in backup & recovery since 1993. I've written the O'Reilly books on backup and have worked with a number of native and commercial tools. I am now Chief Technical Architect at Druva, the leading provider of cloud-based data protection and data management tools for endpoints, infrastructure, and cloud applications. These posts reflect my own opinion and are not necessarily the opinion of my employer.