Protect data wherever it lives

It’s more true now than any other time in history: data really can be anywhere.  Which means you need to be able to protect it anywhere and everywhere. And the backup architecture you choose will either enable that process or hamper it.

Data data everywhere, and not a byte to sync

Any product should be able to backup a typical datacenter.  Install a backup server, hookup some disk or tape, install the backup client, and transfer your data.  Yes, I realize I drastically simplified the situation; remember, I did build a career around the complexities of datacenter backup.  I’m just saying that there we have decades of experience with this use case, and being able to backup a datacenter should be table stakes for any decent backup product or service.

The same is mostly true of a datacenter in a collocation/hosting facility.  You can use much of the same infrastructure to back it up.  One challenge will be is that it may be remote from the people managing the backup, which will require a hands-off way of getting data offsite or someone to manage things like tape.  Another challenge can be if the hosted infrastructure is not big enough to warrant its own backup infrastructure.

This is similar to another data source: the ROBO, for Remote Office/Branch Office. While some of them may have enough data to warrant their own backup infrastructure, they usually don’t warrant IT personnel. Historically this meant you either trained someone in the office to swap tapes (often at your own peril), or you hired Iron Mountain to do it for you (at significant cost). Deduplication and backup appliances have changed this for many companies, but ROBOs still plague many other companies who haven’t updated their backup infrastructure.

The truly remote site is a VM – or a bunch of VMs – running in a public cloud provider like AWS, Azure, or Google Cloud. There is no backup infrastructure there, and putting any type of traditional backup infrastructure will be very expensive.  Cloud VMs are very inexpensive – if you’re using them part time.  If you’re running them 24×7 like a typical backup server, they’re going to very expensive indeed. This means that the cloud falls into a special category of truly remote office without backup infrastructure or personnel.  You have to have an automated remote backup system to handle this data source.

Even more remote than a public cloud VM is a public cloud SaaS app.  With a SaaS app you don’t even have the option of running an expensive VM to backup your infrastructure.  You are forced to interact with any APIs they provide for this purpose. You must be able to protect this data over the Internet.

Finally, there are end user devices: laptops, desktops, tablets, and phones. There is no getting around that most people do the bulk of their work on such devices, and it’s also pretty easy to argue that they’re creating data that they store on their laptops.  Some companies handle this problem by converting to cloud apps and telling users to do all their work in the cloud. But my experience is that most people are still using desktop apps to do some of their work.  Even if they’re using the cloud to store the record of authority, they’re probably going to have a locally cached copy that they work on.  And since there’s nothing forcing them to sync it online, it can often be days or weeks ahead of the protected version stored in the cloud.  This is why it’s still a good idea to protect these systems. Mobile devices are primarily data consumption devices, but they still may create some data.  If it’s corporate data, it needs to be protected as well.

All things to all bytes

The key to backing up from anywhere is to reduce as much as possible the number of bytes that must be transferred to get the job done, because many of these backups will be done over slow or expensive connections. The first way to do this is to perform a block-level incremental backup, which transfers only the bytes that have changed since the last backup.  Once we’ve reduced the backup image to just the changed bytes, those bytes should be checked against other clients to see if they have the same changed bytes — before the data is sent across the network.  For example, if you’re backing up Windows systems, you should only have to back up the latest Windows patches once.

The only way to do this is source deduplication, also known as client-side deduplication. Source dedupe is done at the backup client before any data is transferred across the network. It does not require any local hardware, appliance, or virtual appliance/VM to work.  In fact, the appliance or system to which a given system is backing up can be completely on the other side of the Internet.

In my opinion, source-side dedupe is the way backup always should have been done.  We just didn’t have the technology. It saves bandwidth, it increases the speed of most backups, and it makes the impossible (like backing up a server across the Internet) possible.

You can backup some of the previously mentioned data sources with target dedupe (where you put a dedupe appliance close to the data & it does the deduping), but it can’t do all of them.  Target dedupe also comes at a significant cost, as it means you have to install an appliance or virtual appliance at every location you plan on backing up. This means an appliance in every remote datacenter, even if it only has a few dozen gigabytes of data, a virtual appliance (or more) in every cloud, an appliance in every colo – and mobile data gets left out in the cold.  Source dedupe is cheaper and scales farther out to the edge than target dedupe – without the extra cost of appliances in every location.

----- Signature and Disclaimer -----

Written by W. Curtis Preston (@wcpreston). For those of you unfamiliar with my work, I've specialized in backup & recovery since 1993. I've written the O'Reilly books on backup and have worked with a number of native and commercial tools. I am now Chief Technical Architect at Druva, the leading provider of cloud-based data protection and data management tools for endpoints, infrastructure, and cloud applications. These posts reflect my own opinion and are not necessarily the opinion of my employer.

Someone else driving your car doesn’t make it an Uber

Imagine if you had to lease a car in order to use Uber. That’s the logic behind how many infrastructure software vendors sell their “service” in the cloud. These “services” come in a couple of different flavors, but both of them come down to the same thing: infrastructure set aside for your exclusive use – which means you pay for it.

Private Cloud vendors

There are a lot of “XaaS” vendors that will create a dedicated system for you in their cloud, manage it for you, and then charge it to you as a service. It is definitely “Something as a Service,” as the management of it, including OS and application upgrades, hardware upgrades, and reporting, are all handled for you.  The key to this idea is that you get one bill that includes compute, storage, networking, and management costs.

This is definitely an improvement over managing your own system – from a level of effort perspective.  You don’t have to manage anything but the relationship with the vendor. Depending on your circumstances, you can even make the argument that the vendor in question is better at providing service X than you are. This is especially true of “forgotten” apps like data protection that don’t get the attention they deserve.  You could argue that using a private cloud vendor is better for your data protection than doing it yourself.

What you can’t argue is that it’s less expensive.  There are very few economies of scale in this model. Someone is still paying for one or more servers, some storage, some compute, and some personnel to manage them. They are then marking those things up and passing the cost to you.  There is no way this is cheaper than doing it yourself.

In addition, it’s also important to say that vendors who use the private cloud model don’t come with the same security advantages of those using established public cloud vendors. I know of one vendor that sells their services exclusively via a huge network of MSPs, each of which has a completely different level of capabilities, redundancies, and security practices.  Using a private cloud model requires a customer to look very closely at their infrastructure.

Hosted Software Vendors

Suppose you say you want to use the public cloud for economies of scale, an enhanced security model when compared to private cloud vendors, or maybe someone up higher simply said you needed to start using the public cloud. There are a number of infrastructure vendors that will run their software in VMs in the public cloud, and then offer you a service type of agreement for that software.

Now you are paying two bills: the “service” bill to the infrastructure software vendor, and the cloud provider bill for the compute, storage, and networking services required by this infrastructure vendor. Often in this model, the only service is that the vendor is selling you their software as a subscription.  But the moniker “as a Subscription” doesn’t sound as good as “as a Service,” so they still call this a service.

The problem with this model is that you aren’t getting any of the benefits of the cloud. Typical benefits of the cloud include partial utilization, cloud native services, automated provisioning, and paying only for what you use.  But you’re getting none of those in this model.

Infrastructure products – especially data protection products – are designed around using servers 24×7.  A backup server that isn’t performing any backups is still running 24×7, in case any backup clients request a backup.  That means those VMs you’re running the software on in the cloud have to run 24×7 – so much for partial utilization. A 24×7 cloud VM is very expensive indeed.

Such products are also written to use traditional infrastructure services, like filesystems, block devices, and SQL databases. They don’t know how to use services like S3 and NoSQL databases available the cloud.  In the case of backup software, they might know how to archive to S3 or Glacier, but they don’t know how to store the main set of backups there.

Such products also require manual scaling efforts when your capacity needs grow. You have configure more VMs, configure the software to run on those VMs, and adjust your licensing as appropriate. You’re not able to take advantage of the automated scaling the public cloud offers.

Finally, because you have to provision things in advance, you are often paying for infrastructure before you need it. If you know you’re going to run out of compute, you have to configure a new VM before you do. As you start using that VM, a good portion of it is completely unused. The same is true of filesystems and block storage, especially with backup systems. If your backups and metadata are stored on a filesystem or block storage, you have to manually configure additional capacity before you need it.  This means you’re paying for it before you need it. If the product could automatically make compute available only when you needed it, and use S3 for its storage, you would only pay for compute and storage as you consume it.

Don’t lease a car to take an Uber

See what I mean? In both of these models, you are leasing a car so you can take an Uber.  In the private cloud model, the cost of leasing the system is built into the price of the service, but you’re still paying for that infrastructure 24×7, since it is dedicated to you.  In the public cloud model, you’re paying for the service and you’re leasing the infrastructure 24×7 – even though the service isn’t using the infrastructure 24×7.  Examples of infrastructure products that work like this are Hosted Exchange, SAP Cloud and almost every BaaS/DRaaS/DMaaS vendor.

If you’re going to use the public cloud effectively, you need partial utilization, automated provisioning, and pay-only-for-what-you-use pricing. A true cloud-native product, such as Salesforce.com, Office365, G-Suite, or the Druva Cloud Platform, offers all of those things.  Don’t lease a car to take an Uber.

I don’t often directly push my employers products, but it’s World Backup Day tomorrow so I’m making an exception.  Celebrate it by checking out my employer’s announcement of the Druva Cloud Platform, the only cloud-native data management solution.  It can protect data centers, laptops, mobile devices, SaaS apps like Office 365, G-Suite, and Salesforce.com, and workloads running in the cloud – all while you gain all of the benefits of the cloud, including partial utilization, automated provisioning, and full use of cloud-native tools like S3 and DynamoDB.

----- Signature and Disclaimer -----

Written by W. Curtis Preston (@wcpreston). For those of you unfamiliar with my work, I've specialized in backup & recovery since 1993. I've written the O'Reilly books on backup and have worked with a number of native and commercial tools. I am now Chief Technical Architect at Druva, the leading provider of cloud-based data protection and data management tools for endpoints, infrastructure, and cloud applications. These posts reflect my own opinion and are not necessarily the opinion of my employer.

GDPR Primer #2: What is personal data?

Last week I wrote the first of what will probably be a few articles about GDPR, EU’s General Data Protection Regulation.  It governs the protection of “personal data” that your company is storing from EU citizens living in the EU.  (They must be EU citizens, and they must be currently living in the EU for the regulation to apply.)

Note: This article is one in a series about GDPR.  Here’s a list of articles so far: 

As mentioned in my last article, US companies are subject to the regulation if they have personal data from EU citizens. Nexus or a physical presence is not required, only that you have data from people living there.

Is Personal Data the same as PII?

In the US we have a term we like to use called Personally Identifiable Information (PII), which includes certain data types that can be used to identify a person.  Examples  include social security numbers, birthdays, names, employers, physical addresses, and phone numbers.  It’s usually the combination of two data elements that makes something PII, for example knowing someone’s name and their birthday puts you one data point away from being able to steal their identity.  All you need is the social security number and you’re off to the races.

Personal Data, as defined by the GDPR, includes what we call PII, but it includes “any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person.”  This is interpreted so far to include things like IP addresses, social media profiles, email addresses, and other types of data that we don’t think of as PII in the US.

Someone filling out a basic marketing form on your website has submitted what the GDPR considers personal data to your company. If there’s enough for the person to be identified in any way – which a marketing form would most certainly have – then it’s considered personal data as far as GDPR is concerned.

GDPR Is coming

GDPR goes into effect May 28th.  If you haven’t talked to your backup company about it, it’s time to start having that conversation.

----- Signature and Disclaimer -----

Written by W. Curtis Preston (@wcpreston). For those of you unfamiliar with my work, I've specialized in backup & recovery since 1993. I've written the O'Reilly books on backup and have worked with a number of native and commercial tools. I am now Chief Technical Architect at Druva, the leading provider of cloud-based data protection and data management tools for endpoints, infrastructure, and cloud applications. These posts reflect my own opinion and are not necessarily the opinion of my employer.

Is your data protection company worried about GDPR? They should be.

If you haven’t looked into how your data protection vendors are preparing for the General Data Protection Regulation (GPDR), you’re already behind the power curve.  It goes into effect May 25, 2018. Hopefully this article can get you up to speed so you can start pressuring your vendors about how they are going to help you comply with this incredibly big regulation.

Note: This article is one in a series about GDPR.  Here’s a list of articles so far: 

Disclaimer: I’m not a lawyer or a GDPR expert. My goal of this blog is to get you thinking and maybe scare you a little bit.  Nothing in this blog should be construed as legal advice.

Disclaimer #2: There is no such thing as a GDPR-compliant product, and definitely no such thing as GDPR-certified. A product can help you comply with GDPR.  A product can say “we are able to help you comply with articles 15 and 17,” but a product alone will not make you GDPR compliant.  And there is no certification body to provide a GDPR certification. Anyone who says that is making it up.

US companies must comply with GPDR if

Although this is a European Union (EU) regulation, you are subject to it if you are storing personally identifiable information (PII) (referred to by GPDR as “personal data”) about European citizens (referred to by GDPR as “data subjects”) living within the EU. Where your company is headquartered is irrelevant.

A business transaction is not required.  A marketing survey targeting EU residents appears sufficient to require your company to comply with GDPR.  An EU resident (who was not targeted specifically) filling out a form on your US website that does not have an EU domain might not trigger GDPR protection for that person.  My non-legal advice is that you should look into how you’re preparing for the requirements.

Not complying with GPPR can cost you dearly

Companies not complying with the data privacy aspects of GDPR can be fined 4% of annual revenue, or 20 million Euros, whichever is greater.  It hasn’t gone into effect yet, and no one has been fined yet, so we don’t yet know just how tough the courts are going to be. But that’s what the regulation allows.

How does GDPR affect data protection?

There are several aspects to GDPR protection, but only a few of them affect your data protection system. For example, there is a requirement to gain consent before storing personal data. That responsibility falls way outside the data protection system. But let’s look at some parts that many systems are going to have a really hard time with.

GDPR has articles that talk about general data security, but I think any modern backup system should be able to comply with those articles. The things about GDPR that I think data protection people will struggle with are articles 15, 16 and 17: the right to data access by the subject, the right to correction, and the right to erasure (AKA “right to be forgotten”).

Article 15: Right to data access by subject

If you have data on a data subject (i.e. EU citizen), and assuming that data is subject to GDPR, the subject has a right to see that data. This includes any and all data stored on primary storage, snapshots, backup storage, and archives.  Try to think about how you would comply with that request today and you see where my concern is. Archive software might be ready for this, but most backup systems are incapable of delivering information in this manner.

Article 16: Right to correction

A data subject has the right to have incorrect data corrected. This may not directly affect the backup and archive systems, but it might.

Article 17, Right to erasure (AKA “the right to be forgotten”)

This one is the one that truly scares me as a data protection specialist.  If a company cannot prove they have a legitimate business reason for continued storage of a particular data subject’s personal data, the data subject has the right to have it deleted. And that means all of it.

As previously mentioned, we don’t have any case law on this yet, and we don’t yet know the degree to which the EU courts will order a company to delete someone’s data from backups and archives. But this is article that has me the most worried.

Update: 05/29: I’ve changed a bit in how I think about this.  Make sure to check out this blog post and this one about this topic.

I told you so

The customers that are in real trouble are those that use their backup systems as archive systems, since most backup systems are simply incapable of doing these things. They will be completely incapable of complying with Articles 15-17 of GPDR.

I’ve been telling customers for years to not use their backup system as an archive system. If you are one of the ones who listened to me, and any long term data is stored in an archive system, you’re pretty much ready for GDPR.  A good archive should be able to satisfy these requirements.

But if you’ve got data from five years ago sitting on a dedupe appliance or backup tapes, you could be in a world of hurt. There simply isn’t a way to collect in one place all data on a given subject, and there’s definitely no way to delete it from your backups. Each record is a tiny record inside a filesystem backup stored in some kind of blog, such as tar file or the equivalent for your backup system.

What are your vendors saying?

Has anyone had any conversations about this with their data protection vendors?  What are they saying?  I’d really love to hear your stories.

----- Signature and Disclaimer -----

Written by W. Curtis Preston (@wcpreston). For those of you unfamiliar with my work, I've specialized in backup & recovery since 1993. I've written the O'Reilly books on backup and have worked with a number of native and commercial tools. I am now Chief Technical Architect at Druva, the leading provider of cloud-based data protection and data management tools for endpoints, infrastructure, and cloud applications. These posts reflect my own opinion and are not necessarily the opinion of my employer.

How to make the cloud cheaper (or more expensive)

Depending on how you do it, the cloud can be much less expensive than using on-premises systems.  But it can also much more expensive. It all depends on you use the public cloud.

The expensive way: 24×7 VMs and pre-provisioned block storage

Running one or more VMs in the cloud 24×7 (like you would in a datacenter) is a great way to drive up your costs. It’s generally going to be more expensive than running the same VMs in house (if you’re running them 24×7).  It’s difficult to come up with the incremental cost of an individual VM, as this article attests. But generally speaking, you should be able to run a VM onsite for less than the cost of running that same VM in the cloud. It makes sense; it’s called markup.

Storage can also be more expensive in the cloud for the same reasons.  If you’re provisioning large chunks of block storage (e.g. EBS in AWS) before you actually consume it, your costs are going to be higher than if you only pay for storage as you use it. This is really only possible with object storage.

It’s also important to note that moving a VM to the cloud doesn’t get rid of all the typical system administration tasks associated with said VM. The OS still needs updating; the applications still need updating.  Sure, you don’t have to worry about swapping out the power supply, but most people let a vendor do that part anyway. But it’s important to understand that moving a VM to the cloud doesn’t make it magically start caring for itself.

The cheap way: Dynamically allocated VMs and object storage

In the public cloud, your costs are directly tied to how much storage, network and compute you use. That means that if you have an application that can dynamically scale up and down its use of cloud resources, you might be able to save money in the cloud, even if the per-hour costs are higher than those you would have onsite. This is because generally speaking, you don’t save money in the datacenter by turning off a VM. The resources attached to that VM are still there, so your costs don’t do down. But if you have an app that can reduce its compute resources – especially to the point of turning off VMs, you can save a lot of money.

This also goes true for storage. If you are using object storage instead of block storage, you pay only for what you use as you use it.  As backups expire and objects are deleted out of the object store, your costs decrease.  This is very different than how pre-provisioned block storage behaves, where deleting files doesn’t save you money.

Use the cloud the way its meant to be used.

If your backup software is just running software in 24×7 VMs in the cloud, and if they require you to provision block storage for said VMs, then they’re using the cloud in the way that cloud experts generally agree is a great way to drive up costs and not add a lot of value.

Your costs will go up and your manageability stays the same. You’re still dealing with an OS and application that needs to be updated in the same way it would be onsite. You still have to increase or decrease your software or storage licenses as your needs grow.

----- Signature and Disclaimer -----

Written by W. Curtis Preston (@wcpreston). For those of you unfamiliar with my work, I've specialized in backup & recovery since 1993. I've written the O'Reilly books on backup and have worked with a number of native and commercial tools. I am now Chief Technical Architect at Druva, the leading provider of cloud-based data protection and data management tools for endpoints, infrastructure, and cloud applications. These posts reflect my own opinion and are not necessarily the opinion of my employer.

Another example of why we do backups

There’s a story going around about Apple’s MacOS APFS sparse disk images occasionally losing their mind and throwing documents out the window.  Yet another example of why we do backups.

Don’t use APFS images for backups

I’ve never liked these disk images that Apple makes –– as a backup method.  This is just another example of why.  For those unfamiliar with them, they’re like a fancy .ISO image.  It’s one big file that you can mount as a file system. The “sparse” part is what the industry would call a thin-provisioned version of this image.  That is, you tell it how big it’s allowed to grow, but it will only consume the amount of space that is actually put into the image.

The problem that was recently discovered is that if the APFS sparse image runs out of virtual space, it will just keep writing the files like nothing’s wrong.  Even worse, the files will appear to have been copied, as they’ll be in RAM.  Unmount the disk image and remount it and you’ll find that the files were never copied.  Surely Apple needs to fix this.

The one place you’ll see a disk image is if you buy a Time Capsule Time Machine backup appliance.  I’m not sure why, but they chose to do it this way, instead of just mirroring the filesystem, the way Time Machine does on a local machine.  I’m sure they had their reasons, but this is where you’ll see disk images.  (Actually, I haven’t looked into the details of the Time Capsules in a while, so they could have changed.  But I can’t think of any other place where you’d see such a beast.)

I’ve never been a fan

Nine years ago I wrote an article about how I wasn’t a huge fan of Time Machine, and how I really didn’t like Time Capsules because of their disk images — and how they can get corrupted. Time Machine is nice for upgrades or a local copy, but I don’t think you should rely on it as your only backup.

This is why we do real backups.  Real backups are scheduled and happen all the time without you having to do anything. Their data is stored somewhere else, which today typically means the cloud. I simply can’t think of another viable way to backup mobile users and home users.

----- Signature and Disclaimer -----

Written by W. Curtis Preston (@wcpreston). For those of you unfamiliar with my work, I've specialized in backup & recovery since 1993. I've written the O'Reilly books on backup and have worked with a number of native and commercial tools. I am now Chief Technical Architect at Druva, the leading provider of cloud-based data protection and data management tools for endpoints, infrastructure, and cloud applications. These posts reflect my own opinion and are not necessarily the opinion of my employer.

Protect your backups from ransomware

Don’t get your backup advice from security people. That’s how I felt reading what started out as a really good article about protecting your systems from ransomware.  It was all great until he started talking about how to configure your backup system.  He had no idea what he was talking about. But now I’m going to give you security advice about your backup system, so take it with a grain of salt.

Windows backup servers are risky

Windows-based backup products that store data in a directory are a huge security risk. In fact, many customers of such products have already reported my worst fears: their backups were encrypted with the same ransomware that infected their servers.

This isn’t an anti-Windows rant, or an anti-BackupProductX rant. It’s simply acknowledging the elephant in the room.

  1. If your backup server is accessible via the same network your computers are on, it can be attacked via the same things that attack your computers.
  2. If your backup server runs the same OS as your computers – especially if it’s the OS that most ransomware attacks happen on (Windows) – it can be infected with the same ransomware
  3. If your backups are stored in a directory (as opposed to a tape drive, an S3 object, or a smart appliance not accessible via SMB/NFS), they can be infected if your backup server is infected.
  4. If your backups are stored on a network mount via NFS/SMB, you’re giving the ransomware even more ways to attack you.

What should you do?

I don’t want to be guilty of doing what the security guy did, so I’ll say this: research what you can do to protect your systems from ransomware. But I’ll do my best to give some general advice.

I know the best advice I’ve read is to keep up-to-date on patches and to disable Remote Desktop Management on Windows.  There are also default SMB shares in Windows that should be disabled.

You can also make sure that your backups aren’t just stored in a directory. Unfortunately, that’s the default setup for most inexpensive backup software products. You need to investigate if the software you’re using supports another way to store backups.  If not, it’s time to think about a different product.

The same goes true for those currently storing backups on an NFS/SMB share. Investigate if your backup software has the ability to store backups on that device without using NFS/SMB. If not, make sure you lock down that share as much as you can. Again, if not, it’s time to think about another backup product.

Consider a cloud data protection service

A true cloud-based data protection service might be the best way to do this.  In a true cloud-based system, you never see the backup servers. You don’t know what they are and never login to them. You login to a web-based portal, and the actual servers that make this happen are completely invisible to you.  (Similar to the way the servers that make salesforce.com happen are invisible to you.)

If your backup servers are invisible to you, they’re invisible to your attackers. If there’s no way to directly access your backup – unless you’ve specifically setup such access for a recovery or DR test – then ransomware can’t get to those backups either.

It should go without saying that this recommendation does not apply if your “cloud” data protection vendor is just putting backup software on VMs that you manage in the cloud – what many have dubbed “cloud washing.” If you’re seeing your backup servers as VMs in the cloud, they’re just as much of a risk as they are if they were in your data centers. It’s on the reasons why these cloud washing vendors aren’t really giving you the full benefit of the cloud if all they’re doing is putting VMs up there.

----- Signature and Disclaimer -----

Written by W. Curtis Preston (@wcpreston). For those of you unfamiliar with my work, I've specialized in backup & recovery since 1993. I've written the O'Reilly books on backup and have worked with a number of native and commercial tools. I am now Chief Technical Architect at Druva, the leading provider of cloud-based data protection and data management tools for endpoints, infrastructure, and cloud applications. These posts reflect my own opinion and are not necessarily the opinion of my employer.

Time to fire the man in the van

The man in the van can lose your tapes.  Any questions?

It’s the man, not the mountain

Yes, Iron Mountain has had many very public incidents of losing tapes. You can do a google search for Iron Mountain Loses Tapes to see what I’m talking about.  When all these stories started hitting the news back in 2005 (thanks to California’s new law requiring you to report such things), Iron Mountain’s official response was, “Iron Mountain performs upwards of five million pickups and deliveries of backup tapes each year, with greater than 99.999% reliability. Nevertheless, since the beginning of the year, four events of human error at Iron Mountain resulted in the loss of a customer’s computer backup tapes. While four losses is not a large number in comparison to an annual rate of five million transportation events, any loss is important to customers and to Iron Mountain … Iron Mountain is advising its customers that current, commonly used disaster recovery processes do not address increased requirements for protecting personal information from inadvertent disclosure.”

The tape vaulting company I used to use back in the day lost one or two of our tapes a year.  We gave them about 50 tapes a day, and retrieved 50 more back.  We tracked each individual tape, and were linked into their system to show when the tapes made it into the vault.  Every once in a while, there would be a discrepancy where one of the tapes would not show up in the vault.  This resulted in a search, and inevitably the tape would be found somewhere along the way.  Good times.

I remember one vaulting customer that received a box of tapes that weren’t theres.  When they called their rep, they had him read the bar codes off the tapes.  They couldn’t figure out whose they were, so the vaulting company said they should keep the tapes!

As long as media vaulting companies employ humans to be the “man” in the van, this problem will continue.  Humans do dumb things.  Humans make mistakes. So until these companies start hiring robots to pick up and deliver tapes, we will continue to see these problems.  However, I think much of the world will have moved to electronic vaulting by then.

I’ve always liked electronic vaulting

If you’re not going to use tapes to get your data offsite, you can use electronic vaulting.  This can be accomplished via a few different methods.

Onsite & Offsite Target Dedupe Appliance

There are a number of vendors that will be happy to sell you an appliance that will dedupe any backups you send to it. Those deduped backups are then replicated to another dedupe appliance offsite. This has been the primary model for the last 15 years or so to accomplish electronic vaulting. The problem is that these appliances are very expensive, and you have to buy two of them – as well as power, cool, and maintain them. It’s the most expensive of the three options mentioned here.

Source dedupe to offsite appliance

It makes more sense to buy backup software that will dedupe the data before it’s sent to an appliance. This appliance can be offsite, so that data is immediately sent offsite.  It can even be a virtual appliance running as a VM in the cloud.  Most people exploring this option opt for an onsite copy that replicates to the offsite appliance or VM.  Most vendors selling this type of solution tend to want to charge you for both copies.

Source dedupe to a cloud service

If you are backing up to a true cloud service (not just backup software running in some VMs in the cloud), and you are deduping data before it is sent to the cloud. Vendors that use this model tend to only charge you for the cloud copy. If they support a local appliance for quick recoveries, they tend not to charge for that copy. That makes this option the least expensive of the three

Fire the man, get a plan

Wow, I like that!  There are a number of ways you can now have onsite and offsite backups without ever touching a tape or talking to a man in the van down by the river.  Look into them and join the new millennium.

----- Signature and Disclaimer -----

Written by W. Curtis Preston (@wcpreston). For those of you unfamiliar with my work, I've specialized in backup & recovery since 1993. I've written the O'Reilly books on backup and have worked with a number of native and commercial tools. I am now Chief Technical Architect at Druva, the leading provider of cloud-based data protection and data management tools for endpoints, infrastructure, and cloud applications. These posts reflect my own opinion and are not necessarily the opinion of my employer.

The Perils of Hardware

No one likes hardware; they only like what they can do with it.  And I say this as a geek who has built plenty of PCs in my house, including a Hackintosh.  What kind of sick weirdo builds their own Mac?  Well, you know what? That Hackintosh illustrates the perils of hardware in three ways.

Hardware gets marked up

The first peril of hardware is why I did this: Apple’s crazy markup on hardware. Why did I go through the difficulty of finding and buying components that were compatible with MacOS?  Why did I go through the rigmarole necessary to fool the MacOS installer into installing on something that wasn’t a real Mac?

I wanted to run MacOS on a server powerful enough to run Adobe Premier Pro well, and the MacPro I wanted was something like $4-5000.  But I could build a Hackintosh for around $1500, so I did.

This is why storage customers revolted against traditional proprietary storage vendors in favor of software-defined startups that allowed them to use off-the-shelf hardware that wasn’t ridiculously marked up.  People started realizing that hardware is hardware, and rarely is hardware special enough to warrant a huge markup.

Hardware must be maintained

Hardware breaks.  Power supplies die, disks stop spinning, and fans stop blowing. This is why every production piece of hardware typically comes with a service agreement specifying how quickly the vendor should respond when a problem occurs.

At no time is this peril more acute than the last few weeks. The spectre of the Spectre and Meltdown vulnerabilities is wreaking havoc on hardware land. First Intel came out with a new microcode version to address the vulnerabilities, then Microsoft, RedHat, and other Linux vendors came out with OS patches.  Then people that installed them started seeing spontaneous reboots. So they all started pulling their patches, and Microsoft even released an out-of-band update that disabled the microcode patches if you installed them.  It’s been a tough couple of weeks for those that must maintain hardware.

Meanwhile customers who are using services like Salesfore.com, Office365, Gmail, and yes, the Druva Cloud Platform, didn’t have to worry about maintaining the hardware underneath those systems. The service providers had plenty of work to do, for sure. The cloud is not magic. There is no such thing as the cloud; it’s only someone else’s datacenter. But people who were using true cloud services simply didn’t have to worry about maintaining the hardware behind the services they were using.

This brings me to the point of the companies in the data protection space who have now certified that their product runs in AWS. Yes, this allows them to say that they work “in the cloud.” But it’s important to distinguish this from a cloud service offering, where hardware is not your problem. Customers of such backup solutions that are “running in the cloud” are having just as many problems with their cloud backup servers as they are with their onsite servers.  Because even virtual hardware has to be maintained. It may be someone else’s hardware (i.e. you don’t own the server your cloud VM is running on), but you still have to maintain it.

Hardware is a capital expense

The Hackintosh I built was only $1500, but what if it had been $100,000?  Hardware of all kinds requires a significant amount of capital outlay.  Maybe you can finance it and maybe you need to come up with the actual cash to buy it outright.  Either way, it’s going to stay on your books for years.

Capital expenses can be really difficult to get approved. I remember working at a place where every single item over $1,000 was a capital expense, and getting capital expenses approved took months – even years.  I remember doing all sorts of things to work around that issue.

Real hardware also exists.  If you bought it for a project that changed directions, you’re stuck with that hardware.  If your project needs faster hardware, you have to upgrade – leaving the old hardware in the dust (literally). This is perhaps the most compelling thing about moving apps to the cloud.  If you change your mind, you just delete the VM.

The hardware isn’t important – the service is

This brings me full circle. The hardware isn’t what’s important; the service is what’s important. Consider my opening story of the Hackintosh. My need was to edit video. The solution to that need was Adobe Premiere Pro – which I already owned.  But I owned the MacOS version, so I needed a Mac.  I couldn’t afford a MacPro, so I built one. (I just found out the Hackintosh I built is running fine, BTW.)

But what if I was able to find a cloud service to do my video editing? Yes, I realize there are rules of physics that might get in my way, since raw video can be huge. But just work with me.  What if I could meet all of my business needs with a service that runs in the cloud?

Would I need the Mac?  Would I need the Hackintosh? Would I need Premiere Pro? No, i wouldn’t.  A Chromebook would probably do just fine.

But if I went to Apple and told them my business requirements, their answer to my questions would most certainly be a MacPro. That’s what happens when you ask a hardware vendor to help solve your problem. It’s like going into a hardware store and telling him you need a place to live. The first thing they’re going to do is sell you a hammer, nails, and wood.  Because that’s what they sell.

Why would you want hardware?

This entire blog post was inspired by another blog post by a blogger and writer I respect. The title also started with “The Perils of…” He used the hammer analogy, too. He suggested that you shouldn’t go to vendors who just sell “backup,” as there is an entire continuum of data protection requirements not met by that term.  I agree with that part.  The days of backup only are over.

But then he suggested that his company, a very large hardware and software vendor, was the right way to go because they sell all types of solutions. That’s where I’m going to have to disagree. Because almost all of their solutions are just more hardware & software.  Hardware & software get marked up.  It has to be maintained. And hardware is a large capital investment.  Why would you want to do any of that if you could meet your data protection needs with a service where none of that is an issue?  Just a thought.

----- Signature and Disclaimer -----

Written by W. Curtis Preston (@wcpreston). For those of you unfamiliar with my work, I've specialized in backup & recovery since 1993. I've written the O'Reilly books on backup and have worked with a number of native and commercial tools. I am now Chief Technical Architect at Druva, the leading provider of cloud-based data protection and data management tools for endpoints, infrastructure, and cloud applications. These posts reflect my own opinion and are not necessarily the opinion of my employer.

Instant recovery & dedupe are not friends

Instant recovery is the modern-day equivalent of what we used to call a hot site, as it allows you to recover immediately after some type of incident. I have personally advocated for this concept, as I strongly believe that in a true disaster (or ransomware event), time is of the essence.

As mentioned in my previous article, one company’s lack of an instant recovery system caused them to pay the ransom when they were infected with ransomware. They said recovering their entire datacenter using their backup system would have taken several days, and paying the ransom would cost less than several days of downtime. I explain in the article why I completely disagree with this reasoning, but I understand I have the luxury of Monday morning quarterbacking.

The key to being able to easily recover from a large disaster or ransomware attack is to be able to instantly spin up your entire datacenter in a hot site or an instant recovery system. This allows you to take your time addressing the cause of the incident, such as identifying and removing the ransomware itself, putting out actual fires, or replacing hardware damaged in the incident. If you can run your entire environment in a public or private cloud, you can continue your business – almost without interruption – regardless of how bad the incident is.

Dedupe is not instant recovery’s friend

Instant recovery is great, as it is allowing many to recover much quicker and better than they ever could before. Deduplication is also great, as it is the technology that enables so many wonderful things, like disk-based backup and recovery, offsite replication of backups without human intervention, and significant reductions in bandwidth usage. It’s the marriage of deduplication and instant recovery that usually doesn’t work.

Deduplication systems are very good at many things, but usually are not very good at random reads and writes. Just ask anyone who has attempted to run one or more VMs using their deduplicated backup data as the datastore. The performance might be enough to handle a single server that doesn’t require a lot of random I/O, but running several servers or an entire datacenter simply isn’t possible from a deduplicated datastore.

This is why post-process deduplication backup appliances make such a big deal about their native landing zone where recent backups are stored in their native, non-deduplicated format before they are deduplicated for replication or long-term storage. They advise customers who are interested in instant recovery to turn off any backup software dedupe.  Backups are sent to disk in their full, native format and are stored that way in the landing zone until they are pushed out by newer backups. This yields much better performance if you have to run multiple VMs from your backups.

But most people using the instant recovery feature tend to be using modern backup packages that already have deduplication integrated as a core part of their product. This means they are typically performing their instant recovery using a deduplicated datastore. This means that they should be able to recover one or two VMs at a time. However, they will probably be very disappointed if they try to recover their entire datacenter.

There are other ways to do instant recovery

If you are going to use instant recovery to run your entire datacenter in a disaster, the latest copy of your VM backups needs to be in native format on storage that can support the performance that you need. There are a couple of ways of accomplishing that.

Continuous data protection (CDP) products are essentially replication with a back button. Some of these companies describe themselves as a TiVo for your backups. They store your backups in native format, and also store the bits needed to be able to change portions of the latest version in order to move it back in time. (A good example of such a product would be Zerto.)

These types of products tend to do well at disaster recovery, but not at operational recovery. They’re good at recovering an entire datacenter, usually not so good at recovering a single file.  The DR functionality of these products can be quite advanced, as it is their specialty. Another upside of this approach is you only have to pay for one copy of your backup – plus the versioning blocks of course. One downside to this approach is most people also purchase another product for operational recovery.

Alternatively, you can use a backup product that uses its backups to update an image stored in native format as a DR image.  (Druva offers this as part of their Data Protection as a Service offering.) Instant recoveries – especially large scale recoveries of an entire datacenter – would run from this DR image. The advantage of this approach is you get operational recovery and disaster recovery in a single system. This is both simpler and less expensive than maintaining two systems. One disadvantage is that you will need to pay for the storage the DR copy of your data uses, equivalent to one full backup.  This cost is offset by the fact that you would be able to do both operational recovery and disaster recovery with a single product.

Don’t pay ransoms! Get a better backup product!

As I mentioned in my previous blog post, please prepare now to be able to recover from a ransomware attack or other disaster. Investigate the DR plans of your company, as you might need to activate them for something you might not consider a disaster. Your entire datacenter may be fully functional, but you won’t be able to get to your data if it’s all encrypted.  So make sure you have a solid plan for how you would recover from this scenario, because the likelihood that this will happen to your company goes up every day.

----- Signature and Disclaimer -----

Written by W. Curtis Preston (@wcpreston). For those of you unfamiliar with my work, I've specialized in backup & recovery since 1993. I've written the O'Reilly books on backup and have worked with a number of native and commercial tools. I am now Chief Technical Architect at Druva, the leading provider of cloud-based data protection and data management tools for endpoints, infrastructure, and cloud applications. These posts reflect my own opinion and are not necessarily the opinion of my employer.