Protect data wherever it lives

It’s more true now than any other time in history: data really can be anywhere.  Which means you need to be able to protect it anywhere and everywhere. And the backup architecture you choose will either enable that process or hamper it.

Data data everywhere, and not a byte to sync

Any product should be able to backup a typical datacenter.  Install a backup server, hookup some disk or tape, install the backup client, and transfer your data.  Yes, I realize I drastically simplified the situation; remember, I did build a career around the complexities of datacenter backup.  I’m just saying that there we have decades of experience with this use case, and being able to backup a datacenter should be table stakes for any decent backup product or service.

The same is mostly true of a datacenter in a collocation/hosting facility.  You can use much of the same infrastructure to back it up.  One challenge will be is that it may be remote from the people managing the backup, which will require a hands-off way of getting data offsite or someone to manage things like tape.  Another challenge can be if the hosted infrastructure is not big enough to warrant its own backup infrastructure.

This is similar to another data source: the ROBO, for Remote Office/Branch Office. While some of them may have enough data to warrant their own backup infrastructure, they usually don’t warrant IT personnel. Historically this meant you either trained someone in the office to swap tapes (often at your own peril), or you hired Iron Mountain to do it for you (at significant cost). Deduplication and backup appliances have changed this for many companies, but ROBOs still plague many other companies who haven’t updated their backup infrastructure.

The truly remote site is a VM – or a bunch of VMs – running in a public cloud provider like AWS, Azure, or Google Cloud. There is no backup infrastructure there, and putting any type of traditional backup infrastructure will be very expensive.  Cloud VMs are very inexpensive – if you’re using them part time.  If you’re running them 24×7 like a typical backup server, they’re going to very expensive indeed. This means that the cloud falls into a special category of truly remote office without backup infrastructure or personnel.  You have to have an automated remote backup system to handle this data source.

Even more remote than a public cloud VM is a public cloud SaaS app.  With a SaaS app you don’t even have the option of running an expensive VM to backup your infrastructure.  You are forced to interact with any APIs they provide for this purpose. You must be able to protect this data over the Internet.

Finally, there are end user devices: laptops, desktops, tablets, and phones. There is no getting around that most people do the bulk of their work on such devices, and it’s also pretty easy to argue that they’re creating data that they store on their laptops.  Some companies handle this problem by converting to cloud apps and telling users to do all their work in the cloud. But my experience is that most people are still using desktop apps to do some of their work.  Even if they’re using the cloud to store the record of authority, they’re probably going to have a locally cached copy that they work on.  And since there’s nothing forcing them to sync it online, it can often be days or weeks ahead of the protected version stored in the cloud.  This is why it’s still a good idea to protect these systems. Mobile devices are primarily data consumption devices, but they still may create some data.  If it’s corporate data, it needs to be protected as well.

All things to all bytes

The key to backing up from anywhere is to reduce as much as possible the number of bytes that must be transferred to get the job done, because many of these backups will be done over slow or expensive connections. The first way to do this is to perform a block-level incremental backup, which transfers only the bytes that have changed since the last backup.  Once we’ve reduced the backup image to just the changed bytes, those bytes should be checked against other clients to see if they have the same changed bytes — before the data is sent across the network.  For example, if you’re backing up Windows systems, you should only have to back up the latest Windows patches once.

The only way to do this is source deduplication, also known as client-side deduplication. Source dedupe is done at the backup client before any data is transferred across the network. It does not require any local hardware, appliance, or virtual appliance/VM to work.  In fact, the appliance or system to which a given system is backing up can be completely on the other side of the Internet.

In my opinion, source-side dedupe is the way backup always should have been done.  We just didn’t have the technology. It saves bandwidth, it increases the speed of most backups, and it makes the impossible (like backing up a server across the Internet) possible.

You can backup some of the previously mentioned data sources with target dedupe (where you put a dedupe appliance close to the data & it does the deduping), but it can’t do all of them.  Target dedupe also comes at a significant cost, as it means you have to install an appliance or virtual appliance at every location you plan on backing up. This means an appliance in every remote datacenter, even if it only has a few dozen gigabytes of data, a virtual appliance (or more) in every cloud, an appliance in every colo – and mobile data gets left out in the cold.  Source dedupe is cheaper and scales farther out to the edge than target dedupe – without the extra cost of appliances in every location.

Written by W. Curtis Preston (@wcpreston), four-time O'Reilly author, and host of The Backup Wrap-up podcast. I am now the Technology Evangelist at Sullivan Strickler, which helps companies manage their legacy data