Snapshots: Another GDPR challenge

In my continuing series of challenges with backup with the General Data Protection Regulation (GDPR), I thought I’d look at snapshots, and the unique problem they present. They may be even more problematic than traditional backups.

Note: This article is one in a series about GDPR.  Here’s a list of articles so far: 

In these previous posts I have defined what the GDPR is and how it applies to your company.  I’ve also discussed whether or not backups are included when someone asks to be “forgotten” via a “right to be forgotten” request in the GDPR.  As I discussed here, here, and here, I do not believe that companies are going to be able to delete such data from their backup systems, nor do I think that the GDPR is going to require them to do it.  (But we just don’t know for sure until the ICO clarifies their position.)

The idea is two-fold.  The first part is the backups aren’t being used to support any current decisions, nor are they accessible via standard IT systems and queries. The second part is that it’s simply not possible today to delete data from a backup like that.

But what about snapshots?

Someone asked about this on twitter.  Snapshots are visible to regular IT systems and could be used to support current decisions.  For example, NetApp snapshots appear under the ~snapshot subdirectory of the volume they are protecting.  They may not be in the normal path of things, but a user could easily search and access them.  It’s kind of the point of how snapshots work.

But guess what? Snapshots are read-only by design. You don’t want people to be able to delete data from your snapshots if you’re using them for backup.  But since they’re accessible via a standard IT process, are they now considered primary data?

Out of curiosity, I reviewed the NetApp whitepaper on how they handle this issue, and it was unclear when it got to the part of actually forgetting the data. It mentioned that you couldn’t delete something if you didn’t know where it was, but it didn’t really go into how you would selectively delete something from a snapshot once you found it.

I’m not picking on NetApp here. I’ve always been a fan.  I’m simply saying that – like backups – selectively deleting data from snapshots goes against their nature. And I’m pointing out that because they are accessible as regular IT data, they might not get the pass that I believe backups will get.

What is your plan for snapshots?

Have you discovered GDPR RTBF and how it relates to snapshots at your company? Has your storage vendor given you any guidance as to how to solve this problem?  Is there a GDPR “back door” that you can selectively use to delete data from a snapshot? Do you want to use it, considering you could corrupt the thing you are using for backup?

I’d really love to hear from you on this.


More thoughts on the GDPR & backups

I’m doubling down on my opinion that the GDPR is not going to be able to force companies to “forget” people in their backups – especially personal data found inside an RDBMS or spreadsheet.  Let me explain.

Disclaimer: I’m not a lawyer. I’m a data protection expert, not a legal expert. I’ve read the GDPR quite a bit and have consulted a lawyer about my thoughts, but this should not be considered a legal opinion.  Just thinking out loud here.

Note: This article is one in a series about GDPR.  Here’s a list of articles so far: 

Personal data will be inside something

If a company is storing personal data on employees, customers, or prospects, that data will not be inside a discrete file for that customer.  For example, Amazon does not have a single Excel spreadsheet or database called “Curtis Preston’s data.” There is a customer database, purchasing database, browsing database, etc, and my data is many rows within these databases.

This is the way it is at all companies around the world. Personal data is stored inside files and databases that also store other personal data. It’s built into most filesystems to be able to search the content of all files, and it’s definitely built into RDBMSs to search their content. Therefore, to comply with a GDPR data portability request or a RTBF request with online data should be relatively straightforward.  Difficult, yes, but it’s simply a matter of identifying where personal data is stored and providing a process for searching it and expunging it if requested.  Backups, however, are a whole different thing.

Backups just store what they’re given

This is important to understand, especially when we are talking database backups. With few exceptions, backup products are handed an object and some metadata about that object. They don’t control the content or format of the object, nor do they have any knowledge what’s inside it.

That object could be a spreadsheet, the latest RMAN stream of data from Oracle, or the results of a dump database command from SQL Server. It’s just an object with some metadata. The backup product might be told what type of object it is, such as an RMAN stream, so it can know how to process it.  For example, it might deduplicate an Oracle RMAN stream different than a spreadsheet file.

But – and this is the important part – rarely does a backup product know anything about what’s inside that object, beyond what the metadata says.  This is especially true of RDBMSs.  A few products have done some extra effort to scan certain file types, such as spreadsheets, so they can provide “full text search” against those files. But, this is absolutely the exception, and I’m not aware of any that do that for relational databases.  Nor am I sure they could even do that, given that the personal data is already packed.

In addition, the object that backup software is handed is often just a block or few that was changed in a file, VM, or database since the last time it was backed up.  They might not even know where that block fits inside the whole, nor do they have the info to figure that out.

MS Office files have structure

Let’s assume we solve the above problem. The backup software would have to unpack the file, extract the personal data in question, then repack the file.  For example, a Microsoft Office is actually a .ZIP file with some XML files inside of it.  The backup software would have to unzip the .ZIP file, take the appropriate data out of the XML file, then rezip the file – all without making the file unreadable when it saves it again.

Relational databases have more structure

Relational databases have the concept of referential integrity.  When the database is open, this is not a problem when you delete record X.  It will automatically delete any references to record X so there aren’t any referential integrity problems. It will also update any indices that reference those references.  Easy peasy.

That’s impossible to do when the database is a bunch of objects in a backup. First, it requires the backup software to know much more about the format of the file than it needed to know before. It then would need to be able to delete a record, any references to that record, and any indices referencing that record, and it would need to do that for every RDBMS it supports.  I just don’t see this being a good idea.

Clean on restore

The first idea – as discussed in my last few blog posts – is for there to be a process to track any data that needs to be deleted (e.g. any records of Curtis Preston w/birthday of 01/01/01, at IP address, etc.), and then delete them on restore. Today this will need to be a manual process, but as has already been mentioned, it could be built into the backup software itself.  It’s a monumental task, but it’s much easier to open, read, and write a file when it’s in a filesystem.  And it’s much easier to run some SQL commands than it is to learn the internal structure of a database.

Just stop using your backups as archives!

This whole problem is caused by people keeping their backups far longer than they should. If you used backups for restores of data from the last, say, 90 days, and then used archive software for data older than that – this would not be a problem.  Everything I mentioned above doesn’t pertain to archives.  Archives by design are given information, not blobs of data.

They are given emails and records from a database, not a backup stream from the database. Yes, they are also given files like spreadsheets, but they are expected to parse those files and get the data inside.  It’s part of the design. An archive systems is also given far fewer objects to parse, so it has time to do all that extra parsing and storing.

Maybe GDPR will be my friend and help me stop people from storing their backups for 10 years.  I can dream.

What do you think?

I think this is a big problem, and I don’t see any solutions on the horizon.  Does anyone else see this as any different?  Your comments move the discussion forward.  Anyone?

GDPR: How should we keep a record of deleted data?

I’m still just thinking out loud here.  Again… not an attorney.  I have read the GDPR and done some analysis of it, primarily around the right to be forgotten (RTBF) and how it pertains to data protection systems. I just want to start the conversation about some of these topics and see what people are thinking about these very important topics.

Note: This article is one in a series about GDPR.  Here’s a list of articles so far: 

No one is scrubbing backups

As I mentioned in my previous post, my opinion is that it is not reasonable to expect companies to delete data from their backups in order to satisfy an RTBF request.  It’s simply not technically feasible given modern technology. I do believe companies should switch to non-natural values for the primary keys of their databases. It’s the latter that I want to talk about, based on some comments I received on my last post.

I stand by my opinion about non-natural keys for databases that store personal information. This allows you to delete a record while storing the record identifier, which isn’t personal data. That way you could easily check in the future if you have data that’s supposed to be deleted, such as if you restore the database to a point before the data is deleted.

But the commenter on my last article has a good point. What if you restore the database to a point before you starting using non-natural keys? Suppose you follow the suggestion and stop using natural keys today.  But you still have backups from before today that don’t have natural keys, and you may have to keep those backups for a long period of time.  (You shouldn’t, as you should only be keeping archives for that amount of time, but we all know that at least half of  you are keeping your backups for years.  Even if you were using archives, the problem of scrubbing them is just as hard, so they could cause the same problem.)

But what about this?

So, it’s three years from now and you need to restore a database from a backup you took before you switched to non-natural keys.  In the past three years you have received hundreds of RTBF requests that you need to continue to honor, but you just restored a database that has those records in it, and it doesn’t have that non-natural key you stored in order to make sure the data stays deleted.  How are you going to find and delete those records if you didn’t keep the natural keys you were using before you switched away from them?

Again, my opinion is that you’re going to have to keep enough data to identify a unique person in order to continue to honor RTBF requests after they’ve been done. Get rid of all data about the person (other than that) and store just enough to identify them — and put that in the most secure database you have. You could then use that database in one or both of the following two ways.

One way would to have an app that could read the data in the database, never display it anyone, but occasionally check if any records in the database are found in one or more databases.  The main use case for this method would be after a restore from an older backup.  You could point this app to that restored database so it could clean it.  You could also use it proactively to periodically check your entire environment for deleted records and delete them if they are found.

Another way to use it would be to set it up so that you could only query it by the unique identifier; data is never exported or sent to another app.  So you could run a query to see if SSN 123-34-3222 is in it.  If a record is found, it is supposed to be forgotten, so it should be deleted.  So, again, in the case of restored database you could check every record in the restored database against the deleted records, and delete any that are found.  It’s less efficient than the previous method, but it’s more secure.

I think this is defensible. Do you?

On one hand, it looks like keeping the unique identifier – which was the whole point of the GDPR – goes against the letter of the law for a RTBF request. Yes, it does.  But the GDPR also allows you to keep information required to protect against a lawsuit.  Not honoring RTBF requests could cost your company big time, so my personal, non-legal opinion is that this is a perfectly valid thing to do after you’ve honored a RTBF request – in order to make sure they stay forgotten.

How are you going to deal with this problem?  What do you think of my idea?

Some hope about GDPR and backups.

I am not a lawyer. I’m not even what I consider a GDPR specialist. But I’ve read a lot of the text of the GDPR, and I’ve read a lot about it and watched a lot of videos. So I’d like to offer my layman’s interpretation of an important aspect of GDPR – the right to be forgotten – and whether or not it means we have to delete data from our backups.

Note: This article is one in a series about GDPR.  Here’s a list of articles so far: 

Let’s talk about this

I have an opinion on this issue, but it’s not a legal opinion. I’d love to hear your opinion, especially if it differs from mine. Let’s see some comments on this one, shall we?  Here’s the official GDPR website where you can read it for yourself.

The easy stuff

There are all kinds of GDPR articles about making sure we have consent and a reason to store personal data, making sure we take care of it when we do, making sure we store it securely, etc. There’s even a line that says we need “the ability to restore the availability and access to personal data in a timely manner in the event of a physical or technical incident.” My personal opinion is that you should have been doing all this stuff already, which is why I’m calling it “the easy stuff.”  (The easy stuff isn’t easy if you haven’t been doing it, but it’s easy in that all the technology is there to do it.  All you have to do is implement it.)

A little bit harder

You need a way to search any systems that store personal data. You need to be able to query for any records referencing a given email address, IP address, physical address, etc.  Hopefully you have that already, but if not, that will require some work to comply with.  This is needed to satisfy the data request and right to be forgotten provisions.

If you’re using “natural keys” as the primary keys in your database, that’ll have to change. Any information that could be deemed personal information by the GDPR should not be used as the primary key in a database.

The first reason is what happens if you are asked to delete a given record that uses the primary key of the IP address where the user filled out a form, or the email address they used to do so. If you reference that primary key in other records, you’ll have to do a cascading delete of any records that reference that key, in addition to deleting the primary record.  I’ll discuss the other reason this is important later in the article. Suffice it to say this may require a significant design change in your database system.

It goes way beyond employees

I’ve heard a lot of people talking about employees as if they are the main data subjects under the GDPR.  They are covered under GDPR, but I think employees (IMO) fall under the easy stuff. It’s easy to prove consent when you have an employment contract. You’re probably already securely storing that data, and you probably also have a pretty simple way of searching for those records to comply with any requests for that data. You also have a valid reason to not comply with any erasure requests, because you can say that you’re keeping it to be able to defend against any lawsuits, which is an exception to the erasure requirement. (There are several reasons you don’t have to erase data; one of them is if you are keeping it to protect against lawsuits.)  My opinion is that everything I just said also applies to customers. You have a contract with them, you have a reason to keep their information, you can easily search for it, and you have a reason to not delete it.  Easy peasy.  (Remember, I’m not a lawyer, and I’m curious about your take on this.)

The rub comes when you’re storing data about non-employees and non-customers.  You will have to prove that you got affirmative consent to store the information, you’ll need to supply it when asked, and you’ll need to delete it when asked. Now things get a little hairy. It’s out of the scope of this blog, but this means you have to do things like have an unchecked checkbox that they have to check to give you permission to store the info. And you should be storing any personal data in a system that allows you to easily search for the data if someone asks for it.

But what about backups? Do I have to delete the backups?

No one knows for sure because there’s no case law on it it and the GDPR itself is somewhat unclear on the issue. We won’t know until someone gets sued under the GDPR for not deleting data from their backups.  If a court rules that backups are part of what we’re supposed to delete, we’re all in a world of hurt. If they rule in line with what I say below, then we can breathe easier. Let’s see what the GDRP says about the subject.

The GDPR seems more concerned with live copies of data

This is more a general feeling than anything I can directly quote, but it seems to be interested primarily in online, live copies of data that can be easily accessed. I’m guessing it’s because these are the copies that tend to get hacked and accidentally released to the public. You don’t really see any stories about how some hacker broke into someone’s backup system and restored a bunch of stuff to make it public. Heck, most companies can’t restore their own data properly. How’s a hacker going to do that?

The GDPR doesn’t mention backups.

Go ahead. Search the entire text of the GDPR for phrases like “backup,” or “back up.” You won’t find it. So no help on that front.

The GDPR does mention restores

The writers of the GDPR knew about backups and restores, because they mentioned that you need “the ability to restore the availability and access to personal data in a timely manner in the event of a physical or technical incident.” So they knew the concept of a backup exists, but chose not to mention it in the erasure section.

It does use the words archive and archival, but

When it uses the word archival, it seems to be referring to a large collection of information for a long period of time. And if you can prove you’re doing something like that “for the public good,” then it’s also exempt from erasure. For example, you can’t ask that CNN erase a story about you getting arrested.

The GDPR does mention copies

There’s a section that says you should take reasonable steps to erase any “links to, or copies or replications of those personal data” if you’ve made it public. But, again, this seems focused primarily on online copies of data that are replicated copies of the same data we’re trying to erase.

The GDPR uses the word “reasonable” and “excessive”

There GDPR is filled with phrases like “reasonable” and “excessive”. They understand that not everything is possible and that some things will require an excessive amount of effort. One example of this is in Recital 66 about Article 17 (the right to be forgotten article).  It says that if a controller has made the personal data public,  it “should take reasonable steps, taking into account available technology and the means available to the controller, including technical measures.”

The GDPR doesn’t use the word “reasonable” in the erasure section

Interestingly enough, right where we’d like to see a “reasonable” section, there isn’t one.  There is one when it talks about what you have to do if you’ve already made the data public and are asked to delete it, but it doesn’t mention reasonability when talking about deleting the main source of the data or any backups of that data.

You do have to make sure data stays deleted

If you are asked to delete a particular piece of personal data, you do need to make sure it is deleted – and stays deleted.  But it’s virtually impossible (and certainly not reasonable) to delete records out of most backup systems, so how are going to ensure a given record stays deleted if you do a restore?

Now we’re back to natural keys. You’ll need a way to find records pertaining to Steve Smith living at 123 anywhere lane, without storing the values of Steve Smith and 123 anywhere lane.  (Because doing that would be violating the deletion request.)  This is why you need to use something other than natural keys. If you’re not using natural keys, you can determine that Steve Smith at 123 anywhere lane is lead number 9303033138.  That is a unique value that is tied to his record, but is not personal data if you get rid of the other values. You can then create a separate table somewhere that tracks the lead numbers that must stay deleted from the marketing database – even if it’s restored.

If you restore the marketing database, you just need to make sure you delete lead number 9303033138 and any other leads listed in the DeletedLeads table – before you put that database back online. Because if you put the marketing database back online with Steve Smith’s address and email address still there – and then someone kicks off a marketing campaign that contacts Steve Smith after you said his records are deleted – you’re going to have a very easily provable GDPR violation on your hands.  Then we’re back to talking about those potentially huge fines.

I don’t think you have to delete data from your backups

My personal non-legal opinion is that as long as you have a process for making sure that deleted records stay deleted even after a restore – and you make sure you follow that process – you have a pretty defensible position. My personal opinion would also be to be upfront about this in your notification to the data subject.

Dear Steve Smith,

We have deleted all references to your personal data in our marketing database. For technical reasons we are unable to delete this information from our backup system, but that system is only used to restore the marketing database if it is damaged. We also have a system in place to ensure that your records will be immediately deleted if the marketing database is ever restored from the backup system.



Final thoughts

Backup vendors can and should be part of this process moving forward. Maybe in a few years’ time, we’ll have the ability to surgically remove records from a backup.  That would be very nice, and would be more elegant than having to do what I’m suggesting above.  This may indeed become a competitive differentiator for one or more backup companies moving forward.

What do you think?  Am I being too hopeful here?

What it was like presenting at Cloud Field Day

Presenting at a Gestalt IT “Field Day” (Cloud Field Day in this case) was very different than being a delegate. So I thought I’d blog about it – just like a delegate.

What is Cloud Field Day?

Cloud Field Day is an event put on by Gestalt IT, a company founded by Stephen Foskett (@sfoskett). They put on a variety of “Field Day” events, originally just called Tech Field Day.  They branched out into Storage Field Day, Networking Field Day, Wireless Field Day, and Cloud Field Day.

They bring in a group of 10+ influencers from around the world, each of which has some type of audience. Delegates, as they are called, can be anything from someone with a “day job” in IT who just blogs part time for fun, to someone who is now making money full time as a blogger, speaker, or analyst.  The one thing they have in common is that they are independent; they cannot be employed by a vendor in the space.

I’ve been a delegate to a number of Field Days, and it’s definitely easier being on that side of things. It’s easier to listen to a vendor’s pitch and ask questions than it is to be the vendor making that pitch and answering questions. It’s easy to question why they’re doing something, or to “poke holes” in their strategy.  I can remember a few times where I and my fellow delegates thought the vendor was way off base.  I can even remember one time when it was so bad that the consensus among the delegates was that the start-up in question should immediately go out of business and return any remaining investor money.  (For the record, we were right, and that actually happened to that particular company.)

In all of those field days as a delegate, I was never stressed about being there. It’s quite enjoyable as a delegate.  You’re flown in, driven around in a limo, and constantly fed and catered to.  The only time I remember feeling any “stress” (if I can call it that) is when I found myself on the delivery end of a heated discussion. Even though I felt I was very justified in what I was saying, it’s still stressful being the center of attention in a heated argument that is being streamed live.

Presenting is very different

Being a former delegate made me more nervous, not less. I knew how probing delegates could be. I knew the messages I wanted to get across, but I wondered how those messages would be received.  I also knew that the delegates often drive the presentation, and they can be difficult to “redirect” once they grasp onto a concept they want to discuss.

For example, I watched as one Cloud Field Day 3 presenter before me “lost the room” for quite a while as the delegates debated a related technical topic. I remember thinking how would I handle that if it happened to us. You don’t want to stifle discussion, but you also need to make sure you communicate your message.

We survived

Based on the coverage we received, I think we communicated the core messages we wanted to get across, although it didn’t resonate equally on all ears. Each delegate comes with their own experiences and bias, and you can’t cater to them all.

For example, I think the delegates understood how we are the only data protection product designed for the AWS infrastructure, automatically scaling the resources we use up and down using the AWS native load balancing apps.  We are also the only ones using AWS’ DynamoDB to hold all metadata and S3 to hold all backups. (Other products can copy some backups to S3; we store all of them there.) And I think they understood how that should drastically affect costs that we pass onto the customer.

What I didn’t anticipate was that being designed for the AWS infrastructure would not resonate with those who are proponents of the Azure infrastructure. We were asked why aren’t we running there as well. The answer is two-fold.  Because we are actually designed to use AWS’ native tools (e.g. load balancer, DynamoDB, S3), we can’t just move our software over to Azure. We would need an entirely separate code stack on Azure, so the level of effort is significantly different than those who just run their software in VMs without using native tools. Their approach is more expensive to deliver; ours requires additional coding to move platforms.  Secondly, we just don’t get a big demand for native support in Azure. Most customers don’t care where we run our infrastructure – and don’t have to.  But I can understand how that would fall on deaf ears of an Azure advocate.

But the biggest challenge we ran into with this crowd is not everyone was convinced that backing up to the cloud is the way to go for datacenters. I should have known better, being a former delegate, that technical types are going to think about these things, and we need to address them before they can think about anything else.

If I had a Do-over

I would explain the benefits of our approach before I explain what it was.  I did cover the benefits, but after the architecture. I should have done it the other way around. Cover the problem and what parts of it we solve – then cover how we solve it. Pretty standard stuff really.  But I got excited talking to a technical audience and went technical first. While the field day audience doesn’t want 10 slides on how data is growing, etc, they do want some description of the problem you’re solving before you explain how you solve it.

I would also address the “elephant in the room” first, and explain that our model of backing up everything to the cloud will probably not scale to backup a single datacenter with multiple petabytes. (We do have customers who store double-digit numbers of petabytes with us, but not all from one datacenter.)

I could then explain that we scale farther than you probably think. And since most companies don’t have datacenters like that, why force them to use an approach that is designed for that (onsite hardware). If you could meet your backup and recovery needs without any onsite hardware, why wouldn’t you?

Cloud Field Day is awesome

It’s a bit of a public trial by fire, but it’s a refining fire.  I learned a lot about how to present our solution by presenting at Cloud Field Day. I’d recommend it to everyone.  I know we’ll be back.

Protect data wherever it lives

It’s more true now than any other time in history: data really can be anywhere.  Which means you need to be able to protect it anywhere and everywhere. And the backup architecture you choose will either enable that process or hamper it.

Data data everywhere, and not a byte to sync

Any product should be able to backup a typical datacenter.  Install a backup server, hookup some disk or tape, install the backup client, and transfer your data.  Yes, I realize I drastically simplified the situation; remember, I did build a career around the complexities of datacenter backup.  I’m just saying that there we have decades of experience with this use case, and being able to backup a datacenter should be table stakes for any decent backup product or service.

The same is mostly true of a datacenter in a collocation/hosting facility.  You can use much of the same infrastructure to back it up.  One challenge will be is that it may be remote from the people managing the backup, which will require a hands-off way of getting data offsite or someone to manage things like tape.  Another challenge can be if the hosted infrastructure is not big enough to warrant its own backup infrastructure.

This is similar to another data source: the ROBO, for Remote Office/Branch Office. While some of them may have enough data to warrant their own backup infrastructure, they usually don’t warrant IT personnel. Historically this meant you either trained someone in the office to swap tapes (often at your own peril), or you hired Iron Mountain to do it for you (at significant cost). Deduplication and backup appliances have changed this for many companies, but ROBOs still plague many other companies who haven’t updated their backup infrastructure.

The truly remote site is a VM – or a bunch of VMs – running in a public cloud provider like AWS, Azure, or Google Cloud. There is no backup infrastructure there, and putting any type of traditional backup infrastructure will be very expensive.  Cloud VMs are very inexpensive – if you’re using them part time.  If you’re running them 24×7 like a typical backup server, they’re going to very expensive indeed. This means that the cloud falls into a special category of truly remote office without backup infrastructure or personnel.  You have to have an automated remote backup system to handle this data source.

Even more remote than a public cloud VM is a public cloud SaaS app.  With a SaaS app you don’t even have the option of running an expensive VM to backup your infrastructure.  You are forced to interact with any APIs they provide for this purpose. You must be able to protect this data over the Internet.

Finally, there are end user devices: laptops, desktops, tablets, and phones. There is no getting around that most people do the bulk of their work on such devices, and it’s also pretty easy to argue that they’re creating data that they store on their laptops.  Some companies handle this problem by converting to cloud apps and telling users to do all their work in the cloud. But my experience is that most people are still using desktop apps to do some of their work.  Even if they’re using the cloud to store the record of authority, they’re probably going to have a locally cached copy that they work on.  And since there’s nothing forcing them to sync it online, it can often be days or weeks ahead of the protected version stored in the cloud.  This is why it’s still a good idea to protect these systems. Mobile devices are primarily data consumption devices, but they still may create some data.  If it’s corporate data, it needs to be protected as well.

All things to all bytes

The key to backing up from anywhere is to reduce as much as possible the number of bytes that must be transferred to get the job done, because many of these backups will be done over slow or expensive connections. The first way to do this is to perform a block-level incremental backup, which transfers only the bytes that have changed since the last backup.  Once we’ve reduced the backup image to just the changed bytes, those bytes should be checked against other clients to see if they have the same changed bytes — before the data is sent across the network.  For example, if you’re backing up Windows systems, you should only have to back up the latest Windows patches once.

The only way to do this is source deduplication, also known as client-side deduplication. Source dedupe is done at the backup client before any data is transferred across the network. It does not require any local hardware, appliance, or virtual appliance/VM to work.  In fact, the appliance or system to which a given system is backing up can be completely on the other side of the Internet.

In my opinion, source-side dedupe is the way backup always should have been done.  We just didn’t have the technology. It saves bandwidth, it increases the speed of most backups, and it makes the impossible (like backing up a server across the Internet) possible.

You can backup some of the previously mentioned data sources with target dedupe (where you put a dedupe appliance close to the data & it does the deduping), but it can’t do all of them.  Target dedupe also comes at a significant cost, as it means you have to install an appliance or virtual appliance at every location you plan on backing up. This means an appliance in every remote datacenter, even if it only has a few dozen gigabytes of data, a virtual appliance (or more) in every cloud, an appliance in every colo – and mobile data gets left out in the cold.  Source dedupe is cheaper and scales farther out to the edge than target dedupe – without the extra cost of appliances in every location.

Someone else driving your car doesn’t make it an Uber

Imagine if you had to lease a car in order to use Uber. That’s the logic behind how many infrastructure software vendors sell their “service” in the cloud. These “services” come in a couple of different flavors, but both of them come down to the same thing: infrastructure set aside for your exclusive use – which means you pay for it.

Private Cloud vendors

There are a lot of “XaaS” vendors that will create a dedicated system for you in their cloud, manage it for you, and then charge it to you as a service. It is definitely “Something as a Service,” as the management of it, including OS and application upgrades, hardware upgrades, and reporting, are all handled for you.  The key to this idea is that you get one bill that includes compute, storage, networking, and management costs.

This is definitely an improvement over managing your own system – from a level of effort perspective.  You don’t have to manage anything but the relationship with the vendor. Depending on your circumstances, you can even make the argument that the vendor in question is better at providing service X than you are. This is especially true of “forgotten” apps like data protection that don’t get the attention they deserve.  You could argue that using a private cloud vendor is better for your data protection than doing it yourself.

What you can’t argue is that it’s less expensive.  There are very few economies of scale in this model. Someone is still paying for one or more servers, some storage, some compute, and some personnel to manage them. They are then marking those things up and passing the cost to you.  There is no way this is cheaper than doing it yourself.

In addition, it’s also important to say that vendors who use the private cloud model don’t come with the same security advantages of those using established public cloud vendors. I know of one vendor that sells their services exclusively via a huge network of MSPs, each of which has a completely different level of capabilities, redundancies, and security practices.  Using a private cloud model requires a customer to look very closely at their infrastructure.

Hosted Software Vendors

Suppose you say you want to use the public cloud for economies of scale, an enhanced security model when compared to private cloud vendors, or maybe someone up higher simply said you needed to start using the public cloud. There are a number of infrastructure vendors that will run their software in VMs in the public cloud, and then offer you a service type of agreement for that software.

Now you are paying two bills: the “service” bill to the infrastructure software vendor, and the cloud provider bill for the compute, storage, and networking services required by this infrastructure vendor. Often in this model, the only service is that the vendor is selling you their software as a subscription.  But the moniker “as a Subscription” doesn’t sound as good as “as a Service,” so they still call this a service.

The problem with this model is that you aren’t getting any of the benefits of the cloud. Typical benefits of the cloud include partial utilization, cloud native services, automated provisioning, and paying only for what you use.  But you’re getting none of those in this model.

Infrastructure products – especially data protection products – are designed around using servers 24×7.  A backup server that isn’t performing any backups is still running 24×7, in case any backup clients request a backup.  That means those VMs you’re running the software on in the cloud have to run 24×7 – so much for partial utilization. A 24×7 cloud VM is very expensive indeed.

Such products are also written to use traditional infrastructure services, like filesystems, block devices, and SQL databases. They don’t know how to use services like S3 and NoSQL databases available the cloud.  In the case of backup software, they might know how to archive to S3 or Glacier, but they don’t know how to store the main set of backups there.

Such products also require manual scaling efforts when your capacity needs grow. You have configure more VMs, configure the software to run on those VMs, and adjust your licensing as appropriate. You’re not able to take advantage of the automated scaling the public cloud offers.

Finally, because you have to provision things in advance, you are often paying for infrastructure before you need it. If you know you’re going to run out of compute, you have to configure a new VM before you do. As you start using that VM, a good portion of it is completely unused. The same is true of filesystems and block storage, especially with backup systems. If your backups and metadata are stored on a filesystem or block storage, you have to manually configure additional capacity before you need it.  This means you’re paying for it before you need it. If the product could automatically make compute available only when you needed it, and use S3 for its storage, you would only pay for compute and storage as you consume it.

Don’t lease a car to take an Uber

See what I mean? In both of these models, you are leasing a car so you can take an Uber.  In the private cloud model, the cost of leasing the system is built into the price of the service, but you’re still paying for that infrastructure 24×7, since it is dedicated to you.  In the public cloud model, you’re paying for the service and you’re leasing the infrastructure 24×7 – even though the service isn’t using the infrastructure 24×7.  Examples of infrastructure products that work like this are Hosted Exchange, SAP Cloud and almost every BaaS/DRaaS/DMaaS vendor.

If you’re going to use the public cloud effectively, you need partial utilization, automated provisioning, and pay-only-for-what-you-use pricing. A true cloud-native product, such as, Office365, G-Suite, or the Druva Cloud Platform, offers all of those things.  Don’t lease a car to take an Uber.

I don’t often directly push my employers products, but it’s World Backup Day tomorrow so I’m making an exception.  Celebrate it by checking out my employer’s announcement of the Druva Cloud Platform, the only cloud-native data management solution.  It can protect data centers, laptops, mobile devices, SaaS apps like Office 365, G-Suite, and, and workloads running in the cloud – all while you gain all of the benefits of the cloud, including partial utilization, automated provisioning, and full use of cloud-native tools like S3 and DynamoDB.

GDPR Primer #2: What is personal data?

Last week I wrote the first of what will probably be a few articles about GDPR, EU’s General Data Protection Regulation.  It governs the protection of “personal data” that your company is storing from EU citizens living in the EU.  (They must be EU citizens, and they must be currently living in the EU for the regulation to apply.)

Note: This article is one in a series about GDPR.  Here’s a list of articles so far: 

As mentioned in my last article, US companies are subject to the regulation if they have personal data from EU citizens. Nexus or a physical presence is not required, only that you have data from people living there.

Is Personal Data the same as PII?

In the US we have a term we like to use called Personally Identifiable Information (PII), which includes certain data types that can be used to identify a person.  Examples  include social security numbers, birthdays, names, employers, physical addresses, and phone numbers.  It’s usually the combination of two data elements that makes something PII, for example knowing someone’s name and their birthday puts you one data point away from being able to steal their identity.  All you need is the social security number and you’re off to the races.

Personal Data, as defined by the GDPR, includes what we call PII, but it includes “any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person.”  This is interpreted so far to include things like IP addresses, social media profiles, email addresses, and other types of data that we don’t think of as PII in the US.

Someone filling out a basic marketing form on your website has submitted what the GDPR considers personal data to your company. If there’s enough for the person to be identified in any way – which a marketing form would most certainly have – then it’s considered personal data as far as GDPR is concerned.

GDPR Is coming

GDPR goes into effect May 28th.  If you haven’t talked to your backup company about it, it’s time to start having that conversation.

Is your data protection company worried about GDPR? They should be.

If you haven’t looked into how your data protection vendors are preparing for the General Data Protection Regulation (GPDR), you’re already behind the power curve.  It goes into effect May 25, 2018. Hopefully this article can get you up to speed so you can start pressuring your vendors about how they are going to help you comply with this incredibly big regulation.

Note: This article is one in a series about GDPR.  Here’s a list of articles so far: 

Disclaimer: I’m not a lawyer or a GDPR expert. My goal of this blog is to get you thinking and maybe scare you a little bit.  Nothing in this blog should be construed as legal advice.

Disclaimer #2: There is no such thing as a GDPR-compliant product, and definitely no such thing as GDPR-certified. A product can help you comply with GDPR.  A product can say “we are able to help you comply with articles 15 and 17,” but a product alone will not make you GDPR compliant.  And there is no certification body to provide a GDPR certification. Anyone who says that is making it up.

US companies must comply with GPDR if

Although this is a European Union (EU) regulation, you are subject to it if you are storing personally identifiable information (PII) (referred to by GPDR as “personal data”) about European citizens (referred to by GDPR as “data subjects”) living within the EU. Where your company is headquartered is irrelevant.

A business transaction is not required.  A marketing survey targeting EU residents appears sufficient to require your company to comply with GDPR.  An EU resident (who was not targeted specifically) filling out a form on your US website that does not have an EU domain might not trigger GDPR protection for that person.  My non-legal advice is that you should look into how you’re preparing for the requirements.

Not complying with GPPR can cost you dearly

Companies not complying with the data privacy aspects of GDPR can be fined 4% of annual revenue, or 20 million Euros, whichever is greater.  It hasn’t gone into effect yet, and no one has been fined yet, so we don’t yet know just how tough the courts are going to be. But that’s what the regulation allows.

How does GDPR affect data protection?

There are several aspects to GDPR protection, but only a few of them affect your data protection system. For example, there is a requirement to gain consent before storing personal data. That responsibility falls way outside the data protection system. But let’s look at some parts that many systems are going to have a really hard time with.

GDPR has articles that talk about general data security, but I think any modern backup system should be able to comply with those articles. The things about GDPR that I think data protection people will struggle with are articles 15, 16 and 17: the right to data access by the subject, the right to correction, and the right to erasure (AKA “right to be forgotten”).

Article 15: Right to data access by subject

If you have data on a data subject (i.e. EU citizen), and assuming that data is subject to GDPR, the subject has a right to see that data. This includes any and all data stored on primary storage, snapshots, backup storage, and archives.  Try to think about how you would comply with that request today and you see where my concern is. Archive software might be ready for this, but most backup systems are incapable of delivering information in this manner.

Article 16: Right to correction

A data subject has the right to have incorrect data corrected. This may not directly affect the backup and archive systems, but it might.

Article 17, Right to erasure (AKA “the right to be forgotten”)

This one is the one that truly scares me as a data protection specialist.  If a company cannot prove they have a legitimate business reason for continued storage of a particular data subject’s personal data, the data subject has the right to have it deleted. And that means all of it.

As previously mentioned, we don’t have any case law on this yet, and we don’t yet know the degree to which the EU courts will order a company to delete someone’s data from backups and archives. But this is article that has me the most worried.

Update: 05/29: I’ve changed a bit in how I think about this.  Make sure to check out this blog post and this one about this topic.

I told you so

The customers that are in real trouble are those that use their backup systems as archive systems, since most backup systems are simply incapable of doing these things. They will be completely incapable of complying with Articles 15-17 of GPDR.

I’ve been telling customers for years to not use their backup system as an archive system. If you are one of the ones who listened to me, and any long term data is stored in an archive system, you’re pretty much ready for GDPR.  A good archive should be able to satisfy these requirements.

But if you’ve got data from five years ago sitting on a dedupe appliance or backup tapes, you could be in a world of hurt. There simply isn’t a way to collect in one place all data on a given subject, and there’s definitely no way to delete it from your backups. Each record is a tiny record inside a filesystem backup stored in some kind of blog, such as tar file or the equivalent for your backup system.

What are your vendors saying?

Has anyone had any conversations about this with their data protection vendors?  What are they saying?  I’d really love to hear your stories.

How to make the cloud cheaper (or more expensive)

Depending on how you do it, the cloud can be much less expensive than using on-premises systems.  But it can also much more expensive. It all depends on you use the public cloud.

The expensive way: 24×7 VMs and pre-provisioned block storage

Running one or more VMs in the cloud 24×7 (like you would in a datacenter) is a great way to drive up your costs. It’s generally going to be more expensive than running the same VMs in house (if you’re running them 24×7).  It’s difficult to come up with the incremental cost of an individual VM, as this article attests. But generally speaking, you should be able to run a VM onsite for less than the cost of running that same VM in the cloud. It makes sense; it’s called markup.

Storage can also be more expensive in the cloud for the same reasons.  If you’re provisioning large chunks of block storage (e.g. EBS in AWS) before you actually consume it, your costs are going to be higher than if you only pay for storage as you use it. This is really only possible with object storage.

It’s also important to note that moving a VM to the cloud doesn’t get rid of all the typical system administration tasks associated with said VM. The OS still needs updating; the applications still need updating.  Sure, you don’t have to worry about swapping out the power supply, but most people let a vendor do that part anyway. But it’s important to understand that moving a VM to the cloud doesn’t make it magically start caring for itself.

The cheap way: Dynamically allocated VMs and object storage

In the public cloud, your costs are directly tied to how much storage, network and compute you use. That means that if you have an application that can dynamically scale up and down its use of cloud resources, you might be able to save money in the cloud, even if the per-hour costs are higher than those you would have onsite. This is because generally speaking, you don’t save money in the datacenter by turning off a VM. The resources attached to that VM are still there, so your costs don’t do down. But if you have an app that can reduce its compute resources – especially to the point of turning off VMs, you can save a lot of money.

This also goes true for storage. If you are using object storage instead of block storage, you pay only for what you use as you use it.  As backups expire and objects are deleted out of the object store, your costs decrease.  This is very different than how pre-provisioned block storage behaves, where deleting files doesn’t save you money.

Use the cloud the way its meant to be used.

If your backup software is just running software in 24×7 VMs in the cloud, and if they require you to provision block storage for said VMs, then they’re using the cloud in the way that cloud experts generally agree is a great way to drive up costs and not add a lot of value.

Your costs will go up and your manageability stays the same. You’re still dealing with an OS and application that needs to be updated in the same way it would be onsite. You still have to increase or decrease your software or storage licenses as your needs grow.