More thoughts on the GDPR & backups

I’m doubling down on my opinion that the GDPR is not going to be able to force companies to “forget” people in their backups – especially personal data found inside an RDBMS or spreadsheet.  Let me explain.

Disclaimer: I’m not a lawyer. I’m a data protection expert, not a legal expert. I’ve read the GDPR quite a bit and have consulted a lawyer about my thoughts, but this should not be considered a legal opinion.  Just thinking out loud here.

Note: This article is one in a series about GDPR.  Here’s a list of articles so far: 

Personal data will be inside something

If a company is storing personal data on employees, customers, or prospects, that data will not be inside a discrete file for that customer.  For example, Amazon does not have a single Excel spreadsheet or database called “Curtis Preston’s data.” There is a customer database, purchasing database, browsing database, etc, and my data is many rows within these databases.

This is the way it is at all companies around the world. Personal data is stored inside files and databases that also store other personal data. It’s built into most filesystems to be able to search the content of all files, and it’s definitely built into RDBMSs to search their content. Therefore, to comply with a GDPR data portability request or a RTBF request with online data should be relatively straightforward.  Difficult, yes, but it’s simply a matter of identifying where personal data is stored and providing a process for searching it and expunging it if requested.  Backups, however, are a whole different thing.

Backups just store what they’re given

This is important to understand, especially when we are talking database backups. With few exceptions, backup products are handed an object and some metadata about that object. They don’t control the content or format of the object, nor do they have any knowledge what’s inside it.

That object could be a spreadsheet, the latest RMAN stream of data from Oracle, or the results of a dump database command from SQL Server. It’s just an object with some metadata. The backup product might be told what type of object it is, such as an RMAN stream, so it can know how to process it.  For example, it might deduplicate an Oracle RMAN stream different than a spreadsheet file.

But – and this is the important part – rarely does a backup product know anything about what’s inside that object, beyond what the metadata says.  This is especially true of RDBMSs.  A few products have done some extra effort to scan certain file types, such as spreadsheets, so they can provide “full text search” against those files. But, this is absolutely the exception, and I’m not aware of any that do that for relational databases.  Nor am I sure they could even do that, given that the personal data is already packed.

In addition, the object that backup software is handed is often just a block or few that was changed in a file, VM, or database since the last time it was backed up.  They might not even know where that block fits inside the whole, nor do they have the info to figure that out.

MS Office files have structure

Let’s assume we solve the above problem. The backup software would have to unpack the file, extract the personal data in question, then repack the file.  For example, a Microsoft Office is actually a .ZIP file with some XML files inside of it.  The backup software would have to unzip the .ZIP file, take the appropriate data out of the XML file, then rezip the file – all without making the file unreadable when it saves it again.

Relational databases have more structure

Relational databases have the concept of referential integrity.  When the database is open, this is not a problem when you delete record X.  It will automatically delete any references to record X so there aren’t any referential integrity problems. It will also update any indices that reference those references.  Easy peasy.

That’s impossible to do when the database is a bunch of objects in a backup. First, it requires the backup software to know much more about the format of the file than it needed to know before. It then would need to be able to delete a record, any references to that record, and any indices referencing that record, and it would need to do that for every RDBMS it supports.  I just don’t see this being a good idea.

Clean on restore

The first idea – as discussed in my last few blog posts – is for there to be a process to track any data that needs to be deleted (e.g. any records of Curtis Preston w/birthday of 01/01/01, at IP address 1.1.1.1, etc.), and then delete them on restore. Today this will need to be a manual process, but as has already been mentioned, it could be built into the backup software itself.  It’s a monumental task, but it’s much easier to open, read, and write a file when it’s in a filesystem.  And it’s much easier to run some SQL commands than it is to learn the internal structure of a database.

Just stop using your backups as archives!

This whole problem is caused by people keeping their backups far longer than they should. If you used backups for restores of data from the last, say, 90 days, and then used archive software for data older than that – this would not be a problem.  Everything I mentioned above doesn’t pertain to archives.  Archives by design are given information, not blobs of data.

They are given emails and records from a database, not a backup stream from the database. Yes, they are also given files like spreadsheets, but they are expected to parse those files and get the data inside.  It’s part of the design. An archive systems is also given far fewer objects to parse, so it has time to do all that extra parsing and storing.

Maybe GDPR will be my friend and help me stop people from storing their backups for 10 years.  I can dream.

What do you think?

I think this is a big problem, and I don’t see any solutions on the horizon.  Does anyone else see this as any different?  Your comments move the discussion forward.  Anyone?

----- Signature and Disclaimer -----

Written by W. Curtis Preston (@wcpreston). For those of you unfamiliar with my work, I've specialized in backup & recovery since 1993. I've written the O'Reilly books on backup and have worked with a number of native and commercial tools. I am now Chief Technical Architect at Druva, the leading provider of cloud-based data protection and data management tools for endpoints, infrastructure, and cloud applications. These posts reflect my own opinion and are not necessarily the opinion of my employer.

GDPR: How should we keep a record of deleted data?

I’m still just thinking out loud here.  Again… not an attorney.  I have read the GDPR and done some analysis of it, primarily around the right to be forgotten (RTBF) and how it pertains to data protection systems. I just want to start the conversation about some of these topics and see what people are thinking about these very important topics.

Note: This article is one in a series about GDPR.  Here’s a list of articles so far: 

No one is scrubbing backups

As I mentioned in my previous post, my opinion is that it is not reasonable to expect companies to delete data from their backups in order to satisfy an RTBF request.  It’s simply not technically feasible given modern technology. I do believe companies should switch to non-natural values for the primary keys of their databases. It’s the latter that I want to talk about, based on some comments I received on my last post.

I stand by my opinion about non-natural keys for databases that store personal information. This allows you to delete a record while storing the record identifier, which isn’t personal data. That way you could easily check in the future if you have data that’s supposed to be deleted, such as if you restore the database to a point before the data is deleted.

But the commenter on my last article has a good point. What if you restore the database to a point before you starting using non-natural keys? Suppose you follow the suggestion and stop using natural keys today.  But you still have backups from before today that don’t have natural keys, and you may have to keep those backups for a long period of time.  (You shouldn’t, as you should only be keeping archives for that amount of time, but we all know that at least half of  you are keeping your backups for years.  Even if you were using archives, the problem of scrubbing them is just as hard, so they could cause the same problem.)

But what about this?

So, it’s three years from now and you need to restore a database from a backup you took before you switched to non-natural keys.  In the past three years you have received hundreds of RTBF requests that you need to continue to honor, but you just restored a database that has those records in it, and it doesn’t have that non-natural key you stored in order to make sure the data stays deleted.  How are you going to find and delete those records if you didn’t keep the natural keys you were using before you switched away from them?

Again, my opinion is that you’re going to have to keep enough data to identify a unique person in order to continue to honor RTBF requests after they’ve been done. Get rid of all data about the person (other than that) and store just enough to identify them — and put that in the most secure database you have. You could then use that database in one or both of the following two ways.

One way would to have an app that could read the data in the database, never display it anyone, but occasionally check if any records in the database are found in one or more databases.  The main use case for this method would be after a restore from an older backup.  You could point this app to that restored database so it could clean it.  You could also use it proactively to periodically check your entire environment for deleted records and delete them if they are found.

Another way to use it would be to set it up so that you could only query it by the unique identifier; data is never exported or sent to another app.  So you could run a query to see if SSN 123-34-3222 is in it.  If a record is found, it is supposed to be forgotten, so it should be deleted.  So, again, in the case of restored database you could check every record in the restored database against the deleted records, and delete any that are found.  It’s less efficient than the previous method, but it’s more secure.

I think this is defensible. Do you?

On one hand, it looks like keeping the unique identifier – which was the whole point of the GDPR – goes against the letter of the law for a RTBF request. Yes, it does.  But the GDPR also allows you to keep information required to protect against a lawsuit.  Not honoring RTBF requests could cost your company big time, so my personal, non-legal opinion is that this is a perfectly valid thing to do after you’ve honored a RTBF request – in order to make sure they stay forgotten.

How are you going to deal with this problem?  What do you think of my idea?

----- Signature and Disclaimer -----

Written by W. Curtis Preston (@wcpreston). For those of you unfamiliar with my work, I've specialized in backup & recovery since 1993. I've written the O'Reilly books on backup and have worked with a number of native and commercial tools. I am now Chief Technical Architect at Druva, the leading provider of cloud-based data protection and data management tools for endpoints, infrastructure, and cloud applications. These posts reflect my own opinion and are not necessarily the opinion of my employer.

Some hope about GDPR and backups.

I am not a lawyer. I’m not even what I consider a GDPR specialist. But I’ve read a lot of the text of the GDPR, and I’ve read a lot about it and watched a lot of videos. So I’d like to offer my layman’s interpretation of an important aspect of GDPR – the right to be forgotten – and whether or not it means we have to delete data from our backups.

Note: This article is one in a series about GDPR.  Here’s a list of articles so far: 

Let’s talk about this

I have an opinion on this issue, but it’s not a legal opinion. I’d love to hear your opinion, especially if it differs from mine. Let’s see some comments on this one, shall we?  Here’s the official GDPR website where you can read it for yourself.

The easy stuff

There are all kinds of GDPR articles about making sure we have consent and a reason to store personal data, making sure we take care of it when we do, making sure we store it securely, etc. There’s even a line that says we need “the ability to restore the availability and access to personal data in a timely manner in the event of a physical or technical incident.” My personal opinion is that you should have been doing all this stuff already, which is why I’m calling it “the easy stuff.”  (The easy stuff isn’t easy if you haven’t been doing it, but it’s easy in that all the technology is there to do it.  All you have to do is implement it.)

A little bit harder

You need a way to search any systems that store personal data. You need to be able to query for any records referencing a given email address, IP address, physical address, etc.  Hopefully you have that already, but if not, that will require some work to comply with.  This is needed to satisfy the data request and right to be forgotten provisions.

If you’re using “natural keys” as the primary keys in your database, that’ll have to change. Any information that could be deemed personal information by the GDPR should not be used as the primary key in a database.

The first reason is what happens if you are asked to delete a given record that uses the primary key of the IP address where the user filled out a form, or the email address they used to do so. If you reference that primary key in other records, you’ll have to do a cascading delete of any records that reference that key, in addition to deleting the primary record.  I’ll discuss the other reason this is important later in the article. Suffice it to say this may require a significant design change in your database system.

It goes way beyond employees

I’ve heard a lot of people talking about employees as if they are the main data subjects under the GDPR.  They are covered under GDPR, but I think employees (IMO) fall under the easy stuff. It’s easy to prove consent when you have an employment contract. You’re probably already securely storing that data, and you probably also have a pretty simple way of searching for those records to comply with any requests for that data. You also have a valid reason to not comply with any erasure requests, because you can say that you’re keeping it to be able to defend against any lawsuits, which is an exception to the erasure requirement. (There are several reasons you don’t have to erase data; one of them is if you are keeping it to protect against lawsuits.)  My opinion is that everything I just said also applies to customers. You have a contract with them, you have a reason to keep their information, you can easily search for it, and you have a reason to not delete it.  Easy peasy.  (Remember, I’m not a lawyer, and I’m curious about your take on this.)

The rub comes when you’re storing data about non-employees and non-customers.  You will have to prove that you got affirmative consent to store the information, you’ll need to supply it when asked, and you’ll need to delete it when asked. Now things get a little hairy. It’s out of the scope of this blog, but this means you have to do things like have an unchecked checkbox that they have to check to give you permission to store the info. And you should be storing any personal data in a system that allows you to easily search for the data if someone asks for it.

But what about backups? Do I have to delete the backups?

No one knows for sure because there’s no case law on it it and the GDPR itself is somewhat unclear on the issue. We won’t know until someone gets sued under the GDPR for not deleting data from their backups.  If a court rules that backups are part of what we’re supposed to delete, we’re all in a world of hurt. If they rule in line with what I say below, then we can breathe easier. Let’s see what the GDRP says about the subject.

The GDPR seems more concerned with live copies of data

This is more a general feeling than anything I can directly quote, but it seems to be interested primarily in online, live copies of data that can be easily accessed. I’m guessing it’s because these are the copies that tend to get hacked and accidentally released to the public. You don’t really see any stories about how some hacker broke into someone’s backup system and restored a bunch of stuff to make it public. Heck, most companies can’t restore their own data properly. How’s a hacker going to do that?

The GDPR doesn’t mention backups.

Go ahead. Search the entire text of the GDPR for phrases like “backup,” or “back up.” You won’t find it. So no help on that front.

The GDPR does mention restores

The writers of the GDPR knew about backups and restores, because they mentioned that you need “the ability to restore the availability and access to personal data in a timely manner in the event of a physical or technical incident.” So they knew the concept of a backup exists, but chose not to mention it in the erasure section.

It does use the words archive and archival, but

When it uses the word archival, it seems to be referring to a large collection of information for a long period of time. And if you can prove you’re doing something like that “for the public good,” then it’s also exempt from erasure. For example, you can’t ask that CNN erase a story about you getting arrested.

The GDPR does mention copies

There’s a section that says you should take reasonable steps to erase any “links to, or copies or replications of those personal data” if you’ve made it public. But, again, this seems focused primarily on online copies of data that are replicated copies of the same data we’re trying to erase.

The GDPR uses the word “reasonable” and “excessive”

There GDPR is filled with phrases like “reasonable” and “excessive”. They understand that not everything is possible and that some things will require an excessive amount of effort. One example of this is in Recital 66 about Article 17 (the right to be forgotten article).  It says that if a controller has made the personal data public,  it “should take reasonable steps, taking into account available technology and the means available to the controller, including technical measures.”

The GDPR doesn’t use the word “reasonable” in the erasure section

Interestingly enough, right where we’d like to see a “reasonable” section, there isn’t one.  There is one when it talks about what you have to do if you’ve already made the data public and are asked to delete it, but it doesn’t mention reasonability when talking about deleting the main source of the data or any backups of that data.

You do have to make sure data stays deleted

If you are asked to delete a particular piece of personal data, you do need to make sure it is deleted – and stays deleted.  But it’s virtually impossible (and certainly not reasonable) to delete records out of most backup systems, so how are going to ensure a given record stays deleted if you do a restore?

Now we’re back to natural keys. You’ll need a way to find records pertaining to Steve Smith living at 123 anywhere lane, without storing the values of Steve Smith and 123 anywhere lane.  (Because doing that would be violating the deletion request.)  This is why you need to use something other than natural keys. If you’re not using natural keys, you can determine that Steve Smith at 123 anywhere lane is lead number 9303033138.  That is a unique value that is tied to his record, but is not personal data if you get rid of the other values. You can then create a separate table somewhere that tracks the lead numbers that must stay deleted from the marketing database – even if it’s restored.

If you restore the marketing database, you just need to make sure you delete lead number 9303033138 and any other leads listed in the DeletedLeads table – before you put that database back online. Because if you put the marketing database back online with Steve Smith’s address and email address still there – and then someone kicks off a marketing campaign that contacts Steve Smith after you said his records are deleted – you’re going to have a very easily provable GDPR violation on your hands.  Then we’re back to talking about those potentially huge fines.

I don’t think you have to delete data from your backups

My personal non-legal opinion is that as long as you have a process for making sure that deleted records stay deleted even after a restore – and you make sure you follow that process – you have a pretty defensible position. My personal opinion would also be to be upfront about this in your notification to the data subject.

Dear Steve Smith,

We have deleted all references to your personal data in our marketing database. For technical reasons we are unable to delete this information from our backup system, but that system is only used to restore the marketing database if it is damaged. We also have a system in place to ensure that your records will be immediately deleted if the marketing database is ever restored from the backup system.

Sincerely,

The DPO

Final thoughts

Backup vendors can and should be part of this process moving forward. Maybe in a few years’ time, we’ll have the ability to surgically remove records from a backup.  That would be very nice, and would be more elegant than having to do what I’m suggesting above.  This may indeed become a competitive differentiator for one or more backup companies moving forward.

What do you think?  Am I being too hopeful here?

----- Signature and Disclaimer -----

Written by W. Curtis Preston (@wcpreston). For those of you unfamiliar with my work, I've specialized in backup & recovery since 1993. I've written the O'Reilly books on backup and have worked with a number of native and commercial tools. I am now Chief Technical Architect at Druva, the leading provider of cloud-based data protection and data management tools for endpoints, infrastructure, and cloud applications. These posts reflect my own opinion and are not necessarily the opinion of my employer.