Some hope about GDPR and backups.

I am not a lawyer. I’m not even what I consider a GDPR specialist. But I’ve read a lot of the text of the GDPR, and I’ve read a lot about it and watched a lot of videos. So I’d like to offer my layman’s interpretation of an important aspect of GDPR – the right to be forgotten – and whether or not it means we have to delete data from our backups.

Note: This article is one in a series about GDPR.  Here’s a list of articles so far: 

Let’s talk about this

I have an opinion on this issue, but it’s not a legal opinion. I’d love to hear your opinion, especially if it differs from mine. Let’s see some comments on this one, shall we?  Here’s the official GDPR website where you can read it for yourself.

The easy stuff

There are all kinds of GDPR articles about making sure we have consent and a reason to store personal data, making sure we take care of it when we do, making sure we store it securely, etc. There’s even a line that says we need “the ability to restore the availability and access to personal data in a timely manner in the event of a physical or technical incident.” My personal opinion is that you should have been doing all this stuff already, which is why I’m calling it “the easy stuff.”  (The easy stuff isn’t easy if you haven’t been doing it, but it’s easy in that all the technology is there to do it.  All you have to do is implement it.)

A little bit harder

You need a way to search any systems that store personal data. You need to be able to query for any records referencing a given email address, IP address, physical address, etc.  Hopefully you have that already, but if not, that will require some work to comply with.  This is needed to satisfy the data request and right to be forgotten provisions.

If you’re using “natural keys” as the primary keys in your database, that’ll have to change. Any information that could be deemed personal information by the GDPR should not be used as the primary key in a database.

The first reason is what happens if you are asked to delete a given record that uses the primary key of the IP address where the user filled out a form, or the email address they used to do so. If you reference that primary key in other records, you’ll have to do a cascading delete of any records that reference that key, in addition to deleting the primary record.  I’ll discuss the other reason this is important later in the article. Suffice it to say this may require a significant design change in your database system.

It goes way beyond employees

I’ve heard a lot of people talking about employees as if they are the main data subjects under the GDPR.  They are covered under GDPR, but I think employees (IMO) fall under the easy stuff. It’s easy to prove consent when you have an employment contract. You’re probably already securely storing that data, and you probably also have a pretty simple way of searching for those records to comply with any requests for that data. You also have a valid reason to not comply with any erasure requests, because you can say that you’re keeping it to be able to defend against any lawsuits, which is an exception to the erasure requirement. (There are several reasons you don’t have to erase data; one of them is if you are keeping it to protect against lawsuits.)  My opinion is that everything I just said also applies to customers. You have a contract with them, you have a reason to keep their information, you can easily search for it, and you have a reason to not delete it.  Easy peasy.  (Remember, I’m not a lawyer, and I’m curious about your take on this.)

The rub comes when you’re storing data about non-employees and non-customers.  You will have to prove that you got affirmative consent to store the information, you’ll need to supply it when asked, and you’ll need to delete it when asked. Now things get a little hairy. It’s out of the scope of this blog, but this means you have to do things like have an unchecked checkbox that they have to check to give you permission to store the info. And you should be storing any personal data in a system that allows you to easily search for the data if someone asks for it.

But what about backups? Do I have to delete the backups?

No one knows for sure because there’s no case law on it it and the GDPR itself is somewhat unclear on the issue. We won’t know until someone gets sued under the GDPR for not deleting data from their backups.  If a court rules that backups are part of what we’re supposed to delete, we’re all in a world of hurt. If they rule in line with what I say below, then we can breathe easier. Let’s see what the GDRP says about the subject.

The GDPR seems more concerned with live copies of data

This is more a general feeling than anything I can directly quote, but it seems to be interested primarily in online, live copies of data that can be easily accessed. I’m guessing it’s because these are the copies that tend to get hacked and accidentally released to the public. You don’t really see any stories about how some hacker broke into someone’s backup system and restored a bunch of stuff to make it public. Heck, most companies can’t restore their own data properly. How’s a hacker going to do that?

The GDPR doesn’t mention backups.

Go ahead. Search the entire text of the GDPR for phrases like “backup,” or “back up.” You won’t find it. So no help on that front.

The GDPR does mention restores

The writers of the GDPR knew about backups and restores, because they mentioned that you need “the ability to restore the availability and access to personal data in a timely manner in the event of a physical or technical incident.” So they knew the concept of a backup exists, but chose not to mention it in the erasure section.

It does use the words archive and archival, but

When it uses the word archival, it seems to be referring to a large collection of information for a long period of time. And if you can prove you’re doing something like that “for the public good,” then it’s also exempt from erasure. For example, you can’t ask that CNN erase a story about you getting arrested.

The GDPR does mention copies

There’s a section that says you should take reasonable steps to erase any “links to, or copies or replications of those personal data” if you’ve made it public. But, again, this seems focused primarily on online copies of data that are replicated copies of the same data we’re trying to erase.

The GDPR uses the word “reasonable” and “excessive”

There GDPR is filled with phrases like “reasonable” and “excessive”. They understand that not everything is possible and that some things will require an excessive amount of effort. One example of this is in Recital 66 about Article 17 (the right to be forgotten article).  It says that if a controller has made the personal data public,  it “should take reasonable steps, taking into account available technology and the means available to the controller, including technical measures.”

The GDPR doesn’t use the word “reasonable” in the erasure section

Interestingly enough, right where we’d like to see a “reasonable” section, there isn’t one.  There is one when it talks about what you have to do if you’ve already made the data public and are asked to delete it, but it doesn’t mention reasonability when talking about deleting the main source of the data or any backups of that data.

You do have to make sure data stays deleted

If you are asked to delete a particular piece of personal data, you do need to make sure it is deleted – and stays deleted.  But it’s virtually impossible (and certainly not reasonable) to delete records out of most backup systems, so how are going to ensure a given record stays deleted if you do a restore?

Now we’re back to natural keys. You’ll need a way to find records pertaining to Steve Smith living at 123 anywhere lane, without storing the values of Steve Smith and 123 anywhere lane.  (Because doing that would be violating the deletion request.)  This is why you need to use something other than natural keys. If you’re not using natural keys, you can determine that Steve Smith at 123 anywhere lane is lead number 9303033138.  That is a unique value that is tied to his record, but is not personal data if you get rid of the other values. You can then create a separate table somewhere that tracks the lead numbers that must stay deleted from the marketing database – even if it’s restored.

If you restore the marketing database, you just need to make sure you delete lead number 9303033138 and any other leads listed in the DeletedLeads table – before you put that database back online. Because if you put the marketing database back online with Steve Smith’s address and email address still there – and then someone kicks off a marketing campaign that contacts Steve Smith after you said his records are deleted – you’re going to have a very easily provable GDPR violation on your hands.  Then we’re back to talking about those potentially huge fines.

I don’t think you have to delete data from your backups

My personal non-legal opinion is that as long as you have a process for making sure that deleted records stay deleted even after a restore – and you make sure you follow that process – you have a pretty defensible position. My personal opinion would also be to be upfront about this in your notification to the data subject.

Dear Steve Smith,

We have deleted all references to your personal data in our marketing database. For technical reasons we are unable to delete this information from our backup system, but that system is only used to restore the marketing database if it is damaged. We also have a system in place to ensure that your records will be immediately deleted if the marketing database is ever restored from the backup system.

Sincerely,

The DPO

Final thoughts

Backup vendors can and should be part of this process moving forward. Maybe in a few years’ time, we’ll have the ability to surgically remove records from a backup.  That would be very nice, and would be more elegant than having to do what I’m suggesting above.  This may indeed become a competitive differentiator for one or more backup companies moving forward.

What do you think?  Am I being too hopeful here?

Written by W. Curtis Preston (@wcpreston), four-time O'Reilly author, and host of The Backup Wrap-up podcast. I am now the Technology Evangelist at Sullivan Strickler, which helps companies manage their legacy data

13 comments
    • Thanks! It’s a huge topic and you have to bite it off in chunks

  • I think you reasoned this out so well that this will be brought forward in any such related case, possibly as “the state of the art”. I wonder if you will get to be an expert witness on this topic in any GDPR court cases. I agree with your logic and matches what I’ve gotten out of everything I’ve read, so sorry I can’t be a contrasting point of view.
    As for the Subscribe Options you have pop up, I went with option #3 long ago for this blog and it still works wonderfully for me. Using RSS means I can readily manage and set the times I can read them rather than losing track of them in the fire-hoses that represent the first two options. Pity it isn’t as easy for you to track who follows you that way.

    • I’ve been an expert witness before. Happy to do it again.

      Agreed on the RSS thing.

  • Thanks, Curtis, for a very interesting and thought provoking article. I’m not a lawyer either (I work for a tape vendor), but I think the rationale for how you can use non-natural keys to scrub restores as they come back online is very logical and would hopefully intercept records that had been deleted from the live system, but remained somewhere on a backup.

    Obviously, organisations can choose whether or not to proactively search and index their backup and archive data in advance (on-prem, off-prem, cloud, tape etc) as a risk management decision: to be completely safe, you would, but to avoid undertaking a possibly herculean task, you could take the view that offline and effectively irretrievable is tantamount to being erased/forgotten. You then just rely on the intercept process you describe as and when you need to bring data in from the cold.

    Where I think the case law needs to be defined is surrounding much older backups (wherever they are) where you might not have appended your non-natural key. As these came back online, my thought process is that there would be no key within the dataset to match against the ‘Deleted Records’ table? And if you had securely erased all the natural data, you (presumably) won’t have a list of emails or other contact details to cross-refer against either (as these would have been deleted!)

    In this instance, I can’t really see any option other than restoring and indexing backups, ideally sooner rather than later before the Article 17 requests start rolling in. Then you could determine if you really need that data – in which case add a GDPR non-natural key – and re-archive the content to whichever platform fits best. Or you might simply decide the task is too onerous/you don’t need that data, and destroy the old content to put it out of harm’s way.

    Arguably, all of this is a good thing, reducing the scope of data to manage; sorting good from crud; and potentially unlocking new insights from the good.

    But it’s still going to take a lot of time and effort and I wonder how many companies are on-track to complete the task by 25th May?

    • That’s actually a really good point on the older backups. One thing I’d say is that retention periods tend to be (and should be IMO) much shorter on structured data. Why in the world would you need a backup of your purchasing database from a year ago? I know many people that have a 30 days or less retention on such data.

      But if someone did, I think it’s quite unreasonable to ask them to restore and scrub old database backups. But it is a conundrum. How would you scrub a restore from an older backup of a database that used natural keys? The answer is you can’t.

      I think the more reasonable thing to do would be to draw a line in the sand from, say, 60 days before you started using natural keys. Any structured backups from before that point should be erased. The options outside of that all come with huge consequences. You could:

      1. Keep one natural key that could be used as a relatively unique identifier. (e.g. Not a name, but an SSN in US, but they don’t have that in Europe.) (e.g. You kept the IP address or address of the person without their name.) This data would need to be special access only, maybe requiring two-person authentication or something. Then if you do a restore from that older data for whatever reason, you just scrub it of that unique key. You risk deleting somebody you didn’t mean to. This process might be considered a violation of the GDPR, but is a minor one IMO if that’s all you kept. AND I think you could make an argument that the only reason you kept that one or two pieces of data is to protect against lawsuits about GDPR. This one’s HAIRY.

      2. Draw a line in the sand and agree that any restores of structured data from before that time must be done in a sandbox and only for certain purposes. Data will never be copied from that sandbox into any other production area. This one’s less hairy as long as we all stick to the plan. A single breach of the plan could be very expensive.

      Good point, though!

  • I agree fully what you are saying about backup and GDPR – do not delete any information from backups.
    But you need to have a process described and in place for the restores that might take place to ensure that under “right to be forgotten” changes or deletes made between now and the time backup was taken are reset before restored data is taken to production. (This probably means that you need to have a log file or database for the GDPR “right to be forgotten” actions taken in place at any time.)

    • This may sound like GDPR blasphemy, but the more I think about it, the more I think a log of one or two identifiers (and that’s it) may be necessary for some period of time. Just enough to ID the person, but nothing else about them. E.g. in the US you would keep just the SSN and nothing else. Not sure what unique identifier I would use in the EU.

      I’m not an app specialist, but my opinion would also be that I should be able to search that database for a unique identifier (to comply with a later info request), but any exports of that database would require super special access, like two-person authentication and would only be used in special circumstances.

      My opinion is that this is in keeping with the spirit of the GDPR and the letter. It’s in keeping with the spirit because it prevents the main thing the GDPR is trying to prevent, which is a treasure trove of info about me tied to my unique identifier being inappropriately accessed somehow. But on the other hand, it’s NOT in keeping with part of the letter, in that the whole point is to NOT store that IP address x.x.x.x and Steve Smith are related. But i could also argue that keeping that one piece of info is required to protect against lawsuits (which is an exception covered under the GDPR).

      Again, I’m not an attorney, but I do specialize in data protection. I’m speaking from my 25 years of experience with how THOSE systems work.

  • Was thinking the same thing: re-apply the “forget” process after a restore. Then the emphasis is on managing the list of requests to-be-forgotten and the unique identifier. I ran into the same problem: what’s a good identifier and I guess that’s going to be SSN in the US. Additionally, we’ll need to account for a process where a person that was forgotten now needs to be remembered because we have a new relationship with them (customer, employee) or they choose to be remembered because they gain another benefit. So we keep the previous info forgotten but the new info stays.

  • Great post Curtis! I hadn’t even considered most of this! Thanks for posting

  • Curtis, great post and an important topic. I’m going to suggest a slightly different angle to consider. That is of the ‘spirit’ of what GDPR is intended to do. It has nothing to do with distinguishing between backups, archives, local copies and the like. It has more to do with the proper protection and management of a person’s identifying data.

    Considering the spirit of GDPR, I would argue that backups are fair game. Let’s face it that backups (collectively backups, archives, etc) have been a challenge for a very long time. As you have stated, there are times when we can’t retrieve what we want from them. But then again, there is sensitive data on them too.

    Therefore, IT as a whole can’t look at backups as a fixed data storage any longer. Now we need to consider the impact of that ‘backup debt’ that we have been sitting on for years. Having a data management policy that is truly adhered to will help.

    My argument is that GDPR will push us to modernize our storage of backups on tape and cold storage to identify data and be able to manage it effectively. As such, this will cause us, however painful, to reconsider how we manage data on the whole. We have needed to do this for a long time, but I expect GDPR to push IT forward in doing it.

    In sum, backups are fair game for GDPR. Get ready.

    • I agree that this will require us to modernize our backups. I know my employer is working on that as well. But my point is about current tech, that will still be in place for many, many years, even if all backups were modernized today. The cost to a company could be in the millions of dollars to wipe out one name from backups. Then they have to do it again for the next name? Never going to happen.