GDPR: How should we keep a record of deleted data?

I’m still just thinking out loud here.  Again… not an attorney.  I have read the GDPR and done some analysis of it, primarily around the right to be forgotten (RTBF) and how it pertains to data protection systems. I just want to start the conversation about some of these topics and see what people are thinking about these very important topics.

Note: This article is one in a series about GDPR.  Here’s a list of articles so far: 

No one is scrubbing backups

As I mentioned in my previous post, my opinion is that it is not reasonable to expect companies to delete data from their backups in order to satisfy an RTBF request.  It’s simply not technically feasible given modern technology. I do believe companies should switch to non-natural values for the primary keys of their databases. It’s the latter that I want to talk about, based on some comments I received on my last post.

I stand by my opinion about non-natural keys for databases that store personal information. This allows you to delete a record while storing the record identifier, which isn’t personal data. That way you could easily check in the future if you have data that’s supposed to be deleted, such as if you restore the database to a point before the data is deleted.

But the commenter on my last article has a good point. What if you restore the database to a point before you starting using non-natural keys? Suppose you follow the suggestion and stop using natural keys today.  But you still have backups from before today that don’t have natural keys, and you may have to keep those backups for a long period of time.  (You shouldn’t, as you should only be keeping archives for that amount of time, but we all know that at least half of  you are keeping your backups for years.  Even if you were using archives, the problem of scrubbing them is just as hard, so they could cause the same problem.)

But what about this?

So, it’s three years from now and you need to restore a database from a backup you took before you switched to non-natural keys.  In the past three years you have received hundreds of RTBF requests that you need to continue to honor, but you just restored a database that has those records in it, and it doesn’t have that non-natural key you stored in order to make sure the data stays deleted.  How are you going to find and delete those records if you didn’t keep the natural keys you were using before you switched away from them?

Again, my opinion is that you’re going to have to keep enough data to identify a unique person in order to continue to honor RTBF requests after they’ve been done. Get rid of all data about the person (other than that) and store just enough to identify them — and put that in the most secure database you have. You could then use that database in one or both of the following two ways.

One way would to have an app that could read the data in the database, never display it anyone, but occasionally check if any records in the database are found in one or more databases.  The main use case for this method would be after a restore from an older backup.  You could point this app to that restored database so it could clean it.  You could also use it proactively to periodically check your entire environment for deleted records and delete them if they are found.

Another way to use it would be to set it up so that you could only query it by the unique identifier; data is never exported or sent to another app.  So you could run a query to see if SSN 123-34-3222 is in it.  If a record is found, it is supposed to be forgotten, so it should be deleted.  So, again, in the case of restored database you could check every record in the restored database against the deleted records, and delete any that are found.  It’s less efficient than the previous method, but it’s more secure.

I think this is defensible. Do you?

On one hand, it looks like keeping the unique identifier – which was the whole point of the GDPR – goes against the letter of the law for a RTBF request. Yes, it does.  But the GDPR also allows you to keep information required to protect against a lawsuit.  Not honoring RTBF requests could cost your company big time, so my personal, non-legal opinion is that this is a perfectly valid thing to do after you’ve honored a RTBF request – in order to make sure they stay forgotten.

How are you going to deal with this problem?  What do you think of my idea?

Written by W. Curtis Preston (@wcpreston), four-time O'Reilly author, and host of The Backup Wrap-up podcast. I am now the Technology Evangelist at Sullivan Strickler, which helps companies manage their legacy data

4 comments
  • Curtis, this goes to my earlier comment to your other post. I think you’re splitting hairs here in a way that non-technical folks won’t care about.

    At the end of the day, if a company gets a request to delete data, the expectation is that it will *all* be deleted. Of course there will be exceptions…but those should be incredibly specific and rare. Anyone who leverages these loopholes can expect to get dinged as some point in the future.

    I would argue that it is best practice not to try and sidestep the regulations, but rather try to fully comply with them and keep things above-board.

    • Who said I want to sidestep? I want to comply. But complying means I need to acknowledge the old backups problem. (Until we solve the inability to delete a person from backups problem.) And I don’t see any way to fix the old backups problem w/o keeping at least one record of the ID for future scrubbing purposes. And I believe in doing it above board.

      As to no one but techy people understanding/caring… of course. But there needs to be a partnership between the tech folks and the regulation folks.

  • Curtis, here’s what I’d be more worried about; companies that are using backup systems to create long-term searchable archives. You mention the risk of restoring data where the records have been deleted. What happens if you’re doing large-scale search across data, where some of those older backups contain data from users that should be deleted?

    I guess the issue is what specifically you’re using secondary data for, whether for trending or other more detailed specific retrieval. I can see this being an issue though.

    • Chris, that is a GREAT point. My main subject of the last two blogs was, shall we say, “traditional” backups. That is, copies made for the specific purpose of restoring damaged/deleted data or servers. NOT for the purpose of ediscovery, analytics, etc. But thanks for opening another can of worms. That’ll be the subject of my next blog. 🙂

      I think the products that are going to have a hard time, though, are those that cross that boundary … backup products that are storing data the way backup products typically stored backups, but then adding archive functionality. It’s simply not possible to surgically remove a record from a database that’s inside a backup stream, that’s inside a tar ball. So not sure what they’re going to do.