The GDPR is unclear about backups

No one knows for sure whether backups are going to be included in the right to be forgotten (RTBF). Even the GDPR ICO isn’t being entirely clear about it yet. But that hasn’t stopped people from expressing very strong opinions on the subject.

I’ve now written several articles on this topic, and I’ve seen a variety of responses to my comments so far, especially comments to The Register article that Chris Mellor wrote that mentioned my articles. The ones that crack me up are the people that are absolutely sure about how GDPR works – even if the ICO isn’t.  This is especially true when they’re trying to sell me something to fix a GDPR problem.  Be wary of GDPR fear mongers trying to sell you something.

Just in case you’re wondering, even though I work at a data management as a service (DMaaS) vendor, I’ve got no axe to grind here. I’m aware of no products in this space that have a stronger GDPR story than Druva, so I have no reason to convince you this isn’t a problem.  (We are able to help you find files that match certain parameters, and can help you delete them from backups.  Like everyone else, however, we are not yet able to delete data from within a backup of a structured database. I am still unaware of any products that solve this problem.)

My only goal here is to start a conversation about this difficult topic. That I’ve clearly done.  So I’m going to keep talking about it until we know better.

Experts abound

There are what I would call the “GDPR purists,” whose position sounds something like “what part of forgotten do you not understand?” Clearly these people feel that backups and archives are included in the RTBF, and they really don’t care how much it would cost a company to comply.  Most importantly, they’re certain any companies not agreeing with them will be out of compliance and subject to huge fines.

I also get comments that are completely opposite of that, where people are certain that backups are not (and will not be) included in RTBF requests.  While my opinion is that these people are closer to what is likely to happen, they are just as dangerous as those who are certain they are wrong.

It’s all conjecture

The only thing I know for certain is that anyone who says they “know” the answer to this question is definitely wrong. Consider what happened when Chris Mellor contacted the ICO for his article.  It does seem at first that their response seems to favor the “definitely included” folks. They said, “Merely because it may be considered ‘technically difficult’ to comply with some of its requirements does not mean organisations can ignore their obligations.”

The ICO knows that the RTBF is going to be hard (even without the backup part of the problem), and they want you to know that you can’t just say “erasing every reference to a given person is really hard” as a defense for not doing it.  Don’t even think about trying that one, they’re saying.  They need everyone to know they mean business.

But I also don’t think we’re talking about technically difficult; we’re talking technically impossible. After 25 years of experience in this field, I can easily say it is technically impossible to erase data from inside a structured database inside a backup without corrupting the backup. Almost all backups are image based, not record based. Even if you could identify the blocks pertaining to a given record inside a database, deleting those blocks would corrupt the rest of the backup.  You literally cannot do that.

I also want to say that the idea that you would restore every copy of the database you have, delete the record in question, then re-backup the database is simply ludicrous. And you would do that every time you got such a request?  That’s just nonsense. Besides the fact that the process would be so expensive that it would be cheaper to pay the fine, there’s a huge risk element to the process. That means that it places you in possible violation with another part of the GDPR — that you must be able to safely restore personal data that was deleted or corrupted.  (A mistake in the process could corrupt all backups of every database.)

There is another proposed solution of converting your backups to archives by scanning & indexing them, and then deleting the backup tapes. This process sounds interesting until you learn it doesn’t solve the problem I keep bringing up – personal data stored inside an RDBMS.  So it might help, but it’s not a full solution to the problem.

My opinion is that erasing all references to a given person in your production system – while also having a process in place to make sure said person never “resurrects” from the backup system – accomplishes the goal of RTBF without placing the backup data at risk or attempting to do the impossible. If Vegas had odds on this topic, that’s where I’d place my bet.  I think the ICO is going to say that as long as data is not being used to support any current business decisions, and isn’t directly accessible to production systems, it can be excluded from the RTBF process.  But you need to have a process to make sure it never comes back.

No one knows for sure, though, and that’s my point.  Anyone who tries to tell you they know the answer for sure either has no idea what they’re talking about, is trying to sell you something, or both.

The ICO says they’re going to make it clearer soon

The ICO could have said “backups are included. Period.” (in their comment to Chris’ article.) They didn’t. They said “The key point is that organisations should be clear with individuals as to what will happen to their data when their erasure request is fulfilled, including in respect of both production environments and backup systems. We will be providing more information on backups and the right to erasure soon.”

I for one am looking forward to that guidance.

What do you think?

I’m really curious.  Anyone else want to make a guess as to how this all shakes out?

What about this? Do you think that adopting a wait-and-see approach is risky?  Should you spend millions now even if we’re not sure how this is going to end up?

Snapshots: Another GDPR challenge

In my continuing series of challenges with backup with the General Data Protection Regulation (GDPR), I thought I’d look at snapshots, and the unique problem they present. They may be even more problematic than traditional backups.

Note: This article is one in a series about GDPR.  Here’s a list of articles so far: 

In these previous posts I have defined what the GDPR is and how it applies to your company.  I’ve also discussed whether or not backups are included when someone asks to be “forgotten” via a “right to be forgotten” request in the GDPR.  As I discussed here, here, and here, I do not believe that companies are going to be able to delete such data from their backup systems, nor do I think that the GDPR is going to require them to do it.  (But we just don’t know for sure until the ICO clarifies their position.)

The idea is two-fold.  The first part is the backups aren’t being used to support any current decisions, nor are they accessible via standard IT systems and queries. The second part is that it’s simply not possible today to delete data from a backup like that.

But what about snapshots?

Someone asked about this on twitter.  Snapshots are visible to regular IT systems and could be used to support current decisions.  For example, NetApp snapshots appear under the ~snapshot subdirectory of the volume they are protecting.  They may not be in the normal path of things, but a user could easily search and access them.  It’s kind of the point of how snapshots work.

But guess what? Snapshots are read-only by design. You don’t want people to be able to delete data from your snapshots if you’re using them for backup.  But since they’re accessible via a standard IT process, are they now considered primary data?

Out of curiosity, I reviewed the NetApp whitepaper on how they handle this issue, and it was unclear when it got to the part of actually forgetting the data. It mentioned that you couldn’t delete something if you didn’t know where it was, but it didn’t really go into how you would selectively delete something from a snapshot once you found it.

I’m not picking on NetApp here. I’ve always been a fan.  I’m simply saying that – like backups – selectively deleting data from snapshots goes against their nature. And I’m pointing out that because they are accessible as regular IT data, they might not get the pass that I believe backups will get.

What is your plan for snapshots?

Have you discovered GDPR RTBF and how it relates to snapshots at your company? Has your storage vendor given you any guidance as to how to solve this problem?  Is there a GDPR “back door” that you can selectively use to delete data from a snapshot? Do you want to use it, considering you could corrupt the thing you are using for backup?

I’d really love to hear from you on this.