More thoughts on the GDPR & backups

ByW. Curtis Preston May 24, 2018March 22, 2022

I’m doubling down on my opinion that the GDPR is not going to be able to force companies to “forget” people in their backups – especially personal data found inside an RDBMS or spreadsheet. Let me explain.

Disclaimer: I’m not a lawyer. I’m a data protection expert, not a legal expert. I’ve read the GDPR quite a bit and have consulted a lawyer about my thoughts, but this should not be considered a legal opinion. Just thinking out loud here.

Note: This article is one in a series about GDPR. Here’s a list of articles so far:

Personal data will be inside something

If a company is storing personal data on employees, customers, or prospects, that data will not be inside a discrete file for that customer. For example, Amazon does not have a single Excel spreadsheet or database called “Curtis Preston’s data.” There is a customer database, purchasing database, browsing database, etc, and my data is many rows within these databases.

This is the way it is at all companies around the world. Personal data is stored inside files and databases that also store other personal data. It’s built into most filesystems to be able to search the content of all files, and it’s definitely built into RDBMSs to search their content. Therefore, to comply with a GDPR data portability request or a RTBF request with online data should be relatively straightforward. Difficult, yes, but it’s simply a matter of identifying where personal data is stored and providing a process for searching it and expunging it if requested. Backups, however, are a whole different thing.

Backups just store what they’re given

This is important to understand, especially when we are talking database backups. With few exceptions, backup products are handed an object and some metadata about that object. They don’t control the content or format of the object, nor do they have any knowledge what’s inside it.

That object could be a spreadsheet, the latest RMAN stream of data from Oracle, or the results of a dump database command from SQL Server. It’s just an object with some metadata. The backup product might be told what type of object it is, such as an RMAN stream, so it can know how to process it. For example, it might deduplicate an Oracle RMAN stream different than a spreadsheet file.

But – and this is the important part – rarely does a backup product know anything about what’s inside that object, beyond what the metadata says. This is especially true of RDBMSs. A few products have done some extra effort to scan certain file types, such as spreadsheets, so they can provide “full text search” against those files. But, this is absolutely the exception, and I’m not aware of any that do that for relational databases. Nor am I sure they could even do that, given that the personal data is already packed.

In addition, the object that backup software is handed is often just a block or few that was changed in a file, VM, or database since the last time it was backed up. They might not even know where that block fits inside the whole, nor do they have the info to figure that out.

MS Office files have structure

Let’s assume we solve the above problem. The backup software would have to unpack the file, extract the personal data in question, then repack the file. For example, a Microsoft Office is actually a .ZIP file with some XML files inside of it. The backup software would have to unzip the .ZIP file, take the appropriate data out of the XML file, then rezip the file – all without making the file unreadable when it saves it again.

Relational databases have more structure

Relational databases have the concept of referential integrity. When the database is open, this is not a problem when you delete record X. It will automatically delete any references to record X so there aren’t any referential integrity problems. It will also update any indices that reference those references. Easy peasy.

That’s impossible to do when the database is a bunch of objects in a backup. First, it requires the backup software to know much more about the format of the file than it needed to know before. It then would need to be able to delete a record, any references to that record, and any indices referencing that record, and it would need to do that for every RDBMS it supports. I just don’t see this being a good idea.

Clean on restore

The first idea – as discussed in my last few blog posts – is for there to be a process to track any data that needs to be deleted (e.g. any records of Curtis Preston w/birthday of 01/01/01, at IP address 1.1.1.1, etc.), and then delete them on restore. Today this will need to be a manual process, but as has already been mentioned, it could be built into the backup software itself. It’s a monumental task, but it’s much easier to open, read, and write a file when it’s in a filesystem. And it’s much easier to run some SQL commands than it is to learn the internal structure of a database.

Just stop using your backups as archives!

This whole problem is caused by people keeping their backups far longer than they should. If you used backups for restores of data from the last, say, 90 days, and then used archive software for data older than that – this would not be a problem. Everything I mentioned above doesn’t pertain to archives. Archives by design are given information, not blobs of data.

They are given emails and records from a database, not a backup stream from the database. Yes, they are also given files like spreadsheets, but they are expected to parse those files and get the data inside. It’s part of the design. An archive systems is also given far fewer objects to parse, so it has time to do all that extra parsing and storing.

Maybe GDPR will be my friend and help me stop people from storing their backups for 10 years. I can dream.

What do you think?

I think this is a big problem, and I don’t see any solutions on the horizon. Does anyone else see this as any different? Your comments move the discussion forward. Anyone?

Written by W. Curtis Preston (@wcpreston), four-time O'Reilly author, and host of The Backup Wrap-up podcast. I am now the Technology Evangelist at S2|DATA, which helps companies manage their legacy data

Mr. Backup Blog

Exchange 2010 says goodbye to SIS
ByW. Curtis Preston September 9, 2009

FacebookXFans of Exchange’s Single Instance Storage feature will be sad to know that is gone as of Exchange 2010. That’s right. Gone. Written by W. Curtis Preston (@wcpreston), four-time O’Reilly author, and host of The Backup Wrap-up podcast. I am now the Technology Evangelist at S2|DATA, which helps companies manage their legacy data

Read More Exchange 2010 says goodbye to SIS
Mr. Backup Blog

Hackers read encrypted hard drives
ByW. Curtis Preston February 24, 2008May 28, 2024

FacebookXIf you’re encrypting the data on your hard drive using OS-level software encryption (e.g. Windows EFS, Vista BitBlocker, MacOS FileVault, Linux DM-Crypt, or TrueCrypt ), then a research study at Princeton University, partially funded by the Department of Homeland Security, has figured out how to read that data without your password. Well, that’s just great….

Read More Hackers read encrypted hard drives
Mr. Backup Blog

Lack of backups on your part does not constitute negligence on your vendor’s part
ByW. Curtis Preston December 7, 2018June 30, 2019

FacebookXIf you care about your data, back it up. If you don’t care about your data enough to back it up, don’t tell me it’s your vendor’s fault when something goes awry. This is what came to me when I read the article about the Adobe Premier Pro user that lost what he described as…

Read More Lack of backups on your part does not constitute negligence on your vendor’s part
Mr. Backup Blog

More proof that one basket for all your eggs is bad: codespaces.com is gone
ByW. Curtis Preston June 18, 2014

FacebookX Codespaces.com ceased to exist on June 17th, 2014 because they failed to adhere to the standard advice of not putting all your eggs in one basket. There are a number of things that they could have done to prevent this, but they apparently did none of them. Before I continue, let me say this….

Read More More proof that one basket for all your eggs is bad: codespaces.com is gone
Mr. Backup Blog

What it was like presenting at Cloud Field Day
ByW. Curtis Preston April 30, 2018

FacebookXPresenting at a Gestalt IT “Field Day” (Cloud Field Day in this case) was very different than being a delegate. So I thought I’d blog about it – just like a delegate. What is Cloud Field Day? Cloud Field Day is an event put on by Gestalt IT, a company founded by Stephen Foskett (@sfoskett)….

Read More What it was like presenting at Cloud Field Day
Mr. Backup Blog

Backing Up The Cloud
ByW. Curtis Preston December 3, 2012

FacebookX "The Cloud" has changed the way I do business, but I'm not always sure how I should back up the data I have "up there." So I thought I'd write a blog post about my research to address this hole in our plan. Truth in IT, Inc. is run almost entirely in the cloud. …

Read More Backing Up The Cloud

10 Comments

Rj Brown says:

May 24, 2018 at 12:40 pm

I run a rather small network. I use a simple backup system that has worked well for me for almost 20 years now — ever since 80 gb drives were available for about $200.00 US. I now use 6 TB drives for under $200.00 US. Under cover of darkness (after midnight) the backup job kicks off and runs rsync to scarf all the changes to all the filesystems on the network. Once this is done, the backup RAID mirror has an image of every systems’ files. It then copies using hard links to make a new time-stamped directory for each system so the result is a set of filesystem images for each machine every night — sort of a poor man’s dedup backup system. The advantage of this is that all the backups are directly accessible. Since the backup mirror is a RAID, I can pull one of those 6 TB drives and take it to the safe deposit box at the bank. And yes, I have 7 years of backups in those boxes.

Now, even though everything is directly accesible, most of the backup sets are off-line. I have tinkered with the idea of a cryptographic erase technique, but that would affect the entire drive, not individual files, much less individual fields within a file. The GDPR concept was obviously written by lawmakers with no knowledge of how a computer storage system works. Maybe they could stipulate that 5 years from now all new backup systems must have a way to deal with this, and 10 years from now all grandfathering of the old ways must stop, but that would hardly please those voters that want to be forgotten. And cleaning on restore doesn’t work because the raw tapes/drives couldbe subpoenaed and the raw dirty data extracted, so its not really forgottren after all.
1. W. Curtis Preston says:
  
  May 24, 2018 at 2:03 pm
  
  I actually blogged about that type of backup years ago. I’m not a big fan of it for commercial systems for a few reasons, but it seems to work for some people. And I’m REALLY not a fan of you keeping that kind of backup for seven years. Man… you ever get an ediscovery request agains that and you’ll go broke fulfilling it.
  
  As to GDPR, I actually doubt we’ll be able to deal with this even in five years, because of the formatting issues I mentioned in this post. A lot of this is out of the hands of the backup people.
Linus Moses says:

May 27, 2018 at 2:44 am

Having read your position on backups and GDPR I have to say I am in agreement.
The one thing that I am stuck on is how intelligent should the backup tools themselves actually be? To address GDPR concerns there probably needs to be a degree more intelligence than we see currently to allow for deeper search and we could then look to delete the markers for data sets within the backup catalogue. However I like the idea of backup tools being (for want of a better word) dumb, simply copying data to backup media.
I think adding addition intelligence adds additional security concerns. Having a simple tool may drive more intelligent thought form our admins.
1. W. Curtis Preston says:
  
  May 27, 2018 at 10:05 am
  
  I was discussing that a bit w/a friend the other day. The more intelligence we add to backups, the more it turns from being a backup to being a copy. The more it becomes a copy, the more it will be subject to GDPR.
  1. Linus Moses says:
    
    May 28, 2018 at 12:10 am
    
    But then I suppose this is where we make better use of the technologies we have available now. I think a combination of a “dumb” backup tool and some form of version control or vault software could help make the distinction between back data needed for recovery and archive data needed for historical purposes.
Miguel says:

May 31, 2018 at 2:02 pm

As per W. Curtis request:

Some years back I had to deal with this very same issue as citizen.
I asked a company to delete my personal account, which they did. Months later I began to get some mails.
I demanded an explanation pointing out that local regulation could get them a nice fine.
The infrastructure manager and I arranged a call to clarify the situation.
They actually removed my email address but following a crash they had to recover from backup. He noted me that they added a trigger in the database so anytime my email showed up it would be deleted.
It, indeed, ended up working out for me.

… Today, with GDPR in place, it could have been far worse.
1. W. Curtis Preston says:
  
  June 1, 2018 at 10:46 am
  
  This is what I think people are going to need to do. They will have to remember who they were asked to forget, and then make sure they stay forgotten. It sounds counterintuitive, but I can’t think of a better workaround at the moment.
Pingback: Die glorreiche Unsicherheit: Die Backup-Welt hat eine DSGVO Moment – Nachrichten Welt
eric gosnell says:

June 15, 2018 at 2:37 pm

Are we as backup professionals even thinking about this correctly. Backups are created to offset risk. Risk from disaster, malicious actors, etc.. What is being presented as it appears to me is another variable in the risk calculation. We must now either accept/implement Curtis’ (and I do agree with it) idea of tracking RTBF requests and hoping that they are applied correctly in the event of a recovery and the commiserate risk that if they aren’t I’m subject to 4% of the annual turnover, or we simply stop retaining backups for more than the RPO of the device + a minimal number of iterations and risk some data loss in the event we want a file that was corrupted or lost weeks ago.

My guess is businesses will ultimately side on the latter as the cost. There is a decrease over the status quo and I’m no longer sure the risk of true disaster isn’t mitigated via BC practices to the degree that there is significantly more risk in even accidentally keeping data that I’ve been told to forget.
1. W. Curtis Preston says:
  
  June 15, 2018 at 3:03 pm
  
  Thanks for the comment. A couple of thoughts… The 4% of annual revenue penalty is a MAX penalty. AFAICT, it’s not “You accidentally restored one person you were supposed to forget” penalty. You’re going to get a 4% penalty if you completely ignore GDPR and end up with a massive breach, and it can be shown the reason you had that breach was that you ignored GDPR. (That statement is, of course, not legally binding.) So no one is really sure yet what the fine would be for a minor “infraction,” like this, or even if there would be a fine at all if you rectify it. You had a design and a system to address this problem and something fell through a crack. Fix the crack if you can and apologize. I’m doubting the ICO will have time to create fines for things this small.
  
  Also, DR is not the only reason we backup. There are files that get restored from months ago. Because of that, I’m OK w/keeping backups for a year or so, but not for much longer than that. After that we should be using archive software.