More thoughts on the GDPR & backups

I’m doubling down on my opinion that the GDPR is not going to be able to force companies to “forget” people in their backups – especially personal data found inside an RDBMS or spreadsheet.  Let me explain.

Disclaimer: I’m not a lawyer. I’m a data protection expert, not a legal expert. I’ve read the GDPR quite a bit and have consulted a lawyer about my thoughts, but this should not be considered a legal opinion.  Just thinking out loud here.

Note: This article is one in a series about GDPR.  Here’s a list of articles so far: 

Personal data will be inside something

If a company is storing personal data on employees, customers, or prospects, that data will not be inside a discrete file for that customer.  For example, Amazon does not have a single Excel spreadsheet or database called “Curtis Preston’s data.” There is a customer database, purchasing database, browsing database, etc, and my data is many rows within these databases.

This is the way it is at all companies around the world. Personal data is stored inside files and databases that also store other personal data. It’s built into most filesystems to be able to search the content of all files, and it’s definitely built into RDBMSs to search their content. Therefore, to comply with a GDPR data portability request or a RTBF request with online data should be relatively straightforward.  Difficult, yes, but it’s simply a matter of identifying where personal data is stored and providing a process for searching it and expunging it if requested.  Backups, however, are a whole different thing.

Backups just store what they’re given

This is important to understand, especially when we are talking database backups. With few exceptions, backup products are handed an object and some metadata about that object. They don’t control the content or format of the object, nor do they have any knowledge what’s inside it.

That object could be a spreadsheet, the latest RMAN stream of data from Oracle, or the results of a dump database command from SQL Server. It’s just an object with some metadata. The backup product might be told what type of object it is, such as an RMAN stream, so it can know how to process it.  For example, it might deduplicate an Oracle RMAN stream different than a spreadsheet file.

But – and this is the important part – rarely does a backup product know anything about what’s inside that object, beyond what the metadata says.  This is especially true of RDBMSs.  A few products have done some extra effort to scan certain file types, such as spreadsheets, so they can provide “full text search” against those files. But, this is absolutely the exception, and I’m not aware of any that do that for relational databases.  Nor am I sure they could even do that, given that the personal data is already packed.

In addition, the object that backup software is handed is often just a block or few that was changed in a file, VM, or database since the last time it was backed up.  They might not even know where that block fits inside the whole, nor do they have the info to figure that out.

MS Office files have structure

Let’s assume we solve the above problem. The backup software would have to unpack the file, extract the personal data in question, then repack the file.  For example, a Microsoft Office is actually a .ZIP file with some XML files inside of it.  The backup software would have to unzip the .ZIP file, take the appropriate data out of the XML file, then rezip the file – all without making the file unreadable when it saves it again.

Relational databases have more structure

Relational databases have the concept of referential integrity.  When the database is open, this is not a problem when you delete record X.  It will automatically delete any references to record X so there aren’t any referential integrity problems. It will also update any indices that reference those references.  Easy peasy.

That’s impossible to do when the database is a bunch of objects in a backup. First, it requires the backup software to know much more about the format of the file than it needed to know before. It then would need to be able to delete a record, any references to that record, and any indices referencing that record, and it would need to do that for every RDBMS it supports.  I just don’t see this being a good idea.

Clean on restore

The first idea – as discussed in my last few blog posts – is for there to be a process to track any data that needs to be deleted (e.g. any records of Curtis Preston w/birthday of 01/01/01, at IP address 1.1.1.1, etc.), and then delete them on restore. Today this will need to be a manual process, but as has already been mentioned, it could be built into the backup software itself.  It’s a monumental task, but it’s much easier to open, read, and write a file when it’s in a filesystem.  And it’s much easier to run some SQL commands than it is to learn the internal structure of a database.

Just stop using your backups as archives!

This whole problem is caused by people keeping their backups far longer than they should. If you used backups for restores of data from the last, say, 90 days, and then used archive software for data older than that – this would not be a problem.  Everything I mentioned above doesn’t pertain to archives.  Archives by design are given information, not blobs of data.

They are given emails and records from a database, not a backup stream from the database. Yes, they are also given files like spreadsheets, but they are expected to parse those files and get the data inside.  It’s part of the design. An archive systems is also given far fewer objects to parse, so it has time to do all that extra parsing and storing.

Maybe GDPR will be my friend and help me stop people from storing their backups for 10 years.  I can dream.

What do you think?

I think this is a big problem, and I don’t see any solutions on the horizon.  Does anyone else see this as any different?  Your comments move the discussion forward.  Anyone?

Written by W. Curtis Preston (@wcpreston), four-time O'Reilly author, and host of The Backup Wrap-up podcast. I am now the Technology Evangelist at Sullivan Strickler, which helps companies manage their legacy data

10 comments
  • I run a rather small network. I use a simple backup system that has worked well for me for almost 20 years now — ever since 80 gb drives were available for about $200.00 US. I now use 6 TB drives for under $200.00 US. Under cover of darkness (after midnight) the backup job kicks off and runs rsync to scarf all the changes to all the filesystems on the network. Once this is done, the backup RAID mirror has an image of every systems’ files. It then copies using hard links to make a new time-stamped directory for each system so the result is a set of filesystem images for each machine every night — sort of a poor man’s dedup backup system. The advantage of this is that all the backups are directly accessible. Since the backup mirror is a RAID, I can pull one of those 6 TB drives and take it to the safe deposit box at the bank. And yes, I have 7 years of backups in those boxes.

    Now, even though everything is directly accesible, most of the backup sets are off-line. I have tinkered with the idea of a cryptographic erase technique, but that would affect the entire drive, not individual files, much less individual fields within a file. The GDPR concept was obviously written by lawmakers with no knowledge of how a computer storage system works. Maybe they could stipulate that 5 years from now all new backup systems must have a way to deal with this, and 10 years from now all grandfathering of the old ways must stop, but that would hardly please those voters that want to be forgotten. And cleaning on restore doesn’t work because the raw tapes/drives couldbe subpoenaed and the raw dirty data extracted, so its not really forgottren after all.

    • I actually blogged about that type of backup years ago. I’m not a big fan of it for commercial systems for a few reasons, but it seems to work for some people. And I’m REALLY not a fan of you keeping that kind of backup for seven years. Man… you ever get an ediscovery request agains that and you’ll go broke fulfilling it.

      As to GDPR, I actually doubt we’ll be able to deal with this even in five years, because of the formatting issues I mentioned in this post. A lot of this is out of the hands of the backup people.

  • Having read your position on backups and GDPR I have to say I am in agreement.
    The one thing that I am stuck on is how intelligent should the backup tools themselves actually be? To address GDPR concerns there probably needs to be a degree more intelligence than we see currently to allow for deeper search and we could then look to delete the markers for data sets within the backup catalogue. However I like the idea of backup tools being (for want of a better word) dumb, simply copying data to backup media.
    I think adding addition intelligence adds additional security concerns. Having a simple tool may drive more intelligent thought form our admins.

    • I was discussing that a bit w/a friend the other day. The more intelligence we add to backups, the more it turns from being a backup to being a copy. The more it becomes a copy, the more it will be subject to GDPR.

      • But then I suppose this is where we make better use of the technologies we have available now. I think a combination of a “dumb” backup tool and some form of version control or vault software could help make the distinction between back data needed for recovery and archive data needed for historical purposes.

  • As per W. Curtis request:

    Some years back I had to deal with this very same issue as citizen.
    I asked a company to delete my personal account, which they did. Months later I began to get some mails.
    I demanded an explanation pointing out that local regulation could get them a nice fine.
    The infrastructure manager and I arranged a call to clarify the situation.
    They actually removed my email address but following a crash they had to recover from backup. He noted me that they added a trigger in the database so anytime my email showed up it would be deleted.
    It, indeed, ended up working out for me.

    … Today, with GDPR in place, it could have been far worse.

    • This is what I think people are going to need to do. They will have to remember who they were asked to forget, and then make sure they stay forgotten. It sounds counterintuitive, but I can’t think of a better workaround at the moment.

  • Are we as backup professionals even thinking about this correctly. Backups are created to offset risk. Risk from disaster, malicious actors, etc.. What is being presented as it appears to me is another variable in the risk calculation. We must now either accept/implement Curtis’ (and I do agree with it) idea of tracking RTBF requests and hoping that they are applied correctly in the event of a recovery and the commiserate risk that if they aren’t I’m subject to 4% of the annual turnover, or we simply stop retaining backups for more than the RPO of the device + a minimal number of iterations and risk some data loss in the event we want a file that was corrupted or lost weeks ago.

    My guess is businesses will ultimately side on the latter as the cost. There is a decrease over the status quo and I’m no longer sure the risk of true disaster isn’t mitigated via BC practices to the degree that there is significantly more risk in even accidentally keeping data that I’ve been told to forget.

    • Thanks for the comment. A couple of thoughts… The 4% of annual revenue penalty is a MAX penalty. AFAICT, it’s not “You accidentally restored one person you were supposed to forget” penalty. You’re going to get a 4% penalty if you completely ignore GDPR and end up with a massive breach, and it can be shown the reason you had that breach was that you ignored GDPR. (That statement is, of course, not legally binding.) So no one is really sure yet what the fine would be for a minor “infraction,” like this, or even if there would be a fine at all if you rectify it. You had a design and a system to address this problem and something fell through a crack. Fix the crack if you can and apologize. I’m doubting the ICO will have time to create fines for things this small.

      Also, DR is not the only reason we backup. There are files that get restored from months ago. Because of that, I’m OK w/keeping backups for a year or so, but not for much longer than that. After that we should be using archive software.