I’m doubling down on my opinion that the GDPR is not going to be able to force companies to “forget” people in their backups – especially personal data found inside an RDBMS or spreadsheet. Let me explain.
Disclaimer: I’m not a lawyer. I’m a data protection expert, not a legal expert. I’ve read the GDPR quite a bit and have consulted a lawyer about my thoughts, but this should not be considered a legal opinion. Just thinking out loud here.
Note: This article is one in a series about GDPR. Here’s a list of articles so far:
- Worried about GDPR?
- What is personal data?
- Some hope about GDPR & backups
- Keeping a copy of deleted data
- More thoughts on GDPR
Personal data will be inside something
If a company is storing personal data on employees, customers, or prospects, that data will not be inside a discrete file for that customer. For example, Amazon does not have a single Excel spreadsheet or database called “Curtis Preston’s data.” There is a customer database, purchasing database, browsing database, etc, and my data is many rows within these databases.
This is the way it is at all companies around the world. Personal data is stored inside files and databases that also store other personal data. It’s built into most filesystems to be able to search the content of all files, and it’s definitely built into RDBMSs to search their content. Therefore, to comply with a GDPR data portability request or a RTBF request with online data should be relatively straightforward. Difficult, yes, but it’s simply a matter of identifying where personal data is stored and providing a process for searching it and expunging it if requested. Backups, however, are a whole different thing.
Backups just store what they’re given
This is important to understand, especially when we are talking database backups. With few exceptions, backup products are handed an object and some metadata about that object. They don’t control the content or format of the object, nor do they have any knowledge what’s inside it.
That object could be a spreadsheet, the latest RMAN stream of data from Oracle, or the results of a dump database command from SQL Server. It’s just an object with some metadata. The backup product might be told what type of object it is, such as an RMAN stream, so it can know how to process it. For example, it might deduplicate an Oracle RMAN stream different than a spreadsheet file.
But – and this is the important part – rarely does a backup product know anything about what’s inside that object, beyond what the metadata says. This is especially true of RDBMSs. A few products have done some extra effort to scan certain file types, such as spreadsheets, so they can provide “full text search” against those files. But, this is absolutely the exception, and I’m not aware of any that do that for relational databases. Nor am I sure they could even do that, given that the personal data is already packed.
In addition, the object that backup software is handed is often just a block or few that was changed in a file, VM, or database since the last time it was backed up. They might not even know where that block fits inside the whole, nor do they have the info to figure that out.
MS Office files have structure
Let’s assume we solve the above problem. The backup software would have to unpack the file, extract the personal data in question, then repack the file. For example, a Microsoft Office is actually a .ZIP file with some XML files inside of it. The backup software would have to unzip the .ZIP file, take the appropriate data out of the XML file, then rezip the file – all without making the file unreadable when it saves it again.
Relational databases have more structure
Relational databases have the concept of referential integrity. When the database is open, this is not a problem when you delete record X. It will automatically delete any references to record X so there aren’t any referential integrity problems. It will also update any indices that reference those references. Easy peasy.
That’s impossible to do when the database is a bunch of objects in a backup. First, it requires the backup software to know much more about the format of the file than it needed to know before. It then would need to be able to delete a record, any references to that record, and any indices referencing that record, and it would need to do that for every RDBMS it supports. I just don’t see this being a good idea.
Clean on restore
The first idea – as discussed in my last few blog posts – is for there to be a process to track any data that needs to be deleted (e.g. any records of Curtis Preston w/birthday of 01/01/01, at IP address 18.104.22.168, etc.), and then delete them on restore. Today this will need to be a manual process, but as has already been mentioned, it could be built into the backup software itself. It’s a monumental task, but it’s much easier to open, read, and write a file when it’s in a filesystem. And it’s much easier to run some SQL commands than it is to learn the internal structure of a database.
Just stop using your backups as archives!
This whole problem is caused by people keeping their backups far longer than they should. If you used backups for restores of data from the last, say, 90 days, and then used archive software for data older than that – this would not be a problem. Everything I mentioned above doesn’t pertain to archives. Archives by design are given information, not blobs of data.
They are given emails and records from a database, not a backup stream from the database. Yes, they are also given files like spreadsheets, but they are expected to parse those files and get the data inside. It’s part of the design. An archive systems is also given far fewer objects to parse, so it has time to do all that extra parsing and storing.
Maybe GDPR will be my friend and help me stop people from storing their backups for 10 years. I can dream.
What do you think?
I think this is a big problem, and I don’t see any solutions on the horizon. Does anyone else see this as any different? Your comments move the discussion forward. Anyone?