It seems to me that source dedupe is the most efficient way to backup data, so why is it that very few products do it? This is what I found myself thinking about today.
Source dedupe is the way to go
This is my opinion and always has been. Ever since I first learned about Avamar in 1998 (when it was called Undoo). If you can eliminate duplicate data across your enterprise – even before its sent – why wouldn’t you want to do that? It saves bandwidth and storage. Properly done, it makes the backups faster and does not slow down restores. Its even possible to use dedupe in reverse to speed up restores.
If properly done, it also reduces the CPU load on the client. A typical incremental backup (without dedupe) and a full backup both use way more compute cycles than those that are used to generate whatever hash is being used to do the dedupe.
You save bandwidth, storage, and CPU cycles. So why don’t all products do this?
Products that have been around a while have a significant code base to maintain. Changing to source dedupe requires massive architectural changes that can’t be easily added into the mix with an existing customer. It might require a “rip and replace” from the old to the new, which isn’t what you want to do with a customer.
Update: An earlier version of the post said some things about specific products that turned out to be out of date. I’ve removed those references. My question still remains, though.
None of the source dedupe products have torn up the market. For example, if Avamar became so popular that it was displacing the vast majority of backup installations, competitors would have been forced to come up with an answer. (The same could be true of CDP products, which could also be described as a much better way to do backups and restores. Very few true CDP products have had significant success.) But the market did not create a mandate for source dedupe, and I’ve often wondered why.
Many of the source dedupe implementations had limitations that made some think that it wasn’t the way to go. The biggest one I know of is that restore speeds for larger datasets were often slower than what you would get if you used traditional disk or a target dedupe disk. It seemed that developers of source dedupe solutions had done that venerable sin of making the backup faster and better at the expense of restore speed.
Another limitation of both source and target dedupe – but ostensibly more important in source dedupe implementations – is that the typical architectures used to hold the hash index topped out at some point. The “hash index,” as it’s called, could only handle datasets of a certain size before it could no longer reliably keep up with the backup speed customers needed.
The only solution to this problem was to create another hash index, which creates a dedupe island. This reduces the effectiveness of dedupe, because apps backed up to one dedupe island will not dedupe against another dedupe island. This increases bandwidth usage and the overall cost of things, since it will store more data as well.
This is one limitation my current employer worked around by using a massively scalable no-SQL database – DynamoDB – that is available to us in AWS. Where typical dedupe products top out at a 100 TB or so, we have customers with over 10 PB of data in a single environment, all being deduped against each other. And this implementation doesn’t slow down backups or restores.
What do you think?
Did I hit the nail on the head, or is there something else I’m missing? Why didn’t the whole world go to source dedupe?
The answer is absolutely yes, and anyone who thinks you don’t need to do so should not be put in charge of your data. Also, anyone who thinks I’m saying this just because I work for a company that backs up Office365 should read this blog post from seven years ago when I basically said exactly the same thing: Cloud services need to be backed up.
I was reading a spiceworks thread on this topic and was shocked at some of the anti-backup recommendations I saw there. One person pointed to TechEd article that talks about how redundant the storage is for Office365. That has absolutely nothing to do with this topic. That’s the equivalent of saying “I have RAID, so I don’t need backups.”
I saw another post where someone explained that the recycle bin is sufficient for “oops” recovery needs, and that vendors just try to scare people with things like rogue admins to get them to buy their products. He/she went on to say nothing like that had every happened to them, so… It’s not just rogue admins, people. There are all sorts of things that can corrupt your entire datastore that can only be addressed via a good third party backup solution.
Backups aren’t included
Take a look at the feature page for Office365. You will find that backups aren’t included. The references to data protection features are more about loss prevention and things like that. They have nothing to do with recovering corrupted data.
MCSE Brian Posey points out that “the Office 365 service-level agreement addresses availability, not recoverability.” So if you or someone else messes up your Office365 data, Microsoft is under no obligation to help you.
MCSE Experts think so
Microsoft MVP Brien Posey says that “you might not have as many options for restoring your data as you might think. As such, it is critically important to understand your options for disaster recovery in an Office 365 environment.”
“Microsoft says they also perform traditional backups of Office 365 servers. However, those backups are used for internal purposes only if they experienced a catastrophic event that wiped out large volumes of customer data…”
He also points out that there is no “provision for reverting a mailbox server to an earlier point in time (such as might be necessary if a virus corrupted all the mailboxes on a server).”
You can delete your primary & secondary recycle bin
A lot of people talk about using the recycle bin to recovery accidentally deleted or corrupted folders. It is true that it can keep such items for up to 90 days, depending on your settings. However, it is also true that a well-meaning or malicious person can easily clean out both the primary and secondary recycle bin. And a malicious person would indeed do just that.
Litigation hold doesn’t protect public folders
Some say that litigation hold protects you from such things. It keeps a copy of most messages forever; however, it does not protect public folders. Someone could easily delete everything in a public folder and then empty the recycle bin, and you would no recourse if you did not have a third-party tool.
Litigation hold has no separation of powers
An important concept in many environments is the separation of powers between a person like the Exchange admin, and a backup person. That protects the organization from rogue admins doing very bad things and then covering them up by deleting the backups as well.
But litigation hold has no such protection. Office 365 administrators could (rightly or wrongly) assign themselves eDiscovery Manager rights and have full access to search and export from Exchange mailboxes, SharePoint folders, and OneDrive locations. They could even modify the Litigation Hold policies. One way to describe this is that it helps a good person to do the right thing, but it does not stop a bad or incompetent person from doing the wrong thing.
The OneDrive restore feature is all or nothing
The OneDrive restore feature is a bit puzzling. It can only restore things that are in the recycle bin, and it is all or nothing. Meaning you have to restore the entire OneDrive system to a single point in time; you cannot just restore parts of it. That has to be the most worthless restore I’ve ever heard of.
You need to backup Office365
You need to backup Exchange, OneDrive, and Sharepoint. Microsoft isn’t doing it for you, and the features that protect you against accidents do not go far enough. Look into a third-party solution, such as what my employer (Druva) provides.
Disaster recovery experts do not agree whether you should have one-and-only-one recovery time objective (RTO) and recovery point objective (RPO) for each application, or two of them. What am I talking about? Let me explain.
In case you’re not familiar with RTO & RPO, I’ll define them. RTO is the amount of time it should take to restore your data and return the application to a ready state (e.g. “This server must be up within four hours”). RPO is the amount of data you can afford to lose (e.g. “You must restore this app to within one hour of when the outage occurred”).
Please note that no one is suggesting you have one RTO/RPO for your entire site. What we’re talking about is whether or not each application should have one RTO/RPO or two. We’re also not talking about whether or not to have different values for RTO and RPO (e.g. 12-hour RPO and 4-hour RTO). Most people do that. Let me explain.
In defense of two RTOs/RPOs (for each app)
If you lose a building (e.g via a bomb blast or major fire) or a campus (e.g. via an earthquake or tsunami) it’s going to take a lot longer to get up and running than if you just have a triple-disk failure in a RAID6 array. In addition, you might have an onsite solution that gets you a nice RPO or RTO as long as the building is still intact. But when the building ceases to exist, most people are just left to their latest backup tape they sent to Iron Mountain. This is why most people feel it’s acceptable to have two RTOs/RPOs: one for onsite “disasters” and another for true, site-wide disasters.
In defense of one RTO/RPO (for each app)
It is an absolute fact that RTOs and RPOs should be based on the needs of the business unit that is using any given application. Those who feel that there can only be one RTO/RPO say that the business can either be down for a day or it can’t (24-hour RTO). It can either lose a day of data or it can’t (24-hour RPO). If they can only afford to be down for one hour (1-hour RTO), it shouldn’t matter what the cause of the outage is — they can’t afford one longer than an hour.
I’m with the first team
While I agree with the second team that the business can either afford (or not) a certain amount of downtime and/or data loss, I also understand that backup and disaster recovery solutions come with a cost. The shorter the RTO & RPO, the greater the cost. In addition, solutions that are built to survive the loss of a datacenter or campus are more expensive than those that are built to survive a simple disk or server outage. They cost more in terms of the software and hardware to make it possible — and especially in terms of the bandwidth required to satisfy an aggressive RTO or RPO. You can’t do an RPO of less than 24-36 hours with trucks; you have to do it with replication.
This is how it plays out in my head. Let’s say a given business unit says that one hour of downtime costs $1M. This is after considering all of the factors, including loss of revenue and damage to the brand, etc. So they say they decide that they can’t afford more than one hour of downtime. No problem. Now we go and design a solution to meet a 1-hour RTO. Now suppose that the solution to satisfy that one-hour RTO costs $10M. After hearing this, the IT department looks at alternatives, and it finds out that we can do a 12-hour RTO for $100K and a 6-hour RTO for $2M.
So for $10M, we are assured that we will lose only $1M in an outage. For $2M we can have a 6-hour RTO, and for $100K we can have a 12-hour RTO. That means that a severe outage would cost me $10M-11M ($10M + 1 hour of downtime at $1M), or $6M-$12M ($6M + $6M in downtime), or $100K-$12M ($100K + 12 hours of downtime).
A gambler would say that you’re looking at definitely losing (spending) $10M, $6M, or $100K and possibly losing $1M, $6M or $12M. I would probably take option two or three — probably three. I’d then put $9.9M I saved and make it work for me, and hopefully I’ll make more for the company with that $9.9M than the amount we will lose ($12M) if we have a major outage.
Now what if I told you that I could also give you an onsite 1-hour RTO for another $10K. Wouldn’t you want to spend another $10K to prevent a loss greater than $1M, knowing full well that this solution will only work if the datacenter remains intact? Of course you would.
So we’ll have a 12-hour RTO for a true disaster that takes out my datacenter, but we’ll have a 1-hour RTO as long as the outage is local and doesn’t take out the entire datacenter.
Guess what. You just agreed to have two RTOs. (All the same logic applies to RPOs, by the way.)
If everything cost the same, then I’d agree that each application should have one — and only one — RTO and RPO. However, things do not cost the same. That’s why I’m a firm believer in having two complete different sets of RTOs and RPOs. You have one that you will live up to in most situations (e.g. dead disk array) and another that you hope you never have to live up to (loss of an entire building or campus).
What do you think? Weigh in on this in the comments section.
One of the most valuable resources your company has it probably not being backed up properly – if at all. Like a lot of cloud services, the ability of salesforce customers to recover from big mistakes or a malicious attack is a bit overstated. Let’s take a look at that.
Big, bad update
Say, for example, that someone wants to change how phone numbers are stored in Salesforce. (I know this because I wanted to do this once with a large number of records.) Let’s say they are tired of the inconsistent way phone numbers are stored and want to go to a standard format. They have chosen to get rid of all parentheses and spaces, and just use dashes. (800) 555-1212 becomes 800-555-1212.
They download a CSV of all the salesforce IDs and accompanying phone numbers. They do their magic on the phone numbers and change everything to dashes. But they accidentally sort one column, completely disassociating numbers with Salesforce IDs. They then update every single one of your leads with incorrect phone numbers. Little by little, salespeople notice that some phone numbers are wrong and fix them. But it’s days before they realize that it was this update that broke everything.
This would also be a great way for a salesperson to get even with your company for not giving him the bonus he wanted. Download a bunch of records, do a quick sort on only one column, then use data loader to upload nonsense back to salesforce.
Recycle bin cannot fix updated records
The recycle bin contains deleted records, not updated records. So fixing even a few mistakenly (or maliciously) updated records is not possible with the recycle bin. It can only fix things if you accidentally delete records – as long as it’s not more records than what can fit in your recycle bin. (The number of megabytes of storage you have X 25.)
You really need to back up Salesforce
Without an external salesforce backup, you are literally one bad update away from being forced to use their “recovery service,” which may be the worst service ever. It’s so bad they don’t want you to use it. They call it a “last resort,” and tell you it’s going to take 6-8 weeks and cost $10,000. And after six weeks, all you have is a bunch of CSV files that represent your salesforce instance at a particular point in time. It will be your job to determine what needs to be uploaded, updated, replaced, etc. That process will be complicated and likely take a long time as well.
Please look into an automated way to backup you Salesforce data.
No one knows for sure whether backups are going to be included in the right to be forgotten (RTBF). Even the GDPR ICO isn’t being entirely clear about it yet. But that hasn’t stopped people from expressing very strong opinions on the subject.
I’ve now written several articles on this topic, and I’ve seen a variety of responses to my comments so far, especially comments to The Register article that Chris Mellor wrote that mentioned my articles. The ones that crack me up are the people that are absolutely sure about how GDPR works – even if the ICO isn’t. This is especially true when they’re trying to sell me something to fix a GDPR problem. Be wary of GDPR fear mongers trying to sell you something.
Just in case you’re wondering, even though I work at a data management as a service (DMaaS) vendor, I’ve got no axe to grind here. I’m aware of no products in this space that have a stronger GDPR story than Druva, so I have no reason to convince you this isn’t a problem. (We are able to help you find files that match certain parameters, and can help you delete them from backups. Like everyone else, however, we are not yet able to delete data from within a backup of a structured database. I am still unaware of any products that solve this problem.)
My only goal here is to start a conversation about this difficult topic. That I’ve clearly done. So I’m going to keep talking about it until we know better.
There are what I would call the “GDPR purists,” whose position sounds something like “what part of forgotten do you not understand?” Clearly these people feel that backups and archives are included in the RTBF, and they really don’t care how much it would cost a company to comply. Most importantly, they’re certain any companies not agreeing with them will be out of compliance and subject to huge fines.
I also get comments that are completely opposite of that, where people are certain that backups are not (and will not be) included in RTBF requests. While my opinion is that these people are closer to what is likely to happen, they are just as dangerous as those who are certain they are wrong.
It’s all conjecture
The only thing I know for certain is that anyone who says they “know” the answer to this question is definitely wrong. Consider what happened when Chris Mellor contacted the ICO for his article. It does seem at first that their response seems to favor the “definitely included” folks. They said, “Merely because it may be considered ‘technically difficult’ to comply with some of its requirements does not mean organisations can ignore their obligations.”
The ICO knows that the RTBF is going to be hard (even without the backup part of the problem), and they want you to know that you can’t just say “erasing every reference to a given person is really hard” as a defense for not doing it. Don’t even think about trying that one, they’re saying. They need everyone to know they mean business.
But I also don’t think we’re talking about technically difficult; we’re talking technically impossible. After 25 years of experience in this field, I can easily say it is technically impossible to erase data from inside a structured database inside a backup without corrupting the backup. Almost all backups are image based, not record based. Even if you could identify the blocks pertaining to a given record inside a database, deleting those blocks would corrupt the rest of the backup. You literally cannot do that.
I also want to say that the idea that you would restore every copy of the database you have, delete the record in question, then re-backup the database is simply ludicrous. And you would do that every time you got such a request? That’s just nonsense. Besides the fact that the process would be so expensive that it would be cheaper to pay the fine, there’s a huge risk element to the process. That means that it places you in possible violation with another part of the GDPR — that you must be able to safely restore personal data that was deleted or corrupted. (A mistake in the process could corrupt all backups of every database.)
There is another proposed solution of converting your backups to archives by scanning & indexing them, and then deleting the backup tapes. This process sounds interesting until you learn it doesn’t solve the problem I keep bringing up – personal data stored inside an RDBMS. So it might help, but it’s not a full solution to the problem.
My opinion is that erasing all references to a given person in your production system – while also having a process in place to make sure said person never “resurrects” from the backup system – accomplishes the goal of RTBF without placing the backup data at risk or attempting to do the impossible. If Vegas had odds on this topic, that’s where I’d place my bet. I think the ICO is going to say that as long as data is not being used to support any current business decisions, and isn’t directly accessible to production systems, it can be excluded from the RTBF process. But you need to have a process to make sure it never comes back.
No one knows for sure, though, and that’s my point. Anyone who tries to tell you they know the answer for sure either has no idea what they’re talking about, is trying to sell you something, or both.
The ICO says they’re going to make it clearer soon
The ICO could have said “backups are included. Period.” (in their comment to Chris’ article.) They didn’t. They said “The key point is that organisations should be clear with individuals as to what will happen to their data when their erasure request is fulfilled, including in respect of both production environments and backup systems. We will be providing more information on backups and the right to erasure soon.”
I for one am looking forward to that guidance.
What do you think?
I’m really curious. Anyone else want to make a guess as to how this all shakes out?
What about this? Do you think that adopting a wait-and-see approach is risky? Should you spend millions now even if we’re not sure how this is going to end up?
In my continuing series of challenges with backup with the General Data Protection Regulation (GDPR), I thought I’d look at snapshots, and the unique problem they present. They may be even more problematic than traditional backups.
Note: This article is one in a series about GDPR. Here’s a list of articles so far:
In these previous posts I have defined what the GDPR is and how it applies to your company. I’ve also discussed whether or not backups are included when someone asks to be “forgotten” via a “right to be forgotten” request in the GDPR. As I discussed here, here, and here, I do not believe that companies are going to be able to delete such data from their backup systems, nor do I think that the GDPR is going to require them to do it. (But we just don’t know for sure until the ICO clarifies their position.)
The idea is two-fold. The first part is the backups aren’t being used to support any current decisions, nor are they accessible via standard IT systems and queries. The second part is that it’s simply not possible today to delete data from a backup like that.
But what about snapshots?
Someone asked about this on twitter. Snapshots are visible to regular IT systems and could be used to support current decisions. For example, NetApp snapshots appear under the ~snapshot subdirectory of the volume they are protecting. They may not be in the normal path of things, but a user could easily search and access them. It’s kind of the point of how snapshots work.
But guess what? Snapshots are read-only by design. You don’t want people to be able to delete data from your snapshots if you’re using them for backup. But since they’re accessible via a standard IT process, are they now considered primary data?
Out of curiosity, I reviewed the NetApp whitepaper on how they handle this issue, and it was unclear when it got to the part of actually forgetting the data. It mentioned that you couldn’t delete something if you didn’t know where it was, but it didn’t really go into how you would selectively delete something from a snapshot once you found it.
I’m not picking on NetApp here. I’ve always been a fan. I’m simply saying that – like backups – selectively deleting data from snapshots goes against their nature. And I’m pointing out that because they are accessible as regular IT data, they might not get the pass that I believe backups will get.
What is your plan for snapshots?
Have you discovered GDPR RTBF and how it relates to snapshots at your company? Has your storage vendor given you any guidance as to how to solve this problem? Is there a GDPR “back door” that you can selectively use to delete data from a snapshot? Do you want to use it, considering you could corrupt the thing you are using for backup?
I’m doubling down on my opinion that the GDPR is not going to be able to force companies to “forget” people in their backups – especially personal data found inside an RDBMS or spreadsheet. Let me explain.
Disclaimer: I’m not a lawyer. I’m a data protection expert, not a legal expert. I’ve read the GDPR quite a bit and have consulted a lawyer about my thoughts, but this should not be considered a legal opinion. Just thinking out loud here.
Note: This article is one in a series about GDPR. Here’s a list of articles so far:
If a company is storing personal data on employees, customers, or prospects, that data will not be inside a discrete file for that customer. For example, Amazon does not have a single Excel spreadsheet or database called “Curtis Preston’s data.” There is a customer database, purchasing database, browsing database, etc, and my data is many rows within these databases.
This is the way it is at all companies around the world. Personal data is stored inside files and databases that also store other personal data. It’s built into most filesystems to be able to search the content of all files, and it’s definitely built into RDBMSs to search their content. Therefore, to comply with a GDPR data portability request or a RTBF request with online data should be relatively straightforward. Difficult, yes, but it’s simply a matter of identifying where personal data is stored and providing a process for searching it and expunging it if requested. Backups, however, are a whole different thing.
Backups just store what they’re given
This is important to understand, especially when we are talking database backups. With few exceptions, backup products are handed an object and some metadata about that object. They don’t control the content or format of the object, nor do they have any knowledge what’s inside it.
That object could be a spreadsheet, the latest RMAN stream of data from Oracle, or the results of a dump database command from SQL Server. It’s just an object with some metadata. The backup product might be told what type of object it is, such as an RMAN stream, so it can know how to process it. For example, it might deduplicate an Oracle RMAN stream different than a spreadsheet file.
But – and this is the important part – rarely does a backup product know anything about what’s inside that object, beyond what the metadata says. This is especially true of RDBMSs. A few products have done some extra effort to scan certain file types, such as spreadsheets, so they can provide “full text search” against those files. But, this is absolutely the exception, and I’m not aware of any that do that for relational databases. Nor am I sure they could even do that, given that the personal data is already packed.
In addition, the object that backup software is handed is often just a block or few that was changed in a file, VM, or database since the last time it was backed up. They might not even know where that block fits inside the whole, nor do they have the info to figure that out.
MS Office files have structure
Let’s assume we solve the above problem. The backup software would have to unpack the file, extract the personal data in question, then repack the file. For example, a Microsoft Office is actually a .ZIP file with some XML files inside of it. The backup software would have to unzip the .ZIP file, take the appropriate data out of the XML file, then rezip the file – all without making the file unreadable when it saves it again.
Relational databases have more structure
Relational databases have the concept of referential integrity. When the database is open, this is not a problem when you delete record X. It will automatically delete any references to record X so there aren’t any referential integrity problems. It will also update any indices that reference those references. Easy peasy.
That’s impossible to do when the database is a bunch of objects in a backup. First, it requires the backup software to know much more about the format of the file than it needed to know before. It then would need to be able to delete a record, any references to that record, and any indices referencing that record, and it would need to do that for every RDBMS it supports. I just don’t see this being a good idea.
Clean on restore
The first idea – as discussed in my last few blog posts – is for there to be a process to track any data that needs to be deleted (e.g. any records of Curtis Preston w/birthday of 01/01/01, at IP address 22.214.171.124, etc.), and then delete them on restore. Today this will need to be a manual process, but as has already been mentioned, it could be built into the backup software itself. It’s a monumental task, but it’s much easier to open, read, and write a file when it’s in a filesystem. And it’s much easier to run some SQL commands than it is to learn the internal structure of a database.
Just stop using your backups as archives!
This whole problem is caused by people keeping their backups far longer than they should. If you used backups for restores of data from the last, say, 90 days, and then used archive software for data older than that – this would not be a problem. Everything I mentioned above doesn’t pertain to archives. Archives by design are given information, not blobs of data.
They are given emails and records from a database, not a backup stream from the database. Yes, they are also given files like spreadsheets, but they are expected to parse those files and get the data inside. It’s part of the design. An archive systems is also given far fewer objects to parse, so it has time to do all that extra parsing and storing.
Maybe GDPR will be my friend and help me stop people from storing their backups for 10 years. I can dream.
What do you think?
I think this is a big problem, and I don’t see any solutions on the horizon. Does anyone else see this as any different? Your comments move the discussion forward. Anyone?
I’m still just thinking out loud here. Again… not an attorney. I have read the GDPR and done some analysis of it, primarily around the right to be forgotten (RTBF) and how it pertains to data protection systems. I just want to start the conversation about some of these topics and see what people are thinking about these very important topics.
Note: This article is one in a series about GDPR. Here’s a list of articles so far:
As I mentioned in my previous post, my opinion is that it is not reasonable to expect companies to delete data from their backups in order to satisfy an RTBF request. It’s simply not technically feasible given modern technology. I do believe companies should switch to non-natural values for the primary keys of their databases. It’s the latter that I want to talk about, based on some comments I received on my last post.
I stand by my opinion about non-natural keys for databases that store personal information. This allows you to delete a record while storing the record identifier, which isn’t personal data. That way you could easily check in the future if you have data that’s supposed to be deleted, such as if you restore the database to a point before the data is deleted.
But the commenter on my last article has a good point. What if you restore the database to a point before you starting using non-natural keys? Suppose you follow the suggestion and stop using natural keys today. But you still have backups from before today that don’t have natural keys, and you may have to keep those backups for a long period of time. (You shouldn’t, as you should only be keeping archives for that amount of time, but we all know that at least half of you are keeping your backups for years. Even if you were using archives, the problem of scrubbing them is just as hard, so they could cause the same problem.)
But what about this?
So, it’s three years from now and you need to restore a database from a backup you took before you switched to non-natural keys. In the past three years you have received hundreds of RTBF requests that you need to continue to honor, but you just restored a database that has those records in it, and it doesn’t have that non-natural key you stored in order to make sure the data stays deleted. How are you going to find and delete those records if you didn’t keep the natural keys you were using before you switched away from them?
Again, my opinion is that you’re going to have to keep enough data to identify a unique person in order to continue to honor RTBF requests after they’ve been done. Get rid of all data about the person (other than that) and store just enough to identify them — and put that in the most secure database you have. You could then use that database in one or both of the following two ways.
One way would to have an app that could read the data in the database, never display it anyone, but occasionally check if any records in the database are found in one or more databases. The main use case for this method would be after a restore from an older backup. You could point this app to that restored database so it could clean it. You could also use it proactively to periodically check your entire environment for deleted records and delete them if they are found.
Another way to use it would be to set it up so that you could only query it by the unique identifier; data is never exported or sent to another app. So you could run a query to see if SSN 123-34-3222 is in it. If a record is found, it is supposed to be forgotten, so it should be deleted. So, again, in the case of restored database you could check every record in the restored database against the deleted records, and delete any that are found. It’s less efficient than the previous method, but it’s more secure.
I think this is defensible. Do you?
On one hand, it looks like keeping the unique identifier – which was the whole point of the GDPR – goes against the letter of the law for a RTBF request. Yes, it does. But the GDPR also allows you to keep information required to protect against a lawsuit. Not honoring RTBF requests could cost your company big time, so my personal, non-legal opinion is that this is a perfectly valid thing to do after you’ve honored a RTBF request – in order to make sure they stay forgotten.
How are you going to deal with this problem? What do you think of my idea?
I am not a lawyer. I’m not even what I consider a GDPR specialist. But I’ve read a lot of the text of the GDPR, and I’ve read a lot about it and watched a lot of videos. So I’d like to offer my layman’s interpretation of an important aspect of GDPR – the right to be forgotten – and whether or not it means we have to delete data from our backups.
Note: This article is one in a series about GDPR. Here’s a list of articles so far:
I have an opinion on this issue, but it’s not a legal opinion. I’d love to hear your opinion, especially if it differs from mine. Let’s see some comments on this one, shall we? Here’s the official GDPR website where you can read it for yourself.
The easy stuff
There are all kinds of GDPR articles about making sure we have consent and a reason to store personal data, making sure we take care of it when we do, making sure we store it securely, etc. There’s even a line that says we need “the ability to restore the availability and access to personal data in a timely manner in the event of a physical or technical incident.” My personal opinion is that you should have been doing all this stuff already, which is why I’m calling it “the easy stuff.” (The easy stuff isn’t easy if you haven’t been doing it, but it’s easy in that all the technology is there to do it. All you have to do is implement it.)
A little bit harder
You need a way to search any systems that store personal data. You need to be able to query for any records referencing a given email address, IP address, physical address, etc. Hopefully you have that already, but if not, that will require some work to comply with. This is needed to satisfy the data request and right to be forgotten provisions.
If you’re using “natural keys” as the primary keys in your database, that’ll have to change. Any information that could be deemed personal information by the GDPR should not be used as the primary key in a database.
The first reason is what happens if you are asked to delete a given record that uses the primary key of the IP address where the user filled out a form, or the email address they used to do so. If you reference that primary key in other records, you’ll have to do a cascading delete of any records that reference that key, in addition to deleting the primary record. I’ll discuss the other reason this is important later in the article. Suffice it to say this may require a significant design change in your database system.
It goes way beyond employees
I’ve heard a lot of people talking about employees as if they are the main data subjects under the GDPR. They are covered under GDPR, but I think employees (IMO) fall under the easy stuff. It’s easy to prove consent when you have an employment contract. You’re probably already securely storing that data, and you probably also have a pretty simple way of searching for those records to comply with any requests for that data. You also have a valid reason to not comply with any erasure requests, because you can say that you’re keeping it to be able to defend against any lawsuits, which is an exception to the erasure requirement. (There are several reasons you don’t have to erase data; one of them is if you are keeping it to protect against lawsuits.) My opinion is that everything I just said also applies to customers. You have a contract with them, you have a reason to keep their information, you can easily search for it, and you have a reason to not delete it. Easy peasy. (Remember, I’m not a lawyer, and I’m curious about your take on this.)
The rub comes when you’re storing data about non-employees and non-customers. You will have to prove that you got affirmative consent to store the information, you’ll need to supply it when asked, and you’ll need to delete it when asked. Now things get a little hairy. It’s out of the scope of this blog, but this means you have to do things like have an unchecked checkbox that they have to check to give you permission to store the info. And you should be storing any personal data in a system that allows you to easily search for the data if someone asks for it.
But what about backups? Do I have to delete the backups?
No one knows for sure because there’s no case law on it it and the GDPR itself is somewhat unclear on the issue. We won’t know until someone gets sued under the GDPR for not deleting data from their backups. If a court rules that backups are part of what we’re supposed to delete, we’re all in a world of hurt. If they rule in line with what I say below, then we can breathe easier. Let’s see what the GDRP says about the subject.
The GDPR seems more concerned with live copies of data
This is more a general feeling than anything I can directly quote, but it seems to be interested primarily in online, live copies of data that can be easily accessed. I’m guessing it’s because these are the copies that tend to get hacked and accidentally released to the public. You don’t really see any stories about how some hacker broke into someone’s backup system and restored a bunch of stuff to make it public. Heck, most companies can’t restore their own data properly. How’s a hacker going to do that?
The GDPR doesn’t mention backups.
Go ahead. Search the entire text of the GDPR for phrases like “backup,” or “back up.” You won’t find it. So no help on that front.
The GDPR does mention restores
The writers of the GDPR knew about backups and restores, because they mentioned that you need “the ability to restore the availability and access to personal data in a timely manner in the event of a physical or technical incident.” So they knew the concept of a backup exists, but chose not to mention it in the erasure section.
It does use the words archive and archival, but
When it uses the word archival, it seems to be referring to a large collection of information for a long period of time. And if you can prove you’re doing something like that “for the public good,” then it’s also exempt from erasure. For example, you can’t ask that CNN erase a story about you getting arrested.
The GDPR does mention copies
There’s a section that says you should take reasonable steps to erase any “links to, or copies or replications of those personal data” if you’ve made it public. But, again, this seems focused primarily on online copies of data that are replicated copies of the same data we’re trying to erase.
The GDPR uses the word “reasonable” and “excessive”
There GDPR is filled with phrases like “reasonable” and “excessive”. They understand that not everything is possible and that some things will require an excessive amount of effort. One example of this is in Recital 66 about Article 17 (the right to be forgotten article). It says that if a controller has made the personal data public, it “should take reasonable steps, taking into account available technology and the means available to the controller, including technical measures.”
The GDPR doesn’t use the word “reasonable” in the erasure section
Interestingly enough, right where we’d like to see a “reasonable” section, there isn’t one. There is one when it talks about what you have to do if you’ve already made the data public and are asked to delete it, but it doesn’t mention reasonability when talking about deleting the main source of the data or any backups of that data.
You do have to make sure data stays deleted
If you are asked to delete a particular piece of personal data, you do need to make sure it is deleted – and stays deleted. But it’s virtually impossible (and certainly not reasonable) to delete records out of most backup systems, so how are going to ensure a given record stays deleted if you do a restore?
Now we’re back to natural keys. You’ll need a way to find records pertaining to Steve Smith living at 123 anywhere lane, without storing the values of Steve Smith and 123 anywhere lane. (Because doing that would be violating the deletion request.) This is why you need to use something other than natural keys. If you’re not using natural keys, you can determine that Steve Smith at 123 anywhere lane is lead number 9303033138. That is a unique value that is tied to his record, but is not personal data if you get rid of the other values. You can then create a separate table somewhere that tracks the lead numbers that must stay deleted from the marketing database – even if it’s restored.
If you restore the marketing database, you just need to make sure you delete lead number 9303033138 and any other leads listed in the DeletedLeads table – before you put that database back online. Because if you put the marketing database back online with Steve Smith’s address and email address still there – and then someone kicks off a marketing campaign that contacts Steve Smith after you said his records are deleted – you’re going to have a very easily provable GDPR violation on your hands. Then we’re back to talking about those potentially huge fines.
I don’t think you have to delete data from your backups
My personal non-legal opinion is that as long as you have a process for making sure that deleted records stay deleted even after a restore – and you make sure you follow that process – you have a pretty defensible position. My personal opinion would also be to be upfront about this in your notification to the data subject.
Dear Steve Smith,
We have deleted all references to your personal data in our marketing database. For technical reasons we are unable to delete this information from our backup system, but that system is only used to restore the marketing database if it is damaged. We also have a system in place to ensure that your records will be immediately deleted if the marketing database is ever restored from the backup system.
Backup vendors can and should be part of this process moving forward. Maybe in a few years’ time, we’ll have the ability to surgically remove records from a backup. That would be very nice, and would be more elegant than having to do what I’m suggesting above. This may indeed become a competitive differentiator for one or more backup companies moving forward.
Presenting at a Gestalt IT “Field Day” (Cloud Field Day in this case) was very different than being a delegate. So I thought I’d blog about it – just like a delegate.
What is Cloud Field Day?
Cloud Field Day is an event put on by Gestalt IT, a company founded by Stephen Foskett (@sfoskett). They put on a variety of “Field Day” events, originally just called Tech Field Day. They branched out into Storage Field Day, Networking Field Day, Wireless Field Day, and Cloud Field Day.
They bring in a group of 10+ influencers from around the world, each of which has some type of audience. Delegates, as they are called, can be anything from someone with a “day job” in IT who just blogs part time for fun, to someone who is now making money full time as a blogger, speaker, or analyst. The one thing they have in common is that they are independent; they cannot be employed by a vendor in the space.
I’ve been a delegate to a number of Field Days, and it’s definitely easier being on that side of things. It’s easier to listen to a vendor’s pitch and ask questions than it is to be the vendor making that pitch and answering questions. It’s easy to question why they’re doing something, or to “poke holes” in their strategy. I can remember a few times where I and my fellow delegates thought the vendor was way off base. I can even remember one time when it was so bad that the consensus among the delegates was that the start-up in question should immediately go out of business and return any remaining investor money. (For the record, we were right, and that actually happened to that particular company.)
In all of those field days as a delegate, I was never stressed about being there. It’s quite enjoyable as a delegate. You’re flown in, driven around in a limo, and constantly fed and catered to. The only time I remember feeling any “stress” (if I can call it that) is when I found myself on the delivery end of a heated discussion. Even though I felt I was very justified in what I was saying, it’s still stressful being the center of attention in a heated argument that is being streamed live.
Presenting is very different
Being a former delegate made me more nervous, not less. I knew how probing delegates could be. I knew the messages I wanted to get across, but I wondered how those messages would be received. I also knew that the delegates often drive the presentation, and they can be difficult to “redirect” once they grasp onto a concept they want to discuss.
For example, I watched as one Cloud Field Day 3 presenter before me “lost the room” for quite a while as the delegates debated a related technical topic. I remember thinking how would I handle that if it happened to us. You don’t want to stifle discussion, but you also need to make sure you communicate your message.
Based on the coverage we received, I think we communicated the core messages we wanted to get across, although it didn’t resonate equally on all ears. Each delegate comes with their own experiences and bias, and you can’t cater to them all.
For example, I think the delegates understood how we are the only data protection product designed for the AWS infrastructure, automatically scaling the resources we use up and down using the AWS native load balancing apps. We are also the only ones using AWS’ DynamoDB to hold all metadata and S3 to hold all backups. (Other products can copy some backups to S3; we store all of them there.) And I think they understood how that should drastically affect costs that we pass onto the customer.
What I didn’t anticipate was that being designed for the AWS infrastructure would not resonate with those who are proponents of the Azure infrastructure. We were asked why aren’t we running there as well. The answer is two-fold. Because we are actually designed to use AWS’ native tools (e.g. load balancer, DynamoDB, S3), we can’t just move our software over to Azure. We would need an entirely separate code stack on Azure, so the level of effort is significantly different than those who just run their software in VMs without using native tools. Their approach is more expensive to deliver; ours requires additional coding to move platforms. Secondly, we just don’t get a big demand for native support in Azure. Most customers don’t care where we run our infrastructure – and don’t have to. But I can understand how that would fall on deaf ears of an Azure advocate.
But the biggest challenge we ran into with this crowd is not everyone was convinced that backing up to the cloud is the way to go for datacenters. I should have known better, being a former delegate, that technical types are going to think about these things, and we need to address them before they can think about anything else.
If I had a Do-over
I would explain the benefits of our approach before I explain what it was. I did cover the benefits, but after the architecture. I should have done it the other way around. Cover the problem and what parts of it we solve – then cover how we solve it. Pretty standard stuff really. But I got excited talking to a technical audience and went technical first. While the field day audience doesn’t want 10 slides on how data is growing, etc, they do want some description of the problem you’re solving before you explain how you solve it.
I would also address the “elephant in the room” first, and explain that our model of backing up everything to the cloud will probably not scale to backup a single datacenter with multiple petabytes. (We do have customers who store double-digit numbers of petabytes with us, but not all from one datacenter.)
I could then explain that we scale farther than you probably think. And since most companies don’t have datacenters like that, why force them to use an approach that is designed for that (onsite hardware). If you could meet your backup and recovery needs without any onsite hardware, why wouldn’t you?
Cloud Field Day is awesome
It’s a bit of a public trial by fire, but it’s a refining fire. I learned a lot about how to present our solution by presenting at Cloud Field Day. I’d recommend it to everyone. I know we’ll be back.