Dedupe, Immutability & Non-repudiation

Occasionally people ask me if those who have regulations requiring the immutability and non-repudiation of certain types of data should be concerned about data deduplication. I’ve also seen a few blog entries and articles like this one asking the same question. Does dedupe change the data? Can you use deduped storage if you have immutability and/or non-repudiation requirements? Click Read More to see what I think.

The short version of my opinion about immutability and non-repudiation of data and dedupe is that’s a lot of hype over nothing.  Let’s see why I think that.

  • Most customers don’t need it
    • At least according to Enterprise Strategy Group…  They did a survey in 2006 that asked customers if they have immutability requirements, and 80% of them responded saying they had no such requirements.  (Source: Digital Archiving: End-User Survey & Market Forecast 2006-2010; Enterprise Strategy Group, January 2006) This matches my anecdotal evidence that no one has asked me for it. 
  • Immutability doesn’t mean it never changed
    • “What?” you say?  That’s exactly what the definition means!  Well, yes, and no.  While that is the textbook definition, it can’t possibly be both the letter and spirit of the legal definition.  If it did, then there is no such thing as an immutable file, as everything that stores data changes it in some way.  File systems split the file into chunks and spread them around.  Tape drives compress files as they’re written.  IP splits the file into fragments, then puts them back together at the other end, and Fibre Channel and SCSI do something similar.
  • Non-repudiation doesn’t mean it either
    • Repudiate means to reject; therefore, non-repudiatable data is data that cannot be rejected.  Why would someone reject data?  They would do so if the veracity/truth of the data is brought into question. If it is from an unreliable source, then it can’t be trusted, and can therefore be repudiated/rejected. Since data changes hands so many times, how can we make sure it’s veracity can’t be challenged? Part of this has to do with chain of custody.  How did the data get there, and how do we know it wasn’t changed before the storage system was asked to store it?
  • Meeting non-repudiation requirements
    • You have to address the entire chain of custody.  Let me give an example.  If every emal that is sent or received by an email system is immediately archived and stored in an archiving system that can demonstrated for anyone concerned when/where an email came from and how long it has been stored, you could use that system to build a non-repudiatable source of data that could be used in legal proceedings.  (It’s not just about the software, of course, as you have to address access and all other kinds of issues, but that would be a start.  BUT, IMHO, non-repudiation requirements have much more to do with proving chain of custody than they do with the content of the data, and dedupe systems are just as good at proving that as any other storage system — IOW they don’t.  It’s usually up to the system that put the data in there and took it out.
    • Think about a WORM tape system.  How would you prove in a court of law that the tape from which you read the data is the same as the tape you wrote the data to — that it hadn’t been swapped with a modified tape with the same barcode?  That’s all about chain of custody.
  • Meeting immutability requirements
    • What matters here is the ability to say that this piece of data you asked me to store is the same now as when you gave it to me.  The typical way to prove this is with cryptographic signatures.  If you run the object through SHA-1 (or some similar system) and it comes out with the same value, then it can be said with a legal certainty that it is the same.   It doesn’t matter how it was stored; I can prove it’s still the same as it was when I got it.

In summary, deduped storage doesn’t make your system any more or less able to meet immutability and non-repudiation requrements.  The question is whether or not your vendor has addressed these requirmenets.  If a system was designed to meet these requirements, then they’ll know everything I just wrote and they’ll have an answer for it. If a system wasn’t designed to store immutable/non-repudiatable data, you shouldn’t store it there — deduped or not.

Written by W. Curtis Preston (@wcpreston), four-time O'Reilly author, and host of The Backup Wrap-up podcast. I am now the Technology Evangelist at Sullivan Strickler, which helps companies manage their legacy data

5 comments
  • My two cents –

    When in case of a normal storage, we have no clue how it maintains the data internally ๐Ÿ™‚

    Till the point the application is able to see the data just the way it stored it, both the propositions (Immutability & Non-repudiation) hold good.

    Its the mounted file-system which gives the idea about how data is visible and not the disk.

  • Cryptographic algorithms are broken and collisions can be created easily. Unless you do a bit-by-bit comparison, no digital signatures can be legally admissible.

  • What I’m saying is that if you prove, using whatever method is deemed legally admissible, that the doc today is the same as it was when it was created, it being deduped in between those two times is irrelevant.

    Having said that what you’re saying is your opinion and is not backed up by case law YET. In fact, while it has been proven that you can create two blocks of data with the same hash, no one that I’ve ever read about or talked to has demonstrated the ability to significantly change an object (email, document, etc) while keeping it’s hash the same. The amount of work and computing power to do that for EACH document you wanted to hack would be ridiculously large. Therefore, I think that someone trying to argue that a block of emails can’t really be trusted because their hash can be hacked could be easily argued against in court. And that’s MY opinion.

    BUT, again, that wasn’t the point of this article.

  • This was a great post. We have been looking at a couple dedup solutions…for the obvious reasons. But I’ve been getting push back from my team leader regarding the legality of the data. After numerous websearches nothing really has come up, unless I’m not looking for the right thing. I figure if the gov’t, much larger companies, law firms, ect..are using this technology then the legal argument is irrelevant. My opinion as well.