I like tightly defined terms

Robin of StorageMojo replied to my blog entry about preferring the term de-duplication over compression.  I started replying in the comments, and decided to make a whole other blog entry about it.

I love a good discussion.  Robin seems to like broader definitions, and I like strict definitions.  I've actually defended a number of terms over the years. 

I've argued that a SAN needed to use a serial protocol.  Others felt that good ol' parallel SCSI was a SAN, and I said NO.  Others went for a broader definition, such as any network that carried storage traffic, which would include networks that carry NAS traffic.  We need for people to know what we mean when we say a SAN, and it doesn't involve a SCSI cable or file traffic.  As of this writing, only Fibre Channel and iSCSI meet this definition.

Another "fight" was over CDP.  Vendors such as Microsoft and Symantec are marketing snapshot-oriented products as CDP when they are not continuous; they are near-continuous.  If you're making snapshots, you're doing that at some set period of time, and period is an antonym to continuous.  CDP allows you to recover to any point in time, not just to when you took a snapshot.  Near-CDP allows you to recover to points in time where you took a snapshot.

While I agree that it would be easier if we just use the word compression for data de-duplication, I think it would cause confusion.  I think the response that Robin got from the compression community demonstrates my point.  Experts in compression don't like calling de-dupe compression.  And nubies to it are confused by the term as well.  It gives them the impression that they can store 20 GB of online data in 1 GB of disk, because that's the way compression works.  But it's not, as I pointed out in my last blog entry, the way de-dupe works.

I therefore think it's entirely appropriate when companies come up with new terms for new ideas.  And it looks like de-dupe has stuck.  Other related terms that I like are:

Data reduction

A global term that encompasses anything that reduces the amount of stored data.  It includes data de-dupe, compression, and delta differentials.

Compression

Good ol' compression in all its forms.

Delta differentials

Slightly different than de-dupe, where you identify new files based on data/time, then look inside just those files to find the bytes that are new.  Unlike de-dupe, it won't find duplicate blocks between files or databases, so it won't reduce the amount of backed up data as much as de-dupe will, but it is faster.

Written by W. Curtis Preston (@wcpreston), four-time O'Reilly author, and host of The Backup Wrap-up podcast. I am now the Technology Evangelist at Sullivan Strickler, which helps companies manage their legacy data

6 comments
  • I think Robin’s point is one of “education of the marketplace”, and a more general synthesis would be “when is it appropriate to co-opt a current term and shortcut the education process vs. when is it necessary to do that education?”

    I lived this education process in the very early ’90s when document imaging became practical on PC hardware (with special monitors), and the significant price drop compared to, say, mainframe based imaging that e.g. USAA was famously using by then meant there was a vastly larger market of vastly less clueful sales prospects.

    The salesmen I supported had to spend most of their time educating their prospects, which largely ceased to be true by the mid-90s when I re-entered the field (a bunch of us had quit and then later formed a new company)—but by then, the healthy profits in the field had also ceased….

    In that case, there was no getting around the education process, to most people this was a “new thing on the earth.”

    Here, de-duplication vendors have options, they can at the very least analogize their products to compression products (which are now so routine they are captured in inflexible silicon). But Curtis has a strong point: de-duplication is a very different thing than pure compression, the way it works naturally leads you to the applications for which it is good. And while I’m certainly a special case, the term plus the first two or so sentences he put in his recent book about it immediately told me what it was all about, why it’s special sauce was so good and unique.

    There was a genuine “ah ha!” point where I said to myself, if I was still doing that sort of thing, I’d want de-duplication pe se, and bad (I am lucky that for my current needs, per file deltas work just fine (lucky because my current budget approaches nil)).

    So my questions:

    Are the 12-18 months that Robin talks about that will be required to educate the sales prospects a necessary hit for the de-duplication vendors? And can the smaller ones even afford such a longer sales cycle? (They are selling six digit enterprise software, they’d better be set for a long one, but does this lengthen it excessively?)

    Or would the confusion engendered by confounding “compression” with “de-duplication” cause them to not make a lot of sales in the first place? “We already do compression, why do we need your stuff?” Upon which you’ve got to do that education anyway, but you have first have to clear away the confusion you inserted with said confounding….

    Putting on my engineer hat, I tend to agree with Curtis. Putting on my pre-sales support hat, I’m not sure, I’d really like to talk with some in the trenches de-duplication salesmen. And I would already have 2-3 diagrams and a solid explanation for non-technical people for the sales calls I’d go out on with my favorite salesmen to convince a prospect that we really knew what we were doing.

    To close on a bright note, Curtis is doing the very best thing he can by educating people about what de-duplication really is, and this is one of those things in technology where if you have the slightest clue, I think the very basic concept can be explained (“If you have 10 machines with ten copies of the same version of Windows, you only need to store one copy of Windows, and here’s how we do that….” Problem is, a lot of the people who sign on the dotted line don’t have the background to even understand that, nor should they particularly … if they were willing and/or able to trust their IT people….

    – Harold

  • You asked:

    Are the 12-18 months that Robin talks about that will be required to educate the sales prospects a necessary hit for the de-duplication vendors?

    I think he’s saying that had they done a better job educating people (and possibly using the term “compression”) has set them back. _I_ think that this 12-18 months has passed, and that it mainly passed while they were advertising vaporware. Several surveys have shown de-dupe to be the hottest “got to have” technology right now, so I think they’re going to do just fine.

  • Data Reduction is exactly what it says it is — data reduction. Compression doesn’t actually reduce data, it just finds a more efficient way of storing it.

    So, De-Duplication would be a special case of Data Reduction, but Compression would not be included in either of these terms.

    If you want a more general term that could comprise all of these subjects that are being discussed, you might be able to talk about Data Minimization, or increased Data Storage Efficiency, or any number of other terms.

    Sorry to be pedantic, but if you’re going to be pedantic, then you should be pedantic. ;-0

  • Both compression and de-dupe attempt to find repetitive blocks of data and replace them with something smaller. It’s just that when de-dupe runs against a new set of data, it’s going to see if there’s anything in the new dataset that’s the same as what it’s already stored, where a compression system typically only looks at the new data and compare it against itself.

    So De-dupe and compression are actually similar, but only if you can imagine a compression system that treats your entire data set as a file and re-compresses it every time you back up.

    BUT my point was that using the term compression confuses things, so I don’t like it.

    As to the data reduction term, many people in the industry are using it. It’s starting to be used the way that I referred to it above, whether the pedantice like it or not.

  • The compression industry knows what compression is, and the dedupe industry knows what dedupe is, and the two are very different. Calling one the other confuses things and I don’t like that.

    Yes, dedupe meets the most generic sense of compression. Just like a motorcycle is also a bicycle. But if I told you I had a bicycle that could go 150 MPH, you’d call me a liar. The same is true if I told you I had a compression system that has a 20:1 compression ratio.

    Let’s allow this new industry a new term and move on.