Robin of StorageMojo replied to my blog entry about preferring the term de-duplication over compression. I started replying in the comments, and decided to make a whole other blog entry about it.
I love a good discussion. Robin seems to like broader definitions, and I like strict definitions. I've actually defended a number of terms over the years.
I've argued that a SAN needed to use a serial protocol. Others felt that good ol' parallel SCSI was a SAN, and I said NO. Others went for a broader definition, such as any network that carried storage traffic, which would include networks that carry NAS traffic. We need for people to know what we mean when we say a SAN, and it doesn't involve a SCSI cable or file traffic. As of this writing, only Fibre Channel and iSCSI meet this definition.
Another "fight" was over CDP. Vendors such as Microsoft and Symantec are marketing snapshot-oriented products as CDP when they are not continuous; they are near-continuous. If you're making snapshots, you're doing that at some set period of time, and period is an antonym to continuous. CDP allows you to recover to any point in time, not just to when you took a snapshot. Near-CDP allows you to recover to points in time where you took a snapshot.
While I agree that it would be easier if we just use the word compression for data de-duplication, I think it would cause confusion. I think the response that Robin got from the compression community demonstrates my point. Experts in compression don't like calling de-dupe compression. And nubies to it are confused by the term as well. It gives them the impression that they can store 20 GB of online data in 1 GB of disk, because that's the way compression works. But it's not, as I pointed out in my last blog entry, the way de-dupe works.
I therefore think it's entirely appropriate when companies come up with new terms for new ideas. And it looks like de-dupe has stuck. Other related terms that I like are:
A global term that encompasses anything that reduces the amount of stored data. It includes data de-dupe, compression, and delta differentials.
Good ol' compression in all its forms.
Slightly different than de-dupe, where you identify new files based on data/time, then look inside just those files to find the bytes that are new. Unlike de-dupe, it won't find duplicate blocks between files or databases, so it won't reduce the amount of backed up data as much as de-dupe will, but it is faster.