Data Domain: The new CAS on the block?

Data Domain has just announced that it is entering the nearline market with its latest OS release optimized for storing smaller files.  What does this mean for the big CAS players on the block?

Data Domain now has over 1000 customers in the backup & DR space, offering an average de-duplication ratio of 20:1 on backed up images.  However, Data Domain was not targeting nearline applications such as reference data stored for litigation support and regulatory compliance.  This data was typically stored on tape, optical, or on content addressable storage (CAS).

A CAS system assigns an address to each stored object based on its content.  In more technical terms, an object is addressed by a 126-bit MD5 hash or 160-bit SHA-1 hash that is created using an algorithm run against its contents.  If two files are exactly the same, they will get the same hash and will be stored only once.  If they differ by even one bit, they will not have the same hash and will both be stored in a CAS system.

Data Domain takes this a bit further and attempt to identify redundant data at the sub-file level.  This means that if several versions of the same file are all sent to the same Data Domain system, it will identify the blocks/pieces/fragments of the files that are the same (and store them only once) and the blocks that are unique to each version, storing them as well.  Depending on the application, this could result in signifcant space and cost savings.  In addition, since Data Domain systems were originally engineered to meet the demanding throughput requirements of backup systems, they should be able to provide higher performance than today's typical CAS system.  (I speak only theoretically.  I have not tested their new archive OS against a CAS system.  I'll be curious to see what happens out there.) 

Like CAS systems, you'll be able to replicate a Data Domain system to another Data Domain system for offsite protection.  They have also been certified with leading archive software vendors.  (It's a lot easier to qualify your system when you have a filesystem interface, as opposed to the API that a certain CAS system requires you to program to. <ahem>  I never understood why EMC couldn't have just used a standard filesystem interface.  I know they provide an optional filesystem interface now, due to popular demand, but why didn't they just do that from the beginning?  What benefit does the user derive from having a storage system that can only be read or written via an API?  I can see the obvious benefit to EMC, but what's the benefit to the user?)

I digress.  What was I talking about?

Let's see, an archive-application-supported storage system that requires less storage than a typical CAS system needs to store the same amount of data  and that's (potentially) faster and (probably) cost less.  In addition, if you want to use your extra Data Domain capacity for backups, you can.  Or if you decide to stop archiving to disk and switch to optical or something, you can reuse your Data Domain system by connecting it to your backup server.

Why should the CAS vendors worry? 

Written by W. Curtis Preston (@wcpreston), four-time O'Reilly author, and host of The Backup Wrap-up podcast. I am now the Technology Evangelist at Sullivan Strickler, which helps companies manage their legacy data

4 comments
  • You’re telling me that people vending CAS actually think that they’re selling me something worth my employer’s dollars when they’re hashing at the *file* level?

    First, that’s absurd. I can present a file system that does that in Perl with minimal I/O overhead atop any FFS look-alike in a weekend hack. I’m not paying for that.

    Second, I don’t believe that’s actually what the serious vendors are doing. EMC, for example, started selling their CAS absent an actual FS for it, and waited for the OS vendors to catch up. As I understood it, they were hashing at least at the block level, if not doing so in finer grains. Was I misled?

  • I’m not saying that that ALL CAS vendors bring to the table is file-level de-duplication. I’m just saying that their de-duplication is only on the object level. (An objectcan be a file, an email, etc. It’s whatever the archiving vendor sends to them. ) They They do bring other things to the table. For example, your filesystem will not meet the WORM requirements of an archive system, allowing you to prove 7 years from now that the object/email/file you’re presenting in court is indeed the object/email/file that was written 7 years ago.

    As to your comment on EMC waiting for the other vendors to catch up, I call BS. There were dozens of archiving and record management vendors around that already knew how to write to a filesystem. EMC could have created a filesystem interface, and all those vendors would have automatically been supported. But no. They created an API, and forced all these existing archiving vendors to program to that API if they wanted to be listed as supported apps on EMC’s website. Then, after increased customer pressure, they added the filesystem API.

  • I agree that that EMC chose to make their API hard when they didn’t need to, but the point of my question was that I had understood their (flimsy) technical justification for doing so was precisely that they were hashing at the block level, rather than the object level. Are they not? (Is anybody? I haven’t followed the market because it’s not something anybody I’ve worked for has needed.)

  • If you’re CAS, you’re de-duping at the object level, not at the sub-object level. Period. It’s where the name came from. Each object is given a single address (i.e. MD5 or SHA-1 hash) based on its content, and is therefore forever referenced using this address. Hence the name, Content Addressable Storage.