Data Domain has just announced that it is entering the nearline market with its latest OS release optimized for storing smaller files. What does this mean for the big CAS players on the block?
Data Domain now has over 1000 customers in the backup & DR space, offering an average de-duplication ratio of 20:1 on backed up images. However, Data Domain was not targeting nearline applications such as reference data stored for litigation support and regulatory compliance. This data was typically stored on tape, optical, or on content addressable storage (CAS).
A CAS system assigns an address to each stored object based on its content. In more technical terms, an object is addressed by a 126-bit MD5 hash or 160-bit SHA-1 hash that is created using an algorithm run against its contents. If two files are exactly the same, they will get the same hash and will be stored only once. If they differ by even one bit, they will not have the same hash and will both be stored in a CAS system.
Data Domain takes this a bit further and attempt to identify redundant data at the sub-file level. This means that if several versions of the same file are all sent to the same Data Domain system, it will identify the blocks/pieces/fragments of the files that are the same (and store them only once) and the blocks that are unique to each version, storing them as well. Depending on the application, this could result in signifcant space and cost savings. In addition, since Data Domain systems were originally engineered to meet the demanding throughput requirements of backup systems, they should be able to provide higher performance than today's typical CAS system. (I speak only theoretically. I have not tested their new archive OS against a CAS system. I'll be curious to see what happens out there.)
Like CAS systems, you'll be able to replicate a Data Domain system to another Data Domain system for offsite protection. They have also been certified with leading archive software vendors. (It's a lot easier to qualify your system when you have a filesystem interface, as opposed to the API that a certain CAS system requires you to program to. <ahem> I never understood why EMC couldn't have just used a standard filesystem interface. I know they provide an optional filesystem interface now, due to popular demand, but why didn't they just do that from the beginning? What benefit does the user derive from having a storage system that can only be read or written via an API? I can see the obvious benefit to EMC, but what's the benefit to the user?)
I digress. What was I talking about?
Let's see, an archive-application-supported storage system that requires less storage than a typical CAS system needs to store the same amount of data and that's (potentially) faster and (probably) cost less. In addition, if you want to use your extra Data Domain capacity for backups, you can. Or if you decide to stop archiving to disk and switch to optical or something, you can reuse your Data Domain system by connecting it to your backup server.
Why should the CAS vendors worry?
----- Signature and Disclaimer -----
Written by W. Curtis Preston (@wcpreston). For those of you unfamiliar with my work, I've specialized in backup & recovery since 1993. I've written the O'Reilly books on backup and have worked with a number of native and commercial tools. I am now Chief Technical Architect at Druva, the leading provider of cloud-based data protection and data management tools for endpoints, infrastructure, and cloud applications. These posts reflect my own opinion and are not necessarily the opinion of my employer.