Deduplication maturity (updated 06-08)

Customers and audience members often ask about the maturity of deduplication. Is it mature?  Are all products mature?  Should you buy it now or wait?  I thought that would make a nice blog entry.

This blog entry is one in a series on deduplication.  The previous entry was about the real odds of hash collisions.

 

On one hand, immaturity and backups don't go well together. We like our backup systems to be stable.  They're the backup, after all. On the other hand, backup is often "pushing the envelope," as the requirements we're given are constantly changing.  I would argue that there has been much more innovation in the backup space than in the rest of the storage industry.  Technologies that really solve problems tend to get adopted rather quickly.   Think about the ease with which many customers started using multiplexing, LAN-free backups, media/device servers, centralized tape libraries, etc.  CDP didn't make as big of a splash, and I would argue that this was not because backup people don't innovate.  I would argue it was for two reasons: their requirements didn't drive them to it, and it was just too big of a pill to swallow.  CDP is the only method that can get you an RPO of 0. How many people have apps that have an RPO of 0?  Not many.  One hour, 15 mins, maybe.  But 0?  And I can do 15 mins with more traditional methods.  As to the big pill issue, CDP uses a completely different paradigm to get the data from here to there.

Did I digress?  De-dupe is just the latest in the list of innovative backup techniques.  So don't be afraid of it just because it's new.  Just be aware that it's new and treat it accordingly.

I'll talk about both dedupe software and hardware.  Both have companies that have been doing this for a while, and both have companies that are just now adding dedupe to their bag of tricks.  

Dedupe software 

First let's talk about dedupe software (e.g. Avamar, Puredisk, Asigra, Evault, etc.). To use these products, you uninstall your current backup software on the client you are going to protect and install the dedupe software.  That client then backs up to your dedupe backup server instead of your regular backup server.

Asigra is the first mover here, as they've been doing it for several years.  Avamar is probably the most recognized product, especially since it was acquired by EMC.  Puredisk is Symantec's offering and is the result of an acquisition of Datacenter Technologies.  Symantec's been selling it since 2007.  It's core technology is much older, but I don't think it was actually available to end users.  Evault has been around for several years and also has a big customer base, and they've recently added dedupe to their arsenal.

Most of these products have many customers using them for quite some time and can give you lots of references.  I've personally worked with customers using all of them (been a customer of two of them), and can say that they all appear to actually work, but each of them has limitations that you must address in your design.  If you don't address those limitations you will be an unhappy camper.

Dedupe hardware

Now let's talk about the products that dedupe inside a disk target (NAS or VTL).  This is a bit harder, as a lot of these companies have been talking about having dedupe a lot longer than they've actually had dedupe — some of them still don't have dedupe, and might not have it until Q408!  One vendor has spoken that they do not plan to do deduped storage at all.  I'll only talk about the information that I have that I can share.  (Obviously, I have some NDA information that I can't share.)  This information comes from (and will also be updated in) the disk target product directory in the Backup Central Wiki.

Dedupe Vendor

There are currently nine providers of deduplicated target storage: Data Domain, Diligent, Exagrid, FalconStor, HP, NEC, NetApp, Quantum & SEPATON. Everybody else either has no dedupe or is reselling/OEMing products from these companies.  The first mover here is, of course, Data Domain.  They have been shipping dedupe target devices for several years, and have well over 2000 customers using their products.  From a dedupe installed base perspective, everyone else pales in comparison.  The first fast follower to ship was Diligent, and they've been shipping since 2005.  A few surveys are starting to show that while Data Domain wins in total number of shipped units, Dilgent is either trailing close behind (or possibly beating) Data Domain in total shipped TBs of disk.  In order of when GA dedupe product first shipped to customers, the other fast followers would be Exagrid, NEC, Falconstor, SEPATON, Quantum, then HP.  (This is a rough approximation based on multiple sources.) IBM acquired Diligent 5/08, so that's obviously their plan. EMC announced 06/08 that they will be shipping dedupe VTLs/NAS based on Quantum's technology.  Sun is shipping Falconstor-based units, and HP is shipping SEPATON-based dedupe.  (It's using SEPATON for it's larger system, and has its own dedupe for their smaller systems.)  HDS' product is based on Diligent, and they say this will continue, even though IBM purchased Diligetn.  NetApp's WAFL-based dedupe is their own product.  There are several dedupe VTLs to choose from.

Dedupe Domain 

If you back up less than 10 TB a night (including weekends), you don't need to worry about this category, as you can back up 10 TB in an eight-hour backup window with all but one of these solutions.  (The Overland unit is aimed at a different market and can handle about 4.3 TB in a 12-hour window.)  If you are backing up significantly more than 10 TB a night, then you should be aware of this category.  If the dedupe domain says "Single head," then data coming into a given head is only compared to other data that came into that head.  A multiple head system will compare data that came into any head with data that came into all other heads.  If you use multiple heads of a product with a single-headed dedupe domain, you will need to direct a given set of backups to only one head in order to gain a good dedupe ratio.  You should not, for example, point the backup of a given database or filesystem to two heads for performance/load balancing reasons.  Doing that will reduce your overall deduplication ratio, as the backup sent to head A will not be compared against the backup sent to head B.  A multi head system would allow you to send backups to any head, and have those backups compared against all other backups sent to all other heads.  The idea is that this reduces both complexity of design and increases your effective deduplication ratio.  While some single-headed vendors attempt to minimize the importance of this (IMHO very important) feature, you can rest assured that any vendor who doesn't have it is currently working on adding it.  Having said that it's important, it's also important to note that the vendors to offer this feature are the latest vendors to join the party (Falconstor, SEPATON, Quantum).  So if you think this feature is important, you'll be looking at adopting some of the newer technology out there.  (I'm not saying don't do it, of course.  I'm just saying to test the
heck out of it, just as you would with anything.)  If "time-in-service" is more important to you, you might want to figure out how not to need this feature. ๐Ÿ˜‰

Dedupe Replication

If you're doing cross-campus replication where you have full LAN speeds, they can all replicate that.  However, if they can replicate the data after it's been deduplicated, they can also replicate your backups across a WAN.  Data Domain has had this feature for several years.  The Diligent product is a bit different, in that it doesn't do the replication for you; however, you can use any replication product to replicate its deduplicated data.  Falconstor's dedupe-based replication shipped with their dedupe software.  Most vendors replicate the same barcodes (VTL) or filenames (NAS) that the backup software writes to the target devices.   If you do this, the backup software product is therefore not aware of the second copy, as it can't understand how a single file or barcoded tape can be in two places at once.  (Think about it, the same tape cannot be in two tape libraries at the same time.)  The replicated copy can be used in a DR scenario by using an alternate master, recovering the backup software's catalog/database of backups, and telling it to inventory the NAS system or VTL.  However, it can't be used in an operational backup & recovery perspective. Each vendor has a different answer as to how they handle this particular issue. SEPATON is trailing this race, as it's replication currently does not support replicating the deduplicated data (they can replicate before dedupe, which requires significantly more bandwidth).  They're saying this feature will be coming in Q208.

An added twist here is Symantec's Open Storage API, or OST for short.  If a product supports this feature, NetBackup can tell the intelligent storage device to copy an individual backup from one device to another.  The device would use deduplicated replication to copy the backup quickly and then tell NBU that the copy is done and where it was copied.  This allows NBU to control the replication and to know about the replicated copy.  Data Domain is supporting this today, and Quantum and Falconstor are expected to follow shortly.  Other vendors have announced that they will support it as well, but their support is probably farther in the future. 

That's my best attempt to summarize the state of things today (June 2008).    Here's a table summarizing these features. 

 

Vendor/Product Dedupe Vendor VTL, NAS,
or Local
De-dupe Dedupe Domain Dedupe Replication 
COPAN Falconstor VTL & NAS Yes Multiple heads  Yes
Data Domain Data Domain VTL & NAS Yes Single head  Yes (also with Symantec OST)
Diligent ProtecTier Diligent VTL Yes Single head Can replicate deduped bytes using product of your choice
EMC EDL Quantum VTL Yes Single head  Yes
Exagrid Exagrid NAS Yes Single head  Yes
Falconstor Falconstor VTL Yes Multiple heads  Yes
Gresham Clarita VTL Will not be doing deduped storage VTL N/A N/A  N/A
HDS VTL Diligent VTL Yes Single head  Can replicate deduped bytes using HDS replication products
HP VTL

HP (SMB)

SEPATON (Enterprise)

VTL Yes Multiple heads (Enterprise only)  Does not use dedupe feature yet
IBM VTL Acquired Diligent VTL Yes Single head  Yes
NEC Hydrastor NEC NAS Yes Multiple heads  Yes
NetApp Nearstore VTL Not announced VTL No N/A  N/A
NetApp NearStore NetApp (WAFL based) NAS Yes Single Flex-vol  Yes
Overland Reo VTL is Overland, Dedupe is Diligent VTL Yes Single head  No
Quantum DXi Quantum VTL Yes Multiple heads  Yes
SEPATON SEPATON VTL Yes Multiple heads Q208 (Non-deduped replication available now)
PureDisk Storage Unit
(Only works with NBU)
Symantec LFS Yes Single head No
Sun StorageTek VTL Falconstor VTL Yes Multiple heads Yes

This blog entry is one in a series on deduplication.  The previous entry was about the real odds of hash collisions

 

Written by W. Curtis Preston (@wcpreston), four-time O'Reilly author, and host of The Backup Wrap-up podcast. I am now the Technology Evangelist at Sullivan Strickler, which helps companies manage their legacy data

8 comments
  • Hi Curtis,

    Long time, no talk.. Just curious about your claims that DD is a good solution for 6TB/night requirement. That would require they are able to move their data @ their rated spec of 220MB/sec, which we have never seen in the field. I just learned of a customer in Boston who just unplugged their box because it was slower than their previous tape backkup !!

  • Quantum is mentioned in the table but not in the body of the article. Are they re-marketing someone else’s technology?

  • Note to readers: Although the comment reads like “storagedoctor” and I know each other, I do not know his/her real identity, nor do I know if he/she works for an end user company or a Data Domain competitor. The verbiage of the comment would suggest the latter, so take his/her comments with a grain of salt.

    My response to the actual comment is this: While I haven’t seen them do 220, I have seen them do close. My original intent therefore was to say "around 6 TB." (I’ve changed it now to say 5-6 TB.) My point is that, while the throughput of some of these systems is not 1000s of megabytes per second, it’s still enough to meet a lot of people’s requirements. I know a lot of customers that back up far fewer than 6 TB a night. As with all things dedupe, your mileage may vary. Therefore you should test anything you buy.

  • Hi Curtis,

    speaking with a lot of customers about dedupe and looking to all products available, I think it is for bigger customers very important to have a focus on the possible single stream speed.

    Keeping it together it seems, that every VTL deduplication, which can work only in-bound offers a relatively slow SingleStreamSpeed (MB/s per VTape).
    The best real rate I heard until now is something round 25 MB/sec — but this system works with a hugh # of FC-Disk Drives – so it cannot be cheap).
    Most systems – at least those with a limted # of SATA-Disks seem to have much lower MB/s rates per stream.

    I know first customers, who reach up to 100 MB/sec per Stream (without tape multiplexing) at their large DB-Backups (from an expensive Highend FC Primary Storage) to PTape.

    But everybody, who have some high single stream speeds will be really disappointed, if they try a VTL, which results for exapmle in 4 times longer Backup times for their large DBs.

    For this needs a system with out-band dedupe may be the right option.
    But then the 24 hour dedupe-rate or the 7 days dedupe-rate for the postprocess dedupe comes into the play: the postprocess dedupe should complete before the next windows with full backups.

    And there is a need of some more disk space for intermediate storing.
    This seems to be a little bit uncalculable with the Falconstor Design, as there are seperate Disks-LUNs used for the itermediate VTL-Storage and the final dedupe-storage.

    Any other experience about that?

  • You must test single stream performance. I have seen faster numbers than you’re seeing, though, and I just saw that Data Domain is now advertising 200 MB/s single stream performance. If you’ve got very fast individual streams (without multiplexing) then I completely agree that you should test this (both backup and restore).

  • The ability to ingest fast streams is important yes, but I have raraely seen this to be a limitation of the backup target (whether it be VTL/Tape etc), it is normally the backup hosts hardware architecture that is the problem. I have not seen a host server that can supply a single stream of data fastere than a VTL or current geeneration tape dirve can handle.
    So is it really a valid test to measure the ingest speed of a single stream? Shouldn’t we first measure the performance of the streaem being delivered by the host?

  • You are right in that the speed of the target is rarely the reason backups are slowed down, AND very few systems can create a single backup stream even close to 100 MB/s. BUT some CAN — and that’s why we test this metric. In addition, we I say single stream I’m referring to both backup, restore AND copy speed. One area where backup apps can easily create a 100+ MB/s stream is when copying a backup from disk to tape, something you may do a lot. You need to make sure that your dedupe system can keep up. What if you tested it and it’s single stream throughput was only 10 MB/s? Wouldn’t you be glad you tested it?

    You should also test aggregate backup AND restore performance, ease of installation, replaced drive rebuild time, ease of use of user interface, and other things.