|
Customers and audience members often ask about the maturity of deduplication. Is it mature? Are all products mature? Should you buy it now or wait? I thought that would make a nice blog entry.
This blog entry is one in a series on deduplication. The previous entry was about the real odds of hash collisions. On one hand, immaturity and backups don't go well together. We like our backup systems to be stable. They're the backup, after all. On the other hand, backup is often "pushing the envelope," as the requirements we're given are constantly changing. I would argue that there has been much more innovation in the backup space than in the rest of the storage industry. Technologies that really solve problems tend to get adopted rather quickly. Think about the ease with which many customers started using multiplexing, LAN-free backups, media/device servers, centralized tape libraries, etc. CDP didn't make as big of a splash, and I would argue that this was not because backup people don't innovate. I would argue it was for two reasons: their requirements didn't drive them to it, and it was just too big of a pill to swallow. CDP is the only method that can get you an RPO of 0. How many people have apps that have an RPO of 0? Not many. One hour, 15 mins, maybe. But 0? And I can do 15 mins with more traditional methods. As to the big pill issue, CDP uses a completely different paradigm to get the data from here to there. Did I digress? De-dupe is just the latest in the list of innovative backup techniques. So don't be afraid of it just because it's new. Just be aware that it's new and treat it accordingly. I'll talk about both dedupe software and hardware. Both have companies that have been doing this for a while, and both have companies that are just now adding dedupe to their bag of tricks. Dedupe software First let's talk about dedupe software (e.g. Avamar, Puredisk, Asigra, Evault, etc.). To use these products, you uninstall your current backup software on the client you are going to protect and install the dedupe software. That client then backs up to your dedupe backup server instead of your regular backup server. Asigra is the first mover here, as they've been doing it for several years. Avamar is probably the most recognized product, especially since it was acquired by EMC. Puredisk is Symantec's offering and is the result of an acquisition of Datacenter Technologies. Symantec's been selling it for about a year now. It's core technology is much older, but I don't think it was actually available to end users. Evault has been around for several years and also has a big customer base, and they've recently added dedupe to their arsenal. Most of these products have many customers using them for quite some time and can give you lots of references. I've personally worked with customers using all of them (been a customer of two of them), and can say that they all appear to actually work, but each of them has limitations that you must address in your design. If you don't address those limitations you will be an unhappy camper. Dedupe hardwareNow let's talk about the products that dedupe inside a disk target (NAS or VTL). This is a bit harder, as a lot of these companies have been talking about having dedupe a lot longer than they've actually had dedupe -- some of them still don't have dedupe, and might not have it until Q408! One vendor has spoken that they do not plan to do deduped storage at all. I'll only talk about the information that I have that I can share. (Obviously, I have some NDA information that I can't share.) This information comes from (and will also be updated in) the disk target product directory in the Backup Central Wiki. Dedupe Vendor
There are currently seven providers of deduplicated target storage: Data Domain, Diligent, Exagrid, FalconStor, NEC, NetApp, & SEPATON. Everybody else either has no dedupe or is reselling/OEMing products from these companies. The first mover here is, of course, Data Domain. They have been shipping dedupe target devices for several years, and have well over 1000 customers using their products. From a dedupe installed base perspective, everyone else pales in comparison. The first fast follower to ship was Diligent, and they've been shipping for somewhere between a year or two now. In order of whe GA dedupe product first shipped to customers, the other fast followers would be Exagrid, NEC, Falconstor & SEPATON. (This is a rough approximation based on multiple sources.) EMC, HP, IBM, & Sun are not talking publicly about their dedupe plans, but you can better your money that they're working on it. Sources suggest that one or more of these vendors is developing their own dedupe product (Good Luck!), while others are testing the dedupe capabilities of the product that they OEM (i.e. FalconStor or SEPATON). The only major OEMs to have a shipping dedupe product are HDS & NetApp. HDS' product is based on Diligent, and NetApp's WAFL-based dedupe is their own product. Summary: as long as you don't have to buy something from EMC, IBM, or Sun, there are several dedupe VTLs to choose from. Dedupe Domain If you back up less than 5-6 TB a night, you don't need to worry about this category, as you can back up 5-6 TB in an eight-hour backup window (or 8-9 TB with a 12-hour window) with all but one of these solutions. (The Overland unit is aimed at a different market and can handle about 4.3 TB in a 12-hour window.) If you are backing up significantly more than 5-9 TB a night, then you should be aware of this category. If the dedupe domain says "Single head," then data coming into a given head is only compared to other data that came into that head. A multiple head system will compare data that came into any head with data that came into all other heads. If you use multiple heads of a product with a single-headed dedupe domain, you will need to direct a given set of backups to only one head in order to gain a good dedupe ratio. You should not, for example, point the backup of a given database or filesystem to two heads for performance/load balancing reasons. Doing that will reduce your overall deduplication ratio, as the backup sent to head A will not be compared against the backup sent to head B. A multi head system would allow you to send backups to any head, and have those backups compared against all other backups sent to all other heads. The idea is that this reduces both complexity of design and increases your effective deduplication ratio. While some single-headed vendors attempt to minimize the importance of this (IMHO very important) feature, you can rest assured that any vendor who doesn't have it is currently working on adding it. Having said that it's important, it's also important to note that the two vendors to offer this feature are the latest vendors to join the party (Falconstor & SEPATON). So if you think this feature is important, you'll be looking at adopting some of the newer technology out there. (I'm not saying don't do it, of course. I'm just saying to test the heck out of it, just as you would with anything.) If "time-in-service" is more important to you, you might want to figure out how not to need this feature. ;) Dedupe Replication
If you're doing cross-campus replication where you have full LAN speeds, they can all replicate that. However, if they can replicate the data after it's been deduplicated, they can also replicate your backups across a WAN. Data Domain has had this feature for several years. The Diligent product is a bit different, in that it doesn't do the replication for you; however, you can use any replication product to replicate its deduplicated data. Falconstor's dedupe-based replication shipped with their dedupe software. All current vendors replicate the same barcodes (VTL) or filenames (NAS) that the backup software writes to the target devices. The backup software product is therefore not aware of the second copy, as it can't understand how a single file or barcoded tape can be in two places at once. (Think about it, the same tape cannot be in two tape libraries at the same time.) The replicated copy can be used in a DR scenario by using an alternate master, recovering the backup software's catalog/database of backups, and telling it to inventory the NAS system or VTL. However, it can't be used in an operational backup & recovery perspective. Each vendor has a different answer as to how they handle this particular issue. SEPATON is trailing this race, as it's replication currently does not support replicating the deduplicated data (they can replicate before dedupe, which requires significantly more bandwidth). They're saying this feature will be coming in Q208. That's my best attempt to summarize the state of things today (December 07). Here's a table summarizing these features. | Vendor/Product | Dedupe Vendor | VTL, NAS, or Local | De-dupe | Dedupe Domain | Dedupe Replication | | COPAN | Falconstor | VTL & NAS | Yes | Multiple heads | Yes | | Data Domain | Data Domain | VTL & NAS | Yes | Single head | Yes | | Diligent ProtecTier | Diligent | VTL | Yes | Single head | Can replicate deduped bytes using product of your choice | | EMC EDL | Not announced (VTL is Falconstor) | VTL | No | N/A | N/A | | Exagrid | Exagrid | NAS | Yes | Single head | Yes | | Falconstor | Falconstor | VTL | Yes | Multiple heads | Yes | | Gresham Clarita VTL | Will not be doing deduped storage | VTL | N/A | N/A | N/A | | HDS VTL | Diligent | VTL | Yes | Single head | Can replicate deduped bytes using HDS replication products | | HP VTL | Not announced (VTL is SEPATON)
| VTL | No | N/A | N/A | | IBM VTL | Not announced (VTL is Falconstor) | VTL | No | N/A | N/A | | NEC Hydrastor | NEC | NAS | Yes | Multiple heads | Yes | | NetApp Nearstore VTL | Not announced
| VTL | No | N/A | N/A | | NetApp NearStore | NetApp (WAFL based) | NAS | Yes | Single Flex-vol | Yes | | Overland Reo | VTL is Overland, Dedupe is Diligent | VTL | Yes
| Single head
| No | | Quantum DXi | Quantum | VTL | Yes | Single head | Yes | | SEPATON | SEPATON | VTL | Yes | Multiple heads | Q208 (Non-deduped replication available now) | PureDisk Storage Unit (Only works with NBU) | Symantec | LFS | Yes | Single head | No | | Sun StorageTek VTL | Not announced | VTL | Yes | Multiple heads | N/A | This blog entry is one in a series on deduplication. The previous entry was about the real odds of hash collisions.
|