Login Form






Lost Password?
No account yet? Register

Search Backup Central

Curtis

Disclaimer

The opinions contained within this website, it's blog(s), forums, and Wikis, are those of the original poster and do not represent the position of my (or any other) employer. This blog is not owned by my employer nor does it officially represent any company.
Quantum's Dedupe: Inline or not? PDF Print E-mail
Written by W. Curtis Preston   
Sunday, 24 February 2008
I kept reading stories like this one that said that Quantum's dedupe is inline.  Then I would hear from those "in the know" that said it was post-process.  Different people at Quantum would say different things.  Some would say that they run the dedupe at the same time as the ingest, so they considered it inline, although data is hitting disk before it's deduped.  They say since it only hits disk for a few seconds, it's really inline.  I said, "No it's not."  So what's the scoop?  Read on to see.

If any data is written to any disk before it is deduplicated, then the vendor is using a post-process approach. In a response to my comment on the Byte & Switch story the Quantum spokesperson said that "de-duplication of ingested data typically will finish within seconds of ingest." That is a post-process approach.  "Within a few seconds of ingest" is not the same thing as "at the same time as ingest and before it is written to disk."

Please note: I am not saying post-process is bad. I'm merely saying that what the Quantum spokesperson is describing is not inline; it is post process. Just because Quantum marketing calls it inline doesn't make it so.

Speaking of Quantum marketing, I kept getting different messages depending on to whom I was speaking. Therefore, I was given the chance to sit down and talk with Quantum's CTO about this issue, and he assures me that the DXi7500 will do true inline deduplication -- data will not hit the disk until it has been deduplicated -- but it will do so only at speeds significantly slower than the 7500's advertised ingest rates. Once it passes a certain ingest rate (~100-160 MB/s), it will switch to post-process dedupe, and the post-process dedupe will be happening as the data is coming in -- making it asynchronous.

Therefore, I say that if you are using the DXi7500 at anywhere near it's advertised ingest rates, it is using a post-process approach. If you're staying under 150 MB/s, it would be inline -- but why would you buy a device that could go that fast and run it that slow?

 

Comments
Add NewSearch
Hummdis - Why Use DeDuplication?     | 24.56.11.xxx | 2008-04-18 22:32:11
The only main question that I have is "why would you use DeDuplication in the first place?"

So you have 100 1MB email files that are exactly the same. Only one instance is saved and the other 99 point to the first file. What happens when you lose the main file to corruption or by some other means? Instead of having 99 others you can recover from, you don't have anything at all.

I guess from a backup perspective, I don't see the logic in this process.
tburrell   | 150.228.40.xxx | 2008-04-21 11:32:02
"put all your eggs in one basket, and WATCH THAT BASKET!"- Mark Twain (Or maybe Andrew Carnegie- some debate on that)

Oversimplified, but that's the theory- you can't just turn on De-Dupe without understanding that you MUST protect your database and datastore. Like many strategies, it's best used as part of a multilayered system for "watching the basket"- I use De-Dupe for on-site storage of backups and replication of remote sites, which we then spin to tape for DR purposes, and some select datasets are spun off to tape independently to accommodate various special requirements. The point is it can really help you save bandwidth, datacenter floor space, power and tape admin overhead. We protect 35 sites all over North America with one tape library allocated at our main datacenter thanks to De-Dupe. Two of us used to have a Friday call list to get the receptionists to change tapes (which worked about 4 out of 10 times), now I do it alone in about 10 minutes a day of making sure the jobs ran. The key again is to WATCH THE BASKET!
Hummdis - Forgetting Something?     | 64.140.176.xxx | 2008-04-21 14:02:50
I understand all of that, but it still doesn't answer as to what happens should you loose the main file. If it's gone, all of the other files that point to it are now gone as well.

You can put all of your eggs in one basket and watch it, that makes it easy to monitor, but you still run a very high risk of loosing everything in it if it falls.

I'm thinking of The Italian Job...instead of putting all of the gold in one truck, put it in three separate trucks. Then when the thieves steal the one truck, you still have 2/3 of the gold. This would have also then passed the rear tire height that tipped them off as to which truck the gold was in because the rear tires would have all been the same height.

I'm not saying that you have to use 10, 20, or 50 different methods. What I'm saying is how do you avoid a complete catastrophe if you've loaded all of your cargo onto a ship and it sinks? Instead of using two or three smaller ships. If one goes, you still have 2/3 of your data that is available. It's better get back 2/3 of your data than nothing at all isn't it?

Alright, enough metaphor talk for today. :-)
cpreston - Right. That's why you replicate!   | Super Administrator | 2008-04-21 14:19:55
Dedupe does not promote NOT copying your data. It just gets rid of of duplicate data within each copy. Replicate it! Copy it to tape! Do whatever you did before! Dedupe will eliminate redundant blocks within each copy.
tburrell - And that's my point...   | 150.228.40.xxx | 2008-04-21 15:03:52
You can't just dedupe it and call it a day- you have to protect your backup too. But 3 de-duped copies are still smaller than 3 full copies. In our case, if we lose the "main file", we have a copy of the de-duped data and the database on at least 2 different tapes- that can be restored just like any other backup. Still lets me get >6 TB of protected data on 2TB of space, 3 times!
nspring   | Registered | 2008-04-22 10:59:45
Hummdis: I believe you're confusing single instance storage (like Exchange does with attachments) with global namespace, block-level deduplication. Think of how many C drives there are in a Windows guest OS VMware environment; wouldn't it be nice to do full image-level backups of each VM and store it with all the redundant blocks de-duped out?

In-line vs post: any inline de-dupe needs a way to examine the data stream as it's being ingested. Quantum's process is variable block with a sliding window. It needs some sort of cache to do that successfully. Compare this to what NetApp does with A-SIS wherein ALL the data must land on disk, and then a process kicks in on a scheduled basis and duplicate blocks are removed.
hutch - Inline vs. Post BS   | 12.175.145.xxx | 2008-05-27 13:37:51
I think the marketing from Data Domain is pretty interesting. They want you to believe that post-process deduplication should be considered part of the total backup window which is just plain wrong.

If you can backup to VTL at 2000+ MB/s and complete the de-dupe concurrently where it does not impact production, it is going to be multiple times faster than backing up inline at 170 MB/s (their numbers).

What Data Domain doesn't want to admit is that vendors that do post-process (or concurrent) dedeuplication utilizes a shorter backup window. And for the analysts and the Data Domain marketing people that don't understand that, your backup window is defined by the amount of time that a backup consumes production resources and network resources. Companies buy VTL's to reduce their backup window and to make recovery faster.
Only registered users can write comments!

Copyright (C) 2007 Alain Georgette / Copyright (C) 2006 Frantisek Hliva. All rights reserved.

 
< Prev   Next >

Sponsored Links