Quantum's Dedupe: Inline or not? (Updated 7/08)

I kept reading stories like this one that said that Quantum’s dedupe is inline. Then I would hear from those “in the know” that said it was post-process. Different people at Quantum would say different things. Some would say that they run the dedupe at the same time as the ingest, so they considered it inline, although data is hitting disk before it’s deduped. They say since it only hits disk for a few seconds, it’s really inline. I said, “No it’s not.” So what’s the scoop? Read on to see.

If any data is written to any disk before it is deduplicated, then the vendor is using a post-process approach. In a response to my comment on the Byte & Switch story the Quantum spokesperson said that “de-duplication of ingested data typically will finish within seconds of ingest.” That is a post-process approach. “Within a few seconds of ingest” is not the same thing as “at the same time as ingest and before it is written to disk.”

Please note: I am not saying post-process is bad. I’m merely saying that what the Quantum spokesperson is describing is not inline; it is post process. Just because Quantum marketing calls it inline doesn’t make it so.

Speaking of Quantum marketing, I kept getting different messages depending on to whom I was speaking. Therefore, I was given the chance to sit down and talk with Quantum’s CTO about this issue, and he assures me that the DXi7500 will do true inline deduplication — data will not hit the disk until it has been deduplicated — but it will do so only at speeds significantly slower than the 7500’s advertised ingest rates. Once it passes a certain ingest rate (~100-160 MB/s), it will switch to post-process dedupe, and the post-process dedupe will be happening as the data is coming in — making it asynchronous.

Therefore, I say that if you are using the DXi7500 at anywhere near it’s advertised ingest rates, it is using a post-process approach. If you’re staying under 150 MB/s, it would be inline — but why would you buy a device that could go that fast and run it that slow?

Update: Some read this blog entry and thought that I was saying that the DXi7500 only dedupes at 150 MB/s.  That would not match what they have told me and is NOT this blog entry was trying to say.  I was merely attempting to clarify some ambiguity with what I was being told about whether Quantum does inline dedupe or not.  The short answer is that if the ingest rate is less than 150 MB/s, then they are using inline dedupe.  If it is greater than 150 MB/s, then they use post processing dedupe.  They refer to this as adaptive dedupe, as it adapts to the incoming conditions.  They also offer scheduled dedupe, which means it runs completely after the backups, and they dedupe faster when they’re deduping outside the backup window.

So how much data can the DXi7500 ingest in a day and dedupe it before the next day’s backup?  They say they can dedupe 1.6 TB/hr (444 MB/s) if they’re deduping during the backup window and 2 TB/hr (555 MB/s)  if they’re deduping outside the backup window.  If we assume a typical 12-hr backup window and calculate 1.6 TB/hr during the window and 2 TB/hr outside the window, it can dedupe 43.2 TB a day.  This means they could ingest data at 3.6 TB/hr (1000 MB/s) for 12 hours and still dedupe it before the next day.  Since this is less than their advertised ingest rate of 8 TB/hr, they should be able to do it. (This is, of course, leaving no room for error or maintenance, but I’m not sure what figures to put in for that.)  If you ingest data at their advertised ingest rate of 8 TB/hr (2222 MB/s), you could only do that for 5.2 hours and still dedupe it all in a day.

I hope this clears up any confusion.

Written by W. Curtis Preston (@wcpreston), four-time O'Reilly author, and host of The Backup Wrap-up podcast. I am now the Technology Evangelist at Sullivan Strickler, which helps companies manage their legacy data

10 comments
  • The only main question that I have is “why would you use DeDuplication in the first place?”

    So you have 100 1MB email files that are exactly the same. Only one instance is saved and the other 99 point to the first file. What happens when you lose the main file to corruption or by some other means? Instead of having 99 others you can recover from, you don’t have anything at all.

    I guess from a backup perspective, I don’t see the logic in this process.

  • “put all your eggs in one basket, and WATCH THAT BASKET!”- Mark Twain (Or maybe Andrew Carnegie- some debate on that)

    Oversimplified, but that’s the theory- you can’t just turn on De-Dupe without understanding that you MUST protect your database and datastore. Like many strategies, it’s best used as part of a multilayered system for “watching the basket”- I use De-Dupe for on-site storage of backups and replication of remote sites, which we then spin to tape for DR purposes, and some select datasets are spun off to tape independently to accommodate various special requirements. The point is it can really help you save bandwidth, datacenter floor space, power and tape admin overhead. We protect 35 sites all over North America with one tape library allocated at our main datacenter thanks to De-Dupe. Two of us used to have a Friday call list to get the receptionists to change tapes (which worked about 4 out of 10 times), now I do it alone in about 10 minutes a day of making sure the jobs ran. The key again is to WATCH THE BASKET!

  • I understand all of that, but it still doesn’t answer as to what happens should you loose the main file. If it’s gone, all of the other files that point to it are now gone as well.

    You can put all of your eggs in one basket and watch it, that makes it easy to monitor, but you still run a very high risk of loosing everything in it if it falls.

    I’m thinking of The Italian Job…instead of putting all of the gold in one truck, put it in three separate trucks. Then when the thieves steal the one truck, you still have 2/3 of the gold. This would have also then passed the rear tire height that tipped them off as to which truck the gold was in because the rear tires would have all been the same height.

    I’m not saying that you have to use 10, 20, or 50 different methods. What I’m saying is how do you avoid a complete catastrophe if you’ve loaded all of your cargo onto a ship and it sinks? Instead of using two or three smaller ships. If one goes, you still have 2/3 of your data that is available. It’s better get back 2/3 of your data than nothing at all isn’t it?

    Alright, enough metaphor talk for today. ๐Ÿ™‚

  • Dedupe does not promote NOT copying your data. It just gets rid of of duplicate data within each copy. Replicate it! Copy it to tape! Do whatever you did before! Dedupe will eliminate redundant blocks within each copy.

  • You can’t just dedupe it and call it a day- you have to protect your backup too. But 3 de-duped copies are still smaller than 3 full copies. In our case, if we lose the “main file”, we have a copy of the de-duped data and the database on at least 2 different tapes- that can be restored just like any other backup. Still lets me get >6 TB of protected data on 2TB of space, 3 times!

  • Hummdis: I believe you’re confusing single instance storage (like Exchange does with attachments) with global namespace, block-level deduplication. Think of how many C drives there are in a Windows guest OS VMware environment; wouldn’t it be nice to do full image-level backups of each VM and store it with all the redundant blocks de-duped out?

    In-line vs post: any inline de-dupe needs a way to examine the data stream as it’s being ingested. Quantum’s process is variable block with a sliding window. It needs some sort of cache to do that successfully. Compare this to what NetApp does with A-SIS wherein ALL the data must land on disk, and then a process kicks in on a scheduled basis and duplicate blocks are removed.

  • I think the marketing from Data Domain is pretty interesting. They want you to believe that post-process deduplication should be considered part of the total backup window which is just plain wrong.

    If you can backup to VTL at 2000+ MB/s and complete the de-dupe concurrently where it does not impact production, it is going to be multiple times faster than backing up inline at 170 MB/s (their numbers).

    What Data Domain doesn’t want to admit is that vendors that do post-process (or concurrent) dedeuplication utilizes a shorter backup window. And for the analysts and the Data Domain marketing people that don’t understand that, your backup window is defined by the amount of time that a backup consumes production resources and network resources. Companies buy VTL’s to reduce their backup window and to make recovery faster.

  • [quote name=hutch]
    What Data Domain doesn’t want to admit is that vendors that do post-process (or concurrent) dedeuplication utilizes a shorter backup window. And for the analysts and the Data Domain marketing people that don’t understand that, your backup window is defined by the amount of time that a backup consumes production resources and network resources. Companies buy VTL’s to reduce their backup window and to make recovery faster.[/quote]
    It is all about whether or not you consider your data safe when it is just off the production server. Or if you think it is safe when it is off-site. What is your RPO? Is having it replicated off-site almost as fast as it is backed up important? If not, VTL with post-process is for you.

  • A little late (yes it’s now 03/09) but I found this interesting as I’m looking at both EMC/Quantum & Data Domain as a potential data de-duplication solution. Assume “inline” and NAS deployment. Regarding the EMC solution, if the threshold is reached where the process goes to post processing (~150 MB/s) when does the data get replicated? After it’s de-dupped? If the system stays busy is there a concern that the data may not get replicated in a timely manner (yes,dependent on requirements)? Something I also read on EMC’s powerlink (DL3D Best Practices Guide) site is that the data is also retained in a “native” format, basically un-de-dupped (is that a word?) format for fast recoveries. Doesn’t that go against the the whole philosophy of reducing storage footprint? There is also a comment regarding a 70% threshold (capacity based) and a “truncate” process starts running. What is the impact w/ this truncate process? If I have a system continually running at 70-75% this threshold could met all the time.

  • What effect does the inline/post-process debate have on replication?

    Ignore the “is it inline or post-process” mode. How each product does what they do is less important than what the numbers are.

    Every dedupe device has an ingest and dedupe rate, and every appliance can only replicate data that’s been deduped. With a true inline product (like DD or IBM/Diligent), then the ingest and dedupe rates are the same. With Quantum’s box, you can dedupe while data’s coming in (i.e. adaptive) or dedupe later (i.e. deferred). When you do adaptive dedupe, they’re still a max rate at which they can do that, and I’d say that we can compare it side-by-side to an inline number. (I wouldn’t say that of the deferred number. They can ingest faster and dedupe faster if you defer dedupe processing until after you’re all done ingesting, but the overall backup window is longer.)

    The Quantum 7500’s adaptive ingest & dedupe rate is 1.8 TB/hr, or 500 MB/s. Assuming a 20:1 dedupe ratio, a 500 MB/s ingest/dedupe rate will generate about 10 MB/s of data to replicate. I’m not sure what your link speed is, but many people can’t replicate that fast.

    Can you go faster? Maybe. The Data Domain 690 now has a maximum ingest/dedupe rate of 2.7 TB/hr, or 750 MB/s — if you use NBU & OST. If you’re not a NBU customer, then I think the DD690’s ingest/dedupe rate is about that of the 7500.

    Short answer: it doesn’t look like using Quantum’s adaptive dedupe on the 7500 will affect the speed at which you replicate backups — unless you’re a NBU OST customer, in which case DD might be faster.

    What’s up with storing it in native format? Doesn’t that interfere with cost savings?

    Yes, it absolutely messes with cost savings. The real question you should ask is why the do that? The answer is because restoring from the block pool (what they call the deduped data) is often significantly slower than restoring from the native data. Depending on a number of factors, it could be 75% slower. THAT’S why they store the data in original format as long as they can — to prevent having to restore from the block pool.

    What’s the impact of truncation?

    First, truncation is an I/O process that may have an effect on the system while it’s running, but it shouldn’t run often. Second, anything truncated will have to be restored from the block pool — see my previous answer.

    BTW, they are working on this and have made improvements this month and some more are expected next month, and even more in the summer (hopefully).