What do YOU want to ask the dedupe vendors?

As I started working on making sure all my information was up to date on all the dedupe vendors, I thought about you!  What have you always wanted to ask the dedupe vendors?  Click read more to see what I’m talking about.
As this space is changing daily, I’m going to be speaking with each of the dedupe vendors to make sure my information is up to date.  I also plan on presenting them the FUD that I hear about them and seeing what response they have to it.  (There’s nothing like frank and open dialogue to increase understanding…)

Is there something you’ve always wanted to ask a dedupe vendor?
Is there a piece of FUD that you heard about one of them and you want to know if it’s true or not?

The best way to give me this info is to send me a private note by clicking on the “Contact Curtis” link in the menu.  If you post it as a comment and it’s unsubstantiated FUD — PLEASE send it privately and not as a comment on the site.  I don’t want to ADD to the confusion or start a flame war, although there might be one when the stories come out.  If it’s just a basic question, then go ahead and post it as a comment.

Written by W. Curtis Preston (@wcpreston), four-time O'Reilly author, and host of The Backup Wrap-up podcast. I am now the Technology Evangelist at Sullivan Strickler, which helps companies manage their legacy data

8 comments
  • Which vendors deduplicate primary storage?
    Can CIFS / NFS users access data directly from a Data Domain box?

    Thanks and Regards,
    Joe

  • The easy answer is that only NetApp dedupes primary storage today, but there is a vendor that compresses it and that’s Storewize. (More on these later.)

    Yes, you can point users to a NFS or CIFS mount from a Data Domain box (or any other dedupe box with a NAS head) today and yes that data will be deduped. BUT I’m not sure it’s designed with that use case in mind. It’s primarily designed as a target for backup and archive.

    There are two issues with a see using a backup/archive dedupe device as a target for regular user data. The first is the user’s performance experience may change (although that might be worth testing). The second, and actually bigger problem is the cost/benefit ratio. You probably won’t get much more than 2:1 on user data (and that primarily comes from compression), but the dedupe vendor’s pricing is typically based on the value provided by a dedupe ratio of 20:1 (or something like that). They make 1 TB look like 20 TB, but only charge you for 10, or something like that. But if they make 1 TB look like 2 TB (2:1 dedupe) and charge you for 10, they’re not really helping, are they?

    Now, on to NetApp & Storewize. NetApp uses their A-SIS technology to find duplicate blocks of data on data stored on NetApp filers (I believe on NFS, CIFS & SAN). While there’s a significant performance hit while the post-process dedupe session is running (which they ask you to run after hours), there is a minimal impact to performance to the user when that process isn’t running. So as long as you’re not concerned about 100% performance 24×7, that would work well. Dedupe at night, use during the day. The other cool thing about it is that it’s a free feature of Data Ontap, essentially doubling the size of their filers for free — as long as you fit the right use case. (What works really well for this is VMware images.)

    Then there’s Storewize. They sit in FRONT of any kind of filer and compress & uncompress the data (not dedupe it) as it’s being written to or read from the filer. What time they spend compressing it inline is made up by the reduced time it takes to write it/read it to/from disk (since it’s compressed), so they’re TELLING me that you should be able to pop this in front of a filer and Voila! The user never notices. It should be cheaper than buying a dedupe box (based on how they’re pricing works, see above), and you should get similar data reduction rates as you will get on primary data stored on a dedupe box.

    I haven’t tested either of these, but this is what I’ve seen that answers your question.

  • Why do you believe that compression (Storewize) will give comparable results to dedupe in terms of data reduction rates with respect to primary storage?

  • I hear this alot but I just dont get it. A C: Drive is how big? Whats the DeDupe ratio per C: Drive?

    How many drives am I really going to save? Lets say I have 100 Windows servers that I want to VM….

    I’ve looked at the math and its 4/5’s of not very much at ALL….So why would I risk additional head processing on data to save a couple of drives….

    Besides – it also means I am reducing in effect the amount of disk IOPs right – Why would I want that?

    Wouldn’t deduping the C: also be ineffectual over time (images change right?) and when doing patches etc? May not do them all at the same time.

    Sorry – I just dont get it.

  • 100 40 GB C: drives is 4 TB of disk. If you could reduce that 4 TB to 1 TB, you just saved yourself many (perhaps 10s of) thousands of dollars. If you could do that without a reduction in performance or an increase in cost, my question is why WOULDN’T you do it?

  • I understand your answer…but you did not really answer the question. I asked why would I dedupe my C: Drives. The volume of data in a Guest OS is not in the C: drive (Boot image) its in the Application Volumes. I have never seen a Boot image take up 40GB; I think your math is off.
    I think best practice in a VM environment from what I have seen and heard, is to have the C: drive a boot image only – so you are talking 4-8GB right? Application volumes tend to be seperate. Like Exchange, SQL, Oracle etc….why would I put that in C: drive.. The C: Drives are similar however the Application volumes are going to be very different.

    If you are talking Thin Clients then all boot images would be the same…so why not have one and boot them all off one image and then have seperate Datafile volumes for each user. Therefore 100 Desktops would only consume teh space of one C: Drive in this case.

    You’re ‘oracle’ like statements on this matter have little substance to them and seem all gloss.

  • You read one response to your comment (which you didn’t fully explain), and suddenly I’m "Oracle-like" with "little substance" and "all gloss?" Rush to judgment much? Perhaps you should read a little more of my body of work before making such judgments. Most people say that my talks, articles, and blogs contain more real-world technical know-how and advice than most others.

    Now, on to your question. First, I don’t agree that it’s best practice to separate your OS and application BINARIES. I typically install OS and my applications on a single drive (virtual or not), and put my application DATA on a second drive. And when I do that, a base install of Windows and common applications (and all their patches, and undo garbage that comes with them) typically DOES come to well over 20 GB. Leave a little room for growth and play room, and you’re easily at 40 GB. That’s where I got the 40 GB number from. And if I’ve installed Windows and Exchange on 10 VMs, then they’ve absolutely got a lot of common blocks among them. Shoot, they’re almost all common blocks.

    Now let’s talk about the data itself. If we’re talking about Exchange, there’s also a lot of common data between the information stores of multiple Exchange stores. Even with single instance storage, there is common data within a single instance of Exchange. If someone sends the same email/attachment to 20 people, and 10 of them are in unique Exchange storage groups and/or instances, then that email/attachment will be stored 11 times. (10 times for each storage group and once in my Sent Items folder.) If each of the 20 people make corrections and send it back to the original user, many of the blocks are now stored 20 times: 10 times in each of their Sent Items folders and 10 times in the Inbox of the recipient. Run dedupe against all those storage groups and you save a lot of storage.

    I know customers that have deduped their VMware images and dropped the amount of disk they needed by 75%-95% — and they’ve done that without a significant change in performance — and they did it at no additional cost.

    As to your spindle-count comment, I think it’s really not an issue. Since we’re talking OS and application binaries (mainly), they’re really not accessed that much anyway; performance is really not the issue. And aren’t we talking about VMs? And you’re asking me about high performance and spindle count? Then it probably shouldn’t be in a VM. As to performance of Exchange data, you either notice a significant performance difference when you dedupe it or you don’t. If you do, then don’t dedupe it! If it reduces your disk consumption by 50%, but costs twice as much as using it, then don’t do it.

    You sound like I or someone else is trying to shove dedupe of your C: drive down your throat. No one is trying to do that. I don’t think everyone should dedupe their OS/app drives. I think you should do it if it makes sense to you.

    You gave examples where dedupe might not help. Great/. Deduping standard SQL/Oracle data won’t get you much, and it would be silly to dedupe a single C: drive image that’s being used as a common boot for multiple systems. If dedupe doesn’t help, then don’t use it! BUT real customers are using it and are being helped by it and aren’t suffering performance loss or increase in cost, SO… (going back to my oracle-like statement that apparently has little substance and seems all gloss), "If you could do that without a reduction in performance or an increase in cost, my question is why WOULDN’T you do it?"

  • One of the questions I would want the DeDupe vendors to answer has to do with the efficiency of how quickly data that has been DeDuped and truncated due to storage space requirements can be reconstituted (Re-Duped). I have run into a big problem in that I have reached a threshold of storage space which is requiring the VTL to truncate data before it can be copied to a physical tape. Because this truncation occurs, the data that is meant to be copied to tape now has to be reconstituted (Re-Duped) by the VTL in order to be copied. This from what I can tell is an extremely slow process. I am seeing an avg of 50GB/Hour across 4 streams on the tape copy. Before reaching this space threshold I was seeing 450GB – 500GB/Hour across the same 4 streams. I have the same issue with Synth fulls as well. Data within a backup cycle which has been truncated takes an extremely long time to reconstitute in order for the synthetic full job to progress. I was seeing about 150GB/Hour on synth fulls. That has dropped to between 20GB and 30GB/Hour during this reduping process. It seems as though the vendors spend most of their time pumping up their backup speeds and performance but fail to address how efficient they are when it comes to re-duping the data.