My Detente With EMC's DD Archiver

When I first heard about the EMC disk archiver, I blew my stack.  I don’t remember exactly how it was presented to me, but what I heard was that EMC was coming out with a disk product that was designed to hold backups for seven years or more.  Since storing backups for seven years or more is fundamentally wrong (and no one — and I mean no one — argues with that), the idea that EMC was coming out with a product that was designed specifically to do that angered me.  Brian Biles, VP of Product Management for EMC’s BRS division, said with a wry smile, “so you’re saying we’ve become a tobacco company.”

I replied saying, “No, you’ve become a cigarette case manufacturer.  You shouldn’t smoke, kids, but here’s a really pretty gold case to hold your ciggies in.”  I had a similar conversation with Mark Twomey (@storagezilla) on Twitter.

Since that time, I have come to a detente.  I still wouldn’t buy one of these for my long term storage needs, but I can see why some other people might want to do so — and I don’t think those people are wrong or committing evil or data treason. This blog post is about how I got here from there.

Here were my arguments against this product:

There’s no way that this could cost less than tape

Some of the messaging that I saw for the Archiver suggested that it was as affordable as tape.  That’s simply not possible.  First, let’s talk about what we’re competing with. (For these comparisons, I am assuming you have either a tape system or a Data Domain box, and that what we’re talking about is adding the cost of extra capacity to support long term storage of backups or archives.)

A backup or archive that is kept for that long is not kept in the tape library; it’s put on a shelf.  (This is because chances are that it’s never going to be read from.)  Therefore, the cost for tape is about $.02/GB, which is the cost of an LTO-5 tape cartridge.  The daily operational cost of that tape’s existence is negligible, assuming it’s onsite.

The last time I checked target dedupe appliances, they were about $1/GB after discounting.  I also saw a slide that this archiver is supposed to be about 20% cheaper than a regular Data Domain.  That puts it at around $.80/GB — 40 times greater than the cost of a tape on a shelf.  And the daily operational cost of that disk is higher than the tape because it is going to be powered on.  (The Archiver does not currently support powering down unused shelves, although it may in the future.)

Then there is the issue of dedupe ratio.  The deduped disk price above is assuming a 20:1 dedupe ratio.  Dedupe ratios do not go up over time; they actually decrease.  This is because eventually we start making new data.  (The full backup you take today is going to contain quite a bit of new data when compared to the full backup from a year ago.)  Then there’s the fact that the Archiver needs to start each tier (a collection of disks) with a new full backup, thus decreasing the overall dedupe ratio of the entire unit.  (It must do this in order to keep each tier self-contained.)  The result is that you will probably get a much lower dedupe ratio on your long term data than on your short-term data.  This increases your cost.

If you’re going to do the right thing and use archive software to store data for several years (instead of backup software), any good archive software has single-instance-storage.  So if you’re using archive software, you’re going to get an even lower dedupe ratio.

Which brings me back to my belief that there is no way this can be anywhere near as inexpensive as tape.

The good news is that I didn’t hear EMC saying that the Archiver is as cheap as tape when I saw them speak about it at EMC World.  When I talked to the EMC people at the show, I told them I had heard stories of EMC sales reps showing this unit cheaper than tape by using dedupe ratios of 100:1.  (The idea is that you’re going to store 100 copies of the same full backups.)  They told me that any sales rep quoting ratios like is not speaking on behalf of EMC and talking out of his …  Well, you know.

There’s nothing that this unit offers that justifies that difference in price

Disk offers a lot of advantages when used for day-to-day backups.  It’s a whole lot easier to stream during both backups and restores.  There is no question that it adds a lot of value there.  However, the idea of backups or archives that are stored long term is that no one reads them.  If they are reading them, it’s for an electronic discovery request, where the amount of time you have to retrieve that is much greater than the time you typically have for a restore.  This increased amount of time is easily met with tape as your storage medium.  Disk offers no real advantage here.

When I said this, Mark Twomey pointed out that this unit offers regular data integrity checking of backups stored on it.  I informed him that if this were important, there are now two tape library manufacturers (Quantum & Spectralogic) that will be glad to do this for your tapes.

I will concede that disk does offer an advantage if you’re using backups as your archives.  Having backups that will load instantly helps mitigate the issue of how many restores you’re going to be doing to satisfy a complicated ediscovery request.

It’s just wrong to store backups for many years

You should not be using your backups as archives.  You should not be using backups as archives.  If you ever get an ediscovery request for all of Joe Smith’s emails for the last seven years — and you happen to have a weekly full for each of the 364 weeks of that time frame — you will remember what I said.

The thing is that EMC agrees. In fact, the EMC Archiver presentation starts with a few slides about how you should be doing real archiving; you should not be using your backups as archives.

They also said that they see this device as a transition device that can store both backups and archives.  Just because this device can store backups doesn’t mean you have to store backups on it.  You can use proper archive software.  (But, if you did, I once again point out that your dedupe ratio will go down and therefore your effective cost per GB will go up.)

So what’s changed, then?

I had a number of good conversations with EMC folks at last week’s EMC World.  (Which, for the record, was a really big show.)  Some of those comments are above.  They know that this is not going to be cheaper than tape, and they’re saying that anyone that is saying that is not being truthful.  They know that storing backups for years is wrong; they also know that more than half of the world does it that way.

The reason for the detente, however, is that I realize that many people hate tape.  I think they’re wrong, as I’ve stated more than a few times.  There are plenty of IT departments that have a “get rid of tape” edict.  If the goal is to get rid of tape, the fact that the alternatives are much more expensive is not really an issue.  And if you’re going to store backups for a really long time on disk, then at least EMC put some thought into what a disk system would need to do in order to do that right.  This includes things like fault isolation. If you lose one tier for whatever reason, you only lose the data on that array.  It includes things like scanning data occasionally to make sure it’s still good.

Finally, Index Engines also announced an important product at EMC World that will help increase the value of the Archiver for those using it to store backups.  They already have a box that can scan tape backups and basically turn them into archives.  (One of the coolest products I’ve ever seen, BTW.)  They now support NFS, so you can point an Index Engines box at a DD Archiver and voila!  Those backups that you are storing on disk magically become fully searchable, ediscovery-ready archives.

Summary

Don’t use your backups as archives.  Use archive software instead.  Tape is still the most economical destination for long term storage of backups or archives, and it’s a pretty reliable one, too.  However, if you’re going to store your backups or archives on disk for many years, there are worse places to put them than the EMC Data Domain Archiver.

Continue reading

Server virtualization does NOT cause storage explosion

Server virtualization doesn’t kill storage.  People kill storage.  That’s all I’m saying.

I get hot under the collar when I hear people say things like “server virtualization increases storage requirements by huge amounts.”  They slam server virtualization with this comment, as if changing a server from being a physical one to being a virtual one somehow magically increases its size.  They list it as a reason that you shouldn’t use server virtualization.

So I got a little irked when I heard the CEO of Symantec, Enrique Salem, say something like it in his keynote this week at Symantec Vision. (It was a great show, by the way.)  “Server virtualization increases storage use by 200% – 800%,” he said.  When we had the media Q&A with him, this was the first question out of my mouth.  “What about moving a server from being physical to being virtual increases storage requirements?”  I asked a similar question of every other Symantec person I met with that day, as well as when I met VMware CTO, Steve Herrod.

In retrospect, I was probably a little hard on Mr. Salem during my Q&A.  Even Steve Herrod from VMware verified that the typical VMware customer does see such a storage explosion.  However, I still stand by my statement that this is not VMware’s fault.  Moving to VMware does not cause your storage to magically explode.  Moving to VMware probably does “help” it happen, though.  Here are my thoughts on that.

VMware’s design actually reduces storage use

The average virtual machine image (VMDK in VMware speak) is significantly smaller than the smallest disk drive you can buy to put into a server.  The smallest hard drive I can configure in a Dell server is 250 GB. You can create a thin-provisioned VMDK and it will consume only as much storage as it needs to, which is going to be far less than 250 GB.  I don’t know Hyper-V as well as I do VMware, but I’m guessing it’s similar.  I would also say that moving servers into VMware/Hyper-V means that you can put all those very duplicated images on a single storage volume that supports deduplication, removing that huge storage explosion.  You can’t do that if you’re using physical servers with discrete hard drives.

Many people buy their first “real” storage array when they buy VMware/Hyper-V

They may feel that this “forces” them to increase their storage costs, because they’re used to just buying discrete hard drives — often with no RAID or monitoring.  They then blame this increase in cost on VMware/Hyper-V.  I don’t buy that either.  First, they didn’t have to do that.  They could have bought a nice HP/Dell/IBM server with internal storage and run VMware on that.  The decision to buy a storage array is a second decision.  Second, if VMware “forces” them into the 21st century as far as storage management is concerned, so be it.  It’s about time they have real storage.

Server virtualization often means a lot of test/dev VMs

This was Mr. Salem’s point.  VMware/Hyper-V makes it really easy to have many, many different images of different configurations, so people create dozens or hundreds of VMs in their test/dev environment, and that causes a huge increase in storage.  I again say that you could continue to do in your dev/test lab whatever it was you did before you had VMware/Hyper-V, so it isn’t VMware/Hyper-V’s fault that you lab now uses 10 times more storage than it used to.  But it sure does make it easy, though, doesn’t it?  I would also say that this increase in storage is accompanied by a huge increase in usability of the lab.

VM sprawl is evil and real and it eats up storage

This was the universal comment from most everyone I talked to.  When we step out of the test/dev world, it is a reality that when you are buying physical servers, there tends to be much more of an approval process.  When all you have to do to create a new server is click the right button on your mouse, you tend to create new “servers” very quickly.  Next thing you know, you have a whole lot more servers (and images of Windows/Linux) than you ever would have had if you had physical servers. VM sprawl is real, and it should be addressed with process and procedure.

VMware and Hyper-V are not the problem here.  What we do with it is the problem.  Yes, they make it much easier to do dumb things like VM sprawl, but blaming VMware and Hyper-V on your storage explosion is like blaming Ferrari for your tickets.  Just saying.

Continue reading