Amazon Glacier changes the game

In case you missed it, Amazon just announced a new storage cloud service called Glacier.  It's designed as a target for archive and backup data at a cost of $.01/GB/mth.  That's right, one penny per month per GB.  I think my first tweet on this sums up my feelings on this matter: "Amazon glacier announcement today. 1c/GB per month for backup archive type data. Wow. Seriously."

I think Amazon designed and priced this service very well.  The price includes unlimited transfers of data into the service.  The price also includes retrieving/restoring up to 5% of your total storage per month, and it includes unlimited retrievals/restores from Glacier into EC2.  If you want to retrieve/restore more than 5% of your data in a given month, additional retrievals/restores are priced at $.05/GB-$.12/GB depending on the amount you're restoring. Since most backup and archive systems store, store, store and backup, backup, backup and never retrieve or restore, I'd say that it's safe to say that most people's cost will be only $.01/GB/month.  (There are some other things you can do to drive up costs, so make sure you're aware of them, but I think as long as you take them into consideration in the design of your system, they shouldn't hit you.)

This low price comes at a cost, starting with the fact that retrievals take a while.  Each retrieval request initiates a retrieval job, and each job takes 3-5 hours to complete.  That's 3-5 hours before you can begin downloading the first byte to your datacenter.  Then it's available for download for another 24 hours.  

This is obviously not for mission critical data that needs to be retrieved in minutes.  If that doesn't meet your needs, don't use the service.  But my thinking is that it is perfectly matched to the way people use archive systems, and to a lesser degree how they use backup systems.

It's better suited for archive, which is why Amazon uses that term first to describe this system.  It also properly uses the term retrieve instead of restore.  (A retrieve is what an archive system does; a restore is what a backup system does.)  Good on ya, Amazon!  Glacier could be used for backup, as long as you're going to do small restores, and RTOs of many, many hours are OK.  But it's perfect for archives.

We need software!  (But not from Amazon!)

Right now Glacier is just an API; there is no backup or archive software that writes to that API.  A lot of people on twitter and on Glacier's forum seem to think this is lame and that Amazon should come out with some backup software.

First, let me say that this is how Amazon has always done things.  Here's where you can put some storage (S-3), but it's just an API.  Here's where you can put some servers (EC2), but what you put in those virtual servers is up to you.  This is no different.

Second, let me say that I don't want Amazon to come out with backup software.  I want all commercial backup software apps and appliances to write to Glacier as a backup target.  I'm sure Jungledisk, which currently writes to S-3, will add Glacier support posthaste.  So will all the other backup software products that currently know how to write to S-3. They'll never do that, though, if they have to compete with Amazon's own backup app.  These apps and appliances writing to Glacier will add deduplication and compression, significantly dropping the effective price of Glacier — and making archives and backups use far less bandwidth.

Questions

We all have questions that the Amazon announcement did not answer.  I have asked these questions of Amazon and am awaiting an answer.  I'll let you know what they say.

  1. Is this on disk, tape, or both?  (I've heard unofficially that the official answer is no answer, but I'll wait to see what they say to me directly.)
  2. The briefing says that it distributes my data across mutliple locations.  Are they saying that every archive will be in at least two locations, or are they saying they're doing some type of multiple location redundacy.  (Think RAID across locations.)
  3. It says that downloads are avaialble for 24 hours.  What if it takes me longer than 24 hours to download something.
  4. What about tape-based seeding for large archives, or tape-based retrieval of large archives?'

ZDNet's Cost Article

Jack Clark of ZDNet wrote an article that said that Glacier's 1c/GB/mth pricing was ten times that of tape.  Suffice it to say that I believe his numbers are way off.  I'm writing a blog post to respond to his article, but it will be a long one and a difficult read with lots of numbers and math.  I know you can't wait.

 


Written by W. Curtis Preston (@wcpreston), four-time O'Reilly author, and host of The Backup Wrap-up podcast. I am now the Technology Evangelist at Sullivan Strickler, which helps companies manage their legacy data

12 comments
  • wcpreston: Sent Amazon six questions about Glacier, got five answers back. The “what is it stored on” question was omitted. Oversight or avoidance?

    It just makes sense that it is a tape library. Why else would they have a several hour delay? They might have removable hard drives but if they do, they don’t buy them from us.

  • I initially thought the same thing, but the evidence is pointing to disk. The answer to your question is a forced spin down of disks combined with an enforced maximum percentage of disks powered on at one time. Add to that the fact that they must move data from the disks they’re storing data on to the disks directly connected to the download system. Add to THAT the fact that they have to allow for massive runs on the bank.

  • Indeed this is a very interesting piece of news. I wonder though if security concerns will weight heavily on companies that would like to take advantage of this service. From what I can tell there is only a 5% uptake of disk-disk-cloud backups, and many cite security and lack of clarity of who owns the data as a gating factor (besides pricing). What is your take Curtis?

  • Great question. I’ll be uploading a video very soon on truebit.tv that talks to this very question. Short version of the answer from the expert that I talked to: your data is more secure in the cloud than it is the datacenter.

  • Jas, Great link! I sent a comment to the spectralogic.com blogger asking her to give us the numbers and assumption for her math. As my algebra teacher used to say “Show your work” – Obviously one issue is they are comparing the prices at the high end of the storage spectrum – 10,000 tapes is a whole lot of tapes. My guess is they are playing the whole “compare us compressed to the other guy uncompressed” game too. Although to be fair at these sizes Glacier will incur substantial transfer fees, which are neglected in the .01/Gig numbers

  • I know Molly and I’m going to contact her, too.

    FWIW, Glacier does not have transfer fees IN. There are only transfer fees OUT, and only if you take out more than 5% of your total allotment per month.

  • Actually, after fully reading Molly’s article, I’d have to say that we are not in disagreement. I agree with almost everything she said. I don’t agree with her costing numbers for a bunch of reasons, which I have already covered in my other post. The biggest reason is that her numbers only cover one copy, where Amazon Glacier data is automatically in two places. To do that with tape, you need something like CrossRoad’s StrongBox, and their TCO costs are a lot more than what Molly posted. But other than that, I’d agree with the rest of the blog post.

  • Yes Strongbox is an option but certainly more expensive but I’m sure Crossroads would argue would give you faster access if that is what you want.

    But if (maybe big if) tape is 10x cheaper then surely employing a second tape library will make it just 5x cheaper.

  • Forget the Crossroads comparison, then. Just the use whitepaper that the entire tape industry (including Spectra) uses to show how cheap tape is. You can download the paper Spectra calls “Tape and Disk: What it really costs” here:
    http://www.spectralogic.com/index.cfm?fuseaction=members.docContactInfoForm&DocID=1944

    And that paper gives numbers significantly higher than what Molly wrote in her blog post.

    But if we’re going to have a pricing discussion on my blog, it should be on this post that talks about pricing:

    https://backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/405-amazon-glacier-zdnet.html

  • By reading this article it means Each retrieval request initiates a retrieval job, and each job takes 3-5 hours to complete.