My first trip to VMworld

VMworld is the new industry show.  It is the show to attend and the show to exhibit at.  I was really impressed.  Here's a list of my thoughts about my trip there.

Update: I re-read this blog post this morning and felt it was too harsh and didn't contain enough of my positive reaction to the show.  I've therefore added a new paragraph or two in the beginning that explains my overall reaction to VMworld.  I also added some photos. No, I didn't get any complaints. This is just a case of writer's remorse.

VMworld is a very impressive show.  The main session was the biggest such session I've ever seen.  Attendance was around 20,000 people, which is more than last year, which was bigger than the year before, etc.  In a world of ever-shrinking tradeshows, it's nice to see one that's growing.  I liked the way they did the virtual park, and the way they had volleyball, basketball, and badminton courts (with what appeared to be pros would would play with you).  The attendance at the opening keynote was incredible.  (The content of the Paul Maritz's talk, or the what-appeared-to-be-scripted "interview" of the three CIOs later… not so much.  Due to my impression of that talk, I slept in the following morning and didn't go to the next morning's general session, only to be disappointed by all the tweets about how much better THAT talk was.) 

VMworld and the Venetian also did a very good job of shuffling 20,000 people around the various venues, including lunch.  I never felt like I was ever waiting in line anywhere.  There was the occasional traffic jam, of course, but nothing compared to what I've seen at some shows. Food was decent, and there were healthy options if that's what you were looking for. 

The treatment of the press was very good.  We had a press-only area with meals, drinks, and snacks where we could relax, write, blog, etc. Then they had a place where press could bring non-press people for interviews.  They also had dedicated Q&A sessions for the press.  All of that was very close by, which made it all very convenient.

Overall, it's a very good show with a lot of content (if that' what you're looking for) and a lot of exhibitors (if that's what you're looking for).  You could do a lot worse.  Now, my feedback…

1. Registration was easy

That is, just getting registered.  And then….

2. Session builder was horrible

You were required to register for any sessions you wanted to attend.  That's fine.  However, the system you had to use to do that was one of the worst designed web pages I've ever worked with.  Every mouse click resulted in a refresh of the entire page with a list of all sessions.  Many sessions were listed in multiple places, instead of just listing the session once with multiple times.  Registering for each session required many, many mouse clicks and a popup.  Then, of course, it was followed by a page refresh.  Yuck.

One cool feature was that you could export your schedule to your calendar.  That was nice.

3. The exhibit hall was huge, huge, huge.

It's not just that this was bigger than EMC World or Symantec Vision, or any other large industry show.  It's that it contained almost anyone who was anyone.  In the backup world, you're not going to see SyncSort or CA at EMC World, but you do see them here.  This shows how separate EMC continues to allow VMware to be.

In fact, the exhibit hall was so big, and there were so many vendors there that I hadn't seen in a long time, that I had to give up almost all the sessions I had planned to attend just to make time to see all the vendors in my space.  And that's just in the backup space!

4. The exhibit hall is a little out of control

Certain vendors (and you know who they are if you were there) send people so far out in the aisle that you can't get past them without being accosted.  They would literally stand in front of you, forcing you to interact with them.  This is regardless of how many times you went by the booth, or whether or not you had any interest in technology.

Many vendors exceeded any reasonable noise rules.  There should be a very definite rule that your booth cannot be beyond N decibels if you're more than N feet away from the booth.  Subwoofers should be outlawed altogether.  It is soooo not cool to be the booth 30 feet away and not be able to hold a conversation because another booth is blasting away.

If you're going to hire booth babes (and there's a good argument for not doing so), can you at least have them dress professionally and not like they're going to a night club or standing on a street corner?

5. Water.  Seriously.

I was never so thirsty as when I was in the exhibit hall.  You're several minutes away from any drinks you can buy.  There's no complimentary sodas.  So there should be water dispenser everywhere — and they should be constantly monitored for fullness and cup availability.  Almost every single water dispenser I found was either out of water, out of cups, or both.  Here's an idea?  How about putting the next 5-gallon water bottle next to the dispenser.  If we're thirsty and it's empty, we'll put it in.

The first night I went to dinner after being thirsted to death in the exhibit hall.  I drank six glasses of water and — not sure how to say this delicately — my body showed me later it needed all six glasses. [Update: I heard from a few people that I put this too delicately and they didn't understand what I was saying.  I'm saying that I didn't need to go to the bathroom at all after drinking that much water.]  I was severely dehydrated just from walking around the exhibit hall.  Water was that hard to find.

Having said all of that, this is the new industry show and I will never miss it again if I can help it.

Continue reading

----- Signature and Disclaimer -----

Written by W. Curtis Preston (@wcpreston). For those of you unfamiliar with my work, I've specialized in backup & recovery since 1993. I've written the O'Reilly books on backup and have worked with a number of native and commercial tools. I am now Chief Technical Architect at Druva, the leading provider of cloud-based data protection and data management tools for endpoints, infrastructure, and cloud applications. These posts reflect my own opinion and are not necessarily the opinion of my employer.

VMware passes Hyper-V up in the backup race

The title may surprise none of you, but it is actually the opposite of what I said 1.5 years ago in a blog post called Hyper-V ahead of VMware in the backup race

Back then I was concerned that VMware did not have full VSS support.  They have since rectified that. [Update: by “full VSS support,” what I mean is that it can talk properly to all versions of VSS.  Before, they did not support Windows 2008.  Now they support all versions of Windows.  There is still the problem that they only have one style of snapshot, so they aren’t telling applications that they’ve been backed up, which means that the applications aren’t truncating their logs.]

They also added changed block tracking (AKA “CBT”) in vSphere, so it is possible to perform block-level incrementals on image-level backups. And since VMware is talking properly to VSS, the applications are doing what they are supposed to be doing before a backup as well. 

Now it is Hyper-V that is behind.  There is no API within Hyper-V that can present to you a map of changed blocks in order to back them up.  You can perform an incremental backup of-course, but an incremental backup via the Hyper-V host is going to back up everything, as every .VDK file will have changed.

This changed blocked block tracking feature of VMware makes finding which blocks have changed must faster, and backing up just the blocks that have changed (vs the files that have changed) is the fastest way to do an incremental backup.

Just like with VMware, third parties have stepped in to fill the void.  So far, I know of Veeam and Arkeia that are using their source deduplication capabilites to perform sub-file incremental backup of Hyper-V machines.  I’m sure there are more as well — and if any of them mention themselves in a comment, I’ll update my post.

Continue reading

----- Signature and Disclaimer -----

Written by W. Curtis Preston (@wcpreston). For those of you unfamiliar with my work, I've specialized in backup & recovery since 1993. I've written the O'Reilly books on backup and have worked with a number of native and commercial tools. I am now Chief Technical Architect at Druva, the leading provider of cloud-based data protection and data management tools for endpoints, infrastructure, and cloud applications. These posts reflect my own opinion and are not necessarily the opinion of my employer.

Moving to the cloud

I’m in the process of trying to convert my church’s use of an onsite file and email server to a cloud file synchronization service and a hosted Exchange service.  I’ve chosen SugarSync, as that appears to have a little more flexibility than Dropbox, and Sherweb for the Exchange service, as that is what we’re currently using for Exchange at Truth in IT.

For file services, the idea would be to move any folder that a given person needs access to their local PC, synchronize that folder to SugarSync, then share that folder out to other people who would need access to it.  That user would then synchronize that folder to their PC and have local access to it as well.  Changes would automatically be synchronized to every computer accessing the same folder.  (This is the same way we use Dropbox at Truth in IT. We share one big folder, and it’s synchronized to all our Macbooks.)  In order for this to work at the church, I need each PC to have enough local storage space to hold the data that they need to access on a regular basis.  (They can get access to infrequently accessed files via the web.)  The good news is that most PCs today have way more storage than they need if they’re using a fileserver. 

I’ve started the SugarSync pilot with three of the office workers, and selected about 20GB of folders to synchronize.  They have a 1.5 Mb T-1, so it took about a little over 24 hours to upload that 20GB up to SugarSync, and another 24+ hours to sync it to each computer that needs to have access to it.

Besides doing away with the server (and the costs associated with maintaining it), different people have experienced different benefits.

  • One staff member who does not have an Internet connection can work on his files on his laptop at home and have them automatically synchronized to SugarSync when he plugs into the church’s Wi-Fi
  • One staff member who likes to work from home a lot can access all of her files at home just like she was at the church, and can stop using thumb drives to bring files back and forth, or waiting ages to download a file via the VPN
  • Another staff member needs infrequent access to office files from the house, but doesn’t feel the need to sync any folders to his house.  He will instead download or upload any files he needs via the SugarSync website.

There you have it: different strokes for different folks.

 

Continue reading

----- Signature and Disclaimer -----

Written by W. Curtis Preston (@wcpreston). For those of you unfamiliar with my work, I've specialized in backup & recovery since 1993. I've written the O'Reilly books on backup and have worked with a number of native and commercial tools. I am now Chief Technical Architect at Druva, the leading provider of cloud-based data protection and data management tools for endpoints, infrastructure, and cloud applications. These posts reflect my own opinion and are not necessarily the opinion of my employer.

Social Media and security

Social media incidents cost a typical company $4 million over the past 12 months, according to the results of a Symantec survey published today.

There have been a number of legal actions about social media in recent years, including a Financial Industry Regulation Authority (FINRA) regulatory notice, the Romano vs Steelcase Inc and Bass vs Ms. Porter’s School cases (where both plaintiffs were granted discovery of the defendant’s Facebook Profile), and the sexual harassment case EEOC vs Simple Storage Management LLC (where a US District Court held that social networking sites — or SNS for short — were discoverable).  This means that what your employees do on their personal time on SNSs can open your company to embarassment and litigation.  The survey, then, sought to find out how big this problem is in the enterprise. They hired Applied Research to interview IT professionals from 1200+ enterprises with 1000+ employees.

45% of respondents use SNSs for personal use, and 42% use them for company use.  IT folks are worried about employees sharing too much information (46%), the loss or exposure of confidential information (41%), damage to the brand (40%), exposure to litigation (37%), malware (37%), and violating regulatory rules (36%). 

The respondents to the survey listed 9 social media “incidents” in the past 12 months, with 94% of those incidents having consequences, including damage to the brand (28%), loss of data (27%), or lost revenue (25%).  The average cost of a social media incident was listed as $4.3M!

Most of the companies are discussing creating a social media policy, training their employees, putting processes to capture confidential information, and putting technology in place to stop these things from happening as well.  However, what was suprising was that — while almost 90% of respondents felt they  needed to have these things in place, only 24% had a social media policy, 22% were training their employees on social media, and about 20% were using technology to control this process.

Folks, it’s happening and it isn’t going away.  The very least you can do is to create a social media policy and train your employees why it is important.  Those employees who are allowed to blog about company matters need to be continually reminded that their actions are discoverable.  Even if their personal site may not be demonstrated to be official company policy, it surely states the opinion of one of its employees — and those employees make up the company.  And if it can be shown that one of its employees was continually doing something damaging on a publicly accessible social site and the company did nothing to stop it, that can be actionable.

Just remember: It’s really easy to be a jerk on the Internet where you’re not facing the person you’re talking to.  You might want to dial it down a notch or two.  Just a thought.

Update 25 Jul 2011: I was given a briefing about this survey and didn’t read the press release until today. During the briefing, Symantec seemed to be playing down the role that technology had to play in helping to solve this problem.  However, in the press release, it seems as if they’re saying that Enterprise Vault is going to handle this by archiving social media content.  First, I have no idea why anyone who is not required to archive any content — be it email or twitter — would do such a thing.  If you’re not required to keep something and keeping it adds no value to your business — don’t keep it!  Second, even if you did archive it, I’m trying to understand how that would help you in a discovery situation.  If someone wants to see your Facebook logs, they’re going to subpoena Facebook.  That’s what happened in the cases listed in this article.  So if you did archive it, now you’re required to produce it.  So why would you do this if you weren’t being forced?  And how would doing this help you in a trial?

Continue reading

----- Signature and Disclaimer -----

Written by W. Curtis Preston (@wcpreston). For those of you unfamiliar with my work, I've specialized in backup & recovery since 1993. I've written the O'Reilly books on backup and have worked with a number of native and commercial tools. I am now Chief Technical Architect at Druva, the leading provider of cloud-based data protection and data management tools for endpoints, infrastructure, and cloud applications. These posts reflect my own opinion and are not necessarily the opinion of my employer.

Is Holographic Storage the future of archive & backup?

And now for something completely different.  GE researchers have announced that they have successfully demonstrated a micro-holographic material that can support 500 GB in a DVD-style disc.  That's 20 times greater than most Blu-Ray discs (there is a Blu-Ray 100 in the works), and 100 times greater than DVDs.  So does this have backup and archive potential?  Let's look into that.

The first question is how fast this thing will be.  The article said that it supports "data recording at the same speed as Blu-ray discs."  The fastest a Blu-Ray disc can currently write is 12x, which translates into 54 MB/s.  That's slow in comparison to modern tape drives, but still not too shabby.  It's way faster than any of the Magneto-Optical formats. Although it's not stated anywhere, I'm assuming this is a random-access format, so it's access time during restores or retrievals would be very nice when compared to tape.  Due to the load/unload process, it's still not going to be as fast as a hard drive unless we're talking about leaving the disc in the drive all the time.  In a robotic setup, you'd have to add robotic time and load/unload time.  But this would all be similar to, if not better than, the speeds we have with tape.

The next question is cost, and there's nothing on that yet.  Traditionally, other optical formats have lost this race in a big way.  Only time will tell whether or not this format will change that pattern.

Finally, there's the question of long-term stability of the media itself.  I previously posted about the differences of tape vs disk in this area, and how tape is actually more stable for longer periods of time than disk is.  However, this is holographic storage and I honestly have no idea what the long term viability of data stored on such a medium would be.   I'm leaning towards the idea that it would actually be very stable, but I know that other optical formats are not as stable as one might think they would be, so…  Only time and more research will answer that question, too.

Assuming that they address the cost concerns and my hunches are right about its long term stability, I'm really leaning towards this as a long-term archival medium — as opposed to a backup and recovery medium.  While 54 MB/s may sound like a lot, it's just not enough for today's large data centers.  Throughput doesn't matter much in archival situations, but random access does, making this really well suited to archive.

For those of you ready to dump tape or disk for anything that gives you the portability and cost of tape with the random-access nature of disk, it looks like you're going to have to wait a bit.

Continue reading

----- Signature and Disclaimer -----

Written by W. Curtis Preston (@wcpreston). For those of you unfamiliar with my work, I've specialized in backup & recovery since 1993. I've written the O'Reilly books on backup and have worked with a number of native and commercial tools. I am now Chief Technical Architect at Druva, the leading provider of cloud-based data protection and data management tools for endpoints, infrastructure, and cloud applications. These posts reflect my own opinion and are not necessarily the opinion of my employer.

Include All Files; Reject Some

I had a twitter chat with @JLivens the other day where the question was "what do you back up?"  My first response was to, of course, say that I back up everything – thrice – cause I'm me. If you're curious, my critical personal data is synced on multiple computers with history using Dropbox (which I'm reconsidering based on how things have been going over there lately), then it's backed up with the free version of CrashPlan to another computer that isn't at my house, AND I can't resist the urge to throw in a Time Machine backup every once in a while.  You know what?  I haven't done one of those in a week or so.  Just a second.  My little Time Machine icon is spinning now. Ah, there I feel better.

Side note: for all my talk about tape lately, you'll notice that I don't have any tape in my setup for now.  I am about to embark on a project that may make me reconsider that as I might have an archiving need soon.  Can't keep it all on spinning disk!

Alright, back to the topic at hand.  What do I back up?  I actually do back up everything, but that is not the point I wanted to get across in this post. 

It's easy to come up with a list of directories you don't want to back up.  Your /tmp folder, your "Temporary Internet Files," your folder on your work laptop that contains the illegally downloaded movies that you should have never downloaded in the first place.  Yeah, I'm talking to you.  Pay for the media/software you consume.

But what I wanted to talk about was how to make your backup selection if you want to exclude things.  What I've found is that the human tendency is to say "just backup the Documents folder," or something like that.  And that is what I really want to talk you out of.  There is too much risk doing it this way.  You could accidentally put some important data in a directory you're not backing up.  You could create a whole other directory that contains really important data and forget to add it to the list.  The risk outweighs the benefit of excluding the other data.

If your backup software has the ability, please have it autoselect both filesystems/drives and folders/directories.  If it supports it and if you want to do so, you can also create an exclude list of the directories you definitely don't want to back up.

And that's what I came to say: backup up everything, but exclude what you don't want.  Hopefully the title makes sense now.

Continue reading

----- Signature and Disclaimer -----

Written by W. Curtis Preston (@wcpreston). For those of you unfamiliar with my work, I've specialized in backup & recovery since 1993. I've written the O'Reilly books on backup and have worked with a number of native and commercial tools. I am now Chief Technical Architect at Druva, the leading provider of cloud-based data protection and data management tools for endpoints, infrastructure, and cloud applications. These posts reflect my own opinion and are not necessarily the opinion of my employer.

Schedule Tweets from the command-line

We at Truth in IT have several events that we need to invite people to, and twitter is one of the ways we do that.  Scheduling such tweets in advance is a great way to make sure you send the right tweet at the right time, and Twuffer.com (short for Twitter Buffer) is an easy way to make such automated tweets happen.  The only problem is that each scheduled tweet in twuffer.com takes several mouse clicks, each of which is followed by a screen refresh.

I wondered if there was an easier way.  I'm proficient in old-style Bourne shell programming in Unix/Linux (never did get very good at Perl, but I rock at Bourne Shell) and I know how to use cron, so if I could just find a way to tweet from the Linux command-line I figured I could make my own twuffer.

An Internet search for "tweet from the command line" turned up this and this article.  I got all excited, then disappointed once I realized those were using basic authentication, which was disabled in June of last year.  It was replaced by oauth authentication, allowing you to authorize an app to use your twitter account without giving them your twitter password.

A google search for "oauth twitter commandline post" turned up this post from Joe Chung's "Nothing of Value" blog called "Twitter OAuth Example."  He explains a series of separate PHP scripts that, if run and edited in the proper order, will result in you having a script called twitter.php that is actually your own properly registered and authorized twitter app that can send tweets from the command line.

While I was able to figure out Joe Chung's instructions (and I'm incredibly thankful for them and the code that comes with them), I wanted to adapt his code and instructions a little bit for those who may not be as adept at coding.  And I've also added my own code around the final tweet.php script to support scheduled tweets.

Before You Start

If you want to understand more about Oauth and how it works, you should read the original blog post.  Each major step below is also a link to the original instructions from twitter.

What You'll Need

You will need a Unix/Linux command line (or something like it), php and cron to make all of this code work.  If you don't have cron or something like it, you won't be able to send scheduled tweets, but you will still be able to send tweets from the command line.  You'll also need to have a basic understanding of the command line.  Unlike the original code from Joe, though, you won't have to edit any of the PHP scripts.

Step 0: Download my modified code

You can download all of my source files here: http://www.backupcentral.com/twitterapp.zip
Unzip them into a directory then cd into that directory.  My first six steps of my post follow the ones from the original post.   I again urge you to read the original post, as he really deserves all the credit for figuring this out.  All I did was hack his scripts to behave differently.  If you want even more information, each step is a link to the original oauth spec from twitter.com.

Step 1: Register an application with twitter

Only registered apps can send tweets via Twitter's API.  So in order to send a tweet on the command line, you need to be your own app.  (Don't worry; the code is already written.  You just need to register the code you just downloaded as your own app.)  The first step in this process is to go to twitter.com and register your app.

Here are some pointers to help you fill out the form:

  1. Whatever you put as the name of the Twitter App is what will show up when you send tweets in the "via" column.  For example, we named ours TruthinITApp, so our scheduled tweets say "via TruthinITApp" at the end.  You can name the app whatever you want, except that the name cannot have the word "twitter" in it
  2. It doesn't matter what you put in the rest of the fields, although you should probably put a valid website, and a description of what you're up to.
  3. I put Browser as my application type, but I'm not sure if that matters
  4. Specify Read & Write or Read, Write & DM access
  5. Use twitter for login

Once you have clicked Save, you will be presented with a results page.  You need to get two values from that page: Consumer Key & Consumer Secret(Record these values somewhere for later.)

Step 2: Get a request token

Now you're going to do the equivalent of a user using the app for the first time.  You will login to twitter, then try to use the app.  Twitter will ask if you authorize the app.  After you do that, it gives you another value you need.

1. Login to twitter as the user you wish to send tweets as
2. Run the following command, substituting the two values of consumer_key and consumer_secret you got in Step 1

$ php getreqtok.php consumer_key consumer_secret

This will display a URL followed by a command.  You will use those two strings in the next two steps.

Step 3: Authenticate the user and authorize the app to tweet for the user

Cut and paste the URL from the previous step into your browser.  (This is the equivalent of using the app for the first time as the user you want to tweet as.)  Once you click Authorize App, it will display a seven-digit number that will then append to the command displayed in the results of the previous command.  (Record the value for later.)

Step 4: Get the access token and secret

Now that the app has been authorized to tweet for the user, the app needs to establish a special key and secret (think username and password, but without actually giving them your password) that it will use each time it tweets on your behalf.  The command will look something like the following command, where consumer_key and consumer_secret are the values that you got when you registered your app, oauth_token and oauth_token_secret are the values the app was given when the app was authorized by the user, and authkey is the seven-digit value from the web page.

$php getacctok.php consumer_key consumer_secret oauth_token oauth_token_secret authkey

This command will display the next command that must be run, which is the actual twitter.php command, along with all the arguments you need to pass to it.  It will look something like the following, where access_token and access_token_secret are the values that the previous command got that are the unique username/password combo for this app and for this user. (Notice the access token actually starts with your twitter user ID — the number, not the name.)

$ php tweet.php "Hello World…" access_token access_token_secret consumer_key consumer_secret

Step 5: Post a tweet on the command line

Start your twitter client or monitor twitter.com for the user you're going to send the tweet as.

Run the command above, and you should see a bunch of text fly by.  As long as you don't see errors like "Invalid Token" or anything like that, your tweet should have gone through.  

You just sent your first command-line tweet!

Scheduling tweets using cron and tweet.sh

In addition to the code above that was written by Joe Chung, I wrote twitter.sh, that uses twitter.conf and twitter.txt to automate the sending of tweets using cron.  The rest of this blog post is about how to use those tools, which are also in the code you downloaded in Step 0.

Step 6: Edit tweet.conf with the appropriate keys and secrets

Put the values of consumer_key and consumer_key secret as the second and third field in the consumer_key line:

consumer_key:<consumer_key>:<consumer_key_secret>

Create a line for each user that you have authorized using the steps above and insert the appropriate values for:

username:<access_key>:<access_key_secret>

Step 7: Put a cron job that will run tweet.sh every minute for you:

* * * * * /workingdirectory/tweet.sh workingdirectory >/tmp/tweet.out 2>&1

Where workingdirectory is the directory where you installed the code.

Step 8: Edit tweet.txt and put a tweet sometime in the near future. 

The format for tweets is as follows (where "|" is the field separator):

MON DD HH:MM|username|Tweet goes here

Here's an example.  First, get the current date

$ date
Tue Jun 21 03:20:22 EDT 2011

(Yes, I'm up a little late working on this post…)

Second, add a tweet to the file for a few minutes from now

$ echo "Jun 21 03:22|testuser|Test tweet1" >>tweet.txt

Please note that I used "|" as the field separator.  This means you cannot use the "|" character in any of your tweets.  One other note: Twitter will not let you send the same tweet twice, so you will need to change your tweet phrase if you want to do more testing.

When Jun 21, 03:22 rolls around, it will send your tweet.  If tweet.php returns successfully (indicating a successful tweet), it removes it from tweet.txt and appends it to completedtweets.txt.  If there was a problem sending your tweet (such as it being a duplicate), then it leaves it in the tweet.txt file.

That's it.  All you need to do to send tweets in the future is to add them to tweet.txt and they will magically happen.  You can put blank lines, comments, or whatever other formatting you want in tweet.txt, as long as the actual tweet lines follow the format in step 8.

Please let me know if this post was helpful.  Also please post any suggestions on how to make the code better.  If I can make it work, I'll update the code and the post.

Continue reading

----- Signature and Disclaimer -----

Written by W. Curtis Preston (@wcpreston). For those of you unfamiliar with my work, I've specialized in backup & recovery since 1993. I've written the O'Reilly books on backup and have worked with a number of native and commercial tools. I am now Chief Technical Architect at Druva, the leading provider of cloud-based data protection and data management tools for endpoints, infrastructure, and cloud applications. These posts reflect my own opinion and are not necessarily the opinion of my employer.

Other tape considerations

I've posted and talked quite a bit about tape lately.  I asked if we've put it out to pasture too soon. After participating in a Linked In thread from hell, I said that tape was a more reliable medium for long term storage.  I talked about that last post on Infosmack 102, which should be on The Register any day now.  I've also spoken about tape at my Backup Central Live! shows.  (Quick plug: We have announced the dates for Toronto, NYC, Seattle, Denver, Atlanta, Austin, Phoenix, Los Angeles, San Francisco, and Washington DC.  Click your favorite city to register!)

First let's talk about backup and recovery

Anyone who has heard me speak knows that I do not recommend using tape as the primary target for backups.  The main problem with tape and backups is that most backups are incremental backups and provide <1MB/s of performance, and modern tape drives want at least 40-50 MB/s after compression and really want much more than that.  This speed mismatch is impossible to overcome without bringing disk into the picture. I think that disk (especially deduped disk)  offers so many advantages for backup and recovery that it just makes sense to use it as your primary target for backup and recovery.  Even if you plan to build your backup system primarily out of tape (usually due to cost), you need to solve the speed mismatch problem using disk staging.  Stage to disk, then destage to tape.  You don't get the recovery benefits that disk provides, but at least you solve the shoe-shining problem.  (BTW, I read on

It also makes a lot of sense to replicate deduped backups to another device offsite, although I still believe tape is a cheaper way to accomplish the offsite requirement.  It also comes with the "air gap" feature

However,

What I do think

Continue reading

----- Signature and Disclaimer -----

Written by W. Curtis Preston (@wcpreston). For those of you unfamiliar with my work, I've specialized in backup & recovery since 1993. I've written the O'Reilly books on backup and have worked with a number of native and commercial tools. I am now Chief Technical Architect at Druva, the leading provider of cloud-based data protection and data management tools for endpoints, infrastructure, and cloud applications. These posts reflect my own opinion and are not necessarily the opinion of my employer.

Tape more reliable than disk for long term storage

Tape is inherently a more stable magnetic medium than disk when used to store data for long periods of time.  This is simply "recording physics 101," according to Joe Jurneke of Applied Engineering Science, Inc. 

I had heard rumblings of this before, but it was Joe that finally explained it in almost plain English in a post to this thread from hell on LinkedIn.  Here's the core of his argument:

By the way, the time dependent change in magnetization of any magnetic recording is exponentially related to a term known as KuV/kt. This relates the "blocking energy" (KuV) which attempts to keep magnetization stable, driven by particle volume (V) and particle anisotropy (Ku) to the destabilizing force (kt) the temperature in degrees kelvin (t) and Boltzmans constant (k).  Modern disk systems have KuV/kt ratios of approximately 45-60. Modern production tape systems have ratios between 80 and 150. As stated earlier, it is exponentially related. The higher the ratio, the longer the magnetization is stable, and the more difficult it is to switch state…..Recording Physics 101….

I had to call him to get more information.  He explained how this came about.  Disk drives have been pushed for greater and greater densities, which caused their vendors to create a much tighter "areal density."  Tape, on the other hand, mainly got longer and fatter to accomodate more data in the same physical space.  (Yes, it increased areal density, too, but nowhere near as much as the disk drive folks did.)  The result is that the tape folks have more room to play, allowing them to use magnetic particles with a bigger particle volume (the V in the equation).  The bigger the particle volume, the more stable the magnetism is, according to the KuV/kt equation.  In addition, tapes are generally stored outside of the drive, which means their temperature is lower than disk drives.  That means they have a lower k volume (degrees kelvin), which is one of the "bad" numbers in the KuV/kt equation.  Having a higher V value and a lower t value is what translates into tape systems having ratios of 80-150, vs disk systems that have ratios of approximately 45-60. While I don't have an exact cite to point to in order to show these exact values, what he's describing makes perfect sense to me.
 

Add to this the fact that tape drives also have a lower bit error rate than disk.  SATA disk is 1:10^14, FC disk is 1:10^15, LTO is 1:10^16, and IBM 3xx0 and Oracle T10000s are 1:10^17.

Add to this the fact that tape drives always do a read after write, where disk drives do not always do this.

Sooo…

Tape drives:

  1. Write data more reliably than disk
  2. Read it after they've written it to make sure they did (where disks often don't do that)
  3. Have significantly less "bit rot" or "bit flip" than disk drives over time.

Like I said in a previous post, I think we've put these guys out to pasture a little too soon.

Continue reading

----- Signature and Disclaimer -----

Written by W. Curtis Preston (@wcpreston). For those of you unfamiliar with my work, I've specialized in backup & recovery since 1993. I've written the O'Reilly books on backup and have worked with a number of native and commercial tools. I am now Chief Technical Architect at Druva, the leading provider of cloud-based data protection and data management tools for endpoints, infrastructure, and cloud applications. These posts reflect my own opinion and are not necessarily the opinion of my employer.

My Detente With EMC's DD Archiver

When I first heard about the EMC disk archiver, I blew my stack.  I don’t remember exactly how it was presented to me, but what I heard was that EMC was coming out with a disk product that was designed to hold backups for seven years or more.  Since storing backups for seven years or more is fundamentally wrong (and no one — and I mean no one — argues with that), the idea that EMC was coming out with a product that was designed specifically to do that angered me.  Brian Biles, VP of Product Management for EMC’s BRS division, said with a wry smile, “so you’re saying we’ve become a tobacco company.”

I replied saying, “No, you’ve become a cigarette case manufacturer.  You shouldn’t smoke, kids, but here’s a really pretty gold case to hold your ciggies in.”  I had a similar conversation with Mark Twomey (@storagezilla) on Twitter.

Since that time, I have come to a detente.  I still wouldn’t buy one of these for my long term storage needs, but I can see why some other people might want to do so — and I don’t think those people are wrong or committing evil or data treason. This blog post is about how I got here from there.

Here were my arguments against this product:

There’s no way that this could cost less than tape

Some of the messaging that I saw for the Archiver suggested that it was as affordable as tape.  That’s simply not possible.  First, let’s talk about what we’re competing with. (For these comparisons, I am assuming you have either a tape system or a Data Domain box, and that what we’re talking about is adding the cost of extra capacity to support long term storage of backups or archives.)

A backup or archive that is kept for that long is not kept in the tape library; it’s put on a shelf.  (This is because chances are that it’s never going to be read from.)  Therefore, the cost for tape is about $.02/GB, which is the cost of an LTO-5 tape cartridge.  The daily operational cost of that tape’s existence is negligible, assuming it’s onsite.

The last time I checked target dedupe appliances, they were about $1/GB after discounting.  I also saw a slide that this archiver is supposed to be about 20% cheaper than a regular Data Domain.  That puts it at around $.80/GB — 40 times greater than the cost of a tape on a shelf.  And the daily operational cost of that disk is higher than the tape because it is going to be powered on.  (The Archiver does not currently support powering down unused shelves, although it may in the future.)

Then there is the issue of dedupe ratio.  The deduped disk price above is assuming a 20:1 dedupe ratio.  Dedupe ratios do not go up over time; they actually decrease.  This is because eventually we start making new data.  (The full backup you take today is going to contain quite a bit of new data when compared to the full backup from a year ago.)  Then there’s the fact that the Archiver needs to start each tier (a collection of disks) with a new full backup, thus decreasing the overall dedupe ratio of the entire unit.  (It must do this in order to keep each tier self-contained.)  The result is that you will probably get a much lower dedupe ratio on your long term data than on your short-term data.  This increases your cost.

If you’re going to do the right thing and use archive software to store data for several years (instead of backup software), any good archive software has single-instance-storage.  So if you’re using archive software, you’re going to get an even lower dedupe ratio.

Which brings me back to my belief that there is no way this can be anywhere near as inexpensive as tape.

The good news is that I didn’t hear EMC saying that the Archiver is as cheap as tape when I saw them speak about it at EMC World.  When I talked to the EMC people at the show, I told them I had heard stories of EMC sales reps showing this unit cheaper than tape by using dedupe ratios of 100:1.  (The idea is that you’re going to store 100 copies of the same full backups.)  They told me that any sales rep quoting ratios like is not speaking on behalf of EMC and talking out of his …  Well, you know.

There’s nothing that this unit offers that justifies that difference in price

Disk offers a lot of advantages when used for day-to-day backups.  It’s a whole lot easier to stream during both backups and restores.  There is no question that it adds a lot of value there.  However, the idea of backups or archives that are stored long term is that no one reads them.  If they are reading them, it’s for an electronic discovery request, where the amount of time you have to retrieve that is much greater than the time you typically have for a restore.  This increased amount of time is easily met with tape as your storage medium.  Disk offers no real advantage here.

When I said this, Mark Twomey pointed out that this unit offers regular data integrity checking of backups stored on it.  I informed him that if this were important, there are now two tape library manufacturers (Quantum & Spectralogic) that will be glad to do this for your tapes.

I will concede that disk does offer an advantage if you’re using backups as your archives.  Having backups that will load instantly helps mitigate the issue of how many restores you’re going to be doing to satisfy a complicated ediscovery request.

It’s just wrong to store backups for many years

You should not be using your backups as archives.  You should not be using backups as archives.  If you ever get an ediscovery request for all of Joe Smith’s emails for the last seven years — and you happen to have a weekly full for each of the 364 weeks of that time frame — you will remember what I said.

The thing is that EMC agrees. In fact, the EMC Archiver presentation starts with a few slides about how you should be doing real archiving; you should not be using your backups as archives.

They also said that they see this device as a transition device that can store both backups and archives.  Just because this device can store backups doesn’t mean you have to store backups on it.  You can use proper archive software.  (But, if you did, I once again point out that your dedupe ratio will go down and therefore your effective cost per GB will go up.)

So what’s changed, then?

I had a number of good conversations with EMC folks at last week’s EMC World.  (Which, for the record, was a really big show.)  Some of those comments are above.  They know that this is not going to be cheaper than tape, and they’re saying that anyone that is saying that is not being truthful.  They know that storing backups for years is wrong; they also know that more than half of the world does it that way.

The reason for the detente, however, is that I realize that many people hate tape.  I think they’re wrong, as I’ve stated more than a few times.  There are plenty of IT departments that have a “get rid of tape” edict.  If the goal is to get rid of tape, the fact that the alternatives are much more expensive is not really an issue.  And if you’re going to store backups for a really long time on disk, then at least EMC put some thought into what a disk system would need to do in order to do that right.  This includes things like fault isolation. If you lose one tier for whatever reason, you only lose the data on that array.  It includes things like scanning data occasionally to make sure it’s still good.

Finally, Index Engines also announced an important product at EMC World that will help increase the value of the Archiver for those using it to store backups.  They already have a box that can scan tape backups and basically turn them into archives.  (One of the coolest products I’ve ever seen, BTW.)  They now support NFS, so you can point an Index Engines box at a DD Archiver and voila!  Those backups that you are storing on disk magically become fully searchable, ediscovery-ready archives.

Summary

Don’t use your backups as archives.  Use archive software instead.  Tape is still the most economical destination for long term storage of backups or archives, and it’s a pretty reliable one, too.  However, if you’re going to store your backups or archives on disk for many years, there are worse places to put them than the EMC Data Domain Archiver.

Continue reading

----- Signature and Disclaimer -----

Written by W. Curtis Preston (@wcpreston). For those of you unfamiliar with my work, I've specialized in backup & recovery since 1993. I've written the O'Reilly books on backup and have worked with a number of native and commercial tools. I am now Chief Technical Architect at Druva, the leading provider of cloud-based data protection and data management tools for endpoints, infrastructure, and cloud applications. These posts reflect my own opinion and are not necessarily the opinion of my employer.