The Great VTL Debate

“My operations people want me to move off of VTLs as quickly as possible.”  This was something someone said at a recent Dedupe School that I was speaking at in Seattle last week.  The person who said this — I’ll call him Greg — happens to work for a very large company that uses well over 100 VTLs and well over 100 NAS-based target dedupe systems.  I’ve had the opportunity to chat with Greg several times. He is what my Boston friends would call “wicked smart;” he knows his stuff.

Why are his ops people so down on VTLs?  What does this say about FalconStor, SEPATON, or IBM that produce products that are VTL only?  Do they have a future?

I have defended the VTL industry more than most pundits I know, so I thought it only appropriate to revisit the topic in light of this feedback from a  real VTL customer.  The good news is that the reason that I push certain people to VTLs still exists; the bad news is that the bad things about VTLs haven’t gone away either — or at least with this guy’s particular VTL.

VTLs have one major advantage over their NAS counterparts — Fibre Channel.  Even with advancements like 10 GbE Ethernet, TCP Offload Engines (TOEs), iSCSI & FCoE, Fibre Channel is still a more efficient way to move large chunks of data than any of the alternatives.  Greg told me that they have tested all of these newer technologies and agreed with the above statement.  10 GbE Ethernet may sound faster than 8 Gb Fibre Channel, but when it comes to backups it just isn’t.  You can argue that Fibre Channel is more expensive, or that it’s the protocol of the previous decade and not the next, but it’s still the best thing going for bulk transfers of data like backups.

For the last 10 years, backup experts have recommended moving larger servers to LAN-free backups, where backups are sent over Fibre Channel.  We’ve also been recommending using disk as the initial target for backups, even if you ultimately copy your backups to tape.  (It solves the shoe-shining issue by using the disk as a cache.)  For the past few years, we’ve also been recommending using deduplicated disk, but we’ll set that aside for the moment.

Assume that you want to follow the first two of these recommendations: LAN-free backups and disk as your initial backup target. You need a disk device that you can share with multiple servers over Fibre Channel, and there are two ways to do that: a global SAN-based file system and a VTL.  Greg and I agree that the VTL is by far the best choice.  The concept of a SAN-based, globally-writeable filesystem may seem nice but it has not gained market acceptance.  Even a guy whose company doesn’t like VTLs thinks they’re better than a SAN file system.  That’s something if you ask me.

When you add deduplication into the mix, it’s the final nail in the coffin of the discussion.  If you want disk as a target, Fibre Channel as a transport, and deduplication features, your only choice is a VTL.  (There are no globally-writeable SAN filesystems with dedupe yet.)

The statement above is referring to the options available to a person whose backup software does not have a built-in deduplication option, or to a person who who believes that their backup software’s dedupe option isn’t ready for their needs.  Products like TSM, ARCServe, NetBackup, and CommVault do offer built-in dedupe that allow customers to use both Fibre Channel and disk and have dedupe as well. 

FalconStor might claim that they are the only exception to this rule, since they offer OST (NetBackup/Backup Exec Open STorage) over Fibre Channel.  (Everyone else does OST over IP.)  However, since they ultimately store that OST data on virtual tapes in their VTL, they should continue to have the same challenges that other VTLs have.

So what’s wrong with a VTL? 

When VTLs first came out, it was thought that the similarity to real tape libraries would make them easier for customers to integrate into existing backup systems.  It’s just like what you’re already using — just better!  The problem is that pretending to be tape also comes with some of its disadvantages:

  • Tape drive operations sometimes hang, requiring SCSI resets
  • Even virtual tapes get stuck in virtual tape drives (this is a variation of the previous issue)
  • 50 virtual drives do indeed take more work to manage than 10
  • You cannot read and write to a single tape at the same time

Greg told me that the first two issues caused enough hassles for his operations team that they wished they could get rid of VTLs altogether.  He said they don’t have any of those challenges with their NAS-based dedupe appliances. Because of this, he feels very strongly that he wants to limit the number of virtual tape drives that he needs to manage. But since the NAS-based systems can’t meet the performance requirements of his larger systems, he’s forced to deploy the VTLs.

I did push back a little, asking him which brands of VTLs he had seen this on.  (His experience was with one VTL that is no longer on the market, and one other mainstream product that is a VTL-only product.)  I suggested that had he tested and deployed a different model of VTL that things might have been different.  He disagreed, feeling that the problem is primarily a SCSI/tape drive thing, not something any particular VTL vendor could fix.  I still can’t help but wonder if his opinion about VTLs would be different if he had over 100 VTLs of somebody else’s brand.  They aren’t all created equal, after all.  If any VTL customers want to reach out to me privately to confirm or deny that you’re experiencing the same problems as Greg, I’d really like to hear it.

The final disadvantage (not being able to read/write at the same time) is one that causes many people grief, because it causes conflicts for various processes.  Perhaps you want to copy last night’s backups to real tape, or perhaps your virtual tape needs to be read by a post-process deduplication process.  Either way, it is a pain that you can’t write and read to/from the same tape at the same time.

I actually have a bet with Marc Staimer about the future of VTLs.  He said they will be on the decline in five years and I said they would be on the rise.  The weird thing is that we made it on Valentine’s day and the bet was for dinner. 😉 We’re almost 3/5 of the way through the five years and VTL-only products continue to flourish.  They may not be perfect, but I still think they’re the best thing going if you want dedupe and Fibre Channel in a single appliance.  (They’re actually the only thing going.)  Ultimately the market will decide.


10 thoughts on “The Great VTL Debate

  1. Mark Cooper says:

    You say, “… there are two ways to do that: a global SAN-based file system and a VTL.” Have you checked out NetBackup’s SAN Client. Basically, we wrote a target-mode driver for the HBA on the media server. Multiple clients can then write to the media server as if it were a target.

  2. cpjlboss says:

    I know about the SAN client. But in a large environment, you still need multiple REAL media servers to receive/send data to disk. Multiple media servers still creates the need to have multiple disks to write to.

    Also, I’m speaking generally here to people using multiple backup products, not just to NBU customers.

  3. scsirob says:

    Instability on the SCSI/FC level is the major tip-off for ‘Greg’? Then the best thing to do is fix that instability. If the transport layer is broken or unreliable it will cause issues with tape but also with disk and replication. Better fix that..

    I know of NAS dedupe systems with frequent hanging NFS sessions as well. Does that make NAS a bad technology? Fix the cause, don’t dismiss a technology that’s a victim of poor integration.

    As for simultaneous read/write, keep the size of virtual tapes small and this will hardly ever be a problem. Worst case, abort the backup job when the restore has higher priority.

  4. cpjlboss says:

    @Rob

    IMO Greg’s problem is not that his SAN is unreliable. It is that tape drives on SCSI get hung up. Backup software, HBAs, and host processes occasionally do dumb things with tape drives (fake or not), and require a reset of something to fix it. Anybody that has spent time in a large datacenter with many tape drives (he has thousands) that operate constantly can attest to that. I know _I_ can attest to that.

    In his defense, he had similar problems with two VTLs, so he sees a trend. Having said that, I did say that it’s possible that part of the problem is the VTL he chose. (Perhaps other VTLs have developed a workaround to this problem.) I won’t go into why they chose the one they did, but suffice it to say that I didn’t agree with their selection process. They artificially dropped a number of products off because they fell into a category that they thought was a bad idea.

    Minimizing the size of virtual tapes will minimize the simult. read/write. but it doesn’t solve it.

    Assuming that you don’t get the hanging NFS sessions (which BTW I’ve never heard of until your post), the point of the post is that if you don’t NEED Fibre Channel, NAS-based systems are probably easier to deploy and use.

  5. Dom says:

    Just wondering what is the biggest VTL/OST device out there. From what I’ve seen it’s the Sepaton but I’m guessing it’s not ? Actually, what are the top 3 ? ANy info on each like speeds etc ? Just curious.

  6. esherril says:

    NFS mounts can definitely hang, anybody who has ever mounted NFS exports with a UNIX or Linux box as “hard” mounts (which is usually recommended vs. “soft” mounts, for data integrity, along with using TCP instead of UDP), without using the “intr” option (which lets you CTRL-C out of a process waiting on an NFS mount to respond), knows about that pain….

    Hopefully NFS v4.1 / pNFS will help some there, or at least not make it any worse. I would love to see the timetables for full support of pNFS on non-VTL-mode GLOBAL dedupe backup.

  7. schweeb says:

    I’ve been using the EMC Disk Library product (FalconStor based, obviously) for somewhere around 5 years. I’ve had very few issues with them, none like the problems he is describing.

    I’m getting my first Data Domain devices pretty quickly here, and they will be run in VTL mode as well, since our FC infrastructure is beefier than our IP infrastructure.

  8. jm7640 says:

    This is “Greg”. I wanted to clarify the issue.
    This specific issue here is how to efficiently and repeatably move data from the media server to the disks serving as the backup target and then either duplicate the backup images on disk (Copy1) to tape (Copy 2) or replicate to storage at another data center(and do this repeatably every day).

    To write out copy1 from the media server to disk, I have two transports available to choose from: Fibre channel or Ethernet.

    Where 10Gb is available my options are IP based storage using CFS or NFS. It is NOT practical to back up to a standard NAS like a NetApp NearStore and then replicate those images offsite or back them up to tape through NDMP because there are too many levels of obfuscation created between the tape and the server if a restore is needed.

    So the only practical option when using Ethernet is to back up Copy1 to a DD. In many sites this is exactly what I do. However there is a great deal of data that does not dedup well and the volume of traffic exceeds what I can replicate so in many cases I still need to duplicate the images from DD to offsite tape. This is an expensive and IO intensive backup model and is not ideal.

    In many data centers, 10Gb Ethernet is not available as an option but fibre channel is. Per standard, each media server is provisioned with at least two FC ports dedicated to backup.

    Since fibre channel is the transport available, I have two access methods, VTL or a standard file system.

    A benefit of VTL over the file system is that the VTL allows a number of media servers to access a large shared pool of storage concurrently, while arbitrating reads and writes through the VTL layer.

    For the storage operations staff this streamlines the provisioning process because they simply need to provision storage once to the VTL and from there on, only need to issue a zoning change to make the VTL accessible to the media server host.

    For the backup staff, they are already configuring the physical tape drive on the host so configuring one or more VTL drives on the server should only slightly increase the provisioning time. (I thought)

    In addition, VTL provides the potential ability to offload duplication onto dedicated hardware. The media servers can write the backups to VTL at night and dedicated infrastructure media server can handle duplication to tape during the day. In addition, by adding deduplication software to the VTL it allowed me to better utilize my storage frames compared to just using a file system on a media server.

    After deploying these solution I received a great deal of pointed feedback from the operational staff. The hot issue is the number of virtual drives that are required and the number of VTLs needed to support the environment. The VTLs added a layer of complexity that needed to be managed and required staff time.

    Each virtual drive still requires an EMM entry and the scsi device still needs to be mapped and managed in the OS. Virtual drives do go down for a myriad of reasons or a backup stalls for other reasons and it leaves the virtual drive in an uncertain state such as a stuck virtual tape.

    In a small environment this is more manageable but we have thousands of virtual drives. To the operations staff it is easier for them to use a standard filesystem as the STU as opposed to using VTL.

    That is the gist of what I said to Curtis.

  9. jm7640 says:

    This is Greg again. I spoke with operations and have some additional commentary.

    Another issue operations has with VTL is the large number of virtual cartridges that must be managed.

    Regarding the virtual drives, among the myriad of reasons tapes become stuck, another is server reboots.

    In the case of HPUX the problem is compunded by the lack of persistent scsi device bindings. The /dev entries are not always persistent across reboots.

    On Solaris systems the devices are persistent but the tape and drive are still in a precarious “stuck” state if the reboot happened when the backup was running.

    These issues, times thousands of servers add up to days of time spent resolving the VTL issues they create which is why operations prefers the use of standard file systems for the storage unit.

  10. tkimball says:

    Since this has come up again, I’ll remind that EMC Networker has another option to VTL – the Adv_File disk device.

    It can read/write at the same time (though not when recovering, but that may get fixed soon). And you can attach the disks in whatever way you want – in my case, several multi-TB volumes controlled by Veritas volume manager (concat of several FC RAID volumes).

    I’m still pushing for Networker to take what they got from DataDomain and apply it to adv_file volumes. It could make things interesting again. However, that’s about as likely to happen as the Sun/Oracle 7000 NAS series removing NDMP, and make it act as a backup client instead (again, something I’ve been pushing for some time).

    –TSK

Leave a Reply

Your email address will not be published. Required fields are marked *