Tech Field Day Post 2: NEC HydraStor

NEC HydraStor is the coolest product you’ve never heard of. My interest in this product was renewed last week as I watched a dozen hard-core techies salivate when they heard it described, and almost swoon when they saw it demonstrated.

As I said in a previous blog, I spent two and a half days last week with a bunch of miscreants collected from around the globe (USA, Scotland, Australia, Nigeria, Holland, and — of all places — Ohio). We called it Seattle Tech Field Day, and it was organized by none other than my friend Stephen Foskett (and, of course, his right-hand, Claire Chaplais). For two exhausting days we experienced death-by-Powerpoint and listened to several vendor pitches, and we grilled said vendors about the strengths and weaknesses of their various approaches.

I was not paid to attend this event, but it did not cost me to attend it. I had more than one free meal and drink on these guys, and I got a few chotzkies, but I am under no obligation to blog about what I saw. So please consider the blogs I do write about this event to be products I found genuinely interesting. I want to reiterate that NEC is not paying me to write this blog, nor are they paying me for anything. (That last part was just for you Greg K.) 😉 I wanted to reiterate this, but goes I’m about to go all fan-boy on them. It’s not a perfect product, but it is one that is truly different and exciting.

Every NEC presentation starts out with how they’ve been around since 1899. (According to Wikipedia, they were the first company that was the result of join US-Japanese investment.) They have $43 Billion in annual revenue (FWIW, that’s about 1/3 of Hitachi’s) and 143,000 staff. They spend $3B annually on R&D and have 48,000 patents. They make servers, storage arrays, projectors, optical drives, and a whole bunch of other stuff. They spend about 15 minutes explaining this because the only major problem anyone seems to have with their product is that they’ve never heard of NEC. They want to assure you that they’re a giant company that isn’t going anywhere.

On to the product… First let’s talk about RAID an its limitations. For an excellent overview those limitations, take a look at this excellent article by Marc Staimer in the May 2010 issue of Storage Magazine. The limitations that I see are:

  • RAID can only protect from a small number of failures (RAID 5=1, RAID 6=2, etc.)
  • RAID rebuilds create performance degradation (a bigger issue w/1& 2 TB drives)
  • RAID volumes must be designed (stripe size, RAID level, etc.) and managed
  • RAID arrays are not designed to be automatically replaced when obsolescent

Marc’s article discusses a number of alternatives to traditional RAID, including RAID-DP, BeyondRAID, RAID-X, and self-healing storage. He then ends the article by talking about a category he calls “paradigm shift alternative,” and discusses a concept called erasure codes, and mentions that NEC, Cleversafe, and EMC’s Atmos are the only products on the market that use this technology. More on erasure codes later.

An NEC HydraStor is built of two primary components: a storage node (SN) and an accelerator node (SN). An SN is an off-the-shelf 2-U NEC server (Remember, NEC makes and sell servers, too.) that holds 12 disks and Xeon 5500 processors. The AN is the same server without the storage.

A HydraStor system starts with more ANs and one or more SNs. The most typical configuration has two ANs and four SNs. The SNs and ANs talk to each other over a private GbE network. (They say they are not yet limited by this, and plan to upgrade to 10 GbE when they see a need.) The ANs act as NAS heads, each with at least one NFS or CIFS mount point. Their job is to funnel the data to the appropriate SN.

Files that are sent to a mount point on an AN are split into variable-sized chunks (something smaller than a file) and a SHA-1 hash is created for each chunk. The system will then need to perform a hash-table lookup to see if that chunk has ever been seen before For scalability reasons, the hash-table is automatically split up amongst all SNs. If you have four SNs, each has one-fourth of the hash table; if you have five, each one has one-fifth, and so on. The SN that will do the hash table lookup is then selected by the first few bits of the hash, as that will say which one of the SNs will have that portion of the hash table. If the SN determines the chunk has been seen before, only a pointer is stored. If it has not been seen before, the chunk will be stored.

Except for the splitting up of the hash table, so far this is pretty normal dedupe stuff. Here’s where it gets interesting. For data resiliency reasons, the chunk is split into (by default) nine “fragments,” and erasure coding is used to create three parity fragments. According to Wikipedia, “an erasure code is a forward error correction (FEC) code for the binary erasure channel, which transforms a message of k symbols into a longer message (code word) with n symbols such that the original message can be recovered from a subset of the n symbols.”  This means that we’re going to take the nine data fragments and make a longer message by adding three parity fragments.  The key of erasure codes, however, is that the chunk can be reconstructed from any 9 of these 12 fragments.  The number of parity chunks determine the level of resiliency; three parity fragments (the default) allows the system to survive three failures. Each mount point can have its own resiliency level, and you can select any number up to six. (This will create six data fragments and six parity fragments, as there is always a total of twelve fragments.) The higher the resiliency level, the better the protection and the higher the overhead. A resiliency level of six has an overhead of 50% (like mirroring), but can survive up to six failures (unlike mirroring).

The SN’s next job is to place the 12 fragments on as many separate SNs and disks as it can. If you only have four SNs, three fragments will be placed on each of the SNs, and each of those three fragments will be placed on a separate disk within that SN. This would allow you to survive three simultaneous disk failures, regardless of where those disk reside.  You could also survive one node failure. If you had 12 SNs, each fragment would get its own disk and its own SN. This configuration would allow you to survive three simultaneous disk or node failures.

When the chunk needs to be read, the AN asks for all 12 fragments, and the first nine to “show up,” are used to construct the chunk. (Remember, with erasure coding, any nine of the twelve fragments can be used to construct the chunk.) This means that slower SNs (such as one rebuilding a failed disk drive) do not interfere with read performance. For example, if you had four SNs and one was down (or slow due to a disk rebuild), the read performance would actually stay the same. As difficult as it is to imagine, if you had 12 SNs and three were down, the read performance would still stay the same.

There are no volumes or RAID groups to be managed. The only thing you have to create are the mount points, and the only decisions to be made then is what to call them and what level of resiliency they are to have. (The resiliency level cannot be changed on a mount point once created.) You also need to decide on a size, but most people leave it at the default, which is something like 200 PB. The volume is thin-provisioned, so this 200 PB “volume” doesn’t create any overhead until you write data to it.

More coolness happens as you add more nodes to the system. Suppose you started with a 4×2 system and decided to add four more SNs. The system will automatically “notice” that it has more SNs and begin migrating data to those new nodes in order to increase resiliency. It knows the more SNs that data resides on, the more failure scenarios it can survive. If it only has four nodes, it will make do. If you give it more SNs later, it will make things more resilient automatically in the background as it has time and bandwidth. Pretty cool, huh? I thought so.

Now let’s talk about what happens when it’s time to switch hardware. At some point, the Xeon 5500 processors are going to be depreciated and you’re going to want to replace them with the newest, coolest processors. If you want to replace an AN, you plug it into the private network and say which AN it is replacing, and voila! All mount points are moved automatically to the new AN. Since there is no data to move, it is instantaneous. If you want to replace one or more SNs, you plug the new SN(s) into the private network and say which SN(s) is/are to be retired. All data is automatically migrated from the retiring SN(s) and moved to any new SNs that are found. Once the migration is done, you can simply unplug the retired SNs.

A single AN provides over 500 MB/s of throughput, and the system supports up to 55 ANs working together as a single unit. (They do not have a unified name space yet, but all data sent to any node is automatically deduped against all other data.) That means that a fully-configured HydraStor system can ingest and dedupe data at about 90 TB/hr, or 2 PB per day. Wow.

Let’s compare the HydraStor against the list of RAID deficiencies at the beginning of this blog post.

  • HydraStor can survive up to six simultaneous node or disk failures
  • Disk rebuilds do not create performance degradation
  • No volume design is needed, other than name and resiliency level
  • Obsolescent nodes are easily and automatically replaced

Not bad, huh?

[Update 9/10:  In addition to their NAS interface, they also support NetBackup’s OST.]

Challenges

Their biggest challenge is one of marketiing and sales.  They are the typical Japanese technology company that is driven by technology and struggles with marketing.  I can’t tell you the lack of a warm fuzzy I feel when I hear “we don’t share revenue numbers — ever,” and “we have two references.”  (And they’re the same ones they had a few years ago.)  They’ve got to address this if they are to be taken seriously in the business.

Summary

A target dedupe system that can back up 2 PB in a day, and can survive up to six failures of any component.  I don’t know what else to say other than that.  Pretty impressive stuff.  Like I said, it’s the coolest product you’ve never heard of.

Written by W. Curtis Preston (@wcpreston), four-time O'Reilly author, and host of The Backup Wrap-up podcast. I am now the Technology Evangelist at Sullivan Strickler, which helps companies manage their legacy data

9 comments
  • Way cool. Can we buy systems based on this technology for home yet? Can you use it to replace your Drobo — or mine?

  • 😮

    So who really needs to plan for (6) drive failures at the expense of R1-like overhead? Certainly seems far-fetched and better served by buying an additional, replicated unit for DR or even local recovery of an entire failed system.

    And while various types of data protection such as this are interesting, the use of these very high performance processors and technologies like ASICS which offload parity overhead handle this issue pretty handily.

    And other vendors do head replacements as newer and faster technology is introduced. Inline De-Dupe solutions rely on pure horsepower to increase and improve ingest speeds more than back-end disk or parity calculation concerns. And even without that ability, rolling from one VTL to another by introducing the new unit into the backup pool and restoring from the old until the data has expired (30-45 day retention not uncommon) is a fairly standard and simple process.

    I guess I’m failing to see this as anything revolutionary. Just me?

  • @Adam12

    I think you missed the point of the post.

    I said you COULD config it to survive 6 drive failures. Default is 3.

    Other vendors replace heads, but very rarely is this done non-disruptively. They require downtime and data migration; these guys require neither.