Does Undetectable Bit Error Rate Matter?

Magnetic devices make mistakes; the only question is how many mistakes will they make. I was presenting a slide yesterday that listed the UBERs for various magnetic media, and someone decided to take issue with it.  He basically claimed that it was all bunk and that modern-day magnetic devices don’t make mistakes like I was talking about.  He said he had worked several years in businesses that make disk drives, and that there was essentially no such thing as an undetectable bit error.  I thought I’d throw out a few other thoughts on this issue and see what others think.

He said that all modern devices (disk and tape) do read after write checks, and therefore they catch such errors.  But my (albeit somewhat cursory) knowledge of ECC technology is that the read after write is not a block-for-block comparison.  My basic understanding is that a CRC is calculated of the block before the write, the write is made, the block is read back, the CRC is calculated on what was read back, and if they match all is good.  HOWEVER, since the CRC is so small (12-16 bits), there is a possibility that the block doesn’t match, but the CRC does match.  The result is an undetected bit error.  (This is my best attempt at understanding how ECC and UBERs work.  If someone else who has deep understanding of how it really works can explain it in plain English, I’m all eyes.)

There was a representative in the room from Exablox, and he previously worked at Data Domain.  He mentioned that both vendors do high-level checking that looks for bit errors that disk drives make, and that they had found such errors many times — which is why they do it.

Stephen Foskett has said that he thinks that any modern disk array does such checking, and so the fact that disk drives have higher UBERs than tape drives is irrelevant.  Any such errors would be caught by the higher level checks performed by the array or filesystem.  For example, ZFS has such checking as well.  But if all modern arrays do such checks, why do some vendors make sure they mention that THEY do such checking, suggesting that other vendors don’t do such checks. 

Unless someone can explain to me why I should, I definitely don’t agree with the person who made the comment in my show.  If drives didn’t make these errors, they wouldn’t need to publish a UBER in the first place.  I somewhat agree with Stephen — if we’re talking about arrays that do higher-level checks.  But I don’t think all arrays do such checks.  So I think that UBER still matters. 

What do you think? 

Continue reading