Magnetic devices make mistakes; the only question is how many mistakes will they make. I was presenting a slide a while ago that listed the Bit Error Rates (BERs) for various magnetic media, and someone decided to take issue with it. He basically claimed that it was all bunk and that modern-day magnetic devices don’t make mistakes like I was talking about. He said he had worked several years in businesses that make disk drives, and that there was essentially no such thing as a bit error. I thought I’d throw out a few other thoughts on this issue and see what others think.
Note: The first version of this post referred to Undetectable Bit Error rate vs just Bit Error Rate. The more common term is just BER. But when it’s UBER, the U apparently refers to unrecoverable, not undetectable.
He said that all modern devices (disk and tape) do read after write checks, and therefore they catch such errors. But my (albeit somewhat cursory) knowledge of ECC technology is that the read after write is not a block-for-block comparison. My basic understanding is that a CRC hash is calculated of the block before the write, the write is made, the block is read back, the CRC is calculated on what was read back, and if they match all is good. HOWEVER, since the CRC is so small (12-16 bits), there is a possibility that the block doesn’t match, but the CRC does match. The result is an undetected bit error. (This is my best attempt at understanding how ECC and UBERs work. If someone else who has deep understanding of how it really works can explain it in plain English, I’m all eyes.)
There was a representative in the room from a target dedupe vendor, and he previously worked at another target dedupe vendor. He mentioned that both vendors do high-level checking that looks for bit errors that disk drives make, and that they had found such errors many times — which is why they do it.
I once heard Stephen Foskett (@sfoskett) say that he thinks that any modern disk array does such checking, and so the fact that some disk drives have higher UBERs than others (and all are higher than most tape drives) is irrelevant. Any such errors would be caught by the higher level checks performed by the array or filesystem.
For example, an object storage system (e.g. S3) can product a high-level check on all objects to make sure that the various copies of the object do not change. If any of them show a change, it would be flagged via that check, and the corrupted object would be replaced. It’s a check on top of a check on top of a check. ZFS has similar checking.
But if all modern arrays do such checks, why do some vendors make sure they mention that THEY do such checking, suggesting that other vendors don’t do such checks? Unless someone can explain to me why I should, I definitely don’t agree with the idea that UBERs don’t matter. If drives didn’t make these errors, they wouldn’t need to publish a UBER in the first place. I somewhat agree with Stephen — if we’re talking about arrays or storage systems that do higher-level checks. But I don’t think all arrays do such checks. So I think that UBER still matters. What do you think?