Does Bit Error Rate Matter?

Magnetic devices make mistakes; the only question is how many mistakes will they make. I was presenting a slide a while ago that listed the Bit Error Rates (BERs) for various magnetic media, and someone decided to take issue with it.  He basically claimed that it was all bunk and that modern-day magnetic devices don’t make mistakes like I was talking about.  He said he had worked several years in businesses that make disk drives, and that there was essentially no such thing as a bit error.  I thought I’d throw out a few other thoughts on this issue and see what others think.

Note: The first version of this post referred to Undetectable Bit Error rate vs just Bit Error Rate.  The more common term is just BER.  But when it’s UBER, the U apparently refers to unrecoverable, not undetectable.

He said that all modern devices (disk and tape) do read after write checks, and therefore they catch such errors.  But my (albeit somewhat cursory) knowledge of ECC technology is that the read after write is not a block-for-block comparison.  My basic understanding is that a CRC hash is calculated of the block before the write, the write is made, the block is read back, the CRC is calculated on what was read back, and if they match all is good.  HOWEVER, since the CRC is so small (12-16 bits), there is a possibility that the block doesn’t match, but the CRC does match.  The result is an undetected bit error.  (This is my best attempt at understanding how ECC and UBERs work.  If someone else who has deep understanding of how it really works can explain it in plain English, I’m all eyes.)

There was a representative in the room from a target dedupe vendor, and he previously worked at another target dedupe vendor.  He mentioned that both vendors do high-level checking that looks for bit errors that disk drives make, and that they had found such errors many times — which is why they do it.

I once heard Stephen Foskett (@sfoskett) say that he thinks that any modern disk array does such checking, and so the fact that some disk drives have higher UBERs than others (and all are higher than most tape drives) is irrelevant.  Any such errors would be caught by the higher level checks performed by the array or filesystem.

For example, an object storage system (e.g. S3) can product a high-level check on all objects to make sure that the various copies of the object do not change. If any of them show a change, it would be flagged via that check, and the corrupted object would be replaced.  It’s a check on top of a check on top of a check. ZFS has similar checking.

But if all modern arrays do such checks, why do some vendors make sure they mention that THEY do such checking, suggesting that other vendors don’t do such checks? 

Unless someone can explain to me why I should, I definitely don’t agree with the idea that UBERs don’t matter.  If drives didn’t make these errors, they wouldn’t need to publish a UBER in the first place.  I somewhat agree with Stephen — if we’re talking about arrays or storage systems that do higher-level checks.  But I don’t think all arrays do such checks.  So I think that UBER still matters. 

What do you think?

Written by W. Curtis Preston (@wcpreston), four-time O'Reilly author, and host of The Backup Wrap-up podcast. I am now the Technology Evangelist at Sullivan Strickler, which helps companies manage their legacy data

20 comments
  • Undetectable means exactly that: you cannot detect it (at some layer of processing). What you have described is adding an extra layer of processing or comparison to the problem.

    We have several places where errors can creep in and flip a bit:

    1. Moving the data around in computer memory, such as from the userspace to a system i/o buffer pool. I include here copying the data across a data bus (scsi, sata, fddi, etc.) to a devices buiilt-in cache memory buffer. Every time you move the data, you risk a bus error causing a bit to flip.

    2. Holding the data in ram. Even modern ecc ram memory has an undetectable bit error rate; its just a lot lower that raw ram memory because the error correcting code adds redundancy bits that can correct n bits in error and detect n+1 bits in error. If more than n+1 bits experienced errors, it is undetectable, and certainly uncorrectable. A stray cosmic ray can flip a bit. An electgrostatic discharge can flip a bunch of bits. Lots of other things can go wrong to flip bits.

    3. Holding data in magnetic memory. I’m not an expert here, but we’ve all heard the stories of floor polishers working under the tape rack in the computer room. Other pohysical problems can occur, damaging the medium. Excessiv e heat past the Currie point can demagnetize magnets. A speck of dirt can cause a misread.

    4. I do not know much about the physical details of SSD memory, but it is essentially flash. ECC coding is used to reduce the bit error rate.

    For any information storage or communication system (such as data busses, networks, etc.) there is a certain probability of a bit error. Given the raw bit error rate of the medium, engineers use algebrasic coding theory to design ECC schemes to add redundant bits and encode them in such a way as to reduce the probability of an uncorrectrable bit error. Note that I said reduce, not eliminate.

    Sometimes it is good enough to detect that an error occurred abd request a retransmission. This works if you read a disk record and see there is an error. You can read it again and hope it was an intermittent error. If the second read is clean, you’re safe.

    Sometimes, you can’t request a retransmission. Consider a spacecraft deep in the solar system. It might take 1/2 hour to get a message to it. You have to be able to correct an error up to some design specified probability. The undetectable bit error rate is specified in the design of the system. Consider if you had to re-send every copy of a memory work from one location to another. The whole computer memory would operate at half speed. This is why ECC RAM is used in systems where it matters, like servers.

    Basicly, error detecting codes, signature hashes, dedup hashes, these all fall under pretty much the same area of mathematics: finite fields. Claude Shannon started it all with his seminal paper on the bandwidth of a communications channel, where incidently, he coined the term “bit” for binary digit.

  • I’ve been running ZFS for several years on some laboratory Linux servers. These use good quality, “enterprise class” individual hard drives (e.g. Western Digital WD “Re” or Gold.) ZFS frequently detects data write errors via its internal checksumming/scrub process. If data were always written correctly or an error were returned allowing for a rewrite, those corrupt writes could not happen.

    I suspect, as you do, that it’s a combination of a very rare event combined with lots and lots and lots of chances for that very rare event to happen.

    • It’s fascinating to hear about ZFS detecting those errors that often. I do wonder about the ones that are NOT detected.

  • Curtis, the disk drive guy was wrong. The problem is that after bits are written – and this is true of flash as well – they can flip. Then what? The ECC, depending on how deep it is, will catch the majority of such errors and correct them, but it won’t catch all of them. And you won’t know that until you attempt to read the data and – oops! – can’t.

    There’s a reason GFS keeps 3 copies, and it’s not only performance, or drive failure. Bit rot is real!

    • Hey, Robin! Long time no hear!

      You’re talking about bit rot, which can also happen.

      I’m talking about an error on the initial write. The more and more data we have, the more that it can be a problem.

  • One thing going on here is that disk drives have failure modes that tape drives probably don’t, like correctly writing information to a wrong location on the disk.

    Another is that disk sectors are tiny, 512 to 4096 bytes, so they can’t afford really good error checks. Per Wikipedia and common sense, starting with LTO-3 and 4 tape block sizes are 1,616,940 bytes, and they use “strong” forward error correction coding “that makes data recovery possible when lost data is within one track”, which is very affordable with such massive tape block sizes, and 21st Century tape systems that can fairly easily compute this sort of thing at speed.

    Still, there are so many things that can go wrong in so many places even if you assume the tape drive is perfect that end to end error checking is indeed a must.

    • I’m just wondering about how much error checking is missing, whether we are talking disk or tape.

  • A parity check is the simplest kind of hash function, and as we all know a parity check can be fooled by flipping two bits instead of one. More complex hashes, and combinations of hash functions, can achieve more certainty but also cost more processing.
    Error checking for the highest certainty at the lowest processing cost is an area of advanced discrete mathematics. So it’s not so much “we do it” or “we don’t do it” but I would expect the quality of error checking to be a significant brand differentiator – albeit one that is too arcane to appear on most commercial procurement checklists for storage and backup solutions.

  • I’m pretty sure that UBER is “undetected” rather than “undetectable”, that might seem like I’m being pedantic, but it’s an important distinction, because it means that on subsequent access of data you’ve got a good chance to fix that up if you have some additional lossless redundancy via some form of erasure coding (including parity raid).

    Most information storage and transmission systems have a REALLY low threshold for undetected errors that aren’t fixed by things like ECC, but they do happen, and a lot of “cosmic ray” incidents are probably more likely due to stuff like that when the error results in a datastructure inconsistency that is large enough to make something bork.

    Additional layers of resiliency to check for this kind of thing are quite commonplace in enterprise class gear where the levels of data integrity paranoia border on OCD .. for example ONTAP includes not only the SCSI T10 Data Integrity Field, it also combines that with something called lost write detection (drives sometimes lie about exactly where they’ve written a series of sectors, so when you read a 4K block all the ECC lines up nicely,) so its nice to have another layer of protection to make sure that what you’ve read is what’s actually meant to be there, and then on top of that it does dual or triple parity checks against every read to make sure that everything lines up. ZFS also has some similar kinds of end to end checking as does Oracle and some other databases.

    Statistically even with all these extra checks its still possible for the wrong data to get served up, but using those techniques pushes the number of undetected errors to such an incredibly tiny number, that I suspect you may end up down in quantum mechanical kinds of uncertainty .. and down at that level if the gods decide to play dice with your data, you’re better off with prayer than technology.

    • Whether it’s undetected or undetectable, I think the concern is that we need additional levels of checking beyond ECC.

      Your last phrase is awesome “you’re better off with prayer than tech.” Love it.

  • Sorry, I forgot .. there’s a pretty good plain English explanation of how ECC works to correct mistakes in data (especially things like undetected bit-flips) during transmission and storage of data here

    https://www.codethink.co.uk/articles/2017/error-correcting-erasure-codes/

    The trick is to have enough additional data with the payload to not only detect, but repair the missing data … object stores like S3 do that via either triple mirroring (all three copies are read, if one block is different to the other blocks it gets ignored and repaired) , or via something akin to Reed-Solomon erasure codes that underpin most people’s RAID-6 implementations. That comes at a cost to capacity, but with a reasonably sized payload (anything over about 8KiB) the additional costs are IMHO negligible and worth it ..

    One thing to be wary of though is the difference between an ECC field and a checksum. A Checksum can detect something that was originally written incorrectly on read, such as the T10 DIF aka DIX (https://access.redhat.com/solutions/41548) , but it doesn’t contain enough information to reconstruct the data on the fly. If you depend only on checksums, then you’d best make sure you have a good backup and that you test your backups regularly, or trust in the capricious deities of IT and hope that the advertised UBER on your expensive device is indeed as low as you hope it is.

    • That’s actually a really good blog post that I hadn’t read before. Thanks! I may write a post just on that.

  • Surely if a manufacturer publishes a figure labelled Undectable Bit Error Rate, that must means that errors can exist, but that the drive does NOT detect them!

    The only trouble with the claim that software can detect them, is a) The software vendors dont make the same kind of claims, and b) if they can detect these undetectable errors why dont we see any evidence of these detections

    • Actually I think higher-level detections and repairs happen all the time. They’re only undetectable at the write level. Something looking at a higher meta live might be able to catch an error when it’s combined with something else.

  • All the known UBER values for magnetic and flash media are Uncorrectable BER and not Undetectable BER. The first means how much the ECC failed to correct for various reasons and the latter is not something I’ve ever seen reported by anyone.

    • A lot of resources just say BER. It does appear that it should be unrecoverable, not undetectable. I’ve reached out to my favorite resource on subjects like this for clarification.

  • Curtis, all decent (enterprise) platforms will do read-after-write consistency to ensure data was written correctly. AWS does it on S3 objects too – https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html

    It would be crazy to assume that HDDs are 100% accurate. Yes the chances of UBER are tiny, but still possible, otherwise vendors wouldn’t quote figures. I would imagine that if we go back in the annals of time, we can find data that shows older HDDs were many more times less reliable than today. I would suggest that we’ve improved reliability over time at more orders of magnitude than capacities have increased, so the perceived notion is that UBER is no longer a problem.

    Incidentally, remember the cosmic ray discussion by Intel at SFD9? (http://techfieldday.com/appearance/intel-storage-presents-at-storage-field-day-9/) – changed bits on an SSD can cause the firmware to brick the device – Which is best, having another layer of validation or bricking the hardware for a single bit failure… “-)