As you learned in the chapter about RAID, disks are not perfect and can fail (on occasion). In early RAID systems, the model of failure was quite simple: either the entire disk is working, or it fails completely, and the detection of such a failure is straightforward. This fail-stop model of disk failure makes building RAID relatively simple“Implementing Fault-Tolerant Services Using The State Machine Approach: A Tutorial” by Fred B. Schneider. ACM Surveys, Vol. 22, No. 4, December 1990. How to build fault tolerant services. A must read for those building distributed systems..
What you didn’t learn is about all of the other types of failure modes modern disks exhibit. Specifically, as Bairavasundaram et al.1-“An Analysis of Latent Sector Errors in Disk Drives” by L. Bairavasundaram, G. Goodson, S. Pasupathy, J. Schindler. SIGMETRICS ’07, San Diego, CA. The first paper to study latent sector errors in detail. The paper also won the Kenneth C. Sevcik Outstanding Student Paper award, named after a brilliant researcher and wonderful guy who passed away too soon. To show the OSTEP authors it was possible to move from the U.S. to Canada, Ken once sang us the Canadian national anthem, standing up in the middle of a restaurant to do so. We chose the U.S., but got this memory. 2- “An Analysis of Data Corruption in the Storage Stack” by Lakshmi N. Bairavasundaram, Garth R. Goodson, Bianca Schroeder, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau. FAST ’08, San Jose, CA, February 2008. The first paper to truly study disk corruption in great detail, focusing on how often such corruption occurs over three years for over 1.5 million drives. studied in great detail, modern disks will occasionally seem to be mostly working but have trouble successfully accessing one or more blocks. Specifically, two types of single-block failures are common and worthy of consideration: latent-sector errors (LSEs) and block corruption. We’ll now discuss each in more detail.
Latent-sector errors (LSEs)
LSEs arise when a disk sector (or group of sectors) has been damaged in some way. For example, if the disk head touches the surface for some reason (a head crash, something which shouldn’t happen during normal operation), it may damage the surface, making the bits unreadable. Cosmic rays can also flip bits, leading to incorrect contents. Fortunately, in-disk error correcting codes (ECC) are used by the drive to determine whether the on-disk bits in a block are good, and in some cases, to fix them. If they are not good, and the drive does not have enough information to fix the error, the disk will return an error when a request is issued to read them.
Block corruption
There are also cases where a disk block becomes corrupt in a way not detectable by the disk itself. For example, buggy disk firmware may write a block to the wrong location. In such a case, the disk ECC indicates the block contents are fine, but from the client’s perspective the wrong block is returned when subsequently accessed. Similarly, a block may get corrupted when it is transferred from the host to the disk across a faulty bus; the resulting corrupt data is stored by the disk, but it is not what the client desires. These types of faults are particularly insidious because they are silent faults. The disk gives no indication of the problem when returning the faulty data.
Prabhakaran et al.“IRON File Systems” by V. Prabhakaran, L. Bairavasundaram, N. Agrawal, H. Gunawi, A. Arpaci-Dusseau, R. Arpaci-Dusseau. SOSP ’05, Brighton, England. Our paper on how disks have partial failure modes, and a detailed study of how modern file systems react to such failures. As it turns out, rather poorly! We found numerous bugs, design flaws, and other oddities in this work. Some of this has fed back into the Linux community, thus improving file system reliability. You’re welcome! describe this more modern view of disk failure as the fail-partial disk failure model. In this view, disks can still fail in their entirety (as was the case in the traditional fail-stop model). However, disks can also seemingly be working and have one or more blocks become inaccessible (i.e., LSEs) or hold the wrong contents (i.e., corruption). Thus, when accessing a seemingly-working disk, once in a while it may either return an error when trying to read or write a given block (a non-silent partial fault) and once in a while, it may simply return the wrong data (a silent partial fault).
Both of these types of faults are somewhat rare, but just how rare? The figure below summarizes some of the findings from the two Bairavasundaram studies1-“An Analysis of Latent Sector Errors in Disk Drives” by L. Bairavasundaram, G. Goodson, S. Pasupathy, J. Schindler. SIGMETRICS ’07, San Diego, CA. The first paper to study latent sector errors in detail. The paper also won the Kenneth C. Sevcik Outstanding Student Paper award, named after a brilliant researcher and wonderful guy who passed away too soon. To show the OSTEP authors it was possible to move from the U.S. to Canada, Ken once sang us the Canadian national anthem, standing up in the middle of a restaurant to do so. We chose the U.S., but got this memory. 2- “An Analysis of Data Corruption in the Storage Stack” by Lakshmi N. Bairavasundaram, Garth R. Goodson, Bianca Schroeder, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau. FAST ’08, San Jose, CA, February 2008. The first paper to truly study disk corruption in great detail, focusing on how often such corruption occurs over three years for over 1.5 million drives..