* PATA/SATA Disk Reliability paper @ 2007-02-18 18:50 Richard Scobie 2007-02-19 11:26 ` Al Boldi 2007-02-20 3:03 ` H. Peter Anvin 0 siblings, 2 replies; 22+ messages in thread From: Richard Scobie @ 2007-02-18 18:50 UTC (permalink / raw) To: linux-raid Thought this paper may be of interest. A study done by Google on over 100,000 drives they have/had in service. http://labs.google.com/papers/disk_failures.pdf Regards, Richard ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: PATA/SATA Disk Reliability paper 2007-02-18 18:50 PATA/SATA Disk Reliability paper Richard Scobie @ 2007-02-19 11:26 ` Al Boldi 2007-02-19 21:42 ` Eyal Lebedinsky 2007-02-26 14:15 ` Mario 'BitKoenig' Holbe 2007-02-20 3:03 ` H. Peter Anvin 1 sibling, 2 replies; 22+ messages in thread From: Al Boldi @ 2007-02-19 11:26 UTC (permalink / raw) To: linux-raid Richard Scobie wrote: > Thought this paper may be of interest. A study done by Google on over > 100,000 drives they have/had in service. > > http://labs.google.com/papers/disk_failures.pdf Interesting link. They seem to point out that smart not necessarily warns of pending failure. This is probably worse than not having smart at all, as it gives you the illusion of safety. If there is one thing to watch out for, it is "dew". I remember video machines sensing for dew, so do any drives sense for "dew"? Thanks! -- Al ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: PATA/SATA Disk Reliability paper 2007-02-19 11:26 ` Al Boldi @ 2007-02-19 21:42 ` Eyal Lebedinsky 2007-02-20 12:15 ` Al Boldi 2007-02-26 14:15 ` Mario 'BitKoenig' Holbe 1 sibling, 1 reply; 22+ messages in thread From: Eyal Lebedinsky @ 2007-02-19 21:42 UTC (permalink / raw) To: Al Boldi; +Cc: linux-raid Disks are sealed, and a dessicant is present in each to keep humidity down. If you ever open a disk drive (e.g. for the magnets, or the mirror quality platters, or for fun) then you can see the dessicant sachet. cheers Al Boldi wrote: > Richard Scobie wrote: > >>Thought this paper may be of interest. A study done by Google on over >>100,000 drives they have/had in service. >> >>http://labs.google.com/papers/disk_failures.pdf > > > Interesting link. They seem to point out that smart not necessarily warns of > pending failure. This is probably worse than not having smart at all, as it > gives you the illusion of safety. > > If there is one thing to watch out for, it is "dew". > > I remember video machines sensing for dew, so do any drives sense for "dew"? > > > Thanks! > > -- > Al -- Eyal Lebedinsky (eyal@eyal.emu.id.au) <http://samba.org/eyal/> attach .zip as .dat ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: PATA/SATA Disk Reliability paper 2007-02-19 21:42 ` Eyal Lebedinsky @ 2007-02-20 12:15 ` Al Boldi 2007-02-22 22:27 ` Nix 0 siblings, 1 reply; 22+ messages in thread From: Al Boldi @ 2007-02-20 12:15 UTC (permalink / raw) To: Eyal Lebedinsky; +Cc: linux-raid Eyal Lebedinsky wrote: > Disks are sealed, and a dessicant is present in each to keep humidity > down. If you ever open a disk drive (e.g. for the magnets, or the mirror > quality platters, or for fun) then you can see the dessicant sachet. Actually, they aren't sealed 100%. On wd's at least, there is a hole with a warning printed on its side: DO NOT COVER HOLE BELOW V V V V o In contrast, older models from the last century, don't have that hole. > Al Boldi wrote: > > > > If there is one thing to watch out for, it is "dew". > > > > I remember video machines sensing for dew, so do any drives sense for > > "dew"? Thanks! -- Al ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: PATA/SATA Disk Reliability paper 2007-02-20 12:15 ` Al Boldi @ 2007-02-22 22:27 ` Nix 2007-02-22 22:30 ` Nix 2007-02-22 23:30 ` Stephen C Woods 0 siblings, 2 replies; 22+ messages in thread From: Nix @ 2007-02-22 22:27 UTC (permalink / raw) To: Al Boldi; +Cc: Eyal Lebedinsky, linux-raid On 20 Feb 2007, Al Boldi outgrape: > Eyal Lebedinsky wrote: >> Disks are sealed, and a dessicant is present in each to keep humidity >> down. If you ever open a disk drive (e.g. for the magnets, or the mirror >> quality platters, or for fun) then you can see the dessicant sachet. > > Actually, they aren't sealed 100%. I'd certainly hope not, unless you like the sound of imploding drives when you carry one up a mountain. > On wd's at least, there is a hole with a warning printed on its side: > > DO NOT COVER HOLE BELOW > V V V V > > o I suspect that's for air-pressure equalization. > In contrast, older models from the last century, don't have that hole. It was my understanding that disks have had some way of equalizing pressure with their surroundings for many years; but I haven't verified this so you may well be right that this is a recent thing. (Anyone know for sure?) -- `In the future, company names will be a 32-character hex string.' --- Bruce Schneier on the shortage of company names ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: PATA/SATA Disk Reliability paper 2007-02-22 22:27 ` Nix @ 2007-02-22 22:30 ` Nix 2007-02-22 23:30 ` Stephen C Woods 1 sibling, 0 replies; 22+ messages in thread From: Nix @ 2007-02-22 22:30 UTC (permalink / raw) To: Al Boldi; +Cc: Eyal Lebedinsky, linux-raid On 22 Feb 2007, nix@esperi.org.uk uttered the following: > On 20 Feb 2007, Al Boldi outgrape: >> Eyal Lebedinsky wrote: >>> Disks are sealed, and a dessicant is present in each to keep humidity >>> down. If you ever open a disk drive (e.g. for the magnets, or the mirror >>> quality platters, or for fun) then you can see the dessicant sachet. >> >> Actually, they aren't sealed 100%. > > I'd certainly hope not, unless you like the sound of imploding drives > when you carry one up a mountain. Or even exploding drives. (Oops.) -- `In the future, company names will be a 32-character hex string.' --- Bruce Schneier on the shortage of company names ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: PATA/SATA Disk Reliability paper 2007-02-22 22:27 ` Nix 2007-02-22 22:30 ` Nix @ 2007-02-22 23:30 ` Stephen C Woods 2007-02-23 18:22 ` Al Boldi 2007-02-27 19:06 ` Bill Davidsen 1 sibling, 2 replies; 22+ messages in thread From: Stephen C Woods @ 2007-02-22 23:30 UTC (permalink / raw) To: Nix; +Cc: Al Boldi, Eyal Lebedinsky, linux-raid As he leans on his cane, the old codger says.... Well Disks used to come in open cannisters, that is you took the bottom cover off, and then put the whould pack into the drive, and then unscrewed the top cover and took it out.. Clearly ventilated. C 1975. Later we got sealed drives, Kennedy 180 MB Winchesters they were called (the used IBM 3030 technology). The had a vent pipe with two filters, you replaced the outer one every 90days (as part of the PM process). The inner one you didn't touch. Aparently they figured that it'd be a long time before the inner one got really clogged at 10 min exposure every 90 days. C 1980 Still later we had a Mainframe running Un*x, it used IBM 3080 drives these had huge HDA boxes that wree sealed but hav vent filters that had to be changed every PM (30 days, 2 hours of down time to do them all). C 1985. So drives do need to be ventilated, not so much wory about exploding, but rather subtle distortion of the case as the atmospheric preasure changed. Doe anyone rememnber that you had to let you drives acclimate to your machine room for a day or so before you used them. Ah the good old days... HUH??? <scw> On Thu, Feb 22, 2007 at 10:27:43PM +0000, Nix wrote: > On 20 Feb 2007, Al Boldi outgrape: > > Eyal Lebedinsky wrote: > >> Disks are sealed, and a dessicant is present in each to keep humidity > >> down. If you ever open a disk drive (e.g. for the magnets, or the mirror > >> quality platters, or for fun) then you can see the dessicant sachet. > > > > Actually, they aren't sealed 100%. > > I'd certainly hope not, unless you like the sound of imploding drives > when you carry one up a mountain. > > > On wd's at least, there is a hole with a warning printed on its side: > > > > DO NOT COVER HOLE BELOW > > V V V V > > > > o > > I suspect that's for air-pressure equalization. > > > In contrast, older models from the last century, don't have that hole. > > It was my understanding that disks have had some way of equalizing > pressure with their surroundings for many years; but I haven't verified > this so you may well be right that this is a recent thing. (Anyone know > for sure?) > > -- > `In the future, company names will be a 32-character hex string.' > --- Bruce Schneier on the shortage of company names > - > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- ----- Stephen C. Woods; UCLA SEASnet; 2567 Boelter hall; LA CA 90095; (310)-825-8614 Unless otherwise noted these statements are my own, Not those of the University of California. Internet mail:scw@seas.ucla.edu ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: PATA/SATA Disk Reliability paper 2007-02-22 23:30 ` Stephen C Woods @ 2007-02-23 18:22 ` Al Boldi 2007-02-24 22:27 ` Mark Hahn 2007-02-27 19:06 ` Bill Davidsen 1 sibling, 1 reply; 22+ messages in thread From: Al Boldi @ 2007-02-23 18:22 UTC (permalink / raw) To: Stephen C Woods, Nix; +Cc: Eyal Lebedinsky, linux-raid Stephen C Woods wrote: > So drives do need to be ventilated, not so much wory about exploding, > but rather subtle distortion of the case as the atmospheric preasure > changed. I have a '94 Caviar without any apparent holes; and as a bonus, the drive still works. In contrast, ever since these holes appeared, drive failures became the norm. > Doe anyone rememnber that you had to let you drives acclimate to your > machine room for a day or so before you used them. The problem is, that's not enough; the room temperature/humidity has to be controlled too. In a desktop environment, that's not really feasible. Thanks! -- Al ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: PATA/SATA Disk Reliability paper 2007-02-23 18:22 ` Al Boldi @ 2007-02-24 22:27 ` Mark Hahn 2007-02-25 11:22 ` Al Boldi 2007-02-25 19:02 ` Richard Scobie 0 siblings, 2 replies; 22+ messages in thread From: Mark Hahn @ 2007-02-24 22:27 UTC (permalink / raw) To: Al Boldi; +Cc: linux-raid > In contrast, ever since these holes appeared, drive failures became the norm. wow, great conspiracy theory! maybe the hole is plugged at the factory with a substance which evaporates at 1/warranty-period ;) seriously, isn't it easy to imagine a bladder-like arrangement that permits equilibration without net flow? disk spec-sheets do limit this - I checked the seagate 7200.10: 10k feet operating, 40k max. amusingly -200 feet is the min either way... >> Doe anyone rememnber that you had to let you drives acclimate to your >> machine room for a day or so before you used them. > > The problem is, that's not enough; the room temperature/humidity has to be > controlled too. In a desktop environment, that's not really feasible. 5-90% humidity, operating, 95% non-op, and 30%/hour. seems pretty easy to me. in fact, I frequently ask people to justify the assumption that a good machineroom needs tight control over humidity. (assuming, like most machinerooms, you aren't frequently handling the innards.) ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: PATA/SATA Disk Reliability paper 2007-02-24 22:27 ` Mark Hahn @ 2007-02-25 11:22 ` Al Boldi 2007-02-25 17:40 ` Mark Hahn 2007-02-25 19:02 ` Richard Scobie 1 sibling, 1 reply; 22+ messages in thread From: Al Boldi @ 2007-02-25 11:22 UTC (permalink / raw) To: Mark Hahn; +Cc: linux-raid Mark Hahn wrote: > > In contrast, ever since these holes appeared, drive failures became the > > norm. > > wow, great conspiracy theory! I think you misunderstand. I just meant plain old-fashioned mis-engineering. > maybe the hole is plugged at > the factory with a substance which evaporates at 1/warranty-period ;) Actually it's plugged with a thin paper-like filter, which does not seem to evaporate easily. And it's got nothing to do with warranty, although if you get lucky and the failure happens within the warranty period, you can probably demand a replacement drive to make you feel better. But remember, the google report mentions a great number of drives failing for no apparent reason, not even a smart warning, so failing within the warranty period is just pure luck. > seriously, isn't it easy to imagine a bladder-like arrangement that > permits equilibration without net flow? disk spec-sheets do limit > this - I checked the seagate 7200.10: 10k feet operating, 40k max. > amusingly -200 feet is the min either way... Well, it looks like filtered net flow on wd's. What's it look like on seagate? > >> Doe anyone rememnber that you had to let you drives acclimate to > >> your machine room for a day or so before you used them. > > > > The problem is, that's not enough; the room temperature/humidity has to > > be controlled too. In a desktop environment, that's not really > > feasible. > > 5-90% humidity, operating, 95% non-op, and 30%/hour. seems pretty easy > to me. in fact, I frequently ask people to justify the assumption that > a good machineroom needs tight control over humidity. (assuming, like > most machinerooms, you aren't frequently handling the innards.) I agree, but reality has a different opinion, and it may take down that drive, specs or no specs. A good way to deal with reality is to find the real reasons for failure. Once these reasons are known, engineering quality drives becomes, thank GOD, really rather easy. Thanks! -- Al ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: PATA/SATA Disk Reliability paper 2007-02-25 11:22 ` Al Boldi @ 2007-02-25 17:40 ` Mark Hahn [not found] ` <200702252057.22963.a1426z@gawab.com> 2007-02-27 19:21 ` Bill Davidsen 0 siblings, 2 replies; 22+ messages in thread From: Mark Hahn @ 2007-02-25 17:40 UTC (permalink / raw) To: Al Boldi; +Cc: linux-raid >>> In contrast, ever since these holes appeared, drive failures became the >>> norm. >> >> wow, great conspiracy theory! > > I think you misunderstand. I just meant plain old-fashioned mis-engineering. I should have added a smilie. but I find it dubious that the whole industry would have made a major bungle if so many failures are due to the hole... > But remember, the google report mentions a great number of drives failing for > no apparent reason, not even a smart warning, so failing within the warranty > period is just pure luck. are we reading the same report? I look at it and see: - lowest failures from medium-utilization drives, 30-35C. - higher failures from young drives in general, but especially if cold or used hard. - higher failures from end-of-life drives, especially > 40C. - scan errors, realloc counts, offline realloc and probation counts are all significant in drives which fail. the paper seems unnecessarily gloomy about these results. to me, they're quite exciting, and provide good reason to pay a lot of attention to these factors. I hate to criticize such a valuable paper, but I think they've missed a lot by not considering the results in a fully factorial analysis as most medical/behavioral/social studies do. for instance, they bemoan a 56% false negative rate from only SMART signals, and mention that if >40C is added, the FN rate falls to 36%. also incorporating the low-young risk factor would help. I would guess that a full-on model, especially if it incorporated utilization, age, performance could comfortable levels. >>> The problem is, that's not enough; the room temperature/humidity has to >>> be controlled too. In a desktop environment, that's not really >>> feasible. >> >> 5-90% humidity, operating, 95% non-op, and 30%/hour. seems pretty easy >> to me. in fact, I frequently ask people to justify the assumption that >> a good machineroom needs tight control over humidity. (assuming, like >> most machinerooms, you aren't frequently handling the innards.) > > I agree, but reality has a different opinion, and it may take down that > drive, specs or no specs. why do you say this? I have my machineroom set for 35% (which appears to be it's "natural" point, with a wide 20% margin on either side. I don't really want to waste cooling capacity on dehumidification, for instance, unless there's a good reason. > A good way to deal with reality is to find the real reasons for failure. > Once these reasons are known, engineering quality drives becomes, thank GOD, > really rather easy. that would be great, but depends rather much on relatively small number of variables, which are manifest, not hidden. there are billions of studies (in medical/behavioral/social fields) which assume large numbers of more or less hidden variables, and which still manage good success... regards, mark hahn. ^ permalink raw reply [flat|nested] 22+ messages in thread
[parent not found: <200702252057.22963.a1426z@gawab.com>]
* Re: PATA/SATA Disk Reliability paper [not found] ` <200702252057.22963.a1426z@gawab.com> @ 2007-02-25 19:58 ` Mark Hahn 2007-02-25 21:07 ` Al Boldi 0 siblings, 1 reply; 22+ messages in thread From: Mark Hahn @ 2007-02-25 19:58 UTC (permalink / raw) To: Al Boldi; +Cc: linux-raid >>> A good way to deal with reality is to find the real reasons for failure. >>> Once these reasons are known, engineering quality drives becomes, thank >>> GOD, really rather easy. >> >> that would be great, but depends rather much on relatively small number of >> variables, which are manifest, not hidden. there are billions of studies >> (in medical/behavioral/social fields) which assume large numbers of more >> or less hidden variables, and which still manage good success... > > Interesting. Can you elaborate? I'll give it a try - my wife wears the statistical pants in the family ;) I was trying to say three things: - disks are very complicated, so their failure rates are a combination of conditional failure rates of many components. to take a fully reductionist approach would require knowing how each of ~1k parts responds to age, wear, temp, handling, etc. and none of those can be assumed to be independent. those are the "real reasons", but most can't be measured directly outside a lab and the number of combinatorial interactions is huge. - factorial analysis of the data. temperature is a good example, because both low and high temperature affect AFR, and in ways that interact with age and/or utilization. this is a common issue in medical studies, which are strikingly similar in design (outcome is subject or disk dies...) there is a well-established body of practice for factorial analysis. - recognition that the relative results are actually quite good, even if the absolute results are not amazing. for instance, assume we have 1k drives, and a 10% overall failure rate. using all SMART but temp detects 64 of the 100 failures and misses 36. essentially, the failure rate is now .036. I'm guessing that if utilization and temperature were included, the rate would be much lower. feedback from active testing (especially scrubbing) and performance under the normal workload would also help. in other words, I find the paper quite encouraging even inspiring! while the raw failure rates are almost shocking, monitoring and replacement appears to give reasonable results. a particular "treatment schedule" would need to minimize false-positive as well, unless disks taken out because of warning signs can be re-validated in some way. (my organization has around 6300 disks and no coherent monitoring so far.) regards, mark hahn. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: PATA/SATA Disk Reliability paper 2007-02-25 19:58 ` Mark Hahn @ 2007-02-25 21:07 ` Al Boldi 2007-02-25 22:14 ` Mark Hahn 0 siblings, 1 reply; 22+ messages in thread From: Al Boldi @ 2007-02-25 21:07 UTC (permalink / raw) To: Mark Hahn; +Cc: linux-raid Mark Hahn wrote: > - disks are very complicated, so their failure rates are a > combination of conditional failure rates of many components. > to take a fully reductionist approach would require knowing > how each of ~1k parts responds to age, wear, temp, handling, etc. > and none of those can be assumed to be independent. those are the > "real reasons", but most can't be measured directly outside a lab > and the number of combinatorial interactions is huge. It seems to me that the biggest problem are the 7.2k+ rpm platters themselves, especially with those heads flying closely on top of them. So, we can probably forget the rest of the ~1k non-moving parts, as they have proven to be pretty reliable, most of the time. > - factorial analysis of the data. temperature is a good > example, because both low and high temperature affect AFR, > and in ways that interact with age and/or utilization. this > is a common issue in medical studies, which are strikingly > similar in design (outcome is subject or disk dies...) there > is a well-established body of practice for factorial analysis. Agreed. We definitely need more sensors. > - recognition that the relative results are actually quite good, > even if the absolute results are not amazing. for instance, > assume we have 1k drives, and a 10% overall failure rate. using > all SMART but temp detects 64 of the 100 failures and misses 36. > essentially, the failure rate is now .036. I'm guessing that if > utilization and temperature were included, the rate would be much > lower. feedback from active testing (especially scrubbing) > and performance under the normal workload would also help. Are you saying, you are content with pre-mature disk failure, as long as there is a smart warning sign? If so, then I don't think that is enough. I think the sensors should trigger some kind of shutdown mechanism as a protective measure, when some threshold is reached. Just like the protective measure you see for CPUs to prevent meltdown. Thanks! -- Al ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: PATA/SATA Disk Reliability paper 2007-02-25 21:07 ` Al Boldi @ 2007-02-25 22:14 ` Mark Hahn 2007-02-25 22:46 ` Benjamin Davenport 0 siblings, 1 reply; 22+ messages in thread From: Mark Hahn @ 2007-02-25 22:14 UTC (permalink / raw) To: Al Boldi; +Cc: linux-raid >> and none of those can be assumed to be independent. those are the >> "real reasons", but most can't be measured directly outside a lab >> and the number of combinatorial interactions is huge. > > It seems to me that the biggest problem are the 7.2k+ rpm platters > themselves, especially with those heads flying closely on top of them. So, > we can probably forget the rest of the ~1k non-moving parts, as they have > proven to be pretty reliable, most of the time. donno. non-moving parts probably have much higher reliability, but so many of them makes them a concern. if a discrete resistor has a 1e9 hour MTBF, 1k of them are 1e6 and that's starting to approach the claimed MTBF of a disk. any lower (or more components) and it takes over as a dominant failure mode... the Google paper doesn't really try to diagnose, but it does indicate that metrics related to media/head problems tend to promptly lead to failure. (scan errors, reallocations, etc.) I guess that's circumstantial support for your theory that crashes of media/heads are the primary failure mode. >> - factorial analysis of the data. temperature is a good >> example, because both low and high temperature affect AFR, >> and in ways that interact with age and/or utilization. this >> is a common issue in medical studies, which are strikingly >> similar in design (outcome is subject or disk dies...) there >> is a well-established body of practice for factorial analysis. > > Agreed. We definitely need more sensors. just to be clear, I'm not saying we need more sensors, just that the existing metrics (including temp and utilization) need to be considered jointly, not independently. more metrics would be better as well, assuming they're direct readouts, not idiot-lights... >> and performance under the normal workload would also help. > > Are you saying, you are content with pre-mature disk failure, as long as > there is a smart warning sign? I'm saying that disk failures are inevitable. ways to reduce the chance of data loss are what we have to focus on. the Google paper shows that disks like to be at around 35C - not too cool or hot (though this is probably conflated with utilization.) the paper also shows that warning signs can indicate a majority of failures (though it doesn't present the factorial analysis necessary to tell which ones, how well, avoid false-positives, etc.) > I think the sensors should trigger some kind of shutdown mechanism as a > protective measure, when some threshold is reached. Just like the > protective measure you see for CPUs to prevent meltdown. but they already do. persistent bad reads or writes to a block will trigger its reallocation to spares, etc. for CPUs, the main threat is heat, and it's easy to throttle to cool down. for disks, the main threat is probably wear, which seems quite different - more catastrophic and less mitigatable once it starts. I'd love to hear from an actual drive engineer on the failure modes they worry about... regards, mark hahn. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: PATA/SATA Disk Reliability paper 2007-02-25 22:14 ` Mark Hahn @ 2007-02-25 22:46 ` Benjamin Davenport 2007-02-25 23:58 ` Mark Hahn 0 siblings, 1 reply; 22+ messages in thread From: Benjamin Davenport @ 2007-02-25 22:46 UTC (permalink / raw) To: linux-raid -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 Mark Hahn wrote: | if a discrete resistor has a 1e9 hour MTBF, 1k of them are 1e6 That's not actually true. As a (contrived) example, consider two cases. Case 1: failures occur at constant rate from hours 0 through 2e9. Case 2: failures occur at constant rate from 1e9-10 hours through 1e9+10 hours. Clearly in the former case, over 1000 components there will almost certainly be a failure by 1e8 hours. In the latter case, there will not be. Yet both have the same MTTF. MTTF says nothing about the shape of the failure curve. It indicates only where its midpoint is. To compute the MTTF of 1000 devices, you'll need to know the probability distribution of failures over time of those 1000 devices, which can be computed from the distribution of failures over time for a single device. But, although MTTF is derived from this distribution, you cannot reconstruct the distribution knowing only MTTF. In fact, the recent papers on disk failure indicate that common assumptions about the shape of that distribution (either a bathtub curve, or increasing failures due to wear-out after 3ish years) do not hold. - -Ben -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFF4hHHcsocGMHJ2H8RCqPfAKCYYlcOTW3OKGyJlYdXIRq802US+ACfTaBG ZzVJSUNyU/htda/JCxWvc4A= =DouE -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: PATA/SATA Disk Reliability paper 2007-02-25 22:46 ` Benjamin Davenport @ 2007-02-25 23:58 ` Mark Hahn 0 siblings, 0 replies; 22+ messages in thread From: Mark Hahn @ 2007-02-25 23:58 UTC (permalink / raw) To: Benjamin Davenport; +Cc: linux-raid > | if a discrete resistor has a 1e9 hour MTBF, 1k of them are 1e6 > > That's not actually true. As a (contrived) example, consider two cases. if you know nothing else, it's the best you can do. it's also a conservative estimate (where conservative means to expect a failure sooner). > distribution knowing only MTTF. In fact, the recent papers on disk failure > indicate that common assumptions about the shape of that distribution (either > a > bathtub curve, or increasing failures due to wear-out after 3ish years) do > not hold. the data in both the Google and Schroeder&Gibson papers are fairly noisy. yes, the "strong bathtub hypthothesis" is apparently wrong (that infant mortality is an exp decreasing failure rate over the first year, that disks stay at a constant failure rate for the next 4-6 years, then have an exp increasing failure rate). both papers, though, show what you might call a "swimming pool" curve: a short period of high mortality (clock starts when the drive leaves the factory) with a minimum failure rate at about 1 year. that's the deep end of the pool ;) then increasing failures out to the end of expected service life (warranty period). what happens after is probably too noisy to conclude much, since most people prefer not to use disks which have already seen the death of ~25% of their peers. (Google's paper has, halleluiah, error bars showing high variance at >3 years.) both papers (and most people's experience, I think) agree that: - there may be an infant mortality curve, but it depends on when you start counting, conditions and load in early life, etc. - failure rates increase with age. - failure rates in the "prime of life" are dramatically higher than the vendor spec sheets. - failure rates in senescence (post warranty) are very bad. after all, real bathtubs don't have flat bottoms! as for models and fits, well, it's complicated. consider that in a lot of environments, it takes a year or two for a new disk array to fill. so a wear-related process will initially be focused on a small area of disk, perhaps not even spread across individual disks. or consider that once the novelty of a new installation wears off, people get more worried about failures, perhaps altering their replacement strategy... ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: PATA/SATA Disk Reliability paper 2007-02-25 17:40 ` Mark Hahn [not found] ` <200702252057.22963.a1426z@gawab.com> @ 2007-02-27 19:21 ` Bill Davidsen 1 sibling, 0 replies; 22+ messages in thread From: Bill Davidsen @ 2007-02-27 19:21 UTC (permalink / raw) To: Mark Hahn; +Cc: Al Boldi, linux-raid Mark Hahn wrote: >>>> In contrast, ever since these holes appeared, drive failures became >>>> the >>>> norm. >>> >>> wow, great conspiracy theory! >> >> I think you misunderstand. I just meant plain old-fashioned >> mis-engineering. > > I should have added a smilie. but I find it dubious that the whole > industry would have made a major bungle if so many failures are due to > the hole... > >> But remember, the google report mentions a great number of drives >> failing for >> no apparent reason, not even a smart warning, so failing within the >> warranty >> period is just pure luck. > > are we reading the same report? I look at it and see: > > - lowest failures from medium-utilization drives, 30-35C. > - higher failures from young drives in general, but especially > if cold or used hard. > - higher failures from end-of-life drives, especially > 40C. > - scan errors, realloc counts, offline realloc and probation > counts are all significant in drives which fail. > > the paper seems unnecessarily gloomy about these results. to me, they're > quite exciting, and provide good reason to pay a lot of attention to > these > factors. I hate to criticize such a valuable paper, but I think they've > missed a lot by not considering the results in a fully factorial analysis > as most medical/behavioral/social studies do. for instance, they bemoan > a 56% false negative rate from only SMART signals, and mention that if >> 40C is added, the FN rate falls to 36%. also incorporating the >> low-young > risk factor would help. I would guess that a full-on model, especially > if it incorporated utilization, age, performance could comfortable > levels. The big thing I notice is that drives with SMART errors are quite likely to fail, but drives which fail aren't all that likely to have SMART errors. So while I might proactively move a drive with errors out or to non-critical service, seeing no errors doesn't mean the drive won't fail. I haven't looked at drive temp vs. ambient, I am collecting what data I can, but I no longer have thousands of drives to monitor (I'm grateful). Interesting speculation: on drives with cyclic load, does spinning down off-shift help or hinder? I have two boxes full of WD, Seagate and Maxtor drives, all cheap commodity drives, which have about 6.8 years power on time, 11-14 power cycles, and 2200-2500 spin-up cycles, due to spin down nights and weekends. Does anyone have a large enough collection of similar use drives to contribute results? -- bill davidsen <davidsen@tmr.com> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: PATA/SATA Disk Reliability paper 2007-02-24 22:27 ` Mark Hahn 2007-02-25 11:22 ` Al Boldi @ 2007-02-25 19:02 ` Richard Scobie 1 sibling, 0 replies; 22+ messages in thread From: Richard Scobie @ 2007-02-25 19:02 UTC (permalink / raw) To: Linux RAID Mailing List Mark Hahn wrote: > this - I checked the seagate 7200.10: 10k feet operating, 40k max. > amusingly -200 feet is the min either way... Which means you could not use this drive on the shores of the Dead Sea, which is at about -1300ft. Regards, Richard ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: PATA/SATA Disk Reliability paper 2007-02-22 23:30 ` Stephen C Woods 2007-02-23 18:22 ` Al Boldi @ 2007-02-27 19:06 ` Bill Davidsen 1 sibling, 0 replies; 22+ messages in thread From: Bill Davidsen @ 2007-02-27 19:06 UTC (permalink / raw) To: Stephen C Woods; +Cc: Nix, Al Boldi, Eyal Lebedinsky, linux-raid Stephen C Woods wrote: > As he leans on his cane, the old codger says.... > Well Disks used to come in open cannisters, that is you took the bottom > cover off, and then put the whould pack into the drive, and then > unscrewed the top cover and took it out.. Clearly ventilated. C 1975. > > Later we got sealed drives, Kennedy 180 MB Winchesters they were > called (the used IBM 3030 technology). The had a vent pipe with two > filters, you replaced the outer one every 90days (as part of the PM > process). The inner one you didn't touch. Aparently they figured that > it'd be a long time before the inner one got really clogged at 10 min > exposure every 90 days. C 1980 > > Still later we had a Mainframe running Un*x, it used IBM 3080 drives > these had huge HDA boxes that wree sealed but hav vent filters that had > to be changed every PM (30 days, 2 hours of down time to do them > all). C 1985. > > So drives do need to be ventilated, not so much wory about exploding, > but rather subtle distortion of the case as the atmospheric preasure > changed. > > Doe anyone rememnber that you had to let you drives acclimate to your > machine room for a day or so before you used them. > > Ah the good old days... > HUH??? > > <scw> I remember the DSU-10, 16 million 36 bit words of storage, which not only wanted to be acclimatized, but had platters so large, over a meter in diameter, that ther was a short crane mounting point on the box. Failure rate went WAY down after better air filters were installed. I think they were made for GE by CDC, but never knew for sure. GE was a mainframe manufacturer until 1970, their big claim to fame was the GE-645, the development platform for MULTICS. They sold the computer business, mainframe and industrial control, in 1970 to put money into nuclear energy, and haven't built a power plant since. Then the developed a personal computer in 1978, built a plant to manufacture it in Waynesboro VA, and decided there was no market for a small computer. -- bill davidsen <davidsen@tmr.com> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: PATA/SATA Disk Reliability paper 2007-02-19 11:26 ` Al Boldi 2007-02-19 21:42 ` Eyal Lebedinsky @ 2007-02-26 14:15 ` Mario 'BitKoenig' Holbe 2007-02-26 17:46 ` Al Boldi 1 sibling, 1 reply; 22+ messages in thread From: Mario 'BitKoenig' Holbe @ 2007-02-26 14:15 UTC (permalink / raw) To: linux-raid Al Boldi <a1426z@gawab.com> wrote: > Interesting link. They seem to point out that smart not necessarily warns of > pending failure. This is probably worse than not having smart at all, as it > gives you the illusion of safety. If SMART gives you the illusion of safety, you didn't understand SMART. SMART hints *only* the potential presence or occurence of failures in the future, it does not prove the absence of such - and nobody ever said it does. It would even be impossible to do that, though (which is easy to prove by just utilizing an external damaging tool like a hammer). Concluding from that that not having any failure detector at all is better than having at least an imperfect one is IMHO completely wrong. regards Mario -- File names are infinite in length where infinity is set to 255 characters. -- Peter Collinson, "The Unix File System" ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: PATA/SATA Disk Reliability paper 2007-02-26 14:15 ` Mario 'BitKoenig' Holbe @ 2007-02-26 17:46 ` Al Boldi 0 siblings, 0 replies; 22+ messages in thread From: Al Boldi @ 2007-02-26 17:46 UTC (permalink / raw) To: linux-raid Mario 'BitKoenig' Holbe wrote: > Al Boldi <a1426z@gawab.com> wrote: > > Interesting link. They seem to point out that smart not necessarily > > warns of pending failure. This is probably worse than not having smart > > at all, as it gives you the illusion of safety. > > If SMART gives you the illusion of safety, you didn't understand SMART. > SMART hints *only* the potential presence or occurence of failures in > the future, it does not prove the absence of such - and nobody ever said > it does. It would even be impossible to do that, though (which is easy > to prove by just utilizing an external damaging tool like a hammer). > Concluding from that that not having any failure detector at all is > better than having at least an imperfect one is IMHO completely wrong. Agreed. But would you then call it SMART? Sounds rather DUMB. Thanks! -- Al ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: PATA/SATA Disk Reliability paper 2007-02-18 18:50 PATA/SATA Disk Reliability paper Richard Scobie 2007-02-19 11:26 ` Al Boldi @ 2007-02-20 3:03 ` H. Peter Anvin 1 sibling, 0 replies; 22+ messages in thread From: H. Peter Anvin @ 2007-02-20 3:03 UTC (permalink / raw) To: Richard Scobie; +Cc: linux-raid Richard Scobie wrote: > Thought this paper may be of interest. A study done by Google on over > 100,000 drives they have/had in service. > > http://labs.google.com/papers/disk_failures.pdf > Bastards: "Failure rates are known to be highly correlated with drive models, manufacturers and vintages [18]. Our results do not contradict this fact. For example, Figure 2 changes significantly when we normalize failure rates per each drive model. Most age-related results are impacted by drive vintages. However, in this paper, we do not show a breakdown of drives per manufacturer, model, or vintage due to the proprietary nature of these data." -hpa - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2007-02-27 19:21 UTC | newest] Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2007-02-18 18:50 PATA/SATA Disk Reliability paper Richard Scobie 2007-02-19 11:26 ` Al Boldi 2007-02-19 21:42 ` Eyal Lebedinsky 2007-02-20 12:15 ` Al Boldi 2007-02-22 22:27 ` Nix 2007-02-22 22:30 ` Nix 2007-02-22 23:30 ` Stephen C Woods 2007-02-23 18:22 ` Al Boldi 2007-02-24 22:27 ` Mark Hahn 2007-02-25 11:22 ` Al Boldi 2007-02-25 17:40 ` Mark Hahn [not found] ` <200702252057.22963.a1426z@gawab.com> 2007-02-25 19:58 ` Mark Hahn 2007-02-25 21:07 ` Al Boldi 2007-02-25 22:14 ` Mark Hahn 2007-02-25 22:46 ` Benjamin Davenport 2007-02-25 23:58 ` Mark Hahn 2007-02-27 19:21 ` Bill Davidsen 2007-02-25 19:02 ` Richard Scobie 2007-02-27 19:06 ` Bill Davidsen 2007-02-26 14:15 ` Mario 'BitKoenig' Holbe 2007-02-26 17:46 ` Al Boldi 2007-02-20 3:03 ` H. Peter Anvin
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.