All of lore.kernel.org
 help / color / mirror / Atom feed
* PATA/SATA Disk Reliability paper
@ 2007-02-18 18:50 Richard Scobie
  2007-02-19 11:26 ` Al Boldi
  2007-02-20  3:03 ` H. Peter Anvin
  0 siblings, 2 replies; 22+ messages in thread
From: Richard Scobie @ 2007-02-18 18:50 UTC (permalink / raw)
  To: linux-raid

Thought this paper may be of interest. A study done by Google on over 
100,000 drives they have/had in service.

http://labs.google.com/papers/disk_failures.pdf

Regards,

Richard

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: PATA/SATA Disk Reliability paper
  2007-02-18 18:50 PATA/SATA Disk Reliability paper Richard Scobie
@ 2007-02-19 11:26 ` Al Boldi
  2007-02-19 21:42   ` Eyal Lebedinsky
  2007-02-26 14:15   ` Mario 'BitKoenig' Holbe
  2007-02-20  3:03 ` H. Peter Anvin
  1 sibling, 2 replies; 22+ messages in thread
From: Al Boldi @ 2007-02-19 11:26 UTC (permalink / raw)
  To: linux-raid

Richard Scobie wrote:
> Thought this paper may be of interest. A study done by Google on over
> 100,000 drives they have/had in service.
>
> http://labs.google.com/papers/disk_failures.pdf

Interesting link.  They seem to point out that smart not necessarily warns of  
pending failure.  This is probably worse than not having smart at all, as it 
gives you the illusion of safety.

If there is one thing to watch out for, it is "dew".

I remember video machines sensing for dew, so do any drives sense for "dew"?


Thanks!

--
Al


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: PATA/SATA Disk Reliability paper
  2007-02-19 11:26 ` Al Boldi
@ 2007-02-19 21:42   ` Eyal Lebedinsky
  2007-02-20 12:15     ` Al Boldi
  2007-02-26 14:15   ` Mario 'BitKoenig' Holbe
  1 sibling, 1 reply; 22+ messages in thread
From: Eyal Lebedinsky @ 2007-02-19 21:42 UTC (permalink / raw)
  To: Al Boldi; +Cc: linux-raid

Disks are sealed, and a dessicant is present in each to keep humidity down.
If you ever open a disk drive (e.g. for the magnets, or the mirror quality
platters, or for fun) then you can see the dessicant sachet.

cheers

Al Boldi wrote:
> Richard Scobie wrote:
> 
>>Thought this paper may be of interest. A study done by Google on over
>>100,000 drives they have/had in service.
>>
>>http://labs.google.com/papers/disk_failures.pdf
> 
> 
> Interesting link.  They seem to point out that smart not necessarily warns of  
> pending failure.  This is probably worse than not having smart at all, as it 
> gives you the illusion of safety.
> 
> If there is one thing to watch out for, it is "dew".
> 
> I remember video machines sensing for dew, so do any drives sense for "dew"?
> 
> 
> Thanks!
> 
> --
> Al

-- 
Eyal Lebedinsky (eyal@eyal.emu.id.au) <http://samba.org/eyal/>
	attach .zip as .dat

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: PATA/SATA Disk Reliability paper
  2007-02-18 18:50 PATA/SATA Disk Reliability paper Richard Scobie
  2007-02-19 11:26 ` Al Boldi
@ 2007-02-20  3:03 ` H. Peter Anvin
  1 sibling, 0 replies; 22+ messages in thread
From: H. Peter Anvin @ 2007-02-20  3:03 UTC (permalink / raw)
  To: Richard Scobie; +Cc: linux-raid

Richard Scobie wrote:
> Thought this paper may be of interest. A study done by Google on over 
> 100,000 drives they have/had in service.
> 
> http://labs.google.com/papers/disk_failures.pdf
> 

Bastards:

"Failure rates are known to be highly correlated with drive
models, manufacturers and vintages [18]. Our results do
not contradict this fact. For example, Figure 2 changes
significantly when we normalize failure rates per each
drive model. Most age-related results are impacted by
drive vintages. However, in this paper, we do not show a
breakdown of drives per manufacturer, model, or vintage
due to the proprietary nature of these data."

	-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: PATA/SATA Disk Reliability paper
  2007-02-19 21:42   ` Eyal Lebedinsky
@ 2007-02-20 12:15     ` Al Boldi
  2007-02-22 22:27       ` Nix
  0 siblings, 1 reply; 22+ messages in thread
From: Al Boldi @ 2007-02-20 12:15 UTC (permalink / raw)
  To: Eyal Lebedinsky; +Cc: linux-raid

Eyal Lebedinsky wrote:
> Disks are sealed, and a dessicant is present in each to keep humidity
> down. If you ever open a disk drive (e.g. for the magnets, or the mirror
> quality platters, or for fun) then you can see the dessicant sachet.

Actually, they aren't sealed 100%.  

On wd's at least, there is a hole with a warning printed on its side:

                      DO NOT COVER HOLE BELOW
                      V       V      V      V

                                  o


In contrast, older models from the last century, don't have that hole.

> Al Boldi wrote:
> >
> > If there is one thing to watch out for, it is "dew".
> >
> > I remember video machines sensing for dew, so do any drives sense for
> > "dew"?


Thanks!

--
Al


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: PATA/SATA Disk Reliability paper
  2007-02-20 12:15     ` Al Boldi
@ 2007-02-22 22:27       ` Nix
  2007-02-22 22:30         ` Nix
  2007-02-22 23:30         ` Stephen C Woods
  0 siblings, 2 replies; 22+ messages in thread
From: Nix @ 2007-02-22 22:27 UTC (permalink / raw)
  To: Al Boldi; +Cc: Eyal Lebedinsky, linux-raid

On 20 Feb 2007, Al Boldi outgrape:
> Eyal Lebedinsky wrote:
>> Disks are sealed, and a dessicant is present in each to keep humidity
>> down. If you ever open a disk drive (e.g. for the magnets, or the mirror
>> quality platters, or for fun) then you can see the dessicant sachet.
>
> Actually, they aren't sealed 100%.  

I'd certainly hope not, unless you like the sound of imploding drives
when you carry one up a mountain.

> On wd's at least, there is a hole with a warning printed on its side:
>
>                       DO NOT COVER HOLE BELOW
>                       V       V      V      V
>
>                                   o

I suspect that's for air-pressure equalization.

> In contrast, older models from the last century, don't have that hole.

It was my understanding that disks have had some way of equalizing
pressure with their surroundings for many years; but I haven't verified
this so you may well be right that this is a recent thing. (Anyone know
for sure?)

-- 
`In the future, company names will be a 32-character hex string.'
  --- Bruce Schneier on the shortage of company names

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: PATA/SATA Disk Reliability paper
  2007-02-22 22:27       ` Nix
@ 2007-02-22 22:30         ` Nix
  2007-02-22 23:30         ` Stephen C Woods
  1 sibling, 0 replies; 22+ messages in thread
From: Nix @ 2007-02-22 22:30 UTC (permalink / raw)
  To: Al Boldi; +Cc: Eyal Lebedinsky, linux-raid

On 22 Feb 2007, nix@esperi.org.uk uttered the following:

> On 20 Feb 2007, Al Boldi outgrape:
>> Eyal Lebedinsky wrote:
>>> Disks are sealed, and a dessicant is present in each to keep humidity
>>> down. If you ever open a disk drive (e.g. for the magnets, or the mirror
>>> quality platters, or for fun) then you can see the dessicant sachet.
>>
>> Actually, they aren't sealed 100%.  
>
> I'd certainly hope not, unless you like the sound of imploding drives
> when you carry one up a mountain.

Or even exploding drives. (Oops.)

-- 
`In the future, company names will be a 32-character hex string.'
  --- Bruce Schneier on the shortage of company names

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: PATA/SATA Disk Reliability paper
  2007-02-22 22:27       ` Nix
  2007-02-22 22:30         ` Nix
@ 2007-02-22 23:30         ` Stephen C Woods
  2007-02-23 18:22           ` Al Boldi
  2007-02-27 19:06           ` Bill Davidsen
  1 sibling, 2 replies; 22+ messages in thread
From: Stephen C Woods @ 2007-02-22 23:30 UTC (permalink / raw)
  To: Nix; +Cc: Al Boldi, Eyal Lebedinsky, linux-raid

   As he leans on his cane, the old codger says....
Well Disks used to come in open cannisters,  that is you took the bottom
cover off, and then put the whould pack into the drive, and then
unscrewed the top cover and took it out.. Clearly ventilated.  C 1975.

  Later we got sealed drives, Kennedy 180 MB Winchesters they were
called (the used IBM 3030 technology).  The had a vent pipe with two
filters, you replaced the outer one every 90days (as part of the PM
process).  The inner one you didn't touch.  Aparently they figured that
it'd be a long time before the inner one got really clogged at 10 min
exposure every 90 days.  C 1980

  Still later we had a Mainframe running Un*x, it used IBM 3080 drives
these had huge HDA boxes that wree sealed but hav vent filters that had
to be changed every PM  (30 days,  2 hours of down time to do them
all).  C 1985.

  So drives do need to be ventilated, not so much wory about exploding,
but rather subtle distortion of the case as the atmospheric preasure
changed.

   Doe anyone rememnber that you had to let you drives acclimate to your
machine room for a day or so before you used them.

   Ah the good old days...
     HUH???

  <scw>


On Thu, Feb 22, 2007 at 10:27:43PM +0000, Nix wrote:
> On 20 Feb 2007, Al Boldi outgrape:
> > Eyal Lebedinsky wrote:
> >> Disks are sealed, and a dessicant is present in each to keep humidity
> >> down. If you ever open a disk drive (e.g. for the magnets, or the mirror
> >> quality platters, or for fun) then you can see the dessicant sachet.
> >
> > Actually, they aren't sealed 100%.  
> 
> I'd certainly hope not, unless you like the sound of imploding drives
> when you carry one up a mountain.
> 
> > On wd's at least, there is a hole with a warning printed on its side:
> >
> >                       DO NOT COVER HOLE BELOW
> >                       V       V      V      V
> >
> >                                   o
> 
> I suspect that's for air-pressure equalization.
> 
> > In contrast, older models from the last century, don't have that hole.
> 
> It was my understanding that disks have had some way of equalizing
> pressure with their surroundings for many years; but I haven't verified
> this so you may well be right that this is a recent thing. (Anyone know
> for sure?)
> 
> -- 
> `In the future, company names will be a 32-character hex string.'
>   --- Bruce Schneier on the shortage of company names
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
-----
Stephen C. Woods; UCLA SEASnet; 2567 Boelter hall; LA CA 90095; (310)-825-8614
Unless otherwise noted these statements are my own, Not those of the 
University of California.                      Internet mail:scw@seas.ucla.edu

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: PATA/SATA Disk Reliability paper
  2007-02-22 23:30         ` Stephen C Woods
@ 2007-02-23 18:22           ` Al Boldi
  2007-02-24 22:27             ` Mark Hahn
  2007-02-27 19:06           ` Bill Davidsen
  1 sibling, 1 reply; 22+ messages in thread
From: Al Boldi @ 2007-02-23 18:22 UTC (permalink / raw)
  To: Stephen C Woods, Nix; +Cc: Eyal Lebedinsky, linux-raid

Stephen C Woods wrote:
>   So drives do need to be ventilated, not so much wory about exploding,
> but rather subtle distortion of the case as the atmospheric preasure
> changed.

I have a '94 Caviar without any apparent holes; and as a bonus, the drive 
still works.

In contrast, ever since these holes appeared, drive failures became the norm.

>    Doe anyone rememnber that you had to let you drives acclimate to your
> machine room for a day or so before you used them.

The problem is, that's not enough; the room temperature/humidity has to be 
controlled too.  In a desktop environment, that's not really feasible.


Thanks!

--
Al


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: PATA/SATA Disk Reliability paper
  2007-02-23 18:22           ` Al Boldi
@ 2007-02-24 22:27             ` Mark Hahn
  2007-02-25 11:22               ` Al Boldi
  2007-02-25 19:02               ` Richard Scobie
  0 siblings, 2 replies; 22+ messages in thread
From: Mark Hahn @ 2007-02-24 22:27 UTC (permalink / raw)
  To: Al Boldi; +Cc: linux-raid

> In contrast, ever since these holes appeared, drive failures became the norm.

wow, great conspiracy theory!  maybe the hole is plugged at 
the factory with a substance which evaporates at 1/warranty-period ;)

seriously, isn't it easy to imagine a bladder-like arrangement that 
permits equilibration without net flow?  disk spec-sheets do limit
this - I checked the seagate 7200.10: 10k feet operating, 40k max.
amusingly -200 feet is the min either way...

>>    Doe anyone rememnber that you had to let you drives acclimate to your
>> machine room for a day or so before you used them.
>
> The problem is, that's not enough; the room temperature/humidity has to be
> controlled too.  In a desktop environment, that's not really feasible.

5-90% humidity, operating, 95% non-op, and 30%/hour.  seems pretty easy
to me.  in fact, I frequently ask people to justify the assumption that 
a good machineroom needs tight control over humidity.  (assuming, like 
most machinerooms, you aren't frequently handling the innards.)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: PATA/SATA Disk Reliability paper
  2007-02-24 22:27             ` Mark Hahn
@ 2007-02-25 11:22               ` Al Boldi
  2007-02-25 17:40                 ` Mark Hahn
  2007-02-25 19:02               ` Richard Scobie
  1 sibling, 1 reply; 22+ messages in thread
From: Al Boldi @ 2007-02-25 11:22 UTC (permalink / raw)
  To: Mark Hahn; +Cc: linux-raid

Mark Hahn wrote:
> > In contrast, ever since these holes appeared, drive failures became the
> > norm.
>
> wow, great conspiracy theory!

I think you misunderstand.  I just meant plain old-fashioned mis-engineering.

> maybe the hole is plugged at
> the factory with a substance which evaporates at 1/warranty-period ;)

Actually it's plugged with a thin paper-like filter, which does not seem to 
evaporate easily.

And it's got nothing to do with warranty, although if you get lucky and the 
failure happens within the warranty period, you can probably demand a 
replacement drive to make you feel better.

But remember, the google report mentions a great number of drives failing for 
no apparent reason, not even a smart warning, so failing within the warranty 
period is just pure luck.

> seriously, isn't it easy to imagine a bladder-like arrangement that
> permits equilibration without net flow?  disk spec-sheets do limit
> this - I checked the seagate 7200.10: 10k feet operating, 40k max.
> amusingly -200 feet is the min either way...

Well, it looks like filtered net flow on wd's.

What's it look like on seagate?

> >>    Doe anyone rememnber that you had to let you drives acclimate to
> >> your machine room for a day or so before you used them.
> >
> > The problem is, that's not enough; the room temperature/humidity has to
> > be controlled too.  In a desktop environment, that's not really
> > feasible.
>
> 5-90% humidity, operating, 95% non-op, and 30%/hour.  seems pretty easy
> to me.  in fact, I frequently ask people to justify the assumption that
> a good machineroom needs tight control over humidity.  (assuming, like
> most machinerooms, you aren't frequently handling the innards.)

I agree, but reality has a different opinion, and it may take down that 
drive, specs or no specs.

A good way to deal with reality is to find the real reasons for failure.  
Once these reasons are known, engineering quality drives becomes, thank GOD, 
really rather easy.


Thanks!

--
Al


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: PATA/SATA Disk Reliability paper
  2007-02-25 11:22               ` Al Boldi
@ 2007-02-25 17:40                 ` Mark Hahn
       [not found]                   ` <200702252057.22963.a1426z@gawab.com>
  2007-02-27 19:21                   ` Bill Davidsen
  0 siblings, 2 replies; 22+ messages in thread
From: Mark Hahn @ 2007-02-25 17:40 UTC (permalink / raw)
  To: Al Boldi; +Cc: linux-raid

>>> In contrast, ever since these holes appeared, drive failures became the
>>> norm.
>>
>> wow, great conspiracy theory!
>
> I think you misunderstand.  I just meant plain old-fashioned mis-engineering.

I should have added a smilie.  but I find it dubious that the whole 
industry would have made a major bungle if so many failures are due to 
the hole...

> But remember, the google report mentions a great number of drives failing for
> no apparent reason, not even a smart warning, so failing within the warranty
> period is just pure luck.

are we reading the same report?  I look at it and see:

         - lowest failures from medium-utilization drives, 30-35C.
         - higher failures from young drives in general, but especially
         if cold or used hard.
         - higher failures from end-of-life drives, especially > 40C.
 	- scan errors, realloc counts, offline realloc and probation
 	counts are all significant in drives which fail.

the paper seems unnecessarily gloomy about these results.  to me, they're
quite exciting, and provide good reason to pay a lot of attention to these
factors.  I hate to criticize such a valuable paper, but I think they've
missed a lot by not considering the results in a fully factorial analysis
as most medical/behavioral/social studies do.  for instance, they bemoan
a 56% false negative rate from only SMART signals, and mention that if 
>40C is added, the FN rate falls to 36%.  also incorporating the low-young
risk factor would help.  I would guess that a full-on model, especially
if it incorporated utilization, age, performance could comfortable levels.

>>> The problem is, that's not enough; the room temperature/humidity has to
>>> be controlled too.  In a desktop environment, that's not really
>>> feasible.
>>
>> 5-90% humidity, operating, 95% non-op, and 30%/hour.  seems pretty easy
>> to me.  in fact, I frequently ask people to justify the assumption that
>> a good machineroom needs tight control over humidity.  (assuming, like
>> most machinerooms, you aren't frequently handling the innards.)
>
> I agree, but reality has a different opinion, and it may take down that
> drive, specs or no specs.

why do you say this?  I have my machineroom set for 35% (which appears 
to be it's "natural" point, with a wide 20% margin on either side.
I don't really want to waste cooling capacity on dehumidification,
for instance, unless there's a good reason.

> A good way to deal with reality is to find the real reasons for failure.
> Once these reasons are known, engineering quality drives becomes, thank GOD,
> really rather easy.

that would be great, but depends rather much on relatively small number of 
variables, which are manifest, not hidden.  there are billions of studies
(in medical/behavioral/social fields) which assume large numbers of more
or less hidden variables, and which still manage good success...

regards, mark hahn.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: PATA/SATA Disk Reliability paper
  2007-02-24 22:27             ` Mark Hahn
  2007-02-25 11:22               ` Al Boldi
@ 2007-02-25 19:02               ` Richard Scobie
  1 sibling, 0 replies; 22+ messages in thread
From: Richard Scobie @ 2007-02-25 19:02 UTC (permalink / raw)
  To: Linux RAID Mailing List

Mark Hahn wrote:

> this - I checked the seagate 7200.10: 10k feet operating, 40k max.
> amusingly -200 feet is the min either way...

Which means you could not use this drive on the shores of the Dead Sea, 
which is at about -1300ft.

Regards,

Richard

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: PATA/SATA Disk Reliability paper
       [not found]                   ` <200702252057.22963.a1426z@gawab.com>
@ 2007-02-25 19:58                     ` Mark Hahn
  2007-02-25 21:07                       ` Al Boldi
  0 siblings, 1 reply; 22+ messages in thread
From: Mark Hahn @ 2007-02-25 19:58 UTC (permalink / raw)
  To: Al Boldi; +Cc: linux-raid

>>> A good way to deal with reality is to find the real reasons for failure.
>>> Once these reasons are known, engineering quality drives becomes, thank
>>> GOD, really rather easy.
>>
>> that would be great, but depends rather much on relatively small number of
>> variables, which are manifest, not hidden.  there are billions of studies
>> (in medical/behavioral/social fields) which assume large numbers of more
>> or less hidden variables, and which still manage good success...
>
> Interesting.  Can you elaborate?

I'll give it a try - my wife wears the statistical pants in the family ;)
I was trying to say three things:

 	- disks are very complicated, so their failure rates are a
 	combination of conditional failure rates of many components.
 	to take a fully reductionist approach would require knowing
 	how each of ~1k parts responds to age, wear, temp, handling, etc.
 	and none of those can be assumed to be independent.  those are the
 	"real reasons", but most can't be measured directly outside a lab
 	and the number of combinatorial interactions is huge.

 	- factorial analysis of the data.  temperature is a good
 	example, because both low and high temperature affect AFR,
 	and in ways that interact with age and/or utilization.  this
 	is a common issue in medical studies, which are strikingly
 	similar in design (outcome is subject or disk dies...)  there
 	is a well-established body of practice for factorial analysis.

 	- recognition that the relative results are actually quite good,
 	even if the absolute results are not amazing.  for instance,
 	assume we have 1k drives, and a 10% overall failure rate.  using
 	all SMART but temp detects 64 of the 100 failures and misses 36.
 	essentially, the failure rate is now .036.  I'm guessing that if
 	utilization and temperature were included, the rate would be much
 	lower.  feedback from active testing (especially scrubbing)
 	and performance under the normal workload would also help.

in other words, I find the paper quite encouraging even inspiring!
while the raw failure rates are almost shocking, monitoring and replacement
appears to give reasonable results.  a particular "treatment schedule"
would need to minimize false-positive as well, unless disks taken out 
because of warning signs can be re-validated in some way.

(my organization has around 6300 disks and no coherent monitoring so far.)

regards, mark hahn.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: PATA/SATA Disk Reliability paper
  2007-02-25 19:58                     ` Mark Hahn
@ 2007-02-25 21:07                       ` Al Boldi
  2007-02-25 22:14                         ` Mark Hahn
  0 siblings, 1 reply; 22+ messages in thread
From: Al Boldi @ 2007-02-25 21:07 UTC (permalink / raw)
  To: Mark Hahn; +Cc: linux-raid

Mark Hahn wrote:
>  	- disks are very complicated, so their failure rates are a
>  	combination of conditional failure rates of many components.
>  	to take a fully reductionist approach would require knowing
>  	how each of ~1k parts responds to age, wear, temp, handling, etc.
>  	and none of those can be assumed to be independent.  those are the
>  	"real reasons", but most can't be measured directly outside a lab
>  	and the number of combinatorial interactions is huge.

It seems to me that the biggest problem are the 7.2k+ rpm platters 
themselves, especially with those heads flying closely on top of them.  So, 
we can probably forget the rest of the ~1k non-moving parts, as they have 
proven to be pretty reliable, most of the time.

>  	- factorial analysis of the data.  temperature is a good
>  	example, because both low and high temperature affect AFR,
>  	and in ways that interact with age and/or utilization.  this
>  	is a common issue in medical studies, which are strikingly
>  	similar in design (outcome is subject or disk dies...)  there
>  	is a well-established body of practice for factorial analysis.

Agreed.  We definitely need more sensors.

>  	- recognition that the relative results are actually quite good,
>  	even if the absolute results are not amazing.  for instance,
>  	assume we have 1k drives, and a 10% overall failure rate.  using
>  	all SMART but temp detects 64 of the 100 failures and misses 36.
>  	essentially, the failure rate is now .036.  I'm guessing that if
>  	utilization and temperature were included, the rate would be much
>  	lower.  feedback from active testing (especially scrubbing)
>  	and performance under the normal workload would also help.

Are you saying, you are content with pre-mature disk failure, as long as 
there is a smart warning sign?

If so, then I don't think that is enough.

I think the sensors should trigger some kind of shutdown mechanism as a 
protective measure, when some threshold is reached.  Just like the 
protective measure you see for CPUs to prevent meltdown.

Thanks!

--
Al


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: PATA/SATA Disk Reliability paper
  2007-02-25 21:07                       ` Al Boldi
@ 2007-02-25 22:14                         ` Mark Hahn
  2007-02-25 22:46                           ` Benjamin Davenport
  0 siblings, 1 reply; 22+ messages in thread
From: Mark Hahn @ 2007-02-25 22:14 UTC (permalink / raw)
  To: Al Boldi; +Cc: linux-raid

>>  	and none of those can be assumed to be independent.  those are the
>>  	"real reasons", but most can't be measured directly outside a lab
>>  	and the number of combinatorial interactions is huge.
>
> It seems to me that the biggest problem are the 7.2k+ rpm platters
> themselves, especially with those heads flying closely on top of them.  So,
> we can probably forget the rest of the ~1k non-moving parts, as they have
> proven to be pretty reliable, most of the time.

donno.  non-moving parts probably have much higher reliability, but 
so many of them makes them a concern.  if a discrete resistor has 
a 1e9 hour MTBF, 1k of them are 1e6 and that's starting to approach
the claimed MTBF of a disk.  any lower (or more components) and it
takes over as a dominant failure mode...

the Google paper doesn't really try to diagnose, but it does indicate
that metrics related to media/head problems tend to promptly lead to failure.
(scan errors, reallocations, etc.)  I guess that's circumstantial support
for your theory that crashes of media/heads are the primary failure mode.

>>  	- factorial analysis of the data.  temperature is a good
>>  	example, because both low and high temperature affect AFR,
>>  	and in ways that interact with age and/or utilization.  this
>>  	is a common issue in medical studies, which are strikingly
>>  	similar in design (outcome is subject or disk dies...)  there
>>  	is a well-established body of practice for factorial analysis.
>
> Agreed.  We definitely need more sensors.

just to be clear, I'm not saying we need more sensors, just that the 
existing metrics (including temp and utilization) need to be considered
jointly, not independently.  more metrics would be better as well,
assuming they're direct readouts, not idiot-lights...

>>  	and performance under the normal workload would also help.
>
> Are you saying, you are content with pre-mature disk failure, as long as
> there is a smart warning sign?

I'm saying that disk failures are inevitable.  ways to reduce the chance
of data loss are what we have to focus on.  the Google paper shows that 
disks like to be at around 35C - not too cool or hot (though this is probably
conflated with utilization.)  the paper also shows that warning signs can 
indicate a majority of failures (though it doesn't present the factorial 
analysis necessary to tell which ones, how well, avoid false-positives, etc.)

> I think the sensors should trigger some kind of shutdown mechanism as a
> protective measure, when some threshold is reached.  Just like the
> protective measure you see for CPUs to prevent meltdown.

but they already do.  persistent bad reads or writes to a block will trigger
its reallocation to spares, etc.  for CPUs, the main threat is heat, and it's 
easy to throttle to cool down.  for disks, the main threat is probably wear, 
which seems quite different - more catastrophic and less mitigatable
once it starts.

I'd love to hear from an actual drive engineer on the failure modes 
they worry about...

regards, mark hahn.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: PATA/SATA Disk Reliability paper
  2007-02-25 22:14                         ` Mark Hahn
@ 2007-02-25 22:46                           ` Benjamin Davenport
  2007-02-25 23:58                             ` Mark Hahn
  0 siblings, 1 reply; 22+ messages in thread
From: Benjamin Davenport @ 2007-02-25 22:46 UTC (permalink / raw)
  To: linux-raid

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

Mark Hahn wrote:
| if a discrete resistor has a 1e9 hour MTBF, 1k of them are 1e6

That's not actually true.  As a (contrived) example, consider two cases.

Case 1: failures occur at constant rate from hours 0 through 2e9.
Case 2: failures occur at constant rate from 1e9-10 hours through 1e9+10 hours.

Clearly in the former case, over 1000 components there will almost certainly be
a failure by 1e8 hours.  In the latter case, there will not be.  Yet both have
the same MTTF.


MTTF says nothing about the shape of the failure curve.  It indicates only where
its midpoint is.  To compute the MTTF of 1000 devices, you'll need to know the
probability distribution of failures over time of those 1000 devices, which can
be computed from the distribution of failures over time for a single device.
But, although MTTF is derived from this distribution, you cannot reconstruct the
distribution knowing only MTTF.  In fact, the recent papers on disk failure
indicate that common assumptions about the shape of that distribution (either a
bathtub curve, or increasing failures due to wear-out after 3ish years) do not hold.

- -Ben
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFF4hHHcsocGMHJ2H8RCqPfAKCYYlcOTW3OKGyJlYdXIRq802US+ACfTaBG
ZzVJSUNyU/htda/JCxWvc4A=
=DouE
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: PATA/SATA Disk Reliability paper
  2007-02-25 22:46                           ` Benjamin Davenport
@ 2007-02-25 23:58                             ` Mark Hahn
  0 siblings, 0 replies; 22+ messages in thread
From: Mark Hahn @ 2007-02-25 23:58 UTC (permalink / raw)
  To: Benjamin Davenport; +Cc: linux-raid

> | if a discrete resistor has a 1e9 hour MTBF, 1k of them are 1e6
>
> That's not actually true.  As a (contrived) example, consider two cases.

if you know nothing else, it's the best you can do.  it's also a 
conservative estimate (where conservative means to expect a failure sooner).

> distribution knowing only MTTF.  In fact, the recent papers on disk failure
> indicate that common assumptions about the shape of that distribution (either 
> a
> bathtub curve, or increasing failures due to wear-out after 3ish years) do 
> not hold.

the data in both the Google and Schroeder&Gibson papers are fairly noisy.
yes, the "strong bathtub hypthothesis" is apparently wrong (that infant
mortality is an exp decreasing failure rate over the first year, that
disks stay at a constant failure rate for the next 4-6 years, then have 
an exp increasing failure rate).

both papers, though, show what you might call a "swimming pool" curve:
a short period of high mortality (clock starts when the drive leaves 
the factory) with a minimum failure rate at about 1 year.  that's the 
deep end of the pool ;)  then increasing failures out to the end of 
expected service life (warranty period).  what happens after is probably
too noisy to conclude much, since most people prefer not to use disks 
which have already seen the death of ~25% of their peers.  (Google's 
paper has, halleluiah, error bars showing high variance at >3 years.)

both papers (and most people's experience, I think) agree that:
 	- there may be an infant mortality curve, but it depends on
 	when you start counting, conditions and load in early life, etc.
 	- failure rates increase with age.
 	- failure rates in the "prime of life" are dramatically higher
 	than the vendor spec sheets.
 	- failure rates in senescence (post warranty) are very bad.

after all, real bathtubs don't have flat bottoms!

as for models and fits, well, it's complicated.  consider that in a lot
of environments, it takes a year or two for a new disk array to fill.
so a wear-related process will initially be focused on a small area of 
disk, perhaps not even spread across individual disks.  or consider that
once the novelty of a new installation wears off, people get more worried
about failures, perhaps altering their replacement strategy...

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: PATA/SATA Disk Reliability paper
  2007-02-19 11:26 ` Al Boldi
  2007-02-19 21:42   ` Eyal Lebedinsky
@ 2007-02-26 14:15   ` Mario 'BitKoenig' Holbe
  2007-02-26 17:46     ` Al Boldi
  1 sibling, 1 reply; 22+ messages in thread
From: Mario 'BitKoenig' Holbe @ 2007-02-26 14:15 UTC (permalink / raw)
  To: linux-raid

Al Boldi <a1426z@gawab.com> wrote:
> Interesting link.  They seem to point out that smart not necessarily warns of  
> pending failure.  This is probably worse than not having smart at all, as it 
> gives you the illusion of safety.

If SMART gives you the illusion of safety, you didn't understand SMART.
SMART hints *only* the potential presence or occurence of failures in
the future, it does not prove the absence of such - and nobody ever said
it does. It would even be impossible to do that, though (which is easy
to prove by just utilizing an external damaging tool like a hammer).
Concluding from that that not having any failure detector at all is
better than having at least an imperfect one is IMHO completely wrong.


regards
   Mario
-- 
File names are infinite in length where infinity is set to 255 characters.
                                -- Peter Collinson, "The Unix File System"


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: PATA/SATA Disk Reliability paper
  2007-02-26 14:15   ` Mario 'BitKoenig' Holbe
@ 2007-02-26 17:46     ` Al Boldi
  0 siblings, 0 replies; 22+ messages in thread
From: Al Boldi @ 2007-02-26 17:46 UTC (permalink / raw)
  To: linux-raid

Mario 'BitKoenig' Holbe wrote:
> Al Boldi <a1426z@gawab.com> wrote:
> > Interesting link.  They seem to point out that smart not necessarily
> > warns of pending failure.  This is probably worse than not having smart
> > at all, as it gives you the illusion of safety.
>
> If SMART gives you the illusion of safety, you didn't understand SMART.
> SMART hints *only* the potential presence or occurence of failures in
> the future, it does not prove the absence of such - and nobody ever said
> it does. It would even be impossible to do that, though (which is easy
> to prove by just utilizing an external damaging tool like a hammer).
> Concluding from that that not having any failure detector at all is
> better than having at least an imperfect one is IMHO completely wrong.

Agreed.  But would you then call it SMART?  Sounds rather DUMB.


Thanks!

--
Al


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: PATA/SATA Disk Reliability paper
  2007-02-22 23:30         ` Stephen C Woods
  2007-02-23 18:22           ` Al Boldi
@ 2007-02-27 19:06           ` Bill Davidsen
  1 sibling, 0 replies; 22+ messages in thread
From: Bill Davidsen @ 2007-02-27 19:06 UTC (permalink / raw)
  To: Stephen C Woods; +Cc: Nix, Al Boldi, Eyal Lebedinsky, linux-raid

Stephen C Woods wrote:
>    As he leans on his cane, the old codger says....
> Well Disks used to come in open cannisters,  that is you took the bottom
> cover off, and then put the whould pack into the drive, and then
> unscrewed the top cover and took it out.. Clearly ventilated.  C 1975.
>
>   Later we got sealed drives, Kennedy 180 MB Winchesters they were
> called (the used IBM 3030 technology).  The had a vent pipe with two
> filters, you replaced the outer one every 90days (as part of the PM
> process).  The inner one you didn't touch.  Aparently they figured that
> it'd be a long time before the inner one got really clogged at 10 min
> exposure every 90 days.  C 1980
>
>   Still later we had a Mainframe running Un*x, it used IBM 3080 drives
> these had huge HDA boxes that wree sealed but hav vent filters that had
> to be changed every PM  (30 days,  2 hours of down time to do them
> all).  C 1985.
>
>   So drives do need to be ventilated, not so much wory about exploding,
> but rather subtle distortion of the case as the atmospheric preasure
> changed.
>
>    Doe anyone rememnber that you had to let you drives acclimate to your
> machine room for a day or so before you used them.
>
>    Ah the good old days...
>      HUH???
>
>   <scw>
I remember the DSU-10, 16 million 36 bit words of storage, which not 
only wanted to be acclimatized, but had platters so large, over a meter 
in diameter, that ther was a short crane mounting point on the box. 
Failure rate went WAY down after better air filters were installed.

I think they were made for GE by CDC, but never knew for sure. GE was a 
mainframe manufacturer until 1970, their big claim to fame was the 
GE-645, the development platform for MULTICS. They sold the computer 
business, mainframe and industrial control, in 1970 to put money into 
nuclear energy, and haven't built a power plant since. Then the 
developed a personal computer in 1978, built a plant to manufacture it 
in Waynesboro VA, and decided there was no market for a small computer.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: PATA/SATA Disk Reliability paper
  2007-02-25 17:40                 ` Mark Hahn
       [not found]                   ` <200702252057.22963.a1426z@gawab.com>
@ 2007-02-27 19:21                   ` Bill Davidsen
  1 sibling, 0 replies; 22+ messages in thread
From: Bill Davidsen @ 2007-02-27 19:21 UTC (permalink / raw)
  To: Mark Hahn; +Cc: Al Boldi, linux-raid

Mark Hahn wrote:
>>>> In contrast, ever since these holes appeared, drive failures became 
>>>> the
>>>> norm.
>>>
>>> wow, great conspiracy theory!
>>
>> I think you misunderstand.  I just meant plain old-fashioned 
>> mis-engineering.
>
> I should have added a smilie.  but I find it dubious that the whole 
> industry would have made a major bungle if so many failures are due to 
> the hole...
>
>> But remember, the google report mentions a great number of drives 
>> failing for
>> no apparent reason, not even a smart warning, so failing within the 
>> warranty
>> period is just pure luck.
>
> are we reading the same report?  I look at it and see:
>
>         - lowest failures from medium-utilization drives, 30-35C.
>         - higher failures from young drives in general, but especially
>         if cold or used hard.
>         - higher failures from end-of-life drives, especially > 40C.
>     - scan errors, realloc counts, offline realloc and probation
>     counts are all significant in drives which fail.
>
> the paper seems unnecessarily gloomy about these results.  to me, they're
> quite exciting, and provide good reason to pay a lot of attention to 
> these
> factors.  I hate to criticize such a valuable paper, but I think they've
> missed a lot by not considering the results in a fully factorial analysis
> as most medical/behavioral/social studies do.  for instance, they bemoan
> a 56% false negative rate from only SMART signals, and mention that if
>> 40C is added, the FN rate falls to 36%.  also incorporating the 
>> low-young
> risk factor would help.  I would guess that a full-on model, especially
> if it incorporated utilization, age, performance could comfortable 
> levels. 
The big thing I notice is that drives with SMART errors are quite likely 
to fail, but drives which fail aren't all that likely to have SMART 
errors. So while I might proactively move a drive with errors out or to 
non-critical service, seeing no errors doesn't mean the drive won't fail.

I haven't looked at drive temp vs. ambient, I am collecting what data I 
can, but I no longer have thousands of drives to monitor (I'm grateful).

Interesting speculation: on drives with cyclic load, does spinning down 
off-shift help or hinder? I have two boxes full of WD, Seagate and 
Maxtor drives, all cheap commodity drives, which have about 6.8 years 
power on time, 11-14 power cycles, and 2200-2500 spin-up cycles, due to 
spin down nights and weekends. Does anyone have a large enough 
collection of similar use drives to contribute results?

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2007-02-27 19:21 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-02-18 18:50 PATA/SATA Disk Reliability paper Richard Scobie
2007-02-19 11:26 ` Al Boldi
2007-02-19 21:42   ` Eyal Lebedinsky
2007-02-20 12:15     ` Al Boldi
2007-02-22 22:27       ` Nix
2007-02-22 22:30         ` Nix
2007-02-22 23:30         ` Stephen C Woods
2007-02-23 18:22           ` Al Boldi
2007-02-24 22:27             ` Mark Hahn
2007-02-25 11:22               ` Al Boldi
2007-02-25 17:40                 ` Mark Hahn
     [not found]                   ` <200702252057.22963.a1426z@gawab.com>
2007-02-25 19:58                     ` Mark Hahn
2007-02-25 21:07                       ` Al Boldi
2007-02-25 22:14                         ` Mark Hahn
2007-02-25 22:46                           ` Benjamin Davenport
2007-02-25 23:58                             ` Mark Hahn
2007-02-27 19:21                   ` Bill Davidsen
2007-02-25 19:02               ` Richard Scobie
2007-02-27 19:06           ` Bill Davidsen
2007-02-26 14:15   ` Mario 'BitKoenig' Holbe
2007-02-26 17:46     ` Al Boldi
2007-02-20  3:03 ` H. Peter Anvin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.