All of lore.kernel.org
 help / color / mirror / Atom feed
From: Justin Piszcz <jpiszcz@lucidpixels.com>
To: David Greaves <david@dgreaves.com>
Cc: Lars Schimmer <l.schimmer@cgv.tugraz.at>,
	LinuxRaid <linux-raid@vger.kernel.org>
Subject: Re: MD Feature Request: non-degraded component replacement
Date: Tue, 16 Dec 2008 18:25:28 -0500 (EST)	[thread overview]
Message-ID: <alpine.DEB.1.10.0812161824040.31132@p34.internal.lan> (raw)
In-Reply-To: <4947A585.4030203@dgreaves.com>



On Tue, 16 Dec 2008, David Greaves wrote:

> Justin Piszcz wrote:
>> On Tue, 16 Dec 2008, Lars Schimmer wrote:
>>> Justin Piszcz wrote:
>>>> On Tue, 16 Dec 2008, David Greaves wrote:
>>>>> of course that's just one opinion after replacing about 20 flaky 1Tb
>>>>> drives in
>>>>> the past 6 months :)
>>>> What were the make/model of those drives, how did they fail?
>>>
>>> Far more important: how much do you have in production?
>>> AS I got roughly 15 Seagate 1 GB HDs here and not one of them failed for
>>> the last year.
>>> And 20 of 30 running is really bad, but 20 from 500 running is not as
>>> bad as it seems ;-)
>> Agree, but I would still be interested in the make/model and what
>> controller they were attached to and how they failed?
>
> This is a home environment; (MythTV doncha know).
>
> I bought 9 Samsung HD103UJ 1Tb drives in June 2008.
>
> Since June I have RMAed 5 of the original 9.
> I have then RMAed 3 of the 5 replacements.
> I have then RMAed 2 of the 3 re-replacements.
> And finally I RMAed 1 of the 2 re-re-replacements. (I think - I was confused at
> this point - I have a list of 18+ serial numbers anyway)
>
> In November (ish) Samsung did the decent thing and replaced all 9 with HE103UJ
> (enterprise) drives; no 'moaning' about using them in RAID etc.
>
> This weekend I replaced 3 of the HE models that were displaying essentially the
> same problems (all on the same machine - the vast majority of the problems were
> in this machine and, as it happens, the 3 in the md array).
> During the replication I got a real media failures.
>
> Anyhow...
>
> I am using Dell SC420 chassis (SOHO class).
> I am running 2.6.18-xen on one system, 2.6.25.4 on another. The controllers are
> cheap dual-channel Sil24 PCIe cards and the Dell onboard controller.
>
> When I found smartctl -l scttempsts I can see that peak temperature is 44C
> They are running in Dell servers in a cool environment; and previously these
> servers supported many more drives.
>
> I had one smart DMA error which I'll attribute to a transient problem with a cable.
>
> All the other 'problems' are when SMART long self tests show eg:
> 21 # 1  Extended offline    Completed: read failure       90%       424         4239
> and
>  1 Raw_Read_Error_Rate     0x000f   100   100   051    Pre-fail  Always       -
>      62
>
> I'm not aware of any OS level issues but I have had some; I've not recorded them
> as I'm taking the SMART self-test to be enough to indicate dodgy disks.
>
> I've never had any with Reallocated_Sector_Ct != 0
>
> I also note that the smart self test log does indeed show inconsistent summary
> messages:
> # 1  Short offline       Completed: read failure       20%      1236
> 1953517887
> # 2  Short offline       Aborted by host               20%      1212         -
> # 3  Short offline       Aborted by host               10%      1188         -
> # 4  Short offline       Aborted by host               10%      1164         -
>
> In fact each log shows "Completed: read failure" until the next log pushes it
> down the stack; at that point it shows "Aborted by host". The % remaining is
> key. Discussion on the smart list suggests that this is a firmware bug. (Indeed
> this is now fixed on some newer RMA replacements).
>
> Also note that the LBA failure has been different (but very similar) for each
> drive but consistent once it occurs. It often but not always goes away if I
> force (dd) a read/write of the reported sector.
>
> I am in touch with a guy at Samsung who is interested in the problem but I've
> not had any tech feedback.
>
> David
> PS Thanks to Samsungs excellent advance replacement RMA service I have been able
> to deal with these problems. No other drive maker offers this service in the UK
> AFAIK. Of course I have spent *days* just ddrescue-ing disks. But I've not had
> to use a backup yet despite *loads* of dual-drive+ failures.

Very thanks for this information, have you run other disks in the system 
without issue?  BTW: I have seen this with the Velociraptors as well:

> In fact each log shows "Completed: read failure" until the next log pushes it
> down the stack; at that point it shows "Aborted by host". The % remaining is
> key. Discussion on the smart list suggests that this is a firmware bug. (Inde$
> this is now fixed on some newer RMA replacements).

When a drive is about to crap out it will start doing that, it will abort 
the test or run forever..

I take it you are running at RAID6 with these disks?

Justin.

  parent reply	other threads:[~2008-12-16 23:25 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-12-16  9:36 MD Feature Request: non-degraded component replacement David Greaves
2008-12-16  9:51 ` Justin Piszcz
2008-12-16 10:55   ` Lars Schimmer
2008-12-16 11:37     ` Justin Piszcz
2008-12-16 12:56       ` David Greaves
2008-12-16 14:38         ` Lars Schimmer
2008-12-16 23:25         ` Justin Piszcz [this message]
2008-12-17  0:20           ` David Greaves
2008-12-19  4:11 ` Neil Brown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.DEB.1.10.0812161824040.31132@p34.internal.lan \
    --to=jpiszcz@lucidpixels.com \
    --cc=david@dgreaves.com \
    --cc=l.schimmer@cgv.tugraz.at \
    --cc=linux-raid@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.