All of lore.kernel.org
 help / color / mirror / Atom feed
* MD Feature Request: non-degraded component replacement
@ 2008-12-16  9:36 David Greaves
  2008-12-16  9:51 ` Justin Piszcz
  2008-12-19  4:11 ` Neil Brown
  0 siblings, 2 replies; 9+ messages in thread
From: David Greaves @ 2008-12-16  9:36 UTC (permalink / raw)
  To: Neil Brown; +Cc: LinuxRaid

Hi Neil

I brought this up in October but got no response - since you seem to be on a
roll I thought I'd try again...

Summary: Add a spare and 'mirror-fail' a device. The spare is synced with the
to-be-removed device and any read errors are corrected from the remaining raid
devices. Once synced, the  to-be-removed device is failed and the spare takes
its place. At no point is the array degraded.

IMHO This one should be high on the todo list. Especially if it's a
pre-requisite for other improvements to resilience.

Right now, if a drive fails or shows signs of going bad then you get into a very
 risky situation.

I'm sure most here know that the risk is because removing the failing drive and
installing a good one to re-sync puts you in a very vulnerable position; if
another drive fails (even one bad block) then you lose data.

The solution involves raid1 - but it needs a twist of raid5/6 and it was
discussed ages ago; see:
  http://arctic.org/~dean/proactive-raid5-disk-replacement.txt


I think this is what was discussed:

Assume md0 has drives A B C D
D is failing
E is new

* add E as spare
* set E to mirror 'failing' drive D (with bitmap?)
* subsequent writes go to both D+E
* recover 99+% of data from D to E by simple mirroring
* any read failures on D when reading from md0 or mirroring D->E are recovered
from reading ABC not E unless E is in sync. D is not failed out. (and it's these
tricks that stops users from doing all this manually)
* any md0 sector read failure on ABC can still (hopefully) be read from D even
if not yet mirrored to E (also not possible if done manually)
* once E is mirrored, D is removed and  the job is done

Personally I think this feature is more important than the reshaping requests;
of course that's just one opinion after replacing about 20 flaky 1Tb drives in
the past 6 months :)

David

-- 
"Don't worry, you'll be fine; I saw it work in a cartoon once..."

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: MD Feature Request: non-degraded component replacement
  2008-12-16  9:36 MD Feature Request: non-degraded component replacement David Greaves
@ 2008-12-16  9:51 ` Justin Piszcz
  2008-12-16 10:55   ` Lars Schimmer
  2008-12-19  4:11 ` Neil Brown
  1 sibling, 1 reply; 9+ messages in thread
From: Justin Piszcz @ 2008-12-16  9:51 UTC (permalink / raw)
  To: David Greaves; +Cc: Neil Brown, LinuxRaid



On Tue, 16 Dec 2008, David Greaves wrote:

> Personally I think this feature is more important than the reshaping requests;
> of course that's just one opinion after replacing about 20 flaky 1Tb drives in
> the past 6 months :)
What were the make/model of those drives, how did they fail?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: MD Feature Request: non-degraded component replacement
  2008-12-16  9:51 ` Justin Piszcz
@ 2008-12-16 10:55   ` Lars Schimmer
  2008-12-16 11:37     ` Justin Piszcz
  0 siblings, 1 reply; 9+ messages in thread
From: Lars Schimmer @ 2008-12-16 10:55 UTC (permalink / raw)
  To: LinuxRaid

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Justin Piszcz wrote:
> 
> 
> On Tue, 16 Dec 2008, David Greaves wrote:
> 
>> Personally I think this feature is more important than the reshaping
>> requests;
>> of course that's just one opinion after replacing about 20 flaky 1Tb
>> drives in
>> the past 6 months :)
> What were the make/model of those drives, how did they fail?

Far more important: how much do you have in production?
AS I got roughly 15 Seagate 1 GB HDs here and not one of them failed for
the last year.
And 20 of 30 running is really bad, but 20 from 500 running is not as
bad as it seems ;-)

MfG,
Lars Schimmer
- --
- -------------------------------------------------------------
TU Graz, Institut für ComputerGraphik & WissensVisualisierung
Tel: +43 316 873-5405       E-Mail: l.schimmer@cgv.tugraz.at
Fax: +43 316 873-5402       PGP-Key-ID: 0x4A9B1723
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEUEARECAAYFAklHiSQACgkQmWhuE0qbFyMQfACghCdF9JYcdULXjLfaxZPv01mu
GHEAmJ4x4Dyk0py+a2Bxi+BfOTkkfAg=
=FdYo
-----END PGP SIGNATURE-----
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: MD Feature Request: non-degraded component replacement
  2008-12-16 10:55   ` Lars Schimmer
@ 2008-12-16 11:37     ` Justin Piszcz
  2008-12-16 12:56       ` David Greaves
  0 siblings, 1 reply; 9+ messages in thread
From: Justin Piszcz @ 2008-12-16 11:37 UTC (permalink / raw)
  To: Lars Schimmer; +Cc: LinuxRaid



On Tue, 16 Dec 2008, Lars Schimmer wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Justin Piszcz wrote:
>>
>>
>> On Tue, 16 Dec 2008, David Greaves wrote:
>>
>>> Personally I think this feature is more important than the reshaping
>>> requests;
>>> of course that's just one opinion after replacing about 20 flaky 1Tb
>>> drives in
>>> the past 6 months :)
>> What were the make/model of those drives, how did they fail?
>
> Far more important: how much do you have in production?
> AS I got roughly 15 Seagate 1 GB HDs here and not one of them failed for
> the last year.
> And 20 of 30 running is really bad, but 20 from 500 running is not as
> bad as it seems ;-)
Agree, but I would still be interested in the make/model and what 
controller they were attached to and how they failed?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: MD Feature Request: non-degraded component replacement
  2008-12-16 11:37     ` Justin Piszcz
@ 2008-12-16 12:56       ` David Greaves
  2008-12-16 14:38         ` Lars Schimmer
  2008-12-16 23:25         ` Justin Piszcz
  0 siblings, 2 replies; 9+ messages in thread
From: David Greaves @ 2008-12-16 12:56 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: Lars Schimmer, LinuxRaid

Justin Piszcz wrote:
> On Tue, 16 Dec 2008, Lars Schimmer wrote:
>> Justin Piszcz wrote:
>>> On Tue, 16 Dec 2008, David Greaves wrote:
>>>> of course that's just one opinion after replacing about 20 flaky 1Tb
>>>> drives in
>>>> the past 6 months :)
>>> What were the make/model of those drives, how did they fail?
>>
>> Far more important: how much do you have in production?
>> AS I got roughly 15 Seagate 1 GB HDs here and not one of them failed for
>> the last year.
>> And 20 of 30 running is really bad, but 20 from 500 running is not as
>> bad as it seems ;-)
> Agree, but I would still be interested in the make/model and what
> controller they were attached to and how they failed?

This is a home environment; (MythTV doncha know).

I bought 9 Samsung HD103UJ 1Tb drives in June 2008.

Since June I have RMAed 5 of the original 9.
I have then RMAed 3 of the 5 replacements.
I have then RMAed 2 of the 3 re-replacements.
And finally I RMAed 1 of the 2 re-re-replacements. (I think - I was confused at
this point - I have a list of 18+ serial numbers anyway)

In November (ish) Samsung did the decent thing and replaced all 9 with HE103UJ
(enterprise) drives; no 'moaning' about using them in RAID etc.

This weekend I replaced 3 of the HE models that were displaying essentially the
same problems (all on the same machine - the vast majority of the problems were
in this machine and, as it happens, the 3 in the md array).
During the replication I got a real media failures.

Anyhow...

I am using Dell SC420 chassis (SOHO class).
I am running 2.6.18-xen on one system, 2.6.25.4 on another. The controllers are
cheap dual-channel Sil24 PCIe cards and the Dell onboard controller.

When I found smartctl -l scttempsts I can see that peak temperature is 44C
They are running in Dell servers in a cool environment; and previously these
servers supported many more drives.

I had one smart DMA error which I'll attribute to a transient problem with a cable.

All the other 'problems' are when SMART long self tests show eg:
21 # 1  Extended offline    Completed: read failure       90%       424         4239
and
  1 Raw_Read_Error_Rate     0x000f   100   100   051    Pre-fail  Always       -
      62

I'm not aware of any OS level issues but I have had some; I've not recorded them
as I'm taking the SMART self-test to be enough to indicate dodgy disks.

I've never had any with Reallocated_Sector_Ct != 0

I also note that the smart self test log does indeed show inconsistent summary
messages:
# 1  Short offline       Completed: read failure       20%      1236
1953517887
# 2  Short offline       Aborted by host               20%      1212         -
# 3  Short offline       Aborted by host               10%      1188         -
# 4  Short offline       Aborted by host               10%      1164         -

In fact each log shows "Completed: read failure" until the next log pushes it
down the stack; at that point it shows "Aborted by host". The % remaining is
key. Discussion on the smart list suggests that this is a firmware bug. (Indeed
this is now fixed on some newer RMA replacements).

Also note that the LBA failure has been different (but very similar) for each
drive but consistent once it occurs. It often but not always goes away if I
force (dd) a read/write of the reported sector.

I am in touch with a guy at Samsung who is interested in the problem but I've
not had any tech feedback.

David
PS Thanks to Samsungs excellent advance replacement RMA service I have been able
 to deal with these problems. No other drive maker offers this service in the UK
AFAIK. Of course I have spent *days* just ddrescue-ing disks. But I've not had
to use a backup yet despite *loads* of dual-drive+ failures.

-- 
"Don't worry, you'll be fine; I saw it work in a cartoon once..."

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: MD Feature Request: non-degraded component replacement
  2008-12-16 12:56       ` David Greaves
@ 2008-12-16 14:38         ` Lars Schimmer
  2008-12-16 23:25         ` Justin Piszcz
  1 sibling, 0 replies; 9+ messages in thread
From: Lars Schimmer @ 2008-12-16 14:38 UTC (permalink / raw)
  To: LinuxRaid

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

David Greaves wrote:

> This is a home environment; (MythTV doncha know).
> 
> I bought 9 Samsung HD103UJ 1Tb drives in June 2008.

Ok, thaught this.
The Samsung 160,250 and 500GB drives were without errors at me, but I
heard bad rumours about the 1 TB drives and avoid them as I can - the
seagate barracuda are as cheap as samsung and git 5 year warranty til yet.

MfG,
Lars Schimmer
- --
- -------------------------------------------------------------
TU Graz, Institut für ComputerGraphik & WissensVisualisierung
Tel: +43 316 873-5405       E-Mail: l.schimmer@cgv.tugraz.at
Fax: +43 316 873-5402       PGP-Key-ID: 0x4A9B1723
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAklHvV0ACgkQmWhuE0qbFyN/PQCcDpFd1TjQMK4Gn88A31l/UnOt
cU8AnRj0DhrvkTgOFcmhdhIGa934BvPz
=FTzC
-----END PGP SIGNATURE-----
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: MD Feature Request: non-degraded component replacement
  2008-12-16 12:56       ` David Greaves
  2008-12-16 14:38         ` Lars Schimmer
@ 2008-12-16 23:25         ` Justin Piszcz
  2008-12-17  0:20           ` David Greaves
  1 sibling, 1 reply; 9+ messages in thread
From: Justin Piszcz @ 2008-12-16 23:25 UTC (permalink / raw)
  To: David Greaves; +Cc: Lars Schimmer, LinuxRaid



On Tue, 16 Dec 2008, David Greaves wrote:

> Justin Piszcz wrote:
>> On Tue, 16 Dec 2008, Lars Schimmer wrote:
>>> Justin Piszcz wrote:
>>>> On Tue, 16 Dec 2008, David Greaves wrote:
>>>>> of course that's just one opinion after replacing about 20 flaky 1Tb
>>>>> drives in
>>>>> the past 6 months :)
>>>> What were the make/model of those drives, how did they fail?
>>>
>>> Far more important: how much do you have in production?
>>> AS I got roughly 15 Seagate 1 GB HDs here and not one of them failed for
>>> the last year.
>>> And 20 of 30 running is really bad, but 20 from 500 running is not as
>>> bad as it seems ;-)
>> Agree, but I would still be interested in the make/model and what
>> controller they were attached to and how they failed?
>
> This is a home environment; (MythTV doncha know).
>
> I bought 9 Samsung HD103UJ 1Tb drives in June 2008.
>
> Since June I have RMAed 5 of the original 9.
> I have then RMAed 3 of the 5 replacements.
> I have then RMAed 2 of the 3 re-replacements.
> And finally I RMAed 1 of the 2 re-re-replacements. (I think - I was confused at
> this point - I have a list of 18+ serial numbers anyway)
>
> In November (ish) Samsung did the decent thing and replaced all 9 with HE103UJ
> (enterprise) drives; no 'moaning' about using them in RAID etc.
>
> This weekend I replaced 3 of the HE models that were displaying essentially the
> same problems (all on the same machine - the vast majority of the problems were
> in this machine and, as it happens, the 3 in the md array).
> During the replication I got a real media failures.
>
> Anyhow...
>
> I am using Dell SC420 chassis (SOHO class).
> I am running 2.6.18-xen on one system, 2.6.25.4 on another. The controllers are
> cheap dual-channel Sil24 PCIe cards and the Dell onboard controller.
>
> When I found smartctl -l scttempsts I can see that peak temperature is 44C
> They are running in Dell servers in a cool environment; and previously these
> servers supported many more drives.
>
> I had one smart DMA error which I'll attribute to a transient problem with a cable.
>
> All the other 'problems' are when SMART long self tests show eg:
> 21 # 1  Extended offline    Completed: read failure       90%       424         4239
> and
>  1 Raw_Read_Error_Rate     0x000f   100   100   051    Pre-fail  Always       -
>      62
>
> I'm not aware of any OS level issues but I have had some; I've not recorded them
> as I'm taking the SMART self-test to be enough to indicate dodgy disks.
>
> I've never had any with Reallocated_Sector_Ct != 0
>
> I also note that the smart self test log does indeed show inconsistent summary
> messages:
> # 1  Short offline       Completed: read failure       20%      1236
> 1953517887
> # 2  Short offline       Aborted by host               20%      1212         -
> # 3  Short offline       Aborted by host               10%      1188         -
> # 4  Short offline       Aborted by host               10%      1164         -
>
> In fact each log shows "Completed: read failure" until the next log pushes it
> down the stack; at that point it shows "Aborted by host". The % remaining is
> key. Discussion on the smart list suggests that this is a firmware bug. (Indeed
> this is now fixed on some newer RMA replacements).
>
> Also note that the LBA failure has been different (but very similar) for each
> drive but consistent once it occurs. It often but not always goes away if I
> force (dd) a read/write of the reported sector.
>
> I am in touch with a guy at Samsung who is interested in the problem but I've
> not had any tech feedback.
>
> David
> PS Thanks to Samsungs excellent advance replacement RMA service I have been able
> to deal with these problems. No other drive maker offers this service in the UK
> AFAIK. Of course I have spent *days* just ddrescue-ing disks. But I've not had
> to use a backup yet despite *loads* of dual-drive+ failures.

Very thanks for this information, have you run other disks in the system 
without issue?  BTW: I have seen this with the Velociraptors as well:

> In fact each log shows "Completed: read failure" until the next log pushes it
> down the stack; at that point it shows "Aborted by host". The % remaining is
> key. Discussion on the smart list suggests that this is a firmware bug. (Inde$
> this is now fixed on some newer RMA replacements).

When a drive is about to crap out it will start doing that, it will abort 
the test or run forever..

I take it you are running at RAID6 with these disks?

Justin.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: MD Feature Request: non-degraded component replacement
  2008-12-16 23:25         ` Justin Piszcz
@ 2008-12-17  0:20           ` David Greaves
  0 siblings, 0 replies; 9+ messages in thread
From: David Greaves @ 2008-12-17  0:20 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: Lars Schimmer, LinuxRaid

Justin Piszcz wrote:
> Very thanks for this information, have you run other disks in the system
> without issue?  BTW: I have seen this with the Velociraptors as well:

NP

Yes. Historically I had more smaller disks in there. The system disks have also
been fine throughout.


FWIW, do you agree with my assessment of the feature request that started this
thread?
(Especially now you understand why I feel it's of value).

>> In fact each log shows "Completed: read failure" until the next log
>> pushes it
>> down the stack; at that point it shows "Aborted by host". The %
>> remaining is
>> key. Discussion on the smart list suggests that this is a firmware
>> bug. (Inde$
>> this is now fixed on some newer RMA replacements).
> 
> When a drive is about to crap out it will start doing that, it will
> abort the test or run forever..

Yes, this was just a FW bug though as it turned out.


> I take it you are running at RAID6 with these disks?
No :( .... but I'm *very* good at recovering RAID5 now ;)

I'm loathe to buy anymore disks - the Samsungs aren't something I want to spend
money on yet and if anything goes wrong with another brand of drive I can't get
advance RMA.


David


-- 
"Don't worry, you'll be fine; I saw it work in a cartoon once..."

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: MD Feature Request: non-degraded component replacement
  2008-12-16  9:36 MD Feature Request: non-degraded component replacement David Greaves
  2008-12-16  9:51 ` Justin Piszcz
@ 2008-12-19  4:11 ` Neil Brown
  1 sibling, 0 replies; 9+ messages in thread
From: Neil Brown @ 2008-12-19  4:11 UTC (permalink / raw)
  To: David Greaves; +Cc: LinuxRaid

On Tuesday December 16, david@dgreaves.com wrote:
> Hi Neil
> 
> I brought this up in October but got no response - since you seem to be on a
> roll I thought I'd try again...
> 
> Summary: Add a spare and 'mirror-fail' a device. The spare is synced with the
> to-be-removed device and any read errors are corrected from the remaining raid
> devices. Once synced, the  to-be-removed device is failed and the spare takes
> its place. At no point is the array degraded.

Yes, I've come to the conclusion that this probably is a good idea.

See my 'road-map' that I just posted.

Thanks,
NeilBrown


> 
> IMHO This one should be high on the todo list. Especially if it's a
> pre-requisite for other improvements to resilience.
> 
> Right now, if a drive fails or shows signs of going bad then you get into a very
>  risky situation.
> 
> I'm sure most here know that the risk is because removing the failing drive and
> installing a good one to re-sync puts you in a very vulnerable position; if
> another drive fails (even one bad block) then you lose data.
> 
> The solution involves raid1 - but it needs a twist of raid5/6 and it was
> discussed ages ago; see:
>   http://arctic.org/~dean/proactive-raid5-disk-replacement.txt
> 
> 
> I think this is what was discussed:
> 
> Assume md0 has drives A B C D
> D is failing
> E is new
> 
> * add E as spare
> * set E to mirror 'failing' drive D (with bitmap?)
> * subsequent writes go to both D+E
> * recover 99+% of data from D to E by simple mirroring
> * any read failures on D when reading from md0 or mirroring D->E are recovered
> from reading ABC not E unless E is in sync. D is not failed out. (and it's these
> tricks that stops users from doing all this manually)
> * any md0 sector read failure on ABC can still (hopefully) be read from D even
> if not yet mirrored to E (also not possible if done manually)
> * once E is mirrored, D is removed and  the job is done
> 
> Personally I think this feature is more important than the reshaping requests;
> of course that's just one opinion after replacing about 20 flaky 1Tb drives in
> the past 6 months :)
> 
> David
> 
> -- 
> "Don't worry, you'll be fine; I saw it work in a cartoon once..."

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2008-12-19  4:11 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-12-16  9:36 MD Feature Request: non-degraded component replacement David Greaves
2008-12-16  9:51 ` Justin Piszcz
2008-12-16 10:55   ` Lars Schimmer
2008-12-16 11:37     ` Justin Piszcz
2008-12-16 12:56       ` David Greaves
2008-12-16 14:38         ` Lars Schimmer
2008-12-16 23:25         ` Justin Piszcz
2008-12-17  0:20           ` David Greaves
2008-12-19  4:11 ` Neil Brown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.