All of lore.kernel.org
 help / color / mirror / Atom feed
* Linux Software RAID a bit of a weakness?
@ 2007-02-23 19:19 Colin Simpson
  2007-02-23 19:55 ` Steve Cousins
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Colin Simpson @ 2007-02-23 19:19 UTC (permalink / raw)
  To: linux-raid

Hi, 

We had a small server here that was configured with a RAID 1 mirror,
using two IDE disks. 

Last week one of the drives failed in this. So we replaced the drive and
set the array to rebuild. The "good" disk then found a bad block and the
mirror failed.

Now I presume that the "good" disk must have had an underlying bad block
in either unallocated space or a file I never access. Now as RAID works
at the block level you only ever see this on an array rebuild when it's
often catastrophic. Is this a bit of a flaw? 

I know there is the definite probability of two drives failing within a
short period of time. But this is a bit different as it's the
probability of two drives failing but over a much larger time scale if
one of the flaws is hidden in unallocated space (maybe a dirt particle
finds it's way onto the surface or something). This would make RAID buy
you a lot less in reliability, I'd have thought. 

I seem to remember seeing in the log file for a Dell perc something
about scavenging for bad blocks. Do hardware RAID systems have a
mechanism that at times of low activity search the disks for bad blocks
to help guard against this sort of failure (so a disk error is reported
early)?

On Software RAID, I was thinking apart from a three way mirror, which I
don't think is at present supported. Is there any merit in say, cat'ing
the whole disk devices to /dev/null every so often to check that the
whole surface is readable (I presume just reading the raw device won't
upset thing, don't worry I don't plan on trying it on a production
system). 

Any thoughts? As I presume people have thought of this before and I must
be missing something.

Colin


This email and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom they are addressed.  If you are not the original recipient or the person responsible for delivering the email to the intended recipient, be advised that you have received this email in error, and that any use, dissemination, forwarding, printing, or copying of this email is strictly prohibited. If you received this email in error, please immediately notify the sender and delete the original.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Linux Software RAID a bit of a weakness?
  2007-02-23 19:19 Linux Software RAID a bit of a weakness? Colin Simpson
@ 2007-02-23 19:55 ` Steve Cousins
  2007-02-23 20:08   ` Justin Piszcz
  2007-02-25 12:24   ` Colin Simpson
  2007-02-23 20:25 ` Neil Brown
  2007-02-25 20:08 ` Bill Davidsen
  2 siblings, 2 replies; 18+ messages in thread
From: Steve Cousins @ 2007-02-23 19:55 UTC (permalink / raw)
  To: Colin Simpson; +Cc: linux-raid

Colin Simpson wrote:
> Hi, 
> 
> We had a small server here that was configured with a RAID 1 mirror,
> using two IDE disks. 
> 
> Last week one of the drives failed in this. So we replaced the drive and
> set the array to rebuild. The "good" disk then found a bad block and the
> mirror failed.
> 
> Now I presume that the "good" disk must have had an underlying bad block
> in either unallocated space or a file I never access. Now as RAID works
> at the block level you only ever see this on an array rebuild when it's
> often catastrophic. Is this a bit of a flaw? 
> 
> I know there is the definite probability of two drives failing within a
> short period of time. But this is a bit different as it's the
> probability of two drives failing but over a much larger time scale if
> one of the flaws is hidden in unallocated space (maybe a dirt particle
> finds it's way onto the surface or something). This would make RAID buy
> you a lot less in reliability, I'd have thought. 
> 
> I seem to remember seeing in the log file for a Dell perc something
> about scavenging for bad blocks. Do hardware RAID systems have a
> mechanism that at times of low activity search the disks for bad blocks
> to help guard against this sort of failure (so a disk error is reported
> early)?
> 
> On Software RAID, I was thinking apart from a three way mirror, which I
> don't think is at present supported. Is there any merit in say, cat'ing
> the whole disk devices to /dev/null every so often to check that the
> whole surface is readable (I presume just reading the raw device won't
> upset thing, don't worry I don't plan on trying it on a production
> system). 
> 
> Any thoughts? As I presume people have thought of this before and I must
> be missing something.

Yes, this is an important thing to keep on top of, both for hardware 
RAID and software RAID.  For md:

	echo check > /sys/block/md0/md/sync_action

This should be done regularly. I have cron do it once a week.

Check out: http://neil.brown.name/blog/20050727141521-002

Good luck,

Steve
-- 
______________________________________________________________________
  Steve Cousins, Ocean Modeling Group    Email: cousins@umit.maine.edu
  Marine Sciences, 452 Aubert Hall       http://rocky.umeoce.maine.edu
  Univ. of Maine, Orono, ME 04469        Phone: (207) 581-4302



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Linux Software RAID a bit of a weakness?
  2007-02-23 19:55 ` Steve Cousins
@ 2007-02-23 20:08   ` Justin Piszcz
  2007-02-25 12:24   ` Colin Simpson
  1 sibling, 0 replies; 18+ messages in thread
From: Justin Piszcz @ 2007-02-23 20:08 UTC (permalink / raw)
  To: Steve Cousins; +Cc: Colin Simpson, linux-raid

This is the most useful thing I have found in a long time!

p34:~# echo check > /sys/block/md0/md/sync_action
$ cat /sys/block/md[0-4]/md/mismatch_cnt
512
0
0
0
0

Wow!

Justin.

On Fri, 23 Feb 2007, Steve Cousins wrote:

> Colin Simpson wrote:
>> Hi, 
>> We had a small server here that was configured with a RAID 1 mirror,
>> using two IDE disks. 
>> Last week one of the drives failed in this. So we replaced the drive and
>> set the array to rebuild. The "good" disk then found a bad block and the
>> mirror failed.
>> 
>> Now I presume that the "good" disk must have had an underlying bad block
>> in either unallocated space or a file I never access. Now as RAID works
>> at the block level you only ever see this on an array rebuild when it's
>> often catastrophic. Is this a bit of a flaw? 
>> I know there is the definite probability of two drives failing within a
>> short period of time. But this is a bit different as it's the
>> probability of two drives failing but over a much larger time scale if
>> one of the flaws is hidden in unallocated space (maybe a dirt particle
>> finds it's way onto the surface or something). This would make RAID buy
>> you a lot less in reliability, I'd have thought. 
>> I seem to remember seeing in the log file for a Dell perc something
>> about scavenging for bad blocks. Do hardware RAID systems have a
>> mechanism that at times of low activity search the disks for bad blocks
>> to help guard against this sort of failure (so a disk error is reported
>> early)?
>> 
>> On Software RAID, I was thinking apart from a three way mirror, which I
>> don't think is at present supported. Is there any merit in say, cat'ing
>> the whole disk devices to /dev/null every so often to check that the
>> whole surface is readable (I presume just reading the raw device won't
>> upset thing, don't worry I don't plan on trying it on a production
>> system). 
>> Any thoughts? As I presume people have thought of this before and I must
>> be missing something.
>
> Yes, this is an important thing to keep on top of, both for hardware RAID and 
> software RAID.  For md:
>
> 	echo check > /sys/block/md0/md/sync_action
>
> This should be done regularly. I have cron do it once a week.
>
> Check out: http://neil.brown.name/blog/20050727141521-002
>
> Good luck,
>
> Steve
> -- 
> ______________________________________________________________________
> Steve Cousins, Ocean Modeling Group    Email: cousins@umit.maine.edu
> Marine Sciences, 452 Aubert Hall       http://rocky.umeoce.maine.edu
> Univ. of Maine, Orono, ME 04469        Phone: (207) 581-4302
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Linux Software RAID a bit of a weakness?
  2007-02-23 19:19 Linux Software RAID a bit of a weakness? Colin Simpson
  2007-02-23 19:55 ` Steve Cousins
@ 2007-02-23 20:25 ` Neil Brown
  2007-02-24  4:14   ` Richard Scobie
  2007-02-25 20:08 ` Bill Davidsen
  2 siblings, 1 reply; 18+ messages in thread
From: Neil Brown @ 2007-02-23 20:25 UTC (permalink / raw)
  To: Colin Simpson; +Cc: linux-raid

On Friday February 23, csimpson@csl.co.uk wrote:
> Hi, 
> 
> We had a small server here that was configured with a RAID 1 mirror,
> using two IDE disks. 
> 
> Last week one of the drives failed in this. So we replaced the drive and
> set the array to rebuild. The "good" disk then found a bad block and the
> mirror failed.
> 
> Now I presume that the "good" disk must have had an underlying bad block
> in either unallocated space or a file I never access. Now as RAID works
> at the block level you only ever see this on an array rebuild when it's
> often catastrophic. Is this a bit of a flaw? 

Certainly can be unfortunate.

> 
> I know there is the definite probability of two drives failing within a
> short period of time. But this is a bit different as it's the
> probability of two drives failing but over a much larger time scale if
> one of the flaws is hidden in unallocated space (maybe a dirt particle
> finds it's way onto the surface or something). This would make RAID buy
> you a lot less in reliability, I'd have thought. 
> 
> I seem to remember seeing in the log file for a Dell perc something
> about scavenging for bad blocks. Do hardware RAID systems have a
> mechanism that at times of low activity search the disks for bad blocks
> to help guard against this sort of failure (so a disk error is reported
> early)?
> 

As has been mentioned, this can be done with md/raid too.  Some
distros (debian/testing at least) schedule a 'check' of all arrays
once a month.

> On Software RAID, I was thinking apart from a three way mirror, which I
> don't think is at present supported. Is there any merit in say, cat'ing
> the whole disk devices to /dev/null every so often to check that the
> whole surface is readable (I presume just reading the raw device won't
> upset thing, don't worry I don't plan on trying it on a production
> system). 

Three-way mirroring has always been supported.  You can do N way
mirroring if you have N drives.

Reading the whole device would not be sufficient as it would only read
one copy of every block rather than all copies.
The 'check' process reads all copies and compares them with one
another,  If there is a difference it is reported.  If you use
'repair' instead of 'check', the difference is arbitrarily corrected.
If a read error is detected during the 'check', md/raid1 will attempt
to write the data from the good drive to the bad drive, then read it
back.  If this works, the drive is assumed to be fixed.  If not, the
bad drive is failed out of the array.

NeilBrown

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Linux Software RAID a bit of a weakness?
  2007-02-23 20:25 ` Neil Brown
@ 2007-02-24  4:14   ` Richard Scobie
  0 siblings, 0 replies; 18+ messages in thread
From: Richard Scobie @ 2007-02-24  4:14 UTC (permalink / raw)
  To: linux-raid

Neil Brown wrote:

> The 'check' process reads all copies and compares them with one
> another,  If there is a difference it is reported.  If you use
> 'repair' instead of 'check', the difference is arbitrarily corrected.
> If a read error is detected during the 'check', md/raid1 will attempt
> to write the data from the good drive to the bad drive, then read it
> back.  If this works, the drive is assumed to be fixed.  If not, the
> bad drive is failed out of the array.
> 

One thing to note here is that 'repair' was broken for RAID1 until 
recently - see

http://marc.theaimsgroup.com/?l=linux-raid&m=116951242005315&w=2

As this patch was submitted just prior to the release of 2.6.20, this 
may be the first "fixed" kernel, but I have not checked.

Regards,

Richard


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Linux Software RAID a bit of a weakness?
  2007-02-23 19:55 ` Steve Cousins
  2007-02-23 20:08   ` Justin Piszcz
@ 2007-02-25 12:24   ` Colin Simpson
  2007-02-25 19:15     ` Richard Scobie
  1 sibling, 1 reply; 18+ messages in thread
From: Colin Simpson @ 2007-02-25 12:24 UTC (permalink / raw)
  To: linux-raid

On Fri, 2007-02-23 at 14:55 -0500, Steve Cousins wrote:
> Yes, this is an important thing to keep on top of, both for hardware 
> RAID and software RAID.  For md:
> 
> 	echo check > /sys/block/md0/md/sync_action
> 
> This should be done regularly. I have cron do it once a week.
> 
> Check out: http://neil.brown.name/blog/20050727141521-002
> 
> Good luck,
> 
> Steve

Thanks for all the info. 

A further search around seems to reveal the seriousness of this issue. 
So called "Disk/Data Scrubbing" seems to be vital for keeping a modern
large RAID healthy.

I've found a few interesting links. 

http://www.ashtech.net/~syntax/blog/archives/53-Data-Scrub-with-Linux-RAID-or-Die.html

The link of particular interest from the above is

http://www.nber.org/sys-admin/linux-nas-raid.html

The really scary item is entitled, "Why do drive failures come in
pairs?", it has the following :

===
Let's repeat the reliability calculation with our new knowledge of the
situation. In our experience perhaps half of drives have at least one
unreadable sector in the first year. Again assume a 6 percent chance of
a single failure. The chance of at least one of the remaining two drives
having a bad sector is 75% (1-(1-.5)^2). So the RAID 5 failure rate is
about 4.5%/year, which is .5% MORE than the 4% failure rate one would
expect from a two drive RAID 0 with the same capacity. Alternatively, if
you just had two drives with a partition on each and no RAID of any
kind, the chance of a failure would still be 4%/year but only half the
data loss per incident, which is considerably better than the RAID 5 can
even hope for under the current reconstruction policy even with the most
expensive hardware.
===

That's got my attention! My RAID 5 is worse than a 2 disk RAID 0. It
goes on about a surface scan being used to mitigate this problem. The
article also talks about how on reconstruction perhaps the md driver
should not just give up is it finds bad blocks on the disk but do
something cleverer. I don't know if that's valid or not.

But this all leaves me with a big problem. As the systems I have
Software RAID running are fully supported RH 4 ES systems (running the
2.6.9-42.0.8 kernel, I can't really change it without losing RH
support). 

They therefore do not have the "check" option in the kernel. Is there
anything else I can do? Would forcing a resync achieve the same result
(or is that down right dangerous as the array is not considered
consistent for a while). Any thoughts apart from my one being to upgrade
them to RH5 when that appears with a probably 2.6.18 kernel (which will
presumably have "check")? Any thoughts?

Is this something that should be added to the "Software-RAID-HOWTO"? 

Just for reference the current Dell Perc 5i controllers has a thing
called "Patrol Read", which goes off and does a scrub in the background.

Thanks again

Colin


This email and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom they are addressed.  If you are not the original recipient or the person responsible for delivering the email to the intended recipient, be advised that you have received this email in error, and that any use, dissemination, forwarding, printing, or copying of this email is strictly prohibited. If you received this email in error, please immediately notify the sender and delete the original.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Linux Software RAID a bit of a weakness?
  2007-02-25 12:24   ` Colin Simpson
@ 2007-02-25 19:15     ` Richard Scobie
  2007-02-25 20:08       ` Mark Hahn
  2007-02-26 16:56       ` David Rees
  0 siblings, 2 replies; 18+ messages in thread
From: Richard Scobie @ 2007-02-25 19:15 UTC (permalink / raw)
  To: Linux RAID Mailing List

Colin Simpson wrote:

> They therefore do not have the "check" option in the kernel. Is there
> anything else I can do? Would forcing a resync achieve the same result
> (or is that down right dangerous as the array is not considered
> consistent for a while). Any thoughts apart from my one being to upgrade
> them to RH5 when that appears with a probably 2.6.18 kernel (which will
> presumably have "check")? Any thoughts?

You could configure smartd to do regular long selftests, which would 
notify you on failures and allow you to take the drive offline and dd, 
replace etc.

Regards,

Richard

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Linux Software RAID a bit of a weakness?
  2007-02-25 19:15     ` Richard Scobie
@ 2007-02-25 20:08       ` Mark Hahn
  2007-02-25 21:02         ` Richard Scobie
  2007-02-26 16:56       ` David Rees
  1 sibling, 1 reply; 18+ messages in thread
From: Mark Hahn @ 2007-02-25 20:08 UTC (permalink / raw)
  To: Richard Scobie; +Cc: Linux RAID Mailing List

> You could configure smartd to do regular long selftests, which would notify 
> you on failures and allow you to take the drive offline and dd, replace etc.

is it known what a long self-test does?  for instance, ultimately you
want the disk to be scrubbed over some fairly lengthy period of time.
that is, not just read and checked, possibly with parity "fixed",
but all blocks read and rewritten (with verify, I suppose!)

this starts to get a bit hair-raising to have entirely in the kernel - 
I wonder if anyone is thinking about how to pull some such activity 
out into user-space.

regards, mark hahn.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Linux Software RAID a bit of a weakness?
  2007-02-23 19:19 Linux Software RAID a bit of a weakness? Colin Simpson
  2007-02-23 19:55 ` Steve Cousins
  2007-02-23 20:25 ` Neil Brown
@ 2007-02-25 20:08 ` Bill Davidsen
  2 siblings, 0 replies; 18+ messages in thread
From: Bill Davidsen @ 2007-02-25 20:08 UTC (permalink / raw)
  To: Colin Simpson; +Cc: linux-raid

Colin Simpson wrote:
> Hi, 
>
> We had a small server here that was configured with a RAID 1 mirror,
> using two IDE disks. 
>
> Last week one of the drives failed in this. So we replaced the drive and
> set the array to rebuild. The "good" disk then found a bad block and the
> mirror failed.
>
> Now I presume that the "good" disk must have had an underlying bad block
> in either unallocated space or a file I never access. Now as RAID works
> at the block level you only ever see this on an array rebuild when it's
> often catastrophic. Is this a bit of a flaw? 
>
> I know there is the definite probability of two drives failing within a
> short period of time. But this is a bit different as it's the
> probability of two drives failing but over a much larger time scale if
> one of the flaws is hidden in unallocated space (maybe a dirt particle
> finds it's way onto the surface or something). This would make RAID buy
> you a lot less in reliability, I'd have thought. 
>
> I seem to remember seeing in the log file for a Dell perc something
> about scavenging for bad blocks. Do hardware RAID systems have a
> mechanism that at times of low activity search the disks for bad blocks
> to help guard against this sort of failure (so a disk error is reported
> early)?
>
> On Software RAID, I was thinking apart from a three way mirror, which I
> don't think is at present supported. Is there any merit in say, cat'ing
> the whole disk devices to /dev/null every so often to check that the
> whole surface is readable (I presume just reading the raw device won't
> upset thing, don't worry I don't plan on trying it on a production
> system). 
>
> Any thoughts? As I presume people have thought of this before and I must
> be missing something.
Multi-way mirror is supported, my boot partition is striped over three 
drives.

How often were you running the "check" function on your array, and did 
anything show up in the S.M.A.R.T. background checks?

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Linux Software RAID a bit of a weakness?
  2007-02-25 20:08       ` Mark Hahn
@ 2007-02-25 21:02         ` Richard Scobie
  0 siblings, 0 replies; 18+ messages in thread
From: Richard Scobie @ 2007-02-25 21:02 UTC (permalink / raw)
  To: Linux RAID Mailing List

Mark Hahn wrote:

> is it known what a long self-test does?  for instance, ultimately you
> want the disk to be scrubbed over some fairly lengthy period of time.
> that is, not just read and checked, possibly with parity "fixed",
> but all blocks read and rewritten (with verify, I suppose!)

The smartctl man page is a little vague, but it looks like it does no 
writing.


Paraphrasing somewhat:

short selftest - The  "Self"  tests check  the  electrical and 
mechanical performance as well as the read performance of the disk.

long selftest - This is a  longer  and  more  thorough version  of the 
Short Self Test described above.

Richard

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Linux Software RAID a bit of a weakness?
  2007-02-25 19:15     ` Richard Scobie
  2007-02-25 20:08       ` Mark Hahn
@ 2007-02-26 16:56       ` David Rees
  2007-02-26 17:26         ` Colin Simpson
  2007-02-27  4:10         ` berk walker
  1 sibling, 2 replies; 18+ messages in thread
From: David Rees @ 2007-02-26 16:56 UTC (permalink / raw)
  To: Richard Scobie; +Cc: Linux RAID Mailing List

On 2/25/07, Richard Scobie <richard@sauce.co.nz> wrote:
> Colin Simpson wrote:
> > They therefore do not have the "check" option in the kernel. Is there
> > anything else I can do? Would forcing a resync achieve the same result
> > (or is that down right dangerous as the array is not considered
> > consistent for a while). Any thoughts apart from my one being to upgrade
> > them to RH5 when that appears with a probably 2.6.18 kernel (which will
> > presumably have "check")? Any thoughts?
>
> You could configure smartd to do regular long selftests, which would
> notify you on failures and allow you to take the drive offline and dd,
> replace etc.

So what do you do when your drives in your array don't support SMART
self tests for some reason?

The best solution I have thought of so far is to do a `dd if=/dev/mdX
of=/dev/null` periodically, but this isn't as nice as running a check
in the later kernels as it's not guaranteed to read blocks from all
disks. I guess you could instead do the same thing but with the
underlying disks instead of the raid device, then make sure you watch
the logs for disk read errors.

-Dave

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Linux Software RAID a bit of a weakness?
  2007-02-26 16:56       ` David Rees
@ 2007-02-26 17:26         ` Colin Simpson
  2007-02-26 19:40           ` Joshua Baker-LePain
                             ` (2 more replies)
  2007-02-27  4:10         ` berk walker
  1 sibling, 3 replies; 18+ messages in thread
From: Colin Simpson @ 2007-02-26 17:26 UTC (permalink / raw)
  To: David Rees; +Cc: Richard Scobie, Linux RAID Mailing List


On Mon, 2007-02-26 at 08:56 -0800, David Rees wrote:
>
> > You could configure smartd to do regular long selftests, which would
> > notify you on failures and allow you to take the drive offline and dd,
> > replace etc.
> 
> So what do you do when your drives in your array don't support SMART
> self tests for some reason?
> 
> The best solution I have thought of so far is to do a `dd if=/dev/mdX
> of=/dev/null` periodically, but this isn't as nice as running a check
> in the later kernels as it's not guaranteed to read blocks from all
> disks. I guess you could instead do the same thing but with the
> underlying disks instead of the raid device, then make sure you watch
> the logs for disk read errors.
> 
> -Dave

SATA isn't supported on RH 4's SMART.

If I say, 

dd if=/dev/sda2 of=/dev/null

where /dev/sda2 is a component of an active md device.

Will the RAID subsystem get upset that someone else is fiddling with the
disk (even in just a read only way)? And will a read error on this dd
(caused by a bad block) cause md to knock out that device?

Thanks

Colin

This email and any files transmitted with it are confidential and are intended solely for the use of the individual or entity to whom they are addressed.  If you are not the original recipient or the person responsible for delivering the email to the intended recipient, be advised that you have received this email in error, and that any use, dissemination, forwarding, printing, or copying of this email is strictly prohibited. If you received this email in error, please immediately notify the sender and delete the original.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Linux Software RAID a bit of a weakness?
  2007-02-26 17:26         ` Colin Simpson
@ 2007-02-26 19:40           ` Joshua Baker-LePain
  2007-02-26 21:13           ` David Rees
  2007-02-26 22:38           ` Jeff Garzik
  2 siblings, 0 replies; 18+ messages in thread
From: Joshua Baker-LePain @ 2007-02-26 19:40 UTC (permalink / raw)
  To: Colin Simpson; +Cc: David Rees, Richard Scobie, Linux RAID Mailing List

On Mon, 26 Feb 2007 at 5:26pm, Colin Simpson wrote

> SATA isn't supported on RH 4's SMART.

Not true (for many SATA chipsets at least).  Just pass '-d ata' to 
smartctl.

-- 
Joshua Baker-LePain
Department of Biomedical Engineering
Duke University

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Linux Software RAID a bit of a weakness?
  2007-02-26 17:26         ` Colin Simpson
  2007-02-26 19:40           ` Joshua Baker-LePain
@ 2007-02-26 21:13           ` David Rees
  2007-02-26 21:22             ` Neil Brown
  2007-02-26 22:38           ` Jeff Garzik
  2 siblings, 1 reply; 18+ messages in thread
From: David Rees @ 2007-02-26 21:13 UTC (permalink / raw)
  To: Colin Simpson; +Cc: Richard Scobie, Linux RAID Mailing List

On 2/26/07, Colin Simpson <csimpson@csl.co.uk> wrote:
> If I say,
>
> dd if=/dev/sda2 of=/dev/null
>
> where /dev/sda2 is a component of an active md device.
>
> Will the RAID subsystem get upset that someone else is fiddling with the
> disk (even in just a read only way)? And will a read error on this dd
> (caused by a bad block) cause md to knock out that device?

The MD subsystem doesn't care if someone else is reading the disk, and
I'm pretty sure that rear errors will be noticed by the MD system,
either.

-Dave

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Linux Software RAID a bit of a weakness?
  2007-02-26 21:13           ` David Rees
@ 2007-02-26 21:22             ` Neil Brown
  2007-02-27 20:12               ` David Rees
  0 siblings, 1 reply; 18+ messages in thread
From: Neil Brown @ 2007-02-26 21:22 UTC (permalink / raw)
  To: David Rees; +Cc: Colin Simpson, Richard Scobie, Linux RAID Mailing List

On Monday February 26, drees76@gmail.com wrote:
> On 2/26/07, Colin Simpson <csimpson@csl.co.uk> wrote:
> > If I say,
> >
> > dd if=/dev/sda2 of=/dev/null
> >
> > where /dev/sda2 is a component of an active md device.
> >
> > Will the RAID subsystem get upset that someone else is fiddling with the
> > disk (even in just a read only way)? And will a read error on this dd
> > (caused by a bad block) cause md to knock out that device?
> 
> The MD subsystem doesn't care if someone else is reading the disk, and

Correct.  It doesn't care if someone writes either.  The thing you
cannot do is open the device with O_EXCL.  Mounting effectively uses
O_EXCL, as does adding a swap device, fsck, and various other things
that want to think they have exclusive access.

> I'm pretty sure that rear errors will be noticed by the MD system,
> either.

:-)  Your typing is nearly as bad as mine often is, but your intent is
correct.  If you independently read from a device in an MD array and get an
error, MD won't notice.  MD only notices errors for requests that it
makes of the devices itself.

NeilBrown

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Linux Software RAID a bit of a weakness?
  2007-02-26 17:26         ` Colin Simpson
  2007-02-26 19:40           ` Joshua Baker-LePain
  2007-02-26 21:13           ` David Rees
@ 2007-02-26 22:38           ` Jeff Garzik
  2 siblings, 0 replies; 18+ messages in thread
From: Jeff Garzik @ 2007-02-26 22:38 UTC (permalink / raw)
  To: Colin Simpson; +Cc: David Rees, Richard Scobie, Linux RAID Mailing List

Colin Simpson wrote:
> SATA isn't supported on RH 4's SMART.

False.  Works just fine.

	Jeff



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Linux Software RAID a bit of a weakness?
  2007-02-26 16:56       ` David Rees
  2007-02-26 17:26         ` Colin Simpson
@ 2007-02-27  4:10         ` berk walker
  1 sibling, 0 replies; 18+ messages in thread
From: berk walker @ 2007-02-27  4:10 UTC (permalink / raw)
  To: David Rees; +Cc: Richard Scobie, Linux RAID Mailing List



David Rees wrote:
> On 2/25/07, Richard Scobie <richard@sauce.co.nz> wrote:
>> Colin Simpson wrote:
>> > They therefore do not have the "check" option in the kernel. Is there
>> > anything else I can do? Would forcing a resync achieve the same result
>> > (or is that down right dangerous as the array is not considered
>> > consistent for a while). Any thoughts apart from my one being to 
>> upgrade
>> > them to RH5 when that appears with a probably 2.6.18 kernel (which 
>> will
>> > presumably have "check")? Any thoughts?
>>
>> You could configure smartd to do regular long selftests, which would
>> notify you on failures and allow you to take the drive offline and dd,
>> replace etc.
>
> So what do you do when your drives in your array don't support SMART
> self tests for some reason?
>
> The best solution I have thought of so far is to do a `dd if=/dev/mdX
> of=/dev/null` periodically, but this isn't as nice as running a check
> in the later kernels as it's not guaranteed to read blocks from all
> disks. I guess you could instead do the same thing but with the
> underlying disks instead of the raid device, then make sure you watch
> the logs for disk read errors.
>
> -Dave
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

> Doing a dd to each drive always seemed to work for me
> .

b-


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Linux Software RAID a bit of a weakness?
  2007-02-26 21:22             ` Neil Brown
@ 2007-02-27 20:12               ` David Rees
  0 siblings, 0 replies; 18+ messages in thread
From: David Rees @ 2007-02-27 20:12 UTC (permalink / raw)
  To: Neil Brown; +Cc: Colin Simpson, Richard Scobie, Linux RAID Mailing List

On 2/26/07, Neil Brown <neilb@suse.de> wrote:
> On Monday February 26, drees76@gmail.com wrote:
> > I'm pretty sure that rear errors will be noticed by the MD system,
> > either.
>
> :-)  Your typing is nearly as bad as mine often is, but your intent is
> correct.  If you independently read from a device in an MD array and get an
> error, MD won't notice.  MD only notices errors for requests that it
> makes of the devices itself.

Doh, 2 errors in one line! Should have read:

I'm pretty sure that read errors will _not_ be noticed by the MD system, either.

Good thing at least Neil understood me. :-)

-Dave

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2007-02-27 20:12 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-02-23 19:19 Linux Software RAID a bit of a weakness? Colin Simpson
2007-02-23 19:55 ` Steve Cousins
2007-02-23 20:08   ` Justin Piszcz
2007-02-25 12:24   ` Colin Simpson
2007-02-25 19:15     ` Richard Scobie
2007-02-25 20:08       ` Mark Hahn
2007-02-25 21:02         ` Richard Scobie
2007-02-26 16:56       ` David Rees
2007-02-26 17:26         ` Colin Simpson
2007-02-26 19:40           ` Joshua Baker-LePain
2007-02-26 21:13           ` David Rees
2007-02-26 21:22             ` Neil Brown
2007-02-27 20:12               ` David Rees
2007-02-26 22:38           ` Jeff Garzik
2007-02-27  4:10         ` berk walker
2007-02-23 20:25 ` Neil Brown
2007-02-24  4:14   ` Richard Scobie
2007-02-25 20:08 ` Bill Davidsen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.