All of lore.kernel.org
 help / color / mirror / Atom feed
* 3-way mirrors
@ 2010-09-07 14:19 George Spelvin
  2010-09-07 16:07 ` Iordan Iordanov
                   ` (4 more replies)
  0 siblings, 5 replies; 23+ messages in thread
From: George Spelvin @ 2010-09-07 14:19 UTC (permalink / raw)
  To: linux-raid; +Cc: linux

After some frustration with RAID-5 finding mismatches and not being
able to figure out which drive has the problem, I'm setting up a rather
intricate 5-way mirrored (x 2-way striped) system.

The intention is that 3 copies will be on line at any time (dropping to
2 in case of disk failure), while copies 4 and 5 will be kept off-site.
Occasionally one will come in, be re-synced, and then removed again.
(The file system can be quiesced briefly to permit a clean split.)

Anyway, one nice property of a 2-drive redundancy (3+-way mirror or
RAID-6) is error detection: in case of a mismatch, it's possible to
finger the offending drive.

My understanding of the current code is that it just copies one mirror
(the first readable?) to the others.  Does someone have a patch to vote
on the data?  If not, can someone point me at the relevant bit of code
and orient me enough that I can create it?

(The other thing I'd love is a more advanced sync_action that can accept a
block number found by "check" as a parameter to "repair" so I don't have
to wait while the array is re-scanned.  Um... I suppose this depends on
a local patch I have that logs the sector numbers of mismatches.)


Another thing I'm a bit worried about is the kernel's tendency to
add drives in the lowest-numbered open slot in a RAID.  When used in
multiply-mirrored RAID-10, this tends to fill up the first stripe hallf
before starting on the second.

I'm worried that someone not paying attention will --add rather than
--re-add the off-site backup drives and create mirrors 4 and 5 of
the first stripe half, thus producing an incomplete backup.

Any suggestions on how to mitigate this risk?  And if it happens,
how do I recover?  Is there a way to force a drive to be added
as 9/10, even if 5/10 is currently empty?


Thank you very much!

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 3-way mirrors
  2010-09-07 14:19 3-way mirrors George Spelvin
@ 2010-09-07 16:07 ` Iordan Iordanov
  2010-09-07 18:49   ` George Spelvin
  2010-09-07 18:31 ` Aryeh Gregor
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 23+ messages in thread
From: Iordan Iordanov @ 2010-09-07 16:07 UTC (permalink / raw)
  To: George Spelvin; +Cc: linux-raid

Hi George,

Due to the widely reported mismatch problems with RAID5, we also went 
with a 3-way mirror design. We have not yet developed a good way of 
dealing with the inevitable mismatches which will occur with some drive 
in a 3-way mirror, but we have some (crude) ideas.

George Spelvin wrote:
> Anyway, one nice property of a 2-drive redundancy (3+-way mirror or
> RAID-6) is error detection: in case of a mismatch, it's possible to
> finger the offending drive.

When we see a mismatch_cnt > 0, we would run a dd/cmp script which would 
detect the drive and sector which is mismatched (i.e. we would craft a 
script which runs three dd processes in parallel, reading from each 
drive, and compares the data).

When an inconsistency is discovered, we would have the sector which 
doesn't match, and which drive it's on. However, even at 60MB/s, this 
would take 5 hours to perform with our 1TB drives. So, it would be much 
better if we could do this while we are up, somehow.

Once we have the drive and sector, we can take the array down, and 
quickly dd the sector from one of the drives onto the one with the mismatch.

> My understanding of the current code is that it just copies one mirror
> (the first readable?) to the others.  Does someone have a patch to vote
> on the data?  If not, can someone point me at the relevant bit of code
> and orient me enough that I can create it?

Resyncing an entire drive is probably not necessary with a mismatch, 
because you already know the rest of the drive is synced and can simply 
manually force a particular sector to match.

> (The other thing I'd love is a more advanced that can accept a
> block number found by "check" as a parameter to "repair" so I don't have
> to wait while the array is re-scanned.  Um... I suppose this depends on
> a local patch I have that logs the sector numbers of mismatches.)

Yes, but don't you run the risk of syncing the "bad" data from the 
mismatch drive to the other two drives if you do this automatically? 
Don't you also need a parameter to specify which drive to sync from?

At any rate, if the mismatch sector(s) are also logged during the array 
check, then resyncing this sector by hand would be easy and fast with 
minimal downtime. It would be great to have this functionality to start 
with.

Cheers!
Iordan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 3-way mirrors
  2010-09-07 14:19 3-way mirrors George Spelvin
  2010-09-07 16:07 ` Iordan Iordanov
@ 2010-09-07 18:31 ` Aryeh Gregor
  2010-09-07 19:02   ` George Spelvin
  2010-09-07 22:01 ` Neil Brown
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 23+ messages in thread
From: Aryeh Gregor @ 2010-09-07 18:31 UTC (permalink / raw)
  To: George Spelvin; +Cc: linux-raid

On Tue, Sep 7, 2010 at 10:19 AM, George Spelvin <linux@horizon.com> wrote:
> Anyway, one nice property of a 2-drive redundancy (3+-way mirror or
> RAID-6) is error detection: in case of a mismatch, it's possible to
> finger the offending drive.
>
> My understanding of the current code is that it just copies one mirror
> (the first readable?) to the others.  Does someone have a patch to vote
> on the data?  If not, can someone point me at the relevant bit of code
> and orient me enough that I can create it?

This might be useful reading:

http://neil.brown.name/blog/20100211050355
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 3-way mirrors
  2010-09-07 16:07 ` Iordan Iordanov
@ 2010-09-07 18:49   ` George Spelvin
  2010-09-07 19:55     ` Keld Jørn Simonsen
  0 siblings, 1 reply; 23+ messages in thread
From: George Spelvin @ 2010-09-07 18:49 UTC (permalink / raw)
  To: iordan, linux; +Cc: linux-raid

George Spelvin wrote:
>> Anyway, one nice property of a 2-drive redundancy (3+-way mirror or
>> RAID-6) is error detection: in case of a mismatch, it's possible to
>> finger the offending drive.

> When we see a mismatch_cnt > 0, we would run a dd/cmp script which would 
> detect the drive and sector which is mismatched (i.e. we would craft a 
> script which runs three dd processes in parallel, reading from each 
> drive, and compares the data).

> When an inconsistency is discovered, we would have the sector which 
> doesn't match, and which drive it's on. However, even at 60MB/s, this 
> would take 5 hours to perform with our 1TB drives. So, it would be much 
> better if we could do this while we are up, somehow.

That was my hope, for the md software to do it automatically.

>> My understanding of the current code is that it just copies one mirror
>> (the first readable?) to the others.  Does someone have a patch to vote
>> on the data?  If not, can someone point me at the relevant bit of code
>> and orient me enough that I can create it?

> Resyncing an entire drive is probably not necessary with a mismatch, 
> because you already know the rest of the drive is synced and can simply 
> manually force a particular sector to match.

Ideally, I'd like ZFS-like checksums on the data, with a mismatch triggering
a read of all mirrors and a reconstruction attempt.  With that, a silently
corrupted sector on RAID-5 can be pinpointed and fixed.

But in the meantime, I'd like check/repair passes to tell me if 2 of the 3
mirrors agree, so I can blame the third.

>> (The other thing I'd love is a more advanced that can accept a
>> block number found by "check" as a parameter to "repair" so I don't have
>> to wait while the array is re-scanned.  Um... I suppose this depends on
>> a local patch I have that logs the sector numbers of mismatches.)

> Yes, but don't you run the risk of syncing the "bad" data from the 
> mismatch drive to the other two drives if you do this automatically? 
> Don't you also need a parameter to specify which drive to sync from?

That's why I wanted the voting, so the RAID software could decide
automatically.  I don't see a practical way to identify the correct
block contents in isolation, although mapping up to a logical file
may find a file which can be checked for consistency.

(But debugfs takes forever to run icheck + ncheck on a large filesystem.)

> At any rate, if the mismatch sector(s) are also logged during the array 
> check, then resyncing this sector by hand would be easy and fast with 
> minimal downtime. It would be great to have this functionality to start 
> with.

I use the following patch.  Note that it reports the offset in 512-byte
sectors within a single component; multiply by the number of data drives
and divide by sectors per block to get a block offset within the RAID
array.

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index d1d6891..2dcffcd 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1363,6 +1363,8 @@ static void sync_request_write(mddev_t *mddev, r10bio_t *r10_bio)
 					break;
 			if (j == vcnt)
 				continue;
+			printk(KERN_INFO "%s: Mismatch at sector %llu\n",
+			    mdname(mddev), (unsigned long long)r10_bio->sector);
 			mddev->resync_mismatches += r10_bio->sectors;
 		}
 		if (test_bit(MD_RECOVERY_CHECK, &mddev->recovery))
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 96c6902..a0a0b08 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2732,6 +2732,8 @@ static void handle_parity_checks5(raid5_conf_t *conf, struct stripe_head *sh,
 			 */
 			set_bit(STRIPE_INSYNC, &sh->state);
 		else {
+printk(KERN_INFO "%s: Mismatch at sector %llu\n", mdname(conf->mddev),
+	(unsigned long long)sh->sector);
 			conf->mddev->resync_mismatches += STRIPE_SECTORS;
 			if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery))
 				/* don't try to repair!! */

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: 3-way mirrors
  2010-09-07 18:31 ` Aryeh Gregor
@ 2010-09-07 19:02   ` George Spelvin
  2010-09-08 22:28     ` Bill Davidsen
  0 siblings, 1 reply; 23+ messages in thread
From: George Spelvin @ 2010-09-07 19:02 UTC (permalink / raw)
  To: linux, Simetrical+list; +Cc: linux-raid

> This might be useful reading:
> 
> http://neil.brown.name/blog/20100211050355

An interesting point of view, BUT...

If I am seeing repeated unexplained mismatches (despite being on a good
UPS and having no unclean shutdowns), then some part of my hardware is
failing, and I'd like to know *what part*.

Even if it doesn't halp me get the current data sector back, if I see
that drive #2 keeps having one opinion on the contents of a block while
drives #1 and #3 have a different opinion, then it's a useful piece of
diagnostic information.

It certainly is true that, if my file system doesn't change too fast, I
can pull the mismatching sector out of the logs and do a manual compare
using dd.  But it's a lot nicer to avoid race conditions by placing the
code inside md.

As for an option to read the whole stripe and check it, actually you
only need to read 2 copies.  If they agree, all is well.  If they don't,
recovery is required.


The arguments about blocks magically changing under the file system
don't really hold water as long as RAID-1 distributes reads across the
component drives.  As long as that is the case, a mismatch can result in
a silent change.  A true fix (in the absence of a higher-level checksum
to validate the data) requires multiple reads.


As for unclean shutdowns, I expect that the RAID code holds off barriers
until all copies are written, so I still expect that a majority vote
will produce a consistent file system.

Thank you for the pointer!

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 3-way mirrors
  2010-09-07 18:49   ` George Spelvin
@ 2010-09-07 19:55     ` Keld Jørn Simonsen
  0 siblings, 0 replies; 23+ messages in thread
From: Keld Jørn Simonsen @ 2010-09-07 19:55 UTC (permalink / raw)
  To: George Spelvin; +Cc: iordan, linux-raid

On Tue, Sep 07, 2010 at 02:49:17PM -0400, George Spelvin wrote:
> George Spelvin wrote:
> 
> But in the meantime, I'd like check/repair passes to tell me if 2 of the 3
> mirrors agree, so I can blame the third.

I would like to check the error logs of the disks to see if one of the
disagreeing blocks has had an anormaly. This would also work when you
only have 2 copies.

Or some reporting to higher up levels, and the ability to then check out
manually which copy to keep.

Best regards
keld

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 3-way mirrors
  2010-09-07 14:19 3-way mirrors George Spelvin
  2010-09-07 16:07 ` Iordan Iordanov
  2010-09-07 18:31 ` Aryeh Gregor
@ 2010-09-07 22:01 ` Neil Brown
  2010-09-08  1:33   ` Neil Brown
  2010-09-08 14:52   ` George Spelvin
  2010-09-08  9:40 ` RAID mismatches (and reporting thereof) Tim Small
  2010-09-28 16:42 ` 3-way mirrors Tim Small
  4 siblings, 2 replies; 23+ messages in thread
From: Neil Brown @ 2010-09-07 22:01 UTC (permalink / raw)
  To: George Spelvin; +Cc: linux-raid

On 7 Sep 2010 10:19:04 -0400
"George Spelvin" <linux@horizon.com> wrote:

> After some frustration with RAID-5 finding mismatches and not being
> able to figure out which drive has the problem, I'm setting up a rather
> intricate 5-way mirrored (x 2-way striped) system.
> 
> The intention is that 3 copies will be on line at any time (dropping to
> 2 in case of disk failure), while copies 4 and 5 will be kept off-site.
> Occasionally one will come in, be re-synced, and then removed again.
> (The file system can be quiesced briefly to permit a clean split.)
> 
> Anyway, one nice property of a 2-drive redundancy (3+-way mirror or
> RAID-6) is error detection: in case of a mismatch, it's possible to
> finger the offending drive.
> 
> My understanding of the current code is that it just copies one mirror
> (the first readable?) to the others.  Does someone have a patch to vote
> on the data?  If not, can someone point me at the relevant bit of code
> and orient me enough that I can create it?
> 

The relevant bit of code is in the MD_RECOVERY_REQUESTED branch of
sync_request_write() in drivers/md/raid1.c
Look for "memcmp".

This code runs when you "echo repair > /sys/block/mdXXX/md/sync_action

It has already read all blocks and now compares them to see if they are the
same.  If not it copies the first to any that are different.

You possibly want to factor out that code into a separate function before
tryin to add any 'voting' code.


> (The other thing I'd love is a more advanced sync_action that can accept a
> block number found by "check" as a parameter to "repair" so I don't have
> to wait while the array is re-scanned.  Um... I suppose this depends on
> a local patch I have that logs the sector numbers of mismatches.)

This is already possible via the sync_min and sync_max sysfs files.
Write a number of sectors to sync_max and a lower number to sync_min.
Then write 'repair' to 'sync_action'.
When sync_completed reaches sync_max, the repair will pause.
You can then let it continue by writing a larger number to sync_max, or tell
it to finish by writing 'idle' to 'sync_action'.

If you have patches that you think are generally useful, feel free to submit
them to me for consideration for upstream inclusion.


> 
> 
> Another thing I'm a bit worried about is the kernel's tendency to
> add drives in the lowest-numbered open slot in a RAID.  When used in
> multiply-mirrored RAID-10, this tends to fill up the first stripe hallf
> before starting on the second.

This is controlled by raid10_add_disk in drivers/md/raid10.c.  I would
happily accept a patch which made a more balanced choice about where to add
the new disk.

> 
> I'm worried that someone not paying attention will --add rather than
> --re-add the off-site backup drives and create mirrors 4 and 5 of
> the first stripe half, thus producing an incomplete backup.

It is already on my to-do list for mdadm-3.2 to reject a --add that looks
like it should be a --re-add.  You will need --force to make it a spare, or
--zero it first.


> 
> Any suggestions on how to mitigate this risk?  And if it happens,
> how do I recover?  Is there a way to force a drive to be added
> as 9/10, even if 5/10 is currently empty?

1/ hack at mdadm or wait for mdadm-3.2, or feed people more coffee:-)
2/ You probably cannot recover with any amount of certainty.
3/ That is entirely a kernel decision - 'fix' the kernel.

NeilBrown


> 
> 
> Thank you very much!
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 3-way mirrors
  2010-09-07 22:01 ` Neil Brown
@ 2010-09-08  1:33   ` Neil Brown
  2010-09-08 14:52   ` George Spelvin
  1 sibling, 0 replies; 23+ messages in thread
From: Neil Brown @ 2010-09-08  1:33 UTC (permalink / raw)
  To: George Spelvin; +Cc: linux-raid

On Wed, 8 Sep 2010 08:01:55 +1000 Neil Brown <neilb@suse.de> wrote:
> On 7 Sep 2010 10:19:04 -0400 "George Spelvin" <linux@horizon.com> wrote:
> > 
> > I'm worried that someone not paying attention will --add rather than
> > --re-add the off-site backup drives and create mirrors 4 and 5 of
> > the first stripe half, thus producing an incomplete backup.
> 
> It is already on my to-do list for mdadm-3.2 to reject a --add that looks
> like it should be a --re-add.  You will need --force to make it a spare, or
> --zero it first.
> 

I just realised I had this slightly wrong.

mdadm will already perform a --re-add if asked to --add a device that can be
re-added.  So you should be safe from people accidentally using --add when
they should have used --re-add.

The change on my to-do list is that if it looks like a re-add might be
possible but the re-add fails, then don't do a normal --add without extra
encouragement.

The case where this is interesting is if you have a doubly-degraded RAID5 and
the devices just had a temporary failure.  
It would seem logical to just add the disks back.  The --re-add attempt will
fail of course, so mdadm will currently make the devices spares which isn't
what is wanted.  Rather mdadm should fail and suggest a 'stop' followed by
'--assemble --force'.

For raid1 my planned change won't make any difference - you should be safe as
you are.

NeilBrown


> 
> > 
> > Any suggestions on how to mitigate this risk?  And if it happens,
> > how do I recover?  Is there a way to force a drive to be added
> > as 9/10, even if 5/10 is currently empty?
> 
> 1/ hack at mdadm or wait for mdadm-3.2, or feed people more coffee:-)
> 2/ You probably cannot recover with any amount of certainty.
> 3/ That is entirely a kernel decision - 'fix' the kernel.
> 
> NeilBrown
> 
> 
> > 
> > 
> > Thank you very much!
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* RAID mismatches (and reporting thereof)
  2010-09-07 14:19 3-way mirrors George Spelvin
                   ` (2 preceding siblings ...)
  2010-09-07 22:01 ` Neil Brown
@ 2010-09-08  9:40 ` Tim Small
  2010-09-08 12:35   ` George Spelvin
  2010-09-28 16:42 ` 3-way mirrors Tim Small
  4 siblings, 1 reply; 23+ messages in thread
From: Tim Small @ 2010-09-08  9:40 UTC (permalink / raw)
  To: George Spelvin; +Cc: linux-raid

On 07/09/10 15:19, George Spelvin wrote:
> After some frustration with RAID-5 finding mismatches and not being
> able to figure out which drive has the problem, I'm setting up a rather
> intricate 5-way mirrored (x 2-way striped) system.
>    

Out of interest, what systems are you seeing mismatches on?  Most of the 
ones I've seen are on LSI1068* SAS controllers (with SATA drives, but 
not sure if that counts for anything, don't use many SAS drives) 
including the Dell SAS5* and SAS6* series.  I suspect there are some 
corner cases where they corrupt data on disk.  Should open a kernel.org 
bug really, so that LSI can ignore the issue in public...

Tim.

-- 
South East Open Source Solutions Limited
Registered in England and Wales with company number 06134732.
Registered Office: 2 Powell Gardens, Redhill, Surrey, RH1 1TQ
VAT number: 900 6633 53  http://seoss.co.uk/ +44-(0)1273-808309


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: RAID mismatches (and reporting thereof)
  2010-09-08  9:40 ` RAID mismatches (and reporting thereof) Tim Small
@ 2010-09-08 12:35   ` George Spelvin
  0 siblings, 0 replies; 23+ messages in thread
From: George Spelvin @ 2010-09-08 12:35 UTC (permalink / raw)
  To: linux, tim; +Cc: linux-raid

> Out of interest, what systems are you seeing mismatches on?  Most of the 
> ones I've seen are on LSI1068* SAS controllers (with SATA drives, but 
> not sure if that counts for anything, don't use many SAS drives) 
> including the Dell SAS5* and SAS6* series.  I suspect there are some 
> corner cases where they corrupt data on disk.  Should open a kernel.org 
> bug really, so that LSI can ignore the issue in public...

MS-7376 ("MSI K9A2 Platinum") motherboard, with 2500 MHz quad-core
Phenom & 8 GiB ECC DDR2.  There are 6 SATA ports, 4 on the SB600 and 2
on a Promise PDC42819:

00:14.1 IDE interface [0101]: ATI Technologies Inc SB600 IDE [1002:438c]
04:00.0 RAID bus controller [0104]: Promise Technology, Inc. PDC42819 [FastTrak TX2650/TX4650] [105a:3f20]

I used to have a different motherboard, with 3x SiI 3132 PCIe adapters:
01:00.0 Mass storage controller [0180]: Silicon Image, Inc. SiI 3132 Serial ATA Raid II Controller [1095:3132] (rev 01)
02:00.0 Mass storage controller [0180]: Silicon Image, Inc. SiI 3132 Serial ATA Raid II Controller [1095:3132] (rev 01)

The drives are all ST3400832AS, installed in a SuperMicro SC833 case's
hot-swap bays.

I have a clone machine (same MB, CPU, and RAM, but different case and
ST3750330AS drives) that's giving me no problems.  Thus the recent
decision to swap drives and rebuild the array.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 3-way mirrors
  2010-09-07 22:01 ` Neil Brown
  2010-09-08  1:33   ` Neil Brown
@ 2010-09-08 14:52   ` George Spelvin
  2010-09-08 23:04     ` Neil Brown
  1 sibling, 1 reply; 23+ messages in thread
From: George Spelvin @ 2010-09-08 14:52 UTC (permalink / raw)
  To: linux, neilb; +Cc: linux-raid

> The relevant bit of code is in the MD_RECOVERY_REQUESTED branch of
> sync_request_write() in drivers/md/raid1.c
> Look for "memcmp".

Okay, so the data is in r1_bio->bios[i]->bi_io_vec[j].bv_page,
for 0 <= i < mddev->raid_disks, and 0 <= j < vcnt (the number
of 4K pages in the chunk).

Okay, so the first for() loop sets primary to the lowest disk
number that was completely readable (.bi_end_io == end_sync_read
&& test_bit(BIO_UPTODATE).

Then the second loop compares all the data to the primary's data
and, if it doesn't match, re-initializes the mirror's sbio to
write it back.

I could probably figure this out with a lot of RTFSing, but if you
don't mind me asking:
- What does it mean if r1_bio->bios[i]->bi_end_io != end_sync_read.
  Does that case only avoid testing the primary again, or are there
  other cases where it might be true.  If there are, why not count
  them as a mismatch?
- What does it mean if !test_bit(BIO_UPTODATE, &sbio->bi_flags)?
- How does the need to write back a particular disk get communicated
  from the sbio setup code to the "schedule writes" section?

(On a tangential note, why the heck are bi_flags and bi_rw "unsigned long"
rather than "u32"?  You'd have to change "if test_bit(BIO_UPTODATE" to
"if bio_flagged(sbio, BIO_UPTODATE."... untested patch appended.)

> You possibly want to factor out that code into a separate function before
> trying to add any 'voting' code.

Indeed, the first thing I'd like to do is add some much more detailed
logging.  What part of the chunk is mismatched?  One sector, one page,
or the whole chunk?  Are just a few bits flipped, or is it a gross
mismatch?  Which disks are mismatched?

> This is controlled by raid10_add_disk in drivers/md/raid10.c.  I would
> happily accept a patch which made a more balanced choice about where to add
> the new disk.

Thank you very much for the encouragement!  The tricky cases are when
the number of drives is not a multiple of the number of data copies.
If I have -n3 and 7 drives, there are many possible subsets of 3 that will
operate.  Suppose I have U__U_U_.  What order should drives 4..7 be added?

(That's something of a rhetorical question; I expect to figure out the
answer myself, although you're welcome to chime in if you have any ideas.
I'm thinking of some kind of score where I consider the n/gcd(n,k) stripe
start positions and rank possible solutions based on the minimum redundancy
level and the number of stripes at that level.  The question is, is there
ever a case where the locations I'd like to add *two* disks differ from the
location I'd like to add one?  If there were, it would be nasty.)



Proof-of-concept patch to shrink bi_flags filed on 64-bit:

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 7fc5606..8cababe 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -64,8 +64,8 @@ struct bio {
 						   sectors */
 	struct bio		*bi_next;	/* request queue link */
 	struct block_device	*bi_bdev;
-	unsigned long		bi_flags;	/* status, command, etc */
-	unsigned long		bi_rw;		/* bottom bits READ/WRITE,
+	unsigned int		bi_flags;	/* status, command, etc */
+	unsigned int		bi_rw;		/* bottom bits READ/WRITE,
 						 * top bits priority
 						 */
 
diff --git a/block/blk-barrier.c b/block/blk-barrier.c
index 0d710c9..aed45dd 100644
--- a/block/blk-barrier.c
+++ b/block/blk-barrier.c
@@ -283,8 +283,8 @@ static void bio_end_empty_barrier(struct bio *bio, int err)
 {
 	if (err) {
 		if (err == -EOPNOTSUPP)
-			set_bit(BIO_EOPNOTSUPP, &bio->bi_flags);
-		clear_bit(BIO_UPTODATE, &bio->bi_flags);
+			bio->bi_flags |= (1<<BIO_EOPNOTSUPP);
+		bio->bi_flags &= ~(1<<BIO_UPTODATE);
 	}
 	if (bio->bi_private)
 		complete(bio->bi_private);
diff --git a/block/blk-core.c b/block/blk-core.c
index f0640d7..dfca463 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -138,8 +138,8 @@ static void req_bio_endio(struct request *rq, struct bio *bio,
 
 	if (&q->bar_rq != rq) {
 		if (error)
-			clear_bit(BIO_UPTODATE, &bio->bi_flags);
-		else if (!test_bit(BIO_UPTODATE, &bio->bi_flags))
+			bio->bi_flags &= ~(1<<BIO_UPTODATE);
+		else if (bio_flagged(bio, BIO_UPTODATE))
 			error = -EIO;
 
 		if (unlikely(nbytes > bio->bi_size)) {
@@ -149,7 +149,7 @@ static void req_bio_endio(struct request *rq, struct bio *bio,
 		}
 
 		if (unlikely(rq->cmd_flags & REQ_QUIET))
-			set_bit(BIO_QUIET, &bio->bi_flags);
+			bio->bi_flags |= (1<<BIO_QUIET);
 
 		bio->bi_size -= nbytes;
 		bio->bi_sector += (nbytes >> 9);
@@ -1329,13 +1329,13 @@ static void handle_bad_sector(struct bio *bio)
 	char b[BDEVNAME_SIZE];
 
 	printk(KERN_INFO "attempt to access beyond end of device\n");
-	printk(KERN_INFO "%s: rw=%ld, want=%Lu, limit=%Lu\n",
+	printk(KERN_INFO "%s: rw=%u, want=%Lu, limit=%Lu\n",
 			bdevname(bio->bi_bdev, b),
 			bio->bi_rw,
 			(unsigned long long)bio->bi_sector + bio_sectors(bio),
 			(long long)(bio->bi_bdev->bd_inode->i_size >> 9));
 
-	set_bit(BIO_EOF, &bio->bi_flags);
+	bio->bi_flags |= (1<<BIO_EOF);
 }
 
 #ifdef CONFIG_FAIL_MAKE_REQUEST
diff --git a/block/blk-lib.c b/block/blk-lib.c
index d0216b9..ee1f2d3 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -13,8 +13,8 @@ static void blkdev_discard_end_io(struct bio *bio, int err)
 {
 	if (err) {
 		if (err == -EOPNOTSUPP)
-			set_bit(BIO_EOPNOTSUPP, &bio->bi_flags);
-		clear_bit(BIO_UPTODATE, &bio->bi_flags);
+			bio->bi_flags |= (1<<BIO_EOPNOTSUPP);
+		bio->bi_flags &= ~(1<<BIO_UPTODATE);
 	}
 
 	if (bio->bi_private)
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 8a549db..ce4a6a0 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -1450,7 +1450,7 @@ static void pkt_finish_packet(struct packet_data *pkt, int uptodate)
 
 static void pkt_run_state_machine(struct pktcdvd_device *pd, struct packet_data *pkt)
 {
-	int uptodate;
+	bool uptodate;
 
 	VPRINTK("run_state_machine: pkt %d\n", pkt->id);
 
@@ -1480,7 +1480,7 @@ static void pkt_run_state_machine(struct pktcdvd_device *pd, struct packet_data
 			if (atomic_read(&pkt->io_wait) > 0)
 				return;
 
-			if (test_bit(BIO_UPTODATE, &pkt->w_bio->bi_flags)) {
+			if (bio_flagged(pkt->w_bio, BIO_UPTODATE)) {
 				pkt_set_state(pkt, PACKET_FINISHED_STATE);
 			} else {
 				pkt_set_state(pkt, PACKET_RECOVERY_STATE);
@@ -1497,7 +1497,7 @@ static void pkt_run_state_machine(struct pktcdvd_device *pd, struct packet_data
 			break;
 
 		case PACKET_FINISHED_STATE:
-			uptodate = test_bit(BIO_UPTODATE, &pkt->w_bio->bi_flags);
+			uptodate = bio_flagged(pkt->w_bio, BIO_UPTODATE);
 			pkt_finish_packet(pkt, uptodate);
 			return;
 
diff --git a/drivers/md/md.c b/drivers/md/md.c
index cb20d0b..58162b1 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -296,7 +296,7 @@ static void md_end_barrier(struct bio *bio, int err)
 	mdk_rdev_t *rdev = bio->bi_private;
 	mddev_t *mddev = rdev->mddev;
 	if (err == -EOPNOTSUPP && mddev->barrier != POST_REQUEST_BARRIER)
-		set_bit(BIO_EOPNOTSUPP, &mddev->barrier->bi_flags);
+		mddev->barrier->bi_flags |= (1<<BIO_EOPNOTSUPP);
 
 	rdev_dec_pending(rdev, mddev);
 
@@ -347,7 +347,7 @@ static void md_submit_barrier(struct work_struct *ws)
 
 	atomic_set(&mddev->flush_pending, 1);
 
-	if (test_bit(BIO_EOPNOTSUPP, &bio->bi_flags))
+	if (bio_flagged(bio, BIO_EOPNOTSUPP))
 		bio_endio(bio, -EOPNOTSUPP);
 	else if (bio->bi_size == 0)
 		/* an empty barrier - all done */
@@ -629,10 +629,10 @@ static void super_written(struct bio *bio, int error)
 	mdk_rdev_t *rdev = bio->bi_private;
 	mddev_t *mddev = rdev->mddev;
 
-	if (error || !test_bit(BIO_UPTODATE, &bio->bi_flags)) {
+	if (error || !bio_flagged(bio, BIO_UPTODATE)) {
 		printk("md: super_written gets error=%d, uptodate=%d\n",
-		       error, test_bit(BIO_UPTODATE, &bio->bi_flags));
-		WARN_ON(test_bit(BIO_UPTODATE, &bio->bi_flags));
+		       error, bio_flagged(bio, BIO_UPTODATE));
+		WARN_ON(bio_flagged(bio, BIO_UPTODATE));
 		md_error(mddev, rdev);
 	}
 
@@ -647,7 +647,7 @@ static void super_written_barrier(struct bio *bio, int error)
 	mdk_rdev_t *rdev = bio2->bi_private;
 	mddev_t *mddev = rdev->mddev;
 
-	if (!test_bit(BIO_UPTODATE, &bio->bi_flags) &&
+	if (!bio_flagged(bio, BIO_UPTODATE) &&
 	    error == -EOPNOTSUPP) {
 		unsigned long flags;
 		/* barriers don't appear to be supported :-( */
@@ -747,7 +747,7 @@ int sync_page_io(struct block_device *bdev, sector_t sector, int size,
 	submit_bio(rw, bio);
 	wait_for_completion(&event);
 
-	ret = test_bit(BIO_UPTODATE, &bio->bi_flags);
+	ret = bio_flagged(bio, BIO_UPTODATE);
 	bio_put(bio);
 	return ret;
 }
diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c
index 410fb60..f57fc90 100644
--- a/drivers/md/multipath.c
+++ b/drivers/md/multipath.c
@@ -84,7 +84,7 @@ static void multipath_end_bh_io (struct multipath_bh *mp_bh, int err)
 
 static void multipath_end_request(struct bio *bio, int error)
 {
-	int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
+	int uptodate = bio_flagged(bio, BIO_UPTODATE);
 	struct multipath_bh *mp_bh = bio->bi_private;
 	multipath_conf_t *conf = mp_bh->mddev->private;
 	mdk_rdev_t *rdev = conf->multipaths[mp_bh->path].rdev;
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index a948da8..8e43334 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -262,7 +262,7 @@ static inline void update_head_pos(int disk, r1bio_t *r1_bio)
 
 static void raid1_end_read_request(struct bio *bio, int error)
 {
-	int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
+	bool uptodate = bio_flagged(bio, BIO_UPTODATE);
 	r1bio_t *r1_bio = bio->bi_private;
 	int mirror;
 	conf_t *conf = r1_bio->mddev->private;
@@ -285,7 +285,7 @@ static void raid1_end_read_request(struct bio *bio, int error)
 		if (r1_bio->mddev->degraded == conf->raid_disks ||
 		    (r1_bio->mddev->degraded == conf->raid_disks-1 &&
 		     !test_bit(Faulty, &conf->mirrors[mirror].rdev->flags)))
-			uptodate = 1;
+			uptodate = true;
 		spin_unlock_irqrestore(&conf->device_lock, flags);
 	}
 
@@ -308,7 +308,7 @@ static void raid1_end_read_request(struct bio *bio, int error)
 
 static void raid1_end_write_request(struct bio *bio, int error)
 {
-	int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
+	bool uptodate = bio_flagged(bio, BIO_UPTODATE);
 	r1bio_t *r1_bio = bio->bi_private;
 	int mirror, behind = test_bit(R1BIO_BehindIO, &r1_bio->state);
 	conf_t *conf = r1_bio->mddev->private;
@@ -1244,7 +1244,7 @@ static void end_sync_read(struct bio *bio, int error)
 	 * or re-read if the read failed.
 	 * We don't do much here, just schedule handling by raid1d
 	 */
-	if (test_bit(BIO_UPTODATE, &bio->bi_flags))
+	if (bio_flagged(bio, BIO_UPTODATE))
 		set_bit(R1BIO_Uptodate, &r1_bio->state);
 
 	if (atomic_dec_and_test(&r1_bio->remaining))
@@ -1253,7 +1253,7 @@ static void end_sync_read(struct bio *bio, int error)
 
 static void end_sync_write(struct bio *bio, int error)
 {
-	int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
+	bool uptodate = bio_flagged(bio, BIO_UPTODATE);
 	r1bio_t *r1_bio = bio->bi_private;
 	mddev_t *mddev = r1_bio->mddev;
 	conf_t *conf = mddev->private;
@@ -1318,7 +1318,7 @@ static void sync_request_write(mddev_t *mddev, r1bio_t *r1_bio)
 		}
 		for (primary=0; primary<mddev->raid_disks; primary++)
 			if (r1_bio->bios[primary]->bi_end_io == end_sync_read &&
-			    test_bit(BIO_UPTODATE, &r1_bio->bios[primary]->bi_flags)) {
+			    bio_flagged(r1_bio->bios[primary], BIO_UPTODATE)) {
 				r1_bio->bios[primary]->bi_end_io = NULL;
 				rdev_dec_pending(conf->mirrors[primary].rdev, mddev);
 				break;
@@ -1331,7 +1331,7 @@ static void sync_request_write(mddev_t *mddev, r1bio_t *r1_bio)
 				struct bio *pbio = r1_bio->bios[primary];
 				struct bio *sbio = r1_bio->bios[i];
 
-				if (test_bit(BIO_UPTODATE, &sbio->bi_flags)) {
+				if (bio_flagged(sbio, BIO_UPTODATE)) {
 					for (j = vcnt; j-- ; ) {
 						struct page *p, *s;
 						p = pbio->bi_io_vec[j].bv_page;
@@ -1346,7 +1346,7 @@ static void sync_request_write(mddev_t *mddev, r1bio_t *r1_bio)
 				if (j >= 0)
 					mddev->resync_mismatches += r1_bio->sectors;
 				if (j < 0 || (test_bit(MD_RECOVERY_CHECK, &mddev->recovery)
-					      && test_bit(BIO_UPTODATE, &sbio->bi_flags))) {
+					      && bio_flagged(sbio, BIO_UPTODATE))) {
 					sbio->bi_end_io = NULL;
 					rdev_dec_pending(conf->mirrors[i].rdev, mddev);
 				} else {
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 42e64e4..4ae0e20 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -255,7 +255,7 @@ static inline void update_head_pos(int slot, r10bio_t *r10_bio)
 
 static void raid10_end_read_request(struct bio *bio, int error)
 {
-	int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
+	bool uptodate = bio_flagged(bio, BIO_UPTODATE);
 	r10bio_t *r10_bio = bio->bi_private;
 	int slot, dev;
 	conf_t *conf = r10_bio->mddev->private;
@@ -297,7 +297,7 @@ static void raid10_end_read_request(struct bio *bio, int error)
 
 static void raid10_end_write_request(struct bio *bio, int error)
 {
-	int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
+	bool uptodate = bio_flagged(bio, BIO_UPTODATE);
 	r10bio_t *r10_bio = bio->bi_private;
 	int slot, dev;
 	conf_t *conf = r10_bio->mddev->private;
@@ -1230,7 +1230,7 @@ static void end_sync_read(struct bio *bio, int error)
 	update_head_pos(i, r10_bio);
 	d = r10_bio->devs[i].devnum;
 
-	if (test_bit(BIO_UPTODATE, &bio->bi_flags))
+	if (bio_flagged(bio, BIO_UPTODATE))
 		set_bit(R10BIO_Uptodate, &r10_bio->state);
 	else {
 		atomic_add(r10_bio->sectors,
@@ -1255,7 +1255,7 @@ static void end_sync_read(struct bio *bio, int error)
 
 static void end_sync_write(struct bio *bio, int error)
 {
-	int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
+	bool uptodate = bio_flagged(bio, BIO_UPTODATE);
 	r10bio_t *r10_bio = bio->bi_private;
 	mddev_t *mddev = r10_bio->mddev;
 	conf_t *conf = mddev->private;
@@ -1313,7 +1313,7 @@ static void sync_request_write(mddev_t *mddev, r10bio_t *r10_bio)
 
 	/* find the first device with a block */
 	for (i=0; i<conf->copies; i++)
-		if (test_bit(BIO_UPTODATE, &r10_bio->devs[i].bio->bi_flags))
+		if (bio_flagged(r10_bio->devs[i].bio, BIO_UPTODATE))
 			break;
 
 	if (i == conf->copies)
@@ -1333,7 +1333,7 @@ static void sync_request_write(mddev_t *mddev, r10bio_t *r10_bio)
 			continue;
 		if (i == first)
 			continue;
-		if (test_bit(BIO_UPTODATE, &r10_bio->devs[i].bio->bi_flags)) {
+		if (bio_flagged(r10_bio->devs[i].bio, BIO_UPTODATE)) {
 			/* We know that the bi_io_vec layout is the same for
 			 * both 'first' and 'i', so we just compare them.
 			 * All vec entries are PAGE_SIZE;
@@ -2027,7 +2027,7 @@ static sector_t sync_request(mddev_t *mddev, sector_t sector_nr, int *skipped, i
 			int d = r10_bio->devs[i].devnum;
 			bio = r10_bio->devs[i].bio;
 			bio->bi_end_io = NULL;
-			clear_bit(BIO_UPTODATE, &bio->bi_flags);
+			bio->bi_flags &= ~(1<<BIO_UPTODATE);
 			if (conf->mirrors[d].rdev == NULL ||
 			    test_bit(Faulty, &conf->mirrors[d].rdev->flags))
 				continue;
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 96c6902..b92baad 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -537,7 +537,7 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
 			set_bit(STRIPE_IO_STARTED, &sh->state);
 
 			bi->bi_bdev = rdev->bdev;
-			pr_debug("%s: for %llu schedule op %ld on disc %d\n",
+			pr_debug("%s: for %llu schedule op %u on disc %d\n",
 				__func__, (unsigned long long)sh->sector,
 				bi->bi_rw, i);
 			atomic_inc(&sh->count);
@@ -559,7 +559,7 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
 		} else {
 			if (rw == WRITE)
 				set_bit(STRIPE_DEGRADED, &sh->state);
-			pr_debug("skip op %ld on disc %d for sector %llu\n",
+			pr_debug("skip op %u on disc %d for sector %llu\n",
 				bi->bi_rw, i, (unsigned long long)sh->sector);
 			clear_bit(R5_LOCKED, &sh->dev[i].flags);
 			set_bit(STRIPE_HANDLE, &sh->state);
@@ -1557,7 +1557,7 @@ static void raid5_end_read_request(struct bio * bi, int error)
 	struct stripe_head *sh = bi->bi_private;
 	raid5_conf_t *conf = sh->raid_conf;
 	int disks = sh->disks, i;
-	int uptodate = test_bit(BIO_UPTODATE, &bi->bi_flags);
+	bool uptodate = bio_flagged(bi, BIO_UPTODATE);
 	char b[BDEVNAME_SIZE];
 	mdk_rdev_t *rdev;
 
@@ -1591,7 +1591,7 @@ static void raid5_end_read_request(struct bio * bi, int error)
 			atomic_set(&conf->disks[i].rdev->read_errors, 0);
 	} else {
 		const char *bdn = bdevname(conf->disks[i].rdev->bdev, b);
-		int retry = 0;
+		bool retry = false;
 		rdev = conf->disks[i].rdev;
 
 		clear_bit(R5_UPTODATE, &sh->dev[i].flags);
@@ -1619,7 +1619,7 @@ static void raid5_end_read_request(struct bio * bi, int error)
 			       "md/raid:%s: Too many read errors, failing device %s.\n",
 			       mdname(conf->mddev), bdn);
 		else
-			retry = 1;
+			retry = true;
 		if (retry)
 			set_bit(R5_ReadError, &sh->dev[i].flags);
 		else {
@@ -1639,7 +1639,7 @@ static void raid5_end_write_request(struct bio *bi, int error)
 	struct stripe_head *sh = bi->bi_private;
 	raid5_conf_t *conf = sh->raid_conf;
 	int disks = sh->disks, i;
-	int uptodate = test_bit(BIO_UPTODATE, &bi->bi_flags);
+	bool uptodate = bio_flagged(bi, BIO_UPTODATE);
 
 	for (i=0 ; i<disks; i++)
 		if (bi == &sh->dev[i].req)
@@ -2251,7 +2251,7 @@ handle_failed_stripe(raid5_conf_t *conf, struct stripe_head *sh,
 		while (bi && bi->bi_sector <
 			sh->dev[i].sector + STRIPE_SECTORS) {
 			struct bio *nextbi = r5_next_bio(bi, sh->dev[i].sector);
-			clear_bit(BIO_UPTODATE, &bi->bi_flags);
+			bi->bi_flags &= ~(1<<BIO_UPTODATE);
 			if (!raid5_dec_bi_phys_segments(bi)) {
 				md_write_end(conf->mddev);
 				bi->bi_next = *return_bi;
@@ -2266,7 +2266,7 @@ handle_failed_stripe(raid5_conf_t *conf, struct stripe_head *sh,
 		while (bi && bi->bi_sector <
 		       sh->dev[i].sector + STRIPE_SECTORS) {
 			struct bio *bi2 = r5_next_bio(bi, sh->dev[i].sector);
-			clear_bit(BIO_UPTODATE, &bi->bi_flags);
+			bi->bi_flags &= ~(1<<BIO_UPTODATE);
 			if (!raid5_dec_bi_phys_segments(bi)) {
 				md_write_end(conf->mddev);
 				bi->bi_next = *return_bi;
@@ -2290,7 +2290,7 @@ handle_failed_stripe(raid5_conf_t *conf, struct stripe_head *sh,
 			       sh->dev[i].sector + STRIPE_SECTORS) {
 				struct bio *nextbi =
 					r5_next_bio(bi, sh->dev[i].sector);
-				clear_bit(BIO_UPTODATE, &bi->bi_flags);
+				bi->bi_flags &= ~(1<<BIO_UPTODATE);
 				if (!raid5_dec_bi_phys_segments(bi)) {
 					bi->bi_next = *return_bi;
 					*return_bi = bi;
@@ -3787,7 +3787,7 @@ static void raid5_align_endio(struct bio *bi, int error)
 	struct bio* raid_bi  = bi->bi_private;
 	mddev_t *mddev;
 	raid5_conf_t *conf;
-	int uptodate = test_bit(BIO_UPTODATE, &bi->bi_flags);
+	bool uptodate = bio_flagged(bi, BIO_UPTODATE);
 	mdk_rdev_t *rdev;
 
 	bio_put(bi);
@@ -4089,7 +4089,7 @@ static int make_request(mddev_t *mddev, struct bio * bi)
 			release_stripe(sh);
 		} else {
 			/* cannot get stripe for read-ahead, just give-up */
-			clear_bit(BIO_UPTODATE, &bi->bi_flags);
+			bi->bi_flags &= ~(1<<BIO_UPTODATE);
 			finish_wait(&conf->wait_for_overlap, &w);
 			break;
 		}
diff --git a/fs/bio.c b/fs/bio.c
index e7bf6ca..76192ca 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -1423,8 +1423,8 @@ EXPORT_SYMBOL(bio_flush_dcache_pages);
 void bio_endio(struct bio *bio, int error)
 {
 	if (error)
-		clear_bit(BIO_UPTODATE, &bio->bi_flags);
-	else if (!test_bit(BIO_UPTODATE, &bio->bi_flags))
+		bio->bi_flags &= ~(1<<BIO_UPTODATE);
+	else if (!bio_flagged(bio, BIO_UPTODATE))
 		error = -EIO;
 
 	if (bio->bi_end_io)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index d74e6af..c58bef8 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1759,17 +1759,17 @@ static void end_bio_extent_writepage(struct bio *bio, int err)
  */
 static void end_bio_extent_readpage(struct bio *bio, int err)
 {
-	int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
+	bool uptodate = bio_flagged(bio, BIO_UPTODATE);
 	struct bio_vec *bvec_end = bio->bi_io_vec + bio->bi_vcnt - 1;
 	struct bio_vec *bvec = bio->bi_io_vec;
 	struct extent_io_tree *tree;
 	u64 start;
 	u64 end;
-	int whole_page;
+	bool whole_page;
 	int ret;
 
 	if (err)
-		uptodate = 0;
+		uptodate = false;
 
 	do {
 		struct page *page = bvec->bv_page;
@@ -1780,9 +1780,9 @@ static void end_bio_extent_readpage(struct bio *bio, int err)
 		end = start + bvec->bv_len - 1;
 
 		if (bvec->bv_offset == 0 && bvec->bv_len == PAGE_CACHE_SIZE)
-			whole_page = 1;
+			whole_page = true;
 		else
-			whole_page = 0;
+			whole_page = false;
 
 		if (++bvec <= bvec_end)
 			prefetchw(&bvec->bv_page->flags);
@@ -1791,17 +1791,16 @@ static void end_bio_extent_readpage(struct bio *bio, int err)
 			ret = tree->ops->readpage_end_io_hook(page, start, end,
 							      NULL);
 			if (ret)
-				uptodate = 0;
+				uptodate = false;
 		}
 		if (!uptodate && tree->ops &&
 		    tree->ops->readpage_io_failed_hook) {
 			ret = tree->ops->readpage_io_failed_hook(bio, page,
 							 start, end, NULL);
 			if (ret == 0) {
-				uptodate =
-					test_bit(BIO_UPTODATE, &bio->bi_flags);
+				uptodate = bio_flagged(bio, BIO_UPTODATE);
 				if (err)
-					uptodate = 0;
+					uptodate = false;
 				continue;
 			}
 		}
@@ -1841,7 +1840,7 @@ static void end_bio_extent_readpage(struct bio *bio, int err)
  */
 static void end_bio_extent_preparewrite(struct bio *bio, int err)
 {
-	const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
+	const bool uptodate = bio_flagged(bio, BIO_UPTODATE);
 	struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
 	struct extent_io_tree *tree;
 	u64 start;
diff --git a/fs/buffer.c b/fs/buffer.c
index d54812b..94af2b9 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2997,14 +2997,14 @@ static void end_bio_bh_io_sync(struct bio *bio, int err)
 	struct buffer_head *bh = bio->bi_private;
 
 	if (err == -EOPNOTSUPP) {
-		set_bit(BIO_EOPNOTSUPP, &bio->bi_flags);
+		bio->bi_flags |= (1<<BIO_EOPNOTSUPP);
 		set_bit(BH_Eopnotsupp, &bh->b_state);
 	}
 
-	if (unlikely (test_bit(BIO_QUIET,&bio->bi_flags)))
+	if (unlikely (bio_flagged(bio, BIO_QUIET)))
 		set_bit(BH_Quiet, &bh->b_state);
 
-	bh->b_end_io(bh, test_bit(BIO_UPTODATE, &bio->bi_flags));
+	bh->b_end_io(bh, bio_flagged(bio, BIO_UPTODATE));
 	bio_put(bio);
 }
 
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 7600aac..c7d3a0f 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -425,7 +425,7 @@ static struct bio *dio_await_one(struct dio *dio)
  */
 static int dio_bio_complete(struct dio *dio, struct bio *bio)
 {
-	const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
+	const bool uptodate = bio_flagged(bio, BIO_UPTODATE);
 	struct bio_vec *bvec = bio->bi_io_vec;
 	int page_no;
 
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 377309c..f6d3216 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2598,7 +2598,7 @@ static int ext4_ext_zeroout(struct inode *inode, struct ext4_extent *ex)
 		submit_bio(WRITE, bio);
 		wait_for_completion(&event);
 
-		if (!test_bit(BIO_UPTODATE, &bio->bi_flags)) {
+		if (!bio_flagged(bio, BIO_UPTODATE)) {
 			bio_put(bio);
 			return -EIO;
 		}
diff --git a/fs/jfs/jfs_logmgr.c b/fs/jfs/jfs_logmgr.c
index c51af2a..b37ee3e 100644
--- a/fs/jfs/jfs_logmgr.c
+++ b/fs/jfs/jfs_logmgr.c
@@ -2216,7 +2216,7 @@ static void lbmIODone(struct bio *bio, int error)
 
 	bp->l_flag |= lbmDONE;
 
-	if (!test_bit(BIO_UPTODATE, &bio->bi_flags)) {
+	if (!bio_flagged(bio, BIO_UPTODATE)) {
 		bp->l_flag |= lbmERROR;
 
 		jfs_err("lbmIODone: I/O error in JFS log");
diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c
index 48b44bd..9222e06 100644
--- a/fs/jfs/jfs_metapage.c
+++ b/fs/jfs/jfs_metapage.c
@@ -287,7 +287,7 @@ static void metapage_read_end_io(struct bio *bio, int err)
 {
 	struct page *page = bio->bi_private;
 
-	if (!test_bit(BIO_UPTODATE, &bio->bi_flags)) {
+	if (!bio_flagged(bio, BIO_UPTODATE)) {
 		printk(KERN_ERR "metapage_read_end_io: I/O error\n");
 		SetPageError(page);
 	}
@@ -344,7 +344,7 @@ static void metapage_write_end_io(struct bio *bio, int err)
 
 	BUG_ON(!PagePrivate(page));
 
-	if (! test_bit(BIO_UPTODATE, &bio->bi_flags)) {
+	if (!bio_flagged(bio, BIO_UPTODATE)) {
 		printk(KERN_ERR "metapage_write_end_io: I/O error\n");
 		SetPageError(page);
 	}
diff --git a/fs/logfs/dev_bdev.c b/fs/logfs/dev_bdev.c
index 9bd2ce2..ea48736 100644
--- a/fs/logfs/dev_bdev.c
+++ b/fs/logfs/dev_bdev.c
@@ -41,7 +41,7 @@ static int sync_request(struct page *page, struct block_device *bdev, int rw)
 	submit_bio(rw, &bio);
 	generic_unplug_device(bdev_get_queue(bdev));
 	wait_for_completion(&complete);
-	return test_bit(BIO_UPTODATE, &bio.bi_flags) ? 0 : -EIO;
+	return bio_flagged(bio, BIO_UPTODATE) ? 0 : -EIO;
 }
 
 static int bdev_readpage(void *_sb, struct page *page)
@@ -66,7 +66,7 @@ static DECLARE_WAIT_QUEUE_HEAD(wq);
 
 static void writeseg_end_io(struct bio *bio, int err)
 {
-	const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
+	const bool uptodate = bio_flagged(bio, BIO_UPTODATE);
 	struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
 	struct super_block *sb = bio->bi_private;
 	struct logfs_super *super = logfs_super(sb);
@@ -174,7 +174,7 @@ static void bdev_writeseg(struct super_block *sb, u64 ofs, size_t len)
 
 static void erase_end_io(struct bio *bio, int err) 
 { 
-	const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags); 
+	const bool uptodate = bio_flagged(bio, BIO_UPTODATE); 
 	struct super_block *sb = bio->bi_private; 
 	struct logfs_super *super = logfs_super(sb); 
 
diff --git a/fs/mpage.c b/fs/mpage.c
index fd56ca2..3be5895 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -42,7 +42,7 @@
  */
 static void mpage_end_io_read(struct bio *bio, int err)
 {
-	const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
+	const bool uptodate = bio_flagged(bio, BIO_UPTODATE);
 	struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
 
 	do {
@@ -64,7 +64,7 @@ static void mpage_end_io_read(struct bio *bio, int err)
 
 static void mpage_end_io_write(struct bio *bio, int err)
 {
-	const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
+	const bool uptodate = bio_flagged(bio, BIO_UPTODATE);
 	struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
 
 	do {
diff --git a/fs/nilfs2/segbuf.c b/fs/nilfs2/segbuf.c
index 2e6a272..d3ef05c 100644
--- a/fs/nilfs2/segbuf.c
+++ b/fs/nilfs2/segbuf.c
@@ -349,11 +349,11 @@ void nilfs_add_checksums_on_logs(struct list_head *logs, u32 seed)
  */
 static void nilfs_end_bio_write(struct bio *bio, int err)
 {
-	const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
+	const bool uptodate = bio_flagged(bio, BIO_UPTODATE);
 	struct nilfs_segment_buffer *segbuf = bio->bi_private;
 
 	if (err == -EOPNOTSUPP) {
-		set_bit(BIO_EOPNOTSUPP, &bio->bi_flags);
+		bio->bi_flags |= (1<<BIO_EOPNOTSUPP);
 		bio_put(bio);
 		/* to be detected by submit_seg_bio() */
 	}
diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index 34640d6..055de11 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -351,7 +351,7 @@ xfs_end_bio(
 	xfs_ioend_t		*ioend = bio->bi_private;
 
 	ASSERT(atomic_read(&bio->bi_cnt) >= 1);
-	ioend->io_error = test_bit(BIO_UPTODATE, &bio->bi_flags) ? 0 : error;
+	ioend->io_error = bio_flagged(bio, BIO_UPTODATE) ? 0 : error;
 
 	/* Toss bio and pass work off to an xfsdatad thread */
 	bio->bi_private = NULL;
diff --git a/mm/bounce.c b/mm/bounce.c
index 13b6dad..7a435fd 100644
--- a/mm/bounce.c
+++ b/mm/bounce.c
@@ -127,8 +127,7 @@ static void bounce_end_io(struct bio *bio, mempool_t *pool, int err)
 	struct bio_vec *bvec, *org_vec;
 	int i;
 
-	if (test_bit(BIO_EOPNOTSUPP, &bio->bi_flags))
-		set_bit(BIO_EOPNOTSUPP, &bio_orig->bi_flags);
+	bio->bi_flags |= bio_orig->bi_flags & (1<<BIO_EOPNOTSUPP);
 
 	/*
 	 * free up bounce indirect pages used
@@ -161,7 +160,7 @@ static void __bounce_end_io_read(struct bio *bio, mempool_t *pool, int err)
 {
 	struct bio *bio_orig = bio->bi_private;
 
-	if (test_bit(BIO_UPTODATE, &bio->bi_flags))
+	if (bio_flagged(bio, BIO_UPTODATE))
 		copy_to_high_bio_irq(bio_orig, bio);
 
 	bounce_end_io(bio, pool, err);
diff --git a/mm/page_io.c b/mm/page_io.c
index 31a3b96..11a16b0 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -42,7 +42,7 @@ static struct bio *get_swap_bio(gfp_t gfp_flags,
 
 static void end_swap_bio_write(struct bio *bio, int err)
 {
-	const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
+	const bool uptodate = bio_flagged(bio, BIO_UPTODATE);
 	struct page *page = bio->bi_io_vec[0].bv_page;
 
 	if (!uptodate) {
@@ -68,7 +68,7 @@ static void end_swap_bio_write(struct bio *bio, int err)
 
 void end_swap_bio_read(struct bio *bio, int err)
 {
-	const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
+	const bool uptodate = bio_flagged(bio, BIO_UPTODATE);
 	struct page *page = bio->bi_io_vec[0].bv_page;
 
 	if (!uptodate) {


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: 3-way mirrors
  2010-09-07 19:02   ` George Spelvin
@ 2010-09-08 22:28     ` Bill Davidsen
  0 siblings, 0 replies; 23+ messages in thread
From: Bill Davidsen @ 2010-09-08 22:28 UTC (permalink / raw)
  To: George Spelvin; +Cc: Simetrical+list, linux-raid

George Spelvin wrote:
>> This might be useful reading:
>>
>> http://neil.brown.name/blog/20100211050355
>>     
>
> An interesting point of view, BUT...
>
> If I am seeing repeated unexplained mismatches (despite being on a good
> UPS and having no unclean shutdowns), then some part of my hardware is
> failing, and I'd like to know *what part*.
>   

How about your disk enclosure? I would think that if vibration caused a 
silent bit flip *on write* you would get a read error (CRC) reading it 
back. but... I am going on the theory that any factor which causes 
measurable errors on read is not doing anything good for writes either.

See: http://www.zdnet.com/blog/storage/bad-bad-bad-vibrations/896

-- 
Bill Davidsen <davidsen@tmr.com>
  "We can't solve today's problems by using the same thinking we
   used in creating them." - Einstein


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 3-way mirrors
  2010-09-08 14:52   ` George Spelvin
@ 2010-09-08 23:04     ` Neil Brown
  0 siblings, 0 replies; 23+ messages in thread
From: Neil Brown @ 2010-09-08 23:04 UTC (permalink / raw)
  To: George Spelvin; +Cc: linux-raid

On 8 Sep 2010 10:52:32 -0400
"George Spelvin" <linux@horizon.com> wrote:

> > The relevant bit of code is in the MD_RECOVERY_REQUESTED branch of
> > sync_request_write() in drivers/md/raid1.c
> > Look for "memcmp".
> 
> Okay, so the data is in r1_bio->bios[i]->bi_io_vec[j].bv_page,
> for 0 <= i < mddev->raid_disks, and 0 <= j < vcnt (the number
> of 4K pages in the chunk).
> 
> Okay, so the first for() loop sets primary to the lowest disk
> number that was completely readable (.bi_end_io == end_sync_read
> && test_bit(BIO_UPTODATE).
> 
> Then the second loop compares all the data to the primary's data
> and, if it doesn't match, re-initializes the mirror's sbio to
> write it back.
> 
> I could probably figure this out with a lot of RTFSing, but if you
> don't mind me asking:
> - What does it mean if r1_bio->bios[i]->bi_end_io != end_sync_read.
>   Does that case only avoid testing the primary again, or are there
>   other cases where it might be true.  If there are, why not count
>   them as a mismatch?

bi_end_io is set up in sync_request().
A non-NULL value mean that a 'nr_pending' reference is held on the device.
If that value is end_sync_read, then a read was attempted.  If it is
end_sync_write, then no read was attempted as we would not expect the data to
be valid (typically during a rebuild).

So:
 NULL -> device is failed or doesn't exist or otherwise should be ignored
        e.g. during recovery we read from one device, write to one, and
        ignore the rest.
 end_sync_write -> device is working but is not in-sync.  Probably doesn't
        happen for check/repair cycles.
 end_sync_read -> we read this block so we need to test the content.


> - What does it mean if !test_bit(BIO_UPTODATE, &sbio->bi_flags)?

The read request failed.

> - How does the need to write back a particular disk get communicated
>   from the sbio setup code to the "schedule writes" section?

It is the other way around.  We signal "don't write this block" by setting
bi_end_io to NULL.
The default is to write to every working disk that isn't the first once we
read from and that isn't being ignored (this reflects that fact that the code
originally just did resync and recovery, and check/repair was added later).

> 
> (On a tangential note, why the heck are bi_flags and bi_rw "unsigned long"
> rather than "u32"?  You'd have to change "if test_bit(BIO_UPTODATE" to
> "if bio_flagged(sbio, BIO_UPTODATE."... untested patch appended.)

Hysterical Raisins?  You would need to take that up with Jens Axboe.


> 
> > You possibly want to factor out that code into a separate function before
> > trying to add any 'voting' code.
> 
> Indeed, the first thing I'd like to do is add some much more detailed
> logging.  What part of the chunk is mismatched?  One sector, one page,
> or the whole chunk?  Are just a few bits flipped, or is it a gross
> mismatch?  Which disks are mismatched?

Sounds good.  Keep it brief and easy to parse.  Probably for each time
memcmp fails for a requested pass, print one line that identifies the
2 devices, the sector/size of the block, the first and last byte that are
different, and the first 16 bytes of the differing range from each device ???


> 
> > This is controlled by raid10_add_disk in drivers/md/raid10.c.  I would
> > happily accept a patch which made a more balanced choice about where to add
> > the new disk.
> 
> Thank you very much for the encouragement!  The tricky cases are when
> the number of drives is not a multiple of the number of data copies.
> If I have -n3 and 7 drives, there are many possible subsets of 3 that will
> operate.  Suppose I have U__U_U_.  What order should drives 4..7 be added?

You don't need to make the code perfect, just better.
If you only change the order for adding spares in the simple/common case,
that would be enough improvement to be very worth while.

> 
> (That's something of a rhetorical question; I expect to figure out the
> answer myself, although you're welcome to chime in if you have any ideas.
> I'm thinking of some kind of score where I consider the n/gcd(n,k) stripe
> start positions and rank possible solutions based on the minimum redundancy
> level and the number of stripes at that level.  The question is, is there
> ever a case where the locations I'd like to add *two* disks differ from the
> location I'd like to add one?  If there were, it would be nasty.)
> 
> 

Thanks,
NeilBrown



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 3-way mirrors
  2010-09-07 14:19 3-way mirrors George Spelvin
                   ` (3 preceding siblings ...)
  2010-09-08  9:40 ` RAID mismatches (and reporting thereof) Tim Small
@ 2010-09-28 16:42 ` Tim Small
  4 siblings, 0 replies; 23+ messages in thread
From: Tim Small @ 2010-09-28 16:42 UTC (permalink / raw)
  To: George Spelvin; +Cc: linux-raid

On 07/09/10 15:19, George Spelvin wrote:
> After some frustration with RAID-5 finding mismatches and not being
> able to figure out which drive has the problem, I'm setting up a rather
> intricate 5-way mirrored (x 2-way striped) system.
>    

I know that this doesn't solve your current problem, but I wondered if 
the fact that mismatch_cnt is not a reliable indication of corruption on 
RAID1 and RAID10 is a problem with your proposed solution?  I don't know 
how difficult it would be to fix that whilst you are at it (add a data 
copy in the write path).

Whilst I think about it, perhaps mismatch_cnt should be dropped from 
RAID1 / RAID10 entirely, as it doesn't seem to be particularly useful 
as-is....

Perhaps the data-copy mode could be a runtime option, and mismatch_cnt 
would only appear when it was switched on (and a repair forced when 
making the transition from no-copy mode to copy mode?).

Cheers,

Tim.


-- 
South East Open Source Solutions Limited
Registered in England and Wales with company number 06134732.
Registered Office: 2 Powell Gardens, Redhill, Surrey, RH1 1TQ
VAT number: 900 6633 53  http://seoss.co.uk/ +44-(0)1273-808309


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 3-way mirrors
  2010-09-08  7:01 Michael Sallaway
@ 2010-09-08  9:11 ` Tim Small
  0 siblings, 0 replies; 23+ messages in thread
From: Tim Small @ 2010-09-08  9:11 UTC (permalink / raw)
  To: Michael Sallaway; +Cc: Neil Brown, linux-raid

On 08/09/10 08:01, Michael Sallaway wrote:
> Aha -- yep, I'm just using the stock ubuntu kernel, which for 10.04 seems to be back at 2.6.32-22. Yikes!
>    

Get Ubuntu to integrate the patches (if they haven't already).

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=581392

Tim.

-- 
South East Open Source Solutions Limited
Registered in England and Wales with company number 06134732.
Registered Office: 2 Powell Gardens, Redhill, Surrey, RH1 1TQ
VAT number: 900 6633 53  http://seoss.co.uk/ +44-(0)1273-808309


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 3-way mirrors
  2010-09-08  6:40 ` Neil Brown
@ 2010-09-08  9:06   ` Tim Small
  0 siblings, 0 replies; 23+ messages in thread
From: Tim Small @ 2010-09-08  9:06 UTC (permalink / raw)
  To: Neil Brown; +Cc: Michael Sallaway, linux-raid

On 08/09/10 07:40, Neil Brown wrote:
> It looks like you have an ancient kernel - older than April 2010 :-)
> A patch went in to 2.6.35 and I think some 2.6.34.y which fixed a bug that
> causes md to drop devices in a degraded RAID6 when it could have fixed the
> read error.  Commit 7b0bb5368a719
>
> So a newer kernel might fix your problem for you.
>    

FYI, at least Debian 5, and RHEL 5 have picked up these patches AFAIK.

Tim.

-- 
South East Open Source Solutions Limited
Registered in England and Wales with company number 06134732.
Registered Office: 2 Powell Gardens, Redhill, Surrey, RH1 1TQ
VAT number: 900 6633 53  http://seoss.co.uk/ +44-(0)1273-808309


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 3-way mirrors
@ 2010-09-08  7:01 Michael Sallaway
  2010-09-08  9:11 ` Tim Small
  0 siblings, 1 reply; 23+ messages in thread
From: Michael Sallaway @ 2010-09-08  7:01 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid


>  -------Original Message-------
>  From: Neil Brown <neilb@suse.de>
>  To: Michael Sallaway <michael@sallaway.com>
>  Cc: linux-raid@vger.kernel.org
>  Subject: Re: 3-way mirrors
>  Sent: 08 Sep '10 06:40

>  Yes, it is just reads.
>  It looks like you have an ancient kernel - older than April 2010 :-)
>  A patch went in to 2.6.35 and I think some 2.6.34.y which fixed a bug that
>  causes md to drop devices in a degraded RAID6 when it could have fixed the
>  read error.  Commit 7b0bb5368a719
>  
>  So a newer kernel might fix your problem for you.

Aha -- yep, I'm just using the stock ubuntu kernel, which for 10.04 seems to be back at 2.6.32-22. Yikes!

So that would explain that. I'll see what kernel I can upgrade to, and go from there. Thanks for your help!

Cheers,
Michael
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 3-way mirrors
  2010-09-08  6:16 Michael Sallaway
@ 2010-09-08  6:40 ` Neil Brown
  2010-09-08  9:06   ` Tim Small
  0 siblings, 1 reply; 23+ messages in thread
From: Neil Brown @ 2010-09-08  6:40 UTC (permalink / raw)
  To: Michael Sallaway; +Cc: linux-raid

On Wed, 08 Sep 2010 06:16:16 +0000
"Michael Sallaway" <michael@sallaway.com> wrote:

> 
> >  -------Original Message-------
> >  From: Neil Brown <neilb@suse.de>
> >  To: Michael Sallaway <michael@sallaway.com>
> >  Cc: linux-raid@vger.kernel.org
> >  Subject: Re: 3-way mirrors
> >  Sent: 08 Sep '10 06:02
> >  
> >  Hmm.... Drive B shouldn't be ejected from the array for a read error.  md
> >  should calculate the data for both A and B from the other devices and then
> >  write that to A and B.
> >  If the write fails, only then should it kick B from the array.  Is that what
> >  is happening?
> >  
> >  i.e. do you see messages like:
> >     read error corrected
> >     read error not correctable
> >     read error NOT corrected
> >  
> >  in the kernel logs??
> 
> 
> The logs for the relevant section are below, at the bottom -- it's a "read error not correctable". So I'm guessing it's also failing a write, although I can't see the ATA error handling mentioning any writes -- it all looks like reads??

Yes, it is just reads.
It looks like you have an ancient kernel - older than April 2010 :-)
A patch went in to 2.6.35 and I think some 2.6.34.y which fixed a bug that
causes md to drop devices in a degraded RAID6 when it could have fixed the
read error.  Commit 7b0bb5368a719

So a newer kernel might fix your problem for you.

> 
> 
> >  If the write is failing, then you want my bad-block-log patches - only they
> >  aren't really finished yet and certainly aren't tested very well.  I really
> >  should get back to those.
> 
> Interesting -- I'm not familiar with them, where would I find these patches? And what would they do -- just allow the bad blocks (even on writes), and keep the drive in the array? That's all I'm really after, in this case, I think.

I posted them to the list for review a few months ago and haven't got back to
them.

http://www.spinics.net/lists/raid/msg28813.html

I wouldn't recommend using them until they've seen more review and testing.

NeilBrown



> 
> Thanks!
> Michael
> 
> 
> 
> Syslog from the failure of the first drive:
> 
> Sep  7 09:31:24 lechuck kernel: [51912.039892] ata13.00: exception Emask 0x0 SAct 0x1ff SErr 0x0 action 0x0
> Sep  7 09:31:24 lechuck kernel: [51912.048227] ata13.00: irq_stat 0x40000008
> Sep  7 09:31:24 lechuck kernel: [51912.056685] ata13.00: failed command: READ FPDMA QUEUED
> Sep  7 09:31:24 lechuck kernel: [51912.065055] ata13.00: cmd 60/d8:08:00:20:d9/00:00:5d:00:00/40 tag 1 ncq 110592 in
> Sep  7 09:31:24 lechuck kernel: [51912.065061]          res 51/40:35:a3:20:d9/00:00:5d:00:00/40 Emask 0x409 (media error) <F>
> Sep  7 09:31:25 lechuck kernel: [51912.098113] ata13.00: status: { DRDY ERR }
> Sep  7 09:31:25 lechuck kernel: [51912.106705] ata13.00: error: { UNC }
> Sep  7 09:31:25 lechuck kernel: [51912.128027] ata13.00: configured for UDMA/133
> Sep  7 09:31:25 lechuck kernel: [51912.128054] ata13: EH complete
> Sep  7 09:31:28 lechuck kernel: [51915.216232] ata13.00: exception Emask 0x0 SAct 0x1ff SErr 0x0 action 0x0
> Sep  7 09:31:28 lechuck kernel: [51915.224757] ata13.00: irq_stat 0x40000008
> Sep  7 09:31:28 lechuck kernel: [51915.233283] ata13.00: failed command: READ FPDMA QUEUED
> Sep  7 09:31:28 lechuck kernel: [51915.241660] ata13.00: cmd 60/d8:38:00:20:d9/00:00:5d:00:00/40 tag 7 ncq 110592 in
> Sep  7 09:31:28 lechuck kernel: [51915.241662]          res 41/40:35:a3:20:d9/00:00:5d:00:00/40 Emask 0x409 (media error) <F>
> Sep  7 09:31:28 lechuck kernel: [51915.275603] ata13.00: status: { DRDY ERR }
> Sep  7 09:31:28 lechuck kernel: [51915.284267] ata13.00: error: { UNC }
> Sep  7 09:31:28 lechuck kernel: [51915.305722] ata13.00: configured for UDMA/133
> Sep  7 09:31:28 lechuck kernel: [51915.305746] ata13: EH complete
> Sep  7 09:31:30 lechuck kernel: [51917.992164] ata13.00: exception Emask 0x0 SAct 0x1ff SErr 0x0 action 0x0
> Sep  7 09:31:30 lechuck kernel: [51918.000791] ata13.00: irq_stat 0x40000008
> Sep  7 09:31:30 lechuck kernel: [51918.009631] ata13.00: failed command: READ FPDMA QUEUED
> Sep  7 09:31:30 lechuck kernel: [51918.018303] ata13.00: cmd 60/d8:08:00:20:d9/00:00:5d:00:00/40 tag 1 ncq 110592 in
> Sep  7 09:31:30 lechuck kernel: [51918.018305]          res 41/40:35:a3:20:d9/00:00:5d:00:00/40 Emask 0x409 (media error) <F>
> Sep  7 09:31:30 lechuck kernel: [51918.054117] ata13.00: status: { DRDY ERR }
> Sep  7 09:31:30 lechuck kernel: [51918.062808] ata13.00: error: { UNC }
> Sep  7 09:31:30 lechuck kernel: [51918.084521] ata13.00: configured for UDMA/133
> Sep  7 09:31:30 lechuck kernel: [51918.084547] ata13: EH complete
> Sep  7 09:31:33 lechuck kernel: [51920.956122] ata13.00: exception Emask 0x0 SAct 0x1ff SErr 0x0 action 0x0
> Sep  7 09:31:33 lechuck kernel: [51920.964858] ata13.00: irq_stat 0x40000008
> Sep  7 09:31:33 lechuck kernel: [51920.973829] ata13.00: failed command: READ FPDMA QUEUED
> Sep  7 09:31:33 lechuck kernel: [51920.982587] ata13.00: cmd 60/d8:38:00:20:d9/00:00:5d:00:00/40 tag 7 ncq 110592 in
> Sep  7 09:31:33 lechuck kernel: [51920.982589]          res 41/40:35:a3:20:d9/00:00:5d:00:00/40 Emask 0x409 (media error) <F>
> Sep  7 09:31:33 lechuck kernel: [51921.017401] ata13.00: status: { DRDY ERR }
> Sep  7 09:31:33 lechuck kernel: [51921.026134] ata13.00: error: { UNC }
> Sep  7 09:31:33 lechuck kernel: [51921.048656] ata13.00: configured for UDMA/133
> Sep  7 09:31:33 lechuck kernel: [51921.048680] ata13: EH complete
> Sep  7 09:31:37 lechuck kernel: [51924.153414] ata13.00: exception Emask 0x0 SAct 0x1ff SErr 0x0 action 0x0
> Sep  7 09:31:37 lechuck kernel: [51924.162178] ata13.00: irq_stat 0x40000008
> Sep  7 09:31:37 lechuck kernel: [51924.162182] ata13.00: failed command: READ FPDMA QUEUED
> Sep  7 09:31:37 lechuck kernel: [51924.162189] ata13.00: cmd 60/d8:08:00:20:d9/00:00:5d:00:00/40 tag 1 ncq 110592 in
> Sep  7 09:31:37 lechuck kernel: [51924.162190]          res 41/40:35:a3:20:d9/00:00:5d:00:00/40 Emask 0x409 (media error) <F>
> Sep  7 09:31:37 lechuck kernel: [51924.162193] ata13.00: status: { DRDY ERR }
> Sep  7 09:31:37 lechuck kernel: [51924.162195] ata13.00: error: { UNC }
> Sep  7 09:31:37 lechuck kernel: [51924.175348] ata13.00: configured for UDMA/133
> Sep  7 09:31:37 lechuck kernel: [51924.175374] ata13: EH complete
> Sep  7 09:31:39 lechuck kernel: [51927.005666] ata13.00: exception Emask 0x0 SAct 0x1ff SErr 0x0 action 0x0
> Sep  7 09:31:39 lechuck kernel: [51927.014384] ata13.00: irq_stat 0x40000008
> Sep  7 09:31:39 lechuck kernel: [51927.023299] ata13.00: failed command: READ FPDMA QUEUED
> Sep  7 09:31:39 lechuck kernel: [51927.031949] ata13.00: cmd 60/d8:38:00:20:d9/00:00:5d:00:00/40 tag 7 ncq 110592 in
> Sep  7 09:31:39 lechuck kernel: [51927.031951]          res 41/40:35:a3:20:d9/00:00:5d:00:00/40 Emask 0x409 (media error) <F>
> Sep  7 09:31:39 lechuck kernel: [51927.066322] ata13.00: status: { DRDY ERR }
> Sep  7 09:31:39 lechuck kernel: [51927.074946] ata13.00: error: { UNC }
> Sep  7 09:31:40 lechuck kernel: [51927.096349] ata13.00: configured for UDMA/133
> Sep  7 09:31:40 lechuck kernel: [51927.096393] sd 12:0:0:0: [sdm] Unhandled sense code
> Sep  7 09:31:40 lechuck kernel: [51927.096396] sd 12:0:0:0: [sdm] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> Sep  7 09:31:40 lechuck kernel: [51927.096401] sd 12:0:0:0: [sdm] Sense Key : Medium Error [current] [descriptor]
> Sep  7 09:31:40 lechuck kernel: [51927.096406] Descriptor sense data with sense descriptors (in hex):
> Sep  7 09:31:40 lechuck kernel: [51927.096409]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
> Sep  7 09:31:40 lechuck kernel: [51927.096420]         5d d9 20 a3
> Sep  7 09:31:40 lechuck kernel: [51927.096425] sd 12:0:0:0: [sdm] Add. Sense: Unrecovered read error - auto reallocate failed
> Sep  7 09:31:40 lechuck kernel: [51927.096431] sd 12:0:0:0: [sdm] CDB: Read(10): 28 00 5d d9 20 00 00 00 d8 00
> Sep  7 09:31:40 lechuck kernel: [51927.096442] end_request: I/O error, dev sdm, sector 1574510755
> Sep  7 09:31:40 lechuck kernel: [51927.104975] raid5:md10: read error not correctable (sector 1574510752 on sdm).
> Sep  7 09:31:40 lechuck kernel: [51927.104985] raid5: Disk failure on sdm, disabling device.
> Sep  7 09:31:40 lechuck kernel: [51927.104989] raid5: Operation continuing on 10 devices.
> Sep  7 09:31:40 lechuck kernel: [51927.122210] raid5:md10: read error not correctable (sector 1574510760 on sdm).
> Sep  7 09:31:40 lechuck kernel: [51927.122214] raid5:md10: read error not correctable (sector 1574510768 on sdm).
> Sep  7 09:31:40 lechuck kernel: [51927.122218] raid5:md10: read error not correctable (sector 1574510776 on sdm).
> Sep  7 09:31:40 lechuck kernel: [51927.122222] raid5:md10: read error not correctable (sector 1574510784 on sdm).
> Sep  7 09:31:40 lechuck kernel: [51927.122225] raid5:md10: read error not correctable (sector 1574510792 on sdm).
> Sep  7 09:31:40 lechuck kernel: [51927.122229] raid5:md10: read error not correctable (sector 1574510800 on sdm).
> Sep  7 09:31:40 lechuck kernel: [51927.122242] ata13: EH complete
> Sep  7 09:31:40 lechuck kernel: [51927.142926] md: md10: recovery done.
> Sep  7 09:31:40 lechuck mdadm[3840]: Fail event detected on md device /dev/md10, component device /dev/sdm
> Sep  7 09:31:40 lechuck kernel: [51927.344026] RAID5 conf printout:
> Sep  7 09:31:40 lechuck kernel: [51927.344031]  --- rd:12 wd:10
> Sep  7 09:31:40 lechuck kernel: [51927.344034]  disk 0, o:1, dev:sdf
> Sep  7 09:31:40 lechuck kernel: [51927.344037]  disk 1, o:1, dev:sdb
> Sep  7 09:31:40 lechuck kernel: [51927.344039]  disk 2, o:1, dev:sda
> Sep  7 09:31:40 lechuck kernel: [51927.344042]  disk 3, o:1, dev:sdc
> Sep  7 09:31:40 lechuck kernel: [51927.344044]  disk 4, o:1, dev:sdj
> Sep  7 09:31:40 lechuck kernel: [51927.344047]  disk 5, o:1, dev:sdi
> Sep  7 09:31:40 lechuck kernel: [51927.344049]  disk 6, o:1, dev:sdp
> Sep  7 09:31:40 lechuck kernel: [51927.344052]  disk 7, o:1, dev:sdn
> Sep  7 09:31:40 lechuck kernel: [51927.344054]  disk 8, o:1, dev:sdo
> Sep  7 09:31:40 lechuck kernel: [51927.344057]  disk 9, o:0, dev:sdm
> Sep  7 09:31:40 lechuck kernel: [51927.344059]  disk 10, o:1, dev:sdk
> Sep  7 09:31:40 lechuck kernel: [51927.344062]  disk 11, o:1, dev:sdl
> Sep  7 09:31:40 lechuck kernel: [51927.344064] RAID5 conf printout:
> Sep  7 09:31:40 lechuck kernel: [51927.344066]  --- rd:12 wd:10
> Sep  7 09:31:40 lechuck kernel: [51927.344068]  disk 0, o:1, dev:sdf
> Sep  7 09:31:40 lechuck kernel: [51927.344070]  disk 1, o:1, dev:sdb
> Sep  7 09:31:40 lechuck kernel: [51927.344073]  disk 2, o:1, dev:sda
> Sep  7 09:31:40 lechuck kernel: [51927.344075]  disk 3, o:1, dev:sdc
> Sep  7 09:31:40 lechuck kernel: [51927.344077]  disk 4, o:1, dev:sdj
> Sep  7 09:31:40 lechuck kernel: [51927.344080]  disk 5, o:1, dev:sdi
> Sep  7 09:31:40 lechuck kernel: [51927.344082]  disk 6, o:1, dev:sdp
> Sep  7 09:31:40 lechuck kernel: [51927.344084]  disk 7, o:1, dev:sdn
> Sep  7 09:31:40 lechuck kernel: [51927.344087]  disk 8, o:1, dev:sdo
> Sep  7 09:31:40 lechuck kernel: [51927.344089]  disk 9, o:0, dev:sdm
> Sep  7 09:31:40 lechuck kernel: [51927.344091]  disk 10, o:1, dev:sdk
> Sep  7 09:31:40 lechuck kernel: [51927.344093]  disk 11, o:1, dev:sdl
> Sep  7 09:31:40 lechuck kernel: [51927.344095] RAID5 conf printout:
> Sep  7 09:31:40 lechuck kernel: [51927.344097]  --- rd:12 wd:10
> Sep  7 09:31:40 lechuck kernel: [51927.344100]  disk 0, o:1, dev:sdf
> Sep  7 09:31:40 lechuck kernel: [51927.344102]  disk 1, o:1, dev:sdb
> Sep  7 09:31:40 lechuck kernel: [51927.344104]  disk 2, o:1, dev:sda
> Sep  7 09:31:40 lechuck kernel: [51927.344106]  disk 3, o:1, dev:sdc
> Sep  7 09:31:40 lechuck kernel: [51927.344109]  disk 4, o:1, dev:sdj
> Sep  7 09:31:40 lechuck kernel: [51927.344111]  disk 5, o:1, dev:sdi
> Sep  7 09:31:40 lechuck kernel: [51927.344113]  disk 6, o:1, dev:sdp
> Sep  7 09:31:40 lechuck kernel: [51927.344116]  disk 7, o:1, dev:sdn
> Sep  7 09:31:40 lechuck kernel: [51927.344118]  disk 8, o:1, dev:sdo
> Sep  7 09:31:40 lechuck kernel: [51927.344120]  disk 9, o:0, dev:sdm
> Sep  7 09:31:40 lechuck kernel: [51927.344122]  disk 10, o:1, dev:sdk
> Sep  7 09:31:40 lechuck kernel: [51927.344125]  disk 11, o:1, dev:sdl
> Sep  7 09:31:40 lechuck kernel: [51927.400014] RAID5 conf printout:
> Sep  7 09:31:40 lechuck kernel: [51927.400017]  --- rd:12 wd:10
> Sep  7 09:31:40 lechuck kernel: [51927.400020]  disk 0, o:1, dev:sdf
> Sep  7 09:31:40 lechuck kernel: [51927.400022]  disk 1, o:1, dev:sdb
> Sep  7 09:31:40 lechuck kernel: [51927.400025]  disk 2, o:1, dev:sda
> Sep  7 09:31:40 lechuck kernel: [51927.400027]  disk 3, o:1, dev:sdc
> Sep  7 09:31:40 lechuck kernel: [51927.400029]  disk 4, o:1, dev:sdj
> Sep  7 09:31:40 lechuck kernel: [51927.400032]  disk 5, o:1, dev:sdi
> Sep  7 09:31:40 lechuck kernel: [51927.400034]  disk 6, o:1, dev:sdp
> Sep  7 09:31:40 lechuck kernel: [51927.400036]  disk 7, o:1, dev:sdn
> Sep  7 09:31:40 lechuck kernel: [51927.400039]  disk 8, o:1, dev:sdo
> Sep  7 09:31:40 lechuck kernel: [51927.400041]  disk 10, o:1, dev:sdk
> Sep  7 09:31:40 lechuck kernel: [51927.400043]  disk 11, o:1, dev:sdl
> Sep  7 09:31:40 lechuck kernel: [51927.400138] md: recovery of RAID array md10
> Sep  7 09:31:40 lechuck kernel: [51927.400141] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
> Sep  7 09:31:40 lechuck kernel: [51927.400145] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
> Sep  7 09:31:40 lechuck kernel: [51927.400155] md: using 128k window, over a total of 1465138496 blocks.
> Sep  7 09:31:40 lechuck kernel: [51927.400159] md: resuming recovery of md10 from checkpoint.
> Sep  7 09:31:40 lechuck mdadm[3840]: RebuildFinished event detected on md device /dev/md10, component device  mismatches found: 477544
> Sep  7 09:31:40 lechuck mdadm[3840]: RebuildStarted event detected on md device /dev/md10
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 3-way mirrors
@ 2010-09-08  6:16 Michael Sallaway
  2010-09-08  6:40 ` Neil Brown
  0 siblings, 1 reply; 23+ messages in thread
From: Michael Sallaway @ 2010-09-08  6:16 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid


>  -------Original Message-------
>  From: Neil Brown <neilb@suse.de>
>  To: Michael Sallaway <michael@sallaway.com>
>  Cc: linux-raid@vger.kernel.org
>  Subject: Re: 3-way mirrors
>  Sent: 08 Sep '10 06:02
>  
>  Hmm.... Drive B shouldn't be ejected from the array for a read error.  md
>  should calculate the data for both A and B from the other devices and then
>  write that to A and B.
>  If the write fails, only then should it kick B from the array.  Is that what
>  is happening?
>  
>  i.e. do you see messages like:
>     read error corrected
>     read error not correctable
>     read error NOT corrected
>  
>  in the kernel logs??


The logs for the relevant section are below, at the bottom -- it's a "read error not correctable". So I'm guessing it's also failing a write, although I can't see the ATA error handling mentioning any writes -- it all looks like reads??


>  If the write is failing, then you want my bad-block-log patches - only they
>  aren't really finished yet and certainly aren't tested very well.  I really
>  should get back to those.

Interesting -- I'm not familiar with them, where would I find these patches? And what would they do -- just allow the bad blocks (even on writes), and keep the drive in the array? That's all I'm really after, in this case, I think.

Thanks!
Michael



Syslog from the failure of the first drive:

Sep  7 09:31:24 lechuck kernel: [51912.039892] ata13.00: exception Emask 0x0 SAct 0x1ff SErr 0x0 action 0x0
Sep  7 09:31:24 lechuck kernel: [51912.048227] ata13.00: irq_stat 0x40000008
Sep  7 09:31:24 lechuck kernel: [51912.056685] ata13.00: failed command: READ FPDMA QUEUED
Sep  7 09:31:24 lechuck kernel: [51912.065055] ata13.00: cmd 60/d8:08:00:20:d9/00:00:5d:00:00/40 tag 1 ncq 110592 in
Sep  7 09:31:24 lechuck kernel: [51912.065061]          res 51/40:35:a3:20:d9/00:00:5d:00:00/40 Emask 0x409 (media error) <F>
Sep  7 09:31:25 lechuck kernel: [51912.098113] ata13.00: status: { DRDY ERR }
Sep  7 09:31:25 lechuck kernel: [51912.106705] ata13.00: error: { UNC }
Sep  7 09:31:25 lechuck kernel: [51912.128027] ata13.00: configured for UDMA/133
Sep  7 09:31:25 lechuck kernel: [51912.128054] ata13: EH complete
Sep  7 09:31:28 lechuck kernel: [51915.216232] ata13.00: exception Emask 0x0 SAct 0x1ff SErr 0x0 action 0x0
Sep  7 09:31:28 lechuck kernel: [51915.224757] ata13.00: irq_stat 0x40000008
Sep  7 09:31:28 lechuck kernel: [51915.233283] ata13.00: failed command: READ FPDMA QUEUED
Sep  7 09:31:28 lechuck kernel: [51915.241660] ata13.00: cmd 60/d8:38:00:20:d9/00:00:5d:00:00/40 tag 7 ncq 110592 in
Sep  7 09:31:28 lechuck kernel: [51915.241662]          res 41/40:35:a3:20:d9/00:00:5d:00:00/40 Emask 0x409 (media error) <F>
Sep  7 09:31:28 lechuck kernel: [51915.275603] ata13.00: status: { DRDY ERR }
Sep  7 09:31:28 lechuck kernel: [51915.284267] ata13.00: error: { UNC }
Sep  7 09:31:28 lechuck kernel: [51915.305722] ata13.00: configured for UDMA/133
Sep  7 09:31:28 lechuck kernel: [51915.305746] ata13: EH complete
Sep  7 09:31:30 lechuck kernel: [51917.992164] ata13.00: exception Emask 0x0 SAct 0x1ff SErr 0x0 action 0x0
Sep  7 09:31:30 lechuck kernel: [51918.000791] ata13.00: irq_stat 0x40000008
Sep  7 09:31:30 lechuck kernel: [51918.009631] ata13.00: failed command: READ FPDMA QUEUED
Sep  7 09:31:30 lechuck kernel: [51918.018303] ata13.00: cmd 60/d8:08:00:20:d9/00:00:5d:00:00/40 tag 1 ncq 110592 in
Sep  7 09:31:30 lechuck kernel: [51918.018305]          res 41/40:35:a3:20:d9/00:00:5d:00:00/40 Emask 0x409 (media error) <F>
Sep  7 09:31:30 lechuck kernel: [51918.054117] ata13.00: status: { DRDY ERR }
Sep  7 09:31:30 lechuck kernel: [51918.062808] ata13.00: error: { UNC }
Sep  7 09:31:30 lechuck kernel: [51918.084521] ata13.00: configured for UDMA/133
Sep  7 09:31:30 lechuck kernel: [51918.084547] ata13: EH complete
Sep  7 09:31:33 lechuck kernel: [51920.956122] ata13.00: exception Emask 0x0 SAct 0x1ff SErr 0x0 action 0x0
Sep  7 09:31:33 lechuck kernel: [51920.964858] ata13.00: irq_stat 0x40000008
Sep  7 09:31:33 lechuck kernel: [51920.973829] ata13.00: failed command: READ FPDMA QUEUED
Sep  7 09:31:33 lechuck kernel: [51920.982587] ata13.00: cmd 60/d8:38:00:20:d9/00:00:5d:00:00/40 tag 7 ncq 110592 in
Sep  7 09:31:33 lechuck kernel: [51920.982589]          res 41/40:35:a3:20:d9/00:00:5d:00:00/40 Emask 0x409 (media error) <F>
Sep  7 09:31:33 lechuck kernel: [51921.017401] ata13.00: status: { DRDY ERR }
Sep  7 09:31:33 lechuck kernel: [51921.026134] ata13.00: error: { UNC }
Sep  7 09:31:33 lechuck kernel: [51921.048656] ata13.00: configured for UDMA/133
Sep  7 09:31:33 lechuck kernel: [51921.048680] ata13: EH complete
Sep  7 09:31:37 lechuck kernel: [51924.153414] ata13.00: exception Emask 0x0 SAct 0x1ff SErr 0x0 action 0x0
Sep  7 09:31:37 lechuck kernel: [51924.162178] ata13.00: irq_stat 0x40000008
Sep  7 09:31:37 lechuck kernel: [51924.162182] ata13.00: failed command: READ FPDMA QUEUED
Sep  7 09:31:37 lechuck kernel: [51924.162189] ata13.00: cmd 60/d8:08:00:20:d9/00:00:5d:00:00/40 tag 1 ncq 110592 in
Sep  7 09:31:37 lechuck kernel: [51924.162190]          res 41/40:35:a3:20:d9/00:00:5d:00:00/40 Emask 0x409 (media error) <F>
Sep  7 09:31:37 lechuck kernel: [51924.162193] ata13.00: status: { DRDY ERR }
Sep  7 09:31:37 lechuck kernel: [51924.162195] ata13.00: error: { UNC }
Sep  7 09:31:37 lechuck kernel: [51924.175348] ata13.00: configured for UDMA/133
Sep  7 09:31:37 lechuck kernel: [51924.175374] ata13: EH complete
Sep  7 09:31:39 lechuck kernel: [51927.005666] ata13.00: exception Emask 0x0 SAct 0x1ff SErr 0x0 action 0x0
Sep  7 09:31:39 lechuck kernel: [51927.014384] ata13.00: irq_stat 0x40000008
Sep  7 09:31:39 lechuck kernel: [51927.023299] ata13.00: failed command: READ FPDMA QUEUED
Sep  7 09:31:39 lechuck kernel: [51927.031949] ata13.00: cmd 60/d8:38:00:20:d9/00:00:5d:00:00/40 tag 7 ncq 110592 in
Sep  7 09:31:39 lechuck kernel: [51927.031951]          res 41/40:35:a3:20:d9/00:00:5d:00:00/40 Emask 0x409 (media error) <F>
Sep  7 09:31:39 lechuck kernel: [51927.066322] ata13.00: status: { DRDY ERR }
Sep  7 09:31:39 lechuck kernel: [51927.074946] ata13.00: error: { UNC }
Sep  7 09:31:40 lechuck kernel: [51927.096349] ata13.00: configured for UDMA/133
Sep  7 09:31:40 lechuck kernel: [51927.096393] sd 12:0:0:0: [sdm] Unhandled sense code
Sep  7 09:31:40 lechuck kernel: [51927.096396] sd 12:0:0:0: [sdm] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Sep  7 09:31:40 lechuck kernel: [51927.096401] sd 12:0:0:0: [sdm] Sense Key : Medium Error [current] [descriptor]
Sep  7 09:31:40 lechuck kernel: [51927.096406] Descriptor sense data with sense descriptors (in hex):
Sep  7 09:31:40 lechuck kernel: [51927.096409]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
Sep  7 09:31:40 lechuck kernel: [51927.096420]         5d d9 20 a3
Sep  7 09:31:40 lechuck kernel: [51927.096425] sd 12:0:0:0: [sdm] Add. Sense: Unrecovered read error - auto reallocate failed
Sep  7 09:31:40 lechuck kernel: [51927.096431] sd 12:0:0:0: [sdm] CDB: Read(10): 28 00 5d d9 20 00 00 00 d8 00
Sep  7 09:31:40 lechuck kernel: [51927.096442] end_request: I/O error, dev sdm, sector 1574510755
Sep  7 09:31:40 lechuck kernel: [51927.104975] raid5:md10: read error not correctable (sector 1574510752 on sdm).
Sep  7 09:31:40 lechuck kernel: [51927.104985] raid5: Disk failure on sdm, disabling device.
Sep  7 09:31:40 lechuck kernel: [51927.104989] raid5: Operation continuing on 10 devices.
Sep  7 09:31:40 lechuck kernel: [51927.122210] raid5:md10: read error not correctable (sector 1574510760 on sdm).
Sep  7 09:31:40 lechuck kernel: [51927.122214] raid5:md10: read error not correctable (sector 1574510768 on sdm).
Sep  7 09:31:40 lechuck kernel: [51927.122218] raid5:md10: read error not correctable (sector 1574510776 on sdm).
Sep  7 09:31:40 lechuck kernel: [51927.122222] raid5:md10: read error not correctable (sector 1574510784 on sdm).
Sep  7 09:31:40 lechuck kernel: [51927.122225] raid5:md10: read error not correctable (sector 1574510792 on sdm).
Sep  7 09:31:40 lechuck kernel: [51927.122229] raid5:md10: read error not correctable (sector 1574510800 on sdm).
Sep  7 09:31:40 lechuck kernel: [51927.122242] ata13: EH complete
Sep  7 09:31:40 lechuck kernel: [51927.142926] md: md10: recovery done.
Sep  7 09:31:40 lechuck mdadm[3840]: Fail event detected on md device /dev/md10, component device /dev/sdm
Sep  7 09:31:40 lechuck kernel: [51927.344026] RAID5 conf printout:
Sep  7 09:31:40 lechuck kernel: [51927.344031]  --- rd:12 wd:10
Sep  7 09:31:40 lechuck kernel: [51927.344034]  disk 0, o:1, dev:sdf
Sep  7 09:31:40 lechuck kernel: [51927.344037]  disk 1, o:1, dev:sdb
Sep  7 09:31:40 lechuck kernel: [51927.344039]  disk 2, o:1, dev:sda
Sep  7 09:31:40 lechuck kernel: [51927.344042]  disk 3, o:1, dev:sdc
Sep  7 09:31:40 lechuck kernel: [51927.344044]  disk 4, o:1, dev:sdj
Sep  7 09:31:40 lechuck kernel: [51927.344047]  disk 5, o:1, dev:sdi
Sep  7 09:31:40 lechuck kernel: [51927.344049]  disk 6, o:1, dev:sdp
Sep  7 09:31:40 lechuck kernel: [51927.344052]  disk 7, o:1, dev:sdn
Sep  7 09:31:40 lechuck kernel: [51927.344054]  disk 8, o:1, dev:sdo
Sep  7 09:31:40 lechuck kernel: [51927.344057]  disk 9, o:0, dev:sdm
Sep  7 09:31:40 lechuck kernel: [51927.344059]  disk 10, o:1, dev:sdk
Sep  7 09:31:40 lechuck kernel: [51927.344062]  disk 11, o:1, dev:sdl
Sep  7 09:31:40 lechuck kernel: [51927.344064] RAID5 conf printout:
Sep  7 09:31:40 lechuck kernel: [51927.344066]  --- rd:12 wd:10
Sep  7 09:31:40 lechuck kernel: [51927.344068]  disk 0, o:1, dev:sdf
Sep  7 09:31:40 lechuck kernel: [51927.344070]  disk 1, o:1, dev:sdb
Sep  7 09:31:40 lechuck kernel: [51927.344073]  disk 2, o:1, dev:sda
Sep  7 09:31:40 lechuck kernel: [51927.344075]  disk 3, o:1, dev:sdc
Sep  7 09:31:40 lechuck kernel: [51927.344077]  disk 4, o:1, dev:sdj
Sep  7 09:31:40 lechuck kernel: [51927.344080]  disk 5, o:1, dev:sdi
Sep  7 09:31:40 lechuck kernel: [51927.344082]  disk 6, o:1, dev:sdp
Sep  7 09:31:40 lechuck kernel: [51927.344084]  disk 7, o:1, dev:sdn
Sep  7 09:31:40 lechuck kernel: [51927.344087]  disk 8, o:1, dev:sdo
Sep  7 09:31:40 lechuck kernel: [51927.344089]  disk 9, o:0, dev:sdm
Sep  7 09:31:40 lechuck kernel: [51927.344091]  disk 10, o:1, dev:sdk
Sep  7 09:31:40 lechuck kernel: [51927.344093]  disk 11, o:1, dev:sdl
Sep  7 09:31:40 lechuck kernel: [51927.344095] RAID5 conf printout:
Sep  7 09:31:40 lechuck kernel: [51927.344097]  --- rd:12 wd:10
Sep  7 09:31:40 lechuck kernel: [51927.344100]  disk 0, o:1, dev:sdf
Sep  7 09:31:40 lechuck kernel: [51927.344102]  disk 1, o:1, dev:sdb
Sep  7 09:31:40 lechuck kernel: [51927.344104]  disk 2, o:1, dev:sda
Sep  7 09:31:40 lechuck kernel: [51927.344106]  disk 3, o:1, dev:sdc
Sep  7 09:31:40 lechuck kernel: [51927.344109]  disk 4, o:1, dev:sdj
Sep  7 09:31:40 lechuck kernel: [51927.344111]  disk 5, o:1, dev:sdi
Sep  7 09:31:40 lechuck kernel: [51927.344113]  disk 6, o:1, dev:sdp
Sep  7 09:31:40 lechuck kernel: [51927.344116]  disk 7, o:1, dev:sdn
Sep  7 09:31:40 lechuck kernel: [51927.344118]  disk 8, o:1, dev:sdo
Sep  7 09:31:40 lechuck kernel: [51927.344120]  disk 9, o:0, dev:sdm
Sep  7 09:31:40 lechuck kernel: [51927.344122]  disk 10, o:1, dev:sdk
Sep  7 09:31:40 lechuck kernel: [51927.344125]  disk 11, o:1, dev:sdl
Sep  7 09:31:40 lechuck kernel: [51927.400014] RAID5 conf printout:
Sep  7 09:31:40 lechuck kernel: [51927.400017]  --- rd:12 wd:10
Sep  7 09:31:40 lechuck kernel: [51927.400020]  disk 0, o:1, dev:sdf
Sep  7 09:31:40 lechuck kernel: [51927.400022]  disk 1, o:1, dev:sdb
Sep  7 09:31:40 lechuck kernel: [51927.400025]  disk 2, o:1, dev:sda
Sep  7 09:31:40 lechuck kernel: [51927.400027]  disk 3, o:1, dev:sdc
Sep  7 09:31:40 lechuck kernel: [51927.400029]  disk 4, o:1, dev:sdj
Sep  7 09:31:40 lechuck kernel: [51927.400032]  disk 5, o:1, dev:sdi
Sep  7 09:31:40 lechuck kernel: [51927.400034]  disk 6, o:1, dev:sdp
Sep  7 09:31:40 lechuck kernel: [51927.400036]  disk 7, o:1, dev:sdn
Sep  7 09:31:40 lechuck kernel: [51927.400039]  disk 8, o:1, dev:sdo
Sep  7 09:31:40 lechuck kernel: [51927.400041]  disk 10, o:1, dev:sdk
Sep  7 09:31:40 lechuck kernel: [51927.400043]  disk 11, o:1, dev:sdl
Sep  7 09:31:40 lechuck kernel: [51927.400138] md: recovery of RAID array md10
Sep  7 09:31:40 lechuck kernel: [51927.400141] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
Sep  7 09:31:40 lechuck kernel: [51927.400145] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
Sep  7 09:31:40 lechuck kernel: [51927.400155] md: using 128k window, over a total of 1465138496 blocks.
Sep  7 09:31:40 lechuck kernel: [51927.400159] md: resuming recovery of md10 from checkpoint.
Sep  7 09:31:40 lechuck mdadm[3840]: RebuildFinished event detected on md device /dev/md10, component device  mismatches found: 477544
Sep  7 09:31:40 lechuck mdadm[3840]: RebuildStarted event detected on md device /dev/md10
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 3-way mirrors
  2010-09-08  5:45 Michael Sallaway
@ 2010-09-08  6:02 ` Neil Brown
  0 siblings, 0 replies; 23+ messages in thread
From: Neil Brown @ 2010-09-08  6:02 UTC (permalink / raw)
  To: Michael Sallaway; +Cc: linux-raid

On Wed, 08 Sep 2010 05:45:41 +0000
"Michael Sallaway" <michael@sallaway.com> wrote:

> 
> >  -------Original Message-------
> >  From: Neil Brown <neilb@suse.de>
> >  To: Michael Sallaway <michael@sallaway.com>
> >  Cc: linux-raid@vger.kernel.org
> >  Subject: Re: 3-way mirrors
> >  Sent: 08 Sep '10 04:16
> 
> >  > Interesting... will this also work for a rebuild/recovery? If so, how do I start a rebuild from a particular location? (do I just write the sync_min sector before adding the replacement drive to the array, and it will start from there when I add it?)
> >  
> >  Why would you want to?
> 
> (My apologies for hijacking the email thread, I only meant it as a side question!)
> 
> The reason relates to my question I posted yesterday -- I have a 12-drive raid 6 array, with 3 drives that have some bad sectors at varying locations. I planned to swap out one drive with a new one, and let it rebuild that one, then do the same for the other 2. However, when I replace and rebuild drive A, drive B gets read errors and falls out of the array (at about 50% through), but recovery continues. At the 60% mark, however, drive C gets read errors, and also falls out of the array, which now only has 9 working devices, so abandons recovery. (even though drive B has vaild data at that location, so it could be rebuilt).

Hmm.... Drive B shouldn't be ejected from the array for a read error.  md
should calculate the data for both A and B from the other devices and then
write that to A and B.
If the write fails, only then should it kick B from the array.  Is that what
is happening?

i.e. do you see messages like:
   read error corrected
   read error not correctable
   read error NOT corrected

in the kernel logs??

If the write is failing, then you want my bad-block-log patches - only they
aren't really finished yet and certainly aren't tested very well.  I really
should get back to those.

NeilBrown


> 
> One solution I thought of (and please, suggest others!) was to recover 55% of the array onto the new drive (A), and then stop recovery somehow. Then forcibly add drive B back into the array, and keep recovering, so that when it hits the 60% mark, even though drive C fails, it can still get parity data and recover using drive B.
> 
> It sounds crazy, I know, but can't think of a better solution. If you have one, please suggest it! :-)
> 
> 
> > You can add a new device entirely by writing to sysfs files.  In this case
> > you can set the 'recovery_start' for that device.  This tells md that it has
> > already recovered some of the array.
> 
> Interesting, I think this is exactly what I'm after. Is this documented somewhere, or can you give me some pointers as to where to look to find more information/documentation on the sysfs files and what they do, etc.?
> 
> Thanks!
> Michael


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 3-way mirrors
@ 2010-09-08  5:45 Michael Sallaway
  2010-09-08  6:02 ` Neil Brown
  0 siblings, 1 reply; 23+ messages in thread
From: Michael Sallaway @ 2010-09-08  5:45 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid


>  -------Original Message-------
>  From: Neil Brown <neilb@suse.de>
>  To: Michael Sallaway <michael@sallaway.com>
>  Cc: linux-raid@vger.kernel.org
>  Subject: Re: 3-way mirrors
>  Sent: 08 Sep '10 04:16

>  > Interesting... will this also work for a rebuild/recovery? If so, how do I start a rebuild from a particular location? (do I just write the sync_min sector before adding the replacement drive to the array, and it will start from there when I add it?)
>  
>  Why would you want to?

(My apologies for hijacking the email thread, I only meant it as a side question!)

The reason relates to my question I posted yesterday -- I have a 12-drive raid 6 array, with 3 drives that have some bad sectors at varying locations. I planned to swap out one drive with a new one, and let it rebuild that one, then do the same for the other 2. However, when I replace and rebuild drive A, drive B gets read errors and falls out of the array (at about 50% through), but recovery continues. At the 60% mark, however, drive C gets read errors, and also falls out of the array, which now only has 9 working devices, so abandons recovery. (even though drive B has vaild data at that location, so it could be rebuilt).

One solution I thought of (and please, suggest others!) was to recover 55% of the array onto the new drive (A), and then stop recovery somehow. Then forcibly add drive B back into the array, and keep recovering, so that when it hits the 60% mark, even though drive C fails, it can still get parity data and recover using drive B.

It sounds crazy, I know, but can't think of a better solution. If you have one, please suggest it! :-)


> You can add a new device entirely by writing to sysfs files.  In this case
> you can set the 'recovery_start' for that device.  This tells md that it has
> already recovered some of the array.

Interesting, I think this is exactly what I'm after. Is this documented somewhere, or can you give me some pointers as to where to look to find more information/documentation on the sysfs files and what they do, etc.?

Thanks!
Michael

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 3-way mirrors
  2010-09-08  3:58 Michael Sallaway
@ 2010-09-08  4:16 ` Neil Brown
  0 siblings, 0 replies; 23+ messages in thread
From: Neil Brown @ 2010-09-08  4:16 UTC (permalink / raw)
  To: Michael Sallaway; +Cc: linux-raid

On Wed, 08 Sep 2010 03:58:52 +0000
"Michael Sallaway" <michael@sallaway.com> wrote:

> >  From: Neil Brown <neilb@suse.de>
> >  Subject: Re: 3-way mirrors
> >  Sent: 07 Sep '10 22:01
> >  
> >  This is already possible via the sync_min and sync_max sysfs files.
> >  Write a number of sectors to sync_max and a lower number to sync_min.
> >  Then write 'repair' to 'sync_action'.
> >  When sync_completed reaches sync_max, the repair will pause.
> >  You can then let it continue by writing a larger number to sync_max, or tell
> >  it to finish by writing 'idle' to 'sync_action'.
> 
> Interesting... will this also work for a rebuild/recovery? If so, how do I start a rebuild from a particular location? (do I just write the sync_min sector before adding the replacement drive to the array, and it will start from there when I add it?)

Why would you want to?

sync_min is only honoured when you request a check/repair operation.  When md
determines a resync or recovery is needed, it starts the where it needs to
start from, which is normally the beginning.

You can add a new device entirely by writing to sysfs files.  In this case
you can set the 'recovery_start' for that device.  This tells md that it has
already recovered some of the array.

> 
> Are all sector counts in terms of a drive sector position, or an array sector position?

For raid10, the sector counts are array sector position.
For raid5/raid6, the sector counts are drive sector position with data_offset
subtracted (so they start from 0).

For raid1, both of the above descriptions produce the same answer, so both
are valid descriptions.

NeilBrown


> 
> Thanks,
> Michael
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: 3-way mirrors
@ 2010-09-08  3:58 Michael Sallaway
  2010-09-08  4:16 ` Neil Brown
  0 siblings, 1 reply; 23+ messages in thread
From: Michael Sallaway @ 2010-09-08  3:58 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

>  From: Neil Brown <neilb@suse.de>
>  Subject: Re: 3-way mirrors
>  Sent: 07 Sep '10 22:01
>  
>  This is already possible via the sync_min and sync_max sysfs files.
>  Write a number of sectors to sync_max and a lower number to sync_min.
>  Then write 'repair' to 'sync_action'.
>  When sync_completed reaches sync_max, the repair will pause.
>  You can then let it continue by writing a larger number to sync_max, or tell
>  it to finish by writing 'idle' to 'sync_action'.

Interesting... will this also work for a rebuild/recovery? If so, how do I start a rebuild from a particular location? (do I just write the sync_min sector before adding the replacement drive to the array, and it will start from there when I add it?)

Are all sector counts in terms of a drive sector position, or an array sector position?

Thanks,
Michael

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2010-09-28 16:42 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-09-07 14:19 3-way mirrors George Spelvin
2010-09-07 16:07 ` Iordan Iordanov
2010-09-07 18:49   ` George Spelvin
2010-09-07 19:55     ` Keld Jørn Simonsen
2010-09-07 18:31 ` Aryeh Gregor
2010-09-07 19:02   ` George Spelvin
2010-09-08 22:28     ` Bill Davidsen
2010-09-07 22:01 ` Neil Brown
2010-09-08  1:33   ` Neil Brown
2010-09-08 14:52   ` George Spelvin
2010-09-08 23:04     ` Neil Brown
2010-09-08  9:40 ` RAID mismatches (and reporting thereof) Tim Small
2010-09-08 12:35   ` George Spelvin
2010-09-28 16:42 ` 3-way mirrors Tim Small
2010-09-08  3:58 Michael Sallaway
2010-09-08  4:16 ` Neil Brown
2010-09-08  5:45 Michael Sallaway
2010-09-08  6:02 ` Neil Brown
2010-09-08  6:16 Michael Sallaway
2010-09-08  6:40 ` Neil Brown
2010-09-08  9:06   ` Tim Small
2010-09-08  7:01 Michael Sallaway
2010-09-08  9:11 ` Tim Small

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.