[BUG 2.6.32] md/raid1: barrier disabling does not work correctly in all cases

All of lore.kernel.org
 help / color / mirror / Atom feed

* [BUG 2.6.32] md/raid1: barrier disabling does not work correctly in all cases
       [not found] <AANLkTimp7eGLu5UEr-wH9MVSGrQWA3PKMakW74_48gk7@mail.gmail.com>
@ 2011-01-20 16:52 ` Paul Clements
  2011-01-20 20:25   ` NeilBrown
  0 siblings, 1 reply; 8+ messages in thread
From: Paul Clements @ 2011-01-20 16:52 UTC (permalink / raw)
  To: linux-raid

I'm having a problem with the 2.6.32 kernel (RHEL 6, actually) and the
barrier disabling code in raid1.

The gist of the problem is that when an fsync is done directly to
/dev/md0 (fsck.ext4 performs an fsync), the kernel
(blkdev_issue_flush, specifically) translates this into a zero-length
write with-barrier-attached. raid1 of course sends this down to its
component devices, which then return EOPNOTSUPP (in my case, a xen
block device, which doesn't support barriers). raid1 then strips the
BIO_RW_BARRIER off and retries the write.

Apparently, a zero-length write (bio) without barrier makes some/all
block drivers very unhappy (IDE, cciss, and xen-blkfront have been
tested and all failed on this). I think these zero-length writes must
normally get filtered out and discarded by filesystem/block layer
before drivers ever see them, but in this particular case, where md is
submitting the write directly to the underlying component driver, it
doesn't get filtered and causes -EIO to be returned (the disk gets
marked failed, the mirror breaks, chaos ensues...)

I can do an obvious hack, which is to assume all the time that
barriers don't work. That is, mark mddev->barriers_work as 0 or just
disable that check:

--- drivers/md/raid1.c.PRISTINE 2011-01-14 14:51:56.959809669 -0500
+++ drivers/md/raid1.c  2011-01-20 10:17:11.001701186 -0500
@@ -819,6 +833,7 @@ static int make_request(mddev_t *mddev,
                finish_wait(&conf->wait_barrier, &w);
        }
-        if (unlikely(!mddev->barriers_work &&
+       if ((
                     bio_rw_flagged(bio, BIO_RW_BARRIER))) {
                if (rw == WRITE)
                        md_write_end(mddev);

That fixes things, but does Neil or someone else have an idea how best
to fix this? Could we specifically just not retry a zero-length
barrier write? I think that would fix it, but is kind of a hack too.

I know barriers are being pulled out of the kernel, so this isn't a
problem for recent kernels, but as 2.6.32 is a long-term support
kernel, this may be something that others run into and will want
fixed.

Thanks,
Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG 2.6.32] md/raid1: barrier disabling does not work correctly in all cases
  2011-01-20 16:52 ` [BUG 2.6.32] md/raid1: barrier disabling does not work correctly in all cases Paul Clements
@ 2011-01-20 20:25   ` NeilBrown
  2011-01-20 20:44     ` Paul Clements
  2011-01-26 13:55     ` Paul Clements
  0 siblings, 2 replies; 8+ messages in thread
From: NeilBrown @ 2011-01-20 20:25 UTC (permalink / raw)
  To: Paul Clements; +Cc: linux-raid

On Thu, 20 Jan 2011 11:52:32 -0500 Paul Clements <paul.clements@us.sios.com>
wrote:

> I'm having a problem with the 2.6.32 kernel (RHEL 6, actually) and the
> barrier disabling code in raid1.
> 
> The gist of the problem is that when an fsync is done directly to
> /dev/md0 (fsck.ext4 performs an fsync), the kernel
> (blkdev_issue_flush, specifically) translates this into a zero-length
> write with-barrier-attached. raid1 of course sends this down to its
> component devices, which then return EOPNOTSUPP (in my case, a xen
> block device, which doesn't support barriers). raid1 then strips the
> BIO_RW_BARRIER off and retries the write.
> 
> Apparently, a zero-length write (bio) without barrier makes some/all
> block drivers very unhappy (IDE, cciss, and xen-blkfront have been
> tested and all failed on this). I think these zero-length writes must
> normally get filtered out and discarded by filesystem/block layer
> before drivers ever see them, but in this particular case, where md is
> submitting the write directly to the underlying component driver, it
> doesn't get filtered and causes -EIO to be returned (the disk gets
> marked failed, the mirror breaks, chaos ensues...)
> 
> I can do an obvious hack, which is to assume all the time that
> barriers don't work. That is, mark mddev->barriers_work as 0 or just
> disable that check:
> 
> --- drivers/md/raid1.c.PRISTINE 2011-01-14 14:51:56.959809669 -0500
> +++ drivers/md/raid1.c  2011-01-20 10:17:11.001701186 -0500
> @@ -819,6 +833,7 @@ static int make_request(mddev_t *mddev,
>                 finish_wait(&conf->wait_barrier, &w);
>         }
> -        if (unlikely(!mddev->barriers_work &&
> +       if ((
>                      bio_rw_flagged(bio, BIO_RW_BARRIER))) {
>                 if (rw == WRITE)
>                         md_write_end(mddev);
> 
> That fixes things, but does Neil or someone else have an idea how best
> to fix this? Could we specifically just not retry a zero-length
> barrier write? I think that would fix it, but is kind of a hack too.
> 
> I know barriers are being pulled out of the kernel, so this isn't a
> problem for recent kernels, but as 2.6.32 is a long-term support
> kernel, this may be something that others run into and will want
> fixed.

Normally the first thing that md/raid1 writes to a member device is the
metadata.  This is written with a barrier write if possible.  If that fails
then barriers_work is cleared, so all barrier writes from the filesystem
(empty or otherwise) will be rejected.

As you are getting an error here I assume that you are using non-persistent
metadata - correct?

Nonetheless, I think the correct fix is to add a special case for retrying a
zero-length.

Something like this maybe?

NeilBrown


Index: linux-2.6.32.orig/drivers/md/raid1.c
===================================================================
--- linux-2.6.32.orig.orig/drivers/md/raid1.c	2011-01-21 07:23:59.000000000 +1100
+++ linux-2.6.32.orig/drivers/md/raid1.c	2011-01-21 07:24:40.498354896 +1100
@@ -236,14 +236,18 @@ static void raid_end_bio_io(r1bio_t *r1_
 
 	/* if nobody has done the final endio yet, do it now */
 	if (!test_and_set_bit(R1BIO_Returned, &r1_bio->state)) {
+		int err = 0;
 		PRINTK(KERN_DEBUG "raid1: sync end %s on sectors %llu-%llu\n",
 			(bio_data_dir(bio) == WRITE) ? "write" : "read",
 			(unsigned long long) bio->bi_sector,
 			(unsigned long long) bio->bi_sector +
 				(bio->bi_size >> 9) - 1);
 
-		bio_endio(bio,
-			test_bit(R1BIO_Uptodate, &r1_bio->state) ? 0 : -EIO);
+		if (test_bit(R1BIO_BarrierRetry, &r1_bio->state))
+			err = -EOPNOTSUPP;
+		else if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
+			err = -EIO;
+		bio_endio(bio, err);
 	}
 	free_r1bio(r1_bio);
 }
@@ -1607,6 +1611,11 @@ static void raid1d(mddev_t *mddev)
 			 */
 			int i;
 			const bool do_sync = bio_rw_flagged(r1_bio->master_bio, BIO_RW_SYNCIO);
+			if (r1_bio->master_bio->bi_size == 0) {
+				/* cannot retry empty barriers, just fail */
+				raid_end_bio_io(r1_bio);
+				continue;
+			}
 			clear_bit(R1BIO_BarrierRetry, &r1_bio->state);
 			clear_bit(R1BIO_Barrier, &r1_bio->state);
 			for (i=0; i < conf->raid_disks; i++)
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG 2.6.32] md/raid1: barrier disabling does not work correctly in all cases
  2011-01-20 20:25   ` NeilBrown
@ 2011-01-20 20:44     ` Paul Clements
  2011-01-26 13:55     ` Paul Clements
  1 sibling, 0 replies; 8+ messages in thread
From: Paul Clements @ 2011-01-20 20:44 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On Thu, Jan 20, 2011 at 3:25 PM, NeilBrown <neilb@suse.de> wrote:

> Normally the first thing that md/raid1 writes to a member device is the
> metadata.  This is written with a barrier write if possible.  If that fails
> then barriers_work is cleared, so all barrier writes from the filesystem
> (empty or otherwise) will be rejected.

Yes, I noticed that, and assume that's why more people haven't seen
this problem...

> As you are getting an error here I assume that you are using non-persistent
> metadata - correct?

Yep, that's right...

> Nonetheless, I think the correct fix is to add a special case for retrying a
> zero-length.

> Something like this maybe?

Thanks Neil. I will try this patch.

--
Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG 2.6.32] md/raid1: barrier disabling does not work correctly in all cases
  2011-01-20 20:25   ` NeilBrown
  2011-01-20 20:44     ` Paul Clements
@ 2011-01-26 13:55     ` Paul Clements
  2011-02-01 20:45       ` Paul Clements
  1 sibling, 1 reply; 8+ messages in thread
From: Paul Clements @ 2011-01-26 13:55 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 417 bytes --]

On Thu, Jan 20, 2011 at 3:25 PM, NeilBrown <neilb@suse.de> wrote:

> Nonetheless, I think the correct fix is to add a special case for retrying a
> zero-length.
>
> Something like this maybe?

Close, but no cigar...

Attached is a modified patch, which does the extra necessary work
(bitmap_endwrite, md_write_end) on the bio before failing it.

Does this look correct? It seems to work.

Thanks again Neil.

--
Paul

[-- Attachment #2: raid1-2.6.32-barrier-retry-fix-2.diff --]
[-- Type: text/x-patch, Size: 2161 bytes --]

--- linux-2.6.32-71.el6.x86_64/drivers/md/raid1.c.PRISTINE	2011-01-14 14:51:56.959809669 -0500
+++ linux-2.6.32-71.el6.x86_64/drivers/md/raid1.c	2011-01-26 08:44:20.176936238 -0500
@@ -236,14 +236,18 @@ static void raid_end_bio_io(r1bio_t *r1_
 
 	/* if nobody has done the final endio yet, do it now */
 	if (!test_and_set_bit(R1BIO_Returned, &r1_bio->state)) {
+		int err = 0;
 		PRINTK(KERN_DEBUG "raid1: sync end %s on sectors %llu-%llu\n",
 			(bio_data_dir(bio) == WRITE) ? "write" : "read",
 			(unsigned long long) bio->bi_sector,
 			(unsigned long long) bio->bi_sector +
 				(bio->bi_size >> 9) - 1);
 
-		bio_endio(bio,
-			test_bit(R1BIO_Uptodate, &r1_bio->state) ? 0 : -EIO);
+		if (test_bit(R1BIO_BarrierRetry, &r1_bio->state))
+			err = -EOPNOTSUPP;
+		else if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
+			err = -EIO;
+		bio_endio(bio, err);
 	}
 	free_r1bio(r1_bio);
 }
@@ -1623,6 +1627,18 @@ static void raid1d(mddev_t *mddev)
 			 */
 			int i;
 			const bool do_sync = bio_rw_flagged(r1_bio->master_bio, BIO_RW_SYNCIO);
+			if (r1_bio->master_bio->bi_size == 0) {
+				/* cannot retry empty barriers, just fail */
+				bitmap_endwrite(r1_bio->mddev->bitmap,
+					r1_bio->sector,
+					r1_bio->sectors,
+					!test_bit(R1BIO_Degraded, &r1_bio->state),
+					test_bit(R1BIO_BehindIO, &r1_bio->state));
+				md_write_end(r1_bio->mddev);
+
+				raid_end_bio_io(r1_bio);
+				continue;
+			}
 			clear_bit(R1BIO_BarrierRetry, &r1_bio->state);
 			clear_bit(R1BIO_Barrier, &r1_bio->state);
 			for (i=0; i < conf->raid_disks; i++)
@@ -2139,7 +2155,7 @@ static int stop(mddev_t *mddev)
 	/* wait for behind writes to complete */
 	while (bitmap && atomic_read(&bitmap->behind_writes) > 0) {
 		behind_wait++;
-		printk(KERN_INFO "raid1: behind writes in progress on device %s, waiting to stop (%d)\n", mdname(mddev), behind_wait);
+		printk(KERN_INFO "raid1: behind writes (%d) in progress on device %s, waiting to stop (%d)\n", atomic_read(&bitmap->behind_writes), mdname(mddev), behind_wait);
 		set_current_state(TASK_UNINTERRUPTIBLE);
 		schedule_timeout(HZ); /* wait a second */
 		/* need to kick something here to make sure I/O goes? */

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG 2.6.32] md/raid1: barrier disabling does not work correctly in all cases
  2011-01-26 13:55     ` Paul Clements
@ 2011-02-01 20:45       ` Paul Clements
  2011-02-02  0:32         ` NeilBrown
  0 siblings, 1 reply; 8+ messages in thread
From: Paul Clements @ 2011-02-01 20:45 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On Wed, Jan 26, 2011 at 8:55 AM, Paul Clements
<paul.clements@us.sios.com> wrote:

> Attached is a modified patch, which does the extra necessary work
> (bitmap_endwrite, md_write_end) on the bio before failing it.
>
> Does this look correct? It seems to work.

Well, not quite...it's more complicated. From my reading of the code,
it looks like behind writes and barrier retries just do not work
correctly together. The issue is this:

- With behind writes, we signal the master_bio complete as soon as all
non-write-behind writes are complete.

- With barrier retries, you don't know if you'll need to retry until
you've completed all legs of the write (the last leg to complete might
throw EOPNOTSUPP).

So in the case where the master_bio has been completed, we still try
to do a retry for the leg that failed the barrier (but it's really too
late to retry). In any case, raid1d is touching master_bio (looking at
bi_size and bio_cloning it) during the retry, which causes a panic if
master_bio is already being reused by someone else.

I can't think of a good way to do behind writes and barrier retries
together. Seems we've got to disable behind writes for barriers, or
we've got to disable barrier retries when doing behind writes...

Any thoughts?

--
Paul

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG 2.6.32] md/raid1: barrier disabling does not work correctly in all cases
  2011-02-01 20:45       ` Paul Clements
@ 2011-02-02  0:32         ` NeilBrown
  2011-02-02  1:04           ` Paul Clements
  0 siblings, 1 reply; 8+ messages in thread
From: NeilBrown @ 2011-02-02  0:32 UTC (permalink / raw)
  To: Paul Clements; +Cc: linux-raid

On Tue, 1 Feb 2011 15:45:16 -0500 Paul Clements <paul.clements@us.sios.com>
wrote:

> On Wed, Jan 26, 2011 at 8:55 AM, Paul Clements
> <paul.clements@us.sios.com> wrote:
> 
> > Attached is a modified patch, which does the extra necessary work
> > (bitmap_endwrite, md_write_end) on the bio before failing it.
> >
> > Does this look correct? It seems to work.
> 
> Well, not quite...it's more complicated. From my reading of the code,
> it looks like behind writes and barrier retries just do not work
> correctly together. The issue is this:
> 
> - With behind writes, we signal the master_bio complete as soon as all
> non-write-behind writes are complete.
> 
> - With barrier retries, you don't know if you'll need to retry until
> you've completed all legs of the write (the last leg to complete might
> throw EOPNOTSUPP).
> 
> So in the case where the master_bio has been completed, we still try
> to do a retry for the leg that failed the barrier (but it's really too
> late to retry). In any case, raid1d is touching master_bio (looking at
> bi_size and bio_cloning it) during the retry, which causes a panic if
> master_bio is already being reused by someone else.
> 
> I can't think of a good way to do behind writes and barrier retries
> together. Seems we've got to disable behind writes for barriers, or
> we've got to disable barrier retries when doing behind writes...
> 
> Any thoughts?

I suspect you are right that barriers and behind writes are deeply
incompatible.  I suspect they could be made to work together in some vaguely
sane way, but I suspect it would be a lot of work and not worth the effort.

Disabling behind-writes for all barrier requests would be quite easy, but it
might negate a lot of the value of behind writes

We could simply ignore the barrier flag on writes to behind-write devices,
but that would risk them being even more inconsistent than they currently can
be, so I doubt that is a good direct - though it is a possibility.

I think the best option is to reject barrier writes if there are any
behind-write devices.  That would be reasonably safe and reasonably
consistent.

So maybe something like this??

NeilBrown

Index: linux-2.6.32.orig/drivers/md/bitmap.c
===================================================================
--- linux-2.6.32.orig.orig/drivers/md/bitmap.c	2009-12-03 14:51:21.000000000 +1100
+++ linux-2.6.32.orig/drivers/md/bitmap.c	2011-02-02 11:31:51.156585883 +1100
@@ -1676,6 +1676,8 @@ int bitmap_create(mddev_t *mddev)
 		pages, bmname(bitmap));
 
 	mddev->bitmap = bitmap;
+	if (bitmap->max_write_behind)
+		mddev->barriers_work = 0;
 
 	mddev->thread->timeout = bitmap->daemon_sleep * HZ;
 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG 2.6.32] md/raid1: barrier disabling does not work correctly in all cases
  2011-02-02  0:32         ` NeilBrown
@ 2011-02-02  1:04           ` Paul Clements
  2011-02-24 21:04             ` Paul Clements
  0 siblings, 1 reply; 8+ messages in thread
From: Paul Clements @ 2011-02-02  1:04 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On Tue, Feb 1, 2011 at 7:32 PM, NeilBrown <neilb@suse.de> wrote:
> On Tue, 1 Feb 2011 15:45:16 -0500 Paul Clements <paul.clements@us.sios.com>
> wrote:
> Disabling behind-writes for all barrier requests would be quite easy, but it
> might negate a lot of the value of behind writes

Agreed...

> I think the best option is to reject barrier writes if there are any
> behind-write devices.  That would be reasonably safe and reasonably
> consistent.

Yeah, I was leaning toward this.

> So maybe something like this??

Yes, I think that would work.

Thanks,
Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG 2.6.32] md/raid1: barrier disabling does not work correctly in all cases
  2011-02-02  1:04           ` Paul Clements
@ 2011-02-24 21:04             ` Paul Clements
  0 siblings, 0 replies; 8+ messages in thread
From: Paul Clements @ 2011-02-24 21:04 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 959 bytes --]

OK, so I finally got back to this. Just FYI, a slight adjustment to
the patch was needed (bitmap_info now has the max_write_behind field).
It's attached. You need both the barrier-retry-fix-2 patch that I
posted a while back and this one.

Thanks,
Paul



On Tue, Feb 1, 2011 at 8:04 PM, Paul Clements <paul.clements@us.sios.com> wrote:
> On Tue, Feb 1, 2011 at 7:32 PM, NeilBrown <neilb@suse.de> wrote:
>> On Tue, 1 Feb 2011 15:45:16 -0500 Paul Clements <paul.clements@us.sios.com>
>> wrote:
>> Disabling behind-writes for all barrier requests would be quite easy, but it
>> might negate a lot of the value of behind writes
>
> Agreed...
>
>> I think the best option is to reject barrier writes if there are any
>> behind-write devices.  That would be reasonably safe and reasonably
>> consistent.
>
> Yeah, I was leaning toward this.
>
>> So maybe something like this??
>
> Yes, I think that would work.
>
> Thanks,
> Paul
>

[-- Attachment #2: md-disable-barriers-with-write-behind.diff --]
[-- Type: text/x-patch, Size: 1236 bytes --]

--- ./drivers/md/bitmap.c.PRISTINE	2011-02-23 17:40:26.633270814 -0500
+++ ./drivers/md/bitmap.c	2011-02-23 14:47:38.652434745 -0500
@@ -1298,8 +1298,7 @@ int bitmap_startwrite(struct bitmap *bit
 		if (bw > bitmap->behind_writes_used)
 			bitmap->behind_writes_used = bw;
 
-		PRINTK(KERN_DEBUG "inc write-behind count %d/%d\n",
-		       bw, bitmap->max_write_behind);
+		PRINTK(KERN_DEBUG "inc write-behind count %d\n", bw);
 	}
 
 	while (sectors) {
@@ -1357,8 +1356,8 @@ void bitmap_endwrite(struct bitmap *bitm
 	if (!bitmap) return;
 	if (behind) {
 		atomic_dec(&bitmap->behind_writes);
-		PRINTK(KERN_DEBUG "dec write-behind count %d/%d\n",
-		  atomic_read(&bitmap->behind_writes), bitmap->max_write_behind);
+		PRINTK(KERN_DEBUG "dec write-behind count %d\n",
+		  atomic_read(&bitmap->behind_writes));
 	}
 	if (bitmap->mddev->degraded)
 		/* Never clear bits or update events_cleared when degraded */
@@ -1713,6 +1712,10 @@ int bitmap_create(mddev_t *mddev)
 	if (err)
 		goto error;
 
+	/* write behind and barrier write retries are not compatible */
+	if (mddev->bitmap_info.max_write_behind)
+		mddev->barriers_work = 0;
+
 	bitmap->daemon_lastrun = jiffies;
 	bitmap->chunkshift = ffz(~mddev->bitmap_info.chunksize);
 

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2011-02-24 21:04 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <AANLkTimp7eGLu5UEr-wH9MVSGrQWA3PKMakW74_48gk7@mail.gmail.com>
2011-01-20 16:52 ` [BUG 2.6.32] md/raid1: barrier disabling does not work correctly in all cases Paul Clements
2011-01-20 20:25   ` NeilBrown
2011-01-20 20:44     ` Paul Clements
2011-01-26 13:55     ` Paul Clements
2011-02-01 20:45       ` Paul Clements
2011-02-02  0:32         ` NeilBrown
2011-02-02  1:04           ` Paul Clements
2011-02-24 21:04             ` Paul Clements

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.