All of lore.kernel.org
 help / color / mirror / Atom feed
* [md PATCH 0/4] Assorted minor improvements.
@ 2016-11-04  5:46 NeilBrown
  2016-11-04  5:46 ` [md PATCH 1/4] md: perform async updates for metadata where possible NeilBrown
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: NeilBrown @ 2016-11-04  5:46 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid

There is no real pattern to these patches except that they are fairly
boring but occasionally useful.

The first allows --add and --remove commands to succeed without
waiting for a metadata write, which is just wasted time.  This can be
useful when adding/removes hundreds of devices on a large RAID10
array.

The next two abort some writes which have become pointless.  If a
device fails in a way that causes long retries, this can reduce the
total time  for recovery

The last is a small correctness fix bitmap_daemon_work() doesn't wait
for writes to complete, so they might still be pending when the next
writes is sent, and two writes to the same location might not be
handled properly.  So we insert waits in the rare case that they are
needed.

Thanks,
NeilBrown

---

NeilBrown (4):
      md: perform async updates for metadata where possible.
      md/raid1: abort delayed writes when device fails.
      md/raid10: abort delayed writes when device fails.
      md/bitmap: Don't write bitmap while earlier writes might be in-flight


 drivers/md/bitmap.c |   27 ++++++++++++++++++++++-----
 drivers/md/md.c     |   16 ++++++++++++----
 drivers/md/raid1.c  |   20 +++++++++++++++-----
 drivers/md/raid10.c |   22 ++++++++++++++++------
 4 files changed, 65 insertions(+), 20 deletions(-)

--
Signature


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [md PATCH 1/4] md: perform async updates for metadata where possible.
  2016-11-04  5:46 [md PATCH 0/4] Assorted minor improvements NeilBrown
@ 2016-11-04  5:46 ` NeilBrown
  2016-11-04  5:46 ` [md PATCH 4/4] md/bitmap: Don't write bitmap while earlier writes might be in-flight NeilBrown
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 10+ messages in thread
From: NeilBrown @ 2016-11-04  5:46 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid

When adding devices to, or removing device from, an array we need to
update the metadata.  However we don't need to do it synchronously as
data integrity doesn't depend on these changes being recorded
instantly.  So avoid the synchronous call to md_update_sb and just set
a flag so that the thread will do it.

This can reduce the number of updates performed when lots of devices
are being added or removed.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/md.c |   16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 0354a92636aa..5ea8063587e9 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -2611,8 +2611,10 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len)
 
 			if (err == 0) {
 				md_kick_rdev_from_array(rdev);
-				if (mddev->pers)
-					md_update_sb(mddev, 1);
+				if (mddev->pers) {
+					set_bit(MD_CHANGE_DEVS, &mddev->flags);
+					md_wakeup_thread(mddev->thread);
+				}
 				md_new_event(mddev);
 			}
 		}
@@ -6178,7 +6180,11 @@ static int hot_remove_disk(struct mddev *mddev, dev_t dev)
 		md_cluster_ops->remove_disk(mddev, rdev);
 
 	md_kick_rdev_from_array(rdev);
-	md_update_sb(mddev, 1);
+	set_bit(MD_CHANGE_DEVS, &mddev->flags);
+	if (mddev->thread)
+		md_wakeup_thread(mddev->thread);
+	else
+		md_update_sb(mddev, 1);
 	md_new_event(mddev);
 
 	return 0;
@@ -6243,7 +6249,9 @@ static int hot_add_disk(struct mddev *mddev, dev_t dev)
 
 	rdev->raid_disk = -1;
 
-	md_update_sb(mddev, 1);
+	set_bit(MD_CHANGE_DEVS, &mddev->flags);
+	if (!mddev->thread)
+		md_update_sb(mddev, 1);
 	/*
 	 * Kick recovery, maybe this spare has to be added to the
 	 * array immediately.



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [md PATCH 2/4] md/raid1: abort delayed writes when device fails.
  2016-11-04  5:46 [md PATCH 0/4] Assorted minor improvements NeilBrown
                   ` (2 preceding siblings ...)
  2016-11-04  5:46 ` [md PATCH 3/4] md/raid10: abort delayed writes when device fails NeilBrown
@ 2016-11-04  5:46 ` NeilBrown
  3 siblings, 0 replies; 10+ messages in thread
From: NeilBrown @ 2016-11-04  5:46 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid

When writing to an array with a bitmap enabled, the writes are grouped
in batches which are preceded by an update to the bitmap.

It is quite likely if that a drive develops a problem which is not
media related, that the bitmap write will be the first to report an
error and cause the device to be marked faulty (as the bitmap write is
at the start of a batch).

In this case, there is point submiting the subsequent writes to the
failed device - that just wastes times.

So re-check the Faulty state of a device before submitting a
delayed write.

This requires that we keep the 'rdev', rather than the 'bdev' in the
bio, then swap in the bdev just before final submission.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/raid1.c |   20 +++++++++++++++-----
 1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 86ecf57ec612..9a73634a99a7 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -739,9 +739,14 @@ static void flush_pending_writes(struct r1conf *conf)
 
 		while (bio) { /* submit pending writes */
 			struct bio *next = bio->bi_next;
+			struct md_rdev *rdev = (void*)bio->bi_bdev;
 			bio->bi_next = NULL;
-			if (unlikely((bio_op(bio) == REQ_OP_DISCARD) &&
-			    !blk_queue_discard(bdev_get_queue(bio->bi_bdev))))
+			bio->bi_bdev = rdev->bdev;
+			if (test_bit(Faulty, &rdev->flags)) {
+				bio->bi_error = -EIO;
+				bio_endio(bio);
+			} else if (unlikely((bio_op(bio) == REQ_OP_DISCARD) &&
+					    !blk_queue_discard(bdev_get_queue(bio->bi_bdev))))
 				/* Just ignore it */
 				bio_endio(bio);
 			else
@@ -1013,9 +1018,14 @@ static void raid1_unplug(struct blk_plug_cb *cb, bool from_schedule)
 
 	while (bio) { /* submit pending writes */
 		struct bio *next = bio->bi_next;
+		struct md_rdev *rdev = (void*)bio->bi_bdev;
 		bio->bi_next = NULL;
-		if (unlikely((bio_op(bio) == REQ_OP_DISCARD) &&
-		    !blk_queue_discard(bdev_get_queue(bio->bi_bdev))))
+		bio->bi_bdev = rdev->bdev;
+		if (test_bit(Faulty, &rdev->flags)) {
+			bio->bi_error = -EIO;
+			bio_endio(bio);
+		} else if (unlikely((bio_op(bio) == REQ_OP_DISCARD) &&
+				    !blk_queue_discard(bdev_get_queue(bio->bi_bdev))))
 			/* Just ignore it */
 			bio_endio(bio);
 		else
@@ -1354,7 +1364,7 @@ static void raid1_make_request(struct mddev *mddev, struct bio * bio)
 
 		mbio->bi_iter.bi_sector	= (r1_bio->sector +
 				   conf->mirrors[i].rdev->data_offset);
-		mbio->bi_bdev = conf->mirrors[i].rdev->bdev;
+		mbio->bi_bdev = (void*)conf->mirrors[i].rdev;
 		mbio->bi_end_io	= raid1_end_write_request;
 		bio_set_op_attrs(mbio, op, do_flush_fua | do_sync);
 		mbio->bi_private = r1_bio;



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [md PATCH 3/4] md/raid10: abort delayed writes when device fails.
  2016-11-04  5:46 [md PATCH 0/4] Assorted minor improvements NeilBrown
  2016-11-04  5:46 ` [md PATCH 1/4] md: perform async updates for metadata where possible NeilBrown
  2016-11-04  5:46 ` [md PATCH 4/4] md/bitmap: Don't write bitmap while earlier writes might be in-flight NeilBrown
@ 2016-11-04  5:46 ` NeilBrown
  2016-11-04  5:46 ` [md PATCH 2/4] md/raid1: " NeilBrown
  3 siblings, 0 replies; 10+ messages in thread
From: NeilBrown @ 2016-11-04  5:46 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid

When writing to an array with a bitmap enabled, the writes are grouped
in batches which are preceded by an update to the bitmap.

It is quite likely if that a drive develops a problem which is not
media related, that the bitmap write will be the first to report an
error and cause the device to be marked faulty (as the bitmap write is
at the start of a batch).

In this case, there is point submiting the subsequent writes to the
failed device - that just wastes times.

So re-check the Faulty state of a device before submitting a
delayed write.

This requires that we keep the 'rdev', rather than the 'bdev' in the
bio, then swap in the bdev just before final submission.

Reported-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/raid10.c |   22 ++++++++++++++++------
 1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index bf6980ad8c00..225d762d61b7 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -855,9 +855,14 @@ static void flush_pending_writes(struct r10conf *conf)
 
 		while (bio) { /* submit pending writes */
 			struct bio *next = bio->bi_next;
+			struct md_rdev *rdev = (void*)bio->bi_bdev;
 			bio->bi_next = NULL;
-			if (unlikely((bio_op(bio) ==  REQ_OP_DISCARD) &&
-			    !blk_queue_discard(bdev_get_queue(bio->bi_bdev))))
+			bio->bi_bdev = rdev->bdev;
+			if (test_bit(Faulty, &rdev->flags)) {
+				bio->bi_error = -EIO;
+				bio_endio(bio);
+			} else if (unlikely((bio_op(bio) ==  REQ_OP_DISCARD) &&
+					    !blk_queue_discard(bdev_get_queue(bio->bi_bdev))))
 				/* Just ignore it */
 				bio_endio(bio);
 			else
@@ -1033,9 +1038,14 @@ static void raid10_unplug(struct blk_plug_cb *cb, bool from_schedule)
 
 	while (bio) { /* submit pending writes */
 		struct bio *next = bio->bi_next;
+		struct md_rdev *rdev = (void*)bio->bi_bdev;
 		bio->bi_next = NULL;
-		if (unlikely((bio_op(bio) ==  REQ_OP_DISCARD) &&
-		    !blk_queue_discard(bdev_get_queue(bio->bi_bdev))))
+		bio->bi_bdev = rdev->bdev;
+		if (test_bit(Faulty, &rdev->flags)) {
+			bio->bi_error = -EIO;
+			bio_endio(bio);
+		} else if (unlikely((bio_op(bio) ==  REQ_OP_DISCARD) &&
+				    !blk_queue_discard(bdev_get_queue(bio->bi_bdev))))
 			/* Just ignore it */
 			bio_endio(bio);
 		else
@@ -1354,7 +1364,7 @@ static void __make_request(struct mddev *mddev, struct bio *bio)
 			mbio->bi_iter.bi_sector	= (r10_bio->devs[i].addr+
 					   choose_data_offset(r10_bio,
 							      rdev));
-			mbio->bi_bdev = rdev->bdev;
+			mbio->bi_bdev = (void*)rdev;
 			mbio->bi_end_io	= raid10_end_write_request;
 			bio_set_op_attrs(mbio, op, do_sync | do_fua);
 			mbio->bi_private = r10_bio;
@@ -1396,7 +1406,7 @@ static void __make_request(struct mddev *mddev, struct bio *bio)
 			mbio->bi_iter.bi_sector	= (r10_bio->devs[i].addr +
 					   choose_data_offset(
 						   r10_bio, rdev));
-			mbio->bi_bdev = rdev->bdev;
+			mbio->bi_bdev = (void*)rdev;
 			mbio->bi_end_io	= raid10_end_write_request;
 			bio_set_op_attrs(mbio, op, do_sync | do_fua);
 			mbio->bi_private = r10_bio;



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [md PATCH 4/4] md/bitmap: Don't write bitmap while earlier writes might be in-flight
  2016-11-04  5:46 [md PATCH 0/4] Assorted minor improvements NeilBrown
  2016-11-04  5:46 ` [md PATCH 1/4] md: perform async updates for metadata where possible NeilBrown
@ 2016-11-04  5:46 ` NeilBrown
  2016-11-05  0:33   ` Shaohua Li
  2016-11-04  5:46 ` [md PATCH 3/4] md/raid10: abort delayed writes when device fails NeilBrown
  2016-11-04  5:46 ` [md PATCH 2/4] md/raid1: " NeilBrown
  3 siblings, 1 reply; 10+ messages in thread
From: NeilBrown @ 2016-11-04  5:46 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid

As we don't wait for writes to complete in bitmap_daemon_work, they
could still be in-flight when bitmap_unplug writes again.  Or when
bitmap_daemon_work tries to write again.
This can be confusing and could risk the wrong data being written last.

So make sure we wait for old writes to complete before new writes start.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/bitmap.c |   27 ++++++++++++++++++++++-----
 1 file changed, 22 insertions(+), 5 deletions(-)

diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index 3ceb0c51891e..e224186d767f 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -416,6 +416,21 @@ static int read_page(struct file *file, unsigned long index,
  * bitmap file superblock operations
  */
 
+/*
+ * bitmap_wait_writes() should be called before writing any bitmap
+ * blocks, to ensure previous writes, particularly from
+ * bitmap_daemon_work(), have completed.
+ */
+static void bitmap_wait_writes(struct bitmap *bitmap)
+{
+	if (bitmap->storage.file)
+		wait_event(bitmap->write_wait,
+			   atomic_read(&bitmap->pending_writes)==0);
+	else
+		md_super_wait(bitmap->mddev);
+}
+
+
 /* update the event counter and sync the superblock to disk */
 void bitmap_update_sb(struct bitmap *bitmap)
 {
@@ -978,6 +993,7 @@ void bitmap_unplug(struct bitmap *bitmap)
 {
 	unsigned long i;
 	int dirty, need_write;
+	int writing = 0;
 
 	if (!bitmap || !bitmap->storage.filemap ||
 	    test_bit(BITMAP_STALE, &bitmap->flags))
@@ -992,15 +1008,15 @@ void bitmap_unplug(struct bitmap *bitmap)
 		need_write = test_and_clear_page_attr(bitmap, i,
 						      BITMAP_PAGE_NEEDWRITE);
 		if (dirty || need_write) {
+			if (!writing)
+				bitmap_wait_writes(bitmap);
 			clear_page_attr(bitmap, i, BITMAP_PAGE_PENDING);
 			write_page(bitmap, bitmap->storage.filemap[i], 0);
+			writing = 1;
 		}
 	}
-	if (bitmap->storage.file)
-		wait_event(bitmap->write_wait,
-			   atomic_read(&bitmap->pending_writes)==0);
-	else
-		md_super_wait(bitmap->mddev);
+	if (writing)
+		bitmap_wait_writes(bitmap);
 
 	if (test_bit(BITMAP_WRITE_ERROR, &bitmap->flags))
 		bitmap_file_kick(bitmap);
@@ -1282,6 +1298,7 @@ void bitmap_daemon_work(struct mddev *mddev)
 	}
 	spin_unlock_irq(&counts->lock);
 
+	bitmap_wait_writes(bitmap);
 	/* Now start writeout on any page in NEEDWRITE that isn't DIRTY.
 	 * DIRTY pages need to be written by bitmap_unplug so it can wait
 	 * for them.



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [md PATCH 4/4] md/bitmap: Don't write bitmap while earlier writes might be in-flight
  2016-11-04  5:46 ` [md PATCH 4/4] md/bitmap: Don't write bitmap while earlier writes might be in-flight NeilBrown
@ 2016-11-05  0:33   ` Shaohua Li
  2016-11-06 22:53     ` NeilBrown
  0 siblings, 1 reply; 10+ messages in thread
From: Shaohua Li @ 2016-11-05  0:33 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On Fri, Nov 04, 2016 at 04:46:03PM +1100, Neil Brown wrote:
> As we don't wait for writes to complete in bitmap_daemon_work, they
> could still be in-flight when bitmap_unplug writes again.  Or when
> bitmap_daemon_work tries to write again.
> This can be confusing and could risk the wrong data being written last.

Applied the first 3 patches, thanks!

This one seems not completely solving the race condition. It's still possible
bitmap_daemon_work clears BITMAP_PAGE_NEEDWRITE but hasn't dispatch the IO yet,
bitmap_unplug then does nothing and thinks bitmap is updated to disk. Why don't
we add locking here?

Thanks,
Shaohua 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [md PATCH 4/4] md/bitmap: Don't write bitmap while earlier writes might be in-flight
  2016-11-05  0:33   ` Shaohua Li
@ 2016-11-06 22:53     ` NeilBrown
  2016-11-07 19:19       ` Shaohua Li
  0 siblings, 1 reply; 10+ messages in thread
From: NeilBrown @ 2016-11-06 22:53 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1345 bytes --]

On Sat, Nov 05 2016, Shaohua Li wrote:

> On Fri, Nov 04, 2016 at 04:46:03PM +1100, Neil Brown wrote:
>> As we don't wait for writes to complete in bitmap_daemon_work, they
>> could still be in-flight when bitmap_unplug writes again.  Or when
>> bitmap_daemon_work tries to write again.
>> This can be confusing and could risk the wrong data being written last.
>
> Applied the first 3 patches, thanks!
>
> This one seems not completely solving the race condition. It's still possible
> bitmap_daemon_work clears BITMAP_PAGE_NEEDWRITE but hasn't dispatch the IO yet,
> bitmap_unplug then does nothing and thinks bitmap is updated to disk. Why don't
> we add locking here?

Thanks for the review!

BITMAP_PAGE_NEEDWRITE is set for pages that need to be written out in
order to clear bits that don't need to be set any more.  There is never
any urgency to do this.
BITMAP_PAGE_DIRTY is set of pages that need to be written out in order
to set bits representing regions that are about to be written to.  These
have to be flushed by bitmap_unplug().
Pages can have both bits set, in which case bitmap_daemon_work() will
leave them for bitmap_unplug() to deal with.

So if bitmap_daemon_work() clears BITMAP_PAGE_PENDING on a page, then it
is a page that bitmap_unplug() doesn't need to wait for.


Does that answer your concerns?

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [md PATCH 4/4] md/bitmap: Don't write bitmap while earlier writes might be in-flight
  2016-11-06 22:53     ` NeilBrown
@ 2016-11-07 19:19       ` Shaohua Li
  2016-11-07 20:19         ` NeilBrown
  0 siblings, 1 reply; 10+ messages in thread
From: Shaohua Li @ 2016-11-07 19:19 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On Mon, Nov 07, 2016 at 09:53:42AM +1100, Neil Brown wrote:
> On Sat, Nov 05 2016, Shaohua Li wrote:
> 
> > On Fri, Nov 04, 2016 at 04:46:03PM +1100, Neil Brown wrote:
> >> As we don't wait for writes to complete in bitmap_daemon_work, they
> >> could still be in-flight when bitmap_unplug writes again.  Or when
> >> bitmap_daemon_work tries to write again.
> >> This can be confusing and could risk the wrong data being written last.
> >
> > Applied the first 3 patches, thanks!
> >
> > This one seems not completely solving the race condition. It's still possible
> > bitmap_daemon_work clears BITMAP_PAGE_NEEDWRITE but hasn't dispatch the IO yet,
> > bitmap_unplug then does nothing and thinks bitmap is updated to disk. Why don't
> > we add locking here?
> 
> Thanks for the review!
> 
> BITMAP_PAGE_NEEDWRITE is set for pages that need to be written out in
> order to clear bits that don't need to be set any more.  There is never
> any urgency to do this.
> BITMAP_PAGE_DIRTY is set of pages that need to be written out in order
> to set bits representing regions that are about to be written to.  These
> have to be flushed by bitmap_unplug().
> Pages can have both bits set, in which case bitmap_daemon_work() will
> leave them for bitmap_unplug() to deal with.
> 
> So if bitmap_daemon_work() clears BITMAP_PAGE_PENDING on a page, then it
> is a page that bitmap_unplug() doesn't need to wait for.

Oops, I misread the code. Yes, this is very clear, thanks for the explaination.

So I can understand the patch fixes the confusion. when is there risk wrong
data is written?

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [md PATCH 4/4] md/bitmap: Don't write bitmap while earlier writes might be in-flight
  2016-11-07 19:19       ` Shaohua Li
@ 2016-11-07 20:19         ` NeilBrown
  2016-11-07 22:57           ` Shaohua Li
  0 siblings, 1 reply; 10+ messages in thread
From: NeilBrown @ 2016-11-07 20:19 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2622 bytes --]

On Tue, Nov 08 2016, Shaohua Li wrote:

> On Mon, Nov 07, 2016 at 09:53:42AM +1100, Neil Brown wrote:
>> On Sat, Nov 05 2016, Shaohua Li wrote:
>> 
>> > On Fri, Nov 04, 2016 at 04:46:03PM +1100, Neil Brown wrote:
>> >> As we don't wait for writes to complete in bitmap_daemon_work, they
>> >> could still be in-flight when bitmap_unplug writes again.  Or when
>> >> bitmap_daemon_work tries to write again.
>> >> This can be confusing and could risk the wrong data being written last.
>> >
>> > Applied the first 3 patches, thanks!
>> >
>> > This one seems not completely solving the race condition. It's still possible
>> > bitmap_daemon_work clears BITMAP_PAGE_NEEDWRITE but hasn't dispatch the IO yet,
>> > bitmap_unplug then does nothing and thinks bitmap is updated to disk. Why don't
>> > we add locking here?
>> 
>> Thanks for the review!
>> 
>> BITMAP_PAGE_NEEDWRITE is set for pages that need to be written out in
>> order to clear bits that don't need to be set any more.  There is never
>> any urgency to do this.
>> BITMAP_PAGE_DIRTY is set of pages that need to be written out in order
>> to set bits representing regions that are about to be written to.  These
>> have to be flushed by bitmap_unplug().
>> Pages can have both bits set, in which case bitmap_daemon_work() will
>> leave them for bitmap_unplug() to deal with.
>> 
>> So if bitmap_daemon_work() clears BITMAP_PAGE_PENDING on a page, then it
>> is a page that bitmap_unplug() doesn't need to wait for.
>
> Oops, I misread the code. Yes, this is very clear, thanks for the explaination.
>
> So I can understand the patch fixes the confusion. when is there risk wrong
> data is written?

The goal is to avoid the possibility that two writes to the same
location are in flight at the same time.
It isn't clear that this would cause a problem, but it isn't clear that
it is completely safe either.
In general, this doesn't happen - certainly not from the page cache - so
drivers will not be prepared for it.
As an extreme example, suppose that the target device is a multi-path
device.
One write is sent and just after it is DMAed to one path, the in-memory
page is changed and a second write is submitted, and the data is DMAed
down the other path.  Are we certain the two will be committed to
storage in the intended order?
Probably they will be.  But maybe with a more complex stack, the chance
of one IO overtaking the other (even after the first DMA) increases.

So I cannot argue very strongly for this code, but it seems like a good
idea and is reasonably simple....

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [md PATCH 4/4] md/bitmap: Don't write bitmap while earlier writes might be in-flight
  2016-11-07 20:19         ` NeilBrown
@ 2016-11-07 22:57           ` Shaohua Li
  0 siblings, 0 replies; 10+ messages in thread
From: Shaohua Li @ 2016-11-07 22:57 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On Tue, Nov 08, 2016 at 07:19:05AM +1100, Neil Brown wrote:
> On Tue, Nov 08 2016, Shaohua Li wrote:
> 
> > On Mon, Nov 07, 2016 at 09:53:42AM +1100, Neil Brown wrote:
> >> On Sat, Nov 05 2016, Shaohua Li wrote:
> >> 
> >> > On Fri, Nov 04, 2016 at 04:46:03PM +1100, Neil Brown wrote:
> >> >> As we don't wait for writes to complete in bitmap_daemon_work, they
> >> >> could still be in-flight when bitmap_unplug writes again.  Or when
> >> >> bitmap_daemon_work tries to write again.
> >> >> This can be confusing and could risk the wrong data being written last.
> >> >
> >> > Applied the first 3 patches, thanks!
> >> >
> >> > This one seems not completely solving the race condition. It's still possible
> >> > bitmap_daemon_work clears BITMAP_PAGE_NEEDWRITE but hasn't dispatch the IO yet,
> >> > bitmap_unplug then does nothing and thinks bitmap is updated to disk. Why don't
> >> > we add locking here?
> >> 
> >> Thanks for the review!
> >> 
> >> BITMAP_PAGE_NEEDWRITE is set for pages that need to be written out in
> >> order to clear bits that don't need to be set any more.  There is never
> >> any urgency to do this.
> >> BITMAP_PAGE_DIRTY is set of pages that need to be written out in order
> >> to set bits representing regions that are about to be written to.  These
> >> have to be flushed by bitmap_unplug().
> >> Pages can have both bits set, in which case bitmap_daemon_work() will
> >> leave them for bitmap_unplug() to deal with.
> >> 
> >> So if bitmap_daemon_work() clears BITMAP_PAGE_PENDING on a page, then it
> >> is a page that bitmap_unplug() doesn't need to wait for.
> >
> > Oops, I misread the code. Yes, this is very clear, thanks for the explaination.
> >
> > So I can understand the patch fixes the confusion. when is there risk wrong
> > data is written?
> 
> The goal is to avoid the possibility that two writes to the same
> location are in flight at the same time.
> It isn't clear that this would cause a problem, but it isn't clear that
> it is completely safe either.
> In general, this doesn't happen - certainly not from the page cache - so
> drivers will not be prepared for it.
> As an extreme example, suppose that the target device is a multi-path
> device.
> One write is sent and just after it is DMAed to one path, the in-memory
> page is changed and a second write is submitted, and the data is DMAed
> down the other path.  Are we certain the two will be committed to
> storage in the intended order?
> Probably they will be.  But maybe with a more complex stack, the chance
> of one IO overtaking the other (even after the first DMA) increases.
> 
> So I cannot argue very strongly for this code, but it seems like a good
> idea and is reasonably simple....

I didn't try to argue about the patch, it makes things clear actually. Just
want to know where the data corruption comes from. I think I have an answer
now.

I'll add this patch.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2016-11-07 22:57 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-04  5:46 [md PATCH 0/4] Assorted minor improvements NeilBrown
2016-11-04  5:46 ` [md PATCH 1/4] md: perform async updates for metadata where possible NeilBrown
2016-11-04  5:46 ` [md PATCH 4/4] md/bitmap: Don't write bitmap while earlier writes might be in-flight NeilBrown
2016-11-05  0:33   ` Shaohua Li
2016-11-06 22:53     ` NeilBrown
2016-11-07 19:19       ` Shaohua Li
2016-11-07 20:19         ` NeilBrown
2016-11-07 22:57           ` Shaohua Li
2016-11-04  5:46 ` [md PATCH 3/4] md/raid10: abort delayed writes when device fails NeilBrown
2016-11-04  5:46 ` [md PATCH 2/4] md/raid1: " NeilBrown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.