All of lore.kernel.org
 help / color / mirror / Atom feed
* [md PATCH 00/14] remove all abuse of bi_phys_segments
@ 2017-02-16  4:39 NeilBrown
  2017-02-16  4:39 ` [md PATCH 04/14] block: trace completion of all bios NeilBrown
                   ` (13 more replies)
  0 siblings, 14 replies; 27+ messages in thread
From: NeilBrown @ 2017-02-16  4:39 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, hch

This series removes all abuse of bi_phys_segments from
md/raid.
Most of the effort is for raid5 - raid1 and raid10 are
comparatively straight forward.

I've also include some optimization of md_write_start() which is now
used more heavily.  This is the most "interesting" code, so it is at
the end, allowing it to easily be postponed.

The raid1 parts will conflict with Coly's raid1 resync changes.  The
conflicts aren't serious.  Which ever goes in last can easily be
adjusted to work with the other.

As you can see below, we save over 100 lines of code :-)

I've reviewed these a couple of times and done some light testing.
More review and testing would be welcome.

Thanks,
NeilBrown



---

NeilBrown (14):
      md/raid5: use md_write_start to count stripes, not bios
      md/raid5: simplfy delaying of writes while metadata is updated.
      md/raid5: call bio_endio() directly rather than queueing for later.
      block: trace completion of all bios.
      md/raid5: use bio_inc_remaining() instead of repurposing bi_phys_segments as a counter
      md/raid5: remove over-loading of ->bi_phys_segments.
      Revert "md/raid5: limit request size according to implementation limits"
      md/raid1,raid10: move rXbio accounting closer to allocation.
      md/raid10: stop using bi_phys_segments
      md/raid1: stop using bi_phys_segment
      md/raid5: don't test ->writes_pending in raid5_remove_disk
      md: factor out set_in_sync()
      md: close a race with setting mddev->in_sync
      MD: use per-cpu counter for writes_pending


 block/bio.c              |    3 +
 drivers/md/dm.c          |    1 
 drivers/md/md.c          |  161 ++++++++++++++++++++++++++++++++------------
 drivers/md/md.h          |    3 +
 drivers/md/raid1.c       |  106 ++++++++++-------------------
 drivers/md/raid10.c      |   82 +++++++----------------
 drivers/md/raid5-cache.c |   14 +---
 drivers/md/raid5.c       |  167 ++++++++++++----------------------------------
 drivers/md/raid5.h       |   51 +-------------
 9 files changed, 235 insertions(+), 353 deletions(-)

--
Signature


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [md PATCH 01/14] md/raid5: use md_write_start to count stripes, not bios
  2017-02-16  4:39 [md PATCH 00/14] remove all abuse of bi_phys_segments NeilBrown
  2017-02-16  4:39 ` [md PATCH 04/14] block: trace completion of all bios NeilBrown
  2017-02-16  4:39 ` [md PATCH 03/14] md/raid5: call bio_endio() directly rather than queueing for later NeilBrown
@ 2017-02-16  4:39 ` NeilBrown
  2017-02-16 17:29   ` Shaohua Li
  2017-02-16  4:39 ` [md PATCH 02/14] md/raid5: simplfy delaying of writes while metadata is updated NeilBrown
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 27+ messages in thread
From: NeilBrown @ 2017-02-16  4:39 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, hch

We use md_write_start() to increase the count of pending writes, and
md_write_end() to decrement the count.  We currently count bios
submitted to md/raid5.  Change it count stripe_heads that a WRITE bio
has been attached to.

So now, raid5_make_request() calls md_write_start() and then
md_write_end() to keep the count elevated during the setup of the
request.

add_stripe_bio() calls md_write_start() for each stripe_head, and the
completion routines always call md_write_end(), instead of only
calling it when raid5_dec_bi_active_stripes() returns 0.
make_discard_request also calls md_write_start/end().

The parallel between md_write_{start,end} and use of bi_phys_segments
can be seen in that:
 Whenever we set bi_phys_segments to 1, we now call md_write_start.
 Whenever we increment it on non-read requests with
   raid5_inc_bi_active_stripes(), we now call md_write_start().
 Whenever we decrement bi_phys_segments on non-read requsts with
    raid5_dec_bi_active_stripes(), we now call md_write_end().

This reduces our dependence on keeping a per-bio count of active
stripes in bi_phys_segments.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/raid5-cache.c |    2 +-
 drivers/md/raid5.c       |   27 +++++++++++++--------------
 2 files changed, 14 insertions(+), 15 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 302dea3296ba..4b211f914d17 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -265,8 +265,8 @@ r5c_return_dev_pending_writes(struct r5conf *conf, struct r5dev *dev,
 	while (wbi && wbi->bi_iter.bi_sector <
 	       dev->sector + STRIPE_SECTORS) {
 		wbi2 = r5_next_bio(wbi, dev->sector);
+		md_write_end(conf->mddev);
 		if (!raid5_dec_bi_active_stripes(wbi)) {
-			md_write_end(conf->mddev);
 			bio_list_add(return_bi, wbi);
 		}
 		wbi = wbi2;
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 3c7e106c12a2..760b726943c9 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3075,6 +3075,7 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
 		bi->bi_next = *bip;
 	*bip = bi;
 	raid5_inc_bi_active_stripes(bi);
+	md_write_start(conf->mddev, bi);
 
 	if (forwrite) {
 		/* check if page is covered */
@@ -3198,10 +3199,9 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
 			struct bio *nextbi = r5_next_bio(bi, sh->dev[i].sector);
 
 			bi->bi_error = -EIO;
-			if (!raid5_dec_bi_active_stripes(bi)) {
-				md_write_end(conf->mddev);
+			md_write_end(conf->mddev);
+			if (!raid5_dec_bi_active_stripes(bi))
 				bio_list_add(return_bi, bi);
-			}
 			bi = nextbi;
 		}
 		if (bitmap_end)
@@ -3222,10 +3222,9 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
 			struct bio *bi2 = r5_next_bio(bi, sh->dev[i].sector);
 
 			bi->bi_error = -EIO;
-			if (!raid5_dec_bi_active_stripes(bi)) {
-				md_write_end(conf->mddev);
+			md_write_end(conf->mddev);
+			if (!raid5_dec_bi_active_stripes(bi))
 				bio_list_add(return_bi, bi);
-			}
 			bi = bi2;
 		}
 
@@ -3582,10 +3581,9 @@ static void handle_stripe_clean_event(struct r5conf *conf,
 				while (wbi && wbi->bi_iter.bi_sector <
 					dev->sector + STRIPE_SECTORS) {
 					wbi2 = r5_next_bio(wbi, dev->sector);
-					if (!raid5_dec_bi_active_stripes(wbi)) {
-						md_write_end(conf->mddev);
+					md_write_end(conf->mddev);
+					if (!raid5_dec_bi_active_stripes(wbi))
 						bio_list_add(return_bi, wbi);
-					}
 					wbi = wbi2;
 				}
 				bitmap_endwrite(conf->mddev->bitmap, sh->sector,
@@ -5268,6 +5266,7 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
 
 	bi->bi_next = NULL;
 	bi->bi_phys_segments = 1; /* over-loaded to count active stripes */
+	md_write_start(mddev, bi);
 
 	stripe_sectors = conf->chunk_sectors *
 		(conf->raid_disks - conf->max_degraded);
@@ -5314,6 +5313,7 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
 			sh->dev[d].towrite = bi;
 			set_bit(R5_OVERWRITE, &sh->dev[d].flags);
 			raid5_inc_bi_active_stripes(bi);
+			md_write_start(mddev, bi);
 			sh->overwrite_disks++;
 		}
 		spin_unlock_irq(&sh->stripe_lock);
@@ -5336,9 +5336,9 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
 		release_stripe_plug(mddev, sh);
 	}
 
+	md_write_end(mddev);
 	remaining = raid5_dec_bi_active_stripes(bi);
 	if (remaining == 0) {
-		md_write_end(mddev);
 		bio_endio(bi);
 	}
 }
@@ -5373,8 +5373,6 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
 		do_flush = bi->bi_opf & REQ_PREFLUSH;
 	}
 
-	md_write_start(mddev, bi);
-
 	/*
 	 * If array is degraded, better not do chunk aligned read because
 	 * later we might have to read it again in order to reconstruct
@@ -5397,6 +5395,7 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
 	last_sector = bio_end_sector(bi);
 	bi->bi_next = NULL;
 	bi->bi_phys_segments = 1;	/* over-loaded to count active stripes */
+	md_write_start(mddev, bi);
 
 	prepare_to_wait(&conf->wait_for_overlap, &w, TASK_UNINTERRUPTIBLE);
 	for (;logical_sector < last_sector; logical_sector += STRIPE_SECTORS) {
@@ -5531,11 +5530,11 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
 	}
 	finish_wait(&conf->wait_for_overlap, &w);
 
+	if (rw == WRITE)
+		md_write_end(mddev);
 	remaining = raid5_dec_bi_active_stripes(bi);
 	if (remaining == 0) {
 
-		if ( rw == WRITE )
-			md_write_end(mddev);
 
 		trace_block_bio_complete(bdev_get_queue(bi->bi_bdev),
 					 bi, 0);



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [md PATCH 02/14] md/raid5: simplfy delaying of writes while metadata is updated.
  2017-02-16  4:39 [md PATCH 00/14] remove all abuse of bi_phys_segments NeilBrown
                   ` (2 preceding siblings ...)
  2017-02-16  4:39 ` [md PATCH 01/14] md/raid5: use md_write_start to count stripes, not bios NeilBrown
@ 2017-02-16  4:39 ` NeilBrown
  2017-02-16 17:37   ` Shaohua Li
  2017-02-16  4:39 ` [md PATCH 11/14] md/raid5: don't test ->writes_pending in raid5_remove_disk NeilBrown
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 27+ messages in thread
From: NeilBrown @ 2017-02-16  4:39 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, hch

If a device fails during a write, we must ensure the failure is
recorded in the metadata before the completion of the write is
acknowleged.

Commit c3cce6cda162 ("md/raid5: ensure device failure recorded before
write request returns.")  added code for this, but it was
unnecessarily complicated.  We already had similar functionality for
handling updates to the bad-block-list, thanks to Commit de393cdea66c
("md: make it easier to wait for bad blocks to be acknowledged.")

So revert most of the former commit, and instead avoid collecting
completed writes if MD_CHANGE_PENDING is set.  raid5d() will then flush
the metadata and retry the stripe_head.

We check MD_CHANGE_PENDING *after* analyse_stripe() as it could be set
asynchronously.  After analyse_stripe(), we have collected stable data
about the state of devices, which will be used to make decisions.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/raid5.c |   27 ++++-----------------------
 drivers/md/raid5.h |    3 ---
 2 files changed, 4 insertions(+), 26 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 760b726943c9..154593e0afbe 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4492,7 +4492,8 @@ static void handle_stripe(struct stripe_head *sh)
 	if (test_bit(STRIPE_LOG_TRAPPED, &sh->state))
 		goto finish;
 
-	if (s.handle_bad_blocks) {
+	if (s.handle_bad_blocks ||
+	    test_bit(MD_SB_CHANGE_PENDING, &conf->mddev->sb_flags)) {
 		set_bit(STRIPE_HANDLE, &sh->state);
 		goto finish;
 	}
@@ -4822,15 +4823,8 @@ static void handle_stripe(struct stripe_head *sh)
 			md_wakeup_thread(conf->mddev->thread);
 	}
 
-	if (!bio_list_empty(&s.return_bi)) {
-		if (test_bit(MD_SB_CHANGE_PENDING, &conf->mddev->sb_flags)) {
-			spin_lock_irq(&conf->device_lock);
-			bio_list_merge(&conf->return_bi, &s.return_bi);
-			spin_unlock_irq(&conf->device_lock);
-			md_wakeup_thread(conf->mddev->thread);
-		} else
-			return_io(&s.return_bi);
-	}
+	if (!bio_list_empty(&s.return_bi))
+		return_io(&s.return_bi);
 
 	clear_bit_unlock(STRIPE_ACTIVE, &sh->state);
 }
@@ -6055,18 +6049,6 @@ static void raid5d(struct md_thread *thread)
 
 	md_check_recovery(mddev);
 
-	if (!bio_list_empty(&conf->return_bi) &&
-	    !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) {
-		struct bio_list tmp = BIO_EMPTY_LIST;
-		spin_lock_irq(&conf->device_lock);
-		if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) {
-			bio_list_merge(&tmp, &conf->return_bi);
-			bio_list_init(&conf->return_bi);
-		}
-		spin_unlock_irq(&conf->device_lock);
-		return_io(&tmp);
-	}
-
 	blk_start_plug(&plug);
 	handled = 0;
 	spin_lock_irq(&conf->device_lock);
@@ -6705,7 +6687,6 @@ static struct r5conf *setup_conf(struct mddev *mddev)
 	INIT_LIST_HEAD(&conf->hold_list);
 	INIT_LIST_HEAD(&conf->delayed_list);
 	INIT_LIST_HEAD(&conf->bitmap_list);
-	bio_list_init(&conf->return_bi);
 	init_llist_head(&conf->released_stripes);
 	atomic_set(&conf->active_stripes, 0);
 	atomic_set(&conf->preread_active_stripes, 0);
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 1440fa26e296..1ef71273b9b2 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -621,9 +621,6 @@ struct r5conf {
 	int			skip_copy; /* Don't copy data from bio to stripe cache */
 	struct list_head	*last_hold; /* detect hold_list promotions */
 
-	/* bios to have bi_end_io called after metadata is synced */
-	struct bio_list		return_bi;
-
 	atomic_t		reshape_stripes; /* stripes with pending writes for reshape */
 	/* unfortunately we need two cache names as we temporarily have
 	 * two caches.



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [md PATCH 03/14] md/raid5: call bio_endio() directly rather than queueing for later.
  2017-02-16  4:39 [md PATCH 00/14] remove all abuse of bi_phys_segments NeilBrown
  2017-02-16  4:39 ` [md PATCH 04/14] block: trace completion of all bios NeilBrown
@ 2017-02-16  4:39 ` NeilBrown
  2017-02-16  4:39 ` [md PATCH 01/14] md/raid5: use md_write_start to count stripes, not bios NeilBrown
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 27+ messages in thread
From: NeilBrown @ 2017-02-16  4:39 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, hch

We currently gather bios that need to be returned into a bio_list
and call bio_endio() on them all together.
The original reason for this was to avoid making the calls while
holding a spinlock.
Locking has changed a lot since then, and that reason is no longer
valid.

So discard return_io() and various return_bi lists, and just call
bio_endio() directly as needed.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/raid5-cache.c |   13 +++++--------
 drivers/md/raid5.c       |   38 ++++++++++----------------------------
 drivers/md/raid5.h       |    3 +--
 3 files changed, 16 insertions(+), 38 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 4b211f914d17..1b972a172800 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -255,8 +255,7 @@ static void __r5l_set_io_unit_state(struct r5l_io_unit *io,
 }
 
 static void
-r5c_return_dev_pending_writes(struct r5conf *conf, struct r5dev *dev,
-			      struct bio_list *return_bi)
+r5c_return_dev_pending_writes(struct r5conf *conf, struct r5dev *dev)
 {
 	struct bio *wbi, *wbi2;
 
@@ -266,23 +265,21 @@ r5c_return_dev_pending_writes(struct r5conf *conf, struct r5dev *dev,
 	       dev->sector + STRIPE_SECTORS) {
 		wbi2 = r5_next_bio(wbi, dev->sector);
 		md_write_end(conf->mddev);
-		if (!raid5_dec_bi_active_stripes(wbi)) {
-			bio_list_add(return_bi, wbi);
-		}
+		if (!raid5_dec_bi_active_stripes(wbi))
+			bio_endio(wbi);
 		wbi = wbi2;
 	}
 }
 
 void r5c_handle_cached_data_endio(struct r5conf *conf,
-	  struct stripe_head *sh, int disks, struct bio_list *return_bi)
+				  struct stripe_head *sh, int disks)
 {
 	int i;
 
 	for (i = sh->disks; i--; ) {
 		if (sh->dev[i].written) {
 			set_bit(R5_UPTODATE, &sh->dev[i].flags);
-			r5c_return_dev_pending_writes(conf, &sh->dev[i],
-						      return_bi);
+			r5c_return_dev_pending_writes(conf, &sh->dev[i]);
 			bitmap_endwrite(conf->mddev->bitmap, sh->sector,
 					STRIPE_SECTORS,
 					!test_bit(STRIPE_DEGRADED, &sh->state),
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 154593e0afbe..a6edca65b561 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -154,17 +154,6 @@ static int raid6_idx_to_slot(int idx, struct stripe_head *sh,
 	return slot;
 }
 
-static void return_io(struct bio_list *return_bi)
-{
-	struct bio *bi;
-	while ((bi = bio_list_pop(return_bi)) != NULL) {
-		bi->bi_iter.bi_size = 0;
-		trace_block_bio_complete(bdev_get_queue(bi->bi_bdev),
-					 bi, 0);
-		bio_endio(bi);
-	}
-}
-
 static void print_raid5_conf (struct r5conf *conf);
 
 static int stripe_operations_active(struct stripe_head *sh)
@@ -1175,7 +1164,6 @@ async_copy_data(int frombio, struct bio *bio, struct page **page,
 static void ops_complete_biofill(void *stripe_head_ref)
 {
 	struct stripe_head *sh = stripe_head_ref;
-	struct bio_list return_bi = BIO_EMPTY_LIST;
 	int i;
 
 	pr_debug("%s: stripe %llu\n", __func__,
@@ -1200,15 +1188,13 @@ static void ops_complete_biofill(void *stripe_head_ref)
 				dev->sector + STRIPE_SECTORS) {
 				rbi2 = r5_next_bio(rbi, dev->sector);
 				if (!raid5_dec_bi_active_stripes(rbi))
-					bio_list_add(&return_bi, rbi);
+					bio_endio(rbi);
 				rbi = rbi2;
 			}
 		}
 	}
 	clear_bit(STRIPE_BIOFILL_RUN, &sh->state);
 
-	return_io(&return_bi);
-
 	set_bit(STRIPE_HANDLE, &sh->state);
 	raid5_release_stripe(sh);
 }
@@ -3152,8 +3138,7 @@ static void stripe_set_idx(sector_t stripe, struct r5conf *conf, int previous,
 
 static void
 handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
-				struct stripe_head_state *s, int disks,
-				struct bio_list *return_bi)
+		     struct stripe_head_state *s, int disks)
 {
 	int i;
 	BUG_ON(sh->batch_head);
@@ -3201,7 +3186,7 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
 			bi->bi_error = -EIO;
 			md_write_end(conf->mddev);
 			if (!raid5_dec_bi_active_stripes(bi))
-				bio_list_add(return_bi, bi);
+				bio_endio(bi);
 			bi = nextbi;
 		}
 		if (bitmap_end)
@@ -3224,7 +3209,7 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
 			bi->bi_error = -EIO;
 			md_write_end(conf->mddev);
 			if (!raid5_dec_bi_active_stripes(bi))
-				bio_list_add(return_bi, bi);
+				bio_endio(bi);
 			bi = bi2;
 		}
 
@@ -3250,7 +3235,7 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
 
 				bi->bi_error = -EIO;
 				if (!raid5_dec_bi_active_stripes(bi))
-					bio_list_add(return_bi, bi);
+					bio_endio(bi);
 				bi = nextbi;
 			}
 		}
@@ -3549,7 +3534,7 @@ static void break_stripe_batch_list(struct stripe_head *head_sh,
  * never LOCKED, so we don't need to test 'failed' directly.
  */
 static void handle_stripe_clean_event(struct r5conf *conf,
-	struct stripe_head *sh, int disks, struct bio_list *return_bi)
+	struct stripe_head *sh, int disks)
 {
 	int i;
 	struct r5dev *dev;
@@ -3583,7 +3568,7 @@ static void handle_stripe_clean_event(struct r5conf *conf,
 					wbi2 = r5_next_bio(wbi, dev->sector);
 					md_write_end(conf->mddev);
 					if (!raid5_dec_bi_active_stripes(wbi))
-						bio_list_add(return_bi, wbi);
+						bio_endio(wbi);
 					wbi = wbi2;
 				}
 				bitmap_endwrite(conf->mddev->bitmap, sh->sector,
@@ -4526,7 +4511,7 @@ static void handle_stripe(struct stripe_head *sh)
 		sh->reconstruct_state = 0;
 		break_stripe_batch_list(sh, 0);
 		if (s.to_read+s.to_write+s.written)
-			handle_failed_stripe(conf, sh, &s, disks, &s.return_bi);
+			handle_failed_stripe(conf, sh, &s, disks);
 		if (s.syncing + s.replacing)
 			handle_failed_sync(conf, sh, &s);
 	}
@@ -4592,10 +4577,10 @@ static void handle_stripe(struct stripe_head *sh)
 			     && !test_bit(R5_LOCKED, &qdev->flags)
 			     && (test_bit(R5_UPTODATE, &qdev->flags) ||
 				 test_bit(R5_Discard, &qdev->flags))))))
-		handle_stripe_clean_event(conf, sh, disks, &s.return_bi);
+		handle_stripe_clean_event(conf, sh, disks);
 
 	if (s.just_cached)
-		r5c_handle_cached_data_endio(conf, sh, disks, &s.return_bi);
+		r5c_handle_cached_data_endio(conf, sh, disks);
 	r5l_stripe_write_finished(sh);
 
 	/* Now we might consider reading some blocks, either to check/generate
@@ -4823,9 +4808,6 @@ static void handle_stripe(struct stripe_head *sh)
 			md_wakeup_thread(conf->mddev->thread);
 	}
 
-	if (!bio_list_empty(&s.return_bi))
-		return_io(&s.return_bi);
-
 	clear_bit_unlock(STRIPE_ACTIVE, &sh->state);
 }
 
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 1ef71273b9b2..9051631cb7f0 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -272,7 +272,6 @@ struct stripe_head_state {
 	int dec_preread_active;
 	unsigned long ops_request;
 
-	struct bio_list return_bi;
 	struct md_rdev *blocked_rdev;
 	int handle_bad_blocks;
 	int log_failed;
@@ -776,7 +775,7 @@ extern void r5c_release_extra_page(struct stripe_head *sh);
 extern void r5c_use_extra_page(struct stripe_head *sh);
 extern void r5l_wake_reclaim(struct r5l_log *log, sector_t space);
 extern void r5c_handle_cached_data_endio(struct r5conf *conf,
-	struct stripe_head *sh, int disks, struct bio_list *return_bi);
+	struct stripe_head *sh, int disks);
 extern int r5c_cache_data(struct r5l_log *log, struct stripe_head *sh,
 			  struct stripe_head_state *s);
 extern void r5c_make_stripe_write_out(struct stripe_head *sh);



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [md PATCH 04/14] block: trace completion of all bios.
  2017-02-16  4:39 [md PATCH 00/14] remove all abuse of bi_phys_segments NeilBrown
@ 2017-02-16  4:39 ` NeilBrown
  2017-02-16  4:39 ` [md PATCH 03/14] md/raid5: call bio_endio() directly rather than queueing for later NeilBrown
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 27+ messages in thread
From: NeilBrown @ 2017-02-16  4:39 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, hch

Currently only dm and raid5 bios trigger trace_block_bio_complete.
Rather than sprinkling trace calls around all drivers that might
want them, add the trace call to bio_endio() so that all
drivers benefit.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 block/bio.c        |    3 +++
 drivers/md/dm.c    |    1 -
 drivers/md/raid5.c |    8 --------
 3 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 2b375020fc49..984c0afa935a 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1813,6 +1813,9 @@ void bio_endio(struct bio *bio)
 		goto again;
 	}
 
+	if (bio->bi_bdev)
+		trace_block_bio_complete(bdev_get_queue(bio->bi_bdev),
+					 bio, bio->bi_error);
 	if (bio->bi_end_io)
 		bio->bi_end_io(bio);
 }
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 3086da5664f3..e151aefa0917 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -807,7 +807,6 @@ static void dec_pending(struct dm_io *io, int error)
 			queue_io(md, bio);
 		} else {
 			/* done with normal IO or empty flush */
-			trace_block_bio_complete(md->queue, bio, io_error);
 			bio->bi_error = io_error;
 			bio_endio(bio);
 		}
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index a6edca65b561..2fbf939c1a15 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4943,8 +4943,6 @@ static void raid5_align_endio(struct bio *bi)
 	rdev_dec_pending(rdev, conf->mddev);
 
 	if (!error) {
-		trace_block_bio_complete(bdev_get_queue(raid_bi->bi_bdev),
-					 raid_bi, 0);
 		bio_endio(raid_bi);
 		if (atomic_dec_and_test(&conf->active_aligned_reads))
 			wake_up(&conf->wait_for_quiescent);
@@ -5510,10 +5508,6 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
 		md_write_end(mddev);
 	remaining = raid5_dec_bi_active_stripes(bi);
 	if (remaining == 0) {
-
-
-		trace_block_bio_complete(bdev_get_queue(bi->bi_bdev),
-					 bi, 0);
 		bio_endio(bi);
 	}
 }
@@ -5921,8 +5915,6 @@ static int  retry_aligned_read(struct r5conf *conf, struct bio *raid_bio)
 	}
 	remaining = raid5_dec_bi_active_stripes(raid_bio);
 	if (remaining == 0) {
-		trace_block_bio_complete(bdev_get_queue(raid_bio->bi_bdev),
-					 raid_bio, 0);
 		bio_endio(raid_bio);
 	}
 	if (atomic_dec_and_test(&conf->active_aligned_reads))



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [md PATCH 05/14] md/raid5: use bio_inc_remaining() instead of repurposing bi_phys_segments as a counter
  2017-02-16  4:39 [md PATCH 00/14] remove all abuse of bi_phys_segments NeilBrown
                   ` (8 preceding siblings ...)
  2017-02-16  4:39 ` [md PATCH 09/14] md/raid10: stop using bi_phys_segments NeilBrown
@ 2017-02-16  4:39 ` NeilBrown
  2017-02-16  4:39 ` [md PATCH 06/14] md/raid5: remove over-loading of ->bi_phys_segments NeilBrown
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 27+ messages in thread
From: NeilBrown @ 2017-02-16  4:39 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, hch

md/raid5 needs to keep track of how many stripe_heads are processing a
bio so that it can delay calling bio_endio() until all stripe_heads
have completed.  It currently uses 16 bits of ->bi_phys_segments for
this purpose.

16 bits is only enough for 256M requests, and it is possible for a
single bio to be larger than this, which causes problems.  Also, the
bio struct contains a larger counter, __bi_remaining, which has a
purpose very similar to the purpose of our counter.  So stop using
->bi_phys_segments, and instead use __bi_remaining.

This means we don't need to initialize the counter, as our caller
initializes it to '1'.  It also means we can call bio_endio() directly
as it tests this counter internally.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/raid5-cache.c |    3 +--
 drivers/md/raid5.c       |   50 +++++++++++-----------------------------------
 drivers/md/raid5.h       |   17 +---------------
 3 files changed, 14 insertions(+), 56 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 1b972a172800..5c282ae71cbd 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -265,8 +265,7 @@ r5c_return_dev_pending_writes(struct r5conf *conf, struct r5dev *dev)
 	       dev->sector + STRIPE_SECTORS) {
 		wbi2 = r5_next_bio(wbi, dev->sector);
 		md_write_end(conf->mddev);
-		if (!raid5_dec_bi_active_stripes(wbi))
-			bio_endio(wbi);
+		bio_endio(wbi);
 		wbi = wbi2;
 	}
 }
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 2fbf939c1a15..905abf081acf 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1187,8 +1187,7 @@ static void ops_complete_biofill(void *stripe_head_ref)
 			while (rbi && rbi->bi_iter.bi_sector <
 				dev->sector + STRIPE_SECTORS) {
 				rbi2 = r5_next_bio(rbi, dev->sector);
-				if (!raid5_dec_bi_active_stripes(rbi))
-					bio_endio(rbi);
+				bio_endio(rbi);
 				rbi = rbi2;
 			}
 		}
@@ -3027,14 +3026,6 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
 		(unsigned long long)bi->bi_iter.bi_sector,
 		(unsigned long long)sh->sector);
 
-	/*
-	 * If several bio share a stripe. The bio bi_phys_segments acts as a
-	 * reference count to avoid race. The reference count should already be
-	 * increased before this function is called (for example, in
-	 * raid5_make_request()), so other bio sharing this stripe will not free the
-	 * stripe. If a stripe is owned by one stripe, the stripe lock will
-	 * protect it.
-	 */
 	spin_lock_irq(&sh->stripe_lock);
 	/* Don't allow new IO added to stripes in batch list */
 	if (sh->batch_head)
@@ -3060,7 +3051,7 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
 	if (*bip)
 		bi->bi_next = *bip;
 	*bip = bi;
-	raid5_inc_bi_active_stripes(bi);
+	bio_inc_remaining(bi);
 	md_write_start(conf->mddev, bi);
 
 	if (forwrite) {
@@ -3185,8 +3176,7 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
 
 			bi->bi_error = -EIO;
 			md_write_end(conf->mddev);
-			if (!raid5_dec_bi_active_stripes(bi))
-				bio_endio(bi);
+			bio_endio(bi);
 			bi = nextbi;
 		}
 		if (bitmap_end)
@@ -3208,8 +3198,7 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
 
 			bi->bi_error = -EIO;
 			md_write_end(conf->mddev);
-			if (!raid5_dec_bi_active_stripes(bi))
-				bio_endio(bi);
+			bio_endio(bi);
 			bi = bi2;
 		}
 
@@ -3234,8 +3223,7 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
 					r5_next_bio(bi, sh->dev[i].sector);
 
 				bi->bi_error = -EIO;
-				if (!raid5_dec_bi_active_stripes(bi))
-					bio_endio(bi);
+				bio_endio(bi);
 				bi = nextbi;
 			}
 		}
@@ -3567,8 +3555,7 @@ static void handle_stripe_clean_event(struct r5conf *conf,
 					dev->sector + STRIPE_SECTORS) {
 					wbi2 = r5_next_bio(wbi, dev->sector);
 					md_write_end(conf->mddev);
-					if (!raid5_dec_bi_active_stripes(wbi))
-						bio_endio(wbi);
+					bio_endio(wbi);
 					wbi = wbi2;
 				}
 				bitmap_endwrite(conf->mddev->bitmap, sh->sector,
@@ -4913,7 +4900,7 @@ static struct bio *remove_bio_from_retry(struct r5conf *conf)
 		 * this sets the active strip count to 1 and the processed
 		 * strip count to zero (upper 8 bits)
 		 */
-		raid5_set_bi_stripes(bi, 1); /* biased count of active stripes */
+		raid5_set_bi_processed_stripes(bi, 0);
 	}
 
 	return bi;
@@ -5228,7 +5215,6 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
 	struct r5conf *conf = mddev->private;
 	sector_t logical_sector, last_sector;
 	struct stripe_head *sh;
-	int remaining;
 	int stripe_sectors;
 
 	if (mddev->reshape_position != MaxSector)
@@ -5239,7 +5225,6 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
 	last_sector = bi->bi_iter.bi_sector + (bi->bi_iter.bi_size>>9);
 
 	bi->bi_next = NULL;
-	bi->bi_phys_segments = 1; /* over-loaded to count active stripes */
 	md_write_start(mddev, bi);
 
 	stripe_sectors = conf->chunk_sectors *
@@ -5286,7 +5271,7 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
 				continue;
 			sh->dev[d].towrite = bi;
 			set_bit(R5_OVERWRITE, &sh->dev[d].flags);
-			raid5_inc_bi_active_stripes(bi);
+			bio_inc_remaining(bi);
 			md_write_start(mddev, bi);
 			sh->overwrite_disks++;
 		}
@@ -5311,10 +5296,7 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
 	}
 
 	md_write_end(mddev);
-	remaining = raid5_dec_bi_active_stripes(bi);
-	if (remaining == 0) {
-		bio_endio(bi);
-	}
+	bio_endio(bi);
 }
 
 static void raid5_make_request(struct mddev *mddev, struct bio * bi)
@@ -5325,7 +5307,6 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
 	sector_t logical_sector, last_sector;
 	struct stripe_head *sh;
 	const int rw = bio_data_dir(bi);
-	int remaining;
 	DEFINE_WAIT(w);
 	bool do_prepare;
 	bool do_flush = false;
@@ -5368,7 +5349,6 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
 	logical_sector = bi->bi_iter.bi_sector & ~((sector_t)STRIPE_SECTORS-1);
 	last_sector = bio_end_sector(bi);
 	bi->bi_next = NULL;
-	bi->bi_phys_segments = 1;	/* over-loaded to count active stripes */
 	md_write_start(mddev, bi);
 
 	prepare_to_wait(&conf->wait_for_overlap, &w, TASK_UNINTERRUPTIBLE);
@@ -5506,10 +5486,7 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
 
 	if (rw == WRITE)
 		md_write_end(mddev);
-	remaining = raid5_dec_bi_active_stripes(bi);
-	if (remaining == 0) {
-		bio_endio(bi);
-	}
+	bio_endio(bi);
 }
 
 static sector_t raid5_size(struct mddev *mddev, sector_t sectors, int raid_disks);
@@ -5874,7 +5851,6 @@ static int  retry_aligned_read(struct r5conf *conf, struct bio *raid_bio)
 	int dd_idx;
 	sector_t sector, logical_sector, last_sector;
 	int scnt = 0;
-	int remaining;
 	int handled = 0;
 
 	logical_sector = raid_bio->bi_iter.bi_sector &
@@ -5913,10 +5889,8 @@ static int  retry_aligned_read(struct r5conf *conf, struct bio *raid_bio)
 		raid5_release_stripe(sh);
 		handled++;
 	}
-	remaining = raid5_dec_bi_active_stripes(raid_bio);
-	if (remaining == 0) {
-		bio_endio(raid_bio);
-	}
+	bio_endio(raid_bio);
+
 	if (atomic_dec_and_test(&conf->active_aligned_reads))
 		wake_up(&conf->wait_for_quiescent);
 	return handled;
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 9051631cb7f0..3018a33693ab 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -481,8 +481,7 @@ static inline struct bio *r5_next_bio(struct bio *bio, sector_t sector)
 }
 
 /*
- * We maintain a biased count of active stripes in the bottom 16 bits of
- * bi_phys_segments, and a count of processed stripes in the upper 16 bits
+ * We maintain a count of processed stripes in the upper 16 bits
  */
 static inline int raid5_bi_processed_stripes(struct bio *bio)
 {
@@ -491,20 +490,6 @@ static inline int raid5_bi_processed_stripes(struct bio *bio)
 	return (atomic_read(segments) >> 16) & 0xffff;
 }
 
-static inline int raid5_dec_bi_active_stripes(struct bio *bio)
-{
-	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
-
-	return atomic_sub_return(1, segments) & 0xffff;
-}
-
-static inline void raid5_inc_bi_active_stripes(struct bio *bio)
-{
-	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
-
-	atomic_inc(segments);
-}
-
 static inline void raid5_set_bi_processed_stripes(struct bio *bio,
 	unsigned int cnt)
 {



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [md PATCH 06/14] md/raid5: remove over-loading of ->bi_phys_segments.
  2017-02-16  4:39 [md PATCH 00/14] remove all abuse of bi_phys_segments NeilBrown
                   ` (9 preceding siblings ...)
  2017-02-16  4:39 ` [md PATCH 05/14] md/raid5: use bio_inc_remaining() instead of repurposing bi_phys_segments as a counter NeilBrown
@ 2017-02-16  4:39 ` NeilBrown
  2017-02-16  4:39 ` [md PATCH 12/14] md: factor out set_in_sync() NeilBrown
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 27+ messages in thread
From: NeilBrown @ 2017-02-16  4:39 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, hch

When a read request, which bypassed the cache, fails, we need to retry
it through the cache.
This involves attaching it to a sequence of stripe_heads, and it may not
be possible to get all the stripe_heads we need at once.
We do what we can, and record how far we got in ->bi_phys_segments so
we can pick up again later.

There is only ever one bio which may have a non-zero offset stored in
->bi_phys_segments, the one that is either active in the single thread
which calls retry_aligned_read(), or is in conf->retry_read_aligned
waiting for retry_aligned_read() to be called again.

So we only need to store one offset value.  This can be in a local
variable passed between remove_bio_from_retry() and
retry_aligned_read(), or in the r5conf structure next to the
->retry_read_aligned pointer.

Storing it there allows the last usage of ->bi_phys_segments to be
removed from md/raid5.c.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/raid5.c |   24 ++++++++++++------------
 drivers/md/raid5.h |   30 +-----------------------------
 2 files changed, 13 insertions(+), 41 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 905abf081acf..f93e8fddbb23 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4883,12 +4883,14 @@ static void add_bio_to_retry(struct bio *bi,struct r5conf *conf)
 	md_wakeup_thread(conf->mddev->thread);
 }
 
-static struct bio *remove_bio_from_retry(struct r5conf *conf)
+static struct bio *remove_bio_from_retry(struct r5conf *conf,
+					 unsigned int *offset)
 {
 	struct bio *bi;
 
 	bi = conf->retry_read_aligned;
 	if (bi) {
+		*offset = conf->retry_read_offset;
 		conf->retry_read_aligned = NULL;
 		return bi;
 	}
@@ -4896,11 +4898,7 @@ static struct bio *remove_bio_from_retry(struct r5conf *conf)
 	if(bi) {
 		conf->retry_read_aligned_list = bi->bi_next;
 		bi->bi_next = NULL;
-		/*
-		 * this sets the active strip count to 1 and the processed
-		 * strip count to zero (upper 8 bits)
-		 */
-		raid5_set_bi_processed_stripes(bi, 0);
+		*offset = 0;
 	}
 
 	return bi;
@@ -5835,7 +5833,8 @@ static inline sector_t raid5_sync_request(struct mddev *mddev, sector_t sector_n
 	return STRIPE_SECTORS;
 }
 
-static int  retry_aligned_read(struct r5conf *conf, struct bio *raid_bio)
+static int  retry_aligned_read(struct r5conf *conf, struct bio *raid_bio,
+			       unsigned int offset)
 {
 	/* We may not be able to submit a whole bio at once as there
 	 * may not be enough stripe_heads available.
@@ -5864,7 +5863,7 @@ static int  retry_aligned_read(struct r5conf *conf, struct bio *raid_bio)
 		     sector += STRIPE_SECTORS,
 		     scnt++) {
 
-		if (scnt < raid5_bi_processed_stripes(raid_bio))
+		if (scnt < offset)
 			/* already done this stripe */
 			continue;
 
@@ -5872,15 +5871,15 @@ static int  retry_aligned_read(struct r5conf *conf, struct bio *raid_bio)
 
 		if (!sh) {
 			/* failed to get a stripe - must wait */
-			raid5_set_bi_processed_stripes(raid_bio, scnt);
 			conf->retry_read_aligned = raid_bio;
+			conf->retry_read_offset = scnt;
 			return handled;
 		}
 
 		if (!add_stripe_bio(sh, raid_bio, dd_idx, 0, 0)) {
 			raid5_release_stripe(sh);
-			raid5_set_bi_processed_stripes(raid_bio, scnt);
 			conf->retry_read_aligned = raid_bio;
+			conf->retry_read_offset = scnt;
 			return handled;
 		}
 
@@ -6003,6 +6002,7 @@ static void raid5d(struct md_thread *thread)
 	while (1) {
 		struct bio *bio;
 		int batch_size, released;
+		unsigned int offset;
 
 		released = release_stripe_list(conf, conf->temp_inactive_list);
 		if (released)
@@ -6020,10 +6020,10 @@ static void raid5d(struct md_thread *thread)
 		}
 		raid5_activate_delayed(conf);
 
-		while ((bio = remove_bio_from_retry(conf))) {
+		while ((bio = remove_bio_from_retry(conf, &offset))) {
 			int ok;
 			spin_unlock_irq(&conf->device_lock);
-			ok = retry_aligned_read(conf, bio);
+			ok = retry_aligned_read(conf, bio, offset);
 			spin_lock_irq(&conf->device_lock);
 			if (!ok)
 				break;
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 3018a33693ab..a4ef02176afb 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -480,35 +480,6 @@ static inline struct bio *r5_next_bio(struct bio *bio, sector_t sector)
 		return NULL;
 }
 
-/*
- * We maintain a count of processed stripes in the upper 16 bits
- */
-static inline int raid5_bi_processed_stripes(struct bio *bio)
-{
-	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
-
-	return (atomic_read(segments) >> 16) & 0xffff;
-}
-
-static inline void raid5_set_bi_processed_stripes(struct bio *bio,
-	unsigned int cnt)
-{
-	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
-	int old, new;
-
-	do {
-		old = atomic_read(segments);
-		new = (old & 0xffff) | (cnt << 16);
-	} while (atomic_cmpxchg(segments, old, new) != old);
-}
-
-static inline void raid5_set_bi_stripes(struct bio *bio, unsigned int cnt)
-{
-	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
-
-	atomic_set(segments, cnt);
-}
-
 /* NOTE NR_STRIPE_HASH_LOCKS must remain below 64.
  * This is because we sometimes take all the spinlocks
  * and creating that much locking depth can cause
@@ -596,6 +567,7 @@ struct r5conf {
 	struct list_head	delayed_list; /* stripes that have plugged requests */
 	struct list_head	bitmap_list; /* stripes delaying awaiting bitmap update */
 	struct bio		*retry_read_aligned; /* currently retrying aligned bios   */
+	unsigned int		retry_read_offset; /* sector offset into retry_read_aligned */
 	struct bio		*retry_read_aligned_list; /* aligned bios retry list  */
 	atomic_t		preread_active_stripes; /* stripes with scheduled io */
 	atomic_t		active_aligned_reads;



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [md PATCH 08/14] md/raid1, raid10: move rXbio accounting closer to allocation.
  2017-02-16  4:39 [md PATCH 00/14] remove all abuse of bi_phys_segments NeilBrown
                   ` (6 preceding siblings ...)
  2017-02-16  4:39 ` [md PATCH 10/14] md/raid1: stop using bi_phys_segment NeilBrown
@ 2017-02-16  4:39 ` NeilBrown
  2017-02-16  4:39 ` [md PATCH 09/14] md/raid10: stop using bi_phys_segments NeilBrown
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 27+ messages in thread
From: NeilBrown @ 2017-02-16  4:39 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, hch

When raid1 or raid10 need find they will need to allocate
a new r1bio/r10bio, in order to work around a known bad block,
the account for the allocation well before the allocation
is made.  This separation makes the correctness less obvious and
requires comments.

The accounting needs to be a little before: before the first rXbio is
submitted, but that is all.

So move the accounting down to where it makes more sense.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/raid1.c  |   23 ++++++++++-------------
 drivers/md/raid10.c |   22 +++++++++-------------
 2 files changed, 19 insertions(+), 26 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 7b0f647bcccb..c1d0675880fb 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1326,18 +1326,9 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 		goto retry_write;
 	}
 
-	if (max_sectors < r1_bio->sectors) {
-		/* We are splitting this write into multiple parts, so
-		 * we need to prepare for allocating another r1_bio.
-		 */
+	if (max_sectors < r1_bio->sectors)
 		r1_bio->sectors = max_sectors;
-		spin_lock_irq(&conf->device_lock);
-		if (bio->bi_phys_segments == 0)
-			bio->bi_phys_segments = 2;
-		else
-			bio->bi_phys_segments++;
-		spin_unlock_irq(&conf->device_lock);
-	}
+
 	sectors_handled = r1_bio->sector + max_sectors - bio->bi_iter.bi_sector;
 
 	atomic_set(&r1_bio->remaining, 1);
@@ -1426,10 +1417,16 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 	 * as it could result in the bio being freed.
 	 */
 	if (sectors_handled < bio_sectors(bio)) {
-		r1_bio_write_done(r1_bio);
-		/* We need another r1_bio.  It has already been counted
+		/* We need another r1_bio, which must be accounted
 		 * in bio->bi_phys_segments
 		 */
+		spin_lock_irq(&conf->device_lock);
+		if (bio->bi_phys_segments == 0)
+			bio->bi_phys_segments = 2;
+		else
+			bio->bi_phys_segments++;
+		spin_unlock_irq(&conf->device_lock);
+		r1_bio_write_done(r1_bio);
 		r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
 		r1_bio->master_bio = bio;
 		r1_bio->sectors = bio_sectors(bio) - sectors_handled;
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 1920756828df..9258cbe233bb 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1383,18 +1383,8 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio,
 		goto retry_write;
 	}
 
-	if (max_sectors < r10_bio->sectors) {
-		/* We are splitting this into multiple parts, so
-		 * we need to prepare for allocating another r10_bio.
-		 */
+	if (max_sectors < r10_bio->sectors)
 		r10_bio->sectors = max_sectors;
-		spin_lock_irq(&conf->device_lock);
-		if (bio->bi_phys_segments == 0)
-			bio->bi_phys_segments = 2;
-		else
-			bio->bi_phys_segments++;
-		spin_unlock_irq(&conf->device_lock);
-	}
 	sectors_handled = r10_bio->sector + max_sectors -
 		bio->bi_iter.bi_sector;
 
@@ -1491,10 +1481,16 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio,
 	 */
 
 	if (sectors_handled < bio_sectors(bio)) {
-		one_write_done(r10_bio);
-		/* We need another r10_bio.  It has already been counted
+		/* We need another r10_bio and it needs to be counted
 		 * in bio->bi_phys_segments.
 		 */
+		spin_lock_irq(&conf->device_lock);
+		if (bio->bi_phys_segments == 0)
+			bio->bi_phys_segments = 2;
+		else
+			bio->bi_phys_segments++;
+		spin_unlock_irq(&conf->device_lock);
+		one_write_done(r10_bio);
 		r10_bio = mempool_alloc(conf->r10bio_pool, GFP_NOIO);
 
 		r10_bio->master_bio = bio;



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [md PATCH 07/14] Revert "md/raid5: limit request size according to implementation limits"
  2017-02-16  4:39 [md PATCH 00/14] remove all abuse of bi_phys_segments NeilBrown
                   ` (4 preceding siblings ...)
  2017-02-16  4:39 ` [md PATCH 11/14] md/raid5: don't test ->writes_pending in raid5_remove_disk NeilBrown
@ 2017-02-16  4:39 ` NeilBrown
  2017-02-16  4:39 ` [md PATCH 10/14] md/raid1: stop using bi_phys_segment NeilBrown
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 27+ messages in thread
From: NeilBrown @ 2017-02-16  4:39 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, hch

This reverts commit e8d7c33232e5fdfa761c3416539bc5b4acd12db5.

Now that raid5 doesn't abuse bi_phys_segments any more, we no longer
need to impose these limits.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/raid5.c |    9 ---------
 1 file changed, 9 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index f93e8fddbb23..c96fca2c6a98 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -7100,15 +7100,6 @@ static int raid5_run(struct mddev *mddev)
 			stripe = (stripe | (stripe-1)) + 1;
 		mddev->queue->limits.discard_alignment = stripe;
 		mddev->queue->limits.discard_granularity = stripe;
-
-		/*
-		 * We use 16-bit counter of active stripes in bi_phys_segments
-		 * (minus one for over-loaded initialization)
-		 */
-		blk_queue_max_hw_sectors(mddev->queue, 0xfffe * STRIPE_SECTORS);
-		blk_queue_max_discard_sectors(mddev->queue,
-					      0xfffe * STRIPE_SECTORS);
-
 		/*
 		 * unaligned part of discard request will be ignored, so can't
 		 * guarantee discard_zeroes_data



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [md PATCH 10/14] md/raid1: stop using bi_phys_segment
  2017-02-16  4:39 [md PATCH 00/14] remove all abuse of bi_phys_segments NeilBrown
                   ` (5 preceding siblings ...)
  2017-02-16  4:39 ` [md PATCH 07/14] Revert "md/raid5: limit request size according to implementation limits" NeilBrown
@ 2017-02-16  4:39 ` NeilBrown
  2017-02-20 10:57   ` Ming Lei
  2017-02-16  4:39 ` [md PATCH 08/14] md/raid1, raid10: move rXbio accounting closer to allocation NeilBrown
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 27+ messages in thread
From: NeilBrown @ 2017-02-16  4:39 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, hch

Change to use bio->__bi_remaining to count number of r1bio attached
to a bio.
See precious raid10 patch for more details.

inc_pending is a little more complicated in raid1 as we need to adjust
next_window_requests or current_window_requests.

The wait_event() call if start_next_window is no longer needed as each
r1_bio is completed separately.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/raid1.c |   99 ++++++++++++++++++----------------------------------
 1 file changed, 34 insertions(+), 65 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index c1d0675880fb..5a57111c7bc9 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -241,36 +241,19 @@ static void reschedule_retry(struct r1bio *r1_bio)
 static void call_bio_endio(struct r1bio *r1_bio)
 {
 	struct bio *bio = r1_bio->master_bio;
-	int done;
 	struct r1conf *conf = r1_bio->mddev->private;
 	sector_t start_next_window = r1_bio->start_next_window;
 	sector_t bi_sector = bio->bi_iter.bi_sector;
 
-	if (bio->bi_phys_segments) {
-		unsigned long flags;
-		spin_lock_irqsave(&conf->device_lock, flags);
-		bio->bi_phys_segments--;
-		done = (bio->bi_phys_segments == 0);
-		spin_unlock_irqrestore(&conf->device_lock, flags);
-		/*
-		 * make_request() might be waiting for
-		 * bi_phys_segments to decrease
-		 */
-		wake_up(&conf->wait_barrier);
-	} else
-		done = 1;
-
 	if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
 		bio->bi_error = -EIO;
 
-	if (done) {
-		bio_endio(bio);
-		/*
-		 * Wake up any possible resync thread that waits for the device
-		 * to go idle.
-		 */
-		allow_barrier(conf, start_next_window, bi_sector);
-	}
+	bio_endio(bio);
+	/*
+	 * Wake up any possible resync thread that waits for the device
+	 * to go idle.
+	 */
+	allow_barrier(conf, start_next_window, bi_sector);
 }
 
 static void raid_end_bio_io(struct r1bio *r1_bio)
@@ -923,6 +906,26 @@ static sector_t wait_barrier(struct r1conf *conf, struct bio *bio)
 	return sector;
 }
 
+static void inc_pending(struct r1conf *conf, sector_t start_next_window,
+			sector_t bi_sector)
+{
+	/* The current request requires multiple r1_bio, so
+	 * we need to increment the pending count, and the corresponding
+	 * window count.
+	 */
+	spin_lock(&conf->resync_lock);
+	conf->nr_pending++;
+	if (start_next_window == conf->start_next_window) {
+		if (conf->start_next_window + NEXT_NORMALIO_DISTANCE
+		    <= bi_sector)
+			conf->next_window_requests++;
+		else
+			conf->current_window_requests++;
+	} else
+		conf->current_window_requests++;
+	spin_unlock(&conf->resync_lock);
+}
+
 static void allow_barrier(struct r1conf *conf, sector_t start_next_window,
 			  sector_t bi_sector)
 {
@@ -1137,12 +1140,7 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio,
 		sectors_handled = (r1_bio->sector + max_sectors
 				   - bio->bi_iter.bi_sector);
 		r1_bio->sectors = max_sectors;
-		spin_lock_irq(&conf->device_lock);
-		if (bio->bi_phys_segments == 0)
-			bio->bi_phys_segments = 2;
-		else
-			bio->bi_phys_segments++;
-		spin_unlock_irq(&conf->device_lock);
+		bio_inc_remaining(bio);
 
 		/*
 		 * Cannot call generic_make_request directly as that will be
@@ -1304,7 +1302,6 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 	if (unlikely(blocked_rdev)) {
 		/* Wait for this device to become unblocked */
 		int j;
-		sector_t old = start_next_window;
 
 		for (j = 0; j < i; j++)
 			if (r1_bio->bios[j])
@@ -1314,15 +1311,6 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 		raid1_log(mddev, "wait rdev %d blocked", blocked_rdev->raid_disk);
 		md_wait_for_blocked_rdev(blocked_rdev, mddev);
 		start_next_window = wait_barrier(conf, bio);
-		/*
-		 * We must make sure the multi r1bios of bio have
-		 * the same value of bi_phys_segments
-		 */
-		if (bio->bi_phys_segments && old &&
-		    old != start_next_window)
-			/* Wait for the former r1bio(s) to complete */
-			wait_event(conf->wait_barrier,
-				   bio->bi_phys_segments == 1);
 		goto retry_write;
 	}
 
@@ -1417,22 +1405,18 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 	 * as it could result in the bio being freed.
 	 */
 	if (sectors_handled < bio_sectors(bio)) {
-		/* We need another r1_bio, which must be accounted
-		 * in bio->bi_phys_segments
-		 */
-		spin_lock_irq(&conf->device_lock);
-		if (bio->bi_phys_segments == 0)
-			bio->bi_phys_segments = 2;
-		else
-			bio->bi_phys_segments++;
-		spin_unlock_irq(&conf->device_lock);
+		/* We need another r1_bio, which must be counted */
+		sector_t sect = bio->bi_iter.bi_sector + sectors_handled;
+
+		inc_pending(conf, start_next_window, sect);
+		bio_inc_remaining(bio);
 		r1_bio_write_done(r1_bio);
 		r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
 		r1_bio->master_bio = bio;
 		r1_bio->sectors = bio_sectors(bio) - sectors_handled;
 		r1_bio->state = 0;
 		r1_bio->mddev = mddev;
-		r1_bio->sector = bio->bi_iter.bi_sector + sectors_handled;
+		r1_bio->sector = sect;
 		goto retry_write;
 	}
 
@@ -1460,16 +1444,6 @@ static void raid1_make_request(struct mddev *mddev, struct bio *bio)
 	r1_bio->mddev = mddev;
 	r1_bio->sector = bio->bi_iter.bi_sector;
 
-	/*
-	 * We might need to issue multiple reads to different devices if there
-	 * are bad blocks around, so we keep track of the number of reads in
-	 * bio->bi_phys_segments.  If this is 0, there is only one r1_bio and
-	 * no locking will be needed when requests complete.  If it is
-	 * non-zero, then it is the number of not-completed requests.
-	 */
-	bio->bi_phys_segments = 0;
-	bio_clear_flag(bio, BIO_SEG_VALID);
-
 	if (bio_data_dir(bio) == READ)
 		raid1_read_request(mddev, bio, r1_bio);
 	else
@@ -2434,12 +2408,7 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
 			int sectors_handled = (r1_bio->sector + max_sectors
 					       - mbio->bi_iter.bi_sector);
 			r1_bio->sectors = max_sectors;
-			spin_lock_irq(&conf->device_lock);
-			if (mbio->bi_phys_segments == 0)
-				mbio->bi_phys_segments = 2;
-			else
-				mbio->bi_phys_segments++;
-			spin_unlock_irq(&conf->device_lock);
+			bio_inc_remaining(mbio);
 			trace_block_bio_remap(bdev_get_queue(bio->bi_bdev),
 					      bio, bio_dev, bio_sector);
 			generic_make_request(bio);



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [md PATCH 09/14] md/raid10: stop using bi_phys_segments
  2017-02-16  4:39 [md PATCH 00/14] remove all abuse of bi_phys_segments NeilBrown
                   ` (7 preceding siblings ...)
  2017-02-16  4:39 ` [md PATCH 08/14] md/raid1, raid10: move rXbio accounting closer to allocation NeilBrown
@ 2017-02-16  4:39 ` NeilBrown
  2017-02-16 14:26   ` Jack Wang
  2017-02-16  4:39 ` [md PATCH 05/14] md/raid5: use bio_inc_remaining() instead of repurposing bi_phys_segments as a counter NeilBrown
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 27+ messages in thread
From: NeilBrown @ 2017-02-16  4:39 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, hch

raid10 currently repurposes bi_phys_segments on each
incoming bio to count how many r10bio was used to encode the
request.

We need to know when the number of attached r10bio reaches
zero to:
1/ call bio_endio() when all IO on the bio is finished
2/ decrement ->nr_pending so that resync IO can proceed.

Now that the bio has its own __bi_remaining counter, that
can be used instead. We can call bio_inc_remaining to
increment the counter and call bio_endio() every time an
r10bio completes, rather than only when bi_phys_segments
reaches zero.

This addresses point 1, but not point 2.  bio_endio()
doesn't (and cannot) report when the last r10bio has
finished, so a different approach is needed.

So: instead of counting bios in ->nr_pending, count r10bios.
i.e. every time we attach a bio, increment nr_pending.
Every time an r10bio completes, decrement nr_pending.

Normally we only increment nr_pending after first checking
that ->barrier is zero, or some other non-trivial tests and
possible waiting.  When attaching multiple r10bios to a bio,
we only need the tests and the waiting once.  After the
first increment, subsequent increments can happen
unconditionally as they are really all part of the one
request.

So introduce inc_pending() which can be used when we know
that nr_pending is already elevated.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/raid10.c |   76 +++++++++++++++++----------------------------------
 1 file changed, 25 insertions(+), 51 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 9258cbe233bb..6b4d8643c574 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -301,27 +301,18 @@ static void reschedule_retry(struct r10bio *r10_bio)
 static void raid_end_bio_io(struct r10bio *r10_bio)
 {
 	struct bio *bio = r10_bio->master_bio;
-	int done;
 	struct r10conf *conf = r10_bio->mddev->private;
 
-	if (bio->bi_phys_segments) {
-		unsigned long flags;
-		spin_lock_irqsave(&conf->device_lock, flags);
-		bio->bi_phys_segments--;
-		done = (bio->bi_phys_segments == 0);
-		spin_unlock_irqrestore(&conf->device_lock, flags);
-	} else
-		done = 1;
 	if (!test_bit(R10BIO_Uptodate, &r10_bio->state))
 		bio->bi_error = -EIO;
-	if (done) {
-		bio_endio(bio);
-		/*
-		 * Wake up any possible resync thread that waits for the device
-		 * to go idle.
-		 */
-		allow_barrier(conf);
-	}
+
+	/*
+	 * Wake up any possible resync thread that waits for the device
+	 * to go idle.
+	 */
+	allow_barrier(conf);
+	bio_endio(bio);
+
 	free_r10bio(r10_bio);
 }
 
@@ -984,6 +975,15 @@ static void wait_barrier(struct r10conf *conf)
 	spin_unlock_irq(&conf->resync_lock);
 }
 
+static void inc_pending(struct r10conf *conf)
+{
+	/* The current request requires multiple r10_bio, so
+	 * we need to increment the pending count.
+	 */
+	WARN_ON(!atomic_read(&conf->nr_pending));
+	atomic_inc(&conf->nr_pending);
+}
+
 static void allow_barrier(struct r10conf *conf)
 {
 	if ((atomic_dec_and_test(&conf->nr_pending)) ||
@@ -1161,12 +1161,8 @@ static void raid10_read_request(struct mddev *mddev, struct bio *bio,
 		sectors_handled = (r10_bio->sector + max_sectors
 				   - bio->bi_iter.bi_sector);
 		r10_bio->sectors = max_sectors;
-		spin_lock_irq(&conf->device_lock);
-		if (bio->bi_phys_segments == 0)
-			bio->bi_phys_segments = 2;
-		else
-			bio->bi_phys_segments++;
-		spin_unlock_irq(&conf->device_lock);
+		inc_pending(conf);
+		bio_inc_remaining(bio);
 		/*
 		 * Cannot call generic_make_request directly as that will be
 		 * queued in __generic_make_request and subsequent
@@ -1261,9 +1257,7 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio,
 	 * on which we have seen a write error, we want to avoid
 	 * writing to those blocks.  This potentially requires several
 	 * writes to write around the bad blocks.  Each set of writes
-	 * gets its own r10_bio with a set of bios attached.  The number
-	 * of r10_bios is recored in bio->bi_phys_segments just as with
-	 * the read case.
+	 * gets its own r10_bio with a set of bios attached.
 	 */
 
 	r10_bio->read_slot = -1; /* make sure repl_bio gets freed */
@@ -1481,15 +1475,9 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio,
 	 */
 
 	if (sectors_handled < bio_sectors(bio)) {
-		/* We need another r10_bio and it needs to be counted
-		 * in bio->bi_phys_segments.
-		 */
-		spin_lock_irq(&conf->device_lock);
-		if (bio->bi_phys_segments == 0)
-			bio->bi_phys_segments = 2;
-		else
-			bio->bi_phys_segments++;
-		spin_unlock_irq(&conf->device_lock);
+		/* We need another r10_bio and it needs to be counted */
+		inc_pending(conf);
+		bio_inc_remaining(bio);
 		one_write_done(r10_bio);
 		r10_bio = mempool_alloc(conf->r10bio_pool, GFP_NOIO);
 
@@ -1518,16 +1506,6 @@ static void __make_request(struct mddev *mddev, struct bio *bio)
 	r10_bio->sector = bio->bi_iter.bi_sector;
 	r10_bio->state = 0;
 
-	/*
-	 * We might need to issue multiple reads to different devices if there
-	 * are bad blocks around, so we keep track of the number of reads in
-	 * bio->bi_phys_segments.  If this is 0, there is only one r10_bio and
-	 * no locking will be needed when the request completes.  If it is
-	 * non-zero, then it is the number of not-completed requests.
-	 */
-	bio->bi_phys_segments = 0;
-	bio_clear_flag(bio, BIO_SEG_VALID);
-
 	if (bio_data_dir(bio) == READ)
 		raid10_read_request(mddev, bio, r10_bio);
 	else
@@ -2662,12 +2640,8 @@ static void handle_read_error(struct mddev *mddev, struct r10bio *r10_bio)
 			r10_bio->sector + max_sectors
 			- mbio->bi_iter.bi_sector;
 		r10_bio->sectors = max_sectors;
-		spin_lock_irq(&conf->device_lock);
-		if (mbio->bi_phys_segments == 0)
-			mbio->bi_phys_segments = 2;
-		else
-			mbio->bi_phys_segments++;
-		spin_unlock_irq(&conf->device_lock);
+		bio_inc_remaining(mbio);
+		inc_pending(conf);
 		generic_make_request(bio);
 
 		r10_bio = mempool_alloc(conf->r10bio_pool,



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [md PATCH 11/14] md/raid5: don't test ->writes_pending in raid5_remove_disk
  2017-02-16  4:39 [md PATCH 00/14] remove all abuse of bi_phys_segments NeilBrown
                   ` (3 preceding siblings ...)
  2017-02-16  4:39 ` [md PATCH 02/14] md/raid5: simplfy delaying of writes while metadata is updated NeilBrown
@ 2017-02-16  4:39 ` NeilBrown
  2017-02-16  4:39 ` [md PATCH 07/14] Revert "md/raid5: limit request size according to implementation limits" NeilBrown
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 27+ messages in thread
From: NeilBrown @ 2017-02-16  4:39 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, hch

This test on ->writes_pending cannot be safe as the counter
can be incremented at any moment and cannot be locked against.

Change it to test conf->active_stripes.  This allows read requests
to interfere so it might make removal of the journal device
harder, but it is safer.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/raid5.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index c96fca2c6a98..4de03479cdce 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -7279,10 +7279,14 @@ static int raid5_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
 		 * we can't wait pending write here, as this is called in
 		 * raid5d, wait will deadlock.
 		 */
-		if (atomic_read(&mddev->writes_pending))
+		lock_all_device_hash_locks_irq(conf);
+		if (atomic_read(&conf->active_stripes)) {
+			unlock_all_device_hash_locks_irq(conf);
 			return -EBUSY;
+		}
 		log = conf->log;
 		conf->log = NULL;
+		unlock_all_device_hash_locks_irq(conf);
 		synchronize_rcu();
 		r5l_exit_log(log);
 		return 0;



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [md PATCH 12/14] md: factor out set_in_sync()
  2017-02-16  4:39 [md PATCH 00/14] remove all abuse of bi_phys_segments NeilBrown
                   ` (10 preceding siblings ...)
  2017-02-16  4:39 ` [md PATCH 06/14] md/raid5: remove over-loading of ->bi_phys_segments NeilBrown
@ 2017-02-16  4:39 ` NeilBrown
  2017-02-16  4:39 ` [md PATCH 13/14] md: close a race with setting mddev->in_sync NeilBrown
  2017-02-16  4:39 ` [md PATCH 14/14] MD: use per-cpu counter for writes_pending NeilBrown
  13 siblings, 0 replies; 27+ messages in thread
From: NeilBrown @ 2017-02-16  4:39 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, hch

Three separate places in md.c check if the number of active
writes is zero and, if so, sets mddev->in_sync.

There are a few differences, but there shouldn't be:
- it is always appropriate to notify the change in
  sysfs_state, and there is no need to do this outside a
  spin-locked region.
- we never need to check ->recovery_cp.  The state of resync
  is not relevant for whether there are any pending writes
  or not (which is what ->in_sync reports).

So create set_in_sync() which does the correct tests and
makes the correct changes, and call this in all three
places.

Any behaviour changes here a minor and cosmetic.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/md.c |   57 ++++++++++++++++++++++---------------------------------
 1 file changed, 23 insertions(+), 34 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 01175dac0db6..1d5fdc4a659d 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -2250,6 +2250,24 @@ static void export_array(struct mddev *mddev)
 	mddev->major_version = 0;
 }
 
+static bool set_in_sync(struct mddev *mddev)
+{
+	bool ret = false;
+
+	WARN_ON_ONCE(!spin_is_locked(&mddev->lock));
+	if (atomic_read(&mddev->writes_pending) == 0) {
+		if (mddev->in_sync == 0) {
+			mddev->in_sync = 1;
+			set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
+			sysfs_notify_dirent_safe(mddev->sysfs_state);
+		}
+		ret = true;
+	}
+	if (mddev->safemode == 1)
+		mddev->safemode = 0;
+	return ret;
+}
+
 static void sync_sbs(struct mddev *mddev, int nospares)
 {
 	/* Update each superblock (in-memory image), but
@@ -3948,7 +3966,7 @@ static int restart_array(struct mddev *mddev);
 static ssize_t
 array_state_store(struct mddev *mddev, const char *buf, size_t len)
 {
-	int err;
+	int err = 0;
 	enum array_state st = match_word(buf, array_states);
 
 	if (mddev->pers && (st == active || st == clean) && mddev->ro != 1) {
@@ -3961,18 +3979,9 @@ array_state_store(struct mddev *mddev, const char *buf, size_t len)
 			clear_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags);
 			md_wakeup_thread(mddev->thread);
 			wake_up(&mddev->sb_wait);
-			err = 0;
 		} else /* st == clean */ {
 			restart_array(mddev);
-			if (atomic_read(&mddev->writes_pending) == 0) {
-				if (mddev->in_sync == 0) {
-					mddev->in_sync = 1;
-					if (mddev->safemode == 1)
-						mddev->safemode = 0;
-					set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
-				}
-				err = 0;
-			} else
+			if (!set_in_sync(mddev))
 				err = -EBUSY;
 		}
 		if (!err)
@@ -4030,15 +4039,7 @@ array_state_store(struct mddev *mddev, const char *buf, size_t len)
 			if (err)
 				break;
 			spin_lock(&mddev->lock);
-			if (atomic_read(&mddev->writes_pending) == 0) {
-				if (mddev->in_sync == 0) {
-					mddev->in_sync = 1;
-					if (mddev->safemode == 1)
-						mddev->safemode = 0;
-					set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
-				}
-				err = 0;
-			} else
+			if (!set_in_sync(mddev))
 				err = -EBUSY;
 			spin_unlock(&mddev->lock);
 		} else
@@ -8450,22 +8451,10 @@ void md_check_recovery(struct mddev *mddev)
 				md_reload_sb(mddev, mddev->good_device_nr);
 		}
 
-		if (!mddev->external) {
-			int did_change = 0;
+		if (!mddev->external && !mddev->in_sync) {
 			spin_lock(&mddev->lock);
-			if (mddev->safemode &&
-			    !atomic_read(&mddev->writes_pending) &&
-			    !mddev->in_sync &&
-			    mddev->recovery_cp == MaxSector) {
-				mddev->in_sync = 1;
-				did_change = 1;
-				set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
-			}
-			if (mddev->safemode == 1)
-				mddev->safemode = 0;
+			set_in_sync(mddev);
 			spin_unlock(&mddev->lock);
-			if (did_change)
-				sysfs_notify_dirent_safe(mddev->sysfs_state);
 		}
 
 		if (mddev->sb_flags)



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [md PATCH 13/14] md: close a race with setting mddev->in_sync
  2017-02-16  4:39 [md PATCH 00/14] remove all abuse of bi_phys_segments NeilBrown
                   ` (11 preceding siblings ...)
  2017-02-16  4:39 ` [md PATCH 12/14] md: factor out set_in_sync() NeilBrown
@ 2017-02-16  4:39 ` NeilBrown
  2017-02-16  4:39 ` [md PATCH 14/14] MD: use per-cpu counter for writes_pending NeilBrown
  13 siblings, 0 replies; 27+ messages in thread
From: NeilBrown @ 2017-02-16  4:39 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, hch

If ->in_sync is being set just as md_write_start() is being called,
it is possible that set_in_sync() won't see the elevated
->writes_pending, and md_write_start() won't see the set ->in_sync.

To close this race, re-test ->writes_pending after setting ->in_sync,
and add memory barriers to ensure the increment of ->writes_pending
will be seen by the time of this second test, or the new ->in_sync
will be seen by md_write_start().

Add a spinlock to array_state_show() to ensure this temporary
instability is never visible from userspace.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/md.c |    7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 1d5fdc4a659d..f856c95ee7d5 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -2258,6 +2258,10 @@ static bool set_in_sync(struct mddev *mddev)
 	if (atomic_read(&mddev->writes_pending) == 0) {
 		if (mddev->in_sync == 0) {
 			mddev->in_sync = 1;
+			smp_mb();
+			if (atomic_read(&mddev->writes_pending))
+				/* lost a race with md_write_start() */
+				mddev->in_sync = 0;
 			set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
 			sysfs_notify_dirent_safe(mddev->sysfs_state);
 		}
@@ -3938,6 +3942,7 @@ array_state_show(struct mddev *mddev, char *page)
 			st = read_auto;
 			break;
 		case 0:
+			spin_lock(&mddev->lock);
 			if (test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
 				st = write_pending;
 			else if (mddev->in_sync)
@@ -3946,6 +3951,7 @@ array_state_show(struct mddev *mddev, char *page)
 				st = active_idle;
 			else
 				st = active;
+			spin_unlock(&mddev->lock);
 		}
 	else {
 		if (list_empty(&mddev->disks) &&
@@ -7769,6 +7775,7 @@ void md_write_start(struct mddev *mddev, struct bio *bi)
 		did_change = 1;
 	}
 	atomic_inc(&mddev->writes_pending);
+	smp_mb(); /* Match smp_mb in set_in_sync() */
 	if (mddev->safemode == 1)
 		mddev->safemode = 0;
 	if (mddev->in_sync) {



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [md PATCH 14/14] MD: use per-cpu counter for writes_pending
  2017-02-16  4:39 [md PATCH 00/14] remove all abuse of bi_phys_segments NeilBrown
                   ` (12 preceding siblings ...)
  2017-02-16  4:39 ` [md PATCH 13/14] md: close a race with setting mddev->in_sync NeilBrown
@ 2017-02-16  4:39 ` NeilBrown
  2017-02-16 20:12   ` Shaohua Li
  13 siblings, 1 reply; 27+ messages in thread
From: NeilBrown @ 2017-02-16  4:39 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, hch

The 'writes_pending' counter is used to determine when the
array is stable so that it can be marked in the superblock
as "Clean".  Consequently it needs to be updated frequently
but only checked for zero occasionally.  Recent changes to
raid5 cause the count to be updated even more often - once
per 4K rather than once per bio.  This provided
justification for making the updates more efficient.

So we replace the atomic counter with a per-cpu array of
'long' counters. Incrementing and decrementing is normally
much cheaper, testing for zero is more expensive.

To meaningfully be able to test for zero we need to be able
to block further updates.  This is done by forcing the
"increment" step to take a spinlock in the rare case that
another thread is checking if the count is zero.  This is
done using a new field: "checkers".  "checkers" is the
number of threads that are currently checking whether the
count is zero.  It is usually 0, occasionally 1, and it is
not impossible that it could be higher, though this would be
rare.

If, within an rcu_read_locked section, checkers is seen to
be zero, then the local-cpu counter can be incremented
freely.  If checkers is not zero, mddev->lock must be taken
before the increment is allowed.  A decrement is always
allowed.

To test for zero, a thread must increment "checkers", call
synchronize_rcu(), then take mddev->lock.  Once this is done
no new increments can happen.  A thread may choose to
perform a quick test-for-zero by summing all the counters
without holding a lock.  If this is non-zero, the the total
count is non-zero, or was non-zero very recently, so it is
safe to assume that it isn't zero.  If the quick check does
report a zero sum, then it is worth performing the locking
protocol.

When the counter is decremented, it is no longer possible to
immediately test if the result is zero
(atmic_dec_and_test()).  We don't even really want to
perform the "quick" tests as that sums over all cpus and is
work that will most often bring no benefit.

In the "safemode==2" case, when we want to mark the array as
"clean" immediately when there are no writes, we perform the
quick test anyway, and possibly wake the md thread to do the
full test.  "safemode==2" is only used during shutdown so
the cost is not problematic.

When safemode!=2 we always set the timer, rather than only
when the counter reaches zero.

If mod_timer() is called to set the timeout to the value it
already has, mod_timer() has low overhead with no atomic
operations.  So at worst it will have a noticeable cost once
per jiffie.  To further reduce the otherhead, we round the
requests delay to a multiple of ->safemode_delay.  This
might increase the delay until the timer fires a little, but
will reduce the overhead of calling mod_timer()
significantly.  If lots of requests are completing, the
timer will be updated every 200 milliseconds (by default)
and never fire.  When it does eventually fire, it will
schedule the md thread to perform the full test for
writes_pending==0, and this is quite likely to find '0'.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/md.c |  107 +++++++++++++++++++++++++++++++++++++++++++++++--------
 drivers/md/md.h |    3 +-
 2 files changed, 93 insertions(+), 17 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index f856c95ee7d5..47bbd22e2865 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -64,6 +64,8 @@
 #include <linux/raid/md_p.h>
 #include <linux/raid/md_u.h>
 #include <linux/slab.h>
+#include <linux/percpu.h>
+
 #include <trace/events/block.h>
 #include "md.h"
 #include "bitmap.h"
@@ -2250,21 +2252,52 @@ static void export_array(struct mddev *mddev)
 	mddev->major_version = 0;
 }
 
+/*
+ * The percpu writes_pending counters are linked with ->checkers and
+ * ->lock.  If ->writes_pending can always be decremented without a
+ * lock.  It can only be incremented without a lock if ->checkers is 0
+ * and the test+incr happen in a rcu_readlocked region.
+ * ->checkers can only be changed under ->lock protection.
+ * To determine if ->writes_pending is totally zero, a quick sum without
+ * locks can be performed. If this is non-zero, then the result is final.
+ * Otherwise ->checkers must be incremented and synchronize_rcu() called.
+ * Then a sum calculated under ->lock, and the result is final until the
+ * ->checkers is decremented, or the lock is dropped.
+ *
+ */
+
+static bool __writes_pending(struct mddev *mddev)
+{
+	long cnt = 0;
+	int i;
+
+	for_each_possible_cpu(i)
+		cnt += *per_cpu_ptr(mddev->writes_pending_percpu, i);
+	return cnt != 0;
+}
+
 static bool set_in_sync(struct mddev *mddev)
 {
 	bool ret = false;
 
 	WARN_ON_ONCE(!spin_is_locked(&mddev->lock));
-	if (atomic_read(&mddev->writes_pending) == 0) {
-		if (mddev->in_sync == 0) {
+	if (!__writes_pending(mddev) && !mddev->in_sync) {
+		mddev->checkers++;
+		spin_unlock(&mddev->lock);
+		synchronize_rcu();
+		spin_lock(&mddev->lock);
+		if (!mddev->in_sync &&
+		    !__writes_pending(mddev)) {
 			mddev->in_sync = 1;
+			/*
+			 * Ensure in_sync is visible before ->checkers
+			 * is decremented
+			 */
 			smp_mb();
-			if (atomic_read(&mddev->writes_pending))
-				/* lost a race with md_write_start() */
-				mddev->in_sync = 0;
 			set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
 			sysfs_notify_dirent_safe(mddev->sysfs_state);
 		}
+		mddev->checkers--;
 		ret = true;
 	}
 	if (mddev->safemode == 1)
@@ -2272,6 +2305,29 @@ static bool set_in_sync(struct mddev *mddev)
 	return ret;
 }
 
+static void inc_writes_pending(struct mddev *mddev)
+{
+	rcu_read_lock();
+	if (mddev->checkers == 0) {
+		__this_cpu_inc(*mddev->writes_pending_percpu);
+		rcu_read_unlock();
+		return;
+	}
+	rcu_read_unlock();
+	/* Need that spinlock */
+	spin_lock(&mddev->lock);
+	this_cpu_inc(*mddev->writes_pending_percpu);
+	spin_unlock(&mddev->lock);
+}
+
+static void zero_writes_pending(struct mddev *mddev)
+{
+	int i;
+
+	for_each_possible_cpu(i)
+		*per_cpu_ptr(mddev->writes_pending_percpu, i) = 0;
+}
+
 static void sync_sbs(struct mddev *mddev, int nospares)
 {
 	/* Update each superblock (in-memory image), but
@@ -5000,6 +5056,8 @@ static void md_free(struct kobject *ko)
 		del_gendisk(mddev->gendisk);
 		put_disk(mddev->gendisk);
 	}
+	if (mddev->writes_pending_percpu)
+		free_percpu(mddev->writes_pending_percpu);
 
 	kfree(mddev);
 }
@@ -5076,6 +5134,13 @@ static int md_alloc(dev_t dev, char *name)
 	blk_queue_make_request(mddev->queue, md_make_request);
 	blk_set_stacking_limits(&mddev->queue->limits);
 
+	mddev->writes_pending_percpu = alloc_percpu(long);
+	if (!mddev->writes_pending_percpu) {
+		blk_cleanup_queue(mddev->queue);
+		mddev->queue = NULL;
+		goto abort;
+	}
+
 	disk = alloc_disk(1 << shift);
 	if (!disk) {
 		blk_cleanup_queue(mddev->queue);
@@ -5159,7 +5224,7 @@ static void md_safemode_timeout(unsigned long data)
 {
 	struct mddev *mddev = (struct mddev *) data;
 
-	if (!atomic_read(&mddev->writes_pending)) {
+	if (!__writes_pending(mddev)) {
 		mddev->safemode = 1;
 		if (mddev->external)
 			sysfs_notify_dirent_safe(mddev->sysfs_state);
@@ -5365,7 +5430,7 @@ int md_run(struct mddev *mddev)
 	} else if (mddev->ro == 2) /* auto-readonly not meaningful */
 		mddev->ro = 0;
 
-	atomic_set(&mddev->writes_pending,0);
+	zero_writes_pending(mddev);
 	atomic_set(&mddev->max_corr_read_errors,
 		   MD_DEFAULT_MAX_CORRECTED_READ_ERRORS);
 	mddev->safemode = 0;
@@ -7774,11 +7839,13 @@ void md_write_start(struct mddev *mddev, struct bio *bi)
 		md_wakeup_thread(mddev->sync_thread);
 		did_change = 1;
 	}
-	atomic_inc(&mddev->writes_pending);
+	rcu_read_lock();
+	inc_writes_pending(mddev);
 	smp_mb(); /* Match smp_mb in set_in_sync() */
 	if (mddev->safemode == 1)
 		mddev->safemode = 0;
-	if (mddev->in_sync) {
+	if (mddev->in_sync || mddev->checkers) {
+		rcu_read_unlock();
 		spin_lock(&mddev->lock);
 		if (mddev->in_sync) {
 			mddev->in_sync = 0;
@@ -7788,7 +7855,9 @@ void md_write_start(struct mddev *mddev, struct bio *bi)
 			did_change = 1;
 		}
 		spin_unlock(&mddev->lock);
-	}
+	} else
+		rcu_read_unlock();
+
 	if (did_change)
 		sysfs_notify_dirent_safe(mddev->sysfs_state);
 	wait_event(mddev->sb_wait,
@@ -7798,12 +7867,18 @@ EXPORT_SYMBOL(md_write_start);
 
 void md_write_end(struct mddev *mddev)
 {
-	if (atomic_dec_and_test(&mddev->writes_pending)) {
-		if (mddev->safemode == 2)
+	this_cpu_dec(*mddev->writes_pending_percpu);
+	if (mddev->safemode == 2) {
+		if (!__writes_pending(mddev))
 			md_wakeup_thread(mddev->thread);
-		else if (mddev->safemode_delay)
-			mod_timer(&mddev->safemode_timer, jiffies + mddev->safemode_delay);
-	}
+	} else if (mddev->safemode_delay)
+		/* The roundup() ensure this only performs locking once
+		 * every ->safemode_delay jiffies
+		 */
+		mod_timer(&mddev->safemode_timer,
+			  roundup(jiffies, mddev->safemode_delay) +
+			  mddev->safemode_delay);
+
 }
 EXPORT_SYMBOL(md_write_end);
 
@@ -8406,7 +8481,7 @@ void md_check_recovery(struct mddev *mddev)
 		test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
 		test_bit(MD_RELOAD_SB, &mddev->flags) ||
 		(mddev->external == 0 && mddev->safemode == 1) ||
-		(mddev->safemode == 2 && ! atomic_read(&mddev->writes_pending)
+		(mddev->safemode == 2 && !__writes_pending(mddev)
 		 && !mddev->in_sync && mddev->recovery_cp == MaxSector)
 		))
 		return;
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 2a514036a83d..7e41f882d33d 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -404,7 +404,8 @@ struct mddev {
 							 */
 	unsigned int			safemode_delay;
 	struct timer_list		safemode_timer;
-	atomic_t			writes_pending;
+	long				*writes_pending_percpu;
+	int				checkers;	/* # of threads checking writes_pending */
 	struct request_queue		*queue;	/* for plugging ... */
 
 	struct bitmap			*bitmap; /* the bitmap for the device */



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [md PATCH 09/14] md/raid10: stop using bi_phys_segments
  2017-02-16  4:39 ` [md PATCH 09/14] md/raid10: stop using bi_phys_segments NeilBrown
@ 2017-02-16 14:26   ` Jack Wang
  2017-02-17  2:15     ` NeilBrown
  0 siblings, 1 reply; 27+ messages in thread
From: Jack Wang @ 2017-02-16 14:26 UTC (permalink / raw)
  To: NeilBrown; +Cc: Shaohua Li, linux-raid, Christoph Hellwig

2017-02-16 5:39 GMT+01:00 NeilBrown <neilb@suse.com>:
> raid10 currently repurposes bi_phys_segments on each
> incoming bio to count how many r10bio was used to encode the
> request.
>
> We need to know when the number of attached r10bio reaches
> zero to:
> 1/ call bio_endio() when all IO on the bio is finished
> 2/ decrement ->nr_pending so that resync IO can proceed.
>
> Now that the bio has its own __bi_remaining counter, that
> can be used instead. We can call bio_inc_remaining to
> increment the counter and call bio_endio() every time an
> r10bio completes, rather than only when bi_phys_segments
> reaches zero.
>
> This addresses point 1, but not point 2.  bio_endio()
> doesn't (and cannot) report when the last r10bio has
> finished, so a different approach is needed.
>
> So: instead of counting bios in ->nr_pending, count r10bios.
> i.e. every time we attach a bio, increment nr_pending.
> Every time an r10bio completes, decrement nr_pending.
>
> Normally we only increment nr_pending after first checking
> that ->barrier is zero, or some other non-trivial tests and
> possible waiting.  When attaching multiple r10bios to a bio,
> we only need the tests and the waiting once.  After the
> first increment, subsequent increments can happen
> unconditionally as they are really all part of the one
> request.
>
> So introduce inc_pending() which can be used when we know
> that nr_pending is already elevated.
>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  drivers/md/raid10.c |   76 +++++++++++++++++----------------------------------
>  1 file changed, 25 insertions(+), 51 deletions(-)
>
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index 9258cbe233bb..6b4d8643c574 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -301,27 +301,18 @@ static void reschedule_retry(struct r10bio *r10_bio)
>  static void raid_end_bio_io(struct r10bio *r10_bio)
>  {
>         struct bio *bio = r10_bio->master_bio;
> -       int done;
>         struct r10conf *conf = r10_bio->mddev->private;
>
> -       if (bio->bi_phys_segments) {
> -               unsigned long flags;
> -               spin_lock_irqsave(&conf->device_lock, flags);
> -               bio->bi_phys_segments--;
> -               done = (bio->bi_phys_segments == 0);
> -               spin_unlock_irqrestore(&conf->device_lock, flags);
> -       } else
> -               done = 1;
>         if (!test_bit(R10BIO_Uptodate, &r10_bio->state))
>                 bio->bi_error = -EIO;
> -       if (done) {
> -               bio_endio(bio);
> -               /*
> -                * Wake up any possible resync thread that waits for the device
> -                * to go idle.
> -                */
> -               allow_barrier(conf);
> -       }
> +
> +       /*
> +        * Wake up any possible resync thread that waits for the device
> +        * to go idle.
> +        */
> +       allow_barrier(conf);
> +       bio_endio(bio);
> +

Hi Neil,

Why do you switch the order of above 2 lines, is there a reason
behind, I notice in raid1 you kept the order?

Regards,
Jack

>         free_r10bio(r10_bio);
>  }
>
> @@ -984,6 +975,15 @@ static void wait_barrier(struct r10conf *conf)
>         spin_unlock_irq(&conf->resync_lock);
>  }
>
> +static void inc_pending(struct r10conf *conf)
> +{
> +       /* The current request requires multiple r10_bio, so
> +        * we need to increment the pending count.
> +        */
> +       WARN_ON(!atomic_read(&conf->nr_pending));
> +       atomic_inc(&conf->nr_pending);
> +}
> +
>  static void allow_barrier(struct r10conf *conf)
>  {
>         if ((atomic_dec_and_test(&conf->nr_pending)) ||
> @@ -1161,12 +1161,8 @@ static void raid10_read_request(struct mddev *mddev, struct bio *bio,
>                 sectors_handled = (r10_bio->sector + max_sectors
>                                    - bio->bi_iter.bi_sector);
>                 r10_bio->sectors = max_sectors;
> -               spin_lock_irq(&conf->device_lock);
> -               if (bio->bi_phys_segments == 0)
> -                       bio->bi_phys_segments = 2;
> -               else
> -                       bio->bi_phys_segments++;
> -               spin_unlock_irq(&conf->device_lock);
> +               inc_pending(conf);
> +               bio_inc_remaining(bio);
>                 /*
>                  * Cannot call generic_make_request directly as that will be
>                  * queued in __generic_make_request and subsequent
> @@ -1261,9 +1257,7 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio,
>          * on which we have seen a write error, we want to avoid
>          * writing to those blocks.  This potentially requires several
>          * writes to write around the bad blocks.  Each set of writes
> -        * gets its own r10_bio with a set of bios attached.  The number
> -        * of r10_bios is recored in bio->bi_phys_segments just as with
> -        * the read case.
> +        * gets its own r10_bio with a set of bios attached.
>          */
>
>         r10_bio->read_slot = -1; /* make sure repl_bio gets freed */
> @@ -1481,15 +1475,9 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio,
>          */
>
>         if (sectors_handled < bio_sectors(bio)) {
> -               /* We need another r10_bio and it needs to be counted
> -                * in bio->bi_phys_segments.
> -                */
> -               spin_lock_irq(&conf->device_lock);
> -               if (bio->bi_phys_segments == 0)
> -                       bio->bi_phys_segments = 2;
> -               else
> -                       bio->bi_phys_segments++;
> -               spin_unlock_irq(&conf->device_lock);
> +               /* We need another r10_bio and it needs to be counted */
> +               inc_pending(conf);
> +               bio_inc_remaining(bio);
>                 one_write_done(r10_bio);
>                 r10_bio = mempool_alloc(conf->r10bio_pool, GFP_NOIO);
>
> @@ -1518,16 +1506,6 @@ static void __make_request(struct mddev *mddev, struct bio *bio)
>         r10_bio->sector = bio->bi_iter.bi_sector;
>         r10_bio->state = 0;
>
> -       /*
> -        * We might need to issue multiple reads to different devices if there
> -        * are bad blocks around, so we keep track of the number of reads in
> -        * bio->bi_phys_segments.  If this is 0, there is only one r10_bio and
> -        * no locking will be needed when the request completes.  If it is
> -        * non-zero, then it is the number of not-completed requests.
> -        */
> -       bio->bi_phys_segments = 0;
> -       bio_clear_flag(bio, BIO_SEG_VALID);
> -
>         if (bio_data_dir(bio) == READ)
>                 raid10_read_request(mddev, bio, r10_bio);
>         else
> @@ -2662,12 +2640,8 @@ static void handle_read_error(struct mddev *mddev, struct r10bio *r10_bio)
>                         r10_bio->sector + max_sectors
>                         - mbio->bi_iter.bi_sector;
>                 r10_bio->sectors = max_sectors;
> -               spin_lock_irq(&conf->device_lock);
> -               if (mbio->bi_phys_segments == 0)
> -                       mbio->bi_phys_segments = 2;
> -               else
> -                       mbio->bi_phys_segments++;
> -               spin_unlock_irq(&conf->device_lock);
> +               bio_inc_remaining(mbio);
> +               inc_pending(conf);
>                 generic_make_request(bio);
>
>                 r10_bio = mempool_alloc(conf->r10bio_pool,
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [md PATCH 01/14] md/raid5: use md_write_start to count stripes, not bios
  2017-02-16  4:39 ` [md PATCH 01/14] md/raid5: use md_write_start to count stripes, not bios NeilBrown
@ 2017-02-16 17:29   ` Shaohua Li
  2017-02-17  2:04     ` NeilBrown
  0 siblings, 1 reply; 27+ messages in thread
From: Shaohua Li @ 2017-02-16 17:29 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, hch

On Thu, Feb 16, 2017 at 03:39:01PM +1100, Neil Brown wrote:
> We use md_write_start() to increase the count of pending writes, and
> md_write_end() to decrement the count.  We currently count bios
> submitted to md/raid5.  Change it count stripe_heads that a WRITE bio
> has been attached to.
> 
> So now, raid5_make_request() calls md_write_start() and then
> md_write_end() to keep the count elevated during the setup of the
> request.
> 
> add_stripe_bio() calls md_write_start() for each stripe_head, and the
> completion routines always call md_write_end(), instead of only
> calling it when raid5_dec_bi_active_stripes() returns 0.
> make_discard_request also calls md_write_start/end().
> 
> The parallel between md_write_{start,end} and use of bi_phys_segments
> can be seen in that:
>  Whenever we set bi_phys_segments to 1, we now call md_write_start.
>  Whenever we increment it on non-read requests with
>    raid5_inc_bi_active_stripes(), we now call md_write_start().
>  Whenever we decrement bi_phys_segments on non-read requsts with
>     raid5_dec_bi_active_stripes(), we now call md_write_end().
> 
> This reduces our dependence on keeping a per-bio count of active
> stripes in bi_phys_segments.
> 
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  drivers/md/raid5-cache.c |    2 +-
>  drivers/md/raid5.c       |   27 +++++++++++++--------------
>  2 files changed, 14 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index 302dea3296ba..4b211f914d17 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -265,8 +265,8 @@ r5c_return_dev_pending_writes(struct r5conf *conf, struct r5dev *dev,
>  	while (wbi && wbi->bi_iter.bi_sector <
>  	       dev->sector + STRIPE_SECTORS) {
>  		wbi2 = r5_next_bio(wbi, dev->sector);
> +		md_write_end(conf->mddev);
>  		if (!raid5_dec_bi_active_stripes(wbi)) {
> -			md_write_end(conf->mddev);
>  			bio_list_add(return_bi, wbi);
>  		}
>  		wbi = wbi2;
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 3c7e106c12a2..760b726943c9 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -3075,6 +3075,7 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
>  		bi->bi_next = *bip;
>  	*bip = bi;
>  	raid5_inc_bi_active_stripes(bi);
> +	md_write_start(conf->mddev, bi);
>  
>  	if (forwrite) {
>  		/* check if page is covered */
> @@ -3198,10 +3199,9 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
>  			struct bio *nextbi = r5_next_bio(bi, sh->dev[i].sector);
>  
>  			bi->bi_error = -EIO;
> -			if (!raid5_dec_bi_active_stripes(bi)) {
> -				md_write_end(conf->mddev);
> +			md_write_end(conf->mddev);
> +			if (!raid5_dec_bi_active_stripes(bi))
>  				bio_list_add(return_bi, bi);
> -			}
>  			bi = nextbi;
>  		}
>  		if (bitmap_end)
> @@ -3222,10 +3222,9 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
>  			struct bio *bi2 = r5_next_bio(bi, sh->dev[i].sector);
>  
>  			bi->bi_error = -EIO;
> -			if (!raid5_dec_bi_active_stripes(bi)) {
> -				md_write_end(conf->mddev);
> +			md_write_end(conf->mddev);
> +			if (!raid5_dec_bi_active_stripes(bi))
>  				bio_list_add(return_bi, bi);
> -			}
>  			bi = bi2;
>  		}
>  
> @@ -3582,10 +3581,9 @@ static void handle_stripe_clean_event(struct r5conf *conf,
>  				while (wbi && wbi->bi_iter.bi_sector <
>  					dev->sector + STRIPE_SECTORS) {
>  					wbi2 = r5_next_bio(wbi, dev->sector);
> -					if (!raid5_dec_bi_active_stripes(wbi)) {
> -						md_write_end(conf->mddev);
> +					md_write_end(conf->mddev);
> +					if (!raid5_dec_bi_active_stripes(wbi))
>  						bio_list_add(return_bi, wbi);
> -					}
>  					wbi = wbi2;
>  				}
>  				bitmap_endwrite(conf->mddev->bitmap, sh->sector,
> @@ -5268,6 +5266,7 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
>  
>  	bi->bi_next = NULL;
>  	bi->bi_phys_segments = 1; /* over-loaded to count active stripes */
> +	md_write_start(mddev, bi);

md_write_start calls wait_event, so it can't be called with lock hold. Maybe
add a special __md_write_start which only increases the count, we already wait
in make_request.

>  	stripe_sectors = conf->chunk_sectors *
>  		(conf->raid_disks - conf->max_degraded);
> @@ -5314,6 +5313,7 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
>  			sh->dev[d].towrite = bi;
>  			set_bit(R5_OVERWRITE, &sh->dev[d].flags);
>  			raid5_inc_bi_active_stripes(bi);
> +			md_write_start(mddev, bi);

Ditto.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [md PATCH 02/14] md/raid5: simplfy delaying of writes while metadata is updated.
  2017-02-16  4:39 ` [md PATCH 02/14] md/raid5: simplfy delaying of writes while metadata is updated NeilBrown
@ 2017-02-16 17:37   ` Shaohua Li
  2017-02-17  2:10     ` NeilBrown
  0 siblings, 1 reply; 27+ messages in thread
From: Shaohua Li @ 2017-02-16 17:37 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, hch

On Thu, Feb 16, 2017 at 03:39:01PM +1100, Neil Brown wrote:
> If a device fails during a write, we must ensure the failure is
> recorded in the metadata before the completion of the write is
> acknowleged.
> 
> Commit c3cce6cda162 ("md/raid5: ensure device failure recorded before
> write request returns.")  added code for this, but it was
> unnecessarily complicated.  We already had similar functionality for
> handling updates to the bad-block-list, thanks to Commit de393cdea66c
> ("md: make it easier to wait for bad blocks to be acknowledged.")
> 
> So revert most of the former commit, and instead avoid collecting
> completed writes if MD_CHANGE_PENDING is set.  raid5d() will then flush
> the metadata and retry the stripe_head.
> 
> We check MD_CHANGE_PENDING *after* analyse_stripe() as it could be set
> asynchronously.  After analyse_stripe(), we have collected stable data
> about the state of devices, which will be used to make decisions.
> 
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  drivers/md/raid5.c |   27 ++++-----------------------
>  drivers/md/raid5.h |    3 ---
>  2 files changed, 4 insertions(+), 26 deletions(-)
> 
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 760b726943c9..154593e0afbe 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -4492,7 +4492,8 @@ static void handle_stripe(struct stripe_head *sh)
>  	if (test_bit(STRIPE_LOG_TRAPPED, &sh->state))
>  		goto finish;
>  
> -	if (s.handle_bad_blocks) {
> +	if (s.handle_bad_blocks ||
> +	    test_bit(MD_SB_CHANGE_PENDING, &conf->mddev->sb_flags)) {
>  		set_bit(STRIPE_HANDLE, &sh->state);
>  		goto finish;
>  	}

This part is fragile. We don't delay it and post it to handle list again. So
the raid5d/worker could run this stripe infinitely.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [md PATCH 14/14] MD: use per-cpu counter for writes_pending
  2017-02-16  4:39 ` [md PATCH 14/14] MD: use per-cpu counter for writes_pending NeilBrown
@ 2017-02-16 20:12   ` Shaohua Li
  2017-02-17  2:34     ` NeilBrown
  0 siblings, 1 reply; 27+ messages in thread
From: Shaohua Li @ 2017-02-16 20:12 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, hch

On Thu, Feb 16, 2017 at 03:39:03PM +1100, Neil Brown wrote:
> The 'writes_pending' counter is used to determine when the
> array is stable so that it can be marked in the superblock
> as "Clean".  Consequently it needs to be updated frequently
> but only checked for zero occasionally.  Recent changes to
> raid5 cause the count to be updated even more often - once
> per 4K rather than once per bio.  This provided
> justification for making the updates more efficient.
> 
> So we replace the atomic counter with a per-cpu array of
> 'long' counters. Incrementing and decrementing is normally
> much cheaper, testing for zero is more expensive.
> 
> To meaningfully be able to test for zero we need to be able
> to block further updates.  This is done by forcing the
> "increment" step to take a spinlock in the rare case that
> another thread is checking if the count is zero.  This is
> done using a new field: "checkers".  "checkers" is the
> number of threads that are currently checking whether the
> count is zero.  It is usually 0, occasionally 1, and it is
> not impossible that it could be higher, though this would be
> rare.
> 
> If, within an rcu_read_locked section, checkers is seen to
> be zero, then the local-cpu counter can be incremented
> freely.  If checkers is not zero, mddev->lock must be taken
> before the increment is allowed.  A decrement is always
> allowed.
> 
> To test for zero, a thread must increment "checkers", call
> synchronize_rcu(), then take mddev->lock.  Once this is done
> no new increments can happen.  A thread may choose to
> perform a quick test-for-zero by summing all the counters
> without holding a lock.  If this is non-zero, the the total
> count is non-zero, or was non-zero very recently, so it is
> safe to assume that it isn't zero.  If the quick check does
> report a zero sum, then it is worth performing the locking
> protocol.
> 
> When the counter is decremented, it is no longer possible to
> immediately test if the result is zero
> (atmic_dec_and_test()).  We don't even really want to
> perform the "quick" tests as that sums over all cpus and is
> work that will most often bring no benefit.
> 
> In the "safemode==2" case, when we want to mark the array as
> "clean" immediately when there are no writes, we perform the
> quick test anyway, and possibly wake the md thread to do the
> full test.  "safemode==2" is only used during shutdown so
> the cost is not problematic.
> 
> When safemode!=2 we always set the timer, rather than only
> when the counter reaches zero.
> 
> If mod_timer() is called to set the timeout to the value it
> already has, mod_timer() has low overhead with no atomic
> operations.  So at worst it will have a noticeable cost once
> per jiffie.  To further reduce the otherhead, we round the
> requests delay to a multiple of ->safemode_delay.  This
> might increase the delay until the timer fires a little, but
> will reduce the overhead of calling mod_timer()
> significantly.  If lots of requests are completing, the
> timer will be updated every 200 milliseconds (by default)
> and never fire.  When it does eventually fire, it will
> schedule the md thread to perform the full test for
> writes_pending==0, and this is quite likely to find '0'.
> 
> Signed-off-by: NeilBrown <neilb@suse.com>

This sounds like a good place to use percpu-refcount. In set_in_sync, we switch
it to atomic, read it, then switch it back to percpu.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [md PATCH 01/14] md/raid5: use md_write_start to count stripes, not bios
  2017-02-16 17:29   ` Shaohua Li
@ 2017-02-17  2:04     ` NeilBrown
  0 siblings, 0 replies; 27+ messages in thread
From: NeilBrown @ 2017-02-17  2:04 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, hch

[-- Attachment #1: Type: text/plain, Size: 5258 bytes --]

On Thu, Feb 16 2017, Shaohua Li wrote:

> On Thu, Feb 16, 2017 at 03:39:01PM +1100, Neil Brown wrote:
>> We use md_write_start() to increase the count of pending writes, and
>> md_write_end() to decrement the count.  We currently count bios
>> submitted to md/raid5.  Change it count stripe_heads that a WRITE bio
>> has been attached to.
>> 
>> So now, raid5_make_request() calls md_write_start() and then
>> md_write_end() to keep the count elevated during the setup of the
>> request.
>> 
>> add_stripe_bio() calls md_write_start() for each stripe_head, and the
>> completion routines always call md_write_end(), instead of only
>> calling it when raid5_dec_bi_active_stripes() returns 0.
>> make_discard_request also calls md_write_start/end().
>> 
>> The parallel between md_write_{start,end} and use of bi_phys_segments
>> can be seen in that:
>>  Whenever we set bi_phys_segments to 1, we now call md_write_start.
>>  Whenever we increment it on non-read requests with
>>    raid5_inc_bi_active_stripes(), we now call md_write_start().
>>  Whenever we decrement bi_phys_segments on non-read requsts with
>>     raid5_dec_bi_active_stripes(), we now call md_write_end().
>> 
>> This reduces our dependence on keeping a per-bio count of active
>> stripes in bi_phys_segments.
>> 
>> Signed-off-by: NeilBrown <neilb@suse.com>
>> ---
>>  drivers/md/raid5-cache.c |    2 +-
>>  drivers/md/raid5.c       |   27 +++++++++++++--------------
>>  2 files changed, 14 insertions(+), 15 deletions(-)
>> 
>> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
>> index 302dea3296ba..4b211f914d17 100644
>> --- a/drivers/md/raid5-cache.c
>> +++ b/drivers/md/raid5-cache.c
>> @@ -265,8 +265,8 @@ r5c_return_dev_pending_writes(struct r5conf *conf, struct r5dev *dev,
>>  	while (wbi && wbi->bi_iter.bi_sector <
>>  	       dev->sector + STRIPE_SECTORS) {
>>  		wbi2 = r5_next_bio(wbi, dev->sector);
>> +		md_write_end(conf->mddev);
>>  		if (!raid5_dec_bi_active_stripes(wbi)) {
>> -			md_write_end(conf->mddev);
>>  			bio_list_add(return_bi, wbi);
>>  		}
>>  		wbi = wbi2;
>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>> index 3c7e106c12a2..760b726943c9 100644
>> --- a/drivers/md/raid5.c
>> +++ b/drivers/md/raid5.c
>> @@ -3075,6 +3075,7 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
>>  		bi->bi_next = *bip;
>>  	*bip = bi;
>>  	raid5_inc_bi_active_stripes(bi);
>> +	md_write_start(conf->mddev, bi);
>>  
>>  	if (forwrite) {
>>  		/* check if page is covered */
>> @@ -3198,10 +3199,9 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
>>  			struct bio *nextbi = r5_next_bio(bi, sh->dev[i].sector);
>>  
>>  			bi->bi_error = -EIO;
>> -			if (!raid5_dec_bi_active_stripes(bi)) {
>> -				md_write_end(conf->mddev);
>> +			md_write_end(conf->mddev);
>> +			if (!raid5_dec_bi_active_stripes(bi))
>>  				bio_list_add(return_bi, bi);
>> -			}
>>  			bi = nextbi;
>>  		}
>>  		if (bitmap_end)
>> @@ -3222,10 +3222,9 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
>>  			struct bio *bi2 = r5_next_bio(bi, sh->dev[i].sector);
>>  
>>  			bi->bi_error = -EIO;
>> -			if (!raid5_dec_bi_active_stripes(bi)) {
>> -				md_write_end(conf->mddev);
>> +			md_write_end(conf->mddev);
>> +			if (!raid5_dec_bi_active_stripes(bi))
>>  				bio_list_add(return_bi, bi);
>> -			}
>>  			bi = bi2;
>>  		}
>>  
>> @@ -3582,10 +3581,9 @@ static void handle_stripe_clean_event(struct r5conf *conf,
>>  				while (wbi && wbi->bi_iter.bi_sector <
>>  					dev->sector + STRIPE_SECTORS) {
>>  					wbi2 = r5_next_bio(wbi, dev->sector);
>> -					if (!raid5_dec_bi_active_stripes(wbi)) {
>> -						md_write_end(conf->mddev);
>> +					md_write_end(conf->mddev);
>> +					if (!raid5_dec_bi_active_stripes(wbi))
>>  						bio_list_add(return_bi, wbi);
>> -					}
>>  					wbi = wbi2;
>>  				}
>>  				bitmap_endwrite(conf->mddev->bitmap, sh->sector,
>> @@ -5268,6 +5266,7 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
>>  
>>  	bi->bi_next = NULL;
>>  	bi->bi_phys_segments = 1; /* over-loaded to count active stripes */
>> +	md_write_start(mddev, bi);
>
> md_write_start calls wait_event, so it can't be called with lock hold. Maybe
> add a special __md_write_start which only increases the count, we already wait
> in make_request.

I cannot see what lock is held here, but I can see a that a lock is held
below while md_write_start() is called.

An "md_write_incr()" or similar would certainly make sense.  I'll add
that in.

Thanks,
NeilBrown


>
>>  	stripe_sectors = conf->chunk_sectors *
>>  		(conf->raid_disks - conf->max_degraded);
>> @@ -5314,6 +5313,7 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
>>  			sh->dev[d].towrite = bi;
>>  			set_bit(R5_OVERWRITE, &sh->dev[d].flags);
>>  			raid5_inc_bi_active_stripes(bi);
>> +			md_write_start(mddev, bi);
>
> Ditto.
>
> Thanks,
> Shaohua
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [md PATCH 02/14] md/raid5: simplfy delaying of writes while metadata is updated.
  2017-02-16 17:37   ` Shaohua Li
@ 2017-02-17  2:10     ` NeilBrown
  0 siblings, 0 replies; 27+ messages in thread
From: NeilBrown @ 2017-02-17  2:10 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, hch

[-- Attachment #1: Type: text/plain, Size: 2134 bytes --]

On Thu, Feb 16 2017, Shaohua Li wrote:

> On Thu, Feb 16, 2017 at 03:39:01PM +1100, Neil Brown wrote:
>> If a device fails during a write, we must ensure the failure is
>> recorded in the metadata before the completion of the write is
>> acknowleged.
>> 
>> Commit c3cce6cda162 ("md/raid5: ensure device failure recorded before
>> write request returns.")  added code for this, but it was
>> unnecessarily complicated.  We already had similar functionality for
>> handling updates to the bad-block-list, thanks to Commit de393cdea66c
>> ("md: make it easier to wait for bad blocks to be acknowledged.")
>> 
>> So revert most of the former commit, and instead avoid collecting
>> completed writes if MD_CHANGE_PENDING is set.  raid5d() will then flush
>> the metadata and retry the stripe_head.
>> 
>> We check MD_CHANGE_PENDING *after* analyse_stripe() as it could be set
>> asynchronously.  After analyse_stripe(), we have collected stable data
>> about the state of devices, which will be used to make decisions.
>> 
>> Signed-off-by: NeilBrown <neilb@suse.com>
>> ---
>>  drivers/md/raid5.c |   27 ++++-----------------------
>>  drivers/md/raid5.h |    3 ---
>>  2 files changed, 4 insertions(+), 26 deletions(-)
>> 
>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>> index 760b726943c9..154593e0afbe 100644
>> --- a/drivers/md/raid5.c
>> +++ b/drivers/md/raid5.c
>> @@ -4492,7 +4492,8 @@ static void handle_stripe(struct stripe_head *sh)
>>  	if (test_bit(STRIPE_LOG_TRAPPED, &sh->state))
>>  		goto finish;
>>  
>> -	if (s.handle_bad_blocks) {
>> +	if (s.handle_bad_blocks ||
>> +	    test_bit(MD_SB_CHANGE_PENDING, &conf->mddev->sb_flags)) {
>>  		set_bit(STRIPE_HANDLE, &sh->state);
>>  		goto finish;
>>  	}
>
> This part is fragile. We don't delay it and post it to handle list again. So
> the raid5d/worker could run this stripe infinitely.
>

Ah yes - I forget about the worker threads.

I think we need to get raid5_do_work() to block while MD_CHANGE_PENDING
is set.
I'll look over the interactions there and see if that really make sense.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [md PATCH 09/14] md/raid10: stop using bi_phys_segments
  2017-02-16 14:26   ` Jack Wang
@ 2017-02-17  2:15     ` NeilBrown
  0 siblings, 0 replies; 27+ messages in thread
From: NeilBrown @ 2017-02-17  2:15 UTC (permalink / raw)
  To: Jack Wang; +Cc: Shaohua Li, linux-raid, Christoph Hellwig

[-- Attachment #1: Type: text/plain, Size: 3404 bytes --]

On Thu, Feb 16 2017, Jack Wang wrote:

> 2017-02-16 5:39 GMT+01:00 NeilBrown <neilb@suse.com>:
>> raid10 currently repurposes bi_phys_segments on each
>> incoming bio to count how many r10bio was used to encode the
>> request.
>>
>> We need to know when the number of attached r10bio reaches
>> zero to:
>> 1/ call bio_endio() when all IO on the bio is finished
>> 2/ decrement ->nr_pending so that resync IO can proceed.
>>
>> Now that the bio has its own __bi_remaining counter, that
>> can be used instead. We can call bio_inc_remaining to
>> increment the counter and call bio_endio() every time an
>> r10bio completes, rather than only when bi_phys_segments
>> reaches zero.
>>
>> This addresses point 1, but not point 2.  bio_endio()
>> doesn't (and cannot) report when the last r10bio has
>> finished, so a different approach is needed.
>>
>> So: instead of counting bios in ->nr_pending, count r10bios.
>> i.e. every time we attach a bio, increment nr_pending.
>> Every time an r10bio completes, decrement nr_pending.
>>
>> Normally we only increment nr_pending after first checking
>> that ->barrier is zero, or some other non-trivial tests and
>> possible waiting.  When attaching multiple r10bios to a bio,
>> we only need the tests and the waiting once.  After the
>> first increment, subsequent increments can happen
>> unconditionally as they are really all part of the one
>> request.
>>
>> So introduce inc_pending() which can be used when we know
>> that nr_pending is already elevated.
>>
>> Signed-off-by: NeilBrown <neilb@suse.com>
>> ---
>>  drivers/md/raid10.c |   76 +++++++++++++++++----------------------------------
>>  1 file changed, 25 insertions(+), 51 deletions(-)
>>
>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>> index 9258cbe233bb..6b4d8643c574 100644
>> --- a/drivers/md/raid10.c
>> +++ b/drivers/md/raid10.c
>> @@ -301,27 +301,18 @@ static void reschedule_retry(struct r10bio *r10_bio)
>>  static void raid_end_bio_io(struct r10bio *r10_bio)
>>  {
>>         struct bio *bio = r10_bio->master_bio;
>> -       int done;
>>         struct r10conf *conf = r10_bio->mddev->private;
>>
>> -       if (bio->bi_phys_segments) {
>> -               unsigned long flags;
>> -               spin_lock_irqsave(&conf->device_lock, flags);
>> -               bio->bi_phys_segments--;
>> -               done = (bio->bi_phys_segments == 0);
>> -               spin_unlock_irqrestore(&conf->device_lock, flags);
>> -       } else
>> -               done = 1;
>>         if (!test_bit(R10BIO_Uptodate, &r10_bio->state))
>>                 bio->bi_error = -EIO;
>> -       if (done) {
>> -               bio_endio(bio);
>> -               /*
>> -                * Wake up any possible resync thread that waits for the device
>> -                * to go idle.
>> -                */
>> -               allow_barrier(conf);
>> -       }
>> +
>> +       /*
>> +        * Wake up any possible resync thread that waits for the device
>> +        * to go idle.
>> +        */
>> +       allow_barrier(conf);
>> +       bio_endio(bio);
>> +
>
> Hi Neil,
>
> Why do you switch the order of above 2 lines, is there a reason
> behind, I notice in raid1 you kept the order?

I cannot think why I would have done that.  It doesn't matter what
order they are in, but it does make sense to leave the order unchanged
unless there is a good reason.  I'll put it back.

Thanks,
NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [md PATCH 14/14] MD: use per-cpu counter for writes_pending
  2017-02-16 20:12   ` Shaohua Li
@ 2017-02-17  2:34     ` NeilBrown
  0 siblings, 0 replies; 27+ messages in thread
From: NeilBrown @ 2017-02-17  2:34 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, hch

[-- Attachment #1: Type: text/plain, Size: 4034 bytes --]

On Thu, Feb 16 2017, Shaohua Li wrote:

> On Thu, Feb 16, 2017 at 03:39:03PM +1100, Neil Brown wrote:
>> The 'writes_pending' counter is used to determine when the
>> array is stable so that it can be marked in the superblock
>> as "Clean".  Consequently it needs to be updated frequently
>> but only checked for zero occasionally.  Recent changes to
>> raid5 cause the count to be updated even more often - once
>> per 4K rather than once per bio.  This provided
>> justification for making the updates more efficient.
>> 
>> So we replace the atomic counter with a per-cpu array of
>> 'long' counters. Incrementing and decrementing is normally
>> much cheaper, testing for zero is more expensive.
>> 
>> To meaningfully be able to test for zero we need to be able
>> to block further updates.  This is done by forcing the
>> "increment" step to take a spinlock in the rare case that
>> another thread is checking if the count is zero.  This is
>> done using a new field: "checkers".  "checkers" is the
>> number of threads that are currently checking whether the
>> count is zero.  It is usually 0, occasionally 1, and it is
>> not impossible that it could be higher, though this would be
>> rare.
>> 
>> If, within an rcu_read_locked section, checkers is seen to
>> be zero, then the local-cpu counter can be incremented
>> freely.  If checkers is not zero, mddev->lock must be taken
>> before the increment is allowed.  A decrement is always
>> allowed.
>> 
>> To test for zero, a thread must increment "checkers", call
>> synchronize_rcu(), then take mddev->lock.  Once this is done
>> no new increments can happen.  A thread may choose to
>> perform a quick test-for-zero by summing all the counters
>> without holding a lock.  If this is non-zero, the the total
>> count is non-zero, or was non-zero very recently, so it is
>> safe to assume that it isn't zero.  If the quick check does
>> report a zero sum, then it is worth performing the locking
>> protocol.
>> 
>> When the counter is decremented, it is no longer possible to
>> immediately test if the result is zero
>> (atmic_dec_and_test()).  We don't even really want to
>> perform the "quick" tests as that sums over all cpus and is
>> work that will most often bring no benefit.
>> 
>> In the "safemode==2" case, when we want to mark the array as
>> "clean" immediately when there are no writes, we perform the
>> quick test anyway, and possibly wake the md thread to do the
>> full test.  "safemode==2" is only used during shutdown so
>> the cost is not problematic.
>> 
>> When safemode!=2 we always set the timer, rather than only
>> when the counter reaches zero.
>> 
>> If mod_timer() is called to set the timeout to the value it
>> already has, mod_timer() has low overhead with no atomic
>> operations.  So at worst it will have a noticeable cost once
>> per jiffie.  To further reduce the otherhead, we round the
>> requests delay to a multiple of ->safemode_delay.  This
>> might increase the delay until the timer fires a little, but
>> will reduce the overhead of calling mod_timer()
>> significantly.  If lots of requests are completing, the
>> timer will be updated every 200 milliseconds (by default)
>> and never fire.  When it does eventually fire, it will
>> schedule the md thread to perform the full test for
>> writes_pending==0, and this is quite likely to find '0'.
>> 
>> Signed-off-by: NeilBrown <neilb@suse.com>
>
> This sounds like a good place to use percpu-refcount. In set_in_sync, we switch
> it to atomic, read it, then switch it back to percpu.

I did have a quick look at percpu-refcount and it didn't seem suitable.
There is no interface to set the counter to atomic and wait for that to
complete.
There is also no interface to perform a lockless sum - only expected to
report '0' if the count has been zero for a little while.

I might be able to make do without those, or possibly enhance
percpu-refcount.

I'll have a look - thanks.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [md PATCH 10/14] md/raid1: stop using bi_phys_segment
  2017-02-16  4:39 ` [md PATCH 10/14] md/raid1: stop using bi_phys_segment NeilBrown
@ 2017-02-20 10:57   ` Ming Lei
  2017-02-21  0:05     ` NeilBrown
  0 siblings, 1 reply; 27+ messages in thread
From: Ming Lei @ 2017-02-20 10:57 UTC (permalink / raw)
  To: NeilBrown
  Cc: Shaohua Li, open list:SOFTWARE RAID (Multiple Disks) SUPPORT,
	Christoph Hellwig

On Thu, Feb 16, 2017 at 12:39 PM, NeilBrown <neilb@suse.com> wrote:
> Change to use bio->__bi_remaining to count number of r1bio attached
> to a bio.
> See precious raid10 patch for more details.
>
> inc_pending is a little more complicated in raid1 as we need to adjust
> next_window_requests or current_window_requests.
>
> The wait_event() call if start_next_window is no longer needed as each
> r1_bio is completed separately.
>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  drivers/md/raid1.c |   99 ++++++++++++++++++----------------------------------
>  1 file changed, 34 insertions(+), 65 deletions(-)
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index c1d0675880fb..5a57111c7bc9 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -241,36 +241,19 @@ static void reschedule_retry(struct r1bio *r1_bio)
>  static void call_bio_endio(struct r1bio *r1_bio)
>  {
>         struct bio *bio = r1_bio->master_bio;
> -       int done;
>         struct r1conf *conf = r1_bio->mddev->private;
>         sector_t start_next_window = r1_bio->start_next_window;
>         sector_t bi_sector = bio->bi_iter.bi_sector;
>
> -       if (bio->bi_phys_segments) {
> -               unsigned long flags;
> -               spin_lock_irqsave(&conf->device_lock, flags);
> -               bio->bi_phys_segments--;
> -               done = (bio->bi_phys_segments == 0);
> -               spin_unlock_irqrestore(&conf->device_lock, flags);
> -               /*
> -                * make_request() might be waiting for
> -                * bi_phys_segments to decrease
> -                */
> -               wake_up(&conf->wait_barrier);
> -       } else
> -               done = 1;
> -
>         if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
>                 bio->bi_error = -EIO;
>
> -       if (done) {
> -               bio_endio(bio);
> -               /*
> -                * Wake up any possible resync thread that waits for the device
> -                * to go idle.
> -                */
> -               allow_barrier(conf, start_next_window, bi_sector);
> -       }
> +       bio_endio(bio);
> +       /*
> +        * Wake up any possible resync thread that waits for the device
> +        * to go idle.
> +        */
> +       allow_barrier(conf, start_next_window, bi_sector);
>  }
>
>  static void raid_end_bio_io(struct r1bio *r1_bio)
> @@ -923,6 +906,26 @@ static sector_t wait_barrier(struct r1conf *conf, struct bio *bio)
>         return sector;
>  }
>
> +static void inc_pending(struct r1conf *conf, sector_t start_next_window,
> +                       sector_t bi_sector)
> +{
> +       /* The current request requires multiple r1_bio, so
> +        * we need to increment the pending count, and the corresponding
> +        * window count.
> +        */
> +       spin_lock(&conf->resync_lock);
> +       conf->nr_pending++;

Just be curious, in current code 'nr_pending' is increased in wait_barrier(),
and looks this patch introduces inc_pending() to do that on each r10_bio, but
not see any change in wait_barrier(), so that means there might be issue in
current implementation about operating on this counter?

> +       if (start_next_window == conf->start_next_window) {
> +               if (conf->start_next_window + NEXT_NORMALIO_DISTANCE
> +                   <= bi_sector)
> +                       conf->next_window_requests++;
> +               else
> +                       conf->current_window_requests++;
> +       } else
> +               conf->current_window_requests++;
> +       spin_unlock(&conf->resync_lock);
> +}
> +
>  static void allow_barrier(struct r1conf *conf, sector_t start_next_window,
>                           sector_t bi_sector)
>  {
> @@ -1137,12 +1140,7 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio,
>                 sectors_handled = (r1_bio->sector + max_sectors
>                                    - bio->bi_iter.bi_sector);
>                 r1_bio->sectors = max_sectors;
> -               spin_lock_irq(&conf->device_lock);
> -               if (bio->bi_phys_segments == 0)
> -                       bio->bi_phys_segments = 2;
> -               else
> -                       bio->bi_phys_segments++;
> -               spin_unlock_irq(&conf->device_lock);
> +               bio_inc_remaining(bio);
>
>                 /*
>                  * Cannot call generic_make_request directly as that will be
> @@ -1304,7 +1302,6 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
>         if (unlikely(blocked_rdev)) {
>                 /* Wait for this device to become unblocked */
>                 int j;
> -               sector_t old = start_next_window;
>
>                 for (j = 0; j < i; j++)
>                         if (r1_bio->bios[j])
> @@ -1314,15 +1311,6 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
>                 raid1_log(mddev, "wait rdev %d blocked", blocked_rdev->raid_disk);
>                 md_wait_for_blocked_rdev(blocked_rdev, mddev);
>                 start_next_window = wait_barrier(conf, bio);
> -               /*
> -                * We must make sure the multi r1bios of bio have
> -                * the same value of bi_phys_segments
> -                */
> -               if (bio->bi_phys_segments && old &&
> -                   old != start_next_window)
> -                       /* Wait for the former r1bio(s) to complete */
> -                       wait_event(conf->wait_barrier,
> -                                  bio->bi_phys_segments == 1);
>                 goto retry_write;
>         }
>
> @@ -1417,22 +1405,18 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
>          * as it could result in the bio being freed.
>          */
>         if (sectors_handled < bio_sectors(bio)) {
> -               /* We need another r1_bio, which must be accounted
> -                * in bio->bi_phys_segments
> -                */
> -               spin_lock_irq(&conf->device_lock);
> -               if (bio->bi_phys_segments == 0)
> -                       bio->bi_phys_segments = 2;
> -               else
> -                       bio->bi_phys_segments++;
> -               spin_unlock_irq(&conf->device_lock);
> +               /* We need another r1_bio, which must be counted */
> +               sector_t sect = bio->bi_iter.bi_sector + sectors_handled;
> +
> +               inc_pending(conf, start_next_window, sect);
> +               bio_inc_remaining(bio);
>                 r1_bio_write_done(r1_bio);
>                 r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
>                 r1_bio->master_bio = bio;
>                 r1_bio->sectors = bio_sectors(bio) - sectors_handled;
>                 r1_bio->state = 0;
>                 r1_bio->mddev = mddev;
> -               r1_bio->sector = bio->bi_iter.bi_sector + sectors_handled;
> +               r1_bio->sector = sect;
>                 goto retry_write;
>         }
>
> @@ -1460,16 +1444,6 @@ static void raid1_make_request(struct mddev *mddev, struct bio *bio)
>         r1_bio->mddev = mddev;
>         r1_bio->sector = bio->bi_iter.bi_sector;
>
> -       /*
> -        * We might need to issue multiple reads to different devices if there
> -        * are bad blocks around, so we keep track of the number of reads in
> -        * bio->bi_phys_segments.  If this is 0, there is only one r1_bio and
> -        * no locking will be needed when requests complete.  If it is
> -        * non-zero, then it is the number of not-completed requests.
> -        */
> -       bio->bi_phys_segments = 0;
> -       bio_clear_flag(bio, BIO_SEG_VALID);
> -
>         if (bio_data_dir(bio) == READ)
>                 raid1_read_request(mddev, bio, r1_bio);
>         else
> @@ -2434,12 +2408,7 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
>                         int sectors_handled = (r1_bio->sector + max_sectors
>                                                - mbio->bi_iter.bi_sector);
>                         r1_bio->sectors = max_sectors;
> -                       spin_lock_irq(&conf->device_lock);
> -                       if (mbio->bi_phys_segments == 0)
> -                               mbio->bi_phys_segments = 2;
> -                       else
> -                               mbio->bi_phys_segments++;
> -                       spin_unlock_irq(&conf->device_lock);
> +                       bio_inc_remaining(mbio);
>                         trace_block_bio_remap(bdev_get_queue(bio->bi_bdev),
>                                               bio, bio_dev, bio_sector);
>                         generic_make_request(bio);
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Ming Lei

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [md PATCH 10/14] md/raid1: stop using bi_phys_segment
  2017-02-20 10:57   ` Ming Lei
@ 2017-02-21  0:05     ` NeilBrown
  2017-02-21  7:41       ` Ming Lei
  0 siblings, 1 reply; 27+ messages in thread
From: NeilBrown @ 2017-02-21  0:05 UTC (permalink / raw)
  To: Ming Lei
  Cc: Shaohua Li, open list:SOFTWARE RAID (Multiple Disks) SUPPORT,
	Christoph Hellwig

[-- Attachment #1: Type: text/plain, Size: 930 bytes --]

On Mon, Feb 20 2017, Ming Lei wrote:

> On Thu, Feb 16, 2017 at 12:39 PM, NeilBrown <neilb@suse.com> wrote:
>>
>> +static void inc_pending(struct r1conf *conf, sector_t start_next_window,
>> +                       sector_t bi_sector)
>> +{
>> +       /* The current request requires multiple r1_bio, so
>> +        * we need to increment the pending count, and the corresponding
>> +        * window count.
>> +        */
>> +       spin_lock(&conf->resync_lock);
>> +       conf->nr_pending++;
>
> Just be curious, in current code 'nr_pending' is increased in wait_barrier(),
> and looks this patch introduces inc_pending() to do that on each r10_bio, but
> not see any change in wait_barrier(), so that means there might be issue in
> current implementation about operating on this counter?

Did you read the more detailed description in the previous raid10.c
patch?
This patch follows the same logic as that patch.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [md PATCH 10/14] md/raid1: stop using bi_phys_segment
  2017-02-21  0:05     ` NeilBrown
@ 2017-02-21  7:41       ` Ming Lei
  2017-03-03  0:34         ` NeilBrown
  0 siblings, 1 reply; 27+ messages in thread
From: Ming Lei @ 2017-02-21  7:41 UTC (permalink / raw)
  To: NeilBrown
  Cc: Shaohua Li, open list:SOFTWARE RAID (Multiple Disks) SUPPORT,
	Christoph Hellwig

On Tue, Feb 21, 2017 at 8:05 AM, NeilBrown <neilb@suse.com> wrote:
> On Mon, Feb 20 2017, Ming Lei wrote:
>
>> On Thu, Feb 16, 2017 at 12:39 PM, NeilBrown <neilb@suse.com> wrote:
>>>
>>> +static void inc_pending(struct r1conf *conf, sector_t start_next_window,
>>> +                       sector_t bi_sector)
>>> +{
>>> +       /* The current request requires multiple r1_bio, so
>>> +        * we need to increment the pending count, and the corresponding
>>> +        * window count.
>>> +        */
>>> +       spin_lock(&conf->resync_lock);
>>> +       conf->nr_pending++;
>>
>> Just be curious, in current code 'nr_pending' is increased in wait_barrier(),
>> and looks this patch introduces inc_pending() to do that on each r10_bio, but
>> not see any change in wait_barrier(), so that means there might be issue in
>> current implementation about operating on this counter?
>
> Did you read the more detailed description in the previous raid10.c
> patch?
> This patch follows the same logic as that patch.

OK, I see the point now:

- for the 1st r1_bio, conf->nr_pending is increased in wait_barrier()
- for the others, conf->nr_pending is increased in inc_pending().

Also I have another question:

- before this patch, both number of requests in windows
are increased only for WRITE I/O(see wait_barrier()), and decreased
for both READ/WRITE in complete path(see allow_barrier())

- after this patch, except for the 1st r1_bio, number of requests in
windows are increased for both WRITE/READ I/O, and decreased
for both READ/WRITE too.

Could you explain a bit about this change?

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [md PATCH 10/14] md/raid1: stop using bi_phys_segment
  2017-02-21  7:41       ` Ming Lei
@ 2017-03-03  0:34         ` NeilBrown
  0 siblings, 0 replies; 27+ messages in thread
From: NeilBrown @ 2017-03-03  0:34 UTC (permalink / raw)
  To: Ming Lei
  Cc: Shaohua Li, open list:SOFTWARE RAID (Multiple Disks) SUPPORT,
	Christoph Hellwig

[-- Attachment #1: Type: text/plain, Size: 2165 bytes --]

On Tue, Feb 21 2017, Ming Lei wrote:

> On Tue, Feb 21, 2017 at 8:05 AM, NeilBrown <neilb@suse.com> wrote:
>> On Mon, Feb 20 2017, Ming Lei wrote:
>>
>>> On Thu, Feb 16, 2017 at 12:39 PM, NeilBrown <neilb@suse.com> wrote:
>>>>
>>>> +static void inc_pending(struct r1conf *conf, sector_t start_next_window,
>>>> +                       sector_t bi_sector)
>>>> +{
>>>> +       /* The current request requires multiple r1_bio, so
>>>> +        * we need to increment the pending count, and the corresponding
>>>> +        * window count.
>>>> +        */
>>>> +       spin_lock(&conf->resync_lock);
>>>> +       conf->nr_pending++;
>>>
>>> Just be curious, in current code 'nr_pending' is increased in wait_barrier(),
>>> and looks this patch introduces inc_pending() to do that on each r10_bio, but
>>> not see any change in wait_barrier(), so that means there might be issue in
>>> current implementation about operating on this counter?
>>
>> Did you read the more detailed description in the previous raid10.c
>> patch?
>> This patch follows the same logic as that patch.
>
> OK, I see the point now:
>
> - for the 1st r1_bio, conf->nr_pending is increased in wait_barrier()
> - for the others, conf->nr_pending is increased in inc_pending().
>
> Also I have another question:
>
> - before this patch, both number of requests in windows
> are increased only for WRITE I/O(see wait_barrier()), and decreased
> for both READ/WRITE in complete path(see allow_barrier())

For a READ request, ->start_next_window is zero, so allow_barrier()
doesn't decrease the window counters.
So they are only increased and decreased for WRITE request.

>
> - after this patch, except for the 1st r1_bio, number of requests in
> windows are increased for both WRITE/READ I/O, and decreased
> for both READ/WRITE too.

Why do you think READ requests now increase the number of requests in a
window?

NeilBrown

>
> Could you explain a bit about this change?
>
> Thanks,
> Ming Lei
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2017-03-03  0:34 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-16  4:39 [md PATCH 00/14] remove all abuse of bi_phys_segments NeilBrown
2017-02-16  4:39 ` [md PATCH 04/14] block: trace completion of all bios NeilBrown
2017-02-16  4:39 ` [md PATCH 03/14] md/raid5: call bio_endio() directly rather than queueing for later NeilBrown
2017-02-16  4:39 ` [md PATCH 01/14] md/raid5: use md_write_start to count stripes, not bios NeilBrown
2017-02-16 17:29   ` Shaohua Li
2017-02-17  2:04     ` NeilBrown
2017-02-16  4:39 ` [md PATCH 02/14] md/raid5: simplfy delaying of writes while metadata is updated NeilBrown
2017-02-16 17:37   ` Shaohua Li
2017-02-17  2:10     ` NeilBrown
2017-02-16  4:39 ` [md PATCH 11/14] md/raid5: don't test ->writes_pending in raid5_remove_disk NeilBrown
2017-02-16  4:39 ` [md PATCH 07/14] Revert "md/raid5: limit request size according to implementation limits" NeilBrown
2017-02-16  4:39 ` [md PATCH 10/14] md/raid1: stop using bi_phys_segment NeilBrown
2017-02-20 10:57   ` Ming Lei
2017-02-21  0:05     ` NeilBrown
2017-02-21  7:41       ` Ming Lei
2017-03-03  0:34         ` NeilBrown
2017-02-16  4:39 ` [md PATCH 08/14] md/raid1, raid10: move rXbio accounting closer to allocation NeilBrown
2017-02-16  4:39 ` [md PATCH 09/14] md/raid10: stop using bi_phys_segments NeilBrown
2017-02-16 14:26   ` Jack Wang
2017-02-17  2:15     ` NeilBrown
2017-02-16  4:39 ` [md PATCH 05/14] md/raid5: use bio_inc_remaining() instead of repurposing bi_phys_segments as a counter NeilBrown
2017-02-16  4:39 ` [md PATCH 06/14] md/raid5: remove over-loading of ->bi_phys_segments NeilBrown
2017-02-16  4:39 ` [md PATCH 12/14] md: factor out set_in_sync() NeilBrown
2017-02-16  4:39 ` [md PATCH 13/14] md: close a race with setting mddev->in_sync NeilBrown
2017-02-16  4:39 ` [md PATCH 14/14] MD: use per-cpu counter for writes_pending NeilBrown
2017-02-16 20:12   ` Shaohua Li
2017-02-17  2:34     ` NeilBrown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.