All of lore.kernel.org
 help / color / mirror / Atom feed
* [md PATCH 0/5] Stop using bi_phys_segments as a counter
@ 2016-11-21  1:19 NeilBrown
  2016-11-21  1:19 ` [md PATCH 5/5] md/raid5: use bio_inc_remaining() instead of repurposing " NeilBrown
                   ` (8 more replies)
  0 siblings, 9 replies; 18+ messages in thread
From: NeilBrown @ 2016-11-21  1:19 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Konstantin Khlebnikov, Christoph Hellwig, linux-raid

There are 2 problems with using bi_phys_segments as a counter
1/ we only use 16bits, which limits bios to 256M
2/ it is poor form to reuse a field like this.  It interferes
   with other changes to bios.

We need to clean up a few things before we can change the use the
counter which is now available inside a bio.

I have only tested this lightly.  More review and testing would be
appreciated.

NeilBrown


---

NeilBrown (5):
      md: optimize md_write_start() slightly.
      md/raid5: use md_write_start to count stripes, not bios
      md/raid5: simplfy delaying of writes while metadata is updated.
      md/raid5: call bio_endio() directly rather than queuing for later.
      md/raid5: use bio_inc_remaining() instead of repurposing bi_phys_segments as a counter


 drivers/md/md.c    |   32 ++++++------
 drivers/md/raid5.c |  136 +++++++++++-----------------------------------------
 drivers/md/raid5.h |    4 --
 3 files changed, 44 insertions(+), 128 deletions(-)

--
Signature


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [md PATCH 1/5] md: optimize md_write_start() slightly.
  2016-11-21  1:19 [md PATCH 0/5] Stop using bi_phys_segments as a counter NeilBrown
                   ` (3 preceding siblings ...)
  2016-11-21  1:19 ` [md PATCH 2/5] md/raid5: use md_write_start to count stripes, not bios NeilBrown
@ 2016-11-21  1:19 ` NeilBrown
  2016-11-21  2:32 ` [md PATCH 6/5] md/raid5: remove over-loading of ->bi_phys_segments NeilBrown
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: NeilBrown @ 2016-11-21  1:19 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Konstantin Khlebnikov, Christoph Hellwig, linux-raid

If md_write_start() finds that ->writes_pending is non-zero, it
should be able to avoid most of the other checks.

To ensure that a non-zero ->writes_pending does mean that other checks
have completed, move it down until after ->in_sync is known to be
clear.  To avoid races with places like array_state_store() which
possible sets ->in_sync, we need to increment ->write_pending inside
the locked region.  As ->writes_pending is now incremented *after*
->in_sync is tested, we must always take the spin_lock, but only if
->writes_pending was found to be zero.

If ->writes_pending is found to be non-zero, we still need to wait it
MD_CHANGE_PENDING is true.

In the common case, md_write_start() will now only
 - check if data_dir is WRITE
 - increment ->writes_pending
 - check MD_CHANGE_PENDING is cleared.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/md.c |   32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 1f1c7f007b68..2f21f6c7156f 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7686,20 +7686,18 @@ void md_write_start(struct mddev *mddev, struct bio *bi)
 	int did_change = 0;
 	if (bio_data_dir(bi) != WRITE)
 		return;
-
-	BUG_ON(mddev->ro == 1);
-	if (mddev->ro == 2) {
-		/* need to switch to read/write */
-		mddev->ro = 0;
-		set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
-		md_wakeup_thread(mddev->thread);
-		md_wakeup_thread(mddev->sync_thread);
-		did_change = 1;
-	}
-	atomic_inc(&mddev->writes_pending);
-	if (mddev->safemode == 1)
-		mddev->safemode = 0;
-	if (mddev->in_sync) {
+	if (!atomic_inc_not_zero(&mddev->writes_pending)) {
+		BUG_ON(mddev->ro == 1);
+		if (mddev->ro == 2) {
+			/* need to switch to read/write */
+			mddev->ro = 0;
+			set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
+			md_wakeup_thread(mddev->thread);
+			md_wakeup_thread(mddev->sync_thread);
+			did_change = 1;
+		}
+		if (mddev->safemode == 1)
+			mddev->safemode = 0;
 		spin_lock(&mddev->lock);
 		if (mddev->in_sync) {
 			mddev->in_sync = 0;
@@ -7708,10 +7706,12 @@ void md_write_start(struct mddev *mddev, struct bio *bi)
 			md_wakeup_thread(mddev->thread);
 			did_change = 1;
 		}
+		atomic_inc(&mddev->writes_pending);
 		spin_unlock(&mddev->lock);
+
+		if (did_change)
+			sysfs_notify_dirent_safe(mddev->sysfs_state);
 	}
-	if (did_change)
-		sysfs_notify_dirent_safe(mddev->sysfs_state);
 	wait_event(mddev->sb_wait,
 		   !test_bit(MD_CHANGE_PENDING, &mddev->flags));
 }



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [md PATCH 2/5] md/raid5: use md_write_start to count stripes, not bios
  2016-11-21  1:19 [md PATCH 0/5] Stop using bi_phys_segments as a counter NeilBrown
                   ` (2 preceding siblings ...)
  2016-11-21  1:19 ` [md PATCH 4/5] md/raid5: call bio_endio() directly rather than queuing for later NeilBrown
@ 2016-11-21  1:19 ` NeilBrown
  2016-11-21  1:19 ` [md PATCH 1/5] md: optimize md_write_start() slightly NeilBrown
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: NeilBrown @ 2016-11-21  1:19 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Konstantin Khlebnikov, Christoph Hellwig, linux-raid

We use md_write_start() to increase the count of pending writes, and
md_write_end() to decrement the count.  We currently count bios
submitted to md/raid5.  Change it count stripe_heads that a WRITE bio
has been attached to.

So raid5_make_request() call md_write_start() and then md_write_end().
add_stripe_bio() calls md_write_start() for each stripe_head, and the
completion routines always call md_write_end(), instead of only
calling when raid5_dec_bi_active_stripes() returns 0.

This reduces our dependence on keeping a per-bio count of active
stripes in bi_phys_segments.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/raid5.c |   25 +++++++++++--------------
 1 file changed, 11 insertions(+), 14 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index df88656d8798..d07d2dce6856 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2998,6 +2998,7 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
 		bi->bi_next = *bip;
 	*bip = bi;
 	raid5_inc_bi_active_stripes(bi);
+	md_write_start(conf->mddev, bi);
 
 	if (forwrite) {
 		/* check if page is covered */
@@ -3121,10 +3122,9 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
 			struct bio *nextbi = r5_next_bio(bi, sh->dev[i].sector);
 
 			bi->bi_error = -EIO;
-			if (!raid5_dec_bi_active_stripes(bi)) {
-				md_write_end(conf->mddev);
+			md_write_end(conf->mddev);
+			if (!raid5_dec_bi_active_stripes(bi))
 				bio_list_add(return_bi, bi);
-			}
 			bi = nextbi;
 		}
 		if (bitmap_end)
@@ -3145,10 +3145,9 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
 			struct bio *bi2 = r5_next_bio(bi, sh->dev[i].sector);
 
 			bi->bi_error = -EIO;
-			if (!raid5_dec_bi_active_stripes(bi)) {
-				md_write_end(conf->mddev);
+			md_write_end(conf->mddev);
+			if (!raid5_dec_bi_active_stripes(bi))
 				bio_list_add(return_bi, bi);
-			}
 			bi = bi2;
 		}
 
@@ -3490,10 +3489,9 @@ static void handle_stripe_clean_event(struct r5conf *conf,
 				while (wbi && wbi->bi_iter.bi_sector <
 					dev->sector + STRIPE_SECTORS) {
 					wbi2 = r5_next_bio(wbi, dev->sector);
-					if (!raid5_dec_bi_active_stripes(wbi)) {
-						md_write_end(conf->mddev);
+					md_write_end(conf->mddev);
+					if (!raid5_dec_bi_active_stripes(wbi))
 						bio_list_add(return_bi, wbi);
-					}
 					wbi = wbi2;
 				}
 				bitmap_endwrite(conf->mddev->bitmap, sh->sector,
@@ -5142,9 +5140,9 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
 		release_stripe_plug(mddev, sh);
 	}
 
+	md_write_end(mddev);
 	remaining = raid5_dec_bi_active_stripes(bi);
 	if (remaining == 0) {
-		md_write_end(mddev);
 		bio_endio(bi);
 	}
 }
@@ -5173,8 +5171,6 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
 		/* ret == -EAGAIN, fallback */
 	}
 
-	md_write_start(mddev, bi);
-
 	/*
 	 * If array is degraded, better not do chunk aligned read because
 	 * later we might have to read it again in order to reconstruct
@@ -5196,6 +5192,7 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
 	last_sector = bio_end_sector(bi);
 	bi->bi_next = NULL;
 	bi->bi_phys_segments = 1;	/* over-loaded to count active stripes */
+	md_write_start(mddev, bi);
 
 	prepare_to_wait(&conf->wait_for_overlap, &w, TASK_UNINTERRUPTIBLE);
 	for (;logical_sector < last_sector; logical_sector += STRIPE_SECTORS) {
@@ -5324,11 +5321,11 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
 	}
 	finish_wait(&conf->wait_for_overlap, &w);
 
+	if (rw == WRITE)
+		md_write_end(mddev);
 	remaining = raid5_dec_bi_active_stripes(bi);
 	if (remaining == 0) {
 
-		if ( rw == WRITE )
-			md_write_end(mddev);
 
 		trace_block_bio_complete(bdev_get_queue(bi->bi_bdev),
 					 bi, 0);



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [md PATCH 3/5] md/raid5: simplfy delaying of writes while metadata is updated.
  2016-11-21  1:19 [md PATCH 0/5] Stop using bi_phys_segments as a counter NeilBrown
  2016-11-21  1:19 ` [md PATCH 5/5] md/raid5: use bio_inc_remaining() instead of repurposing " NeilBrown
@ 2016-11-21  1:19 ` NeilBrown
  2016-11-21  1:19 ` [md PATCH 4/5] md/raid5: call bio_endio() directly rather than queuing for later NeilBrown
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: NeilBrown @ 2016-11-21  1:19 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Konstantin Khlebnikov, Christoph Hellwig, linux-raid

If a device fails during a write, we must ensure the failure is
recorded in the metadata before the completion of the write is
acknowleged.

Commit: c3cce6cda162 ("md/raid5: ensure device failure recorded before write request returns.")
added code for this, but it was unnecessarily complicated.  We already
had similar function for handling updated to the bad-block-list thanks to
Commit: de393cdea66c ("md: make it easier to wait for bad blocks to be acknowledged.")

So revert most of the former commit, and instead avoid collecting
completed write if MD_CHANGE_PENDING is set.  raid5d will then flush
the metadata and retry the stripe_head.

We check MD_CHANGE_PENDING *after* analyse_stripe() as it could be set
asynchronously.  After analyse_stripe(), we have collected stable data
about the data of devices, which will be used to make decisions.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/raid5.c |   27 ++++-----------------------
 drivers/md/raid5.h |    3 ---
 2 files changed, 4 insertions(+), 26 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index d07d2dce6856..e53b8f499a4c 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4344,7 +4344,8 @@ static void handle_stripe(struct stripe_head *sh)
 	if (test_bit(STRIPE_LOG_TRAPPED, &sh->state))
 		goto finish;
 
-	if (s.handle_bad_blocks) {
+	if (s.handle_bad_blocks ||
+	    test_bit(MD_CHANGE_PENDING, &conf->mddev->flags)) {
 		set_bit(STRIPE_HANDLE, &sh->state);
 		goto finish;
 	}
@@ -4632,15 +4633,8 @@ static void handle_stripe(struct stripe_head *sh)
 			md_wakeup_thread(conf->mddev->thread);
 	}
 
-	if (!bio_list_empty(&s.return_bi)) {
-		if (test_bit(MD_CHANGE_PENDING, &conf->mddev->flags)) {
-			spin_lock_irq(&conf->device_lock);
-			bio_list_merge(&conf->return_bi, &s.return_bi);
-			spin_unlock_irq(&conf->device_lock);
-			md_wakeup_thread(conf->mddev->thread);
-		} else
-			return_io(&s.return_bi);
-	}
+	if (!bio_list_empty(&s.return_bi))
+		return_io(&s.return_bi);
 
 	clear_bit_unlock(STRIPE_ACTIVE, &sh->state);
 }
@@ -5846,18 +5840,6 @@ static void raid5d(struct md_thread *thread)
 
 	md_check_recovery(mddev);
 
-	if (!bio_list_empty(&conf->return_bi) &&
-	    !test_bit(MD_CHANGE_PENDING, &mddev->flags)) {
-		struct bio_list tmp = BIO_EMPTY_LIST;
-		spin_lock_irq(&conf->device_lock);
-		if (!test_bit(MD_CHANGE_PENDING, &mddev->flags)) {
-			bio_list_merge(&tmp, &conf->return_bi);
-			bio_list_init(&conf->return_bi);
-		}
-		spin_unlock_irq(&conf->device_lock);
-		return_io(&tmp);
-	}
-
 	blk_start_plug(&plug);
 	handled = 0;
 	spin_lock_irq(&conf->device_lock);
@@ -6490,7 +6472,6 @@ static struct r5conf *setup_conf(struct mddev *mddev)
 	INIT_LIST_HEAD(&conf->hold_list);
 	INIT_LIST_HEAD(&conf->delayed_list);
 	INIT_LIST_HEAD(&conf->bitmap_list);
-	bio_list_init(&conf->return_bi);
 	init_llist_head(&conf->released_stripes);
 	atomic_set(&conf->active_stripes, 0);
 	atomic_set(&conf->preread_active_stripes, 0);
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 57ec49f0839e..f654f8207a44 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -482,9 +482,6 @@ struct r5conf {
 	int			skip_copy; /* Don't copy data from bio to stripe cache */
 	struct list_head	*last_hold; /* detect hold_list promotions */
 
-	/* bios to have bi_end_io called after metadata is synced */
-	struct bio_list		return_bi;
-
 	atomic_t		reshape_stripes; /* stripes with pending writes for reshape */
 	/* unfortunately we need two cache names as we temporarily have
 	 * two caches.



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [md PATCH 4/5] md/raid5: call bio_endio() directly rather than queuing for later.
  2016-11-21  1:19 [md PATCH 0/5] Stop using bi_phys_segments as a counter NeilBrown
  2016-11-21  1:19 ` [md PATCH 5/5] md/raid5: use bio_inc_remaining() instead of repurposing " NeilBrown
  2016-11-21  1:19 ` [md PATCH 3/5] md/raid5: simplfy delaying of writes while metadata is updated NeilBrown
@ 2016-11-21  1:19 ` NeilBrown
  2016-11-21  1:19 ` [md PATCH 2/5] md/raid5: use md_write_start to count stripes, not bios NeilBrown
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: NeilBrown @ 2016-11-21  1:19 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Konstantin Khlebnikov, Christoph Hellwig, linux-raid

We gather bios that need to be returned into a bio_list
and call bio_endio() on them all together.
The original reason for this was to avoid making the calls while
holding a spinlock.
Locking has changed a lot since then, and that reason is no longer
valid.

So discard return_io() and various return_bi lists, and just call
bio_endio() directly as needed.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/raid5.c |   36 +++++++++---------------------------
 drivers/md/raid5.h |    1 -
 2 files changed, 9 insertions(+), 28 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index e53b8f499a4c..6f3154c80fbf 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -223,17 +223,6 @@ static int raid6_idx_to_slot(int idx, struct stripe_head *sh,
 	return slot;
 }
 
-static void return_io(struct bio_list *return_bi)
-{
-	struct bio *bi;
-	while ((bi = bio_list_pop(return_bi)) != NULL) {
-		bi->bi_iter.bi_size = 0;
-		trace_block_bio_complete(bdev_get_queue(bi->bi_bdev),
-					 bi, 0);
-		bio_endio(bi);
-	}
-}
-
 static void print_raid5_conf (struct r5conf *conf);
 
 static int stripe_operations_active(struct stripe_head *sh)
@@ -1178,7 +1167,6 @@ async_copy_data(int frombio, struct bio *bio, struct page **page,
 static void ops_complete_biofill(void *stripe_head_ref)
 {
 	struct stripe_head *sh = stripe_head_ref;
-	struct bio_list return_bi = BIO_EMPTY_LIST;
 	int i;
 
 	pr_debug("%s: stripe %llu\n", __func__,
@@ -1203,15 +1191,13 @@ static void ops_complete_biofill(void *stripe_head_ref)
 				dev->sector + STRIPE_SECTORS) {
 				rbi2 = r5_next_bio(rbi, dev->sector);
 				if (!raid5_dec_bi_active_stripes(rbi))
-					bio_list_add(&return_bi, rbi);
+					bio_endio(rbi);
 				rbi = rbi2;
 			}
 		}
 	}
 	clear_bit(STRIPE_BIOFILL_RUN, &sh->state);
 
-	return_io(&return_bi);
-
 	set_bit(STRIPE_HANDLE, &sh->state);
 	raid5_release_stripe(sh);
 }
@@ -3075,8 +3061,7 @@ static void stripe_set_idx(sector_t stripe, struct r5conf *conf, int previous,
 
 static void
 handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
-				struct stripe_head_state *s, int disks,
-				struct bio_list *return_bi)
+		     struct stripe_head_state *s, int disks)
 {
 	int i;
 	BUG_ON(sh->batch_head);
@@ -3124,7 +3109,7 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
 			bi->bi_error = -EIO;
 			md_write_end(conf->mddev);
 			if (!raid5_dec_bi_active_stripes(bi))
-				bio_list_add(return_bi, bi);
+				bio_endio(bi);
 			bi = nextbi;
 		}
 		if (bitmap_end)
@@ -3147,7 +3132,7 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
 			bi->bi_error = -EIO;
 			md_write_end(conf->mddev);
 			if (!raid5_dec_bi_active_stripes(bi))
-				bio_list_add(return_bi, bi);
+				bio_endio(bi);
 			bi = bi2;
 		}
 
@@ -3173,7 +3158,7 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
 
 				bi->bi_error = -EIO;
 				if (!raid5_dec_bi_active_stripes(bi))
-					bio_list_add(return_bi, bi);
+					bio_endio(bi);
 				bi = nextbi;
 			}
 		}
@@ -3457,7 +3442,7 @@ static void break_stripe_batch_list(struct stripe_head *head_sh,
  * never LOCKED, so we don't need to test 'failed' directly.
  */
 static void handle_stripe_clean_event(struct r5conf *conf,
-	struct stripe_head *sh, int disks, struct bio_list *return_bi)
+	struct stripe_head *sh, int disks)
 {
 	int i;
 	struct r5dev *dev;
@@ -3491,7 +3476,7 @@ static void handle_stripe_clean_event(struct r5conf *conf,
 					wbi2 = r5_next_bio(wbi, dev->sector);
 					md_write_end(conf->mddev);
 					if (!raid5_dec_bi_active_stripes(wbi))
-						bio_list_add(return_bi, wbi);
+						bio_endio(wbi);
 					wbi = wbi2;
 				}
 				bitmap_endwrite(conf->mddev->bitmap, sh->sector,
@@ -4378,7 +4363,7 @@ static void handle_stripe(struct stripe_head *sh)
 		sh->reconstruct_state = 0;
 		break_stripe_batch_list(sh, 0);
 		if (s.to_read+s.to_write+s.written)
-			handle_failed_stripe(conf, sh, &s, disks, &s.return_bi);
+			handle_failed_stripe(conf, sh, &s, disks);
 		if (s.syncing + s.replacing)
 			handle_failed_sync(conf, sh, &s);
 	}
@@ -4443,7 +4428,7 @@ static void handle_stripe(struct stripe_head *sh)
 			     && !test_bit(R5_LOCKED, &qdev->flags)
 			     && (test_bit(R5_UPTODATE, &qdev->flags) ||
 				 test_bit(R5_Discard, &qdev->flags))))))
-		handle_stripe_clean_event(conf, sh, disks, &s.return_bi);
+		handle_stripe_clean_event(conf, sh, disks);
 
 	/* Now we might consider reading some blocks, either to check/generate
 	 * parity, or to satisfy requests
@@ -4633,9 +4618,6 @@ static void handle_stripe(struct stripe_head *sh)
 			md_wakeup_thread(conf->mddev->thread);
 	}
 
-	if (!bio_list_empty(&s.return_bi))
-		return_io(&s.return_bi);
-
 	clear_bit_unlock(STRIPE_ACTIVE, &sh->state);
 }
 
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index f654f8207a44..799f84b26838 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -269,7 +269,6 @@ struct stripe_head_state {
 	int dec_preread_active;
 	unsigned long ops_request;
 
-	struct bio_list return_bi;
 	struct md_rdev *blocked_rdev;
 	int handle_bad_blocks;
 	int log_failed;



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [md PATCH 5/5] md/raid5: use bio_inc_remaining() instead of repurposing bi_phys_segments as a counter
  2016-11-21  1:19 [md PATCH 0/5] Stop using bi_phys_segments as a counter NeilBrown
@ 2016-11-21  1:19 ` NeilBrown
  2016-11-21  1:19 ` [md PATCH 3/5] md/raid5: simplfy delaying of writes while metadata is updated NeilBrown
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: NeilBrown @ 2016-11-21  1:19 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Konstantin Khlebnikov, Christoph Hellwig, linux-raid

md/raid5 needs to keep track of how many stripe_heads are processing a
bio so that it can delay calling bio_endio() until all stripe_heads
have completed.  It currently uses 16 bits of ->bi_phys_segments for
this purpose.

16 bits is only enough for 256M requests, and it is possible for a
single bio to be larger than this, which causes problems.  Also, the
bio struct contains a larger counter, __bi_remaining, which has a
purpose very similar to the purpose of our counter.  So stop using
->bi_phys_segments, and instead use __bi_remaining.

This means we don't need to initialize the counter, as our caller
initializes it to '1'.  It also means we can call bio_endio() directly
as it tests this counter internally.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/raid5.c |   68 +++++++++++-----------------------------------------
 1 file changed, 14 insertions(+), 54 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 6f3154c80fbf..7169420bfde5 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -154,18 +154,6 @@ static inline int raid5_bi_processed_stripes(struct bio *bio)
 	return (atomic_read(segments) >> 16) & 0xffff;
 }
 
-static inline int raid5_dec_bi_active_stripes(struct bio *bio)
-{
-	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
-	return atomic_sub_return(1, segments) & 0xffff;
-}
-
-static inline void raid5_inc_bi_active_stripes(struct bio *bio)
-{
-	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
-	atomic_inc(segments);
-}
-
 static inline void raid5_set_bi_processed_stripes(struct bio *bio,
 	unsigned int cnt)
 {
@@ -178,12 +166,6 @@ static inline void raid5_set_bi_processed_stripes(struct bio *bio,
 	} while (atomic_cmpxchg(segments, old, new) != old);
 }
 
-static inline void raid5_set_bi_stripes(struct bio *bio, unsigned int cnt)
-{
-	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
-	atomic_set(segments, cnt);
-}
-
 /* Find first data disk in a raid6 stripe */
 static inline int raid6_d0(struct stripe_head *sh)
 {
@@ -1190,8 +1172,7 @@ static void ops_complete_biofill(void *stripe_head_ref)
 			while (rbi && rbi->bi_iter.bi_sector <
 				dev->sector + STRIPE_SECTORS) {
 				rbi2 = r5_next_bio(rbi, dev->sector);
-				if (!raid5_dec_bi_active_stripes(rbi))
-					bio_endio(rbi);
+				bio_endio(rbi);
 				rbi = rbi2;
 			}
 		}
@@ -2983,7 +2964,7 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
 	if (*bip)
 		bi->bi_next = *bip;
 	*bip = bi;
-	raid5_inc_bi_active_stripes(bi);
+	bio_inc_remaining(bi);
 	md_write_start(conf->mddev, bi);
 
 	if (forwrite) {
@@ -3108,8 +3089,7 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
 
 			bi->bi_error = -EIO;
 			md_write_end(conf->mddev);
-			if (!raid5_dec_bi_active_stripes(bi))
-				bio_endio(bi);
+			bio_endio(bi);
 			bi = nextbi;
 		}
 		if (bitmap_end)
@@ -3131,8 +3111,7 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
 
 			bi->bi_error = -EIO;
 			md_write_end(conf->mddev);
-			if (!raid5_dec_bi_active_stripes(bi))
-				bio_endio(bi);
+			bio_endio(bi);
 			bi = bi2;
 		}
 
@@ -3157,8 +3136,7 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
 					r5_next_bio(bi, sh->dev[i].sector);
 
 				bi->bi_error = -EIO;
-				if (!raid5_dec_bi_active_stripes(bi))
-					bio_endio(bi);
+				bio_endio(bi);
 				bi = nextbi;
 			}
 		}
@@ -3475,8 +3453,7 @@ static void handle_stripe_clean_event(struct r5conf *conf,
 					dev->sector + STRIPE_SECTORS) {
 					wbi2 = r5_next_bio(wbi, dev->sector);
 					md_write_end(conf->mddev);
-					if (!raid5_dec_bi_active_stripes(wbi))
-						bio_endio(wbi);
+					bio_endio(wbi);
 					wbi = wbi2;
 				}
 				bitmap_endwrite(conf->mddev->bitmap, sh->sector,
@@ -4719,7 +4696,7 @@ static struct bio *remove_bio_from_retry(struct r5conf *conf)
 		 * this sets the active strip count to 1 and the processed
 		 * strip count to zero (upper 8 bits)
 		 */
-		raid5_set_bi_stripes(bi, 1); /* biased count of active stripes */
+		raid5_set_bi_processed_stripes(bi, 0);
 	}
 
 	return bi;
@@ -5036,7 +5013,6 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
 	struct r5conf *conf = mddev->private;
 	sector_t logical_sector, last_sector;
 	struct stripe_head *sh;
-	int remaining;
 	int stripe_sectors;
 
 	if (mddev->reshape_position != MaxSector)
@@ -5047,7 +5023,7 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
 	last_sector = bi->bi_iter.bi_sector + (bi->bi_iter.bi_size>>9);
 
 	bi->bi_next = NULL;
-	bi->bi_phys_segments = 1; /* over-loaded to count active stripes */
+	bi->bi_phys_segments = 0;
 
 	stripe_sectors = conf->chunk_sectors *
 		(conf->raid_disks - conf->max_degraded);
@@ -5093,7 +5069,7 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
 				continue;
 			sh->dev[d].towrite = bi;
 			set_bit(R5_OVERWRITE, &sh->dev[d].flags);
-			raid5_inc_bi_active_stripes(bi);
+			bio_inc_remaining(bi);
 			sh->overwrite_disks++;
 		}
 		spin_unlock_irq(&sh->stripe_lock);
@@ -5117,10 +5093,7 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
 	}
 
 	md_write_end(mddev);
-	remaining = raid5_dec_bi_active_stripes(bi);
-	if (remaining == 0) {
-		bio_endio(bi);
-	}
+	bio_endio(bi);
 }
 
 static void raid5_make_request(struct mddev *mddev, struct bio * bi)
@@ -5131,7 +5104,6 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
 	sector_t logical_sector, last_sector;
 	struct stripe_head *sh;
 	const int rw = bio_data_dir(bi);
-	int remaining;
 	DEFINE_WAIT(w);
 	bool do_prepare;
 
@@ -5167,7 +5139,7 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
 	logical_sector = bi->bi_iter.bi_sector & ~((sector_t)STRIPE_SECTORS-1);
 	last_sector = bio_end_sector(bi);
 	bi->bi_next = NULL;
-	bi->bi_phys_segments = 1;	/* over-loaded to count active stripes */
+	bi->bi_phys_segments = 0;
 	md_write_start(mddev, bi);
 
 	prepare_to_wait(&conf->wait_for_overlap, &w, TASK_UNINTERRUPTIBLE);
@@ -5299,14 +5271,7 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
 
 	if (rw == WRITE)
 		md_write_end(mddev);
-	remaining = raid5_dec_bi_active_stripes(bi);
-	if (remaining == 0) {
-
-
-		trace_block_bio_complete(bdev_get_queue(bi->bi_bdev),
-					 bi, 0);
-		bio_endio(bi);
-	}
+	bio_endio(bi);
 }
 
 static sector_t raid5_size(struct mddev *mddev, sector_t sectors, int raid_disks);
@@ -5671,7 +5636,6 @@ static int  retry_aligned_read(struct r5conf *conf, struct bio *raid_bio)
 	int dd_idx;
 	sector_t sector, logical_sector, last_sector;
 	int scnt = 0;
-	int remaining;
 	int handled = 0;
 
 	logical_sector = raid_bio->bi_iter.bi_sector &
@@ -5710,12 +5674,8 @@ static int  retry_aligned_read(struct r5conf *conf, struct bio *raid_bio)
 		raid5_release_stripe(sh);
 		handled++;
 	}
-	remaining = raid5_dec_bi_active_stripes(raid_bio);
-	if (remaining == 0) {
-		trace_block_bio_complete(bdev_get_queue(raid_bio->bi_bdev),
-					 raid_bio, 0);
-		bio_endio(raid_bio);
-	}
+	bio_endio(raid_bio);
+
 	if (atomic_dec_and_test(&conf->active_aligned_reads))
 		wake_up(&conf->wait_for_quiescent);
 	return handled;



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [md PATCH 6/5] md/raid5: remove over-loading of ->bi_phys_segments.
  2016-11-21  1:19 [md PATCH 0/5] Stop using bi_phys_segments as a counter NeilBrown
                   ` (4 preceding siblings ...)
  2016-11-21  1:19 ` [md PATCH 1/5] md: optimize md_write_start() slightly NeilBrown
@ 2016-11-21  2:32 ` NeilBrown
  2016-11-21 14:01 ` [md PATCH 0/5] Stop using bi_phys_segments as a counter Christoph Hellwig
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: NeilBrown @ 2016-11-21  2:32 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Christoph Hellwig, linux-raid

[-- Attachment #1: Type: text/plain, Size: 5786 bytes --]


When a read request, which bypassed the cache, fails, we need to retry
it through the cache.
This involves attaching it to a sequence of stripe_heads, and it may not
be possible to get all the stripe_heads we need at once.
We we do what we can, and record how far we got in ->bi_phys_segments so
we can pick up again later.

There is only every one bio which may have a non-zero offset stored in
->bi_phys_segments, the one that is either active in the single thread
which calls retry_aligned_read(), or is in conf->retry_read_aligned
waiting for retry_aligned_read() to be called again.

So we only need to store one offset value.  This can be in a local
variable passed between remove_bio_from_retry() and
retry_aligned_read(), or in the r5conf structure next to the
->retry_read_aligned pointer.

Storing it there allow the last usage of ->bi_phys_segments to be
removed from md/raid5.c.

Signed-off-by: NeilBrown <neilb@suse.com>
---

This applies on top of the previous set of 5, and finishes the
task for eradicating usage of bi_phys_segments.

NeilBrown


 drivers/md/raid5.c | 46 ++++++++++++----------------------------------
 drivers/md/raid5.h |  1 +
 2 files changed, 13 insertions(+), 34 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 7169420bfde5..b7be5a097ead 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -144,28 +144,6 @@ static inline struct bio *r5_next_bio(struct bio *bio, sector_t sector)
 		return NULL;
 }
 
-/*
- * We maintain a biased count of active stripes in the bottom 16 bits of
- * bi_phys_segments, and a count of processed stripes in the upper 16 bits
- */
-static inline int raid5_bi_processed_stripes(struct bio *bio)
-{
-	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
-	return (atomic_read(segments) >> 16) & 0xffff;
-}
-
-static inline void raid5_set_bi_processed_stripes(struct bio *bio,
-	unsigned int cnt)
-{
-	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
-	int old, new;
-
-	do {
-		old = atomic_read(segments);
-		new = (old & 0xffff) | (cnt << 16);
-	} while (atomic_cmpxchg(segments, old, new) != old);
-}
-
 /* Find first data disk in a raid6 stripe */
 static inline int raid6_d0(struct stripe_head *sh)
 {
@@ -4679,12 +4657,14 @@ static void add_bio_to_retry(struct bio *bi,struct r5conf *conf)
 	md_wakeup_thread(conf->mddev->thread);
 }
 
-static struct bio *remove_bio_from_retry(struct r5conf *conf)
+static struct bio *remove_bio_from_retry(struct r5conf *conf,
+					 unsigned int *offset)
 {
 	struct bio *bi;
 
 	bi = conf->retry_read_aligned;
 	if (bi) {
+		*offset = conf->retry_read_offset;
 		conf->retry_read_aligned = NULL;
 		return bi;
 	}
@@ -4692,11 +4672,7 @@ static struct bio *remove_bio_from_retry(struct r5conf *conf)
 	if(bi) {
 		conf->retry_read_aligned_list = bi->bi_next;
 		bi->bi_next = NULL;
-		/*
-		 * this sets the active strip count to 1 and the processed
-		 * strip count to zero (upper 8 bits)
-		 */
-		raid5_set_bi_processed_stripes(bi, 0);
+		*offset = 0;
 	}
 
 	return bi;
@@ -5620,7 +5596,8 @@ static inline sector_t raid5_sync_request(struct mddev *mddev, sector_t sector_n
 	return STRIPE_SECTORS;
 }
 
-static int  retry_aligned_read(struct r5conf *conf, struct bio *raid_bio)
+static int  retry_aligned_read(struct r5conf *conf, struct bio *raid_bio,
+			       unsigned int offset)
 {
 	/* We may not be able to submit a whole bio at once as there
 	 * may not be enough stripe_heads available.
@@ -5649,7 +5626,7 @@ static int  retry_aligned_read(struct r5conf *conf, struct bio *raid_bio)
 		     sector += STRIPE_SECTORS,
 		     scnt++) {
 
-		if (scnt < raid5_bi_processed_stripes(raid_bio))
+		if (scnt < offset)
 			/* already done this stripe */
 			continue;
 
@@ -5657,15 +5634,15 @@ static int  retry_aligned_read(struct r5conf *conf, struct bio *raid_bio)
 
 		if (!sh) {
 			/* failed to get a stripe - must wait */
-			raid5_set_bi_processed_stripes(raid_bio, scnt);
 			conf->retry_read_aligned = raid_bio;
+			conf->retry_read_offset = scnt;
 			return handled;
 		}
 
 		if (!add_stripe_bio(sh, raid_bio, dd_idx, 0, 0)) {
 			raid5_release_stripe(sh);
-			raid5_set_bi_processed_stripes(raid_bio, scnt);
 			conf->retry_read_aligned = raid_bio;
+			conf->retry_read_offset = scnt;
 			return handled;
 		}
 
@@ -5788,6 +5765,7 @@ static void raid5d(struct md_thread *thread)
 	while (1) {
 		struct bio *bio;
 		int batch_size, released;
+		unsigned int offset;
 
 		released = release_stripe_list(conf, conf->temp_inactive_list);
 		if (released)
@@ -5805,10 +5783,10 @@ static void raid5d(struct md_thread *thread)
 		}
 		raid5_activate_delayed(conf);
 
-		while ((bio = remove_bio_from_retry(conf))) {
+		while ((bio = remove_bio_from_retry(conf, &offset))) {
 			int ok;
 			spin_unlock_irq(&conf->device_lock);
-			ok = retry_aligned_read(conf, bio);
+			ok = retry_aligned_read(conf, bio, offset);
 			spin_lock_irq(&conf->device_lock);
 			if (!ok)
 				break;
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 799f84b26838..ec2be7677bfb 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -472,6 +472,7 @@ struct r5conf {
 	struct list_head	delayed_list; /* stripes that have plugged requests */
 	struct list_head	bitmap_list; /* stripes delaying awaiting bitmap update */
 	struct bio		*retry_read_aligned; /* currently retrying aligned bios   */
+	unsigned int		retry_read_offset; /* sector offset into retry_read_aligned */
 	struct bio		*retry_read_aligned_list; /* aligned bios retry list  */
 	atomic_t		preread_active_stripes; /* stripes with scheduled io */
 	atomic_t		active_aligned_reads;
-- 
2.10.2


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [md PATCH 0/5] Stop using bi_phys_segments as a counter
  2016-11-21  1:19 [md PATCH 0/5] Stop using bi_phys_segments as a counter NeilBrown
                   ` (5 preceding siblings ...)
  2016-11-21  2:32 ` [md PATCH 6/5] md/raid5: remove over-loading of ->bi_phys_segments NeilBrown
@ 2016-11-21 14:01 ` Christoph Hellwig
  2016-11-21 23:43 ` Shaohua Li
  2017-02-06  8:56 ` Christoph Hellwig
  8 siblings, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2016-11-21 14:01 UTC (permalink / raw)
  To: NeilBrown
  Cc: Shaohua Li, Konstantin Khlebnikov, Christoph Hellwig, linux-raid

Hi Neil,

I only took a cursory look, but these changes look very nice to me.
Thanks for looking into this!

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [md PATCH 0/5] Stop using bi_phys_segments as a counter
  2016-11-21  1:19 [md PATCH 0/5] Stop using bi_phys_segments as a counter NeilBrown
                   ` (6 preceding siblings ...)
  2016-11-21 14:01 ` [md PATCH 0/5] Stop using bi_phys_segments as a counter Christoph Hellwig
@ 2016-11-21 23:43 ` Shaohua Li
  2016-11-22  0:25   ` NeilBrown
  2017-02-06  8:56 ` Christoph Hellwig
  8 siblings, 1 reply; 18+ messages in thread
From: Shaohua Li @ 2016-11-21 23:43 UTC (permalink / raw)
  To: NeilBrown; +Cc: Konstantin Khlebnikov, Christoph Hellwig, linux-raid

On Mon, Nov 21, 2016 at 12:19:43PM +1100, Neil Brown wrote:
> There are 2 problems with using bi_phys_segments as a counter
> 1/ we only use 16bits, which limits bios to 256M
> 2/ it is poor form to reuse a field like this.  It interferes
>    with other changes to bios.
> 
> We need to clean up a few things before we can change the use the
> counter which is now available inside a bio.
> 
> I have only tested this lightly.  More review and testing would be
> appreciated.

So without the accounting, we:
- don't do bio completion trace
- call md_write_start/md_write_end excessively, which involves atomic operation.

Not big problems. But we are actually reusing __bi_remaining, I'm wondering why
we not explicitly reuse it. Eg, adds bio_dec_remaining_return() and uses it
like raid5_dec_bi_active_stripes.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [md PATCH 0/5] Stop using bi_phys_segments as a counter
  2016-11-21 23:43 ` Shaohua Li
@ 2016-11-22  0:25   ` NeilBrown
  2016-11-22  1:02     ` Shaohua Li
  0 siblings, 1 reply; 18+ messages in thread
From: NeilBrown @ 2016-11-22  0:25 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Konstantin Khlebnikov, Christoph Hellwig, linux-raid

[-- Attachment #1: Type: text/plain, Size: 1385 bytes --]

On Tue, Nov 22 2016, Shaohua Li wrote:

> On Mon, Nov 21, 2016 at 12:19:43PM +1100, Neil Brown wrote:
>> There are 2 problems with using bi_phys_segments as a counter
>> 1/ we only use 16bits, which limits bios to 256M
>> 2/ it is poor form to reuse a field like this.  It interferes
>>    with other changes to bios.
>> 
>> We need to clean up a few things before we can change the use the
>> counter which is now available inside a bio.
>> 
>> I have only tested this lightly.  More review and testing would be
>> appreciated.
>
> So without the accounting, we:
> - don't do bio completion trace

Yes, but hopefully that will be added back to bio_endio() soon.

> - call md_write_start/md_write_end excessively, which involves atomic operation.

raid5_inc_bio_active_stripes() did an atomic operation.  I don't think
there is a net increase in the number of atomic operations.

>
> Not big problems. But we are actually reusing __bi_remaining, I'm wondering why
> we not explicitly reuse it. Eg, adds bio_dec_remaining_return() and uses it
> like raid5_dec_bi_active_stripes.

Because using it exactly the same way that other places use it leads to
fewer surprises, now or later.
And I think that the effort to rearrange the code so that we could just
call bio_endio() brought real improvements in code clarity and
simplicity.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [md PATCH 0/5] Stop using bi_phys_segments as a counter
  2016-11-22  0:25   ` NeilBrown
@ 2016-11-22  1:02     ` Shaohua Li
  2016-11-22  2:19       ` NeilBrown
  0 siblings, 1 reply; 18+ messages in thread
From: Shaohua Li @ 2016-11-22  1:02 UTC (permalink / raw)
  To: NeilBrown; +Cc: Konstantin Khlebnikov, Christoph Hellwig, linux-raid

On Tue, Nov 22, 2016 at 11:25:04AM +1100, Neil Brown wrote:
> On Tue, Nov 22 2016, Shaohua Li wrote:
> 
> > On Mon, Nov 21, 2016 at 12:19:43PM +1100, Neil Brown wrote:
> >> There are 2 problems with using bi_phys_segments as a counter
> >> 1/ we only use 16bits, which limits bios to 256M
> >> 2/ it is poor form to reuse a field like this.  It interferes
> >>    with other changes to bios.
> >> 
> >> We need to clean up a few things before we can change the use the
> >> counter which is now available inside a bio.
> >> 
> >> I have only tested this lightly.  More review and testing would be
> >> appreciated.
> >
> > So without the accounting, we:
> > - don't do bio completion trace
> 
> Yes, but hopefully that will be added back to bio_endio() soon.
> 
> > - call md_write_start/md_write_end excessively, which involves atomic operation.
> 
> raid5_inc_bio_active_stripes() did an atomic operation.  I don't think
> there is a net increase in the number of atomic operations.

That's different. md_write_start/end uses a global atomic.
raid5_inc_bio_active_stripes uses a bio atomic. So we have more cache bouncing now.
 
> >
> > Not big problems. But we are actually reusing __bi_remaining, I'm wondering why
> > we not explicitly reuse it. Eg, adds bio_dec_remaining_return() and uses it
> > like raid5_dec_bi_active_stripes.
> 
> Because using it exactly the same way that other places use it leads to
> fewer surprises, now or later.
> And I think that the effort to rearrange the code so that we could just
> call bio_endio() brought real improvements in code clarity and
> simplicity.

Not the same way. The return_bi list and retry list fix are still good. We can
replace the bio_endio in your patch with something like:
if (bio_dec_remaining_return() == 0) {
	trace_block_bio_complete()
	md_write_end()
	bio_endio();
}
This will give us better control when to end io.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [md PATCH 0/5] Stop using bi_phys_segments as a counter
  2016-11-22  1:02     ` Shaohua Li
@ 2016-11-22  2:19       ` NeilBrown
  2016-11-22  8:01         ` Shaohua Li
  0 siblings, 1 reply; 18+ messages in thread
From: NeilBrown @ 2016-11-22  2:19 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Konstantin Khlebnikov, Christoph Hellwig, linux-raid

[-- Attachment #1: Type: text/plain, Size: 3534 bytes --]

On Tue, Nov 22 2016, Shaohua Li wrote:

> On Tue, Nov 22, 2016 at 11:25:04AM +1100, Neil Brown wrote:
>> On Tue, Nov 22 2016, Shaohua Li wrote:
>> 
>> > On Mon, Nov 21, 2016 at 12:19:43PM +1100, Neil Brown wrote:
>> >> There are 2 problems with using bi_phys_segments as a counter
>> >> 1/ we only use 16bits, which limits bios to 256M
>> >> 2/ it is poor form to reuse a field like this.  It interferes
>> >>    with other changes to bios.
>> >> 
>> >> We need to clean up a few things before we can change the use the
>> >> counter which is now available inside a bio.
>> >> 
>> >> I have only tested this lightly.  More review and testing would be
>> >> appreciated.
>> >
>> > So without the accounting, we:
>> > - don't do bio completion trace
>> 
>> Yes, but hopefully that will be added back to bio_endio() soon.
>> 
>> > - call md_write_start/md_write_end excessively, which involves atomic operation.
>> 
>> raid5_inc_bio_active_stripes() did an atomic operation.  I don't think
>> there is a net increase in the number of atomic operations.
>
> That's different. md_write_start/end uses a global atomic.
> raid5_inc_bio_active_stripes uses a bio atomic. So we have more cache bouncing now.

Maybe.
Most md_write_start() calls are made in the context of
raid5_make_request().
We could
 - call md_write_start() once at the start
 - count how many times we want to call it in a variable local to
   raid5_make_request()
 - atomically add that to the counter at the end.

Similarly mode md_write_end() requests are in the context of raid5d.  It
could maintain local counter and apply them all in a single update
before it sleeps.

It would be a little messy, but not too horrible I think.

>  
>> >
>> > Not big problems. But we are actually reusing __bi_remaining, I'm wondering why
>> > we not explicitly reuse it. Eg, adds bio_dec_remaining_return() and uses it
>> > like raid5_dec_bi_active_stripes.
>> 
>> Because using it exactly the same way that other places use it leads to
>> fewer surprises, now or later.
>> And I think that the effort to rearrange the code so that we could just
>> call bio_endio() brought real improvements in code clarity and
>> simplicity.
>
> Not the same way. The return_bi list and retry list fix are still good. We can
> replace the bio_endio in your patch with something like:
> if (bio_dec_remaining_return() == 0) {
> 	trace_block_bio_complete()
> 	md_write_end()
> 	bio_endio();
> }
> This will give us better control when to end io.

This isn't safe.  The bio arriving at raid5_make_request() might already
have been split and could be chained.  Then raid5 might never see
bio_dec_remaining_return() return zero.

For example, suppose there is a RAID0 make of some other device, and
this RAID5.
A write request arrives which crosses a chunk boundary.
raid0.c calls bio_split to split off a new bio that will fit in the other
device, leaving the original bio with a larger bi_sector which will get
mapped only into the raid5.
The split bio is chained into the original bio, elevating its
__bi_remaining count.
If the other device is particularly slow, or the RAID5 is particularly
fast, the RAID5 IO might complete before the split bio completes, so
raid5 will only see __bi_remaining go down to one, not zero.
When the split bio finally completes, it's bi_endio is
bio_chain_endio(), and that will call the final bio_endio() on the
original bio.  md_write_end() would then never be called.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [md PATCH 0/5] Stop using bi_phys_segments as a counter
  2016-11-22  2:19       ` NeilBrown
@ 2016-11-22  8:01         ` Shaohua Li
  2016-11-23  2:08           ` NeilBrown
  0 siblings, 1 reply; 18+ messages in thread
From: Shaohua Li @ 2016-11-22  8:01 UTC (permalink / raw)
  To: NeilBrown; +Cc: Konstantin Khlebnikov, Christoph Hellwig, linux-raid

On Tue, Nov 22, 2016 at 01:19:07PM +1100, Neil Brown wrote:
> On Tue, Nov 22 2016, Shaohua Li wrote:
> 
> > On Tue, Nov 22, 2016 at 11:25:04AM +1100, Neil Brown wrote:
> >> On Tue, Nov 22 2016, Shaohua Li wrote:
> >> 
> >> > On Mon, Nov 21, 2016 at 12:19:43PM +1100, Neil Brown wrote:
> >> >> There are 2 problems with using bi_phys_segments as a counter
> >> >> 1/ we only use 16bits, which limits bios to 256M
> >> >> 2/ it is poor form to reuse a field like this.  It interferes
> >> >>    with other changes to bios.
> >> >> 
> >> >> We need to clean up a few things before we can change the use the
> >> >> counter which is now available inside a bio.
> >> >> 
> >> >> I have only tested this lightly.  More review and testing would be
> >> >> appreciated.
> >> >
> >> > So without the accounting, we:
> >> > - don't do bio completion trace
> >> 
> >> Yes, but hopefully that will be added back to bio_endio() soon.
> >> 
> >> > - call md_write_start/md_write_end excessively, which involves atomic operation.
> >> 
> >> raid5_inc_bio_active_stripes() did an atomic operation.  I don't think
> >> there is a net increase in the number of atomic operations.
> >
> > That's different. md_write_start/end uses a global atomic.
> > raid5_inc_bio_active_stripes uses a bio atomic. So we have more cache bouncing now.
> 
> Maybe.
> Most md_write_start() calls are made in the context of
> raid5_make_request().
> We could
>  - call md_write_start() once at the start
>  - count how many times we want to call it in a variable local to
>    raid5_make_request()
>  - atomically add that to the counter at the end.
> 
> Similarly mode md_write_end() requests are in the context of raid5d.  It
> could maintain local counter and apply them all in a single update
> before it sleeps.
> 
> It would be a little messy, but not too horrible I think.

this is messy actually, don't think it's justified.

> >> >
> >> > Not big problems. But we are actually reusing __bi_remaining, I'm wondering why
> >> > we not explicitly reuse it. Eg, adds bio_dec_remaining_return() and uses it
> >> > like raid5_dec_bi_active_stripes.
> >> 
> >> Because using it exactly the same way that other places use it leads to
> >> fewer surprises, now or later.
> >> And I think that the effort to rearrange the code so that we could just
> >> call bio_endio() brought real improvements in code clarity and
> >> simplicity.
> >
> > Not the same way. The return_bi list and retry list fix are still good. We can
> > replace the bio_endio in your patch with something like:
> > if (bio_dec_remaining_return() == 0) {
> > 	trace_block_bio_complete()
> > 	md_write_end()
> > 	bio_endio();
> > }
> > This will give us better control when to end io.
> 
> This isn't safe.  The bio arriving at raid5_make_request() might already
> have been split and could be chained.  Then raid5 might never see
> bio_dec_remaining_return() return zero.
> 
> For example, suppose there is a RAID0 make of some other device, and
> this RAID5.
> A write request arrives which crosses a chunk boundary.
> raid0.c calls bio_split to split off a new bio that will fit in the other
> device, leaving the original bio with a larger bi_sector which will get
> mapped only into the raid5.
> The split bio is chained into the original bio, elevating its
> __bi_remaining count.
> If the other device is particularly slow, or the RAID5 is particularly
> fast, the RAID5 IO might complete before the split bio completes, so
> raid5 will only see __bi_remaining go down to one, not zero.
> When the split bio finally completes, it's bi_endio is
> bio_chain_endio(), and that will call the final bio_endio() on the
> original bio.  md_write_end() would then never be called.
ok, this makes sense.

I still don't like we have no control when to do endio, but don't have better
idea either. These patches work for raid5, but we need to find similar tricky
way to workaround raid1/raid10, which reuse bi_phys_segments too. Probably it's
time MD allocates additional data and attaches to bio. Adding a counter will
solve the issue in a consistency/clean way for raid1/10/5.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [md PATCH 0/5] Stop using bi_phys_segments as a counter
  2016-11-22  8:01         ` Shaohua Li
@ 2016-11-23  2:08           ` NeilBrown
  2016-11-23  8:45             ` Christoph Hellwig
  0 siblings, 1 reply; 18+ messages in thread
From: NeilBrown @ 2016-11-23  2:08 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Christoph Hellwig, linux-raid

[-- Attachment #1: Type: text/plain, Size: 802 bytes --]

On Tue, Nov 22 2016, Shaohua Li wrote:

>
> I still don't like we have no control when to do endio, but don't have better
> idea either. These patches work for raid5, but we need to find similar tricky
> way to workaround raid1/raid10, which reuse bi_phys_segments too. Probably it's
> time MD allocates additional data and attaches to bio. Adding a counter will
> solve the issue in a consistency/clean way for raid1/10/5.

For raid1/raid10 we could do a very similar thing.  There is an
awkwardness in raid1 w.r.t waiting for bi_phys_segments to reach 1, but
that might disappear if Coly's resync changes go through.
Alternately it might make sense to use bio_split so there is one r1_bio
per bio.
I might try the raid10 version and see what it looks like.

Thanks,
NeilBrown


>
> Thanks,
> Shaohua

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [md PATCH 0/5] Stop using bi_phys_segments as a counter
  2016-11-23  2:08           ` NeilBrown
@ 2016-11-23  8:45             ` Christoph Hellwig
  2016-11-24  0:31               ` NeilBrown
  0 siblings, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2016-11-23  8:45 UTC (permalink / raw)
  To: NeilBrown; +Cc: Shaohua Li, Christoph Hellwig, linux-raid

On Wed, Nov 23, 2016 at 01:08:38PM +1100, NeilBrown wrote:
> For raid1/raid10 we could do a very similar thing.  There is an
> awkwardness in raid1 w.r.t waiting for bi_phys_segments to reach 1, but
> that might disappear if Coly's resync changes go through.
> Alternately it might make sense to use bio_split so there is one r1_bio
> per bio.
> I might try the raid10 version and see what it looks like.

For RAID1 reads there already is one r1_bio per bio - in fact I have
a hack that doesn't allocate a r1_bio at all, but that one currently
does not handle reads from degraded arrays at all.  Due you remember
why we only mark a leg fail for a given bio instead of on a per-device
or at least per sector-range?

For writes it would make sense to allocate the new bio for each mirror
using a bio_set with front_pad for raid1-specific data, but I haven't
really looked into the details yet.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [md PATCH 0/5] Stop using bi_phys_segments as a counter
  2016-11-23  8:45             ` Christoph Hellwig
@ 2016-11-24  0:31               ` NeilBrown
  0 siblings, 0 replies; 18+ messages in thread
From: NeilBrown @ 2016-11-24  0:31 UTC (permalink / raw)
  Cc: Shaohua Li, Christoph Hellwig, linux-raid

[-- Attachment #1: Type: text/plain, Size: 1565 bytes --]

On Wed, Nov 23 2016, Christoph Hellwig wrote:

> On Wed, Nov 23, 2016 at 01:08:38PM +1100, NeilBrown wrote:
>> For raid1/raid10 we could do a very similar thing.  There is an
>> awkwardness in raid1 w.r.t waiting for bi_phys_segments to reach 1, but
>> that might disappear if Coly's resync changes go through.
>> Alternately it might make sense to use bio_split so there is one r1_bio
>> per bio.
>> I might try the raid10 version and see what it looks like.
>
> For RAID1 reads there already is one r1_bio per bio

unless the bad-block-log records some bad/missing blocks in the middle
of the range being read.

> - in fact I have
> a hack that doesn't allocate a r1_bio at all, but that one currently
> does not handle reads from degraded arrays at all.  Due you remember
> why we only mark a leg fail for a given bio instead of on a per-device
> or at least per sector-range?

When we detect a read error, we try to "correct" it by reading the data
From elsewhere and over-writing the "bad" block.  If that works, there
was no need to mark the whole device as faulty.  If it does fail, that
is when the device is failed.

We could have a "soft-write-mostly" flag to discourage reads from a
questionable device, until it has been properly tested.

>
> For writes it would make sense to allocate the new bio for each mirror
> using a bio_set with front_pad for raid1-specific data, but I haven't
> really looked into the details yet.

Yes, you could.  I don't know if it would be an improvement, or just a
change.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [md PATCH 0/5] Stop using bi_phys_segments as a counter
  2016-11-21  1:19 [md PATCH 0/5] Stop using bi_phys_segments as a counter NeilBrown
                   ` (7 preceding siblings ...)
  2016-11-21 23:43 ` Shaohua Li
@ 2017-02-06  8:56 ` Christoph Hellwig
  2017-02-06 21:41   ` Shaohua Li
  8 siblings, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2017-02-06  8:56 UTC (permalink / raw)
  To: NeilBrown
  Cc: Shaohua Li, Konstantin Khlebnikov, Christoph Hellwig, linux-raid

Did anything happen to this series?  Having it would be extremely
useful for the block layer moving forward.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [md PATCH 0/5] Stop using bi_phys_segments as a counter
  2017-02-06  8:56 ` Christoph Hellwig
@ 2017-02-06 21:41   ` Shaohua Li
  0 siblings, 0 replies; 18+ messages in thread
From: Shaohua Li @ 2017-02-06 21:41 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: NeilBrown, Konstantin Khlebnikov, linux-raid

On Mon, Feb 06, 2017 at 09:56:24AM +0100, Christoph Hellwig wrote:
> Did anything happen to this series?  Having it would be extremely
> useful for the block layer moving forward.

So I'd prefer allocating extra memory for the counter instead of playing tricky
games. I'll post some patches soon.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2017-02-06 21:41 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-21  1:19 [md PATCH 0/5] Stop using bi_phys_segments as a counter NeilBrown
2016-11-21  1:19 ` [md PATCH 5/5] md/raid5: use bio_inc_remaining() instead of repurposing " NeilBrown
2016-11-21  1:19 ` [md PATCH 3/5] md/raid5: simplfy delaying of writes while metadata is updated NeilBrown
2016-11-21  1:19 ` [md PATCH 4/5] md/raid5: call bio_endio() directly rather than queuing for later NeilBrown
2016-11-21  1:19 ` [md PATCH 2/5] md/raid5: use md_write_start to count stripes, not bios NeilBrown
2016-11-21  1:19 ` [md PATCH 1/5] md: optimize md_write_start() slightly NeilBrown
2016-11-21  2:32 ` [md PATCH 6/5] md/raid5: remove over-loading of ->bi_phys_segments NeilBrown
2016-11-21 14:01 ` [md PATCH 0/5] Stop using bi_phys_segments as a counter Christoph Hellwig
2016-11-21 23:43 ` Shaohua Li
2016-11-22  0:25   ` NeilBrown
2016-11-22  1:02     ` Shaohua Li
2016-11-22  2:19       ` NeilBrown
2016-11-22  8:01         ` Shaohua Li
2016-11-23  2:08           ` NeilBrown
2016-11-23  8:45             ` Christoph Hellwig
2016-11-24  0:31               ` NeilBrown
2017-02-06  8:56 ` Christoph Hellwig
2017-02-06 21:41   ` Shaohua Li

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.