[RFC PATCH v2 0/9] Block/XFS: Support alternative mirror device retry

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v2 0/9] Block/XFS: Support alternative mirror device retry
@ 2019-02-13  9:50 Bob Liu
  2019-02-13  9:50 ` [RFC PATCH v2 1/9] block: add nr_mirrors to request_queue Bob Liu
                   ` (10 more replies)
  0 siblings, 11 replies; 28+ messages in thread
From: Bob Liu @ 2019-02-13  9:50 UTC (permalink / raw)
  To: linux-block
  Cc: linux-xfs, linux-fsdevel, martin.petersen, shirley.ma,
	allison.henderson, david, darrick.wong, hch, adilger, Bob Liu

Motivation:
When fs data/metadata checksum mismatch, lower block devices may have other
correct copies. e.g. If XFS successfully reads a metadata buffer off a raid1 but
decides that the metadata is garbage, today it will shut down the entire
filesystem without trying any of the other mirrors.  This is a severe
loss of service, and we propose these patches to have XFS try harder to
avoid failure.

This patch prototype this mirror retry idea by:
* Adding @nr_mirrors to struct request_queue which is similar as
  blk_queue_nonrot(), filesystem can grab device request queue and check max
  mirrors this block device has.
  Helper functions were also added to get/set the nr_mirrors.

* Introducing bi_rd_hint just like bi_write_hint, but bi_rd_hint is a long bitmap
in order to support stacked layer case.

* Modify md/raid1 to support this retry feature.

* Adapter xfs to use this feature.
  If the read verify fails, we loop over the available mirrors and retry the read.

* Rewrite retried read
  When the read verification fails, but the retry succeedes
  write the buffer back to correct the bad mirror

* Add tracepoints and logging to alternate device retry.
  This patch adds new log entries and trace points to the alternate device retry
  error path.

Changes v2:
- No more reuse bi_write_hint
- Stacked layer support(see patch 4/9)
- Other feedback fix

Allison Henderson (5):
  Add b_alt_retry to xfs_buf
  xfs: Add b_rd_hint to xfs_buf
  xfs: Add device retry
  xfs: Rewrite retried read
  xfs: Add tracepoints and logging to alternate device retry

Bob Liu (4):
  block: add nr_mirrors to request_queue
  block: add rd_hint to bio and request
  md:raid1: set mirrors correctly
  md:raid1: rd_hint support and consider stacked layer case

 Documentation/block/biodoc.txt |   3 +
 block/bio.c                    |   1 +
 block/blk-core.c               |   4 ++
 block/blk-merge.c              |   6 ++
 block/blk-settings.c           |  24 +++++++
 block/bounce.c                 |   1 +
 drivers/md/raid1.c             | 123 ++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_buf.c               |  58 +++++++++++++++-
 fs/xfs/xfs_buf.h               |  14 ++++
 fs/xfs/xfs_trace.h             |   6 +-
 include/linux/blk_types.h      |   1 +
 include/linux/blkdev.h         |   4 ++
 include/linux/types.h          |   3 +
 13 files changed, 244 insertions(+), 4 deletions(-)

-- 
2.17.1

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH v2 1/9] block: add nr_mirrors to request_queue
  2019-02-13  9:50 [RFC PATCH v2 0/9] Block/XFS: Support alternative mirror device retry Bob Liu
@ 2019-02-13  9:50 ` Bob Liu
  2019-02-13 10:26   ` Andreas Dilger
  2019-02-13 16:04   ` Theodore Y. Ts'o
  2019-02-13  9:50 ` [RFC PATCH v2 2/9] block: add rd_hint to bio and request Bob Liu
                   ` (9 subsequent siblings)
  10 siblings, 2 replies; 28+ messages in thread
From: Bob Liu @ 2019-02-13  9:50 UTC (permalink / raw)
  To: linux-block
  Cc: linux-xfs, linux-fsdevel, martin.petersen, shirley.ma,
	allison.henderson, david, darrick.wong, hch, adilger, Bob Liu

When fs data/metadata checksum mismatch, lower block devices may have other
correct copies. e.g if we did raid1 for protecting fs metadata.
Then fs could try other copies of metadata instead of panic, but fs need be
awared how many mirrors the block devices have.

This patch add @nr_mirrors to struct request_queue which is similar as
blk_queue_nonrot(), filesystem can grab device request queue and check the
number of mirrors of this block device.

@nr_mirrors is 1 by default which means only one copy, drivers e.g raid1 are
responsible for setting the right value. The maximum value is
BITS_PER_LONG which is 32 or 64. That should be big enough else retry lantency
may be too high.

Also added helper functions for get/set the number of mirrors for a specific
device request queue.

Todo:
* Export nr_mirrors through /sysfs.

Signed-off-by: Bob Liu <bob.liu@oracle.com>
---
 block/blk-core.c       |  3 +++
 block/blk-settings.c   | 24 ++++++++++++++++++++++++
 include/linux/blkdev.h |  3 +++
 include/linux/types.h  |  3 +++
 4 files changed, 33 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index 6b78ec56a4f2..b838c6dc5357 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -537,6 +537,9 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 	if (blkcg_init_queue(q))
 		goto fail_ref;
 
+	/* Set queue default mirrors to 1 explicitly. */
+	blk_queue_set_mirrors(q, 1);
+
 	return q;
 
 fail_ref:
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 3e7038e475ee..38e4d7e675e6 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -844,6 +844,30 @@ void blk_queue_write_cache(struct request_queue *q, bool wc, bool fua)
 }
 EXPORT_SYMBOL_GPL(blk_queue_write_cache);
 
+/*
+ * Get the number of read redundant mirrors.
+ */
+unsigned short blk_queue_get_mirrors(struct request_queue *q)
+{
+	return q->nr_mirrors;
+}
+EXPORT_SYMBOL(blk_queue_get_mirrors);
+
+/*
+ * Set the number of read redundant mirrors.
+ */
+bool blk_queue_set_mirrors(struct request_queue *q, unsigned short mirrors)
+{
+	if(q->nr_mirrors >= BLKDEV_MAX_MIRRORS) {
+		printk("blk_queue_set_mirrors: %d exceed max mirrors(%d)\n",
+				mirrors, BLKDEV_MAX_MIRRORS);
+		return false;
+	}
+	q->nr_mirrors = mirrors;
+	return true;
+}
+EXPORT_SYMBOL(blk_queue_set_mirrors);
+
 static int __init blk_settings_init(void)
 {
 	blk_max_low_pfn = max_low_pfn - 1;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 338604dff7d0..0191dc4d3f2d 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -570,6 +570,7 @@ struct request_queue {
 
 #define BLK_MAX_WRITE_HINTS	5
 	u64			write_hints[BLK_MAX_WRITE_HINTS];
+	unsigned long		nr_mirrors; /* Default value is 1 */
 };
 
 #define QUEUE_FLAG_STOPPED	1	/* queue is stopped */
@@ -1071,6 +1072,8 @@ extern void blk_queue_update_dma_alignment(struct request_queue *, int);
 extern void blk_queue_rq_timeout(struct request_queue *, unsigned int);
 extern void blk_queue_flush_queueable(struct request_queue *q, bool queueable);
 extern void blk_queue_write_cache(struct request_queue *q, bool enabled, bool fua);
+extern unsigned short blk_queue_get_mirrors(struct request_queue *q);
+extern bool blk_queue_set_mirrors(struct request_queue *q, unsigned short mirrors);
 
 /*
  * Number of physical segments as sent to the device.
diff --git a/include/linux/types.h b/include/linux/types.h
index c2615d6a019e..a29135772f3a 100644
--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -7,6 +7,9 @@
 
 #ifndef __ASSEMBLY__
 
+/* max mirrors of blkdev */
+#define BLKDEV_MAX_MIRRORS BITS_PER_LONG
+
 #define DECLARE_BITMAP(name,bits) \
 	unsigned long name[BITS_TO_LONGS(bits)]
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH v2 2/9] block: add rd_hint to bio and request
  2019-02-13  9:50 [RFC PATCH v2 0/9] Block/XFS: Support alternative mirror device retry Bob Liu
  2019-02-13  9:50 ` [RFC PATCH v2 1/9] block: add nr_mirrors to request_queue Bob Liu
@ 2019-02-13  9:50 ` Bob Liu
  2019-02-13 16:18   ` Jens Axboe
  2019-02-13  9:50 ` [RFC PATCH v2 3/9] md:raid1: set mirrors correctly Bob Liu
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 28+ messages in thread
From: Bob Liu @ 2019-02-13  9:50 UTC (permalink / raw)
  To: linux-block
  Cc: linux-xfs, linux-fsdevel, martin.petersen, shirley.ma,
	allison.henderson, david, darrick.wong, hch, adilger, Bob Liu

rd_hint is a bitmap for stacked layer support(see patch 4/9),
set a bit to 1 means already read from the corresponding mirror device.

rd_hint will be set properly recording read i/o went to which real device
during end_bio().
If the upper layer want to retry other mirrors, just preserve the returned
bi_rd_hint and resubmit bio.

The upper layer e.g fs can set bitmap_zero(rd_hint) if don't care about alt
mirror device retry feature which is also the default setting.

Signed-off-by: Bob Liu <bob.liu@oracle.com>
---
 Documentation/block/biodoc.txt | 3 +++
 block/bio.c                    | 1 +
 block/blk-core.c               | 1 +
 block/blk-merge.c              | 6 ++++++
 block/bounce.c                 | 1 +
 drivers/md/raid1.c             | 1 +
 include/linux/blk_types.h      | 1 +
 include/linux/blkdev.h         | 1 +
 8 files changed, 15 insertions(+)

diff --git a/Documentation/block/biodoc.txt b/Documentation/block/biodoc.txt
index ac18b488cb5e..c6b5dfc9314b 100644
--- a/Documentation/block/biodoc.txt
+++ b/Documentation/block/biodoc.txt
@@ -430,6 +430,7 @@ struct bio {
        struct bio          *bi_next;    /* request queue link */
        struct block_device *bi_bdev;	/* target device */
        unsigned long       bi_flags;    /* status, command, etc */
+       DECLARE_BITMAP(bi_rd_hint, BLKDEV_MAX_MIRRORS); /* bio read hint */
        unsigned long       bi_opf;       /* low bits: r/w, high: priority */
 
        unsigned int	bi_vcnt;     /* how may bio_vec's */
@@ -464,6 +465,8 @@ With this multipage bio design:
   (e.g a 1MB bio_vec needs to be handled in max 128kB chunks for IDE)
   [TBD: Should preferably also have a bi_voffset and bi_vlen to avoid modifying
    bi_offset an len fields]
+- bi_rd_hint is an in/out bitmap parameter, set a bit to 1 means already read
+  from the corresponding mirror device.
 
 (*) unrelated merges -- a request ends up containing two or more bios that
     didn't originate from the same place.
diff --git a/block/bio.c b/block/bio.c
index 4db1008309ed..0e97d75edbd4 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -606,6 +606,7 @@ void __bio_clone_fast(struct bio *bio, struct bio *bio_src)
 	bio->bi_opf = bio_src->bi_opf;
 	bio->bi_ioprio = bio_src->bi_ioprio;
 	bio->bi_write_hint = bio_src->bi_write_hint;
+	bitmap_copy(bio->bi_rd_hint, bio_src->bi_rd_hint, BLKDEV_MAX_MIRRORS);
 	bio->bi_iter = bio_src->bi_iter;
 	bio->bi_io_vec = bio_src->bi_io_vec;
 
diff --git a/block/blk-core.c b/block/blk-core.c
index b838c6dc5357..c93162b7140c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -742,6 +742,7 @@ void blk_init_request_from_bio(struct request *req, struct bio *bio)
 	req->__sector = bio->bi_iter.bi_sector;
 	req->ioprio = bio_prio(bio);
 	req->write_hint = bio->bi_write_hint;
+	bitmap_copy(req->rd_hint, bio->bi_rd_hint, BLKDEV_MAX_MIRRORS);
 	blk_rq_bio_prep(req->q, req, bio);
 }
 EXPORT_SYMBOL_GPL(blk_init_request_from_bio);
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 71e9ac03f621..58982a80eca8 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -745,6 +745,9 @@ static struct request *attempt_merge(struct request_queue *q,
 	if (req->write_hint != next->write_hint)
 		return NULL;
 
+	if (!bitmap_equal(req->rd_hint, next->rd_hint, BLKDEV_MAX_MIRRORS))
+		return NULL;
+
 	if (req->ioprio != next->ioprio)
 		return NULL;
 
@@ -877,6 +880,9 @@ bool blk_rq_merge_ok(struct request *rq, struct bio *bio)
 	if (rq->write_hint != bio->bi_write_hint)
 		return false;
 
+	if (!bitmap_equal(rq->rd_hint, bio->bi_rd_hint, BLKDEV_MAX_MIRRORS))
+		return false;
+
 	if (rq->ioprio != bio_prio(bio))
 		return false;
 
diff --git a/block/bounce.c b/block/bounce.c
index ffb9e9ecfa7e..fba66e06b735 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -250,6 +250,7 @@ static struct bio *bounce_clone_bio(struct bio *bio_src, gfp_t gfp_mask,
 	bio->bi_opf		= bio_src->bi_opf;
 	bio->bi_ioprio		= bio_src->bi_ioprio;
 	bio->bi_write_hint	= bio_src->bi_write_hint;
+	bitmap_copy(bio->bi_rd_hint, bio_src->bi_rd_hint, BLKDEV_MAX_MIRRORS);
 	bio->bi_iter.bi_sector	= bio_src->bi_iter.bi_sector;
 	bio->bi_iter.bi_size	= bio_src->bi_iter.bi_size;
 
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 1d54109071cc..1e5a51f22332 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1103,6 +1103,7 @@ static void alloc_behind_master_bio(struct r1bio *r1_bio,
 	}
 
 	behind_bio->bi_write_hint = bio->bi_write_hint;
+	bitmap_copy(behind_bio->bi_rd_hint, bio->bi_rd_hint, BLKDEV_MAX_MIRRORS);
 
 	while (i < vcnt && size) {
 		struct page *page;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index d66bf5f32610..49bdd96e2623 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -151,6 +151,7 @@ struct bio {
 	unsigned short		bi_flags;	/* status, etc and bvec pool number */
 	unsigned short		bi_ioprio;
 	unsigned short		bi_write_hint;
+	DECLARE_BITMAP(bi_rd_hint, BLKDEV_MAX_MIRRORS);
 	blk_status_t		bi_status;
 	u8			bi_partno;
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 0191dc4d3f2d..0a1e93b282c4 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -214,6 +214,7 @@ struct request {
 #endif
 
 	unsigned short write_hint;
+	DECLARE_BITMAP(rd_hint, BLKDEV_MAX_MIRRORS);
 	unsigned short ioprio;
 
 	void *special;		/* opaque pointer available for LLD use */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH v2 3/9] md:raid1: set mirrors correctly
  2019-02-13  9:50 [RFC PATCH v2 0/9] Block/XFS: Support alternative mirror device retry Bob Liu
  2019-02-13  9:50 ` [RFC PATCH v2 1/9] block: add nr_mirrors to request_queue Bob Liu
  2019-02-13  9:50 ` [RFC PATCH v2 2/9] block: add rd_hint to bio and request Bob Liu
@ 2019-02-13  9:50 ` Bob Liu
  2019-02-13  9:50 ` [RFC PATCH v2 4/9] md:raid1: rd_hint support and consider stacked layer case Bob Liu
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 28+ messages in thread
From: Bob Liu @ 2019-02-13  9:50 UTC (permalink / raw)
  To: linux-block
  Cc: linux-xfs, linux-fsdevel, martin.petersen, shirley.ma,
	allison.henderson, david, darrick.wong, hch, adilger, Bob Liu

In stack layer case, the mirror of upper layer device is the sum of mirrors of
all lower layer devices.

Signed-off-by: Bob Liu <bob.liu@oracle.com>
---
 drivers/md/raid1.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 1e5a51f22332..0de28714e9b5 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -3050,6 +3050,7 @@ static int raid1_run(struct mddev *mddev)
 	struct md_rdev *rdev;
 	int ret;
 	bool discard_supported = false;
+	unsigned long mirrors = 0;
 
 	if (mddev->level != 1) {
 		pr_warn("md/raid1:%s: raid level not set to mirroring (%d)\n",
@@ -3084,11 +3085,15 @@ static int raid1_run(struct mddev *mddev)
 	rdev_for_each(rdev, mddev) {
 		if (!mddev->gendisk)
 			continue;
+		mirrors += blk_queue_get_mirrors(bdev_get_queue(rdev->bdev));
 		disk_stack_limits(mddev->gendisk, rdev->bdev,
 				  rdev->data_offset << 9);
 		if (blk_queue_discard(bdev_get_queue(rdev->bdev)))
 			discard_supported = true;
 	}
+	if (mddev->queue)
+		if (!blk_queue_set_mirrors(mddev->queue, mirrors))
+			return -EINVAL;
 
 	mddev->degraded = 0;
 	for (i=0; i < conf->raid_disks; i++)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH v2 4/9] md:raid1: rd_hint support and consider stacked layer case
  2019-02-13  9:50 [RFC PATCH v2 0/9] Block/XFS: Support alternative mirror device retry Bob Liu
                   ` (2 preceding siblings ...)
  2019-02-13  9:50 ` [RFC PATCH v2 3/9] md:raid1: set mirrors correctly Bob Liu
@ 2019-02-13  9:50 ` Bob Liu
  2019-02-13  9:50 ` [RFC PATCH v2 5/9] Add b_alt_retry to xfs_buf Bob Liu
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 28+ messages in thread
From: Bob Liu @ 2019-02-13  9:50 UTC (permalink / raw)
  To: linux-block
  Cc: linux-xfs, linux-fsdevel, martin.petersen, shirley.ma,
	allison.henderson, david, darrick.wong, hch, adilger, Bob Liu

rd_hint is a bit map for stacked md layer supporting.
When submit bio to a lower md layer, the bio->bi_rd_hint should be split
according mirror number of each device of lower layer.
And merge bio->bi_rd_hint in the end path vise versa.

For a two layer stacked md case like:
                           /dev/md0
             /                |                        \
      /dev/md1-a             /dev/md1-b                /dev/md1-c
   /        \           /       |        \           /      |      \
/dev/sda /dev/sdb  /dev/sdc /dev/sdd  /dev/sde  /dev/sdf /dev/sdg /dev/sdh


- 1) First the top layer sumbit bio with bi_rd_hint = [00 000 000],
then the value of bi_rd_hint changed as below when bio goes to lower layer.
                         [00 000 000]
             /                |                       \
         [00]               [000]                    [000]
   /        \           /       |        \           /      |      \
[0]         [0]        [0]     [0]       [0]       [0]     [0]     [0]


- 2) i/o may goes to  /dev/sda at first:
[1]         [0]        [0]     [0]      [0]       [0]     [0]     [0]
  \         /           \       |        /          \      |      /
         [10]                [000]                    [000]
             \                |                       /
                         [10 000 000]
The top layer will get bio->bi_rd_hint = [10 000 000]


- 3) Fs check the data is corrupt, resumbit bio with bi_rd_hint = [10 000 000]
                         [10 000 000]
             /                |                       \
         [10]               [000]                    [000]
   /        \           /       |        \           /      |      \
[1]         [0]        [0]     [0]       [0]       [0]     [0]     [0]


- 4) i/o can go to any dev except /dev/sda(already tried), assum goes to /dev/sdg
this time.
[1]         [0]        [0]     [0]      [0]       [0]     [1]     [0]
  \         /           \       |        /          \      |      /
         [10]                [000]                    [010]
             \                |                       /
                         [10 000 010]
The top layer will get bio->bi_rd_hint = [10 000 010], which means we already
tried /dev/sda and /dev/sdg.


- 5) If the data is corrupt again, resumbit bio with
bi_rd_hint = [10 000 010].

Loop until all mirrors are tried..

Signed-off-by: Bob Liu <bob.liu@oracle.com>
---
 drivers/md/raid1.c | 117 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 116 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 0de28714e9b5..75fde3a3fd3d 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -325,6 +325,41 @@ static int find_bio_disk(struct r1bio *r1_bio, struct bio *bio)
 	return mirror;
 }
 
+/* merge children's rd hint to master bio */
+static void raid1_merge_rd_hint(struct bio *bio)
+{
+	struct r1bio *r1_bio = bio->bi_private;
+	struct r1conf *conf = r1_bio->mddev->private;
+	struct md_rdev *tmp_rdev = NULL;
+	int i = conf->raid_disks - 1;
+	int cnt = 0;
+	int read_disk = r1_bio->read_disk;
+	DECLARE_BITMAP(tmp_bitmap, BLKDEV_MAX_MIRRORS);
+
+	if (!r1_bio->master_bio)
+		return;
+
+	/* ignore replace case now */
+	if (read_disk > conf->raid_disks - 1)
+		read_disk = r1_bio->read_disk - conf->raid_disks;
+
+	for (; i >= 0; i--) {
+		tmp_rdev = conf->mirrors[i].rdev;
+		if (i == read_disk)
+			break;
+		cnt += blk_queue_get_mirrors(bdev_get_queue(tmp_rdev->bdev));
+	}
+
+	/* init map properly from most lower layer */
+	if (blk_queue_get_mirrors(bdev_get_queue(tmp_rdev->bdev)) == 1)
+		bitmap_set(bio->bi_rd_hint, 0, 1);
+
+	bitmap_shift_left(tmp_bitmap, bio->bi_rd_hint, cnt, BLKDEV_MAX_MIRRORS);
+	bitmap_or(r1_bio->master_bio->bi_rd_hint,
+		  r1_bio->master_bio->bi_rd_hint, tmp_bitmap,
+		  BLKDEV_MAX_MIRRORS);
+}
+
 static void raid1_end_read_request(struct bio *bio)
 {
 	int uptodate = !bio->bi_status;
@@ -332,6 +367,7 @@ static void raid1_end_read_request(struct bio *bio)
 	struct r1conf *conf = r1_bio->mddev->private;
 	struct md_rdev *rdev = conf->mirrors[r1_bio->read_disk].rdev;
 
+	raid1_merge_rd_hint(bio);
 	/*
 	 * this branch is our 'one mirror IO has finished' event handler:
 	 */
@@ -539,6 +575,37 @@ static sector_t align_to_barrier_unit_end(sector_t start_sector,
 	return len;
 }
 
+static long choose_disk_from_rd_hint(struct r1conf *conf, struct r1bio *r1_bio)
+{
+	struct md_rdev *tmp_rdev;
+	unsigned long bit, cnt;
+	struct bio *bio = r1_bio->master_bio;
+	int mirror = conf->raid_disks - 1;
+
+	cnt = blk_queue_get_mirrors(r1_bio->mddev->queue);
+	/* Find a never-readed device */
+	bit = bitmap_find_next_zero_area(bio->bi_rd_hint, cnt, 0, 1, 0);
+	if (bit >= cnt)
+		/* Already tried all mirrors */
+		return -1;
+
+	/* Decide this device belongs to which mirror for stacked-layer raid
+	 * devices. */
+	cnt = 0;
+	for ( ; mirror >= 0; mirror--) {
+		tmp_rdev = conf->mirrors[mirror].rdev;
+		cnt += blk_queue_get_mirrors(bdev_get_queue(tmp_rdev->bdev));
+		/* bit start from 0, while mirrors start from 1. So should compare
+		 * with (bit + 1) */
+		if (cnt >= (bit + 1)) {
+			return mirror;
+		}
+	}
+
+	/* Should not arrive here. */
+	return -1;
+}
+
 /*
  * This routine returns the disk from which the requested read should
  * be done. There is a per-array 'next expected sequential IO' sector
@@ -566,6 +633,7 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
 	struct md_rdev *rdev;
 	int choose_first;
 	int choose_next_idle;
+	int max_disks;
 
 	rcu_read_lock();
 	/*
@@ -593,7 +661,18 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
 	else
 		choose_first = 0;
 
-	for (disk = 0 ; disk < conf->raid_disks * 2 ; disk++) {
+	if (!bitmap_empty(r1_bio->master_bio->bi_rd_hint, BLKDEV_MAX_MIRRORS)) {
+		disk  = choose_disk_from_rd_hint(conf, r1_bio);
+		if (disk < 0)
+			return -1;
+
+		/* Use the specific disk */
+		max_disks = disk + 1;
+	} else {
+		disk = 0;
+		max_disks = conf->raid_disks * 2;
+	}
+	for (; disk < max_disks; disk++) {
 		sector_t dist;
 		sector_t first_bad;
 		int bad_sectors;
@@ -1186,6 +1265,34 @@ alloc_r1bio(struct mddev *mddev, struct bio *bio)
 	return r1_bio;
 }
 
+static void raid1_split_rd_hint(struct bio *bio)
+{
+	struct r1bio *r1_bio = bio->bi_private;
+	struct r1conf *conf = r1_bio->mddev->private;
+	unsigned int cnt = 0;
+	DECLARE_BITMAP(tmp_bitmap, BLKDEV_MAX_MIRRORS);
+
+	int i = conf->raid_disks - 1;
+	struct md_rdev *tmp_rdev = NULL;
+
+	for (; i >= 0; i--) {
+		tmp_rdev = conf->mirrors[i].rdev;
+		if (i == r1_bio->read_disk)
+			break;
+		cnt += blk_queue_get_mirrors(bdev_get_queue(tmp_rdev->bdev));
+	}
+
+	bitmap_zero(tmp_bitmap, BLKDEV_MAX_MIRRORS);
+	bitmap_shift_right(bio->bi_rd_hint, r1_bio->master_bio->bi_rd_hint, cnt,
+			BLKDEV_MAX_MIRRORS);
+
+	cnt = blk_queue_get_mirrors(bdev_get_queue(tmp_rdev->bdev));
+	bitmap_set(tmp_bitmap, 0, cnt);
+
+	bitmap_and(bio->bi_rd_hint, bio->bi_rd_hint, tmp_bitmap,
+			BLKDEV_MAX_MIRRORS);
+}
+
 static void raid1_read_request(struct mddev *mddev, struct bio *bio,
 			       int max_read_sectors, struct r1bio *r1_bio)
 {
@@ -1199,6 +1306,7 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio,
 	int rdisk;
 	bool print_msg = !!r1_bio;
 	char b[BDEVNAME_SIZE];
+	bool auto_select_mirror;
 
 	/*
 	 * If r1_bio is set, we are blocking the raid1d thread
@@ -1230,6 +1338,8 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio,
 	else
 		init_r1bio(r1_bio, mddev, bio);
 	r1_bio->sectors = max_read_sectors;
+	auto_select_mirror = bitmap_empty(r1_bio->master_bio->bi_rd_hint, BLKDEV_MAX_MIRRORS);
+
 
 	/*
 	 * make_request() can abort the operation when read-ahead is being
@@ -1238,6 +1348,9 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio,
 	rdisk = read_balance(conf, r1_bio, &max_sectors);
 
 	if (rdisk < 0) {
+		if (auto_select_mirror)
+			bitmap_set(r1_bio->master_bio->bi_rd_hint, 0, BLKDEV_MAX_MIRRORS);
+
 		/* couldn't find anywhere to read from */
 		if (print_msg) {
 			pr_crit_ratelimited("md/raid1:%s: %s: unrecoverable I/O read error for block %llu\n",
@@ -1292,6 +1405,8 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio,
 	    test_bit(R1BIO_FailFast, &r1_bio->state))
 	        read_bio->bi_opf |= MD_FAILFAST;
 	read_bio->bi_private = r1_bio;
+	/* rd_hint of read_bio is a subset of master_bio. */
+	raid1_split_rd_hint(read_bio);
 
 	if (mddev->gendisk)
 	        trace_block_bio_remap(read_bio->bi_disk->queue, read_bio,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH v2 5/9] Add b_alt_retry to xfs_buf
  2019-02-13  9:50 [RFC PATCH v2 0/9] Block/XFS: Support alternative mirror device retry Bob Liu
                   ` (3 preceding siblings ...)
  2019-02-13  9:50 ` [RFC PATCH v2 4/9] md:raid1: rd_hint support and consider stacked layer case Bob Liu
@ 2019-02-13  9:50 ` Bob Liu
  2019-02-13  9:50 ` [RFC PATCH v2 6/9] xfs: Add b_rd_hint " Bob Liu
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 28+ messages in thread
From: Bob Liu @ 2019-02-13  9:50 UTC (permalink / raw)
  To: linux-block
  Cc: linux-xfs, linux-fsdevel, martin.petersen, shirley.ma,
	allison.henderson, david, darrick.wong, hch, adilger

From: Allison Henderson <allison.henderson@oracle.com>

This patch adds b_alt_retry boolean to xfs_buf. We will use
this to enable alternate device retry when the bio completes
for single buffer bios.

At this time, we do not yet support alternate device retry for
multi buffer bio

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
---
 fs/xfs/xfs_buf.c | 8 ++++++++
 fs/xfs/xfs_buf.h | 1 +
 2 files changed, 9 insertions(+)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 4f5f2ff3f70f..e2683c8e868c 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1409,6 +1409,14 @@ xfs_buf_ioapply_map(
 			flush_kernel_vmap_range(bp->b_addr,
 						xfs_buf_vmap_len(bp));
 		}
+
+		/*
+		 * At the moment, we only support alternate
+		 * device retry on single bio buffers
+		 */
+		if (size == 0)
+			bp->b_alt_retry = true;
+
 		submit_bio(bio);
 		if (size)
 			goto next_chunk;
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index b9f5511ea998..989b97a17486 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -198,6 +198,7 @@ typedef struct xfs_buf {
 	int			b_last_error;
 
 	const struct xfs_buf_ops	*b_ops;
+	bool			b_alt_retry;	/* toggle alt device retry */
 } xfs_buf_t;
 
 /* Finding and Reading Buffers */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH v2 6/9] xfs: Add b_rd_hint to xfs_buf
  2019-02-13  9:50 [RFC PATCH v2 0/9] Block/XFS: Support alternative mirror device retry Bob Liu
                   ` (4 preceding siblings ...)
  2019-02-13  9:50 ` [RFC PATCH v2 5/9] Add b_alt_retry to xfs_buf Bob Liu
@ 2019-02-13  9:50 ` Bob Liu
  2019-02-13  9:50 ` [RFC PATCH v2 7/9] xfs: Add device retry Bob Liu
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 28+ messages in thread
From: Bob Liu @ 2019-02-13  9:50 UTC (permalink / raw)
  To: linux-block
  Cc: linux-xfs, linux-fsdevel, martin.petersen, shirley.ma,
	allison.henderson, david, darrick.wong, hch, adilger

From: Allison Henderson <allison.henderson@oracle.com>

This patch adds a new field b_rd_hint to xfs_buf.  We will
need this to properly initialize the new bio->bi_rw_hint when
submitting the read request.

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
---
 fs/xfs/xfs_buf.c |  6 +++++-
 fs/xfs/xfs_buf.h | 12 ++++++++++++
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index e2683c8e868c..6098195ecaf4 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1338,8 +1338,11 @@ xfs_buf_bio_end_io(
 	if (!bp->b_error && xfs_buf_is_vmapped(bp) && (bp->b_flags & XBF_READ))
 		invalidate_kernel_vmap_range(bp->b_addr, xfs_buf_vmap_len(bp));
 
-	if (atomic_dec_and_test(&bp->b_io_remaining) == 1)
+	if (atomic_dec_and_test(&bp->b_io_remaining) == 1) {
+		if (bp->b_flags & XBF_RW_HINT)
+			bitmap_copy(bp->b_rd_hint, bio->bi_rd_hint, BLKDEV_MAX_MIRRORS);
 		xfs_buf_ioend_async(bp);
+	}
 	bio_put(bio);
 }
 
@@ -1385,6 +1388,7 @@ xfs_buf_ioapply_map(
 	bio->bi_iter.bi_sector = sector;
 	bio->bi_end_io = xfs_buf_bio_end_io;
 	bio->bi_private = bp;
+	bitmap_copy(bio->bi_rd_hint, bp->b_rd_hint, BLKDEV_MAX_MIRRORS);
 	bio_set_op_attrs(bio, op, op_flags);
 
 	for (; size && nr_pages; nr_pages--, page_index++) {
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index 989b97a17486..af9bdff29e66 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -40,6 +40,7 @@ typedef enum {
 #define XBF_SYNCIO	 (1 << 10)/* treat this buffer as synchronous I/O */
 #define XBF_FUA		 (1 << 11)/* force cache write through mode */
 #define XBF_FLUSH	 (1 << 12)/* flush the disk cache before a write */
+#define XBF_RW_HINT	 (1 << 13)/* Read/write hint used for alt dev retry */
 
 /* flags used only as arguments to access routines */
 #define XBF_TRYLOCK	 (1 << 16)/* lock requested, but do not wait */
@@ -65,6 +66,7 @@ typedef unsigned int xfs_buf_flags_t;
 	{ XBF_SYNCIO,		"SYNCIO" }, \
 	{ XBF_FUA,		"FUA" }, \
 	{ XBF_FLUSH,		"FLUSH" }, \
+	{ XBF_RW_HINT,		"RW_HINT" }, \
 	{ XBF_TRYLOCK,		"TRYLOCK" },	/* should never be set */\
 	{ XBF_UNMAPPED,		"UNMAPPED" },	/* ditto */\
 	{ _XBF_PAGES,		"PAGES" }, \
@@ -197,6 +199,16 @@ typedef struct xfs_buf {
 	unsigned long		b_first_retry_time; /* in jiffies */
 	int			b_last_error;
 
+	/*
+	 * Bitmask used by block device for alternate device retry
+	 *
+	 * To retry a read with the next device, resubmit the bio with
+	 * the bi_rd_hint returned from the last read.
+	 * Otherwise use bitmap_zero() if we don't care about alt mirror
+	 * device retry.
+         */
+	DECLARE_BITMAP(b_rd_hint, BLKDEV_MAX_MIRRORS);
+
 	const struct xfs_buf_ops	*b_ops;
 	bool			b_alt_retry;	/* toggle alt device retry */
 } xfs_buf_t;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH v2 7/9] xfs: Add device retry
  2019-02-13  9:50 [RFC PATCH v2 0/9] Block/XFS: Support alternative mirror device retry Bob Liu
                   ` (5 preceding siblings ...)
  2019-02-13  9:50 ` [RFC PATCH v2 6/9] xfs: Add b_rd_hint " Bob Liu
@ 2019-02-13  9:50 ` Bob Liu
  2019-02-13  9:50 ` [RFC PATCH v2 8/9] xfs: Rewrite retried read Bob Liu
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 28+ messages in thread
From: Bob Liu @ 2019-02-13  9:50 UTC (permalink / raw)
  To: linux-block
  Cc: linux-xfs, linux-fsdevel, martin.petersen, shirley.ma,
	allison.henderson, david, darrick.wong, hch, adilger

From: Allison Henderson <allison.henderson@oracle.com>

Check to see if the _xfs_buf_read fails.  If so loop over the
available mirrors and retry the read

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
---
 fs/xfs/xfs_buf.c | 26 +++++++++++++++++++++++++-
 1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 6098195ecaf4..2c250221cb78 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -21,6 +21,7 @@
 #include <linux/migrate.h>
 #include <linux/backing-dev.h>
 #include <linux/freezer.h>
+#include <linux/blkdev.h>
 
 #include "xfs_format.h"
 #include "xfs_log_format.h"
@@ -824,6 +825,9 @@ xfs_buf_read_map(
 	const struct xfs_buf_ops *ops)
 {
 	struct xfs_buf		*bp;
+	struct request_queue	*q;
+	unsigned long		i;
+	int			retries = 0;
 
 	flags |= XBF_READ;
 
@@ -836,7 +840,27 @@ xfs_buf_read_map(
 	if (!(bp->b_flags & XBF_DONE)) {
 		XFS_STATS_INC(target->bt_mount, xb_get_read);
 		bp->b_ops = ops;
-		_xfs_buf_read(bp, flags);
+		q = bdev_get_queue(bp->b_target->bt_bdev);
+
+		if (bp->b_alt_retry)
+			retries = blk_queue_get_mirrors(q);
+
+		for (i = 0; i <= retries; i++) {
+			bp->b_error = 0;
+			_xfs_buf_read(bp, flags);
+
+			switch (bp->b_error) {
+			case -EIO:
+			case -EFSCORRUPTED:
+			case -EFSBADCRC:
+				/* loop again */
+				continue;
+			default:
+				goto retry_done;
+			}
+
+		}
+retry_done:
 		return bp;
 	}
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH v2 8/9] xfs: Rewrite retried read
  2019-02-13  9:50 [RFC PATCH v2 0/9] Block/XFS: Support alternative mirror device retry Bob Liu
                   ` (6 preceding siblings ...)
  2019-02-13  9:50 ` [RFC PATCH v2 7/9] xfs: Add device retry Bob Liu
@ 2019-02-13  9:50 ` Bob Liu
  2019-02-13  9:50 ` [RFC PATCH v2 9/9] xfs: Add tracepoints and logging to alternate device retry Bob Liu
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 28+ messages in thread
From: Bob Liu @ 2019-02-13  9:50 UTC (permalink / raw)
  To: linux-block
  Cc: linux-xfs, linux-fsdevel, martin.petersen, shirley.ma,
	allison.henderson, david, darrick.wong, hch, adilger

From: Allison Henderson <allison.henderson@oracle.com>

If we had to try more than one mirror to get a successful
read, then write that buffer back to correct the bad mirror

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
---
 fs/xfs/xfs_buf.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 2c250221cb78..e54dbc776d15 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -861,6 +861,14 @@ xfs_buf_read_map(
 
 		}
 retry_done:
+
+		/*
+		 * if we had to try more than one mirror to sucessfully read
+		 * the buffer, write the buffer back
+		 */
+		if (!bp->b_error && i > 0)
+			xfs_bwrite(bp);
+
 		return bp;
 	}
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH v2 9/9] xfs: Add tracepoints and logging to alternate device retry
  2019-02-13  9:50 [RFC PATCH v2 0/9] Block/XFS: Support alternative mirror device retry Bob Liu
                   ` (7 preceding siblings ...)
  2019-02-13  9:50 ` [RFC PATCH v2 8/9] xfs: Rewrite retried read Bob Liu
@ 2019-02-13  9:50 ` Bob Liu
  2019-02-18  8:08 ` [RFC PATCH v2 0/9] Block/XFS: Support alternative mirror " jianchao.wang
  2019-02-18 21:31 ` Dave Chinner
  10 siblings, 0 replies; 28+ messages in thread
From: Bob Liu @ 2019-02-13  9:50 UTC (permalink / raw)
  To: linux-block
  Cc: linux-xfs, linux-fsdevel, martin.petersen, shirley.ma,
	allison.henderson, david, darrick.wong, hch, adilger

From: Allison Henderson <allison.henderson@oracle.com>

This patch adds new log entries and trace points to the
alternate device retry error path

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
---
 fs/xfs/xfs_buf.c   | 10 ++++++++++
 fs/xfs/xfs_buf.h   |  1 +
 fs/xfs/xfs_trace.h |  6 +++++-
 3 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index e54dbc776d15..1a0427137883 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -847,6 +847,11 @@ xfs_buf_read_map(
 
 		for (i = 0; i <= retries; i++) {
 			bp->b_error = 0;
+
+			if (i > 0)
+				xfs_alert(bp->b_target->bt_mount,
+					  "Retrying read from disk %lu",i);
+
 			_xfs_buf_read(bp, flags);
 
 			switch (bp->b_error) {
@@ -854,6 +859,11 @@ xfs_buf_read_map(
 			case -EFSCORRUPTED:
 			case -EFSBADCRC:
 				/* loop again */
+				trace_xfs_buf_ioretry(bp, _RET_IP_);
+				xfs_alert(bp->b_target->bt_mount,
+					  "Read error:%d from disk number %lu",
+					   bp->b_error, *bp->b_rd_hint);
+
 				continue;
 			default:
 				goto retry_done;
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index af9bdff29e66..69605a50c15e 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -306,6 +306,7 @@ extern void __xfs_buf_ioerror(struct xfs_buf *bp, int error,
 		xfs_failaddr_t failaddr);
 #define xfs_buf_ioerror(bp, err) __xfs_buf_ioerror((bp), (err), __this_address)
 extern void xfs_buf_ioerror_alert(struct xfs_buf *, const char *func);
+extern void xfs_buf_ioretry_alert(struct xfs_buf *, const char *func);
 
 extern int __xfs_buf_submit(struct xfs_buf *bp, bool);
 static inline int xfs_buf_submit(struct xfs_buf *bp)
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 6fcc893dfc91..4b948cf2dd65 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -276,6 +276,7 @@ DECLARE_EVENT_CLASS(xfs_buf_class,
 		__field(int, pincount)
 		__field(unsigned, lockval)
 		__field(unsigned, flags)
+		__field(unsigned short, rd_hint)
 		__field(unsigned long, caller_ip)
 	),
 	TP_fast_assign(
@@ -289,10 +290,11 @@ DECLARE_EVENT_CLASS(xfs_buf_class,
 		__entry->pincount = atomic_read(&bp->b_pin_count);
 		__entry->lockval = bp->b_sema.count;
 		__entry->flags = bp->b_flags;
+		__entry->rd_hint = bp->b_rd_hint;
 		__entry->caller_ip = caller_ip;
 	),
 	TP_printk("dev %d:%d bno 0x%llx nblks 0x%x hold %d pincount %d "
-		  "lock %d flags %s caller %pS",
+		  "lock %d flags %s rd_hint %hu caller %pS",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  (unsigned long long)__entry->bno,
 		  __entry->nblks,
@@ -300,6 +302,7 @@ DECLARE_EVENT_CLASS(xfs_buf_class,
 		  __entry->pincount,
 		  __entry->lockval,
 		  __print_flags(__entry->flags, "|", XFS_BUF_FLAGS),
+		  __entry->rd_hint,
 		  (void *)__entry->caller_ip)
 )
 
@@ -312,6 +315,7 @@ DEFINE_BUF_EVENT(xfs_buf_free);
 DEFINE_BUF_EVENT(xfs_buf_hold);
 DEFINE_BUF_EVENT(xfs_buf_rele);
 DEFINE_BUF_EVENT(xfs_buf_iodone);
+DEFINE_BUF_EVENT(xfs_buf_ioretry);
 DEFINE_BUF_EVENT(xfs_buf_submit);
 DEFINE_BUF_EVENT(xfs_buf_lock);
 DEFINE_BUF_EVENT(xfs_buf_lock_done);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 1/9] block: add nr_mirrors to request_queue
  2019-02-13  9:50 ` [RFC PATCH v2 1/9] block: add nr_mirrors to request_queue Bob Liu
@ 2019-02-13 10:26   ` Andreas Dilger
  2019-02-13 16:04   ` Theodore Y. Ts'o
  1 sibling, 0 replies; 28+ messages in thread
From: Andreas Dilger @ 2019-02-13 10:26 UTC (permalink / raw)
  To: Bob Liu
  Cc: linux-block, linux-xfs, linux-fsdevel, Martin Petersen,
	shirley.ma, allison.henderson, david, darrick.wong, hch

[-- Attachment #1: Type: text/plain, Size: 1637 bytes --]

On Feb 13, 2019, at 2:50 AM, Bob Liu <bob.liu@oracle.com> wrote:
> 
> When fs data/metadata checksum mismatch, lower block devices may have other
> correct copies. e.g if we did raid1 for protecting fs metadata.
> Then fs could try other copies of metadata instead of panic, but fs need be
> awared how many mirrors the block devices have.
> 
> This patch add @nr_mirrors to struct request_queue which is similar as
> blk_queue_nonrot(), filesystem can grab device request queue and check the
> number of mirrors of this block device.
> 
> @nr_mirrors is 1 by default which means only one copy, drivers e.g raid1 are
> responsible for setting the right value. The maximum value is
> BITS_PER_LONG which is 32 or 64. That should be big enough else retry lantency
> may be too high.
> 
> Also added helper functions for get/set the number of mirrors for a specific
> device request queue.
> 
> Todo:
> * Export nr_mirrors through /sysfs.
> 
> Signed-off-by: Bob Liu <bob.liu@oracle.com>

> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index 3e7038e475ee..38e4d7e675e6 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -844,6 +844,30 @@ void blk_queue_write_cache(struct request_queue *q, bool wc, bool fua)
> +/*
> + * Set the number of read redundant mirrors.
> + */
> +bool blk_queue_set_mirrors(struct request_queue *q, unsigned short mirrors)
> +{
> +	if(q->nr_mirrors >= BLKDEV_MAX_MIRRORS) {
> +		printk("blk_queue_set_mirrors: %d exceed max mirrors(%d)\n",
> +				mirrors, BLKDEV_MAX_MIRRORS);

Need to supply a KERN_ level here.

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 1/9] block: add nr_mirrors to request_queue
  2019-02-13  9:50 ` [RFC PATCH v2 1/9] block: add nr_mirrors to request_queue Bob Liu
  2019-02-13 10:26   ` Andreas Dilger
@ 2019-02-13 16:04   ` Theodore Y. Ts'o
  2019-02-14  5:57     ` Bob Liu
  1 sibling, 1 reply; 28+ messages in thread
From: Theodore Y. Ts'o @ 2019-02-13 16:04 UTC (permalink / raw)
  To: Bob Liu
  Cc: linux-block, linux-xfs, linux-fsdevel, martin.petersen,
	shirley.ma, allison.henderson, david, darrick.wong, hch, adilger

On Wed, Feb 13, 2019 at 05:50:36PM +0800, Bob Liu wrote:
> @nr_mirrors is 1 by default which means only one copy, drivers e.g raid1 are
> responsible for setting the right value. The maximum value is
> BITS_PER_LONG which is 32 or 64. That should be big enough else retry lantency
> may be too high.

This is admittedly bike-shedding, so feel free to ignore, but...

In the case of Raid 6, "mirrors" will be a bit of a misnomer.  Would
"nr_recovery" be better?

Thanks for working on this!!  I would be interested in using this for
ext4 once it's available.

				- Ted

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 2/9] block: add rd_hint to bio and request
  2019-02-13  9:50 ` [RFC PATCH v2 2/9] block: add rd_hint to bio and request Bob Liu
@ 2019-02-13 16:18   ` Jens Axboe
  2019-02-14  6:10     ` Bob Liu
  0 siblings, 1 reply; 28+ messages in thread
From: Jens Axboe @ 2019-02-13 16:18 UTC (permalink / raw)
  To: Bob Liu, linux-block
  Cc: linux-xfs, linux-fsdevel, martin.petersen, shirley.ma,
	allison.henderson, david, darrick.wong, hch, adilger

On 2/13/19 2:50 AM, Bob Liu wrote:
> rd_hint is a bitmap for stacked layer support(see patch 4/9),
> set a bit to 1 means already read from the corresponding mirror device.
> 
> rd_hint will be set properly recording read i/o went to which real device
> during end_bio().
> If the upper layer want to retry other mirrors, just preserve the returned
> bi_rd_hint and resubmit bio.
> 
> The upper layer e.g fs can set bitmap_zero(rd_hint) if don't care about alt
> mirror device retry feature which is also the default setting.

You just made the bio 16 bytes bigger on my build, which is an increase
of 12.5% and spills it into a third cacheline. That's not going to work
at all. At least look at where you are placing this thing. That goes
for the request as well, you can just toss members in there at random.

Also, why is BLKDEV_MAX_MIRRORS in types.h? That makes very little sense.

Look into options of carrying this elsewhere, or (at the very least)
making it dependent on whoever needs it. This is NOT a negligible
amount of wasted space.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 1/9] block: add nr_mirrors to request_queue
  2019-02-13 16:04   ` Theodore Y. Ts'o
@ 2019-02-14  5:57     ` Bob Liu
  2019-02-18 17:56       ` Theodore Y. Ts'o
  0 siblings, 1 reply; 28+ messages in thread
From: Bob Liu @ 2019-02-14  5:57 UTC (permalink / raw)
  To: Theodore Y. Ts'o
  Cc: linux-block, linux-xfs, linux-fsdevel, martin.petersen,
	shirley.ma, allison.henderson, david, darrick.wong, hch, adilger

On 2/14/19 12:04 AM, Theodore Y. Ts'o wrote:
> On Wed, Feb 13, 2019 at 05:50:36PM +0800, Bob Liu wrote:
>> @nr_mirrors is 1 by default which means only one copy, drivers e.g raid1 are
>> responsible for setting the right value. The maximum value is
>> BITS_PER_LONG which is 32 or 64. That should be big enough else retry lantency
>> may be too high.
> 
> This is admittedly bike-shedding, so feel free to ignore, but...
> 
> In the case of Raid 6, "mirrors" will be a bit of a misnomer.  Would
> "nr_recovery" be better?
> 

Now the initial/default value is 1 indicating only one copy of data.
Would nr_copy be more accurate?

> Thanks for working on this!!  I would be interested in using this for
> ext4 once it's available.
> 
> 				- Ted
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 2/9] block: add rd_hint to bio and request
  2019-02-13 16:18   ` Jens Axboe
@ 2019-02-14  6:10     ` Bob Liu
  0 siblings, 0 replies; 28+ messages in thread
From: Bob Liu @ 2019-02-14  6:10 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: linux-xfs, linux-fsdevel, martin.petersen, shirley.ma,
	allison.henderson, david, darrick.wong, hch, adilger

On 2/14/19 12:18 AM, Jens Axboe wrote:
> On 2/13/19 2:50 AM, Bob Liu wrote:
>> rd_hint is a bitmap for stacked layer support(see patch 4/9),
>> set a bit to 1 means already read from the corresponding mirror device.
>>
>> rd_hint will be set properly recording read i/o went to which real device
>> during end_bio().
>> If the upper layer want to retry other mirrors, just preserve the returned
>> bi_rd_hint and resubmit bio.
>>
>> The upper layer e.g fs can set bitmap_zero(rd_hint) if don't care about alt
>> mirror device retry feature which is also the default setting.
> 
> You just made the bio 16 bytes bigger on my build, which is an increase
> of 12.5% and spills it into a third cacheline. That's not going to work
> at all. At least look at where you are placing this thing. That goes
> for the request as well, you can just toss members in there at random.
> 

Are you fine with an union like?
-       unsigned short          bi_write_hint;
-       DECLARE_BITMAP(bi_rd_hint, BLKDEV_MAX_MIRRORS);
+       union {
+           unsigned short              bi_write_hint;
+           unsigned long               bi_rd_hint;
+       };

But rd_hint need to be "unsigned long" which would still make bio/request bigger.

For sure can add KCONFIG option around if necessary.

> Also, why is BLKDEV_MAX_MIRRORS in types.h? That makes very little sense.
> 

Indeed, so I plan to switch back "unsigned long bi_rd_hint".
But bi_rd_hint is still a bitmap(for stacked layer support) which means this feature can not
work if more than BITS_PER_LONG copies.

> Look into options of carrying this elsewhere, or (at the very least)
> making it dependent on whoever needs it. This is NOT a negligible
> amount of wasted space.
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 0/9] Block/XFS: Support alternative mirror device retry
  2019-02-13  9:50 [RFC PATCH v2 0/9] Block/XFS: Support alternative mirror device retry Bob Liu
                   ` (8 preceding siblings ...)
  2019-02-13  9:50 ` [RFC PATCH v2 9/9] xfs: Add tracepoints and logging to alternate device retry Bob Liu
@ 2019-02-18  8:08 ` jianchao.wang
  2019-02-19  1:29   ` jianchao.wang
  2019-02-18 21:31 ` Dave Chinner
  10 siblings, 1 reply; 28+ messages in thread
From: jianchao.wang @ 2019-02-18  8:08 UTC (permalink / raw)
  To: Bob Liu, linux-block
  Cc: linux-xfs, linux-fsdevel, martin.petersen, shirley.ma,
	allison.henderson, david, darrick.wong, hch, adilger

Hi Bob

On 2/13/19 5:50 PM, Bob Liu wrote:
> Motivation:
> When fs data/metadata checksum mismatch, lower block devices may have other
> correct copies. e.g. If XFS successfully reads a metadata buffer off a raid1 but
> decides that the metadata is garbage, today it will shut down the entire
> filesystem without trying any of the other mirrors.  This is a severe
> loss of service, and we propose these patches to have XFS try harder to
> avoid failure.
> 
> This patch prototype this mirror retry idea by:
> * Adding @nr_mirrors to struct request_queue which is similar as
>   blk_queue_nonrot(), filesystem can grab device request queue and check max
>   mirrors this block device has.
>   Helper functions were also added to get/set the nr_mirrors.
> 
> * Introducing bi_rd_hint just like bi_write_hint, but bi_rd_hint is a long bitmap
> in order to support stacked layer case.

Why does we need a bitmap to know which underlying device has been tried ?
For example, the following scenario,

                    md8
                   / | \
               sda sdb sdc

If the the raid read the data from sda and fs check and find the data is corrupted.
Then we may just need to let raid1 know that the data is from sda. Then based on this
hint, raid1 could handle it with handle_read_error to try other replica and fix the
error.

If this is feasible, we just need to modify the bio as following and needn't add any
bytes in it.

struct bio {
    ...
    union {
        unsigned short bi_write_hint;
        unsigned short bi_read_hint;
    }
    ...
}

Thanks
Jianchao
> 
> * Modify md/raid1 to support this retry feature.
> 
> * Adapter xfs to use this feature.
>   If the read verify fails, we loop over the available mirrors and retry the read.
> 
> * Rewrite retried read
>   When the read verification fails, but the retry succeedes
>   write the buffer back to correct the bad mirror
> 
> * Add tracepoints and logging to alternate device retry.
>   This patch adds new log entries and trace points to the alternate device retry
>   error path.
> 
> Changes v2:
> - No more reuse bi_write_hint
> - Stacked layer support(see patch 4/9)
> - Other feedback fix
> 
> Allison Henderson (5):
>   Add b_alt_retry to xfs_buf
>   xfs: Add b_rd_hint to xfs_buf
>   xfs: Add device retry
>   xfs: Rewrite retried read
>   xfs: Add tracepoints and logging to alternate device retry
> 
> Bob Liu (4):
>   block: add nr_mirrors to request_queue
>   block: add rd_hint to bio and request
>   md:raid1: set mirrors correctly
>   md:raid1: rd_hint support and consider stacked layer case
> 
>  Documentation/block/biodoc.txt |   3 +
>  block/bio.c                    |   1 +
>  block/blk-core.c               |   4 ++
>  block/blk-merge.c              |   6 ++
>  block/blk-settings.c           |  24 +++++++
>  block/bounce.c                 |   1 +
>  drivers/md/raid1.c             | 123 ++++++++++++++++++++++++++++++++-
>  fs/xfs/xfs_buf.c               |  58 +++++++++++++++-
>  fs/xfs/xfs_buf.h               |  14 ++++
>  fs/xfs/xfs_trace.h             |   6 +-
>  include/linux/blk_types.h      |   1 +
>  include/linux/blkdev.h         |   4 ++
>  include/linux/types.h          |   3 +
>  13 files changed, 244 insertions(+), 4 deletions(-)
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 1/9] block: add nr_mirrors to request_queue
  2019-02-14  5:57     ` Bob Liu
@ 2019-02-18 17:56       ` Theodore Y. Ts'o
  0 siblings, 0 replies; 28+ messages in thread
From: Theodore Y. Ts'o @ 2019-02-18 17:56 UTC (permalink / raw)
  To: Bob Liu
  Cc: linux-block, linux-xfs, linux-fsdevel, martin.petersen,
	shirley.ma, allison.henderson, david, darrick.wong, hch, adilger

On Thu, Feb 14, 2019 at 01:57:20PM +0800, Bob Liu wrote:
> 
> Now the initial/default value is 1 indicating only one copy of data.
> Would nr_copy be more accurate?
>

Well, it's at least shorter; the problem is that it's not really
another "copy" of the data, it's just that it can simply be different
(multiple) ways of reconstructing the data.  I suppose we could say
that it's a virtual copy.

In any case, I can't think of a better term, so nr_copy is probably as
good as any.

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 0/9] Block/XFS: Support alternative mirror device retry
  2019-02-13  9:50 [RFC PATCH v2 0/9] Block/XFS: Support alternative mirror device retry Bob Liu
                   ` (9 preceding siblings ...)
  2019-02-18  8:08 ` [RFC PATCH v2 0/9] Block/XFS: Support alternative mirror " jianchao.wang
@ 2019-02-18 21:31 ` Dave Chinner
  2019-02-19  2:55   ` Darrick J. Wong
  2019-02-28 14:22   ` Bob Liu
  10 siblings, 2 replies; 28+ messages in thread
From: Dave Chinner @ 2019-02-18 21:31 UTC (permalink / raw)
  To: Bob Liu
  Cc: linux-block, linux-xfs, linux-fsdevel, martin.petersen,
	shirley.ma, allison.henderson, darrick.wong, hch, adilger

On Wed, Feb 13, 2019 at 05:50:35PM +0800, Bob Liu wrote:
> Motivation:
> When fs data/metadata checksum mismatch, lower block devices may have other
> correct copies. e.g. If XFS successfully reads a metadata buffer off a raid1 but
> decides that the metadata is garbage, today it will shut down the entire
> filesystem without trying any of the other mirrors.  This is a severe
> loss of service, and we propose these patches to have XFS try harder to
> avoid failure.
> 
> This patch prototype this mirror retry idea by:
> * Adding @nr_mirrors to struct request_queue which is similar as
>   blk_queue_nonrot(), filesystem can grab device request queue and check max
>   mirrors this block device has.
>   Helper functions were also added to get/set the nr_mirrors.
> 
> * Introducing bi_rd_hint just like bi_write_hint, but bi_rd_hint is a long bitmap
> in order to support stacked layer case.
> 
> * Modify md/raid1 to support this retry feature.
> 
> * Adapter xfs to use this feature.
>   If the read verify fails, we loop over the available mirrors and retry the read.

Why does the filesystem have to iterate every single posible
combination of devices that are underneath it?

Wouldn't it be much simpler to be able to attach a verifier
function to the bio, and have each layer that gets called iterate
over all it's copies internally until the verfier function passes
or all copies are exhausted?

This works for stacked mirrors - it can pass the higher layer
verifier down as far as necessary. It can work for RAID5/6, too, by
having that layer supply it's own verifier for reads that verifies
parity and can reconstruct of failure, then when it's reconstructed
a valid stripe it can run the verifier that was supplied to it from
above, etc.

i.e. I dont see why only filesystems should drive retries or have to
be aware of the underlying storage stacking. ISTM that each
layer of the storage stack should be able to verify what has been
returned to it is valid independently of the higher layer
requirements. The only difference from a caller point of view should
be submit_bio(bio); vs submit_bio_verify(bio, verifier_cb_func);

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 0/9] Block/XFS: Support alternative mirror device retry
  2019-02-18  8:08 ` [RFC PATCH v2 0/9] Block/XFS: Support alternative mirror " jianchao.wang
@ 2019-02-19  1:29   ` jianchao.wang
  0 siblings, 0 replies; 28+ messages in thread
From: jianchao.wang @ 2019-02-19  1:29 UTC (permalink / raw)
  To: Bob Liu, linux-block
  Cc: linux-xfs, linux-fsdevel, martin.petersen, shirley.ma,
	allison.henderson, david, darrick.wong, hch, adilger



On 2/18/19 4:08 PM, jianchao.wang wrote:
> Hi Bob
> 
> On 2/13/19 5:50 PM, Bob Liu wrote:
>> Motivation:
>> When fs data/metadata checksum mismatch, lower block devices may have other
>> correct copies. e.g. If XFS successfully reads a metadata buffer off a raid1 but
>> decides that the metadata is garbage, today it will shut down the entire
>> filesystem without trying any of the other mirrors.  This is a severe
>> loss of service, and we propose these patches to have XFS try harder to
>> avoid failure.
>>
>> This patch prototype this mirror retry idea by:
>> * Adding @nr_mirrors to struct request_queue which is similar as
>>   blk_queue_nonrot(), filesystem can grab device request queue and check max
>>   mirrors this block device has.
>>   Helper functions were also added to get/set the nr_mirrors.
>>
>> * Introducing bi_rd_hint just like bi_write_hint, but bi_rd_hint is a long bitmap
>> in order to support stacked layer case.
> 
> Why does we need a bitmap to know which underlying device has been tried ?
> For example, the following scenario,
> 
>                     md8
>                    / | \
>                sda sdb sdc
> 
> If the the raid read the data from sda and fs check and find the data is corrupted.
> Then we may just need to let raid1 know that the data is from sda. Then based on this
> hint, raid1 could handle it with handle_read_error to try other replica and fix the
> error.

This doesn't work.
The md raid1 can only see IO success or failure, so fix_read_error won't fix this.
Sorry for the noise.

Thanks
Jianchao

> 
> If this is feasible, we just need to modify the bio as following and needn't add any
> bytes in it.
> 
> struct bio {
>     ...
>     union {
>         unsigned short bi_write_hint;
>         unsigned short bi_read_hint;
>     }
>     ...
> }
> 
> Thanks
> Jianchao
>>
>> * Modify md/raid1 to support this retry feature.
>>
>> * Adapter xfs to use this feature.
>>   If the read verify fails, we loop over the available mirrors and retry the read.
>>
>> * Rewrite retried read
>>   When the read verification fails, but the retry succeedes
>>   write the buffer back to correct the bad mirror
>>
>> * Add tracepoints and logging to alternate device retry.
>>   This patch adds new log entries and trace points to the alternate device retry
>>   error path.
>>
>> Changes v2:
>> - No more reuse bi_write_hint
>> - Stacked layer support(see patch 4/9)
>> - Other feedback fix
>>
>> Allison Henderson (5):
>>   Add b_alt_retry to xfs_buf
>>   xfs: Add b_rd_hint to xfs_buf
>>   xfs: Add device retry
>>   xfs: Rewrite retried read
>>   xfs: Add tracepoints and logging to alternate device retry
>>
>> Bob Liu (4):
>>   block: add nr_mirrors to request_queue
>>   block: add rd_hint to bio and request
>>   md:raid1: set mirrors correctly
>>   md:raid1: rd_hint support and consider stacked layer case
>>
>>  Documentation/block/biodoc.txt |   3 +
>>  block/bio.c                    |   1 +
>>  block/blk-core.c               |   4 ++
>>  block/blk-merge.c              |   6 ++
>>  block/blk-settings.c           |  24 +++++++
>>  block/bounce.c                 |   1 +
>>  drivers/md/raid1.c             | 123 ++++++++++++++++++++++++++++++++-
>>  fs/xfs/xfs_buf.c               |  58 +++++++++++++++-
>>  fs/xfs/xfs_buf.h               |  14 ++++
>>  fs/xfs/xfs_trace.h             |   6 +-
>>  include/linux/blk_types.h      |   1 +
>>  include/linux/blkdev.h         |   4 ++
>>  include/linux/types.h          |   3 +
>>  13 files changed, 244 insertions(+), 4 deletions(-)
>>
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 0/9] Block/XFS: Support alternative mirror device retry
  2019-02-18 21:31 ` Dave Chinner
@ 2019-02-19  2:55   ` Darrick J. Wong
  2019-02-19  3:33     ` Dave Chinner
  2019-02-28 14:22   ` Bob Liu
  1 sibling, 1 reply; 28+ messages in thread
From: Darrick J. Wong @ 2019-02-19  2:55 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Bob Liu, linux-block, linux-xfs, linux-fsdevel, martin.petersen,
	shirley.ma, allison.henderson, hch, adilger

On Tue, Feb 19, 2019 at 08:31:50AM +1100, Dave Chinner wrote:
> On Wed, Feb 13, 2019 at 05:50:35PM +0800, Bob Liu wrote:
> > Motivation:
> > When fs data/metadata checksum mismatch, lower block devices may have other
> > correct copies. e.g. If XFS successfully reads a metadata buffer off a raid1 but
> > decides that the metadata is garbage, today it will shut down the entire
> > filesystem without trying any of the other mirrors.  This is a severe
> > loss of service, and we propose these patches to have XFS try harder to
> > avoid failure.
> > 
> > This patch prototype this mirror retry idea by:
> > * Adding @nr_mirrors to struct request_queue which is similar as
> >   blk_queue_nonrot(), filesystem can grab device request queue and check max
> >   mirrors this block device has.
> >   Helper functions were also added to get/set the nr_mirrors.
> > 
> > * Introducing bi_rd_hint just like bi_write_hint, but bi_rd_hint is a long bitmap
> > in order to support stacked layer case.
> > 
> > * Modify md/raid1 to support this retry feature.
> > 
> > * Adapter xfs to use this feature.
> >   If the read verify fails, we loop over the available mirrors and retry the read.
> 
> Why does the filesystem have to iterate every single posible
> combination of devices that are underneath it?
> 
> Wouldn't it be much simpler to be able to attach a verifier
> function to the bio, and have each layer that gets called iterate
> over all it's copies internally until the verfier function passes
> or all copies are exhausted?
> 
> This works for stacked mirrors - it can pass the higher layer
> verifier down as far as necessary. It can work for RAID5/6, too, by
> having that layer supply it's own verifier for reads that verifies
> parity and can reconstruct of failure, then when it's reconstructed
> a valid stripe it can run the verifier that was supplied to it from
> above, etc.
> 
> i.e. I dont see why only filesystems should drive retries or have to
> be aware of the underlying storage stacking. ISTM that each
> layer of the storage stack should be able to verify what has been
> returned to it is valid independently of the higher layer
> requirements. The only difference from a caller point of view should
> be submit_bio(bio); vs submit_bio_verify(bio, verifier_cb_func);

What if instead of constructing a giant pile of verifier call chain, we
simply had a return value from ->bi_end_io that would then be returned
from bio_endio()?  Stacked things like dm-linear would have to know how
to connect the upper endio to the lower endio though.  And that could
have its downsides, too.  How long do we tie up resources in the scsi
layer while upper levels are busy running verification functions...?

Hmmmmmmmmm....

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 0/9] Block/XFS: Support alternative mirror device retry
  2019-02-19  2:55   ` Darrick J. Wong
@ 2019-02-19  3:33     ` Dave Chinner
  0 siblings, 0 replies; 28+ messages in thread
From: Dave Chinner @ 2019-02-19  3:33 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Bob Liu, linux-block, linux-xfs, linux-fsdevel, martin.petersen,
	shirley.ma, allison.henderson, hch, adilger

On Mon, Feb 18, 2019 at 06:55:20PM -0800, Darrick J. Wong wrote:
> On Tue, Feb 19, 2019 at 08:31:50AM +1100, Dave Chinner wrote:
> > On Wed, Feb 13, 2019 at 05:50:35PM +0800, Bob Liu wrote:
> > > Motivation:
> > > When fs data/metadata checksum mismatch, lower block devices may have other
> > > correct copies. e.g. If XFS successfully reads a metadata buffer off a raid1 but
> > > decides that the metadata is garbage, today it will shut down the entire
> > > filesystem without trying any of the other mirrors.  This is a severe
> > > loss of service, and we propose these patches to have XFS try harder to
> > > avoid failure.
> > > 
> > > This patch prototype this mirror retry idea by:
> > > * Adding @nr_mirrors to struct request_queue which is similar as
> > >   blk_queue_nonrot(), filesystem can grab device request queue and check max
> > >   mirrors this block device has.
> > >   Helper functions were also added to get/set the nr_mirrors.
> > > 
> > > * Introducing bi_rd_hint just like bi_write_hint, but bi_rd_hint is a long bitmap
> > > in order to support stacked layer case.
> > > 
> > > * Modify md/raid1 to support this retry feature.
> > > 
> > > * Adapter xfs to use this feature.
> > >   If the read verify fails, we loop over the available mirrors and retry the read.
> > 
> > Why does the filesystem have to iterate every single posible
> > combination of devices that are underneath it?
> > 
> > Wouldn't it be much simpler to be able to attach a verifier
> > function to the bio, and have each layer that gets called iterate
> > over all it's copies internally until the verfier function passes
> > or all copies are exhausted?
> > 
> > This works for stacked mirrors - it can pass the higher layer
> > verifier down as far as necessary. It can work for RAID5/6, too, by
> > having that layer supply it's own verifier for reads that verifies
> > parity and can reconstruct of failure, then when it's reconstructed
> > a valid stripe it can run the verifier that was supplied to it from
> > above, etc.
> > 
> > i.e. I dont see why only filesystems should drive retries or have to
> > be aware of the underlying storage stacking. ISTM that each
> > layer of the storage stack should be able to verify what has been
> > returned to it is valid independently of the higher layer
> > requirements. The only difference from a caller point of view should
> > be submit_bio(bio); vs submit_bio_verify(bio, verifier_cb_func);
> 
> What if instead of constructing a giant pile of verifier call chain, we
> simply had a return value from ->bi_end_io that would then be returned
> from bio_endio()?

Conceptually it acheives the same thing - getting the high level
verifier status down to the lower layer to say "this copy is bad,
try again", but I suspect all the bio chaining and cloning done in
the stack makes this much more difficult than it seems.

> Stacked things like dm-linear would have to know how to connect
> the upper endio to the lower endio though.  And that could have
> its downsides, too. 

Stacking always makes things hard :/

> How long do we tie up resources in the scsi
> layer while upper levels are busy running verification functions...?

I suspect there's a more important issue to worry about: we run the
XFS read verifiers in an async work queue context after collecting
the IO completion status from the bio, rather than running directly
in bio->bi_end_io() call chain.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 0/9] Block/XFS: Support alternative mirror device retry
  2019-02-18 21:31 ` Dave Chinner
  2019-02-19  2:55   ` Darrick J. Wong
@ 2019-02-28 14:22   ` Bob Liu
  2019-02-28 21:49     ` Dave Chinner
  2019-02-28 23:28     ` Andreas Dilger
  1 sibling, 2 replies; 28+ messages in thread
From: Bob Liu @ 2019-02-28 14:22 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-block, linux-xfs, linux-fsdevel, martin.petersen,
	shirley.ma, allison.henderson, darrick.wong, hch, adilger

On 2/19/19 5:31 AM, Dave Chinner wrote:
> On Wed, Feb 13, 2019 at 05:50:35PM +0800, Bob Liu wrote:
>> Motivation:
>> When fs data/metadata checksum mismatch, lower block devices may have other
>> correct copies. e.g. If XFS successfully reads a metadata buffer off a raid1 but
>> decides that the metadata is garbage, today it will shut down the entire
>> filesystem without trying any of the other mirrors.  This is a severe
>> loss of service, and we propose these patches to have XFS try harder to
>> avoid failure.
>>
>> This patch prototype this mirror retry idea by:
>> * Adding @nr_mirrors to struct request_queue which is similar as
>>   blk_queue_nonrot(), filesystem can grab device request queue and check max
>>   mirrors this block device has.
>>   Helper functions were also added to get/set the nr_mirrors.
>>
>> * Introducing bi_rd_hint just like bi_write_hint, but bi_rd_hint is a long bitmap
>> in order to support stacked layer case.
>>
>> * Modify md/raid1 to support this retry feature.
>>
>> * Adapter xfs to use this feature.
>>   If the read verify fails, we loop over the available mirrors and retry the read.
> 
> Why does the filesystem have to iterate every single posible
> combination of devices that are underneath it?
> 
> Wouldn't it be much simpler to be able to attach a verifier
> function to the bio, and have each layer that gets called iterate
> over all it's copies internally until the verfier function passes
> or all copies are exhausted?
> 
> This works for stacked mirrors - it can pass the higher layer
> verifier down as far as necessary. It can work for RAID5/6, too, by
> having that layer supply it's own verifier for reads that verifies
> parity and can reconstruct of failure, then when it's reconstructed
> a valid stripe it can run the verifier that was supplied to it from
> above, etc.
> 
> i.e. I dont see why only filesystems should drive retries or have to
> be aware of the underlying storage stacking. ISTM that each
> layer of the storage stack should be able to verify what has been
> returned to it is valid independently of the higher layer
> requirements. The only difference from a caller point of view should
> be submit_bio(bio); vs submit_bio_verify(bio, verifier_cb_func);
> 

We already have bio->bi_end_io(), how about do the verification inside bi_end_io()?

Then the whole sequence would like:
bio_endio()
    > 1.bio->bi_end_io()
        > xfs_buf_bio_end_io()
            > verify, set bio->bi_status = "please retry" if verify fail
        
    > 2.if found bio->bi_status = retry
    > 3.resubmit bio

Is it fine to resubmit a bio inside bio_endio()?

- Thanks, Bob.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 0/9] Block/XFS: Support alternative mirror device retry
  2019-02-28 14:22   ` Bob Liu
@ 2019-02-28 21:49     ` Dave Chinner
  2019-03-03  2:37       ` Bob Liu
  2019-02-28 23:28     ` Andreas Dilger
  1 sibling, 1 reply; 28+ messages in thread
From: Dave Chinner @ 2019-02-28 21:49 UTC (permalink / raw)
  To: Bob Liu
  Cc: linux-block, linux-xfs, linux-fsdevel, martin.petersen,
	shirley.ma, allison.henderson, darrick.wong, hch, adilger

On Thu, Feb 28, 2019 at 10:22:02PM +0800, Bob Liu wrote:
> On 2/19/19 5:31 AM, Dave Chinner wrote:
> > On Wed, Feb 13, 2019 at 05:50:35PM +0800, Bob Liu wrote:
> >> Motivation:
> >> When fs data/metadata checksum mismatch, lower block devices may have other
> >> correct copies. e.g. If XFS successfully reads a metadata buffer off a raid1 but
> >> decides that the metadata is garbage, today it will shut down the entire
> >> filesystem without trying any of the other mirrors.  This is a severe
> >> loss of service, and we propose these patches to have XFS try harder to
> >> avoid failure.
> >>
> >> This patch prototype this mirror retry idea by:
> >> * Adding @nr_mirrors to struct request_queue which is similar as
> >>   blk_queue_nonrot(), filesystem can grab device request queue and check max
> >>   mirrors this block device has.
> >>   Helper functions were also added to get/set the nr_mirrors.
> >>
> >> * Introducing bi_rd_hint just like bi_write_hint, but bi_rd_hint is a long bitmap
> >> in order to support stacked layer case.
> >>
> >> * Modify md/raid1 to support this retry feature.
> >>
> >> * Adapter xfs to use this feature.
> >>   If the read verify fails, we loop over the available mirrors and retry the read.
> > 
> > Why does the filesystem have to iterate every single posible
> > combination of devices that are underneath it?
> > 
> > Wouldn't it be much simpler to be able to attach a verifier
> > function to the bio, and have each layer that gets called iterate
> > over all it's copies internally until the verfier function passes
> > or all copies are exhausted?
> > 
> > This works for stacked mirrors - it can pass the higher layer
> > verifier down as far as necessary. It can work for RAID5/6, too, by
> > having that layer supply it's own verifier for reads that verifies
> > parity and can reconstruct of failure, then when it's reconstructed
> > a valid stripe it can run the verifier that was supplied to it from
> > above, etc.
> > 
> > i.e. I dont see why only filesystems should drive retries or have to
> > be aware of the underlying storage stacking. ISTM that each
> > layer of the storage stack should be able to verify what has been
> > returned to it is valid independently of the higher layer
> > requirements. The only difference from a caller point of view should
> > be submit_bio(bio); vs submit_bio_verify(bio, verifier_cb_func);
> > 
> 
> We already have bio->bi_end_io(), how about do the verification inside bi_end_io()?
> 
> Then the whole sequence would like:
> bio_endio()
>     > 1.bio->bi_end_io()
>         > xfs_buf_bio_end_io()
>             > verify, set bio->bi_status = "please retry" if verify fail
>         
>     > 2.if found bio->bi_status = retry
>     > 3.resubmit bio

As I mentioned to Darrick, this isn't cwas simple as it seems
because what XFS actually does is this:

IO completion thread			Workqueue Thread
bio_endio(bio)
  bio->bi_end_io(bio)
    xfs_buf_bio_end_io(bio)
      bp->b_error = bio->bi_status
      xfs_buf_ioend_async(bp)
        queue_work(bp->b_ioend_wq, bp)
      bio_put(bio)
<io completion done>
					.....
					xfs_buf_ioend(bp)
					  bp->b_ops->read_verify()
					.....

IOWs, XFS does not do read verification inside the bio completion
context, but instead defers it to an external workqueue so it does
not delay processing incoming bio IO completions. Hence there is no
way to get the verification status back to the bio completion (the
bio has already been freed!) to resubmit from there.

This is one of the reasons I suggested a verifier be added to the
submission, so the bio itself is wholly responsible for running it,
not an external, filesystem level completion function that may
operate outside of bio scope....

> Is it fine to resubmit a bio inside bio_endio()?

Depends on the context the bio_endio() completion is running in.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 0/9] Block/XFS: Support alternative mirror device retry
  2019-02-28 14:22   ` Bob Liu
  2019-02-28 21:49     ` Dave Chinner
@ 2019-02-28 23:28     ` Andreas Dilger
  2019-03-01 14:14       ` Bob Liu
  2019-03-03 23:45       ` Dave Chinner
  1 sibling, 2 replies; 28+ messages in thread
From: Andreas Dilger @ 2019-02-28 23:28 UTC (permalink / raw)
  To: Bob Liu
  Cc: Dave Chinner, linux-block, linux-xfs, linux-fsdevel,
	Martin Petersen, shirley.ma, Allison Henderson, darrick.wong,
	hch

[-- Attachment #1: Type: text/plain, Size: 6154 bytes --]

On Feb 28, 2019, at 7:22 AM, Bob Liu <bob.liu@oracle.com> wrote:
> 
> On 2/19/19 5:31 AM, Dave Chinner wrote:
>> On Wed, Feb 13, 2019 at 05:50:35PM +0800, Bob Liu wrote:
>>> Motivation:
>>> When fs data/metadata checksum mismatch, lower block devices may have other
>>> correct copies. e.g. If XFS successfully reads a metadata buffer off a raid1 but
>>> decides that the metadata is garbage, today it will shut down the entire
>>> filesystem without trying any of the other mirrors.  This is a severe
>>> loss of service, and we propose these patches to have XFS try harder to
>>> avoid failure.
>>> 
>>> This patch prototype this mirror retry idea by:
>>> * Adding @nr_mirrors to struct request_queue which is similar as
>>>  blk_queue_nonrot(), filesystem can grab device request queue and check max
>>>  mirrors this block device has.
>>>  Helper functions were also added to get/set the nr_mirrors.
>>> 
>>> * Introducing bi_rd_hint just like bi_write_hint, but bi_rd_hint is a long bitmap
>>> in order to support stacked layer case.
>>> 
>>> * Modify md/raid1 to support this retry feature.
>>> 
>>> * Adapter xfs to use this feature.
>>>  If the read verify fails, we loop over the available mirrors and retry the read.
>> 
>> Why does the filesystem have to iterate every single posible
>> combination of devices that are underneath it?

Even if the filesystem isn't doing this iteration, there needs to be
some way to track which devices or combinations of devices have been
tried for the bio, which likely still means something inside the bio.

>> Wouldn't it be much simpler to be able to attach a verifier
>> function to the bio, and have each layer that gets called iterate
>> over all it's copies internally until the verfier function passes
>> or all copies are exhausted?
>> 
>> This works for stacked mirrors - it can pass the higher layer
>> verifier down as far as necessary. It can work for RAID5/6, too, by
>> having that layer supply it's own verifier for reads that verifies
>> parity and can reconstruct of failure, then when it's reconstructed
>> a valid stripe it can run the verifier that was supplied to it from
>> above, etc.
>> 
>> i.e. I dont see why only filesystems should drive retries or have to
>> be aware of the underlying storage stacking. ISTM that each
>> layer of the storage stack should be able to verify what has been
>> returned to it is valid independently of the higher layer
>> requirements. The only difference from a caller point of view should
>> be submit_bio(bio); vs submit_bio_verify(bio, verifier_cb_func);

I don't think the filesystem should be aware of the stacking (nor are
they in the proposed implementation).  That said, the filesystem-level
checksums should, IMHO, be checked at the filesystem level, and this
proposal allows the filesystem to tell the lower layer "this read was
bad, try something else".

One option, instead of having a bitmap, with one bit for every possible
device/combination in the system, would be to have a counter instead.
This is much denser, and even the existing "__u16 bio_write_hint" field
would be enough for 2^16 different devices/combinations of devices to
be tried.  The main difference would be that the retry layers in the
device layer would need to have a deterministic iterator for the bio.

For stacked devices it would need to use the same API to determine how
many possible combinations are below it, and do a modulus to pass down
the per-device iteration number.  The easiest would be to iterate in
numeric order, but it would also be possible to use something like a
PRNG seeded by e.g. the block number to change the order on a per-bio
basis to even out the load, if that is desirable.

> For a two layer stacked md case like:
>                              /dev/md0
>             /                  |                  \
>      /dev/md1-a             /dev/md1-b          /dev/sdf
>   /        \           /       |        \
> /dev/sda /dev/sdb  /dev/sdc /dev/sdd  /dev/sde

In this case, the top-level md0 would call blk_queue_get_copies() on each
sub-devices to determine how many sub-devices/combinations they have,
and pick the maximum (3 in this case), multiplied by the number of
top-level devices (also 3 in this case).  That means the top-level device
would return blk_queue_get_copies() == 9 combinations, but the same
could be done recursively for more/non-uniform layers if needed.

The top-level device maps md1-a = [0-2], md1-b = [3-5], md1-c = [6-8],
and can easily map an incoming bio_read_hint to the next device, either
by simple increment or by predetermining a device ordering and following
that (e.g. 0, 3, 6, 1, 4, 7, 2, 5, 8), or any other deterministic order
that hits all of the devices exactly once).  During submission bio_read_hint
is set to the modulus of the value (so that each layer in the stack sees
only values in the range [0, copies), and when the bio completes the top-level
device will set bio_read_hint to be the next sub-device to try (like the
original proposal was splitting and combining the bitmaps).  If a sub-device
gets a bad index (e.g. md1-a sees bio_read_hint == 2, or sdf sees anything
other than 0) it is a no-op and returns e.g. -EAGAIN to the upper device
so that it moves to the next device without returning to the caller.

>> I suspect there's a more important issue to worry about: we run the
>> XFS read verifiers in an async work queue context after collecting
>> the IO completion status from the bio, rather than running directly
>> in bio->bi_end_io() call chain.

In this proposal, XFS would just have to save the __u16 bio_read_hint
field from the previous bio completion and set it in the retried bio,
so that it could start at the next device/combination.  Obviously,
this would mean that the internal device iterator couldn't have any
hidden state for the bio so that just setting bio_read_hint would be
the same as resubmitting the original bio again, but that is already
a given or this whole problem wouldn't exist in the first place.

Cheers, Andreas

[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 0/9] Block/XFS: Support alternative mirror device retry
  2019-02-28 23:28     ` Andreas Dilger
@ 2019-03-01 14:14       ` Bob Liu
  2019-03-03 23:45       ` Dave Chinner
  1 sibling, 0 replies; 28+ messages in thread
From: Bob Liu @ 2019-03-01 14:14 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Dave Chinner, linux-block, linux-xfs, linux-fsdevel,
	Martin Petersen, shirley.ma, Allison Henderson, darrick.wong,
	hch

On 3/1/19 7:28 AM, Andreas Dilger wrote:
> On Feb 28, 2019, at 7:22 AM, Bob Liu <bob.liu@oracle.com> wrote:
>>
>> On 2/19/19 5:31 AM, Dave Chinner wrote:
>>> On Wed, Feb 13, 2019 at 05:50:35PM +0800, Bob Liu wrote:
>>>> Motivation:
>>>> When fs data/metadata checksum mismatch, lower block devices may have other
>>>> correct copies. e.g. If XFS successfully reads a metadata buffer off a raid1 but
>>>> decides that the metadata is garbage, today it will shut down the entire
>>>> filesystem without trying any of the other mirrors.  This is a severe
>>>> loss of service, and we propose these patches to have XFS try harder to
>>>> avoid failure.
>>>>
>>>> This patch prototype this mirror retry idea by:
>>>> * Adding @nr_mirrors to struct request_queue which is similar as
>>>>  blk_queue_nonrot(), filesystem can grab device request queue and check max
>>>>  mirrors this block device has.
>>>>  Helper functions were also added to get/set the nr_mirrors.
>>>>
>>>> * Introducing bi_rd_hint just like bi_write_hint, but bi_rd_hint is a long bitmap
>>>> in order to support stacked layer case.
>>>>
>>>> * Modify md/raid1 to support this retry feature.
>>>>
>>>> * Adapter xfs to use this feature.
>>>>  If the read verify fails, we loop over the available mirrors and retry the read.
>>>
>>> Why does the filesystem have to iterate every single posible
>>> combination of devices that are underneath it?
> 
> Even if the filesystem isn't doing this iteration, there needs to be
> some way to track which devices or combinations of devices have been
> tried for the bio, which likely still means something inside the bio.
> 
>>> Wouldn't it be much simpler to be able to attach a verifier
>>> function to the bio, and have each layer that gets called iterate
>>> over all it's copies internally until the verfier function passes
>>> or all copies are exhausted?
>>>
>>> This works for stacked mirrors - it can pass the higher layer
>>> verifier down as far as necessary. It can work for RAID5/6, too, by
>>> having that layer supply it's own verifier for reads that verifies
>>> parity and can reconstruct of failure, then when it's reconstructed
>>> a valid stripe it can run the verifier that was supplied to it from
>>> above, etc.
>>>
>>> i.e. I dont see why only filesystems should drive retries or have to
>>> be aware of the underlying storage stacking. ISTM that each
>>> layer of the storage stack should be able to verify what has been
>>> returned to it is valid independently of the higher layer
>>> requirements. The only difference from a caller point of view should
>>> be submit_bio(bio); vs submit_bio_verify(bio, verifier_cb_func);
> 
> I don't think the filesystem should be aware of the stacking (nor are
> they in the proposed implementation).  That said, the filesystem-level
> checksums should, IMHO, be checked at the filesystem level, and this
> proposal allows the filesystem to tell the lower layer "this read was
> bad, try something else".
> 
> One option, instead of having a bitmap, with one bit for every possible
> device/combination in the system, would be to have a counter instead.
> This is much denser, and even the existing "__u16 bio_write_hint" field

Indeed! This way is better than a bitmap.

But as Dave mentioned, it's much better and simpler if attaching a verfier function to the bio..

-
Thanks,
Bob

> would be enough for 2^16 different devices/combinations of devices to
> be tried.  The main difference would be that the retry layers in the
> device layer would need to have a deterministic iterator for the bio.
> 
> For stacked devices it would need to use the same API to determine how
> many possible combinations are below it, and do a modulus to pass down
> the per-device iteration number.  The easiest would be to iterate in
> numeric order, but it would also be possible to use something like a
> PRNG seeded by e.g. the block number to change the order on a per-bio
> basis to even out the load, if that is desirable.
> 
>> For a two layer stacked md case like:
>>                              /dev/md0
>>             /                  |                  \
>>      /dev/md1-a             /dev/md1-b          /dev/sdf
>>   /        \           /       |        \
>> /dev/sda /dev/sdb  /dev/sdc /dev/sdd  /dev/sde
> 
> In this case, the top-level md0 would call blk_queue_get_copies() on each
> sub-devices to determine how many sub-devices/combinations they have,
> and pick the maximum (3 in this case), multiplied by the number of
> top-level devices (also 3 in this case).  That means the top-level device
> would return blk_queue_get_copies() == 9 combinations, but the same
> could be done recursively for more/non-uniform layers if needed.
> 
> The top-level device maps md1-a = [0-2], md1-b = [3-5], md1-c = [6-8],
> and can easily map an incoming bio_read_hint to the next device, either
> by simple increment or by predetermining a device ordering and following
> that (e.g. 0, 3, 6, 1, 4, 7, 2, 5, 8), or any other deterministic order
> that hits all of the devices exactly once).  During submission bio_read_hint
> is set to the modulus of the value (so that each layer in the stack sees
> only values in the range [0, copies), and when the bio completes the top-level
> device will set bio_read_hint to be the next sub-device to try (like the
> original proposal was splitting and combining the bitmaps).  If a sub-device
> gets a bad index (e.g. md1-a sees bio_read_hint == 2, or sdf sees anything
> other than 0) it is a no-op and returns e.g. -EAGAIN to the upper device
> so that it moves to the next device without returning to the caller.
> 
>>> I suspect there's a more important issue to worry about: we run the
>>> XFS read verifiers in an async work queue context after collecting
>>> the IO completion status from the bio, rather than running directly
>>> in bio->bi_end_io() call chain.
> 
> In this proposal, XFS would just have to save the __u16 bio_read_hint
> field from the previous bio completion and set it in the retried bio,
> so that it could start at the next device/combination.  Obviously,
> this would mean that the internal device iterator couldn't have any
> hidden state for the bio so that just setting bio_read_hint would be
> the same as resubmitting the original bio again, but that is already
> a given or this whole problem wouldn't exist in the first place.
> 
> Cheers, Andreas
> 



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 0/9] Block/XFS: Support alternative mirror device retry
  2019-02-28 21:49     ` Dave Chinner
@ 2019-03-03  2:37       ` Bob Liu
  2019-03-03 23:18         ` Dave Chinner
  0 siblings, 1 reply; 28+ messages in thread
From: Bob Liu @ 2019-03-03  2:37 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-block, linux-xfs, linux-fsdevel, martin.petersen,
	shirley.ma, allison.henderson, darrick.wong, hch, adilger

On 3/1/19 5:49 AM, Dave Chinner wrote:
> On Thu, Feb 28, 2019 at 10:22:02PM +0800, Bob Liu wrote:
>> On 2/19/19 5:31 AM, Dave Chinner wrote:
>>> On Wed, Feb 13, 2019 at 05:50:35PM +0800, Bob Liu wrote:
>>>> Motivation:
>>>> When fs data/metadata checksum mismatch, lower block devices may have other
>>>> correct copies. e.g. If XFS successfully reads a metadata buffer off a raid1 but
>>>> decides that the metadata is garbage, today it will shut down the entire
>>>> filesystem without trying any of the other mirrors.  This is a severe
>>>> loss of service, and we propose these patches to have XFS try harder to
>>>> avoid failure.
>>>>
>>>> This patch prototype this mirror retry idea by:
>>>> * Adding @nr_mirrors to struct request_queue which is similar as
>>>>   blk_queue_nonrot(), filesystem can grab device request queue and check max
>>>>   mirrors this block device has.
>>>>   Helper functions were also added to get/set the nr_mirrors.
>>>>
>>>> * Introducing bi_rd_hint just like bi_write_hint, but bi_rd_hint is a long bitmap
>>>> in order to support stacked layer case.
>>>>
>>>> * Modify md/raid1 to support this retry feature.
>>>>
>>>> * Adapter xfs to use this feature.
>>>>   If the read verify fails, we loop over the available mirrors and retry the read.
>>>
>>> Why does the filesystem have to iterate every single posible
>>> combination of devices that are underneath it?
>>>
>>> Wouldn't it be much simpler to be able to attach a verifier
>>> function to the bio, and have each layer that gets called iterate
>>> over all it's copies internally until the verfier function passes
>>> or all copies are exhausted?
>>>
>>> This works for stacked mirrors - it can pass the higher layer
>>> verifier down as far as necessary. It can work for RAID5/6, too, by
>>> having that layer supply it's own verifier for reads that verifies
>>> parity and can reconstruct of failure, then when it's reconstructed
>>> a valid stripe it can run the verifier that was supplied to it from
>>> above, etc.
>>>
>>> i.e. I dont see why only filesystems should drive retries or have to
>>> be aware of the underlying storage stacking. ISTM that each
>>> layer of the storage stack should be able to verify what has been
>>> returned to it is valid independently of the higher layer
>>> requirements. The only difference from a caller point of view should
>>> be submit_bio(bio); vs submit_bio_verify(bio, verifier_cb_func);
>>>
>>
>> We already have bio->bi_end_io(), how about do the verification inside bi_end_io()?
>>
>> Then the whole sequence would like:
>> bio_endio()
>>     > 1.bio->bi_end_io()
>>         > xfs_buf_bio_end_io()
>>             > verify, set bio->bi_status = "please retry" if verify fail
>>         
>>     > 2.if found bio->bi_status = retry
>>     > 3.resubmit bio
> 
> As I mentioned to Darrick, this isn't cwas simple as it seems
> because what XFS actually does is this:
> 
> IO completion thread			Workqueue Thread
> bio_endio(bio)
>   bio->bi_end_io(bio)
>     xfs_buf_bio_end_io(bio)
>       bp->b_error = bio->bi_status
>       xfs_buf_ioend_async(bp)
>         queue_work(bp->b_ioend_wq, bp)
>       bio_put(bio)
> <io completion done>
> 					.....
> 					xfs_buf_ioend(bp)
> 					  bp->b_ops->read_verify()
> 					.....
> 
> IOWs, XFS does not do read verification inside the bio completion
> context, but instead defers it to an external workqueue so it does
> not delay processing incoming bio IO completions. Hence there is no
> way to get the verification status back to the bio completion (the
> bio has already been freed!) to resubmit from there.
> 
> This is one of the reasons I suggested a verifier be added to the
> submission, so the bio itself is wholly responsible for running it,

But then completion time of an i/o would be longer if calling verifier function inside bio_endio().
Would that be a problem? Since it used to be async as your mentioned xfs uses workqueue.

Thanks, -Bob


> not an external, filesystem level completion function that may
> operate outside of bio scope....
> 
>> Is it fine to resubmit a bio inside bio_endio()?
> 
> Depends on the context the bio_endio() completion is running in.
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 0/9] Block/XFS: Support alternative mirror device retry
  2019-03-03  2:37       ` Bob Liu
@ 2019-03-03 23:18         ` Dave Chinner
  0 siblings, 0 replies; 28+ messages in thread
From: Dave Chinner @ 2019-03-03 23:18 UTC (permalink / raw)
  To: Bob Liu
  Cc: linux-block, linux-xfs, linux-fsdevel, martin.petersen,
	shirley.ma, allison.henderson, darrick.wong, hch, adilger

On Sun, Mar 03, 2019 at 10:37:59AM +0800, Bob Liu wrote:
> On 3/1/19 5:49 AM, Dave Chinner wrote:
> > On Thu, Feb 28, 2019 at 10:22:02PM +0800, Bob Liu wrote:
> >> On 2/19/19 5:31 AM, Dave Chinner wrote:
> >>> On Wed, Feb 13, 2019 at 05:50:35PM +0800, Bob Liu wrote:
> >>>> Motivation:
> >>>> When fs data/metadata checksum mismatch, lower block devices may have other
> >>>> correct copies. e.g. If XFS successfully reads a metadata buffer off a raid1 but
> >>>> decides that the metadata is garbage, today it will shut down the entire
> >>>> filesystem without trying any of the other mirrors.  This is a severe
> >>>> loss of service, and we propose these patches to have XFS try harder to
> >>>> avoid failure.
> >>>>
> >>>> This patch prototype this mirror retry idea by:
> >>>> * Adding @nr_mirrors to struct request_queue which is similar as
> >>>>   blk_queue_nonrot(), filesystem can grab device request queue and check max
> >>>>   mirrors this block device has.
> >>>>   Helper functions were also added to get/set the nr_mirrors.
> >>>>
> >>>> * Introducing bi_rd_hint just like bi_write_hint, but bi_rd_hint is a long bitmap
> >>>> in order to support stacked layer case.
> >>>>
> >>>> * Modify md/raid1 to support this retry feature.
> >>>>
> >>>> * Adapter xfs to use this feature.
> >>>>   If the read verify fails, we loop over the available mirrors and retry the read.
> >>>
> >>> Why does the filesystem have to iterate every single posible
> >>> combination of devices that are underneath it?
> >>>
> >>> Wouldn't it be much simpler to be able to attach a verifier
> >>> function to the bio, and have each layer that gets called iterate
> >>> over all it's copies internally until the verfier function passes
> >>> or all copies are exhausted?
> >>>
> >>> This works for stacked mirrors - it can pass the higher layer
> >>> verifier down as far as necessary. It can work for RAID5/6, too, by
> >>> having that layer supply it's own verifier for reads that verifies
> >>> parity and can reconstruct of failure, then when it's reconstructed
> >>> a valid stripe it can run the verifier that was supplied to it from
> >>> above, etc.
> >>>
> >>> i.e. I dont see why only filesystems should drive retries or have to
> >>> be aware of the underlying storage stacking. ISTM that each
> >>> layer of the storage stack should be able to verify what has been
> >>> returned to it is valid independently of the higher layer
> >>> requirements. The only difference from a caller point of view should
> >>> be submit_bio(bio); vs submit_bio_verify(bio, verifier_cb_func);
> >>>
> >>
> >> We already have bio->bi_end_io(), how about do the verification inside bi_end_io()?
> >>
> >> Then the whole sequence would like:
> >> bio_endio()
> >>     > 1.bio->bi_end_io()
> >>         > xfs_buf_bio_end_io()
> >>             > verify, set bio->bi_status = "please retry" if verify fail
> >>         
> >>     > 2.if found bio->bi_status = retry
> >>     > 3.resubmit bio
> > 
> > As I mentioned to Darrick, this isn't cwas simple as it seems
> > because what XFS actually does is this:
> > 
> > IO completion thread			Workqueue Thread
> > bio_endio(bio)
> >   bio->bi_end_io(bio)
> >     xfs_buf_bio_end_io(bio)
> >       bp->b_error = bio->bi_status
> >       xfs_buf_ioend_async(bp)
> >         queue_work(bp->b_ioend_wq, bp)
> >       bio_put(bio)
> > <io completion done>
> > 					.....
> > 					xfs_buf_ioend(bp)
> > 					  bp->b_ops->read_verify()
> > 					.....
> > 
> > IOWs, XFS does not do read verification inside the bio completion
> > context, but instead defers it to an external workqueue so it does
> > not delay processing incoming bio IO completions. Hence there is no
> > way to get the verification status back to the bio completion (the
> > bio has already been freed!) to resubmit from there.
> > 
> > This is one of the reasons I suggested a verifier be added to the
> > submission, so the bio itself is wholly responsible for running it,
> 
> But then completion time of an i/o would be longer if calling verifier function inside bio_endio().
> Would that be a problem?

No, because then we don't have to do it in the filesystem. i.e. the
filesystem doesn't complete the IO until after the verifier has run,
so it from the perspective of the waiting reader is doesn't matter
where it is run because the overall I/O latency is the same.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v2 0/9] Block/XFS: Support alternative mirror device retry
  2019-02-28 23:28     ` Andreas Dilger
  2019-03-01 14:14       ` Bob Liu
@ 2019-03-03 23:45       ` Dave Chinner
  1 sibling, 0 replies; 28+ messages in thread
From: Dave Chinner @ 2019-03-03 23:45 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Bob Liu, linux-block, linux-xfs, linux-fsdevel, Martin Petersen,
	shirley.ma, Allison Henderson, darrick.wong, hch

On Thu, Feb 28, 2019 at 04:28:53PM -0700, Andreas Dilger wrote:
> On Feb 28, 2019, at 7:22 AM, Bob Liu <bob.liu@oracle.com> wrote:
> > 
> > On 2/19/19 5:31 AM, Dave Chinner wrote:
> >> On Wed, Feb 13, 2019 at 05:50:35PM +0800, Bob Liu wrote:
> >>> Motivation:
> >>> When fs data/metadata checksum mismatch, lower block devices may have other
> >>> correct copies. e.g. If XFS successfully reads a metadata buffer off a raid1 but
> >>> decides that the metadata is garbage, today it will shut down the entire
> >>> filesystem without trying any of the other mirrors.  This is a severe
> >>> loss of service, and we propose these patches to have XFS try harder to
> >>> avoid failure.
> >>> 
> >>> This patch prototype this mirror retry idea by:
> >>> * Adding @nr_mirrors to struct request_queue which is similar as
> >>>  blk_queue_nonrot(), filesystem can grab device request queue and check max
> >>>  mirrors this block device has.
> >>>  Helper functions were also added to get/set the nr_mirrors.
> >>> 
> >>> * Introducing bi_rd_hint just like bi_write_hint, but bi_rd_hint is a long bitmap
> >>> in order to support stacked layer case.
> >>> 
> >>> * Modify md/raid1 to support this retry feature.
> >>> 
> >>> * Adapter xfs to use this feature.
> >>>  If the read verify fails, we loop over the available mirrors and retry the read.
> >> 
> >> Why does the filesystem have to iterate every single posible
> >> combination of devices that are underneath it?
> 
> Even if the filesystem isn't doing this iteration, there needs to be
> some way to track which devices or combinations of devices have been
> tried for the bio, which likely still means something inside the bio.

I don't beleive it needs to be "in the bio". The thing that does
the iteration (i.e. the layer with multiple copies or rebuild
capability) is the one that captures the IO completion state, runs
the verifier it is supplied with and re-issues the read if the
verifier or initial IO fails.

i.e. it moves the iteration down to the thing that knows what can be
iterated, and so there's no state needed in the bio itself.

> >> Wouldn't it be much simpler to be able to attach a verifier
> >> function to the bio, and have each layer that gets called iterate
> >> over all it's copies internally until the verfier function passes
> >> or all copies are exhausted?
> >> 
> >> This works for stacked mirrors - it can pass the higher layer
> >> verifier down as far as necessary. It can work for RAID5/6, too, by
> >> having that layer supply it's own verifier for reads that verifies
> >> parity and can reconstruct of failure, then when it's reconstructed
> >> a valid stripe it can run the verifier that was supplied to it from
> >> above, etc.
> >> 
> >> i.e. I dont see why only filesystems should drive retries or have to
> >> be aware of the underlying storage stacking. ISTM that each
> >> layer of the storage stack should be able to verify what has been
> >> returned to it is valid independently of the higher layer
> >> requirements. The only difference from a caller point of view should
> >> be submit_bio(bio); vs submit_bio_verify(bio, verifier_cb_func);
> 
> I don't think the filesystem should be aware of the stacking (nor are
> they in the proposed implementation).  That said, the filesystem-level
> checksums should, IMHO, be checked at the filesystem level, and this
> proposal allows the filesystem to tell the lower layer "this read was
> bad, try something else".

After the fact, yes. I want the verification during the IO while
the layer that knows about iteration and recovery can do this
easily.

i.e. all the complexity right now is because we back out of the
layer that can do iteration before we can run the verification, and
so we have to carry some state up to a higher level and then pass it
back down in a completely separate IO context. That's where all this
"need to carry satate in the bio" stuff comes from, and that's what
I'm trying to get rid of.

> One option, instead of having a bitmap, with one bit for every possible
> device/combination in the system, would be to have a counter instead.
> This is much denser, and even the existing "__u16 bio_write_hint" field
> would be enough for 2^16 different devices/combinations of devices to
> be tried.  The main difference would be that the retry layers in the
> device layer would need to have a deterministic iterator for the bio.

The problem there is stacked layers - each layer needs a unique
ID for it's iterator function, as this complexity:

> For stacked devices it would need to use the same API to determine how
> many possible combinations are below it, and do a modulus to pass down
> the per-device iteration number.  The easiest would be to iterate in
> numeric order, but it would also be possible to use something like a
> PRNG seeded by e.g. the block number to change the order on a per-bio
> basis to even out the load, if that is desirable.
> 
> > For a two layer stacked md case like:
> >                              /dev/md0
> >             /                  |                  \
> >      /dev/md1-a             /dev/md1-b          /dev/sdf
> >   /        \           /       |        \
> > /dev/sda /dev/sdb  /dev/sdc /dev/sdd  /dev/sde
> 
> In this case, the top-level md0 would call blk_queue_get_copies() on each
> sub-devices to determine how many sub-devices/combinations they have,
> and pick the maximum (3 in this case), multiplied by the number of
> top-level devices (also 3 in this case).  That means the top-level device
> would return blk_queue_get_copies() == 9 combinations, but the same
> could be done recursively for more/non-uniform layers if needed.
>
> The top-level device maps md1-a = [0-2], md1-b = [3-5], md1-c = [6-8],
> and can easily map an incoming bio_read_hint to the next device, either
> by simple increment or by predetermining a device ordering and following
> that (e.g. 0, 3, 6, 1, 4, 7, 2, 5, 8), or any other deterministic order
> that hits all of the devices exactly once).  During submission bio_read_hint
> is set to the modulus of the value (so that each layer in the stack sees
> only values in the range [0, copies), and when the bio completes the top-level
> device will set bio_read_hint to be the next sub-device to try (like the
> original proposal was splitting and combining the bitmaps).  If a sub-device
> gets a bad index (e.g. md1-a sees bio_read_hint == 2, or sdf sees anything
> other than 0) it is a no-op and returns e.g. -EAGAIN to the upper device
> so that it moves to the next device without returning to the caller.

.... clearly demonstrates.

I'd much prefer stacking completions and running them on demand as
it should work for all stacked types and not require magic to
iterate constructs like the above.

Passing the verifier down also allows the underlying layer to repair
itself. i.e. if it gets a verifier failure, then retries and gets
success, it knows immediately which part of the mirror contains bad
data and can repair it. It can also trigger a region scrub, knowing
which device might be bad and which is likely to contain good data.
i.e. we can start to think about automated block device self-repair
if we can supply a data verifier with submit_bio()...

> >> I suspect there's a more important issue to worry about: we run the
> >> XFS read verifiers in an async work queue context after collecting
> >> the IO completion status from the bio, rather than running directly
> >> in bio->bi_end_io() call chain.
> 
> In this proposal, XFS would just have to save the __u16 bio_read_hint
> field from the previous bio completion and set it in the retried bio,
> so that it could start at the next device/combination.  Obviously,
> this would mean that the internal device iterator couldn't have any
> hidden state for the bio so that just setting bio_read_hint would be
> the same as resubmitting the original bio again, but that is already
> a given or this whole problem wouldn't exist in the first place.

It still requires code in the filesystem to iterate and retry N
times, instead of never. And we still have to re-write the data we
read to fix the underlying device issue (which the device should
already know about and have fixed by this point!) i.e. we either get
verified data returned on bio completion or we get an error to say
the data was corrupt and unrecoverable. If someone wants "fail fast"
semantics, then they simply don't provide a verifier....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2019-03-03 23:46 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-13  9:50 [RFC PATCH v2 0/9] Block/XFS: Support alternative mirror device retry Bob Liu
2019-02-13  9:50 ` [RFC PATCH v2 1/9] block: add nr_mirrors to request_queue Bob Liu
2019-02-13 10:26   ` Andreas Dilger
2019-02-13 16:04   ` Theodore Y. Ts'o
2019-02-14  5:57     ` Bob Liu
2019-02-18 17:56       ` Theodore Y. Ts'o
2019-02-13  9:50 ` [RFC PATCH v2 2/9] block: add rd_hint to bio and request Bob Liu
2019-02-13 16:18   ` Jens Axboe
2019-02-14  6:10     ` Bob Liu
2019-02-13  9:50 ` [RFC PATCH v2 3/9] md:raid1: set mirrors correctly Bob Liu
2019-02-13  9:50 ` [RFC PATCH v2 4/9] md:raid1: rd_hint support and consider stacked layer case Bob Liu
2019-02-13  9:50 ` [RFC PATCH v2 5/9] Add b_alt_retry to xfs_buf Bob Liu
2019-02-13  9:50 ` [RFC PATCH v2 6/9] xfs: Add b_rd_hint " Bob Liu
2019-02-13  9:50 ` [RFC PATCH v2 7/9] xfs: Add device retry Bob Liu
2019-02-13  9:50 ` [RFC PATCH v2 8/9] xfs: Rewrite retried read Bob Liu
2019-02-13  9:50 ` [RFC PATCH v2 9/9] xfs: Add tracepoints and logging to alternate device retry Bob Liu
2019-02-18  8:08 ` [RFC PATCH v2 0/9] Block/XFS: Support alternative mirror " jianchao.wang
2019-02-19  1:29   ` jianchao.wang
2019-02-18 21:31 ` Dave Chinner
2019-02-19  2:55   ` Darrick J. Wong
2019-02-19  3:33     ` Dave Chinner
2019-02-28 14:22   ` Bob Liu
2019-02-28 21:49     ` Dave Chinner
2019-03-03  2:37       ` Bob Liu
2019-03-03 23:18         ` Dave Chinner
2019-02-28 23:28     ` Andreas Dilger
2019-03-01 14:14       ` Bob Liu
2019-03-03 23:45       ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).