[RFC PATCH v3 0/3] Block/XFS: Support alternative mirror device retry

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v3 0/3] Block/XFS: Support alternative mirror device retry
@ 2019-03-29 14:23 Bob Liu
  2019-03-29 14:23 ` [PATCH v3 1/3] block: introduce submit_bio_verify() Bob Liu
                   ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Bob Liu @ 2019-03-29 14:23 UTC (permalink / raw)
  To: linux-block
  Cc: linux-xfs, linux-fsdevel, martin.petersen, shirley.ma,
	allison.henderson, david, darrick.wong, hch, adilger, axboe,
	tytso, Bob Liu

Motivation:
When fs data/metadata checksum mismatch, lower block devices may have other
correct copies. e.g. If XFS successfully reads a metadata buffer off a raid1 but
decides that the metadata is garbage, today it will shut down the entire
filesystem without trying any of the other mirrors.  This is a severe
loss of service, and we propose these patches to have XFS try harder to
avoid failure.

This patch prototype this mirror retry idea by adding a function verifier
callback to submit_bio.  Filesystem can use submit_bio_verify() to pass a
callback to the block layer which can then be used to verify if the data read is
correct.

Reused some of bio-integrity code, can be separated if necessary.

Changes v3:
- Total new implementation, pass down verify function to block layer as
  suggested by Dave.

Bob Liu (3):
  block: introduce submit_bio_verify()
  block: verify data when endio
  fs: xfs: add read_verifier() function

 block/bio-integrity.c     | 45 +++++++++++++++++++++++++++++++++++++++
 block/bio.c               |  3 +++
 block/blk-core.c          | 17 ++++++++++++---
 block/blk.h               |  8 +++++++
 block/bounce.c            |  1 +
 drivers/md/raid1.c        |  1 +
 drivers/md/raid5-ppl.c    |  1 +
 fs/xfs/xfs_buf.c          | 23 ++++++++++++++++++--
 fs/xfs/xfs_buf.h          |  1 +
 include/linux/bio.h       |  2 ++
 include/linux/blk_types.h |  5 +++++
 11 files changed, 102 insertions(+), 5 deletions(-)

-- 
2.17.1

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v3 1/3] block: introduce submit_bio_verify()
  2019-03-29 14:23 [RFC PATCH v3 0/3] Block/XFS: Support alternative mirror device retry Bob Liu
@ 2019-03-29 14:23 ` Bob Liu
  2019-03-29 22:22   ` Andreas Dilger
  2019-03-29 14:23 ` [PATCH v3 2/3] block: verify data when endio Bob Liu
  2019-03-29 14:23 ` [PATCH v3 3/3] fs: xfs: add read_verifier() function Bob Liu
  2 siblings, 1 reply; 26+ messages in thread
From: Bob Liu @ 2019-03-29 14:23 UTC (permalink / raw)
  To: linux-block
  Cc: linux-xfs, linux-fsdevel, martin.petersen, shirley.ma,
	allison.henderson, david, darrick.wong, hch, adilger, axboe,
	tytso, Bob Liu

This patch adds a function verifier callback to submit_bio.  The
filesystem layer can use submit_bio_verify to pass a call back
to the block layer which can then be used to verify if the data
read is correct.

Signed-off-by: Bob Liu <bob.liu@oracle.com>
---
 block/blk-core.c    | 13 ++++++++++---
 include/linux/bio.h |  2 ++
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 6b78ec56a4f2..d265d2924c32 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1156,15 +1156,16 @@ blk_qc_t direct_make_request(struct bio *bio)
 EXPORT_SYMBOL_GPL(direct_make_request);
 
 /**
- * submit_bio - submit a bio to the block device layer for I/O
+ * submit_bio_verify - submit a bio to the block device layer for I/O
  * @bio: The &struct bio which describes the I/O
  *
- * submit_bio() is very similar in purpose to generic_make_request(), and
+ * submit_bio_verify() is very similar in purpose to generic_make_request(), and
  * uses that function to do most of the work. Both are fairly rough
  * interfaces; @bio must be presetup and ready for I/O.
  *
  */
-blk_qc_t submit_bio(struct bio *bio)
+blk_qc_t submit_bio_verify(struct bio *bio,
+		int (*verifier_cb_func)(struct bio *))
 {
 	/*
 	 * If it's a regular read/write or a barrier with data attached,
@@ -1197,6 +1198,12 @@ blk_qc_t submit_bio(struct bio *bio)
 
 	return generic_make_request(bio);
 }
+EXPORT_SYMBOL(submit_bio_verify);
+
+blk_qc_t submit_bio(struct bio *bio)
+{
+	return submit_bio_verify(bio, NULL);
+}
 EXPORT_SYMBOL(submit_bio);
 
 /**
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 7380b094dcca..ddaadab74dcf 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -398,6 +398,8 @@ static inline struct bio *bio_kmalloc(gfp_t gfp_mask, unsigned int nr_iovecs)
 	return bio_alloc_bioset(gfp_mask, nr_iovecs, NULL);
 }
 
+extern blk_qc_t submit_bio_verify(struct bio *,
+		int (*verifier_cb_func)(struct bio *));
 extern blk_qc_t submit_bio(struct bio *);
 
 extern void bio_endio(struct bio *);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v3 2/3] block: verify data when endio
  2019-03-29 14:23 [RFC PATCH v3 0/3] Block/XFS: Support alternative mirror device retry Bob Liu
  2019-03-29 14:23 ` [PATCH v3 1/3] block: introduce submit_bio_verify() Bob Liu
@ 2019-03-29 14:23 ` Bob Liu
  2019-03-29 14:28   ` Jens Axboe
  2019-03-29 15:39   ` Ming Lei
  2019-03-29 14:23 ` [PATCH v3 3/3] fs: xfs: add read_verifier() function Bob Liu
  2 siblings, 2 replies; 26+ messages in thread
From: Bob Liu @ 2019-03-29 14:23 UTC (permalink / raw)
  To: linux-block
  Cc: linux-xfs, linux-fsdevel, martin.petersen, shirley.ma,
	allison.henderson, david, darrick.wong, hch, adilger, axboe,
	tytso, Bob Liu

Call verify callback same as bio integrity.
If verify fail, drivers like MD will try other mirrors until get a correct
one or return failure after all mirrors are tried.
The MD driver already works like this, so no extra changed.

Todo:
- union with "struct bio_integrity_payload *bi_integrity" to save bio space.

Signed-off-by: Bob Liu <bob.liu@oracle.com>
---
 block/bio-integrity.c     | 45 +++++++++++++++++++++++++++++++++++++++
 block/bio.c               |  3 +++
 block/blk-core.c          |  4 ++++
 block/blk.h               |  8 +++++++
 block/bounce.c            |  1 +
 drivers/md/raid1.c        |  1 +
 drivers/md/raid5-ppl.c    |  1 +
 include/linux/blk_types.h |  5 +++++
 8 files changed, 68 insertions(+)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 1b633a3526d4..90a47ad31dbf 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -372,6 +372,51 @@ bool __bio_integrity_endio(struct bio *bio)
 	return true;
 }
 
+/**
+ * bio_verify_fn - Verify I/O completion worker
+ * @work:	Work struct stored in bio to be verified
+ *
+ * Description: This workqueue function is called to complete a READ
+ * request.  The function call verifier callack that fs pass down
+ * and then calls the original bio end_io function.
+ */
+static void bio_verify_fn(struct work_struct *work)
+{
+	struct bio *bio =
+		container_of(work, struct bio, bi_work);
+
+	bio->bi_status = bio->bi_verifier(bio);
+	/* Clear flag if verify succeed to avoid verifing
+	 * it unnecessary by parent bio
+	 */
+	if (!bio->bi_status)
+		bio->bi_opf &= ~REQ_VERIFY;
+	bio_endio(bio);
+}
+
+/**
+ * __bio_verify_endio - Verify I/O completion function
+ * @bio:	Protected bio
+ *
+ * Description: Completion for verify I/O
+ *
+ * Normally I/O completion is done in interrupt context.  However,
+ * verifying I/O is a time-consuming task which must be run
+ * in process context.	This function postpones completion
+ * accordingly.
+ */
+bool __bio_verify_endio(struct bio *bio)
+{
+	if (bio_op(bio) == REQ_OP_READ && !bio->bi_status &&
+	    (bio->bi_opf & REQ_VERIFY) && bio->bi_verifier) {
+		INIT_WORK(&bio->bi_work, bio_verify_fn);
+		queue_work(kintegrityd_wq, &bio->bi_work);
+		return false;
+	}
+
+	return true;
+}
+
 /**
  * bio_integrity_advance - Advance integrity vector
  * @bio:	bio whose integrity vector to update
diff --git a/block/bio.c b/block/bio.c
index 4db1008309ed..8928806acda6 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -608,6 +608,7 @@ void __bio_clone_fast(struct bio *bio, struct bio *bio_src)
 	bio->bi_write_hint = bio_src->bi_write_hint;
 	bio->bi_iter = bio_src->bi_iter;
 	bio->bi_io_vec = bio_src->bi_io_vec;
+	bio->bi_verifier = bio_src->bi_verifier;
 
 	bio_clone_blkg_association(bio, bio_src);
 	blkcg_bio_issue_init(bio);
@@ -1763,6 +1764,8 @@ void bio_endio(struct bio *bio)
 		return;
 	if (!bio_integrity_endio(bio))
 		return;
+	if (!bio_verify_endio(bio))
+		return;
 
 	if (bio->bi_disk)
 		rq_qos_done_bio(bio->bi_disk->queue, bio);
diff --git a/block/blk-core.c b/block/blk-core.c
index d265d2924c32..cbec80f2d73a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1167,6 +1167,10 @@ EXPORT_SYMBOL_GPL(direct_make_request);
 blk_qc_t submit_bio_verify(struct bio *bio,
 		int (*verifier_cb_func)(struct bio *))
 {
+	if (verifier_cb_func) {
+		bio->bi_verifier = verifier_cb_func;
+		bio->bi_opf |= REQ_VERIFY;
+	}
 	/*
 	 * If it's a regular read/write or a barrier with data attached,
 	 * go through the normal accounting stuff before submission.
diff --git a/block/blk.h b/block/blk.h
index 848278c52030..cdf30c65d4a8 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -151,6 +151,14 @@ static inline bool bio_integrity_endio(struct bio *bio)
 }
 #endif /* CONFIG_BLK_DEV_INTEGRITY */
 
+bool __bio_verify_endio(struct bio *);
+static inline bool bio_verify_endio(struct bio *bio)
+{
+	if (bio->bi_opf & REQ_VERIFY)
+		return __bio_verify_endio(bio);
+	return true;
+}
+
 unsigned long blk_rq_timeout(unsigned long timeout);
 void blk_add_timer(struct request *req);
 
diff --git a/block/bounce.c b/block/bounce.c
index ffb9e9ecfa7e..7a2c3f536030 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -252,6 +252,7 @@ static struct bio *bounce_clone_bio(struct bio *bio_src, gfp_t gfp_mask,
 	bio->bi_write_hint	= bio_src->bi_write_hint;
 	bio->bi_iter.bi_sector	= bio_src->bi_iter.bi_sector;
 	bio->bi_iter.bi_size	= bio_src->bi_iter.bi_size;
+	bio->bi_verifier	= bio_src->bi_verifier;
 
 	switch (bio_op(bio)) {
 	case REQ_OP_DISCARD:
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 1d54109071cc..11b29a3831e1 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1103,6 +1103,7 @@ static void alloc_behind_master_bio(struct r1bio *r1_bio,
 	}
 
 	behind_bio->bi_write_hint = bio->bi_write_hint;
+	behind_bio->bi_verifier = bio->bi_verifier;
 
 	while (i < vcnt && size) {
 		struct page *page;
diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
index 3a7c36326589..4cdaa5dabfbe 100644
--- a/drivers/md/raid5-ppl.c
+++ b/drivers/md/raid5-ppl.c
@@ -505,6 +505,7 @@ static void ppl_submit_iounit(struct ppl_io_unit *io)
 			bio = bio_alloc_bioset(GFP_NOIO, BIO_MAX_PAGES,
 					       &ppl_conf->bs);
 			bio->bi_opf = prev->bi_opf;
+			bio->bi_verifier = prev->bi_verifier;
 			bio_copy_dev(bio, prev);
 			bio->bi_iter.bi_sector = bio_end_sector(prev);
 			bio_add_page(bio, sh->ppl_page, PAGE_SIZE, 0);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index d66bf5f32610..e9f25f162138 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -18,6 +18,7 @@ struct block_device;
 struct io_context;
 struct cgroup_subsys_state;
 typedef void (bio_end_io_t) (struct bio *);
+typedef int (bio_verifier_t) (struct bio *);
 
 /*
  * Block error status values.  See block/blk-core:blk_errors for the details.
@@ -187,6 +188,8 @@ struct bio {
 		struct bio_integrity_payload *bi_integrity; /* data integrity */
 #endif
 	};
+	bio_verifier_t		*bi_verifier;	/* verify callback when endio */
+	struct work_struct      bi_work;	/* I/O completion */
 
 	unsigned short		bi_vcnt;	/* how many bio_vec's */
 
@@ -329,6 +332,7 @@ enum req_flag_bits {
 	/* for driver use */
 	__REQ_DRV,
 	__REQ_SWAP,		/* swapping request. */
+	__REQ_VERIFY,		/* verify IO when endio is called */
 	__REQ_NR_BITS,		/* stops here */
 };
 
@@ -351,6 +355,7 @@ enum req_flag_bits {
 
 #define REQ_DRV			(1ULL << __REQ_DRV)
 #define REQ_SWAP		(1ULL << __REQ_SWAP)
+#define REQ_VERIFY		(1ULL << __REQ_VERIFY)
 
 #define REQ_FAILFAST_MASK \
 	(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v3 3/3] fs: xfs: add read_verifier() function
  2019-03-29 14:23 [RFC PATCH v3 0/3] Block/XFS: Support alternative mirror device retry Bob Liu
  2019-03-29 14:23 ` [PATCH v3 1/3] block: introduce submit_bio_verify() Bob Liu
  2019-03-29 14:23 ` [PATCH v3 2/3] block: verify data when endio Bob Liu
@ 2019-03-29 14:23 ` Bob Liu
  2 siblings, 0 replies; 26+ messages in thread
From: Bob Liu @ 2019-03-29 14:23 UTC (permalink / raw)
  To: linux-block
  Cc: linux-xfs, linux-fsdevel, martin.petersen, shirley.ma,
	allison.henderson, david, darrick.wong, hch, adilger, axboe,
	tytso, Bob Liu

This patch adds an new read_verifier function and pass to
the block layer to verify reads.

Signed-off-by: Bob Liu <bob.liu@oracle.com>
---
 fs/xfs/xfs_buf.c | 23 +++++++++++++++++++++--
 fs/xfs/xfs_buf.h |  1 +
 2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 4f5f2ff3f70f..f3185dd9a557 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1243,7 +1243,8 @@ xfs_buf_ioend(
 		xfs_buf_ioerror(bp, bp->b_io_error);
 
 	/* Only validate buffers that were read without errors */
-	if (read && !bp->b_error && bp->b_ops) {
+	if (read && !bp->b_error && bp->b_ops &&
+	    !(bp->b_flags & _XBF_NOVERIFY)) {
 		ASSERT(!bp->b_iodone);
 		bp->b_ops->verify_read(bp);
 	}
@@ -1343,6 +1344,24 @@ xfs_buf_bio_end_io(
 	bio_put(bio);
 }
 
+int read_verifier(struct bio *bio)
+{
+	struct xfs_buf *bp = (struct xfs_buf *)bio->bi_private;
+
+	if (bp && !bp->b_error && bp->b_ops && bp->b_ops->verify_read) {
+		if (xfs_buf_is_vmapped(bp))
+			invalidate_kernel_vmap_range(bp->b_addr,
+						     xfs_buf_vmap_len(bp));
+		bp->b_ops->verify_read(bp);
+		/* set bit so that endio won't verify it again */
+		if (!bp->b_error)
+			bp->b_flags |= _XBF_NOVERIFY;
+		return bp->b_error;
+	} else {
+		return 0;
+	}
+}
+
 static void
 xfs_buf_ioapply_map(
 	struct xfs_buf	*bp,
@@ -1409,7 +1428,7 @@ xfs_buf_ioapply_map(
 			flush_kernel_vmap_range(bp->b_addr,
 						xfs_buf_vmap_len(bp));
 		}
-		submit_bio(bio);
+		submit_bio_verify(bio, &read_verifier);
 		if (size)
 			goto next_chunk;
 	} else {
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index b9f5511ea998..499637c61001 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -50,6 +50,7 @@ typedef enum {
 #define _XBF_KMEM	 (1 << 21)/* backed by heap memory */
 #define _XBF_DELWRI_Q	 (1 << 22)/* buffer on a delwri queue */
 #define _XBF_COMPOUND	 (1 << 23)/* compound buffer */
+#define _XBF_NOVERIFY	 (1 << 24)/* verify_read should not be called */
 
 typedef unsigned int xfs_buf_flags_t;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 2/3] block: verify data when endio
  2019-03-29 14:23 ` [PATCH v3 2/3] block: verify data when endio Bob Liu
@ 2019-03-29 14:28   ` Jens Axboe
  2019-03-29 14:34     ` Martin K. Petersen
  2019-03-29 14:40     ` Bob Liu
  2019-03-29 15:39   ` Ming Lei
  1 sibling, 2 replies; 26+ messages in thread
From: Jens Axboe @ 2019-03-29 14:28 UTC (permalink / raw)
  To: Bob Liu, linux-block
  Cc: linux-xfs, linux-fsdevel, martin.petersen, shirley.ma,
	allison.henderson, david, darrick.wong, hch, adilger, tytso

On 3/29/19 8:23 AM, Bob Liu wrote:
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index d66bf5f32610..e9f25f162138 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -18,6 +18,7 @@ struct block_device;
>  struct io_context;
>  struct cgroup_subsys_state;
>  typedef void (bio_end_io_t) (struct bio *);
> +typedef int (bio_verifier_t) (struct bio *);
>  
>  /*
>   * Block error status values.  See block/blk-core:blk_errors for the details.
> @@ -187,6 +188,8 @@ struct bio {
>  		struct bio_integrity_payload *bi_integrity; /* data integrity */
>  #endif
>  	};
> +	bio_verifier_t		*bi_verifier;	/* verify callback when endio */
> +	struct work_struct      bi_work;	/* I/O completion */
>  
>  	unsigned short		bi_vcnt;	/* how many bio_vec's */

I told you this for the initial posting, and the objection still stands.
Adding 40 bytes to struct bio is a no-go.

So that's a big NAK on that series.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 2/3] block: verify data when endio
  2019-03-29 14:28   ` Jens Axboe
@ 2019-03-29 14:34     ` Martin K. Petersen
  2019-03-29 14:38       ` Jens Axboe
  2019-03-29 14:41       ` Bob Liu
  2019-03-29 14:40     ` Bob Liu
  1 sibling, 2 replies; 26+ messages in thread
From: Martin K. Petersen @ 2019-03-29 14:34 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Bob Liu, linux-block, linux-xfs, linux-fsdevel, martin.petersen,
	shirley.ma, allison.henderson, david, darrick.wong, hch, adilger,
	tytso


Jens,

> I told you this for the initial posting, and the objection still
> stands.  Adding 40 bytes to struct bio is a no-go.
>
> So that's a big NAK on that series.

I think you missed Bob's comment that this will go in the existing
bio_integrity field. I believe the main purpose of the series is to
solicit feedback on the callback approach.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 2/3] block: verify data when endio
  2019-03-29 14:34     ` Martin K. Petersen
@ 2019-03-29 14:38       ` Jens Axboe
  2019-03-29 14:46         ` Martin K. Petersen
  2019-03-29 14:41       ` Bob Liu
  1 sibling, 1 reply; 26+ messages in thread
From: Jens Axboe @ 2019-03-29 14:38 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Bob Liu, linux-block, linux-xfs, linux-fsdevel, shirley.ma,
	allison.henderson, david, darrick.wong, hch, adilger, tytso

On 3/29/19 8:34 AM, Martin K. Petersen wrote:
> 
> Jens,
> 
>> I told you this for the initial posting, and the objection still
>> stands.  Adding 40 bytes to struct bio is a no-go.
>>
>> So that's a big NAK on that series.
> 
> I think you missed Bob's comment that this will go in the existing
> bio_integrity field. I believe the main purpose of the series is to
> solicit feedback on the callback approach.

I didn't miss that, but it fixes nothing. That will unify the 40 bytes
with 8 bytes, we're still growing the bio by a LOT. And we can't even
nicely hide this behind some ifdef, sine distros enable everything and
hence we're solving nothing by doing that.

One potential solution could be to setup a specific bio_set for this,
and allocate your read bios out of that. Growing struct bio by an
enormous amount for this is just a non-starter to begin with, end of
story.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 2/3] block: verify data when endio
  2019-03-29 14:28   ` Jens Axboe
  2019-03-29 14:34     ` Martin K. Petersen
@ 2019-03-29 14:40     ` Bob Liu
  2019-03-29 14:47       ` Jens Axboe
  2019-03-30  0:20       ` Andreas Dilger
  1 sibling, 2 replies; 26+ messages in thread
From: Bob Liu @ 2019-03-29 14:40 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: linux-xfs, linux-fsdevel, martin.petersen, shirley.ma,
	allison.henderson, david, darrick.wong, hch, adilger, tytso

On 3/29/19 10:28 PM, Jens Axboe wrote:
> On 3/29/19 8:23 AM, Bob Liu wrote:
>> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
>> index d66bf5f32610..e9f25f162138 100644
>> --- a/include/linux/blk_types.h
>> +++ b/include/linux/blk_types.h
>> @@ -18,6 +18,7 @@ struct block_device;
>>  struct io_context;
>>  struct cgroup_subsys_state;
>>  typedef void (bio_end_io_t) (struct bio *);
>> +typedef int (bio_verifier_t) (struct bio *);
>>  
>>  /*
>>   * Block error status values.  See block/blk-core:blk_errors for the details.
>> @@ -187,6 +188,8 @@ struct bio {
>>  		struct bio_integrity_payload *bi_integrity; /* data integrity */
>>  #endif
>>  	};
>> +	bio_verifier_t		*bi_verifier;	/* verify callback when endio */
>> +	struct work_struct      bi_work;	/* I/O completion */
>>  
>>  	unsigned short		bi_vcnt;	/* how many bio_vec's */
> 
> I told you this for the initial posting, and the objection still stands.
> Adding 40 bytes to struct bio is a no-go.
> 

Yes, I know that.
Already list in Todo:
- union with "struct bio_integrity_payload *bi_integrity" to save bio space.

This is only a rfc patch, because not sure any other objection of this passdown-verify solution..
Will consider more about the bio space if there is agreement of the passdown-verify way.

Thanks,
Bob

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 2/3] block: verify data when endio
  2019-03-29 14:34     ` Martin K. Petersen
  2019-03-29 14:38       ` Jens Axboe
@ 2019-03-29 14:41       ` Bob Liu
  1 sibling, 0 replies; 26+ messages in thread
From: Bob Liu @ 2019-03-29 14:41 UTC (permalink / raw)
  To: Martin K. Petersen, Jens Axboe
  Cc: linux-block, linux-xfs, linux-fsdevel, shirley.ma,
	allison.henderson, david, darrick.wong, hch, adilger, tytso

On 3/29/19 10:34 PM, Martin K. Petersen wrote:
> 
> Jens,
> 
>> I told you this for the initial posting, and the objection still
>> stands.  Adding 40 bytes to struct bio is a no-go.
>>
>> So that's a big NAK on that series.
> 
> I think you missed Bob's comment that this will go in the existing
> bio_integrity field. I believe the main purpose of the series is to
> solicit feedback on the callback approach.
> 

Yes, indeed!


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 2/3] block: verify data when endio
  2019-03-29 14:38       ` Jens Axboe
@ 2019-03-29 14:46         ` Martin K. Petersen
  2019-03-29 14:50           ` Jens Axboe
  0 siblings, 1 reply; 26+ messages in thread
From: Martin K. Petersen @ 2019-03-29 14:46 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Martin K. Petersen, Bob Liu, linux-block, linux-xfs,
	linux-fsdevel, shirley.ma, allison.henderson, david,
	darrick.wong, hch, adilger, tytso


Jens,

> I didn't miss that, but it fixes nothing. That will unify the 40 bytes
> with 8 bytes, we're still growing the bio by a LOT. And we can't even
> nicely hide this behind some ifdef, sine distros enable everything and
> hence we're solving nothing by doing that.

We'll just need to handle it exactly like the integrity stuff. We only
allocate the extra bits when the underlying device indicates that it's
required and desired.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 2/3] block: verify data when endio
  2019-03-29 14:40     ` Bob Liu
@ 2019-03-29 14:47       ` Jens Axboe
  2019-03-30  0:20       ` Andreas Dilger
  1 sibling, 0 replies; 26+ messages in thread
From: Jens Axboe @ 2019-03-29 14:47 UTC (permalink / raw)
  To: Bob Liu, linux-block
  Cc: linux-xfs, linux-fsdevel, martin.petersen, shirley.ma,
	allison.henderson, david, darrick.wong, hch, adilger, tytso

On 3/29/19 8:40 AM, Bob Liu wrote:
> On 3/29/19 10:28 PM, Jens Axboe wrote:
>> On 3/29/19 8:23 AM, Bob Liu wrote:
>>> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
>>> index d66bf5f32610..e9f25f162138 100644
>>> --- a/include/linux/blk_types.h
>>> +++ b/include/linux/blk_types.h
>>> @@ -18,6 +18,7 @@ struct block_device;
>>>  struct io_context;
>>>  struct cgroup_subsys_state;
>>>  typedef void (bio_end_io_t) (struct bio *);
>>> +typedef int (bio_verifier_t) (struct bio *);
>>>  
>>>  /*
>>>   * Block error status values.  See block/blk-core:blk_errors for the details.
>>> @@ -187,6 +188,8 @@ struct bio {
>>>  		struct bio_integrity_payload *bi_integrity; /* data integrity */
>>>  #endif
>>>  	};
>>> +	bio_verifier_t		*bi_verifier;	/* verify callback when endio */
>>> +	struct work_struct      bi_work;	/* I/O completion */
>>>  
>>>  	unsigned short		bi_vcnt;	/* how many bio_vec's */
>>
>> I told you this for the initial posting, and the objection still stands.
>> Adding 40 bytes to struct bio is a no-go.
>>
> 
> Yes, I know that.
> Already list in Todo:
> - union with "struct bio_integrity_payload *bi_integrity" to save bio space.

As written in the other email, that solves _nothing_. So in terms of RFC, I'm
letting you know that adding anything to the bio for this issue is going to
be a non-starter.

> This is only a rfc patch, because not sure any other objection of this
> passdown-verify solution..  Will consider more about the bio space if
> there is agreement of the passdown-verify way.

OK, also see other mail on a potential route for the bio space that
doesn't increase the size for the bio at all.

You and Martin keep claiming this isn't an issue, but it is. Hence I'm
just making it as clear as I can that bumping the bio size for this
isn't going to fly.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 2/3] block: verify data when endio
  2019-03-29 14:46         ` Martin K. Petersen
@ 2019-03-29 14:50           ` Jens Axboe
  2019-03-29 14:51             ` Martin K. Petersen
  0 siblings, 1 reply; 26+ messages in thread
From: Jens Axboe @ 2019-03-29 14:50 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Bob Liu, linux-block, linux-xfs, linux-fsdevel, shirley.ma,
	allison.henderson, david, darrick.wong, hch, adilger, tytso

On 3/29/19 8:46 AM, Martin K. Petersen wrote:
> 
> Jens,
> 
>> I didn't miss that, but it fixes nothing. That will unify the 40 bytes
>> with 8 bytes, we're still growing the bio by a LOT. And we can't even
>> nicely hide this behind some ifdef, sine distros enable everything and
>> hence we're solving nothing by doing that.
> 
> We'll just need to handle it exactly like the integrity stuff. We only
> allocate the extra bits when the underlying device indicates that it's
> required and desired.

The integrity stuff still has that nasty pointer in there. It'd be nice
to get rid of that as well, and hide it all in a parent container of the
bio.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 2/3] block: verify data when endio
  2019-03-29 14:50           ` Jens Axboe
@ 2019-03-29 14:51             ` Martin K. Petersen
  2019-03-29 14:52               ` Jens Axboe
  0 siblings, 1 reply; 26+ messages in thread
From: Martin K. Petersen @ 2019-03-29 14:51 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Martin K. Petersen, Bob Liu, linux-block, linux-xfs,
	linux-fsdevel, shirley.ma, allison.henderson, david,
	darrick.wong, hch, adilger, tytso


Jens,

> The integrity stuff still has that nasty pointer in there. It'd be
> nice to get rid of that as well, and hide it all in a parent container
> of the bio.

That's fine with me. We really didn't have that option a decade ago.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 2/3] block: verify data when endio
  2019-03-29 14:51             ` Martin K. Petersen
@ 2019-03-29 14:52               ` Jens Axboe
  2019-03-29 15:00                 ` Bob Liu
  0 siblings, 1 reply; 26+ messages in thread
From: Jens Axboe @ 2019-03-29 14:52 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Bob Liu, linux-block, linux-xfs, linux-fsdevel, shirley.ma,
	allison.henderson, david, darrick.wong, hch, adilger, tytso

On 3/29/19 8:51 AM, Martin K. Petersen wrote:
> 
> Jens,
> 
>> The integrity stuff still has that nasty pointer in there. It'd be
>> nice to get rid of that as well, and hide it all in a parent container
>> of the bio.
> 
> That's fine with me. We really didn't have that option a decade ago.

Right, but now we do. Hence trying to nudge things in that direction,
instead of adding another pointer to struct bio for a new use case. I'd
much rather go that route, instead of unionizing with another pointer
that should be going away instead.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 2/3] block: verify data when endio
  2019-03-29 14:52               ` Jens Axboe
@ 2019-03-29 15:00                 ` Bob Liu
  2019-03-29 15:03                   ` Jens Axboe
  0 siblings, 1 reply; 26+ messages in thread
From: Bob Liu @ 2019-03-29 15:00 UTC (permalink / raw)
  To: Jens Axboe, Martin K. Petersen
  Cc: linux-block, linux-xfs, linux-fsdevel, shirley.ma,
	allison.henderson, david, darrick.wong, hch, adilger, tytso

On 3/29/19 10:52 PM, Jens Axboe wrote:
> On 3/29/19 8:51 AM, Martin K. Petersen wrote:
>>
>> Jens,
>>
>>> The integrity stuff still has that nasty pointer in there. It'd be
>>> nice to get rid of that as well, and hide it all in a parent container
>>> of the bio.
>>
>> That's fine with me. We really didn't have that option a decade ago.
> 
> Right, but now we do. Hence trying to nudge things in that direction,
> instead of adding another pointer to struct bio for a new use case. I'd
> much rather go that route, instead of unionizing with another pointer
> that should be going away instead.
> 

No problem, will update in next version.
Let me wait a while for feedback about the verify callback.
Thank you!

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 2/3] block: verify data when endio
  2019-03-29 15:00                 ` Bob Liu
@ 2019-03-29 15:03                   ` Jens Axboe
  2019-03-30  2:17                     ` Martin K. Petersen
  0 siblings, 1 reply; 26+ messages in thread
From: Jens Axboe @ 2019-03-29 15:03 UTC (permalink / raw)
  To: Bob Liu, Martin K. Petersen
  Cc: linux-block, linux-xfs, linux-fsdevel, shirley.ma,
	allison.henderson, david, darrick.wong, hch, adilger, tytso

On 3/29/19 9:00 AM, Bob Liu wrote:
> On 3/29/19 10:52 PM, Jens Axboe wrote:
>> On 3/29/19 8:51 AM, Martin K. Petersen wrote:
>>>
>>> Jens,
>>>
>>>> The integrity stuff still has that nasty pointer in there. It'd be
>>>> nice to get rid of that as well, and hide it all in a parent container
>>>> of the bio.
>>>
>>> That's fine with me. We really didn't have that option a decade ago.
>>
>> Right, but now we do. Hence trying to nudge things in that direction,
>> instead of adding another pointer to struct bio for a new use case. I'd
>> much rather go that route, instead of unionizing with another pointer
>> that should be going away instead.
>>
> 
> No problem, will update in next version.
> Let me wait a while for feedback about the verify callback.
> Thank you!

I think you're still missing my point. If you go the bio_set and
container-of-bio route, then you can do whatever you want. You will not
need a callback in the bio, you will just have a private end_io function
for that particular bio that does the verification.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 2/3] block: verify data when endio
  2019-03-29 14:23 ` [PATCH v3 2/3] block: verify data when endio Bob Liu
  2019-03-29 14:28   ` Jens Axboe
@ 2019-03-29 15:39   ` Ming Lei
  1 sibling, 0 replies; 26+ messages in thread
From: Ming Lei @ 2019-03-29 15:39 UTC (permalink / raw)
  To: Bob Liu
  Cc: linux-block, open list:XFS FILESYSTEM, Linux FS Devel,
	Martin K. Petersen, shirley.ma, allison.henderson, Dave Chinner,
	Darrick J. Wong, Christoph Hellwig, Andreas Dilger, Jens Axboe,
	Ted Tso

Hi Bob,

On Fri, Mar 29, 2019 at 10:25 PM Bob Liu <bob.liu@oracle.com> wrote:
>
> Call verify callback same as bio integrity.
> If verify fail, drivers like MD will try other mirrors until get a correct
> one or return failure after all mirrors are tried.
> The MD driver already works like this, so no extra changed.
>
> Todo:
> - union with "struct bio_integrity_payload *bi_integrity" to save bio space.
>
> Signed-off-by: Bob Liu <bob.liu@oracle.com>
> ---
>  block/bio-integrity.c     | 45 +++++++++++++++++++++++++++++++++++++++
>  block/bio.c               |  3 +++
>  block/blk-core.c          |  4 ++++
>  block/blk.h               |  8 +++++++
>  block/bounce.c            |  1 +
>  drivers/md/raid1.c        |  1 +
>  drivers/md/raid5-ppl.c    |  1 +
>  include/linux/blk_types.h |  5 +++++
>  8 files changed, 68 insertions(+)
>
> diff --git a/block/bio-integrity.c b/block/bio-integrity.c
> index 1b633a3526d4..90a47ad31dbf 100644
> --- a/block/bio-integrity.c
> +++ b/block/bio-integrity.c
> @@ -372,6 +372,51 @@ bool __bio_integrity_endio(struct bio *bio)
>         return true;
>  }
>
> +/**
> + * bio_verify_fn - Verify I/O completion worker
> + * @work:      Work struct stored in bio to be verified
> + *
> + * Description: This workqueue function is called to complete a READ
> + * request.  The function call verifier callack that fs pass down
> + * and then calls the original bio end_io function.
> + */
> +static void bio_verify_fn(struct work_struct *work)
> +{
> +       struct bio *bio =
> +               container_of(work, struct bio, bi_work);
> +
> +       bio->bi_status = bio->bi_verifier(bio);
> +       /* Clear flag if verify succeed to avoid verifing
> +        * it unnecessary by parent bio
> +        */
> +       if (!bio->bi_status)
> +               bio->bi_opf &= ~REQ_VERIFY;
> +       bio_endio(bio);
> +}

1) why is bio->bi_verifier run from workqueue context instead of being
called directly
from bio_endio()?

2) the followings part of bio_endio(bio) may be run twice, will this
way work correctly?

        if (!bio_remaining_done(bio))
                return;
        if (!bio_integrity_endio(bio))
                return;

> +
> +/**
> + * __bio_verify_endio - Verify I/O completion function
> + * @bio:       Protected bio
> + *
> + * Description: Completion for verify I/O
> + *
> + * Normally I/O completion is done in interrupt context.  However,
> + * verifying I/O is a time-consuming task which must be run
> + * in process context. This function postpones completion
> + * accordingly.
> + */
> +bool __bio_verify_endio(struct bio *bio)
> +{
> +       if (bio_op(bio) == REQ_OP_READ && !bio->bi_status &&
> +           (bio->bi_opf & REQ_VERIFY) && bio->bi_verifier) {
> +               INIT_WORK(&bio->bi_work, bio_verify_fn);
> +               queue_work(kintegrityd_wq, &bio->bi_work);
> +               return false;
> +       }
> +
> +       return true;
> +}
> +
>  /**
>   * bio_integrity_advance - Advance integrity vector
>   * @bio:       bio whose integrity vector to update
> diff --git a/block/bio.c b/block/bio.c
> index 4db1008309ed..8928806acda6 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -608,6 +608,7 @@ void __bio_clone_fast(struct bio *bio, struct bio *bio_src)
>         bio->bi_write_hint = bio_src->bi_write_hint;
>         bio->bi_iter = bio_src->bi_iter;
>         bio->bi_io_vec = bio_src->bi_io_vec;
> +       bio->bi_verifier = bio_src->bi_verifier;

->bi_verifier is cloned along the IO stack, is this .bi_verifier
capable of covering
the cloned bio which may be just a part of original bio? such as, bio split.

Given in the 3rd patch, .bi_verifier is implemented in fs, I guess it should
only work for the bio submitted from fs because the verifier uses fs's
knowledge.

If yes, looks not necessary to clone .bi_verifier here. Otherwise, could you
explain a bit how ->bi_verifier can work for the cloned bio in the stack?


Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 1/3] block: introduce submit_bio_verify()
  2019-03-29 14:23 ` [PATCH v3 1/3] block: introduce submit_bio_verify() Bob Liu
@ 2019-03-29 22:22   ` Andreas Dilger
  2019-03-30  1:43     ` Martin K. Petersen
  0 siblings, 1 reply; 26+ messages in thread
From: Andreas Dilger @ 2019-03-29 22:22 UTC (permalink / raw)
  To: Bob Liu
  Cc: linux-block, linux-xfs, linux-fsdevel, Martin Petersen,
	shirley.ma, allison.henderson, david, darrick.wong, hch, axboe,
	tytso

[-- Attachment #1: Type: text/plain, Size: 2562 bytes --]

On Mar 29, 2019, at 8:23 AM, Bob Liu <bob.liu@oracle.com> wrote:
> 
> This patch adds a function verifier callback to submit_bio.  The
> filesystem layer can use submit_bio_verify to pass a call back
> to the block layer which can then be used to verify if the data
> read is correct.

How does this interact (if at all) with bio_integrity_verify() code?
Does it mean if e.g. XFS is on storage with T10-PI that only one or
the other can be used, or will it be possible to chain the verify
callbacks?

Cheers, Andreas

> Signed-off-by: Bob Liu <bob.liu@oracle.com>
> ---
> block/blk-core.c    | 13 ++++++++++---
> include/linux/bio.h |  2 ++
> 2 files changed, 12 insertions(+), 3 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 6b78ec56a4f2..d265d2924c32 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1156,15 +1156,16 @@ blk_qc_t direct_make_request(struct bio *bio)
> EXPORT_SYMBOL_GPL(direct_make_request);
> 
> /**
> - * submit_bio - submit a bio to the block device layer for I/O
> + * submit_bio_verify - submit a bio to the block device layer for I/O
>  * @bio: The &struct bio which describes the I/O
>  *
> - * submit_bio() is very similar in purpose to generic_make_request(), and
> + * submit_bio_verify() is very similar in purpose to generic_make_request(), and
>  * uses that function to do most of the work. Both are fairly rough
>  * interfaces; @bio must be presetup and ready for I/O.
>  *
>  */
> -blk_qc_t submit_bio(struct bio *bio)
> +blk_qc_t submit_bio_verify(struct bio *bio,
> +		int (*verifier_cb_func)(struct bio *))
> {
> 	/*
> 	 * If it's a regular read/write or a barrier with data attached,
> @@ -1197,6 +1198,12 @@ blk_qc_t submit_bio(struct bio *bio)
> 
> 	return generic_make_request(bio);
> }
> +EXPORT_SYMBOL(submit_bio_verify);
> +
> +blk_qc_t submit_bio(struct bio *bio)
> +{
> +	return submit_bio_verify(bio, NULL);
> +}
> EXPORT_SYMBOL(submit_bio);
> 
> /**
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 7380b094dcca..ddaadab74dcf 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -398,6 +398,8 @@ static inline struct bio *bio_kmalloc(gfp_t gfp_mask, unsigned int nr_iovecs)
> 	return bio_alloc_bioset(gfp_mask, nr_iovecs, NULL);
> }
> 
> +extern blk_qc_t submit_bio_verify(struct bio *,
> +		int (*verifier_cb_func)(struct bio *));
> extern blk_qc_t submit_bio(struct bio *);
> 
> extern void bio_endio(struct bio *);
> --
> 2.17.1
> 


Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 2/3] block: verify data when endio
  2019-03-29 14:40     ` Bob Liu
  2019-03-29 14:47       ` Jens Axboe
@ 2019-03-30  0:20       ` Andreas Dilger
  1 sibling, 0 replies; 26+ messages in thread
From: Andreas Dilger @ 2019-03-30  0:20 UTC (permalink / raw)
  To: Bob Liu
  Cc: Jens Axboe, linux-block, linux-xfs, linux-fsdevel,
	Martin Petersen, shirley.ma, Allison Henderson, Dave Chinner,
	Darrick J. Wong, Christoph Hellwig, Theodore Ts'o,
	Li Dongyang, Li Xi


[-- Attachment #1.1: Type: text/plain, Size: 2573 bytes --]

On Mar 29, 2019, at 8:40 AM, Bob Liu <bob.liu@oracle.com> wrote:
> 
> On 3/29/19 10:28 PM, Jens Axboe wrote:
>> On 3/29/19 8:23 AM, Bob Liu wrote:
>>> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
>>> index d66bf5f32610..e9f25f162138 100644
>>> --- a/include/linux/blk_types.h
>>> +++ b/include/linux/blk_types.h
>>> @@ -18,6 +18,7 @@ struct block_device;
>>> struct io_context;
>>> struct cgroup_subsys_state;
>>> typedef void (bio_end_io_t) (struct bio *);
>>> +typedef int (bio_verifier_t) (struct bio *);
>>> 
>>> /*
>>>  * Block error status values.  See block/blk-core:blk_errors for the details.
>>> @@ -187,6 +188,8 @@ struct bio {
>>> 		struct bio_integrity_payload *bi_integrity; /* data integrity */
>>> #endif
>>> 	};
>>> +	bio_verifier_t		*bi_verifier;	/* verify callback when endio */
>>> +	struct work_struct      bi_work;	/* I/O completion */
>>> 
>>> 	unsigned short		bi_vcnt;	/* how many bio_vec's */
>> 
>> I told you this for the initial posting, and the objection still stands.
>> Adding 40 bytes to struct bio is a no-go.
> 
> Yes, I know that.
> Already list in Todo:
> - union with "struct bio_integrity_payload *bi_integrity" to save bio space.
> 
> This is only a rfc patch, because not sure any other objection of this passdown-verify solution..
> Will consider more about the bio space if there is agreement of the passdown-verify way.

It is best if "rfc patch" is labelled with "RFC" in the subject when there are known
TODO issues with the patch, to avoid this kind of confusion... :-)


I worked on a similar patch that adds generate and verify callback functions to struct
bio_integrity_payload rather than directly in the bio, to avoid increasing bio size.

This was implemented by registering the callback functions via bio_integrity_prep_fn()
and stored them in bio_integrity_payload before the bio_submit() was called, rather
than passing them down the call stack as would be needed here.

This would allow having a per-bio callback without the changing the whole callchain.
Instead of XFS calling the new bio_submit_verify(), it calls bio_integrity_prep_fn(),
which doesn't seem materially different in terms of code complexity (i.e. there will
still need to be some changes to XFS but it doesn't affect other code).

I've attached the patch below.  It was based on an older kernel and I've just done a
quick port to the latest kernel (untested), so it is entirely possible that there is
some breakage but it shows the general approach.

Cheers, Andreas



[-- Attachment #1.2: 0001-block-add-optional-integrity-callbacks-for-bio.patch --]
[-- Type: application/octet-stream, Size: 9422 bytes --]

From 1d999a8597f93ff36d4fbe08ba3a68f81baa5f10 Mon Sep 17 00:00:00 2001
From: Andreas Dilger <adilger@dilger.ca>
Date: Fri, 29 Mar 2019 17:58:56 -0600
Subject: [RFC PATCH] block: add optional integrity callbacks for bio

Add optional integrity generate and verify callback functions for
a submitted bio.  The callbacks are passed to bio_integrity_prep()
and bio_integrity_alloc() and stored in struct bio_integrity_payload
rather than the bio to avoid increasing the size of struct bio when
they are not needed.

The per-bio integrity generate and verify functions take priority
over the ones registered on the block device.

Signed-off-by: Li Xi <lixi@ddn.com>
Signed-off-by: Li Dongyang <dongyangli@ddn.com>
Signed-off-by: Andreas Dilger <adilger@dilger.ca>
---
 block/bio-integrity.c  | 85 ++++++++++++++++++++++++++++++++++++--------------
 include/linux/bio.h    |  4 +++
 include/linux/blkdev.h |  2 ++
 3 files changed, 68 insertions(+), 23 deletions(-)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 1b633a3..5e670ad 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -39,18 +39,23 @@ void blk_flush_integrity(void)
 }
 
 /**
- * bio_integrity_alloc - Allocate integrity payload and attach it to bio
+ * bio_integrity_alloc_fn - Allocate integrity payload and attach it to bio
  * @bio:	bio to attach integrity metadata to
  * @gfp_mask:	Memory allocation mask
  * @nr_vecs:	Number of integrity metadata scatter-gather elements
+ * @generate_fn: Optional callback function to generate integrity data on write
+ * @verify_fn	Optional callback function to verify integrity data on read
  *
  * Description: This function prepares a bio for attaching integrity
- * metadata.  nr_vecs specifies the maximum number of pages containing
- * integrity metadata that can be attached.
+ * metadata.  @nr_vecs specifies the maximum number of pages containing
+ * integrity metadata that can be attached.  If the @generate_fn and/or
+ * @verify_fn are not NULL, then they will be stored in bio_integrity_payload
+ * for use when bio_integrity_prep() and bio_integrity_verify() are called.
  */
-struct bio_integrity_payload *bio_integrity_alloc(struct bio *bio,
-						  gfp_t gfp_mask,
-						  unsigned int nr_vecs)
+struct bio_integrity_payload *bio_integrity_alloc_fn(struct bio *bio,
+					gfp_t gfp_mask, unsigned int nr_vecs,
+					integrity_processing_fn *generate_fn,
+					integrity_processing_fn *verify_fn)
 {
 	struct bio_integrity_payload *bip;
 	struct bio_set *bs = bio->bi_pool;
@@ -85,6 +90,8 @@ struct bio_integrity_payload *bio_integrity_alloc(struct bio *bio,
 	}
 
 	bip->bip_bio = bio;
+	bip->bip_generate_fn = generate_fn;
+	bip->bip_verify_fn = verify_fn;
 	bio->bi_integrity = bip;
 	bio->bi_opf |= REQ_INTEGRITY;
 
@@ -93,6 +100,14 @@ struct bio_integrity_payload *bio_integrity_alloc(struct bio *bio,
 	mempool_free(bip, &bs->bio_integrity_pool);
 	return ERR_PTR(-ENOMEM);
 }
+EXPORT_SYMBOL(bio_integrity_allocfn);
+
+struct bio_integrity_payload *bio_integrity_alloc(struct bio *bio,
+						  gfp_t gfp_mask,
+						  unsigned int nr_vecs)
+{
+	return bio_integrity_alloc_fn(bio, gfp_mask, nr_vecs, NULL, NULL);
+}
 EXPORT_SYMBOL(bio_integrity_alloc);
 
 /**
@@ -187,6 +202,8 @@ static blk_status_t bio_integrity_process(struct bio *bio,
 
 		iter.data_buf = kaddr + bv.bv_offset;
 		iter.data_size = bv.bv_len;
+		iter.bio = bio;
+		iter.bi_idx = proc_iter->bi_idx;
 
 		ret = proc_fn(&iter);
 		if (ret) {
@@ -200,18 +217,23 @@ static blk_status_t bio_integrity_process(struct bio *bio,
 }
 
 /**
- * bio_integrity_prep - Prepare bio for integrity I/O
- * @bio:	bio to prepare
+ * bio_integrity_prep_fn - Prepare bio for integrity I/O
+ * @bio:		bio to prepare
+ * @generate_fn:	optional function called on write to generate integrity
+ * @verify_fn:		optional function called on read to verify integrity
  *
  * Description:  Checks if the bio already has an integrity payload attached.
  * If it does, the payload has been generated by another kernel subsystem,
  * and we just pass it through. Otherwise allocates integrity payload.
  * The bio must have data direction, target device and start sector set priot
  * to calling.  In the WRITE case, integrity metadata will be generated using
- * the block device's integrity function.  In the READ case, the buffer
- * will be prepared for DMA and a suitable end_io handler set up.
+ * either the passed-in @generate_fn (if set), the payload generate_fn (if
+ * set), or the block device's integrity function.  In the READ case, the
+ * buffer will be prepared for DMA and a suitable end_io handler set up.
  */
-bool bio_integrity_prep(struct bio *bio)
+bool bio_integrity_prep_fn(struct bio *bio,
+			   integrity_processing_fn *generate_fn,
+			   integrity_processing_fn *verify_fn)
 {
 	struct bio_integrity_payload *bip;
 	struct blk_integrity *bi = blk_get_integrity(bio->bi_disk);
@@ -236,13 +258,21 @@ bool bio_integrity_prep(struct bio *bio)
 	if (bio_integrity(bio))
 		return true;
 
-	if (bio_data_dir(bio) == READ) {
-		if (!bi->profile->verify_fn ||
-		    !(bi->flags & BLK_INTEGRITY_VERIFY))
+	if (bio_data_dir(bio) == WRITE) {
+		if (!generate_fn) {
+			generate_fn = bip->bip_generate_fn;
+			if (!generate_fn)
+				generate_fn = bi->profile->generate_fn;
+		}
+		if (!generate_fn || !(bi->flags & BLK_INTEGRITY_GENERATE))
 			return true;
 	} else {
-		if (!bi->profile->generate_fn ||
-		    !(bi->flags & BLK_INTEGRITY_GENERATE))
+		if (!verify_fn) {
+			verify_fn = bip->bip_verify_fn;
+			if (!verify_fn)
+				verify_fn = bi->profile->verify_fn;
+		}
+		if (!verify_fn || !(bi->flags & BLK_INTEGRITY_VERIFY))
 			return true;
 	}
 	intervals = bio_integrity_intervals(bi, bio_sectors(bio));
@@ -261,7 +291,8 @@ bool bio_integrity_prep(struct bio *bio)
 	nr_pages = end - start;
 
 	/* Allocate bio integrity payload and integrity vectors */
-	bip = bio_integrity_alloc(bio, GFP_NOIO, nr_pages);
+	bip = bio_integrity_alloc_fn(bio, GFP_NOIO, nr_pages, generate_fn,
+				     verify_fn);
 	if (IS_ERR(bip)) {
 		printk(KERN_ERR "could not allocate data integrity bioset\n");
 		kfree(buf);
@@ -303,12 +334,11 @@ bool bio_integrity_prep(struct bio *bio)
 	}
 
 	/* Auto-generate integrity metadata if this is a write */
-	if (bio_data_dir(bio) == WRITE) {
-		bio_integrity_process(bio, &bio->bi_iter,
-				      bi->profile->generate_fn);
-	} else {
+	if (bio_data_dir(bio) == WRITE)
+		bio_integrity_process(bio, &bio->bi_iter, generate_fn);
+	else
 		bip->bio_iter = bio->bi_iter;
-	}
+
 	return true;
 
 err_end_io:
@@ -317,6 +347,12 @@ bool bio_integrity_prep(struct bio *bio)
 	return false;
 
 }
+EXPORT_SYMBOL(bio_integrity_prep_fn);
+
+bool bio_integrity_prep(struct bio *bio)
+{
+	return bio_integrity_prep_fn(bio, NULL, NULL);
+}
 EXPORT_SYMBOL(bio_integrity_prep);
 
 /**
@@ -340,7 +376,8 @@ static void bio_integrity_verify_fn(struct work_struct *work)
 	 * it's original position.
 	 */
 	bio->bi_status = bio_integrity_process(bio, &bip->bio_iter,
-						bi->profile->verify_fn);
+					       bip->bip_verify_fn ?:
+					       bi->profile->verify_fn);
 	bio_integrity_free(bio);
 	bio_endio(bio);
 }
@@ -431,6 +468,8 @@ int bio_integrity_clone(struct bio *bio, struct bio *bio_src,
 
 	bip->bip_vcnt = bip_src->bip_vcnt;
 	bip->bip_iter = bip_src->bip_iter;
+	bip->bip_generate_fn = bip_src->bip_generate_fn;
+	bip->bip_verify_fn = bip_src->bip_verify_fn;
 
 	return 0;
 }
diff --git a/include/linux/bio.h b/include/linux/bio.h
index bb6090a..79e0c2f 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -315,6 +315,8 @@ struct bio_integrity_payload {
 	struct work_struct	bip_work;	/* I/O completion */
 
 	struct bio_vec		*bip_vec;
+	integrity_processing_fn	*bip_generate_fn;  /* called on write */
+	integrity_processing_fn	*bip_verify_fn;	   /* called on read */
 	struct bio_vec		bip_inline_vecs[0];/* embedded bvec array */
 };
 
@@ -754,8 +756,10 @@ static inline bool bioset_initialized(struct bio_set *bs)
 		bip_for_each_vec(_bvl, _bio->bi_integrity, _iter)
 
 extern struct bio_integrity_payload *bio_integrity_alloc(struct bio *, gfp_t, unsigned int);
+extern struct bio_integrity_payload *bio_integrity_alloc_fn(struct bio *, gfp_t, unsigned int, integrity_processing_fn *generate_fn, integrity_processing_fn *verify_fn);
 extern int bio_integrity_add_page(struct bio *, struct page *, unsigned int, unsigned int);
 extern bool bio_integrity_prep(struct bio *);
+extern bool bio_integrity_prep_fn(struct bio *, integrity_processing_fn *generate_fn, integrity_processing_fn *verify_fn);
 extern void bio_integrity_advance(struct bio *, unsigned int);
 extern void bio_integrity_trim(struct bio *);
 extern int bio_integrity_clone(struct bio *, struct bio *, gfp_t);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 5c58a3b..be9a833 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1471,7 +1471,9 @@ struct blk_integrity_iter {
 	sector_t		seed;
 	unsigned int		data_size;
 	unsigned short		interval;
+	unsigned short		bi_idx;
 	const char		*disk_name;
+	struct bio		*bio;
 };
 
 typedef blk_status_t (integrity_processing_fn) (struct blk_integrity_iter *);
-- 
1.8.0


[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 1/3] block: introduce submit_bio_verify()
  2019-03-29 22:22   ` Andreas Dilger
@ 2019-03-30  1:43     ` Martin K. Petersen
  0 siblings, 0 replies; 26+ messages in thread
From: Martin K. Petersen @ 2019-03-30  1:43 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Bob Liu, linux-block, linux-xfs, linux-fsdevel, Martin Petersen,
	shirley.ma, allison.henderson, david, darrick.wong, hch, axboe,
	tytso


Andreas,

> How does this interact (if at all) with bio_integrity_verify() code?
> Does it mean if e.g. XFS is on storage with T10-PI that only one or
> the other can be used,

Yes. Although if your storage is sophisticated enough to be T10 PI
capable, you are probably using redundancy inside the array and
therefore not MD.

But I think there are other problems with the callback approach. See my
impending email.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 2/3] block: verify data when endio
  2019-03-29 15:03                   ` Jens Axboe
@ 2019-03-30  2:17                     ` Martin K. Petersen
  2019-03-31 22:00                       ` Dave Chinner
  0 siblings, 1 reply; 26+ messages in thread
From: Martin K. Petersen @ 2019-03-30  2:17 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Bob Liu, Martin K. Petersen, linux-block, linux-xfs,
	linux-fsdevel, shirley.ma, allison.henderson, david,
	darrick.wong, hch, adilger, tytso

Jens,

> You will not need a callback in the bio, you will just have a private
> end_io function for that particular bio that does the verification.

The saving grace for the integrity stuff is that once all the child bios
complete, we no longer care about their completion context and we have
the parent bio submitted by the filesystem we can use to verify the PI
against.

For the redundant copy use case, however, I am guessing that the
filesystem folks would want the same thing. I.e. verify the structure of
the data received once the parent bio completes. However, at that point
all the slicing and dicing completion state is lost. And thus there is
no way to know that the failure was due to mirror B two layers down the
stack. Nor is there any way to retry the I/O without having recorded a
completion breadcrumb trail for every child bio.

The other approach is the callback where each stacking layer--which
knows about redundancy--can do verification of a bio upon completion.
However, that suffers from another headache in that the I/O can get
arbitrarily sliced and diced in units of 512 bytes. And thus the
filesystem verification function would have to be able to verify an
arbitrary multiple of 512 bytes at any 512-byte offset inside a
submitted bio. That would work, but I don't think that's the pony the
filesystem were wishing for?

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 2/3] block: verify data when endio
  2019-03-30  2:17                     ` Martin K. Petersen
@ 2019-03-31 22:00                       ` Dave Chinner
  2019-04-01 14:04                         ` Martin K. Petersen
  0 siblings, 1 reply; 26+ messages in thread
From: Dave Chinner @ 2019-03-31 22:00 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Jens Axboe, Bob Liu, linux-block, linux-xfs, linux-fsdevel,
	shirley.ma, allison.henderson, darrick.wong, hch, adilger, tytso

On Fri, Mar 29, 2019 at 10:17:22PM -0400, Martin K. Petersen wrote:
> 
> Jens,
> 
> > You will not need a callback in the bio, you will just have a private
> > end_io function for that particular bio that does the verification.
> 
> The saving grace for the integrity stuff is that once all the child bios
> complete, we no longer care about their completion context and we have
> the parent bio submitted by the filesystem we can use to verify the PI
> against.
> 
> For the redundant copy use case, however, I am guessing that the
> filesystem folks would want the same thing. I.e. verify the structure of
> the data received once the parent bio completes. However, at that point
> all the slicing and dicing completion state is lost.

Right, that's the problem. We already run the verifier on completion
of the bio that the filesytsem sends down the stack, but that then
means....

> And thus there is
> no way to know that the failure was due to mirror B two layers down the
> stack. Nor is there any way to retry the I/O without having recorded a
> completion breadcrumb trail for every child bio.

.... we have this problem when the verifier fails. i.e. the bio
needs to contain sufficient information for the filesystem to
implement some robust retry mechanism without having any clue what
lies below it or what failed.

> The other approach is the callback where each stacking layer--which
> knows about redundancy--can do verification of a bio upon completion.

*nod*

> However, that suffers from another headache in that the I/O can get
> arbitrarily sliced and diced in units of 512 bytes.

Right, but we don't need support that insane case. Indeed, if
it wasn't already obvious, we _can't support it_ because the
filesystem verifiers can't do partial verification. i.e.  part of
the verification is CRC validation of the whole bio, not to mention
that filesystem structure fragments cannot be safely parsed,
interpretted and/or verified without the whole structure first being
read in.

This means the verifier is only useful if the entire IO can be
passed down to the next layer. IOWs, if the bio has to be sliced and
diced to be issued to the next layer down, then we have a hard stop
on verifier propagation. Put simply, the verifier can only be run at
the lowest layer that sees the whole parent bio context. Hence
sliced and diced child bios won't have the parent verifier attached
to them, and so we can ignore the whole "slice and dice" problem
altogether.

Further, arguing about slicing and dicing misses the key observation
that the filesystem can largely avoid slicing and dicing for most
common cases. i.e. the IO sizes (XFS metadata!) we are talking about
here and their alignment to the underlying block devices are very
small and so are extremely unlikely to cross multi-device
boundaries.  And, of course, if the underlying device can't verify
the biofor whatever reason, we'll still do it at the filesystem IO
completion and so detect corruption like we do now.

IOWs, we need to look at this problem from a "whole stack" point of
view, not just cry about how "bios are too flexible and so make this
too hard!". The filesystem greatly constrains the alignment and
slicing/dicing problem to the point where it should be non-existent,
we have a clearly defined hard stop where verifier propagation
terminates, and if all else fails we can still detect corruption at
the filesystem level just like we do now. The worst thing that
happens here is we give up the capability for automatic block device
recovery and repair of damaged copies, which we can't do right now,
so it's essentially status quo...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 2/3] block: verify data when endio
  2019-03-31 22:00                       ` Dave Chinner
@ 2019-04-01 14:04                         ` Martin K. Petersen
  2019-04-01 21:21                           ` Dave Chinner
  0 siblings, 1 reply; 26+ messages in thread
From: Martin K. Petersen @ 2019-04-01 14:04 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Martin K. Petersen, Jens Axboe, Bob Liu, linux-block, linux-xfs,
	linux-fsdevel, shirley.ma, allison.henderson, darrick.wong, hch,
	adilger, tytso

Hi Dave!

>> However, that suffers from another headache in that the I/O can get
>> arbitrarily sliced and diced in units of 512 bytes.
>
> Right, but we don't need support that insane case. Indeed, if
> it wasn't already obvious, we _can't support it_ because the
> filesystem verifiers can't do partial verification. i.e.  part of
> the verification is CRC validation of the whole bio, not to mention
> that filesystem structure fragments cannot be safely parsed,
> interpretted and/or verified without the whole structure first being
> read in.

What I thought. There are some things I can verify by masking but it's
limited.

What about journal entries? Would they be validated with 512-byte
granularity or in bundles thereof? Only a problem during recovery, but
potentially a case where we care deeply about trying another copy if it
exists.

What I'm asking is if we should have a block size argument for the
verification? Or would you want to submit bios capped to the size you
care about and let the block layer take care of coalescing?

Validation of units bigger than the logical block size is an area which
our older Oracle HARD technology handles gracefully but which T10 PI has
been unable to address. So this is an area of particular interest to me,
although it's somewhat orthogonal to Bob's retry plumbing.

Another question for you wrt. retries: Once a copy has been identified
as bad and a good copy read from media, who does the rewrite? Does the
filesystem send a new I/O (which would overwrite all copies) or does the
retry plumbing own the responsibility of writing the good bio to the bad
location?

> IOWs, we need to look at this problem from a "whole stack" point of
> view, not just cry about how "bios are too flexible and so make this
> too hard!". The filesystem greatly constrains the alignment and
> slicing/dicing problem to the point where it should be non-existent,
> we have a clearly defined hard stop where verifier propagation
> terminates, and if all else fails we can still detect corruption at
> the filesystem level just like we do now. The worst thing that
> happens here is we give up the capability for automatic block device
> recovery and repair of damaged copies, which we can't do right now,
> so it's essentially status quo...

Having gone down the path of the one-to-many relationship when I did the
original heterogeneous I/O topology attempt, it's pure hell. Also dealt
with similar conundrums for the integrity stuff. So I don't like the
breadcrumb approach. Perfect is the enemy of good and all that.

And I am 100% in agreement on the careful alignment and not making
things complex for crazy use cases (although occasional straddling I/Os
are not as uncommon as we'd like to think). However, I do have concerns
about this particular feature when it comes to your status quo comment.

In order for us to build highly reliable systems, we have to have a
better building block than "this redundancy retry feature works most of
the time". So to me it is imperative that we provide hard guarantees
once a particular configuration has been set up and stacked. And if the
retry guarantee is somehow invalidated, then we really need to let the
user know about it.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 2/3] block: verify data when endio
  2019-04-01 14:04                         ` Martin K. Petersen
@ 2019-04-01 21:21                           ` Dave Chinner
  2019-04-03  2:45                             ` Martin K. Petersen
  0 siblings, 1 reply; 26+ messages in thread
From: Dave Chinner @ 2019-04-01 21:21 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Jens Axboe, Bob Liu, linux-block, linux-xfs, linux-fsdevel,
	shirley.ma, allison.henderson, darrick.wong, hch, adilger, tytso

On Mon, Apr 01, 2019 at 10:04:43AM -0400, Martin K. Petersen wrote:
> 
> Hi Dave!

Hi Martin!

> >> However, that suffers from another headache in that the I/O can get
> >> arbitrarily sliced and diced in units of 512 bytes.
> >
> > Right, but we don't need support that insane case. Indeed, if
> > it wasn't already obvious, we _can't support it_ because the
> > filesystem verifiers can't do partial verification. i.e.  part of
> > the verification is CRC validation of the whole bio, not to mention
> > that filesystem structure fragments cannot be safely parsed,
> > interpretted and/or verified without the whole structure first being
> > read in.
> 
> What I thought. There are some things I can verify by masking but it's
> limited.
> 
> What about journal entries? Would they be validated with 512-byte
> granularity or in bundles thereof? Only a problem during recovery, but
> potentially a case where we care deeply about trying another copy if it
> exists.

Journal writes are padded to the specified mkfs alignment (i.e. the
"log stripe unit") and are checksummed individually, so they can be
explicitly configured not to cross device boundaries (e.g 32kB
stripe unit on 64k chunk size means exactly 2 journal writes to each
chunk). I took this into account when I said "the filesystem can
constrain the alignment problem entirely at mkfs time"....

> What I'm asking is if we should have a block size argument for the
> verification? Or would you want to submit bios capped to the size you
> care about and let the block layer take care of coalescing?

Not sure what you mean by "capped to the size you care about". The
verifier attached to a bio will exactly match the size of the bio
being issued. AFAICT, coalescing with other bios in the request
queues should not affect how the completion of that bio is
handled by things like the RAID layers...

> Validation of units bigger than the logical block size is an area which
> our older Oracle HARD technology handles gracefully but which T10 PI has
> been unable to address. So this is an area of particular interest to me,
> although it's somewhat orthogonal to Bob's retry plumbing.
> 
> Another question for you wrt. retries: Once a copy has been identified
> as bad and a good copy read from media, who does the rewrite?

As far as I'm concerned, correcting bad copies is the responisbility
of the layer that manages the copies. It has nothing to do with the
filesystem.

> Does the
> filesystem send a new I/O (which would overwrite all copies) or does the
> retry plumbing own the responsibility of writing the good bio to the bad
> location?

There is so many varied storage algorithms and recovery options
(rewrite, partial rewrite, recalc parity/erasure codes and rewrite,
full stripe rewrite, rebuild onto hot spare due to too many errors,
etc) it doesn't make sense to only allow repair to be done by
completely error context-free rewriting from a higher layer. The
layer that owns the redundancy can make much better decisions aout
repair

And let's face it: we don't want to have to add complex IO retry
logic to every single filesystem.

> > IOWs, we need to look at this problem from a "whole stack" point of
> > view, not just cry about how "bios are too flexible and so make this
> > too hard!". The filesystem greatly constrains the alignment and
> > slicing/dicing problem to the point where it should be non-existent,
> > we have a clearly defined hard stop where verifier propagation
> > terminates, and if all else fails we can still detect corruption at
> > the filesystem level just like we do now. The worst thing that
> > happens here is we give up the capability for automatic block device
> > recovery and repair of damaged copies, which we can't do right now,
> > so it's essentially status quo...
> 
> Having gone down the path of the one-to-many relationship when I did the
> original heterogeneous I/O topology attempt, it's pure hell. Also dealt
> with similar conundrums for the integrity stuff. So I don't like the
> breadcrumb approach. Perfect is the enemy of good and all that.
> 
> And I am 100% in agreement on the careful alignment and not making
> things complex for crazy use cases (although occasional straddling I/Os
> are not as uncommon as we'd like to think). However, I do have concerns
> about this particular feature when it comes to your status quo comment.

In what way? This is only one layer of defense from the filesystem
perspective. i.e. If the underlying device cannot verify or correct
the bad metadata it found from another copy/rebuild, then we still
have to fall back to the filesystem rebuilding the metadata. That's
what all the online scrubbing and repair code we're working on in
XFS does, and btrfs is already capable of doing this when CRC errors
are detected.  i.e. the post-IO metadata verifier detects the error,
the online scrub and repair code then kicks in and rebuilds the
bad metadata block from other information in the filesystem.

What we are trying to do is avoid the scrub-and-repair stage in the
filesystem (as it is costly!) by ensuring that the underlying
storage has exhausted all redundancy and repair capabilities it has
before we run the filesystem rebuild phase. Recovering a bad block
from a RAID1/5/6 stripe is much, much faster than rebuilding a
metadata block from other metadata references in the filesystem....

> In order for us to build highly reliable systems, we have to have a
> better building block than "this redundancy retry feature works most of
> the time".

But that is actually good enough when you put a self-repairing
filesystem on top of a block layer that can recover from redundant
information /most of the time/. i.e. to build a reliable storage
system, the filesystem *must* be able to tolerate storage failures
because IO errors and hardware/media corruption can and do happen.

> So to me it is imperative that we provide hard guarantees
> once a particular configuration has been set up and stacked. And if the
> retry guarantee is somehow invalidated, then we really need to let the
> user know about it.

If the storage fails (and it will) and the filesystem cannot recover
the lost metadata, then it will let the user know and potentially
shut down the filesystem to protect the rest of the filesystem from
further damage. That is the current status quo, and the presence or
absence of automatic block layer retry and repair does not change
this at all.

IOWs, I think you're still looking at the problem as a "block
storage layers need to be perfect" problem when, in fact, we do not
need that nor are we asking for it. We have redundancy, rebuild and
reporting apabilities above the block storage layers (i.e. in
filesystems like btrfs and XFS) these days which mean we can
tolerate the lower layers of the storage being less than perfect.

IOWs, the filesystem doesn't expect hard "always correct" guarantees
from the storage layers - we always have to assume IO failures will
occur because they do, even with T10 PI. Hence it makes no sense to
for an automatic retry-and-recovery infrastructure for filesystems
to require hard guarantees that the block device will always return
good data.  Automatic repair doesn't guarantee the storage is free
from errors - it just provides a mechanism to detect errors and
perform optimistic, best effort recovery at the lowest possible
layer in the stack as early as possible.

Cheers,

dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 2/3] block: verify data when endio
  2019-04-01 21:21                           ` Dave Chinner
@ 2019-04-03  2:45                             ` Martin K. Petersen
  2019-04-03 22:21                               ` Dave Chinner
  0 siblings, 1 reply; 26+ messages in thread
From: Martin K. Petersen @ 2019-04-03  2:45 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Martin K. Petersen, Jens Axboe, Bob Liu, linux-block, linux-xfs,
	linux-fsdevel, shirley.ma, allison.henderson, darrick.wong, hch,
	adilger, tytso

Dave,

> Not sure what you mean by "capped to the size you care about". The
> verifier attached to a bio will exactly match the size of the bio
> being issued. AFAICT, coalescing with other bios in the request
> queues should not affect how the completion of that bio is
> handled by things like the RAID layers...

Just wanted to make sure that you wanted an interface that worked on a
bio containing a single logical entity. As opposed to an interface that
permitted you to submit 10 logical entities in one bio and have the
verify function iterate over them at completion time.

> As far as I'm concerned, correcting bad copies is the responisbility
> of the layer that manages the copies. It has nothing to do with the
> filesystem.

Good.

> There is so many varied storage algorithms and recovery options
> (rewrite, partial rewrite, recalc parity/erasure codes and rewrite,
> full stripe rewrite, rebuild onto hot spare due to too many errors,
> etc) it doesn't make sense to only allow repair to be done by
> completely error context-free rewriting from a higher layer. The
> layer that owns the redundancy can make much better decisions aout
> repair

I agree.

> If the storage fails (and it will) and the filesystem cannot recover
> the lost metadata, then it will let the user know and potentially
> shut down the filesystem to protect the rest of the filesystem from
> further damage. That is the current status quo, and the presence or
> absence of automatic block layer retry and repair does not change
> this at all.

No. But hopefully the retry logic will significantly reduce the cases
where shutdown and recovery is required. Availability is super
important.

Also, at least some storage technologies are trending towards becoming
less reliable, not more. So the reality is that recovering from block
errors could become, if not hot path, then at least relatively common
path.

> IOWs, the filesystem doesn't expect hard "always correct" guarantees
> from the storage layers - we always have to assume IO failures will
> occur because they do, even with T10 PI. Hence it makes no sense to
> for an automatic retry-and-recovery infrastructure for filesystems to
> require hard guarantees that the block device will always return good
> data.

I am not expecting hard guarantees wrt. always delivering good data. But
I want predictable behavior of the retry infrastructure.

That's no different from RAID drive failures. Things keep running, I/Os
don't fail until we run out of good copies. But we notify the user that
redundancy is lost so they can decide how to deal with the situation.
Setting the expectation that an I/O failure on the remaining drive would
potentially lead to a filesystem or database shutdown. RAID1 isn't
branded as "we sometimes mirror your data". Substantial effort has gone
into making sure that the mirrors are in sync.

For the retry stuff we should have a similar expectation. It doesn't
have to be fancy. I'm perfectly happy with a check at mkfs/growfs time
that complains if the resulting configuration violates whichever
alignment and other assumptions we end up baking into this.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 2/3] block: verify data when endio
  2019-04-03  2:45                             ` Martin K. Petersen
@ 2019-04-03 22:21                               ` Dave Chinner
  0 siblings, 0 replies; 26+ messages in thread
From: Dave Chinner @ 2019-04-03 22:21 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Jens Axboe, Bob Liu, linux-block, linux-xfs, linux-fsdevel,
	shirley.ma, allison.henderson, darrick.wong, hch, adilger, tytso

On Tue, Apr 02, 2019 at 10:45:03PM -0400, Martin K. Petersen wrote:
> 
> Dave,
> 
> > Not sure what you mean by "capped to the size you care about". The
> > verifier attached to a bio will exactly match the size of the bio
> > being issued. AFAICT, coalescing with other bios in the request
> > queues should not affect how the completion of that bio is
> > handled by things like the RAID layers...
> 
> Just wanted to make sure that you wanted an interface that worked on a
> bio containing a single logical entity. As opposed to an interface that
> permitted you to submit 10 logical entities in one bio and have the
> verify function iterate over them at completion time.
> 
> > As far as I'm concerned, correcting bad copies is the responisbility
> > of the layer that manages the copies. It has nothing to do with the
> > filesystem.
> 
> Good.
> 
> > There is so many varied storage algorithms and recovery options
> > (rewrite, partial rewrite, recalc parity/erasure codes and rewrite,
> > full stripe rewrite, rebuild onto hot spare due to too many errors,
> > etc) it doesn't make sense to only allow repair to be done by
> > completely error context-free rewriting from a higher layer. The
> > layer that owns the redundancy can make much better decisions aout
> > repair
> 
> I agree.
> 
> > If the storage fails (and it will) and the filesystem cannot recover
> > the lost metadata, then it will let the user know and potentially
> > shut down the filesystem to protect the rest of the filesystem from
> > further damage. That is the current status quo, and the presence or
> > absence of automatic block layer retry and repair does not change
> > this at all.
> 
> No. But hopefully the retry logic will significantly reduce the cases
> where shutdown and recovery is required. Availability is super
> important.

*nod*

We're on the same page, I think :)

> Also, at least some storage technologies are trending towards becoming
> less reliable, not more. So the reality is that recovering from block
> errors could become, if not hot path, then at least relatively common
> path.

Yup, all the more reason for having some form of generic
verification infrastructure plumbed into the stack.

> > IOWs, the filesystem doesn't expect hard "always correct" guarantees
> > from the storage layers - we always have to assume IO failures will
> > occur because they do, even with T10 PI. Hence it makes no sense to
> > for an automatic retry-and-recovery infrastructure for filesystems to
> > require hard guarantees that the block device will always return good
> > data.
> 
> I am not expecting hard guarantees wrt. always delivering good data. But
> I want predictable behavior of the retry infrastructure.
> 
> That's no different from RAID drive failures. Things keep running, I/Os
> don't fail until we run out of good copies. But we notify the user that
> redundancy is lost so they can decide how to deal with the situation.
> Setting the expectation that an I/O failure on the remaining drive would
> potentially lead to a filesystem or database shutdown. RAID1 isn't
> branded as "we sometimes mirror your data". Substantial effort has gone
> into making sure that the mirrors are in sync.

Oh, I've been assuming that the layers that can do verification
would immediately take steps to correct whatever was detected, and
if the error couldn't be corrected they take whatever action they
normally take when an uncorrectable error is encountered.  i.e.
they'd never silently violate the redundancy guarantees they are
supposed to provide. MD and DM already do this, right? (e.g.
marking arrays degraded and notifying the admin they need to replace
a drive and rebuild.)

> For the retry stuff we should have a similar expectation. It doesn't
> have to be fancy. I'm perfectly happy with a check at mkfs/growfs time
> that complains if the resulting configuration violates whichever
> alignment and other assumptions we end up baking into this.

Makes sense to me. If we can ensure that the alignment requirements
for the stack are communicated in the existing geometry infoi
correctly (i.e. io_min, io_opt) then we've pretty much already got
everything we need in place. We might need a few mkfs tweaks to
ensure the log is placed correctly, log write padding is set
appropriately, and check that inode clusters are appropriately
aligned (I think they are already) but otherwise I think we will be
good here...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2019-04-03 22:21 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-29 14:23 [RFC PATCH v3 0/3] Block/XFS: Support alternative mirror device retry Bob Liu
2019-03-29 14:23 ` [PATCH v3 1/3] block: introduce submit_bio_verify() Bob Liu
2019-03-29 22:22   ` Andreas Dilger
2019-03-30  1:43     ` Martin K. Petersen
2019-03-29 14:23 ` [PATCH v3 2/3] block: verify data when endio Bob Liu
2019-03-29 14:28   ` Jens Axboe
2019-03-29 14:34     ` Martin K. Petersen
2019-03-29 14:38       ` Jens Axboe
2019-03-29 14:46         ` Martin K. Petersen
2019-03-29 14:50           ` Jens Axboe
2019-03-29 14:51             ` Martin K. Petersen
2019-03-29 14:52               ` Jens Axboe
2019-03-29 15:00                 ` Bob Liu
2019-03-29 15:03                   ` Jens Axboe
2019-03-30  2:17                     ` Martin K. Petersen
2019-03-31 22:00                       ` Dave Chinner
2019-04-01 14:04                         ` Martin K. Petersen
2019-04-01 21:21                           ` Dave Chinner
2019-04-03  2:45                             ` Martin K. Petersen
2019-04-03 22:21                               ` Dave Chinner
2019-03-29 14:41       ` Bob Liu
2019-03-29 14:40     ` Bob Liu
2019-03-29 14:47       ` Jens Axboe
2019-03-30  0:20       ` Andreas Dilger
2019-03-29 15:39   ` Ming Lei
2019-03-29 14:23 ` [PATCH v3 3/3] fs: xfs: add read_verifier() function Bob Liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).