All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 5.20 0/4] block/dm: add bio_rewind for improving dm requeue
@ 2022-06-24 14:12 ` Ming Lei
  0 siblings, 0 replies; 76+ messages in thread
From: Ming Lei @ 2022-06-24 14:12 UTC (permalink / raw)
  To: Jens Axboe, Mike Snitzer; +Cc: linux-block, dm-devel, Ming Lei

Hello Guys,

The 1st patch adds bio_rewind which can restore bio to original position
by recording sectors between the original position to bio's end sector
if the bio's end sector won't change, which should be very common to
see.

The 2nd and 3rd patch cleans up dm code for handling requeue &
completion.

The last patch implements 2 stage dm io requeue for avoiding to allocate
one bio beforehand for just handling requeue which is one unusual event.
The 1st stage requeue is added for cloning & restoring original bio in wq
context, then 2nd stage requeue will use that as original bio for
handling requeue.


Ming Lei (4):
  block: add bio_rewind() API
  dm: add new helper for handling dm_io requeue
  dm: improve handling for DM_REQUEUE and AGAIN
  dm: add two stage requeue

 block/bio-integrity.c       |  19 ++++
 block/bio.c                 |  19 ++++
 block/blk-crypto-internal.h |   7 ++
 block/blk-crypto.c          |  23 +++++
 drivers/md/dm-core.h        |  11 ++-
 drivers/md/dm.c             | 180 ++++++++++++++++++++++++++++--------
 include/linux/bio.h         |  21 +++++
 include/linux/bvec.h        |  33 +++++++
 8 files changed, 271 insertions(+), 42 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 76+ messages in thread

* [dm-devel] [PATCH 5.20 0/4] block/dm: add bio_rewind for improving dm requeue
@ 2022-06-24 14:12 ` Ming Lei
  0 siblings, 0 replies; 76+ messages in thread
From: Ming Lei @ 2022-06-24 14:12 UTC (permalink / raw)
  To: Jens Axboe, Mike Snitzer; +Cc: linux-block, dm-devel, Ming Lei

Hello Guys,

The 1st patch adds bio_rewind which can restore bio to original position
by recording sectors between the original position to bio's end sector
if the bio's end sector won't change, which should be very common to
see.

The 2nd and 3rd patch cleans up dm code for handling requeue &
completion.

The last patch implements 2 stage dm io requeue for avoiding to allocate
one bio beforehand for just handling requeue which is one unusual event.
The 1st stage requeue is added for cloning & restoring original bio in wq
context, then 2nd stage requeue will use that as original bio for
handling requeue.


Ming Lei (4):
  block: add bio_rewind() API
  dm: add new helper for handling dm_io requeue
  dm: improve handling for DM_REQUEUE and AGAIN
  dm: add two stage requeue

 block/bio-integrity.c       |  19 ++++
 block/bio.c                 |  19 ++++
 block/blk-crypto-internal.h |   7 ++
 block/blk-crypto.c          |  23 +++++
 drivers/md/dm-core.h        |  11 ++-
 drivers/md/dm.c             | 180 ++++++++++++++++++++++++++++--------
 include/linux/bio.h         |  21 +++++
 include/linux/bvec.h        |  33 +++++++
 8 files changed, 271 insertions(+), 42 deletions(-)

-- 
2.31.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 5.20 1/4] block: add bio_rewind() API
  2022-06-24 14:12 ` [dm-devel] " Ming Lei
@ 2022-06-24 14:12   ` Ming Lei
  -1 siblings, 0 replies; 76+ messages in thread
From: Ming Lei @ 2022-06-24 14:12 UTC (permalink / raw)
  To: Jens Axboe, Mike Snitzer
  Cc: linux-block, dm-devel, Ming Lei, Eric Biggers, Kent Overstreet,
	Dmitry Monakhov, Martin K . Petersen

Commit 7759eb23fd98 ("block: remove bio_rewind_iter()") removes
the similar API because the following reasons:

    ```
    It is pointed that bio_rewind_iter() is one very bad API[1]:

    1) bio size may not be restored after rewinding

    2) it causes some bogus change, such as 5151842b9d8732 (block: reset
    bi_iter.bi_done after splitting bio)

    3) rewinding really makes things complicated wrt. bio splitting

    4) unnecessary updating of .bi_done in fast path

    [1] https://marc.info/?t=153549924200005&r=1&w=2

    So this patch takes Kent's suggestion to restore one bio into its original
    state via saving bio iterator(struct bvec_iter) in bio_integrity_prep(),
    given now bio_rewind_iter() is only used by bio integrity code.
    ```

However, it isn't easy to restore bio by saving 32 bytes bio->bi_iter, and saving
it only can't restore crypto and integrity info.

Add bio_rewind() back for some use cases which may not be same with
previous generic case:

1) most of bio has fixed end sector since bio split is done from front of the bio,
if driver just records how many sectors between current bio's start sector and
the bio's end sector, the original position can be restored

2) if one bio's end sector won't change, usually bio_trim() isn't called, user can
restore original position by storing sectors from current ->bi_iter.bi_sector to
bio's end sector; together by saving bio size, 8 bytes can restore to
original bio.

3) dm's requeue use case: when BLK_STS_DM_REQUEUE happens, dm core needs to
restore to the original bio which represents current dm io to be requeued.
By storing sectors to the bio's end sector and dm io's size,
bio_rewind() can restore such original bio, then dm core code needn't to
allocate one bio beforehand just for handling BLK_STS_DM_REQUEUE which
is actually one unusual event.

4) Not like original rewind API, this one needn't to add .bi_done, and no any
effect on fast path

Cc: Eric Biggers <ebiggers@google.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Dmitry Monakhov <dmonakhov@openvz.org>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/bio-integrity.c       | 19 +++++++++++++++++++
 block/bio.c                 | 19 +++++++++++++++++++
 block/blk-crypto-internal.h |  7 +++++++
 block/blk-crypto.c          | 23 +++++++++++++++++++++++
 include/linux/bio.h         | 21 +++++++++++++++++++++
 include/linux/bvec.h        | 33 +++++++++++++++++++++++++++++++++
 6 files changed, 122 insertions(+)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 32929c89ba8a..06c2fe81fdf2 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -378,6 +378,25 @@ void bio_integrity_advance(struct bio *bio, unsigned int bytes_done)
 	bvec_iter_advance(bip->bip_vec, &bip->bip_iter, bytes);
 }
 
+/**
+ * bio_integrity_rewind - Rewind integrity vector
+ * @bio:	bio whose integrity vector to update
+ * @bytes_done:	number of data bytes to rewind
+ *
+ * Description: This function calculates how many integrity bytes the
+ * number of completed data bytes correspond to and rewind the
+ * integrity vector accordingly.
+ */
+void bio_integrity_rewind(struct bio *bio, unsigned int bytes_done)
+{
+	struct bio_integrity_payload *bip = bio_integrity(bio);
+	struct blk_integrity *bi = blk_get_integrity(bio->bi_bdev->bd_disk);
+	unsigned bytes = bio_integrity_bytes(bi, bytes_done >> 9);
+
+	bip->bip_iter.bi_sector -= bio_integrity_intervals(bi, bytes_done >> 9);
+	bvec_iter_rewind(bip->bip_vec, &bip->bip_iter, bytes);
+}
+
 /**
  * bio_integrity_trim - Trim integrity vector
  * @bio:	bio whose integrity vector to update
diff --git a/block/bio.c b/block/bio.c
index 51c99f2c5c90..5318944b7b18 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1360,6 +1360,25 @@ void __bio_advance(struct bio *bio, unsigned bytes)
 }
 EXPORT_SYMBOL(__bio_advance);
 
+/**
+ * bio_rewind - rewind @bio by @bytes
+ * @bio: bio to rewind
+ * @bytes: how many bytes to rewind
+ *
+ * Update ->bi_iter of @bio by rewinding @bytes. Most of bio has fixed end
+ * sector, so it is easy to rewind from end of the bio and restore its
+ * original position. And it is caller's responsibility to restore bio size.
+ */
+void bio_rewind(struct bio *bio, unsigned bytes)
+{
+	if (bio_integrity(bio))
+		bio_integrity_rewind(bio, bytes);
+
+	bio_crypt_rewind(bio, bytes);
+	bio_rewind_iter(bio, &bio->bi_iter, bytes);
+}
+EXPORT_SYMBOL(bio_rewind);
+
 void bio_copy_data_iter(struct bio *dst, struct bvec_iter *dst_iter,
 			struct bio *src, struct bvec_iter *src_iter)
 {
diff --git a/block/blk-crypto-internal.h b/block/blk-crypto-internal.h
index e6818ffaddbf..b723599bbf99 100644
--- a/block/blk-crypto-internal.h
+++ b/block/blk-crypto-internal.h
@@ -114,6 +114,13 @@ static inline void bio_crypt_advance(struct bio *bio, unsigned int bytes)
 		__bio_crypt_advance(bio, bytes);
 }
 
+void __bio_crypt_rewind(struct bio *bio, unsigned int bytes);
+static inline void bio_crypt_rewind(struct bio *bio, unsigned int bytes)
+{
+	if (bio_has_crypt_ctx(bio))
+		__bio_crypt_rewind(bio, bytes);
+}
+
 void __bio_crypt_free_ctx(struct bio *bio);
 static inline void bio_crypt_free_ctx(struct bio *bio)
 {
diff --git a/block/blk-crypto.c b/block/blk-crypto.c
index a496aaef85ba..caae2f429fc7 100644
--- a/block/blk-crypto.c
+++ b/block/blk-crypto.c
@@ -134,6 +134,21 @@ void bio_crypt_dun_increment(u64 dun[BLK_CRYPTO_DUN_ARRAY_SIZE],
 	}
 }
 
+/* Decrements @dun by @dec, treating @dun as a multi-limb integer. */
+void bio_crypt_dun_decrement(u64 dun[BLK_CRYPTO_DUN_ARRAY_SIZE],
+			     unsigned int dec)
+{
+	int i;
+
+	for (i = 0; dec && i < BLK_CRYPTO_DUN_ARRAY_SIZE; i++) {
+		dun[i] -= dec;
+		if (dun[i] > inc)
+			dec = 1;
+		else
+			dec = 0;
+	}
+}
+
 void __bio_crypt_advance(struct bio *bio, unsigned int bytes)
 {
 	struct bio_crypt_ctx *bc = bio->bi_crypt_context;
@@ -142,6 +157,14 @@ void __bio_crypt_advance(struct bio *bio, unsigned int bytes)
 				bytes >> bc->bc_key->data_unit_size_bits);
 }
 
+void __bio_crypt_rewind(struct bio *bio, unsigned int bytes)
+{
+	struct bio_crypt_ctx *bc = bio->bi_crypt_context;
+
+	bio_crypt_dun_decrement(bc->bc_dun,
+				bytes >> bc->bc_key->data_unit_size_bits);
+}
+
 /*
  * Returns true if @bc->bc_dun plus @bytes converted to data units is equal to
  * @next_dun, treating the DUNs as multi-limb integers.
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 992ee987f273..4e6674f232b4 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -105,6 +105,19 @@ static inline void bio_advance_iter(const struct bio *bio,
 		/* TODO: It is reasonable to complete bio with error here. */
 }
 
+static inline void bio_rewind_iter(const struct bio *bio,
+				    struct bvec_iter *iter, unsigned int bytes)
+{
+	iter->bi_sector -= bytes >> 9;
+
+	/* No advance means no rewind */
+	if (bio_no_advance_iter(bio))
+		iter->bi_size += bytes;
+	else
+		bvec_iter_rewind(bio->bi_io_vec, iter, bytes);
+		/* TODO: It is reasonable to complete bio with error here. */
+}
+
 /* @bytes should be less or equal to bvec[i->bi_idx].bv_len */
 static inline void bio_advance_iter_single(const struct bio *bio,
 					   struct bvec_iter *iter,
@@ -119,6 +132,7 @@ static inline void bio_advance_iter_single(const struct bio *bio,
 }
 
 void __bio_advance(struct bio *, unsigned bytes);
+void bio_rewind(struct bio *, unsigned bytes);
 
 /**
  * bio_advance - increment/complete a bio by some number of bytes
@@ -699,6 +713,7 @@ extern struct bio_integrity_payload *bio_integrity_alloc(struct bio *, gfp_t, un
 extern int bio_integrity_add_page(struct bio *, struct page *, unsigned int, unsigned int);
 extern bool bio_integrity_prep(struct bio *);
 extern void bio_integrity_advance(struct bio *, unsigned int);
+extern void bio_integrity_rewind(struct bio *, unsigned int);
 extern void bio_integrity_trim(struct bio *);
 extern int bio_integrity_clone(struct bio *, struct bio *, gfp_t);
 extern int bioset_integrity_create(struct bio_set *, int);
@@ -739,6 +754,12 @@ static inline void bio_integrity_advance(struct bio *bio,
 	return;
 }
 
+static inline void bio_integrity_rewind(struct bio *bio,
+					 unsigned int bytes_done)
+{
+	return;
+}
+
 static inline void bio_integrity_trim(struct bio *bio)
 {
 	return;
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 35c25dff651a..b56d92e939c1 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -122,6 +122,39 @@ static inline bool bvec_iter_advance(const struct bio_vec *bv,
 	return true;
 }
 
+static inline bool bvec_iter_rewind(const struct bio_vec *bv,
+				     struct bvec_iter *iter,
+				     unsigned int bytes)
+{
+	int idx;
+
+	iter->bi_size += bytes;
+	if (bytes <= iter->bi_bvec_done) {
+		iter->bi_bvec_done -= bytes;
+		return true;
+	}
+
+	bytes -= iter->bi_bvec_done;
+	idx = iter->bi_idx - 1;
+
+	while (idx >= 0 && bytes && bytes > bv[idx].bv_len) {
+		bytes -= bv[idx].bv_len;
+		idx--;
+	}
+
+	if (WARN_ONCE(idx < 0 && bytes,
+		      "Attempted to rewind iter beyond bvec's boundaries\n")) {
+		iter->bi_size -= bytes;
+		iter->bi_bvec_done = 0;
+		iter->bi_idx = 0;
+		return false;
+	}
+
+	iter->bi_idx = idx;
+	iter->bi_bvec_done = bv[idx].bv_len - bytes;
+	return true;
+}
+
 /*
  * A simpler version of bvec_iter_advance(), @bytes should not span
  * across multiple bvec entries, i.e. bytes <= bv[i->bi_idx].bv_len
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [dm-devel] [PATCH 5.20 1/4] block: add bio_rewind() API
@ 2022-06-24 14:12   ` Ming Lei
  0 siblings, 0 replies; 76+ messages in thread
From: Ming Lei @ 2022-06-24 14:12 UTC (permalink / raw)
  To: Jens Axboe, Mike Snitzer
  Cc: Martin K . Petersen, Eric Biggers, Ming Lei, linux-block,
	dm-devel, Dmitry Monakhov, Kent Overstreet

Commit 7759eb23fd98 ("block: remove bio_rewind_iter()") removes
the similar API because the following reasons:

    ```
    It is pointed that bio_rewind_iter() is one very bad API[1]:

    1) bio size may not be restored after rewinding

    2) it causes some bogus change, such as 5151842b9d8732 (block: reset
    bi_iter.bi_done after splitting bio)

    3) rewinding really makes things complicated wrt. bio splitting

    4) unnecessary updating of .bi_done in fast path

    [1] https://marc.info/?t=153549924200005&r=1&w=2

    So this patch takes Kent's suggestion to restore one bio into its original
    state via saving bio iterator(struct bvec_iter) in bio_integrity_prep(),
    given now bio_rewind_iter() is only used by bio integrity code.
    ```

However, it isn't easy to restore bio by saving 32 bytes bio->bi_iter, and saving
it only can't restore crypto and integrity info.

Add bio_rewind() back for some use cases which may not be same with
previous generic case:

1) most of bio has fixed end sector since bio split is done from front of the bio,
if driver just records how many sectors between current bio's start sector and
the bio's end sector, the original position can be restored

2) if one bio's end sector won't change, usually bio_trim() isn't called, user can
restore original position by storing sectors from current ->bi_iter.bi_sector to
bio's end sector; together by saving bio size, 8 bytes can restore to
original bio.

3) dm's requeue use case: when BLK_STS_DM_REQUEUE happens, dm core needs to
restore to the original bio which represents current dm io to be requeued.
By storing sectors to the bio's end sector and dm io's size,
bio_rewind() can restore such original bio, then dm core code needn't to
allocate one bio beforehand just for handling BLK_STS_DM_REQUEUE which
is actually one unusual event.

4) Not like original rewind API, this one needn't to add .bi_done, and no any
effect on fast path

Cc: Eric Biggers <ebiggers@google.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Dmitry Monakhov <dmonakhov@openvz.org>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/bio-integrity.c       | 19 +++++++++++++++++++
 block/bio.c                 | 19 +++++++++++++++++++
 block/blk-crypto-internal.h |  7 +++++++
 block/blk-crypto.c          | 23 +++++++++++++++++++++++
 include/linux/bio.h         | 21 +++++++++++++++++++++
 include/linux/bvec.h        | 33 +++++++++++++++++++++++++++++++++
 6 files changed, 122 insertions(+)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 32929c89ba8a..06c2fe81fdf2 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -378,6 +378,25 @@ void bio_integrity_advance(struct bio *bio, unsigned int bytes_done)
 	bvec_iter_advance(bip->bip_vec, &bip->bip_iter, bytes);
 }
 
+/**
+ * bio_integrity_rewind - Rewind integrity vector
+ * @bio:	bio whose integrity vector to update
+ * @bytes_done:	number of data bytes to rewind
+ *
+ * Description: This function calculates how many integrity bytes the
+ * number of completed data bytes correspond to and rewind the
+ * integrity vector accordingly.
+ */
+void bio_integrity_rewind(struct bio *bio, unsigned int bytes_done)
+{
+	struct bio_integrity_payload *bip = bio_integrity(bio);
+	struct blk_integrity *bi = blk_get_integrity(bio->bi_bdev->bd_disk);
+	unsigned bytes = bio_integrity_bytes(bi, bytes_done >> 9);
+
+	bip->bip_iter.bi_sector -= bio_integrity_intervals(bi, bytes_done >> 9);
+	bvec_iter_rewind(bip->bip_vec, &bip->bip_iter, bytes);
+}
+
 /**
  * bio_integrity_trim - Trim integrity vector
  * @bio:	bio whose integrity vector to update
diff --git a/block/bio.c b/block/bio.c
index 51c99f2c5c90..5318944b7b18 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1360,6 +1360,25 @@ void __bio_advance(struct bio *bio, unsigned bytes)
 }
 EXPORT_SYMBOL(__bio_advance);
 
+/**
+ * bio_rewind - rewind @bio by @bytes
+ * @bio: bio to rewind
+ * @bytes: how many bytes to rewind
+ *
+ * Update ->bi_iter of @bio by rewinding @bytes. Most of bio has fixed end
+ * sector, so it is easy to rewind from end of the bio and restore its
+ * original position. And it is caller's responsibility to restore bio size.
+ */
+void bio_rewind(struct bio *bio, unsigned bytes)
+{
+	if (bio_integrity(bio))
+		bio_integrity_rewind(bio, bytes);
+
+	bio_crypt_rewind(bio, bytes);
+	bio_rewind_iter(bio, &bio->bi_iter, bytes);
+}
+EXPORT_SYMBOL(bio_rewind);
+
 void bio_copy_data_iter(struct bio *dst, struct bvec_iter *dst_iter,
 			struct bio *src, struct bvec_iter *src_iter)
 {
diff --git a/block/blk-crypto-internal.h b/block/blk-crypto-internal.h
index e6818ffaddbf..b723599bbf99 100644
--- a/block/blk-crypto-internal.h
+++ b/block/blk-crypto-internal.h
@@ -114,6 +114,13 @@ static inline void bio_crypt_advance(struct bio *bio, unsigned int bytes)
 		__bio_crypt_advance(bio, bytes);
 }
 
+void __bio_crypt_rewind(struct bio *bio, unsigned int bytes);
+static inline void bio_crypt_rewind(struct bio *bio, unsigned int bytes)
+{
+	if (bio_has_crypt_ctx(bio))
+		__bio_crypt_rewind(bio, bytes);
+}
+
 void __bio_crypt_free_ctx(struct bio *bio);
 static inline void bio_crypt_free_ctx(struct bio *bio)
 {
diff --git a/block/blk-crypto.c b/block/blk-crypto.c
index a496aaef85ba..caae2f429fc7 100644
--- a/block/blk-crypto.c
+++ b/block/blk-crypto.c
@@ -134,6 +134,21 @@ void bio_crypt_dun_increment(u64 dun[BLK_CRYPTO_DUN_ARRAY_SIZE],
 	}
 }
 
+/* Decrements @dun by @dec, treating @dun as a multi-limb integer. */
+void bio_crypt_dun_decrement(u64 dun[BLK_CRYPTO_DUN_ARRAY_SIZE],
+			     unsigned int dec)
+{
+	int i;
+
+	for (i = 0; dec && i < BLK_CRYPTO_DUN_ARRAY_SIZE; i++) {
+		dun[i] -= dec;
+		if (dun[i] > inc)
+			dec = 1;
+		else
+			dec = 0;
+	}
+}
+
 void __bio_crypt_advance(struct bio *bio, unsigned int bytes)
 {
 	struct bio_crypt_ctx *bc = bio->bi_crypt_context;
@@ -142,6 +157,14 @@ void __bio_crypt_advance(struct bio *bio, unsigned int bytes)
 				bytes >> bc->bc_key->data_unit_size_bits);
 }
 
+void __bio_crypt_rewind(struct bio *bio, unsigned int bytes)
+{
+	struct bio_crypt_ctx *bc = bio->bi_crypt_context;
+
+	bio_crypt_dun_decrement(bc->bc_dun,
+				bytes >> bc->bc_key->data_unit_size_bits);
+}
+
 /*
  * Returns true if @bc->bc_dun plus @bytes converted to data units is equal to
  * @next_dun, treating the DUNs as multi-limb integers.
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 992ee987f273..4e6674f232b4 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -105,6 +105,19 @@ static inline void bio_advance_iter(const struct bio *bio,
 		/* TODO: It is reasonable to complete bio with error here. */
 }
 
+static inline void bio_rewind_iter(const struct bio *bio,
+				    struct bvec_iter *iter, unsigned int bytes)
+{
+	iter->bi_sector -= bytes >> 9;
+
+	/* No advance means no rewind */
+	if (bio_no_advance_iter(bio))
+		iter->bi_size += bytes;
+	else
+		bvec_iter_rewind(bio->bi_io_vec, iter, bytes);
+		/* TODO: It is reasonable to complete bio with error here. */
+}
+
 /* @bytes should be less or equal to bvec[i->bi_idx].bv_len */
 static inline void bio_advance_iter_single(const struct bio *bio,
 					   struct bvec_iter *iter,
@@ -119,6 +132,7 @@ static inline void bio_advance_iter_single(const struct bio *bio,
 }
 
 void __bio_advance(struct bio *, unsigned bytes);
+void bio_rewind(struct bio *, unsigned bytes);
 
 /**
  * bio_advance - increment/complete a bio by some number of bytes
@@ -699,6 +713,7 @@ extern struct bio_integrity_payload *bio_integrity_alloc(struct bio *, gfp_t, un
 extern int bio_integrity_add_page(struct bio *, struct page *, unsigned int, unsigned int);
 extern bool bio_integrity_prep(struct bio *);
 extern void bio_integrity_advance(struct bio *, unsigned int);
+extern void bio_integrity_rewind(struct bio *, unsigned int);
 extern void bio_integrity_trim(struct bio *);
 extern int bio_integrity_clone(struct bio *, struct bio *, gfp_t);
 extern int bioset_integrity_create(struct bio_set *, int);
@@ -739,6 +754,12 @@ static inline void bio_integrity_advance(struct bio *bio,
 	return;
 }
 
+static inline void bio_integrity_rewind(struct bio *bio,
+					 unsigned int bytes_done)
+{
+	return;
+}
+
 static inline void bio_integrity_trim(struct bio *bio)
 {
 	return;
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 35c25dff651a..b56d92e939c1 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -122,6 +122,39 @@ static inline bool bvec_iter_advance(const struct bio_vec *bv,
 	return true;
 }
 
+static inline bool bvec_iter_rewind(const struct bio_vec *bv,
+				     struct bvec_iter *iter,
+				     unsigned int bytes)
+{
+	int idx;
+
+	iter->bi_size += bytes;
+	if (bytes <= iter->bi_bvec_done) {
+		iter->bi_bvec_done -= bytes;
+		return true;
+	}
+
+	bytes -= iter->bi_bvec_done;
+	idx = iter->bi_idx - 1;
+
+	while (idx >= 0 && bytes && bytes > bv[idx].bv_len) {
+		bytes -= bv[idx].bv_len;
+		idx--;
+	}
+
+	if (WARN_ONCE(idx < 0 && bytes,
+		      "Attempted to rewind iter beyond bvec's boundaries\n")) {
+		iter->bi_size -= bytes;
+		iter->bi_bvec_done = 0;
+		iter->bi_idx = 0;
+		return false;
+	}
+
+	iter->bi_idx = idx;
+	iter->bi_bvec_done = bv[idx].bv_len - bytes;
+	return true;
+}
+
 /*
  * A simpler version of bvec_iter_advance(), @bytes should not span
  * across multiple bvec entries, i.e. bytes <= bv[i->bi_idx].bv_len
-- 
2.31.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 5.20 2/4] dm: add new helper for handling dm_io requeue
  2022-06-24 14:12 ` [dm-devel] " Ming Lei
@ 2022-06-24 14:12   ` Ming Lei
  -1 siblings, 0 replies; 76+ messages in thread
From: Ming Lei @ 2022-06-24 14:12 UTC (permalink / raw)
  To: Jens Axboe, Mike Snitzer; +Cc: linux-block, dm-devel, Ming Lei

Add helper of dm_handle_requeue() for handling dm_io requeue.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/md/dm.c | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 2b75f1ef7386..a9e5e429c150 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -884,13 +884,11 @@ static int __noflush_suspending(struct mapped_device *md)
 	return test_bit(DMF_NOFLUSH_SUSPENDING, &md->flags);
 }
 
-static void dm_io_complete(struct dm_io *io)
+static void dm_handle_requeue(struct dm_io *io)
 {
-	blk_status_t io_error;
-	struct mapped_device *md = io->md;
-	struct bio *bio = io->split_bio ? io->split_bio : io->orig_bio;
-
 	if (io->status == BLK_STS_DM_REQUEUE) {
+		struct bio *bio = io->split_bio ? io->split_bio : io->orig_bio;
+		struct mapped_device *md = io->md;
 		unsigned long flags;
 		/*
 		 * Target requested pushing back the I/O.
@@ -909,6 +907,15 @@ static void dm_io_complete(struct dm_io *io)
 		}
 		spin_unlock_irqrestore(&md->deferred_lock, flags);
 	}
+}
+
+static void dm_io_complete(struct dm_io *io)
+{
+	struct bio *bio = io->split_bio ? io->split_bio : io->orig_bio;
+	struct mapped_device *md = io->md;
+	blk_status_t io_error;
+
+	dm_handle_requeue(io);
 
 	io_error = io->status;
 	if (dm_io_flagged(io, DM_IO_ACCOUNTED))
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [dm-devel] [PATCH 5.20 2/4] dm: add new helper for handling dm_io requeue
@ 2022-06-24 14:12   ` Ming Lei
  0 siblings, 0 replies; 76+ messages in thread
From: Ming Lei @ 2022-06-24 14:12 UTC (permalink / raw)
  To: Jens Axboe, Mike Snitzer; +Cc: linux-block, dm-devel, Ming Lei

Add helper of dm_handle_requeue() for handling dm_io requeue.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/md/dm.c | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 2b75f1ef7386..a9e5e429c150 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -884,13 +884,11 @@ static int __noflush_suspending(struct mapped_device *md)
 	return test_bit(DMF_NOFLUSH_SUSPENDING, &md->flags);
 }
 
-static void dm_io_complete(struct dm_io *io)
+static void dm_handle_requeue(struct dm_io *io)
 {
-	blk_status_t io_error;
-	struct mapped_device *md = io->md;
-	struct bio *bio = io->split_bio ? io->split_bio : io->orig_bio;
-
 	if (io->status == BLK_STS_DM_REQUEUE) {
+		struct bio *bio = io->split_bio ? io->split_bio : io->orig_bio;
+		struct mapped_device *md = io->md;
 		unsigned long flags;
 		/*
 		 * Target requested pushing back the I/O.
@@ -909,6 +907,15 @@ static void dm_io_complete(struct dm_io *io)
 		}
 		spin_unlock_irqrestore(&md->deferred_lock, flags);
 	}
+}
+
+static void dm_io_complete(struct dm_io *io)
+{
+	struct bio *bio = io->split_bio ? io->split_bio : io->orig_bio;
+	struct mapped_device *md = io->md;
+	blk_status_t io_error;
+
+	dm_handle_requeue(io);
 
 	io_error = io->status;
 	if (dm_io_flagged(io, DM_IO_ACCOUNTED))
-- 
2.31.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 5.20 3/4] dm: improve handling for DM_REQUEUE and AGAIN
  2022-06-24 14:12 ` [dm-devel] " Ming Lei
@ 2022-06-24 14:12   ` Ming Lei
  -1 siblings, 0 replies; 76+ messages in thread
From: Ming Lei @ 2022-06-24 14:12 UTC (permalink / raw)
  To: Jens Axboe, Mike Snitzer; +Cc: linux-block, dm-devel, Ming Lei

In case that BLK_STS_DM_REQUEUE is returned or BLK_STS_AGAIN is returned
for POLLED io, we requeue the original bio into deferred list and
request md->wq to re-submit it to block layer.

Improve the handling in the following way:

1) unify handling for BLK_STS_DM_REQUEUE and BLK_STS_AGAIN, and clear
REQ_POLLED for BLK_STS_DM_REQUEUE too, for the sake of simplicity,
given BLK_STS_DM_REQUEUE is very unusual

2) queue md->wq explicitly in __dm_io_complete(), so requeue handling
becomes more robust

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/md/dm.c | 58 +++++++++++++++++++++++++++++--------------------
 1 file changed, 34 insertions(+), 24 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index a9e5e429c150..ee22c763873f 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -884,20 +884,39 @@ static int __noflush_suspending(struct mapped_device *md)
 	return test_bit(DMF_NOFLUSH_SUSPENDING, &md->flags);
 }
 
-static void dm_handle_requeue(struct dm_io *io)
+/* Return true if the original bio is requeued */
+static bool dm_handle_requeue(struct dm_io *io)
 {
-	if (io->status == BLK_STS_DM_REQUEUE) {
-		struct bio *bio = io->split_bio ? io->split_bio : io->orig_bio;
-		struct mapped_device *md = io->md;
+	struct bio *bio = io->split_bio ? io->split_bio : io->orig_bio;
+	bool need_requeue = (io->status == BLK_STS_DM_REQUEUE);
+	bool handle_eagain = (io->status == BLK_STS_AGAIN) &&
+		(bio->bi_opf & REQ_POLLED);
+	struct mapped_device *md = io->md;
+	bool requeued = false;
+
+	if (need_requeue || handle_eagain) {
 		unsigned long flags;
+
+		if (bio->bi_opf & REQ_POLLED) {
+			/*
+			 * Upper layer won't help us poll split bio
+			 * (io->orig_bio may only reflect a subset of the
+			 * pre-split original) so clear REQ_POLLED in case
+			 * of requeue.
+			 */
+			bio_clear_polled(bio);
+		}
+
 		/*
 		 * Target requested pushing back the I/O.
 		 */
 		spin_lock_irqsave(&md->deferred_lock, flags);
-		if (__noflush_suspending(md) &&
-		    !WARN_ON_ONCE(dm_is_zone_write(md, bio))) {
+		if ((__noflush_suspending(md) &&
+		    !WARN_ON_ONCE(dm_is_zone_write(md, bio))) ||
+				handle_eagain) {
 			/* NOTE early return due to BLK_STS_DM_REQUEUE below */
 			bio_list_add_head(&md->deferred, bio);
+			requeued = true;
 		} else {
 			/*
 			 * noflush suspend was interrupted or this is
@@ -907,6 +926,10 @@ static void dm_handle_requeue(struct dm_io *io)
 		}
 		spin_unlock_irqrestore(&md->deferred_lock, flags);
 	}
+
+	if (requeued)
+		queue_work(md->wq, &md->work);
+	return requeued;
 }
 
 static void dm_io_complete(struct dm_io *io)
@@ -914,8 +937,9 @@ static void dm_io_complete(struct dm_io *io)
 	struct bio *bio = io->split_bio ? io->split_bio : io->orig_bio;
 	struct mapped_device *md = io->md;
 	blk_status_t io_error;
+	bool requeued;
 
-	dm_handle_requeue(io);
+	requeued = dm_handle_requeue(io);
 
 	io_error = io->status;
 	if (dm_io_flagged(io, DM_IO_ACCOUNTED))
@@ -936,23 +960,9 @@ static void dm_io_complete(struct dm_io *io)
 	if (unlikely(wq_has_sleeper(&md->wait)))
 		wake_up(&md->wait);
 
-	if (io_error == BLK_STS_DM_REQUEUE || io_error == BLK_STS_AGAIN) {
-		if (bio->bi_opf & REQ_POLLED) {
-			/*
-			 * Upper layer won't help us poll split bio (io->orig_bio
-			 * may only reflect a subset of the pre-split original)
-			 * so clear REQ_POLLED in case of requeue.
-			 */
-			bio_clear_polled(bio);
-			if (io_error == BLK_STS_AGAIN) {
-				/* io_uring doesn't handle BLK_STS_AGAIN (yet) */
-				queue_io(md, bio);
-				return;
-			}
-		}
-		if (io_error == BLK_STS_DM_REQUEUE)
-			return;
-	}
+	/* We have requeued, so return now */
+	if (requeued)
+		return;
 
 	if (bio_is_flush_with_data(bio)) {
 		/*
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [dm-devel] [PATCH 5.20 3/4] dm: improve handling for DM_REQUEUE and AGAIN
@ 2022-06-24 14:12   ` Ming Lei
  0 siblings, 0 replies; 76+ messages in thread
From: Ming Lei @ 2022-06-24 14:12 UTC (permalink / raw)
  To: Jens Axboe, Mike Snitzer; +Cc: linux-block, dm-devel, Ming Lei

In case that BLK_STS_DM_REQUEUE is returned or BLK_STS_AGAIN is returned
for POLLED io, we requeue the original bio into deferred list and
request md->wq to re-submit it to block layer.

Improve the handling in the following way:

1) unify handling for BLK_STS_DM_REQUEUE and BLK_STS_AGAIN, and clear
REQ_POLLED for BLK_STS_DM_REQUEUE too, for the sake of simplicity,
given BLK_STS_DM_REQUEUE is very unusual

2) queue md->wq explicitly in __dm_io_complete(), so requeue handling
becomes more robust

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/md/dm.c | 58 +++++++++++++++++++++++++++++--------------------
 1 file changed, 34 insertions(+), 24 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index a9e5e429c150..ee22c763873f 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -884,20 +884,39 @@ static int __noflush_suspending(struct mapped_device *md)
 	return test_bit(DMF_NOFLUSH_SUSPENDING, &md->flags);
 }
 
-static void dm_handle_requeue(struct dm_io *io)
+/* Return true if the original bio is requeued */
+static bool dm_handle_requeue(struct dm_io *io)
 {
-	if (io->status == BLK_STS_DM_REQUEUE) {
-		struct bio *bio = io->split_bio ? io->split_bio : io->orig_bio;
-		struct mapped_device *md = io->md;
+	struct bio *bio = io->split_bio ? io->split_bio : io->orig_bio;
+	bool need_requeue = (io->status == BLK_STS_DM_REQUEUE);
+	bool handle_eagain = (io->status == BLK_STS_AGAIN) &&
+		(bio->bi_opf & REQ_POLLED);
+	struct mapped_device *md = io->md;
+	bool requeued = false;
+
+	if (need_requeue || handle_eagain) {
 		unsigned long flags;
+
+		if (bio->bi_opf & REQ_POLLED) {
+			/*
+			 * Upper layer won't help us poll split bio
+			 * (io->orig_bio may only reflect a subset of the
+			 * pre-split original) so clear REQ_POLLED in case
+			 * of requeue.
+			 */
+			bio_clear_polled(bio);
+		}
+
 		/*
 		 * Target requested pushing back the I/O.
 		 */
 		spin_lock_irqsave(&md->deferred_lock, flags);
-		if (__noflush_suspending(md) &&
-		    !WARN_ON_ONCE(dm_is_zone_write(md, bio))) {
+		if ((__noflush_suspending(md) &&
+		    !WARN_ON_ONCE(dm_is_zone_write(md, bio))) ||
+				handle_eagain) {
 			/* NOTE early return due to BLK_STS_DM_REQUEUE below */
 			bio_list_add_head(&md->deferred, bio);
+			requeued = true;
 		} else {
 			/*
 			 * noflush suspend was interrupted or this is
@@ -907,6 +926,10 @@ static void dm_handle_requeue(struct dm_io *io)
 		}
 		spin_unlock_irqrestore(&md->deferred_lock, flags);
 	}
+
+	if (requeued)
+		queue_work(md->wq, &md->work);
+	return requeued;
 }
 
 static void dm_io_complete(struct dm_io *io)
@@ -914,8 +937,9 @@ static void dm_io_complete(struct dm_io *io)
 	struct bio *bio = io->split_bio ? io->split_bio : io->orig_bio;
 	struct mapped_device *md = io->md;
 	blk_status_t io_error;
+	bool requeued;
 
-	dm_handle_requeue(io);
+	requeued = dm_handle_requeue(io);
 
 	io_error = io->status;
 	if (dm_io_flagged(io, DM_IO_ACCOUNTED))
@@ -936,23 +960,9 @@ static void dm_io_complete(struct dm_io *io)
 	if (unlikely(wq_has_sleeper(&md->wait)))
 		wake_up(&md->wait);
 
-	if (io_error == BLK_STS_DM_REQUEUE || io_error == BLK_STS_AGAIN) {
-		if (bio->bi_opf & REQ_POLLED) {
-			/*
-			 * Upper layer won't help us poll split bio (io->orig_bio
-			 * may only reflect a subset of the pre-split original)
-			 * so clear REQ_POLLED in case of requeue.
-			 */
-			bio_clear_polled(bio);
-			if (io_error == BLK_STS_AGAIN) {
-				/* io_uring doesn't handle BLK_STS_AGAIN (yet) */
-				queue_io(md, bio);
-				return;
-			}
-		}
-		if (io_error == BLK_STS_DM_REQUEUE)
-			return;
-	}
+	/* We have requeued, so return now */
+	if (requeued)
+		return;
 
 	if (bio_is_flush_with_data(bio)) {
 		/*
-- 
2.31.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 5.20 4/4] dm: add two stage requeue
  2022-06-24 14:12 ` [dm-devel] " Ming Lei
@ 2022-06-24 14:12   ` Ming Lei
  -1 siblings, 0 replies; 76+ messages in thread
From: Ming Lei @ 2022-06-24 14:12 UTC (permalink / raw)
  To: Jens Axboe, Mike Snitzer; +Cc: linux-block, dm-devel, Ming Lei

Commit 7dd76d1feec7 ("dm: improve bio splitting and associated IO accounting")
makes dm io's original bio points to same original bio from upper layer,
so more than one dm io can share one same original bio in case of
splitting. This way is fine if all dm io is completed successfully. But
if BLK_STS_DM_REQUEUE is returned from clone bio, the current code will
requeue the shared original bio, and cause the following issue:

1) the shared original bio has been trimmed and mapped to the last dm
   io, so it becomes not matched if requeuing this original bio

2) more than one dm io completion may touch the single shared original
   bio, such as the bio may have been submitted in one code path, but
   another code path may be ending it, so this way is very fragile.

Patch 'dm: fix BLK_STS_DM_REQUEUE handling when dm_io' can fix the
issue, but still need to clone one backing bio in case of split. We can
solve the issue by two stages requeue with help of new added bio_rewind,
then the bio clone can only be needed after BLK_STS_DM_REQUEUE happens:

1) requeue the dm io into the added requeue list, and schedule it via
new added requeue work, it is just for clone/allocate a mapped original
bio for requeue, and we recover the original bio by bio_rewind().

2) the 2nd stage requeue is same with original requeue, but io->orig_bio
points to new cloned bio which matches with the requeued dm io.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/md/dm-core.h |  11 +++-
 drivers/md/dm.c      | 131 ++++++++++++++++++++++++++++++++++---------
 2 files changed, 116 insertions(+), 26 deletions(-)

diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index c954ff91870e..0545ce441427 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -22,6 +22,8 @@
 
 #define DM_RESERVED_MAX_IOS		1024
 
+struct dm_io;
+
 struct dm_kobject_holder {
 	struct kobject kobj;
 	struct completion completion;
@@ -91,6 +93,14 @@ struct mapped_device {
 	spinlock_t deferred_lock;
 	struct bio_list deferred;
 
+	/*
+	 * requeue work context is nedded for cloning one new bio
+	 * for representing the dm_io to be requeued since each
+	 * dm_io may point to the original bio from FS.
+	 */
+	struct work_struct requeue_work;
+	struct dm_io *requeue_list;
+
 	void *interface_ptr;
 
 	/*
@@ -272,7 +282,6 @@ struct dm_io {
 	atomic_t io_count;
 	struct mapped_device *md;
 
-	struct bio *split_bio;
 	/* The three fields represent mapped part of original bio */
 	struct bio *orig_bio;
 	unsigned int sector_offset; /* offset to end of orig_bio */
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index ee22c763873f..c2b95b931c31 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -594,7 +594,6 @@ static struct dm_io *alloc_io(struct mapped_device *md, struct bio *bio)
 	atomic_set(&io->io_count, 2);
 	this_cpu_inc(*md->pending_io);
 	io->orig_bio = bio;
-	io->split_bio = NULL;
 	io->md = md;
 	spin_lock_init(&io->lock);
 	io->start_time = jiffies;
@@ -884,10 +883,32 @@ static int __noflush_suspending(struct mapped_device *md)
 	return test_bit(DMF_NOFLUSH_SUSPENDING, &md->flags);
 }
 
+static void dm_requeue_add_io(struct dm_io *io, bool first_stage)
+{
+	struct mapped_device *md = io->md;
+
+	if (first_stage) {
+		struct dm_io *next = md->requeue_list;
+
+		md->requeue_list = io;
+		io->next = next;
+	} else {
+		bio_list_add_head(&md->deferred, io->orig_bio);
+	}
+}
+
+static void dm_requeue_schedule(struct mapped_device *md, bool first_stage)
+{
+	if (first_stage)
+		queue_work(md->wq, &md->requeue_work);
+	else
+		queue_work(md->wq, &md->work);
+}
+
 /* Return true if the original bio is requeued */
-static bool dm_handle_requeue(struct dm_io *io)
+static bool dm_handle_requeue(struct dm_io *io, bool first_stage)
 {
-	struct bio *bio = io->split_bio ? io->split_bio : io->orig_bio;
+	struct bio *bio = io->orig_bio;
 	bool need_requeue = (io->status == BLK_STS_DM_REQUEUE);
 	bool handle_eagain = (io->status == BLK_STS_AGAIN) &&
 		(bio->bi_opf & REQ_POLLED);
@@ -913,9 +934,9 @@ static bool dm_handle_requeue(struct dm_io *io)
 		spin_lock_irqsave(&md->deferred_lock, flags);
 		if ((__noflush_suspending(md) &&
 		    !WARN_ON_ONCE(dm_is_zone_write(md, bio))) ||
-				handle_eagain) {
+				handle_eagain || first_stage) {
 			/* NOTE early return due to BLK_STS_DM_REQUEUE below */
-			bio_list_add_head(&md->deferred, bio);
+			dm_requeue_add_io(io, first_stage);
 			requeued = true;
 		} else {
 			/*
@@ -928,18 +949,20 @@ static bool dm_handle_requeue(struct dm_io *io)
 	}
 
 	if (requeued)
-		queue_work(md->wq, &md->work);
+		dm_requeue_schedule(md, first_stage);
 	return requeued;
 }
 
-static void dm_io_complete(struct dm_io *io)
+static void __dm_io_complete(struct dm_io *io, bool first_stage)
 {
-	struct bio *bio = io->split_bio ? io->split_bio : io->orig_bio;
+	struct bio *bio = io->orig_bio;
 	struct mapped_device *md = io->md;
 	blk_status_t io_error;
 	bool requeued;
 
-	requeued = dm_handle_requeue(io);
+	requeued = dm_handle_requeue(io, first_stage);
+	if (requeued && first_stage)
+		return;
 
 	io_error = io->status;
 	if (dm_io_flagged(io, DM_IO_ACCOUNTED))
@@ -979,6 +1002,74 @@ static void dm_io_complete(struct dm_io *io)
 	}
 }
 
+static void dm_wq_requeue_work(struct work_struct *work)
+{
+	struct mapped_device *md = container_of(work, struct mapped_device,
+			requeue_work);
+	unsigned long flags;
+	struct dm_io *io;
+
+	/* reuse deferred lock to simplify dm_handle_requeue  */
+	spin_lock_irqsave(&md->deferred_lock, flags);
+	io = md->requeue_list;
+	md->requeue_list = NULL;
+	spin_unlock_irqrestore(&md->deferred_lock, flags);
+
+	while (io) {
+		struct dm_io *next = io->next;
+		struct bio *orig = io->orig_bio;
+		struct bio *new_orig = bio_alloc_clone(orig->bi_bdev,
+				orig, GFP_NOIO, &md->queue->bio_split);
+
+		/*
+		 * bio_rewind can restore to previous position since the end
+		 * sector is fixed for original bio, but we still need to
+		 * recover sectors manually
+		 */
+		bio_rewind(new_orig, (io->sector_offset << 9) -
+				orig->bi_iter.bi_size);
+		bio_trim(new_orig, 0, io->sectors);
+
+		bio_chain(new_orig, orig);
+		/*
+		 * __bi_remaining has been increased during split, so has to
+		 * drop the one added in bio_chain
+		 */
+		atomic_dec(&orig->__bi_remaining);
+		io->orig_bio = new_orig;
+
+		io->next = NULL;
+		__dm_io_complete(io, false);
+		io = next;
+	}
+}
+
+/*
+ * Two stages requeue:
+ *
+ * 1) io->orig_bio points to the real original bio, and we should requeue
+ * the part mapped to this io, instead of other parts of the original bio
+ */
+static void dm_io_complete(struct dm_io *io)
+{
+	bool first_requeue;
+
+	/*
+	 * only split io needs two stage requeue, otherwise we may run
+	 * into bio clone chain in long suspend, and OOM could be triggered,
+	 * pointed out by Mike.
+	 *
+	 * Also flush data io won't be marked as DM_IO_WAS_SPLIT, so they
+	 * needn't to be handled via the first stage requeue.
+	 */
+	if (dm_io_flagged(io, DM_IO_WAS_SPLIT))
+		first_requeue = true;
+	else
+		first_requeue = false;
+
+	__dm_io_complete(io, first_requeue);
+}
+
 /*
  * Decrements the number of outstanding ios that a bio has been
  * cloned into, completing the original io if necc.
@@ -1401,17 +1492,7 @@ static void setup_split_accounting(struct clone_info *ci, unsigned len)
 		 */
 		dm_io_set_flag(io, DM_IO_WAS_SPLIT);
 		io->sectors = len;
-	}
-
-	if (static_branch_unlikely(&stats_enabled) &&
-	    unlikely(dm_stats_used(&io->md->stats))) {
-		/*
-		 * Save bi_sector in terms of its offset from end of
-		 * original bio, only needed for DM-stats' benefit.
-		 * - saved regardless of whether split needed so that
-		 *   dm_accept_partial_bio() doesn't need to.
-		 */
-		io->sector_offset = bio_end_sector(ci->bio) - ci->sector;
+		io->sector_offset = bio_sectors(ci->bio);
 	}
 }
 
@@ -1711,11 +1792,9 @@ static void dm_split_and_process_bio(struct mapped_device *md,
 	 * Remainder must be passed to submit_bio_noacct() so it gets handled
 	 * *after* bios already submitted have been completely processed.
 	 */
-	WARN_ON_ONCE(!dm_io_flagged(io, DM_IO_WAS_SPLIT));
-	io->split_bio = bio_split(bio, io->sectors, GFP_NOIO,
-				  &md->queue->bio_split);
-	bio_chain(io->split_bio, bio);
-	trace_block_split(io->split_bio, bio->bi_iter.bi_sector);
+	bio_trim(bio, io->sectors, ci.sector_count);
+	trace_block_split(bio, bio->bi_iter.bi_sector);
+	bio_inc_remaining(bio);
 	submit_bio_noacct(bio);
 out:
 	/*
@@ -1991,9 +2070,11 @@ static struct mapped_device *alloc_dev(int minor)
 
 	init_waitqueue_head(&md->wait);
 	INIT_WORK(&md->work, dm_wq_work);
+	INIT_WORK(&md->requeue_work, dm_wq_requeue_work);
 	init_waitqueue_head(&md->eventq);
 	init_completion(&md->kobj_holder.completion);
 
+	md->requeue_list = NULL;
 	md->swap_bios = get_swap_bios();
 	sema_init(&md->swap_bios_semaphore, md->swap_bios);
 	mutex_init(&md->swap_bios_lock);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [dm-devel] [PATCH 5.20 4/4] dm: add two stage requeue
@ 2022-06-24 14:12   ` Ming Lei
  0 siblings, 0 replies; 76+ messages in thread
From: Ming Lei @ 2022-06-24 14:12 UTC (permalink / raw)
  To: Jens Axboe, Mike Snitzer; +Cc: linux-block, dm-devel, Ming Lei

Commit 7dd76d1feec7 ("dm: improve bio splitting and associated IO accounting")
makes dm io's original bio points to same original bio from upper layer,
so more than one dm io can share one same original bio in case of
splitting. This way is fine if all dm io is completed successfully. But
if BLK_STS_DM_REQUEUE is returned from clone bio, the current code will
requeue the shared original bio, and cause the following issue:

1) the shared original bio has been trimmed and mapped to the last dm
   io, so it becomes not matched if requeuing this original bio

2) more than one dm io completion may touch the single shared original
   bio, such as the bio may have been submitted in one code path, but
   another code path may be ending it, so this way is very fragile.

Patch 'dm: fix BLK_STS_DM_REQUEUE handling when dm_io' can fix the
issue, but still need to clone one backing bio in case of split. We can
solve the issue by two stages requeue with help of new added bio_rewind,
then the bio clone can only be needed after BLK_STS_DM_REQUEUE happens:

1) requeue the dm io into the added requeue list, and schedule it via
new added requeue work, it is just for clone/allocate a mapped original
bio for requeue, and we recover the original bio by bio_rewind().

2) the 2nd stage requeue is same with original requeue, but io->orig_bio
points to new cloned bio which matches with the requeued dm io.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/md/dm-core.h |  11 +++-
 drivers/md/dm.c      | 131 ++++++++++++++++++++++++++++++++++---------
 2 files changed, 116 insertions(+), 26 deletions(-)

diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index c954ff91870e..0545ce441427 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -22,6 +22,8 @@
 
 #define DM_RESERVED_MAX_IOS		1024
 
+struct dm_io;
+
 struct dm_kobject_holder {
 	struct kobject kobj;
 	struct completion completion;
@@ -91,6 +93,14 @@ struct mapped_device {
 	spinlock_t deferred_lock;
 	struct bio_list deferred;
 
+	/*
+	 * requeue work context is nedded for cloning one new bio
+	 * for representing the dm_io to be requeued since each
+	 * dm_io may point to the original bio from FS.
+	 */
+	struct work_struct requeue_work;
+	struct dm_io *requeue_list;
+
 	void *interface_ptr;
 
 	/*
@@ -272,7 +282,6 @@ struct dm_io {
 	atomic_t io_count;
 	struct mapped_device *md;
 
-	struct bio *split_bio;
 	/* The three fields represent mapped part of original bio */
 	struct bio *orig_bio;
 	unsigned int sector_offset; /* offset to end of orig_bio */
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index ee22c763873f..c2b95b931c31 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -594,7 +594,6 @@ static struct dm_io *alloc_io(struct mapped_device *md, struct bio *bio)
 	atomic_set(&io->io_count, 2);
 	this_cpu_inc(*md->pending_io);
 	io->orig_bio = bio;
-	io->split_bio = NULL;
 	io->md = md;
 	spin_lock_init(&io->lock);
 	io->start_time = jiffies;
@@ -884,10 +883,32 @@ static int __noflush_suspending(struct mapped_device *md)
 	return test_bit(DMF_NOFLUSH_SUSPENDING, &md->flags);
 }
 
+static void dm_requeue_add_io(struct dm_io *io, bool first_stage)
+{
+	struct mapped_device *md = io->md;
+
+	if (first_stage) {
+		struct dm_io *next = md->requeue_list;
+
+		md->requeue_list = io;
+		io->next = next;
+	} else {
+		bio_list_add_head(&md->deferred, io->orig_bio);
+	}
+}
+
+static void dm_requeue_schedule(struct mapped_device *md, bool first_stage)
+{
+	if (first_stage)
+		queue_work(md->wq, &md->requeue_work);
+	else
+		queue_work(md->wq, &md->work);
+}
+
 /* Return true if the original bio is requeued */
-static bool dm_handle_requeue(struct dm_io *io)
+static bool dm_handle_requeue(struct dm_io *io, bool first_stage)
 {
-	struct bio *bio = io->split_bio ? io->split_bio : io->orig_bio;
+	struct bio *bio = io->orig_bio;
 	bool need_requeue = (io->status == BLK_STS_DM_REQUEUE);
 	bool handle_eagain = (io->status == BLK_STS_AGAIN) &&
 		(bio->bi_opf & REQ_POLLED);
@@ -913,9 +934,9 @@ static bool dm_handle_requeue(struct dm_io *io)
 		spin_lock_irqsave(&md->deferred_lock, flags);
 		if ((__noflush_suspending(md) &&
 		    !WARN_ON_ONCE(dm_is_zone_write(md, bio))) ||
-				handle_eagain) {
+				handle_eagain || first_stage) {
 			/* NOTE early return due to BLK_STS_DM_REQUEUE below */
-			bio_list_add_head(&md->deferred, bio);
+			dm_requeue_add_io(io, first_stage);
 			requeued = true;
 		} else {
 			/*
@@ -928,18 +949,20 @@ static bool dm_handle_requeue(struct dm_io *io)
 	}
 
 	if (requeued)
-		queue_work(md->wq, &md->work);
+		dm_requeue_schedule(md, first_stage);
 	return requeued;
 }
 
-static void dm_io_complete(struct dm_io *io)
+static void __dm_io_complete(struct dm_io *io, bool first_stage)
 {
-	struct bio *bio = io->split_bio ? io->split_bio : io->orig_bio;
+	struct bio *bio = io->orig_bio;
 	struct mapped_device *md = io->md;
 	blk_status_t io_error;
 	bool requeued;
 
-	requeued = dm_handle_requeue(io);
+	requeued = dm_handle_requeue(io, first_stage);
+	if (requeued && first_stage)
+		return;
 
 	io_error = io->status;
 	if (dm_io_flagged(io, DM_IO_ACCOUNTED))
@@ -979,6 +1002,74 @@ static void dm_io_complete(struct dm_io *io)
 	}
 }
 
+static void dm_wq_requeue_work(struct work_struct *work)
+{
+	struct mapped_device *md = container_of(work, struct mapped_device,
+			requeue_work);
+	unsigned long flags;
+	struct dm_io *io;
+
+	/* reuse deferred lock to simplify dm_handle_requeue  */
+	spin_lock_irqsave(&md->deferred_lock, flags);
+	io = md->requeue_list;
+	md->requeue_list = NULL;
+	spin_unlock_irqrestore(&md->deferred_lock, flags);
+
+	while (io) {
+		struct dm_io *next = io->next;
+		struct bio *orig = io->orig_bio;
+		struct bio *new_orig = bio_alloc_clone(orig->bi_bdev,
+				orig, GFP_NOIO, &md->queue->bio_split);
+
+		/*
+		 * bio_rewind can restore to previous position since the end
+		 * sector is fixed for original bio, but we still need to
+		 * recover sectors manually
+		 */
+		bio_rewind(new_orig, (io->sector_offset << 9) -
+				orig->bi_iter.bi_size);
+		bio_trim(new_orig, 0, io->sectors);
+
+		bio_chain(new_orig, orig);
+		/*
+		 * __bi_remaining has been increased during split, so has to
+		 * drop the one added in bio_chain
+		 */
+		atomic_dec(&orig->__bi_remaining);
+		io->orig_bio = new_orig;
+
+		io->next = NULL;
+		__dm_io_complete(io, false);
+		io = next;
+	}
+}
+
+/*
+ * Two stages requeue:
+ *
+ * 1) io->orig_bio points to the real original bio, and we should requeue
+ * the part mapped to this io, instead of other parts of the original bio
+ */
+static void dm_io_complete(struct dm_io *io)
+{
+	bool first_requeue;
+
+	/*
+	 * only split io needs two stage requeue, otherwise we may run
+	 * into bio clone chain in long suspend, and OOM could be triggered,
+	 * pointed out by Mike.
+	 *
+	 * Also flush data io won't be marked as DM_IO_WAS_SPLIT, so they
+	 * needn't to be handled via the first stage requeue.
+	 */
+	if (dm_io_flagged(io, DM_IO_WAS_SPLIT))
+		first_requeue = true;
+	else
+		first_requeue = false;
+
+	__dm_io_complete(io, first_requeue);
+}
+
 /*
  * Decrements the number of outstanding ios that a bio has been
  * cloned into, completing the original io if necc.
@@ -1401,17 +1492,7 @@ static void setup_split_accounting(struct clone_info *ci, unsigned len)
 		 */
 		dm_io_set_flag(io, DM_IO_WAS_SPLIT);
 		io->sectors = len;
-	}
-
-	if (static_branch_unlikely(&stats_enabled) &&
-	    unlikely(dm_stats_used(&io->md->stats))) {
-		/*
-		 * Save bi_sector in terms of its offset from end of
-		 * original bio, only needed for DM-stats' benefit.
-		 * - saved regardless of whether split needed so that
-		 *   dm_accept_partial_bio() doesn't need to.
-		 */
-		io->sector_offset = bio_end_sector(ci->bio) - ci->sector;
+		io->sector_offset = bio_sectors(ci->bio);
 	}
 }
 
@@ -1711,11 +1792,9 @@ static void dm_split_and_process_bio(struct mapped_device *md,
 	 * Remainder must be passed to submit_bio_noacct() so it gets handled
 	 * *after* bios already submitted have been completely processed.
 	 */
-	WARN_ON_ONCE(!dm_io_flagged(io, DM_IO_WAS_SPLIT));
-	io->split_bio = bio_split(bio, io->sectors, GFP_NOIO,
-				  &md->queue->bio_split);
-	bio_chain(io->split_bio, bio);
-	trace_block_split(io->split_bio, bio->bi_iter.bi_sector);
+	bio_trim(bio, io->sectors, ci.sector_count);
+	trace_block_split(bio, bio->bi_iter.bi_sector);
+	bio_inc_remaining(bio);
 	submit_bio_noacct(bio);
 out:
 	/*
@@ -1991,9 +2070,11 @@ static struct mapped_device *alloc_dev(int minor)
 
 	init_waitqueue_head(&md->wait);
 	INIT_WORK(&md->work, dm_wq_work);
+	INIT_WORK(&md->requeue_work, dm_wq_requeue_work);
 	init_waitqueue_head(&md->eventq);
 	init_completion(&md->kobj_holder.completion);
 
+	md->requeue_list = NULL;
 	md->swap_bios = get_swap_bios();
 	sema_init(&md->swap_bios_semaphore, md->swap_bios);
 	mutex_init(&md->swap_bios_lock);
-- 
2.31.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH 5.20 1/4] block: add bio_rewind() API
  2022-06-24 14:12   ` [dm-devel] " Ming Lei
@ 2022-06-26 20:14     ` Kent Overstreet
  -1 siblings, 0 replies; 76+ messages in thread
From: Kent Overstreet @ 2022-06-26 20:14 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Mike Snitzer, linux-block, dm-devel, Eric Biggers,
	Dmitry Monakhov, Martin K . Petersen

On Fri, Jun 24, 2022 at 10:12:52PM +0800, Ming Lei wrote:
> Commit 7759eb23fd98 ("block: remove bio_rewind_iter()") removes
> the similar API because the following reasons:
> 
>     ```
>     It is pointed that bio_rewind_iter() is one very bad API[1]:
> 
>     1) bio size may not be restored after rewinding
> 
>     2) it causes some bogus change, such as 5151842b9d8732 (block: reset
>     bi_iter.bi_done after splitting bio)
> 
>     3) rewinding really makes things complicated wrt. bio splitting
> 
>     4) unnecessary updating of .bi_done in fast path
> 
>     [1] https://marc.info/?t=153549924200005&r=1&w=2
> 
>     So this patch takes Kent's suggestion to restore one bio into its original
>     state via saving bio iterator(struct bvec_iter) in bio_integrity_prep(),
>     given now bio_rewind_iter() is only used by bio integrity code.
>     ```
> 
> However, it isn't easy to restore bio by saving 32 bytes bio->bi_iter, and saving
> it only can't restore crypto and integrity info.
> 
> Add bio_rewind() back for some use cases which may not be same with
> previous generic case:
> 
> 1) most of bio has fixed end sector since bio split is done from front of the bio,
> if driver just records how many sectors between current bio's start sector and
> the bio's end sector, the original position can be restored
> 
> 2) if one bio's end sector won't change, usually bio_trim() isn't called, user can
> restore original position by storing sectors from current ->bi_iter.bi_sector to
> bio's end sector; together by saving bio size, 8 bytes can restore to
> original bio.
> 
> 3) dm's requeue use case: when BLK_STS_DM_REQUEUE happens, dm core needs to
> restore to the original bio which represents current dm io to be requeued.
> By storing sectors to the bio's end sector and dm io's size,
> bio_rewind() can restore such original bio, then dm core code needn't to
> allocate one bio beforehand just for handling BLK_STS_DM_REQUEUE which
> is actually one unusual event.
> 
> 4) Not like original rewind API, this one needn't to add .bi_done, and no any
> effect on fast path

It seems like perhaps the real issue here is that we need a real bio_iter,
separate from bvec_iter, that also encapsulates iterating over integrity &
fscrypt. 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dm-devel] [PATCH 5.20 1/4] block: add bio_rewind() API
@ 2022-06-26 20:14     ` Kent Overstreet
  0 siblings, 0 replies; 76+ messages in thread
From: Kent Overstreet @ 2022-06-26 20:14 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Martin K . Petersen, Mike Snitzer, Eric Biggers,
	linux-block, dm-devel, Dmitry Monakhov

On Fri, Jun 24, 2022 at 10:12:52PM +0800, Ming Lei wrote:
> Commit 7759eb23fd98 ("block: remove bio_rewind_iter()") removes
> the similar API because the following reasons:
> 
>     ```
>     It is pointed that bio_rewind_iter() is one very bad API[1]:
> 
>     1) bio size may not be restored after rewinding
> 
>     2) it causes some bogus change, such as 5151842b9d8732 (block: reset
>     bi_iter.bi_done after splitting bio)
> 
>     3) rewinding really makes things complicated wrt. bio splitting
> 
>     4) unnecessary updating of .bi_done in fast path
> 
>     [1] https://marc.info/?t=153549924200005&r=1&w=2
> 
>     So this patch takes Kent's suggestion to restore one bio into its original
>     state via saving bio iterator(struct bvec_iter) in bio_integrity_prep(),
>     given now bio_rewind_iter() is only used by bio integrity code.
>     ```
> 
> However, it isn't easy to restore bio by saving 32 bytes bio->bi_iter, and saving
> it only can't restore crypto and integrity info.
> 
> Add bio_rewind() back for some use cases which may not be same with
> previous generic case:
> 
> 1) most of bio has fixed end sector since bio split is done from front of the bio,
> if driver just records how many sectors between current bio's start sector and
> the bio's end sector, the original position can be restored
> 
> 2) if one bio's end sector won't change, usually bio_trim() isn't called, user can
> restore original position by storing sectors from current ->bi_iter.bi_sector to
> bio's end sector; together by saving bio size, 8 bytes can restore to
> original bio.
> 
> 3) dm's requeue use case: when BLK_STS_DM_REQUEUE happens, dm core needs to
> restore to the original bio which represents current dm io to be requeued.
> By storing sectors to the bio's end sector and dm io's size,
> bio_rewind() can restore such original bio, then dm core code needn't to
> allocate one bio beforehand just for handling BLK_STS_DM_REQUEUE which
> is actually one unusual event.
> 
> 4) Not like original rewind API, this one needn't to add .bi_done, and no any
> effect on fast path

It seems like perhaps the real issue here is that we need a real bio_iter,
separate from bvec_iter, that also encapsulates iterating over integrity &
fscrypt. 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dm-devel] [PATCH 5.20 1/4] block: add bio_rewind() API
  2022-06-24 14:12   ` [dm-devel] " Ming Lei
@ 2022-06-26 21:37     ` Eric Biggers
  -1 siblings, 0 replies; 76+ messages in thread
From: Eric Biggers @ 2022-06-26 21:37 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Mike Snitzer, Martin K . Petersen, linux-block,
	dm-devel, Dmitry Monakhov, Kent Overstreet

On Fri, Jun 24, 2022 at 10:12:52PM +0800, Ming Lei wrote:
> diff --git a/block/blk-crypto.c b/block/blk-crypto.c
> index a496aaef85ba..caae2f429fc7 100644
> --- a/block/blk-crypto.c
> +++ b/block/blk-crypto.c
> @@ -134,6 +134,21 @@ void bio_crypt_dun_increment(u64 dun[BLK_CRYPTO_DUN_ARRAY_SIZE],
>  	}
>  }
>  
> +/* Decrements @dun by @dec, treating @dun as a multi-limb integer. */
> +void bio_crypt_dun_decrement(u64 dun[BLK_CRYPTO_DUN_ARRAY_SIZE],
> +			     unsigned int dec)
> +{
> +	int i;
> +
> +	for (i = 0; dec && i < BLK_CRYPTO_DUN_ARRAY_SIZE; i++) {
> +		dun[i] -= dec;
> +		if (dun[i] > inc)
> +			dec = 1;
> +		else
> +			dec = 0;
> +	}
> +}

This doesn't compile.  Also this doesn't handle underflow into the next limb
correctly.  A correct version would be:

		u64 prev = dun[i];

		dun[i] -= dec;
		if (dun[i] > prev)
			dec = 1;
		else
			dec = 0;

- Eric

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dm-devel] [PATCH 5.20 1/4] block: add bio_rewind() API
@ 2022-06-26 21:37     ` Eric Biggers
  0 siblings, 0 replies; 76+ messages in thread
From: Eric Biggers @ 2022-06-26 21:37 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Mike Snitzer, Martin K . Petersen, linux-block,
	dm-devel, Dmitry Monakhov, Kent Overstreet

On Fri, Jun 24, 2022 at 10:12:52PM +0800, Ming Lei wrote:
> diff --git a/block/blk-crypto.c b/block/blk-crypto.c
> index a496aaef85ba..caae2f429fc7 100644
> --- a/block/blk-crypto.c
> +++ b/block/blk-crypto.c
> @@ -134,6 +134,21 @@ void bio_crypt_dun_increment(u64 dun[BLK_CRYPTO_DUN_ARRAY_SIZE],
>  	}
>  }
>  
> +/* Decrements @dun by @dec, treating @dun as a multi-limb integer. */
> +void bio_crypt_dun_decrement(u64 dun[BLK_CRYPTO_DUN_ARRAY_SIZE],
> +			     unsigned int dec)
> +{
> +	int i;
> +
> +	for (i = 0; dec && i < BLK_CRYPTO_DUN_ARRAY_SIZE; i++) {
> +		dun[i] -= dec;
> +		if (dun[i] > inc)
> +			dec = 1;
> +		else
> +			dec = 0;
> +	}
> +}

This doesn't compile.  Also this doesn't handle underflow into the next limb
correctly.  A correct version would be:

		u64 prev = dun[i];

		dun[i] -= dec;
		if (dun[i] > prev)
			dec = 1;
		else
			dec = 0;

- Eric

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 5.20 1/4] block: add bio_rewind() API
  2022-06-26 20:14     ` [dm-devel] " Kent Overstreet
@ 2022-06-27  7:36       ` Ming Lei
  -1 siblings, 0 replies; 76+ messages in thread
From: Ming Lei @ 2022-06-27  7:36 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Jens Axboe, Mike Snitzer, linux-block, dm-devel, Eric Biggers,
	Dmitry Monakhov, Martin K . Petersen

On Sun, Jun 26, 2022 at 04:14:58PM -0400, Kent Overstreet wrote:
> On Fri, Jun 24, 2022 at 10:12:52PM +0800, Ming Lei wrote:
> > Commit 7759eb23fd98 ("block: remove bio_rewind_iter()") removes
> > the similar API because the following reasons:
> > 
> >     ```
> >     It is pointed that bio_rewind_iter() is one very bad API[1]:
> > 
> >     1) bio size may not be restored after rewinding
> > 
> >     2) it causes some bogus change, such as 5151842b9d8732 (block: reset
> >     bi_iter.bi_done after splitting bio)
> > 
> >     3) rewinding really makes things complicated wrt. bio splitting
> > 
> >     4) unnecessary updating of .bi_done in fast path
> > 
> >     [1] https://marc.info/?t=153549924200005&r=1&w=2
> > 
> >     So this patch takes Kent's suggestion to restore one bio into its original
> >     state via saving bio iterator(struct bvec_iter) in bio_integrity_prep(),
> >     given now bio_rewind_iter() is only used by bio integrity code.
> >     ```
> > 
> > However, it isn't easy to restore bio by saving 32 bytes bio->bi_iter, and saving
> > it only can't restore crypto and integrity info.
> > 
> > Add bio_rewind() back for some use cases which may not be same with
> > previous generic case:
> > 
> > 1) most of bio has fixed end sector since bio split is done from front of the bio,
> > if driver just records how many sectors between current bio's start sector and
> > the bio's end sector, the original position can be restored
> > 
> > 2) if one bio's end sector won't change, usually bio_trim() isn't called, user can
> > restore original position by storing sectors from current ->bi_iter.bi_sector to
> > bio's end sector; together by saving bio size, 8 bytes can restore to
> > original bio.
> > 
> > 3) dm's requeue use case: when BLK_STS_DM_REQUEUE happens, dm core needs to
> > restore to the original bio which represents current dm io to be requeued.
> > By storing sectors to the bio's end sector and dm io's size,
> > bio_rewind() can restore such original bio, then dm core code needn't to
> > allocate one bio beforehand just for handling BLK_STS_DM_REQUEUE which
> > is actually one unusual event.
> > 
> > 4) Not like original rewind API, this one needn't to add .bi_done, and no any
> > effect on fast path
> 
> It seems like perhaps the real issue here is that we need a real bio_iter,
> separate from bvec_iter, that also encapsulates iterating over integrity &
> fscrypt. 

Not mention bio_iter, bvec_iter has been 32 bytes, which is too big to
hold in per-io data structure. With this patch, 8bytes is enough
to rewind one bio if the end sector is fixed.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dm-devel] [PATCH 5.20 1/4] block: add bio_rewind() API
@ 2022-06-27  7:36       ` Ming Lei
  0 siblings, 0 replies; 76+ messages in thread
From: Ming Lei @ 2022-06-27  7:36 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Jens Axboe, Martin K . Petersen, Mike Snitzer, Eric Biggers,
	linux-block, dm-devel, Dmitry Monakhov

On Sun, Jun 26, 2022 at 04:14:58PM -0400, Kent Overstreet wrote:
> On Fri, Jun 24, 2022 at 10:12:52PM +0800, Ming Lei wrote:
> > Commit 7759eb23fd98 ("block: remove bio_rewind_iter()") removes
> > the similar API because the following reasons:
> > 
> >     ```
> >     It is pointed that bio_rewind_iter() is one very bad API[1]:
> > 
> >     1) bio size may not be restored after rewinding
> > 
> >     2) it causes some bogus change, such as 5151842b9d8732 (block: reset
> >     bi_iter.bi_done after splitting bio)
> > 
> >     3) rewinding really makes things complicated wrt. bio splitting
> > 
> >     4) unnecessary updating of .bi_done in fast path
> > 
> >     [1] https://marc.info/?t=153549924200005&r=1&w=2
> > 
> >     So this patch takes Kent's suggestion to restore one bio into its original
> >     state via saving bio iterator(struct bvec_iter) in bio_integrity_prep(),
> >     given now bio_rewind_iter() is only used by bio integrity code.
> >     ```
> > 
> > However, it isn't easy to restore bio by saving 32 bytes bio->bi_iter, and saving
> > it only can't restore crypto and integrity info.
> > 
> > Add bio_rewind() back for some use cases which may not be same with
> > previous generic case:
> > 
> > 1) most of bio has fixed end sector since bio split is done from front of the bio,
> > if driver just records how many sectors between current bio's start sector and
> > the bio's end sector, the original position can be restored
> > 
> > 2) if one bio's end sector won't change, usually bio_trim() isn't called, user can
> > restore original position by storing sectors from current ->bi_iter.bi_sector to
> > bio's end sector; together by saving bio size, 8 bytes can restore to
> > original bio.
> > 
> > 3) dm's requeue use case: when BLK_STS_DM_REQUEUE happens, dm core needs to
> > restore to the original bio which represents current dm io to be requeued.
> > By storing sectors to the bio's end sector and dm io's size,
> > bio_rewind() can restore such original bio, then dm core code needn't to
> > allocate one bio beforehand just for handling BLK_STS_DM_REQUEUE which
> > is actually one unusual event.
> > 
> > 4) Not like original rewind API, this one needn't to add .bi_done, and no any
> > effect on fast path
> 
> It seems like perhaps the real issue here is that we need a real bio_iter,
> separate from bvec_iter, that also encapsulates iterating over integrity &
> fscrypt. 

Not mention bio_iter, bvec_iter has been 32 bytes, which is too big to
hold in per-io data structure. With this patch, 8bytes is enough
to rewind one bio if the end sector is fixed.


Thanks,
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dm-devel] [PATCH 5.20 1/4] block: add bio_rewind() API
  2022-06-26 21:37     ` Eric Biggers
@ 2022-06-27  7:37       ` Ming Lei
  -1 siblings, 0 replies; 76+ messages in thread
From: Ming Lei @ 2022-06-27  7:37 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Jens Axboe, Mike Snitzer, Martin K . Petersen, linux-block,
	dm-devel, Dmitry Monakhov, Kent Overstreet

On Sun, Jun 26, 2022 at 02:37:22PM -0700, Eric Biggers wrote:
> On Fri, Jun 24, 2022 at 10:12:52PM +0800, Ming Lei wrote:
> > diff --git a/block/blk-crypto.c b/block/blk-crypto.c
> > index a496aaef85ba..caae2f429fc7 100644
> > --- a/block/blk-crypto.c
> > +++ b/block/blk-crypto.c
> > @@ -134,6 +134,21 @@ void bio_crypt_dun_increment(u64 dun[BLK_CRYPTO_DUN_ARRAY_SIZE],
> >  	}
> >  }
> >  
> > +/* Decrements @dun by @dec, treating @dun as a multi-limb integer. */
> > +void bio_crypt_dun_decrement(u64 dun[BLK_CRYPTO_DUN_ARRAY_SIZE],
> > +			     unsigned int dec)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; dec && i < BLK_CRYPTO_DUN_ARRAY_SIZE; i++) {
> > +		dun[i] -= dec;
> > +		if (dun[i] > inc)
> > +			dec = 1;
> > +		else
> > +			dec = 0;
> > +	}
> > +}
> 
> This doesn't compile.  Also this doesn't handle underflow into the next limb
> correctly.  A correct version would be:
> 
> 		u64 prev = dun[i];
> 
> 		dun[i] -= dec;
> 		if (dun[i] > prev)
> 			dec = 1;
> 		else
> 			dec = 0;

You are right, thanks for the review!

Thanks,
Ming


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dm-devel] [PATCH 5.20 1/4] block: add bio_rewind() API
@ 2022-06-27  7:37       ` Ming Lei
  0 siblings, 0 replies; 76+ messages in thread
From: Ming Lei @ 2022-06-27  7:37 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Jens Axboe, Mike Snitzer, Martin K . Petersen, linux-block,
	dm-devel, Dmitry Monakhov, Kent Overstreet

On Sun, Jun 26, 2022 at 02:37:22PM -0700, Eric Biggers wrote:
> On Fri, Jun 24, 2022 at 10:12:52PM +0800, Ming Lei wrote:
> > diff --git a/block/blk-crypto.c b/block/blk-crypto.c
> > index a496aaef85ba..caae2f429fc7 100644
> > --- a/block/blk-crypto.c
> > +++ b/block/blk-crypto.c
> > @@ -134,6 +134,21 @@ void bio_crypt_dun_increment(u64 dun[BLK_CRYPTO_DUN_ARRAY_SIZE],
> >  	}
> >  }
> >  
> > +/* Decrements @dun by @dec, treating @dun as a multi-limb integer. */
> > +void bio_crypt_dun_decrement(u64 dun[BLK_CRYPTO_DUN_ARRAY_SIZE],
> > +			     unsigned int dec)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; dec && i < BLK_CRYPTO_DUN_ARRAY_SIZE; i++) {
> > +		dun[i] -= dec;
> > +		if (dun[i] > inc)
> > +			dec = 1;
> > +		else
> > +			dec = 0;
> > +	}
> > +}
> 
> This doesn't compile.  Also this doesn't handle underflow into the next limb
> correctly.  A correct version would be:
> 
> 		u64 prev = dun[i];
> 
> 		dun[i] -= dec;
> 		if (dun[i] > prev)
> 			dec = 1;
> 		else
> 			dec = 0;

You are right, thanks for the review!

Thanks,
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 5.20 1/4] block: add bio_rewind() API
  2022-06-27  7:36       ` [dm-devel] " Ming Lei
@ 2022-06-28  4:20         ` Kent Overstreet
  -1 siblings, 0 replies; 76+ messages in thread
From: Kent Overstreet @ 2022-06-28  4:20 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Mike Snitzer, linux-block, dm-devel, Eric Biggers,
	Dmitry Monakhov, Martin K . Petersen

On Mon, Jun 27, 2022 at 03:36:22PM +0800, Ming Lei wrote:
> On Sun, Jun 26, 2022 at 04:14:58PM -0400, Kent Overstreet wrote:
> > On Fri, Jun 24, 2022 at 10:12:52PM +0800, Ming Lei wrote:
> > > Commit 7759eb23fd98 ("block: remove bio_rewind_iter()") removes
> > > the similar API because the following reasons:
> > > 
> > >     ```
> > >     It is pointed that bio_rewind_iter() is one very bad API[1]:
> > > 
> > >     1) bio size may not be restored after rewinding
> > > 
> > >     2) it causes some bogus change, such as 5151842b9d8732 (block: reset
> > >     bi_iter.bi_done after splitting bio)
> > > 
> > >     3) rewinding really makes things complicated wrt. bio splitting
> > > 
> > >     4) unnecessary updating of .bi_done in fast path
> > > 
> > >     [1] https://marc.info/?t=153549924200005&r=1&w=2
> > > 
> > >     So this patch takes Kent's suggestion to restore one bio into its original
> > >     state via saving bio iterator(struct bvec_iter) in bio_integrity_prep(),
> > >     given now bio_rewind_iter() is only used by bio integrity code.
> > >     ```
> > > 
> > > However, it isn't easy to restore bio by saving 32 bytes bio->bi_iter, and saving
> > > it only can't restore crypto and integrity info.
> > > 
> > > Add bio_rewind() back for some use cases which may not be same with
> > > previous generic case:
> > > 
> > > 1) most of bio has fixed end sector since bio split is done from front of the bio,
> > > if driver just records how many sectors between current bio's start sector and
> > > the bio's end sector, the original position can be restored
> > > 
> > > 2) if one bio's end sector won't change, usually bio_trim() isn't called, user can
> > > restore original position by storing sectors from current ->bi_iter.bi_sector to
> > > bio's end sector; together by saving bio size, 8 bytes can restore to
> > > original bio.
> > > 
> > > 3) dm's requeue use case: when BLK_STS_DM_REQUEUE happens, dm core needs to
> > > restore to the original bio which represents current dm io to be requeued.
> > > By storing sectors to the bio's end sector and dm io's size,
> > > bio_rewind() can restore such original bio, then dm core code needn't to
> > > allocate one bio beforehand just for handling BLK_STS_DM_REQUEUE which
> > > is actually one unusual event.
> > > 
> > > 4) Not like original rewind API, this one needn't to add .bi_done, and no any
> > > effect on fast path
> > 
> > It seems like perhaps the real issue here is that we need a real bio_iter,
> > separate from bvec_iter, that also encapsulates iterating over integrity &
> > fscrypt. 
> 
> Not mention bio_iter, bvec_iter has been 32 bytes, which is too big to
> hold in per-io data structure. With this patch, 8bytes is enough
> to rewind one bio if the end sector is fixed.

Hold on though, does that check out? Why is that too big for per IO data
structures?

By definition these structures are only for IOs in flight, and we don't _want_
there to ever be very many of these or we're going to run into latency issues
due to queue depth.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dm-devel] [PATCH 5.20 1/4] block: add bio_rewind() API
@ 2022-06-28  4:20         ` Kent Overstreet
  0 siblings, 0 replies; 76+ messages in thread
From: Kent Overstreet @ 2022-06-28  4:20 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Martin K . Petersen, Mike Snitzer, Eric Biggers,
	linux-block, dm-devel, Dmitry Monakhov

On Mon, Jun 27, 2022 at 03:36:22PM +0800, Ming Lei wrote:
> On Sun, Jun 26, 2022 at 04:14:58PM -0400, Kent Overstreet wrote:
> > On Fri, Jun 24, 2022 at 10:12:52PM +0800, Ming Lei wrote:
> > > Commit 7759eb23fd98 ("block: remove bio_rewind_iter()") removes
> > > the similar API because the following reasons:
> > > 
> > >     ```
> > >     It is pointed that bio_rewind_iter() is one very bad API[1]:
> > > 
> > >     1) bio size may not be restored after rewinding
> > > 
> > >     2) it causes some bogus change, such as 5151842b9d8732 (block: reset
> > >     bi_iter.bi_done after splitting bio)
> > > 
> > >     3) rewinding really makes things complicated wrt. bio splitting
> > > 
> > >     4) unnecessary updating of .bi_done in fast path
> > > 
> > >     [1] https://marc.info/?t=153549924200005&r=1&w=2
> > > 
> > >     So this patch takes Kent's suggestion to restore one bio into its original
> > >     state via saving bio iterator(struct bvec_iter) in bio_integrity_prep(),
> > >     given now bio_rewind_iter() is only used by bio integrity code.
> > >     ```
> > > 
> > > However, it isn't easy to restore bio by saving 32 bytes bio->bi_iter, and saving
> > > it only can't restore crypto and integrity info.
> > > 
> > > Add bio_rewind() back for some use cases which may not be same with
> > > previous generic case:
> > > 
> > > 1) most of bio has fixed end sector since bio split is done from front of the bio,
> > > if driver just records how many sectors between current bio's start sector and
> > > the bio's end sector, the original position can be restored
> > > 
> > > 2) if one bio's end sector won't change, usually bio_trim() isn't called, user can
> > > restore original position by storing sectors from current ->bi_iter.bi_sector to
> > > bio's end sector; together by saving bio size, 8 bytes can restore to
> > > original bio.
> > > 
> > > 3) dm's requeue use case: when BLK_STS_DM_REQUEUE happens, dm core needs to
> > > restore to the original bio which represents current dm io to be requeued.
> > > By storing sectors to the bio's end sector and dm io's size,
> > > bio_rewind() can restore such original bio, then dm core code needn't to
> > > allocate one bio beforehand just for handling BLK_STS_DM_REQUEUE which
> > > is actually one unusual event.
> > > 
> > > 4) Not like original rewind API, this one needn't to add .bi_done, and no any
> > > effect on fast path
> > 
> > It seems like perhaps the real issue here is that we need a real bio_iter,
> > separate from bvec_iter, that also encapsulates iterating over integrity &
> > fscrypt. 
> 
> Not mention bio_iter, bvec_iter has been 32 bytes, which is too big to
> hold in per-io data structure. With this patch, 8bytes is enough
> to rewind one bio if the end sector is fixed.

Hold on though, does that check out? Why is that too big for per IO data
structures?

By definition these structures are only for IOs in flight, and we don't _want_
there to ever be very many of these or we're going to run into latency issues
due to queue depth.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 5.20 1/4] block: add bio_rewind() API
  2022-06-27  7:36       ` [dm-devel] " Ming Lei
@ 2022-06-28  4:26         ` Kent Overstreet
  -1 siblings, 0 replies; 76+ messages in thread
From: Kent Overstreet @ 2022-06-28  4:26 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Mike Snitzer, linux-block, dm-devel, Eric Biggers,
	Dmitry Monakhov, Martin K . Petersen

On Mon, Jun 27, 2022 at 03:36:22PM +0800, Ming Lei wrote:
> Not mention bio_iter, bvec_iter has been 32 bytes, which is too big to
> hold in per-io data structure. With this patch, 8bytes is enough
> to rewind one bio if the end sector is fixed.

And with rewind, you're making an assumption about the state the iterator is
going to be in when the IO has completed.

What if the iterator was never advanced?

So say you check for that by saving some other part of the iterator - but that
may have legitimately changed too, if the bio was redirected (bi_sector changes)
or trimmed (bi_size changes)

I still think this is an inherently buggy interface, the way it's being proposed
to be used.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dm-devel] [PATCH 5.20 1/4] block: add bio_rewind() API
@ 2022-06-28  4:26         ` Kent Overstreet
  0 siblings, 0 replies; 76+ messages in thread
From: Kent Overstreet @ 2022-06-28  4:26 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Martin K . Petersen, Mike Snitzer, Eric Biggers,
	linux-block, dm-devel, Dmitry Monakhov

On Mon, Jun 27, 2022 at 03:36:22PM +0800, Ming Lei wrote:
> Not mention bio_iter, bvec_iter has been 32 bytes, which is too big to
> hold in per-io data structure. With this patch, 8bytes is enough
> to rewind one bio if the end sector is fixed.

And with rewind, you're making an assumption about the state the iterator is
going to be in when the IO has completed.

What if the iterator was never advanced?

So say you check for that by saving some other part of the iterator - but that
may have legitimately changed too, if the bio was redirected (bi_sector changes)
or trimmed (bi_size changes)

I still think this is an inherently buggy interface, the way it's being proposed
to be used.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 5.20 1/4] block: add bio_rewind() API
  2022-06-28  4:20         ` [dm-devel] " Kent Overstreet
@ 2022-06-28  7:42           ` Ming Lei
  -1 siblings, 0 replies; 76+ messages in thread
From: Ming Lei @ 2022-06-28  7:42 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Jens Axboe, Mike Snitzer, linux-block, dm-devel, Eric Biggers,
	Dmitry Monakhov, Martin K . Petersen

On Tue, Jun 28, 2022 at 12:20:16AM -0400, Kent Overstreet wrote:
> On Mon, Jun 27, 2022 at 03:36:22PM +0800, Ming Lei wrote:
> > On Sun, Jun 26, 2022 at 04:14:58PM -0400, Kent Overstreet wrote:
> > > On Fri, Jun 24, 2022 at 10:12:52PM +0800, Ming Lei wrote:
> > > > Commit 7759eb23fd98 ("block: remove bio_rewind_iter()") removes
> > > > the similar API because the following reasons:
> > > > 
> > > >     ```
> > > >     It is pointed that bio_rewind_iter() is one very bad API[1]:
> > > > 
> > > >     1) bio size may not be restored after rewinding
> > > > 
> > > >     2) it causes some bogus change, such as 5151842b9d8732 (block: reset
> > > >     bi_iter.bi_done after splitting bio)
> > > > 
> > > >     3) rewinding really makes things complicated wrt. bio splitting
> > > > 
> > > >     4) unnecessary updating of .bi_done in fast path
> > > > 
> > > >     [1] https://marc.info/?t=153549924200005&r=1&w=2
> > > > 
> > > >     So this patch takes Kent's suggestion to restore one bio into its original
> > > >     state via saving bio iterator(struct bvec_iter) in bio_integrity_prep(),
> > > >     given now bio_rewind_iter() is only used by bio integrity code.
> > > >     ```
> > > > 
> > > > However, it isn't easy to restore bio by saving 32 bytes bio->bi_iter, and saving
> > > > it only can't restore crypto and integrity info.
> > > > 
> > > > Add bio_rewind() back for some use cases which may not be same with
> > > > previous generic case:
> > > > 
> > > > 1) most of bio has fixed end sector since bio split is done from front of the bio,
> > > > if driver just records how many sectors between current bio's start sector and
> > > > the bio's end sector, the original position can be restored
> > > > 
> > > > 2) if one bio's end sector won't change, usually bio_trim() isn't called, user can
> > > > restore original position by storing sectors from current ->bi_iter.bi_sector to
> > > > bio's end sector; together by saving bio size, 8 bytes can restore to
> > > > original bio.
> > > > 
> > > > 3) dm's requeue use case: when BLK_STS_DM_REQUEUE happens, dm core needs to
> > > > restore to the original bio which represents current dm io to be requeued.
> > > > By storing sectors to the bio's end sector and dm io's size,
> > > > bio_rewind() can restore such original bio, then dm core code needn't to
> > > > allocate one bio beforehand just for handling BLK_STS_DM_REQUEUE which
> > > > is actually one unusual event.
> > > > 
> > > > 4) Not like original rewind API, this one needn't to add .bi_done, and no any
> > > > effect on fast path
> > > 
> > > It seems like perhaps the real issue here is that we need a real bio_iter,
> > > separate from bvec_iter, that also encapsulates iterating over integrity &
> > > fscrypt. 
> > 
> > Not mention bio_iter, bvec_iter has been 32 bytes, which is too big to
> > hold in per-io data structure. With this patch, 8bytes is enough
> > to rewind one bio if the end sector is fixed.
> 
> Hold on though, does that check out? Why is that too big for per IO data
> structures?
> 
> By definition these structures are only for IOs in flight, and we don't _want_
> there to ever be very many of these or we're going to run into latency issues
> due to queue depth.

I don't see there is 'queue depth' for bio or bio driver.

32 bytes have been big, and memory footprint is increased too since the data has been
prepared for the future possible rewind. If crypt or integrity is considered, it
can be bigger.



Thanks, 
Ming


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dm-devel] [PATCH 5.20 1/4] block: add bio_rewind() API
@ 2022-06-28  7:42           ` Ming Lei
  0 siblings, 0 replies; 76+ messages in thread
From: Ming Lei @ 2022-06-28  7:42 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Jens Axboe, Martin K . Petersen, Mike Snitzer, Eric Biggers,
	linux-block, dm-devel, Dmitry Monakhov

On Tue, Jun 28, 2022 at 12:20:16AM -0400, Kent Overstreet wrote:
> On Mon, Jun 27, 2022 at 03:36:22PM +0800, Ming Lei wrote:
> > On Sun, Jun 26, 2022 at 04:14:58PM -0400, Kent Overstreet wrote:
> > > On Fri, Jun 24, 2022 at 10:12:52PM +0800, Ming Lei wrote:
> > > > Commit 7759eb23fd98 ("block: remove bio_rewind_iter()") removes
> > > > the similar API because the following reasons:
> > > > 
> > > >     ```
> > > >     It is pointed that bio_rewind_iter() is one very bad API[1]:
> > > > 
> > > >     1) bio size may not be restored after rewinding
> > > > 
> > > >     2) it causes some bogus change, such as 5151842b9d8732 (block: reset
> > > >     bi_iter.bi_done after splitting bio)
> > > > 
> > > >     3) rewinding really makes things complicated wrt. bio splitting
> > > > 
> > > >     4) unnecessary updating of .bi_done in fast path
> > > > 
> > > >     [1] https://marc.info/?t=153549924200005&r=1&w=2
> > > > 
> > > >     So this patch takes Kent's suggestion to restore one bio into its original
> > > >     state via saving bio iterator(struct bvec_iter) in bio_integrity_prep(),
> > > >     given now bio_rewind_iter() is only used by bio integrity code.
> > > >     ```
> > > > 
> > > > However, it isn't easy to restore bio by saving 32 bytes bio->bi_iter, and saving
> > > > it only can't restore crypto and integrity info.
> > > > 
> > > > Add bio_rewind() back for some use cases which may not be same with
> > > > previous generic case:
> > > > 
> > > > 1) most of bio has fixed end sector since bio split is done from front of the bio,
> > > > if driver just records how many sectors between current bio's start sector and
> > > > the bio's end sector, the original position can be restored
> > > > 
> > > > 2) if one bio's end sector won't change, usually bio_trim() isn't called, user can
> > > > restore original position by storing sectors from current ->bi_iter.bi_sector to
> > > > bio's end sector; together by saving bio size, 8 bytes can restore to
> > > > original bio.
> > > > 
> > > > 3) dm's requeue use case: when BLK_STS_DM_REQUEUE happens, dm core needs to
> > > > restore to the original bio which represents current dm io to be requeued.
> > > > By storing sectors to the bio's end sector and dm io's size,
> > > > bio_rewind() can restore such original bio, then dm core code needn't to
> > > > allocate one bio beforehand just for handling BLK_STS_DM_REQUEUE which
> > > > is actually one unusual event.
> > > > 
> > > > 4) Not like original rewind API, this one needn't to add .bi_done, and no any
> > > > effect on fast path
> > > 
> > > It seems like perhaps the real issue here is that we need a real bio_iter,
> > > separate from bvec_iter, that also encapsulates iterating over integrity &
> > > fscrypt. 
> > 
> > Not mention bio_iter, bvec_iter has been 32 bytes, which is too big to
> > hold in per-io data structure. With this patch, 8bytes is enough
> > to rewind one bio if the end sector is fixed.
> 
> Hold on though, does that check out? Why is that too big for per IO data
> structures?
> 
> By definition these structures are only for IOs in flight, and we don't _want_
> there to ever be very many of these or we're going to run into latency issues
> due to queue depth.

I don't see there is 'queue depth' for bio or bio driver.

32 bytes have been big, and memory footprint is increased too since the data has been
prepared for the future possible rewind. If crypt or integrity is considered, it
can be bigger.



Thanks, 
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 5.20 1/4] block: add bio_rewind() API
  2022-06-28  4:26         ` [dm-devel] " Kent Overstreet
@ 2022-06-28  7:49           ` Ming Lei
  -1 siblings, 0 replies; 76+ messages in thread
From: Ming Lei @ 2022-06-28  7:49 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Jens Axboe, Mike Snitzer, linux-block, dm-devel, Eric Biggers,
	Dmitry Monakhov, Martin K . Petersen

On Tue, Jun 28, 2022 at 12:26:10AM -0400, Kent Overstreet wrote:
> On Mon, Jun 27, 2022 at 03:36:22PM +0800, Ming Lei wrote:
> > Not mention bio_iter, bvec_iter has been 32 bytes, which is too big to
> > hold in per-io data structure. With this patch, 8bytes is enough
> > to rewind one bio if the end sector is fixed.
> 
> And with rewind, you're making an assumption about the state the iterator is
> going to be in when the IO has completed.
> 
> What if the iterator was never advanced?

bio_rewind() works as expected if the iterator doesn't advance, since bytes
between the recorded position and the end position isn't changed, same
with the end position.

> 
> So say you check for that by saving some other part of the iterator - but that
> may have legitimately changed too, if the bio was redirected (bi_sector changes)
> or trimmed (bi_size changes)
> 
> I still think this is an inherently buggy interface, the way it's being proposed
> to be used.

The patch did mention that the interface should be for situation in which end
sector of bio won't change.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dm-devel] [PATCH 5.20 1/4] block: add bio_rewind() API
@ 2022-06-28  7:49           ` Ming Lei
  0 siblings, 0 replies; 76+ messages in thread
From: Ming Lei @ 2022-06-28  7:49 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Jens Axboe, Martin K . Petersen, Mike Snitzer, Eric Biggers,
	linux-block, dm-devel, Dmitry Monakhov

On Tue, Jun 28, 2022 at 12:26:10AM -0400, Kent Overstreet wrote:
> On Mon, Jun 27, 2022 at 03:36:22PM +0800, Ming Lei wrote:
> > Not mention bio_iter, bvec_iter has been 32 bytes, which is too big to
> > hold in per-io data structure. With this patch, 8bytes is enough
> > to rewind one bio if the end sector is fixed.
> 
> And with rewind, you're making an assumption about the state the iterator is
> going to be in when the IO has completed.
> 
> What if the iterator was never advanced?

bio_rewind() works as expected if the iterator doesn't advance, since bytes
between the recorded position and the end position isn't changed, same
with the end position.

> 
> So say you check for that by saving some other part of the iterator - but that
> may have legitimately changed too, if the bio was redirected (bi_sector changes)
> or trimmed (bi_size changes)
> 
> I still think this is an inherently buggy interface, the way it's being proposed
> to be used.

The patch did mention that the interface should be for situation in which end
sector of bio won't change.


Thanks,
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 5.20 1/4] block: add bio_rewind() API
  2022-06-28  7:42           ` [dm-devel] " Ming Lei
@ 2022-06-28 16:16             ` Kent Overstreet
  -1 siblings, 0 replies; 76+ messages in thread
From: Kent Overstreet @ 2022-06-28 16:16 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Mike Snitzer, linux-block, dm-devel, Eric Biggers,
	Dmitry Monakhov, Martin K . Petersen

On Tue, Jun 28, 2022 at 03:42:52PM +0800, Ming Lei wrote:
> I don't see there is 'queue depth' for bio or bio driver.
> 
> 32 bytes have been big, and memory footprint is increased too since the data has been
> prepared for the future possible rewind. If crypt or integrity is considered, it
> can be bigger.

Put some thought into this, Ming. I think you've absorbed some "received
wisdom", some of the implicit assumptions everyone seems to be making without
ever doing any math to check if they make sense.

There may not be a "queue depth" for all drivers, but the same latency
considerations apply - if you have an unbounded number of IOs in flight, you're
going to be having other, much more significant issues.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dm-devel] [PATCH 5.20 1/4] block: add bio_rewind() API
@ 2022-06-28 16:16             ` Kent Overstreet
  0 siblings, 0 replies; 76+ messages in thread
From: Kent Overstreet @ 2022-06-28 16:16 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Martin K . Petersen, Mike Snitzer, Eric Biggers,
	linux-block, dm-devel, Dmitry Monakhov

On Tue, Jun 28, 2022 at 03:42:52PM +0800, Ming Lei wrote:
> I don't see there is 'queue depth' for bio or bio driver.
> 
> 32 bytes have been big, and memory footprint is increased too since the data has been
> prepared for the future possible rewind. If crypt or integrity is considered, it
> can be bigger.

Put some thought into this, Ming. I think you've absorbed some "received
wisdom", some of the implicit assumptions everyone seems to be making without
ever doing any math to check if they make sense.

There may not be a "queue depth" for all drivers, but the same latency
considerations apply - if you have an unbounded number of IOs in flight, you're
going to be having other, much more significant issues.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 5.20 1/4] block: add bio_rewind() API
  2022-06-28  7:49           ` [dm-devel] " Ming Lei
@ 2022-06-28 16:36             ` Kent Overstreet
  -1 siblings, 0 replies; 76+ messages in thread
From: Kent Overstreet @ 2022-06-28 16:36 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Mike Snitzer, linux-block, dm-devel, Eric Biggers,
	Dmitry Monakhov, Martin K . Petersen

On Tue, Jun 28, 2022 at 03:49:28PM +0800, Ming Lei wrote:
> On Tue, Jun 28, 2022 at 12:26:10AM -0400, Kent Overstreet wrote:
> > On Mon, Jun 27, 2022 at 03:36:22PM +0800, Ming Lei wrote:
> > > Not mention bio_iter, bvec_iter has been 32 bytes, which is too big to
> > > hold in per-io data structure. With this patch, 8bytes is enough
> > > to rewind one bio if the end sector is fixed.
> > 
> > And with rewind, you're making an assumption about the state the iterator is
> > going to be in when the IO has completed.
> > 
> > What if the iterator was never advanced?
> 
> bio_rewind() works as expected if the iterator doesn't advance, since bytes
> between the recorded position and the end position isn't changed, same
> with the end position.
> 
> > 
> > So say you check for that by saving some other part of the iterator - but that
> > may have legitimately changed too, if the bio was redirected (bi_sector changes)
> > or trimmed (bi_size changes)
> > 
> > I still think this is an inherently buggy interface, the way it's being proposed
> > to be used.
> 
> The patch did mention that the interface should be for situation in which end
> sector of bio won't change.

But that's an assumption that you simply can't make!

We allow block device drivers to be stacked in _any_ combination. After a bio is
completed it may have been partially advanced, fully advanced, trimmed, not
trimmed, anything - and bi_sector and thus also bio_end_sector() may have
changed, and will have if there's partition tables involved.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dm-devel] [PATCH 5.20 1/4] block: add bio_rewind() API
@ 2022-06-28 16:36             ` Kent Overstreet
  0 siblings, 0 replies; 76+ messages in thread
From: Kent Overstreet @ 2022-06-28 16:36 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Martin K . Petersen, Mike Snitzer, Eric Biggers,
	linux-block, dm-devel, Dmitry Monakhov

On Tue, Jun 28, 2022 at 03:49:28PM +0800, Ming Lei wrote:
> On Tue, Jun 28, 2022 at 12:26:10AM -0400, Kent Overstreet wrote:
> > On Mon, Jun 27, 2022 at 03:36:22PM +0800, Ming Lei wrote:
> > > Not mention bio_iter, bvec_iter has been 32 bytes, which is too big to
> > > hold in per-io data structure. With this patch, 8bytes is enough
> > > to rewind one bio if the end sector is fixed.
> > 
> > And with rewind, you're making an assumption about the state the iterator is
> > going to be in when the IO has completed.
> > 
> > What if the iterator was never advanced?
> 
> bio_rewind() works as expected if the iterator doesn't advance, since bytes
> between the recorded position and the end position isn't changed, same
> with the end position.
> 
> > 
> > So say you check for that by saving some other part of the iterator - but that
> > may have legitimately changed too, if the bio was redirected (bi_sector changes)
> > or trimmed (bi_size changes)
> > 
> > I still think this is an inherently buggy interface, the way it's being proposed
> > to be used.
> 
> The patch did mention that the interface should be for situation in which end
> sector of bio won't change.

But that's an assumption that you simply can't make!

We allow block device drivers to be stacked in _any_ combination. After a bio is
completed it may have been partially advanced, fully advanced, trimmed, not
trimmed, anything - and bi_sector and thus also bio_end_sector() may have
changed, and will have if there's partition tables involved.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 5.20 1/4] block: add bio_rewind() API
  2022-06-28 16:36             ` [dm-devel] " Kent Overstreet
@ 2022-06-28 17:41               ` Mike Snitzer
  -1 siblings, 0 replies; 76+ messages in thread
From: Mike Snitzer @ 2022-06-28 17:41 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Ming Lei, Jens Axboe, linux-block, dm-devel, Eric Biggers,
	Dmitry Monakhov, Martin K . Petersen

On Tue, Jun 28 2022 at 12:36P -0400,
Kent Overstreet <kent.overstreet@gmail.com> wrote:

> On Tue, Jun 28, 2022 at 03:49:28PM +0800, Ming Lei wrote:
> > On Tue, Jun 28, 2022 at 12:26:10AM -0400, Kent Overstreet wrote:
> > > On Mon, Jun 27, 2022 at 03:36:22PM +0800, Ming Lei wrote:
> > > > Not mention bio_iter, bvec_iter has been 32 bytes, which is too big to
> > > > hold in per-io data structure. With this patch, 8bytes is enough
> > > > to rewind one bio if the end sector is fixed.
> > > 
> > > And with rewind, you're making an assumption about the state the iterator is
> > > going to be in when the IO has completed.
> > > 
> > > What if the iterator was never advanced?
> > 
> > bio_rewind() works as expected if the iterator doesn't advance, since bytes
> > between the recorded position and the end position isn't changed, same
> > with the end position.
> > 
> > > 
> > > So say you check for that by saving some other part of the iterator - but that
> > > may have legitimately changed too, if the bio was redirected (bi_sector changes)
> > > or trimmed (bi_size changes)
> > > 
> > > I still think this is an inherently buggy interface, the way it's being proposed
> > > to be used.
> > 
> > The patch did mention that the interface should be for situation in which end
> > sector of bio won't change.
> 
> But that's an assumption that you simply can't make!

Why not?  There is clear use-case for this API, as demonstrated in the
patchset: DM can/will make use of it for the purposes of enhancing its
more unlikely bio requeuing work (that is needed to make the more
likely bio splitting scenario more efficient).

> We allow block device drivers to be stacked in _any_ combination. After a bio is
> completed it may have been partially advanced, fully advanced, trimmed, not
> trimmed, anything - and bi_sector and thus also bio_end_sector() may have
> changed, and will have if there's partition tables involved.

We don't _need_ this API to cure cancer for all hypothetical block
drivers.

If consumers of the API follow the rule that end sector of the
_original bio_ isn't changed then it is all fine.  It is that simple.

Stacked drivers will work just fine.  The lower layers will be
modifying their bios as needed. Because for DM those bios happen to
be clones, so there is isolation to the broader design flaw you are
trying to say is a major problem. As this patchset demonstrates.

I do concede that policing who can use an API is hard.  But if some
consumer of an API tries something that invalidates rules of the API
they get to keep the N pieces when it breaks.

Mike


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dm-devel] [PATCH 5.20 1/4] block: add bio_rewind() API
@ 2022-06-28 17:41               ` Mike Snitzer
  0 siblings, 0 replies; 76+ messages in thread
From: Mike Snitzer @ 2022-06-28 17:41 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Jens Axboe, Martin K . Petersen, Eric Biggers, Ming Lei,
	linux-block, dm-devel, Dmitry Monakhov

On Tue, Jun 28 2022 at 12:36P -0400,
Kent Overstreet <kent.overstreet@gmail.com> wrote:

> On Tue, Jun 28, 2022 at 03:49:28PM +0800, Ming Lei wrote:
> > On Tue, Jun 28, 2022 at 12:26:10AM -0400, Kent Overstreet wrote:
> > > On Mon, Jun 27, 2022 at 03:36:22PM +0800, Ming Lei wrote:
> > > > Not mention bio_iter, bvec_iter has been 32 bytes, which is too big to
> > > > hold in per-io data structure. With this patch, 8bytes is enough
> > > > to rewind one bio if the end sector is fixed.
> > > 
> > > And with rewind, you're making an assumption about the state the iterator is
> > > going to be in when the IO has completed.
> > > 
> > > What if the iterator was never advanced?
> > 
> > bio_rewind() works as expected if the iterator doesn't advance, since bytes
> > between the recorded position and the end position isn't changed, same
> > with the end position.
> > 
> > > 
> > > So say you check for that by saving some other part of the iterator - but that
> > > may have legitimately changed too, if the bio was redirected (bi_sector changes)
> > > or trimmed (bi_size changes)
> > > 
> > > I still think this is an inherently buggy interface, the way it's being proposed
> > > to be used.
> > 
> > The patch did mention that the interface should be for situation in which end
> > sector of bio won't change.
> 
> But that's an assumption that you simply can't make!

Why not?  There is clear use-case for this API, as demonstrated in the
patchset: DM can/will make use of it for the purposes of enhancing its
more unlikely bio requeuing work (that is needed to make the more
likely bio splitting scenario more efficient).

> We allow block device drivers to be stacked in _any_ combination. After a bio is
> completed it may have been partially advanced, fully advanced, trimmed, not
> trimmed, anything - and bi_sector and thus also bio_end_sector() may have
> changed, and will have if there's partition tables involved.

We don't _need_ this API to cure cancer for all hypothetical block
drivers.

If consumers of the API follow the rule that end sector of the
_original bio_ isn't changed then it is all fine.  It is that simple.

Stacked drivers will work just fine.  The lower layers will be
modifying their bios as needed. Because for DM those bios happen to
be clones, so there is isolation to the broader design flaw you are
trying to say is a major problem. As this patchset demonstrates.

I do concede that policing who can use an API is hard.  But if some
consumer of an API tries something that invalidates rules of the API
they get to keep the N pieces when it breaks.

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 5.20 1/4] block: add bio_rewind() API
  2022-06-28 17:41               ` [dm-devel] " Mike Snitzer
@ 2022-06-28 17:52                 ` Kent Overstreet
  -1 siblings, 0 replies; 76+ messages in thread
From: Kent Overstreet @ 2022-06-28 17:52 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Ming Lei, Jens Axboe, linux-block, dm-devel, Eric Biggers,
	Dmitry Monakhov, Martin K . Petersen

On Tue, Jun 28, 2022 at 01:41:12PM -0400, Mike Snitzer wrote:
> On Tue, Jun 28 2022 at 12:36P -0400,
> Kent Overstreet <kent.overstreet@gmail.com> wrote:
> 
> > On Tue, Jun 28, 2022 at 03:49:28PM +0800, Ming Lei wrote:
> > > On Tue, Jun 28, 2022 at 12:26:10AM -0400, Kent Overstreet wrote:
> > > > On Mon, Jun 27, 2022 at 03:36:22PM +0800, Ming Lei wrote:
> > > > > Not mention bio_iter, bvec_iter has been 32 bytes, which is too big to
> > > > > hold in per-io data structure. With this patch, 8bytes is enough
> > > > > to rewind one bio if the end sector is fixed.
> > > > 
> > > > And with rewind, you're making an assumption about the state the iterator is
> > > > going to be in when the IO has completed.
> > > > 
> > > > What if the iterator was never advanced?
> > > 
> > > bio_rewind() works as expected if the iterator doesn't advance, since bytes
> > > between the recorded position and the end position isn't changed, same
> > > with the end position.
> > > 
> > > > 
> > > > So say you check for that by saving some other part of the iterator - but that
> > > > may have legitimately changed too, if the bio was redirected (bi_sector changes)
> > > > or trimmed (bi_size changes)
> > > > 
> > > > I still think this is an inherently buggy interface, the way it's being proposed
> > > > to be used.
> > > 
> > > The patch did mention that the interface should be for situation in which end
> > > sector of bio won't change.
> > 
> > But that's an assumption that you simply can't make!
> 
> Why not?  There is clear use-case for this API, as demonstrated in the
> patchset: DM can/will make use of it for the purposes of enhancing its
> more unlikely bio requeuing work (that is needed to make the more
> likely bio splitting scenario more efficient).
> 
> > We allow block device drivers to be stacked in _any_ combination. After a bio is
> > completed it may have been partially advanced, fully advanced, trimmed, not
> > trimmed, anything - and bi_sector and thus also bio_end_sector() may have
> > changed, and will have if there's partition tables involved.
> 
> We don't _need_ this API to cure cancer for all hypothetical block
> drivers.
> 
> If consumers of the API follow the rule that end sector of the
> _original bio_ isn't changed then it is all fine.  It is that simple.
> 
> Stacked drivers will work just fine.  The lower layers will be
> modifying their bios as needed. Because for DM those bios happen to
> be clones, so there is isolation to the broader design flaw you are
> trying to say is a major problem. As this patchset demonstrates.
> 
> I do concede that policing who can use an API is hard.  But if some
> consumer of an API tries something that invalidates rules of the API
> they get to keep the N pieces when it breaks.

Mike, keep in mind that when bio_rewind() was originally introduced, it
immediately grew users that were _inherently buggy_ (of the "users can break
this trivially") variety, and the whole thing had to be reverted, and I was
really annoyed - mostly at myself, because I would have caught it if I'd been
paying attention to the mailing list more.

And I _guarantee_ you that if this makes it in, we'll have the same thing
happening all over again - we have a lot of different block drivers being
written by a lot of different people, and most of them do not understand all the
subtleties of the block layer and the ways different things can interact, and so
the onus is on us to not add tools that they aren't going to immediately turn
around and slice themselves with.

The 32 bytes you're trying to save is meaningless. Think instead about the weeks
of engineer time that get wasted by bugs like this - chasing the bugs,
babysitting the patches in to fix it, then the _endless_ -stable backports.

_Please_ try to think more defensively when you're writing code.

So, and I'm sorry I have to be the killjoy here, but hard NACK on this patchset.
Hard, hard NACK.

I'll be happy to assist in coming up with alternate, less dangerous solutions
though (and I think introducing a real bio_iter is overdue, so that's probably
the first thing we should look at).

Cheers

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dm-devel] [PATCH 5.20 1/4] block: add bio_rewind() API
@ 2022-06-28 17:52                 ` Kent Overstreet
  0 siblings, 0 replies; 76+ messages in thread
From: Kent Overstreet @ 2022-06-28 17:52 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jens Axboe, Martin K . Petersen, Eric Biggers, Ming Lei,
	linux-block, dm-devel, Dmitry Monakhov

On Tue, Jun 28, 2022 at 01:41:12PM -0400, Mike Snitzer wrote:
> On Tue, Jun 28 2022 at 12:36P -0400,
> Kent Overstreet <kent.overstreet@gmail.com> wrote:
> 
> > On Tue, Jun 28, 2022 at 03:49:28PM +0800, Ming Lei wrote:
> > > On Tue, Jun 28, 2022 at 12:26:10AM -0400, Kent Overstreet wrote:
> > > > On Mon, Jun 27, 2022 at 03:36:22PM +0800, Ming Lei wrote:
> > > > > Not mention bio_iter, bvec_iter has been 32 bytes, which is too big to
> > > > > hold in per-io data structure. With this patch, 8bytes is enough
> > > > > to rewind one bio if the end sector is fixed.
> > > > 
> > > > And with rewind, you're making an assumption about the state the iterator is
> > > > going to be in when the IO has completed.
> > > > 
> > > > What if the iterator was never advanced?
> > > 
> > > bio_rewind() works as expected if the iterator doesn't advance, since bytes
> > > between the recorded position and the end position isn't changed, same
> > > with the end position.
> > > 
> > > > 
> > > > So say you check for that by saving some other part of the iterator - but that
> > > > may have legitimately changed too, if the bio was redirected (bi_sector changes)
> > > > or trimmed (bi_size changes)
> > > > 
> > > > I still think this is an inherently buggy interface, the way it's being proposed
> > > > to be used.
> > > 
> > > The patch did mention that the interface should be for situation in which end
> > > sector of bio won't change.
> > 
> > But that's an assumption that you simply can't make!
> 
> Why not?  There is clear use-case for this API, as demonstrated in the
> patchset: DM can/will make use of it for the purposes of enhancing its
> more unlikely bio requeuing work (that is needed to make the more
> likely bio splitting scenario more efficient).
> 
> > We allow block device drivers to be stacked in _any_ combination. After a bio is
> > completed it may have been partially advanced, fully advanced, trimmed, not
> > trimmed, anything - and bi_sector and thus also bio_end_sector() may have
> > changed, and will have if there's partition tables involved.
> 
> We don't _need_ this API to cure cancer for all hypothetical block
> drivers.
> 
> If consumers of the API follow the rule that end sector of the
> _original bio_ isn't changed then it is all fine.  It is that simple.
> 
> Stacked drivers will work just fine.  The lower layers will be
> modifying their bios as needed. Because for DM those bios happen to
> be clones, so there is isolation to the broader design flaw you are
> trying to say is a major problem. As this patchset demonstrates.
> 
> I do concede that policing who can use an API is hard.  But if some
> consumer of an API tries something that invalidates rules of the API
> they get to keep the N pieces when it breaks.

Mike, keep in mind that when bio_rewind() was originally introduced, it
immediately grew users that were _inherently buggy_ (of the "users can break
this trivially") variety, and the whole thing had to be reverted, and I was
really annoyed - mostly at myself, because I would have caught it if I'd been
paying attention to the mailing list more.

And I _guarantee_ you that if this makes it in, we'll have the same thing
happening all over again - we have a lot of different block drivers being
written by a lot of different people, and most of them do not understand all the
subtleties of the block layer and the ways different things can interact, and so
the onus is on us to not add tools that they aren't going to immediately turn
around and slice themselves with.

The 32 bytes you're trying to save is meaningless. Think instead about the weeks
of engineer time that get wasted by bugs like this - chasing the bugs,
babysitting the patches in to fix it, then the _endless_ -stable backports.

_Please_ try to think more defensively when you're writing code.

So, and I'm sorry I have to be the killjoy here, but hard NACK on this patchset.
Hard, hard NACK.

I'll be happy to assist in coming up with alternate, less dangerous solutions
though (and I think introducing a real bio_iter is overdue, so that's probably
the first thing we should look at).

Cheers

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 5.20 1/4] block: add bio_rewind() API
  2022-06-28  4:20         ` [dm-devel] " Kent Overstreet
@ 2022-06-28 18:13           ` Jens Axboe
  -1 siblings, 0 replies; 76+ messages in thread
From: Jens Axboe @ 2022-06-28 18:13 UTC (permalink / raw)
  To: Kent Overstreet, Ming Lei
  Cc: Mike Snitzer, linux-block, dm-devel, Eric Biggers,
	Dmitry Monakhov, Martin K . Petersen

On 6/27/22 10:20 PM, Kent Overstreet wrote:
> On Mon, Jun 27, 2022 at 03:36:22PM +0800, Ming Lei wrote:
>> On Sun, Jun 26, 2022 at 04:14:58PM -0400, Kent Overstreet wrote:
>>> On Fri, Jun 24, 2022 at 10:12:52PM +0800, Ming Lei wrote:
>>>> Commit 7759eb23fd98 ("block: remove bio_rewind_iter()") removes
>>>> the similar API because the following reasons:
>>>>
>>>>     ```
>>>>     It is pointed that bio_rewind_iter() is one very bad API[1]:
>>>>
>>>>     1) bio size may not be restored after rewinding
>>>>
>>>>     2) it causes some bogus change, such as 5151842b9d8732 (block: reset
>>>>     bi_iter.bi_done after splitting bio)
>>>>
>>>>     3) rewinding really makes things complicated wrt. bio splitting
>>>>
>>>>     4) unnecessary updating of .bi_done in fast path
>>>>
>>>>     [1] https://marc.info/?t=153549924200005&r=1&w=2
>>>>
>>>>     So this patch takes Kent's suggestion to restore one bio into its original
>>>>     state via saving bio iterator(struct bvec_iter) in bio_integrity_prep(),
>>>>     given now bio_rewind_iter() is only used by bio integrity code.
>>>>     ```
>>>>
>>>> However, it isn't easy to restore bio by saving 32 bytes bio->bi_iter, and saving
>>>> it only can't restore crypto and integrity info.
>>>>
>>>> Add bio_rewind() back for some use cases which may not be same with
>>>> previous generic case:
>>>>
>>>> 1) most of bio has fixed end sector since bio split is done from front of the bio,
>>>> if driver just records how many sectors between current bio's start sector and
>>>> the bio's end sector, the original position can be restored
>>>>
>>>> 2) if one bio's end sector won't change, usually bio_trim() isn't called, user can
>>>> restore original position by storing sectors from current ->bi_iter.bi_sector to
>>>> bio's end sector; together by saving bio size, 8 bytes can restore to
>>>> original bio.
>>>>
>>>> 3) dm's requeue use case: when BLK_STS_DM_REQUEUE happens, dm core needs to
>>>> restore to the original bio which represents current dm io to be requeued.
>>>> By storing sectors to the bio's end sector and dm io's size,
>>>> bio_rewind() can restore such original bio, then dm core code needn't to
>>>> allocate one bio beforehand just for handling BLK_STS_DM_REQUEUE which
>>>> is actually one unusual event.
>>>>
>>>> 4) Not like original rewind API, this one needn't to add .bi_done, and no any
>>>> effect on fast path
>>>
>>> It seems like perhaps the real issue here is that we need a real bio_iter,
>>> separate from bvec_iter, that also encapsulates iterating over integrity &
>>> fscrypt. 
>>
>> Not mention bio_iter, bvec_iter has been 32 bytes, which is too big to
>> hold in per-io data structure. With this patch, 8bytes is enough
>> to rewind one bio if the end sector is fixed.
> 
> Hold on though, does that check out? Why is that too big for per IO data
> structures?
> 
> By definition these structures are only for IOs in flight, and we don't _want_
> there to ever be very many of these or we're going to run into latency issues
> due to queue depth.

It's much less about using whatever amount of memory for inflight IO,
and much more about not bloating fast path structures (of which the bio
is certainly one). All of this gunk has to be initialized for each IO,
and that's the real issue.

Just look at the recent work for iov_iter and why ITER_UBUF makes sense
to do.

This is not a commentary on this patchset, just a general observation.
Sizes of hot data structures DO matter, and quite a bit so.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dm-devel] [PATCH 5.20 1/4] block: add bio_rewind() API
@ 2022-06-28 18:13           ` Jens Axboe
  0 siblings, 0 replies; 76+ messages in thread
From: Jens Axboe @ 2022-06-28 18:13 UTC (permalink / raw)
  To: Kent Overstreet, Ming Lei
  Cc: Mike Snitzer, Martin K . Petersen, Eric Biggers, linux-block,
	dm-devel, Dmitry Monakhov

On 6/27/22 10:20 PM, Kent Overstreet wrote:
> On Mon, Jun 27, 2022 at 03:36:22PM +0800, Ming Lei wrote:
>> On Sun, Jun 26, 2022 at 04:14:58PM -0400, Kent Overstreet wrote:
>>> On Fri, Jun 24, 2022 at 10:12:52PM +0800, Ming Lei wrote:
>>>> Commit 7759eb23fd98 ("block: remove bio_rewind_iter()") removes
>>>> the similar API because the following reasons:
>>>>
>>>>     ```
>>>>     It is pointed that bio_rewind_iter() is one very bad API[1]:
>>>>
>>>>     1) bio size may not be restored after rewinding
>>>>
>>>>     2) it causes some bogus change, such as 5151842b9d8732 (block: reset
>>>>     bi_iter.bi_done after splitting bio)
>>>>
>>>>     3) rewinding really makes things complicated wrt. bio splitting
>>>>
>>>>     4) unnecessary updating of .bi_done in fast path
>>>>
>>>>     [1] https://marc.info/?t=153549924200005&r=1&w=2
>>>>
>>>>     So this patch takes Kent's suggestion to restore one bio into its original
>>>>     state via saving bio iterator(struct bvec_iter) in bio_integrity_prep(),
>>>>     given now bio_rewind_iter() is only used by bio integrity code.
>>>>     ```
>>>>
>>>> However, it isn't easy to restore bio by saving 32 bytes bio->bi_iter, and saving
>>>> it only can't restore crypto and integrity info.
>>>>
>>>> Add bio_rewind() back for some use cases which may not be same with
>>>> previous generic case:
>>>>
>>>> 1) most of bio has fixed end sector since bio split is done from front of the bio,
>>>> if driver just records how many sectors between current bio's start sector and
>>>> the bio's end sector, the original position can be restored
>>>>
>>>> 2) if one bio's end sector won't change, usually bio_trim() isn't called, user can
>>>> restore original position by storing sectors from current ->bi_iter.bi_sector to
>>>> bio's end sector; together by saving bio size, 8 bytes can restore to
>>>> original bio.
>>>>
>>>> 3) dm's requeue use case: when BLK_STS_DM_REQUEUE happens, dm core needs to
>>>> restore to the original bio which represents current dm io to be requeued.
>>>> By storing sectors to the bio's end sector and dm io's size,
>>>> bio_rewind() can restore such original bio, then dm core code needn't to
>>>> allocate one bio beforehand just for handling BLK_STS_DM_REQUEUE which
>>>> is actually one unusual event.
>>>>
>>>> 4) Not like original rewind API, this one needn't to add .bi_done, and no any
>>>> effect on fast path
>>>
>>> It seems like perhaps the real issue here is that we need a real bio_iter,
>>> separate from bvec_iter, that also encapsulates iterating over integrity &
>>> fscrypt. 
>>
>> Not mention bio_iter, bvec_iter has been 32 bytes, which is too big to
>> hold in per-io data structure. With this patch, 8bytes is enough
>> to rewind one bio if the end sector is fixed.
> 
> Hold on though, does that check out? Why is that too big for per IO data
> structures?
> 
> By definition these structures are only for IOs in flight, and we don't _want_
> there to ever be very many of these or we're going to run into latency issues
> due to queue depth.

It's much less about using whatever amount of memory for inflight IO,
and much more about not bloating fast path structures (of which the bio
is certainly one). All of this gunk has to be initialized for each IO,
and that's the real issue.

Just look at the recent work for iov_iter and why ITER_UBUF makes sense
to do.

This is not a commentary on this patchset, just a general observation.
Sizes of hot data structures DO matter, and quite a bit so.

-- 
Jens Axboe

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 5.20 1/4] block: add bio_rewind() API
  2022-06-28 18:13           ` [dm-devel] " Jens Axboe
@ 2022-06-28 18:32             ` Kent Overstreet
  -1 siblings, 0 replies; 76+ messages in thread
From: Kent Overstreet @ 2022-06-28 18:32 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ming Lei, Mike Snitzer, linux-block, dm-devel, Eric Biggers,
	Dmitry Monakhov, Martin K . Petersen

On Tue, Jun 28, 2022 at 12:13:06PM -0600, Jens Axboe wrote:
> It's much less about using whatever amount of memory for inflight IO,
> and much more about not bloating fast path structures (of which the bio
> is certainly one). All of this gunk has to be initialized for each IO,
> and that's the real issue.
> 
> Just look at the recent work for iov_iter and why ITER_UBUF makes sense
> to do.
> 
> This is not a commentary on this patchset, just a general observation.
> Sizes of hot data structures DO matter, and quite a bit so.

Younger me would have definitely been in agreement; initializing these structs
definitely tends to show up in profiles.

These days I'm somewhat less inclined towards that view - profiles naturally
highlight where your cache misses are happening, and initializing a freshly
allocated data structure is always going to be a cache miss. But the difference
between touching 3 and 6 contiguous cachelines is practically nil...  assuming
we aren't doing anything stupid like using memset (despite what Linus wants from
the CPU vendors, rep stos _still_ sucks) and perhaps inserting prefetches where
appropriate.

And I see work going by that makes me really wonder if it was justified - in
particular I _really_ want to know if Christoph's bio initialization change was
justified by actual benchmarks and what those numbers were, vs. looking at
profiles. Wasn't anything in the commit log...

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dm-devel] [PATCH 5.20 1/4] block: add bio_rewind() API
@ 2022-06-28 18:32             ` Kent Overstreet
  0 siblings, 0 replies; 76+ messages in thread
From: Kent Overstreet @ 2022-06-28 18:32 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Martin K . Petersen, Mike Snitzer, Eric Biggers, Ming Lei,
	linux-block, dm-devel, Dmitry Monakhov

On Tue, Jun 28, 2022 at 12:13:06PM -0600, Jens Axboe wrote:
> It's much less about using whatever amount of memory for inflight IO,
> and much more about not bloating fast path structures (of which the bio
> is certainly one). All of this gunk has to be initialized for each IO,
> and that's the real issue.
> 
> Just look at the recent work for iov_iter and why ITER_UBUF makes sense
> to do.
> 
> This is not a commentary on this patchset, just a general observation.
> Sizes of hot data structures DO matter, and quite a bit so.

Younger me would have definitely been in agreement; initializing these structs
definitely tends to show up in profiles.

These days I'm somewhat less inclined towards that view - profiles naturally
highlight where your cache misses are happening, and initializing a freshly
allocated data structure is always going to be a cache miss. But the difference
between touching 3 and 6 contiguous cachelines is practically nil...  assuming
we aren't doing anything stupid like using memset (despite what Linus wants from
the CPU vendors, rep stos _still_ sucks) and perhaps inserting prefetches where
appropriate.

And I see work going by that makes me really wonder if it was justified - in
particular I _really_ want to know if Christoph's bio initialization change was
justified by actual benchmarks and what those numbers were, vs. looking at
profiles. Wasn't anything in the commit log...

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 5.20 1/4] block: add bio_rewind() API
  2022-06-28 18:13           ` [dm-devel] " Jens Axboe
@ 2022-06-29  0:49             ` Ming Lei
  -1 siblings, 0 replies; 76+ messages in thread
From: Ming Lei @ 2022-06-29  0:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Kent Overstreet, Mike Snitzer, linux-block, dm-devel,
	Eric Biggers, Dmitry Monakhov, Martin K . Petersen

On Tue, Jun 28, 2022 at 12:13:06PM -0600, Jens Axboe wrote:
> On 6/27/22 10:20 PM, Kent Overstreet wrote:
> > On Mon, Jun 27, 2022 at 03:36:22PM +0800, Ming Lei wrote:
> >> On Sun, Jun 26, 2022 at 04:14:58PM -0400, Kent Overstreet wrote:
> >>> On Fri, Jun 24, 2022 at 10:12:52PM +0800, Ming Lei wrote:
> >>>> Commit 7759eb23fd98 ("block: remove bio_rewind_iter()") removes
> >>>> the similar API because the following reasons:
> >>>>
> >>>>     ```
> >>>>     It is pointed that bio_rewind_iter() is one very bad API[1]:
> >>>>
> >>>>     1) bio size may not be restored after rewinding
> >>>>
> >>>>     2) it causes some bogus change, such as 5151842b9d8732 (block: reset
> >>>>     bi_iter.bi_done after splitting bio)
> >>>>
> >>>>     3) rewinding really makes things complicated wrt. bio splitting
> >>>>
> >>>>     4) unnecessary updating of .bi_done in fast path
> >>>>
> >>>>     [1] https://marc.info/?t=153549924200005&r=1&w=2
> >>>>
> >>>>     So this patch takes Kent's suggestion to restore one bio into its original
> >>>>     state via saving bio iterator(struct bvec_iter) in bio_integrity_prep(),
> >>>>     given now bio_rewind_iter() is only used by bio integrity code.
> >>>>     ```
> >>>>
> >>>> However, it isn't easy to restore bio by saving 32 bytes bio->bi_iter, and saving
> >>>> it only can't restore crypto and integrity info.
> >>>>
> >>>> Add bio_rewind() back for some use cases which may not be same with
> >>>> previous generic case:
> >>>>
> >>>> 1) most of bio has fixed end sector since bio split is done from front of the bio,
> >>>> if driver just records how many sectors between current bio's start sector and
> >>>> the bio's end sector, the original position can be restored
> >>>>
> >>>> 2) if one bio's end sector won't change, usually bio_trim() isn't called, user can
> >>>> restore original position by storing sectors from current ->bi_iter.bi_sector to
> >>>> bio's end sector; together by saving bio size, 8 bytes can restore to
> >>>> original bio.
> >>>>
> >>>> 3) dm's requeue use case: when BLK_STS_DM_REQUEUE happens, dm core needs to
> >>>> restore to the original bio which represents current dm io to be requeued.
> >>>> By storing sectors to the bio's end sector and dm io's size,
> >>>> bio_rewind() can restore such original bio, then dm core code needn't to
> >>>> allocate one bio beforehand just for handling BLK_STS_DM_REQUEUE which
> >>>> is actually one unusual event.
> >>>>
> >>>> 4) Not like original rewind API, this one needn't to add .bi_done, and no any
> >>>> effect on fast path
> >>>
> >>> It seems like perhaps the real issue here is that we need a real bio_iter,
> >>> separate from bvec_iter, that also encapsulates iterating over integrity &
> >>> fscrypt. 
> >>
> >> Not mention bio_iter, bvec_iter has been 32 bytes, which is too big to
> >> hold in per-io data structure. With this patch, 8bytes is enough
> >> to rewind one bio if the end sector is fixed.
> > 
> > Hold on though, does that check out? Why is that too big for per IO data
> > structures?
> > 
> > By definition these structures are only for IOs in flight, and we don't _want_
> > there to ever be very many of these or we're going to run into latency issues
> > due to queue depth.
> 
> It's much less about using whatever amount of memory for inflight IO,
> and much more about not bloating fast path structures (of which the bio
> is certainly one). All of this gunk has to be initialized for each IO,
> and that's the real issue.

Can't agree more, especially the initialization is just for the unusual
DM_REQUEUE event(bio rewind is needed).


Thanks,
Ming


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dm-devel] [PATCH 5.20 1/4] block: add bio_rewind() API
@ 2022-06-29  0:49             ` Ming Lei
  0 siblings, 0 replies; 76+ messages in thread
From: Ming Lei @ 2022-06-29  0:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Martin K . Petersen, Mike Snitzer, Eric Biggers, linux-block,
	dm-devel, Dmitry Monakhov, Kent Overstreet

On Tue, Jun 28, 2022 at 12:13:06PM -0600, Jens Axboe wrote:
> On 6/27/22 10:20 PM, Kent Overstreet wrote:
> > On Mon, Jun 27, 2022 at 03:36:22PM +0800, Ming Lei wrote:
> >> On Sun, Jun 26, 2022 at 04:14:58PM -0400, Kent Overstreet wrote:
> >>> On Fri, Jun 24, 2022 at 10:12:52PM +0800, Ming Lei wrote:
> >>>> Commit 7759eb23fd98 ("block: remove bio_rewind_iter()") removes
> >>>> the similar API because the following reasons:
> >>>>
> >>>>     ```
> >>>>     It is pointed that bio_rewind_iter() is one very bad API[1]:
> >>>>
> >>>>     1) bio size may not be restored after rewinding
> >>>>
> >>>>     2) it causes some bogus change, such as 5151842b9d8732 (block: reset
> >>>>     bi_iter.bi_done after splitting bio)
> >>>>
> >>>>     3) rewinding really makes things complicated wrt. bio splitting
> >>>>
> >>>>     4) unnecessary updating of .bi_done in fast path
> >>>>
> >>>>     [1] https://marc.info/?t=153549924200005&r=1&w=2
> >>>>
> >>>>     So this patch takes Kent's suggestion to restore one bio into its original
> >>>>     state via saving bio iterator(struct bvec_iter) in bio_integrity_prep(),
> >>>>     given now bio_rewind_iter() is only used by bio integrity code.
> >>>>     ```
> >>>>
> >>>> However, it isn't easy to restore bio by saving 32 bytes bio->bi_iter, and saving
> >>>> it only can't restore crypto and integrity info.
> >>>>
> >>>> Add bio_rewind() back for some use cases which may not be same with
> >>>> previous generic case:
> >>>>
> >>>> 1) most of bio has fixed end sector since bio split is done from front of the bio,
> >>>> if driver just records how many sectors between current bio's start sector and
> >>>> the bio's end sector, the original position can be restored
> >>>>
> >>>> 2) if one bio's end sector won't change, usually bio_trim() isn't called, user can
> >>>> restore original position by storing sectors from current ->bi_iter.bi_sector to
> >>>> bio's end sector; together by saving bio size, 8 bytes can restore to
> >>>> original bio.
> >>>>
> >>>> 3) dm's requeue use case: when BLK_STS_DM_REQUEUE happens, dm core needs to
> >>>> restore to the original bio which represents current dm io to be requeued.
> >>>> By storing sectors to the bio's end sector and dm io's size,
> >>>> bio_rewind() can restore such original bio, then dm core code needn't to
> >>>> allocate one bio beforehand just for handling BLK_STS_DM_REQUEUE which
> >>>> is actually one unusual event.
> >>>>
> >>>> 4) Not like original rewind API, this one needn't to add .bi_done, and no any
> >>>> effect on fast path
> >>>
> >>> It seems like perhaps the real issue here is that we need a real bio_iter,
> >>> separate from bvec_iter, that also encapsulates iterating over integrity &
> >>> fscrypt. 
> >>
> >> Not mention bio_iter, bvec_iter has been 32 bytes, which is too big to
> >> hold in per-io data structure. With this patch, 8bytes is enough
> >> to rewind one bio if the end sector is fixed.
> > 
> > Hold on though, does that check out? Why is that too big for per IO data
> > structures?
> > 
> > By definition these structures are only for IOs in flight, and we don't _want_
> > there to ever be very many of these or we're going to run into latency issues
> > due to queue depth.
> 
> It's much less about using whatever amount of memory for inflight IO,
> and much more about not bloating fast path structures (of which the bio
> is certainly one). All of this gunk has to be initialized for each IO,
> and that's the real issue.

Can't agree more, especially the initialization is just for the unusual
DM_REQUEUE event(bio rewind is needed).


Thanks,
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 5.20 1/4] block: add bio_rewind() API
  2022-06-28 16:36             ` [dm-devel] " Kent Overstreet
@ 2022-06-29  1:02               ` Ming Lei
  -1 siblings, 0 replies; 76+ messages in thread
From: Ming Lei @ 2022-06-29  1:02 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Jens Axboe, Mike Snitzer, linux-block, dm-devel, Eric Biggers,
	Dmitry Monakhov, Martin K . Petersen

On Tue, Jun 28, 2022 at 12:36:17PM -0400, Kent Overstreet wrote:
> On Tue, Jun 28, 2022 at 03:49:28PM +0800, Ming Lei wrote:
> > On Tue, Jun 28, 2022 at 12:26:10AM -0400, Kent Overstreet wrote:
> > > On Mon, Jun 27, 2022 at 03:36:22PM +0800, Ming Lei wrote:
> > > > Not mention bio_iter, bvec_iter has been 32 bytes, which is too big to
> > > > hold in per-io data structure. With this patch, 8bytes is enough
> > > > to rewind one bio if the end sector is fixed.
> > > 
> > > And with rewind, you're making an assumption about the state the iterator is
> > > going to be in when the IO has completed.
> > > 
> > > What if the iterator was never advanced?
> > 
> > bio_rewind() works as expected if the iterator doesn't advance, since bytes
> > between the recorded position and the end position isn't changed, same
> > with the end position.
> > 
> > > 
> > > So say you check for that by saving some other part of the iterator - but that
> > > may have legitimately changed too, if the bio was redirected (bi_sector changes)
> > > or trimmed (bi_size changes)
> > > 
> > > I still think this is an inherently buggy interface, the way it's being proposed
> > > to be used.
> > 
> > The patch did mention that the interface should be for situation in which end
> > sector of bio won't change.
> 
> But that's an assumption that you simply can't make!

Of course, we can, at least for this DM's use case, the bio is issued
from FS or split from DM, and it won't be issued to underlying queue any
more, and simply owned by DM core code.

> 
> We allow block device drivers to be stacked in _any_ combination. After a bio is
> completed it may have been partially advanced, fully advanced, trimmed, not
> trimmed, anything - and bi_sector and thus also bio_end_sector() may have
> changed, and will have if there's partition tables involved.

How can bio (partial)advance change bio's end sector?

bio end sector can be changed only when bio->bi_iter.bi_size is changed
manually(include bio_trim), or ->bi_bdev is changed. But inside one
driver, if the bio is owned by this driver(such as the driver is the
finally layer request based driver), the assumption often can make.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dm-devel] [PATCH 5.20 1/4] block: add bio_rewind() API
@ 2022-06-29  1:02               ` Ming Lei
  0 siblings, 0 replies; 76+ messages in thread
From: Ming Lei @ 2022-06-29  1:02 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Jens Axboe, Martin K . Petersen, Mike Snitzer, Eric Biggers,
	linux-block, dm-devel, Dmitry Monakhov

On Tue, Jun 28, 2022 at 12:36:17PM -0400, Kent Overstreet wrote:
> On Tue, Jun 28, 2022 at 03:49:28PM +0800, Ming Lei wrote:
> > On Tue, Jun 28, 2022 at 12:26:10AM -0400, Kent Overstreet wrote:
> > > On Mon, Jun 27, 2022 at 03:36:22PM +0800, Ming Lei wrote:
> > > > Not mention bio_iter, bvec_iter has been 32 bytes, which is too big to
> > > > hold in per-io data structure. With this patch, 8bytes is enough
> > > > to rewind one bio if the end sector is fixed.
> > > 
> > > And with rewind, you're making an assumption about the state the iterator is
> > > going to be in when the IO has completed.
> > > 
> > > What if the iterator was never advanced?
> > 
> > bio_rewind() works as expected if the iterator doesn't advance, since bytes
> > between the recorded position and the end position isn't changed, same
> > with the end position.
> > 
> > > 
> > > So say you check for that by saving some other part of the iterator - but that
> > > may have legitimately changed too, if the bio was redirected (bi_sector changes)
> > > or trimmed (bi_size changes)
> > > 
> > > I still think this is an inherently buggy interface, the way it's being proposed
> > > to be used.
> > 
> > The patch did mention that the interface should be for situation in which end
> > sector of bio won't change.
> 
> But that's an assumption that you simply can't make!

Of course, we can, at least for this DM's use case, the bio is issued
from FS or split from DM, and it won't be issued to underlying queue any
more, and simply owned by DM core code.

> 
> We allow block device drivers to be stacked in _any_ combination. After a bio is
> completed it may have been partially advanced, fully advanced, trimmed, not
> trimmed, anything - and bi_sector and thus also bio_end_sector() may have
> changed, and will have if there's partition tables involved.

How can bio (partial)advance change bio's end sector?

bio end sector can be changed only when bio->bi_iter.bi_size is changed
manually(include bio_trim), or ->bi_bdev is changed. But inside one
driver, if the bio is owned by this driver(such as the driver is the
finally layer request based driver), the assumption often can make.


Thanks,
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 5.20 1/4] block: add bio_rewind() API
  2022-06-28 17:52                 ` [dm-devel] " Kent Overstreet
@ 2022-06-29  6:07                   ` Mike Snitzer
  -1 siblings, 0 replies; 76+ messages in thread
From: Mike Snitzer @ 2022-06-29  6:07 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Ming Lei, Jens Axboe, linux-block, dm-devel, Eric Biggers,
	Dmitry Monakhov, Martin K . Petersen

On Tue, Jun 28 2022 at  1:52P -0400,
Kent Overstreet <kent.overstreet@gmail.com> wrote:

> On Tue, Jun 28, 2022 at 01:41:12PM -0400, Mike Snitzer wrote:
> > On Tue, Jun 28 2022 at 12:36P -0400,
> > Kent Overstreet <kent.overstreet@gmail.com> wrote:
> > 
> > > On Tue, Jun 28, 2022 at 03:49:28PM +0800, Ming Lei wrote:
> > > > On Tue, Jun 28, 2022 at 12:26:10AM -0400, Kent Overstreet wrote:
> > > > > On Mon, Jun 27, 2022 at 03:36:22PM +0800, Ming Lei wrote:
> > > > > > Not mention bio_iter, bvec_iter has been 32 bytes, which is too big to
> > > > > > hold in per-io data structure. With this patch, 8bytes is enough
> > > > > > to rewind one bio if the end sector is fixed.
> > > > > 
> > > > > And with rewind, you're making an assumption about the state the iterator is
> > > > > going to be in when the IO has completed.
> > > > > 
> > > > > What if the iterator was never advanced?
> > > > 
> > > > bio_rewind() works as expected if the iterator doesn't advance, since bytes
> > > > between the recorded position and the end position isn't changed, same
> > > > with the end position.
> > > > 
> > > > > 
> > > > > So say you check for that by saving some other part of the iterator - but that
> > > > > may have legitimately changed too, if the bio was redirected (bi_sector changes)
> > > > > or trimmed (bi_size changes)
> > > > > 
> > > > > I still think this is an inherently buggy interface, the way it's being proposed
> > > > > to be used.
> > > > 
> > > > The patch did mention that the interface should be for situation in which end
> > > > sector of bio won't change.
> > > 
> > > But that's an assumption that you simply can't make!
> > 
> > Why not?  There is clear use-case for this API, as demonstrated in the
> > patchset: DM can/will make use of it for the purposes of enhancing its
> > more unlikely bio requeuing work (that is needed to make the more
> > likely bio splitting scenario more efficient).
> > 
> > > We allow block device drivers to be stacked in _any_ combination. After a bio is
> > > completed it may have been partially advanced, fully advanced, trimmed, not
> > > trimmed, anything - and bi_sector and thus also bio_end_sector() may have
> > > changed, and will have if there's partition tables involved.
> > 
> > We don't _need_ this API to cure cancer for all hypothetical block
> > drivers.
> > 
> > If consumers of the API follow the rule that end sector of the
> > _original bio_ isn't changed then it is all fine.  It is that simple.
> > 
> > Stacked drivers will work just fine.  The lower layers will be
> > modifying their bios as needed. Because for DM those bios happen to
> > be clones, so there is isolation to the broader design flaw you are
> > trying to say is a major problem. As this patchset demonstrates.
> > 
> > I do concede that policing who can use an API is hard.  But if some
> > consumer of an API tries something that invalidates rules of the API
> > they get to keep the N pieces when it breaks.
> 
> Mike, keep in mind that when bio_rewind() was originally introduced, it
> immediately grew users that were _inherently buggy_ (of the "users can break
> this trivially") variety, and the whole thing had to be reverted, and I was
> really annoyed - mostly at myself, because I would have caught it if I'd been
> paying attention to the mailing list more.
> 
> And I _guarantee_ you that if this makes it in, we'll have the same thing
> happening all over again - we have a lot of different block drivers being
> written by a lot of different people, and most of them do not understand all the
> subtleties of the block layer and the ways different things can interact, and so
> the onus is on us to not add tools that they aren't going to immediately turn
> around and slice themselves with.
> 
> The 32 bytes you're trying to save is meaningless. Think instead about the weeks
> of engineer time that get wasted by bugs like this - chasing the bugs,
> babysitting the patches in to fix it, then the _endless_ -stable backports.

While you or others did something approximating all that, really
doesn't mean it applicable here.

> _Please_ try to think more defensively when you're writing code.

Please try to dial down the hyperbole and judgment. Ming wrote this
code. And you haven't been able to point out anything _actually_ wrong
with it (yet).

This patch's header does need editing for clarity, but we can help
improve it and the documentation above bio_rewind() in the code.

> So, and I'm sorry I have to be the killjoy here, but hard NACK on this patchset.
> Hard, hard NACK.

<insert tom-delonge-wtf.gif>

You see this bio_rewind() as history repeating itself, but it isn't
like what you ranted about in the past:
https://marc.info/?l=linux-block&m=153549921116441&w=2

I can certainly see why you think it similar at first glance. But this
patchset shows how bio_rewind() must be used, and how DM benefits from
using it safely (with no impact to struct bio or DM's per-bio-data).

bio_rewind() usage will be as niche as DM's use-case for it. If other
code respects the documented constraint, that the original bio's end
sector be preserved, then they can use it too.

The key is for a driver to maintain enough state to allow this fixed
end be effectively immutable. (DM happens to get this state "for free"
simply because it was already established for its IO accounting of
split bios).

The Linux codebase requires precision. This isn't new.

> I'll be happy to assist in coming up with alternate, less dangerous solutions
> though (and I think introducing a real bio_iter is overdue, so that's probably
> the first thing we should look at).

It isn't dangerous. It is an interface whose constraint needs to be
respected. Just like is documented for a myriad other kernel
interfaces.

Factoring out a bio_iter will bloat struct bio for functionality most
consumers don't need. And gating DM's ability to achieve this
patchset's functionality with some overdue refactoring is really _not_
acceptable.

Mike


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dm-devel] [PATCH 5.20 1/4] block: add bio_rewind() API
@ 2022-06-29  6:07                   ` Mike Snitzer
  0 siblings, 0 replies; 76+ messages in thread
From: Mike Snitzer @ 2022-06-29  6:07 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Jens Axboe, Martin K . Petersen, Eric Biggers, Ming Lei,
	linux-block, dm-devel, Dmitry Monakhov

On Tue, Jun 28 2022 at  1:52P -0400,
Kent Overstreet <kent.overstreet@gmail.com> wrote:

> On Tue, Jun 28, 2022 at 01:41:12PM -0400, Mike Snitzer wrote:
> > On Tue, Jun 28 2022 at 12:36P -0400,
> > Kent Overstreet <kent.overstreet@gmail.com> wrote:
> > 
> > > On Tue, Jun 28, 2022 at 03:49:28PM +0800, Ming Lei wrote:
> > > > On Tue, Jun 28, 2022 at 12:26:10AM -0400, Kent Overstreet wrote:
> > > > > On Mon, Jun 27, 2022 at 03:36:22PM +0800, Ming Lei wrote:
> > > > > > Not mention bio_iter, bvec_iter has been 32 bytes, which is too big to
> > > > > > hold in per-io data structure. With this patch, 8bytes is enough
> > > > > > to rewind one bio if the end sector is fixed.
> > > > > 
> > > > > And with rewind, you're making an assumption about the state the iterator is
> > > > > going to be in when the IO has completed.
> > > > > 
> > > > > What if the iterator was never advanced?
> > > > 
> > > > bio_rewind() works as expected if the iterator doesn't advance, since bytes
> > > > between the recorded position and the end position isn't changed, same
> > > > with the end position.
> > > > 
> > > > > 
> > > > > So say you check for that by saving some other part of the iterator - but that
> > > > > may have legitimately changed too, if the bio was redirected (bi_sector changes)
> > > > > or trimmed (bi_size changes)
> > > > > 
> > > > > I still think this is an inherently buggy interface, the way it's being proposed
> > > > > to be used.
> > > > 
> > > > The patch did mention that the interface should be for situation in which end
> > > > sector of bio won't change.
> > > 
> > > But that's an assumption that you simply can't make!
> > 
> > Why not?  There is clear use-case for this API, as demonstrated in the
> > patchset: DM can/will make use of it for the purposes of enhancing its
> > more unlikely bio requeuing work (that is needed to make the more
> > likely bio splitting scenario more efficient).
> > 
> > > We allow block device drivers to be stacked in _any_ combination. After a bio is
> > > completed it may have been partially advanced, fully advanced, trimmed, not
> > > trimmed, anything - and bi_sector and thus also bio_end_sector() may have
> > > changed, and will have if there's partition tables involved.
> > 
> > We don't _need_ this API to cure cancer for all hypothetical block
> > drivers.
> > 
> > If consumers of the API follow the rule that end sector of the
> > _original bio_ isn't changed then it is all fine.  It is that simple.
> > 
> > Stacked drivers will work just fine.  The lower layers will be
> > modifying their bios as needed. Because for DM those bios happen to
> > be clones, so there is isolation to the broader design flaw you are
> > trying to say is a major problem. As this patchset demonstrates.
> > 
> > I do concede that policing who can use an API is hard.  But if some
> > consumer of an API tries something that invalidates rules of the API
> > they get to keep the N pieces when it breaks.
> 
> Mike, keep in mind that when bio_rewind() was originally introduced, it
> immediately grew users that were _inherently buggy_ (of the "users can break
> this trivially") variety, and the whole thing had to be reverted, and I was
> really annoyed - mostly at myself, because I would have caught it if I'd been
> paying attention to the mailing list more.
> 
> And I _guarantee_ you that if this makes it in, we'll have the same thing
> happening all over again - we have a lot of different block drivers being
> written by a lot of different people, and most of them do not understand all the
> subtleties of the block layer and the ways different things can interact, and so
> the onus is on us to not add tools that they aren't going to immediately turn
> around and slice themselves with.
> 
> The 32 bytes you're trying to save is meaningless. Think instead about the weeks
> of engineer time that get wasted by bugs like this - chasing the bugs,
> babysitting the patches in to fix it, then the _endless_ -stable backports.

While you or others did something approximating all that, really
doesn't mean it applicable here.

> _Please_ try to think more defensively when you're writing code.

Please try to dial down the hyperbole and judgment. Ming wrote this
code. And you haven't been able to point out anything _actually_ wrong
with it (yet).

This patch's header does need editing for clarity, but we can help
improve it and the documentation above bio_rewind() in the code.

> So, and I'm sorry I have to be the killjoy here, but hard NACK on this patchset.
> Hard, hard NACK.

<insert tom-delonge-wtf.gif>

You see this bio_rewind() as history repeating itself, but it isn't
like what you ranted about in the past:
https://marc.info/?l=linux-block&m=153549921116441&w=2

I can certainly see why you think it similar at first glance. But this
patchset shows how bio_rewind() must be used, and how DM benefits from
using it safely (with no impact to struct bio or DM's per-bio-data).

bio_rewind() usage will be as niche as DM's use-case for it. If other
code respects the documented constraint, that the original bio's end
sector be preserved, then they can use it too.

The key is for a driver to maintain enough state to allow this fixed
end be effectively immutable. (DM happens to get this state "for free"
simply because it was already established for its IO accounting of
split bios).

The Linux codebase requires precision. This isn't new.

> I'll be happy to assist in coming up with alternate, less dangerous solutions
> though (and I think introducing a real bio_iter is overdue, so that's probably
> the first thing we should look at).

It isn't dangerous. It is an interface whose constraint needs to be
respected. Just like is documented for a myriad other kernel
interfaces.

Factoring out a bio_iter will bloat struct bio for functionality most
consumers don't need. And gating DM's ability to achieve this
patchset's functionality with some overdue refactoring is really _not_
acceptable.

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 5.20 1/4] block: add bio_rewind() API
  2022-06-28 18:32             ` [dm-devel] " Kent Overstreet
@ 2022-06-29 17:16               ` Jens Axboe
  -1 siblings, 0 replies; 76+ messages in thread
From: Jens Axboe @ 2022-06-29 17:16 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Ming Lei, Mike Snitzer, linux-block, dm-devel, Eric Biggers,
	Dmitry Monakhov, Martin K . Petersen

On 6/28/22 12:32 PM, Kent Overstreet wrote:
> On Tue, Jun 28, 2022 at 12:13:06PM -0600, Jens Axboe wrote:
>> It's much less about using whatever amount of memory for inflight IO,
>> and much more about not bloating fast path structures (of which the
>> bio is certainly one). All of this gunk has to be initialized for
>> each IO, and that's the real issue.
>>
>> Just look at the recent work for iov_iter and why ITER_UBUF makes
>> sense to do.
>>
>> This is not a commentary on this patchset, just a general
>> observation. Sizes of hot data structures DO matter, and quite a bit
>> so.
> 
> Younger me would have definitely been in agreement; initializing these
> structs definitely tends to show up in profiles.

Older me still greatly cares :-)

> These days I'm somewhat less inclined towards that view - profiles
> naturally highlight where your cache misses are happening, and
> initializing a freshly allocated data structure is always going to be
> a cache miss. But the difference between touching 3 and 6 contiguous
> cachelines is practically nil...  assuming we aren't doing anything
> stupid like using memset (despite what Linus wants from the CPU
> vendors, rep stos _still_ sucks) and perhaps inserting prefetches
> where appropriate.
> 
> And I see work going by that makes me really wonder if it was
> justified - in particular I _really_ want to know if Christoph's bio
> initialization change was justified by actual benchmarks and what
> those numbers were, vs. looking at profiles. Wasn't anything in the
> commit log...

Not sure what Christoph change you are referring to, but all the ones
that I did to improve the init side were all backed by numbers I ran at
that time (and most/all of the commit messages will have that data). So
yes, it is indeed still very noticeable. Maybe not at 100K IOPS, but at
10M on a core it most certainly is.

I'm all for having solid and maintainable code, obviously, but frivolous
bloating of structures and more expensive setup cannot be hand waved
away with "it doesn't matter if we touch 3 or 6 cachelines" because we
obviously have a disagreement on that.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dm-devel] [PATCH 5.20 1/4] block: add bio_rewind() API
@ 2022-06-29 17:16               ` Jens Axboe
  0 siblings, 0 replies; 76+ messages in thread
From: Jens Axboe @ 2022-06-29 17:16 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Martin K . Petersen, Mike Snitzer, Eric Biggers, Ming Lei,
	linux-block, dm-devel, Dmitry Monakhov

On 6/28/22 12:32 PM, Kent Overstreet wrote:
> On Tue, Jun 28, 2022 at 12:13:06PM -0600, Jens Axboe wrote:
>> It's much less about using whatever amount of memory for inflight IO,
>> and much more about not bloating fast path structures (of which the
>> bio is certainly one). All of this gunk has to be initialized for
>> each IO, and that's the real issue.
>>
>> Just look at the recent work for iov_iter and why ITER_UBUF makes
>> sense to do.
>>
>> This is not a commentary on this patchset, just a general
>> observation. Sizes of hot data structures DO matter, and quite a bit
>> so.
> 
> Younger me would have definitely been in agreement; initializing these
> structs definitely tends to show up in profiles.

Older me still greatly cares :-)

> These days I'm somewhat less inclined towards that view - profiles
> naturally highlight where your cache misses are happening, and
> initializing a freshly allocated data structure is always going to be
> a cache miss. But the difference between touching 3 and 6 contiguous
> cachelines is practically nil...  assuming we aren't doing anything
> stupid like using memset (despite what Linus wants from the CPU
> vendors, rep stos _still_ sucks) and perhaps inserting prefetches
> where appropriate.
> 
> And I see work going by that makes me really wonder if it was
> justified - in particular I _really_ want to know if Christoph's bio
> initialization change was justified by actual benchmarks and what
> those numbers were, vs. looking at profiles. Wasn't anything in the
> commit log...

Not sure what Christoph change you are referring to, but all the ones
that I did to improve the init side were all backed by numbers I ran at
that time (and most/all of the commit messages will have that data). So
yes, it is indeed still very noticeable. Maybe not at 100K IOPS, but at
10M on a core it most certainly is.

I'm all for having solid and maintainable code, obviously, but frivolous
bloating of structures and more expensive setup cannot be hand waved
away with "it doesn't matter if we touch 3 or 6 cachelines" because we
obviously have a disagreement on that.

-- 
Jens Axboe

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 5.20 1/4] block: add bio_rewind() API
  2022-06-29  6:07                   ` [dm-devel] " Mike Snitzer
@ 2022-06-29 18:11                     ` Kent Overstreet
  -1 siblings, 0 replies; 76+ messages in thread
From: Kent Overstreet @ 2022-06-29 18:11 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Ming Lei, Jens Axboe, linux-block, dm-devel, Eric Biggers,
	Dmitry Monakhov, Martin K . Petersen

On Wed, Jun 29, 2022 at 02:07:08AM -0400, Mike Snitzer wrote:
> Please try to dial down the hyperbole and judgment. Ming wrote this
> code. And you haven't been able to point out anything _actually_ wrong
> with it (yet).
> 
> This patch's header does need editing for clarity, but we can help
> improve it and the documentation above bio_rewind() in the code.
> 
> > So, and I'm sorry I have to be the killjoy here, but hard NACK on this patchset.
> > Hard, hard NACK.
> 
> <insert tom-delonge-wtf.gif>
> 
> You see this bio_rewind() as history repeating itself, but it isn't
> like what you ranted about in the past:
> https://marc.info/?l=linux-block&m=153549921116441&w=2
> 
> I can certainly see why you think it similar at first glance. But this
> patchset shows how bio_rewind() must be used, and how DM benefits from
> using it safely (with no impact to struct bio or DM's per-bio-data).
> 
> bio_rewind() usage will be as niche as DM's use-case for it. If other
> code respects the documented constraint, that the original bio's end
> sector be preserved, then they can use it too.
> 
> The key is for a driver to maintain enough state to allow this fixed
> end be effectively immutable. (DM happens to get this state "for free"
> simply because it was already established for its IO accounting of
> split bios).
> 
> The Linux codebase requires precision. This isn't new.

Mike, that's not justification for making things _more_ dangerous.

> 
> > I'll be happy to assist in coming up with alternate, less dangerous solutions
> > though (and I think introducing a real bio_iter is overdue, so that's probably
> > the first thing we should look at).
> 
> It isn't dangerous. It is an interface whose constraint needs to be
> respected. Just like is documented for a myriad other kernel
> interfaces.
> 
> Factoring out a bio_iter will bloat struct bio for functionality most
> consumers don't need. And gating DM's ability to achieve this
> patchset's functionality with some overdue refactoring is really _not_
> acceptable.

Mike, you're the one who's getting seriously hyperbolic here. You're getting
frustrated because you've got this one thing you really want to get done, and
you feel like you're running into a brick wall when I tell you "no".

And yes, coding in the kernel is a complicated, dangerous environment with many
rules that need to be respected.

That does not mean it's ok to be adding to that complexity, and making it even
more dangerous, without a _really fucking good reason_. This doesn't fly. Maybe
it would if it was some device mapper private thing, but you're acting like it's
only going to be used by device mapper when you're trying to add it to the
public interface for core block layer bio code. _That_ needs real justification.

Also, bio_iter is something we should definitely be considering because of the
way integrity and now crypt has been tacked on to struct bio.

When I originally wrote the modern bvec_iter code, the ability to use an
iterator besides the one in struct bio was an important piece of functionality,
one that's still in use (including in device mapper; see
__bio_for_each_segment()). The fact that we're growing additional data
structures that in theory want to be iterated in lockstep with the main bio
payload but _aren't_ iterated over with bi_iter is, at best, a code smell and a
lurking footgun.

However, I can see that the two of you are not likely take on figuring out how
to clean that up, and truthfully I don't have the time right now either, much as
it pains me.

Here's an alternative approach:

The fundamental problem with bio_rewind() (and I know that you two are super
serious that this is completely safe for your use case and no one else is going
to use it for anything else) is that we're using it to get back to some initial
state, but it's not invariant w.r.t. what's been done to the bio since then, and
the nature of the block layer is that that's a problem.

So here's what you do:

You bring back bi_done: bi_done counts bytes advanced, total, since the start
of the bio. Then we introduce a type:

struct bio_pos {
	unsigned	bi_done;
	unsigned	bi_size;
};

And two new functions:

struct bio_pos bio_get_pos(struct bio *)
{
	...
}

void bio_set_pos(struct bio *, struct bio_pos)
{
	...
}

That gets you the same functionality as bio_rewind(), but it'll be much more
broadly useful.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dm-devel] [PATCH 5.20 1/4] block: add bio_rewind() API
@ 2022-06-29 18:11                     ` Kent Overstreet
  0 siblings, 0 replies; 76+ messages in thread
From: Kent Overstreet @ 2022-06-29 18:11 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jens Axboe, Martin K . Petersen, Eric Biggers, Ming Lei,
	linux-block, dm-devel, Dmitry Monakhov

On Wed, Jun 29, 2022 at 02:07:08AM -0400, Mike Snitzer wrote:
> Please try to dial down the hyperbole and judgment. Ming wrote this
> code. And you haven't been able to point out anything _actually_ wrong
> with it (yet).
> 
> This patch's header does need editing for clarity, but we can help
> improve it and the documentation above bio_rewind() in the code.
> 
> > So, and I'm sorry I have to be the killjoy here, but hard NACK on this patchset.
> > Hard, hard NACK.
> 
> <insert tom-delonge-wtf.gif>
> 
> You see this bio_rewind() as history repeating itself, but it isn't
> like what you ranted about in the past:
> https://marc.info/?l=linux-block&m=153549921116441&w=2
> 
> I can certainly see why you think it similar at first glance. But this
> patchset shows how bio_rewind() must be used, and how DM benefits from
> using it safely (with no impact to struct bio or DM's per-bio-data).
> 
> bio_rewind() usage will be as niche as DM's use-case for it. If other
> code respects the documented constraint, that the original bio's end
> sector be preserved, then they can use it too.
> 
> The key is for a driver to maintain enough state to allow this fixed
> end be effectively immutable. (DM happens to get this state "for free"
> simply because it was already established for its IO accounting of
> split bios).
> 
> The Linux codebase requires precision. This isn't new.

Mike, that's not justification for making things _more_ dangerous.

> 
> > I'll be happy to assist in coming up with alternate, less dangerous solutions
> > though (and I think introducing a real bio_iter is overdue, so that's probably
> > the first thing we should look at).
> 
> It isn't dangerous. It is an interface whose constraint needs to be
> respected. Just like is documented for a myriad other kernel
> interfaces.
> 
> Factoring out a bio_iter will bloat struct bio for functionality most
> consumers don't need. And gating DM's ability to achieve this
> patchset's functionality with some overdue refactoring is really _not_
> acceptable.

Mike, you're the one who's getting seriously hyperbolic here. You're getting
frustrated because you've got this one thing you really want to get done, and
you feel like you're running into a brick wall when I tell you "no".

And yes, coding in the kernel is a complicated, dangerous environment with many
rules that need to be respected.

That does not mean it's ok to be adding to that complexity, and making it even
more dangerous, without a _really fucking good reason_. This doesn't fly. Maybe
it would if it was some device mapper private thing, but you're acting like it's
only going to be used by device mapper when you're trying to add it to the
public interface for core block layer bio code. _That_ needs real justification.

Also, bio_iter is something we should definitely be considering because of the
way integrity and now crypt has been tacked on to struct bio.

When I originally wrote the modern bvec_iter code, the ability to use an
iterator besides the one in struct bio was an important piece of functionality,
one that's still in use (including in device mapper; see
__bio_for_each_segment()). The fact that we're growing additional data
structures that in theory want to be iterated in lockstep with the main bio
payload but _aren't_ iterated over with bi_iter is, at best, a code smell and a
lurking footgun.

However, I can see that the two of you are not likely take on figuring out how
to clean that up, and truthfully I don't have the time right now either, much as
it pains me.

Here's an alternative approach:

The fundamental problem with bio_rewind() (and I know that you two are super
serious that this is completely safe for your use case and no one else is going
to use it for anything else) is that we're using it to get back to some initial
state, but it's not invariant w.r.t. what's been done to the bio since then, and
the nature of the block layer is that that's a problem.

So here's what you do:

You bring back bi_done: bi_done counts bytes advanced, total, since the start
of the bio. Then we introduce a type:

struct bio_pos {
	unsigned	bi_done;
	unsigned	bi_size;
};

And two new functions:

struct bio_pos bio_get_pos(struct bio *)
{
	...
}

void bio_set_pos(struct bio *, struct bio_pos)
{
	...
}

That gets you the same functionality as bio_rewind(), but it'll be much more
broadly useful.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 5.20 1/4] block: add bio_rewind() API
  2022-06-29 17:16               ` [dm-devel] " Jens Axboe
@ 2022-06-29 18:40                 ` Kent Overstreet
  -1 siblings, 0 replies; 76+ messages in thread
From: Kent Overstreet @ 2022-06-29 18:40 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ming Lei, Mike Snitzer, linux-block, dm-devel, Eric Biggers,
	Dmitry Monakhov, Martin K . Petersen

On Wed, Jun 29, 2022 at 11:16:10AM -0600, Jens Axboe wrote:
> Not sure what Christoph change you are referring to, but all the ones
> that I did to improve the init side were all backed by numbers I ran at
> that time (and most/all of the commit messages will have that data). So
> yes, it is indeed still very noticeable. Maybe not at 100K IOPS, but at
> 10M on a core it most certainly is.

I was referring to 609be1066731fea86436f5f91022f82e592ab456. You signed off on
it, you must remember it...?

> I'm all for having solid and maintainable code, obviously, but frivolous
> bloating of structures and more expensive setup cannot be hand waved
> away with "it doesn't matter if we touch 3 or 6 cachelines" because we
> obviously have a disagreement on that.

I wouldn't propose inflating struct _bio_ like that. But Jens, to be blunt - I
know we have different priorities in the way we write code. Your writeback
throttling code was buggy for _ages_ and I had users hitting deadlocks there
that I pinged you about, and I could not make heads or tails of how that code
was supposed to work and not for lack of time spent trying!

You should be well aware that I care about performance too - I was the one who
pushed through the patches to not separately allocate mempools and biosets, and
a lot of the work I did ages ago was specifically to work towards getting rif of
the counting segments pass (work I believe Ming completed); that was a _major_
chunk of cpu time in any block layer profile I've looked at. So sure, tell me I
don't care about performance enough.

*sigh*

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dm-devel] [PATCH 5.20 1/4] block: add bio_rewind() API
@ 2022-06-29 18:40                 ` Kent Overstreet
  0 siblings, 0 replies; 76+ messages in thread
From: Kent Overstreet @ 2022-06-29 18:40 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Martin K . Petersen, Mike Snitzer, Eric Biggers, Ming Lei,
	linux-block, dm-devel, Dmitry Monakhov

On Wed, Jun 29, 2022 at 11:16:10AM -0600, Jens Axboe wrote:
> Not sure what Christoph change you are referring to, but all the ones
> that I did to improve the init side were all backed by numbers I ran at
> that time (and most/all of the commit messages will have that data). So
> yes, it is indeed still very noticeable. Maybe not at 100K IOPS, but at
> 10M on a core it most certainly is.

I was referring to 609be1066731fea86436f5f91022f82e592ab456. You signed off on
it, you must remember it...?

> I'm all for having solid and maintainable code, obviously, but frivolous
> bloating of structures and more expensive setup cannot be hand waved
> away with "it doesn't matter if we touch 3 or 6 cachelines" because we
> obviously have a disagreement on that.

I wouldn't propose inflating struct _bio_ like that. But Jens, to be blunt - I
know we have different priorities in the way we write code. Your writeback
throttling code was buggy for _ages_ and I had users hitting deadlocks there
that I pinged you about, and I could not make heads or tails of how that code
was supposed to work and not for lack of time spent trying!

You should be well aware that I care about performance too - I was the one who
pushed through the patches to not separately allocate mempools and biosets, and
a lot of the work I did ages ago was specifically to work towards getting rif of
the counting segments pass (work I believe Ming completed); that was a _major_
chunk of cpu time in any block layer profile I've looked at. So sure, tell me I
don't care about performance enough.

*sigh*

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 5.20 1/4] block: add bio_rewind() API
  2022-06-29 18:40                 ` [dm-devel] " Kent Overstreet
@ 2022-06-29 18:51                   ` Bart Van Assche
  -1 siblings, 0 replies; 76+ messages in thread
From: Bart Van Assche @ 2022-06-29 18:51 UTC (permalink / raw)
  To: Kent Overstreet, Jens Axboe
  Cc: Ming Lei, Mike Snitzer, linux-block, dm-devel, Eric Biggers,
	Dmitry Monakhov, Martin K . Petersen

On 6/29/22 11:40, Kent Overstreet wrote:
> But Jens, to be blunt - I know we have different priorities in the way we write code.

Please stay professional in your communication and focus on the 
technical issues instead of on the people involved.

BTW, I remember that some time ago I bisected a kernel bug to one of 
your commits and that you refused to fix the bug introduced by that 
commit. I had to spend my time on root-causing it and sending a fix 
upstream.

Bart.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dm-devel] [PATCH 5.20 1/4] block: add bio_rewind() API
@ 2022-06-29 18:51                   ` Bart Van Assche
  0 siblings, 0 replies; 76+ messages in thread
From: Bart Van Assche @ 2022-06-29 18:51 UTC (permalink / raw)
  To: Kent Overstreet, Jens Axboe
  Cc: Martin K . Petersen, Mike Snitzer, Eric Biggers, Ming Lei,
	linux-block, dm-devel, Dmitry Monakhov

On 6/29/22 11:40, Kent Overstreet wrote:
> But Jens, to be blunt - I know we have different priorities in the way we write code.

Please stay professional in your communication and focus on the 
technical issues instead of on the people involved.

BTW, I remember that some time ago I bisected a kernel bug to one of 
your commits and that you refused to fix the bug introduced by that 
commit. I had to spend my time on root-causing it and sending a fix 
upstream.

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 5.20 1/4] block: add bio_rewind() API
  2022-06-29 18:40                 ` [dm-devel] " Kent Overstreet
@ 2022-06-29 19:00                   ` Jens Axboe
  -1 siblings, 0 replies; 76+ messages in thread
From: Jens Axboe @ 2022-06-29 19:00 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Ming Lei, Mike Snitzer, linux-block, dm-devel, Eric Biggers,
	Dmitry Monakhov, Martin K . Petersen

On 6/29/22 12:40 PM, Kent Overstreet wrote:
> On Wed, Jun 29, 2022 at 11:16:10AM -0600, Jens Axboe wrote:
>> Not sure what Christoph change you are referring to, but all the ones
>> that I did to improve the init side were all backed by numbers I ran at
>> that time (and most/all of the commit messages will have that data). So
>> yes, it is indeed still very noticeable. Maybe not at 100K IOPS, but at
>> 10M on a core it most certainly is.
> 
> I was referring to 609be1066731fea86436f5f91022f82e592ab456. You
> signed off on it, you must remember it...?

I'm sure we all remember each and every commit that gets applied,
particularly with such a precise description of the change.

>> I'm all for having solid and maintainable code, obviously, but frivolous
>> bloating of structures and more expensive setup cannot be hand waved
>> away with "it doesn't matter if we touch 3 or 6 cachelines" because we
>> obviously have a disagreement on that.
> 
> I wouldn't propose inflating struct _bio_ like that. But Jens, to be
> blunt - I know we have different priorities in the way we write code.
> Your writeback throttling code was buggy for _ages_ and I had users
> hitting deadlocks there that I pinged you about, and I could not make
> heads or tails of how that code was supposed to work and not for lack
> of time spent trying!

OK Kent, you just wasted your 2nd chance here. Plonk. There are many
rebuttals that could be made here, but I won't waste my time on it, nor
would it be appropriate.

Come back when you know how to act professionally. Or don't come back
at all.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dm-devel] [PATCH 5.20 1/4] block: add bio_rewind() API
@ 2022-06-29 19:00                   ` Jens Axboe
  0 siblings, 0 replies; 76+ messages in thread
From: Jens Axboe @ 2022-06-29 19:00 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Martin K . Petersen, Mike Snitzer, Eric Biggers, Ming Lei,
	linux-block, dm-devel, Dmitry Monakhov

On 6/29/22 12:40 PM, Kent Overstreet wrote:
> On Wed, Jun 29, 2022 at 11:16:10AM -0600, Jens Axboe wrote:
>> Not sure what Christoph change you are referring to, but all the ones
>> that I did to improve the init side were all backed by numbers I ran at
>> that time (and most/all of the commit messages will have that data). So
>> yes, it is indeed still very noticeable. Maybe not at 100K IOPS, but at
>> 10M on a core it most certainly is.
> 
> I was referring to 609be1066731fea86436f5f91022f82e592ab456. You
> signed off on it, you must remember it...?

I'm sure we all remember each and every commit that gets applied,
particularly with such a precise description of the change.

>> I'm all for having solid and maintainable code, obviously, but frivolous
>> bloating of structures and more expensive setup cannot be hand waved
>> away with "it doesn't matter if we touch 3 or 6 cachelines" because we
>> obviously have a disagreement on that.
> 
> I wouldn't propose inflating struct _bio_ like that. But Jens, to be
> blunt - I know we have different priorities in the way we write code.
> Your writeback throttling code was buggy for _ages_ and I had users
> hitting deadlocks there that I pinged you about, and I could not make
> heads or tails of how that code was supposed to work and not for lack
> of time spent trying!

OK Kent, you just wasted your 2nd chance here. Plonk. There are many
rebuttals that could be made here, but I won't waste my time on it, nor
would it be appropriate.

Come back when you know how to act professionally. Or don't come back
at all.

-- 
Jens Axboe

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 5.20 1/4] block: add bio_rewind() API
  2022-06-29 18:51                   ` [dm-devel] " Bart Van Assche
@ 2022-06-29 19:05                     ` Kent Overstreet
  -1 siblings, 0 replies; 76+ messages in thread
From: Kent Overstreet @ 2022-06-29 19:05 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, Ming Lei, Mike Snitzer, linux-block, dm-devel,
	Eric Biggers, Dmitry Monakhov, Martin K . Petersen

On Wed, Jun 29, 2022 at 11:51:27AM -0700, Bart Van Assche wrote:
> On 6/29/22 11:40, Kent Overstreet wrote:
> > But Jens, to be blunt - I know we have different priorities in the way we write code.
> 
> Please stay professional in your communication and focus on the technical
> issues instead of on the people involved.
> 
> BTW, I remember that some time ago I bisected a kernel bug to one of your
> commits and that you refused to fix the bug introduced by that commit. I had
> to spend my time on root-causing it and sending a fix upstream.

I'd be genuinely appreciative if you'd refresh my memory on what it was. Because
yeah, if I did that that was my fuckup and I want to learn from my mistakes.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dm-devel] [PATCH 5.20 1/4] block: add bio_rewind() API
@ 2022-06-29 19:05                     ` Kent Overstreet
  0 siblings, 0 replies; 76+ messages in thread
From: Kent Overstreet @ 2022-06-29 19:05 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, Martin K . Petersen, Mike Snitzer, Eric Biggers,
	Ming Lei, linux-block, dm-devel, Dmitry Monakhov

On Wed, Jun 29, 2022 at 11:51:27AM -0700, Bart Van Assche wrote:
> On 6/29/22 11:40, Kent Overstreet wrote:
> > But Jens, to be blunt - I know we have different priorities in the way we write code.
> 
> Please stay professional in your communication and focus on the technical
> issues instead of on the people involved.
> 
> BTW, I remember that some time ago I bisected a kernel bug to one of your
> commits and that you refused to fix the bug introduced by that commit. I had
> to spend my time on root-causing it and sending a fix upstream.

I'd be genuinely appreciative if you'd refresh my memory on what it was. Because
yeah, if I did that that was my fuckup and I want to learn from my mistakes.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 5.20 1/4] block: add bio_rewind() API
  2022-06-29 19:00                   ` [dm-devel] " Jens Axboe
@ 2022-06-29 19:26                     ` Kent Overstreet
  -1 siblings, 0 replies; 76+ messages in thread
From: Kent Overstreet @ 2022-06-29 19:26 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ming Lei, Mike Snitzer, linux-block, dm-devel, Eric Biggers,
	Dmitry Monakhov, Martin K . Petersen

On Wed, Jun 29, 2022 at 01:00:52PM -0600, Jens Axboe wrote:
> On 6/29/22 12:40 PM, Kent Overstreet wrote:
> > On Wed, Jun 29, 2022 at 11:16:10AM -0600, Jens Axboe wrote:
> >> Not sure what Christoph change you are referring to, but all the ones
> >> that I did to improve the init side were all backed by numbers I ran at
> >> that time (and most/all of the commit messages will have that data). So
> >> yes, it is indeed still very noticeable. Maybe not at 100K IOPS, but at
> >> 10M on a core it most certainly is.
> > 
> > I was referring to 609be1066731fea86436f5f91022f82e592ab456. You
> > signed off on it, you must remember it...?
> 
> I'm sure we all remember each and every commit that gets applied,
> particularly with such a precise description of the change.
> 
> >> I'm all for having solid and maintainable code, obviously, but frivolous
> >> bloating of structures and more expensive setup cannot be hand waved
> >> away with "it doesn't matter if we touch 3 or 6 cachelines" because we
> >> obviously have a disagreement on that.
> > 
> > I wouldn't propose inflating struct _bio_ like that. But Jens, to be
> > blunt - I know we have different priorities in the way we write code.
> > Your writeback throttling code was buggy for _ages_ and I had users
> > hitting deadlocks there that I pinged you about, and I could not make
> > heads or tails of how that code was supposed to work and not for lack
> > of time spent trying!
> 
> OK Kent, you just wasted your 2nd chance here. Plonk. There are many
> rebuttals that could be made here, but I won't waste my time on it, nor
> would it be appropriate.
> 
> Come back when you know how to act professionally. Or don't come back
> at all.

Jens, you're just acting like your code is immune to criticism, and I don't have
an eyeroll big enough for that. We all know how you care about chasing every
last of those 10 million iops - and not much else.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dm-devel] [PATCH 5.20 1/4] block: add bio_rewind() API
@ 2022-06-29 19:26                     ` Kent Overstreet
  0 siblings, 0 replies; 76+ messages in thread
From: Kent Overstreet @ 2022-06-29 19:26 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Martin K . Petersen, Mike Snitzer, Eric Biggers, Ming Lei,
	linux-block, dm-devel, Dmitry Monakhov

On Wed, Jun 29, 2022 at 01:00:52PM -0600, Jens Axboe wrote:
> On 6/29/22 12:40 PM, Kent Overstreet wrote:
> > On Wed, Jun 29, 2022 at 11:16:10AM -0600, Jens Axboe wrote:
> >> Not sure what Christoph change you are referring to, but all the ones
> >> that I did to improve the init side were all backed by numbers I ran at
> >> that time (and most/all of the commit messages will have that data). So
> >> yes, it is indeed still very noticeable. Maybe not at 100K IOPS, but at
> >> 10M on a core it most certainly is.
> > 
> > I was referring to 609be1066731fea86436f5f91022f82e592ab456. You
> > signed off on it, you must remember it...?
> 
> I'm sure we all remember each and every commit that gets applied,
> particularly with such a precise description of the change.
> 
> >> I'm all for having solid and maintainable code, obviously, but frivolous
> >> bloating of structures and more expensive setup cannot be hand waved
> >> away with "it doesn't matter if we touch 3 or 6 cachelines" because we
> >> obviously have a disagreement on that.
> > 
> > I wouldn't propose inflating struct _bio_ like that. But Jens, to be
> > blunt - I know we have different priorities in the way we write code.
> > Your writeback throttling code was buggy for _ages_ and I had users
> > hitting deadlocks there that I pinged you about, and I could not make
> > heads or tails of how that code was supposed to work and not for lack
> > of time spent trying!
> 
> OK Kent, you just wasted your 2nd chance here. Plonk. There are many
> rebuttals that could be made here, but I won't waste my time on it, nor
> would it be appropriate.
> 
> Come back when you know how to act professionally. Or don't come back
> at all.

Jens, you're just acting like your code is immune to criticism, and I don't have
an eyeroll big enough for that. We all know how you care about chasing every
last of those 10 million iops - and not much else.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 5.20 1/4] block: add bio_rewind() API
  2022-06-29 19:05                     ` [dm-devel] " Kent Overstreet
@ 2022-06-29 19:37                       ` Bart Van Assche
  -1 siblings, 0 replies; 76+ messages in thread
From: Bart Van Assche @ 2022-06-29 19:37 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Jens Axboe, Ming Lei, Mike Snitzer, linux-block, dm-devel,
	Eric Biggers, Dmitry Monakhov, Martin K . Petersen

On 6/29/22 12:05, Kent Overstreet wrote:
> On Wed, Jun 29, 2022 at 11:51:27AM -0700, Bart Van Assche wrote:
>> On 6/29/22 11:40, Kent Overstreet wrote:
>>> But Jens, to be blunt - I know we have different priorities in the way we write code.
>>
>> Please stay professional in your communication and focus on the technical
>> issues instead of on the people involved.
>>
>> BTW, I remember that some time ago I bisected a kernel bug to one of your
>> commits and that you refused to fix the bug introduced by that commit. I had
>> to spend my time on root-causing it and sending a fix upstream.
> 
> I'd be genuinely appreciative if you'd refresh my memory on what it was. Because
> yeah, if I did that that was my fuckup and I want to learn from my mistakes.

I was referring to the following two conversations from May 2018:
* [PATCH] Revert "block: Add warning for bi_next not NULL in 
bio_endio()" 
(https://lore.kernel.org/linux-block/20180522235505.20937-1-bart.vanassche@wdc.com/)
* [PATCH v2] Revert "block: Add warning for bi_next not NULL in 
bio_endio()" 
(https://lore.kernel.org/linux-block/20180619172640.15246-1-bart.vanassche@wdc.com/)

Bart.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dm-devel] [PATCH 5.20 1/4] block: add bio_rewind() API
@ 2022-06-29 19:37                       ` Bart Van Assche
  0 siblings, 0 replies; 76+ messages in thread
From: Bart Van Assche @ 2022-06-29 19:37 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Jens Axboe, Martin K . Petersen, Mike Snitzer, Eric Biggers,
	Ming Lei, linux-block, dm-devel, Dmitry Monakhov

On 6/29/22 12:05, Kent Overstreet wrote:
> On Wed, Jun 29, 2022 at 11:51:27AM -0700, Bart Van Assche wrote:
>> On 6/29/22 11:40, Kent Overstreet wrote:
>>> But Jens, to be blunt - I know we have different priorities in the way we write code.
>>
>> Please stay professional in your communication and focus on the technical
>> issues instead of on the people involved.
>>
>> BTW, I remember that some time ago I bisected a kernel bug to one of your
>> commits and that you refused to fix the bug introduced by that commit. I had
>> to spend my time on root-causing it and sending a fix upstream.
> 
> I'd be genuinely appreciative if you'd refresh my memory on what it was. Because
> yeah, if I did that that was my fuckup and I want to learn from my mistakes.

I was referring to the following two conversations from May 2018:
* [PATCH] Revert "block: Add warning for bi_next not NULL in 
bio_endio()" 
(https://lore.kernel.org/linux-block/20180522235505.20937-1-bart.vanassche@wdc.com/)
* [PATCH v2] Revert "block: Add warning for bi_next not NULL in 
bio_endio()" 
(https://lore.kernel.org/linux-block/20180619172640.15246-1-bart.vanassche@wdc.com/)

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 5.20 1/4] block: add bio_rewind() API
  2022-06-29 19:37                       ` [dm-devel] " Bart Van Assche
@ 2022-06-29 19:50                         ` Kent Overstreet
  -1 siblings, 0 replies; 76+ messages in thread
From: Kent Overstreet @ 2022-06-29 19:50 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, Ming Lei, Mike Snitzer, linux-block, dm-devel,
	Eric Biggers, Dmitry Monakhov, Martin K . Petersen

On Wed, Jun 29, 2022 at 12:37:59PM -0700, Bart Van Assche wrote:
> On 6/29/22 12:05, Kent Overstreet wrote:
> > On Wed, Jun 29, 2022 at 11:51:27AM -0700, Bart Van Assche wrote:
> > > On 6/29/22 11:40, Kent Overstreet wrote:
> > > > But Jens, to be blunt - I know we have different priorities in the way we write code.
> > > 
> > > Please stay professional in your communication and focus on the technical
> > > issues instead of on the people involved.
> > > 
> > > BTW, I remember that some time ago I bisected a kernel bug to one of your
> > > commits and that you refused to fix the bug introduced by that commit. I had
> > > to spend my time on root-causing it and sending a fix upstream.
> > 
> > I'd be genuinely appreciative if you'd refresh my memory on what it was. Because
> > yeah, if I did that that was my fuckup and I want to learn from my mistakes.
> 
> I was referring to the following two conversations from May 2018:
> * [PATCH] Revert "block: Add warning for bi_next not NULL in bio_endio()" (https://lore.kernel.org/linux-block/20180522235505.20937-1-bart.vanassche@wdc.com/)
> * [PATCH v2] Revert "block: Add warning for bi_next not NULL in bio_endio()" (https://lore.kernel.org/linux-block/20180619172640.15246-1-bart.vanassche@wdc.com/)

Oh yeah, that.

So, we had - still have - a situation where we have a struct bio field, bi_next,
that's used in different ways by different chunks of code, and where that field
being left in an indeterminate state when bios are being handed off across
module boundaries was likely to be a source of bugs. And having debugged one of
those I decided to introduce a check, and then when that broke device mapper (an
unfortunate sitation I agree) you decided to just... revert the check I added.

Like I said, I would have been happy to fix the device mapper code if I'd been
able to get your tests working (did we ever get this added to blk-tests?) - but
I wasn't MIA on the list and I would've been happy to work with you on this.

Was there something you thought I should've done differently?

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dm-devel] [PATCH 5.20 1/4] block: add bio_rewind() API
@ 2022-06-29 19:50                         ` Kent Overstreet
  0 siblings, 0 replies; 76+ messages in thread
From: Kent Overstreet @ 2022-06-29 19:50 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, Martin K . Petersen, Mike Snitzer, Eric Biggers,
	Ming Lei, linux-block, dm-devel, Dmitry Monakhov

On Wed, Jun 29, 2022 at 12:37:59PM -0700, Bart Van Assche wrote:
> On 6/29/22 12:05, Kent Overstreet wrote:
> > On Wed, Jun 29, 2022 at 11:51:27AM -0700, Bart Van Assche wrote:
> > > On 6/29/22 11:40, Kent Overstreet wrote:
> > > > But Jens, to be blunt - I know we have different priorities in the way we write code.
> > > 
> > > Please stay professional in your communication and focus on the technical
> > > issues instead of on the people involved.
> > > 
> > > BTW, I remember that some time ago I bisected a kernel bug to one of your
> > > commits and that you refused to fix the bug introduced by that commit. I had
> > > to spend my time on root-causing it and sending a fix upstream.
> > 
> > I'd be genuinely appreciative if you'd refresh my memory on what it was. Because
> > yeah, if I did that that was my fuckup and I want to learn from my mistakes.
> 
> I was referring to the following two conversations from May 2018:
> * [PATCH] Revert "block: Add warning for bi_next not NULL in bio_endio()" (https://lore.kernel.org/linux-block/20180522235505.20937-1-bart.vanassche@wdc.com/)
> * [PATCH v2] Revert "block: Add warning for bi_next not NULL in bio_endio()" (https://lore.kernel.org/linux-block/20180619172640.15246-1-bart.vanassche@wdc.com/)

Oh yeah, that.

So, we had - still have - a situation where we have a struct bio field, bi_next,
that's used in different ways by different chunks of code, and where that field
being left in an indeterminate state when bios are being handed off across
module boundaries was likely to be a source of bugs. And having debugged one of
those I decided to introduce a check, and then when that broke device mapper (an
unfortunate sitation I agree) you decided to just... revert the check I added.

Like I said, I would have been happy to fix the device mapper code if I'd been
able to get your tests working (did we ever get this added to blk-tests?) - but
I wasn't MIA on the list and I would've been happy to work with you on this.

Was there something you thought I should've done differently?

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 5.20 1/4] block: add bio_rewind() API
  2022-06-29 19:37                       ` [dm-devel] " Bart Van Assche
@ 2022-06-29 19:59                         ` Kent Overstreet
  -1 siblings, 0 replies; 76+ messages in thread
From: Kent Overstreet @ 2022-06-29 19:59 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, Ming Lei, Mike Snitzer, linux-block, dm-devel,
	Eric Biggers, Dmitry Monakhov, Martin K . Petersen

On Wed, Jun 29, 2022 at 12:37:59PM -0700, Bart Van Assche wrote:
> On 6/29/22 12:05, Kent Overstreet wrote:
> > On Wed, Jun 29, 2022 at 11:51:27AM -0700, Bart Van Assche wrote:
> > > On 6/29/22 11:40, Kent Overstreet wrote:
> > > > But Jens, to be blunt - I know we have different priorities in the way we write code.
> > > 
> > > Please stay professional in your communication and focus on the technical
> > > issues instead of on the people involved.
> > > 
> > > BTW, I remember that some time ago I bisected a kernel bug to one of your
> > > commits and that you refused to fix the bug introduced by that commit. I had
> > > to spend my time on root-causing it and sending a fix upstream.
> > 
> > I'd be genuinely appreciative if you'd refresh my memory on what it was. Because
> > yeah, if I did that that was my fuckup and I want to learn from my mistakes.
> 
> I was referring to the following two conversations from May 2018:
> * [PATCH] Revert "block: Add warning for bi_next not NULL in bio_endio()" (https://lore.kernel.org/linux-block/20180522235505.20937-1-bart.vanassche@wdc.com/)
> * [PATCH v2] Revert "block: Add warning for bi_next not NULL in bio_endio()" (https://lore.kernel.org/linux-block/20180619172640.15246-1-bart.vanassche@wdc.com/)

And also, why did you decide to just revert instead of fixing the issue within
DM? You had a WARN() printed, you shouldn't have needed to bisect, the commit
message explained what was going on - but you never explained that part?

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dm-devel] [PATCH 5.20 1/4] block: add bio_rewind() API
@ 2022-06-29 19:59                         ` Kent Overstreet
  0 siblings, 0 replies; 76+ messages in thread
From: Kent Overstreet @ 2022-06-29 19:59 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, Martin K . Petersen, Mike Snitzer, Eric Biggers,
	Ming Lei, linux-block, dm-devel, Dmitry Monakhov

On Wed, Jun 29, 2022 at 12:37:59PM -0700, Bart Van Assche wrote:
> On 6/29/22 12:05, Kent Overstreet wrote:
> > On Wed, Jun 29, 2022 at 11:51:27AM -0700, Bart Van Assche wrote:
> > > On 6/29/22 11:40, Kent Overstreet wrote:
> > > > But Jens, to be blunt - I know we have different priorities in the way we write code.
> > > 
> > > Please stay professional in your communication and focus on the technical
> > > issues instead of on the people involved.
> > > 
> > > BTW, I remember that some time ago I bisected a kernel bug to one of your
> > > commits and that you refused to fix the bug introduced by that commit. I had
> > > to spend my time on root-causing it and sending a fix upstream.
> > 
> > I'd be genuinely appreciative if you'd refresh my memory on what it was. Because
> > yeah, if I did that that was my fuckup and I want to learn from my mistakes.
> 
> I was referring to the following two conversations from May 2018:
> * [PATCH] Revert "block: Add warning for bi_next not NULL in bio_endio()" (https://lore.kernel.org/linux-block/20180522235505.20937-1-bart.vanassche@wdc.com/)
> * [PATCH v2] Revert "block: Add warning for bi_next not NULL in bio_endio()" (https://lore.kernel.org/linux-block/20180619172640.15246-1-bart.vanassche@wdc.com/)

And also, why did you decide to just revert instead of fixing the issue within
DM? You had a WARN() printed, you shouldn't have needed to bisect, the commit
message explained what was going on - but you never explained that part?

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 5.20 1/4] block: add bio_rewind() API
  2022-06-29 19:26                     ` [dm-devel] " Kent Overstreet
@ 2022-06-29 20:51                       ` Jens Axboe
  -1 siblings, 0 replies; 76+ messages in thread
From: Jens Axboe @ 2022-06-29 20:51 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Ming Lei, Mike Snitzer, linux-block, dm-devel, Eric Biggers,
	Dmitry Monakhov, Martin K . Petersen

On Jun 29, 2022, at 1:26 PM, Kent Overstreet <kent.overstreet@gmail.com> wrote:
> 
> On Wed, Jun 29, 2022 at 01:00:52PM -0600, Jens Axboe wrote:
>>> On 6/29/22 12:40 PM, Kent Overstreet wrote:
>>> On Wed, Jun 29, 2022 at 11:16:10AM -0600, Jens Axboe wrote:
>>>> Not sure what Christoph change you are referring to, but all the ones
>>>> that I did to improve the init side were all backed by numbers I ran at
>>>> that time (and most/all of the commit messages will have that data). So
>>>> yes, it is indeed still very noticeable. Maybe not at 100K IOPS, but at
>>>> 10M on a core it most certainly is.
>>> 
>>> I was referring to 609be1066731fea86436f5f91022f82e592ab456. You
>>> signed off on it, you must remember it...?
>> 
>> I'm sure we all remember each and every commit that gets applied,
>> particularly with such a precise description of the change.
>> 
>>>> I'm all for having solid and maintainable code, obviously, but frivolous
>>>> bloating of structures and more expensive setup cannot be hand waved
>>>> away with "it doesn't matter if we touch 3 or 6 cachelines" because we
>>>> obviously have a disagreement on that.
>>> 
>>> I wouldn't propose inflating struct _bio_ like that. But Jens, to be
>>> blunt - I know we have different priorities in the way we write code.
>>> Your writeback throttling code was buggy for _ages_ and I had users
>>> hitting deadlocks there that I pinged you about, and I could not make
>>> heads or tails of how that code was supposed to work and not for lack
>>> of time spent trying!
>> 
>> OK Kent, you just wasted your 2nd chance here. Plonk. There are many
>> rebuttals that could be made here, but I won't waste my time on it, nor
>> would it be appropriate.
>> 
>> Come back when you know how to act professionally. Or don't come back
>> at all.
> 
> Jens, you're just acting like your code is immune to criticism, and I don't have
> an eyeroll big enough for that. We all know how you care about chasing every
> last of those 10 million iops - and not much else.

Kent, the time for your unsolicited opinions and attacks have passed. Just go away, not interested in interacting with you. You have no idea what you’re talking about. 





^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dm-devel] [PATCH 5.20 1/4] block: add bio_rewind() API
@ 2022-06-29 20:51                       ` Jens Axboe
  0 siblings, 0 replies; 76+ messages in thread
From: Jens Axboe @ 2022-06-29 20:51 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Martin K . Petersen, Mike Snitzer, Eric Biggers, Ming Lei,
	linux-block, dm-devel, Dmitry Monakhov

On Jun 29, 2022, at 1:26 PM, Kent Overstreet <kent.overstreet@gmail.com> wrote:
> 
> On Wed, Jun 29, 2022 at 01:00:52PM -0600, Jens Axboe wrote:
>>> On 6/29/22 12:40 PM, Kent Overstreet wrote:
>>> On Wed, Jun 29, 2022 at 11:16:10AM -0600, Jens Axboe wrote:
>>>> Not sure what Christoph change you are referring to, but all the ones
>>>> that I did to improve the init side were all backed by numbers I ran at
>>>> that time (and most/all of the commit messages will have that data). So
>>>> yes, it is indeed still very noticeable. Maybe not at 100K IOPS, but at
>>>> 10M on a core it most certainly is.
>>> 
>>> I was referring to 609be1066731fea86436f5f91022f82e592ab456. You
>>> signed off on it, you must remember it...?
>> 
>> I'm sure we all remember each and every commit that gets applied,
>> particularly with such a precise description of the change.
>> 
>>>> I'm all for having solid and maintainable code, obviously, but frivolous
>>>> bloating of structures and more expensive setup cannot be hand waved
>>>> away with "it doesn't matter if we touch 3 or 6 cachelines" because we
>>>> obviously have a disagreement on that.
>>> 
>>> I wouldn't propose inflating struct _bio_ like that. But Jens, to be
>>> blunt - I know we have different priorities in the way we write code.
>>> Your writeback throttling code was buggy for _ages_ and I had users
>>> hitting deadlocks there that I pinged you about, and I could not make
>>> heads or tails of how that code was supposed to work and not for lack
>>> of time spent trying!
>> 
>> OK Kent, you just wasted your 2nd chance here. Plonk. There are many
>> rebuttals that could be made here, but I won't waste my time on it, nor
>> would it be appropriate.
>> 
>> Come back when you know how to act professionally. Or don't come back
>> at all.
> 
> Jens, you're just acting like your code is immune to criticism, and I don't have
> an eyeroll big enough for that. We all know how you care about chasing every
> last of those 10 million iops - and not much else.

Kent, the time for your unsolicited opinions and attacks have passed. Just go away, not interested in interacting with you. You have no idea what you’re talking about. 




--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 5.20 1/4] block: add bio_rewind() API
  2022-06-29 18:11                     ` [dm-devel] " Kent Overstreet
@ 2022-06-30  0:47                       ` Ming Lei
  -1 siblings, 0 replies; 76+ messages in thread
From: Ming Lei @ 2022-06-30  0:47 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Mike Snitzer, Jens Axboe, linux-block, dm-devel, Eric Biggers,
	Dmitry Monakhov, Martin K . Petersen

On Wed, Jun 29, 2022 at 02:11:54PM -0400, Kent Overstreet wrote:
> On Wed, Jun 29, 2022 at 02:07:08AM -0400, Mike Snitzer wrote:
> > Please try to dial down the hyperbole and judgment. Ming wrote this
> > code. And you haven't been able to point out anything _actually_ wrong
> > with it (yet).
> > 
> > This patch's header does need editing for clarity, but we can help
> > improve it and the documentation above bio_rewind() in the code.
> > 
> > > So, and I'm sorry I have to be the killjoy here, but hard NACK on this patchset.
> > > Hard, hard NACK.
> > 
> > <insert tom-delonge-wtf.gif>
> > 
> > You see this bio_rewind() as history repeating itself, but it isn't
> > like what you ranted about in the past:
> > https://marc.info/?l=linux-block&m=153549921116441&w=2
> > 
> > I can certainly see why you think it similar at first glance. But this
> > patchset shows how bio_rewind() must be used, and how DM benefits from
> > using it safely (with no impact to struct bio or DM's per-bio-data).
> > 
> > bio_rewind() usage will be as niche as DM's use-case for it. If other
> > code respects the documented constraint, that the original bio's end
> > sector be preserved, then they can use it too.
> > 
> > The key is for a driver to maintain enough state to allow this fixed
> > end be effectively immutable. (DM happens to get this state "for free"
> > simply because it was already established for its IO accounting of
> > split bios).
> > 
> > The Linux codebase requires precision. This isn't new.
> 
> Mike, that's not justification for making things _more_ dangerous.
> 
> > 
> > > I'll be happy to assist in coming up with alternate, less dangerous solutions
> > > though (and I think introducing a real bio_iter is overdue, so that's probably
> > > the first thing we should look at).
> > 
> > It isn't dangerous. It is an interface whose constraint needs to be
> > respected. Just like is documented for a myriad other kernel
> > interfaces.
> > 
> > Factoring out a bio_iter will bloat struct bio for functionality most
> > consumers don't need. And gating DM's ability to achieve this
> > patchset's functionality with some overdue refactoring is really _not_
> > acceptable.
> 
> Mike, you're the one who's getting seriously hyperbolic here. You're getting
> frustrated because you've got this one thing you really want to get done, and
> you feel like you're running into a brick wall when I tell you "no".
> 
> And yes, coding in the kernel is a complicated, dangerous environment with many
> rules that need to be respected.
> 
> That does not mean it's ok to be adding to that complexity, and making it even
> more dangerous, without a _really fucking good reason_. This doesn't fly. Maybe
> it would if it was some device mapper private thing, but you're acting like it's
> only going to be used by device mapper when you're trying to add it to the
> public interface for core block layer bio code. _That_ needs real justification.
> 
> Also, bio_iter is something we should definitely be considering because of the
> way integrity and now crypt has been tacked on to struct bio.
> 
> When I originally wrote the modern bvec_iter code, the ability to use an
> iterator besides the one in struct bio was an important piece of functionality,
> one that's still in use (including in device mapper; see
> __bio_for_each_segment()). The fact that we're growing additional data
> structures that in theory want to be iterated in lockstep with the main bio
> payload but _aren't_ iterated over with bi_iter is, at best, a code smell and a
> lurking footgun.
> 
> However, I can see that the two of you are not likely take on figuring out how
> to clean that up, and truthfully I don't have the time right now either, much as
> it pains me.
> 
> Here's an alternative approach:
> 
> The fundamental problem with bio_rewind() (and I know that you two are super
> serious that this is completely safe for your use case and no one else is going
> to use it for anything else) is that we're using it to get back to some initial
> state, but it's not invariant w.r.t. what's been done to the bio since then, and
> the nature of the block layer is that that's a problem.
> 
> So here's what you do:
> 
> You bring back bi_done: bi_done counts bytes advanced, total, since the start
> of the bio. Then we introduce a type:
> 
> struct bio_pos {
> 	unsigned	bi_done;
> 	unsigned	bi_size;
> };
> 
> And two new functions:
> 
> struct bio_pos bio_get_pos(struct bio *)
> {
> 	...
> }
> 
> void bio_set_pos(struct bio *, struct bio_pos)
> {
> 	...
> }
> 
> That gets you the same functionality as bio_rewind(), but it'll be much more
> broadly useful.

What is the difference between bio_set_pos and bio_rewind()? Both have
to restore bio->bi_iter(the sector part and the bvec part).

Also how to update ->bi_done which 'counts bytes advanced'? You meant doing it in
very bio_advance()? then no, why do we have to pay the cost for very unusual
bio_rewind()?

Or if I misunderstood your point, please cook a patch and I am happy to
take a close look, and posting one very raw idea with random data
structure looks not helpful much for this discussion technically.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dm-devel] [PATCH 5.20 1/4] block: add bio_rewind() API
@ 2022-06-30  0:47                       ` Ming Lei
  0 siblings, 0 replies; 76+ messages in thread
From: Ming Lei @ 2022-06-30  0:47 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Jens Axboe, Martin K . Petersen, Mike Snitzer, Eric Biggers,
	linux-block, dm-devel, Dmitry Monakhov

On Wed, Jun 29, 2022 at 02:11:54PM -0400, Kent Overstreet wrote:
> On Wed, Jun 29, 2022 at 02:07:08AM -0400, Mike Snitzer wrote:
> > Please try to dial down the hyperbole and judgment. Ming wrote this
> > code. And you haven't been able to point out anything _actually_ wrong
> > with it (yet).
> > 
> > This patch's header does need editing for clarity, but we can help
> > improve it and the documentation above bio_rewind() in the code.
> > 
> > > So, and I'm sorry I have to be the killjoy here, but hard NACK on this patchset.
> > > Hard, hard NACK.
> > 
> > <insert tom-delonge-wtf.gif>
> > 
> > You see this bio_rewind() as history repeating itself, but it isn't
> > like what you ranted about in the past:
> > https://marc.info/?l=linux-block&m=153549921116441&w=2
> > 
> > I can certainly see why you think it similar at first glance. But this
> > patchset shows how bio_rewind() must be used, and how DM benefits from
> > using it safely (with no impact to struct bio or DM's per-bio-data).
> > 
> > bio_rewind() usage will be as niche as DM's use-case for it. If other
> > code respects the documented constraint, that the original bio's end
> > sector be preserved, then they can use it too.
> > 
> > The key is for a driver to maintain enough state to allow this fixed
> > end be effectively immutable. (DM happens to get this state "for free"
> > simply because it was already established for its IO accounting of
> > split bios).
> > 
> > The Linux codebase requires precision. This isn't new.
> 
> Mike, that's not justification for making things _more_ dangerous.
> 
> > 
> > > I'll be happy to assist in coming up with alternate, less dangerous solutions
> > > though (and I think introducing a real bio_iter is overdue, so that's probably
> > > the first thing we should look at).
> > 
> > It isn't dangerous. It is an interface whose constraint needs to be
> > respected. Just like is documented for a myriad other kernel
> > interfaces.
> > 
> > Factoring out a bio_iter will bloat struct bio for functionality most
> > consumers don't need. And gating DM's ability to achieve this
> > patchset's functionality with some overdue refactoring is really _not_
> > acceptable.
> 
> Mike, you're the one who's getting seriously hyperbolic here. You're getting
> frustrated because you've got this one thing you really want to get done, and
> you feel like you're running into a brick wall when I tell you "no".
> 
> And yes, coding in the kernel is a complicated, dangerous environment with many
> rules that need to be respected.
> 
> That does not mean it's ok to be adding to that complexity, and making it even
> more dangerous, without a _really fucking good reason_. This doesn't fly. Maybe
> it would if it was some device mapper private thing, but you're acting like it's
> only going to be used by device mapper when you're trying to add it to the
> public interface for core block layer bio code. _That_ needs real justification.
> 
> Also, bio_iter is something we should definitely be considering because of the
> way integrity and now crypt has been tacked on to struct bio.
> 
> When I originally wrote the modern bvec_iter code, the ability to use an
> iterator besides the one in struct bio was an important piece of functionality,
> one that's still in use (including in device mapper; see
> __bio_for_each_segment()). The fact that we're growing additional data
> structures that in theory want to be iterated in lockstep with the main bio
> payload but _aren't_ iterated over with bi_iter is, at best, a code smell and a
> lurking footgun.
> 
> However, I can see that the two of you are not likely take on figuring out how
> to clean that up, and truthfully I don't have the time right now either, much as
> it pains me.
> 
> Here's an alternative approach:
> 
> The fundamental problem with bio_rewind() (and I know that you two are super
> serious that this is completely safe for your use case and no one else is going
> to use it for anything else) is that we're using it to get back to some initial
> state, but it's not invariant w.r.t. what's been done to the bio since then, and
> the nature of the block layer is that that's a problem.
> 
> So here's what you do:
> 
> You bring back bi_done: bi_done counts bytes advanced, total, since the start
> of the bio. Then we introduce a type:
> 
> struct bio_pos {
> 	unsigned	bi_done;
> 	unsigned	bi_size;
> };
> 
> And two new functions:
> 
> struct bio_pos bio_get_pos(struct bio *)
> {
> 	...
> }
> 
> void bio_set_pos(struct bio *, struct bio_pos)
> {
> 	...
> }
> 
> That gets you the same functionality as bio_rewind(), but it'll be much more
> broadly useful.

What is the difference between bio_set_pos and bio_rewind()? Both have
to restore bio->bi_iter(the sector part and the bvec part).

Also how to update ->bi_done which 'counts bytes advanced'? You meant doing it in
very bio_advance()? then no, why do we have to pay the cost for very unusual
bio_rewind()?

Or if I misunderstood your point, please cook a patch and I am happy to
take a close look, and posting one very raw idea with random data
structure looks not helpful much for this discussion technically.


Thanks,
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 5.20 1/4] block: add bio_rewind() API
  2022-06-30  0:47                       ` [dm-devel] " Ming Lei
@ 2022-06-30  0:58                         ` Kent Overstreet
  -1 siblings, 0 replies; 76+ messages in thread
From: Kent Overstreet @ 2022-06-30  0:58 UTC (permalink / raw)
  To: Ming Lei
  Cc: Mike Snitzer, Jens Axboe, linux-block, dm-devel, Eric Biggers,
	Dmitry Monakhov, Martin K . Petersen

On Thu, Jun 30, 2022 at 08:47:13AM +0800, Ming Lei wrote:
> What is the difference between bio_set_pos and bio_rewind()? Both have
> to restore bio->bi_iter(the sector part and the bvec part).
> 
> Also how to update ->bi_done which 'counts bytes advanced'? You meant doing it in
> very bio_advance()? then no, why do we have to pay the cost for very unusual
> bio_rewind()?

Yeah, we'll have to add a u32 to bvec_iter, and increment it in bio_advance().

This would us everything we want - you'll be able to restore a bio to an initial
state and you just have to save 8 bytes, not a whole bvec_iter, and unlike
bio_rewind it'll be safe to use after calling submit_bio(), _and_ it solves the
problem that stashing a copy of bvec_iter doesn't save state in integrity or
crypt context.

> Or if I misunderstood your point, please cook a patch and I am happy to
> take a close look, and posting one very raw idea with random data
> structure looks not helpful much for this discussion technically.

I can do that...

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dm-devel] [PATCH 5.20 1/4] block: add bio_rewind() API
@ 2022-06-30  0:58                         ` Kent Overstreet
  0 siblings, 0 replies; 76+ messages in thread
From: Kent Overstreet @ 2022-06-30  0:58 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Martin K . Petersen, Mike Snitzer, Eric Biggers,
	linux-block, dm-devel, Dmitry Monakhov

On Thu, Jun 30, 2022 at 08:47:13AM +0800, Ming Lei wrote:
> What is the difference between bio_set_pos and bio_rewind()? Both have
> to restore bio->bi_iter(the sector part and the bvec part).
> 
> Also how to update ->bi_done which 'counts bytes advanced'? You meant doing it in
> very bio_advance()? then no, why do we have to pay the cost for very unusual
> bio_rewind()?

Yeah, we'll have to add a u32 to bvec_iter, and increment it in bio_advance().

This would us everything we want - you'll be able to restore a bio to an initial
state and you just have to save 8 bytes, not a whole bvec_iter, and unlike
bio_rewind it'll be safe to use after calling submit_bio(), _and_ it solves the
problem that stashing a copy of bvec_iter doesn't save state in integrity or
crypt context.

> Or if I misunderstood your point, please cook a patch and I am happy to
> take a close look, and posting one very raw idea with random data
> structure looks not helpful much for this discussion technically.

I can do that...

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 5.20 1/4] block: add bio_rewind() API
  2022-06-30  0:47                       ` [dm-devel] " Ming Lei
@ 2022-06-30  1:14                         ` Kent Overstreet
  -1 siblings, 0 replies; 76+ messages in thread
From: Kent Overstreet @ 2022-06-30  1:14 UTC (permalink / raw)
  To: Ming Lei
  Cc: Mike Snitzer, Jens Axboe, linux-block, dm-devel, Eric Biggers,
	Dmitry Monakhov, Martin K . Petersen

On Thu, Jun 30, 2022 at 08:47:13AM +0800, Ming Lei wrote:
> Or if I misunderstood your point, please cook a patch and I am happy to
> take a close look, and posting one very raw idea with random data
> structure looks not helpful much for this discussion technically.

Based it on your bio_rewind() patch - what do you think of this?

-- >8 --
From: Kent Overstreet <kent.overstreet@gmail.com>
Subject: [PATCH] block: add bio_(get|set)_pos()

Commit 7759eb23fd98 ("block: remove bio_rewind_iter()") removes
the similar API because the following reasons:

    ```
    It is pointed that bio_rewind_iter() is one very bad API[1]:

    1) bio size may not be restored after rewinding

    2) it causes some bogus change, such as 5151842b9d8732 (block: reset
    bi_iter.bi_done after splitting bio)

    3) rewinding really makes things complicated wrt. bio splitting

    4) unnecessary updating of .bi_done in fast path

    [1] https://marc.info/?t=153549924200005&r=1&w=2

    So this patch takes Kent's suggestion to restore one bio into its original
    state via saving bio iterator(struct bvec_iter) in bio_integrity_prep(),
    given now bio_rewind_iter() is only used by bio integrity code.
    ```

However, saving and restoring bi_iter isn't sufficient anymore, because
of integrity and now per-bio crypt context.

This patch implements the same functionality as bio_rewind(), based on a
patch by Ming, but with a different (safer!) interface.

 - bio_get_pos() gets the current state of a a bio, i.e. how far it has
   been advanced and its current (remaining) size
 - bio_set_pos() restores a bio to a previous state, advancing or
   rewinding it as needed

Co-authored-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
---
 block/bio-integrity.c       | 19 +++++++++++++++++++
 block/bio.c                 | 26 ++++++++++++++++++++++++++
 block/blk-crypto-internal.h |  7 +++++++
 block/blk-crypto.c          | 25 +++++++++++++++++++++++++
 include/linux/bio.h         | 22 ++++++++++++++++++++++
 include/linux/blk_types.h   | 19 +++++++++++++++++++
 include/linux/bvec.h        | 36 +++++++++++++++++++++++++++++++++++-
 7 files changed, 153 insertions(+), 1 deletion(-)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 32929c89ba..06c2fe81fd 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -378,6 +378,25 @@ void bio_integrity_advance(struct bio *bio, unsigned int bytes_done)
 	bvec_iter_advance(bip->bip_vec, &bip->bip_iter, bytes);
 }
 
+/**
+ * bio_integrity_rewind - Rewind integrity vector
+ * @bio:	bio whose integrity vector to update
+ * @bytes_done:	number of data bytes to rewind
+ *
+ * Description: This function calculates how many integrity bytes the
+ * number of completed data bytes correspond to and rewind the
+ * integrity vector accordingly.
+ */
+void bio_integrity_rewind(struct bio *bio, unsigned int bytes_done)
+{
+	struct bio_integrity_payload *bip = bio_integrity(bio);
+	struct blk_integrity *bi = blk_get_integrity(bio->bi_bdev->bd_disk);
+	unsigned bytes = bio_integrity_bytes(bi, bytes_done >> 9);
+
+	bip->bip_iter.bi_sector -= bio_integrity_intervals(bi, bytes_done >> 9);
+	bvec_iter_rewind(bip->bip_vec, &bip->bip_iter, bytes);
+}
+
 /**
  * bio_integrity_trim - Trim integrity vector
  * @bio:	bio whose integrity vector to update
diff --git a/block/bio.c b/block/bio.c
index b2425b8d88..bbf8aa4e62 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1329,6 +1329,32 @@ void __bio_advance(struct bio *bio, unsigned bytes)
 }
 EXPORT_SYMBOL(__bio_advance);
 
+/**
+ * bio_set_pos - restore a bio to a previous state, after having been iterated
+ * or trimmed
+ * @bio: bio to reset
+ * @pos: pos to reset it to, from bio_get_pos()
+ */
+void bio_set_pos(struct bio *bio, struct bio_pos pos)
+{
+	int delta = bio->bi_iter.bi_done - pos.bi_done;
+
+	if (delta > 0) {
+		if (bio_integrity(bio))
+			bio_integrity_rewind(bio, delta);
+		bio_crypt_rewind(bio, delta);
+		bio_rewind_iter(bio, &bio->bi_iter, delta);
+	} else {
+		bio_advance(bio, -delta);
+	}
+
+	bio->bi_iter.bi_size = pos.bi_size;
+
+	if (bio_integrity(bio))
+		bio_integrity_trim(bio);
+}
+EXPORT_SYMBOL(bio_set_pos);
+
 void bio_copy_data_iter(struct bio *dst, struct bvec_iter *dst_iter,
 			struct bio *src, struct bvec_iter *src_iter)
 {
diff --git a/block/blk-crypto-internal.h b/block/blk-crypto-internal.h
index e6818ffadd..b723599bbf 100644
--- a/block/blk-crypto-internal.h
+++ b/block/blk-crypto-internal.h
@@ -114,6 +114,13 @@ static inline void bio_crypt_advance(struct bio *bio, unsigned int bytes)
 		__bio_crypt_advance(bio, bytes);
 }
 
+void __bio_crypt_rewind(struct bio *bio, unsigned int bytes);
+static inline void bio_crypt_rewind(struct bio *bio, unsigned int bytes)
+{
+	if (bio_has_crypt_ctx(bio))
+		__bio_crypt_rewind(bio, bytes);
+}
+
 void __bio_crypt_free_ctx(struct bio *bio);
 static inline void bio_crypt_free_ctx(struct bio *bio)
 {
diff --git a/block/blk-crypto.c b/block/blk-crypto.c
index a496aaef85..e3584b5a68 100644
--- a/block/blk-crypto.c
+++ b/block/blk-crypto.c
@@ -134,6 +134,23 @@ void bio_crypt_dun_increment(u64 dun[BLK_CRYPTO_DUN_ARRAY_SIZE],
 	}
 }
 
+/* Decrements @dun by @dec, treating @dun as a multi-limb integer. */
+void bio_crypt_dun_decrement(u64 dun[BLK_CRYPTO_DUN_ARRAY_SIZE],
+			     unsigned int dec)
+{
+	int i;
+
+	for (i = 0; dec && i < BLK_CRYPTO_DUN_ARRAY_SIZE; i++) {
+		u64 prev = dun[i];
+
+		dun[i] -= dec;
+		if (dun[i] > prev)
+			dec = 1;
+		else
+			dec = 0;
+	}
+}
+
 void __bio_crypt_advance(struct bio *bio, unsigned int bytes)
 {
 	struct bio_crypt_ctx *bc = bio->bi_crypt_context;
@@ -142,6 +159,14 @@ void __bio_crypt_advance(struct bio *bio, unsigned int bytes)
 				bytes >> bc->bc_key->data_unit_size_bits);
 }
 
+void __bio_crypt_rewind(struct bio *bio, unsigned int bytes)
+{
+	struct bio_crypt_ctx *bc = bio->bi_crypt_context;
+
+	bio_crypt_dun_decrement(bc->bc_dun,
+				bytes >> bc->bc_key->data_unit_size_bits);
+}
+
 /*
  * Returns true if @bc->bc_dun plus @bytes converted to data units is equal to
  * @next_dun, treating the DUNs as multi-limb integers.
diff --git a/include/linux/bio.h b/include/linux/bio.h
index c11103a872..5fff008913 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -105,6 +105,19 @@ static inline void bio_advance_iter(const struct bio *bio,
 		/* TODO: It is reasonable to complete bio with error here. */
 }
 
+static inline void bio_rewind_iter(const struct bio *bio,
+				    struct bvec_iter *iter, unsigned int bytes)
+{
+	iter->bi_sector -= bytes >> 9;
+
+	/* No advance means no rewind */
+	if (bio_no_advance_iter(bio))
+		iter->bi_size += bytes;
+	else
+		bvec_iter_rewind(bio->bi_io_vec, iter, bytes);
+		/* TODO: It is reasonable to complete bio with error here. */
+}
+
 /* @bytes should be less or equal to bvec[i->bi_idx].bv_len */
 static inline void bio_advance_iter_single(const struct bio *bio,
 					   struct bvec_iter *iter,
@@ -133,6 +146,8 @@ void __bio_advance(struct bio *, unsigned bytes);
  */
 static inline void bio_advance(struct bio *bio, unsigned int nbytes)
 {
+	bio->bi_iter.bi_done += nbytes;
+
 	if (nbytes == bio->bi_iter.bi_size) {
 		bio->bi_iter.bi_size = 0;
 		return;
@@ -707,6 +722,7 @@ extern struct bio_integrity_payload *bio_integrity_alloc(struct bio *, gfp_t, un
 extern int bio_integrity_add_page(struct bio *, struct page *, unsigned int, unsigned int);
 extern bool bio_integrity_prep(struct bio *);
 extern void bio_integrity_advance(struct bio *, unsigned int);
+extern void bio_integrity_rewind(struct bio *, unsigned int);
 extern void bio_integrity_trim(struct bio *);
 extern int bio_integrity_clone(struct bio *, struct bio *, gfp_t);
 extern int bioset_integrity_create(struct bio_set *, int);
@@ -747,6 +763,12 @@ static inline void bio_integrity_advance(struct bio *bio,
 	return;
 }
 
+static inline void bio_integrity_rewind(struct bio *bio,
+					 unsigned int bytes_done)
+{
+	return;
+}
+
 static inline void bio_integrity_trim(struct bio *bio)
 {
 	return;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 1973ef9bd4..eff756a96f 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -306,6 +306,25 @@ struct bio {
 	struct bio_vec		bi_inline_vecs[];
 };
 
+/*
+ * These are for saving & restoring all the iterators within a bio to a previous
+ * state
+ */
+struct bio_pos {
+	unsigned int	bi_done;
+	unsigned int	bi_size;
+};
+
+static inline struct bio_pos bio_get_pos(struct bio *bio)
+{
+	return (struct bio_pos) {
+		.bi_done	= bio->bi_iter.bi_done,
+		.bi_size	= bio->bi_iter.bi_size,
+	};
+}
+
+extern void bio_set_pos(struct bio *bio, struct bio_pos pos);
+
 #define BIO_RESET_BYTES		offsetof(struct bio, bi_max_vecs)
 #define BIO_MAX_SECTORS		(UINT_MAX >> SECTOR_SHIFT)
 
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 35c25dff65..1606ebe1da 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -44,7 +44,8 @@ struct bvec_iter {
 
 	unsigned int            bi_bvec_done;	/* number of bytes completed in
 						   current bvec */
-} __packed;
+	unsigned int		bi_done;	/* number of bytes completed, total */
+};
 
 struct bvec_iter_all {
 	struct bio_vec	bv;
@@ -122,6 +123,39 @@ static inline bool bvec_iter_advance(const struct bio_vec *bv,
 	return true;
 }
 
+static inline bool bvec_iter_rewind(const struct bio_vec *bv,
+				     struct bvec_iter *iter,
+				     unsigned int bytes)
+{
+	int idx;
+
+	iter->bi_size += bytes;
+	if (bytes <= iter->bi_bvec_done) {
+		iter->bi_bvec_done -= bytes;
+		return true;
+	}
+
+	bytes -= iter->bi_bvec_done;
+	idx = iter->bi_idx - 1;
+
+	while (idx >= 0 && bytes && bytes > bv[idx].bv_len) {
+		bytes -= bv[idx].bv_len;
+		idx--;
+	}
+
+	if (WARN_ONCE(idx < 0 && bytes,
+		      "Attempted to rewind iter beyond bvec's boundaries\n")) {
+		iter->bi_size -= bytes;
+		iter->bi_bvec_done = 0;
+		iter->bi_idx = 0;
+		return false;
+	}
+
+	iter->bi_idx = idx;
+	iter->bi_bvec_done = bv[idx].bv_len - bytes;
+	return true;
+}
+
 /*
  * A simpler version of bvec_iter_advance(), @bytes should not span
  * across multiple bvec entries, i.e. bytes <= bv[i->bi_idx].bv_len
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [dm-devel] [PATCH 5.20 1/4] block: add bio_rewind() API
@ 2022-06-30  1:14                         ` Kent Overstreet
  0 siblings, 0 replies; 76+ messages in thread
From: Kent Overstreet @ 2022-06-30  1:14 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Martin K . Petersen, Mike Snitzer, Eric Biggers,
	linux-block, dm-devel, Dmitry Monakhov

On Thu, Jun 30, 2022 at 08:47:13AM +0800, Ming Lei wrote:
> Or if I misunderstood your point, please cook a patch and I am happy to
> take a close look, and posting one very raw idea with random data
> structure looks not helpful much for this discussion technically.

Based it on your bio_rewind() patch - what do you think of this?

-- >8 --
From: Kent Overstreet <kent.overstreet@gmail.com>
Subject: [PATCH] block: add bio_(get|set)_pos()

Commit 7759eb23fd98 ("block: remove bio_rewind_iter()") removes
the similar API because the following reasons:

    ```
    It is pointed that bio_rewind_iter() is one very bad API[1]:

    1) bio size may not be restored after rewinding

    2) it causes some bogus change, such as 5151842b9d8732 (block: reset
    bi_iter.bi_done after splitting bio)

    3) rewinding really makes things complicated wrt. bio splitting

    4) unnecessary updating of .bi_done in fast path

    [1] https://marc.info/?t=153549924200005&r=1&w=2

    So this patch takes Kent's suggestion to restore one bio into its original
    state via saving bio iterator(struct bvec_iter) in bio_integrity_prep(),
    given now bio_rewind_iter() is only used by bio integrity code.
    ```

However, saving and restoring bi_iter isn't sufficient anymore, because
of integrity and now per-bio crypt context.

This patch implements the same functionality as bio_rewind(), based on a
patch by Ming, but with a different (safer!) interface.

 - bio_get_pos() gets the current state of a a bio, i.e. how far it has
   been advanced and its current (remaining) size
 - bio_set_pos() restores a bio to a previous state, advancing or
   rewinding it as needed

Co-authored-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
---
 block/bio-integrity.c       | 19 +++++++++++++++++++
 block/bio.c                 | 26 ++++++++++++++++++++++++++
 block/blk-crypto-internal.h |  7 +++++++
 block/blk-crypto.c          | 25 +++++++++++++++++++++++++
 include/linux/bio.h         | 22 ++++++++++++++++++++++
 include/linux/blk_types.h   | 19 +++++++++++++++++++
 include/linux/bvec.h        | 36 +++++++++++++++++++++++++++++++++++-
 7 files changed, 153 insertions(+), 1 deletion(-)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 32929c89ba..06c2fe81fd 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -378,6 +378,25 @@ void bio_integrity_advance(struct bio *bio, unsigned int bytes_done)
 	bvec_iter_advance(bip->bip_vec, &bip->bip_iter, bytes);
 }
 
+/**
+ * bio_integrity_rewind - Rewind integrity vector
+ * @bio:	bio whose integrity vector to update
+ * @bytes_done:	number of data bytes to rewind
+ *
+ * Description: This function calculates how many integrity bytes the
+ * number of completed data bytes correspond to and rewind the
+ * integrity vector accordingly.
+ */
+void bio_integrity_rewind(struct bio *bio, unsigned int bytes_done)
+{
+	struct bio_integrity_payload *bip = bio_integrity(bio);
+	struct blk_integrity *bi = blk_get_integrity(bio->bi_bdev->bd_disk);
+	unsigned bytes = bio_integrity_bytes(bi, bytes_done >> 9);
+
+	bip->bip_iter.bi_sector -= bio_integrity_intervals(bi, bytes_done >> 9);
+	bvec_iter_rewind(bip->bip_vec, &bip->bip_iter, bytes);
+}
+
 /**
  * bio_integrity_trim - Trim integrity vector
  * @bio:	bio whose integrity vector to update
diff --git a/block/bio.c b/block/bio.c
index b2425b8d88..bbf8aa4e62 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1329,6 +1329,32 @@ void __bio_advance(struct bio *bio, unsigned bytes)
 }
 EXPORT_SYMBOL(__bio_advance);
 
+/**
+ * bio_set_pos - restore a bio to a previous state, after having been iterated
+ * or trimmed
+ * @bio: bio to reset
+ * @pos: pos to reset it to, from bio_get_pos()
+ */
+void bio_set_pos(struct bio *bio, struct bio_pos pos)
+{
+	int delta = bio->bi_iter.bi_done - pos.bi_done;
+
+	if (delta > 0) {
+		if (bio_integrity(bio))
+			bio_integrity_rewind(bio, delta);
+		bio_crypt_rewind(bio, delta);
+		bio_rewind_iter(bio, &bio->bi_iter, delta);
+	} else {
+		bio_advance(bio, -delta);
+	}
+
+	bio->bi_iter.bi_size = pos.bi_size;
+
+	if (bio_integrity(bio))
+		bio_integrity_trim(bio);
+}
+EXPORT_SYMBOL(bio_set_pos);
+
 void bio_copy_data_iter(struct bio *dst, struct bvec_iter *dst_iter,
 			struct bio *src, struct bvec_iter *src_iter)
 {
diff --git a/block/blk-crypto-internal.h b/block/blk-crypto-internal.h
index e6818ffadd..b723599bbf 100644
--- a/block/blk-crypto-internal.h
+++ b/block/blk-crypto-internal.h
@@ -114,6 +114,13 @@ static inline void bio_crypt_advance(struct bio *bio, unsigned int bytes)
 		__bio_crypt_advance(bio, bytes);
 }
 
+void __bio_crypt_rewind(struct bio *bio, unsigned int bytes);
+static inline void bio_crypt_rewind(struct bio *bio, unsigned int bytes)
+{
+	if (bio_has_crypt_ctx(bio))
+		__bio_crypt_rewind(bio, bytes);
+}
+
 void __bio_crypt_free_ctx(struct bio *bio);
 static inline void bio_crypt_free_ctx(struct bio *bio)
 {
diff --git a/block/blk-crypto.c b/block/blk-crypto.c
index a496aaef85..e3584b5a68 100644
--- a/block/blk-crypto.c
+++ b/block/blk-crypto.c
@@ -134,6 +134,23 @@ void bio_crypt_dun_increment(u64 dun[BLK_CRYPTO_DUN_ARRAY_SIZE],
 	}
 }
 
+/* Decrements @dun by @dec, treating @dun as a multi-limb integer. */
+void bio_crypt_dun_decrement(u64 dun[BLK_CRYPTO_DUN_ARRAY_SIZE],
+			     unsigned int dec)
+{
+	int i;
+
+	for (i = 0; dec && i < BLK_CRYPTO_DUN_ARRAY_SIZE; i++) {
+		u64 prev = dun[i];
+
+		dun[i] -= dec;
+		if (dun[i] > prev)
+			dec = 1;
+		else
+			dec = 0;
+	}
+}
+
 void __bio_crypt_advance(struct bio *bio, unsigned int bytes)
 {
 	struct bio_crypt_ctx *bc = bio->bi_crypt_context;
@@ -142,6 +159,14 @@ void __bio_crypt_advance(struct bio *bio, unsigned int bytes)
 				bytes >> bc->bc_key->data_unit_size_bits);
 }
 
+void __bio_crypt_rewind(struct bio *bio, unsigned int bytes)
+{
+	struct bio_crypt_ctx *bc = bio->bi_crypt_context;
+
+	bio_crypt_dun_decrement(bc->bc_dun,
+				bytes >> bc->bc_key->data_unit_size_bits);
+}
+
 /*
  * Returns true if @bc->bc_dun plus @bytes converted to data units is equal to
  * @next_dun, treating the DUNs as multi-limb integers.
diff --git a/include/linux/bio.h b/include/linux/bio.h
index c11103a872..5fff008913 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -105,6 +105,19 @@ static inline void bio_advance_iter(const struct bio *bio,
 		/* TODO: It is reasonable to complete bio with error here. */
 }
 
+static inline void bio_rewind_iter(const struct bio *bio,
+				    struct bvec_iter *iter, unsigned int bytes)
+{
+	iter->bi_sector -= bytes >> 9;
+
+	/* No advance means no rewind */
+	if (bio_no_advance_iter(bio))
+		iter->bi_size += bytes;
+	else
+		bvec_iter_rewind(bio->bi_io_vec, iter, bytes);
+		/* TODO: It is reasonable to complete bio with error here. */
+}
+
 /* @bytes should be less or equal to bvec[i->bi_idx].bv_len */
 static inline void bio_advance_iter_single(const struct bio *bio,
 					   struct bvec_iter *iter,
@@ -133,6 +146,8 @@ void __bio_advance(struct bio *, unsigned bytes);
  */
 static inline void bio_advance(struct bio *bio, unsigned int nbytes)
 {
+	bio->bi_iter.bi_done += nbytes;
+
 	if (nbytes == bio->bi_iter.bi_size) {
 		bio->bi_iter.bi_size = 0;
 		return;
@@ -707,6 +722,7 @@ extern struct bio_integrity_payload *bio_integrity_alloc(struct bio *, gfp_t, un
 extern int bio_integrity_add_page(struct bio *, struct page *, unsigned int, unsigned int);
 extern bool bio_integrity_prep(struct bio *);
 extern void bio_integrity_advance(struct bio *, unsigned int);
+extern void bio_integrity_rewind(struct bio *, unsigned int);
 extern void bio_integrity_trim(struct bio *);
 extern int bio_integrity_clone(struct bio *, struct bio *, gfp_t);
 extern int bioset_integrity_create(struct bio_set *, int);
@@ -747,6 +763,12 @@ static inline void bio_integrity_advance(struct bio *bio,
 	return;
 }
 
+static inline void bio_integrity_rewind(struct bio *bio,
+					 unsigned int bytes_done)
+{
+	return;
+}
+
 static inline void bio_integrity_trim(struct bio *bio)
 {
 	return;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 1973ef9bd4..eff756a96f 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -306,6 +306,25 @@ struct bio {
 	struct bio_vec		bi_inline_vecs[];
 };
 
+/*
+ * These are for saving & restoring all the iterators within a bio to a previous
+ * state
+ */
+struct bio_pos {
+	unsigned int	bi_done;
+	unsigned int	bi_size;
+};
+
+static inline struct bio_pos bio_get_pos(struct bio *bio)
+{
+	return (struct bio_pos) {
+		.bi_done	= bio->bi_iter.bi_done,
+		.bi_size	= bio->bi_iter.bi_size,
+	};
+}
+
+extern void bio_set_pos(struct bio *bio, struct bio_pos pos);
+
 #define BIO_RESET_BYTES		offsetof(struct bio, bi_max_vecs)
 #define BIO_MAX_SECTORS		(UINT_MAX >> SECTOR_SHIFT)
 
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 35c25dff65..1606ebe1da 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -44,7 +44,8 @@ struct bvec_iter {
 
 	unsigned int            bi_bvec_done;	/* number of bytes completed in
 						   current bvec */
-} __packed;
+	unsigned int		bi_done;	/* number of bytes completed, total */
+};
 
 struct bvec_iter_all {
 	struct bio_vec	bv;
@@ -122,6 +123,39 @@ static inline bool bvec_iter_advance(const struct bio_vec *bv,
 	return true;
 }
 
+static inline bool bvec_iter_rewind(const struct bio_vec *bv,
+				     struct bvec_iter *iter,
+				     unsigned int bytes)
+{
+	int idx;
+
+	iter->bi_size += bytes;
+	if (bytes <= iter->bi_bvec_done) {
+		iter->bi_bvec_done -= bytes;
+		return true;
+	}
+
+	bytes -= iter->bi_bvec_done;
+	idx = iter->bi_idx - 1;
+
+	while (idx >= 0 && bytes && bytes > bv[idx].bv_len) {
+		bytes -= bv[idx].bv_len;
+		idx--;
+	}
+
+	if (WARN_ONCE(idx < 0 && bytes,
+		      "Attempted to rewind iter beyond bvec's boundaries\n")) {
+		iter->bi_size -= bytes;
+		iter->bi_bvec_done = 0;
+		iter->bi_idx = 0;
+		return false;
+	}
+
+	iter->bi_idx = idx;
+	iter->bi_bvec_done = bv[idx].bv_len - bytes;
+	return true;
+}
+
 /*
  * A simpler version of bvec_iter_advance(), @bytes should not span
  * across multiple bvec entries, i.e. bytes <= bv[i->bi_idx].bv_len
-- 
2.36.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [dm-devel] [PATCH 5.20 1/4] block: add bio_rewind() API
  2022-06-30  1:14                         ` [dm-devel] " Kent Overstreet
@ 2022-07-01  3:58                           ` Ming Lei
  -1 siblings, 0 replies; 76+ messages in thread
From: Ming Lei @ 2022-07-01  3:58 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Jens Axboe, Martin K . Petersen, Mike Snitzer, Eric Biggers,
	linux-block, dm-devel, Dmitry Monakhov

On Wed, Jun 29, 2022 at 09:14:54PM -0400, Kent Overstreet wrote:
> On Thu, Jun 30, 2022 at 08:47:13AM +0800, Ming Lei wrote:
> > Or if I misunderstood your point, please cook a patch and I am happy to
> > take a close look, and posting one very raw idea with random data
> > structure looks not helpful much for this discussion technically.
> 
> Based it on your bio_rewind() patch - what do you think of this?
> 
> -- >8 --
> From: Kent Overstreet <kent.overstreet@gmail.com>
> Subject: [PATCH] block: add bio_(get|set)_pos()
> 
> Commit 7759eb23fd98 ("block: remove bio_rewind_iter()") removes
> the similar API because the following reasons:
> 
>     ```
>     It is pointed that bio_rewind_iter() is one very bad API[1]:
> 
>     1) bio size may not be restored after rewinding
> 
>     2) it causes some bogus change, such as 5151842b9d8732 (block: reset
>     bi_iter.bi_done after splitting bio)
> 
>     3) rewinding really makes things complicated wrt. bio splitting
> 
>     4) unnecessary updating of .bi_done in fast path
> 
>     [1] https://marc.info/?t=153549924200005&r=1&w=2
> 
>     So this patch takes Kent's suggestion to restore one bio into its original
>     state via saving bio iterator(struct bvec_iter) in bio_integrity_prep(),
>     given now bio_rewind_iter() is only used by bio integrity code.
>     ```
> 
> However, saving and restoring bi_iter isn't sufficient anymore, because
> of integrity and now per-bio crypt context.
> 
> This patch implements the same functionality as bio_rewind(), based on a
> patch by Ming, but with a different (safer!) interface.
> 
>  - bio_get_pos() gets the current state of a a bio, i.e. how far it has
>    been advanced and its current (remaining) size
>  - bio_set_pos() restores a bio to a previous state, advancing or
>    rewinding it as needed
> 
> Co-authored-by: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
> ---
>  block/bio-integrity.c       | 19 +++++++++++++++++++
>  block/bio.c                 | 26 ++++++++++++++++++++++++++
>  block/blk-crypto-internal.h |  7 +++++++
>  block/blk-crypto.c          | 25 +++++++++++++++++++++++++
>  include/linux/bio.h         | 22 ++++++++++++++++++++++
>  include/linux/blk_types.h   | 19 +++++++++++++++++++
>  include/linux/bvec.h        | 36 +++++++++++++++++++++++++++++++++++-
>  7 files changed, 153 insertions(+), 1 deletion(-)
> 
> diff --git a/block/bio-integrity.c b/block/bio-integrity.c
> index 32929c89ba..06c2fe81fd 100644
> --- a/block/bio-integrity.c
> +++ b/block/bio-integrity.c
> @@ -378,6 +378,25 @@ void bio_integrity_advance(struct bio *bio, unsigned int bytes_done)
>  	bvec_iter_advance(bip->bip_vec, &bip->bip_iter, bytes);
>  }
>  
> +/**
> + * bio_integrity_rewind - Rewind integrity vector
> + * @bio:	bio whose integrity vector to update
> + * @bytes_done:	number of data bytes to rewind
> + *
> + * Description: This function calculates how many integrity bytes the
> + * number of completed data bytes correspond to and rewind the
> + * integrity vector accordingly.
> + */
> +void bio_integrity_rewind(struct bio *bio, unsigned int bytes_done)
> +{
> +	struct bio_integrity_payload *bip = bio_integrity(bio);
> +	struct blk_integrity *bi = blk_get_integrity(bio->bi_bdev->bd_disk);
> +	unsigned bytes = bio_integrity_bytes(bi, bytes_done >> 9);
> +
> +	bip->bip_iter.bi_sector -= bio_integrity_intervals(bi, bytes_done >> 9);
> +	bvec_iter_rewind(bip->bip_vec, &bip->bip_iter, bytes);
> +}
> +
>  /**
>   * bio_integrity_trim - Trim integrity vector
>   * @bio:	bio whose integrity vector to update
> diff --git a/block/bio.c b/block/bio.c
> index b2425b8d88..bbf8aa4e62 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1329,6 +1329,32 @@ void __bio_advance(struct bio *bio, unsigned bytes)
>  }
>  EXPORT_SYMBOL(__bio_advance);
>  
> +/**
> + * bio_set_pos - restore a bio to a previous state, after having been iterated
> + * or trimmed
> + * @bio: bio to reset
> + * @pos: pos to reset it to, from bio_get_pos()
> + */
> +void bio_set_pos(struct bio *bio, struct bio_pos pos)
> +{
> +	int delta = bio->bi_iter.bi_done - pos.bi_done;
> +
> +	if (delta > 0) {
> +		if (bio_integrity(bio))
> +			bio_integrity_rewind(bio, delta);
> +		bio_crypt_rewind(bio, delta);
> +		bio_rewind_iter(bio, &bio->bi_iter, delta);
> +	} else {
> +		bio_advance(bio, -delta);
> +	}
> +
> +	bio->bi_iter.bi_size = pos.bi_size;
> +
> +	if (bio_integrity(bio))
> +		bio_integrity_trim(bio);
> +}
> +EXPORT_SYMBOL(bio_set_pos);
> +
>  void bio_copy_data_iter(struct bio *dst, struct bvec_iter *dst_iter,
>  			struct bio *src, struct bvec_iter *src_iter)
>  {
> diff --git a/block/blk-crypto-internal.h b/block/blk-crypto-internal.h
> index e6818ffadd..b723599bbf 100644
> --- a/block/blk-crypto-internal.h
> +++ b/block/blk-crypto-internal.h
> @@ -114,6 +114,13 @@ static inline void bio_crypt_advance(struct bio *bio, unsigned int bytes)
>  		__bio_crypt_advance(bio, bytes);
>  }
>  
> +void __bio_crypt_rewind(struct bio *bio, unsigned int bytes);
> +static inline void bio_crypt_rewind(struct bio *bio, unsigned int bytes)
> +{
> +	if (bio_has_crypt_ctx(bio))
> +		__bio_crypt_rewind(bio, bytes);
> +}
> +
>  void __bio_crypt_free_ctx(struct bio *bio);
>  static inline void bio_crypt_free_ctx(struct bio *bio)
>  {
> diff --git a/block/blk-crypto.c b/block/blk-crypto.c
> index a496aaef85..e3584b5a68 100644
> --- a/block/blk-crypto.c
> +++ b/block/blk-crypto.c
> @@ -134,6 +134,23 @@ void bio_crypt_dun_increment(u64 dun[BLK_CRYPTO_DUN_ARRAY_SIZE],
>  	}
>  }
>  
> +/* Decrements @dun by @dec, treating @dun as a multi-limb integer. */
> +void bio_crypt_dun_decrement(u64 dun[BLK_CRYPTO_DUN_ARRAY_SIZE],
> +			     unsigned int dec)
> +{
> +	int i;
> +
> +	for (i = 0; dec && i < BLK_CRYPTO_DUN_ARRAY_SIZE; i++) {
> +		u64 prev = dun[i];
> +
> +		dun[i] -= dec;
> +		if (dun[i] > prev)
> +			dec = 1;
> +		else
> +			dec = 0;
> +	}
> +}
> +
>  void __bio_crypt_advance(struct bio *bio, unsigned int bytes)
>  {
>  	struct bio_crypt_ctx *bc = bio->bi_crypt_context;
> @@ -142,6 +159,14 @@ void __bio_crypt_advance(struct bio *bio, unsigned int bytes)
>  				bytes >> bc->bc_key->data_unit_size_bits);
>  }
>  
> +void __bio_crypt_rewind(struct bio *bio, unsigned int bytes)
> +{
> +	struct bio_crypt_ctx *bc = bio->bi_crypt_context;
> +
> +	bio_crypt_dun_decrement(bc->bc_dun,
> +				bytes >> bc->bc_key->data_unit_size_bits);
> +}
> +
>  /*
>   * Returns true if @bc->bc_dun plus @bytes converted to data units is equal to
>   * @next_dun, treating the DUNs as multi-limb integers.
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index c11103a872..5fff008913 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -105,6 +105,19 @@ static inline void bio_advance_iter(const struct bio *bio,
>  		/* TODO: It is reasonable to complete bio with error here. */
>  }
>  
> +static inline void bio_rewind_iter(const struct bio *bio,
> +				    struct bvec_iter *iter, unsigned int bytes)
> +{
> +	iter->bi_sector -= bytes >> 9;
> +
> +	/* No advance means no rewind */
> +	if (bio_no_advance_iter(bio))
> +		iter->bi_size += bytes;
> +	else
> +		bvec_iter_rewind(bio->bi_io_vec, iter, bytes);
> +		/* TODO: It is reasonable to complete bio with error here. */
> +}
> +
>  /* @bytes should be less or equal to bvec[i->bi_idx].bv_len */
>  static inline void bio_advance_iter_single(const struct bio *bio,
>  					   struct bvec_iter *iter,
> @@ -133,6 +146,8 @@ void __bio_advance(struct bio *, unsigned bytes);
>   */
>  static inline void bio_advance(struct bio *bio, unsigned int nbytes)
>  {
> +	bio->bi_iter.bi_done += nbytes;

Why do we need to pay the cost(4 bytes added to bio and the updating
in absolutely fast path) if rewind isn't used? So far, only dm requeue
needs it, and it is one very unusual event.

Assuming fixed bio end sector should cover most of cases, especially
if bio_rewind is only for dm or driver.

I'd suggest to not take this way until turning out bio_rewind() is not
enough for new requirement or usages.

Thanks 
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 5.20 1/4] block: add bio_rewind() API
@ 2022-07-01  3:58                           ` Ming Lei
  0 siblings, 0 replies; 76+ messages in thread
From: Ming Lei @ 2022-07-01  3:58 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Mike Snitzer, Jens Axboe, linux-block, dm-devel, Eric Biggers,
	Dmitry Monakhov, Martin K . Petersen

On Wed, Jun 29, 2022 at 09:14:54PM -0400, Kent Overstreet wrote:
> On Thu, Jun 30, 2022 at 08:47:13AM +0800, Ming Lei wrote:
> > Or if I misunderstood your point, please cook a patch and I am happy to
> > take a close look, and posting one very raw idea with random data
> > structure looks not helpful much for this discussion technically.
> 
> Based it on your bio_rewind() patch - what do you think of this?
> 
> -- >8 --
> From: Kent Overstreet <kent.overstreet@gmail.com>
> Subject: [PATCH] block: add bio_(get|set)_pos()
> 
> Commit 7759eb23fd98 ("block: remove bio_rewind_iter()") removes
> the similar API because the following reasons:
> 
>     ```
>     It is pointed that bio_rewind_iter() is one very bad API[1]:
> 
>     1) bio size may not be restored after rewinding
> 
>     2) it causes some bogus change, such as 5151842b9d8732 (block: reset
>     bi_iter.bi_done after splitting bio)
> 
>     3) rewinding really makes things complicated wrt. bio splitting
> 
>     4) unnecessary updating of .bi_done in fast path
> 
>     [1] https://marc.info/?t=153549924200005&r=1&w=2
> 
>     So this patch takes Kent's suggestion to restore one bio into its original
>     state via saving bio iterator(struct bvec_iter) in bio_integrity_prep(),
>     given now bio_rewind_iter() is only used by bio integrity code.
>     ```
> 
> However, saving and restoring bi_iter isn't sufficient anymore, because
> of integrity and now per-bio crypt context.
> 
> This patch implements the same functionality as bio_rewind(), based on a
> patch by Ming, but with a different (safer!) interface.
> 
>  - bio_get_pos() gets the current state of a a bio, i.e. how far it has
>    been advanced and its current (remaining) size
>  - bio_set_pos() restores a bio to a previous state, advancing or
>    rewinding it as needed
> 
> Co-authored-by: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
> ---
>  block/bio-integrity.c       | 19 +++++++++++++++++++
>  block/bio.c                 | 26 ++++++++++++++++++++++++++
>  block/blk-crypto-internal.h |  7 +++++++
>  block/blk-crypto.c          | 25 +++++++++++++++++++++++++
>  include/linux/bio.h         | 22 ++++++++++++++++++++++
>  include/linux/blk_types.h   | 19 +++++++++++++++++++
>  include/linux/bvec.h        | 36 +++++++++++++++++++++++++++++++++++-
>  7 files changed, 153 insertions(+), 1 deletion(-)
> 
> diff --git a/block/bio-integrity.c b/block/bio-integrity.c
> index 32929c89ba..06c2fe81fd 100644
> --- a/block/bio-integrity.c
> +++ b/block/bio-integrity.c
> @@ -378,6 +378,25 @@ void bio_integrity_advance(struct bio *bio, unsigned int bytes_done)
>  	bvec_iter_advance(bip->bip_vec, &bip->bip_iter, bytes);
>  }
>  
> +/**
> + * bio_integrity_rewind - Rewind integrity vector
> + * @bio:	bio whose integrity vector to update
> + * @bytes_done:	number of data bytes to rewind
> + *
> + * Description: This function calculates how many integrity bytes the
> + * number of completed data bytes correspond to and rewind the
> + * integrity vector accordingly.
> + */
> +void bio_integrity_rewind(struct bio *bio, unsigned int bytes_done)
> +{
> +	struct bio_integrity_payload *bip = bio_integrity(bio);
> +	struct blk_integrity *bi = blk_get_integrity(bio->bi_bdev->bd_disk);
> +	unsigned bytes = bio_integrity_bytes(bi, bytes_done >> 9);
> +
> +	bip->bip_iter.bi_sector -= bio_integrity_intervals(bi, bytes_done >> 9);
> +	bvec_iter_rewind(bip->bip_vec, &bip->bip_iter, bytes);
> +}
> +
>  /**
>   * bio_integrity_trim - Trim integrity vector
>   * @bio:	bio whose integrity vector to update
> diff --git a/block/bio.c b/block/bio.c
> index b2425b8d88..bbf8aa4e62 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1329,6 +1329,32 @@ void __bio_advance(struct bio *bio, unsigned bytes)
>  }
>  EXPORT_SYMBOL(__bio_advance);
>  
> +/**
> + * bio_set_pos - restore a bio to a previous state, after having been iterated
> + * or trimmed
> + * @bio: bio to reset
> + * @pos: pos to reset it to, from bio_get_pos()
> + */
> +void bio_set_pos(struct bio *bio, struct bio_pos pos)
> +{
> +	int delta = bio->bi_iter.bi_done - pos.bi_done;
> +
> +	if (delta > 0) {
> +		if (bio_integrity(bio))
> +			bio_integrity_rewind(bio, delta);
> +		bio_crypt_rewind(bio, delta);
> +		bio_rewind_iter(bio, &bio->bi_iter, delta);
> +	} else {
> +		bio_advance(bio, -delta);
> +	}
> +
> +	bio->bi_iter.bi_size = pos.bi_size;
> +
> +	if (bio_integrity(bio))
> +		bio_integrity_trim(bio);
> +}
> +EXPORT_SYMBOL(bio_set_pos);
> +
>  void bio_copy_data_iter(struct bio *dst, struct bvec_iter *dst_iter,
>  			struct bio *src, struct bvec_iter *src_iter)
>  {
> diff --git a/block/blk-crypto-internal.h b/block/blk-crypto-internal.h
> index e6818ffadd..b723599bbf 100644
> --- a/block/blk-crypto-internal.h
> +++ b/block/blk-crypto-internal.h
> @@ -114,6 +114,13 @@ static inline void bio_crypt_advance(struct bio *bio, unsigned int bytes)
>  		__bio_crypt_advance(bio, bytes);
>  }
>  
> +void __bio_crypt_rewind(struct bio *bio, unsigned int bytes);
> +static inline void bio_crypt_rewind(struct bio *bio, unsigned int bytes)
> +{
> +	if (bio_has_crypt_ctx(bio))
> +		__bio_crypt_rewind(bio, bytes);
> +}
> +
>  void __bio_crypt_free_ctx(struct bio *bio);
>  static inline void bio_crypt_free_ctx(struct bio *bio)
>  {
> diff --git a/block/blk-crypto.c b/block/blk-crypto.c
> index a496aaef85..e3584b5a68 100644
> --- a/block/blk-crypto.c
> +++ b/block/blk-crypto.c
> @@ -134,6 +134,23 @@ void bio_crypt_dun_increment(u64 dun[BLK_CRYPTO_DUN_ARRAY_SIZE],
>  	}
>  }
>  
> +/* Decrements @dun by @dec, treating @dun as a multi-limb integer. */
> +void bio_crypt_dun_decrement(u64 dun[BLK_CRYPTO_DUN_ARRAY_SIZE],
> +			     unsigned int dec)
> +{
> +	int i;
> +
> +	for (i = 0; dec && i < BLK_CRYPTO_DUN_ARRAY_SIZE; i++) {
> +		u64 prev = dun[i];
> +
> +		dun[i] -= dec;
> +		if (dun[i] > prev)
> +			dec = 1;
> +		else
> +			dec = 0;
> +	}
> +}
> +
>  void __bio_crypt_advance(struct bio *bio, unsigned int bytes)
>  {
>  	struct bio_crypt_ctx *bc = bio->bi_crypt_context;
> @@ -142,6 +159,14 @@ void __bio_crypt_advance(struct bio *bio, unsigned int bytes)
>  				bytes >> bc->bc_key->data_unit_size_bits);
>  }
>  
> +void __bio_crypt_rewind(struct bio *bio, unsigned int bytes)
> +{
> +	struct bio_crypt_ctx *bc = bio->bi_crypt_context;
> +
> +	bio_crypt_dun_decrement(bc->bc_dun,
> +				bytes >> bc->bc_key->data_unit_size_bits);
> +}
> +
>  /*
>   * Returns true if @bc->bc_dun plus @bytes converted to data units is equal to
>   * @next_dun, treating the DUNs as multi-limb integers.
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index c11103a872..5fff008913 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -105,6 +105,19 @@ static inline void bio_advance_iter(const struct bio *bio,
>  		/* TODO: It is reasonable to complete bio with error here. */
>  }
>  
> +static inline void bio_rewind_iter(const struct bio *bio,
> +				    struct bvec_iter *iter, unsigned int bytes)
> +{
> +	iter->bi_sector -= bytes >> 9;
> +
> +	/* No advance means no rewind */
> +	if (bio_no_advance_iter(bio))
> +		iter->bi_size += bytes;
> +	else
> +		bvec_iter_rewind(bio->bi_io_vec, iter, bytes);
> +		/* TODO: It is reasonable to complete bio with error here. */
> +}
> +
>  /* @bytes should be less or equal to bvec[i->bi_idx].bv_len */
>  static inline void bio_advance_iter_single(const struct bio *bio,
>  					   struct bvec_iter *iter,
> @@ -133,6 +146,8 @@ void __bio_advance(struct bio *, unsigned bytes);
>   */
>  static inline void bio_advance(struct bio *bio, unsigned int nbytes)
>  {
> +	bio->bi_iter.bi_done += nbytes;

Why do we need to pay the cost(4 bytes added to bio and the updating
in absolutely fast path) if rewind isn't used? So far, only dm requeue
needs it, and it is one very unusual event.

Assuming fixed bio end sector should cover most of cases, especially
if bio_rewind is only for dm or driver.

I'd suggest to not take this way until turning out bio_rewind() is not
enough for new requirement or usages.

Thanks 
Ming


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 5.20 1/4] block: add bio_rewind() API
  2022-07-01  3:58                           ` Ming Lei
@ 2022-07-01 21:09                             ` Kent Overstreet
  -1 siblings, 0 replies; 76+ messages in thread
From: Kent Overstreet @ 2022-07-01 21:09 UTC (permalink / raw)
  To: Ming Lei
  Cc: Mike Snitzer, Jens Axboe, linux-block, dm-devel, Eric Biggers,
	Dmitry Monakhov, Martin K . Petersen

On Fri, Jul 01, 2022 at 11:58:35AM +0800, Ming Lei wrote:
> Why do we need to pay the cost(4 bytes added to bio and the updating
> in absolutely fast path) if rewind isn't used? So far, only dm requeue
> needs it, and it is one very unusual event.
> 
> Assuming fixed bio end sector should cover most of cases, especially
> if bio_rewind is only for dm or driver.
> 
> I'd suggest to not take this way until turning out bio_rewind() is not
> enough for new requirement or usages.

Well, you're proposing add this to the block layer, not just dm, so we should be
looking at potential users and not just this one use case.

Check out block/blk-crypto-fallback.c -> blk_crypto_fallback_decrypt_bio: it's
using __bio_for_each_segment, which is how I found it. It's also using a
bio_crypt_ctx they've stashed in their per-bio bio_fallback_crypt_ctx, in
blk_crypto_fallback_bio_prep, in addition to saving a copy of bio->bi_iter.

bio_crypt_ctx is 40 bytes, struct bvec_iter is currently 20 bytes, so we'd be
replacing 60 bytes of per-bio state with 8 bytes, if we went with bio_pos. And
this code path is used by block/blk-crypto.c as well, not just
blk-crypto-fallback.c.

drivers/md/dm-integrity.c: Again, we're saving a bvec_iter (using
dm_bio_record()) so that after the bio completes we can walk through the bio as
it was when we submitted it. Here, we're also interested in bi_integrity, but
what dm_bio_record() is doing looks highly suspect - we're only saving a copy of
the bio->bi_integrity pointer, but bio_integrity_payload contains another
bvec_iter and that's probably what should be getting saved.

So for this code, switching bio_(get|set)_pos would likely either be eliminating
the need for a tricky workaround I haven't spotted yet, or perhaps fixing an
actual bug - and it'd be saving 20 bytes of state in dm_bio_details.

Side note: according to the comment in dm_bio_details, what that code is trying
to do is actually something rather interesting - fully restore the state of a
bio so it can be resubmitted to another device after an error. This is probably
something that's worth promoting to block/bio.c, so it can be kept in sync with
e.g. bio_init() and bio_reset, and because we have other code in the kernel
that's doing similar stuff and might want to make use of it if it was standard
(btrfs, bcachefs, md-multipath, perhaps aoe).

drivers/block/aoe/aoecmd.c: here we're also saving bvec_iter, but this code is
doing a different trick that it's able to do because it's not submitting the bio
to other block drivers, the terminal handling is in its own code. It's not
modifying bio->bi_iter at all, which is neat because it makes resubmitting after
an error trivial.

This code is smart, and I would consider it a vote in favor of the bio_iter
approach (but see below, I'm not actually advocating we do that). I do something
similar in bcachefs, my read retry path: the original bio may have been
fragmented, and a retry may be for only a fragment of an original bio. Having a
separate bvec_iter means that the retry path can submit a retry for only a
fragment of the original bio, without having to mutate it or race with other
threads that are doing things with other parts of the original bio. It's not
strictly necessary functionality - I could've also solved this by not freeing
the fragment and retrying that - but it kept things simpler in other ways (the
retry may itself end up being fragmented if the extents we're reading from
changed, which means in the normal read path fragment bch_read_bios only ever
point to toplevel bch_read_bios, but retrying fragment bch_read_bios would mean
when doing retries we'd have fragments pointing to fragments pointing to
fragments, and memory allocations bounded only by the maximum number of retries
we could do).

So the struct bio_iter approach - segregating all the things we mutate when
iterating over a bio into a sub-type, i.e. what I was originally doing with
bvec_iter - has some nice and useful properties, but OTOH with integrity and now
bio_crypt_ctx it doesn't look super practical anymore - bio_iter would always
have to contain whatever it is we need for iterating over crypt and integrity if
those features are #ifdef'd in, whereas now an individual bio only pays the
memory overhead for those features if they're in use.

OTOH, looking at actual code, bio_(get|set)_pos gets us 90% of the elegance of a
separate iterator type, and it works for what all the code I've looked at is
trying to do (except for bcachefs, but I have no interest in blk crypt or
integrity so I'm fine with the current situation).

That's just what I found with a bit of perusing - there's other things I'd want
to look at if I wanted to be thorough (anything with multipath in the name for
sure, I haven't looked at all at the nvme multipath code yet).

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [dm-devel] [PATCH 5.20 1/4] block: add bio_rewind() API
@ 2022-07-01 21:09                             ` Kent Overstreet
  0 siblings, 0 replies; 76+ messages in thread
From: Kent Overstreet @ 2022-07-01 21:09 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Martin K . Petersen, Mike Snitzer, Eric Biggers,
	linux-block, dm-devel, Dmitry Monakhov

On Fri, Jul 01, 2022 at 11:58:35AM +0800, Ming Lei wrote:
> Why do we need to pay the cost(4 bytes added to bio and the updating
> in absolutely fast path) if rewind isn't used? So far, only dm requeue
> needs it, and it is one very unusual event.
> 
> Assuming fixed bio end sector should cover most of cases, especially
> if bio_rewind is only for dm or driver.
> 
> I'd suggest to not take this way until turning out bio_rewind() is not
> enough for new requirement or usages.

Well, you're proposing add this to the block layer, not just dm, so we should be
looking at potential users and not just this one use case.

Check out block/blk-crypto-fallback.c -> blk_crypto_fallback_decrypt_bio: it's
using __bio_for_each_segment, which is how I found it. It's also using a
bio_crypt_ctx they've stashed in their per-bio bio_fallback_crypt_ctx, in
blk_crypto_fallback_bio_prep, in addition to saving a copy of bio->bi_iter.

bio_crypt_ctx is 40 bytes, struct bvec_iter is currently 20 bytes, so we'd be
replacing 60 bytes of per-bio state with 8 bytes, if we went with bio_pos. And
this code path is used by block/blk-crypto.c as well, not just
blk-crypto-fallback.c.

drivers/md/dm-integrity.c: Again, we're saving a bvec_iter (using
dm_bio_record()) so that after the bio completes we can walk through the bio as
it was when we submitted it. Here, we're also interested in bi_integrity, but
what dm_bio_record() is doing looks highly suspect - we're only saving a copy of
the bio->bi_integrity pointer, but bio_integrity_payload contains another
bvec_iter and that's probably what should be getting saved.

So for this code, switching bio_(get|set)_pos would likely either be eliminating
the need for a tricky workaround I haven't spotted yet, or perhaps fixing an
actual bug - and it'd be saving 20 bytes of state in dm_bio_details.

Side note: according to the comment in dm_bio_details, what that code is trying
to do is actually something rather interesting - fully restore the state of a
bio so it can be resubmitted to another device after an error. This is probably
something that's worth promoting to block/bio.c, so it can be kept in sync with
e.g. bio_init() and bio_reset, and because we have other code in the kernel
that's doing similar stuff and might want to make use of it if it was standard
(btrfs, bcachefs, md-multipath, perhaps aoe).

drivers/block/aoe/aoecmd.c: here we're also saving bvec_iter, but this code is
doing a different trick that it's able to do because it's not submitting the bio
to other block drivers, the terminal handling is in its own code. It's not
modifying bio->bi_iter at all, which is neat because it makes resubmitting after
an error trivial.

This code is smart, and I would consider it a vote in favor of the bio_iter
approach (but see below, I'm not actually advocating we do that). I do something
similar in bcachefs, my read retry path: the original bio may have been
fragmented, and a retry may be for only a fragment of an original bio. Having a
separate bvec_iter means that the retry path can submit a retry for only a
fragment of the original bio, without having to mutate it or race with other
threads that are doing things with other parts of the original bio. It's not
strictly necessary functionality - I could've also solved this by not freeing
the fragment and retrying that - but it kept things simpler in other ways (the
retry may itself end up being fragmented if the extents we're reading from
changed, which means in the normal read path fragment bch_read_bios only ever
point to toplevel bch_read_bios, but retrying fragment bch_read_bios would mean
when doing retries we'd have fragments pointing to fragments pointing to
fragments, and memory allocations bounded only by the maximum number of retries
we could do).

So the struct bio_iter approach - segregating all the things we mutate when
iterating over a bio into a sub-type, i.e. what I was originally doing with
bvec_iter - has some nice and useful properties, but OTOH with integrity and now
bio_crypt_ctx it doesn't look super practical anymore - bio_iter would always
have to contain whatever it is we need for iterating over crypt and integrity if
those features are #ifdef'd in, whereas now an individual bio only pays the
memory overhead for those features if they're in use.

OTOH, looking at actual code, bio_(get|set)_pos gets us 90% of the elegance of a
separate iterator type, and it works for what all the code I've looked at is
trying to do (except for bcachefs, but I have no interest in blk crypt or
integrity so I'm fine with the current situation).

That's just what I found with a bit of perusing - there's other things I'd want
to look at if I wanted to be thorough (anything with multipath in the name for
sure, I haven't looked at all at the nvme multipath code yet).

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 76+ messages in thread

end of thread, other threads:[~2022-07-06 15:03 UTC | newest]

Thread overview: 76+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-24 14:12 [PATCH 5.20 0/4] block/dm: add bio_rewind for improving dm requeue Ming Lei
2022-06-24 14:12 ` [dm-devel] " Ming Lei
2022-06-24 14:12 ` [PATCH 5.20 1/4] block: add bio_rewind() API Ming Lei
2022-06-24 14:12   ` [dm-devel] " Ming Lei
2022-06-26 20:14   ` Kent Overstreet
2022-06-26 20:14     ` [dm-devel] " Kent Overstreet
2022-06-27  7:36     ` Ming Lei
2022-06-27  7:36       ` [dm-devel] " Ming Lei
2022-06-28  4:20       ` Kent Overstreet
2022-06-28  4:20         ` [dm-devel] " Kent Overstreet
2022-06-28  7:42         ` Ming Lei
2022-06-28  7:42           ` [dm-devel] " Ming Lei
2022-06-28 16:16           ` Kent Overstreet
2022-06-28 16:16             ` [dm-devel] " Kent Overstreet
2022-06-28 18:13         ` Jens Axboe
2022-06-28 18:13           ` [dm-devel] " Jens Axboe
2022-06-28 18:32           ` Kent Overstreet
2022-06-28 18:32             ` [dm-devel] " Kent Overstreet
2022-06-29 17:16             ` Jens Axboe
2022-06-29 17:16               ` [dm-devel] " Jens Axboe
2022-06-29 18:40               ` Kent Overstreet
2022-06-29 18:40                 ` [dm-devel] " Kent Overstreet
2022-06-29 18:51                 ` Bart Van Assche
2022-06-29 18:51                   ` [dm-devel] " Bart Van Assche
2022-06-29 19:05                   ` Kent Overstreet
2022-06-29 19:05                     ` [dm-devel] " Kent Overstreet
2022-06-29 19:37                     ` Bart Van Assche
2022-06-29 19:37                       ` [dm-devel] " Bart Van Assche
2022-06-29 19:50                       ` Kent Overstreet
2022-06-29 19:50                         ` [dm-devel] " Kent Overstreet
2022-06-29 19:59                       ` Kent Overstreet
2022-06-29 19:59                         ` [dm-devel] " Kent Overstreet
2022-06-29 19:00                 ` Jens Axboe
2022-06-29 19:00                   ` [dm-devel] " Jens Axboe
2022-06-29 19:26                   ` Kent Overstreet
2022-06-29 19:26                     ` [dm-devel] " Kent Overstreet
2022-06-29 20:51                     ` Jens Axboe
2022-06-29 20:51                       ` [dm-devel] " Jens Axboe
2022-06-29  0:49           ` Ming Lei
2022-06-29  0:49             ` [dm-devel] " Ming Lei
2022-06-28  4:26       ` Kent Overstreet
2022-06-28  4:26         ` [dm-devel] " Kent Overstreet
2022-06-28  7:49         ` Ming Lei
2022-06-28  7:49           ` [dm-devel] " Ming Lei
2022-06-28 16:36           ` Kent Overstreet
2022-06-28 16:36             ` [dm-devel] " Kent Overstreet
2022-06-28 17:41             ` Mike Snitzer
2022-06-28 17:41               ` [dm-devel] " Mike Snitzer
2022-06-28 17:52               ` Kent Overstreet
2022-06-28 17:52                 ` [dm-devel] " Kent Overstreet
2022-06-29  6:07                 ` Mike Snitzer
2022-06-29  6:07                   ` [dm-devel] " Mike Snitzer
2022-06-29 18:11                   ` Kent Overstreet
2022-06-29 18:11                     ` [dm-devel] " Kent Overstreet
2022-06-30  0:47                     ` Ming Lei
2022-06-30  0:47                       ` [dm-devel] " Ming Lei
2022-06-30  0:58                       ` Kent Overstreet
2022-06-30  0:58                         ` [dm-devel] " Kent Overstreet
2022-06-30  1:14                       ` Kent Overstreet
2022-06-30  1:14                         ` [dm-devel] " Kent Overstreet
2022-07-01  3:58                         ` Ming Lei
2022-07-01  3:58                           ` Ming Lei
2022-07-01 21:09                           ` Kent Overstreet
2022-07-01 21:09                             ` [dm-devel] " Kent Overstreet
2022-06-29  1:02             ` Ming Lei
2022-06-29  1:02               ` [dm-devel] " Ming Lei
2022-06-26 21:37   ` Eric Biggers
2022-06-26 21:37     ` Eric Biggers
2022-06-27  7:37     ` Ming Lei
2022-06-27  7:37       ` Ming Lei
2022-06-24 14:12 ` [PATCH 5.20 2/4] dm: add new helper for handling dm_io requeue Ming Lei
2022-06-24 14:12   ` [dm-devel] " Ming Lei
2022-06-24 14:12 ` [PATCH 5.20 3/4] dm: improve handling for DM_REQUEUE and AGAIN Ming Lei
2022-06-24 14:12   ` [dm-devel] " Ming Lei
2022-06-24 14:12 ` [PATCH 5.20 4/4] dm: add two stage requeue Ming Lei
2022-06-24 14:12   ` [dm-devel] " Ming Lei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.