linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] block: Revert bio_clone() default behaviour
@ 2013-11-06  3:48 Kent Overstreet
  2013-11-06  5:02 ` Olof Johansson
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Kent Overstreet @ 2013-11-06  3:48 UTC (permalink / raw)
  To: axboe, linux-kernel
  Cc: Kent Overstreet, Chris Mason, Mike Snitzer, NeilBrown, Olof Johansson

This patch reverts the default behaviour introduced by
9fc6286f347d00528adcdcf12396d220f47492ed - bio_clone_biovec() no clonger
shares the source bio's biovec, cloning the biovec is once again the
default.

Instead, we add a new bio_clone_biovec_fast(), which creates a clone
that shares the source's biovec. This patch changes bcache and md to use
__bio_clone_biovec_fast() since they're expecting the new behaviour due
to other refactoring; most of the other uses of bio_clone() should be
same to convert to the _fast() variant but that will be done more
incrementally in other patches (bio_split() in particular).

Note that __bio_clone() isn't being readded - the reason being that with
immutable biovecs allocating the right number of biovecs for the new
clone is no longer trivial so we don't want drivers trying to do that
themselves.

This patch also reverts febca1baea1cfe2d7a0271385d89b03d5fb34f94 -
__bio_clone_fast() should not be setting bi_vcnt for bios that do not
own the biovec (see Documentation/block/biovecs.txt for rationale) - in
short, not setting it might cause bugs in the short term but long term
it's likely to hide nastier more subtle bugs, we don't want code looking
at bi_vcnt at all for bios it does not own. However, this patch
_shouldn't_ cause any regressions because of this since we're reverting
back to the old bio_clone() behaviour.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Olof Johansson <olof@lixom.net>
---
Chris, Olaf, can you two in particular test this? I have tested the bounce
buffer code (and bcache), but Jens told me today there was an md bug that I
_still_ can't find any emails about so I'm not sure what to test for that.

 drivers/md/bcache/request.c |   6 +--
 drivers/md/dm.c             |   4 +-
 fs/bio.c                    | 104 +++++++++++++++++++++++++++-----------------
 include/linux/bio.h         |   6 +--
 mm/bounce.c                 |   1 -
 5 files changed, 72 insertions(+), 49 deletions(-)

diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 52a1fef..ef44198 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -681,7 +681,7 @@ static void do_bio_hook(struct search *s)
 	struct bio *bio = &s->bio.bio;
 
 	bio_init(bio);
-	__bio_clone(bio, s->orig_bio);
+	__bio_clone_fast(bio, s->orig_bio);
 	bio->bi_end_io		= request_endio;
 	bio->bi_private		= &s->cl;
 
@@ -969,8 +969,8 @@ static void request_write(struct cached_dev *dc, struct search *s)
 	trace_bcache_write(s->orig_bio, s->writeback, s->op.skip);
 
 	if (!s->writeback) {
-		s->op.cache_bio = bio_clone_bioset(bio, GFP_NOIO,
-						   dc->disk.bio_split);
+		s->op.cache_bio = bio_clone_bioset_fast(bio, GFP_NOIO,
+							dc->disk.bio_split);
 
 		closure_bio_submit(bio, cl, s->d);
 	} else {
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 8e6174c..bafe7ed 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1135,7 +1135,7 @@ static void clone_bio(struct dm_target_io *tio, struct bio *bio,
 {
 	struct bio *clone = &tio->clone;
 
-	__bio_clone(clone, bio);
+	__bio_clone_fast(clone, bio);
 
 	if (bio_integrity(bio))
 		bio_integrity_clone(clone, bio, GFP_NOIO);
@@ -1177,7 +1177,7 @@ static void __clone_and_map_simple_bio(struct clone_info *ci,
 	 * ci->bio->bi_max_vecs is BIO_INLINE_VECS anyway, for both flush
 	 * and discard, so no need for concern about wasted bvec allocations.
 	 */
-	 __bio_clone(clone, ci->bio);
+	 __bio_clone_fast(clone, ci->bio);
 	if (len)
 		bio_setup_sector(clone, ci->sector, len);
 
diff --git a/fs/bio.c b/fs/bio.c
index 6046c91..99ff176 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -539,15 +539,17 @@ inline int bio_phys_segments(struct request_queue *q, struct bio *bio)
 EXPORT_SYMBOL(bio_phys_segments);
 
 /**
- * 	__bio_clone	-	clone a bio
+ * 	__bio_clone_fast - clone a bio that shares the original bio's biovec
  * 	@bio: destination bio
  * 	@bio_src: bio to clone
  *
  *	Clone a &bio. Caller will own the returned bio, but not
  *	the actual data it points to. Reference count of returned
  * 	bio will be one.
+ *
+ * 	Caller must ensure that @bio_src is not freed before @bio.
  */
-void __bio_clone(struct bio *bio, struct bio *bio_src)
+void __bio_clone_fast(struct bio *bio, struct bio *bio_src)
 {
 	BUG_ON(bio->bi_pool && BIO_POOL_IDX(bio) != BIO_POOL_NONE);
 
@@ -560,20 +562,19 @@ void __bio_clone(struct bio *bio, struct bio *bio_src)
 	bio->bi_rw = bio_src->bi_rw;
 	bio->bi_iter = bio_src->bi_iter;
 	bio->bi_io_vec = bio_src->bi_io_vec;
-	bio->bi_vcnt = bio_src->bi_vcnt;
 }
-EXPORT_SYMBOL(__bio_clone);
+EXPORT_SYMBOL(__bio_clone_fast);
 
 /**
- *	bio_clone_bioset -	clone a bio
+ *	bio_clone_bioset_fast -	clone a bio that shares the original bio's biovec
  *	@bio: bio to clone
  *	@gfp_mask: allocation priority
  *	@bs: bio_set to allocate from
  *
- * 	Like __bio_clone, only also allocates the returned bio
+ * 	Like __bio_clone_fast, only also allocates the returned bio
  */
-struct bio *bio_clone_bioset(struct bio *bio, gfp_t gfp_mask,
-			     struct bio_set *bs)
+struct bio *bio_clone_bioset_fast(struct bio *bio, gfp_t gfp_mask,
+				  struct bio_set *bs)
 {
 	struct bio *b;
 
@@ -581,7 +582,7 @@ struct bio *bio_clone_bioset(struct bio *bio, gfp_t gfp_mask,
 	if (!b)
 		return NULL;
 
-	__bio_clone(b, bio);
+	__bio_clone_fast(b, bio);
 
 	if (bio_integrity(bio)) {
 		int ret;
@@ -596,53 +597,76 @@ struct bio *bio_clone_bioset(struct bio *bio, gfp_t gfp_mask,
 
 	return b;
 }
-EXPORT_SYMBOL(bio_clone_bioset);
+EXPORT_SYMBOL(bio_clone_bioset_fast);
 
 /**
- * bio_clone_biovec: Given a cloned bio, give the clone its own copy of the
- * biovec
- * @bio: cloned bio
+ * 	bio_clone_bioset - clone a bio
+ * 	@bio_src: bio to clone
+ *	@gfp_mask: allocation priority
+ *	@bs: bio_set to allocate from
  *
- * @bio must have been allocated from a bioset - i.e. returned from
- * bio_clone_bioset()
+ *	Clone a &bio. Caller will own the returned bio, but not
+ *	the actual data it points to. Reference count of returned
+ * 	bio will be one.
  */
-int bio_clone_biovec(struct bio *bio, gfp_t gfp_mask)
+struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
+			     struct bio_set *bs)
 {
-	unsigned long idx = BIO_POOL_NONE;
 	unsigned nr_iovecs = 0;
-	struct bio_vec bv, *bvl = NULL;
 	struct bvec_iter iter;
-	int i;
+	struct bio_vec bv;
+	struct bio *bio;
 
-	BUG_ON(!bio->bi_pool);
-	BUG_ON(BIO_POOL_IDX(bio) != BIO_POOL_NONE);
+	/*
+	 * Pre immutable biovecs, __bio_clone() used to just do a memcpy from
+	 * bio_src->bi_io_vec to bio->bi_io_vec.
+	 *
+	 * We can't do that anymore, because:
+	 *
+	 *  - The point of cloning the biovec is to produce a bio with a biovec
+	 *    the caller can modify: bi_idx and bi_bvec_done should be 0.
+	 *
+	 *  - The original bio could've had more than BIO_MAX_PAGES biovecs; if
+	 *    we tried to clone the whole thing bio_alloc_bioset() would fail.
+	 *    But the clone should succeed as long as the number of biovecs we
+	 *    actually need to allocate is fewer than BIO_MAX_PAGES.
+	 *
+	 *  - Lastly, bi_vcnt should not be looked at or relied upon by code
+	 *    that does not own the bio - reason being drivers don't use it for
+	 *    iterating over the biovec anymore, so expecting it to be kept up
+	 *    to date (i.e. for clones that share the parent biovec) is just
+	 *    asking for trouble and would force extra work on
+	 *    __bio_clone_fast() anyways.
+	 */
 
-	bio_for_each_segment(bv, bio, iter)
+	bio_for_each_segment(bv, bio_src, iter)
 		nr_iovecs++;
 
-	if (nr_iovecs > BIO_INLINE_VECS) {
-		bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx,
-				 bio->bi_pool->bvec_pool);
-		if (!bvl)
-			return -ENOMEM;
-	} else if (nr_iovecs) {
-		bvl = bio->bi_inline_vecs;
-	}
+	bio = bio_alloc_bioset(gfp_mask, nr_iovecs, bs);
+	if (!bio)
+		return NULL;
 
-	i = 0;
-	bio_for_each_segment(bv, bio, iter)
-		bvl[i++] = bv;
+	bio->bi_bdev		= bio_src->bi_bdev;
+	bio->bi_rw		= bio_src->bi_rw;
+	bio->bi_iter.bi_sector	= bio_src->bi_iter.bi_sector;
+	bio->bi_iter.bi_size	= bio_src->bi_iter.bi_size;
 
-	bio->bi_io_vec = bvl;
-	bio->bi_iter.bi_idx = 0;
-	bio->bi_iter.bi_bvec_done = 0;
+	bio_for_each_segment(bv, bio_src, iter)
+		bio->bi_io_vec[bio->bi_vcnt++] = bv;
 
-	bio->bi_flags &= BIO_POOL_MASK - 1;
-	bio->bi_flags |= idx << BIO_POOL_OFFSET;
+	if (bio_integrity(bio_src)) {
+		int ret;
 
-	return 0;
+		ret = bio_integrity_clone(bio, bio_src, gfp_mask);
+		if (ret < 0) {
+			bio_put(bio);
+			return NULL;
+		}
+	}
+
+	return bio;
 }
-EXPORT_SYMBOL(bio_clone_biovec);
+EXPORT_SYMBOL(bio_clone_bioset);
 
 /**
  *	bio_get_nr_vecs		- return approx number of vecs
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 204489e..434ac76 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -327,9 +327,9 @@ extern mempool_t *biovec_create_pool(struct bio_set *bs, int pool_entries);
 extern struct bio *bio_alloc_bioset(gfp_t, int, struct bio_set *);
 extern void bio_put(struct bio *);
 
-extern void __bio_clone(struct bio *, struct bio *);
-extern struct bio *bio_clone_bioset(struct bio *, gfp_t, struct bio_set *bs);
-extern int bio_clone_biovec(struct bio *bio, gfp_t gfp_mask);
+extern void __bio_clone_fast(struct bio *, struct bio *);
+extern struct bio *bio_clone_bioset_fast(struct bio *, gfp_t, struct bio_set *);
+extern struct bio *bio_clone_bioset(struct bio *, gfp_t, struct bio_set *);
 
 extern struct bio_set *fs_bio_set;
 
diff --git a/mm/bounce.c b/mm/bounce.c
index d5873f2..523918b 100644
--- a/mm/bounce.c
+++ b/mm/bounce.c
@@ -211,7 +211,6 @@ static void __blk_queue_bounce(struct request_queue *q, struct bio **bio_orig,
 	return;
 bounce:
 	bio = bio_clone_bioset(*bio_orig, GFP_NOIO, fs_bio_set);
-	bio_clone_biovec(bio, GFP_NOIO);
 
 	bio_for_each_segment_all(to, bio, i) {
 		struct page *page = to->bv_page;
-- 
1.8.4.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] block: Revert bio_clone() default behaviour
  2013-11-06  3:48 [PATCH] block: Revert bio_clone() default behaviour Kent Overstreet
@ 2013-11-06  5:02 ` Olof Johansson
  2013-11-06  5:07   ` Kent Overstreet
  2013-11-06 16:11 ` Chris Mason
  2013-11-06 20:31 ` Mike Snitzer
  2 siblings, 1 reply; 15+ messages in thread
From: Olof Johansson @ 2013-11-06  5:02 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Jens Axboe, linux-kernel, Chris Mason, Mike Snitzer, NeilBrown

Hi,

On Tue, Nov 5, 2013 at 7:48 PM, Kent Overstreet <kmo@daterainc.com> wrote:

> Chris, Olaf, can you two in particular test this? I have tested the bounce
> buffer code (and bcache), but Jens told me today there was an md bug that I
> _still_ can't find any emails about so I'm not sure what to test for that.

(I'm guessing you mean me and not some German person here. :-)

What is this expected to apply on top of? It doesn't apply cleanly on
the 20131105 -next tree which is where I had been seeing the issues.


-Olof

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] block: Revert bio_clone() default behaviour
  2013-11-06  5:02 ` Olof Johansson
@ 2013-11-06  5:07   ` Kent Overstreet
  2013-11-06 15:25     ` Olof Johansson
  0 siblings, 1 reply; 15+ messages in thread
From: Kent Overstreet @ 2013-11-06  5:07 UTC (permalink / raw)
  To: Olof Johansson
  Cc: Jens Axboe, linux-kernel, Chris Mason, Mike Snitzer, NeilBrown

On Tue, Nov 05, 2013 at 09:02:19PM -0800, Olof Johansson wrote:
> Hi,
> 
> On Tue, Nov 5, 2013 at 7:48 PM, Kent Overstreet <kmo@daterainc.com> wrote:
> 
> > Chris, Olaf, can you two in particular test this? I have tested the bounce
> > buffer code (and bcache), but Jens told me today there was an md bug that I
> > _still_ can't find any emails about so I'm not sure what to test for that.
> 
> (I'm guessing you mean me and not some German person here. :-)
> 
> What is this expected to apply on top of? It doesn't apply cleanly on
> the 20131105 -next tree which is where I had been seeing the issues.

Sorry - it's on top of Jens' for-3.13/core branch,
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] block: Revert bio_clone() default behaviour
  2013-11-06  5:07   ` Kent Overstreet
@ 2013-11-06 15:25     ` Olof Johansson
  0 siblings, 0 replies; 15+ messages in thread
From: Olof Johansson @ 2013-11-06 15:25 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Jens Axboe, linux-kernel, Chris Mason, Mike Snitzer, NeilBrown

On Tue, Nov 5, 2013 at 9:07 PM, Kent Overstreet <kmo@daterainc.com> wrote:
> On Tue, Nov 05, 2013 at 09:02:19PM -0800, Olof Johansson wrote:
>> Hi,
>>
>> On Tue, Nov 5, 2013 at 7:48 PM, Kent Overstreet <kmo@daterainc.com> wrote:
>>
>> > Chris, Olaf, can you two in particular test this? I have tested the bounce
>> > buffer code (and bcache), but Jens told me today there was an md bug that I
>> > _still_ can't find any emails about so I'm not sure what to test for that.
>>
>> (I'm guessing you mean me and not some German person here. :-)
>>
>> What is this expected to apply on top of? It doesn't apply cleanly on
>> the 20131105 -next tree which is where I had been seeing the issues.
>
> Sorry - it's on top of Jens' for-3.13/core branch,
> git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git

Tested-by: Olof Johansson <olof@lixom.net>

Note that last night's -next seems to have held up as well.


-Olof

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] block: Revert bio_clone() default behaviour
  2013-11-06  3:48 [PATCH] block: Revert bio_clone() default behaviour Kent Overstreet
  2013-11-06  5:02 ` Olof Johansson
@ 2013-11-06 16:11 ` Chris Mason
  2013-11-06 20:02   ` Kent Overstreet
  2013-11-06 20:31 ` Mike Snitzer
  2 siblings, 1 reply; 15+ messages in thread
From: Chris Mason @ 2013-11-06 16:11 UTC (permalink / raw)
  To: Kent Overstreet, axboe, linux-kernel
  Cc: Kent Overstreet, Mike Snitzer, NeilBrown, Olof Johansson

Quoting Kent Overstreet (2013-11-05 22:48:41)
> This patch reverts the default behaviour introduced by
> 9fc6286f347d00528adcdcf12396d220f47492ed - bio_clone_biovec() no clonger
> shares the source bio's biovec, cloning the biovec is once again the
> default.
> 
> Instead, we add a new bio_clone_biovec_fast(), which creates a clone
> that shares the source's biovec. This patch changes bcache and md to use
                                                               ^^^^^
							       dm?

> __bio_clone_biovec_fast() since they're expecting the new behaviour due
> to other refactoring; most of the other uses of bio_clone() should be
> same to convert to the _fast() variant but that will be done more
> incrementally in other patches (bio_split() in particular).

Hi Kent,

I noticed yesterday the _fast variants of bio clone introduce sharing
between the src and the clone, but without any reference counts:

        bio->bi_io_vec = bio_src->bi_io_vec;

Have you audited all of the _fast users to make sure they are not
freeing the src before the clone?  Sorry if this came up already in past
reviews.

> 
> Note that __bio_clone() isn't being readded - the reason being that with
> immutable biovecs allocating the right number of biovecs for the new
> clone is no longer trivial so we don't want drivers trying to do that
> themselves.
> 
> This patch also reverts febca1baea1cfe2d7a0271385d89b03d5fb34f94 -
> __bio_clone_fast() should not be setting bi_vcnt for bios that do not
> own the biovec (see Documentation/block/biovecs.txt for rationale) - in
> short,

I think I see what you mean with tying bi_vcnt to ownership of the bio,
but we're not consistent.  Looking at bio_for_eaach_segment_all:

*
 * drivers should _never_ use the all version - the bio may have been split
 * before it got to the driver and the driver won't own all of it
 */
#define bio_for_each_segment_all(bvl, bio, i)                           \
        for (i = 0, bvl = (bio)->bi_io_vec; i < (bio)->bi_vcnt; i++, bvl++)

bio_for_each_segment_all still trusts bi_vcnt, so any
bio_for_each_segment_all operation on a clone will basically be a noop.

Just looking at MD raid1 make_request()

                mbio = bio_clone_mddev(bio, GFP_NOIO, mddev);
		...
		alloc_behind_pages(mbio, r1_bio); -> bio_for_each_segment_all
		...
		if (r1_bio->behind_bvecs) {
                        bio_for_each_segment_all(bvec, mbio, j)
			...

I didn't test MD without the vcnt fix, but I think any operations in MD
that duplicate data for raid1 turn into noops.  I think we'll end up
writing garbage (or nothing) to the second mirror.

If you look at dm's crypt_free_buffer_pages(), it had similar problems.

> not setting it might cause bugs in the short term but long term
> it's likely to hide nastier more subtle bugs, we don't want code looking
> at bi_vcnt at all for bios it does not own.

I think the concept of bio ownership is still much too weak, at least
for established users like MD and DM.  I don't know how to verify the
sharing of bi_io_vec without some kind of reference counting on the
iovec.

-chris


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] block: Revert bio_clone() default behaviour
  2013-11-06 16:11 ` Chris Mason
@ 2013-11-06 20:02   ` Kent Overstreet
  2013-11-06 20:22     ` Chris Mason
  0 siblings, 1 reply; 15+ messages in thread
From: Kent Overstreet @ 2013-11-06 20:02 UTC (permalink / raw)
  To: Chris Mason; +Cc: axboe, linux-kernel, Mike Snitzer, NeilBrown, Olof Johansson

On Wed, Nov 06, 2013 at 11:11:30AM -0500, Chris Mason wrote:
> Quoting Kent Overstreet (2013-11-05 22:48:41)
> > This patch reverts the default behaviour introduced by
> > 9fc6286f347d00528adcdcf12396d220f47492ed - bio_clone_biovec() no clonger
> > shares the source bio's biovec, cloning the biovec is once again the
> > default.
> > 
> > Instead, we add a new bio_clone_biovec_fast(), which creates a clone
> > that shares the source's biovec. This patch changes bcache and md to use
>                                                                ^^^^^
> 							       dm?
> 
> > __bio_clone_biovec_fast() since they're expecting the new behaviour due
> > to other refactoring; most of the other uses of bio_clone() should be
> > same to convert to the _fast() variant but that will be done more
> > incrementally in other patches (bio_split() in particular).
> 
> Hi Kent,
> 
> I noticed yesterday the _fast variants of bio clone introduce sharing
> between the src and the clone, but without any reference counts:
> 
>         bio->bi_io_vec = bio_src->bi_io_vec;
> 
> Have you audited all of the _fast users to make sure they are not
> freeing the src before the clone?  Sorry if this came up already in past
> reviews.

Yup - that should actually be safe for all the existing bio_clone() users
actually, I audited all of them - because normally you're not going to complete
the original bio until the clone finishes.

> > Note that __bio_clone() isn't being readded - the reason being that with
> > immutable biovecs allocating the right number of biovecs for the new
> > clone is no longer trivial so we don't want drivers trying to do that
> > themselves.
> > 
> > This patch also reverts febca1baea1cfe2d7a0271385d89b03d5fb34f94 -
> > __bio_clone_fast() should not be setting bi_vcnt for bios that do not
> > own the biovec (see Documentation/block/biovecs.txt for rationale) - in
> > short,
> 
> I think I see what you mean with tying bi_vcnt to ownership of the bio,
> but we're not consistent.  Looking at bio_for_eaach_segment_all:
> 
> *
>  * drivers should _never_ use the all version - the bio may have been split
>  * before it got to the driver and the driver won't own all of it
>  */
> #define bio_for_each_segment_all(bvl, bio, i)                           \
>         for (i = 0, bvl = (bio)->bi_io_vec; i < (bio)->bi_vcnt; i++, bvl++)
> 
> bio_for_each_segment_all still trusts bi_vcnt, so any
> bio_for_each_segment_all operation on a clone will basically be a noop.
> 
> Just looking at MD raid1 make_request()
> 
>                 mbio = bio_clone_mddev(bio, GFP_NOIO, mddev);
> 		...
> 		alloc_behind_pages(mbio, r1_bio); -> bio_for_each_segment_all
> 		...
> 		if (r1_bio->behind_bvecs) {
>                         bio_for_each_segment_all(bvec, mbio, j)
> 			...
> 
> I didn't test MD without the vcnt fix, but I think any operations in MD
> that duplicate data for raid1 turn into noops.  I think we'll end up
> writing garbage (or nothing) to the second mirror.
> 
> If you look at dm's crypt_free_buffer_pages(), it had similar problems.

Those are fine actually - in both cases, they're bios they allocated, not the
bios that were submitted to them. Though md _definitely_ shouldn't have been
sharing the original bio's biovec, so looks like this patch will fix a bug
there...

(I remember seeing that code before and I thought I added a bio_clone_biovec()
call to that md code, but apparently that never got commited. Argh.)

> 
> > not setting it might cause bugs in the short term but long term
> > it's likely to hide nastier more subtle bugs, we don't want code looking
> > at bi_vcnt at all for bios it does not own.
> 
> I think the concept of bio ownership is still much too weak, at least
> for established users like MD and DM.  I don't know how to verify the
> sharing of bi_io_vec without some kind of reference counting on the
> iovec.

What's unclear about it? The rule is just - if you didn't allocate the biovec,
don't modify it or use bio_for_each_segment_all() (probably I didn't quite state
it clearly enough before though)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] block: Revert bio_clone() default behaviour
  2013-11-06 20:02   ` Kent Overstreet
@ 2013-11-06 20:22     ` Chris Mason
  2013-11-06 20:36       ` Mike Snitzer
                         ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Chris Mason @ 2013-11-06 20:22 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: axboe, linux-kernel, Mike Snitzer, NeilBrown, Olof Johansson

Quoting Kent Overstreet (2013-11-06 15:02:22)
> On Wed, Nov 06, 2013 at 11:11:30AM -0500, Chris Mason wrote:
> > Quoting Kent Overstreet (2013-11-05 22:48:41)
> > > This patch reverts the default behaviour introduced by
> > > 9fc6286f347d00528adcdcf12396d220f47492ed - bio_clone_biovec() no clonger
> > > shares the source bio's biovec, cloning the biovec is once again the
> > > default.
> > > 
> > > Instead, we add a new bio_clone_biovec_fast(), which creates a clone
> > > that shares the source's biovec. This patch changes bcache and md to use
> >                                                                ^^^^^
> >                                                              dm?
> > 
> > > __bio_clone_biovec_fast() since they're expecting the new behaviour due
> > > to other refactoring; most of the other uses of bio_clone() should be
> > > same to convert to the _fast() variant but that will be done more
> > > incrementally in other patches (bio_split() in particular).
> > 
> > Hi Kent,
> > 
> > I noticed yesterday the _fast variants of bio clone introduce sharing
> > between the src and the clone, but without any reference counts:
> > 
> >         bio->bi_io_vec = bio_src->bi_io_vec;
> > 
> > Have you audited all of the _fast users to make sure they are not
> > freeing the src before the clone?  Sorry if this came up already in past
> > reviews.
> 
> Yup - that should actually be safe for all the existing bio_clone() users
> actually, I audited all of them - because normally you're not going to complete
> the original bio until the clone finishes.

I'd say we need an ack from Neil before we can switch to _fast.  The
completions look non-trivial and _fast adds new ordering requirements on
free.

> 
> > > Note that __bio_clone() isn't being readded - the reason being that with
> > > immutable biovecs allocating the right number of biovecs for the new
> > > clone is no longer trivial so we don't want drivers trying to do that
> > > themselves.
> > > 
> > > This patch also reverts febca1baea1cfe2d7a0271385d89b03d5fb34f94 -
> > > __bio_clone_fast() should not be setting bi_vcnt for bios that do not
> > > own the biovec (see Documentation/block/biovecs.txt for rationale) - in
> > > short,
> > 
> > I think I see what you mean with tying bi_vcnt to ownership of the bio,
> > but we're not consistent.  Looking at bio_for_eaach_segment_all:
> > 
> > *
> >  * drivers should _never_ use the all version - the bio may have been split
> >  * before it got to the driver and the driver won't own all of it
> >  */
> > #define bio_for_each_segment_all(bvl, bio, i)                           \
> >         for (i = 0, bvl = (bio)->bi_io_vec; i < (bio)->bi_vcnt; i++, bvl++)
> > 
> > bio_for_each_segment_all still trusts bi_vcnt, so any
> > bio_for_each_segment_all operation on a clone will basically be a noop.
> > 
> > Just looking at MD raid1 make_request()
> > 
> >                 mbio = bio_clone_mddev(bio, GFP_NOIO, mddev);
> >               ...
> >               alloc_behind_pages(mbio, r1_bio); -> bio_for_each_segment_all
> >               ...
> >               if (r1_bio->behind_bvecs) {
> >                         bio_for_each_segment_all(bvec, mbio, j)
> >                       ...
> > 
> > I didn't test MD without the vcnt fix, but I think any operations in MD
> > that duplicate data for raid1 turn into noops.  I think we'll end up
> > writing garbage (or nothing) to the second mirror.
> > 
> > If you look at dm's crypt_free_buffer_pages(), it had similar problems.
> 
> Those are fine actually - in both cases, they're bios they allocated, not the
> bios that were submitted to them.

Both are expecting bi_vcnt to be non-zero after a clone, which wasn't
the case before the bi_vcnt patch.  With your current patch (adding
_fast) both should now be correct.

>
> Though md _definitely_ shouldn't have been
> sharing the original bio's biovec, so looks like this patch will fix a bug
> there...
> 
> (I remember seeing that code before and I thought I added a bio_clone_biovec()
> call to that md code, but apparently that never got commited. Argh.)
> 
> > 
> > > not setting it might cause bugs in the short term but long term
> > > it's likely to hide nastier more subtle bugs, we don't want code looking
> > > at bi_vcnt at all for bios it does not own.
> > 
> > I think the concept of bio ownership is still much too weak, at least
> > for established users like MD and DM.  I don't know how to verify the
> > sharing of bi_io_vec without some kind of reference counting on the
> > iovec.
> 
> What's unclear about it? The rule is just - if you didn't allocate the biovec,
> don't modify it or use bio_for_each_segment_all() (probably I didn't quite state
> it clearly enough before though)

That part makes sense.  The new rule that scares me is that we can't
free the src of the clone until all the clones are freed.  If it works
with today's existing users it feels like it is more by accident than
design.  I'm not saying we can't do it, we just need some bigger
flashing warning lights.

-chris


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: block: Revert bio_clone() default behaviour
  2013-11-06  3:48 [PATCH] block: Revert bio_clone() default behaviour Kent Overstreet
  2013-11-06  5:02 ` Olof Johansson
  2013-11-06 16:11 ` Chris Mason
@ 2013-11-06 20:31 ` Mike Snitzer
  2013-11-06 20:40   ` Kent Overstreet
  2 siblings, 1 reply; 15+ messages in thread
From: Mike Snitzer @ 2013-11-06 20:31 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: axboe, linux-kernel, Chris Mason, NeilBrown, Olof Johansson

Hey Kent,

Digging a bit in the LKML archive I think this patch is in response to
this thread: https://lkml.org/lkml/2013/11/6/27

Might be good to give context for which reported problem(s) are being
fixed by this patch.

On Tue, Nov 05 2013 at 10:48pm -0500,
Kent Overstreet <kmo@daterainc.com> wrote:

> This patch reverts the default behaviour introduced by
> 9fc6286f347d00528adcdcf12396d220f47492ed - bio_clone_biovec() no clonger
> shares the source bio's biovec, cloning the biovec is once again the
> default.

Your focus, in terms of revert, seems to be on restoring
bio_clone_bioset, so: s/bio_clone_biovec/bio_clone_bioset/

Maybe best to say "effectively reverts" since you aren't reverting 
9fc6286f347d00528adcdcf12396d220f47492ed ?  Also s/clonger/longer/ typo ^

> Instead, we add a new bio_clone_biovec_fast(), which creates a clone
> that shares the source's biovec. This patch changes bcache and md to use

s/md/dm/

> __bio_clone_biovec_fast() since they're expecting the new behaviour due
> to other refactoring; most of the other uses of bio_clone() should be
> same to convert to the _fast() variant but that will be done more

s/same/safe/

> incrementally in other patches (bio_split() in particular).
> 
> Note that __bio_clone() isn't being readded - the reason being that with
> immutable biovecs allocating the right number of biovecs for the new
> clone is no longer trivial so we don't want drivers trying to do that
> themselves.
> 
> This patch also reverts febca1baea1cfe2d7a0271385d89b03d5fb34f94 -
> __bio_clone_fast() should not be setting bi_vcnt for bios that do not
> own the biovec (see Documentation/block/biovecs.txt for rationale) - in
> short, not setting it might cause bugs in the short term but long term
> it's likely to hide nastier more subtle bugs, we don't want code looking
> at bi_vcnt at all for bios it does not own. However, this patch
> _shouldn't_ cause any regressions because of this since we're reverting
> back to the old bio_clone() behaviour.
> 
> Signed-off-by: Kent Overstreet <kmo@daterainc.com>
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: Chris Mason <chris.mason@fusionio.com>
> Cc: Mike Snitzer <snitzer@redhat.com>
> Cc: NeilBrown <neilb@suse.de>
> Cc: Olof Johansson <olof@lixom.net>
> ---
> Chris, Olaf, can you two in particular test this? I have tested the bounce
> buffer code (and bcache), but Jens told me today there was an md bug that I
> _still_ can't find any emails about so I'm not sure what to test for that.

/me assumes you really mean md here, given Chris's later reply in this thread.

Relative to DM, this patch looks fine to me:

Acked-by: Mike Snitzer <snitzer@redhat.com>

Thanks,
Mike

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: block: Revert bio_clone() default behaviour
  2013-11-06 20:22     ` Chris Mason
@ 2013-11-06 20:36       ` Mike Snitzer
  2013-11-06 20:49         ` Chris Mason
  2013-11-06 20:57       ` [PATCH] " Kent Overstreet
  2013-11-07  4:59       ` NeilBrown
  2 siblings, 1 reply; 15+ messages in thread
From: Mike Snitzer @ 2013-11-06 20:36 UTC (permalink / raw)
  To: Chris Mason
  Cc: Kent Overstreet, axboe, linux-kernel, NeilBrown, Olof Johansson

On Wed, Nov 06 2013 at  3:22pm -0500,
Chris Mason <chris.mason@fusionio.com> wrote:

> Quoting Kent Overstreet (2013-11-06 15:02:22)
> > On Wed, Nov 06, 2013 at 11:11:30AM -0500, Chris Mason wrote:
> > > 
> > > I think the concept of bio ownership is still much too weak, at least
> > > for established users like MD and DM.  I don't know how to verify the
> > > sharing of bi_io_vec without some kind of reference counting on the
> > > iovec.
> > 
> > What's unclear about it? The rule is just - if you didn't allocate the biovec,
> > don't modify it or use bio_for_each_segment_all() (probably I didn't quite state
> > it clearly enough before though)
> 
> That part makes sense.  The new rule that scares me is that we can't
> free the src of the clone until all the clones are freed.  If it works
> with today's existing users it feels like it is more by accident than
> design.  I'm not saying we can't do it, we just need some bigger
> flashing warning lights.

But we probably don't want those warning lights to come with the cost of
managing extra refcounts in the fast path -- so maybe a debug-only
refcount?

Mike

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: block: Revert bio_clone() default behaviour
  2013-11-06 20:31 ` Mike Snitzer
@ 2013-11-06 20:40   ` Kent Overstreet
  0 siblings, 0 replies; 15+ messages in thread
From: Kent Overstreet @ 2013-11-06 20:40 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: axboe, linux-kernel, Chris Mason, NeilBrown, Olof Johansson

On Wed, Nov 06, 2013 at 03:31:02PM -0500, Mike Snitzer wrote:
> Hey Kent,
> 
> Digging a bit in the LKML archive I think this patch is in response to
> this thread: https://lkml.org/lkml/2013/11/6/27

That thread I saw, Jens told me there was another one though

> Might be good to give context for which reported problem(s) are being
> fixed by this patch.
> 
> On Tue, Nov 05 2013 at 10:48pm -0500,
> Kent Overstreet <kmo@daterainc.com> wrote:
> 
> > This patch reverts the default behaviour introduced by
> > 9fc6286f347d00528adcdcf12396d220f47492ed - bio_clone_biovec() no clonger
> > shares the source bio's biovec, cloning the biovec is once again the
> > default.
> 
> Your focus, in terms of revert, seems to be on restoring
> bio_clone_bioset, so: s/bio_clone_biovec/bio_clone_bioset/
> 
> Maybe best to say "effectively reverts" since you aren't reverting 
> 9fc6286f347d00528adcdcf12396d220f47492ed ?  Also s/clonger/longer/ typo ^
> 
> > Instead, we add a new bio_clone_biovec_fast(), which creates a clone
> > that shares the source's biovec. This patch changes bcache and md to use
> 
> s/md/dm/

Whoops :p

> > __bio_clone_biovec_fast() since they're expecting the new behaviour due
> > to other refactoring; most of the other uses of bio_clone() should be
> > same to convert to the _fast() variant but that will be done more
> 
> s/same/safe/

Thanks

> 
> > incrementally in other patches (bio_split() in particular).
> > 
> > Note that __bio_clone() isn't being readded - the reason being that with
> > immutable biovecs allocating the right number of biovecs for the new
> > clone is no longer trivial so we don't want drivers trying to do that
> > themselves.
> > 
> > This patch also reverts febca1baea1cfe2d7a0271385d89b03d5fb34f94 -
> > __bio_clone_fast() should not be setting bi_vcnt for bios that do not
> > own the biovec (see Documentation/block/biovecs.txt for rationale) - in
> > short, not setting it might cause bugs in the short term but long term
> > it's likely to hide nastier more subtle bugs, we don't want code looking
> > at bi_vcnt at all for bios it does not own. However, this patch
> > _shouldn't_ cause any regressions because of this since we're reverting
> > back to the old bio_clone() behaviour.
> > 
> > Signed-off-by: Kent Overstreet <kmo@daterainc.com>
> > Cc: Jens Axboe <axboe@kernel.dk>
> > Cc: Chris Mason <chris.mason@fusionio.com>
> > Cc: Mike Snitzer <snitzer@redhat.com>
> > Cc: NeilBrown <neilb@suse.de>
> > Cc: Olof Johansson <olof@lixom.net>
> > ---
> > Chris, Olaf, can you two in particular test this? I have tested the bounce
> > buffer code (and bcache), but Jens told me today there was an md bug that I
> > _still_ can't find any emails about so I'm not sure what to test for that.
> 
> /me assumes you really mean md here, given Chris's later reply in this thread.
> 
> Relative to DM, this patch looks fine to me:
> 
> Acked-by: Mike Snitzer <snitzer@redhat.com>

Thanks!

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: block: Revert bio_clone() default behaviour
  2013-11-06 20:36       ` Mike Snitzer
@ 2013-11-06 20:49         ` Chris Mason
  0 siblings, 0 replies; 15+ messages in thread
From: Chris Mason @ 2013-11-06 20:49 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Kent Overstreet, axboe, linux-kernel, NeilBrown, Olof Johansson

Quoting Mike Snitzer (2013-11-06 15:36:40)
> On Wed, Nov 06 2013 at  3:22pm -0500,
> Chris Mason <chris.mason@fusionio.com> wrote:
> 
> > Quoting Kent Overstreet (2013-11-06 15:02:22)
> > > On Wed, Nov 06, 2013 at 11:11:30AM -0500, Chris Mason wrote:
> > > > 
> > > > I think the concept of bio ownership is still much too weak, at least
> > > > for established users like MD and DM.  I don't know how to verify the
> > > > sharing of bi_io_vec without some kind of reference counting on the
> > > > iovec.
> > > 
> > > What's unclear about it? The rule is just - if you didn't allocate the biovec,
> > > don't modify it or use bio_for_each_segment_all() (probably I didn't quite state
> > > it clearly enough before though)
> > 
> > That part makes sense.  The new rule that scares me is that we can't
> > free the src of the clone until all the clones are freed.  If it works
> > with today's existing users it feels like it is more by accident than
> > design.  I'm not saying we can't do it, we just need some bigger
> > flashing warning lights.
> 
> But we probably don't want those warning lights to come with the cost of
> managing extra refcounts in the fast path -- so maybe a debug-only
> refcount?

I'd be happy with some code comments and a few extra SOBs.  In general
I'm happy with the new patches that add _fast and make us explicitly
choose the sharing.  Lots of little chances for bugs, but opt-in is a
much better starting point.

-chris

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] block: Revert bio_clone() default behaviour
  2013-11-06 20:22     ` Chris Mason
  2013-11-06 20:36       ` Mike Snitzer
@ 2013-11-06 20:57       ` Kent Overstreet
  2013-11-06 21:25         ` Chris Mason
  2013-11-07  4:59       ` NeilBrown
  2 siblings, 1 reply; 15+ messages in thread
From: Kent Overstreet @ 2013-11-06 20:57 UTC (permalink / raw)
  To: Chris Mason; +Cc: axboe, linux-kernel, Mike Snitzer, NeilBrown, Olof Johansson

On Wed, Nov 06, 2013 at 03:22:36PM -0500, Chris Mason wrote:
> Quoting Kent Overstreet (2013-11-06 15:02:22)
> > Yup - that should actually be safe for all the existing bio_clone() users
> > actually, I audited all of them - because normally you're not going to complete
> > the original bio until the clone finishes.
> 
> I'd say we need an ack from Neil before we can switch to _fast.  The
> completions look non-trivial and _fast adds new ordering requirements on
> free.

Agreed, I've already broken md enough lately :P

(I'm started to work on a real test suite for the block layer, that'll help with
this stuff...)

> > > If you look at dm's crypt_free_buffer_pages(), it had similar problems.
> > 
> > Those are fine actually - in both cases, they're bios they allocated, not the
> > bios that were submitted to them.
> 
> Both are expecting bi_vcnt to be non-zero after a clone, which wasn't
> the case before the bi_vcnt patch.  With your current patch (adding
> _fast) both should now be correct.

Yeah, but bi_vcnt being zero was just a symptom - the underlying reason the old
code was wrong was that with the clone sharing the parent's biovec, md was
modifying a biovec (including bv_page!) that it didn't own - _that_ was what was
horribly broken about it.

> > What's unclear about it? The rule is just - if you didn't allocate the biovec,
> > don't modify it or use bio_for_each_segment_all() (probably I didn't quite state
> > it clearly enough before though)
> 
> That part makes sense.  The new rule that scares me is that we can't
> free the src of the clone until all the clones are freed.  If it works
> with today's existing users it feels like it is more by accident than
> design.  I'm not saying we can't do it, we just need some bigger
> flashing warning lights.

Yeah, Jens was saying the same thing yesterday, hence this patch.

OTOH - with regards to just the ordering requirements, the more I look at
various code the less accidental the fact that that works seems to be: the best
explanation I've come up with so far is that you already needed to ensure that
the _pages_ the clone points to stick around until the clone completes, and if
you don't own the original bio the only way to do that is to not complete the
original bio until after the clone completes.

So if you're a driver cloning bios that were submitted to you, bio_clone_fast()
introduces no new ordering requirements.

On the third hand - if you're cloning (i.e. splitting) your own bios, in that
case it would be possible to screw up the ordering - I don't know of any code in
the kernel that does this today (except for, sort of, bcache) but my dio rewrite
takes this approach - but if you do the obvious and sane thing with bio_chain()
it's a non issue, it seems to me you'd have to come up with something pretty
contrived and dumb for this to actually be an issue in practice.

Anyways, I haven't come to any definite conclusions, those are just my
observations so far.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] block: Revert bio_clone() default behaviour
  2013-11-06 20:57       ` [PATCH] " Kent Overstreet
@ 2013-11-06 21:25         ` Chris Mason
  2013-11-06 21:51           ` Kent Overstreet
  0 siblings, 1 reply; 15+ messages in thread
From: Chris Mason @ 2013-11-06 21:25 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: axboe, linux-kernel, Mike Snitzer, NeilBrown, Olof Johansson

Quoting Kent Overstreet (2013-11-06 15:57:34)
> On Wed, Nov 06, 2013 at 03:22:36PM -0500, Chris Mason wrote:
> > Quoting Kent Overstreet (2013-11-06 15:02:22)

[ ... nods, thanks! ... ]

> OTOH - with regards to just the ordering requirements, the more I look at
> various code the less accidental the fact that that works seems to be: the best
> explanation I've come up with so far is that you already needed to ensure that
> the _pages_ the clone points to stick around until the clone completes, and if
> you don't own the original bio the only way to do that is to not complete the
> original bio until after the clone completes.
> 
> So if you're a driver cloning bios that were submitted to you, bio_clone_fast()
> introduces no new ordering requirements.
> 
> On the third hand - if you're cloning (i.e. splitting) your own bios, in that
> case it would be possible to screw up the ordering - I don't know of any code in
> the kernel that does this today (except for, sort of, bcache) but my dio rewrite
> takes this approach - but if you do the obvious and sane thing with bio_chain()
> it's a non issue, it seems to me you'd have to come up with something pretty
> contrived and dumb for this to actually be an issue in practice.
> 
> Anyways, I haven't come to any definite conclusions, those are just my
> observations so far.

I do think you're right.  We all seem to have clones doing work on
behalf of the original, and when everyone is done we complete the
original.

But, btrfs does have silly things like this:

        dio_end_io(dio_bio, err); // end and free the original
	bio_put(bio); // free the clone

It's not a bug yet, but given enough time the space between those two
frees will grow new code that kills us all.

Really though, the new stuff is better, thanks.

-chris

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] block: Revert bio_clone() default behaviour
  2013-11-06 21:25         ` Chris Mason
@ 2013-11-06 21:51           ` Kent Overstreet
  0 siblings, 0 replies; 15+ messages in thread
From: Kent Overstreet @ 2013-11-06 21:51 UTC (permalink / raw)
  To: Chris Mason; +Cc: axboe, linux-kernel, Mike Snitzer, NeilBrown, Olof Johansson

On Wed, Nov 06, 2013 at 04:25:45PM -0500, Chris Mason wrote:
> Quoting Kent Overstreet (2013-11-06 15:57:34)
> > On Wed, Nov 06, 2013 at 03:22:36PM -0500, Chris Mason wrote:
> > > Quoting Kent Overstreet (2013-11-06 15:02:22)
> 
> [ ... nods, thanks! ... ]
> 
> > OTOH - with regards to just the ordering requirements, the more I look at
> > various code the less accidental the fact that that works seems to be: the best
> > explanation I've come up with so far is that you already needed to ensure that
> > the _pages_ the clone points to stick around until the clone completes, and if
> > you don't own the original bio the only way to do that is to not complete the
> > original bio until after the clone completes.
> > 
> > So if you're a driver cloning bios that were submitted to you, bio_clone_fast()
> > introduces no new ordering requirements.
> > 
> > On the third hand - if you're cloning (i.e. splitting) your own bios, in that
> > case it would be possible to screw up the ordering - I don't know of any code in
> > the kernel that does this today (except for, sort of, bcache) but my dio rewrite
> > takes this approach - but if you do the obvious and sane thing with bio_chain()
> > it's a non issue, it seems to me you'd have to come up with something pretty
> > contrived and dumb for this to actually be an issue in practice.
> > 
> > Anyways, I haven't come to any definite conclusions, those are just my
> > observations so far.
> 
> I do think you're right.  We all seem to have clones doing work on
> behalf of the original, and when everyone is done we complete the
> original.
> 
> But, btrfs does have silly things like this:
> 
>         dio_end_io(dio_bio, err); // end and free the original
> 	bio_put(bio); // free the clone
> 
> It's not a bug yet, but given enough time the space between those two
> frees will grow new code that kills us all.

Hopefully the DIO situation will get better in the near future, Josef was
telling me btrfs was probably going to end up with its own DIO code (replacing a
good chunk of direct-io.c) and that should get a lot easier and saner for you
guys with my dio rewrite. Need to finish getting that to pass xfstests...

> Really though, the new stuff is better, thanks.

Thanks! :)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] block: Revert bio_clone() default behaviour
  2013-11-06 20:22     ` Chris Mason
  2013-11-06 20:36       ` Mike Snitzer
  2013-11-06 20:57       ` [PATCH] " Kent Overstreet
@ 2013-11-07  4:59       ` NeilBrown
  2 siblings, 0 replies; 15+ messages in thread
From: NeilBrown @ 2013-11-07  4:59 UTC (permalink / raw)
  To: Chris Mason
  Cc: Kent Overstreet, axboe, linux-kernel, Mike Snitzer, Olof Johansson

[-- Attachment #1: Type: text/plain, Size: 1129 bytes --]

On Wed, 6 Nov 2013 15:22:36 -0500 Chris Mason <chris.mason@fusionio.com>
wrote:


> > Yup - that should actually be safe for all the existing bio_clone() users
> > actually, I audited all of them - because normally you're not going to complete
> > the original bio until the clone finishes.
> 
> I'd say we need an ack from Neil before we can switch to _fast.  The
> completions look non-trivial and _fast adds new ordering requirements on
> free.

I think it should be fairly straight forward to use _fast in md.
As Kent says, the original bio that is cloned is (almost) never going to have
endio called until all the clones have done their thing and finished up.

The only exception is the "behind_pages" stuff in raid1.c.  We clearly need
to be more careful there.
If the  "ordering requirements" you mention are simply that the owner of the
original may free the biovec, so the clone must have finished using it before
that can happen, then that would most easily be fix with
   mbio->bi_io_vec = r1_bio->behind_bvecs;
rather than the current loop which copies all the bv_page pointers.

NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2013-11-07  5:00 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-11-06  3:48 [PATCH] block: Revert bio_clone() default behaviour Kent Overstreet
2013-11-06  5:02 ` Olof Johansson
2013-11-06  5:07   ` Kent Overstreet
2013-11-06 15:25     ` Olof Johansson
2013-11-06 16:11 ` Chris Mason
2013-11-06 20:02   ` Kent Overstreet
2013-11-06 20:22     ` Chris Mason
2013-11-06 20:36       ` Mike Snitzer
2013-11-06 20:49         ` Chris Mason
2013-11-06 20:57       ` [PATCH] " Kent Overstreet
2013-11-06 21:25         ` Chris Mason
2013-11-06 21:51           ` Kent Overstreet
2013-11-07  4:59       ` NeilBrown
2013-11-06 20:31 ` Mike Snitzer
2013-11-06 20:40   ` Kent Overstreet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).