All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH for-4.2 00/14] block, dm: first batch of changes for 4.2
@ 2015-05-14 21:04 Mike Snitzer
  2015-05-14 21:04 ` [PATCH for-4.2 01/14] block: remove management of bi_remaining when restoring original bi_end_io Mike Snitzer
                   ` (13 more replies)
  0 siblings, 14 replies; 31+ messages in thread
From: Mike Snitzer @ 2015-05-14 21:04 UTC (permalink / raw)
  To: dm-devel; +Cc: Jens Axboe, Mike Snitzer, linux-kernel

Here is the start of the DM changes I've queued for 4.2 (via
linux-dm.git's for-next branch).

I'm posting them to the lists mainly to get feedback on the block
changes (first 4 patches) and to publish Joe's dm-thinp discard
improvements which depend on the new blkdev_issue_discard_async
interface.

Jens, if all looks OK please pick up the first 4 block patches.  The
"block: remove management of bi_remaining when restoring original
bi_end_io" has been tested in terms of DM.  I haven't tested btrfs,
bcache, etc.

All review/feedback is welcome.  Thanks!
Mike

Christoph Hellwig (1):
  block, dm: don't copy bios for request clones

Joe Thornber (5):
  block: factor out blkdev_issue_discard_async
  dm btree: add dm_btree_remove_leaves()
  dm thin metadata: add dm_thin_find_mapped_range()
  dm thin metadata: add dm_thin_remove_range()
  dm thin: range discard support

Mike Snitzer (8):
  block: remove management of bi_remaining when restoring original bi_end_io
  block: remove export for blk_queue_bio
  dm: do not allocate any mempools for blk-mq request-based DM
  dm: rename methods that requeue requests
  dm: factor out a common cleanup_mapped_device()
  dm thin: cleanup overwrite's endio restore to be centralized
  dm thin: cleanup schedule_zero() to read more logically
  dm thin metadata: remove in-core 'read_only' flag

 block/bio-integrity.c                         |   4 +-
 block/bio.c                                   |  20 --
 block/blk-core.c                              |  99 +-----
 block/blk-lib.c                               |  57 ++-
 drivers/md/bcache/io.c                        |   2 +-
 drivers/md/dm-cache-target.c                  |   6 -
 drivers/md/dm-raid1.c                         |   2 -
 drivers/md/dm-snap.c                          |   1 -
 drivers/md/dm-table.c                         |  29 +-
 drivers/md/dm-thin-metadata.c                 | 117 +++++-
 drivers/md/dm-thin-metadata.h                 |  11 +
 drivers/md/dm-thin.c                          | 493 +++++++++++++++++---------
 drivers/md/dm-verity.c                        |   2 +-
 drivers/md/dm.c                               | 314 +++++++---------
 drivers/md/dm.h                               |   5 +-
 drivers/md/persistent-data/dm-block-manager.c |   6 +
 drivers/md/persistent-data/dm-block-manager.h |   1 +
 drivers/md/persistent-data/dm-btree-remove.c  | 127 +++++++
 drivers/md/persistent-data/dm-btree.h         |   9 +
 fs/btrfs/disk-io.c                            |   2 +-
 fs/btrfs/volumes.c                            |  16 +-
 fs/btrfs/volumes.h                            |   2 -
 include/linux/bio.h                           |   1 -
 include/linux/blk_types.h                     |   3 +
 include/linux/blkdev.h                        |  11 +-
 25 files changed, 806 insertions(+), 534 deletions(-)

-- 
2.3.2 (Apple Git-55)


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH for-4.2 01/14] block: remove management of bi_remaining when restoring original bi_end_io
  2015-05-14 21:04 [PATCH for-4.2 00/14] block, dm: first batch of changes for 4.2 Mike Snitzer
@ 2015-05-14 21:04 ` Mike Snitzer
  2015-05-18  7:22   ` Jan Kara
  2015-05-18  8:24   ` [dm-devel] [PATCH for-4.2 " Christoph Hellwig
  2015-05-14 21:05 ` [PATCH for-4.2 02/14] block: remove export for blk_queue_bio Mike Snitzer
                   ` (12 subsequent siblings)
  13 siblings, 2 replies; 31+ messages in thread
From: Mike Snitzer @ 2015-05-14 21:04 UTC (permalink / raw)
  To: dm-devel; +Cc: Jens Axboe, Mike Snitzer, linux-kernel, Jan Kara, Chris Mason

Commit c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for
non-chains") regressed all existing callers that followed this pattern:
 1) saving a bio's original bi_end_io
 2) wiring up an intermediate bi_end_io
 3) restoring the original bi_end_io from intermediate bi_end_io
 4) calling bio_endio() to execute the restored original bi_end_io

The regression was due to BIO_CHAIN only ever getting set if
bio_inc_remaining() is called.  For the above pattern it isn't set until
step 3 above (step 2 would've needed to establish BIO_CHAIN).  As such
the first bio_endio(), in step 2 above, never decremented __bi_remaining
before calling the intermediate bi_end_io -- leaving __bi_remaining with
the value 1 instead of 0.  When bio_inc_remaining() occurred during step
3 it brought it to a value of 2.  When the second bio_endio() was
called, in step 4 above, it should've called the original bi_end_io but
it didn't because there was an extra reference that wasn't dropped (due
to atomic operations being optimized away since BIO_CHAIN wasn't set
upfront).

Fix this issue by removing the __bi_remaining management complexity for
all callers that use the above pattern -- bio_chain() is the only
interface that _needs_ to be concerned with __bi_remaining.  For the
above pattern callers just expect the bi_end_io they set to get called!
Remove bio_endio_nodec() and also remove all bio_inc_remaining() calls
that aren't associated with the bio_chain() interface.

The bio_inc_remaining() interface has been left exported because it is
still useful for more elaborate uses of bio_chain() -- it will be used
in an upcoming DM commit "dm thin: range discard support".

Fixes: c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Chris Mason <clm@fb.com>
---
 block/bio-integrity.c        |  4 ++--
 block/bio.c                  | 20 --------------------
 drivers/md/bcache/io.c       |  2 +-
 drivers/md/dm-cache-target.c |  6 ------
 drivers/md/dm-raid1.c        |  2 --
 drivers/md/dm-snap.c         |  1 -
 drivers/md/dm-thin.c         |  9 +++------
 drivers/md/dm-verity.c       |  2 +-
 fs/btrfs/disk-io.c           |  2 +-
 fs/btrfs/volumes.c           | 16 +++++-----------
 fs/btrfs/volumes.h           |  2 --
 include/linux/bio.h          |  1 -
 12 files changed, 13 insertions(+), 54 deletions(-)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 5cbd5d9..0436c21 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -361,7 +361,7 @@ static void bio_integrity_verify_fn(struct work_struct *work)
 
 	/* Restore original bio completion handler */
 	bio->bi_end_io = bip->bip_end_io;
-	bio_endio_nodec(bio, error);
+	bio_endio(bio, error);
 }
 
 /**
@@ -388,7 +388,7 @@ void bio_integrity_endio(struct bio *bio, int error)
 	 */
 	if (error) {
 		bio->bi_end_io = bip->bip_end_io;
-		bio_endio_nodec(bio, error);
+		bio_endio(bio, error);
 
 		return;
 	}
diff --git a/block/bio.c b/block/bio.c
index c2ff8a8..3690f1c 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1809,26 +1809,6 @@ void bio_endio(struct bio *bio, int error)
 EXPORT_SYMBOL(bio_endio);
 
 /**
- * bio_endio_nodec - end I/O on a bio, without decrementing bi_remaining
- * @bio:	bio
- * @error:	error, if any
- *
- * For code that has saved and restored bi_end_io; thing hard before using this
- * function, probably you should've cloned the entire bio.
- **/
-void bio_endio_nodec(struct bio *bio, int error)
-{
-	/*
-	 * If it's not flagged as a chain, we are not going to dec the count
-	 */
-	if (bio_flagged(bio, BIO_CHAIN))
-		bio_inc_remaining(bio);
-
-	bio_endio(bio, error);
-}
-EXPORT_SYMBOL(bio_endio_nodec);
-
-/**
  * bio_split - split a bio
  * @bio:	bio to split
  * @sectors:	number of sectors to split from the front of @bio
diff --git a/drivers/md/bcache/io.c b/drivers/md/bcache/io.c
index fa028fa..cb64e64a 100644
--- a/drivers/md/bcache/io.c
+++ b/drivers/md/bcache/io.c
@@ -55,7 +55,7 @@ static void bch_bio_submit_split_done(struct closure *cl)
 
 	s->bio->bi_end_io = s->bi_end_io;
 	s->bio->bi_private = s->bi_private;
-	bio_endio_nodec(s->bio, 0);
+	bio_endio(s->bio, 0);
 
 	closure_debug_destroy(&s->cl);
 	mempool_free(s, s->p->bio_split_hook);
diff --git a/drivers/md/dm-cache-target.c b/drivers/md/dm-cache-target.c
index 705eb7b..41b2594 100644
--- a/drivers/md/dm-cache-target.c
+++ b/drivers/md/dm-cache-target.c
@@ -86,12 +86,6 @@ static void dm_unhook_bio(struct dm_hook_info *h, struct bio *bio)
 {
 	bio->bi_end_io = h->bi_end_io;
 	bio->bi_private = h->bi_private;
-
-	/*
-	 * Must bump bi_remaining to allow bio to complete with
-	 * restored bi_end_io.
-	 */
-	bio_inc_remaining(bio);
 }
 
 /*----------------------------------------------------------------*/
diff --git a/drivers/md/dm-raid1.c b/drivers/md/dm-raid1.c
index d6a1c09..743fa9b 100644
--- a/drivers/md/dm-raid1.c
+++ b/drivers/md/dm-raid1.c
@@ -1254,8 +1254,6 @@ static int mirror_end_io(struct dm_target *ti, struct bio *bio, int error)
 			dm_bio_restore(bd, bio);
 			bio_record->details.bi_bdev = NULL;
 
-			bio_inc_remaining(bio);
-
 			queue_bio(ms, bio, rw);
 			return DM_ENDIO_INCOMPLETE;
 		}
diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
index 8bfeae2..7c82d3c 100644
--- a/drivers/md/dm-snap.c
+++ b/drivers/md/dm-snap.c
@@ -1478,7 +1478,6 @@ out:
 	if (full_bio) {
 		full_bio->bi_end_io = pe->full_bio_end_io;
 		full_bio->bi_private = pe->full_bio_private;
-		bio_inc_remaining(full_bio);
 	}
 	increment_pending_exceptions_done_count();
 
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 342dbda..e852602c 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -793,10 +793,9 @@ static void inc_remap_and_issue_cell(struct thin_c *tc,
 
 static void process_prepared_mapping_fail(struct dm_thin_new_mapping *m)
 {
-	if (m->bio) {
+	if (m->bio)
 		m->bio->bi_end_io = m->saved_bi_end_io;
-		bio_inc_remaining(m->bio);
-	}
+
 	cell_error(m->tc->pool, m->cell);
 	list_del(&m->list);
 	mempool_free(m, m->tc->pool->mapping_pool);
@@ -810,10 +809,8 @@ static void process_prepared_mapping(struct dm_thin_new_mapping *m)
 	int r;
 
 	bio = m->bio;
-	if (bio) {
+	if (bio)
 		bio->bi_end_io = m->saved_bi_end_io;
-		bio_inc_remaining(bio);
-	}
 
 	if (m->err) {
 		cell_error(pool, m->cell);
diff --git a/drivers/md/dm-verity.c b/drivers/md/dm-verity.c
index 66616db..bb9c6a0 100644
--- a/drivers/md/dm-verity.c
+++ b/drivers/md/dm-verity.c
@@ -459,7 +459,7 @@ static void verity_finish_io(struct dm_verity_io *io, int error)
 	bio->bi_end_io = io->orig_bi_end_io;
 	bio->bi_private = io->orig_bi_private;
 
-	bio_endio_nodec(bio, error);
+	bio_endio(bio, error);
 }
 
 static void verity_work(struct work_struct *w)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 2ef9a4b..fbdbee4 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1745,7 +1745,7 @@ static void end_workqueue_fn(struct btrfs_work *work)
 	bio->bi_private = end_io_wq->private;
 	bio->bi_end_io = end_io_wq->end_io;
 	kmem_cache_free(btrfs_end_io_wq_cache, end_io_wq);
-	bio_endio_nodec(bio, error);
+	bio_endio(bio, error);
 }
 
 static int cleaner_kthread(void *arg)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 8e8d1d1..dac77d4 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5585,10 +5585,10 @@ int btrfs_rmap_block(struct btrfs_mapping_tree *map_tree,
 
 static inline void btrfs_end_bbio(struct btrfs_bio *bbio, struct bio *bio, int err)
 {
-	if (likely(bbio->flags & BTRFS_BIO_ORIG_BIO_SUBMITTED))
-		bio_endio_nodec(bio, err);
-	else
-		bio_endio(bio, err);
+	bio->bi_private = bbio->private;
+	bio->bi_end_io = bbio->end_io;
+	bio_endio(bio, err);
+
 	btrfs_put_bbio(bbio);
 }
 
@@ -5632,8 +5632,6 @@ static void btrfs_end_bio(struct bio *bio, int err)
 			bio = bbio->orig_bio;
 		}
 
-		bio->bi_private = bbio->private;
-		bio->bi_end_io = bbio->end_io;
 		btrfs_io_bio(bio)->mirror_num = bbio->mirror_num;
 		/* only send an error to the higher layers if it is
 		 * beyond the tolerance of the btrfs bio
@@ -5815,8 +5813,6 @@ static void bbio_error(struct btrfs_bio *bbio, struct bio *bio, u64 logical)
 		/* Shoud be the original bio. */
 		WARN_ON(bio != bbio->orig_bio);
 
-		bio->bi_private = bbio->private;
-		bio->bi_end_io = bbio->end_io;
 		btrfs_io_bio(bio)->mirror_num = bbio->mirror_num;
 		bio->bi_iter.bi_sector = logical >> 9;
 
@@ -5897,10 +5893,8 @@ int btrfs_map_bio(struct btrfs_root *root, int rw, struct bio *bio,
 		if (dev_nr < total_devs - 1) {
 			bio = btrfs_bio_clone(first_bio, GFP_NOFS);
 			BUG_ON(!bio); /* -ENOMEM */
-		} else {
+		} else
 			bio = first_bio;
-			bbio->flags |= BTRFS_BIO_ORIG_BIO_SUBMITTED;
-		}
 
 		submit_stripe_bio(root, bbio, bio,
 				  bbio->stripes[dev_nr].physical, dev_nr, rw,
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index ebc3133..cedae03 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -292,8 +292,6 @@ struct btrfs_bio_stripe {
 struct btrfs_bio;
 typedef void (btrfs_bio_end_io_t) (struct btrfs_bio *bio, int err);
 
-#define BTRFS_BIO_ORIG_BIO_SUBMITTED	(1 << 0)
-
 struct btrfs_bio {
 	atomic_t refs;
 	atomic_t stripes_pending;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 7486ea1..fe0a88d 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -427,7 +427,6 @@ static inline struct bio *bio_clone_kmalloc(struct bio *bio, gfp_t gfp_mask)
 }
 
 extern void bio_endio(struct bio *, int);
-extern void bio_endio_nodec(struct bio *, int);
 struct request_queue;
 extern int bio_phys_segments(struct request_queue *, struct bio *);
 
-- 
2.3.2 (Apple Git-55)


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH for-4.2 02/14] block: remove export for blk_queue_bio
  2015-05-14 21:04 [PATCH for-4.2 00/14] block, dm: first batch of changes for 4.2 Mike Snitzer
  2015-05-14 21:04 ` [PATCH for-4.2 01/14] block: remove management of bi_remaining when restoring original bi_end_io Mike Snitzer
@ 2015-05-14 21:05 ` Mike Snitzer
  2015-05-14 21:05 ` [PATCH for-4.2 03/14] block, dm: don't copy bios for request clones Mike Snitzer
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 31+ messages in thread
From: Mike Snitzer @ 2015-05-14 21:05 UTC (permalink / raw)
  To: dm-devel; +Cc: Jens Axboe, Mike Snitzer, linux-kernel

With commit ff36ab345 ("dm: remove request-based logic from
make_request_fn wrapper") DM no longer calls blk_queue_bio() directly,
so remove its export.  Doing so required a forward declaration in
blk-core.c.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 block/blk-core.c       | 5 +++--
 include/linux/blkdev.h | 2 --
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 96b24b0..69035cb 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -735,6 +735,8 @@ blk_init_queue_node(request_fn_proc *rfn, spinlock_t *lock, int node_id)
 }
 EXPORT_SYMBOL(blk_init_queue_node);
 
+static void blk_queue_bio(struct request_queue *q, struct bio *bio);
+
 struct request_queue *
 blk_init_allocated_queue(struct request_queue *q, request_fn_proc *rfn,
 			 spinlock_t *lock)
@@ -1588,7 +1590,7 @@ void init_request_from_bio(struct request *req, struct bio *bio)
 	blk_rq_bio_prep(req->q, req, bio);
 }
 
-void blk_queue_bio(struct request_queue *q, struct bio *bio)
+static void blk_queue_bio(struct request_queue *q, struct bio *bio)
 {
 	const bool sync = !!(bio->bi_rw & REQ_SYNC);
 	struct blk_plug *plug;
@@ -1696,7 +1698,6 @@ out_unlock:
 		spin_unlock_irq(q->queue_lock);
 	}
 }
-EXPORT_SYMBOL_GPL(blk_queue_bio);	/* for device mapper only */
 
 /*
  * If bio->bi_dev is a partition, remap the location
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 2da818a..c8cb30e 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -792,8 +792,6 @@ extern int scsi_cmd_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 			 struct scsi_ioctl_command __user *);
 
-extern void blk_queue_bio(struct request_queue *q, struct bio *bio);
-
 /*
  * A queue has just exitted congestion.  Note this in the global counter of
  * congested queues, and wake up anyone who was waiting for requests to be
-- 
2.3.2 (Apple Git-55)


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH for-4.2 03/14] block, dm: don't copy bios for request clones
  2015-05-14 21:04 [PATCH for-4.2 00/14] block, dm: first batch of changes for 4.2 Mike Snitzer
  2015-05-14 21:04 ` [PATCH for-4.2 01/14] block: remove management of bi_remaining when restoring original bi_end_io Mike Snitzer
  2015-05-14 21:05 ` [PATCH for-4.2 02/14] block: remove export for blk_queue_bio Mike Snitzer
@ 2015-05-14 21:05 ` Mike Snitzer
  2015-05-14 21:05 ` [PATCH for-4.2 04/14] block: factor out blkdev_issue_discard_async Mike Snitzer
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 31+ messages in thread
From: Mike Snitzer @ 2015-05-14 21:05 UTC (permalink / raw)
  To: dm-devel; +Cc: Jens Axboe, Christoph Hellwig, linux-kernel, Mike Snitzer

From: Christoph Hellwig <hch@lst.de>

Currently dm-multipath has to clone the bios for every request sent
to the lower devices, which wastes cpu cycles and ties down memory.

This patch instead adds a new REQ_CLONE flag that instructs req_bio_endio
to not complete bios attached to a request, which we set on clone
requests similar to bios in a flush sequence.  With this change I/O
errors on a path failure only get propagated to dm-multipath, which
can then either resubmit the I/O or complete the bios on the original
request.

I've done some basic testing of this on a Linux target with ALUA support,
and it survives path failures during I/O nicely.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 block/blk-core.c          |  94 +++----------------------
 drivers/md/dm-table.c     |  25 ++++---
 drivers/md/dm.c           | 171 +++++++++++-----------------------------------
 drivers/md/dm.h           |   5 +-
 include/linux/blk_types.h |   2 +
 include/linux/blkdev.h    |   6 +-
 6 files changed, 73 insertions(+), 230 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 69035cb..8605dce 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -117,7 +117,7 @@ EXPORT_SYMBOL(blk_rq_init);
 static void req_bio_endio(struct request *rq, struct bio *bio,
 			  unsigned int nbytes, int error)
 {
-	if (error)
+	if (error && !(rq->cmd_flags & REQ_CLONE))
 		clear_bit(BIO_UPTODATE, &bio->bi_flags);
 	else if (!test_bit(BIO_UPTODATE, &bio->bi_flags))
 		error = -EIO;
@@ -128,7 +128,8 @@ static void req_bio_endio(struct request *rq, struct bio *bio,
 	bio_advance(bio, nbytes);
 
 	/* don't actually finish bio if it's part of flush sequence */
-	if (bio->bi_iter.bi_size == 0 && !(rq->cmd_flags & REQ_FLUSH_SEQ))
+	if (bio->bi_iter.bi_size == 0 &&
+	    !(rq->cmd_flags & (REQ_FLUSH_SEQ|REQ_CLONE)))
 		bio_endio(bio, error);
 }
 
@@ -2914,95 +2915,22 @@ int blk_lld_busy(struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blk_lld_busy);
 
-/**
- * blk_rq_unprep_clone - Helper function to free all bios in a cloned request
- * @rq: the clone request to be cleaned up
- *
- * Description:
- *     Free all bios in @rq for a cloned request.
- */
-void blk_rq_unprep_clone(struct request *rq)
-{
-	struct bio *bio;
-
-	while ((bio = rq->bio) != NULL) {
-		rq->bio = bio->bi_next;
-
-		bio_put(bio);
-	}
-}
-EXPORT_SYMBOL_GPL(blk_rq_unprep_clone);
-
-/*
- * Copy attributes of the original request to the clone request.
- * The actual data parts (e.g. ->cmd, ->sense) are not copied.
- */
-static void __blk_rq_prep_clone(struct request *dst, struct request *src)
+void blk_rq_prep_clone(struct request *dst, struct request *src)
 {
 	dst->cpu = src->cpu;
-	dst->cmd_flags |= (src->cmd_flags & REQ_CLONE_MASK) | REQ_NOMERGE;
+	dst->cmd_flags |= (src->cmd_flags & REQ_CLONE_MASK);
+	dst->cmd_flags |= REQ_NOMERGE | REQ_CLONE;
 	dst->cmd_type = src->cmd_type;
 	dst->__sector = blk_rq_pos(src);
 	dst->__data_len = blk_rq_bytes(src);
 	dst->nr_phys_segments = src->nr_phys_segments;
 	dst->ioprio = src->ioprio;
 	dst->extra_len = src->extra_len;
-}
-
-/**
- * blk_rq_prep_clone - Helper function to setup clone request
- * @rq: the request to be setup
- * @rq_src: original request to be cloned
- * @bs: bio_set that bios for clone are allocated from
- * @gfp_mask: memory allocation mask for bio
- * @bio_ctr: setup function to be called for each clone bio.
- *           Returns %0 for success, non %0 for failure.
- * @data: private data to be passed to @bio_ctr
- *
- * Description:
- *     Clones bios in @rq_src to @rq, and copies attributes of @rq_src to @rq.
- *     The actual data parts of @rq_src (e.g. ->cmd, ->sense)
- *     are not copied, and copying such parts is the caller's responsibility.
- *     Also, pages which the original bios are pointing to are not copied
- *     and the cloned bios just point same pages.
- *     So cloned bios must be completed before original bios, which means
- *     the caller must complete @rq before @rq_src.
- */
-int blk_rq_prep_clone(struct request *rq, struct request *rq_src,
-		      struct bio_set *bs, gfp_t gfp_mask,
-		      int (*bio_ctr)(struct bio *, struct bio *, void *),
-		      void *data)
-{
-	struct bio *bio, *bio_src;
-
-	if (!bs)
-		bs = fs_bio_set;
-
-	__rq_for_each_bio(bio_src, rq_src) {
-		bio = bio_clone_fast(bio_src, gfp_mask, bs);
-		if (!bio)
-			goto free_and_out;
-
-		if (bio_ctr && bio_ctr(bio, bio_src, data))
-			goto free_and_out;
-
-		if (rq->bio) {
-			rq->biotail->bi_next = bio;
-			rq->biotail = bio;
-		} else
-			rq->bio = rq->biotail = bio;
-	}
-
-	__blk_rq_prep_clone(rq, rq_src);
-
-	return 0;
-
-free_and_out:
-	if (bio)
-		bio_put(bio);
-	blk_rq_unprep_clone(rq);
-
-	return -ENOMEM;
+	dst->bio = src->bio;
+	dst->biotail = src->biotail;
+	dst->cmd = src->cmd;
+	dst->cmd_len = src->cmd_len;
+	dst->sense = src->sense;
 }
 EXPORT_SYMBOL_GPL(blk_rq_prep_clone);
 
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index d9b00b8..3662b2e 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -940,21 +940,28 @@ static int dm_table_alloc_md_mempools(struct dm_table *t, struct mapped_device *
 {
 	unsigned type = dm_table_get_type(t);
 	unsigned per_bio_data_size = 0;
-	struct dm_target *tgt;
 	unsigned i;
 
-	if (unlikely(type == DM_TYPE_NONE)) {
+	switch (type) {
+	case DM_TYPE_BIO_BASED:
+		for (i = 0; i < t->num_targets; i++) {
+			struct dm_target *tgt = t->targets + i;
+
+			per_bio_data_size = max(per_bio_data_size,
+						tgt->per_bio_data_size);
+		}
+		t->mempools = dm_alloc_bio_mempools(t->integrity_supported,
+						    per_bio_data_size);
+		break;
+	case DM_TYPE_REQUEST_BASED:
+	case DM_TYPE_MQ_REQUEST_BASED:
+		t->mempools = dm_alloc_rq_mempools(md, type);
+		break;
+	default:
 		DMWARN("no table type is set, can't allocate mempools");
 		return -EINVAL;
 	}
 
-	if (type == DM_TYPE_BIO_BASED)
-		for (i = 0; i < t->num_targets; i++) {
-			tgt = t->targets + i;
-			per_bio_data_size = max(per_bio_data_size, tgt->per_bio_data_size);
-		}
-
-	t->mempools = dm_alloc_md_mempools(md, type, t->integrity_supported, per_bio_data_size);
 	if (!t->mempools)
 		return -ENOMEM;
 
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index a930b72..38837f8 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -990,57 +990,6 @@ static void clone_endio(struct bio *bio, int error)
 	dec_pending(io, error);
 }
 
-/*
- * Partial completion handling for request-based dm
- */
-static void end_clone_bio(struct bio *clone, int error)
-{
-	struct dm_rq_clone_bio_info *info =
-		container_of(clone, struct dm_rq_clone_bio_info, clone);
-	struct dm_rq_target_io *tio = info->tio;
-	struct bio *bio = info->orig;
-	unsigned int nr_bytes = info->orig->bi_iter.bi_size;
-
-	bio_put(clone);
-
-	if (tio->error)
-		/*
-		 * An error has already been detected on the request.
-		 * Once error occurred, just let clone->end_io() handle
-		 * the remainder.
-		 */
-		return;
-	else if (error) {
-		/*
-		 * Don't notice the error to the upper layer yet.
-		 * The error handling decision is made by the target driver,
-		 * when the request is completed.
-		 */
-		tio->error = error;
-		return;
-	}
-
-	/*
-	 * I/O for the bio successfully completed.
-	 * Notice the data completion to the upper layer.
-	 */
-
-	/*
-	 * bios are processed from the head of the list.
-	 * So the completing bio should always be rq->bio.
-	 * If it's not, something wrong is happening.
-	 */
-	if (tio->orig->bio != bio)
-		DMERR("bio completion is going in the middle of the request");
-
-	/*
-	 * Update the original request.
-	 * Do not use blk_end_request() here, because it may complete
-	 * the original request before the clone, and break the ordering.
-	 */
-	blk_update_request(tio->orig, 0, nr_bytes);
-}
-
 static struct dm_rq_target_io *tio_from_request(struct request *rq)
 {
 	return (rq->q->mq_ops ? blk_mq_rq_to_pdu(rq) : rq->special);
@@ -1089,8 +1038,6 @@ static void free_rq_clone(struct request *clone, bool must_be_mapped)
 
 	WARN_ON_ONCE(must_be_mapped && !clone->q);
 
-	blk_rq_unprep_clone(clone);
-
 	if (md->type == DM_TYPE_MQ_REQUEST_BASED)
 		/* stacked on blk-mq queue(s) */
 		tio->ti->type->release_clone_rq(clone);
@@ -1821,39 +1768,13 @@ static void dm_dispatch_clone_request(struct request *clone, struct request *rq)
 		dm_complete_request(rq, r);
 }
 
-static int dm_rq_bio_constructor(struct bio *bio, struct bio *bio_orig,
-				 void *data)
+static void setup_clone(struct request *clone, struct request *rq,
+		        struct dm_rq_target_io *tio)
 {
-	struct dm_rq_target_io *tio = data;
-	struct dm_rq_clone_bio_info *info =
-		container_of(bio, struct dm_rq_clone_bio_info, clone);
-
-	info->orig = bio_orig;
-	info->tio = tio;
-	bio->bi_end_io = end_clone_bio;
-
-	return 0;
-}
-
-static int setup_clone(struct request *clone, struct request *rq,
-		       struct dm_rq_target_io *tio, gfp_t gfp_mask)
-{
-	int r;
-
-	r = blk_rq_prep_clone(clone, rq, tio->md->bs, gfp_mask,
-			      dm_rq_bio_constructor, tio);
-	if (r)
-		return r;
-
-	clone->cmd = rq->cmd;
-	clone->cmd_len = rq->cmd_len;
-	clone->sense = rq->sense;
+	blk_rq_prep_clone(clone, rq);
 	clone->end_io = end_clone_request;
 	clone->end_io_data = tio;
-
 	tio->clone = clone;
-
-	return 0;
 }
 
 static struct request *clone_rq(struct request *rq, struct mapped_device *md,
@@ -1874,12 +1795,7 @@ static struct request *clone_rq(struct request *rq, struct mapped_device *md,
 		clone = tio->clone;
 
 	blk_rq_init(NULL, clone);
-	if (setup_clone(clone, rq, tio, gfp_mask)) {
-		/* -ENOMEM */
-		if (alloc_clone)
-			free_clone_request(md, clone);
-		return NULL;
-	}
+	setup_clone(clone, rq, tio);
 
 	return clone;
 }
@@ -1973,11 +1889,7 @@ static int map_request(struct dm_rq_target_io *tio, struct request *rq,
 		}
 		if (IS_ERR(clone))
 			return DM_MAPIO_REQUEUE;
-		if (setup_clone(clone, rq, tio, GFP_ATOMIC)) {
-			/* -ENOMEM */
-			ti->type->release_clone_rq(clone);
-			return DM_MAPIO_REQUEUE;
-		}
+		setup_clone(clone, rq, tio);
 	}
 
 	switch (r) {
@@ -2431,8 +2343,6 @@ static void __bind_mempools(struct mapped_device *md, struct dm_table *t)
 		goto out;
 	}
 
-	BUG_ON(!p || md->io_pool || md->rq_pool || md->bs);
-
 	md->io_pool = p->io_pool;
 	p->io_pool = NULL;
 	md->rq_pool = p->rq_pool;
@@ -3536,48 +3446,23 @@ int dm_noflush_suspending(struct dm_target *ti)
 }
 EXPORT_SYMBOL_GPL(dm_noflush_suspending);
 
-struct dm_md_mempools *dm_alloc_md_mempools(struct mapped_device *md, unsigned type,
-					    unsigned integrity, unsigned per_bio_data_size)
+struct dm_md_mempools *dm_alloc_bio_mempools(unsigned integrity,
+					     unsigned per_bio_data_size)
 {
-	struct dm_md_mempools *pools = kzalloc(sizeof(*pools), GFP_KERNEL);
-	struct kmem_cache *cachep = NULL;
-	unsigned int pool_size = 0;
+	struct dm_md_mempools *pools;
+	unsigned int pool_size = dm_get_reserved_bio_based_ios();
 	unsigned int front_pad;
 
+	pools = kzalloc(sizeof(*pools), GFP_KERNEL);
 	if (!pools)
 		return NULL;
 
-	type = filter_md_type(type, md);
+	front_pad = roundup(per_bio_data_size, __alignof__(struct dm_target_io)) +
+		offsetof(struct dm_target_io, clone);
 
-	switch (type) {
-	case DM_TYPE_BIO_BASED:
-		cachep = _io_cache;
-		pool_size = dm_get_reserved_bio_based_ios();
-		front_pad = roundup(per_bio_data_size, __alignof__(struct dm_target_io)) + offsetof(struct dm_target_io, clone);
-		break;
-	case DM_TYPE_REQUEST_BASED:
-		cachep = _rq_tio_cache;
-		pool_size = dm_get_reserved_rq_based_ios();
-		pools->rq_pool = mempool_create_slab_pool(pool_size, _rq_cache);
-		if (!pools->rq_pool)
-			goto out;
-		/* fall through to setup remaining rq-based pools */
-	case DM_TYPE_MQ_REQUEST_BASED:
-		if (!pool_size)
-			pool_size = dm_get_reserved_rq_based_ios();
-		front_pad = offsetof(struct dm_rq_clone_bio_info, clone);
-		/* per_bio_data_size is not used. See __bind_mempools(). */
-		WARN_ON(per_bio_data_size != 0);
-		break;
-	default:
-		BUG();
-	}
-
-	if (cachep) {
-		pools->io_pool = mempool_create_slab_pool(pool_size, cachep);
-		if (!pools->io_pool)
-			goto out;
-	}
+	pools->io_pool = mempool_create_slab_pool(pool_size, _io_cache);
+	if (!pools->io_pool)
+		goto out;
 
 	pools->bs = bioset_create_nobvec(pool_size, front_pad);
 	if (!pools->bs)
@@ -3587,10 +3472,34 @@ struct dm_md_mempools *dm_alloc_md_mempools(struct mapped_device *md, unsigned t
 		goto out;
 
 	return pools;
-
 out:
 	dm_free_md_mempools(pools);
+	return NULL;
+}
+
+struct dm_md_mempools *dm_alloc_rq_mempools(struct mapped_device *md,
+					    unsigned type)
+{
+	unsigned int pool_size = dm_get_reserved_rq_based_ios();
+	struct dm_md_mempools *pools;
 
+	pools = kzalloc(sizeof(*pools), GFP_KERNEL);
+	if (!pools)
+		return NULL;
+
+	if (filter_md_type(type, md) == DM_TYPE_REQUEST_BASED) {
+		pools->rq_pool = mempool_create_slab_pool(pool_size, _rq_cache);
+		if (!pools->rq_pool)
+			goto out;
+	}
+
+	pools->io_pool = mempool_create_slab_pool(pool_size, _rq_tio_cache);
+	if (!pools->io_pool)
+		goto out;
+
+	return pools;
+out:
+	dm_free_md_mempools(pools);
 	return NULL;
 }
 
diff --git a/drivers/md/dm.h b/drivers/md/dm.h
index 6123c2b..e6e66d0 100644
--- a/drivers/md/dm.h
+++ b/drivers/md/dm.h
@@ -222,8 +222,9 @@ void dm_kcopyd_exit(void);
 /*
  * Mempool operations
  */
-struct dm_md_mempools *dm_alloc_md_mempools(struct mapped_device *md, unsigned type,
-					    unsigned integrity, unsigned per_bio_data_size);
+struct dm_md_mempools *dm_alloc_bio_mempools(unsigned integrity,
+					     unsigned per_bio_data_size);
+struct dm_md_mempools *dm_alloc_rq_mempools(struct mapped_device *md, unsigned type);
 void dm_free_md_mempools(struct dm_md_mempools *pools);
 
 /*
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index cc2e85a..745588b 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -195,6 +195,7 @@ enum rq_flag_bits {
 	__REQ_HASHED,		/* on IO scheduler merge hash */
 	__REQ_MQ_INFLIGHT,	/* track inflight for MQ */
 	__REQ_NO_TIMEOUT,	/* requests may never expire */
+	__REQ_CLONE,		/* cloned bios */
 	__REQ_NR_BITS,		/* stops here */
 };
 
@@ -249,5 +250,6 @@ enum rq_flag_bits {
 #define REQ_HASHED		(1ULL << __REQ_HASHED)
 #define REQ_MQ_INFLIGHT		(1ULL << __REQ_MQ_INFLIGHT)
 #define REQ_NO_TIMEOUT		(1ULL << __REQ_NO_TIMEOUT)
+#define REQ_CLONE		(1ULL << __REQ_CLONE)
 
 #endif /* __LINUX_BLK_TYPES_H */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c8cb30e..cd14e2d 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -775,11 +775,7 @@ extern void blk_add_request_payload(struct request *rq, struct page *page,
 		unsigned int len);
 extern int blk_rq_check_limits(struct request_queue *q, struct request *rq);
 extern int blk_lld_busy(struct request_queue *q);
-extern int blk_rq_prep_clone(struct request *rq, struct request *rq_src,
-			     struct bio_set *bs, gfp_t gfp_mask,
-			     int (*bio_ctr)(struct bio *, struct bio *, void *),
-			     void *data);
-extern void blk_rq_unprep_clone(struct request *rq);
+extern void blk_rq_prep_clone(struct request *rq, struct request *rq_src);
 extern int blk_insert_cloned_request(struct request_queue *q,
 				     struct request *rq);
 extern void blk_delay_queue(struct request_queue *, unsigned long);
-- 
2.3.2 (Apple Git-55)


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH for-4.2 04/14] block: factor out blkdev_issue_discard_async
  2015-05-14 21:04 [PATCH for-4.2 00/14] block, dm: first batch of changes for 4.2 Mike Snitzer
                   ` (2 preceding siblings ...)
  2015-05-14 21:05 ` [PATCH for-4.2 03/14] block, dm: don't copy bios for request clones Mike Snitzer
@ 2015-05-14 21:05 ` Mike Snitzer
  2015-05-18  8:27   ` [dm-devel] " Christoph Hellwig
  2015-05-14 21:05 ` [PATCH for-4.2 05/14] dm: do not allocate any mempools for blk-mq request-based DM Mike Snitzer
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 31+ messages in thread
From: Mike Snitzer @ 2015-05-14 21:05 UTC (permalink / raw)
  To: dm-devel; +Cc: Jens Axboe, Joe Thornber, linux-kernel, Mike Snitzer

From: Joe Thornber <ejt@redhat.com>

Useful for callers who wish to manage the async completion of discard(s)
without explicitly blocking waiting for completion.

blkdev_issue_discard() is updated to call blkdev_issue_discard_async()
and DM thinp will make use of blkdev_issue_discard_async() in the
upcoming "dm thin: range discard support" commit.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 block/blk-lib.c           | 57 +++++++++++++++++++++++++++++++++++++----------
 include/linux/blk_types.h |  1 +
 include/linux/blkdev.h    |  3 +++
 3 files changed, 49 insertions(+), 12 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 7688ee3..2815973 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -26,26 +26,36 @@ static void bio_batch_end_io(struct bio *bio, int err)
 	bio_put(bio);
 }
 
+static void bio_batch_set_completion(struct bio *bio, void *data)
+{
+	struct bio_batch *bb = data;
+
+	bio->bi_end_io = bio_batch_end_io;
+	bio->bi_private = data;
+	atomic_inc(&bb->done);
+}
+
 /**
- * blkdev_issue_discard - queue a discard
+ * blkdev_issue_discard_async - queue a discard with async completion
  * @bdev:	blockdev to issue discard for
  * @sector:	start sector
  * @nr_sects:	number of sectors to discard
  * @gfp_mask:	memory allocation flags (for bio_alloc)
  * @flags:	BLKDEV_IFL_* flags to control behaviour
+ * @set_completion: callback to set completion mechanism for discard bios
+ * @data:       callback data passed to @set_completion
  *
  * Description:
  *    Issue a discard request for the sectors in question.
  */
-int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
-		sector_t nr_sects, gfp_t gfp_mask, unsigned long flags)
+int blkdev_issue_discard_async(struct block_device *bdev, sector_t sector,
+			       sector_t nr_sects, gfp_t gfp_mask, unsigned long flags,
+			       bio_discard_completion_t set_completion, void *data)
 {
-	DECLARE_COMPLETION_ONSTACK(wait);
 	struct request_queue *q = bdev_get_queue(bdev);
 	int type = REQ_WRITE | REQ_DISCARD;
 	unsigned int max_discard_sectors, granularity;
 	int alignment;
-	struct bio_batch bb;
 	struct bio *bio;
 	int ret = 0;
 	struct blk_plug plug;
@@ -77,10 +87,6 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 		type |= REQ_SECURE;
 	}
 
-	atomic_set(&bb.done, 1);
-	bb.flags = 1 << BIO_UPTODATE;
-	bb.wait = &wait;
-
 	blk_start_plug(&plug);
 	while (nr_sects) {
 		unsigned int req_sects;
@@ -108,16 +114,15 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 			req_sects = end_sect - sector;
 		}
 
+		set_completion(bio, data);
+
 		bio->bi_iter.bi_sector = sector;
-		bio->bi_end_io = bio_batch_end_io;
 		bio->bi_bdev = bdev;
-		bio->bi_private = &bb;
 
 		bio->bi_iter.bi_size = req_sects << 9;
 		nr_sects -= req_sects;
 		sector = end_sect;
 
-		atomic_inc(&bb.done);
 		submit_bio(type, bio);
 
 		/*
@@ -129,6 +134,34 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 		cond_resched();
 	}
 	blk_finish_plug(&plug);
+	return ret;
+}
+EXPORT_SYMBOL(blkdev_issue_discard_async);
+
+/**
+ * blkdev_issue_discard - queue a discard
+ * @bdev:	blockdev to issue discard for
+ * @sector:	start sector
+ * @nr_sects:	number of sectors to discard
+ * @gfp_mask:	memory allocation flags (for bio_alloc)
+ * @flags:	BLKDEV_IFL_* flags to control behaviour
+ *
+ * Description:
+ *    Issue a discard request for the sectors in question.
+ */
+int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
+			 sector_t nr_sects, gfp_t gfp_mask, unsigned long flags)
+{
+	int ret;
+	DECLARE_COMPLETION_ONSTACK(wait);
+	struct bio_batch bb;
+
+	atomic_set(&bb.done, 1);
+	bb.flags = 1 << BIO_UPTODATE;
+	bb.wait = &wait;
+
+	ret = blkdev_issue_discard_async(bdev, sector, nr_sects, gfp_mask, flags,
+					 bio_batch_set_completion, &bb);
 
 	/* Wait for bios in-flight */
 	if (!atomic_dec_and_test(&bb.done))
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 745588b..a0c00e8 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -16,6 +16,7 @@ struct io_context;
 struct cgroup_subsys_state;
 typedef void (bio_end_io_t) (struct bio *, int);
 typedef void (bio_destructor_t) (struct bio *);
+typedef void (bio_discard_completion_t) (struct bio *, void *);
 
 /*
  * was unsigned short, but we might as well be ready for > 64kB I/O pages
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index cd14e2d..16beac3 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1128,6 +1128,9 @@ static inline struct request *blk_map_queue_find_tag(struct blk_queue_tag *bqt,
 extern int blkdev_issue_flush(struct block_device *, gfp_t, sector_t *);
 extern int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp_mask, unsigned long flags);
+extern int blkdev_issue_discard_async(struct block_device *bdev, sector_t sector,
+	        sector_t nr_sects, gfp_t gfp_mask, unsigned long flags,
+		bio_discard_completion_t set_completion, void *data);
 extern int blkdev_issue_write_same(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp_mask, struct page *page);
 extern int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
-- 
2.3.2 (Apple Git-55)


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH for-4.2 05/14] dm: do not allocate any mempools for blk-mq request-based DM
  2015-05-14 21:04 [PATCH for-4.2 00/14] block, dm: first batch of changes for 4.2 Mike Snitzer
                   ` (3 preceding siblings ...)
  2015-05-14 21:05 ` [PATCH for-4.2 04/14] block: factor out blkdev_issue_discard_async Mike Snitzer
@ 2015-05-14 21:05 ` Mike Snitzer
  2015-05-14 21:05 ` [PATCH for-4.2 06/14] dm: rename methods that requeue requests Mike Snitzer
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 31+ messages in thread
From: Mike Snitzer @ 2015-05-14 21:05 UTC (permalink / raw)
  To: dm-devel; +Cc: Jens Axboe, Mike Snitzer

Do not allocate the io_pool mempool for blk-mq request-based DM
(DM_TYPE_MQ_REQUEST_BASED) in dm_alloc_rq_mempools().

Also refine __bind_mempools() to have more precise awareness of which
mempools each type of DM device uses -- avoids mempool churn when
reloading DM tables (particularly for DM_TYPE_REQUEST_BASED).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 drivers/md/dm-table.c |  4 +--
 drivers/md/dm.c       | 69 ++++++++++++++++++++++++++++-----------------------
 2 files changed, 40 insertions(+), 33 deletions(-)

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 3662b2e..2000fea 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -962,8 +962,8 @@ static int dm_table_alloc_md_mempools(struct dm_table *t, struct mapped_device *
 		return -EINVAL;
 	}
 
-	if (!t->mempools)
-		return -ENOMEM;
+	if (IS_ERR(t->mempools))
+		return PTR_ERR(t->mempools);
 
 	return 0;
 }
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 38837f8..8502da7 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -2317,39 +2317,52 @@ static void free_dev(struct mapped_device *md)
 	kfree(md);
 }
 
+static unsigned filter_md_type(unsigned type, struct mapped_device *md)
+{
+	if (type == DM_TYPE_BIO_BASED)
+		return type;
+
+	return !md->use_blk_mq ? DM_TYPE_REQUEST_BASED : DM_TYPE_MQ_REQUEST_BASED;
+}
+
 static void __bind_mempools(struct mapped_device *md, struct dm_table *t)
 {
 	struct dm_md_mempools *p = dm_table_get_md_mempools(t);
 
-	if (md->bs) {
-		/* The md already has necessary mempools. */
-		if (dm_table_get_type(t) == DM_TYPE_BIO_BASED) {
+	switch (filter_md_type(dm_table_get_type(t), md)) {
+	case DM_TYPE_BIO_BASED:
+		if (md->bs && md->io_pool) {
 			/*
+			 * This bio-based md already has necessary mempools.
 			 * Reload bioset because front_pad may have changed
 			 * because a different table was loaded.
 			 */
 			bioset_free(md->bs);
 			md->bs = p->bs;
 			p->bs = NULL;
+			goto out;
 		}
-		/*
-		 * There's no need to reload with request-based dm
-		 * because the size of front_pad doesn't change.
-		 * Note for future: If you are to reload bioset,
-		 * prep-ed requests in the queue may refer
-		 * to bio from the old bioset, so you must walk
-		 * through the queue to unprep.
-		 */
-		goto out;
+		break;
+	case DM_TYPE_REQUEST_BASED:
+		if (md->rq_pool && md->io_pool)
+			/*
+			 * This request-based md already has necessary mempools.
+			 */
+			goto out;
+		break;
+	case DM_TYPE_MQ_REQUEST_BASED:
+		BUG_ON(p); /* No mempools needed */
+		return;
 	}
 
+	BUG_ON(!p || md->io_pool || md->rq_pool || md->bs);
+
 	md->io_pool = p->io_pool;
 	p->io_pool = NULL;
 	md->rq_pool = p->rq_pool;
 	p->rq_pool = NULL;
 	md->bs = p->bs;
 	p->bs = NULL;
-
 out:
 	/* mempool bind completed, no longer need any mempools in the table */
 	dm_table_free_md_mempools(t);
@@ -2726,14 +2739,6 @@ out_tag_set:
 	return err;
 }
 
-static unsigned filter_md_type(unsigned type, struct mapped_device *md)
-{
-	if (type == DM_TYPE_BIO_BASED)
-		return type;
-
-	return !md->use_blk_mq ? DM_TYPE_REQUEST_BASED : DM_TYPE_MQ_REQUEST_BASED;
-}
-
 /*
  * Setup the DM device's queue based on md's type
  */
@@ -3455,7 +3460,7 @@ struct dm_md_mempools *dm_alloc_bio_mempools(unsigned integrity,
 
 	pools = kzalloc(sizeof(*pools), GFP_KERNEL);
 	if (!pools)
-		return NULL;
+		return ERR_PTR(-ENOMEM);
 
 	front_pad = roundup(per_bio_data_size, __alignof__(struct dm_target_io)) +
 		offsetof(struct dm_target_io, clone);
@@ -3474,24 +3479,26 @@ struct dm_md_mempools *dm_alloc_bio_mempools(unsigned integrity,
 	return pools;
 out:
 	dm_free_md_mempools(pools);
-	return NULL;
+	return ERR_PTR(-ENOMEM);
 }
 
 struct dm_md_mempools *dm_alloc_rq_mempools(struct mapped_device *md,
 					    unsigned type)
 {
-	unsigned int pool_size = dm_get_reserved_rq_based_ios();
+	unsigned int pool_size;
 	struct dm_md_mempools *pools;
 
+	if (filter_md_type(type, md) == DM_TYPE_MQ_REQUEST_BASED)
+		return NULL; /* No mempools needed */
+
+	pool_size = dm_get_reserved_rq_based_ios();
 	pools = kzalloc(sizeof(*pools), GFP_KERNEL);
 	if (!pools)
-		return NULL;
+		return ERR_PTR(-ENOMEM);
 
-	if (filter_md_type(type, md) == DM_TYPE_REQUEST_BASED) {
-		pools->rq_pool = mempool_create_slab_pool(pool_size, _rq_cache);
-		if (!pools->rq_pool)
-			goto out;
-	}
+	pools->rq_pool = mempool_create_slab_pool(pool_size, _rq_cache);
+	if (!pools->rq_pool)
+		goto out;
 
 	pools->io_pool = mempool_create_slab_pool(pool_size, _rq_tio_cache);
 	if (!pools->io_pool)
@@ -3500,7 +3507,7 @@ struct dm_md_mempools *dm_alloc_rq_mempools(struct mapped_device *md,
 	return pools;
 out:
 	dm_free_md_mempools(pools);
-	return NULL;
+	return ERR_PTR(-ENOMEM);
 }
 
 void dm_free_md_mempools(struct dm_md_mempools *pools)
-- 
2.3.2 (Apple Git-55)

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH for-4.2 06/14] dm: rename methods that requeue requests
  2015-05-14 21:04 [PATCH for-4.2 00/14] block, dm: first batch of changes for 4.2 Mike Snitzer
                   ` (4 preceding siblings ...)
  2015-05-14 21:05 ` [PATCH for-4.2 05/14] dm: do not allocate any mempools for blk-mq request-based DM Mike Snitzer
@ 2015-05-14 21:05 ` Mike Snitzer
  2015-05-18  8:29   ` Christoph Hellwig
  2015-05-14 21:05 ` [PATCH for-4.2 07/14] dm: factor out a common cleanup_mapped_device() Mike Snitzer
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 31+ messages in thread
From: Mike Snitzer @ 2015-05-14 21:05 UTC (permalink / raw)
  To: dm-devel; +Cc: Jens Axboe, Mike Snitzer

More often than not a request that is requeued _is_ mapped (meaning the
clone request is allocated and clone->q is initialized).  Rename
dm_requeue_unmapped_original_request() and dm_requeue_unmapped_request()
to avoid potential confusion due to function names containing "unmapped".

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 drivers/md/dm.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 8502da7..c894d13 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1114,8 +1114,8 @@ static void old_requeue_request(struct request *rq)
 	spin_unlock_irqrestore(q->queue_lock, flags);
 }
 
-static void dm_requeue_unmapped_original_request(struct mapped_device *md,
-						 struct request *rq)
+static void dm_requeue_original_request(struct mapped_device *md,
+					struct request *rq)
 {
 	int rw = rq_data_dir(rq);
 
@@ -1131,11 +1131,11 @@ static void dm_requeue_unmapped_original_request(struct mapped_device *md,
 	rq_completed(md, rw, false);
 }
 
-static void dm_requeue_unmapped_request(struct request *clone)
+static void dm_requeue_request(struct request *clone)
 {
 	struct dm_rq_target_io *tio = clone->end_io_data;
 
-	dm_requeue_unmapped_original_request(tio->md, tio->orig);
+	dm_requeue_original_request(tio->md, tio->orig);
 }
 
 static void old_stop_queue(struct request_queue *q)
@@ -1201,7 +1201,7 @@ static void dm_done(struct request *clone, int error, bool mapped)
 		return;
 	else if (r == DM_ENDIO_REQUEUE)
 		/* The target wants to requeue the I/O */
-		dm_requeue_unmapped_request(clone);
+		dm_requeue_request(clone);
 	else {
 		DMWARN("unimplemented target endio return value: %d", r);
 		BUG();
@@ -1904,7 +1904,7 @@ static int map_request(struct dm_rq_target_io *tio, struct request *rq,
 		break;
 	case DM_MAPIO_REQUEUE:
 		/* The target wants to requeue the I/O */
-		dm_requeue_unmapped_request(clone);
+		dm_requeue_request(clone);
 		break;
 	default:
 		if (r > 0) {
@@ -1927,7 +1927,7 @@ static void map_tio_request(struct kthread_work *work)
 	struct mapped_device *md = tio->md;
 
 	if (map_request(tio, rq, md) == DM_MAPIO_REQUEUE)
-		dm_requeue_unmapped_original_request(md, rq);
+		dm_requeue_original_request(md, rq);
 }
 
 static void dm_start_request(struct mapped_device *md, struct request *orig)
@@ -2682,7 +2682,7 @@ static int dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
 	} else {
 		/* Direct call is fine since .queue_rq allows allocations */
 		if (map_request(tio, rq, md) == DM_MAPIO_REQUEUE)
-			dm_requeue_unmapped_original_request(md, rq);
+			dm_requeue_original_request(md, rq);
 	}
 
 	return BLK_MQ_RQ_QUEUE_OK;
-- 
2.3.2 (Apple Git-55)

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH for-4.2 07/14] dm: factor out a common cleanup_mapped_device()
  2015-05-14 21:04 [PATCH for-4.2 00/14] block, dm: first batch of changes for 4.2 Mike Snitzer
                   ` (5 preceding siblings ...)
  2015-05-14 21:05 ` [PATCH for-4.2 06/14] dm: rename methods that requeue requests Mike Snitzer
@ 2015-05-14 21:05 ` Mike Snitzer
  2015-05-14 21:05 ` [PATCH for-4.2 08/14] dm btree: add dm_btree_remove_leaves() Mike Snitzer
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 31+ messages in thread
From: Mike Snitzer @ 2015-05-14 21:05 UTC (permalink / raw)
  To: dm-devel; +Cc: Jens Axboe, Mike Snitzer

Introduce a single common method for cleaning up a DM device's
mapped_device.  No functional change, just eliminates duplication of
delicate mapped_device cleanup code.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 drivers/md/dm.c | 78 +++++++++++++++++++++++++++++++--------------------------
 1 file changed, 43 insertions(+), 35 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index c894d13..d31f7a3 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -2167,6 +2167,40 @@ static void dm_init_old_md_queue(struct mapped_device *md)
 	blk_queue_bounce_limit(md->queue, BLK_BOUNCE_ANY);
 }
 
+static void cleanup_mapped_device(struct mapped_device *md)
+{
+	cleanup_srcu_struct(&md->io_barrier);
+
+	if (md->wq)
+		destroy_workqueue(md->wq);
+	if (md->kworker_task)
+		kthread_stop(md->kworker_task);
+	if (md->io_pool)
+		mempool_destroy(md->io_pool);
+	if (md->rq_pool)
+		mempool_destroy(md->rq_pool);
+	if (md->bs)
+		bioset_free(md->bs);
+
+	if (md->disk) {
+		spin_lock(&_minor_lock);
+		md->disk->private_data = NULL;
+		spin_unlock(&_minor_lock);
+		if (blk_get_integrity(md->disk))
+			blk_integrity_unregister(md->disk);
+		del_gendisk(md->disk);
+		put_disk(md->disk);
+	}
+
+	if (md->queue)
+		blk_cleanup_queue(md->queue);
+
+	if (md->bdev) {
+		bdput(md->bdev);
+		md->bdev = NULL;
+	}
+}
+
 /*
  * Allocate and initialise a blank device with a given minor.
  */
@@ -2212,13 +2246,13 @@ static struct mapped_device *alloc_dev(int minor)
 
 	md->queue = blk_alloc_queue(GFP_KERNEL);
 	if (!md->queue)
-		goto bad_queue;
+		goto bad;
 
 	dm_init_md_queue(md);
 
 	md->disk = alloc_disk(1);
 	if (!md->disk)
-		goto bad_disk;
+		goto bad;
 
 	atomic_set(&md->pending[0], 0);
 	atomic_set(&md->pending[1], 0);
@@ -2239,11 +2273,11 @@ static struct mapped_device *alloc_dev(int minor)
 
 	md->wq = alloc_workqueue("kdmflush", WQ_MEM_RECLAIM, 0);
 	if (!md->wq)
-		goto bad_thread;
+		goto bad;
 
 	md->bdev = bdget_disk(md->disk, 0);
 	if (!md->bdev)
-		goto bad_bdev;
+		goto bad;
 
 	bio_init(&md->flush_bio);
 	md->flush_bio.bi_bdev = md->bdev;
@@ -2260,15 +2294,8 @@ static struct mapped_device *alloc_dev(int minor)
 
 	return md;
 
-bad_bdev:
-	destroy_workqueue(md->wq);
-bad_thread:
-	del_gendisk(md->disk);
-	put_disk(md->disk);
-bad_disk:
-	blk_cleanup_queue(md->queue);
-bad_queue:
-	cleanup_srcu_struct(&md->io_barrier);
+bad:
+	cleanup_mapped_device(md);
 bad_io_barrier:
 	free_minor(minor);
 bad_minor:
@@ -2285,32 +2312,13 @@ static void free_dev(struct mapped_device *md)
 	int minor = MINOR(disk_devt(md->disk));
 
 	unlock_fs(md);
-	destroy_workqueue(md->wq);
 
-	if (md->kworker_task)
-		kthread_stop(md->kworker_task);
-	if (md->io_pool)
-		mempool_destroy(md->io_pool);
-	if (md->rq_pool)
-		mempool_destroy(md->rq_pool);
-	if (md->bs)
-		bioset_free(md->bs);
+	cleanup_mapped_device(md);
+	if (md->use_blk_mq)
+		blk_mq_free_tag_set(&md->tag_set);
 
-	cleanup_srcu_struct(&md->io_barrier);
 	free_table_devices(&md->table_devices);
 	dm_stats_cleanup(&md->stats);
-
-	spin_lock(&_minor_lock);
-	md->disk->private_data = NULL;
-	spin_unlock(&_minor_lock);
-	if (blk_get_integrity(md->disk))
-		blk_integrity_unregister(md->disk);
-	del_gendisk(md->disk);
-	put_disk(md->disk);
-	blk_cleanup_queue(md->queue);
-	if (md->use_blk_mq)
-		blk_mq_free_tag_set(&md->tag_set);
-	bdput(md->bdev);
 	free_minor(minor);
 
 	module_put(THIS_MODULE);
-- 
2.3.2 (Apple Git-55)

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH for-4.2 08/14] dm btree: add dm_btree_remove_leaves()
  2015-05-14 21:04 [PATCH for-4.2 00/14] block, dm: first batch of changes for 4.2 Mike Snitzer
                   ` (6 preceding siblings ...)
  2015-05-14 21:05 ` [PATCH for-4.2 07/14] dm: factor out a common cleanup_mapped_device() Mike Snitzer
@ 2015-05-14 21:05 ` Mike Snitzer
  2015-05-14 21:05 ` [PATCH for-4.2 09/14] dm thin metadata: add dm_thin_find_mapped_range() Mike Snitzer
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 31+ messages in thread
From: Mike Snitzer @ 2015-05-14 21:05 UTC (permalink / raw)
  To: dm-devel; +Cc: Jens Axboe, Joe Thornber, Mike Snitzer

From: Joe Thornber <ejt@redhat.com>

Removes a range of leaf values from the tree.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 drivers/md/persistent-data/dm-btree-remove.c | 127 +++++++++++++++++++++++++++
 drivers/md/persistent-data/dm-btree.h        |   9 ++
 2 files changed, 136 insertions(+)

diff --git a/drivers/md/persistent-data/dm-btree-remove.c b/drivers/md/persistent-data/dm-btree-remove.c
index b88757c..e04cfd2 100644
--- a/drivers/md/persistent-data/dm-btree-remove.c
+++ b/drivers/md/persistent-data/dm-btree-remove.c
@@ -590,3 +590,130 @@ int dm_btree_remove(struct dm_btree_info *info, dm_block_t root,
 	return r;
 }
 EXPORT_SYMBOL_GPL(dm_btree_remove);
+
+/*----------------------------------------------------------------*/
+
+static int remove_nearest(struct shadow_spine *s, struct dm_btree_info *info,
+			  struct dm_btree_value_type *vt, dm_block_t root,
+			  uint64_t key, int *index)
+{
+	int i = *index, r;
+	struct btree_node *n;
+
+	for (;;) {
+		r = shadow_step(s, root, vt);
+		if (r < 0)
+			break;
+
+		/*
+		 * We have to patch up the parent node, ugly, but I don't
+		 * see a way to do this automatically as part of the spine
+		 * op.
+		 */
+		if (shadow_has_parent(s)) {
+			__le64 location = cpu_to_le64(dm_block_location(shadow_current(s)));
+			memcpy(value_ptr(dm_block_data(shadow_parent(s)), i),
+			       &location, sizeof(__le64));
+		}
+
+		n = dm_block_data(shadow_current(s));
+
+		if (le32_to_cpu(n->header.flags) & LEAF_NODE) {
+			*index = lower_bound(n, key);
+			return 0;
+		}
+
+		r = rebalance_children(s, info, vt, key);
+		if (r)
+			break;
+
+		n = dm_block_data(shadow_current(s));
+		if (le32_to_cpu(n->header.flags) & LEAF_NODE) {
+			*index = lower_bound(n, key);
+			return 0;
+		}
+
+		i = lower_bound(n, key);
+
+		/*
+		 * We know the key is present, or else
+		 * rebalance_children would have returned
+		 * -ENODATA
+		 */
+		root = value64(n, i);
+	}
+
+	return r;
+}
+
+static int remove_one(struct dm_btree_info *info, dm_block_t root,
+		      uint64_t *keys, uint64_t end_key,
+		      dm_block_t *new_root, unsigned *nr_removed)
+{
+	unsigned level, last_level = info->levels - 1;
+	int index = 0, r = 0;
+	struct shadow_spine spine;
+	struct btree_node *n;
+	uint64_t k;
+
+	init_shadow_spine(&spine, info);
+	for (level = 0; level < last_level; level++) {
+		r = remove_raw(&spine, info, &le64_type,
+			       root, keys[level], (unsigned *) &index);
+		if (r < 0)
+			goto out;
+
+		n = dm_block_data(shadow_current(&spine));
+		root = value64(n, index);
+	}
+
+	r = remove_nearest(&spine, info, &info->value_type,
+			   root, keys[last_level], &index);
+	if (r < 0)
+		goto out;
+
+	n = dm_block_data(shadow_current(&spine));
+
+	if (index < 0)
+		index = 0;
+
+	if (index >= le32_to_cpu(n->header.nr_entries)) {
+		r = -ENODATA;
+		goto out;
+	}
+
+	k = le64_to_cpu(n->keys[index]);
+	if (k >= keys[last_level] && k < end_key) {
+		if (info->value_type.dec)
+			info->value_type.dec(info->value_type.context,
+					     value_ptr(n, index));
+
+		delete_at(n, index);
+
+	} else
+		r = -ENODATA;
+
+out:
+	*new_root = shadow_root(&spine);
+	exit_shadow_spine(&spine);
+
+	return r;
+}
+
+int dm_btree_remove_leaves(struct dm_btree_info *info, dm_block_t root,
+			   uint64_t *first_key, uint64_t end_key,
+			   dm_block_t *new_root, unsigned *nr_removed)
+{
+	int r;
+
+	*nr_removed = 0;
+	do {
+		r = remove_one(info, root, first_key, end_key, &root, nr_removed);
+		if (!r)
+			(*nr_removed)++;
+	} while (!r);
+
+	*new_root = root;
+	return r == -ENODATA ? 0 : r;
+}
+EXPORT_SYMBOL_GPL(dm_btree_remove_leaves);
diff --git a/drivers/md/persistent-data/dm-btree.h b/drivers/md/persistent-data/dm-btree.h
index dacfc34..11d8cf7 100644
--- a/drivers/md/persistent-data/dm-btree.h
+++ b/drivers/md/persistent-data/dm-btree.h
@@ -135,6 +135,15 @@ int dm_btree_remove(struct dm_btree_info *info, dm_block_t root,
 		    uint64_t *keys, dm_block_t *new_root);
 
 /*
+ * Removes values between 'keys' and keys2, where keys2 is keys with the
+ * final key replaced with 'end_key'.  'end_key' is the one-past-the-end
+ * value.  'keys' may be altered.
+ */
+int dm_btree_remove_leaves(struct dm_btree_info *info, dm_block_t root,
+			   uint64_t *keys, uint64_t end_key,
+			   dm_block_t *new_root, unsigned *nr_removed);
+
+/*
  * Returns < 0 on failure.  Otherwise the number of key entries that have
  * been filled out.  Remember trees can have zero entries, and as such have
  * no lowest key.
-- 
2.3.2 (Apple Git-55)

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH for-4.2 09/14] dm thin metadata: add dm_thin_find_mapped_range()
  2015-05-14 21:04 [PATCH for-4.2 00/14] block, dm: first batch of changes for 4.2 Mike Snitzer
                   ` (7 preceding siblings ...)
  2015-05-14 21:05 ` [PATCH for-4.2 08/14] dm btree: add dm_btree_remove_leaves() Mike Snitzer
@ 2015-05-14 21:05 ` Mike Snitzer
  2015-05-14 21:05 ` [PATCH for-4.2 10/14] dm thin metadata: add dm_thin_remove_range() Mike Snitzer
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 31+ messages in thread
From: Mike Snitzer @ 2015-05-14 21:05 UTC (permalink / raw)
  To: dm-devel; +Cc: Jens Axboe, Joe Thornber, Mike Snitzer

From: Joe Thornber <ejt@redhat.com>

Retrieve the next run of contiguously mapped blocks.  Useful for working
out where to break up IO.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 drivers/md/dm-thin-metadata.c | 57 +++++++++++++++++++++++++++++++++++++++++++
 drivers/md/dm-thin-metadata.h |  9 +++++++
 2 files changed, 66 insertions(+)

diff --git a/drivers/md/dm-thin-metadata.c b/drivers/md/dm-thin-metadata.c
index 79f6941..edcf4fb 100644
--- a/drivers/md/dm-thin-metadata.c
+++ b/drivers/md/dm-thin-metadata.c
@@ -1419,6 +1419,63 @@ int dm_thin_find_block(struct dm_thin_device *td, dm_block_t block,
 	return r;
 }
 
+/* FIXME: write a more efficient one in btree */
+int dm_thin_find_mapped_range(struct dm_thin_device *td,
+			      dm_block_t begin, dm_block_t end,
+			      dm_block_t *thin_begin, dm_block_t *thin_end,
+			      dm_block_t *pool_begin, bool *maybe_shared)
+{
+	int r;
+	dm_block_t pool_end;
+	struct dm_thin_lookup_result lookup;
+
+	if (end < begin)
+		return -ENODATA;
+
+	/*
+	 * Find first mapped block.
+	 */
+	while (begin < end) {
+		r = dm_thin_find_block(td, begin, true, &lookup);
+		if (r) {
+			if (r != -ENODATA)
+				return r;
+		} else
+			break;
+
+		begin++;
+	}
+
+	if (begin == end)
+		return -ENODATA;
+
+	*thin_begin = begin;
+	*pool_begin = lookup.block;
+	*maybe_shared = lookup.shared;
+
+	begin++;
+	pool_end = *pool_begin + 1;
+	while (begin != end) {
+		r = dm_thin_find_block(td, begin, true, &lookup);
+		if (r) {
+			if (r == -ENODATA)
+				break;
+			else
+				return r;
+		}
+
+		if ((lookup.block != pool_end) ||
+		    (lookup.shared != *maybe_shared))
+			break;
+
+		pool_end++;
+		begin++;
+	}
+
+	*thin_end = begin;
+	return 0;
+}
+
 static int __insert(struct dm_thin_device *td, dm_block_t block,
 		    dm_block_t data_block)
 {
diff --git a/drivers/md/dm-thin-metadata.h b/drivers/md/dm-thin-metadata.h
index fac01a9..f11f140 100644
--- a/drivers/md/dm-thin-metadata.h
+++ b/drivers/md/dm-thin-metadata.h
@@ -147,6 +147,15 @@ int dm_thin_find_block(struct dm_thin_device *td, dm_block_t block,
 		       int can_issue_io, struct dm_thin_lookup_result *result);
 
 /*
+ * Retrieve the next run of contiguously mapped blocks.  Useful for working
+ * out where to break up IO.  Returns 0 on success, < 0 on error.
+ */
+int dm_thin_find_mapped_range(struct dm_thin_device *td,
+			      dm_block_t begin, dm_block_t end,
+			      dm_block_t *thin_begin, dm_block_t *thin_end,
+			      dm_block_t *pool_begin, bool *maybe_shared);
+
+/*
  * Obtain an unused block.
  */
 int dm_pool_alloc_data_block(struct dm_pool_metadata *pmd, dm_block_t *result);
-- 
2.3.2 (Apple Git-55)

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH for-4.2 10/14] dm thin metadata: add dm_thin_remove_range()
  2015-05-14 21:04 [PATCH for-4.2 00/14] block, dm: first batch of changes for 4.2 Mike Snitzer
                   ` (8 preceding siblings ...)
  2015-05-14 21:05 ` [PATCH for-4.2 09/14] dm thin metadata: add dm_thin_find_mapped_range() Mike Snitzer
@ 2015-05-14 21:05 ` Mike Snitzer
  2015-05-14 21:05 ` [PATCH for-4.2 11/14] dm thin: range discard support Mike Snitzer
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 31+ messages in thread
From: Mike Snitzer @ 2015-05-14 21:05 UTC (permalink / raw)
  To: dm-devel; +Cc: Jens Axboe, Joe Thornber, Mike Snitzer

From: Joe Thornber <ejt@redhat.com>

Removes a range of blocks from the btree.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 drivers/md/dm-thin-metadata.c | 54 +++++++++++++++++++++++++++++++++++++++++++
 drivers/md/dm-thin-metadata.h |  2 ++
 2 files changed, 56 insertions(+)

diff --git a/drivers/md/dm-thin-metadata.c b/drivers/md/dm-thin-metadata.c
index edcf4fb..40359e9 100644
--- a/drivers/md/dm-thin-metadata.c
+++ b/drivers/md/dm-thin-metadata.c
@@ -1528,6 +1528,47 @@ static int __remove(struct dm_thin_device *td, dm_block_t block)
 	return 0;
 }
 
+static int __remove_range(struct dm_thin_device *td, dm_block_t begin, dm_block_t end)
+{
+	int r;
+	unsigned count;
+	struct dm_pool_metadata *pmd = td->pmd;
+	dm_block_t keys[1] = { td->id };
+	__le64 value;
+	dm_block_t mapping_root;
+
+	/*
+	 * Find the mapping tree
+	 */
+	r = dm_btree_lookup(&pmd->tl_info, pmd->root, keys, &value);
+	if (r)
+		return r;
+
+	/*
+	 * Remove from the mapping tree, taking care to inc the
+	 * ref count so it doesn't get deleted.
+	 */
+	mapping_root = le64_to_cpu(value);
+	dm_tm_inc(pmd->tm, mapping_root);
+	r = dm_btree_remove(&pmd->tl_info, pmd->root, keys, &pmd->root);
+	if (r)
+		return r;
+
+	r = dm_btree_remove_leaves(&pmd->bl_info, mapping_root, &begin, end, &mapping_root, &count);
+	if (r)
+		return r;
+
+	td->mapped_blocks -= count;
+	td->changed = 1;
+
+	/*
+	 * Reinsert the mapping tree.
+	 */
+	value = cpu_to_le64(mapping_root);
+	__dm_bless_for_disk(&value);
+	return dm_btree_insert(&pmd->tl_info, pmd->root, keys, &value, &pmd->root);
+}
+
 int dm_thin_remove_block(struct dm_thin_device *td, dm_block_t block)
 {
 	int r = -EINVAL;
@@ -1540,6 +1581,19 @@ int dm_thin_remove_block(struct dm_thin_device *td, dm_block_t block)
 	return r;
 }
 
+int dm_thin_remove_range(struct dm_thin_device *td,
+			 dm_block_t begin, dm_block_t end)
+{
+	int r = -EINVAL;
+
+	down_write(&td->pmd->root_lock);
+	if (!td->pmd->fail_io)
+		r = __remove_range(td, begin, end);
+	up_write(&td->pmd->root_lock);
+
+	return r;
+}
+
 int dm_pool_block_is_used(struct dm_pool_metadata *pmd, dm_block_t b, bool *result)
 {
 	int r;
diff --git a/drivers/md/dm-thin-metadata.h b/drivers/md/dm-thin-metadata.h
index f11f140..a938bab 100644
--- a/drivers/md/dm-thin-metadata.h
+++ b/drivers/md/dm-thin-metadata.h
@@ -167,6 +167,8 @@ int dm_thin_insert_block(struct dm_thin_device *td, dm_block_t block,
 			 dm_block_t data_block);
 
 int dm_thin_remove_block(struct dm_thin_device *td, dm_block_t block);
+int dm_thin_remove_range(struct dm_thin_device *td,
+			 dm_block_t begin, dm_block_t end);
 
 /*
  * Queries.
-- 
2.3.2 (Apple Git-55)

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH for-4.2 11/14] dm thin: range discard support
  2015-05-14 21:04 [PATCH for-4.2 00/14] block, dm: first batch of changes for 4.2 Mike Snitzer
                   ` (9 preceding siblings ...)
  2015-05-14 21:05 ` [PATCH for-4.2 10/14] dm thin metadata: add dm_thin_remove_range() Mike Snitzer
@ 2015-05-14 21:05 ` Mike Snitzer
  2015-05-14 21:05 ` [PATCH for-4.2 12/14] dm thin: cleanup overwrite's endio restore to be centralized Mike Snitzer
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 31+ messages in thread
From: Mike Snitzer @ 2015-05-14 21:05 UTC (permalink / raw)
  To: dm-devel; +Cc: Jens Axboe, Joe Thornber, Mike Snitzer

From: Joe Thornber <ejt@redhat.com>

Previously REQ_DISCARD bios have been split into block sized chunks
before submission to the thin target.  There are a couple of issues with
this:

 - If the block size is small, a large discard request can
   get broken up into a great many bios which is both slow and causes
   a lot of memory pressure.

 - The thin pool block size and the discard granularity for the
   underlying data device need to be compatible if we want to passdown
   the discard.

This patch relaxes the block size granularity for thin devices.  It
makes use of the recent range locking added to the bio_prison to
quiesce a whole range of thin blocks before unmapping them.  Once a
thin range has been unmapped the discard can then be passed down to
the data device for those sub ranges where the data blocks are no
longer used (ie. they weren't shared in the first place).

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 drivers/md/dm-thin.c | 463 ++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 314 insertions(+), 149 deletions(-)

diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index e852602c..aa71bdd 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -111,22 +111,30 @@ DECLARE_DM_KCOPYD_THROTTLE_WITH_MODULE_PARM(snapshot_copy_throttle,
 /*
  * Key building.
  */
-static void build_data_key(struct dm_thin_device *td,
-			   dm_block_t b, struct dm_cell_key *key)
+enum lock_space {
+	VIRTUAL,
+	PHYSICAL
+};
+
+static void build_key(struct dm_thin_device *td, enum lock_space ls,
+		      dm_block_t b, dm_block_t e, struct dm_cell_key *key)
 {
-	key->virtual = 0;
+	key->virtual = (ls == VIRTUAL);
 	key->dev = dm_thin_dev_id(td);
 	key->block_begin = b;
-	key->block_end = b + 1ULL;
+	key->block_end = e;
+}
+
+static void build_data_key(struct dm_thin_device *td, dm_block_t b,
+			   struct dm_cell_key *key)
+{
+	build_key(td, PHYSICAL, b, b + 1llu, key);
 }
 
 static void build_virtual_key(struct dm_thin_device *td, dm_block_t b,
 			      struct dm_cell_key *key)
 {
-	key->virtual = 1;
-	key->dev = dm_thin_dev_id(td);
-	key->block_begin = b;
-	key->block_end = b + 1ULL;
+	build_key(td, VIRTUAL, b, b + 1llu, key);
 }
 
 /*----------------------------------------------------------------*/
@@ -312,6 +320,35 @@ struct thin_c {
 
 /*----------------------------------------------------------------*/
 
+static bool block_size_is_power_of_two(struct pool *pool)
+{
+	return pool->sectors_per_block_shift >= 0;
+}
+
+static sector_t block_to_sectors(struct pool *pool, dm_block_t b)
+{
+	return block_size_is_power_of_two(pool) ?
+		(b << pool->sectors_per_block_shift) :
+		(b * pool->sectors_per_block);
+}
+
+static void dm_thin_discard_set_completion(struct bio *bio, void *data)
+{
+	bio_chain(bio, data);
+}
+
+static int issue_discard(struct thin_c *tc, dm_block_t data_b, dm_block_t data_e,
+			 struct bio *parent_bio)
+{
+	sector_t s = block_to_sectors(tc->pool, data_b);
+	sector_t len = block_to_sectors(tc->pool, data_e - data_b);
+
+	return blkdev_issue_discard_async(tc->pool_dev->bdev, s, len, GFP_NOWAIT, 0,
+					  dm_thin_discard_set_completion, parent_bio);
+}
+
+/*----------------------------------------------------------------*/
+
 /*
  * wake_worker() is used when new work is queued and when pool_resume is
  * ready to continue deferred IO processing.
@@ -461,6 +498,7 @@ struct dm_thin_endio_hook {
 	struct dm_deferred_entry *all_io_entry;
 	struct dm_thin_new_mapping *overwrite_mapping;
 	struct rb_node rb_node;
+	struct dm_bio_prison_cell *cell;
 };
 
 static void __merge_bio_list(struct bio_list *bios, struct bio_list *master)
@@ -541,11 +579,6 @@ static void error_retry_list(struct pool *pool)
  * target.
  */
 
-static bool block_size_is_power_of_two(struct pool *pool)
-{
-	return pool->sectors_per_block_shift >= 0;
-}
-
 static dm_block_t get_bio_block(struct thin_c *tc, struct bio *bio)
 {
 	struct pool *pool = tc->pool;
@@ -559,6 +592,34 @@ static dm_block_t get_bio_block(struct thin_c *tc, struct bio *bio)
 	return block_nr;
 }
 
+/*
+ * Returns the _complete_ blocks that this bio covers.
+ */
+static void get_bio_block_range(struct thin_c *tc, struct bio *bio,
+				dm_block_t *begin, dm_block_t *end)
+{
+	struct pool *pool = tc->pool;
+	sector_t b = bio->bi_iter.bi_sector;
+	sector_t e = b + (bio->bi_iter.bi_size >> SECTOR_SHIFT);
+
+	b += pool->sectors_per_block - 1ull; /* so we round up */
+
+	if (block_size_is_power_of_two(pool)) {
+		b >>= pool->sectors_per_block_shift;
+		e >>= pool->sectors_per_block_shift;
+	} else {
+		(void) sector_div(b, pool->sectors_per_block);
+		(void) sector_div(e, pool->sectors_per_block);
+	}
+
+	if (e < b)
+		/* Can happen if the bio is within a single block. */
+		e = b;
+
+	*begin = b;
+	*end = e;
+}
+
 static void remap(struct thin_c *tc, struct bio *bio, dm_block_t block)
 {
 	struct pool *pool = tc->pool;
@@ -647,7 +708,7 @@ struct dm_thin_new_mapping {
 	struct list_head list;
 
 	bool pass_discard:1;
-	bool definitely_not_shared:1;
+	bool maybe_shared:1;
 
 	/*
 	 * Track quiescing, copying and zeroing preparation actions.  When this
@@ -658,9 +719,9 @@ struct dm_thin_new_mapping {
 
 	int err;
 	struct thin_c *tc;
-	dm_block_t virt_block;
+	dm_block_t virt_begin, virt_end;
 	dm_block_t data_block;
-	struct dm_bio_prison_cell *cell, *cell2;
+	struct dm_bio_prison_cell *cell;
 
 	/*
 	 * If the bio covers the whole area of a block then we can avoid
@@ -822,7 +883,7 @@ static void process_prepared_mapping(struct dm_thin_new_mapping *m)
 	 * Any I/O for this block arriving after this point will get
 	 * remapped to it directly.
 	 */
-	r = dm_thin_insert_block(tc->td, m->virt_block, m->data_block);
+	r = dm_thin_insert_block(tc->td, m->cell->key.block_begin, m->data_block);
 	if (r) {
 		metadata_operation_failed(pool, "dm_thin_insert_block", r);
 		cell_error(pool, m->cell);
@@ -849,50 +910,112 @@ out:
 	mempool_free(m, pool->mapping_pool);
 }
 
-static void process_prepared_discard_fail(struct dm_thin_new_mapping *m)
+/*----------------------------------------------------------------*/
+
+static void free_discard_mapping(struct dm_thin_new_mapping *m)
 {
 	struct thin_c *tc = m->tc;
+	if (m->cell)
+		cell_defer_no_holder(tc, m->cell);
+	mempool_free(m, tc->pool->mapping_pool);
+}
 
+static void process_prepared_discard_fail(struct dm_thin_new_mapping *m)
+{
 	bio_io_error(m->bio);
+	free_discard_mapping(m);
+}
+
+static void process_prepared_discard_success(struct dm_thin_new_mapping *m)
+{
+	bio_endio(m->bio, 0);
+	free_discard_mapping(m);
+}
+
+static void process_prepared_discard_no_passdown(struct dm_thin_new_mapping *m)
+{
+	int r;
+	struct thin_c *tc = m->tc;
+
+	r = dm_thin_remove_range(tc->td, m->cell->key.block_begin, m->cell->key.block_end);
+	if (r) {
+		metadata_operation_failed(tc->pool, "dm_thin_remove_range", r);
+		bio_io_error(m->bio);
+	} else
+		bio_endio(m->bio, 0);
+
 	cell_defer_no_holder(tc, m->cell);
-	cell_defer_no_holder(tc, m->cell2);
 	mempool_free(m, tc->pool->mapping_pool);
 }
 
-static void process_prepared_discard_passdown(struct dm_thin_new_mapping *m)
+static int passdown_double_checking_shared_status(struct dm_thin_new_mapping *m)
 {
+	/*
+	 * We've already unmapped this range of blocks, but before we
+	 * passdown we have to check that these blocks are now unused.
+	 */
+	int r;
+	bool used = true;
 	struct thin_c *tc = m->tc;
+	struct pool *pool = tc->pool;
+	dm_block_t b = m->data_block, e, end = m->data_block + m->virt_end - m->virt_begin;
 
-	inc_all_io_entry(tc->pool, m->bio);
-	cell_defer_no_holder(tc, m->cell);
-	cell_defer_no_holder(tc, m->cell2);
+	while (b != end) {
+		/* find start of unmapped run */
+		for (; b < end; b++) {
+			r = dm_pool_block_is_used(pool->pmd, b, &used);
+			if (r)
+				return r;
 
-	if (m->pass_discard)
-		if (m->definitely_not_shared)
-			remap_and_issue(tc, m->bio, m->data_block);
-		else {
-			bool used = false;
-			if (dm_pool_block_is_used(tc->pool->pmd, m->data_block, &used) || used)
-				bio_endio(m->bio, 0);
-			else
-				remap_and_issue(tc, m->bio, m->data_block);
+			if (!used)
+				break;
 		}
-	else
-		bio_endio(m->bio, 0);
 
-	mempool_free(m, tc->pool->mapping_pool);
+		if (b == end)
+			break;
+
+		/* find end of run */
+		for (e = b + 1; e != end; e++) {
+			r = dm_pool_block_is_used(pool->pmd, e, &used);
+			if (r)
+				return r;
+
+			if (used)
+				break;
+		}
+
+		r = issue_discard(tc, b, e, m->bio);
+		if (r)
+			return r;
+
+		b = e;
+	}
+
+	return 0;
 }
 
-static void process_prepared_discard(struct dm_thin_new_mapping *m)
+static void process_prepared_discard_passdown(struct dm_thin_new_mapping *m)
 {
 	int r;
 	struct thin_c *tc = m->tc;
+	struct pool *pool = tc->pool;
 
-	r = dm_thin_remove_block(tc->td, m->virt_block);
+	r = dm_thin_remove_range(tc->td, m->virt_begin, m->virt_end);
 	if (r)
-		DMERR_LIMIT("dm_thin_remove_block() failed");
+		metadata_operation_failed(pool, "dm_thin_remove_range", r);
 
-	process_prepared_discard_passdown(m);
+	else if (m->maybe_shared)
+		r = passdown_double_checking_shared_status(m);
+	else
+		r = issue_discard(tc, m->data_block, m->data_block + (m->virt_end - m->virt_begin), m->bio);
+
+	/*
+	 * Even if r is set, there could be sub discards in flight that we
+	 * need to wait for.
+	 */
+	bio_endio(m->bio, r);
+	cell_defer_no_holder(tc, m->cell);
+	mempool_free(m, pool->mapping_pool);
 }
 
 static void process_prepared(struct pool *pool, struct list_head *head,
@@ -976,7 +1099,7 @@ static void ll_zero(struct thin_c *tc, struct dm_thin_new_mapping *m,
 }
 
 static void remap_and_issue_overwrite(struct thin_c *tc, struct bio *bio,
-				      dm_block_t data_block,
+				      dm_block_t data_begin,
 				      struct dm_thin_new_mapping *m)
 {
 	struct pool *pool = tc->pool;
@@ -986,7 +1109,7 @@ static void remap_and_issue_overwrite(struct thin_c *tc, struct bio *bio,
 	m->bio = bio;
 	save_and_set_endio(bio, &m->saved_bi_end_io, overwrite_endio);
 	inc_all_io_entry(pool, bio);
-	remap_and_issue(tc, bio, data_block);
+	remap_and_issue(tc, bio, data_begin);
 }
 
 /*
@@ -1003,7 +1126,6 @@ static void schedule_copy(struct thin_c *tc, dm_block_t virt_block,
 	struct dm_thin_new_mapping *m = get_next_mapping(pool);
 
 	m->tc = tc;
-	m->virt_block = virt_block;
 	m->data_block = data_dest;
 	m->cell = cell;
 
@@ -1082,7 +1204,8 @@ static void schedule_zero(struct thin_c *tc, dm_block_t virt_block,
 
 	atomic_set(&m->prepare_actions, 1); /* no need to quiesce */
 	m->tc = tc;
-	m->virt_block = virt_block;
+	m->virt_begin = virt_block;
+	m->virt_end = virt_block + 1u;
 	m->data_block = data_block;
 	m->cell = cell;
 
@@ -1291,99 +1414,134 @@ static void retry_bios_on_resume(struct pool *pool, struct dm_bio_prison_cell *c
 		retry_on_resume(bio);
 }
 
-static void process_discard_cell(struct thin_c *tc, struct dm_bio_prison_cell *cell)
+static void process_discard_cell_no_passdown(struct thin_c *tc,
+					     struct dm_bio_prison_cell *virt_cell)
 {
-	int r;
-	struct bio *bio = cell->holder;
 	struct pool *pool = tc->pool;
-	struct dm_bio_prison_cell *cell2;
-	struct dm_cell_key key2;
-	dm_block_t block = get_bio_block(tc, bio);
-	struct dm_thin_lookup_result lookup_result;
-	struct dm_thin_new_mapping *m;
+	struct dm_thin_new_mapping *m = get_next_mapping(pool);
 
-	if (tc->requeue_mode) {
-		cell_requeue(pool, cell);
-		return;
-	}
+	/*
+	 * We don't need to lock the data blocks, since there's no
+	 * passdown.  We only lock data blocks for allocation and breaking sharing.
+	 */
+	m->tc = tc;
+	m->virt_begin = virt_cell->key.block_begin;
+	m->virt_end = virt_cell->key.block_end;
+	m->cell = virt_cell;
+	m->bio = virt_cell->holder;
 
-	r = dm_thin_find_block(tc->td, block, 1, &lookup_result);
-	switch (r) {
-	case 0:
-		/*
-		 * Check nobody is fiddling with this pool block.  This can
-		 * happen if someone's in the process of breaking sharing
-		 * on this block.
-		 */
-		build_data_key(tc->td, lookup_result.block, &key2);
-		if (bio_detain(tc->pool, &key2, bio, &cell2)) {
-			cell_defer_no_holder(tc, cell);
-			break;
-		}
+	if (!dm_deferred_set_add_work(pool->all_io_ds, &m->list))
+		pool->process_prepared_discard(m);
+}
 
-		if (io_overlaps_block(pool, bio)) {
-			/*
-			 * IO may still be going to the destination block.  We must
-			 * quiesce before we can do the removal.
-			 */
-			m = get_next_mapping(pool);
-			m->tc = tc;
-			m->pass_discard = pool->pf.discard_passdown;
-			m->definitely_not_shared = !lookup_result.shared;
-			m->virt_block = block;
-			m->data_block = lookup_result.block;
-			m->cell = cell;
-			m->cell2 = cell2;
-			m->bio = bio;
-
-			if (!dm_deferred_set_add_work(pool->all_io_ds, &m->list))
-				pool->process_prepared_discard(m);
+static void break_up_discard_bio(struct thin_c *tc, dm_block_t begin, dm_block_t end,
+				 struct bio *bio)
+{
+	struct pool *pool = tc->pool;
 
-		} else {
-			inc_all_io_entry(pool, bio);
-			cell_defer_no_holder(tc, cell);
-			cell_defer_no_holder(tc, cell2);
+	int r;
+	bool maybe_shared;
+	struct dm_cell_key data_key;
+	struct dm_bio_prison_cell *data_cell;
+	struct dm_thin_new_mapping *m;
+	dm_block_t virt_begin, virt_end, data_begin;
 
+	while (begin != end) {
+		r = ensure_next_mapping(pool);
+		if (r)
+			/* we did our best */
+			return;
+
+		r = dm_thin_find_mapped_range(tc->td, begin, end, &virt_begin, &virt_end,
+					      &data_begin, &maybe_shared);
+		if (r)
 			/*
-			 * The DM core makes sure that the discard doesn't span
-			 * a block boundary.  So we submit the discard of a
-			 * partial block appropriately.
+			 * Silently fail, letting any mappings we've
+			 * created complete.
 			 */
-			if ((!lookup_result.shared) && pool->pf.discard_passdown)
-				remap_and_issue(tc, bio, lookup_result.block);
-			else
-				bio_endio(bio, 0);
+			break;
+
+		build_key(tc->td, PHYSICAL, data_begin, data_begin + (virt_end - virt_begin), &data_key);
+		if (bio_detain(tc->pool, &data_key, NULL, &data_cell)) {
+			/* contention, we'll give up with this range */
+			begin = virt_end;
+			continue;
 		}
-		break;
 
-	case -ENODATA:
 		/*
-		 * It isn't provisioned, just forget it.
+		 * IO may still be going to the destination block.  We must
+		 * quiesce before we can do the removal.
 		 */
-		cell_defer_no_holder(tc, cell);
-		bio_endio(bio, 0);
-		break;
+		m = get_next_mapping(pool);
+		m->tc = tc;
+		m->maybe_shared = maybe_shared;
+		m->virt_begin = virt_begin;
+		m->virt_end = virt_end;
+		m->data_block = data_begin;
+		m->cell = data_cell;
+		m->bio = bio;
 
-	default:
-		DMERR_LIMIT("%s: dm_thin_find_block() failed: error = %d",
-			    __func__, r);
-		cell_defer_no_holder(tc, cell);
-		bio_io_error(bio);
-		break;
+		/*
+		 * This per-mapping bi_remaining increment is paired with
+		 * the implicit decrement that occurs via bio_endio() in
+		 * process_prepared_discard_{passdown,no_passdown}.
+		 */
+		bio_inc_remaining(bio);
+		if (!dm_deferred_set_add_work(pool->all_io_ds, &m->list))
+			pool->process_prepared_discard(m);
+
+		begin = virt_end;
 	}
 }
 
+static void process_discard_cell_passdown(struct thin_c *tc, struct dm_bio_prison_cell *virt_cell)
+{
+	struct bio *bio = virt_cell->holder;
+	struct dm_thin_endio_hook *h = dm_per_bio_data(bio, sizeof(struct dm_thin_endio_hook));
+
+	/*
+	 * The virt_cell will only get freed once the origin bio completes.
+	 * This means it will remain locked while all the individual
+	 * passdown bios are in flight.
+	 */
+	h->cell = virt_cell;
+	break_up_discard_bio(tc, virt_cell->key.block_begin, virt_cell->key.block_end, bio);
+
+	/*
+	 * We complete the bio now, knowing that the bi_remaining field
+	 * will prevent completion until the sub range discards have
+	 * completed.
+	 */
+	bio_endio(bio, 0);
+}
+
 static void process_discard_bio(struct thin_c *tc, struct bio *bio)
 {
-	struct dm_bio_prison_cell *cell;
-	struct dm_cell_key key;
-	dm_block_t block = get_bio_block(tc, bio);
+	dm_block_t begin, end;
+	struct dm_cell_key virt_key;
+	struct dm_bio_prison_cell *virt_cell;
 
-	build_virtual_key(tc->td, block, &key);
-	if (bio_detain(tc->pool, &key, bio, &cell))
+	get_bio_block_range(tc, bio, &begin, &end);
+	if (begin == end) {
+		/*
+		 * The discard covers less than a block.
+		 */
+		bio_endio(bio, 0);
+		return;
+	}
+
+	build_key(tc->td, VIRTUAL, begin, end, &virt_key);
+	if (bio_detain(tc->pool, &virt_key, bio, &virt_cell))
+		/*
+		 * Potential starvation issue: We're relying on the
+		 * fs/application being well behaved, and not trying to
+		 * send IO to a region at the same time as discarding it.
+		 * If they do this persistently then it's possible this
+		 * cell will never be granted.
+		 */
 		return;
 
-	process_discard_cell(tc, cell);
+	tc->pool->process_discard_cell(tc, virt_cell);
 }
 
 static void break_sharing(struct thin_c *tc, struct bio *bio, dm_block_t block,
@@ -2099,6 +2257,24 @@ static void notify_of_pool_mode_change(struct pool *pool, const char *new_mode)
 	       dm_device_name(pool->pool_md), new_mode);
 }
 
+static bool passdown_enabled(struct pool_c *pt)
+{
+	return pt->adjusted_pf.discard_passdown;
+}
+
+static void set_discard_callbacks(struct pool *pool)
+{
+	struct pool_c *pt = pool->ti->private;
+
+	if (passdown_enabled(pt)) {
+		pool->process_discard_cell = process_discard_cell_passdown;
+		pool->process_prepared_discard = process_prepared_discard_passdown;
+	} else {
+		pool->process_discard_cell = process_discard_cell_no_passdown;
+		pool->process_prepared_discard = process_prepared_discard_no_passdown;
+	}
+}
+
 static void set_pool_mode(struct pool *pool, enum pool_mode new_mode)
 {
 	struct pool_c *pt = pool->ti->private;
@@ -2150,7 +2326,7 @@ static void set_pool_mode(struct pool *pool, enum pool_mode new_mode)
 		pool->process_cell = process_cell_read_only;
 		pool->process_discard_cell = process_cell_success;
 		pool->process_prepared_mapping = process_prepared_mapping_fail;
-		pool->process_prepared_discard = process_prepared_discard_passdown;
+		pool->process_prepared_discard = process_prepared_discard_success;
 
 		error_retry_list(pool);
 		break;
@@ -2169,9 +2345,8 @@ static void set_pool_mode(struct pool *pool, enum pool_mode new_mode)
 		pool->process_bio = process_bio_read_only;
 		pool->process_discard = process_discard_bio;
 		pool->process_cell = process_cell_read_only;
-		pool->process_discard_cell = process_discard_cell;
 		pool->process_prepared_mapping = process_prepared_mapping;
-		pool->process_prepared_discard = process_prepared_discard;
+		set_discard_callbacks(pool);
 
 		if (!pool->pf.error_if_no_space && no_space_timeout)
 			queue_delayed_work(pool->wq, &pool->no_space_timeout, no_space_timeout);
@@ -2184,9 +2359,8 @@ static void set_pool_mode(struct pool *pool, enum pool_mode new_mode)
 		pool->process_bio = process_bio;
 		pool->process_discard = process_discard_bio;
 		pool->process_cell = process_cell;
-		pool->process_discard_cell = process_discard_cell;
 		pool->process_prepared_mapping = process_prepared_mapping;
-		pool->process_prepared_discard = process_prepared_discard;
+		set_discard_callbacks(pool);
 		break;
 	}
 
@@ -2275,6 +2449,7 @@ static void thin_hook_bio(struct thin_c *tc, struct bio *bio)
 	h->shared_read_entry = NULL;
 	h->all_io_entry = NULL;
 	h->overwrite_mapping = NULL;
+	h->cell = NULL;
 }
 
 /*
@@ -2422,7 +2597,6 @@ static void disable_passdown_if_not_supported(struct pool_c *pt)
 	struct pool *pool = pt->pool;
 	struct block_device *data_bdev = pt->data_dev->bdev;
 	struct queue_limits *data_limits = &bdev_get_queue(data_bdev)->limits;
-	sector_t block_size = pool->sectors_per_block << SECTOR_SHIFT;
 	const char *reason = NULL;
 	char buf[BDEVNAME_SIZE];
 
@@ -2435,12 +2609,6 @@ static void disable_passdown_if_not_supported(struct pool_c *pt)
 	else if (data_limits->max_discard_sectors < pool->sectors_per_block)
 		reason = "max discard sectors smaller than a block";
 
-	else if (data_limits->discard_granularity > block_size)
-		reason = "discard granularity larger than a block";
-
-	else if (!is_factor(block_size, data_limits->discard_granularity))
-		reason = "discard granularity not a factor of block size";
-
 	if (reason) {
 		DMWARN("Data device (%s) %s: Disabling discard passdown.", bdevname(data_bdev, buf), reason);
 		pt->adjusted_pf.discard_passdown = false;
@@ -3573,24 +3741,6 @@ static int pool_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
 	return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
 }
 
-static void set_discard_limits(struct pool_c *pt, struct queue_limits *limits)
-{
-	struct pool *pool = pt->pool;
-	struct queue_limits *data_limits;
-
-	limits->max_discard_sectors = pool->sectors_per_block;
-
-	/*
-	 * discard_granularity is just a hint, and not enforced.
-	 */
-	if (pt->adjusted_pf.discard_passdown) {
-		data_limits = &bdev_get_queue(pt->data_dev->bdev)->limits;
-		limits->discard_granularity = max(data_limits->discard_granularity,
-						  pool->sectors_per_block << SECTOR_SHIFT);
-	} else
-		limits->discard_granularity = pool->sectors_per_block << SECTOR_SHIFT;
-}
-
 static void pool_io_hints(struct dm_target *ti, struct queue_limits *limits)
 {
 	struct pool_c *pt = ti->private;
@@ -3645,14 +3795,17 @@ static void pool_io_hints(struct dm_target *ti, struct queue_limits *limits)
 
 	disable_passdown_if_not_supported(pt);
 
-	set_discard_limits(pt, limits);
+	/*
+	 * The pool uses the same discard limits as the underlying data
+	 * device.  DM core has already set this up.
+	 */
 }
 
 static struct target_type pool_target = {
 	.name = "thin-pool",
 	.features = DM_TARGET_SINGLETON | DM_TARGET_ALWAYS_WRITEABLE |
 		    DM_TARGET_IMMUTABLE,
-	.version = {1, 14, 0},
+	.version = {1, 15, 0},
 	.module = THIS_MODULE,
 	.ctr = pool_ctr,
 	.dtr = pool_dtr,
@@ -3811,8 +3964,7 @@ static int thin_ctr(struct dm_target *ti, unsigned argc, char **argv)
 	if (tc->pool->pf.discard_enabled) {
 		ti->discards_supported = true;
 		ti->num_discard_bios = 1;
-		/* Discard bios must be split on a block boundary */
-		ti->split_discard_bios = true;
+		ti->split_discard_bios = false;
 	}
 
 	mutex_unlock(&dm_thin_pool_table.mutex);
@@ -3899,6 +4051,9 @@ static int thin_endio(struct dm_target *ti, struct bio *bio, int err)
 		}
 	}
 
+	if (h->cell)
+		cell_defer_no_holder(h->tc, h->cell);
+
 	return 0;
 }
 
@@ -4026,9 +4181,18 @@ static int thin_iterate_devices(struct dm_target *ti,
 	return 0;
 }
 
+static void thin_io_hints(struct dm_target *ti, struct queue_limits *limits)
+{
+	struct thin_c *tc = ti->private;
+	struct pool *pool = tc->pool;
+
+	limits->discard_granularity = pool->sectors_per_block << SECTOR_SHIFT;
+	limits->max_discard_sectors = 2048 * 1024 * 16; /* 16G */
+}
+
 static struct target_type thin_target = {
 	.name = "thin",
-	.version = {1, 14, 0},
+	.version = {1, 15, 0},
 	.module	= THIS_MODULE,
 	.ctr = thin_ctr,
 	.dtr = thin_dtr,
@@ -4040,6 +4204,7 @@ static struct target_type thin_target = {
 	.status = thin_status,
 	.merge = thin_merge,
 	.iterate_devices = thin_iterate_devices,
+	.io_hints = thin_io_hints,
 };
 
 /*----------------------------------------------------------------*/
-- 
2.3.2 (Apple Git-55)

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH for-4.2 12/14] dm thin: cleanup overwrite's endio restore to be centralized
  2015-05-14 21:04 [PATCH for-4.2 00/14] block, dm: first batch of changes for 4.2 Mike Snitzer
                   ` (10 preceding siblings ...)
  2015-05-14 21:05 ` [PATCH for-4.2 11/14] dm thin: range discard support Mike Snitzer
@ 2015-05-14 21:05 ` Mike Snitzer
  2015-05-14 21:05 ` [PATCH for-4.2 13/14] dm thin: cleanup schedule_zero() to read more logically Mike Snitzer
  2015-05-14 21:05 ` [PATCH for-4.2 14/14] dm thin metadata: remove in-core 'read_only' flag Mike Snitzer
  13 siblings, 0 replies; 31+ messages in thread
From: Mike Snitzer @ 2015-05-14 21:05 UTC (permalink / raw)
  To: dm-devel; +Cc: Jens Axboe, Mike Snitzer

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 drivers/md/dm-thin.c | 11 +++--------
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index aa71bdd..b316450 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -766,6 +766,8 @@ static void overwrite_endio(struct bio *bio, int err)
 	struct dm_thin_endio_hook *h = dm_per_bio_data(bio, sizeof(struct dm_thin_endio_hook));
 	struct dm_thin_new_mapping *m = h->overwrite_mapping;
 
+	bio->bi_end_io = m->saved_bi_end_io;
+
 	m->err = err;
 	complete_mapping_preparation(m);
 }
@@ -854,9 +856,6 @@ static void inc_remap_and_issue_cell(struct thin_c *tc,
 
 static void process_prepared_mapping_fail(struct dm_thin_new_mapping *m)
 {
-	if (m->bio)
-		m->bio->bi_end_io = m->saved_bi_end_io;
-
 	cell_error(m->tc->pool, m->cell);
 	list_del(&m->list);
 	mempool_free(m, m->tc->pool->mapping_pool);
@@ -866,13 +865,9 @@ static void process_prepared_mapping(struct dm_thin_new_mapping *m)
 {
 	struct thin_c *tc = m->tc;
 	struct pool *pool = tc->pool;
-	struct bio *bio;
+	struct bio *bio = m->bio;
 	int r;
 
-	bio = m->bio;
-	if (bio)
-		bio->bi_end_io = m->saved_bi_end_io;
-
 	if (m->err) {
 		cell_error(pool, m->cell);
 		goto out;
-- 
2.3.2 (Apple Git-55)

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH for-4.2 13/14] dm thin: cleanup schedule_zero() to read more logically
  2015-05-14 21:04 [PATCH for-4.2 00/14] block, dm: first batch of changes for 4.2 Mike Snitzer
                   ` (11 preceding siblings ...)
  2015-05-14 21:05 ` [PATCH for-4.2 12/14] dm thin: cleanup overwrite's endio restore to be centralized Mike Snitzer
@ 2015-05-14 21:05 ` Mike Snitzer
  2015-05-14 21:05 ` [PATCH for-4.2 14/14] dm thin metadata: remove in-core 'read_only' flag Mike Snitzer
  13 siblings, 0 replies; 31+ messages in thread
From: Mike Snitzer @ 2015-05-14 21:05 UTC (permalink / raw)
  To: dm-devel; +Cc: Jens Axboe, Joe Thornber, Mike Snitzer

The overwrite has only ever about optimizing away the need to zero a
block if the entire block was being overwritten.  As such it is only
relevant when zeroing is enabled.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
---
 drivers/md/dm-thin.c | 16 +++++++---------
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index b316450..4aea876 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -1209,16 +1209,14 @@ static void schedule_zero(struct thin_c *tc, dm_block_t virt_block,
 	 * zeroing pre-existing data, we can issue the bio immediately.
 	 * Otherwise we use kcopyd to zero the data first.
 	 */
-	if (!pool->pf.zero_new_blocks)
+	if (pool->pf.zero_new_blocks) {
+		if (io_overwrites_block(pool, bio))
+			remap_and_issue_overwrite(tc, bio, data_block, m);
+		else
+			ll_zero(tc, m, data_block * pool->sectors_per_block,
+				(data_block + 1) * pool->sectors_per_block);
+	} else
 		process_prepared_mapping(m);
-
-	else if (io_overwrites_block(pool, bio))
-		remap_and_issue_overwrite(tc, bio, data_block, m);
-
-	else
-		ll_zero(tc, m,
-			data_block * pool->sectors_per_block,
-			(data_block + 1) * pool->sectors_per_block);
 }
 
 static void schedule_external_copy(struct thin_c *tc, dm_block_t virt_block,
-- 
2.3.2 (Apple Git-55)

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH for-4.2 14/14] dm thin metadata: remove in-core 'read_only' flag
  2015-05-14 21:04 [PATCH for-4.2 00/14] block, dm: first batch of changes for 4.2 Mike Snitzer
                   ` (12 preceding siblings ...)
  2015-05-14 21:05 ` [PATCH for-4.2 13/14] dm thin: cleanup schedule_zero() to read more logically Mike Snitzer
@ 2015-05-14 21:05 ` Mike Snitzer
  13 siblings, 0 replies; 31+ messages in thread
From: Mike Snitzer @ 2015-05-14 21:05 UTC (permalink / raw)
  To: dm-devel; +Cc: Jens Axboe, Mike Snitzer

Leverage the block manager's read_only flag instead of duplicating it;
access with new dm_bm_is_read_only() method.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 drivers/md/dm-thin-metadata.c                 | 6 +-----
 drivers/md/persistent-data/dm-block-manager.c | 6 ++++++
 drivers/md/persistent-data/dm-block-manager.h | 1 +
 3 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/md/dm-thin-metadata.c b/drivers/md/dm-thin-metadata.c
index 40359e9..8b521e3 100644
--- a/drivers/md/dm-thin-metadata.c
+++ b/drivers/md/dm-thin-metadata.c
@@ -184,7 +184,6 @@ struct dm_pool_metadata {
 	uint64_t trans_id;
 	unsigned long flags;
 	sector_t data_block_size;
-	bool read_only:1;
 
 	/*
 	 * Set if a transaction has to be aborted but the attempt to roll back
@@ -836,7 +835,6 @@ struct dm_pool_metadata *dm_pool_metadata_open(struct block_device *bdev,
 	init_rwsem(&pmd->root_lock);
 	pmd->time = 0;
 	INIT_LIST_HEAD(&pmd->thin_devices);
-	pmd->read_only = false;
 	pmd->fail_io = false;
 	pmd->bdev = bdev;
 	pmd->data_block_size = data_block_size;
@@ -880,7 +878,7 @@ int dm_pool_metadata_close(struct dm_pool_metadata *pmd)
 		return -EBUSY;
 	}
 
-	if (!pmd->read_only && !pmd->fail_io) {
+	if (!dm_bm_is_read_only(pmd->bm) && !pmd->fail_io) {
 		r = __commit_transaction(pmd);
 		if (r < 0)
 			DMWARN("%s: __commit_transaction() failed, error = %d",
@@ -1850,7 +1848,6 @@ int dm_pool_resize_metadata_dev(struct dm_pool_metadata *pmd, dm_block_t new_cou
 void dm_pool_metadata_read_only(struct dm_pool_metadata *pmd)
 {
 	down_write(&pmd->root_lock);
-	pmd->read_only = true;
 	dm_bm_set_read_only(pmd->bm);
 	up_write(&pmd->root_lock);
 }
@@ -1858,7 +1855,6 @@ void dm_pool_metadata_read_only(struct dm_pool_metadata *pmd)
 void dm_pool_metadata_read_write(struct dm_pool_metadata *pmd)
 {
 	down_write(&pmd->root_lock);
-	pmd->read_only = false;
 	dm_bm_set_read_write(pmd->bm);
 	up_write(&pmd->root_lock);
 }
diff --git a/drivers/md/persistent-data/dm-block-manager.c b/drivers/md/persistent-data/dm-block-manager.c
index 087411c..4d6c9b6 100644
--- a/drivers/md/persistent-data/dm-block-manager.c
+++ b/drivers/md/persistent-data/dm-block-manager.c
@@ -609,6 +609,12 @@ void dm_bm_prefetch(struct dm_block_manager *bm, dm_block_t b)
 	dm_bufio_prefetch(bm->bufio, b, 1);
 }
 
+bool dm_bm_is_read_only(struct dm_block_manager *bm)
+{
+	return bm->read_only;
+}
+EXPORT_SYMBOL_GPL(dm_bm_is_read_only);
+
 void dm_bm_set_read_only(struct dm_block_manager *bm)
 {
 	bm->read_only = true;
diff --git a/drivers/md/persistent-data/dm-block-manager.h b/drivers/md/persistent-data/dm-block-manager.h
index 1b95dfc..84330f5 100644
--- a/drivers/md/persistent-data/dm-block-manager.h
+++ b/drivers/md/persistent-data/dm-block-manager.h
@@ -123,6 +123,7 @@ void dm_bm_prefetch(struct dm_block_manager *bm, dm_block_t b);
  * Additionally you should not use dm_bm_unlock_move, however no error will
  * be returned if you do.
  */
+bool dm_bm_is_read_only(struct dm_block_manager *bm);
 void dm_bm_set_read_only(struct dm_block_manager *bm);
 void dm_bm_set_read_write(struct dm_block_manager *bm);
 
-- 
2.3.2 (Apple Git-55)

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH for-4.2 01/14] block: remove management of bi_remaining when restoring original bi_end_io
  2015-05-14 21:04 ` [PATCH for-4.2 01/14] block: remove management of bi_remaining when restoring original bi_end_io Mike Snitzer
@ 2015-05-18  7:22   ` Jan Kara
  2015-05-18 13:13     ` Mike Snitzer
  2015-05-18  8:24   ` [dm-devel] [PATCH for-4.2 " Christoph Hellwig
  1 sibling, 1 reply; 31+ messages in thread
From: Jan Kara @ 2015-05-18  7:22 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: dm-devel, Jens Axboe, linux-kernel, Jan Kara, Chris Mason

On Thu 14-05-15 17:04:59, Mike Snitzer wrote:
> Commit c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for
> non-chains") regressed all existing callers that followed this pattern:
>  1) saving a bio's original bi_end_io
>  2) wiring up an intermediate bi_end_io
>  3) restoring the original bi_end_io from intermediate bi_end_io
>  4) calling bio_endio() to execute the restored original bi_end_io
> 
> The regression was due to BIO_CHAIN only ever getting set if
> bio_inc_remaining() is called.  For the above pattern it isn't set until
> step 3 above (step 2 would've needed to establish BIO_CHAIN).  As such
> the first bio_endio(), in step 2 above, never decremented __bi_remaining
> before calling the intermediate bi_end_io -- leaving __bi_remaining with
> the value 1 instead of 0.  When bio_inc_remaining() occurred during step
> 3 it brought it to a value of 2.  When the second bio_endio() was
> called, in step 4 above, it should've called the original bi_end_io but
> it didn't because there was an extra reference that wasn't dropped (due
> to atomic operations being optimized away since BIO_CHAIN wasn't set
> upfront).
> 
> Fix this issue by removing the __bi_remaining management complexity for
> all callers that use the above pattern -- bio_chain() is the only
> interface that _needs_ to be concerned with __bi_remaining.  For the
> above pattern callers just expect the bi_end_io they set to get called!
> Remove bio_endio_nodec() and also remove all bio_inc_remaining() calls
> that aren't associated with the bio_chain() interface.
> 
> The bio_inc_remaining() interface has been left exported because it is
> still useful for more elaborate uses of bio_chain() -- it will be used
> in an upcoming DM commit "dm thin: range discard support".
> 
> Fixes: c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains")
> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Chris Mason <clm@fb.com>
  One question: What happens if you stack dm-thin on top of e.g. dm-linear?
dm-thin will do it's thing to a bio and passes it to dm-linear. That will
split & chain the bio so BIO_CHAIN will be set. And on IO completion you
will have troubles in dm-thinp as now bi_remaining gets decremented in
bio_endio(). That's the reason why I suggested that we should clear
BIO_CHAIN once bi_remaining hits zero...

								Honza

> ---
>  block/bio-integrity.c        |  4 ++--
>  block/bio.c                  | 20 --------------------
>  drivers/md/bcache/io.c       |  2 +-
>  drivers/md/dm-cache-target.c |  6 ------
>  drivers/md/dm-raid1.c        |  2 --
>  drivers/md/dm-snap.c         |  1 -
>  drivers/md/dm-thin.c         |  9 +++------
>  drivers/md/dm-verity.c       |  2 +-
>  fs/btrfs/disk-io.c           |  2 +-
>  fs/btrfs/volumes.c           | 16 +++++-----------
>  fs/btrfs/volumes.h           |  2 --
>  include/linux/bio.h          |  1 -
>  12 files changed, 13 insertions(+), 54 deletions(-)
> 
> diff --git a/block/bio-integrity.c b/block/bio-integrity.c
> index 5cbd5d9..0436c21 100644
> --- a/block/bio-integrity.c
> +++ b/block/bio-integrity.c
> @@ -361,7 +361,7 @@ static void bio_integrity_verify_fn(struct work_struct *work)
>  
>  	/* Restore original bio completion handler */
>  	bio->bi_end_io = bip->bip_end_io;
> -	bio_endio_nodec(bio, error);
> +	bio_endio(bio, error);
>  }
>  
>  /**
> @@ -388,7 +388,7 @@ void bio_integrity_endio(struct bio *bio, int error)
>  	 */
>  	if (error) {
>  		bio->bi_end_io = bip->bip_end_io;
> -		bio_endio_nodec(bio, error);
> +		bio_endio(bio, error);
>  
>  		return;
>  	}
> diff --git a/block/bio.c b/block/bio.c
> index c2ff8a8..3690f1c 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1809,26 +1809,6 @@ void bio_endio(struct bio *bio, int error)
>  EXPORT_SYMBOL(bio_endio);
>  
>  /**
> - * bio_endio_nodec - end I/O on a bio, without decrementing bi_remaining
> - * @bio:	bio
> - * @error:	error, if any
> - *
> - * For code that has saved and restored bi_end_io; thing hard before using this
> - * function, probably you should've cloned the entire bio.
> - **/
> -void bio_endio_nodec(struct bio *bio, int error)
> -{
> -	/*
> -	 * If it's not flagged as a chain, we are not going to dec the count
> -	 */
> -	if (bio_flagged(bio, BIO_CHAIN))
> -		bio_inc_remaining(bio);
> -
> -	bio_endio(bio, error);
> -}
> -EXPORT_SYMBOL(bio_endio_nodec);
> -
> -/**
>   * bio_split - split a bio
>   * @bio:	bio to split
>   * @sectors:	number of sectors to split from the front of @bio
> diff --git a/drivers/md/bcache/io.c b/drivers/md/bcache/io.c
> index fa028fa..cb64e64a 100644
> --- a/drivers/md/bcache/io.c
> +++ b/drivers/md/bcache/io.c
> @@ -55,7 +55,7 @@ static void bch_bio_submit_split_done(struct closure *cl)
>  
>  	s->bio->bi_end_io = s->bi_end_io;
>  	s->bio->bi_private = s->bi_private;
> -	bio_endio_nodec(s->bio, 0);
> +	bio_endio(s->bio, 0);
>  
>  	closure_debug_destroy(&s->cl);
>  	mempool_free(s, s->p->bio_split_hook);
> diff --git a/drivers/md/dm-cache-target.c b/drivers/md/dm-cache-target.c
> index 705eb7b..41b2594 100644
> --- a/drivers/md/dm-cache-target.c
> +++ b/drivers/md/dm-cache-target.c
> @@ -86,12 +86,6 @@ static void dm_unhook_bio(struct dm_hook_info *h, struct bio *bio)
>  {
>  	bio->bi_end_io = h->bi_end_io;
>  	bio->bi_private = h->bi_private;
> -
> -	/*
> -	 * Must bump bi_remaining to allow bio to complete with
> -	 * restored bi_end_io.
> -	 */
> -	bio_inc_remaining(bio);
>  }
>  
>  /*----------------------------------------------------------------*/
> diff --git a/drivers/md/dm-raid1.c b/drivers/md/dm-raid1.c
> index d6a1c09..743fa9b 100644
> --- a/drivers/md/dm-raid1.c
> +++ b/drivers/md/dm-raid1.c
> @@ -1254,8 +1254,6 @@ static int mirror_end_io(struct dm_target *ti, struct bio *bio, int error)
>  			dm_bio_restore(bd, bio);
>  			bio_record->details.bi_bdev = NULL;
>  
> -			bio_inc_remaining(bio);
> -
>  			queue_bio(ms, bio, rw);
>  			return DM_ENDIO_INCOMPLETE;
>  		}
> diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
> index 8bfeae2..7c82d3c 100644
> --- a/drivers/md/dm-snap.c
> +++ b/drivers/md/dm-snap.c
> @@ -1478,7 +1478,6 @@ out:
>  	if (full_bio) {
>  		full_bio->bi_end_io = pe->full_bio_end_io;
>  		full_bio->bi_private = pe->full_bio_private;
> -		bio_inc_remaining(full_bio);
>  	}
>  	increment_pending_exceptions_done_count();
>  
> diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> index 342dbda..e852602c 100644
> --- a/drivers/md/dm-thin.c
> +++ b/drivers/md/dm-thin.c
> @@ -793,10 +793,9 @@ static void inc_remap_and_issue_cell(struct thin_c *tc,
>  
>  static void process_prepared_mapping_fail(struct dm_thin_new_mapping *m)
>  {
> -	if (m->bio) {
> +	if (m->bio)
>  		m->bio->bi_end_io = m->saved_bi_end_io;
> -		bio_inc_remaining(m->bio);
> -	}
> +
>  	cell_error(m->tc->pool, m->cell);
>  	list_del(&m->list);
>  	mempool_free(m, m->tc->pool->mapping_pool);
> @@ -810,10 +809,8 @@ static void process_prepared_mapping(struct dm_thin_new_mapping *m)
>  	int r;
>  
>  	bio = m->bio;
> -	if (bio) {
> +	if (bio)
>  		bio->bi_end_io = m->saved_bi_end_io;
> -		bio_inc_remaining(bio);
> -	}
>  
>  	if (m->err) {
>  		cell_error(pool, m->cell);
> diff --git a/drivers/md/dm-verity.c b/drivers/md/dm-verity.c
> index 66616db..bb9c6a0 100644
> --- a/drivers/md/dm-verity.c
> +++ b/drivers/md/dm-verity.c
> @@ -459,7 +459,7 @@ static void verity_finish_io(struct dm_verity_io *io, int error)
>  	bio->bi_end_io = io->orig_bi_end_io;
>  	bio->bi_private = io->orig_bi_private;
>  
> -	bio_endio_nodec(bio, error);
> +	bio_endio(bio, error);
>  }
>  
>  static void verity_work(struct work_struct *w)
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 2ef9a4b..fbdbee4 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -1745,7 +1745,7 @@ static void end_workqueue_fn(struct btrfs_work *work)
>  	bio->bi_private = end_io_wq->private;
>  	bio->bi_end_io = end_io_wq->end_io;
>  	kmem_cache_free(btrfs_end_io_wq_cache, end_io_wq);
> -	bio_endio_nodec(bio, error);
> +	bio_endio(bio, error);
>  }
>  
>  static int cleaner_kthread(void *arg)
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 8e8d1d1..dac77d4 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -5585,10 +5585,10 @@ int btrfs_rmap_block(struct btrfs_mapping_tree *map_tree,
>  
>  static inline void btrfs_end_bbio(struct btrfs_bio *bbio, struct bio *bio, int err)
>  {
> -	if (likely(bbio->flags & BTRFS_BIO_ORIG_BIO_SUBMITTED))
> -		bio_endio_nodec(bio, err);
> -	else
> -		bio_endio(bio, err);
> +	bio->bi_private = bbio->private;
> +	bio->bi_end_io = bbio->end_io;
> +	bio_endio(bio, err);
> +
>  	btrfs_put_bbio(bbio);
>  }
>  
> @@ -5632,8 +5632,6 @@ static void btrfs_end_bio(struct bio *bio, int err)
>  			bio = bbio->orig_bio;
>  		}
>  
> -		bio->bi_private = bbio->private;
> -		bio->bi_end_io = bbio->end_io;
>  		btrfs_io_bio(bio)->mirror_num = bbio->mirror_num;
>  		/* only send an error to the higher layers if it is
>  		 * beyond the tolerance of the btrfs bio
> @@ -5815,8 +5813,6 @@ static void bbio_error(struct btrfs_bio *bbio, struct bio *bio, u64 logical)
>  		/* Shoud be the original bio. */
>  		WARN_ON(bio != bbio->orig_bio);
>  
> -		bio->bi_private = bbio->private;
> -		bio->bi_end_io = bbio->end_io;
>  		btrfs_io_bio(bio)->mirror_num = bbio->mirror_num;
>  		bio->bi_iter.bi_sector = logical >> 9;
>  
> @@ -5897,10 +5893,8 @@ int btrfs_map_bio(struct btrfs_root *root, int rw, struct bio *bio,
>  		if (dev_nr < total_devs - 1) {
>  			bio = btrfs_bio_clone(first_bio, GFP_NOFS);
>  			BUG_ON(!bio); /* -ENOMEM */
> -		} else {
> +		} else
>  			bio = first_bio;
> -			bbio->flags |= BTRFS_BIO_ORIG_BIO_SUBMITTED;
> -		}
>  
>  		submit_stripe_bio(root, bbio, bio,
>  				  bbio->stripes[dev_nr].physical, dev_nr, rw,
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index ebc3133..cedae03 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -292,8 +292,6 @@ struct btrfs_bio_stripe {
>  struct btrfs_bio;
>  typedef void (btrfs_bio_end_io_t) (struct btrfs_bio *bio, int err);
>  
> -#define BTRFS_BIO_ORIG_BIO_SUBMITTED	(1 << 0)
> -
>  struct btrfs_bio {
>  	atomic_t refs;
>  	atomic_t stripes_pending;
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 7486ea1..fe0a88d 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -427,7 +427,6 @@ static inline struct bio *bio_clone_kmalloc(struct bio *bio, gfp_t gfp_mask)
>  }
>  
>  extern void bio_endio(struct bio *, int);
> -extern void bio_endio_nodec(struct bio *, int);
>  struct request_queue;
>  extern int bio_phys_segments(struct request_queue *, struct bio *);
>  
> -- 
> 2.3.2 (Apple Git-55)
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [dm-devel] [PATCH for-4.2 01/14] block: remove management of bi_remaining when restoring original bi_end_io
  2015-05-14 21:04 ` [PATCH for-4.2 01/14] block: remove management of bi_remaining when restoring original bi_end_io Mike Snitzer
  2015-05-18  7:22   ` Jan Kara
@ 2015-05-18  8:24   ` Christoph Hellwig
  2015-05-18 13:20     ` Mike Snitzer
  1 sibling, 1 reply; 31+ messages in thread
From: Christoph Hellwig @ 2015-05-18  8:24 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: dm-devel, Jens Axboe, Chris Mason, Jan Kara, linux-kernel

In general this looks good.  But as Jan mentioned you need to
clear BIO_CHAIN when bi_remaining reaches zero, and I'd really
prefer if bio_inc_remaining wuld not be left exported and folded
into bio_chain so that we prevent new abuses from showing up
and keep the code centralized.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [dm-devel] [PATCH for-4.2 04/14] block: factor out blkdev_issue_discard_async
  2015-05-14 21:05 ` [PATCH for-4.2 04/14] block: factor out blkdev_issue_discard_async Mike Snitzer
@ 2015-05-18  8:27   ` Christoph Hellwig
  2015-05-18 13:32     ` Mike Snitzer
  0 siblings, 1 reply; 31+ messages in thread
From: Christoph Hellwig @ 2015-05-18  8:27 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: dm-devel, Jens Axboe, Joe Thornber, linux-kernel

On Thu, May 14, 2015 at 05:05:02PM -0400, Mike Snitzer wrote:
> From: Joe Thornber <ejt@redhat.com>
> 
> Useful for callers who wish to manage the async completion of discard(s)
> without explicitly blocking waiting for completion.
> 
> blkdev_issue_discard() is updated to call blkdev_issue_discard_async()
> and DM thinp will make use of blkdev_issue_discard_async() in the
> upcoming "dm thin: range discard support" commit.

I think this is the wrong level of interface.  I think dm should just
submit the bios directly, which will also allow it to use bio_chain
properly instead of needing the inc_remaining hack.  Instead export
helpers that properly split up the discard chunk sectors without
touching the bio itself.  And with bio split on demand work even
that will hopefully go away soon.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH for-4.2 06/14] dm: rename methods that requeue requests
  2015-05-14 21:05 ` [PATCH for-4.2 06/14] dm: rename methods that requeue requests Mike Snitzer
@ 2015-05-18  8:29   ` Christoph Hellwig
  2015-05-18 15:44     ` Mike Snitzer
  0 siblings, 1 reply; 31+ messages in thread
From: Christoph Hellwig @ 2015-05-18  8:29 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Jens Axboe, dm-devel

On Thu, May 14, 2015 at 05:05:04PM -0400, Mike Snitzer wrote:
> More often than not a request that is requeued _is_ mapped (meaning the
> clone request is allocated and clone->q is initialized).  Rename
> dm_requeue_unmapped_original_request() and dm_requeue_unmapped_request()
> to avoid potential confusion due to function names containing "unmapped".

Can you also just kill off dm_requeue_unmapped_request / dm_requeue_request
while you're at it?  Both callers have the dm_rq_target_io at hand,
so it's rather poinless.

Also for the blk-mq case please just return BLK_MQ_RQ_QUEUE_BUSY from
->queue_rq insted of requeing the request.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH for-4.2 01/14] block: remove management of bi_remaining when restoring original bi_end_io
  2015-05-18  7:22   ` Jan Kara
@ 2015-05-18 13:13     ` Mike Snitzer
  2015-05-18 15:36       ` Jan Kara
  0 siblings, 1 reply; 31+ messages in thread
From: Mike Snitzer @ 2015-05-18 13:13 UTC (permalink / raw)
  To: Jan Kara; +Cc: dm-devel, Jens Axboe, linux-kernel, Chris Mason

On Mon, May 18 2015 at  3:22am -0400,
Jan Kara <jack@suse.cz> wrote:

> On Thu 14-05-15 17:04:59, Mike Snitzer wrote:
> > Commit c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for
> > non-chains") regressed all existing callers that followed this pattern:
> >  1) saving a bio's original bi_end_io
> >  2) wiring up an intermediate bi_end_io
> >  3) restoring the original bi_end_io from intermediate bi_end_io
> >  4) calling bio_endio() to execute the restored original bi_end_io
> > 
> > The regression was due to BIO_CHAIN only ever getting set if
> > bio_inc_remaining() is called.  For the above pattern it isn't set until
> > step 3 above (step 2 would've needed to establish BIO_CHAIN).  As such
> > the first bio_endio(), in step 2 above, never decremented __bi_remaining
> > before calling the intermediate bi_end_io -- leaving __bi_remaining with
> > the value 1 instead of 0.  When bio_inc_remaining() occurred during step
> > 3 it brought it to a value of 2.  When the second bio_endio() was
> > called, in step 4 above, it should've called the original bi_end_io but
> > it didn't because there was an extra reference that wasn't dropped (due
> > to atomic operations being optimized away since BIO_CHAIN wasn't set
> > upfront).
> > 
> > Fix this issue by removing the __bi_remaining management complexity for
> > all callers that use the above pattern -- bio_chain() is the only
> > interface that _needs_ to be concerned with __bi_remaining.  For the
> > above pattern callers just expect the bi_end_io they set to get called!
> > Remove bio_endio_nodec() and also remove all bio_inc_remaining() calls
> > that aren't associated with the bio_chain() interface.
> > 
> > The bio_inc_remaining() interface has been left exported because it is
> > still useful for more elaborate uses of bio_chain() -- it will be used
> > in an upcoming DM commit "dm thin: range discard support".
> > 
> > Fixes: c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains")
> > Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> > Cc: Jan Kara <jack@suse.cz>
> > Cc: Chris Mason <clm@fb.com>
>   One question: What happens if you stack dm-thin on top of e.g. dm-linear?
> dm-thin will do it's thing to a bio and passes it to dm-linear. That will
> split & chain the bio so BIO_CHAIN will be set. And on IO completion you
> will have troubles in dm-thinp as now bi_remaining gets decremented in
> bio_endio(). That's the reason why I suggested that we should clear
> BIO_CHAIN once bi_remaining hits zero...

I think you need to be more precise in explaining the scenario you're
concerned about.  Could be there is an issue but I'm not seeing it yet.

Are you referring to the patch that makes DM thinp use the proposed
blkdev_issue_discard_async() interface?  The bios issued to DM linear
are generated by blkdev_issue_discard_async().  By using bio_chain()
they establish ancestory with the parent DM thinp bio (which has
had BIO_CHAIN set even before calling blkdev_issue_discard_async because
there is potential for DM thinp to complete the parent bio before all N
blkdev_issue_discard_async() generated bios complete -- so that is why
DM thinp itself takes an extra reference on the parent bio using
bio_inc_remaining() before calling blkdev_issue_discard_async)

In practice I'm not having any problems with the
device-mapper-test-suite's thin-provisioning tests, e.g.:
dmtest run --suite thin-provisioning -n /discard/

(dmts happily uses dm-linear devices and stacks DM thinp ontop.)

Mike

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH for-4.2 01/14] block: remove management of bi_remaining when restoring original bi_end_io
  2015-05-18  8:24   ` [dm-devel] [PATCH for-4.2 " Christoph Hellwig
@ 2015-05-18 13:20     ` Mike Snitzer
  0 siblings, 0 replies; 31+ messages in thread
From: Mike Snitzer @ 2015-05-18 13:20 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dm-devel, Jens Axboe, Chris Mason, Jan Kara, linux-kernel

On Mon, May 18 2015 at  4:24am -0400,
Christoph Hellwig <hch@infradead.org> wrote:

> In general this looks good.  But as Jan mentioned you need to
> clear BIO_CHAIN when bi_remaining reaches zero

OK I just replied to Jan -- I'm not yet understanding why.  That said, I
also don't have a problem with doing what you guys are asking.  I'd just
like to understand the problem you're forseeing because in practice I'm
not hitting it in testing.

> and I'd really prefer if bio_inc_remaining wuld not be left exported
> and folded into bio_chain so that we prevent new abuses from showing
> up and keep the code centralized.

Your desire to make bio_inc_remaining() private is noted but I think the
proposed blkdev_issue_discard_async() is useful.  In the context of DM
thinp's use of blkdev_issue_discard_async(): yes it is (ab)using
bio_chain() and bio_inc_remaining() to setup the async IO completion
scheme but it makes for a pretty clean solution to the problem of
wanting to have an async interface for discard.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH for-4.2 04/14] block: factor out blkdev_issue_discard_async
  2015-05-18  8:27   ` [dm-devel] " Christoph Hellwig
@ 2015-05-18 13:32     ` Mike Snitzer
  2015-05-18 16:17       ` [dm-devel] " Christoph Hellwig
  0 siblings, 1 reply; 31+ messages in thread
From: Mike Snitzer @ 2015-05-18 13:32 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: dm-devel, Jens Axboe, Joe Thornber, linux-kernel

On Mon, May 18 2015 at  4:27am -0400,
Christoph Hellwig <hch@infradead.org> wrote:

> On Thu, May 14, 2015 at 05:05:02PM -0400, Mike Snitzer wrote:
> > From: Joe Thornber <ejt@redhat.com>
> > 
> > Useful for callers who wish to manage the async completion of discard(s)
> > without explicitly blocking waiting for completion.
> > 
> > blkdev_issue_discard() is updated to call blkdev_issue_discard_async()
> > and DM thinp will make use of blkdev_issue_discard_async() in the
> > upcoming "dm thin: range discard support" commit.
> 
> I think this is the wrong level of interface.  I think dm should just
> submit the bios directly, which will also allow it to use bio_chain
> properly instead of needing the inc_remaining hack.  Instead export
> helpers that properly split up the discard chunk sectors without
> touching the bio itself.  And with bio split on demand work even
> that will hopefully go away soon.

The proposed blkdev_issue_discard_async interface allows DM (or any
caller) to not have to concern itself with how discard(s) gets issued.

It leaves all the details of how large a discard can be, etc to block
core.  The entire point of doing things this way is to _not_ pollute DM
with code that breaks up a discard into N bios based on the discard
limits of the underlying device.

What you're suggesting sounds a lot like having DM open code
blkdev_issue_discard() -- blkdev_issue_discard_async() was engineered to
avoid that completely.

I hope we can reach consensus on this because as it stands I know Jens
will be less inclined to take this blkdev_issue_discard_async() change
given your early disapproval.  Which basically pretty much screws me up
for the coming merge window.. I'm OK with that (and exploring
alternatives) but I _really_ hope you've explored this carefully (not
getting that vibe yet given your suggestion appears to be "open code all
of blkdev_issue_discard in DM").

Mike

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH for-4.2 01/14] block: remove management of bi_remaining when restoring original bi_end_io
  2015-05-18 13:13     ` Mike Snitzer
@ 2015-05-18 15:36       ` Jan Kara
  2015-05-18 15:59         ` Mike Snitzer
  0 siblings, 1 reply; 31+ messages in thread
From: Jan Kara @ 2015-05-18 15:36 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Jan Kara, dm-devel, Jens Axboe, linux-kernel, Chris Mason

On Mon 18-05-15 09:13:59, Mike Snitzer wrote:
> On Mon, May 18 2015 at  3:22am -0400,
> Jan Kara <jack@suse.cz> wrote:
> 
> > On Thu 14-05-15 17:04:59, Mike Snitzer wrote:
> > > Commit c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for
> > > non-chains") regressed all existing callers that followed this pattern:
> > >  1) saving a bio's original bi_end_io
> > >  2) wiring up an intermediate bi_end_io
> > >  3) restoring the original bi_end_io from intermediate bi_end_io
> > >  4) calling bio_endio() to execute the restored original bi_end_io
> > > 
> > > The regression was due to BIO_CHAIN only ever getting set if
> > > bio_inc_remaining() is called.  For the above pattern it isn't set until
> > > step 3 above (step 2 would've needed to establish BIO_CHAIN).  As such
> > > the first bio_endio(), in step 2 above, never decremented __bi_remaining
> > > before calling the intermediate bi_end_io -- leaving __bi_remaining with
> > > the value 1 instead of 0.  When bio_inc_remaining() occurred during step
> > > 3 it brought it to a value of 2.  When the second bio_endio() was
> > > called, in step 4 above, it should've called the original bi_end_io but
> > > it didn't because there was an extra reference that wasn't dropped (due
> > > to atomic operations being optimized away since BIO_CHAIN wasn't set
> > > upfront).
> > > 
> > > Fix this issue by removing the __bi_remaining management complexity for
> > > all callers that use the above pattern -- bio_chain() is the only
> > > interface that _needs_ to be concerned with __bi_remaining.  For the
> > > above pattern callers just expect the bi_end_io they set to get called!
> > > Remove bio_endio_nodec() and also remove all bio_inc_remaining() calls
> > > that aren't associated with the bio_chain() interface.
> > > 
> > > The bio_inc_remaining() interface has been left exported because it is
> > > still useful for more elaborate uses of bio_chain() -- it will be used
> > > in an upcoming DM commit "dm thin: range discard support".
> > > 
> > > Fixes: c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains")
> > > Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> > > Cc: Jan Kara <jack@suse.cz>
> > > Cc: Chris Mason <clm@fb.com>
> >   One question: What happens if you stack dm-thin on top of e.g. dm-linear?
> > dm-thin will do it's thing to a bio and passes it to dm-linear. That will
> > split & chain the bio so BIO_CHAIN will be set. And on IO completion you
> > will have troubles in dm-thinp as now bi_remaining gets decremented in
> > bio_endio(). That's the reason why I suggested that we should clear
> > BIO_CHAIN once bi_remaining hits zero...
> 
> I think you need to be more precise in explaining the scenario you're
> concerned about.  Could be there is an issue but I'm not seeing it yet.
> 
> Are you referring to the patch that makes DM thinp use the proposed
> blkdev_issue_discard_async() interface?  The bios issued to DM linear
> are generated by blkdev_issue_discard_async().  By using bio_chain()
> they establish ancestory with the parent DM thinp bio (which has
> had BIO_CHAIN set even before calling blkdev_issue_discard_async because
> there is potential for DM thinp to complete the parent bio before all N
> blkdev_issue_discard_async() generated bios complete -- so that is why
> DM thinp itself takes an extra reference on the parent bio using
> bio_inc_remaining() before calling blkdev_issue_discard_async)

  No, I'm not referring to your proposed interface. I'm referring to
current kernel + your patch to remove bio_inc_remaining() from all the dm
targets. Ah, after checking again I see where misunderstanding may have
come from - the device below has to be handled by drivers/md/linear.c which
is MD linear driver, not DM one. I confused those two. Anyway here is the
failure I envision (and frankly, I don't understand dm details much so I may
be just completely wrong but I'd like to understand what prevents the following
from happening):
* We have dm-thin stacked on top of drivers/dm/linear.c
* FS issues bio to dm-thin. remap_and_issue_overwrite() sets bi_end_io to
  overwrite_endio. dm-thin eventually calls generic_make_request(bio).
* Now linear_make_request() gets called and it ends up calling
  bio_chain(split, bio). This sets BIO_CHAIN on bio.
* IO for all chained bios is completed. So bio->bi_remaining is now zero,
  bio still has BIO_CHAIN set and overwrite_endio gets called.
* process_prepared_mapping() will eventually try to call original bi_end_io
  callback but that never happens because bi_remaining is 0 and BIO_CHAIN
  remained set.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH for-4.2 06/14] dm: rename methods that requeue requests
  2015-05-18  8:29   ` Christoph Hellwig
@ 2015-05-18 15:44     ` Mike Snitzer
  0 siblings, 0 replies; 31+ messages in thread
From: Mike Snitzer @ 2015-05-18 15:44 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, dm-devel

On Mon, May 18 2015 at  4:29am -0400,
Christoph Hellwig <hch@infradead.org> wrote:

> On Thu, May 14, 2015 at 05:05:04PM -0400, Mike Snitzer wrote:
> > More often than not a request that is requeued _is_ mapped (meaning the
> > clone request is allocated and clone->q is initialized).  Rename
> > dm_requeue_unmapped_original_request() and dm_requeue_unmapped_request()
> > to avoid potential confusion due to function names containing "unmapped".
> 
> Can you also just kill off dm_requeue_unmapped_request / dm_requeue_request
> while you're at it?  Both callers have the dm_rq_target_io at hand,
> so it's rather poinless.

Good point, will do.
 
> Also for the blk-mq case please just return BLK_MQ_RQ_QUEUE_BUSY from
> ->queue_rq insted of requeing the request.

I'll make a note to revisit this (need to audit what, if any,
implications doing so has).

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH for-4.2 01/14] block: remove management of bi_remaining when restoring original bi_end_io
  2015-05-18 15:36       ` Jan Kara
@ 2015-05-18 15:59         ` Mike Snitzer
  2015-05-18 20:40           ` [PATCH for-4.2 v2 " Mike Snitzer
  0 siblings, 1 reply; 31+ messages in thread
From: Mike Snitzer @ 2015-05-18 15:59 UTC (permalink / raw)
  To: Jan Kara; +Cc: Jens Axboe, Chris Mason, dm-devel, linux-kernel

On Mon, May 18 2015 at 11:36am -0400,
Jan Kara <jack@suse.cz> wrote:
 
>   No, I'm not referring to your proposed interface. I'm referring to
> current kernel + your patch to remove bio_inc_remaining() from all the dm
> targets. Ah, after checking again I see where misunderstanding may have
> come from - the device below has to be handled by drivers/md/linear.c which
> is MD linear driver, not DM one. I confused those two. Anyway here is the
> failure I envision (and frankly, I don't understand dm details much so I may
> be just completely wrong but I'd like to understand what prevents the following
> from happening):
> * We have dm-thin stacked on top of drivers/dm/linear.c
> * FS issues bio to dm-thin. remap_and_issue_overwrite() sets bi_end_io to
>   overwrite_endio. dm-thin eventually calls generic_make_request(bio).
> * Now linear_make_request() gets called and it ends up calling
>   bio_chain(split, bio). This sets BIO_CHAIN on bio.
> * IO for all chained bios is completed. So bio->bi_remaining is now zero,
>   bio still has BIO_CHAIN set and overwrite_endio gets called.
> * process_prepared_mapping() will eventually try to call original bi_end_io
>   callback but that never happens because bi_remaining is 0 and BIO_CHAIN
>   remained set.

Makes sense, you have a valid concern (I knew you and hch must, I just
didn't understand).

I'll clear the BIO_CHAIN like you suggest and post v2.

Thanks!


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [dm-devel] [PATCH for-4.2 04/14] block: factor out blkdev_issue_discard_async
  2015-05-18 13:32     ` Mike Snitzer
@ 2015-05-18 16:17       ` Christoph Hellwig
  2015-05-18 19:18         ` Mike Snitzer
  0 siblings, 1 reply; 31+ messages in thread
From: Christoph Hellwig @ 2015-05-18 16:17 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Christoph Hellwig, Jens Axboe, dm-devel, Joe Thornber, linux-kernel

On Mon, May 18, 2015 at 09:32:23AM -0400, Mike Snitzer wrote:
> The proposed blkdev_issue_discard_async interface allows DM (or any
> caller) to not have to concern itself with how discard(s) gets issued.
> 
> It leaves all the details of how large a discard can be, etc to block
> core.  The entire point of doing things this way is to _not_ pollute DM
> with code that breaks up a discard into N bios based on the discard
> limits of the underlying device.
> 
> What you're suggesting sounds a lot like having DM open code
> blkdev_issue_discard() -- blkdev_issue_discard_async() was engineered to
> avoid that completely.

Parts of it anyway.  The splitting logic can still be factored into
helpers to keep the nasty details out of DM.  But except for that I
think async discards should be handled exactly like async reads, writes
or flushes.

And besides that generic high level sentiment I think the interface
for blkdev_issue_discard_async is simply wrong.  Either you want to keep
the internals private and just expose a completion callback that gets
your private data and an error, or you want to build your own bios as
suggested above.  But not one that is mostly opaque except for allowing
the caller to hook into the submission process and thus taking over I/O
completion.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH for-4.2 04/14] block: factor out blkdev_issue_discard_async
  2015-05-18 16:17       ` [dm-devel] " Christoph Hellwig
@ 2015-05-18 19:18         ` Mike Snitzer
  2015-05-19  8:32           ` Christoph Hellwig
  0 siblings, 1 reply; 31+ messages in thread
From: Mike Snitzer @ 2015-05-18 19:18 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, dm-devel, Joe Thornber, linux-kernel

On Mon, May 18 2015 at 12:17pm -0400,
Christoph Hellwig <hch@infradead.org> wrote:

> On Mon, May 18, 2015 at 09:32:23AM -0400, Mike Snitzer wrote:
> > The proposed blkdev_issue_discard_async interface allows DM (or any
> > caller) to not have to concern itself with how discard(s) gets issued.
> > 
> > It leaves all the details of how large a discard can be, etc to block
> > core.  The entire point of doing things this way is to _not_ pollute DM
> > with code that breaks up a discard into N bios based on the discard
> > limits of the underlying device.
> > 
> > What you're suggesting sounds a lot like having DM open code
> > blkdev_issue_discard() -- blkdev_issue_discard_async() was engineered to
> > avoid that completely.
> 
> Parts of it anyway.  The splitting logic can still be factored into
> helpers to keep the nasty details out of DM.  But except for that I
> think async discards should be handled exactly like async reads, writes
> or flushes.

OK.

> And besides that generic high level sentiment I think the interface
> for blkdev_issue_discard_async is simply wrong.  Either you want to keep
> the internals private and just expose a completion callback that gets
> your private data and an error, or you want to build your own bios as
> suggested above.  But not one that is mostly opaque except for allowing
> the caller to hook into the submission process and thus taking over I/O
> completion.

I'll see what I can come up with.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH for-4.2 v2 01/14] block: remove management of bi_remaining when restoring original bi_end_io
  2015-05-18 15:59         ` Mike Snitzer
@ 2015-05-18 20:40           ` Mike Snitzer
  2015-05-19  6:28             ` Christoph Hellwig
  2015-05-19  7:20             ` Jan Kara
  0 siblings, 2 replies; 31+ messages in thread
From: Mike Snitzer @ 2015-05-18 20:40 UTC (permalink / raw)
  To: Jan Kara, hch, Jens Axboe; +Cc: Chris Mason, dm-devel, linux-kernel

Commit c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for
non-chains") regressed all existing callers that followed this pattern:
 1) saving a bio's original bi_end_io
 2) wiring up an intermediate bi_end_io
 3) restoring the original bi_end_io from intermediate bi_end_io
 4) calling bio_endio() to execute the restored original bi_end_io

The regression was due to BIO_CHAIN only ever getting set if
bio_inc_remaining() is called.  For the above pattern it isn't set until
step 3 above (step 2 would've needed to establish BIO_CHAIN).  As such
the first bio_endio(), in step 2 above, never decremented __bi_remaining
before calling the intermediate bi_end_io -- leaving __bi_remaining with
the value 1 instead of 0.  When bio_inc_remaining() occurred during step
3 it brought it to a value of 2.  When the second bio_endio() was
called, in step 4 above, it should've called the original bi_end_io but
it didn't because there was an extra reference that wasn't dropped (due
to atomic operations being optimized away since BIO_CHAIN wasn't set
upfront).

Fix this issue by removing the __bi_remaining management complexity for
all callers that use the above pattern -- bio_chain() is the only
interface that _needs_ to be concerned with __bi_remaining.  For the
above pattern callers just expect the bi_end_io they set to get called!
Remove bio_endio_nodec() and also remove all bio_inc_remaining() calls
that aren't associated with the bio_chain() interface.

changes in v2:
Care is now taken to clear the BIO_CHAIN flag once ->__bi_remaining
drops to 0.  Also, the bio_inc_remaining() interface has moved local to
bio.c.

Fixes: c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 block/bio-integrity.c        |  4 ++--
 block/bio.c                  | 35 ++++++++++++++---------------------
 drivers/md/bcache/io.c       |  2 +-
 drivers/md/dm-cache-target.c |  6 ------
 drivers/md/dm-raid1.c        |  2 --
 drivers/md/dm-snap.c         |  1 -
 drivers/md/dm-thin.c         |  9 +++------
 drivers/md/dm-verity.c       |  2 +-
 fs/btrfs/disk-io.c           |  2 +-
 fs/btrfs/volumes.c           | 16 +++++-----------
 fs/btrfs/volumes.h           |  2 --
 include/linux/bio.h          | 12 ------------
 12 files changed, 27 insertions(+), 66 deletions(-)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 5cbd5d9..0436c21 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -361,7 +361,7 @@ static void bio_integrity_verify_fn(struct work_struct *work)
 
 	/* Restore original bio completion handler */
 	bio->bi_end_io = bip->bip_end_io;
-	bio_endio_nodec(bio, error);
+	bio_endio(bio, error);
 }
 
 /**
@@ -388,7 +388,7 @@ void bio_integrity_endio(struct bio *bio, int error)
 	 */
 	if (error) {
 		bio->bi_end_io = bip->bip_end_io;
-		bio_endio_nodec(bio, error);
+		bio_endio(bio, error);
 
 		return;
 	}
diff --git a/block/bio.c b/block/bio.c
index c2ff8a8..259197d 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -303,6 +303,17 @@ static void bio_chain_endio(struct bio *bio, int error)
 	bio_put(bio);
 }
 
+/*
+ * Increment chain count for the bio. Make sure the CHAIN flag update
+ * is visible before the raised count.
+ */
+static inline void bio_inc_remaining(struct bio *bio)
+{
+	bio->bi_flags |= (1 << BIO_CHAIN);
+	smp_mb__before_atomic();
+	atomic_inc(&bio->__bi_remaining);
+}
+
 /**
  * bio_chain - chain bio completions
  * @bio: the target bio
@@ -1756,8 +1767,10 @@ static inline bool bio_remaining_done(struct bio *bio)
 
 	BUG_ON(atomic_read(&bio->__bi_remaining) <= 0);
 
-	if (atomic_dec_and_test(&bio->__bi_remaining))
+	if (atomic_dec_and_test(&bio->__bi_remaining)) {
+		clear_bit(BIO_CHAIN, &bio->bi_flags);
 		return true;
+	}
 
 	return false;
 }
@@ -1809,26 +1822,6 @@ void bio_endio(struct bio *bio, int error)
 EXPORT_SYMBOL(bio_endio);
 
 /**
- * bio_endio_nodec - end I/O on a bio, without decrementing bi_remaining
- * @bio:	bio
- * @error:	error, if any
- *
- * For code that has saved and restored bi_end_io; thing hard before using this
- * function, probably you should've cloned the entire bio.
- **/
-void bio_endio_nodec(struct bio *bio, int error)
-{
-	/*
-	 * If it's not flagged as a chain, we are not going to dec the count
-	 */
-	if (bio_flagged(bio, BIO_CHAIN))
-		bio_inc_remaining(bio);
-
-	bio_endio(bio, error);
-}
-EXPORT_SYMBOL(bio_endio_nodec);
-
-/**
  * bio_split - split a bio
  * @bio:	bio to split
  * @sectors:	number of sectors to split from the front of @bio
diff --git a/drivers/md/bcache/io.c b/drivers/md/bcache/io.c
index fa028fa..cb64e64a 100644
--- a/drivers/md/bcache/io.c
+++ b/drivers/md/bcache/io.c
@@ -55,7 +55,7 @@ static void bch_bio_submit_split_done(struct closure *cl)
 
 	s->bio->bi_end_io = s->bi_end_io;
 	s->bio->bi_private = s->bi_private;
-	bio_endio_nodec(s->bio, 0);
+	bio_endio(s->bio, 0);
 
 	closure_debug_destroy(&s->cl);
 	mempool_free(s, s->p->bio_split_hook);
diff --git a/drivers/md/dm-cache-target.c b/drivers/md/dm-cache-target.c
index 705eb7b..41b2594 100644
--- a/drivers/md/dm-cache-target.c
+++ b/drivers/md/dm-cache-target.c
@@ -86,12 +86,6 @@ static void dm_unhook_bio(struct dm_hook_info *h, struct bio *bio)
 {
 	bio->bi_end_io = h->bi_end_io;
 	bio->bi_private = h->bi_private;
-
-	/*
-	 * Must bump bi_remaining to allow bio to complete with
-	 * restored bi_end_io.
-	 */
-	bio_inc_remaining(bio);
 }
 
 /*----------------------------------------------------------------*/
diff --git a/drivers/md/dm-raid1.c b/drivers/md/dm-raid1.c
index d6a1c09..743fa9b 100644
--- a/drivers/md/dm-raid1.c
+++ b/drivers/md/dm-raid1.c
@@ -1254,8 +1254,6 @@ static int mirror_end_io(struct dm_target *ti, struct bio *bio, int error)
 			dm_bio_restore(bd, bio);
 			bio_record->details.bi_bdev = NULL;
 
-			bio_inc_remaining(bio);
-
 			queue_bio(ms, bio, rw);
 			return DM_ENDIO_INCOMPLETE;
 		}
diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
index 8bfeae2..7c82d3c 100644
--- a/drivers/md/dm-snap.c
+++ b/drivers/md/dm-snap.c
@@ -1478,7 +1478,6 @@ out:
 	if (full_bio) {
 		full_bio->bi_end_io = pe->full_bio_end_io;
 		full_bio->bi_private = pe->full_bio_private;
-		bio_inc_remaining(full_bio);
 	}
 	increment_pending_exceptions_done_count();
 
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 342dbda..e852602c 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -793,10 +793,9 @@ static void inc_remap_and_issue_cell(struct thin_c *tc,
 
 static void process_prepared_mapping_fail(struct dm_thin_new_mapping *m)
 {
-	if (m->bio) {
+	if (m->bio)
 		m->bio->bi_end_io = m->saved_bi_end_io;
-		bio_inc_remaining(m->bio);
-	}
+
 	cell_error(m->tc->pool, m->cell);
 	list_del(&m->list);
 	mempool_free(m, m->tc->pool->mapping_pool);
@@ -810,10 +809,8 @@ static void process_prepared_mapping(struct dm_thin_new_mapping *m)
 	int r;
 
 	bio = m->bio;
-	if (bio) {
+	if (bio)
 		bio->bi_end_io = m->saved_bi_end_io;
-		bio_inc_remaining(bio);
-	}
 
 	if (m->err) {
 		cell_error(pool, m->cell);
diff --git a/drivers/md/dm-verity.c b/drivers/md/dm-verity.c
index 66616db..bb9c6a0 100644
--- a/drivers/md/dm-verity.c
+++ b/drivers/md/dm-verity.c
@@ -459,7 +459,7 @@ static void verity_finish_io(struct dm_verity_io *io, int error)
 	bio->bi_end_io = io->orig_bi_end_io;
 	bio->bi_private = io->orig_bi_private;
 
-	bio_endio_nodec(bio, error);
+	bio_endio(bio, error);
 }
 
 static void verity_work(struct work_struct *w)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 2ef9a4b..fbdbee4 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1745,7 +1745,7 @@ static void end_workqueue_fn(struct btrfs_work *work)
 	bio->bi_private = end_io_wq->private;
 	bio->bi_end_io = end_io_wq->end_io;
 	kmem_cache_free(btrfs_end_io_wq_cache, end_io_wq);
-	bio_endio_nodec(bio, error);
+	bio_endio(bio, error);
 }
 
 static int cleaner_kthread(void *arg)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 8e8d1d1..dac77d4 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5585,10 +5585,10 @@ int btrfs_rmap_block(struct btrfs_mapping_tree *map_tree,
 
 static inline void btrfs_end_bbio(struct btrfs_bio *bbio, struct bio *bio, int err)
 {
-	if (likely(bbio->flags & BTRFS_BIO_ORIG_BIO_SUBMITTED))
-		bio_endio_nodec(bio, err);
-	else
-		bio_endio(bio, err);
+	bio->bi_private = bbio->private;
+	bio->bi_end_io = bbio->end_io;
+	bio_endio(bio, err);
+
 	btrfs_put_bbio(bbio);
 }
 
@@ -5632,8 +5632,6 @@ static void btrfs_end_bio(struct bio *bio, int err)
 			bio = bbio->orig_bio;
 		}
 
-		bio->bi_private = bbio->private;
-		bio->bi_end_io = bbio->end_io;
 		btrfs_io_bio(bio)->mirror_num = bbio->mirror_num;
 		/* only send an error to the higher layers if it is
 		 * beyond the tolerance of the btrfs bio
@@ -5815,8 +5813,6 @@ static void bbio_error(struct btrfs_bio *bbio, struct bio *bio, u64 logical)
 		/* Shoud be the original bio. */
 		WARN_ON(bio != bbio->orig_bio);
 
-		bio->bi_private = bbio->private;
-		bio->bi_end_io = bbio->end_io;
 		btrfs_io_bio(bio)->mirror_num = bbio->mirror_num;
 		bio->bi_iter.bi_sector = logical >> 9;
 
@@ -5897,10 +5893,8 @@ int btrfs_map_bio(struct btrfs_root *root, int rw, struct bio *bio,
 		if (dev_nr < total_devs - 1) {
 			bio = btrfs_bio_clone(first_bio, GFP_NOFS);
 			BUG_ON(!bio); /* -ENOMEM */
-		} else {
+		} else
 			bio = first_bio;
-			bbio->flags |= BTRFS_BIO_ORIG_BIO_SUBMITTED;
-		}
 
 		submit_stripe_bio(root, bbio, bio,
 				  bbio->stripes[dev_nr].physical, dev_nr, rw,
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index ebc3133..cedae03 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -292,8 +292,6 @@ struct btrfs_bio_stripe {
 struct btrfs_bio;
 typedef void (btrfs_bio_end_io_t) (struct btrfs_bio *bio, int err);
 
-#define BTRFS_BIO_ORIG_BIO_SUBMITTED	(1 << 0)
-
 struct btrfs_bio {
 	atomic_t refs;
 	atomic_t stripes_pending;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 7486ea1..f0291cf 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -427,7 +427,6 @@ static inline struct bio *bio_clone_kmalloc(struct bio *bio, gfp_t gfp_mask)
 }
 
 extern void bio_endio(struct bio *, int);
-extern void bio_endio_nodec(struct bio *, int);
 struct request_queue;
 extern int bio_phys_segments(struct request_queue *, struct bio *);
 
@@ -659,17 +658,6 @@ static inline struct bio *bio_list_get(struct bio_list *bl)
 }
 
 /*
- * Increment chain count for the bio. Make sure the CHAIN flag update
- * is visible before the raised count.
- */
-static inline void bio_inc_remaining(struct bio *bio)
-{
-	bio->bi_flags |= (1 << BIO_CHAIN);
-	smp_mb__before_atomic();
-	atomic_inc(&bio->__bi_remaining);
-}
-
-/*
  * bio_set is used to allow other portions of the IO system to
  * allocate their own private memory pools for bio and iovec structures.
  * These memory pools in turn all allocate from the bio_slab
-- 
2.3.2 (Apple Git-55)


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH for-4.2 v2 01/14] block: remove management of bi_remaining when restoring original bi_end_io
  2015-05-18 20:40           ` [PATCH for-4.2 v2 " Mike Snitzer
@ 2015-05-19  6:28             ` Christoph Hellwig
  2015-05-19  7:20             ` Jan Kara
  1 sibling, 0 replies; 31+ messages in thread
From: Christoph Hellwig @ 2015-05-19  6:28 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jan Kara, hch, Jens Axboe, Chris Mason, dm-devel, linux-kernel

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH for-4.2 v2 01/14] block: remove management of bi_remaining when restoring original bi_end_io
  2015-05-18 20:40           ` [PATCH for-4.2 v2 " Mike Snitzer
  2015-05-19  6:28             ` Christoph Hellwig
@ 2015-05-19  7:20             ` Jan Kara
  1 sibling, 0 replies; 31+ messages in thread
From: Jan Kara @ 2015-05-19  7:20 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jan Kara, hch, Jens Axboe, Chris Mason, dm-devel, linux-kernel

On Mon 18-05-15 16:40:17, Mike Snitzer wrote:
> Commit c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for
> non-chains") regressed all existing callers that followed this pattern:
>  1) saving a bio's original bi_end_io
>  2) wiring up an intermediate bi_end_io
>  3) restoring the original bi_end_io from intermediate bi_end_io
>  4) calling bio_endio() to execute the restored original bi_end_io
> 
> The regression was due to BIO_CHAIN only ever getting set if
> bio_inc_remaining() is called.  For the above pattern it isn't set until
> step 3 above (step 2 would've needed to establish BIO_CHAIN).  As such
> the first bio_endio(), in step 2 above, never decremented __bi_remaining
> before calling the intermediate bi_end_io -- leaving __bi_remaining with
> the value 1 instead of 0.  When bio_inc_remaining() occurred during step
> 3 it brought it to a value of 2.  When the second bio_endio() was
> called, in step 4 above, it should've called the original bi_end_io but
> it didn't because there was an extra reference that wasn't dropped (due
> to atomic operations being optimized away since BIO_CHAIN wasn't set
> upfront).
> 
> Fix this issue by removing the __bi_remaining management complexity for
> all callers that use the above pattern -- bio_chain() is the only
> interface that _needs_ to be concerned with __bi_remaining.  For the
> above pattern callers just expect the bi_end_io they set to get called!
> Remove bio_endio_nodec() and also remove all bio_inc_remaining() calls
> that aren't associated with the bio_chain() interface.
> 
> changes in v2:
> Care is now taken to clear the BIO_CHAIN flag once ->__bi_remaining
> drops to 0.  Also, the bio_inc_remaining() interface has moved local to
> bio.c.
> 
> Fixes: c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains")
> Signed-off-by: Mike Snitzer <snitzer@redhat.com>

  The patch looks good to me. You can add:
Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  block/bio-integrity.c        |  4 ++--
>  block/bio.c                  | 35 ++++++++++++++---------------------
>  drivers/md/bcache/io.c       |  2 +-
>  drivers/md/dm-cache-target.c |  6 ------
>  drivers/md/dm-raid1.c        |  2 --
>  drivers/md/dm-snap.c         |  1 -
>  drivers/md/dm-thin.c         |  9 +++------
>  drivers/md/dm-verity.c       |  2 +-
>  fs/btrfs/disk-io.c           |  2 +-
>  fs/btrfs/volumes.c           | 16 +++++-----------
>  fs/btrfs/volumes.h           |  2 --
>  include/linux/bio.h          | 12 ------------
>  12 files changed, 27 insertions(+), 66 deletions(-)
> 
> diff --git a/block/bio-integrity.c b/block/bio-integrity.c
> index 5cbd5d9..0436c21 100644
> --- a/block/bio-integrity.c
> +++ b/block/bio-integrity.c
> @@ -361,7 +361,7 @@ static void bio_integrity_verify_fn(struct work_struct *work)
>  
>  	/* Restore original bio completion handler */
>  	bio->bi_end_io = bip->bip_end_io;
> -	bio_endio_nodec(bio, error);
> +	bio_endio(bio, error);
>  }
>  
>  /**
> @@ -388,7 +388,7 @@ void bio_integrity_endio(struct bio *bio, int error)
>  	 */
>  	if (error) {
>  		bio->bi_end_io = bip->bip_end_io;
> -		bio_endio_nodec(bio, error);
> +		bio_endio(bio, error);
>  
>  		return;
>  	}
> diff --git a/block/bio.c b/block/bio.c
> index c2ff8a8..259197d 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -303,6 +303,17 @@ static void bio_chain_endio(struct bio *bio, int error)
>  	bio_put(bio);
>  }
>  
> +/*
> + * Increment chain count for the bio. Make sure the CHAIN flag update
> + * is visible before the raised count.
> + */
> +static inline void bio_inc_remaining(struct bio *bio)
> +{
> +	bio->bi_flags |= (1 << BIO_CHAIN);
> +	smp_mb__before_atomic();
> +	atomic_inc(&bio->__bi_remaining);
> +}
> +
>  /**
>   * bio_chain - chain bio completions
>   * @bio: the target bio
> @@ -1756,8 +1767,10 @@ static inline bool bio_remaining_done(struct bio *bio)
>  
>  	BUG_ON(atomic_read(&bio->__bi_remaining) <= 0);
>  
> -	if (atomic_dec_and_test(&bio->__bi_remaining))
> +	if (atomic_dec_and_test(&bio->__bi_remaining)) {
> +		clear_bit(BIO_CHAIN, &bio->bi_flags);
>  		return true;
> +	}
>  
>  	return false;
>  }
> @@ -1809,26 +1822,6 @@ void bio_endio(struct bio *bio, int error)
>  EXPORT_SYMBOL(bio_endio);
>  
>  /**
> - * bio_endio_nodec - end I/O on a bio, without decrementing bi_remaining
> - * @bio:	bio
> - * @error:	error, if any
> - *
> - * For code that has saved and restored bi_end_io; thing hard before using this
> - * function, probably you should've cloned the entire bio.
> - **/
> -void bio_endio_nodec(struct bio *bio, int error)
> -{
> -	/*
> -	 * If it's not flagged as a chain, we are not going to dec the count
> -	 */
> -	if (bio_flagged(bio, BIO_CHAIN))
> -		bio_inc_remaining(bio);
> -
> -	bio_endio(bio, error);
> -}
> -EXPORT_SYMBOL(bio_endio_nodec);
> -
> -/**
>   * bio_split - split a bio
>   * @bio:	bio to split
>   * @sectors:	number of sectors to split from the front of @bio
> diff --git a/drivers/md/bcache/io.c b/drivers/md/bcache/io.c
> index fa028fa..cb64e64a 100644
> --- a/drivers/md/bcache/io.c
> +++ b/drivers/md/bcache/io.c
> @@ -55,7 +55,7 @@ static void bch_bio_submit_split_done(struct closure *cl)
>  
>  	s->bio->bi_end_io = s->bi_end_io;
>  	s->bio->bi_private = s->bi_private;
> -	bio_endio_nodec(s->bio, 0);
> +	bio_endio(s->bio, 0);
>  
>  	closure_debug_destroy(&s->cl);
>  	mempool_free(s, s->p->bio_split_hook);
> diff --git a/drivers/md/dm-cache-target.c b/drivers/md/dm-cache-target.c
> index 705eb7b..41b2594 100644
> --- a/drivers/md/dm-cache-target.c
> +++ b/drivers/md/dm-cache-target.c
> @@ -86,12 +86,6 @@ static void dm_unhook_bio(struct dm_hook_info *h, struct bio *bio)
>  {
>  	bio->bi_end_io = h->bi_end_io;
>  	bio->bi_private = h->bi_private;
> -
> -	/*
> -	 * Must bump bi_remaining to allow bio to complete with
> -	 * restored bi_end_io.
> -	 */
> -	bio_inc_remaining(bio);
>  }
>  
>  /*----------------------------------------------------------------*/
> diff --git a/drivers/md/dm-raid1.c b/drivers/md/dm-raid1.c
> index d6a1c09..743fa9b 100644
> --- a/drivers/md/dm-raid1.c
> +++ b/drivers/md/dm-raid1.c
> @@ -1254,8 +1254,6 @@ static int mirror_end_io(struct dm_target *ti, struct bio *bio, int error)
>  			dm_bio_restore(bd, bio);
>  			bio_record->details.bi_bdev = NULL;
>  
> -			bio_inc_remaining(bio);
> -
>  			queue_bio(ms, bio, rw);
>  			return DM_ENDIO_INCOMPLETE;
>  		}
> diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
> index 8bfeae2..7c82d3c 100644
> --- a/drivers/md/dm-snap.c
> +++ b/drivers/md/dm-snap.c
> @@ -1478,7 +1478,6 @@ out:
>  	if (full_bio) {
>  		full_bio->bi_end_io = pe->full_bio_end_io;
>  		full_bio->bi_private = pe->full_bio_private;
> -		bio_inc_remaining(full_bio);
>  	}
>  	increment_pending_exceptions_done_count();
>  
> diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> index 342dbda..e852602c 100644
> --- a/drivers/md/dm-thin.c
> +++ b/drivers/md/dm-thin.c
> @@ -793,10 +793,9 @@ static void inc_remap_and_issue_cell(struct thin_c *tc,
>  
>  static void process_prepared_mapping_fail(struct dm_thin_new_mapping *m)
>  {
> -	if (m->bio) {
> +	if (m->bio)
>  		m->bio->bi_end_io = m->saved_bi_end_io;
> -		bio_inc_remaining(m->bio);
> -	}
> +
>  	cell_error(m->tc->pool, m->cell);
>  	list_del(&m->list);
>  	mempool_free(m, m->tc->pool->mapping_pool);
> @@ -810,10 +809,8 @@ static void process_prepared_mapping(struct dm_thin_new_mapping *m)
>  	int r;
>  
>  	bio = m->bio;
> -	if (bio) {
> +	if (bio)
>  		bio->bi_end_io = m->saved_bi_end_io;
> -		bio_inc_remaining(bio);
> -	}
>  
>  	if (m->err) {
>  		cell_error(pool, m->cell);
> diff --git a/drivers/md/dm-verity.c b/drivers/md/dm-verity.c
> index 66616db..bb9c6a0 100644
> --- a/drivers/md/dm-verity.c
> +++ b/drivers/md/dm-verity.c
> @@ -459,7 +459,7 @@ static void verity_finish_io(struct dm_verity_io *io, int error)
>  	bio->bi_end_io = io->orig_bi_end_io;
>  	bio->bi_private = io->orig_bi_private;
>  
> -	bio_endio_nodec(bio, error);
> +	bio_endio(bio, error);
>  }
>  
>  static void verity_work(struct work_struct *w)
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 2ef9a4b..fbdbee4 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -1745,7 +1745,7 @@ static void end_workqueue_fn(struct btrfs_work *work)
>  	bio->bi_private = end_io_wq->private;
>  	bio->bi_end_io = end_io_wq->end_io;
>  	kmem_cache_free(btrfs_end_io_wq_cache, end_io_wq);
> -	bio_endio_nodec(bio, error);
> +	bio_endio(bio, error);
>  }
>  
>  static int cleaner_kthread(void *arg)
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 8e8d1d1..dac77d4 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -5585,10 +5585,10 @@ int btrfs_rmap_block(struct btrfs_mapping_tree *map_tree,
>  
>  static inline void btrfs_end_bbio(struct btrfs_bio *bbio, struct bio *bio, int err)
>  {
> -	if (likely(bbio->flags & BTRFS_BIO_ORIG_BIO_SUBMITTED))
> -		bio_endio_nodec(bio, err);
> -	else
> -		bio_endio(bio, err);
> +	bio->bi_private = bbio->private;
> +	bio->bi_end_io = bbio->end_io;
> +	bio_endio(bio, err);
> +
>  	btrfs_put_bbio(bbio);
>  }
>  
> @@ -5632,8 +5632,6 @@ static void btrfs_end_bio(struct bio *bio, int err)
>  			bio = bbio->orig_bio;
>  		}
>  
> -		bio->bi_private = bbio->private;
> -		bio->bi_end_io = bbio->end_io;
>  		btrfs_io_bio(bio)->mirror_num = bbio->mirror_num;
>  		/* only send an error to the higher layers if it is
>  		 * beyond the tolerance of the btrfs bio
> @@ -5815,8 +5813,6 @@ static void bbio_error(struct btrfs_bio *bbio, struct bio *bio, u64 logical)
>  		/* Shoud be the original bio. */
>  		WARN_ON(bio != bbio->orig_bio);
>  
> -		bio->bi_private = bbio->private;
> -		bio->bi_end_io = bbio->end_io;
>  		btrfs_io_bio(bio)->mirror_num = bbio->mirror_num;
>  		bio->bi_iter.bi_sector = logical >> 9;
>  
> @@ -5897,10 +5893,8 @@ int btrfs_map_bio(struct btrfs_root *root, int rw, struct bio *bio,
>  		if (dev_nr < total_devs - 1) {
>  			bio = btrfs_bio_clone(first_bio, GFP_NOFS);
>  			BUG_ON(!bio); /* -ENOMEM */
> -		} else {
> +		} else
>  			bio = first_bio;
> -			bbio->flags |= BTRFS_BIO_ORIG_BIO_SUBMITTED;
> -		}
>  
>  		submit_stripe_bio(root, bbio, bio,
>  				  bbio->stripes[dev_nr].physical, dev_nr, rw,
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index ebc3133..cedae03 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -292,8 +292,6 @@ struct btrfs_bio_stripe {
>  struct btrfs_bio;
>  typedef void (btrfs_bio_end_io_t) (struct btrfs_bio *bio, int err);
>  
> -#define BTRFS_BIO_ORIG_BIO_SUBMITTED	(1 << 0)
> -
>  struct btrfs_bio {
>  	atomic_t refs;
>  	atomic_t stripes_pending;
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 7486ea1..f0291cf 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -427,7 +427,6 @@ static inline struct bio *bio_clone_kmalloc(struct bio *bio, gfp_t gfp_mask)
>  }
>  
>  extern void bio_endio(struct bio *, int);
> -extern void bio_endio_nodec(struct bio *, int);
>  struct request_queue;
>  extern int bio_phys_segments(struct request_queue *, struct bio *);
>  
> @@ -659,17 +658,6 @@ static inline struct bio *bio_list_get(struct bio_list *bl)
>  }
>  
>  /*
> - * Increment chain count for the bio. Make sure the CHAIN flag update
> - * is visible before the raised count.
> - */
> -static inline void bio_inc_remaining(struct bio *bio)
> -{
> -	bio->bi_flags |= (1 << BIO_CHAIN);
> -	smp_mb__before_atomic();
> -	atomic_inc(&bio->__bi_remaining);
> -}
> -
> -/*
>   * bio_set is used to allow other portions of the IO system to
>   * allocate their own private memory pools for bio and iovec structures.
>   * These memory pools in turn all allocate from the bio_slab
> -- 
> 2.3.2 (Apple Git-55)
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH for-4.2 04/14] block: factor out blkdev_issue_discard_async
  2015-05-18 19:18         ` Mike Snitzer
@ 2015-05-19  8:32           ` Christoph Hellwig
  0 siblings, 0 replies; 31+ messages in thread
From: Christoph Hellwig @ 2015-05-19  8:32 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Christoph Hellwig, Jens Axboe, dm-devel, Joe Thornber, linux-kernel

On Mon, May 18, 2015 at 03:18:30PM -0400, Mike Snitzer wrote:
> > And besides that generic high level sentiment I think the interface
> > for blkdev_issue_discard_async is simply wrong.  Either you want to keep
> > the internals private and just expose a completion callback that gets
> > your private data and an error, or you want to build your own bios as
> > suggested above.  But not one that is mostly opaque except for allowing
> > the caller to hook into the submission process and thus taking over I/O
> > completion.
> 
> I'll see what I can come up with.

Another option would be to help with reviewing and testing this
series: http://thread.gmane.org/gmane.linux.kernel/1947453

If we can get thast one in the submitter doesn't have to care
about aligning or splitting discard (or any other) bios.

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2015-05-19  8:32 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-14 21:04 [PATCH for-4.2 00/14] block, dm: first batch of changes for 4.2 Mike Snitzer
2015-05-14 21:04 ` [PATCH for-4.2 01/14] block: remove management of bi_remaining when restoring original bi_end_io Mike Snitzer
2015-05-18  7:22   ` Jan Kara
2015-05-18 13:13     ` Mike Snitzer
2015-05-18 15:36       ` Jan Kara
2015-05-18 15:59         ` Mike Snitzer
2015-05-18 20:40           ` [PATCH for-4.2 v2 " Mike Snitzer
2015-05-19  6:28             ` Christoph Hellwig
2015-05-19  7:20             ` Jan Kara
2015-05-18  8:24   ` [dm-devel] [PATCH for-4.2 " Christoph Hellwig
2015-05-18 13:20     ` Mike Snitzer
2015-05-14 21:05 ` [PATCH for-4.2 02/14] block: remove export for blk_queue_bio Mike Snitzer
2015-05-14 21:05 ` [PATCH for-4.2 03/14] block, dm: don't copy bios for request clones Mike Snitzer
2015-05-14 21:05 ` [PATCH for-4.2 04/14] block: factor out blkdev_issue_discard_async Mike Snitzer
2015-05-18  8:27   ` [dm-devel] " Christoph Hellwig
2015-05-18 13:32     ` Mike Snitzer
2015-05-18 16:17       ` [dm-devel] " Christoph Hellwig
2015-05-18 19:18         ` Mike Snitzer
2015-05-19  8:32           ` Christoph Hellwig
2015-05-14 21:05 ` [PATCH for-4.2 05/14] dm: do not allocate any mempools for blk-mq request-based DM Mike Snitzer
2015-05-14 21:05 ` [PATCH for-4.2 06/14] dm: rename methods that requeue requests Mike Snitzer
2015-05-18  8:29   ` Christoph Hellwig
2015-05-18 15:44     ` Mike Snitzer
2015-05-14 21:05 ` [PATCH for-4.2 07/14] dm: factor out a common cleanup_mapped_device() Mike Snitzer
2015-05-14 21:05 ` [PATCH for-4.2 08/14] dm btree: add dm_btree_remove_leaves() Mike Snitzer
2015-05-14 21:05 ` [PATCH for-4.2 09/14] dm thin metadata: add dm_thin_find_mapped_range() Mike Snitzer
2015-05-14 21:05 ` [PATCH for-4.2 10/14] dm thin metadata: add dm_thin_remove_range() Mike Snitzer
2015-05-14 21:05 ` [PATCH for-4.2 11/14] dm thin: range discard support Mike Snitzer
2015-05-14 21:05 ` [PATCH for-4.2 12/14] dm thin: cleanup overwrite's endio restore to be centralized Mike Snitzer
2015-05-14 21:05 ` [PATCH for-4.2 13/14] dm thin: cleanup schedule_zero() to read more logically Mike Snitzer
2015-05-14 21:05 ` [PATCH for-4.2 14/14] dm thin metadata: remove in-core 'read_only' flag Mike Snitzer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.