All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/13] block: assorted cleanup for bio splitting and cloning.
@ 2017-06-18  4:38 NeilBrown
  2017-06-18  4:38 ` [PATCH 03/13] blk: make the bioset rescue_workqueue optional NeilBrown
                   ` (13 more replies)
  0 siblings, 14 replies; 53+ messages in thread
From: NeilBrown @ 2017-06-18  4:38 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, linux-kernel

This is a resend of my series of patches working
towards removing the bioset work queues.

This set is based on for-4.13/block.

It incorporates the revised versions of all the patches that were
resent following feedback on the last set.

It also includes a minor grammatic improvement to a comment, and
simple changes to compensate for a couple of changes to the block tree
since the last posting.

I hope to eventually get rid of the new BIOSET_NEED_RESCUER flag,
but that needs work in dm and probably bcache first.

Thanks,
NeilBrown


---

NeilBrown (13):
      blk: remove bio_set arg from blk_queue_split()
      blk: replace bioset_create_nobvec() with a flags arg to bioset_create()
      blk: make the bioset rescue_workqueue optional.
      blk: use non-rescuing bioset for q->bio_split.
      block: Improvements to bounce-buffer handling
      rbd: use bio_clone_fast() instead of bio_clone()
      drbd: use bio_clone_fast() instead of bio_clone()
      pktcdvd: use bio_clone_fast() instead of bio_clone()
      lightnvm/pblk-read: use bio_clone_fast()
      xen-blkfront: remove bio splitting.
      bcache: use kmalloc to allocate bio in bch_data_verify()
      block: remove bio_clone() and all references.
      block: don't check for BIO_MAX_PAGES in blk_bio_segment_split()


 Documentation/block/biodoc.txt      |    2 -
 block/bio.c                         |   73 ++++++++++++++++-------------------
 block/blk-core.c                    |    4 +-
 block/blk-merge.c                   |   31 ++-------------
 block/blk-mq.c                      |    2 -
 block/bounce.c                      |   32 ++++++++++++---
 drivers/block/drbd/drbd_int.h       |    3 +
 drivers/block/drbd/drbd_main.c      |   13 ++++++
 drivers/block/drbd/drbd_req.c       |    2 -
 drivers/block/drbd/drbd_req.h       |    2 -
 drivers/block/pktcdvd.c             |   14 +++++--
 drivers/block/ps3vram.c             |    2 -
 drivers/block/rbd.c                 |   16 +++++++-
 drivers/block/rsxx/dev.c            |    2 -
 drivers/block/umem.c                |    2 -
 drivers/block/xen-blkfront.c        |   54 +-------------------------
 drivers/lightnvm/pblk-init.c        |   16 ++++++--
 drivers/lightnvm/pblk-read.c        |    2 -
 drivers/lightnvm/pblk.h             |    1 
 drivers/lightnvm/rrpc.c             |    2 -
 drivers/md/bcache/debug.c           |    2 -
 drivers/md/bcache/super.c           |    8 +++-
 drivers/md/dm-crypt.c               |    3 +
 drivers/md/dm-io.c                  |    3 +
 drivers/md/dm.c                     |    5 +-
 drivers/md/md.c                     |    6 +--
 drivers/md/raid1.c                  |    2 -
 drivers/md/raid10.c                 |    2 -
 drivers/md/raid5-cache.c            |    2 -
 drivers/md/raid5-ppl.c              |    2 -
 drivers/md/raid5.c                  |    2 -
 drivers/s390/block/dcssblk.c        |    2 -
 drivers/s390/block/xpram.c          |    2 -
 drivers/target/target_core_iblock.c |    2 -
 fs/block_dev.c                      |    2 -
 fs/btrfs/extent_io.c                |    3 +
 fs/xfs/xfs_super.c                  |    3 +
 include/linux/bio.h                 |   12 ++----
 include/linux/blkdev.h              |    3 -
 39 files changed, 168 insertions(+), 173 deletions(-)

--
Signature

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 01/13] blk: remove bio_set arg from blk_queue_split()
  2017-06-18  4:38 [PATCH 00/13] block: assorted cleanup for bio splitting and cloning NeilBrown
  2017-06-18  4:38 ` [PATCH 03/13] blk: make the bioset rescue_workqueue optional NeilBrown
  2017-06-18  4:38 ` [PATCH 04/13] blk: use non-rescuing bioset for q->bio_split NeilBrown
@ 2017-06-18  4:38 ` NeilBrown
  2017-06-18  4:38 ` [PATCH 02/13] blk: replace bioset_create_nobvec() with a flags arg to bioset_create() NeilBrown
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2017-06-18  4:38 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, linux-kernel

blk_queue_split() is always called with the last arg being q->bio_split,
where 'q' is the first arg.

Also blk_queue_split() sometimes uses the passed-in 'bs' and sometimes uses
q->bio_split.

This is inconsistent and unnecessary.  Remove the last arg and always use
q->bio_split inside blk_queue_split()

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Credit-to: Javier González <jg@lightnvm.io> (Noticed that lightnvm was missed)
Reviewed-by: Javier González <javier@cnexlabs.com>
Tested-by: Javier González <javier@cnexlabs.com>
Signed-off-by: NeilBrown <neilb@suse.com>
---
 block/blk-core.c              |    2 +-
 block/blk-merge.c             |    9 ++++-----
 block/blk-mq.c                |    2 +-
 drivers/block/drbd/drbd_req.c |    2 +-
 drivers/block/pktcdvd.c       |    2 +-
 drivers/block/ps3vram.c       |    2 +-
 drivers/block/rsxx/dev.c      |    2 +-
 drivers/block/umem.c          |    2 +-
 drivers/lightnvm/pblk-init.c  |    4 ++--
 drivers/lightnvm/rrpc.c       |    2 +-
 drivers/md/md.c               |    2 +-
 drivers/s390/block/dcssblk.c  |    2 +-
 drivers/s390/block/xpram.c    |    2 +-
 include/linux/blkdev.h        |    3 +--
 14 files changed, 18 insertions(+), 20 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 8592409db272..31b5ece6b18e 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1723,7 +1723,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
 	 */
 	blk_queue_bounce(q, &bio);
 
-	blk_queue_split(q, &bio, q->bio_split);
+	blk_queue_split(q, &bio);
 
 	if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
 		bio->bi_status = BLK_STS_IOERR;
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 3990ae406341..d59074556703 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -202,8 +202,7 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
 	return do_split ? new : NULL;
 }
 
-void blk_queue_split(struct request_queue *q, struct bio **bio,
-		     struct bio_set *bs)
+void blk_queue_split(struct request_queue *q, struct bio **bio)
 {
 	struct bio *split, *res;
 	unsigned nsegs;
@@ -211,13 +210,13 @@ void blk_queue_split(struct request_queue *q, struct bio **bio,
 	switch (bio_op(*bio)) {
 	case REQ_OP_DISCARD:
 	case REQ_OP_SECURE_ERASE:
-		split = blk_bio_discard_split(q, *bio, bs, &nsegs);
+		split = blk_bio_discard_split(q, *bio, q->bio_split, &nsegs);
 		break;
 	case REQ_OP_WRITE_ZEROES:
-		split = blk_bio_write_zeroes_split(q, *bio, bs, &nsegs);
+		split = blk_bio_write_zeroes_split(q, *bio, q->bio_split, &nsegs);
 		break;
 	case REQ_OP_WRITE_SAME:
-		split = blk_bio_write_same_split(q, *bio, bs, &nsegs);
+		split = blk_bio_write_same_split(q, *bio, q->bio_split, &nsegs);
 		break;
 	default:
 		split = blk_bio_segment_split(q, *bio, q->bio_split, &nsegs);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 359d2dc0d414..4780200fd11b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1477,7 +1477,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 
 	blk_queue_bounce(q, &bio);
 
-	blk_queue_split(q, &bio, q->bio_split);
+	blk_queue_split(q, &bio);
 
 	if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
 		bio_io_error(bio);
diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index fca6b9914948..f6e865b2d543 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -1560,7 +1560,7 @@ blk_qc_t drbd_make_request(struct request_queue *q, struct bio *bio)
 	struct drbd_device *device = (struct drbd_device *) q->queuedata;
 	unsigned long start_jif;
 
-	blk_queue_split(q, &bio, q->bio_split);
+	blk_queue_split(q, &bio);
 
 	start_jif = jiffies;
 
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index e8a381161db6..1f363638b453 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -2414,7 +2414,7 @@ static blk_qc_t pkt_make_request(struct request_queue *q, struct bio *bio)
 
 	blk_queue_bounce(q, &bio);
 
-	blk_queue_split(q, &bio, q->bio_split);
+	blk_queue_split(q, &bio);
 
 	pd = q->queuedata;
 	if (!pd) {
diff --git a/drivers/block/ps3vram.c b/drivers/block/ps3vram.c
index 6fa2b8197013..e0e81cacd781 100644
--- a/drivers/block/ps3vram.c
+++ b/drivers/block/ps3vram.c
@@ -606,7 +606,7 @@ static blk_qc_t ps3vram_make_request(struct request_queue *q, struct bio *bio)
 
 	dev_dbg(&dev->core, "%s\n", __func__);
 
-	blk_queue_split(q, &bio, q->bio_split);
+	blk_queue_split(q, &bio);
 
 	spin_lock_irq(&priv->lock);
 	busy = !bio_list_empty(&priv->list);
diff --git a/drivers/block/rsxx/dev.c b/drivers/block/rsxx/dev.c
index 0b0a0a902355..4e8bdfa0aa31 100644
--- a/drivers/block/rsxx/dev.c
+++ b/drivers/block/rsxx/dev.c
@@ -151,7 +151,7 @@ static blk_qc_t rsxx_make_request(struct request_queue *q, struct bio *bio)
 	struct rsxx_bio_meta *bio_meta;
 	blk_status_t st = BLK_STS_IOERR;
 
-	blk_queue_split(q, &bio, q->bio_split);
+	blk_queue_split(q, &bio);
 
 	might_sleep();
 
diff --git a/drivers/block/umem.c b/drivers/block/umem.c
index 4b3c947697b1..0677d2514665 100644
--- a/drivers/block/umem.c
+++ b/drivers/block/umem.c
@@ -529,7 +529,7 @@ static blk_qc_t mm_make_request(struct request_queue *q, struct bio *bio)
 		 (unsigned long long)bio->bi_iter.bi_sector,
 		 bio->bi_iter.bi_size);
 
-	blk_queue_split(q, &bio, q->bio_split);
+	blk_queue_split(q, &bio);
 
 	spin_lock_irq(&card->lock);
 	*card->biotail = bio;
diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index ae8cd6d5af8b..b3fec8ec55b8 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -33,7 +33,7 @@ static int pblk_rw_io(struct request_queue *q, struct pblk *pblk,
 	 * constraint. Writes can be of arbitrary size.
 	 */
 	if (bio_data_dir(bio) == READ) {
-		blk_queue_split(q, &bio, q->bio_split);
+		blk_queue_split(q, &bio);
 		ret = pblk_submit_read(pblk, bio);
 		if (ret == NVM_IO_DONE && bio_flagged(bio, BIO_CLONED))
 			bio_put(bio);
@@ -46,7 +46,7 @@ static int pblk_rw_io(struct request_queue *q, struct pblk *pblk,
 	 * available for user I/O.
 	 */
 	if (unlikely(pblk_get_secs(bio) >= pblk_rl_sysfs_rate_show(&pblk->rl)))
-		blk_queue_split(q, &bio, q->bio_split);
+		blk_queue_split(q, &bio);
 
 	return pblk_write_to_cache(pblk, bio, PBLK_IOTYPE_USER);
 }
diff --git a/drivers/lightnvm/rrpc.c b/drivers/lightnvm/rrpc.c
index 8d3b53bb3307..267f01ae87e4 100644
--- a/drivers/lightnvm/rrpc.c
+++ b/drivers/lightnvm/rrpc.c
@@ -994,7 +994,7 @@ static blk_qc_t rrpc_make_rq(struct request_queue *q, struct bio *bio)
 	struct nvm_rq *rqd;
 	int err;
 
-	blk_queue_split(q, &bio, q->bio_split);
+	blk_queue_split(q, &bio);
 
 	if (bio_op(bio) == REQ_OP_DISCARD) {
 		rrpc_discard(rrpc, bio);
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 833cc7082034..1c4597d1a3dc 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -265,7 +265,7 @@ static blk_qc_t md_make_request(struct request_queue *q, struct bio *bio)
 	unsigned int sectors;
 	int cpu;
 
-	blk_queue_split(q, &bio, q->bio_split);
+	blk_queue_split(q, &bio);
 
 	if (mddev == NULL || mddev->pers == NULL) {
 		bio_io_error(bio);
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 36e5280af3e4..06eb1de52d1c 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -845,7 +845,7 @@ dcssblk_make_request(struct request_queue *q, struct bio *bio)
 	unsigned long source_addr;
 	unsigned long bytes_done;
 
-	blk_queue_split(q, &bio, q->bio_split);
+	blk_queue_split(q, &bio);
 
 	bytes_done = 0;
 	dev_info = bio->bi_bdev->bd_disk->private_data;
diff --git a/drivers/s390/block/xpram.c b/drivers/s390/block/xpram.c
index b9d7e755c8a3..a48f0d40c1d2 100644
--- a/drivers/s390/block/xpram.c
+++ b/drivers/s390/block/xpram.c
@@ -190,7 +190,7 @@ static blk_qc_t xpram_make_request(struct request_queue *q, struct bio *bio)
 	unsigned long page_addr;
 	unsigned long bytes;
 
-	blk_queue_split(q, &bio, q->bio_split);
+	blk_queue_split(q, &bio);
 
 	if ((bio->bi_iter.bi_sector & 7) != 0 ||
 	    (bio->bi_iter.bi_size & 4095) != 0)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 76b6df862a12..670df402bc51 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -944,8 +944,7 @@ extern blk_status_t blk_insert_cloned_request(struct request_queue *q,
 				     struct request *rq);
 extern int blk_rq_append_bio(struct request *rq, struct bio *bio);
 extern void blk_delay_queue(struct request_queue *, unsigned long);
-extern void blk_queue_split(struct request_queue *, struct bio **,
-			    struct bio_set *);
+extern void blk_queue_split(struct request_queue *, struct bio **);
 extern void blk_recount_segments(struct request_queue *, struct bio *);
 extern int scsi_verify_blk_ioctl(struct block_device *, unsigned int);
 extern int scsi_cmd_blk_ioctl(struct block_device *, fmode_t,

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 02/13] blk: replace bioset_create_nobvec() with a flags arg to bioset_create()
  2017-06-18  4:38 [PATCH 00/13] block: assorted cleanup for bio splitting and cloning NeilBrown
                   ` (2 preceding siblings ...)
  2017-06-18  4:38 ` [PATCH 01/13] blk: remove bio_set arg from blk_queue_split() NeilBrown
@ 2017-06-18  4:38 ` NeilBrown
  2017-06-18  4:38 ` [PATCH 06/13] rbd: use bio_clone_fast() instead of bio_clone() NeilBrown
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2017-06-18  4:38 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, linux-kernel

"flags" arguments are often seen as good API design as they allow
easy extensibility.
bioset_create_nobvec() is implemented internally as a variation in
flags passed to __bioset_create().

To support future extension, make the internal structure part of the
API.
i.e. add a 'flags' argument to bioset_create() and discard
bioset_create_nobvec().

Note that the bio_split allocations in drivers/md/raid* do not need
the bvec mempool - they should have used bioset_create_nobvec().

Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
---
 block/bio.c                         |   60 +++++++++++++----------------------
 block/blk-core.c                    |    2 +
 drivers/block/drbd/drbd_main.c      |    2 +
 drivers/md/bcache/super.c           |    4 +-
 drivers/md/dm-crypt.c               |    2 +
 drivers/md/dm-io.c                  |    2 +
 drivers/md/dm.c                     |    2 +
 drivers/md/md.c                     |    2 +
 drivers/md/raid1.c                  |    2 +
 drivers/md/raid10.c                 |    2 +
 drivers/md/raid5-cache.c            |    2 +
 drivers/md/raid5-ppl.c              |    2 +
 drivers/md/raid5.c                  |    2 +
 drivers/target/target_core_iblock.c |    2 +
 fs/block_dev.c                      |    2 +
 fs/btrfs/extent_io.c                |    3 +-
 fs/xfs/xfs_super.c                  |    3 +-
 include/linux/bio.h                 |    6 ++--
 18 files changed, 45 insertions(+), 57 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 7a5c8ed27f42..f81545ba44bf 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1921,9 +1921,26 @@ void bioset_free(struct bio_set *bs)
 }
 EXPORT_SYMBOL(bioset_free);
 
-static struct bio_set *__bioset_create(unsigned int pool_size,
-				       unsigned int front_pad,
-				       bool create_bvec_pool)
+/**
+ * bioset_create  - Create a bio_set
+ * @pool_size:	Number of bio and bio_vecs to cache in the mempool
+ * @front_pad:	Number of bytes to allocate in front of the returned bio
+ * @flags:	Flags to modify behavior, currently only %BIOSET_NEED_BVECS
+ *
+ * Description:
+ *    Set up a bio_set to be used with @bio_alloc_bioset. Allows the caller
+ *    to ask for a number of bytes to be allocated in front of the bio.
+ *    Front pad allocation is useful for embedding the bio inside
+ *    another structure, to avoid allocating extra data to go with the bio.
+ *    Note that the bio must be embedded at the END of that structure always,
+ *    or things will break badly.
+ *    If %BIOSET_NEED_BVECS is set in @flags, a separate pool will be allocated
+ *    for allocating iovecs.  This pool is not needed e.g. for bio_clone_fast().
+ *
+ */
+struct bio_set *bioset_create(unsigned int pool_size,
+			      unsigned int front_pad,
+			      int flags)
 {
 	unsigned int back_pad = BIO_INLINE_VECS * sizeof(struct bio_vec);
 	struct bio_set *bs;
@@ -1948,7 +1965,7 @@ static struct bio_set *__bioset_create(unsigned int pool_size,
 	if (!bs->bio_pool)
 		goto bad;
 
-	if (create_bvec_pool) {
+	if (flags & BIOSET_NEED_BVECS) {
 		bs->bvec_pool = biovec_create_pool(pool_size);
 		if (!bs->bvec_pool)
 			goto bad;
@@ -1963,41 +1980,8 @@ static struct bio_set *__bioset_create(unsigned int pool_size,
 	bioset_free(bs);
 	return NULL;
 }
-
-/**
- * bioset_create  - Create a bio_set
- * @pool_size:	Number of bio and bio_vecs to cache in the mempool
- * @front_pad:	Number of bytes to allocate in front of the returned bio
- *
- * Description:
- *    Set up a bio_set to be used with @bio_alloc_bioset. Allows the caller
- *    to ask for a number of bytes to be allocated in front of the bio.
- *    Front pad allocation is useful for embedding the bio inside
- *    another structure, to avoid allocating extra data to go with the bio.
- *    Note that the bio must be embedded at the END of that structure always,
- *    or things will break badly.
- */
-struct bio_set *bioset_create(unsigned int pool_size, unsigned int front_pad)
-{
-	return __bioset_create(pool_size, front_pad, true);
-}
 EXPORT_SYMBOL(bioset_create);
 
-/**
- * bioset_create_nobvec  - Create a bio_set without bio_vec mempool
- * @pool_size:	Number of bio to cache in the mempool
- * @front_pad:	Number of bytes to allocate in front of the returned bio
- *
- * Description:
- *    Same functionality as bioset_create() except that mempool is not
- *    created for bio_vecs. Saving some memory for bio_clone_fast() users.
- */
-struct bio_set *bioset_create_nobvec(unsigned int pool_size, unsigned int front_pad)
-{
-	return __bioset_create(pool_size, front_pad, false);
-}
-EXPORT_SYMBOL(bioset_create_nobvec);
-
 #ifdef CONFIG_BLK_CGROUP
 
 /**
@@ -2112,7 +2096,7 @@ static int __init init_bio(void)
 	bio_integrity_init();
 	biovec_init_slabs();
 
-	fs_bio_set = bioset_create(BIO_POOL_SIZE, 0);
+	fs_bio_set = bioset_create(BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS);
 	if (!fs_bio_set)
 		panic("bio: can't allocate bios\n");
 
diff --git a/block/blk-core.c b/block/blk-core.c
index 31b5ece6b18e..62cf92550512 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -790,7 +790,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 	if (q->id < 0)
 		goto fail_q;
 
-	q->bio_split = bioset_create(BIO_POOL_SIZE, 0);
+	q->bio_split = bioset_create(BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS);
 	if (!q->bio_split)
 		goto fail_id;
 
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 84455c365f57..b395fe391171 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -2165,7 +2165,7 @@ static int drbd_create_mempools(void)
 		goto Enomem;
 
 	/* mempools */
-	drbd_md_io_bio_set = bioset_create(DRBD_MIN_POOL_PAGES, 0);
+	drbd_md_io_bio_set = bioset_create(DRBD_MIN_POOL_PAGES, 0, BIOSET_NEED_BVECS);
 	if (drbd_md_io_bio_set == NULL)
 		goto Enomem;
 
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index fbc4f5412dec..abd6e825b39b 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -782,7 +782,7 @@ static int bcache_device_init(struct bcache_device *d, unsigned block_size,
 
 	minor *= BCACHE_MINORS;
 
-	if (!(d->bio_split = bioset_create(4, offsetof(struct bbio, bio))) ||
+	if (!(d->bio_split = bioset_create(4, offsetof(struct bbio, bio), BIOSET_NEED_BVECS)) ||
 	    !(d->disk = alloc_disk(BCACHE_MINORS))) {
 		ida_simple_remove(&bcache_minor, minor);
 		return -ENOMEM;
@@ -1516,7 +1516,7 @@ struct cache_set *bch_cache_set_alloc(struct cache_sb *sb)
 				sizeof(struct bbio) + sizeof(struct bio_vec) *
 				bucket_pages(c))) ||
 	    !(c->fill_iter = mempool_create_kmalloc_pool(1, iter_size)) ||
-	    !(c->bio_split = bioset_create(4, offsetof(struct bbio, bio))) ||
+	    !(c->bio_split = bioset_create(4, offsetof(struct bbio, bio), BIOSET_NEED_BVECS)) ||
 	    !(c->uuids = alloc_bucket_pages(GFP_KERNEL, c)) ||
 	    !(c->moving_gc_wq = alloc_workqueue("bcache_gc",
 						WQ_MEM_RECLAIM, 0)) ||
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 586cef085c6a..237ff8e9752a 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -2677,7 +2677,7 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv)
 		goto bad;
 	}
 
-	cc->bs = bioset_create(MIN_IOS, 0);
+	cc->bs = bioset_create(MIN_IOS, 0, BIOSET_NEED_BVECS);
 	if (!cc->bs) {
 		ti->error = "Cannot allocate crypt bioset";
 		goto bad;
diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index c8f8f3004085..5c4121024d92 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -58,7 +58,7 @@ struct dm_io_client *dm_io_client_create(void)
 	if (!client->pool)
 		goto bad;
 
-	client->bios = bioset_create(min_ios, 0);
+	client->bios = bioset_create(min_ios, 0, BIOSET_NEED_BVECS);
 	if (!client->bios)
 		goto bad;
 
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index c4b74f7398ac..3394a311de3d 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -2660,7 +2660,7 @@ struct dm_md_mempools *dm_alloc_md_mempools(struct mapped_device *md, enum dm_qu
 		BUG();
 	}
 
-	pools->bs = bioset_create_nobvec(pool_size, front_pad);
+	pools->bs = bioset_create(pool_size, front_pad, 0);
 	if (!pools->bs)
 		goto out;
 
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 1c4597d1a3dc..7e2b79408e37 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -5428,7 +5428,7 @@ int md_run(struct mddev *mddev)
 	}
 
 	if (mddev->bio_set == NULL) {
-		mddev->bio_set = bioset_create(BIO_POOL_SIZE, 0);
+		mddev->bio_set = bioset_create(BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS);
 		if (!mddev->bio_set)
 			return -ENOMEM;
 	}
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index a1a3cf0293df..98ca2c1d3226 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2955,7 +2955,7 @@ static struct r1conf *setup_conf(struct mddev *mddev)
 	if (!conf->r1bio_pool)
 		goto abort;
 
-	conf->bio_split = bioset_create(BIO_POOL_SIZE, 0);
+	conf->bio_split = bioset_create(BIO_POOL_SIZE, 0, 0);
 	if (!conf->bio_split)
 		goto abort;
 
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 3178273a7253..57a250fdbbcc 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -3552,7 +3552,7 @@ static struct r10conf *setup_conf(struct mddev *mddev)
 	if (!conf->r10bio_pool)
 		goto out;
 
-	conf->bio_split = bioset_create(BIO_POOL_SIZE, 0);
+	conf->bio_split = bioset_create(BIO_POOL_SIZE, 0, 0);
 	if (!conf->bio_split)
 		goto out;
 
diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 3746c9c27e54..bfa1e907c472 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -3063,7 +3063,7 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
 	if (!log->io_pool)
 		goto io_pool;
 
-	log->bs = bioset_create(R5L_POOL_SIZE, 0);
+	log->bs = bioset_create(R5L_POOL_SIZE, 0, BIOSET_NEED_BVECS);
 	if (!log->bs)
 		goto io_bs;
 
diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
index e709ada0bf09..77cce3573aa8 100644
--- a/drivers/md/raid5-ppl.c
+++ b/drivers/md/raid5-ppl.c
@@ -1150,7 +1150,7 @@ int ppl_init_log(struct r5conf *conf)
 		goto err;
 	}
 
-	ppl_conf->bs = bioset_create(conf->raid_disks, 0);
+	ppl_conf->bs = bioset_create(conf->raid_disks, 0, 0);
 	if (!ppl_conf->bs) {
 		ret = -ENOMEM;
 		goto err;
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 7171bfd48223..62c965be97e1 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -6943,7 +6943,7 @@ static struct r5conf *setup_conf(struct mddev *mddev)
 			goto abort;
 	}
 
-	conf->bio_split = bioset_create(BIO_POOL_SIZE, 0);
+	conf->bio_split = bioset_create(BIO_POOL_SIZE, 0, 0);
 	if (!conf->bio_split)
 		goto abort;
 	conf->mddev = mddev;
diff --git a/drivers/target/target_core_iblock.c b/drivers/target/target_core_iblock.c
index 75373624604b..c05d38016556 100644
--- a/drivers/target/target_core_iblock.c
+++ b/drivers/target/target_core_iblock.c
@@ -93,7 +93,7 @@ static int iblock_configure_device(struct se_device *dev)
 		return -EINVAL;
 	}
 
-	ib_dev->ibd_bio_set = bioset_create(IBLOCK_BIO_POOL_SIZE, 0);
+	ib_dev->ibd_bio_set = bioset_create(IBLOCK_BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS);
 	if (!ib_dev->ibd_bio_set) {
 		pr_err("IBLOCK: Unable to create bioset\n");
 		goto out;
diff --git a/fs/block_dev.c b/fs/block_dev.c
index bcd8e16a34e1..dd91c99e9ba0 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -439,7 +439,7 @@ blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 
 static __init int blkdev_init(void)
 {
-	blkdev_dio_pool = bioset_create(4, offsetof(struct blkdev_dio, bio));
+	blkdev_dio_pool = bioset_create(4, offsetof(struct blkdev_dio, bio), BIOSET_NEED_BVECS);
 	if (!blkdev_dio_pool)
 		return -ENOMEM;
 	return 0;
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 8f66e55e7ba1..19eedf2e630b 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -174,7 +174,8 @@ int __init extent_io_init(void)
 		goto free_state_cache;
 
 	btrfs_bioset = bioset_create(BIO_POOL_SIZE,
-				     offsetof(struct btrfs_io_bio, bio));
+				     offsetof(struct btrfs_io_bio, bio),
+				     BIOSET_NEED_BVECS);
 	if (!btrfs_bioset)
 		goto free_buffer_cache;
 
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 455a575f101d..97df4db13b2e 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1766,7 +1766,8 @@ STATIC int __init
 xfs_init_zones(void)
 {
 	xfs_ioend_bioset = bioset_create(4 * MAX_BUF_PER_PAGE,
-			offsetof(struct xfs_ioend, io_inline_bio));
+			offsetof(struct xfs_ioend, io_inline_bio),
+			BIOSET_NEED_BVECS);
 	if (!xfs_ioend_bioset)
 		goto out;
 
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 9455aada1399..985dc645637e 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -373,8 +373,10 @@ static inline struct bio *bio_next_split(struct bio *bio, int sectors,
 	return bio_split(bio, sectors, gfp, bs);
 }
 
-extern struct bio_set *bioset_create(unsigned int, unsigned int);
-extern struct bio_set *bioset_create_nobvec(unsigned int, unsigned int);
+extern struct bio_set *bioset_create(unsigned int, unsigned int, int flags);
+enum {
+	BIOSET_NEED_BVECS = BIT(0),
+};
 extern void bioset_free(struct bio_set *);
 extern mempool_t *biovec_create_pool(int pool_entries);
 

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 04/13] blk: use non-rescuing bioset for q->bio_split.
  2017-06-18  4:38 [PATCH 00/13] block: assorted cleanup for bio splitting and cloning NeilBrown
  2017-06-18  4:38 ` [PATCH 03/13] blk: make the bioset rescue_workqueue optional NeilBrown
@ 2017-06-18  4:38 ` NeilBrown
  2017-06-18  4:38 ` [PATCH 01/13] blk: remove bio_set arg from blk_queue_split() NeilBrown
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2017-06-18  4:38 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, linux-kernel

A rescuing bioset is only useful if there might be bios from
that same bioset on the bio_list_on_stack queue at a time
when bio_alloc_bioset() is called.  This never applies to
q->bio_split.

Allocations from q->bio_split are only ever made from
blk_queue_split() which is only ever called early in each of
various make_request_fn()s.  The original bio (call this A)
is then passed to generic_make_request() and is placed on
the bio_list_on_stack queue, and the bio that was allocated
from q->bio_split (B) is processed.

The processing of this may cause other bios to be passed to
generic_make_request() or may even cause the bio B itself to
be passed, possible after some prefix has been split off
(using some other bioset).

generic_make_request() now guarantees that all of these bios
(B and dependants) will be fully processed before the tail
of the original bio A gets handled.  None of these early bios
can possible trigger an allocation from the original
q->bio_split as they are either too small to require
splitting or (more likely) are destined for a different queue.

The next time that the original q->bio_split might be used
by this thread is when A is processed again, as it might
still be too big to handle directly.  By this time there
cannot be any other bios allocated from q->bio_split in the
generic_make_request() queue.  So no rescuing will ever be
needed.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
---
 block/blk-core.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 9bd10c46a538..62cf92550512 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -790,8 +790,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 	if (q->id < 0)
 		goto fail_q;
 
-	q->bio_split = bioset_create(BIO_POOL_SIZE, 0, (BIOSET_NEED_BVECS |
-							BIOSET_NEED_RESCUER));
+	q->bio_split = bioset_create(BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS);
 	if (!q->bio_split)
 		goto fail_id;
 

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 03/13] blk: make the bioset rescue_workqueue optional.
  2017-06-18  4:38 [PATCH 00/13] block: assorted cleanup for bio splitting and cloning NeilBrown
@ 2017-06-18  4:38 ` NeilBrown
  2017-06-18  4:38 ` [PATCH 04/13] blk: use non-rescuing bioset for q->bio_split NeilBrown
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2017-06-18  4:38 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, linux-kernel

This patch converts bioset_create() to not create a workqueue by
default, so alloctions will never trigger punt_bios_to_rescuer().  It
also introduces a new flag BIOSET_NEED_RESCUER which tells
bioset_create() to preserve the old behavior.

All callers of bioset_create() that are inside block device drivers,
are given the BIOSET_NEED_RESCUER flag.

biosets used by filesystems or other top-level users do not
need rescuing as the bio can never be queued behind other
bios.  This includes fs_bio_set, blkdev_dio_pool,
btrfs_bioset, xfs_ioend_bioset, and one allocated by
target_core_iblock.c.

biosets used by md/raid do not need rescuing as
their usage was recently audited and revised to never
risk deadlock.

It is hoped that most, if not all, of the remaining biosets
can end up being the non-rescued version.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Credit-to: Ming Lei <ming.lei@redhat.com> (minor fixes)
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
---
 block/bio.c                    |   13 +++++++++++--
 block/blk-core.c               |    3 ++-
 drivers/block/drbd/drbd_main.c |    4 +++-
 drivers/md/bcache/super.c      |    8 ++++++--
 drivers/md/dm-crypt.c          |    3 ++-
 drivers/md/dm-io.c             |    3 ++-
 drivers/md/dm.c                |    5 +++--
 include/linux/bio.h            |    1 +
 8 files changed, 30 insertions(+), 10 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index f81545ba44bf..36984bf382c7 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -363,6 +363,8 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
 	struct bio_list punt, nopunt;
 	struct bio *bio;
 
+	if (WARN_ON_ONCE(!bs->rescue_workqueue))
+		return;
 	/*
 	 * In order to guarantee forward progress we must punt only bios that
 	 * were allocated from this bio_set; otherwise, if there was a bio on
@@ -474,7 +476,8 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, unsigned int nr_iovecs,
 
 		if (current->bio_list &&
 		    (!bio_list_empty(&current->bio_list[0]) ||
-		     !bio_list_empty(&current->bio_list[1])))
+		     !bio_list_empty(&current->bio_list[1])) &&
+		    bs->rescue_workqueue)
 			gfp_mask &= ~__GFP_DIRECT_RECLAIM;
 
 		p = mempool_alloc(bs->bio_pool, gfp_mask);
@@ -1925,7 +1928,8 @@ EXPORT_SYMBOL(bioset_free);
  * bioset_create  - Create a bio_set
  * @pool_size:	Number of bio and bio_vecs to cache in the mempool
  * @front_pad:	Number of bytes to allocate in front of the returned bio
- * @flags:	Flags to modify behavior, currently only %BIOSET_NEED_BVECS
+ * @flags:	Flags to modify behavior, currently %BIOSET_NEED_BVECS
+ *              and %BIOSET_NEED_RESCUER
  *
  * Description:
  *    Set up a bio_set to be used with @bio_alloc_bioset. Allows the caller
@@ -1936,6 +1940,8 @@ EXPORT_SYMBOL(bioset_free);
  *    or things will break badly.
  *    If %BIOSET_NEED_BVECS is set in @flags, a separate pool will be allocated
  *    for allocating iovecs.  This pool is not needed e.g. for bio_clone_fast().
+ *    If %BIOSET_NEED_RESCUER is set, a workqueue is created which can be used to
+ *    dispatch queued requests when the mempool runs out of space.
  *
  */
 struct bio_set *bioset_create(unsigned int pool_size,
@@ -1971,6 +1977,9 @@ struct bio_set *bioset_create(unsigned int pool_size,
 			goto bad;
 	}
 
+	if (!(flags & BIOSET_NEED_RESCUER))
+		return bs;
+
 	bs->rescue_workqueue = alloc_workqueue("bioset", WQ_MEM_RECLAIM, 0);
 	if (!bs->rescue_workqueue)
 		goto bad;
diff --git a/block/blk-core.c b/block/blk-core.c
index 62cf92550512..9bd10c46a538 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -790,7 +790,8 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 	if (q->id < 0)
 		goto fail_q;
 
-	q->bio_split = bioset_create(BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS);
+	q->bio_split = bioset_create(BIO_POOL_SIZE, 0, (BIOSET_NEED_BVECS |
+							BIOSET_NEED_RESCUER));
 	if (!q->bio_split)
 		goto fail_id;
 
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index b395fe391171..bdf51b6977cf 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -2165,7 +2165,9 @@ static int drbd_create_mempools(void)
 		goto Enomem;
 
 	/* mempools */
-	drbd_md_io_bio_set = bioset_create(DRBD_MIN_POOL_PAGES, 0, BIOSET_NEED_BVECS);
+	drbd_md_io_bio_set = bioset_create(DRBD_MIN_POOL_PAGES, 0,
+					   BIOSET_NEED_BVECS |
+					   BIOSET_NEED_RESCUER);
 	if (drbd_md_io_bio_set == NULL)
 		goto Enomem;
 
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index abd6e825b39b..8352fad765f6 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -782,7 +782,9 @@ static int bcache_device_init(struct bcache_device *d, unsigned block_size,
 
 	minor *= BCACHE_MINORS;
 
-	if (!(d->bio_split = bioset_create(4, offsetof(struct bbio, bio), BIOSET_NEED_BVECS)) ||
+	if (!(d->bio_split = bioset_create(4, offsetof(struct bbio, bio),
+					   BIOSET_NEED_BVECS |
+					   BIOSET_NEED_RESCUER)) ||
 	    !(d->disk = alloc_disk(BCACHE_MINORS))) {
 		ida_simple_remove(&bcache_minor, minor);
 		return -ENOMEM;
@@ -1516,7 +1518,9 @@ struct cache_set *bch_cache_set_alloc(struct cache_sb *sb)
 				sizeof(struct bbio) + sizeof(struct bio_vec) *
 				bucket_pages(c))) ||
 	    !(c->fill_iter = mempool_create_kmalloc_pool(1, iter_size)) ||
-	    !(c->bio_split = bioset_create(4, offsetof(struct bbio, bio), BIOSET_NEED_BVECS)) ||
+	    !(c->bio_split = bioset_create(4, offsetof(struct bbio, bio),
+					   BIOSET_NEED_BVECS |
+					   BIOSET_NEED_RESCUER)) ||
 	    !(c->uuids = alloc_bucket_pages(GFP_KERNEL, c)) ||
 	    !(c->moving_gc_wq = alloc_workqueue("bcache_gc",
 						WQ_MEM_RECLAIM, 0)) ||
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 237ff8e9752a..9e1b72e8f7ef 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -2677,7 +2677,8 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv)
 		goto bad;
 	}
 
-	cc->bs = bioset_create(MIN_IOS, 0, BIOSET_NEED_BVECS);
+	cc->bs = bioset_create(MIN_IOS, 0, (BIOSET_NEED_BVECS |
+					    BIOSET_NEED_RESCUER));
 	if (!cc->bs) {
 		ti->error = "Cannot allocate crypt bioset";
 		goto bad;
diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index 5c4121024d92..81248a8a8b57 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -58,7 +58,8 @@ struct dm_io_client *dm_io_client_create(void)
 	if (!client->pool)
 		goto bad;
 
-	client->bios = bioset_create(min_ios, 0, BIOSET_NEED_BVECS);
+	client->bios = bioset_create(min_ios, 0, (BIOSET_NEED_BVECS |
+						  BIOSET_NEED_RESCUER));
 	if (!client->bios)
 		goto bad;
 
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 3394a311de3d..fbd06b9f9467 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1036,7 +1036,8 @@ static void flush_current_bio_list(struct blk_plug_cb *cb, bool from_schedule)
 
 		while ((bio = bio_list_pop(&list))) {
 			struct bio_set *bs = bio->bi_pool;
-			if (unlikely(!bs) || bs == fs_bio_set) {
+			if (unlikely(!bs) || bs == fs_bio_set ||
+			    !bs->rescue_workqueue) {
 				bio_list_add(&current->bio_list[i], bio);
 				continue;
 			}
@@ -2660,7 +2661,7 @@ struct dm_md_mempools *dm_alloc_md_mempools(struct mapped_device *md, enum dm_qu
 		BUG();
 	}
 
-	pools->bs = bioset_create(pool_size, front_pad, 0);
+	pools->bs = bioset_create(pool_size, front_pad, BIOSET_NEED_RESCUER);
 	if (!pools->bs)
 		goto out;
 
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 985dc645637e..32c786baa10a 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -376,6 +376,7 @@ static inline struct bio *bio_next_split(struct bio *bio, int sectors,
 extern struct bio_set *bioset_create(unsigned int, unsigned int, int flags);
 enum {
 	BIOSET_NEED_BVECS = BIT(0),
+	BIOSET_NEED_RESCUER = BIT(1),
 };
 extern void bioset_free(struct bio_set *);
 extern mempool_t *biovec_create_pool(int pool_entries);

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 06/13] rbd: use bio_clone_fast() instead of bio_clone()
  2017-06-18  4:38 [PATCH 00/13] block: assorted cleanup for bio splitting and cloning NeilBrown
                   ` (3 preceding siblings ...)
  2017-06-18  4:38 ` [PATCH 02/13] blk: replace bioset_create_nobvec() with a flags arg to bioset_create() NeilBrown
@ 2017-06-18  4:38 ` NeilBrown
  2017-06-18  4:38 ` [PATCH 08/13] pktcdvd: " NeilBrown
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2017-06-18  4:38 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, linux-kernel

bio_clone() makes a copy of the bi_io_vec, but rbd never changes that,
so there is no need for a copy.
bio_clone_fast() can be used instead, which avoids making the copy.

This requires that we provide a bio_set.  bio_clone() uses fs_bio_set,
but it isn't, in general, safe to use the same bio_set at different
levels of the stack, as that can lead to deadlocks.  As filesystems
use fs_bio_set, block devices shouldn't.

As rbd never stacks, it is safe to have a single global bio_set for
all rbd devices to use.  So allocate that when the module is
initialised, and use it with bio_clone_fast().

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/block/rbd.c |   16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 5420bc40c544..b008b6a98098 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -442,6 +442,8 @@ static DEFINE_SPINLOCK(rbd_client_list_lock);
 static struct kmem_cache	*rbd_img_request_cache;
 static struct kmem_cache	*rbd_obj_request_cache;
 
+static struct bio_set		*rbd_bio_clone;
+
 static int rbd_major;
 static DEFINE_IDA(rbd_dev_id_ida);
 
@@ -1363,7 +1365,7 @@ static struct bio *bio_clone_range(struct bio *bio_src,
 {
 	struct bio *bio;
 
-	bio = bio_clone(bio_src, gfpmask);
+	bio = bio_clone_fast(bio_src, gfpmask, rbd_bio_clone);
 	if (!bio)
 		return NULL;	/* ENOMEM */
 
@@ -6416,8 +6418,16 @@ static int rbd_slab_init(void)
 	if (!rbd_obj_request_cache)
 		goto out_err;
 
+	rbd_assert(!rbd_bio_clone);
+	rbd_bio_clone = bioset_create(BIO_POOL_SIZE, 0, 0);
+	if (!rbd_bio_clone)
+		goto out_err_clone;
+
 	return 0;
 
+out_err_clone:
+	kmem_cache_destroy(rbd_obj_request_cache);
+	rbd_obj_request_cache = NULL;
 out_err:
 	kmem_cache_destroy(rbd_img_request_cache);
 	rbd_img_request_cache = NULL;
@@ -6433,6 +6443,10 @@ static void rbd_slab_exit(void)
 	rbd_assert(rbd_img_request_cache);
 	kmem_cache_destroy(rbd_img_request_cache);
 	rbd_img_request_cache = NULL;
+
+	rbd_assert(rbd_bio_clone);
+	bioset_free(rbd_bio_clone);
+	rbd_bio_clone = NULL;
 }
 
 static int __init rbd_init(void)

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 09/13] lightnvm/pblk-read: use bio_clone_fast()
  2017-06-18  4:38 [PATCH 00/13] block: assorted cleanup for bio splitting and cloning NeilBrown
                   ` (6 preceding siblings ...)
  2017-06-18  4:38 ` [PATCH 05/13] block: Improvements to bounce-buffer handling NeilBrown
@ 2017-06-18  4:38 ` NeilBrown
  2017-06-18  4:38 ` [PATCH 07/13] drbd: use bio_clone_fast() instead of bio_clone() NeilBrown
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2017-06-18  4:38 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, linux-kernel

pblk_submit_read() uses bio_clone_bioset() but doesn't change the
io_vec, so bio_clone_fast() is a better choice.

It also uses fs_bio_set which is intended for filesystems.  Using it
in a device driver can deadlock.
So allocate a new bioset, and and use bio_clone_fast().

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Javier González <javier@cnexlabs.com>
Tested-by: Javier González <javier@cnexlabs.com>
Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/lightnvm/pblk-init.c |   12 +++++++++++-
 drivers/lightnvm/pblk-read.c |    2 +-
 drivers/lightnvm/pblk.h      |    1 +
 3 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index b3fec8ec55b8..aaefbccce30e 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -23,6 +23,7 @@
 static struct kmem_cache *pblk_blk_ws_cache, *pblk_rec_cache, *pblk_r_rq_cache,
 					*pblk_w_rq_cache, *pblk_line_meta_cache;
 static DECLARE_RWSEM(pblk_lock);
+struct bio_set *pblk_bio_set;
 
 static int pblk_rw_io(struct request_queue *q, struct pblk *pblk,
 			  struct bio *bio)
@@ -946,11 +947,20 @@ static struct nvm_tgt_type tt_pblk = {
 
 static int __init pblk_module_init(void)
 {
-	return nvm_register_tgt_type(&tt_pblk);
+	int ret;
+
+	pblk_bio_set = bioset_create(BIO_POOL_SIZE, 0, 0);
+	if (!pblk_bio_set)
+		return -ENOMEM;
+	ret = nvm_register_tgt_type(&tt_pblk);
+	if (ret)
+		bioset_free(pblk_bio_set);
+	return ret;
 }
 
 static void pblk_module_exit(void)
 {
+	bioset_free(pblk_bio_set);
 	nvm_unregister_tgt_type(&tt_pblk);
 }
 
diff --git a/drivers/lightnvm/pblk-read.c b/drivers/lightnvm/pblk-read.c
index 762c0b73cb67..74d3fc53022e 100644
--- a/drivers/lightnvm/pblk-read.c
+++ b/drivers/lightnvm/pblk-read.c
@@ -342,7 +342,7 @@ int pblk_submit_read(struct pblk *pblk, struct bio *bio)
 		struct pblk_r_ctx *r_ctx = nvm_rq_to_pdu(rqd);
 
 		/* Clone read bio to deal with read errors internally */
-		int_bio = bio_clone_bioset(bio, GFP_KERNEL, fs_bio_set);
+		int_bio = bio_clone_fast(bio, GFP_KERNEL, pblk_bio_set);
 		if (!int_bio) {
 			pr_err("pblk: could not clone read bio\n");
 			return NVM_IO_ERR;
diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h
index 99f3186b5288..95b665f23925 100644
--- a/drivers/lightnvm/pblk.h
+++ b/drivers/lightnvm/pblk.h
@@ -702,6 +702,7 @@ void pblk_write_should_kick(struct pblk *pblk);
 /*
  * pblk read path
  */
+extern struct bio_set *pblk_bio_set;
 int pblk_submit_read(struct pblk *pblk, struct bio *bio);
 int pblk_submit_read_gc(struct pblk *pblk, u64 *lba_list, void *data,
 			unsigned int nr_secs, unsigned int *secs_to_gc,

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 05/13] block: Improvements to bounce-buffer handling
  2017-06-18  4:38 [PATCH 00/13] block: assorted cleanup for bio splitting and cloning NeilBrown
                   ` (5 preceding siblings ...)
  2017-06-18  4:38 ` [PATCH 08/13] pktcdvd: " NeilBrown
@ 2017-06-18  4:38 ` NeilBrown
  2017-06-18  4:38 ` [PATCH 09/13] lightnvm/pblk-read: use bio_clone_fast() NeilBrown
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2017-06-18  4:38 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, linux-kernel

Since commit 23688bf4f830 ("block: ensure to split after potentially
bouncing a bio") blk_queue_bounce() is called *before*
blk_queue_split().
This means that:
 1/ the comments blk_queue_split() about bounce buffers are
    irrelevant, and
 2/ a very large bio (more than BIO_MAX_PAGES) will no longer be
    split before it arrives at blk_queue_bounce(), leading to the
    possibility that bio_clone_bioset() will fail and a NULL
    will be dereferenced.

Separately, blk_queue_bounce() shouldn't use fs_bio_set as the bio
being copied could be from the same set, and this could lead to a
deadlock.

So:
 - allocate 2 private biosets for blk_queue_bounce, one for
   splitting enormous bios and one for cloning bios.
 - add code to split a bio that exceeds BIO_MAX_PAGES.
 - Fix up the comments in blk_queue_split()

Credit-to: Ming Lei <tom.leiming@gmail.com> (suggested using single bio_for_each_segment loop)
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.com>
---
 block/blk-merge.c |   14 ++++----------
 block/bounce.c    |   32 ++++++++++++++++++++++++++------
 2 files changed, 30 insertions(+), 16 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index d59074556703..51c84540d3bb 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -117,17 +117,11 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
 		 * each holds at most BIO_MAX_PAGES bvecs because
 		 * bio_clone() can fail to allocate big bvecs.
 		 *
-		 * It should have been better to apply the limit per
-		 * request queue in which bio_clone() is involved,
-		 * instead of globally. The biggest blocker is the
-		 * bio_clone() in bio bounce.
+		 * Those drivers which will need to use bio_clone()
+		 * should tell us in some way.  For now, impose the
+		 * BIO_MAX_PAGES limit on all queues.
 		 *
-		 * If bio is splitted by this reason, we should have
-		 * allowed to continue bios merging, but don't do
-		 * that now for making the change simple.
-		 *
-		 * TODO: deal with bio bounce's bio_clone() gracefully
-		 * and convert the global limit into per-queue limit.
+		 * TODO: handle users of bio_clone() differently.
 		 */
 		if (bvecs++ >= BIO_MAX_PAGES)
 			goto split;
diff --git a/block/bounce.c b/block/bounce.c
index e4703181d97f..17d77613c471 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -26,6 +26,7 @@
 #define POOL_SIZE	64
 #define ISA_POOL_SIZE	16
 
+struct bio_set *bounce_bio_set, *bounce_bio_split;
 static mempool_t *page_pool, *isa_page_pool;
 
 #if defined(CONFIG_HIGHMEM) || defined(CONFIG_NEED_BOUNCE_POOL)
@@ -40,6 +41,14 @@ static __init int init_emergency_pool(void)
 	BUG_ON(!page_pool);
 	pr_info("pool size: %d pages\n", POOL_SIZE);
 
+	bounce_bio_set = bioset_create(BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS);
+	BUG_ON(!bounce_bio_set);
+	if (bioset_integrity_create(bounce_bio_set, BIO_POOL_SIZE))
+		BUG_ON(1);
+
+	bounce_bio_split = bioset_create(BIO_POOL_SIZE, 0, 0);
+	BUG_ON(!bounce_bio_split);
+
 	return 0;
 }
 
@@ -186,15 +195,26 @@ static void __blk_queue_bounce(struct request_queue *q, struct bio **bio_orig,
 	int rw = bio_data_dir(*bio_orig);
 	struct bio_vec *to, from;
 	struct bvec_iter iter;
-	unsigned i;
+	unsigned i = 0;
+	bool bounce = false;
+	int sectors = 0;
 
-	bio_for_each_segment(from, *bio_orig, iter)
+	bio_for_each_segment(from, *bio_orig, iter) {
+		if (i++ < BIO_MAX_PAGES)
+			sectors += from.bv_len >> 9;
 		if (page_to_pfn(from.bv_page) > queue_bounce_pfn(q))
-			goto bounce;
+			bounce = true;
+	}
+	if (!bounce)
+		return;
 
-	return;
-bounce:
-	bio = bio_clone_bioset(*bio_orig, GFP_NOIO, fs_bio_set);
+	if (sectors < bio_sectors(*bio_orig)) {
+		bio = bio_split(*bio_orig, sectors, GFP_NOIO, bounce_bio_split);
+		bio_chain(bio, *bio_orig);
+		generic_make_request(*bio_orig);
+		*bio_orig = bio;
+	}
+	bio = bio_clone_bioset(*bio_orig, GFP_NOIO, bounce_bio_set);
 
 	bio_for_each_segment_all(to, bio, i) {
 		struct page *page = to->bv_page;

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 07/13] drbd: use bio_clone_fast() instead of bio_clone()
  2017-06-18  4:38 [PATCH 00/13] block: assorted cleanup for bio splitting and cloning NeilBrown
                   ` (7 preceding siblings ...)
  2017-06-18  4:38 ` [PATCH 09/13] lightnvm/pblk-read: use bio_clone_fast() NeilBrown
@ 2017-06-18  4:38 ` NeilBrown
  2017-06-18  4:38 ` [PATCH 12/13] block: remove bio_clone() and all references NeilBrown
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2017-06-18  4:38 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, linux-kernel

drbd does not modify the bi_io_vec of the cloned bio,
so there is no need to clone that part.  So bio_clone_fast()
is the better choice.
For bio_clone_fast() we need to specify a bio_set.
We could use fs_bio_set, which bio_clone() uses, or
drbd_md_io_bio_set, which drbd uses for metadata, but it is
generally best to avoid sharing bio_sets unless you can
be certain that there are no interdependencies.

So create a new bio_set, drbd_io_bio_set, and use bio_clone_fast().

Also remove a "XXX cannot fail ???" comment because it definitely
cannot fail - bio_clone_fast() doesn't fail if the GFP flags allow for
sleeping.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/block/drbd/drbd_int.h  |    3 +++
 drivers/block/drbd/drbd_main.c |    9 +++++++++
 drivers/block/drbd/drbd_req.h  |    2 +-
 3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index 76761b4ca13e..d17b6e6393c7 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -1441,6 +1441,9 @@ extern struct bio_set *drbd_md_io_bio_set;
 /* to allocate from that set */
 extern struct bio *bio_alloc_drbd(gfp_t gfp_mask);
 
+/* And a bio_set for cloning */
+extern struct bio_set *drbd_io_bio_set;
+
 extern struct mutex resources_mutex;
 
 extern int conn_lowest_minor(struct drbd_connection *connection);
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index bdf51b6977cf..90680034ef57 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -128,6 +128,7 @@ mempool_t *drbd_request_mempool;
 mempool_t *drbd_ee_mempool;
 mempool_t *drbd_md_io_page_pool;
 struct bio_set *drbd_md_io_bio_set;
+struct bio_set *drbd_io_bio_set;
 
 /* I do not use a standard mempool, because:
    1) I want to hand out the pre-allocated objects first.
@@ -2098,6 +2099,8 @@ static void drbd_destroy_mempools(void)
 
 	/* D_ASSERT(device, atomic_read(&drbd_pp_vacant)==0); */
 
+	if (drbd_io_bio_set)
+		bioset_free(drbd_io_bio_set);
 	if (drbd_md_io_bio_set)
 		bioset_free(drbd_md_io_bio_set);
 	if (drbd_md_io_page_pool)
@@ -2115,6 +2118,7 @@ static void drbd_destroy_mempools(void)
 	if (drbd_al_ext_cache)
 		kmem_cache_destroy(drbd_al_ext_cache);
 
+	drbd_io_bio_set      = NULL;
 	drbd_md_io_bio_set   = NULL;
 	drbd_md_io_page_pool = NULL;
 	drbd_ee_mempool      = NULL;
@@ -2142,6 +2146,7 @@ static int drbd_create_mempools(void)
 	drbd_pp_pool         = NULL;
 	drbd_md_io_page_pool = NULL;
 	drbd_md_io_bio_set   = NULL;
+	drbd_io_bio_set      = NULL;
 
 	/* caches */
 	drbd_request_cache = kmem_cache_create(
@@ -2165,6 +2170,10 @@ static int drbd_create_mempools(void)
 		goto Enomem;
 
 	/* mempools */
+	drbd_io_bio_set = bioset_create(BIO_POOL_SIZE, 0, BIOSET_NEED_RESCUER);
+	if (drbd_io_bio_set == NULL)
+		goto Enomem;
+
 	drbd_md_io_bio_set = bioset_create(DRBD_MIN_POOL_PAGES, 0,
 					   BIOSET_NEED_BVECS |
 					   BIOSET_NEED_RESCUER);
diff --git a/drivers/block/drbd/drbd_req.h b/drivers/block/drbd/drbd_req.h
index eb49e7f2da91..9e1866ab238f 100644
--- a/drivers/block/drbd/drbd_req.h
+++ b/drivers/block/drbd/drbd_req.h
@@ -263,7 +263,7 @@ enum drbd_req_state_bits {
 static inline void drbd_req_make_private_bio(struct drbd_request *req, struct bio *bio_src)
 {
 	struct bio *bio;
-	bio = bio_clone(bio_src, GFP_NOIO); /* XXX cannot fail?? */
+	bio = bio_clone_fast(bio_src, GFP_NOIO, drbd_io_bio_set);
 
 	req->private_bio = bio;
 

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 08/13] pktcdvd: use bio_clone_fast() instead of bio_clone()
  2017-06-18  4:38 [PATCH 00/13] block: assorted cleanup for bio splitting and cloning NeilBrown
                   ` (4 preceding siblings ...)
  2017-06-18  4:38 ` [PATCH 06/13] rbd: use bio_clone_fast() instead of bio_clone() NeilBrown
@ 2017-06-18  4:38 ` NeilBrown
  2017-06-18  4:38 ` [PATCH 05/13] block: Improvements to bounce-buffer handling NeilBrown
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2017-06-18  4:38 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, linux-kernel

pktcdvd doesn't change the bi_io_vec of the clone bio,
so it is more efficient to use bio_clone_fast(), and not clone
the bi_io_vec.
This requires providing a bio_set, and it is safest to
provide a dedicated bio_set rather than sharing
fs_bio_set, which filesytems use.
This new bio_set, pkt_bio_set, can also be use for the bio_split()
call as the two allocations (bio_clone_fast, and bio_split) are
independent, neither can block a bio allocated by the other.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/block/pktcdvd.c |   12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 1f363638b453..26c04baae967 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -98,6 +98,7 @@ static int write_congestion_on  = PKT_WRITE_CONGESTION_ON;
 static int write_congestion_off = PKT_WRITE_CONGESTION_OFF;
 static struct mutex ctl_mutex;	/* Serialize open/close/setup/teardown */
 static mempool_t *psd_pool;
+static struct bio_set *pkt_bio_set;
 
 static struct class	*class_pktcdvd = NULL;    /* /sys/class/pktcdvd */
 static struct dentry	*pkt_debugfs_root = NULL; /* /sys/kernel/debug/pktcdvd */
@@ -2310,7 +2311,7 @@ static void pkt_end_io_read_cloned(struct bio *bio)
 
 static void pkt_make_request_read(struct pktcdvd_device *pd, struct bio *bio)
 {
-	struct bio *cloned_bio = bio_clone(bio, GFP_NOIO);
+	struct bio *cloned_bio = bio_clone_fast(bio, GFP_NOIO, pkt_bio_set);
 	struct packet_stacked_data *psd = mempool_alloc(psd_pool, GFP_NOIO);
 
 	psd->pd = pd;
@@ -2455,7 +2456,7 @@ static blk_qc_t pkt_make_request(struct request_queue *q, struct bio *bio)
 
 			split = bio_split(bio, last_zone -
 					  bio->bi_iter.bi_sector,
-					  GFP_NOIO, fs_bio_set);
+					  GFP_NOIO, pkt_bio_set);
 			bio_chain(split, bio);
 		} else {
 			split = bio;
@@ -2924,6 +2925,11 @@ static int __init pkt_init(void)
 					sizeof(struct packet_stacked_data));
 	if (!psd_pool)
 		return -ENOMEM;
+	pkt_bio_set = bioset_create(BIO_POOL_SIZE, 0, 0);
+	if (!pkt_bio_set) {
+		mempool_destroy(psd_pool);
+		return -ENOMEM;
+	}
 
 	ret = register_blkdev(pktdev_major, DRIVER_NAME);
 	if (ret < 0) {
@@ -2956,6 +2962,7 @@ static int __init pkt_init(void)
 	unregister_blkdev(pktdev_major, DRIVER_NAME);
 out2:
 	mempool_destroy(psd_pool);
+	bioset_free(pkt_bio_set);
 	return ret;
 }
 
@@ -2969,6 +2976,7 @@ static void __exit pkt_exit(void)
 
 	unregister_blkdev(pktdev_major, DRIVER_NAME);
 	mempool_destroy(psd_pool);
+	bioset_free(pkt_bio_set);
 }
 
 MODULE_DESCRIPTION("Packet writing layer for CD/DVD drives");

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 10/13] xen-blkfront: remove bio splitting.
  2017-06-18  4:38 [PATCH 00/13] block: assorted cleanup for bio splitting and cloning NeilBrown
                   ` (11 preceding siblings ...)
  2017-06-18  4:38 ` [PATCH 13/13] block: don't check for BIO_MAX_PAGES in blk_bio_segment_split() NeilBrown
@ 2017-06-18  4:38 ` NeilBrown
  2017-06-18 18:41 ` [PATCH 00/13] block: assorted cleanup for bio splitting and cloning Jens Axboe
  13 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2017-06-18  4:38 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, linux-kernel

bios that are re-submitted will pass through blk_queue_split() when
blk_queue_bio() is called, and this will split the bio if necessary.
There is no longer any need to do this splitting in xen-blkfront.

Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/block/xen-blkfront.c |   54 ++----------------------------------------
 1 file changed, 3 insertions(+), 51 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index e3be666c2776..ac90093fcb25 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -110,11 +110,6 @@ struct blk_shadow {
 	unsigned long associated_id;
 };
 
-struct split_bio {
-	struct bio *bio;
-	atomic_t pending;
-};
-
 struct blkif_req {
 	int	error;
 };
@@ -2000,28 +1995,13 @@ static int blkfront_probe(struct xenbus_device *dev,
 	return 0;
 }
 
-static void split_bio_end(struct bio *bio)
-{
-	struct split_bio *split_bio = bio->bi_private;
-
-	if (atomic_dec_and_test(&split_bio->pending)) {
-		split_bio->bio->bi_phys_segments = 0;
-		split_bio->bio->bi_status = bio->bi_status;
-		bio_endio(split_bio->bio);
-		kfree(split_bio);
-	}
-	bio_put(bio);
-}
-
 static int blkif_recover(struct blkfront_info *info)
 {
-	unsigned int i, r_index;
+	unsigned int r_index;
 	struct request *req, *n;
 	int rc;
-	struct bio *bio, *cloned_bio;
-	unsigned int segs, offset;
-	int pending, size;
-	struct split_bio *split_bio;
+	struct bio *bio;
+	unsigned int segs;
 
 	blkfront_gather_backend_features(info);
 	/* Reset limits changed by blk_mq_update_nr_hw_queues(). */
@@ -2060,34 +2040,6 @@ static int blkif_recover(struct blkfront_info *info)
 
 	while ((bio = bio_list_pop(&info->bio_list)) != NULL) {
 		/* Traverse the list of pending bios and re-queue them */
-		if (bio_segments(bio) > segs) {
-			/*
-			 * This bio has more segments than what we can
-			 * handle, we have to split it.
-			 */
-			pending = (bio_segments(bio) + segs - 1) / segs;
-			split_bio = kzalloc(sizeof(*split_bio), GFP_NOIO);
-			BUG_ON(split_bio == NULL);
-			atomic_set(&split_bio->pending, pending);
-			split_bio->bio = bio;
-			for (i = 0; i < pending; i++) {
-				offset = (i * segs * XEN_PAGE_SIZE) >> 9;
-				size = min((unsigned int)(segs * XEN_PAGE_SIZE) >> 9,
-					   (unsigned int)bio_sectors(bio) - offset);
-				cloned_bio = bio_clone(bio, GFP_NOIO);
-				BUG_ON(cloned_bio == NULL);
-				bio_trim(cloned_bio, offset, size);
-				cloned_bio->bi_private = split_bio;
-				cloned_bio->bi_end_io = split_bio_end;
-				submit_bio(cloned_bio);
-			}
-			/*
-			 * Now we have to wait for all those smaller bios to
-			 * end, so we can also end the "parent" bio.
-			 */
-			continue;
-		}
-		/* We don't need to split this bio */
 		submit_bio(bio);
 	}
 

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 11/13] bcache: use kmalloc to allocate bio in bch_data_verify()
  2017-06-18  4:38 [PATCH 00/13] block: assorted cleanup for bio splitting and cloning NeilBrown
                   ` (9 preceding siblings ...)
  2017-06-18  4:38 ` [PATCH 12/13] block: remove bio_clone() and all references NeilBrown
@ 2017-06-18  4:38 ` NeilBrown
  2017-06-18  4:38 ` [PATCH 13/13] block: don't check for BIO_MAX_PAGES in blk_bio_segment_split() NeilBrown
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2017-06-18  4:38 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, linux-kernel

This function allocates a bio, then a collection
of pages.  It copes with failure.

It currently uses a mempool() to allocate the bio,
but alloc_page() to allocate the pages.  These fail
in different ways, so the usage is inconsistent.

Change the bio_clone() to bio_clone_kmalloc()
so that no pool is used either for the bio or the pages.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Kent Overstreet <kent.overstreet@gmail.com>
Reviewed-by : Ming Lei <ming.lei@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/bcache/debug.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/bcache/debug.c b/drivers/md/bcache/debug.c
index 06f55056aaae..35a5a7210e51 100644
--- a/drivers/md/bcache/debug.c
+++ b/drivers/md/bcache/debug.c
@@ -110,7 +110,7 @@ void bch_data_verify(struct cached_dev *dc, struct bio *bio)
 	struct bio_vec bv, cbv;
 	struct bvec_iter iter, citer = { 0 };
 
-	check = bio_clone(bio, GFP_NOIO);
+	check = bio_clone_kmalloc(bio, GFP_NOIO);
 	if (!check)
 		return;
 	check->bi_opf = REQ_OP_READ;

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 13/13] block: don't check for BIO_MAX_PAGES in blk_bio_segment_split()
  2017-06-18  4:38 [PATCH 00/13] block: assorted cleanup for bio splitting and cloning NeilBrown
                   ` (10 preceding siblings ...)
  2017-06-18  4:38 ` [PATCH 11/13] bcache: use kmalloc to allocate bio in bch_data_verify() NeilBrown
@ 2017-06-18  4:38 ` NeilBrown
  2017-06-18  4:38 ` [PATCH 10/13] xen-blkfront: remove bio splitting NeilBrown
  2017-06-18 18:41 ` [PATCH 00/13] block: assorted cleanup for bio splitting and cloning Jens Axboe
  13 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2017-06-18  4:38 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, linux-kernel

blk_bio_segment_split() makes sure bios have no more than
BIO_MAX_PAGES entries in the bi_io_vec.
This was done because bio_clone_bioset() (when given a
mempool bioset) could not handle larger io_vecs.

No driver uses bio_clone_bioset() any more, they all
use bio_clone_fast() if anything, and bio_clone_fast()
doesn't clone the bi_io_vec.

The main user of of bio_clone_bioset() at this level
is bounce.c, and bouncing now happens before blk_bio_segment_split(),
so that is not of concern.

So remove the big helpful comment and the code.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.com>
---
 block/blk-merge.c |   16 ----------------
 1 file changed, 16 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index e7862e9dcc39..cea544ec5d96 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -108,25 +108,9 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
 	bool do_split = true;
 	struct bio *new = NULL;
 	const unsigned max_sectors = get_max_io_size(q, bio);
-	unsigned bvecs = 0;
 
 	bio_for_each_segment(bv, bio, iter) {
 		/*
-		 * With arbitrary bio size, the incoming bio may be very
-		 * big. We have to split the bio into small bios so that
-		 * each holds at most BIO_MAX_PAGES bvecs because
-		 * bio_clone_bioset() can fail to allocate big bvecs.
-		 *
-		 * Those drivers which will need to use bio_clone_bioset()
-		 * should tell us in some way.  For now, impose the
-		 * BIO_MAX_PAGES limit on all queues.
-		 *
-		 * TODO: handle users of bio_clone_bioset() differently.
-		 */
-		if (bvecs++ >= BIO_MAX_PAGES)
-			goto split;
-
-		/*
 		 * If the queue doesn't support SG gaps and adding this
 		 * offset would create a gap, disallow it.
 		 */

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 12/13] block: remove bio_clone() and all references.
  2017-06-18  4:38 [PATCH 00/13] block: assorted cleanup for bio splitting and cloning NeilBrown
                   ` (8 preceding siblings ...)
  2017-06-18  4:38 ` [PATCH 07/13] drbd: use bio_clone_fast() instead of bio_clone() NeilBrown
@ 2017-06-18  4:38 ` NeilBrown
  2017-06-18  4:38 ` [PATCH 11/13] bcache: use kmalloc to allocate bio in bch_data_verify() NeilBrown
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2017-06-18  4:38 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, linux-kernel

bio_clone() is no longer used.
Only bio_clone_bioset() or bio_clone_fast().
This is for the best, as bio_clone() used fs_bio_set,
and filesystems are unlikely to want to use bio_clone().

So remove bio_clone() and all references.
This includes a fix to some incorrect documentation.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
---
 Documentation/block/biodoc.txt |    2 +-
 block/bio.c                    |    2 +-
 block/blk-merge.c              |    6 +++---
 drivers/md/md.c                |    2 +-
 include/linux/bio.h            |    5 -----
 5 files changed, 6 insertions(+), 11 deletions(-)

diff --git a/Documentation/block/biodoc.txt b/Documentation/block/biodoc.txt
index 01ddeaf64b0f..9490f2845f06 100644
--- a/Documentation/block/biodoc.txt
+++ b/Documentation/block/biodoc.txt
@@ -632,7 +632,7 @@ to i/o submission, if the bio fields are likely to be accessed after the
 i/o is issued (since the bio may otherwise get freed in case i/o completion
 happens in the meantime).
 
-The bio_clone() routine may be used to duplicate a bio, where the clone
+The bio_clone_fast() routine may be used to duplicate a bio, where the clone
 shares the bio_vec_list with the original bio (i.e. both point to the
 same bio_vec_list). This would typically be used for splitting i/o requests
 in lvm or md.
diff --git a/block/bio.c b/block/bio.c
index 36984bf382c7..3f4ac6366d1d 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -547,7 +547,7 @@ EXPORT_SYMBOL(zero_fill_bio);
  *
  * Description:
  *   Put a reference to a &struct bio, either one you have gotten with
- *   bio_alloc, bio_get or bio_clone. The last put of a bio will free it.
+ *   bio_alloc, bio_get or bio_clone_*. The last put of a bio will free it.
  **/
 void bio_put(struct bio *bio)
 {
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 51c84540d3bb..e7862e9dcc39 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -115,13 +115,13 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
 		 * With arbitrary bio size, the incoming bio may be very
 		 * big. We have to split the bio into small bios so that
 		 * each holds at most BIO_MAX_PAGES bvecs because
-		 * bio_clone() can fail to allocate big bvecs.
+		 * bio_clone_bioset() can fail to allocate big bvecs.
 		 *
-		 * Those drivers which will need to use bio_clone()
+		 * Those drivers which will need to use bio_clone_bioset()
 		 * should tell us in some way.  For now, impose the
 		 * BIO_MAX_PAGES limit on all queues.
 		 *
-		 * TODO: handle users of bio_clone() differently.
+		 * TODO: handle users of bio_clone_bioset() differently.
 		 */
 		if (bvecs++ >= BIO_MAX_PAGES)
 			goto split;
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 7e2b79408e37..d60958c84500 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -185,7 +185,7 @@ static int start_readonly;
 static bool create_on_open = true;
 
 /* bio_clone_mddev
- * like bio_clone, but with a local bio set
+ * like bio_clone_bioset, but with a local bio set
  */
 
 struct bio *bio_alloc_mddev(gfp_t gfp_mask, int nr_iovecs,
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 32c786baa10a..40d054185277 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -395,11 +395,6 @@ static inline struct bio *bio_alloc(gfp_t gfp_mask, unsigned int nr_iovecs)
 	return bio_alloc_bioset(gfp_mask, nr_iovecs, fs_bio_set);
 }
 
-static inline struct bio *bio_clone(struct bio *bio, gfp_t gfp_mask)
-{
-	return bio_clone_bioset(bio, gfp_mask, fs_bio_set);
-}
-
 static inline struct bio *bio_kmalloc(gfp_t gfp_mask, unsigned int nr_iovecs)
 {
 	return bio_alloc_bioset(gfp_mask, nr_iovecs, NULL);

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH 00/13] block: assorted cleanup for bio splitting and cloning.
  2017-06-18  4:38 [PATCH 00/13] block: assorted cleanup for bio splitting and cloning NeilBrown
                   ` (12 preceding siblings ...)
  2017-06-18  4:38 ` [PATCH 10/13] xen-blkfront: remove bio splitting NeilBrown
@ 2017-06-18 18:41 ` Jens Axboe
  2017-06-18 21:36   ` NeilBrown
  13 siblings, 1 reply; 53+ messages in thread
From: Jens Axboe @ 2017-06-18 18:41 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-block, linux-kernel

On Sun, Jun 18 2017, NeilBrown wrote:
> This is a resend of my series of patches working
> towards removing the bioset work queues.
> 
> This set is based on for-4.13/block.
> 
> It incorporates the revised versions of all the patches that were
> resent following feedback on the last set.
> 
> It also includes a minor grammatic improvement to a comment, and
> simple changes to compensate for a couple of changes to the block tree
> since the last posting.
> 
> I hope to eventually get rid of the new BIOSET_NEED_RESCUER flag,
> but that needs work in dm and probably bcache first.

Thanks Neil, applied.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 00/13] block: assorted cleanup for bio splitting and cloning.
  2017-06-18 18:41 ` [PATCH 00/13] block: assorted cleanup for bio splitting and cloning Jens Axboe
@ 2017-06-18 21:36   ` NeilBrown
  2017-11-20 16:43     ` Mike Snitzer
  0 siblings, 1 reply; 53+ messages in thread
From: NeilBrown @ 2017-06-18 21:36 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 721 bytes --]

On Sun, Jun 18 2017, Jens Axboe wrote:

> On Sun, Jun 18 2017, NeilBrown wrote:
>> This is a resend of my series of patches working
>> towards removing the bioset work queues.
>> 
>> This set is based on for-4.13/block.
>> 
>> It incorporates the revised versions of all the patches that were
>> resent following feedback on the last set.
>> 
>> It also includes a minor grammatic improvement to a comment, and
>> simple changes to compensate for a couple of changes to the block tree
>> since the last posting.
>> 
>> I hope to eventually get rid of the new BIOSET_NEED_RESCUER flag,
>> but that needs work in dm and probably bcache first.
>
> Thanks Neil, applied.

Thanks a lot Jens.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 00/13] block: assorted cleanup for bio splitting and cloning.
  2017-06-18 21:36   ` NeilBrown
@ 2017-11-20 16:43     ` Mike Snitzer
  2017-11-21  0:34         ` NeilBrown
  0 siblings, 1 reply; 53+ messages in thread
From: Mike Snitzer @ 2017-11-20 16:43 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jens Axboe, linux-block, linux-kernel, device-mapper development

On Sun, Jun 18, 2017 at 5:36 PM, NeilBrown <neilb@suse.com> wrote:
> On Sun, Jun 18 2017, Jens Axboe wrote:
>
>> On Sun, Jun 18 2017, NeilBrown wrote:
>>> This is a resend of my series of patches working
>>> towards removing the bioset work queues.
>>>
>>> This set is based on for-4.13/block.
>>>
>>> It incorporates the revised versions of all the patches that were
>>> resent following feedback on the last set.
>>>
>>> It also includes a minor grammatic improvement to a comment, and
>>> simple changes to compensate for a couple of changes to the block tree
>>> since the last posting.
>>>
>>> I hope to eventually get rid of the new BIOSET_NEED_RESCUER flag,
>>> but that needs work in dm and probably bcache first.
>>
>> Thanks Neil, applied.
>
> Thanks a lot Jens.

I missed this line of work until now.  Not quite sure why I wasn't
cc'd or my review/ack required for the DM changes in this patchset.

But I've now queued this patch for once Linus gets back (reverts DM
changes from commit 47e0fb461f):
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-next&id=c9fdc42ba23eabd1ba7aef199fb9bb4b4fe5c545

As is, it is my understanding that DM has no need for a bio_set's
rescue_workqueue.  So its removal would appear to only be gated by
bcache?

But I could be mistaken, moving forward please let me know what you
feel needs doing in DM to make it a better citizen.

Thanks,
Mike

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [dm-devel] [PATCH 00/13] block: assorted cleanup for bio splitting and cloning.
  2017-11-20 16:43     ` Mike Snitzer
  2017-11-21  0:34         ` NeilBrown
@ 2017-11-21  0:34         ` NeilBrown
  0 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2017-11-21  0:34 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jens Axboe, linux-block, device-mapper development, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3892 bytes --]

On Mon, Nov 20 2017, Mike Snitzer wrote:

> On Sun, Jun 18, 2017 at 5:36 PM, NeilBrown <neilb@suse.com> wrote:
>> On Sun, Jun 18 2017, Jens Axboe wrote:
>>
>>> On Sun, Jun 18 2017, NeilBrown wrote:
>>>> This is a resend of my series of patches working
>>>> towards removing the bioset work queues.
>>>>
>>>> This set is based on for-4.13/block.
>>>>
>>>> It incorporates the revised versions of all the patches that were
>>>> resent following feedback on the last set.
>>>>
>>>> It also includes a minor grammatic improvement to a comment, and
>>>> simple changes to compensate for a couple of changes to the block tree
>>>> since the last posting.
>>>>
>>>> I hope to eventually get rid of the new BIOSET_NEED_RESCUER flag,
>>>> but that needs work in dm and probably bcache first.
>>>
>>> Thanks Neil, applied.
>>
>> Thanks a lot Jens.
>
> I missed this line of work until now.  Not quite sure why I wasn't
> cc'd or my review/ack required for the DM changes in this patchset.

Hi Mike,
 I'm sorry you weren't included on those.  My thinking at the time was
 probably that they were purely cosmetic changes which made no
 functional difference to dm.  That is no excuse though and I do
 apologize.

>
> But I've now queued this patch for once Linus gets back (reverts DM
> changes from commit 47e0fb461f):
> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-next&id=c9fdc42ba23eabd1ba7aef199fb9bb4b4fe5c545

This patch does two things.
1/ It removes the BIOSET_NEED_RESCUER flag from biosets created by dm.
  This a functional changed over the code from before my patches.
  Previously, all biosets were given a rescuer thread.
  After my patch set, biosets only got a rescuer thread if
  BIOSET_NEED_RESCUER was passed, and it was passed for all biosets.
  I then removed it from places were I was certain it wasn't needed.
  I didn't remove it from dm because I wasn't certain.  Your
  patch does remove the flags, which I think is incorrect - see below.

2/ It changes flush_current_bio_list() so that bios allocated from a
   bioset that does not have a rescue_workqueue are now added to
   the ->rescue_list for their bio_set, and ->rescue_work is queued
   on the NULL ->rescue_workqueue, resulting in a NULL dereference.
   I suspect you don't want this.

The patch description claims that the patch fixes something, but it
isn't clear to me what it is meant to be fixing.

It makes reference to  dbba42d8 which is described as removing an unused
bioset process, though what it actually does is remove an used bioset
(and obvious the process disappears with it).  My patch doesn't change
that behavior.

>
> As is, it is my understanding that DM has no need for a bio_set's
> rescue_workqueue.  So its removal would appear to only be gated by
> bcache?
>
> But I could be mistaken, moving forward please let me know what you
> feel needs doing in DM to make it a better citizen.

I think you are mistaken.
Please see
   https://www.redhat.com/archives/dm-devel/2017-August/msg00310.html
and
   https://www.redhat.com/archives/dm-devel/2017-August/msg00315.html
for which the thread continues:
   https://www.redhat.com/archives/dm-devel/2017-September/msg00001.html

These were sent to you, though most of the conversation happened with
Mikulas.

I think that the patches in those threads explain why dm currently needs
rescuer threads, and shows how dm can be changed to no longer need the
rescuer.  I would appreciate your thoughts on these patches.  I can
resend them if that would help.

That would then just leave bcache....  I find it a bit of a challenge to
reason about the code in bcache, but if we can remove
BIOSET_NEED_RESCUER from dm, that will be an extra incentive for me to learn :-)

Thanks,
NeilBrown


>
> Thanks,
> Mike
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [dm-devel] [PATCH 00/13] block: assorted cleanup for bio   splitting and cloning.
@ 2017-11-21  0:34         ` NeilBrown
  0 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2017-11-21  0:34 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jens Axboe, linux-block, device-mapper development, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3892 bytes --]

On Mon, Nov 20 2017, Mike Snitzer wrote:

> On Sun, Jun 18, 2017 at 5:36 PM, NeilBrown <neilb@suse.com> wrote:
>> On Sun, Jun 18 2017, Jens Axboe wrote:
>>
>>> On Sun, Jun 18 2017, NeilBrown wrote:
>>>> This is a resend of my series of patches working
>>>> towards removing the bioset work queues.
>>>>
>>>> This set is based on for-4.13/block.
>>>>
>>>> It incorporates the revised versions of all the patches that were
>>>> resent following feedback on the last set.
>>>>
>>>> It also includes a minor grammatic improvement to a comment, and
>>>> simple changes to compensate for a couple of changes to the block tree
>>>> since the last posting.
>>>>
>>>> I hope to eventually get rid of the new BIOSET_NEED_RESCUER flag,
>>>> but that needs work in dm and probably bcache first.
>>>
>>> Thanks Neil, applied.
>>
>> Thanks a lot Jens.
>
> I missed this line of work until now.  Not quite sure why I wasn't
> cc'd or my review/ack required for the DM changes in this patchset.

Hi Mike,
 I'm sorry you weren't included on those.  My thinking at the time was
 probably that they were purely cosmetic changes which made no
 functional difference to dm.  That is no excuse though and I do
 apologize.

>
> But I've now queued this patch for once Linus gets back (reverts DM
> changes from commit 47e0fb461f):
> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-next&id=c9fdc42ba23eabd1ba7aef199fb9bb4b4fe5c545

This patch does two things.
1/ It removes the BIOSET_NEED_RESCUER flag from biosets created by dm.
  This a functional changed over the code from before my patches.
  Previously, all biosets were given a rescuer thread.
  After my patch set, biosets only got a rescuer thread if
  BIOSET_NEED_RESCUER was passed, and it was passed for all biosets.
  I then removed it from places were I was certain it wasn't needed.
  I didn't remove it from dm because I wasn't certain.  Your
  patch does remove the flags, which I think is incorrect - see below.

2/ It changes flush_current_bio_list() so that bios allocated from a
   bioset that does not have a rescue_workqueue are now added to
   the ->rescue_list for their bio_set, and ->rescue_work is queued
   on the NULL ->rescue_workqueue, resulting in a NULL dereference.
   I suspect you don't want this.

The patch description claims that the patch fixes something, but it
isn't clear to me what it is meant to be fixing.

It makes reference to  dbba42d8 which is described as removing an unused
bioset process, though what it actually does is remove an used bioset
(and obvious the process disappears with it).  My patch doesn't change
that behavior.

>
> As is, it is my understanding that DM has no need for a bio_set's
> rescue_workqueue.  So its removal would appear to only be gated by
> bcache?
>
> But I could be mistaken, moving forward please let me know what you
> feel needs doing in DM to make it a better citizen.

I think you are mistaken.
Please see
   https://www.redhat.com/archives/dm-devel/2017-August/msg00310.html
and
   https://www.redhat.com/archives/dm-devel/2017-August/msg00315.html
for which the thread continues:
   https://www.redhat.com/archives/dm-devel/2017-September/msg00001.html

These were sent to you, though most of the conversation happened with
Mikulas.

I think that the patches in those threads explain why dm currently needs
rescuer threads, and shows how dm can be changed to no longer need the
rescuer.  I would appreciate your thoughts on these patches.  I can
resend them if that would help.

That would then just leave bcache....  I find it a bit of a challenge to
reason about the code in bcache, but if we can remove
BIOSET_NEED_RESCUER from dm, that will be an extra incentive for me to learn :-)

Thanks,
NeilBrown


>
> Thanks,
> Mike
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [dm-devel] [PATCH 00/13] block: assorted cleanup for bio   splitting and cloning.
@ 2017-11-21  0:34         ` NeilBrown
  0 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2017-11-21  0:34 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jens Axboe, linux-block, device-mapper development, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3892 bytes --]

On Mon, Nov 20 2017, Mike Snitzer wrote:

> On Sun, Jun 18, 2017 at 5:36 PM, NeilBrown <neilb@suse.com> wrote:
>> On Sun, Jun 18 2017, Jens Axboe wrote:
>>
>>> On Sun, Jun 18 2017, NeilBrown wrote:
>>>> This is a resend of my series of patches working
>>>> towards removing the bioset work queues.
>>>>
>>>> This set is based on for-4.13/block.
>>>>
>>>> It incorporates the revised versions of all the patches that were
>>>> resent following feedback on the last set.
>>>>
>>>> It also includes a minor grammatic improvement to a comment, and
>>>> simple changes to compensate for a couple of changes to the block tree
>>>> since the last posting.
>>>>
>>>> I hope to eventually get rid of the new BIOSET_NEED_RESCUER flag,
>>>> but that needs work in dm and probably bcache first.
>>>
>>> Thanks Neil, applied.
>>
>> Thanks a lot Jens.
>
> I missed this line of work until now.  Not quite sure why I wasn't
> cc'd or my review/ack required for the DM changes in this patchset.

Hi Mike,
 I'm sorry you weren't included on those.  My thinking at the time was
 probably that they were purely cosmetic changes which made no
 functional difference to dm.  That is no excuse though and I do
 apologize.

>
> But I've now queued this patch for once Linus gets back (reverts DM
> changes from commit 47e0fb461f):
> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-next&id=c9fdc42ba23eabd1ba7aef199fb9bb4b4fe5c545

This patch does two things.
1/ It removes the BIOSET_NEED_RESCUER flag from biosets created by dm.
  This a functional changed over the code from before my patches.
  Previously, all biosets were given a rescuer thread.
  After my patch set, biosets only got a rescuer thread if
  BIOSET_NEED_RESCUER was passed, and it was passed for all biosets.
  I then removed it from places were I was certain it wasn't needed.
  I didn't remove it from dm because I wasn't certain.  Your
  patch does remove the flags, which I think is incorrect - see below.

2/ It changes flush_current_bio_list() so that bios allocated from a
   bioset that does not have a rescue_workqueue are now added to
   the ->rescue_list for their bio_set, and ->rescue_work is queued
   on the NULL ->rescue_workqueue, resulting in a NULL dereference.
   I suspect you don't want this.

The patch description claims that the patch fixes something, but it
isn't clear to me what it is meant to be fixing.

It makes reference to  dbba42d8 which is described as removing an unused
bioset process, though what it actually does is remove an used bioset
(and obvious the process disappears with it).  My patch doesn't change
that behavior.

>
> As is, it is my understanding that DM has no need for a bio_set's
> rescue_workqueue.  So its removal would appear to only be gated by
> bcache?
>
> But I could be mistaken, moving forward please let me know what you
> feel needs doing in DM to make it a better citizen.

I think you are mistaken.
Please see
   https://www.redhat.com/archives/dm-devel/2017-August/msg00310.html
and
   https://www.redhat.com/archives/dm-devel/2017-August/msg00315.html
for which the thread continues:
   https://www.redhat.com/archives/dm-devel/2017-September/msg00001.html

These were sent to you, though most of the conversation happened with
Mikulas.

I think that the patches in those threads explain why dm currently needs
rescuer threads, and shows how dm can be changed to no longer need the
rescuer.  I would appreciate your thoughts on these patches.  I can
resend them if that would help.

That would then just leave bcache....  I find it a bit of a challenge to
reason about the code in bcache, but if we can remove
BIOSET_NEED_RESCUER from dm, that will be an extra incentive for me to learn :-)

Thanks,
NeilBrown


>
> Thanks,
> Mike
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 00/13] block: assorted cleanup for bio splitting and cloning.
  2017-11-21  0:34         ` NeilBrown
@ 2017-11-21  1:35           ` Mike Snitzer
  -1 siblings, 0 replies; 53+ messages in thread
From: Mike Snitzer @ 2017-11-21  1:35 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jens Axboe, linux-block, device-mapper development, linux-kernel,
	Zdenek Kabelac

On Mon, Nov 20 2017 at  7:34pm -0500,
NeilBrown <neilb@suse.com> wrote:

> On Mon, Nov 20 2017, Mike Snitzer wrote:
> 
> > On Sun, Jun 18, 2017 at 5:36 PM, NeilBrown <neilb@suse.com> wrote:
> >> On Sun, Jun 18 2017, Jens Axboe wrote:
> >>
> >>> On Sun, Jun 18 2017, NeilBrown wrote:
> >>>> This is a resend of my series of patches working
> >>>> towards removing the bioset work queues.
> >>>>
> >>>> This set is based on for-4.13/block.
> >>>>
> >>>> It incorporates the revised versions of all the patches that were
> >>>> resent following feedback on the last set.
> >>>>
> >>>> It also includes a minor grammatic improvement to a comment, and
> >>>> simple changes to compensate for a couple of changes to the block tree
> >>>> since the last posting.
> >>>>
> >>>> I hope to eventually get rid of the new BIOSET_NEED_RESCUER flag,
> >>>> but that needs work in dm and probably bcache first.
> >>>
> >>> Thanks Neil, applied.
> >>
> >> Thanks a lot Jens.
> >
> > I missed this line of work until now.  Not quite sure why I wasn't
> > cc'd or my review/ack required for the DM changes in this patchset.
> 
> Hi Mike,
>  I'm sorry you weren't included on those.  My thinking at the time was
>  probably that they were purely cosmetic changes which made no
>  functional difference to dm.  That is no excuse though and I do
>  apologize.
> 
> >
> > But I've now queued this patch for once Linus gets back (reverts DM
> > changes from commit 47e0fb461f):
> > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-next&id=c9fdc42ba23eabd1ba7aef199fb9bb4b4fe5c545
> 
> This patch does two things.
> 1/ It removes the BIOSET_NEED_RESCUER flag from biosets created by dm.
>   This a functional changed over the code from before my patches.
>   Previously, all biosets were given a rescuer thread.
>   After my patch set, biosets only got a rescuer thread if
>   BIOSET_NEED_RESCUER was passed, and it was passed for all biosets.
>   I then removed it from places were I was certain it wasn't needed.
>   I didn't remove it from dm because I wasn't certain.  Your
>   patch does remove the flags, which I think is incorrect - see below.
> 
> 2/ It changes flush_current_bio_list() so that bios allocated from a
>    bioset that does not have a rescue_workqueue are now added to
>    the ->rescue_list for their bio_set, and ->rescue_work is queued
>    on the NULL ->rescue_workqueue, resulting in a NULL dereference.
>    I suspect you don't want this.
> 
> The patch description claims that the patch fixes something, but it
> isn't clear to me what it is meant to be fixing.
> 
> It makes reference to  dbba42d8 which is described as removing an unused
> bioset process, though what it actually does is remove an used bioset
> (and obvious the process disappears with it).  My patch doesn't change
> that behavior.

Well I looked at this because Zdenek reported that with more recent
kernels he is seeing the "bioset" per DM device again (whereas it was
thought to be removed with mikulas' commit dbba42d8 -- but that commit
removed "bioset" only in terms of q->bio_split.

Looks like the q->bio_split bioset is created with BIOSET_NEED_RESCUER
but DM calls bioset_free() for q->bio_split.  Strikes me as odd that
"bioset" was removed DM devices until it recently reappeared.
Especially if what you say is accurate (that BIOSET_NEED_RESCUER was
implied with the old code.. I believe you!)

I tried to quickly answer how "bioset" is now re-appearing for DM
devices but I obviously need to put more time to it.

(And my goodness does this bioset rescue workqueue need a better name
than "bioset"! Should include the bdevname too)

> > As is, it is my understanding that DM has no need for a bio_set's
> > rescue_workqueue.  So its removal would appear to only be gated by
> > bcache?
> >
> > But I could be mistaken, moving forward please let me know what you
> > feel needs doing in DM to make it a better citizen.
> 
> I think you are mistaken.
> Please see
>    https://www.redhat.com/archives/dm-devel/2017-August/msg00310.html
> and
>    https://www.redhat.com/archives/dm-devel/2017-August/msg00315.html
> for which the thread continues:
>    https://www.redhat.com/archives/dm-devel/2017-September/msg00001.html
> 
> These were sent to you, though most of the conversation happened with
> Mikulas.
> 
> I think that the patches in those threads explain why dm currently needs
> rescuer threads, and shows how dm can be changed to no longer need the
> rescuer.  I would appreciate your thoughts on these patches.  I can
> resend them if that would help.

No need to resend.  I'll work through the old threads.

> That would then just leave bcache....  I find it a bit of a challenge to
> reason about the code in bcache, but if we can remove
> BIOSET_NEED_RESCUER from dm, that will be an extra incentive for me to learn :-)

I'm all for properly removing BIOSET_NEED_RESCUER from DM.

I'll look closer at all of this in the morning (for now I'm backing the
patch I referenced out from linux-next)

Mike

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 00/13] block: assorted cleanup for bio      splitting and cloning.
@ 2017-11-21  1:35           ` Mike Snitzer
  0 siblings, 0 replies; 53+ messages in thread
From: Mike Snitzer @ 2017-11-21  1:35 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jens Axboe, linux-block, device-mapper development, linux-kernel,
	Zdenek Kabelac

On Mon, Nov 20 2017 at  7:34pm -0500,
NeilBrown <neilb@suse.com> wrote:

> On Mon, Nov 20 2017, Mike Snitzer wrote:
> 
> > On Sun, Jun 18, 2017 at 5:36 PM, NeilBrown <neilb@suse.com> wrote:
> >> On Sun, Jun 18 2017, Jens Axboe wrote:
> >>
> >>> On Sun, Jun 18 2017, NeilBrown wrote:
> >>>> This is a resend of my series of patches working
> >>>> towards removing the bioset work queues.
> >>>>
> >>>> This set is based on for-4.13/block.
> >>>>
> >>>> It incorporates the revised versions of all the patches that were
> >>>> resent following feedback on the last set.
> >>>>
> >>>> It also includes a minor grammatic improvement to a comment, and
> >>>> simple changes to compensate for a couple of changes to the block tree
> >>>> since the last posting.
> >>>>
> >>>> I hope to eventually get rid of the new BIOSET_NEED_RESCUER flag,
> >>>> but that needs work in dm and probably bcache first.
> >>>
> >>> Thanks Neil, applied.
> >>
> >> Thanks a lot Jens.
> >
> > I missed this line of work until now.  Not quite sure why I wasn't
> > cc'd or my review/ack required for the DM changes in this patchset.
> 
> Hi Mike,
>  I'm sorry you weren't included on those.  My thinking at the time was
>  probably that they were purely cosmetic changes which made no
>  functional difference to dm.  That is no excuse though and I do
>  apologize.
> 
> >
> > But I've now queued this patch for once Linus gets back (reverts DM
> > changes from commit 47e0fb461f):
> > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-next&id=c9fdc42ba23eabd1ba7aef199fb9bb4b4fe5c545
> 
> This patch does two things.
> 1/ It removes the BIOSET_NEED_RESCUER flag from biosets created by dm.
>   This a functional changed over the code from before my patches.
>   Previously, all biosets were given a rescuer thread.
>   After my patch set, biosets only got a rescuer thread if
>   BIOSET_NEED_RESCUER was passed, and it was passed for all biosets.
>   I then removed it from places were I was certain it wasn't needed.
>   I didn't remove it from dm because I wasn't certain.  Your
>   patch does remove the flags, which I think is incorrect - see below.
> 
> 2/ It changes flush_current_bio_list() so that bios allocated from a
>    bioset that does not have a rescue_workqueue are now added to
>    the ->rescue_list for their bio_set, and ->rescue_work is queued
>    on the NULL ->rescue_workqueue, resulting in a NULL dereference.
>    I suspect you don't want this.
> 
> The patch description claims that the patch fixes something, but it
> isn't clear to me what it is meant to be fixing.
> 
> It makes reference to  dbba42d8 which is described as removing an unused
> bioset process, though what it actually does is remove an used bioset
> (and obvious the process disappears with it).  My patch doesn't change
> that behavior.

Well I looked at this because Zdenek reported that with more recent
kernels he is seeing the "bioset" per DM device again (whereas it was
thought to be removed with mikulas' commit dbba42d8 -- but that commit
removed "bioset" only in terms of q->bio_split.

Looks like the q->bio_split bioset is created with BIOSET_NEED_RESCUER
but DM calls bioset_free() for q->bio_split.  Strikes me as odd that
"bioset" was removed DM devices until it recently reappeared.
Especially if what you say is accurate (that BIOSET_NEED_RESCUER was
implied with the old code.. I believe you!)

I tried to quickly answer how "bioset" is now re-appearing for DM
devices but I obviously need to put more time to it.

(And my goodness does this bioset rescue workqueue need a better name
than "bioset"! Should include the bdevname too)

> > As is, it is my understanding that DM has no need for a bio_set's
> > rescue_workqueue.  So its removal would appear to only be gated by
> > bcache?
> >
> > But I could be mistaken, moving forward please let me know what you
> > feel needs doing in DM to make it a better citizen.
> 
> I think you are mistaken.
> Please see
>    https://www.redhat.com/archives/dm-devel/2017-August/msg00310.html
> and
>    https://www.redhat.com/archives/dm-devel/2017-August/msg00315.html
> for which the thread continues:
>    https://www.redhat.com/archives/dm-devel/2017-September/msg00001.html
> 
> These were sent to you, though most of the conversation happened with
> Mikulas.
> 
> I think that the patches in those threads explain why dm currently needs
> rescuer threads, and shows how dm can be changed to no longer need the
> rescuer.  I would appreciate your thoughts on these patches.  I can
> resend them if that would help.

No need to resend.  I'll work through the old threads.

> That would then just leave bcache....  I find it a bit of a challenge to
> reason about the code in bcache, but if we can remove
> BIOSET_NEED_RESCUER from dm, that will be an extra incentive for me to learn :-)

I'm all for properly removing BIOSET_NEED_RESCUER from DM.

I'll look closer at all of this in the morning (for now I'm backing the
patch I referenced out from linux-next)

Mike

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 00/13] block: assorted cleanup for bio splitting and cloning.
  2017-11-21  1:35           ` Mike Snitzer
@ 2017-11-21 12:10             ` Mike Snitzer
  -1 siblings, 0 replies; 53+ messages in thread
From: Mike Snitzer @ 2017-11-21 12:10 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jens Axboe, linux-block, device-mapper development, linux-kernel,
	Zdenek Kabelac

On Mon, Nov 20 2017 at  8:35pm -0500,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Mon, Nov 20 2017 at  7:34pm -0500,
> NeilBrown <neilb@suse.com> wrote:
> 
> > On Mon, Nov 20 2017, Mike Snitzer wrote:
> > 
> > >
> > > But I've now queued this patch for once Linus gets back (reverts DM
> > > changes from commit 47e0fb461f):
> > > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-next&id=c9fdc42ba23eabd1ba7aef199fb9bb4b4fe5c545
> > 
> > This patch does two things.
> > 1/ It removes the BIOSET_NEED_RESCUER flag from biosets created by dm.
> >   This a functional changed over the code from before my patches.
> >   Previously, all biosets were given a rescuer thread.
> >   After my patch set, biosets only got a rescuer thread if
> >   BIOSET_NEED_RESCUER was passed, and it was passed for all biosets.
> >   I then removed it from places were I was certain it wasn't needed.
> >   I didn't remove it from dm because I wasn't certain.  Your
> >   patch does remove the flags, which I think is incorrect - see below.

Yeap, definitely was incorrect.  I've dropped the patch.

> > 2/ It changes flush_current_bio_list() so that bios allocated from a
> >    bioset that does not have a rescue_workqueue are now added to
> >    the ->rescue_list for their bio_set, and ->rescue_work is queued
> >    on the NULL ->rescue_workqueue, resulting in a NULL dereference.
> >    I suspect you don't want this.

Yes, I see that now.

> > The patch description claims that the patch fixes something, but it
> > isn't clear to me what it is meant to be fixing.
> > 
> > It makes reference to  dbba42d8 which is described as removing an unused
> > bioset process, though what it actually does is remove an used bioset
> > (and obvious the process disappears with it).  My patch doesn't change
> > that behavior.
> 
> Well I looked at this because Zdenek reported that with more recent
> kernels he is seeing the "bioset" per DM device again (whereas it was
> thought to be removed with mikulas' commit dbba42d8 -- but that commit
> removed "bioset" only in terms of q->bio_split.

I think Zdenek triggered a false-positive that DM had magically sprouted
a new "bioset" rescue_workqueue.  Reality is I cannot see how each
bio-based DM device can avoid having one.  And the commit d67a5f4b59
("dm: flush queued bios when process blocks to avoid deadlock") I
referenced earlier very much makes DM depend on it even more.

So apologies for being so off-base (by looking to prematurely revert
DM's use of BIOSET_NEED_RESCUER, etc).

> > Please see
> >    https://www.redhat.com/archives/dm-devel/2017-August/msg00310.html

I'll very likely pick these up for 4.16 shortly.  But hope to work
through complete removal of DM's use of BIOSET_NEED_RESCUER for 4.16 as
well.

> > and
> >    https://www.redhat.com/archives/dm-devel/2017-August/msg00315.html

This one [1] needs a lot of review and testing.  Particularly against this
test case that Mikulas created to reproduce the snapshot deadlock (same
deadlock that motivated commit dbba42d8):
https://www.redhat.com/archives/dm-devel/2017-January/msg00064.html

> > for which the thread continues:
> >    https://www.redhat.com/archives/dm-devel/2017-September/msg00001.html

Wish I could clone myself (or Kent, the world needs 2 Kents!) and pursue
this: https://www.redhat.com/archives/dm-devel/2014-May/msg00100.html

Short of that, how would you like to proceed?

> > That would then just leave bcache....  I find it a bit of a challenge to
> > reason about the code in bcache, but if we can remove
> > BIOSET_NEED_RESCUER from dm, that will be an extra incentive for me to learn :-)
> 
> I'm all for properly removing BIOSET_NEED_RESCUER from DM.

Should we work to make [1] (above) sure it fixes Mikulas' test case?

I'll set in on reviewing and playing with [1] now.

Thanks,
Mike

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 00/13] block: assorted cleanup for bio      splitting and cloning.
@ 2017-11-21 12:10             ` Mike Snitzer
  0 siblings, 0 replies; 53+ messages in thread
From: Mike Snitzer @ 2017-11-21 12:10 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jens Axboe, linux-block, device-mapper development, linux-kernel,
	Zdenek Kabelac

On Mon, Nov 20 2017 at  8:35pm -0500,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Mon, Nov 20 2017 at  7:34pm -0500,
> NeilBrown <neilb@suse.com> wrote:
> 
> > On Mon, Nov 20 2017, Mike Snitzer wrote:
> > 
> > >
> > > But I've now queued this patch for once Linus gets back (reverts DM
> > > changes from commit 47e0fb461f):
> > > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-next&id=c9fdc42ba23eabd1ba7aef199fb9bb4b4fe5c545
> > 
> > This patch does two things.
> > 1/ It removes the BIOSET_NEED_RESCUER flag from biosets created by dm.
> >   This a functional changed over the code from before my patches.
> >   Previously, all biosets were given a rescuer thread.
> >   After my patch set, biosets only got a rescuer thread if
> >   BIOSET_NEED_RESCUER was passed, and it was passed for all biosets.
> >   I then removed it from places were I was certain it wasn't needed.
> >   I didn't remove it from dm because I wasn't certain.  Your
> >   patch does remove the flags, which I think is incorrect - see below.

Yeap, definitely was incorrect.  I've dropped the patch.

> > 2/ It changes flush_current_bio_list() so that bios allocated from a
> >    bioset that does not have a rescue_workqueue are now added to
> >    the ->rescue_list for their bio_set, and ->rescue_work is queued
> >    on the NULL ->rescue_workqueue, resulting in a NULL dereference.
> >    I suspect you don't want this.

Yes, I see that now.

> > The patch description claims that the patch fixes something, but it
> > isn't clear to me what it is meant to be fixing.
> > 
> > It makes reference to  dbba42d8 which is described as removing an unused
> > bioset process, though what it actually does is remove an used bioset
> > (and obvious the process disappears with it).  My patch doesn't change
> > that behavior.
> 
> Well I looked at this because Zdenek reported that with more recent
> kernels he is seeing the "bioset" per DM device again (whereas it was
> thought to be removed with mikulas' commit dbba42d8 -- but that commit
> removed "bioset" only in terms of q->bio_split.

I think Zdenek triggered a false-positive that DM had magically sprouted
a new "bioset" rescue_workqueue.  Reality is I cannot see how each
bio-based DM device can avoid having one.  And the commit d67a5f4b59
("dm: flush queued bios when process blocks to avoid deadlock") I
referenced earlier very much makes DM depend on it even more.

So apologies for being so off-base (by looking to prematurely revert
DM's use of BIOSET_NEED_RESCUER, etc).

> > Please see
> >    https://www.redhat.com/archives/dm-devel/2017-August/msg00310.html

I'll very likely pick these up for 4.16 shortly.  But hope to work
through complete removal of DM's use of BIOSET_NEED_RESCUER for 4.16 as
well.

> > and
> >    https://www.redhat.com/archives/dm-devel/2017-August/msg00315.html

This one [1] needs a lot of review and testing.  Particularly against this
test case that Mikulas created to reproduce the snapshot deadlock (same
deadlock that motivated commit dbba42d8):
https://www.redhat.com/archives/dm-devel/2017-January/msg00064.html

> > for which the thread continues:
> >    https://www.redhat.com/archives/dm-devel/2017-September/msg00001.html

Wish I could clone myself (or Kent, the world needs 2 Kents!) and pursue
this: https://www.redhat.com/archives/dm-devel/2014-May/msg00100.html

Short of that, how would you like to proceed?

> > That would then just leave bcache....  I find it a bit of a challenge to
> > reason about the code in bcache, but if we can remove
> > BIOSET_NEED_RESCUER from dm, that will be an extra incentive for me to learn :-)
> 
> I'm all for properly removing BIOSET_NEED_RESCUER from DM.

Should we work to make [1] (above) sure it fixes Mikulas' test case?

I'll set in on reviewing and playing with [1] now.

Thanks,
Mike

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 00/13] block: assorted cleanup for bio splitting and cloning.
  2017-11-21 12:10             ` Mike Snitzer
@ 2017-11-21 12:43               ` Mike Snitzer
  -1 siblings, 0 replies; 53+ messages in thread
From: Mike Snitzer @ 2017-11-21 12:43 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jens Axboe, linux-block, device-mapper development, linux-kernel,
	Zdenek Kabelac

On Tue, Nov 21 2017 at  7:10am -0500,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Mon, Nov 20 2017 at  8:35pm -0500,
> Mike Snitzer <snitzer@redhat.com> wrote:
> 
> > On Mon, Nov 20 2017 at  7:34pm -0500,
> > NeilBrown <neilb@suse.com> wrote:
> > 
> > > Please see
> > >    https://www.redhat.com/archives/dm-devel/2017-August/msg00310.html
> 
> I'll very likely pick these up for 4.16 shortly.  But hope to work
> through complete removal of DM's use of BIOSET_NEED_RESCUER for 4.16 as
> well.
> 
> > > and
> > >    https://www.redhat.com/archives/dm-devel/2017-August/msg00315.html
> 
> This one [1] needs a lot of review and testing.  Particularly against this
> test case that Mikulas created to reproduce the snapshot deadlock (same
> deadlock that motivated commit dbba42d8):
> https://www.redhat.com/archives/dm-devel/2017-January/msg00064.html
...
> Short of that, how would you like to proceed?
> 
> > > That would then just leave bcache....  I find it a bit of a challenge to
> > > reason about the code in bcache, but if we can remove
> > > BIOSET_NEED_RESCUER from dm, that will be an extra incentive for me to learn :-)
> > 
> > I'm all for properly removing BIOSET_NEED_RESCUER from DM.
> 
> Should we work to make [1] (above) sure it fixes Mikulas' test case?
> 
> I'll set in on reviewing and playing with [1] now.

Decided it a better use of my time to review and then hopefully use the
block-core's bio splitting infrastructure in DM.  Been meaning to do
that for quite a while anyway.  This mail from you just made it all the
more clear that needs doing:
https://www.redhat.com/archives/dm-devel/2017-September/msg00098.html

So I will start here on this patch you proposed:
https://www.redhat.com/archives/dm-devel/2017-September/msg00091.html
(of note, this patch slipped through the cracks because I was recovering
from injury when it originally came through).

Once DM is using q->bio_split I'll come back to this patch (aka
"[1]") as a starting point for the follow-on work to remove DM's use of
BIOSET_NEED_RESCUER:
https://www.redhat.com/archives/dm-devel/2017-August/msg00315.html

Mike

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 00/13] block: assorted cleanup for bio      splitting and cloning.
@ 2017-11-21 12:43               ` Mike Snitzer
  0 siblings, 0 replies; 53+ messages in thread
From: Mike Snitzer @ 2017-11-21 12:43 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jens Axboe, linux-block, device-mapper development, linux-kernel,
	Zdenek Kabelac

On Tue, Nov 21 2017 at  7:10am -0500,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Mon, Nov 20 2017 at  8:35pm -0500,
> Mike Snitzer <snitzer@redhat.com> wrote:
> 
> > On Mon, Nov 20 2017 at  7:34pm -0500,
> > NeilBrown <neilb@suse.com> wrote:
> > 
> > > Please see
> > >    https://www.redhat.com/archives/dm-devel/2017-August/msg00310.html
> 
> I'll very likely pick these up for 4.16 shortly.  But hope to work
> through complete removal of DM's use of BIOSET_NEED_RESCUER for 4.16 as
> well.
> 
> > > and
> > >    https://www.redhat.com/archives/dm-devel/2017-August/msg00315.html
> 
> This one [1] needs a lot of review and testing.  Particularly against this
> test case that Mikulas created to reproduce the snapshot deadlock (same
> deadlock that motivated commit dbba42d8):
> https://www.redhat.com/archives/dm-devel/2017-January/msg00064.html
...
> Short of that, how would you like to proceed?
> 
> > > That would then just leave bcache....  I find it a bit of a challenge to
> > > reason about the code in bcache, but if we can remove
> > > BIOSET_NEED_RESCUER from dm, that will be an extra incentive for me to learn :-)
> > 
> > I'm all for properly removing BIOSET_NEED_RESCUER from DM.
> 
> Should we work to make [1] (above) sure it fixes Mikulas' test case?
> 
> I'll set in on reviewing and playing with [1] now.

Decided it a better use of my time to review and then hopefully use the
block-core's bio splitting infrastructure in DM.  Been meaning to do
that for quite a while anyway.  This mail from you just made it all the
more clear that needs doing:
https://www.redhat.com/archives/dm-devel/2017-September/msg00098.html

So I will start here on this patch you proposed:
https://www.redhat.com/archives/dm-devel/2017-September/msg00091.html
(of note, this patch slipped through the cracks because I was recovering
from injury when it originally came through).

Once DM is using q->bio_split I'll come back to this patch (aka
"[1]") as a starting point for the follow-on work to remove DM's use of
BIOSET_NEED_RESCUER:
https://www.redhat.com/archives/dm-devel/2017-August/msg00315.html

Mike

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [dm-devel] [PATCH 00/13] block: assorted cleanup for bio splitting and cloning.
  2017-11-21 12:10             ` Mike Snitzer
  (?)
@ 2017-11-21 19:44               ` NeilBrown
  -1 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2017-11-21 19:44 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jens Axboe, linux-block, device-mapper development, linux-kernel,
	Zdenek Kabelac

[-- Attachment #1: Type: text/plain, Size: 4906 bytes --]

On Tue, Nov 21 2017, Mike Snitzer wrote:

> On Mon, Nov 20 2017 at  8:35pm -0500,
> Mike Snitzer <snitzer@redhat.com> wrote:
>
>> On Mon, Nov 20 2017 at  7:34pm -0500,
>> NeilBrown <neilb@suse.com> wrote:
>> 
>> > On Mon, Nov 20 2017, Mike Snitzer wrote:
>> > 
>> > >
>> > > But I've now queued this patch for once Linus gets back (reverts DM
>> > > changes from commit 47e0fb461f):
>> > > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-next&id=c9fdc42ba23eabd1ba7aef199fb9bb4b4fe5c545
>> > 
>> > This patch does two things.
>> > 1/ It removes the BIOSET_NEED_RESCUER flag from biosets created by dm.
>> >   This a functional changed over the code from before my patches.
>> >   Previously, all biosets were given a rescuer thread.
>> >   After my patch set, biosets only got a rescuer thread if
>> >   BIOSET_NEED_RESCUER was passed, and it was passed for all biosets.
>> >   I then removed it from places were I was certain it wasn't needed.
>> >   I didn't remove it from dm because I wasn't certain.  Your
>> >   patch does remove the flags, which I think is incorrect - see below.
>
> Yeap, definitely was incorrect.  I've dropped the patch.
>
>> > 2/ It changes flush_current_bio_list() so that bios allocated from a
>> >    bioset that does not have a rescue_workqueue are now added to
>> >    the ->rescue_list for their bio_set, and ->rescue_work is queued
>> >    on the NULL ->rescue_workqueue, resulting in a NULL dereference.
>> >    I suspect you don't want this.
>
> Yes, I see that now.
>
>> > The patch description claims that the patch fixes something, but it
>> > isn't clear to me what it is meant to be fixing.
>> > 
>> > It makes reference to  dbba42d8 which is described as removing an unused
>> > bioset process, though what it actually does is remove an used bioset
>> > (and obvious the process disappears with it).  My patch doesn't change
>> > that behavior.
>> 
>> Well I looked at this because Zdenek reported that with more recent
>> kernels he is seeing the "bioset" per DM device again (whereas it was
>> thought to be removed with mikulas' commit dbba42d8 -- but that commit
>> removed "bioset" only in terms of q->bio_split.
>
> I think Zdenek triggered a false-positive that DM had magically sprouted
> a new "bioset" rescue_workqueue.  Reality is I cannot see how each
> bio-based DM device can avoid having one.  And the commit d67a5f4b59
> ("dm: flush queued bios when process blocks to avoid deadlock") I
> referenced earlier very much makes DM depend on it even more.
>
> So apologies for being so off-base (by looking to prematurely revert
> DM's use of BIOSET_NEED_RESCUER, etc).
>
>> > Please see
>> >    https://www.redhat.com/archives/dm-devel/2017-August/msg00310.html
>
> I'll very likely pick these up for 4.16 shortly.  But hope to work
> through complete removal of DM's use of BIOSET_NEED_RESCUER for 4.16 as
> well.
>
>> > and
>> >    https://www.redhat.com/archives/dm-devel/2017-August/msg00315.html
>
> This one [1] needs a lot of review and testing.  Particularly against this
> test case that Mikulas created to reproduce the snapshot deadlock (same
> deadlock that motivated commit dbba42d8):
> https://www.redhat.com/archives/dm-devel/2017-January/msg00064.html

Thanks for that link.  I'll try to make time to experiment with the test
code and confirm my proposed approach doesn't break it.

>
>> > for which the thread continues:
>> >    https://www.redhat.com/archives/dm-devel/2017-September/msg00001.html
>
> Wish I could clone myself (or Kent, the world needs 2 Kents!) and pursue
> this: https://www.redhat.com/archives/dm-devel/2014-May/msg00100.html

In that email Kent mentions "punt off to a per request_queue workqueue".

That "per request_queue workqueue" is what I'm trying to get rid of.  I
don't think this is a good direction.

>
> Short of that, how would you like to proceed?

I'd like to confirm that my approach
1/ doesn't re-introduce a deadlock
2/ doesn't hurt performance
and then merge it.

Though to be honest, I don't recall exactly what "my approach" is.
Your next email picks out two important patches which probably cover
it.  If/when I get to do the testing I'll let you know how it goes.

Thanks,
NeilBrown


>
>> > That would then just leave bcache....  I find it a bit of a challenge to
>> > reason about the code in bcache, but if we can remove
>> > BIOSET_NEED_RESCUER from dm, that will be an extra incentive for me to learn :-)
>> 
>> I'm all for properly removing BIOSET_NEED_RESCUER from DM.
>
> Should we work to make [1] (above) sure it fixes Mikulas' test case?
>
> I'll set in on reviewing and playing with [1] now.
>
> Thanks,
> Mike
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [dm-devel] [PATCH 00/13] block: assorted cleanup for bio   splitting and cloning.
@ 2017-11-21 19:44               ` NeilBrown
  0 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2017-11-21 19:44 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jens Axboe, linux-block, device-mapper development, linux-kernel,
	Zdenek Kabelac

[-- Attachment #1: Type: text/plain, Size: 4906 bytes --]

On Tue, Nov 21 2017, Mike Snitzer wrote:

> On Mon, Nov 20 2017 at  8:35pm -0500,
> Mike Snitzer <snitzer@redhat.com> wrote:
>
>> On Mon, Nov 20 2017 at  7:34pm -0500,
>> NeilBrown <neilb@suse.com> wrote:
>> 
>> > On Mon, Nov 20 2017, Mike Snitzer wrote:
>> > 
>> > >
>> > > But I've now queued this patch for once Linus gets back (reverts DM
>> > > changes from commit 47e0fb461f):
>> > > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-next&id=c9fdc42ba23eabd1ba7aef199fb9bb4b4fe5c545
>> > 
>> > This patch does two things.
>> > 1/ It removes the BIOSET_NEED_RESCUER flag from biosets created by dm.
>> >   This a functional changed over the code from before my patches.
>> >   Previously, all biosets were given a rescuer thread.
>> >   After my patch set, biosets only got a rescuer thread if
>> >   BIOSET_NEED_RESCUER was passed, and it was passed for all biosets.
>> >   I then removed it from places were I was certain it wasn't needed.
>> >   I didn't remove it from dm because I wasn't certain.  Your
>> >   patch does remove the flags, which I think is incorrect - see below.
>
> Yeap, definitely was incorrect.  I've dropped the patch.
>
>> > 2/ It changes flush_current_bio_list() so that bios allocated from a
>> >    bioset that does not have a rescue_workqueue are now added to
>> >    the ->rescue_list for their bio_set, and ->rescue_work is queued
>> >    on the NULL ->rescue_workqueue, resulting in a NULL dereference.
>> >    I suspect you don't want this.
>
> Yes, I see that now.
>
>> > The patch description claims that the patch fixes something, but it
>> > isn't clear to me what it is meant to be fixing.
>> > 
>> > It makes reference to  dbba42d8 which is described as removing an unused
>> > bioset process, though what it actually does is remove an used bioset
>> > (and obvious the process disappears with it).  My patch doesn't change
>> > that behavior.
>> 
>> Well I looked at this because Zdenek reported that with more recent
>> kernels he is seeing the "bioset" per DM device again (whereas it was
>> thought to be removed with mikulas' commit dbba42d8 -- but that commit
>> removed "bioset" only in terms of q->bio_split.
>
> I think Zdenek triggered a false-positive that DM had magically sprouted
> a new "bioset" rescue_workqueue.  Reality is I cannot see how each
> bio-based DM device can avoid having one.  And the commit d67a5f4b59
> ("dm: flush queued bios when process blocks to avoid deadlock") I
> referenced earlier very much makes DM depend on it even more.
>
> So apologies for being so off-base (by looking to prematurely revert
> DM's use of BIOSET_NEED_RESCUER, etc).
>
>> > Please see
>> >    https://www.redhat.com/archives/dm-devel/2017-August/msg00310.html
>
> I'll very likely pick these up for 4.16 shortly.  But hope to work
> through complete removal of DM's use of BIOSET_NEED_RESCUER for 4.16 as
> well.
>
>> > and
>> >    https://www.redhat.com/archives/dm-devel/2017-August/msg00315.html
>
> This one [1] needs a lot of review and testing.  Particularly against this
> test case that Mikulas created to reproduce the snapshot deadlock (same
> deadlock that motivated commit dbba42d8):
> https://www.redhat.com/archives/dm-devel/2017-January/msg00064.html

Thanks for that link.  I'll try to make time to experiment with the test
code and confirm my proposed approach doesn't break it.

>
>> > for which the thread continues:
>> >    https://www.redhat.com/archives/dm-devel/2017-September/msg00001.html
>
> Wish I could clone myself (or Kent, the world needs 2 Kents!) and pursue
> this: https://www.redhat.com/archives/dm-devel/2014-May/msg00100.html

In that email Kent mentions "punt off to a per request_queue workqueue".

That "per request_queue workqueue" is what I'm trying to get rid of.  I
don't think this is a good direction.

>
> Short of that, how would you like to proceed?

I'd like to confirm that my approach
1/ doesn't re-introduce a deadlock
2/ doesn't hurt performance
and then merge it.

Though to be honest, I don't recall exactly what "my approach" is.
Your next email picks out two important patches which probably cover
it.  If/when I get to do the testing I'll let you know how it goes.

Thanks,
NeilBrown


>
>> > That would then just leave bcache....  I find it a bit of a challenge to
>> > reason about the code in bcache, but if we can remove
>> > BIOSET_NEED_RESCUER from dm, that will be an extra incentive for me to learn :-)
>> 
>> I'm all for properly removing BIOSET_NEED_RESCUER from DM.
>
> Should we work to make [1] (above) sure it fixes Mikulas' test case?
>
> I'll set in on reviewing and playing with [1] now.
>
> Thanks,
> Mike
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [dm-devel] [PATCH 00/13] block: assorted cleanup for bio   splitting and cloning.
@ 2017-11-21 19:44               ` NeilBrown
  0 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2017-11-21 19:44 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jens Axboe, linux-block, device-mapper development, linux-kernel,
	Zdenek Kabelac

[-- Attachment #1: Type: text/plain, Size: 4906 bytes --]

On Tue, Nov 21 2017, Mike Snitzer wrote:

> On Mon, Nov 20 2017 at  8:35pm -0500,
> Mike Snitzer <snitzer@redhat.com> wrote:
>
>> On Mon, Nov 20 2017 at  7:34pm -0500,
>> NeilBrown <neilb@suse.com> wrote:
>> 
>> > On Mon, Nov 20 2017, Mike Snitzer wrote:
>> > 
>> > >
>> > > But I've now queued this patch for once Linus gets back (reverts DM
>> > > changes from commit 47e0fb461f):
>> > > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-next&id=c9fdc42ba23eabd1ba7aef199fb9bb4b4fe5c545
>> > 
>> > This patch does two things.
>> > 1/ It removes the BIOSET_NEED_RESCUER flag from biosets created by dm.
>> >   This a functional changed over the code from before my patches.
>> >   Previously, all biosets were given a rescuer thread.
>> >   After my patch set, biosets only got a rescuer thread if
>> >   BIOSET_NEED_RESCUER was passed, and it was passed for all biosets.
>> >   I then removed it from places were I was certain it wasn't needed.
>> >   I didn't remove it from dm because I wasn't certain.  Your
>> >   patch does remove the flags, which I think is incorrect - see below.
>
> Yeap, definitely was incorrect.  I've dropped the patch.
>
>> > 2/ It changes flush_current_bio_list() so that bios allocated from a
>> >    bioset that does not have a rescue_workqueue are now added to
>> >    the ->rescue_list for their bio_set, and ->rescue_work is queued
>> >    on the NULL ->rescue_workqueue, resulting in a NULL dereference.
>> >    I suspect you don't want this.
>
> Yes, I see that now.
>
>> > The patch description claims that the patch fixes something, but it
>> > isn't clear to me what it is meant to be fixing.
>> > 
>> > It makes reference to  dbba42d8 which is described as removing an unused
>> > bioset process, though what it actually does is remove an used bioset
>> > (and obvious the process disappears with it).  My patch doesn't change
>> > that behavior.
>> 
>> Well I looked at this because Zdenek reported that with more recent
>> kernels he is seeing the "bioset" per DM device again (whereas it was
>> thought to be removed with mikulas' commit dbba42d8 -- but that commit
>> removed "bioset" only in terms of q->bio_split.
>
> I think Zdenek triggered a false-positive that DM had magically sprouted
> a new "bioset" rescue_workqueue.  Reality is I cannot see how each
> bio-based DM device can avoid having one.  And the commit d67a5f4b59
> ("dm: flush queued bios when process blocks to avoid deadlock") I
> referenced earlier very much makes DM depend on it even more.
>
> So apologies for being so off-base (by looking to prematurely revert
> DM's use of BIOSET_NEED_RESCUER, etc).
>
>> > Please see
>> >    https://www.redhat.com/archives/dm-devel/2017-August/msg00310.html
>
> I'll very likely pick these up for 4.16 shortly.  But hope to work
> through complete removal of DM's use of BIOSET_NEED_RESCUER for 4.16 as
> well.
>
>> > and
>> >    https://www.redhat.com/archives/dm-devel/2017-August/msg00315.html
>
> This one [1] needs a lot of review and testing.  Particularly against this
> test case that Mikulas created to reproduce the snapshot deadlock (same
> deadlock that motivated commit dbba42d8):
> https://www.redhat.com/archives/dm-devel/2017-January/msg00064.html

Thanks for that link.  I'll try to make time to experiment with the test
code and confirm my proposed approach doesn't break it.

>
>> > for which the thread continues:
>> >    https://www.redhat.com/archives/dm-devel/2017-September/msg00001.html
>
> Wish I could clone myself (or Kent, the world needs 2 Kents!) and pursue
> this: https://www.redhat.com/archives/dm-devel/2014-May/msg00100.html

In that email Kent mentions "punt off to a per request_queue workqueue".

That "per request_queue workqueue" is what I'm trying to get rid of.  I
don't think this is a good direction.

>
> Short of that, how would you like to proceed?

I'd like to confirm that my approach
1/ doesn't re-introduce a deadlock
2/ doesn't hurt performance
and then merge it.

Though to be honest, I don't recall exactly what "my approach" is.
Your next email picks out two important patches which probably cover
it.  If/when I get to do the testing I'll let you know how it goes.

Thanks,
NeilBrown


>
>> > That would then just leave bcache....  I find it a bit of a challenge to
>> > reason about the code in bcache, but if we can remove
>> > BIOSET_NEED_RESCUER from dm, that will be an extra incentive for me to learn :-)
>> 
>> I'm all for properly removing BIOSET_NEED_RESCUER from DM.
>
> Should we work to make [1] (above) sure it fixes Mikulas' test case?
>
> I'll set in on reviewing and playing with [1] now.
>
> Thanks,
> Mike
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* new patchset to eliminate DM's use of BIOSET_NEED_RESCUER [was: Re: [PATCH 00/13] block: assorted cleanup for bio splitting and cloning.]
  2017-11-21 12:43               ` Mike Snitzer
  (?)
@ 2017-11-21 19:47               ` Mike Snitzer
  2017-11-21 21:23                 ` [dm-devel] " Mikulas Patocka
  -1 siblings, 1 reply; 53+ messages in thread
From: Mike Snitzer @ 2017-11-21 19:47 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jens Axboe, linux-block, device-mapper development, linux-kernel,
	Zdenek Kabelac, Mikulas Patocka

On Tue, Nov 21 2017 at  7:43am -0500,
Mike Snitzer <snitzer@redhat.com> wrote:
 
> Decided it a better use of my time to review and then hopefully use the
> block-core's bio splitting infrastructure in DM.  Been meaning to do
> that for quite a while anyway.  This mail from you just made it all the
> more clear that needs doing:
> https://www.redhat.com/archives/dm-devel/2017-September/msg00098.html
> 
> So I will start here on this patch you proposed:
> https://www.redhat.com/archives/dm-devel/2017-September/msg00091.html
> (of note, this patch slipped through the cracks because I was recovering
> from injury when it originally came through).
> 
> Once DM is using q->bio_split I'll come back to this patch (aka
> "[1]") as a starting point for the follow-on work to remove DM's use of
> BIOSET_NEED_RESCUER:
> https://www.redhat.com/archives/dm-devel/2017-August/msg00315.html

Hey Neil,

Good news!  All your code works ;)

(well after 1 fixup due to a cut-n-paste bug.. the code you added to
dm_wq_work() to process the md->rescued bio_list was operating on
the md->deferred bio_list due to cut-n-paste from code you copied from
just below it)

I split your code out some to make it more reviewable.  I also tweaked
headers accordingly.

Please see this branch (which _will_ get rebased between now and the
4.16 merge window):
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.16

I successfully tested these changes using Mikulas' test program that
reproduces the snapshot deadlock:
https://www.redhat.com/archives/dm-devel/2017-January/msg00064.html

I'll throw various other DM testsuites at it to verify they all look
good (e.g. thinp, cache, multipath).

I'm open to all suggestions about changes you'd like to see (either to
these patches or anything you'd like to layer ontop of them).

Thanks for all your work, much appreciated!
Mike

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 00/13] block: assorted cleanup for bio splitting and cloning.
  2017-11-21 19:44               ` NeilBrown
@ 2017-11-21 19:50                 ` Mike Snitzer
  -1 siblings, 0 replies; 53+ messages in thread
From: Mike Snitzer @ 2017-11-21 19:50 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jens Axboe, linux-block, device-mapper development, linux-kernel,
	Zdenek Kabelac

On Tue, Nov 21 2017 at  2:44pm -0500,
NeilBrown <neilb@suse.com> wrote:

> On Tue, Nov 21 2017, Mike Snitzer wrote:
> 
> > On Mon, Nov 20 2017 at  8:35pm -0500,
> > Mike Snitzer <snitzer@redhat.com> wrote:
> >
> >> On Mon, Nov 20 2017 at  7:34pm -0500,
> >> NeilBrown <neilb@suse.com> wrote:
> >> 
> >> > On Mon, Nov 20 2017, Mike Snitzer wrote:
> >> > 
> >> > >
> >> > > But I've now queued this patch for once Linus gets back (reverts DM
> >> > > changes from commit 47e0fb461f):
> >> > > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-next&id=c9fdc42ba23eabd1ba7aef199fb9bb4b4fe5c545
> >> > 
> >> > This patch does two things.
> >> > 1/ It removes the BIOSET_NEED_RESCUER flag from biosets created by dm.
> >> >   This a functional changed over the code from before my patches.
> >> >   Previously, all biosets were given a rescuer thread.
> >> >   After my patch set, biosets only got a rescuer thread if
> >> >   BIOSET_NEED_RESCUER was passed, and it was passed for all biosets.
> >> >   I then removed it from places were I was certain it wasn't needed.
> >> >   I didn't remove it from dm because I wasn't certain.  Your
> >> >   patch does remove the flags, which I think is incorrect - see below.
> >
> > Yeap, definitely was incorrect.  I've dropped the patch.
> >
> >> > 2/ It changes flush_current_bio_list() so that bios allocated from a
> >> >    bioset that does not have a rescue_workqueue are now added to
> >> >    the ->rescue_list for their bio_set, and ->rescue_work is queued
> >> >    on the NULL ->rescue_workqueue, resulting in a NULL dereference.
> >> >    I suspect you don't want this.
> >
> > Yes, I see that now.
> >
> >> > The patch description claims that the patch fixes something, but it
> >> > isn't clear to me what it is meant to be fixing.
> >> > 
> >> > It makes reference to  dbba42d8 which is described as removing an unused
> >> > bioset process, though what it actually does is remove an used bioset
> >> > (and obvious the process disappears with it).  My patch doesn't change
> >> > that behavior.
> >> 
> >> Well I looked at this because Zdenek reported that with more recent
> >> kernels he is seeing the "bioset" per DM device again (whereas it was
> >> thought to be removed with mikulas' commit dbba42d8 -- but that commit
> >> removed "bioset" only in terms of q->bio_split.
> >
> > I think Zdenek triggered a false-positive that DM had magically sprouted
> > a new "bioset" rescue_workqueue.  Reality is I cannot see how each
> > bio-based DM device can avoid having one.  And the commit d67a5f4b59
> > ("dm: flush queued bios when process blocks to avoid deadlock") I
> > referenced earlier very much makes DM depend on it even more.
> >
> > So apologies for being so off-base (by looking to prematurely revert
> > DM's use of BIOSET_NEED_RESCUER, etc).
> >
> >> > Please see
> >> >    https://www.redhat.com/archives/dm-devel/2017-August/msg00310.html
> >
> > I'll very likely pick these up for 4.16 shortly.  But hope to work
> > through complete removal of DM's use of BIOSET_NEED_RESCUER for 4.16 as
> > well.
> >
> >> > and
> >> >    https://www.redhat.com/archives/dm-devel/2017-August/msg00315.html
> >
> > This one [1] needs a lot of review and testing.  Particularly against this
> > test case that Mikulas created to reproduce the snapshot deadlock (same
> > deadlock that motivated commit dbba42d8):
> > https://www.redhat.com/archives/dm-devel/2017-January/msg00064.html
> 
> Thanks for that link.  I'll try to make time to experiment with the test
> code and confirm my proposed approach doesn't break it.
> 
> >
> >> > for which the thread continues:
> >> >    https://www.redhat.com/archives/dm-devel/2017-September/msg00001.html
> >
> > Wish I could clone myself (or Kent, the world needs 2 Kents!) and pursue
> > this: https://www.redhat.com/archives/dm-devel/2014-May/msg00100.html
> 
> In that email Kent mentions "punt off to a per request_queue workqueue".
> 
> That "per request_queue workqueue" is what I'm trying to get rid of.  I
> don't think this is a good direction.
> 
> >
> > Short of that, how would you like to proceed?
> 
> I'd like to confirm that my approach
> 1/ doesn't re-introduce a deadlock
> 2/ doesn't hurt performance
> and then merge it.
> 
> Though to be honest, I don't recall exactly what "my approach" is.
> Your next email picks out two important patches which probably cover
> it.  If/when I get to do the testing I'll let you know how it goes.

I _think_ I've done the heavy lifting of what you likely had in mind
( please see: https://lkml.org/lkml/2017/11/21/567 )

Now what is left is another once-over from you to verify you're happy
with the code and patch headers, etc.

Mike

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 00/13] block: assorted cleanup for bio      splitting and cloning.
@ 2017-11-21 19:50                 ` Mike Snitzer
  0 siblings, 0 replies; 53+ messages in thread
From: Mike Snitzer @ 2017-11-21 19:50 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jens Axboe, linux-block, device-mapper development, linux-kernel,
	Zdenek Kabelac

On Tue, Nov 21 2017 at  2:44pm -0500,
NeilBrown <neilb@suse.com> wrote:

> On Tue, Nov 21 2017, Mike Snitzer wrote:
> 
> > On Mon, Nov 20 2017 at  8:35pm -0500,
> > Mike Snitzer <snitzer@redhat.com> wrote:
> >
> >> On Mon, Nov 20 2017 at  7:34pm -0500,
> >> NeilBrown <neilb@suse.com> wrote:
> >> 
> >> > On Mon, Nov 20 2017, Mike Snitzer wrote:
> >> > 
> >> > >
> >> > > But I've now queued this patch for once Linus gets back (reverts DM
> >> > > changes from commit 47e0fb461f):
> >> > > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-next&id=c9fdc42ba23eabd1ba7aef199fb9bb4b4fe5c545
> >> > 
> >> > This patch does two things.
> >> > 1/ It removes the BIOSET_NEED_RESCUER flag from biosets created by dm.
> >> >   This a functional changed over the code from before my patches.
> >> >   Previously, all biosets were given a rescuer thread.
> >> >   After my patch set, biosets only got a rescuer thread if
> >> >   BIOSET_NEED_RESCUER was passed, and it was passed for all biosets.
> >> >   I then removed it from places were I was certain it wasn't needed.
> >> >   I didn't remove it from dm because I wasn't certain.  Your
> >> >   patch does remove the flags, which I think is incorrect - see below.
> >
> > Yeap, definitely was incorrect.  I've dropped the patch.
> >
> >> > 2/ It changes flush_current_bio_list() so that bios allocated from a
> >> >    bioset that does not have a rescue_workqueue are now added to
> >> >    the ->rescue_list for their bio_set, and ->rescue_work is queued
> >> >    on the NULL ->rescue_workqueue, resulting in a NULL dereference.
> >> >    I suspect you don't want this.
> >
> > Yes, I see that now.
> >
> >> > The patch description claims that the patch fixes something, but it
> >> > isn't clear to me what it is meant to be fixing.
> >> > 
> >> > It makes reference to  dbba42d8 which is described as removing an unused
> >> > bioset process, though what it actually does is remove an used bioset
> >> > (and obvious the process disappears with it).  My patch doesn't change
> >> > that behavior.
> >> 
> >> Well I looked at this because Zdenek reported that with more recent
> >> kernels he is seeing the "bioset" per DM device again (whereas it was
> >> thought to be removed with mikulas' commit dbba42d8 -- but that commit
> >> removed "bioset" only in terms of q->bio_split.
> >
> > I think Zdenek triggered a false-positive that DM had magically sprouted
> > a new "bioset" rescue_workqueue.  Reality is I cannot see how each
> > bio-based DM device can avoid having one.  And the commit d67a5f4b59
> > ("dm: flush queued bios when process blocks to avoid deadlock") I
> > referenced earlier very much makes DM depend on it even more.
> >
> > So apologies for being so off-base (by looking to prematurely revert
> > DM's use of BIOSET_NEED_RESCUER, etc).
> >
> >> > Please see
> >> >    https://www.redhat.com/archives/dm-devel/2017-August/msg00310.html
> >
> > I'll very likely pick these up for 4.16 shortly.  But hope to work
> > through complete removal of DM's use of BIOSET_NEED_RESCUER for 4.16 as
> > well.
> >
> >> > and
> >> >    https://www.redhat.com/archives/dm-devel/2017-August/msg00315.html
> >
> > This one [1] needs a lot of review and testing.  Particularly against this
> > test case that Mikulas created to reproduce the snapshot deadlock (same
> > deadlock that motivated commit dbba42d8):
> > https://www.redhat.com/archives/dm-devel/2017-January/msg00064.html
> 
> Thanks for that link.  I'll try to make time to experiment with the test
> code and confirm my proposed approach doesn't break it.
> 
> >
> >> > for which the thread continues:
> >> >    https://www.redhat.com/archives/dm-devel/2017-September/msg00001.html
> >
> > Wish I could clone myself (or Kent, the world needs 2 Kents!) and pursue
> > this: https://www.redhat.com/archives/dm-devel/2014-May/msg00100.html
> 
> In that email Kent mentions "punt off to a per request_queue workqueue".
> 
> That "per request_queue workqueue" is what I'm trying to get rid of.  I
> don't think this is a good direction.
> 
> >
> > Short of that, how would you like to proceed?
> 
> I'd like to confirm that my approach
> 1/ doesn't re-introduce a deadlock
> 2/ doesn't hurt performance
> and then merge it.
> 
> Though to be honest, I don't recall exactly what "my approach" is.
> Your next email picks out two important patches which probably cover
> it.  If/when I get to do the testing I'll let you know how it goes.

I _think_ I've done the heavy lifting of what you likely had in mind
( please see: https://lkml.org/lkml/2017/11/21/567 )

Now what is left is another once-over from you to verify you're happy
with the code and patch headers, etc.

Mike

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [dm-devel] new patchset to eliminate DM's use of BIOSET_NEED_RESCUER [was: Re: [PATCH 00/13] block: assorted cleanup for bio splitting and cloning.]
  2017-11-21 19:47               ` new patchset to eliminate DM's use of BIOSET_NEED_RESCUER [was: Re: [PATCH 00/13] block: assorted cleanup for bio splitting and cloning.] Mike Snitzer
@ 2017-11-21 21:23                 ` Mikulas Patocka
  2017-11-21 22:51                   ` new patchset to eliminate DM's use of BIOSET_NEED_RESCUER Mike Snitzer
  2017-11-21 23:03                     ` NeilBrown
  0 siblings, 2 replies; 53+ messages in thread
From: Mikulas Patocka @ 2017-11-21 21:23 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: NeilBrown, Jens Axboe, linux-kernel, linux-block,
	device-mapper development, Zdenek Kabelac



On Tue, 21 Nov 2017, Mike Snitzer wrote:

> On Tue, Nov 21 2017 at  7:43am -0500,
> Mike Snitzer <snitzer@redhat.com> wrote:
>  
> > Decided it a better use of my time to review and then hopefully use the
> > block-core's bio splitting infrastructure in DM.  Been meaning to do
> > that for quite a while anyway.  This mail from you just made it all the
> > more clear that needs doing:
> > https://www.redhat.com/archives/dm-devel/2017-September/msg00098.html
> > 
> > So I will start here on this patch you proposed:
> > https://www.redhat.com/archives/dm-devel/2017-September/msg00091.html
> > (of note, this patch slipped through the cracks because I was recovering
> > from injury when it originally came through).
> > 
> > Once DM is using q->bio_split I'll come back to this patch (aka
> > "[1]") as a starting point for the follow-on work to remove DM's use of
> > BIOSET_NEED_RESCUER:
> > https://www.redhat.com/archives/dm-devel/2017-August/msg00315.html
> 
> Hey Neil,
> 
> Good news!  All your code works ;)
> 
> (well after 1 fixup due to a cut-n-paste bug.. the code you added to
> dm_wq_work() to process the md->rescued bio_list was operating on
> the md->deferred bio_list due to cut-n-paste from code you copied from
> just below it)
> 
> I split your code out some to make it more reviewable.  I also tweaked
> headers accordingly.
> 
> Please see this branch (which _will_ get rebased between now and the
> 4.16 merge window):
> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.16
> 
> I successfully tested these changes using Mikulas' test program that
> reproduces the snapshot deadlock:
> https://www.redhat.com/archives/dm-devel/2017-January/msg00064.html
> 
> I'll throw various other DM testsuites at it to verify they all look
> good (e.g. thinp, cache, multipath).
> 
> I'm open to all suggestions about changes you'd like to see (either to
> these patches or anything you'd like to layer ontop of them).
> 
> Thanks for all your work, much appreciated!
> Mike

This is not correct:

   2206 static void dm_wq_work(struct work_struct *work)
   2207 {
   2208         struct mapped_device *md = container_of(work, struct mapped_device, work);
   2209         struct bio *bio;
   2210         int srcu_idx;
   2211         struct dm_table *map;
   2212
   2213         if (!bio_list_empty(&md->rescued)) {
   2214                 struct bio_list list;
   2215                 spin_lock_irq(&md->deferred_lock);
   2216                 list = md->rescued;
   2217                 bio_list_init(&md->rescued);
   2218                 spin_unlock_irq(&md->deferred_lock);
   2219                 while ((bio = bio_list_pop(&list)))
   2220                         generic_make_request(bio);
   2221         }
   2222
   2223         map = dm_get_live_table(md, &srcu_idx);
   2224
   2225         while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) {
   2226                 spin_lock_irq(&md->deferred_lock);
   2227                 bio = bio_list_pop(&md->deferred);
   2228                 spin_unlock_irq(&md->deferred_lock);
   2229
   2230                 if (!bio)
   2231                         break;
   2232
   2233                 if (dm_request_based(md))
   2234                         generic_make_request(bio);
   2235                 else
   2236                         __split_and_process_bio(md, map, bio);
   2237         }
   2238
   2239         dm_put_live_table(md, srcu_idx);
   2240 }

You can see that if we are in dm_wq_work in __split_and_process_bio, we 
will not process md->rescued list.

The processing of md->rescued is also wrong - bios for different devices 
must be offloaded to different helper threads, so that processing a bio 
for a lower device doesn't depend on processing a bio for a higher device. 
If you offload all the bios on current->bio_list to the same thread, the 
bios still depend on each other and the deadlock will still happen.

Mikulas

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: new patchset to eliminate DM's use of BIOSET_NEED_RESCUER
  2017-11-21 21:23                 ` [dm-devel] " Mikulas Patocka
@ 2017-11-21 22:51                   ` Mike Snitzer
  2017-11-22  1:21                     ` Mikulas Patocka
  2017-11-21 23:03                     ` NeilBrown
  1 sibling, 1 reply; 53+ messages in thread
From: Mike Snitzer @ 2017-11-21 22:51 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: NeilBrown, Jens Axboe, linux-kernel, linux-block,
	device-mapper development, Zdenek Kabelac

On Tue, Nov 21 2017 at  4:23pm -0500,
Mikulas Patocka <mpatocka@redhat.com> wrote:

> 
> 
> On Tue, 21 Nov 2017, Mike Snitzer wrote:
> 
> > On Tue, Nov 21 2017 at  7:43am -0500,
> > Mike Snitzer <snitzer@redhat.com> wrote:
> >  
> > > Decided it a better use of my time to review and then hopefully use the
> > > block-core's bio splitting infrastructure in DM.  Been meaning to do
> > > that for quite a while anyway.  This mail from you just made it all the
> > > more clear that needs doing:
> > > https://www.redhat.com/archives/dm-devel/2017-September/msg00098.html
> > > 
> > > So I will start here on this patch you proposed:
> > > https://www.redhat.com/archives/dm-devel/2017-September/msg00091.html
> > > (of note, this patch slipped through the cracks because I was recovering
> > > from injury when it originally came through).
> > > 
> > > Once DM is using q->bio_split I'll come back to this patch (aka
> > > "[1]") as a starting point for the follow-on work to remove DM's use of
> > > BIOSET_NEED_RESCUER:
> > > https://www.redhat.com/archives/dm-devel/2017-August/msg00315.html
> > 
> > Hey Neil,
> > 
> > Good news!  All your code works ;)
> > 
> > (well after 1 fixup due to a cut-n-paste bug.. the code you added to
> > dm_wq_work() to process the md->rescued bio_list was operating on
> > the md->deferred bio_list due to cut-n-paste from code you copied from
> > just below it)
> > 
> > I split your code out some to make it more reviewable.  I also tweaked
> > headers accordingly.
> > 
> > Please see this branch (which _will_ get rebased between now and the
> > 4.16 merge window):
> > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.16
> > 
> > I successfully tested these changes using Mikulas' test program that
> > reproduces the snapshot deadlock:
> > https://www.redhat.com/archives/dm-devel/2017-January/msg00064.html
> > 
> > I'll throw various other DM testsuites at it to verify they all look
> > good (e.g. thinp, cache, multipath).
> > 
> > I'm open to all suggestions about changes you'd like to see (either to
> > these patches or anything you'd like to layer ontop of them).
> > 
> > Thanks for all your work, much appreciated!
> > Mike
> 
> This is not correct:
> 
>    2206 static void dm_wq_work(struct work_struct *work)
>    2207 {
>    2208         struct mapped_device *md = container_of(work, struct mapped_device, work);
>    2209         struct bio *bio;
>    2210         int srcu_idx;
>    2211         struct dm_table *map;
>    2212
>    2213         if (!bio_list_empty(&md->rescued)) {
>    2214                 struct bio_list list;
>    2215                 spin_lock_irq(&md->deferred_lock);
>    2216                 list = md->rescued;
>    2217                 bio_list_init(&md->rescued);
>    2218                 spin_unlock_irq(&md->deferred_lock);
>    2219                 while ((bio = bio_list_pop(&list)))
>    2220                         generic_make_request(bio);
>    2221         }
>    2222
>    2223         map = dm_get_live_table(md, &srcu_idx);
>    2224
>    2225         while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) {
>    2226                 spin_lock_irq(&md->deferred_lock);
>    2227                 bio = bio_list_pop(&md->deferred);
>    2228                 spin_unlock_irq(&md->deferred_lock);
>    2229
>    2230                 if (!bio)
>    2231                         break;
>    2232
>    2233                 if (dm_request_based(md))
>    2234                         generic_make_request(bio);
>    2235                 else
>    2236                         __split_and_process_bio(md, map, bio);
>    2237         }
>    2238
>    2239         dm_put_live_table(md, srcu_idx);
>    2240 }
> 
> You can see that if we are in dm_wq_work in __split_and_process_bio, we 
> will not process md->rescued list.

Can you elaborate further?  We cannot be "in dm_wq_work in
__split_and_process_bio" simultaneously.  Do you mean as a side-effect
of scheduling away from __split_and_process_bio?

The more detail you can share the better.

> The processing of md->rescued is also wrong - bios for different devices 
> must be offloaded to different helper threads, so that processing a bio 
> for a lower device doesn't depend on processing a bio for a higher device. 
> If you offload all the bios on current->bio_list to the same thread, the 
> bios still depend on each other and the deadlock will still happen.

Commit 325738403 ("dm: revise 'rescue' strategy for bio-based bioset
allocations") speaks to this with:

"Note that only current->bio_list[0] is offloaded.  current->bio_list[1]
contains bios that were scheduled *before* the current one started, so
they must have been submitted from higher up the stack, and we cannot be
waiting for them here (thanks to the "dm: ensure bio submission follows
a depth-first tree walk" commit).  Also, we now rescue *all* bios on the
list as there is nothing to be gained by being more selective."

And again: this patchset passes your dm-snapshot deadlock test.  Is
that test somehow lacking?

Or do you see a hypothetical case where a deadlock is still possible?
That is of less concern.  I'd prefer that we tackle problems for
targets, and associated scenarios, that we currently support.

Either way, happy to review this with you further.  Any fixes are
welcomed too.  But I'd like us to head in a direction that this patchset
is taking us.  Specifically: away from DM relying on BIOSET_NEED_RESCUER.

Thanks,
Mike

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [dm-devel] new patchset to eliminate DM's use of BIOSET_NEED_RESCUER [was: Re: [PATCH 00/13] block: assorted cleanup for bio splitting and cloning.]
  2017-11-21 21:23                 ` [dm-devel] " Mikulas Patocka
@ 2017-11-21 23:03                     ` NeilBrown
  2017-11-21 23:03                     ` NeilBrown
  1 sibling, 0 replies; 53+ messages in thread
From: NeilBrown @ 2017-11-21 23:03 UTC (permalink / raw)
  To: Mikulas Patocka, Mike Snitzer
  Cc: Jens Axboe, linux-kernel, linux-block, device-mapper development,
	Zdenek Kabelac

[-- Attachment #1: Type: text/plain, Size: 5643 bytes --]

On Tue, Nov 21 2017, Mikulas Patocka wrote:

> On Tue, 21 Nov 2017, Mike Snitzer wrote:
>
>> On Tue, Nov 21 2017 at  7:43am -0500,
>> Mike Snitzer <snitzer@redhat.com> wrote:
>>  
>> > Decided it a better use of my time to review and then hopefully use the
>> > block-core's bio splitting infrastructure in DM.  Been meaning to do
>> > that for quite a while anyway.  This mail from you just made it all the
>> > more clear that needs doing:
>> > https://www.redhat.com/archives/dm-devel/2017-September/msg00098.html
>> > 
>> > So I will start here on this patch you proposed:
>> > https://www.redhat.com/archives/dm-devel/2017-September/msg00091.html
>> > (of note, this patch slipped through the cracks because I was recovering
>> > from injury when it originally came through).
>> > 
>> > Once DM is using q->bio_split I'll come back to this patch (aka
>> > "[1]") as a starting point for the follow-on work to remove DM's use of
>> > BIOSET_NEED_RESCUER:
>> > https://www.redhat.com/archives/dm-devel/2017-August/msg00315.html
>> 
>> Hey Neil,
>> 
>> Good news!  All your code works ;)
>> 
>> (well after 1 fixup due to a cut-n-paste bug.. the code you added to
>> dm_wq_work() to process the md->rescued bio_list was operating on
>> the md->deferred bio_list due to cut-n-paste from code you copied from
>> just below it)
>> 
>> I split your code out some to make it more reviewable.  I also tweaked
>> headers accordingly.
>> 
>> Please see this branch (which _will_ get rebased between now and the
>> 4.16 merge window):
>> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.16
>> 
>> I successfully tested these changes using Mikulas' test program that
>> reproduces the snapshot deadlock:
>> https://www.redhat.com/archives/dm-devel/2017-January/msg00064.html
>> 
>> I'll throw various other DM testsuites at it to verify they all look
>> good (e.g. thinp, cache, multipath).
>> 
>> I'm open to all suggestions about changes you'd like to see (either to
>> these patches or anything you'd like to layer ontop of them).
>> 
>> Thanks for all your work, much appreciated!
>> Mike
>
> This is not correct:

Thanks for your review!

>
>    2206 static void dm_wq_work(struct work_struct *work)
>    2207 {
>    2208         struct mapped_device *md = container_of(work, struct mapped_device, work);
>    2209         struct bio *bio;
>    2210         int srcu_idx;
>    2211         struct dm_table *map;
>    2212
>    2213         if (!bio_list_empty(&md->rescued)) {
>    2214                 struct bio_list list;
>    2215                 spin_lock_irq(&md->deferred_lock);
>    2216                 list = md->rescued;
>    2217                 bio_list_init(&md->rescued);
>    2218                 spin_unlock_irq(&md->deferred_lock);
>    2219                 while ((bio = bio_list_pop(&list)))
>    2220                         generic_make_request(bio);
>    2221         }
>    2222
>    2223         map = dm_get_live_table(md, &srcu_idx);
>    2224
>    2225         while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) {
>    2226                 spin_lock_irq(&md->deferred_lock);
>    2227                 bio = bio_list_pop(&md->deferred);
>    2228                 spin_unlock_irq(&md->deferred_lock);
>    2229
>    2230                 if (!bio)
>    2231                         break;
>    2232
>    2233                 if (dm_request_based(md))
>    2234                         generic_make_request(bio);
>    2235                 else
>    2236                         __split_and_process_bio(md, map, bio);
>    2237         }
>    2238
>    2239         dm_put_live_table(md, srcu_idx);
>    2240 }
>
> You can see that if we are in dm_wq_work in __split_and_process_bio, we 
> will not process md->rescued list.

Correct, but md->rescued will be empty, or irrelevant.
The first section of dm_wq_work ensures ->rescued is empty.
When __split_and_process_bio() calls generic_make_request() (indirectly
through one or more targets) they will not be recursive calls,
so nothing will be added to current->bio_list[0] and nothing
will be moved to md->rescued.  Each generic_make_request() will
completely submit the request in the lower level devel.

Some other thread could call generic_make_request on this dm device and
result in bios appeared on md->rescued.  These bios could only be a
problem if something that __split_and_process_bio calls might wait for
them.  I don't think that happens (at least I don't think it should...).


>
> The processing of md->rescued is also wrong - bios for different devices 
> must be offloaded to different helper threads, so that processing a bio 
> for a lower device doesn't depend on processing a bio for a higher device. 
> If you offload all the bios on current->bio_list to the same thread, the 
> bios still depend on each other and the deadlock will still happen.

bios on current->bio_list[0] are not allowed to depend on each other
except that later bios can depend on earlier bios.  They are all for a
lower-level device and should be largely independent.
The sorting that generic_make_request now does ensure that a bio for a
higher level device is never processed when a bio for a lower level
device, that it might depend on, is stuck on current->bio_list.

So I don't think there is a problem here.
Do you find this argument at all convincing?

Thanks,
NeilBrown

>
> Mikulas
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [dm-devel] new patchset to eliminate DM's use of BIOSET_NEED_RESCUER [was: Re: [PATCH 00/13] block: assorted cleanup for bio splitting and cloning.]
@ 2017-11-21 23:03                     ` NeilBrown
  0 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2017-11-21 23:03 UTC (permalink / raw)
  To: Mikulas Patocka, Mike Snitzer
  Cc: Jens Axboe, linux-kernel, linux-block, device-mapper development,
	Zdenek Kabelac

[-- Attachment #1: Type: text/plain, Size: 5643 bytes --]

On Tue, Nov 21 2017, Mikulas Patocka wrote:

> On Tue, 21 Nov 2017, Mike Snitzer wrote:
>
>> On Tue, Nov 21 2017 at  7:43am -0500,
>> Mike Snitzer <snitzer@redhat.com> wrote:
>>  
>> > Decided it a better use of my time to review and then hopefully use the
>> > block-core's bio splitting infrastructure in DM.  Been meaning to do
>> > that for quite a while anyway.  This mail from you just made it all the
>> > more clear that needs doing:
>> > https://www.redhat.com/archives/dm-devel/2017-September/msg00098.html
>> > 
>> > So I will start here on this patch you proposed:
>> > https://www.redhat.com/archives/dm-devel/2017-September/msg00091.html
>> > (of note, this patch slipped through the cracks because I was recovering
>> > from injury when it originally came through).
>> > 
>> > Once DM is using q->bio_split I'll come back to this patch (aka
>> > "[1]") as a starting point for the follow-on work to remove DM's use of
>> > BIOSET_NEED_RESCUER:
>> > https://www.redhat.com/archives/dm-devel/2017-August/msg00315.html
>> 
>> Hey Neil,
>> 
>> Good news!  All your code works ;)
>> 
>> (well after 1 fixup due to a cut-n-paste bug.. the code you added to
>> dm_wq_work() to process the md->rescued bio_list was operating on
>> the md->deferred bio_list due to cut-n-paste from code you copied from
>> just below it)
>> 
>> I split your code out some to make it more reviewable.  I also tweaked
>> headers accordingly.
>> 
>> Please see this branch (which _will_ get rebased between now and the
>> 4.16 merge window):
>> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.16
>> 
>> I successfully tested these changes using Mikulas' test program that
>> reproduces the snapshot deadlock:
>> https://www.redhat.com/archives/dm-devel/2017-January/msg00064.html
>> 
>> I'll throw various other DM testsuites at it to verify they all look
>> good (e.g. thinp, cache, multipath).
>> 
>> I'm open to all suggestions about changes you'd like to see (either to
>> these patches or anything you'd like to layer ontop of them).
>> 
>> Thanks for all your work, much appreciated!
>> Mike
>
> This is not correct:

Thanks for your review!

>
>    2206 static void dm_wq_work(struct work_struct *work)
>    2207 {
>    2208         struct mapped_device *md = container_of(work, struct mapped_device, work);
>    2209         struct bio *bio;
>    2210         int srcu_idx;
>    2211         struct dm_table *map;
>    2212
>    2213         if (!bio_list_empty(&md->rescued)) {
>    2214                 struct bio_list list;
>    2215                 spin_lock_irq(&md->deferred_lock);
>    2216                 list = md->rescued;
>    2217                 bio_list_init(&md->rescued);
>    2218                 spin_unlock_irq(&md->deferred_lock);
>    2219                 while ((bio = bio_list_pop(&list)))
>    2220                         generic_make_request(bio);
>    2221         }
>    2222
>    2223         map = dm_get_live_table(md, &srcu_idx);
>    2224
>    2225         while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) {
>    2226                 spin_lock_irq(&md->deferred_lock);
>    2227                 bio = bio_list_pop(&md->deferred);
>    2228                 spin_unlock_irq(&md->deferred_lock);
>    2229
>    2230                 if (!bio)
>    2231                         break;
>    2232
>    2233                 if (dm_request_based(md))
>    2234                         generic_make_request(bio);
>    2235                 else
>    2236                         __split_and_process_bio(md, map, bio);
>    2237         }
>    2238
>    2239         dm_put_live_table(md, srcu_idx);
>    2240 }
>
> You can see that if we are in dm_wq_work in __split_and_process_bio, we 
> will not process md->rescued list.

Correct, but md->rescued will be empty, or irrelevant.
The first section of dm_wq_work ensures ->rescued is empty.
When __split_and_process_bio() calls generic_make_request() (indirectly
through one or more targets) they will not be recursive calls,
so nothing will be added to current->bio_list[0] and nothing
will be moved to md->rescued.  Each generic_make_request() will
completely submit the request in the lower level devel.

Some other thread could call generic_make_request on this dm device and
result in bios appeared on md->rescued.  These bios could only be a
problem if something that __split_and_process_bio calls might wait for
them.  I don't think that happens (at least I don't think it should...).


>
> The processing of md->rescued is also wrong - bios for different devices 
> must be offloaded to different helper threads, so that processing a bio 
> for a lower device doesn't depend on processing a bio for a higher device. 
> If you offload all the bios on current->bio_list to the same thread, the 
> bios still depend on each other and the deadlock will still happen.

bios on current->bio_list[0] are not allowed to depend on each other
except that later bios can depend on earlier bios.  They are all for a
lower-level device and should be largely independent.
The sorting that generic_make_request now does ensure that a bio for a
higher level device is never processed when a bio for a lower level
device, that it might depend on, is stuck on current->bio_list.

So I don't think there is a problem here.
Do you find this argument at all convincing?

Thanks,
NeilBrown

>
> Mikulas
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: new patchset to eliminate DM's use of BIOSET_NEED_RESCUER
  2017-11-21 22:51                   ` new patchset to eliminate DM's use of BIOSET_NEED_RESCUER Mike Snitzer
@ 2017-11-22  1:21                     ` Mikulas Patocka
  2017-11-22  2:32                       ` Mike Snitzer
  2017-11-22  4:00                         ` NeilBrown
  0 siblings, 2 replies; 53+ messages in thread
From: Mikulas Patocka @ 2017-11-22  1:21 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: NeilBrown, Jens Axboe, linux-kernel, linux-block,
	device-mapper development, Zdenek Kabelac



On Tue, 21 Nov 2017, Mike Snitzer wrote:

> On Tue, Nov 21 2017 at  4:23pm -0500,
> Mikulas Patocka <mpatocka@redhat.com> wrote:
> 
> > This is not correct:
> > 
> >    2206 static void dm_wq_work(struct work_struct *work)
> >    2207 {
> >    2208         struct mapped_device *md = container_of(work, struct mapped_device, work);
> >    2209         struct bio *bio;
> >    2210         int srcu_idx;
> >    2211         struct dm_table *map;
> >    2212
> >    2213         if (!bio_list_empty(&md->rescued)) {
> >    2214                 struct bio_list list;
> >    2215                 spin_lock_irq(&md->deferred_lock);
> >    2216                 list = md->rescued;
> >    2217                 bio_list_init(&md->rescued);
> >    2218                 spin_unlock_irq(&md->deferred_lock);
> >    2219                 while ((bio = bio_list_pop(&list)))
> >    2220                         generic_make_request(bio);
> >    2221         }
> >    2222
> >    2223         map = dm_get_live_table(md, &srcu_idx);
> >    2224
> >    2225         while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) {
> >    2226                 spin_lock_irq(&md->deferred_lock);
> >    2227                 bio = bio_list_pop(&md->deferred);
> >    2228                 spin_unlock_irq(&md->deferred_lock);
> >    2229
> >    2230                 if (!bio)
> >    2231                         break;
> >    2232
> >    2233                 if (dm_request_based(md))
> >    2234                         generic_make_request(bio);
> >    2235                 else
> >    2236                         __split_and_process_bio(md, map, bio);
> >    2237         }
> >    2238
> >    2239         dm_put_live_table(md, srcu_idx);
> >    2240 }
> > 
> > You can see that if we are in dm_wq_work in __split_and_process_bio, we 
> > will not process md->rescued list.
> 
> Can you elaborate further?  We cannot be "in dm_wq_work in
> __split_and_process_bio" simultaneously.  Do you mean as a side-effect
> of scheduling away from __split_and_process_bio?
> 
> The more detail you can share the better.

Suppose this scenario:

* dm_wq_work calls __split_and_process_bio
* __split_and_process_bio eventually reaches the function snapshot_map
* snapshot_map attempts to take the snapshot lock

* the snapshot lock could be released only if some bios submitted by the 
snapshot driver to the underlying device complete
* the bios submitted to the underlying device were already offloaded by 
some other task and they are waiting on the list md->rescued
* the bios waiting on md->rescued are not processed, because dm_wq_work is 
blocked in snapshot_map (called from __split_and_process_bio)

> > The processing of md->rescued is also wrong - bios for different devices 
> > must be offloaded to different helper threads, so that processing a bio 
> > for a lower device doesn't depend on processing a bio for a higher device. 
> > If you offload all the bios on current->bio_list to the same thread, the 
> > bios still depend on each other and the deadlock will still happen.
> 
> Commit 325738403 ("dm: revise 'rescue' strategy for bio-based bioset
> allocations") speaks to this with:
> 
> "Note that only current->bio_list[0] is offloaded.  current->bio_list[1]
> contains bios that were scheduled *before* the current one started, so
> they must have been submitted from higher up the stack, and we cannot be
> waiting for them here (thanks to the "dm: ensure bio submission follows
> a depth-first tree walk" commit).  Also, we now rescue *all* bios on the
> list as there is nothing to be gained by being more selective."

I think you are right - if we only offload current->bio_list[0], then 
mixing of dependent bios on the offloaded list won't happen.

> And again: this patchset passes your dm-snapshot deadlock test.  Is
> that test somehow lacking?

With your patchset, the deadlock would happen only if bios are queued on 
&md->deferred - and that happens only in case of resume or if we are 
processing REQ_PREFLUSH with non-zero data size.

So, the simple test that I wrote doesn't trigger it, but a more complex 
test involving REQ_PREFLUSH could.

> Or do you see a hypothetical case where a deadlock is still possible?
> That is of less concern.  I'd prefer that we tackle problems for
> targets, and associated scenarios, that we currently support.
> 
> Either way, happy to review this with you further.  Any fixes are
> welcomed too.  But I'd like us to head in a direction that this patchset
> is taking us.  Specifically: away from DM relying on BIOSET_NEED_RESCUER.
> 
> Thanks,
> Mike

Mikulas

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: new patchset to eliminate DM's use of BIOSET_NEED_RESCUER
  2017-11-22  1:21                     ` Mikulas Patocka
@ 2017-11-22  2:32                       ` Mike Snitzer
  2017-11-22  4:00                         ` NeilBrown
  1 sibling, 0 replies; 53+ messages in thread
From: Mike Snitzer @ 2017-11-22  2:32 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: NeilBrown, Jens Axboe, linux-kernel, linux-block,
	device-mapper development, Zdenek Kabelac

On Tue, Nov 21 2017 at  8:21pm -0500,
Mikulas Patocka <mpatocka@redhat.com> wrote:

> 
> 
> On Tue, 21 Nov 2017, Mike Snitzer wrote:
> 
> > On Tue, Nov 21 2017 at  4:23pm -0500,
> > Mikulas Patocka <mpatocka@redhat.com> wrote:
> > 
> > > This is not correct:
> > > 
> > >    2206 static void dm_wq_work(struct work_struct *work)
> > >    2207 {
> > >    2208         struct mapped_device *md = container_of(work, struct mapped_device, work);
> > >    2209         struct bio *bio;
> > >    2210         int srcu_idx;
> > >    2211         struct dm_table *map;
> > >    2212
> > >    2213         if (!bio_list_empty(&md->rescued)) {
> > >    2214                 struct bio_list list;
> > >    2215                 spin_lock_irq(&md->deferred_lock);
> > >    2216                 list = md->rescued;
> > >    2217                 bio_list_init(&md->rescued);
> > >    2218                 spin_unlock_irq(&md->deferred_lock);
> > >    2219                 while ((bio = bio_list_pop(&list)))
> > >    2220                         generic_make_request(bio);
> > >    2221         }
> > >    2222
> > >    2223         map = dm_get_live_table(md, &srcu_idx);
> > >    2224
> > >    2225         while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) {
> > >    2226                 spin_lock_irq(&md->deferred_lock);
> > >    2227                 bio = bio_list_pop(&md->deferred);
> > >    2228                 spin_unlock_irq(&md->deferred_lock);
> > >    2229
> > >    2230                 if (!bio)
> > >    2231                         break;
> > >    2232
> > >    2233                 if (dm_request_based(md))
> > >    2234                         generic_make_request(bio);
> > >    2235                 else
> > >    2236                         __split_and_process_bio(md, map, bio);
> > >    2237         }
> > >    2238
> > >    2239         dm_put_live_table(md, srcu_idx);
> > >    2240 }
> > > 
> > > You can see that if we are in dm_wq_work in __split_and_process_bio, we 
> > > will not process md->rescued list.
> > 
> > Can you elaborate further?  We cannot be "in dm_wq_work in
> > __split_and_process_bio" simultaneously.  Do you mean as a side-effect
> > of scheduling away from __split_and_process_bio?
> > 
> > The more detail you can share the better.
> 
> Suppose this scenario:
> 
> * dm_wq_work calls __split_and_process_bio

Right, I later realized this was the call chain you were referring to.
Not sure how I missed it the first time around.

> * __split_and_process_bio eventually reaches the function snapshot_map
> * snapshot_map attempts to take the snapshot lock
> 
> * the snapshot lock could be released only if some bios submitted by the 
> snapshot driver to the underlying device complete
> * the bios submitted to the underlying device were already offloaded by 
> some other task and they are waiting on the list md->rescued
> * the bios waiting on md->rescued are not processed, because dm_wq_work is 
> blocked in snapshot_map (called from __split_and_process_bio)

SO you're saying the case that Neil doesn't think should happen:
https://lkml.org/lkml/2017/11/21/658
...can happen.

> > > The processing of md->rescued is also wrong - bios for different devices 
> > > must be offloaded to different helper threads, so that processing a bio 
> > > for a lower device doesn't depend on processing a bio for a higher device. 
> > > If you offload all the bios on current->bio_list to the same thread, the 
> > > bios still depend on each other and the deadlock will still happen.
> > 
> > Commit 325738403 ("dm: revise 'rescue' strategy for bio-based bioset
> > allocations") speaks to this with:
> > 
> > "Note that only current->bio_list[0] is offloaded.  current->bio_list[1]
> > contains bios that were scheduled *before* the current one started, so
> > they must have been submitted from higher up the stack, and we cannot be
> > waiting for them here (thanks to the "dm: ensure bio submission follows
> > a depth-first tree walk" commit).  Also, we now rescue *all* bios on the
> > list as there is nothing to be gained by being more selective."
> 
> I think you are right - if we only offload current->bio_list[0], then 
> mixing of dependent bios on the offloaded list won't happen.
> 
> > And again: this patchset passes your dm-snapshot deadlock test.  Is
> > that test somehow lacking?
> 
> With your patchset, the deadlock would happen only if bios are queued on 
> &md->deferred - and that happens only in case of resume or if we are 
> processing REQ_PREFLUSH with non-zero data size.
> 
> So, the simple test that I wrote doesn't trigger it, but a more complex 
> test involving REQ_PREFLUSH could.

Makes sense.  But I need to think further about _why_ bios submitted to
the snapshot driver's underlying device would end up on md->rescued
(like you suggested above).  Again, Neil thinks it not possible.  Neil
said:
"they will not be recursive calls, so nothing will be added to
current->bio_list[0] and nothing will be moved to md->rescued.  Each
generic_make_request() will completely submit the request in the lower
level devel."

Mike

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [dm-devel] new patchset to eliminate DM's use of BIOSET_NEED_RESCUER
  2017-11-22  1:21                     ` Mikulas Patocka
  2017-11-22  2:32                       ` Mike Snitzer
@ 2017-11-22  4:00                         ` NeilBrown
  1 sibling, 0 replies; 53+ messages in thread
From: NeilBrown @ 2017-11-22  4:00 UTC (permalink / raw)
  To: Mikulas Patocka, Mike Snitzer
  Cc: Jens Axboe, linux-kernel, linux-block, device-mapper development,
	Zdenek Kabelac

[-- Attachment #1: Type: text/plain, Size: 11093 bytes --]

On Tue, Nov 21 2017, Mikulas Patocka wrote:

> On Tue, 21 Nov 2017, Mike Snitzer wrote:
>
>> On Tue, Nov 21 2017 at  4:23pm -0500,
>> Mikulas Patocka <mpatocka@redhat.com> wrote:
>> 
>> > This is not correct:
>> > 
>> >    2206 static void dm_wq_work(struct work_struct *work)
>> >    2207 {
>> >    2208         struct mapped_device *md = container_of(work, struct mapped_device, work);
>> >    2209         struct bio *bio;
>> >    2210         int srcu_idx;
>> >    2211         struct dm_table *map;
>> >    2212
>> >    2213         if (!bio_list_empty(&md->rescued)) {
>> >    2214                 struct bio_list list;
>> >    2215                 spin_lock_irq(&md->deferred_lock);
>> >    2216                 list = md->rescued;
>> >    2217                 bio_list_init(&md->rescued);
>> >    2218                 spin_unlock_irq(&md->deferred_lock);
>> >    2219                 while ((bio = bio_list_pop(&list)))
>> >    2220                         generic_make_request(bio);
>> >    2221         }
>> >    2222
>> >    2223         map = dm_get_live_table(md, &srcu_idx);
>> >    2224
>> >    2225         while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) {
>> >    2226                 spin_lock_irq(&md->deferred_lock);
>> >    2227                 bio = bio_list_pop(&md->deferred);
>> >    2228                 spin_unlock_irq(&md->deferred_lock);
>> >    2229
>> >    2230                 if (!bio)
>> >    2231                         break;
>> >    2232
>> >    2233                 if (dm_request_based(md))
>> >    2234                         generic_make_request(bio);
>> >    2235                 else
>> >    2236                         __split_and_process_bio(md, map, bio);
>> >    2237         }
>> >    2238
>> >    2239         dm_put_live_table(md, srcu_idx);
>> >    2240 }
>> > 
>> > You can see that if we are in dm_wq_work in __split_and_process_bio, we 
>> > will not process md->rescued list.
>> 
>> Can you elaborate further?  We cannot be "in dm_wq_work in
>> __split_and_process_bio" simultaneously.  Do you mean as a side-effect
>> of scheduling away from __split_and_process_bio?
>> 
>> The more detail you can share the better.
>
> Suppose this scenario:
>
> * dm_wq_work calls __split_and_process_bio
> * __split_and_process_bio eventually reaches the function snapshot_map
> * snapshot_map attempts to take the snapshot lock
>
> * the snapshot lock could be released only if some bios submitted by the 
> snapshot driver to the underlying device complete
> * the bios submitted to the underlying device were already offloaded by 
> some other task and they are waiting on the list md->rescued
> * the bios waiting on md->rescued are not processed, because dm_wq_work is 
> blocked in snapshot_map (called from __split_and_process_bio)

Yes, I think you are right. 

I think the solution is to get rid of the dm_offload() infrastructure
and make it not necessary.
i.e. discard my patches
    dm: prepare to discontinue use of BIOSET_NEED_RESCUER
and
    dm: revise 'rescue' strategy for bio-based bioset allocations

And build on "dm: ensure bio submission follows a depth-first tree walk"
which was written after those and already makes dm_offload() less
important.

Since that "depth-first" patch, every request to the dm device, after
the initial splitting, allocates just one dm_target_io structure, and
makes just one __map_bio() call, and so will behave exactly the way
generic_make_request() expects and copes with - thus avoiding awkward
dependencies and deadlocks.  Except....

a/ If any target defines ->num_write_bios() to return >1,
   __clone_and_map_data_bio() will make multiple calls to alloc_tio()
   and __map_bio(), which might need rescuing.
   But no target defines num_write_bios, and none have since it was
   removed from dm-cache 4.5 years ago.
   Can we discard num_write_bios??

b/ If any target sets any of num_{flush,discard,write_same,write_zeroes}_bios
   to a value > 1, then __send_duplicate_bios() will also make multiple
   calls to alloc_tio() and __map_bio().
   Some do.
     dm-cache-target:  flush=2
     dm-snap: flush=2
     dm-stripe: discard, write_same, write_zeroes all set to 'stripes'.

These will only be a problem if the second (or subsequent) alloc_tio()
blocks waiting for an earlier allocation to complete.  This will only
be a problem if multiple threads are each trying to allocate multiple
dm_target_io from the same bioset at the same time.
This is rare and should be easier to address than the current
dm_offload() approach. 
One possibility would be to copy the approach taken by
crypt_alloc_buffer() which needs to allocate multiple entries from a
mempool.
It first tries the with GFP_NOWAIT.  If that fails it take a mutex and
tries with GFP_NOIO.  This mean only one thread will try to allocate
multiple bios at once, so there can be no deadlock.

Below are two RFC patches.  The first removes num_write_bios.
The second is incomplete and makes a stab are allocating multiple bios
at once safely.
A third would be needed to remove dm_offload() etc... but I cannot quite
fit that in today - must be off.

Thanks,
NeilBrown

From: NeilBrown <neilb@suse.com>
Date: Wed, 22 Nov 2017 14:25:18 +1100
Subject: [PATCH] DM: remove num_write_bios target interface.

No target provides num_write_bios and none has done
since 2013.
Having the possibility of num_write_bios > 1 complicates
bio allocation.
So remove the interface and assume there is only one bio
needed.
If a target ever needs more, it must provide a suitable
bioset and allocate itself based on its particular needs.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/dm.c               | 22 ++++------------------
 include/linux/device-mapper.h | 15 ---------------
 2 files changed, 4 insertions(+), 33 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index b20febd6cbc7..8c1a05609eea 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1323,27 +1323,13 @@ static int __clone_and_map_data_bio(struct clone_info *ci, struct dm_target *ti,
 {
 	struct bio *bio = ci->bio;
 	struct dm_target_io *tio;
-	unsigned target_bio_nr;
-	unsigned num_target_bios = 1;
 	int r = 0;
 
-	/*
-	 * Does the target want to receive duplicate copies of the bio?
-	 */
-	if (bio_data_dir(bio) == WRITE && ti->num_write_bios)
-		num_target_bios = ti->num_write_bios(ti, bio);
-
-	for (target_bio_nr = 0; target_bio_nr < num_target_bios; target_bio_nr++) {
-		tio = alloc_tio(ci, ti, target_bio_nr);
-		tio->len_ptr = len;
-		r = clone_bio(tio, bio, sector, *len);
-		if (r < 0) {
-			free_tio(tio);
-			break;
-		}
+	tio = alloc_tio(ci, ti, 0);
+	tio->len_ptr = len;
+	r = clone_bio(tio, bio, sector, *len);
+	if (r >= 0)
 		__map_bio(tio);
-	}
-
 	return r;
 }
 
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index a5538433c927..5a68b366e664 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -220,14 +220,6 @@ struct target_type {
 #define DM_TARGET_WILDCARD		0x00000008
 #define dm_target_is_wildcard(type)	((type)->features & DM_TARGET_WILDCARD)
 
-/*
- * Some targets need to be sent the same WRITE bio severals times so
- * that they can send copies of it to different devices.  This function
- * examines any supplied bio and returns the number of copies of it the
- * target requires.
- */
-typedef unsigned (*dm_num_write_bios_fn) (struct dm_target *ti, struct bio *bio);
-
 /*
  * A target implements own bio data integrity.
  */
@@ -291,13 +283,6 @@ struct dm_target {
 	 */
 	unsigned per_io_data_size;
 
-	/*
-	 * If defined, this function is called to find out how many
-	 * duplicate bios should be sent to the target when writing
-	 * data.
-	 */
-	dm_num_write_bios_fn num_write_bios;
-
 	/* target specific data */
 	void *private;
 
-- 
2.14.0.rc0.dirty


-----------------------------------
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 8c1a05609eea..8762661df2ef 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1265,8 +1265,7 @@ static int clone_bio(struct dm_target_io *tio, struct bio *bio,
 }
 
 static struct dm_target_io *alloc_tio(struct clone_info *ci,
-				      struct dm_target *ti,
-				      unsigned target_bio_nr)
+				      struct dm_target *ti)
 {
 	struct dm_target_io *tio;
 	struct bio *clone;
@@ -1276,34 +1275,66 @@ static struct dm_target_io *alloc_tio(struct clone_info *ci,
 
 	tio->io = ci->io;
 	tio->ti = ti;
-	tio->target_bio_nr = target_bio_nr;
+	tio->target_bio_nr = 0;
 
 	return tio;
 }
 
-static void __clone_and_map_simple_bio(struct clone_info *ci,
-				       struct dm_target *ti,
-				       unsigned target_bio_nr, unsigned *len)
+static void alloc_multiple_bios(struct bio_list *blist, struct clone_info *ci,
+				struct dm_target *ti, unsigned num_bios)
 {
-	struct dm_target_io *tio = alloc_tio(ci, ti, target_bio_nr);
-	struct bio *clone = &tio->clone;
+	int try;
 
-	tio->len_ptr = len;
+	for (try = 0; try < 2; try++) {
+		int bio_nr;
+		struct bio *bio;
+
+		if (try)
+			mutex_lock(&ci->md->table_devices_lock);
+		for (bio_nr = 0; bio_nr < num_bios; bio_nr++) {
+			bio = bio_alloc_bioset(try ? GFP_NOIO : GFP_NOWAIT,
+					       0, ci->md->bs);
+			if (bio) {
+				struct dm_target_io *tio;
+				bio_list_add(blist, bio);
+				tio = container_of(bio, struct dm_target_io, clone);
 
-	__bio_clone_fast(clone, ci->bio);
-	if (len)
-		bio_setup_sector(clone, ci->sector, *len);
+				tio->io = ci->io;
+				tio->ti = ti;
+				tio->target_bio_nr = bio_nr;
+			} else
+				break;
+		}
+		if (try)
+			mutex_unlock(&ci->md->table_devices_lock);
+		if (bio_nr == num_bios)
+			return;
 
-	__map_bio(tio);
+		while ((bio = bio_list_pop(blist)) != NULL)
+			bio_put(bio);
+	}
 }
 
 static void __send_duplicate_bios(struct clone_info *ci, struct dm_target *ti,
 				  unsigned num_bios, unsigned *len)
 {
-	unsigned target_bio_nr;
+	struct bio_list blist = BIO_EMPTY_LIST;
+	struct bio *bio;
 
-	for (target_bio_nr = 0; target_bio_nr < num_bios; target_bio_nr++)
-		__clone_and_map_simple_bio(ci, ti, target_bio_nr, len);
+	if (num_bios == 1)
+		bio_list_add(&blist, &alloc_tio(ci, ti)->clone);
+	else
+		alloc_multiple_bios(&blist, ci, ti, num_bios);
+
+	while ((bio = bio_list_pop(&blist)) != NULL) {
+		struct dm_target_io *tio = container_of(
+			bio, struct dm_target_io, clone);
+		tio->len_ptr = len;
+		__bio_clone_fast(bio, ci->bio);
+		if (len)
+			bio_setup_sector(bio, ci->sector, *len);
+		__map_bio(tio);
+	}
 }
 
 static int __send_empty_flush(struct clone_info *ci)
@@ -1325,7 +1356,7 @@ static int __clone_and_map_data_bio(struct clone_info *ci, struct dm_target *ti,
 	struct dm_target_io *tio;
 	int r = 0;
 
-	tio = alloc_tio(ci, ti, 0);
+	tio = alloc_tio(ci, ti);
 	tio->len_ptr = len;
 	r = clone_bio(tio, bio, sector, *len);
 	if (r >= 0)

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [dm-devel] new patchset to eliminate DM's use of   BIOSET_NEED_RESCUER
@ 2017-11-22  4:00                         ` NeilBrown
  0 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2017-11-22  4:00 UTC (permalink / raw)
  To: Mikulas Patocka, Mike Snitzer
  Cc: Jens Axboe, linux-kernel, linux-block, device-mapper development,
	Zdenek Kabelac

[-- Attachment #1: Type: text/plain, Size: 11093 bytes --]

On Tue, Nov 21 2017, Mikulas Patocka wrote:

> On Tue, 21 Nov 2017, Mike Snitzer wrote:
>
>> On Tue, Nov 21 2017 at  4:23pm -0500,
>> Mikulas Patocka <mpatocka@redhat.com> wrote:
>> 
>> > This is not correct:
>> > 
>> >    2206 static void dm_wq_work(struct work_struct *work)
>> >    2207 {
>> >    2208         struct mapped_device *md = container_of(work, struct mapped_device, work);
>> >    2209         struct bio *bio;
>> >    2210         int srcu_idx;
>> >    2211         struct dm_table *map;
>> >    2212
>> >    2213         if (!bio_list_empty(&md->rescued)) {
>> >    2214                 struct bio_list list;
>> >    2215                 spin_lock_irq(&md->deferred_lock);
>> >    2216                 list = md->rescued;
>> >    2217                 bio_list_init(&md->rescued);
>> >    2218                 spin_unlock_irq(&md->deferred_lock);
>> >    2219                 while ((bio = bio_list_pop(&list)))
>> >    2220                         generic_make_request(bio);
>> >    2221         }
>> >    2222
>> >    2223         map = dm_get_live_table(md, &srcu_idx);
>> >    2224
>> >    2225         while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) {
>> >    2226                 spin_lock_irq(&md->deferred_lock);
>> >    2227                 bio = bio_list_pop(&md->deferred);
>> >    2228                 spin_unlock_irq(&md->deferred_lock);
>> >    2229
>> >    2230                 if (!bio)
>> >    2231                         break;
>> >    2232
>> >    2233                 if (dm_request_based(md))
>> >    2234                         generic_make_request(bio);
>> >    2235                 else
>> >    2236                         __split_and_process_bio(md, map, bio);
>> >    2237         }
>> >    2238
>> >    2239         dm_put_live_table(md, srcu_idx);
>> >    2240 }
>> > 
>> > You can see that if we are in dm_wq_work in __split_and_process_bio, we 
>> > will not process md->rescued list.
>> 
>> Can you elaborate further?  We cannot be "in dm_wq_work in
>> __split_and_process_bio" simultaneously.  Do you mean as a side-effect
>> of scheduling away from __split_and_process_bio?
>> 
>> The more detail you can share the better.
>
> Suppose this scenario:
>
> * dm_wq_work calls __split_and_process_bio
> * __split_and_process_bio eventually reaches the function snapshot_map
> * snapshot_map attempts to take the snapshot lock
>
> * the snapshot lock could be released only if some bios submitted by the 
> snapshot driver to the underlying device complete
> * the bios submitted to the underlying device were already offloaded by 
> some other task and they are waiting on the list md->rescued
> * the bios waiting on md->rescued are not processed, because dm_wq_work is 
> blocked in snapshot_map (called from __split_and_process_bio)

Yes, I think you are right. 

I think the solution is to get rid of the dm_offload() infrastructure
and make it not necessary.
i.e. discard my patches
    dm: prepare to discontinue use of BIOSET_NEED_RESCUER
and
    dm: revise 'rescue' strategy for bio-based bioset allocations

And build on "dm: ensure bio submission follows a depth-first tree walk"
which was written after those and already makes dm_offload() less
important.

Since that "depth-first" patch, every request to the dm device, after
the initial splitting, allocates just one dm_target_io structure, and
makes just one __map_bio() call, and so will behave exactly the way
generic_make_request() expects and copes with - thus avoiding awkward
dependencies and deadlocks.  Except....

a/ If any target defines ->num_write_bios() to return >1,
   __clone_and_map_data_bio() will make multiple calls to alloc_tio()
   and __map_bio(), which might need rescuing.
   But no target defines num_write_bios, and none have since it was
   removed from dm-cache 4.5 years ago.
   Can we discard num_write_bios??

b/ If any target sets any of num_{flush,discard,write_same,write_zeroes}_bios
   to a value > 1, then __send_duplicate_bios() will also make multiple
   calls to alloc_tio() and __map_bio().
   Some do.
     dm-cache-target:  flush=2
     dm-snap: flush=2
     dm-stripe: discard, write_same, write_zeroes all set to 'stripes'.

These will only be a problem if the second (or subsequent) alloc_tio()
blocks waiting for an earlier allocation to complete.  This will only
be a problem if multiple threads are each trying to allocate multiple
dm_target_io from the same bioset at the same time.
This is rare and should be easier to address than the current
dm_offload() approach. 
One possibility would be to copy the approach taken by
crypt_alloc_buffer() which needs to allocate multiple entries from a
mempool.
It first tries the with GFP_NOWAIT.  If that fails it take a mutex and
tries with GFP_NOIO.  This mean only one thread will try to allocate
multiple bios at once, so there can be no deadlock.

Below are two RFC patches.  The first removes num_write_bios.
The second is incomplete and makes a stab are allocating multiple bios
at once safely.
A third would be needed to remove dm_offload() etc... but I cannot quite
fit that in today - must be off.

Thanks,
NeilBrown

From: NeilBrown <neilb@suse.com>
Date: Wed, 22 Nov 2017 14:25:18 +1100
Subject: [PATCH] DM: remove num_write_bios target interface.

No target provides num_write_bios and none has done
since 2013.
Having the possibility of num_write_bios > 1 complicates
bio allocation.
So remove the interface and assume there is only one bio
needed.
If a target ever needs more, it must provide a suitable
bioset and allocate itself based on its particular needs.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/dm.c               | 22 ++++------------------
 include/linux/device-mapper.h | 15 ---------------
 2 files changed, 4 insertions(+), 33 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index b20febd6cbc7..8c1a05609eea 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1323,27 +1323,13 @@ static int __clone_and_map_data_bio(struct clone_info *ci, struct dm_target *ti,
 {
 	struct bio *bio = ci->bio;
 	struct dm_target_io *tio;
-	unsigned target_bio_nr;
-	unsigned num_target_bios = 1;
 	int r = 0;
 
-	/*
-	 * Does the target want to receive duplicate copies of the bio?
-	 */
-	if (bio_data_dir(bio) == WRITE && ti->num_write_bios)
-		num_target_bios = ti->num_write_bios(ti, bio);
-
-	for (target_bio_nr = 0; target_bio_nr < num_target_bios; target_bio_nr++) {
-		tio = alloc_tio(ci, ti, target_bio_nr);
-		tio->len_ptr = len;
-		r = clone_bio(tio, bio, sector, *len);
-		if (r < 0) {
-			free_tio(tio);
-			break;
-		}
+	tio = alloc_tio(ci, ti, 0);
+	tio->len_ptr = len;
+	r = clone_bio(tio, bio, sector, *len);
+	if (r >= 0)
 		__map_bio(tio);
-	}
-
 	return r;
 }
 
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index a5538433c927..5a68b366e664 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -220,14 +220,6 @@ struct target_type {
 #define DM_TARGET_WILDCARD		0x00000008
 #define dm_target_is_wildcard(type)	((type)->features & DM_TARGET_WILDCARD)
 
-/*
- * Some targets need to be sent the same WRITE bio severals times so
- * that they can send copies of it to different devices.  This function
- * examines any supplied bio and returns the number of copies of it the
- * target requires.
- */
-typedef unsigned (*dm_num_write_bios_fn) (struct dm_target *ti, struct bio *bio);
-
 /*
  * A target implements own bio data integrity.
  */
@@ -291,13 +283,6 @@ struct dm_target {
 	 */
 	unsigned per_io_data_size;
 
-	/*
-	 * If defined, this function is called to find out how many
-	 * duplicate bios should be sent to the target when writing
-	 * data.
-	 */
-	dm_num_write_bios_fn num_write_bios;
-
 	/* target specific data */
 	void *private;
 
-- 
2.14.0.rc0.dirty


-----------------------------------
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 8c1a05609eea..8762661df2ef 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1265,8 +1265,7 @@ static int clone_bio(struct dm_target_io *tio, struct bio *bio,
 }
 
 static struct dm_target_io *alloc_tio(struct clone_info *ci,
-				      struct dm_target *ti,
-				      unsigned target_bio_nr)
+				      struct dm_target *ti)
 {
 	struct dm_target_io *tio;
 	struct bio *clone;
@@ -1276,34 +1275,66 @@ static struct dm_target_io *alloc_tio(struct clone_info *ci,
 
 	tio->io = ci->io;
 	tio->ti = ti;
-	tio->target_bio_nr = target_bio_nr;
+	tio->target_bio_nr = 0;
 
 	return tio;
 }
 
-static void __clone_and_map_simple_bio(struct clone_info *ci,
-				       struct dm_target *ti,
-				       unsigned target_bio_nr, unsigned *len)
+static void alloc_multiple_bios(struct bio_list *blist, struct clone_info *ci,
+				struct dm_target *ti, unsigned num_bios)
 {
-	struct dm_target_io *tio = alloc_tio(ci, ti, target_bio_nr);
-	struct bio *clone = &tio->clone;
+	int try;
 
-	tio->len_ptr = len;
+	for (try = 0; try < 2; try++) {
+		int bio_nr;
+		struct bio *bio;
+
+		if (try)
+			mutex_lock(&ci->md->table_devices_lock);
+		for (bio_nr = 0; bio_nr < num_bios; bio_nr++) {
+			bio = bio_alloc_bioset(try ? GFP_NOIO : GFP_NOWAIT,
+					       0, ci->md->bs);
+			if (bio) {
+				struct dm_target_io *tio;
+				bio_list_add(blist, bio);
+				tio = container_of(bio, struct dm_target_io, clone);
 
-	__bio_clone_fast(clone, ci->bio);
-	if (len)
-		bio_setup_sector(clone, ci->sector, *len);
+				tio->io = ci->io;
+				tio->ti = ti;
+				tio->target_bio_nr = bio_nr;
+			} else
+				break;
+		}
+		if (try)
+			mutex_unlock(&ci->md->table_devices_lock);
+		if (bio_nr == num_bios)
+			return;
 
-	__map_bio(tio);
+		while ((bio = bio_list_pop(blist)) != NULL)
+			bio_put(bio);
+	}
 }
 
 static void __send_duplicate_bios(struct clone_info *ci, struct dm_target *ti,
 				  unsigned num_bios, unsigned *len)
 {
-	unsigned target_bio_nr;
+	struct bio_list blist = BIO_EMPTY_LIST;
+	struct bio *bio;
 
-	for (target_bio_nr = 0; target_bio_nr < num_bios; target_bio_nr++)
-		__clone_and_map_simple_bio(ci, ti, target_bio_nr, len);
+	if (num_bios == 1)
+		bio_list_add(&blist, &alloc_tio(ci, ti)->clone);
+	else
+		alloc_multiple_bios(&blist, ci, ti, num_bios);
+
+	while ((bio = bio_list_pop(&blist)) != NULL) {
+		struct dm_target_io *tio = container_of(
+			bio, struct dm_target_io, clone);
+		tio->len_ptr = len;
+		__bio_clone_fast(bio, ci->bio);
+		if (len)
+			bio_setup_sector(bio, ci->sector, *len);
+		__map_bio(tio);
+	}
 }
 
 static int __send_empty_flush(struct clone_info *ci)
@@ -1325,7 +1356,7 @@ static int __clone_and_map_data_bio(struct clone_info *ci, struct dm_target *ti,
 	struct dm_target_io *tio;
 	int r = 0;
 
-	tio = alloc_tio(ci, ti, 0);
+	tio = alloc_tio(ci, ti);
 	tio->len_ptr = len;
 	r = clone_bio(tio, bio, sector, *len);
 	if (r >= 0)

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [dm-devel] new patchset to eliminate DM's use of   BIOSET_NEED_RESCUER
@ 2017-11-22  4:00                         ` NeilBrown
  0 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2017-11-22  4:00 UTC (permalink / raw)
  To: Mikulas Patocka, Mike Snitzer
  Cc: Jens Axboe, linux-kernel, linux-block, device-mapper development,
	Zdenek Kabelac

[-- Attachment #1: Type: text/plain, Size: 11093 bytes --]

On Tue, Nov 21 2017, Mikulas Patocka wrote:

> On Tue, 21 Nov 2017, Mike Snitzer wrote:
>
>> On Tue, Nov 21 2017 at  4:23pm -0500,
>> Mikulas Patocka <mpatocka@redhat.com> wrote:
>> 
>> > This is not correct:
>> > 
>> >    2206 static void dm_wq_work(struct work_struct *work)
>> >    2207 {
>> >    2208         struct mapped_device *md = container_of(work, struct mapped_device, work);
>> >    2209         struct bio *bio;
>> >    2210         int srcu_idx;
>> >    2211         struct dm_table *map;
>> >    2212
>> >    2213         if (!bio_list_empty(&md->rescued)) {
>> >    2214                 struct bio_list list;
>> >    2215                 spin_lock_irq(&md->deferred_lock);
>> >    2216                 list = md->rescued;
>> >    2217                 bio_list_init(&md->rescued);
>> >    2218                 spin_unlock_irq(&md->deferred_lock);
>> >    2219                 while ((bio = bio_list_pop(&list)))
>> >    2220                         generic_make_request(bio);
>> >    2221         }
>> >    2222
>> >    2223         map = dm_get_live_table(md, &srcu_idx);
>> >    2224
>> >    2225         while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) {
>> >    2226                 spin_lock_irq(&md->deferred_lock);
>> >    2227                 bio = bio_list_pop(&md->deferred);
>> >    2228                 spin_unlock_irq(&md->deferred_lock);
>> >    2229
>> >    2230                 if (!bio)
>> >    2231                         break;
>> >    2232
>> >    2233                 if (dm_request_based(md))
>> >    2234                         generic_make_request(bio);
>> >    2235                 else
>> >    2236                         __split_and_process_bio(md, map, bio);
>> >    2237         }
>> >    2238
>> >    2239         dm_put_live_table(md, srcu_idx);
>> >    2240 }
>> > 
>> > You can see that if we are in dm_wq_work in __split_and_process_bio, we 
>> > will not process md->rescued list.
>> 
>> Can you elaborate further?  We cannot be "in dm_wq_work in
>> __split_and_process_bio" simultaneously.  Do you mean as a side-effect
>> of scheduling away from __split_and_process_bio?
>> 
>> The more detail you can share the better.
>
> Suppose this scenario:
>
> * dm_wq_work calls __split_and_process_bio
> * __split_and_process_bio eventually reaches the function snapshot_map
> * snapshot_map attempts to take the snapshot lock
>
> * the snapshot lock could be released only if some bios submitted by the 
> snapshot driver to the underlying device complete
> * the bios submitted to the underlying device were already offloaded by 
> some other task and they are waiting on the list md->rescued
> * the bios waiting on md->rescued are not processed, because dm_wq_work is 
> blocked in snapshot_map (called from __split_and_process_bio)

Yes, I think you are right. 

I think the solution is to get rid of the dm_offload() infrastructure
and make it not necessary.
i.e. discard my patches
    dm: prepare to discontinue use of BIOSET_NEED_RESCUER
and
    dm: revise 'rescue' strategy for bio-based bioset allocations

And build on "dm: ensure bio submission follows a depth-first tree walk"
which was written after those and already makes dm_offload() less
important.

Since that "depth-first" patch, every request to the dm device, after
the initial splitting, allocates just one dm_target_io structure, and
makes just one __map_bio() call, and so will behave exactly the way
generic_make_request() expects and copes with - thus avoiding awkward
dependencies and deadlocks.  Except....

a/ If any target defines ->num_write_bios() to return >1,
   __clone_and_map_data_bio() will make multiple calls to alloc_tio()
   and __map_bio(), which might need rescuing.
   But no target defines num_write_bios, and none have since it was
   removed from dm-cache 4.5 years ago.
   Can we discard num_write_bios??

b/ If any target sets any of num_{flush,discard,write_same,write_zeroes}_bios
   to a value > 1, then __send_duplicate_bios() will also make multiple
   calls to alloc_tio() and __map_bio().
   Some do.
     dm-cache-target:  flush=2
     dm-snap: flush=2
     dm-stripe: discard, write_same, write_zeroes all set to 'stripes'.

These will only be a problem if the second (or subsequent) alloc_tio()
blocks waiting for an earlier allocation to complete.  This will only
be a problem if multiple threads are each trying to allocate multiple
dm_target_io from the same bioset at the same time.
This is rare and should be easier to address than the current
dm_offload() approach. 
One possibility would be to copy the approach taken by
crypt_alloc_buffer() which needs to allocate multiple entries from a
mempool.
It first tries the with GFP_NOWAIT.  If that fails it take a mutex and
tries with GFP_NOIO.  This mean only one thread will try to allocate
multiple bios at once, so there can be no deadlock.

Below are two RFC patches.  The first removes num_write_bios.
The second is incomplete and makes a stab are allocating multiple bios
at once safely.
A third would be needed to remove dm_offload() etc... but I cannot quite
fit that in today - must be off.

Thanks,
NeilBrown

From: NeilBrown <neilb@suse.com>
Date: Wed, 22 Nov 2017 14:25:18 +1100
Subject: [PATCH] DM: remove num_write_bios target interface.

No target provides num_write_bios and none has done
since 2013.
Having the possibility of num_write_bios > 1 complicates
bio allocation.
So remove the interface and assume there is only one bio
needed.
If a target ever needs more, it must provide a suitable
bioset and allocate itself based on its particular needs.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/dm.c               | 22 ++++------------------
 include/linux/device-mapper.h | 15 ---------------
 2 files changed, 4 insertions(+), 33 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index b20febd6cbc7..8c1a05609eea 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1323,27 +1323,13 @@ static int __clone_and_map_data_bio(struct clone_info *ci, struct dm_target *ti,
 {
 	struct bio *bio = ci->bio;
 	struct dm_target_io *tio;
-	unsigned target_bio_nr;
-	unsigned num_target_bios = 1;
 	int r = 0;
 
-	/*
-	 * Does the target want to receive duplicate copies of the bio?
-	 */
-	if (bio_data_dir(bio) == WRITE && ti->num_write_bios)
-		num_target_bios = ti->num_write_bios(ti, bio);
-
-	for (target_bio_nr = 0; target_bio_nr < num_target_bios; target_bio_nr++) {
-		tio = alloc_tio(ci, ti, target_bio_nr);
-		tio->len_ptr = len;
-		r = clone_bio(tio, bio, sector, *len);
-		if (r < 0) {
-			free_tio(tio);
-			break;
-		}
+	tio = alloc_tio(ci, ti, 0);
+	tio->len_ptr = len;
+	r = clone_bio(tio, bio, sector, *len);
+	if (r >= 0)
 		__map_bio(tio);
-	}
-
 	return r;
 }
 
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index a5538433c927..5a68b366e664 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -220,14 +220,6 @@ struct target_type {
 #define DM_TARGET_WILDCARD		0x00000008
 #define dm_target_is_wildcard(type)	((type)->features & DM_TARGET_WILDCARD)
 
-/*
- * Some targets need to be sent the same WRITE bio severals times so
- * that they can send copies of it to different devices.  This function
- * examines any supplied bio and returns the number of copies of it the
- * target requires.
- */
-typedef unsigned (*dm_num_write_bios_fn) (struct dm_target *ti, struct bio *bio);
-
 /*
  * A target implements own bio data integrity.
  */
@@ -291,13 +283,6 @@ struct dm_target {
 	 */
 	unsigned per_io_data_size;
 
-	/*
-	 * If defined, this function is called to find out how many
-	 * duplicate bios should be sent to the target when writing
-	 * data.
-	 */
-	dm_num_write_bios_fn num_write_bios;
-
 	/* target specific data */
 	void *private;
 
-- 
2.14.0.rc0.dirty


-----------------------------------
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 8c1a05609eea..8762661df2ef 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1265,8 +1265,7 @@ static int clone_bio(struct dm_target_io *tio, struct bio *bio,
 }
 
 static struct dm_target_io *alloc_tio(struct clone_info *ci,
-				      struct dm_target *ti,
-				      unsigned target_bio_nr)
+				      struct dm_target *ti)
 {
 	struct dm_target_io *tio;
 	struct bio *clone;
@@ -1276,34 +1275,66 @@ static struct dm_target_io *alloc_tio(struct clone_info *ci,
 
 	tio->io = ci->io;
 	tio->ti = ti;
-	tio->target_bio_nr = target_bio_nr;
+	tio->target_bio_nr = 0;
 
 	return tio;
 }
 
-static void __clone_and_map_simple_bio(struct clone_info *ci,
-				       struct dm_target *ti,
-				       unsigned target_bio_nr, unsigned *len)
+static void alloc_multiple_bios(struct bio_list *blist, struct clone_info *ci,
+				struct dm_target *ti, unsigned num_bios)
 {
-	struct dm_target_io *tio = alloc_tio(ci, ti, target_bio_nr);
-	struct bio *clone = &tio->clone;
+	int try;
 
-	tio->len_ptr = len;
+	for (try = 0; try < 2; try++) {
+		int bio_nr;
+		struct bio *bio;
+
+		if (try)
+			mutex_lock(&ci->md->table_devices_lock);
+		for (bio_nr = 0; bio_nr < num_bios; bio_nr++) {
+			bio = bio_alloc_bioset(try ? GFP_NOIO : GFP_NOWAIT,
+					       0, ci->md->bs);
+			if (bio) {
+				struct dm_target_io *tio;
+				bio_list_add(blist, bio);
+				tio = container_of(bio, struct dm_target_io, clone);
 
-	__bio_clone_fast(clone, ci->bio);
-	if (len)
-		bio_setup_sector(clone, ci->sector, *len);
+				tio->io = ci->io;
+				tio->ti = ti;
+				tio->target_bio_nr = bio_nr;
+			} else
+				break;
+		}
+		if (try)
+			mutex_unlock(&ci->md->table_devices_lock);
+		if (bio_nr == num_bios)
+			return;
 
-	__map_bio(tio);
+		while ((bio = bio_list_pop(blist)) != NULL)
+			bio_put(bio);
+	}
 }
 
 static void __send_duplicate_bios(struct clone_info *ci, struct dm_target *ti,
 				  unsigned num_bios, unsigned *len)
 {
-	unsigned target_bio_nr;
+	struct bio_list blist = BIO_EMPTY_LIST;
+	struct bio *bio;
 
-	for (target_bio_nr = 0; target_bio_nr < num_bios; target_bio_nr++)
-		__clone_and_map_simple_bio(ci, ti, target_bio_nr, len);
+	if (num_bios == 1)
+		bio_list_add(&blist, &alloc_tio(ci, ti)->clone);
+	else
+		alloc_multiple_bios(&blist, ci, ti, num_bios);
+
+	while ((bio = bio_list_pop(&blist)) != NULL) {
+		struct dm_target_io *tio = container_of(
+			bio, struct dm_target_io, clone);
+		tio->len_ptr = len;
+		__bio_clone_fast(bio, ci->bio);
+		if (len)
+			bio_setup_sector(bio, ci->sector, *len);
+		__map_bio(tio);
+	}
 }
 
 static int __send_empty_flush(struct clone_info *ci)
@@ -1325,7 +1356,7 @@ static int __clone_and_map_data_bio(struct clone_info *ci, struct dm_target *ti,
 	struct dm_target_io *tio;
 	int r = 0;
 
-	tio = alloc_tio(ci, ti, 0);
+	tio = alloc_tio(ci, ti);
 	tio->len_ptr = len;
 	r = clone_bio(tio, bio, sector, *len);
 	if (r >= 0)

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: new patchset to eliminate DM's use of BIOSET_NEED_RESCUER
  2017-11-22  4:00                         ` NeilBrown
@ 2017-11-22  4:28                           ` Mike Snitzer
  -1 siblings, 0 replies; 53+ messages in thread
From: Mike Snitzer @ 2017-11-22  4:28 UTC (permalink / raw)
  To: NeilBrown
  Cc: Mikulas Patocka, Jens Axboe, linux-kernel, linux-block,
	device-mapper development, Zdenek Kabelac

On Tue, Nov 21 2017 at 11:00pm -0500,
NeilBrown <neilb@suse.com> wrote:

> On Tue, Nov 21 2017, Mikulas Patocka wrote:
> 
> > On Tue, 21 Nov 2017, Mike Snitzer wrote:
> >
> >> On Tue, Nov 21 2017 at  4:23pm -0500,
> >> Mikulas Patocka <mpatocka@redhat.com> wrote:
> >> 
> >> > This is not correct:
> >> > 
> >> >    2206 static void dm_wq_work(struct work_struct *work)
> >> >    2207 {
> >> >    2208         struct mapped_device *md = container_of(work, struct mapped_device, work);
> >> >    2209         struct bio *bio;
> >> >    2210         int srcu_idx;
> >> >    2211         struct dm_table *map;
> >> >    2212
> >> >    2213         if (!bio_list_empty(&md->rescued)) {
> >> >    2214                 struct bio_list list;
> >> >    2215                 spin_lock_irq(&md->deferred_lock);
> >> >    2216                 list = md->rescued;
> >> >    2217                 bio_list_init(&md->rescued);
> >> >    2218                 spin_unlock_irq(&md->deferred_lock);
> >> >    2219                 while ((bio = bio_list_pop(&list)))
> >> >    2220                         generic_make_request(bio);
> >> >    2221         }
> >> >    2222
> >> >    2223         map = dm_get_live_table(md, &srcu_idx);
> >> >    2224
> >> >    2225         while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) {
> >> >    2226                 spin_lock_irq(&md->deferred_lock);
> >> >    2227                 bio = bio_list_pop(&md->deferred);
> >> >    2228                 spin_unlock_irq(&md->deferred_lock);
> >> >    2229
> >> >    2230                 if (!bio)
> >> >    2231                         break;
> >> >    2232
> >> >    2233                 if (dm_request_based(md))
> >> >    2234                         generic_make_request(bio);
> >> >    2235                 else
> >> >    2236                         __split_and_process_bio(md, map, bio);
> >> >    2237         }
> >> >    2238
> >> >    2239         dm_put_live_table(md, srcu_idx);
> >> >    2240 }
> >> > 
> >> > You can see that if we are in dm_wq_work in __split_and_process_bio, we 
> >> > will not process md->rescued list.
> >> 
> >> Can you elaborate further?  We cannot be "in dm_wq_work in
> >> __split_and_process_bio" simultaneously.  Do you mean as a side-effect
> >> of scheduling away from __split_and_process_bio?
> >> 
> >> The more detail you can share the better.
> >
> > Suppose this scenario:
> >
> > * dm_wq_work calls __split_and_process_bio
> > * __split_and_process_bio eventually reaches the function snapshot_map
> > * snapshot_map attempts to take the snapshot lock
> >
> > * the snapshot lock could be released only if some bios submitted by the 
> > snapshot driver to the underlying device complete
> > * the bios submitted to the underlying device were already offloaded by 
> > some other task and they are waiting on the list md->rescued
> > * the bios waiting on md->rescued are not processed, because dm_wq_work is 
> > blocked in snapshot_map (called from __split_and_process_bio)
> 
> Yes, I think you are right. 
> 
> I think the solution is to get rid of the dm_offload() infrastructure
> and make it not necessary.
> i.e. discard my patches
>     dm: prepare to discontinue use of BIOSET_NEED_RESCUER
> and
>     dm: revise 'rescue' strategy for bio-based bioset allocations
> 
> And build on "dm: ensure bio submission follows a depth-first tree walk"
> which was written after those and already makes dm_offload() less
> important.
> 
> Since that "depth-first" patch, every request to the dm device, after
> the initial splitting, allocates just one dm_target_io structure, and
> makes just one __map_bio() call, and so will behave exactly the way
> generic_make_request() expects and copes with - thus avoiding awkward
> dependencies and deadlocks.  Except....

Yes, FYI I've also verified that even with just the "depth-first" patch
(and dm_offload disabled) the snapshot deadlock is fixed.
 
> a/ If any target defines ->num_write_bios() to return >1,
>    __clone_and_map_data_bio() will make multiple calls to alloc_tio()
>    and __map_bio(), which might need rescuing.
>    But no target defines num_write_bios, and none have since it was
>    removed from dm-cache 4.5 years ago.
>    Can we discard num_write_bios??

Yes.

> b/ If any target sets any of num_{flush,discard,write_same,write_zeroes}_bios
>    to a value > 1, then __send_duplicate_bios() will also make multiple
>    calls to alloc_tio() and __map_bio().
>    Some do.
>      dm-cache-target:  flush=2
>      dm-snap: flush=2
>      dm-stripe: discard, write_same, write_zeroes all set to 'stripes'.
> 
> These will only be a problem if the second (or subsequent) alloc_tio()
> blocks waiting for an earlier allocation to complete.  This will only
> be a problem if multiple threads are each trying to allocate multiple
> dm_target_io from the same bioset at the same time.
> This is rare and should be easier to address than the current
> dm_offload() approach. 
> One possibility would be to copy the approach taken by
> crypt_alloc_buffer() which needs to allocate multiple entries from a
> mempool.
> It first tries the with GFP_NOWAIT.  If that fails it take a mutex and
> tries with GFP_NOIO.  This mean only one thread will try to allocate
> multiple bios at once, so there can be no deadlock.
> 
> Below are two RFC patches.  The first removes num_write_bios.
> The second is incomplete and makes a stab are allocating multiple bios
> at once safely.
> A third would be needed to remove dm_offload() etc... but I cannot quite
> fit that in today - must be off.

Great.

> From: NeilBrown <neilb@suse.com>
> Date: Wed, 22 Nov 2017 14:25:18 +1100
> Subject: [PATCH] DM: remove num_write_bios target interface.
> 
> No target provides num_write_bios and none has done
> since 2013.
> Having the possibility of num_write_bios > 1 complicates
> bio allocation.
> So remove the interface and assume there is only one bio
> needed.
> If a target ever needs more, it must provide a suitable
> bioset and allocate itself based on its particular needs.
> 
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  drivers/md/dm.c               | 22 ++++------------------
>  include/linux/device-mapper.h | 15 ---------------
>  2 files changed, 4 insertions(+), 33 deletions(-)
> 
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index b20febd6cbc7..8c1a05609eea 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1323,27 +1323,13 @@ static int __clone_and_map_data_bio(struct clone_info *ci, struct dm_target *ti,
>  {
>  	struct bio *bio = ci->bio;
>  	struct dm_target_io *tio;
> -	unsigned target_bio_nr;
> -	unsigned num_target_bios = 1;
>  	int r = 0;
>  
> -	/*
> -	 * Does the target want to receive duplicate copies of the bio?
> -	 */
> -	if (bio_data_dir(bio) == WRITE && ti->num_write_bios)
> -		num_target_bios = ti->num_write_bios(ti, bio);
> -
> -	for (target_bio_nr = 0; target_bio_nr < num_target_bios; target_bio_nr++) {
> -		tio = alloc_tio(ci, ti, target_bio_nr);
> -		tio->len_ptr = len;
> -		r = clone_bio(tio, bio, sector, *len);
> -		if (r < 0) {
> -			free_tio(tio);
> -			break;
> -		}
> +	tio = alloc_tio(ci, ti, 0);
> +	tio->len_ptr = len;
> +	r = clone_bio(tio, bio, sector, *len);
> +	if (r >= 0)
>  		__map_bio(tio);
> -	}
> -

This bit is wrong, free_tio() is needed if clone_bio() fails.  I can fix
it up though.

I'll work through your patches tomorrow.

Thanks,
Mike

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: new patchset to eliminate DM's use of      BIOSET_NEED_RESCUER
@ 2017-11-22  4:28                           ` Mike Snitzer
  0 siblings, 0 replies; 53+ messages in thread
From: Mike Snitzer @ 2017-11-22  4:28 UTC (permalink / raw)
  To: NeilBrown
  Cc: Mikulas Patocka, Jens Axboe, linux-kernel, linux-block,
	device-mapper development, Zdenek Kabelac

On Tue, Nov 21 2017 at 11:00pm -0500,
NeilBrown <neilb@suse.com> wrote:

> On Tue, Nov 21 2017, Mikulas Patocka wrote:
> 
> > On Tue, 21 Nov 2017, Mike Snitzer wrote:
> >
> >> On Tue, Nov 21 2017 at  4:23pm -0500,
> >> Mikulas Patocka <mpatocka@redhat.com> wrote:
> >> 
> >> > This is not correct:
> >> > 
> >> >    2206 static void dm_wq_work(struct work_struct *work)
> >> >    2207 {
> >> >    2208         struct mapped_device *md = container_of(work, struct mapped_device, work);
> >> >    2209         struct bio *bio;
> >> >    2210         int srcu_idx;
> >> >    2211         struct dm_table *map;
> >> >    2212
> >> >    2213         if (!bio_list_empty(&md->rescued)) {
> >> >    2214                 struct bio_list list;
> >> >    2215                 spin_lock_irq(&md->deferred_lock);
> >> >    2216                 list = md->rescued;
> >> >    2217                 bio_list_init(&md->rescued);
> >> >    2218                 spin_unlock_irq(&md->deferred_lock);
> >> >    2219                 while ((bio = bio_list_pop(&list)))
> >> >    2220                         generic_make_request(bio);
> >> >    2221         }
> >> >    2222
> >> >    2223         map = dm_get_live_table(md, &srcu_idx);
> >> >    2224
> >> >    2225         while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) {
> >> >    2226                 spin_lock_irq(&md->deferred_lock);
> >> >    2227                 bio = bio_list_pop(&md->deferred);
> >> >    2228                 spin_unlock_irq(&md->deferred_lock);
> >> >    2229
> >> >    2230                 if (!bio)
> >> >    2231                         break;
> >> >    2232
> >> >    2233                 if (dm_request_based(md))
> >> >    2234                         generic_make_request(bio);
> >> >    2235                 else
> >> >    2236                         __split_and_process_bio(md, map, bio);
> >> >    2237         }
> >> >    2238
> >> >    2239         dm_put_live_table(md, srcu_idx);
> >> >    2240 }
> >> > 
> >> > You can see that if we are in dm_wq_work in __split_and_process_bio, we 
> >> > will not process md->rescued list.
> >> 
> >> Can you elaborate further?  We cannot be "in dm_wq_work in
> >> __split_and_process_bio" simultaneously.  Do you mean as a side-effect
> >> of scheduling away from __split_and_process_bio?
> >> 
> >> The more detail you can share the better.
> >
> > Suppose this scenario:
> >
> > * dm_wq_work calls __split_and_process_bio
> > * __split_and_process_bio eventually reaches the function snapshot_map
> > * snapshot_map attempts to take the snapshot lock
> >
> > * the snapshot lock could be released only if some bios submitted by the 
> > snapshot driver to the underlying device complete
> > * the bios submitted to the underlying device were already offloaded by 
> > some other task and they are waiting on the list md->rescued
> > * the bios waiting on md->rescued are not processed, because dm_wq_work is 
> > blocked in snapshot_map (called from __split_and_process_bio)
> 
> Yes, I think you are right. 
> 
> I think the solution is to get rid of the dm_offload() infrastructure
> and make it not necessary.
> i.e. discard my patches
>     dm: prepare to discontinue use of BIOSET_NEED_RESCUER
> and
>     dm: revise 'rescue' strategy for bio-based bioset allocations
> 
> And build on "dm: ensure bio submission follows a depth-first tree walk"
> which was written after those and already makes dm_offload() less
> important.
> 
> Since that "depth-first" patch, every request to the dm device, after
> the initial splitting, allocates just one dm_target_io structure, and
> makes just one __map_bio() call, and so will behave exactly the way
> generic_make_request() expects and copes with - thus avoiding awkward
> dependencies and deadlocks.  Except....

Yes, FYI I've also verified that even with just the "depth-first" patch
(and dm_offload disabled) the snapshot deadlock is fixed.
 
> a/ If any target defines ->num_write_bios() to return >1,
>    __clone_and_map_data_bio() will make multiple calls to alloc_tio()
>    and __map_bio(), which might need rescuing.
>    But no target defines num_write_bios, and none have since it was
>    removed from dm-cache 4.5 years ago.
>    Can we discard num_write_bios??

Yes.

> b/ If any target sets any of num_{flush,discard,write_same,write_zeroes}_bios
>    to a value > 1, then __send_duplicate_bios() will also make multiple
>    calls to alloc_tio() and __map_bio().
>    Some do.
>      dm-cache-target:  flush=2
>      dm-snap: flush=2
>      dm-stripe: discard, write_same, write_zeroes all set to 'stripes'.
> 
> These will only be a problem if the second (or subsequent) alloc_tio()
> blocks waiting for an earlier allocation to complete.  This will only
> be a problem if multiple threads are each trying to allocate multiple
> dm_target_io from the same bioset at the same time.
> This is rare and should be easier to address than the current
> dm_offload() approach. 
> One possibility would be to copy the approach taken by
> crypt_alloc_buffer() which needs to allocate multiple entries from a
> mempool.
> It first tries the with GFP_NOWAIT.  If that fails it take a mutex and
> tries with GFP_NOIO.  This mean only one thread will try to allocate
> multiple bios at once, so there can be no deadlock.
> 
> Below are two RFC patches.  The first removes num_write_bios.
> The second is incomplete and makes a stab are allocating multiple bios
> at once safely.
> A third would be needed to remove dm_offload() etc... but I cannot quite
> fit that in today - must be off.

Great.

> From: NeilBrown <neilb@suse.com>
> Date: Wed, 22 Nov 2017 14:25:18 +1100
> Subject: [PATCH] DM: remove num_write_bios target interface.
> 
> No target provides num_write_bios and none has done
> since 2013.
> Having the possibility of num_write_bios > 1 complicates
> bio allocation.
> So remove the interface and assume there is only one bio
> needed.
> If a target ever needs more, it must provide a suitable
> bioset and allocate itself based on its particular needs.
> 
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  drivers/md/dm.c               | 22 ++++------------------
>  include/linux/device-mapper.h | 15 ---------------
>  2 files changed, 4 insertions(+), 33 deletions(-)
> 
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index b20febd6cbc7..8c1a05609eea 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1323,27 +1323,13 @@ static int __clone_and_map_data_bio(struct clone_info *ci, struct dm_target *ti,
>  {
>  	struct bio *bio = ci->bio;
>  	struct dm_target_io *tio;
> -	unsigned target_bio_nr;
> -	unsigned num_target_bios = 1;
>  	int r = 0;
>  
> -	/*
> -	 * Does the target want to receive duplicate copies of the bio?
> -	 */
> -	if (bio_data_dir(bio) == WRITE && ti->num_write_bios)
> -		num_target_bios = ti->num_write_bios(ti, bio);
> -
> -	for (target_bio_nr = 0; target_bio_nr < num_target_bios; target_bio_nr++) {
> -		tio = alloc_tio(ci, ti, target_bio_nr);
> -		tio->len_ptr = len;
> -		r = clone_bio(tio, bio, sector, *len);
> -		if (r < 0) {
> -			free_tio(tio);
> -			break;
> -		}
> +	tio = alloc_tio(ci, ti, 0);
> +	tio->len_ptr = len;
> +	r = clone_bio(tio, bio, sector, *len);
> +	if (r >= 0)
>  		__map_bio(tio);
> -	}
> -

This bit is wrong, free_tio() is needed if clone_bio() fails.  I can fix
it up though.

I'll work through your patches tomorrow.

Thanks,
Mike

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [dm-devel] new patchset to eliminate DM's use of BIOSET_NEED_RESCUER
  2017-11-22  4:00                         ` NeilBrown
                                           ` (2 preceding siblings ...)
  (?)
@ 2017-11-22 18:24                         ` Mikulas Patocka
  2017-11-22 18:49                           ` Mike Snitzer
                                             ` (2 more replies)
  -1 siblings, 3 replies; 53+ messages in thread
From: Mikulas Patocka @ 2017-11-22 18:24 UTC (permalink / raw)
  To: NeilBrown
  Cc: Mike Snitzer, Jens Axboe, linux-kernel, linux-block,
	device-mapper development, Zdenek Kabelac



On Wed, 22 Nov 2017, NeilBrown wrote:

> On Tue, Nov 21 2017, Mikulas Patocka wrote:
> 
> > On Tue, 21 Nov 2017, Mike Snitzer wrote:
> >
> >> On Tue, Nov 21 2017 at  4:23pm -0500,
> >> Mikulas Patocka <mpatocka@redhat.com> wrote:
> >> 
> >> > This is not correct:
> >> > 
> >> >    2206 static void dm_wq_work(struct work_struct *work)
> >> >    2207 {
> >> >    2208         struct mapped_device *md = container_of(work, struct mapped_device, work);
> >> >    2209         struct bio *bio;
> >> >    2210         int srcu_idx;
> >> >    2211         struct dm_table *map;
> >> >    2212
> >> >    2213         if (!bio_list_empty(&md->rescued)) {
> >> >    2214                 struct bio_list list;
> >> >    2215                 spin_lock_irq(&md->deferred_lock);
> >> >    2216                 list = md->rescued;
> >> >    2217                 bio_list_init(&md->rescued);
> >> >    2218                 spin_unlock_irq(&md->deferred_lock);
> >> >    2219                 while ((bio = bio_list_pop(&list)))
> >> >    2220                         generic_make_request(bio);
> >> >    2221         }
> >> >    2222
> >> >    2223         map = dm_get_live_table(md, &srcu_idx);
> >> >    2224
> >> >    2225         while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) {
> >> >    2226                 spin_lock_irq(&md->deferred_lock);
> >> >    2227                 bio = bio_list_pop(&md->deferred);
> >> >    2228                 spin_unlock_irq(&md->deferred_lock);
> >> >    2229
> >> >    2230                 if (!bio)
> >> >    2231                         break;
> >> >    2232
> >> >    2233                 if (dm_request_based(md))
> >> >    2234                         generic_make_request(bio);
> >> >    2235                 else
> >> >    2236                         __split_and_process_bio(md, map, bio);
> >> >    2237         }
> >> >    2238
> >> >    2239         dm_put_live_table(md, srcu_idx);
> >> >    2240 }
> >> > 
> >> > You can see that if we are in dm_wq_work in __split_and_process_bio, we 
> >> > will not process md->rescued list.
> >> 
> >> Can you elaborate further?  We cannot be "in dm_wq_work in
> >> __split_and_process_bio" simultaneously.  Do you mean as a side-effect
> >> of scheduling away from __split_and_process_bio?
> >> 
> >> The more detail you can share the better.
> >
> > Suppose this scenario:
> >
> > * dm_wq_work calls __split_and_process_bio
> > * __split_and_process_bio eventually reaches the function snapshot_map
> > * snapshot_map attempts to take the snapshot lock
> >
> > * the snapshot lock could be released only if some bios submitted by the 
> > snapshot driver to the underlying device complete
> > * the bios submitted to the underlying device were already offloaded by 
> > some other task and they are waiting on the list md->rescued
> > * the bios waiting on md->rescued are not processed, because dm_wq_work is 
> > blocked in snapshot_map (called from __split_and_process_bio)
> 
> Yes, I think you are right. 
> 
> I think the solution is to get rid of the dm_offload() infrastructure
> and make it not necessary.
> i.e. discard my patches
>     dm: prepare to discontinue use of BIOSET_NEED_RESCUER
> and
>     dm: revise 'rescue' strategy for bio-based bioset allocations
> 
> And build on "dm: ensure bio submission follows a depth-first tree walk"
> which was written after those and already makes dm_offload() less
> important.
> 
> Since that "depth-first" patch, every request to the dm device, after
> the initial splitting, allocates just one dm_target_io structure, and
> makes just one __map_bio() call, and so will behave exactly the way
> generic_make_request() expects and copes with - thus avoiding awkward
> dependencies and deadlocks.  Except....
> 
> a/ If any target defines ->num_write_bios() to return >1,
>    __clone_and_map_data_bio() will make multiple calls to alloc_tio()
>    and __map_bio(), which might need rescuing.
>    But no target defines num_write_bios, and none have since it was
>    removed from dm-cache 4.5 years ago.
>    Can we discard num_write_bios??
> 
> b/ If any target sets any of num_{flush,discard,write_same,write_zeroes}_bios
>    to a value > 1, then __send_duplicate_bios() will also make multiple
>    calls to alloc_tio() and __map_bio().
>    Some do.
>      dm-cache-target:  flush=2
>      dm-snap: flush=2
>      dm-stripe: discard, write_same, write_zeroes all set to 'stripes'.
> 
> These will only be a problem if the second (or subsequent) alloc_tio()
> blocks waiting for an earlier allocation to complete.  This will only
> be a problem if multiple threads are each trying to allocate multiple
> dm_target_io from the same bioset at the same time.
> This is rare and should be easier to address than the current
> dm_offload() approach. 
> One possibility would be to copy the approach taken by
> crypt_alloc_buffer() which needs to allocate multiple entries from a
> mempool.
> It first tries the with GFP_NOWAIT.  If that fails it take a mutex and
> tries with GFP_NOIO.  This mean only one thread will try to allocate
> multiple bios at once, so there can be no deadlock.
> 
> Below are two RFC patches.  The first removes num_write_bios.
> The second is incomplete and makes a stab are allocating multiple bios
> at once safely.
> A third would be needed to remove dm_offload() etc... but I cannot quite
> fit that in today - must be off.
> 
> Thanks,
> NeilBrown

Another problem is this:

struct bio *b = bio_clone_bioset(bio, GFP_NOIO, md->queue->bio_split);
bio_advance(b, (bio_sectors(b) - ci.sector_count) << 9);
bio_chain(b, bio);

What if it blocks because the bioset is exhausted?

The code basically builds a chain of bios of unlimited length (suppose for 
example a case when we are splitting on every sector boundary, so there 
will be one bio for every sector in the original bio), it could exhaust 
the bioset easily.

It would be better to use mechanism from md-raid that chains all the 
sub-bios to the same master bio and doesn't create long chains of bios:

        if (max_sectors < bio_sectors(bio)) {
                struct bio *split = bio_split(bio, max_sectors,
                                              gfp, conf->bio_split);
                bio_chain(split, bio);
                generic_make_request(bio);
                bio = split;
                r1_bio->master_bio = bio;
                r1_bio->sectors = max_sectors;
        }

Mikulas

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: new patchset to eliminate DM's use of BIOSET_NEED_RESCUER
  2017-11-22 18:24                         ` [dm-devel] " Mikulas Patocka
@ 2017-11-22 18:49                           ` Mike Snitzer
  2017-11-23  5:12                             ` NeilBrown
  2017-11-23 22:52                             ` NeilBrown
  2 siblings, 0 replies; 53+ messages in thread
From: Mike Snitzer @ 2017-11-22 18:49 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: NeilBrown, Jens Axboe, linux-kernel, linux-block,
	device-mapper development, Zdenek Kabelac

On Wed, Nov 22 2017 at  1:24pm -0500,
Mikulas Patocka <mpatocka@redhat.com> wrote:
 
> Another problem is this:
> 
> struct bio *b = bio_clone_bioset(bio, GFP_NOIO, md->queue->bio_split);
> bio_advance(b, (bio_sectors(b) - ci.sector_count) << 9);
> bio_chain(b, bio);
> 
> What if it blocks because the bioset is exhausted?
> 
> The code basically builds a chain of bios of unlimited length (suppose for 
> example a case when we are splitting on every sector boundary, so there 
> will be one bio for every sector in the original bio), it could exhaust 
> the bioset easily.
> 
> It would be better to use mechanism from md-raid that chains all the 
> sub-bios to the same master bio and doesn't create long chains of bios:
> 
>         if (max_sectors < bio_sectors(bio)) {
>                 struct bio *split = bio_split(bio, max_sectors,
>                                               gfp, conf->bio_split);
>                 bio_chain(split, bio);
>                 generic_make_request(bio);
>                 bio = split;
>                 r1_bio->master_bio = bio;
>                 r1_bio->sectors = max_sectors;
>         }

I'd be happy to take an incremental patch that improves on this commit:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16&id=b46d6a08f1ae7bf53e4cde28e0ccdf91567d432e

But short of that I'll have to come back to this.

Thanks,
Mike

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: new patchset to eliminate DM's use of BIOSET_NEED_RESCUER
  2017-11-22  4:28                           ` Mike Snitzer
  (?)
@ 2017-11-22 21:18                           ` Mike Snitzer
  -1 siblings, 0 replies; 53+ messages in thread
From: Mike Snitzer @ 2017-11-22 21:18 UTC (permalink / raw)
  To: NeilBrown
  Cc: Mikulas Patocka, Jens Axboe, linux-kernel, linux-block,
	device-mapper development, Zdenek Kabelac

On Tue, Nov 21 2017 at 11:28pm -0500,
Mike Snitzer <snitzer@redhat.com> wrote:

> 
> I'll work through your patches tomorrow.

Please see the top 3 patches on this branch:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.16

This rebased dm-4.16 branch seems to be working well so far.

Mike

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [dm-devel] new patchset to eliminate DM's use of BIOSET_NEED_RESCUER
  2017-11-22 18:24                         ` [dm-devel] " Mikulas Patocka
@ 2017-11-23  5:12                             ` NeilBrown
  2017-11-23  5:12                             ` NeilBrown
  2017-11-23 22:52                             ` NeilBrown
  2 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2017-11-23  5:12 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Mike Snitzer, Jens Axboe, linux-kernel, linux-block,
	device-mapper development, Zdenek Kabelac

[-- Attachment #1: Type: text/plain, Size: 7367 bytes --]

On Wed, Nov 22 2017, Mikulas Patocka wrote:

> On Wed, 22 Nov 2017, NeilBrown wrote:
>
>> On Tue, Nov 21 2017, Mikulas Patocka wrote:
>> 
>> > On Tue, 21 Nov 2017, Mike Snitzer wrote:
>> >
>> >> On Tue, Nov 21 2017 at  4:23pm -0500,
>> >> Mikulas Patocka <mpatocka@redhat.com> wrote:
>> >> 
>> >> > This is not correct:
>> >> > 
>> >> >    2206 static void dm_wq_work(struct work_struct *work)
>> >> >    2207 {
>> >> >    2208         struct mapped_device *md = container_of(work, struct mapped_device, work);
>> >> >    2209         struct bio *bio;
>> >> >    2210         int srcu_idx;
>> >> >    2211         struct dm_table *map;
>> >> >    2212
>> >> >    2213         if (!bio_list_empty(&md->rescued)) {
>> >> >    2214                 struct bio_list list;
>> >> >    2215                 spin_lock_irq(&md->deferred_lock);
>> >> >    2216                 list = md->rescued;
>> >> >    2217                 bio_list_init(&md->rescued);
>> >> >    2218                 spin_unlock_irq(&md->deferred_lock);
>> >> >    2219                 while ((bio = bio_list_pop(&list)))
>> >> >    2220                         generic_make_request(bio);
>> >> >    2221         }
>> >> >    2222
>> >> >    2223         map = dm_get_live_table(md, &srcu_idx);
>> >> >    2224
>> >> >    2225         while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) {
>> >> >    2226                 spin_lock_irq(&md->deferred_lock);
>> >> >    2227                 bio = bio_list_pop(&md->deferred);
>> >> >    2228                 spin_unlock_irq(&md->deferred_lock);
>> >> >    2229
>> >> >    2230                 if (!bio)
>> >> >    2231                         break;
>> >> >    2232
>> >> >    2233                 if (dm_request_based(md))
>> >> >    2234                         generic_make_request(bio);
>> >> >    2235                 else
>> >> >    2236                         __split_and_process_bio(md, map, bio);
>> >> >    2237         }
>> >> >    2238
>> >> >    2239         dm_put_live_table(md, srcu_idx);
>> >> >    2240 }
>> >> > 
>> >> > You can see that if we are in dm_wq_work in __split_and_process_bio, we 
>> >> > will not process md->rescued list.
>> >> 
>> >> Can you elaborate further?  We cannot be "in dm_wq_work in
>> >> __split_and_process_bio" simultaneously.  Do you mean as a side-effect
>> >> of scheduling away from __split_and_process_bio?
>> >> 
>> >> The more detail you can share the better.
>> >
>> > Suppose this scenario:
>> >
>> > * dm_wq_work calls __split_and_process_bio
>> > * __split_and_process_bio eventually reaches the function snapshot_map
>> > * snapshot_map attempts to take the snapshot lock
>> >
>> > * the snapshot lock could be released only if some bios submitted by the 
>> > snapshot driver to the underlying device complete
>> > * the bios submitted to the underlying device were already offloaded by 
>> > some other task and they are waiting on the list md->rescued
>> > * the bios waiting on md->rescued are not processed, because dm_wq_work is 
>> > blocked in snapshot_map (called from __split_and_process_bio)
>> 
>> Yes, I think you are right. 
>> 
>> I think the solution is to get rid of the dm_offload() infrastructure
>> and make it not necessary.
>> i.e. discard my patches
>>     dm: prepare to discontinue use of BIOSET_NEED_RESCUER
>> and
>>     dm: revise 'rescue' strategy for bio-based bioset allocations
>> 
>> And build on "dm: ensure bio submission follows a depth-first tree walk"
>> which was written after those and already makes dm_offload() less
>> important.
>> 
>> Since that "depth-first" patch, every request to the dm device, after
>> the initial splitting, allocates just one dm_target_io structure, and
>> makes just one __map_bio() call, and so will behave exactly the way
>> generic_make_request() expects and copes with - thus avoiding awkward
>> dependencies and deadlocks.  Except....
>> 
>> a/ If any target defines ->num_write_bios() to return >1,
>>    __clone_and_map_data_bio() will make multiple calls to alloc_tio()
>>    and __map_bio(), which might need rescuing.
>>    But no target defines num_write_bios, and none have since it was
>>    removed from dm-cache 4.5 years ago.
>>    Can we discard num_write_bios??
>> 
>> b/ If any target sets any of num_{flush,discard,write_same,write_zeroes}_bios
>>    to a value > 1, then __send_duplicate_bios() will also make multiple
>>    calls to alloc_tio() and __map_bio().
>>    Some do.
>>      dm-cache-target:  flush=2
>>      dm-snap: flush=2
>>      dm-stripe: discard, write_same, write_zeroes all set to 'stripes'.
>> 
>> These will only be a problem if the second (or subsequent) alloc_tio()
>> blocks waiting for an earlier allocation to complete.  This will only
>> be a problem if multiple threads are each trying to allocate multiple
>> dm_target_io from the same bioset at the same time.
>> This is rare and should be easier to address than the current
>> dm_offload() approach. 
>> One possibility would be to copy the approach taken by
>> crypt_alloc_buffer() which needs to allocate multiple entries from a
>> mempool.
>> It first tries the with GFP_NOWAIT.  If that fails it take a mutex and
>> tries with GFP_NOIO.  This mean only one thread will try to allocate
>> multiple bios at once, so there can be no deadlock.
>> 
>> Below are two RFC patches.  The first removes num_write_bios.
>> The second is incomplete and makes a stab are allocating multiple bios
>> at once safely.
>> A third would be needed to remove dm_offload() etc... but I cannot quite
>> fit that in today - must be off.
>> 
>> Thanks,
>> NeilBrown
>
> Another problem is this:
>
> struct bio *b = bio_clone_bioset(bio, GFP_NOIO, md->queue->bio_split);
> bio_advance(b, (bio_sectors(b) - ci.sector_count) << 9);
> bio_chain(b, bio);
>
> What if it blocks because the bioset is exhausted?
>
> The code basically builds a chain of bios of unlimited length (suppose for 
> example a case when we are splitting on every sector boundary, so there 
> will be one bio for every sector in the original bio), it could exhaust 
> the bioset easily.
>
> It would be better to use mechanism from md-raid that chains all the 
> sub-bios to the same master bio and doesn't create long chains of bios:
>
>         if (max_sectors < bio_sectors(bio)) {
>                 struct bio *split = bio_split(bio, max_sectors,
>                                               gfp, conf->bio_split);
>                 bio_chain(split, bio);
>                 generic_make_request(bio);
>                 bio = split;
>                 r1_bio->master_bio = bio;
>                 r1_bio->sectors = max_sectors;
>         }
>
> Mikulas

Yes, you are right something like that would be better.
Also send_changing_extent_only allocates bios in a loop which can cause
problems.
I think we need to get __split_and_process_non_flush(), in all its
branches, to check if len is too large for a single request, and if it
is, create a clone for the prefix, attached that to the ci and map it,
advance the original bio, and call generic_make_request on it.
That shouldn't be too hard, but it is a change that would touch a few
places.

I'll see if I can write something.

Thanks,
NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [dm-devel] new patchset to eliminate DM's use of BIOSET_NEED_RESCUER
@ 2017-11-23  5:12                             ` NeilBrown
  0 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2017-11-23  5:12 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Mike Snitzer, Jens Axboe, linux-kernel, linux-block,
	device-mapper development, Zdenek Kabelac

[-- Attachment #1: Type: text/plain, Size: 7367 bytes --]

On Wed, Nov 22 2017, Mikulas Patocka wrote:

> On Wed, 22 Nov 2017, NeilBrown wrote:
>
>> On Tue, Nov 21 2017, Mikulas Patocka wrote:
>> 
>> > On Tue, 21 Nov 2017, Mike Snitzer wrote:
>> >
>> >> On Tue, Nov 21 2017 at  4:23pm -0500,
>> >> Mikulas Patocka <mpatocka@redhat.com> wrote:
>> >> 
>> >> > This is not correct:
>> >> > 
>> >> >    2206 static void dm_wq_work(struct work_struct *work)
>> >> >    2207 {
>> >> >    2208         struct mapped_device *md = container_of(work, struct mapped_device, work);
>> >> >    2209         struct bio *bio;
>> >> >    2210         int srcu_idx;
>> >> >    2211         struct dm_table *map;
>> >> >    2212
>> >> >    2213         if (!bio_list_empty(&md->rescued)) {
>> >> >    2214                 struct bio_list list;
>> >> >    2215                 spin_lock_irq(&md->deferred_lock);
>> >> >    2216                 list = md->rescued;
>> >> >    2217                 bio_list_init(&md->rescued);
>> >> >    2218                 spin_unlock_irq(&md->deferred_lock);
>> >> >    2219                 while ((bio = bio_list_pop(&list)))
>> >> >    2220                         generic_make_request(bio);
>> >> >    2221         }
>> >> >    2222
>> >> >    2223         map = dm_get_live_table(md, &srcu_idx);
>> >> >    2224
>> >> >    2225         while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) {
>> >> >    2226                 spin_lock_irq(&md->deferred_lock);
>> >> >    2227                 bio = bio_list_pop(&md->deferred);
>> >> >    2228                 spin_unlock_irq(&md->deferred_lock);
>> >> >    2229
>> >> >    2230                 if (!bio)
>> >> >    2231                         break;
>> >> >    2232
>> >> >    2233                 if (dm_request_based(md))
>> >> >    2234                         generic_make_request(bio);
>> >> >    2235                 else
>> >> >    2236                         __split_and_process_bio(md, map, bio);
>> >> >    2237         }
>> >> >    2238
>> >> >    2239         dm_put_live_table(md, srcu_idx);
>> >> >    2240 }
>> >> > 
>> >> > You can see that if we are in dm_wq_work in __split_and_process_bio, we 
>> >> > will not process md->rescued list.
>> >> 
>> >> Can you elaborate further?  We cannot be "in dm_wq_work in
>> >> __split_and_process_bio" simultaneously.  Do you mean as a side-effect
>> >> of scheduling away from __split_and_process_bio?
>> >> 
>> >> The more detail you can share the better.
>> >
>> > Suppose this scenario:
>> >
>> > * dm_wq_work calls __split_and_process_bio
>> > * __split_and_process_bio eventually reaches the function snapshot_map
>> > * snapshot_map attempts to take the snapshot lock
>> >
>> > * the snapshot lock could be released only if some bios submitted by the 
>> > snapshot driver to the underlying device complete
>> > * the bios submitted to the underlying device were already offloaded by 
>> > some other task and they are waiting on the list md->rescued
>> > * the bios waiting on md->rescued are not processed, because dm_wq_work is 
>> > blocked in snapshot_map (called from __split_and_process_bio)
>> 
>> Yes, I think you are right. 
>> 
>> I think the solution is to get rid of the dm_offload() infrastructure
>> and make it not necessary.
>> i.e. discard my patches
>>     dm: prepare to discontinue use of BIOSET_NEED_RESCUER
>> and
>>     dm: revise 'rescue' strategy for bio-based bioset allocations
>> 
>> And build on "dm: ensure bio submission follows a depth-first tree walk"
>> which was written after those and already makes dm_offload() less
>> important.
>> 
>> Since that "depth-first" patch, every request to the dm device, after
>> the initial splitting, allocates just one dm_target_io structure, and
>> makes just one __map_bio() call, and so will behave exactly the way
>> generic_make_request() expects and copes with - thus avoiding awkward
>> dependencies and deadlocks.  Except....
>> 
>> a/ If any target defines ->num_write_bios() to return >1,
>>    __clone_and_map_data_bio() will make multiple calls to alloc_tio()
>>    and __map_bio(), which might need rescuing.
>>    But no target defines num_write_bios, and none have since it was
>>    removed from dm-cache 4.5 years ago.
>>    Can we discard num_write_bios??
>> 
>> b/ If any target sets any of num_{flush,discard,write_same,write_zeroes}_bios
>>    to a value > 1, then __send_duplicate_bios() will also make multiple
>>    calls to alloc_tio() and __map_bio().
>>    Some do.
>>      dm-cache-target:  flush=2
>>      dm-snap: flush=2
>>      dm-stripe: discard, write_same, write_zeroes all set to 'stripes'.
>> 
>> These will only be a problem if the second (or subsequent) alloc_tio()
>> blocks waiting for an earlier allocation to complete.  This will only
>> be a problem if multiple threads are each trying to allocate multiple
>> dm_target_io from the same bioset at the same time.
>> This is rare and should be easier to address than the current
>> dm_offload() approach. 
>> One possibility would be to copy the approach taken by
>> crypt_alloc_buffer() which needs to allocate multiple entries from a
>> mempool.
>> It first tries the with GFP_NOWAIT.  If that fails it take a mutex and
>> tries with GFP_NOIO.  This mean only one thread will try to allocate
>> multiple bios at once, so there can be no deadlock.
>> 
>> Below are two RFC patches.  The first removes num_write_bios.
>> The second is incomplete and makes a stab are allocating multiple bios
>> at once safely.
>> A third would be needed to remove dm_offload() etc... but I cannot quite
>> fit that in today - must be off.
>> 
>> Thanks,
>> NeilBrown
>
> Another problem is this:
>
> struct bio *b = bio_clone_bioset(bio, GFP_NOIO, md->queue->bio_split);
> bio_advance(b, (bio_sectors(b) - ci.sector_count) << 9);
> bio_chain(b, bio);
>
> What if it blocks because the bioset is exhausted?
>
> The code basically builds a chain of bios of unlimited length (suppose for 
> example a case when we are splitting on every sector boundary, so there 
> will be one bio for every sector in the original bio), it could exhaust 
> the bioset easily.
>
> It would be better to use mechanism from md-raid that chains all the 
> sub-bios to the same master bio and doesn't create long chains of bios:
>
>         if (max_sectors < bio_sectors(bio)) {
>                 struct bio *split = bio_split(bio, max_sectors,
>                                               gfp, conf->bio_split);
>                 bio_chain(split, bio);
>                 generic_make_request(bio);
>                 bio = split;
>                 r1_bio->master_bio = bio;
>                 r1_bio->sectors = max_sectors;
>         }
>
> Mikulas

Yes, you are right something like that would be better.
Also send_changing_extent_only allocates bios in a loop which can cause
problems.
I think we need to get __split_and_process_non_flush(), in all its
branches, to check if len is too large for a single request, and if it
is, create a clone for the prefix, attached that to the ci and map it,
advance the original bio, and call generic_make_request on it.
That shouldn't be too hard, but it is a change that would touch a few
places.

I'll see if I can write something.

Thanks,
NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH] dm: use cloned bio as head, not remainder, in __split_and_process_bio()
  2017-11-22 18:24                         ` [dm-devel] " Mikulas Patocka
@ 2017-11-23 22:52                             ` NeilBrown
  2017-11-23  5:12                             ` NeilBrown
  2017-11-23 22:52                             ` NeilBrown
  2 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2017-11-23 22:52 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Mike Snitzer, Jens Axboe, linux-kernel, linux-block,
	device-mapper development, Zdenek Kabelac

[-- Attachment #1: Type: text/plain, Size: 2559 bytes --]


When we use bio_clone_bioset() to split off the front part of a bio
and chain the two together and submit the remainder to
generic_make_request(), it is important that the newly allocated
bio is used as the head to be processed immediately, and the original
bio gets "bio_advance()"d and sent to generic_make_request() as the
remainder.

If the newly allocated bio is used as the remainder, and if it then
needs to be split again, then the next bio_clone_bioset() call will
be made while holding a reference a bio (result of the first clone)
from the same bioset.  This can potentially exhaust the bioset mempool
and result in a memory allocation deadlock.

So the result of the bio_clone_bioset() must be attached to the new
dm_io struct, and the original must be resubmitted.  The current code
is backwards.

Note that there is no race caused by reassigning cio.io->bio after already
calling __map_bio().  This bio will only be dereferenced again after
dec_pending() has found io->io_count to be zero, and this cannot happen
before the dec_pending() call at the end of __split_and_process_bio().

Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
---

Hi,
 I think this should resolve the problem Mikulas noticed that the
 bios form a deep chain instead of a wide tree.

Thanks,
NeilBrown

 drivers/md/dm.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 99ec215f7dcb..2e0e10a1c030 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1482,12 +1482,19 @@ static void __split_and_process_bio(struct mapped_device *md,
 				 * Remainder must be passed to generic_make_request()
 				 * so that it gets handled *after* bios already submitted
 				 * have been completely processed.
+				 * We take a clone of the original to store in
+				 * ci.io->bio to be used by end_io_acct() and
+				 * for dec_pending to use for completion handling.
+				 * As this path is not used for REQ_OP_ZONE_REPORT,
+				 * the usage of io->bio in dm_remap_zone_report()
+				 * won't be affected by this reassignment.
 				 */
 				struct bio *b = bio_clone_bioset(bio, GFP_NOIO,
 								 md->queue->bio_split);
-				bio_advance(b, (bio_sectors(b) - ci.sector_count) << 9);
+				ci.io->bio = b;
+				bio_advance(bio, (bio_sectors(bio) - ci.sector_count) << 9);
 				bio_chain(b, bio);
-				generic_make_request(b);
+				generic_make_request(bio);
 				break;
 			}
 		}
-- 
2.14.0.rc0.dirty


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH] dm: use cloned bio as head, not remainder, in __split_and_process_bio()
@ 2017-11-23 22:52                             ` NeilBrown
  0 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2017-11-23 22:52 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Mike Snitzer, Jens Axboe, linux-kernel, linux-block,
	device-mapper development, Zdenek Kabelac

[-- Attachment #1: Type: text/plain, Size: 2559 bytes --]


When we use bio_clone_bioset() to split off the front part of a bio
and chain the two together and submit the remainder to
generic_make_request(), it is important that the newly allocated
bio is used as the head to be processed immediately, and the original
bio gets "bio_advance()"d and sent to generic_make_request() as the
remainder.

If the newly allocated bio is used as the remainder, and if it then
needs to be split again, then the next bio_clone_bioset() call will
be made while holding a reference a bio (result of the first clone)
from the same bioset.  This can potentially exhaust the bioset mempool
and result in a memory allocation deadlock.

So the result of the bio_clone_bioset() must be attached to the new
dm_io struct, and the original must be resubmitted.  The current code
is backwards.

Note that there is no race caused by reassigning cio.io->bio after already
calling __map_bio().  This bio will only be dereferenced again after
dec_pending() has found io->io_count to be zero, and this cannot happen
before the dec_pending() call at the end of __split_and_process_bio().

Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
---

Hi,
 I think this should resolve the problem Mikulas noticed that the
 bios form a deep chain instead of a wide tree.

Thanks,
NeilBrown

 drivers/md/dm.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 99ec215f7dcb..2e0e10a1c030 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1482,12 +1482,19 @@ static void __split_and_process_bio(struct mapped_device *md,
 				 * Remainder must be passed to generic_make_request()
 				 * so that it gets handled *after* bios already submitted
 				 * have been completely processed.
+				 * We take a clone of the original to store in
+				 * ci.io->bio to be used by end_io_acct() and
+				 * for dec_pending to use for completion handling.
+				 * As this path is not used for REQ_OP_ZONE_REPORT,
+				 * the usage of io->bio in dm_remap_zone_report()
+				 * won't be affected by this reassignment.
 				 */
 				struct bio *b = bio_clone_bioset(bio, GFP_NOIO,
 								 md->queue->bio_split);
-				bio_advance(b, (bio_sectors(b) - ci.sector_count) << 9);
+				ci.io->bio = b;
+				bio_advance(bio, (bio_sectors(bio) - ci.sector_count) << 9);
 				bio_chain(b, bio);
-				generic_make_request(b);
+				generic_make_request(bio);
 				break;
 			}
 		}
-- 
2.14.0.rc0.dirty


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: dm: use cloned bio as head, not remainder, in __split_and_process_bio()
  2017-11-23 22:52                             ` NeilBrown
  (?)
@ 2017-11-27 14:23                             ` Mike Snitzer
  2017-11-28 22:18                                 ` NeilBrown
  -1 siblings, 1 reply; 53+ messages in thread
From: Mike Snitzer @ 2017-11-27 14:23 UTC (permalink / raw)
  To: NeilBrown
  Cc: Mikulas Patocka, Jens Axboe, linux-kernel, linux-block,
	device-mapper development, Zdenek Kabelac

On Thu, Nov 23 2017 at  5:52pm -0500,
NeilBrown <neilb@suse.com> wrote:

> 
> When we use bio_clone_bioset() to split off the front part of a bio
> and chain the two together and submit the remainder to
> generic_make_request(), it is important that the newly allocated
> bio is used as the head to be processed immediately, and the original
> bio gets "bio_advance()"d and sent to generic_make_request() as the
> remainder.
> 
> If the newly allocated bio is used as the remainder, and if it then
> needs to be split again, then the next bio_clone_bioset() call will
> be made while holding a reference a bio (result of the first clone)
> from the same bioset.  This can potentially exhaust the bioset mempool
> and result in a memory allocation deadlock.
> 
> So the result of the bio_clone_bioset() must be attached to the new
> dm_io struct, and the original must be resubmitted.  The current code
> is backwards.
> 
> Note that there is no race caused by reassigning cio.io->bio after already
> calling __map_bio().  This bio will only be dereferenced again after
> dec_pending() has found io->io_count to be zero, and this cannot happen
> before the dec_pending() call at the end of __split_and_process_bio().
> 
> Reported-by: Mikulas Patocka <mpatocka@redhat.com>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
> 
> Hi,
>  I think this should resolve the problem Mikulas noticed that the
>  bios form a deep chain instead of a wide tree.

I'm inclined to just fold this into the original commit.
I'd update that header to make mention of the details captured in this
header.

Would you be OK with that?

Mike

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [dm-devel] dm: use cloned bio as head, not remainder, in __split_and_process_bio()
  2017-11-27 14:23                             ` Mike Snitzer
@ 2017-11-28 22:18                                 ` NeilBrown
  0 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2017-11-28 22:18 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jens Axboe, linux-kernel, linux-block, device-mapper development,
	Mikulas Patocka, Zdenek Kabelac

[-- Attachment #1: Type: text/plain, Size: 1802 bytes --]

On Mon, Nov 27 2017, Mike Snitzer wrote:

> On Thu, Nov 23 2017 at  5:52pm -0500,
> NeilBrown <neilb@suse.com> wrote:
>
>> 
>> When we use bio_clone_bioset() to split off the front part of a bio
>> and chain the two together and submit the remainder to
>> generic_make_request(), it is important that the newly allocated
>> bio is used as the head to be processed immediately, and the original
>> bio gets "bio_advance()"d and sent to generic_make_request() as the
>> remainder.
>> 
>> If the newly allocated bio is used as the remainder, and if it then
>> needs to be split again, then the next bio_clone_bioset() call will
>> be made while holding a reference a bio (result of the first clone)
>> from the same bioset.  This can potentially exhaust the bioset mempool
>> and result in a memory allocation deadlock.
>> 
>> So the result of the bio_clone_bioset() must be attached to the new
>> dm_io struct, and the original must be resubmitted.  The current code
>> is backwards.
>> 
>> Note that there is no race caused by reassigning cio.io->bio after already
>> calling __map_bio().  This bio will only be dereferenced again after
>> dec_pending() has found io->io_count to be zero, and this cannot happen
>> before the dec_pending() call at the end of __split_and_process_bio().
>> 
>> Reported-by: Mikulas Patocka <mpatocka@redhat.com>
>> Signed-off-by: NeilBrown <neilb@suse.com>
>> ---
>> 
>> Hi,
>>  I think this should resolve the problem Mikulas noticed that the
>>  bios form a deep chain instead of a wide tree.
>
> I'm inclined to just fold this into the original commit.
> I'd update that header to make mention of the details captured in this
> header.
>
> Would you be OK with that?

Perfectly OK with that.  Thanks for asking.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [dm-devel] dm: use cloned bio as head, not remainder, in __split_and_process_bio()
@ 2017-11-28 22:18                                 ` NeilBrown
  0 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2017-11-28 22:18 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jens Axboe, linux-kernel, linux-block, device-mapper development,
	Mikulas Patocka, Zdenek Kabelac

[-- Attachment #1: Type: text/plain, Size: 1802 bytes --]

On Mon, Nov 27 2017, Mike Snitzer wrote:

> On Thu, Nov 23 2017 at  5:52pm -0500,
> NeilBrown <neilb@suse.com> wrote:
>
>> 
>> When we use bio_clone_bioset() to split off the front part of a bio
>> and chain the two together and submit the remainder to
>> generic_make_request(), it is important that the newly allocated
>> bio is used as the head to be processed immediately, and the original
>> bio gets "bio_advance()"d and sent to generic_make_request() as the
>> remainder.
>> 
>> If the newly allocated bio is used as the remainder, and if it then
>> needs to be split again, then the next bio_clone_bioset() call will
>> be made while holding a reference a bio (result of the first clone)
>> from the same bioset.  This can potentially exhaust the bioset mempool
>> and result in a memory allocation deadlock.
>> 
>> So the result of the bio_clone_bioset() must be attached to the new
>> dm_io struct, and the original must be resubmitted.  The current code
>> is backwards.
>> 
>> Note that there is no race caused by reassigning cio.io->bio after already
>> calling __map_bio().  This bio will only be dereferenced again after
>> dec_pending() has found io->io_count to be zero, and this cannot happen
>> before the dec_pending() call at the end of __split_and_process_bio().
>> 
>> Reported-by: Mikulas Patocka <mpatocka@redhat.com>
>> Signed-off-by: NeilBrown <neilb@suse.com>
>> ---
>> 
>> Hi,
>>  I think this should resolve the problem Mikulas noticed that the
>>  bios form a deep chain instead of a wide tree.
>
> I'm inclined to just fold this into the original commit.
> I'd update that header to make mention of the details captured in this
> header.
>
> Would you be OK with that?

Perfectly OK with that.  Thanks for asking.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2017-11-28 22:18 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-18  4:38 [PATCH 00/13] block: assorted cleanup for bio splitting and cloning NeilBrown
2017-06-18  4:38 ` [PATCH 03/13] blk: make the bioset rescue_workqueue optional NeilBrown
2017-06-18  4:38 ` [PATCH 04/13] blk: use non-rescuing bioset for q->bio_split NeilBrown
2017-06-18  4:38 ` [PATCH 01/13] blk: remove bio_set arg from blk_queue_split() NeilBrown
2017-06-18  4:38 ` [PATCH 02/13] blk: replace bioset_create_nobvec() with a flags arg to bioset_create() NeilBrown
2017-06-18  4:38 ` [PATCH 06/13] rbd: use bio_clone_fast() instead of bio_clone() NeilBrown
2017-06-18  4:38 ` [PATCH 08/13] pktcdvd: " NeilBrown
2017-06-18  4:38 ` [PATCH 05/13] block: Improvements to bounce-buffer handling NeilBrown
2017-06-18  4:38 ` [PATCH 09/13] lightnvm/pblk-read: use bio_clone_fast() NeilBrown
2017-06-18  4:38 ` [PATCH 07/13] drbd: use bio_clone_fast() instead of bio_clone() NeilBrown
2017-06-18  4:38 ` [PATCH 12/13] block: remove bio_clone() and all references NeilBrown
2017-06-18  4:38 ` [PATCH 11/13] bcache: use kmalloc to allocate bio in bch_data_verify() NeilBrown
2017-06-18  4:38 ` [PATCH 13/13] block: don't check for BIO_MAX_PAGES in blk_bio_segment_split() NeilBrown
2017-06-18  4:38 ` [PATCH 10/13] xen-blkfront: remove bio splitting NeilBrown
2017-06-18 18:41 ` [PATCH 00/13] block: assorted cleanup for bio splitting and cloning Jens Axboe
2017-06-18 21:36   ` NeilBrown
2017-11-20 16:43     ` Mike Snitzer
2017-11-21  0:34       ` [dm-devel] " NeilBrown
2017-11-21  0:34         ` NeilBrown
2017-11-21  0:34         ` NeilBrown
2017-11-21  1:35         ` Mike Snitzer
2017-11-21  1:35           ` Mike Snitzer
2017-11-21 12:10           ` Mike Snitzer
2017-11-21 12:10             ` Mike Snitzer
2017-11-21 12:43             ` Mike Snitzer
2017-11-21 12:43               ` Mike Snitzer
2017-11-21 19:47               ` new patchset to eliminate DM's use of BIOSET_NEED_RESCUER [was: Re: [PATCH 00/13] block: assorted cleanup for bio splitting and cloning.] Mike Snitzer
2017-11-21 21:23                 ` [dm-devel] " Mikulas Patocka
2017-11-21 22:51                   ` new patchset to eliminate DM's use of BIOSET_NEED_RESCUER Mike Snitzer
2017-11-22  1:21                     ` Mikulas Patocka
2017-11-22  2:32                       ` Mike Snitzer
2017-11-22  4:00                       ` [dm-devel] " NeilBrown
2017-11-22  4:00                         ` NeilBrown
2017-11-22  4:00                         ` NeilBrown
2017-11-22  4:28                         ` Mike Snitzer
2017-11-22  4:28                           ` Mike Snitzer
2017-11-22 21:18                           ` Mike Snitzer
2017-11-22 18:24                         ` [dm-devel] " Mikulas Patocka
2017-11-22 18:49                           ` Mike Snitzer
2017-11-23  5:12                           ` [dm-devel] " NeilBrown
2017-11-23  5:12                             ` NeilBrown
2017-11-23 22:52                           ` [PATCH] dm: use cloned bio as head, not remainder, in __split_and_process_bio() NeilBrown
2017-11-23 22:52                             ` NeilBrown
2017-11-27 14:23                             ` Mike Snitzer
2017-11-28 22:18                               ` [dm-devel] " NeilBrown
2017-11-28 22:18                                 ` NeilBrown
2017-11-21 23:03                   ` [dm-devel] new patchset to eliminate DM's use of BIOSET_NEED_RESCUER [was: Re: [PATCH 00/13] block: assorted cleanup for bio splitting and cloning.] NeilBrown
2017-11-21 23:03                     ` NeilBrown
2017-11-21 19:44             ` [dm-devel] [PATCH 00/13] block: assorted cleanup for bio splitting and cloning NeilBrown
2017-11-21 19:44               ` NeilBrown
2017-11-21 19:44               ` NeilBrown
2017-11-21 19:50               ` Mike Snitzer
2017-11-21 19:50                 ` Mike Snitzer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.