All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/8] dm: add request-based blk-mq support
@ 2014-12-17  3:59 Mike Snitzer
  2014-12-17  3:59 ` [PATCH v3 1/8] block: require blk_rq_prep_clone() be given an initialized clone request Mike Snitzer
                   ` (8 more replies)
  0 siblings, 9 replies; 95+ messages in thread
From: Mike Snitzer @ 2014-12-17  3:59 UTC (permalink / raw)
  To: dm-devel, Keith Busch; +Cc: axboe, hch, j-nomura, bvanassche

Hi,

Here is v3 of the request-based DM blk-support patchset.  I've also
published a git repo here:
https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-for-3.20-blk-mq

I found quite a few issues with v2 for both blk-mq and old
request-based DM.  I've still attributed the original patches from
Keith to him even though I fixed/rewrote significant portions.  Keith,
I'm happy to leave attribution like this unless you'd prefer I change
it.

In general, I'm not in love with the amount of churn in dm.c's
request-based DM code.  BUT I'm also not opposed to these changes if
others have a pressing need for blk-mq support landing for
dm-multipath.  So input on who cares about this work would be
appreciated.  I'd like to get an idea of the kinds of configurations
people are looking to deploy.

These changes have seen basic dm-multipath testing against both blk-mq
(virtio-blk) and traditional SCSI devices.  Error paths are largely
untested, as is dm-multipath's ability to function properly in the
face of heavy IO with concurrent path failures.  But I'm publishing
this work now in the hopes that others with real blk-mq devices can
test more fully.  Keith, it'd be awesome if you could provide the same
test coverage you did for blk-mq testing of your v2.

I'll work with Red Hat's Storage QE to hammer on the old request-based
side.  But with the holiday fast approaching I likely won't have full
test coverage until after the New Year.

Keith Busch (4):
  block: require blk_rq_prep_clone() be given an initialized clone request
  block: add blk-mq support to blk_insert_cloned_request()
  dm: submit stacked requests in irq enabled context
  dm: allocate requests from target when stacking on blk-mq devices

Mike Snitzer (4):
  block: initialize bio member of blk-mq request to NULL
  block: mark blk-mq devices as stackable
  dm: remove exports for request-based interfaces without external callers
  dm: split request structure out from dm_rq_target_io structure

 block/blk-core.c              |   7 +-
 block/blk-mq.c                |   1 +
 drivers/md/dm-mpath.c         |  53 +++++--
 drivers/md/dm-table.c         |  34 ++++-
 drivers/md/dm-target.c        |  15 +-
 drivers/md/dm.c               | 336 +++++++++++++++++++++++++++++-------------
 drivers/md/dm.h               |   8 +-
 include/linux/blkdev.h        |   1 +
 include/linux/device-mapper.h |  10 +-
 9 files changed, 335 insertions(+), 130 deletions(-)

-- 
1.9.3

^ permalink raw reply	[flat|nested] 95+ messages in thread

* [PATCH v3 1/8] block: require blk_rq_prep_clone() be given an initialized clone request
  2014-12-17  3:59 [PATCH v3 0/8] dm: add request-based blk-mq support Mike Snitzer
@ 2014-12-17  3:59 ` Mike Snitzer
  2014-12-17  3:59 ` [PATCH v3 2/8] block: initialize bio member of blk-mq request to NULL Mike Snitzer
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 95+ messages in thread
From: Mike Snitzer @ 2014-12-17  3:59 UTC (permalink / raw)
  To: dm-devel, Keith Busch; +Cc: axboe, hch, j-nomura, bvanassche

From: Keith Busch <keith.busch@intel.com>

Prepare to allow blk_rq_prep_clone() to accept clone requests that were
allocated from blk-mq request queues.  As such the blk_rq_prep_clone()
caller must first initialize the clone request.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 block/blk-core.c | 2 --
 drivers/md/dm.c  | 1 +
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 93f9152..b794cd99 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2926,8 +2926,6 @@ int blk_rq_prep_clone(struct request *rq, struct request *rq_src,
 	if (!bs)
 		bs = fs_bio_set;
 
-	blk_rq_init(NULL, rq);
-
 	__rq_for_each_bio(bio_src, rq_src) {
 		bio = bio_clone_fast(bio_src, gfp_mask, bs);
 		if (!bio)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 4c06585..bef5070 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1719,6 +1719,7 @@ static int setup_clone(struct request *clone, struct request *rq,
 {
 	int r;
 
+	blk_rq_init(NULL, rq);
 	r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC,
 			      dm_rq_bio_constructor, tio);
 	if (r)
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v3 2/8] block: initialize bio member of blk-mq request to NULL
  2014-12-17  3:59 [PATCH v3 0/8] dm: add request-based blk-mq support Mike Snitzer
  2014-12-17  3:59 ` [PATCH v3 1/8] block: require blk_rq_prep_clone() be given an initialized clone request Mike Snitzer
@ 2014-12-17  3:59 ` Mike Snitzer
  2014-12-17  3:59 ` [PATCH v3 3/8] block: add blk-mq support to blk_insert_cloned_request() Mike Snitzer
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 95+ messages in thread
From: Mike Snitzer @ 2014-12-17  3:59 UTC (permalink / raw)
  To: dm-devel, Keith Busch; +Cc: axboe, hch, j-nomura, bvanassche

Otherwise blk_rq_prep_clone() will crash when cloning a blk-mq request,
with: BUG: unable to handle kernel paging request at 00001dfa0d00005e

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 block/blk-mq.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6cd94ba..ff09337 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -183,6 +183,7 @@ static void blk_mq_rq_ctx_init(struct request_queue *q, struct blk_mq_ctx *ctx,
 #if defined(CONFIG_BLK_DEV_INTEGRITY)
 	rq->nr_integrity_segments = 0;
 #endif
+	rq->bio = NULL;
 	rq->special = NULL;
 	/* tag was already set */
 	rq->errors = 0;
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v3 3/8] block: add blk-mq support to blk_insert_cloned_request()
  2014-12-17  3:59 [PATCH v3 0/8] dm: add request-based blk-mq support Mike Snitzer
  2014-12-17  3:59 ` [PATCH v3 1/8] block: require blk_rq_prep_clone() be given an initialized clone request Mike Snitzer
  2014-12-17  3:59 ` [PATCH v3 2/8] block: initialize bio member of blk-mq request to NULL Mike Snitzer
@ 2014-12-17  3:59 ` Mike Snitzer
  2014-12-17  4:00 ` [PATCH v3 4/8] block: mark blk-mq devices as stackable Mike Snitzer
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 95+ messages in thread
From: Mike Snitzer @ 2014-12-17  3:59 UTC (permalink / raw)
  To: dm-devel, Keith Busch; +Cc: axboe, hch, j-nomura, bvanassche

From: Keith Busch <keith.busch@intel.com>

If the request passed to blk_insert_cloned_request() was allocated by
a blk-mq device it must be submitted using blk_mq_insert_request().

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 block/blk-core.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index b794cd99..7852844 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2029,6 +2029,11 @@ int blk_insert_cloned_request(struct request_queue *q, struct request *rq)
 	    should_fail_request(&rq->rq_disk->part0, blk_rq_bytes(rq)))
 		return -EIO;
 
+	if (q->mq_ops) {
+		blk_mq_insert_request(rq, false, true, true);
+		return 0;
+	}
+
 	spin_lock_irqsave(q->queue_lock, flags);
 	if (unlikely(blk_queue_dying(q))) {
 		spin_unlock_irqrestore(q->queue_lock, flags);
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v3 4/8] block: mark blk-mq devices as stackable
  2014-12-17  3:59 [PATCH v3 0/8] dm: add request-based blk-mq support Mike Snitzer
                   ` (2 preceding siblings ...)
  2014-12-17  3:59 ` [PATCH v3 3/8] block: add blk-mq support to blk_insert_cloned_request() Mike Snitzer
@ 2014-12-17  4:00 ` Mike Snitzer
  2014-12-17  4:00 ` [PATCH v3 5/8] dm: remove exports for request-based interfaces without external callers Mike Snitzer
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 95+ messages in thread
From: Mike Snitzer @ 2014-12-17  4:00 UTC (permalink / raw)
  To: dm-devel, Keith Busch; +Cc: axboe, hch, j-nomura, bvanassche

Commit 4ee5eaf4 ("block: add a queue flag for request stacking support")
introduced the concept of "STACKABLE" and blk-mq devices fit the
definition in that they establish q->request_fn.  So establish
QUEUE_FLAG_STACKABLE in QUEUE_FLAG_MQ_DEFAULT.

While not strictly needed (DM _could_ just check for q->mq_ops to assume
the device is request-based), request-based DM support for blk-mq devices
benefits from the ability to consistently check for QUEUE_FLAG_STACKABLE
before allowing a device to be stacked into a request-based DM table.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 include/linux/blkdev.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index e5f620c..5ff3f5c 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -516,6 +516,7 @@ struct request_queue {
 				 (1 << QUEUE_FLAG_ADD_RANDOM))
 
 #define QUEUE_FLAG_MQ_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
+				 (1 << QUEUE_FLAG_STACKABLE)	|	\
 				 (1 << QUEUE_FLAG_SAME_COMP))
 
 static inline void queue_lockdep_assert_held(struct request_queue *q)
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v3 5/8] dm: remove exports for request-based interfaces without external callers
  2014-12-17  3:59 [PATCH v3 0/8] dm: add request-based blk-mq support Mike Snitzer
                   ` (3 preceding siblings ...)
  2014-12-17  4:00 ` [PATCH v3 4/8] block: mark blk-mq devices as stackable Mike Snitzer
@ 2014-12-17  4:00 ` Mike Snitzer
  2014-12-17  4:00 ` [PATCH v3 6/8] dm: split request structure out from dm_rq_target_io structure Mike Snitzer
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 95+ messages in thread
From: Mike Snitzer @ 2014-12-17  4:00 UTC (permalink / raw)
  To: dm-devel, Keith Busch; +Cc: axboe, hch, j-nomura, bvanassche

Remove exports for dm_dispatch_request, dm_requeue_unmapped_request,
and dm_kill_unmapped_request.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 drivers/md/dm.c               | 9 +++------
 include/linux/device-mapper.h | 3 ---
 2 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index bef5070..1c69fcb 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1062,7 +1062,7 @@ static void dm_unprep_request(struct request *rq)
 /*
  * Requeue the original request of a clone.
  */
-void dm_requeue_unmapped_request(struct request *clone)
+static void dm_requeue_unmapped_request(struct request *clone)
 {
 	int rw = rq_data_dir(clone);
 	struct dm_rq_target_io *tio = clone->end_io_data;
@@ -1079,7 +1079,6 @@ void dm_requeue_unmapped_request(struct request *clone)
 
 	rq_completed(md, rw, 0);
 }
-EXPORT_SYMBOL_GPL(dm_requeue_unmapped_request);
 
 static void __stop_queue(struct request_queue *q)
 {
@@ -1177,7 +1176,7 @@ static void dm_complete_request(struct request *clone, int error)
  * Target's rq_end_io() function isn't called.
  * This may be used when the target's map_rq() function fails.
  */
-void dm_kill_unmapped_request(struct request *clone, int error)
+static void dm_kill_unmapped_request(struct request *clone, int error)
 {
 	struct dm_rq_target_io *tio = clone->end_io_data;
 	struct request *rq = tio->orig;
@@ -1185,7 +1184,6 @@ void dm_kill_unmapped_request(struct request *clone, int error)
 	rq->cmd_flags |= REQ_FAILED;
 	dm_complete_request(clone, error);
 }
-EXPORT_SYMBOL_GPL(dm_kill_unmapped_request);
 
 /*
  * Called with the queue lock held
@@ -1686,7 +1684,7 @@ static void dm_request(struct request_queue *q, struct bio *bio)
 		_dm_request(q, bio);
 }
 
-void dm_dispatch_request(struct request *rq)
+static void dm_dispatch_request(struct request *rq)
 {
 	int r;
 
@@ -1698,7 +1696,6 @@ void dm_dispatch_request(struct request *rq)
 	if (r)
 		dm_complete_request(rq, r);
 }
-EXPORT_SYMBOL_GPL(dm_dispatch_request);
 
 static int dm_rq_bio_constructor(struct bio *bio, struct bio *bio_orig,
 				 void *data)
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index ca6d2ac..19296fb 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -600,9 +600,6 @@ static inline unsigned long to_bytes(sector_t n)
 /*-----------------------------------------------------------------
  * Helper for block layer and dm core operations
  *---------------------------------------------------------------*/
-void dm_dispatch_request(struct request *rq);
-void dm_requeue_unmapped_request(struct request *rq);
-void dm_kill_unmapped_request(struct request *rq, int error);
 int dm_underlying_device_busy(struct request_queue *q);
 
 #endif	/* _LINUX_DEVICE_MAPPER_H */
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v3 6/8] dm: split request structure out from dm_rq_target_io structure
  2014-12-17  3:59 [PATCH v3 0/8] dm: add request-based blk-mq support Mike Snitzer
                   ` (4 preceding siblings ...)
  2014-12-17  4:00 ` [PATCH v3 5/8] dm: remove exports for request-based interfaces without external callers Mike Snitzer
@ 2014-12-17  4:00 ` Mike Snitzer
  2014-12-17  4:00 ` [PATCH v3 7/8] dm: submit stacked requests in irq enabled context Mike Snitzer
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 95+ messages in thread
From: Mike Snitzer @ 2014-12-17  4:00 UTC (permalink / raw)
  To: dm-devel, Keith Busch; +Cc: axboe, hch, j-nomura, bvanassche

Request-based DM support for blk-mq devices requires that
dm_rq_target_io structures not be allocated with an embedded request
structure.  The request-based DM target (e.g. dm-multipath) must
allocate the request from the blk-mq devices' request_queue using
blk_get_request().

The unfortunate side-effect of this change is old-style request-based DM
support will no longer use contiguous memory for the dm_rq_target_io and
request structures for each clone.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 drivers/md/dm.c | 70 +++++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 61 insertions(+), 9 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 1c69fcb..ca5eed2 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -78,7 +78,7 @@ struct dm_io {
 struct dm_rq_target_io {
 	struct mapped_device *md;
 	struct dm_target *ti;
-	struct request *orig, clone;
+	struct request *orig, *clone;
 	int error;
 	union map_info info;
 };
@@ -179,6 +179,7 @@ struct mapped_device {
 	 * io objects are allocated from here.
 	 */
 	mempool_t *io_pool;
+	mempool_t *rq_pool;
 
 	struct bio_set *bs;
 
@@ -214,6 +215,7 @@ struct mapped_device {
  */
 struct dm_md_mempools {
 	mempool_t *io_pool;
+	mempool_t *rq_pool;
 	struct bio_set *bs;
 };
 
@@ -228,6 +230,7 @@ struct table_device {
 #define RESERVED_MAX_IOS		1024
 static struct kmem_cache *_io_cache;
 static struct kmem_cache *_rq_tio_cache;
+static struct kmem_cache *_rq_cache;
 
 /*
  * Bio-based DM's mempools' reserved IOs set by the user.
@@ -285,9 +288,14 @@ static int __init local_init(void)
 	if (!_rq_tio_cache)
 		goto out_free_io_cache;
 
+	_rq_cache = kmem_cache_create("dm_clone_request", sizeof(struct request),
+				      __alignof__(struct request), 0, NULL);
+	if (!_rq_cache)
+		goto out_free_rq_tio_cache;
+
 	r = dm_uevent_init();
 	if (r)
-		goto out_free_rq_tio_cache;
+		goto out_free_rq_cache;
 
 	deferred_remove_workqueue = alloc_workqueue("kdmremove", WQ_UNBOUND, 1);
 	if (!deferred_remove_workqueue) {
@@ -309,6 +317,8 @@ out_free_workqueue:
 	destroy_workqueue(deferred_remove_workqueue);
 out_uevent_exit:
 	dm_uevent_exit();
+out_free_rq_cache:
+	kmem_cache_destroy(_rq_cache);
 out_free_rq_tio_cache:
 	kmem_cache_destroy(_rq_tio_cache);
 out_free_io_cache:
@@ -322,6 +332,7 @@ static void local_exit(void)
 	flush_scheduled_work();
 	destroy_workqueue(deferred_remove_workqueue);
 
+	kmem_cache_destroy(_rq_cache);
 	kmem_cache_destroy(_rq_tio_cache);
 	kmem_cache_destroy(_io_cache);
 	unregister_blkdev(_major, _name);
@@ -574,6 +585,17 @@ static void free_rq_tio(struct dm_rq_target_io *tio)
 	mempool_free(tio, tio->md->io_pool);
 }
 
+static struct request *alloc_clone_request(struct mapped_device *md,
+					   gfp_t gfp_mask)
+{
+	return mempool_alloc(md->rq_pool, gfp_mask);
+}
+
+static void free_clone_request(struct mapped_device *md, struct request *rq)
+{
+	mempool_free(rq, md->rq_pool);
+}
+
 static int md_in_flight(struct mapped_device *md)
 {
 	return atomic_read(&md->pending[READ]) +
@@ -1017,6 +1039,7 @@ static void free_rq_clone(struct request *clone)
 	struct dm_rq_target_io *tio = clone->end_io_data;
 
 	blk_rq_unprep_clone(clone);
+	free_clone_request(tio->md, clone);
 	free_rq_tio(tio);
 }
 
@@ -1712,12 +1735,11 @@ static int dm_rq_bio_constructor(struct bio *bio, struct bio *bio_orig,
 }
 
 static int setup_clone(struct request *clone, struct request *rq,
-		       struct dm_rq_target_io *tio)
+		       struct dm_rq_target_io *tio, gfp_t gfp_mask)
 {
 	int r;
 
-	blk_rq_init(NULL, rq);
-	r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC,
+	r = blk_rq_prep_clone(clone, rq, tio->md->bs, gfp_mask,
 			      dm_rq_bio_constructor, tio);
 	if (r)
 		return r;
@@ -1728,9 +1750,29 @@ static int setup_clone(struct request *clone, struct request *rq,
 	clone->end_io = end_clone_request;
 	clone->end_io_data = tio;
 
+	tio->clone = clone;
+
 	return 0;
 }
 
+static struct request *__clone_rq(struct request *rq, struct mapped_device *md,
+				  struct dm_rq_target_io *tio, gfp_t gfp_mask)
+{
+	struct request *clone = alloc_clone_request(md, gfp_mask);
+
+	if (!clone)
+		return NULL;
+
+	blk_rq_init(NULL, clone);
+	if (setup_clone(clone, rq, tio, gfp_mask)) {
+		/* -ENOMEM */
+		free_clone_request(md, clone);
+		return NULL;
+	}
+
+	return clone;
+}
+
 static struct request *clone_rq(struct request *rq, struct mapped_device *md,
 				gfp_t gfp_mask)
 {
@@ -1743,13 +1785,13 @@ static struct request *clone_rq(struct request *rq, struct mapped_device *md,
 
 	tio->md = md;
 	tio->ti = NULL;
+	tio->clone = NULL;
 	tio->orig = rq;
 	tio->error = 0;
 	memset(&tio->info, 0, sizeof(tio->info));
 
-	clone = &tio->clone;
-	if (setup_clone(clone, rq, tio)) {
-		/* -ENOMEM */
+	clone = __clone_rq(rq, md, tio, GFP_ATOMIC);
+	if (!clone) {
 		free_rq_tio(tio);
 		return NULL;
 	}
@@ -2149,6 +2191,8 @@ static void free_dev(struct mapped_device *md)
 	destroy_workqueue(md->wq);
 	if (md->io_pool)
 		mempool_destroy(md->io_pool);
+	if (md->rq_pool)
+		mempool_destroy(md->rq_pool);
 	if (md->bs)
 		bioset_free(md->bs);
 	blk_integrity_unregister(md->disk);
@@ -2195,10 +2239,12 @@ static void __bind_mempools(struct mapped_device *md, struct dm_table *t)
 		goto out;
 	}
 
-	BUG_ON(!p || md->io_pool || md->bs);
+	BUG_ON(!p || md->io_pool || md->rq_pool || md->bs);
 
 	md->io_pool = p->io_pool;
 	p->io_pool = NULL;
+	md->rq_pool = p->rq_pool;
+	p->rq_pool = NULL;
 	md->bs = p->bs;
 	p->bs = NULL;
 
@@ -3129,6 +3175,9 @@ struct dm_md_mempools *dm_alloc_md_mempools(unsigned type, unsigned integrity, u
 	} else if (type == DM_TYPE_REQUEST_BASED) {
 		cachep = _rq_tio_cache;
 		pool_size = dm_get_reserved_rq_based_ios();
+		pools->rq_pool = mempool_create_slab_pool(pool_size, _rq_cache);
+		if (!pools->rq_pool)
+			goto out;
 		front_pad = offsetof(struct dm_rq_clone_bio_info, clone);
 		/* per_bio_data_size is not used. See __bind_mempools(). */
 		WARN_ON(per_bio_data_size != 0);
@@ -3162,6 +3211,9 @@ void dm_free_md_mempools(struct dm_md_mempools *pools)
 	if (pools->io_pool)
 		mempool_destroy(pools->io_pool);
 
+	if (pools->rq_pool)
+		mempool_destroy(pools->rq_pool);
+
 	if (pools->bs)
 		bioset_free(pools->bs);
 
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v3 7/8] dm: submit stacked requests in irq enabled context
  2014-12-17  3:59 [PATCH v3 0/8] dm: add request-based blk-mq support Mike Snitzer
                   ` (5 preceding siblings ...)
  2014-12-17  4:00 ` [PATCH v3 6/8] dm: split request structure out from dm_rq_target_io structure Mike Snitzer
@ 2014-12-17  4:00 ` Mike Snitzer
  2014-12-17  4:00 ` [PATCH v3 8/8] dm: allocate requests from target when stacking on blk-mq devices Mike Snitzer
  2014-12-17 21:42 ` [PATCH v3 0/8] dm: add request-based blk-mq support Keith Busch
  8 siblings, 0 replies; 95+ messages in thread
From: Mike Snitzer @ 2014-12-17  4:00 UTC (permalink / raw)
  To: dm-devel, Keith Busch; +Cc: axboe, hch, j-nomura, bvanassche

From: Keith Busch <keith.busch@intel.com>

Switch to having request-based DM enqueue all prep'ed requests into work
processed by another thread.  This allows request-based DM to invoke
block APIs that assume interrupt enabled context (e.g. blk_get_request)
and is a prerequisite for adding blk-mq support to request-based DM.

The new kernel thread is only initialized for request-based DM devices.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 drivers/md/dm.c | 45 +++++++++++++++++++++++++++++++++++----------
 1 file changed, 35 insertions(+), 10 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index ca5eed2..00c9986 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -20,6 +20,7 @@
 #include <linux/hdreg.h>
 #include <linux/delay.h>
 #include <linux/wait.h>
+#include <linux/kthread.h>
 
 #include <trace/events/block.h>
 
@@ -79,6 +80,7 @@ struct dm_rq_target_io {
 	struct mapped_device *md;
 	struct dm_target *ti;
 	struct request *orig, *clone;
+	struct kthread_work work;
 	int error;
 	union map_info info;
 };
@@ -208,6 +210,9 @@ struct mapped_device {
 	struct bio flush_bio;
 
 	struct dm_stats stats;
+
+	struct kthread_worker kworker;
+	struct task_struct *kworker_task;
 };
 
 /*
@@ -1773,6 +1778,8 @@ static struct request *__clone_rq(struct request *rq, struct mapped_device *md,
 	return clone;
 }
 
+static void map_tio_request(struct kthread_work *work);
+
 static struct request *clone_rq(struct request *rq, struct mapped_device *md,
 				gfp_t gfp_mask)
 {
@@ -1789,6 +1796,7 @@ static struct request *clone_rq(struct request *rq, struct mapped_device *md,
 	tio->orig = rq;
 	tio->error = 0;
 	memset(&tio->info, 0, sizeof(tio->info));
+	init_kthread_work(&tio->work, map_tio_request);
 
 	clone = __clone_rq(rq, md, tio, GFP_ATOMIC);
 	if (!clone) {
@@ -1864,6 +1872,13 @@ static int map_request(struct dm_target *ti, struct request *clone,
 	return requeued;
 }
 
+static void map_tio_request(struct kthread_work *work)
+{
+	struct dm_rq_target_io *tio = container_of(work, struct dm_rq_target_io, work);
+
+	map_request(tio->ti, tio->clone, tio->md);
+}
+
 static struct request *dm_start_request(struct mapped_device *md, struct request *orig)
 {
 	struct request *clone;
@@ -1895,6 +1910,7 @@ static void dm_request_fn(struct request_queue *q)
 	struct dm_table *map = dm_get_live_table(md, &srcu_idx);
 	struct dm_target *ti;
 	struct request *rq, *clone;
+	struct dm_rq_target_io *tio;
 	sector_t pos;
 
 	/*
@@ -1930,20 +1946,15 @@ static void dm_request_fn(struct request_queue *q)
 
 		clone = dm_start_request(md, rq);
 
-		spin_unlock(q->queue_lock);
-		if (map_request(ti, clone, md))
-			goto requeued;
-
+		tio = rq->special;
+		/* Establish tio->ti before queuing work (map_tio_request) */
+		tio->ti = ti;
+		queue_kthread_work(&md->kworker, &tio->work);
 		BUG_ON(!irqs_disabled());
-		spin_lock(q->queue_lock);
 	}
 
 	goto out;
 
-requeued:
-	BUG_ON(!irqs_disabled());
-	spin_lock(q->queue_lock);
-
 delay_and_out:
 	blk_delay_queue(q, HZ / 10);
 out:
@@ -2129,6 +2140,7 @@ static struct mapped_device *alloc_dev(int minor)
 	INIT_WORK(&md->work, dm_wq_work);
 	init_waitqueue_head(&md->eventq);
 	init_completion(&md->kobj_holder.completion);
+	md->kworker_task = NULL;
 
 	md->disk->major = _major;
 	md->disk->first_minor = minor;
@@ -2189,6 +2201,9 @@ static void free_dev(struct mapped_device *md)
 	unlock_fs(md);
 	bdput(md->bdev);
 	destroy_workqueue(md->wq);
+
+	if (md->kworker_task)
+		kthread_stop(md->kworker_task);
 	if (md->io_pool)
 		mempool_destroy(md->io_pool);
 	if (md->rq_pool)
@@ -2484,6 +2499,11 @@ static int dm_init_request_based_queue(struct mapped_device *md)
 	blk_queue_prep_rq(md->queue, dm_prep_fn);
 	blk_queue_lld_busy(md->queue, dm_lld_busy);
 
+	/* Also initialize the request-based DM worker thread */
+	init_kthread_worker(&md->kworker);
+	md->kworker_task = kthread_run(kthread_worker_fn, &md->kworker,
+				       "kdmwork-%s", dm_device_name(md));
+
 	elv_register_queue(md->queue);
 
 	return 1;
@@ -2574,6 +2594,9 @@ static void __dm_destroy(struct mapped_device *md, bool wait)
 	set_bit(DMF_FREEING, &md->flags);
 	spin_unlock(&_minor_lock);
 
+	if (dm_request_based(md))
+		flush_kthread_worker(&md->kworker);
+
 	if (!dm_suspended_md(md)) {
 		dm_table_presuspend_targets(map);
 		dm_table_postsuspend_targets(map);
@@ -2817,8 +2840,10 @@ static int __dm_suspend(struct mapped_device *md, struct dm_table *map,
 	 * Stop md->queue before flushing md->wq in case request-based
 	 * dm defers requests to md->wq from md->queue.
 	 */
-	if (dm_request_based(md))
+	if (dm_request_based(md)) {
 		stop_queue(md->queue);
+		flush_kthread_worker(&md->kworker);
+	}
 
 	flush_workqueue(md->wq);
 
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v3 8/8] dm: allocate requests from target when stacking on blk-mq devices
  2014-12-17  3:59 [PATCH v3 0/8] dm: add request-based blk-mq support Mike Snitzer
                   ` (6 preceding siblings ...)
  2014-12-17  4:00 ` [PATCH v3 7/8] dm: submit stacked requests in irq enabled context Mike Snitzer
@ 2014-12-17  4:00 ` Mike Snitzer
  2014-12-17 22:35   ` Mike Snitzer
  2014-12-17 21:42 ` [PATCH v3 0/8] dm: add request-based blk-mq support Keith Busch
  8 siblings, 1 reply; 95+ messages in thread
From: Mike Snitzer @ 2014-12-17  4:00 UTC (permalink / raw)
  To: dm-devel, Keith Busch; +Cc: axboe, hch, j-nomura, bvanassche

From: Keith Busch <keith.busch@intel.com>

For blk-mq request-based DM the responsibility of allocating a cloned
request is transfered from DM core to the target type so that the cloned
request is allocated from the appropriate request_queue's pool and
initialized for the target block device.  The original request's
'special' now points to the dm_rq_target_io because the clone is
allocated later in the block layer rather than in DM core.

Care was taken to preserve compatibility with old-style block request
completion that requires request-based DM _not_ acquire the clone
request's queue lock in the completion path.  As such, there are now 2
different request-based dm_target interfaces:
1) the original .map_rq() interface will continue to be used for
   non-blk-mq devices -- the preallocated clone request is passed in
   from DM core.
2) a new .clone_and_map_rq() and .release_clone_rq() will be used for
   blk-mq devices -- blk_get_request() and blk_put_request() are used
   respectively from these hooks.

dm_table_set_type() was updated to detect if the request-based target is
being stacked on blk-mq devices, if so DM_TYPE_MQ_REQUEST_BASED is set.
DM core disallows switching the DM table's type after it is set.  This
means that there is no mixing of non-blk-mq and blk-mq devices within
the same request-based DM table.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 drivers/md/dm-mpath.c         |  53 ++++++++--
 drivers/md/dm-table.c         |  34 +++++-
 drivers/md/dm-target.c        |  15 ++-
 drivers/md/dm.c               | 233 +++++++++++++++++++++++++-----------------
 drivers/md/dm.h               |   8 +-
 include/linux/device-mapper.h |   7 ++
 6 files changed, 239 insertions(+), 111 deletions(-)

diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index 7b6b0f0..df408bc 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -11,6 +11,7 @@
 #include "dm-path-selector.h"
 #include "dm-uevent.h"
 
+#include <linux/blkdev.h>
 #include <linux/ctype.h>
 #include <linux/init.h>
 #include <linux/mempool.h>
@@ -378,18 +379,18 @@ static int __must_push_back(struct multipath *m)
 /*
  * Map cloned requests
  */
-static int multipath_map(struct dm_target *ti, struct request *clone,
-			 union map_info *map_context)
+static int __multipath_map(struct dm_target *ti, struct request *clone,
+			   union map_info *map_context,
+			   struct request *rq, struct request **__clone)
 {
 	struct multipath *m = (struct multipath *) ti->private;
 	int r = DM_MAPIO_REQUEUE;
-	size_t nr_bytes = blk_rq_bytes(clone);
-	unsigned long flags;
+	size_t nr_bytes = clone ? blk_rq_bytes(clone) : blk_rq_bytes(rq);
 	struct pgpath *pgpath;
 	struct block_device *bdev;
 	struct dm_mpath_io *mpio;
 
-	spin_lock_irqsave(&m->lock, flags);
+	spin_lock(&m->lock);
 
 	/* Do we need to select a new pgpath? */
 	if (!m->current_pgpath ||
@@ -412,9 +413,21 @@ static int multipath_map(struct dm_target *ti, struct request *clone,
 		goto out_unlock;
 
 	bdev = pgpath->path.dev->bdev;
-	clone->q = bdev_get_queue(bdev);
-	clone->rq_disk = bdev->bd_disk;
-	clone->cmd_flags |= REQ_FAILFAST_TRANSPORT;
+
+	if (clone) {
+		/* Old request-based interface: allocated clone is passed in */
+		clone->q = bdev_get_queue(bdev);
+		clone->rq_disk = bdev->bd_disk;
+		clone->cmd_flags |= REQ_FAILFAST_TRANSPORT;
+	} else {
+		/* blk-mq request-based interface */
+		*__clone = blk_get_request(bdev_get_queue(bdev),
+					   rq_data_dir(rq), GFP_KERNEL);
+		if (IS_ERR(*__clone))
+			goto out_unlock;
+		(*__clone)->cmd_flags |= REQ_FAILFAST_TRANSPORT;
+	}
+
 	mpio = map_context->ptr;
 	mpio->pgpath = pgpath;
 	mpio->nr_bytes = nr_bytes;
@@ -425,11 +438,29 @@ static int multipath_map(struct dm_target *ti, struct request *clone,
 	r = DM_MAPIO_REMAPPED;
 
 out_unlock:
-	spin_unlock_irqrestore(&m->lock, flags);
+	spin_unlock(&m->lock);
 
 	return r;
 }
 
+static int multipath_map(struct dm_target *ti, struct request *clone,
+			 union map_info *map_context)
+{
+	return __multipath_map(ti, clone, map_context, NULL, NULL);
+}
+
+static int multipath_clone_and_map(struct dm_target *ti, struct request *rq,
+				   union map_info *map_context,
+				   struct request **clone)
+{
+	return __multipath_map(ti, NULL, map_context, rq, clone);
+}
+
+static void multipath_release_clone(struct request *clone)
+{
+	blk_put_request(clone);
+}
+
 /*
  * If we run out of usable paths, should we queue I/O or error it?
  */
@@ -1666,11 +1697,13 @@ out:
  *---------------------------------------------------------------*/
 static struct target_type multipath_target = {
 	.name = "multipath",
-	.version = {1, 7, 0},
+	.version = {1, 8, 0},
 	.module = THIS_MODULE,
 	.ctr = multipath_ctr,
 	.dtr = multipath_dtr,
 	.map_rq = multipath_map,
+	.clone_and_map_rq = multipath_clone_and_map,
+	.release_clone_rq = multipath_release_clone,
 	.rq_end_io = multipath_end_io,
 	.presuspend = multipath_presuspend,
 	.postsuspend = multipath_postsuspend,
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 3afae9e..2d7e373 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -827,6 +827,7 @@ static int dm_table_set_type(struct dm_table *t)
 {
 	unsigned i;
 	unsigned bio_based = 0, request_based = 0, hybrid = 0;
+	bool use_blk_mq = false;
 	struct dm_target *tgt;
 	struct dm_dev_internal *dd;
 	struct list_head *devices;
@@ -872,11 +873,26 @@ static int dm_table_set_type(struct dm_table *t)
 	/* Non-request-stackable devices can't be used for request-based dm */
 	devices = dm_table_get_devices(t);
 	list_for_each_entry(dd, devices, list) {
-		if (!blk_queue_stackable(bdev_get_queue(dd->dm_dev->bdev))) {
-			DMWARN("table load rejected: including"
-			       " non-request-stackable devices");
+		struct request_queue *q = bdev_get_queue(dd->dm_dev->bdev);
+
+		if (!blk_queue_stackable(q)) {
+			DMERR("table load rejected: including"
+			      " non-request-stackable devices");
 			return -EINVAL;
 		}
+
+		if (q->mq_ops)
+			use_blk_mq = true;
+	}
+
+	if (use_blk_mq) {
+		/* verify _all_ devices in the table are blk-mq devices */
+		list_for_each_entry(dd, devices, list)
+			if (!bdev_get_queue(dd->dm_dev->bdev)->mq_ops) {
+				DMERR("table load rejected: not all devices"
+				      " are blk-mq request-stackable");
+				return -EINVAL;
+			}
 	}
 
 	/*
@@ -890,7 +906,7 @@ static int dm_table_set_type(struct dm_table *t)
 		return -EINVAL;
 	}
 
-	t->type = DM_TYPE_REQUEST_BASED;
+	t->type = !use_blk_mq ? DM_TYPE_REQUEST_BASED : DM_TYPE_MQ_REQUEST_BASED;
 
 	return 0;
 }
@@ -907,7 +923,15 @@ struct target_type *dm_table_get_immutable_target_type(struct dm_table *t)
 
 bool dm_table_request_based(struct dm_table *t)
 {
-	return dm_table_get_type(t) == DM_TYPE_REQUEST_BASED;
+	unsigned table_type = dm_table_get_type(t);
+
+	return (table_type == DM_TYPE_REQUEST_BASED ||
+		table_type == DM_TYPE_MQ_REQUEST_BASED);
+}
+
+bool dm_table_mq_request_based(struct dm_table *t)
+{
+	return dm_table_get_type(t) == DM_TYPE_MQ_REQUEST_BASED;
 }
 
 static int dm_table_alloc_md_mempools(struct dm_table *t)
diff --git a/drivers/md/dm-target.c b/drivers/md/dm-target.c
index 242e3ce..925ec1b 100644
--- a/drivers/md/dm-target.c
+++ b/drivers/md/dm-target.c
@@ -137,13 +137,26 @@ static int io_err_map_rq(struct dm_target *ti, struct request *clone,
 	return -EIO;
 }
 
+static int io_err_clone_and_map_rq(struct dm_target *ti, struct request *rq,
+				   union map_info *map_context,
+				   struct request **clone)
+{
+	return -EIO;
+}
+
+static void io_err_release_clone_rq(struct request *clone)
+{
+}
+
 static struct target_type error_target = {
 	.name = "error",
-	.version = {1, 2, 0},
+	.version = {1, 3, 0},
 	.ctr  = io_err_ctr,
 	.dtr  = io_err_dtr,
 	.map  = io_err_map,
 	.map_rq = io_err_map_rq,
+	.clone_and_map_rq = io_err_clone_and_map_rq,
+	.release_clone_rq = io_err_release_clone_rq,
 };
 
 int __init dm_target_init(void)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 00c9986..1955710 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1016,7 +1016,7 @@ static void end_clone_bio(struct bio *clone, int error)
  * the md may be freed in dm_put() at the end of this function.
  * Or do dm_get() before calling this function and dm_put() later.
  */
-static void rq_completed(struct mapped_device *md, int rw, int run_queue)
+static void rq_completed(struct mapped_device *md, int rw, bool run_queue)
 {
 	atomic_dec(&md->pending[rw]);
 
@@ -1044,13 +1044,17 @@ static void free_rq_clone(struct request *clone)
 	struct dm_rq_target_io *tio = clone->end_io_data;
 
 	blk_rq_unprep_clone(clone);
-	free_clone_request(tio->md, clone);
+	if (clone->q->mq_ops)
+		tio->ti->type->release_clone_rq(clone);
+	else
+		free_clone_request(tio->md, clone);
 	free_rq_tio(tio);
 }
 
 /*
  * Complete the clone and the original request.
- * Must be called without queue lock.
+ * Must be called without clone's queue lock held,
+ * see end_clone_request() for more details.
  */
 static void dm_end_request(struct request *clone, int error)
 {
@@ -1079,23 +1083,23 @@ static void dm_end_request(struct request *clone, int error)
 
 static void dm_unprep_request(struct request *rq)
 {
-	struct request *clone = rq->special;
+	struct dm_rq_target_io *tio = rq->special;
+	struct request *clone = tio->clone;
 
 	rq->special = NULL;
 	rq->cmd_flags &= ~REQ_DONTPREP;
 
-	free_rq_clone(clone);
+	if (clone)
+		free_rq_clone(clone);
 }
 
 /*
  * Requeue the original request of a clone.
  */
-static void dm_requeue_unmapped_request(struct request *clone)
+static void dm_requeue_unmapped_original_request(struct mapped_device *md,
+						 struct request *rq)
 {
-	int rw = rq_data_dir(clone);
-	struct dm_rq_target_io *tio = clone->end_io_data;
-	struct mapped_device *md = tio->md;
-	struct request *rq = tio->orig;
+	int rw = rq_data_dir(rq);
 	struct request_queue *q = rq->q;
 	unsigned long flags;
 
@@ -1105,7 +1109,14 @@ static void dm_requeue_unmapped_request(struct request *clone)
 	blk_requeue_request(q, rq);
 	spin_unlock_irqrestore(q->queue_lock, flags);
 
-	rq_completed(md, rw, 0);
+	rq_completed(md, rw, false);
+}
+
+static void dm_requeue_unmapped_request(struct request *clone)
+{
+	struct dm_rq_target_io *tio = clone->end_io_data;
+
+	dm_requeue_unmapped_original_request(tio->md, tio->orig);
 }
 
 static void __stop_queue(struct request_queue *q)
@@ -1175,8 +1186,15 @@ static void dm_done(struct request *clone, int error, bool mapped)
 static void dm_softirq_done(struct request *rq)
 {
 	bool mapped = true;
-	struct request *clone = rq->completion_data;
-	struct dm_rq_target_io *tio = clone->end_io_data;
+	struct dm_rq_target_io *tio = rq->special;
+	struct request *clone = tio->clone;
+
+	if (!clone) {
+		blk_end_request_all(rq, tio->error);
+		rq_completed(tio->md, rq_data_dir(rq), false);
+		free_rq_tio(tio);
+		return;
+	}
 
 	if (rq->cmd_flags & REQ_FAILED)
 		mapped = false;
@@ -1188,13 +1206,11 @@ static void dm_softirq_done(struct request *rq)
  * Complete the clone and the original request with the error status
  * through softirq context.
  */
-static void dm_complete_request(struct request *clone, int error)
+static void dm_complete_request(struct request *rq, int error)
 {
-	struct dm_rq_target_io *tio = clone->end_io_data;
-	struct request *rq = tio->orig;
+	struct dm_rq_target_io *tio = rq->special;
 
 	tio->error = error;
-	rq->completion_data = clone;
 	blk_complete_request(rq);
 }
 
@@ -1202,39 +1218,40 @@ static void dm_complete_request(struct request *clone, int error)
  * Complete the not-mapped clone and the original request with the error status
  * through softirq context.
  * Target's rq_end_io() function isn't called.
- * This may be used when the target's map_rq() function fails.
+ * This may be used when the target's map_rq() or clone_and_map_rq() functions fail.
  */
-static void dm_kill_unmapped_request(struct request *clone, int error)
+static void dm_kill_unmapped_request(struct request *rq, int error)
 {
-	struct dm_rq_target_io *tio = clone->end_io_data;
-	struct request *rq = tio->orig;
-
 	rq->cmd_flags |= REQ_FAILED;
-	dm_complete_request(clone, error);
+	dm_complete_request(rq, error);
 }
 
 /*
- * Called with the queue lock held
+ * Called with the clone's queue lock held
  */
 static void end_clone_request(struct request *clone, int error)
 {
-	/*
-	 * For just cleaning up the information of the queue in which
-	 * the clone was dispatched.
-	 * The clone is *NOT* freed actually here because it is alloced from
-	 * dm own mempool and REQ_ALLOCED isn't set in clone->cmd_flags.
-	 */
-	__blk_put_request(clone->q, clone);
+	struct dm_rq_target_io *tio = clone->end_io_data;
+
+	if (!clone->q->mq_ops) {
+		/*
+		 * For just cleaning up the information of the queue in which
+		 * the clone was dispatched.
+		 * The clone is *NOT* freed actually here because it is alloced
+		 * from dm own mempool (REQ_ALLOCED isn't set).
+		 */
+		__blk_put_request(clone->q, clone);
+	}
 
 	/*
 	 * Actual request completion is done in a softirq context which doesn't
-	 * hold the queue lock.  Otherwise, deadlock could occur because:
+	 * hold the clone's queue lock.  Otherwise, deadlock could occur because:
 	 *     - another request may be submitted by the upper level driver
 	 *       of the stacking during the completion
 	 *     - the submission which requires queue lock may be done
-	 *       against this queue
+	 *       against this clone's queue
 	 */
-	dm_complete_request(clone, error);
+	dm_complete_request(tio->orig, error);
 }
 
 /*
@@ -1712,16 +1729,17 @@ static void dm_request(struct request_queue *q, struct bio *bio)
 		_dm_request(q, bio);
 }
 
-static void dm_dispatch_request(struct request *rq)
+static void dm_dispatch_clone_request(struct request *clone, struct request *rq)
 {
 	int r;
 
-	if (blk_queue_io_stat(rq->q))
-		rq->cmd_flags |= REQ_IO_STAT;
+	if (blk_queue_io_stat(clone->q))
+		clone->cmd_flags |= REQ_IO_STAT;
 
-	rq->start_time = jiffies;
-	r = blk_insert_cloned_request(rq->q, rq);
+	clone->start_time = jiffies;
+	r = blk_insert_cloned_request(clone->q, clone);
 	if (r)
+		/* must complete clone in terms of original request */
 		dm_complete_request(rq, r);
 }
 
@@ -1760,8 +1778,8 @@ static int setup_clone(struct request *clone, struct request *rq,
 	return 0;
 }
 
-static struct request *__clone_rq(struct request *rq, struct mapped_device *md,
-				  struct dm_rq_target_io *tio, gfp_t gfp_mask)
+static struct request *clone_rq(struct request *rq, struct mapped_device *md,
+				struct dm_rq_target_io *tio, gfp_t gfp_mask)
 {
 	struct request *clone = alloc_clone_request(md, gfp_mask);
 
@@ -1780,11 +1798,12 @@ static struct request *__clone_rq(struct request *rq, struct mapped_device *md,
 
 static void map_tio_request(struct kthread_work *work);
 
-static struct request *clone_rq(struct request *rq, struct mapped_device *md,
-				gfp_t gfp_mask)
+static struct dm_rq_target_io *prep_tio(struct request *rq,
+					struct mapped_device *md, gfp_t gfp_mask)
 {
-	struct request *clone;
 	struct dm_rq_target_io *tio;
+	int srcu_idx;
+	struct dm_table *table;
 
 	tio = alloc_rq_tio(md, gfp_mask);
 	if (!tio)
@@ -1798,13 +1817,17 @@ static struct request *clone_rq(struct request *rq, struct mapped_device *md,
 	memset(&tio->info, 0, sizeof(tio->info));
 	init_kthread_work(&tio->work, map_tio_request);
 
-	clone = __clone_rq(rq, md, tio, GFP_ATOMIC);
-	if (!clone) {
-		free_rq_tio(tio);
-		return NULL;
+	table = dm_get_live_table(md, &srcu_idx);
+	if (!dm_table_mq_request_based(table)) {
+		if (!clone_rq(rq, md, tio, GFP_ATOMIC)) {
+			dm_put_live_table(md, srcu_idx);
+			free_rq_tio(tio);
+			return NULL;
+		}
 	}
+	dm_put_live_table(md, srcu_idx);
 
-	return clone;
+	return tio;
 }
 
 /*
@@ -1813,18 +1836,18 @@ static struct request *clone_rq(struct request *rq, struct mapped_device *md,
 static int dm_prep_fn(struct request_queue *q, struct request *rq)
 {
 	struct mapped_device *md = q->queuedata;
-	struct request *clone;
+	struct dm_rq_target_io *tio;
 
 	if (unlikely(rq->special)) {
 		DMWARN("Already has something in rq->special.");
 		return BLKPREP_KILL;
 	}
 
-	clone = clone_rq(rq, md, GFP_ATOMIC);
-	if (!clone)
+	tio = prep_tio(rq, md, GFP_ATOMIC);
+	if (!tio)
 		return BLKPREP_DEFER;
 
-	rq->special = clone;
+	rq->special = tio;
 	rq->cmd_flags |= REQ_DONTPREP;
 
 	return BLKPREP_OK;
@@ -1832,17 +1855,31 @@ static int dm_prep_fn(struct request_queue *q, struct request *rq)
 
 /*
  * Returns:
- * 0  : the request has been processed (not requeued)
- * !0 : the request has been requeued
+ * 0   : the request has been processed (not requeued)
+ * 1   : the request has been requeued
+ * < 0 : the original request needs to be requeued
  */
-static int map_request(struct dm_target *ti, struct request *clone,
+static int map_request(struct dm_target *ti, struct request *rq,
 		       struct mapped_device *md)
 {
-	int r, requeued = 0;
-	struct dm_rq_target_io *tio = clone->end_io_data;
+	struct request *clone = NULL;
+	int r, r2, requeued = 0;
+	struct dm_rq_target_io *tio = rq->special;
+
+	if (tio->clone) {
+		clone = tio->clone;
+		r = ti->type->map_rq(ti, clone, &tio->info);
+	} else {
+		r = ti->type->clone_and_map_rq(ti, rq, &tio->info, &clone);
+		if (IS_ERR(clone))
+			return PTR_ERR(clone);
+		r2 = setup_clone(clone, rq, tio, GFP_KERNEL);
+		if (r2) {
+			ti->type->release_clone_rq(clone);
+			return r2;
+		}
+	}
 
-	tio->ti = ti;
-	r = ti->type->map_rq(ti, clone, &tio->info);
 	switch (r) {
 	case DM_MAPIO_SUBMITTED:
 		/* The target has taken the I/O to submit by itself later */
@@ -1850,8 +1887,8 @@ static int map_request(struct dm_target *ti, struct request *clone,
 	case DM_MAPIO_REMAPPED:
 		/* The target has remapped the I/O so dispatch it */
 		trace_block_rq_remap(clone->q, clone, disk_devt(dm_disk(md)),
-				     blk_rq_pos(tio->orig));
-		dm_dispatch_request(clone);
+				     blk_rq_pos(rq));
+		dm_dispatch_clone_request(clone, rq);
 		break;
 	case DM_MAPIO_REQUEUE:
 		/* The target wants to requeue the I/O */
@@ -1865,7 +1902,7 @@ static int map_request(struct dm_target *ti, struct request *clone,
 		}
 
 		/* The target wants to complete the I/O */
-		dm_kill_unmapped_request(clone, r);
+		dm_kill_unmapped_request(rq, r);
 		break;
 	}
 
@@ -1875,17 +1912,17 @@ static int map_request(struct dm_target *ti, struct request *clone,
 static void map_tio_request(struct kthread_work *work)
 {
 	struct dm_rq_target_io *tio = container_of(work, struct dm_rq_target_io, work);
+	struct request *rq = tio->orig;
+	struct mapped_device *md = tio->md;
 
-	map_request(tio->ti, tio->clone, tio->md);
+	if (map_request(tio->ti, rq, md) < 0)
+		dm_requeue_unmapped_original_request(md, rq);
 }
 
-static struct request *dm_start_request(struct mapped_device *md, struct request *orig)
+static void dm_start_request(struct mapped_device *md, struct request *orig)
 {
-	struct request *clone;
-
 	blk_start_request(orig);
-	clone = orig->special;
-	atomic_inc(&md->pending[rq_data_dir(clone)]);
+	atomic_inc(&md->pending[rq_data_dir(orig)]);
 
 	/*
 	 * Hold the md reference here for the in-flight I/O.
@@ -1895,8 +1932,6 @@ static struct request *dm_start_request(struct mapped_device *md, struct request
 	 * See the comment in rq_completed() too.
 	 */
 	dm_get(md);
-
-	return clone;
 }
 
 /*
@@ -1909,7 +1944,7 @@ static void dm_request_fn(struct request_queue *q)
 	int srcu_idx;
 	struct dm_table *map = dm_get_live_table(md, &srcu_idx);
 	struct dm_target *ti;
-	struct request *rq, *clone;
+	struct request *rq;
 	struct dm_rq_target_io *tio;
 	sector_t pos;
 
@@ -1932,19 +1967,19 @@ static void dm_request_fn(struct request_queue *q)
 		ti = dm_table_find_target(map, pos);
 		if (!dm_target_is_valid(ti)) {
 			/*
-			 * Must perform setup, that dm_done() requires,
+			 * Must perform setup, that rq_completed() requires,
 			 * before calling dm_kill_unmapped_request
 			 */
 			DMERR_LIMIT("request attempted access beyond the end of device");
-			clone = dm_start_request(md, rq);
-			dm_kill_unmapped_request(clone, -EIO);
+			dm_start_request(md, rq);
+			dm_kill_unmapped_request(rq, -EIO);
 			continue;
 		}
 
 		if (ti->type->busy && ti->type->busy(ti))
 			goto delay_and_out;
 
-		clone = dm_start_request(md, rq);
+		dm_start_request(md, rq);
 
 		tio = rq->special;
 		/* Establish tio->ti before queuing work (map_tio_request) */
@@ -2241,16 +2276,15 @@ static void __bind_mempools(struct mapped_device *md, struct dm_table *t)
 			bioset_free(md->bs);
 			md->bs = p->bs;
 			p->bs = NULL;
-		} else if (dm_table_get_type(t) == DM_TYPE_REQUEST_BASED) {
-			/*
-			 * There's no need to reload with request-based dm
-			 * because the size of front_pad doesn't change.
-			 * Note for future: If you are to reload bioset,
-			 * prep-ed requests in the queue may refer
-			 * to bio from the old bioset, so you must walk
-			 * through the queue to unprep.
-			 */
 		}
+		/*
+		 * There's no need to reload with request-based dm
+		 * because the size of front_pad doesn't change.
+		 * Note for future: If you are to reload bioset,
+		 * prep-ed requests in the queue may refer
+		 * to bio from the old bioset, so you must walk
+		 * through the queue to unprep.
+		 */
 		goto out;
 	}
 
@@ -2462,6 +2496,14 @@ unsigned dm_get_md_type(struct mapped_device *md)
 	return md->type;
 }
 
+static bool dm_md_type_request_based(struct mapped_device *md)
+{
+	unsigned table_type = dm_get_md_type(md);
+
+	return (table_type == DM_TYPE_REQUEST_BASED ||
+		table_type == DM_TYPE_MQ_REQUEST_BASED);
+}
+
 struct target_type *dm_get_immutable_target_type(struct mapped_device *md)
 {
 	return md->immutable_target_type;
@@ -2514,8 +2556,7 @@ static int dm_init_request_based_queue(struct mapped_device *md)
  */
 int dm_setup_md_queue(struct mapped_device *md)
 {
-	if ((dm_get_md_type(md) == DM_TYPE_REQUEST_BASED) &&
-	    !dm_init_request_based_queue(md)) {
+	if (dm_md_type_request_based(md) && !dm_init_request_based_queue(md)) {
 		DMWARN("Cannot initialize queue for request-based mapped device");
 		return -EINVAL;
 	}
@@ -3187,27 +3228,35 @@ struct dm_md_mempools *dm_alloc_md_mempools(unsigned type, unsigned integrity, u
 {
 	struct dm_md_mempools *pools = kzalloc(sizeof(*pools), GFP_KERNEL);
 	struct kmem_cache *cachep;
-	unsigned int pool_size;
+	unsigned int pool_size = 0;
 	unsigned int front_pad;
 
 	if (!pools)
 		return NULL;
 
-	if (type == DM_TYPE_BIO_BASED) {
+	switch (type) {
+	case DM_TYPE_BIO_BASED:
 		cachep = _io_cache;
 		pool_size = dm_get_reserved_bio_based_ios();
 		front_pad = roundup(per_bio_data_size, __alignof__(struct dm_target_io)) + offsetof(struct dm_target_io, clone);
-	} else if (type == DM_TYPE_REQUEST_BASED) {
-		cachep = _rq_tio_cache;
+		break;
+	case DM_TYPE_REQUEST_BASED:
 		pool_size = dm_get_reserved_rq_based_ios();
 		pools->rq_pool = mempool_create_slab_pool(pool_size, _rq_cache);
 		if (!pools->rq_pool)
 			goto out;
+		/* fall through to setup remaining rq-based pools */
+	case DM_TYPE_MQ_REQUEST_BASED:
+		cachep = _rq_tio_cache;
+		if (!pool_size)
+			pool_size = dm_get_reserved_rq_based_ios();
 		front_pad = offsetof(struct dm_rq_clone_bio_info, clone);
 		/* per_bio_data_size is not used. See __bind_mempools(). */
 		WARN_ON(per_bio_data_size != 0);
-	} else
+		break;
+	default:
 		goto out;
+	}
 
 	pools->io_pool = mempool_create_slab_pool(pool_size, cachep);
 	if (!pools->io_pool)
diff --git a/drivers/md/dm.h b/drivers/md/dm.h
index 84b0f9e4..84d7978 100644
--- a/drivers/md/dm.h
+++ b/drivers/md/dm.h
@@ -34,9 +34,10 @@
 /*
  * Type of table and mapped_device's mempool
  */
-#define DM_TYPE_NONE		0
-#define DM_TYPE_BIO_BASED	1
-#define DM_TYPE_REQUEST_BASED	2
+#define DM_TYPE_NONE			0
+#define DM_TYPE_BIO_BASED		1
+#define DM_TYPE_REQUEST_BASED		2
+#define DM_TYPE_MQ_REQUEST_BASED	3
 
 /*
  * List of devices that a metadevice uses and should open/close.
@@ -73,6 +74,7 @@ int dm_table_any_busy_target(struct dm_table *t);
 unsigned dm_table_get_type(struct dm_table *t);
 struct target_type *dm_table_get_immutable_target_type(struct dm_table *t);
 bool dm_table_request_based(struct dm_table *t);
+bool dm_table_mq_request_based(struct dm_table *t);
 void dm_table_free_md_mempools(struct dm_table *t);
 struct dm_md_mempools *dm_table_get_md_mempools(struct dm_table *t);
 
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 19296fb..2646aed 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -48,6 +48,11 @@ typedef void (*dm_dtr_fn) (struct dm_target *ti);
 typedef int (*dm_map_fn) (struct dm_target *ti, struct bio *bio);
 typedef int (*dm_map_request_fn) (struct dm_target *ti, struct request *clone,
 				  union map_info *map_context);
+typedef int (*dm_clone_and_map_request_fn) (struct dm_target *ti,
+					    struct request *rq,
+					    union map_info *map_context,
+					    struct request **clone);
+typedef void (*dm_release_clone_request_fn) (struct request *clone);
 
 /*
  * Returns:
@@ -143,6 +148,8 @@ struct target_type {
 	dm_dtr_fn dtr;
 	dm_map_fn map;
 	dm_map_request_fn map_rq;
+	dm_clone_and_map_request_fn clone_and_map_rq;
+	dm_release_clone_request_fn release_clone_rq;
 	dm_endio_fn end_io;
 	dm_request_endio_fn rq_end_io;
 	dm_presuspend_fn presuspend;
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 0/8] dm: add request-based blk-mq support
  2014-12-17  3:59 [PATCH v3 0/8] dm: add request-based blk-mq support Mike Snitzer
                   ` (7 preceding siblings ...)
  2014-12-17  4:00 ` [PATCH v3 8/8] dm: allocate requests from target when stacking on blk-mq devices Mike Snitzer
@ 2014-12-17 21:42 ` Keith Busch
  2014-12-17 21:43   ` Jens Axboe
  2014-12-17 22:51   ` [PATCH v3 0/8] dm: add request-based blk-mq support Mike Snitzer
  8 siblings, 2 replies; 95+ messages in thread
From: Keith Busch @ 2014-12-17 21:42 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: axboe, hch, bvanassche, Keith Busch, dm-devel, j-nomura

On Tue, 16 Dec 2014, Mike Snitzer wrote:
> Here is v3 of the request-based DM blk-support patchset.  I've also
> published a git repo here:
> https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-for-3.20-blk-mq
>
> I found quite a few issues with v2 for both blk-mq and old
> request-based DM.  I've still attributed the original patches from
> Keith to him even though I fixed/rewrote significant portions.  Keith,
> I'm happy to leave attribution like this unless you'd prefer I change
> it.

Thanks a bunch, Mike! I'll need to merge your tree with Jens' for nvme
blk-mq, and one patch for a nvme scsi translation fix, then I can test
this out. I get my dual ported nvme controller back tomorrow, so should
have results before the end of the week.

I'm happy to take credit, but if your rewrite is sufficiently different,
I'm okay to just append a Tested-by once that's done. :)

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 0/8] dm: add request-based blk-mq support
  2014-12-17 21:42 ` [PATCH v3 0/8] dm: add request-based blk-mq support Keith Busch
@ 2014-12-17 21:43   ` Jens Axboe
  2014-12-17 23:06     ` Mike Snitzer
  2014-12-17 22:51   ` [PATCH v3 0/8] dm: add request-based blk-mq support Mike Snitzer
  1 sibling, 1 reply; 95+ messages in thread
From: Jens Axboe @ 2014-12-17 21:43 UTC (permalink / raw)
  To: Keith Busch, Mike Snitzer; +Cc: hch, j-nomura, dm-devel, bvanassche

On 12/17/2014 02:42 PM, Keith Busch wrote:
> On Tue, 16 Dec 2014, Mike Snitzer wrote:
>> Here is v3 of the request-based DM blk-support patchset.  I've also
>> published a git repo here:
>> https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-for-3.20-blk-mq
>>
>>
>> I found quite a few issues with v2 for both blk-mq and old
>> request-based DM.  I've still attributed the original patches from
>> Keith to him even though I fixed/rewrote significant portions.  Keith,
>> I'm happy to leave attribution like this unless you'd prefer I change
>> it.
> 
> Thanks a bunch, Mike! I'll need to merge your tree with Jens' for nvme
> blk-mq, and one patch for a nvme scsi translation fix, then I can test
> this out. I get my dual ported nvme controller back tomorrow, so should
> have results before the end of the week.

All the juicy nvme bits are in Linus' tree now, so that should work!

BTW, Mike, do you have any perf numbers? Just curious how far along this is.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 8/8] dm: allocate requests from target when stacking on blk-mq devices
  2014-12-17  4:00 ` [PATCH v3 8/8] dm: allocate requests from target when stacking on blk-mq devices Mike Snitzer
@ 2014-12-17 22:35   ` Mike Snitzer
  0 siblings, 0 replies; 95+ messages in thread
From: Mike Snitzer @ 2014-12-17 22:35 UTC (permalink / raw)
  To: dm-devel, Keith Busch; +Cc: axboe, hch, j-nomura, bvanassche

On Tue, Dec 16 2014 at 11:00pm -0500,
Mike Snitzer <snitzer@redhat.com> wrote:

> From: Keith Busch <keith.busch@intel.com>
> 
> For blk-mq request-based DM the responsibility of allocating a cloned
> request is transfered from DM core to the target type so that the cloned
> request is allocated from the appropriate request_queue's pool and
> initialized for the target block device.  The original request's
> 'special' now points to the dm_rq_target_io because the clone is
> allocated later in the block layer rather than in DM core.
> 
> Care was taken to preserve compatibility with old-style block request
> completion that requires request-based DM _not_ acquire the clone
> request's queue lock in the completion path.  As such, there are now 2
> different request-based dm_target interfaces:
> 1) the original .map_rq() interface will continue to be used for
>    non-blk-mq devices -- the preallocated clone request is passed in
>    from DM core.
> 2) a new .clone_and_map_rq() and .release_clone_rq() will be used for
>    blk-mq devices -- blk_get_request() and blk_put_request() are used
>    respectively from these hooks.
> 
> dm_table_set_type() was updated to detect if the request-based target is
> being stacked on blk-mq devices, if so DM_TYPE_MQ_REQUEST_BASED is set.
> DM core disallows switching the DM table's type after it is set.  This
> means that there is no mixing of non-blk-mq and blk-mq devices within
> the same request-based DM table.
> 
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> Signed-off-by: Mike Snitzer <snitzer@redhat.com>

I did some testing using the DM "error" target and found some error path
fixes were needed, so I folded the following changes into this last
patch and pushed the rebased result to the linux-dm.git
'dm-for-3.20-blk-mq' branch:

diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index df408bc..1fa6f14 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -424,6 +424,7 @@ static int __multipath_map(struct dm_target *ti, struct request *clone,
 		*__clone = blk_get_request(bdev_get_queue(bdev),
 					   rq_data_dir(rq), GFP_KERNEL);
 		if (IS_ERR(*__clone))
+			/* ENOMEM, requeue */
 			goto out_unlock;
 		(*__clone)->cmd_flags |= REQ_FAILFAST_TRANSPORT;
 	}
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 612e1c1..19914f6 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1044,7 +1044,7 @@ static void free_rq_clone(struct request *clone)
 	struct dm_rq_target_io *tio = clone->end_io_data;
 
 	blk_rq_unprep_clone(clone);
-	if (clone->q->mq_ops)
+	if (clone->q && clone->q->mq_ops)
 		tio->ti->type->release_clone_rq(clone);
 	else
 		free_clone_request(tio->md, clone);
@@ -1855,15 +1855,15 @@ static int dm_prep_fn(struct request_queue *q, struct request *rq)
 
 /*
  * Returns:
- * 0   : the request has been processed (not requeued)
- * 1   : the request has been requeued
- * < 0 : the original request needs to be requeued
+ * 0                : the request has been processed (not requeued)
+ * 1                : the request has been requeued
+ * DM_MAPIO_REQUEUE : the original request needs to be requeued
  */
 static int map_request(struct dm_target *ti, struct request *rq,
 		       struct mapped_device *md)
 {
 	struct request *clone = NULL;
-	int r, r2, requeued = 0;
+	int r, requeued = 0;
 	struct dm_rq_target_io *tio = rq->special;
 
 	if (tio->clone) {
@@ -1871,12 +1871,17 @@ static int map_request(struct dm_target *ti, struct request *rq,
 		r = ti->type->map_rq(ti, clone, &tio->info);
 	} else {
 		r = ti->type->clone_and_map_rq(ti, rq, &tio->info, &clone);
+		if (r < 0) {
+			/* The target wants to complete the I/O */
+			dm_kill_unmapped_request(rq, r);
+			return r;
+		}
 		if (IS_ERR(clone))
-			return PTR_ERR(clone);
-		r2 = setup_clone(clone, rq, tio, GFP_KERNEL);
-		if (r2) {
+			return DM_MAPIO_REQUEUE;
+		if (setup_clone(clone, rq, tio, GFP_KERNEL)) {
+			/* -ENOMEM */
 			ti->type->release_clone_rq(clone);
-			return r2;
+			return DM_MAPIO_REQUEUE;
 		}
 	}
 
@@ -1915,7 +1920,7 @@ static void map_tio_request(struct kthread_work *work)
 	struct request *rq = tio->orig;
 	struct mapped_device *md = tio->md;
 
-	if (map_request(tio->ti, rq, md) < 0)
+	if (map_request(tio->ti, rq, md) == DM_MAPIO_REQUEUE)
 		dm_requeue_unmapped_original_request(md, rq);
 }
 

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 0/8] dm: add request-based blk-mq support
  2014-12-17 21:42 ` [PATCH v3 0/8] dm: add request-based blk-mq support Keith Busch
  2014-12-17 21:43   ` Jens Axboe
@ 2014-12-17 22:51   ` Mike Snitzer
  1 sibling, 0 replies; 95+ messages in thread
From: Mike Snitzer @ 2014-12-17 22:51 UTC (permalink / raw)
  To: Keith Busch; +Cc: axboe, bvanassche, hch, dm-devel, j-nomura

On Wed, Dec 17 2014 at  4:42pm -0500,
Keith Busch <keith.busch@intel.com> wrote:

> On Tue, 16 Dec 2014, Mike Snitzer wrote:
> >Here is v3 of the request-based DM blk-support patchset.  I've also
> >published a git repo here:
> >https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-for-3.20-blk-mq
> >
> >I found quite a few issues with v2 for both blk-mq and old
> >request-based DM.  I've still attributed the original patches from
> >Keith to him even though I fixed/rewrote significant portions.  Keith,
> >I'm happy to leave attribution like this unless you'd prefer I change
> >it.
> 
> Thanks a bunch, Mike! I'll need to merge your tree with Jens' for nvme
> blk-mq, and one patch for a nvme scsi translation fix, then I can test
> this out. I get my dual ported nvme controller back tomorrow, so should
> have results before the end of the week.

No problem.  Please test with the latest 'dm-for-3.20-blk-mq' -- topmost
commit should be 1943bd21bf8df061b3112c945160e2a740a906ba

I don't have any dual ported NVMe hardware or anything so I haven't
_really_ tested all aspects of this code.  I've mainly studied the code
_a lot_ and made sure that basic IO is getting submitted/completed as
expected (for both the old and blk-mq cases).

I'll be interested to understand if these changes are sufficient for
your hardware.. I'm left wondering how/if scsi-mq should fit into
dm-multipath given that I know iSER is at least slated to use scsi-mq.
Also, we have the scsi_dh layer that obviously isn't in the mix here
with direct submission and completion with the blk-mq driver.

> I'm happy to take credit, but if your rewrite is sufficiently different,
> I'm okay to just append a Tested-by once that's done. :)

It is pretty different in that I'm maintaining both the old and blk-mq
paths in DM core's request-based path.  But your v2 series provided a
roadmap for what was needed for the blk-mq path... so you'll still find
your code for sure.  I'd say the 8th patch is what saw the biggest
rewrite so I can go ahead and flip that to be attributed to me if you'd
rather not get all the blame ;)

Mike

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 0/8] dm: add request-based blk-mq support
  2014-12-17 21:43   ` Jens Axboe
@ 2014-12-17 23:06     ` Mike Snitzer
  2014-12-18  1:41       ` Keith Busch
  2014-12-19 14:32       ` Bart Van Assche
  0 siblings, 2 replies; 95+ messages in thread
From: Mike Snitzer @ 2014-12-17 23:06 UTC (permalink / raw)
  To: Jens Axboe; +Cc: hch, bvanassche, Keith Busch, dm-devel, j-nomura

On Wed, Dec 17 2014 at  4:43pm -0500,
Jens Axboe <axboe@kernel.dk> wrote:

> On 12/17/2014 02:42 PM, Keith Busch wrote:
> > On Tue, 16 Dec 2014, Mike Snitzer wrote:
> >> Here is v3 of the request-based DM blk-support patchset.  I've also
> >> published a git repo here:
> >> https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-for-3.20-blk-mq
> >>
> >>
> >> I found quite a few issues with v2 for both blk-mq and old
> >> request-based DM.  I've still attributed the original patches from
> >> Keith to him even though I fixed/rewrote significant portions.  Keith,
> >> I'm happy to leave attribution like this unless you'd prefer I change
> >> it.
> > 
> > Thanks a bunch, Mike! I'll need to merge your tree with Jens' for nvme
> > blk-mq, and one patch for a nvme scsi translation fix, then I can test
> > this out. I get my dual ported nvme controller back tomorrow, so should
> > have results before the end of the week.
> 
> All the juicy nvme bits are in Linus' tree now, so that should work!
> 
> BTW, Mike, do you have any perf numbers? Just curious how far along this is.

No, not yet.  I'll be focusing on the old request-based (non-blk-mq)
performance first though to make sure we haven't killed the common case
-- which obviously isn't what you're interested in ;)

The primary concerns for the old request-based path are:
1) does submission through a new dedicated (per rq-based DM device)
   kthread hurt? 
   https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-for-3.20-blk-mq&id=aec254b435c9ee78103b90c229644be810274a33

2) does spliiting the request structure out from dm_rq_target_io
   structure hurt?
   https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-for-3.20-blk-mq&id=a46bee4179e804b19e581d1b745e467d0ba946c1

I'm suspecting these changes should be fine but we'll see.

(Hannes, Christoph, Bart and/or Junichi: if you have some performant
multipath setups I'd love for you to try this code to make sure they
still perform well).

As for blk-mq support... I don't have access to any NVMe hardware, etc.
I only tested with virtio-blk (to a ramdisk, scsi-debug, device on the
host) so I'm really going to need to lean on Keith and others to
validate blk-mq performance.

So if you know someone with relevant blk-mq hardware who might benefit
from blk-mq multipathing please point them at this code and have them
report back!

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 0/8] dm: add request-based blk-mq support
  2014-12-17 23:06     ` Mike Snitzer
@ 2014-12-18  1:41       ` Keith Busch
  2014-12-18  4:58         ` Mike Snitzer
  2014-12-19 14:32       ` Bart Van Assche
  1 sibling, 1 reply; 95+ messages in thread
From: Keith Busch @ 2014-12-18  1:41 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Jens Axboe, hch, bvanassche, Keith Busch, dm-devel, j-nomura

On Wed, 17 Dec 2014, Mike Snitzer wrote:
> As for blk-mq support... I don't have access to any NVMe hardware, etc.
> I only tested with virtio-blk (to a ramdisk, scsi-debug, device on the
> host) so I'm really going to need to lean on Keith and others to
> validate blk-mq performance.

There's a reason no one has multipath capable NVMe drives: they are not
generally available to anyone right now. :) Mine is a prototype so not
a good candidate for performance comparisons.

I was able to get my loaner back a couple hours ago though, so I built and
tested your tree and happy to say it is very successful. While running
filesystem fio, I simulated path alternating hot-removal/add sequences
and everything worked. So functionally it appears great, but I can't
speak on performance right now.

One thing with dual ported PCI-e SSDs is each path can be on a different
pci domain local to different NUMA nodes. I think there's performance
to gain if we select the target path closest to the CPU that the thread
is scheduled on. I don't have data to back that up yet, but could such
a path selection algorithm be considered in the future?

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 0/8] dm: add request-based blk-mq support
  2014-12-18  1:41       ` Keith Busch
@ 2014-12-18  4:58         ` Mike Snitzer
  0 siblings, 0 replies; 95+ messages in thread
From: Mike Snitzer @ 2014-12-18  4:58 UTC (permalink / raw)
  To: Keith Busch; +Cc: Jens Axboe, bvanassche, hch, dm-devel, j-nomura

On Wed, Dec 17 2014 at  8:41pm -0500,
Keith Busch <keith.busch@intel.com> wrote:

> On Wed, 17 Dec 2014, Mike Snitzer wrote:
> >As for blk-mq support... I don't have access to any NVMe hardware, etc.
> >I only tested with virtio-blk (to a ramdisk, scsi-debug, device on the
> >host) so I'm really going to need to lean on Keith and others to
> >validate blk-mq performance.
> 
> There's a reason no one has multipath capable NVMe drives: they are not
> generally available to anyone right now. :) Mine is a prototype so not
> a good candidate for performance comparisons.
> 
> I was able to get my loaner back a couple hours ago though, so I built and
> tested your tree and happy to say it is very successful. While running
> filesystem fio, I simulated path alternating hot-removal/add sequences
> and everything worked. So functionally it appears great, but I can't
> speak on performance right now.

Great news.
 
> One thing with dual ported PCI-e SSDs is each path can be on a different
> pci domain local to different NUMA nodes. I think there's performance
> to gain if we select the target path closest to the CPU that the thread
> is scheduled on. I don't have data to back that up yet, but could such
> a path selection algorithm be considered in the future?

Definitely, if you look at the comment above
dm-mpath.c:parse_path_selector() you'll see that we have a very generic
mechanism for seeding the path selectors with information that they'll
use as the basis for deciding which path to select.  So in this case
we'd have userspace supply the path to NUMA node mapping, etc.  Not
exactly sure what it'd look like at this point but it should be doable.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 0/8] dm: add request-based blk-mq support
  2014-12-17 23:06     ` Mike Snitzer
  2014-12-18  1:41       ` Keith Busch
@ 2014-12-19 14:32       ` Bart Van Assche
  2014-12-19 15:38         ` Mike Snitzer
  1 sibling, 1 reply; 95+ messages in thread
From: Bart Van Assche @ 2014-12-19 14:32 UTC (permalink / raw)
  To: Mike Snitzer, Jens Axboe
  Cc: hch, Keith Busch, j-nomura, device-mapper development

On 12/18/14 00:06, Mike Snitzer wrote:
> So if you know someone with relevant blk-mq hardware who might benefit
> from blk-mq multipathing please point them at this code and have them
> report back!

Hello Mike,

Great to see that you are working on blk-mq multipathing. Unfortunately
a test with the SRP initiator and your dm-for-3.20-blk-mq tree merged
with Linus' latest tree was not successful. This is what was reported
when I tried to start multipathd (without call trace, followed by a
hard lockup):

=========================================================
[ INFO: possible irq lock inversion dependency detected ]
3.18.0-debug+ #1 Tainted: G        W     
---------------------------------------------------------
kdmwork-253:0/5347 just changed the state of lock:
 (&(&m->lock)->rlock){+.....}, at: [<ffffffffa080eb80>] __multipath_map.isra.15+0x40/0x1f0 [dm_multipath]
but this lock was taken by another, HARDIRQ-safe lock in the past:
 (&(&q->__queue_lock)->rlock){-.-...}
 
and interrupts could create inverse lock ordering between them.
 
other info that might help us debug this:
 Possible interrupt unsafe locking scenario:

This is how objdump translates the assembler code of the above kernel address (0x1b80 below):

static int __multipath_map(struct dm_target *ti, struct request *clone,
    1b62:       48 89 55 c8             mov    %rdx,-0x38(%rbp)
    1b66:       4c 89 45 c0             mov    %r8,-0x40(%rbp)
                           union map_info *map_context,
                           struct request *rq, struct request **__clone)
{
        struct multipath *m = (struct multipath *) ti->private;
        int r = DM_MAPIO_REQUEUE;
        size_t nr_bytes = clone ? blk_rq_bytes(clone) : blk_rq_bytes(rq);
    1b6a:       0f 84 50 01 00 00       je     1cc0 <__multipath_map.isra.15+0x180>
    1b70:       44 8b 66 5c             mov    0x5c(%rsi),%r12d
        raw_spin_lock_init(&(_lock)->rlock);            \
} while (0)

static inline void spin_lock(spinlock_t *lock)
{
        raw_spin_lock(&lock->rlock);
    1b74:       49 8d 5e 28             lea    0x28(%r14),%rbx
    1b78:       48 89 df                mov    %rbx,%rdi
    1b7b:       e8 00 00 00 00          callq  1b80 <__multipath_map.isra.15+0x40>
        struct dm_mpath_io *mpio;

        spin_lock(&m->lock);

        /* Do we need to select a new pgpath? */
        if (!m->current_pgpath ||
    1b80:       49 8b 8e d0 00 00 00    mov    0xd0(%r14),%rcx
    1b87:       48 85 c9                test   %rcx,%rcx
    1b8a:       74 55                   je     1be1 <__multipath_map.isra.15+0xa1>
            (!m->queue_io && (m->repeat_count && --m->repeat_count == 0)))
    1b8c:       41 0f b6 96 ec 00 00    movzbl 0xec(%r14),%edx

Bart.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 0/8] dm: add request-based blk-mq support
  2014-12-19 14:32       ` Bart Van Assche
@ 2014-12-19 15:38         ` Mike Snitzer
  2014-12-19 17:14           ` Mike Snitzer
  0 siblings, 1 reply; 95+ messages in thread
From: Mike Snitzer @ 2014-12-19 15:38 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, hch, Keith Busch, device-mapper development, j-nomura

On Fri, Dec 19 2014 at  9:32am -0500,
Bart Van Assche <bvanassche@acm.org> wrote:

> On 12/18/14 00:06, Mike Snitzer wrote:
> > So if you know someone with relevant blk-mq hardware who might benefit
> > from blk-mq multipathing please point them at this code and have them
> > report back!
> 
> Hello Mike,
> 
> Great to see that you are working on blk-mq multipathing. Unfortunately
> a test with the SRP initiator and your dm-for-3.20-blk-mq tree merged
> with Linus' latest tree was not successful. This is what was reported
> when I tried to start multipathd (without call trace, followed by a
> hard lockup):
> 
> =========================================================
> [ INFO: possible irq lock inversion dependency detected ]
> 3.18.0-debug+ #1 Tainted: G        W     
> ---------------------------------------------------------
> kdmwork-253:0/5347 just changed the state of lock:
>  (&(&m->lock)->rlock){+.....}, at: [<ffffffffa080eb80>] __multipath_map.isra.15+0x40/0x1f0 [dm_multipath]
> but this lock was taken by another, HARDIRQ-safe lock in the past:
>  (&(&q->__queue_lock)->rlock){-.-...}
>  
> and interrupts could create inverse lock ordering between them.
>  
> other info that might help us debug this:
>  Possible interrupt unsafe locking scenario:

This "dm: submit stacked requests in irq enabled context" commit
https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-for-3.20-blk-mq&id=1844ba7e2e013fa38c45d646248c517eb363e26c

changed the locking needed in the multipath target.  I altered
__multipath_map but didn't audit elsewhere.  I'll work through it.

I rebuilt my kernel with lockdep enabled and can easily see this too:

[  181.819735] =========================================================
[  181.820046] [ INFO: possible irq lock inversion dependency detected ]
[  181.820046] 3.18.0+ #12 Tainted: G        W
[  181.820046] ---------------------------------------------------------
[  181.820046] swapper/1/0 just changed the state of lock:
[  181.820046]  (&(&q->__queue_lock)->rlock){..-...}, at: [<ffffffff812fb2c4>] blk_end_bidi_request+0x34/0x60
[  181.820046] but this lock took another, SOFTIRQ-unsafe lock in the past:
[  181.820046]  (&(&m->lock)->rlock){+.+...}

and interrupts could create inverse lock ordering between them.

[  181.820046]
[  181.820046] other info that might help us debug this:
[  181.820046]  Possible interrupt unsafe locking scenario:
[  181.820046]
[  181.820046]        CPU0                    CPU1
[  181.820046]        ----                    ----
[  181.820046]   lock(&(&m->lock)->rlock);
[  181.820046]                                local_irq_disable();
[  181.820046]                                lock(&(&q->__queue_lock)->rlock);
[  181.820046]                                lock(&(&m->lock)->rlock);
[  181.820046]   <Interrupt>
[  181.820046]     lock(&(&q->__queue_lock)->rlock);
[  181.820046]
[  181.820046]  *** DEADLOCK ***
[  181.820046]
[  181.820046] no locks held by swapper/1/0.
[  181.820046]
[  181.820046] the shortest dependencies between 2nd lock and 1st lock:
[  181.820046]  -> (&(&m->lock)->rlock){+.+...} ops: 4 {
[  181.820046]     HARDIRQ-ON-W at:
[  181.820046]                       [<ffffffff810c8f46>] __lock_acquire+0x5d6/0x1d40
[  181.820046]                       [<ffffffff810cae37>] lock_acquire+0xb7/0x140
[  181.820046]                       [<ffffffff81694618>] _raw_spin_lock+0x38/0x50
[  181.820046]                       [<ffffffffa00adbd0>] __multipath_map.isra.15+0x40/0x1f0 [dm_multipath]
[  181.820046]                       [<ffffffffa00add9a>] multipath_clone_and_map+0x1a/0x20 [dm_multipath]
[  181.820046]                       [<ffffffffa0206c25>] map_tio_request+0x1d5/0x2b0 [dm_mod]
[  181.820046]                       [<ffffffff8109a4ee>] kthread_worker_fn+0x7e/0x1b0
[  181.820046]                       [<ffffffff8109a3f7>] kthread+0x107/0x120
[  181.820046]                       [<ffffffff8169527c>] ret_from_fork+0x7c/0xb0
[  181.820046]     SOFTIRQ-ON-W at:
[  181.820046]                       [<ffffffff810c8ca0>] __lock_acquire+0x330/0x1d40
[  181.820046]                       [<ffffffff810cae37>] lock_acquire+0xb7/0x140
[  181.820046]                       [<ffffffff81694618>] _raw_spin_lock+0x38/0x50
[  181.820046]                       [<ffffffffa00adbd0>] __multipath_map.isra.15+0x40/0x1f0 [dm_multipath]
[  181.820046]                       [<ffffffffa00add9a>] multipath_clone_and_map+0x1a/0x20 [dm_multipath]
[  181.820046]                       [<ffffffffa0206c25>] map_tio_request+0x1d5/0x2b0 [dm_mod]
[  181.820046]                       [<ffffffff8109a4ee>] kthread_worker_fn+0x7e/0x1b0
[  181.820046]                       [<ffffffff8109a3f7>] kthread+0x107/0x120
[  181.820046]                       [<ffffffff8169527c>] ret_from_fork+0x7c/0xb0
[  181.820046]     INITIAL USE at:
[  181.820046]                      [<ffffffff810c8d2f>] __lock_acquire+0x3bf/0x1d40
[  181.820046]                      [<ffffffff810cae37>] lock_acquire+0xb7/0x140
[  181.820046]                      [<ffffffff81694f50>] _raw_spin_lock_irqsave+0x50/0x70
[  181.820046]                      [<ffffffffa00ac63c>] multipath_resume+0x1c/0x50 [dm_multipath]
[  181.820046]                      [<ffffffffa020bed9>] dm_table_resume_targets+0x99/0xe0 [dm_mod]
[  181.820046]                      [<ffffffffa0209289>] dm_resume+0xd9/0x120 [dm_mod]
[  181.820046]                      [<ffffffffa020e7bb>] dev_suspend+0x12b/0x250 [dm_mod]
[  181.820046]                      [<ffffffffa020f108>] ctl_ioctl+0x278/0x520 [dm_mod]
[  181.820046]                      [<ffffffffa020f3c3>] dm_ctl_ioctl+0x13/0x20 [dm_mod]
[  181.820046]                      [<ffffffff81219488>] do_vfs_ioctl+0x318/0x560
[  181.820046]                      [<ffffffff81219751>] SyS_ioctl+0x81/0xa0
[  181.820046]                      [<ffffffff81695329>] system_call_fastpath+0x12/0x17
[  181.820046]   }
[  181.820046]   ... key      at: [<ffffffffa00b04a0>] __key.33455+0x0/0xffffffffffffeb60 [dm_multipath]
[  181.820046]   ... acquired at:
[  181.820046]    [<ffffffff810cae37>] lock_acquire+0xb7/0x140
[  181.820046]    [<ffffffff81694618>] _raw_spin_lock+0x38/0x50
[  181.820046]    [<ffffffffa020893d>] dm_blk_open+0x1d/0x90 [dm_mod]
[  181.820046]    [<ffffffff812416fe>] __blkdev_get+0xde/0x4e0
[  181.820046]    [<ffffffff81241cf8>] blkdev_get+0x1f8/0x3b0
[  181.820046]    [<ffffffff81241f6f>] blkdev_open+0x5f/0x90
[  181.820046]    [<ffffffff812014cf>] do_dentry_open+0x1ff/0x350
[  181.820046]    [<ffffffff81201789>] vfs_open+0x49/0x50
[  181.820046]    [<ffffffff81211cd2>] do_last+0x682/0x13b0
[  181.820046]    [<ffffffff81214535>] path_openat+0xc5/0x640
[  181.820046]    [<ffffffff81216b69>] do_filp_open+0x49/0xc0
[  181.820046]    [<ffffffff812033a7>] do_sys_open+0x137/0x240
[  181.820046]    [<ffffffff812034ce>] SyS_open+0x1e/0x20
[  181.820046]    [<ffffffff81695329>] system_call_fastpath+0x12/0x17
[  181.820046]
[  181.820046] -> (&(&q->__queue_lock)->rlock){..-...} ops: 71 {
[  181.820046]    IN-SOFTIRQ-W at:
[  181.820046]                     [<ffffffff810c8c25>] __lock_acquire+0x2b5/0x1d40
[  181.820046]                     [<ffffffff810cae37>] lock_acquire+0xb7/0x140
[  181.820046]                     [<ffffffff81694f50>] _raw_spin_lock_irqsave+0x50/0x70
[  181.820046]                     [<ffffffff812fb2c4>] blk_end_bidi_request+0x34/0x60
[  181.820046]                     [<ffffffff812fb3cf>] blk_end_request_all+0x1f/0x30
[  181.820046]                     [<ffffffffa0206359>] dm_softirq_done+0xe9/0x1e0 [dm_mod]
[  181.820046]                     [<ffffffff813023b0>] blk_done_softirq+0xa0/0xd0
[  181.820046]                     [<ffffffff8107e091>] __do_softirq+0x141/0x370
[  181.820046]                     [<ffffffff8107e655>] irq_exit+0x125/0x130
[  181.820046]                     [<ffffffff8104bd05>] smp_call_function_single_interrupt+0x35/0x40
[  181.820046]                     [<ffffffff81696822>] call_function_single_interrupt+0x72/0x80
[  181.820046]                     [<ffffffff810c0084>] cpu_startup_entry+0x194/0x420
[  181.820046]                     [<ffffffff8104c6dd>] start_secondary+0x19d/0x210
[  181.820046]    INITIAL USE at:
[  181.820046]                    [<ffffffff810c8d2f>] __lock_acquire+0x3bf/0x1d40
[  181.820046]                    [<ffffffff810cae37>] lock_acquire+0xb7/0x140
[  181.820046]                    [<ffffffff81694764>] _raw_spin_lock_irq+0x44/0x60
[  181.820046]                    [<ffffffff812f998d>] blk_queue_bypass_start+0x1d/0xb0
[  181.820046]                    [<ffffffff813168c6>] blkcg_activate_policy+0x96/0x340
[  181.820046]                    [<ffffffff8131a04e>] blk_throtl_init+0xee/0x130
[  181.820046]                    [<ffffffff81316c01>] blkcg_init_queue+0x31/0x40
[  181.820046]                    [<ffffffff812f6551>] blk_alloc_queue_node+0x251/0x2c0
[  181.820046]                    [<ffffffff812fa854>] blk_init_queue_node+0x24/0x70
[  181.820046]                    [<ffffffff812fa8b3>] blk_init_queue+0x13/0x20
[  181.820046]                    [<ffffffffa0015364>] virtqueue_get_buf+0x14/0x130 [virtio_ring]
[  181.820046]                    [<ffffffff81002144>] do_one_initcall+0xd4/0x210
[  181.820046]                    [<ffffffff8110b4e2>] load_module+0x17d2/0x1c10
[  181.820046]                    [<ffffffff8110baf6>] SyS_finit_module+0xa6/0xd0
[  181.820046]                    [<ffffffff81695329>] system_call_fastpath+0x12/0x17
[  181.820046]  }
[  181.820046]  ... key      at: [<ffffffff82c21ec0>] __key.42329+0x0/0x8
[  181.820046]  ... acquired at:
[  181.820046]    [<ffffffff810c7909>] check_usage_forwards+0x199/0x1b0
[  181.820046]    [<ffffffff810c81e1>] mark_lock+0x1a1/0x2a0
[  181.820046]    [<ffffffff810c8c25>] __lock_acquire+0x2b5/0x1d40
[  181.820046]    [<ffffffff810cae37>] lock_acquire+0xb7/0x140
[  181.820046]    [<ffffffff81694f50>] _raw_spin_lock_irqsave+0x50/0x70
[  181.820046]    [<ffffffff812fb2c4>] blk_end_bidi_request+0x34/0x60
[  181.820046]    [<ffffffff812fb3cf>] blk_end_request_all+0x1f/0x30
[  181.820046]    [<ffffffffa0206359>] dm_softirq_done+0xe9/0x1e0 [dm_mod]
[  181.820046]    [<ffffffff813023b0>] blk_done_softirq+0xa0/0xd0
[  181.820046]    [<ffffffff8107e091>] __do_softirq+0x141/0x370
[  181.820046]    [<ffffffff8107e655>] irq_exit+0x125/0x130
[  181.820046]    [<ffffffff8104bd05>] smp_call_function_single_interrupt+0x35/0x40
[  181.820046]    [<ffffffff81696822>] call_function_single_interrupt+0x72/0x80
[  181.820046]    [<ffffffff810c0084>] cpu_startup_entry+0x194/0x420
[  181.820046]    [<ffffffff8104c6dd>] start_secondary+0x19d/0x210
[  181.820046]
[  181.820046]
[  181.820046] stack backtrace:
[  181.820046] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G        W      3.18.0+ #12
[  181.820046] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  181.820046]  0000000000000000 79d40dcd9747394f ffff88011fc83b68 ffffffff8168b840
[  181.820046]  0000000000000000 ffffffff828b2340 ffff88011fc83bb8 ffffffff81685d9f
[  181.820046]  ffff88011fc83be0 ffffffff818d58f4 ffff88011fc83bd4 0000000000000000
[  181.820046] Call Trace:
[  181.820046]  <IRQ>  [<ffffffff8168b840>] dump_stack+0x4c/0x65
[  181.820046]  [<ffffffff81685d9f>] print_irq_inversion_bug.part.37+0x1ae/0x1bd
[  181.820046]  [<ffffffff810c7909>] check_usage_forwards+0x199/0x1b0
[  181.820046]  [<ffffffff810c7770>] ? check_usage_backwards+0x1a0/0x1a0
[  181.820046]  [<ffffffff810c81e1>] mark_lock+0x1a1/0x2a0
[  181.820046]  [<ffffffff810c8c25>] __lock_acquire+0x2b5/0x1d40
[  181.820046]  [<ffffffff81689193>] ? __slab_free+0x11c/0x2b0
[  181.820046]  [<ffffffff810cae37>] lock_acquire+0xb7/0x140
[  181.820046]  [<ffffffff812fb2c4>] ? blk_end_bidi_request+0x34/0x60
[  181.820046]  [<ffffffff81694f50>] _raw_spin_lock_irqsave+0x50/0x70
[  181.820046]  [<ffffffff812fb2c4>] ? blk_end_bidi_request+0x34/0x60
[  181.820046]  [<ffffffff812fb2c4>] blk_end_bidi_request+0x34/0x60
[  181.820046]  [<ffffffff812fb3cf>] blk_end_request_all+0x1f/0x30
[  181.820046]  [<ffffffffa0206359>] dm_softirq_done+0xe9/0x1e0 [dm_mod]
[  181.820046]  [<ffffffff813023b0>] blk_done_softirq+0xa0/0xd0
[  181.820046]  [<ffffffff8107e091>] __do_softirq+0x141/0x370
[  181.820046]  [<ffffffff8107e655>] irq_exit+0x125/0x130
[  181.820046]  [<ffffffff8104bd05>] smp_call_function_single_interrupt+0x35/0x40
[  181.820046]  [<ffffffff81696822>] call_function_single_interrupt+0x72/0x80
[  181.820046]  <EOI>  [<ffffffff810fecc9>] ? tick_nohz_idle_exit+0xc9/0x150
[  181.820046]  [<ffffffff810fecc5>] ? tick_nohz_idle_exit+0xc5/0x150
[  181.820046]  [<ffffffff810c0084>] cpu_startup_entry+0x194/0x420
[  181.820046]  [<ffffffff8104c6dd>] start_secondary+0x19d/0x210

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 0/8] dm: add request-based blk-mq support
  2014-12-19 15:38         ` Mike Snitzer
@ 2014-12-19 17:14           ` Mike Snitzer
  2014-12-22 15:28             ` Bart Van Assche
  0 siblings, 1 reply; 95+ messages in thread
From: Mike Snitzer @ 2014-12-19 17:14 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, hch, Keith Busch, device-mapper development, j-nomura

On Fri, Dec 19 2014 at 10:38am -0500,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Fri, Dec 19 2014 at  9:32am -0500,
> Bart Van Assche <bvanassche@acm.org> wrote:
> 
> > On 12/18/14 00:06, Mike Snitzer wrote:
> > > So if you know someone with relevant blk-mq hardware who might benefit
> > > from blk-mq multipathing please point them at this code and have them
> > > report back!
> > 
> > Hello Mike,
> > 
> > Great to see that you are working on blk-mq multipathing. Unfortunately
> > a test with the SRP initiator and your dm-for-3.20-blk-mq tree merged
> > with Linus' latest tree was not successful. This is what was reported
> > when I tried to start multipathd (without call trace, followed by a
> > hard lockup):
> > 
> > =========================================================
> > [ INFO: possible irq lock inversion dependency detected ]
> > 3.18.0-debug+ #1 Tainted: G        W     
> > ---------------------------------------------------------
> > kdmwork-253:0/5347 just changed the state of lock:
> >  (&(&m->lock)->rlock){+.....}, at: [<ffffffffa080eb80>] __multipath_map.isra.15+0x40/0x1f0 [dm_multipath]
> > but this lock was taken by another, HARDIRQ-safe lock in the past:
> >  (&(&q->__queue_lock)->rlock){-.-...}
> >  
> > and interrupts could create inverse lock ordering between them.
> >  
> > other info that might help us debug this:
> >  Possible interrupt unsafe locking scenario:
> 
> This "dm: submit stacked requests in irq enabled context" commit
> https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-for-3.20-blk-mq&id=1844ba7e2e013fa38c45d646248c517eb363e26c
> 
> changed the locking needed in the multipath target.  I altered
> __multipath_map but didn't audit elsewhere.  I'll work through it.

Hi Bart,

This patch silences the lockdep inversion splat on my testbed, but I'd
really appreciate it if you could see if it works for you since you hit
an actual hang:

diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index 1fa6f14..4bfa3d9 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -390,7 +390,7 @@ static int __multipath_map(struct dm_target *ti, struct request *clone,
 	struct block_device *bdev;
 	struct dm_mpath_io *mpio;
 
-	spin_lock(&m->lock);
+	spin_lock_irq(&m->lock);
 
 	/* Do we need to select a new pgpath? */
 	if (!m->current_pgpath ||
@@ -412,8 +412,14 @@ static int __multipath_map(struct dm_target *ti, struct request *clone,
 		/* ENOMEM, requeue */
 		goto out_unlock;
 
+	mpio = map_context->ptr;
+	mpio->pgpath = pgpath;
+	mpio->nr_bytes = nr_bytes;
+
 	bdev = pgpath->path.dev->bdev;
 
+	spin_unlock_irq(&m->lock);
+
 	if (clone) {
 		/* Old request-based interface: allocated clone is passed in */
 		clone->q = bdev_get_queue(bdev);
@@ -425,21 +431,18 @@ static int __multipath_map(struct dm_target *ti, struct request *clone,
 					   rq_data_dir(rq), GFP_KERNEL);
 		if (IS_ERR(*__clone))
 			/* ENOMEM, requeue */
-			goto out_unlock;
+			return r;
 		(*__clone)->cmd_flags |= REQ_FAILFAST_TRANSPORT;
 	}
 
-	mpio = map_context->ptr;
-	mpio->pgpath = pgpath;
-	mpio->nr_bytes = nr_bytes;
 	if (pgpath->pg->ps.type->start_io)
 		pgpath->pg->ps.type->start_io(&pgpath->pg->ps,
 					      &pgpath->path,
 					      nr_bytes);
-	r = DM_MAPIO_REMAPPED;
+	return DM_MAPIO_REMAPPED;
 
 out_unlock:
-	spin_unlock(&m->lock);
+	spin_unlock_irq(&m->lock);
 
 	return r;
 }

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 0/8] dm: add request-based blk-mq support
  2014-12-19 17:14           ` Mike Snitzer
@ 2014-12-22 15:28             ` Bart Van Assche
  2014-12-22 18:49               ` Mike Snitzer
  0 siblings, 1 reply; 95+ messages in thread
From: Bart Van Assche @ 2014-12-22 15:28 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jens Axboe, hch, Keith Busch, device-mapper development, j-nomura

On 12/19/14 18:14, Mike Snitzer wrote:
> On Fri, Dec 19 2014 at 10:38am -0500,
> Mike Snitzer <snitzer@redhat.com> wrote:
> 
>> On Fri, Dec 19 2014 at  9:32am -0500,
>> Bart Van Assche <bvanassche@acm.org> wrote:
>>
>>> On 12/18/14 00:06, Mike Snitzer wrote:
>>>> So if you know someone with relevant blk-mq hardware who might benefit
>>>> from blk-mq multipathing please point them at this code and have them
>>>> report back!
>>>
>>> Hello Mike,
>>>
>>> Great to see that you are working on blk-mq multipathing. Unfortunately
>>> a test with the SRP initiator and your dm-for-3.20-blk-mq tree merged
>>> with Linus' latest tree was not successful. This is what was reported
>>> when I tried to start multipathd (without call trace, followed by a
>>> hard lockup):
>>>
>>> =========================================================
>>> [ INFO: possible irq lock inversion dependency detected ]
>>> 3.18.0-debug+ #1 Tainted: G        W     
>>> ---------------------------------------------------------
>>> kdmwork-253:0/5347 just changed the state of lock:
>>>  (&(&m->lock)->rlock){+.....}, at: [<ffffffffa080eb80>] __multipath_map.isra.15+0x40/0x1f0 [dm_multipath]
>>> but this lock was taken by another, HARDIRQ-safe lock in the past:
>>>  (&(&q->__queue_lock)->rlock){-.-...}
>>>  
>>> and interrupts could create inverse lock ordering between them.
>>>  
>>> other info that might help us debug this:
>>>  Possible interrupt unsafe locking scenario:
>>
>> This "dm: submit stacked requests in irq enabled context" commit
>> https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-for-3.20-blk-mq&id=1844ba7e2e013fa38c45d646248c517eb363e26c
>>
>> changed the locking needed in the multipath target.  I altered
>> __multipath_map but didn't audit elsewhere.  I'll work through it.
> 
> Hi Bart,
> 
> This patch silences the lockdep inversion splat on my testbed, but I'd
> really appreciate it if you could see if it works for you since you hit
> an actual hang:
> 
> [ ... ]

Hello Mike,

Good news: with this patch my standard SRP multipath test ran fine for
several hours, after which I stopped the test. The only issue I hit
during this test is the one mentioned on
https://lkml.org/lkml/2014/10/29/523 but that's a bug in the e1000
driver that is not related to multipath.

Bart.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 0/8] dm: add request-based blk-mq support
  2014-12-22 15:28             ` Bart Van Assche
@ 2014-12-22 18:49               ` Mike Snitzer
  2014-12-23 16:24                 ` Bart Van Assche
  0 siblings, 1 reply; 95+ messages in thread
From: Mike Snitzer @ 2014-12-22 18:49 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, hch, Keith Busch, device-mapper development, j-nomura

Hi Bart,

On Mon, Dec 22 2014 at 10:28am -0500,
Bart Van Assche <bvanassche@acm.org> wrote:
 
> Hello Mike,
> 
> Good news: with this patch my standard SRP multipath test ran fine for
> several hours, after which I stopped the test.

Great, thanks for testing!  Did you happen to look at the performance of
your testing?  If so, is it comparable/better/worse?

FYI, I went ahead and added your Tested-by: to this commit:
https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-next&id=e596370123136b37757f4c7d01fb6a6cba26452b

And I also staged the entire series in linux-next (for 3.20 at
earliest).  Still have much more testing to do on these changes but
it'll be good to let them soak in linux-next over the holiday break.

> The only issue I hit during this test is the one mentioned on
> https://lkml.org/lkml/2014/10/29/523 but that's a bug in the e1000
> driver that is not related to multipath.

OK, hopefully you'll hear back from the maintainers on that thread soon.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 0/8] dm: add request-based blk-mq support
  2014-12-22 18:49               ` Mike Snitzer
@ 2014-12-23 16:24                 ` Bart Van Assche
  2014-12-23 17:13                   ` Mike Snitzer
  0 siblings, 1 reply; 95+ messages in thread
From: Bart Van Assche @ 2014-12-23 16:24 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jens Axboe, hch, Keith Busch, device-mapper development, j-nomura

On 12/22/14 19:49, Mike Snitzer wrote:
> On Mon, Dec 22 2014 at 10:28am -0500,
> Bart Van Assche <bvanassche@acm.org> wrote:
>> Good news: with this patch my standard SRP multipath test ran fine for
>> several hours, after which I stopped the test.
> 
> Great, thanks for testing!  Did you happen to look at the performance of
> your testing?  If so, is it comparable/better/worse?

Hello Mike,

I have tried to run a performance comparison but after I had finished
the measurements without multipath and when I started multipathd I ran
into the following (typed over from the console):

BUG: unable to handle kernel NULL pointer dereference at 0000318
IP: scsi_setup_cmnd+0xe8 [scsi_mod]
Workqueue: kblockd blk_mq_run_work_fn
Call Trace:
scsi_queue_rq+0x5a5
__blk_mq_run_hw_queue+0x1cb
blk_mq_run_work_fn+0xd
process_one_work+0x133
worker_thread+0x11b
kthread+0xcd

gdb translates the above address as follows:

(gdb) list *(scsi_setup_cmnd+0xe8)
0x9568 is in scsi_setup_cmnd (include/scsi/scsi_cmnd.h:155).
150     }
151
152     /* make sure not to use it with REQ_TYPE_BLOCK_PC commands */
153     static inline struct scsi_driver *scsi_cmd_to_driver(struct scsi_cmnd *cmd)
154     {
155             return *(struct scsi_driver **)cmd->request->rq_disk->private_data;
156     }
157
158     extern struct scsi_cmnd *scsi_get_command(struct scsi_device *, gfp_t);
159     extern void scsi_put_command(struct scsi_cmnd *);

Bart.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 0/8] dm: add request-based blk-mq support
  2014-12-23 16:24                 ` Bart Van Assche
@ 2014-12-23 17:13                   ` Mike Snitzer
  2014-12-23 21:42                     ` Mike Snitzer
  0 siblings, 1 reply; 95+ messages in thread
From: Mike Snitzer @ 2014-12-23 17:13 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, Keith Busch, snitzer, device-mapper, hch,
	development, j-nomura


[-- Attachment #1.1: Type: text/plain, Size: 1637 bytes --]

Sorry for top post but:

You'll likely fix this if you establish the cloned request's rq_disk member in the blk-mq branch of dm-mpath.c:__multipath_map()

Mike


Bart Van Assche <bvanassche@acm.org> wrote:

On 12/22/14 19:49, Mike Snitzer wrote:
> On Mon, Dec 22 2014 at 10:28am -0500,
> Bart Van Assche <bvanassche@acm.org> wrote:
>> Good news: with this patch my standard SRP multipath test ran fine for
>> several hours, after which I stopped the test.
> 
> Great, thanks for testing!  Did you happen to look at the performance of
> your testing?  If so, is it comparable/better/worse?

Hello Mike,

I have tried to run a performance comparison but after I had finished
the measurements without multipath and when I started multipathd I ran
into the following (typed over from the console):

BUG: unable to handle kernel NULL pointer dereference at 0000318
IP: scsi_setup_cmnd+0xe8 [scsi_mod]
Workqueue: kblockd blk_mq_run_work_fn
Call Trace:
scsi_queue_rq+0x5a5
__blk_mq_run_hw_queue+0x1cb
blk_mq_run_work_fn+0xd
process_one_work+0x133
worker_thread+0x11b
kthread+0xcd

gdb translates the above address as follows:

(gdb) list *(scsi_setup_cmnd+0xe8)
0x9568 is in scsi_setup_cmnd (include/scsi/scsi_cmnd.h:155).
150     }
151
152     /* make sure not to use it with REQ_TYPE_BLOCK_PC commands */
153     static inline struct scsi_driver *scsi_cmd_to_driver(struct scsi_cmnd *cmd)
154     {
155             return *(struct scsi_driver **)cmd->request->rq_disk->private_data;
156     }
157
158     extern struct scsi_cmnd *scsi_get_command(struct scsi_device *, gfp_t);
159     extern void scsi_put_command(struct scsi_cmnd *);

Bart.

[-- Attachment #1.2: Type: text/html, Size: 2164 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 0/8] dm: add request-based blk-mq support
  2014-12-23 17:13                   ` Mike Snitzer
@ 2014-12-23 21:42                     ` Mike Snitzer
  2014-12-24 13:02                       ` Bart Van Assche
  0 siblings, 1 reply; 95+ messages in thread
From: Mike Snitzer @ 2014-12-23 21:42 UTC (permalink / raw)
  To: device-mapper development
  Cc: Jens Axboe, Christoph Hellwig, Bart Van Assche, device-mapper,
	Keith Busch, Jun'ichi Nomura

On Tue, Dec 23, 2014 at 12:13 PM, Mike Snitzer <msnitzer@redhat.com> wrote:
> Sorry for top post but:
>
> You'll likely fix this if you establish the cloned request's rq_disk member
> in the blk-mq branch of dm-mpath.c:__multipath_map()

I've rebased with this fix folded into both the 'for-next' and
'dm-for-3.20-blk-mq' branches of linux-dm.git

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 0/8] dm: add request-based blk-mq support
  2014-12-23 21:42                     ` Mike Snitzer
@ 2014-12-24 13:02                       ` Bart Van Assche
  2014-12-24 18:21                         ` Mike Snitzer
  0 siblings, 1 reply; 95+ messages in thread
From: Bart Van Assche @ 2014-12-24 13:02 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jens Axboe, Keith Busch, device-mapper, Christoph Hellwig,
	device-mapper development, Jun'ichi Nomura

On 12/23/14 22:42, Mike Snitzer wrote:
> I've rebased with this fix folded into both the 'for-next' and
> 'dm-for-3.20-blk-mq' branches of linux-dm.git

Thanks, that's appreciated. However, with this tree I ran into a
different issue:

BUG: unable to handle kernel NULL pointer dereference at 00000000000002a0
IP: [<ffffffff811b2d18>] blk_account_io_completion+0x48/0x90
PGD 8150b6067 PUD 8150b4067 PMD 0 
Oops: 0000 [#1] SMP 
Modules linked in: dm_queue_length dm_multipath ib_srp scsi_transport_srp netconsole configfs fuse ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables x_tables 8021q garp bridge stp llc rdma_ucm rdma_cm iw_cm af_packet ib_ipoib ib_cm ib_uverbs ib_umad mlx4_en mlx4_ib ib_sa ib_mad ib_core ib_addr snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_intel snd_hda_controller snd_hda_codec x86_pkg_temp_thermal coretemp snd_hwdep kvm_intel snd_pcm kvm snd_seq snd_seq_device crct10dif_pclmul snd_timer crc32c_intel e1000e mlx4_core sr_mod snd sb_edac xhci_pci edac_core cdrom microcode xhci_hcd pcspkr ptp lpc_ich mfd_core soundcore i2c_i801 pps_core wmi button sg dm_mod autofs4 ext4 crc16 mbcache jbd2 sd_mod hid_generic usbhid hid radeon i2c_algo_bit drm_kms_
 helper ttm ahci libahci libata drm ehci_pci ehci_hcd usbcore agpgart usb_common processor thermal_sys hwmon scsi_dh_alua scsi_dh scsi_mod
CPU: 10 PID: 0 Comm: swapper/10 Not tainted 3.19.0-rc1+ #1
Hardware name: MSI MS-7737/Big Bang-XPower II (MS-7737), BIOS V1.5 10/16/2012
task: ffff88085c145010 ti: ffff88085c2c4000 task.ti: ffff88085c2c4000
RIP: 0010:[<ffffffff811b2d18>]  [<ffffffff811b2d18>] blk_account_io_completion+0x48/0x90
RSP: 0018:ffff88085fd43df8  EFLAGS: 00010046
RAX: 0000000000000000 RBX: ffff880806c60000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff88085fd43df8 R08: 000000000000000a R09: 0000000000000002
R10: 000000000000004e R11: 0000000000000000 R12: ffff88080844c800
R13: ffff880806c60170 R14: 0000000000000000 R15: ffff880806c60000
FS:  0000000000000000(0000) GS:ffff88085fd40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000002a0 CR3: 00000008150b9000 CR4: 00000000000407e0
Stack:
 ffff88085fd43e48 ffffffff811b2d9a ffff88085fd43e08 000000005fd43e08
 ffff88085fd43e48 ffff880806c60000 ffff88080844c800 ffff880806c60170
 0000000000000000 ffff880834fe8008 ffff88085fd43e98 ffffffffa000902f
Call Trace:
 <IRQ> 
 [<ffffffff811b2d9a>] blk_update_request+0x3a/0x300
 [<ffffffffa000902f>] scsi_end_request+0x2f/0x1e0 [scsi_mod]
 [<ffffffffa000b451>] scsi_io_completion+0x101/0x690 [scsi_mod]
 [<ffffffffa000124a>] scsi_finish_command+0xca/0x130 [scsi_mod]
 [<ffffffffa000abff>] scsi_softirq_done+0x12f/0x160 [scsi_mod]
 [<ffffffff811ba1de>] __blk_mq_complete_request_remote+0xe/0x10
 [<ffffffff8109ea6d>] generic_smp_call_function_single_interrupt+0x5d/0x150
 [<ffffffff8102c382>] smp_call_function_single_interrupt+0x22/0x40
 [<ffffffff813e925a>] call_function_single_interrupt+0x6a/0x70
 <EOI> 
 [<ffffffff812f0b15>] ? cpuidle_enter_state+0x55/0xc0
 [<ffffffff812f0b07>] ? cpuidle_enter_state+0x47/0xc0
 [<ffffffff812f0c32>] cpuidle_enter+0x12/0x20
 [<ffffffff81078bcc>] cpu_startup_entry+0x22c/0x2c0
 [<ffffffff8102ca1d>] start_secondary+0x14d/0x170

(gdb) list *(blk_account_io_completion+0x48)
0xffffffff811b2d18 is in blk_account_io_completion (block/blk-core.c:2114).
2109                    struct hd_struct *part;
2110                    int cpu;
2111
2112                    cpu = part_stat_lock();
2113                    part = req->part;
2114                    part_stat_add(cpu, part, sectors[rw], bytes >> 9);
2115                    part_stat_unlock();
2116            }
2117    }
2118

Bart.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 0/8] dm: add request-based blk-mq support
  2014-12-24 13:02                       ` Bart Van Assche
@ 2014-12-24 18:21                         ` Mike Snitzer
  2014-12-24 18:55                           ` Mike Snitzer
  0 siblings, 1 reply; 95+ messages in thread
From: Mike Snitzer @ 2014-12-24 18:21 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, Christoph Hellwig, Keith Busch,
	device-mapper development, Jun'ichi Nomura

On Wed, Dec 24 2014 at  8:02am -0500,
Bart Van Assche <bvanassche@acm.org> wrote:

> On 12/23/14 22:42, Mike Snitzer wrote:
> > I've rebased with this fix folded into both the 'for-next' and
> > 'dm-for-3.20-blk-mq' branches of linux-dm.git
> 
> Thanks, that's appreciated. However, with this tree I ran into a
> different issue:
> 
> BUG: unable to handle kernel NULL pointer dereference at 00000000000002a0
> IP: [<ffffffff811b2d18>] blk_account_io_completion+0x48/0x90
> PGD 8150b6067 PUD 8150b4067 PMD 0 
> Oops: 0000 [#1] SMP 
...
> Call Trace:
>  <IRQ> 
>  [<ffffffff811b2d9a>] blk_update_request+0x3a/0x300
>  [<ffffffffa000902f>] scsi_end_request+0x2f/0x1e0 [scsi_mod]
>  [<ffffffffa000b451>] scsi_io_completion+0x101/0x690 [scsi_mod]
>  [<ffffffffa000124a>] scsi_finish_command+0xca/0x130 [scsi_mod]
>  [<ffffffffa000abff>] scsi_softirq_done+0x12f/0x160 [scsi_mod]
>  [<ffffffff811ba1de>] __blk_mq_complete_request_remote+0xe/0x10
>  [<ffffffff8109ea6d>] generic_smp_call_function_single_interrupt+0x5d/0x150
>  [<ffffffff8102c382>] smp_call_function_single_interrupt+0x22/0x40
>  [<ffffffff813e925a>] call_function_single_interrupt+0x6a/0x70
>  <EOI> 
>  [<ffffffff812f0b15>] ? cpuidle_enter_state+0x55/0xc0
>  [<ffffffff812f0b07>] ? cpuidle_enter_state+0x47/0xc0
>  [<ffffffff812f0c32>] cpuidle_enter+0x12/0x20
>  [<ffffffff81078bcc>] cpu_startup_entry+0x22c/0x2c0
>  [<ffffffff8102ca1d>] start_secondary+0x14d/0x170
> 
> (gdb) list *(blk_account_io_completion+0x48)
> 0xffffffff811b2d18 is in blk_account_io_completion (block/blk-core.c:2114).
> 2109                    struct hd_struct *part;
> 2110                    int cpu;
> 2111
> 2112                    cpu = part_stat_lock();
> 2113                    part = req->part;
> 2114                    part_stat_add(cpu, part, sectors[rw], bytes >> 9);
> 2115                    part_stat_unlock();
> 2116            }
> 2117    }
> 2118

This is odd considering blk-mq defaults to setting QUEUE_FLAG_IO_STAT,
so each request will have REQ_IO_STAT set.

I'm not sure what would account for this NULL pointer (my code appears
to be slightly different than yours but AFAICT req->part is NULL in your
crash, which shouldn't ever happen if IO stats are enabled).

Are you manually enabling/disabling IO stats via sysfs at all?

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 0/8] dm: add request-based blk-mq support
  2014-12-24 18:21                         ` Mike Snitzer
@ 2014-12-24 18:55                           ` Mike Snitzer
  2014-12-24 19:26                             ` Mike Snitzer
  0 siblings, 1 reply; 95+ messages in thread
From: Mike Snitzer @ 2014-12-24 18:55 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, Christoph Hellwig, Keith Busch,
	device-mapper development, Jun'ichi Nomura

On Wed, Dec 24 2014 at  1:21pm -0500,
Mike Snitzer <snitzer@redhat.com> wrote:

> This is odd considering blk-mq defaults to setting QUEUE_FLAG_IO_STAT,
> so each request will have REQ_IO_STAT set.
> 
> I'm not sure what would account for this NULL pointer (my code appears
> to be slightly different than yours but AFAICT req->part is NULL in your
> crash, which shouldn't ever happen if IO stats are enabled).
> 
> Are you manually enabling/disabling IO stats via sysfs at all?

Answering my own question: unlikely.  Considering I was able to
reproduce merely by creating an mpath device ontop of a blk-mq device
(now that dm-mpath.c:__multipath_map sets rq->rq_disk).

I'll try to get to the bottom of it (it still seems pretty weird to me).

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 0/8] dm: add request-based blk-mq support
  2014-12-24 18:55                           ` Mike Snitzer
@ 2014-12-24 19:26                             ` Mike Snitzer
  2015-01-02 17:53                               ` Bart Van Assche
  0 siblings, 1 reply; 95+ messages in thread
From: Mike Snitzer @ 2014-12-24 19:26 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, Christoph Hellwig, Keith Busch,
	device-mapper development, Jun'ichi Nomura

On Wed, Dec 24 2014 at  1:55pm -0500,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Wed, Dec 24 2014 at  1:21pm -0500,
> Mike Snitzer <snitzer@redhat.com> wrote:
> 
> > This is odd considering blk-mq defaults to setting QUEUE_FLAG_IO_STAT,
> > so each request will have REQ_IO_STAT set.
> > 
> > I'm not sure what would account for this NULL pointer (my code appears
> > to be slightly different than yours but AFAICT req->part is NULL in your
> > crash, which shouldn't ever happen if IO stats are enabled).
> > 
> > Are you manually enabling/disabling IO stats via sysfs at all?
> 
> Answering my own question: unlikely.  Considering I was able to
> reproduce merely by creating an mpath device ontop of a blk-mq device
> (now that dm-mpath.c:__multipath_map sets rq->rq_disk).
> 
> I'll try to get to the bottom of it (it still seems pretty weird to me).

This fixes it:

diff --git a/block/blk-core.c b/block/blk-core.c
index cdd84e9..138ffb2 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2030,6 +2030,8 @@ int blk_insert_cloned_request(struct request_queue *q, struct request *rq)
 		return -EIO;
 
 	if (q->mq_ops) {
+		if (blk_queue_io_stat(rq->q))
+			blk_account_io_start(rq, true);
 		blk_mq_insert_request(rq, false, true, true);
 		return 0;
 	}

I've folded this fix into this commit (and rebased the 'for-next' and
'dm-for-3.20-blk-mq' branches):
https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-next&id=1fd5e9c83c4ae6a5144783855e9b29a8f42bdc4a

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 0/8] dm: add request-based blk-mq support
  2014-12-24 19:26                             ` Mike Snitzer
@ 2015-01-02 17:53                               ` Bart Van Assche
  2015-01-05 21:35                                 ` Mike Snitzer
  0 siblings, 1 reply; 95+ messages in thread
From: Bart Van Assche @ 2015-01-02 17:53 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jens Axboe, Christoph Hellwig, Keith Busch,
	device-mapper development, Jun'ichi Nomura

On 12/24/14 20:26, Mike Snitzer wrote:
> This fixes it:
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index cdd84e9..138ffb2 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2030,6 +2030,8 @@ int blk_insert_cloned_request(struct request_queue *q, struct request *rq)
>  		return -EIO;
>  
>  	if (q->mq_ops) {
> +		if (blk_queue_io_stat(rq->q))
> +			blk_account_io_start(rq, true);
>  		blk_mq_insert_request(rq, false, true, true);
>  		return 0;
>  	}
> 
> I've folded this fix into this commit (and rebased the 'for-next' and
> 'dm-for-3.20-blk-mq' branches):
> https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-next&id=1fd5e9c83c4ae6a5144783855e9b29a8f42bdc4a

Hello Mike,

Thanks, my tests confirm that this patch indeed fixes the issue I had
reported. Unfortunately this doesn't mean that the blk-mq multipath code
is already working perfectly. Most of the time I/O requests are
processed within the expected time but sometimes I/O processing takes
much more time than what I expected:

# /usr/bin/time -f %e mkfs.xfs -f /dev/dm-0 >/dev/null
0.02
# /usr/bin/time -f %e mkfs.xfs -f /dev/dm-0 >/dev/null
0.02
# /usr/bin/time -f %e mkfs.xfs -f /dev/dm-0 >/dev/null
8.68

However, if I run the same command on the underlying device it always
completes within the expected time.

Bart.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 0/8] dm: add request-based blk-mq support
  2015-01-02 17:53                               ` Bart Van Assche
@ 2015-01-05 21:35                                 ` Mike Snitzer
  2015-01-06  8:59                                   ` Christoph Hellwig
  2015-01-06  9:31                                   ` Bart Van Assche
  0 siblings, 2 replies; 95+ messages in thread
From: Mike Snitzer @ 2015-01-05 21:35 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, Christoph Hellwig, Keith Busch,
	device-mapper development, Jun'ichi Nomura

On Fri, Jan 02 2015 at 12:53pm -0500,
Bart Van Assche <bvanassche@acm.org> wrote:

> On 12/24/14 20:26, Mike Snitzer wrote:
> > This fixes it:
> > 
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index cdd84e9..138ffb2 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -2030,6 +2030,8 @@ int blk_insert_cloned_request(struct request_queue *q, struct request *rq)
> >  		return -EIO;
> >  
> >  	if (q->mq_ops) {
> > +		if (blk_queue_io_stat(rq->q))
> > +			blk_account_io_start(rq, true);
> >  		blk_mq_insert_request(rq, false, true, true);
> >  		return 0;
> >  	}
> > 
> > I've folded this fix into this commit (and rebased the 'for-next' and
> > 'dm-for-3.20-blk-mq' branches):
> > https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-next&id=1fd5e9c83c4ae6a5144783855e9b29a8f42bdc4a
> 
> Hello Mike,
> 
> Thanks, my tests confirm that this patch indeed fixes the issue I had
> reported. Unfortunately this doesn't mean that the blk-mq multipath code
> is already working perfectly. Most of the time I/O requests are
> processed within the expected time but sometimes I/O processing takes
> much more time than what I expected:
>
> # /usr/bin/time -f %e mkfs.xfs -f /dev/dm-0 >/dev/null
> 0.02
> # /usr/bin/time -f %e mkfs.xfs -f /dev/dm-0 >/dev/null
> 0.02
> # /usr/bin/time -f %e mkfs.xfs -f /dev/dm-0 >/dev/null
> 8.68
> 
> However, if I run the same command on the underlying device it always
> completes within the expected time.

I don't have very large blk-mq devices, but I can work on that.
How large is the blk-mq device in question?

Also, how much memory does the system have?  Is memory fragmented at
all?  With this change the requests are cloned using memory allocated
from block core's blk_get_request (rather than a dedicated mempool in DM
core).

Any chance you could use 'perf record' to try to analyze where the
kernel is spending its time?

Thanks for your continued help in testing these changes.

Mike

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 0/8] dm: add request-based blk-mq support
  2015-01-05 21:35                                 ` Mike Snitzer
@ 2015-01-06  8:59                                   ` Christoph Hellwig
  2015-01-06  9:31                                   ` Bart Van Assche
  1 sibling, 0 replies; 95+ messages in thread
From: Christoph Hellwig @ 2015-01-06  8:59 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jens Axboe, Keith Busch, Bart Van Assche, Christoph Hellwig,
	device-mapper development, Jun'ichi Nomura

On Mon, Jan 05, 2015 at 04:35:57PM -0500, Mike Snitzer wrote:
> I don't have very large blk-mq devices, but I can work on that.

Just export a large sparse file over an iSCSI target?

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v3 0/8] dm: add request-based blk-mq support
  2015-01-05 21:35                                 ` Mike Snitzer
  2015-01-06  8:59                                   ` Christoph Hellwig
@ 2015-01-06  9:31                                   ` Bart Van Assche
  2015-01-06 16:05                                     ` blk-mq request allocation stalls [was: Re: [PATCH v3 0/8] dm: add request-based blk-mq support] Mike Snitzer
  1 sibling, 1 reply; 95+ messages in thread
From: Bart Van Assche @ 2015-01-06  9:31 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jens Axboe, Christoph Hellwig, Keith Busch,
	device-mapper development, Jun'ichi Nomura

On 01/05/15 22:35, Mike Snitzer wrote:
> On Fri, Jan 02 2015 at 12:53pm -0500,
> Bart Van Assche <bvanassche@acm.org> wrote:
>> Thanks, my tests confirm that this patch indeed fixes the issue I had
>> reported. Unfortunately this doesn't mean that the blk-mq multipath code
>> is already working perfectly. Most of the time I/O requests are
>> processed within the expected time but sometimes I/O processing takes
>> much more time than what I expected:
>>
>> # /usr/bin/time -f %e mkfs.xfs -f /dev/dm-0 >/dev/null
>> 0.02
>> # /usr/bin/time -f %e mkfs.xfs -f /dev/dm-0 >/dev/null
>> 0.02
>> # /usr/bin/time -f %e mkfs.xfs -f /dev/dm-0 >/dev/null
>> 8.68
>>
>> However, if I run the same command on the underlying device it always
>> completes within the expected time.
> 
> I don't have very large blk-mq devices, but I can work on that.
> How large is the blk-mq device in question?
> 
> Also, how much memory does the system have?  Is memory fragmented at
> all?  With this change the requests are cloned using memory allocated
> from block core's blk_get_request (rather than a dedicated mempool in DM
> core).
> 
> Any chance you could use 'perf record' to try to analyze where the
> kernel is spending its time?

Hello Mike,

The device used in this test was a tmpfs file with a size of 16 MB. That
file had been created as follows: dd if=/dev/zero of=/dev/vdisk bs=1M
count=16. The initiator and target systems did have enough memory to keep
this tmpfs file in RAM all the time (32 GB and 4 GB respectively).

For the runs that took much longer than expected the CPU load was low.
This probably means that the system was waiting for one or another I/O
timer to expire. The output triggered by "echo w > /proc/sysrq-trigger"
during a run that took longer than expected was as follows:

SysRq : Show Blocked State
  task                        PC stack   pid father
kdmwork-253:0   D ffff8807c1fd3b78     0 10396      2 0x00000000
 ffff8807c1fd3b78 ffff88083b6b6cc0 0000000000012ec0 ffff8807c1fd3fd8
 0000000000012ec0 ffff880824225aa0 ffff88083b6b6cc0 ffff88081b0cb2c0
 ffff88085fc537c8 ffff8807c1fd3c98 ffff8807f7a99d70 ffffe8ffffc43bc0
Call Trace:
 [<ffffffff814d5230>] io_schedule+0xa0/0x130
 [<ffffffff8125a3f7>] bt_get+0x117/0x1b0
 [<ffffffff81256580>] ? blk_mq_queue_enter+0x30/0x2a0
 [<ffffffff81094cf0>] ? prepare_to_wait_event+0x110/0x110
 [<ffffffff8125a76f>] blk_mq_get_tag+0x9f/0xd0
 [<ffffffff8125591b>] __blk_mq_alloc_request+0x1b/0x210
 [<ffffffff812571c9>] blk_mq_alloc_request+0x139/0x150
 [<ffffffff8124c16e>] blk_get_request+0x2e/0xe0
 [<ffffffff8109a60d>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffffa07f7d0f>] __multipath_map.isra.15+0x1cf/0x210 [dm_multipath]
 [<ffffffffa07f7d6a>] multipath_clone_and_map+0x1a/0x20 [dm_multipath]
 [<ffffffffa039dbb5>] map_tio_request+0x1d5/0x3a0 [dm_mod]
 [<ffffffff8109a53d>] ? trace_hardirqs_on_caller+0xfd/0x1c0
 [<ffffffff81075cbe>] kthread_worker_fn+0x7e/0x1b0
 [<ffffffff81075c40>] ? __init_kthread_worker+0x60/0x60
 [<ffffffff81075bc8>] kthread+0xf8/0x110
 [<ffffffff81075ad0>] ? kthread_create_on_node+0x210/0x210
 [<ffffffff814dacac>] ret_from_fork+0x7c/0xb0
 [<ffffffff81075ad0>] ? kthread_create_on_node+0x210/0x210
dmraid          D ffff8807f4cafc88     0 25099  25064 0x00000000
 ffff8807f4cafc88 ffff8807c0b52440 0000000000012ec0 ffff8807f4caffd8
 0000000000012ec0 ffffffff81a194e0 ffff8807c0b52440 ffff8807c09ec1c0
 ffff88085fc137c8 ffff88085ff8ce38 ffff8807f4cafd30 0000000000000082
Call Trace:
 [<ffffffff814d5990>] ? bit_wait+0x50/0x50
 [<ffffffff814d5230>] io_schedule+0xa0/0x130
 [<ffffffff814d59bc>] bit_wait_io+0x2c/0x50
 [<ffffffff814d578b>] __wait_on_bit_lock+0x4b/0xb0
 [<ffffffff8113b45a>] __lock_page_killable+0x9a/0xa0
 [<ffffffff81094d30>] ? autoremove_wake_function+0x40/0x40
 [<ffffffff8113da78>] generic_file_read_iter+0x408/0x640
 [<ffffffff8109a60d>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffff811d5f57>] blkdev_read_iter+0x37/0x40
 [<ffffffff8119866e>] new_sync_read+0x7e/0xb0
 [<ffffffff81199858>] __vfs_read+0x18/0x50
 [<ffffffff81199916>] vfs_read+0x86/0x140
 [<ffffffff81199a19>] SyS_read+0x49/0xb0
 [<ffffffff814dad52>] system_call_fastpath+0x12/0x17

Bart.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* blk-mq request allocation stalls [was: Re: [PATCH v3 0/8] dm: add request-based blk-mq support]
  2015-01-06  9:31                                   ` Bart Van Assche
@ 2015-01-06 16:05                                     ` Mike Snitzer
  2015-01-06 16:15                                       ` Jens Axboe
  0 siblings, 1 reply; 95+ messages in thread
From: Mike Snitzer @ 2015-01-06 16:05 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: Christoph Hellwig, Keith Busch, device-mapper development,
	Jun'ichi Nomura

On Tue, Jan 06 2015 at  4:31am -0500,
Bart Van Assche <bvanassche@acm.org> wrote:

> On 01/05/15 22:35, Mike Snitzer wrote:
> > On Fri, Jan 02 2015 at 12:53pm -0500,
> > Bart Van Assche <bvanassche@acm.org> wrote:
> >> Thanks, my tests confirm that this patch indeed fixes the issue I had
> >> reported. Unfortunately this doesn't mean that the blk-mq multipath code
> >> is already working perfectly. Most of the time I/O requests are
> >> processed within the expected time but sometimes I/O processing takes
> >> much more time than what I expected:
> >>
> >> # /usr/bin/time -f %e mkfs.xfs -f /dev/dm-0 >/dev/null
> >> 0.02
> >> # /usr/bin/time -f %e mkfs.xfs -f /dev/dm-0 >/dev/null
> >> 0.02
> >> # /usr/bin/time -f %e mkfs.xfs -f /dev/dm-0 >/dev/null
> >> 8.68
> >>
> >> However, if I run the same command on the underlying device it always
> >> completes within the expected time.
> > 
> > I don't have very large blk-mq devices, but I can work on that.
> > How large is the blk-mq device in question?
> > 
> > Also, how much memory does the system have?  Is memory fragmented at
> > all?  With this change the requests are cloned using memory allocated
> > from block core's blk_get_request (rather than a dedicated mempool in DM
> > core).
> > 
> > Any chance you could use 'perf record' to try to analyze where the
> > kernel is spending its time?
> 
> Hello Mike,
> 
> The device used in this test was a tmpfs file with a size of 16 MB. That
> file had been created as follows: dd if=/dev/zero of=/dev/vdisk bs=1M
> count=16. The initiator and target systems did have enough memory to keep
> this tmpfs file in RAM all the time (32 GB and 4 GB respectively).
> 
> For the runs that took much longer than expected the CPU load was low.
> This probably means that the system was waiting for one or another I/O
> timer to expire. The output triggered by "echo w > /proc/sysrq-trigger"
> during a run that took longer than expected was as follows:
> 
> SysRq : Show Blocked State
>   task                        PC stack   pid father
> kdmwork-253:0   D ffff8807c1fd3b78     0 10396      2 0x00000000
>  ffff8807c1fd3b78 ffff88083b6b6cc0 0000000000012ec0 ffff8807c1fd3fd8
>  0000000000012ec0 ffff880824225aa0 ffff88083b6b6cc0 ffff88081b0cb2c0
>  ffff88085fc537c8 ffff8807c1fd3c98 ffff8807f7a99d70 ffffe8ffffc43bc0
> Call Trace:
>  [<ffffffff814d5230>] io_schedule+0xa0/0x130
>  [<ffffffff8125a3f7>] bt_get+0x117/0x1b0
>  [<ffffffff81256580>] ? blk_mq_queue_enter+0x30/0x2a0
>  [<ffffffff81094cf0>] ? prepare_to_wait_event+0x110/0x110
>  [<ffffffff8125a76f>] blk_mq_get_tag+0x9f/0xd0
>  [<ffffffff8125591b>] __blk_mq_alloc_request+0x1b/0x210
>  [<ffffffff812571c9>] blk_mq_alloc_request+0x139/0x150
>  [<ffffffff8124c16e>] blk_get_request+0x2e/0xe0
>  [<ffffffff8109a60d>] ? trace_hardirqs_on+0xd/0x10
>  [<ffffffffa07f7d0f>] __multipath_map.isra.15+0x1cf/0x210 [dm_multipath]
>  [<ffffffffa07f7d6a>] multipath_clone_and_map+0x1a/0x20 [dm_multipath]
>  [<ffffffffa039dbb5>] map_tio_request+0x1d5/0x3a0 [dm_mod]
>  [<ffffffff8109a53d>] ? trace_hardirqs_on_caller+0xfd/0x1c0
>  [<ffffffff81075cbe>] kthread_worker_fn+0x7e/0x1b0
>  [<ffffffff81075c40>] ? __init_kthread_worker+0x60/0x60
>  [<ffffffff81075bc8>] kthread+0xf8/0x110
>  [<ffffffff81075ad0>] ? kthread_create_on_node+0x210/0x210
>  [<ffffffff814dacac>] ret_from_fork+0x7c/0xb0
>  [<ffffffff81075ad0>] ? kthread_create_on_node+0x210/0x210

Jens,

This stack trace confirms my suspicion that switching DM-multipath over
to allocating clone requests via blk_get_request (rather than using a
dedicated mempool in DM core) is the cause of the slowdown that Bart has
experienced.

Given blk_mq_get_tag() looks to be the culprit is there anything we can
do to speed up blk-mq request allocation?  I'm currently using
GFP_KERNEL when calling blk_get_request().

Mike

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: blk-mq request allocation stalls [was: Re: [PATCH v3 0/8] dm: add request-based blk-mq support]
  2015-01-06 16:05                                     ` blk-mq request allocation stalls [was: Re: [PATCH v3 0/8] dm: add request-based blk-mq support] Mike Snitzer
@ 2015-01-06 16:15                                       ` Jens Axboe
  2015-01-07 10:33                                         ` Bart Van Assche
  0 siblings, 1 reply; 95+ messages in thread
From: Jens Axboe @ 2015-01-06 16:15 UTC (permalink / raw)
  To: Mike Snitzer, Bart Van Assche
  Cc: Christoph Hellwig, Keith Busch, device-mapper development,
	Jun'ichi Nomura

On 01/06/2015 09:05 AM, Mike Snitzer wrote:
> On Tue, Jan 06 2015 at  4:31am -0500,
> Bart Van Assche <bvanassche@acm.org> wrote:
>
>> On 01/05/15 22:35, Mike Snitzer wrote:
>>> On Fri, Jan 02 2015 at 12:53pm -0500,
>>> Bart Van Assche <bvanassche@acm.org> wrote:
>>>> Thanks, my tests confirm that this patch indeed fixes the issue I had
>>>> reported. Unfortunately this doesn't mean that the blk-mq multipath code
>>>> is already working perfectly. Most of the time I/O requests are
>>>> processed within the expected time but sometimes I/O processing takes
>>>> much more time than what I expected:
>>>>
>>>> # /usr/bin/time -f %e mkfs.xfs -f /dev/dm-0 >/dev/null
>>>> 0.02
>>>> # /usr/bin/time -f %e mkfs.xfs -f /dev/dm-0 >/dev/null
>>>> 0.02
>>>> # /usr/bin/time -f %e mkfs.xfs -f /dev/dm-0 >/dev/null
>>>> 8.68
>>>>
>>>> However, if I run the same command on the underlying device it always
>>>> completes within the expected time.
>>>
>>> I don't have very large blk-mq devices, but I can work on that.
>>> How large is the blk-mq device in question?
>>>
>>> Also, how much memory does the system have?  Is memory fragmented at
>>> all?  With this change the requests are cloned using memory allocated
>>> from block core's blk_get_request (rather than a dedicated mempool in DM
>>> core).
>>>
>>> Any chance you could use 'perf record' to try to analyze where the
>>> kernel is spending its time?
>>
>> Hello Mike,
>>
>> The device used in this test was a tmpfs file with a size of 16 MB. That
>> file had been created as follows: dd if=/dev/zero of=/dev/vdisk bs=1M
>> count=16. The initiator and target systems did have enough memory to keep
>> this tmpfs file in RAM all the time (32 GB and 4 GB respectively).
>>
>> For the runs that took much longer than expected the CPU load was low.
>> This probably means that the system was waiting for one or another I/O
>> timer to expire. The output triggered by "echo w > /proc/sysrq-trigger"
>> during a run that took longer than expected was as follows:
>>
>> SysRq : Show Blocked State
>>    task                        PC stack   pid father
>> kdmwork-253:0   D ffff8807c1fd3b78     0 10396      2 0x00000000
>>   ffff8807c1fd3b78 ffff88083b6b6cc0 0000000000012ec0 ffff8807c1fd3fd8
>>   0000000000012ec0 ffff880824225aa0 ffff88083b6b6cc0 ffff88081b0cb2c0
>>   ffff88085fc537c8 ffff8807c1fd3c98 ffff8807f7a99d70 ffffe8ffffc43bc0
>> Call Trace:
>>   [<ffffffff814d5230>] io_schedule+0xa0/0x130
>>   [<ffffffff8125a3f7>] bt_get+0x117/0x1b0
>>   [<ffffffff81256580>] ? blk_mq_queue_enter+0x30/0x2a0
>>   [<ffffffff81094cf0>] ? prepare_to_wait_event+0x110/0x110
>>   [<ffffffff8125a76f>] blk_mq_get_tag+0x9f/0xd0
>>   [<ffffffff8125591b>] __blk_mq_alloc_request+0x1b/0x210
>>   [<ffffffff812571c9>] blk_mq_alloc_request+0x139/0x150
>>   [<ffffffff8124c16e>] blk_get_request+0x2e/0xe0
>>   [<ffffffff8109a60d>] ? trace_hardirqs_on+0xd/0x10
>>   [<ffffffffa07f7d0f>] __multipath_map.isra.15+0x1cf/0x210 [dm_multipath]
>>   [<ffffffffa07f7d6a>] multipath_clone_and_map+0x1a/0x20 [dm_multipath]
>>   [<ffffffffa039dbb5>] map_tio_request+0x1d5/0x3a0 [dm_mod]
>>   [<ffffffff8109a53d>] ? trace_hardirqs_on_caller+0xfd/0x1c0
>>   [<ffffffff81075cbe>] kthread_worker_fn+0x7e/0x1b0
>>   [<ffffffff81075c40>] ? __init_kthread_worker+0x60/0x60
>>   [<ffffffff81075bc8>] kthread+0xf8/0x110
>>   [<ffffffff81075ad0>] ? kthread_create_on_node+0x210/0x210
>>   [<ffffffff814dacac>] ret_from_fork+0x7c/0xb0
>>   [<ffffffff81075ad0>] ? kthread_create_on_node+0x210/0x210
>
> Jens,
>
> This stack trace confirms my suspicion that switching DM-multipath over
> to allocating clone requests via blk_get_request (rather than using a
> dedicated mempool in DM core) is the cause of the slowdown that Bart has
> experienced.
>
> Given blk_mq_get_tag() looks to be the culprit is there anything we can
> do to speed up blk-mq request allocation?  I'm currently using
> GFP_KERNEL when calling blk_get_request().

blk-mq request allocation is pretty much as optimized/fast as it can be. 
The slowdown must be due to one of two reasons:

- A bug related to running out of requests, perhaps a missing queue run 
or something like that.
- A smaller number of available requests, due to the requested queue depth.

Looking at Barts results, it looks like it's usually fast, but sometimes 
very slow. That would seem to indicate it's option #1 above that is the 
issue. Bart, since this seems to wait for quite a bit, would it be 
possible to cat the 'tags' file for that queue when it is stuck like that?

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: blk-mq request allocation stalls [was: Re: [PATCH v3 0/8] dm: add request-based blk-mq support]
  2015-01-06 16:15                                       ` Jens Axboe
@ 2015-01-07 10:33                                         ` Bart Van Assche
  2015-01-07 15:32                                           ` Jens Axboe
  2015-01-07 20:40                                           ` Keith Busch
  0 siblings, 2 replies; 95+ messages in thread
From: Bart Van Assche @ 2015-01-07 10:33 UTC (permalink / raw)
  To: Jens Axboe, Mike Snitzer
  Cc: Christoph Hellwig, Keith Busch, device-mapper development,
	Jun'ichi Nomura

On 01/06/15 17:15, Jens Axboe wrote:
> blk-mq request allocation is pretty much as optimized/fast as it can be.
> The slowdown must be due to one of two reasons:
>
> - A bug related to running out of requests, perhaps a missing queue run
> or something like that.
> - A smaller number of available requests, due to the requested queue depth.
>
> Looking at Barts results, it looks like it's usually fast, but sometimes
> very slow. That would seem to indicate it's option #1 above that is the
> issue. Bart, since this seems to wait for quite a bit, would it be
> possible to cat the 'tags' file for that queue when it is stuck like that?

Hello Jens,

Thanks for the assistance. Is this the output you were looking for ?

# dmsetup table /dev/dm-1
0 256000 multipath 0 0 2 1 service-time 0 1 2 8:32 1 1 service-time 0 1 
2 8:48 1 1

# ls -ld /dev/sd[cd]
brw-rw---- 1 root disk 8, 32 Jan  7 11:16 /dev/sdc
brw-rw---- 1 root disk 8, 48 Jan  7 11:16 /dev/sdd

# time mkfs.xfs -f /dev/dm-1 &
[ ... ]
real    4m12.101s

# for d in sdc sdd; do echo ==== $d; (cd /sys/block/$d/mq &&
   find|cut -c3-|grep tag|xargs grep -aH ''); done
==== sdc
0/tags:nr_tags=62, reserved_tags=0, bits_per_word=3
0/tags:nr_free=62, nr_reserved=0
0/tags:active_queues=0
1/tags:nr_tags=62, reserved_tags=0, bits_per_word=3
1/tags:nr_free=62, nr_reserved=0
1/tags:active_queues=1
2/tags:nr_tags=62, reserved_tags=0, bits_per_word=3
2/tags:nr_free=62, nr_reserved=0
2/tags:active_queues=0
3/tags:nr_tags=62, reserved_tags=0, bits_per_word=3
3/tags:nr_free=62, nr_reserved=0
3/tags:active_queues=0
4/tags:nr_tags=62, reserved_tags=0, bits_per_word=3
4/tags:nr_free=62, nr_reserved=0
4/tags:active_queues=0
5/tags:nr_tags=62, reserved_tags=0, bits_per_word=3
5/tags:nr_free=62, nr_reserved=0
5/tags:active_queues=0
==== sdd
0/tags:nr_tags=62, reserved_tags=0, bits_per_word=3
0/tags:nr_free=62, nr_reserved=0
0/tags:active_queues=0
1/tags:nr_tags=62, reserved_tags=0, bits_per_word=3
1/tags:nr_free=62, nr_reserved=0
1/tags:active_queues=0
2/tags:nr_tags=62, reserved_tags=0, bits_per_word=3
2/tags:nr_free=62, nr_reserved=0
2/tags:active_queues=0
3/tags:nr_tags=62, reserved_tags=0, bits_per_word=3
3/tags:nr_free=62, nr_reserved=0
3/tags:active_queues=0
4/tags:nr_tags=62, reserved_tags=0, bits_per_word=3
4/tags:nr_free=62, nr_reserved=0
4/tags:active_queues=0
5/tags:nr_tags=62, reserved_tags=0, bits_per_word=3
5/tags:nr_free=62, nr_reserved=0
5/tags:active_queues=0

# dmesg -c >/dev/null; echo w >/proc/sysrq-trigger; dmesg -c
SysRq : Show Blocked State
   task                        PC stack   pid father
kdmwork-253:1   D ffff8807f3aafb78     0  3819      2 0x00000000
  ffff8807f3aafb78 ffff880832d9c880 0000000000013080 ffff8807f3aaffd8
  0000000000013080 ffff8807fdfac880 ffff880832d9c880 ffff88080066ea00
  ffff88085fd13988 ffff8807f3aafc98 ffff8807fd553ca0 ffffe8ffffd02f00
Call Trace:
  [<ffffffff814d5330>] io_schedule+0xa0/0x130
  [<ffffffff81259a47>] bt_get+0x117/0x1b0
  [<ffffffff810949f0>] ? prepare_to_wait_event+0x110/0x110
  [<ffffffff81259dbf>] blk_mq_get_tag+0x9f/0xd0
  [<ffffffff81254f7b>] __blk_mq_alloc_request+0x1b/0x210
  [<ffffffff81256819>] blk_mq_alloc_request+0x139/0x150
  [<ffffffff8124b7ee>] blk_get_request+0x2e/0xe0
  [<ffffffff8109a28d>] ? trace_hardirqs_on+0xd/0x10
  [<ffffffffa0671d0f>] __multipath_map.isra.15+0x1cf/0x210 [dm_multipath]
  [<ffffffffa0671d6a>] multipath_clone_and_map+0x1a/0x20 [dm_multipath]
  [<ffffffffa0354bb5>] map_tio_request+0x1d5/0x3a0 [dm_mod]
  [<ffffffff8109a1bd>] ? trace_hardirqs_on_caller+0xfd/0x1c0
  [<ffffffff81075b76>] kthread_worker_fn+0x86/0x1b0
  [<ffffffff81075af0>] ? __init_kthread_worker+0x60/0x60
  [<ffffffff81075a6f>] kthread+0xef/0x110
  [<ffffffff81075980>] ? kthread_create_on_node+0x210/0x210
  [<ffffffff814dad6c>] ret_from_fork+0x7c/0xb0
  [<ffffffff81075980>] ? kthread_create_on_node+0x210/0x210
systemd-udevd   D ffff880835e13c88     0  5352    449 0x00000000
  ffff880835e13c88 ffff880832d9a440 0000000000013080 ffff880835e13fd8
  0000000000013080 ffff8808331bdaa0 ffff880832d9a440 ffff8807fc921dc0
  ffff88085fc13988 ffff88085ffd8438 ffff880835e13d30 0000000000000082
Call Trace:
  [<ffffffff814d5a90>] ? bit_wait+0x50/0x50
  [<ffffffff814d5330>] io_schedule+0xa0/0x130
  [<ffffffff814d5abc>] bit_wait_io+0x2c/0x50
  [<ffffffff814d588b>] __wait_on_bit_lock+0x4b/0xb0
  [<ffffffff8113ae0a>] __lock_page_killable+0x9a/0xa0
  [<ffffffff81094a30>] ? autoremove_wake_function+0x40/0x40
  [<ffffffff8113d428>] generic_file_read_iter+0x408/0x640
  [<ffffffff811d56c7>] blkdev_read_iter+0x37/0x40
  [<ffffffff81197e6e>] new_sync_read+0x7e/0xb0
  [<ffffffff81199058>] __vfs_read+0x18/0x50
  [<ffffffff81199116>] vfs_read+0x86/0x140
  [<ffffffff81199219>] SyS_read+0x49/0xb0
  [<ffffffff814dae12>] system_call_fastpath+0x12/0x17
mkfs.xfs        D ffff8807fd6c3a48     0  5355   2301 0x00000000
  ffff8807fd6c3a48 ffff8808351ddaa0 0000000000013080 ffff8807fd6c3fd8
  0000000000013080 ffffffff81a194e0 ffff8808351ddaa0 ffff8807fc921c00
  ffff88085fc13988 0000000000000000 0000000000000000 ffff88081aebfb40
Call Trace:
  [<ffffffff814d5330>] io_schedule+0xa0/0x130
  [<ffffffff811d9a12>] do_blockdev_direct_IO+0x1982/0x26d0
  [<ffffffff811d4b50>] ? I_BDEV+0x10/0x10
  [<ffffffff811da7ac>] __blockdev_direct_IO+0x4c/0x50
  [<ffffffff811d4b50>] ? I_BDEV+0x10/0x10
  [<ffffffff811d519e>] blkdev_direct_IO+0x4e/0x50
  [<ffffffff811d4b50>] ? I_BDEV+0x10/0x10
  [<ffffffff8113d709>] generic_file_direct_write+0xa9/0x170
  [<ffffffff8113da76>] __generic_file_write_iter+0x2a6/0x350
  [<ffffffff811d561f>] blkdev_write_iter+0x2f/0xa0
  [<ffffffff81197f21>] new_sync_write+0x81/0xb0
  [<ffffffff811986a7>] vfs_write+0xb7/0x1f0
  [<ffffffff811b857e>] ? __fget_light+0xbe/0xe0
  [<ffffffff81199452>] SyS_pwrite64+0x72/0xb0
  [<ffffffff814dae12>] system_call_fastpath+0x12/0x17

Bart.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: blk-mq request allocation stalls [was: Re: [PATCH v3 0/8] dm: add request-based blk-mq support]
  2015-01-07 10:33                                         ` Bart Van Assche
@ 2015-01-07 15:32                                           ` Jens Axboe
  2015-01-07 16:15                                             ` Mike Snitzer
  2015-01-07 20:40                                           ` Keith Busch
  1 sibling, 1 reply; 95+ messages in thread
From: Jens Axboe @ 2015-01-07 15:32 UTC (permalink / raw)
  To: Bart Van Assche, Mike Snitzer
  Cc: Christoph Hellwig, Keith Busch, device-mapper development,
	Jun'ichi Nomura

On 01/07/2015 03:33 AM, Bart Van Assche wrote:
> On 01/06/15 17:15, Jens Axboe wrote:
>> blk-mq request allocation is pretty much as optimized/fast as it can be.
>> The slowdown must be due to one of two reasons:
>>
>> - A bug related to running out of requests, perhaps a missing queue run
>> or something like that.
>> - A smaller number of available requests, due to the requested queue
>> depth.
>>
>> Looking at Barts results, it looks like it's usually fast, but sometimes
>> very slow. That would seem to indicate it's option #1 above that is the
>> issue. Bart, since this seems to wait for quite a bit, would it be
>> possible to cat the 'tags' file for that queue when it is stuck like
>> that?
>
> Hello Jens,
>
> Thanks for the assistance. Is this the output you were looking for ?
>
> # dmsetup table /dev/dm-1
> 0 256000 multipath 0 0 2 1 service-time 0 1 2 8:32 1 1 service-time 0 1
> 2 8:48 1 1
>
> # ls -ld /dev/sd[cd]
> brw-rw---- 1 root disk 8, 32 Jan  7 11:16 /dev/sdc
> brw-rw---- 1 root disk 8, 48 Jan  7 11:16 /dev/sdd
>
> # time mkfs.xfs -f /dev/dm-1 &
> [ ... ]
> real    4m12.101s
>
> # for d in sdc sdd; do echo ==== $d; (cd /sys/block/$d/mq &&
>    find|cut -c3-|grep tag|xargs grep -aH ''); done

You forgot dm-1, that's what mkfs is sleeping on. But given that sdc/sdd 
look idle, it still looks like a case of dm-1 not appropriately running 
the queues after insertion.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: blk-mq request allocation stalls [was: Re: [PATCH v3 0/8] dm: add request-based blk-mq support]
  2015-01-07 15:32                                           ` Jens Axboe
@ 2015-01-07 16:15                                             ` Mike Snitzer
  2015-01-07 16:18                                               ` Jens Axboe
  2015-01-07 16:22                                               ` Mike Snitzer
  0 siblings, 2 replies; 95+ messages in thread
From: Mike Snitzer @ 2015-01-07 16:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Keith Busch, device-mapper development,
	Bart Van Assche, Jun'ichi Nomura

On Wed, Jan 07 2015 at 10:32am -0500,
Jens Axboe <axboe@kernel.dk> wrote:

> On 01/07/2015 03:33 AM, Bart Van Assche wrote:
> >On 01/06/15 17:15, Jens Axboe wrote:
> >>blk-mq request allocation is pretty much as optimized/fast as it can be.
> >>The slowdown must be due to one of two reasons:
> >>
> >>- A bug related to running out of requests, perhaps a missing queue run
> >>or something like that.
> >>- A smaller number of available requests, due to the requested queue
> >>depth.
> >>
> >>Looking at Barts results, it looks like it's usually fast, but sometimes
> >>very slow. That would seem to indicate it's option #1 above that is the
> >>issue. Bart, since this seems to wait for quite a bit, would it be
> >>possible to cat the 'tags' file for that queue when it is stuck like
> >>that?
> >
> >Hello Jens,
> >
> >Thanks for the assistance. Is this the output you were looking for ?
> >
> ># dmsetup table /dev/dm-1
> >0 256000 multipath 0 0 2 1 service-time 0 1 2 8:32 1 1 service-time 0 1
> >2 8:48 1 1
> >
> ># ls -ld /dev/sd[cd]
> >brw-rw---- 1 root disk 8, 32 Jan  7 11:16 /dev/sdc
> >brw-rw---- 1 root disk 8, 48 Jan  7 11:16 /dev/sdd
> >
> ># time mkfs.xfs -f /dev/dm-1 &
> >[ ... ]
> >real    4m12.101s
> >
> ># for d in sdc sdd; do echo ==== $d; (cd /sys/block/$d/mq &&
> >   find|cut -c3-|grep tag|xargs grep -aH ''); done
> 
> You forgot dm-1, that's what mkfs is sleeping on. But given that
> sdc/sdd look idle, it still looks like a case of dm-1 not
> appropriately running the queues after insertion.

DM never directly runs the queues of the underlying SCSI devices
(e.g. sdc, sdd).

request-based DM runs the DM device's queue on completion of a clone
request:

dm_end_request -> rq_completed -> blk_run_queue_async

Which ultimately does seem to be the wrong way around (like you say:
queue should run after insertion).

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: blk-mq request allocation stalls [was: Re: [PATCH v3 0/8] dm: add request-based blk-mq support]
  2015-01-07 16:15                                             ` Mike Snitzer
@ 2015-01-07 16:18                                               ` Jens Axboe
  2015-01-07 16:22                                               ` Mike Snitzer
  1 sibling, 0 replies; 95+ messages in thread
From: Jens Axboe @ 2015-01-07 16:18 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Christoph Hellwig, Keith Busch, device-mapper development,
	Bart Van Assche, Jun'ichi Nomura

On 01/07/2015 09:15 AM, Mike Snitzer wrote:
> On Wed, Jan 07 2015 at 10:32am -0500,
> Jens Axboe <axboe@kernel.dk> wrote:
>
>> On 01/07/2015 03:33 AM, Bart Van Assche wrote:
>>> On 01/06/15 17:15, Jens Axboe wrote:
>>>> blk-mq request allocation is pretty much as optimized/fast as it can be.
>>>> The slowdown must be due to one of two reasons:
>>>>
>>>> - A bug related to running out of requests, perhaps a missing queue run
>>>> or something like that.
>>>> - A smaller number of available requests, due to the requested queue
>>>> depth.
>>>>
>>>> Looking at Barts results, it looks like it's usually fast, but sometimes
>>>> very slow. That would seem to indicate it's option #1 above that is the
>>>> issue. Bart, since this seems to wait for quite a bit, would it be
>>>> possible to cat the 'tags' file for that queue when it is stuck like
>>>> that?
>>>
>>> Hello Jens,
>>>
>>> Thanks for the assistance. Is this the output you were looking for ?
>>>
>>> # dmsetup table /dev/dm-1
>>> 0 256000 multipath 0 0 2 1 service-time 0 1 2 8:32 1 1 service-time 0 1
>>> 2 8:48 1 1
>>>
>>> # ls -ld /dev/sd[cd]
>>> brw-rw---- 1 root disk 8, 32 Jan  7 11:16 /dev/sdc
>>> brw-rw---- 1 root disk 8, 48 Jan  7 11:16 /dev/sdd
>>>
>>> # time mkfs.xfs -f /dev/dm-1 &
>>> [ ... ]
>>> real    4m12.101s
>>>
>>> # for d in sdc sdd; do echo ==== $d; (cd /sys/block/$d/mq &&
>>>    find|cut -c3-|grep tag|xargs grep -aH ''); done
>>
>> You forgot dm-1, that's what mkfs is sleeping on. But given that
>> sdc/sdd look idle, it still looks like a case of dm-1 not
>> appropriately running the queues after insertion.
>
> DM never directly runs the queues of the underlying SCSI devices
> (e.g. sdc, sdd).
>
> request-based DM runs the DM device's queue on completion of a clone
> request:
>
> dm_end_request -> rq_completed -> blk_run_queue_async
>
> Which ultimately does seem to be the wrong way around (like you say:
> queue should run after insertion).

That does seem backwards. You'd normally run the queue at completion if 
you decided to stop it (or stop issuing) due to some resource 
constraints. What runs the queue after insertion of a clone to the 
underlying device?

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: blk-mq request allocation stalls [was: Re: [PATCH v3 0/8] dm: add request-based blk-mq support]
  2015-01-07 16:15                                             ` Mike Snitzer
  2015-01-07 16:18                                               ` Jens Axboe
@ 2015-01-07 16:22                                               ` Mike Snitzer
  2015-01-07 16:24                                                 ` Jens Axboe
  1 sibling, 1 reply; 95+ messages in thread
From: Mike Snitzer @ 2015-01-07 16:22 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Keith Busch, device-mapper development,
	Bart Van Assche, Jun'ichi Nomura

On Wed, Jan 07 2015 at 11:15am -0500,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Wed, Jan 07 2015 at 10:32am -0500,
> Jens Axboe <axboe@kernel.dk> wrote:
> 
> > You forgot dm-1, that's what mkfs is sleeping on. But given that
> > sdc/sdd look idle, it still looks like a case of dm-1 not
> > appropriately running the queues after insertion.
> 
> DM never directly runs the queues of the underlying SCSI devices
> (e.g. sdc, sdd).
> 
> request-based DM runs the DM device's queue on completion of a clone
> request:
> 
> dm_end_request -> rq_completed -> blk_run_queue_async
> 
> Which ultimately does seem to be the wrong way around (like you say:
> queue should run after insertion).

Hmm, for q->mq_ops blk_insert_cloned_request() should already be running
the queue.

blk_insert_cloned_request is calling blk_mq_insert_request(rq, false, true, true);

Third arg being @run_queue which results in blk_mq_run_hw_queue() being
called.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: blk-mq request allocation stalls [was: Re: [PATCH v3 0/8] dm: add request-based blk-mq support]
  2015-01-07 16:22                                               ` Mike Snitzer
@ 2015-01-07 16:24                                                 ` Jens Axboe
  2015-01-07 17:18                                                   ` Mike Snitzer
  0 siblings, 1 reply; 95+ messages in thread
From: Jens Axboe @ 2015-01-07 16:24 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Christoph Hellwig, Keith Busch, device-mapper development,
	Bart Van Assche, Jun'ichi Nomura

On 01/07/2015 09:22 AM, Mike Snitzer wrote:
> On Wed, Jan 07 2015 at 11:15am -0500,
> Mike Snitzer <snitzer@redhat.com> wrote:
>
>> On Wed, Jan 07 2015 at 10:32am -0500,
>> Jens Axboe <axboe@kernel.dk> wrote:
>>
>>> You forgot dm-1, that's what mkfs is sleeping on. But given that
>>> sdc/sdd look idle, it still looks like a case of dm-1 not
>>> appropriately running the queues after insertion.
>>
>> DM never directly runs the queues of the underlying SCSI devices
>> (e.g. sdc, sdd).
>>
>> request-based DM runs the DM device's queue on completion of a clone
>> request:
>>
>> dm_end_request -> rq_completed -> blk_run_queue_async
>>
>> Which ultimately does seem to be the wrong way around (like you say:
>> queue should run after insertion).
>
> Hmm, for q->mq_ops blk_insert_cloned_request() should already be running
> the queue.
>
> blk_insert_cloned_request is calling blk_mq_insert_request(rq, false, true, true);
>
> Third arg being @run_queue which results in blk_mq_run_hw_queue() being
> called.

OK, that should be fine then. In that case, it's probably a missing 
queue run in some other condition... Which does make more sense, since 
"most" of the runs Bart did looked fine, it's just a slow one every now 
and then.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: blk-mq request allocation stalls [was: Re: [PATCH v3 0/8] dm: add request-based blk-mq support]
  2015-01-07 16:24                                                 ` Jens Axboe
@ 2015-01-07 17:18                                                   ` Mike Snitzer
  2015-01-07 17:35                                                     ` Jens Axboe
  0 siblings, 1 reply; 95+ messages in thread
From: Mike Snitzer @ 2015-01-07 17:18 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Keith Busch, device-mapper development,
	Bart Van Assche, Jun'ichi Nomura

On Wed, Jan 07 2015 at 11:24am -0500,
Jens Axboe <axboe@kernel.dk> wrote:

> On 01/07/2015 09:22 AM, Mike Snitzer wrote:
> >On Wed, Jan 07 2015 at 11:15am -0500,
> >Mike Snitzer <snitzer@redhat.com> wrote:
> >
> >>On Wed, Jan 07 2015 at 10:32am -0500,
> >>Jens Axboe <axboe@kernel.dk> wrote:
> >>
> >>>You forgot dm-1, that's what mkfs is sleeping on. But given that
> >>>sdc/sdd look idle, it still looks like a case of dm-1 not
> >>>appropriately running the queues after insertion.
> >>
> >>DM never directly runs the queues of the underlying SCSI devices
> >>(e.g. sdc, sdd).
> >>
> >>request-based DM runs the DM device's queue on completion of a clone
> >>request:
> >>
> >>dm_end_request -> rq_completed -> blk_run_queue_async
> >>
> >>Which ultimately does seem to be the wrong way around (like you say:
> >>queue should run after insertion).
> >
> >Hmm, for q->mq_ops blk_insert_cloned_request() should already be running
> >the queue.
> >
> >blk_insert_cloned_request is calling blk_mq_insert_request(rq, false, true, true);
> >
> >Third arg being @run_queue which results in blk_mq_run_hw_queue() being
> >called.
> 
> OK, that should be fine then. In that case, it's probably a missing
> queue run in some other condition... Which does make more sense,
> since "most" of the runs Bart did looked fine, it's just a slow one
> every now and then.

The one case that looks suspect (missing queue run) is if/when the
multipath target's mapping function returns DM_MAPIO_REQUEUE.  It only
returns this if memory allocation failed (e.g. blk_get_request returns
ENOMEM).  Not seeing why DM core wouldn't want to re-run the DM device's
queue in this case given it just called blk_requeue_request().  But that
seems like something I need to revisit and not the ultimate problem.

Looking closer, why not have blk_run_queue() (and blk_run_queue_async)
call blk_mq_start_stopped_hw_queues() if q->mq_ops?  As is
scsi_run_queue() open-codes it.

Actually, that is likely the ultimate problem: blk_run_queue* aren't
trained for q->mq_ops.  As such DM would need to open code a call to
blk_mq_start_stopped_hw_queues.

I think DM needs something like the following _untested_ patch, pretty
glaring oversight on my part (I thought the appropriate old block
methods would do the right thing for q->mq_ops):

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 549b815..d70a665 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -14,6 +14,7 @@
 #include <linux/moduleparam.h>
 #include <linux/blkpg.h>
 #include <linux/bio.h>
+#include <linux/blk-mq.h>
 #include <linux/mempool.h>
 #include <linux/slab.h>
 #include <linux/idr.h>
@@ -1012,6 +1013,46 @@ static void end_clone_bio(struct bio *clone, int error)
 }
 
 /*
+ * request-based DM queue management functions.
+ */
+static void dm_stop_queue(struct request_queue *q)
+{
+	unsigned long flags;
+
+	if (q->mq_ops) {
+		blk_mq_stop_hw_queues(q);
+		return;
+	}
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	blk_stop_queue(q);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+static void dm_start_queue(struct request_queue *q)
+{
+	unsigned long flags;
+
+	if (q->mq_ops) {
+		blk_mq_start_stopped_hw_queues(q, true);
+		return;
+	}
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	if (blk_queue_stopped(q))
+		blk_start_queue(q);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+static void dm_run_queue(struct request_queue *q)
+{
+	if (q->mq_ops)
+		blk_mq_start_stopped_hw_queues(q, true);
+	else
+		blk_run_queue_async(q);
+}
+
+/*
  * Don't touch any member of the md after calling this function because
  * the md may be freed in dm_put() at the end of this function.
  * Or do dm_get() before calling this function and dm_put() later.
@@ -1031,7 +1072,7 @@ static void rq_completed(struct mapped_device *md, int rw, bool run_queue)
 	 * queue lock again.
 	 */
 	if (run_queue)
-		blk_run_queue_async(md->queue);
+		dm_run_queue(md->queue);
 
 	/*
 	 * dm_put() must be at the end of this function. See the comment above
@@ -1119,35 +1160,6 @@ static void dm_requeue_unmapped_request(struct request *clone)
 	dm_requeue_unmapped_original_request(tio->md, tio->orig);
 }
 
-static void __stop_queue(struct request_queue *q)
-{
-	blk_stop_queue(q);
-}
-
-static void stop_queue(struct request_queue *q)
-{
-	unsigned long flags;
-
-	spin_lock_irqsave(q->queue_lock, flags);
-	__stop_queue(q);
-	spin_unlock_irqrestore(q->queue_lock, flags);
-}
-
-static void __start_queue(struct request_queue *q)
-{
-	if (blk_queue_stopped(q))
-		blk_start_queue(q);
-}
-
-static void start_queue(struct request_queue *q)
-{
-	unsigned long flags;
-
-	spin_lock_irqsave(q->queue_lock, flags);
-	__start_queue(q);
-	spin_unlock_irqrestore(q->queue_lock, flags);
-}
-
 static void dm_done(struct request *clone, int error, bool mapped)
 {
 	int r = error;
@@ -2419,7 +2431,7 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t,
 	 * because request-based dm may be run just after the setting.
 	 */
 	if (dm_table_request_based(t) && !blk_queue_stopped(q))
-		stop_queue(q);
+		dm_stop_queue(q);
 
 	__bind_mempools(md, t);
 
@@ -2886,7 +2898,7 @@ static int __dm_suspend(struct mapped_device *md, struct dm_table *map,
 	 * dm defers requests to md->wq from md->queue.
 	 */
 	if (dm_request_based(md)) {
-		stop_queue(md->queue);
+		dm_stop_queue(md->queue);
 		flush_kthread_worker(&md->kworker);
 	}
 
@@ -2909,7 +2921,7 @@ static int __dm_suspend(struct mapped_device *md, struct dm_table *map,
 		dm_queue_flush(md);
 
 		if (dm_request_based(md))
-			start_queue(md->queue);
+			dm_start_queue(md->queue);
 
 		unlock_fs(md);
 		dm_table_presuspend_undo_targets(map);
@@ -2988,7 +3000,7 @@ static int __dm_resume(struct mapped_device *md, struct dm_table *map)
 	 * Request-based dm is queueing the deferred I/Os in its request_queue.
 	 */
 	if (dm_request_based(md))
-		start_queue(md->queue);
+		dm_start_queue(md->queue);
 
 	unlock_fs(md);
 

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: blk-mq request allocation stalls [was: Re: [PATCH v3 0/8] dm: add request-based blk-mq support]
  2015-01-07 17:18                                                   ` Mike Snitzer
@ 2015-01-07 17:35                                                     ` Jens Axboe
  2015-01-07 20:09                                                       ` Mike Snitzer
  0 siblings, 1 reply; 95+ messages in thread
From: Jens Axboe @ 2015-01-07 17:35 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Christoph Hellwig, Keith Busch, device-mapper development,
	Bart Van Assche, Jun'ichi Nomura

On 01/07/2015 10:18 AM, Mike Snitzer wrote:
> On Wed, Jan 07 2015 at 11:24am -0500,
> Jens Axboe <axboe@kernel.dk> wrote:
>
>> On 01/07/2015 09:22 AM, Mike Snitzer wrote:
>>> On Wed, Jan 07 2015 at 11:15am -0500,
>>> Mike Snitzer <snitzer@redhat.com> wrote:
>>>
>>>> On Wed, Jan 07 2015 at 10:32am -0500,
>>>> Jens Axboe <axboe@kernel.dk> wrote:
>>>>
>>>>> You forgot dm-1, that's what mkfs is sleeping on. But given that
>>>>> sdc/sdd look idle, it still looks like a case of dm-1 not
>>>>> appropriately running the queues after insertion.
>>>>
>>>> DM never directly runs the queues of the underlying SCSI devices
>>>> (e.g. sdc, sdd).
>>>>
>>>> request-based DM runs the DM device's queue on completion of a clone
>>>> request:
>>>>
>>>> dm_end_request -> rq_completed -> blk_run_queue_async
>>>>
>>>> Which ultimately does seem to be the wrong way around (like you say:
>>>> queue should run after insertion).
>>>
>>> Hmm, for q->mq_ops blk_insert_cloned_request() should already be running
>>> the queue.
>>>
>>> blk_insert_cloned_request is calling blk_mq_insert_request(rq, false, true, true);
>>>
>>> Third arg being @run_queue which results in blk_mq_run_hw_queue() being
>>> called.
>>
>> OK, that should be fine then. In that case, it's probably a missing
>> queue run in some other condition... Which does make more sense,
>> since "most" of the runs Bart did looked fine, it's just a slow one
>> every now and then.
>
> The one case that looks suspect (missing queue run) is if/when the
> multipath target's mapping function returns DM_MAPIO_REQUEUE.  It only
> returns this if memory allocation failed (e.g. blk_get_request returns
> ENOMEM).  Not seeing why DM core wouldn't want to re-run the DM device's
> queue in this case given it just called blk_requeue_request().  But that
> seems like something I need to revisit and not the ultimate problem.
>
> Looking closer, why not have blk_run_queue() (and blk_run_queue_async)
> call blk_mq_start_stopped_hw_queues() if q->mq_ops?  As is
> scsi_run_queue() open-codes it.
>
> Actually, that is likely the ultimate problem: blk_run_queue* aren't
> trained for q->mq_ops.  As such DM would need to open code a call to
> blk_mq_start_stopped_hw_queues.

It's not completely parallel how SCSI uses it. blk_run_queue(), for 
legacy request_fn, does not start stopped queues. For the mq path, 
scsi-mq decided to do that. So if we embed 
blk_mq_start_stopped_hw_queues() in blk_run_queue*(), then we'd have 
different behavior between mq and non-mq. We could have 
blk_start_queue() do the right thing, but that would require different 
contexts between mq and non-mq, as non-mq requires that to be called 
with the queue lock held and ints disabled.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: blk-mq request allocation stalls [was: Re: [PATCH v3 0/8] dm: add request-based blk-mq support]
  2015-01-07 17:35                                                     ` Jens Axboe
@ 2015-01-07 20:09                                                       ` Mike Snitzer
  0 siblings, 0 replies; 95+ messages in thread
From: Mike Snitzer @ 2015-01-07 20:09 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Keith Busch, device-mapper development,
	Bart Van Assche, Jun'ichi Nomura

On Wed, Jan 07 2015 at 12:35pm -0500,
Jens Axboe <axboe@kernel.dk> wrote:

> On 01/07/2015 10:18 AM, Mike Snitzer wrote:
> >
> >Looking closer, why not have blk_run_queue() (and blk_run_queue_async)
> >call blk_mq_start_stopped_hw_queues() if q->mq_ops?  As is
> >scsi_run_queue() open-codes it.
> >
> >Actually, that is likely the ultimate problem: blk_run_queue* aren't
> >trained for q->mq_ops.  As such DM would need to open code a call to
> >blk_mq_start_stopped_hw_queues.

Turns out that concern was bogus (as was the patch I shared),
request-based DM's request_queue isn't using blk-mq even if the
underlying devices are.

I'm not sure what is causing Bart's slow down; I cannot reproduce any
hang using a 100MB scsi_debug (ramdisk) device on the host that is
exported over virtio-blk into a guest that then layers a multipath
device on the blk-mq virtio-blk device.
 
> It's not completely parallel how SCSI uses it. blk_run_queue(), for
> legacy request_fn, does not start stopped queues. For the mq path,
> scsi-mq decided to do that. So if we embed
> blk_mq_start_stopped_hw_queues() in blk_run_queue*(), then we'd have
> different behavior between mq and non-mq. We could have
> blk_start_queue() do the right thing, but that would require
> different contexts between mq and non-mq, as non-mq requires that to
> be called with the queue lock held and ints disabled.

New wrappers could be introduced and drivers converted to use them but
best to just leave well enough alone.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: blk-mq request allocation stalls [was: Re: [PATCH v3 0/8] dm: add request-based blk-mq support]
  2015-01-07 10:33                                         ` Bart Van Assche
  2015-01-07 15:32                                           ` Jens Axboe
@ 2015-01-07 20:40                                           ` Keith Busch
  2015-01-09 19:49                                             ` Mike Snitzer
  1 sibling, 1 reply; 95+ messages in thread
From: Keith Busch @ 2015-01-07 20:40 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, Keith Busch, Mike Snitzer, Christoph Hellwig,
	device-mapper development, Jun'ichi Nomura

On Wed, 7 Jan 2015, Bart Van Assche wrote:
> On 01/06/15 17:15, Jens Axboe wrote:
>> blk-mq request allocation is pretty much as optimized/fast as it can be.
>> The slowdown must be due to one of two reasons:
>> 
>> - A bug related to running out of requests, perhaps a missing queue run
>> or something like that.
>> - A smaller number of available requests, due to the requested queue depth.
>> 
>> Looking at Barts results, it looks like it's usually fast, but sometimes
>> very slow. That would seem to indicate it's option #1 above that is the
>> issue. Bart, since this seems to wait for quite a bit, would it be
>> possible to cat the 'tags' file for that queue when it is stuck like that?
>
> Hello Jens,
>
> Thanks for the assistance. Is this the output you were looking for

I'm a little confused by the later comments given the below data. It says
multipath_clone_and_map() is stuck at bt_get, but that doesn't block
unless there are no tags available. The tags should be coming from one
of dm-1's path queues, and I'm assuming these queues are provided by sdc
and sdd. All their tags are free, so that looks like a missing wake_up
when the queue idles.

> # dmsetup table /dev/dm-1
> 0 256000 multipath 0 0 2 1 service-time 0 1 2 8:32 1 1 service-time 0 1 2 
> 8:48 1 1
>
> # ls -ld /dev/sd[cd]
> brw-rw---- 1 root disk 8, 32 Jan  7 11:16 /dev/sdc
> brw-rw---- 1 root disk 8, 48 Jan  7 11:16 /dev/sdd
>
> # time mkfs.xfs -f /dev/dm-1 &
> [ ... ]
> real    4m12.101s
>
> # for d in sdc sdd; do echo ==== $d; (cd /sys/block/$d/mq &&
>  find|cut -c3-|grep tag|xargs grep -aH ''); done
> ==== sdc
> 0/tags:nr_tags=62, reserved_tags=0, bits_per_word=3
> 0/tags:nr_free=62, nr_reserved=0
> 0/tags:active_queues=0
> 1/tags:nr_tags=62, reserved_tags=0, bits_per_word=3
> 1/tags:nr_free=62, nr_reserved=0
> 1/tags:active_queues=1
> 2/tags:nr_tags=62, reserved_tags=0, bits_per_word=3
> 2/tags:nr_free=62, nr_reserved=0
> 2/tags:active_queues=0
> 3/tags:nr_tags=62, reserved_tags=0, bits_per_word=3
> 3/tags:nr_free=62, nr_reserved=0
> 3/tags:active_queues=0
> 4/tags:nr_tags=62, reserved_tags=0, bits_per_word=3
> 4/tags:nr_free=62, nr_reserved=0
> 4/tags:active_queues=0
> 5/tags:nr_tags=62, reserved_tags=0, bits_per_word=3
> 5/tags:nr_free=62, nr_reserved=0
> 5/tags:active_queues=0
> ==== sdd
> 0/tags:nr_tags=62, reserved_tags=0, bits_per_word=3
> 0/tags:nr_free=62, nr_reserved=0
> 0/tags:active_queues=0
> 1/tags:nr_tags=62, reserved_tags=0, bits_per_word=3
> 1/tags:nr_free=62, nr_reserved=0
> 1/tags:active_queues=0
> 2/tags:nr_tags=62, reserved_tags=0, bits_per_word=3
> 2/tags:nr_free=62, nr_reserved=0
> 2/tags:active_queues=0
> 3/tags:nr_tags=62, reserved_tags=0, bits_per_word=3
> 3/tags:nr_free=62, nr_reserved=0
> 3/tags:active_queues=0
> 4/tags:nr_tags=62, reserved_tags=0, bits_per_word=3
> 4/tags:nr_free=62, nr_reserved=0
> 4/tags:active_queues=0
> 5/tags:nr_tags=62, reserved_tags=0, bits_per_word=3
> 5/tags:nr_free=62, nr_reserved=0
> 5/tags:active_queues=0
>
> # dmesg -c >/dev/null; echo w >/proc/sysrq-trigger; dmesg -c
> SysRq : Show Blocked State
>  task                        PC stack   pid father
> kdmwork-253:1   D ffff8807f3aafb78     0  3819      2 0x00000000
> ffff8807f3aafb78 ffff880832d9c880 0000000000013080 ffff8807f3aaffd8
> 0000000000013080 ffff8807fdfac880 ffff880832d9c880 ffff88080066ea00
> ffff88085fd13988 ffff8807f3aafc98 ffff8807fd553ca0 ffffe8ffffd02f00
> Call Trace:
> [<ffffffff814d5330>] io_schedule+0xa0/0x130
> [<ffffffff81259a47>] bt_get+0x117/0x1b0
> [<ffffffff810949f0>] ? prepare_to_wait_event+0x110/0x110
> [<ffffffff81259dbf>] blk_mq_get_tag+0x9f/0xd0
> [<ffffffff81254f7b>] __blk_mq_alloc_request+0x1b/0x210
> [<ffffffff81256819>] blk_mq_alloc_request+0x139/0x150
> [<ffffffff8124b7ee>] blk_get_request+0x2e/0xe0
> [<ffffffff8109a28d>] ? trace_hardirqs_on+0xd/0x10
> [<ffffffffa0671d0f>] __multipath_map.isra.15+0x1cf/0x210 [dm_multipath]
> [<ffffffffa0671d6a>] multipath_clone_and_map+0x1a/0x20 [dm_multipath]
> [<ffffffffa0354bb5>] map_tio_request+0x1d5/0x3a0 [dm_mod]
> [<ffffffff8109a1bd>] ? trace_hardirqs_on_caller+0xfd/0x1c0
> [<ffffffff81075b76>] kthread_worker_fn+0x86/0x1b0
> [<ffffffff81075af0>] ? __init_kthread_worker+0x60/0x60
> [<ffffffff81075a6f>] kthread+0xef/0x110
> [<ffffffff81075980>] ? kthread_create_on_node+0x210/0x210
> [<ffffffff814dad6c>] ret_from_fork+0x7c/0xb0
> [<ffffffff81075980>] ? kthread_create_on_node+0x210/0x210
> systemd-udevd   D ffff880835e13c88     0  5352    449 0x00000000
> ffff880835e13c88 ffff880832d9a440 0000000000013080 ffff880835e13fd8
> 0000000000013080 ffff8808331bdaa0 ffff880832d9a440 ffff8807fc921dc0
> ffff88085fc13988 ffff88085ffd8438 ffff880835e13d30 0000000000000082
> Call Trace:
> [<ffffffff814d5a90>] ? bit_wait+0x50/0x50
> [<ffffffff814d5330>] io_schedule+0xa0/0x130
> [<ffffffff814d5abc>] bit_wait_io+0x2c/0x50
> [<ffffffff814d588b>] __wait_on_bit_lock+0x4b/0xb0
> [<ffffffff8113ae0a>] __lock_page_killable+0x9a/0xa0
> [<ffffffff81094a30>] ? autoremove_wake_function+0x40/0x40
> [<ffffffff8113d428>] generic_file_read_iter+0x408/0x640
> [<ffffffff811d56c7>] blkdev_read_iter+0x37/0x40
> [<ffffffff81197e6e>] new_sync_read+0x7e/0xb0
> [<ffffffff81199058>] __vfs_read+0x18/0x50
> [<ffffffff81199116>] vfs_read+0x86/0x140
> [<ffffffff81199219>] SyS_read+0x49/0xb0
> [<ffffffff814dae12>] system_call_fastpath+0x12/0x17
> mkfs.xfs        D ffff8807fd6c3a48     0  5355   2301 0x00000000
> ffff8807fd6c3a48 ffff8808351ddaa0 0000000000013080 ffff8807fd6c3fd8
> 0000000000013080 ffffffff81a194e0 ffff8808351ddaa0 ffff8807fc921c00
> ffff88085fc13988 0000000000000000 0000000000000000 ffff88081aebfb40
> Call Trace:
> [<ffffffff814d5330>] io_schedule+0xa0/0x130
> [<ffffffff811d9a12>] do_blockdev_direct_IO+0x1982/0x26d0
> [<ffffffff811d4b50>] ? I_BDEV+0x10/0x10
> [<ffffffff811da7ac>] __blockdev_direct_IO+0x4c/0x50
> [<ffffffff811d4b50>] ? I_BDEV+0x10/0x10
> [<ffffffff811d519e>] blkdev_direct_IO+0x4e/0x50
> [<ffffffff811d4b50>] ? I_BDEV+0x10/0x10
> [<ffffffff8113d709>] generic_file_direct_write+0xa9/0x170
> [<ffffffff8113da76>] __generic_file_write_iter+0x2a6/0x350
> [<ffffffff811d561f>] blkdev_write_iter+0x2f/0xa0
> [<ffffffff81197f21>] new_sync_write+0x81/0xb0
> [<ffffffff811986a7>] vfs_write+0xb7/0x1f0
> [<ffffffff811b857e>] ? __fget_light+0xbe/0xe0
> [<ffffffff81199452>] SyS_pwrite64+0x72/0xb0
> [<ffffffff814dae12>] system_call_fastpath+0x12/0x17
>
> Bart.
>

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: blk-mq request allocation stalls [was: Re: [PATCH v3 0/8] dm: add request-based blk-mq support]
  2015-01-07 20:40                                           ` Keith Busch
@ 2015-01-09 19:49                                             ` Mike Snitzer
  2015-01-09 21:07                                               ` Jens Axboe
  0 siblings, 1 reply; 95+ messages in thread
From: Mike Snitzer @ 2015-01-09 19:49 UTC (permalink / raw)
  To: Keith Busch, Bart Van Assche, Jens Axboe
  Cc: Christoph Hellwig, Jun'ichi Nomura, device-mapper development

[-- Attachment #1: Type: text/plain, Size: 9291 bytes --]

On Wed, Jan 07 2015 at  3:40pm -0500,
Keith Busch <keith.busch@intel.com> wrote:

> On Wed, 7 Jan 2015, Bart Van Assche wrote:
> >On 01/06/15 17:15, Jens Axboe wrote:
> >>blk-mq request allocation is pretty much as optimized/fast as it can be.
> >>The slowdown must be due to one of two reasons:
> >>
> >>- A bug related to running out of requests, perhaps a missing queue run
> >>or something like that.
> >>- A smaller number of available requests, due to the requested queue depth.
> >>
> >>Looking at Barts results, it looks like it's usually fast, but sometimes
> >>very slow. That would seem to indicate it's option #1 above that is the
> >>issue. Bart, since this seems to wait for quite a bit, would it be
> >>possible to cat the 'tags' file for that queue when it is stuck like that?
> >
> >Hello Jens,
> >
> >Thanks for the assistance. Is this the output you were looking for
> 
> I'm a little confused by the later comments given the below data. It says
> multipath_clone_and_map() is stuck at bt_get, but that doesn't block
> unless there are no tags available. The tags should be coming from one
> of dm-1's path queues, and I'm assuming these queues are provided by sdc
> and sdd. All their tags are free, so that looks like a missing wake_up
> when the queue idles.

Like I said in an earlier email, I cannot reproduce Bart's hangs running
mkfs.xfs against a multipath device that is built ontop of a virtio
device in a KVM guest.

But I can hit __bt_get() failures on the virtio-blk device that I'm
using for the root device on this guest.  Bart I'd be interested to see
what you get when running the attached debug patch (likely will just
echo the same type of info you've already provided).

There does appear to be something weird going on with bt_get().  With
the debug patch I'm seeing the following when I simply run "make install" 
of the kernel (it'll run dracut to build the initramfs, etc):

You'll note that in all instances where __bt_get() returns -1 nr_free isn't 0.

[  441.332886] bt_get: __bt_get() returned -1
[  441.336246] queue_num=0, nr_tags=128, reserved_tags=0, bits_per_word=5
[  441.338368] nr_free=128, nr_reserved=0
[  441.340076] active_queues=0
[  441.341636] CPU: 2 PID: 190 Comm: jbd2/vda3-8 Tainted: G        W      3.18.0+ #18
[  441.343897] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  441.345904]  0000000000000000 0000000012f1ab3d ffff8801198cb938 ffffffff8168b8fb
[  441.348283]  0000000000000000 00000000ffffffff ffff8801198cb9b8 ffffffff81307f53
[  441.350645]  0000000000000000 ffffffff81304bd8 0000000000000000 ffff8800358e43b0
[  441.352996] Call Trace:
[  441.354544]  [<ffffffff8168b8fb>] dump_stack+0x4c/0x65
[  441.356472]  [<ffffffff81307f53>] bt_get+0xc3/0x210
[  441.358346]  [<ffffffff81304bd8>] ? blk_mq_queue_enter+0x38/0x220
[  441.360409]  [<ffffffff810bfc20>] ? prepare_to_wait_event+0x110/0x110
[  441.362524]  [<ffffffff813083af>] blk_mq_get_tag+0xbf/0xf0
[  441.364480]  [<ffffffff81303a4b>] __blk_mq_alloc_request+0x1b/0x210
[  441.366536]  [<ffffffff813054f0>] blk_mq_map_request+0xd0/0x240
[  441.368542]  [<ffffffff81306b2a>] blk_sq_make_request+0x9a/0x3b0
[  441.370564]  [<ffffffff812f6f44>] ? generic_make_request_checks+0x2a4/0x410
[  441.372719]  [<ffffffff8118bb95>] ? mempool_alloc_slab+0x15/0x20
[  441.374725]  [<ffffffff812f7190>] generic_make_request+0xe0/0x130
[  441.376746]  [<ffffffff812f7257>] submit_bio+0x77/0x150
[  441.378642]  [<ffffffff812f1e96>] ? bio_alloc_bioset+0x1d6/0x330
[  441.380648]  [<ffffffff8123d6e9>] _submit_bh+0x119/0x170
[  441.382552]  [<ffffffff8123d750>] submit_bh+0x10/0x20
[  441.384460]  [<ffffffffa015ab34>] jbd2_journal_commit_transaction+0x774/0x1b90 [jbd2]
[  441.386763]  [<ffffffff816948f6>] ? _raw_spin_unlock_irqrestore+0x36/0x70
[  441.388903]  [<ffffffff810eccd5>] ? del_timer_sync+0x5/0xf0
[  441.390864]  [<ffffffffa015fa24>] kjournald2+0xd4/0x2a0 [jbd2]
[  441.392860]  [<ffffffff810bfc20>] ? prepare_to_wait_event+0x110/0x110
[  441.394952]  [<ffffffffa015f950>] ? commit_timeout+0x10/0x10 [jbd2]
[  441.397054]  [<ffffffff8109a3f7>] kthread+0x107/0x120
[  441.398948]  [<ffffffff8109a2f0>] ? kthread_create_on_node+0x240/0x240
[  441.401053]  [<ffffffff8169533c>] ret_from_fork+0x7c/0xb0
[  441.403003]  [<ffffffff8109a2f0>] ? kthread_create_on_node+0x240/0x240
[  446.719847]
[  446.719847] bt_get: __bt_get() returned -1
[  446.724288] queue_num=0, nr_tags=128, reserved_tags=0, bits_per_word=5
[  446.727624] nr_free=128, nr_reserved=0
[  446.730070] active_queues=0
[  446.732347] CPU: 2 PID: 190 Comm: jbd2/vda3-8 Tainted: G        W      3.18.0+ #18
[  446.735856] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  446.738912]  0000000000000000 0000000012f1ab3d ffff8801198cb938 ffffffff8168b8fb
[  446.742712]  0000000000000000 00000000ffffffff ffff8801198cb9b8 ffffffff81307f53
[  446.746464]  0000000000000000 ffffffff81304bd8 0000000000000000 ffff8800358e43b0
[  446.750217] Call Trace:
[  446.752287]  [<ffffffff8168b8fb>] dump_stack+0x4c/0x65
[  446.755089]  [<ffffffff81307f53>] bt_get+0xc3/0x210
[  446.757844]  [<ffffffff81304bd8>] ? blk_mq_queue_enter+0x38/0x220
[  446.760964]  [<ffffffff810bfc20>] ? prepare_to_wait_event+0x110/0x110
[  446.764105]  [<ffffffff813083af>] blk_mq_get_tag+0xbf/0xf0
[  446.766956]  [<ffffffff81303a4b>] __blk_mq_alloc_request+0x1b/0x210
[  446.770087]  [<ffffffff813054f0>] blk_mq_map_request+0xd0/0x240
[  446.773151]  [<ffffffff81306b2a>] blk_sq_make_request+0x9a/0x3b0
[  446.776208]  [<ffffffff812f6f44>] ? generic_make_request_checks+0x2a4/0x410
[  446.779576]  [<ffffffff8118bb95>] ? mempool_alloc_slab+0x15/0x20
[  446.782655]  [<ffffffff812f7190>] generic_make_request+0xe0/0x130
[  446.785727]  [<ffffffff812f7257>] submit_bio+0x77/0x150
[  446.788477]  [<ffffffff812f1e96>] ? bio_alloc_bioset+0x1d6/0x330
[  446.791533]  [<ffffffff8123d6e9>] _submit_bh+0x119/0x170
[  446.794317]  [<ffffffff8123d750>] submit_bh+0x10/0x20
[  446.797051]  [<ffffffffa015ab34>] jbd2_journal_commit_transaction+0x774/0x1b90 [jbd2]
[  446.800564]  [<ffffffff816948f6>] ? _raw_spin_unlock_irqrestore+0x36/0x70
[  446.803908]  [<ffffffff810eccd5>] ? del_timer_sync+0x5/0xf0
[  446.806922]  [<ffffffffa015fa24>] kjournald2+0xd4/0x2a0 [jbd2]
[  446.810019]  [<ffffffff810bfc20>] ? prepare_to_wait_event+0x110/0x110
[  446.813177]  [<ffffffffa015f950>] ? commit_timeout+0x10/0x10 [jbd2]
[  446.816362]  [<ffffffff8109a3f7>] kthread+0x107/0x120
[  446.819165]  [<ffffffff8109a2f0>] ? kthread_create_on_node+0x240/0x240
[  446.822359]  [<ffffffff8169533c>] ret_from_fork+0x7c/0xb0
[  446.825195]  [<ffffffff8109a2f0>] ? kthread_create_on_node+0x240/0x240
[  471.843384]
[  471.843384] bt_get: __bt_get() returned -1
[  471.847400] queue_num=0, nr_tags=128, reserved_tags=0, bits_per_word=5
[  471.849502] nr_free=127, nr_reserved=0
[  471.851226] active_queues=0
[  471.852765] CPU: 3 PID: 6 Comm: kworker/u8:0 Tainted: G        W      3.18.0+ #18
[  471.855007] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  471.857018] Workqueue: writeback bdi_writeback_workfn (flush-253:0)
[  471.859133]  0000000000000000 00000000bbb0d8f3 ffff88011ab8b6c8 ffffffff8168b8fb
[  471.861481]  0000000000000000 00000000ffffffff ffff88011ab8b748 ffffffff81307f53
[  471.863826]  0000000000000000 ffffffff81304bd8 0000000000000000 ffff88011ab80000
[  471.866181] Call Trace:
[  471.867703]  [<ffffffff8168b8fb>] dump_stack+0x4c/0x65
[  471.869628]  [<ffffffff81307f53>] bt_get+0xc3/0x210
[  471.871494]  [<ffffffff81304bd8>] ? blk_mq_queue_enter+0x38/0x220
[  471.873539]  [<ffffffff810bfc20>] ? prepare_to_wait_event+0x110/0x110
[  471.875640]  [<ffffffff813083af>] blk_mq_get_tag+0xbf/0xf0
[  471.877598]  [<ffffffff81303a4b>] __blk_mq_alloc_request+0x1b/0x210
[  471.879653]  [<ffffffff813054f0>] blk_mq_map_request+0xd0/0x240
[  471.881645]  [<ffffffff81306b2a>] blk_sq_make_request+0x9a/0x3b0
[  471.883653]  [<ffffffff812f6f44>] ? generic_make_request_checks+0x2a4/0x410
[  471.885798]  [<ffffffff812f7190>] generic_make_request+0xe0/0x130
[  471.887807]  [<ffffffff812f7257>] submit_bio+0x77/0x150
[  471.889699]  [<ffffffffa017ebb5>] ? ext4_writepages+0xa75/0xde0 [ext4]
[  471.891792]  [<ffffffffa01824c9>] ext4_io_submit+0x29/0x50 [ext4]
[  471.893819]  [<ffffffffa017eaab>] ext4_writepages+0x96b/0xde0 [ext4]
[  471.895879]  [<ffffffff81195251>] do_writepages+0x21/0x50
[  471.897795]  [<ffffffff81230b75>] __writeback_single_inode+0x55/0x2c0
[  471.899866]  [<ffffffff81231b36>] writeback_sb_inodes+0x2a6/0x4a0
[  471.901887]  [<ffffffff81207ed4>] ? grab_super_passive+0x44/0x90
[  471.903898]  [<ffffffff81231dcf>] __writeback_inodes_wb+0x9f/0xd0
[  471.905919]  [<ffffffff81232733>] wb_writeback+0x323/0x350
[  471.907845]  [<ffffffff81234b3d>] bdi_writeback_workfn+0x32d/0x4f0
[  471.909878]  [<ffffffff81094471>] process_one_work+0x1d1/0x520
[  471.911858]  [<ffffffff810943fa>] ? process_one_work+0x15a/0x520
[  471.913867]  [<ffffffff810948db>] worker_thread+0x11b/0x4a0
[  471.915809]  [<ffffffff810947c0>] ? process_one_work+0x520/0x520
[  471.917820]  [<ffffffff8109a3f7>] kthread+0x107/0x120
[  471.919677]  [<ffffffff8109a2f0>] ? kthread_create_on_node+0x240/0x240
[  471.921740]  [<ffffffff8169533c>] ret_from_fork+0x7c/0xb0
[  471.923630]  [<ffffffff8109a2f0>] ? kthread_create_on_node+0x240/0x240

[-- Attachment #2: blk-mq-debug-bt_get.patch --]
[-- Type: text/plain, Size: 1659 bytes --]

 block/blk-mq-tag.c | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 32e8dbb..8d4b4f0 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -232,6 +232,27 @@ static struct bt_wait_state *bt_wait_ptr(struct blk_mq_bitmap_tags *bt,
 	return bs;
 }
 
+static unsigned int bt_unused_tags(struct blk_mq_bitmap_tags *bt);
+
+static void print_hctx_tags_usage(struct blk_mq_hw_ctx *hctx)
+{
+	unsigned int free, res;
+	struct blk_mq_tags *tags = hctx->tags;
+
+	if (!tags)
+		return;
+
+	printk("queue_num=%d, nr_tags=%u, reserved_tags=%u, bits_per_word=%u\n",
+	       hctx->queue_num, tags->nr_tags, tags->nr_reserved_tags,
+	       tags->bitmap_tags.bits_per_word);
+
+	free = bt_unused_tags(&tags->bitmap_tags);
+	res = bt_unused_tags(&tags->breserved_tags);
+
+	printk("nr_free=%u, nr_reserved=%u\n", free, res);
+	printk("active_queues=%u\n", atomic_read(&tags->active_queues));
+}
+
 static int bt_get(struct blk_mq_alloc_data *data,
 		struct blk_mq_bitmap_tags *bt,
 		struct blk_mq_hw_ctx *hctx,
@@ -245,6 +266,10 @@ static int bt_get(struct blk_mq_alloc_data *data,
 	if (tag != -1)
 		return tag;
 
+	printk("\n%s: __bt_get() returned -1\n", __func__);
+	print_hctx_tags_usage(hctx);
+	dump_stack();
+
 	if (!(data->gfp & __GFP_WAIT))
 		return -1;
 
@@ -256,6 +281,9 @@ static int bt_get(struct blk_mq_alloc_data *data,
 		if (tag != -1)
 			break;
 
+		printk("%s: __bt_get() _still_ returned -1\n", __func__);
+		print_hctx_tags_usage(hctx);
+
 		/*
 		 * We're out of tags on this hardware queue, kick any
 		 * pending IO submits before going to sleep waiting for

[-- Attachment #3: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: blk-mq request allocation stalls [was: Re: [PATCH v3 0/8] dm: add request-based blk-mq support]
  2015-01-09 19:49                                             ` Mike Snitzer
@ 2015-01-09 21:07                                               ` Jens Axboe
  2015-01-09 21:11                                                 ` Jens Axboe
  0 siblings, 1 reply; 95+ messages in thread
From: Jens Axboe @ 2015-01-09 21:07 UTC (permalink / raw)
  To: Mike Snitzer, Keith Busch, Bart Van Assche
  Cc: Christoph Hellwig, Jun'ichi Nomura, device-mapper development

[-- Attachment #1: Type: text/plain, Size: 2240 bytes --]

On 01/09/2015 12:49 PM, Mike Snitzer wrote:
> On Wed, Jan 07 2015 at  3:40pm -0500,
> Keith Busch <keith.busch@intel.com> wrote:
>
>> On Wed, 7 Jan 2015, Bart Van Assche wrote:
>>> On 01/06/15 17:15, Jens Axboe wrote:
>>>> blk-mq request allocation is pretty much as optimized/fast as it can be.
>>>> The slowdown must be due to one of two reasons:
>>>>
>>>> - A bug related to running out of requests, perhaps a missing queue run
>>>> or something like that.
>>>> - A smaller number of available requests, due to the requested queue depth.
>>>>
>>>> Looking at Barts results, it looks like it's usually fast, but sometimes
>>>> very slow. That would seem to indicate it's option #1 above that is the
>>>> issue. Bart, since this seems to wait for quite a bit, would it be
>>>> possible to cat the 'tags' file for that queue when it is stuck like that?
>>>
>>> Hello Jens,
>>>
>>> Thanks for the assistance. Is this the output you were looking for
>>
>> I'm a little confused by the later comments given the below data. It says
>> multipath_clone_and_map() is stuck at bt_get, but that doesn't block
>> unless there are no tags available. The tags should be coming from one
>> of dm-1's path queues, and I'm assuming these queues are provided by sdc
>> and sdd. All their tags are free, so that looks like a missing wake_up
>> when the queue idles.
>
> Like I said in an earlier email, I cannot reproduce Bart's hangs running
> mkfs.xfs against a multipath device that is built ontop of a virtio
> device in a KVM guest.
>
> But I can hit __bt_get() failures on the virtio-blk device that I'm
> using for the root device on this guest.  Bart I'd be interested to see
> what you get when running the attached debug patch (likely will just
> echo the same type of info you've already provided).
>
> There does appear to be something weird going on with bt_get().  With
> the debug patch I'm seeing the following when I simply run "make install"
> of the kernel (it'll run dracut to build the initramfs, etc):
>
> You'll note that in all instances where __bt_get() returns -1 nr_free isn't 0.

Yeah, that doesn't look good. Can you try with this patch? The second 
hunk is the interesting bit, the first is more of a cleanup.

-- 
Jens Axboe


[-- Attachment #2: tag.patch --]
[-- Type: text/x-patch, Size: 1165 bytes --]

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 60c9d4a93fe4..363d32d4bae6 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -143,7 +143,6 @@ static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
 static int __bt_get_word(struct blk_align_bitmap *bm, unsigned int last_tag)
 {
 	int tag, org_last_tag, end;
-	bool wrap = last_tag != 0;
 
 	org_last_tag = last_tag;
 	end = bm->depth;
@@ -155,15 +154,16 @@ restart:
 			 * We started with an offset, start from 0 to
 			 * exhaust the map.
 			 */
-			if (wrap) {
-				wrap = false;
+			if (org_last_tag) {
 				end = org_last_tag;
-				last_tag = 0;
+				last_tag = org_last_tag = 0;
 				goto restart;
 			}
 			return -1;
 		}
 		last_tag = tag + 1;
+		if (last_tag >= bm->depth - 1)
+			last_tag = 0;
 	} while (test_and_set_bit(tag, &bm->word));
 
 	return tag;
@@ -199,9 +199,13 @@ static int __bt_get(struct blk_mq_hw_ctx *hctx, struct blk_mq_bitmap_tags *bt,
 			goto done;
 		}
 
-		last_tag = 0;
-		if (++index >= bt->map_nr)
+		index++;
+		last_tag += (index << bt->bits_per_word);
+
+		if (index >= bt->map_nr) {
 			index = 0;
+			last_tag = 0;
+		}
 	}
 
 	*tag_cache = 0;

[-- Attachment #3: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: blk-mq request allocation stalls [was: Re: [PATCH v3 0/8] dm: add request-based blk-mq support]
  2015-01-09 21:07                                               ` Jens Axboe
@ 2015-01-09 21:11                                                 ` Jens Axboe
  2015-01-09 21:40                                                   ` Mike Snitzer
  0 siblings, 1 reply; 95+ messages in thread
From: Jens Axboe @ 2015-01-09 21:11 UTC (permalink / raw)
  To: Mike Snitzer, Keith Busch, Bart Van Assche
  Cc: Christoph Hellwig, Jun'ichi Nomura, device-mapper development

[-- Attachment #1: Type: text/plain, Size: 2456 bytes --]

On 01/09/2015 02:07 PM, Jens Axboe wrote:
> On 01/09/2015 12:49 PM, Mike Snitzer wrote:
>> On Wed, Jan 07 2015 at  3:40pm -0500,
>> Keith Busch <keith.busch@intel.com> wrote:
>>
>>> On Wed, 7 Jan 2015, Bart Van Assche wrote:
>>>> On 01/06/15 17:15, Jens Axboe wrote:
>>>>> blk-mq request allocation is pretty much as optimized/fast as it
>>>>> can be.
>>>>> The slowdown must be due to one of two reasons:
>>>>>
>>>>> - A bug related to running out of requests, perhaps a missing queue
>>>>> run
>>>>> or something like that.
>>>>> - A smaller number of available requests, due to the requested
>>>>> queue depth.
>>>>>
>>>>> Looking at Barts results, it looks like it's usually fast, but
>>>>> sometimes
>>>>> very slow. That would seem to indicate it's option #1 above that is
>>>>> the
>>>>> issue. Bart, since this seems to wait for quite a bit, would it be
>>>>> possible to cat the 'tags' file for that queue when it is stuck
>>>>> like that?
>>>>
>>>> Hello Jens,
>>>>
>>>> Thanks for the assistance. Is this the output you were looking for
>>>
>>> I'm a little confused by the later comments given the below data. It
>>> says
>>> multipath_clone_and_map() is stuck at bt_get, but that doesn't block
>>> unless there are no tags available. The tags should be coming from one
>>> of dm-1's path queues, and I'm assuming these queues are provided by sdc
>>> and sdd. All their tags are free, so that looks like a missing wake_up
>>> when the queue idles.
>>
>> Like I said in an earlier email, I cannot reproduce Bart's hangs running
>> mkfs.xfs against a multipath device that is built ontop of a virtio
>> device in a KVM guest.
>>
>> But I can hit __bt_get() failures on the virtio-blk device that I'm
>> using for the root device on this guest.  Bart I'd be interested to see
>> what you get when running the attached debug patch (likely will just
>> echo the same type of info you've already provided).
>>
>> There does appear to be something weird going on with bt_get().  With
>> the debug patch I'm seeing the following when I simply run "make install"
>> of the kernel (it'll run dracut to build the initramfs, etc):
>>
>> You'll note that in all instances where __bt_get() returns -1 nr_free
>> isn't 0.
>
> Yeah, that doesn't look good. Can you try with this patch? The second
> hunk is the interesting bit, the first is more of a cleanup.

Actually, try this one instead, it should be a bit more precise than the 
first.


-- 
Jens Axboe


[-- Attachment #2: tag-v2.patch --]
[-- Type: text/x-patch, Size: 1164 bytes --]

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 60c9d4a93fe4..2e38cd118c1d 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -143,7 +143,6 @@ static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
 static int __bt_get_word(struct blk_align_bitmap *bm, unsigned int last_tag)
 {
 	int tag, org_last_tag, end;
-	bool wrap = last_tag != 0;
 
 	org_last_tag = last_tag;
 	end = bm->depth;
@@ -155,15 +154,16 @@ restart:
 			 * We started with an offset, start from 0 to
 			 * exhaust the map.
 			 */
-			if (wrap) {
-				wrap = false;
+			if (org_last_tag) {
 				end = org_last_tag;
-				last_tag = 0;
+				last_tag = org_last_tag = 0;
 				goto restart;
 			}
 			return -1;
 		}
 		last_tag = tag + 1;
+		if (last_tag >= bm->depth - 1)
+			last_tag = 0;
 	} while (test_and_set_bit(tag, &bm->word));
 
 	return tag;
@@ -199,9 +199,13 @@ static int __bt_get(struct blk_mq_hw_ctx *hctx, struct blk_mq_bitmap_tags *bt,
 			goto done;
 		}
 
-		last_tag = 0;
-		if (++index >= bt->map_nr)
+		index++;
+		last_tag = (index << bt->bits_per_word);
+
+		if (index >= bt->map_nr) {
 			index = 0;
+			last_tag = 0;
+		}
 	}
 
 	*tag_cache = 0;

[-- Attachment #3: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: blk-mq request allocation stalls [was: Re: [PATCH v3 0/8] dm: add request-based blk-mq support]
  2015-01-09 21:11                                                 ` Jens Axboe
@ 2015-01-09 21:40                                                   ` Mike Snitzer
  2015-01-09 21:56                                                     ` Jens Axboe
  0 siblings, 1 reply; 95+ messages in thread
From: Mike Snitzer @ 2015-01-09 21:40 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, Christoph Hellwig, device-mapper development,
	Bart Van Assche, Jun'ichi Nomura

On Fri, Jan 09 2015 at  4:11pm -0500,
Jens Axboe <axboe@kernel.dk> wrote:

> 
> Actually, try this one instead, it should be a bit more precise than
> the first.
> 

Thanks for the test patch.

I'm still seeing failures that look wrong (last_tag=127 could be edge
condition not handled properly?):

[   14.254632] __bt_get: values before for loop: last_tag=127, index=3
[   14.255841] __bt_get: values after  for loop: last_tag=64, index=2
[   14.257036]
[   14.257036] bt_get: __bt_get() returned -1
[   14.258051] queue_num=0, nr_tags=128, reserved_tags=0, bits_per_word=5
[   14.259246] nr_free=128, nr_reserved=0
[   14.259963] active_queues=0

[  213.115997] __bt_get: values before for loop: last_tag=127, index=3
[  213.117115] __bt_get: values after  for loop: last_tag=96, index=3
[  213.118200]
[  213.118200] bt_get: __bt_get() returned -1
[  213.121593] queue_num=0, nr_tags=128, reserved_tags=0, bits_per_word=5
[  213.123960] nr_free=128, nr_reserved=0
[  213.125880] active_queues=0

[  239.158079] __bt_get: values before for loop: last_tag=8, index=0
[  239.160363] __bt_get: values after  for loop: last_tag=0, index=0
[  239.162896]
[  239.162896] bt_get: __bt_get() returned -1
[  239.166284] queue_num=0, nr_tags=128, reserved_tags=0, bits_per_word=5
[  239.168623] nr_free=127, nr_reserved=0
[  239.170508] active_queues=0

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: blk-mq request allocation stalls [was: Re: [PATCH v3 0/8] dm: add request-based blk-mq support]
  2015-01-09 21:40                                                   ` Mike Snitzer
@ 2015-01-09 21:56                                                     ` Jens Axboe
  2015-01-09 22:25                                                       ` Mike Snitzer
  0 siblings, 1 reply; 95+ messages in thread
From: Jens Axboe @ 2015-01-09 21:56 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Keith Busch, Christoph Hellwig, device-mapper development,
	Bart Van Assche, Jun'ichi Nomura

[-- Attachment #1: Type: text/plain, Size: 1509 bytes --]

On 01/09/2015 02:40 PM, Mike Snitzer wrote:
> On Fri, Jan 09 2015 at  4:11pm -0500,
> Jens Axboe <axboe@kernel.dk> wrote:
>
>>
>> Actually, try this one instead, it should be a bit more precise than
>> the first.
>>
>
> Thanks for the test patch.
>
> I'm still seeing failures that look wrong (last_tag=127 could be edge
> condition not handled properly?):
>
> [   14.254632] __bt_get: values before for loop: last_tag=127, index=3
> [   14.255841] __bt_get: values after  for loop: last_tag=64, index=2
> [   14.257036]
> [   14.257036] bt_get: __bt_get() returned -1
> [   14.258051] queue_num=0, nr_tags=128, reserved_tags=0, bits_per_word=5
> [   14.259246] nr_free=128, nr_reserved=0
> [   14.259963] active_queues=0
>
> [  213.115997] __bt_get: values before for loop: last_tag=127, index=3
> [  213.117115] __bt_get: values after  for loop: last_tag=96, index=3
> [  213.118200]
> [  213.118200] bt_get: __bt_get() returned -1
> [  213.121593] queue_num=0, nr_tags=128, reserved_tags=0, bits_per_word=5
> [  213.123960] nr_free=128, nr_reserved=0
> [  213.125880] active_queues=0
>
> [  239.158079] __bt_get: values before for loop: last_tag=8, index=0
> [  239.160363] __bt_get: values after  for loop: last_tag=0, index=0
> [  239.162896]
> [  239.162896] bt_get: __bt_get() returned -1
> [  239.166284] queue_num=0, nr_tags=128, reserved_tags=0, bits_per_word=5
> [  239.168623] nr_free=127, nr_reserved=0
> [  239.170508] active_queues=0

Thanks for testing, can you try this one?

-- 
Jens Axboe


[-- Attachment #2: tag-v3.patch --]
[-- Type: text/x-patch, Size: 1399 bytes --]

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 60c9d4a93fe4..4130c2bdc6c0 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -143,28 +143,31 @@ static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
 static int __bt_get_word(struct blk_align_bitmap *bm, unsigned int last_tag)
 {
 	int tag, org_last_tag, end;
-	bool wrap = last_tag != 0;
 
 	org_last_tag = last_tag;
 	end = bm->depth;
-	do {
-restart:
-		tag = find_next_zero_bit(&bm->word, end, last_tag);
+	while (1) {
+		tag = find_next_zero_bit(&bm->word, bm->depth, last_tag);
 		if (unlikely(tag >= end)) {
 			/*
 			 * We started with an offset, start from 0 to
 			 * exhaust the map.
 			 */
-			if (wrap) {
-				wrap = false;
+			if (org_last_tag) {
 				end = org_last_tag;
-				last_tag = 0;
-				goto restart;
+				last_tag = org_last_tag = 0;
+				continue;
 			}
 			return -1;
 		}
+
+		if (!test_and_set_bit(tag, &bm->word))
+			break;
+
 		last_tag = tag + 1;
-	} while (test_and_set_bit(tag, &bm->word));
+		if (last_tag >= bm->depth - 1)
+			last_tag = 0;
+	}
 
 	return tag;
 }
@@ -199,9 +202,13 @@ static int __bt_get(struct blk_mq_hw_ctx *hctx, struct blk_mq_bitmap_tags *bt,
 			goto done;
 		}
 
-		last_tag = 0;
-		if (++index >= bt->map_nr)
+		index++;
+		last_tag = (index << bt->bits_per_word);
+
+		if (index >= bt->map_nr) {
 			index = 0;
+			last_tag = 0;
+		}
 	}
 
 	*tag_cache = 0;

[-- Attachment #3: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: blk-mq request allocation stalls [was: Re: [PATCH v3 0/8] dm: add request-based blk-mq support]
  2015-01-09 21:56                                                     ` Jens Axboe
@ 2015-01-09 22:25                                                       ` Mike Snitzer
  2015-01-10  0:27                                                         ` Jens Axboe
  0 siblings, 1 reply; 95+ messages in thread
From: Mike Snitzer @ 2015-01-09 22:25 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, Christoph Hellwig, device-mapper development,
	Bart Van Assche, Jun'ichi Nomura

On Fri, Jan 09 2015 at  4:56pm -0500,
Jens Axboe <axboe@kernel.dk> wrote:

> On 01/09/2015 02:40 PM, Mike Snitzer wrote:
> >On Fri, Jan 09 2015 at  4:11pm -0500,
> >Jens Axboe <axboe@kernel.dk> wrote:
> >
> >>
> >>Actually, try this one instead, it should be a bit more precise than
> >>the first.
> >>
> >
> >Thanks for the test patch.
> >
> >I'm still seeing failures that look wrong (last_tag=127 could be edge
> >condition not handled properly?):
> >
> >[   14.254632] __bt_get: values before for loop: last_tag=127, index=3
> >[   14.255841] __bt_get: values after  for loop: last_tag=64, index=2
> >[   14.257036]
> >[   14.257036] bt_get: __bt_get() returned -1
> >[   14.258051] queue_num=0, nr_tags=128, reserved_tags=0, bits_per_word=5
> >[   14.259246] nr_free=128, nr_reserved=0
> >[   14.259963] active_queues=0
> >
> >[  213.115997] __bt_get: values before for loop: last_tag=127, index=3
> >[  213.117115] __bt_get: values after  for loop: last_tag=96, index=3
> >[  213.118200]
> >[  213.118200] bt_get: __bt_get() returned -1
> >[  213.121593] queue_num=0, nr_tags=128, reserved_tags=0, bits_per_word=5
> >[  213.123960] nr_free=128, nr_reserved=0
> >[  213.125880] active_queues=0
> >
> >[  239.158079] __bt_get: values before for loop: last_tag=8, index=0
> >[  239.160363] __bt_get: values after  for loop: last_tag=0, index=0
> >[  239.162896]
> >[  239.162896] bt_get: __bt_get() returned -1
> >[  239.166284] queue_num=0, nr_tags=128, reserved_tags=0, bits_per_word=5
> >[  239.168623] nr_free=127, nr_reserved=0
> >[  239.170508] active_queues=0
> 
> Thanks for testing, can you try this one?

Huh, at least now we're now seeing some nr_free=0... but the last 3
failures below look unnecessary still.  E.g. the last_tag=127 case

[   13.895265] __bt_get: values before for loop: last_tag=59, index=1
[   13.895265] __bt_get: values after  for loop: last_tag=32, index=1
[   13.895266] 
[   13.895266] bt_get: __bt_get() returned -1
[   13.895267] queue_num=0, nr_tags=128, reserved_tags=0, bits_per_word=5
[   13.895267] nr_free=0, nr_reserved=0
[   13.895268] active_queues=0

[   13.895269] __bt_get: values before for loop: last_tag=0, index=0
[   13.895270] __bt_get: values after  for loop: last_tag=0, index=0
[   13.895270] 
[   13.895270] bt_get: __bt_get() returned -1
[   13.895271] queue_num=0, nr_tags=128, reserved_tags=0, bits_per_word=5
[   13.895272] nr_free=0, nr_reserved=0
[   13.895272] active_queues=0
[   13.895324] __bt_get: values before for loop: last_tag=0, index=0
[   13.895324] __bt_get: values after  for loop: last_tag=0, index=0
[   13.895325] bt_get: __bt_get() _still_ returned -1
[   13.895325] queue_num=0, nr_tags=128, reserved_tags=0, bits_per_word=5
[   13.895326] nr_free=0, nr_reserved=0
[   13.895326] active_queues=0

[   18.931425] __bt_get: values before for loop: last_tag=127, index=3
[   18.933317] __bt_get: values after  for loop: last_tag=0, index=0
[   18.935140] 
[   18.935140] bt_get: __bt_get() returned -1
[   18.936807] queue_num=0, nr_tags=128, reserved_tags=0, bits_per_word=5
[   18.938772] nr_free=128, nr_reserved=0
[   18.939927] active_queues=0

[  489.119597] __bt_get: values before for loop: last_tag=95, index=2
[  489.120621] __bt_get: values after  for loop: last_tag=96, index=3
[  489.121624]
[  489.121624] bt_get: __bt_get() returned -1
[  489.122532] queue_num=0, nr_tags=128, reserved_tags=0, bits_per_word=5
[  489.123581] nr_free=128, nr_reserved=0
[  489.124206] active_queues=0

[  494.705758] __bt_get: values before for loop: last_tag=127, index=3
[  494.707797] __bt_get: values after  for loop: last_tag=0, index=0
[  494.709696] 
[  494.709696] bt_get: __bt_get() returned -1
[  494.712459] queue_num=0, nr_tags=128, reserved_tags=0, bits_per_word=5
[  494.714403] nr_free=128, nr_reserved=0
[  494.715955] active_queues=0

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: blk-mq request allocation stalls [was: Re: [PATCH v3 0/8] dm: add request-based blk-mq support]
  2015-01-09 22:25                                                       ` Mike Snitzer
@ 2015-01-10  0:27                                                         ` Jens Axboe
  2015-01-10  1:48                                                           ` Mike Snitzer
  0 siblings, 1 reply; 95+ messages in thread
From: Jens Axboe @ 2015-01-10  0:27 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Keith Busch, Christoph Hellwig, device-mapper development,
	Bart Van Assche, Jun'ichi Nomura

[-- Attachment #1: Type: text/plain, Size: 4249 bytes --]

On 01/09/2015 03:25 PM, Mike Snitzer wrote:
> On Fri, Jan 09 2015 at  4:56pm -0500,
> Jens Axboe <axboe@kernel.dk> wrote:
> 
>> On 01/09/2015 02:40 PM, Mike Snitzer wrote:
>>> On Fri, Jan 09 2015 at  4:11pm -0500,
>>> Jens Axboe <axboe@kernel.dk> wrote:
>>>
>>>>
>>>> Actually, try this one instead, it should be a bit more precise than
>>>> the first.
>>>>
>>>
>>> Thanks for the test patch.
>>>
>>> I'm still seeing failures that look wrong (last_tag=127 could be edge
>>> condition not handled properly?):
>>>
>>> [   14.254632] __bt_get: values before for loop: last_tag=127, index=3
>>> [   14.255841] __bt_get: values after  for loop: last_tag=64, index=2
>>> [   14.257036]
>>> [   14.257036] bt_get: __bt_get() returned -1
>>> [   14.258051] queue_num=0, nr_tags=128, reserved_tags=0, bits_per_word=5
>>> [   14.259246] nr_free=128, nr_reserved=0
>>> [   14.259963] active_queues=0
>>>
>>> [  213.115997] __bt_get: values before for loop: last_tag=127, index=3
>>> [  213.117115] __bt_get: values after  for loop: last_tag=96, index=3
>>> [  213.118200]
>>> [  213.118200] bt_get: __bt_get() returned -1
>>> [  213.121593] queue_num=0, nr_tags=128, reserved_tags=0, bits_per_word=5
>>> [  213.123960] nr_free=128, nr_reserved=0
>>> [  213.125880] active_queues=0
>>>
>>> [  239.158079] __bt_get: values before for loop: last_tag=8, index=0
>>> [  239.160363] __bt_get: values after  for loop: last_tag=0, index=0
>>> [  239.162896]
>>> [  239.162896] bt_get: __bt_get() returned -1
>>> [  239.166284] queue_num=0, nr_tags=128, reserved_tags=0, bits_per_word=5
>>> [  239.168623] nr_free=127, nr_reserved=0
>>> [  239.170508] active_queues=0
>>
>> Thanks for testing, can you try this one?
> 
> Huh, at least now we're now seeing some nr_free=0... but the last 3
> failures below look unnecessary still.  E.g. the last_tag=127 case
> 
> [   13.895265] __bt_get: values before for loop: last_tag=59, index=1
> [   13.895265] __bt_get: values after  for loop: last_tag=32, index=1
> [   13.895266] 
> [   13.895266] bt_get: __bt_get() returned -1
> [   13.895267] queue_num=0, nr_tags=128, reserved_tags=0, bits_per_word=5
> [   13.895267] nr_free=0, nr_reserved=0
> [   13.895268] active_queues=0
> 
> [   13.895269] __bt_get: values before for loop: last_tag=0, index=0
> [   13.895270] __bt_get: values after  for loop: last_tag=0, index=0
> [   13.895270] 
> [   13.895270] bt_get: __bt_get() returned -1
> [   13.895271] queue_num=0, nr_tags=128, reserved_tags=0, bits_per_word=5
> [   13.895272] nr_free=0, nr_reserved=0
> [   13.895272] active_queues=0
> [   13.895324] __bt_get: values before for loop: last_tag=0, index=0
> [   13.895324] __bt_get: values after  for loop: last_tag=0, index=0
> [   13.895325] bt_get: __bt_get() _still_ returned -1
> [   13.895325] queue_num=0, nr_tags=128, reserved_tags=0, bits_per_word=5
> [   13.895326] nr_free=0, nr_reserved=0
> [   13.895326] active_queues=0
> 
> [   18.931425] __bt_get: values before for loop: last_tag=127, index=3
> [   18.933317] __bt_get: values after  for loop: last_tag=0, index=0
> [   18.935140] 
> [   18.935140] bt_get: __bt_get() returned -1
> [   18.936807] queue_num=0, nr_tags=128, reserved_tags=0, bits_per_word=5
> [   18.938772] nr_free=128, nr_reserved=0
> [   18.939927] active_queues=0
> 
> [  489.119597] __bt_get: values before for loop: last_tag=95, index=2
> [  489.120621] __bt_get: values after  for loop: last_tag=96, index=3
> [  489.121624]
> [  489.121624] bt_get: __bt_get() returned -1
> [  489.122532] queue_num=0, nr_tags=128, reserved_tags=0, bits_per_word=5
> [  489.123581] nr_free=128, nr_reserved=0
> [  489.124206] active_queues=0
> 
> [  494.705758] __bt_get: values before for loop: last_tag=127, index=3
> [  494.707797] __bt_get: values after  for loop: last_tag=0, index=0
> [  494.709696] 
> [  494.709696] bt_get: __bt_get() returned -1
> [  494.712459] queue_num=0, nr_tags=128, reserved_tags=0, bits_per_word=5
> [  494.714403] nr_free=128, nr_reserved=0
> [  494.715955] active_queues=0

I sent out the half-done v3, unfortunately. Can you try this? Both the
cases with substantial nr_free are at the end of an index.

If this one doesn't solve it, I'll reproduce it myself to save the
ping-pong effort :-)


-- 
Jens Axboe


[-- Attachment #2: tag-v4.patch --]
[-- Type: text/x-patch, Size: 1474 bytes --]

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 60c9d4a93fe4..1ce031a56080 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -142,29 +142,29 @@ static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
 
 static int __bt_get_word(struct blk_align_bitmap *bm, unsigned int last_tag)
 {
-	int tag, org_last_tag, end;
-	bool wrap = last_tag != 0;
+	int tag, org_last_tag = last_tag;
 
-	org_last_tag = last_tag;
-	end = bm->depth;
-	do {
-restart:
-		tag = find_next_zero_bit(&bm->word, end, last_tag);
-		if (unlikely(tag >= end)) {
+	while (1) {
+		tag = find_next_zero_bit(&bm->word, bm->depth, last_tag);
+		if (unlikely(tag >= bm->depth)) {
 			/*
 			 * We started with an offset, start from 0 to
 			 * exhaust the map.
 			 */
-			if (wrap) {
-				wrap = false;
-				end = org_last_tag;
-				last_tag = 0;
-				goto restart;
+			if (org_last_tag) {
+				last_tag = org_last_tag = 0;
+				continue;
 			}
 			return -1;
 		}
+
+		if (!test_and_set_bit(tag, &bm->word))
+			break;
+
 		last_tag = tag + 1;
-	} while (test_and_set_bit(tag, &bm->word));
+		if (last_tag >= bm->depth - 1)
+			last_tag = 0;
+	}
 
 	return tag;
 }
@@ -199,9 +199,13 @@ static int __bt_get(struct blk_mq_hw_ctx *hctx, struct blk_mq_bitmap_tags *bt,
 			goto done;
 		}
 
-		last_tag = 0;
-		if (++index >= bt->map_nr)
+		index++;
+		last_tag = (index << bt->bits_per_word);
+
+		if (index >= bt->map_nr) {
 			index = 0;
+			last_tag = 0;
+		}
 	}
 
 	*tag_cache = 0;

[-- Attachment #3: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: blk-mq request allocation stalls [was: Re: [PATCH v3 0/8] dm: add request-based blk-mq support]
  2015-01-10  0:27                                                         ` Jens Axboe
@ 2015-01-10  1:48                                                           ` Mike Snitzer
  2015-01-10  1:59                                                             ` Jens Axboe
  0 siblings, 1 reply; 95+ messages in thread
From: Mike Snitzer @ 2015-01-10  1:48 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, Christoph Hellwig, device-mapper development,
	Bart Van Assche, Jun'ichi Nomura

[-- Attachment #1: Type: text/plain, Size: 1936 bytes --]

On Fri, Jan 09 2015 at  7:27pm -0500,
Jens Axboe <axboe@kernel.dk> wrote:

> I sent out the half-done v3, unfortunately. Can you try this? Both the
> cases with substantial nr_free are at the end of an index.

I initially thought it was fixed since I didn't see any failures on boot
(which I normally do see 3-4).  I then ran the kernel "make install" to
this virtio-blk root device and also didn't see any failures on the the
first run.  But the 2nd run triggered these:

[   83.711724] __bt_get: values before for loop: last_tag=55, index=1
[   83.713395] __bt_get: values after  for loop: last_tag=32, index=1
[   83.714464] bt_get: __bt_get() returned -1
[   83.715183] queue_num=0, nr_tags=128, reserved_tags=0, bits_per_word=5
[   83.716297] nr_free=128, nr_reserved=0
[   83.716940] active_queues=0

[   88.716241] __bt_get: values before for loop: last_tag=15, index=0
[   88.717890] __bt_get: values after  for loop: last_tag=0, index=0
[   88.718956] bt_get: __bt_get() returned -1
[   88.719682] queue_num=0, nr_tags=128, reserved_tags=0, bits_per_word=5
[   88.720866] nr_free=128, nr_reserved=0
[   88.721536] active_queues=0

A third "make install" resulted in:

[  543.711782] __bt_get: values before for loop: last_tag=114, index=3
[  543.713411] __bt_get: values after  for loop: last_tag=96, index=3
[  543.714495] bt_get: __bt_get() returned -1
[  543.715222] queue_num=0, nr_tags=128, reserved_tags=0, bits_per_word=5
[  543.716351] nr_free=128, nr_reserved=0
[  543.717016] active_queues=0

(things definitely do seem better, e.g. less frequent failure and no
longer see the last_tag=127 case)

> If this one doesn't solve it, I'll reproduce it myself to save the
> ping-pong effort :-)

I don't mind testing it since it is really quick.  But OK.

I've attached the debug patch I've been using in case you'd like to use it.

But feel free to send me additional versions for me to test off-list if
you'd like.

Mike

[-- Attachment #2: blk-mq-bt_get-debug-v2.patch --]
[-- Type: text/plain, Size: 2496 bytes --]

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 32e8dbb..4f11e7c 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -180,12 +180,16 @@ static int __bt_get(struct blk_mq_hw_ctx *hctx, struct blk_mq_bitmap_tags *bt,
 	unsigned int last_tag, org_last_tag;
 	int index, i, tag;
 
-	if (!hctx_may_queue(hctx, bt))
+	if (!hctx_may_queue(hctx, bt)) {
+		printk("!hctx_may_queue() __bt_get returning -1\n");
 		return -1;
+	}
 
 	last_tag = org_last_tag = *tag_cache;
 	index = TAG_TO_INDEX(bt, last_tag);
 
+	WARN_ON(last_tag > 127);
+
 	for (i = 0; i < bt->map_nr; i++) {
 		tag = __bt_get_word(&bt->map[index], TAG_TO_BIT(bt, last_tag));
 		if (tag != -1) {
@@ -198,6 +202,11 @@ static int __bt_get(struct blk_mq_hw_ctx *hctx, struct blk_mq_bitmap_tags *bt,
 			index = 0;
 	}
 
+	printk("\n%s: values before for loop: last_tag=%u, index=%d\n", __func__,
+	       *tag_cache,  TAG_TO_INDEX(bt, *tag_cache));
+	printk("%s: values after  for loop: last_tag=%u, index=%d\n", __func__,
+	       last_tag, index);
+
 	*tag_cache = 0;
 	return -1;
 
@@ -232,6 +241,27 @@ static struct bt_wait_state *bt_wait_ptr(struct blk_mq_bitmap_tags *bt,
 	return bs;
 }
 
+static unsigned int bt_unused_tags(struct blk_mq_bitmap_tags *bt);
+
+static void print_hctx_tags_usage(struct blk_mq_hw_ctx *hctx)
+{
+	unsigned int free, res;
+	struct blk_mq_tags *tags = hctx->tags;
+
+	if (!tags)
+		return;
+
+	printk("queue_num=%d, nr_tags=%u, reserved_tags=%u, bits_per_word=%u\n",
+	       hctx->queue_num, tags->nr_tags, tags->nr_reserved_tags,
+	       tags->bitmap_tags.bits_per_word);
+
+	free = bt_unused_tags(&tags->bitmap_tags);
+	res = bt_unused_tags(&tags->breserved_tags);
+
+	printk("nr_free=%u, nr_reserved=%u\n", free, res);
+	printk("active_queues=%u\n", atomic_read(&tags->active_queues));
+}
+
 static int bt_get(struct blk_mq_alloc_data *data,
 		struct blk_mq_bitmap_tags *bt,
 		struct blk_mq_hw_ctx *hctx,
@@ -245,6 +275,10 @@ static int bt_get(struct blk_mq_alloc_data *data,
 	if (tag != -1)
 		return tag;
 
+	printk("%s: __bt_get() returned -1\n", __func__);
+	print_hctx_tags_usage(hctx);
+	//dump_stack();
+
 	if (!(data->gfp & __GFP_WAIT))
 		return -1;
 
@@ -256,6 +290,9 @@ static int bt_get(struct blk_mq_alloc_data *data,
 		if (tag != -1)
 			break;
 
+		printk("%s: __bt_get() _still_ returned -1\n", __func__);
+		print_hctx_tags_usage(hctx);
+
 		/*
 		 * We're out of tags on this hardware queue, kick any
 		 * pending IO submits before going to sleep waiting for

[-- Attachment #3: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: blk-mq request allocation stalls [was: Re: [PATCH v3 0/8] dm: add request-based blk-mq support]
  2015-01-10  1:48                                                           ` Mike Snitzer
@ 2015-01-10  1:59                                                             ` Jens Axboe
  2015-01-10  3:10                                                               ` Mike Snitzer
  0 siblings, 1 reply; 95+ messages in thread
From: Jens Axboe @ 2015-01-10  1:59 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Keith Busch, Christoph Hellwig, device-mapper development,
	Bart Van Assche, Jun'ichi Nomura

On 01/09/2015 06:48 PM, Mike Snitzer wrote:
> On Fri, Jan 09 2015 at  7:27pm -0500,
> Jens Axboe <axboe@kernel.dk> wrote:
>
>> I sent out the half-done v3, unfortunately. Can you try this? Both the
>> cases with substantial nr_free are at the end of an index.
>
> I initially thought it was fixed since I didn't see any failures on boot
> (which I normally do see 3-4).  I then ran the kernel "make install" to
> this virtio-blk root device and also didn't see any failures on the the
> first run.  But the 2nd run triggered these:
>
> [   83.711724] __bt_get: values before for loop: last_tag=55, index=1
> [   83.713395] __bt_get: values after  for loop: last_tag=32, index=1
> [   83.714464] bt_get: __bt_get() returned -1
> [   83.715183] queue_num=0, nr_tags=128, reserved_tags=0, bits_per_word=5
> [   83.716297] nr_free=128, nr_reserved=0
> [   83.716940] active_queues=0
>
> [   88.716241] __bt_get: values before for loop: last_tag=15, index=0
> [   88.717890] __bt_get: values after  for loop: last_tag=0, index=0
> [   88.718956] bt_get: __bt_get() returned -1
> [   88.719682] queue_num=0, nr_tags=128, reserved_tags=0, bits_per_word=5
> [   88.720866] nr_free=128, nr_reserved=0
> [   88.721536] active_queues=0
>
> A third "make install" resulted in:
>
> [  543.711782] __bt_get: values before for loop: last_tag=114, index=3
> [  543.713411] __bt_get: values after  for loop: last_tag=96, index=3
> [  543.714495] bt_get: __bt_get() returned -1
> [  543.715222] queue_num=0, nr_tags=128, reserved_tags=0, bits_per_word=5
> [  543.716351] nr_free=128, nr_reserved=0
> [  543.717016] active_queues=0
>
> (things definitely do seem better, e.g. less frequent failure and no
> longer see the last_tag=127 case)

So if we end up freeing in batches, it's not totally unlikely that the 
case could hit where all were busy, and they got freed in between. Does 
seem a bit peculiar, though. The dump above, is that for the first 
failure case of invoking __bt_get()? I don't see the:

_still_ returned -1

which would seem to back up the theory, though. So I think this might 
actually be good, even if you hit that case.

Bart, could you try the patch (the -v4) and your DM hang and see if it 
solves it for you?

>
>> If this one doesn't solve it, I'll reproduce it myself to save the
>> ping-pong effort :-)
>
> I don't mind testing it since it is really quick.  But OK.

OK, then we can stick to that. Let me know if you hit the case of it 
both the initial -1 and the following -1, since that would indicate it's 
not fixed.


-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: blk-mq request allocation stalls [was: Re: [PATCH v3 0/8] dm: add request-based blk-mq support]
  2015-01-10  1:59                                                             ` Jens Axboe
@ 2015-01-10  3:10                                                               ` Mike Snitzer
  2015-01-12 14:46                                                                 ` blk-mq request allocation stalls Bart Van Assche
  0 siblings, 1 reply; 95+ messages in thread
From: Mike Snitzer @ 2015-01-10  3:10 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, Christoph Hellwig, device-mapper development,
	Bart Van Assche, Jun'ichi Nomura

On Fri, Jan 09 2015 at  8:59pm -0500,
Jens Axboe <axboe@kernel.dk> wrote:

> On 01/09/2015 06:48 PM, Mike Snitzer wrote:
> >
> >A third "make install" resulted in:
> >
> >[  543.711782] __bt_get: values before for loop: last_tag=114, index=3
> >[  543.713411] __bt_get: values after  for loop: last_tag=96, index=3
> >[  543.714495] bt_get: __bt_get() returned -1
> >[  543.715222] queue_num=0, nr_tags=128, reserved_tags=0, bits_per_word=5
> >[  543.716351] nr_free=128, nr_reserved=0
> >[  543.717016] active_queues=0
> >
> >(things definitely do seem better, e.g. less frequent failure and no
> >longer see the last_tag=127 case)
> 
> So if we end up freeing in batches, it's not totally unlikely that
> the case could hit where all were busy, and they got freed in
> between. Does seem a bit peculiar, though. The dump above, is that
> for the first failure case of invoking __bt_get()? I don't see the:
> 
> _still_ returned -1
> 
> which would seem to back up the theory, though. So I think this
> might actually be good, even if you hit that case.

Right, I'm not seeing the double failure case ("_still_ returned -1")
but I did see it in the previous 3 patches, albeit infrequently.
 
> Bart, could you try the patch (the -v4) and your DM hang and see if
> it solves it for you?

Yes, I'm interested to hear from Bart on v4 too.

> >>If this one doesn't solve it, I'll reproduce it myself to save the
> >>ping-pong effort :-)
> >
> >I don't mind testing it since it is really quick.  But OK.
> 
> OK, then we can stick to that. Let me know if you hit the case of it
> both the initial -1 and the following -1, since that would indicate
> it's not fixed.

Will do.

Thanks for all your help.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: blk-mq request allocation stalls
  2015-01-10  3:10                                                               ` Mike Snitzer
@ 2015-01-12 14:46                                                                 ` Bart Van Assche
  2015-01-12 15:42                                                                   ` Jens Axboe
  0 siblings, 1 reply; 95+ messages in thread
From: Bart Van Assche @ 2015-01-12 14:46 UTC (permalink / raw)
  To: Mike Snitzer, Jens Axboe
  Cc: Keith Busch, Christoph Hellwig, device-mapper development,
	Jun'ichi Nomura

On 01/10/15 04:10, Mike Snitzer wrote:
> On Fri, Jan 09 2015 at  8:59pm -0500,
> Jens Axboe <axboe@kernel.dk> wrote:
>> Bart, could you try the patch (the -v4) and your DM hang and see if
>> it solves it for you?
> 
> Yes, I'm interested to hear from Bart on v4 too.

Hello Mike and Jens,

Sorry but even with v4 applied filesystem creation still takes too long.
The kernel I have been testing with was generated as follows:
* Started from Mike's dm-for-3.20-blk-mq branch.
* Merged v3.19-rc4 with this branch.
* Applied Jens' blk-mq tag patch and Mike's debug patch on top.
* Modified Mike's patch to make it print the blk-mq "may_queue" state
  (hctx_may_queue(hctx, bt)).

Here are the results without multipath:

# systemctl disable multipathd
# systemctl stop multipathd
# dmsetup remove_all
# rmmod dm_service_time
# rmmod dm_multipath
# rmmod dm_mod
# time mkfs.xfs -f /dev/sdc >/dev/null
real    0m0.037s
user    0m0.000s
sys     0m0.020s
# time mkfs.xfs -f /dev/sdd >/dev/null
real    0m0.030s
user    0m0.010s
sys     0m0.010s

With multipath:

# ls -l /dev/sd[cd]
brw-rw---- 1 root disk 8, 32 Jan 12 15:09 /dev/sdc
brw-rw---- 1 root disk 8, 48 Jan 12 15:11 /dev/sdd
# systemctl start multipathd
# dmsetup table /dev/dm-0
0 256000 multipath 3 queue_if_no_path pg_init_retries 50 0 1 1
service-time 0 2 2 8:48 1 1 8:32 1 1
# time mkfs.xfs -f /dev/dm-0 >/dev/null
real    0m8.845s
user    0m0.000s
sys     0m0.020s
# time mkfs.xfs -f /dev/dm-0 >/dev/null
real    0m14.905s
user    0m0.000s
sys     0m0.020s

What is remarkable is that Mike's debug patch started to report
"bt_get() returned -1" as soon as multipathd was started. The first of
many identical call traces printed by this debug patch was as follows:

bt_get: __bt_get() returned -1
queue_num=2, nr_tags=62, reserved_tags=0, bits_per_word=3
nr_free=62, nr_reserved=0, may_queue=0
active_queues=8
CPU: 2 PID: 6316 Comm: kdmwork-253:2 Tainted: G        W
3.19.0-rc4-debug+ #1
Call Trace:
 [<ffffffff814d26ea>] dump_stack+0x4c/0x65
 [<ffffffff81259c91>] bt_get+0xa1/0x1f0
 [<ffffffff8125a0bf>] blk_mq_get_tag+0x9f/0xd0
 [<ffffffff8125522b>] __blk_mq_alloc_request+0x1b/0x210
 [<ffffffff81256a30>] blk_mq_alloc_request+0xa0/0x150
 [<ffffffff8124ba9e>] blk_get_request+0x2e/0xe0
 [<ffffffffa06c6d0f>] __multipath_map.isra.15+0x1cf/0x210 [dm_multipath]
 [<ffffffffa06c6d6a>] multipath_clone_and_map+0x1a/0x20 [dm_multipath]
 [<ffffffffa07bfbb5>] map_tio_request+0x1d5/0x3a0 [dm_mod]
 [<ffffffff81075d16>] kthread_worker_fn+0x86/0x1b0
 [<ffffffff81075c0f>] kthread+0xef/0x110
 [<ffffffff814db46c>] ret_from_fork+0x7c/0xb0

Bart.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: blk-mq request allocation stalls
  2015-01-12 14:46                                                                 ` blk-mq request allocation stalls Bart Van Assche
@ 2015-01-12 15:42                                                                   ` Jens Axboe
  2015-01-12 16:12                                                                     ` Bart Van Assche
  0 siblings, 1 reply; 95+ messages in thread
From: Jens Axboe @ 2015-01-12 15:42 UTC (permalink / raw)
  To: Bart Van Assche, Mike Snitzer
  Cc: Keith Busch, Christoph Hellwig, device-mapper development,
	Jun'ichi Nomura

On 01/12/2015 07:46 AM, Bart Van Assche wrote:
> On 01/10/15 04:10, Mike Snitzer wrote:
>> On Fri, Jan 09 2015 at  8:59pm -0500,
>> Jens Axboe <axboe@kernel.dk> wrote:
>>> Bart, could you try the patch (the -v4) and your DM hang and see if
>>> it solves it for you?
>>
>> Yes, I'm interested to hear from Bart on v4 too.
>
> Hello Mike and Jens,
>
> Sorry but even with v4 applied filesystem creation still takes too long.
> The kernel I have been testing with was generated as follows:
> * Started from Mike's dm-for-3.20-blk-mq branch.
> * Merged v3.19-rc4 with this branch.
> * Applied Jens' blk-mq tag patch and Mike's debug patch on top.
> * Modified Mike's patch to make it print the blk-mq "may_queue" state
>    (hctx_may_queue(hctx, bt)).
>
> Here are the results without multipath:
>
> # systemctl disable multipathd
> # systemctl stop multipathd
> # dmsetup remove_all
> # rmmod dm_service_time
> # rmmod dm_multipath
> # rmmod dm_mod
> # time mkfs.xfs -f /dev/sdc >/dev/null
> real    0m0.037s
> user    0m0.000s
> sys     0m0.020s
> # time mkfs.xfs -f /dev/sdd >/dev/null
> real    0m0.030s
> user    0m0.010s
> sys     0m0.010s
>
> With multipath:
>
> # ls -l /dev/sd[cd]
> brw-rw---- 1 root disk 8, 32 Jan 12 15:09 /dev/sdc
> brw-rw---- 1 root disk 8, 48 Jan 12 15:11 /dev/sdd
> # systemctl start multipathd
> # dmsetup table /dev/dm-0
> 0 256000 multipath 3 queue_if_no_path pg_init_retries 50 0 1 1
> service-time 0 2 2 8:48 1 1 8:32 1 1
> # time mkfs.xfs -f /dev/dm-0 >/dev/null
> real    0m8.845s
> user    0m0.000s
> sys     0m0.020s
> # time mkfs.xfs -f /dev/dm-0 >/dev/null
> real    0m14.905s
> user    0m0.000s
> sys     0m0.020s
>
> What is remarkable is that Mike's debug patch started to report
> "bt_get() returned -1" as soon as multipathd was started. The first of
> many identical call traces printed by this debug patch was as follows:
>
> bt_get: __bt_get() returned -1
> queue_num=2, nr_tags=62, reserved_tags=0, bits_per_word=3
> nr_free=62, nr_reserved=0, may_queue=0
> active_queues=8

Can you add dumping of hctx->nr_active when this fails? You case is that 
the may_queue logic says no-can-do, so it smells like the nr_active 
accounting is wonky since you have supposedly no allocated tags, yet it 
clearly thinks that you do.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: blk-mq request allocation stalls
  2015-01-12 15:42                                                                   ` Jens Axboe
@ 2015-01-12 16:12                                                                     ` Bart Van Assche
  2015-01-12 16:34                                                                       ` Jens Axboe
  0 siblings, 1 reply; 95+ messages in thread
From: Bart Van Assche @ 2015-01-12 16:12 UTC (permalink / raw)
  To: Jens Axboe, Mike Snitzer
  Cc: Keith Busch, Christoph Hellwig, device-mapper development,
	Jun'ichi Nomura

On 01/12/15 16:42, Jens Axboe wrote:
> On 01/12/2015 07:46 AM, Bart Van Assche wrote:
>> bt_get: __bt_get() returned -1
>> queue_num=2, nr_tags=62, reserved_tags=0, bits_per_word=3
>> nr_free=62, nr_reserved=0, may_queue=0
>> active_queues=8
> 
> Can you add dumping of hctx->nr_active when this fails? You case is that 
> the may_queue logic says no-can-do, so it smells like the nr_active 
> accounting is wonky since you have supposedly no allocated tags, yet it 
> clearly thinks that you do.

Hello Jens,

The requested output is as follows:

bt_get: __bt_get() returned -1
queue_num=0, nr_tags=62, reserved_tags=0, bits_per_word=3
nr_free=62, nr_reserved=0, hctx->tags->active_queues=7,
hctx->nr_active=9, hctx_may_queue()=0
active_queues=7
CPU: 0 PID: 3111 Comm: kdmwork-253:6 Tainted: G        W
3.19.0-rc4-debug+ #1
Call Trace:
 [<ffffffff814d2763>] dump_stack+0x4c/0x65
 [<ffffffff81259dc7>] bt_get+0x97/0x1e0
 [<ffffffff8125a0af>] blk_mq_get_tag+0x9f/0xd0
 [<ffffffff8125522b>] __blk_mq_alloc_request+0x1b/0x210
 [<ffffffff81256a30>] blk_mq_alloc_request+0xa0/0x150
 [<ffffffff8124ba9e>] blk_get_request+0x2e/0xe0
 [<ffffffffa07d9d0f>] __multipath_map.isra.15+0x1cf/0x210 [dm_multipath]
 [<ffffffffa07d9d6a>] multipath_clone_and_map+0x1a/0x20 [dm_multipath]
 [<ffffffffa0350bb5>] map_tio_request+0x1d5/0x3a0 [dm_mod]
 [<ffffffff81075d16>] kthread_worker_fn+0x86/0x1b0
 [<ffffffff81075c0f>] kthread+0xef/0x110
 [<ffffffff814db4ec>] ret_from_fork+0x7c/0xb0

Bart.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: blk-mq request allocation stalls
  2015-01-12 16:12                                                                     ` Bart Van Assche
@ 2015-01-12 16:34                                                                       ` Jens Axboe
  2015-01-12 16:58                                                                         ` Mike Snitzer
  2015-01-12 17:04                                                                         ` Bart Van Assche
  0 siblings, 2 replies; 95+ messages in thread
From: Jens Axboe @ 2015-01-12 16:34 UTC (permalink / raw)
  To: Bart Van Assche, Mike Snitzer
  Cc: Keith Busch, Christoph Hellwig, device-mapper development,
	Jun'ichi Nomura

On 01/12/2015 09:12 AM, Bart Van Assche wrote:
> On 01/12/15 16:42, Jens Axboe wrote:
>> On 01/12/2015 07:46 AM, Bart Van Assche wrote:
>>> bt_get: __bt_get() returned -1
>>> queue_num=2, nr_tags=62, reserved_tags=0, bits_per_word=3
>>> nr_free=62, nr_reserved=0, may_queue=0
>>> active_queues=8
>>
>> Can you add dumping of hctx->nr_active when this fails? You case is that
>> the may_queue logic says no-can-do, so it smells like the nr_active
>> accounting is wonky since you have supposedly no allocated tags, yet it
>> clearly thinks that you do.
>
> Hello Jens,
>
> The requested output is as follows:
>
> bt_get: __bt_get() returned -1
> queue_num=0, nr_tags=62, reserved_tags=0, bits_per_word=3
> nr_free=62, nr_reserved=0, hctx->tags->active_queues=7,
> hctx->nr_active=9, hctx_may_queue()=0
> active_queues=7

So that does look a bit off, we have (supposedly) 9 active requests, but 
nothing allocated. When the mkfs is done and things are idle, can you 
try and cat the 'active' file in the mq directory? I want to see if it 
drops to zero or stays elevated.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: blk-mq request allocation stalls
  2015-01-12 16:34                                                                       ` Jens Axboe
@ 2015-01-12 16:58                                                                         ` Mike Snitzer
  2015-01-12 16:59                                                                           ` Jens Axboe
  2015-01-12 17:04                                                                         ` Bart Van Assche
  1 sibling, 1 reply; 95+ messages in thread
From: Mike Snitzer @ 2015-01-12 16:58 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, Bart Van Assche, device-mapper development,
	Jun'ichi Nomura, Christoph Hellwig

On Mon, Jan 12 2015 at 11:34am -0500,
Jens Axboe <axboe@kernel.dk> wrote:

> On 01/12/2015 09:12 AM, Bart Van Assche wrote:
> >On 01/12/15 16:42, Jens Axboe wrote:
> >>On 01/12/2015 07:46 AM, Bart Van Assche wrote:
> >>>bt_get: __bt_get() returned -1
> >>>queue_num=2, nr_tags=62, reserved_tags=0, bits_per_word=3
> >>>nr_free=62, nr_reserved=0, may_queue=0
> >>>active_queues=8
> >>
> >>Can you add dumping of hctx->nr_active when this fails? You case is that
> >>the may_queue logic says no-can-do, so it smells like the nr_active
> >>accounting is wonky since you have supposedly no allocated tags, yet it
> >>clearly thinks that you do.
> >
> >Hello Jens,
> >
> >The requested output is as follows:
> >
> >bt_get: __bt_get() returned -1
> >queue_num=0, nr_tags=62, reserved_tags=0, bits_per_word=3
> >nr_free=62, nr_reserved=0, hctx->tags->active_queues=7,
> >hctx->nr_active=9, hctx_may_queue()=0
> >active_queues=7
> 
> So that does look a bit off, we have (supposedly) 9 active requests,
> but nothing allocated. When the mkfs is done and things are idle,
> can you try and cat the 'active' file in the mq directory? I want to
> see if it drops to zero or stays elevated.

Could this be something flawed in the iSCSI blk-mq implementation?  I
haven't ever been able to replicate this problem with virtio-blk.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: blk-mq request allocation stalls
  2015-01-12 16:58                                                                         ` Mike Snitzer
@ 2015-01-12 16:59                                                                           ` Jens Axboe
  0 siblings, 0 replies; 95+ messages in thread
From: Jens Axboe @ 2015-01-12 16:59 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Keith Busch, Bart Van Assche, device-mapper development,
	Jun'ichi Nomura, Christoph Hellwig

On 01/12/2015 09:58 AM, Mike Snitzer wrote:
> On Mon, Jan 12 2015 at 11:34am -0500,
> Jens Axboe <axboe@kernel.dk> wrote:
>
>> On 01/12/2015 09:12 AM, Bart Van Assche wrote:
>>> On 01/12/15 16:42, Jens Axboe wrote:
>>>> On 01/12/2015 07:46 AM, Bart Van Assche wrote:
>>>>> bt_get: __bt_get() returned -1
>>>>> queue_num=2, nr_tags=62, reserved_tags=0, bits_per_word=3
>>>>> nr_free=62, nr_reserved=0, may_queue=0
>>>>> active_queues=8
>>>>
>>>> Can you add dumping of hctx->nr_active when this fails? You case is that
>>>> the may_queue logic says no-can-do, so it smells like the nr_active
>>>> accounting is wonky since you have supposedly no allocated tags, yet it
>>>> clearly thinks that you do.
>>>
>>> Hello Jens,
>>>
>>> The requested output is as follows:
>>>
>>> bt_get: __bt_get() returned -1
>>> queue_num=0, nr_tags=62, reserved_tags=0, bits_per_word=3
>>> nr_free=62, nr_reserved=0, hctx->tags->active_queues=7,
>>> hctx->nr_active=9, hctx_may_queue()=0
>>> active_queues=7
>>
>> So that does look a bit off, we have (supposedly) 9 active requests,
>> but nothing allocated. When the mkfs is done and things are idle,
>> can you try and cat the 'active' file in the mq directory? I want to
>> see if it drops to zero or stays elevated.
>
> Could this be something flawed in the iSCSI blk-mq implementation?  I
> haven't ever been able to replicate this problem with virtio-blk.

It's related to a shared tag map, which only happens on scsi-mq. Other 
devices generally don't share tag maps.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: blk-mq request allocation stalls
  2015-01-12 16:34                                                                       ` Jens Axboe
  2015-01-12 16:58                                                                         ` Mike Snitzer
@ 2015-01-12 17:04                                                                         ` Bart Van Assche
  2015-01-12 17:09                                                                           ` Jens Axboe
  2015-01-12 18:19                                                                           ` Mike Snitzer
  1 sibling, 2 replies; 95+ messages in thread
From: Bart Van Assche @ 2015-01-12 17:04 UTC (permalink / raw)
  To: Jens Axboe, Mike Snitzer
  Cc: Keith Busch, Christoph Hellwig, device-mapper development,
	Jun'ichi Nomura

On 01/12/15 17:34, Jens Axboe wrote:
> On 01/12/2015 09:12 AM, Bart Van Assche wrote:
>> On 01/12/15 16:42, Jens Axboe wrote:
>>> On 01/12/2015 07:46 AM, Bart Van Assche wrote:
>>>> bt_get: __bt_get() returned -1
>>>> queue_num=2, nr_tags=62, reserved_tags=0, bits_per_word=3
>>>> nr_free=62, nr_reserved=0, may_queue=0
>>>> active_queues=8
>>>
>>> Can you add dumping of hctx->nr_active when this fails? You case is that
>>> the may_queue logic says no-can-do, so it smells like the nr_active
>>> accounting is wonky since you have supposedly no allocated tags, yet it
>>> clearly thinks that you do.
>>
>> Hello Jens,
>>
>> The requested output is as follows:
>>
>> bt_get: __bt_get() returned -1
>> queue_num=0, nr_tags=62, reserved_tags=0, bits_per_word=3
>> nr_free=62, nr_reserved=0, hctx->tags->active_queues=7,
>> hctx->nr_active=9, hctx_may_queue()=0
>> active_queues=7
> 
> So that does look a bit off, we have (supposedly) 9 active requests, but 
> nothing allocated. When the mkfs is done and things are idle, can you 
> try and cat the 'active' file in the mq directory? I want to see if it 
> drops to zero or stays elevated.

Hello Jens,

Sorry that I hadn't been more clear but the __bt_get() data in the
previous e-mail was gathered when multipathd started instead of when
mkfs was started. That was the time at which Mike's debug patch reported
for the first time that __bt_get() returned -1. What is also remarkable
is that all "__bt_get() returned -1" reports that I checked referred to
the /dev/dm-0 device and none to any of the underlying devices
(/dev/sd[cd]).

The tag state after having stopped multipathd (systemctl stop
multipathd) is as follows:
# dmsetup table /dev/dm-0
0 256000 multipath 3 queue_if_no_path pg_init_retries 50 0 1 1
service-time 0 2 2 8:48 1 1 8:32 1 1
# ls -l /dev/sd[cd]
brw-rw---- 1 root disk 8, 32 Jan 12 17:47 /dev/sdc
brw-rw---- 1 root disk 8, 48 Jan 12 17:47 /dev/sdd
# for d in sdc sdd dm-0; do echo ==== $d; (cd /sys/block/$d/mq &&
  find|cut -c3-|grep active|xargs grep -aH ''); done
==== sdc
0/active:10
1/active:14
2/active:7
3/active:13
4/active:6
5/active:10
==== sdd
0/active:17
1/active:8
2/active:9
3/active:13
4/active:5
5/active:10
==== dm-0
-bash: cd: /sys/block/dm-0/mq: No such file or directory

Bart.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: blk-mq request allocation stalls
  2015-01-12 17:04                                                                         ` Bart Van Assche
@ 2015-01-12 17:09                                                                           ` Jens Axboe
  2015-01-12 17:53                                                                             ` Keith Busch
  2015-01-12 18:19                                                                           ` Mike Snitzer
  1 sibling, 1 reply; 95+ messages in thread
From: Jens Axboe @ 2015-01-12 17:09 UTC (permalink / raw)
  To: Bart Van Assche, Mike Snitzer
  Cc: Keith Busch, Christoph Hellwig, device-mapper development,
	Jun'ichi Nomura

On 01/12/2015 10:04 AM, Bart Van Assche wrote:
> On 01/12/15 17:34, Jens Axboe wrote:
>> On 01/12/2015 09:12 AM, Bart Van Assche wrote:
>>> On 01/12/15 16:42, Jens Axboe wrote:
>>>> On 01/12/2015 07:46 AM, Bart Van Assche wrote:
>>>>> bt_get: __bt_get() returned -1
>>>>> queue_num=2, nr_tags=62, reserved_tags=0, bits_per_word=3
>>>>> nr_free=62, nr_reserved=0, may_queue=0
>>>>> active_queues=8
>>>>
>>>> Can you add dumping of hctx->nr_active when this fails? You case is that
>>>> the may_queue logic says no-can-do, so it smells like the nr_active
>>>> accounting is wonky since you have supposedly no allocated tags, yet it
>>>> clearly thinks that you do.
>>>
>>> Hello Jens,
>>>
>>> The requested output is as follows:
>>>
>>> bt_get: __bt_get() returned -1
>>> queue_num=0, nr_tags=62, reserved_tags=0, bits_per_word=3
>>> nr_free=62, nr_reserved=0, hctx->tags->active_queues=7,
>>> hctx->nr_active=9, hctx_may_queue()=0
>>> active_queues=7
>>
>> So that does look a bit off, we have (supposedly) 9 active requests, but
>> nothing allocated. When the mkfs is done and things are idle, can you
>> try and cat the 'active' file in the mq directory? I want to see if it
>> drops to zero or stays elevated.
>
> Hello Jens,
>
> Sorry that I hadn't been more clear but the __bt_get() data in the
> previous e-mail was gathered when multipathd started instead of when
> mkfs was started. That was the time at which Mike's debug patch reported
> for the first time that __bt_get() returned -1. What is also remarkable
> is that all "__bt_get() returned -1" reports that I checked referred to
> the /dev/dm-0 device and none to any of the underlying devices
> (/dev/sd[cd]).
>
> The tag state after having stopped multipathd (systemctl stop
> multipathd) is as follows:
> # dmsetup table /dev/dm-0
> 0 256000 multipath 3 queue_if_no_path pg_init_retries 50 0 1 1
> service-time 0 2 2 8:48 1 1 8:32 1 1
> # ls -l /dev/sd[cd]
> brw-rw---- 1 root disk 8, 32 Jan 12 17:47 /dev/sdc
> brw-rw---- 1 root disk 8, 48 Jan 12 17:47 /dev/sdd
> # for d in sdc sdd dm-0; do echo ==== $d; (cd /sys/block/$d/mq &&
>    find|cut -c3-|grep active|xargs grep -aH ''); done
> ==== sdc
> 0/active:10
> 1/active:14
> 2/active:7
> 3/active:13
> 4/active:6
> 5/active:10
> ==== sdd
> 0/active:17
> 1/active:8
> 2/active:9
> 3/active:13
> 4/active:5
> 5/active:10
> ==== dm-0
> -bash: cd: /sys/block/dm-0/mq: No such file or directory

OK, so it's definitely leaking, but only partially - the requests are 
freed, yet the active count isn't decremented. I wonder if we're losing 
that flag along the way. It's numbered high enough that a cast to int 
will drop it, perhaps the cmd_flags is being copied/passed around as an 
int and not the appropriate u64? We've had bugs like that before.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: blk-mq request allocation stalls
  2015-01-12 17:09                                                                           ` Jens Axboe
@ 2015-01-12 17:53                                                                             ` Keith Busch
  2015-01-12 18:12                                                                               ` Jens Axboe
  0 siblings, 1 reply; 95+ messages in thread
From: Keith Busch @ 2015-01-12 17:53 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Mike Snitzer, Keith Busch,
	device-mapper development, Jun'ichi Nomura, Bart Van Assche

On Mon, 12 Jan 2015, Jens Axboe wrote:
> On 01/12/2015 10:04 AM, Bart Van Assche wrote:
>> The tag state after having stopped multipathd (systemctl stop
>> multipathd) is as follows:
>> # dmsetup table /dev/dm-0
>> 0 256000 multipath 3 queue_if_no_path pg_init_retries 50 0 1 1
>> service-time 0 2 2 8:48 1 1 8:32 1 1
>> # ls -l /dev/sd[cd]
>> brw-rw---- 1 root disk 8, 32 Jan 12 17:47 /dev/sdc
>> brw-rw---- 1 root disk 8, 48 Jan 12 17:47 /dev/sdd
>> # for d in sdc sdd dm-0; do echo ==== $d; (cd /sys/block/$d/mq &&
>>    find|cut -c3-|grep active|xargs grep -aH ''); done
>> ==== sdc
>> 0/active:10
>> 1/active:14
>> 2/active:7
>> 3/active:13
>> 4/active:6
>> 5/active:10
>> ==== sdd
>> 0/active:17
>> 1/active:8
>> 2/active:9
>> 3/active:13
>> 4/active:5
>> 5/active:10
>> ==== dm-0
>> -bash: cd: /sys/block/dm-0/mq: No such file or directory
>
> OK, so it's definitely leaking, but only partially - the requests are freed, 
> yet the active count isn't decremented. I wonder if we're losing that flag 
> along the way. It's numbered high enough that a cast to int will drop it, 
> perhaps the cmd_flags is being copied/passed around as an int and not the 
> appropriate u64? We've had bugs like that before.

Is the nr_active count correct prior to starting the mkfs test? Trying
to see if someone is calling "blk_mq_alloc_tag_set()" twice on the same
set. It might be good to add a WARN if this is detected anyway.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: blk-mq request allocation stalls
  2015-01-12 17:53                                                                             ` Keith Busch
@ 2015-01-12 18:12                                                                               ` Jens Axboe
  2015-01-12 18:22                                                                                 ` Keith Busch
  2015-01-12 19:07                                                                                 ` Mike Snitzer
  0 siblings, 2 replies; 95+ messages in thread
From: Jens Axboe @ 2015-01-12 18:12 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Bart Van Assche, device-mapper development,
	Jun'ichi Nomura, Mike Snitzer

On 01/12/2015 10:53 AM, Keith Busch wrote:
> On Mon, 12 Jan 2015, Jens Axboe wrote:
>> On 01/12/2015 10:04 AM, Bart Van Assche wrote:
>>> The tag state after having stopped multipathd (systemctl stop
>>> multipathd) is as follows:
>>> # dmsetup table /dev/dm-0
>>> 0 256000 multipath 3 queue_if_no_path pg_init_retries 50 0 1 1
>>> service-time 0 2 2 8:48 1 1 8:32 1 1
>>> # ls -l /dev/sd[cd]
>>> brw-rw---- 1 root disk 8, 32 Jan 12 17:47 /dev/sdc
>>> brw-rw---- 1 root disk 8, 48 Jan 12 17:47 /dev/sdd
>>> # for d in sdc sdd dm-0; do echo ==== $d; (cd /sys/block/$d/mq &&
>>>    find|cut -c3-|grep active|xargs grep -aH ''); done
>>> ==== sdc
>>> 0/active:10
>>> 1/active:14
>>> 2/active:7
>>> 3/active:13
>>> 4/active:6
>>> 5/active:10
>>> ==== sdd
>>> 0/active:17
>>> 1/active:8
>>> 2/active:9
>>> 3/active:13
>>> 4/active:5
>>> 5/active:10
>>> ==== dm-0
>>> -bash: cd: /sys/block/dm-0/mq: No such file or directory
>>
>> OK, so it's definitely leaking, but only partially - the requests are
>> freed, yet the active count isn't decremented. I wonder if we're
>> losing that flag along the way. It's numbered high enough that a cast
>> to int will drop it, perhaps the cmd_flags is being copied/passed
>> around as an int and not the appropriate u64? We've had bugs like that
>> before.
>
> Is the nr_active count correct prior to starting the mkfs test? Trying
> to see if someone is calling "blk_mq_alloc_tag_set()" twice on the same
> set. It might be good to add a WARN if this is detected anyway.

That might be a good debug aid, I agree. But the above doesn't look like 
it's corrupted. If you add the values, you get 60 and 62 for the two 
cases, which seems to indicate that we did bump the values correctly, 
but for some reason we never did the decrement on completion. Hence we 
stabilize around the queue depth of the device, which will be 62 +/- a 
bit due to the sharing.

I'm not familiar with how rq based dm works. We clone the original 
request (which has the RQ_MQ_INFLIGHT flag set), then we issue the 
clone(s) to the underlying device(s)? And when that completes, we 
complete the original? That would work fine with the flag on the 
original request. Maybe I'm missing something, and I'll let more 
knowledgeable people discuss that.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: blk-mq request allocation stalls
  2015-01-12 17:04                                                                         ` Bart Van Assche
  2015-01-12 17:09                                                                           ` Jens Axboe
@ 2015-01-12 18:19                                                                           ` Mike Snitzer
  1 sibling, 0 replies; 95+ messages in thread
From: Mike Snitzer @ 2015-01-12 18:19 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, Keith Busch, device-mapper development,
	Jun'ichi Nomura, Christoph Hellwig

On Mon, Jan 12 2015 at 12:04pm -0500,
Bart Van Assche <bart.vanassche@sandisk.com> wrote:

> On 01/12/15 17:34, Jens Axboe wrote:
> > 
> > So that does look a bit off, we have (supposedly) 9 active requests, but 
> > nothing allocated. When the mkfs is done and things are idle, can you 
> > try and cat the 'active' file in the mq directory? I want to see if it 
> > drops to zero or stays elevated.
> 
> Hello Jens,
> 
> Sorry that I hadn't been more clear but the __bt_get() data in the
> previous e-mail was gathered when multipathd started instead of when
> mkfs was started. That was the time at which Mike's debug patch reported
> for the first time that __bt_get() returned -1. What is also remarkable
> is that all "__bt_get() returned -1" reports that I checked referred to
> the /dev/dm-0 device and none to any of the underlying devices
> (/dev/sd[cd]).

That cannot be, considering the request-based DM device (dm-0) isn't a
blk-mq device (so it never calls into __bt_get).

How did you draw that conclusion?

> The tag state after having stopped multipathd (systemctl stop
> multipathd) is as follows:
> # dmsetup table /dev/dm-0
> 0 256000 multipath 3 queue_if_no_path pg_init_retries 50 0 1 1
> service-time 0 2 2 8:48 1 1 8:32 1 1
> # ls -l /dev/sd[cd]
> brw-rw---- 1 root disk 8, 32 Jan 12 17:47 /dev/sdc
> brw-rw---- 1 root disk 8, 48 Jan 12 17:47 /dev/sdd
> # for d in sdc sdd dm-0; do echo ==== $d; (cd /sys/block/$d/mq &&
>   find|cut -c3-|grep active|xargs grep -aH ''); done
> ==== sdc
> 0/active:10
> 1/active:14
> 2/active:7
> 3/active:13
> 4/active:6
> 5/active:10
> ==== sdd
> 0/active:17
> 1/active:8
> 2/active:9
> 3/active:13
> 4/active:5
> 5/active:10
> ==== dm-0
> -bash: cd: /sys/block/dm-0/mq: No such file or directory

Confirms dm-0 isn't blk-mq device.  Making request-based DM use blk-mq
is certainly something I intend to look at.  But the current
dm-multipath is using old request-based queue (dm-0) in front of
underlying blk-mq devices (sdc, sdd) as the first step.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: blk-mq request allocation stalls
  2015-01-12 18:12                                                                               ` Jens Axboe
@ 2015-01-12 18:22                                                                                 ` Keith Busch
  2015-01-12 18:35                                                                                   ` Keith Busch
  2015-01-12 19:05                                                                                   ` blk-mq request allocation stalls Jens Axboe
  2015-01-12 19:07                                                                                 ` Mike Snitzer
  1 sibling, 2 replies; 95+ messages in thread
From: Keith Busch @ 2015-01-12 18:22 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Mike Snitzer, Keith Busch,
	device-mapper development, Jun'ichi Nomura, Bart Van Assche

On Mon, 12 Jan 2015, Jens Axboe wrote:
> On 01/12/2015 10:53 AM, Keith Busch wrote:
>> Is the nr_active count correct prior to starting the mkfs test? Trying
>> to see if someone is calling "blk_mq_alloc_tag_set()" twice on the same
>> set. It might be good to add a WARN if this is detected anyway.
>
> That might be a good debug aid, I agree. But the above doesn't look like it's 
> corrupted. If you add the values, you get 60 and 62 for the two cases, which 
> seems to indicate that we did bump the values correctly, but for some reason 
> we never did the decrement on completion. Hence we stabilize around the queue 
> depth of the device, which will be 62 +/- a bit due to the sharing.
>
> I'm not familiar with how rq based dm works. We clone the original request 
> (which has the RQ_MQ_INFLIGHT flag set), then we issue the clone(s) to the 
> underlying device(s)? And when that completes, we complete the original? That 
> would work fine with the flag on the original request. Maybe I'm missing 
> something, and I'll let more knowledgeable people discuss that.

Oh, let's look at "__blk_rq_prep_clone". dm calls that after
blk_get_request() for the blk-mq based multipath types and overrides the
destinations cmd_flags with the source's even though the source was not
allocated from a blk-mq based queue, much less a shared tag.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: blk-mq request allocation stalls
  2015-01-12 18:22                                                                                 ` Keith Busch
@ 2015-01-12 18:35                                                                                   ` Keith Busch
  2015-01-12 19:11                                                                                     ` Mike Snitzer
  2015-01-13 14:59                                                                                     ` blk-mq request allocation stalls Jens Axboe
  2015-01-12 19:05                                                                                   ` blk-mq request allocation stalls Jens Axboe
  1 sibling, 2 replies; 95+ messages in thread
From: Keith Busch @ 2015-01-12 18:35 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, Mike Snitzer, Christoph Hellwig,
	device-mapper development, Jun'ichi Nomura, Bart Van Assche

On Mon, 12 Jan 2015, Keith Busch wrote:
> Oh, let's look at "__blk_rq_prep_clone". dm calls that after
> blk_get_request() for the blk-mq based multipath types and overrides the
> destinations cmd_flags with the source's even though the source was not
> allocated from a blk-mq based queue, much less a shared tag.

Untested patch. This will also preserve the failfast cmd_flag dm-mpath
set after allocating.

---
diff --git a/block/blk-core.c b/block/blk-core.c
index 7e78931..6201090 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2895,7 +2895,10 @@ EXPORT_SYMBOL_GPL(blk_rq_unprep_clone);
  static void __blk_rq_prep_clone(struct request *dst, struct request *src)
  {
  	dst->cpu = src->cpu;
-	dst->cmd_flags = (src->cmd_flags & REQ_CLONE_MASK) | REQ_NOMERGE;
+	if (dst->q->mq_ops)
+		dst->cmd_flags |= (src->cmd_flags & REQ_CLONE_MASK) | REQ_NOMERGE;
+	else
+		dst->cmd_flags = (src->cmd_flags & REQ_CLONE_MASK) | REQ_NOMERGE;
  	dst->cmd_type = src->cmd_type;
  	dst->__sector = blk_rq_pos(src);
  	dst->__data_len = blk_rq_bytes(src);
--

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: blk-mq request allocation stalls
  2015-01-12 18:22                                                                                 ` Keith Busch
  2015-01-12 18:35                                                                                   ` Keith Busch
@ 2015-01-12 19:05                                                                                   ` Jens Axboe
  1 sibling, 0 replies; 95+ messages in thread
From: Jens Axboe @ 2015-01-12 19:05 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Bart Van Assche, device-mapper development,
	Jun'ichi Nomura, Mike Snitzer

On 01/12/2015 11:22 AM, Keith Busch wrote:
> On Mon, 12 Jan 2015, Jens Axboe wrote:
>> On 01/12/2015 10:53 AM, Keith Busch wrote:
>>> Is the nr_active count correct prior to starting the mkfs test? Trying
>>> to see if someone is calling "blk_mq_alloc_tag_set()" twice on the same
>>> set. It might be good to add a WARN if this is detected anyway.
>>
>> That might be a good debug aid, I agree. But the above doesn't look
>> like it's corrupted. If you add the values, you get 60 and 62 for the
>> two cases, which seems to indicate that we did bump the values
>> correctly, but for some reason we never did the decrement on
>> completion. Hence we stabilize around the queue depth of the device,
>> which will be 62 +/- a bit due to the sharing.
>>
>> I'm not familiar with how rq based dm works. We clone the original
>> request (which has the RQ_MQ_INFLIGHT flag set), then we issue the
>> clone(s) to the underlying device(s)? And when that completes, we
>> complete the original? That would work fine with the flag on the
>> original request. Maybe I'm missing something, and I'll let more
>> knowledgeable people discuss that.
>
> Oh, let's look at "__blk_rq_prep_clone". dm calls that after
> blk_get_request() for the blk-mq based multipath types and overrides the
> destinations cmd_flags with the source's even though the source was not
> allocated from a blk-mq based queue, much less a shared tag.

Heh, I suck, I had read that but read it as |=. So yes, that would seem 
to backup my missing flag theory.


-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: blk-mq request allocation stalls
  2015-01-12 18:12                                                                               ` Jens Axboe
  2015-01-12 18:22                                                                                 ` Keith Busch
@ 2015-01-12 19:07                                                                                 ` Mike Snitzer
  1 sibling, 0 replies; 95+ messages in thread
From: Mike Snitzer @ 2015-01-12 19:07 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, Bart Van Assche, device-mapper development,
	Jun'ichi Nomura, Christoph Hellwig

On Mon, Jan 12 2015 at  1:12pm -0500,
Jens Axboe <axboe@kernel.dk> wrote:

> On 01/12/2015 10:53 AM, Keith Busch wrote:
> >On Mon, 12 Jan 2015, Jens Axboe wrote:
> >>On 01/12/2015 10:04 AM, Bart Van Assche wrote:
> >>>The tag state after having stopped multipathd (systemctl stop
> >>>multipathd) is as follows:
> >>># dmsetup table /dev/dm-0
> >>>0 256000 multipath 3 queue_if_no_path pg_init_retries 50 0 1 1
> >>>service-time 0 2 2 8:48 1 1 8:32 1 1
> >>># ls -l /dev/sd[cd]
> >>>brw-rw---- 1 root disk 8, 32 Jan 12 17:47 /dev/sdc
> >>>brw-rw---- 1 root disk 8, 48 Jan 12 17:47 /dev/sdd
> >>># for d in sdc sdd dm-0; do echo ==== $d; (cd /sys/block/$d/mq &&
> >>>   find|cut -c3-|grep active|xargs grep -aH ''); done
> >>>==== sdc
> >>>0/active:10
> >>>1/active:14
> >>>2/active:7
> >>>3/active:13
> >>>4/active:6
> >>>5/active:10
> >>>==== sdd
> >>>0/active:17
> >>>1/active:8
> >>>2/active:9
> >>>3/active:13
> >>>4/active:5
> >>>5/active:10
> >>>==== dm-0
> >>>-bash: cd: /sys/block/dm-0/mq: No such file or directory
> >>
> >>OK, so it's definitely leaking, but only partially - the requests are
> >>freed, yet the active count isn't decremented. I wonder if we're
> >>losing that flag along the way. It's numbered high enough that a cast
> >>to int will drop it, perhaps the cmd_flags is being copied/passed
> >>around as an int and not the appropriate u64? We've had bugs like that
> >>before.
> >
> >Is the nr_active count correct prior to starting the mkfs test? Trying
> >to see if someone is calling "blk_mq_alloc_tag_set()" twice on the same
> >set. It might be good to add a WARN if this is detected anyway.
> 
> That might be a good debug aid, I agree. But the above doesn't look
> like it's corrupted. If you add the values, you get 60 and 62 for
> the two cases, which seems to indicate that we did bump the values
> correctly, but for some reason we never did the decrement on
> completion. Hence we stabilize around the queue depth of the device,
> which will be 62 +/- a bit due to the sharing.
> 
> I'm not familiar with how rq based dm works. We clone the original
> request (which has the RQ_MQ_INFLIGHT flag set), then we issue the
> clone(s) to the underlying device(s)?

No, the original request is old request-based path (like I said in my
previous reply to Bart).  So RQ_MQ_INFLIGHT will _not_ have been set in
the original request.  It only gets set in the blk-mq blk_get_request()
path.

Unfortunately any flag changes that blk_get_request() does would get
thrown away very quickly via __blk_rq_prep_clone(), which establishes
the flags with:
  dst->cmd_flags = (src->cmd_flags & REQ_CLONE_MASK) | REQ_NOMERGE;

The current call sequence is:
1) blk_get_request() -- via dm-mpath.c:__multipath_map()
2) __blk_mq_alloc_request() possibly sets REQ_MQ_INFLIGHT
3) blk_rq_prep_clone() copies cmd_flags to the clone; overwriting the
   clone's cmd_flags!

So the problem must be that REQ_MQ_INFLIGHT is getting dropped on the
floor in step 3.

The ability to cope with the clone request allocation establishing flags
in the clone before actually copying the original request's flags state
is a new requirement from blk-mq.

Should __blk_rq_prep_clone() be updated to preserve REQ_CLONE_MASK in
the cloned request too?  E.g. patch at the end of this mail?

> And when that completes, we complete the original? That would work
> fine with the flag on the original request. Maybe I'm missing
> something, and I'll let more knowledgeable people discuss that.

Yes, once the blk-mq requests issued to the underlying blk-mq devices
complete the original (old) request is completed.

 block/blk-core.c          | 3 ++-
 include/linux/blk_types.h | 1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 7e78931..40071de 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2895,7 +2895,8 @@ EXPORT_SYMBOL_GPL(blk_rq_unprep_clone);
 static void __blk_rq_prep_clone(struct request *dst, struct request *src)
 {
 	dst->cpu = src->cpu;
-	dst->cmd_flags = (src->cmd_flags & REQ_CLONE_MASK) | REQ_NOMERGE;
+	dst->cmd_flags = (dst->cmd_flags & REQ_PRESERVE_CLONE_MASK) |
+		(src->cmd_flags & REQ_CLONE_MASK) | REQ_NOMERGE;
 	dst->cmd_type = src->cmd_type;
 	dst->__sector = blk_rq_pos(src);
 	dst->__data_len = blk_rq_bytes(src);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 445d592..f5ac72d 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -212,6 +212,7 @@ enum rq_flag_bits {
 	 REQ_DISCARD | REQ_WRITE_SAME | REQ_NOIDLE | REQ_FLUSH | REQ_FUA | \
 	 REQ_SECURE | REQ_INTEGRITY)
 #define REQ_CLONE_MASK		REQ_COMMON_MASK
+#define REQ_PRESERVE_CLONE_MASK		REQ_MQ_INFLIGHT
 
 #define BIO_NO_ADVANCE_ITER_MASK	(REQ_DISCARD|REQ_WRITE_SAME)
 

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: blk-mq request allocation stalls
  2015-01-12 18:35                                                                                   ` Keith Busch
@ 2015-01-12 19:11                                                                                     ` Mike Snitzer
  2015-01-12 20:21                                                                                       ` Mike Snitzer
  2015-01-13 14:59                                                                                     ` blk-mq request allocation stalls Jens Axboe
  1 sibling, 1 reply; 95+ messages in thread
From: Mike Snitzer @ 2015-01-12 19:11 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, Bart Van Assche, device-mapper development,
	Jun'ichi Nomura, Christoph Hellwig

On Mon, Jan 12 2015 at  1:35pm -0500,
Keith Busch <keith.busch@intel.com> wrote:

> On Mon, 12 Jan 2015, Keith Busch wrote:
> >Oh, let's look at "__blk_rq_prep_clone". dm calls that after
> >blk_get_request() for the blk-mq based multipath types and overrides the
> >destinations cmd_flags with the source's even though the source was not
> >allocated from a blk-mq based queue, much less a shared tag.
> 
> Untested patch. This will also preserve the failfast cmd_flag dm-mpath
> set after allocating.

Ah, good point.  The failfast flag would get cleared with the patch I
just proposed (unless REQ_FAILFAST_TRANSPORT was added to
REQ_PRESERVE_CLONE_MASK).

Anyway, I'm happy to see this implemented however you guys think is
best.  I think I like Keith's patch better than mine.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: blk-mq request allocation stalls
  2015-01-12 19:11                                                                                     ` Mike Snitzer
@ 2015-01-12 20:21                                                                                       ` Mike Snitzer
  2015-01-13 12:29                                                                                         ` Bart Van Assche
  0 siblings, 1 reply; 95+ messages in thread
From: Mike Snitzer @ 2015-01-12 20:21 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, Bart Van Assche, device-mapper development,
	Jun'ichi Nomura, Christoph Hellwig

On Mon, Jan 12 2015 at  2:11pm -0500,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Mon, Jan 12 2015 at  1:35pm -0500,
> Keith Busch <keith.busch@intel.com> wrote:
> 
> > On Mon, 12 Jan 2015, Keith Busch wrote:
> > >Oh, let's look at "__blk_rq_prep_clone". dm calls that after
> > >blk_get_request() for the blk-mq based multipath types and overrides the
> > >destinations cmd_flags with the source's even though the source was not
> > >allocated from a blk-mq based queue, much less a shared tag.
> > 
> > Untested patch. This will also preserve the failfast cmd_flag dm-mpath
> > set after allocating.
> 
> Ah, good point.  The failfast flag would get cleared with the patch I
> just proposed (unless REQ_FAILFAST_TRANSPORT was added to
> REQ_PRESERVE_CLONE_MASK).
> 
> Anyway, I'm happy to see this implemented however you guys think is
> best.  I think I like Keith's patch better than mine.

FYI, I staged Keith's patch here:
https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-for-3.20-blk-mq&id=7004ddf2462df38c6e3232ac020ed6ff655cc07e

Bart, this is the tip of the linux-dm.git "dm-for-3.20-blk-mq" branch.
Please test, it should hopefully take care of the stall you've been
seeing.

Mike

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: blk-mq request allocation stalls
  2015-01-12 20:21                                                                                       ` Mike Snitzer
@ 2015-01-13 12:29                                                                                         ` Bart Van Assche
  2015-01-13 14:17                                                                                           ` Mike Snitzer
  0 siblings, 1 reply; 95+ messages in thread
From: Bart Van Assche @ 2015-01-13 12:29 UTC (permalink / raw)
  To: Mike Snitzer, Keith Busch
  Cc: Jens Axboe, Christoph Hellwig, device-mapper development,
	Jun'ichi Nomura

On 01/12/15 21:22, Mike Snitzer wrote:
> FYI, I staged Keith's patch here:
> https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-for-3.20-blk-mq&id=7004ddf2462df38c6e3232ac020ed6ff655cc07e
> 
> Bart, this is the tip of the linux-dm.git "dm-for-3.20-blk-mq" branch.
> Please test, it should hopefully take care of the stall you've been
> seeing.

Hello Mike,

In the quick test I ran the I/O stalls were indeed gone. Thanks :-)

However, I hit another issue while running I/O on top of a multipath
device (on a kernel with lockdep and SLUB memory poisoning enabled):

NMI watchdog: BUG: soft lockup - CPU#7 stuck for 23s! [kdmwork-253:0:3116]
CPU: 7 PID: 3116 Comm: kdmwork-253:0 Tainted: G        W      3.19.0-rc4-debug+ #1
Call Trace:
 [<ffffffff8118e4be>] kmem_cache_alloc+0x28e/0x2c0
 [<ffffffff81346aca>] alloc_iova_mem+0x1a/0x20
 [<ffffffff81342c8e>] alloc_iova+0x2e/0x250
 [<ffffffff81344b65>] intel_alloc_iova+0x95/0xd0
 [<ffffffff81348a15>] intel_map_sg+0xc5/0x260
 [<ffffffffa07e0661>] srp_queuecommand+0xa11/0xc30 [ib_srp]
 [<ffffffffa001698e>] scsi_dispatch_cmd+0xde/0x5a0 [scsi_mod]
 [<ffffffffa0017480>] scsi_queue_rq+0x630/0x700 [scsi_mod]
 [<ffffffff8125683d>] __blk_mq_run_hw_queue+0x1dd/0x370
 [<ffffffff81256aae>] blk_mq_alloc_request+0xde/0x150
 [<ffffffff8124bade>] blk_get_request+0x2e/0xe0
 [<ffffffffa07ebd0f>] __multipath_map.isra.15+0x1cf/0x210 [dm_multipath]
 [<ffffffffa07ebd6a>] multipath_clone_and_map+0x1a/0x20 [dm_multipath]
 [<ffffffffa044abb5>] map_tio_request+0x1d5/0x3a0 [dm_mod]
 [<ffffffff81075d16>] kthread_worker_fn+0x86/0x1b0
 [<ffffffff81075c0f>] kthread+0xef/0x110
 [<ffffffff814db42c>] ret_from_fork+0x7c/0xb0

Bart.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: blk-mq request allocation stalls
  2015-01-13 12:29                                                                                         ` Bart Van Assche
@ 2015-01-13 14:17                                                                                           ` Mike Snitzer
  2015-01-13 14:28                                                                                             ` dm + blk-mq soft lockup complaint Bart Van Assche
  0 siblings, 1 reply; 95+ messages in thread
From: Mike Snitzer @ 2015-01-13 14:17 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Keith Busch, Jens Axboe, device-mapper development,
	Jun'ichi Nomura, Christoph Hellwig

On Tue, Jan 13 2015 at  7:29am -0500,
Bart Van Assche <bart.vanassche@sandisk.com> wrote:

> On 01/12/15 21:22, Mike Snitzer wrote:
> > FYI, I staged Keith's patch here:
> > https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-for-3.20-blk-mq&id=7004ddf2462df38c6e3232ac020ed6ff655cc07e
> > 
> > Bart, this is the tip of the linux-dm.git "dm-for-3.20-blk-mq" branch.
> > Please test, it should hopefully take care of the stall you've been
> > seeing.
> 
> Hello Mike,
> 
> In the quick test I ran the I/O stalls were indeed gone. Thanks :-)

Good news, followed by a new mole rearing its head ;)
 
> However, I hit another issue while running I/O on top of a multipath
> device (on a kernel with lockdep and SLUB memory poisoning enabled):
>
> NMI watchdog: BUG: soft lockup - CPU#7 stuck for 23s! [kdmwork-253:0:3116]
> CPU: 7 PID: 3116 Comm: kdmwork-253:0 Tainted: G        W      3.19.0-rc4-debug+ #1
> Call Trace:
>  [<ffffffff8118e4be>] kmem_cache_alloc+0x28e/0x2c0
>  [<ffffffff81346aca>] alloc_iova_mem+0x1a/0x20
>  [<ffffffff81342c8e>] alloc_iova+0x2e/0x250
>  [<ffffffff81344b65>] intel_alloc_iova+0x95/0xd0
>  [<ffffffff81348a15>] intel_map_sg+0xc5/0x260
>  [<ffffffffa07e0661>] srp_queuecommand+0xa11/0xc30 [ib_srp]
>  [<ffffffffa001698e>] scsi_dispatch_cmd+0xde/0x5a0 [scsi_mod]
>  [<ffffffffa0017480>] scsi_queue_rq+0x630/0x700 [scsi_mod]
>  [<ffffffff8125683d>] __blk_mq_run_hw_queue+0x1dd/0x370
>  [<ffffffff81256aae>] blk_mq_alloc_request+0xde/0x150
>  [<ffffffff8124bade>] blk_get_request+0x2e/0xe0
>  [<ffffffffa07ebd0f>] __multipath_map.isra.15+0x1cf/0x210 [dm_multipath]
>  [<ffffffffa07ebd6a>] multipath_clone_and_map+0x1a/0x20 [dm_multipath]
>  [<ffffffffa044abb5>] map_tio_request+0x1d5/0x3a0 [dm_mod]
>  [<ffffffff81075d16>] kthread_worker_fn+0x86/0x1b0
>  [<ffffffff81075c0f>] kthread+0xef/0x110
>  [<ffffffff814db42c>] ret_from_fork+0x7c/0xb0

Unfortunate.  Is this still with a 16MB backing device or is it real
hardware?  Can you share the workload so that myself and/or Keith could
try to reproduce?

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: dm + blk-mq soft lockup complaint
  2015-01-13 14:17                                                                                           ` Mike Snitzer
@ 2015-01-13 14:28                                                                                             ` Bart Van Assche
  2015-01-13 16:20                                                                                               ` Mike Snitzer
  0 siblings, 1 reply; 95+ messages in thread
From: Bart Van Assche @ 2015-01-13 14:28 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Keith Busch, Jens Axboe, device-mapper development,
	Jun'ichi Nomura, Christoph Hellwig

On 01/13/15 15:18, Mike Snitzer wrote:
> On Tue, Jan 13 2015 at  7:29am -0500,
> Bart Van Assche <bart.vanassche@sandisk.com> wrote:
>> However, I hit another issue while running I/O on top of a multipath
>> device (on a kernel with lockdep and SLUB memory poisoning enabled):
>>
>> NMI watchdog: BUG: soft lockup - CPU#7 stuck for 23s! [kdmwork-253:0:3116]
>> CPU: 7 PID: 3116 Comm: kdmwork-253:0 Tainted: G        W      3.19.0-rc4-debug+ #1
>> Call Trace:
>>  [<ffffffff8118e4be>] kmem_cache_alloc+0x28e/0x2c0
>>  [<ffffffff81346aca>] alloc_iova_mem+0x1a/0x20
>>  [<ffffffff81342c8e>] alloc_iova+0x2e/0x250
>>  [<ffffffff81344b65>] intel_alloc_iova+0x95/0xd0
>>  [<ffffffff81348a15>] intel_map_sg+0xc5/0x260
>>  [<ffffffffa07e0661>] srp_queuecommand+0xa11/0xc30 [ib_srp]
>>  [<ffffffffa001698e>] scsi_dispatch_cmd+0xde/0x5a0 [scsi_mod]
>>  [<ffffffffa0017480>] scsi_queue_rq+0x630/0x700 [scsi_mod]
>>  [<ffffffff8125683d>] __blk_mq_run_hw_queue+0x1dd/0x370
>>  [<ffffffff81256aae>] blk_mq_alloc_request+0xde/0x150
>>  [<ffffffff8124bade>] blk_get_request+0x2e/0xe0
>>  [<ffffffffa07ebd0f>] __multipath_map.isra.15+0x1cf/0x210 [dm_multipath]
>>  [<ffffffffa07ebd6a>] multipath_clone_and_map+0x1a/0x20 [dm_multipath]
>>  [<ffffffffa044abb5>] map_tio_request+0x1d5/0x3a0 [dm_mod]
>>  [<ffffffff81075d16>] kthread_worker_fn+0x86/0x1b0
>>  [<ffffffff81075c0f>] kthread+0xef/0x110
>>  [<ffffffff814db42c>] ret_from_fork+0x7c/0xb0
> 
> Unfortunate.  Is this still with a 16MB backing device or is it real
> hardware?  Can you share the workload so that myself and/or Keith could
> try to reproduce?
 
Hello Mike,

This is still with a 16MB RAM disk as backing device. The fio job I
used to trigger this was as follows:

dev=/dev/sdc
fio --bs=4K --ioengine=libaio --rw=randread --buffered=0 --numjobs=12   \
    --iodepth=128 --iodepth_batch=64 --iodepth_batch_complete=64        \
    --thread --norandommap --loops=$((2**31)) --runtime=60              \
    --group_reporting --gtod_reduce=1 --name=$dev --filename=$dev       \
    --invalidate=1

Bart.
 

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: blk-mq request allocation stalls
  2015-01-12 18:35                                                                                   ` Keith Busch
  2015-01-12 19:11                                                                                     ` Mike Snitzer
@ 2015-01-13 14:59                                                                                     ` Jens Axboe
  2015-01-13 15:11                                                                                       ` Keith Busch
                                                                                                         ` (2 more replies)
  1 sibling, 3 replies; 95+ messages in thread
From: Jens Axboe @ 2015-01-13 14:59 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Bart Van Assche, device-mapper development,
	Jun'ichi Nomura, Mike Snitzer

On 01/12/2015 11:35 AM, Keith Busch wrote:
> On Mon, 12 Jan 2015, Keith Busch wrote:
>> Oh, let's look at "__blk_rq_prep_clone". dm calls that after
>> blk_get_request() for the blk-mq based multipath types and overrides the
>> destinations cmd_flags with the source's even though the source was not
>> allocated from a blk-mq based queue, much less a shared tag.
>
> Untested patch. This will also preserve the failfast cmd_flag dm-mpath
> set after allocating.
>
> ---
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 7e78931..6201090 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2895,7 +2895,10 @@ EXPORT_SYMBOL_GPL(blk_rq_unprep_clone);
>   static void __blk_rq_prep_clone(struct request *dst, struct request *src)
>   {
>       dst->cpu = src->cpu;
> -    dst->cmd_flags = (src->cmd_flags & REQ_CLONE_MASK) | REQ_NOMERGE;
> +    if (dst->q->mq_ops)
> +        dst->cmd_flags |= (src->cmd_flags & REQ_CLONE_MASK) | REQ_NOMERGE;
> +    else
> +        dst->cmd_flags = (src->cmd_flags & REQ_CLONE_MASK) | REQ_NOMERGE;

Making the two cases different is a bit... nonsensical. We should do 
this for both cases, if safe, or move the MQ_INFLIGHT flag and expand 
the CLONE_MASK.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: blk-mq request allocation stalls
  2015-01-13 14:59                                                                                     ` blk-mq request allocation stalls Jens Axboe
@ 2015-01-13 15:11                                                                                       ` Keith Busch
  2015-01-13 15:27                                                                                         ` Keith Busch
  2015-01-13 15:41                                                                                         ` Mike Snitzer
  2015-01-13 15:14                                                                                       ` Mike Snitzer
  2015-01-27 18:42                                                                                       ` blk-mq DM changes for 3.20 [was: Re: blk-mq request allocation stalls] Mike Snitzer
  2 siblings, 2 replies; 95+ messages in thread
From: Keith Busch @ 2015-01-13 15:11 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Mike Snitzer, Keith Busch,
	device-mapper development, Jun'ichi Nomura, Bart Van Assche

On Tue, 13 Jan 2015, Jens Axboe wrote:
> On 01/12/2015 11:35 AM, Keith Busch wrote:
>> On Mon, 12 Jan 2015, Keith Busch wrote:
>>> Oh, let's look at "__blk_rq_prep_clone". dm calls that after
>>> blk_get_request() for the blk-mq based multipath types and overrides the
>>> destinations cmd_flags with the source's even though the source was not
>>> allocated from a blk-mq based queue, much less a shared tag.
>> 
>> Untested patch. This will also preserve the failfast cmd_flag dm-mpath
>> set after allocating.
>> 
>> ---
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 7e78931..6201090 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -2895,7 +2895,10 @@ EXPORT_SYMBOL_GPL(blk_rq_unprep_clone);
>>   static void __blk_rq_prep_clone(struct request *dst, struct request *src)
>>   {
>>       dst->cpu = src->cpu;
>> -    dst->cmd_flags = (src->cmd_flags & REQ_CLONE_MASK) | REQ_NOMERGE;
>> +    if (dst->q->mq_ops)
>> +        dst->cmd_flags |= (src->cmd_flags & REQ_CLONE_MASK) | REQ_NOMERGE;
>> +    else
>> +        dst->cmd_flags = (src->cmd_flags & REQ_CLONE_MASK) | REQ_NOMERGE;
>
> Making the two cases different is a bit... nonsensical. We should do this for 
> both cases, if safe, or move the MQ_INFLIGHT flag and expand the CLONE_MASK.

Expanding the clone mask won't do any good since the src doesn't come
from blk-mq and wouldn't ever have MQ_INFLIGHT set. Blk-mq initializes
the cmd_flags when you get one so I assumed OR'ing was safe. For the
non-blk-mq case, we have no guarantees how the req was initialized and
could have nonsense cmd_flags.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: blk-mq request allocation stalls
  2015-01-13 14:59                                                                                     ` blk-mq request allocation stalls Jens Axboe
  2015-01-13 15:11                                                                                       ` Keith Busch
@ 2015-01-13 15:14                                                                                       ` Mike Snitzer
  2015-01-27 18:42                                                                                       ` blk-mq DM changes for 3.20 [was: Re: blk-mq request allocation stalls] Mike Snitzer
  2 siblings, 0 replies; 95+ messages in thread
From: Mike Snitzer @ 2015-01-13 15:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, Bart Van Assche, device-mapper development,
	Jun'ichi Nomura, Christoph Hellwig

On Tue, Jan 13 2015 at  9:59am -0500,
Jens Axboe <axboe@kernel.dk> wrote:

> On 01/12/2015 11:35 AM, Keith Busch wrote:
> >On Mon, 12 Jan 2015, Keith Busch wrote:
> >>Oh, let's look at "__blk_rq_prep_clone". dm calls that after
> >>blk_get_request() for the blk-mq based multipath types and overrides the
> >>destinations cmd_flags with the source's even though the source was not
> >>allocated from a blk-mq based queue, much less a shared tag.
> >
> >Untested patch. This will also preserve the failfast cmd_flag dm-mpath
> >set after allocating.
> >
> >---
> >diff --git a/block/blk-core.c b/block/blk-core.c
> >index 7e78931..6201090 100644
> >--- a/block/blk-core.c
> >+++ b/block/blk-core.c
> >@@ -2895,7 +2895,10 @@ EXPORT_SYMBOL_GPL(blk_rq_unprep_clone);
> >  static void __blk_rq_prep_clone(struct request *dst, struct request *src)
> >  {
> >      dst->cpu = src->cpu;
> >-    dst->cmd_flags = (src->cmd_flags & REQ_CLONE_MASK) | REQ_NOMERGE;
> >+    if (dst->q->mq_ops)
> >+        dst->cmd_flags |= (src->cmd_flags & REQ_CLONE_MASK) | REQ_NOMERGE;
> >+    else
> >+        dst->cmd_flags = (src->cmd_flags & REQ_CLONE_MASK) | REQ_NOMERGE;
> 
> Making the two cases different is a bit... nonsensical. We should do
> this for both cases, if safe, or move the MQ_INFLIGHT flag and
> expand the CLONE_MASK.

ok, i'll work through it.
k

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: blk-mq request allocation stalls
  2015-01-13 15:11                                                                                       ` Keith Busch
@ 2015-01-13 15:27                                                                                         ` Keith Busch
  2015-01-13 15:41                                                                                         ` Mike Snitzer
  1 sibling, 0 replies; 95+ messages in thread
From: Keith Busch @ 2015-01-13 15:27 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, Mike Snitzer, Christoph Hellwig,
	device-mapper development, Jun'ichi Nomura, Bart Van Assche

On Tue, 13 Jan 2015, Keith Busch wrote:
> On Tue, 13 Jan 2015, Jens Axboe wrote:
>> Making the two cases different is a bit... nonsensical. We should do this 
>> for both cases, if safe, or move the MQ_INFLIGHT flag and expand the 
>> CLONE_MASK.
>
> Expanding the clone mask won't do any good since the src doesn't come
> from blk-mq and wouldn't ever have MQ_INFLIGHT set. Blk-mq initializes
> the cmd_flags when you get one so I assumed OR'ing was safe. For the
> non-blk-mq case, we have no guarantees how the req was initialized and
> could have nonsense cmd_flags.

I take back the last part. We required blk_rq_init be called on the
request prior to calling blk_rq_prep_clone, making it safe to OR in all
the time.

diff --git a/block/blk-core.c b/block/blk-core.c
index 7e78931..b40b5d2 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2895,7 +2895,7 @@ EXPORT_SYMBOL_GPL(blk_rq_unprep_clone);
  static void __blk_rq_prep_clone(struct request *dst, struct request *src)
  {
  	dst->cpu = src->cpu;
-	dst->cmd_flags = (src->cmd_flags & REQ_CLONE_MASK) | REQ_NOMERGE;
+	dst->cmd_flags |= (src->cmd_flags & REQ_CLONE_MASK) | REQ_NOMERGE;
  	dst->cmd_type = src->cmd_type;
  	dst->__sector = blk_rq_pos(src);
  	dst->__data_len = blk_rq_bytes(src);

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: blk-mq request allocation stalls
  2015-01-13 15:11                                                                                       ` Keith Busch
  2015-01-13 15:27                                                                                         ` Keith Busch
@ 2015-01-13 15:41                                                                                         ` Mike Snitzer
  1 sibling, 0 replies; 95+ messages in thread
From: Mike Snitzer @ 2015-01-13 15:41 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, Bart Van Assche, device-mapper development,
	Jun'ichi Nomura, Christoph Hellwig

On Tue, Jan 13 2015 at 10:11am -0500,
Keith Busch <keith.busch@intel.com> wrote:

> On Tue, 13 Jan 2015, Jens Axboe wrote:
> >On 01/12/2015 11:35 AM, Keith Busch wrote:
> >>On Mon, 12 Jan 2015, Keith Busch wrote:
> >>>Oh, let's look at "__blk_rq_prep_clone". dm calls that after
> >>>blk_get_request() for the blk-mq based multipath types and overrides the
> >>>destinations cmd_flags with the source's even though the source was not
> >>>allocated from a blk-mq based queue, much less a shared tag.
> >>
> >>Untested patch. This will also preserve the failfast cmd_flag dm-mpath
> >>set after allocating.
> >>
> >>---
> >>diff --git a/block/blk-core.c b/block/blk-core.c
> >>index 7e78931..6201090 100644
> >>--- a/block/blk-core.c
> >>+++ b/block/blk-core.c
> >>@@ -2895,7 +2895,10 @@ EXPORT_SYMBOL_GPL(blk_rq_unprep_clone);
> >>  static void __blk_rq_prep_clone(struct request *dst, struct request *src)
> >>  {
> >>      dst->cpu = src->cpu;
> >>-    dst->cmd_flags = (src->cmd_flags & REQ_CLONE_MASK) | REQ_NOMERGE;
> >>+    if (dst->q->mq_ops)
> >>+        dst->cmd_flags |= (src->cmd_flags & REQ_CLONE_MASK) | REQ_NOMERGE;
> >>+    else
> >>+        dst->cmd_flags = (src->cmd_flags & REQ_CLONE_MASK) | REQ_NOMERGE;
> >
> >Making the two cases different is a bit... nonsensical. We should
> >do this for both cases, if safe, or move the MQ_INFLIGHT flag and
> >expand the CLONE_MASK.
> 
> Expanding the clone mask won't do any good since the src doesn't come
> from blk-mq and wouldn't ever have MQ_INFLIGHT set. Blk-mq initializes
> the cmd_flags when you get one so I assumed OR'ing was safe. For the
> non-blk-mq case, we have no guarantees how the req was initialized and
> could have nonsense cmd_flags.

It _could_ but the only consumer of blk_rq_prep_clone() is request-based
DM, and the non-blk-mq case uses blk_rq_init() prior to calling
blk_rq_prep_clone() so the cmd_flags will be 0.

I've revised the patch to just do it for both cases since it will be
safe, see:
https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-for-3.20-blk-mq&id=6a55c9861326dc2a731c7978d93567dd4e62d2f7

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: dm + blk-mq soft lockup complaint
  2015-01-13 14:28                                                                                             ` dm + blk-mq soft lockup complaint Bart Van Assche
@ 2015-01-13 16:20                                                                                               ` Mike Snitzer
  2015-01-14  9:16                                                                                                   ` Bart Van Assche
  0 siblings, 1 reply; 95+ messages in thread
From: Mike Snitzer @ 2015-01-13 16:20 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig,
	device-mapper development, Jun'ichi Nomura, linux-scsi

On Tue, Jan 13 2015 at  9:28am -0500,
Bart Van Assche <bart.vanassche@sandisk.com> wrote:

> On 01/13/15 15:18, Mike Snitzer wrote:
> > On Tue, Jan 13 2015 at  7:29am -0500,
> > Bart Van Assche <bart.vanassche@sandisk.com> wrote:
> >> However, I hit another issue while running I/O on top of a multipath
> >> device (on a kernel with lockdep and SLUB memory poisoning enabled):
> >>
> >> NMI watchdog: BUG: soft lockup - CPU#7 stuck for 23s! [kdmwork-253:0:3116]
> >> CPU: 7 PID: 3116 Comm: kdmwork-253:0 Tainted: G        W      3.19.0-rc4-debug+ #1
> >> Call Trace:
> >>  [<ffffffff8118e4be>] kmem_cache_alloc+0x28e/0x2c0
> >>  [<ffffffff81346aca>] alloc_iova_mem+0x1a/0x20
> >>  [<ffffffff81342c8e>] alloc_iova+0x2e/0x250
> >>  [<ffffffff81344b65>] intel_alloc_iova+0x95/0xd0
> >>  [<ffffffff81348a15>] intel_map_sg+0xc5/0x260
> >>  [<ffffffffa07e0661>] srp_queuecommand+0xa11/0xc30 [ib_srp]
> >>  [<ffffffffa001698e>] scsi_dispatch_cmd+0xde/0x5a0 [scsi_mod]
> >>  [<ffffffffa0017480>] scsi_queue_rq+0x630/0x700 [scsi_mod]
> >>  [<ffffffff8125683d>] __blk_mq_run_hw_queue+0x1dd/0x370
> >>  [<ffffffff81256aae>] blk_mq_alloc_request+0xde/0x150
> >>  [<ffffffff8124bade>] blk_get_request+0x2e/0xe0
> >>  [<ffffffffa07ebd0f>] __multipath_map.isra.15+0x1cf/0x210 [dm_multipath]
> >>  [<ffffffffa07ebd6a>] multipath_clone_and_map+0x1a/0x20 [dm_multipath]
> >>  [<ffffffffa044abb5>] map_tio_request+0x1d5/0x3a0 [dm_mod]
> >>  [<ffffffff81075d16>] kthread_worker_fn+0x86/0x1b0
> >>  [<ffffffff81075c0f>] kthread+0xef/0x110
> >>  [<ffffffff814db42c>] ret_from_fork+0x7c/0xb0
> > 
> > Unfortunate.  Is this still with a 16MB backing device or is it real
> > hardware?  Can you share the workload so that myself and/or Keith could
> > try to reproduce?
>  
> Hello Mike,
> 
> This is still with a 16MB RAM disk as backing device. The fio job I
> used to trigger this was as follows:
> 
> dev=/dev/sdc
> fio --bs=4K --ioengine=libaio --rw=randread --buffered=0 --numjobs=12   \
>     --iodepth=128 --iodepth_batch=64 --iodepth_batch_complete=64        \
>     --thread --norandommap --loops=$((2**31)) --runtime=60              \
>     --group_reporting --gtod_reduce=1 --name=$dev --filename=$dev       \
>     --invalidate=1

OK, I assume you specified the mpath device for the test that failed.

This test works fine on my 100MB scsi_debug device with 4 paths exported
over virtio-blk to a guest that assembles the mpath device.

Could be a hang that is unique to scsi-mq.

Any chance you'd be willing to provide a HOWTO for setting up your
SRP/iscsi configuration?

Are you carrying any related changes that are not upstream?  (I can hunt
down the email in this thread where you describe your kernel tree...)

I'll try to reproduce but this info could be useful to others that are
more scsi-mq inclined who might need to chase this too.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: dm + blk-mq soft lockup complaint
  2015-01-13 16:20                                                                                               ` Mike Snitzer
@ 2015-01-14  9:16                                                                                                   ` Bart Van Assche
  0 siblings, 0 replies; 95+ messages in thread
From: Bart Van Assche @ 2015-01-14  9:16 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig,
	device-mapper development, Jun'ichi Nomura, linux-scsi

[-- Attachment #1: Type: text/plain, Size: 2555 bytes --]

On 01/13/15 17:21, Mike Snitzer wrote:
> OK, I assume you specified the mpath device for the test that failed.

Yes, of course ...

> This test works fine on my 100MB scsi_debug device with 4 paths exported
> over virtio-blk to a guest that assembles the mpath device.
> 
> Could be a hang that is unique to scsi-mq.
> 
> Any chance you'd be willing to provide a HOWTO for setting up your
> SRP/iscsi configuration?
> 
> Are you carrying any related changes that are not upstream?  (I can hunt
> down the email in this thread where you describe your kernel tree...)
> 
> I'll try to reproduce but this info could be useful to others that are
> more scsi-mq inclined who might need to chase this too.

The four patches I had used in my tests at the initiator side and that
are not yet in v3.19-rc4 have been attached to this e-mail (I have not
yet had the time to post all of these patches for review).

This is how my I had configured the initiator system:
* If the version of the srptools package supplied by your distro is
lower than 1.0.2, build and install the latest version from the source
code available at git://git.openfabrics.org/~bvanassche/srptools.git/.git.
* Install the latest version of lsscsi
(http://sg.danny.cz/scsi/lsscsi.html). This version has SRP transport
support but is not yet in any distro AFAIK.
* Build and install a kernel >= v3.19-rc4 that includes the dm patches
at the start of this e-mail thread.
* Check whether the IB links are up (should display "State: Active"):
ibstat | grep State:
* Spread completion interrupts statically over CPU cores, e.g. via the
attached script (spread-mlx4-ib-interrupts).
* Check whether the SRP target system is visible from the SRP initiator
system - the command below should print at least one line:
ibsrpdm -c
* Enable blk-mq:
echo Y > /sys/module/scsi_mod/parameters/use_blk_mq
* Configure the SRP kernel module parameters as follows:
echo 'options ib_srp cmd_sg_entries=255 dev_loss_tmo=60 ch_count=6' >
/etc/modprobe.d/ib_srp.conf
* Unload and reload the SRP initiator kernel module to apply these
parameters:
rmmod ib_srp; modprobe ib_srp
* Start srpd and wait until SRP login has finished:
systemctl start srpd
while ! lsscsi -t | grep -q srp:; do sleep 1; done
* Start multipathd and check the table it has built:
systemctl start multipathd
dmsetup table /dev/dm-0
* Set the I/O scheduler to noop, disable add_random and set rq_affinity
to 2 for all SRP and dm block devices.
* Run the I/O load of your preference.

Please let me know if you need any further information.

Bart.

[-- Attachment #2: 0001-e1000-Avoid-that-e1000_netpoll-triggers-a-kernel-war.patch --]
[-- Type: text/x-patch, Size: 4609 bytes --]

>From 664b7adce6c09b9c939b4983f7f32b7539497ef4 Mon Sep 17 00:00:00 2001
From: Bart Van Assche <bvanassche@acm.org>
Date: Fri, 2 Jan 2015 14:52:07 +0100
Subject: [PATCH 1/4] e1000: Avoid that e1000_netpoll() triggers a kernel
 warning

console_cont_flush(), which is called by console_unlock(), calls
call_console_drivers() and hence also the netconsole function
write_msg() with local interrupts disabled. This means that it is
not allowed to call disable_irq() from inside a netpoll callback
function. Hence eliminate the disable_irq() / enable_irq() pair
from the e1000 netpoll function. This patch avoids that the e1000
networking driver triggers the following complaint:

BUG: sleeping function called from invalid context at kernel/irq/manage.c:104

Call Trace:
 [<ffffffff814d1ec5>] dump_stack+0x4c/0x65
 [<ffffffff8107bcc5>] ___might_sleep+0x175/0x230
 [<ffffffff8107bdba>] __might_sleep+0x3a/0xa0
 [<ffffffff810a78c8>] synchronize_irq+0x38/0xa0
 [<ffffffff810a7a20>] disable_irq+0x20/0x30
 [<ffffffffa04b4442>] e1000_netpoll+0x102/0x130 [e1000e]
 [<ffffffff813ffff2>] netpoll_poll_dev+0x72/0x350
 [<ffffffff81400489>] netpoll_send_skb_on_dev+0x1b9/0x2b0
 [<ffffffff81400842>] netpoll_send_udp+0x2c2/0x430
 [<ffffffffa058187f>] write_msg+0xcf/0x120 [netconsole]
 [<ffffffff810a4682>] call_console_drivers.constprop.25+0xc2/0x250
 [<ffffffff810a5588>] console_unlock+0x328/0x4c0
 [<ffffffff810a59f0>] vprintk_emit+0x2d0/0x570
 [<ffffffff810a5def>] vprintk_default+0x1f/0x30
 [<ffffffff814cf680>] printk+0x46/0x48

See also "[RFC PATCH net-next 00/11] net: remove disable_irq() from
->ndo_poll_controller" (http://thread.gmane.org/gmane.linux.network/342096).

See also patch "sched/wait: Add might_sleep() checks" (kernel v3.19-rc1;
commit e22b886a8a43).

Reported-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David S. Miller <davem@davemloft.net>
Cc: Peter Zijlstra <peterz@infradead.org>
---
 drivers/net/ethernet/intel/e1000/e1000.h      |  5 +++++
 drivers/net/ethernet/intel/e1000/e1000_main.c | 27 ++++++++++++++++++++++-----
 2 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000/e1000.h b/drivers/net/ethernet/intel/e1000/e1000.h
index 6970710..d85d19f 100644
--- a/drivers/net/ethernet/intel/e1000/e1000.h
+++ b/drivers/net/ethernet/intel/e1000/e1000.h
@@ -323,6 +323,11 @@ struct e1000_adapter {
 	struct delayed_work watchdog_task;
 	struct delayed_work fifo_stall_task;
 	struct delayed_work phy_info_task;
+
+#ifdef CONFIG_NET_POLL_CONTROLLER
+	/* Used to serialize e1000 interrupts and the e1000 netpoll callback. */
+	spinlock_t netpoll_lock;
+#endif
 };
 
 enum e1000_state_t {
diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
index 83140cb..e5866f1 100644
--- a/drivers/net/ethernet/intel/e1000/e1000_main.c
+++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
@@ -1313,6 +1313,9 @@ static int e1000_sw_init(struct e1000_adapter *adapter)
 	e1000_irq_disable(adapter);
 
 	spin_lock_init(&adapter->stats_lock);
+#ifdef CONFIG_NET_POLL_CONTROLLER
+	spin_lock_init(&adapter->netpoll_lock);
+#endif
 
 	set_bit(__E1000_DOWN, &adapter->flags);
 
@@ -3747,10 +3750,8 @@ void e1000_update_stats(struct e1000_adapter *adapter)
  * @irq: interrupt number
  * @data: pointer to a network interface device structure
  **/
-static irqreturn_t e1000_intr(int irq, void *data)
+static irqreturn_t __e1000_intr(int irq, struct e1000_adapter *adapter)
 {
-	struct net_device *netdev = data;
-	struct e1000_adapter *adapter = netdev_priv(netdev);
 	struct e1000_hw *hw = &adapter->hw;
 	u32 icr = er32(ICR);
 
@@ -3792,6 +3793,24 @@ static irqreturn_t e1000_intr(int irq, void *data)
 	return IRQ_HANDLED;
 }
 
+static irqreturn_t e1000_intr(int irq, void *data)
+{
+	struct net_device *netdev = data;
+	struct e1000_adapter *adapter = netdev_priv(netdev);
+	irqreturn_t ret;
+#ifdef CONFIG_NET_POLL_CONTROLLER
+	unsigned long flags;
+
+	spin_lock_irqsave(&adapter->netpoll_lock, flags);
+	ret = __e1000_intr(irq, adapter);
+	spin_unlock_irqrestore(&adapter->netpoll_lock, flags);
+#else
+	ret = __e1000_intr(irq, adapter);
+#endif
+
+	return ret;
+}
+
 /**
  * e1000_clean - NAPI Rx polling callback
  * @adapter: board private structure
@@ -5216,9 +5235,7 @@ static void e1000_netpoll(struct net_device *netdev)
 {
 	struct e1000_adapter *adapter = netdev_priv(netdev);
 
-	disable_irq(adapter->pdev->irq);
 	e1000_intr(adapter->pdev->irq, netdev);
-	enable_irq(adapter->pdev->irq);
 }
 #endif
 
-- 
2.1.2


[-- Attachment #3: 0002-e1000e-Avoid-that-e1000_netpoll-triggers-a-kernel-wa.patch --]
[-- Type: text/x-patch, Size: 5267 bytes --]

>From af2c97b3882a73f9b5a098e6aca322efb341ce6d Mon Sep 17 00:00:00 2001
From: Bart Van Assche <bvanassche@acm.org>
Date: Mon, 5 Jan 2015 11:40:23 +0100
Subject: [PATCH 2/4] e1000e: Avoid that e1000_netpoll() triggers a kernel
 warning

---
 drivers/net/ethernet/intel/e1000e/e1000.h  |  5 ++
 drivers/net/ethernet/intel/e1000e/netdev.c | 73 ++++++++++++++++++++++++------
 2 files changed, 65 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/e1000.h b/drivers/net/ethernet/intel/e1000e/e1000.h
index 7785240..e89b80f 100644
--- a/drivers/net/ethernet/intel/e1000e/e1000.h
+++ b/drivers/net/ethernet/intel/e1000e/e1000.h
@@ -344,6 +344,11 @@ struct e1000_adapter {
 	struct ptp_clock_info ptp_clock_info;
 
 	u16 eee_advert;
+
+#ifdef CONFIG_NET_POLL_CONTROLLER
+	/* Used to serialize e1000 interrupts and the e1000 netpoll callback. */
+	spinlock_t netpoll_lock;
+#endif
 };
 
 struct e1000_info {
diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
index e14fd85..c6d0ffb 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1761,11 +1761,10 @@ static void e1000e_downshift_workaround(struct work_struct *work)
  * @irq: interrupt number
  * @data: pointer to a network interface device structure
  **/
-static irqreturn_t e1000_intr_msi(int __always_unused irq, void *data)
+static irqreturn_t __e1000_intr_msi(struct e1000_adapter *adapter)
 {
-	struct net_device *netdev = data;
-	struct e1000_adapter *adapter = netdev_priv(netdev);
 	struct e1000_hw *hw = &adapter->hw;
+	struct net_device *netdev = adapter->netdev;
 	u32 icr = er32(ICR);
 
 	/* read ICR disables interrupts using IAM */
@@ -1823,16 +1822,32 @@ static irqreturn_t e1000_intr_msi(int __always_unused irq, void *data)
 	return IRQ_HANDLED;
 }
 
+static irqreturn_t e1000_intr_msi(int __always_unused irq, void *data)
+{
+	struct e1000_adapter *adapter = netdev_priv(data);
+#ifdef CONFIG_NET_POLL_CONTROLLER
+	unsigned long flags;
+	int ret;
+
+	spin_lock_irqsave(&adapter->netpoll_lock, flags);
+	ret = __e1000_intr_msi(adapter);
+	spin_unlock_irqrestore(&adapter->netpoll_lock, flags);
+
+	return ret;
+#else
+	return __e1000_intr_msi(adapter);
+#endif
+}
+
 /**
  * e1000_intr - Interrupt Handler
  * @irq: interrupt number
  * @data: pointer to a network interface device structure
  **/
-static irqreturn_t e1000_intr(int __always_unused irq, void *data)
+static irqreturn_t __e1000_intr(struct e1000_adapter *adapter)
 {
-	struct net_device *netdev = data;
-	struct e1000_adapter *adapter = netdev_priv(netdev);
 	struct e1000_hw *hw = &adapter->hw;
+	struct net_device *netdev = adapter->netdev;
 	u32 rctl, icr = er32(ICR);
 
 	if (!icr || test_bit(__E1000_DOWN, &adapter->state))
@@ -1903,6 +1918,23 @@ static irqreturn_t e1000_intr(int __always_unused irq, void *data)
 	return IRQ_HANDLED;
 }
 
+static irqreturn_t e1000_intr(int __always_unused irq, void *data)
+{
+	struct e1000_adapter *adapter = netdev_priv(data);
+#ifdef CONFIG_NET_POLL_CONTROLLER
+	unsigned long flags;
+	int ret;
+
+	spin_lock_irqsave(&adapter->netpoll_lock, flags);
+	ret = __e1000_intr(adapter);
+	spin_unlock_irqrestore(&adapter->netpoll_lock, flags);
+
+	return ret;
+#else
+	return __e1000_intr(adapter);
+#endif
+}
+
 static irqreturn_t e1000_msix_other(int __always_unused irq, void *data)
 {
 	struct net_device *netdev = data;
@@ -4180,6 +4212,9 @@ static int e1000_sw_init(struct e1000_adapter *adapter)
 	adapter->rx_ring_count = E1000_DEFAULT_RXD;
 
 	spin_lock_init(&adapter->stats64_lock);
+#ifdef CONFIG_NET_POLL_CONTROLLER
+	spin_lock_init(&adapter->netpoll_lock);
+#endif
 
 	e1000e_set_interrupt_capability(adapter);
 
@@ -6437,10 +6472,9 @@ static void e1000_shutdown(struct pci_dev *pdev)
 
 #ifdef CONFIG_NET_POLL_CONTROLLER
 
-static irqreturn_t e1000_intr_msix(int __always_unused irq, void *data)
+static irqreturn_t __e1000_intr_msix(struct e1000_adapter *adapter)
 {
-	struct net_device *netdev = data;
-	struct e1000_adapter *adapter = netdev_priv(netdev);
+	struct net_device *netdev = adapter->netdev;
 
 	if (adapter->msix_entries) {
 		int vector, msix_irq;
@@ -6467,6 +6501,23 @@ static irqreturn_t e1000_intr_msix(int __always_unused irq, void *data)
 	return IRQ_HANDLED;
 }
 
+static irqreturn_t e1000_intr_msix(int __always_unused irq, void *data)
+{
+	struct e1000_adapter *adapter = netdev_priv(data);
+#ifdef CONFIG_NET_POLL_CONTROLLER
+	int ret;
+	unsigned long flags;
+
+	spin_lock_irqsave(&adapter->netpoll_lock, flags);
+	ret = __e1000_intr_msix(adapter);
+	spin_unlock_irqrestore(&adapter->netpoll_lock, flags);
+
+	return ret;
+#else
+	return __e1000_intr_msix(adapter);
+#endif
+}
+
 /**
  * e1000_netpoll
  * @netdev: network interface device structure
@@ -6484,14 +6535,10 @@ static void e1000_netpoll(struct net_device *netdev)
 		e1000_intr_msix(adapter->pdev->irq, netdev);
 		break;
 	case E1000E_INT_MODE_MSI:
-		disable_irq(adapter->pdev->irq);
 		e1000_intr_msi(adapter->pdev->irq, netdev);
-		enable_irq(adapter->pdev->irq);
 		break;
 	default:		/* E1000E_INT_MODE_LEGACY */
-		disable_irq(adapter->pdev->irq);
 		e1000_intr(adapter->pdev->irq, netdev);
-		enable_irq(adapter->pdev->irq);
 		break;
 	}
 }
-- 
2.1.2


[-- Attachment #4: 0003-Avoid-that-sd_shutdown-triggers-a-kernel-warning.patch --]
[-- Type: text/x-patch, Size: 10717 bytes --]

>From 54e10ead0a0e98bdf39a631c65c0a585211ffa22 Mon Sep 17 00:00:00 2001
From: Bart Van Assche <bvanassche@acm.org>
Date: Mon, 5 Jan 2015 10:51:13 +0100
Subject: [PATCH 3/4] Avoid that sd_shutdown() triggers a kernel warning

Since kernel v3.19-rc1 module_refcount() returns 1 instead of 0
when called from inside module_exit(). This breaks the
module_refcount() test in scsi_device_put() and hence causes the
following kernel warning to be reported when unloading the ib_srp
kernel module:

WARNING: CPU: 5 PID: 228 at kernel/module.c:954 module_put+0x207/0x220()

Call Trace:
 [<ffffffff814d1fcf>] dump_stack+0x4c/0x65
 [<ffffffff81053ada>] warn_slowpath_common+0x8a/0xc0
 [<ffffffff81053bca>] warn_slowpath_null+0x1a/0x20
 [<ffffffff810d0507>] module_put+0x207/0x220
 [<ffffffffa000bea8>] scsi_device_put+0x48/0x50 [scsi_mod]
 [<ffffffffa03676d2>] scsi_disk_put+0x32/0x50 [sd_mod]
 [<ffffffffa0368d4c>] sd_shutdown+0x8c/0x150 [sd_mod]
 [<ffffffffa0368e79>] sd_remove+0x69/0xc0 [sd_mod]
 [<ffffffff813457ef>] __device_release_driver+0x7f/0xf0
 [<ffffffff81345885>] device_release_driver+0x25/0x40
 [<ffffffff81345134>] bus_remove_device+0x124/0x1b0
 [<ffffffff8134189e>] device_del+0x13e/0x250
 [<ffffffffa001cdcd>] __scsi_remove_device+0xcd/0xe0 [scsi_mod]
 [<ffffffffa001b39f>] scsi_forget_host+0x6f/0x80 [scsi_mod]
 [<ffffffffa000d5f6>] scsi_remove_host+0x86/0x140 [scsi_mod]
 [<ffffffffa07d5c0b>] srp_remove_work+0x9b/0x210 [ib_srp]
 [<ffffffff8106fd28>] process_one_work+0x1d8/0x780
 [<ffffffff810703eb>] worker_thread+0x11b/0x4a0
 [<ffffffff81075a6f>] kthread+0xef/0x110
 [<ffffffff814dad6c>] ret_from_fork+0x7c/0xb0

See also patch "module: Remove stop_machine from module unloading"
(Masami Hiramatsu; commit e513cc1c07e2; kernel v3.19-rc1).

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
---
 drivers/scsi/scsi.c        | 63 ++++++++++++++++++++++++++++++++--------------
 drivers/scsi/sd.c          | 44 +++++++++++++++++---------------
 include/scsi/scsi_device.h |  2 ++
 3 files changed, 70 insertions(+), 39 deletions(-)

diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index e028854..2cae46b 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -973,30 +973,63 @@ int scsi_report_opcode(struct scsi_device *sdev, unsigned char *buffer,
 EXPORT_SYMBOL(scsi_report_opcode);
 
 /**
- * scsi_device_get  -  get an additional reference to a scsi_device
+ * scsi_dev_get - get an additional reference to a scsi_device
  * @sdev:	device to get a reference to
+ * @get_lld:    whether or not to increase the LLD kernel module refcount
  *
- * Description: Gets a reference to the scsi_device and increments the use count
- * of the underlying LLDD module.  You must hold host_lock of the
- * parent Scsi_Host or already have a reference when calling this.
+ * Description: Gets a reference to the scsi_device and optionally increments
+ * the use count of the associated LLDD module. You must hold host_lock of
+ * the parent Scsi_Host or already have a reference when calling this.
  */
-int scsi_device_get(struct scsi_device *sdev)
+int scsi_dev_get(struct scsi_device *sdev, bool get_lld)
 {
 	if (sdev->sdev_state == SDEV_DEL)
 		return -ENXIO;
 	if (!get_device(&sdev->sdev_gendev))
 		return -ENXIO;
-	/* We can fail this if we're doing SCSI operations
-	 * from module exit (like cache flush) */
-	try_module_get(sdev->host->hostt->module);
+	/* Can fail if invoked during module exit (like cache flush) */
+	if (get_lld && !try_module_get(sdev->host->hostt->module)) {
+		put_device(&sdev->sdev_gendev);
+		return -ENXIO;
+	}
 
 	return 0;
 }
+EXPORT_SYMBOL(scsi_dev_get);
+
+/**
+ * scsi_dev_put - release a reference to a scsi_device
+ * @sdev:	device to release a reference on
+ * @put_lld:    whether or not to decrease the LLD kernel module refcount
+ *
+ * Description: Release a reference to the scsi_device. The device is freed
+ * once the last user vanishes.
+ */
+void scsi_dev_put(struct scsi_device *sdev, bool put_lld)
+{
+	if (put_lld)
+		module_put(sdev->host->hostt->module);
+	put_device(&sdev->sdev_gendev);
+}
+EXPORT_SYMBOL(scsi_dev_put);
+
+/**
+ * scsi_device_get - get an additional reference to a scsi_device
+ * @sdev:	device to get a reference to
+ *
+ * Description: Gets a reference to the scsi_device and increments the use count
+ * of the underlying LLDD module.  You must hold host_lock of the
+ * parent Scsi_Host or already have a reference when calling this.
+ */
+int scsi_device_get(struct scsi_device *sdev)
+{
+	return scsi_dev_get(sdev, true);
+}
 EXPORT_SYMBOL(scsi_device_get);
 
 /**
- * scsi_device_put  -  release a reference to a scsi_device
- * @sdev:	device to release a reference on.
+ * scsi_device_put - release a reference to a scsi_device
+ * @sdev:	device to release a reference on
  *
  * Description: Release a reference to the scsi_device and decrements the use
  * count of the underlying LLDD module.  The device is freed once the last
@@ -1004,15 +1037,7 @@ EXPORT_SYMBOL(scsi_device_get);
  */
 void scsi_device_put(struct scsi_device *sdev)
 {
-#ifdef CONFIG_MODULE_UNLOAD
-	struct module *module = sdev->host->hostt->module;
-
-	/* The module refcount will be zero if scsi_device_get()
-	 * was called from a module removal routine */
-	if (module && module_refcount(module) != 0)
-		module_put(module);
-#endif
-	put_device(&sdev->sdev_gendev);
+	scsi_dev_put(sdev, true);
 }
 EXPORT_SYMBOL(scsi_device_put);
 
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 3995169..bd641a8 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -564,13 +564,13 @@ static int sd_major(int major_idx)
 	}
 }
 
-static struct scsi_disk *__scsi_disk_get(struct gendisk *disk)
+static struct scsi_disk *__scsi_disk_get(struct gendisk *disk, bool get_lld)
 {
 	struct scsi_disk *sdkp = NULL;
 
 	if (disk->private_data) {
 		sdkp = scsi_disk(disk);
-		if (scsi_device_get(sdkp->device) == 0)
+		if (scsi_dev_get(sdkp->device, get_lld) == 0)
 			get_device(&sdkp->dev);
 		else
 			sdkp = NULL;
@@ -578,35 +578,36 @@ static struct scsi_disk *__scsi_disk_get(struct gendisk *disk)
 	return sdkp;
 }
 
-static struct scsi_disk *scsi_disk_get(struct gendisk *disk)
+static struct scsi_disk *scsi_disk_get(struct gendisk *disk, bool get_lld)
 {
 	struct scsi_disk *sdkp;
 
 	mutex_lock(&sd_ref_mutex);
-	sdkp = __scsi_disk_get(disk);
+	sdkp = __scsi_disk_get(disk, get_lld);
 	mutex_unlock(&sd_ref_mutex);
 	return sdkp;
 }
 
-static struct scsi_disk *scsi_disk_get_from_dev(struct device *dev)
+static struct scsi_disk *scsi_disk_get_from_dev(struct device *dev,
+						bool get_lld)
 {
 	struct scsi_disk *sdkp;
 
 	mutex_lock(&sd_ref_mutex);
 	sdkp = dev_get_drvdata(dev);
 	if (sdkp)
-		sdkp = __scsi_disk_get(sdkp->disk);
+		sdkp = __scsi_disk_get(sdkp->disk, get_lld);
 	mutex_unlock(&sd_ref_mutex);
 	return sdkp;
 }
 
-static void scsi_disk_put(struct scsi_disk *sdkp)
+static void scsi_disk_put(struct scsi_disk *sdkp, bool put_lld)
 {
 	struct scsi_device *sdev = sdkp->device;
 
 	mutex_lock(&sd_ref_mutex);
 	put_device(&sdkp->dev);
-	scsi_device_put(sdev);
+	scsi_dev_put(sdev, put_lld);
 	mutex_unlock(&sd_ref_mutex);
 }
 
@@ -1184,7 +1185,7 @@ static void sd_uninit_command(struct scsi_cmnd *SCpnt)
  **/
 static int sd_open(struct block_device *bdev, fmode_t mode)
 {
-	struct scsi_disk *sdkp = scsi_disk_get(bdev->bd_disk);
+	struct scsi_disk *sdkp = scsi_disk_get(bdev->bd_disk, true);
 	struct scsi_device *sdev;
 	int retval;
 
@@ -1239,7 +1240,7 @@ static int sd_open(struct block_device *bdev, fmode_t mode)
 	return 0;
 
 error_out:
-	scsi_disk_put(sdkp);
+	scsi_disk_put(sdkp, true);
 	return retval;	
 }
 
@@ -1273,7 +1274,7 @@ static void sd_release(struct gendisk *disk, fmode_t mode)
 	 * XXX is followed by a "rmmod sd_mod"?
 	 */
 
-	scsi_disk_put(sdkp);
+	scsi_disk_put(sdkp, true);
 }
 
 static int sd_getgeo(struct block_device *bdev, struct hd_geometry *geo)
@@ -1525,11 +1526,11 @@ static int sd_sync_cache(struct scsi_disk *sdkp)
 
 static void sd_rescan(struct device *dev)
 {
-	struct scsi_disk *sdkp = scsi_disk_get_from_dev(dev);
+	struct scsi_disk *sdkp = scsi_disk_get_from_dev(dev, true);
 
 	if (sdkp) {
 		revalidate_disk(sdkp->disk);
-		scsi_disk_put(sdkp);
+		scsi_disk_put(sdkp, true);
 	}
 }
 
@@ -3143,11 +3144,14 @@ static int sd_start_stop_device(struct scsi_disk *sdkp, int start)
 /*
  * Send a SYNCHRONIZE CACHE instruction down to the device through
  * the normal SCSI command structure.  Wait for the command to
- * complete.
+ * complete.  Since this function can be called during SCSI LLD kernel
+ * module unload and since try_module_get() fails after kernel module
+ * unload has started this function must not try to increase the SCSI
+ * LLD kernel module refcount.
  */
 static void sd_shutdown(struct device *dev)
 {
-	struct scsi_disk *sdkp = scsi_disk_get_from_dev(dev);
+	struct scsi_disk *sdkp = scsi_disk_get_from_dev(dev, false);
 
 	if (!sdkp)
 		return;         /* this can happen */
@@ -3166,12 +3170,12 @@ static void sd_shutdown(struct device *dev)
 	}
 
 exit:
-	scsi_disk_put(sdkp);
+	scsi_disk_put(sdkp, false);
 }
 
 static int sd_suspend_common(struct device *dev, bool ignore_stop_errors)
 {
-	struct scsi_disk *sdkp = scsi_disk_get_from_dev(dev);
+	struct scsi_disk *sdkp = scsi_disk_get_from_dev(dev, true);
 	int ret = 0;
 
 	if (!sdkp)
@@ -3197,7 +3201,7 @@ static int sd_suspend_common(struct device *dev, bool ignore_stop_errors)
 	}
 
 done:
-	scsi_disk_put(sdkp);
+	scsi_disk_put(sdkp, true);
 	return ret;
 }
 
@@ -3213,7 +3217,7 @@ static int sd_suspend_runtime(struct device *dev)
 
 static int sd_resume(struct device *dev)
 {
-	struct scsi_disk *sdkp = scsi_disk_get_from_dev(dev);
+	struct scsi_disk *sdkp = scsi_disk_get_from_dev(dev, true);
 	int ret = 0;
 
 	if (!sdkp->device->manage_start_stop)
@@ -3223,7 +3227,7 @@ static int sd_resume(struct device *dev)
 	ret = sd_start_stop_device(sdkp, 1);
 
 done:
-	scsi_disk_put(sdkp);
+	scsi_disk_put(sdkp, true);
 	return ret;
 }
 
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index 3a4edd1..a4cb852 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -330,6 +330,8 @@ extern void scsi_remove_device(struct scsi_device *);
 extern int scsi_unregister_device_handler(struct scsi_device_handler *scsi_dh);
 void scsi_attach_vpd(struct scsi_device *sdev);
 
+extern int scsi_dev_get(struct scsi_device *, bool get_lld);
+extern void scsi_dev_put(struct scsi_device *, bool put_lld);
 extern int scsi_device_get(struct scsi_device *);
 extern void scsi_device_put(struct scsi_device *);
 extern struct scsi_device *scsi_device_lookup(struct Scsi_Host *,
-- 
2.1.2


[-- Attachment #5: 0004-IB-srp-Process-REQ_PREEMPT-requests-correctly.patch --]
[-- Type: text/x-patch, Size: 1166 bytes --]

>From 6f593a0e9fcfd9b6c99fd24ac981450ed6eb0a0f Mon Sep 17 00:00:00 2001
From: Bart Van Assche <bvanassche@acm.org>
Date: Thu, 8 Jan 2015 09:42:45 +0100
Subject: [PATCH 4/4] IB/srp: Process REQ_PREEMPT requests correctly

Reported-by: Max Gurtuvoy <maxg@mellanox.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/infiniband/ulp/srp/ib_srp.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
index 0747c05..77a7a2f 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -2003,8 +2003,13 @@ static int srp_queuecommand(struct Scsi_Host *shost, struct scsi_cmnd *scmnd)
 	if (in_scsi_eh)
 		mutex_lock(&rport->mutex);
 
+	/*
+	 * The "blocked" state of SCSI devices is ignored by the SCSI core for
+	 * REQ_PREEMPT requests. Hence the explicit check below for the SCSI
+	 * device state.
+	 */
 	scmnd->result = srp_chkready(target->rport);
-	if (unlikely(scmnd->result))
+	if (unlikely(scmnd->result != 0 || scsi_device_blocked(scmnd->device)))
 		goto err;
 
 	WARN_ON_ONCE(scmnd->request->tag < 0);
-- 
2.1.2


[-- Attachment #6: spread-mlx4-ib-interrupts --]
[-- Type: text/plain, Size: 1671 bytes --]

#!/bin/awk -f

BEGIN {
  "ls -1d /sys/devices/system/node/node* 2>&1 | wc -l" | getline nodes
  if (nodes > 1) {
    for (i = 0; i < nodes; i++) {
      cpus_per_node = 0
      while (("cd /sys/devices/system/cpu && ls -d cpu*/node" i " | sed 's/^cpu//;s,/.*,,'|sort -n" | getline j) > 0) {
        #print "[" i ", " cpus_per_node "]: " j
        cpu[i, cpus_per_node++] = j
      }
    }
  } else {
      cpus_per_node = 0
      while (("cd /sys/devices/system/cpu && ls -d cpu[0-9]* | sed 's/^cpu//'|sort -n" | getline j) > 0) {
        #print "[0, " cpus_per_node "]: " j
        cpu[0, cpus_per_node++] = j
      }
  }
  for (i = 0; i < nodes; i++)
      nextcpu[i] = 0
  while (("sed -n 's/.*mlx4-ib-\\([0-9]*\\)-[0-9]*@\\(.*\\)$/\\1 \\2/p' /proc/interrupts | uniq" | getline) > 0) {
    port = $1
    bus = substr($0, length($1) + 2)
    #print "port = " port "; bus = " bus
    irqcount = 0
    while (("sed -n 's/^[[:blank:]]*\\([0-9]*\\):[0-9[:blank:]]*[^[:blank:]]*[[:blank:]]*\\(mlx4-ib-" port "-[0-9]*@" bus "\\)$/\\1 \\2/p' </proc/interrupts" | getline) > 0) {
      irq[irqcount] = $1
      irqname[irqcount] = substr($0, length($1) + 2)
      irqcount++
    }
    for (i = 0; i < nodes; i++) {
      ch_start = i * irqcount / nodes
      ch_end = (i + 1) * irqcount / nodes
      for (ch = ch_start; ch < ch_end; ch++) {
        c = cpu[i, nextcpu[i]++ % cpus_per_node]
        if (nodes > 1)
            nodetxt =  " (node " i ")"
        else
            nodetxt = ""
        print "IRQ " irq[ch] " (" irqname[ch] "): CPU " c nodetxt
        cmd="echo " c " >/proc/irq/" irq[ch] "/smp_affinity_list"
	#print cmd
	system(cmd)
      }
    }
  }
  exit 0
}

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: dm + blk-mq soft lockup complaint
@ 2015-01-14  9:16                                                                                                   ` Bart Van Assche
  0 siblings, 0 replies; 95+ messages in thread
From: Bart Van Assche @ 2015-01-14  9:16 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig,
	device-mapper development, Jun'ichi Nomura, linux-scsi

[-- Attachment #1: Type: text/plain, Size: 2555 bytes --]

On 01/13/15 17:21, Mike Snitzer wrote:
> OK, I assume you specified the mpath device for the test that failed.

Yes, of course ...

> This test works fine on my 100MB scsi_debug device with 4 paths exported
> over virtio-blk to a guest that assembles the mpath device.
> 
> Could be a hang that is unique to scsi-mq.
> 
> Any chance you'd be willing to provide a HOWTO for setting up your
> SRP/iscsi configuration?
> 
> Are you carrying any related changes that are not upstream?  (I can hunt
> down the email in this thread where you describe your kernel tree...)
> 
> I'll try to reproduce but this info could be useful to others that are
> more scsi-mq inclined who might need to chase this too.

The four patches I had used in my tests at the initiator side and that
are not yet in v3.19-rc4 have been attached to this e-mail (I have not
yet had the time to post all of these patches for review).

This is how my I had configured the initiator system:
* If the version of the srptools package supplied by your distro is
lower than 1.0.2, build and install the latest version from the source
code available at git://git.openfabrics.org/~bvanassche/srptools.git/.git.
* Install the latest version of lsscsi
(http://sg.danny.cz/scsi/lsscsi.html). This version has SRP transport
support but is not yet in any distro AFAIK.
* Build and install a kernel >= v3.19-rc4 that includes the dm patches
at the start of this e-mail thread.
* Check whether the IB links are up (should display "State: Active"):
ibstat | grep State:
* Spread completion interrupts statically over CPU cores, e.g. via the
attached script (spread-mlx4-ib-interrupts).
* Check whether the SRP target system is visible from the SRP initiator
system - the command below should print at least one line:
ibsrpdm -c
* Enable blk-mq:
echo Y > /sys/module/scsi_mod/parameters/use_blk_mq
* Configure the SRP kernel module parameters as follows:
echo 'options ib_srp cmd_sg_entries=255 dev_loss_tmo=60 ch_count=6' >
/etc/modprobe.d/ib_srp.conf
* Unload and reload the SRP initiator kernel module to apply these
parameters:
rmmod ib_srp; modprobe ib_srp
* Start srpd and wait until SRP login has finished:
systemctl start srpd
while ! lsscsi -t | grep -q srp:; do sleep 1; done
* Start multipathd and check the table it has built:
systemctl start multipathd
dmsetup table /dev/dm-0
* Set the I/O scheduler to noop, disable add_random and set rq_affinity
to 2 for all SRP and dm block devices.
* Run the I/O load of your preference.

Please let me know if you need any further information.

Bart.

[-- Attachment #2: 0001-e1000-Avoid-that-e1000_netpoll-triggers-a-kernel-war.patch --]
[-- Type: text/x-patch, Size: 4608 bytes --]

From 664b7adce6c09b9c939b4983f7f32b7539497ef4 Mon Sep 17 00:00:00 2001
From: Bart Van Assche <bvanassche@acm.org>
Date: Fri, 2 Jan 2015 14:52:07 +0100
Subject: [PATCH 1/4] e1000: Avoid that e1000_netpoll() triggers a kernel
 warning

console_cont_flush(), which is called by console_unlock(), calls
call_console_drivers() and hence also the netconsole function
write_msg() with local interrupts disabled. This means that it is
not allowed to call disable_irq() from inside a netpoll callback
function. Hence eliminate the disable_irq() / enable_irq() pair
from the e1000 netpoll function. This patch avoids that the e1000
networking driver triggers the following complaint:

BUG: sleeping function called from invalid context at kernel/irq/manage.c:104

Call Trace:
 [<ffffffff814d1ec5>] dump_stack+0x4c/0x65
 [<ffffffff8107bcc5>] ___might_sleep+0x175/0x230
 [<ffffffff8107bdba>] __might_sleep+0x3a/0xa0
 [<ffffffff810a78c8>] synchronize_irq+0x38/0xa0
 [<ffffffff810a7a20>] disable_irq+0x20/0x30
 [<ffffffffa04b4442>] e1000_netpoll+0x102/0x130 [e1000e]
 [<ffffffff813ffff2>] netpoll_poll_dev+0x72/0x350
 [<ffffffff81400489>] netpoll_send_skb_on_dev+0x1b9/0x2b0
 [<ffffffff81400842>] netpoll_send_udp+0x2c2/0x430
 [<ffffffffa058187f>] write_msg+0xcf/0x120 [netconsole]
 [<ffffffff810a4682>] call_console_drivers.constprop.25+0xc2/0x250
 [<ffffffff810a5588>] console_unlock+0x328/0x4c0
 [<ffffffff810a59f0>] vprintk_emit+0x2d0/0x570
 [<ffffffff810a5def>] vprintk_default+0x1f/0x30
 [<ffffffff814cf680>] printk+0x46/0x48

See also "[RFC PATCH net-next 00/11] net: remove disable_irq() from
->ndo_poll_controller" (http://thread.gmane.org/gmane.linux.network/342096).

See also patch "sched/wait: Add might_sleep() checks" (kernel v3.19-rc1;
commit e22b886a8a43).

Reported-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David S. Miller <davem@davemloft.net>
Cc: Peter Zijlstra <peterz@infradead.org>
---
 drivers/net/ethernet/intel/e1000/e1000.h      |  5 +++++
 drivers/net/ethernet/intel/e1000/e1000_main.c | 27 ++++++++++++++++++++++-----
 2 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000/e1000.h b/drivers/net/ethernet/intel/e1000/e1000.h
index 6970710..d85d19f 100644
--- a/drivers/net/ethernet/intel/e1000/e1000.h
+++ b/drivers/net/ethernet/intel/e1000/e1000.h
@@ -323,6 +323,11 @@ struct e1000_adapter {
 	struct delayed_work watchdog_task;
 	struct delayed_work fifo_stall_task;
 	struct delayed_work phy_info_task;
+
+#ifdef CONFIG_NET_POLL_CONTROLLER
+	/* Used to serialize e1000 interrupts and the e1000 netpoll callback. */
+	spinlock_t netpoll_lock;
+#endif
 };
 
 enum e1000_state_t {
diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
index 83140cb..e5866f1 100644
--- a/drivers/net/ethernet/intel/e1000/e1000_main.c
+++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
@@ -1313,6 +1313,9 @@ static int e1000_sw_init(struct e1000_adapter *adapter)
 	e1000_irq_disable(adapter);
 
 	spin_lock_init(&adapter->stats_lock);
+#ifdef CONFIG_NET_POLL_CONTROLLER
+	spin_lock_init(&adapter->netpoll_lock);
+#endif
 
 	set_bit(__E1000_DOWN, &adapter->flags);
 
@@ -3747,10 +3750,8 @@ void e1000_update_stats(struct e1000_adapter *adapter)
  * @irq: interrupt number
  * @data: pointer to a network interface device structure
  **/
-static irqreturn_t e1000_intr(int irq, void *data)
+static irqreturn_t __e1000_intr(int irq, struct e1000_adapter *adapter)
 {
-	struct net_device *netdev = data;
-	struct e1000_adapter *adapter = netdev_priv(netdev);
 	struct e1000_hw *hw = &adapter->hw;
 	u32 icr = er32(ICR);
 
@@ -3792,6 +3793,24 @@ static irqreturn_t e1000_intr(int irq, void *data)
 	return IRQ_HANDLED;
 }
 
+static irqreturn_t e1000_intr(int irq, void *data)
+{
+	struct net_device *netdev = data;
+	struct e1000_adapter *adapter = netdev_priv(netdev);
+	irqreturn_t ret;
+#ifdef CONFIG_NET_POLL_CONTROLLER
+	unsigned long flags;
+
+	spin_lock_irqsave(&adapter->netpoll_lock, flags);
+	ret = __e1000_intr(irq, adapter);
+	spin_unlock_irqrestore(&adapter->netpoll_lock, flags);
+#else
+	ret = __e1000_intr(irq, adapter);
+#endif
+
+	return ret;
+}
+
 /**
  * e1000_clean - NAPI Rx polling callback
  * @adapter: board private structure
@@ -5216,9 +5235,7 @@ static void e1000_netpoll(struct net_device *netdev)
 {
 	struct e1000_adapter *adapter = netdev_priv(netdev);
 
-	disable_irq(adapter->pdev->irq);
 	e1000_intr(adapter->pdev->irq, netdev);
-	enable_irq(adapter->pdev->irq);
 }
 #endif
 
-- 
2.1.2


[-- Attachment #3: 0002-e1000e-Avoid-that-e1000_netpoll-triggers-a-kernel-wa.patch --]
[-- Type: text/x-patch, Size: 5266 bytes --]

From af2c97b3882a73f9b5a098e6aca322efb341ce6d Mon Sep 17 00:00:00 2001
From: Bart Van Assche <bvanassche@acm.org>
Date: Mon, 5 Jan 2015 11:40:23 +0100
Subject: [PATCH 2/4] e1000e: Avoid that e1000_netpoll() triggers a kernel
 warning

---
 drivers/net/ethernet/intel/e1000e/e1000.h  |  5 ++
 drivers/net/ethernet/intel/e1000e/netdev.c | 73 ++++++++++++++++++++++++------
 2 files changed, 65 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/e1000.h b/drivers/net/ethernet/intel/e1000e/e1000.h
index 7785240..e89b80f 100644
--- a/drivers/net/ethernet/intel/e1000e/e1000.h
+++ b/drivers/net/ethernet/intel/e1000e/e1000.h
@@ -344,6 +344,11 @@ struct e1000_adapter {
 	struct ptp_clock_info ptp_clock_info;
 
 	u16 eee_advert;
+
+#ifdef CONFIG_NET_POLL_CONTROLLER
+	/* Used to serialize e1000 interrupts and the e1000 netpoll callback. */
+	spinlock_t netpoll_lock;
+#endif
 };
 
 struct e1000_info {
diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
index e14fd85..c6d0ffb 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1761,11 +1761,10 @@ static void e1000e_downshift_workaround(struct work_struct *work)
  * @irq: interrupt number
  * @data: pointer to a network interface device structure
  **/
-static irqreturn_t e1000_intr_msi(int __always_unused irq, void *data)
+static irqreturn_t __e1000_intr_msi(struct e1000_adapter *adapter)
 {
-	struct net_device *netdev = data;
-	struct e1000_adapter *adapter = netdev_priv(netdev);
 	struct e1000_hw *hw = &adapter->hw;
+	struct net_device *netdev = adapter->netdev;
 	u32 icr = er32(ICR);
 
 	/* read ICR disables interrupts using IAM */
@@ -1823,16 +1822,32 @@ static irqreturn_t e1000_intr_msi(int __always_unused irq, void *data)
 	return IRQ_HANDLED;
 }
 
+static irqreturn_t e1000_intr_msi(int __always_unused irq, void *data)
+{
+	struct e1000_adapter *adapter = netdev_priv(data);
+#ifdef CONFIG_NET_POLL_CONTROLLER
+	unsigned long flags;
+	int ret;
+
+	spin_lock_irqsave(&adapter->netpoll_lock, flags);
+	ret = __e1000_intr_msi(adapter);
+	spin_unlock_irqrestore(&adapter->netpoll_lock, flags);
+
+	return ret;
+#else
+	return __e1000_intr_msi(adapter);
+#endif
+}
+
 /**
  * e1000_intr - Interrupt Handler
  * @irq: interrupt number
  * @data: pointer to a network interface device structure
  **/
-static irqreturn_t e1000_intr(int __always_unused irq, void *data)
+static irqreturn_t __e1000_intr(struct e1000_adapter *adapter)
 {
-	struct net_device *netdev = data;
-	struct e1000_adapter *adapter = netdev_priv(netdev);
 	struct e1000_hw *hw = &adapter->hw;
+	struct net_device *netdev = adapter->netdev;
 	u32 rctl, icr = er32(ICR);
 
 	if (!icr || test_bit(__E1000_DOWN, &adapter->state))
@@ -1903,6 +1918,23 @@ static irqreturn_t e1000_intr(int __always_unused irq, void *data)
 	return IRQ_HANDLED;
 }
 
+static irqreturn_t e1000_intr(int __always_unused irq, void *data)
+{
+	struct e1000_adapter *adapter = netdev_priv(data);
+#ifdef CONFIG_NET_POLL_CONTROLLER
+	unsigned long flags;
+	int ret;
+
+	spin_lock_irqsave(&adapter->netpoll_lock, flags);
+	ret = __e1000_intr(adapter);
+	spin_unlock_irqrestore(&adapter->netpoll_lock, flags);
+
+	return ret;
+#else
+	return __e1000_intr(adapter);
+#endif
+}
+
 static irqreturn_t e1000_msix_other(int __always_unused irq, void *data)
 {
 	struct net_device *netdev = data;
@@ -4180,6 +4212,9 @@ static int e1000_sw_init(struct e1000_adapter *adapter)
 	adapter->rx_ring_count = E1000_DEFAULT_RXD;
 
 	spin_lock_init(&adapter->stats64_lock);
+#ifdef CONFIG_NET_POLL_CONTROLLER
+	spin_lock_init(&adapter->netpoll_lock);
+#endif
 
 	e1000e_set_interrupt_capability(adapter);
 
@@ -6437,10 +6472,9 @@ static void e1000_shutdown(struct pci_dev *pdev)
 
 #ifdef CONFIG_NET_POLL_CONTROLLER
 
-static irqreturn_t e1000_intr_msix(int __always_unused irq, void *data)
+static irqreturn_t __e1000_intr_msix(struct e1000_adapter *adapter)
 {
-	struct net_device *netdev = data;
-	struct e1000_adapter *adapter = netdev_priv(netdev);
+	struct net_device *netdev = adapter->netdev;
 
 	if (adapter->msix_entries) {
 		int vector, msix_irq;
@@ -6467,6 +6501,23 @@ static irqreturn_t e1000_intr_msix(int __always_unused irq, void *data)
 	return IRQ_HANDLED;
 }
 
+static irqreturn_t e1000_intr_msix(int __always_unused irq, void *data)
+{
+	struct e1000_adapter *adapter = netdev_priv(data);
+#ifdef CONFIG_NET_POLL_CONTROLLER
+	int ret;
+	unsigned long flags;
+
+	spin_lock_irqsave(&adapter->netpoll_lock, flags);
+	ret = __e1000_intr_msix(adapter);
+	spin_unlock_irqrestore(&adapter->netpoll_lock, flags);
+
+	return ret;
+#else
+	return __e1000_intr_msix(adapter);
+#endif
+}
+
 /**
  * e1000_netpoll
  * @netdev: network interface device structure
@@ -6484,14 +6535,10 @@ static void e1000_netpoll(struct net_device *netdev)
 		e1000_intr_msix(adapter->pdev->irq, netdev);
 		break;
 	case E1000E_INT_MODE_MSI:
-		disable_irq(adapter->pdev->irq);
 		e1000_intr_msi(adapter->pdev->irq, netdev);
-		enable_irq(adapter->pdev->irq);
 		break;
 	default:		/* E1000E_INT_MODE_LEGACY */
-		disable_irq(adapter->pdev->irq);
 		e1000_intr(adapter->pdev->irq, netdev);
-		enable_irq(adapter->pdev->irq);
 		break;
 	}
 }
-- 
2.1.2


[-- Attachment #4: 0003-Avoid-that-sd_shutdown-triggers-a-kernel-warning.patch --]
[-- Type: text/x-patch, Size: 10716 bytes --]

From 54e10ead0a0e98bdf39a631c65c0a585211ffa22 Mon Sep 17 00:00:00 2001
From: Bart Van Assche <bvanassche@acm.org>
Date: Mon, 5 Jan 2015 10:51:13 +0100
Subject: [PATCH 3/4] Avoid that sd_shutdown() triggers a kernel warning

Since kernel v3.19-rc1 module_refcount() returns 1 instead of 0
when called from inside module_exit(). This breaks the
module_refcount() test in scsi_device_put() and hence causes the
following kernel warning to be reported when unloading the ib_srp
kernel module:

WARNING: CPU: 5 PID: 228 at kernel/module.c:954 module_put+0x207/0x220()

Call Trace:
 [<ffffffff814d1fcf>] dump_stack+0x4c/0x65
 [<ffffffff81053ada>] warn_slowpath_common+0x8a/0xc0
 [<ffffffff81053bca>] warn_slowpath_null+0x1a/0x20
 [<ffffffff810d0507>] module_put+0x207/0x220
 [<ffffffffa000bea8>] scsi_device_put+0x48/0x50 [scsi_mod]
 [<ffffffffa03676d2>] scsi_disk_put+0x32/0x50 [sd_mod]
 [<ffffffffa0368d4c>] sd_shutdown+0x8c/0x150 [sd_mod]
 [<ffffffffa0368e79>] sd_remove+0x69/0xc0 [sd_mod]
 [<ffffffff813457ef>] __device_release_driver+0x7f/0xf0
 [<ffffffff81345885>] device_release_driver+0x25/0x40
 [<ffffffff81345134>] bus_remove_device+0x124/0x1b0
 [<ffffffff8134189e>] device_del+0x13e/0x250
 [<ffffffffa001cdcd>] __scsi_remove_device+0xcd/0xe0 [scsi_mod]
 [<ffffffffa001b39f>] scsi_forget_host+0x6f/0x80 [scsi_mod]
 [<ffffffffa000d5f6>] scsi_remove_host+0x86/0x140 [scsi_mod]
 [<ffffffffa07d5c0b>] srp_remove_work+0x9b/0x210 [ib_srp]
 [<ffffffff8106fd28>] process_one_work+0x1d8/0x780
 [<ffffffff810703eb>] worker_thread+0x11b/0x4a0
 [<ffffffff81075a6f>] kthread+0xef/0x110
 [<ffffffff814dad6c>] ret_from_fork+0x7c/0xb0

See also patch "module: Remove stop_machine from module unloading"
(Masami Hiramatsu; commit e513cc1c07e2; kernel v3.19-rc1).

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
---
 drivers/scsi/scsi.c        | 63 ++++++++++++++++++++++++++++++++--------------
 drivers/scsi/sd.c          | 44 +++++++++++++++++---------------
 include/scsi/scsi_device.h |  2 ++
 3 files changed, 70 insertions(+), 39 deletions(-)

diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index e028854..2cae46b 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -973,30 +973,63 @@ int scsi_report_opcode(struct scsi_device *sdev, unsigned char *buffer,
 EXPORT_SYMBOL(scsi_report_opcode);
 
 /**
- * scsi_device_get  -  get an additional reference to a scsi_device
+ * scsi_dev_get - get an additional reference to a scsi_device
  * @sdev:	device to get a reference to
+ * @get_lld:    whether or not to increase the LLD kernel module refcount
  *
- * Description: Gets a reference to the scsi_device and increments the use count
- * of the underlying LLDD module.  You must hold host_lock of the
- * parent Scsi_Host or already have a reference when calling this.
+ * Description: Gets a reference to the scsi_device and optionally increments
+ * the use count of the associated LLDD module. You must hold host_lock of
+ * the parent Scsi_Host or already have a reference when calling this.
  */
-int scsi_device_get(struct scsi_device *sdev)
+int scsi_dev_get(struct scsi_device *sdev, bool get_lld)
 {
 	if (sdev->sdev_state == SDEV_DEL)
 		return -ENXIO;
 	if (!get_device(&sdev->sdev_gendev))
 		return -ENXIO;
-	/* We can fail this if we're doing SCSI operations
-	 * from module exit (like cache flush) */
-	try_module_get(sdev->host->hostt->module);
+	/* Can fail if invoked during module exit (like cache flush) */
+	if (get_lld && !try_module_get(sdev->host->hostt->module)) {
+		put_device(&sdev->sdev_gendev);
+		return -ENXIO;
+	}
 
 	return 0;
 }
+EXPORT_SYMBOL(scsi_dev_get);
+
+/**
+ * scsi_dev_put - release a reference to a scsi_device
+ * @sdev:	device to release a reference on
+ * @put_lld:    whether or not to decrease the LLD kernel module refcount
+ *
+ * Description: Release a reference to the scsi_device. The device is freed
+ * once the last user vanishes.
+ */
+void scsi_dev_put(struct scsi_device *sdev, bool put_lld)
+{
+	if (put_lld)
+		module_put(sdev->host->hostt->module);
+	put_device(&sdev->sdev_gendev);
+}
+EXPORT_SYMBOL(scsi_dev_put);
+
+/**
+ * scsi_device_get - get an additional reference to a scsi_device
+ * @sdev:	device to get a reference to
+ *
+ * Description: Gets a reference to the scsi_device and increments the use count
+ * of the underlying LLDD module.  You must hold host_lock of the
+ * parent Scsi_Host or already have a reference when calling this.
+ */
+int scsi_device_get(struct scsi_device *sdev)
+{
+	return scsi_dev_get(sdev, true);
+}
 EXPORT_SYMBOL(scsi_device_get);
 
 /**
- * scsi_device_put  -  release a reference to a scsi_device
- * @sdev:	device to release a reference on.
+ * scsi_device_put - release a reference to a scsi_device
+ * @sdev:	device to release a reference on
  *
  * Description: Release a reference to the scsi_device and decrements the use
  * count of the underlying LLDD module.  The device is freed once the last
@@ -1004,15 +1037,7 @@ EXPORT_SYMBOL(scsi_device_get);
  */
 void scsi_device_put(struct scsi_device *sdev)
 {
-#ifdef CONFIG_MODULE_UNLOAD
-	struct module *module = sdev->host->hostt->module;
-
-	/* The module refcount will be zero if scsi_device_get()
-	 * was called from a module removal routine */
-	if (module && module_refcount(module) != 0)
-		module_put(module);
-#endif
-	put_device(&sdev->sdev_gendev);
+	scsi_dev_put(sdev, true);
 }
 EXPORT_SYMBOL(scsi_device_put);
 
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 3995169..bd641a8 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -564,13 +564,13 @@ static int sd_major(int major_idx)
 	}
 }
 
-static struct scsi_disk *__scsi_disk_get(struct gendisk *disk)
+static struct scsi_disk *__scsi_disk_get(struct gendisk *disk, bool get_lld)
 {
 	struct scsi_disk *sdkp = NULL;
 
 	if (disk->private_data) {
 		sdkp = scsi_disk(disk);
-		if (scsi_device_get(sdkp->device) == 0)
+		if (scsi_dev_get(sdkp->device, get_lld) == 0)
 			get_device(&sdkp->dev);
 		else
 			sdkp = NULL;
@@ -578,35 +578,36 @@ static struct scsi_disk *__scsi_disk_get(struct gendisk *disk)
 	return sdkp;
 }
 
-static struct scsi_disk *scsi_disk_get(struct gendisk *disk)
+static struct scsi_disk *scsi_disk_get(struct gendisk *disk, bool get_lld)
 {
 	struct scsi_disk *sdkp;
 
 	mutex_lock(&sd_ref_mutex);
-	sdkp = __scsi_disk_get(disk);
+	sdkp = __scsi_disk_get(disk, get_lld);
 	mutex_unlock(&sd_ref_mutex);
 	return sdkp;
 }
 
-static struct scsi_disk *scsi_disk_get_from_dev(struct device *dev)
+static struct scsi_disk *scsi_disk_get_from_dev(struct device *dev,
+						bool get_lld)
 {
 	struct scsi_disk *sdkp;
 
 	mutex_lock(&sd_ref_mutex);
 	sdkp = dev_get_drvdata(dev);
 	if (sdkp)
-		sdkp = __scsi_disk_get(sdkp->disk);
+		sdkp = __scsi_disk_get(sdkp->disk, get_lld);
 	mutex_unlock(&sd_ref_mutex);
 	return sdkp;
 }
 
-static void scsi_disk_put(struct scsi_disk *sdkp)
+static void scsi_disk_put(struct scsi_disk *sdkp, bool put_lld)
 {
 	struct scsi_device *sdev = sdkp->device;
 
 	mutex_lock(&sd_ref_mutex);
 	put_device(&sdkp->dev);
-	scsi_device_put(sdev);
+	scsi_dev_put(sdev, put_lld);
 	mutex_unlock(&sd_ref_mutex);
 }
 
@@ -1184,7 +1185,7 @@ static void sd_uninit_command(struct scsi_cmnd *SCpnt)
  **/
 static int sd_open(struct block_device *bdev, fmode_t mode)
 {
-	struct scsi_disk *sdkp = scsi_disk_get(bdev->bd_disk);
+	struct scsi_disk *sdkp = scsi_disk_get(bdev->bd_disk, true);
 	struct scsi_device *sdev;
 	int retval;
 
@@ -1239,7 +1240,7 @@ static int sd_open(struct block_device *bdev, fmode_t mode)
 	return 0;
 
 error_out:
-	scsi_disk_put(sdkp);
+	scsi_disk_put(sdkp, true);
 	return retval;	
 }
 
@@ -1273,7 +1274,7 @@ static void sd_release(struct gendisk *disk, fmode_t mode)
 	 * XXX is followed by a "rmmod sd_mod"?
 	 */
 
-	scsi_disk_put(sdkp);
+	scsi_disk_put(sdkp, true);
 }
 
 static int sd_getgeo(struct block_device *bdev, struct hd_geometry *geo)
@@ -1525,11 +1526,11 @@ static int sd_sync_cache(struct scsi_disk *sdkp)
 
 static void sd_rescan(struct device *dev)
 {
-	struct scsi_disk *sdkp = scsi_disk_get_from_dev(dev);
+	struct scsi_disk *sdkp = scsi_disk_get_from_dev(dev, true);
 
 	if (sdkp) {
 		revalidate_disk(sdkp->disk);
-		scsi_disk_put(sdkp);
+		scsi_disk_put(sdkp, true);
 	}
 }
 
@@ -3143,11 +3144,14 @@ static int sd_start_stop_device(struct scsi_disk *sdkp, int start)
 /*
  * Send a SYNCHRONIZE CACHE instruction down to the device through
  * the normal SCSI command structure.  Wait for the command to
- * complete.
+ * complete.  Since this function can be called during SCSI LLD kernel
+ * module unload and since try_module_get() fails after kernel module
+ * unload has started this function must not try to increase the SCSI
+ * LLD kernel module refcount.
  */
 static void sd_shutdown(struct device *dev)
 {
-	struct scsi_disk *sdkp = scsi_disk_get_from_dev(dev);
+	struct scsi_disk *sdkp = scsi_disk_get_from_dev(dev, false);
 
 	if (!sdkp)
 		return;         /* this can happen */
@@ -3166,12 +3170,12 @@ static void sd_shutdown(struct device *dev)
 	}
 
 exit:
-	scsi_disk_put(sdkp);
+	scsi_disk_put(sdkp, false);
 }
 
 static int sd_suspend_common(struct device *dev, bool ignore_stop_errors)
 {
-	struct scsi_disk *sdkp = scsi_disk_get_from_dev(dev);
+	struct scsi_disk *sdkp = scsi_disk_get_from_dev(dev, true);
 	int ret = 0;
 
 	if (!sdkp)
@@ -3197,7 +3201,7 @@ static int sd_suspend_common(struct device *dev, bool ignore_stop_errors)
 	}
 
 done:
-	scsi_disk_put(sdkp);
+	scsi_disk_put(sdkp, true);
 	return ret;
 }
 
@@ -3213,7 +3217,7 @@ static int sd_suspend_runtime(struct device *dev)
 
 static int sd_resume(struct device *dev)
 {
-	struct scsi_disk *sdkp = scsi_disk_get_from_dev(dev);
+	struct scsi_disk *sdkp = scsi_disk_get_from_dev(dev, true);
 	int ret = 0;
 
 	if (!sdkp->device->manage_start_stop)
@@ -3223,7 +3227,7 @@ static int sd_resume(struct device *dev)
 	ret = sd_start_stop_device(sdkp, 1);
 
 done:
-	scsi_disk_put(sdkp);
+	scsi_disk_put(sdkp, true);
 	return ret;
 }
 
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index 3a4edd1..a4cb852 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -330,6 +330,8 @@ extern void scsi_remove_device(struct scsi_device *);
 extern int scsi_unregister_device_handler(struct scsi_device_handler *scsi_dh);
 void scsi_attach_vpd(struct scsi_device *sdev);
 
+extern int scsi_dev_get(struct scsi_device *, bool get_lld);
+extern void scsi_dev_put(struct scsi_device *, bool put_lld);
 extern int scsi_device_get(struct scsi_device *);
 extern void scsi_device_put(struct scsi_device *);
 extern struct scsi_device *scsi_device_lookup(struct Scsi_Host *,
-- 
2.1.2


[-- Attachment #5: 0004-IB-srp-Process-REQ_PREEMPT-requests-correctly.patch --]
[-- Type: text/x-patch, Size: 1165 bytes --]

From 6f593a0e9fcfd9b6c99fd24ac981450ed6eb0a0f Mon Sep 17 00:00:00 2001
From: Bart Van Assche <bvanassche@acm.org>
Date: Thu, 8 Jan 2015 09:42:45 +0100
Subject: [PATCH 4/4] IB/srp: Process REQ_PREEMPT requests correctly

Reported-by: Max Gurtuvoy <maxg@mellanox.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/infiniband/ulp/srp/ib_srp.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
index 0747c05..77a7a2f 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -2003,8 +2003,13 @@ static int srp_queuecommand(struct Scsi_Host *shost, struct scsi_cmnd *scmnd)
 	if (in_scsi_eh)
 		mutex_lock(&rport->mutex);
 
+	/*
+	 * The "blocked" state of SCSI devices is ignored by the SCSI core for
+	 * REQ_PREEMPT requests. Hence the explicit check below for the SCSI
+	 * device state.
+	 */
 	scmnd->result = srp_chkready(target->rport);
-	if (unlikely(scmnd->result))
+	if (unlikely(scmnd->result != 0 || scsi_device_blocked(scmnd->device)))
 		goto err;
 
 	WARN_ON_ONCE(scmnd->request->tag < 0);
-- 
2.1.2


[-- Attachment #6: spread-mlx4-ib-interrupts --]
[-- Type: text/plain, Size: 1671 bytes --]

#!/bin/awk -f

BEGIN {
  "ls -1d /sys/devices/system/node/node* 2>&1 | wc -l" | getline nodes
  if (nodes > 1) {
    for (i = 0; i < nodes; i++) {
      cpus_per_node = 0
      while (("cd /sys/devices/system/cpu && ls -d cpu*/node" i " | sed 's/^cpu//;s,/.*,,'|sort -n" | getline j) > 0) {
        #print "[" i ", " cpus_per_node "]: " j
        cpu[i, cpus_per_node++] = j
      }
    }
  } else {
      cpus_per_node = 0
      while (("cd /sys/devices/system/cpu && ls -d cpu[0-9]* | sed 's/^cpu//'|sort -n" | getline j) > 0) {
        #print "[0, " cpus_per_node "]: " j
        cpu[0, cpus_per_node++] = j
      }
  }
  for (i = 0; i < nodes; i++)
      nextcpu[i] = 0
  while (("sed -n 's/.*mlx4-ib-\\([0-9]*\\)-[0-9]*@\\(.*\\)$/\\1 \\2/p' /proc/interrupts | uniq" | getline) > 0) {
    port = $1
    bus = substr($0, length($1) + 2)
    #print "port = " port "; bus = " bus
    irqcount = 0
    while (("sed -n 's/^[[:blank:]]*\\([0-9]*\\):[0-9[:blank:]]*[^[:blank:]]*[[:blank:]]*\\(mlx4-ib-" port "-[0-9]*@" bus "\\)$/\\1 \\2/p' </proc/interrupts" | getline) > 0) {
      irq[irqcount] = $1
      irqname[irqcount] = substr($0, length($1) + 2)
      irqcount++
    }
    for (i = 0; i < nodes; i++) {
      ch_start = i * irqcount / nodes
      ch_end = (i + 1) * irqcount / nodes
      for (ch = ch_start; ch < ch_end; ch++) {
        c = cpu[i, nextcpu[i]++ % cpus_per_node]
        if (nodes > 1)
            nodetxt =  " (node " i ")"
        else
            nodetxt = ""
        print "IRQ " irq[ch] " (" irqname[ch] "): CPU " c nodetxt
        cmd="echo " c " >/proc/irq/" irq[ch] "/smp_affinity_list"
	#print cmd
	system(cmd)
      }
    }
  }
  exit 0
}

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: dm + blk-mq soft lockup complaint
  2015-01-14  9:16                                                                                                   ` Bart Van Assche
  (?)
@ 2015-01-14 18:59                                                                                                   ` Mike Snitzer
  2015-01-15  8:11                                                                                                     ` Bart Van Assche
  -1 siblings, 1 reply; 95+ messages in thread
From: Mike Snitzer @ 2015-01-14 18:59 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig,
	device-mapper development, Jun'ichi Nomura, linux-scsi

On Wed, Jan 14 2015 at  4:16am -0500,
Bart Van Assche <bart.vanassche@sandisk.com> wrote:

> On 01/13/15 17:21, Mike Snitzer wrote:
> > OK, I assume you specified the mpath device for the test that failed.
> 
> Yes, of course ...
> 
> > This test works fine on my 100MB scsi_debug device with 4 paths exported
> > over virtio-blk to a guest that assembles the mpath device.
> > 
> > Could be a hang that is unique to scsi-mq.
> > 
> > Any chance you'd be willing to provide a HOWTO for setting up your
> > SRP/iscsi configuration?
> > 
> > Are you carrying any related changes that are not upstream?  (I can hunt
> > down the email in this thread where you describe your kernel tree...)
> > 
> > I'll try to reproduce but this info could be useful to others that are
> > more scsi-mq inclined who might need to chase this too.
> 
> The four patches I had used in my tests at the initiator side and that
> are not yet in v3.19-rc4 have been attached to this e-mail (I have not
> yet had the time to post all of these patches for review).
> 
> This is how my I had configured the initiator system:
> * If the version of the srptools package supplied by your distro is
> lower than 1.0.2, build and install the latest version from the source
> code available at git://git.openfabrics.org/~bvanassche/srptools.git/.git.
> * Install the latest version of lsscsi
> (http://sg.danny.cz/scsi/lsscsi.html). This version has SRP transport
> support but is not yet in any distro AFAIK.
> * Build and install a kernel >= v3.19-rc4 that includes the dm patches
> at the start of this e-mail thread.
> * Check whether the IB links are up (should display "State: Active"):
> ibstat | grep State:
> * Spread completion interrupts statically over CPU cores, e.g. via the
> attached script (spread-mlx4-ib-interrupts).
> * Check whether the SRP target system is visible from the SRP initiator
> system - the command below should print at least one line:
> ibsrpdm -c
> * Enable blk-mq:
> echo Y > /sys/module/scsi_mod/parameters/use_blk_mq
> * Configure the SRP kernel module parameters as follows:
> echo 'options ib_srp cmd_sg_entries=255 dev_loss_tmo=60 ch_count=6' >
> /etc/modprobe.d/ib_srp.conf
> * Unload and reload the SRP initiator kernel module to apply these
> parameters:
> rmmod ib_srp; modprobe ib_srp
> * Start srpd and wait until SRP login has finished:
> systemctl start srpd
> while ! lsscsi -t | grep -q srp:; do sleep 1; done
> * Start multipathd and check the table it has built:
> systemctl start multipathd
> dmsetup table /dev/dm-0
> * Set the I/O scheduler to noop, disable add_random and set rq_affinity
> to 2 for all SRP and dm block devices.
> * Run the I/O load of your preference.

Thanks for all this info.  But I don't have an IB setup readily
available to test with.  We are setting up an IB testbed in the lab and
can hopefully work through your setup in the coming weeks.

IB aside, I haven't been following along close enough on scsi-mq
developments, but does a regular iscsi initiator have support for
scsi-mq?  I'd like to validate scsi-mq devices with the dm-mpath
changes.

Mike

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: dm + blk-mq soft lockup complaint
  2015-01-14 18:59                                                                                                   ` Mike Snitzer
@ 2015-01-15  8:11                                                                                                     ` Bart Van Assche
  2015-01-15 15:43                                                                                                       ` Mike Snitzer
  0 siblings, 1 reply; 95+ messages in thread
From: Bart Van Assche @ 2015-01-15  8:11 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig,
	device-mapper development, Jun'ichi Nomura, linux-scsi

On 01/14/15 20:00, Mike Snitzer wrote:
> IB aside, I haven't been following along close enough on scsi-mq
> developments, but does a regular iscsi initiator have support for
> scsi-mq?  I'd like to validate scsi-mq devices with the dm-mpath
> changes.

Hello Mike,

scsi-mq support for the iSCSI initiator is being considered but has not
yet been implemented. For more information, see also Sagi Grimberg,
[LSF/MM TOPIC] iSCSI MQ adoption via MCS discussion, linux-scsi mailing
list, January 7, 2015 (http://thread.gmane.org/gmane.linux.scsi/98199).

Bart.


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: dm + blk-mq soft lockup complaint
  2015-01-15  8:11                                                                                                     ` Bart Van Assche
@ 2015-01-15 15:43                                                                                                       ` Mike Snitzer
  2015-01-15 15:55                                                                                                         ` Bart Van Assche
  0 siblings, 1 reply; 95+ messages in thread
From: Mike Snitzer @ 2015-01-15 15:43 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig,
	device-mapper development, Jun'ichi Nomura, linux-scsi

On Thu, Jan 15 2015 at  3:11am -0500,
Bart Van Assche <bart.vanassche@sandisk.com> wrote:

> On 01/14/15 20:00, Mike Snitzer wrote:
> > IB aside, I haven't been following along close enough on scsi-mq
> > developments, but does a regular iscsi initiator have support for
> > scsi-mq?  I'd like to validate scsi-mq devices with the dm-mpath
> > changes.
> 
> Hello Mike,
> 
> scsi-mq support for the iSCSI initiator is being considered but has not
> yet been implemented. For more information, see also Sagi Grimberg,
> [LSF/MM TOPIC] iSCSI MQ adoption via MCS discussion, linux-scsi mailing
> list, January 7, 2015 (http://thread.gmane.org/gmane.linux.scsi/98199).

OK, thanks.

FYI, I just used your fio test against an mpath device that is using 4
virtio-scsi devices with blk-mq enabled and all worked fine.

How easily could you reproduce your SRP hang?  Did it happen every run
or did it take multiple runs to see it?

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: dm + blk-mq soft lockup complaint
  2015-01-15 15:43                                                                                                       ` Mike Snitzer
@ 2015-01-15 15:55                                                                                                         ` Bart Van Assche
  0 siblings, 0 replies; 95+ messages in thread
From: Bart Van Assche @ 2015-01-15 15:55 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig,
	device-mapper development, Jun'ichi Nomura, linux-scsi

On 01/15/15 16:44, Mike Snitzer wrote:
> How easily could you reproduce your SRP hang?  Did it happen every run
> or did it take multiple runs to see it?

Hello Mike,

This complaint was easy to trigger. It took less than one minute before
the soft lockup complaint appeared. The system did not hang after that
complaint appeared. Maybe this complaint means that a cond_resched()
call was missing from a loop ?

Bart.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* blk-mq DM changes for 3.20 [was: Re: blk-mq request allocation stalls]
  2015-01-13 14:59                                                                                     ` blk-mq request allocation stalls Jens Axboe
  2015-01-13 15:11                                                                                       ` Keith Busch
  2015-01-13 15:14                                                                                       ` Mike Snitzer
@ 2015-01-27 18:42                                                                                       ` Mike Snitzer
  2015-01-28 16:42                                                                                         ` Jens Axboe
  2 siblings, 1 reply; 95+ messages in thread
From: Mike Snitzer @ 2015-01-27 18:42 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, Bart Van Assche, device-mapper development,
	Jun'ichi Nomura, Christoph Hellwig

Hey Jens,

I _think_ we've resolved the issues Bart raised for request-based DM's
support for blk-mq devices (anything remaining seems specific to iSER's
blk-mq support which is in development).  Though Keith did have that one
additional patch for that block scatter gather attribute that we still
need to review closer.

Anyway, I think what we have is a solid start and see no reason to hold
these changes back further.  So I've rebased the 'dm-for-3.20' branch of
linux-dm.git ontop of 3.19-rc6 and reordered the required block changes
to be at the front of the series, see:
https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-for-3.20

(these changes have been in Linux next for a month, via linux-dm.git
'for-next')

With your OK, I'd be happy to carry the required block changes and
ultimately request Linus pull them for 3.20 (I can backfill your Acks if
you approve).  BUT I also have no problem with you picking up the block
changes to submit via your block tree (I'd just have to rebase ontop of
your 3.20 branch once you pull them in).

Let me know what you think, thanks.
Mike

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: blk-mq DM changes for 3.20 [was: Re: blk-mq request allocation stalls]
  2015-01-27 18:42                                                                                       ` blk-mq DM changes for 3.20 [was: Re: blk-mq request allocation stalls] Mike Snitzer
@ 2015-01-28 16:42                                                                                         ` Jens Axboe
  2015-01-28 17:44                                                                                           ` Mike Snitzer
  0 siblings, 1 reply; 95+ messages in thread
From: Jens Axboe @ 2015-01-28 16:42 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Keith Busch, Bart Van Assche, device-mapper development,
	Jun'ichi Nomura, Christoph Hellwig

On 01/27/2015 11:42 AM, Mike Snitzer wrote:
> Hey Jens,
>
> I _think_ we've resolved the issues Bart raised for request-based DM's
> support for blk-mq devices (anything remaining seems specific to iSER's
> blk-mq support which is in development).  Though Keith did have that one
> additional patch for that block scatter gather attribute that we still
> need to review closer.
>
> Anyway, I think what we have is a solid start and see no reason to hold
> these changes back further.  So I've rebased the 'dm-for-3.20' branch of
> linux-dm.git ontop of 3.19-rc6 and reordered the required block changes
> to be at the front of the series, see:
> https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-for-3.20
>
> (these changes have been in Linux next for a month, via linux-dm.git
> 'for-next')
>
> With your OK, I'd be happy to carry the required block changes and
> ultimately request Linus pull them for 3.20 (I can backfill your Acks if
> you approve).  BUT I also have no problem with you picking up the block
> changes to submit via your block tree (I'd just have to rebase ontop of
> your 3.20 branch once you pull them in).

I'd prefer to take these prep patches through the block tree. Only one I 
don't really like is this one:

https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-for-3.20&id=23556c2461407495099d1eb20b0de43432dc727d

I prefer keeping the alloc path as lean as possible, normal allocs 
always initialize ->bio since they need to associate a bio with it. Do
you have the oops trace from this one? Just curious if we can get rid of 
it, depending on how deep in the caller this is.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: blk-mq DM changes for 3.20 [was: Re: blk-mq request allocation stalls]
  2015-01-28 16:42                                                                                         ` Jens Axboe
@ 2015-01-28 17:44                                                                                           ` Mike Snitzer
  2015-01-28 17:49                                                                                             ` Jens Axboe
  0 siblings, 1 reply; 95+ messages in thread
From: Mike Snitzer @ 2015-01-28 17:44 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, Bart Van Assche, device-mapper development,
	Jun'ichi Nomura, Christoph Hellwig

On Wed, Jan 28 2015 at 11:42am -0500,
Jens Axboe <axboe@kernel.dk> wrote:

> On 01/27/2015 11:42 AM, Mike Snitzer wrote:
> >Hey Jens,
> >
> >I _think_ we've resolved the issues Bart raised for request-based DM's
> >support for blk-mq devices (anything remaining seems specific to iSER's
> >blk-mq support which is in development).  Though Keith did have that one
> >additional patch for that block scatter gather attribute that we still
> >need to review closer.
> >
> >Anyway, I think what we have is a solid start and see no reason to hold
> >these changes back further.  So I've rebased the 'dm-for-3.20' branch of
> >linux-dm.git ontop of 3.19-rc6 and reordered the required block changes
> >to be at the front of the series, see:
> >https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-for-3.20
> >
> >(these changes have been in Linux next for a month, via linux-dm.git
> >'for-next')
> >
> >With your OK, I'd be happy to carry the required block changes and
> >ultimately request Linus pull them for 3.20 (I can backfill your Acks if
> >you approve).  BUT I also have no problem with you picking up the block
> >changes to submit via your block tree (I'd just have to rebase ontop of
> >your 3.20 branch once you pull them in).
> 
> I'd prefer to take these prep patches through the block tree.

Great, should I send the patches or can you cherry-pick?

> Only one I don't really like is this one:
> 
> https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-for-3.20&id=23556c2461407495099d1eb20b0de43432dc727d
> 
> I prefer keeping the alloc path as lean as possible, normal allocs
> always initialize ->bio since they need to associate a bio with it.

Would be very surprised if this initialization were measurable but..
I could push this initialization into the DM-mpath driver (just after
blk_get_request, like Keith opted for) but that seemed really gross.

> Do you have the oops trace from this one? Just curious if we can get
> rid of it, depending on how deep in the caller this is.

I did't but it was easy enough to recreate:

[    3.112949] BUG: unable to handle kernel NULL pointer dereference at           (null)                                                                               |
[    3.113416] IP: [<ffffffff812f6734>] blk_rq_prep_clone+0x44/0x160                                                                                                   |
[    3.113416] PGD 0                                                                                                                                                   |
[    3.113416] Oops: 0002 [#1] SMP                                                                                                                                     |
[    3.113416] Modules linked in: dm_service_time crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel glue_helper lrw gf128mul ablk_helper crypt|
d serio_raw pcspkr virtio_balloon 8139too i2c_piix4 nfsd auth_rpcgss nfs_acl lockd grace sunrpc dm_multipath dm_mod ext4 mbcache jbd2 sd_mod ata_generic cirrus pata_ac|
pi syscopyarea sysfillrect sysimgblt drm_kms_helper ttm drm virtio_scsi virtio_blk 8139cp virtio_pci mii i2c_core virtio_ring ata_piix virtio libata floppy            |
[    3.113416] CPU: 0 PID: 483 Comm: kdmwork-252:3 Tainted: G        W      3.18.0+ #29                                                                                |
[    3.113416] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011                                                                                                       |
[    3.113416] task: ffff880035c1ad20 ti: ffff8800d6900000 task.ti: ffff8800d6900000                                                                                   |
[    3.113416] RIP: 0010:[<ffffffff812f6734>]  [<ffffffff812f6734>] blk_rq_prep_clone+0x44/0x160                                                                       |
[    3.113416] RSP: 0000:ffff8800d6903d48  EFLAGS: 00010286                                                                                                            |
[    3.113416] RAX: 0000000000000000 RBX: ffffffffa0208500 RCX: 0000000000000001                                                                                       |
[    3.113416] RDX: ffff8800d7a3b0a0 RSI: ffff880035d0ab00 RDI: ffff880119f8f510                                                                                       |
[    3.113416] RBP: ffff8800d6903d98 R08: 00000000000185a0 R09: 00000000000000d0                                                                                       |
[    3.113416] R10: ffff8800d7547680 R11: ffff880035c1b8c8 R12: ffff8800d83d7900
[    3.113416] R13: ffff880035d0ab00 R14: ffff880119f8f510 R15: ffff8800d7547680
[    3.113416] FS:  0000000000000000(0000) GS:ffff88011fc00000(0000) knlGS:0000000000000000
[    3.113416] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    3.113416] CR2: 0000000000000000 CR3: 00000000daeec000 CR4: 00000000000407f0
[    3.113416] Stack:
[    3.113416]  ffff8800d6903db8 ffff8800d71502e0 ffff8800d7a3b0a0 000000d0d71502e0
[    3.113416]  ffff8800d6e89800 ffff8800d71502e0 ffff8800d6e89800 0000000000000001
[    3.113416]  ffff8800d7a3b0a0 ffffc90000998040 ffff8800d6903df8 ffffffffa0209c69
[    3.113416] Call Trace:
[    3.113416]  [<ffffffffa0209c69>] map_tio_request+0x219/0x2b0 [dm_mod]
[    3.113416]  [<ffffffff8109a4ee>] kthread_worker_fn+0x7e/0x1b0
[    3.113416]  [<ffffffff8109a470>] ? __init_kthread_worker+0x60/0x60
[    3.113416]  [<ffffffff8109a3f7>] kthread+0x107/0x120
[    3.113416]  [<ffffffff8109a2f0>] ? kthread_create_on_node+0x240/0x240
[    3.113416]  [<ffffffff816952bc>] ret_from_fork+0x7c/0xb0
[    3.113416]  [<ffffffff8109a2f0>] ? kthread_create_on_node+0x240/0x240
[    3.113416] Code: 89 c3 48 83 ec 28 4c 8b 6e 68 48 85 d2 4c 0f 44 25 22 b7 92 01 48 89 75 b8 89 4d cc 4c 89 4d c0 4d 85 ed 75 16 eb 60 49 8b 47 70 <4c> 89 30 4d 89
77 70 4d 8b 6d 00 4d 85 ed 74 4c 8b 75 cc 4c 89
[    3.113416] RIP  [<ffffffff812f6734>] blk_rq_prep_clone+0x44/0x160
[    3.113416]  RSP <ffff8800d6903d48>
[    3.113416] CR2: 0000000000000000
[    3.113416] ---[ end trace 9b3bb6dd6cc4435d ]---

crash> dis -l blk_rq_prep_clone+0x44
/home/snitm/git/linux/block/blk-core.c: 2945
0xffffffff812f6734 <blk_rq_prep_clone+0x44>:    mov    %r14,(%rax)

crash> l /home/snitm/git/linux/block/blk-core.c: 2945
2940    
2941                    if (bio_ctr && bio_ctr(bio, bio_src, data))
2942                            goto free_and_out;
2943    
2944                    if (rq->bio) {
2945                            rq->biotail->bi_next = bio;
2946                            rq->biotail = bio;
2947                    } else
2948                            rq->bio = rq->biotail = bio;
2949            }

Given it would seem the NULL pointer occurs when attempting to
dereference rq->biotail a revised check of "if (rq->bio && rq->biotail)"
should suffice but I unfortunately then get:

[    2.801634] general protection fault: 0000 [#1] SMP
[    2.802504] Modules linked in: dm_service_time crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel glue_helper lrw gf128mul ablk_helper crypt
d pcspkr serio_raw 8139too virtio_balloon i2c_piix4 nfsd auth_rpcgss nfs_acl lockd grace sunrpc dm_multipath dm_mod ext4 mbcache jbd2 ata_generic sd_mod pata_acpi cirr
us syscopyarea sysfillrect sysimgblt drm_kms_helper ttm virtio_scsi virtio_blk drm virtio_pci virtio_ring ata_piix 8139cp libata mii i2c_core virtio floppy
[    2.802504] CPU: 0 PID: 474 Comm: kdmwork-252:1 Tainted: G        W      3.18.0+ #30
[    2.802504] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[    2.802504] task: ffff8801194b1690 ti: ffff880119abc000 task.ti: ffff880119abc000
[    2.802504] RIP: 0010:[<ffffffff812f6739>]  [<ffffffff812f6739>] blk_rq_prep_clone+0x49/0x160
[    2.802504] RSP: 0018:ffff880119abfd48  EFLAGS: 00010206                                                                                                            
[    2.802504] RAX: 6de900000000e800 RBX: ffffffffa0218500 RCX: 0000000000000001                                                                                       
[    2.802504] RDX: ffff8800daca30a0 RSI: ffff880119dcaf00 RDI: ffff880119dca310                                                                                       
[    2.802504] RBP: ffff880119abfd98 R08: 00000000000185a0 R09: 00000000000000d0                                                                                       
[    2.802504] R10: ffff880035937680 R11: ffff8801194b2238 R12: ffff880035876900                                                                                       
[    2.802504] R13: ffff880119dcaf00 R14: ffff880119dca310 R15: ffff880035937680                                                                                       
[    2.802504] FS:  0000000000000000(0000) GS:ffff88011fc00000(0000) knlGS:0000000000000000                                                                            
[    2.802504] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033                                                                                                       
[    2.802504] CR2: 00007f45598c2350 CR3: 000000003614a000 CR4: 00000000000407f0                                                                                       
[    2.802504] Stack:                                                                                                                                                  
[    2.802504]  ffff880119abfdb8 ffff8800dae602e0 ffff8800daca30a0 000000d0dae602e0                                                                                    
[    2.802504]  ffff8800dad6a000 ffff8800dae602e0 ffff8800dad6a000 0000000000000001                                                                                    
[    2.802504]  ffff8800daca30a0 ffffc9000097d040 ffff880119abfdf8 ffffffffa0219c69                                                                                    
[    2.802504] Call Trace:                                                                                                                                             
[    2.802504]  [<ffffffffa0219c69>] map_tio_request+0x219/0x2b0 [dm_mod]                                                                                              
[    2.802504]  [<ffffffff8109a4ee>] kthread_worker_fn+0x7e/0x1b0                                                                                                      
[    2.802504]  [<ffffffff8109a470>] ? __init_kthread_worker+0x60/0x60                                                                                                 
[    2.802504]  [<ffffffff8109a3f7>] kthread+0x107/0x120                                                                                                               
[    2.802504]  [<ffffffff8109a2f0>] ? kthread_create_on_node+0x240/0x240                                                                                              
[    2.802504]  [<ffffffff816952bc>] ret_from_fork+0x7c/0xb0                                                                                                           
[    2.802504]  [<ffffffff8109a2f0>] ? kthread_create_on_node+0x240/0x240                                                                                              
[    2.802504] Code: 28 4c 8b 6e 68 48 85 d2 4c 0f 44 25 22 b7 92 01 48 89 75 b8 89 4d cc 4c 89 4d c0 4d 85 ed 75 1b eb 64 49 8b 47 70 48 85 c0 74 4a <4c> 89 30 4d 89 
77 70 4d 8b 6d 00 4d 85 ed 74 4b 8b 75 cc 4c 89                                                                                                                        
[    2.802504] RIP  [<ffffffff812f6739>] blk_rq_prep_clone+0x49/0x160                                                                                                  
[    2.802504]  RSP <ffff880119abfd48>                                                                                                                                 
[    2.802386] general protection fault: 0000 [#2] [    2.893050] ---[ end trace 20d230269dc05eca ]---                                  

Not sure what to make of this (other than rq->biotail is pointing at
crap too, which is actually likely if rq->bio is):

crash> dis -l blk_rq_prep_clone+0x49
/home/snitm/git/linux/block/blk-core.c: 2945
0xffffffff812f6739 <blk_rq_prep_clone+0x49>:    mov    %r14,(%rax)

crash> l /home/snitm/git/linux/block/blk-core.c: 2945
2940    
2941                    if (bio_ctr && bio_ctr(bio, bio_src, data))
2942                            goto free_and_out;
2943    
2944                    if (rq->bio && rq->biotail) {
2945                            rq->biotail->bi_next = bio;
2946                            rq->biotail = bio;
2947                    } else
2948                            rq->bio = rq->biotail = bio;
2949            }

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: blk-mq DM changes for 3.20 [was: Re: blk-mq request allocation stalls]
  2015-01-28 17:44                                                                                           ` Mike Snitzer
@ 2015-01-28 17:49                                                                                             ` Jens Axboe
  2015-01-28 18:10                                                                                               ` Mike Snitzer
  2015-01-29 22:43                                                                                               ` blk-mq DM changes for 3.20 [was: Re: blk-mq request allocation stalls]X Keith Busch
  0 siblings, 2 replies; 95+ messages in thread
From: Jens Axboe @ 2015-01-28 17:49 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Keith Busch, Bart Van Assche, device-mapper development,
	Jun'ichi Nomura, Christoph Hellwig

On 01/28/2015 10:44 AM, Mike Snitzer wrote:
> On Wed, Jan 28 2015 at 11:42am -0500,
> Jens Axboe <axboe@kernel.dk> wrote:
>
>> On 01/27/2015 11:42 AM, Mike Snitzer wrote:
>>> Hey Jens,
>>>
>>> I _think_ we've resolved the issues Bart raised for request-based DM's
>>> support for blk-mq devices (anything remaining seems specific to iSER's
>>> blk-mq support which is in development).  Though Keith did have that one
>>> additional patch for that block scatter gather attribute that we still
>>> need to review closer.
>>>
>>> Anyway, I think what we have is a solid start and see no reason to hold
>>> these changes back further.  So I've rebased the 'dm-for-3.20' branch of
>>> linux-dm.git ontop of 3.19-rc6 and reordered the required block changes
>>> to be at the front of the series, see:
>>> https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-for-3.20
>>>
>>> (these changes have been in Linux next for a month, via linux-dm.git
>>> 'for-next')
>>>
>>> With your OK, I'd be happy to carry the required block changes and
>>> ultimately request Linus pull them for 3.20 (I can backfill your Acks if
>>> you approve).  BUT I also have no problem with you picking up the block
>>> changes to submit via your block tree (I'd just have to rebase ontop of
>>> your 3.20 branch once you pull them in).
>>
>> I'd prefer to take these prep patches through the block tree.
>
> Great, should I send the patches or can you cherry-pick?

I already cherry picked them, they are in the for-3.20/core branch.

>> Only one I don't really like is this one:
>>
>> https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-for-3.20&id=23556c2461407495099d1eb20b0de43432dc727d
>>
>> I prefer keeping the alloc path as lean as possible, normal allocs
>> always initialize ->bio since they need to associate a bio with it.
>
> Would be very surprised if this initialization were measurable but..

That's what people always say, and then keep piling more crap in...

> I could push this initialization into the DM-mpath driver (just after
> blk_get_request, like Keith opted for) but that seemed really gross.

It's already doing blk_rq_init() now, so not a huge change and not that 
nasty.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: blk-mq DM changes for 3.20 [was: Re: blk-mq request allocation stalls]
  2015-01-28 17:49                                                                                             ` Jens Axboe
@ 2015-01-28 18:10                                                                                               ` Mike Snitzer
  2015-01-29 22:43                                                                                               ` blk-mq DM changes for 3.20 [was: Re: blk-mq request allocation stalls]X Keith Busch
  1 sibling, 0 replies; 95+ messages in thread
From: Mike Snitzer @ 2015-01-28 18:10 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, Bart Van Assche, device-mapper development,
	Jun'ichi Nomura, Christoph Hellwig

On Wed, Jan 28 2015 at 12:49pm -0500,
Jens Axboe <axboe@kernel.dk> wrote:

> On 01/28/2015 10:44 AM, Mike Snitzer wrote:
> >On Wed, Jan 28 2015 at 11:42am -0500,
> >Jens Axboe <axboe@kernel.dk> wrote:
> >
> >>On 01/27/2015 11:42 AM, Mike Snitzer wrote:
> >>>Hey Jens,
> >>>
> >>>I _think_ we've resolved the issues Bart raised for request-based DM's
> >>>support for blk-mq devices (anything remaining seems specific to iSER's
> >>>blk-mq support which is in development).  Though Keith did have that one
> >>>additional patch for that block scatter gather attribute that we still
> >>>need to review closer.
> >>>
> >>>Anyway, I think what we have is a solid start and see no reason to hold
> >>>these changes back further.  So I've rebased the 'dm-for-3.20' branch of
> >>>linux-dm.git ontop of 3.19-rc6 and reordered the required block changes
> >>>to be at the front of the series, see:
> >>>https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-for-3.20
> >>>
> >>>(these changes have been in Linux next for a month, via linux-dm.git
> >>>'for-next')
> >>>
> >>>With your OK, I'd be happy to carry the required block changes and
> >>>ultimately request Linus pull them for 3.20 (I can backfill your Acks if
> >>>you approve).  BUT I also have no problem with you picking up the block
> >>>changes to submit via your block tree (I'd just have to rebase ontop of
> >>>your 3.20 branch once you pull them in).
> >>
> >>I'd prefer to take these prep patches through the block tree.
> >
> >Great, should I send the patches or can you cherry-pick?
> 
> I already cherry picked them, they are in the for-3.20/core branch.
> 
> >>Only one I don't really like is this one:
> >>
> >>https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-for-3.20&id=23556c2461407495099d1eb20b0de43432dc727d
> >>
> >>I prefer keeping the alloc path as lean as possible, normal allocs
> >>always initialize ->bio since they need to associate a bio with it.
> >
> >Would be very surprised if this initialization were measurable but..
> 
> That's what people always say, and then keep piling more crap in...

Uh huh ;)

> >I could push this initialization into the DM-mpath driver (just after
> >blk_get_request, like Keith opted for) but that seemed really gross.
> 
> It's already doing blk_rq_init() now, so not a huge change and not
> that nasty.

"It" being drivers/md/dm.c:clone_rq?  That is only for the old request
path not for blk-mq.  blk_rq_init() cannot be called after
blk_get_request() for blk-mq case because it'll destroy all the relevant
initialization that was already done.

Anyway, I'll just hack around this in the blk-mq request allocation path
of the mpath driver with:

diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index 913b5b4..863fc8c 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -432,6 +432,7 @@ static int __multipath_map(struct dm_target *ti, struct request *clone,
 		if (IS_ERR(*__clone))
 			/* ENOMEM, requeue */
 			return r;
+		(*__clone)->bio = (*__clone)->biotail = NULL;
 		(*__clone)->rq_disk = bdev->bd_disk;
 		(*__clone)->cmd_flags |= REQ_FAILFAST_TRANSPORT;
 	}

Thanks for pulling in the other changes!

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: blk-mq DM changes for 3.20 [was: Re: blk-mq request allocation stalls]X
  2015-01-28 17:49                                                                                             ` Jens Axboe
  2015-01-28 18:10                                                                                               ` Mike Snitzer
@ 2015-01-29 22:43                                                                                               ` Keith Busch
  2015-01-29 23:09                                                                                                 ` Mike Snitzer
  1 sibling, 1 reply; 95+ messages in thread
From: Keith Busch @ 2015-01-29 22:43 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Mike Snitzer, Keith Busch,
	device-mapper development, Jun'ichi Nomura, Bart Van Assche

On Wed, 28 Jan 2015, Jens Axboe wrote:
> On 01/28/2015 10:44 AM, Mike Snitzer wrote:
>> On Wed, Jan 28 2015 at 11:42am -0500,
>> Jens Axboe <axboe@kernel.dk> wrote:
>>> I'd prefer to take these prep patches through the block tree.
>> 
>> Great, should I send the patches or can you cherry-pick?
>
> I already cherry picked them, they are in the for-3.20/core branch.

I might be getting ahead of myself for trying this right now (sorry if
that's the case) but 'for-3.20/core' is missing necessary parts of the
series and hits a BUG_ON at blk-core.c:2333. The original request is
initialized when setting up the clone, and in this branch, that happens
in prep_fn before the request was dequeued so it's not in the queuelist
when it started.

One of my commits relocated the initialization, but I didn't realize it
had a hard dependency on the follow-on commit.  Should we reorder that
part of the series?

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: blk-mq DM changes for 3.20 [was: Re: blk-mq request allocation stalls]X
  2015-01-29 22:43                                                                                               ` blk-mq DM changes for 3.20 [was: Re: blk-mq request allocation stalls]X Keith Busch
@ 2015-01-29 23:09                                                                                                 ` Mike Snitzer
  2015-01-29 23:44                                                                                                   ` Keith Busch
  0 siblings, 1 reply; 95+ messages in thread
From: Mike Snitzer @ 2015-01-29 23:09 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, Bart Van Assche, device-mapper development,
	Jun'ichi Nomura, Christoph Hellwig

On Thu, Jan 29 2015 at  5:43pm -0500,
Keith Busch <keith.busch@intel.com> wrote:

> On Wed, 28 Jan 2015, Jens Axboe wrote:
> >On 01/28/2015 10:44 AM, Mike Snitzer wrote:
> >>On Wed, Jan 28 2015 at 11:42am -0500,
> >>Jens Axboe <axboe@kernel.dk> wrote:
> >>>I'd prefer to take these prep patches through the block tree.
> >>
> >>Great, should I send the patches or can you cherry-pick?
> >
> >I already cherry picked them, they are in the for-3.20/core branch.
> 
> I might be getting ahead of myself for trying this right now (sorry if
> that's the case) but 'for-3.20/core' is missing necessary parts of the
> series and hits a BUG_ON at blk-core.c:2333. The original request is
> initialized when setting up the clone, and in this branch, that happens
> in prep_fn before the request was dequeued so it's not in the queuelist
> when it started.
> 
> One of my commits relocated the initialization, but I didn't realize it
> had a hard dependency on the follow-on commit.  Should we reorder that
> part of the series?

Which follow on commit are you referring to?  Please be specific about
which commits you think are out of order.

Also, what are you testing... are you using the linux-dm.git tree's
'dm-for-3.20' branch which builds on Jens' 'for-3.20/core'?

If not please test that and see if you still have problems (without the
associated DM changes all the block preparation changes that Jens
cherry-picked shouldn't cause any problems at all).

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: blk-mq DM changes for 3.20 [was: Re: blk-mq request allocation stalls]X
  2015-01-29 23:09                                                                                                 ` Mike Snitzer
@ 2015-01-29 23:44                                                                                                   ` Keith Busch
  2015-01-30  0:32                                                                                                     ` Mike Snitzer
  0 siblings, 1 reply; 95+ messages in thread
From: Keith Busch @ 2015-01-29 23:44 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jens Axboe, Christoph Hellwig, Keith Busch,
	device-mapper development, Jun'ichi Nomura, Bart Van Assche

On Thu, 29 Jan 2015, Mike Snitzer wrote:
> On Thu, Jan 29 2015 at  5:43pm -0500,
>> One of my commits relocated the initialization, but I didn't realize it
>> had a hard dependency on the follow-on commit.  Should we reorder that
>> part of the series?
>
> Which follow on commit are you referring to?  Please be specific about
> which commits you think are out of order.
>
> Also, what are you testing... are you using the linux-dm.git tree's
> 'dm-for-3.20' branch which builds on Jens' 'for-3.20/core'?
>
> If not please test that and see if you still have problems (without the
> associated DM changes all the block preparation changes that Jens
> cherry-picked shouldn't cause any problems at all).

I'm using Jens' linux-block for-3.20/core branch. The last dm commit
is this:

  commit febf71588c2a750e04dc2a8b0824ce120c48bd9e
  Author: Keith Busch <keith.busch@intel.com>
  Date:   Fri Oct 17 17:46:35 2014 -0600

     block: require blk_rq_prep_clone() be given an initialized clone request

Looking at this again, the above was incorrect in the first place: it
initialized the original, but the intent was to initialize the clone. This
slipped by me since the next part of the series fixed it. In your linux-dm
dm-for-3.20, it's this commit:

  commit 102e38b1030e883efc022dfdc7b7e7a3de70d1c5
  Author: Mike Snitzer <snitzer@redhat.com>
  Date:   Fri Dec 5 17:11:05 2014 -0500

      dm: split request structure out from dm_rq_target_io structure


So I was confused earlier, there's no need to reorder anything. I just
need to fix the broken part.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: blk-mq DM changes for 3.20 [was: Re: blk-mq request allocation stalls]X
  2015-01-29 23:44                                                                                                   ` Keith Busch
@ 2015-01-30  0:32                                                                                                     ` Mike Snitzer
  0 siblings, 0 replies; 95+ messages in thread
From: Mike Snitzer @ 2015-01-30  0:32 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, Bart Van Assche, device-mapper development,
	Jun'ichi Nomura, Christoph Hellwig

On Thu, Jan 29 2015 at  6:44pm -0500,
Keith Busch <keith.busch@intel.com> wrote:

> On Thu, 29 Jan 2015, Mike Snitzer wrote:
> >On Thu, Jan 29 2015 at  5:43pm -0500,
> >>One of my commits relocated the initialization, but I didn't realize it
> >>had a hard dependency on the follow-on commit.  Should we reorder that
> >>part of the series?
> >
> >Which follow on commit are you referring to?  Please be specific about
> >which commits you think are out of order.
> >
> >Also, what are you testing... are you using the linux-dm.git tree's
> >'dm-for-3.20' branch which builds on Jens' 'for-3.20/core'?
> >
> >If not please test that and see if you still have problems (without the
> >associated DM changes all the block preparation changes that Jens
> >cherry-picked shouldn't cause any problems at all).
> 
> I'm using Jens' linux-block for-3.20/core branch. The last dm commit
> is this:
> 
>  commit febf71588c2a750e04dc2a8b0824ce120c48bd9e
>  Author: Keith Busch <keith.busch@intel.com>
>  Date:   Fri Oct 17 17:46:35 2014 -0600
> 
>     block: require blk_rq_prep_clone() be given an initialized clone request
> 
> Looking at this again, the above was incorrect in the first place: it
> initialized the original, but the intent was to initialize the clone. This
> slipped by me since the next part of the series fixed it. In your linux-dm
> dm-for-3.20, it's this commit:
> 
>  commit 102e38b1030e883efc022dfdc7b7e7a3de70d1c5
>  Author: Mike Snitzer <snitzer@redhat.com>
>  Date:   Fri Dec 5 17:11:05 2014 -0500
> 
>      dm: split request structure out from dm_rq_target_io structure
> 
> 
> So I was confused earlier, there's no need to reorder anything. I just
> need to fix the broken part.

Oof, yeah that first commit should be using: blk_rq_init(NULL, clone);

Jens, any chance you could rebase commit febf71588c2a with s/rq/clone/
in dm.c:clone_rq's call to blk_rq_init?

^ permalink raw reply	[flat|nested] 95+ messages in thread

end of thread, other threads:[~2015-01-30  0:32 UTC | newest]

Thread overview: 95+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-12-17  3:59 [PATCH v3 0/8] dm: add request-based blk-mq support Mike Snitzer
2014-12-17  3:59 ` [PATCH v3 1/8] block: require blk_rq_prep_clone() be given an initialized clone request Mike Snitzer
2014-12-17  3:59 ` [PATCH v3 2/8] block: initialize bio member of blk-mq request to NULL Mike Snitzer
2014-12-17  3:59 ` [PATCH v3 3/8] block: add blk-mq support to blk_insert_cloned_request() Mike Snitzer
2014-12-17  4:00 ` [PATCH v3 4/8] block: mark blk-mq devices as stackable Mike Snitzer
2014-12-17  4:00 ` [PATCH v3 5/8] dm: remove exports for request-based interfaces without external callers Mike Snitzer
2014-12-17  4:00 ` [PATCH v3 6/8] dm: split request structure out from dm_rq_target_io structure Mike Snitzer
2014-12-17  4:00 ` [PATCH v3 7/8] dm: submit stacked requests in irq enabled context Mike Snitzer
2014-12-17  4:00 ` [PATCH v3 8/8] dm: allocate requests from target when stacking on blk-mq devices Mike Snitzer
2014-12-17 22:35   ` Mike Snitzer
2014-12-17 21:42 ` [PATCH v3 0/8] dm: add request-based blk-mq support Keith Busch
2014-12-17 21:43   ` Jens Axboe
2014-12-17 23:06     ` Mike Snitzer
2014-12-18  1:41       ` Keith Busch
2014-12-18  4:58         ` Mike Snitzer
2014-12-19 14:32       ` Bart Van Assche
2014-12-19 15:38         ` Mike Snitzer
2014-12-19 17:14           ` Mike Snitzer
2014-12-22 15:28             ` Bart Van Assche
2014-12-22 18:49               ` Mike Snitzer
2014-12-23 16:24                 ` Bart Van Assche
2014-12-23 17:13                   ` Mike Snitzer
2014-12-23 21:42                     ` Mike Snitzer
2014-12-24 13:02                       ` Bart Van Assche
2014-12-24 18:21                         ` Mike Snitzer
2014-12-24 18:55                           ` Mike Snitzer
2014-12-24 19:26                             ` Mike Snitzer
2015-01-02 17:53                               ` Bart Van Assche
2015-01-05 21:35                                 ` Mike Snitzer
2015-01-06  8:59                                   ` Christoph Hellwig
2015-01-06  9:31                                   ` Bart Van Assche
2015-01-06 16:05                                     ` blk-mq request allocation stalls [was: Re: [PATCH v3 0/8] dm: add request-based blk-mq support] Mike Snitzer
2015-01-06 16:15                                       ` Jens Axboe
2015-01-07 10:33                                         ` Bart Van Assche
2015-01-07 15:32                                           ` Jens Axboe
2015-01-07 16:15                                             ` Mike Snitzer
2015-01-07 16:18                                               ` Jens Axboe
2015-01-07 16:22                                               ` Mike Snitzer
2015-01-07 16:24                                                 ` Jens Axboe
2015-01-07 17:18                                                   ` Mike Snitzer
2015-01-07 17:35                                                     ` Jens Axboe
2015-01-07 20:09                                                       ` Mike Snitzer
2015-01-07 20:40                                           ` Keith Busch
2015-01-09 19:49                                             ` Mike Snitzer
2015-01-09 21:07                                               ` Jens Axboe
2015-01-09 21:11                                                 ` Jens Axboe
2015-01-09 21:40                                                   ` Mike Snitzer
2015-01-09 21:56                                                     ` Jens Axboe
2015-01-09 22:25                                                       ` Mike Snitzer
2015-01-10  0:27                                                         ` Jens Axboe
2015-01-10  1:48                                                           ` Mike Snitzer
2015-01-10  1:59                                                             ` Jens Axboe
2015-01-10  3:10                                                               ` Mike Snitzer
2015-01-12 14:46                                                                 ` blk-mq request allocation stalls Bart Van Assche
2015-01-12 15:42                                                                   ` Jens Axboe
2015-01-12 16:12                                                                     ` Bart Van Assche
2015-01-12 16:34                                                                       ` Jens Axboe
2015-01-12 16:58                                                                         ` Mike Snitzer
2015-01-12 16:59                                                                           ` Jens Axboe
2015-01-12 17:04                                                                         ` Bart Van Assche
2015-01-12 17:09                                                                           ` Jens Axboe
2015-01-12 17:53                                                                             ` Keith Busch
2015-01-12 18:12                                                                               ` Jens Axboe
2015-01-12 18:22                                                                                 ` Keith Busch
2015-01-12 18:35                                                                                   ` Keith Busch
2015-01-12 19:11                                                                                     ` Mike Snitzer
2015-01-12 20:21                                                                                       ` Mike Snitzer
2015-01-13 12:29                                                                                         ` Bart Van Assche
2015-01-13 14:17                                                                                           ` Mike Snitzer
2015-01-13 14:28                                                                                             ` dm + blk-mq soft lockup complaint Bart Van Assche
2015-01-13 16:20                                                                                               ` Mike Snitzer
2015-01-14  9:16                                                                                                 ` Bart Van Assche
2015-01-14  9:16                                                                                                   ` Bart Van Assche
2015-01-14 18:59                                                                                                   ` Mike Snitzer
2015-01-15  8:11                                                                                                     ` Bart Van Assche
2015-01-15 15:43                                                                                                       ` Mike Snitzer
2015-01-15 15:55                                                                                                         ` Bart Van Assche
2015-01-13 14:59                                                                                     ` blk-mq request allocation stalls Jens Axboe
2015-01-13 15:11                                                                                       ` Keith Busch
2015-01-13 15:27                                                                                         ` Keith Busch
2015-01-13 15:41                                                                                         ` Mike Snitzer
2015-01-13 15:14                                                                                       ` Mike Snitzer
2015-01-27 18:42                                                                                       ` blk-mq DM changes for 3.20 [was: Re: blk-mq request allocation stalls] Mike Snitzer
2015-01-28 16:42                                                                                         ` Jens Axboe
2015-01-28 17:44                                                                                           ` Mike Snitzer
2015-01-28 17:49                                                                                             ` Jens Axboe
2015-01-28 18:10                                                                                               ` Mike Snitzer
2015-01-29 22:43                                                                                               ` blk-mq DM changes for 3.20 [was: Re: blk-mq request allocation stalls]X Keith Busch
2015-01-29 23:09                                                                                                 ` Mike Snitzer
2015-01-29 23:44                                                                                                   ` Keith Busch
2015-01-30  0:32                                                                                                     ` Mike Snitzer
2015-01-12 19:05                                                                                   ` blk-mq request allocation stalls Jens Axboe
2015-01-12 19:07                                                                                 ` Mike Snitzer
2015-01-12 18:19                                                                           ` Mike Snitzer
2014-12-17 22:51   ` [PATCH v3 0/8] dm: add request-based blk-mq support Mike Snitzer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.