All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] blk-mq & dm: fix IO hang and deal with one performance issue
@ 2018-01-22  3:35 Ming Lei
  2018-01-22  3:35 ` [PATCH 1/5] blk-mq: introduce BLK_STS_DEV_RESOURCE Ming Lei
                   ` (4 more replies)
  0 siblings, 5 replies; 31+ messages in thread
From: Ming Lei @ 2018-01-22  3:35 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer
  Cc: dm-devel, Ming Lei

Hi Jens,

This 1st patch introduces BLK_STS_DEV_RESOURCE for dealing with IO hang
when non-device resource is run out of and queue is idle, and this
approach is suggested by you.

The 2nd and 3rd patches cleans dm-rq a bit and prepares for applying
blk_get_request_notify().

The 4th patch introduces blk_get_request_notify(), still suggested by
you.

The last patch applies blk_get_request_notify() and avoids one
DM specific performance issue which is introduced by the 1st patch.

These 5 patches are against both dm(dm-4.16) and block tree(for-4.16/block).

Ming Lei (5):
  blk-mq: introduce BLK_STS_DEV_RESOURCE
  dm-rq: handle dispatch exception in dm_dispatch_clone_request()
  dm-rq: return BLK_STS_* from map_request()
  blk-mq: introduce blk_get_request_notify
  dm-mpath: use blk_mq_alloc_request_notify for allocating blk-mq req

 block/blk-core.c             |  1 +
 block/blk-mq-tag.c           | 27 +++++++++++++++++++++-
 block/blk-mq.c               | 43 +++++++++++++++++++++++++++++------
 block/blk-mq.h               |  1 +
 drivers/block/null_blk.c     |  2 +-
 drivers/block/virtio_blk.c   |  2 +-
 drivers/block/xen-blkfront.c |  2 +-
 drivers/md/dm-mpath.c        | 50 ++++++++++++++++++++++++++++++++++++++++-
 drivers/md/dm-rq.c           | 53 +++++++++++++++++++++++++++-----------------
 drivers/md/dm.c              |  1 +
 drivers/scsi/scsi_lib.c      |  6 ++---
 include/linux/blk-mq.h       |  5 +++++
 include/linux/blk_types.h    |  7 ++++++
 13 files changed, 165 insertions(+), 35 deletions(-)

-- 
2.9.5

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 1/5] blk-mq: introduce BLK_STS_DEV_RESOURCE
  2018-01-22  3:35 [PATCH 0/5] blk-mq & dm: fix IO hang and deal with one performance issue Ming Lei
@ 2018-01-22  3:35 ` Ming Lei
  2018-01-22 16:32   ` Christoph Hellwig
  2018-01-22 16:49     ` Bart Van Assche
  2018-01-22  3:35   ` Ming Lei
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 31+ messages in thread
From: Ming Lei @ 2018-01-22  3:35 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer
  Cc: dm-devel, Ming Lei, Laurence Oberman, Bart Van Assche,
	James E . J . Bottomley, Martin K . Petersen, linux-scsi

This status is returned from driver to block layer if device related
resource is run out of, but driver can guarantee that IO dispatch is
triggered in future when the resource is available.

This patch converts some drivers to use this return value. Meantime
if driver returns BLK_STS_RESOURCE and S_SCHED_RESTART is marked, run
queue after 10ms for avoiding IO hang.

Suggested-by: Jens Axboe <axboe@kernel.dk>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: James E.J. Bottomley <jejb@linux.vnet.ibm.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c             |  1 +
 block/blk-mq.c               | 19 +++++++++++++++----
 drivers/block/null_blk.c     |  2 +-
 drivers/block/virtio_blk.c   |  2 +-
 drivers/block/xen-blkfront.c |  2 +-
 drivers/md/dm-rq.c           |  2 +-
 drivers/scsi/scsi_lib.c      |  6 +++---
 include/linux/blk_types.h    |  7 +++++++
 8 files changed, 30 insertions(+), 11 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index a2005a485335..ac4789c5e329 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -145,6 +145,7 @@ static const struct {
 	[BLK_STS_MEDIUM]	= { -ENODATA,	"critical medium" },
 	[BLK_STS_PROTECTION]	= { -EILSEQ,	"protection" },
 	[BLK_STS_RESOURCE]	= { -ENOMEM,	"kernel resource" },
+	[BLK_STS_DEV_RESOURCE]	= { -ENOMEM,	"device resource" },
 	[BLK_STS_AGAIN]		= { -EAGAIN,	"nonblocking retry" },
 
 	/* device mapper special case, should not leak out: */
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 01f271d40825..e91d688792a9 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1169,6 +1169,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 	struct request *rq, *nxt;
 	bool no_tag = false;
 	int errors, queued;
+	blk_status_t ret = BLK_STS_OK;
 
 	if (list_empty(list))
 		return false;
@@ -1181,7 +1182,6 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 	errors = queued = 0;
 	do {
 		struct blk_mq_queue_data bd;
-		blk_status_t ret;
 
 		rq = list_first_entry(list, struct request, queuelist);
 		if (!blk_mq_get_driver_tag(rq, &hctx, false)) {
@@ -1226,7 +1226,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 		}
 
 		ret = q->mq_ops->queue_rq(hctx, &bd);
-		if (ret == BLK_STS_RESOURCE) {
+		if ((ret == BLK_STS_RESOURCE) || (ret == BLK_STS_DEV_RESOURCE)) {
 			/*
 			 * If an I/O scheduler has been configured and we got a
 			 * driver tag for the next request already, free it
@@ -1257,6 +1257,8 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 	 * that is where we will continue on next queue run.
 	 */
 	if (!list_empty(list)) {
+		bool needs_restart;
+
 		spin_lock(&hctx->lock);
 		list_splice_init(list, &hctx->dispatch);
 		spin_unlock(&hctx->lock);
@@ -1280,10 +1282,18 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 		 * - Some but not all block drivers stop a queue before
 		 *   returning BLK_STS_RESOURCE. Two exceptions are scsi-mq
 		 *   and dm-rq.
+		 *
+		 * If drivers return BLK_STS_RESOURCE and S_SCHED_RESTART
+		 * bit is set, run queue after 10ms for avoiding IO hang
+		 * because the queue may be idle and the RESTART mechanism
+		 * can't work any more.
 		 */
-		if (!blk_mq_sched_needs_restart(hctx) ||
+		needs_restart = blk_mq_sched_needs_restart(hctx);
+		if (!needs_restart ||
 		    (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
 			blk_mq_run_hw_queue(hctx, true);
+		else if (needs_restart && (ret == BLK_STS_RESOURCE))
+			blk_mq_delay_run_hw_queue(hctx, 10);
 	}
 
 	return (queued + errors) != 0;
@@ -1764,6 +1774,7 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
 		*cookie = new_cookie;
 		break;
 	case BLK_STS_RESOURCE:
+	case BLK_STS_DEV_RESOURCE:
 		__blk_mq_requeue_request(rq);
 		break;
 	default:
@@ -1826,7 +1837,7 @@ static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 	hctx_lock(hctx, &srcu_idx);
 
 	ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false);
-	if (ret == BLK_STS_RESOURCE)
+	if ((ret == BLK_STS_RESOURCE) || (ret == BLK_STS_DEV_RESOURCE))
 		blk_mq_sched_insert_request(rq, false, true, false);
 	else if (ret != BLK_STS_OK)
 		blk_mq_end_request(rq, ret);
diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index 6655893a3a7a..287a09611c0f 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -1230,7 +1230,7 @@ static blk_status_t null_handle_cmd(struct nullb_cmd *cmd)
 				return BLK_STS_OK;
 			} else
 				/* requeue request */
-				return BLK_STS_RESOURCE;
+				return BLK_STS_DEV_RESOURCE;
 		}
 	}
 
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 68846897d213..79908e6ddbf2 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -276,7 +276,7 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
 		/* Out of mem doesn't actually happen, since we fall back
 		 * to direct descriptors */
 		if (err == -ENOMEM || err == -ENOSPC)
-			return BLK_STS_RESOURCE;
+			return BLK_STS_DEV_RESOURCE;
 		return BLK_STS_IOERR;
 	}
 
diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 891265acb10e..e126e4cac2ca 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -911,7 +911,7 @@ static blk_status_t blkif_queue_rq(struct blk_mq_hw_ctx *hctx,
 out_busy:
 	blk_mq_stop_hw_queue(hctx);
 	spin_unlock_irqrestore(&rinfo->ring_lock, flags);
-	return BLK_STS_RESOURCE;
+	return BLK_STS_DEV_RESOURCE;
 }
 
 static void blkif_complete_rq(struct request *rq)
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index d8519ddd7e1a..e0bb56723b9b 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -500,7 +500,7 @@ static int map_request(struct dm_rq_target_io *tio)
 		trace_block_rq_remap(clone->q, clone, disk_devt(dm_disk(md)),
 				     blk_rq_pos(rq));
 		ret = dm_dispatch_clone_request(clone, rq);
-		if (ret == BLK_STS_RESOURCE) {
+		if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) {
 			blk_rq_unprep_clone(clone);
 			tio->ti->type->release_clone_rq(clone);
 			tio->clone = NULL;
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index d9ca1dfab154..55be2550c555 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -2030,9 +2030,9 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
 	case BLK_STS_OK:
 		break;
 	case BLK_STS_RESOURCE:
-		if (atomic_read(&sdev->device_busy) == 0 &&
-		    !scsi_device_blocked(sdev))
-			blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY);
+		if (atomic_read(&sdev->device_busy) ||
+		    scsi_device_blocked(sdev))
+			ret = BLK_STS_DEV_RESOURCE;
 		break;
 	default:
 		/*
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index c5d3db0d83f8..d76f1744a7ca 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -39,6 +39,13 @@ typedef u8 __bitwise blk_status_t;
 
 #define BLK_STS_AGAIN		((__force blk_status_t)12)
 
+/*
+ * This status is returned from driver to block layer if device related
+ * resource is run out of, but driver can guarantee that IO dispatch is
+ * triggered in future when the resource is available.
+ */
+#define BLK_STS_DEV_RESOURCE	((__force blk_status_t)13)
+
 /**
  * blk_path_error - returns true if error may be path related
  * @error: status the request was completed with
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 2/5] dm-rq: handle dispatch exception in dm_dispatch_clone_request()
  2018-01-22  3:35 [PATCH 0/5] blk-mq & dm: fix IO hang and deal with one performance issue Ming Lei
@ 2018-01-22  3:35   ` Ming Lei
  2018-01-22  3:35   ` Ming Lei
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 31+ messages in thread
From: Ming Lei @ 2018-01-22  3:35 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer
  Cc: dm-devel, Ming Lei, Laurence Oberman, Bart Van Assche

So we can clean up map_request() a bit, no functional change.

Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/md/dm-rq.c | 23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index e0bb56723b9b..1408e6664c16 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -12,6 +12,9 @@
 
 #define DM_MSG_PREFIX "core-rq"
 
+#define  dispatch_busy(r)  \
+	(((r) == BLK_STS_RESOURCE) || ((r) == BLK_STS_DEV_RESOURCE))
+
 #define DM_MQ_NR_HW_QUEUES 1
 #define DM_MQ_QUEUE_DEPTH 2048
 static unsigned dm_mq_nr_hw_queues = DM_MQ_NR_HW_QUEUES;
@@ -399,7 +402,8 @@ static void end_clone_request(struct request *clone, blk_status_t error)
 	dm_complete_request(tio->orig, error);
 }
 
-static blk_status_t dm_dispatch_clone_request(struct request *clone, struct request *rq)
+static blk_status_t dm_dispatch_clone_request(struct dm_rq_target_io *tio,
+		struct request *clone, struct request *rq)
 {
 	blk_status_t r;
 
@@ -408,9 +412,17 @@ static blk_status_t dm_dispatch_clone_request(struct request *clone, struct requ
 
 	clone->start_time = jiffies;
 	r = blk_insert_cloned_request(clone->q, clone);
-	if (r != BLK_STS_OK && r != BLK_STS_RESOURCE)
+	if (dispatch_busy(r)) {
+		blk_rq_unprep_clone(clone);
+		tio->ti->type->release_clone_rq(clone);
+		tio->clone = NULL;
+	} else if (r != BLK_STS_OK) {
 		/* must complete clone in terms of original request */
 		dm_complete_request(rq, r);
+
+		/* we handled the failure, so return OK */
+		r = BLK_STS_OK;
+	}
 	return r;
 }
 
@@ -499,11 +511,8 @@ static int map_request(struct dm_rq_target_io *tio)
 		/* The target has remapped the I/O so dispatch it */
 		trace_block_rq_remap(clone->q, clone, disk_devt(dm_disk(md)),
 				     blk_rq_pos(rq));
-		ret = dm_dispatch_clone_request(clone, rq);
-		if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) {
-			blk_rq_unprep_clone(clone);
-			tio->ti->type->release_clone_rq(clone);
-			tio->clone = NULL;
+		ret = dm_dispatch_clone_request(tio, clone, rq);
+		if (dispatch_busy(ret)) {
 			if (!rq->q->mq_ops)
 				r = DM_MAPIO_DELAY_REQUEUE;
 			else
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 2/5] dm-rq: handle dispatch exception in dm_dispatch_clone_request()
@ 2018-01-22  3:35   ` Ming Lei
  0 siblings, 0 replies; 31+ messages in thread
From: Ming Lei @ 2018-01-22  3:35 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer
  Cc: Bart Van Assche, dm-devel, Laurence Oberman, Ming Lei

So we can clean up map_request() a bit, no functional change.

Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/md/dm-rq.c | 23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index e0bb56723b9b..1408e6664c16 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -12,6 +12,9 @@
 
 #define DM_MSG_PREFIX "core-rq"
 
+#define  dispatch_busy(r)  \
+	(((r) == BLK_STS_RESOURCE) || ((r) == BLK_STS_DEV_RESOURCE))
+
 #define DM_MQ_NR_HW_QUEUES 1
 #define DM_MQ_QUEUE_DEPTH 2048
 static unsigned dm_mq_nr_hw_queues = DM_MQ_NR_HW_QUEUES;
@@ -399,7 +402,8 @@ static void end_clone_request(struct request *clone, blk_status_t error)
 	dm_complete_request(tio->orig, error);
 }
 
-static blk_status_t dm_dispatch_clone_request(struct request *clone, struct request *rq)
+static blk_status_t dm_dispatch_clone_request(struct dm_rq_target_io *tio,
+		struct request *clone, struct request *rq)
 {
 	blk_status_t r;
 
@@ -408,9 +412,17 @@ static blk_status_t dm_dispatch_clone_request(struct request *clone, struct requ
 
 	clone->start_time = jiffies;
 	r = blk_insert_cloned_request(clone->q, clone);
-	if (r != BLK_STS_OK && r != BLK_STS_RESOURCE)
+	if (dispatch_busy(r)) {
+		blk_rq_unprep_clone(clone);
+		tio->ti->type->release_clone_rq(clone);
+		tio->clone = NULL;
+	} else if (r != BLK_STS_OK) {
 		/* must complete clone in terms of original request */
 		dm_complete_request(rq, r);
+
+		/* we handled the failure, so return OK */
+		r = BLK_STS_OK;
+	}
 	return r;
 }
 
@@ -499,11 +511,8 @@ static int map_request(struct dm_rq_target_io *tio)
 		/* The target has remapped the I/O so dispatch it */
 		trace_block_rq_remap(clone->q, clone, disk_devt(dm_disk(md)),
 				     blk_rq_pos(rq));
-		ret = dm_dispatch_clone_request(clone, rq);
-		if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) {
-			blk_rq_unprep_clone(clone);
-			tio->ti->type->release_clone_rq(clone);
-			tio->clone = NULL;
+		ret = dm_dispatch_clone_request(tio, clone, rq);
+		if (dispatch_busy(ret)) {
 			if (!rq->q->mq_ops)
 				r = DM_MAPIO_DELAY_REQUEUE;
 			else
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 3/5] dm-rq: return BLK_STS_* from map_request()
  2018-01-22  3:35 [PATCH 0/5] blk-mq & dm: fix IO hang and deal with one performance issue Ming Lei
@ 2018-01-22  3:35   ` Ming Lei
  2018-01-22  3:35   ` Ming Lei
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 31+ messages in thread
From: Ming Lei @ 2018-01-22  3:35 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer
  Cc: dm-devel, Ming Lei, Laurence Oberman, Bart Van Assche

Except for DM_MAPIO_REQUEUE, map_request() handles other dispatch exception
already, so return BLK_STS_* from map_request() directly.

Another change is that if dm_dispatch_clone_request() returns
BLK_STS_DEV_RESOURCE from underlying queue, this status is returned
to blk-mq too, since underlying queue's RESTART can handle dm-rq's
RESTART in this case.

Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/md/dm-rq.c | 30 +++++++++++++++++-------------
 1 file changed, 17 insertions(+), 13 deletions(-)

diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 1408e6664c16..830e1ccfbb44 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -482,9 +482,10 @@ static void init_tio(struct dm_rq_target_io *tio, struct request *rq,
 
 /*
  * Returns:
- * DM_MAPIO_*       : the request has been processed as indicated
- * DM_MAPIO_REQUEUE : the original request needs to be immediately requeued
- * < 0              : the request was completed due to failure
+ * BLK_STS_OK       : the request has been processed aready, no need to
+ * 		ask block layer to handle it any more
+ * BLK_STS_RESOURCE : the original request needs to be immediately requeued
+ * BLK_STS_DEV_RESOURCE : same with BLK_STS_RESOURCE, but blk-mq need this info
  */
 static int map_request(struct dm_rq_target_io *tio)
 {
@@ -493,7 +494,7 @@ static int map_request(struct dm_rq_target_io *tio)
 	struct mapped_device *md = tio->md;
 	struct request *rq = tio->orig;
 	struct request *clone = NULL;
-	blk_status_t ret;
+	blk_status_t ret, result = BLK_STS_OK;
 
 	r = ti->type->clone_and_map_rq(ti, rq, &tio->info, &clone);
 check_again:
@@ -513,15 +514,16 @@ static int map_request(struct dm_rq_target_io *tio)
 				     blk_rq_pos(rq));
 		ret = dm_dispatch_clone_request(tio, clone, rq);
 		if (dispatch_busy(ret)) {
-			if (!rq->q->mq_ops)
+			if (!rq->q->mq_ops) {
 				r = DM_MAPIO_DELAY_REQUEUE;
-			else
-				r = DM_MAPIO_REQUEUE;
-			goto check_again;
+				goto check_again;
+			}
+			result = ret;
 		}
 		break;
 	case DM_MAPIO_REQUEUE:
 		/* The target wants to requeue the I/O */
+		result = BLK_STS_RESOURCE;
 		break;
 	case DM_MAPIO_DELAY_REQUEUE:
 		/* The target wants to requeue the I/O after a delay */
@@ -536,7 +538,7 @@ static int map_request(struct dm_rq_target_io *tio)
 		BUG();
 	}
 
-	return r;
+	return result;
 }
 
 static void dm_start_request(struct mapped_device *md, struct request *orig)
@@ -599,7 +601,7 @@ static void map_tio_request(struct kthread_work *work)
 {
 	struct dm_rq_target_io *tio = container_of(work, struct dm_rq_target_io, work);
 
-	if (map_request(tio) == DM_MAPIO_REQUEUE)
+	if (dispatch_busy(map_request(tio)))
 		dm_requeue_original_request(tio, false);
 }
 
@@ -754,6 +756,7 @@ static blk_status_t dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
 	struct dm_rq_target_io *tio = blk_mq_rq_to_pdu(rq);
 	struct mapped_device *md = tio->md;
 	struct dm_target *ti = md->immutable_target;
+	blk_status_t ret;
 
 	if (unlikely(!ti)) {
 		int srcu_idx;
@@ -777,14 +780,15 @@ static blk_status_t dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
 	tio->ti = ti;
 
 	/* Direct call is fine since .queue_rq allows allocations */
-	if (map_request(tio) == DM_MAPIO_REQUEUE) {
+	ret = map_request(tio);
+	if (dispatch_busy(ret)) {
 		/* Undo dm_start_request() before requeuing */
 		rq_end_stats(md, rq);
 		rq_completed(md, rq_data_dir(rq), false);
-		return BLK_STS_RESOURCE;
+		return ret;
 	}
 
-	return BLK_STS_OK;
+	return ret;
 }
 
 static const struct blk_mq_ops dm_mq_ops = {
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 3/5] dm-rq: return BLK_STS_* from map_request()
@ 2018-01-22  3:35   ` Ming Lei
  0 siblings, 0 replies; 31+ messages in thread
From: Ming Lei @ 2018-01-22  3:35 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer
  Cc: Bart Van Assche, dm-devel, Laurence Oberman, Ming Lei

Except for DM_MAPIO_REQUEUE, map_request() handles other dispatch exception
already, so return BLK_STS_* from map_request() directly.

Another change is that if dm_dispatch_clone_request() returns
BLK_STS_DEV_RESOURCE from underlying queue, this status is returned
to blk-mq too, since underlying queue's RESTART can handle dm-rq's
RESTART in this case.

Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/md/dm-rq.c | 30 +++++++++++++++++-------------
 1 file changed, 17 insertions(+), 13 deletions(-)

diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 1408e6664c16..830e1ccfbb44 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -482,9 +482,10 @@ static void init_tio(struct dm_rq_target_io *tio, struct request *rq,
 
 /*
  * Returns:
- * DM_MAPIO_*       : the request has been processed as indicated
- * DM_MAPIO_REQUEUE : the original request needs to be immediately requeued
- * < 0              : the request was completed due to failure
+ * BLK_STS_OK       : the request has been processed aready, no need to
+ * 		ask block layer to handle it any more
+ * BLK_STS_RESOURCE : the original request needs to be immediately requeued
+ * BLK_STS_DEV_RESOURCE : same with BLK_STS_RESOURCE, but blk-mq need this info
  */
 static int map_request(struct dm_rq_target_io *tio)
 {
@@ -493,7 +494,7 @@ static int map_request(struct dm_rq_target_io *tio)
 	struct mapped_device *md = tio->md;
 	struct request *rq = tio->orig;
 	struct request *clone = NULL;
-	blk_status_t ret;
+	blk_status_t ret, result = BLK_STS_OK;
 
 	r = ti->type->clone_and_map_rq(ti, rq, &tio->info, &clone);
 check_again:
@@ -513,15 +514,16 @@ static int map_request(struct dm_rq_target_io *tio)
 				     blk_rq_pos(rq));
 		ret = dm_dispatch_clone_request(tio, clone, rq);
 		if (dispatch_busy(ret)) {
-			if (!rq->q->mq_ops)
+			if (!rq->q->mq_ops) {
 				r = DM_MAPIO_DELAY_REQUEUE;
-			else
-				r = DM_MAPIO_REQUEUE;
-			goto check_again;
+				goto check_again;
+			}
+			result = ret;
 		}
 		break;
 	case DM_MAPIO_REQUEUE:
 		/* The target wants to requeue the I/O */
+		result = BLK_STS_RESOURCE;
 		break;
 	case DM_MAPIO_DELAY_REQUEUE:
 		/* The target wants to requeue the I/O after a delay */
@@ -536,7 +538,7 @@ static int map_request(struct dm_rq_target_io *tio)
 		BUG();
 	}
 
-	return r;
+	return result;
 }
 
 static void dm_start_request(struct mapped_device *md, struct request *orig)
@@ -599,7 +601,7 @@ static void map_tio_request(struct kthread_work *work)
 {
 	struct dm_rq_target_io *tio = container_of(work, struct dm_rq_target_io, work);
 
-	if (map_request(tio) == DM_MAPIO_REQUEUE)
+	if (dispatch_busy(map_request(tio)))
 		dm_requeue_original_request(tio, false);
 }
 
@@ -754,6 +756,7 @@ static blk_status_t dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
 	struct dm_rq_target_io *tio = blk_mq_rq_to_pdu(rq);
 	struct mapped_device *md = tio->md;
 	struct dm_target *ti = md->immutable_target;
+	blk_status_t ret;
 
 	if (unlikely(!ti)) {
 		int srcu_idx;
@@ -777,14 +780,15 @@ static blk_status_t dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
 	tio->ti = ti;
 
 	/* Direct call is fine since .queue_rq allows allocations */
-	if (map_request(tio) == DM_MAPIO_REQUEUE) {
+	ret = map_request(tio);
+	if (dispatch_busy(ret)) {
 		/* Undo dm_start_request() before requeuing */
 		rq_end_stats(md, rq);
 		rq_completed(md, rq_data_dir(rq), false);
-		return BLK_STS_RESOURCE;
+		return ret;
 	}
 
-	return BLK_STS_OK;
+	return ret;
 }
 
 static const struct blk_mq_ops dm_mq_ops = {
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 4/5] blk-mq: introduce blk_get_request_notify
  2018-01-22  3:35 [PATCH 0/5] blk-mq & dm: fix IO hang and deal with one performance issue Ming Lei
                   ` (2 preceding siblings ...)
  2018-01-22  3:35   ` Ming Lei
@ 2018-01-22  3:35 ` Ming Lei
  2018-01-22 10:19   ` Ming Lei
  2018-01-22 17:13     ` Bart Van Assche
  2018-01-22  3:35 ` [PATCH 5/5] dm-mpath: use blk_mq_alloc_request_notify for allocating blk-mq req Ming Lei
  4 siblings, 2 replies; 31+ messages in thread
From: Ming Lei @ 2018-01-22  3:35 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer
  Cc: dm-devel, Ming Lei, Laurence Oberman, Bart Van Assche

DM-MPATH need to allocate request from underlying queue, but when the
allocation fails, there is no way to make underlying queue's RESTART
to restart DM's queue.

This patch introduces blk_get_request_notify() for this purpose, and
caller need to pass 'wait_queue_entry_t' to this function, and make
sure it is initialized well, so after the current allocation fails,
DM will get notified when there is request available from underlying
queue.

This approach is suggested by Jens, and has been used in blk-mq dispatch
patch for a while, see blk_mq_mark_tag_wait().

Suggested-by: Jens Axboe <axboe@kernel.dk>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-tag.c     | 27 ++++++++++++++++++++++++++-
 block/blk-mq.c         | 24 +++++++++++++++++++++---
 block/blk-mq.h         |  1 +
 include/linux/blk-mq.h |  5 +++++
 4 files changed, 53 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 336dde07b230..911fc9bd1bab 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -128,10 +128,35 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 	if (tag != -1)
 		goto found_tag;
 
-	if (data->flags & BLK_MQ_REQ_NOWAIT)
+	if ((data->flags & BLK_MQ_REQ_NOWAIT) && !(data->notifier &&
+			(data->flags & BLK_MQ_REQ_ALLOC_NOTIFY)))
 		return BLK_MQ_TAG_FAIL;
 
 	ws = bt_wait_ptr(bt, data->hctx);
+
+	/*
+	 * If caller requires notification when tag is available, add
+	 * wait entry of 'data->notifier' to the wait queue.
+	 */
+	if (data->flags & BLK_MQ_REQ_NOWAIT) {
+		bool added = false;
+
+		spin_lock_irq(&ws->wait.lock);
+		if (list_empty(&data->notifier->entry))
+			__add_wait_queue(&ws->wait, data->notifier);
+		else
+			added = true;
+		spin_unlock_irq(&ws->wait.lock);
+
+		if (added)
+			return BLK_MQ_TAG_FAIL;
+
+		tag = __blk_mq_get_tag(data, bt);
+		if (tag != -1)
+			goto found_tag;
+		return BLK_MQ_TAG_FAIL;
+	}
+
 	drop_ctx = data->ctx == NULL;
 	do {
 		/*
diff --git a/block/blk-mq.c b/block/blk-mq.c
index e91d688792a9..a71e44f17e7f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -384,10 +384,14 @@ static struct request *blk_mq_get_request(struct request_queue *q,
 	return rq;
 }
 
-struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
-		blk_mq_req_flags_t flags)
+static struct request *__blk_mq_alloc_request(struct request_queue *q,
+		unsigned int op, blk_mq_req_flags_t flags,
+		wait_queue_entry_t *notifier)
 {
-	struct blk_mq_alloc_data alloc_data = { .flags = flags };
+	struct blk_mq_alloc_data alloc_data = {
+		.flags = flags,
+		.notifier = notifier,
+	};
 	struct request *rq;
 	int ret;
 
@@ -408,8 +412,22 @@ struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
 	rq->bio = rq->biotail = NULL;
 	return rq;
 }
+
+struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
+		blk_mq_req_flags_t flags)
+{
+	return __blk_mq_alloc_request(q, op, flags, NULL);
+}
 EXPORT_SYMBOL(blk_mq_alloc_request);
 
+struct request *blk_mq_alloc_request_notify(struct request_queue *q,
+		unsigned int op, blk_mq_req_flags_t flags,
+		wait_queue_entry_t *notifier)
+{
+	return __blk_mq_alloc_request(q, op, flags, notifier);
+}
+EXPORT_SYMBOL(blk_mq_alloc_request_notify);
+
 struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
 	unsigned int op, blk_mq_req_flags_t flags, unsigned int hctx_idx)
 {
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 88c558f71819..bec2f675f8f1 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -160,6 +160,7 @@ struct blk_mq_alloc_data {
 	struct request_queue *q;
 	blk_mq_req_flags_t flags;
 	unsigned int shallow_depth;
+	wait_queue_entry_t *notifier;
 
 	/* input & output parameter */
 	struct blk_mq_ctx *ctx;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 8efcf49796a3..335c7dace5a7 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -219,10 +219,15 @@ enum {
 	BLK_MQ_REQ_INTERNAL	= (__force blk_mq_req_flags_t)(1 << 2),
 	/* set RQF_PREEMPT */
 	BLK_MQ_REQ_PREEMPT	= (__force blk_mq_req_flags_t)(1 << 3),
+	/* notify when new rq is available */
+	BLK_MQ_REQ_ALLOC_NOTIFY	= (__force blk_mq_req_flags_t)(1 << 4),
 };
 
 struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
 		blk_mq_req_flags_t flags);
+struct request *blk_mq_alloc_request_notify(struct request_queue *q,
+		unsigned int op, blk_mq_req_flags_t flags,
+		wait_queue_entry_t *notifier);
 struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
 		unsigned int op, blk_mq_req_flags_t flags,
 		unsigned int hctx_idx);
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 5/5] dm-mpath: use blk_mq_alloc_request_notify for allocating blk-mq req
  2018-01-22  3:35 [PATCH 0/5] blk-mq & dm: fix IO hang and deal with one performance issue Ming Lei
                   ` (3 preceding siblings ...)
  2018-01-22  3:35 ` [PATCH 4/5] blk-mq: introduce blk_get_request_notify Ming Lei
@ 2018-01-22  3:35 ` Ming Lei
  4 siblings, 0 replies; 31+ messages in thread
From: Ming Lei @ 2018-01-22  3:35 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer
  Cc: dm-devel, Ming Lei, Laurence Oberman, Bart Van Assche

When blk_get_request() fails to allocate one request, there is no way
to make underlying queue's RESTART to restart DM's queue, this patch
applies the new introduced blk_mq_alloc_request_notify() to deal with
this issue.

The following issues can be addressed:

1) When the blk_get_request() fails, finaly dm-rq will return
BLK_STS_RESOURCE to blk-mq, then blk-mq will run queue after 10ms, this
delay may degrade dm-rq performance a lot when any underlying device
attached to the same host is accessed directly.

2) there can't be a way to figure out a perfect delay for running queue
in the situation 1), and if the delay is too small, request can be dequeued
from DM queue too quick, then sequential IO performance gets hurt.

This approach is suggested by Jens, and has been used in blk-mq dispatch
patch for a while, see blk_mq_mark_tag_wait().

Suggested-by: Jens Axboe <axboe@kernel.dk>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/md/dm-mpath.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++++-
 drivers/md/dm.c       |  1 +
 2 files changed, 50 insertions(+), 1 deletion(-)

diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index 7d3e572072f5..f7753a2e9229 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -11,6 +11,7 @@
 #include "dm-bio-record.h"
 #include "dm-path-selector.h"
 #include "dm-uevent.h"
+#include "dm.h"
 
 #include <linux/blkdev.h>
 #include <linux/ctype.h>
@@ -30,6 +31,15 @@
 #define DM_PG_INIT_DELAY_MSECS 2000
 #define DM_PG_INIT_DELAY_DEFAULT ((unsigned) -1)
 
+/*
+ * When running out of request from underlying request, use this
+ * structure to get notified from blk-mq
+ */
+struct underlying_notifier {
+	struct request_queue *q;
+	wait_queue_entry_t wait;
+};
+
 /* Path properties */
 struct pgpath {
 	struct list_head list;
@@ -40,6 +50,7 @@ struct pgpath {
 	struct dm_path path;
 	struct delayed_work activate_path;
 
+	struct underlying_notifier notifier;
 	bool is_active:1;		/* Path status */
 };
 
@@ -125,6 +136,17 @@ static void process_queued_bios(struct work_struct *work);
  * Allocation routines
  *-----------------------------------------------*/
 
+static int req_notify(wait_queue_entry_t *wait, unsigned mode, int flags,
+		void *key)
+{
+	struct underlying_notifier *notifier;
+
+	notifier = container_of(wait, struct underlying_notifier, wait);
+	list_del_init(&notifier->wait.entry);
+	blk_mq_run_hw_queues(notifier->q, true);
+	return 1;
+}
+
 static struct pgpath *alloc_pgpath(void)
 {
 	struct pgpath *pgpath = kzalloc(sizeof(*pgpath), GFP_KERNEL);
@@ -490,6 +512,27 @@ static bool must_push_back_bio(struct multipath *m)
 	return __must_push_back(m, flags);
 }
 
+static struct request *
+multipath_get_req(struct pgpath *pgpath, struct request_queue *q,
+		  struct request *rq)
+{
+	if (!q->mq_ops)
+		return blk_get_request(q, rq->cmd_flags | REQ_NOMERGE,
+				GFP_ATOMIC);
+
+	/*
+	 * Even BLK_MQ_REQ_ALLOC_NOTIFY is passed, we can't return
+	 * BLK_STS_DEV_RESOUCE to blk-mq if this allocation fails because
+	 * the notification can come before adding dm's request to
+	 * dispatch list, so still need to return BLK_STS_RESOUCE to
+	 * blk-mq, and the notification will run queue and cancel the
+	 * delay caused by BLK_STS_RESOUCE immediately.
+	 */
+	return blk_mq_alloc_request_notify(q, rq->cmd_flags | REQ_NOMERGE,
+                                BLK_MQ_REQ_NOWAIT | BLK_MQ_REQ_ALLOC_NOTIFY,
+				&pgpath->notifier.wait);
+}
+
 /*
  * Map cloned requests (request-based multipath)
  */
@@ -526,7 +569,7 @@ static int multipath_clone_and_map(struct dm_target *ti, struct request *rq,
 
 	bdev = pgpath->path.dev->bdev;
 	q = bdev_get_queue(bdev);
-	clone = blk_get_request(q, rq->cmd_flags | REQ_NOMERGE, GFP_ATOMIC);
+	clone = multipath_get_req(pgpath, q, rq);
 	if (IS_ERR(clone)) {
 		/* EBUSY, ENODEV or EWOULDBLOCK: requeue */
 		if (blk_queue_dying(q)) {
@@ -873,6 +916,7 @@ static struct pgpath *parse_path(struct dm_arg_set *as, struct path_selector *ps
 	int r;
 	struct pgpath *p;
 	struct multipath *m = ti->private;
+	struct mapped_device *md = dm_table_get_md((m)->ti->table);
 
 	/* we need at least a path arg */
 	if (as->argc < 1) {
@@ -906,6 +950,10 @@ static struct pgpath *parse_path(struct dm_arg_set *as, struct path_selector *ps
 		goto bad;
 	}
 
+	p->notifier.q = dm_get_md_queue(md);
+	init_waitqueue_func_entry(&p->notifier.wait, req_notify);
+	INIT_LIST_HEAD(&p->notifier.wait.entry);
+
 	return p;
  bad:
 	free_pgpath(p);
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index d6de00f367ef..ff06efcaf36e 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -445,6 +445,7 @@ struct request_queue *dm_get_md_queue(struct mapped_device *md)
 {
 	return md->queue;
 }
+EXPORT_SYMBOL_GPL(dm_get_md_queue);
 
 struct dm_stats *dm_get_stats(struct mapped_device *md)
 {
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/5] dm-rq: return BLK_STS_* from map_request()
  2018-01-22  3:35   ` Ming Lei
  (?)
@ 2018-01-22  5:35   ` Ming Lei
  -1 siblings, 0 replies; 31+ messages in thread
From: Ming Lei @ 2018-01-22  5:35 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer
  Cc: dm-devel, Laurence Oberman, Bart Van Assche

On Mon, Jan 22, 2018 at 11:35:48AM +0800, Ming Lei wrote:
> Except for DM_MAPIO_REQUEUE, map_request() handles other dispatch exception
> already, so return BLK_STS_* from map_request() directly.
> 
> Another change is that if dm_dispatch_clone_request() returns
> BLK_STS_DEV_RESOURCE from underlying queue, this status is returned
> to blk-mq too, since underlying queue's RESTART can handle dm-rq's
> RESTART in this case.

Hammm, this is obvious wrong, :-(

-- 
Ming

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/5] blk-mq: introduce blk_get_request_notify
  2018-01-22  3:35 ` [PATCH 4/5] blk-mq: introduce blk_get_request_notify Ming Lei
@ 2018-01-22 10:19   ` Ming Lei
  2018-01-22 17:13     ` Bart Van Assche
  1 sibling, 0 replies; 31+ messages in thread
From: Ming Lei @ 2018-01-22 10:19 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer
  Cc: dm-devel, Laurence Oberman, Bart Van Assche

On Mon, Jan 22, 2018 at 11:35:49AM +0800, Ming Lei wrote:
> DM-MPATH need to allocate request from underlying queue, but when the
> allocation fails, there is no way to make underlying queue's RESTART
> to restart DM's queue.
> 
> This patch introduces blk_get_request_notify() for this purpose, and
> caller need to pass 'wait_queue_entry_t' to this function, and make
> sure it is initialized well, so after the current allocation fails,
> DM will get notified when there is request available from underlying
> queue.
> 
> This approach is suggested by Jens, and has been used in blk-mq dispatch
> patch for a while, see blk_mq_mark_tag_wait().
> 
> Suggested-by: Jens Axboe <axboe@kernel.dk>
> Cc: Mike Snitzer <snitzer@redhat.com>
> Cc: Laurence Oberman <loberman@redhat.com>
> Cc: Bart Van Assche <bart.vanassche@sandisk.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-mq-tag.c     | 27 ++++++++++++++++++++++++++-
>  block/blk-mq.c         | 24 +++++++++++++++++++++---
>  block/blk-mq.h         |  1 +
>  include/linux/blk-mq.h |  5 +++++
>  4 files changed, 53 insertions(+), 4 deletions(-)
> 
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 336dde07b230..911fc9bd1bab 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -128,10 +128,35 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
>  	if (tag != -1)
>  		goto found_tag;
>  
> -	if (data->flags & BLK_MQ_REQ_NOWAIT)
> +	if ((data->flags & BLK_MQ_REQ_NOWAIT) && !(data->notifier &&
> +			(data->flags & BLK_MQ_REQ_ALLOC_NOTIFY)))
>  		return BLK_MQ_TAG_FAIL;
>  
>  	ws = bt_wait_ptr(bt, data->hctx);
> +
> +	/*
> +	 * If caller requires notification when tag is available, add
> +	 * wait entry of 'data->notifier' to the wait queue.
> +	 */
> +	if (data->flags & BLK_MQ_REQ_NOWAIT) {
> +		bool added = false;
> +
> +		spin_lock_irq(&ws->wait.lock);
> +		if (list_empty(&data->notifier->entry))
> +			__add_wait_queue(&ws->wait, data->notifier);
> +		else
> +			added = true;
> +		spin_unlock_irq(&ws->wait.lock);

The above need a per-notifier lock too, since the same notifier may
be added to different wait queues at the same time.

-- 
Ming

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/5] blk-mq: introduce BLK_STS_DEV_RESOURCE
  2018-01-22  3:35 ` [PATCH 1/5] blk-mq: introduce BLK_STS_DEV_RESOURCE Ming Lei
@ 2018-01-22 16:32   ` Christoph Hellwig
  2018-01-22 16:49     ` Bart Van Assche
  1 sibling, 0 replies; 31+ messages in thread
From: Christoph Hellwig @ 2018-01-22 16:32 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer,
	dm-devel, Laurence Oberman, Bart Van Assche,
	James E . J . Bottomley, Martin K . Petersen, linux-scsi

> -		if (ret == BLK_STS_RESOURCE) {
> +		if ((ret == BLK_STS_RESOURCE) || (ret == BLK_STS_DEV_RESOURCE)) {

No need for the inner braces here.

> +	if ((ret == BLK_STS_RESOURCE) || (ret == BLK_STS_DEV_RESOURCE))

Same here.

> +/*
> + * This status is returned from driver to block layer if device related
> + * resource is run out of, but driver can guarantee that IO dispatch is
> + * triggered in future when the resource is available.
> + */
> +#define BLK_STS_DEV_RESOURCE	((__force blk_status_t)13)

We'll need some good explanation on BLK_STS_RESOURCE vs
BLK_STS_DEV_RESOURCE I think.

Except for these nitpicks this looks fine to me.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/5] blk-mq: introduce BLK_STS_DEV_RESOURCE
  2018-01-22  3:35 ` [PATCH 1/5] blk-mq: introduce BLK_STS_DEV_RESOURCE Ming Lei
@ 2018-01-22 16:49     ` Bart Van Assche
  2018-01-22 16:49     ` Bart Van Assche
  1 sibling, 0 replies; 31+ messages in thread
From: Bart Van Assche @ 2018-01-22 16:49 UTC (permalink / raw)
  To: hch, linux-block, snitzer, ming.lei, axboe
  Cc: dm-devel, jejb, linux-scsi, martin.petersen, loberman

T24gTW9uLCAyMDE4LTAxLTIyIGF0IDExOjM1ICswODAwLCBNaW5nIExlaSB3cm90ZToNCj4gQEAg
LTEyODAsMTAgKzEyODIsMTggQEAgYm9vbCBibGtfbXFfZGlzcGF0Y2hfcnFfbGlzdChzdHJ1Y3Qg
cmVxdWVzdF9xdWV1ZSAqcSwgc3RydWN0IGxpc3RfaGVhZCAqbGlzdCwNCj4gIAkJICogLSBTb21l
IGJ1dCBub3QgYWxsIGJsb2NrIGRyaXZlcnMgc3RvcCBhIHF1ZXVlIGJlZm9yZQ0KPiAgCQkgKiAg
IHJldHVybmluZyBCTEtfU1RTX1JFU09VUkNFLiBUd28gZXhjZXB0aW9ucyBhcmUgc2NzaS1tcQ0K
PiAgCQkgKiAgIGFuZCBkbS1ycS4NCj4gKwkJICoNCj4gKwkJICogSWYgZHJpdmVycyByZXR1cm4g
QkxLX1NUU19SRVNPVVJDRSBhbmQgU19TQ0hFRF9SRVNUQVJUDQo+ICsJCSAqIGJpdCBpcyBzZXQs
IHJ1biBxdWV1ZSBhZnRlciAxMG1zIGZvciBhdm9pZGluZyBJTyBoYW5nDQo+ICsJCSAqIGJlY2F1
c2UgdGhlIHF1ZXVlIG1heSBiZSBpZGxlIGFuZCB0aGUgUkVTVEFSVCBtZWNoYW5pc20NCj4gKwkJ
ICogY2FuJ3Qgd29yayBhbnkgbW9yZS4NCj4gIAkJICovDQo+IC0JCWlmICghYmxrX21xX3NjaGVk
X25lZWRzX3Jlc3RhcnQoaGN0eCkgfHwNCj4gKwkJbmVlZHNfcmVzdGFydCA9IGJsa19tcV9zY2hl
ZF9uZWVkc19yZXN0YXJ0KGhjdHgpOw0KPiArCQlpZiAoIW5lZWRzX3Jlc3RhcnQgfHwNCj4gIAkJ
ICAgIChub190YWcgJiYgbGlzdF9lbXB0eV9jYXJlZnVsKCZoY3R4LT5kaXNwYXRjaF93YWl0LmVu
dHJ5KSkpDQo+ICAJCQlibGtfbXFfcnVuX2h3X3F1ZXVlKGhjdHgsIHRydWUpOw0KPiArCQllbHNl
IGlmIChuZWVkc19yZXN0YXJ0ICYmIChyZXQgPT0gQkxLX1NUU19SRVNPVVJDRSkpDQo+ICsJCQli
bGtfbXFfZGVsYXlfcnVuX2h3X3F1ZXVlKGhjdHgsIDEwKTsNCj4gIAl9DQoNCkluIG15IG9waW5p
b24gdGhlcmUgYXJlIHR3byBwcm9ibGVtcyB3aXRoIHRoZSBhYm92ZSBjaGFuZ2VzOg0KKiBPbmx5
IHRoZSBibG9jayBkcml2ZXIgYXV0aG9yIGNhbiBrbm93IHdoYXQgYSBnb29kIGNob2ljZSBpcyBm
b3IgdGhlIHRpbWUNCiAgYWZ0ZXIgd2hpY2ggdG8gcmVydW4gdGhlIHF1ZXVlLiBTbyBJIHRoaW5r
IG1vdmluZyB0aGUgcmVydW4gZGVsYXkgKDEwIG1zKQ0KICBjb25zdGFudCBmcm9tIGJsb2NrIGRy
aXZlcnMgaW50byB0aGUgY29yZSBpcyBhIHN0ZXAgYmFja3dhcmRzIGluc3RlYWQgb2YgYQ0KICBz
dGVwIGZvcndhcmRzLg0KKiBUaGUgcHVycG9zZSBvZiB0aGUgQkxLX01RX1NfU0NIRURfUkVTVEFS
VCBmbGFnIGlzIHRvIGRldGVjdCB3aGV0aGVyIG9yIG5vdA0KICBhbnkgb2YgdGhlIHF1ZXVlIHJ1
bnMgdHJpZ2dlcmVkIGJ5IGZyZWVpbmcgYSB0YWcgaGFwcGVuZWQgY29uY3VycmVudGx5LiBJDQog
IGRvbid0IHRoaW5rIHRoYXQgdGhlcmUgaXMgYW55IHJlbGF0aW9uc2hpcCBiZXR3ZWVuIHF1ZXVl
IHJ1bnMgaGFwcGVuaW5nIGFsbA0KICBvciBub3QgY29uY3VycmVudGx5IGFuZCB0aGUgY2hhbmNl
IHRoYXQgZHJpdmVyIHJlc291cmNlcyBiZWNvbWUgYXZhaWxhYmxlLg0KICBTbyBkZWNpZGluZyB3
aGV0aGVyIG9yIG5vdCBhIHF1ZXVlIHNob3VsZCBiZSByZXJ1biBiYXNlZCBvbiB0aGUgdmFsdWUg
b2YNCiAgdGhlIEJMS19NUV9TX1NDSEVEX1JFU1RBUlQgZmxhZyBzZWVtcyB3cm9uZyB0byBtZS4N
Cg0KPiBkaWZmIC0tZ2l0IGEvZHJpdmVycy9zY3NpL3Njc2lfbGliLmMgYi9kcml2ZXJzL3Njc2kv
c2NzaV9saWIuYw0KPiBpbmRleCBkOWNhMWRmYWIxNTQuLjU1YmUyNTUwYzU1NSAxMDA2NDQNCj4g
LS0tIGEvZHJpdmVycy9zY3NpL3Njc2lfbGliLmMNCj4gKysrIGIvZHJpdmVycy9zY3NpL3Njc2lf
bGliLmMNCj4gQEAgLTIwMzAsOSArMjAzMCw5IEBAIHN0YXRpYyBibGtfc3RhdHVzX3Qgc2NzaV9x
dWV1ZV9ycShzdHJ1Y3QgYmxrX21xX2h3X2N0eCAqaGN0eCwNCj4gIAljYXNlIEJMS19TVFNfT0s6
DQo+ICAJCWJyZWFrOw0KPiAgCWNhc2UgQkxLX1NUU19SRVNPVVJDRToNCj4gLQkJaWYgKGF0b21p
Y19yZWFkKCZzZGV2LT5kZXZpY2VfYnVzeSkgPT0gMCAmJg0KPiAtCQkgICAgIXNjc2lfZGV2aWNl
X2Jsb2NrZWQoc2RldikpDQo+IC0JCQlibGtfbXFfZGVsYXlfcnVuX2h3X3F1ZXVlKGhjdHgsIFND
U0lfUVVFVUVfREVMQVkpOw0KPiArCQlpZiAoYXRvbWljX3JlYWQoJnNkZXYtPmRldmljZV9idXN5
KSB8fA0KPiArCQkgICAgc2NzaV9kZXZpY2VfYmxvY2tlZChzZGV2KSkNCj4gKwkJCXJldCA9IEJM
S19TVFNfREVWX1JFU09VUkNFOw0KPiAgCQlicmVhazsNCj4gIAlkZWZhdWx0Og0KPiAgCQkvKg0K
DQpUaGUgYWJvdmUgaW50cm9kdWNlcyB0d28gY2hhbmdlcyB0aGF0IGhhdmUgbm90IGJlZW4gbWVu
dGlvbmVkIGluIHRoZQ0KZGVzY3JpcHRpb24gb2YgdGhpcyBwYXRjaDoNCi0gVGhlIHF1ZXVlIHJl
cnVubmluZyBkZWxheSBpcyBjaGFuZ2VkIGZyb20gMyBtcyBpbnRvIDEwIG1zLiBXaGVyZSBpcyB0
aGUNCiAgZXhwbGFuYXRpb24gb2YgdGhpcyBjaGFuZ2U/IERvZXMgdGhpcyBjaGFuZ2UgaGF2ZSBh
IHBvc2l0aXZlIG9yIG5lZ2F0aXZlDQogIHBlcmZvcm1hbmNlIGltcGFjdD8NCi0gVGhlIGFib3Zl
IG1vZGlmaWVzIGEgZ3VhcmFudGVlZCBxdWV1ZSByZXJ1biBpbnRvIGEgcXVldWUgcmVydW4gdGhh
dA0KICBtYXkgb3IgbWF5IG5vdCBoYXBwZW4sIGRlcGVuZGluZyBvbiB3aGV0aGVyIG9yIG5vdCBt
dWx0aXBsZSB0YWdzIGdldCBmcmVlZA0KICBjb25jdXJyZW50bHkgKHJldHVybiBCTEtfU1RTX0RF
Vl9SRVNPVVJDRSkuIFNvcnJ5IGJ1dCBJIHRoaW5rIHRoYXQncyB3cm9uZy4NCg0KQmFydC4=

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/5] blk-mq: introduce BLK_STS_DEV_RESOURCE
@ 2018-01-22 16:49     ` Bart Van Assche
  0 siblings, 0 replies; 31+ messages in thread
From: Bart Van Assche @ 2018-01-22 16:49 UTC (permalink / raw)
  To: hch, linux-block, snitzer, ming.lei, axboe
  Cc: dm-devel, jejb, linux-scsi, martin.petersen, loberman

On Mon, 2018-01-22 at 11:35 +0800, Ming Lei wrote:
> @@ -1280,10 +1282,18 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
>  		 * - Some but not all block drivers stop a queue before
>  		 *   returning BLK_STS_RESOURCE. Two exceptions are scsi-mq
>  		 *   and dm-rq.
> +		 *
> +		 * If drivers return BLK_STS_RESOURCE and S_SCHED_RESTART
> +		 * bit is set, run queue after 10ms for avoiding IO hang
> +		 * because the queue may be idle and the RESTART mechanism
> +		 * can't work any more.
>  		 */
> -		if (!blk_mq_sched_needs_restart(hctx) ||
> +		needs_restart = blk_mq_sched_needs_restart(hctx);
> +		if (!needs_restart ||
>  		    (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
>  			blk_mq_run_hw_queue(hctx, true);
> +		else if (needs_restart && (ret == BLK_STS_RESOURCE))
> +			blk_mq_delay_run_hw_queue(hctx, 10);
>  	}

In my opinion there are two problems with the above changes:
* Only the block driver author can know what a good choice is for the time
  after which to rerun the queue. So I think moving the rerun delay (10 ms)
  constant from block drivers into the core is a step backwards instead of a
  step forwards.
* The purpose of the BLK_MQ_S_SCHED_RESTART flag is to detect whether or not
  any of the queue runs triggered by freeing a tag happened concurrently. I
  don't think that there is any relationship between queue runs happening all
  or not concurrently and the chance that driver resources become available.
  So deciding whether or not a queue should be rerun based on the value of
  the BLK_MQ_S_SCHED_RESTART flag seems wrong to me.

> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index d9ca1dfab154..55be2550c555 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -2030,9 +2030,9 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
>  	case BLK_STS_OK:
>  		break;
>  	case BLK_STS_RESOURCE:
> -		if (atomic_read(&sdev->device_busy) == 0 &&
> -		    !scsi_device_blocked(sdev))
> -			blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY);
> +		if (atomic_read(&sdev->device_busy) ||
> +		    scsi_device_blocked(sdev))
> +			ret = BLK_STS_DEV_RESOURCE;
>  		break;
>  	default:
>  		/*

The above introduces two changes that have not been mentioned in the
description of this patch:
- The queue rerunning delay is changed from 3 ms into 10 ms. Where is the
  explanation of this change? Does this change have a positive or negative
  performance impact?
- The above modifies a guaranteed queue rerun into a queue rerun that
  may or may not happen, depending on whether or not multiple tags get freed
  concurrently (return BLK_STS_DEV_RESOURCE). Sorry but I think that's wrong.

Bart.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/5] blk-mq: introduce blk_get_request_notify
  2018-01-22  3:35 ` [PATCH 4/5] blk-mq: introduce blk_get_request_notify Ming Lei
@ 2018-01-22 17:13     ` Bart Van Assche
  2018-01-22 17:13     ` Bart Van Assche
  1 sibling, 0 replies; 31+ messages in thread
From: Bart Van Assche @ 2018-01-22 17:13 UTC (permalink / raw)
  To: hch, linux-block, snitzer, ming.lei, axboe; +Cc: dm-devel, loberman

T24gTW9uLCAyMDE4LTAxLTIyIGF0IDExOjM1ICswODAwLCBNaW5nIExlaSB3cm90ZToNCj4gRE0t
TVBBVEggbmVlZCB0byBhbGxvY2F0ZSByZXF1ZXN0IGZyb20gdW5kZXJseWluZyBxdWV1ZSwgYnV0
IHdoZW4gdGhlDQo+IGFsbG9jYXRpb24gZmFpbHMsIHRoZXJlIGlzIG5vIHdheSB0byBtYWtlIHVu
ZGVybHlpbmcgcXVldWUncyBSRVNUQVJUDQo+IHRvIHJlc3RhcnQgRE0ncyBxdWV1ZS4NCj4gDQo+
IFRoaXMgcGF0Y2ggaW50cm9kdWNlcyBibGtfZ2V0X3JlcXVlc3Rfbm90aWZ5KCkgZm9yIHRoaXMg
cHVycG9zZSwgYW5kDQo+IGNhbGxlciBuZWVkIHRvIHBhc3MgJ3dhaXRfcXVldWVfZW50cnlfdCcg
dG8gdGhpcyBmdW5jdGlvbiwgYW5kIG1ha2UNCj4gc3VyZSBpdCBpcyBpbml0aWFsaXplZCB3ZWxs
LCBzbyBhZnRlciB0aGUgY3VycmVudCBhbGxvY2F0aW9uIGZhaWxzLA0KPiBETSB3aWxsIGdldCBu
b3RpZmllZCB3aGVuIHRoZXJlIGlzIHJlcXVlc3QgYXZhaWxhYmxlIGZyb20gdW5kZXJseWluZw0K
PiBxdWV1ZS4NCg0KUGxlYXNlIG1lbnRpb24gdGhhdCB0aGlzIGlzIG9ubHkgYSBwYXJ0aWFsIHNv
bHV0aW9uIGJlY2F1c2UgdGhlIGNhc2Ugd2hlbg0KZS5nLiBibGtfaW5zZXJ0X2Nsb25lZF9yZXF1
ZXN0KCkgcmV0dXJucyBCTEtfU1RTX1JFU09VUkNFIGlzIG5vdCBoYW5kbGVkLg0KVGhpcyBjb3Vs
ZCBoZWxwIGZvciBkcml2ZXJzIHRoYXQgc3VwcG9ydCBhIHZlcnkgbG93IHF1ZXVlIGRlcHRoIChs
cGZjKSBidXQNCnByb2JhYmx5IHdvbid0IGJlIHRoYXQgdXNlZnVsIGZvciBvdGhlciBkcml2ZXJz
Lg0KDQo+ICsJLyoNCj4gKwkgKiBJZiBjYWxsZXIgcmVxdWlyZXMgbm90aWZpY2F0aW9uIHdoZW4g
dGFnIGlzIGF2YWlsYWJsZSwgYWRkDQo+ICsJICogd2FpdCBlbnRyeSBvZiAnZGF0YS0+bm90aWZp
ZXInIHRvIHRoZSB3YWl0IHF1ZXVlLg0KPiArCSAqLw0KPiArCWlmIChkYXRhLT5mbGFncyAmIEJM
S19NUV9SRVFfTk9XQUlUKSB7DQo+ICsJCWJvb2wgYWRkZWQgPSBmYWxzZTsNCj4gKw0KPiArCQlz
cGluX2xvY2tfaXJxKCZ3cy0+d2FpdC5sb2NrKTsNCj4gKwkJaWYgKGxpc3RfZW1wdHkoJmRhdGEt
Pm5vdGlmaWVyLT5lbnRyeSkpDQo+ICsJCQlfX2FkZF93YWl0X3F1ZXVlKCZ3cy0+d2FpdCwgZGF0
YS0+bm90aWZpZXIpOw0KPiArCQllbHNlDQo+ICsJCQlhZGRlZCA9IHRydWU7DQo+ICsJCXNwaW5f
dW5sb2NrX2lycSgmd3MtPndhaXQubG9jayk7DQo+ICsNCj4gKwkJaWYgKGFkZGVkKQ0KPiArCQkJ
cmV0dXJuIEJMS19NUV9UQUdfRkFJTDsNCj4gKw0KPiArCQl0YWcgPSBfX2Jsa19tcV9nZXRfdGFn
KGRhdGEsIGJ0KTsNCj4gKwkJaWYgKHRhZyAhPSAtMSkNCj4gKwkJCWdvdG8gZm91bmRfdGFnOw0K
PiArCQlyZXR1cm4gQkxLX01RX1RBR19GQUlMOw0KPiArCX0NCg0KU29ycnkgYnV0IEkgZG9uJ3Qg
bGlrZSB0aGlzIGFwcHJvYWNoLiBBZGRpbmcgImRhdGEtPm5vdGlmaWVyIiB0byB0aGUgd2FpdA0K
cXVldWUgY3JlYXRlcyBhIGxpbmsgYmV0d2VlbiB0d28gcmVxdWVzdCBxdWV1ZXMsIGUuZy4gYSBk
bS1tcGF0aCBxdWV1ZSBhbmQNCm9uZSBvZiB0aGUgcGF0aHMgdGhhdCBpcyBhIG1lbWJlciBvZiB0
aGF0IGRtLW1wYXRoIHF1ZXVlLiBUaGlzIGNyZWF0ZXMgdGhlDQpwb3RlbnRpYWwgZm9yIHVnbHkg
cmFjZXMgYmV0d2VlbiBlLmcuICJkYXRhLT5ub3RpZmllciIgYmVpbmcgdHJpZ2dlcmVkIGFuZA0K
cmVtb3ZhbCBvZiB0aGUgZG0tbXBhdGggcXVldWUuDQoNCj4gZGlmZiAtLWdpdCBhL2Jsb2NrL2Js
ay1tcS5oIGIvYmxvY2svYmxrLW1xLmgNCj4gaW5kZXggODhjNTU4ZjcxODE5Li5iZWMyZjY3NWY4
ZjEgMTAwNjQ0DQo+IC0tLSBhL2Jsb2NrL2Jsay1tcS5oDQo+ICsrKyBiL2Jsb2NrL2Jsay1tcS5o
DQo+IEBAIC0xNjAsNiArMTYwLDcgQEAgc3RydWN0IGJsa19tcV9hbGxvY19kYXRhIHsNCj4gIAlz
dHJ1Y3QgcmVxdWVzdF9xdWV1ZSAqcTsNCj4gIAlibGtfbXFfcmVxX2ZsYWdzX3QgZmxhZ3M7DQo+
ICAJdW5zaWduZWQgaW50IHNoYWxsb3dfZGVwdGg7DQo+ICsJd2FpdF9xdWV1ZV9lbnRyeV90ICpu
b3RpZmllcjsNCg0KSWYgb3RoZXJzIHdvdWxkIGFncmVlIHdpdGggdGhlIGFwcHJvYWNoIG9mIHRo
aXMgcGF0Y2ggcGxlYXNlIHVzZSBhbm90aGVyIG5hbWUNCnRoYW4gIm5vdGlmaWVyIi4gSW4gdGhl
IGNvbnRleHQgb2YgdGhlIExpbnV4IGtlcm5lbCBhIG5vdGlmaWVyIGlzIGFuIGluc3RhbmNlDQpv
ZiBzdHJ1Y3Qgbm90aWZpZXJfYmxvY2suIFRoZSBhYm92ZSAibm90aWZpZXIiIG1lbWJlciBpcyBu
b3QgYSBub3RpZmllciBidXQgYQ0Kd2FpdCBxdWV1ZSBlbnRyeS4NCg0KVGhhbmtzLA0KDQpCYXJ0
Lg==

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/5] blk-mq: introduce blk_get_request_notify
@ 2018-01-22 17:13     ` Bart Van Assche
  0 siblings, 0 replies; 31+ messages in thread
From: Bart Van Assche @ 2018-01-22 17:13 UTC (permalink / raw)
  To: hch, linux-block, snitzer, ming.lei, axboe; +Cc: dm-devel, loberman

On Mon, 2018-01-22 at 11:35 +0800, Ming Lei wrote:
> DM-MPATH need to allocate request from underlying queue, but when the
> allocation fails, there is no way to make underlying queue's RESTART
> to restart DM's queue.
> 
> This patch introduces blk_get_request_notify() for this purpose, and
> caller need to pass 'wait_queue_entry_t' to this function, and make
> sure it is initialized well, so after the current allocation fails,
> DM will get notified when there is request available from underlying
> queue.

Please mention that this is only a partial solution because the case when
e.g. blk_insert_cloned_request() returns BLK_STS_RESOURCE is not handled.
This could help for drivers that support a very low queue depth (lpfc) but
probably won't be that useful for other drivers.

> +	/*
> +	 * If caller requires notification when tag is available, add
> +	 * wait entry of 'data->notifier' to the wait queue.
> +	 */
> +	if (data->flags & BLK_MQ_REQ_NOWAIT) {
> +		bool added = false;
> +
> +		spin_lock_irq(&ws->wait.lock);
> +		if (list_empty(&data->notifier->entry))
> +			__add_wait_queue(&ws->wait, data->notifier);
> +		else
> +			added = true;
> +		spin_unlock_irq(&ws->wait.lock);
> +
> +		if (added)
> +			return BLK_MQ_TAG_FAIL;
> +
> +		tag = __blk_mq_get_tag(data, bt);
> +		if (tag != -1)
> +			goto found_tag;
> +		return BLK_MQ_TAG_FAIL;
> +	}

Sorry but I don't like this approach. Adding "data->notifier" to the wait
queue creates a link between two request queues, e.g. a dm-mpath queue and
one of the paths that is a member of that dm-mpath queue. This creates the
potential for ugly races between e.g. "data->notifier" being triggered and
removal of the dm-mpath queue.

> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index 88c558f71819..bec2f675f8f1 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -160,6 +160,7 @@ struct blk_mq_alloc_data {
>  	struct request_queue *q;
>  	blk_mq_req_flags_t flags;
>  	unsigned int shallow_depth;
> +	wait_queue_entry_t *notifier;

If others would agree with the approach of this patch please use another name
than "notifier". In the context of the Linux kernel a notifier is an instance
of struct notifier_block. The above "notifier" member is not a notifier but a
wait queue entry.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/5] blk-mq: introduce BLK_STS_DEV_RESOURCE
  2018-01-22 16:49     ` Bart Van Assche
  (?)
@ 2018-01-23  0:57     ` Ming Lei
  2018-01-23 16:17       ` Bart Van Assche
  -1 siblings, 1 reply; 31+ messages in thread
From: Ming Lei @ 2018-01-23  0:57 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: hch, linux-block, snitzer, axboe, dm-devel, jejb, linux-scsi,
	martin.petersen, loberman

On Mon, Jan 22, 2018 at 04:49:54PM +0000, Bart Van Assche wrote:
> On Mon, 2018-01-22 at 11:35 +0800, Ming Lei wrote:
> > @@ -1280,10 +1282,18 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
> >  		 * - Some but not all block drivers stop a queue before
> >  		 *   returning BLK_STS_RESOURCE. Two exceptions are scsi-mq
> >  		 *   and dm-rq.
> > +		 *
> > +		 * If drivers return BLK_STS_RESOURCE and S_SCHED_RESTART
> > +		 * bit is set, run queue after 10ms for avoiding IO hang
> > +		 * because the queue may be idle and the RESTART mechanism
> > +		 * can't work any more.
> >  		 */
> > -		if (!blk_mq_sched_needs_restart(hctx) ||
> > +		needs_restart = blk_mq_sched_needs_restart(hctx);
> > +		if (!needs_restart ||
> >  		    (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
> >  			blk_mq_run_hw_queue(hctx, true);
> > +		else if (needs_restart && (ret == BLK_STS_RESOURCE))
> > +			blk_mq_delay_run_hw_queue(hctx, 10);
> >  	}
> 
> In my opinion there are two problems with the above changes:
> * Only the block driver author can know what a good choice is for the time

The reality is that no one knew that before, if you look at the fact, most of
BLK_STS_RESOURCE need that, but no one dealt with it at all. Both Jens and I
concluded that it is a generic issue, which need generic solution.

On the contrary, it is much easier to evaluate if one STS_RESOURCE can be
converted to STS_DEV_RESOURCE, as you saw, there isn't many, because now we
call .queue_rq() after getting budget and driver tag.

>   after which to rerun the queue. So I think moving the rerun delay (10 ms)
>   constant from block drivers into the core is a step backwards instead of a
>   step forwards.

As I mentioned before, if running out of resource inside .queue_rq(),
this request is still out of control of blk-mq, and if run queue is
scheduled inside .queue_rq(), it isn't correct, because when __blk_mq_run_hw_queue()
is run, the request may not be visible for dispatch.

Even though RCU lock is held during dispatch, preemption or interrupt
can happen too, so it is simply wrong to depend on the timing to make
sure __blk_mq_run_hw_queue() can see the request in this situation.

Or driver reinserts the request into scheduler queue, but that way
involves much big change if we do this in driver, and introduce
unnecessary cost meantime.

> * The purpose of the BLK_MQ_S_SCHED_RESTART flag is to detect whether or not
>   any of the queue runs triggered by freeing a tag happened concurrently. I
>   don't think that there is any relationship between queue runs happening all
>   or not concurrently and the chance that driver resources become available.
>   So deciding whether or not a queue should be rerun based on the value of
>   the BLK_MQ_S_SCHED_RESTART flag seems wrong to me.

The fact is simple, that once BLK_MQ_S_SCHED_RESTART is set,
blk_mq_dispatch_rq_list() won't call blk_mq_run_hw_queue() any more, and depends
on request completion to rerun the queue. If driver can't make sure the queue will be
restarted in future by returning STS_RESOURCE, we have to call
blk_mq_delay_run_hw_queue(hctx, 10) for avoiding IO hang.

> 
> > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> > index d9ca1dfab154..55be2550c555 100644
> > --- a/drivers/scsi/scsi_lib.c
> > +++ b/drivers/scsi/scsi_lib.c
> > @@ -2030,9 +2030,9 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
> >  	case BLK_STS_OK:
> >  		break;
> >  	case BLK_STS_RESOURCE:
> > -		if (atomic_read(&sdev->device_busy) == 0 &&
> > -		    !scsi_device_blocked(sdev))
> > -			blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY);
> > +		if (atomic_read(&sdev->device_busy) ||
> > +		    scsi_device_blocked(sdev))
> > +			ret = BLK_STS_DEV_RESOURCE;
> >  		break;
> >  	default:
> >  		/*
> 
> The above introduces two changes that have not been mentioned in the
> description of this patch:
> - The queue rerunning delay is changed from 3 ms into 10 ms. Where is the
>   explanation of this change? Does this change have a positive or negative
>   performance impact?

How can that be a issue for SCSI? The rerunning delay is only triggered
when there isn't any in-flight requests in SCSI queue.

> - The above modifies a guaranteed queue rerun into a queue rerun that
>   may or may not happen, depending on whether or not multiple tags get freed
>   concurrently (return BLK_STS_DEV_RESOURCE). Sorry but I think that's wrong.

This patch only moves the rerunning delay from SCSI .queue_rq into
blk-mq, I don't know what you mean above, please explained a bit.

I know there is a race here in SCSI, but nothing to do with this patch.


-- 
Ming

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/5] blk-mq: introduce blk_get_request_notify
  2018-01-22 17:13     ` Bart Van Assche
  (?)
@ 2018-01-23  1:29     ` Ming Lei
  -1 siblings, 0 replies; 31+ messages in thread
From: Ming Lei @ 2018-01-23  1:29 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: hch, linux-block, snitzer, axboe, dm-devel, loberman

On Mon, Jan 22, 2018 at 05:13:03PM +0000, Bart Van Assche wrote:
> On Mon, 2018-01-22 at 11:35 +0800, Ming Lei wrote:
> > DM-MPATH need to allocate request from underlying queue, but when the
> > allocation fails, there is no way to make underlying queue's RESTART
> > to restart DM's queue.
> > 
> > This patch introduces blk_get_request_notify() for this purpose, and
> > caller need to pass 'wait_queue_entry_t' to this function, and make
> > sure it is initialized well, so after the current allocation fails,
> > DM will get notified when there is request available from underlying
> > queue.
> 
> Please mention that this is only a partial solution because the case when
> e.g. blk_insert_cloned_request() returns BLK_STS_RESOURCE is not handled.

I don't think it is necessary to mention that in comment log, because
one patch won't deal with all things.

But definitely there will be one patch which deals with that in V2,
otherwise the performance issue can't be solved completely.

> This could help for drivers that support a very low queue depth (lpfc) but
> probably won't be that useful for other drivers.

Up to now, it is one dm-mpath specific performance issue under one
specific situation: IO is submitted to dm-mpath device and the
underlying queue at the same time, and the underlying queue depth is
very low, such as 1.

> 
> > +	/*
> > +	 * If caller requires notification when tag is available, add
> > +	 * wait entry of 'data->notifier' to the wait queue.
> > +	 */
> > +	if (data->flags & BLK_MQ_REQ_NOWAIT) {
> > +		bool added = false;
> > +
> > +		spin_lock_irq(&ws->wait.lock);
> > +		if (list_empty(&data->notifier->entry))
> > +			__add_wait_queue(&ws->wait, data->notifier);
> > +		else
> > +			added = true;
> > +		spin_unlock_irq(&ws->wait.lock);
> > +
> > +		if (added)
> > +			return BLK_MQ_TAG_FAIL;
> > +
> > +		tag = __blk_mq_get_tag(data, bt);
> > +		if (tag != -1)
> > +			goto found_tag;
> > +		return BLK_MQ_TAG_FAIL;
> > +	}
> 
> Sorry but I don't like this approach. Adding "data->notifier" to the wait
> queue creates a link between two request queues, e.g. a dm-mpath queue and
> one of the paths that is a member of that dm-mpath queue. This creates the
> potential for ugly races between e.g. "data->notifier" being triggered and
> removal of the dm-mpath queue.

I thought no such race because the dm request is still in dispatch list before
the 'notifier' is handled. And now looks this isn't true, but this issue can be
dealt easily by call blk_queue_enter_live() before the allocation, and
release them all once the notifier is triggered.

Anyway this interface need to be well documented.

> 
> > diff --git a/block/blk-mq.h b/block/blk-mq.h
> > index 88c558f71819..bec2f675f8f1 100644
> > --- a/block/blk-mq.h
> > +++ b/block/blk-mq.h
> > @@ -160,6 +160,7 @@ struct blk_mq_alloc_data {
> >  	struct request_queue *q;
> >  	blk_mq_req_flags_t flags;
> >  	unsigned int shallow_depth;
> > +	wait_queue_entry_t *notifier;
> 
> If others would agree with the approach of this patch please use another name
> than "notifier". In the context of the Linux kernel a notifier is an instance
> of struct notifier_block. The above "notifier" member is not a notifier but a
> wait queue entry.

OK, I will change to 'wait', similar with 'dispatch_wait'.

-- 
Ming

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/5] blk-mq: introduce BLK_STS_DEV_RESOURCE
  2018-01-23  0:57     ` Ming Lei
@ 2018-01-23 16:17       ` Bart Van Assche
  2018-01-23 16:26         ` Ming Lei
  0 siblings, 1 reply; 31+ messages in thread
From: Bart Van Assche @ 2018-01-23 16:17 UTC (permalink / raw)
  To: Ming Lei
  Cc: hch, linux-block, snitzer, axboe, dm-devel, jejb, linux-scsi,
	martin.petersen, loberman



On 01/22/18 16:57, Ming Lei wrote:
> Even though RCU lock is held during dispatch, preemption or interrupt
> can happen too, so it is simply wrong to depend on the timing to make
> sure __blk_mq_run_hw_queue() can see the request in this situation.

It is very unlikely that this race will ever be hit because that race 
exists for less than one microsecond and the smallest timeout that can 
be specified for delayed queue rerunning is one millisecond. Let's 
address this race if anyone ever finds a way to hit it.

>>> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
>>> index d9ca1dfab154..55be2550c555 100644
>>> --- a/drivers/scsi/scsi_lib.c
>>> +++ b/drivers/scsi/scsi_lib.c
>>> @@ -2030,9 +2030,9 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
>>>   	case BLK_STS_OK:
>>>   		break;
>>>   	case BLK_STS_RESOURCE:
>>> -		if (atomic_read(&sdev->device_busy) == 0 &&
>>> -		    !scsi_device_blocked(sdev))
>>> -			blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY);
>>> +		if (atomic_read(&sdev->device_busy) ||
>>> +		    scsi_device_blocked(sdev))
>>> +			ret = BLK_STS_DEV_RESOURCE;
>>>   		break;
>>>   	default:
>>>   		/*
>>
>> The above introduces two changes that have not been mentioned in the
>> description of this patch:
>> - The queue rerunning delay is changed from 3 ms into 10 ms. Where is the
>>    explanation of this change? Does this change have a positive or negative
>>    performance impact?
> 
> How can that be a issue for SCSI? The rerunning delay is only triggered
> when there isn't any in-flight requests in SCSI queue.

That change will result in more scsi_queue_rq() calls and hence in 
higher CPU usage. By how much the CPU usage is increased will depend on 
the CPU time required by the LLD .queuecommand() callback if that 
function gets invoked.

Bart.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/5] blk-mq: introduce BLK_STS_DEV_RESOURCE
  2018-01-23 16:17       ` Bart Van Assche
@ 2018-01-23 16:26         ` Ming Lei
  2018-01-23 16:37           ` Bart Van Assche
  0 siblings, 1 reply; 31+ messages in thread
From: Ming Lei @ 2018-01-23 16:26 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: hch, linux-block, snitzer, axboe, dm-devel, jejb, linux-scsi,
	martin.petersen, loberman

On Tue, Jan 23, 2018 at 08:17:02AM -0800, Bart Van Assche wrote:
> 
> 
> On 01/22/18 16:57, Ming Lei wrote:
> > Even though RCU lock is held during dispatch, preemption or interrupt
> > can happen too, so it is simply wrong to depend on the timing to make
> > sure __blk_mq_run_hw_queue() can see the request in this situation.
> 
> It is very unlikely that this race will ever be hit because that race exists
> for less than one microsecond and the smallest timeout that can be specified
> for delayed queue rerunning is one millisecond. Let's address this race if
> anyone ever finds a way to hit it.

Please don't depend on the timing which is often fragile, as we can make it
correct in a generic approach. Also we should avoid to make every driver to
follow this way for dealing with most of STS_RESOURCE, right?

> 
> > > > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> > > > index d9ca1dfab154..55be2550c555 100644
> > > > --- a/drivers/scsi/scsi_lib.c
> > > > +++ b/drivers/scsi/scsi_lib.c
> > > > @@ -2030,9 +2030,9 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
> > > >   	case BLK_STS_OK:
> > > >   		break;
> > > >   	case BLK_STS_RESOURCE:
> > > > -		if (atomic_read(&sdev->device_busy) == 0 &&
> > > > -		    !scsi_device_blocked(sdev))
> > > > -			blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY);
> > > > +		if (atomic_read(&sdev->device_busy) ||
> > > > +		    scsi_device_blocked(sdev))
> > > > +			ret = BLK_STS_DEV_RESOURCE;
> > > >   		break;
> > > >   	default:
> > > >   		/*
> > > 
> > > The above introduces two changes that have not been mentioned in the
> > > description of this patch:
> > > - The queue rerunning delay is changed from 3 ms into 10 ms. Where is the
> > >    explanation of this change? Does this change have a positive or negative
> > >    performance impact?
> > 
> > How can that be a issue for SCSI? The rerunning delay is only triggered
> > when there isn't any in-flight requests in SCSI queue.
> 
> That change will result in more scsi_queue_rq() calls and hence in higher
> CPU usage. By how much the CPU usage is increased will depend on the CPU
> time required by the LLD .queuecommand() callback if that function gets
> invoked.

No, this patch won't increase CPU usage on SCSI, and the only change is to move
the blk_mq_delay_run_hw_queue() out of SCSI's .queue_rq(), and the delay
becomes 10.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/5] blk-mq: introduce BLK_STS_DEV_RESOURCE
  2018-01-23 16:26         ` Ming Lei
@ 2018-01-23 16:37           ` Bart Van Assche
  2018-01-23 16:41             ` Ming Lei
  0 siblings, 1 reply; 31+ messages in thread
From: Bart Van Assche @ 2018-01-23 16:37 UTC (permalink / raw)
  To: Ming Lei
  Cc: hch, linux-block, snitzer, axboe, dm-devel, jejb, linux-scsi,
	martin.petersen, loberman

On 01/23/18 08:26, Ming Lei wrote:
> On Tue, Jan 23, 2018 at 08:17:02AM -0800, Bart Van Assche wrote:
>> On 01/22/18 16:57, Ming Lei wrote:
>>> Even though RCU lock is held during dispatch, preemption or interrupt
>>> can happen too, so it is simply wrong to depend on the timing to make
>>> sure __blk_mq_run_hw_queue() can see the request in this situation.
>>
>> It is very unlikely that this race will ever be hit because that race exists
>> for less than one microsecond and the smallest timeout that can be specified
>> for delayed queue rerunning is one millisecond. Let's address this race if
>> anyone ever finds a way to hit it.
> 
> Please don't depend on the timing which is often fragile, as we can make it
> correct in a generic approach. Also we should avoid to make every driver to
> follow this way for dealing with most of STS_RESOURCE, right?

Wouldn't it be better to fix that race without changing the block layer 
API, e.g. by using call_rcu() for delayed queue runs? As you know 
call_rcu() will only call the specified function after a grace period. 
Since pushing back requests onto the dispatch list happens with the RCU 
lock held using call_rcu() for delayed queue runs would be sufficient to 
address this race.

Bart.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/5] blk-mq: introduce BLK_STS_DEV_RESOURCE
  2018-01-23 16:37           ` Bart Van Assche
@ 2018-01-23 16:41             ` Ming Lei
  2018-01-23 16:47                 ` Bart Van Assche
  0 siblings, 1 reply; 31+ messages in thread
From: Ming Lei @ 2018-01-23 16:41 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: hch, linux-block, snitzer, axboe, dm-devel, jejb, linux-scsi,
	martin.petersen, loberman

On Tue, Jan 23, 2018 at 08:37:26AM -0800, Bart Van Assche wrote:
> On 01/23/18 08:26, Ming Lei wrote:
> > On Tue, Jan 23, 2018 at 08:17:02AM -0800, Bart Van Assche wrote:
> > > On 01/22/18 16:57, Ming Lei wrote:
> > > > Even though RCU lock is held during dispatch, preemption or interrupt
> > > > can happen too, so it is simply wrong to depend on the timing to make
> > > > sure __blk_mq_run_hw_queue() can see the request in this situation.
> > > 
> > > It is very unlikely that this race will ever be hit because that race exists
> > > for less than one microsecond and the smallest timeout that can be specified
> > > for delayed queue rerunning is one millisecond. Let's address this race if
> > > anyone ever finds a way to hit it.
> > 
> > Please don't depend on the timing which is often fragile, as we can make it
> > correct in a generic approach. Also we should avoid to make every driver to
> > follow this way for dealing with most of STS_RESOURCE, right?
> 
> Wouldn't it be better to fix that race without changing the block layer API,
> e.g. by using call_rcu() for delayed queue runs? As you know call_rcu() will

Could you explain where to call call_rcu()?  call_rcu() can't be used in
IO path at all.

> only call the specified function after a grace period. Since pushing back
> requests onto the dispatch list happens with the RCU lock held using
> call_rcu() for delayed queue runs would be sufficient to address this race.


-- 
Ming

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/5] blk-mq: introduce BLK_STS_DEV_RESOURCE
  2018-01-23 16:41             ` Ming Lei
@ 2018-01-23 16:47                 ` Bart Van Assche
  0 siblings, 0 replies; 31+ messages in thread
From: Bart Van Assche @ 2018-01-23 16:47 UTC (permalink / raw)
  To: ming.lei
  Cc: linux-block, hch, snitzer, martin.petersen, axboe, linux-scsi,
	jejb, loberman, dm-devel

T24gV2VkLCAyMDE4LTAxLTI0IGF0IDAwOjQxICswODAwLCBNaW5nIExlaSB3cm90ZToNCj4gQ291
bGQgeW91IGV4cGxhaW4gd2hlcmUgdG8gY2FsbCBjYWxsX3JjdSgpPyAgY2FsbF9yY3UoKSBjYW4n
dCBiZSB1c2VkIGluDQo+IElPIHBhdGggYXQgYWxsLg0KDQpDYW4geW91IGV4cGxhaW4gd2hhdCBt
YWtlcyB5b3UgdGhpbmsgdGhhdCBjYWxsX3JjdSgpIGNhbid0IGJlIHVzZWQgaW4gdGhlDQpJL08g
cGF0aD8gQXMgeW91IGtub3cgY2FsbF9yY3UoKSBpbnZva2VzIGEgZnVuY3Rpb24gYXN5bmNocm9u
b3VzbHkuIEZyb20NCkRvY3VtZW50YXRpb24vUkNVL0Rlc2lnbi9SZXF1aXJlbWVudHMvUmVxdWly
ZW1lbnRzLmh0bWw6DQoNCiAgVGhlIDx0dD5jYWxsX3JjdSgpPC90dD4gZnVuY3Rpb24gbWF5IGJl
IHVzZWQgaW4gYSBudW1iZXIgb2YNCiAgc2l0dWF0aW9ucyB3aGVyZSBuZWl0aGVyIDx0dD5zeW5j
aHJvbml6ZV9yY3UoKTwvdHQ+IG5vcg0KICA8dHQ+c3luY2hyb25pemVfcmN1X2V4cGVkaXRlZCgp
PC90dD4gd291bGQgYmUgbGVnYWwsDQogIGluY2x1ZGluZyB3aXRoaW4gcHJlZW1wdC1kaXNhYmxl
IGNvZGUsIDx0dD5sb2NhbF9iaF9kaXNhYmxlKCk8L3R0PiBjb2RlLA0KICBpbnRlcnJ1cHQtZGlz
YWJsZSBjb2RlLCBhbmQgaW50ZXJydXB0IGhhbmRsZXJzLg0KDQpCYXJ0Lg==

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/5] blk-mq: introduce BLK_STS_DEV_RESOURCE
@ 2018-01-23 16:47                 ` Bart Van Assche
  0 siblings, 0 replies; 31+ messages in thread
From: Bart Van Assche @ 2018-01-23 16:47 UTC (permalink / raw)
  To: ming.lei
  Cc: linux-block, hch, snitzer, martin.petersen, axboe, linux-scsi,
	jejb, loberman, dm-devel

On Wed, 2018-01-24 at 00:41 +0800, Ming Lei wrote:
> Could you explain where to call call_rcu()?  call_rcu() can't be used in
> IO path at all.

Can you explain what makes you think that call_rcu() can't be used in the
I/O path? As you know call_rcu() invokes a function asynchronously. From
Documentation/RCU/Design/Requirements/Requirements.html:

  The <tt>call_rcu()</tt> function may be used in a number of
  situations where neither <tt>synchronize_rcu()</tt> nor
  <tt>synchronize_rcu_expedited()</tt> would be legal,
  including within preempt-disable code, <tt>local_bh_disable()</tt> code,
  interrupt-disable code, and interrupt handlers.

Bart.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/5] blk-mq: introduce BLK_STS_DEV_RESOURCE
  2018-01-23 16:47                 ` Bart Van Assche
  (?)
@ 2018-01-23 16:49                 ` Ming Lei
  2018-01-23 16:54                     ` Bart Van Assche
  -1 siblings, 1 reply; 31+ messages in thread
From: Ming Lei @ 2018-01-23 16:49 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-block, hch, snitzer, martin.petersen, axboe, linux-scsi,
	jejb, loberman, dm-devel

On Tue, Jan 23, 2018 at 04:47:11PM +0000, Bart Van Assche wrote:
> On Wed, 2018-01-24 at 00:41 +0800, Ming Lei wrote:
> > Could you explain where to call call_rcu()?  call_rcu() can't be used in
> > IO path at all.
> 
> Can you explain what makes you think that call_rcu() can't be used in the
> I/O path? As you know call_rcu() invokes a function asynchronously. From
> Documentation/RCU/Design/Requirements/Requirements.html:
> 
>   The <tt>call_rcu()</tt> function may be used in a number of
>   situations where neither <tt>synchronize_rcu()</tt> nor
>   <tt>synchronize_rcu_expedited()</tt> would be legal,
>   including within preempt-disable code, <tt>local_bh_disable()</tt> code,
>   interrupt-disable code, and interrupt handlers.

OK, suppose it is true, do you want to change most of STS_RESOURCE in
all drivers to this way? Why can't we use the generic and simple approach
in this patch?

-- 
Ming

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/5] blk-mq: introduce BLK_STS_DEV_RESOURCE
  2018-01-23 16:49                 ` Ming Lei
@ 2018-01-23 16:54                     ` Bart Van Assche
  0 siblings, 0 replies; 31+ messages in thread
From: Bart Van Assche @ 2018-01-23 16:54 UTC (permalink / raw)
  To: ming.lei
  Cc: linux-block, hch, snitzer, martin.petersen, axboe, linux-scsi,
	jejb, loberman, dm-devel

T24gV2VkLCAyMDE4LTAxLTI0IGF0IDAwOjQ5ICswODAwLCBNaW5nIExlaSB3cm90ZToNCj4gT24g
VHVlLCBKYW4gMjMsIDIwMTggYXQgMDQ6NDc6MTFQTSArMDAwMCwgQmFydCBWYW4gQXNzY2hlIHdy
b3RlOg0KPiA+IE9uIFdlZCwgMjAxOC0wMS0yNCBhdCAwMDo0MSArMDgwMCwgTWluZyBMZWkgd3Jv
dGU6DQo+ID4gPiBDb3VsZCB5b3UgZXhwbGFpbiB3aGVyZSB0byBjYWxsIGNhbGxfcmN1KCk/ICBj
YWxsX3JjdSgpIGNhbid0IGJlIHVzZWQgaW4NCj4gPiA+IElPIHBhdGggYXQgYWxsLg0KPiA+IA0K
PiA+IENhbiB5b3UgZXhwbGFpbiB3aGF0IG1ha2VzIHlvdSB0aGluayB0aGF0IGNhbGxfcmN1KCkg
Y2FuJ3QgYmUgdXNlZCBpbiB0aGUNCj4gPiBJL08gcGF0aD8gQXMgeW91IGtub3cgY2FsbF9yY3Uo
KSBpbnZva2VzIGEgZnVuY3Rpb24gYXN5bmNocm9ub3VzbHkuIEZyb20NCj4gPiBEb2N1bWVudGF0
aW9uL1JDVS9EZXNpZ24vUmVxdWlyZW1lbnRzL1JlcXVpcmVtZW50cy5odG1sOg0KPiA+IA0KPiA+
ICAgVGhlIDx0dD5jYWxsX3JjdSgpPC90dD4gZnVuY3Rpb24gbWF5IGJlIHVzZWQgaW4gYSBudW1i
ZXIgb2YNCj4gPiAgIHNpdHVhdGlvbnMgd2hlcmUgbmVpdGhlciA8dHQ+c3luY2hyb25pemVfcmN1
KCk8L3R0PiBub3INCj4gPiAgIDx0dD5zeW5jaHJvbml6ZV9yY3VfZXhwZWRpdGVkKCk8L3R0PiB3
b3VsZCBiZSBsZWdhbCwNCj4gPiAgIGluY2x1ZGluZyB3aXRoaW4gcHJlZW1wdC1kaXNhYmxlIGNv
ZGUsIDx0dD5sb2NhbF9iaF9kaXNhYmxlKCk8L3R0PiBjb2RlLA0KPiA+ICAgaW50ZXJydXB0LWRp
c2FibGUgY29kZSwgYW5kIGludGVycnVwdCBoYW5kbGVycy4NCj4gDQo+IE9LLCBzdXBwb3NlIGl0
IGlzIHRydWUsIGRvIHlvdSB3YW50IHRvIGNoYW5nZSBtb3N0IG9mIFNUU19SRVNPVVJDRSBpbg0K
PiBhbGwgZHJpdmVycyB0byB0aGlzIHdheT8gV2h5IGNhbid0IHdlIHVzZSB0aGUgZ2VuZXJpYyBh
bmQgc2ltcGxlIGFwcHJvYWNoDQo+IGluIHRoaXMgcGF0Y2g/DQoNClBsZWFzZSByZXJlYWQgbXkg
cHJvcG9zYWwuIEkgZGlkIG5vdCBwcm9wb3NlIHRvIGNoYW5nZSBhbnkgYmxvY2sgZHJpdmVycy4N
CldoYXQgSSBwcm9wb3NlZCBpcyB0byBjaGFuZ2UgdGhlIGJsa19tcV9kZWxheV9ydW5faHdfcXVl
dWUoKSBpbXBsZW1lbnRhdGlvbg0Kc3VjaCB0aGF0IGl0IHVzZXMgY2FsbF9yY3UoKSBpbnN0ZWFk
IG9mIGtibG9ja2RfbW9kX2RlbGF5ZWRfd29ya19vbigpLg0KVGhhdCdzIGEgZ2VuZXJpYyBhbmQg
c2ltcGxlIGFwcHJvYWNoLg0KDQpCYXJ0Lg==

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/5] blk-mq: introduce BLK_STS_DEV_RESOURCE
@ 2018-01-23 16:54                     ` Bart Van Assche
  0 siblings, 0 replies; 31+ messages in thread
From: Bart Van Assche @ 2018-01-23 16:54 UTC (permalink / raw)
  To: ming.lei
  Cc: linux-block, hch, snitzer, martin.petersen, axboe, linux-scsi,
	jejb, loberman, dm-devel

On Wed, 2018-01-24 at 00:49 +0800, Ming Lei wrote:
> On Tue, Jan 23, 2018 at 04:47:11PM +0000, Bart Van Assche wrote:
> > On Wed, 2018-01-24 at 00:41 +0800, Ming Lei wrote:
> > > Could you explain where to call call_rcu()?  call_rcu() can't be used in
> > > IO path at all.
> > 
> > Can you explain what makes you think that call_rcu() can't be used in the
> > I/O path? As you know call_rcu() invokes a function asynchronously. From
> > Documentation/RCU/Design/Requirements/Requirements.html:
> > 
> >   The <tt>call_rcu()</tt> function may be used in a number of
> >   situations where neither <tt>synchronize_rcu()</tt> nor
> >   <tt>synchronize_rcu_expedited()</tt> would be legal,
> >   including within preempt-disable code, <tt>local_bh_disable()</tt> code,
> >   interrupt-disable code, and interrupt handlers.
> 
> OK, suppose it is true, do you want to change most of STS_RESOURCE in
> all drivers to this way? Why can't we use the generic and simple approach
> in this patch?

Please reread my proposal. I did not propose to change any block drivers.
What I proposed is to change the blk_mq_delay_run_hw_queue() implementation
such that it uses call_rcu() instead of kblockd_mod_delayed_work_on().
That's a generic and simple approach.

Bart.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/5] blk-mq: introduce BLK_STS_DEV_RESOURCE
  2018-01-23 16:54                     ` Bart Van Assche
@ 2018-01-23 16:59                       ` Ming Lei
  -1 siblings, 0 replies; 31+ messages in thread
From: Ming Lei @ 2018-01-23 16:59 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-block, hch, snitzer, martin.petersen, axboe, linux-scsi,
	jejb, loberman, dm-devel

On Tue, Jan 23, 2018 at 04:54:02PM +0000, Bart Van Assche wrote:
> On Wed, 2018-01-24 at 00:49 +0800, Ming Lei wrote:
> > On Tue, Jan 23, 2018 at 04:47:11PM +0000, Bart Van Assche wrote:
> > > On Wed, 2018-01-24 at 00:41 +0800, Ming Lei wrote:
> > > > Could you explain where to call call_rcu()?  call_rcu() can't be used in
> > > > IO path at all.
> > > 
> > > Can you explain what makes you think that call_rcu() can't be used in the
> > > I/O path? As you know call_rcu() invokes a function asynchronously. From
> > > Documentation/RCU/Design/Requirements/Requirements.html:
> > > 
> > >   The <tt>call_rcu()</tt> function may be used in a number of
> > >   situations where neither <tt>synchronize_rcu()</tt> nor
> > >   <tt>synchronize_rcu_expedited()</tt> would be legal,
> > >   including within preempt-disable code, <tt>local_bh_disable()</tt> code,
> > >   interrupt-disable code, and interrupt handlers.
> > 
> > OK, suppose it is true, do you want to change most of STS_RESOURCE in
> > all drivers to this way? Why can't we use the generic and simple approach
> > in this patch?
> 
> Please reread my proposal. I did not propose to change any block drivers.
> What I proposed is to change the blk_mq_delay_run_hw_queue() implementation
> such that it uses call_rcu() instead of kblockd_mod_delayed_work_on().
> That's a generic and simple approach.

How is that enough to fix the IO hang when driver returns STS_RESOURCE
and the queue is idle? If you want to follow previous dm-rq's way of
call blk_mq_delay_run_hw_queue() in .queue_rq(), the same trick need
to be applied to other drivers too, right?

Unfortunately most of STS_RESOURCE don't use this trick, but they need
to be handled.

The patch of 'blk-mq: introduce BLK_STS_DEV_RESOURCE' can fix all these
cases.

-- 
Ming

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/5] blk-mq: introduce BLK_STS_DEV_RESOURCE
@ 2018-01-23 16:59                       ` Ming Lei
  0 siblings, 0 replies; 31+ messages in thread
From: Ming Lei @ 2018-01-23 16:59 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: axboe, hch, jejb, snitzer, martin.petersen, linux-block,
	dm-devel, linux-scsi, loberman

On Tue, Jan 23, 2018 at 04:54:02PM +0000, Bart Van Assche wrote:
> On Wed, 2018-01-24 at 00:49 +0800, Ming Lei wrote:
> > On Tue, Jan 23, 2018 at 04:47:11PM +0000, Bart Van Assche wrote:
> > > On Wed, 2018-01-24 at 00:41 +0800, Ming Lei wrote:
> > > > Could you explain where to call call_rcu()?  call_rcu() can't be used in
> > > > IO path at all.
> > > 
> > > Can you explain what makes you think that call_rcu() can't be used in the
> > > I/O path? As you know call_rcu() invokes a function asynchronously. From
> > > Documentation/RCU/Design/Requirements/Requirements.html:
> > > 
> > >   The <tt>call_rcu()</tt> function may be used in a number of
> > >   situations where neither <tt>synchronize_rcu()</tt> nor
> > >   <tt>synchronize_rcu_expedited()</tt> would be legal,
> > >   including within preempt-disable code, <tt>local_bh_disable()</tt> code,
> > >   interrupt-disable code, and interrupt handlers.
> > 
> > OK, suppose it is true, do you want to change most of STS_RESOURCE in
> > all drivers to this way? Why can't we use the generic and simple approach
> > in this patch?
> 
> Please reread my proposal. I did not propose to change any block drivers.
> What I proposed is to change the blk_mq_delay_run_hw_queue() implementation
> such that it uses call_rcu() instead of kblockd_mod_delayed_work_on().
> That's a generic and simple approach.

How is that enough to fix the IO hang when driver returns STS_RESOURCE
and the queue is idle? If you want to follow previous dm-rq's way of
call blk_mq_delay_run_hw_queue() in .queue_rq(), the same trick need
to be applied to other drivers too, right?

Unfortunately most of STS_RESOURCE don't use this trick, but they need
to be handled.

The patch of 'blk-mq: introduce BLK_STS_DEV_RESOURCE' can fix all these
cases.

-- 
Ming

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/5] blk-mq: introduce BLK_STS_DEV_RESOURCE
  2018-01-23 16:59                       ` Ming Lei
@ 2018-01-23 22:01                         ` Bart Van Assche
  -1 siblings, 0 replies; 31+ messages in thread
From: Bart Van Assche @ 2018-01-23 22:01 UTC (permalink / raw)
  To: ming.lei
  Cc: linux-block, hch, snitzer, martin.petersen, axboe, linux-scsi,
	jejb, loberman, dm-devel

T24gV2VkLCAyMDE4LTAxLTI0IGF0IDAwOjU5ICswODAwLCBNaW5nIExlaSB3cm90ZToNCj4gSG93
IGlzIHRoYXQgZW5vdWdoIHRvIGZpeCB0aGUgSU8gaGFuZyB3aGVuIGRyaXZlciByZXR1cm5zIFNU
U19SRVNPVVJDRQ0KPiBhbmQgdGhlIHF1ZXVlIGlzIGlkbGU/IElmIHlvdSB3YW50IHRvIGZvbGxv
dyBwcmV2aW91cyBkbS1ycSdzIHdheSBvZg0KPiBjYWxsIGJsa19tcV9kZWxheV9ydW5faHdfcXVl
dWUoKSBpbiAucXVldWVfcnEoKSwgdGhlIHNhbWUgdHJpY2sgbmVlZA0KPiB0byBiZSBhcHBsaWVk
IHRvIG90aGVyIGRyaXZlcnMgdG9vLCByaWdodD8NCj4gDQo+IFVuZm9ydHVuYXRlbHkgbW9zdCBv
ZiBTVFNfUkVTT1VSQ0UgZG9uJ3QgdXNlIHRoaXMgdHJpY2ssIGJ1dCB0aGV5IG5lZWQNCj4gdG8g
YmUgaGFuZGxlZC4NCj4gDQo+IFRoZSBwYXRjaCBvZiAnYmxrLW1xOiBpbnRyb2R1Y2UgQkxLX1NU
U19ERVZfUkVTT1VSQ0UnIGNhbiBmaXggYWxsIHRoZXNlDQo+IGNhc2VzLg0KDQpUaGUgZ29hbCBv
ZiBteSBwcm9wb3NhbCB3YXMgdG8gYWRkcmVzcyB0aGUgcmFjZSBiZXR3ZWVuIHJ1bm5pbmcgdGhl
IHF1ZXVlIGFuZA0KYWRkaW5nIHJlcXVlc3RzIGJhY2sgdG8gdGhlIGRpc3BhdGNoIHF1ZXVlIG9u
bHkuIFJlZ2FyZGluZyB0aGUgSS9PIGhhbmdzIHRoYXQNCmNhbiBvY2N1ciBpZiBhIGJsb2NrIGRy
aXZlciByZXR1cm5zIEJMS19TVFNfUkVTT1VSQ0U6IHNpbmNlIGFsbCBvZiB0aGVzZSBjYW4NCmJl
IGFkZHJlc3NlZCBieSBpbnNlcnRpbmcgYmxrX21xX2RlbGF5X3J1bl9od19xdWV1ZSgpIGNhbGxz
IGluIHRoZSBhZmZlY3RlZA0KYmxvY2sgZHJpdmVycyBJIHByZWZlciB0byBtb2RpZnkgdGhlIGJs
b2NrIGRyaXZlcnMgaW5zdGVhZCBvZiBtYWtpbmcgdGhlDQpibGstbXEgY29yZSBldmVuIG1vcmUg
Y29tcGxpY2F0ZWQuDQoNClRoYW5rcywNCg0KQmFydC4=

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/5] blk-mq: introduce BLK_STS_DEV_RESOURCE
@ 2018-01-23 22:01                         ` Bart Van Assche
  0 siblings, 0 replies; 31+ messages in thread
From: Bart Van Assche @ 2018-01-23 22:01 UTC (permalink / raw)
  To: ming.lei
  Cc: linux-block, hch, snitzer, martin.petersen, axboe, linux-scsi,
	jejb, loberman, dm-devel

On Wed, 2018-01-24 at 00:59 +0800, Ming Lei wrote:
> How is that enough to fix the IO hang when driver returns STS_RESOURCE
> and the queue is idle? If you want to follow previous dm-rq's way of
> call blk_mq_delay_run_hw_queue() in .queue_rq(), the same trick need
> to be applied to other drivers too, right?
> 
> Unfortunately most of STS_RESOURCE don't use this trick, but they need
> to be handled.
> 
> The patch of 'blk-mq: introduce BLK_STS_DEV_RESOURCE' can fix all these
> cases.

The goal of my proposal was to address the race between running the queue and
adding requests back to the dispatch queue only. Regarding the I/O hangs that
can occur if a block driver returns BLK_STS_RESOURCE: since all of these can
be addressed by inserting blk_mq_delay_run_hw_queue() calls in the affected
block drivers I prefer to modify the block drivers instead of making the
blk-mq core even more complicated.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/5] blk-mq: introduce BLK_STS_DEV_RESOURCE
  2018-01-23 22:01                         ` Bart Van Assche
  (?)
@ 2018-01-24  2:31                         ` Ming Lei
  -1 siblings, 0 replies; 31+ messages in thread
From: Ming Lei @ 2018-01-24  2:31 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-block, hch, snitzer, martin.petersen, axboe, linux-scsi,
	jejb, loberman, dm-devel

On Tue, Jan 23, 2018 at 10:01:37PM +0000, Bart Van Assche wrote:
> On Wed, 2018-01-24 at 00:59 +0800, Ming Lei wrote:
> > How is that enough to fix the IO hang when driver returns STS_RESOURCE
> > and the queue is idle? If you want to follow previous dm-rq's way of
> > call blk_mq_delay_run_hw_queue() in .queue_rq(), the same trick need
> > to be applied to other drivers too, right?
> > 
> > Unfortunately most of STS_RESOURCE don't use this trick, but they need
> > to be handled.
> > 
> > The patch of 'blk-mq: introduce BLK_STS_DEV_RESOURCE' can fix all these
> > cases.
> 
> The goal of my proposal was to address the race between running the queue and
> adding requests back to the dispatch queue only. Regarding the I/O hangs that
> can occur if a block driver returns BLK_STS_RESOURCE: since all of these can
> be addressed by inserting blk_mq_delay_run_hw_queue() calls in the affected
> block drivers I prefer to modify the block drivers instead of making the
> blk-mq core even more complicated.

IMO, this change doesn't make blk-mq code more complicated, also it is
well documented, see the change in block layer:

 block/blk-core.c             |  1 +
 block/blk-mq.c               | 19 +++++++++++++++----
 include/linux/blk_types.h    | 18 ++++++++++++++++++

Also 21 lines of them are comment, so only 17 lines code change needed
in block layer.

If inserting blk_mq_delay_run_hw_queue() to driver, the change can be
a bit complicated, since call_rcu has to be used, you need to figure out
one way to provide callback and the parameter. Even you have to document
the change in each driver.

[ming@ming linux]$ git grep -n BLK_STS_RESOURCE drivers/ | wc -l
42

There are at least 42 uses of BLK_STS_RESOURCE in drivers, in theory
you should insert call_rcu(blk_mq_delay_run_hw_queue) in every BLK_STS_RESOURCE
of drivers.

I leave the decisions to Jens and drivers maintainers.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2018-01-24  2:31 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-22  3:35 [PATCH 0/5] blk-mq & dm: fix IO hang and deal with one performance issue Ming Lei
2018-01-22  3:35 ` [PATCH 1/5] blk-mq: introduce BLK_STS_DEV_RESOURCE Ming Lei
2018-01-22 16:32   ` Christoph Hellwig
2018-01-22 16:49   ` Bart Van Assche
2018-01-22 16:49     ` Bart Van Assche
2018-01-23  0:57     ` Ming Lei
2018-01-23 16:17       ` Bart Van Assche
2018-01-23 16:26         ` Ming Lei
2018-01-23 16:37           ` Bart Van Assche
2018-01-23 16:41             ` Ming Lei
2018-01-23 16:47               ` Bart Van Assche
2018-01-23 16:47                 ` Bart Van Assche
2018-01-23 16:49                 ` Ming Lei
2018-01-23 16:54                   ` Bart Van Assche
2018-01-23 16:54                     ` Bart Van Assche
2018-01-23 16:59                     ` Ming Lei
2018-01-23 16:59                       ` Ming Lei
2018-01-23 22:01                       ` Bart Van Assche
2018-01-23 22:01                         ` Bart Van Assche
2018-01-24  2:31                         ` Ming Lei
2018-01-22  3:35 ` [PATCH 2/5] dm-rq: handle dispatch exception in dm_dispatch_clone_request() Ming Lei
2018-01-22  3:35   ` Ming Lei
2018-01-22  3:35 ` [PATCH 3/5] dm-rq: return BLK_STS_* from map_request() Ming Lei
2018-01-22  3:35   ` Ming Lei
2018-01-22  5:35   ` Ming Lei
2018-01-22  3:35 ` [PATCH 4/5] blk-mq: introduce blk_get_request_notify Ming Lei
2018-01-22 10:19   ` Ming Lei
2018-01-22 17:13   ` Bart Van Assche
2018-01-22 17:13     ` Bart Van Assche
2018-01-23  1:29     ` Ming Lei
2018-01-22  3:35 ` [PATCH 5/5] dm-mpath: use blk_mq_alloc_request_notify for allocating blk-mq req Ming Lei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.