All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] blk-mq: introduce congestion control
@ 2017-07-11 18:20 Ming Lei
  2017-07-11 18:20 ` [PATCH 1/6] xen-blkfront: avoid to use start/stop queue Ming Lei
                   ` (6 more replies)
  0 siblings, 7 replies; 43+ messages in thread
From: Ming Lei @ 2017-07-11 18:20 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Sagi Grimberg, Ming Lei

Hi,

The 1st patch converts two start/stop queue into quiesce/unquiesce
in xen-blkfront.

The 2nd patch uses blk_mq_run_hw_queues() in SCSI becasue the queue
won't be stopped at all. 

The 3rd patch prepares for introducing congestion control for blk-mq.

The 4th patch uses EWMA to estimate congestion threshold.

The 5th patch uses the estimated congestion threshold to dectect
congestion. Once the congestion is found, queue is stopped, and will
be restarted if the condition is invalid.

The 6th patch unexports several APIs for starting/stopping queue,
then the STOPPED state of queue becomes a internal state, and it
is invisible to drivers now.

This patchset is against on Sagi Grimberg's patchset of "[PATCH v3 0/8]
correct quiescing in several block drivers", which can be found
in:
	http://marc.info/?t=149927415900006&r=1&w=2

Any comments are welcome!

Ming Lei (6):
  xen-blkfront: avoid to use start/stop queue
  SCSI: use blk_mq_run_hw_queues() in scsi_kick_queue()
  blk-mq: send the request to dispatch list if direct issue returns busy
  blk-mq: use EWMA to estimate congestion threshold
  blk-mq: introduce basic congestion control
  blk-mq: unexport APIs for start/stop queues

 block/blk-mq.c               | 143 ++++++++++++++++++-------------------------
 block/blk-mq.h               |  11 ++++
 drivers/block/virtio_blk.c   |   7 ---
 drivers/block/xen-blkfront.c |  28 ++-------
 drivers/md/dm-rq.c           |   1 -
 drivers/nvme/host/fc.c       |   4 --
 drivers/scsi/scsi_lib.c      |   5 +-
 include/linux/blk-mq.h       |  10 +--
 8 files changed, 79 insertions(+), 130 deletions(-)

-- 
2.9.4

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 1/6] xen-blkfront: avoid to use start/stop queue
  2017-07-11 18:20 [PATCH 0/6] blk-mq: introduce congestion control Ming Lei
  2017-07-11 18:20 ` [PATCH 1/6] xen-blkfront: avoid to use start/stop queue Ming Lei
@ 2017-07-11 18:20 ` Ming Lei
  2017-07-11 18:41   ` Konrad Rzeszutek Wilk
                     ` (5 more replies)
  2017-07-11 18:20 ` [PATCH 2/6] SCSI: use blk_mq_run_hw_queues() in scsi_kick_queue() Ming Lei
                   ` (4 subsequent siblings)
  6 siblings, 6 replies; 43+ messages in thread
From: Ming Lei @ 2017-07-11 18:20 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Sagi Grimberg, Ming Lei, Konrad Rzeszutek Wilk,
	Roger Pau Monné,
	Boris Ostrovsky, Juergen Gross, xen-devel

This interfaces will be removed soon, so use quiesce and
unquiesce instead, which should be more safe.

The only one usage will be removed in the following
congestion control patches.

Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: "Roger Pau Monné" <roger.pau@citrix.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: xen-devel@lists.xenproject.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/xen-blkfront.c | 22 ++++++++--------------
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index c852ed3c01d5..1578befda635 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -1187,7 +1187,7 @@ static void xlvbd_release_gendisk(struct blkfront_info *info)
 		return;
 
 	/* No more blkif_request(). */
-	blk_mq_stop_hw_queues(info->rq);
+	blk_mq_quiesce_queue(info->rq);
 
 	for (i = 0; i < info->nr_rings; i++) {
 		struct blkfront_ring_info *rinfo = &info->rinfo[i];
@@ -1216,8 +1216,10 @@ static void xlvbd_release_gendisk(struct blkfront_info *info)
 /* Already hold rinfo->ring_lock. */
 static inline void kick_pending_request_queues_locked(struct blkfront_ring_info *rinfo)
 {
-	if (!RING_FULL(&rinfo->ring))
+	if (!RING_FULL(&rinfo->ring)) {
 		blk_mq_start_stopped_hw_queues(rinfo->dev_info->rq, true);
+		blk_mq_kick_requeue_list(rinfo->dev_info->rq);
+	}
 }
 
 static void kick_pending_request_queues(struct blkfront_ring_info *rinfo)
@@ -1225,7 +1227,8 @@ static void kick_pending_request_queues(struct blkfront_ring_info *rinfo)
 	unsigned long flags;
 
 	spin_lock_irqsave(&rinfo->ring_lock, flags);
-	kick_pending_request_queues_locked(rinfo);
+	if (!RING_FULL(&rinfo->ring))
+		blk_mq_run_hw_queues(rinfo->dev_info->rq, true);
 	spin_unlock_irqrestore(&rinfo->ring_lock, flags);
 }
 
@@ -1346,7 +1349,7 @@ static void blkif_free(struct blkfront_info *info, int suspend)
 		BLKIF_STATE_SUSPENDED : BLKIF_STATE_DISCONNECTED;
 	/* No more blkif_request(). */
 	if (info->rq)
-		blk_mq_stop_hw_queues(info->rq);
+		blk_mq_quiesce_queue(info->rq);
 
 	for (i = 0; i < info->nr_rings; i++)
 		blkif_free_ring(&info->rinfo[i]);
@@ -2018,22 +2021,13 @@ static int blkif_recover(struct blkfront_info *info)
 	/* Now safe for us to use the shared ring */
 	info->connected = BLKIF_STATE_CONNECTED;
 
-	for (r_index = 0; r_index < info->nr_rings; r_index++) {
-		struct blkfront_ring_info *rinfo;
-
-		rinfo = &info->rinfo[r_index];
-		/* Kick any other new requests queued since we resumed */
-		kick_pending_request_queues(rinfo);
-	}
-
 	list_for_each_entry_safe(req, n, &info->requests, queuelist) {
 		/* Requeue pending requests (flush or discard) */
 		list_del_init(&req->queuelist);
 		BUG_ON(req->nr_phys_segments > segs);
 		blk_mq_requeue_request(req, false);
 	}
-	blk_mq_start_stopped_hw_queues(info->rq, true);
-	blk_mq_kick_requeue_list(info->rq);
+	blk_mq_unquiesce_queue(info->rq);
 
 	while ((bio = bio_list_pop(&info->bio_list)) != NULL) {
 		/* Traverse the list of pending bios and re-queue them */
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 1/6] xen-blkfront: avoid to use start/stop queue
  2017-07-11 18:20 [PATCH 0/6] blk-mq: introduce congestion control Ming Lei
@ 2017-07-11 18:20 ` Ming Lei
  2017-07-11 18:20 ` Ming Lei
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 43+ messages in thread
From: Ming Lei @ 2017-07-11 18:20 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Juergen Gross, Sagi Grimberg, Ming Lei, xen-devel,
	Bart Van Assche, Boris Ostrovsky, Roger Pau Monné

This interfaces will be removed soon, so use quiesce and
unquiesce instead, which should be more safe.

The only one usage will be removed in the following
congestion control patches.

Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: "Roger Pau Monné" <roger.pau@citrix.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: xen-devel@lists.xenproject.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/xen-blkfront.c | 22 ++++++++--------------
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index c852ed3c01d5..1578befda635 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -1187,7 +1187,7 @@ static void xlvbd_release_gendisk(struct blkfront_info *info)
 		return;
 
 	/* No more blkif_request(). */
-	blk_mq_stop_hw_queues(info->rq);
+	blk_mq_quiesce_queue(info->rq);
 
 	for (i = 0; i < info->nr_rings; i++) {
 		struct blkfront_ring_info *rinfo = &info->rinfo[i];
@@ -1216,8 +1216,10 @@ static void xlvbd_release_gendisk(struct blkfront_info *info)
 /* Already hold rinfo->ring_lock. */
 static inline void kick_pending_request_queues_locked(struct blkfront_ring_info *rinfo)
 {
-	if (!RING_FULL(&rinfo->ring))
+	if (!RING_FULL(&rinfo->ring)) {
 		blk_mq_start_stopped_hw_queues(rinfo->dev_info->rq, true);
+		blk_mq_kick_requeue_list(rinfo->dev_info->rq);
+	}
 }
 
 static void kick_pending_request_queues(struct blkfront_ring_info *rinfo)
@@ -1225,7 +1227,8 @@ static void kick_pending_request_queues(struct blkfront_ring_info *rinfo)
 	unsigned long flags;
 
 	spin_lock_irqsave(&rinfo->ring_lock, flags);
-	kick_pending_request_queues_locked(rinfo);
+	if (!RING_FULL(&rinfo->ring))
+		blk_mq_run_hw_queues(rinfo->dev_info->rq, true);
 	spin_unlock_irqrestore(&rinfo->ring_lock, flags);
 }
 
@@ -1346,7 +1349,7 @@ static void blkif_free(struct blkfront_info *info, int suspend)
 		BLKIF_STATE_SUSPENDED : BLKIF_STATE_DISCONNECTED;
 	/* No more blkif_request(). */
 	if (info->rq)
-		blk_mq_stop_hw_queues(info->rq);
+		blk_mq_quiesce_queue(info->rq);
 
 	for (i = 0; i < info->nr_rings; i++)
 		blkif_free_ring(&info->rinfo[i]);
@@ -2018,22 +2021,13 @@ static int blkif_recover(struct blkfront_info *info)
 	/* Now safe for us to use the shared ring */
 	info->connected = BLKIF_STATE_CONNECTED;
 
-	for (r_index = 0; r_index < info->nr_rings; r_index++) {
-		struct blkfront_ring_info *rinfo;
-
-		rinfo = &info->rinfo[r_index];
-		/* Kick any other new requests queued since we resumed */
-		kick_pending_request_queues(rinfo);
-	}
-
 	list_for_each_entry_safe(req, n, &info->requests, queuelist) {
 		/* Requeue pending requests (flush or discard) */
 		list_del_init(&req->queuelist);
 		BUG_ON(req->nr_phys_segments > segs);
 		blk_mq_requeue_request(req, false);
 	}
-	blk_mq_start_stopped_hw_queues(info->rq, true);
-	blk_mq_kick_requeue_list(info->rq);
+	blk_mq_unquiesce_queue(info->rq);
 
 	while ((bio = bio_list_pop(&info->bio_list)) != NULL) {
 		/* Traverse the list of pending bios and re-queue them */
-- 
2.9.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 2/6] SCSI: use blk_mq_run_hw_queues() in scsi_kick_queue()
  2017-07-11 18:20 [PATCH 0/6] blk-mq: introduce congestion control Ming Lei
  2017-07-11 18:20 ` [PATCH 1/6] xen-blkfront: avoid to use start/stop queue Ming Lei
  2017-07-11 18:20 ` Ming Lei
@ 2017-07-11 18:20 ` Ming Lei
  2017-07-11 19:57     ` Bart Van Assche
  2017-07-11 18:21 ` [PATCH 3/6] blk-mq: send the request to dispatch list if direct issue returns busy Ming Lei
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 43+ messages in thread
From: Ming Lei @ 2017-07-11 18:20 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Sagi Grimberg, Ming Lei, James E.J. Bottomley,
	Martin K. Petersen, linux-scsi

Now SCSI won't stop queue, and not necessary to use
blk_mq_start_hw_queues(), so switch to blk_mq_run_hw_queues()
instead.

Cc: "James E.J. Bottomley" <jejb@linux.vnet.ibm.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/scsi/scsi_lib.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index f6097b89d5d3..91d890356b78 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -333,7 +333,7 @@ void scsi_device_unbusy(struct scsi_device *sdev)
 static void scsi_kick_queue(struct request_queue *q)
 {
 	if (q->mq_ops)
-		blk_mq_start_hw_queues(q);
+		blk_mq_run_hw_queues(q, false);
 	else
 		blk_run_queue(q);
 }
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 3/6] blk-mq: send the request to dispatch list if direct issue returns busy
  2017-07-11 18:20 [PATCH 0/6] blk-mq: introduce congestion control Ming Lei
                   ` (2 preceding siblings ...)
  2017-07-11 18:20 ` [PATCH 2/6] SCSI: use blk_mq_run_hw_queues() in scsi_kick_queue() Ming Lei
@ 2017-07-11 18:21 ` Ming Lei
  2017-07-11 20:18   ` Bart Van Assche
  2017-07-11 18:21 ` [PATCH 4/6] blk-mq: use EWMA to estimate congestion threshold Ming Lei
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 43+ messages in thread
From: Ming Lei @ 2017-07-11 18:21 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Sagi Grimberg, Ming Lei

Before mq IO scheduler is in, we always send the request to
dispatch list if .queue_rq() return busy. After mq IO scheduler
is introduced, we only do this way when scheduler is used in
case of direct issue. Actually we can do that when scheduler
isn't used too.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 041f7b7fa0d6..6e0fc80aa151 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1463,6 +1463,16 @@ static blk_qc_t request_to_qc_t(struct blk_mq_hw_ctx *hctx, struct request *rq)
 	return blk_tag_to_qc_t(rq->internal_tag, hctx->queue_num, true);
 }
 
+static void blk_mq_direct_dispatch(struct blk_mq_hw_ctx *hctx,
+				   struct request *rq)
+{
+	spin_lock(&hctx->lock);
+	list_add(&rq->queuelist, &hctx->dispatch);
+	spin_unlock(&hctx->lock);
+
+	blk_mq_run_hw_queue(hctx, false);
+}
+
 static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 					struct request *rq,
 					blk_qc_t *cookie, bool may_sleep)
@@ -1499,15 +1509,17 @@ static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 	switch (ret) {
 	case BLK_STS_OK:
 		*cookie = new_cookie;
-		return;
+		break;
 	case BLK_STS_RESOURCE:
 		__blk_mq_requeue_request(rq);
-		goto insert;
+		blk_mq_direct_dispatch(hctx, rq);
+		break;
 	default:
 		*cookie = BLK_QC_T_NONE;
 		blk_mq_end_request(rq, ret);
-		return;
+		break;
 	}
+	return;
 
 insert:
 	blk_mq_sched_insert_request(rq, false, run_queue, false, may_sleep);
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 4/6] blk-mq: use EWMA to estimate congestion threshold
  2017-07-11 18:20 [PATCH 0/6] blk-mq: introduce congestion control Ming Lei
                   ` (3 preceding siblings ...)
  2017-07-11 18:21 ` [PATCH 3/6] blk-mq: send the request to dispatch list if direct issue returns busy Ming Lei
@ 2017-07-11 18:21 ` Ming Lei
  2017-07-11 18:25   ` Jens Axboe
                     ` (2 more replies)
  2017-07-11 18:21 ` [PATCH 5/6] blk-mq: introduce basic congestion control Ming Lei
  2017-07-11 18:21 ` [PATCH 6/6] blk-mq: unexport APIs for start/stop queues Ming Lei
  6 siblings, 3 replies; 43+ messages in thread
From: Ming Lei @ 2017-07-11 18:21 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Sagi Grimberg, Ming Lei

When .queue_rq() returns BLK_STS_RESOURCE(BUSY), we can
consider that there is congestion in either low level
driver or hardware.

This patch uses EWMA to estimate this congestion threshold,
then this threshold can be used to detect/avoid congestion.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c         | 14 ++++++++++++++
 block/blk-mq.h         |  9 +++++++++
 include/linux/blk-mq.h |  2 ++
 3 files changed, 25 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6e0fc80aa151..da50c187c508 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -976,6 +976,18 @@ static bool blk_mq_dispatch_wait_add(struct blk_mq_hw_ctx *hctx)
 	return true;
 }
 
+static void blk_mq_update_req_dispatch_busy(struct blk_mq_hw_ctx *hctx)
+{
+	struct sbitmap_queue *sbq;
+	unsigned depth;
+
+	sbq = &hctx->tags->bitmap_tags;
+	depth = sbitmap_weight(&sbq->sb);
+
+	/* use EWMA to estimate a threshold for detecting congestion */
+	ewma_add(hctx->avg_busy_threshold, depth, 8, 0);
+}
+
 bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list)
 {
 	struct blk_mq_hw_ctx *hctx;
@@ -1064,6 +1076,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list)
 
 		spin_lock(&hctx->lock);
 		list_splice_init(list, &hctx->dispatch);
+		blk_mq_update_req_dispatch_busy(hctx);
 		spin_unlock(&hctx->lock);
 
 		/*
@@ -1468,6 +1481,7 @@ static void blk_mq_direct_dispatch(struct blk_mq_hw_ctx *hctx,
 {
 	spin_lock(&hctx->lock);
 	list_add(&rq->queuelist, &hctx->dispatch);
+	blk_mq_update_req_dispatch_busy(hctx);
 	spin_unlock(&hctx->lock);
 
 	blk_mq_run_hw_queue(hctx, false);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 60b01c0309bc..c4516d2a2d2c 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -133,4 +133,13 @@ static inline bool blk_mq_hw_queue_mapped(struct blk_mq_hw_ctx *hctx)
 	return hctx->nr_ctx && hctx->tags;
 }
 
+/* borrowed from bcache */
+#define ewma_add(ewma, val, weight, factor)                             \
+({                                                                      \
+        (ewma) *= (weight) - 1;                                         \
+        (ewma) += (val) << factor;                                      \
+        (ewma) /= (weight);                                             \
+        (ewma) >> factor;                                               \
+})
+
 #endif
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 14542308d25b..8694fb39cd80 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -22,6 +22,8 @@ struct blk_mq_hw_ctx {
 
 	unsigned long		flags;		/* BLK_MQ_F_* flags */
 
+	unsigned long		avg_busy_threshold;
+
 	void			*sched_data;
 	struct request_queue	*queue;
 	struct blk_flush_queue	*fq;
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 5/6] blk-mq: introduce basic congestion control
  2017-07-11 18:20 [PATCH 0/6] blk-mq: introduce congestion control Ming Lei
                   ` (4 preceding siblings ...)
  2017-07-11 18:21 ` [PATCH 4/6] blk-mq: use EWMA to estimate congestion threshold Ming Lei
@ 2017-07-11 18:21 ` Ming Lei
  2017-07-11 18:21 ` [PATCH 6/6] blk-mq: unexport APIs for start/stop queues Ming Lei
  6 siblings, 0 replies; 43+ messages in thread
From: Ming Lei @ 2017-07-11 18:21 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Sagi Grimberg, Ming Lei

Now we have removed all usage of start/stop queue except
for the case BLK_STS_RESOURCE.

This patch implements basic congestion control so that
we can stop queue when congestion is detected. After the
congestion condition becomes invalid, restart the queue
again. The congestion threshold is introduced in last
patch, and if queue depth is bigger than this threshold,
this patch considers it is a congestion.

There are at least two advantage to do congestion control:

- avoid to waste CPU to dispatch rq to hardware/driver when
it might be busy

- avoid to let drivers handle this, so we can cleanup drivers
and unexport interfaces of restart/stop queue, which have casued
enough trouble already

One simple sequential read test(libaio, bs:4k, direct io, queue
depth:64, 8 jobs) on virtio-scsi shows that:
	- CPU utilization decreases ~20%
	- IOPS increases by ~10%

With this congestion control approach in blk-mq framework, we
can remove the handling in drivers. Actually the handling in
drivers isn't good enough too, such as:

	- virtio-blk/xen blk_front stops queue if one rq can't
	be handled, and restart the queue again if one request
	is completed, this way may cause trouble to dispatch
	request at batch, and make queue stop/restart too
	frequent

	- virtio-scsi won't stop queue, and just delay handling
	new requests after some delay in this case, this way wastes
	CPU and may affect both latency and throughput

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c               | 31 +++++++++++++++++++++++++++++++
 drivers/block/virtio_blk.c   |  7 -------
 drivers/block/xen-blkfront.c | 12 ------------
 drivers/md/dm-rq.c           |  1 -
 drivers/nvme/host/fc.c       |  4 ----
 drivers/scsi/scsi_lib.c      |  3 ---
 6 files changed, 31 insertions(+), 27 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index da50c187c508..d994449c154b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -416,6 +416,29 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
 }
 EXPORT_SYMBOL_GPL(blk_mq_alloc_request_hctx);
 
+static void blk_mq_update_req_complete(struct blk_mq_hw_ctx *hctx)
+{
+	struct sbitmap_queue *sbq;
+	unsigned threshold, depth;
+
+	if (!blk_mq_hctx_stopped(hctx))
+		return;
+
+	sbq = &hctx->tags->bitmap_tags;
+	depth = sbitmap_weight(&sbq->sb);
+	threshold = READ_ONCE(hctx->avg_busy_threshold);
+
+	/*
+	 * TODO: replace the 1/8 hardcode window with one
+	 * intelligent way, such as exponential backoff
+	 */
+	if ((depth < threshold) &&
+			(threshold - depth) >= (threshold >> 3)) {
+		clear_bit(BLK_MQ_S_STOPPED, &hctx->state);
+		blk_mq_run_hw_queue(hctx, true);
+	}
+}
+
 void blk_mq_free_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
@@ -447,6 +470,8 @@ void blk_mq_free_request(struct request *rq)
 		blk_mq_put_tag(hctx, hctx->sched_tags, ctx, sched_tag);
 	blk_mq_sched_restart(hctx);
 	blk_queue_exit(q);
+
+	blk_mq_update_req_complete(hctx);
 }
 EXPORT_SYMBOL_GPL(blk_mq_free_request);
 
@@ -986,6 +1011,12 @@ static void blk_mq_update_req_dispatch_busy(struct blk_mq_hw_ctx *hctx)
 
 	/* use EWMA to estimate a threshold for detecting congestion */
 	ewma_add(hctx->avg_busy_threshold, depth, 8, 0);
+
+	if (blk_mq_hctx_stopped(hctx))
+		return;
+
+	if (depth >= hctx->avg_busy_threshold)
+		set_bit(BLK_MQ_S_STOPPED, &hctx->state);
 }
 
 bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list)
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 4e02aa5fdac0..8014017f7f69 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -189,7 +189,6 @@ static inline void virtblk_request_done(struct request *req)
 static void virtblk_done(struct virtqueue *vq)
 {
 	struct virtio_blk *vblk = vq->vdev->priv;
-	bool req_done = false;
 	int qid = vq->index;
 	struct virtblk_req *vbr;
 	unsigned long flags;
@@ -202,15 +201,10 @@ static void virtblk_done(struct virtqueue *vq)
 			struct request *req = blk_mq_rq_from_pdu(vbr);
 
 			blk_mq_complete_request(req);
-			req_done = true;
 		}
 		if (unlikely(virtqueue_is_broken(vq)))
 			break;
 	} while (!virtqueue_enable_cb(vq));
-
-	/* In case queue is stopped waiting for more buffers. */
-	if (req_done)
-		blk_mq_start_stopped_hw_queues(vblk->disk->queue, true);
 	spin_unlock_irqrestore(&vblk->vqs[qid].lock, flags);
 }
 
@@ -271,7 +265,6 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
 		err = virtblk_add_req(vblk->vqs[qid].vq, vbr, vbr->sg, num);
 	if (err) {
 		virtqueue_kick(vblk->vqs[qid].vq);
-		blk_mq_stop_hw_queue(hctx);
 		spin_unlock_irqrestore(&vblk->vqs[qid].lock, flags);
 		/* Out of mem doesn't actually happen, since we fall back
 		 * to direct descriptors */
diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 1578befda635..9893abac4e0f 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -907,7 +907,6 @@ static blk_status_t blkif_queue_rq(struct blk_mq_hw_ctx *hctx,
 
 out_busy:
 	spin_unlock_irqrestore(&rinfo->ring_lock, flags);
-	blk_mq_stop_hw_queue(hctx);
 	return BLK_STS_RESOURCE;
 }
 
@@ -1213,15 +1212,6 @@ static void xlvbd_release_gendisk(struct blkfront_info *info)
 	info->gd = NULL;
 }
 
-/* Already hold rinfo->ring_lock. */
-static inline void kick_pending_request_queues_locked(struct blkfront_ring_info *rinfo)
-{
-	if (!RING_FULL(&rinfo->ring)) {
-		blk_mq_start_stopped_hw_queues(rinfo->dev_info->rq, true);
-		blk_mq_kick_requeue_list(rinfo->dev_info->rq);
-	}
-}
-
 static void kick_pending_request_queues(struct blkfront_ring_info *rinfo)
 {
 	unsigned long flags;
@@ -1659,8 +1649,6 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
 	} else
 		rinfo->ring.sring->rsp_event = i + 1;
 
-	kick_pending_request_queues_locked(rinfo);
-
 	spin_unlock_irqrestore(&rinfo->ring_lock, flags);
 
 	return IRQ_HANDLED;
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index c6ebc5b1e00e..71422cea1c4a 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -761,7 +761,6 @@ static blk_status_t dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
 		/* Undo dm_start_request() before requeuing */
 		rq_end_stats(md, rq);
 		rq_completed(md, rq_data_dir(rq), false);
-		blk_mq_delay_run_hw_queue(hctx, 100/*ms*/);
 		return BLK_STS_RESOURCE;
 	}
 
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index a8d06aa09660..cc8c68dd6c9b 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -1970,10 +1970,6 @@ nvme_fc_start_fcp_op(struct nvme_fc_ctrl *ctrl, struct nvme_fc_queue *queue,
 		if (ret != -EBUSY)
 			return BLK_STS_IOERR;
 
-		if (op->rq) {
-			blk_mq_stop_hw_queues(op->rq->q);
-			blk_mq_delay_queue(queue->hctx, NVMEFC_QUEUE_DELAY);
-		}
 		return BLK_STS_RESOURCE;
 	}
 
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 91d890356b78..ac030c64fa5f 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1976,9 +1976,6 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
 	case BLK_STS_OK:
 		break;
 	case BLK_STS_RESOURCE:
-		if (atomic_read(&sdev->device_busy) == 0 &&
-		    !scsi_device_blocked(sdev))
-			blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY);
 		break;
 	default:
 		/*
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 6/6] blk-mq: unexport APIs for start/stop queues
  2017-07-11 18:20 [PATCH 0/6] blk-mq: introduce congestion control Ming Lei
                   ` (5 preceding siblings ...)
  2017-07-11 18:21 ` [PATCH 5/6] blk-mq: introduce basic congestion control Ming Lei
@ 2017-07-11 18:21 ` Ming Lei
  6 siblings, 0 replies; 43+ messages in thread
From: Ming Lei @ 2017-07-11 18:21 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig
  Cc: Bart Van Assche, Sagi Grimberg, Ming Lei

Now no one uses these APIs anymore, so unexport them.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c         | 80 --------------------------------------------------
 block/blk-mq.h         |  2 ++
 include/linux/blk-mq.h |  8 -----
 3 files changed, 2 insertions(+), 88 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index d994449c154b..e8a3486dc8a2 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1208,12 +1208,6 @@ static void __blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async,
 					 msecs_to_jiffies(msecs));
 }
 
-void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs)
-{
-	__blk_mq_delay_run_hw_queue(hctx, true, msecs);
-}
-EXPORT_SYMBOL(blk_mq_delay_run_hw_queue);
-
 void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
 {
 	__blk_mq_delay_run_hw_queue(hctx, async, 0);
@@ -1255,60 +1249,6 @@ bool blk_mq_queue_stopped(struct request_queue *q)
 }
 EXPORT_SYMBOL(blk_mq_queue_stopped);
 
-/*
- * This function is often used for pausing .queue_rq() by driver when
- * there isn't enough resource or some conditions aren't satisfied, and
- * BLK_MQ_RQ_QUEUE_BUSY is usually returned.
- *
- * We do not guarantee that dispatch can be drained or blocked
- * after blk_mq_stop_hw_queue() returns. Please use
- * blk_mq_quiesce_queue() for that requirement.
- */
-void blk_mq_stop_hw_queue(struct blk_mq_hw_ctx *hctx)
-{
-	cancel_delayed_work(&hctx->run_work);
-
-	set_bit(BLK_MQ_S_STOPPED, &hctx->state);
-}
-EXPORT_SYMBOL(blk_mq_stop_hw_queue);
-
-/*
- * This function is often used for pausing .queue_rq() by driver when
- * there isn't enough resource or some conditions aren't satisfied, and
- * BLK_MQ_RQ_QUEUE_BUSY is usually returned.
- *
- * We do not guarantee that dispatch can be drained or blocked
- * after blk_mq_stop_hw_queues() returns. Please use
- * blk_mq_quiesce_queue() for that requirement.
- */
-void blk_mq_stop_hw_queues(struct request_queue *q)
-{
-	struct blk_mq_hw_ctx *hctx;
-	int i;
-
-	queue_for_each_hw_ctx(q, hctx, i)
-		blk_mq_stop_hw_queue(hctx);
-}
-EXPORT_SYMBOL(blk_mq_stop_hw_queues);
-
-void blk_mq_start_hw_queue(struct blk_mq_hw_ctx *hctx)
-{
-	clear_bit(BLK_MQ_S_STOPPED, &hctx->state);
-
-	blk_mq_run_hw_queue(hctx, false);
-}
-EXPORT_SYMBOL(blk_mq_start_hw_queue);
-
-void blk_mq_start_hw_queues(struct request_queue *q)
-{
-	struct blk_mq_hw_ctx *hctx;
-	int i;
-
-	queue_for_each_hw_ctx(q, hctx, i)
-		blk_mq_start_hw_queue(hctx);
-}
-EXPORT_SYMBOL(blk_mq_start_hw_queues);
-
 void blk_mq_start_stopped_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
 {
 	if (!blk_mq_hctx_stopped(hctx))
@@ -1317,7 +1257,6 @@ void blk_mq_start_stopped_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
 	clear_bit(BLK_MQ_S_STOPPED, &hctx->state);
 	blk_mq_run_hw_queue(hctx, async);
 }
-EXPORT_SYMBOL_GPL(blk_mq_start_stopped_hw_queue);
 
 void blk_mq_start_stopped_hw_queues(struct request_queue *q, bool async)
 {
@@ -1327,7 +1266,6 @@ void blk_mq_start_stopped_hw_queues(struct request_queue *q, bool async)
 	queue_for_each_hw_ctx(q, hctx, i)
 		blk_mq_start_stopped_hw_queue(hctx, async);
 }
-EXPORT_SYMBOL(blk_mq_start_stopped_hw_queues);
 
 static void blk_mq_run_work_fn(struct work_struct *work)
 {
@@ -1352,24 +1290,6 @@ static void blk_mq_run_work_fn(struct work_struct *work)
 }
 
 
-void blk_mq_delay_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs)
-{
-	if (WARN_ON_ONCE(!blk_mq_hw_queue_mapped(hctx)))
-		return;
-
-	/*
-	 * Stop the hw queue, then modify currently delayed work.
-	 * This should prevent us from running the queue prematurely.
-	 * Mark the queue as auto-clearing STOPPED when it runs.
-	 */
-	blk_mq_stop_hw_queue(hctx);
-	set_bit(BLK_MQ_S_START_ON_RUN, &hctx->state);
-	kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx),
-					&hctx->run_work,
-					msecs_to_jiffies(msecs));
-}
-EXPORT_SYMBOL(blk_mq_delay_queue);
-
 static inline void __blk_mq_insert_req_list(struct blk_mq_hw_ctx *hctx,
 					    struct request *rq,
 					    bool at_head)
diff --git a/block/blk-mq.h b/block/blk-mq.h
index c4516d2a2d2c..9bc51d43155f 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -25,6 +25,8 @@ struct blk_mq_ctx {
 	struct kobject		kobj;
 } ____cacheline_aligned_in_smp;
 
+void blk_mq_start_stopped_hw_queue(struct blk_mq_hw_ctx *hctx, bool async);
+void blk_mq_start_stopped_hw_queues(struct request_queue *q, bool async);
 void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async);
 void blk_mq_freeze_queue(struct request_queue *q);
 void blk_mq_free_queue(struct request_queue *q);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 8694fb39cd80..a539e38f5a76 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -243,18 +243,10 @@ void blk_mq_delay_kick_requeue_list(struct request_queue *q, unsigned long msecs
 void blk_mq_complete_request(struct request *rq);
 
 bool blk_mq_queue_stopped(struct request_queue *q);
-void blk_mq_stop_hw_queue(struct blk_mq_hw_ctx *hctx);
-void blk_mq_start_hw_queue(struct blk_mq_hw_ctx *hctx);
-void blk_mq_stop_hw_queues(struct request_queue *q);
-void blk_mq_start_hw_queues(struct request_queue *q);
-void blk_mq_start_stopped_hw_queue(struct blk_mq_hw_ctx *hctx, bool async);
-void blk_mq_start_stopped_hw_queues(struct request_queue *q, bool async);
 void blk_mq_quiesce_queue(struct request_queue *q);
 void blk_mq_unquiesce_queue(struct request_queue *q);
-void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs);
 void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async);
 void blk_mq_run_hw_queues(struct request_queue *q, bool async);
-void blk_mq_delay_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs);
 void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 		busy_tag_iter_fn *fn, void *priv);
 void blk_mq_freeze_queue(struct request_queue *q);
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/6] blk-mq: use EWMA to estimate congestion threshold
  2017-07-11 18:21 ` [PATCH 4/6] blk-mq: use EWMA to estimate congestion threshold Ming Lei
@ 2017-07-11 18:25   ` Jens Axboe
  2017-07-12  2:30     ` Ming Lei
  2017-07-11 18:39   ` Jens Axboe
  2017-07-11 21:02   ` Bart Van Assche
  2 siblings, 1 reply; 43+ messages in thread
From: Jens Axboe @ 2017-07-11 18:25 UTC (permalink / raw)
  To: Ming Lei, linux-block, Christoph Hellwig; +Cc: Bart Van Assche, Sagi Grimberg

On 07/11/2017 12:21 PM, Ming Lei wrote:
> When .queue_rq() returns BLK_STS_RESOURCE(BUSY), we can
> consider that there is congestion in either low level
> driver or hardware.
> 
> This patch uses EWMA to estimate this congestion threshold,
> then this threshold can be used to detect/avoid congestion.

This whole patch set lacks some sort of reasoning why this
makes sense. I'm assuming you want to reduce unnecessary
restarts of the queue? I would much rather ensure that we only
start when we absolutely have to to begin with, I'm pretty sure
we have a number of cases where that is not so.

What happens with fluid congestion boundaries, with shared tags?

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/6] blk-mq: use EWMA to estimate congestion threshold
  2017-07-11 18:21 ` [PATCH 4/6] blk-mq: use EWMA to estimate congestion threshold Ming Lei
  2017-07-11 18:25   ` Jens Axboe
@ 2017-07-11 18:39   ` Jens Axboe
  2017-07-12  3:20     ` Ming Lei
  2017-07-11 21:02   ` Bart Van Assche
  2 siblings, 1 reply; 43+ messages in thread
From: Jens Axboe @ 2017-07-11 18:39 UTC (permalink / raw)
  To: Ming Lei, linux-block, Christoph Hellwig; +Cc: Bart Van Assche, Sagi Grimberg

On 07/11/2017 12:21 PM, Ming Lei wrote:
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index 60b01c0309bc..c4516d2a2d2c 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -133,4 +133,13 @@ static inline bool blk_mq_hw_queue_mapped(struct blk_mq_hw_ctx *hctx)
>  	return hctx->nr_ctx && hctx->tags;
>  }
>  
> +/* borrowed from bcache */
> +#define ewma_add(ewma, val, weight, factor)                             \
> +({                                                                      \
> +        (ewma) *= (weight) - 1;                                         \
> +        (ewma) += (val) << factor;                                      \
> +        (ewma) /= (weight);                                             \
> +        (ewma) >> factor;                                               \
> +})

Just put that in blk_mq_update_req_dispatch_busy(), or at least make it
a static function in blk-mq.c above that function. You don't use factor
at all.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/6] xen-blkfront: avoid to use start/stop queue
  2017-07-11 18:20 ` Ming Lei
  2017-07-11 18:41   ` Konrad Rzeszutek Wilk
@ 2017-07-11 18:41   ` Konrad Rzeszutek Wilk
  2017-07-12  2:52     ` Ming Lei
  2017-07-12  2:52     ` Ming Lei
  2017-07-11 18:41   ` Bart Van Assche
                     ` (3 subsequent siblings)
  5 siblings, 2 replies; 43+ messages in thread
From: Konrad Rzeszutek Wilk @ 2017-07-11 18:41 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	Sagi Grimberg, Roger Pau Monné,
	Boris Ostrovsky, Juergen Gross, xen-devel

On Wed, Jul 12, 2017 at 02:20:58AM +0800, Ming Lei wrote:
> This interfaces will be removed soon, so use quiesce and
> unquiesce instead, which should be more safe.

'should be'? That does not sound encouraging?

> 
> The only one usage will be removed in the following
> congestion control patches.
> 
> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Cc: "Roger Pau Monn�" <roger.pau@citrix.com>
> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> Cc: Juergen Gross <jgross@suse.com>
> Cc: xen-devel@lists.xenproject.org
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/block/xen-blkfront.c | 22 ++++++++--------------
>  1 file changed, 8 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> index c852ed3c01d5..1578befda635 100644
> --- a/drivers/block/xen-blkfront.c
> +++ b/drivers/block/xen-blkfront.c
> @@ -1187,7 +1187,7 @@ static void xlvbd_release_gendisk(struct blkfront_info *info)
>  		return;
>  
>  	/* No more blkif_request(). */
> -	blk_mq_stop_hw_queues(info->rq);
> +	blk_mq_quiesce_queue(info->rq);
>  
>  	for (i = 0; i < info->nr_rings; i++) {
>  		struct blkfront_ring_info *rinfo = &info->rinfo[i];
> @@ -1216,8 +1216,10 @@ static void xlvbd_release_gendisk(struct blkfront_info *info)
>  /* Already hold rinfo->ring_lock. */
>  static inline void kick_pending_request_queues_locked(struct blkfront_ring_info *rinfo)
>  {
> -	if (!RING_FULL(&rinfo->ring))
> +	if (!RING_FULL(&rinfo->ring)) {
>  		blk_mq_start_stopped_hw_queues(rinfo->dev_info->rq, true);
> +		blk_mq_kick_requeue_list(rinfo->dev_info->rq);
> +	}
>  }
>  
>  static void kick_pending_request_queues(struct blkfront_ring_info *rinfo)
> @@ -1225,7 +1227,8 @@ static void kick_pending_request_queues(struct blkfront_ring_info *rinfo)
>  	unsigned long flags;
>  
>  	spin_lock_irqsave(&rinfo->ring_lock, flags);
> -	kick_pending_request_queues_locked(rinfo);
> +	if (!RING_FULL(&rinfo->ring))
> +		blk_mq_run_hw_queues(rinfo->dev_info->rq, true);
>  	spin_unlock_irqrestore(&rinfo->ring_lock, flags);
>  }
>  
> @@ -1346,7 +1349,7 @@ static void blkif_free(struct blkfront_info *info, int suspend)
>  		BLKIF_STATE_SUSPENDED : BLKIF_STATE_DISCONNECTED;
>  	/* No more blkif_request(). */
>  	if (info->rq)
> -		blk_mq_stop_hw_queues(info->rq);
> +		blk_mq_quiesce_queue(info->rq);
>  
>  	for (i = 0; i < info->nr_rings; i++)
>  		blkif_free_ring(&info->rinfo[i]);
> @@ -2018,22 +2021,13 @@ static int blkif_recover(struct blkfront_info *info)
>  	/* Now safe for us to use the shared ring */
>  	info->connected = BLKIF_STATE_CONNECTED;
>  
> -	for (r_index = 0; r_index < info->nr_rings; r_index++) {
> -		struct blkfront_ring_info *rinfo;
> -
> -		rinfo = &info->rinfo[r_index];
> -		/* Kick any other new requests queued since we resumed */
> -		kick_pending_request_queues(rinfo);
> -	}
> -
>  	list_for_each_entry_safe(req, n, &info->requests, queuelist) {
>  		/* Requeue pending requests (flush or discard) */
>  		list_del_init(&req->queuelist);
>  		BUG_ON(req->nr_phys_segments > segs);
>  		blk_mq_requeue_request(req, false);
>  	}
> -	blk_mq_start_stopped_hw_queues(info->rq, true);
> -	blk_mq_kick_requeue_list(info->rq);
> +	blk_mq_unquiesce_queue(info->rq);
>  
>  	while ((bio = bio_list_pop(&info->bio_list)) != NULL) {
>  		/* Traverse the list of pending bios and re-queue them */
> -- 
> 2.9.4
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/6] xen-blkfront: avoid to use start/stop queue
  2017-07-11 18:20 ` Ming Lei
@ 2017-07-11 18:41   ` Konrad Rzeszutek Wilk
  2017-07-11 18:41   ` Konrad Rzeszutek Wilk
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 43+ messages in thread
From: Konrad Rzeszutek Wilk @ 2017-07-11 18:41 UTC (permalink / raw)
  To: Ming Lei
  Cc: Juergen Gross, linux-block, Sagi Grimberg, Christoph Hellwig,
	Jens Axboe, xen-devel, Bart Van Assche, Boris Ostrovsky,
	Roger Pau Monné

On Wed, Jul 12, 2017 at 02:20:58AM +0800, Ming Lei wrote:
> This interfaces will be removed soon, so use quiesce and
> unquiesce instead, which should be more safe.

'should be'? That does not sound encouraging?

> 
> The only one usage will be removed in the following
> congestion control patches.
> 
> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Cc: "Roger Pau Monné" <roger.pau@citrix.com>
> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> Cc: Juergen Gross <jgross@suse.com>
> Cc: xen-devel@lists.xenproject.org
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/block/xen-blkfront.c | 22 ++++++++--------------
>  1 file changed, 8 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> index c852ed3c01d5..1578befda635 100644
> --- a/drivers/block/xen-blkfront.c
> +++ b/drivers/block/xen-blkfront.c
> @@ -1187,7 +1187,7 @@ static void xlvbd_release_gendisk(struct blkfront_info *info)
>  		return;
>  
>  	/* No more blkif_request(). */
> -	blk_mq_stop_hw_queues(info->rq);
> +	blk_mq_quiesce_queue(info->rq);
>  
>  	for (i = 0; i < info->nr_rings; i++) {
>  		struct blkfront_ring_info *rinfo = &info->rinfo[i];
> @@ -1216,8 +1216,10 @@ static void xlvbd_release_gendisk(struct blkfront_info *info)
>  /* Already hold rinfo->ring_lock. */
>  static inline void kick_pending_request_queues_locked(struct blkfront_ring_info *rinfo)
>  {
> -	if (!RING_FULL(&rinfo->ring))
> +	if (!RING_FULL(&rinfo->ring)) {
>  		blk_mq_start_stopped_hw_queues(rinfo->dev_info->rq, true);
> +		blk_mq_kick_requeue_list(rinfo->dev_info->rq);
> +	}
>  }
>  
>  static void kick_pending_request_queues(struct blkfront_ring_info *rinfo)
> @@ -1225,7 +1227,8 @@ static void kick_pending_request_queues(struct blkfront_ring_info *rinfo)
>  	unsigned long flags;
>  
>  	spin_lock_irqsave(&rinfo->ring_lock, flags);
> -	kick_pending_request_queues_locked(rinfo);
> +	if (!RING_FULL(&rinfo->ring))
> +		blk_mq_run_hw_queues(rinfo->dev_info->rq, true);
>  	spin_unlock_irqrestore(&rinfo->ring_lock, flags);
>  }
>  
> @@ -1346,7 +1349,7 @@ static void blkif_free(struct blkfront_info *info, int suspend)
>  		BLKIF_STATE_SUSPENDED : BLKIF_STATE_DISCONNECTED;
>  	/* No more blkif_request(). */
>  	if (info->rq)
> -		blk_mq_stop_hw_queues(info->rq);
> +		blk_mq_quiesce_queue(info->rq);
>  
>  	for (i = 0; i < info->nr_rings; i++)
>  		blkif_free_ring(&info->rinfo[i]);
> @@ -2018,22 +2021,13 @@ static int blkif_recover(struct blkfront_info *info)
>  	/* Now safe for us to use the shared ring */
>  	info->connected = BLKIF_STATE_CONNECTED;
>  
> -	for (r_index = 0; r_index < info->nr_rings; r_index++) {
> -		struct blkfront_ring_info *rinfo;
> -
> -		rinfo = &info->rinfo[r_index];
> -		/* Kick any other new requests queued since we resumed */
> -		kick_pending_request_queues(rinfo);
> -	}
> -
>  	list_for_each_entry_safe(req, n, &info->requests, queuelist) {
>  		/* Requeue pending requests (flush or discard) */
>  		list_del_init(&req->queuelist);
>  		BUG_ON(req->nr_phys_segments > segs);
>  		blk_mq_requeue_request(req, false);
>  	}
> -	blk_mq_start_stopped_hw_queues(info->rq, true);
> -	blk_mq_kick_requeue_list(info->rq);
> +	blk_mq_unquiesce_queue(info->rq);
>  
>  	while ((bio = bio_list_pop(&info->bio_list)) != NULL) {
>  		/* Traverse the list of pending bios and re-queue them */
> -- 
> 2.9.4
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/6] xen-blkfront: avoid to use start/stop queue
  2017-07-11 18:20 ` Ming Lei
  2017-07-11 18:41   ` Konrad Rzeszutek Wilk
  2017-07-11 18:41   ` Konrad Rzeszutek Wilk
@ 2017-07-11 18:41   ` Bart Van Assche
  2017-07-12  2:59     ` Ming Lei
                       ` (3 more replies)
  2017-07-11 18:41   ` Bart Van Assche
                     ` (2 subsequent siblings)
  5 siblings, 4 replies; 43+ messages in thread
From: Bart Van Assche @ 2017-07-11 18:41 UTC (permalink / raw)
  To: hch, linux-block, axboe, ming.lei
  Cc: boris.ostrovsky, roger.pau, xen-devel, jgross, konrad.wilk, sagi

On Wed, 2017-07-12 at 02:20 +0800, Ming Lei wrote:
> This interfaces will be removed soon, so use quiesce and
> unquiesce instead, which should be more safe.
>=20
> The only one usage will be removed in the following
> congestion control patches.

Hello Ming,

The title of this patch is misleading since this patch does not touch the c=
alls
related to queue congestion (blk_mq_stop_hw_queue() and
blk_mq_start_stopped_hw_queues()). I assume that you meant that this patch =
avoids
that the xen-blkfront driver uses blk_mq_(start|stop)_hw_queues() (with que=
ues in
plural form)? Can you please reflect that in the subject of this and relate=
d
patches?

Additionally, it's probably a good idea that this is not just an interface =
change
but that this kind of patches fix a (hard to trigger?) race condition.

>  static inline void kick_pending_request_queues_locked(struct blkfront_ri=
ng_info *rinfo)
>  {
> -	if (!RING_FULL(&rinfo->ring))
> +	if (!RING_FULL(&rinfo->ring)) {
>  		blk_mq_start_stopped_hw_queues(rinfo->dev_info->rq, true);
> +		blk_mq_kick_requeue_list(rinfo->dev_info->rq);
> +	}
>  }
> =20
>  static void kick_pending_request_queues(struct blkfront_ring_info *rinfo=
)
> @@ -1225,7 +1227,8 @@ static void kick_pending_request_queues(struct blkf=
ront_ring_info *rinfo)
>  	unsigned long flags;
> =20
>  	spin_lock_irqsave(&rinfo->ring_lock, flags);
> -	kick_pending_request_queues_locked(rinfo);
> +	if (!RING_FULL(&rinfo->ring))
> +		blk_mq_run_hw_queues(rinfo->dev_info->rq, true);
>  	spin_unlock_irqrestore(&rinfo->ring_lock, flags);
>  }

Why do some kick_pending_request_queues_locked() kick the requeue list and =
why
has the above kick_pending_request_queues_locked() call been converted into=
 a
blk_mq_run_hw_queues() call and thereby ignores the requeue list?

Thanks,

Bart.=

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/6] xen-blkfront: avoid to use start/stop queue
  2017-07-11 18:20 ` Ming Lei
                     ` (2 preceding siblings ...)
  2017-07-11 18:41   ` Bart Van Assche
@ 2017-07-11 18:41   ` Bart Van Assche
  2017-07-11 21:24   ` Roger Pau Monné
  2017-07-11 21:24   ` Roger Pau Monné
  5 siblings, 0 replies; 43+ messages in thread
From: Bart Van Assche @ 2017-07-11 18:41 UTC (permalink / raw)
  To: hch, linux-block, axboe, ming.lei
  Cc: jgross, sagi, xen-devel, boris.ostrovsky, roger.pau

On Wed, 2017-07-12 at 02:20 +0800, Ming Lei wrote:
> This interfaces will be removed soon, so use quiesce and
> unquiesce instead, which should be more safe.
> 
> The only one usage will be removed in the following
> congestion control patches.

Hello Ming,

The title of this patch is misleading since this patch does not touch the calls
related to queue congestion (blk_mq_stop_hw_queue() and
blk_mq_start_stopped_hw_queues()). I assume that you meant that this patch avoids
that the xen-blkfront driver uses blk_mq_(start|stop)_hw_queues() (with queues in
plural form)? Can you please reflect that in the subject of this and related
patches?

Additionally, it's probably a good idea that this is not just an interface change
but that this kind of patches fix a (hard to trigger?) race condition.

>  static inline void kick_pending_request_queues_locked(struct blkfront_ring_info *rinfo)
>  {
> -	if (!RING_FULL(&rinfo->ring))
> +	if (!RING_FULL(&rinfo->ring)) {
>  		blk_mq_start_stopped_hw_queues(rinfo->dev_info->rq, true);
> +		blk_mq_kick_requeue_list(rinfo->dev_info->rq);
> +	}
>  }
>  
>  static void kick_pending_request_queues(struct blkfront_ring_info *rinfo)
> @@ -1225,7 +1227,8 @@ static void kick_pending_request_queues(struct blkfront_ring_info *rinfo)
>  	unsigned long flags;
>  
>  	spin_lock_irqsave(&rinfo->ring_lock, flags);
> -	kick_pending_request_queues_locked(rinfo);
> +	if (!RING_FULL(&rinfo->ring))
> +		blk_mq_run_hw_queues(rinfo->dev_info->rq, true);
>  	spin_unlock_irqrestore(&rinfo->ring_lock, flags);
>  }

Why do some kick_pending_request_queues_locked() kick the requeue list and why
has the above kick_pending_request_queues_locked() call been converted into a
blk_mq_run_hw_queues() call and thereby ignores the requeue list?

Thanks,

Bart.
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/6] SCSI: use blk_mq_run_hw_queues() in scsi_kick_queue()
  2017-07-11 18:20 ` [PATCH 2/6] SCSI: use blk_mq_run_hw_queues() in scsi_kick_queue() Ming Lei
@ 2017-07-11 19:57     ` Bart Van Assche
  0 siblings, 0 replies; 43+ messages in thread
From: Bart Van Assche @ 2017-07-11 19:57 UTC (permalink / raw)
  To: hch, linux-block, axboe, ming.lei; +Cc: jejb, linux-scsi, martin.petersen, sagi

On Wed, 2017-07-12 at 02:20 +0800, Ming Lei wrote:
> Now SCSI won't stop queue, and not necessary to use
> blk_mq_start_hw_queues(), so switch to blk_mq_run_hw_queues()
> instead.
>=20
> Cc: "James E.J. Bottomley" <jejb@linux.vnet.ibm.com>
> Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
> Cc: linux-scsi@vger.kernel.org
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/scsi/scsi_lib.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>=20
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index f6097b89d5d3..91d890356b78 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -333,7 +333,7 @@ void scsi_device_unbusy(struct scsi_device *sdev)
>  static void scsi_kick_queue(struct request_queue *q)
>  {
>  	if (q->mq_ops)
> -		blk_mq_start_hw_queues(q);
> +		blk_mq_run_hw_queues(q, false);
>  	else
>  		blk_run_queue(q);
>  }

Hello Ming,

Now that we have separate flags to represent the "stopped" and "quiesced"
states, wouldn't it be better not to modify scsi_kick_queue() but instead t=
o
stop a SCSI hardware queue again if scsi_queue_rq() returns BLK_STS_RESOURC=
E?
See also commits 36e3cf273977 ("scsi: Avoid that SCSI queues get stuck") an=
d
commit 52d7f1b5c2f3 ("blk-mq: Avoid that requeueing starts stopped queues")=
.

Bart.=

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/6] SCSI: use blk_mq_run_hw_queues() in scsi_kick_queue()
@ 2017-07-11 19:57     ` Bart Van Assche
  0 siblings, 0 replies; 43+ messages in thread
From: Bart Van Assche @ 2017-07-11 19:57 UTC (permalink / raw)
  To: hch, linux-block, axboe, ming.lei; +Cc: jejb, linux-scsi, martin.petersen, sagi

On Wed, 2017-07-12 at 02:20 +0800, Ming Lei wrote:
> Now SCSI won't stop queue, and not necessary to use
> blk_mq_start_hw_queues(), so switch to blk_mq_run_hw_queues()
> instead.
> 
> Cc: "James E.J. Bottomley" <jejb@linux.vnet.ibm.com>
> Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
> Cc: linux-scsi@vger.kernel.org
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/scsi/scsi_lib.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index f6097b89d5d3..91d890356b78 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -333,7 +333,7 @@ void scsi_device_unbusy(struct scsi_device *sdev)
>  static void scsi_kick_queue(struct request_queue *q)
>  {
>  	if (q->mq_ops)
> -		blk_mq_start_hw_queues(q);
> +		blk_mq_run_hw_queues(q, false);
>  	else
>  		blk_run_queue(q);
>  }

Hello Ming,

Now that we have separate flags to represent the "stopped" and "quiesced"
states, wouldn't it be better not to modify scsi_kick_queue() but instead to
stop a SCSI hardware queue again if scsi_queue_rq() returns BLK_STS_RESOURCE?
See also commits 36e3cf273977 ("scsi: Avoid that SCSI queues get stuck") and
commit 52d7f1b5c2f3 ("blk-mq: Avoid that requeueing starts stopped queues").

Bart.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/6] blk-mq: send the request to dispatch list if direct issue returns busy
  2017-07-11 18:21 ` [PATCH 3/6] blk-mq: send the request to dispatch list if direct issue returns busy Ming Lei
@ 2017-07-11 20:18   ` Bart Van Assche
  2017-07-12  3:45     ` Ming Lei
  0 siblings, 1 reply; 43+ messages in thread
From: Bart Van Assche @ 2017-07-11 20:18 UTC (permalink / raw)
  To: hch, linux-block, axboe, ming.lei; +Cc: sagi

On Wed, 2017-07-12 at 02:21 +0800, Ming Lei wrote:
> Before mq IO scheduler is in, we always send the request to
> dispatch list if .queue_rq() return busy. After mq IO scheduler
> is introduced, we only do this way when scheduler is used in
> case of direct issue. Actually we can do that when scheduler
> isn't used too.
>=20
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-mq.c | 18 +++++++++++++++---
>  1 file changed, 15 insertions(+), 3 deletions(-)
>=20
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 041f7b7fa0d6..6e0fc80aa151 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1463,6 +1463,16 @@ static blk_qc_t request_to_qc_t(struct blk_mq_hw_c=
tx *hctx, struct request *rq)
>  	return blk_tag_to_qc_t(rq->internal_tag, hctx->queue_num, true);
>  }
> =20
> +static void blk_mq_direct_dispatch(struct blk_mq_hw_ctx *hctx,
> +				   struct request *rq)
> +{
> +	spin_lock(&hctx->lock);
> +	list_add(&rq->queuelist, &hctx->dispatch);
> +	spin_unlock(&hctx->lock);
> +
> +	blk_mq_run_hw_queue(hctx, false);
> +}
> +
>  static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
>  					struct request *rq,
>  					blk_qc_t *cookie, bool may_sleep)
> @@ -1499,15 +1509,17 @@ static void __blk_mq_try_issue_directly(struct bl=
k_mq_hw_ctx *hctx,
>  	switch (ret) {
>  	case BLK_STS_OK:
>  		*cookie =3D new_cookie;
> -		return;
> +		break;
>  	case BLK_STS_RESOURCE:
>  		__blk_mq_requeue_request(rq);
> -		goto insert;
> +		blk_mq_direct_dispatch(hctx, rq);
> +		break;
>  	default:
>  		*cookie =3D BLK_QC_T_NONE;
>  		blk_mq_end_request(rq, ret);
> -		return;
> +		break;
>  	}
> +	return;
> =20
>  insert:
>  	blk_mq_sched_insert_request(rq, false, run_queue, false, may_sleep);

Hello Ming,

This patch changes the behavior of blk_mq_try_issue_directly() if a schedul=
er
has been configured and .queue_rq() returns BLK_STS_RESOURCE, namely by ski=
pping
the e->type->ops.mq.insert_requests() call. Sorry but I don't think this is=
 what
we want.

Bart.=

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/6] blk-mq: use EWMA to estimate congestion threshold
  2017-07-11 18:21 ` [PATCH 4/6] blk-mq: use EWMA to estimate congestion threshold Ming Lei
  2017-07-11 18:25   ` Jens Axboe
  2017-07-11 18:39   ` Jens Axboe
@ 2017-07-11 21:02   ` Bart Van Assche
  2017-07-12  3:43     ` Ming Lei
  2 siblings, 1 reply; 43+ messages in thread
From: Bart Van Assche @ 2017-07-11 21:02 UTC (permalink / raw)
  To: hch, linux-block, axboe, ming.lei; +Cc: sagi

On Wed, 2017-07-12 at 02:21 +0800, Ming Lei wrote:
> When .queue_rq() returns BLK_STS_RESOURCE(BUSY), we can
> consider that there is congestion in either low level
> driver or hardware.
>=20
> This patch uses EWMA to estimate this congestion threshold,
> then this threshold can be used to detect/avoid congestion.

Hello Ming,

Does EWMA stand for "exponentially weighted moving average" in the context =
of
this patch? If so, please mention this.

> +static void blk_mq_update_req_dispatch_busy(struct blk_mq_hw_ctx *hctx)
> +{
> +	struct sbitmap_queue *sbq;
> +	unsigned depth;
> +
> +	sbq =3D &hctx->tags->bitmap_tags;
> +	depth =3D sbitmap_weight(&sbq->sb);
> +
> +	/* use EWMA to estimate a threshold for detecting congestion */
> +	ewma_add(hctx->avg_busy_threshold, depth, 8, 0);
> +}

This function has been named after the context it is called from. Wouldn't =
it
be more clear to change the name of this function into something that refer=
s to
what this function does, e.g. blk_mq_update_avg_busy_threshold()?

Additionally, I think that the behavior of e.g. the SCSI and dm-mpath drive=
rs
is too complicated for this approach to be effective. If you want to procee=
d
with this approach I think it should be possible for block drivers to opt o=
ut
of the mechanism introduced in the next patch.

> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index 60b01c0309bc..c4516d2a2d2c 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -133,4 +133,13 @@ static inline bool blk_mq_hw_queue_mapped(struct blk=
_mq_hw_ctx *hctx)
>  	return hctx->nr_ctx && hctx->tags;
>  }
> =20
> +/* borrowed from bcache */
> +#define ewma_add(ewma, val, weight, factor)                             =
\
> +({                                                                      =
\
> +        (ewma) *=3D (weight) - 1;                                       =
  \
> +        (ewma) +=3D (val) << factor;                                    =
  \
> +        (ewma) /=3D (weight);                                           =
  \
> +        (ewma) >> factor;                                               =
\
> +})

Sorry but this does not match how others define an exponentially weighted m=
oving
average. As far as I know the ewma values should be updated as follows:

   new_ewma =3D w * val + (1 - w) * current_ewma

where 0 < w <=3D 1 is a rational number (typically 0.05 <=3D w <=3D 0.3). S=
ee also
https://en.wikipedia.org/wiki/EWMA_chart.

Bart.=

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/6] xen-blkfront: avoid to use start/stop queue
  2017-07-11 18:20 ` Ming Lei
                     ` (3 preceding siblings ...)
  2017-07-11 18:41   ` Bart Van Assche
@ 2017-07-11 21:24   ` Roger Pau Monné
  2017-07-12  3:12     ` Ming Lei
  2017-07-12  3:12     ` Ming Lei
  2017-07-11 21:24   ` Roger Pau Monné
  5 siblings, 2 replies; 43+ messages in thread
From: Roger Pau Monné @ 2017-07-11 21:24 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	Sagi Grimberg, Konrad Rzeszutek Wilk, Boris Ostrovsky,
	Juergen Gross, xen-devel

On Wed, Jul 12, 2017 at 02:20:58AM +0800, Ming Lei wrote:
> This interfaces will be removed soon, so use quiesce and
> unquiesce instead, which should be more safe.
> 
> The only one usage will be removed in the following
> congestion control patches.

Would it be better to simply fix blk_mq_{start/stop}_stopped_hw_queues
rather than introducing a new interface?

Roger.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/6] xen-blkfront: avoid to use start/stop queue
  2017-07-11 18:20 ` Ming Lei
                     ` (4 preceding siblings ...)
  2017-07-11 21:24   ` Roger Pau Monné
@ 2017-07-11 21:24   ` Roger Pau Monné
  5 siblings, 0 replies; 43+ messages in thread
From: Roger Pau Monné @ 2017-07-11 21:24 UTC (permalink / raw)
  To: Ming Lei
  Cc: Juergen Gross, linux-block, Sagi Grimberg, Jens Axboe, xen-devel,
	Christoph Hellwig, Bart Van Assche, Boris Ostrovsky

On Wed, Jul 12, 2017 at 02:20:58AM +0800, Ming Lei wrote:
> This interfaces will be removed soon, so use quiesce and
> unquiesce instead, which should be more safe.
> 
> The only one usage will be removed in the following
> congestion control patches.

Would it be better to simply fix blk_mq_{start/stop}_stopped_hw_queues
rather than introducing a new interface?

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/6] blk-mq: use EWMA to estimate congestion threshold
  2017-07-11 18:25   ` Jens Axboe
@ 2017-07-12  2:30     ` Ming Lei
  2017-07-12 15:39       ` Bart Van Assche
  0 siblings, 1 reply; 43+ messages in thread
From: Ming Lei @ 2017-07-12  2:30 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Christoph Hellwig, Bart Van Assche, Sagi Grimberg

On Tue, Jul 11, 2017 at 12:25:16PM -0600, Jens Axboe wrote:
> On 07/11/2017 12:21 PM, Ming Lei wrote:
> > When .queue_rq() returns BLK_STS_RESOURCE(BUSY), we can
> > consider that there is congestion in either low level
> > driver or hardware.
> > 
> > This patch uses EWMA to estimate this congestion threshold,
> > then this threshold can be used to detect/avoid congestion.
> 
> This whole patch set lacks some sort of reasoning why this
> makes sense. I'm assuming you want to reduce unnecessary
> restarts of the queue?

When low level drivers returns BLK_STS_RESOURCE, it means
either driver or hardware can't handle the incoming requests.
The issue is actually one problem of congestion control, IMO.

There are some ways taken in drivers to deal with the issue:

	1) both virtio-blk/xen-blkfront just stops queue in this
	case, then restart the queue when one request is completed.

	2) virtio-scsi chooses to not stop queue

For reaching good IO performance, we often do the following two
things:
	- try to make the queue pipeline as full by increasing queue depth
	- run several tasks to do IO at the same time

So for the soltion 1) above, it can't be efficient, because when one request
is completed to restart queue, there are lots of requests in the queue,
and dispatching these requests will cause to return lots of BLK_STS_RESOURCE
too. The dispatch isn't cheap at all, we need to acquire sw queue
lock/scheduler lock/hw queue lock /low level queue lock in the whole
path, so we should try to avoid dispatching to a busy queue.

For the solution 2), it simply doesn't stop queue, unnecessary
dispatching to a busy queue just wastes CPU, and has the same issue
with 1) too. As you can see in the commit log of patch 5, big
improvement is observed with this simple implementation:

	sequential read test(libaio, bs:4k, direct io, queue depth:64, 8 jobs)
	on virtio-scsi shows that:
	        - CPU utilization decreases ~20%
	        - IOPS increases by ~10%

EWMA is used to estimate one average queue depth, with which
the queue will often be busy. In this patchset, the average queue
depth when queue becomes busy is called congestion threshold.

I chooose EWMA because it is one usual/common way to figure out
weight average value, and it has been used in several fields of
kernel(wifi rate control, sequential I/O estimation in bcache,
network,...)  Or there is other better way to do that? I am happy
to take it, even in the future we may support more than one approaches
to address the issue.

Also another motivation is to unexport APIs of start/stop queue.

> I would much rather ensure that we only
> start when we absolutely have to to begin with, I'm pretty sure
> we have a number of cases where that is not so.

Could you share us these cases? I believe we can integrate different
approaches for congestion control to address different requirement.

> 
> What happens with fluid congestion boundaries, with shared tags?

The approach in this patch should work, but the threshold may not
be accurate in this way, one simple method is to use the average
tag weight in EWMA, like this:

	sbitmap_weight() / hctx->tags->active_queues

-- 
Ming

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/6] xen-blkfront: avoid to use start/stop queue
  2017-07-11 18:41   ` Konrad Rzeszutek Wilk
  2017-07-12  2:52     ` Ming Lei
@ 2017-07-12  2:52     ` Ming Lei
  1 sibling, 0 replies; 43+ messages in thread
From: Ming Lei @ 2017-07-12  2:52 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	Sagi Grimberg, Roger Pau Monné,
	Boris Ostrovsky, Juergen Gross, xen-devel

On Tue, Jul 11, 2017 at 02:41:05PM -0400, Konrad Rzeszutek Wilk wrote:
> On Wed, Jul 12, 2017 at 02:20:58AM +0800, Ming Lei wrote:
> > This interfaces will be removed soon, so use quiesce and
> > unquiesce instead, which should be more safe.
> 
> 'should be'? That does not sound encouraging?

Sorry for the mistake, and will fix it in V2, definitely quiesce
is the preferred interface, also stop queue may not do what you
want to do as you can see from comment of blk_mq_stop_hw_queues,
and has caused much trouble already.

And I appreciate if you guys may review on this patch itself.

BTW, this patch should have been included in Sagi Grimberg's
patchset of "[PATCH v3 0/8] correct quiescing in several block drivers":

        http://marc.info/?t=149927415900006&r=1&w=2

Thanks,
Ming

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/6] xen-blkfront: avoid to use start/stop queue
  2017-07-11 18:41   ` Konrad Rzeszutek Wilk
@ 2017-07-12  2:52     ` Ming Lei
  2017-07-12  2:52     ` Ming Lei
  1 sibling, 0 replies; 43+ messages in thread
From: Ming Lei @ 2017-07-12  2:52 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Juergen Gross, linux-block, Sagi Grimberg, Christoph Hellwig,
	Jens Axboe, xen-devel, Bart Van Assche, Boris Ostrovsky,
	Roger Pau Monné

On Tue, Jul 11, 2017 at 02:41:05PM -0400, Konrad Rzeszutek Wilk wrote:
> On Wed, Jul 12, 2017 at 02:20:58AM +0800, Ming Lei wrote:
> > This interfaces will be removed soon, so use quiesce and
> > unquiesce instead, which should be more safe.
> 
> 'should be'? That does not sound encouraging?

Sorry for the mistake, and will fix it in V2, definitely quiesce
is the preferred interface, also stop queue may not do what you
want to do as you can see from comment of blk_mq_stop_hw_queues,
and has caused much trouble already.

And I appreciate if you guys may review on this patch itself.

BTW, this patch should have been included in Sagi Grimberg's
patchset of "[PATCH v3 0/8] correct quiescing in several block drivers":

        http://marc.info/?t=149927415900006&r=1&w=2

Thanks,
Ming

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/6] xen-blkfront: avoid to use start/stop queue
  2017-07-11 18:41   ` Bart Van Assche
  2017-07-12  2:59     ` Ming Lei
@ 2017-07-12  2:59     ` Ming Lei
  2017-07-12  3:05     ` Ming Lei
  2017-07-12  3:05     ` Ming Lei
  3 siblings, 0 replies; 43+ messages in thread
From: Ming Lei @ 2017-07-12  2:59 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: hch, linux-block, axboe, boris.ostrovsky, roger.pau, xen-devel,
	jgross, konrad.wilk, sagi

On Tue, Jul 11, 2017 at 06:41:29PM +0000, Bart Van Assche wrote:
> On Wed, 2017-07-12 at 02:20 +0800, Ming Lei wrote:
> > This interfaces will be removed soon, so use quiesce and
> > unquiesce instead, which should be more safe.
> > 
> > The only one usage will be removed in the following
> > congestion control patches.
> 
> Hello Ming,
> 
> The title of this patch is misleading since this patch does not touch the calls
> related to queue congestion (blk_mq_stop_hw_queue() and
> blk_mq_start_stopped_hw_queues()). I assume that you meant that this patch avoids
> that the xen-blkfront driver uses blk_mq_(start|stop)_hw_queues() (with queues in
> plural form)? Can you please reflect that in the subject of this and related
> patches?

OK, will do it in V2.

> 
> Additionally, it's probably a good idea that this is not just an interface change
> but that this kind of patches fix a (hard to trigger?) race condition.
> 
> >  static inline void kick_pending_request_queues_locked(struct blkfront_ring_info *rinfo)
> >  {
> > -	if (!RING_FULL(&rinfo->ring))
> > +	if (!RING_FULL(&rinfo->ring)) {
> >  		blk_mq_start_stopped_hw_queues(rinfo->dev_info->rq, true);
> > +		blk_mq_kick_requeue_list(rinfo->dev_info->rq);
> > +	}
> >  }
> >  
> >  static void kick_pending_request_queues(struct blkfront_ring_info *rinfo)
> > @@ -1225,7 +1227,8 @@ static void kick_pending_request_queues(struct blkfront_ring_info *rinfo)
> >  	unsigned long flags;
> >  
> >  	spin_lock_irqsave(&rinfo->ring_lock, flags);
> > -	kick_pending_request_queues_locked(rinfo);
> > +	if (!RING_FULL(&rinfo->ring))
> > +		blk_mq_run_hw_queues(rinfo->dev_info->rq, true);
> >  	spin_unlock_irqrestore(&rinfo->ring_lock, flags);
> >  }
> 
> Why do some kick_pending_request_queues_locked() kick the requeue list and why
> has the above kick_pending_request_queues_locked() call been converted into a
> blk_mq_run_hw_queues() call and thereby ignores the requeue list?

kick_pending_request_queues_locked() is used in req completion path,
which belongs to congestion control, so this patch doesn't touch
kick_pending_request_queues_locked(), which will be switched to
generic congestion control in patch 5.

For kick_pending_request_queues(), this patch replaces
blk_mq_start_stopped_hw_queues() with blk_mq_run_hw_queues()
because run queue is often the real purpose of this function,
especially in non-congestion control path.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/6] xen-blkfront: avoid to use start/stop queue
  2017-07-11 18:41   ` Bart Van Assche
@ 2017-07-12  2:59     ` Ming Lei
  2017-07-12  2:59     ` Ming Lei
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 43+ messages in thread
From: Ming Lei @ 2017-07-12  2:59 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: jgross, linux-block, sagi, axboe, hch, xen-devel,
	boris.ostrovsky, roger.pau

On Tue, Jul 11, 2017 at 06:41:29PM +0000, Bart Van Assche wrote:
> On Wed, 2017-07-12 at 02:20 +0800, Ming Lei wrote:
> > This interfaces will be removed soon, so use quiesce and
> > unquiesce instead, which should be more safe.
> > 
> > The only one usage will be removed in the following
> > congestion control patches.
> 
> Hello Ming,
> 
> The title of this patch is misleading since this patch does not touch the calls
> related to queue congestion (blk_mq_stop_hw_queue() and
> blk_mq_start_stopped_hw_queues()). I assume that you meant that this patch avoids
> that the xen-blkfront driver uses blk_mq_(start|stop)_hw_queues() (with queues in
> plural form)? Can you please reflect that in the subject of this and related
> patches?

OK, will do it in V2.

> 
> Additionally, it's probably a good idea that this is not just an interface change
> but that this kind of patches fix a (hard to trigger?) race condition.
> 
> >  static inline void kick_pending_request_queues_locked(struct blkfront_ring_info *rinfo)
> >  {
> > -	if (!RING_FULL(&rinfo->ring))
> > +	if (!RING_FULL(&rinfo->ring)) {
> >  		blk_mq_start_stopped_hw_queues(rinfo->dev_info->rq, true);
> > +		blk_mq_kick_requeue_list(rinfo->dev_info->rq);
> > +	}
> >  }
> >  
> >  static void kick_pending_request_queues(struct blkfront_ring_info *rinfo)
> > @@ -1225,7 +1227,8 @@ static void kick_pending_request_queues(struct blkfront_ring_info *rinfo)
> >  	unsigned long flags;
> >  
> >  	spin_lock_irqsave(&rinfo->ring_lock, flags);
> > -	kick_pending_request_queues_locked(rinfo);
> > +	if (!RING_FULL(&rinfo->ring))
> > +		blk_mq_run_hw_queues(rinfo->dev_info->rq, true);
> >  	spin_unlock_irqrestore(&rinfo->ring_lock, flags);
> >  }
> 
> Why do some kick_pending_request_queues_locked() kick the requeue list and why
> has the above kick_pending_request_queues_locked() call been converted into a
> blk_mq_run_hw_queues() call and thereby ignores the requeue list?

kick_pending_request_queues_locked() is used in req completion path,
which belongs to congestion control, so this patch doesn't touch
kick_pending_request_queues_locked(), which will be switched to
generic congestion control in patch 5.

For kick_pending_request_queues(), this patch replaces
blk_mq_start_stopped_hw_queues() with blk_mq_run_hw_queues()
because run queue is often the real purpose of this function,
especially in non-congestion control path.


Thanks,
Ming

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/6] xen-blkfront: avoid to use start/stop queue
  2017-07-11 18:41   ` Bart Van Assche
                       ` (2 preceding siblings ...)
  2017-07-12  3:05     ` Ming Lei
@ 2017-07-12  3:05     ` Ming Lei
  3 siblings, 0 replies; 43+ messages in thread
From: Ming Lei @ 2017-07-12  3:05 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: hch, linux-block, axboe, boris.ostrovsky, roger.pau, xen-devel,
	jgross, konrad.wilk, sagi

On Tue, Jul 11, 2017 at 06:41:29PM +0000, Bart Van Assche wrote:
> On Wed, 2017-07-12 at 02:20 +0800, Ming Lei wrote:
> > This interfaces will be removed soon, so use quiesce and
> > unquiesce instead, which should be more safe.
> > 
> > The only one usage will be removed in the following
> > congestion control patches.
> 
> Hello Ming,
> 
> The title of this patch is misleading since this patch does not touch the calls
> related to queue congestion (blk_mq_stop_hw_queue() and
> blk_mq_start_stopped_hw_queues()). I assume that you meant that this patch avoids
> that the xen-blkfront driver uses blk_mq_(start|stop)_hw_queues() (with queues in
> plural form)? Can you please reflect that in the subject of this and related
> patches?
> 
> Additionally, it's probably a good idea that this is not just an interface change
> but that this kind of patches fix a (hard to trigger?) race condition.
> 
> >  static inline void kick_pending_request_queues_locked(struct blkfront_ring_info *rinfo)
> >  {
> > -	if (!RING_FULL(&rinfo->ring))
> > +	if (!RING_FULL(&rinfo->ring)) {
> >  		blk_mq_start_stopped_hw_queues(rinfo->dev_info->rq, true);
> > +		blk_mq_kick_requeue_list(rinfo->dev_info->rq);
> > +	}
> >  }
> >  
> >  static void kick_pending_request_queues(struct blkfront_ring_info *rinfo)
> > @@ -1225,7 +1227,8 @@ static void kick_pending_request_queues(struct blkfront_ring_info *rinfo)
> >  	unsigned long flags;
> >  
> >  	spin_lock_irqsave(&rinfo->ring_lock, flags);
> > -	kick_pending_request_queues_locked(rinfo);
> > +	if (!RING_FULL(&rinfo->ring))
> > +		blk_mq_run_hw_queues(rinfo->dev_info->rq, true);
> >  	spin_unlock_irqrestore(&rinfo->ring_lock, flags);
> >  }
> 
> Why do some kick_pending_request_queues_locked() kick the requeue list and why
> has the above kick_pending_request_queues_locked() call been converted into a
> blk_mq_run_hw_queues() call and thereby ignores the requeue list?

Looks I forget to reply the question about requeue list.

Actually blk_mq_kick_requeue_list() is only needed run where the
queue is restarted, so this patch moves it after
blk_mq_start_stopped_hw_queues().

In other path, we don't stop queue anymore, so needn't to kick requeue
list.

-- 
Ming

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/6] xen-blkfront: avoid to use start/stop queue
  2017-07-11 18:41   ` Bart Van Assche
  2017-07-12  2:59     ` Ming Lei
  2017-07-12  2:59     ` Ming Lei
@ 2017-07-12  3:05     ` Ming Lei
  2017-07-12  3:05     ` Ming Lei
  3 siblings, 0 replies; 43+ messages in thread
From: Ming Lei @ 2017-07-12  3:05 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: jgross, linux-block, sagi, axboe, hch, xen-devel,
	boris.ostrovsky, roger.pau

On Tue, Jul 11, 2017 at 06:41:29PM +0000, Bart Van Assche wrote:
> On Wed, 2017-07-12 at 02:20 +0800, Ming Lei wrote:
> > This interfaces will be removed soon, so use quiesce and
> > unquiesce instead, which should be more safe.
> > 
> > The only one usage will be removed in the following
> > congestion control patches.
> 
> Hello Ming,
> 
> The title of this patch is misleading since this patch does not touch the calls
> related to queue congestion (blk_mq_stop_hw_queue() and
> blk_mq_start_stopped_hw_queues()). I assume that you meant that this patch avoids
> that the xen-blkfront driver uses blk_mq_(start|stop)_hw_queues() (with queues in
> plural form)? Can you please reflect that in the subject of this and related
> patches?
> 
> Additionally, it's probably a good idea that this is not just an interface change
> but that this kind of patches fix a (hard to trigger?) race condition.
> 
> >  static inline void kick_pending_request_queues_locked(struct blkfront_ring_info *rinfo)
> >  {
> > -	if (!RING_FULL(&rinfo->ring))
> > +	if (!RING_FULL(&rinfo->ring)) {
> >  		blk_mq_start_stopped_hw_queues(rinfo->dev_info->rq, true);
> > +		blk_mq_kick_requeue_list(rinfo->dev_info->rq);
> > +	}
> >  }
> >  
> >  static void kick_pending_request_queues(struct blkfront_ring_info *rinfo)
> > @@ -1225,7 +1227,8 @@ static void kick_pending_request_queues(struct blkfront_ring_info *rinfo)
> >  	unsigned long flags;
> >  
> >  	spin_lock_irqsave(&rinfo->ring_lock, flags);
> > -	kick_pending_request_queues_locked(rinfo);
> > +	if (!RING_FULL(&rinfo->ring))
> > +		blk_mq_run_hw_queues(rinfo->dev_info->rq, true);
> >  	spin_unlock_irqrestore(&rinfo->ring_lock, flags);
> >  }
> 
> Why do some kick_pending_request_queues_locked() kick the requeue list and why
> has the above kick_pending_request_queues_locked() call been converted into a
> blk_mq_run_hw_queues() call and thereby ignores the requeue list?

Looks I forget to reply the question about requeue list.

Actually blk_mq_kick_requeue_list() is only needed run where the
queue is restarted, so this patch moves it after
blk_mq_start_stopped_hw_queues().

In other path, we don't stop queue anymore, so needn't to kick requeue
list.

-- 
Ming

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/6] xen-blkfront: avoid to use start/stop queue
  2017-07-11 21:24   ` Roger Pau Monné
@ 2017-07-12  3:12     ` Ming Lei
  2017-07-12  3:12     ` Ming Lei
  1 sibling, 0 replies; 43+ messages in thread
From: Ming Lei @ 2017-07-12  3:12 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	Sagi Grimberg, Konrad Rzeszutek Wilk, Boris Ostrovsky,
	Juergen Gross, xen-devel

On Tue, Jul 11, 2017 at 11:24:44PM +0200, Roger Pau Monn� wrote:
> On Wed, Jul 12, 2017 at 02:20:58AM +0800, Ming Lei wrote:
> > This interfaces will be removed soon, so use quiesce and
> > unquiesce instead, which should be more safe.
> > 
> > The only one usage will be removed in the following
> > congestion control patches.
> 
> Would it be better to simply fix blk_mq_{start/stop}_stopped_hw_queues
> rather than introducing a new interface?

No, we do not want to expose start/stop state to drivers any more, which
has caused enough trouble already, so these APIs need to be removed, and
quiesce/unquiesce is preferred way for this purpose, as you can see
the work done by Sagi Grimberg:

        http://marc.info/?t=149927415900006&r=1&w=2


Thanks,
Ming

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/6] xen-blkfront: avoid to use start/stop queue
  2017-07-11 21:24   ` Roger Pau Monné
  2017-07-12  3:12     ` Ming Lei
@ 2017-07-12  3:12     ` Ming Lei
  1 sibling, 0 replies; 43+ messages in thread
From: Ming Lei @ 2017-07-12  3:12 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Juergen Gross, linux-block, Sagi Grimberg, Jens Axboe, xen-devel,
	Christoph Hellwig, Bart Van Assche, Boris Ostrovsky

On Tue, Jul 11, 2017 at 11:24:44PM +0200, Roger Pau Monné wrote:
> On Wed, Jul 12, 2017 at 02:20:58AM +0800, Ming Lei wrote:
> > This interfaces will be removed soon, so use quiesce and
> > unquiesce instead, which should be more safe.
> > 
> > The only one usage will be removed in the following
> > congestion control patches.
> 
> Would it be better to simply fix blk_mq_{start/stop}_stopped_hw_queues
> rather than introducing a new interface?

No, we do not want to expose start/stop state to drivers any more, which
has caused enough trouble already, so these APIs need to be removed, and
quiesce/unquiesce is preferred way for this purpose, as you can see
the work done by Sagi Grimberg:

        http://marc.info/?t=149927415900006&r=1&w=2


Thanks,
Ming

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/6] SCSI: use blk_mq_run_hw_queues() in scsi_kick_queue()
  2017-07-11 19:57     ` Bart Van Assche
  (?)
@ 2017-07-12  3:15     ` Ming Lei
  2017-07-12 15:12         ` Bart Van Assche
  -1 siblings, 1 reply; 43+ messages in thread
From: Ming Lei @ 2017-07-12  3:15 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: hch, linux-block, axboe, jejb, linux-scsi, martin.petersen, sagi

On Tue, Jul 11, 2017 at 07:57:53PM +0000, Bart Van Assche wrote:
> On Wed, 2017-07-12 at 02:20 +0800, Ming Lei wrote:
> > Now SCSI won't stop queue, and not necessary to use
> > blk_mq_start_hw_queues(), so switch to blk_mq_run_hw_queues()
> > instead.
> > 
> > Cc: "James E.J. Bottomley" <jejb@linux.vnet.ibm.com>
> > Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
> > Cc: linux-scsi@vger.kernel.org
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  drivers/scsi/scsi_lib.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> > index f6097b89d5d3..91d890356b78 100644
> > --- a/drivers/scsi/scsi_lib.c
> > +++ b/drivers/scsi/scsi_lib.c
> > @@ -333,7 +333,7 @@ void scsi_device_unbusy(struct scsi_device *sdev)
> >  static void scsi_kick_queue(struct request_queue *q)
> >  {
> >  	if (q->mq_ops)
> > -		blk_mq_start_hw_queues(q);
> > +		blk_mq_run_hw_queues(q, false);
> >  	else
> >  		blk_run_queue(q);
> >  }
> 
> Hello Ming,
> 
> Now that we have separate flags to represent the "stopped" and "quiesced"
> states, wouldn't it be better not to modify scsi_kick_queue() but instead to
> stop a SCSI hardware queue again if scsi_queue_rq() returns BLK_STS_RESOURCE?
> See also commits 36e3cf273977 ("scsi: Avoid that SCSI queues get stuck") and
> commit 52d7f1b5c2f3 ("blk-mq: Avoid that requeueing starts stopped queues").

As you can see in the following patches, all stop/start queue APIs will
be removed, and the 'stopped' state will become a internal state.

It doesn't make sense to clear 'stopped' for scsi anymore, since the
queue won't be stopped actually. So we need this change.

-- 
Ming

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/6] blk-mq: use EWMA to estimate congestion threshold
  2017-07-11 18:39   ` Jens Axboe
@ 2017-07-12  3:20     ` Ming Lei
  0 siblings, 0 replies; 43+ messages in thread
From: Ming Lei @ 2017-07-12  3:20 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Christoph Hellwig, Bart Van Assche, Sagi Grimberg

On Tue, Jul 11, 2017 at 12:39:35PM -0600, Jens Axboe wrote:
> On 07/11/2017 12:21 PM, Ming Lei wrote:
> > diff --git a/block/blk-mq.h b/block/blk-mq.h
> > index 60b01c0309bc..c4516d2a2d2c 100644
> > --- a/block/blk-mq.h
> > +++ b/block/blk-mq.h
> > @@ -133,4 +133,13 @@ static inline bool blk_mq_hw_queue_mapped(struct blk_mq_hw_ctx *hctx)
> >  	return hctx->nr_ctx && hctx->tags;
> >  }
> >  
> > +/* borrowed from bcache */
> > +#define ewma_add(ewma, val, weight, factor)                             \
> > +({                                                                      \
> > +        (ewma) *= (weight) - 1;                                         \
> > +        (ewma) += (val) << factor;                                      \
> > +        (ewma) /= (weight);                                             \
> > +        (ewma) >> factor;                                               \
> > +})
> 
> Just put that in blk_mq_update_req_dispatch_busy(), or at least make it
> a static function in blk-mq.c above that function. You don't use factor
> at all.

OK, will do it in V2.


-- 
Ming

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/6] blk-mq: use EWMA to estimate congestion threshold
  2017-07-11 21:02   ` Bart Van Assche
@ 2017-07-12  3:43     ` Ming Lei
  0 siblings, 0 replies; 43+ messages in thread
From: Ming Lei @ 2017-07-12  3:43 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: hch, linux-block, axboe, sagi

On Tue, Jul 11, 2017 at 09:02:13PM +0000, Bart Van Assche wrote:
> On Wed, 2017-07-12 at 02:21 +0800, Ming Lei wrote:
> > When .queue_rq() returns BLK_STS_RESOURCE(BUSY), we can
> > consider that there is congestion in either low level
> > driver or hardware.
> > 
> > This patch uses EWMA to estimate this congestion threshold,
> > then this threshold can be used to detect/avoid congestion.
> 
> Hello Ming,
> 
> Does EWMA stand for "exponentially weighted moving average" in the context of
> this patch? If so, please mention this.

Yes and OK.

> 
> > +static void blk_mq_update_req_dispatch_busy(struct blk_mq_hw_ctx *hctx)
> > +{
> > +	struct sbitmap_queue *sbq;
> > +	unsigned depth;
> > +
> > +	sbq = &hctx->tags->bitmap_tags;
> > +	depth = sbitmap_weight(&sbq->sb);
> > +
> > +	/* use EWMA to estimate a threshold for detecting congestion */
> > +	ewma_add(hctx->avg_busy_threshold, depth, 8, 0);
> > +}
> 
> This function has been named after the context it is called from. Wouldn't it
> be more clear to change the name of this function into something that refers to
> what this function does, e.g. blk_mq_update_avg_busy_threshold()?

In the next patch, more things will be done in this function.

> 
> Additionally, I think that the behavior of e.g. the SCSI and dm-mpath drivers
> is too complicated for this approach to be effective. If you want to proceed
> with this approach I think it should be possible for block drivers to opt out
> of the mechanism introduced in the next patch.

dm might be a bit special, but for SCSI I suggest to use that since I see
obvious improvement in virtio-scsi.

But it depends on performance, if there isn't any perf loss, I'd rather
to do for all(include dm), even we can develop other smart way for
special requirement if there are.

> 
> > diff --git a/block/blk-mq.h b/block/blk-mq.h
> > index 60b01c0309bc..c4516d2a2d2c 100644
> > --- a/block/blk-mq.h
> > +++ b/block/blk-mq.h
> > @@ -133,4 +133,13 @@ static inline bool blk_mq_hw_queue_mapped(struct blk_mq_hw_ctx *hctx)
> >  	return hctx->nr_ctx && hctx->tags;
> >  }
> >  
> > +/* borrowed from bcache */
> > +#define ewma_add(ewma, val, weight, factor)                             \
> > +({                                                                      \
> > +        (ewma) *= (weight) - 1;                                         \
> > +        (ewma) += (val) << factor;                                      \
> > +        (ewma) /= (weight);                                             \
> > +        (ewma) >> factor;                                               \
> > +})
> 
> Sorry but this does not match how others define an exponentially weighted moving
> average. As far as I know the ewma values should be updated as follows:
> 
>    new_ewma = w * val + (1 - w) * current_ewma
> 
> where 0 < w <= 1 is a rational number (typically 0.05 <= w <= 0.3). See also
> https://en.wikipedia.org/wiki/EWMA_chart.

Yes, for the way in this patch, w is 1/8, and factor is zero, it is just
for computer to do it efficiently, no big difference with definition in
paper, and as you see, ewma_add() is borrowed from bcache.

-- 
Ming

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/6] blk-mq: send the request to dispatch list if direct issue returns busy
  2017-07-11 20:18   ` Bart Van Assche
@ 2017-07-12  3:45     ` Ming Lei
  0 siblings, 0 replies; 43+ messages in thread
From: Ming Lei @ 2017-07-12  3:45 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: hch, linux-block, axboe, sagi

On Tue, Jul 11, 2017 at 08:18:51PM +0000, Bart Van Assche wrote:
> On Wed, 2017-07-12 at 02:21 +0800, Ming Lei wrote:
> > Before mq IO scheduler is in, we always send the request to
> > dispatch list if .queue_rq() return busy. After mq IO scheduler
> > is introduced, we only do this way when scheduler is used in
> > case of direct issue. Actually we can do that when scheduler
> > isn't used too.
> > 
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  block/blk-mq.c | 18 +++++++++++++++---
> >  1 file changed, 15 insertions(+), 3 deletions(-)
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 041f7b7fa0d6..6e0fc80aa151 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -1463,6 +1463,16 @@ static blk_qc_t request_to_qc_t(struct blk_mq_hw_ctx *hctx, struct request *rq)
> >  	return blk_tag_to_qc_t(rq->internal_tag, hctx->queue_num, true);
> >  }
> >  
> > +static void blk_mq_direct_dispatch(struct blk_mq_hw_ctx *hctx,
> > +				   struct request *rq)
> > +{
> > +	spin_lock(&hctx->lock);
> > +	list_add(&rq->queuelist, &hctx->dispatch);
> > +	spin_unlock(&hctx->lock);
> > +
> > +	blk_mq_run_hw_queue(hctx, false);
> > +}
> > +
> >  static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
> >  					struct request *rq,
> >  					blk_qc_t *cookie, bool may_sleep)
> > @@ -1499,15 +1509,17 @@ static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
> >  	switch (ret) {
> >  	case BLK_STS_OK:
> >  		*cookie = new_cookie;
> > -		return;
> > +		break;
> >  	case BLK_STS_RESOURCE:
> >  		__blk_mq_requeue_request(rq);
> > -		goto insert;
> > +		blk_mq_direct_dispatch(hctx, rq);
> > +		break;
> >  	default:
> >  		*cookie = BLK_QC_T_NONE;
> >  		blk_mq_end_request(rq, ret);
> > -		return;
> > +		break;
> >  	}
> > +	return;
> >  
> >  insert:
> >  	blk_mq_sched_insert_request(rq, false, run_queue, false, may_sleep);
> 
> Hello Ming,
> 
> This patch changes the behavior of blk_mq_try_issue_directly() if a scheduler
> has been configured and .queue_rq() returns BLK_STS_RESOURCE, namely by skipping
> the e->type->ops.mq.insert_requests() call. Sorry but I don't think this is what
> we want.

It does not change behaviour for scheduler, please see
blk_mq_sched_bypass_insert() which is called from
blk_mq_sched_insert_request().

-- 
Ming

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/6] SCSI: use blk_mq_run_hw_queues() in scsi_kick_queue()
  2017-07-12  3:15     ` Ming Lei
@ 2017-07-12 15:12         ` Bart Van Assche
  0 siblings, 0 replies; 43+ messages in thread
From: Bart Van Assche @ 2017-07-12 15:12 UTC (permalink / raw)
  To: ming.lei; +Cc: jejb, hch, linux-block, sagi, axboe, linux-scsi, martin.petersen

On Wed, 2017-07-12 at 11:15 +0800, Ming Lei wrote:
> On Tue, Jul 11, 2017 at 07:57:53PM +0000, Bart Van Assche wrote:
> > On Wed, 2017-07-12 at 02:20 +0800, Ming Lei wrote:
> > > Now SCSI won't stop queue, and not necessary to use
> > > blk_mq_start_hw_queues(), so switch to blk_mq_run_hw_queues()
> > > instead.
> > >=20
> > > Cc: "James E.J. Bottomley" <jejb@linux.vnet.ibm.com>
> > > Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
> > > Cc: linux-scsi@vger.kernel.org
> > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > ---
> > >  drivers/scsi/scsi_lib.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >=20
> > > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> > > index f6097b89d5d3..91d890356b78 100644
> > > --- a/drivers/scsi/scsi_lib.c
> > > +++ b/drivers/scsi/scsi_lib.c
> > > @@ -333,7 +333,7 @@ void scsi_device_unbusy(struct scsi_device *sdev)
> > >  static void scsi_kick_queue(struct request_queue *q)
> > >  {
> > >  	if (q->mq_ops)
> > > -		blk_mq_start_hw_queues(q);
> > > +		blk_mq_run_hw_queues(q, false);
> > >  	else
> > >  		blk_run_queue(q);
> > >  }
> >=20
> > Hello Ming,
> >=20
> > Now that we have separate flags to represent the "stopped" and "quiesce=
d"
> > states, wouldn't it be better not to modify scsi_kick_queue() but inste=
ad to
> > stop a SCSI hardware queue again if scsi_queue_rq() returns BLK_STS_RES=
OURCE?
> > See also commits 36e3cf273977 ("scsi: Avoid that SCSI queues get stuck"=
) and
> > commit 52d7f1b5c2f3 ("blk-mq: Avoid that requeueing starts stopped queu=
es").
>=20
> As you can see in the following patches, all stop/start queue APIs will
> be removed, and the 'stopped' state will become a internal state.
>=20
> It doesn't make sense to clear 'stopped' for scsi anymore, since the
> queue won't be stopped actually. So we need this change.

Hello Ming,

As Jens already noticed, this approach won't work properly for concurrent I=
/O
to multiple LUNs associated with the same SCSI host. This approach won't wo=
rk
properly for dm-mpath either. Sorry but I'm not convinced that it's possibl=
e
to replace the stop/start queue API for all block drivers by an algorithm
that is based on estimating the queue depth.

Bart.=

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/6] SCSI: use blk_mq_run_hw_queues() in scsi_kick_queue()
@ 2017-07-12 15:12         ` Bart Van Assche
  0 siblings, 0 replies; 43+ messages in thread
From: Bart Van Assche @ 2017-07-12 15:12 UTC (permalink / raw)
  To: ming.lei; +Cc: jejb, hch, linux-block, sagi, axboe, linux-scsi, martin.petersen

On Wed, 2017-07-12 at 11:15 +0800, Ming Lei wrote:
> On Tue, Jul 11, 2017 at 07:57:53PM +0000, Bart Van Assche wrote:
> > On Wed, 2017-07-12 at 02:20 +0800, Ming Lei wrote:
> > > Now SCSI won't stop queue, and not necessary to use
> > > blk_mq_start_hw_queues(), so switch to blk_mq_run_hw_queues()
> > > instead.
> > > 
> > > Cc: "James E.J. Bottomley" <jejb@linux.vnet.ibm.com>
> > > Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
> > > Cc: linux-scsi@vger.kernel.org
> > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > ---
> > >  drivers/scsi/scsi_lib.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> > > index f6097b89d5d3..91d890356b78 100644
> > > --- a/drivers/scsi/scsi_lib.c
> > > +++ b/drivers/scsi/scsi_lib.c
> > > @@ -333,7 +333,7 @@ void scsi_device_unbusy(struct scsi_device *sdev)
> > >  static void scsi_kick_queue(struct request_queue *q)
> > >  {
> > >  	if (q->mq_ops)
> > > -		blk_mq_start_hw_queues(q);
> > > +		blk_mq_run_hw_queues(q, false);
> > >  	else
> > >  		blk_run_queue(q);
> > >  }
> > 
> > Hello Ming,
> > 
> > Now that we have separate flags to represent the "stopped" and "quiesced"
> > states, wouldn't it be better not to modify scsi_kick_queue() but instead to
> > stop a SCSI hardware queue again if scsi_queue_rq() returns BLK_STS_RESOURCE?
> > See also commits 36e3cf273977 ("scsi: Avoid that SCSI queues get stuck") and
> > commit 52d7f1b5c2f3 ("blk-mq: Avoid that requeueing starts stopped queues").
> 
> As you can see in the following patches, all stop/start queue APIs will
> be removed, and the 'stopped' state will become a internal state.
> 
> It doesn't make sense to clear 'stopped' for scsi anymore, since the
> queue won't be stopped actually. So we need this change.

Hello Ming,

As Jens already noticed, this approach won't work properly for concurrent I/O
to multiple LUNs associated with the same SCSI host. This approach won't work
properly for dm-mpath either. Sorry but I'm not convinced that it's possible
to replace the stop/start queue API for all block drivers by an algorithm
that is based on estimating the queue depth.

Bart.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/6] blk-mq: use EWMA to estimate congestion threshold
  2017-07-12  2:30     ` Ming Lei
@ 2017-07-12 15:39       ` Bart Van Assche
  2017-07-13 10:43         ` Ming Lei
  0 siblings, 1 reply; 43+ messages in thread
From: Bart Van Assche @ 2017-07-12 15:39 UTC (permalink / raw)
  To: ming.lei, axboe; +Cc: Bart Van Assche, hch, linux-block, sagi

On Wed, 2017-07-12 at 10:30 +0800, Ming Lei wrote:
> On Tue, Jul 11, 2017 at 12:25:16PM -0600, Jens Axboe wrote:
> > What happens with fluid congestion boundaries, with shared tags?
>=20
> The approach in this patch should work, but the threshold may not
> be accurate in this way, one simple method is to use the average
> tag weight in EWMA, like this:
>=20
> 	sbitmap_weight() / hctx->tags->active_queues

Hello Ming,

That approach would result in a severe performance degradation. "active_que=
ues"
namely represents the number of queues against which I/O ever has been queu=
ed.
If e.g. 64 LUNs would be associated with a single SCSI host and all 64 LUNs=
 are
responding and if the queue depth would also be 64 then the approach you
proposed will reduce the effective queue depth per LUN from 64 to 1.

Bart.=

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/6] SCSI: use blk_mq_run_hw_queues() in scsi_kick_queue()
  2017-07-12 15:12         ` Bart Van Assche
  (?)
@ 2017-07-13 10:23         ` Ming Lei
  2017-07-13 17:44             ` Bart Van Assche
  -1 siblings, 1 reply; 43+ messages in thread
From: Ming Lei @ 2017-07-13 10:23 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: jejb, hch, linux-block, sagi, axboe, linux-scsi, martin.petersen

On Wed, Jul 12, 2017 at 03:12:37PM +0000, Bart Van Assche wrote:
> On Wed, 2017-07-12 at 11:15 +0800, Ming Lei wrote:
> > On Tue, Jul 11, 2017 at 07:57:53PM +0000, Bart Van Assche wrote:
> > > On Wed, 2017-07-12 at 02:20 +0800, Ming Lei wrote:
> > > > Now SCSI won't stop queue, and not necessary to use
> > > > blk_mq_start_hw_queues(), so switch to blk_mq_run_hw_queues()
> > > > instead.
> > > > 
> > > > Cc: "James E.J. Bottomley" <jejb@linux.vnet.ibm.com>
> > > > Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
> > > > Cc: linux-scsi@vger.kernel.org
> > > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > > ---
> > > >  drivers/scsi/scsi_lib.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > 
> > > > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> > > > index f6097b89d5d3..91d890356b78 100644
> > > > --- a/drivers/scsi/scsi_lib.c
> > > > +++ b/drivers/scsi/scsi_lib.c
> > > > @@ -333,7 +333,7 @@ void scsi_device_unbusy(struct scsi_device *sdev)
> > > >  static void scsi_kick_queue(struct request_queue *q)
> > > >  {
> > > >  	if (q->mq_ops)
> > > > -		blk_mq_start_hw_queues(q);
> > > > +		blk_mq_run_hw_queues(q, false);
> > > >  	else
> > > >  		blk_run_queue(q);
> > > >  }
> > > 
> > > Hello Ming,
> > > 
> > > Now that we have separate flags to represent the "stopped" and "quiesced"
> > > states, wouldn't it be better not to modify scsi_kick_queue() but instead to
> > > stop a SCSI hardware queue again if scsi_queue_rq() returns BLK_STS_RESOURCE?
> > > See also commits 36e3cf273977 ("scsi: Avoid that SCSI queues get stuck") and
> > > commit 52d7f1b5c2f3 ("blk-mq: Avoid that requeueing starts stopped queues").
> > 
> > As you can see in the following patches, all stop/start queue APIs will
> > be removed, and the 'stopped' state will become a internal state.
> > 
> > It doesn't make sense to clear 'stopped' for scsi anymore, since the
> > queue won't be stopped actually. So we need this change.
> 
> Hello Ming,
> 
> As Jens already noticed, this approach won't work properly for concurrent I/O
> to multiple LUNs associated with the same SCSI host. This approach won't work
> properly for dm-mpath either. Sorry but I'm not convinced that it's possible

We can deal with special cases by passing flag, so that we don't need
to expose 'stopped' flag to drivers. Again, it has caused lots of
trouble.

If you check linus tree, most of start/stop queue API has been replaced
with quiesce/unquiesce. We can remove others too.

> to replace the stop/start queue API for all block drivers by an algorithm
> that is based on estimating the queue depth.

Anyway this patch is correct, and doesn't make sense to clear 'stopped'
in SCSI since SCSI doesn't stop queue at all even though not introduce
congest control, right?

Please discuss the congestion control in patch 4 and 5.

-- 
Ming

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/6] blk-mq: use EWMA to estimate congestion threshold
  2017-07-12 15:39       ` Bart Van Assche
@ 2017-07-13 10:43         ` Ming Lei
  2017-07-13 14:56           ` Bart Van Assche
  0 siblings, 1 reply; 43+ messages in thread
From: Ming Lei @ 2017-07-13 10:43 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: axboe, hch, linux-block, sagi

On Wed, Jul 12, 2017 at 03:39:14PM +0000, Bart Van Assche wrote:
> On Wed, 2017-07-12 at 10:30 +0800, Ming Lei wrote:
> > On Tue, Jul 11, 2017 at 12:25:16PM -0600, Jens Axboe wrote:
> > > What happens with fluid congestion boundaries, with shared tags?
> > 
> > The approach in this patch should work, but the threshold may not
> > be accurate in this way, one simple method is to use the average
> > tag weight in EWMA, like this:
> > 
> > 	sbitmap_weight() / hctx->tags->active_queues
> 
> Hello Ming,
> 
> That approach would result in a severe performance degradation. "active_queues"
> namely represents the number of queues against which I/O ever has been queued.
> If e.g. 64 LUNs would be associated with a single SCSI host and all 64 LUNs are
> responding and if the queue depth would also be 64 then the approach you
> proposed will reduce the effective queue depth per LUN from 64 to 1.

No, this approach does _not_ reduce the effective queue depth, it only
stops the queue for a while when the queue is busy enough.

In this case, there may not have congestion because for blk-mq at most allows
to assign queue_depth/active_queues tags to each LUN, please see hctx_may_queue().
Then get_driver_tag() can only allow to return one pending tag at most to the
request_queue(LUN).

The algorithm in this patch only starts to work when congestion happens,
that said it is only run when BLK_STS_RESOURCE is returned from .queue_rq().
This approach is for avoiding to dispatch requests to one busy queue
unnecessarily, so that we don't need to heat CPU unnecessarily, and
merge gets improved meantime.

-- 
Ming

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/6] blk-mq: use EWMA to estimate congestion threshold
  2017-07-13 10:43         ` Ming Lei
@ 2017-07-13 14:56           ` Bart Van Assche
  2017-07-13 15:32             ` Ming Lei
  0 siblings, 1 reply; 43+ messages in thread
From: Bart Van Assche @ 2017-07-13 14:56 UTC (permalink / raw)
  To: ming.lei; +Cc: hch, linux-block, axboe, sagi

On Thu, 2017-07-13 at 18:43 +0800, Ming Lei wrote:
> On Wed, Jul 12, 2017 at 03:39:14PM +0000, Bart Van Assche wrote:
> > On Wed, 2017-07-12 at 10:30 +0800, Ming Lei wrote:
> > > On Tue, Jul 11, 2017 at 12:25:16PM -0600, Jens Axboe wrote:
> > > > What happens with fluid congestion boundaries, with shared tags?
> > >=20
> > > The approach in this patch should work, but the threshold may not
> > > be accurate in this way, one simple method is to use the average
> > > tag weight in EWMA, like this:
> > >=20
> > > 	sbitmap_weight() / hctx->tags->active_queues
> >=20
> > Hello Ming,
> >=20
> > That approach would result in a severe performance degradation. "active=
_queues"
> > namely represents the number of queues against which I/O ever has been =
queued.
> > If e.g. 64 LUNs would be associated with a single SCSI host and all 64 =
LUNs are
> > responding and if the queue depth would also be 64 then the approach yo=
u
> > proposed will reduce the effective queue depth per LUN from 64 to 1.
>=20
> No, this approach does _not_ reduce the effective queue depth, it only
> stops the queue for a while when the queue is busy enough.
>
> In this case, there may not have congestion because for blk-mq at most al=
lows
> to assign queue_depth/active_queues tags to each LUN, please see hctx_may=
_queue().

Hello Ming,

hctx_may_queue() severely limits the queue depth if many LUNs are associate=
d
with the same SCSI host. I think that this is a performance regression
compared to scsi-sq and that this performance regression should be fixed.

Bart.=

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/6] blk-mq: use EWMA to estimate congestion threshold
  2017-07-13 14:56           ` Bart Van Assche
@ 2017-07-13 15:32             ` Ming Lei
  2017-07-13 17:35               ` Bart Van Assche
  0 siblings, 1 reply; 43+ messages in thread
From: Ming Lei @ 2017-07-13 15:32 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: hch, linux-block, axboe, sagi

On Thu, Jul 13, 2017 at 02:56:38PM +0000, Bart Van Assche wrote:
> On Thu, 2017-07-13 at 18:43 +0800, Ming Lei wrote:
> > On Wed, Jul 12, 2017 at 03:39:14PM +0000, Bart Van Assche wrote:
> > > On Wed, 2017-07-12 at 10:30 +0800, Ming Lei wrote:
> > > > On Tue, Jul 11, 2017 at 12:25:16PM -0600, Jens Axboe wrote:
> > > > > What happens with fluid congestion boundaries, with shared tags?
> > > > 
> > > > The approach in this patch should work, but the threshold may not
> > > > be accurate in this way, one simple method is to use the average
> > > > tag weight in EWMA, like this:
> > > > 
> > > > 	sbitmap_weight() / hctx->tags->active_queues
> > > 
> > > Hello Ming,
> > > 
> > > That approach would result in a severe performance degradation. "active_queues"
> > > namely represents the number of queues against which I/O ever has been queued.
> > > If e.g. 64 LUNs would be associated with a single SCSI host and all 64 LUNs are
> > > responding and if the queue depth would also be 64 then the approach you
> > > proposed will reduce the effective queue depth per LUN from 64 to 1.
> > 
> > No, this approach does _not_ reduce the effective queue depth, it only
> > stops the queue for a while when the queue is busy enough.
> >
> > In this case, there may not have congestion because for blk-mq at most allows
> > to assign queue_depth/active_queues tags to each LUN, please see hctx_may_queue().
> 
> Hello Ming,
> 
> hctx_may_queue() severely limits the queue depth if many LUNs are associated
> with the same SCSI host. I think that this is a performance regression
> compared to scsi-sq and that this performance regression should be fixed.

IMO, it is hard to evaluate/compare perf between scsi-mq vs scsi-sq:

	- how many LUNs do you run IO on concurrently?
	- evaluate the perf on single LUN or multi LUN?

BTW, active_queues is a runtime variable which accounts the actual active
queues in use.

-- 
Ming

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/6] blk-mq: use EWMA to estimate congestion threshold
  2017-07-13 15:32             ` Ming Lei
@ 2017-07-13 17:35               ` Bart Van Assche
  0 siblings, 0 replies; 43+ messages in thread
From: Bart Van Assche @ 2017-07-13 17:35 UTC (permalink / raw)
  To: ming.lei; +Cc: hch, linux-block, axboe, sagi

On Thu, 2017-07-13 at 23:32 +0800, Ming Lei wrote:
> On Thu, Jul 13, 2017 at 02:56:38PM +0000, Bart Van Assche wrote:
> > hctx_may_queue() severely limits the queue depth if many LUNs are assoc=
iated
> > with the same SCSI host. I think that this is a performance regression
> > compared to scsi-sq and that this performance regression should be fixe=
d.
>=20
> IMO, it is hard to evaluate/compare perf between scsi-mq vs scsi-sq:
>=20
> 	- how many LUNs do you run IO on concurrently?
> 	- evaluate the perf on single LUN or multi LUN?
>=20
> BTW, active_queues is a runtime variable which accounts the actual active
> queues in use.

Hello Ming,

What I described can be verified easily by running fio against a single LUN
of a SCSI host with which a large number of LUNs are associated.

BTW, something I overlooked in my analysis of the active_queues variable is
that it is not only decremented if a queue is destroyed but also if a queue
is idle during (block layer request timeout) seconds. So the problem I
described will only occur if some software submits I/O requests periodicall=
y
to all SCSI LUNs with a shorter interval than the I/O timeout. I'm not sure
any Linux software does this today. As an example, the time between path
checks by the multipathd TUR checker typically exceeds the timeout of a
single TUR check.

Bart.=

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/6] SCSI: use blk_mq_run_hw_queues() in scsi_kick_queue()
  2017-07-13 10:23         ` Ming Lei
@ 2017-07-13 17:44             ` Bart Van Assche
  0 siblings, 0 replies; 43+ messages in thread
From: Bart Van Assche @ 2017-07-13 17:44 UTC (permalink / raw)
  To: ming.lei; +Cc: jejb, hch, linux-scsi, linux-block, axboe, sagi, martin.petersen

On Thu, 2017-07-13 at 18:23 +0800, Ming Lei wrote:
> Please discuss the congestion control in patch 4 and 5.

Hello Ming,

Since there is a significant risk that this patch series will introduce
performance and/or functional regressions, I will wait with reviewing this
patch series further until it has been shown that there is no less risky
alternative to realize the same performance improvement. I think Jens had
already suggested a possible alternative, namely by only starting a queue
if it really has to be started.

Bart.=

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/6] SCSI: use blk_mq_run_hw_queues() in scsi_kick_queue()
@ 2017-07-13 17:44             ` Bart Van Assche
  0 siblings, 0 replies; 43+ messages in thread
From: Bart Van Assche @ 2017-07-13 17:44 UTC (permalink / raw)
  To: ming.lei; +Cc: jejb, hch, linux-scsi, linux-block, axboe, sagi, martin.petersen

On Thu, 2017-07-13 at 18:23 +0800, Ming Lei wrote:
> Please discuss the congestion control in patch 4 and 5.

Hello Ming,

Since there is a significant risk that this patch series will introduce
performance and/or functional regressions, I will wait with reviewing this
patch series further until it has been shown that there is no less risky
alternative to realize the same performance improvement. I think Jens had
already suggested a possible alternative, namely by only starting a queue
if it really has to be started.

Bart.

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2017-07-13 17:44 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-11 18:20 [PATCH 0/6] blk-mq: introduce congestion control Ming Lei
2017-07-11 18:20 ` [PATCH 1/6] xen-blkfront: avoid to use start/stop queue Ming Lei
2017-07-11 18:20 ` Ming Lei
2017-07-11 18:41   ` Konrad Rzeszutek Wilk
2017-07-11 18:41   ` Konrad Rzeszutek Wilk
2017-07-12  2:52     ` Ming Lei
2017-07-12  2:52     ` Ming Lei
2017-07-11 18:41   ` Bart Van Assche
2017-07-12  2:59     ` Ming Lei
2017-07-12  2:59     ` Ming Lei
2017-07-12  3:05     ` Ming Lei
2017-07-12  3:05     ` Ming Lei
2017-07-11 18:41   ` Bart Van Assche
2017-07-11 21:24   ` Roger Pau Monné
2017-07-12  3:12     ` Ming Lei
2017-07-12  3:12     ` Ming Lei
2017-07-11 21:24   ` Roger Pau Monné
2017-07-11 18:20 ` [PATCH 2/6] SCSI: use blk_mq_run_hw_queues() in scsi_kick_queue() Ming Lei
2017-07-11 19:57   ` Bart Van Assche
2017-07-11 19:57     ` Bart Van Assche
2017-07-12  3:15     ` Ming Lei
2017-07-12 15:12       ` Bart Van Assche
2017-07-12 15:12         ` Bart Van Assche
2017-07-13 10:23         ` Ming Lei
2017-07-13 17:44           ` Bart Van Assche
2017-07-13 17:44             ` Bart Van Assche
2017-07-11 18:21 ` [PATCH 3/6] blk-mq: send the request to dispatch list if direct issue returns busy Ming Lei
2017-07-11 20:18   ` Bart Van Assche
2017-07-12  3:45     ` Ming Lei
2017-07-11 18:21 ` [PATCH 4/6] blk-mq: use EWMA to estimate congestion threshold Ming Lei
2017-07-11 18:25   ` Jens Axboe
2017-07-12  2:30     ` Ming Lei
2017-07-12 15:39       ` Bart Van Assche
2017-07-13 10:43         ` Ming Lei
2017-07-13 14:56           ` Bart Van Assche
2017-07-13 15:32             ` Ming Lei
2017-07-13 17:35               ` Bart Van Assche
2017-07-11 18:39   ` Jens Axboe
2017-07-12  3:20     ` Ming Lei
2017-07-11 21:02   ` Bart Van Assche
2017-07-12  3:43     ` Ming Lei
2017-07-11 18:21 ` [PATCH 5/6] blk-mq: introduce basic congestion control Ming Lei
2017-07-11 18:21 ` [PATCH 6/6] blk-mq: unexport APIs for start/stop queues Ming Lei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.