[PATCH v4 0/6] Avoid that scsi-mq and dm-mq queue processing stalls sporadically

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v4 0/6] Avoid that scsi-mq and dm-mq queue processing stalls sporadically
@ 2017-04-07 18:16 ` Bart Van Assche
  0 siblings, 0 replies; 55+ messages in thread
From: Bart Van Assche @ 2017-04-07 18:16 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, linux-scsi, Bart Van Assche

Hello Jens,

The six patches in this patch series fix the queue lockup I reported
recently on the linux-block mailing list. Please consider these patches
for inclusion in the upstream kernel.

Thanks,

Bart.

Changes between v3 and v4:
- Addressed the review comments on version three of this series about the
  patch that makes it safe to use RCU to iterate over .tag_list and also
  about the runtime performance and use of short variable names in patch 2/5.
- Clarified the description of the patch that fixes the scsi-mq stall.
- Added a patch to fix a dm-mq queue stall.
  
Changes between v2 and v3:
- Removed the blk_mq_ops.restart_hctx function pointer again.
- Modified blk_mq_sched_restart_queues() such that only a single hardware
  queue is restarted instead of multiple if hardware queues are shared.
- Introduced a new function in the block layer, namely
  blk_mq_delay_run_hw_queue().  

Changes between v1 and v2:
- Reworked scsi_restart_queues() such that it no longer takes the SCSI
  host lock.
- Added two patches - one for exporting blk_mq_sched_restart_hctx() and
  another one to make iterating with RCU over blk_mq_tag_set.tag_list safe.

Bart Van Assche (6):
  blk-mq: Make it safe to use RCU to iterate over
    blk_mq_tag_set.tag_list
  blk-mq: Restart a single queue if tag sets are shared
  blk-mq: Clarify comments in blk_mq_dispatch_rq_list()
  blk-mq: Introduce blk_mq_delay_run_hw_queue()
  scsi: Avoid that SCSI queues get stuck
  dm rq: Avoid that request processing stalls sporadically

 block/blk-mq-sched.c    | 63 +++++++++++++++++++++++++++++++++++-------
 block/blk-mq-sched.h    | 16 +----------
 block/blk-mq.c          | 73 +++++++++++++++++++++++++++++++++++++++----------
 drivers/md/dm-rq.c      |  1 +
 drivers/scsi/scsi_lib.c |  6 ++--
 include/linux/blk-mq.h  |  2 ++
 include/linux/blkdev.h  |  1 -
 7 files changed, 118 insertions(+), 44 deletions(-)

-- 
2.12.0

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v4 0/6] Avoid that scsi-mq and dm-mq queue processing stalls sporadically
@ 2017-04-07 18:16 ` Bart Van Assche
  0 siblings, 0 replies; 55+ messages in thread
From: Bart Van Assche @ 2017-04-07 18:16 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, linux-scsi, Bart Van Assche

Hello Jens,

The six patches in this patch series fix the queue lockup I reported
recently on the linux-block mailing list. Please consider these patches
for inclusion in the upstream kernel.

Thanks,

Bart.

Changes between v3 and v4:
- Addressed the review comments on version three of this series about the
  patch that makes it safe to use RCU to iterate over .tag_list and also
  about the runtime performance and use of short variable names in patch 2/5.
- Clarified the description of the patch that fixes the scsi-mq stall.
- Added a patch to fix a dm-mq queue stall.
  
Changes between v2 and v3:
- Removed the blk_mq_ops.restart_hctx function pointer again.
- Modified blk_mq_sched_restart_queues() such that only a single hardware
  queue is restarted instead of multiple if hardware queues are shared.
- Introduced a new function in the block layer, namely
  blk_mq_delay_run_hw_queue().  

Changes between v1 and v2:
- Reworked scsi_restart_queues() such that it no longer takes the SCSI
  host lock.
- Added two patches - one for exporting blk_mq_sched_restart_hctx() and
  another one to make iterating with RCU over blk_mq_tag_set.tag_list safe.

Bart Van Assche (6):
  blk-mq: Make it safe to use RCU to iterate over
    blk_mq_tag_set.tag_list
  blk-mq: Restart a single queue if tag sets are shared
  blk-mq: Clarify comments in blk_mq_dispatch_rq_list()
  blk-mq: Introduce blk_mq_delay_run_hw_queue()
  scsi: Avoid that SCSI queues get stuck
  dm rq: Avoid that request processing stalls sporadically

 block/blk-mq-sched.c    | 63 +++++++++++++++++++++++++++++++++++-------
 block/blk-mq-sched.h    | 16 +----------
 block/blk-mq.c          | 73 +++++++++++++++++++++++++++++++++++++++----------
 drivers/md/dm-rq.c      |  1 +
 drivers/scsi/scsi_lib.c |  6 ++--
 include/linux/blk-mq.h  |  2 ++
 include/linux/blkdev.h  |  1 -
 7 files changed, 118 insertions(+), 44 deletions(-)

-- 
2.12.0

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v4 1/6] blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list
  2017-04-07 18:16 ` Bart Van Assche
@ 2017-04-07 18:16   ` Bart Van Assche
  -1 siblings, 0 replies; 55+ messages in thread
From: Bart Van Assche @ 2017-04-07 18:16 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Bart Van Assche, Christoph Hellwig,
	Hannes Reinecke

Since the next patch in this series will use RCU to iterate over
tag_list, make this safe. Add lockdep_assert_held() statements
in functions that iterate over tag_list to make clear that using
list_for_each_entry() instead of list_for_each_entry_rcu() is
fine in these functions.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
---
 block/blk-mq.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index f7cd3208bcdf..c26464f9649a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2076,6 +2076,8 @@ static void blk_mq_update_tag_set_depth(struct blk_mq_tag_set *set, bool shared)
 {
 	struct request_queue *q;
 
+	lockdep_assert_held(&set->tag_list_lock);
+
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
 		blk_mq_freeze_queue(q);
 		queue_set_hctx_shared(q, shared);
@@ -2088,7 +2090,8 @@ static void blk_mq_del_queue_tag_set(struct request_queue *q)
 	struct blk_mq_tag_set *set = q->tag_set;
 
 	mutex_lock(&set->tag_list_lock);
-	list_del_init(&q->tag_set_list);
+	list_del_rcu(&q->tag_set_list);
+	INIT_LIST_HEAD(&q->tag_set_list);
 	if (list_is_singular(&set->tag_list)) {
 		/* just transitioned to unshared */
 		set->flags &= ~BLK_MQ_F_TAG_SHARED;
@@ -2096,6 +2099,8 @@ static void blk_mq_del_queue_tag_set(struct request_queue *q)
 		blk_mq_update_tag_set_depth(set, false);
 	}
 	mutex_unlock(&set->tag_list_lock);
+
+	synchronize_rcu();
 }
 
 static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
@@ -2113,7 +2118,7 @@ static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
 	}
 	if (set->flags & BLK_MQ_F_TAG_SHARED)
 		queue_set_hctx_shared(q, true);
-	list_add_tail(&q->tag_set_list, &set->tag_list);
+	list_add_tail_rcu(&q->tag_set_list, &set->tag_list);
 
 	mutex_unlock(&set->tag_list_lock);
 }
@@ -2601,6 +2606,8 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
 {
 	struct request_queue *q;
 
+	lockdep_assert_held(&set->tag_list_lock);
+
 	if (nr_hw_queues > nr_cpu_ids)
 		nr_hw_queues = nr_cpu_ids;
 	if (nr_hw_queues < 1 || nr_hw_queues == set->nr_hw_queues)
-- 
2.12.0

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v4 1/6] blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list
@ 2017-04-07 18:16   ` Bart Van Assche
  0 siblings, 0 replies; 55+ messages in thread
From: Bart Van Assche @ 2017-04-07 18:16 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Bart Van Assche, Christoph Hellwig,
	Hannes Reinecke

Since the next patch in this series will use RCU to iterate over
tag_list, make this safe. Add lockdep_assert_held() statements
in functions that iterate over tag_list to make clear that using
list_for_each_entry() instead of list_for_each_entry_rcu() is
fine in these functions.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
---
 block/blk-mq.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index f7cd3208bcdf..c26464f9649a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2076,6 +2076,8 @@ static void blk_mq_update_tag_set_depth(struct blk_mq_tag_set *set, bool shared)
 {
 	struct request_queue *q;
 
+	lockdep_assert_held(&set->tag_list_lock);
+
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
 		blk_mq_freeze_queue(q);
 		queue_set_hctx_shared(q, shared);
@@ -2088,7 +2090,8 @@ static void blk_mq_del_queue_tag_set(struct request_queue *q)
 	struct blk_mq_tag_set *set = q->tag_set;
 
 	mutex_lock(&set->tag_list_lock);
-	list_del_init(&q->tag_set_list);
+	list_del_rcu(&q->tag_set_list);
+	INIT_LIST_HEAD(&q->tag_set_list);
 	if (list_is_singular(&set->tag_list)) {
 		/* just transitioned to unshared */
 		set->flags &= ~BLK_MQ_F_TAG_SHARED;
@@ -2096,6 +2099,8 @@ static void blk_mq_del_queue_tag_set(struct request_queue *q)
 		blk_mq_update_tag_set_depth(set, false);
 	}
 	mutex_unlock(&set->tag_list_lock);
+
+	synchronize_rcu();
 }
 
 static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
@@ -2113,7 +2118,7 @@ static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
 	}
 	if (set->flags & BLK_MQ_F_TAG_SHARED)
 		queue_set_hctx_shared(q, true);
-	list_add_tail(&q->tag_set_list, &set->tag_list);
+	list_add_tail_rcu(&q->tag_set_list, &set->tag_list);
 
 	mutex_unlock(&set->tag_list_lock);
 }
@@ -2601,6 +2606,8 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
 {
 	struct request_queue *q;
 
+	lockdep_assert_held(&set->tag_list_lock);
+
 	if (nr_hw_queues > nr_cpu_ids)
 		nr_hw_queues = nr_cpu_ids;
 	if (nr_hw_queues < 1 || nr_hw_queues == set->nr_hw_queues)
-- 
2.12.0

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v4 2/6] blk-mq: Restart a single queue if tag sets are shared
  2017-04-07 18:16 ` Bart Van Assche
@ 2017-04-07 18:16   ` Bart Van Assche
  -1 siblings, 0 replies; 55+ messages in thread
From: Bart Van Assche @ 2017-04-07 18:16 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Bart Van Assche, Christoph Hellwig,
	Hannes Reinecke

To improve scalability, if hardware queues are shared, restart
a single hardware queue in round-robin fashion. Rename
blk_mq_sched_restart_queues() to reflect the new semantics.
Remove blk_mq_sched_mark_restart_queue() because this function
has no callers. Remove flag QUEUE_FLAG_RESTART because this
patch removes the code that uses this flag.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
---
 block/blk-mq-sched.c   | 63 ++++++++++++++++++++++++++++++++++++++++++--------
 block/blk-mq-sched.h   | 16 +------------
 block/blk-mq.c         |  2 +-
 include/linux/blkdev.h |  1 -
 4 files changed, 55 insertions(+), 27 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 09af8ff18719..a5c683a6429c 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -317,25 +317,68 @@ static bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx,
 	return true;
 }
 
-static void blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx)
+static bool blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx)
 {
 	if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) {
 		clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
-		if (blk_mq_hctx_has_pending(hctx))
+		if (blk_mq_hctx_has_pending(hctx)) {
 			blk_mq_run_hw_queue(hctx, true);
+			return true;
+		}
 	}
+	return false;
 }
 
-void blk_mq_sched_restart_queues(struct blk_mq_hw_ctx *hctx)
-{
-	struct request_queue *q = hctx->queue;
-	unsigned int i;
+/**
+ * list_for_each_entry_rcu_rr - iterate in a round-robin fashion over rcu list
+ * @pos:    loop cursor.
+ * @skip:   the list element that will not be examined. Iteration starts at
+ *          @skip->next.
+ * @head:   head of the list to examine. This list must have at least one
+ *          element, namely @skip.
+ * @member: name of the list_head structure within typeof(*pos).
+ */
+#define list_for_each_entry_rcu_rr(pos, skip, head, member)		\
+	for ((pos) = (skip);						\
+	     (pos = (pos)->member.next != (head) ? list_entry_rcu(	\
+			(pos)->member.next, typeof(*pos), member) :	\
+	      list_entry_rcu((pos)->member.next->next, typeof(*pos), member)), \
+	     (pos) != (skip); )
 
-	if (test_bit(QUEUE_FLAG_RESTART, &q->queue_flags)) {
-		if (test_and_clear_bit(QUEUE_FLAG_RESTART, &q->queue_flags)) {
-			queue_for_each_hw_ctx(q, hctx, i)
-				blk_mq_sched_restart_hctx(hctx);
+/*
+ * Called after a driver tag has been freed to check whether a hctx needs to
+ * be restarted. Restarts @hctx if its tag set is not shared. Restarts hardware
+ * queues in a round-robin fashion if the tag set of @hctx is shared with other
+ * hardware queues.
+ */
+void blk_mq_sched_restart(struct blk_mq_hw_ctx *const hctx)
+{
+	struct blk_mq_tags *const tags = hctx->tags;
+	struct blk_mq_tag_set *const set = hctx->queue->tag_set;
+	struct request_queue *const queue = hctx->queue, *q;
+	struct blk_mq_hw_ctx *hctx2;
+	unsigned int i, j;
+
+	if (set->flags & BLK_MQ_F_TAG_SHARED) {
+		rcu_read_lock();
+		list_for_each_entry_rcu_rr(q, queue, &set->tag_list,
+					   tag_set_list) {
+			queue_for_each_hw_ctx(q, hctx2, i)
+				if (hctx2->tags == tags &&
+				    blk_mq_sched_restart_hctx(hctx2))
+					goto done;
+		}
+		j = hctx->queue_num + 1;
+		for (i = 0; i < queue->nr_hw_queues; i++, j++) {
+			if (j == queue->nr_hw_queues)
+				j = 0;
+			hctx2 = queue->queue_hw_ctx[j];
+			if (hctx2->tags == tags &&
+			    blk_mq_sched_restart_hctx(hctx2))
+				break;
 		}
+done:
+		rcu_read_unlock();
 	} else {
 		blk_mq_sched_restart_hctx(hctx);
 	}
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index a75b16b123f7..4e3fc2a40207 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -19,7 +19,7 @@ bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
 				struct request **merged_request);
 bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio);
 bool blk_mq_sched_try_insert_merge(struct request_queue *q, struct request *rq);
-void blk_mq_sched_restart_queues(struct blk_mq_hw_ctx *hctx);
+void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx);
 
 void blk_mq_sched_insert_request(struct request *rq, bool at_head,
 				 bool run_queue, bool async, bool can_block);
@@ -131,20 +131,6 @@ static inline void blk_mq_sched_mark_restart_hctx(struct blk_mq_hw_ctx *hctx)
 		set_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
 }
 
-/*
- * Mark a hardware queue and the request queue it belongs to as needing a
- * restart.
- */
-static inline void blk_mq_sched_mark_restart_queue(struct blk_mq_hw_ctx *hctx)
-{
-	struct request_queue *q = hctx->queue;
-
-	if (!test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state))
-		set_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
-	if (!test_bit(QUEUE_FLAG_RESTART, &q->queue_flags))
-		set_bit(QUEUE_FLAG_RESTART, &q->queue_flags);
-}
-
 static inline bool blk_mq_sched_needs_restart(struct blk_mq_hw_ctx *hctx)
 {
 	return test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c26464f9649a..dba34eb79a08 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -351,7 +351,7 @@ void __blk_mq_finish_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
 		blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
 	if (sched_tag != -1)
 		blk_mq_sched_completed_request(hctx, rq);
-	blk_mq_sched_restart_queues(hctx);
+	blk_mq_sched_restart(hctx);
 	blk_queue_exit(q);
 }
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 3cf241b0814d..dc6c8d39d462 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -615,7 +615,6 @@ struct request_queue {
 #define QUEUE_FLAG_FLUSH_NQ    25	/* flush not queueuable */
 #define QUEUE_FLAG_DAX         26	/* device supports DAX */
 #define QUEUE_FLAG_STATS       27	/* track rq completion times */
-#define QUEUE_FLAG_RESTART     28	/* queue needs restart at completion */
 #define QUEUE_FLAG_POLL_STATS  29	/* collecting stats for hybrid polling */
 #define QUEUE_FLAG_REGISTERED  30	/* queue has been registered to a disk */
 
-- 
2.12.0

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v4 2/6] blk-mq: Restart a single queue if tag sets are shared
@ 2017-04-07 18:16   ` Bart Van Assche
  0 siblings, 0 replies; 55+ messages in thread
From: Bart Van Assche @ 2017-04-07 18:16 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Bart Van Assche, Christoph Hellwig,
	Hannes Reinecke

To improve scalability, if hardware queues are shared, restart
a single hardware queue in round-robin fashion. Rename
blk_mq_sched_restart_queues() to reflect the new semantics.
Remove blk_mq_sched_mark_restart_queue() because this function
has no callers. Remove flag QUEUE_FLAG_RESTART because this
patch removes the code that uses this flag.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
---
 block/blk-mq-sched.c   | 63 ++++++++++++++++++++++++++++++++++++++++++--------
 block/blk-mq-sched.h   | 16 +------------
 block/blk-mq.c         |  2 +-
 include/linux/blkdev.h |  1 -
 4 files changed, 55 insertions(+), 27 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 09af8ff18719..a5c683a6429c 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -317,25 +317,68 @@ static bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx,
 	return true;
 }
 
-static void blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx)
+static bool blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx)
 {
 	if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) {
 		clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
-		if (blk_mq_hctx_has_pending(hctx))
+		if (blk_mq_hctx_has_pending(hctx)) {
 			blk_mq_run_hw_queue(hctx, true);
+			return true;
+		}
 	}
+	return false;
 }
 
-void blk_mq_sched_restart_queues(struct blk_mq_hw_ctx *hctx)
-{
-	struct request_queue *q = hctx->queue;
-	unsigned int i;
+/**
+ * list_for_each_entry_rcu_rr - iterate in a round-robin fashion over rcu list
+ * @pos:    loop cursor.
+ * @skip:   the list element that will not be examined. Iteration starts at
+ *          @skip->next.
+ * @head:   head of the list to examine. This list must have at least one
+ *          element, namely @skip.
+ * @member: name of the list_head structure within typeof(*pos).
+ */
+#define list_for_each_entry_rcu_rr(pos, skip, head, member)		\
+	for ((pos) = (skip);						\
+	     (pos = (pos)->member.next != (head) ? list_entry_rcu(	\
+			(pos)->member.next, typeof(*pos), member) :	\
+	      list_entry_rcu((pos)->member.next->next, typeof(*pos), member)), \
+	     (pos) != (skip); )
 
-	if (test_bit(QUEUE_FLAG_RESTART, &q->queue_flags)) {
-		if (test_and_clear_bit(QUEUE_FLAG_RESTART, &q->queue_flags)) {
-			queue_for_each_hw_ctx(q, hctx, i)
-				blk_mq_sched_restart_hctx(hctx);
+/*
+ * Called after a driver tag has been freed to check whether a hctx needs to
+ * be restarted. Restarts @hctx if its tag set is not shared. Restarts hardware
+ * queues in a round-robin fashion if the tag set of @hctx is shared with other
+ * hardware queues.
+ */
+void blk_mq_sched_restart(struct blk_mq_hw_ctx *const hctx)
+{
+	struct blk_mq_tags *const tags = hctx->tags;
+	struct blk_mq_tag_set *const set = hctx->queue->tag_set;
+	struct request_queue *const queue = hctx->queue, *q;
+	struct blk_mq_hw_ctx *hctx2;
+	unsigned int i, j;
+
+	if (set->flags & BLK_MQ_F_TAG_SHARED) {
+		rcu_read_lock();
+		list_for_each_entry_rcu_rr(q, queue, &set->tag_list,
+					   tag_set_list) {
+			queue_for_each_hw_ctx(q, hctx2, i)
+				if (hctx2->tags == tags &&
+				    blk_mq_sched_restart_hctx(hctx2))
+					goto done;
+		}
+		j = hctx->queue_num + 1;
+		for (i = 0; i < queue->nr_hw_queues; i++, j++) {
+			if (j == queue->nr_hw_queues)
+				j = 0;
+			hctx2 = queue->queue_hw_ctx[j];
+			if (hctx2->tags == tags &&
+			    blk_mq_sched_restart_hctx(hctx2))
+				break;
 		}
+done:
+		rcu_read_unlock();
 	} else {
 		blk_mq_sched_restart_hctx(hctx);
 	}
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index a75b16b123f7..4e3fc2a40207 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -19,7 +19,7 @@ bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
 				struct request **merged_request);
 bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio);
 bool blk_mq_sched_try_insert_merge(struct request_queue *q, struct request *rq);
-void blk_mq_sched_restart_queues(struct blk_mq_hw_ctx *hctx);
+void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx);
 
 void blk_mq_sched_insert_request(struct request *rq, bool at_head,
 				 bool run_queue, bool async, bool can_block);
@@ -131,20 +131,6 @@ static inline void blk_mq_sched_mark_restart_hctx(struct blk_mq_hw_ctx *hctx)
 		set_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
 }
 
-/*
- * Mark a hardware queue and the request queue it belongs to as needing a
- * restart.
- */
-static inline void blk_mq_sched_mark_restart_queue(struct blk_mq_hw_ctx *hctx)
-{
-	struct request_queue *q = hctx->queue;
-
-	if (!test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state))
-		set_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
-	if (!test_bit(QUEUE_FLAG_RESTART, &q->queue_flags))
-		set_bit(QUEUE_FLAG_RESTART, &q->queue_flags);
-}
-
 static inline bool blk_mq_sched_needs_restart(struct blk_mq_hw_ctx *hctx)
 {
 	return test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c26464f9649a..dba34eb79a08 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -351,7 +351,7 @@ void __blk_mq_finish_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
 		blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
 	if (sched_tag != -1)
 		blk_mq_sched_completed_request(hctx, rq);
-	blk_mq_sched_restart_queues(hctx);
+	blk_mq_sched_restart(hctx);
 	blk_queue_exit(q);
 }
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 3cf241b0814d..dc6c8d39d462 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -615,7 +615,6 @@ struct request_queue {
 #define QUEUE_FLAG_FLUSH_NQ    25	/* flush not queueuable */
 #define QUEUE_FLAG_DAX         26	/* device supports DAX */
 #define QUEUE_FLAG_STATS       27	/* track rq completion times */
-#define QUEUE_FLAG_RESTART     28	/* queue needs restart at completion */
 #define QUEUE_FLAG_POLL_STATS  29	/* collecting stats for hybrid polling */
 #define QUEUE_FLAG_REGISTERED  30	/* queue has been registered to a disk */
 
-- 
2.12.0

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v4 3/6] blk-mq: Clarify comments in blk_mq_dispatch_rq_list()
  2017-04-07 18:16 ` Bart Van Assche
@ 2017-04-07 18:16   ` Bart Van Assche
  -1 siblings, 0 replies; 55+ messages in thread
From: Bart Van Assche @ 2017-04-07 18:16 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Bart Van Assche, Omar Sandoval,
	Christoph Hellwig, Hannes Reinecke

The blk_mq_dispatch_rq_list() implementation got modified several
times but the comments in that function were not updated every
time. Since it is nontrivial what is going on, update the comments
in blk_mq_dispatch_rq_list().

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
---
 block/blk-mq.c | 28 ++++++++++++++++++----------
 1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index dba34eb79a08..aff85d41cea3 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1063,8 +1063,8 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 	 */
 	if (!list_empty(list)) {
 		/*
-		 * If we got a driver tag for the next request already,
-		 * free it again.
+		 * If an I/O scheduler has been configured and we got a driver
+		 * tag for the next request already, free it again.
 		 */
 		rq = list_first_entry(list, struct request, queuelist);
 		blk_mq_put_driver_tag(rq);
@@ -1074,16 +1074,24 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 		spin_unlock(&hctx->lock);
 
 		/*
-		 * the queue is expected stopped with BLK_MQ_RQ_QUEUE_BUSY, but
-		 * it's possible the queue is stopped and restarted again
-		 * before this. Queue restart will dispatch requests. And since
-		 * requests in rq_list aren't added into hctx->dispatch yet,
-		 * the requests in rq_list might get lost.
+		 * If SCHED_RESTART was set by the caller of this function and
+		 * it is no longer set that means that it was cleared by another
+		 * thread and hence that a queue rerun is needed.
 		 *
-		 * blk_mq_run_hw_queue() already checks the STOPPED bit
+		 * If TAG_WAITING is set that means that an I/O scheduler has
+		 * been configured and another thread is waiting for a driver
+		 * tag. To guarantee fairness, do not rerun this hardware queue
+		 * but let the other thread grab the driver tag.
 		 *
-		 * If RESTART or TAG_WAITING is set, then let completion restart
-		 * the queue instead of potentially looping here.
+		 * If no I/O scheduler has been configured it is possible that
+		 * the hardware queue got stopped and restarted before requests
+		 * were pushed back onto the dispatch list. Rerun the queue to
+		 * avoid starvation. Notes:
+		 * - blk_mq_run_hw_queue() checks whether or not a queue has
+		 *   been stopped before rerunning a queue.
+		 * - Some but not all block drivers stop a queue before
+		 *   returning BLK_MQ_RQ_QUEUE_BUSY. Two exceptions are scsi-mq
+		 *   and dm-rq.
 		 */
 		if (!blk_mq_sched_needs_restart(hctx) &&
 		    !test_bit(BLK_MQ_S_TAG_WAITING, &hctx->state))
-- 
2.12.0

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v4 3/6] blk-mq: Clarify comments in blk_mq_dispatch_rq_list()
@ 2017-04-07 18:16   ` Bart Van Assche
  0 siblings, 0 replies; 55+ messages in thread
From: Bart Van Assche @ 2017-04-07 18:16 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Bart Van Assche, Omar Sandoval,
	Christoph Hellwig, Hannes Reinecke

The blk_mq_dispatch_rq_list() implementation got modified several
times but the comments in that function were not updated every
time. Since it is nontrivial what is going on, update the comments
in blk_mq_dispatch_rq_list().

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
---
 block/blk-mq.c | 28 ++++++++++++++++++----------
 1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index dba34eb79a08..aff85d41cea3 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1063,8 +1063,8 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 	 */
 	if (!list_empty(list)) {
 		/*
-		 * If we got a driver tag for the next request already,
-		 * free it again.
+		 * If an I/O scheduler has been configured and we got a driver
+		 * tag for the next request already, free it again.
 		 */
 		rq = list_first_entry(list, struct request, queuelist);
 		blk_mq_put_driver_tag(rq);
@@ -1074,16 +1074,24 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 		spin_unlock(&hctx->lock);
 
 		/*
-		 * the queue is expected stopped with BLK_MQ_RQ_QUEUE_BUSY, but
-		 * it's possible the queue is stopped and restarted again
-		 * before this. Queue restart will dispatch requests. And since
-		 * requests in rq_list aren't added into hctx->dispatch yet,
-		 * the requests in rq_list might get lost.
+		 * If SCHED_RESTART was set by the caller of this function and
+		 * it is no longer set that means that it was cleared by another
+		 * thread and hence that a queue rerun is needed.
 		 *
-		 * blk_mq_run_hw_queue() already checks the STOPPED bit
+		 * If TAG_WAITING is set that means that an I/O scheduler has
+		 * been configured and another thread is waiting for a driver
+		 * tag. To guarantee fairness, do not rerun this hardware queue
+		 * but let the other thread grab the driver tag.
 		 *
-		 * If RESTART or TAG_WAITING is set, then let completion restart
-		 * the queue instead of potentially looping here.
+		 * If no I/O scheduler has been configured it is possible that
+		 * the hardware queue got stopped and restarted before requests
+		 * were pushed back onto the dispatch list. Rerun the queue to
+		 * avoid starvation. Notes:
+		 * - blk_mq_run_hw_queue() checks whether or not a queue has
+		 *   been stopped before rerunning a queue.
+		 * - Some but not all block drivers stop a queue before
+		 *   returning BLK_MQ_RQ_QUEUE_BUSY. Two exceptions are scsi-mq
+		 *   and dm-rq.
 		 */
 		if (!blk_mq_sched_needs_restart(hctx) &&
 		    !test_bit(BLK_MQ_S_TAG_WAITING, &hctx->state))
-- 
2.12.0

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v4 4/6] blk-mq: Introduce blk_mq_delay_run_hw_queue()
  2017-04-07 18:16 ` Bart Van Assche
@ 2017-04-07 18:16   ` Bart Van Assche
  -1 siblings, 0 replies; 55+ messages in thread
From: Bart Van Assche @ 2017-04-07 18:16 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Bart Van Assche, Christoph Hellwig,
	Hannes Reinecke, Long Li, K . Y . Srinivasan

Introduce a function that runs a hardware queue unconditionally
after a delay. Note: there is already a function that stops and
restarts a hardware queue after a delay, namely blk_mq_delay_queue().

This function will be used in the next patch in this series.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Long Li <longli@microsoft.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
---
 block/blk-mq.c         | 32 ++++++++++++++++++++++++++++++--
 include/linux/blk-mq.h |  2 ++
 2 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index aff85d41cea3..836e3a17da54 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1146,7 +1146,8 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
 	return hctx->next_cpu;
 }
 
-void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
+static void __blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async,
+					unsigned long msecs)
 {
 	if (unlikely(blk_mq_hctx_stopped(hctx) ||
 		     !blk_mq_hw_queue_mapped(hctx)))
@@ -1163,7 +1164,24 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
 		put_cpu();
 	}
 
-	kblockd_schedule_work_on(blk_mq_hctx_next_cpu(hctx), &hctx->run_work);
+	if (msecs == 0)
+		kblockd_schedule_work_on(blk_mq_hctx_next_cpu(hctx),
+					 &hctx->run_work);
+	else
+		kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx),
+						 &hctx->delayed_run_work,
+						 msecs_to_jiffies(msecs));
+}
+
+void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs)
+{
+	__blk_mq_delay_run_hw_queue(hctx, true, msecs);
+}
+EXPORT_SYMBOL(blk_mq_delay_run_hw_queue);
+
+void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
+{
+	__blk_mq_delay_run_hw_queue(hctx, async, 0);
 }
 
 void blk_mq_run_hw_queues(struct request_queue *q, bool async)
@@ -1266,6 +1284,15 @@ static void blk_mq_run_work_fn(struct work_struct *work)
 	__blk_mq_run_hw_queue(hctx);
 }
 
+static void blk_mq_delayed_run_work_fn(struct work_struct *work)
+{
+	struct blk_mq_hw_ctx *hctx;
+
+	hctx = container_of(work, struct blk_mq_hw_ctx, delayed_run_work.work);
+
+	__blk_mq_run_hw_queue(hctx);
+}
+
 static void blk_mq_delay_work_fn(struct work_struct *work)
 {
 	struct blk_mq_hw_ctx *hctx;
@@ -1866,6 +1893,7 @@ static int blk_mq_init_hctx(struct request_queue *q,
 		node = hctx->numa_node = set->numa_node;
 
 	INIT_WORK(&hctx->run_work, blk_mq_run_work_fn);
+	INIT_DELAYED_WORK(&hctx->delayed_run_work, blk_mq_delayed_run_work_fn);
 	INIT_DELAYED_WORK(&hctx->delay_work, blk_mq_delay_work_fn);
 	spin_lock_init(&hctx->lock);
 	INIT_LIST_HEAD(&hctx->dispatch);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index bdea90d75274..b90c3d5766cd 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -51,6 +51,7 @@ struct blk_mq_hw_ctx {
 
 	atomic_t		nr_active;
 
+	struct delayed_work	delayed_run_work;
 	struct delayed_work	delay_work;
 
 	struct hlist_node	cpuhp_dead;
@@ -236,6 +237,7 @@ void blk_mq_stop_hw_queues(struct request_queue *q);
 void blk_mq_start_hw_queues(struct request_queue *q);
 void blk_mq_start_stopped_hw_queue(struct blk_mq_hw_ctx *hctx, bool async);
 void blk_mq_start_stopped_hw_queues(struct request_queue *q, bool async);
+void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs);
 void blk_mq_run_hw_queues(struct request_queue *q, bool async);
 void blk_mq_delay_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs);
 void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
-- 
2.12.0

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v4 4/6] blk-mq: Introduce blk_mq_delay_run_hw_queue()
@ 2017-04-07 18:16   ` Bart Van Assche
  0 siblings, 0 replies; 55+ messages in thread
From: Bart Van Assche @ 2017-04-07 18:16 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Bart Van Assche, Christoph Hellwig,
	Hannes Reinecke, Long Li, K . Y . Srinivasan

Introduce a function that runs a hardware queue unconditionally
after a delay. Note: there is already a function that stops and
restarts a hardware queue after a delay, namely blk_mq_delay_queue().

This function will be used in the next patch in this series.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Long Li <longli@microsoft.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
---
 block/blk-mq.c         | 32 ++++++++++++++++++++++++++++++--
 include/linux/blk-mq.h |  2 ++
 2 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index aff85d41cea3..836e3a17da54 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1146,7 +1146,8 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
 	return hctx->next_cpu;
 }
 
-void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
+static void __blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async,
+					unsigned long msecs)
 {
 	if (unlikely(blk_mq_hctx_stopped(hctx) ||
 		     !blk_mq_hw_queue_mapped(hctx)))
@@ -1163,7 +1164,24 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
 		put_cpu();
 	}
 
-	kblockd_schedule_work_on(blk_mq_hctx_next_cpu(hctx), &hctx->run_work);
+	if (msecs == 0)
+		kblockd_schedule_work_on(blk_mq_hctx_next_cpu(hctx),
+					 &hctx->run_work);
+	else
+		kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx),
+						 &hctx->delayed_run_work,
+						 msecs_to_jiffies(msecs));
+}
+
+void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs)
+{
+	__blk_mq_delay_run_hw_queue(hctx, true, msecs);
+}
+EXPORT_SYMBOL(blk_mq_delay_run_hw_queue);
+
+void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
+{
+	__blk_mq_delay_run_hw_queue(hctx, async, 0);
 }
 
 void blk_mq_run_hw_queues(struct request_queue *q, bool async)
@@ -1266,6 +1284,15 @@ static void blk_mq_run_work_fn(struct work_struct *work)
 	__blk_mq_run_hw_queue(hctx);
 }
 
+static void blk_mq_delayed_run_work_fn(struct work_struct *work)
+{
+	struct blk_mq_hw_ctx *hctx;
+
+	hctx = container_of(work, struct blk_mq_hw_ctx, delayed_run_work.work);
+
+	__blk_mq_run_hw_queue(hctx);
+}
+
 static void blk_mq_delay_work_fn(struct work_struct *work)
 {
 	struct blk_mq_hw_ctx *hctx;
@@ -1866,6 +1893,7 @@ static int blk_mq_init_hctx(struct request_queue *q,
 		node = hctx->numa_node = set->numa_node;
 
 	INIT_WORK(&hctx->run_work, blk_mq_run_work_fn);
+	INIT_DELAYED_WORK(&hctx->delayed_run_work, blk_mq_delayed_run_work_fn);
 	INIT_DELAYED_WORK(&hctx->delay_work, blk_mq_delay_work_fn);
 	spin_lock_init(&hctx->lock);
 	INIT_LIST_HEAD(&hctx->dispatch);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index bdea90d75274..b90c3d5766cd 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -51,6 +51,7 @@ struct blk_mq_hw_ctx {
 
 	atomic_t		nr_active;
 
+	struct delayed_work	delayed_run_work;
 	struct delayed_work	delay_work;
 
 	struct hlist_node	cpuhp_dead;
@@ -236,6 +237,7 @@ void blk_mq_stop_hw_queues(struct request_queue *q);
 void blk_mq_start_hw_queues(struct request_queue *q);
 void blk_mq_start_stopped_hw_queue(struct blk_mq_hw_ctx *hctx, bool async);
 void blk_mq_start_stopped_hw_queues(struct request_queue *q, bool async);
+void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs);
 void blk_mq_run_hw_queues(struct request_queue *q, bool async);
 void blk_mq_delay_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs);
 void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
-- 
2.12.0

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v4 5/6] scsi: Avoid that SCSI queues get stuck
  2017-04-07 18:16 ` Bart Van Assche
@ 2017-04-07 18:16   ` Bart Van Assche
  -1 siblings, 0 replies; 55+ messages in thread
From: Bart Van Assche @ 2017-04-07 18:16 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Bart Van Assche, Martin K . Petersen,
	James Bottomley, Christoph Hellwig, Hannes Reinecke,
	Sagi Grimberg, Long Li, K . Y . Srinivasan

If a .queue_rq() function returns BLK_MQ_RQ_QUEUE_BUSY then the block
driver that implements that function is responsible for rerunning the
hardware queue once requests can be queued again successfully.

commit 52d7f1b5c2f3 ("blk-mq: Avoid that requeueing starts stopped
queues") removed the blk_mq_stop_hw_queue() call from scsi_queue_rq()
for the BLK_MQ_RQ_QUEUE_BUSY case. Hence change all calls to functions
that are intended to rerun a busy queue such that these examine all
hardware queues instead of only stopped queues.

Since no other functions than scsi_internal_device_block() and
scsi_internal_device_unblock() should ever stop or restart a SCSI
queue, change the blk_mq_delay_queue() call into a
blk_mq_delay_run_hw_queue() call.

Fixes: commit 52d7f1b5c2f3 ("blk-mq: Avoid that requeueing starts stopped queues")
Fixes: commit 7e79dadce222 ("blk-mq: stop hardware queue in blk_mq_delay_queue()")
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Long Li <longli@microsoft.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
---
 drivers/scsi/scsi_lib.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 11972d1075f1..7bc4513bf4e4 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -496,7 +496,7 @@ static void scsi_run_queue(struct request_queue *q)
 		scsi_starved_list_run(sdev->host);
 
 	if (q->mq_ops)
-		blk_mq_start_stopped_hw_queues(q, false);
+		blk_mq_run_hw_queues(q, false);
 	else
 		blk_run_queue(q);
 }
@@ -667,7 +667,7 @@ static bool scsi_end_request(struct request *req, int error,
 		    !list_empty(&sdev->host->starved_list))
 			kblockd_schedule_work(&sdev->requeue_work);
 		else
-			blk_mq_start_stopped_hw_queues(q, true);
+			blk_mq_run_hw_queues(q, true);
 	} else {
 		unsigned long flags;
 
@@ -1974,7 +1974,7 @@ static int scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
 	case BLK_MQ_RQ_QUEUE_BUSY:
 		if (atomic_read(&sdev->device_busy) == 0 &&
 		    !scsi_device_blocked(sdev))
-			blk_mq_delay_queue(hctx, SCSI_QUEUE_DELAY);
+			blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY);
 		break;
 	case BLK_MQ_RQ_QUEUE_ERROR:
 		/*
-- 
2.12.0

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v4 5/6] scsi: Avoid that SCSI queues get stuck
@ 2017-04-07 18:16   ` Bart Van Assche
  0 siblings, 0 replies; 55+ messages in thread
From: Bart Van Assche @ 2017-04-07 18:16 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Bart Van Assche, Martin K . Petersen,
	James Bottomley, Christoph Hellwig, Hannes Reinecke,
	Sagi Grimberg, Long Li, K . Y . Srinivasan

If a .queue_rq() function returns BLK_MQ_RQ_QUEUE_BUSY then the block
driver that implements that function is responsible for rerunning the
hardware queue once requests can be queued again successfully.

commit 52d7f1b5c2f3 ("blk-mq: Avoid that requeueing starts stopped
queues") removed the blk_mq_stop_hw_queue() call from scsi_queue_rq()
for the BLK_MQ_RQ_QUEUE_BUSY case. Hence change all calls to functions
that are intended to rerun a busy queue such that these examine all
hardware queues instead of only stopped queues.

Since no other functions than scsi_internal_device_block() and
scsi_internal_device_unblock() should ever stop or restart a SCSI
queue, change the blk_mq_delay_queue() call into a
blk_mq_delay_run_hw_queue() call.

Fixes: commit 52d7f1b5c2f3 ("blk-mq: Avoid that requeueing starts stopped queues")
Fixes: commit 7e79dadce222 ("blk-mq: stop hardware queue in blk_mq_delay_queue()")
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Long Li <longli@microsoft.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
---
 drivers/scsi/scsi_lib.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 11972d1075f1..7bc4513bf4e4 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -496,7 +496,7 @@ static void scsi_run_queue(struct request_queue *q)
 		scsi_starved_list_run(sdev->host);
 
 	if (q->mq_ops)
-		blk_mq_start_stopped_hw_queues(q, false);
+		blk_mq_run_hw_queues(q, false);
 	else
 		blk_run_queue(q);
 }
@@ -667,7 +667,7 @@ static bool scsi_end_request(struct request *req, int error,
 		    !list_empty(&sdev->host->starved_list))
 			kblockd_schedule_work(&sdev->requeue_work);
 		else
-			blk_mq_start_stopped_hw_queues(q, true);
+			blk_mq_run_hw_queues(q, true);
 	} else {
 		unsigned long flags;
 
@@ -1974,7 +1974,7 @@ static int scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
 	case BLK_MQ_RQ_QUEUE_BUSY:
 		if (atomic_read(&sdev->device_busy) == 0 &&
 		    !scsi_device_blocked(sdev))
-			blk_mq_delay_queue(hctx, SCSI_QUEUE_DELAY);
+			blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY);
 		break;
 	case BLK_MQ_RQ_QUEUE_ERROR:
 		/*
-- 
2.12.0

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v4 6/6] dm rq: Avoid that request processing stalls sporadically
  2017-04-07 18:16 ` Bart Van Assche
@ 2017-04-07 18:16   ` Bart Van Assche
  -1 siblings, 0 replies; 55+ messages in thread
From: Bart Van Assche @ 2017-04-07 18:16 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Bart Van Assche, Mike Snitzer, dm-devel

While running the srp-test software I noticed that request
processing stalls sporadically at the beginning of a test, namely
when mkfs is run against a dm-mpath device. Every time when that
happened the following command was sufficient to resume request
processing:

    echo run >/sys/kernel/debug/block/dm-0/state

This patch avoids that such request processing stalls occur. The
test I ran is as follows:

    while srp-test/run_tests -d -r 30 -t 02-mq; do :; done

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
---
 drivers/md/dm-rq.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 6886bf160fb2..d19af1d21f4c 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -755,6 +755,7 @@ static int dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
 		/* Undo dm_start_request() before requeuing */
 		rq_end_stats(md, rq);
 		rq_completed(md, rq_data_dir(rq), false);
+		blk_mq_delay_run_hw_queue(hctx, 100/*ms*/);
 		return BLK_MQ_RQ_QUEUE_BUSY;
 	}
 
-- 
2.12.0

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v4 6/6] dm rq: Avoid that request processing stalls sporadically
@ 2017-04-07 18:16   ` Bart Van Assche
  0 siblings, 0 replies; 55+ messages in thread
From: Bart Van Assche @ 2017-04-07 18:16 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Bart Van Assche, Mike Snitzer, dm-devel

While running the srp-test software I noticed that request
processing stalls sporadically at the beginning of a test, namely
when mkfs is run against a dm-mpath device. Every time when that
happened the following command was sufficient to resume request
processing:

    echo run >/sys/kernel/debug/block/dm-0/state

This patch avoids that such request processing stalls occur. The
test I ran is as follows:

    while srp-test/run_tests -d -r 30 -t 02-mq; do :; done

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
---
 drivers/md/dm-rq.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 6886bf160fb2..d19af1d21f4c 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -755,6 +755,7 @@ static int dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
 		/* Undo dm_start_request() before requeuing */
 		rq_end_stats(md, rq);
 		rq_completed(md, rq_data_dir(rq), false);
+		blk_mq_delay_run_hw_queue(hctx, 100/*ms*/);
 		return BLK_MQ_RQ_QUEUE_BUSY;
 	}
 
-- 
2.12.0

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 0/6] Avoid that scsi-mq and dm-mq queue processing stalls sporadically
  2017-04-07 18:16 ` Bart Van Assche
                   ` (6 preceding siblings ...)
  (?)
@ 2017-04-07 18:23 ` Jens Axboe
  2017-04-07 18:33     ` Bart Van Assche
  -1 siblings, 1 reply; 55+ messages in thread
From: Jens Axboe @ 2017-04-07 18:23 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-block, linux-scsi

On 04/07/2017 12:16 PM, Bart Van Assche wrote:
> Hello Jens,
> 
> The six patches in this patch series fix the queue lockup I reported
> recently on the linux-block mailing list. Please consider these patches
> for inclusion in the upstream kernel.

Some of this we need in 4.11, but not all of it. I can't be applying patches
that "improve scalability" at this point.

4-6 looks like what we want for 4.11, I'll see if those apply directly. Then
we can put 1-3 on top in 4.12, with the others pulled in first.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 0/6] Avoid that scsi-mq and dm-mq queue processing stalls sporadically
  2017-04-07 18:23 ` [PATCH v4 0/6] Avoid that scsi-mq and dm-mq queue " Jens Axboe
@ 2017-04-07 18:33     ` Bart Van Assche
  0 siblings, 0 replies; 55+ messages in thread
From: Bart Van Assche @ 2017-04-07 18:33 UTC (permalink / raw)
  To: axboe; +Cc: linux-scsi, linux-block

On Fri, 2017-04-07 at 12:23 -0600, Jens Axboe wrote:
> On 04/07/2017 12:16 PM, Bart Van Assche wrote:
> > Hello Jens,
> >=20
> > The six patches in this patch series fix the queue lockup I reported
> > recently on the linux-block mailing list. Please consider these patches
> > for inclusion in the upstream kernel.
>=20
> Some of this we need in 4.11, but not all of it. I can't be applying patc=
hes
> that "improve scalability" at this point.
>=20
> 4-6 looks like what we want for 4.11, I'll see if those apply directly. T=
hen
> we can put 1-3 on top in 4.12, with the others pulled in first.

Hello Jens,

Please note that patch 2/6 is a bug fix. The current implementation of
blk_mq_sched_restart_queues() only considers hardware queues associated wit=
h
the same request queue as the hardware queue that has been passed as an
argument. If a tag set is shared across request queues - as is the case for
SCSI - then all request queues that share a tag set with the hctx argument
must be considered.

Bart.=

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 0/6] Avoid that scsi-mq and dm-mq queue processing stalls sporadically
@ 2017-04-07 18:33     ` Bart Van Assche
  0 siblings, 0 replies; 55+ messages in thread
From: Bart Van Assche @ 2017-04-07 18:33 UTC (permalink / raw)
  To: axboe; +Cc: linux-scsi, linux-block

On Fri, 2017-04-07 at 12:23 -0600, Jens Axboe wrote:
> On 04/07/2017 12:16 PM, Bart Van Assche wrote:
> > Hello Jens,
> > 
> > The six patches in this patch series fix the queue lockup I reported
> > recently on the linux-block mailing list. Please consider these patches
> > for inclusion in the upstream kernel.
> 
> Some of this we need in 4.11, but not all of it. I can't be applying patches
> that "improve scalability" at this point.
> 
> 4-6 looks like what we want for 4.11, I'll see if those apply directly. Then
> we can put 1-3 on top in 4.12, with the others pulled in first.

Hello Jens,

Please note that patch 2/6 is a bug fix. The current implementation of
blk_mq_sched_restart_queues() only considers hardware queues associated with
the same request queue as the hardware queue that has been passed as an
argument. If a tag set is shared across request queues - as is the case for
SCSI - then all request queues that share a tag set with the hctx argument
must be considered.

Bart.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 0/6] Avoid that scsi-mq and dm-mq queue processing stalls sporadically
  2017-04-07 18:33     ` Bart Van Assche
@ 2017-04-07 18:39       ` Bart Van Assche
  -1 siblings, 0 replies; 55+ messages in thread
From: Bart Van Assche @ 2017-04-07 18:39 UTC (permalink / raw)
  To: axboe; +Cc: linux-scsi, linux-block

On Fri, 2017-04-07 at 11:33 -0700, Bart Van Assche wrote:
> On Fri, 2017-04-07 at 12:23 -0600, Jens Axboe wrote:
> > On 04/07/2017 12:16 PM, Bart Van Assche wrote:
> > > Hello Jens,
> > >=20
> > > The six patches in this patch series fix the queue lockup I reported
> > > recently on the linux-block mailing list. Please consider these patch=
es
> > > for inclusion in the upstream kernel.
> >=20
> > Some of this we need in 4.11, but not all of it. I can't be applying pa=
tches
> > that "improve scalability" at this point.
> >=20
> > 4-6 looks like what we want for 4.11, I'll see if those apply directly.=
 Then
> > we can put 1-3 on top in 4.12, with the others pulled in first.
>=20
> Hello Jens,
>=20
> Please note that patch 2/6 is a bug fix. The current implementation of
> blk_mq_sched_restart_queues() only considers hardware queues associated w=
ith
> the same request queue as the hardware queue that has been passed as an
> argument. If a tag set is shared across request queues - as is the case f=
or
> SCSI - then all request queues that share a tag set with the hctx argumen=
t
> must be considered.

(replying to my own e-mail)

Hello Jens,

If you want I can split that patch into two patches - one that runs all har=
dware
queues with which the tag set is shared and one that switches from rerunnin=
g
all hardware queues to one hardware queue.

Bart.=

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 0/6] Avoid that scsi-mq and dm-mq queue processing stalls sporadically
@ 2017-04-07 18:39       ` Bart Van Assche
  0 siblings, 0 replies; 55+ messages in thread
From: Bart Van Assche @ 2017-04-07 18:39 UTC (permalink / raw)
  To: axboe; +Cc: linux-scsi, linux-block

On Fri, 2017-04-07 at 11:33 -0700, Bart Van Assche wrote:
> On Fri, 2017-04-07 at 12:23 -0600, Jens Axboe wrote:
> > On 04/07/2017 12:16 PM, Bart Van Assche wrote:
> > > Hello Jens,
> > > 
> > > The six patches in this patch series fix the queue lockup I reported
> > > recently on the linux-block mailing list. Please consider these patches
> > > for inclusion in the upstream kernel.
> > 
> > Some of this we need in 4.11, but not all of it. I can't be applying patches
> > that "improve scalability" at this point.
> > 
> > 4-6 looks like what we want for 4.11, I'll see if those apply directly. Then
> > we can put 1-3 on top in 4.12, with the others pulled in first.
> 
> Hello Jens,
> 
> Please note that patch 2/6 is a bug fix. The current implementation of
> blk_mq_sched_restart_queues() only considers hardware queues associated with
> the same request queue as the hardware queue that has been passed as an
> argument. If a tag set is shared across request queues - as is the case for
> SCSI - then all request queues that share a tag set with the hctx argument
> must be considered.

(replying to my own e-mail)

Hello Jens,

If you want I can split that patch into two patches - one that runs all hardware
queues with which the tag set is shared and one that switches from rerunning
all hardware queues to one hardware queue.

Bart.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 0/6] Avoid that scsi-mq and dm-mq queue processing stalls sporadically
  2017-04-07 18:39       ` Bart Van Assche
  (?)
@ 2017-04-07 18:51       ` Jens Axboe
  -1 siblings, 0 replies; 55+ messages in thread
From: Jens Axboe @ 2017-04-07 18:51 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-scsi, linux-block

On 04/07/2017 12:39 PM, Bart Van Assche wrote:
> On Fri, 2017-04-07 at 11:33 -0700, Bart Van Assche wrote:
>> On Fri, 2017-04-07 at 12:23 -0600, Jens Axboe wrote:
>>> On 04/07/2017 12:16 PM, Bart Van Assche wrote:
>>>> Hello Jens,
>>>>
>>>> The six patches in this patch series fix the queue lockup I reported
>>>> recently on the linux-block mailing list. Please consider these patches
>>>> for inclusion in the upstream kernel.
>>>
>>> Some of this we need in 4.11, but not all of it. I can't be applying patches
>>> that "improve scalability" at this point.
>>>
>>> 4-6 looks like what we want for 4.11, I'll see if those apply directly. Then
>>> we can put 1-3 on top in 4.12, with the others pulled in first.
>>
>> Hello Jens,
>>
>> Please note that patch 2/6 is a bug fix. The current implementation of
>> blk_mq_sched_restart_queues() only considers hardware queues associated with
>> the same request queue as the hardware queue that has been passed as an
>> argument. If a tag set is shared across request queues - as is the case for
>> SCSI - then all request queues that share a tag set with the hctx argument
>> must be considered.
> 
> (replying to my own e-mail)
> 
> Hello Jens,
> 
> If you want I can split that patch into two patches - one that runs all hardware
> queues with which the tag set is shared and one that switches from rerunning
> all hardware queues to one hardware queue.

I already put it in, but this is getting very awkward. We're at -rc5 time, patches
going into mainline should be TINY. And now I'm sitting on this, that I have to
justify:

 15 files changed, 281 insertions(+), 164 deletions(-)

and where one of the patches reads like it's a performance improvement, when
in reality it's fixing a hang. So yes, the patch should have been split in
two, and the series should have been ordered so that the first patches could
go into 4.11, and the rest on top of that in 4.12. Did we really need a
patch clarifying comments in that series? Probably not.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 1/6] blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list
  2017-04-07 18:16   ` Bart Van Assche
  (?)
@ 2017-04-10  7:10   ` Christoph Hellwig
  -1 siblings, 0 replies; 55+ messages in thread
From: Christoph Hellwig @ 2017-04-10  7:10 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, linux-scsi, Christoph Hellwig, Hannes Reinecke

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 2/6] blk-mq: Restart a single queue if tag sets are shared
  2017-04-07 18:16   ` Bart Van Assche
  (?)
@ 2017-04-10  7:11   ` Christoph Hellwig
  -1 siblings, 0 replies; 55+ messages in thread
From: Christoph Hellwig @ 2017-04-10  7:11 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, linux-scsi, Christoph Hellwig, Hannes Reinecke

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 3/6] blk-mq: Clarify comments in blk_mq_dispatch_rq_list()
  2017-04-07 18:16   ` Bart Van Assche
  (?)
@ 2017-04-10  7:11   ` Christoph Hellwig
  -1 siblings, 0 replies; 55+ messages in thread
From: Christoph Hellwig @ 2017-04-10  7:11 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, linux-scsi, Omar Sandoval,
	Christoph Hellwig, Hannes Reinecke

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 4/6] blk-mq: Introduce blk_mq_delay_run_hw_queue()
  2017-04-07 18:16   ` Bart Van Assche
  (?)
@ 2017-04-10  7:12   ` Christoph Hellwig
  2017-04-10 15:02     ` Jens Axboe
  -1 siblings, 1 reply; 55+ messages in thread
From: Christoph Hellwig @ 2017-04-10  7:12 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, linux-scsi, Christoph Hellwig,
	Hannes Reinecke, Long Li, K . Y . Srinivasan

> +	if (msecs == 0)
> +		kblockd_schedule_work_on(blk_mq_hctx_next_cpu(hctx),
> +					 &hctx->run_work);
> +	else
> +		kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx),
> +						 &hctx->delayed_run_work,
> +						 msecs_to_jiffies(msecs));
> +}

I'd rather make run_work a delayed_work (again) and use
kblockd_schedule_delayed_work_on with a timeout of zero for the immediate
run case instead of having two competing work structs.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 5/6] scsi: Avoid that SCSI queues get stuck
  2017-04-07 18:16   ` Bart Van Assche
  (?)
@ 2017-04-10  7:12   ` Christoph Hellwig
  -1 siblings, 0 replies; 55+ messages in thread
From: Christoph Hellwig @ 2017-04-10  7:12 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, linux-scsi, Martin K . Petersen,
	James Bottomley, Christoph Hellwig, Hannes Reinecke,
	Sagi Grimberg, Long Li, K . Y . Srinivasan

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 4/6] blk-mq: Introduce blk_mq_delay_run_hw_queue()
  2017-04-10  7:12   ` Christoph Hellwig
@ 2017-04-10 15:02     ` Jens Axboe
  0 siblings, 0 replies; 55+ messages in thread
From: Jens Axboe @ 2017-04-10 15:02 UTC (permalink / raw)
  To: Christoph Hellwig, Bart Van Assche
  Cc: linux-block, linux-scsi, Hannes Reinecke, Long Li, K . Y . Srinivasan

On 04/10/2017 01:12 AM, Christoph Hellwig wrote:
>> +	if (msecs == 0)
>> +		kblockd_schedule_work_on(blk_mq_hctx_next_cpu(hctx),
>> +					 &hctx->run_work);
>> +	else
>> +		kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx),
>> +						 &hctx->delayed_run_work,
>> +						 msecs_to_jiffies(msecs));
>> +}
> 
> I'd rather make run_work a delayed_work (again) and use
> kblockd_schedule_delayed_work_on with a timeout of zero for the immediate
> run case instead of having two competing work structs.

Yeah that's a good point, it'd have to be an incremental patch at this
point though. Also note that blk_mq_stop_hw_queue() isn't currently
canceling the new ->delayed_run_work, that looks like a bug.

And looking at it, right now we have 3 (three!) work items in the
hardware queue. The two delayed items differ in that one of them only
runs the queue it was previously stopped, that's it. The non-delayed one
is identical to the non stopped checking delayed variant.

I'll send out a patch.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 6/6] dm rq: Avoid that request processing stalls sporadically
  2017-04-07 18:16   ` Bart Van Assche
@ 2017-04-11 16:09     ` Mike Snitzer
  -1 siblings, 0 replies; 55+ messages in thread
From: Mike Snitzer @ 2017-04-11 16:09 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: Jens Axboe, linux-block, linux-scsi, dm-devel

On Fri, Apr 07 2017 at  2:16pm -0400,
Bart Van Assche <bart.vanassche@sandisk.com> wrote:

> While running the srp-test software I noticed that request
> processing stalls sporadically at the beginning of a test, namely
> when mkfs is run against a dm-mpath device. Every time when that
> happened the following command was sufficient to resume request
> processing:
> 
>     echo run >/sys/kernel/debug/block/dm-0/state
> 
> This patch avoids that such request processing stalls occur. The
> test I ran is as follows:
> 
>     while srp-test/run_tests -d -r 30 -t 02-mq; do :; done
> 
> Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
> Cc: Mike Snitzer <snitzer@redhat.com>
> Cc: dm-devel@redhat.com
> ---
>  drivers/md/dm-rq.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
> index 6886bf160fb2..d19af1d21f4c 100644
> --- a/drivers/md/dm-rq.c
> +++ b/drivers/md/dm-rq.c
> @@ -755,6 +755,7 @@ static int dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
>  		/* Undo dm_start_request() before requeuing */
>  		rq_end_stats(md, rq);
>  		rq_completed(md, rq_data_dir(rq), false);
> +		blk_mq_delay_run_hw_queue(hctx, 100/*ms*/);
>  		return BLK_MQ_RQ_QUEUE_BUSY;
>  	}
>  
> -- 
> 2.12.0
> 

I really appreciate your hard work Bart but this looks like a cheap
hack.

I'm clearly too late to stop this from going in (given Jens got it
merged for -rc6) but: this has no place in dm-mq (or any blk-mq
driver).  If it is needed it should be elevated to blk-mq core to
trigger blk_mq_delay_run_hw_queue() when BLK_MQ_RQ_QUEUE_BUSY is
returned from blk_mq_ops' .queue_rq.

If this dm-mq specific commit is justified the case certainly is spelled
out in the commit header.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 6/6] dm rq: Avoid that request processing stalls sporadically
@ 2017-04-11 16:09     ` Mike Snitzer
  0 siblings, 0 replies; 55+ messages in thread
From: Mike Snitzer @ 2017-04-11 16:09 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: Jens Axboe, linux-block, dm-devel, linux-scsi

On Fri, Apr 07 2017 at  2:16pm -0400,
Bart Van Assche <bart.vanassche@sandisk.com> wrote:

> While running the srp-test software I noticed that request
> processing stalls sporadically at the beginning of a test, namely
> when mkfs is run against a dm-mpath device. Every time when that
> happened the following command was sufficient to resume request
> processing:
> 
>     echo run >/sys/kernel/debug/block/dm-0/state
> 
> This patch avoids that such request processing stalls occur. The
> test I ran is as follows:
> 
>     while srp-test/run_tests -d -r 30 -t 02-mq; do :; done
> 
> Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
> Cc: Mike Snitzer <snitzer@redhat.com>
> Cc: dm-devel@redhat.com
> ---
>  drivers/md/dm-rq.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
> index 6886bf160fb2..d19af1d21f4c 100644
> --- a/drivers/md/dm-rq.c
> +++ b/drivers/md/dm-rq.c
> @@ -755,6 +755,7 @@ static int dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
>  		/* Undo dm_start_request() before requeuing */
>  		rq_end_stats(md, rq);
>  		rq_completed(md, rq_data_dir(rq), false);
> +		blk_mq_delay_run_hw_queue(hctx, 100/*ms*/);
>  		return BLK_MQ_RQ_QUEUE_BUSY;
>  	}
>  
> -- 
> 2.12.0
> 

I really appreciate your hard work Bart but this looks like a cheap
hack.

I'm clearly too late to stop this from going in (given Jens got it
merged for -rc6) but: this has no place in dm-mq (or any blk-mq
driver).  If it is needed it should be elevated to blk-mq core to
trigger blk_mq_delay_run_hw_queue() when BLK_MQ_RQ_QUEUE_BUSY is
returned from blk_mq_ops' .queue_rq.

If this dm-mq specific commit is justified the case certainly is spelled
out in the commit header.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 6/6] dm rq: Avoid that request processing stalls sporadically
  2017-04-11 16:09     ` Mike Snitzer
@ 2017-04-11 16:26       ` Bart Van Assche
  -1 siblings, 0 replies; 55+ messages in thread
From: Bart Van Assche @ 2017-04-11 16:26 UTC (permalink / raw)
  To: snitzer; +Cc: linux-scsi, dm-devel, linux-block, axboe

On Tue, 2017-04-11 at 12:09 -0400, Mike Snitzer wrote:
> This has no place in dm-mq (or any blk-mq
> driver).  If it is needed it should be elevated to blk-mq core to
> trigger blk_mq_delay_run_hw_queue() when BLK_MQ_RQ_QUEUE_BUSY is
> returned from blk_mq_ops' .queue_rq.

Hello Mike,

If the blk-mq core would have to figure out whether or not a queue is no
longer busy without any cooperation from the blk-mq driver all the blk-mq
core could do is to attempt to rerun that queue from time to time. But at
which intervals should the blk-mq core check whether or not a queue is stil=
l
busy? Would it be possible to choose intervals at which to check the queue
state that work well for all block drivers? Consider e.g. at the dm-mpath
driver. multipath_busy() returns true as long as path initialization is in
progress. Path initialization can take a long time. The (indirect) call to
blk_mq_run_queue() from pg_init_done() resumes request processing immediate=
ly
after path initialization has finished. Sorry but I don't think it is possi=
ble
to invent an algorithm for the blk-mq core that guarantees not only that a
queue is rerun as soon as it is no longer busy but also that avoids that
plenty of CPU cycles are wasted by the blk-mq core for checking whether a
queue is no longer busy.

Bart.=

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 6/6] dm rq: Avoid that request processing stalls sporadically
@ 2017-04-11 16:26       ` Bart Van Assche
  0 siblings, 0 replies; 55+ messages in thread
From: Bart Van Assche @ 2017-04-11 16:26 UTC (permalink / raw)
  To: snitzer; +Cc: linux-scsi, dm-devel, linux-block, axboe

On Tue, 2017-04-11 at 12:09 -0400, Mike Snitzer wrote:
> This has no place in dm-mq (or any blk-mq
> driver).  If it is needed it should be elevated to blk-mq core to
> trigger blk_mq_delay_run_hw_queue() when BLK_MQ_RQ_QUEUE_BUSY is
> returned from blk_mq_ops' .queue_rq.

Hello Mike,

If the blk-mq core would have to figure out whether or not a queue is no
longer busy without any cooperation from the blk-mq driver all the blk-mq
core could do is to attempt to rerun that queue from time to time. But at
which intervals should the blk-mq core check whether or not a queue is still
busy? Would it be possible to choose intervals at which to check the queue
state that work well for all block drivers? Consider e.g. at the dm-mpath
driver. multipath_busy() returns true as long as path initialization is in
progress. Path initialization can take a long time. The (indirect) call to
blk_mq_run_queue() from pg_init_done() resumes request processing immediately
after path initialization has finished. Sorry but I don't think it is possible
to invent an algorithm for the blk-mq core that guarantees not only that a
queue is rerun as soon as it is no longer busy but also that avoids that
plenty of CPU cycles are wasted by the blk-mq core for checking whether a
queue is no longer busy.

Bart.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 6/6] dm rq: Avoid that request processing stalls sporadically
  2017-04-11 16:26       ` Bart Van Assche
  (?)
@ 2017-04-11 17:47       ` Mike Snitzer
  2017-04-11 17:51           ` Bart Van Assche
  -1 siblings, 1 reply; 55+ messages in thread
From: Mike Snitzer @ 2017-04-11 17:47 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-scsi, dm-devel, linux-block, axboe

On Tue, Apr 11 2017 at 12:26pm -0400,
Bart Van Assche <Bart.VanAssche@sandisk.com> wrote:

> On Tue, 2017-04-11 at 12:09 -0400, Mike Snitzer wrote:
> > This has no place in dm-mq (or any blk-mq
> > driver).  If it is needed it should be elevated to blk-mq core to
> > trigger blk_mq_delay_run_hw_queue() when BLK_MQ_RQ_QUEUE_BUSY is
> > returned from blk_mq_ops' .queue_rq.
> 
> Hello Mike,
> 
> If the blk-mq core would have to figure out whether or not a queue is no
> longer busy without any cooperation from the blk-mq driver all the blk-mq
> core could do is to attempt to rerun that queue from time to time. But at
> which intervals should the blk-mq core check whether or not a queue is still
> busy? Would it be possible to choose intervals at which to check the queue
> state that work well for all block drivers? Consider e.g. at the dm-mpath
> driver. multipath_busy() returns true as long as path initialization is in
> progress. Path initialization can take a long time. The (indirect) call to
> blk_mq_run_queue() from pg_init_done() resumes request processing immediately
> after path initialization has finished. Sorry but I don't think it is possible
> to invent an algorithm for the blk-mq core that guarantees not only that a
> queue is rerun as soon as it is no longer busy but also that avoids that
> plenty of CPU cycles are wasted by the blk-mq core for checking whether a
> queue is no longer busy.

Sorry but that isn't a very strong argument for what you've done.

I mean I do appreciate your point that the 2 BLK_MQ_RQ_QUEUE_BUSY
returns in dm_mq_queue_rq() are not equal but that could easily be
conveyed using a new return value.

Anyway, point is, no blk-mq driver should need to have concern about
whether their request will get resubmitted (and the associated hw queue
re-ran) if they return BLK_MQ_RQ_QUEUE_BUSY.

Your change is a means to an end but it just solves the problem in a
very hackish way.  Other drivers will very likely be caught about by
this blk-mq quirk in the future.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 6/6] dm rq: Avoid that request processing stalls sporadically
  2017-04-11 17:47       ` Mike Snitzer
@ 2017-04-11 17:51           ` Bart Van Assche
  0 siblings, 0 replies; 55+ messages in thread
From: Bart Van Assche @ 2017-04-11 17:51 UTC (permalink / raw)
  To: snitzer; +Cc: linux-scsi, dm-devel, linux-block, axboe

On Tue, 2017-04-11 at 13:47 -0400, Mike Snitzer wrote:
> Other drivers will very likely be caught about by
> this blk-mq quirk in the future.

Hello Mike,

Are you aware that the requirement that blk-mq drivers rerun the queue afte=
r
having returned BLK_MQ_RQ_QUEUE_BUSY is a requirement that is shared with
traditional block drivers? From dm_old_request_fn():

	if (... ||=A0(ti->type->busy && ti->type->busy(ti))) {
		blk_delay_queue(q, 10);
		return;
	}

Bart.=

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 6/6] dm rq: Avoid that request processing stalls sporadically
@ 2017-04-11 17:51           ` Bart Van Assche
  0 siblings, 0 replies; 55+ messages in thread
From: Bart Van Assche @ 2017-04-11 17:51 UTC (permalink / raw)
  To: snitzer; +Cc: linux-scsi, dm-devel, linux-block, axboe

On Tue, 2017-04-11 at 13:47 -0400, Mike Snitzer wrote:
> Other drivers will very likely be caught about by
> this blk-mq quirk in the future.

Hello Mike,

Are you aware that the requirement that blk-mq drivers rerun the queue after
having returned BLK_MQ_RQ_QUEUE_BUSY is a requirement that is shared with
traditional block drivers? From dm_old_request_fn():

	if (... || (ti->type->busy && ti->type->busy(ti))) {
		blk_delay_queue(q, 10);
		return;
	}

Bart.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 6/6] dm rq: Avoid that request processing stalls sporadically
  2017-04-11 17:51           ` Bart Van Assche
@ 2017-04-11 18:03             ` Mike Snitzer
  -1 siblings, 0 replies; 55+ messages in thread
From: Mike Snitzer @ 2017-04-11 18:03 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-scsi, dm-devel, linux-block, axboe

On Tue, Apr 11 2017 at  1:51pm -0400,
Bart Van Assche <Bart.VanAssche@sandisk.com> wrote:

> On Tue, 2017-04-11 at 13:47 -0400, Mike Snitzer wrote:
> > Other drivers will very likely be caught about by
> > this blk-mq quirk in the future.
> 
> Hello Mike,
> 
> Are you aware that the requirement that blk-mq drivers rerun the queue after
> having returned BLK_MQ_RQ_QUEUE_BUSY is a requirement that is shared with
> traditional block drivers? From dm_old_request_fn():
> 
> 	if (... ||ï¿½(ti->type->busy && ti->type->busy(ti))) {
> 		blk_delay_queue(q, 10);
> 		return;
> 	}

No, and pointing to DM code that does something with the old .request_fn
case to justify why blk-mq requires the same is pretty specious.

Rather than working so hard to use DM code against me, your argument
should be: "blk-mq drivers X, Y and Z rerun the hw queue; this is a well
established pattern"

I see drivers/nvme/host/fc.c:nvme_fc_start_fcp_op() does.  But that is
only one other driver out of ~20 BLK_MQ_RQ_QUEUE_BUSY returns
tree-wide.

Could be there are some others, but hardly a well-established pattern.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 6/6] dm rq: Avoid that request processing stalls sporadically
@ 2017-04-11 18:03             ` Mike Snitzer
  0 siblings, 0 replies; 55+ messages in thread
From: Mike Snitzer @ 2017-04-11 18:03 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-block, axboe, dm-devel, linux-scsi

On Tue, Apr 11 2017 at  1:51pm -0400,
Bart Van Assche <Bart.VanAssche@sandisk.com> wrote:

> On Tue, 2017-04-11 at 13:47 -0400, Mike Snitzer wrote:
> > Other drivers will very likely be caught about by
> > this blk-mq quirk in the future.
> 
> Hello Mike,
> 
> Are you aware that the requirement that blk-mq drivers rerun the queue after
> having returned BLK_MQ_RQ_QUEUE_BUSY is a requirement that is shared with
> traditional block drivers? From dm_old_request_fn():
> 
> 	if (... || (ti->type->busy && ti->type->busy(ti))) {
> 		blk_delay_queue(q, 10);
> 		return;
> 	}

No, and pointing to DM code that does something with the old .request_fn
case to justify why blk-mq requires the same is pretty specious.

Rather than working so hard to use DM code against me, your argument
should be: "blk-mq drivers X, Y and Z rerun the hw queue; this is a well
established pattern"

I see drivers/nvme/host/fc.c:nvme_fc_start_fcp_op() does.  But that is
only one other driver out of ~20 BLK_MQ_RQ_QUEUE_BUSY returns
tree-wide.

Could be there are some others, but hardly a well-established pattern.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 6/6] dm rq: Avoid that request processing stalls sporadically
  2017-04-11 18:03             ` Mike Snitzer
@ 2017-04-11 18:18               ` Bart Van Assche
  -1 siblings, 0 replies; 55+ messages in thread
From: Bart Van Assche @ 2017-04-11 18:18 UTC (permalink / raw)
  To: snitzer; +Cc: linux-scsi, dm-devel, linux-block, axboe

On Tue, 2017-04-11 at 14:03 -0400, Mike Snitzer wrote:
> Rather than working so hard to use DM code against me, your argument
> should be: "blk-mq drivers X, Y and Z rerun the hw queue; this is a well
> established pattern"
>=20
> I see drivers/nvme/host/fc.c:nvme_fc_start_fcp_op() does.  But that is
> only one other driver out of ~20 BLK_MQ_RQ_QUEUE_BUSY returns
> tree-wide.
>=20
> Could be there are some others, but hardly a well-established pattern.

Hello Mike,

Several blk-mq drivers that can return BLK_MQ_RQ_QUEUE_BUSY from their
.queue_rq() implementation stop the request queue=A0(blk_mq_stop_hw_queue()=
)
before returning "busy" and restart the queue after the busy condition has
been cleared (blk_mq_start_stopped_hw_queues()). Examples are virtio_blk an=
d
xen-blkfront. However, this approach is not appropriate for the dm-mq core
nor for the scsi core since both drivers already use the "stopped" state fo=
r
another purpose than tracking whether or not a hardware queue is busy. Henc=
e
the blk_mq_delay_run_hw_queue() and blk_mq_run_hw_queue() calls in these la=
st
two drivers to rerun a hardware queue after the busy state has been cleared=
.

Bart.=

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 6/6] dm rq: Avoid that request processing stalls sporadically
@ 2017-04-11 18:18               ` Bart Van Assche
  0 siblings, 0 replies; 55+ messages in thread
From: Bart Van Assche @ 2017-04-11 18:18 UTC (permalink / raw)
  To: snitzer; +Cc: linux-scsi, dm-devel, linux-block, axboe

On Tue, 2017-04-11 at 14:03 -0400, Mike Snitzer wrote:
> Rather than working so hard to use DM code against me, your argument
> should be: "blk-mq drivers X, Y and Z rerun the hw queue; this is a well
> established pattern"
> 
> I see drivers/nvme/host/fc.c:nvme_fc_start_fcp_op() does.  But that is
> only one other driver out of ~20 BLK_MQ_RQ_QUEUE_BUSY returns
> tree-wide.
> 
> Could be there are some others, but hardly a well-established pattern.

Hello Mike,

Several blk-mq drivers that can return BLK_MQ_RQ_QUEUE_BUSY from their
.queue_rq() implementation stop the request queue (blk_mq_stop_hw_queue())
before returning "busy" and restart the queue after the busy condition has
been cleared (blk_mq_start_stopped_hw_queues()). Examples are virtio_blk and
xen-blkfront. However, this approach is not appropriate for the dm-mq core
nor for the scsi core since both drivers already use the "stopped" state for
another purpose than tracking whether or not a hardware queue is busy. Hence
the blk_mq_delay_run_hw_queue() and blk_mq_run_hw_queue() calls in these last
two drivers to rerun a hardware queue after the busy state has been cleared.

Bart.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 6/6] dm rq: Avoid that request processing stalls sporadically
  2017-04-11 18:18               ` Bart Van Assche
@ 2017-04-12  3:42                 ` Ming Lei
  -1 siblings, 0 replies; 55+ messages in thread
From: Ming Lei @ 2017-04-12  3:42 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: snitzer, linux-scsi, dm-devel, linux-block, axboe

On Tue, Apr 11, 2017 at 06:18:36PM +0000, Bart Van Assche wrote:
> On Tue, 2017-04-11 at 14:03 -0400, Mike Snitzer wrote:
> > Rather than working so hard to use DM code against me, your argument
> > should be: "blk-mq drivers X, Y and Z rerun the hw queue; this is a well
> > established pattern"
> > 
> > I see drivers/nvme/host/fc.c:nvme_fc_start_fcp_op() does.  But that is
> > only one other driver out of ~20 BLK_MQ_RQ_QUEUE_BUSY returns
> > tree-wide.
> > 
> > Could be there are some others, but hardly a well-established pattern.
> 
> Hello Mike,
> 
> Several blk-mq drivers that can return BLK_MQ_RQ_QUEUE_BUSY from their
> .queue_rq() implementation stop the request queueï¿½(blk_mq_stop_hw_queue())
> before returning "busy" and restart the queue after the busy condition has
> been cleared (blk_mq_start_stopped_hw_queues()). Examples are virtio_blk and
> xen-blkfront. However, this approach is not appropriate for the dm-mq core
> nor for the scsi core since both drivers already use the "stopped" state for
> another purpose than tracking whether or not a hardware queue is busy. Hence
> the blk_mq_delay_run_hw_queue() and blk_mq_run_hw_queue() calls in these last
> two drivers to rerun a hardware queue after the busy state has been cleared.

But looks this patch just reruns the hw queue after 100ms, which isn't
that after the busy state has been cleared, right?

Actually if BLK_MQ_RQ_QUEUE_BUSY is returned from .queue_rq(), blk-mq
will buffer this request into hctx->dispatch and run the hw queue again,
so looks blk_mq_delay_run_hw_queue() in this situation shouldn't have been
needed at my 1st impression. Or maybe Bart has more stories about this usage,
better to comments it?

Thanks,
Ming

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 6/6] dm rq: Avoid that request processing stalls sporadically
@ 2017-04-12  3:42                 ` Ming Lei
  0 siblings, 0 replies; 55+ messages in thread
From: Ming Lei @ 2017-04-12  3:42 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: snitzer, linux-scsi, dm-devel, linux-block, axboe

On Tue, Apr 11, 2017 at 06:18:36PM +0000, Bart Van Assche wrote:
> On Tue, 2017-04-11 at 14:03 -0400, Mike Snitzer wrote:
> > Rather than working so hard to use DM code against me, your argument
> > should be: "blk-mq drivers X, Y and Z rerun the hw queue; this is a well
> > established pattern"
> > 
> > I see drivers/nvme/host/fc.c:nvme_fc_start_fcp_op() does.  But that is
> > only one other driver out of ~20 BLK_MQ_RQ_QUEUE_BUSY returns
> > tree-wide.
> > 
> > Could be there are some others, but hardly a well-established pattern.
> 
> Hello Mike,
> 
> Several blk-mq drivers that can return BLK_MQ_RQ_QUEUE_BUSY from their
> .queue_rq() implementation stop the request queue (blk_mq_stop_hw_queue())
> before returning "busy" and restart the queue after the busy condition has
> been cleared (blk_mq_start_stopped_hw_queues()). Examples are virtio_blk and
> xen-blkfront. However, this approach is not appropriate for the dm-mq core
> nor for the scsi core since both drivers already use the "stopped" state for
> another purpose than tracking whether or not a hardware queue is busy. Hence
> the blk_mq_delay_run_hw_queue() and blk_mq_run_hw_queue() calls in these last
> two drivers to rerun a hardware queue after the busy state has been cleared.

But looks this patch just reruns the hw queue after 100ms, which isn't
that after the busy state has been cleared, right?

Actually if BLK_MQ_RQ_QUEUE_BUSY is returned from .queue_rq(), blk-mq
will buffer this request into hctx->dispatch and run the hw queue again,
so looks blk_mq_delay_run_hw_queue() in this situation shouldn't have been
needed at my 1st impression. Or maybe Bart has more stories about this usage,
better to comments it?

Thanks,
Ming

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 0/6] Avoid that scsi-mq and dm-mq queue processing stalls sporadically
  2017-04-07 18:16 ` Bart Van Assche
@ 2017-04-12 10:55   ` Benjamin Block
  -1 siblings, 0 replies; 55+ messages in thread
From: Benjamin Block @ 2017-04-12 10:55 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: Jens Axboe, linux-block, linux-scsi

On Fri, Apr 07, 2017 at 11:16:48AM -0700, Bart Van Assche wrote:
> Hello Jens,
> 
> The six patches in this patch series fix the queue lockup I reported
> recently on the linux-block mailing list. Please consider these patches
> for inclusion in the upstream kernel.
> 

Hey Bart,

just out of curiosity. Is this maybe related to similar stuff happening
when CPUs are hot plugged - at least in that the stack gets stuck? Like
in this thread here:
https://www.mail-archive.com/linux-block@vger.kernel.org/msg06057.html

Would be interesting, because we recently saw similar stuff happening.


                                                    Beste Grï¿½ï¿½e / Best regards,
                                                      - Benjamin Block
-- 
Linux on z Systems Development         /         IBM Systems & Technology Group
		  IBM Deutschland Research & Development GmbH 
Vorsitz. AufsR.: Martina Koederitz     /        Geschï¿½ftsfï¿½hrung: Dirk Wittkopp
Sitz der Gesellschaft: Bï¿½blingen / Registergericht: AmtsG Stuttgart, HRB 243294

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 0/6] Avoid that scsi-mq and dm-mq queue processing stalls sporadically
@ 2017-04-12 10:55   ` Benjamin Block
  0 siblings, 0 replies; 55+ messages in thread
From: Benjamin Block @ 2017-04-12 10:55 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: Jens Axboe, linux-block, linux-scsi

On Fri, Apr 07, 2017 at 11:16:48AM -0700, Bart Van Assche wrote:
> Hello Jens,
> 
> The six patches in this patch series fix the queue lockup I reported
> recently on the linux-block mailing list. Please consider these patches
> for inclusion in the upstream kernel.
> 

Hey Bart,

just out of curiosity. Is this maybe related to similar stuff happening
when CPUs are hot plugged - at least in that the stack gets stuck? Like
in this thread here:
https://www.mail-archive.com/linux-block@vger.kernel.org/msg06057.html

Would be interesting, because we recently saw similar stuff happening.


                                                    Beste Grüße / Best regards,
                                                      - Benjamin Block
-- 
Linux on z Systems Development         /         IBM Systems & Technology Group
		  IBM Deutschland Research & Development GmbH 
Vorsitz. AufsR.: Martina Koederitz     /        Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: AmtsG Stuttgart, HRB 243294

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 0/6] Avoid that scsi-mq and dm-mq queue processing stalls sporadically
  2017-04-12 10:55   ` Benjamin Block
@ 2017-04-12 18:11     ` Bart Van Assche
  -1 siblings, 0 replies; 55+ messages in thread
From: Bart Van Assche @ 2017-04-12 18:11 UTC (permalink / raw)
  To: bblock; +Cc: linux-scsi, linux-block, axboe

On Wed, 2017-04-12 at 12:55 +0200, Benjamin Block wrote:
> On Fri, Apr 07, 2017 at 11:16:48AM -0700, Bart Van Assche wrote:
> > The six patches in this patch series fix the queue lockup I reported
> > recently on the linux-block mailing list. Please consider these patches
> > for inclusion in the upstream kernel.
>=20
> just out of curiosity. Is this maybe related to similar stuff happening
> when CPUs are hot plugged - at least in that the stack gets stuck? Like
> in this thread here:
> https://www.mail-archive.com/linux-block@vger.kernel.org/msg06057.html
>=20
> Would be interesting, because we recently saw similar stuff happening.

Hello Benjamin,

My proposal is to repeat that test with Jens' for-next branch. If the issue
still occurs with that tree then please check the contents of
/sys/kernel/debug/block/*/mq/*/{dispatch,*/rq_list}. That will allow to
determine whether or not any block layer requests are still pending. If
running the command below resolves the deadlock then it means that a
trigger to run a block layer queue is still missing somewhere:

for a in /sys/kernel/debug/block/*/mq/state; do echo run >$a;=A0done

See also git://git.kernel.dk/linux-block.git.

Bart.=

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 0/6] Avoid that scsi-mq and dm-mq queue processing stalls sporadically
@ 2017-04-12 18:11     ` Bart Van Assche
  0 siblings, 0 replies; 55+ messages in thread
From: Bart Van Assche @ 2017-04-12 18:11 UTC (permalink / raw)
  To: bblock; +Cc: linux-scsi, linux-block, axboe

On Wed, 2017-04-12 at 12:55 +0200, Benjamin Block wrote:
> On Fri, Apr 07, 2017 at 11:16:48AM -0700, Bart Van Assche wrote:
> > The six patches in this patch series fix the queue lockup I reported
> > recently on the linux-block mailing list. Please consider these patches
> > for inclusion in the upstream kernel.
> 
> just out of curiosity. Is this maybe related to similar stuff happening
> when CPUs are hot plugged - at least in that the stack gets stuck? Like
> in this thread here:
> https://www.mail-archive.com/linux-block@vger.kernel.org/msg06057.html
> 
> Would be interesting, because we recently saw similar stuff happening.

Hello Benjamin,

My proposal is to repeat that test with Jens' for-next branch. If the issue
still occurs with that tree then please check the contents of
/sys/kernel/debug/block/*/mq/*/{dispatch,*/rq_list}. That will allow to
determine whether or not any block layer requests are still pending. If
running the command below resolves the deadlock then it means that a
trigger to run a block layer queue is still missing somewhere:

for a in /sys/kernel/debug/block/*/mq/state; do echo run >$a; done

See also git://git.kernel.dk/linux-block.git.

Bart.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 6/6] dm rq: Avoid that request processing stalls sporadically
  2017-04-12  3:42                 ` Ming Lei
  (?)
@ 2017-04-12 18:38                   ` Bart Van Assche
  -1 siblings, 0 replies; 55+ messages in thread
From: Bart Van Assche @ 2017-04-12 18:38 UTC (permalink / raw)
  To: ming.lei; +Cc: linux-scsi, dm-devel, linux-block, snitzer, axboe

On Wed, 2017-04-12 at 11:42 +0800, Ming Lei wrote:
> On Tue, Apr 11, 2017 at 06:18:36PM +0000, Bart Van Assche wrote:
> > On Tue, 2017-04-11 at 14:03 -0400, Mike Snitzer wrote:
> > > Rather than working so hard to use DM code against me, your argument
> > > should be: "blk-mq drivers X, Y and Z rerun the hw queue; this is a w=
ell
> > > established pattern"
> > >=20
> > > I see drivers/nvme/host/fc.c:nvme_fc_start_fcp_op() does.  But that i=
s
> > > only one other driver out of ~20 BLK_MQ_RQ_QUEUE_BUSY returns
> > > tree-wide.
> > >=20
> > > Could be there are some others, but hardly a well-established pattern=
.
> >=20
> > Hello Mike,
> >=20
> > Several blk-mq drivers that can return BLK_MQ_RQ_QUEUE_BUSY from their
> > .queue_rq() implementation stop the request queue=A0(blk_mq_stop_hw_que=
ue())
> > before returning "busy" and restart the queue after the busy condition =
has
> > been cleared (blk_mq_start_stopped_hw_queues()). Examples are virtio_bl=
k and
> > xen-blkfront. However, this approach is not appropriate for the dm-mq c=
ore
> > nor for the scsi core since both drivers already use the "stopped" stat=
e for
> > another purpose than tracking whether or not a hardware queue is busy. =
Hence
> > the blk_mq_delay_run_hw_queue() and blk_mq_run_hw_queue() calls in thes=
e last
> > two drivers to rerun a hardware queue after the busy state has been cle=
ared.
>=20
> But looks this patch just reruns the hw queue after 100ms, which isn't
> that after the busy state has been cleared, right?

Hello Ming,

That patch can be considered as a first step that can be refined further, n=
amely
by modifying the dm-rq code further such that dm-rq queues are only rerun a=
fter
the busy condition has been cleared. The patch at the start of this thread =
is
easier to review and easier to test than any patch that would only rerun dm=
-rq
queues after the busy condition has been cleared.

> Actually if BLK_MQ_RQ_QUEUE_BUSY is returned from .queue_rq(), blk-mq
> will buffer this request into hctx->dispatch and run the hw queue again,
> so looks blk_mq_delay_run_hw_queue() in this situation shouldn't have bee=
n
> needed at my 1st impression.

If the blk-mq core would always rerun a hardware queue if a block driver
returns BLK_MQ_RQ_QUEUE_BUSY then that would cause 100% of a single CPU cor=
e
to be busy with polling a hardware queue until the "busy" condition has bee=
n
cleared. One can see easily that that's not what the blk-mq core does. From
blk_mq_sched_dispatch_requests():

	if (!list_empty(&rq_list)) {
		blk_mq_sched_mark_restart_hctx(hctx);
		did_work =3D blk_mq_dispatch_rq_list(q, &rq_list);
	}

>From the end of blk_mq_dispatch_rq_list():

	if (!list_empty(list)) {
		[ ... ]
		if (!blk_mq_sched_needs_restart(hctx) &&
		=A0=A0=A0=A0!test_bit(BLK_MQ_S_TAG_WAITING, &hctx->state))
			blk_mq_run_hw_queue(hctx, true);
	}

In other words, the BLK_MQ_S_SCHED_RESTART flag is set before the dispatch =
list
is examined and only if that flag gets cleared while blk_mq_dispatch_rq_lis=
t()
is in progress by a concurrent blk_mq_sched_restart_hctx() call then the
dispatch list will be rerun after a block driver returned=A0BLK_MQ_RQ_QUEUE=
_BUSY.

Bart.=

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 6/6] dm rq: Avoid that request processing stalls sporadically
@ 2017-04-12 18:38                   ` Bart Van Assche
  0 siblings, 0 replies; 55+ messages in thread
From: Bart Van Assche @ 2017-04-12 18:38 UTC (permalink / raw)
  To: ming.lei; +Cc: linux-scsi, dm-devel, linux-block, snitzer, axboe

On Wed, 2017-04-12 at 11:42 +0800, Ming Lei wrote:
> On Tue, Apr 11, 2017 at 06:18:36PM +0000, Bart Van Assche wrote:
> > On Tue, 2017-04-11 at 14:03 -0400, Mike Snitzer wrote:
> > > Rather than working so hard to use DM code against me, your argument
> > > should be: "blk-mq drivers X, Y and Z rerun the hw queue; this is a well
> > > established pattern"
> > > 
> > > I see drivers/nvme/host/fc.c:nvme_fc_start_fcp_op() does.  But that is
> > > only one other driver out of ~20 BLK_MQ_RQ_QUEUE_BUSY returns
> > > tree-wide.
> > > 
> > > Could be there are some others, but hardly a well-established pattern.
> > 
> > Hello Mike,
> > 
> > Several blk-mq drivers that can return BLK_MQ_RQ_QUEUE_BUSY from their
> > .queue_rq() implementation stop the request queue (blk_mq_stop_hw_queue())
> > before returning "busy" and restart the queue after the busy condition has
> > been cleared (blk_mq_start_stopped_hw_queues()). Examples are virtio_blk and
> > xen-blkfront. However, this approach is not appropriate for the dm-mq core
> > nor for the scsi core since both drivers already use the "stopped" state for
> > another purpose than tracking whether or not a hardware queue is busy. Hence
> > the blk_mq_delay_run_hw_queue() and blk_mq_run_hw_queue() calls in these last
> > two drivers to rerun a hardware queue after the busy state has been cleared.
> 
> But looks this patch just reruns the hw queue after 100ms, which isn't
> that after the busy state has been cleared, right?

Hello Ming,

That patch can be considered as a first step that can be refined further, namely
by modifying the dm-rq code further such that dm-rq queues are only rerun after
the busy condition has been cleared. The patch at the start of this thread is
easier to review and easier to test than any patch that would only rerun dm-rq
queues after the busy condition has been cleared.

> Actually if BLK_MQ_RQ_QUEUE_BUSY is returned from .queue_rq(), blk-mq
> will buffer this request into hctx->dispatch and run the hw queue again,
> so looks blk_mq_delay_run_hw_queue() in this situation shouldn't have been
> needed at my 1st impression.

If the blk-mq core would always rerun a hardware queue if a block driver
returns BLK_MQ_RQ_QUEUE_BUSY then that would cause 100% of a single CPU core
to be busy with polling a hardware queue until the "busy" condition has been
cleared. One can see easily that that's not what the blk-mq core does. From
blk_mq_sched_dispatch_requests():

	if (!list_empty(&rq_list)) {
		blk_mq_sched_mark_restart_hctx(hctx);
		did_work = blk_mq_dispatch_rq_list(q, &rq_list);
	}

>From the end of blk_mq_dispatch_rq_list():

	if (!list_empty(list)) {
		[ ... ]
		if (!blk_mq_sched_needs_restart(hctx) &&
		    !test_bit(BLK_MQ_S_TAG_WAITING, &hctx->state))
			blk_mq_run_hw_queue(hctx, true);
	}

In other words, the BLK_MQ_S_SCHED_RESTART flag is set before the dispatch list
is examined and only if that flag gets cleared while blk_mq_dispatch_rq_list()
is in progress by a concurrent blk_mq_sched_restart_hctx() call then the
dispatch list will be rerun after a block driver returned BLK_MQ_RQ_QUEUE_BUSY.

Bart.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 6/6] dm rq: Avoid that request processing stalls sporadically
@ 2017-04-12 18:38                   ` Bart Van Assche
  0 siblings, 0 replies; 55+ messages in thread
From: Bart Van Assche @ 2017-04-12 18:38 UTC (permalink / raw)
  To: ming.lei; +Cc: linux-scsi, dm-devel, linux-block, snitzer, axboe

On Wed, 2017-04-12 at 11:42 +0800, Ming Lei wrote:
> On Tue, Apr 11, 2017 at 06:18:36PM +0000, Bart Van Assche wrote:
> > On Tue, 2017-04-11 at 14:03 -0400, Mike Snitzer wrote:
> > > Rather than working so hard to use DM code against me, your argument
> > > should be: "blk-mq drivers X, Y and Z rerun the hw queue; this is a well
> > > established pattern"
> > > 
> > > I see drivers/nvme/host/fc.c:nvme_fc_start_fcp_op() does.  But that is
> > > only one other driver out of ~20 BLK_MQ_RQ_QUEUE_BUSY returns
> > > tree-wide.
> > > 
> > > Could be there are some others, but hardly a well-established pattern.
> > 
> > Hello Mike,
> > 
> > Several blk-mq drivers that can return BLK_MQ_RQ_QUEUE_BUSY from their
> > .queue_rq() implementation stop the request queue (blk_mq_stop_hw_queue())
> > before returning "busy" and restart the queue after the busy condition has
> > been cleared (blk_mq_start_stopped_hw_queues()). Examples are virtio_blk and
> > xen-blkfront. However, this approach is not appropriate for the dm-mq core
> > nor for the scsi core since both drivers already use the "stopped" state for
> > another purpose than tracking whether or not a hardware queue is busy. Hence
> > the blk_mq_delay_run_hw_queue() and blk_mq_run_hw_queue() calls in these last
> > two drivers to rerun a hardware queue after the busy state has been cleared.
> 
> But looks this patch just reruns the hw queue after 100ms, which isn't
> that after the busy state has been cleared, right?

Hello Ming,

That patch can be considered as a first step that can be refined further, namely
by modifying the dm-rq code further such that dm-rq queues are only rerun after
the busy condition has been cleared. The patch at the start of this thread is
easier to review and easier to test than any patch that would only rerun dm-rq
queues after the busy condition has been cleared.

> Actually if BLK_MQ_RQ_QUEUE_BUSY is returned from .queue_rq(), blk-mq
> will buffer this request into hctx->dispatch and run the hw queue again,
> so looks blk_mq_delay_run_hw_queue() in this situation shouldn't have been
> needed at my 1st impression.

If the blk-mq core would always rerun a hardware queue if a block driver
returns BLK_MQ_RQ_QUEUE_BUSY then that would cause 100% of a single CPU core
to be busy with polling a hardware queue until the "busy" condition has been
cleared. One can see easily that that's not what the blk-mq core does. From
blk_mq_sched_dispatch_requests():

	if (!list_empty(&rq_list)) {
		blk_mq_sched_mark_restart_hctx(hctx);
		did_work = blk_mq_dispatch_rq_list(q, &rq_list);
	}

From the end of blk_mq_dispatch_rq_list():

	if (!list_empty(list)) {
		[ ... ]
		if (!blk_mq_sched_needs_restart(hctx) &&
		    !test_bit(BLK_MQ_S_TAG_WAITING, &hctx->state))
			blk_mq_run_hw_queue(hctx, true);
	}

In other words, the BLK_MQ_S_SCHED_RESTART flag is set before the dispatch list
is examined and only if that flag gets cleared while blk_mq_dispatch_rq_list()
is in progress by a concurrent blk_mq_sched_restart_hctx() call then the
dispatch list will be rerun after a block driver returned BLK_MQ_RQ_QUEUE_BUSY.

Bart.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 6/6] dm rq: Avoid that request processing stalls sporadically
  2017-04-12 18:38                   ` Bart Van Assche
@ 2017-04-13  2:20                     ` Ming Lei
  -1 siblings, 0 replies; 55+ messages in thread
From: Ming Lei @ 2017-04-13  2:20 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-scsi, dm-devel, linux-block, snitzer, axboe

On Wed, Apr 12, 2017 at 06:38:07PM +0000, Bart Van Assche wrote:
> On Wed, 2017-04-12 at 11:42 +0800, Ming Lei wrote:
> > On Tue, Apr 11, 2017 at 06:18:36PM +0000, Bart Van Assche wrote:
> > > On Tue, 2017-04-11 at 14:03 -0400, Mike Snitzer wrote:
> > > > Rather than working so hard to use DM code against me, your argument
> > > > should be: "blk-mq drivers X, Y and Z rerun the hw queue; this is a well
> > > > established pattern"
> > > > 
> > > > I see drivers/nvme/host/fc.c:nvme_fc_start_fcp_op() does.  But that is
> > > > only one other driver out of ~20 BLK_MQ_RQ_QUEUE_BUSY returns
> > > > tree-wide.
> > > > 
> > > > Could be there are some others, but hardly a well-established pattern.
> > > 
> > > Hello Mike,
> > > 
> > > Several blk-mq drivers that can return BLK_MQ_RQ_QUEUE_BUSY from their
> > > .queue_rq() implementation stop the request queueï¿½(blk_mq_stop_hw_queue())
> > > before returning "busy" and restart the queue after the busy condition has
> > > been cleared (blk_mq_start_stopped_hw_queues()). Examples are virtio_blk and
> > > xen-blkfront. However, this approach is not appropriate for the dm-mq core
> > > nor for the scsi core since both drivers already use the "stopped" state for
> > > another purpose than tracking whether or not a hardware queue is busy. Hence
> > > the blk_mq_delay_run_hw_queue() and blk_mq_run_hw_queue() calls in these last
> > > two drivers to rerun a hardware queue after the busy state has been cleared.
> > 
> > But looks this patch just reruns the hw queue after 100ms, which isn't
> > that after the busy state has been cleared, right?
> 
> Hello Ming,
> 
> That patch can be considered as a first step that can be refined further, namely
> by modifying the dm-rq code further such that dm-rq queues are only rerun after
> the busy condition has been cleared. The patch at the start of this thread is
> easier to review and easier to test than any patch that would only rerun dm-rq
> queues after the busy condition has been cleared.

OK, got it, it should have been better to add comments about this change
since reruning the queue after 100ms is actually a workaround, instead
of final solution.

> 
> > Actually if BLK_MQ_RQ_QUEUE_BUSY is returned from .queue_rq(), blk-mq
> > will buffer this request into hctx->dispatch and run the hw queue again,
> > so looks blk_mq_delay_run_hw_queue() in this situation shouldn't have been
> > needed at my 1st impression.
> 
> If the blk-mq core would always rerun a hardware queue if a block driver
> returns BLK_MQ_RQ_QUEUE_BUSY then that would cause 100% of a single CPU core

It won't casue 100% CPU utilization since we restart queue in completion
path and at that time at least one tag is available, then progress can be
made.

> to be busy with polling a hardware queue until the "busy" condition has been
> cleared. One can see easily that that's not what the blk-mq core does. From
> blk_mq_sched_dispatch_requests():
> 
> 	if (!list_empty(&rq_list)) {
> 		blk_mq_sched_mark_restart_hctx(hctx);
> 		did_work = blk_mq_dispatch_rq_list(q, &rq_list);
> 	}
> 
> From the end of blk_mq_dispatch_rq_list():
> 
> 	if (!list_empty(list)) {
> 		[ ... ]
> 		if (!blk_mq_sched_needs_restart(hctx) &&
> 		ï¿½ï¿½ï¿½ï¿½!test_bit(BLK_MQ_S_TAG_WAITING, &hctx->state))
> 			blk_mq_run_hw_queue(hctx, true);
> 	}

That is exactly what I meant, blk-mq already provides this mechanism
to rerun the queue automatically in case of BLK_MQ_RQ_QUEUE_BUSY. If the
mechanism doesn't work well, we need to fix that, then why bother
drivers to workaround it?

> 
> In other words, the BLK_MQ_S_SCHED_RESTART flag is set before the dispatch list
> is examined and only if that flag gets cleared while blk_mq_dispatch_rq_list()
> is in progress by a concurrent blk_mq_sched_restart_hctx() call then the
> dispatch list will be rerun after a block driver returnedï¿½BLK_MQ_RQ_QUEUE_BUSY.

Yes, the queue is rerun either in completion path when
BLK_MQ_S_SCHED_RESTART is set, or just .queue_rq() returning _BUSY
and the flag is cleared at the same time from completion path.

So in theroy we can make sure the queue will be run again if _BUSY
happened, then what is the root cause why we have to add
blk_mq_delay_run_hw_queue(hctx, 100) in dm's .queue_rq()?

Thanks,
Ming

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 6/6] dm rq: Avoid that request processing stalls sporadically
@ 2017-04-13  2:20                     ` Ming Lei
  0 siblings, 0 replies; 55+ messages in thread
From: Ming Lei @ 2017-04-13  2:20 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-scsi, dm-devel, linux-block, snitzer, axboe

On Wed, Apr 12, 2017 at 06:38:07PM +0000, Bart Van Assche wrote:
> On Wed, 2017-04-12 at 11:42 +0800, Ming Lei wrote:
> > On Tue, Apr 11, 2017 at 06:18:36PM +0000, Bart Van Assche wrote:
> > > On Tue, 2017-04-11 at 14:03 -0400, Mike Snitzer wrote:
> > > > Rather than working so hard to use DM code against me, your argument
> > > > should be: "blk-mq drivers X, Y and Z rerun the hw queue; this is a well
> > > > established pattern"
> > > > 
> > > > I see drivers/nvme/host/fc.c:nvme_fc_start_fcp_op() does.  But that is
> > > > only one other driver out of ~20 BLK_MQ_RQ_QUEUE_BUSY returns
> > > > tree-wide.
> > > > 
> > > > Could be there are some others, but hardly a well-established pattern.
> > > 
> > > Hello Mike,
> > > 
> > > Several blk-mq drivers that can return BLK_MQ_RQ_QUEUE_BUSY from their
> > > .queue_rq() implementation stop the request queue (blk_mq_stop_hw_queue())
> > > before returning "busy" and restart the queue after the busy condition has
> > > been cleared (blk_mq_start_stopped_hw_queues()). Examples are virtio_blk and
> > > xen-blkfront. However, this approach is not appropriate for the dm-mq core
> > > nor for the scsi core since both drivers already use the "stopped" state for
> > > another purpose than tracking whether or not a hardware queue is busy. Hence
> > > the blk_mq_delay_run_hw_queue() and blk_mq_run_hw_queue() calls in these last
> > > two drivers to rerun a hardware queue after the busy state has been cleared.
> > 
> > But looks this patch just reruns the hw queue after 100ms, which isn't
> > that after the busy state has been cleared, right?
> 
> Hello Ming,
> 
> That patch can be considered as a first step that can be refined further, namely
> by modifying the dm-rq code further such that dm-rq queues are only rerun after
> the busy condition has been cleared. The patch at the start of this thread is
> easier to review and easier to test than any patch that would only rerun dm-rq
> queues after the busy condition has been cleared.

OK, got it, it should have been better to add comments about this change
since reruning the queue after 100ms is actually a workaround, instead
of final solution.

> 
> > Actually if BLK_MQ_RQ_QUEUE_BUSY is returned from .queue_rq(), blk-mq
> > will buffer this request into hctx->dispatch and run the hw queue again,
> > so looks blk_mq_delay_run_hw_queue() in this situation shouldn't have been
> > needed at my 1st impression.
> 
> If the blk-mq core would always rerun a hardware queue if a block driver
> returns BLK_MQ_RQ_QUEUE_BUSY then that would cause 100% of a single CPU core

It won't casue 100% CPU utilization since we restart queue in completion
path and at that time at least one tag is available, then progress can be
made.

> to be busy with polling a hardware queue until the "busy" condition has been
> cleared. One can see easily that that's not what the blk-mq core does. From
> blk_mq_sched_dispatch_requests():
> 
> 	if (!list_empty(&rq_list)) {
> 		blk_mq_sched_mark_restart_hctx(hctx);
> 		did_work = blk_mq_dispatch_rq_list(q, &rq_list);
> 	}
> 
> From the end of blk_mq_dispatch_rq_list():
> 
> 	if (!list_empty(list)) {
> 		[ ... ]
> 		if (!blk_mq_sched_needs_restart(hctx) &&
> 		    !test_bit(BLK_MQ_S_TAG_WAITING, &hctx->state))
> 			blk_mq_run_hw_queue(hctx, true);
> 	}

That is exactly what I meant, blk-mq already provides this mechanism
to rerun the queue automatically in case of BLK_MQ_RQ_QUEUE_BUSY. If the
mechanism doesn't work well, we need to fix that, then why bother
drivers to workaround it?

> 
> In other words, the BLK_MQ_S_SCHED_RESTART flag is set before the dispatch list
> is examined and only if that flag gets cleared while blk_mq_dispatch_rq_list()
> is in progress by a concurrent blk_mq_sched_restart_hctx() call then the
> dispatch list will be rerun after a block driver returned BLK_MQ_RQ_QUEUE_BUSY.

Yes, the queue is rerun either in completion path when
BLK_MQ_S_SCHED_RESTART is set, or just .queue_rq() returning _BUSY
and the flag is cleared at the same time from completion path.

So in theroy we can make sure the queue will be run again if _BUSY
happened, then what is the root cause why we have to add
blk_mq_delay_run_hw_queue(hctx, 100) in dm's .queue_rq()?

Thanks,
Ming

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 0/6] Avoid that scsi-mq and dm-mq queue processing stalls sporadically
  2017-04-12 18:11     ` Bart Van Assche
@ 2017-04-13 12:23       ` Benjamin Block
  -1 siblings, 0 replies; 55+ messages in thread
From: Benjamin Block @ 2017-04-13 12:23 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-scsi, linux-block, axboe

On Wed, Apr 12, 2017 at 06:11:25PM +0000, Bart Van Assche wrote:
> On Wed, 2017-04-12 at 12:55 +0200, Benjamin Block wrote:
> > On Fri, Apr 07, 2017 at 11:16:48AM -0700, Bart Van Assche wrote:
> > > The six patches in this patch series fix the queue lockup I reported
> > > recently on the linux-block mailing list. Please consider these patches
> > > for inclusion in the upstream kernel.
> >
> > just out of curiosity. Is this maybe related to similar stuff happening
> > when CPUs are hot plugged - at least in that the stack gets stuck? Like
> > in this thread here:
> > https://www.mail-archive.com/linux-block@vger.kernel.org/msg06057.html
> >
> > Would be interesting, because we recently saw similar stuff happening.
>
> Hello Benjamin,
>
> My proposal is to repeat that test with Jens' for-next branch. If the issue
> still occurs with that tree then please check the contents of
> /sys/kernel/debug/block/*/mq/*/{dispatch,*/rq_list}. That will allow to
> determine whether or not any block layer requests are still pending. If
> running the command below resolves the deadlock then it means that a
> trigger to run a block layer queue is still missing somewhere:
>
> for a in /sys/kernel/debug/block/*/mq/state; do echo run >$a;ï¿½done
>
> See also git://git.kernel.dk/linux-block.git.
>

Thx for the hint! I'll forward that and see if the affected folks are
willing to reproduce.


                                                    Beste Grï¿½ï¿½e / Best regards,
                                                      - Benjamin Block
--
Linux on z Systems Development         /         IBM Systems & Technology Group
		  IBM Deutschland Research & Development GmbH
Vorsitz. AufsR.: Martina Koederitz     /        Geschï¿½ftsfï¿½hrung: Dirk Wittkopp
Sitz der Gesellschaft: Bï¿½blingen / Registergericht: AmtsG Stuttgart, HRB 243294

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 0/6] Avoid that scsi-mq and dm-mq queue processing stalls sporadically
@ 2017-04-13 12:23       ` Benjamin Block
  0 siblings, 0 replies; 55+ messages in thread
From: Benjamin Block @ 2017-04-13 12:23 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-scsi, linux-block, axboe

On Wed, Apr 12, 2017 at 06:11:25PM +0000, Bart Van Assche wrote:
> On Wed, 2017-04-12 at 12:55 +0200, Benjamin Block wrote:
> > On Fri, Apr 07, 2017 at 11:16:48AM -0700, Bart Van Assche wrote:
> > > The six patches in this patch series fix the queue lockup I reported
> > > recently on the linux-block mailing list. Please consider these patches
> > > for inclusion in the upstream kernel.
> >
> > just out of curiosity. Is this maybe related to similar stuff happening
> > when CPUs are hot plugged - at least in that the stack gets stuck? Like
> > in this thread here:
> > https://www.mail-archive.com/linux-block@vger.kernel.org/msg06057.html
> >
> > Would be interesting, because we recently saw similar stuff happening.
>
> Hello Benjamin,
>
> My proposal is to repeat that test with Jens' for-next branch. If the issue
> still occurs with that tree then please check the contents of
> /sys/kernel/debug/block/*/mq/*/{dispatch,*/rq_list}. That will allow to
> determine whether or not any block layer requests are still pending. If
> running the command below resolves the deadlock then it means that a
> trigger to run a block layer queue is still missing somewhere:
>
> for a in /sys/kernel/debug/block/*/mq/state; do echo run >$a; done
>
> See also git://git.kernel.dk/linux-block.git.
>

Thx for the hint! I'll forward that and see if the affected folks are
willing to reproduce.


                                                    Beste Grüße / Best regards,
                                                      - Benjamin Block
--
Linux on z Systems Development         /         IBM Systems & Technology Group
		  IBM Deutschland Research & Development GmbH
Vorsitz. AufsR.: Martina Koederitz     /        Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: AmtsG Stuttgart, HRB 243294

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 6/6] dm rq: Avoid that request processing stalls sporadically
  2017-04-13  2:20                     ` Ming Lei
  (?)
@ 2017-04-13 16:59                     ` Bart Van Assche
  2017-04-14  1:13                       ` Ming Lei
  -1 siblings, 1 reply; 55+ messages in thread
From: Bart Van Assche @ 2017-04-13 16:59 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-scsi, dm-devel, linux-block, snitzer, axboe

On 04/12/17 19:20, Ming Lei wrote:
> On Wed, Apr 12, 2017 at 06:38:07PM +0000, Bart Van Assche wrote:
>> If the blk-mq core would always rerun a hardware queue if a block driver
>> returns BLK_MQ_RQ_QUEUE_BUSY then that would cause 100% of a single CPU core
> 
> It won't casue 100% CPU utilization since we restart queue in completion
> path and at that time at least one tag is available, then progress can be
> made.

Hello Ming,

Sorry but you are wrong. If .queue_rq() returns BLK_MQ_RQ_QUEUE_BUSY
then it's likely that calling .queue_rq() again after only a few
microseconds will cause it to return BLK_MQ_RQ_QUEUE_BUSY again. If you
don't believe me, change "if (!blk_mq_sched_needs_restart(hctx) &&
!test_bit(BLK_MQ_S_TAG_WAITING, &hctx->state)) blk_mq_run_hw_queue(hctx,
true);" into "blk_mq_run_hw_queue(hctx, true);", trigger a busy
condition for either a SCSI LLD or a dm-rq driver, run top and you will
see that the CPU usage of a kworker thread jumps up to 100%.

Bart.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 6/6] dm rq: Avoid that request processing stalls sporadically
  2017-04-13 16:59                     ` Bart Van Assche
@ 2017-04-14  1:13                       ` Ming Lei
  2017-04-14 17:12                           ` Bart Van Assche
  0 siblings, 1 reply; 55+ messages in thread
From: Ming Lei @ 2017-04-14  1:13 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-scsi, dm-devel, linux-block, snitzer, axboe

On Thu, Apr 13, 2017 at 09:59:57AM -0700, Bart Van Assche wrote:
> On 04/12/17 19:20, Ming Lei wrote:
> > On Wed, Apr 12, 2017 at 06:38:07PM +0000, Bart Van Assche wrote:
> >> If the blk-mq core would always rerun a hardware queue if a block driver
> >> returns BLK_MQ_RQ_QUEUE_BUSY then that would cause 100% of a single CPU core
> > 
> > It won't casue 100% CPU utilization since we restart queue in completion
> > path and at that time at least one tag is available, then progress can be
> > made.
> 
> Hello Ming,
> 
> Sorry but you are wrong. If .queue_rq() returns BLK_MQ_RQ_QUEUE_BUSY
> then it's likely that calling .queue_rq() again after only a few
> microseconds will cause it to return BLK_MQ_RQ_QUEUE_BUSY again. If you
> don't believe me, change "if (!blk_mq_sched_needs_restart(hctx) &&
> !test_bit(BLK_MQ_S_TAG_WAITING, &hctx->state)) blk_mq_run_hw_queue(hctx,
> true);" into "blk_mq_run_hw_queue(hctx, true);", trigger a busy

Yes, that can be true, but I mean it is still OK to run the queue again
with

	if (!blk_mq_sched_needs_restart(hctx) &&
	    !test_bit(BLK_MQ_S_TAG_WAITING, &hctx->state))
			blk_mq_run_hw_queue(hctx, true);

and restarting queue in __blk_mq_finish_request() when
BLK_MQ_RQ_QUEUE_BUSY is returned from .queue_rq(). And both are in current
blk-mq implementation.

Then why do we need blk_mq_delay_run_hw_queue(hctx, 100/*ms*/) in dm?

Thanks,
Ming

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 6/6] dm rq: Avoid that request processing stalls sporadically
  2017-04-14  1:13                       ` Ming Lei
@ 2017-04-14 17:12                           ` Bart Van Assche
  0 siblings, 0 replies; 55+ messages in thread
From: Bart Van Assche @ 2017-04-14 17:12 UTC (permalink / raw)
  To: ming.lei; +Cc: linux-scsi, dm-devel, linux-block, snitzer, axboe

On Fri, 2017-04-14 at 09:13 +0800, Ming Lei wrote:
> On Thu, Apr 13, 2017 at 09:59:57AM -0700, Bart Van Assche wrote:
> > On 04/12/17 19:20, Ming Lei wrote:
> > > On Wed, Apr 12, 2017 at 06:38:07PM +0000, Bart Van Assche wrote:
> > > > If the blk-mq core would always rerun a hardware queue if a block d=
river
> > > > returns BLK_MQ_RQ_QUEUE_BUSY then that would cause 100% of a single=
 CPU core
> > >=20
> > > It won't casue 100% CPU utilization since we restart queue in complet=
ion
> > > path and at that time at least one tag is available, then progress ca=
n be
> > > made.
> >=20
> > Hello Ming,
> >=20
> > Sorry but you are wrong. If .queue_rq() returns BLK_MQ_RQ_QUEUE_BUSY
> > then it's likely that calling .queue_rq() again after only a few
> > microseconds will cause it to return BLK_MQ_RQ_QUEUE_BUSY again. If you
> > don't believe me, change "if (!blk_mq_sched_needs_restart(hctx) &&
> > !test_bit(BLK_MQ_S_TAG_WAITING, &hctx->state)) blk_mq_run_hw_queue(hctx=
,
> > true);" into "blk_mq_run_hw_queue(hctx, true);", trigger a busy
>=20
> Yes, that can be true, but I mean it is still OK to run the queue again
> with
>=20
> 	if (!blk_mq_sched_needs_restart(hctx) &&
> 	    !test_bit(BLK_MQ_S_TAG_WAITING, &hctx->state))
> 			blk_mq_run_hw_queue(hctx, true);
>=20
> and restarting queue in __blk_mq_finish_request() when
> BLK_MQ_RQ_QUEUE_BUSY is returned from .queue_rq(). And both are in curren=
t
> blk-mq implementation.
>=20
> Then why do we need blk_mq_delay_run_hw_queue(hctx, 100/*ms*/) in dm?

Because if dm_mq_queue_rq() returns BLK_MQ_RQ_QUEUE_BUSY that there is no
guarantee that __blk_mq_finish_request() will be called later on for the
same queue. dm_mq_queue_rq() can e.g. return BLK_MQ_RQ_QUEUE_BUSY while no
dm requests are in progress because the SCSI error handler is active for
all underlying paths. See also scsi_lld_busy() and scsi_host_in_recovery().

Bart.=

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 6/6] dm rq: Avoid that request processing stalls sporadically
@ 2017-04-14 17:12                           ` Bart Van Assche
  0 siblings, 0 replies; 55+ messages in thread
From: Bart Van Assche @ 2017-04-14 17:12 UTC (permalink / raw)
  To: ming.lei; +Cc: linux-scsi, dm-devel, linux-block, snitzer, axboe

On Fri, 2017-04-14 at 09:13 +0800, Ming Lei wrote:
> On Thu, Apr 13, 2017 at 09:59:57AM -0700, Bart Van Assche wrote:
> > On 04/12/17 19:20, Ming Lei wrote:
> > > On Wed, Apr 12, 2017 at 06:38:07PM +0000, Bart Van Assche wrote:
> > > > If the blk-mq core would always rerun a hardware queue if a block driver
> > > > returns BLK_MQ_RQ_QUEUE_BUSY then that would cause 100% of a single CPU core
> > > 
> > > It won't casue 100% CPU utilization since we restart queue in completion
> > > path and at that time at least one tag is available, then progress can be
> > > made.
> > 
> > Hello Ming,
> > 
> > Sorry but you are wrong. If .queue_rq() returns BLK_MQ_RQ_QUEUE_BUSY
> > then it's likely that calling .queue_rq() again after only a few
> > microseconds will cause it to return BLK_MQ_RQ_QUEUE_BUSY again. If you
> > don't believe me, change "if (!blk_mq_sched_needs_restart(hctx) &&
> > !test_bit(BLK_MQ_S_TAG_WAITING, &hctx->state)) blk_mq_run_hw_queue(hctx,
> > true);" into "blk_mq_run_hw_queue(hctx, true);", trigger a busy
> 
> Yes, that can be true, but I mean it is still OK to run the queue again
> with
> 
> 	if (!blk_mq_sched_needs_restart(hctx) &&
> 	    !test_bit(BLK_MQ_S_TAG_WAITING, &hctx->state))
> 			blk_mq_run_hw_queue(hctx, true);
> 
> and restarting queue in __blk_mq_finish_request() when
> BLK_MQ_RQ_QUEUE_BUSY is returned from .queue_rq(). And both are in current
> blk-mq implementation.
> 
> Then why do we need blk_mq_delay_run_hw_queue(hctx, 100/*ms*/) in dm?

Because if dm_mq_queue_rq() returns BLK_MQ_RQ_QUEUE_BUSY that there is no
guarantee that __blk_mq_finish_request() will be called later on for the
same queue. dm_mq_queue_rq() can e.g. return BLK_MQ_RQ_QUEUE_BUSY while no
dm requests are in progress because the SCSI error handler is active for
all underlying paths. See also scsi_lld_busy() and scsi_host_in_recovery().

Bart.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4 6/6] dm rq: Avoid that request processing stalls sporadically
  2017-04-14 17:12                           ` Bart Van Assche
  (?)
@ 2017-04-16 10:21                           ` Ming Lei
  -1 siblings, 0 replies; 55+ messages in thread
From: Ming Lei @ 2017-04-16 10:21 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-scsi, dm-devel, linux-block, snitzer, axboe

On Fri, Apr 14, 2017 at 05:12:50PM +0000, Bart Van Assche wrote:
> On Fri, 2017-04-14 at 09:13 +0800, Ming Lei wrote:
> > On Thu, Apr 13, 2017 at 09:59:57AM -0700, Bart Van Assche wrote:
> > > On 04/12/17 19:20, Ming Lei wrote:
> > > > On Wed, Apr 12, 2017 at 06:38:07PM +0000, Bart Van Assche wrote:
> > > > > If the blk-mq core would always rerun a hardware queue if a block driver
> > > > > returns BLK_MQ_RQ_QUEUE_BUSY then that would cause 100% of a single CPU core
> > > > 
> > > > It won't casue 100% CPU utilization since we restart queue in completion
> > > > path and at that time at least one tag is available, then progress can be
> > > > made.
> > > 
> > > Hello Ming,
> > > 
> > > Sorry but you are wrong. If .queue_rq() returns BLK_MQ_RQ_QUEUE_BUSY
> > > then it's likely that calling .queue_rq() again after only a few
> > > microseconds will cause it to return BLK_MQ_RQ_QUEUE_BUSY again. If you
> > > don't believe me, change "if (!blk_mq_sched_needs_restart(hctx) &&
> > > !test_bit(BLK_MQ_S_TAG_WAITING, &hctx->state)) blk_mq_run_hw_queue(hctx,
> > > true);" into "blk_mq_run_hw_queue(hctx, true);", trigger a busy
> > 
> > Yes, that can be true, but I mean it is still OK to run the queue again
> > with
> > 
> > 	if (!blk_mq_sched_needs_restart(hctx) &&
> > 	    !test_bit(BLK_MQ_S_TAG_WAITING, &hctx->state))
> > 			blk_mq_run_hw_queue(hctx, true);
> > 
> > and restarting queue in __blk_mq_finish_request() when
> > BLK_MQ_RQ_QUEUE_BUSY is returned from .queue_rq(). And both are in current
> > blk-mq implementation.
> > 
> > Then why do we need blk_mq_delay_run_hw_queue(hctx, 100/*ms*/) in dm?
> 
> Because if dm_mq_queue_rq() returns BLK_MQ_RQ_QUEUE_BUSY that there is no
> guarantee that __blk_mq_finish_request() will be called later on for the
> same queue. dm_mq_queue_rq() can e.g. return BLK_MQ_RQ_QUEUE_BUSY while no
> dm requests are in progress because the SCSI error handler is active for
> all underlying paths. See also scsi_lld_busy() and scsi_host_in_recovery().

OK, thanks Bart for the explanation.

Looks a very interesting BLK_MQ_RQ_QUEUE_BUSY case which isn't casued by
too many pending I/O, and will study more about this case.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2017-04-16 10:21 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-07 18:16 [PATCH v4 0/6] Avoid that scsi-mq and dm-mq queue processing stalls sporadically Bart Van Assche
2017-04-07 18:16 ` Bart Van Assche
2017-04-07 18:16 ` [PATCH v4 1/6] blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list Bart Van Assche
2017-04-07 18:16   ` Bart Van Assche
2017-04-10  7:10   ` Christoph Hellwig
2017-04-07 18:16 ` [PATCH v4 2/6] blk-mq: Restart a single queue if tag sets are shared Bart Van Assche
2017-04-07 18:16   ` Bart Van Assche
2017-04-10  7:11   ` Christoph Hellwig
2017-04-07 18:16 ` [PATCH v4 3/6] blk-mq: Clarify comments in blk_mq_dispatch_rq_list() Bart Van Assche
2017-04-07 18:16   ` Bart Van Assche
2017-04-10  7:11   ` Christoph Hellwig
2017-04-07 18:16 ` [PATCH v4 4/6] blk-mq: Introduce blk_mq_delay_run_hw_queue() Bart Van Assche
2017-04-07 18:16   ` Bart Van Assche
2017-04-10  7:12   ` Christoph Hellwig
2017-04-10 15:02     ` Jens Axboe
2017-04-07 18:16 ` [PATCH v4 5/6] scsi: Avoid that SCSI queues get stuck Bart Van Assche
2017-04-07 18:16   ` Bart Van Assche
2017-04-10  7:12   ` Christoph Hellwig
2017-04-07 18:16 ` [PATCH v4 6/6] dm rq: Avoid that request processing stalls sporadically Bart Van Assche
2017-04-07 18:16   ` Bart Van Assche
2017-04-11 16:09   ` Mike Snitzer
2017-04-11 16:09     ` Mike Snitzer
2017-04-11 16:26     ` Bart Van Assche
2017-04-11 16:26       ` Bart Van Assche
2017-04-11 17:47       ` Mike Snitzer
2017-04-11 17:51         ` Bart Van Assche
2017-04-11 17:51           ` Bart Van Assche
2017-04-11 18:03           ` Mike Snitzer
2017-04-11 18:03             ` Mike Snitzer
2017-04-11 18:18             ` Bart Van Assche
2017-04-11 18:18               ` Bart Van Assche
2017-04-12  3:42               ` Ming Lei
2017-04-12  3:42                 ` Ming Lei
2017-04-12 18:38                 ` Bart Van Assche
2017-04-12 18:38                   ` Bart Van Assche
2017-04-12 18:38                   ` Bart Van Assche
2017-04-13  2:20                   ` Ming Lei
2017-04-13  2:20                     ` Ming Lei
2017-04-13 16:59                     ` Bart Van Assche
2017-04-14  1:13                       ` Ming Lei
2017-04-14 17:12                         ` Bart Van Assche
2017-04-14 17:12                           ` Bart Van Assche
2017-04-16 10:21                           ` Ming Lei
2017-04-07 18:23 ` [PATCH v4 0/6] Avoid that scsi-mq and dm-mq queue " Jens Axboe
2017-04-07 18:33   ` Bart Van Assche
2017-04-07 18:33     ` Bart Van Assche
2017-04-07 18:39     ` Bart Van Assche
2017-04-07 18:39       ` Bart Van Assche
2017-04-07 18:51       ` Jens Axboe
2017-04-12 10:55 ` Benjamin Block
2017-04-12 10:55   ` Benjamin Block
2017-04-12 18:11   ` Bart Van Assche
2017-04-12 18:11     ` Bart Van Assche
2017-04-13 12:23     ` Benjamin Block
2017-04-13 12:23       ` Benjamin Block

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.